首页 > 其他分享 >Prometheus发送告警机制

Prometheus发送告警机制

时间:2022-11-01 17:23:03浏览次数:50  
标签:11 1.200 192.168 发送 Prometheus 9100 2022 告警

Prometheus会根据rules中的规则,不断的评估是否需要发出告警信息,
如果满足规则中的条件,则会向alertmanagers中配置的地址发送告警,
告警是通过alertmanager配置的地址post告警,比如targets: ['192.168.1.104:8090'],则会向http://192.168.1.104:8090/api/v2/alerts发送告警信息。

1. 目标

一般都会通过alertmanager组件处理告警信息,但是这样信息都被alertmanager分组或者抑制处理了,看不到原始的告警信息
这里自己写一个alertmanger程序,来接收Prometheus发送的告警,并将告警打印出来。以此来研究告警信息,发送频率以及告警解除处理。

名词解释
evaluation_interval: prometheus中配置评估规则时间的间隔
for: 告警规则中配置的告警等待时间,值等于: fireAt - activeAt

2. alertmanger程序编写

alertmanger.go

package main

import (
    "time"
    "io/ioutil"
    "net/http"
    "fmt"
)

type MyHandler struct{}

func (mh *MyHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    body, err := ioutil.ReadAll(r.Body)
    if err != nil {
        fmt.Printf("read body err, %v\n", err)
        return
    }
    fmt.Println(time.Now())
    fmt.Printf("%s\n\n", string(body))
}

func main() {
    http.Handle("/api/v1/alerts", &MyHandler{})
    http.ListenAndServe(":8090", nil)
}

启动程序: go run alertmanager.go

3. 配置文件

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['192.168.1.104:8090']
rule_files:
  - "/etc/prometheus/rules.yml"
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node'
    static_configs:
    - targets: ['192.168.1.200:9100']

我们配置的alertmanager的地址为192.168.1.104:8090,这个地址就是接下来我们要启动的自已的程序,用来接收prometheus的告警信息并打印出来。

rules.yml

groups:
- name: example
  rules:
 # Alert for any instance that is unreachable for >1 minutes.
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      serverity: page
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

4. 日志解析

关闭实例192.168.1.200:9100的node_exporter的服务,程序打印的日志如下:

2022-11-01 16:04:01.4538613 +0800 CST m=+23803.323087701
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:08:01.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

2022-11-01 16:05:16.4596299 +0800 CST m=+23878.328856301
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:09:16.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

2022-11-01 16:06:31.4571604 +0800 CST m=+23953.326386801
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:10:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

从日志中可以看出:
Prometheus发送告警频率为1分15秒,正好等于 evaluation_interval + for
endsAt处于未来时间,这里为日志发送时间加上4分钟(不明白为啥是4分钟)

启动node_exporter服务,时间为: 16:11:16
日志显示如下:

2022-11-01 16:11:31.4596592 +0800 CST m=+24253.328885601
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:11:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

2022-11-01 16:12:46.4868947 +0800 CST m=+24328.356121101
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:11:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

...............
...............

2022-11-01 16:25:16.4505975 +0800 CST m=+25078.319823901
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:11:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

2022-11-01 16:26:31.4480177 +0800 CST m=+25153.317244101
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:11:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":A{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

启动node_exporter服务后,根据上面的输出我们还可以发现,当“解除告警”发出去以后,Prometheus还坚持把“解除告警”发送了好多次,这里总共发送了13次警报,
发送频率也是等于 evaluation_interval + for

相当于告警已经解除,于是Prometheus会在等待for的时间后立即发送一条告警出去,表明告警已解除。也就是说,下面第一条告警其实是一条“解除告警”,为什么呢?因为endsAt的时间就是发送该条告警的时间,当AlertManager接收到以后,发现这个时间已经是一个过去的时间了,也就是说,这条告警已经结束了。

endsAt的时间为为inactiveAt + for

5. 总结

Prometheus会以evaluation_interval的间隔评估是否应该发送告警,
当满足告警条件时Prometheus会以evaluation_interval + for的频率发送告警
日志中的key解释:
startAt: 告警激活时间,满足表达式的时间+for
endsAt: 解除警报时间,需要注意的是警报处于激活时,他的时间应该处于未来的某个时间,如果是警报解除

标签:11,1.200,192.168,发送,Prometheus,9100,2022,告警
From: https://www.cnblogs.com/zydev/p/16848444.html

相关文章

  • Python发送QQ邮件
    Python发送QQ邮件1、登陆QQ邮箱,获取授权码可以参考官网说明登录QQ邮箱点击设置点击账户、点击开启POP3/SMEP服务点击开启后验证密保,然后根据操作发送短信......
  • Prometheus
    从零搭建Prometheus监控报警系统什么是Prometheus?Prometheus是由SoundCloud开发的开源监控报警系统和时序列数据库(TSDB)。Prometheus使用Go语言开发,是GoogleBorgMon监......
  • Python实现监控网络设备状况并发送邮件
    importparamikoimportdatetimeimporttimefromemail.mime.textimportMIMETextimportsmtplib#定义发送邮箱函数defsend_email(title):title=titles......
  • docker部署zabbix6.0及企业微信发送告警
    1前言1.1实验背景因zabbix6.0新增许多新特性,为熟悉界面特意在本地部署一套简易版(未启用HA功能)。原本想要在烧制了centos7.9系统树莓派上部署,一查之下armv7果然冷门,mar......
  • AI人脸检测识别EasyCVR视频融合平台告警预案的配置操作与使用
    我们在前期的文章中为大家介绍了EasyCVR新增的告警预案功能,感兴趣的用户可以戳这篇文章:《AI人脸检测智能视频融合平台EasyCVR新增告警预案功能》。  告警预案可以根......
  • 阿里云注册集群+Prometheus 解决多云容器集群运维痛点
    作者:左知容器集群可观测现状随着Kubernetes(K8s)容器编排工具已经成为事实上行业通用技术底座,容器集群监控经历多种方案实践后,Prometheus最终成为容器集群监控的事实标准......
  • JavaScript 使用 Notification 发送系统通知
    使用Notification可以在系统级别发送页面外部显示的桌面通知,即使浏览器在后台运行也可以向用户发出消息检查权限发送通知需要用户授权,通过只读属性Notification.per......
  • MinIO集群怎么接入Prometheus监控?(上)
    微信公众号“SRE成长记”可查看原文前言minio集群有暴露监控指标接口给Prometheus,可通过配置Prometheus访问MinIO集群的权限,将MinIO集群接入Prometheus监控,并通过MinIO官......
  • 3.登录发送验证码
    登录发送验证码1.接口说明2.流程分析客户端发送请求服务端调用第三方组件发送验证码验证码发送成功,存入redis响应客户端,客户端跳转到输入验证码页面3.代码实现3......
  • 2.封装短信发送服务组件
    封装短信发送服务组件企业开发中,往往将常见工具类封装抽取,以简洁便利的方式供其他工程模块使用。而SpringBoot的自动装配机制可以方便的实现组件抽取。SpringBoot执行流程......