一.告警规则
告警规则是通过YAML格式进行定义,在Prometheus server中使用PromQL 配置实际告警解发条件,Prometheus 会根据告警规则及配置周期进行周期性计算,若满足触发条件则会触发告警通知。告警规则的加载是在prometheus.yml文件中进行配置,默认情况下prometheus对设置的告警规则进行计算的时间间隔是1分钟,可以使用global中的evaluation_interval配置项设置间隔时间。例如 :
global: evaluation_interval:15s
告警规则可以直接指定文件,也可以指定到特定目录下,为了方便管理把告警拆分成多个文件,以特定的命名格式被prometheus.yml指定后加载。
下面创建一个监控node_exporter是否是UP状态的告警规则示例,分为以下三个步骤。
1.在prometheus.yml中添加关联node_exporter的配置内容,这里监控了四个主机,可参考:Prometheus node_exporter 主机监控
- job_name: 'node_exporter' static_configs: - targets: ['172.18.250.66:9100','172.18.148.50:9100','172.18.148.51:9100','172.18.148.49:9100']
2.在prometheus.yml中指定加载告警规则
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "rules/node_up_rules.yml" # - "second_rules.yml"
#目录rules文件夹 [root@iZwz97yqubb71vyxhuskfyZ prometheus]# pwd /root/prometheus/prometheus [root@iZwz97yqubb71vyxhuskfyZ prometheus]# ls console_libraries consoles data LICENSE nohup.out NOTICE prometheus prometheus.yml promtool rules
3.node_up_rules.yml创建告警规则
groups: - name: UP rules: - alert: node expr: up{job="node_exporter"}==0 for: 3m labels: severity: critical annotations: description: "{{ $labels.instance }} has been down for more than 5 minutes." summary: "{{ $labels.instance }} down"
4.检查文件内容是否正确
[root@iZwz97yqubb71vyxhuskfyZ prometheus]# pwd /root/prometheus/prometheus [root@iZwz97yqubb71vyxhuskfyZ prometheus]# ls console_libraries consoles data LICENSE nohup.out NOTICE prometheus prometheus.yml promtool rules [root@iZwz97yqubb71vyxhuskfyZ prometheus]# ./promtool check rules rules/node_up_rules.yml Checking rules/node_up_rules.yml SUCCESS: 1 rules found
最后重启prometheus,使用web ui查看地址:http://47.107.88.98:9090/rules,已经添加好了告警规则, expr: up{job="node_exporter"}==0 代表为下线状态,1为上线状态。
5.模拟node_exporter下线
上面配置四个url节点的node_exporter地址,停掉一个node_exporter触发告警规则,此时在看prometheus web ui的地址:http://47.107.88.98:9090/alerts?search= ,如下所示:
初次打开界面,由于设置了for:3m选项,告警状态为PENDING。
当满足告警条件持续3分钟,就会触发实际告警,状态为FIRING。
6.发送到Alertmanager
当prometheus产生FIRING告警后,会发送到Alertmanager,Alertmanager通过webhook推送到指点接口,配置如下:
[root@iZwz97yqubb71vyxhuskfyZ alertmanager]# cat alertmanager.yml route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'web.hook' receivers: - name: 'web.hook' webhook_configs: - url: 'http://120.79.188.142:8081/api/AlertManager/Receiver/' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
当Alertmanager收到消息后,在Alertmanager web ui 界面“红圈”这里能看到prometheus推送过来的消息,界面 如下所示:
7.webhook接口
这里使用.net core web api做为webhook接口,代码如下所示:
/// <summary> /// 获取服务状态信息 /// </summary> [HttpPost] public string Receiver() { try { this.Request.Body.Position = 0; using var stream = new StreamReader(this.Request.Body); var body = stream.ReadToEndAsync().Result; _service.SendQiDian(body); return "ok"; } catch(Exception e) { _logger.LogError("接收AlertManager消息错误:"+e.Message+e.StackTrace); return "error"; } }
中间件配置一下允许多次读取流
app.Use(next => context => { //允许body多次读取 context.Request.EnableBuffering(); return next(context); });
接收到的json内容如下所示:
{ "receiver": "web\\.hook", "status": "firing", "alerts": [ { "status": "firing", "labels": { "alertname": "node", "instance": "172.18.250.66:9100", "job": "node_exporter", "severity": "critical" }, "annotations": { "description": "172.18.250.66:9100 has been down for more than 5 minutes.", "summary": "172.18.250.66:9100 down" }, "startsAt": "2022-10-17T08:01:37.27Z", "endsAt": "0001-01-01T00:00:00Z", "generatorURL": "http://iZwz97yqubb71vyxhuskfyZ:9090/graph?g0.expr=up%7Bjob%3D%22node_exporter%22%7D+%3D%3D+0\u0026g0.tab=1", "fingerprint": "8dc92b025c20ac38" } ], "groupLabels": { "alertname": "node" }, "commonLabels": { "alertname": "node", "instance": "172.18.250.66:9100", "job": "node_exporter", "severity": "critical" }, "commonAnnotations": { "description": "172.18.250.66:9100 has been down for more than 5 minutes.", "summary": "172.18.250.66:9100 down" }, "externalURL": "http://iZwz97yqubb71vyxhuskfyZ:9093", "version": "4", "groupKey": "{}:{alertname=\"node\"}", "truncatedAlerts": 0 }
接着再把消息推送到腾讯企点。
标签:node,Alertmanager,rules,prometheus,exporter,告警,yml From: https://www.cnblogs.com/MrHSR/p/16790817.html