一、下载alertmanager和webhook-dingtalk
www.github.com 搜索alertmanager webhook-dingtalk
1、解压、安装webhook-dingtalk
tar -zxvf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 /usr/local/webhook-dingtalk
cp /usr/local/webhook-dingtalk/config.example.yml/usr/local/webhook-dingtalk/config.yml
2、创建开机启动并启动服务
vim /usr/lib/systemd/system/webhook.service [Unit] Description=Prometheus-Server After=network.target [Service] ExecStart=/usr/local/webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/webhook-dingtalk/config.yml User=root [Install] WantedBy=multi-user.target systemctl enable webhook.service --now
3、解压、安装alertmanager
tar zxf alertmanager-0.25.0-rc.2.linux-amd64.tar.gz mv alertmanager-0.25.0-rc.2.linux-amd64 /usr/local/prometheus/alertmanager
4、创建开机启动并启动服务
vim /usr/lib/systemd/system/alertmanager.service [Unit] Description=Prometheus-Server After=network.target [Service] ExecStart=/usr/local/alertmanager/alertmanager --cluster.advertise-address=0.0.0.0:9093 --config.file=/usr/local/alertmanager/alertmanager.yml User=root [Install] WantedBy=multi-user.target
systemctl enable alertmanager.service --now
5、验证alertmanager和webhook-dingtalk监听端口
ss -ant|egrep "9093|8060"
二、配置、测试
1、 Webhook-dingtalk配置相对比较简单,只改以下三处即可
2、添加钉钉报警模板
vim /usr/local/prometheus/webhook-dingtalk/template.tmpl {{ define "__subject" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ end }} {{ define "__alert_list" }}{{ range . }} --- {{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }} **告警主题**: {{ .Annotations.summary }} **告警类型**: {{ .Labels.alertname }} **告警级别**: {{ .Labels.severity }} **告警主机**: {{ .Labels.instance }} **告警信息**: {{ index .Annotations "description" }} **告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} {{ end }}{{ end }} {{ define "__resolved_list" }}{{ range . }} --- {{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }} **告警主题**: {{ .Annotations.summary }} **告警类型**: {{ .Labels.alertname }} **告警级别**: {{ .Labels.severity }} **告警主机**: {{ .Labels.instance }} **告警信息**: {{ index .Annotations "description" }} **告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} **恢复时间**: {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }} {{ end }}{{ end }} {{ define "default.title" }} {{ template "__subject" . }} {{ end }} {{ define "default.content" }} {{ if gt (len .Alerts.Firing) 0 }} **====侦测到{{ .Alerts.Firing | len }}个故障====** {{ template "__alert_list" .Alerts.Firing }} --- {{ end }} {{ if gt (len .Alerts.Resolved) 0 }} **====恢复{{ .Alerts.Resolved | len }}个故障====** {{ template "__resolved_list" .Alerts.Resolved }} {{ end }} {{ end }} {{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }} {{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }} {{ template "default.title" . }} {{ template "default.content" . }}
3、重启webhook
systemctl restart webhook.service
4、Alertmanager配置钉钉告警
vim /usr/local/alertmanager/alertmanager.yml route: group_by: ['dingding'] group_wait: 30s group_interval: 1h repeat_interval: 1h receiver: 'dingding.webhook1' routes: - receiver: 'dingding.webhook1' match_re: alertname: ".*" receivers: - name: 'dingding.webhook1' webhook_configs: - url: 'http://IP:8060/dingtalk/webhook1/send' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
主要修改的地方
重启alertmanager
systemctl restart alertmanager.service
5、Prometheus集成Alertmanager及告警规则配置
vim /usr/local/prometheus/prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - static_configs: - targets: - IP:9093 rule_files: - "/usr/local/prometheus/rule/node_exporter.yml" scrape_configs: - job_name: "VMware" static_configs: - targets: ["IP:59100"]
重点修改地方
6、添加node_exporter告警规则
mkdir /usr/local/prometheus/prometheus/rule vim /usr/local/prometheus/prometheus/rule/node_exporter.yml groups: - name: 服务器资源监控 rules: - alert: 内存使用率过高 expr: 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 80 for: 3m labels: severity: 严重告警 annotations: summary: "{{ $labels.instance }} 内存使用率过高, 请尽快处理!" description: "{{ $labels.instance }}内存使用率超过80%,当前使用率{{ $value }}%." - alert: 服务器宕机 expr: up == 0 for: 1s labels: severity: 严重告警 annotations: summary: "{{$labels.instance}} 服务器宕机, 请尽快处理!" description: "{{$labels.instance}} 服务器延时超过3分钟,当前状态{{ $value }}. " - alert: CPU高负荷 expr: 100 - (avg by (instance,job)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90 for: 5m labels: severity: 严重告警 annotations: summary: "{{$labels.instance}} CPU使用率过高,请尽快处理!" description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. " - alert: 磁盘IO性能 expr: avg(irate(node_disk_io_time_seconds_total[1m])) by(instance,job)* 100 > 90 for: 5m labels: severity: 严重告警 annotations: summary: "{{$labels.instance}} 流入磁盘IO使用率过高,请尽快处理!" description: "{{$labels.instance}} 流入磁盘IO大于90%,当前使用率{{ $value }}%." - alert: 网络流入 expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400 for: 5m labels: severity: 严重告警 annotations: summary: "{{$labels.instance}} 流入网络带宽过高,请尽快处理!" description: "{{$labels.instance}} 流入网络带宽持续5分钟高于100M. RX带宽使用量{{$value}}." - alert: 网络流出 expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400 for: 5m labels: severity: 严重告警 annotations: summary: "{{$labels.instance}} 流出网络带宽过高,请尽快处理!" description: "{{$labels.instance}} 流出网络带宽持续5分钟高于100M. RX带宽使用量{$value}}." - alert: TCP连接数 expr: node_netstat_Tcp_CurrEstab > 10000 for: 2m labels: severity: 严重告警 annotations: summary: " TCP_ESTABLISHED过高!" description: "{{$labels.instance}} TCP_ESTABLISHED大于100%,当前使用率{{ $value }}%." - alert: 磁盘容量 expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90 for: 1m labels: severity: 严重告警 annotations: summary: "{{$labels.mountpoint}} 磁盘分区使用率过高,请尽快处理!" description: "{{$labels.instance}} 磁盘分区使用大于90%,当前使用率{{ $value }}%."
重启prometheus
systemctl restart prometheus.service
标签:Alertmanager,dingtalk,webhook,labels,Webhook,instance,Prometheus,alertmanager,告警 From: https://www.cnblogs.com/hm1825/p/17854166.html