首页 > 其他分享 >Prometheus+AlertManager+webhookdingtalk实现钉钉报警

Prometheus+AlertManager+webhookdingtalk实现钉钉报警

时间:2023-08-25 12:02:14浏览次数:40  
标签:AlertManager amd64 dingtalk labels webhook prometheus instance Prometheus webhoo

前提

在我上一篇博文中(Prometheus_彭阳的技术博客_51CTO博客)介绍了,prometheus监控原理,监控服务搭建,prometheus内部函数...,如果还未搭建起prometheus服务和node_exporter实现机器初步监控的可以对照文章搭建起来,在这一小节将着重介绍prometheus alertmanager通过webhookdingtalk插件来实现钉钉报警功能


实验架构

Prometheus+AlertManager+webhookdingtalk实现钉钉报警_webhookdingtalk


搭建步骤

第一步:创建钉钉机器人,保留webhook地址

Prometheus+AlertManager+webhookdingtalk实现钉钉报警_webhookdingtalk_02

Prometheus+AlertManager+webhookdingtalk实现钉钉报警_AlertManager_03

第二步:prometheus-webhook-dingtalk安装启动

wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
mv prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz /opt/prometheus/
cd /opt/prometheus/
tar xf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
rm -f prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 webhook_dingtalk


# 钉钉报警配置
cat >> /opt/prometheus/webhook_dingtalk/dingtalk.yml  << EOF
templates:
    - /opt/prometheus/webhook_dingtalk/template.tmpl
timeout: 5s
 
targets:
  webhook_robot:
    # 钉钉机器人创建后的webhook地址
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxx
  webhook_mention_all:
    # 钉钉机器人创建后的webhook地址
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxx
    # 提醒全员
    mention:
      all: true
      
 #报警模板
cat >> /opt/prometheus/webhook_dingtalk/template.tmpl  << EOF
{{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ end }}
  
  
{{ define "__alert_list" }}{{ range . }}
---
{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}
  
**告警主题**: {{ .Annotations.summary }}
 
**告警类型**: {{ .Labels.alertname }}
  
**告警级别**: {{ .Labels.severity }}
  
**告警主机**: {{ .Labels.instance }}
  
**告警信息**: {{ index .Annotations "description" }}
  
**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
{{ end }}{{ end }}
  
{{ define "__resolved_list" }}{{ range . }}
---
{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}
 
**告警主题**: {{ .Annotations.summary }}
 
**告警类型**: {{ .Labels.alertname }}
  
**告警级别**: {{ .Labels.severity }}
  
**告警主机**: {{ .Labels.instance }}
  
**告警信息**: {{ index .Annotations "description" }}
  
**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
  
**恢复时间**: {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}
{{ end }}{{ end }}
  
  
{{ define "default.title" }}
{{ template "__subject" . }}
{{ end }}
  
{{ define "default.content" }}
{{ if gt (len .Alerts.Firing) 0 }}
**====侦测到{{ .Alerts.Firing | len  }}个故障====**
{{ template "__alert_list" .Alerts.Firing }}
---
{{ end }}
  
{{ if gt (len .Alerts.Resolved) 0 }}
**====恢复{{ .Alerts.Resolved | len  }}个故障====**
{{ template "__resolved_list" .Alerts.Resolved }}
{{ end }}
{{ end }}
  
  
{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}
{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}
{{ template "default.title" . }}
{{ template "default.content" . }}
EOF

#服务自启动
cat >> /usr/lib/systemd/system/webhook_dingtalk.service  << EOF
[Unit]
Description=https://prometheus.io
 
[Service]
Restart=on-failure
ExecStart=/opt/prometheus/webhook_dingtalk/prometheus-webhook-dingtalk --config.file=/opt/prometheus/webhook_dingtalk/dingtalk.yml --web.listen-address=:8060
 
[Install]
WantedBy=multi-user.target
EOF
 
 
systemctl daemon-reload
systemctl restart webhook_dingtalk.service
systemctl status webhook_dingtalk.service
systemctl enable webhook_dingtalk.service

# 验证服务是否启动
netstat -anput |grep 8060

第三步:AlertManager安装启动

wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0-rc.2/alertmanager-0.25.0-rc.2.linux-amd64.tar.gz
mv alertmanager-0.23.0-rc.0.linux-amd64.tar.gz /opt/prometheus/
cd /opt/prometheus/
tar xf alertmanager-0.23.0-rc.0.linux-amd64.tar.gz
rm -f alertmanager-0.23.0-rc.0.linux-amd64.tar.gz
mv alertmanager-0.23.0-rc.0.linux-amd64/* .

# 配置alertmanager配置文件
cat >> alertmanager.yml << EOF
global:
  # 在没有报警的情况下声明为已解决的时间
  resolve_timeout: 5m
 
route:
  # 接收到告警后到自定义分组
  group_by: ["alertname"]
  # 分组创建后初始化等待时长
  group_wait: 10s
  # 告警信息发送之前的等待时长
  group_interval: 30s
  # 重复报警的间隔时长
  repeat_interval: 1h
  # 默认消息接收
  receiver: "dingtalk"
 
receivers:
  # 钉钉
  - name: 'dingtalk'
    webhook_configs:
        # prometheus-webhook-dingtalk服务的地址
      - url: http://10.0.0.63:8060/dingtalk/webhook_robot/send
        send_resolved: true
 
inhibit_rules:
	# 告警抑制规则
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
EOF


# 服务自启动
cat >> /usr/lib/systemd/system/alertmanager.service << EOF
[Unit]
Description=https://prometheus.io
 
[Service]
Restart=on-failure
ExecStart=/opt/prometheus/alertmanager --config.file=/opt/prometheus/alertmanager.yml --storage.path=/opt/prometheus/data/
 
[Install]
WantedBy=multi-user.target
EOF
 
 
systemctl daemon-reload
systemctl start alertmanager.service
systemctl status alertmanager.service
systemctl enable alertmanager.service

# 验证服务是否启动
netstat -anput |grep 9093


第四步:Prometheus集成AlertManager及告警规则配置

编辑prometheus配置文件prometheus.yml

Prometheus+AlertManager+webhookdingtalk实现钉钉报警_webhookdingtalk_04

报警规则配置

cat >> /opt/prometheus/alert_rules.yml  << EOF
groups:
- name: 服务器资源监控
  rules:
  - alert: 内存使用率过高
    expr: 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 80
    for: 3m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }} 内存使用率过高, 请尽快处理!"
      description: "{{ $labels.instance }}内存使用率超过80%,当前使用率{{ $value }}%."
           
  - alert: 服务器宕机
    expr: up == 0
    for: 3m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 服务器宕机, 请尽快处理!"
      description: "{{$labels.instance}} 服务器延时超过3分钟,当前状态{{ $value }}. "
  
  - alert: CPU高负荷
    expr: 100 - (avg by (instance,job)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} CPU使用率过高,请尽快处理!"
      description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. "
       
  - alert: 磁盘IO性能
    expr: avg(irate(node_disk_io_time_seconds_total[1m])) by(instance,job)* 100 > 90
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流入磁盘IO使用率过高,请尽快处理!"
      description: "{{$labels.instance}} 流入磁盘IO大于90%,当前使用率{{ $value }}%."
  
  
  - alert: 网络流入
    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流入网络带宽过高,请尽快处理!"
      description: "{{$labels.instance}} 流入网络带宽持续5分钟高于100M. RX带宽使用量{{$value}}."
  
  - alert: 网络流出
    expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流出网络带宽过高,请尽快处理!"
      description: "{{$labels.instance}} 流出网络带宽持续5分钟高于100M. RX带宽使用量{$value}}."
   
  - alert: TCP连接数
    expr: node_netstat_Tcp_CurrEstab > 10000
    for: 2m
    labels:
      severity: 严重告警
    annotations:
      summary: " TCP_ESTABLISHED过高!"
      description: "{{$labels.instance}} TCP_ESTABLISHED大于100%,当前使用率{{ $value }}%."
  
  - alert: 磁盘容量
    expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90
    for: 1m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 磁盘分区使用率过高,请尽快处理!"
      description: "{{$labels.instance}} 磁盘分区使用大于90%,当前使用率{{ $value }}%."
EOF

重启Prometheus服务

systemctl restart prometheus.service


访问Prometheus Web页面http://10.0.0.63/alerts 可以查看到添加的规则,如下图:

Prometheus+AlertManager+webhookdingtalk实现钉钉报警_webhookdingtalk_05

最终效果

Prometheus+AlertManager+webhookdingtalk实现钉钉报警_AlertManager_06

标签:AlertManager,amd64,dingtalk,labels,webhook,prometheus,instance,Prometheus,webhoo
From: https://blog.51cto.com/u_15703497/7228398

相关文章

  • 真香!基于 Prometheus 的持久化存储,全是知识点
    Prometheus将基于告警规则生成的告警存储为时间序列,不会将Alertmanager的告警信息持久化存储,那么针对历史告警的检索、统计等需求就无法实现。因此需要一种持久化机制用于存储历史告警信息,本文主要探究基于alertmanager告警的开源持久化方案。1.告警触发机制基于主机层面内存......
  • Prometheus--学习笔记
    Prometheus  https://prometheus.fuckcloudnative.io/1.指标类型:四种核心指标类型Counter计数器Inc,Add,rate,topkGauge仪表盘daltapredict_linerHistogram直方图histogram_quantilesummary摘要,与histogram类似,不同点在于:关于分位数 原文链接:https://blog.......
  • K8S系统监控:使用Metrics Server和Prometheus
    Kubernetes也提供了类似的linuxtop的命令,就是kubectltop,不过默认情况下这个命令不会生效,必须要安装一个插件MetricsServer才可以。MetricsServer是一个专门用来收集Kubernetes核心资源指标(metrics)的工具,它定时从所有节点的kubelet里采集信息,但是对集群的整体性能影响......
  • Prometheus+Grafana 监控服务器资源
    一、Prometheus1.安装Prometheusdockerpullprom/prometheusdockerrun-itd--name=prometheus--restart=always-p9090:9090prom/prometheus容器创建成功后,即可通过浏览器访问http://本地ip:9090来进行验证2.安装Grafanamkdir/data/grafana-storage#创建目录用于......
  • Prometheus
    Prometheus是一个开源系统监控和警报工具包Prometheus将其指标收集并存储为时间序列数据,即指标信息与记录时的时间戳一起存储,以及称为标签的可选键值对 下图说明了Prometheus的架构及其一些生态系统组件:Prometheus直接或通过短期作业的中间推送网关从仪表化作业中获取指......
  • springboot开启prometheus可采集的指标配置
    1、引包<!--实现对Actuator的自动化配置--><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-actuator</artifactId></dependency>......
  • Prometheus
    一、Prometheus二、grafana1、添加数据模板#blackbox_exporter监控数据#每个参数是不同的dashbord模板https://grafana.com/grafana/dashboards/9965https://grafana.com/grafana/dashboards/9719 此模板需要安装饼状图插件下载地址 https://grafana.com/grafana/pl......
  • Alertmanager 二进制安装
    下载解压二进制程序123456cd/appswgethttps://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gztarxfalertmanager-0.24.0.linux-amd64.tar.gz#创建软连接ln-sv/apps/alertmanager-0.24.0.linux-amd64/app......
  • Loki not alerting Alertmanager
    4Igotitworkingatlast.Belowismyrulerconfigruler:storage:type:locallocal:directory:/etc/loki/rulestoragerule_path:/etc/loki/rulesalertmanager_url:http://alertmanager:9093ring:kvstore:store:inmemo......
  • prometheus简易推送demo
    publicstaticvoidmain(String[]args)throwsException{InetAddressia=InetAddress.getLocalHost();Map<String,String>map=newHashMap<>();map.put("serverip",ia.getHostAddress());ma......