首页 > 其他分享 >Prometheus配置Alertmanager(钉钉告警)

Prometheus配置Alertmanager(钉钉告警)

时间:2023-02-27 19:24:11浏览次数:62  
标签:Alertmanager dingtalk labels webhook value instance Prometheus alertmanager 告警

Prometheus配置Alertmanager(钉钉告警)

简介

  • Alertmanager 主要用于接收 Prometheus 发送的告警信息,它支持丰富的告警通知渠道,例如邮件、微信、钉钉、Slack 等常用沟通工具,而且很容易做到告警信息进行去重,降噪,分组等,是一款很好用的告警通知系统。
  • 下图是Alertmanager与Prometheus的基本架构
    screenshot_1612509722570

一,二进制部署 Alertmanager

本文选择的安装版本为0.24.0
image-20230227173019099

  • 根据服务器情况选择安装目录,上传安装包。
cd /prometheus
#解压
tar -xvzf alertmanager-0.24.0.linux-amd64.tar.gz
mv alertmanager-0.24.0.linux-amd64 alertmanager
cd alertmanager

image-20230227173428420

  • 进行系统service编写

​ 创建alertmanager.service配置文件

cd /usr/lib/systemd/system
vim alertmanager.service
  • alertmanager.service 文件填入如下内容后保存:wq
[Unit]
Description=https://prometheus.io

[Service]
Restart=on-failure
ExecStart=/prometheus/alertmanager/alertmanager --config.file=/prometheus/alertmanager/alertmanager.yml --storage.path=/prometheus/alertmanager/data/

[Install]
WantedBy=multi-user.target
  • 查看配置文件
cat alertmanager.service 

image-20230227174343094

  • 刷新服务配置并启动服务
systemctl daemon-reload
systemctl start alertmanager.service
  • 查看服务运行状态
systemctl status alertmanager.service

image-20230227174420137

  • 设置开机自启动
systemctl enable alertmanager.service

image-20230227174446681

访问系统

  • 访问系统 http://服务器ip:9093,注意防火墙或安全组开放端口
    image-20230227174621398
  • 若看到如上界面则说明alertmanager部署成功

二,配置钉钉机器人

  • 打开钉钉的智能群助手,点击添加机器人

    image-20230227175038217

  • 选择自定义机器人

    image-20230227175058364

    image-20210210143832145

  • 复制webhook地址后点击保存

    image-20210210143924405

三,安装钉钉服务(不推荐Docker安装,新版本的安装文档已经很久没更新)

1,二进制安装

image-20230227175936684

  • 根据服务器情况选择安装目录,上传安装包。
  1. 部署包下载完毕,开始安装
cd /prometheus
tar -xvzf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 webhook_dingtalk
cd webhook_dingtalk

image-20230227180725016

  • 编写配置文件(复制之后切记删除#的所有注释,否则启动服务时会报错),将上述获取的钉钉webhook地址填写到如下文件
vim dingtalk.yml
timeout: 5s

targets:
  webhook_robot:
  	# 钉钉机器人创建后的webhook地址
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  webhook_mention_all:
  	# 钉钉机器人创建后的webhook地址
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    # 提醒全员
    mention:
      all: true
  • 进行系统service编写

​ 创建webhook_dingtalk配置文件

cd /usr/lib/systemd/system
vim webhook_dingtalk.service
  • webhook_dingtalk.service 文件填入如下内容后保存:wq
[Unit]
Description=https://prometheus.io

[Service]
Restart=on-failure
ExecStart=/prometheus/webhook_dingtalk/prometheus-webhook-dingtalk --config.file=/prometheus/webhook_dingtalk/dingtalk.yml --web.listen-address=:8060

[Install]
WantedBy=multi-user.target
  • 查看配置文件
cat webhook_dingtalk.service 

image-20230227180925914

  • 刷新服务配置并启动服务
systemctl daemon-reload
systemctl start webhook_dingtalk.service
  • 查看服务运行状态
systemctl status webhook_dingtalk.service

image-20230227182038176

  • 设置开机自启动
systemctl enable webhook_dingtalk.service
  • 我们记下 urls=http://localhost:8060/dingtalk/webhook_robot/send 这一段值,接下来的配置会用上

配置Alertmanager

  • 打开 /prometheus/alertmanager/alertmanager.yml,修改为如下内容

    global:
      # 在没有报警的情况下声明为已解决的时间
      resolve_timeout: 5m
    
    route:
      # 接收到告警后到自定义分组
      group_by: ["alertname"]
      # 分组创建后初始化等待时长
      group_wait: 10s
      # 告警信息发送之前的等待时长
      group_interval: 30s
      # 重复报警的间隔时长
      repeat_interval: 5m
      # 默认消息接收
      receiver: "dingtalk"
    
    receivers:
      # 钉钉
      - name: 'dingtalk'
        webhook_configs:
        	# prometheus-webhook-dingtalk服务的地址
          - url: http://1xx.xx.xx.7:8060/dingtalk/webhook_robot/send
            send_resolved: true
    
  • 在prometheus安装文件夹根目录增加alert_rules.yml配置文件,内容如下

    groups:
      - name: alert_rules
        rules:
          - alert: CpuUsageAlertWarning
            expr: sum(avg(irate(node_cpu_seconds_total{mode!='idle'}[5m])) without (cpu)) by (instance) > 0.60
            for: 2m
            labels:
              level: warning
            annotations:
              summary: "Instance {{ $labels.instance }} CPU usage high"
              description: "{{ $labels.instance }} CPU usage above 60% (current value: {{ $value }})"
          - alert: CpuUsageAlertSerious
            #expr: sum(avg(irate(node_cpu_seconds_total{mode!='idle'}[5m])) without (cpu)) by (instance) > 0.85
            expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{job=~".*",mode="idle"}[5m])) * 100)) > 85
            for: 3m
            labels:
              level: serious
            annotations:
              summary: "Instance {{ $labels.instance }} CPU usage high"
              description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
          - alert: MemUsageAlertWarning
            expr: avg by(instance) ((1 - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes) * 100) > 70
            for: 2m
            labels:
              level: warning
            annotations:
              summary: "Instance {{ $labels.instance }} MEM usage high"
              description: "{{$labels.instance}}: MEM usage is above 70% (current value is: {{ $value }})"
          - alert: MemUsageAlertSerious
            expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes > 0.90
            for: 3m
            labels:
              level: serious
            annotations:
              summary: "Instance {{ $labels.instance }} MEM usage high"
              description: "{{ $labels.instance }} MEM usage above 90% (current value: {{ $value }})"
          - alert: DiskUsageAlertWarning
            expr: (1 - node_filesystem_free_bytes{fstype!="rootfs",mountpoint!="",mountpoint!~"/(run|var|sys|dev).*"} / node_filesystem_size_bytes) * 100 > 80
            for: 2m
            labels:
              level: warning
            annotations:
              summary: "Instance {{ $labels.instance }} Disk usage high"
              description: "{{$labels.instance}}: Disk usage is above 80% (current value is: {{ $value }})"
          - alert: DiskUsageAlertSerious
            expr: (1 - node_filesystem_free_bytes{fstype!="rootfs",mountpoint!="",mountpoint!~"/(run|var|sys|dev).*"} / node_filesystem_size_bytes) * 100 > 90
            for: 3m
            labels:
              level: serious
            annotations:
              summary: "Instance {{ $labels.instance }} Disk usage high"
              description: "{{$labels.instance}}: Disk usage is above 90% (current value is: {{ $value }})"
          - alert: NodeFileDescriptorUsage
            expr: avg by (instance) (node_filefd_allocated{} / node_filefd_maximum{}) * 100 > 60
            for: 2m
            labels:
              level: warning
            annotations:
              summary: "Instance {{ $labels.instance }} File Descriptor usage high"
              description: "{{$labels.instance}}: File Descriptor usage is above 60% (current value is: {{ $value }})"
          - alert: NodeLoad15
            expr: avg by (instance) (node_load15{}) > 80
            for: 2m
            labels:
              level: warning
            annotations:
              summary: "Instance {{ $labels.instance }} Load15 usage high"
              description: "{{$labels.instance}}: Load15 is above 80 (current value is: {{ $value }})"
          - alert: NodeAgentStatus
            expr: avg by (instance) (up{}) == 0
            for: 2m
            labels:
              level: warning
            annotations:
              summary: "{{$labels.instance}}: has been down"
              description: "{{$labels.instance}}: Node_Exporter Agent is down (current value is: {{ $value }})"
          - alert: NodeProcsBlocked
            expr: avg by (instance) (node_procs_blocked{}) > 10
            for: 2m
            labels:
              level: warning
            annotations:
              summary: "Instance {{ $labels.instance }}  Process Blocked usage high"
              description: "{{$labels.instance}}: Node Blocked Procs detected! above 10 (current value is: {{ $value }})"
          - alert: NetworkTransmitRate
            #expr:  avg by (instance) (floor(irate(node_network_transmit_bytes_total{device="ens192"}[2m]) / 1024 / 1024)) > 50
            expr:  avg by (instance) (floor(irate(node_network_transmit_bytes_total{}[2m]) / 1024 / 1024 * 8 )) > 40
            for: 1m
            labels:
              level: warning
            annotations:
              summary: "Instance {{ $labels.instance }} Network Transmit Rate usage high"
              description: "{{$labels.instance}}: Node Transmit Rate (Upload) is above 40Mbps/s (current value is: {{ $value }}Mbps/s)"
          - alert: NetworkReceiveRate
            #expr:  avg by (instance) (floor(irate(node_network_receive_bytes_total{device="ens192"}[2m]) / 1024 / 1024)) > 50
            expr:  avg by (instance) (floor(irate(node_network_receive_bytes_total{}[2m]) / 1024 / 1024 * 8 )) > 40
            for: 1m
            labels:
              level: warning
            annotations:
              summary: "Instance {{ $labels.instance }} Network Receive Rate usage high"
              description: "{{$labels.instance}}: Node Receive Rate (Download) is above 40Mbps/s (current value is: {{ $value }}Mbps/s)"
          - alert: DiskReadRate
            expr: avg by (instance) (floor(irate(node_disk_read_bytes_total{}[2m]) / 1024 )) > 200
            for: 2m
            labels:
              level: warning
            annotations:
              summary: "Instance {{ $labels.instance }} Disk Read Rate usage high"
              description: "{{$labels.instance}}: Node Disk Read Rate is above 200KB/s (current value is: {{ $value }}KB/s)"
          - alert: DiskWriteRate
            expr: avg by (instance) (floor(irate(node_disk_written_bytes_total{}[2m]) / 1024 / 1024 )) > 20
            for: 2m
            labels:
              level: warning
            annotations:
              summary: "Instance {{ $labels.instance }} Disk Write Rate usage high"
              description: "{{$labels.instance}}: Node Disk Write Rate is above 20MB/s (current value is: {{ $value }}MB/s)"
    
  • 修改prometheys.yml,最上方三个节点改为如下配置

    global:
      scrape_interval:     15s 
      evaluation_interval: 15s 
    
    alerting:
      alertmanagers:
      - static_configs:
        # alertmanager服务地址
        - targets: ['11x.xx.x.7:9093']
    
    rule_files:
      - "alert_rules.yml"
    
  • 执行curl -XPOST localhost:9090/-/reload刷新prometheus配置

  • 执行systemctl restart alertmanger.servicedocker restart alertmanager刷新alertmanger服务

验证配置

  • 打开prometheus服务,可以看到alerts栏出现了很多规则

    image-20230227182936328

  • 此时我们手动关闭一个节点

    docker stop  mysqld
    
  • 刷新prometheus,可以看到有一个节点颜色改变,进入了pending状态

    image-20210210152723906

  • 稍等片刻,alertmanager.yml 配置为等待5m,颜色变为红色,进入了firing状态

    image-20210210152829210

  • 查看alertmanager服务,也出现了相关告警节点

    image-20210210152851241

  • 此时如果配置无误,会收到钉钉机器人的一条信息

    image-20230227183644457

  • 这时我们重启mysqld-exporter服务

    docker start mysqld
    
  • 过了配置的等待时长,若服务没有在期间断开,钉钉机器人会发送一条恢复状态的信息

后记

标签:Alertmanager,dingtalk,labels,webhook,value,instance,Prometheus,alertmanager,告警
From: https://www.cnblogs.com/blogof-fusu/p/17161554.html

相关文章

  • Prometheus配置Grafana监控大屏
    简介Grafana是一个跨平台的开源的度量分析和可视化工具,可以通过将采集的数据查询然后可视化的展示,并及时通知。主要特点展示方式:快速灵活的客户端图表,面板插件有许......
  • Prometheus插件安装(NodeExporter)
    Prometheus插件安装(NodeExporter)一,下载安装包并解压下载地址:https://github.com/prometheus/node_exporter/releases同样物理机上下载,然后上传到服务器,本次安装使用的......
  • 9.【go-kit教程】go-kit集成Prometheus
    在Gokit中集成Prometheus进行API监控可以帮助开发人员更好地了解系统的性能和行为,提高系统的可观察性和可靠性。下面是一个简单的示例,演示如何在Gokit中集成P......
  • HugePages配置后启动数据库实例时告警的处理
    近期,在一个LINUX环境的数据库使用HugePages时,启动数据库实例时发现有一个告警信息StartingORACLEinstance(normal)************************LargePagesInformation**......
  • 使用docker-compose快速部署Prometheus+grafana环境
    由于最近公司服务频繁出问题,老板很生气,下面的人都很不好过,于是老大让加一下业务监控,来观察线上数据状态。但是由于qa环境数据量太少,所以自己搭建了一套环境做相关监控,并且......
  • 性能测试-grafana + prometheus + node_exporter
    1、grafana安装下载网址:https://grafana.com/grafana/download/7.4.3?platform=linux#下载wgethttps://dl.grafana.com/enterprise/release/grafana-enterprise-7.4.3......
  • 最易懂的Prometheus告警原理详解
    通俗易懂的一篇文章,主要介绍了Prometheus什么时候告警,什么时候不会告警。同时介绍了Prometheus告警原理。 警报是监控系统中必不可少的一块,当然了,也是最难......
  • Prometheus安装部署(主体)
    Prometheus安装部署一,下载安装包并解压下载地址:https://github.com/prometheus/prometheus/releases因为服务器上下载速度太慢,所以可以提前在物理机上下载上传到服务器,......
  • <<运维监控系统实战笔记>> 小记随笔 —— Prometheus 初识
    Prometheus简介Prometheusserver包含时序库、告警引擎、数据展示三大块,体系中最核心的组件Exporters采集数据的客户端,负载采集数据存在内存中,提供http接口,让......
  • Prometheus监控各类程序
    一、Prometheus安装github:https://github.com/prometheus/prometheus官网: https://prometheus.io#1下载prometheus-v2.40.7镜像https://hub.docker.com/r/prom/pro......