一、Prometheus告警简介
告警能⼒在Prometheus的架构中被划分成两个独⽴的部分。如下所示,通过在Prometheus中定义AlertRule(告警规则),Prometheus会周期性的对告警规则进⾏计算,如果满⾜告警触发条件就会向Alertmanager发送告警信息
Alertmanager作为⼀个独⽴的组件,负责接收并处理来⾃Prometheus Server(也可以是其它的客户端程序)的告警信息。Alertmanager可以对这些告警信息进⾏进⼀步的处理,⽐如当接收到⼤量重复告警时能够消除重复的告警信息,同时对告警信息进⾏分组并且路由到正确的通知⽅,Prometheus内置了对邮件,Slack等多种通知⽅式的⽀持,同时还⽀持与Webhook的集成,以⽀持更多定制化的场景。例如,⽬前还不⽀持钉钉,那⽤户完全可以通过Webhook与钉钉机器⼈进⾏集成,从⽽通过钉钉接收告警信息。同时AlertManager还提供了静默和告警抑制机制来对告警通知⾏为进⾏优化。
1.1 Alertmanager特性
- 分组
分组机制可以将详细的告警信息合并成⼀个通知。在某些情况下,⽐如由于系统宕机导致⼤量的告警被同时触发,在这种情况下分组机制可以将这些被触发的告警合并为⼀个告警通知,避免⼀次性接受⼤量的告警通知,⽽⽆法对问题进⾏快速定位
告警分组,告警时间,以及告警的接受⽅式可以通过Alertmanager的配置⽂件进⾏配置。
- 抑制
抑制是指当某⼀告警发出后,可以停⽌重复发送由此告警引发的其它告警的机制
例如,当集群不可访问时触发了⼀次告警,通过配置Alertmanager可以忽略与该集群有关的其它所有告警。这样可以避免接收到⼤量与实际问题⽆关的告警通知
- 静默
静默提供了⼀个简单的机制可以快速根据标签对告警进⾏静默处理。如果接收到的告警符合静默的配置,Alertmanager则不会发送告警通知。
静默设置需要在Alertmanager的Werb⻚⾯上进⾏设置。
二、Alertmanager配置概述
version: '3.3' volumes: prometheus_data: {} grafana_data: {} networks: monitoring: driver: bridge services: prometheus: image: prom/prometheus:v2.37.6 container_name: prometheus restart: always volumes: - /etc/localtime:/etc/localtime:ro - ./prometheus/:/etc/prometheus/ - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.console.libraries=/usr/share/prometheus/console_libraries' - '--web.console.templates=/usr/share/prometheus/consoles' #热加载配置 - '--web.enable-lifecycle' #api配置 #- '--web.enable-admin-api' #历史数据最大保留时间,默认15天 - '--storage.tsdb.retention.time=30d' networks: - monitoring links: - alertmanager - cadvisor - node_exporter expose: - '9090' ports: - 9090:9090 depends_on: - cadvisor alertmanager: image: prom/alertmanager:v0.25.0 container_name: alertmanager restart: always volumes: - /etc/localtime:/etc/localtime:ro - ./alertmanager/:/etc/alertmanager/ command: - '--config.file=/etc/alertmanager/config.yml' - '--storage.path=/alertmanager' networks: - monitoring expose: - '9093' ports: - 9093:9093 cadvisor: image: google/cadvisor:latest container_name: cadvisor restart: always volumes: - /etc/localtime:/etc/localtime:ro - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro networks: - monitoring expose: - '8080' node_exporter: image: prom/node-exporter:v1.5.0 container_name: node-exporter restart: always volumes: - /etc/localtime:/etc/localtime:ro - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc|rootfs/var/lib/docker)($$|/)' networks: - monitoring ports: - '9100:9100' grafana: image: grafana/grafana:9.4.3 container_name: grafana restart: always volumes: - /etc/localtime:/etc/localtime:ro - grafana_data:/var/lib/grafana - ./grafana/provisioning/:/etc/grafana/provisioning/ env_file: - ./grafana/config.monitoring networks: - monitoring links: - prometheus ports: - 3000:3000 depends_on: - prometheusdocker-compose.yaml
Alertmanager主要负责对Prometheus产⽣的告警进⾏统⼀处理,因此在Alertmanager配置中⼀般会包含以下⼏个主要部分:
- 全局配置(global):⽤于定义⼀些全局的公共参数,如全局的SMTP配置,Slack配置等内容
- 模板(templates):⽤于定义告警通知时的模板,如HTML模板,邮件模板等
- 告警路由(route):根据标签匹配,确定当前告警应该如何处理;
- 接收⼈(receivers):接收⼈是⼀个抽象的概念,它可以是⼀个邮箱也可以是微信,Slack或者Webhook等,接收⼈⼀般配合告警路由使⽤
- 抑制规则(inhibit_rules):合理设置抑制规则可以减少垃圾告警的产⽣
完整配置格式如下: global: [ resolve_timeout: <duration> | default = 5m ] [ smtp_from: <tmpl_string> ] [ smtp_smarthost: <string> ] [ smtp_hello: <string> | default = "localhost" ] [ smtp_auth_username: <string> ] [ smtp_auth_password: <secret> ] [ smtp_auth_identity: <string> ] [ smtp_auth_secret: <secret> ] [ smtp_require_tls: <bool> | default = true ] [ slack_api_url: <secret> ] [ victorops_api_key: <secret> ] [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ] [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ] [ opsgenie_api_key: <secret> ] [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ] [ hipchat_api_url: <string> | default = "https://api.hipchat.com/" ] [ hipchat_auth_token: <secret> ] [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi- bin/" ] [ wechat_api_secret: <secret> ] [ wechat_api_corp_id: <string> ] [ http_config: <http_config> ] templates: [ - <filepath> ... ] route: <route> receivers: - <receiver> ... inhibit_rules: [ - <inhibit_rule> ... ]
在全局配置中需要注意的是 resolve_timeout ,该参数定义了当Alertmanager持续多⻓时间未接收到告警后标记告警状态为resolved(已解决)。该参数的定义可能会影响到告警恢复通知的接收时间,默认为5分钟.
三、Prometheus告警规则
Prometheus中的告警规则允许你基于PromQL表达式定义告警触发条件,Prometheus后端对这些触发规则进⾏周期性计算,当满⾜触发条件后则会触发告警通知。默认情况下,⽤户可以通过Prometheus的Web界⾯查看这些告警规则以及告警的触发状态。当Promthues与Alertmanager关联之后,可以将告警发送到外部服务如Alertmanager中并通过Alertmanager可以对这些告警进⾏进⼀步的处理。
- 告警规则是配置在prometheus服务器
3.1 与Alertmanager关联
Prometheus把产⽣的告警发送给Alertmanager进⾏告警处理时,需要在Prometheus使⽤的配置⽂件中添加关联Alertmanager组件的对应配置内容
(1)编辑prometheus.yml⽂件加⼊关联Alertmanager组件的访问地址,示例如下:
# Alertmanager 配置 alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093']
(2)添加监控Alertmanager,让Prometheus去收集Alertmanager的监控指标。
- job_name: 'alertmanager' # 覆盖全局默认值,每15秒从该作业中刮取一次目标 scrape_interval: 15s static_configs: - targets: ['alertmanager:9093']
配置完成重新加载:
curl -X POST http://localhost:9090/-/reload
3.2 配置告警规则
安装好node_exporter后
3.2.1 Prometheus添加配置
- job_name: 'node-exporter' scrape_interval: 15s static_configs: - targets: ['node_exporter:9100'] labels: instance: Prometheus服务器 - targets: ['192.168.10.100:9100'] labels: instance: test服务器
3.2.2 创建告警规则文件
cd /data/docker-prometheus vim prometheus/alert.yml # 告警规则 groups: - name: Prometheus alert rules: # 对任何实例超过30秒无法联系的情况发出警报 - alert: 服务告警 expr: up == 0 for: 1m labels: severity: critical annotations: summary: "服务异常,实例:{{ $labels.instance }}" description: "{{ $labels.job }} 服务已关闭"
在告警规则⽂件中,我们可以将⼀组相关的规则设置定义在⼀个group下。在每⼀个group中我们可以定义多个告警规则(rule)。⼀条告警规则主要由以下⼏部分组成
- alert:告警规则的名称。
- expr:基于PromQL表达式告警触发条件,⽤于计算是否有时间序列满⾜该条件。
- for:评估等待时间,可选参数。⽤于表示只有当触发条件持续⼀段时间后才发送告警。在等待期间新产⽣告警的状态为pending。
- labels:⾃定义标签,允许⽤户指定要附加到告警上的⼀组附加标签。
- annotations:⽤于指定⼀组附加信息,⽐如⽤于描述告警详细信息的⽂字等,annotations的内容在告警产⽣时会⼀同作为参数发送到Alertmanager。
3.2.3 指定加载告警规则
为了能够让Prometheus能够启⽤定义的告警规则,我们需要在Prometheus全局配置⽂件中通过rule_files指定⼀组告警规则⽂件的访问路径,Prometheus启动后会⾃动扫描这些路径下规则⽂件中定义的内容
,并且根据这些规则计算是否向外部发送通知:
vim prometheus/prometheus.yml # 报警(触发器)配置 rule_files: - "alert.yml" - "rules/*.yml"
3.2.3 查看告警状态
可以在http://192.168.10.14:9090/alerts?search=上查看告警规则以及其当前所处的活动状态
http://192.168.10.14:9093/#/alerts
3.2.4 Promql查询
promql表达式:
up == 0
四、配置告警
4.1 配置邮箱告警
4.1.1 获取邮箱授权码并开启smtp服务
以163邮箱为例,在设置中POP3/SMTP/IMAP中开启smtp服务,获取授权码
4.1.2 修改alertmanager配置
修改alertmanager配置文件
#docker安装修改 cd /data/docker-prometheus vim alertmanager/alertmanager.yml #填入如下内容: global: #163服务器 smtp_smarthost: 'smtp.163.com:465' #发邮件的邮箱 smtp_from: '[email protected]' #发邮件的邮箱用户名,也就是你的邮箱 smtp_auth_username: '[email protected]' #发邮件的邮箱密码,邮箱授权码 smtp_auth_password: 'your-password' #tls验证配置,false为关闭 smtp_require_tls: false route: group_by: ['alertname'] # 当收到告警的时候,等待group_wait配置的时间10s,看是否还有告警,如果有就一起发出去 group_wait: 10s # 如果上次告警信息发送成功,此时又来了一个新的告警数据,则需要等待group_interval配置的时间才可以发送出去 group_interval: 10s # 如果上次告警信息发送成功,且问题没有解决,则等待 repeat_interval配置的时间再次发送告警数据 repeat_interval: 4h # 全局报警组,这个参数是必选的,和下面报警组名要相同 receiver: 'email' receivers: - name: 'email' #收邮件的邮箱 email_configs: - to: '[email protected]' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
多个收邮件的邮箱账号配置
receivers: - name: 'email' #收邮件的邮箱 email_configs: - to: '[email protected]' - to: '[email protected]' - to: '[email protected]'
重新加载配置:
curl -X POST http://localhost:9093/-/reload
检查status状态:
http://192.168.10.14:9093/#/status
auth_password: <secret>
访问告警模块web页面:
http://192.168.10.14:9090/alerts
- INACTIVE:活跃中,即表示正常无告警产生。
- PENDING:待触发,表示已经达到预设的阈值,但没达到预设的时间。
- FIRING:表示达到预设的阈值并超过预设的时间触发告警
4.1.3 测试
停用一个服务可以看看是否有收到报警邮件
4.1.4 使用模版
看需求--不使用模版默认也行
4.1.4.1 创建模版文件
cd /data/docker-prometheus #创建存放模版的目录 mkdir alertmanager/template cat > alertmanager/template/email.tmpl <<"EOF" {{ define "email.html" }} {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }} <h2>@告警通知</h2> 告警程序: prometheus_alert <br> 告警级别: {{ .Labels.severity }} 级 <br> 告警类型: {{ .Labels.alertname }} <br> 故障主机: {{ .Labels.instance }} <br> 告警主题: {{ .Annotations.summary }} <br> 告警详情: {{ .Annotations.description }} <br> 触发时间: {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }} <br> {{ end }}{{ end -}} {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }} <h2>@告警恢复</h2> 告警程序: prometheus_alert <br> 故障主机: {{ .Labels.instance }}<br> 故障主题: {{ .Annotations.summary }}<br> 告警详情: {{ .Annotations.description }}<br> 告警时间: {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}<br> 恢复时间: {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}<br> {{ end }}{{ end -}} {{- end }} EOF
4.1.4.2 修改alertmanager配置
vim alertmanager/config.yml #模版配置 templates: - '/etc/alertmanager/template/*.tmpl' .... receivers: - name: 'email' #收邮件的邮箱 email_configs: - to: '[email protected]' #发送邮件的内容(调用模板文件中的) html: '{{ template "email.html" .}}'
send_resolved: true
重载配置:
curl -X POST http://localhost:9093/-/reload
检查:
http://192.168.10.14:9093/#/status
测试:
查看163报警邮件,修改前和修改后的区别
4.2 钉钉告警
4.2.1 钉钉设置
a.注册企业钉钉
b.填写企业资料
c. 添加机器人
因为机器人添加,只能是钉钉电脑版(手机版钉钉不能添加机器人)。“测试钉钉报警“ 这个企业只有我一个人,所以我就把报警消息发到默认的 ”测试钉钉报警 全员群“ 里面。实际使用时,请创建个运维群--添加对应的人员进来。
电脑钉钉登陆成功后----点击左下角的。。。---然后再点管理后台,如下图:
点击之前创建的企业名称---点击通讯录---组织架构---添加子部门,批量管理中可以将 成员调整到部门
再查看钉钉,可以看到一个“运维”部门,打开运维部门群设置----机器人---添加机器人----添加机器人---自定义---添加----机器人名字----安全设置(IP地址段:一般为alertmanager服务器地址)---复制webhook,提取access_token
4.2.2 使用prometheus-webhook-dingtalk实现钉钉报警
https://github.com/timonwong/prometheus-webhook-dingtalk/releases
#安装二进制安装包 wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz #解压 tar vxf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz ls -l #移动并改名 mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 /opt/prometheus-webhook-dingtalk # 创建配置文件 cat > /opt/prometheus-webhook-dingtalk/config.yml <<"EOF" targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=修改为自己的TOKEN secret: SEC000000000000000000000 EOF # 创建prometheus用户 useradd -M -s /usr/sbin/nologin prometheus # 修改用户文件夹权限 chown prometheus:prometheus -R /opt/prometheus-webhook-dingtalk # 创建systemd服务 cat > /etc/systemd/system/prometheus-webhook-dingtalk.service << "EOF" [Unit] Description=prometheus-webhook-dingtalk Documentation=https://github.com/timonwong/prometheus-webhook-dingtalk [Service] User=prometheus Group=prometheus Restart=on-failure ExecStart=/opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk \ --config.file=/opt/prometheus-webhook-dingtalk/config.yml [Install] WantedBy=multi-user.target EOF # 启动 prometheus-webhook-dingtalk systemctl daemon-reload systemctl start prometheus-webhook-dingtalk.service systemctl enable prometheus-webhook-dingtalk.service二进制安装方法
docker安装prometheus-webhook-dingtalk
#创建数据目录 mkdir /data/docker-prometheus/prometheus-webhook-dingtalk/ -p # 创建配置文件config.yml cat > /data/docker-prometheus/prometheus-webhook-dingtalk/config.yml <<"EOF" #templates: # - /etc/prometheus-webhook-dingtalk/templates/default.tmpl targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=之前复制的TOKEN secret: SEC000000000000000000000 #message: # text: '{{ template "default.content" . }}' EOF
docker-compose.yaml文件
注:我把prometheus-webhook-dingtalk安装在prometheus服务器上,如果安装在其他机器上也是可以的。
cd /data/docker-prometheus/prometheus-webhook-dingtalk/ cat > docker-compose.yaml << "EOF" version: '3.3' services: webhook: image: timonwong/prometheus-webhook-dingtalk:v2.1.0 container_name: prometheus-webhook-dingtalk restart: "always" ports: - 8060:8060 command: - '--config.file=/etc/prometheus-webhook-dingtalk/config.yml' volumes: - ./config.yml:/etc/prometheus-webhook-dingtalk/config.yml - /etc/localtime:/etc/localtime:ro EOF
docker-compose up -d
访问:http://192.168.10.14:8060/
4.2.3 alertmanager配置
alertmanager/config.yml增加如下配置route: receiver: 'dingtalk' # 修改报警方式,必须修改 receivers: - name: 'email' #收邮件的邮箱 email_configs: - to: '[email protected]' #当告警恢复后,是否发送邮件 #send_resolved: true html: '{{ template "email.html" .}}' - name: 'dingtalk' webhook_configs: - url: 'http://192.168.10.14:8060/dingtalk/webhook1/send' send_resolved: true
检查配置
#docker安装方式,检查 docker exec -it alertmanager amtool check-config /etc/alertmanager/config.yml #二进制安装方式,检查 /opt/prometheus/alertmanager/amtool check-config /opt/prometheus/alertmanager/alertmanager.yml
加载:
curl -X POST http://localhost:9093/-/reload
4.2.4 配置触发器
前面已经配置过了
vim prometheus/alert.yml
root@os:/data/docker-prometheus# cat prometheus/alert.yml groups: - name: Prometheus alert rules: # 对任何实例超过30秒无法联系的情况发出警报 - alert: 服务告警 expr: up == 0 for:30s labels: severity: critical annotations: summary: "服务异常,实例:{{ $labels.instance }}" description: "{{ $labels.job }} 服务已关闭" - name: node-exporter rules: - alert: HostOutOfMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 for: 2m labels: severity: warning annotations: summary: "主机内存不足,实例:{{ $labels.instance }}" description: "内存可用率<10%,当前值:{{ $value }}" - alert: HostMemoryUnderMemoryPressure expr: rate(node_vmstat_pgmajfault[1m]) > 1000 for: 2m labels: severity: warning annotations: summary: "内存压力不足,实例:{{ $labels.instance }}" description: "节点内存压力大。 重大页面错误率高,当前值为:{{ $value }}" - alert: HostUnusualNetworkThroughputIn expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100 for: 5m labels: severity: warning annotations: summary: "异常流入网络吞吐量,实例:{{ $labels.instance }}" description: "网络流入流量 > 100 MB/s,当前值:{{ $value }}" - alert: HostUnusualNetworkThroughputOut expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100 for: 5m labels: severity: warning annotations: summary: "异常流出网络吞吐量,实例:{{ $labels.instance }}" description: "网络流出流量 > 100 MB/s,当前值为:{{ $value }}" - alert: HostUnusualDiskReadRate expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50 for: 5m labels: severity: warning annotations: summary: "异常磁盘读取,实例:{{ $labels.instance }}" description: "磁盘读取> 50 MB/s,当前值:{{ $value }}" - alert: HostUnusualDiskWriteRate expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50 for: 2m labels: severity: warning annotations: summary: "异常磁盘写入,实例:{{ $labels.instance }}" description: "磁盘写入> 50 MB/s,当前值:{{ $value }}" - alert: HostOutOfDiskSpace expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 for: 2m labels: severity: warning annotations: summary: "磁盘空间不足告警,实例:{{ $labels.instance }}" description: "剩余磁盘空间< 10% ,当前值:{{ $value }}" - alert: HostDiskWillFillIn24Hours expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 for: 2m labels: severity: warning annotations: summary: "磁盘空间将在24小时内耗尽,实例:{{ $labels.instance }}" description: "以当前写入速率预计磁盘空间将在 24 小时内耗尽,当前值:{{ $value }}" - alert: HostOutOfInodes expr: node_filesystem_files_free{mountpoint ="/"} / node_filesystem_files{mountpoint="/"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/"} == 0 for: 2m labels: severity: warning annotations: summary: "磁盘Inodes不足,实例:{{ $labels.instance }}" description: "剩余磁盘 inodes < 10%,当前值: {{ $value }}" - alert: HostUnusualDiskReadLatency expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0 for: 2m labels: severity: warning annotations: summary: "异常磁盘读取延迟,实例:{{ $labels.instance }}" description: "磁盘读取延迟 > 100ms,当前值:{{ $value }}" - alert: HostUnusualDiskWriteLatency expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0 for: 2m labels: severity: warning annotations: summary: "异常磁盘写入延迟,实例:{{ $labels.instance }}" description: "磁盘写入延迟 > 100ms,当前值:{{ $value }}" - alert: high_load expr: node_load1 > 4 for: 2m labels: severity: page annotations: summary: "CPU1分钟负载过高,实例:{{ $labels.instance }}" description: "CPU1分钟负载>4,已经持续2分钟。当前值为:{{ $value }}" - alert: HostCpuIsUnderUtilized expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80 for: 1m labels: severity: warning annotations: summary: "cpu负载高,实例:{{ $labels.instance }}" description: "cpu负载> 80%,当前值:{{ $value }}" - alert: HostCpuStealNoisyNeighbor expr: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10 for: 0m labels: severity: warning annotations: summary: "CPU窃取率异常,实例:{{ $labels.instance }}" description: "CPU 窃取率 > 10%。 嘈杂的邻居正在扼杀 VM 性能,或者 Spot 实例可能失去信用,当前值:{{ $value }}" - alert: HostSwapIsFillingUp expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80 for: 2m labels: severity: warning annotations: summary: "磁盘swap空间使用率异常,实例:{{ $labels.instance }}" description: "磁盘swap空间使用率>80%" - alert: HostNetworkReceiveErrors expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01 for: 2m labels: severity: warning annotations: summary: "异常网络接收错误,实例:{{ $labels.instance }}" description: "网卡{{ $labels.device }}在过去2分钟接收错误率大于0.01,当前值:{{ $value }}" - alert: HostNetworkTransmitErrors expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01 for: 2m labels: severity: warning annotations: summary: "异常网络传输错误,实例:{{ $labels.instance }}" description: "网卡{{ $labels.device }}在过去2分钟传输错误率大于0.01,当前值:{{ $value }}" - alert: HostNetworkInterfaceSaturated expr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8 < 10000 for: 1m labels: severity: warning annotations: summary: "异常网络接口饱和,实例:{{ $labels.instance }}" description: "网卡{{ $labels.device }}正在超载,当前值{{ $value }}" - alert: HostConntrackLimit expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8 for: 5m labels: severity: warning annotations: summary: "异常连接数,实例:{{ $labels.instance }}" description: "连接数过大,当前连接数:{{ $value }}" - alert: HostClockSkew expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0) for: 2m labels: severity: warning annotations: summary: "异常时钟偏差,实例:{{ $labels.instance }}" description: "检测到时钟偏差,时钟不同步。值为:{{ $value }}" - alert: HostClockNotSynchronising expr: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16 for: 2m labels: severity: warning annotations: summary: "时钟不同步,实例:{{ $labels.instance }}" description: "时钟不同步" - alert: NodeFileDescriptorLimit expr: node_filefd_allocated / node_filefd_maximum * 100 > 80 for: 1m labels: severity: warning annotations: summary: "预计内核将很快耗尽文件描述符限制" description: "{{ $labels.instance }}}已分配的文件描述符数超过了限制的80%,当前值为:{{ $value }}" - name: nginx rules: # 对任何实例超过30秒无法联系的情况发出警报 - alert: NginxDown expr: nginx_up == 0 for: 30s labels: severity: critical annotations: summary: "nginx异常,实例:{{ $labels.instance }}" description: "{{ $labels.job }} nginx已关闭"触发器
查看钉钉群报警
4.3 企业微信报警
4.3.1 注册企业微信
浏览器打开https://work.weixin.qq.com/ 点击注册,填写资料
4.3.2 webhook告警(和微信应用报警二选一)
添加群机器人
注册成功后,手机下载企业微信,登陆企业微信注:因为我这个是测试企业微信,所以就在”企业全员群“,新建群机器人了。真实一般都是新创建个部门,然后把需要接受报警的人拉到这个部门里面,然后在这个部门群里面新建机器人。
二进制安装alertmanager-wechatrobot-webhook
cd /opt/prometheus/ #下载代码 git clone https://gitee.com/linge365/alertmanager-wechatrobot-webhook.git #进入目录 cd alertmanager-wechatrobot-webhook #把service服务移动到对应目录下 mv alertmanager-wechatrobot-webhook.service /etc/systemd/system/ #添加prometheus用户,如果已存在不需要重复添加 useradd -M -s /usr/sbin/nologin prometheus #授权 chown -R prometheu.prometheus /opt/prometheus systemctl start alertmanager-wechatrobot-webhook # 修改alertmanager配置 vim alertmanager/config.yml #增加如下配置 route: receiver: wechat receivers: - name: "wechat" webhook_configs: - url: 'http://192.168.10.14:8999/webhook?key=之前复制的企业微信webhook的key' send_resolved: true #docker安装方式,检查 docker exec -it alertmanager amtool check-config /etc/alertmanager/config.yml #二进制安装方式,检查 /opt/alertmanager/alertmanager amtool check-config /etc/alertmanager/config.yml curl -X POST http://localhost:9093/-/reload二进制安装配置方法
4.3.3 微信应用告警(和webhook告警二选一)
- 企业微信应用需要添加ip白名单才能正常使用
- 需要使用一个域名绑定
浏览器打开企业微信官网登录进去
- 创建应用
点击应用管理---创建应用
- 获取AgentID和秘钥
创建应用成功后,复制AgentId,和查看Secret--会发送Secret到手机企业微信中。
- 配置IP白名单和可信域名并校验
- 获取部门ID
- 获取corp_id
- 修改alertmanager配置
vim alertmanager/config.yml route: receiver: wechat receivers: - name: 'wechat' wechat_configs: - send_resolved: true #to_user: '@all' #发送给企业微信用户的ID,@all是所有人 #to_tag: '1' #企业微信中创建的接收告警的标签 to_party: '1' #部门id agent_id: '1000002' # 企业微信中创建的应用的ID corp_id: 'ww75c7ff0bc812538c' # 企业微信中企业ID api_secret: '-rg8Xtzchefy6w94O6G_qT5gOMhDZt7MsZmHSELAOZw' # 企业微信中,应用的Secret
检查配置
#docker安装方式,检查 docker exec -it alertmanager amtool check-config /etc/alertmanager/config.yml #二进制安装方式,检查 /opt/alertmanager/alertmanager amtool check-config /etc/alertmanager/config.yml
curl -X POST http://localhost:9093/-/reload
查看prometheus的alerts: http://192.168.10.14:9090/alerts
查看alertmanager的alerts:http://192.168.10.14:9093/#/alerts
4.3.4 使用模版
(非必需,仅限微信应用告警)
cd /data/docker-prometheus #创建存放模版的目录 mkdir alertmanager/template cat > alertmanager/template/wechat.tmpl <<"EOF" {{ define "wechat.html" }} {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }} @告警通知 告警程序: prometheus_alert 告警级别: {{ .Labels.severity }}级别 告警类型: {{ .Labels.alertname }} 故障主机: {{ .Labels.instance }} 告警主题: {{ .Annotations.summary }} 告警详情: {{ .Annotations.description }} 触发时间: {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }} {{ end }}{{ end -}} {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }} @告警恢复 告警程序: prometheus_alert 故障主机: {{ .Labels.instance }} 故障主题: {{ .Annotations.summary }} 告警详情: {{ .Annotations.description }} 告警时间: {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }} 恢复时间: {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }} {{ end }}{{ end -}} {{- end }} EOF # 修改alertmanager配置 # 增减message #模版配置 templates: - '/etc/alertmanager/template/*.tmpl' .... receivers: - name: 'wechat' wechat_configs: - send_resolved: true #只增加这行配置 message: '{{ template "wechat.html" . }}' # 重新加载 curl -X POST http://localhost:9093/-/reload http://192.168.10.14:9093/#/status
标签:node,Alertmanager,21,labels,instance,prometheus,alertmanager,告警 From: https://www.cnblogs.com/yangmeichong/p/18174777