揭秘业务背后的守护者,真实业务场景带你领略运维技术的魅力!
看了很多技术大佬的博客,都是在讲技术,缺乏业务场景的构建,很多运维人员遇到问题找不到解决方案。因此我想分享真实的业务场景,大家一起沟通业务问题,快速的提升技术,快速升职加薪。
-----------------------正文开始-----------------------
搭建了prometheus+grafana后,可以将主机的监控数据通过grafana可视化展示。当监控达到设置的阈值之后,可以通过alertmanager+prometheus-webhook-dingtalk推送至钉钉群中,提醒运维人员进行关注并处理。
Prometheus 规则配置:
groups:
- name: consul-datacenter-node-alert
rules:
- alert: "探针未启动或宕机"
expr: up{job =~ "node-nginx"} == 0
for: 30s
labels:
env: 生产
app: datacenter-linux
annotations:
description: "Job:{{ $labels.job }}, Instance:{{ $labels.instance }}, Role:{{ $labels.role }}的探针未启动或宕机,当前值:{{ $value }}"
- alert: "CPU使用率过高"
expr: (100 - (avg by(job,instance,role) (irate(node_cpu_seconds_total{job =~ "node-nginx",mode="idle"}[5m])) * 100)) > 80
for: 2m
labels:
env: 生产
app: datacenter-linux
annotations:
description: "{{ $labels.instance }} ({{ $labels.role }}) 的 CPU使用率超过90% ,当前值:{{ $value }}"
- alert: "内存使用率过高"
expr: ((node_memory_MemTotal_bytes{job =~ "node-nginx"} - node_memory_MemAvailable_bytes{job =~ "node-nginx"}) / node_memory_MemTotal_bytes{job =~ "node-nginx"}) * 100 > 73
for: 2m
labels:
env: 生产
app: datacenter-linux
annotations:
description: "{{ $labels.instance }} ({{ $labels.role }}) 内存使用率超过90%,当前值:{{ $value }}"
- alert: "磁盘使用率过高>=80%"
expr: 100 - ((node_filesystem_avail_bytes{fstype=~"ext4|xfs|nfs|nfs4", job =~ "node-nginx"} * 100) / node_filesystem_size_bytes {fstype=~"ext4|xfs|nfs|nfs4", job =~ "node-nginx"}) >= 80
for: 2m
labels:
env: 生产
app: datacenter-linux
annotations:
description: "{{ $labels.instance }},({{ $labels.role }}) {{ $labels.mountpoint }}分区使用率已超过80%,当前值:{{ $value }}"
alertmanager:
1、下载alertmanager:
https://github.com/prometheus/alertmanager/tags
2、设置重启alertmanager
vim restart.sh
#!/bin/bash
pidnum=`ps aux|grep alertmanager|grep -v grep|awk -F ' ' '{print $2}'`
kill -9 ${pidnum}
nohup /opt/jiankong/alertmanager/alertmanager --config.file=/opt/jiankong/alertmanager/dingtalk.yml --storage.path=/opt/jiankong/alertmanager/alertmanager_data --web.external-url=http://192.168.210.75:9093 &
3、alertmanager配置检查:
./amtool check-config dingtalk.yml
4、alertmanager配置:
[root@localhost alertmanager]# cat dingtalk.yml
global:
resolve_timeout: 10m
#templates:
#- './config/*.tmpl'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 30s
repeat_interval: 5m
#repeat_interval: 5s
receiver: 'default'
routes:
# 推送告警测试群
- match:
app: "datacenter-linux"
receiver: 'webhook-dingtalk-alter-test'
# 推送第二个告警测试群
- match:
app: "blackbox_icmp"
receiver: 'blackbox-dingtalk-alter-test'
inhibit_rules:
- source_match:
receivers:
- name: 'webhook-dingtalk-alter-test'
webhook_configs:
- send_resolved: true
# 临时-告警测试群
url: http://192.168.210.75:8060/dingtalk/webhook1/send
- name: 'blackbox-dingtalk-alter-test'
webhook_configs:
- send_resolved: true
# 临时-第二个告警群
url: http://192.168.210.75:8060/dingtalk/webhook2/send
- name: 'default'
webhook_configs:
# 临时-告警测试群
- url: http://192.168.210.75:8060/dingtalk/webhook1/send
prometheus-webhook-dingtalk
1、下载prometheus-webhook-dingtalk:
https://github.com/timonwong/prometheus-webhook-dingtalk/releases/tag
2、Prometheus-webhook-dingtalk 配置:
[root@localhost webhook-dingtalk-2.0.0]# cat config-message.yml
#timeout: 5s
#no_builtin_template: true
templates:
- /opt/jiankong/webhook-dingtalk-2.0.0/contrib/templates/alertmanager-dingtalk-message.tmpl
- /opt/jiankong/webhook-dingtalk-2.0.0/contrib/templates/blackbox-icmp-message.tmpl
targets:
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=28b5xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
secret: SEC2bc3fb99xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxecbbd1273d1bb6
mention:
mobiles: ['173xxxxxx56']
message:
text: |
{{ template "linux.message" . }} #这里选择模版名字
@173xxxxxx56
webhook2:
url: https://oapi.dingtalk.com/robot/send?access_token=dad50cxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
secret: SEC4da8xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxe64ffbdbca044cfe1d678
mention:
mobiles: ['173xxxxxx56']
message:
text: |
{{ template "email.to.message" . }} #这里选择模版名字
@173xxxxxx56
3、Prometheus-webhook-dingtalk 模版配置:
[root@localhost webhook-dingtalk-2.0.0]# cat contrib/templates/alertmanager-dingtalk-message.tmpl
{{ define "linux.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
== = **linux虚拟机告警** = ==
**告警程序:** Alertmanager
**告警类型:** {{ $alert.Labels.alertname }}
**故障主机:** {{ $alert.Labels.instance }}
**告警详情:** <font color=#ff0000> {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}</font>
**故障时间:** {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
**告警状态:** <font color=#ff0000> {{ .Status }}</font>
== = **end** = ==
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
==== = **linux虚拟机告警恢复** = ====
**告警程序:** Alertmanager
**告警类型:** {{ $alert.Labels.alertname }}
**故障主机:** {{ $alert.Labels.instance }}
**告警详情:** <font color=#00ff00>{{ $alert.Annotations.message }}{{ $alert.Annotations.description}} </font>
**故障时间:** {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
**恢复时间:** {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
**告警状态:** <font color=#00ff00> {{ .Status }} </font>
======= = **end** = =======
{{- end }}
{{- end }}
{{- end }}
4、Prometheus-webhook-dingtalk 重启脚本:
[root@localhost webhook-dingtalk-2.0.0]# cat restart.sh
#!/bin/bash
pidnum=`ps aux|grep prometheus-webhook-dingtalk |grep -v grep|awk -F ' ' '{print $2}'`
kill -9 ${pidnum}
nohup /opt/jiankong/webhook-dingtalk-2.0.0/prometheus-webhook-dingtalk --web.listen-address=:8060 --web.enable-ui --config.file=/opt/jiankong/webhook-dingtalk-2.0.0/config-message.yml &
----------------------------以下无正文-------------------------
如果大家有运维技术问题,可扫描下方二维码进QQ群,一起沟通交流,提升技术。
标签:node,alertmanager,dingtalk,labels,webhook,alert,prometheus From: https://blog.51cto.com/u_15819125/11970421