首页 > 其他分享 >prometheus监控+alertmanager告警

prometheus监控+alertmanager告警

时间:2024-03-07 17:37:05浏览次数:32  
标签:alertmanager dingtalk labels prometheus usr 告警 pod

prometheus监控+alertmanager告警

 

配置告警规则

  1、创建规则目录

mkdir /usr/local/prometheus/rules

  

  2、编写告警规则文件。

  vim /usr/local/prometheus/rules/rule.yml

#添加以下配置
groups:
- name: instance-abnormal
  rules:
  - alert: POD新增告警!
    expr: |
      kube_pod_status_ready{condition="true"} == 0
      and on(pod)
      kube_pod_container_status_restarts_total == 0
    for: 60s
    labels:
      name: instance
      severity: Warning
      instance: "{{ $labels.pod }}"
    annotations:
      summary: "k8s集群告警!"
      description: "{{ $labels.pod }} 为新增节点!"

  - alert: POD重启告警!
    expr: |
      kube_pod_status_ready{condition="true"} == 0
      and on(pod)
      kube_pod_container_status_restarts_total > 0
    for: 60s
    labels:
      name: instance
      severity: Critical
    annotations:
      summary: "k8s集群POD重启!"
      description: "{{ $labels.pod }} 正在重启!"

- name: instance-down
  rules:
  - alert: k8s集群节点down!
    expr: |
      kube_node_status_condition{condition="Ready",status="true"} == 0
    for: 60s
    labels:
      severity: Critical
    annotations:
      summary: "k8s集群{{ $labels.node }}节点down!"
      description: "{{ $labels.node }} 节点不可用,请尽快检查!"

- name: resource-status
  rules:
  - alert: POD cpu使用率过高!
    expr: |
      sum by (pod, namespace)(rate(container_cpu_usage_seconds_total{name!=""}[60s])) /
      sum by (pod,namespace) (kube_pod_container_resource_limits{resource="cpu"}) > 0.8
    for: 1m
    labels:
      severity: Warning
    annotations:
      summary: "CPU使用率超过80%!"
      description: "{{ $labels.pod }} CPU使用率超过80%,已超过1分钟,请检查!"

  - alert: POD内存使用率过高!
    expr: |
      sum by (pod, namespace)(rate(container_memory_usage_bytes{name!=""}[60s])) /
      sum by (pod,namespace) (kube_pod_container_resource_limits{resource="memory"}) > 0.8
    for: 1m
    labels:
      severity: Warning
    annotations:
      summary: "内存使用率超过80%!"
      description: "{{ $labels.pod }} 内存使用率超过80%,已超过1分钟,请检查!"

  - alert: 主机cpu使用率过高!
    expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)
    for: 1m
    labels:
      severity: Warning
    annotations:
      summary: "{{ $labels.pod }} CPU使用率超过80%!"
      description: "{{ $labels.pod }} CPU使用率超过80%,已超过1分钟,请检查!"

  - alert: 主机内存使用率过高!
    expr: |
      sum by (pod, namespace)(rate(container_memory_usage_bytes{name!=""}[60s])) /
      sum by (pod,namespace) (kube_pod_container_resource_limits{resource="memory"}) > 0.8
    for: 1m
    labels:
      severity: Warning
    annotations:
      summary: "{{ $labels.node}} 内存使用率超过80%!"
      description: "{{ $labels.node }} 内存使用率超过80%,已超过1分钟,请检查!"

  - alert: 主机磁盘使用率过高!
    expr: |
      sum by (pod, namespace)(rate(container_memory_usage_bytes{name!=""}[60s])) /
      sum by (pod,namespace) (kube_pod_container_resource_limits{resource="memory"}) > 0.8
    for: 1m
    labels:
      severity: Warning
    annotations:
      summary: "{{ $labels.node }} 磁盘使用率超过80%!"
      description: "{{ $labels.node }} 磁盘使用率超过80%,已超过1分钟,请检查!"

  根据自己的需求配置。

 

  3、新增prometheus配置。

  vim /usr/local/prometheus/prometheus.yml

#新增以下配置
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - xxx.xxx.xxx.xxx:9093

rule_files:
  - "/usr/local/prometheus/rules/rule.yml"

 

  4、重新加载prometheus

curl -X POST http://localhost:9090/-/reload

 

部署alertmanager

  1、下载alertmanager

wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz

  

  2、解压、移动到安装目录、配置版本软连接。

tar -zxf alertmanager-0.26.0.linux-amd64.tar.gz
mv alertmanager-0.26.0.linux-amd64 /opt/alertmanager-0.26.0
ln -s /opt/alertmanager-0.26.0 /usr/local/alertmanager

 

  3、配置systemd管理

  vim /usr/lib/systemd/system/alertmanager.service

[Unit]
Description=Alertmanager Service
After=network.target

[Service]
ExecStart=/usr/local/alertmanager/alertmanager \
--storage.path=/usr/local/alertmanager/data \
--config.file=/usr/local/alertmanager/alertmanager.yml

[Install]
WantedBy=multi-user.target

 

   4、启动alertmanager,设置为开机启动

systemctl start alertmanager
systemctl enable alertmanager

 

  配置邮件告警

  1、修改alertmanager.yml配置,配置邮箱告警。

  vim /usr/local/alertmanager/alertmanager.yml

#修改文件内容
global:
  smtp_smarthost: 'smtp.139.com:25'        # smtp地址,配置前需要检查邮箱是否有开通SMTP,25端口是否通
  smtp_from: '[email protected]'            # 发送邮件的邮箱地址
  smtp_auth_username: 'xxxxxxxx'           # 邮箱用户
  smtp_auth_password: 'xxxxxxxx'           # 邮箱密码,这里需要配置的是客户端授权码,开通SMTP时会生成,有过期时间,过期了需要去邮箱系统里重置。
  smtp_require_tls: false           # 是否开启加密连接,默认为true

route:
  group_by: ["alertname"]                  # 分组
  group_wait: 30s                          # 告警等待,等待30秒内的其他告警信息统一发送,发送之后,需要等待group_interval的时间后才再次发送。
  group_interval: 5m                       # 2次发送警告信息之间的间隔时间
  repeat_interval: 1h                      # 同一条报警信息,重复发送的间隔时间
  receiver: email                          # 接收器名称,与receivers中的name对应。

receivers:
- name: 'email'                            # 接收器名称
  email_configs:
  - to: '[email protected]'                  # 接收邮件的邮箱地址

 

  2、重启alertmanager

systemctl restart alertmanager

 

 配置钉钉告警

  1、下载钉钉通知系统工具

wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz

 

  2、解压、移动至安装目录,创建软连接

tar -zxf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 /opt/prometheus-webhook-dingtalk-2.1.0
ln -s /opt/prometheus-webhook-dingtalk-2.1.0 /usr/local/prometheus-webhook-dingtalk

 

  3、创建钉钉告警模板

mkdir /usr/local/prometheus-webhook-dingtalk/templates
vim /usr/local/prometheus-webhook-dingtalk/templates/service.tmpl

  在service.tmpl文件中加入以下配置

{{ template "service.title" . }}

{{ define "service.title" }}
{{ template "__subject" . }}
{{ end }}

{{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ end }}

{{ template "service.content" . }}
{{ define "service.content" }}
{{ if gt (len .Alerts.Firing) 0 }}
========监控到{{ .Alerts.Firing | len  }}个故障========
{{ template "__alert_list" .Alerts.Firing }}
---
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
========已恢复{{ .Alerts.Resolved | len  }}个故障========
{{ template "__resolved_list" .Alerts.Resolved }}
---
{{ end }}
{{ end }}

{{ define "__alert_list" }}{{ range . }}
---
    **告警类型**: {{ .Labels.alertname }}
    **告警级别**: {{ .Labels.severity }}
    **告警状态**: {{ .Status }}
    **告警主题**: {{ .Annotations.summary }}
    **告警详情**: {{ .Annotations.description }}
    **触发时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end }}

{{ define "__resolved_list" }}{{ range . }}
---
    **告警类型**: {{ .Labels.alertname }}
    **告警级别**: {{ .Labels.severity }}
    **告警状态**: {{ .Status }}
    **告警主题**: {{ .Annotations.summary }}
    **告警详情**: {{ .Annotations.description }}
    **触发时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    **恢复时间**: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end }}

 

  4、修改配置文件

cp /usr/local/prometheus-webhook-dingtalk/config.example.yml /usr/local/prometheus-webhook-dingtalk/config.yml 
vim /usr/local/prometheus-webhook-dingtalk/config.yml 

  修改config.yml文件

templates:
  - /usr/local/prometheus-webhook-dingtalk/templates/*.tmpl

targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
    message:
      title: '{{ template "service.title" . }}'
      text:  '{{ template "service.content" . }}'

  xxxxxxxx为钉钉机器人的token

 

  5、配置systemd管理脚本

[Unit]
Description=prometheus webhook dingtalk
After=network.target

[Service]
ExecStart=/usr/local/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk \
--config.file=/usr/local/prometheus-webhook-dingtalk/config.yml

[Install]
WantedBy=multi-user.target

 

  6、启动服务,设置为开机启动

systemctl start dingtalk
systemctl enable dingtalk

 

  7、修改alertmanager配置,告警信息同时发送到邮箱和钉钉

  vim  /usr/local/alertmanager/alertmanager.yml

  修改为以下配置

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 2h
  receiver: 'default'


  routes:
  - receiver: 'email'
    continue: true                                              # 继续匹配后续路由
  - receiver: 'dingding'
    continue: true                                              # 如果有其他接收者也需要接收,可以继续添加

receivers:
- name: 'default'
- name: 'email'
  email_configs:
  - to: '[email protected]'                                         #接收告警信息邮件邮箱
    from: '[email protected]'                                     #发送邮件信息邮箱
    smarthost: 'smtp.139.com:25'               
    auth_username: 'xxxxxxxx'
    auth_password: 'xxxxxxxx'
    require_tls: false
    send_resolved: true                                          #发送恢复信息

- name: 'dingding'
  webhook_configs:
  - url: 'http://xxx.xxx.xxx.xxx:8060/dingtalk/webhook1/send'    #prometheus-webhook-dingtalk服务地址
    send_resolved: true                                          # 当告警恢复时,也发送通知

 

  8、重启alertmanager

systemctl restart alertmanager

 

标签:alertmanager,dingtalk,labels,prometheus,usr,告警,pod
From: https://www.cnblogs.com/NanZhiHan/p/18058725

相关文章

  • docker部署监控Prometheus+Grafana
    目录一、Prometheus简介二、Prometheus基本原理三、Prometheus架构图四、Prometheus特性五、Prometheus组件六、Prometheus服务发现七、部署环境八、部署主机九、部署Prometheus1、安装docker2、启动docker并设置开机自启3、下载镜像包4、创建prometheus挂载目录5、创建prometheus......
  • (译) 理解 Prometheus 的范围向量 (Range Vector)
    Prometheus中RangeVector的概念是有一点不直观的,除非你彻底阅读并理解了官方提供的文档。谁会这样做呢,去读官方文档?大多的人应该会花些错误的时间去做了一些错误的事情,然后随机去寻找一篇像本文一样的文章去理解这个概念,不是吗?什么是Vector由于Prometheus是一个时序型的......
  • “vSAN磁盘均衡”告警处理
    Symptoms免责声明:本文为 vSAN"Proactiverebalance"and"AutomaticRebalance"  的翻译版本。尽管我们会不断努力为本文提供最佳翻译版本,但本地化的内容可能会过时。有关最新内容,请参见英文版本。  Purpose本文旨在介绍vSAN“主动重新平衡和自动重新平衡......
  • SSH连接告警:-bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
    问题ssh连接登录时报错Lastlogin:SatMar209:58:252024from10.10.1.1-bash:warning:setlocale:LC_ALL:cannotchangelocale(en_US.UTF-8)[root@master01~]#解决系统已经设置了默认地区_语言,字符集为en_US.UTF-8,但是在系统中没有定义对应的locale文件,只......
  • docker部署Prometheus
     1、安装运行Prometheus下面介绍如何使用Prometheus、Grafana、CAdvisor、node-exporter、mysqld-exporter对本机服务器性能、Docker容器、MySQL数据库进行监控。监控本机,只需要一个exporternode_exporter–用于机器系统数据收集mysqld-exporter用于MySQL数据库数据收集......
  • springBoot 整合 groovy 实现表达式解析 该示例可以用于配置告警规则
    1.引入pom<dependency><groupId>org.codehaus.groovy</groupId><artifactId>groovy</artifactId><version>3.0.9</version></dependency><dependency......
  • k8s prometheus监控自定义exporter接口
    案例1:我有的k8s中所有pod应用资源监控接口是/actuator/prometheus,但是默认prometheus监控的是/metrics,这是需要修改prometheus-server的configmap,修改抓取资源监控的api接口需要找到kubernetes-service-endpoints这一项,然后找到action:replace,然后添加replacement案例2:如果......
  • Unity中关于刚体和碰撞器遇到的告警
    告警信息:Scripterror:OnCollisionEnter2DThismessageparameterhastobeoftype:Collision2DThemessagewillbeignored.  解决:经查验发现,由于该脚本是粘贴的类似功能脚本,而粘贴前使用的触发器,因此方法为 privatevoidOnTriggerStay2D(Collider2Dcollision),而......
  • Prometheus+TDengine集群实现监控体系高可用
    背景为避免再次出现因Prometheus宕机导致业务无法查看历史数据受到影响,准备将Prometheus架构从单节点方式升级为高可用集群方式并将后端存储由本地存储改为远端分布式时序数据库存储。分布式时序数据库采用国产数据库TDengine。架构解释:虚线代表Prometheusmaster节点的Promet......
  • Prometheus+Grafana监控cmdb
    目录前言一、实验环境二、基于Prometheus+Grafana的监控解决方案三、实施步骤3.1获取安装包3.2数据库安装用户部署安装包3.3启动文件修改3.4示例3.5增加crontab3.6部署检测3.7安装prometheus(1)下载(2)解压(3)启动3.8安装grafana3.9修改数据库参数前言Prometheus是从......