首页 > 其他分享 >Alertmanager 告警介绍和部署(1)

Alertmanager 告警介绍和部署(1)

时间:2022-12-26 12:44:09浏览次数:61  
标签:Alertmanager group 部署 alerts team 告警 receiver

一.概述

  告警是整个监控系统中重要的组成部分,在Prometheus监控体系中,指标的采集存储与告警是分开的。告警规则是在Prometheus server端定义的,告警规则被触发后,才会将信息发送给独立组件Alertmanager上,经过对告警的处理后,最终通过接收器(email)通知用户。

   我们使用Prometheus server采集各类监控指标,然后基于promQL对这些指标定义阀值告警规则(Rules), Prometheus server对告警规则周期性地进行计算,如果满足告警触发条件,便生成一条告警信息,并将其推送到Alertmanager组件。收到告警后,Alertmanager会处理告警,进行分组(grouping)并将它们路由(routing)到正确的接收器(receiver),如Eamil、PagerDuty、HipChat等。

  1. 告警分组

    分组机制(Grouping)是指,Alertmanager将同类型的告警进行分组,合并多条告警到一个通知中。避免瞬间突发性地接收大量的告警通知,使得管理员无法对问题进行快速定位。例如在大规则集群时,大量应用程序无法连接数据库的故障,如果我们在Prometheus告警规则中配置为每一个服务实例都发送告警,那么最后的结果就是大量的告警被发送到Alertmanager中心。

    因此告警分组,告警时间、告警接收器均是通过Alertmanager的配置文件来完成的。

  2. 告警抑制

    抑制机制(Inhibition)是指,当某告警已经发出,停止重复发送由此告警引发的其它异常或故障的告警机制。Alertmanager的抑制机制在一定程序上避免了管理员收到过多的触发告警通知,抑制机制也是通过Alertmanager的配置文件进行设置的。

  3.告警静默

    告警静默(Silences)提供了一个简单的机制,可以根据标签快速对告警进行静默处理。对传入的告警进行匹配检查,如果接收到的告警符合静默的配置,Alertmanager则不会发送告警通知,管理员可以直接在Alertmanager的web界面中临时屏蔽指定的告警通知。

 

二.部署配置

  Alertmanager也是基于Go语言编写,下载解压就可以使用。解压后,查看版本,当前版本号v0.24.0

[root@iZwz97yqubb71vyxhuskfyZ alertmanager]# pwd
/root/prometheus/alertmanager
[root@iZwz97yqubb71vyxhuskfyZ alertmanager]# ./alertmanager --version
alertmanager, version 0.24.0 (branch: HEAD, revision: f484b17fa3c583ed1b2c8bbcec20ba1db2aa5f11)
  build user:       root@265f14f5c6fc
  build date:       20220325-09:31:33
  go version:       go1.17.8
  platform:         linux/amd64

  Alertmanager选项说明,说几个重要的参数,全部参数查看:  ./alertmanager -h

选项名 解释
--config.file 指定alertmanager.yml配置文件路径
--web.external-url 指定地址和端口,默认9093 格式:http://0.0.0.0:9093
--data.retention 历史数据最大保留时间,默认120小时

  启动如下所示:

 nohup ./alertmanager --config.file=alertmanager.yml --web.external-url=http://0.0.0.0:9093 > nohup.out&

  

三.配置介绍

  Alertmanager配置文件格式通常包括global(全局配置)、templates(告警模板)、route(告警路由)、receivers(接收器)、inhibit_rules(抑制规则)等主要配置项模块。

  这是alertmanager.yml模块格式,更多配置查看:https://prometheus.io/docs/alerting/latest/configuration/#filepath

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'localhost:25'
  smtp_from: '[email protected]'

# The root route on which each incoming alert enters.
route:
  # The root route must not have any matchers as it is the entry point for
  # all alerts. It needs to have a receiver configured so alerts that do not
  # match any of the sub-routes are sent to someone.
  receiver: 'team-X-mails'

  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  #
  # To aggregate by all possible labels use '...' as the sole label name.
  # This effectively disables aggregation entirely, passing through all
  # alerts as-is. This is unlikely to be what you want, unless you have
  # a very low alert volume or your upstream notification system performs
  # its own grouping. Example: group_by: [...]
  group_by: ['alertname', 'cluster']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h

  # All the above attributes are inherited by all child routes and can
  # overwritten on each.

  # The child route trees.
  routes:
  # This routes performs a regular expression match on alert labels to
  # catch alerts that are related to a list of services.
  - match_re:
      service: ^(foo1|foo2|baz)$
    receiver: team-X-mails

    # The service has a sub-route for critical alerts, any alerts
    # that do not match, i.e. severity != critical, fall-back to the
    # parent node and are sent to 'team-X-mails'
    routes:
    - match:
        severity: critical
      receiver: team-X-pager

  - match:
      service: files
    receiver: team-Y-mails

    routes:
    - match:
        severity: critical
      receiver: team-Y-pager

  # This route handles all alerts coming from a database service. If there's
  # no team to handle it, it defaults to the DB team.
  - match:
      service: database

    receiver: team-DB-pager
    # Also group alerts by affected database.
    group_by: [alertname, cluster, database]

    routes:
    - match:
        owner: team-X
      receiver: team-X-pager

    - match:
        owner: team-Y
      receiver: team-Y-pager


# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
- source_matchers:
    - severity="critical"
  target_matchers:
    - severity="warning"
  # Apply inhibition if the alertname is the same.
  # CAUTION: 
  #   If all label names listed in `equal` are missing 
  #   from both the source and target alerts,
  #   the inhibition rule will apply!
  equal: ['alertname']


receivers:
- name: 'team-X-mails'
  email_configs:
  - to: '[email protected], [email protected]'

- name: 'team-X-pager'
  email_configs:
  - to: '[email protected]'
  pagerduty_configs:
  - routing_key: <team-X-key>

- name: 'team-Y-mails'
  email_configs:
  - to: '[email protected]'

- name: 'team-Y-pager'
  pagerduty_configs:
  - routing_key: <team-Y-key>

- name: 'team-DB-pager'
  pagerduty_configs:
  - routing_key: <team-DB-key>

  3.1 global

    全局配置,可以作为其他配置项的默认值,也可以被其他配置项中的设置覆盖掉。

    smtp_smarthost:邮箱smtp服务器代理地址。

    smtp_from: 发送邮件的名称。

    smtp_auth_username: 邮箱用户名称。

    smtp_auth_password: 邮箱授权密码。

  3.2 templates

    与global同级,告警模板可以自定义告警通知的外观格式及其包含的对应告警数据。在templaes部分中包含告警模板的目录列表,也就是设置已存在的模板文件路径,例如:

templates:
-  'templates/*.tmpl'

  3.3 route

    告警路由模块描述了在收到Prometheus server生成的告警后,将告警发送到receiver指定的目的地址的规则。下面来看一个示例

route:
    receiver: 'admin-receiver'
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    group_by: [cluster,alertname]
    routes:
        -match:
            team: developers
          group_by: [product, environment]
          receiver: 'developer-pager'
        -match_re:
            service: mysql|redis
          receiver: 'database-pager'

    选项说明:

      route是根路由,

      routes是子路由。match是通过字符形式进行告警匹配设置,用于判断当前告警中是否具有标签labelname且等于labelvalue。

      group_by:是指定分组的标签,若告警中包含的标签符合group_by中指定的标签名称,这些警告会被合并为一个通知发送给接收器,即实现告警分组。

      match_re:是通过正则表达式进行告警匹配设置,判断当前告警标签是否适配正则表达式的信息。

    示例解释:

      默认告警发送给管理员admin-receiver,且根路由中按照cluster和alertname进行了告警分组。

      在子路由中若匹配到告警中标签team的值为developers,Alertmanager将按照标签product和environment对告警分组后发送通知,使得开发人员快速定位故障。

      最后正则匹配规则,若告警信息中含有service标签,且值匹配到mysql或redis,就会向数据库管理员database-pager发送告警通知。

  3.4 receivers

    接收器是一个统称,每个receiver需要设置一个全局唯一的名称,并且对应一个或者多个通知方式,包括电子邮箱、微信、webhook等,官方建议通过webhook接收器实现自定义通知集成,可以支持用户定制。

  3.5 inhibit_rules

    inhibit_rules模块中设置实现告警抑制功能。可以指定在特定条件下要忽略的告警条件,合理设置抑制规则可以减少"垃圾"告警的产生。

 

四. 默认的Alertmanager配置介绍

  

route:                           #路由配置模块
  group_by: ['alertname']        #告警分组
  group_wait: 30s                #30秒内收到的同组告警在同一条告警通知中发送出去
  group_interval: 5m              #同组之间发送告警通知的时间间隔
  repeat_interval: 1h        #相同告警信息发送重复告警的周期
  receiver: 'web.hook'            #使用的接收器为web.hook
receivers:                        #接收器模块设置
  - name: 'web.hook'              #设置接收器名称为web.hook
    webhook_configs:        #设置webhook地址
      - url: 'http://127.0.0.1:5001/' 
inhibit_rules:                   #告警抑制功能模块
  - source_match:                
      severity: 'critical'       #当存在源标签告警触发时,抑制含有目标标签的告警
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance'] #保证该配置下标签内容相同才会被抑制

    

标签:Alertmanager,group,部署,alerts,team,告警,receiver
From: https://www.cnblogs.com/MrHSR/p/16775718.html

相关文章

  • 将 Spring 启动应用部署到 Azure
    本文将指导你将应用程序部署到AzureSpring应用。建议您查看官方Azure文档AzureSpringApps以获取同一任务的最新说明。您将构建的内容你将从GitHub克隆一个示例Spr......
  • LNMP架构环境之Nginx安装部署
    1.搭建准备#0)操作系统版本cat/etc/redhat-release#1)关闭selinux:setenforce0#临时关闭selinuxgetenforce#查看临时关闭情况sed-i's#SELINUX=enforcing#......
  • 实验八-web部署
    实验内容1.配置openEuler2.安装LAMP3.安装部署wordpress实验步骤购买云服务器本文环境基于华为云的弹性云服务器ECS:CPU架构:选择鲲鹏通用计算增强型操作系统选择......
  • 第八次实验--Web部署
    实验相关配置弹性云服务器ECS远程访问推荐使用MobaXterm.LAMP是指一组通常一起使用来运行动态网站或者服务器的自由软件名称首字母缩写:Linux,操作系统,openEuler就是......
  • 实验八-Web部署
    参考https://www.cnblogs.com/rocedu/p/16929895.html和附件视频,基于LAMP部署wordpress,提交自己部署过程博客1.遇到的问题和解决过程2.对实验的建议配置openEuler在......
  • 实验八-Web部署
    配置openEuler在华为云openEuler安装后,没有配置yum源,我们通过重新配置。cd/etc/yum.repos.d  增加下面内容:[OS]name=OSbaseurl=http://repo.openeuler.org/openE......
  • 20221414徐鹿鸣的实验八-Web部署
    过程与老师博客基本相同。(之前卡崩了导致没截图)遇到的问题和解决过程1.如何退出MariaDBexit2.如何退出编辑Apache的配置文件以nano开头编辑文件的,则要退出,按【Ctr......
  • Jenkins 部署
     1 修改jenkins的根目录,默认地在C:\DocumentsandSettings\AAA\.jenkins。.jenkins├─jobs│ └─JavaHelloWorld│     ├─builds│     │ ├─20......
  • 实验八-Web部署
    实验八-Web部署20221323侯冒祯实验报告目录实验流程及结果遇到的问题及解决实验建议1.实验流程及结果在这次实验中,我根据https://www.cnblogs.com/rocedu/p/1692......
  • 部署DNS服务器
    在DNS中域名包括:根域,二级域,顶级域,主机名​DNS的顶级域,组织域和含义:gov(政府)com(商业)edu(教育)org(民间团体)mil(军事)cn(中国)us(美国)uk(英国)​DNS服务器的分类:1、主要服务器:......