1.prometheus告警简介
prometheus告警架构分成两个独立的部分。 通过prometheus中定义AlertRule,prometheus会周期的对告警规则进行计算, 如果满足告警触发条件就会向AlertManager发送告警信息。
Alertmanger特性: 分组、抑制、静默等
分组:详细的告警信息合并成一个通知。 比如系统宕机导致大量的告警被同时触发,在这种情况下分组机制可以讲这些触发的告警合并成一个告警通知
抑制:当某一个告警发出后,可以停止重复发送由此告警引发的其他告警机制(alertmanager配置文件配置)
静默: 可以快速根据标签对告警进行静默处理。 altermanager不会发送告警通知(alertmanger的web上配置)
2.定义告警规则
一个group中可以定义多个告警规则,一条告警规则主要组成部分:
alert:告警规则名称
expr:PromQL表达式触发条件,用于计算是否有时间序列满足该条件
for:等待时间,触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending
labels:自定义标签,允许用户指定要附加到告警上的一组附加标签(配合alertmanager配置,匹配正则等,告警通知不同人)
annotations:用于指定一组附加信息,描述告警信息等文字,作为参数发送给alertmanager
示例:
memory.yml
groups: - name: 内存报警规则 rules: - alert: 内存使用率告警 expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 40 for: 10s labels: severity: warning team: frontend annotations: summary: "服务器可用内存不足。" description: "内存使用率已超过50%(当前值:{{ $value }}%)"
修改Prometheus配置文件prometheus.yml添加alertmanager配置:
#关联prometheus和Alertmanager
alerting: alertmanagers: - static_configs: - targets: - 127.0.0.1:9093 # # 指定规则文件 rule_files: - rules/*.yml
3.Alertmanager 配置概述
global:全局配置,用于定义一些全局公共参数,如SMTP等
templates:用户定义告警通知时的模板,如HTML,邮件等
route:告警路由,根据标签匹配,确定当前告警应该如何处理
receivers:接收人,微信、钉钉、webhook等
inhibit_rules: 抑制规则,合理设置,减少垃圾告警
resolve_timeout :定义了当Alertmanager持续多长时间未收到告警后标记为已解决状态:resolved
group_by: 定义分组规则。基于告警中包含的标签,如果满足group_by中定义标签名称,那么这些告警将会合并为一个通知发送给接收器
group_wait: 果在等待时间内当前group接收到了新的告警,这些告警将会合并为一个通知向receiver发送(秒级别)
group_interval: 相同的Group之间发送告警通知的时间间隔
repeat_interval: 一条成功发送的告警,在最终发送通知之前的等待时间(小时以上)
官网完整配置:
global: # The default SMTP From header field. [ smtp_from: <tmpl_string> ] # The default SMTP smarthost used for sending emails, including port number. # Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS). # Example: smtp.example.org:587 [ smtp_smarthost: <string> ] # The default hostname to identify to the SMTP server. [ smtp_hello: <string> | default = "localhost" ] # SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server. [ smtp_auth_username: <string> ] # SMTP Auth using LOGIN and PLAIN. [ smtp_auth_password: <secret> ] # SMTP Auth using LOGIN and PLAIN. [ smtp_auth_password_file: <string> ] # SMTP Auth using PLAIN. [ smtp_auth_identity: <string> ] # SMTP Auth using CRAM-MD5. [ smtp_auth_secret: <secret> ] # The default SMTP TLS requirement. # Note that Go does not support unencrypted connections to remote SMTP endpoints. [ smtp_require_tls: <bool> | default = true ] # The API URL to use for Slack notifications. [ slack_api_url: <secret> ] [ slack_api_url_file: <filepath> ] [ victorops_api_key: <secret> ] [ victorops_api_key_file: <filepath> ] [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ] [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ] [ opsgenie_api_key: <secret> ] [ opsgenie_api_key_file: <filepath> ] [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ] [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ] [ wechat_api_secret: <secret> ] [ wechat_api_corp_id: <string> ] [ telegram_api_url: <string> | default = "https://api.telegram.org" ] [ webex_api_url: <string> | default = "https://webexapis.com/v1/messages" ] # The default HTTP client configuration [ http_config: <http_config> ] # ResolveTimeout is the default value used by alertmanager if the alert does # not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated. # This has no impact on alerts from Prometheus, as they always include EndsAt. [ resolve_timeout: <duration> | default = 5m ] # Files from which custom notification template definitions are read. # The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'. templates: [ - <filepath> ... ] # The root node of the routing tree. route: <route> # A list of notification receivers. receivers: - <receiver> ... # A list of inhibition rules. inhibit_rules: [ - <inhibit_rule> ... ] # DEPRECATED: use time_intervals below. # A list of mute time intervals for muting routes. mute_time_intervals: [ - <mute_time_interval> ... ] # A list of time intervals for muting/activating routes. time_intervals: [ - <time_interval> ... ]View Code
案例1(所有告警通知只通知一个人或者群等):
route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'web.hook' receivers: - name: 'web.hook' webhook_configs: - url: 'http://11.0.1.1:5000/send' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
案例2(根据不通分类,告警通知人或方式不同):
global: resolve_timeout: 1m smtp_smarthost: 'smtp.qq.com:465' smtp_from: '[email protected]' smtp_auth_username: '[email protected]' smtp_auth_password: 'gdfawprxfuonbfcf' smtp_hello: '@qq.com' smtp_require_tls: false route: group_by: ['alertname'] group_wait: 10s group_interval: 20s repeat_interval: 5h receiver: 'default' routes: - receiver: "web.hook" #webhook通知 group_wait: 10s match_re: service: test - receiver: "mails" #邮件通知 group_by: [product, environment] match: team: frontend receivers: - name: 'web.hook' webhook_configs: - url: 'http://11.0.1.1:5000/send' - name: "mails" email_configs: - to: '[email protected]' send_resolved: true #通知已经恢复的告警 - name: "default" webhook_configs: - url: 'http://11.0.1.1:5000/senddef' inhibit_rules: #抑制的规则 - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
4.Alertmanager部署
方法一:二进制部署
下载地址:https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
启动服务:./alertmanager --config.file=alertmanager.yml --cluster.advertise-address=0.0.0.0:9093
备注:
查看Prometheus的alertmanager相关配置是否生效:http://11.0.1.141:9099/config
标签:Alertmanager,group,default,SMTP,smtp,prometheus,api,告警 From: https://www.cnblogs.com/aroin/p/17061207.html