首页 > 其他分享 >prometheus 监控并发告警给企业微信

prometheus 监控并发告警给企业微信

时间:2024-05-24 17:56:08浏览次数:27  
标签:xxx name 微信 labels 7100 instance prometheus 告警

一、部署prometheus

采集系统数据的工具

时序图

1.1、部署node_exporte

node_exporter 是prometheus的一部分,用来装在被监控的服务器上

下载地址:https://prometheus.io/download/

# 1、解压安装包
tar -zxvf node_exporter-1.7.0.linux-amd64.tar.gz
# 2、启动 默认监听9100端口 建议使用7100 防止跟现有程序冲突
# 2.1 先查看一下是端口是否被占用:netstat -tlnp | grep 7100
nohup ./node_exporter --web.listen-address=":7100" &
# 3、查看7100是否打开
sudo firewall-cmd --list-ports
# 4、打开7100端口
sudo firewall-cmd --zone=public --add-port=7100/tcp --permanent
# 5、刷新端口
sudo firewall-cmd --reload

1.2 、部署prometheus主程序

Prometheus 是主要采集端,安装的服务器 不能是被采集的对象

下载地址:https://prometheus.io/download/

# 1、解压安装包
tar -zxvf prometheus-2.45.4.linux-amd64.tar.gz
# 2、配置 prometheus.yml

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

  - job_name: '数据库:176'
    static_configs:
      - targets: ['xxx.xxx.xxx.176:7100']
        labels:
          instance: 'xxx.xxx.xxx.176'

  - job_name: '数据库:178'
    static_configs:
      - targets: ['xxx.xxx.xxx.178:7100']
        labels:
          instance: 'xxx.xxx.xxx.178'

  - job_name: '中间件:103'
    static_configs:
      - targets: ['xxx.xxx.xxx.103:7100']
        labels:
          instance: 'xxx.xxx.xxx.103'
        
  - job_name: '中间件:205'
    static_configs:
      - targets: ['xxx.xxx.xxx.205:7100']
        labels:
          instance: 'xxx.xxx.xxx.205'

  - job_name: '前置机:105'
    static_configs:
      - targets: ['xxx.xxx.xxx.105:7100']
        labels:
          instance: 'xxx.xxx.xxx.105'

  - job_name: '前置机:192'
    static_configs:
      - targets: ['xxx.xxx.xxx.192:7100']
        labels:
          instance: 'xxx.xxx.xxx.192'

  - job_name: '应用:55'
    static_configs:
      - targets: ['xxx.xxx.xxx.55:7100']
        labels:
          instance: 'xxx.xxx.xxx.55'
  
  - job_name: '应用:121'
    static_configs:
      - targets: ['xxx.xxx.xxx.121:7100']
        labels:
          instance: 'xxx.xxx.xxx.121'

  - job_name: '应用:227'
    static_configs:
      - targets: ['xxx.xxx.xxx.227:7100']
        labels:
          instance: 'xxx.xxx.xxx.227'
                 
# 启动
nohup ./prometheus --config.file=prometheus.yml --web.listen-address=:7200 > prometheus.log 2>&1 &

二、部署grafana

grafanas是可视化和分析平台, 本身并不会监听数据,只是通过分析prometheus采集到的数据然后通过图形报表等方式直观的展示出来

# 1、在线的方式安装 【不推荐】
sudo yum install -y https://dl.grafana.com/enterprise/release/grafana-enterprise-10.4.0-1.x86_64.rpm

# 3、配置服务和自启动
# 3.1、重新加载Systemd的守护程序配置
sudo systemctl daemon-reload 
# 3.2、启动grafana
sudo systemctl start grafana-server
# 3.3、查看Grafana服务器的状态
sudo systemctl status grafana-server
# 3.4、配置成自启动
sudo systemctl enable grafana-server.service

firewall-cmd --zone=public --add-port=443/tcp --permanent

三、部署alertmanager

1、企业微信机器人

1.1 配置Prometheus

Prometheus.yml:

prometheus与 alertmanager通信的配置

alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - xxx.xxx.xxx.112:7201
# 报警规则文件配置
rule_files:
  - rules/*.yml

rules 目录下的文件配置

node_alived.yml :

groups:
- name: 实例存活告警规则
  rules:
  - alert: 实例存活告警
    expr: up == 0
    for: 5m
    labels:
      user: prometheus
      severity: warning
    annotations:
      summary: "主机宕机 !!!"
      description: "{{ $labels.instance }} :实例主机已经宕机超过一分钟了。"

memory_over.yml

groups:
- name: 内存报警规则
  rules:
  - alert: 内存使用率告警
    expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 50
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "服务器可用内存不足。"
      description: "{{$labels.instance}}:内存使用率已超过50%(当前值:{{ $value }}%)"

dis_over.yml 硬盘告警

groups:
- name: 磁盘使用率报警规则
  rules:
  - alert: 磁盘使用率告警
    expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80
    for: 20m
    labels:
      severity: warning
    annotations:
      summary: "硬盘分区使用率过高"
      description: "{{ $labels.instance }}:分区使用大于80%(当前值:{{ $value }}%)"

cpu_over.yml cpu 告警

groups:
- name: CPU报警规则
  rules:
  - alert: CPU使用率告警
    expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 50
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "CPU使用率正在飙升。"
      description: "{{ $labels.instance }}: CPU使用率超过50%(当前值:{{ $value }}%)"

1.2 配置 alertmanager.yml

发送消息的配置

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'wechat'
  
receivers:
  - name: 'wechat'
    wechat_configs:
      - corp_id: 'wwba418e5da69b30d2'
        to_party: '1'
        agent_id: '1000010'
        api_secret: 'j1zYCwAjh2j9PJv2oopPpkhQQ1YvZSUjup53PUq2Tvs'
        to_user : '@all'

标签:xxx,name,微信,labels,7100,instance,prometheus,告警
From: https://www.cnblogs.com/kangjunyun/p/18211457

相关文章