一、部署prometheus
采集系统数据的工具
时序图
1.1、部署node_exporte
node_exporter 是prometheus的一部分,用来装在被监控的服务器上
# 1、解压安装包
tar -zxvf node_exporter-1.7.0.linux-amd64.tar.gz
# 2、启动 默认监听9100端口 建议使用7100 防止跟现有程序冲突
# 2.1 先查看一下是端口是否被占用:netstat -tlnp | grep 7100
nohup ./node_exporter --web.listen-address=":7100" &
# 3、查看7100是否打开
sudo firewall-cmd --list-ports
# 4、打开7100端口
sudo firewall-cmd --zone=public --add-port=7100/tcp --permanent
# 5、刷新端口
sudo firewall-cmd --reload
1.2 、部署prometheus主程序
Prometheus 是主要采集端,安装的服务器 不能是被采集的对象
# 1、解压安装包
tar -zxvf prometheus-2.45.4.linux-amd64.tar.gz
# 2、配置 prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: '数据库:176'
static_configs:
- targets: ['xxx.xxx.xxx.176:7100']
labels:
instance: 'xxx.xxx.xxx.176'
- job_name: '数据库:178'
static_configs:
- targets: ['xxx.xxx.xxx.178:7100']
labels:
instance: 'xxx.xxx.xxx.178'
- job_name: '中间件:103'
static_configs:
- targets: ['xxx.xxx.xxx.103:7100']
labels:
instance: 'xxx.xxx.xxx.103'
- job_name: '中间件:205'
static_configs:
- targets: ['xxx.xxx.xxx.205:7100']
labels:
instance: 'xxx.xxx.xxx.205'
- job_name: '前置机:105'
static_configs:
- targets: ['xxx.xxx.xxx.105:7100']
labels:
instance: 'xxx.xxx.xxx.105'
- job_name: '前置机:192'
static_configs:
- targets: ['xxx.xxx.xxx.192:7100']
labels:
instance: 'xxx.xxx.xxx.192'
- job_name: '应用:55'
static_configs:
- targets: ['xxx.xxx.xxx.55:7100']
labels:
instance: 'xxx.xxx.xxx.55'
- job_name: '应用:121'
static_configs:
- targets: ['xxx.xxx.xxx.121:7100']
labels:
instance: 'xxx.xxx.xxx.121'
- job_name: '应用:227'
static_configs:
- targets: ['xxx.xxx.xxx.227:7100']
labels:
instance: 'xxx.xxx.xxx.227'
# 启动
nohup ./prometheus --config.file=prometheus.yml --web.listen-address=:7200 > prometheus.log 2>&1 &
二、部署grafana
grafanas是可视化和分析平台, 本身并不会监听数据,只是通过分析prometheus采集到的数据然后通过图形报表等方式直观的展示出来
# 1、在线的方式安装 【不推荐】
sudo yum install -y https://dl.grafana.com/enterprise/release/grafana-enterprise-10.4.0-1.x86_64.rpm
# 3、配置服务和自启动
# 3.1、重新加载Systemd的守护程序配置
sudo systemctl daemon-reload
# 3.2、启动grafana
sudo systemctl start grafana-server
# 3.3、查看Grafana服务器的状态
sudo systemctl status grafana-server
# 3.4、配置成自启动
sudo systemctl enable grafana-server.service
firewall-cmd --zone=public --add-port=443/tcp --permanent
三、部署alertmanager
1、企业微信机器人
1.1 配置Prometheus
Prometheus.yml:
prometheus与 alertmanager通信的配置
alerting:
alertmanagers:
- static_configs:
- targets:
- xxx.xxx.xxx.112:7201
# 报警规则文件配置
rule_files:
- rules/*.yml
rules 目录下的文件配置
node_alived.yml :
groups:
- name: 实例存活告警规则
rules:
- alert: 实例存活告警
expr: up == 0
for: 5m
labels:
user: prometheus
severity: warning
annotations:
summary: "主机宕机 !!!"
description: "{{ $labels.instance }} :实例主机已经宕机超过一分钟了。"
memory_over.yml
groups:
- name: 内存报警规则
rules:
- alert: 内存使用率告警
expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 50
for: 1m
labels:
severity: warning
annotations:
summary: "服务器可用内存不足。"
description: "{{$labels.instance}}:内存使用率已超过50%(当前值:{{ $value }}%)"
dis_over.yml 硬盘告警
groups:
- name: 磁盘使用率报警规则
rules:
- alert: 磁盘使用率告警
expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80
for: 20m
labels:
severity: warning
annotations:
summary: "硬盘分区使用率过高"
description: "{{ $labels.instance }}:分区使用大于80%(当前值:{{ $value }}%)"
cpu_over.yml cpu 告警
groups:
- name: CPU报警规则
rules:
- alert: CPU使用率告警
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 50
for: 1m
labels:
severity: warning
annotations:
summary: "CPU使用率正在飙升。"
description: "{{ $labels.instance }}: CPU使用率超过50%(当前值:{{ $value }}%)"
1.2 配置 alertmanager.yml
发送消息的配置
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'wechat'
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: 'wwba418e5da69b30d2'
to_party: '1'
agent_id: '1000010'
api_secret: 'j1zYCwAjh2j9PJv2oopPpkhQQ1YvZSUjup53PUq2Tvs'
to_user : '@all'
标签:xxx,name,微信,labels,7100,instance,prometheus,告警
From: https://www.cnblogs.com/kangjunyun/p/18211457