一、进程监控
如果想要对主机的进程进行监控,例如chronyd,sshd等服务进程以及自定义脚本程序运行状态监控。我们使用node exporter就不能实现需求了,此时就需要使用process exporter来做进程状态的监控。
项目地址:https://github.com/ncabatoff/process-exporter
二、process-exporter安装
下载地址:https://github.com/ncabatoff/process-exporter/releases
2.1 二进制安装
wget https://github.com/ncabatoff/process-exporter/releases/download/v0.7.10/process-exporter-0.7.10.linux-amd64.tar.gz tar zxvf process-exporter-0.7.10.linux-amd64.tar.gz mkdir /opt/prometheus -p mv process-exporter-0.7.10.linux-amd64 /opt/prometheus/process_exporter ls -l /opt/prometheus/process_exporter # 创建用户 useradd -M -s /usr/sbin/nologin prometheus #修改目录权限 chown prometheus:prometheus -R /opt/prometheus # 修改配置文件 # 监控所有进程 cat >>/opt/prometheus/process_exporter/process.yml<<"EOF" process_names: - name: "{{.Comm}}" # 匹配模板 cmdline: - '.+' # 匹配名称 EOF # 创建systemd cat <<"EOF" >/etc/systemd/system/process_exporter.service [Unit] Description=process_exporter After=network.target [Service] Type=simple User=prometheus Group=prometheus ExecStart=/opt/prometheus/process_exporter/process-exporter -config.path=/opt/prometheus/process_exporter/process.yml Restart=on-failure [Install] WantedBy=multi-user.target EOF # 启动 systemctl daemon-reload systemctl start process_exporter systemctl enable process_exporter
2.2 docker安装
# 创建数据目录 mkdir /data/process_exporter -p cd /data/process_exporter # 创建配置文件 # Process-Exporter 的做法是配置需要监控的进程的名称,他会去搜索该进程从而得到其需要的监控信息,其实也就是我们常做的 ps -efl | grep xxx 命令来查看对应的进程 # 监控所有进程 mkdir config cat >>config/process.yml <<"EOF" process_names: - name: "{{.Comm}}" # 匹配模板 cmdline: - '.+' # 匹配所有名称 EOF # 监控指定进程 process_names: # - name: "{{.Comm}}" # cmdline: # - '.+' - name: "{{.Matches}}" cmdline: - 'nginx' #唯一标识 - name: "{{.Matches}}" cmdline: - 'mongod' - name: "{{.Matches}}" cmdline: - 'mysqld' - name: "{{.Matches}}" cmdline: - 'redis-server' - name: "{{.Matches}}" cmdline: - 'org.apache.zookeeper.server.quorum.QuorumPeerMain' - name: "{{.Matches}}" cmdline: - 'org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer' - name: "{{.Matches}}" cmdline: - 'org.apache.hadoop.hdfs.qjournal.server.JournalNode' ##注 cmdline: 所选进程的唯一标识,ps -ef 可以查询到。如果改进程不存在,则不会有该进程的数据采集到。
2.3 docker或docker-compose安装(二选一)
docker run -d -p 9256:9256 --privileged -v /proc:/host/proc -v `pwd`/config:/config --name process-exporter ncabatoff/process-exporter --procfs /host/proc -config.path /config/process.yml
cat > docker-compose.yaml <<"EOF" version: '3' services: process-exporter: image: ncabatoff/process-exporter container_name: process-exporter restart: always privileged: true volumes: - /proc:/host/proc - ./config:/config command: /bin/process-exporter --procfs /host/proc -config.path /config/process-exporter.yml ports: - "9256:9256" EOF
启动:docker-compose up -d
检查:
http://192.168.10.100:9256/metrics
2.4 配置说明
- 匹配模板
参数 |
解释 |
{{.Comm}} |
包含原始可执行文件的名称,即/proc//stat |
{{.ExeBase}} |
包含可执行文件的名称(默认) |
{{.ExeFull}} |
包含可执行文件的路径 |
{{.Username}} |
包含的用户名 |
{{.Matches}} |
包含所有正则表达式而产生的匹配项(建议使用) |
{{.PID}} |
包含进程的PID,一个PID仅包含一个进程(不建议使用) |
{{.StartTime}} |
包含进程的开始时间(不建议使用) |
3. Prometheus配置
cd /data/docker-prometheus cat >> prometheus/prometheus.yml <<"EOF" - job_name: 'process' scrape_interval: 30s scrape_timeout: 15s static_configs: - targets: ['192.168.10.100:9256'] EOF # 重载配置 curl -X POST http://localhost:9090/-/reload
检查:http://192.168.10.14:9090/targets?search=
3.1.metrics说明
namedprocess_ namedprocess_namegroup_states{state="Zombie"} 查看僵尸 # 上下文切换数量 # Counter namedprocess_namegroup_context_switches_total # CPU user/system 时间(秒) # Counter namedprocess_namegroup_cpu_seconds_total # 主要页缺失次数 # Counter namedprocess_namegroup_major_page_faults_total # 次要页缺失次数 # Counter namedprocess_namegroup_minor_page_faults_total # 内存占用(byte) # Gauge namedprocess_namegroup_memory_bytes # 同名进程数量 # Gauge namedprocess_namegroup_num_procs # 同名进程状态分布 # Gauge namedprocess_namegroup_states # 线程数量 # Gauge namedprocess_namegroup_num_threads # 启动时间戳 # Gauge namedprocess_namegroup_oldest_start_time_seconds # 打开文件描述符数量 # Gauge namedprocess_namegroup_open_filedesc # 打开文件数 / 允许打开文件数 # Gauge namedprocess_namegroup_worst_fd_ratio # 读数据量(byte) # Counter namedprocess_namegroup_read_bytes_total # 写数据量(byte) # Counter namedprocess_namegroup_write_bytes_total # 内核wchan等待线程数量 # Gauge namedprocess_namegroup_threads_wchan
3.2. 常用指数
指标名 |
解释 |
namedprocess_namegroup_num_procs |
运行的进程数 |
namedprocess_namegroup_num_threads |
线程数 |
namedprocess_namegroup_states |
Running/Sleeping/Other/Zombie状态的进程数 |
namedprocess_namegroup_cpu_seconds_total |
获取/proc/[pid]/stat 进程CPU utime、stime状态时间 |
namedprocess_namegroup_read_bytes_total |
获取/proc/[pid]/io 进程读取字节数 |
namedprocess_namegroup_write_bytes_total |
获取/proc/[pid]/io 进程写入字节数 |
namedprocess_namegroup_memory_bytes |
获取进程使用的内存字节数 |
namedprocess_namegroup_open_filedesc |
获取进程使用的文件描述符数量 |
namedprocess_namegroup_worst_fd_ratio |
进程文件描述符使用率 |
namedprocess_namegroup_thread_count |
运行的线程数 |
namedprocess_namegroup_thread_cpu_seconds_total |
获取线程CPU状态时间 |
namedprocess_namegroup_thread_io_bytes_total |
获取线程IO字节数 |
3.3 添加触发器
cat > prometheus/rules/process.yml <<"EOF" groups: - name: process rules: - alert: 进程数多告警 expr: sum(namedprocess_namegroup_states) by (instance) > 1000 for: 1m labels: severity: warning annotations: summary: "进程数超过1000" description: "服务器当前有{{ $value }}个进程" - alert: 僵尸进程数告警 expr: sum by(instance, groupname) (namedprocess_namegroup_states{state="Zombie"}) > 0 for: 1m labels: severity: warning annotations: summary: "有僵尸进程数" description: "进程{{ $labels.groupname }}有{{ $value }}个僵尸进程" - alert: 进程重启告警 expr: ceil(time() - max by(instance, groupname) (namedprocess_namegroup_oldest_start_time_seconds)) < 60 for: 15s labels: severity: warning annotations: summary: "进程重启" description: "进程{{ $labels.groupname }}在{{ $value }}秒前重启过" - alert: 进程退出告警 expr: max by(instance, groupname) (delta(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^java.*|^nginx.*"}[1d])) < 0 for: 1m labels: severity: warning annotations: summary: "进程退出" description: "进程{{ $labels.groupname }}退出了" - alert: 进程打开文件描述符告警 expr: namedprocess_namegroup_worst_fd_ratio * 100 > 80 for: 1m labels: severity: warning annotations: summary: "进程打开文件描述符过高" description: "进程{{ $labels.groupname }},打开文件描述符过高" EOF
重新加载:
curl -X POST http://localhost:9090/-/reload
检查:http://192.168.10.14:9090/alerts?search=
4.Grafana Dashboard图形化
https://grafana.com/grafana/dashboards/8378-system-processes-metrics/
Top processes by Total CPU cores used 和 Top processes by System CPU cores used 图形显示不正常
process-exporter 升级到 0.5.0后 ,namedprocess_namegroup_cpu_user_seconds_total
和namedprocess_namegroup_cpu_system_seconds_total
合为一个指标名namedprocess_namegroup_cpu_seconds_total
namedprocess_namegroup_cpu_user_seconds_tota
l变成namedprocess_namegroup_cpu_seconds_total{mode="system"}
namedprocess_namegroup_cpu_system_seconds_total
变成namedprocess_namegroup_cpu_seconds_total{mode="user"}
指标 |
监控项含义 |
单位 |
说明 |
namedprocess_namegroup_cpu_seconds_total{mode="system"} |
当前内核空间占用CPU百分比。 |
% |
系统上下文切换的消耗。如果该监控项数值比较高,则说明服务器开了太多的进程或线程。 |
namedprocess_namegroup_cpu_seconds_total{mode="user"} |
当前用户空间占用CPU百分比。 |
% |
用户进程对CPU的消耗。 |
解决方法:
Top processes by System CPU cores used图形修改如下:
topk(5, rate(namedprocess_namegroup_cpu_seconds_total{mode="system",groupname=~"$processes",instance=~"$host"}[$interval]) or ( irate(namedprocess_namegroup_cpu_seconds_total{mode="system",groupname=~"$processes",instance=~"$host"}[5m])))
Top processes by Total CPU cores used图形修改如下:
topk(5,sum by (groupname,instance) (rate(namedprocess_namegroup_cpu_seconds_total{groupname=~"$processes",instance=~"$host"}[$interval])) or sum by (groupname,instance) (irate(namedprocess_namegroup_cpu_seconds_total{groupname=~"$processes",instance=~"$host"}[5m])))
或者图形改名为:Top processes by User CPU cores used
用户进程cpu使用率排名
topk(5, rate(namedprocess_namegroup_cpu_seconds_total{mode="user",groupname=~"$processes",instance=~"$host"}[$interval]) or ( irate(namedprocess_namegroup_cpu_seconds_total{mode="user",groupname=~"$processes",instance=~"$host"}[5m])))
修改完成后可以显示了
标签:10,namedprocess,exporter,namegroup,--,process,监控,进程,total From: https://www.cnblogs.com/yangmeichong/p/18156518