我们安装好 prometheus-operator 之后,打开 prometheus 页面Alerts
页面能看到好多报警规则,目前有的还处于报警状态
但是这些报警信息是哪里来的呢?他们应该用怎样的方式通知我们呢?我们知道 可以在Prometheus 的配置文件之中指定 AlertManager 实例和 报警的 rules 文件,现在我们通过 Operator 部署的呢?我们可以在 Prometheus Dashboard 的 Config 页面下面查看关于 AlertManager 的配置:
alerting: alert_relabel_configs: - separator: ; regex: prometheus_replica replacement: $1 action: labeldrop alertmanagers: - kubernetes_sd_configs: - role: endpoints namespaces: names: - monitoring scheme: http path_prefix: / timeout: 10s api_version: v1 relabel_configs: - source_labels: [__meta_kubernetes_service_name] separator: ; regex: alertmanager-main replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: web replacement: $1 action: keep rule_files: - /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml
上面 alertmanagers 实例的配置我们可以看到是通过角色为 endpoints 的 kubernetes 的服务发现机制获取的,匹配的是服务名为 alertmanager-main,端口名为 web 的 Service 服务,我们查看下 alertmanager-main 这个 Service:
$ kubectl describe -n monitoring svc alertmanager-main Name: alertmanager-main Namespace: monitoring Labels: alertmanager=main Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"alertmanager":"main"},"name":"alertmanager-main","namespace":"... Selector: alertmanager=main,app=alertmanager Type: ClusterIP IP: 10.16.131.214 Port: web 9093/TCP TargetPort: web/TCP Endpoints: 10.103.74.7:9093,10.103.75.9:9093,10.103.76.7:9093 Session Affinity: ClientIP Events: <none>
可以看到服务名正是 alertmanager-main,Port 定义的名称也是 web,符合上面的规则,所以 Prometheus 和 AlertManager 组件就正确关联上了。而对应的报警规则文件位于:/etc/prometheus/rules/prometheus-k8s-rulefiles-0/
目录下面所有的 YAML 文件。我们可以进入 Prometheus 的 Pod 中验证下该目录下面是否有 YAML 文件:
$ kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoring Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod. /prometheus $ ls /etc/prometheus/rules/prometheus-k8s-rulefiles-0/ monitoring-prometheus-k8s-rules.yaml /prometheus $ cat /etc/prometheus/rules/prometheus-k8s-rulefiles-0/monitoring-pr ometheus-k8s-rules.yaml groups: - name: k8s.rules rules: - expr: | sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace) record: namespace:container_cpu_usage_seconds_total:sum_rate
这个 YAML 文件实际上就是我们之前创建的一个 PrometheusRule 文件包含的:
$ cat prometheus-rules.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: prometheus-k8s-rules namespace: monitoring spec: groups: - name: k8s.rules rules: - expr: | sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace) record: namespace:container_cpu_usage_seconds_total:sum_rate
我们这里的 PrometheusRule 的 name 为 prometheus-k8s-rules,namespace 为 monitoring,我们可以猜想到我们创建一个 PrometheusRule 资源对象后,会自动在上面的 prometheus-k8s-rulefiles-0 目录下面生成一个对应的<namespace>-<name>.yaml
文件,所以如果以后我们需要自定义一个报警选项的话,只需要定义一个 PrometheusRule 资源对象即可。至于为什么 Prometheus 能够识别这个 PrometheusRule 资源对象呢?这就需要查看我们创建的 prometheus 这个资源对象了,里面有非常重要的一个属性 ruleSelector,用来匹配 rule 规则的过滤器,要求匹配具有 prometheus=k8s 和 role=alert-rules 标签的 PrometheusRule 资源对象,现在明白了吧?
ruleSelector: matchLabels: prometheus: k8s role: alert-rules
所以我们要想自定义一个报警规则,只需要创建一个具有 prometheus=k8s 和 role=alert-rules 标签的 PrometheusRule 对象就行了,比如 我们现在添加一个集群节点磁盘 使用率操过 88% 就报警。
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: disk-free-rules namespace: monitoring spec: groups: - name: disk rules: - alert: diskFree annotations: summary: "{{ $labels.job }} 项目实例 {{ $labels.instance }} 磁盘使用率大于 80%" description: "{{ $labels.instance }} {{ $labels.mountpoint }} 磁盘使用率大于80% (当前的值: {{ $value }}%),请及时处理" expr: | (1-(node_filesystem_free_bytes{fstype=~"ext4|xfs",mountpoint!="/boot"} / node_filesystem_size_bytes{fstype=~"ext4|xfs",mountpoint!="/boot"}) )*100 > 85 for: 3m labels: level: disaster
注意 label 标签一定至少要有 prometheus=k8s 和 role=alert-rules,创建完成后,隔一会儿再去容器中查看下 rules 文件夹:
/etc/prometheus/rules/prometheus-k8s-rulefiles-0 $ ls monitoring-disk-free-rules.yaml monitoring-prometheus-k8s-rules.yaml
可以看到我们创建的 rule 文件已经被注入到了对应的 rulefiles 文件夹下面了,证明我们上面的设想是正确的。然后再去 Prometheus Dashboard 的 Alert 页面下面就可以查看到上面我们新建的报警规则了:
配置报警
我们知道了如何去添加一个报警规则配置项,但是这些报警信息用怎样的方式去发送呢?我们知道我们可以通过 AlertManager 的配置文件去配置各种报警接收器,现在我们是通过 Operator 提供的 alertmanager 资源对象创建的组件,应该怎样去修改配置呢?
首先我们将 alertmanager-main 这个 Service 创建一个 ingress,修改完成后我们可以在页面上的 status 路径下面查看 AlertManager 的配置信息:
$ cat ingress.yml apiVersion: extensions/v1beta1 kind: Ingress metadata: name: kube-prometheus namespace: monitoring spec: rules: - host: prometheus.zsf.com http: paths: - path: / backend: serviceName: prometheus-k8s servicePort: 9090 - host: grafana.zsf.com http: paths: - path: / backend: serviceName: grafana servicePort: 3000 - host: alertmanager.zsf.com http: paths: - path: / backend: serviceName: alertmanager-main servicePort: 9093
配置信息其实来自于 alertmanager/alertmanager-secret.yaml
# cat alertmanager/alertmanager-secret.yaml apiVersion: v1 data: alertmanager.yaml: Imdsb2JhbCI6CiAgInJlc29sdmVfdGltZW91dCI6ICI1bSIKInJlY2VpdmVycyI6Ci0gIm5hbWUiOiAibnVsbCIKInJvdXRlIjoKICAiZ3JvdXBfYnkiOgogIC0gImpvYiIKICAiZ3JvdXBfaW50ZXJ2YWwiOiAiNW0iCiAgImdyb3VwX3dhaXQiOiAiMzBzIgogICJyZWNlaXZlciI6ICJudWxsIgogICJyZXBlYXRfaW50ZXJ2YWwiOiAiMTJoIgogICJyb3V0ZXMiOgogIC0gIm1hdGNoIjoKICAgICAgImFsZXJ0bmFtZSI6ICJXYXRjaGRvZyIKICAgICJyZWNlaXZlciI6ICJudWxsIg== kind: Secret metadata: name: alertmanager-main namespace: monitoring type: Opaque
我们对 alertmanager.yml 文件进行 base 64 反解析
$ echo 'Imdsb2JhbCI6CiAgInJlc29sdmVfdGltZW91dCI6ICI1bSIKInJlY2VpdmVycyI6Ci0gIm5hbWUiOiAibnVsbCIKInJvdXRlIjoKICAiZ3JvdXBfYnkiOgogIC0gImpvYiIKICAiZ3JvdXBfaW50ZXJ2YWwiOiAiNW0iCiAgImdyb3VwX3dhaXQiOiAiMzBzIgogICJyZWNlaXZlciI6ICJudWxsIgogICJyZXBlYXRfaW50ZXJ2YWwiOiAiMTJoIgogICJyb3V0ZXMiOgogIC0gIm1hdGNoIjoKICAgICAgImFsZXJ0bmFtZSI6ICJXYXRjaGRvZyIKICAgICJyZWNlaXZlciI6ICJudWxsIg==' | base64 -d "global": "resolve_timeout": "5m" "receivers": - "name": "null" "route": "group_by": - "job" "group_interval": "5m" "group_wait": "30s" "receiver": "null" "repeat_interval": "12h" "routes": - "match": "alertname": "Watchdog" "receiver": "null"
我们可以看到内容和上面查看的配置信息是一致的,所以如果我们想要添加自己的接收器,或者模板消息,我们就可以更改这个文件:
# cat alertmanager.yaml global: resolve_timeout: 5m receivers: - name: dingtalk-webhook webhook_configs: - send_resolved: true url: http://dingtalk-webhook:8060/dingtalk/guiji/send route: group_by: - job group_interval: 5m group_wait: 30s receiver: dingtalk-webhook repeat_interval: 12h routes: - receiver: dingtalk-webhook group_wait: 10s
将上面文件保存为 alertmanager.yaml,然后使用这个文件创建一个 Secret 对象:
# 先将之前的 secret 对象删除
$ kubectl delete secret alertmanager-main -n monitoring secret "alertmanager-main" deleted $ kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring secret "alertmanager-main" created
配置prometheus-operater 钉钉告警
创建 webhook 的配置文件
# vim dingTalk-webhook-configmap.yml apiVersion: v1 kind: ConfigMap metadata: namespace: monitoring name: dingtalk-webhook-config data: config.yml: | # Request timeout timeout: 5s ## Customizable templates path templates: - /etc/prometheus-webhook-dingtalk/templates/*.tmpl ## You can also override default template using `default_message` ## The following example to use the 'legacy' template from v0.3.0 # default_message: # title: '{{ template "legacy.title" . }}' # text: '{{ template "legacy.content" . }}' ## Targets, previously was known as "profiles" targets: guiji: url: https://oapi.dingtalk.com/robot/send?access_token=5752a9d10727165d116b883b4e7d312b781a3ed90fefa5d1a8f4d61f06343a27 message: title: '{{ template "ding.link.title" . }}' text: '{{ template "ding.link.content" . }}' mention: all: true mobiles: ['18001587880']
创建告警模板配置文件:
# vim dingTalk-webhook-template.yml apiVersion: v1 kind: ConfigMap metadata: namespace: monitoring name: dingtalk-webhook-template data: template.tmpl: | {{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }} {{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }} {{ define "__text_alert_list" }}{{ range . }} **Labels** {{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **Annotations** {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }}) {{ end }}{{ end }} {{ define "default.__text_alert_list" }}{{ range . }} --- **告警级别:** {{ .Labels.severity | upper }} **运营团队:** {{ .Labels.team | upper }} **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} **事件信息:** {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **事件标签:** {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }}{{ end }} {{ end }} {{ end }} {{ define "default.__text_alertresovle_list" }}{{ range . }} --- **告警级别:** {{ .Labels.severity | upper }} **运营团队:** {{ .Labels.team | upper }} **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} **结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }} **事件信息:** {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **事件标签:** {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }}{{ end }} {{ end }} {{ end }} {{/* Default */}} {{ define "default.title" }}{{ template "__subject" . }}{{ end }} {{ define "default.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})** {{ if gt (len .Alerts.Firing) 0 -}} ![警报 图标](https://ss0.bdstatic.com/70cFuHSh_Q1YnxGkpoWK1HF6hhy/it/u=3626076420,1196179712&fm=15&gp=0.jpg) **====侦测到故障====** {{ template "default.__text_alert_list" .Alerts.Firing }} {{- end }} {{ if gt (len .Alerts.Resolved) 0 -}} {{ template "default.__text_alertresovle_list" .Alerts.Resolved }} {{- end }} {{- end }} {{/* Legacy */}} {{ define "legacy.title" }}{{ template "__subject" . }}{{ end }} {{ define "legacy.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})** {{ template "__text_alert_list" .Alerts.Firing }} {{- end }} {{/* Following names for compatibility */}} {{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }} {{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}
创建 webhook 的资源配置清单
# cat dingTalk-webhook-deployment.yml apiVersion: extensions/v1beta1 kind: Deployment metadata: namespace: monitoring name: dingtalk-webhook labels: app: dingtalk-webhook spec: selector: matchLabels: app: dingtalk-webhook replicas: 1 template: metadata: labels: app: dingtalk-webhook spec: containers: - name: dingtalk-webhook image: harbor.zsf.com/public/prometheus-webhook-dingtalk args: - --config.file=/etc/prometheus-webhook-dingtalk/config.yml #- --ding.profile=guiji=https://oapi.dingtalk.com/robot/send?access_token=5752a9d10727165d116b883b4e7d312b781a3ed90fefa5d1a8f4d61f06343a27 ports: - containerPort: 8060 protocol: TCP volumeMounts: - mountPath: "/etc/prometheus-webhook-dingtalk" name: dingtalk-webhook-confing subPath: config.yml - mountPath: "/etc/prometheus-webhook-dingtalk/templates" name: dingtalk-webhook-template subPath: template.tmpl volumes: - name: dingtalk-webhook-confing configMap: name: dingtalk-webhook-config - name: dingtalk-webhook-template configMap: name: dingtalk-webhook-template --- apiVersion: v1 kind: Service metadata: namespace: monitoring name: dingtalk-webhook labels: app: dingtalk-webhook spec: selector: app: dingtalk-webhook ports: - name: http port: 8060 targetPort: 8060 protocol: TCP
然后我们等一会就能查看到报警信息了。
标签:alertmanager,end,name,dingtalk,报警,webhook,prometheus,operator From: https://www.cnblogs.com/putaoo/p/17446151.html