前言
Kubernetes 支持HPA模块进行容器伸缩,默认支持CPU和内存等指标。原生的HPA基于Heapster,不支持GPU指标的伸缩,但是支持通过CustomMetrics的方式进行HPA指标的扩展。我们可以通过部署一个基于Prometheus Adapter 作为CustomMetricServer,它能将Prometheus指标注册的APIServer接口,提供HPA调用。 通过配置,HPA将CustomMetric作为扩缩容指标, 可以进行GPU指标的弹性伸缩。
阿里云容器Kubernetes监控-GPU监控
- k8s集群准备好gpu 服务器
# kubectl get node
NAME STATUS ROLES AGE VERSION
master-11 Ready master 466d v1.18.20
master-12 Ready master 466d v1.18.20
master-13 Ready master 466d v1.18.20
slave-gpu-103 Ready <none> 159d v1.18.20
slave-gpu-105 Ready <none> 160d v1.18.20
slave-gpu-109 Ready <none> 160d v1.18.20
slave-rtx3080-gpu-111 Ready <none> 6d3h v1.18.20
- 给每个GPU 服务器打上标签、并添加污点
kubectl label node slave-gpu-103 aliyun.accelerator/nvidia_name=yes
kubectl taint node slave-gpu-103 gpu_type=moviebook:NoSchedule
- 部署Prometheus 的GPU 采集器,网络采用hostNetwork
# cat gpu-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
namespace: monitoring
name: ack-prometheus-gpu-exporter
spec:
selector:
matchLabels:
k8s-app: ack-prometheus-gpu-exporter
template:
metadata:
labels:
k8s-app: ack-prometheus-gpu-exporter
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: aliyun.accelerator/nvidia_name
operator: Exists
hostNetwork: true
hostPID: true
containers:
- name: node-gpu-exporter
image: registry.cn-hangzhou.aliyuncs.com/acs/gpu-prometheus-exporter:0.1-5cc5f27
imagePullPolicy: Always
ports:
- name: http-metrics
containerPort: 9445
env:
- name: MY_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
resources:
requests:
memory: 50Mi
cpu: 200m
limits:
memory: 100Mi
cpu: 300m
volumeMounts:
- mountPath: /var/run/docker.sock
name: docker-sock
volumes:
- hostPath:
path: /var/run/docker.sock
type: File
name: docker-sock
tolerations:
- effect: NoSchedule
key: server_type
operator: Exists
---
apiVersion: v1
kind: Service
metadata:
name: node-gpu-exporter
namespace: monitoring
labels:
k8s-app: ack-prometheus-gpu-exporter
spec:
type: ClusterIP
ports:
- name: http-metrics
port: 9445
protocol: TCP
selector:
k8s-app: ack-prometheus-gpu-exporter
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ack-prometheus-gpu-exporter
labels:
release: ack-prometheus-operator
app: ack-prometheus-gpu-exporter
namespace: monitoring
spec:
selector:
matchLabels:
k8s-app: ack-prometheus-gpu-exporter
namespaceSelector:
matchNames:
- monitoring
endpoints:
- port: http-metrics
interval: 30s
#创建GPU 采集器
kubectl apply -f gpu-exporter.yaml
- prometheus 增加监控GPU 服务器实例列表
# kubectl edit cm -n prometheus prometheus-conf
- job_name: 'GPU服务监控'
static_configs:
#- targets: ['node-gpu-exporter.monitoring:9445']
- targets:
- 10.147.100.103:9445
- 10.147.100.105:9445
- 10.147.100.111:9445
- 10.147.100.109:9445
#重启prometheus 使配置文件生效
#查看prometheus gpu信息相关指标 nvidia_gpu_duty_cycle
- PROMETHEUS ADAPTER的证书