一般来说我们只会按照系统设置的cpu和mem去动态扩缩容,但是其实这样很不灵活,比如,jvm占的大小并不等同于真正应用所占的内存,如果有一种可能,可以再监控数据里拿出来一个衡量指标数据,然后依照这个数据进行动态扩缩容,这样的话就会灵活很多,研究了2天,今天终于研究明白了,赶紧记录一下
第一步,安装prometheus-adapter组件,如果我们要接入外部资源,我们就需要安装部署
[ec2-user@test-doex prometheus-adapter]$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts [ec2-user@test-doex prometheus-adapter]$ helm repo list NAME URL prometheus-community https://prometheus-community.github.io/helm-charts
第二步,安装部署prometheus-adapter
helm install prometheus-adapter prometheus-community/prometheus-adapter -f values.yaml -n prometheus #注意values.yaml我们需要修改prometheus.url指向prometheus-server,其实还要修改别的,我们一会在说
第三步,部署成功后,修改prometheus-adapter配置文件
注意红色字体是要修改的地方,这里有rule和externalrule,我们这里用的是externalrule,所以rule这部分可以去掉
1.metricsQuery部分写prometheus函数部分<<.Series>>是占位符,可以当变量看
2.{<<.LabelMatchers>>}这里一定要加,不加的话后面标签匹配不到,我在这里踩坑了
3.如果你的prometheus sql语句不需要namespace,或者要跨namespace,那么把namespaced:false
4.如果你后期需要namespace查询或者pod查询,可以定义resources的时候定义namespace: {resource: "namespace"}或者pod: {resource: "pod"}
#其实下面就是调用了promSQL:histogram_quantile(0.95, sum(rate(grpc_server_latency_bucket{app=~"bh-shard-2", method=~"CancelOrder", status="OK"}[5m])) by (le, app, instance, method))
[ec2-user@test-doex prometheus-adapter]$ kubectl -n prometheus edit cm prometheus-adapter apiVersion: v1 data: config.yaml: | rules: - metricsQuery: 'histogram_quantile(0.95, sum(rate(<<.Series>>[5m])) by (le, app, instance,method))' resources: template: '<<.Resource>>' seriesQuery: 'grpc_server_latency_bucket{app=~"bh-shard-2", method=~"CancelOrder",status="OK"}' externalRules: - metricsQuery: 'histogram_quantile(0.95, sum(rate(<<.Series>>{<<.LabelMatchers>>}[5m])) by (le, app, instance,method))' seriesQuery: 'grpc_server_latency_bucket{app=~"bh-shard-2", method=~"CancelOrder",status="OK"}' resources: namespaced: false kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: prometheus-adapter meta.helm.sh/release-namespace: prometheus creationTimestamp: "2023-10-27T08:01:44Z" labels: app.kubernetes.io/component: metrics app.kubernetes.io/instance: prometheus-adapter app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: prometheus-adapter app.kubernetes.io/part-of: prometheus-adapter app.kubernetes.io/version: v0.11.1 helm.sh/chart: prometheus-adapter-4.7.1 name: prometheus-adapter namespace: prometheus resourceVersion: "323610963" uid: 5203bdeb-d9ac-42ff-85b3-5d45bd9b4b66
第四步,部署完成后重启控制器(这里等同于重启metrics api)重新生效
kubectl -n prometheus rollout restart deployment prometheus-adapter
第五步,测试k8s 扩展的api是否可用grpc_server_latency_bucket就是我们自定义的资源名称,以后就靠这个名字去查询
[ec2-user@test-doex prometheus-adapter]$ kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1' | jq . { "kind": "APIResourceList", "apiVersion": "v1", "groupVersion": "external.metrics.k8s.io/v1beta1", "resources": [ { "name": "grpc_server_latency_bucket", "singularName": "", "namespaced": true, "kind": "ExternalMetricValueList", "verbs": [ "get" ] } ] }
第六步,测试我们自定义的资源,看看能不能获取到值,value就是我们要的值
#这个命令显示的可能有点多 kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1/namespaces/bluehelix/grpc_server_latency_bucket' | jq . #我们做一下过滤
#这里等同于调用(确保这个promSQL和下面--raw得出的值是一样的才对,否则就是错的)prometheusSQL histogram_quantile(0.95, sum(rate(grpc_server_latency_bucket{app=~"bh-shard-2", method=~"CancelOrder", status="OK"}[5m])) by (le, app, instance, method)) [ec2-user@test-doex prometheus-adapter]$ kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1/namespaces/bluehelix/grpc_server_latency_bucket' | jq '.items[] | select(.metricLabels.app == "bh-shard-2" and .metricLabels.method == "CancelOrder")' { "metricName": "grpc_server_latency_bucket", "metricLabels": { "app": "bh-shard-2", "instance": "10.33.48.156:7013", "method": "CancelOrder" }, "timestamp": "2023-10-27T10:34:10Z", "value": "327368m" } { "metricName": "grpc_server_latency_bucket", "metricLabels": { "app": "bh-shard-2", "instance": "10.33.78.253:7013", "method": "CancelOrder" }, "timestamp": "2023-10-27T10:34:10Z", "value": "49426m" }
第七步,接下来我们就可以配置hpa了
apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: bh-shard-2 namespace: bluehelix #需要配置到跟deployment控制器一个名称空间下 annotations: autoscaling.alpha.kubernetes.io/metrics: | [ { "type": "External", "external": { "metricName": "grpc_server_latency_bucket", #这里写我们自定义的资源名称 "metricSelector": { "matchLabels": { "app": "bh-shard-2", #要过滤出我们想要的数据就得写标签过滤器 "method": "CancelOrder" } }, "targetAverageValue": "4842m" #资源限制,也就是超过这个阈值就会自动扩容 } } ] spec: maxReplicas: 2 #扩容的最大值 minReplicas: 1 #缩容的最小值 scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: bh-shard-2
第八步,部署,然后查看一下结果,可以看到4842m就是我们设置的阈值,现在已经超过阈值了,于是扩容了一个pod
注意:
1.当资源低于阈值的时候不会马上进行缩容,默认是5分钟也就是300秒
2.如果想修改这个值需要定义:
...... spec: behavior: scaleDown: stabilizationWindowSeconds: 180
3.如过你的pod有多个,比如有5个pod,那么你的target当前值是总和/5,这个平均值高于阈值才会出发hpa
[ec2-user@test-doex ~]$ kubectl -n bluehelix get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE bh-shard-2 Deployment/bh-shard-2 229357m/4842m (avg) 1 2 2 54m
总结:
扩展的hpa,相对而言文档比较少,而且官方文档有点晦涩难懂,需要不断尝试,细节较多,做的时候需要细心
官方文档:https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/externalmetrics.md
标签:adapter,扩缩容,server,metrics,method,prometheus,io,hpa,app From: https://www.cnblogs.com/fengzi7314/p/17792991.html