首页 > 其他分享 >Docker和K8S集群调用GPU

Docker和K8S集群调用GPU

时间:2024-10-12 10:59:43浏览次数:1  
标签:02 I1012 device nvidia go GPU Docker K8S

参考:
安装Docker插件
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
Unbntu使用Docker调用GPU
https://blog.csdn.net/dw14132124/article/details/140534628
https://www.cnblogs.com/li508q/p/18444582

  1. 环境查看
    系统环境
# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.4 LTS
Release:	22.04
Codename:	jammy
# cat /etc/redhat-release 
Rocky Linux release 9.3 (Blue Onyx)

软件环境

# kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.25.16
WARNING: version difference between client (1.30) and server (1.25) exceeds the supported minor version skew of +/-1
  1. 安装Nvidia的Docker插件
    在有GPU资源的主机安装,改主机作为K8S集群的Node
    设置源
# curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

配置存储库以使用实验性软件包

# sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

修改后把以下注释取消
image
更新

# sudo apt-get update

安装Toolkit

# sudo apt-get install -y nvidia-container-toolkit

配置Docker以使用Nvidia

# sudo nvidia-ctk runtime configure --runtime=docker
INFO[0000] Loading config from /etc/docker/daemon.json  
INFO[0000] Wrote updated config to /etc/docker/daemon.json 
INFO[0000] It is recommended that docker daemon be restarted. 

这条命令会修改配置文件/etc/docker/daemon.json添加runtimes配置

# cat /etc/docker/daemon.json 
{
    "insecure-registries": [
        "192.168.3.61"
    ],
    "registry-mirrors": [
        "https://7sl94zzz.mirror.aliyuncs.com",
        "https://hub.atomgit.com",
        "https://docker.awsl9527.cn"
    ],
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }

重启docker

# systemctl daemon-reload
# systemctl restart docker
  1. 使用Docker调用GPU
    验证配置
    启动一个镜像查看GPU信息
~#   docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Sat Oct 12 01:33:33 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   53C    P2             59W /  450W |    4795MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

该输出结果显示了 GPU 的详细信息,包括型号、温度、功率使用情况和内存使用情况等。这表明 Docker 容器成功地访问到了 NVIDIA GPU,并且 NVIDIA Container Toolkit 安装和配置成功。
4. 使用K8S集群Pod调用GPU
以下操作在K8S机器的Master节点操作
安装K8S插件
下载最新版本

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml

yml文件内容如下

# cat nvidia-device-plugin.yml 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.1
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

使用DaemonSet方式部署在每一台node服务器部署
查看Pod日志

# kubectl logs -f nvidia-device-plugin-daemonset-8bltf -n kube-system
I1012 02:15:37.171056       1 main.go:199] Starting FS watcher.
I1012 02:15:37.171239       1 main.go:206] Starting OS watcher.
I1012 02:15:37.172177       1 main.go:221] Starting Plugins.
I1012 02:15:37.172236       1 main.go:278] Loading configuration.
I1012 02:15:37.173224       1 main.go:303] Updating config with default resource matching patterns.
I1012 02:15:37.173717       1 main.go:314] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I1012 02:15:37.173760       1 main.go:317] Retrieving plugins.
E1012 02:15:37.174052       1 factory.go:87] Incompatible strategy detected auto
E1012 02:15:37.174086       1 factory.go:88] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1012 02:15:37.174096       1 factory.go:89] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E1012 02:15:37.174104       1 factory.go:90] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E1012 02:15:37.174113       1 factory.go:91] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I1012 02:15:37.174123       1 main.go:346] No devices found. Waiting indefinitely.

驱动失败,错误提示已经清楚说明了失败原因

  1. 该Node部署GPU节点即该Node没有GPU资源
  2. 该Node有GPU资源,没有安装Docker驱动
    没有GPU资源的节点肯定无法使用,但是已经有GPU资源的Node节点也会报这个错误
    有GPU节点的修复方法,修改配置文件添加配置
# cat /etc/docker/daemon.json
{
    "insecure-registries": [
        "192.168.3.61"
    ],
    "registry-mirrors": [
        "https://7sl94zzz.mirror.aliyuncs.com",
        "https://hub.atomgit.com",
        "https://docker.awsl9527.cn"
    ],
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "/usr/bin/nvidia-container-runtime"
        }
    }
}

关键配置是以下行
image

再次查看Pod日志

# kubectl logs -f nvidia-device-plugin-daemonset-mp5ql -n kube-system
I1012 02:22:00.990246       1 main.go:199] Starting FS watcher.
I1012 02:22:00.990278       1 main.go:206] Starting OS watcher.
I1012 02:22:00.990373       1 main.go:221] Starting Plugins.
I1012 02:22:00.990382       1 main.go:278] Loading configuration.
I1012 02:22:00.990692       1 main.go:303] Updating config with default resource matching patterns.
I1012 02:22:00.990776       1 main.go:314] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I1012 02:22:00.990780       1 main.go:317] Retrieving plugins.
I1012 02:22:01.010950       1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I1012 02:22:01.011281       1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1012 02:22:01.012376       1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

查看GPU节点信息

# kubectl describe node aiserver003087

image
在k8s中测试GPU资源调用
测试Pod

# cat gpu_test.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: ffmpeg-pod
spec:
  nodeName: aiserver003087 #指定有gpu的节点
  containers:
    - name: ffmpeg-container
      image: nightseas/ffmpeg:latest #k8s中配置阿里的私有仓库遇到一些问题,暂时用公共镜像
      command: [ "/bin/bash", "-ce", "tail -f /dev/null" ]
      resources:
        limits:
          nvidia.com/gpu: 1 # 请求分配 1个 GPU

创建Pod

# kubectl apply -f gpu_test.yaml 
pod/ffmpeg-pod configured

往Pod内倒入一个视频进行转换测试

# kubectl cp test.mp4 ffmpeg-pod:/root

进入Pod

# kubectl exec -it ffmpeg-pod bash

转换测试视频

# ffmpeg -hwaccel cuvid -c:v h264_cuvid -i test.mp4 -vf scale_npp=1280:720 -vcodec h264_nvenc out.mp4

成功转换并且输出out.mp4则代表调用GPU资源成功

标签:02,I1012,device,nvidia,go,GPU,Docker,K8S
From: https://www.cnblogs.com/minseo/p/18460107

相关文章

  • Dockerfile(Jenkins)
    1.创建⼀个jenkins的Dockerfilemkdirtomcatcdtomcat2、上传需要的安装包apache-tomcat-8.5.47.tar.gzjdk-8u211-linux-x64.tar.gzjenkins.war3、编写DockerfilevimDockerfile#ThismyfirstjenkinsDockerfile#Version1.0FROMcentos:7MAINTAINERligaojie......
  • Docker 万字入门教程
    0.前言文章已经收录到GitHub个人博客项目,欢迎Star:https://github.com/chenyl8848/chenyl8848.github.io或者访问网站,进行在线浏览:https://chenyl8848.github.io/1.Docker简介1.1官方定义官方介绍:Wehaveacompletecontainersolutionforyou-nomatterwh......
  • docker 安装与使用
    0docker出现的原因软件在开发机器上可以跑,但是在其他机器上,无法跑,或者其他机器需要繁琐的环境配置。在另外的机器上能跑,必须保证:操作系统的设置各种库和组件的安装从根本上解决问题,软件带环境安装,安装的时候,把原始环境一模一样地复制过来。虚拟机是带环境安装的一......
  • DLJD_Docker学习_01
    第1章Docker概述1.1课程引入开发/运维互掐1.1.1开发与测试和运维间的矛盾,主要是由于环境的不同而引发的。如果能将开发人员使用的环境交给测试与运维使用,这些问题就都能解决。1.1.2DevOpsDevOps是一种思想,是一种管理模式,是一种执行规范与标准。它主要是用于促进开发、......
  • tensorflow案例1--天气识别,包含(Tensorflow的检查是否GPU、图像数据加载与划分、拿取
    ......
  • 在K8S中,镜像下载策略有哪些?
    在Kubernetes(K8s)中,镜像下载策略主要通过Pod的imagePullPolicy字段来配置。这一策略决定了当Pod启动时,Kubernetes如何拉取容器镜像。以下是K8s中常用的镜像下载策略:1.Always策略行为:每次创建Pod时,都会从镜像仓库拉取最新的镜像版本,即使本地已经存在同名镜像。适用场景:适用于需......
  • 在K8S中,ingress 有何作用?
    在Kubernetes(简称K8S)中,Ingress是一种API资源对象,用于管理从集群外部到集群内部服务的HTTP和HTTPS连接。它主要用来定义网络路由规则,允许外部用户通过特定的路径访问部署在集群内的应用服务。Ingress控制器则是实际实现这些规则的组件,通常运行在一个或多个Pod中,并监听......
  • 在K8S中,Scheduler 作用及实现原理有哪些?
    在Kubernetes(K8s)中,Scheduler是一个至关重要的核心组件,其主要作用是负责集群中Pod的调度。以下是Scheduler的详细作用及实现原理:1.Scheduler的作用资源分配:Scheduler根据每个Pod的资源请求(如CPU、内存)以及节点当前的可用资源情况,选择合适的Node来运行Pod。这有助于确保Pod能够......
  • 在K8S中,各模块如何与APlServer通信?
    在Kubernetes(K8s)中,各个模块与APIServer的通信是集群正常运作的关键。以下详细说明了各模块如何与APIServer进行通信:1.通信方式KubernetesAPIServer提供了一个统一的RESTful接口,用于集群内所有组件之间的通信。任何想要操作集群资源(如Pods、Services、ConfigMaps等)的组件,都......
  • 在K8S中,负载均衡器有何作用?
    在Kubernetes(K8S)中,负载均衡器(LoadBalancer)是一种服务类型(ServiceType),用于在集群内部的服务之间分配流量。负载均衡器的作用不仅仅是简单地转发请求,还包括确保应用程序能够可靠地处理来自客户端的请求,同时保持高可用性和可扩展性。以下是负载均衡器在Kubernetes中的一些主......