NVIDIA官方文档:NVIDIA/k8s-device-plugin:用于 Kubernetes 的 NVIDIA 设备插件
前提条件
安装NVIDIA Container Toolkit
官方地址:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
yum install -y nvidia-container-toolkit
更新运行时
docker
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
无根模式下运行的docker
nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json
systemctl --user restart docker
nvidia-ctk config --set nvidia-container-cli.no-cgroups --in-place
Containerd
nvidia-ctk runtime configure --runtime=containerd
systemctl daemon-reload
systemctl restart containerd
仅用于nerdctl无需配置,直接运行nerdctl run --gpus=all
如果以上修改不生效,修改默认运行时
# 编辑
/etc/containerd/config.toml
搜索
plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc
runtime_type = "io.containerd.runc.v2"
改成io.containerd.runtime.v1.linux
搜索
io.containerd.runtime.v1.linux
runtime = "runc"
改成runtime = "nvidia-container-runtime"
systemctl daemon-reload
systemctl restart containerd
CRI-O
nvidia-ctk runtime configure --runtime=crio
systemctl restart crio
部署nvidia-device-plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nvidia-device-plugin.yml
查看GPU所在节点的pod日志
正常情况(将GPU注册到kubelet):
I0820 08:26:47.926058 1 main.go:317] Retrieving plugins.
I0820 08:26:51.674618 1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I0820 08:26:51.682183 1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0820 08:26:51.690851 1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
未找到GPU(尝试将nvidia修改为默认运行时):
I0820 08:26:53.370854 1 main.go:317] Retrieving plugins.
E0820 08:26:53.370997 1 factory.go:87] Incompatible strategy detected auto
E0820 08:26:53.371004 1 factory.go:88] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0820 08:26:53.371010 1 factory.go:89] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0820 08:26:53.371016 1 factory.go:90] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0820 08:26:53.371021 1 factory.go:91] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0820 08:26:53.371029 1 main.go:346] No devices found. Waiting indefinitely.
更新运行时后,重启device-plugin daemonset
kubectl rollout restart ds -n kube-system nvidia-device-plugin-daemonset
测试
# gpu节点打标签
kubectl label nodes k8s-dell-r740-worker01 nvidia.com/gpu=true
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: IfNotPresent
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1
command: ['nvidia-smi']
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
#nodeSelector:
# viadia.com/gpu: "true"