环境配置

  • kernel-3.10.0-1160.119.1.el7.x86_64

  • kernel-devel-3.10.0-1160.119.1.el7.x86_64

  • CentOS 7.9.2009

  • Linux x64 (AMD64/EM64T) Display Driver 550.107.02

  • GeForce RTX 2080 Ti

内核版本测试过5.4.227,6.9.7都不能安装驱动,降级回3.10.0就可以了

查看显卡

lspci | grep -i nvidia

d8:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] (rev a1)

d8:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)

d8:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)

d8:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a

禁用显卡驱动

# 查看是否存在
lsmod | grep nouveau
# 修改下方配置文件
vi /lib/modprobe.d/dist-blacklist.conf

# 应用更改
update-initramfs -u
# 再次查看是否存在
lsmod | grep nouveau
# 如果还存在,重启服务器
reboot

# 删除
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
# 重建
dracut /boot/initramfs-$(uname -r).img $(uname -r)

下载驱动

地址:https://www.nvidia.cn/Download/index.aspx?lang=cn

安装驱动

# 执行权限
chmod +x NVIDIA-Linux-x86_64-550.107.02.run
# 安装
./NVIDIA-Linux-x86_64-550.107.02.run -no-x-check --no-opengl-files --kernel-source-path=/usr/src/kernels/5.4.227-1.el7.elrepo.x86_64

-no-x-check: 安装驱动时不检查X服务,非必需,已经禁用图形界面

-no-opengl-files: 只安装驱动文件,不安装OpenGL文件

-no-nouveau-check:安装驱动时禁用nouveau,非必需,已经禁用nouveau

--kernel-source-path:内核源码包位置,安装时会修改源码包,不带该参数,默认是当前内核

验证驱动

安装CUDA Toolkit

版本信息:1. CUDA 12.6 Release Notes — Release Notes 12.6 documentation (nvidia.com)

对应驱动版本

本次安装CentOS 7 v12.4 GA:CUDA Toolkit 12.4 Downloads | NVIDIA Developer

wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sh cuda_12.4.0_550.54.14_linux.run

已知问题

Installation failed. See log at /var/log/cuda-installer.log for details.

排查思路

解决方案:

# 查看日志
cat /var/log/cuda-installer.log
[INFO]: Initializing menu
[INFO]: nvidia-fs.setKOVersion(2.19.6)
[INFO]: Setup complete
[INFO]: Installing: Driver
[INFO]: Installing: 550.54.14
[INFO]: Executing NVIDIA-Linux-x86_64-550.54.14.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd  2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed. Consult the driver log at /var/log/nvidia-installer.log for more details.
[ERROR]: Install of 550.54.14 failed, quitting

安装nvidia驱动失败,继续根据提示查看日志

cat /var/log/nvidia-installer.log 
Using built-in stream user interface
-> Detected 64 CPUs online; setting concurrency level to 32.
-> Scanning the initramfs with lsinitrd...
-> /usr/bin/lsinitrd requires a file path argument, but none was given.
-> Executing: /usr/bin/lsinitrd    /boot/initramfs-3.10.0-1160.119.1.el7.x86_64.img
-> Tagging shared libraries with chcon -t textrel_shlib_t.
-> The file '/tmp/.X0-lock' exists and appears to contain the process ID '2843' of a running X server.
-> You appear to be running an X server.  Installing the NVIDIA driver while X is running is not recommended, as doing so may prevent the installer from detecting some potential installation problems, and it may not be possible to start new graphics applications after a new driver is installed.  If you choose to continue installation, it is highly recommended that you reboot your computer after installation to use the newly installed driver. (Answer: Abort installation)
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

2843进程正在使用驱动,kill掉这个进程重新安装CUDA

问题二

ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.

CUDA版本低于NVIDIA驱动版本

验证CUDA驱动

 /usr/local/cuda/bin/nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Tue_Feb_27_16:19:38_PST_2024 Cuda compilation tools, release 12.4, V12.4.99 Build cuda_12.4.r12.4/compiler.33961263_0