环境配置
kernel-3.10.0-1160.119.1.el7.x86_64
kernel-devel-3.10.0-1160.119.1.el7.x86_64
CentOS 7.9.2009
Linux x64 (AMD64/EM64T) Display Driver 550.107.02
GeForce RTX 2080 Ti
内核版本测试过5.4.227,6.9.7都不能安装驱动,降级回3.10.0就可以了
查看显卡
lspci | grep -i nvidia
d8:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] (rev a1)
d8:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
d8:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
d8:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a
禁用显卡驱动
# 查看是否存在
lsmod | grep nouveau
# 修改下方配置文件
vi /lib/modprobe.d/dist-blacklist.conf
# 应用更改
update-initramfs -u
# 再次查看是否存在
lsmod | grep nouveau
# 如果还存在,重启服务器
reboot
# 删除
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
# 重建
dracut /boot/initramfs-$(uname -r).img $(uname -r)
下载驱动
地址:https://www.nvidia.cn/Download/index.aspx?lang=cn
安装驱动
# 执行权限
chmod +x NVIDIA-Linux-x86_64-550.107.02.run
# 安装
./NVIDIA-Linux-x86_64-550.107.02.run -no-x-check --no-opengl-files --kernel-source-path=/usr/src/kernels/5.4.227-1.el7.elrepo.x86_64
-no-x-check: 安装驱动时不检查X服务,非必需,已经禁用图形界面
-no-opengl-files: 只安装驱动文件,不安装OpenGL文件
-no-nouveau-check:安装驱动时禁用nouveau,非必需,已经禁用nouveau
--kernel-source-path:内核源码包位置,安装时会修改源码包,不带该参数,默认是当前内核
验证驱动
安装CUDA Toolkit
版本信息:1. CUDA 12.6 Release Notes — Release Notes 12.6 documentation (nvidia.com)
对应驱动版本
本次安装CentOS 7 v12.4 GA:CUDA Toolkit 12.4 Downloads | NVIDIA Developer
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sh cuda_12.4.0_550.54.14_linux.run
已知问题
Installation failed. See log at /var/log/cuda-installer.log for details.
排查思路
解决方案:
# 查看日志
cat /var/log/cuda-installer.log
[INFO]: Initializing menu
[INFO]: nvidia-fs.setKOVersion(2.19.6)
[INFO]: Setup complete
[INFO]: Installing: Driver
[INFO]: Installing: 550.54.14
[INFO]: Executing NVIDIA-Linux-x86_64-550.54.14.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd 2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed. Consult the driver log at /var/log/nvidia-installer.log for more details.
[ERROR]: Install of 550.54.14 failed, quitting
安装nvidia驱动失败,继续根据提示查看日志
cat /var/log/nvidia-installer.log
Using built-in stream user interface
-> Detected 64 CPUs online; setting concurrency level to 32.
-> Scanning the initramfs with lsinitrd...
-> /usr/bin/lsinitrd requires a file path argument, but none was given.
-> Executing: /usr/bin/lsinitrd /boot/initramfs-3.10.0-1160.119.1.el7.x86_64.img
-> Tagging shared libraries with chcon -t textrel_shlib_t.
-> The file '/tmp/.X0-lock' exists and appears to contain the process ID '2843' of a running X server.
-> You appear to be running an X server. Installing the NVIDIA driver while X is running is not recommended, as doing so may prevent the installer from detecting some potential installation problems, and it may not be possible to start new graphics applications after a new driver is installed. If you choose to continue installation, it is highly recommended that you reboot your computer after installation to use the newly installed driver. (Answer: Abort installation)
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
2843进程正在使用驱动,kill掉这个进程重新安装CUDA
问题二
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.
CUDA版本低于NVIDIA驱动版本
验证CUDA驱动
/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Tue_Feb_27_16:19:38_PST_2024 Cuda compilation tools, release 12.4, V12.4.99 Build cuda_12.4.r12.4/compiler.33961263_0