跳到主要内容
Python AI 算法
NVIDIA 580 驱动 CUDA13 cuDNN9.13 在 Win11 Ubuntu WSL2 环境下的安装与验证 综述由AI生成 详细记录了在 Windows 11、Ubuntu 及 WSL2 环境下安装 NVIDIA 580 驱动、CUDA 13 及 cuDNN 9.13 的全过程。涵盖驱动清理、各系统包管理器安装方法、环境变量配置及版本验证。同时提供了 C++、PyTorch 和 TensorFlow 的测试代码与性能对比,解决了原生 Windows 下 TensorFlow 不支持 GPU 的问题,并包含矩阵乘法运算的速度优化示例。
无尘 发布于 2026/3/28 更新于 2026/5/31 28 浏览概述
Linux 使用发行版为 Ubuntu-24.04,安装 NVIDIA 驱动版本为 nvidia-headless-no-dkms-580-server-open,CUDA 版本为 cuda-toolkit-13-0,cuDNN 库为 libcudnn9-dev-cuda-13(9.13.1)
Windows 为 win11,NVIDIA Studio 581.29,CUDA Version: 13.0.1,cuDNN 库为 9.13.1。
WSL2 为 Ubuntu24.04 和 Archlinux,与宿主 Windows 共用 NVIDIA 驱动,WSL 只有 Ubuntu 有 CUDA13,而且 WSL 目前只有支持 CUDA12 的 cuDNN 8.9.2
验证安装 包括版本验证、C++ 编译、PyTorch、TensorFlow 功能的验证,以及常见问题解决方案
一、安装前准备
之前没有安装过英伟达驱动的可以跳过这一步
对于 Linux,安装前请确保原版本已删除(一定要先删除,否则会冲突)
最简单的,运行:
sudo apt-get purge nvidia*
或者已知版本具体名称可以运行
sudo apt remove nvidia-v
不知道版本可以运行以下代码来查看:
dpkg -l |grep nvidia
然而我这里并没有找到原版本,并且是新服务器,所以采取了最彻底的办法:
sudo apt purge *nvidia* *cuda* *cudnn* *nsight*
sudo apt autoremove
sudo /usr/bin/nvidia-uninstall
对于 Windows,如果有 NVIDIA APP,可以直接打开进行升级
我是已经升级完了,没升级的话点这个地方。
二、安装新驱动
1. Linux (Ubuntu)
先更新 apt:
sudo apt update
再获取可以下载的列表
apt search nvidia-driver
输出显示多个版本,有显示器的选 driver,没有的选 headless(无头模式),服务器最好选 server,open 是开源。
我是给云计算服务器装所以是:
sudo apt install nvidia-headless-580-server-open
没啥特别需求的话装这个吧:
sudo apt install nvidia-driver-580-open
安装结束后,使用 nvidia-smi 会提示你还没有安装 nvidia-utils。于是再执行 (服务器用删除#号):
sudo apt install nvidia-utils-580
之后就能看到正常的结果了。
注意最好不要装 no-dkms 的,我第一次装的 nvidia-headless-no-dkms-580-server-open,使用 nvidia-smi 输出错误。后面重新装了 nvidia-headless-580-server-open 就正常了。
还是不行的话运行:
sudo apt-get install dkms
ls -l /usr/src/
sudo dkms install -m nvidia -v 580.65.06
2. Windows 直接更新了的就跳过,完全没有的到英伟达官方网站下载。
可以参考不同驱动的对比来选择,我的是 Studio。
也可以下载 NVIDIA APP,这样更新方便些。
3. WSL2 WSL2 中的 Linux 系统和宿主 Windows 共用同一个 NVIDIA 驱动,所以只要宿主 Windows 安装好了,所有 WSL 系统也都不用再装了。但其他部分要另外装,方法和 Linux 物理机一样。
我的 WSL-Ubuntu24.04(目前 WSL 中只有 Ubuntu 有 CUDA13 可以下载)。
我的 WSL-Archlinux(没有 CUDA 和 cuDNN 支持可安装)。
三、安装 CUDA Toolkit 官网:英伟达 CUDA13 下载
根据自己的版本选,windows 版本的联网版安装包在文章最上面,本地安装包另外发,其他的看下表 (给的都是 x86_64 联网安装,sbsa 的直接把里面的 x86_64 改成 sbsa 即可):
另外,因为 WSL-Ubuntu 还没有支持 CUDA13 的 cuDNN,如果要使用 cuDNN 得再下载一个 CUDA12
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo
sudo dnf clean all
sudo dnf -y install cuda-toolkit-13-0
curl https://developer.download.nvidia.com/compute/cuda/repos/azl3/x86_64/cuda-azl3.repo |
sudo tee /etc/yum.repos.d/cuda-azl3.repo
sudo tdnf -y install azurelinux-repos-extended
sudo tdnf clean all
sudo tdnf -y install cuda-toolkit-13-0
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0
sudo dnf config-manager addrepo --from-repofile https://developer.download.nvidia.com/compute/cuda/repos/fedora42/x86_64/cuda-fedora42.repo
sudo dnf clean all
sudo dnf -y install cuda-toolkit-13-0
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/kylin10/x86_64/cuda-kylin10.repo
sudo dnf clean all
sudo dnf -y install cuda-toolkit-13-0
sudo zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/opensuse15/x86_64/cuda-opensuse15.repo
sudo zypper refresh
sudo zypper install -y cuda-toolkit-13-0
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf clean all
sudo dnf -y install cuda-toolkit-13-0
sudo zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/cuda-sles15.repo
sudo zypper refresh
sudo zypper install -y cuda-toolkit-13-0
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0
同一发行版代码基本相同,直接修改内部的版本号即可。
例如 Azure-Linux 把 azl3 修改成 azl2,Ubuntu 可以把 2404 修改成 2204。
同时也可以根据自己 NVDIA 驱动支持的版本修改 cuda-toolkit-13-0 为 cuda-toolkit-12-8 等…
通过 nvidia-smi 查询最高支持的版本:
四、安装 cuDNN 历史版本官网下载,最新版参考官网 cuDNN9.13.1。
1. Windows
pip 安装踩坑 (win11) 这部分想看就看,跳过也行。
我一开始尝试使用 pip 在 win11 上安装:
pip install nvidia-pyindex
然后又报错。尽管它说 No module named 'pip',但我 pip 更新几次还是这个报错。我看出错的位置是 Building wheel,于是猜想是 wheel 的问题,运行:
pip install nvidia-pyindex
pip install nvidia-cudnn
成功安装了 nvidia-pyindex。
nvidia-cudnn 还是报错。最终我放弃了,发现可以直接装安装包。
2. Debian 系 (Ubuntu) 因为 Ubuntu-24.04 的 apt 已经有了 cuDNN 最新版 9.13.1 打包,安装时也没有出现错误,直接用 apt 下载即可,如果这里的方法行不通,也可以参考下一节。
另外,WSL 目前还没有 cuDNN9.13.1 支持,可以先用 sudo apt install nvidia-cudnn 下载支持 CUDA12 的 cuDNN 8.9.2,弹出的选项选择 OK 和 I AGREE 即可。
apt 搜索可用的包:
输出包含多个版本,没有开发要求直接下载 libcudnn9-cuda-12/13,frontend 是有 C++ API 的前端版本,dev 是开发版,backend/jit 是即时编译。
如果你是用底层 C++ 来操作 GPU 则使用 frontend,如果你是开发框架则使用 backend/jit。如果你像我一样只是个 python 框架使用者,选择 dev 即可。
3. 其他 Linux 版本
wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.13.1.26_cuda13-archive.tar.xz
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cudnn
sudo zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/opensuse15/x86_64/cuda-opensuse15.repo
sudo zypper refresh
sudo zypper install -y cudnn
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel10/x86_64/cuda-rhel10.repo
sudo dnf clean all
sudo dnf -y install cudnn
sudo zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/cuda-sles15.repo
sudo zypper refresh
sudo zypper install -y cudnn
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cudnn
jit 版本把 cudnn 改成 cudnn_jit。
对于 Tarball:CUDA12.x 版本把 cuda13 改成 cuda12;对于其他:把 -y cudnn 改成 -y cudnn9-cuda-12。
五、验证安装和版本
1. NVIDIA Driver
2. CUDA Toolkit 我在这里出现了安装 13.0 后依然显示 12.0 的情况。
尝试修改链接,依然没有解决问题。
于是修改配置文件(不会使用 linux 文本编辑工具的看我第一篇博客)。
nano ~/.bashrc
notepad ~/.bashrc
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
source ~/.bashrc
nvcc --version
3. cuDNN Linux cuDNN 没有直接的办法查看版本号,如果非要印证是否安装了可以 find 找一下头文件的位置。
find /usr -name "cudnn_version.h" 2>/dev/null
我的结果是 /usr/include/x86_64-linux-gnu/cudnn_version.h。
于是运行:
ls /usr/include/x86_64-linux-gnu/
其中的输出有 cudnn_cnn_v9.h,证明安装的有 cuDNN9.x 版本。
4. cuDNN Win11 Get-ChildItem -Path C:\ -Name cudnn.h -File -Recurse -ErrorAction SilentlyContinue
Program Files\NVIDIA\CUDNN\v9.13\include \12.9\cudnn.h
Program Files\NVIDIA\CUDNN\v9.13\include \13.0\cudnn.h
如果 C 盘文件太多会很慢,用 Everything 就会快很多,不过要自己安装,太偏题了所以就不讲了。
六、框架支持测试程序 不管是安装驱动还是工具包,或者神经网络库,最终还是要保证期望的功能能够实现,所以最好再对功能进行验证 (同时也是 GPU 编程练手)。
注:代码中的中文和中文标点符号,以及 emoji 可能造成乱码(也可能不会)。
1. C++ 支持 (这里我在控制台编写,也可以使用自己习惯的 ide 和文本编辑工具)
cat> test_cudnn.cu <<'EOF'
#include <iostream>
#include <cudnn.h>
#include <cuda_runtime.h>
int main () {
int runtime_version = 0 ;
int driver_version = 0 ;
cudaRuntimeGetVersion (&runtime_version);
cudaDriverGetVersion (&driver_version);
std::cout << "CUDA Runtime Version: " << runtime_version /1000 << "." <<(runtime_version %1000 )/10 << std::endl;
std::cout << "CUDA Driver Version: " << driver_version /1000 << "." <<(driver_version %1000 )/10 << std::endl;
cudnnHandle_t handle;
cudnnStatus_t status = cudnnCreate (&handle);
size_t cudnn_version = cudnnGetVersion ();
if (status == CUDNN_STATUS_SUCCESS){
std::cout << "cuDNN Version: " << cudnn_version /10000 << "." <<(cudnn_version %10000 )/100 << "." << cudnn_version %100 << std::endl;
std::cout << "✅ cuDNN installed successfully!" << std::endl;
cudnnDestroy (handle);
}else {
std::cout << "❌ cuDNN initialization failed!" << std::endl;
}
return 0 ;
}
EOF
nvcc -o test_cudnn test_cudnn.cu -lcudnn
./test_cudnn
CUDA Runtime Version: 12.0
CUDA Driver Version: 13.0
cuDNN Version: 9.13 .1
✅ cuDNN installed successfully!
rm ./test_cudnn
rm ./test_cudnn.cu
如果这里出现了以下提示(通常是 Windows,因为 linux 会附带下载 gcc):
nvcc fatal : Cannot find compiler 'cl.exe' in PATH
你可以将自己 Visual Studio 中自带的 c 语言编译器添加到环境变量,也可以新安装一个 MinGW 再添加到 PATH。
然后输入并保存 Visual Studio 编译器路径。
如果你的 Microsoft Visual Studio 不在 C:\Program Files\下,那就找到真正的位置。
2. PyTorch 支持 sudo apt install python3.12-venv
mkdir pytorch
cd pytorch
python3 -m venv torch
source torch/bin/activate
pip install torch
python3
import torch
print (f"PyTorch version: {torch.__version__} " )
print (f"CUDA available: {torch.cuda.is_available()} " )
print (f"cuDNN enabled: {torch.backends.cudnn.enabled} " )
print (f"cuDNN version: {torch.backends.cudnn.version()} " )
print (f"PyTorch built with CUDA version: {torch.version.cuda} " )
(torch) ubuntu@ubuntu:~/pytorch$ python3
Python 3.12 .3 (main, Aug 14 2025 ,17 :47 :21 )[GCC 13.3 .0 ] on linux
Type "help" ,"copyright" ,"credits" or "license" for more information.
>>> import torch
/home/ubuntu/pytorch/torch/lib/python3.12 /site-packages/torch/_subclasses/functional_tensor.py:279 : UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81. )
cpu = _conversion_method_template(device=torch.device("cpu" ))
>>> print (f"PyTorch version: {torch.__version__} " )
PyTorch version: 2.8 .0 +cu128
>>> print (f"CUDA available: {torch.cuda.is_available()} " )
CUDA available: True
>>> print (f"cuDNN enabled: {torch.backends.cudnn.enabled} " )
cuDNN enabled: True
>>> print (f"cuDNN version: {torch.backends.cudnn.version()} " )
cuDNN version: 91002
>>> print (f"PyTorch built with CUDA version: {torch.version.cuda} " )
PyTorch built with CUDA version: 12.8
>>> quit()
最后一个输出是 PyTorch 内置的 CUDA 版本(这个版本是 PyTorch 编译时绑定的,但我装的 13.0 也兼容)。
PyTorch 无 CUDA 支持解决方法 PyTorch version: 2.8 .0 +cpu
CUDA available: False
cuDNN enabled: True
cuDNN version: None
PyTorch built with CUDA version: None
我的 win11 上出现了这样的结果,说明我安装的 pytorch 是无 CUDA 支持的。
要保证下载到有 CUDA 支持的 pytorch,访问:PyTorch 官网。
Linux 和 Windows 的指令相同:
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130
我写这篇文字时还没有 CUDA13 的选项,但内置 12.8 和 12.9 的都支持 CUDA13,这点我在自己的 Ubuntu 和 Win11 上都得到了验证。
后面的演示依然是 12.9,不过不影响。
如果连不上可以用南京大学的镜像:
pip3 install torch torchvision torchaudio --index-url https://mirrors.nju.edu.cn/pytorch/whl/cu130
PyTorch version: 2.8 .0 +cu129
CUDA available: True
cuDNN enabled: True
cuDNN version: 91002
PyTorch built with CUDA version: 12.9
3. TensorFlow 支持 mkdir tensorflow
cd tensorflow
python3 -m venv tf
source tf/bin/activate
pip install tensorflow
python3
import tensorflow as tf
print (f"TensorFlow version: {tf.__version__} " )
print (f"TensorFlow CUDA available: {tf.test.is_built_with_cuda()} " )
print (f"GPU available: {tf.config.list_physical_devices('GPU' )} " )
(tf) ubuntu@ubuntu:~/tensorflow$ python3
Python 3.12 .3 (main, Aug 14 2025 ,17 :47 :21 )[GCC 13.3 .0 ] on linux
Type "help" ,"copyright" ,"credits" or "license" for more information.
>>> import tensorflow as tf
2025 -10 -08 18 :26 :36.705030 : I tensorflow/core/platform/cpu_feature_guard.cc:210 ] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA,in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> print (f"TensorFlow version: {tf.__version__} " )
TensorFlow version: 2.20 .0
>>> print (f"TensorFlow CUDA available: {tf.test.is_built_with_cuda()} " )
TensorFlow CUDA available: True
>>> print (f"GPU available: {tf.config.list_physical_devices('GPU' )} " )
GPU available:[PhysicalDevice(name='/physical_device:GPU:0' , device_type='GPU' )]
>>> quit()
原生 win 环境下 tensorflow2.10+不支持 GPU(使用 WSL2) 参阅:原生 Windows 安装 tensorflow 官网。
我的 win11 上装的和 unbuntu 一摸一样的版本,然而 tensorflow 却提示不支持 CUDA:
TensorFlow CUDA available: False
GPU available: []
根据官网信息,tensorflow2.10 之后的版本不再为原生 Windows 提供 GPU 支持(也就是只能用 WSL 或者 2.10 以及以下版本)。
即使真的下载了 tensorflow2.10,也不能够支持我们单独 CUDA13 最新版本,也就不属于本博客内容了。
不过一定要给出解决方法的话,那就是使用 WSL2。
然后就是要用到 GPU 的话,最好选 WSL-Ubuntu 或其他 Debian 系,我在用 WSL-Archlinux 的时候经常看到有说只有 Debian 系的 GPU 支持,气得我够呛。
重新在 WSL2 搭建环境:
pip install tensorflow[and-cuda]
这里的 [and-cuda] 一定要加上,否则——
运行,得到输出:
>>> print (f"TensorFlow CUDA available: {tf.test.is_built_with_cuda()} " )
TensorFlow CUDA available: True
>>> print (f"GPU available: {tf.config.list_physical_devices('GPU' )} " )
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00 :00 :1759992696.028735 3065 gpu_device.cc:2342 ] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
GPU available: []
显示 tensorflow 是支持 CUDA 的,但是找不到可用的 GPU,并且提示缺少 GPU 库。
如果遇到这种问题可以重新完整安装:
pip install tensorflow[and-cuda]
>>> print (f"TensorFlow CUDA available: {tf.test.is_built_with_cuda()} " )
TensorFlow CUDA available: True
>>> print (f"GPU available: {tf.config.list_physical_devices('GPU' )} " )
GPU available: [PhysicalDevice(name='/physical_device:GPU:0' , device_type='GPU' )]
七、进阶:GPU 矩阵乘法运算 + 速度对比 最直接的测试是否兼容和可用的方式,同时也可以立即熟悉使用 NVIDIA 驱动的 GPU 编程。
1. C++ 测试程序 nano test_gpu.cu
notepad test_gpu.cu
touch test_gpu.cu
#include <iostream>
#include <chrono>
#include <cuda_runtime.h>
#include <cmath>
#include <iomanip>
#define CUDA_CHECK(call) do{ cudaError_t err = call; if (err != cudaSuccess){ std::cerr << "CUDA error at " <<__FILE__<<":" <<__LINE__<< " - " <<cudaGetErrorString(err)<< std::endl; exit(EXIT_FAILURE); } }while(0)
const int MATRIX_SIZE = 1024 ;
const int BLOCK_SIZE = 16 ;
void cpu_matrix_multiply (const float * A,const float * B,float * C,int size) {
for (int i = 0 ; i < size;++i){
for (int j = 0 ; j < size;++j){
float sum = 0.0f ;
for (int k = 0 ; k < size;++k){ sum += A[i * size + k]* B[k * size + j]; }
C[i * size + j]= sum;
}
}
}
__global__ void gpu_matrix_multiply_basic (float * A,float * B,float * C,int size) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < size && col < size){
float sum = 0.0f ;
for (int k = 0 ; k < size;++k){ sum += A[row * size + k]* B[k * size + col]; }
C[row * size + col]= sum;
}
}
__global__ void gpu_matrix_multiply_shared (float * A,float * B,float * C,int size) {
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
int row = blockIdx.y * BLOCK_SIZE + threadIdx.y;
int col = blockIdx.x * BLOCK_SIZE + threadIdx.x;
float sum = 0.0f ;
for (int t = 0 ; t<(size + BLOCK_SIZE -1 )/ BLOCK_SIZE;++t){
int tiledCol = t * BLOCK_SIZE + threadIdx.x;
int tiledRow = t * BLOCK_SIZE + threadIdx.y;
if (row < size && tiledCol < size){ As[threadIdx.y][threadIdx.x]= A[row * size + tiledCol]; }else { As[threadIdx.y][threadIdx.x]=0.0f ; }
if (tiledRow < size && col < size){ Bs[threadIdx.y][threadIdx.x]= B[tiledRow * size + col]; }else { Bs[threadIdx.y][threadIdx.x]=0.0f ; }
__syncthreads();
for (int k = 0 ; k < BLOCK_SIZE;++k){ sum += As[threadIdx.y][k]* Bs[k][threadIdx.x]; }
__syncthreads();
}
if (row < size && col < size){ C[row * size + col]= sum; }
}
int main () {
std::cout << "=== Large Matrix Multiplication Test ===" << std::endl;
std::cout << "Matrix size: " << MATRIX_SIZE << "x" << MATRIX_SIZE << std::endl;
std::cout << "Memory per matrix: " <<(MATRIX_SIZE * MATRIX_SIZE *sizeof (float )/(1024.0 *1024.0 ))<< " MB" << std::endl;
int deviceCount;
CUDA_CHECK (cudaGetDeviceCount (&deviceCount));
if (deviceCount == 0 ){ std::cerr << "Error: No CUDA devices found" << std::endl; return EXIT_FAILURE; }
cudaDeviceProp prop;
CUDA_CHECK (cudaGetDeviceProperties (&prop,0 ));
std::cout << "Using CUDA device: " << prop.name << std::endl;
std::cout << "Available GPU memory: " << prop.totalGlobalMem /(1024.0 *1024.0 ) << " MB" << std::endl;
const int size = MATRIX_SIZE;
const size_t mem_size = size * size *sizeof (float );
if (mem_size *3 > prop.totalGlobalMem){ std::cerr << "Error: Not enough GPU memory for " << size << "x" << size << " matrices" << std::endl; std::cerr << "Required: " <<(mem_size *3 /(1024.0 *1024.0 ))<< " MB" << std::endl; std::cerr << "Available: " << prop.totalGlobalMem /(1024.0 *1024.0 ) << " MB" << std::endl; return EXIT_FAILURE; }
float * h_A = new float [size * size];
float * h_B = new float [size * size];
float * h_C_cpu = new float [size * size];
float * h_C_gpu_basic = new float [size * size];
float * h_C_gpu_shared = new float [size * size];
std::cout << "Initializing matrices..." << std::endl;
for (int i = 0 ; i < size * size;++i){ h_A[i]=static_cast <float >(rand ())/ RAND_MAX; h_B[i]=static_cast <float >(rand ())/ RAND_MAX; }
std::cout << "--- CPU Computation ---" << std::endl;
auto start_cpu = std::chrono::high_resolution_clock::now ();
cpu_matrix_multiply (h_A, h_B, h_C_cpu, size);
auto end_cpu = std::chrono::high_resolution_clock::now ();
std::chrono::duration<double > cpu_duration = end_cpu - start_cpu;
std::cout << "CPU time: " << std::fixed << std::setprecision (3 )<< cpu_duration.count ()<< " seconds" << std::endl;
std::cout << "--- GPU Computation (Basic) ---" << std::endl;
float *d_A,*d_B,*d_C;
CUDA_CHECK (cudaMalloc (&d_A, mem_size));
CUDA_CHECK (cudaMalloc (&d_B, mem_size));
CUDA_CHECK (cudaMalloc (&d_C, mem_size));
CUDA_CHECK (cudaMemcpy (d_A, h_A, mem_size, cudaMemcpyHostToDevice));
CUDA_CHECK (cudaMemcpy (d_B, h_B, mem_size, cudaMemcpyHostToDevice));
dim3 threadsPerBlock (BLOCK_SIZE, BLOCK_SIZE) ;
dim3 blocksPerGrid ((size + threadsPerBlock.x -1 )/ threadsPerBlock.x,(size + threadsPerBlock.y -1 )/ threadsPerBlock.y) ;
auto start_gpu_basic = std::chrono::high_resolution_clock::now ();
gpu_matrix_multiply_basic<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, size);
CUDA_CHECK (cudaGetLastError ());
CUDA_CHECK (cudaDeviceSynchronize ());
auto end_gpu_basic = std::chrono::high_resolution_clock::now ();
std::chrono::duration<double > gpu_basic_duration = end_gpu_basic - start_gpu_basic;
CUDA_CHECK (cudaMemcpy (h_C_gpu_basic, d_C, mem_size, cudaMemcpyDeviceToHost));
std::cout << "GPU basic time: " << std::fixed << std::setprecision (3 )<< gpu_basic_duration.count ()<< " seconds" << std::endl;
std::cout << "--- GPU Computation (Shared Memory Optimized) ---" << std::endl;
auto start_gpu_shared = std::chrono::high_resolution_clock::now ();
gpu_matrix_multiply_shared<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, size);
CUDA_CHECK (cudaGetLastError ());
CUDA_CHECK (cudaDeviceSynchronize ());
auto end_gpu_shared = std::chrono::high_resolution_clock::now ();
std::chrono::duration<double > gpu_shared_duration = end_gpu_shared - start_gpu_shared;
CUDA_CHECK (cudaMemcpy (h_C_gpu_shared, d_C, mem_size, cudaMemcpyDeviceToHost));
std::cout << "GPU shared memory time: " << std::fixed << std::setprecision (3 )<< gpu_shared_duration.count ()<< " seconds" << std::endl;
CUDA_CHECK (cudaFree (d_A));
CUDA_CHECK (cudaFree (d_B));
CUDA_CHECK (cudaFree (d_C));
std::cout << "--- Performance Comparison ---" << std::endl;
std::cout << "CPU time: " << cpu_duration.count ()<< " seconds" << std::endl;
std::cout << "GPU basic time: " << gpu_basic_duration.count ()<< " seconds" << std::endl;
std::cout << "GPU shared memory time: " << gpu_shared_duration.count ()<< " seconds" << std::endl;
std::cout << "Speedup (basic vs CPU): " << std::fixed << std::setprecision (2 )<< cpu_duration.count ()/ gpu_basic_duration.count ()<< "x" << std::endl;
std::cout << "Speedup (shared vs CPU): " << std::fixed << std::setprecision (2 )<< cpu_duration.count ()/ gpu_shared_duration.count ()<< "x" << std::endl;
std::cout << "Speedup (shared vs basic): " << std::fixed << std::setprecision (2 )<< gpu_basic_duration.count ()/ gpu_shared_duration.count ()<< "x" << std::endl;
std::cout << "--- Result Verification ---" << std::endl;
float max_error_basic = 0.0f ;
for (int i = 0 ; i < size * size;++i){float error = fabs (h_C_cpu[i]- h_C_gpu_basic[i]); max_error_basic = fmax (max_error_basic, error);}
std::cout << "Max error (basic): " << std::scientific << max_error_basic << std::endl;
float max_error_shared = 0.0f ;
for (int i = 0 ; i < size * size;++i){float error = fabs (h_C_cpu[i]- h_C_gpu_shared[i]); max_error_shared = fmax (max_error_shared, error);}
std::cout << "Max error (shared): " << std::scientific << max_error_shared << std::endl;
if (max_error_basic <1e-10 ){ std::cout << "✅ CPU and GPU results are perfectly consistent" << std::endl;}
elseif (max_error_basic <1e-5 ){ std::cout << "✅ CPU and GPU results are consistent (excellent accuracy)" << std::endl;}
elseif (max_error_basic <1e-3 ){ std::cout << "⚠️ CPU and GPU results show minor differences (acceptable for most applications)" << std::endl;}
else { std::cout << "❌ CPU and GPU results differ significantly" << std::endl;}
delete [] h_A;
delete [] h_B;
delete [] h_C_cpu;
delete [] h_C_gpu_basic;
delete [] h_C_gpu_shared;
std::cout << "Test completed successfully!" << std::endl;
return 0 ;
}
nvcc -o test_gpu test_gpu.cu
./test_gpu
我的结果 (Ubuntu 云计算服务器 32 核 CPU+ 显存 64G NVIDIA T4 显卡):
= = = Large Matrix Multiplication Test = = =
Matrix size: 1024 x1024
Memory per matrix: 4 MB
Using CUDA device: Tesla T4
Available GPU memory: 14912.7 MB
Initializing matrices...
CPU time : 3.943 seconds
GPU basic time : 0.009 seconds
GPU shared memory time : 0.006 seconds
CPU time : 3.943 seconds
GPU basic time : 0.009 seconds
GPU shared memory time : 0.006 seconds
Speedup (basic vs CPU): 422.91 x
Speedup (shared vs CPU): 677.10 x
Speedup (shared vs basic): 1.60 x
Max error (basic): 9.16e-05
Max error (shared): 9.16e-05
⚠️ CPU and GPU results show minor differences (acceptable formost applications)
Test completed successfully!
CPU 时间:3.943 秒 - 这是单线程计算的典型速度
GPU 基础版:0.009 秒 - 加快了非常多
GPU 优化版:0.006 秒 - 更快了
误差水平:9.16e-05 (小数点后 4~5 位的浮点误差,可以接受)
我的结果 (Win11 笔记本)i7-13650HX+ 显存 8G RTX 4060 显卡):
=== Large Matrix Multiplication Test ===
Matrix size: 1024x1024
Memory per matrix: 4 MB
Using CUDA device: NVIDIA GeForce RTX 4060 Laptop GPU
Available GPU memory: 8187.5 MB
Initializing matrices...
--- CPU Computation ---
CPU time: 2.736 seconds
--- GPU Computation (Basic) ---
GPU basic time: 0.092 seconds
--- GPU Computation (Shared Memory Optimized) ---
GPU shared memory time: 0.003 seconds
--- Performance Comparison ---
CPU time: 2.736 seconds
GPU basic time: 0.092 seconds
GPU shared memory time: 0.003 seconds
Speedup (basic vs CPU): 29. 65x
Speedup (shared vs CPU): 869. 73x
Speedup (shared vs basic): 29. 33x
--- Result Verification ---
Max error (basic): 9.16e-05
Max error (shared): 9.16e-05
⚠️ CPU and GPU results show minor differences (acceptable formost applications)
Test completed successfully!
CPU 时间:2.736 秒 - 还比服务器的快一些
GPU 基础版:0.092 秒 - 明显比 GPU 服务器慢,同时比 CPU 快
GPU 优化版:0.003 秒 - 哇哦~
2. PyTorch 测试程序 cd pytorch
nano test_gpu.cu
notepad test_gpu.cu
touch test_gpu.cu
import torch
import time
device_cpu = torch.device('cpu' )
device_gpu = torch.device('cuda' )
N = 1024
A_cpu = torch.randn(N, N, device=device_cpu)
B_cpu = torch.randn(N, N, device=device_cpu)
A_gpu = A_cpu.to(device_gpu)
B_gpu = B_cpu.to(device_gpu)
start_time = time.time()
C_cpu = torch.mm(A_cpu, B_cpu)
cpu_time = time.time()- start_time
_ = torch.mm(A_gpu, B_gpu)
torch.cuda.synchronize()
start_time = time.time()
C_gpu = torch.mm(A_gpu, B_gpu)
torch.cuda.synchronize()
gpu_time = time.time()- start_time
print (f"\nPyTorch Matrix Multiplication ({N} x{N} )" )
print (f"CPU time: {cpu_time:.4 f} seconds" )
print (f"GPU time: {gpu_time:.4 f} seconds" )
print (f"Speedup: {cpu_time / gpu_time:.2 f} x" )
C_gpu_cpu = C_gpu.cpu()
max_error = torch.max (torch.abs (C_cpu - C_gpu_cpu))
print (f"Maximum error between CPU and GPU: {max_error.item()} " )
source torch/bin/activate
pip install numpy
python3 test_gpu.py
python test_gpu.py
我的结果 (Ubuntu 云计算服务器 32 核 CPU+ 显存 64G NVIDIA T4 显卡):
PyTorch Matrix Multiplication (1024x1024)
CPU time: 0.0050 seconds
GPU time: 0.0006 seconds
Speedup: 7. 95x
Maximum error between CPU and GPU: 7.62939453125e-05
我的结果 (Win11 笔记本)i7-13650HX+ 显存 8G RTX 4060 显卡):
PyTorch Matrix Multiplication (1024x1024)
CPU time: 0.0130 seconds
GPU time: 0.0010 seconds
Speedup: 12. 99x
Maximum error between CPU and GPU: 6.103515625e-05
似乎为数学计算而生的 python 用 CPU 计算会比 C++ 更好,也有可能是应用了优化?
3. TensorFlow 测试程序 cd tensorflow
nano test_gpu.cu
notepad test_gpu.cu
touch test_gpu.cu
import tensorflow as tf
import time
print ("Available devices:" )
for device in tf.config.list_physical_devices():
print (f" {device.device_type} : {device.name} " )
N = 1024
A = tf.random.normal((N, N))
B = tf.random.normal((N, N))
print ("\nRunning on CPU..." )
with tf.device('/CPU:0' ):
A_cpu = tf.identity(A)
B_cpu = tf.identity(B)
start_time = time.time()
C_cpu = tf.matmul(A_cpu, B_cpu)
cpu_time = time.time()- start_time
gpu_available = tf.config.list_physical_devices('GPU' )
if gpu_available:
print ("Running on GPU..." )
with tf.device('/GPU:0' ):
A_gpu = tf.identity(A)
B_gpu = tf.identity(B)
tf.matmul(A_gpu, B_gpu)
start_time = time.time()
C_gpu = tf.matmul(A_gpu, B_gpu)
gpu_time = time.time()- start_time
else :
print ("GPU not available" )
gpu_time = float ('inf' )
print (f"\nTensorFlow Matrix Multiplication ({N} x{N} )" )
print (f"CPU time: {cpu_time:.4 f} seconds" )
if gpu_available:
print (f"GPU time: {gpu_time:.4 f} seconds" )
print (f"Speedup: {cpu_time / gpu_time:.2 f} x" )
max_error = tf.reduce_max(tf.abs (C_cpu - C_gpu))
print (f"Maximum error between CPU and GPU: {max_error.numpy()} " )
else :
print ("GPU: Not available" )
python3 test_gpu.py
python test_gpu.py
我的结果 (Ubuntu 云计算服务器 32 核 CPU+ 显存 64G NVIDIA T4 显卡)(tensorflow2.20.0):
2025-10-08 21:33:45.265201: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Available devices: CPU: /physical_device:CPU:0 GPU: /physical_device:GPU:0
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00 :00:1759930427.172182 100970 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13757 MB memory: -> device: 0 , name: Tesla T4, pci bus id: 0000 :00:08.0, compute capability: 7.5
Running on CPU...
Running on GPU...
TensorFlow Matrix Multiplication (1024x1024)
CPU time: 0.0213 seconds
GPU time: 0.0003 seconds
Speedup: 84. 88x
Maximum error between CPU and GPU: 8.392333984375e-05
我的结果 (Win11 笔记本 WSL2-Ubuntu24.04 环境)i7-13650HX+ 显存 8G RTX 4060 显卡):
2025 -10 -09 15 :22 :33.967581 : I tensorflow/core/util/port.cc:153 ] oneDNN custom operations are on . You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off , set the environment variable `TF_ENABLE_ONEDNN_OPTS=0 `.2025 -10 -09 15 :22 :34.000728 : I tensorflow/core/platform/cpu_feature_guard.cc:210 ] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2025 -10 -09 15 :22 :34.849191 : I tensorflow/core/util/port.cc:153 ] oneDNN custom operations are on . You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off , set the environment variable `TF_ENABLE_ONEDNN_OPTS=0 `. Available devices: CPU: /physical_device:CPU:0 GPU: /physical_device:GPU:0
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00 :00 :1759994555.346035 3525 gpu_device.cc:2020 ] Created device /job:localhost/replica:0 /task:0 /device:GPU:0 with 5561 MB memory: -> device: 0 , name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000 :01 :00.0 , compute capability: 8.9
Running on CPU...
Running on GPU...
TensorFlow Matrix Multiplication (1024 x1024)
CPU time: 0.0173 seconds
GPU time: 0.0004 seconds
Speedup: 48.21 x
Maximum error between CPU and GPU: 0.04720115661621094
输出了很多优化相关的提示信息,不过看来在这一任务下 tensorflow 的 GPU 速度更快呢
总结 在总共三个平台上完成 GPU 编程的环境搭建和测试花了我一天多的时间,一边搭建一边写攻略。也是因为写这个东西吧,每一步我怎么执行的都很清楚,加上一心想要完善攻略的态度也让我把原本的目标进行了提升,最后收获还是很大的。
相关免费在线工具 加密/解密文本 使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
RSA密钥对生成器 生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
Mermaid 预览与可视化编辑 基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
随机西班牙地址生成器 随机生成西班牙地址(支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选),支持数量快捷选择、显示全部与下载。 在线工具,随机西班牙地址生成器在线工具,online
Gemini 图片去水印 基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印,支持批量处理与下载。 在线工具,Gemini 图片去水印在线工具,online
curl 转代码 解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online