NVIDIA 580 驱动 CUDA13 cuDNN9.13 在 Win11 Ubuntu WSL2 环境下的安装与验证 | 极客日志

PythonAI算法

NVIDIA 580 驱动 CUDA13 cuDNN9.13 在 Win11 Ubuntu WSL2 环境下的安装与验证

详细记录了在 Windows 11、Ubuntu 及 WSL2 环境下安装 NVIDIA 580 驱动、CUDA 13 及 cuDNN 9.13 的全过程。涵盖驱动清理、各系统包管理器安装方法、环境变量配置及版本验证。同时提供了 C++、PyTorch 和 TensorFlow 的测试代码与性能对比，解决了原生 Windows 下 TensorFlow 不支持 GPU 的问题，并包含矩阵乘法运算的速度优化示例。

无尘发布于 2026/3/28更新于 2026/7/1845 浏览

概述

Linux 使用发行版为 Ubuntu-24.04，安装 NVIDIA 驱动版本为 nvidia-headless-no-dkms-580-server-open，CUDA 版本为 cuda-toolkit-13-0，cuDNN 库为 libcudnn9-dev-cuda-13（9.13.1）
Windows 为 win11，NVIDIA Studio 581.29，CUDA Version: 13.0.1，cuDNN 库为 9.13.1。
WSL2 为 Ubuntu24.04 和 Archlinux，与宿主 Windows 共用 NVIDIA 驱动，WSL 只有 Ubuntu 有 CUDA13，而且 WSL 目前只有支持 CUDA12 的 cuDNN 8.9.2
验证安装 包括版本验证、C++ 编译、PyTorch、TensorFlow 功能的验证，以及常见问题解决方案

一、安装前准备

之前没有安装过英伟达驱动的可以跳过这一步

对于 Linux，安装前请确保原版本已删除（一定要先删除，否则会冲突）
最简单的，运行：

sudo apt-get purge nvidia*

或者已知版本具体名称可以运行

sudo apt remove nvidia-v #改成自己的版本具体名称

不知道版本可以运行以下代码来查看:

dpkg -l |grep nvidia

然而我这里并没有找到原版本，并且是新服务器，所以采取了最彻底的办法：

# 彻底清除所有 NVIDIA 相关包
sudo apt purge *nvidia* *cuda* *cudnn* *nsight*
# 清除这些包的依赖项
sudo apt autoremove
# 卸载工具也要清除（如果存在的话）
sudo /usr/bin/nvidia-uninstall

对于 Windows，如果有 NVIDIA APP，可以直接打开进行升级 我是已经升级完了，没升级的话点这个地方。

二、安装新驱动

1. Linux (Ubuntu)

先更新 apt:

sudo apt update

再获取可以下载的列表

apt search nvidia-driver

输出显示多个版本，有显示器的选 driver，没有的选 headless（无头模式），服务器最好选 server，open 是开源。我是给云计算服务器装所以是：

sudo apt install nvidia-headless-580-server-open

没啥特别需求的话装这个吧：

sudo apt install nvidia-driver-580-open

安装结束后，使用 nvidia-smi 会提示你还没有安装 nvidia-utils。于是再执行 (服务器用删除＃号)：

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

sudo apt install nvidia-utils-580 # -server

sudo reboot

sudo apt-get install dkms #先安装
ls -l /usr/src/ #不知道版本可以用这个查看
sudo dkms install -m nvidia -v 580.65.06 #这里换成你自己的版本

#Amazon-Linux
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo
sudo dnf clean all
sudo dnf -y install cuda-toolkit-13-0

#Azure-Linux
curl https://developer.download.nvidia.com/compute/cuda/repos/azl3/x86_64/cuda-azl3.repo |
sudo tee /etc/yum.repos.d/cuda-azl3.repo
sudo tdnf -y install azurelinux-repos-extended
sudo tdnf clean all
sudo tdnf -y install cuda-toolkit-13-0

#Debian
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0

#Fedora
sudo dnf config-manager addrepo --from-repofile https://developer.download.nvidia.com/compute/cuda/repos/fedora42/x86_64/cuda-fedora42.repo
sudo dnf clean all
sudo dnf -y install cuda-toolkit-13-0

#KylinOS
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/kylin10/x86_64/cuda-kylin10.repo
sudo dnf clean all
sudo dnf -y install cuda-toolkit-13-0

#OpenSUSE
sudo zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/opensuse15/x86_64/cuda-opensuse15.repo
sudo zypper refresh
sudo zypper install -y cuda-toolkit-13-0

#Redhat 系
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf clean all
sudo dnf -y install cuda-toolkit-13-0

#SLES
sudo zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/cuda-sles15.repo
sudo zypper refresh
sudo zypper install -y cuda-toolkit-13-0

#Ubuntu
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0

#WSL-Ubuntu
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0

nvidia-smi

pip install nvidia-cudnn

pip install nvidia-pyindex

pip install wheel

pip install nvidia-pyindex
pip install nvidia-cudnn

apt search cudnn

#Tarball
wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.13.1.26_cuda13-archive.tar.xz

#Debian
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cudnn

#OpenSUSE
sudo zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/opensuse15/x86_64/cuda-opensuse15.repo
sudo zypper refresh
sudo zypper install -y cudnn

#Redhat 系
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel10/x86_64/cuda-rhel10.repo
sudo dnf clean all
sudo dnf -y install cudnn

#SLES
sudo zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/cuda-sles15.repo
sudo zypper refresh
sudo zypper install -y cudnn

#Ubuntu
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cudnn

nvidia-smi

nvcc --version

nano ~/.bashrc #vim ~/.bashrc #也可使用 vim
notepad ~/.bashrc #（windows 上也可使用记事本）
#code ~/.bashrc #（windows 和 linux 上都可使用 vscode）

# 将 /usr/local/cuda/bin 添加到 PATH 的最前面
export PATH=/usr/local/cuda/bin:$PATH
# 同时确保库路径也设置好了
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

source ~/.bashrc #修改立即生效
nvcc --version #检查版本

find /usr -name "cudnn_version.h" 2>/dev/null

ls /usr/include/x86_64-linux-gnu/

Get-ChildItem -Path C:\ -Name cudnn.h -File -Recurse -ErrorAction SilentlyContinue

Program Files\NVIDIA\CUDNN\v9.13\include\12.9\cudnn.h
Program Files\NVIDIA\CUDNN\v9.13\include\13.0\cudnn.h

cat> test_cudnn.cu <<'EOF'
#include<iostream>
#include<cudnn.h>
#include<cuda_runtime.h>
int main(){
    int runtime_version = 0;
    int driver_version = 0;
    cudaRuntimeGetVersion(&runtime_version);
    cudaDriverGetVersion(&driver_version);
    std::cout << "CUDA Runtime Version: " << runtime_version /1000 << "."<<(runtime_version %1000)/10 << std::endl;
    std::cout << "CUDA Driver Version: " << driver_version /1000 << "."<<(driver_version %1000)/10 << std::endl;
    //创建句柄
    cudnnHandle_t handle;
    cudnnStatus_t status = cudnnCreate(&handle);
    size_t cudnn_version = cudnnGetVersion();
    if(status == CUDNN_STATUS_SUCCESS){
        std::cout << "cuDNN Version: " << cudnn_version /10000 << "."<<(cudnn_version %10000)/100 << "."<< cudnn_version %100 << std::endl;
        std::cout << "✅ cuDNN installed successfully!" << std::endl;
        // 不要忘记销毁句柄
        cudnnDestroy(handle);
    }else{
        std::cout << "❌ cuDNN initialization failed!" << std::endl;
    }
    return 0;
}
EOF

nvcc -o test_cudnn test_cudnn.cu -lcudnn #使用 nvcc(CUDA 编译器) 进行编译，并链接 cuDNN 库 (-lcudnn)
./test_cudnn #运行

CUDA Runtime Version: 12.0
CUDA Driver Version: 13.0
cuDNN Version: 9.13.1
✅ cuDNN installed successfully!

rm ./test_cudnn
rm ./test_cudnn.cu #养成用完打扫的好习惯

nvcc fatal : Cannot find compiler 'cl.exe'in PATH

sudo apt install python3.12-venv #Debian 系要单独安装才能使用 venv 虚拟环境
mkdir pytorch
cd pytorch #养成在虚拟环境里安装库的好习惯
python3 -m venv torch #根据自己的情况，可以直接写 python
source torch/bin/activate #激活虚拟环境 (linux)
#torch/Scripts/activate #这里是 windows 的激活方法
pip install torch #安装 pytorch
python3 #这里同样是根据自己的情况，可以直接写 python

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"cuDNN enabled: {torch.backends.cudnn.enabled}")
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"PyTorch built with CUDA version: {torch.version.cuda}")

(torch) ubuntu@ubuntu:~/pytorch$ python3
Python 3.12.3(main, Aug 14 2025,17:47:21)[GCC 13.3.0] on linux
Type "help","copyright","credits"or"license"for more information.
>>> import torch
/home/ubuntu/pytorch/torch/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy'(Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
cpu = _conversion_method_template(device=torch.device("cpu"))
>>> print(f"PyTorch version: {torch.__version__}")
PyTorch version: 2.8.0+cu128
>>> print(f"CUDA available: {torch.cuda.is_available()}")
CUDA available: True
>>> print(f"cuDNN enabled: {torch.backends.cudnn.enabled}")
cuDNN enabled: True
>>> print(f"cuDNN version: {torch.backends.cudnn.version()}")
cuDNN version: 91002
>>> print(f"PyTorch built with CUDA version: {torch.version.cuda}")
PyTorch built with CUDA version: 12.8
>>> quit()

PyTorch version: 2.8.0+cpu
CUDA available: False
cuDNN enabled: True
cuDNN version: None
PyTorch built with CUDA version: None

pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130

pip3 install torch torchvision torchaudio --index-url https://mirrors.nju.edu.cn/pytorch/whl/cu130

PyTorch version: 2.8.0+cu129
CUDA available: True
cuDNN enabled: True
cuDNN version: 91002
PyTorch built with CUDA version: 12.9

deactivate #退出虚拟环境
cd.. #退出目录文件

mkdir tensorflow
cd tensorflow #养成在虚拟环境里安装库的好习惯
python3 -m venv tf #根据自己的情况，可以直接写 python
source tf/bin/activate #激活虚拟环境 (linux)
#tf/Scripts/activate #这里是 windows 的激活方法
pip install tensorflow #安装 tensorflow
python3 #这里同样是根据自己的情况，可以直接写 python

import tensorflow as tf
print(f"TensorFlow version: {tf.__version__}")
print(f"TensorFlow CUDA available: {tf.test.is_built_with_cuda()}")
print(f"GPU available: {tf.config.list_physical_devices('GPU')}")
# TensorFlow 不会直接暴露 CUDA 和 cuDNN 版本，但能使用 GPU 就说明配置成功

(tf) ubuntu@ubuntu:~/tensorflow$ python3
Python 3.12.3(main, Aug 14 2025,17:47:21)[GCC 13.3.0] on linux
Type "help","copyright","credits"or"license"for more information.
>>> import tensorflow as tf
2025-10-08 18:26:36.705030: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA,in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> print(f"TensorFlow version: {tf.__version__}")
TensorFlow version: 2.20.0
>>> print(f"TensorFlow CUDA available: {tf.test.is_built_with_cuda()}")
TensorFlow CUDA available: True
>>> print(f"GPU available: {tf.config.list_physical_devices('GPU')}")
GPU available:[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>> quit()

deactivate #退出虚拟环境
cd.. #退出目录文件

TensorFlow CUDA available: False
GPU available: []

pip install tensorflow[and-cuda]

>>> print(f"TensorFlow CUDA available: {tf.test.is_built_with_cuda()}")
TensorFlow CUDA available: True
>>> print(f"GPU available: {tf.config.list_physical_devices('GPU')}")
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1759992696.028735 3065 gpu_device.cc:2342] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
GPU available: []

pip install tensorflow[and-cuda]

>>> print(f"TensorFlow CUDA available: {tf.test.is_built_with_cuda()}")
TensorFlow CUDA available: True
>>> print(f"GPU available: {tf.config.list_physical_devices('GPU')}")
GPU available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

nano test_gpu.cu
notepad test_gpu.cu
#或者先新建，再用其他工具编辑
touch test_gpu.cu

#include<iostream>
#include<chrono>
#include<cuda_runtime.h>
#include<cmath>
#include<iomanip>
// CUDA 错误检查宏
#define CUDA_CHECK(call) do{ cudaError_t err = call; if(err != cudaSuccess){ std::cerr << "CUDA error at "<<__FILE__<<":"<<__LINE__<< " - "<<cudaGetErrorString(err)<< std::endl; exit(EXIT_FAILURE); } }while(0)
// 使用较大的矩阵大小
const int MATRIX_SIZE = 1024; // 1024x1024 矩阵
const int BLOCK_SIZE = 16; // 线程块大小
// CPU 矩阵乘法
void cpu_matrix_multiply(const float* A,const float* B,float* C,int size){
    for(int i = 0; i < size;++i){
        for(int j = 0; j < size;++j){
            float sum = 0.0f;
            for(int k = 0; k < size;++k){ sum += A[i * size + k]* B[k * size + j]; }
            C[i * size + j]= sum;
        }
    }
}
// GPU 矩阵乘法内核（基础版本）
__global__ void gpu_matrix_multiply_basic(float* A,float* B,float* C,int size){
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if(row < size && col < size){
        float sum = 0.0f;
        for(int k = 0; k < size;++k){ sum += A[row * size + k]* B[k * size + col]; }
        C[row * size + col]= sum;
    }
}
// GPU 矩阵乘法内核（使用共享内存优化）
__global__ void gpu_matrix_multiply_shared(float* A,float* B,float* C,int size){
    // 为每个线程块声明共享内存
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
    // 计算当前线程处理的 C 矩阵中的行和列
    int row = blockIdx.y * BLOCK_SIZE + threadIdx.y;
    int col = blockIdx.x * BLOCK_SIZE + threadIdx.x;
    float sum = 0.0f;
    // 循环遍历所有需要的分块
    for(int t = 0; t<(size + BLOCK_SIZE -1)/ BLOCK_SIZE;++t){
        // 协作地将 A 和 B 的分块加载到共享内存
        int tiledCol = t * BLOCK_SIZE + threadIdx.x;
        int tiledRow = t * BLOCK_SIZE + threadIdx.y;
        // 加载 A 的分块（处理边界条件）
        if(row < size && tiledCol < size){ As[threadIdx.y][threadIdx.x]= A[row * size + tiledCol]; }else{ As[threadIdx.y][threadIdx.x]=0.0f; }
        // 加载 B 的分块（处理边界条件）
        if(tiledRow < size && col < size){ Bs[threadIdx.y][threadIdx.x]= B[tiledRow * size + col]; }else{ Bs[threadIdx.y][threadIdx.x]=0.0f; }
        // 等待所有线程完成数据加载
        __syncthreads();
        // 使用共享内存中的数据计算部分和
        for(int k = 0; k < BLOCK_SIZE;++k){ sum += As[threadIdx.y][k]* Bs[k][threadIdx.x]; }
        // 等待所有线程完成计算
        __syncthreads();
    }
    // 将结果写入全局内存
    if(row < size && col < size){ C[row * size + col]= sum; }
}
int main(){
    std::cout << "=== Large Matrix Multiplication Test ===" << std::endl;
    std::cout << "Matrix size: " << MATRIX_SIZE << "x" << MATRIX_SIZE << std::endl;
    std::cout << "Memory per matrix: "<<(MATRIX_SIZE * MATRIX_SIZE *sizeof(float)/(1024.0*1024.0))<< " MB" << std::endl;
    // 检查 CUDA 设备
    int deviceCount;
    CUDA_CHECK(cudaGetDeviceCount(&deviceCount));
    if(deviceCount == 0){ std::cerr << "Error: No CUDA devices found" << std::endl; return EXIT_FAILURE; }
    cudaDeviceProp prop;
    CUDA_CHECK(cudaGetDeviceProperties(&prop,0));
    std::cout << "Using CUDA device: " << prop.name << std::endl;
    std::cout << "Available GPU memory: " << prop.totalGlobalMem /(1024.0*1024.0) << " MB" << std::endl;
    const int size = MATRIX_SIZE;
    const size_t mem_size = size * size *sizeof(float);
    // 检查内存是否足够
    if(mem_size *3> prop.totalGlobalMem){ std::cerr << "Error: Not enough GPU memory for " << size << "x" << size << " matrices" << std::endl; std::cerr << "Required: "<<(mem_size *3/(1024.0*1024.0))<< " MB" << std::endl; std::cerr << "Available: " << prop.totalGlobalMem /(1024.0*1024.0) << " MB" << std::endl; return EXIT_FAILURE; }
    // 创建测试矩阵
    float* h_A = new float[size * size];
    float* h_B = new float[size * size];
    float* h_C_cpu = new float[size * size];
    float* h_C_gpu_basic = new float[size * size];
    float* h_C_gpu_shared = new float[size * size];
    // 初始化矩阵（使用随机值）
    std::cout << "Initializing matrices..." << std::endl;
    for(int i = 0; i < size * size;++i){ h_A[i]=static_cast<float>(rand())/ RAND_MAX; h_B[i]=static_cast<float>(rand())/ RAND_MAX; }
    // CPU 计算
    std::cout << "--- CPU Computation ---" << std::endl;
    auto start_cpu = std::chrono::high_resolution_clock::now();
    cpu_matrix_multiply(h_A, h_B, h_C_cpu, size);
    auto end_cpu = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> cpu_duration = end_cpu - start_cpu;
    std::cout << "CPU time: " << std::fixed << std::setprecision(3)<< cpu_duration.count()<< " seconds" << std::endl;
    // GPU 计算 - 基础版本
    std::cout << "--- GPU Computation (Basic) ---" << std::endl;
    // 分配设备内存
    float*d_A,*d_B,*d_C;
    CUDA_CHECK(cudaMalloc(&d_A, mem_size));
    CUDA_CHECK(cudaMalloc(&d_B, mem_size));
    CUDA_CHECK(cudaMalloc(&d_C, mem_size));
    // 复制数据到设备
    CUDA_CHECK(cudaMemcpy(d_A, h_A, mem_size, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_B, h_B, mem_size, cudaMemcpyHostToDevice));
    // 设置线程块和网格大小
    dim3 threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 blocksPerGrid((size + threadsPerBlock.x -1)/ threadsPerBlock.x,(size + threadsPerBlock.y -1)/ threadsPerBlock.y);
    auto start_gpu_basic = std::chrono::high_resolution_clock::now();
    // 启动 GPU 内核（基础版本）
    gpu_matrix_multiply_basic<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, size);
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());
    auto end_gpu_basic = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> gpu_basic_duration = end_gpu_basic - start_gpu_basic;
    // 复制结果回主机
    CUDA_CHECK(cudaMemcpy(h_C_gpu_basic, d_C, mem_size, cudaMemcpyDeviceToHost));
    std::cout << "GPU basic time: " << std::fixed << std::setprecision(3)<< gpu_basic_duration.count()<< " seconds" << std::endl;
    // GPU 计算 - 共享内存优化版本
    std::cout << "--- GPU Computation (Shared Memory Optimized) ---" << std::endl;
    auto start_gpu_shared = std::chrono::high_resolution_clock::now();
    // 启动 GPU 内核（共享内存优化版本）
    gpu_matrix_multiply_shared<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, size);
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());
    auto end_gpu_shared = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> gpu_shared_duration = end_gpu_shared - start_gpu_shared;
    // 复制结果回主机
    CUDA_CHECK(cudaMemcpy(h_C_gpu_shared, d_C, mem_size, cudaMemcpyDeviceToHost));
    std::cout << "GPU shared memory time: " << std::fixed << std::setprecision(3)<< gpu_shared_duration.count()<< " seconds" << std::endl;
    // 清理设备内存
    CUDA_CHECK(cudaFree(d_A));
    CUDA_CHECK(cudaFree(d_B));
    CUDA_CHECK(cudaFree(d_C));
    // 性能比较
    std::cout << "--- Performance Comparison ---" << std::endl;
    std::cout << "CPU time: " << cpu_duration.count()<< " seconds" << std::endl;
    std::cout << "GPU basic time: " << gpu_basic_duration.count()<< " seconds" << std::endl;
    std::cout << "GPU shared memory time: " << gpu_shared_duration.count()<< " seconds" << std::endl;
    std::cout << "Speedup (basic vs CPU): " << std::fixed << std::setprecision(2)<< cpu_duration.count()/ gpu_basic_duration.count()<< "x" << std::endl;
    std::cout << "Speedup (shared vs CPU): " << std::fixed << std::setprecision(2)<< cpu_duration.count()/ gpu_shared_duration.count()<< "x" << std::endl;
    std::cout << "Speedup (shared vs basic): " << std::fixed << std::setprecision(2)<< gpu_basic_duration.count()/ gpu_shared_duration.count()<< "x" << std::endl;
    // 验证结果
    std::cout << "--- Result Verification ---" << std::endl;
    // 验证基础 GPU 版本
    float max_error_basic = 0.0f;
    for(int i = 0; i < size * size;++i){float error = fabs(h_C_cpu[i]- h_C_gpu_basic[i]); max_error_basic = fmax(max_error_basic, error);}
    std::cout << "Max error (basic): " << std::scientific << max_error_basic << std::endl;
    // 验证共享内存 GPU 版本
    float max_error_shared = 0.0f;
    for(int i = 0; i < size * size;++i){float error = fabs(h_C_cpu[i]- h_C_gpu_shared[i]); max_error_shared = fmax(max_error_shared, error);}
    std::cout << "Max error (shared): " << std::scientific << max_error_shared << std::endl;
    if(max_error_basic <1e-10){ std::cout << "✅ CPU and GPU results are perfectly consistent" << std::endl;}
    elseif(max_error_basic <1e-5){ std::cout << "✅ CPU and GPU results are consistent (excellent accuracy)" << std::endl;}
    elseif(max_error_basic <1e-3){ std::cout << "⚠️ CPU and GPU results show minor differences (acceptable for most applications)" << std::endl;}
    else{ std::cout << "❌ CPU and GPU results differ significantly" << std::endl;}
    // 清理主机内存
    delete[] h_A;
    delete[] h_B;
    delete[] h_C_cpu;
    delete[] h_C_gpu_basic;
    delete[] h_C_gpu_shared;
    std::cout << "Test completed successfully!" << std::endl;
    return 0;
}

nvcc -o test_gpu test_gpu.cu #没有用到 cuDNN
./test_gpu

=== Large Matrix Multiplication Test ===
Matrix size: 1024x1024
Memory per matrix: 4 MB
Using CUDA device: Tesla T4
Available GPU memory: 14912.7 MB
Initializing matrices...
--- CPU Computation ---
CPU time: 3.943 seconds
--- GPU Computation (Basic) ---
GPU basic time: 0.009 seconds
--- GPU Computation (Shared Memory Optimized) ---
GPU shared memory time: 0.006 seconds
--- Performance Comparison ---
CPU time: 3.943 seconds
GPU basic time: 0.009 seconds
GPU shared memory time: 0.006 seconds
Speedup (basic vs CPU): 422.91x
Speedup (shared vs CPU): 677.10x
Speedup (shared vs basic): 1.60x
--- Result Verification ---
Max error (basic): 9.16e-05
Max error (shared): 9.16e-05
⚠️ CPU and GPU results show minor differences (acceptable formost applications)
Test completed successfully!

=== Large Matrix Multiplication Test ===
Matrix size: 1024x1024
Memory per matrix: 4 MB
Using CUDA device: NVIDIA GeForce RTX 4060 Laptop GPU
Available GPU memory: 8187.5 MB
Initializing matrices...
--- CPU Computation ---
CPU time: 2.736 seconds
--- GPU Computation (Basic) ---
GPU basic time: 0.092 seconds
--- GPU Computation (Shared Memory Optimized) ---
GPU shared memory time: 0.003 seconds
--- Performance Comparison ---
CPU time: 2.736 seconds
GPU basic time: 0.092 seconds
GPU shared memory time: 0.003 seconds
Speedup (basic vs CPU): 29.65x
Speedup (shared vs CPU): 869.73x
Speedup (shared vs basic): 29.33x
--- Result Verification ---
Max error (basic): 9.16e-05
Max error (shared): 9.16e-05
⚠️ CPU and GPU results show minor differences (acceptable formost applications)
Test completed successfully!

cd pytorch
nano test_gpu.cu
notepad test_gpu.cu
#或者先仅创建，再用其他工具编辑
touch test_gpu.cu

import torch
import time
# 设置设备
device_cpu = torch.device('cpu')
device_gpu = torch.device('cuda')
# 使用较大的矩阵
N = 1024
# 创建随机矩阵
A_cpu = torch.randn(N, N, device=device_cpu)
B_cpu = torch.randn(N, N, device=device_cpu)
#移动到 GPU
A_gpu = A_cpu.to(device_gpu)
B_gpu = B_cpu.to(device_gpu)
# CPU 计算
start_time = time.time()
C_cpu = torch.mm(A_cpu, B_cpu)
cpu_time = time.time()- start_time
# 预热：先运行一次不计时，让 GPU 完成初始化
_ = torch.mm(A_gpu, B_gpu)
torch.cuda.synchronize()
# GPU 计算
start_time = time.time()
C_gpu = torch.mm(A_gpu, B_gpu)
torch.cuda.synchronize()
# 等待 GPU 计算完成
gpu_time = time.time()- start_time
# 输出结果
print(f"\nPyTorch Matrix Multiplication ({N}x{N})")
print(f"CPU time: {cpu_time:.4f} seconds")
print(f"GPU time: {gpu_time:.4f} seconds")
print(f"Speedup: {cpu_time / gpu_time:.2f}x")
# 将结果移回 CPU 进行验证
C_gpu_cpu = C_gpu.cpu()
# 验证结果一致性
max_error = torch.max(torch.abs(C_cpu - C_gpu_cpu))
print(f"Maximum error between CPU and GPU: {max_error.item()}")

source torch/bin/activate #激活虚拟环境 (linux)
#torch/Scripts/activate #这里是 windows 的激活方法
pip install numpy #别忘了安装 numpy

python3 test_gpu.py
python test_gpu.py

PyTorch Matrix Multiplication (1024x1024)
CPU time: 0.0050 seconds
GPU time: 0.0006 seconds
Speedup: 7.95x
Maximum error between CPU and GPU: 7.62939453125e-05

PyTorch Matrix Multiplication (1024x1024)
CPU time: 0.0130 seconds
GPU time: 0.0010 seconds
Speedup: 12.99x
Maximum error between CPU and GPU: 6.103515625e-05

deactivate #退出虚拟环境
cd.. #退出目录文件

cd tensorflow
nano test_gpu.cu
notepad test_gpu.cu
#或者先仅创建，再用其他工具编辑
touch test_gpu.cu

import tensorflow as tf
import time
# 检查可用设备
print("Available devices:")
for device in tf.config.list_physical_devices():
    print(f" {device.device_type}: {device.name}")
# 矩阵大小
N = 1024
# 使用更大的矩阵
# 创建随机矩阵
A = tf.random.normal((N, N))
B = tf.random.normal((N, N))
# CPU 计算
print("\nRunning on CPU...")
with tf.device('/CPU:0'):
    A_cpu = tf.identity(A)
    B_cpu = tf.identity(B)
    start_time = time.time()
    C_cpu = tf.matmul(A_cpu, B_cpu)
    cpu_time = time.time()- start_time
# GPU 计算（如果可用）
gpu_available = tf.config.list_physical_devices('GPU')
if gpu_available:
    print("Running on GPU...")
    with tf.device('/GPU:0'):
        A_gpu = tf.identity(A)
        B_gpu = tf.identity(B)
        # 预热（第一次运行一般较慢）
        tf.matmul(A_gpu, B_gpu)
        start_time = time.time()
        C_gpu = tf.matmul(A_gpu, B_gpu)
        gpu_time = time.time()- start_time
else:
    print("GPU not available")
    gpu_time = float('inf')
# 输出结果
print(f"\nTensorFlow Matrix Multiplication ({N}x{N})")
print(f"CPU time: {cpu_time:.4f} seconds")
if gpu_available:
    print(f"GPU time: {gpu_time:.4f} seconds")
    print(f"Speedup: {cpu_time / gpu_time:.2f}x")
    # 验证结果一致性
    max_error = tf.reduce_max(tf.abs(C_cpu - C_gpu))
    print(f"Maximum error between CPU and GPU: {max_error.numpy()}")
else:
    print("GPU: Not available")

source tf/bin/activate #激活虚拟环境 (linux)
#tf/Scripts/activate #这里是 windows 的激活方法

python3 test_gpu.py
python test_gpu.py

2025-10-08 21:33:45.265201: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Available devices: CPU: /physical_device:CPU:0 GPU: /physical_device:GPU:0
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1759930427.172182 100970 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13757 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:08.0, compute capability: 7.5
Running on CPU...
Running on GPU...
TensorFlow Matrix Multiplication (1024x1024)
CPU time: 0.0213 seconds
GPU time: 0.0003 seconds
Speedup: 84.88x
Maximum error between CPU and GPU: 8.392333984375e-05

2025-10-09 15:22:33.967581: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.2025-10-09 15:22:34.000728: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2025-10-09 15:22:34.849191: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. Available devices: CPU: /physical_device:CPU:0 GPU: /physical_device:GPU:0
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1759994555.346035 3525 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5561 MB memory: -> device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9
Running on CPU...
Running on GPU...
TensorFlow Matrix Multiplication (1024x1024)
CPU time: 0.0173 seconds
GPU time: 0.0004 seconds
Speedup: 48.21x
Maximum error between CPU and GPU: 0.04720115661621094

deactivate #退出虚拟环境
cd.. #退出目录文件

NVIDIA 580 驱动 CUDA13 cuDNN9.13 在 Win11 Ubuntu WSL2 环境下的安装与验证

概述

一、安装前准备

二、安装新驱动

1. Linux (Ubuntu)

更多推荐文章

相关免费在线工具

2. Windows

3. WSL2

三、安装 CUDA Toolkit

四、安装 cuDNN

1. Windows

pip 安装踩坑 (win11)

2. Debian 系 (Ubuntu)

3. 其他 Linux 版本

五、验证安装和版本

1. NVIDIA Driver

2. CUDA Toolkit

3. cuDNN Linux

4. cuDNN Win11

六、框架支持测试程序

1. C++ 支持

2. PyTorch 支持

PyTorch 无 CUDA 支持解决方法

3. TensorFlow 支持

原生 win 环境下 tensorflow2.10+不支持 GPU(使用 WSL2)

七、进阶：GPU 矩阵乘法运算 + 速度对比

1. C++ 测试程序

2. PyTorch 测试程序

3. TensorFlow 测试程序

总结

更多推荐文章

相关免费在线工具

NVIDIA 580 驱动 CUDA13 cuDNN9.13 在 Win11 Ubuntu WSL2 环境下的安装与验证

概述

一、安装前准备

二、安装新驱动

1. Linux (Ubuntu)

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2. Windows

3. WSL2

三、安装 CUDA Toolkit

四、安装 cuDNN

1. Windows

pip 安装踩坑 (win11)

2. Debian 系 (Ubuntu)

3. 其他 Linux 版本

五、验证安装和版本

1. NVIDIA Driver

2. CUDA Toolkit

3. cuDNN Linux

4. cuDNN Win11

六、框架支持测试程序

1. C++ 支持

2. PyTorch 支持

PyTorch 无 CUDA 支持解决方法

3. TensorFlow 支持

原生 win 环境下 tensorflow2.10+不支持 GPU(使用 WSL2)

七、进阶：GPU 矩阵乘法运算 + 速度对比

1. C++ 测试程序

2. PyTorch 测试程序

3. TensorFlow 测试程序

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具