C++高性能计算实战：多线程 SIMD 内存池与并发数据结构 | 极客日志

C++算法

C++高性能计算实战：多线程 SIMD 内存池与并发数据结构

介绍 C++ 高性能计算的核心优化技术，涵盖多线程编程（C++11/14/17）、SIMD 指令集优化、内存池设计及高性能并发数据结构。内容包括性能评估指标与工具、线程池实现、AVX2 数组运算加速、固定与可变大小内存池构建，以及无锁队列和哈希表的 CAS 实现。旨在帮助开发者提升程序吞吐量、降低延迟并优化资源占用。

山野诗人发布于 2026/3/29更新于 2026/7/2045 浏览

1. 高性能计算核心认知与优化目标

1.1 核心性能瓶颈与优化方向

C++程序性能瓶颈主要集中在三个层面，优化需针对性突破：

计算瓶颈：CPU算力未充分利用，指令执行效率低（需SIMD、多线程优化）；
内存瓶颈：内存分配/释放频繁、缓存命中率低、内存碎片严重（需内存池、缓存友好设计）；
并发瓶颈：多线程竞争激烈、锁开销大、上下文切换频繁（需高效并发数据结构、无锁编程）。

核心优化目标：高吞吐（单位时间处理更多任务）、低延迟（单次任务执行耗时短）、低资源占用（内存/CPU使用率合理）。

1.2 性能评估指标与工具

（1）核心性能指标

吞吐量：单位时间内完成的任务数（如每秒处理数据条数）；
延迟：从任务发起至完成的总耗时（如内存分配耗时、计算耗时）；
缓存命中率：CPU缓存命中次数/总访问次数（目标>90%）；
内存碎片率：空闲内存块占总内存的比例（目标<10%）；
上下文切换次数：每秒线程上下文切换次数（越低越好）。

（2）常用性能分析工具

编译分析：GCC/Clang -O2/-O3优化选项、-S生成汇编代码；
性能监控：perf（Linux）、VTune（Intel）、Instrument（macOS）；
内存分析：Valgrind（内存泄漏、碎片检测）、tcmalloc内存分析；
缓存分析：Cachegrind（缓存命中率统计）。

2. C++多线程编程实战（C++11/14/17）

C++11引入标准线程库，彻底告别平台相关的 pthread/Win32 线程，C++14/17进一步增强特性，为高性能并发编程提供坚实基础。

2.1 线程基础：C++11核心组件

（1）线程管理（std::thread）

std::thread 是C++11线程管理核心类，支持创建线程、等待线程结束、分离线程，需注意线程对象销毁前必须调用 join() 或 detach()。

#include <iostream>
#include <thread>
#include <chrono>

// 线程函数
void thread_func(int num) {
    for (int i = 0; i < 3; ++i) {
        std::cout << "Thread " << num << " running: " << i << std::endl;
        std::this_thread::sleep_for(std::chrono::milliseconds(100)); 
    }
}

{
    
    ;
    ;
    
    t();
    t();
    std::cout <<  << std::endl;
     ;
}

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online
Markdown转HTML
将 Markdown（GFM）转为 HTML 片段，浏览器内 marked 解析；与 HTML转Markdown 互为补充。在线工具，Markdown转HTML在线工具，online
HTML转Markdown
将 HTML 片段转为 GitHub Flavored Markdown，支持标题、列表、链接、代码块与表格等；浏览器内处理，可链接预填。在线工具，HTML转Markdown在线工具，online

#include <iostream>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <queue>

std::mutex mtx;
std::condition_variable cv;
std::queue<int> task_queue;
bool is_running = true;

// 生产者线程
void producer() {
    for (int i = 0; i < 5; ++i) {
        std::lock_guard<std::mutex> lock(mtx); // 自动加锁/解锁
        task_queue.push(i);
        std::cout << "Produce task: " << i << std::endl;
        cv.notify_one(); // 唤醒一个等待线程
        std::this_thread::sleep_for(std::chrono::milliseconds(200));
    }
}

// 消费者线程
void consumer() {
    while (is_running) {
        std::unique_lock<std::mutex> lock(mtx); // 等待条件满足（队列非空）
        cv.wait(lock, [](){return !task_queue.empty() || !is_running;});
        if (!task_queue.empty()) {
            int task = task_queue.front();
            task_queue.pop();
            std::cout << "Consume task: " << task << std::endl;
        }
    }
}

int main() {
    std::thread prod(producer);
    std::thread cons(consumer);
    prod.join();
    is_running = false;
    cv.notify_one(); // 唤醒消费者线程退出
    cons.join();
    return 0;
}

#include <iostream>
#include <thread>
#include <shared_mutex>

std::shared_timed_mutex rw_mtx;
int shared_data = 0;

// 读线程（共享锁）
void reader(int id) {
    std::shared_lock<std::shared_timed_mutex> lock(rw_mtx);
    std::cout << "Reader " << id << " read data: " << shared_data << std::endl;
}

// 写线程（独占锁）
void writer() {
    std::unique_lock<std::shared_timed_mutex> lock(rw_mtx);
    shared_data++;
    std::cout << "Writer update data: " << shared_data << std::endl;
}

int main() {
    std::thread r1(reader, 1), r2(reader, 2), w(writer), r3(reader, 3);
    r1.join();
    r2.join();
    w.join();
    r3.join();
    return 0;
}

#include <iostream>
#include <vector>
#include <algorithm>
#include <execution> // C++17 并行算法头文件
#include <thread>

int main() {
    // 并行排序（C++17）
    std::vector<int> vec(1000000);
    std::generate(vec.begin(), vec.end(), [](){return rand()%10000;});
    // 并行排序（std::execution::par 表示并行执行）
    std::sort(std::execution::par, vec.begin(), vec.end());
    
    // std::jthread 自动 join
    std::jthread t([](std::stop_token st){
        while(!st.stop_requested()){ 
            std::cout << "Jthread running..." << std::endl;
            std::this_thread::sleep_for(std::chrono::milliseconds(100));
        }
    });
    std::this_thread::sleep_for(std::chrono::milliseconds(300));
    t.request_stop(); // 取消线程
    return 0;
}

#include <iostream>
#include <vector>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <queue>
#include <functional>
#include <atomic>

class ThreadPool {
public:
    // 构造函数：初始化线程池
    explicit ThreadPool(size_t thread_num = std::thread::hardware_concurrency()) : is_running_(true) {
        // 线程数默认设为 CPU 核心数
        for (size_t i = 0; i < thread_num; ++i) {
            threads_.emplace_back([this]() { this->worker(); });
        }
    }

    // 析构函数：关闭线程池
    ~ThreadPool() {
        {
            std::lock_guard<std::mutex> lock(mtx_);
            is_running_ = false;
        }
        cv_.notify_all(); // 唤醒所有工作线程
        for (auto& t : threads_) {
            if (t.joinable()) {
                t.join();
            }
        }
    }

    // 提交任务
    template<typename F, typename... Args>
    auto submit(F&& f, Args&&... args) -> std::future<decltype(f(args...))> {
        // 封装任务为可调用对象
        using ReturnType = decltype(f(args...));
        auto task = std::make_shared<std::packaged_task<ReturnType()>>(
            std::bind(std::forward<F>(f), std::forward<Args>(args)...));
        std::future<ReturnType> future = task->get_future();
        {
            std::lock_guard<std::mutex> lock(mtx_);
            if (!is_running_) {
                throw std::runtime_error("ThreadPool is stopped");
            }
            task_queue_.emplace([task]() { (*task)(); });
        }
        cv_.notify_one(); // 唤醒一个工作线程
        return future;
    }

private:
    // 工作线程函数
    void worker() {
        while (true) {
            std::function<void()> task;
            {
                std::unique_lock<std::mutex> lock(mtx_); // 等待任务或线程池关闭
                cv_.wait(lock, [this]() { return !is_running_ || !task_queue_.empty(); });
                if (!is_running_ && task_queue_.empty()) {
                    return; // 线程池关闭且无任务，退出
                }
                // 取出任务
                task = std::move(task_queue_.front());
                task_queue_.pop();
            }
            task(); // 执行任务
        }
    }

private:
    std::vector<std::thread> threads_; // 工作线程数组
    std::queue<std::function<void()>> task_queue_; // 任务队列
    std::mutex mtx_; // 互斥锁
    std::condition_variable cv_; // 条件变量
    std::atomic<bool> is_running_; // 线程池运行状态
};

// 测试
int main() {
    ThreadPool pool(4); // 4 线程池
    for (int i = 0; i < 8; ++i) {
        pool.submit([i]() {
            std::cout << "Task " << i << " run in thread " << std::this_thread::get_id() << std::endl;
            std::this_thread::sleep_for(std::chrono::milliseconds(100));
        });
    }
    return 0;
}

#include <iostream>
#include <vector>
#include <chrono>
#include <immintrin.h> // AVX2 头文件

// 普通实现：逐元素相加
void add_normal(const std::vector<int>& a, const std::vector<int>& b, std::vector<int>& c) {
    for (size_t i = 0; i < a.size(); ++i) {
        c[i] = a[i] + b[i];
    }
}

// AVX2 优化实现：一次处理 8 个 int（256 位=8×32 位）
void add_avx2(const std::vector<int>& a, const std::vector<int>& b, std::vector<int>& c) {
    size_t i = 0;
    const size_t batch_size = 8; // AVX2 一次处理 8 个 int
    for (; i < a.size() - batch_size + 1; i += batch_size) {
        // 加载 8 个 int 到 AVX2 寄存器
        __m256i vec_a = _mm256_loadu_si256((__m256i*)&a[i]);
        __m256i vec_b = _mm256_loadu_si256((__m256i*)&b[i]);
        // 向量加法：8 个 int 同时相加
        __m256i vec_c = _mm256_add_epi32(vec_a, vec_b);
        // 存储结果到内存
        _mm256_storeu_si256((__m256i*)&c[i], vec_c);
    }
    // 处理剩余元素（不足 8 个）
    for (; i < a.size(); ++i) {
        c[i] = a[i] + b[i];
    }
}

// 性能测试
int main() {
    const size_t size = 10000000; // 1000 万元素
    std::vector<int> a(size, 1), b(size, 2), c(size, 0);
    
    // 测试普通实现
    auto start = std::chrono::high_resolution_clock::now();
    add_normal(a, b, c);
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Normal add time: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms" << std::endl;
    
    // 测试 AVX2 实现
    start = std::chrono::high_resolution_clock::now();
    add_avx2(a, b, c);
    end = std::chrono::high_resolution_clock::now();
    std::cout << "AVX2 add time: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms" << std::endl;
    
    return 0;
}

#include <iostream>
#include <cstdlib>
#include <cstdint>

class FixedSizeMemoryPool {
public:
    // 构造函数：初始化内存池
    FixedSizeMemoryPool(size_t block_size, size_t pool_size) : block_size_(align_up(block_size)), pool_size_(pool_size) {
        // 计算总内存大小：块大小 × 块数量
        total_size_ = block_size_ * pool_size_;
        // 分配内存池（从堆上分配连续内存）
        pool_start_ = (char*)std::malloc(total_size_);
        if (!pool_start_) {
            throw std::bad_alloc();
        }
        // 初始化空闲链表（将所有块串联）
        free_list_ = pool_start_;
        char* curr = pool_start_;
        for (size_t i = 0; i < pool_size_ - 1; ++i) {
            // 每个块的头部存储下一个块的地址（空闲时）
            *(char**)curr = curr + block_size_;
            curr += block_size_;
        }
        *(char**)curr = nullptr; // 最后一个块的下一个地址为 null
    }

    // 析构函数：释放内存池
    ~FixedSizeMemoryPool() {
        std::free(pool_start_);
    }

    // 分配内存块
    void* allocate() {
        if (!free_list_) {
            // 空闲链表为空，可扩容或返回 null（此处简单返回 null）
            return nullptr;
        }
        // 取出空闲链表头部块
        void* ptr = free_list_;
        free_list_ = *(char**)free_list_;
        return ptr;
    }

    // 释放内存块
    void deallocate(void* ptr) {
        if (!ptr) return;
        // 检查 ptr 是否在内存池范围内
        if (ptr < pool_start_ || ptr >= pool_start_ + total_size_) {
            throw std::invalid_argument("Invalid pointer to deallocate");
        }
        // 将块插入空闲链表头部
        *(char**)ptr = free_list_;
        free_list_ = (char*)ptr;
    }

private:
    // 内存对齐：将 size 向上对齐到 8 字节（可调整为 16/32 字节）
    size_t align_up(size_t size) {
        const size_t align = 8;
        return (size + align - 1) & ~(align - 1);
    }

private:
    char* pool_start_; // 内存池起始地址
    char* free_list_; // 空闲链表头部
    size_t block_size_; // 单个内存块大小（对齐后）
    size_t pool_size_; // 内存块数量
    size_t total_size_; // 内存池总大小
};

// 测试
int main() {
    // 创建内存池：每个块 32 字节，共 100 个块
    FixedSizeMemoryPool pool(32, 100);
    // 分配 5 个块
    void* p1 = pool.allocate();
    void* p2 = pool.allocate();
    void* p3 = pool.allocate();
    void* p4 = pool.allocate();
    void* p5 = pool.allocate();
    std::cout << "Allocated pointers: " << p1 << " " << p2 << " " << p3 << " " << p4 << " " << p5 << std::endl;
    // 释放 p2 和 p4
    pool.deallocate(p2);
    pool.deallocate(p4);
    // 再次分配，会复用释放的块
    void* p6 = pool.allocate();
    void* p7 = pool.allocate();
    std::cout << "Reallocated pointers: " << p6 << " " << p7 << std::endl;
    return 0;
}

#include <iostream>
#include <vector>
#include <cstdint>
#include <algorithm>

// 固定大小内存池（复用之前的实现）
class FixedSizeMemoryPool {
    // ... 此处省略，与 4.2 实现一致 ...
};

class VariableSizeMemoryPool {
public:
    VariableSizeMemoryPool() {
        // 初始化不同大小的固定内存池：8B、16B、32B、64B、128B，每个池 100 个块
        pool_sizes_ = {8, 16, 32, 64, 128};
        for (size_t size : pool_sizes_) {
            pools_.emplace_back(std::make_unique<FixedSizeMemoryPool>(size, 100));
        }
    }

    // 分配内存
    void* allocate(size_t size) {
        if (size == 0) return nullptr;
        // 找到匹配的内存池（大于等于 size 的最小块大小）
        auto it = std::lower_bound(pool_sizes_.begin(), pool_sizes_.end(), size);
        if (it == pool_sizes_.end()) {
            // 超过最大块大小，直接调用 malloc
            return std::malloc(size);
        }
        // 从对应内存池分配
        size_t pool_size = *it;
        size_t pool_idx = it - pool_sizes_.begin();
        void* ptr = pools_[pool_idx]->allocate();
        if (ptr) return ptr;
        // 内存池满了，扩容（简单扩容 100 个块）
        pools_[pool_idx] = std::make_unique<FixedSizeMemoryPool>(pool_size, 200);
        return pools_[pool_idx]->allocate();
    }

    // 释放内存
    void deallocate(void* ptr, size_t size) {
        if (!ptr) return;
        if (size == 0) return;
        // 找到匹配的内存池
        auto it = std::lower_bound(pool_sizes_.begin(), pool_sizes_.end(), size);
        if (it == pool_sizes_.end()) {
            // 释放 malloc 分配的内存
            std::free(ptr);
            return;
        }
        size_t pool_size = *it;
        size_t pool_idx = it - pool_sizes_.begin();
        // 尝试从内存池释放，失败则调用 free
        try {
            pools_[pool_idx]->deallocate(ptr);
        } catch (const std::invalid_argument&) {
            std::free(ptr);
        }
    }

private:
    std::vector<size_t> pool_sizes_; // 内存池块大小列表
    std::vector<std::unique_ptr<FixedSizeMemoryPool>> pools_; // 多个固定大小内存池
};

// 测试
int main() {
    VariableSizeMemoryPool pool;
    void* p1 = pool.allocate(10); // 匹配 16B 内存池
    void* p2 = pool.allocate(40); // 匹配 64B 内存池
    void* p3 = pool.allocate(200); // 超过 128B，调用 malloc
    std::cout << "Allocated: " << p1 << " " << p2 << " " << p3 << std::endl;
    pool.deallocate(p1, 10);
    pool.deallocate(p2, 40);
    pool.deallocate(p3, 200);
    return 0;
}

分配方式	耗时（ms）	内存碎片率
new/delete	~80	~25%
固定大小内存池	~5	~5%
可变大小内存池	~8	~8%
结论：内存池分配效率比 new/delete 提升 10~16 倍，内存碎片率显著降低。

#include <iostream>
#include <atomic>
#include <thread>
#include <vector>

template<typename T>
class LockFreeQueue {
private:
    // 队列节点
    struct Node {
        T data;
        std::atomic<Node*> next;
        Node(const T& data) : data(data), next(nullptr) {}
    };
    std::atomic<Node*> head_; // 队列头部
    std::atomic<Node*> tail_; // 队列尾部

public:
    LockFreeQueue() {
        // 哨兵节点（空节点），简化入队/出队逻辑
        Node* sentinel = new Node(T());
        head_.store(sentinel);
        tail_.store(sentinel);
    }

    ~LockFreeQueue() {
        // 释放所有节点
        while (Node* node = head_.load()) {
            head_.store(node->next.load());
            delete node;
        }
    }

    // 入队
    void enqueue(const T& data) {
        Node* new_node = new Node(data);
        Node* old_tail = nullptr;
        do {
            old_tail = tail_.load(); // 找到尾部节点的 next（应为 null）
        } while (!old_tail->next.compare_exchange_weak(nullptr, new_node, std::memory_order_release, std::memory_order_relaxed));
        // 更新 tail 指针（允许滞后，不影响正确性）
        tail_.compare_exchange_strong(old_tail, new_node);
    }

    // 出队（返回是否成功，数据存入 data）
    bool dequeue(T& data) {
        Node* old_head = nullptr;
        Node* new_head = nullptr;
        do {
            old_head = head_.load();
            new_head = old_head->next.load();
            if (!new_head) {
                return false; // 队列为空
            }
            // 读取数据（需确保节点未被其他线程删除）
            data = new_head->data;
            // 更新 head 指针到新节点（哨兵节点）
        } while (!head_.compare_exchange_weak(
            old_head, new_head, std::memory_order_release, std::memory_order_relaxed));
        delete old_head; // 释放旧哨兵节点
        return true;
    }

    // 判断队列是否为空
    bool empty() const {
        return head_.load()->next.load() == nullptr;
    }
};

// 测试
int main() {
    LockFreeQueue<int> queue;
    const int task_num = 100000;
    std::atomic<int> count(0);
    // 4 个生产者线程入队
    std::vector<std::thread> producers;
    for (int i = 0; i < 4; ++i) {
        producers.emplace_back([&]() {
            for (int j = 0; j < task_num; ++j) {
                queue.enqueue(j);
            }
        });
    }
    // 4 个消费者线程出队
    std::vector<std::thread> consumers;
    for (int i = 0; i < 4; ++i) {
        consumers.emplace_back([&]() {
            int data;
            while (count < task_num * 4) {
                if (queue.dequeue(data)) {
                    count++;
                }
            }
        });
    }
    for (auto& t : producers) t.join();
    for (auto& t : consumers) t.join();
    std::cout << "Total dequeued: " << count << " (expected: " << task_num * 4 << ")" << std::endl;
    return 0;
}

#include <iostream>
#include <vector>
#include <atomic>
#include <functional>
#include <utility>

template<typename K, typename V, size_t Capacity = 1024>
class LockFreeHashTable {
private:
    // 哈希表节点
    struct Node {
        K key;
        V value;
        std::atomic<Node*> next;
        Node(K key, V value) : key(key), value(value), next(nullptr) {}
    };
    std::vector<std::atomic<Node*>> table_; // 哈希表数组
    std::hash<K> hash_; // 哈希函数

public:
    LockFreeHashTable() {
        // 初始化哈希表，每个桶为空
        table_.resize(Capacity);
        for (auto& bucket : table_) {
            bucket.store(nullptr);
        }
    }

    ~LockFreeHashTable() {
        // 释放所有节点
        for (size_t i = 0; i < Capacity; ++i) {
            Node* node = table_[i].load();
            while (node) {
                Node* next = node->next.load();
                delete node;
                node = next;
            }
        }
    }

    // 插入/更新键值对
    bool insert(const K& key, const V& value) {
        size_t idx = hash_(key) % Capacity;
        Node* new_node = new Node(key, value);
        Node* old_head = nullptr;
        do {
            old_head = table_[idx].load();
            new_node->next.store(old_head); // CAS 更新桶的头部节点
        } while (!table_[idx].compare_exchange_weak(
            old_head, new_node, std::memory_order_release, std::memory_order_relaxed));
        return true;
    }

    // 查找键值对
    bool find(const K& key, V& value) const {
        size_t idx = hash_(key) % Capacity;
        Node* node = table_[idx].load();
        while (node) {
            if (node->key == key) {
                value = node->value;
                return true;
            }
            node = node->next.load();
        }
        return false;
    }

    // 删除键值对
    bool erase(const K& key) {
        size_t idx = hash_(key) % Capacity;
        Node* prev = nullptr;
        // ... 删除逻辑实现
        return true;
    }
};

C++高性能计算实战：多线程 SIMD 内存池与并发数据结构

1. 高性能计算核心认知与优化目标

1.1 核心性能瓶颈与优化方向

1.2 性能评估指标与工具

（1）核心性能指标

（2）常用性能分析工具

2. C++多线程编程实战（C++11/14/17）

2.1 线程基础：C++11核心组件

（1）线程管理（std::thread）

更多推荐文章

相关免费在线工具

（2）同步机制

2.2 C++14/17线程特性增强

（1）C++14特性

（2）C++17特性

2.3 实战：高性能线程池实现

2.4 多线程避坑指南

3. SIMD 指令优化实战

3.1 SIMD 核心原理与指令集选型

（1）核心原理

（2）指令集选型

3.2 编译器 Intrinsic 函数使用

（1）头文件与编译选项

3.3 实战：SIMD 优化数组运算

3.4 SIMD 优化注意事项

4. 内存池设计与实现

4.1 内存池核心设计思想

4.2 固定大小内存池实现

4.3 可变大小内存池优化

4.4 内存池性能对比

5. 高性能并发数据结构

5.1 并发数据结构设计原则

5.2 实战：线程安全队列

5.3 实战：无锁哈希表

更多推荐文章

相关免费在线工具

C++高性能计算实战：多线程 SIMD 内存池与并发数据结构

1. 高性能计算核心认知与优化目标

1.1 核心性能瓶颈与优化方向

1.2 性能评估指标与工具

（1）核心性能指标

（2）常用性能分析工具

2. C++多线程编程实战（C++11/14/17）

2.1 线程基础：C++11核心组件

（1）线程管理（std::thread）

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

（2）同步机制

2.2 C++14/17线程特性增强

（1）C++14特性

（2）C++17特性

2.3 实战：高性能线程池实现

2.4 多线程避坑指南

3. SIMD 指令优化实战

3.1 SIMD 核心原理与指令集选型

（1）核心原理

（2）指令集选型

3.2 编译器 Intrinsic 函数使用

（1）头文件与编译选项

3.3 实战：SIMD 优化数组运算

3.4 SIMD 优化注意事项

4. 内存池设计与实现

4.1 内存池核心设计思想

4.2 固定大小内存池实现

4.3 可变大小内存池优化

4.4 内存池性能对比

5. 高性能并发数据结构

5.1 并发数据结构设计原则

5.2 实战：线程安全队列

5.3 实战：无锁哈希表

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具