C++ 进阶：哈希表原理与实现

概念介绍

什么是哈希？

哈希（Hash），也称为散列：是一种将任意长度的输入数据（通常称为'键'或'关键字'）通过特定的数学算法（称为'哈希函数'）映射为固定长度输出的技术。这个输出值被称为'哈希值'、'散列值'或'哈希码'。哈希的核心目的是快速实现数据的查找、存储和比较，广泛应用于哈希表、密码学、数据校验等领域。

核心术语

一、哈希函数

哈希函数（Hash Function）：是哈希表（Hash Table）的核心组成部分，它的作用是将任意长度的输入数据（称为'键'或'关键字'）映射到一个固定长度的输出值（称为'哈希值'或'散列值'）。这个输出值通常用于确定该键在哈希表中的存储位置。

1. 哈希函数的核心特点是什么？

确定性：同一输入必须始终映射到同一个哈希值。例如：输入字符串"apple"每次通过哈希函数计算，结果都应相同。
压缩性：无论输入数据的长度如何，输出的哈希值长度是固定的。例如：常用的 MD5 哈希函数会将任意输入映射为 128 位的哈希值，而哈希表中常用的哈希函数可能将键映射为 0~n-1（n 为哈希表长度）的整数。
高效性：计算哈希值的过程应快速且易于实现，时间复杂度通常为 O(1) 或 O(k)（k 为输入数据的长度），避免成为哈希表操作的性能瓶颈。

2. 哈希函数的设计目标是什么？

均匀分布：理想情况下，哈希函数应将不同的键均匀地映射到哈希表的各个位置，避免大量键集中在少数位置（称为'哈希冲突'）。均匀分布能保证哈希表的操作（插入、查找、删除）效率接近 O(1)。
减少冲突：由于输入空间（可能的键）远大于输出空间（哈希表长度），哈希冲突无法完全避免，但好的哈希函数能最大限度降低冲突概率。

3. 常见的哈希函数有哪些？

直接定址法

直接定址法：通过直接利用关键字本身或关键字的某个线性函数来确定哈希地址，从而实现关键字到存储位置的映射。直接定址法是一种简单直观的哈希函数构造方法。

核心公式和基本原理： H(key) = key 或 H(key) = a × key + b
- key：是待映射的关键字。（需要存储的数据的标识）
- a 和 b：是常数。（a ≠ 0，用于对关键字进行线性变换）
- H(key)：是计算得到的哈希地址。（即：数据在哈希表中的存储位置）
优缺点与适用场景：
- 优点：简单高效；无冲突（只要关键字不重复，计算出的哈希地址一定唯一）。
- 缺点：空间浪费大（如果关键字的范围很大，哈希表需要开辟对应范围的空间）；关键字需为整数。
- 场景：关键字的范围较小且连续（或分布集中）。

除法散列法

除法散列法：核心逻辑是用关键字对一个整数取余，把大范围的关键字映射到哈希表的有效下标区间，以此确定存储位置。除法散列法是哈希函数构造方法里的经典手段。

核心公式与基本原理： H(key) = key % m
- key：是待映射的关键字。
- m：是哈希表的大小。（通常是数组长度，决定了哈希地址的范围）
- H(key)：是计算出的哈希地址。
关键特性与优缺点：
- 优点：实现简单；适用性广；控制范围灵活。
- 缺点：冲突概率与 m 强相关；依赖 m 的选取；不适用于动态扩容。
优化：m 的选取原则：
1. 选质数：优先选质数作为 m，能大幅降低冲突概率。
2. 避免 m=2^k/10^X：若 M = 2^X，key % M 等价于保留 key 的最后 X 位二进制数，易导致冲突。
3. 结合关键字分布调整：若已知关键字的分布，选 m 时尽量让余数覆盖更全。

乘法散列法

乘法散列法：先将关键字 key 与一个在 (0, 1) 之间的常数 A 相乘，得到的小数部分乘以哈希表的大小 m，最后向下取整，就得到了哈希值。

核心公式与基本原理： h(key) = ⌊m × (key × A mod 1)⌋
- key：是要进行哈希计算的关键字。
- A：是一个常数。（取值范围在 (0, 1) 之间，常见的取值如 (√5-1)/2 ≈ 0.618）
- m：是哈希表的大小。
关键特性与优缺点：
- 优点：哈希值分布均匀；对哈希表大小要求灵活；计算效率较高。
- 缺点：常数 A 的选择有难度；实现相对复杂。

全域散列法

全域散列法：是一种随机化哈希技术，通过从精心设计的哈希函数族中随机选择哈希函数，确保即使对于最坏情况下的输入也能获得良好的平均性能。

核心原理：对于任意两个不同的关键字 x 和 y，哈希函数族 H 满足：P(h(x) = h(y)) ≤ 1/m。这样就能保证在平均情况下，哈希冲突的数量是比较少的。

二、负载因子

1. 什么是负载因子？

负载因子（Load Factor）：是哈希表设计与性能分析中的核心概念，用于衡量哈希表的'填充程度'，直接影响哈希冲突概率和内存利用率。

计算公式：λ = n / m
- n：是哈希表中当前存储的有效元素数量。
- m：是哈希表的总容量（即桶数组的长度）。

2. 负载因子对哈希表的性能有什么影响？

负载因子越小 → 哈希冲突概率越低，但内存浪费严重。
负载因子越大 → 哈希冲突概率越高，操作时间复杂度会退化到 O(n)，但内存利用率高。

3. 负载因子超过阈值时会发生什么？

当负载因子超过阈值时，哈希表会触发扩容（Resize）：

新建更大的桶数组（新容量通常是原容量的 2 倍）。
重新映射所有元素。
释放旧内存。

三、哈希冲突

哈希冲突（Hash Collision）：指不同的关键字通过哈希函数计算后，得到相同的哈希地址的情况。

定义：对于两个不同的关键字 key1 ≠ key2，若它们的哈希值满足 h(key1) = h(key2)，则称这两个关键字发生了哈希冲突。
本质：哈希函数是'多对一'的映射，根据鸽巢原理，冲突必然存在。

四、冲突处理

方法一：开放定址法

开放定址法（Open Addressing）：所有元素都存储在哈希表数组本身中，通过探测序列寻找可用的空槽位。

线性探测

探测公式：h_i(key) = (h(key) + i) % m
缺点：容易产生'聚集'现象。

二次探测

探测公式：h_i(key) = (h(key) + c1 * i + c2 * i * i) % m
优点：相较于线性探测，能减少聚集现象。

双重散列

探测公式：h_i(key) = (h1(key) + i * h2(key)) % m
优点：探测序列比较均匀，能有效减少聚集。

方法二：链地址法

链地址法（Separate Chaining）：用数组 + 链表（或其他动态结构）的组合，让冲突元素'链'在一起。

原理：哈希表底层是一个数组，每个数组元素对应一个链表。插入时追加到链表末尾，查找时遍历链表。
优点：冲突处理简单；空间灵活；无聚集问题。
缺点：遍历开销；额外空间。

基本操作

怎么解决键 key 不能取模的问题？

通过仿函数（哈希函数对象），将 key 转换成一个可用于取模的整数。若 key 本身能较方便地转换为整数，直接使用默认仿函数；否则自行实现。

/*------------------任务：定义哈希表函数的'结构体模板'------------------*/
template<class K>
struct HashFunc {
    // 重载 () 运算符 ---> 作用：将 K 类型转化为 size_t 类型
    size_t operator()(const K& key) {
        return (size_t)key; // 默认为直接转换，适用于 int、long 等整数类型
    }
};

/*------------------任务：定义哈希函数的'模板特化'------------------*/
template<>
struct HashFunc<string> {
    // 实现：'() 运算符的重载' ---> 作用：将 string 类型的变量转化为哈希值
    size_t operator()(const string& s) {
        size_t hash = 0;
        // 使用范围 for 循环遍历字符串并用 BKDR 算法计算其哈希值
        for (auto it : s) {
            hash += it;
            hash *= 131; // BKDR 哈希算法认为：131 可有效减少冲突
        }
        return hash;
    }
};

代码实现

头文件

哈希表：HashTable.h

#pragma once
#include <iostream>
#include <vector>
using namespace std;

/*------------------任务：定义哈希表函数的'通用类模板'------------------*/
template<class K>
struct HashFunc {
    size_t operator()(const K& key) {
        return (size_t)key;
    }
};

/*------------------任务：定义哈希函数的'模板特化'------------------*/
template<>
struct HashFunc<string> {
    size_t operator()(const string& s) {
        size_t hash = 0;
        for (auto it : s) {
            hash += it;
            hash *= 131;
        }
        return hash;
    }
};

/*------------------任务：实现'获取下一个 >=n 的质数的函数'---> '用于哈希表扩容'------------------*/
inline unsigned long _stl_next_prime(unsigned long n) {
    static const int __stl_num_primes = 28;
    static const unsigned long _stl_prime_list[__stl_num_primes] = {
        53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593, 
        49157, 98317, 196613, 393241, 786433, 1572869, 3145739, 
        6291469, 12582917, 25165843, 50331653, 100663319, 201326611, 
        402653189, 805306457, 1610612741, 3221225473, 4294967291
    };
    const unsigned long* first = _stl_prime_list;
    const unsigned long* last = _stl_prime_list + __stl_num_primes;
    const unsigned long* pos = lower_bound(first, last, n);
    return pos == last ? *(last - 1) : *pos;
}

开放定址法：open_address.h

#pragma once
#include "HashTable.h"
namespace open_address {
    enum State { EXIST, EMPTY, DELETE };

    template<class K, class V>
    struct HashData {
        pair<K, V> _kv;
        State _state = EMPTY;
    };

    template<class K, class V, class Hash = HashFunc<K>>
    class HashTable {
    private:
        vector<HashData<K, V>> _tables;
        size_t _n;
    public:
        HashTable() : _tables(_stl_next_prime(0)), _n(0) {}

        HashData<K, V>* Find(const K& key) {
            Hash hash;
            size_t hash_0 = hash(key) % _tables.size();
            size_t hash_i = hash_0;
            size_t i = 1;
            while (_tables[hash_i]._state != EMPTY) {
                if (_tables[hash_i]._state == EXIST && _tables[hash_i]._kv.first == key) {
                    return &_tables[hash_i];
                }
                hash_i = (hash_0 + i) % _tables.size();
                ++i;
            }
            return nullptr;
        }

        bool Erase(const K& key) {
            HashData<K, V>* ret = Find(key);
            if (ret) {
                ret->_state = DELETE;
                --_n;
                return true;
            }
            return false;
        }

        bool Insert(const pair<K, V>& kv) {
            if (Find(kv.first)) return false;
            if (_n * 10 / _tables.size() >= 7) {
                HashTable<K, V, Hash> newHt;
                newHt._tables.resize(_stl_next_prime(_tables.size() + 1));
                for (auto& htData : _tables) {
                    if (htData._state == EXIST) {
                        newHt.Insert(htData._kv);
                    }
                }
                _tables.swap(newHt._tables);
            }
            Hash hashFunc;
            size_t hash_0 = hashFunc(kv.first) % _tables.size();
            size_t hash_i = hash_0;
            size_t i = 1;
            while (_tables[hash_i]._state == EXIST) {
                hash_i = (hash_0 + i) % _tables.size();
                ++i;
            }
            _tables[hash_i]._kv = kv;
            _tables[hash_i]._state = EXIST;
            ++_n;
            return true;
        }
    };
}

链地址法：hash_bucket.h

#pragma once
#include "HashTable.h"
namespace hash_bucket {
    template<class K, class V>
    struct HashNode {
        pair<K, V> _kv;
        HashNode<K, V>* _next;
        HashNode(const pair<K, V>& kv) : _kv(kv), _next(nullptr) {}
    };

    template<class K, class V, class Hash = HashFunc<K>>
    class HashTable {
    private:
        vector<HashNode<K, V>*> _tables;
        size_t _n;
        typedef HashNode<K, V> Node;
    public:
        HashTable() : _tables(_stl_next_prime(0)), _n(0) {}

        ~HashTable() {
            for (size_t i = 0; i < _tables.size(); ++i) {
                Node* current = _tables[i];
                while (current) {
                    Node* next = current->_next;
                    delete current;
                    current = next;
                }
                _tables[i] = nullptr;
            }
        }

        Node* Find(const K& key) {
            Hash hashFunc;
            size_t hash_i = hashFunc(key) % _tables.size();
            Node* current = _tables[hash_i];
            while (current) {
                if (current->_kv.first == key) return current;
                current = current->_next;
            }
            return nullptr;
        }

        bool Erase(const K& key) {
            Hash hashFunc;
            size_t hash_i = hashFunc(key) % _tables.size();
            Node* curr = _tables[hash_i];
            Node* prev = nullptr;
            while (curr) {
                if (curr->_kv.first == key) {
                    if (prev == nullptr) {
                        _tables[hash_i] = curr->_next;
                    } else {
                        prev->_next = curr->_next;
                    }
                    delete curr;
                    --_n;
                    return true;
                }
                prev = curr;
                curr = curr->_next;
            }
            return false;
        }

        bool Insert(const pair<K, V>& kv) {
            if (Find(kv.first)) return false;
            if (_n == _tables.size()) {
                vector<Node*> newVector(_tables.size() * 2);
                for (size_t i = 0; i < _tables.size(); i++) {
                    Node* current = _tables[i];
                    while (current) {
                        Node* next = current->_next;
                        Hash hashFunc;
                        size_t hash_i = hashFunc(current->_kv.first) % newVector.size();
                        current->_next = newVector[hash_i];
                        newVector[hash_i] = current;
                        current = next;
                    }
                    _tables[i] = nullptr;
                }
                _tables.swap(newVector);
            }
            Node* newNode = new Node(kv);
            Hash hashFunc;
            size_t hash_i = hashFunc(kv.first) % _tables.size();
            newNode->_next = _tables[hash_i];
            _tables[hash_i] = newNode;
            ++_n;
            return true;
        }
    };
}

测试文件：Test.cpp

#include "HashTable.h"
#include "open_address.h"
#include "hash_bucket.h"
#include <string>
#include <iostream>
using namespace std;

void printTestResult(const string& testName, bool result) {
    cout << (result ? "[PASS] " : "[FAIL] ") << testName << endl;
}

void test_open_address() {
    cout << "\n===== 测试开放寻址法哈希表 =====" << endl;
    open_address::HashTable<int, string> ht;
    cout << "创建哈希表成功" << endl;

    bool insert1 = ht.Insert({1, "A"});
    printTestResult("插入键 1 值 A", insert1);
    bool insert2 = ht.Insert({1, "B"});
    printTestResult("插入重复键 1 值 B（期望失败）", !insert2);
    bool insert3 = ht.Insert({2, "C"});
    printTestResult("插入键 2 值 C", insert3);

    auto node1 = ht.Find(1);
    printTestResult("查找键 1", node1 != nullptr && node1->_kv.second == "A");
    auto node2 = ht.Find(2);
    printTestResult("查找键 2", node2 != nullptr && node2->_kv.second == "C");
    auto node3 = ht.Find(3);
    printTestResult("查找不存在的键 3", node3 == nullptr);

    bool erase1 = ht.Erase(1);
    printTestResult("删除键 1", erase1);
    bool erase2 = ht.Erase(1);
    printTestResult("重复删除键 1（期望失败）", !erase2);

    cout << "\n--- 扩容测试 ---" << endl;
    for (int i = 3; i < 100; ++i) {
        ht.Insert({i, to_string(i)});
    }
    auto node99 = ht.Find(99);
    printTestResult("查找扩容后的键 99", node99 != nullptr && node99->_kv.second == "99");
}

void test_hash_bucket() {
    cout << "\n===== 测试链地址法哈希表 =====" << endl;
    hash_bucket::HashTable<string, int> ht;
    cout << "创建哈希表成功" << endl;

    bool insert1 = ht.Insert({"apple", 5});
    printTestResult("插入键 apple 值 5", insert1);
    bool insert2 = ht.Insert({"apple", 10});
    printTestResult("插入重复键 apple 值 10（期望失败）", !insert2);
    bool insert3 = ht.Insert({"banana", 8});
    printTestResult("插入键 banana 值 8", insert3);

    auto node1 = ht.Find("apple");
    printTestResult("查找键 apple", node1 != nullptr && node1->_kv.second == 5);
    auto node2 = ht.Find("banana");
    printTestResult("查找键 banana", node2 != nullptr && node2->_kv.second == 8);
    auto node3 = ht.Find("orange");
    printTestResult("查找不存在的 orange", node3 == nullptr);

    bool erase1 = ht.Erase("apple");
    printTestResult("删除键 apple", erase1);
    bool erase2 = ht.Erase("apple");
    printTestResult("重复删除键 apple（期望失败）", !erase2);

    cout << "\n--- 扩容测试 ---" << endl;
    for (int i = 0; i < 100; ++i) {
        string key = "key_" + to_string(i);
        ht.Insert({key, i});
    }
    auto node = ht.Find("key_99");
    printTestResult("查找扩容后的键 key_99", node != nullptr && node->_kv.second == 99);
}

struct Date {
    int _year;
    int _month;
    int _day;
    Date(int year = 1, int month = 1, int day = 1) : _year(year), _month(month), _day(day) {}
    bool operator==(const Date& d) const {
        return _year == d._year && _month == d._month && _day == d._day;
    }
};

struct DateHashFunc {
    size_t operator()(const Date& d) {
        size_t hash = 0;
        hash += d._year; hash *= 131;
        hash += d._month; hash *= 131;
        hash += d._day; hash *= 131;
        return hash;
    }
};

void test01() {
    hash_bucket::HashTable<string, string> ht1;
    const char* a1[] = {"abcd", "sort", "insert"};
    for (auto& it : a1) {
        ht1.Insert({it, it});
    }

    hash_bucket::HashTable<int, int> ht2;
    const int a2[] = {-19, -30, 5, 36, 13, 20, 21, 12};
    for (auto& it : a2) {
        ht2.Insert({it, it});
    }

    hash_bucket::HashTable<Date, int, DateHashFunc> ht3;
    ht3.Insert({{2025, 6, 29}, 1});
    ht3.Insert({{2025, 6, 30}, 1});
}

int main() {
    test_open_address();
    test_hash_bucket();
    test01();
    return 0;
}

运行结果

[图：测试结果输出示例] [图：程序运行 GIF 动效]