C++ 轻量级搜索引擎实战：构造正/倒排索引 | 极客日志

C++AI算法

C++ 轻量级搜索引擎实战：构造正/倒排索引

轻量级搜索引擎核心模块基于 C++ 实现，涵盖 Jieba 分词工具集成、正排与倒排索引的数据结构设计及关键函数实现。内容包括文档内容的解析存储、关键字映射逻辑、权重计算以及单例模式在索引管理中的应用。通过构建正倒排索引关系，实现了从文档 ID 检索内容及从关键词检索文档 ID 的基础功能，为搜索引擎底层原理提供工程实践参考。

宁静发布于 2026/3/24更新于 2026/6/1525 浏览

【一】Jieba 分词工具

在使用倒排索引时需要用到关键词，这些关键词由每个 HTML 文档的标题和内容生成，因此涉及分词处理。本项目使用 cppjieba 分词工具完成该功能。

若需使用 cppjieba 分词工具，可通过 Git 命令上传至本地服务器，并建立软链接指向头文件和词库目录。

【二】正/倒排索引结构设计

//正排结构
typedef struct Forward_index {
    //文档内容
    std::string title;
    std::string source;
    std::string chain;
    //对应正排 ID
    uint64_t doc_id;
} Forwardindex;

//倒排结构
typedef struct Inverted_index {
    //对应正排 ID
    int doc_id;
    //涉及到的关键字
    std::string word;
    //权重
    int weight;
} Invertedindex;

正排：根据 ID 映射对应的文档（标题、内容、URL），ID 利用 vector 下标存储。

倒排：根据关键字映射对应的 ID，利用 unordered_map 的快速搜索特性。

//正排存储
std::vector<Forwardindex> Forward;
typedef std::vector<Invertedindex> Stock_Inverted;

//倒排存储
std::unordered_map<std::string,Stock_Inverted> Inverted;

【三】关键函数设计

（1）由文档 ID 返回文档内容

含义：正排外部实现，根据 ID 返回 vector 中对应的具体内容。

//根据 ID 返回文档内容
Forwardindex* GetForward_index(const long long& id) {
    if(id >= Forward.size()) {
        std::cerr << "GetForward_index is errno" << std::endl;
        return nullptr;
    }
    return &Forward[id];
}

（2）由关键字返回倒排拉链

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online

//根据关键字返回倒排拉链
Stock_Inverted* GetInverted_index(const std::string word) {
    auto it = Inverted.find(word);
    if(it == Inverted.end()) {
        std::cerr << "GetInverted_index is errno" << std::endl;
        return nullptr;
    }
    return &it->second;
}

//正排存储
std::vector<Forwardindex> Forward;
typedef std::vector<Invertedindex> Stock_Inverted;

//倒排存储
std::unordered_map<std::string,Stock_Inverted> Inverted;

//建立索引
bool BuildIndex(const std::string &input) {
    std::ifstream in(input, std::ios::in | std::ios::binary);
    if(!in.is_open()) {
        std::cerr << "BuildIndex is errno" << std::endl;
        return false;
    }
    
    std::string line;
    while(std::getline(in, line)) {
        Forwardindex *doc = Build_Forward_Index(line);
        printf("正在建立索引:%lld\ntitle:%s\nchain:%s\n", doc->doc_id, doc->title.c_str(), doc->chain.c_str());
        if(doc == nullptr) continue;
        Build_Inverted_Index(*doc);
    }
    return true;
}

//建立正排索引
Forwardindex* Build_Forward_Index(const std::string& line) {
    Forwardindex* index = new Forwardindex();
    
    //截取 title
    size_t set_pos1 = line.find('\3');
    if(set_pos1 == std::string::npos) {
        delete index;
        return nullptr;
    }
    index->title = line.substr(0, set_pos1);
    if (index->title.empty()) index->title = "空";
    
    //截取 source
    size_t set_pos2 = line.find('\3', set_pos1+1);
    if(set_pos2 == std::string::npos) {
        delete index;
        return nullptr;
    }
    index->source = line.substr(set_pos1+1, set_pos2 - (set_pos1+1));
    if (index->source.empty()) index->source = "空";
    
    //截取 URL(chain)
    index->chain = line.substr(set_pos2+1);
    if (index->chain.empty()) index->chain = "空";
    
    index->doc_id = Forward.size();
    Forward.push_back(*index);
    return index;
}

//建立倒排索引
bool Build_Inverted_Index(const Forwardindex& doc) {
    JiebaUtil jieba;
    std::vector<std::string> S;
    
    //对标题统计
    S = jieba.Tokenize(doc.title);
    struct calculate {
        int title_size = 0;
        int source_size = 0;
    };
    std::unordered_map<std::string, calculate> V;
    
    for(auto e : S) {
        (V[e].title_size)++;
    }
    
    //统计内容出现次数
    S.clear();
    S = jieba.Tokenize(doc.source);
    for(auto e : S) {
        (V[e].source_size)++;
    }
    
    //遍历 V 的结果写入到 Inverted
    for(auto it : V) {
        Invertedindex index_t;
        index_t.word = it.first;
        index_t.doc_id = doc.doc_id;
        index_t.weight = ((it.second.title_size)*2 + (it.second.source_size)*1);
        Inverted[it.first].push_back(std::move(index_t));
    }
    return true;
}

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
class Index {
public:
    typedef std::vector<Invertedindex> Stock_Inverted;
    
    //禁止拷贝构造和赋值重载
    Index(const Index&) = delete;
    Index& operator=(const Index&) = delete;
    
    static Index* handle() {
        if(instance == nullptr) {
            pthread_mutex_lock(&mutex);
            if (instance == nullptr) {
                instance = new Index;
            }
            pthread_mutex_unlock(&mutex);
        }
        return instance;
    }
    
private:
    static Index* instance;
};

//静态成员初始化
Index* Index::instance = nullptr;

C++ 轻量级搜索引擎实战：构造正/倒排索引

【一】Jieba 分词工具

【二】正/倒排索引结构设计

【三】关键函数设计

（1）由文档 ID 返回文档内容

（2）由关键字返回倒排拉链

更多推荐文章

相关免费在线工具

（3）说明

（4）建立索引

（5）建立正排索引

（6）建立倒排索引

【四】单例模式

更多推荐文章

相关免费在线工具

C++ 轻量级搜索引擎实战：构造正/倒排索引

【一】Jieba 分词工具

【二】正/倒排索引结构设计

【三】关键函数设计

（1）由文档 ID 返回文档内容

（2）由关键字返回倒排拉链

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

（3）说明

（4）建立索引

（5）建立正排索引

（6）建立倒排索引

【四】单例模式

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具