C++ 轻量级搜索引擎实战：正/倒排索引构建指南 | 极客日志

C++AI算法

C++ 轻量级搜索引擎实战：正/倒排索引构建指南

基于 C++ 实现轻量级搜索引擎核心模块，重点讲解正排与倒排索引的数据结构设计及构建流程。利用 cppjieba 进行中文分词，通过文件读取清洗后的数据，建立文档 ID 与内容的映射关系（正排），以及关键词与文档 ID 列表的映射关系（倒排）。采用单例模式管理索引对象，结合 STL 容器与哈希表优化检索性能，完成从分词到索引生成的完整逻辑。

古灵精怪发布于 2026/3/30更新于 2026/6/1126 浏览

C++ 轻量级搜索引擎实战：正/倒排索引构建指南

这是一个聚焦基础搜索引擎核心工作流的实操项目，基于 C/C++ 技术生态落地：从全网爬虫抓取网页资源，到服务器端完成'去标签 - 数据清洗 - 索引构建'的预处理，再通过 HTTP 服务接收客户端请求、检索索引并拼接结果页返回 —— 完整覆盖了轻量级搜索引擎的端到端逻辑。项目采用 C++11、STL、Boost 等核心技术栈，搭配 CentOS 7 云服务器 + GCC 编译环境部署，既适配后端工程的性能需求，也能通过可选的前端技术优化用户交互，是理解搜索引擎底层原理与 C++ 工程实践的典型案例。

【一】Jieba 分词工具

在使用倒排索引时需要用到'关键词'，这个关键词由每个.html 文档的标题和内容而来，因此涉及分词。我们使用 cppjieba 分词工具来完成。

如果需要使用到 cppjieba 分词工具，可以直接在本地上传到服务器：

然后对 cppjieba/include/cppjieba 和 cppjieba/dict 分别建立软链接：头文件和词库。

把 cppjieba 移动到上级目录，然后更新一下这两个软链接。

【二】正/倒排索引结构设计

//正排结构
typedef struct Forward_index {
    std::string title;
    std::string source;
    std::string chain;
    uint64_t doc_id;
} Forwardindex;

//倒排结构
typedef struct Inverted_index {
    int doc_id;
    std::string word;
    int weight;
} Invertedindex;

正排：根据 ID 映射对应的文档（标题、内容、URL），ID 利用 vector 的下标。倒排：根据'关键字'映射对应的 ID，利用 unordered_map 快速的搜索特性。

//正排存储
std::vector<Forwardindex> Forward;
typedef std::vector<Invertedindex> Stock_Inverted;
//倒排存储
std::unordered_map<std::string, Stock_Inverted> Inverted;

【三】关键函数设计

（1）由文档 ID 返回文档内容

含义：即正排外部的实现，根据 ID 返回 vector 中对应的具体内容。

Forwardindex* GetForward_index(const long long& id) {
    if(id >= Forward.size()) {
        std::cerr << "GetForward_index is errno" << std::endl;
        return nullptr;
    }
    return &Forward[id];
}

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online

Stock_Inverted* GetInverted_index(const std::string word) {
    auto it = Inverted.find(word);
    if(it == Inverted.end()) {
        std::cerr << "GetInverted_index is errno" << std::endl;
        return {};
    }
    return &it->second;
}

//正排存储
std::vector<Forwardindex> Forward;
typedef std::vector<Invertedindex> Stock_Inverted;
//倒排存储
std::unordered_map<std::string, Stock_Inverted> Inverted;

bool Buildindex(const std::string &input) {
    std::ifstream in(input, std::ios::in | std::ios::binary);
    if(!in.is_open()) {
        std::cerr << "Buildindex is errno" << std::endl;
        return false;
    }
    std::string line;
    while(std::getline(in, line)) {
        Forwardindex *doc = Build_Forward_Index(line);
        printf("正在建立索引:%lld\ntitle:%s\nchain:%s\n", doc->doc_id, doc->title.c_str(), doc->chain.c_str());
        if(doc == nullptr) continue;
        Build_Inverted_Index(*doc);
    }
    return true;
}

Forwardindex* Build_Forward_Index(const std::string& line) {
    Forwardindex* index = new Forwardindex();
    size_t set_pos1 = line.find('\3');
    if(set_pos1 == std::string::npos) {
        delete index;
        return nullptr;
    }
    index->title = line.substr(0, set_pos1);
    if (index->title.empty()) index->title = "空";

    size_t set_pos2 = line.find('\3', set_pos1+1);
    if(set_pos2 == std::string::npos) {
        delete index;
        return nullptr;
    }
    index->source = line.substr(set_pos1+1, set_pos2 - (set_pos1+1));
    if (index->source.empty()) index->source = "空";

    index->chain = line.substr(set_pos2+1);
    if (index->chain.empty()) index->chain = "空";

    index->doc_id = Forward.size();
    Forward.push_back(*index);
    return index;
}

typedef struct Forward_index {
    std::string title;
    std::string source;
    std::string chain;
    uint64_t doc_id;
} Forwardindex;

bool Build_Inverted_Index(const Forwardindex& doc) {
    JiebaUtil jieba;
    std::vector<std::string> S;
    S = jieba.Tokenize(doc.title);
    struct culculate {
        int title_size=0;
        int source_size=0;
    };
    std::unordered_map<std::string, culculate> V;
    for(auto e : S) {
        (V[e].title_size)++;
    }
    S.clear();
    S = jieba.Tokenize(doc.source);
    for(auto e : S) {
        (V[e].source_size)++;
    }
    for(auto it : V) {
        Invertedindex index_t;
        index_t.word = it.first;
        index_t.doc_id = doc.doc_id;
        index_t.weight = ((it.second.title_size)*2 + (it.second.source_size)*1);
        Inverted[it.first].push_back(std::move(index_t));
    }
    return true;
}

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
class Index {
public:
    typedef std::vector<Invertedindex> Stock_Inverted;
    Index(const Index&) = delete;
    Index& operator=(const Index&) = delete;
    static Index* handle() {
        if(instance == nullptr) {
            pthread_mutex_lock(&mutex);
            if (instance == nullptr) {
                instance = new Index;
            }
            pthread_mutex_unlock(&mutex);
        }
        return instance;
    }
    ......
};

Index* Index::instance = nullptr;

C++ 轻量级搜索引擎实战：正/倒排索引构建指南

C++ 轻量级搜索引擎实战：正/倒排索引构建指南

【一】Jieba 分词工具

【二】正/倒排索引结构设计

【三】关键函数设计

（1）由文档 ID 返回文档内容

更多推荐文章

相关免费在线工具

（2）由关键字返回倒排拉链

（3）说明

（4）建立索引

（5）建立正排索引

（6）建立倒排索引

【四】单例模式

更多推荐文章

相关免费在线工具

C++ 轻量级搜索引擎实战：正/倒排索引构建指南

C++ 轻量级搜索引擎实战：正/倒排索引构建指南

【一】Jieba 分词工具

【二】正/倒排索引结构设计

【三】关键函数设计

（1）由文档 ID 返回文档内容

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

（2）由关键字返回倒排拉链

（3）说明

（4）建立索引

（5）建立正排索引

（6）建立倒排索引

【四】单例模式

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具