C++ 轻量级搜索引擎实战：正/倒排索引设计与实现 | 极客日志

C++算法

C++ 轻量级搜索引擎实战：正/倒排索引设计与实现

综述由AI生成介绍基于 C++11 和 cppjieba 实现的轻量级搜索引擎核心模块。重点讲解正排索引（文档 ID 映射内容）与倒排索引（关键词映射文档列表）的数据结构设计。通过读取清洗后的文本数据，利用分词工具提取关键词并计算权重，构建索引关系。同时采用单例模式管理索引对象，确保线程安全。代码展示了从文件解析到索引建立的关键函数逻辑。

落日余晖发布于 2026/3/30更新于 2026/5/2025 浏览

本文介绍基于 C++11 和 cppjieba 实现的轻量级搜索引擎核心模块。重点讲解正排索引（文档 ID 映射内容）与倒排索引（关键词映射文档列表）的数据结构设计。通过读取清洗后的文本数据，利用分词工具提取关键词并计算权重，构建索引关系。同时采用单例模式管理索引对象，确保线程安全。代码展示了从文件解析到索引建立的关键函数逻辑。

一、Jieba 分词工具

在使用倒排索引时需要'关键词'，该关键词由每个 HTML 文档的标题和内容而来，因此涉及分词处理。项目使用 cppjieba 分词工具来完成此功能。

二、正/倒排索引结构设计

正排结构用于根据 ID 映射对应的文档（标题、内容、URL），ID 可利用 vector 下标直接访问。

typedef struct Forward_index {
    std::string title;
    std::string source;
    std::string chain;
    uint64_t doc_id;
} Forwardindex;

倒排结构用于根据'关键字'映射对应的 ID，利用 unordered_map 的快速搜索特性。

typedef struct Inverted_index {
    int doc_id;
    std::string word;
    int weight;
} Invertedindex;

std::vector<Forwardindex> Forward;
typedef std::vector<Invertedindex> Stock_Inverted;
std::unordered_map<std::string, Stock_Inverted> Inverted;

三、关键函数设计

（1）由文档 ID 返回文档内容

根据 ID 返回 vector 中对应的具体内容。

Forwardindex* GetForward_index(const long long& id) {
    if(id >= Forward.size()) {
        std::cerr << "GetForward_index error" << std::endl;
        return nullptr;
    }
    return &Forward[id];
}

（2）由关键字返回倒排拉链

根据'关键字'返回对应的 ID 列表，即返回 vector。

Stock_Inverted* GetInverted_index(const std::string word) {
    auto it = Inverted.find(word);
    (it == Inverted.()) {
        std::cerr <<  << std::endl;
         ;
    }
     &it->second;
}

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online
Markdown转HTML
将 Markdown（GFM）转为 HTML 片段，浏览器内 marked 解析；与 HTML转Markdown 互为补充。在线工具，Markdown转HTML在线工具，online
HTML转Markdown
将 HTML 片段转为 GitHub Flavored Markdown，支持标题、列表、链接、代码块与表格等；浏览器内处理，可链接预填。在线工具，HTML转Markdown在线工具，online

bool BuildIndex(const std::string &input) {
    std::ifstream in(input, std::ios::in | std::ios::binary);
    if(!in.is_open()) {
        std::cerr << "BuildIndex error" << std::endl;
        return false;
    }
    std::string line;
    while(std::getline(in, line)) {
        Forwardindex *doc = Build_Forward_Index(line);
        printf("正在建立索引:%lld\ntitle:%s\nchain:%s\n", doc->doc_id, doc->title.c_str(), doc->chain.c_str());
        if(doc == nullptr) continue;
        Build_Inverted_Index(*doc);
    }
    return true;
}

Forwardindex* Build_Forward_Index(const std::string& line) {
    Forwardindex* index = new Forwardindex();
    size_t set_pos1 = line.find('\3');
    if(set_pos1 == std::string::npos) {
        delete index;
        return nullptr;
    }
    index->title = line.substr(0, set_pos1);
    if (index->title.empty()) index->title = "空";

    size_t set_pos2 = line.find('\3', set_pos1+1);
    if(set_pos2 == std::string::npos) {
        delete index;
        return nullptr;
    }
    index->source = line.substr(set_pos1+1, set_pos2 - (set_pos1+1));
    if (index->source.empty()) index->source = "空";

    index->chain = line.substr(set_pos2+1);
    if (index->chain.empty()) index->chain = "空";

    index->doc_id = Forward.size();
    Forward.push_back(*index);
    return index;
}

bool Build_Inverted_Index(const Forwardindex& doc) {
    JiebaUtil jieba;
    std::vector<std::string> S;
    
    struct Calculate {
        int title_size = 0;
        int source_size = 0;
    };
    std::unordered_map<std::string, Calculate> V;

    S = jieba.Tokenize(doc.title);
    for(auto e : S) {
        V[e].title_size++;
    }

    S.clear();
    S = jieba.Tokenize(doc.source);
    for(auto e : S) {
        V[e].source_size++;
    }

    for(auto it : V) {
        Invertedindex index_t;
        index_t.word = it.first;
        index_t.doc_id = doc.doc_id;
        index_t.weight = (it.second.title_size)*2 + (it.second.source_size)*1;
        Inverted[it.first].push_back(std::move(index_t));
    }
    return true;
}

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
class Index {
public:
    typedef std::vector<Invertedindex> Stock_Inverted;
    Index(const Index&) = delete;
    Index& operator=(const Index&) = delete;
    static Index* handle() {
        if(instance == nullptr) {
            pthread_mutex_lock(&mutex);
            if (instance == nullptr) {
                instance = new Index;
            }
            pthread_mutex_unlock(&mutex);
        }
        return instance;
    }
private:
    static Index* instance;
};
Index* Index::instance = nullptr;

C++ 轻量级搜索引擎实战：正/倒排索引设计与实现

一、Jieba 分词工具

二、正/倒排索引结构设计

三、关键函数设计

（1）由文档 ID 返回文档内容

（2）由关键字返回倒排拉链

更多推荐文章

相关免费在线工具

（3）建立索引

（4）建立正排索引

（5）建立倒排索引

四、单例模式

更多推荐文章

相关免费在线工具

C++ 轻量级搜索引擎实战：正/倒排索引设计与实现

一、Jieba 分词工具

二、正/倒排索引结构设计

三、关键函数设计

（1）由文档 ID 返回文档内容

（2）由关键字返回倒排拉链

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

（3）建立索引

（4）建立正排索引

（5）建立倒排索引

四、单例模式

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具