C++ 基于正倒排索引的搜索引擎核心模块代码详解 | 极客日志

C++AI算法

C++ 基于正倒排索引的搜索引擎核心模块代码详解

C++ 搜索引擎核心模块实现正倒排索引，采用 Vector 存储文档元数据作为正排索引，利用哈希表映射关键词至倒排拉链。通过单例模式管理索引实例，确保线程安全与资源复用。构建过程包含分词、词频统计、权重计算及索引填充，支持通过文档 ID 获取内容或关键词检索倒排列表。

NodeJser发布于 2026/3/26更新于 2026/6/1321 浏览

正倒排索引在搜索引擎项目中非常重要，两者协同工作实现内容搜索。正排索引将文档内容映射到文档 ID，倒排索引根据文档映射关键词。构建时先创建正排索引，再基于其生成倒排索引。查询时利用倒排索引快速定位文档，结合正排索引展示结果。

1. 正倒排索引的结构

1.1 正排索引

存储文档内容及对应 ID。

typedef struct DocInfo {
    std::string title;   // 文档标题
    std::string content; // 文档内容
    std::string url;     // 文档 URL
    int doc_id;          // 文档 ID
} DocInfo;

1.2 倒排索引

存储文档 ID、关键字及权重。倒排拉链（InvertedList）包含一个关键字对应的多个文档信息。

struct InvertedElem {
    int doc_id;       // 文档 ID
    std::string word; // 关键字
    int weight;       // 权重
};
typedef std::vector<InvertedElem> InvertedList;

2. 正倒排序部分 Class 设计

2.1 数据结构准备

正排索引使用 Vector，下标即为文档 ID，便于访问。倒排索引使用哈希表，实现关键字到倒排文档列表的映射。

private:
    std::vector<DocInfo> forward_index;
    std::unordered_map<std::string, InvertedList> inverted_index;

2.2 单例模式

采用单例模式管理索引实例，减少资源浪费，确保全局逻辑统一，简化调用。

private:
    Index() {}
    Index(const Index&) = delete;
    Index& operator=(const Index&) = delete;
    static Index* instance;
    static std::mutex log;

public:
    ~Index();
    static Index*  {
         (instance == ) {
            log.();
             (instance == ) {
                instance =  ();
            }
            log.();
        }
         instance;
    }

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online

DocInfo* BuildForwardIndex(const std::string& line) {
    std::vector<std::string> results;
    ns_util::StringUtil::Split(line, &results, "\3");
    if (results.size() != 3) return nullptr;
    
    DocInfo doc;
    doc.title = results[0];
    doc.content = results[1];
    doc.url = results[2];
    doc.doc_id = forward_index.size();
    
    forward_index.push_back(doc);
    return &forward_index.back();
}

bool BuildInvertedIndex(const DocInfo& doc) {
    struct WordCnt {
        int title_cnt = 0;
        int content_cnt = 0;
    };
    std::unordered_map<std::string, WordCnt> word_map;
    
    std::vector<std::string> title_words;
    ns_util::JiebaUtil::CutString(doc.title, &title_words);
    for (auto& tw : title_words) {
        boost::to_lower(tw);
        word_map[tw].title_cnt++;
    }
    
    std::vector<std::string> content_words;
    ns_util::JiebaUtil::CutString(doc.content, &content_words);
    for (auto& cw : content_words) {
        boost::to_lower(cw);
        word_map[cw].content_cnt++;
    }
    
    #define X 10
    #define Y 1
    for (auto& word_pair : word_map) {
        InvertedElem item;
        item.doc_id = doc.doc_id;
        item.word = word_pair.first;
        item.weight = X * word_pair.second.title_cnt + Y * word_pair.second.content_cnt;
        inverted_index[word_pair.first].push_back(item);
    }
    return true;
}

bool BuildIndex(const std::string& input) {
    std::ifstream in(input, std::ios::in | std::ios::binary);
    if (!in.is_open()) {
        std::cout << input << "open error" << std::endl;
        return false;
    }
    
    int count = 0;
    std::string line;
    while (std::getline(in, line)) {
        DocInfo* doc = BuildForwardIndex(line);
        if (doc == nullptr) {
            std::cout << "BuildIndex error" << std::endl;
            continue;
        }
        BuildInvertedIndex(*doc);
        count++;
        if (count % 50 == 0) LOG1(NORMAL, "索引建立到：" + std::to_string(count));
    }
    return true;
}

DocInfo* GetForwardIndex(uint64_t doc_id) {
    if (doc_id >= forward_index.size()) {
        std::cout << "doc_id out range, error!" << std::endl;
        return nullptr;
    }
    return &forward_index[doc_id];
}

InvertedList* GetInvertedList(const std::string& word) {
    auto iter = inverted_index.find(word);
    if (iter == inverted_index.end()) {
        std::cout << word << "get error" << std::endl;
        return nullptr;
    }
    return &(iter->second);
}

#pragma once
#include <iostream>
#include <string>
#include <vector>
#include <unordered_map>
#include <fstream>
#include <mutex>
#include "usuallytool.hpp"
#include <boost/algorithm/string.hpp>
#include "log.hpp"

namespace ns_index {
    typedef struct DocInfo {
        std::string title;
        std::string content;
        std::string url;
        int doc_id;
    } DocInfo;

    struct InvertedElem {
        int doc_id;
        std::string word;
        int weight;
    };
    typedef std::vector<InvertedElem> InvertedList;

    class Index {
    private:
        std::vector<DocInfo> forward_index;
        std::unordered_map<std::string, InvertedList> inverted_index;

        Index() {}
        Index(const Index&) = delete;
        Index& operator=(const Index&) = delete;
        static Index* instance;
        static std::mutex log;

    public:
        ~Index();
        static Index* GetInstance() {
            if (instance == nullptr) {
                log.lock();
                if (instance == nullptr) {
                    instance = new Index();
                }
                log.unlock();
            }
            return instance;
        }

        DocInfo* GetForwardIndex(uint64_t doc_id) {
            if (doc_id >= forward_index.size()) {
                std::cout << "doc_id out range, error!" << std::endl;
                return nullptr;
            }
            return &forward_index[doc_id];
        }

        InvertedList* GetInvertedList(const std::string& word) {
            auto iter = inverted_index.find(word);
            if (iter == inverted_index.end()) {
                std::cout << word << "get error" << std::endl;
                return nullptr;
            }
            return &(iter->second);
        }

        bool BuildIndex(const std::string& input) {
            std::ifstream in(input, std::ios::in | std::ios::binary);
            if (!in.is_open()) {
                std::cout << input << "open error" << std::endl;
                return false;
            }
            int count = 0;
            std::string line;
            while (std::getline(in, line)) {
                DocInfo* doc = BuildForwardIndex(line);
                if (doc == nullptr) {
                    std::cout << "BuildIndex error" << std::endl;
                    continue;
                }
                BuildInvertedIndex(*doc);
                count++;
                if (count % 50 == 0) LOG1(NORMAL, "索引建立到：" + std::to_string(count));
            }
            return true;
        }

    private:
        DocInfo* BuildForwardIndex(const std::string& line) {
            std::vector<std::string> results;
            ns_util::StringUtil::Split(line, &results, "\3");
            if (results.size() != 3) return nullptr;
            DocInfo doc;
            doc.title = results[0];
            doc.content = results[1];
            doc.url = results[2];
            doc.doc_id = forward_index.size();
            forward_index.push_back(doc);
            return &forward_index.back();
        }

        bool BuildInvertedIndex(const DocInfo& doc) {
            struct WordCnt {
                int title_cnt = 0;
                int content_cnt = 0;
            };
            std::unordered_map<std::string, WordCnt> word_map;
            std::vector<std::string> title_words;
            ns_util::JiebaUtil::CutString(doc.title, &title_words);
            for (auto& tw : title_words) {
                boost::to_lower(tw);
                word_map[tw].title_cnt++;
            }
            std::vector<std::string> content_words;
            ns_util::JiebaUtil::CutString(doc.content, &content_words);
            for (auto& cw : content_words) {
                boost::to_lower(cw);
                word_map[cw].content_cnt++;
            }
            #define X 10
            #define Y 1
            for (auto& word_pair : word_map) {
                InvertedElem item;
                item.doc_id = doc.doc_id;
                item.word = word_pair.first;
                item.weight = X * word_pair.second.title_cnt + Y * word_pair.second.content_cnt;
                inverted_index[word_pair.first].push_back(item);
            }
            return true;
        }
    };
    Index* Index::instance = nullptr;
    std::mutex Index::log;
}

C++ 基于正倒排索引的搜索引擎核心模块代码详解

1. 正倒排索引的结构

1.1 正排索引

1.2 倒排索引

2. 正倒排序部分 Class 设计

2.1 数据结构准备

2.2 单例模式

更多推荐文章

相关免费在线工具

3. 构建正倒排索引

3.1 创建正排索引

3.2 创建倒排索引

3.3 索引构建入口

4. 获取索引数据

4.1 获取正排索引

4.2 获取倒排索引

5. 完整代码示例

更多推荐文章

相关免费在线工具

C++ 基于正倒排索引的搜索引擎核心模块代码详解

1. 正倒排索引的结构

1.1 正排索引

1.2 倒排索引

2. 正倒排序部分 Class 设计

2.1 数据结构准备

2.2 单例模式

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3. 构建正倒排索引

3.1 创建正排索引

3.2 创建倒排索引

3.3 索引构建入口

4. 获取索引数据

4.1 获取正排索引

4.2 获取倒排索引

5. 完整代码示例

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具