C++ 基于正倒排索引的 Boost 搜索引擎核心模块实现与详解 | 极客日志

C++算法

C++ 基于正倒排索引的 Boost 搜索引擎核心模块实现与详解

使用 C++ 和 Boost 库实现搜索引擎正倒排索引的核心逻辑。内容涵盖数据结构设计（DocInfo、InvertedElem）、单例模式的应用、索引构建流程（分词、词频统计、权重计算）以及检索接口实现。通过 Vector 存储正向索引，哈希表存储反向索引，实现了高效的文档搜索基础架构。

CryptoLab发布于 2026/3/30更新于 2026/7/2148 浏览

1. 正倒排索引的结构

1.1 正排索引

正排索引存储文档内容及其 ID。

// 定义文档信息结构体
typedef struct DocInfo {
    std::string title;   // 文档标题
    std::string content; // 文档内容
    std::string url;     // 文档 URL
    int doc_id;          // 文档 ID
} DocInfo;

1.2 倒排索引

倒排索引存储文档 ID、对应关键字及权重。InvertedList 通常称为倒排拉链，一个关键字可能对应多个文档。

// 定义倒排元素结构体
struct InvertedElem {
    int doc_id;      // 文档 ID
    std::string word;// 关键词
    int weight;      // 权重
};
typedef std::vector<InvertedElem> InvertedList;

2. 正倒排序部分 Class 的 Private 部分

2.1 准备工作

正向索引使用 Vector，下标即为文档 ID，便于访问。反向索引使用哈希表（unordered_map），将关键字映射到倒排文档列表。

private:
    // 正排索引使用 Vector，下标对应文档 ID
    std::vector<DocInfo> forward_index;
    // 使用哈希表进行映射
    std::unordered_map<std::string, InvertedList> inverted_index;

2.2 单例模式

本代码采用单例模式，主要优势包括：

减少资源浪费：避免反复创建或销毁实例导致的内存占用过高和 IO 开销增加。
确保全局逻辑统一：强制所有搜索请求复用同一套逻辑，避免结果混乱。
简化资源管理与调用：降低模块间耦合度，减少代码冗余。

因此需禁用拷贝构造函数和赋值运算符，并添加互斥锁以防止多线程并发创建实例。双重检查锁定机制用于确保线程安全。

private:
    Index() {};
    Index(const Index&) = delete;
    Index& operator=( Index&) = ;
     Index* instance;
     std::mutex log;

:
    ~();

    {
         (instance == ) {
            log.();
             (instance == ) {
                instance =  ();
            }
            log.();
        }
         instance;
    }

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online
Markdown转HTML
将 Markdown（GFM）转为 HTML 片段，浏览器内 marked 解析；与 HTML转Markdown 互为补充。在线工具，Markdown转HTML在线工具，online
HTML转Markdown
将 HTML 片段转为 GitHub Flavored Markdown，支持标题、列表、链接、代码块与表格等；浏览器内处理，可链接预填。在线工具，HTML转Markdown在线工具，online

DocInfo* BuildForwardIndex(const std::string& line) {
    std::vector<std::string> results;
    ns_util::StringUtil::Split(line, &results, "\3"); // 字符串切割，分离 title, content, url
    if (results.size() != 3) return nullptr;

    DocInfo doc;
    doc.title = results[0];
    doc.content = results[1];
    doc.url = results[2];
    doc.doc_id = forward_index.size(); // 文档 ID 为当前下标

    forward_index.push_back(doc);
    return &forward_index.back();
}

bool BuildInvertedIndex(const DocInfo& doc) {
    struct word_cnt {
        int title_cnt;
        int content_cnt;
        word_cnt() : title_cnt(0), content_cnt(0) {}
    };

    std::unordered_map<std::string, word_cnt> word_map;
    std::vector<std::string> title_words;
    ns_util::JiebaUsutl::CutString(doc.title, &title_words);
    for (auto& tw : title_words) {
        boost::to_lower(tw);
        word_map[tw].title_cnt++;
    }

    std::vector<std::string> content_words;
    ns_util::JiebaUsutl::CutString(doc.content, &content_words);
    for (auto& cw : content_words) {
        boost::to_lower(cw);
        word_map[cw].content_cnt++;
    }

    #define X 10
    #define Y 1
    for (auto& word_pair : word_map) {
        InvertedElem item;
        item.doc_id = doc.doc_id;
        item.word = word_pair.first;
        item.weight = X * word_pair.second.title_cnt + Y * word_pair.second.content_cnt;
        inverted_index[word_pair.first].push_back(item);
    }
    return true;
}

bool BuildIndex(const std::string& input) {
    std::ifstream in(input, std::ios::in | std::ios::binary);
    if (!in.is_open()) {
        std::cout << input << "open error" << std::endl;
        return false;
    }

    int count = 0;
    std::string line;
    while (std::getline(in, line)) {
        DocInfo* doc = BuildForwardIndex(line);
        if (doc == nullptr) {
            std::cout << "BuildIndex error" << std::endl;
            continue;
        }
        BuildInvertedIndex(*doc);
        count++;
        if (count % 50 == 0) LOG1(NORMAL, "索引建立到：" + std::to_string(count));
    }
    return true;
}

DocInfo* GetForwardIndex(uint64_t doc_id) {
    if (doc_id >= forward_index.size()) {
        std::cout << "doc_id out range, error!" << std::endl;
        return nullptr;
    }
    return &forward_index[doc_id];
}

InvertedList* GetInvertedList(const std::string& word) {
    auto iter = inverted_index.find(word);
    if (iter == inverted_index.end()) {
        std::cout << word << "get error" << std::endl;
        return nullptr;
    }
    return &(iter->second);
}

#pragma once
#include <iostream>
#include <string>
#include <vector>
#include <unordered_map>
#include <fstream>
#include <mutex>
#include "usuallytool.hpp"
#include <boost/algorithm/string.hpp>
#include "log.hpp"

namespace ns_index {
    typedef struct DocInfo {
        std::string title;
        std::string content;
        std::string url;
        int doc_id;
    } DocInfo;

    struct InvertedElem {
        int doc_id;
        std::string word;
        int weight;
    };
    typedef std::vector<InvertedElem> InvertedList;

    class Index {
    private:
        std::vector<DocInfo> forward_index;
        std::unordered_map<std::string, InvertedList> inverted_index;

        Index() {};
        Index(const Index&) = delete;
        Index& operator=(const Index&) = delete;
        static Index* instance;
        static std::mutex log;

    public:
        ~Index();

        static Index* Getinstance() {
            if (instance == nullptr) {
                log.lock();
                if (instance == nullptr) {
                    instance = new Index();
                }
                log.unlock();
            }
            return instance;
        }

        DocInfo* GetForwardIndex(uint64_t doc_id) {
            if (doc_id >= forward_index.size()) {
                std::cout << "doc_id out range, error!" << std::endl;
                return nullptr;
            }
            return &forward_index[doc_id];
        }

        InvertedList* GetInvertedList(const std::string& word) {
            auto iter = inverted_index.find(word);
            if (iter == inverted_index.end()) {
                std::cout << word << "get error" << std::endl;
                return nullptr;
            }
            return &(iter->second);
        }

        bool BuildIndex(const std::string& input) {
            std::ifstream in(input, std::ios::in | std::ios::binary);
            if (!in.is_open()) {
                std::cout << input << "open error" << std::endl;
                return false;
            }
            int count = 0;
            std::string line;
            while (std::getline(in, line)) {
                DocInfo* doc = BuildForwardIndex(line);
                if (doc == nullptr) {
                    std::cout << "BuildIndex error" << std::endl;
                    continue;
                }
                BuildInvertedIndex(*doc);
                count++;
                if (count % 50 == 0) LOG1(NORMAL, "索引建立到：" + std::to_string(count));
            }
            return true;
        }

    private:
        DocInfo* BuildForwardIndex(const std::string& line) {
            std::vector<std::string> results;
            ns_util::StringUtil::Split(line, &results, "\3");
            if (results.size() != 3) return nullptr;
            DocInfo doc;
            doc.title = results[0];
            doc.content = results[1];
            doc.url = results[2];
            doc.doc_id = forward_index.size();
            forward_index.push_back(doc);
            return &forward_index.back();
        }

        bool BuildInvertedIndex(const DocInfo& doc) {
            struct word_cnt {
                int title_cnt;
                int content_cnt;
                word_cnt() : title_cnt(0), content_cnt(0) {}
            };
            std::unordered_map<std::string, word_cnt> word_map;
            std::vector<std::string> title_words;
            ns_util::JiebaUsutl::CutString(doc.title, &title_words);
            for (auto& tw : title_words) {
                boost::to_lower(tw);
                word_map[tw].title_cnt++;
            }
            std::vector<std::string> content_words;
            ns_util::JiebaUsutl::CutString(doc.content, &content_words);
            for (auto& cw : content_words) {
                boost::to_lower(cw);
                word_map[cw].content_cnt++;
            }
            #define X 10
            #define Y 1
            for (auto& word_pair : word_map) {
                InvertedElem item;
                item.doc_id = doc.doc_id;
                item.word = word_pair.first;
                item.weight = X * word_pair.second.title_cnt + Y * word_pair.second.content_cnt;
                inverted_index[word_pair.first].push_back(item);
            }
            return true;
        }
    };
    Index* Index::instance = nullptr;
    std::mutex Index::log;
}

C++ 基于正倒排索引的 Boost 搜索引擎核心模块实现与详解

1. 正倒排索引的结构

1.1 正排索引

1.2 倒排索引

2. 正倒排序部分 Class 的 Private 部分

2.1 准备工作

2.2 单例模式

更多推荐文章

相关免费在线工具

3. 基于清洗后文档构建索引

3.1 构建正排索引

3.2 构建倒排索引

3.3 批量构建索引

4. 获取索引数据

4.1 获取正排索引

4.2 获取倒排索引

5. 总结

更多推荐文章

相关免费在线工具

C++ 基于正倒排索引的 Boost 搜索引擎核心模块实现与详解

1. 正倒排索引的结构

1.1 正排索引

1.2 倒排索引

2. 正倒排序部分 Class 的 Private 部分

2.1 准备工作

2.2 单例模式

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3. 基于清洗后文档构建索引

3.1 构建正排索引

3.2 构建倒排索引

3.3 批量构建索引

4. 获取索引数据

4.1 获取正排索引

4.2 获取倒排索引

5. 总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具