C++ 基于正倒排索引的 Boost 搜索引擎实现与详解 | 极客日志

C++AI算法

C++ 基于正倒排索引的 Boost 搜索引擎实现与详解

C++ 语言结合 Boost 库实现搜索引擎核心模块，涵盖正排索引与倒排索引的数据结构设计、单例模式应用、文档分词处理及权重计算逻辑。通过 Vector 存储文档元数据，利用哈希表映射关键词至文档列表，支持从原始文本构建索引并快速检索。实现了线程安全的单例初始化、去标签文档解析、词频统计及权重分配机制，为搜索功能提供底层数据支撑。

t ag发布于 2026/3/16更新于 2026/7/129 浏览

C++ 基于正倒排索引的 Boost 搜索引擎实现与详解

正排索引与倒排索引协同工作，实现内容搜索。搜索引擎在对文档进行处理和索引构建时，会先创建正排索引，然后基于正排索引进一步生成倒排索引。当用户输入查询关键词时，搜索引擎会利用倒排索引快速定位包含该关键词的文档，再结合正排索引等其他信息进行结果展示。

说明：正排索引将文档内容映射到文档 ID，倒排索引根据文档 ID 映射关键词。两者均为预先创建，非搜索时临时生成。

1. 正倒排索引的结构

1.1 正排索引

存储文档内容和其 ID。

// 正排索引结构
typedef struct DocInfo {
    std::string title;   // 文档标题
    std::string content; // 文档内容
    std::string url;     // 文档 URL
    int doc_id;          // 文档 ID
} DocInfo;

1.2 倒排索引

存储文档 ID、对应关键字及权重。InvertedList 通常称为倒排拉链，一个关键字可能对应多个文档。

// 倒排索引结构
struct InvertedElem {
    int doc_id;      // 文档 ID
    std::string word;// 关键词
    int weight;      // 权重
};
typedef std::vector<InvertedElem> InvertedList;

2. 正倒排序部分 Class 的 Private 部分

2.1 准备工作

正排索引使用 std::vector，下标即为文档 ID，方便访问。倒排索引使用哈希表（unordered_map），实现关键字到倒排文档列表的映射。

private:
    // 正排索引使用 vector，下标即文档 ID
    std::vector<DocInfo> forward_index;
    // 使用哈希进行映射
    std::unordered_map<std::string, InvertedList> inverted_index;

2.2 单例模式

采用单例模式管理索引组件，确保全局逻辑统一并减少资源浪费。需禁用拷贝构造函数和赋值运算符，并使用互斥锁防止多线程并发创建实例。

private:
    Index() {};
    Index(const Index&) = ;
    Index& =( Index&) = ;
     Index* instance;
     std::mutex log;

:
    ~();
    {
         (instance == ) {
            log.();
             (instance == ) {
                instance =  ();
            }
            log.();
        }
         instance;
    }

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online

DocInfo* BuildForwardIndex(const std::string& line) {
    std::vector<std::string> results;
    ns_util::StringUtil::Split(line, &results, "\3"); // 字符串切割，分离 title, content, url
    if (results.size() != 3) return nullptr;
    
    DocInfo doc;
    doc.title = results[0];
    doc.content = results[1];
    doc.url = results[2];
    doc.doc_id = forward_index.size();
    
    forward_index.push_back(doc);
    return &forward_index.back();
}

bool BuildInvertedIndex(const DocInfo& doc) {
    struct word_cnt {
        int title_cnt;
        int content_cnt;
        word_cnt() : title_cnt(0), content_cnt(0) {}
    };
    
    std::unordered_map<std::string, word_cnt> word_map;
    std::vector<std::string> title_words;
    ns_util::JiebaUtil::CutString(doc.title, &title_words);
    for (auto& tw : title_words) {
        boost::to_lower(tw);
        word_map[tw].title_cnt++;
    }
    
    std::vector<std::string> content_words;
    ns_util::JiebaUtil::CutString(doc.content, &content_words);
    for (auto& cw : content_words) {
        boost::to_lower(cw);
        word_map[cw].content_cnt++;
    }
    
    #define X 10
    #define Y 1
    for (auto& word_pair : word_map) {
        InvertedElem item;
        item.doc_id = doc.doc_id;
        item.word = word_pair.first;
        item.weight = X * word_pair.second.title_cnt + Y * word_pair.second.content_cnt;
        inverted_index[word_pair.first].push_back(item);
    }
    return true;
}

bool BuildIndex(const std::string& input) {
    std::ifstream in(input, std::ios::in | std::ios::binary);
    if (!in.is_open()) {
        std::cout << input << "open error" << std::endl;
        return false;
    }
    
    int count = 0;
    std::string line;
    while (std::getline(in, line)) {
        DocInfo* doc = BuildForwardIndex(line);
        if (doc == nullptr) {
            std::cout << "BuildIndex error" << std::endl;
            continue;
        }
        BuildInvertedIndex(*doc);
        count++;
        if (count % 50 == 0) LOG1(NORMAL, "索引建立到：" + std::to_string(count));
    }
    return true;
}

DocInfo* GetForwardIndex(uint64_t doc_id) {
    if (doc_id >= forward_index.size()) {
        std::cout << "doc_id out range, error!" << std::endl;
        return nullptr;
    }
    return &forward_index[doc_id];
}

InvertedList* GetInvertedList(const std::string& word) {
    auto iter = inverted_index.find(word);
    if (iter == inverted_index.end()) {
        std::cout << word << "get error" << std::endl;
        return nullptr;
    }
    return &(iter->second);
}

#pragma once
#include<iostream>
#include<string>
#include<vector>
#include<unordered_map>
#include<fstream>
#include<mutex>
#include"usuallytool.hpp"
#include<boost/algorithm/string.hpp>
#include"log.hpp"

namespace ns_index {
    typedef struct DocInfo {
        std::string title;
        std::string content;
        std::string url;
        int doc_id;
    } DocInfo;

    struct InvertedElem {
        int doc_id;
        std::string word;
        int weight;
    };
    typedef std::vector<InvertedElem> InvertedList;

    class Index {
    private:
        std::vector<DocInfo> forward_index;
        std::unordered_map<std::string, InvertedList> inverted_index;

    private:
        Index() {};
        Index(const Index&) = delete;
        Index& operator=(const Index&) = delete;
        static Index* instance;
        static std::mutex log;

    public:
        ~Index();
        static Index* Getinstance() {
            if (instance == nullptr) {
                log.lock();
                if (instance == nullptr) {
                    instance = new Index();
                }
                log.unlock();
            }
            return instance;
        }

    public:
        DocInfo* GetForwardIndex(uint64_t doc_id) {
            if (doc_id >= forward_index.size()) {
                std::cout << "doc_id out range, error!" << std::endl;
                return nullptr;
            }
            return &forward_index[doc_id];
        }

        InvertedList* GetInvertedList(const std::string& word) {
            auto iter = inverted_index.find(word);
            if (iter == inverted_index.end()) {
                std::cout << word << "get error" << std::endl;
                return nullptr;
            }
            return &(iter->second);
        }

        bool BuildIndex(const std::string& input) {
            std::ifstream in(input, std::ios::in | std::ios::binary);
            if (!in.is_open()) {
                std::cout << input << "open error" << std::endl;
                return false;
            }
            int count = 0;
            std::string line;
            while (std::getline(in, line)) {
                DocInfo* doc = BuildForwardIndex(line);
                if (doc == nullptr) {
                    std::cout << "BuildIndex error" << std::endl;
                    continue;
                }
                BuildInvertedIndex(*doc);
                count++;
                if (count % 50 == 0) LOG1(NORMAL, "索引建立到：" + std::to_string(count));
            }
            return true;
        }

    private:
        DocInfo* BuildForwardIndex(const std::string& line) {
            std::vector<std::string> results;
            ns_util::StringUtil::Split(line, &results, "\3");
            if (results.size() != 3) return nullptr;
            DocInfo doc;
            doc.title = results[0];
            doc.content = results[1];
            doc.url = results[2];
            doc.doc_id = forward_index.size();
            forward_index.push_back(doc);
            return &forward_index.back();
        }

        bool BuildInvertedIndex(const DocInfo& doc) {
            struct word_cnt {
                int title_cnt;
                int content_cnt;
                word_cnt() : title_cnt(0), content_cnt(0) {}
            };
            std::unordered_map<std::string, word_cnt> word_map;
            std::vector<std::string> title_words;
            ns_util::JiebaUtil::CutString(doc.title, &title_words);
            for (auto& tw : title_words) {
                boost::to_lower(tw);
                word_map[tw].title_cnt++;
            }
            std::vector<std::string> content_words;
            ns_util::JiebaUtil::CutString(doc.content, &content_words);
            for (auto& cw : content_words) {
                boost::to_lower(cw);
                word_map[cw].content_cnt++;
            }
            #define X 10
            #define Y 1
            for (auto& word_pair : word_map) {
                InvertedElem item;
                item.doc_id = doc.doc_id;
                item.word = word_pair.first;
                item.weight = X * word_pair.second.title_cnt + Y * word_pair.second.content_cnt;
                inverted_index[word_pair.first].push_back(item);
            }
            return true;
        }
    };
    Index* Index::instance = nullptr;
    std::mutex Index::log;
}

C++ 基于正倒排索引的 Boost 搜索引擎实现与详解

C++ 基于正倒排索引的 Boost 搜索引擎实现与详解

1. 正倒排索引的结构

1.1 正排索引

1.2 倒排索引

2. 正倒排序部分 Class 的 Private 部分

2.1 准备工作

2.2 单例模式

更多推荐文章

相关免费在线工具

3. 去标签后的文档构建正倒排索引

3.1 创建正排索引

3.2 创建倒排索引

3.3 调用函数创建索引

4. 获取正倒排索引

4.1 获取正排索引

4.2 获取倒排索引

5. 总结

更多推荐文章

相关免费在线工具

C++ 基于正倒排索引的 Boost 搜索引擎实现与详解

C++ 基于正倒排索引的 Boost 搜索引擎实现与详解

1. 正倒排索引的结构

1.1 正排索引

1.2 倒排索引

2. 正倒排序部分 Class 的 Private 部分

2.1 准备工作

2.2 单例模式

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3. 去标签后的文档构建正倒排索引

3.1 创建正排索引

3.2 创建倒排索引

3.3 调用函数创建索引

4. 获取正倒排索引

4.1 获取正排索引

4.2 获取倒排索引

5. 总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具