C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现解析 | 极客日志

C++AI算法

C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现解析

本文解析了基于 Boost 的 C++ 搜索引擎中 Searcher 模块的核心实现。采用单例模式管理正倒排索引，查询流程涵盖分词、触发倒排列表、哈希去重合并排序及 JSON 序列化。通过 InvertedElemPrint 结构体解决文档重复问题，利用 GetDesc 函数生成关键词高亮摘要。整体架构实现了从用户输入到结构化搜索结果的高效转换，重点展示了中间数据结构的设计与权重计算逻辑。

MongoKing发布于 2026/3/24更新于 2026/6/2422 浏览

Searcher 模块设计

Searcher 作为上层封装，主要负责处理用户搜索请求并返回结果。它依赖于底层的正倒排索引，通过一系列步骤将自然语言查询转化为结构化的数据响应。

索引初始化

为了高效管理全局索引资源，这里采用了单例模式。在初始化阶段，获取唯一的 Index 实例并构建索引。

private: ns_index::Index* index;
public: Searcher(){}; ~Searcher(){};
public: void InitSearcher(const std::string& input) {
    // 获取单例索引对象
    index=ns_index::Index::GetInstance(); 
    // 根据输入路径建立索引
    index->BuildIndex(input); 
    LOG1(NORMAL,"建立索引成功...");
}

注意：实际项目中请修正 Getinstance 为标准的 GetInstance 命名规范。

查询处理流程

Search 函数是核心入口，包含分词、触发、合并排序和构建 JSON 四个关键步骤。

1. 分词与触发

首先使用 Jieba 工具对查询字符串进行分词。随后遍历每个词，从单例索引中获取对应的倒排列表。为了避免大小写敏感导致的问题，统一转换为小写处理。

std::vector<std::string>words;
ns_util::JiebaUtil::CutString(query,&words); // 修正拼写错误 Usutl -> Util
std::unordered_map<uint64_t,InvertedElemPrint> tokens_map;
for(std::string w:words) {
    boost::to_lower(w);
    ns_index::InvertedList* inverted_list=index->GetInvertedList(w);
    if(inverted_list==nullptr) continue;
    // 遍历倒排列表，存入哈希表
    for(const auto &elem : *inverted_list){
        auto &item = tokens_map[elem.doc_id];
        item.doc_id = elem.doc_id;
        item.weight += elem.weight;
        item.words.push_back(elem.word);
    }
}

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online

std::vector<InvertedElemPrint> inverted_list_all;
for(const auto &item : tokens_map)
    inverted_list_all.push_back(std::move(item.second));

std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){
    return e1.weight > e2.weight;
});

Json::Value root;
for(auto& item:inverted_list_all) {
    ns_index::DocInfo* doc=index->GetForwardIndex(item.doc_id);
    if(doc==nullptr) continue;
    Json::Value elem;
    elem["title"]=doc->title;
    elem["desc"]=GetDesc(doc->content,item.words[0]);
    elem["url"]=doc->url;
    root.append(elem);
}
Json::StyledWriter writer;
*json_string=writer.write(root);

std::string GetDesc(const std::string& html_content,const std::string& word) {
    int prev_step=50; int next_step=100;
    auto iter=std::search(html_content.begin(),html_content.end(),word.begin(),word.end(),[](int x,int y){
        return (std::tolower(x)==std::tolower(y));
    });
    if(iter==html_content.end()) return "None1";
    int pos=std::distance(html_content.begin(),iter);
    if(pos==std::string::npos) return "None1";
    int start=0; int end=html_content.size()-1;
    if(pos-prev_step>start) start=pos-prev_step;
    if(pos+next_step<end) end=pos+next_step;
    if(start>=end) return "None2";
    std::string desc=html_content.substr(start,end-start);
    desc+="...";
    return desc;
}

struct InvertedElemPrint{
    uint64_t doc_id;
    int weight;
    std::vector<std::string> words;
    InvertedElemPrint(): doc_id(0), weight(0) {}
};

#pragma once 
#include"index.hpp" 
#include"usuallytool.hpp" 
#include<algorithm> 
#include<jsoncpp/json/json.h> 
#include"log.hpp" 
namespace ns_searcher{ 
struct InvertedElemPrint{ 
    uint64_t doc_id; 
    int weight; 
    std::vector<std::string> words; 
    InvertedElemPrint(): doc_id(0), weight(0) {} 
}; 
class Searcher{ 
private: ns_index::Index* index; 
public: Searcher(){}; ~Searcher(){}; 
public: void InitSearcher(const std::string& input) { 
    index=ns_index::Index::GetInstance(); 
    index->BuildIndex(input); 
    LOG1(NORMAL,"建立索引成功..."); 
} 
void Search(const std::string& query,std::string* json_string) { 
    std::vector<std::string>words; 
    ns_util::JiebaUtil::CutString(query,&words); 
    std::vector<InvertedElemPrint> inverted_list_all; 
    std::unordered_map<uint64_t,InvertedElemPrint> tokens_map; 
    for(std::string w:words) { 
        boost::to_lower(w); 
        ns_index::InvertedList* inverted_list=index->GetInvertedList(w); 
        if(inverted_list==nullptr) continue; 
        for(const auto &elem : *inverted_list){ 
            auto &item = tokens_map[elem.doc_id]; 
            item.doc_id = elem.doc_id; 
            item.weight += elem.weight; 
            item.words.push_back(elem.word); 
        } 
    } 
    for(const auto &item : tokens_map) 
        inverted_list_all.push_back(std::move(item.second)); 
    std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){ 
        return e1.weight > e2.weight; 
    }); 
    Json::Value root; 
    for(auto& item:inverted_list_all) { 
        ns_index::DocInfo* doc=index->GetForwardIndex(item.doc_id); 
        if(doc==nullptr) continue; 
        Json::Value elem; 
        elem["title"]=doc->title; 
        elem["desc"]=GetDesc(doc->content,item.words[0]); 
        elem["url"]=doc->url; 
        root.append(elem); 
    } 
    Json::StyledWriter writer; 
    *json_string=writer.write(root); 
} 
std::string GetDesc(const std::string& html_content,const std::string& word) { 
    int prev_step=50; int next_step=100; 
    auto iter=std::search(html_content.begin(),html_content.end(),word.begin(),word.end(),[](int x,int y){ 
        return (std::tolower(x)==std::tolower(y)); 
    }); 
    if(iter==html_content.end()) return "None1"; 
    int pos=std::distance(html_content.begin(),iter); 
    if(pos==std::string::npos) return "None1"; 
    int start=0; int end=html_content.size()-1; 
    if(pos-prev_step>start) start=pos-prev_step; 
    if(pos+next_step<end) end=pos+next_step; 
    if(start>=end) return "None2"; 
    std::string desc=html_content.substr(start,end-start); 
    desc+="..."; 
    return desc; 
} 
}; 
}

C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现解析

Searcher 模块设计

索引初始化

查询处理流程

1. 分词与触发

更多推荐文章

相关免费在线工具

2. 合并与排序

3. 结果序列化

摘要生成策略

核心结构体

完整代码参考

更多推荐文章

相关免费在线工具

C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现解析

Searcher 模块设计

索引初始化

查询处理流程

1. 分词与触发

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2. 合并与排序

3. 结果序列化

摘要生成策略

核心结构体

完整代码参考

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具