C++之基于正倒排索引的Boost搜索引擎项目searcher部分代码及详解

Ne0inhk

26 Mar 2026 — 9 min read

这个searcher.hpp的本质是一种使用其他文件，然后实现自己功能的一种更上层的封装。

它主要实现的是就是他用户的搜索词进行处理，接着根据这个处理结果来返回网页给用户。

1. 单例模式

这边的话我们使用的是单例模式来进行实例化。同时我们建立正倒排索引。

 private: ns_index::Index* index; public: Searcher(){}; ~Searcher(){}; public: void InitSearcher(const std::string& input) { //1 创建（获取）一个index对象 //在这里我们用的是单例模式 index=ns_index::Index::Getinstance(); //2根据对象建立索引 index->BuildIndex(input); //std::cout<<"建立索引成功"<<std::endl; LOG1(NORMAL,"建立索引成功..."); }

2. Search

这个函数的主要流程包括分词、触发、合并排序和构建 JSON 结果四个步骤。

2.1 分词

分词的话我们就是先创建一个words数组，然后使用CutSreing函数把用户提供的关键字分词并交给words。

2.2 触发

这一部分的话获取在单例模式中的倒排索引，通过index->GetInvertedList(w)获取关键词w对应的倒排索引列表，然后把他交给tokens_map，又因为tokens_map是哈希结构，所以会自动实现去重。那个to_lower是用来实现小写化的，因为我们不可以说把用户输入的大写和小写区分成两个关键字。

for(const auto &elem : *inverted_list)里面的item是加了&的，所以实际上加的是tokens_map。

2.3 合并

这边就是把处理好的倒排拉链全部交给inverted_list_all，这边之所以要交给它是因为vector访问更方便，接着把他们根据自身的权重来进行从大到小排序。

2.4 构建 JSON

这一步就是序列化，简单来说就是把我们的内容转变成标准的、线性的、可存储 / 可传输的格式。当然最后也需要通过反序列化来读取正确的内容。这边序列化的代码我们不自己写了，而是通过Json的非标准库来实现。把单个序列化的结果交给总的root。

Json::StyledWriter writer; 这行代码的作用是创建一个 “格式化的 JSON 写入器”，用于将内存中的 JSON 对象（如代码中的root）转换为带缩进、换行的可读性强的 JSON 字符串，最后通过write交给json_string这个字符串。

PS:在这份代码里面一会是获取正排索引，一会是获取倒排索引，但是各位有没有发现，他们都和inverted_list_all有关系，inverted_list_all的本质是对多个倒排拉链进行 “去重、合并、排序” 后的候选文档集合，是连接 “索引查询” 和 “结果返回” 的中间数据结构。

//queue是要搜索的关键字 //json_string是返回给用户的搜索结果 void Search(const std::string& query,std::string* json_string) { //1. 分词 std::vector<std::string>words; ns_util::JiebaUsutl::CutString(query,&words);//所以jieba里面的函数来进行分词 //2. 触发 //ns_index::InvertedList inverted_list_all; std::vector<InvertedElemPrint> inverted_list_all;//用来存放去重后的关键字 std::unordered_map<uint64_t,InvertedElemPrint> tokens_map; //这个哈希的作用是用来去重 for(std::string w:words) { boost::to_lower(w);//这一步的作用是忽略大小写 ns_index::InvertedList* inverted_list=index->GetInvertedList(w); if(inverted_list==nullptr) continue; //inverted_list_all.insert(inverted_list_all.end(),inverted_list->begin(),inverted_list->end()); for(const auto &elem : *inverted_list){ //这个for循环就相当于把这个inverted_list里面的值交给这个哈希 auto &item = tokens_map[elem.doc_id]; //[]:如果存在直接获取，如果不存在新建 //注意：这个item是&，所以实际上加的是tokens_map的[elem.doc_id]，这样写是为了避免重复查找，提高效率 //item一定是doc_id相同的print节点 item.doc_id = elem.doc_id; item.weight += elem.weight; item.words.push_back(elem.word); } } //3. 合并 for(const auto &item : tokens_map)//这边就是把通过哈希去重分好后的词交给inverted_list_all inverted_list_all.push_back(std::move(item.second)); /* std::sort(inverted_list_all.begin(),inverted_list_all.end(), [](const ns_index::InvertedElem& e1,const ns_index::InvertedElem& e2){//根据weight来从大到小排序 return e1.weight>e2.weight; });*/ std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){ return e1.weight > e2.weight; }); //4. 构建 Json::Value root; for(auto& item:inverted_list_all) { ns_index::DocInfo* doc=index->GetForwardIndex(item.doc_id); if(doc==nullptr) continue; Json::Value elem;//通过json来进行序列化 elem["title"]=doc->title; elem["desc"]=GetDesc(doc->content,item.words[0]); elem["url"]=doc->url;//这是通过json然后可以将root序列化为JSON格式的字符串接着搜索结果以JSON形式返回。 root.append(elem); } // Json::StyledWriter writer; *json_string=writer.write(root);//完成序列化 }

3. GetDesc

这段代码就是寻找摘要，简单来说就是在html_content里面查找是否出现了关键字word。就相当于我们搜索一个词时，浏览器帮我们选出和这个词相关的网页信息的过程。

下面那几个if条件判断就是在确认这条网页信息的程度够不够，如果不够的话就直接全部拿过来，因为如果说没有怎么长我们还前50后100，那就会越界访问。

 //这个函数的作用就是找摘要，比如说关键词K,那么我们就把K前面和后面的一部分内容作为摘要 std::string GetDesc(const std::string& html_content,const std::string& word) { int prev_step=50; int next_step=100; auto iter=std::search(html_content.begin(),html_content.end(),word.begin(),word.end(),[](int x,int y){ return (std::tolower(x)==std::tolower(y)); });//在 html_content 这个字符串中查找 word 这个子序列 //return (std::tolower(x)==std::tolower(y))用来忽略大小写 if(iter==html_content.end()) return "None1"; int pos=std::distance(html_content.begin(),iter);//计算找到的子序列在 html_content 中的起始位置 pos。 if(pos==std::string::npos) return "None1"; int start=0; int end=html_content.size()-1; if(pos-prev_step>start) start=pos-prev_step; if(pos+next_step<end) end=pos+next_step; if(start>=end) return "None2"; std::string desc=html_content.substr(start,end-start); desc+="..."; return desc; }

4. InvertedElemPrint

这个结构体的出现就是为了解决重复文档的问题。作为 “去重与信息聚合” 的载体，解决 “同一文档因匹配多个关键词而重复出现” 的问题。

通过 doc_id（文档唯一标识）作为核心依据，配合哈希表（tokens_map）实现 “同一文档只被记录一次”。当用户查询包含多个关键词时，同一文档可能匹配多个关键词并出现在多个倒排拉链中，InvertedElemPrint 借助哈希表的键唯一性，避免文档在结果中重复出现。

struct InvertedElemPrint{//这个结构体的作用就是用来解决重复文档的问题 uint64_t doc_id; int weight; std::vector<std::string> words; InvertedElemPrint(): doc_id(0), weight(0) {} };

5. 总结

通过 “索引构建” 提前做好数据准备，通过 “查询处理” 高效匹配并排序文档，通过 “结果格式化” 衔接前端展示，最终实现 “用户输入关键词→得到清晰搜索结果” 的完整功能，以下是它的完整代码：

#pragma once #include"index.hpp" #include"usuallytool.hpp" #include<algorithm> #include<jsoncpp/json/json.h> #include"log.hpp" namespace ns_searcher{ struct InvertedElemPrint{//这个结构体的作用就是用来解决重复文档的问题 uint64_t doc_id; int weight; std::vector<std::string> words; InvertedElemPrint(): doc_id(0), weight(0) {} }; class Searcher{ private: ns_index::Index* index; public: Searcher(){}; ~Searcher(){}; public: void InitSearcher(const std::string& input) { //1 创建（获取）一个index对象 //在这里我们用的是单例模式 index=ns_index::Index::Getinstance(); //2根据对象建立索引 index->BuildIndex(input); //std::cout<<"建立索引成功"<<std::endl; LOG1(NORMAL,"建立索引成功..."); } //queue是要搜索的关键字 //json_string是返回给用户的搜索结果 void Search(const std::string& query,std::string* json_string) { //1. 分词 std::vector<std::string>words; ns_util::JiebaUsutl::CutString(query,&words);//所以jieba里面的函数来进行分词 //2. 触发 //ns_index::InvertedList inverted_list_all; std::vector<InvertedElemPrint> inverted_list_all;//用来存放去重后的关键字 std::unordered_map<uint64_t,InvertedElemPrint> tokens_map; //这个哈希的作用是用来去重 for(std::string w:words) { boost::to_lower(w);//这一步的作用是忽略大小写 ns_index::InvertedList* inverted_list=index->GetInvertedList(w); if(inverted_list==nullptr) continue; //inverted_list_all.insert(inverted_list_all.end(),inverted_list->begin(),inverted_list->end()); for(const auto &elem : *inverted_list){ //这个for循环就相当于把这个inverted_list里面的值交给这个哈希 auto &item = tokens_map[elem.doc_id]; //[]:如果存在直接获取，如果不存在新建 //注意：这个item是&，所以实际上加的是tokens_map的[elem.doc_id]，这样写是为了避免重复查找，提高效率 //item一定是doc_id相同的print节点 item.doc_id = elem.doc_id; item.weight += elem.weight; item.words.push_back(elem.word); } } //3. 合并 for(const auto &item : tokens_map)//这边就是把通过哈希去重分好后的词交给inverted_list_all inverted_list_all.push_back(std::move(item.second)); /* std::sort(inverted_list_all.begin(),inverted_list_all.end(), [](const ns_index::InvertedElem& e1,const ns_index::InvertedElem& e2){//根据weight来从大到小排序 return e1.weight>e2.weight; });*/ std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){ return e1.weight > e2.weight; }); //4. 构建 Json::Value root; for(auto& item:inverted_list_all) { ns_index::DocInfo* doc=index->GetForwardIndex(item.doc_id); if(doc==nullptr) continue; Json::Value elem;//通过json来进行序列化 elem["title"]=doc->title; elem["desc"]=GetDesc(doc->content,item.words[0]); elem["url"]=doc->url;//这是通过json然后可以将root序列化为JSON格式的字符串接着搜索结果以JSON形式返回。 root.append(elem); } // Json::StyledWriter writer; *json_string=writer.write(root);//完成序列化 } //这个函数的作用就是找摘要，比如说关键词K,那么我们就把K前面和后面的一部分内容作为摘要 std::string GetDesc(const std::string& html_content,const std::string& word) { int prev_step=50; int next_step=100; auto iter=std::search(html_content.begin(),html_content.end(),word.begin(),word.end(),[](int x,int y){ return (std::tolower(x)==std::tolower(y)); });//在 html_content 这个字符串中查找 word 这个子序列 //return (std::tolower(x)==std::tolower(y))用来忽略大小写 if(iter==html_content.end()) return "None1"; int pos=std::distance(html_content.begin(),iter);//计算找到的子序列在 html_content 中的起始位置 pos。 if(pos==std::string::npos) return "None1"; int start=0; int end=html_content.size()-1; if(pos-prev_step>start) start=pos-prev_step; if(pos+next_step<end) end=pos+next_step; if(start>=end) return "None2"; std::string desc=html_content.substr(start,end-start); desc+="..."; return desc; } }; }

C++之基于正倒排索引的Boost搜索引擎项目searcher部分代码及详解

Ne0inhk

1. 单例模式

2. Search

2.1 分词

2.2 触发

2.3 合并

2.4 构建 JSON

3. GetDesc

4. InvertedElemPrint

5. 总结

Read more

人工智能赋能传统医疗设施设备改造：路径、挑战与未来展望

人工智能 Gemini 2.5 Pro：深度解析技术突破与实战应用

【保姆级教程】告别命令行！ClawX：可视化AI智能体，小白也能轻松玩出花！

2026 AI“龙虾”大战！OpenClaw、MaxClaw、AutoClaw、QClaw、ArkClaw、KimiClaw、LobsterAI等9款产品横评 + 场景推荐，谁值得你“养”？