C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现解析

综述由AI生成该模块实现了基于正倒排索引的 C++ 搜索引擎核心查询逻辑。通过单例模式管理全局索引对象，利用 Jieba 分词处理用户输入，结合哈希表对多关键词匹配结果进行去重与权重累加。最终通过正向索引获取文档详情并序列化为 JSON 返回，同时包含摘要提取功能以优化展示效果。整体流程涵盖分词、触发、合并排序及序列化四个阶段，确保搜索的高效性与准确性。

RefactorPro发布于 2026/3/16更新于 2026/4/243 浏览

searcher.hpp 作为上层封装模块，负责协调底层索引与用户请求。它主要处理用户的搜索词，根据处理结果返回网页信息。

1. 单例模式设计

这里采用单例模式来实例化索引对象，确保全局唯一性并方便构建正倒排索引。

private:
    ns_index::Index* index;
public:
    Searcher(){}
    ~Searcher(){}
public:
    void InitSearcher(const std::string& input) {
        // 1. 创建（获取）一个 index 对象
        // 使用单例模式获取全局索引实例
        index = ns_index::Index::Getinstance();
        // 2. 根据对象建立索引
        index->BuildIndex(input);
        LOG1(NORMAL, "建立索引成功...");
    }

2. 查询流程 (Search)

该函数的核心逻辑包含分词、触发、合并排序和构建 JSON 结果四个步骤。

2.1 分词

首先创建一个字符串数组，利用 Jieba 工具将用户输入的关键字进行切分。

2.2 触发

获取单例模式中的倒排索引，通过 index->GetInvertedList(w) 拿到关键词对应的列表。由于 tokens_map 是哈希结构，插入时会自动去重。注意 to_lower 用于统一大小写，避免区分大小写导致匹配失败。

在遍历倒排列表时，元素引用直接关联到 tokens_map 中，这样能避免重复查找，提升效率。

2.3 合并

将处理好的倒排拉链交给 inverted_list_all。之所以转为 vector 是为了后续访问更便捷。接着根据权重从大到小排序。

2.4 构建 JSON

这一步是序列化过程，将内存数据转换为标准的线性格式以便传输。我们借助 JsonCpp 库完成，无需手写序列化逻辑。

Json::StyledWriter writer; 用于创建格式化写入器，将 root 对象转换为带缩进的可读字符串，最后写入 json_string。

提示：代码中频繁出现正排和倒排索引操作。inverted_list_all 本质是对多个倒排拉链进行'去重、合并、排序'后的候选文档集合，它是连接索引查询与结果返回的关键中间数据结构。

// queue 是要搜索的关键字，json_string 是返回给用户的搜索结果
void Search(const std::string& query, std::string* json_string) {
    
    std::vector<std::string> words;
    ns_util::JiebaUtil::(query, &words);

    
    std::vector<InvertedElemPrint> inverted_list_all;
    std::unordered_map<, InvertedElemPrint> tokens_map;

     (std::string w : words) {
        boost::(w); 
        ns_index::InvertedList* inverted_list = index->(w);
         (inverted_list == ) ;

         (  &elem : *inverted_list) {
             &item = tokens_map[elem.doc_id];
            item.doc_id = elem.doc_id;
            item.weight += elem.weight;
            item.words.(elem.word);
        }
    }

    
     (  &item : tokens_map) {
        inverted_list_all.(std::(item.second));
    }
    
    
    std::(inverted_list_all.(), inverted_list_all.(), 
              []( InvertedElemPrint &e1,  InvertedElemPrint &e2){
                   eweight > eweight;
              });

    
    Json::Value root;
     (& item : inverted_list_all) {
        ns_index::DocInfo* doc = index->(item.doc_id);
         (doc == ) ;
        
        Json::Value elem;
        elem[] = doc->title;
        elem[] = (doc->content, item.words[]);
        elem[] = doc->url;
        root.(elem);
    }
    
    Json::StyledWriter writer;
    *json_string = writer.(root);
}

#pragma once #include "index.hpp" #include "usuallytool.hpp" #include <algorithm> #include <jsoncpp/json/json.h> #include "log.hpp" namespace ns_searcher { struct InvertedElemPrint { uint64_t doc_id; int weight; std::vector<std::string> words; InvertedElemPrint() : doc_id(0), weight(0) {} }; class Searcher { private: ns_index::Index* index; public: Searcher(){} ~Searcher(){} void InitSearcher(const std::string& input) { index = ns_index::Index::Getinstance(); index->BuildIndex(input); LOG1(NORMAL, "建立索引成功..."); } void Search(const std::string& query, std::string* json_string) { std::vector<std::string> words; ns_util::JiebaUtil::CutString(query, &words); std::vector<InvertedElemPrint> inverted_list_all; std::unordered_map<uint64_t, InvertedElemPrint> tokens_map; for (std::string w : words) { boost::to_lower(w); ns_index::InvertedList* inverted_list = index->GetInvertedList(w); if (inverted_list == nullptr) continue; for (const auto &elem : *inverted_list) { auto &item = tokens_map[elem.doc_id]; item.doc_id = elem.doc_id; item.weight += elem.weight; item.words.push_back(elem.word); } } for (const auto &item : tokens_map) { inverted_list_all.push_back(std::move(item.second)); } std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){ return e1.weight > e2.weight; }); Json::Value root; for (auto& item : inverted_list_all) { ns_index::DocInfo* doc = index->GetForwardIndex(item.doc_id); if (doc == nullptr) continue; Json::Value elem; elem["title"] = doc->title; elem["desc"] = GetDesc(doc->content, item.words[0]); elem["url"] = doc->url; root.append(elem); } Json::StyledWriter writer; *json_string = writer.write(root); } std::string GetDesc(const std::string& html_content, const std::string& word) { int prev_step = 50; int next_step = 100; auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(), [](int x, int y){ return (std::tolower(x) == std::tolower(y)); }); if (iter == html_content.end()) return "None1"; int pos = std::distance(html_content.begin(), iter); if (pos == std::string::npos) return "None1"; int start = 0; int end = html_content.size() - 1; if (pos - prev_step > start) start = pos - prev_step; if (pos + next_step < end) end = pos + next_step; if (start >= end) return "None2"; std::string desc = html_content.substr(start, end - start); desc += "..."; return desc; } }; }

C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现解析

1. 单例模式设计

2. 查询流程 (Search)

2.1 分词

2.2 触发

2.3 合并

2.4 构建 JSON

更多推荐文章

3. 摘要提取 (GetDesc)

4. 结构体设计 (InvertedElemPrint)

5. 完整代码参考

更多推荐文章

相关免费在线工具

C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现解析

1. 单例模式设计

2. 查询流程 (Search)

2.1 分词

2.2 触发

2.3 合并

2.4 构建 JSON

微信扫一扫，关注极客日志

更多推荐文章

3. 摘要提取 (GetDesc)

4. 结构体设计 (InvertedElemPrint)

5. 完整代码参考

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具