C++ 搜索引擎 Searcher 模块源码解析：正倒排索引实现

C++ 搜索引擎 Searcher 模块源码解析正倒排索引实现。该模块负责处理用户搜索请求，通过单例模式管理索引，利用 Jieba 分词并结合倒排索引获取候选文档。核心流程涵盖触发查询、哈希去重合并权重、按权值排序及构建正向索引返回结果。摘要生成函数支持关键词定位与上下文截取，最终通过 JsonCpp 将搜索结果序列化为标准 JSON 格式返回给前端。涉及数据结构优化与内存管理细节。

涅槃凤凰发布于 2026/3/29更新于 2026/4/252 浏览

Searcher 类作为上层封装，依赖底层索引文件实现具体业务逻辑。核心职责是处理用户搜索词，并根据匹配结果返回网页信息。

1. 单例模式

初始化阶段采用单例模式管理索引对象，确保全局唯一性并构建正倒排索引。

private:
    ns_index::Index* index;
public:
    Searcher(){}
    ~Searcher(){}
public:
    void InitSearcher(const std::string& input) {
        // 1. 创建（获取）一个 index 对象
        // 在这里我们用的是单例模式
        index = ns_index::Index::GetInstance();
        // 2. 根据对象建立索引
        index->BuildIndex(input);
        LOG1(NORMAL, "建立索引成功...");
    }

2. Search

该函数的主要流程包括分词、触发查询、合并排序和构建 JSON 结果四个步骤。

2.1 分词

首先创建字符串向量存储分词结果，使用 Jieba 工具将用户提供的关键字分词并交给 words 容器。

2.2 触发

从单例索引中获取倒排列表，通过 index->GetInvertedList(w) 获取关键词对应的倒排索引列表，存入哈希表 tokens_map。由于哈希结构特性，可自动实现去重。调用 boost::to_lower 统一转为小写，避免大小写区分导致的关键字失效。

遍历倒排列表时，元素引用直接操作哈希表中的节点，避免重复查找提高效率。

2.3 合并

将处理好的倒排数据统一存入 inverted_list_all，利用 vector 访问更方便的特性，接着根据权重进行从大到小排序。

注：代码中频繁提及正排与倒排索引，但核心都关联到 inverted_list_all。它是对多个倒排拉链进行'去重、合并、排序'后的候选文档集合，是连接'索引查询'和'结果返回'的中间数据结构。

2.4 构建 JSON

最后进行序列化，将内存中的 JSON 对象转换为标准传输格式。此处使用 JsonCpp 库的非标准库实现，无需手动编写序列化逻辑。将单个序列化的结果追加到根节点 root，最终通过 StyledWriter 写入可读性强的 JSON 字符串。

// query 为搜索关键字，json_string 是返回给用户的搜索结果
void Search(const std::string& query, std::string* json_string) {
    // 1. 分词
    std::vector<std::string> words;
    ns_util::JiebaUtil::CutString(query, &words);

    // 2. 触发
    std::vector<InvertedElemPrint> inverted_list_all; // 用来存放去重后的关键字
    std::unordered_map<uint64_t, InvertedElemPrint> tokens_map; // 哈希用于去重

    for (std::string w : words) {
        boost::to_lower(w); // 忽略大小写
        ns_index::InvertedList* inverted_list = index->GetInvertedList(w);
        if (inverted_list == nullptr) continue;

        for (const auto& elem : *inverted_list) {
            auto& item = tokens_map[elem.doc_id];
            // []:如果存在直接获取，如果不存在新建
            // item 一定是 doc_id 相同的 print 节点
            item.doc_id = elem.doc_id;
            item.weight += elem.weight;
            item.words.push_back(elem.word);
        }
    }

    // 3. 合并
    for (const auto& item : tokens_map)
        inverted_list_all.push_back(std::move(item.second));

    std::sort(inverted_list_all.begin(), inverted_list_all.end(),
              [](const InvertedElemPrint& e1, const InvertedElemPrint& e2) {
                  return e1.weight > e2.weight;
              });

    // 4. 构建
    Json::Value root;
    for (auto& item : inverted_list_all) {
        ns_index::DocInfo* doc = index->GetForwardIndex(item.doc_id);
        if (doc == nullptr) continue;
        Json::Value elem;
        elem["title"] = doc->title;
        elem["desc"] = GetDesc(doc->content, item.words[0]);
        elem["url"] = doc->url;
        root.append(elem);
    }

    Json::StyledWriter writer;
    *json_string = writer.write(root);
}

3. GetDesc

此函数负责生成摘要，即在 HTML 内容中查找是否出现关键词。若找到，截取关键词前后指定长度的内容；若未找到或长度不足，则返回空值。边界检查防止越界访问。

std::string GetDesc(const std::string& html_content, const std::string& word) {
    int prev_step = 50;
    int next_step = 100;
    
    // 在 html_content 中查找 word，忽略大小写
    auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(),
                            [](int x, int y) { return (std::tolower(x) == std::tolower(y)); });
    
    if (iter == html_content.end()) return "None1";
    int pos = std::distance(html_content.begin(), iter);
    
    if (pos == std::string::npos) return "None1";
    
    int start = 0;
    int end = html_content.size() - 1;
    
    if (pos - prev_step > start) start = pos - prev_step;
    if (pos + next_step < end) end = pos + next_step;
    
    if (start >= end) return "None2";
    
    std::string desc = html_content.substr(start, end - start);
    desc += "...";
    return desc;
}

4. InvertedElemPrint

该结构体用于解决文档重复问题。作为'去重与信息聚合'的载体，当同一文档匹配多个关键词时，借助哈希表的键唯一性，避免文档在结果中重复出现。

struct InvertedElemPrint {
    uint64_t doc_id;
    int weight;
    std::vector<std::string> words;
    InvertedElemPrint() : doc_id(0), weight(0) {}
};

5. 完整源码参考

通过'索引构建'提前做好数据准备，通过'查询处理'高效匹配并排序文档，通过'结果格式化'衔接前端展示，最终实现'用户输入关键词→得到清晰搜索结果'的完整功能。

#pragma once
#include "index.hpp"
#include "usuallytool.hpp"
#include <algorithm>
#include <jsoncpp/json/json.h>
#include "log.hpp"

namespace ns_searcher {

struct InvertedElemPrint {
    uint64_t doc_id;
    int weight;
    std::vector<std::string> words;
    InvertedElemPrint() : doc_id(0), weight(0) {}
};

class Searcher {
private:
    ns_index::Index* index;
public:
    Searcher() {}
    ~Searcher() {}

public:
    void InitSearcher(const std::string& input) {
        index = ns_index::Index::GetInstance();
        index->BuildIndex(input);
        LOG1(NORMAL, "建立索引成功...");
    }

    void Search(const std::string& query, std::string* json_string) {
        std::vector<std::string> words;
        ns_util::JiebaUtil::CutString(query, &words);

        std::vector<InvertedElemPrint> inverted_list_all;
        std::unordered_map<uint64_t, InvertedElemPrint> tokens_map;

        for (std::string w : words) {
            boost::to_lower(w);
            ns_index::InvertedList* inverted_list = index->GetInvertedList(w);
            if (inverted_list == nullptr) continue;

            for (const auto& elem : *inverted_list) {
                auto& item = tokens_map[elem.doc_id];
                item.doc_id = elem.doc_id;
                item.weight += elem.weight;
                item.words.push_back(elem.word);
            }
        }

        for (const auto& item : tokens_map)
            inverted_list_all.push_back(std::move(item.second));

        std::sort(inverted_list_all.begin(), inverted_list_all.end(),
                  [](const InvertedElemPrint& e1, const InvertedElemPrint& e2) {
                      return e1.weight > e2.weight;
                  });

        Json::Value root;
        for (auto& item : inverted_list_all) {
            ns_index::DocInfo* doc = index->GetForwardIndex(item.doc_id);
            if (doc == nullptr) continue;
            Json::Value elem;
            elem["title"] = doc->title;
            elem["desc"] = GetDesc(doc->content, item.words[0]);
            elem["url"] = doc->url;
            root.append(elem);
        }

        Json::StyledWriter writer;
        *json_string = writer.write(root);
    }

    std::string GetDesc(const std::string& html_content, const std::string& word) {
        int prev_step = 50;
        int next_step = 100;
        auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(),
                                [](int x, int y) { return (std::tolower(x) == std::tolower(y)); });
        if (iter == html_content.end()) return "None1";
        int pos = std::distance(html_content.begin(), iter);
        if (pos == std::string::npos) return "None1";
        int start = 0;
        int end = html_content.size() - 1;
        if (pos - prev_step > start) start = pos - prev_step;
        if (pos + next_step < end) end = pos + next_step;
        if (start >= end) return "None2";
        std::string desc = html_content.substr(start, end - start);
        desc += "...";
        return desc;
    }
};
}

C++ 搜索引擎 Searcher 模块源码解析：正倒排索引实现

涅槃凤凰发布于 2026/3/29更新于 2026/4/252 浏览

private: ns_index::Index* index; public: Searcher(){} ~Searcher(){} public: void InitSearcher(const std::string& input) { // 1. 创建（获取）一个 index 对象 // 在这里我们用的是单例模式 index = ns_index::Index::GetInstance(); // 2. 根据对象建立索引 index->BuildIndex(input); LOG1(NORMAL, "建立索引成功..."); }

// query 为搜索关键字，json_string 是返回给用户的搜索结果 void Search(const std::string& query, std::string* json_string) { // 1. 分词 std::vector<std::string> words; ns_util::JiebaUtil::CutString(query, &words); // 2. 触发 std::vector<InvertedElemPrint> inverted_list_all; // 用来存放去重后的关键字 std::unordered_map<uint64_t, InvertedElemPrint> tokens_map; // 哈希用于去重 for (std::string w : words) { boost::to_lower(w); // 忽略大小写 ns_index::InvertedList* inverted_list = index->GetInvertedList(w); if (inverted_list == nullptr) continue; for (const auto& elem : *inverted_list) { auto& item = tokens_map[elem.doc_id]; // []:如果存在直接获取，如果不存在新建 // item 一定是 doc_id 相同的 print 节点 item.doc_id = elem.doc_id; item.weight += elem.weight; item.words.push_back(elem.word); } } // 3. 合并 for (const auto& item : tokens_map) inverted_list_all.push_back(std::move(item.second)); std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint& e1, const InvertedElemPrint& e2) { return e1.weight > e2.weight; }); // 4. 构建 Json::Value root; for (auto& item : inverted_list_all) { ns_index::DocInfo* doc = index->GetForwardIndex(item.doc_id); if (doc == nullptr) continue; Json::Value elem; elem["title"] = doc->title; elem["desc"] = GetDesc(doc->content, item.words[0]); elem["url"] = doc->url; root.append(elem); } Json::StyledWriter writer; *json_string = writer.write(root); }

std::string GetDesc(const std::string& html_content, const std::string& word) { int prev_step = 50; int next_step = 100; // 在 html_content 中查找 word，忽略大小写 auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(), [](int x, int y) { return (std::tolower(x) == std::tolower(y)); }); if (iter == html_content.end()) return "None1"; int pos = std::distance(html_content.begin(), iter); if (pos == std::string::npos) return "None1"; int start = 0; int end = html_content.size() - 1; if (pos - prev_step > start) start = pos - prev_step; if (pos + next_step < end) end = pos + next_step; if (start >= end) return "None2"; std::string desc = html_content.substr(start, end - start); desc += "..."; return desc; }

#pragma once #include "index.hpp" #include "usuallytool.hpp" #include <algorithm> #include <jsoncpp/json/json.h> #include "log.hpp" namespace ns_searcher { struct InvertedElemPrint { uint64_t doc_id; int weight; std::vector<std::string> words; InvertedElemPrint() : doc_id(0), weight(0) {} }; class Searcher { private: ns_index::Index* index; public: Searcher() {} ~Searcher() {} public: void InitSearcher(const std::string& input) { index = ns_index::Index::GetInstance(); index->BuildIndex(input); LOG1(NORMAL, "建立索引成功..."); } void Search(const std::string& query, std::string* json_string) { std::vector<std::string> words; ns_util::JiebaUtil::CutString(query, &words); std::vector<InvertedElemPrint> inverted_list_all; std::unordered_map<uint64_t, InvertedElemPrint> tokens_map; for (std::string w : words) { boost::to_lower(w); ns_index::InvertedList* inverted_list = index->GetInvertedList(w); if (inverted_list == nullptr) continue; for (const auto& elem : *inverted_list) { auto& item = tokens_map[elem.doc_id]; item.doc_id = elem.doc_id; item.weight += elem.weight; item.words.push_back(elem.word); } } for (const auto& item : tokens_map) inverted_list_all.push_back(std::move(item.second)); std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint& e1, const InvertedElemPrint& e2) { return e1.weight > e2.weight; }); Json::Value root; for (auto& item : inverted_list_all) { ns_index::DocInfo* doc = index->GetForwardIndex(item.doc_id); if (doc == nullptr) continue; Json::Value elem; elem["title"] = doc->title; elem["desc"] = GetDesc(doc->content, item.words[0]); elem["url"] = doc->url; root.append(elem); } Json::StyledWriter writer; *json_string = writer.write(root); } std::string GetDesc(const std::string& html_content, const std::string& word) { int prev_step = 50; int next_step = 100; auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(), [](int x, int y) { return (std::tolower(x) == std::tolower(y)); }); if (iter == html_content.end()) return "None1"; int pos = std::distance(html_content.begin(), iter); if (pos == std::string::npos) return "None1"; int start = 0; int end = html_content.size() - 1; if (pos - prev_step > start) start = pos - prev_step; if (pos + next_step < end) end = pos + next_step; if (start >= end) return "None2"; std::string desc = html_content.substr(start, end - start); desc += "..."; return desc; } }; }

C++ 搜索引擎 Searcher 模块源码解析：正倒排索引实现

1. 单例模式

2. Search

2.1 分词

2.2 触发

2.3 合并

2.4 构建 JSON

3. GetDesc

4. InvertedElemPrint

5. 完整源码参考

C++ 搜索引擎 Searcher 模块源码解析：正倒排索引实现

1. 单例模式

2. Search

2.1 分词

2.2 触发

2.3 合并

2.4 构建 JSON

3. GetDesc

4. InvertedElemPrint

5. 完整源码参考

更多推荐文章

相关免费在线工具

更多推荐文章

相关免费在线工具

C++ 搜索引擎 Searcher 模块源码解析：正倒排索引实现

1. 单例模式

2. Search

2.1 分词

2.2 触发

2.3 合并

2.4 构建 JSON

3. GetDesc

4. InvertedElemPrint

5. 完整源码参考

C++ 搜索引擎 Searcher 模块源码解析：正倒排索引实现

1. 单例模式

2. Search

2.1 分词

2.2 触发

2.3 合并

2.4 构建 JSON

3. GetDesc

4. InvertedElemPrint

5. 完整源码参考

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具