C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现详解

综述由AI生成C++ 搜索引擎 Searcher 模块采用单例模式管理倒排索引，通过 Jieba 分词处理用户查询。核心流程涵盖触发匹配、哈希去重合并权重、按权值排序文档，最终序列化为 JSON 返回。摘要生成逻辑基于关键词位置截取上下文。该实现展示了从索引构建到结果展示的关键技术点，重点解决了多关键词匹配时的文档去重与排序问题。

星落发布于 2026/3/21更新于 2026/4/251 浏览

这个 Searcher 模块本质上是对底层索引功能的上层封装，主要负责处理用户搜索词并返回结果。其核心流程包括初始化、分词、触发匹配、合并排序以及构建 JSON 响应。

1. 单例模式与索引初始化

为了高效管理资源，我们采用单例模式来实例化索引对象。在初始化阶段，获取全局唯一的 Index 实例并根据输入数据建立索引。

private:
    ns_index::Index* index;
public:
    Searcher() {}
    ~Searcher() {}

public:
    void InitSearcher(const std::string& input) {
        // 1. 创建（获取）一个 index 对象，使用单例模式
        index = ns_index::Index::GetInstance();
        // 2. 根据对象建立索引
        index->BuildIndex(input);
        LOG1(NORMAL, "建立索引成功...");
    }
};

2. 搜索核心流程

Search 函数是查询的核心，主要包含四个步骤：分词、触发、合并排序和构建 JSON。

2.1 分词

首先将用户输入的查询字符串进行切分。这里使用了 Jieba 工具类将关键字转换为单词数组。

std::vector<std::string> words;
ns_util::JiebaUtil::CutString(query, &words);

2.2 触发与去重

获取倒排索引列表后，我们需要将其映射到哈希表中。这一步非常关键，因为同一个文档可能匹配多个关键词，直接拼接会导致重复。利用 unordered_map 的特性，以 doc_id 为键自动实现去重，同时累加权重。

注意大小写不敏感处理，统一转为小写后再查询。

std::unordered_map<uint64_t, InvertedElemPrint> tokens_map;
for (std::string w : words) {
    boost::to_lower(w); // 忽略大小写
    ns_index::InvertedList* inverted_list = index->GetInvertedList(w);
    if (inverted_list == nullptr) continue;

    for (const auto &elem : *inverted_list) {
        // 通过引用直接操作 map 中的元素，避免重复查找
        auto &item = tokens_map[elem.doc_id];
        item.doc_id = elem.doc_id;
        item.weight += elem.weight;       // 累加权重
        item.words.push_back(elem.word);  // 记录匹配的关键词
    }
}

2.3 合并与排序

将哈希表中的数据转移回向量中，并根据权重从大到小排序。这里的 InvertedElemPrint 结构体专门用于解决重复文档问题，作为'去重与信息聚合'的载体。

std::vector<InvertedElemPrint> inverted_list_all;
for (const auto &item : tokens_map) {
    inverted_list_all.push_back(std::move(item.second));
}

// 按权重降序排列
std::sort(inverted_list_all.begin(), inverted_list_all.end(), 
    [](const InvertedElemPrint &e1, const InvertedElemPrint &e2) {
        return e1.weight > e2.weight;
    });

2.4 构建 JSON 结果

最后一步是序列化。我们遍历排序后的文档列表，获取正排索引信息（标题、内容、URL），并生成摘要，最终组装成 JSON 格式返回给前端。

Json::Value root;
for (auto& item : inverted_list_all) {
    ns_index::DocInfo* doc = index->GetForwardIndex(item.doc_id);
    if (doc == nullptr) continue;

    Json::Value elem;
    elem["title"] = doc->title;
    elem["desc"] = GetDesc(doc->content, item.words[0]); // 生成摘要
    elem["url"] = doc->url;
    root.append(elem);
}

Json::StyledWriter writer;
*json_string = writer.write(root);

3. 摘要生成逻辑

GetDesc 函数的作用是在 HTML 内容中定位关键词，并截取前后一定范围的文本作为摘要。如果未找到关键词或长度不足，则返回默认提示。

std::string GetDesc(const std::string& html_content, const std::string& word) {
    int prev_step = 50;
    int next_step = 100;
    
    // 查找子序列，忽略大小写
    auto iter = std::search(html_content.begin(), html_content.end(), 
        word.begin(), word.end(), 
        [](int x, int y) { return (std::tolower(x) == std::tolower(y)); });

    if (iter == html_content.end()) return "None1";
    
    int pos = std::distance(html_content.begin(), iter);
    if (pos == -1) return "None1"; // 兼容 npos 检查

    int start = 0;
    int end = static_cast<int>(html_content.size()) - 1;
    
    // 计算截取范围，防止越界
    if (pos - prev_step > start) start = pos - prev_step;
    if (pos + next_step < end) end = pos + next_step;
    
    if (start >= end) return "None2";
    
    std::string desc = html_content.substr(start, end - start);
    desc += "...";
    return desc;
}

4. 完整代码参考

以下是整合后的完整 Searcher 类定义，包含了上述所有逻辑及必要的头文件依赖。

#pragma once
#include "index.hpp"
#include "usuallytool.hpp"
#include <algorithm>
#include <jsoncpp/json/json.h>
#include <log.hpp>
#include <boost/algorithm/string.hpp>

namespace ns_searcher {

struct InvertedElemPrint {
    uint64_t doc_id;
    int weight;
    std::vector<std::string> words;
    InvertedElemPrint() : doc_id(0), weight(0) {}
};

class Searcher {
private:
    ns_index::Index* index;

public:
    Searcher() {}
    ~Searcher() {}

    void InitSearcher(const std::string& input) {
        index = ns_index::Index::GetInstance();
        index->BuildIndex(input);
        LOG1(NORMAL, "建立索引成功...");
    }

    void Search(const std::string& query, std::string* json_string) {
        std::vector<std::string> words;
        ns_util::JiebaUtil::CutString(query, &words);

        std::unordered_map<uint64_t, InvertedElemPrint> tokens_map;
        for (std::string w : words) {
            boost::to_lower(w);
            ns_index::InvertedList* inverted_list = index->GetInvertedList(w);
            if (inverted_list == nullptr) continue;

            for (const auto &elem : *inverted_list) {
                auto &item = tokens_map[elem.doc_id];
                item.doc_id = elem.doc_id;
                item.weight += elem.weight;
                item.words.push_back(elem.word);
            }
        }

        std::vector<InvertedElemPrint> inverted_list_all;
        for (const auto &item : tokens_map) {
            inverted_list_all.push_back(std::move(item.second));
        }

        std::sort(inverted_list_all.begin(), inverted_list_all.end(), 
            [](const InvertedElemPrint &e1, const InvertedElemPrint &e2) {
                return e1.weight > e2.weight;
            });

        Json::Value root;
        for (auto& item : inverted_list_all) {
            ns_index::DocInfo* doc = index->GetForwardIndex(item.doc_id);
            if (doc == nullptr) continue;
            Json::Value elem;
            elem["title"] = doc->title;
            elem["desc"] = GetDesc(doc->content, item.words[0]);
            elem["url"] = doc->url;
            root.append(elem);
        }

        Json::StyledWriter writer;
        *json_string = writer.write(root);
    }

    std::string GetDesc(const std::string& html_content, const std::string& word) {
        int prev_step = 50;
        int next_step = 100;
        auto iter = std::search(html_content.begin(), html_content.end(), 
            word.begin(), word.end(), 
            [](int x, int y) { return (std::tolower(x) == std::tolower(y)); });

        if (iter == html_content.end()) return "None1";
        int pos = std::distance(html_content.begin(), iter);
        if (pos == -1) return "None1";

        int start = 0;
        int end = static_cast<int>(html_content.size()) - 1;
        if (pos - prev_step > start) start = pos - prev_step;
        if (pos + next_step < end) end = pos + next_step;
        if (start >= end) return "None2";

        std::string desc = html_content.substr(start, end - start);
        desc += "...";
        return desc;
    }
};

} // namespace ns_searcher

C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现详解

星落发布于 2026/3/21更新于 2026/4/251 浏览

private: ns_index::Index* index; public: Searcher() {} ~Searcher() {} public: void InitSearcher(const std::string& input) { // 1. 创建（获取）一个 index 对象，使用单例模式 index = ns_index::Index::GetInstance(); // 2. 根据对象建立索引 index->BuildIndex(input); LOG1(NORMAL, "建立索引成功..."); } };

std::unordered_map<uint64_t, InvertedElemPrint> tokens_map; for (std::string w : words) { boost::to_lower(w); // 忽略大小写 ns_index::InvertedList* inverted_list = index->GetInvertedList(w); if (inverted_list == nullptr) continue; for (const auto &elem : *inverted_list) { // 通过引用直接操作 map 中的元素，避免重复查找 auto &item = tokens_map[elem.doc_id]; item.doc_id = elem.doc_id; item.weight += elem.weight; // 累加权重 item.words.push_back(elem.word); // 记录匹配的关键词 } }

std::vector<InvertedElemPrint> inverted_list_all; for (const auto &item : tokens_map) { inverted_list_all.push_back(std::move(item.second)); } // 按权重降序排列 std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint &e1, const InvertedElemPrint &e2) { return e1.weight > e2.weight; });

Json::Value root; for (auto& item : inverted_list_all) { ns_index::DocInfo* doc = index->GetForwardIndex(item.doc_id); if (doc == nullptr) continue; Json::Value elem; elem["title"] = doc->title; elem["desc"] = GetDesc(doc->content, item.words[0]); // 生成摘要 elem["url"] = doc->url; root.append(elem); } Json::StyledWriter writer; *json_string = writer.write(root);

std::string GetDesc(const std::string& html_content, const std::string& word) { int prev_step = 50; int next_step = 100; // 查找子序列，忽略大小写 auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(), [](int x, int y) { return (std::tolower(x) == std::tolower(y)); }); if (iter == html_content.end()) return "None1"; int pos = std::distance(html_content.begin(), iter); if (pos == -1) return "None1"; // 兼容 npos 检查 int start = 0; int end = static_cast<int>(html_content.size()) - 1; // 计算截取范围，防止越界 if (pos - prev_step > start) start = pos - prev_step; if (pos + next_step < end) end = pos + next_step; if (start >= end) return "None2"; std::string desc = html_content.substr(start, end - start); desc += "..."; return desc; }

#pragma once #include "index.hpp" #include "usuallytool.hpp" #include <algorithm> #include <jsoncpp/json/json.h> #include <log.hpp> #include <boost/algorithm/string.hpp> namespace ns_searcher { struct InvertedElemPrint { uint64_t doc_id; int weight; std::vector<std::string> words; InvertedElemPrint() : doc_id(0), weight(0) {} }; class Searcher { private: ns_index::Index* index; public: Searcher() {} ~Searcher() {} void InitSearcher(const std::string& input) { index = ns_index::Index::GetInstance(); index->BuildIndex(input); LOG1(NORMAL, "建立索引成功..."); } void Search(const std::string& query, std::string* json_string) { std::vector<std::string> words; ns_util::JiebaUtil::CutString(query, &words); std::unordered_map<uint64_t, InvertedElemPrint> tokens_map; for (std::string w : words) { boost::to_lower(w); ns_index::InvertedList* inverted_list = index->GetInvertedList(w); if (inverted_list == nullptr) continue; for (const auto &elem : *inverted_list) { auto &item = tokens_map[elem.doc_id]; item.doc_id = elem.doc_id; item.weight += elem.weight; item.words.push_back(elem.word); } } std::vector<InvertedElemPrint> inverted_list_all; for (const auto &item : tokens_map) { inverted_list_all.push_back(std::move(item.second)); } std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint &e1, const InvertedElemPrint &e2) { return e1.weight > e2.weight; }); Json::Value root; for (auto& item : inverted_list_all) { ns_index::DocInfo* doc = index->GetForwardIndex(item.doc_id); if (doc == nullptr) continue; Json::Value elem; elem["title"] = doc->title; elem["desc"] = GetDesc(doc->content, item.words[0]); elem["url"] = doc->url; root.append(elem); } Json::StyledWriter writer; *json_string = writer.write(root); } std::string GetDesc(const std::string& html_content, const std::string& word) { int prev_step = 50; int next_step = 100; auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(), [](int x, int y) { return (std::tolower(x) == std::tolower(y)); }); if (iter == html_content.end()) return "None1"; int pos = std::distance(html_content.begin(), iter); if (pos == -1) return "None1"; int start = 0; int end = static_cast<int>(html_content.size()) - 1; if (pos - prev_step > start) start = pos - prev_step; if (pos + next_step < end) end = pos + next_step; if (start >= end) return "None2"; std::string desc = html_content.substr(start, end - start); desc += "..."; return desc; } }; } // namespace ns_searcher

C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现详解

1. 单例模式与索引初始化

2. 搜索核心流程

2.1 分词

2.2 触发与去重

2.3 合并与排序

2.4 构建 JSON 结果

3. 摘要生成逻辑

4. 完整代码参考

C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现详解

1. 单例模式与索引初始化

2. 搜索核心流程

2.1 分词

2.2 触发与去重

2.3 合并与排序

2.4 构建 JSON 结果

3. 摘要生成逻辑

4. 完整代码参考

更多推荐文章

相关免费在线工具

更多推荐文章

相关免费在线工具

C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现详解

1. 单例模式与索引初始化

2. 搜索核心流程

2.1 分词

2.2 触发与去重

2.3 合并与排序

2.4 构建 JSON 结果

3. 摘要生成逻辑

4. 完整代码参考

C++ 搜索引擎核心：基于正倒排索引的 Searcher 实现详解

1. 单例模式与索引初始化

2. 搜索核心流程

2.1 分词

2.2 触发与去重

2.3 合并与排序

2.4 构建 JSON 结果

3. 摘要生成逻辑

4. 完整代码参考

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具