C++ 基于正倒排索引的搜索引擎 Searcher 模块详解 | 极客日志

C++算法

C++ 基于正倒排索引的搜索引擎 Searcher 模块详解

解析 C++ 搜索引擎中 Searcher 模块的核心逻辑，涵盖单例索引初始化、查询词分词与触发、倒排索引合并排序及结果序列化流程。重点阐述了如何利用哈希表去重聚合文档权重，以及摘要生成的边界处理策略，最后提供完整的头文件代码参考。

日志猎手发布于 2026/3/15更新于 2026/6/1521 浏览

概述

Searcher 模块作为搜索引擎的上层封装，主要负责接收用户查询请求，调用底层索引服务，并将最终结果序列化为标准格式返回。其核心流程包括索引初始化、查询词处理、结果合并排序及数据序列化。

索引初始化

模块采用单例模式管理索引实例，确保全局唯一性。初始化时创建或获取 Index 对象，并构建正倒排索引结构。

private: ns_index::Index* index;
public: Searcher(){}; ~Searcher(){};
public: void InitSearcher(const std::string& input) { 
    // 获取单例索引对象
    index = ns_index::Index::GetInstance(); 
    // 根据输入路径建立索引
    index->BuildIndex(input); 
    LOG1(NORMAL,"建立索引成功..."); 
}

搜索流程

搜索函数主要包含分词、触发、合并排序和构建 JSON 结果四个步骤。

1. 分词

利用 Jieba 工具对用户输入的查询字符串进行切分，生成关键词列表。

std::vector<std::string> words;
ns_util::JiebaUtil::CutString(query, &words);

2. 触发与去重

遍历分词后的关键词，从倒排索引中获取对应的文档列表。由于同一文档可能匹配多个关键词，直接使用向量会导致重复，因此引入哈希表 tokens_map 进行去重与权重累加。同时通过 boost::to_lower 统一转为小写，忽略大小写差异。

std::unordered_map<uint64_t, InvertedElemPrint> tokens_map;
for(std::string w : words) {
    boost::to_lower(w);
    ns_index::InvertedList* inverted_list = index->GetInvertedList(w);
    if(inverted_list == nullptr) continue;
    
    for(const auto &elem : *inverted_list) {
        auto &item = tokens_map[elem.doc_id];
        item.doc_id = elem.doc_id;
        item.weight += elem.weight;
        item.words.push_back(elem.word);
    }
}

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online
Markdown转HTML
将 Markdown（GFM）转为 HTML 片段，浏览器内 marked 解析；与 HTML转Markdown 互为补充。在线工具，Markdown转HTML在线工具，online
HTML转Markdown
将 HTML 片段转为 GitHub Flavored Markdown，支持标题、列表、链接、代码块与表格等；浏览器内处理，可链接预填。在线工具，HTML转Markdown在线工具，online

std::vector<InvertedElemPrint> inverted_list_all;
for(const auto &item : tokens_map) {
    inverted_list_all.push_back(std::move(item.second));
}
std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){
    return e1.weight > e2.weight;
});

Json::Value root;
for(auto& item : inverted_list_all) {
    ns_index::DocInfo* doc = index->GetForwardIndex(item.doc_id);
    if(doc == nullptr) continue;
    Json::Value elem;
    elem["title"] = doc->title;
    elem["desc"] = GetDesc(doc->content, item.words[0]);
    elem["url"] = doc->url;
    root.append(elem);
}
Json::StyledWriter writer;
*json_string = writer.write(root);

std::string GetDesc(const std::string& html_content, const std::string& word) {
    int prev_step = 50;
    int next_step = 100;
    // 忽略大小写查找关键词
    auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(),
        [](int x, int y){ return (std::tolower(x) == std::tolower(y)); });
    
    if(iter == html_content.end()) return "None1";
    int pos = std::distance(html_content.begin(), iter);
    if(pos == std::string::npos) return "None1";
    
    int start = 0;
    int end = html_content.size() - 1;
    if(pos - prev_step > start) start = pos - prev_step;
    if(pos + next_step < end) end = pos + next_step;
    
    if(start >= end) return "None2";
    std::string desc = html_content.substr(start, end - start);
    desc += "...";
    return desc;
}

struct InvertedElemPrint {
    uint64_t doc_id;
    int weight;
    std::vector<std::string> words;
    InvertedElemPrint(): doc_id(0), weight(0) {}
};

#pragma once
#include "index.hpp"
#include "usuallytool.hpp"
#include <algorithm>
#include <jsoncpp/json/json.h>
#include "log.hpp"

namespace ns_searcher {
    struct InvertedElemPrint {
        uint64_t doc_id;
        int weight;
        std::vector<std::string> words;
        InvertedElemPrint(): doc_id(0), weight(0) {}
    };

    class Searcher {
    private:
        ns_index::Index* index;
    public:
        Searcher(){}; ~Searcher(){};
        
        void InitSearcher(const std::string& input) {
            index = ns_index::Index::GetInstance();
            index->BuildIndex(input);
            LOG1(NORMAL,"建立索引成功...");
        }

        void Search(const std::string& query, std::string* json_string) {
            std::vector<std::string> words;
            ns_util::JiebaUtil::CutString(query, &words);
            
            std::vector<InvertedElemPrint> inverted_list_all;
            std::unordered_map<uint64_t, InvertedElemPrint> tokens_map;
            
            for(std::string w : words) {
                boost::to_lower(w);
                ns_index::InvertedList* inverted_list = index->GetInvertedList(w);
                if(inverted_list == nullptr) continue;
                
                for(const auto &elem : *inverted_list) {
                    auto &item = tokens_map[elem.doc_id];
                    item.doc_id = elem.doc_id;
                    item.weight += elem.weight;
                    item.words.push_back(elem.word);
                }
            }
            
            for(const auto &item : tokens_map) {
                inverted_list_all.push_back(std::move(item.second));
            }
            
            std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){
                return e1.weight > e2.weight;
            });
            
            Json::Value root;
            for(auto& item : inverted_list_all) {
                ns_index::DocInfo* doc = index->GetForwardIndex(item.doc_id);
                if(doc == nullptr) continue;
                Json::Value elem;
                elem["title"] = doc->title;
                elem["desc"] = GetDesc(doc->content, item.words[0]);
                elem["url"] = doc->url;
                root.append(elem);
            }
            
            Json::StyledWriter writer;
            *json_string = writer.write(root);
        }

        std::string GetDesc(const std::string& html_content, const std::string& word) {
            int prev_step = 50;
            int next_step = 100;
            auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(),
                [](int x, int y){ return (std::tolower(x) == std::tolower(y)); });
            
            if(iter == html_content.end()) return "None1";
            int pos = std::distance(html_content.begin(), iter);
            if(pos == std::string::npos) return "None1";
            
            int start = 0;
            int end = html_content.size() - 1;
            if(pos - prev_step > start) start = pos - prev_step;
            if(pos + next_step < end) end = pos + next_step;
            
            if(start >= end) return "None2";
            std::string desc = html_content.substr(start, end - start);
            desc += "...";
            return desc;
        }
    };
}

C++ 基于正倒排索引的搜索引擎 Searcher 模块详解

概述

索引初始化

搜索流程

1. 分词

2. 触发与去重

更多推荐文章

相关免费在线工具

3. 合并与排序

4. 构建 JSON 结果

摘要生成策略

数据结构设计

完整代码参考

更多推荐文章

相关免费在线工具

C++ 基于正倒排索引的搜索引擎 Searcher 模块详解

概述

索引初始化

搜索流程

1. 分词

2. 触发与去重

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3. 合并与排序

4. 构建 JSON 结果

摘要生成策略

数据结构设计

完整代码参考

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具