C++ Boost 搜索引擎：正倒排索引核心实现与详解

在搜索引擎项目中，正排索引与倒排索引是协同工作的核心组件。正排索引将文档内容映射到文档 ID，而倒排索引则根据关键词反向映射到文档列表。构建流程通常先创建正排索引，再基于其生成倒排索引。当用户查询时，系统利用倒排索引快速定位相关文档，并结合正排索引信息展示结果。

注意：索引是在搜索前预先构建的，并非临时创建。

1. 数据结构设计

1.1 正排索引结构

正排索引主要维护文档的元数据，包括标题、内容和 URL。

// 正排索引所需的数据结构
typedef struct DocInfo {
    std::string title;   // 文档标题
    std::string content; // 文档内容
    std::string url;     // 文档 URL
    int doc_id;          // 文档 ID
} DocInfo;

1.2 倒排索引结构

倒排索引记录关键词与文档 ID 的映射关系，并包含权重信息。一个关键词可能对应多个文档，形成所谓的'倒排拉链'。

// 倒排索引元素定义
struct InvertedElem {
    int doc_id;      // 文档 ID
    std::string word;// 关键词
    int weight;      // 权重
};

typedef std::vector<InvertedElem> InvertedList;

2. 核心类设计与单例模式

2.1 成员变量规划

正排索引使用 std::vector，因为文档 ID 天然适合作为下标，访问效率高且节省空间。倒排索引使用 std::unordered_map，通过哈希表实现关键词到文档列表的快速映射。

private:
    // 正排索引：Vector 下标即文档 ID
    std::vector<DocInfo> forward_index;
    // 倒排索引：哈希表映射关键词
    std::unordered_map<std::string, InvertedList> inverted_index;

2.2 单例模式实现

为了保证全局逻辑统一并减少资源浪费（如内存占用和 IO 开销），索引管理器采用单例模式。同时，为了支持多线程环境下的安全访问，需要禁用拷贝构造和赋值运算符，并使用互斥锁保护实例初始化过程。

这里采用了双重检查锁定（Double-Checked Locking）机制，在进入锁之前和之后各判断一次 instance 是否为空，以防止高并发场景下的重复创建问题。

private:
    Index() {};
    ( Index&) = ;
    Index& =( Index&) = ;

     Index* instance;
     std::mutex log;

:
    ~();

    {
         (instance == ) {
            log.();
             (instance == ) {
                instance =  ();
            }
            log.();
        }
         instance;
    }

#pragma once #include <iostream> #include <string> #include <vector> #include <unordered_map> #include <fstream> #include <mutex> #include "usuallytool.hpp" #include <boost/algorithm/string.hpp> #include "log.hpp" namespace ns_index { typedef struct DocInfo { std::string title; std::string content; std::string url; int doc_id; } DocInfo; struct InvertedElem { int doc_id; std::string word; int weight; }; typedef std::vector<InvertedElem> InvertedList; class Index { private: std::vector<DocInfo> forward_index; std::unordered_map<std::string, InvertedList> inverted_index; Index() {}; Index(const Index&) = delete; Index& operator=(const Index&) = delete; static Index* instance; static std::mutex log; public: ~Index(); static Index* GetInstance() { if (instance == nullptr) { log.lock(); if (instance == nullptr) { instance = new Index(); } log.unlock(); } return instance; } DocInfo* GetForwardIndex(uint64_t doc_id) { if (doc_id >= forward_index.size()) { std::cout << "doc_id out range, error!" << std::endl; return nullptr; } return &forward_index[doc_id]; } InvertedList* GetInvertedList(const std::string& word) { auto iter = inverted_index.find(word); if (iter == inverted_index.end()) { std::cout << word << "get error" << std::endl; return nullptr; } return &(iter->second); } bool BuildIndex(const std::string& input) { std::ifstream in(input, std::ios::in | std::ios::binary); if (!in.is_open()) { std::cout << input << "open error" << std::endl; return false; } int count = 0; std::string line; while (std::getline(in, line)) { DocInfo* doc = BuildForwardIndex(line); if (doc == nullptr) { std::cout << "BuildIndex error" << std::endl; continue; } BuildInvertedIndex(*doc); count++; if (count % 50 == 0) LOG1(NORMAL, "索引建立到：" + std::to_string(count)); } return true; } private: DocInfo* BuildForwardIndex(const std::string& line) { std::vector<std::string> results; ns_util::StringUtil::Split(line, &results, "\3"); if (results.size() != 3) return nullptr; DocInfo doc; doc.title = results[0]; doc.content = results[1]; doc.url = results[2]; doc.doc_id = forward_index.size(); forward_index.push_back(doc); return &forward_index.back(); } bool BuildInvertedIndex(const DocInfo& doc) { struct WordCnt { int title_cnt = 0; int content_cnt = 0; }; std::unordered_map<std::string, WordCnt> word_map; std::vector<std::string> title_words, content_words; ns_util::JiebaUtil::CutString(doc.title, &title_words); for (auto& tw : title_words) { boost::to_lower(tw); word_map[tw].title_cnt++; } ns_util::JiebaUtil::CutString(doc.content, &content_words); for (auto& cw : content_words) { boost::to_lower(cw); word_map[cw].content_cnt++; } #define X 10 #define Y 1 for (auto& pair : word_map) { InvertedElem item; item.doc_id = doc.doc_id; item.word = pair.first; item.weight = X * pair.second.title_cnt + Y * pair.second.content_cnt; inverted_index[pair.first].push_back(item); } return true; } }; Index* Index::instance = nullptr; std::mutex Index::log; }

C++ Boost 搜索引擎：正倒排索引核心实现与详解