基于 Milvus 向量数据库实现 GraphRAG 技术详解 | 极客日志

PythonAI算法

基于 Milvus 向量数据库实现 GraphRAG 技术详解

综述由AI生成GraphRAG 技术结合知识图谱增强 RAG 应用检索能力。基于 Milvus 向量数据库实现 GraphRAG 的完整流程，涵盖环境配置、数据准备、索引构建、实体与关系加载、搜索引擎搭建及查询测试。通过 Milvus 存储实体向量嵌入，结合 LLM 实现精准问答与问题生成。此外，还补充了生产环境下的性能优化与部署建议，助力开发者高效落地复杂信息检索系统。

女王发布于 2025/2/7更新于 2026/6/1024 浏览

GraphRAG 技术借助知识图谱，给 RAG 应用注入了新的动力，使其能够在海量数据中精确检索所需信息。本文将带你了解 GraphRAG 的实现方法，包括如何创建索引以及如何利用 Milvus 向量数据库进行查询，助你在信息检索的道路上事半功倍。

1. 先决条件

在运行本文中的代码之前，请确保已安装以下依赖项：

pip install --upgrade pymilvus
pip install git+https://github.com/zc277584121/graphrag.git

注意：通过一个分支仓库来安装 GraphRAG，这是因为 Milvus 的存储功能在本文编写时还未被官方正式合并。

2. 数据准备

为了进行 GraphRAG 索引，我们需要准备一个小型文本文件。我们将从 Gutenberg 项目下载一个大约一千行的文本文件，这个文件包含了关于达芬奇的故事。利用这个数据集，构建一个涉及达芬奇所有关系的知识图谱索引，并使用 Milvus 向量数据库来检索相关知识，以便回答相关问题。

以下是 Python 代码，用于下载文本文件并进行初步处理：

import nest_asyncio
nest_asyncio.apply()

import os
import urllib.request

index_root = os.path.join(os.getcwd(), 'graphrag_index')
os.makedirs(os.path.join(index_root, 'input'), exist_ok=True)
url = "https://www.gutenberg.org/cache/epub/7785/pg7785.txt"
file_path = os.path.join(index_root, 'input', 'davinci.txt')
urllib.request.urlretrieve(url, file_path)
with open(file_path, 'r+', encoding='utf-8') as file:
    # 使用文本文件的前 934 行，因为后面的行与本例无关。
    # 如果想节省 API 密钥成本，可以截断文本文件以减小大小。
    lines = file.readlines()
    file.seek(0)
    file.writelines(lines[:934])  # 如果想节省 API 密钥成本，可以减少这个数字。
    file.truncate()

3. 初始化工作空间

现在，使用 GraphRAG 对文本文件进行索引。首先运行 graphrag.index --init 命令初始化工作空间。

python -m graphrag.index --init --root ./graphrag_index

4. 配置环境变量文件

在索引的根目录下，能找到一个名为 .env 的文件。要启用这个文件，请将你的 OpenAI API 密钥添加进去。

注意事项：

本例将使用 OpenAI 模型作为一部分，请准备好你的 API 密钥。
GraphRAG 索引的成本相对较高，因为它需要用 LLM 处理整个文本语料库。运行这个演示可能会花费一些资金。为了节省成本，你可以考虑将文本文件缩减尺寸。

5. 执行索引流程

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

python -m graphrag.index --root ./graphrag_index

import os
import pandas as pd
import tiktoken
from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.input.loaders.dfs import store_entity_semantic_embeddings
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.embedding import OpenAIEmbedding
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.question_gen.local_gen import LocalQuestionGen
from graphrag.query.structured_search.local_search.mixed_context import LocalSearchMixedContext
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores import MilvusVectorStore

output_dir = os.path.join(index_root, "output")
subdirs = [os.path.join(output_dir, d) for d in os.listdir(output_dir)]
latest_subdir = max(subdirs, key=os.path.getmtime)  # 获取最新的输出目录
INPUT_DIR = os.path.join(latest_subdir, "artifacts")

COMMUNITY_REPORT_TABLE = "create_final_community_reports"
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"
RELATIONSHIP_TABLE = "create_final_relationships"
COVARIATE_TABLE = "create_final_covariates"
TEXT_UNIT_TABLE = "create_final_text_units"
COMMUNITY_LEVEL = 2

entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")

entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)
description_embedding_store = MilvusVectorStore(collection_name="entity_description_embeddings")
# description_embedding_store.connect(uri="http://localhost:19530") # 用于 Milvus docker 服务
description_embedding_store.connect(uri="./milvus.db") # For Milvus Lite
entity_description_embeddings = store_entity_semantic_embeddings(entities=entities, vectorstore=description_embedding_store)
print(f"实体数量:{len(entity_df)}")
entity_df.head()

relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
relationships = read_indexer_relationships(relationship_df)

print(f"关系数量：{len(relationship_df)}")
relationship_df.head()

report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL)

print(f"报告记录:{len(report_df)}")
report_df.head()

text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
text_units = read_indexer_text_units(text_unit_df)

print(f"文本单元记录:{len(text_unit_df)}")
text_unit_df.head()

api_key = os.environ["OPENAI_API_KEY"]  # 你的 OpenAI API 密钥
llm_model = "gpt-4o"  # 或 gpt-4-turbo-preview
embedding_model = "text-embedding-3-small"

llm = ChatOpenAI(
    api_key=api_key,
    model=llm_model,
    api_type=OpenaiApiType.OpenAI,
    max_retries=20,
)
token_encoder = tiktoken.get_encoding("cl100k_base")
text_embedder = OpenAIEmbedding(
    api_key=api_key,
    api_base=None,
    api_type=OpenaiApiType.OpenAI,
    model=embedding_model,
    deployment_name=embedding_model,
    max_retries=20,
)

context_builder = LocalSearchMixedContext(
    community_reports=reports,
    text_units=text_units,
    entities=entities,
    relationships=relationships,
    covariates=None,
    entity_text_embeddings=description_embedding_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)

local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 10,
    "top_k_relationships": 10,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,
    "max_tokens": 12_000,  # 根据你的模型的令牌限制更改此设置
}

llm_params = {
    "max_tokens": 2_000,  # 根据你的模型的令牌限制更改此设置
    "temperature": 0.0,
}

search_engine = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    llm_params=llm_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",
)

result = await search_engine.asearch("Tell me about Leonardo Da Vinci")
print(result.response)

question_generator = LocalQuestionGen(
   llm=llm,
   context_builder=context_builder,
   token_encoder=token_encoder,
   llm_params=llm_params,
   context_builder_params=local_context_params,
)

question_history = [
    "Tell me about Leonardo Da Vinci",
    "Leonardo's early works",
]

candidate_questions = await question_generator.agenerate(
        question_history=question_history, context_data=None, question_count=5
    )
candidate_questions.response

# import shutil
#
# shutil.rmtree(index_root)

基于 Milvus 向量数据库实现 GraphRAG 技术详解

1. 先决条件

2. 数据准备

3. 初始化工作空间

4. 配置环境变量文件

5. 执行索引流程

更多推荐文章

相关免费在线工具

6. 使用 Milvus 向量数据库进行查询

7. 从索引过程中加载数据

8. 构建本地搜索引擎

9. 进行查询

早期生活和训练

艺术杰作

科学和工程贡献

赞助和职业关系

遗产和影响

10. 问题生成

11. 结语

12. 性能优化与部署建议

更多推荐文章

相关免费在线工具

基于 Milvus 向量数据库实现 GraphRAG 技术详解

1. 先决条件

2. 数据准备

3. 初始化工作空间

4. 配置环境变量文件

5. 执行索引流程

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

6. 使用 Milvus 向量数据库进行查询

7. 从索引过程中加载数据

8. 构建本地搜索引擎

9. 进行查询

早期生活和训练

艺术杰作

科学和工程贡献

赞助和职业关系

遗产和影响

10. 问题生成

11. 结语

12. 性能优化与部署建议

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具