GraphRAG 提升 LLM 摘要总结能力的原理与实践

GraphRAG 如何提升 LLM 的摘要总结能力？

GraphRAG 架构图

GraphRAG 是一种基于知识图谱的检索增强生成方法。微软在 7 月初开源了 GraphRAG 项目，一个月左右的时间内，它已经获得了 13k 的 stars。相对于通常的 RAG，GraphRAG 在从多个非结构化文档中进行高层次总结和摘要方面表现更佳。

例如，对于关于环境问题的文章集合，GraphRAG 能更好地回答'这些文章的最主要的 5 个主题是什么？'这类问题。此类问题没有直接相关的文档可供 RAG 召回，因此通常的 RAG 对于这类问题很难处理。在 GraphRAG 之前，也有相关方案处理此类问题。

例如，《RAPTOR: RECURSIVE ABSTRACTIVE PROCESSING FOR TREE-ORGANIZED RETRIEVAL》论文中提到的方法，通过对文档进行聚类，并基于不同的抽象层级进行多个层级的聚类，聚类后进行摘要总结，以供后续的 RAG 召回。

本文围绕上述问题，对 GraphRAG 进行分析和介绍。文章开始部分对 GraphRAG 解决的问题和设计初衷做简单介绍，第二章节主要围绕 GraphRAG 的原理和概念展开，最后一部分是一些观点和想法。

GraphRAG 并没有'原创性'的创新，而是巧妙地组合了之前已经存在的技术。涉及的技术包括 LLM、知识图谱、社区检测与聚合算法，还使用了一些 Map-Reduce 的思想。GraphRAG 的设计精髓可以用其对应论文《From Local to Global: A Graph RAG Approach to Query-Focused Summarization》中的一句话概括：'Use the natural modularity of graphs to partition data for global summarization.'

下文会围绕这句话，进行展开解释。

01 Why GraphRAG？

1.1 GraphRAG 在解决什么问题？

引用 GraphRAG: Advanced Data Retrieval for Enhanced Insights 文中表述：

Complex Information Traversal: It excels at connecting different pieces of information to provide new, synthesized insights.

Holistic Understanding: It performs better at understanding and summarizing large data collections, offering a more comprehensive grasp of the information.

第二点 'Holistic Understanding' 是文章开始提到的'摘要总结'能力，即处理 QFS ( query focused summarization ) 问题的能力，需要跨多个文档进行高层次的总结和抽象。GraphRAG 的论文中主要表述了这一点，并且通过分析其代码，可以看到大量设计也是围绕这一能力展开。

第一点是通过知识图谱提升的能力。**GraphRAG 在构建知识图谱时（下一章节将详细介绍），通过知识图谱将分布在不同文章和信息片段中的信息关联起来。**在查询时，能召回相关的语料信息，而通常的 RAG 由于没有预先构建的知识图谱，无法完全召回所需的语料。对于需要结合多个语料才能回答的问题，GraphRAG 表现更佳。

论文中主要讨论的是摘要总结能力的增强，GraphRAG 的评估也是围绕这一能力进行的。'Complex Information Traversal' 能力结合多个片段信息，提供新的洞察力，在进行 QFS 时也会用到。

1.2 不能使用超大上下文的 LLM 进行摘要总结么？

Claude 3 模型上下文为 200K，可以直接将所有文章一次性提供给 LLM 进行摘要总结么？这样做有两个问题。一是上下文大小的限制，200K 的 token 限制在处理大量语料时仍可能不够。成千上万篇独立文档的 token 数量很容易超过这一限制。而且，每次处理几十万 token 的时间和计算成本都过高。

后文会介绍，GraphRAG 采用分层摘要，即中间数据在一次计算后可以重复使用。二是目前的 LLM 随着上下文变长，会表现出'找不到重点'或'忽略一些信息'的问题。GraphRAG 论文《From Local to Global: A Graph RAG Approach to Query-Focused Summarization》中对此有所表述：

The challenge remains, however, for query-focused abstractive summarization over an entire corpus. Such volumes of text can greatly exceed the limits of LLM context windows, and the expansion of such windows may not be enough given that information can be "lost in the middle" of longer contexts (Kuratov et al., 2024; Liu et al., 2023).

"""Local search system prompts.""" LOCAL_SEARCH_SYSTEM_PROMPT = """ ---Role--- You are a helpful assistant responding to questions about data in the tables provided. ---Goal--- Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge. If you don't know the answer, just say so. Do not make anything up. Points supported by data should list their data references as follows: "This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]." Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. For example: "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]." where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record. Do not include information where the supporting evidence for it is not provided. ---Target response length and format--- {response_type} ---Data tables--- {context_data} ---Goal--- Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge. If you don't know the answer, just say so. Do not make anything up. Points supported by data should list their data references as follows: "This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]." Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. For example: "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]." where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record. Do not include information where the supporting evidence for it is not provided. ---Target response length and format--- {response_type} Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown. """

"""System prompts for global search.""" MAP_SYSTEM_PROMPT = """ ---Role--- You are a helpful assistant responding to questions about data in the tables provided. ---Goal--- Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables. You should use the data provided in the data tables below as the primary context for generating the response. If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up. Each key point in the response should have the following element: - Description: A comprehensive description of the point. - Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0. The response should be JSON formatted as follows: { "points": [ {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}}, {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}} ] } The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will". Points supported by data should list the relevant reports as references as follows: "This is an example sentence supported by data references [Data: Reports (report ids)]" **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. For example: "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]" where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables. Do not include information where the supporting evidence for it is not provided. ---Data tables--- {context_data} ---Goal--- Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables. You should use the data provided in the data tables below as the primary context for generating the response. If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up. Each key point in the response should have the following element: - Description: A comprehensive description of the point. - Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0. The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will". Points supported by data should list the relevant reports as references as follows: "This is an example sentence supported by data references [Data: Reports (report ids)]" **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. For example: "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]" where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables. Do not include information where the supporting evidence for it is not provided. The response should be JSON formatted as follows: { "points": [ {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}}, {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}} ] } """

"""Global Search system prompts.""" REDUCE_SYSTEM_PROMPT = """ ---Role--- You are a helpful assistant responding to a dataset by synthesizing perspectives from multiple analysts. ---Goal--- Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset. Note that the analysts' reports provided below are ranked in the **descending order of importance**. If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up. The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format. Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown. The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will". The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process. **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. For example: "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]" where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record. Do not include information where the supporting evidence for it is not provided. ---Target response length and format--- {response_type} ---Analyst Reports--- {report_data} ---Goal--- Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset. Note that the analysts' reports provided below are ranked in the **descending order of importance**. If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up. The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format. The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will". The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process. **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. For example: "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]" where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record. Do not include information where the supporting evidence for it is not provided. ---Target response length and format--- {response_type} Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown. """ NO_DATA_ANSWER = ( "I am sorry but I am unable to answer this question given the provided data." ) GENERAL_KNOWLEDGE_INSTRUCTION = """ The response may also include relevant real-world knowledge outside the dataset, but it must be explicitly annotated with a verification tag [LLM: verify]. For example: "This is an example sentence supported by real-world knowledge [LLM: verify]." """

GraphRAG 提升 LLM 摘要总结能力的原理与实践