GraphRAG 并没有'原创性'的创新,而是巧妙地组合了之前已经存在的技术。涉及的技术包括 LLM、知识图谱、社区检测与聚合算法,还使用了一些 Map-Reduce 的思想。GraphRAG 的设计精髓可以用其对应论文《From Local to Global: A Graph RAG Approach to Query-Focused Summarization》中的一句话概括:'Use the natural modularity of graphs to partition data for global summarization.'
下文会围绕这句话,进行展开解释。
01 Why GraphRAG?
1.1 GraphRAG 在解决什么问题?
引用 GraphRAG: Advanced Data Retrieval for Enhanced Insights 文中表述:
Complex Information Traversal: It excels at connecting different pieces of information to provide new, synthesized insights.
Holistic Understanding: It performs better at understanding and summarizing large data collections, offering a more comprehensive grasp of the information.
后文会介绍,GraphRAG 采用分层摘要,即中间数据在一次计算后可以重复使用。二是目前的 LLM 随着上下文变长,会表现出'找不到重点'或'忽略一些信息'的问题。GraphRAG 论文《From Local to Global: A Graph RAG Approach to Query-Focused Summarization》中对此有所表述:
The challenge remains, however, for query-focused abstractive summarization over an entire corpus. Such volumes of text can greatly exceed the limits of LLM context windows, and the expansion of such windows may not be enough given that information can be "lost in the middle" of longer contexts (Kuratov et al., 2024; Liu et al., 2023).
1.3 与 RAPTOR 相比,有何差异?
RAPTOR 设计初衷,也是为了解决 QFS 问题,实现的原理,参考论文如下:
RAPTOR 在 query 前先进行多层树的构建,其构建的几点说明如下:
由最底层的 Text chunk 向上进行聚类,聚类后使用 LLM 进行 summary,节点中保存 summary 后的信息。
差异在于聚类的方式不同。GraphRAG 是通过构建知识图谱,然后基于图谱间节点的关联关系进行多层聚类('Use the natural modularity of graphs to partition data for global summarization'),而 RAPTOR 的聚类还是基于 embedding 后的结果进行的聚类。两者究竟在效果上有多大差异,暂未找到直接的数据比对。
"""Local search system prompts."""
LOCAL_SEARCH_SYSTEM_PROMPT = """
---Role---
You are a helpful assistant responding to questions about data in the tables provided.
---Goal---
Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge.
If you don't know the answer, just say so. Do not make anything up.
Points supported by data should list their data references as follows:
"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."
Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]."
where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Do not include information where the supporting evidence for it is not provided.
---Target response length and format---
{response_type}
---Data tables---
{context_data}
---Goal---
Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge.
If you don't know the answer, just say so. Do not make anything up.
Points supported by data should list their data references as follows:
"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."
Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]."
where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Do not include information where the supporting evidence for it is not provided.
---Target response length and format---
{response_type}
Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
"""
其中 {context_data} 为占位符变量,具体值为上述 dataflow 构造出的 context。一次 Local Search 调用一次 LLM 即可,但 Global Search 一次 Query 可能会调用十几次 LLM,下面分析下 Global Search 是如何处理的。
2.2.2. Global Search
前文提到,Global Search 使用的 context 与 Local Search 差异很大。
Global Search 使用特定层次的 Community Report 的集合。由于单个上下文可能无法容纳这些 Community Reports,需要进行 MapReduce 操作:
"""System prompts for global search."""
MAP_SYSTEM_PROMPT = """
---Role---
You are a helpful assistant responding to questions about data in the tables provided.
---Goal---
Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.
You should use the data provided in the data tables below as the primary context for generating the response.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.
Each key point in the response should have the following element:
- Description: A comprehensive description of the point.
- Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.
The response should be JSON formatted as follows:
{
"points": [
{{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
{{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
]
}
The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
Points supported by data should list the relevant reports as references as follows:
"This is an example sentence supported by data references [Data: Reports (report ids)]"
**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.
Do not include information where the supporting evidence for it is not provided.
---Data tables---
{context_data}
---Goal---
Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.
You should use the data provided in the data tables below as the primary context for generating the response.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.
Each key point in the response should have the following element:
- Description: A comprehensive description of the point.
- Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.
The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
Points supported by data should list the relevant reports as references as follows:
"This is an example sentence supported by data references [Data: Reports (report ids)]"
**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.
Do not include information where the supporting evidence for it is not provided.
The response should be JSON formatted as follows:
{
"points": [
{{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
{{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
]
}
"""
从上面 prompt 可以看出输出的结果是 JSON 格式的:
{{"points":[{{"description":"Description of point 1 [Data: Reports (report ids)]","score": score_value
}},{{"description":"Description of point 2 [Data: Reports (report ids)]","score": score_value
}}]}}
reduce 阶段的 prompt:
"""Global Search system prompts."""
REDUCE_SYSTEM_PROMPT = """
---Role---
You are a helpful assistant responding to a dataset by synthesizing perspectives from multiple analysts.
---Goal---
Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.
Note that the analysts' reports provided below are ranked in the **descending order of importance**.
If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.
The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.
Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.
**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Do not include information where the supporting evidence for it is not provided.
---Target response length and format---
{response_type}
---Analyst Reports---
{report_data}
---Goal---
Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.
Note that the analysts' reports provided below are ranked in the **descending order of importance**.
If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.
The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.
The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.
**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Do not include information where the supporting evidence for it is not provided.
---Target response length and format---
{response_type}
Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
"""
NO_DATA_ANSWER = (
"I am sorry but I am unable to answer this question given the provided data."
)
GENERAL_KNOWLEDGE_INSTRUCTION = """
The response may also include relevant real-world knowledge outside the dataset, but it must be explicitly annotated with a verification tag [LLM: verify]. For example:
"This is an example sentence supported by real-world knowledge [LLM: verify]."
"""
prompt 中'Note that the analysts' reports provided below are ranked in the descending order of importance.'表示在给到 LLM 前,已经把 map 阶段的信息根据评分进行降序排列。
以上是 Global Search 的主要过程。除了进行不同层次问题的 Query,由于使用了知识图谱,GraphRAG 还可以进行问题推荐,下一小节简单分析此能力。
"""Question Generation system prompts."""
QUESTION_SYSTEM_PROMPT = """
---Role---
You are a helpful assistant generating a bulleted list of {question_count} questions about data in the tables provided.
---Data tables---
{context_data}
---Goal---
Given a series of example questions provided by the user, generate a bulleted list of {question_count} candidates for the next question. Use - marks as bullet points.
These candidate questions should represent the most important or urgent information content or themes in the data tables.
The candidate questions should be answerable using the data tables provided, but should not mention any specific data fields or data tables in the question text.
If the user's questions reference several named entities, then each candidate question should reference all named entities.
---Example questions---
"""