LLM RAG 检索增强生成原理与应用详解

综述由AI生成详细解析了 LLM RAG（检索增强生成）的工作原理与实现路径。文章首先阐述了 RAG 解决大模型知识滞后、幻觉及领域适配问题的核心价值。随后深入技术细节，涵盖文档加载、文本分割策略、向量化模型选择、向量数据库存储及多种数据召回算法。文中还补充了混合检索、查询重写等优化方案，以及评估指标与延迟成本控制等工程实践要点，为构建高质量 RAG 应用提供了完整的技术参考。

涅槃凤凰发布于 2025/2/6更新于 2026/6/218 浏览

前言

2024 年随着大模型进一步增强升级，越来越多的大模型应用落地。经过初期的探索和研究，目前业界逐渐收敛聚焦于两个主要的应用方向：RAG（检索增强生成）和 Agents（智能体）。今天我们就来先聊聊这个 RAG。

一、RAG 基本介绍

RAG：全称 Retrieval-Augmented Generation，即检索增强生成。我们知道由 ChatGPT 掀起的 LLM 大模型浪潮，其核心就是 Generation（生成），而 Retrieval-augmented 就是指除了 LLM 本身已经学到的知识之外，通过外挂其他数据源的方式来增强 LLM 的能力。这其中就包括了外部向量数据库、外部知识图谱、文档数据、Web 数据等。

[图示：RAG 系统架构流程图]

如上图所示，经过 Doc Loader 加载各种数据源的数据，经过 Embedding 向量化后存储进向量数据库。这是 Retrieval-augmented 基础数据处理流程。用户通过 QA 向 LLM 提问，会通过 QA 问题向向量数据库召回相似度较高的上下文，通过 Prompt 提示词一起发给 LLM，LLM 通过问题与上下文一起生成答案返回给用户。

我们不禁会问，为什么大模型动不动就千亿参数级别，涵盖了 PB 级的数据，还需要自己外挂数据源？这里面主要有几方面的原因：

数据更新：LLM 数据来源截止日期一般都是在训练结束时间点，而且它无法实时了解最新的信息。外挂知识库可以提供更新的、实时的信息，确保模型对新兴事实和领域内的最新发展有所了解。
领域专业知识：虽然训练 LLM 的数据量很庞大，但是在某些特定领域，如医学、法律或科学，可能需要深入的专业知识。LLM 在这些领域可能无法提供高度准确的信息，因此如果能提供这方面的数据，它能工作得更好。
定制需求：对于某些应用场景，用户可能需要 LLM 在特定方面的专业化，例如公司内部知识库、产品规格等。外挂知识库可以帮助模型更好地服务于特定用户或组织的需求。
避免错误：在特定领域，LLM 可能会生成不准确或误导性的信息（幻觉问题）。通过使用外挂知识库，可以提高答案的准确性，避免潜在的错误。在实际应用中，外挂知识库通常与 LLM 进行集成，通过定制的方式来满足用户或企业的特殊需求，提供更专业、准确和个性化的服务。

好，我们了解了 RAG 的基本概念，接下来我们就一起深入技术细节，了解 RAG 的实现原理。

二、RAG 技术实现

2.1 数据加载（Document Loaders）

RAG 首先要解决的问题是数据来源的问题。数据有多种来源，各种格式的数据，如 CSV、HTML、JSON、Markdown、PDF。所有的这些数据都需要有对应的 Document Loaders 来进行加工处理，将信息正确提取出来。

以 LangChain（LLM 应用框架）为例，目前 LangChain 社区中已经实现了多种文档加载器。例如 HTML 加载器：

from langchain_community.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
data = loader.load()

可以看到目前 LangChain 社区涵盖了国内网诸多网站和平台的数据，如百度云盘、腾讯云文档，甚至包括了区块链信息。

2.2 数据处理（Text Splitters）

2.2.1 数据分割

加载完数据后，下一步通常需要将数据进行拆分，尤其是在处理长文本的情况下。如何将文本进行分割处理，听起来很简单，比如按 400 个字符直接切片就好了，但往往这样应用效果不甚理想。

我们通常希望能将语义相关的文本片段保留在一起。重点其实就在这个'语义相关'。比如中文，我们希望是句号为分割符；比如一段长代码，我们希望以编程语言特点来分割，比如 Python 中的 def、class。

以 LangChain 为例，LangChain 目前支持 HTML、字符、MarkdownHeader 和多种代码分割，甚至正在实验中的语义分割。

按 MarkdownHeader 分割

from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_document = 
  
headers_to_split_on = [  
(, ),  
(, ),  
(, ),  
]  

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)  
md_header_splits = markdown_splitter.split_text(markdown_document)  
md_header_splits

Name	Index Type	Uses an LLM	When to Use	Description
Vectorstore	Vectorstore	No	If you are just getting started and looking for something quick and easy.	This is the simplest method and the one that is easiest to get started with. It involves creating embeddings for each piece of text.
Vectorstore + Document Store	Vectorstore + Document Store	No	If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together.	This involves indexing multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks).
Vectorstore + Document Store	Vectorstore + Document Store	Sometimes during indexing	If you are able to extract information from documents that you think is more relevant to index than the text itself.	This involves creating multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions.
Vectorstore	Vectorstore	Yes	If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text.	This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filer to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself).
Any	Any	Sometimes	If you are finding that your retrieved documents contain too much irrelevant information and are distracting the LLM.	This puts a post-processing step on top of another retriever and extracts only the most relevant information from retrieved documents. This can be done with embeddings or an LLM.
Vectorstore	Vectorstore	No	If you have timestamps associated with your documents, and you want to retrieve the most recent ones	This fetches documents based on a combination of semantic similarity (as in normal vector retrieval) and recency (looking at timestamps of indexed documents)
Any	Any	Yes	If users are asking questions that are complex and require multiple pieces of distinct information to respond	This uses an LLM to generate multiple queries from the original one. This is useful when the original query needs pieces of information about multiple topics to be properly answered. By generating multiple queries, we can then fetch documents for each of them.
Any	Any	No	If you have multiple retrieval methods and want to try combining them.	This fetches documents from multiple retrievers and then combines them.

LLM RAG 检索增强生成原理与应用详解

前言

一、RAG 基本介绍

二、RAG 技术实现

2.1 数据加载（Document Loaders）

2.2 数据处理（Text Splitters）

2.2.1 数据分割

更多推荐文章

相关免费在线工具

2.2.2 数据信息（Metadata）

2.2.3 分割参数

2.3 数据向量化（Text embedding models）

2.4 向量数据库（Vector stores）

2.5 数据召回（Retrievers）

2.5.1 数据召回算法

三、RAG 实践挑战与优化

3.1 检索精度优化

3.2 评估体系构建

3.3 延迟与成本考量

四、总结

更多推荐文章

相关免费在线工具

模型	中文支持
M3E	是
text2vec	是
OpenAIEmbeddings	是

LLM RAG 检索增强生成原理与应用详解

前言

一、RAG 基本介绍

二、RAG 技术实现

2.1 数据加载（Document Loaders）

2.2 数据处理（Text Splitters）

2.2.1 数据分割

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2.2.2 数据信息（Metadata）

2.2.3 分割参数

2.3 数据向量化（Text embedding models）

2.4 向量数据库（Vector stores）

2.5 数据召回（Retrievers）

2.5.1 数据召回算法

三、RAG 实践挑战与优化

3.1 检索精度优化

3.2 评估体系构建

3.3 延迟与成本考量

四、总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具