基于 RAG+LangChain 实现 ChatPDF 文档对话系统 | 极客日志

PythonAI算法

基于 RAG+LangChain 实现 ChatPDF 文档对话系统

综述由AI生成利用检索增强生成（RAG）技术结合 LangChain 框架实现 PDF 文档对话系统的完整流程。通过文档加载、文本分割、向量化存储、检索及生成回答五个步骤，构建了一个能够基于私有数据回答问题的聊天机器人。文章详细讲解了 RecursiveCharacterTextSplitter 分割策略、Chroma 向量数据库的使用、SelfQueryRetriever 元数据过滤以及 ConversationalRetrievalChain 的记忆功能，并提供了基于 Flask 的 Web 应用示例。最后对比了 Assistants API 与自定义 RAG 方案的优劣，总结了工程化落地的关键点。

鲜活发布于 2025/2/6更新于 2026/6/219 浏览

基于 RAG+LangChain 实现 ChatPDF 文档对话系统

像 ChatGPT 这样的大语言模型（LLM）可以回答很多类型的问题，但是，如果只依赖 LLM，它只知道训练过的内容，不知道你的私有数据：如公司内部没有联网的企业文档，或者在 LLM 训练完成后新产生的数据。（即使是最新的 GPT-4 Turbo，训练的数据集也只更新到 2023 年 4 月）所以，如果我们开发一个聊天机器人，可以与自己的文档对话，让 LLM 基于文档的信息回答我们的问题，是一件很有意义的事情。

本次我们会基于 RAG 的原理，通过 LangChain 来实现与 pdf 文档对话。

什么是 RAG？

RAG 是 Retrieval-augmented generation（检索增强生成）的简称，它结合了检索和生成的能力，为文本序列生成任务引入额外的外部知识（通常是私有的或者是实时的数据），就是用外部信息来增强 LLM 的知识。RAG 将传统的语言生成模型与大规模的外部知识库相结合，使模型在生成响应或文本时可以动态地从这些知识库中检索相关信息。这种结合方法旨在增强模型的生成能力，使其能够产生更为丰富、准确和有根据的内容，特别适合需要具体细节或外部事实支持的场合。

RAG 一般分为下面几步：

检索：对于给定的输入（问题），模型首先使用检索系统从大型文档集合中查找相关的文档或段落。这个检索系统通常基于密集向量搜索。
上下文编码：找到相关的文档或段落后，模型将它们与原始输入（问题）一起放到 Prompt 里。
生成：使用编码的上下文信息，模型生成输出（答案）。这通常通过大模型完成。

使用 LangChain 实现

RAG 看起来还是比较抽象，我们接下来会用 LangChain 实现，可以细分为下面 5 步：

Document Loading：文档加载器把 Documents 加载为以 LangChain 能够读取的形式。
Splitting：文本分割器把 Documents 切分为指定大小的、语义上有意义的块，一般称为'文档块'或者'文档片'。
Storage：将上一步中分割好的'文档块'以'嵌入'（Embedding）的形式存储到向量数据库（Vector DB）中，形成一个个的'嵌入片'。
Retrieval：应用程序从存储中检索分割后的文档（例如通过比较余弦相似度，找到与输入问题类似的嵌入片）。
Output：把问题和相似的文档块传递给语言模型（LLM），使用包含问题、检索到的文档块的提示生成答案。

注意，最新版的 openai 库与当前的 LangChain 不兼容，要安装 0.28.1 版的 openai 库。

!pip install openai==0.28.1

要先用.env 文件来初始化环境变量。

from langchain.document_loaders import PyPDFLoader
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import AzureChatOpenAI
from langchain.chains import ConversationalRetrievalChain


 dotenv  load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

# 加载文档
pdffiles = [
    "docs/cs229_lectures/MachineLearning-Lecture01.pdf",
    "docs/cs229_lectures/MachineLearning-Lecture01.pdf",  # 故意重复以模拟杂乱数据
    "docs/cs229_lectures/MachineLearning-Lecture02.pdf",
    "docs/cs229_lectures/MachineLearning-Lecture03.pdf"
]
docs = []
for file_path in pdffiles:
    loader=PyPDFLoader(file_path)
    docs.extend(loader.load())

print(f"The number of docs:{len(docs)}")
# print(docs[0])

some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.
and words are separated by space."""
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=20,
)
r_splitter.split_text(some_text)

# 文档分割
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150
)
splits = text_splitter.split_documents(docs)
print(f"The number of splits:{len(splits)}")
# print(splits[0])

# 向量库存储
embedding = OpenAIEmbeddings()
persist_directory = 'docs/chroma/'

# 由于接口限制，每次只能传 16 个文本块，需要循环分批传入
for i in range(0, len(splits), 16):
    batch = splits[i:i+16]
    vectordb = Chroma.from_documents(
        documents=batch,
        embedding=embedding,
        persist_directory=persist_directory
    )
vectordb.persist() #保存下来，后面可以直接使用

# 已经保存到向量数据库，从数据库里读取
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding)
print(vectordb._collection.count())

# 检索
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]
document_content_description = "Lecture notes"
llm = AzureChatOpenAI(deployment_name="GPT-4", temperature=0)
self_query_retriever= SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    search_type="mmr",
    search_kwargs={'k': 5, 'fetch_k': 10},
    verbose=True
)

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor = compressor,
    base_retriever = self_query_retriever
)

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)
qa = ConversationalRetrievalChain.from_llm(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever,
    memory=memory
)

{
    "query": "regression",
    "filter": "eq(\"source\", \"docs/cs229_lectures/MachineLearning-Lecture03.pdf\")"
}

# Output 问答系统的 UI 实现
from flask import Flask, request, render_template

app = Flask(__name__)  # Flask APP

@app.route('/', methods=['GET', 'POST'])
def home():
    if request.method == 'POST':
        # 接收用户输入作为问题
        question = request.form.get('question')
        # ConversationalRetrievalChain 链 - 读入问题，生成答案
        result = qa({"question": question})
        print(result)
        # 把大模型的回答结果返回网页进行渲染
        return render_template('index.html', result=result)

    return render_template('index.html')

if __name__ == "__main__":
    app.run(host='0.0.0.0', debug=True, port=5000)

基于 RAG+LangChain 实现 ChatPDF 文档对话系统

基于 RAG+LangChain 实现 ChatPDF 文档对话系统

什么是 RAG？

使用 LangChain 实现

更多推荐文章

相关免费在线工具

文档加载

文档分割

向量库存储

检索

生成回答并展示

延伸与总结

更多推荐文章

相关免费在线工具

基于 RAG+LangChain 实现 ChatPDF 文档对话系统

基于 RAG+LangChain 实现 ChatPDF 文档对话系统

什么是 RAG？

使用 LangChain 实现

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

文档加载

文档分割

向量库存储

检索

生成回答并展示

延伸与总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具