from langchain.document_loaders import PyPDFLoader
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import AzureChatOpenAI
from langchain.chains import ConversationalRetrievalChain
dotenv load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
# 用.env 文件初始化环境变量
from
import
# read local .env file
文档加载
为了创建一个与 pdf 文档对话的应用,首先要将 pdf 文档加载为 LangChain 可以使用的格式。LangChain 提供了文档加载器来完成这件事。LangChain 有超过 80 种不同类型的文档加载器。
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.
and words are separated by space."""
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
chunk_overlap=20,
)
r_splitter.split_text(some_text)
这是输出结果:
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example," , 'For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.', 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the 'backslash n' you see embedded in this', 'embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']
吴恩达教授的机器学习 pdf 文件已经以'文档块嵌入片'的格式被存储在向量数据库里面了。我们只需要查询这个向量数据库,就可以找到大体上相关的信息。
检索
当文档存储到向量数据库之后,我们需要根据问题和任务来提取最相关的信息。此时,信息提取的基本方式就是把问题也转换为向量,然后去和向量数据库中的各个向量进行比较,然后选择最相似的前 n 个分块。最后将这 n 个最相似的分块与问题一起传递给 LLM,就可以得到答案。
# 检索
metadata_field_info = [
AttributeInfo(
name="source",
description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
type="string",
),
AttributeInfo(
name="page",
description="The page from the lecture",
type="integer",
),
]
document_content_description = "Lecture notes"
llm = AzureChatOpenAI(deployment_name="GPT-4", temperature=0)
self_query_retriever= SelfQueryRetriever.from_llm(
llm,
vectordb,
document_content_description,
metadata_field_info,
search_type="mmr",
search_kwargs={'k': 5, 'fetch_k': 10},
verbose=True
)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor = compressor,
base_retriever = self_query_retriever
)
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True
)
qa = ConversationalRetrievalChain.from_llm(
llm=llm,
chain_type="stuff",
retriever=compression_retriever,
memory=memory
)
我们先定义一个 SelfQueryRetriever,这个 Retriever 命名有点'圣人之道,吾性自足,不假外求'的味道,其实就是调用 LLM,利用 FewShotPromptTemplate 来确定是否要用文档块的元数据来过滤查询到的文档块。如果我们提问"what did they say about regression in the third lecture?",LLM 会加上一个 filter:
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
如我们先问'Is probability a class topic?' AI 回答:Yes, the context indicates that familiarity with basic probability and statistics is assumed for the class.
接着问:why are those prerequisites needed? 如果直接把这个问题传给向量数据库,向量数据库是给不出答案的。现在有了 ConversationBufferMemory,LLM 查找历史纪录将问题转为:Why are basic probability and statistics considered prerequisites for the class?