基于 Python 和本地大模型的 RAG 系统构建指南 | 极客日志

PythonAI算法

基于 Python 和本地大模型的 RAG 系统构建指南

如何使用 Python、Ollama 和 ChromaDB 构建本地检索增强生成（RAG）系统。内容涵盖环境搭建、PDF 文档 OCR 处理、多模态数据（文本、表格、图像）预处理、向量数据库存储以及基于 Streamlit 的前端交互实现。系统支持在普通笔记本电脑上运行，无需 GPU，确保数据隐私安全。

橘子海发布于 2025/2/7更新于 2026/6/925 浏览

本文详细介绍如何利用 Python 和大型语言模型（LLM）构建一个检索增强生成（Retrieval-Augmented Generation, RAG）系统。该系统包括完整的数据处理，向量数据库，前端以及后端结构，能够从个人文档中学习知识，并在无需 GPU 的情况下于笔记本电脑上运行。

1. 引言

自然语言处理（NLP）是人工智能领域研究机器与人类语言交互的学科。NLP 的巅峰目前当属大型语言模型（LLM）的出现。LLM 在海量文本数据上进行训练，能够学习语言模式和变体。'语言模型'一词随着深度学习和神经网络的兴起而普及。特别是 2018 年，谷歌引入了基于 Transformer 架构的模型，极大地提升了 NLP 模型的性能（例如谷歌的 BERT 和 OpenAI 的 GPT）。

如今，LLM 通常用于文本到文本任务或'文本生成'（如翻译、摘要、聊天机器人和虚拟助手，甚至撰写整本书）。这催生了生成式人工智能（GenAI）的兴起，GenAI 专注于生成新内容（如文本、图像、音频、视频）。

目前最先进的 LLM 包括：

OpenAI 的 ChatGPT
Anthropic 的 Claude
谷歌的 Bard
Meta 的 Llama
微软的 Phi（体积最小，可在笔记本电脑上运行，无需 GPU）
StabilityAI 的 StableLM
Cohere 的 CommandR
Snowflake 的 Arctic
阿里巴巴的 Qwen
01AI 的 Yi
X 的 Grok
英伟达的 Megatron
亚马逊的 Olympus（尚未发布）
苹果的 MM1（尚未发布）

ChatGPT 是最常用的 LLM，但企业面临一个问题：无法将敏感数据上传到 OpenAI（主要出于隐私和安全考虑）。因此，企业正在创建内部 AI 服务，以便在其私有知识库上利用 LLM 的强大功能。这种任务被称为检索增强生成（RAG）。RAG 是一种结合检索和生成模型的技术，通过从外部来源获取知识事实来增强 LLM。

在组织中，知识库通常由包含多模态内容（如文本、图像、电子表格）的文档组成。因此，最大的挑战是如何以机器能够理解的方式处理所有这些内容。简而言之，首先将每个文档转换为嵌入向量，然后将用户查询转换为相同的向量空间，从而实现余弦相似度搜索。

本文将介绍如何使用 LLM 和多模态数据构建一个 RAG 应用程序，该应用程序可以在普通笔记本电脑上运行，无需 GPU。文中将提供一些有用的 Python 代码，这些代码可以轻松应用于其他类似案例，并对每一行代码进行注释，以便读者可以复现该示例。

2. 环境搭建

在企业环境中，PDF 是使用最广泛的文档格式，因为大多数文档在共享前都会转换为 PDF 格式。此外，PDF 也是一个很好的案例，因为它们包含图像、表格和文本。因此，本文将以一份 PDF 格式的上市公司财务报表作为数据集。

处理 PDF 文档有两种方法：将其作为文本读取或将其解析为图像。这两种方法都不完美，具体取决于用例，但 OCR（光学字符识别）往往效果更好，因此本文采用这种方法。

首先，需要将文档转换为图像：

import pdf2image #1.17.0
doc_img = pdf2image.convert_from_path("data/doc_nvidia.pdf", dpi=300)

然后，使用 Tesseract 识别图像中的文本。Tesseract 是 HP 在 1985 年开发的主要 OCR 系统，目前由谷歌维护。

import pytesseract #0.3.10
doc_txt = []
for page  doc_img:
    text = pytesseract.image_to_string(page)
    doc_txt.append(text)

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

# 使用目录为段落添加标签
title_map = { 
    "4-12" : "Business", 
    "13-33" : "Risk Factors",
    "34-44" : "Financials",
    "45-46" : "Directors",
    "47-83" : "Data" 
}
lst_docs, lst_ids, lst_metadata = [], [], []
for n, page in enumerate(doc_txt):
    try:
        ## 获取标题
        title = [v for k,v in title_map.items() if n in range(int(k.split("-")[0]), int(k.split("-")[1])+1)][0]
        ## 清理页面
        page = page.replace("Table of Contents","")
        ## 获取段落
        for i,p in enumerate(page.split('\n\n')):
            if len(p.strip())>5: 
                lst_docs.append(p.strip())
                lst_ids.append(str(n)+"_"+str(i))
                lst_metadata.append({"title":title})
    except:
        continue

import ollama #0.2.1
def keyword_generator(p, top=3):
    prompt = "summarize the following paragraph in 3 keywords separated by , : "
    res = ollama.generate(model="phi3", prompt=prompt)["response"]
    return res.replace("\n", " ").strip()

from tqdm.notebook import tqdm
for i, doc in tqdm(enumerate(lst_docs)):
    lst_metadata[i]["keywords"] = keyword_generator(doc)

table = lst_docs[376]
print("Table: \n", table)
prompt = f"Summarize the following table: {table}"
res = ollama.generate(model="phi3", prompt=prompt)["response"]
print("\nSummary : \n", res)

from matplotlib import image, pyplot
image_file = "data/image.jpeg"
pyplot.imshow(image.imread(image_file))
pyplot.show()

import base64
def encode_image(path):
    with open(path, "rb") as file:
        return base64.b64encode(file.read()).decode('utf-8')
img = encode_image(image_file)

prompt = "describe the image"
res = ollama.generate(model="llava", prompt=prompt, images=[img])["response"]
print(res)

image_file = "data/plot.png"
pyplot.imshow(image.imread(image_file))
pyplot.show()
img = encode_image(image_file)
prompt = "Describe the image in detail. Be specific about graphs, such as bar plots, line graphs, etc."
res = ollama.generate(model="llava", prompt=prompt, images=[img])["response"]
print(res)

import chromadb #0.5.0

# 创建一个持久化的数据库实例
db = chromadb.PersistentClient()

# 获取或创建一个名为 "nvidia" 的集合
collection_name = "nvidia"
collection = db.get_or_create_collection(
    name=collection_name,
    embedding_function=chromadb.utils.embedding_functions.DefaultEmb
)

# 将文档、ID 和元数据添加到集合中
collection.add(
    documents=lst_docs, 
    ids=lst_ids, 
    metadatas=lst_metadata, 
    images=None, 
    embeddings=None
)

# 查看集合中的一个样本
collection.peek(1)

{
    'embeddings': [
        [-0.06092095375061035, -0.01741098240017891, 0.0484163761138916, ...]
    ],
    'metadatas': [
        {
            'keywords': 'Renewable Energy Adoption, Supplier Engagement, Emission Reduction Goals', 
            'title': 'Business'
        }
    ],
    'documents': [
        'We aim to generate enough renewable energy to match 100% of our global electricity usage for our offices and data centers. In fiscal year 2023, we increased the percentage of our total electricity use matched by renewable energy purchases to 44%. By fiscal year 2026, we aim to engage manufacturing suppliers comprising at least 67% of NVIDIA's scope 3 category 1 GHG emissions with the goal of effecting supplier adoption of science-based targets.'
    ],
    'uris': None,
    'data': None
}

query = "how much is the revenue?"
res_db = collection.query(query_texts=[query])["documents"][0][0:10]
context = ' '.join(res_db).replace("\n", " ")
print(context)

Total revenue for fiscal year 2024 was $60.9 billion, up 126% from a year ago. Data Center revenue for fiscal year 2024 was up 217%. Strong demand was driven by enterprise software and consumer internet applications, and multiple industry verticals including automotive, financial services, and healthcare. Gaming revenue for fiscal year 2024 was up 15%. Professional Visualization revenue for fiscal year 2024 was up 1%. Automotive revenue for the fiscal year 2024 was up 21%. The increase primarily reflected growth in self-driving platforms. Gross margin increased in fiscal year 2024, primarily driven by Data Center revenue growth and lower net inventory provisions as a percentage of revenue. Operating expenses increased for fiscal year 2024, driven by growth in employees and compensation increases.

res = ollama.chat(
    model="phi3",
    messages=[
        {"role": "system", "content": "Give the most accurate answer using only the following information:"},
        {"role": "user", "content": query}
    ],
    stream=True
)
print(res["message"]["content"])

The total recognized revenue for fiscal year 2024 was $60.9 billion, which represents an increase of 126% from the previous year. The breakdown by category in millions of dollars is as follows:
- Data Center: $47,525 million
- Gaming: $10,447 million
- Professional Visualization: $1,553 million
- Automotive: $1,091 million
- OEM and Other: $306 million

res = ollama.chat(
    model="phi3",
    messages=[
        {"role": "system", "content": "Give the most accurate answer using your knowledge and the following information:"},
        {"role": "user", "content": query}
    ],
    stream=True
)
print(res["message"]["content"])

The total recognized revenue for fiscal year 2024 was 60,922 million (or $60.9 billion when expressed in billions). This amount represents an increase of 126% from the previous year's revenue. Additionally, there is a breakdown by product categories as follows:
- Data Center: $47,525 million
- Gaming: $10,447 million
- Professional Visualization: $1,553 million
- Automotive: $1,091 million
- OEM and Other: $306 million

It's also important to note that there are deferred revenue amounts of $233 million in fiscal 2024 and $35 million in fiscal 2023 related to customer advances, which will be recognized as revenue over future periods. The remaining performance obligations account for approximately $1.1 billion, with an expectation that about 40% of this amount will be recognized within the next twelve months.

import streamlit as st #1.35.0

## 布局
st.title("Write your questions")
st.sidebar.title("Chat History")
app = st.session_state
if 'messages' not in app:
    app['messages'] = [{"role": "assistant", "content": "I'm ready to retrieve information"}]
if 'history' not in app:
    app['history'] = []
if 'full_response' not in app:
    app['full_response'] = ''

{
    'history': [
        ': how much is the revenue?',
        ': The total revenue reported in the given information is 60 million'
    ],
    'messages': [
        {'role': 'assistant', 'content': "I'm ready to retrieve information"},
        {'role': 'user', 'content': 'how much is the revenue?'},
        {'role': 'assistant', 'content': 'The total revenue reported in the given information is 60 million'}
    ],
    'full_response': 'The total revenue reported in the given information is 60 million'
}

## 保持消息在聊天中
for msg in app["messages"]:
    if msg["role"] == "user":
        st.chat_message(msg["role"], avatar="🧑").write(msg["content"])
    elif msg["role"] == "assistant":
        st.chat_message(msg["role"], avatar="🤖").write(msg["content"])

## 聊天
if txt := st.chat_input():
    ### 用户写入
    app["messages"].append({"role": "user", "content": txt})
    st.chat_message("user", avatar="🧑").write(txt)
    ### AI 使用聊天流式响应
    app["full_response"] = ""
    with st.chat_message("assistant", avatar="🤖"):
        for chunk in ai.respond(app["messages"], use_knowledge=True):
            app["full_response"] += chunk
            st.write(chunk)
    ### 显示历史记录
    app['history'].append(": " + txt)
    app['history'].append(": " + app["full_response"])
    st.sidebar.markdown("<br />".join(app['history']) + "<br /><br />", unsafe_allow_html=True)

import chromadb #0.5.0
import ollama #0.5.0

class AI:
    def __init__(self):
        db = chromadb.PersistentClient()
        self.collection = db.get_or_create_collection("nvidia")

    def query(self, q, top=10):
        res_db = self.collection.query(query_texts=[q])["documents"][0][0:top]
        context = ' '.join(res_db).replace("\n", " ")
        return context

    def respond(self, lst_messages, use_knowledge=False):
        q = lst_messages[-1]["content"]
        context = self.query(q)
        if use_knowledge:
            prompt = "Give the most accurate answer using your knowledge and the following information:"
        else:
            prompt = "Give the most accurate answer using only the following information:"
        res_ai = ollama.chat(
            model="phi3",
            messages=[
                {"role": "system", "content": prompt},
                *lst_messages
            ],
            stream=True
        )
        for res in res_ai:
            chunk = res["message"]["content"]
            app["full_response"] += chunk
            yield chunk

ai = AI()

streamlit run rag_app.py

The total reported revenue for fiscal year 2024 was 60,922 million. This represents an increase of 126% from a year ago. Additionally, there is a breakdown by product categories as follows:
- Data Center: 47,525 million
- Gaming: 10,447 million
- Professional Visualization: 1,553 million
- Automotive: 1,091 million
- OEM and Other: 306 million

NVIDIA achieved the reported revenue through a combination of factors, including:
- **Data Center Revenue Growth**: The Data Center segment saw a significant increase of 217%, driven by strong demand from enterprise software and consumer internet applications, as well as various industry verticals such as automotive, financial services, and healthcare. Customers access NVIDIA AI infrastructure both through the cloud and on-premises, with Data Center compute revenue growing by 244% and Networking revenue by 133%.
- **Gaming Revenue Increase**: Gaming revenue rose by 15%, reflecting higher sales to partners following the normalization of channel inventory levels and growing demand.
- **Professional Visualization and Automotive Revenue**: These segments also saw growth, with Professional Visualization revenue increasing by 1% and Automotive revenue by 21%, primarily due to growth in self-driving platforms.
- **Gross Margin Improvement**: Gross margin increased due to the growth in Data Center revenue and lower net inventory provisions as a percentage of revenue.
- **Operating Expenses**: Operating expenses increased due to the growth in employees and compensation increases.

The significant increase in Data Center revenue, up 217% from the previous year, highlights NVIDIA's growing dominance in the AI and data center markets. This growth is driven by several factors:
- **Enterprise Demand**: Strong demand from enterprise customers for AI infrastructure to support applications such as machine learning, data analytics, and cloud computing.
- **Diverse Industry Verticals**: The adoption of NVIDIA's AI solutions across various industries, including automotive, financial services, and healthcare, indicates a broad-based demand for AI capabilities.
- **Cloud and On-Premises**: The ability to serve customers through both cloud-based and on-premises solutions provides flexibility and meets different customer needs.
- **Compute and Networking**: The substantial growth in both Data Center compute revenue (up 244%) and Networking revenue (up 133%) underscores the importance of NVIDIA's comprehensive AI platform, which includes both hardware and software components.

While the Automotive revenue increased by 21%, the document does not explicitly list specific challenges faced in this segment. However, potential challenges in the automotive industry for NVIDIA could include:
- **Market Competition**: Intense competition from other technology companies and traditional automotive suppliers in the autonomous driving and AI solutions space.
- **Regulatory Hurdles**: Navigating the complex regulatory landscape for autonomous vehicles and AI technologies.
- **Technological Development**: The need to continuously innovate and improve AI algorithms and hardware to meet the demanding requirements of the automotive industry.
- **Supply Chain Issues**: Potential disruptions in the supply chain for critical components needed for NVIDIA's automotive solutions.

NVIDIA's strategy for future growth is likely to focus on several key areas:
- **Continued Expansion in Data Center**: Leveraging the strong growth in the Data Center segment by expanding its AI infrastructure offerings and targeting new industries and applications.
- **Gaming Market**: Maintaining and growing its presence in the gaming market by continuing to innovate and meet the demands of gamers and developers.
- **Professional Visualization**: Further developing its professional visualization solutions to support industries such as architecture, engineering, and design.
- **Automotive Innovation**: Investing in research and development to advance its autonomous driving and AI technologies for the automotive industry.
- **Sustainability**: Focusing on sustainability initiatives, such as increasing the use of renewable energy and setting science-based targets for emissions reductions.
- **Strategic Partnerships**: Forming strategic partnerships with other companies to expand its reach and capabilities in various markets.

基于 Python 和本地大模型的 RAG 系统构建指南

1. 引言

2. 环境搭建

更多推荐文章

相关免费在线工具

3. 数据预处理

4. 数据库

5. 前端

6. 后端

7. 运行示例

8. 结论

更多推荐文章

相关免费在线工具

基于 Python 和本地大模型的 RAG 系统构建指南

1. 引言

2. 环境搭建

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3. 数据预处理

4. 数据库

5. 前端

6. 后端

7. 运行示例

8. 结论

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具