Python 本地 AI 问答系统搭建：环境配置与 RAG 实践 | 极客日志

PythonAI算法

Python 本地 AI 问答系统搭建：环境配置与 RAG 实践

使用 Python 搭建本地 AI 问答系统的完整流程。内容涵盖虚拟环境隔离、PyTorch 版本对齐、依赖管理、Ollama 模型调用、基于 LangChain 和 FAISS 的 RAG 系统实现、显存优化策略以及 Gradio Web 界面搭建。文章重点解决了 CUDA 兼容、内存溢出等常见问题，并提供了国内网络加速方案及常见报错排查表，帮助开发者快速构建私有知识库问答应用。

RustyLab发布于 2026/4/5更新于 2026/7/2454 浏览

前言

想在本地跑一个 AI 问答系统？听起来很酷，但现实往往是这样的：

'为什么我的 CUDA 版本和 PyTorch 不兼容？' '为什么 pip install 装了半天，运行时还是报 ModuleNotFoundError？' '为什么模型加载到一半内存就爆了？'

这些问题，90% 的新手都踩过。本文将带你从零搭建一个本地 AI 问答系统，并系统性地帮你绕开那些'经典陷阱'。

一、整体架构概览

在动手之前，先看清楚我们要搭建的是什么：

整个系统分为三层：

输入层：用户问题 + 文本预处理
检索层（可选）：RAG（检索增强生成）
推理层：本地 LLM 生成答案

流程如下：

用户输入问题
文本预处理
是否需要检索？
- 是：向量数据库检索 (FAISS / ChromaDB) -> 召回相关文档片段
构建 Prompt (RAG 增强)
本地 LLM 推理 (Ollama / llama.cpp)
生成回答
后处理 & 输出

二、新手踩坑分布图

根据社区反馈，新手遇到的问题主要集中在以下几类：

Python 环境/依赖冲突 (32%)
CUDA/GPU 驱动不兼容 (25%)
模型下载失败或损坏 (18%)
内存/显存不足崩溃 (12%)
API 调用姿势错误 (8%)
其他配置问题 (5%)

接下来，我们按照这个优先级，逐一击破。

三、环境搭建：最容易翻车的第一步

3.1 用虚拟环境隔离，别污染全局

❌ 新手常见错误：

pip install torch transformers langchain # 直接装到全局

✅ 正确做法：用 venv 或 conda 隔离环境

# 方式一：使用 venv（推荐，Python 内置）
python -m venv ai-qa-env
source ai-qa-env/bin/activate # Linux/macOS
ai-qa-env\Scripts\activate     # Windows

# 方式二：使用 conda
conda create -n ai-qa python=3.11
conda activate ai-qa

💡 为什么要隔离？ 不同项目依赖不同版本的库，全局安装会导致版本冲突，出了问题极难排查。

3.2 PyTorch 安装：版本对齐是关键

这是 最高频的踩坑点。PyTorch 的安装命令取决于你的 CUDA 版本，不能无脑 pip install torch。

第一步：查看你的 CUDA 版本

nvidia-smi # 查看 GPU 驱动支持的最高 CUDA 版本
nvcc --version

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

# CUDA 12.1 对应的安装命令示例
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 没有 GPU，只用 CPU
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

import torch
print(f"PyTorch 版本：{torch.__version__}")
print(f"CUDA 是否可用：{torch.cuda.is_available()}")
print(f"GPU 数量：{torch.cuda.device_count()}")
if torch.cuda.is_available():
    print(f"当前 GPU: {torch.cuda.get_device_name(0)}")
    print(f"显存总量：{torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

# 生成当前环境的依赖快照
pip freeze > requirements.txt

# 在新环境中还原
pip install -r requirements.txt

torch==2.2.0
transformers==4.38.0
langchain==0.1.9
langchain-community==0.0.24
faiss-cpu==1.7.4
sentence-transformers==2.5.1
ollama==0.1.7
gradio==4.19.2

# 安装 Ollama（macOS/Linux）
curl -fsSL https://ollama.com/install.sh | sh

# 下载并运行模型
ollama pull llama3.2           # Meta Llama 3.2 (3B)
ollama pull qwen2.5:7b         # 阿里通义千问 2.5 (7B)
ollama pull deepseek-r1:7b     # DeepSeek R1 (7B)

# 验证模型列表
ollama list

import ollama

def ask_local_llm(question: str, model: str = "qwen2.5:7b") -> str:
    """
    调用本地 Ollama 模型进行问答
    Args:
        question: 用户问题
        model: 模型名称
    Returns:
        模型回答
    """
    response = ollama.chat(
        model=model,
        messages=[{"role": "system", "content": "你是一个专业的 AI 助手，请用中文简洁准确地回答问题。"},
                  {"role": "user", "content": question}]
    )
    return response["message"]["content"]

# 测试
if __name__ == "__main__":
    answer = ask_local_llm("Python 中的 GIL 是什么？")
    print(answer)

"""
本地 RAG 问答系统
依赖：pip install langchain langchain-community faiss-cpu sentence-transformers ollama
"""
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain.chains import RetrievalQA
from langchain_community.llms import Ollama
import os

class LocalRAGSystem:
    """本地 RAG 问答系统"""
    def __init__(self, docs_dir: str = "./docs", model_name: str = "qwen2.5:7b",
                 embedding_model: str = "BAAI/bge-small-zh-v1.5", chunk_size: int = 500,
                 chunk_overlap: int = 50):
        self.docs_dir = docs_dir
        self.model_name = model_name
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        print("🔧 初始化 Embedding 模型...")
        # 使用本地 Embedding 模型，避免调用外部 API
        self.embeddings = HuggingFaceEmbeddings(
            model_name=embedding_model,
            model_kwargs={"device": "cpu"},  # 改为 "cuda" 可用 GPU 加速
            encode_kwargs={"normalize_embeddings": True}
        )
        self.vectorstore = None
        self.qa_chain = None

    def load_and_index(self):
        """加载文档并建立向量索引"""
        print(f"📂 加载文档目录：{self.docs_dir}")
        # 支持多种文档格式
        loader = DirectoryLoader(
            self.docs_dir, glob="**/*.txt", loader_cls=TextLoader,
            loader_kwargs={"encoding": "utf-8"}
        )
        documents = loader.load()
        print(f"✅ 加载了 {len(documents)} 个文档")

        # 文本分块
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
            separators=["\n\n", "\n", "。", "！", "？", " ", ""]
        )
        chunks = splitter.split_documents(documents)
        print(f"✅ 分割为 {len(chunks)} 个文本块")

        # 建立向量索引
        print("🔍 建立向量索引（首次较慢，请耐心等待）...")
        self.vectorstore = FAISS.from_documents(chunks, self.embeddings)
        print("✅ 向量索引建立完成")

        # 保存索引到本地（下次直接加载，无需重建）
        self.vectorstore.save_local("./faiss_index")
        print("💾 索引已保存到 ./faiss_index")

    def load_existing_index(self):
        """加载已有的向量索引"""
        if os.path.exists("./faiss_index"):
            print("📦 加载已有向量索引...")
            self.vectorstore = FAISS.load_local(
                "./faiss_index", self.embeddings, allow_dangerous_deserialization=True
            )
            print("✅ 索引加载完成")
        else:
            print("⚠️ 未找到已有索引，请先调用 load_and_index()")

    def build_qa_chain(self):
        """构建问答链"""
        if self.vectorstore is None:
            raise ValueError("请先调用 load_and_index() 或 load_existing_index()")
        print(f"🤖 连接本地 LLM: {self.model_name}")
        llm = Ollama(model=self.model_name, temperature=0.1)  # 降低随机性，让回答更稳定
        retriever = self.vectorstore.as_retriever(
            search_type="similarity", search_kwargs={"k": 3}  # 召回最相关的 3 个文档块
        )
        self.qa_chain = RetrievalQA.from_chain_type(
            llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True
        )
        print("✅ 问答系统就绪！")

    def ask(self, question: str) -> dict:
        """
        提问并获取答案
        Returns:
            dict: {"answer": str, "sources": list}
        """
        if self.qa_chain is None:
            raise ValueError("请先调用 build_qa_chain()")
        result = self.qa_chain.invoke({"query": question})
        return {
            "answer": result["result"],
            "sources": [doc.metadata.get("source", "未知来源") for doc in result["source_documents"]]
        }

# ============ 使用示例 ============
if __name__ == "__main__":
    # 初始化系统
    rag = LocalRAGSystem(docs_dir="./my_docs", model_name="qwen2.5:7b")
    
    # 首次使用：加载文档并建立索引
    rag.load_and_index()
    
    # 后续使用：直接加载已有索引（更快）
    # rag.load_existing_index()
    
    # 构建问答链
    rag.build_qa_chain()
    
    # 开始问答
    while True:
        question = input("\n❓ 请输入问题（输入 q 退出）: ").strip()
        if question.lower() == "q":
            break
        result = rag.ask(question)
        print(f"\n💡 回答:\n{result['answer']}")
        print(f"\n📎 参考来源：{', '.join(result['sources'])}")

模型规模	精度	显存需求
1B 模型	FP16	~2GB
3B 模型	FP16	~6GB
7B 模型	FP16	~14GB
13B 模型	FP16	~26GB
7B 模型	4-bit 量化	~4GB

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit 量化配置（显存减少约 75%）
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
model_id = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=quantization_config, device_map="auto"  # 自动分配到 GPU/CPU
)
print(f"模型加载完成，占用显存：{torch.cuda.memory_allocated()/1024**3:.2f} GB")

import ollama

def stream_answer(question: str, model: str = "qwen2.5:7b"):
    """流式输出，边生成边显示"""
    print("💬 ", end="", flush=True)
    for chunk in ollama.chat(
        model=model,
        messages=[{"role": "user", "content": question}],
        stream=True  # 开启流式输出
    ):
        content = chunk["message"]["content"]
        print(content, end="", flush=True)
    print()

stream_answer("用一句话解释什么是 Transformer 架构")

import gradio as gr
from local_rag import LocalRAGSystem  # 引用上面的代码

# 初始化 RAG 系统
rag = LocalRAGSystem()
rag.load_existing_index()
rag.build_qa_chain()

def chat(message: str, history: list) -> str:
    """Gradio 聊天回调函数"""
    if not message.strip():
        return "请输入问题"
    result = rag.ask(message)
    answer = result["answer"]
    sources = result["sources"]
    if sources:
        answer += f"\n\n---\n📎 **参考来源**: {', '.join(set(sources))}"
    return answer

# 创建 Gradio 界面
demo = gr.ChatInterface(
    fn=chat,
    title="🤖 本地 AI 问答系统",
    description="基于本地 LLM + RAG 的私有知识库问答",
    examples=["这个系统是如何工作的？", "请总结一下主要内容"],
    theme=gr.themes.Soft(),
)

if __name__ == "__main__":
    demo.launch(
        server_name="0.0.0.0",
        server_port=7860,
        share=False  # 改为 True 可生成公网链接
    )

报错信息	原因	解决方案
`CUDA out of memory`	显存不足	使用量化模型或减小 batch_size
`ModuleNotFoundError: No module named 'torch'`	虚拟环境未激活	激活对应的 venv/conda 环境
`RuntimeError: CUDA error: no kernel image is available`	PyTorch 与 CUDA 版本不匹配	重新安装对应 CUDA 版本的 PyTorch
`ConnectionRefusedError: [Errno 111]`	Ollama 服务未启动	运行 `ollama serve`
`OSError: [Errno 28] No space left on device`	磁盘空间不足	清理磁盘或更换存储路径
`ValueError: Tokenizer class ... not found`	transformers 版本过低	`pip install -U transformers`
`huggingface_hub.utils._errors.EntryNotFoundError`	模型名称错误或网络问题	检查模型 ID 或使用镜像源

# 设置 HuggingFace 镜像（国内访问加速）
export HF_ENDPOINT=https://hf-mirror.com

# pip 使用清华镜像
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple transformers

# conda 使用清华镜像
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

# 在代码中指定镜像
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from transformers import AutoTokenizer

# 之后的下载会自动走镜像
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

Python 本地 AI 问答系统搭建：环境配置与 RAG 实践

前言

一、整体架构概览

二、新手踩坑分布图

三、环境搭建：最容易翻车的第一步

3.1 用虚拟环境隔离，别污染全局

3.2 PyTorch 安装：版本对齐是关键

更多推荐文章

相关免费在线工具

3.3 依赖管理：用 requirements.txt 锁定版本

四、模型下载：别让网络毁了你的心情

4.1 使用 Ollama 管理本地模型（强烈推荐）

4.2 用 Python 调用 Ollama

五、搭建 RAG 问答系统

5.1 RAG 完整流程

5.2 完整代码实现

六、内存/显存管理：别让 OOM 毁了你

6.1 显存需求参考

6.2 显存不够？用量化压缩模型

6.3 流式输出，避免等待超时

七、加一个 Web 界面（可选）

八、常见报错速查表

九、国内加速技巧

十、总结：搭建清单

参考资源

更多推荐文章

相关免费在线工具

Python 本地 AI 问答系统搭建：环境配置与 RAG 实践

前言

一、整体架构概览

二、新手踩坑分布图

三、环境搭建：最容易翻车的第一步

3.1 用虚拟环境隔离，别污染全局

3.2 PyTorch 安装：版本对齐是关键

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3.3 依赖管理：用 requirements.txt 锁定版本

四、模型下载：别让网络毁了你的心情

4.1 使用 Ollama 管理本地模型（强烈推荐）

4.2 用 Python 调用 Ollama

五、搭建 RAG 问答系统

5.1 RAG 完整流程

5.2 完整代码实现

六、内存/显存管理：别让 OOM 毁了你

6.1 显存需求参考

6.2 显存不够？用量化压缩模型

6.3 流式输出，避免等待超时

七、加一个 Web 界面（可选）

八、常见报错速查表

九、国内加速技巧

十、总结：搭建清单

参考资源

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具