使用 Google Colab 部署 LLaMA-13B 及 LangChain 实战 | 极客日志

PythonAI算法

使用 Google Colab 部署 LLaMA-13B 及 LangChain 实战

在免费 Google Colab 环境下利用 llama.cpp 加载 LLaMA-13B 模型，结合 LangChain 实现聊天、路由、记忆及 Python 代理功能。通过量化优化资源，展示开源大模型本地推理可行性。

雾岛听风发布于 2026/4/10更新于 2026/5/2216 浏览

在免费 Google Colab 实例上运行 LLaMA-13B 模型，并测试 LangChain 的多种功能，如聊天应用和代理。所有组件均基于开源项目，完全免费。

环境准备与 LLaMA.cpp

LLaMA.cpp 是一个有趣的开源项目，最初为 Macbook 设计，现已支持 CUDA、OpenCL 及 Apple 硅，甚至树莓派。它纯 C/C++ 编写，无外部依赖，且能与 LangChain 连接，无需 OpenAI 密钥即可测试其功能。Google Colab 提供免费访问具有 12 GB RAM 和 16 GB VRAM 的 Python 笔记本，适合在此运行。

安装库很简单，只需在使用 pip 之前启用 LLAMA_CUBLAS：

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python
!pip3 install huggingface-hub
!pip3 install sentence-transformers langchain langchain-experimental
!huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir/content --local-dir-use-symlinks False

首次测试我使用了 7B 模型。这里安装了 huggingface-hub 库以自动下载 GGUF 格式的模型，同时安装了 LangChain 库用于后续测试。

加载模型并测试是否正常：

from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

n_gpu_layers = 40
n_batch = 512
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

llm = LlamaCpp(
    model_path="/content/llama-2-7b-chat.Q4_K_M.gguf",
    temperature=0.1,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,
)

模型加载后，一行代码即可测试：

llm("What is the distance to the Moon? Write the short answer.")

这里使用了 StreamingStdOutCallbackHandler，以获得类似 ChatGPT 的平滑流式输出。关于资源，由于 4 位量化，7B 模型很好地符合 Google Colab 免费限制，仅需约 1.6GB RAM 和 4.2GB VRAM，理论上几乎可在任何预算 PC 上运行。使用 Google Colab，我们甚至可以完全免费运行一个 13B 模型，只需更改下载命令中的 URL：

!huggingface-cli download TheBloke/Llama-2-13B-chat-GGUF llama-2-13b-chat.Q4_K_M.gguf --local-dir/content --local-dir-use-symlinks False

该模型需要更多资源，但仍足够用于免费实例。

LangChain 集成

LangChain 是为开发语言模型驱动应用程序而设计的开源 Python 框架。理论上它是跨平台的，但官方文档示例多针对付费 API。利用 LLaMA.cpp 可以零成本学习此库。

构建基础链

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

from langchain.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.callbacks.tracers import ConsoleCallbackHandler

template = """<s>[INST] <<SYS>> Provide a correct and short answer to the question. <</SYS>> {question} [/INST]"""
prompt = PromptTemplate(template=template, input_variables=["question"])
chain = prompt | llm | StrOutputParser()
chain.invoke({"question":"What is the distance to the Moon?"}, config={# "callbacks": [ConsoleCallbackHandler()]})

template2 = """<s>[INST] <<SYS>> Use the summary {summary} and give 2 one sentence examples of practical applications of the subject [/INST] <</SYS>> [/INST] """
prompt2 = PromptTemplate(input_variables=["summary"], template=template2,)
chain2 = {"summary": prompt | llm | StrOutputParser()}| prompt2 | llm | StrOutputParser()
chain2.invoke({"question":"What is the distance to the Moon?"}, config={# "callbacks": [ConsoleCallbackHandler()]})

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.utils.math import cosine_similarity
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

space_template = """<s>[INST] <<SYS>> You are an astronaut. You are great at answering questions about space. Provide a short answer to the question, understandable to a small kid. <</SYS>> {query} [/INST]"""
math_template = """<s>[INST] <<SYS>> You are a mathematician. You are great at answering math questions. Provide a short answer to the question. <</SYS>> {query} [/INST]"""
embeddings = HuggingFaceEmbeddings()
prompt_templates = [space_template, math_template]
prompt_embeddings = embeddings.embed_documents(prompt_templates)

def prompt_router(input):
    query_embedding = embeddings.embed_query(input["query"])
    similarity = cosine_similarity([query_embedding], prompt_embeddings)[0]
    most_similar = prompt_templates[similarity.argmax()]
    print("Using MATH" if most_similar == math_template else "Using SPACE")
    return PromptTemplate.from_template(most_similar)

chain = ({"query": RunnablePassthrough()}| RunnableLambda(prompt_router)| llm | StrOutputParser())

from langchain.chains import LLMChain
from langchain.prompts.chat import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
from langchain.schema import AIMessage, HumanMessage
from langchain_experimental.chat_models import Llama2Chat

sys_template = """<s>[INST] <<SYS>> Act as an experienced AI assistant. Write only one sentence answers. <</SYS>> [/INST] """
chat_prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(sys_template),
    HumanMessage(content="Hello, how are you doing?"),
    AIMessage(content="I'm doing well, thanks!"),
    HumanMessage(content="May I ask you a question about Moon?"),
    AIMessage(content="Yes, sure."),
    HumanMessagePromptTemplate.from_template("{question}"),
])
model = Llama2Chat(llm=llm)
chain = chat_prompt | model | StrOutputParser()
chain.invoke({"question":"How big is it?"}, config={# "callbacks": [ConsoleCallbackHandler()]})

from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory, ConversationSummaryMemory, CombinedMemory, ChatMessageHistory

conv_memory = ConversationBufferMemory(memory_key="chat_history_lines", input_key="input")
summary_memory = ConversationSummaryMemory(llm=llm, input_key="input")
memory = CombinedMemory(memories=[conv_memory, summary_memory])

template = """<s>[INST] <<SYS>> Act as an experienced AI assistant. Write one-sentence answers only. <</SYS>> Summary of conversation: {history} Current conversation: {chat_history_lines} Human: {input} [/INST] """

summary_memory.save_context({"input":"Hi, how are you"},{"output":"Thanks, I am fine"})
summary_memory.save_context({"input":"May I ask you questions about Moon?"},{"output":"Yes, sure"})
summary_memory.load_memory_variables({})

prompt = PromptTemplate(input_variables=["history","input","chat_history_lines"], template=template,)
conversation = ConversationChain(llm=llm, verbose=True, memory=memory, prompt=prompt)
conversation.run("How far is it?")
conversation.run("And what about Mars?")

from langchain_experimental.tools import PythonREPLTool
tool = PythonREPLTool()
tool.run('import math; print(math.sqrt(5))')

from langchain_experimental.agents.agent_toolkits import create_python_agent
from langchain.agents.agent_types import AgentType

agent = create_python_agent(llm=llm, tool=tool, verbose=True, agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION)
agent.agent.llm_chain.verbose = True
agent.run("What is a square root of 5?")

使用 Google Colab 部署 LLaMA-13B 及 LangChain 实战

环境准备与 LLaMA.cpp

LangChain 集成

构建基础链

更多推荐文章

相关免费在线工具

合并链

自动路由

基本聊天

带记忆和消息摘要的聊天

代理

总结

更多推荐文章

相关免费在线工具

使用 Google Colab 部署 LLaMA-13B 及 LangChain 实战

环境准备与 LLaMA.cpp

LangChain 集成

构建基础链

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

合并链

自动路由

基本聊天

带记忆和消息摘要的聊天

代理

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具