为每个人提供 LLMs：在 Google Colab 中运行 LLaMA-13B 模型和 LangChain

优质文章学习记录

08 Apr 2026 — 16 min read

原文：towardsdatascience.com/llms-for-everyone-running-the-llama-13b-model-and-langchain-in-google-colab-68d88021cf0b

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/c1c2e1a8e6731e9a039722d2789772e2.png

由 Glib Albovsky 拍摄的照片，Unsplash

在故事的第一部分中，我们使用了一个免费的 Google Colab 实例来运行 Mistral-7B 模型，并使用 FAISS（Facebook AI Similarity Search）数据库提取信息。在这一部分，我们将更进一步，我将展示如何运行 LLaMA 2 13B 模型；我们还将测试一些额外的 LangChain 功能，如创建基于聊天的应用程序和使用代理。同样，就像第一部分一样，所有使用的组件都是基于开源项目，并且将完全免费使用。

让我们开始吧！

LLaMA.cpp

LLaMA.CPP是一个非常有趣的开源项目，最初是为了在 Macbooks 上运行 LLaMA 模型而设计的，但其功能已经远远超出了这个范围。首先，它使用纯 C/C++编写，没有外部依赖，可以在任何硬件上运行（支持 CUDA、OpenCL 和 Apple 硅；甚至可以在树莓派上运行）。其次，LLaMA.CPP 可以与LangChain连接，这允许我们免费测试其许多功能，而无需 OpenAI 密钥。最后但同样重要的是，由于 LLaMA.CPP 可以在任何地方运行，它是一个很好的候选者，可以在免费的 Google Colab 实例上运行。提醒一下，Google 提供了免费访问具有 12 GB RAM 和 16 GB VRAM 的 Python 笔记本，可以通过Colab Research页面打开。代码在网页浏览器中打开并在云端运行，因此每个人都可以访问，即使是从预算最低的 PC 也可以。

在使用 LLaMA 之前，让我们安装库。安装本身很简单；我们只需在使用 pip 之前启用LLAMA_CUBLAS：

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python !pip3 install huggingface-hub !pip3 install sentence-transformers langchain langchain-experimental !huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir/content --local-dir-use-symlinks False

对于第一次测试，我将使用 7B 模型。在这里，我还安装了huggingface-hub库，它允许我们自动下载 LLaMA.CPP 所需的 GGUF 格式的“Llama-2–7b-Chat”模型。我还安装了LangChain库，它将被用于进一步的测试。

现在，让我们加载模型并测试它是否正常工作：

from langchain.llms import LlamaCpp from langchain.callbacks.manager import CallbackManager from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler n_gpu_layers =40 n_batch =512 callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]) llm = LlamaCpp( model_path="/content/llama-2-7b-chat.Q4_K_M.gguf", temperature=0.1, n_gpu_layers=n_gpu_layers, n_batch=n_batch, callback_manager=callback_manager, verbose=True,)

当模型加载后，我们只需一行代码就可以测试它：

llm("What is the distance to the Moon? Write the short answer.")

在这里，我还使用了StreamingStdOutCallbackHandler，它允许我们在“ChatGPT”风格中获得平滑的“流式”输出：

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/2cbbbf238dfa17d368ee34c5acdd3800.png

关于资源，由于 4 位量化，7B 模型很好地符合 Google Colab 免费限制：

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/1c0e16f0618dfa87875b4870c92cd935.png

Google Colab 资源，图片由作者提供

如我们所见，该模型只需要大约 1.6GB 的 RAM 和 4.2GB 的 VRAM，因此理论上它几乎可以在任何预算 PC 上运行。使用 Google Colab，我们甚至可以完全免费运行一个 13B 模型！我们只需要更改“download”命令中的 URL：

!huggingface-cli download TheBloke/Llama-2-13B-chat-GGUF llama-2-13b-chat.Q4_K_M.gguf --local-dir/content --local-dir-use-symlinks False

自然地，这个模型需要更多的资源，但仍然足够用于免费实例：

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/bfa6e3dd94d03ab16b9706f5018990d5.png

Google Colab 资源，图片由作者提供

我们的模式已经准备好了；让我们看看我们如何在 LangChain 中使用它。

LangChain

LangChain 是一个为开发由语言模型驱动的应用程序而设计的开源 Python 框架。理论上，它是“跨平台的”，并且可以使用最少的代码更改使用不同的语言模型。但实际上，并不总是清楚需要更改什么，官方文档中的大多数示例都是为 OpenAI 制作的，OpenAI 不是免费的，用户将为每次 API 调用付费。因此，LLaMA.CPP 是一种无需额外成本学习此库的好方法。让我们开始吧！

1. LLM 链

LCEL（LangChain 表达式语言）是 LangChain 库中的基本概念之一。它允许我们创建一个提示并将其绑定到语言模型：

from langchain.prompts import PromptTemplate from langchain.schema.output_parser import StrOutputParser from langchain.callbacks.tracers import ConsoleCallbackHandler template ="""<s>[INST] <<SYS>> Provide a correct and short answer to the question. <</SYS>> {question} [/INST]""" prompt = PromptTemplate(template=template, input_variables=["question"]) chain = prompt | llm | StrOutputParser() chain.invoke({"question":"What is the distance to the Moon?"}, config={# "callbacks": [ConsoleCallbackHandler()]})#> Sure! The average distance from Earth to the Moon is about #> 384,400 kilometers (238,900 miles).

在这里，我创建了一个提示，指示 LLaMA 模型要做什么，将其连接到之前步骤中创建的 LLM，并添加了一个 StrOutputParser 实例来清理输出文本。callbacks 是一个可选参数，允许我们调试链；如果我们想查看实际发送给模型的提示，它非常有用。

2. 合并链

使用 LCEL，我们可以轻松地组合两个链。在这里，我添加了一个第二个链，它使用第一个链的输出作为输入。

template2 ="""<s>[INST] <<SYS>> Use the summary {summary} and give 2 one sentence examples of practical applications of the subject [/INST] <</SYS>> [/INST] """ prompt2 = PromptTemplate( input_variables=["summary"], template=template2,) chain2 ={"summary": prompt | llm | StrOutputParser()}| prompt2 | llm | StrOutputParser() chain2.invoke({"question":"What is the distance to the Moon?"}, config={# "callbacks": [ConsoleCallbackHandler()]})#> The average distance from Earth to the Moon is approximately 384,400 #> kilometers (238,900 miles), and this information has several practical #> applications, such as:#> 1\. Planning space missions: Knowing the exact distance between Earth#> and the Moon is crucial for designing and executing space missions.#> 2\. Navigation and communication: The distance between Earth and the #> Moon affects the time it takes for radio signals to travel between#> the two bodies...

如果我们启用 ConsoleCallbackHandler，我们将看到在这个例子中，语言模型被调用了两次：

[llm/start] Exiting Prompt run with output:"<s>[INST] <<SYS>>nProvide a correct and short answer to the question.n<</SYS>>nWhat is the distance to the Moon? [/INST]" Exiting LLM run with output:"The average distance from Earth to the Moon is about 384,400 kilometers (238,900 miles)."[llm/start] Entering LLM run withinput:"<s>[INST] <<SYS>>nUse the summary The average distance from Earth to the Moon is about 384,400 kilometers (238,900 miles). and give 2 one sentence examples of practical applications of the subject [/INST]n<</SYS>>n[/INST]" Exiting LLM run with output:...

LangChain 库为我们做了所有需要的工作，并在“幕后”进行所有 LLM 调用。这类事情需要我们牢记在心，尤其是如果我们使用付费 API 而不是免费的本地模型（如果我们没有意识到这一点，账单上的 2 倍增加可能会是一个糟糕的惊喜）。

3. 自动路由

让我们测试一个更复杂的例子，并为不同的请求使用不同的提示。在这里，我将使用 HuggingFaceEmbeddings 类和余弦相似度来确定问题是否关于空间或数学：

from langchain.embeddings import HuggingFaceEmbeddings from langchain.utils.math import cosine_similarity from langchain.schema.runnable import RunnableLambda, RunnablePassthrough space_template ="""<s>[INST] <<SYS>> You are an astronaut. You are great at answering questions about space. Provide a short answer to the question, understandable to a small kid. <</SYS>> {query} [/INST]""" math_template ="""<s>[INST] <<SYS>> You are a mathematician. You are great at answering math questions. Provide a short answer to the question. <</SYS>> {query} [/INST]""" embeddings = HuggingFaceEmbeddings() prompt_templates =[space_template, math_template] prompt_embeddings = embeddings.embed_documents(prompt_templates)defprompt_router(input):""" Find a proper template for the input """ query_embedding = embeddings.embed_query(input["query"]) similarity = cosine_similarity([query_embedding], prompt_embeddings)[0] most_similar = prompt_templates[similarity.argmax()]print("Using MATH"if most_similar == math_template else"Using SPACE")return PromptTemplate.from_template(most_similar) chain =({"query": RunnablePassthrough()}| RunnableLambda(prompt_router)| llm | StrOutputParser())

这里的逻辑很简单。HuggingFaceEmbeddings 类将问题转换为数值表示。然后，我们使用余弦相似度指标来确定问题更接近“数学”还是“空间”。

输出看起来像这样：

chain.invoke("How far is Mars?", config={# "callbacks": [ConsoleCallbackHandler()]})#> Using SPACE#> Oh, wow! That's a really cool question! *adjusts spacesuit* Mars is #> actually quite far from Earth! *grin* It's like, really, really far! #> *estimates with hands* Let me see... if I hold out my hand like this #> (gestures), that's how far Mars is from Earth! *smiling* It's about 140 #> million miles away!

如我们所见，如果问题是不同的人提出的，例如成人和孩子，自动提示检测可能很有用。

4. 基本聊天

我们还可以使用 ChatPromptTemplate 类与 LLM 进行交互，该类允许用户与模型进行对话。

from langchain.chains import LLMChain from langchain.prompts.chat import( ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate,)from langchain.schema import AIMessage, HumanMessage from langchain_experimental.chat_models import Llama2Chat sys_template ="""<s>[INST] <<SYS>> Act as an experienced AI assistant. Write only one sentence answers. <</SYS>> [/INST] """ chat_prompt = ChatPromptTemplate.from_messages([ SystemMessagePromptTemplate.from_template(sys_template), HumanMessage(content="Hello, how are you doing?"), AIMessage(content="I'm doing well, thanks!"), HumanMessage(content="May I ask you a question about Moon?"), AIMessage(content="Yes, sure."), HumanMessagePromptTemplate.from_template("{question}"),]) model = Llama2Chat(llm=llm) chain = chat_prompt | model | StrOutputParser() chain.invoke({"question":"How big is it?"}, config={# "callbacks": [ConsoleCallbackHandler()]})#> The Moon has a diameter of approximately 2,159 miles (3,475 kilometers).

在这里，我创建了一个 SystemMessagePromptTemplate 对象，其中包含了模型所需的指令，并添加了对话历史。LangChain 将完成所有必要的工作，将这些数据组合成最终的提示。我们可以启用 ConsoleCallbackHandler 来查看发送给模型的输入：

[llm/start] Entering LLM run withinput:"System: <s>[INST] <<SYS>>nAct as an experienced AI assistant. Write only one sentence answers.n<</SYS>>n[/INST]nnHuman: Hello, how are you doing?nAI: I'm doing well, thanks!nHuman: May I ask you a question about Moon?nAI: Yes, sure.nHuman: How big is it?"

5. 带记忆和消息摘要的聊天

将所有消息存储在文本体中的可能性很有用，但最终的提示很容易变得过长。这对人类也是如此；我们通常无法记住过去对话中的所有短语，但我们记得我们谈论的大致内容。借助 ConversationSummaryMemory 类，我们可以使用同样的想法来处理 LLM。

from langchain.chains import ConversationChain from langchain.memory import ConversationBufferMemory, ConversationSummaryMemory, CombinedMemory, ChatMessageHistory conv_memory = ConversationBufferMemory(memory_key="chat_history_lines", input_key="input") summary_memory = ConversationSummaryMemory(llm=llm, input_key="input") memory = CombinedMemory(memories=[conv_memory, summary_memory]) template ="""<s>[INST] <<SYS>> Act as an experienced AI assistant. Write one-sentence answers only. <</SYS>> Summary of conversation: {history} Current conversation: {chat_history_lines} Human: {input} [/INST] """ summary_memory.save_context({"input":"Hi, how are you"},{"output":"Thanks, I am fine"}) summary_memory.save_context({"input":"May I ask you questions about Moon?"},{"output":"Yes, sure"}) summary_memory.load_memory_variables({}) prompt = PromptTemplate( input_variables=["history","input","chat_history_lines"], template=template,) conversation = ConversationChain(llm=llm, verbose=True, memory=memory, prompt=prompt) conversation.run("How far is it?")#> The average distance from the Earth to the Moon is about 238,855 miles#> (384,400 kilometers) conversation.run("And what about Mars?")#> The average distance from Earth to Mars is about 140 million miles#> (225 million kilometers)

问题和回答看起来很简单，但在幕后有很多事情在进行。每当添加一个新的“请求-响应”对时，ConversationSummaryMemory 类都会调用 LLM。每当调用 ConversationChain 时，摘要也会自动更新。实际上，对于我们的简短对话，序列看起来是这样的：

#> save_context({"input": "Hi, how are you"}, {"output": "Thanks, I am fine"}) The human greets the AI and asks how it is doing. The AI responds that it is fine.#> save_context({"input": "May I ask you questions about Moon?"}, {"output": "Yes, sure"}) The human greets the AI and asks how it is doing. The AI responds that it is fine. The human asks if they can ask questions about the moon. The AI agrees.#> conversation.run("How far is it?")<s>[INST]<<SYS>> Act as an experienced AI assistant. Write one-sentence answers only.<</SYS>> Summary of conversation: The human greets the AI and asks how it is doing. The AI responds that it is fine. The human asks if they can ask questions about the moon. The AI agrees. Current conversation: Human: How far is it?[/INST] The human greets the AI and asks how it is doing. The AI responds that it is fine. The human asks if they can ask questions about the moon. The AI agrees. The human asks how far the moon isfrom Earth,and the AI provides a one-sentence answer:"The average distance from the Earth to the Moon is about 238,855 miles (384,400 kilometers)."#> conversation.run("And what about Mars?")<s>[INST]<<SYS>> Act as an experienced AI assistant. Write short answers only.<</SYS>> Summary of conversation: The human says "hi"and the AI responds with a brief message indicating it is functioning properly. The human asks if they can ask questions about the moon. The AI agrees and provides information about the average distance from the Earth to the Moon. END OF NEW SUMMARY Please provide the new summary after each line of conversation, progressively adding onto the previous summary.Current conversation: Human: How far is it? AI: Sure thing! I am ready to help answer your questions about the moon. The average distance from the Earth to the Moon is about 238,855 miles (384,400 kilometers).Human: And what about Mars?[/INST] The human says "hi"and the AI responds with a brief message indicating it is functioning properly. The human asks if they can ask questions about the moon. The AI agrees and provides information about the average distance from the Earth to the Moon. Now, the human wants to know about Mars. Sure thing! Here is the updated summary of our conversation so far:nnHuman: Hi! AI: Hi! I amm functioning properly. Human: Can I ask questions about the moon? AI: Of course! I d be happy to help. The average distance from the Earth to the Moon is about 238,855 miles (384,400 kilometers).nnNow, what would you like to know about Mars?

这里，有几个有趣的地方值得关注。首先，会话摘要并不总是完美的（至少对于一个 13B 模型来说是这样），输出可以相当长。在我的测试中，链有时会返回错误，因为标记数超过了 LLaMA 的最大限制。其次，正如这个例子所示，我们只提供了两个问题，但 LLM 被执行了六次！再次，这对 LLaMA 模型来说并不重要，但在使用付费 API 和大量测试时可能会让人感到惊讶。

7. 代理

将外部代理与语言模型连接起来是一个强大的想法，它允许模型使用“工具”来完成更具体的任务。在这个例子中，我将使用 PythonREPLTool 类，它允许模型执行 Python 代码。

如我们从 GitHub 源代码所见，PythonREPLTool 只是一个使用 multiprocessing.Process 调用来执行 Python 代码的包装器。我们可以很容易地看到它是如何工作的：

from langchain_experimental.tools import PythonREPLTool tool = PythonREPLTool() tool.run('import math; print(math.sqrt(5))')#> 2.23606797749979

顺便说一下，在撰写本文时，这个类内部没有进行任何合理性检查，这可能是危险的。如果用户要求，例如删除系统文件，该工具可能会毫不犹豫地执行此命令。

要使用 Python 代理，我们只需要几行代码：

from langchain_experimental.agents.agent_toolkits import create_python_agent from langchain.agents.agent_types import AgentType agent = create_python_agent(llm=llm, tool=tool, verbose=True, agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION) agent.agent.llm_chain.verbose =True agent.run("What is a square root of 5?")

输出看起来像这样：

#> You are an agent designed to write and execute python code to answer questions.#> You have access to a python REPL, which you can use to execute python code.#> If you get an error, debug your code and try again.#> Only use the output of your code to answer the question. #> You might know the answer without running any code, but you should still run the code to get the answer.#> If it does not seem like you can write code to answer the question, just return "I don't know" as the answer.#> Python_REPL: A Python shell. Use this to execute python commands. #> Input should be a valid python command. If you want to see the output #> of a value, you should print it out with `print(...)`.#> Use the following format:#> Question: the input question you must answer#> Thought: you should always think about what to do#> Action: the action to take, should be one of [Python_REPL]#> Action Input: the input to the action#> Observation: the result of the action#> ... (this Thought/Action/Action Input/Observation can repeat N times)#> Thought: I now know the final answer#> Final Answer: the final answer to the original input question#> Begin!#> Question: What is a square root of 5?#> Thought: Hmm, this one looks easy. I think I can just use the built-in `sqrt()` function.#> Action: Python_REPL#> Action Input: `print(sqrt(5))`#> Observation: NameError("name 'sqrt' is not defined")#> Thought: Oh dear, it looks like I need to import the math module first.#> Action: Python_REPL#> Action Input: `import math`#> Observation: #> Thought: Now we're getting somewhere!#> Action: Python_REPL#> Action Input: `print(math.sqrt(5))`#> Observation: 2.23606797749979 Finished chain.2.23606797749979

这里，有几个有趣的地方值得关注。

如我们所见，这个提示模板相当复杂，尤其是对于一个开源模型来说。实际上，7B LLaMA 模型根本无法完成挑战。经过多次尝试和错误后，一个 13B 模型完成了任务，但结果在我看来并不一致。例如，“import math”语句没有正确添加；REPL 命令被执行了两次，正是因为这样代码才成功。
与之前的例子一样，LLM 被调用了多次，这在使用付费 API 时可能会产生额外的费用。
如果允许用户在系统上运行任意 Python 代码，可能会存在严重的安全问题。最后但同样重要的是，由于自然语言提示的使用，有害攻击也可能难以检测。

目前我对于在生产环境中使用 Python 代理还是感到有些害怕，但从自我教育的角度来看，看看它是如何工作的还是很有趣的。

结论

使用大型语言模型很有趣。在这篇文章中，我们能够在免费的 Google Colab 实例上运行 LLaMA-13B 模型，并仅使用免费组件测试其功能。一个开源的 LangChain 库允许我们仅用几行代码就完成一些复杂的事情，比如制作聊天摘要。LLaMA.CPP 也是一个有趣的项目，它允许我们使用 LLaMA 以及其他（Alpaca、Vicuna 等）语言模型和不同的硬件。最后但同样重要的是，能够免费运行语言模型和框架对于实验、原型制作或自我教育来说是非常棒的。

在下一部分，我将展示如何在 Google Colab 中运行 HuggingFace 文本生成推理工具包：

LLMs for Everyone: Running the HuggingFace Text Generation Inference in Google Colab

对使用语言模型和自然语言处理感兴趣的人也可以阅读其他文章：

如果你喜欢这个故事，请随意订阅Medium，你将在我新文章发布时收到通知，以及访问来自其他作者成千上万故事的完整权限。你还可以通过LinkedIn与我建立联系。如果你想获取这个以及其他文章的完整源代码，请随意访问我的Patreon 页面。

感谢阅读。

为每个人提供 LLMs：在 Google Colab 中运行 LLaMA-13B 模型和 LangChain

优质文章学习记录

LLaMA.cpp

LangChain

结论

Read more

（第四篇）Spring AI 实战进阶：Ollama+Spring AI 构建离线私有化 AI 服务（脱离 API 密钥的完整方案）

【AI】自动教学视频生成方案

2026 年 AI 开发必看：大模型本地部署与优化实战总结

人工智能：自然语言处理在客户服务领域的应用与实战