Ollama 本地 CPU 部署开源大模型实战指南

Ollama 是一款基于 llama.cpp 实现的高效本地大模型运行工具，支持在 CPU 环境下流畅部署多种开源大语言模型，如 Facebook 的 Llama3、谷歌的 Gemma、微软的 Phi3 以及阿里的 Qwen2 等。相比云端 API，本地部署能更好地保障数据隐私，且无需支付 Token 费用。本文将详细介绍 Ollama 的安装、命令行交互、Python 接口调用以及在 Jupyter Notebook 中的深度集成方法。

一、系统要求与安装准备

在开始之前，请确保您的机器满足以下基本要求：

操作系统：支持 macOS (Apple Silicon/Intel), Linux (x86_64/arm64), Windows (WSL2 或原生)。
内存：建议至少 8GB RAM，运行较大模型（如 Llama3-8B）推荐 16GB 以上。
磁盘空间：每个模型文件通常在 2GB 到 10GB 之间，预留足够空间。

1.1 下载安装

访问官网下载对应系统的安装包。

macOS：直接拖拽应用至应用程序目录。
Linux：使用官方提供的脚本一键安装。
Windows：下载 .msi 安装包进行向导式安装。

安装完成后，打开终端即可使用 ollama 命令。

二、命令行交互与管理

Ollama 的核心功能通过命令行实现，操作简洁高效。

2.1 常用命令

# 拉取并运行模型，若本地不存在则自动下载
ollama run qwen2

# 仅下载模型到本地，不立即运行
ollama pull llama3

# 查看已下载的模型列表
ollama list

# 删除指定模型
ollama rm llama3

# 查看帮助信息
ollama help

2.2 服务启动

默认情况下，运行模型时会自动启动后台服务。也可以手动启动服务以便其他程序连接：

ollama serve

服务默认监听 http://localhost:11434。

三、Python 接口交互

Ollama 提供了原生的 Python 库，同时也完全兼容 OpenAI 的 API 格式，这使得许多现有的 AI 应用框架（如 LangChain、PandasAI）可以无缝接入本地模型。

3.1 使用官方 Python 库

首先安装依赖：

pip install ollama

代码示例：

import ollama

response = ollama.chat(
    model='qwen2',
    stream=False,
    messages=[{'role': 'user', 'content': '段子赏析：我已经不是那个当年的穷小子了，我是今年的那个穷小子。'}]
)

(response[][])

import sys class Ollama: def __init__(self, model='qwen2', max_chat_rounds=20, stream=True, system=None, history=None): self.model = model self.history = [] if history is None else history self.max_chat_rounds = max_chat_rounds self.stream = stream self.system = system try: self.register_magic() response = self('你好') if not self.stream: print(response) print('register magic %%chat succeeded ...', file=sys.stderr) self.history = self.history[:-1] except Exception as err: print('register magic %%chat failed ...', file=sys.stderr) print(err) @classmethod def build_messages(cls, query=None, history=None, system=None): messages = [] history = history if history else [] if system is not None: messages.append({'role': 'system', 'content': system}) for prompt, response in history: pair = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response}] messages.extend(pair) if query is not None: messages.append({"role": "user", "content": query}) return messages def chat(self, messages, stream=True): from openai import OpenAI client = OpenAI(base_url='http://localhost:11434/v1/', api_key='ollama') completion = client.chat.completions.create(messages=messages, model=self.model, stream=stream) return completion def __call__(self, query): from IPython.display import display, clear_output len_his = len(self.history) if len_his >= self.max_chat_rounds + 1: self.history = self.history[len_his - self.max_chat_rounds:] messages = self.build_messages(query=query, history=self.history, system=self.system) if not self.stream: completion = self.chat(messages, stream=False) response = completion.choices[0].message.content self.history.append((query, response)) return response completion = self.chat(messages, stream=True) response = "" for chunk in completion: response += chunk.choices[0].delta.content print(response) clear_output(wait=True) self.history.append((query, response)) return response def register_magic(self): import IPython from IPython.core.magic import Magics, magics_class, line_cell_magic @magics_class class ChatMagics(Magics): def __init__(self, shell, pipe): super().__init__(shell) self.pipe = pipe @line_cell_magic def chat(self, line, cell=None): if cell is None: return self.pipe(line) else: print(self.pipe(cell)) ipython = IPython.get_ipython() magic = ChatMagics(ipython, self) ipython.register_magics(magic)

Ollama 本地 CPU 部署开源大模型实战指南