Llama 3.1：本地部署

优质文章学习记录

10 Apr 2026 — 11 min read

[1] Llama 3.1部署教程（非常详细）从零基础入门到精通，看完这一篇就够了
[2] 科学安装 Ollama
[3] Ollama在Linux系统下配置国内镜像源加速模型下载
[4] Llama 3.1 介绍与部署流程、高效微调
部署服务器：H100 80G
模型：Llama-3.1-8B-Instruct

一、本地部署模型

通过huggingface下载模型：https://huggingface.co/meta-llama/Llama-3.1-8B

创建conda虚拟环境（python版本3.10以上）

conda create -n 环境名称 python==3.11

激活环境
在虚拟环境中安装Pytorch

nvidia-smi # 查看CUDA版本
进入Pytorch官网：https://pytorch.org/get-started/previous-versions/
选择适合的Pytorch版本，选择版本接近且不大于主机CUDA所支持的最高版本即可,然后使用镜像源
如： # 使用清华镜像源
pip install torch2.6.0 torchvision0.21.0 torchaudio==2.6.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

升级pip

python -m pip install --upgrade pip

检查和更新wget和md5sum

wget --version
md5sum --version

如果没有该工具，可以通过以下指令进行安装：

apt-get install wget
apt-get install md5sum

按照transformer

pip install --upgrade transformers
pip install accelerate -i https://pypi.tuna.tsinghua.edu.cn/simple

7.测试：

import transformers import torch model_id ="meta-llama/Meta-Llama-3.1-8B-Instruct"# 修改地址 pipeline = transformers.pipeline("text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto",) messages =[{"role":"system","content":"You are a pirate chatbot who always responds in pirate speak!"},{"role":"user","content":"Who are you?"},] outputs = pipeline( messages, max_new_tokens=256,)print(outputs[0]["generated_text"][-1])

二、本地部署长期运行的API服务

安装所需包

pip install fastapi uvicorn pydantic -i https://pypi.tuna.tsinghua.edu.cn/simple

创建API服务文件

# api_server.pyimport torch from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline from fastapi import FastAPI, HTTPException from pydantic import BaseModel from typing import List, Optional import uvicorn import logging # 配置日志 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__)# 模型路径 MODEL_PATH ="/.../.../.../Llama-3.1-8B-Instruct"# 修改为自己的模型路径# 定义请求/响应模型classChatMessage(BaseModel): role:str content:strclassChatRequest(BaseModel): messages: List[ChatMessage] max_tokens: Optional[int]=200 temperature: Optional[float]=0.7 top_p: Optional[float]=0.9classChatResponse(BaseModel): response:str usage:dict# 初始化FastAPI应用 app = FastAPI(title="Llama 3.1 API", version="1.0")# 全局模型变量 model =None tokenizer =None pipe [email protected]_event("startup")asyncdefstartup_event():"""启动时加载模型"""global model, tokenizer, pipe logger.info("正在加载模型...")try:# 加载tokenizer和model tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, torch_dtype=torch.float16, device_map="auto", low_cpu_mem_usage=True)# 创建pipeline pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0if torch.cuda.is_available()else-1) logger.info(f"模型加载完成！设备: {model.device}") logger.info(f"CUDA可用: {torch.cuda.is_available()}")except Exception as e: logger.error(f"模型加载失败: {e}")[email protected]("/")asyncdefroot():"""根端点，返回服务状态"""return{"service":"Llama 3.1 API","status":"running","model":"Llama-3.1-8B-Instruct","device":str(model.device)if model else"未加载"}@app.get("/health")asyncdefhealth_check():"""健康检查"""return{"status":"healthy"}@app.post("/chat/completions", response_model=ChatResponse)asyncdefchat_completions(request: ChatRequest):"""聊天补全接口（类似OpenAI API格式）"""try:# 使用聊天模板格式化 text = tokenizer.apply_chat_template([msg.dict()for msg in request.messages], tokenize=False, add_generation_prompt=True)# 生成回复 outputs = pipe( text, max_new_tokens=request.max_tokens, temperature=request.temperature, top_p=request.top_p, do_sample=True, pad_token_id=tokenizer.eos_token_id ) response_text = outputs[0]['generated_text']# 提取助手回复if"assistant"in response_text: response_text = response_text.split("assistant")[-1].strip()else:# 移除输入文本 response_text = response_text.replace(text,"").strip()# 计算token使用量 input_tokens =len(tokenizer.encode(text)) output_tokens =len(tokenizer.encode(response_text))return ChatResponse( response=response_text, usage={"prompt_tokens": input_tokens,"completion_tokens": output_tokens,"total_tokens": input_tokens + output_tokens })except Exception as e: logger.error(f"生成失败: {e}")raise HTTPException(status_code=500, detail=str(e))@app.post("/generate")asyncdefgenerate_text(prompt:str, max_tokens:int=100):"""简单的文本生成接口"""try: outputs = pipe( prompt, max_new_tokens=max_tokens, temperature=0.7, do_sample=True)return{"text": outputs[0]['generated_text'],"prompt": prompt }except Exception as e:raise HTTPException(status_code=500, detail=str(e))if __name__ =="__main__":# 启动服务器 uvicorn.run( app, host="0.0.0.0",# 允许外部访问 port=8000, log_level="info")

启动API服务

前台运行（查看日志）
python api_server.py

测试API

# 测试端点 curl http://localhost:8000/ # 健康检查 curl http://localhost:8000/health # 使用聊天接口 curl -X POST "http://localhost:8000/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "system", "content": "你是一个有用的AI助手"}, {"role": "user", "content": "中国的首都是哪里？"} ], "max_tokens": 100 }' # 简单生成接口 curl -X POST "http://localhost:8000/generate?prompt=Hello&max_tokens=50"

三、跨服务器部署与使用的API服务

1. 服务器端

H100服务器端（部署API）

# h100_server.pyimport torch from transformers import AutoModelForCausalLM, AutoTokenizer from fastapi import FastAPI, HTTPException from pydantic import BaseModel, field_validator, model_validator, ConfigDict from typing import List, Optional import uvicorn import logging import time import os from contextlib import asynccontextmanager # 日志配置 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__)# 模型路径 MODEL_PATH ="/dddd/ddddd/LLMs/Llama-3.1-8B-Instruct"# 全局模型变量 model =None tokenizer =None@asynccontextmanagerasyncdeflifespan(app: FastAPI):# 启动时加载模型global model, tokenizer logger.info("正在 H100 上加载 Llama-3.1-8B-Instruct 模型...")try: tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)if tokenizer.pad_token isNone: tokenizer.pad_token = tokenizer.eos_token model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, torch_dtype=torch.float16, device_map="auto", low_cpu_mem_usage=True) logger.info(f"✅ 模型加载成功！设备: {model.device}")yieldexcept Exception as e: logger.error(f"❌ 模型加载失败: {e}")raise RuntimeError("模型初始化失败")from e finally:# 清理资源if model isnotNone:del model torch.cuda.empty_cache() app = FastAPI( title="Llama 3.1 Instruct API", description="OpenAI-compatible API for Llama-3.1-8B-Instruct on H100", version="1.0", lifespan=lifespan )# ======================# Pydantic Models# ======================classChatMessage(BaseModel): role:str content:strclassChatCompletionRequest(BaseModel): model_config = ConfigDict(extra="ignore") messages: Optional[List[ChatMessage]]=None prompt: Optional[str]=None model:str="llama-3.1-8b" max_tokens:int=200 temperature:float=0.7 top_p:float=0.9 stream:bool=False@model_validator(mode='after')defvalidate_input(self):ifnot self.messages andnot self.prompt:raise ValueError("必须提供 'messages' 或 'prompt'")if self.prompt andnot self.messages:# 兼容旧版：将 prompt 转为 messages self.messages =[ChatMessage(role="user", content=self.prompt)]return self @field_validator('max_tokens')@classmethoddefvalidate_max_tokens(cls, v):if v <1or v >2048:raise ValueError("max_tokens 必须在 1~2048 之间")return v classChatCompletionResponseChoice(BaseModel): index:int=0 message: ChatMessage finish_reason:str="stop"classChatCompletionResponse(BaseModel):id:strobject:str="chat.completion" created:int model:str choices: List[ChatCompletionResponseChoice] usage:dict# ======================# Routes# [email protected]("/")asyncdefroot():return{"service":"Llama 3.1 Instruct API","status":"running","device":str(model.device)if model else"uninitialized"}@app.get("/health")asyncdefhealth_check():return{"status":"healthy","model_loaded": model isnotNone}@app.post("/v1/chat/completions")asyncdefchat_completions(request: ChatCompletionRequest):if model isNoneor tokenizer isNone:raise HTTPException(status_code=503, detail="模型尚未加载完成")if request.stream:raise HTTPException(status_code=400, detail="流式输出暂不支持")try:# 准备输入if request.messages isNone:raise ValueError("messages 不能为空")# 确保 messages 是字典列表格式 messages_dict =[msg.model_dump()for msg in request.messages]# 使用 apply_chat_template encoding = tokenizer.apply_chat_template( messages_dict, add_generation_prompt=True, return_tensors="pt")# 提取 input_ids input_ids = encoding.input_ids.to(model.device) input_length = input_ids.shape[1]# 生成 start_time = time.time()with torch.no_grad(): outputs = model.generate( input_ids=input_ids, max_new_tokens=request.max_tokens, temperature=request.temperature, top_p=request.top_p, do_sample=True, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, use_cache=True) gen_time = time.time()- start_time # 解码新生成部分 new_tokens = outputs[0][input_length:] response_text = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()# 构造响应 response = ChatCompletionResponse(id=f"cmpl-{int(time.time())}", created=int(time.time()), model=request.model, choices=[ ChatCompletionResponseChoice( message=ChatMessage(role="assistant", content=response_text))], usage={"prompt_tokens": input_length,"completion_tokens":len(new_tokens),"total_tokens": input_length +len(new_tokens)}) logger.info(f"✅ 生成完成 | 输入: {len(request.messages)} 条消息 | 输出: {len(response_text)} 字符 | 耗时: {gen_time:.2f}s")return response except Exception as e: logger.error(f"❌ 生成错误: {e}", exc_info=True)raise HTTPException(status_code=500, detail=str(e))if __name__ =="__main__": port =int(os.getenv("PORT",8000)) uvicorn.run( app, host="0.0.0.0", port=port, log_level="info", workers=1)

H100启动API服务

# 前台运行（查看日志） python h100_server.py

服务器本地测试

# quick_test.pyfrom transformers import AutoTokenizer MODEL_PATH ="/ddd/ddddd/LLMs/Llama-3.1-8B-Instruct" tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) messages =[{"role":"system","content":"你是一个 helpful AI 助手。"},{"role":"user","content":"你好，请介绍一下你自己。"}]print("测试修复后的逻辑...") encoding = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")print(f"encoding 类型: {type(encoding)}")print(f"encoding.input_ids 类型: {type(encoding.input_ids)}")print(f"encoding.input_ids.shape: {encoding.input_ids.shape}")print("✅ 测试通过！")

(llama_3_1) root@f2osdcap97m-0:/ddd/qddyj/LLMs# python test.py
测试修复后的逻辑…
encoding 类型: <class ‘transformers.tokenization_utils_base.BatchEncoding’>
encoding.input_ids 类型: <class ‘torch.Tensor’>
encoding.input_ids.shape: torch.Size([1, 51])
✅ 测试通过！

2. 客户端

部署环境

conda create -n llama3 python=3.11 conda activate llama3

（1）在一个终端建立 SSH 隧道（端口转发）

ssh -L 8000:localhost:8000 [email protected] -p 30304 # 115.11.11.111远程服务器的ip

测试是否连接服务器

netstat -tuln | grep 8000

有输出则为连接成功

tcp 0 0 127.0.0.1:8000 0.0.0.0:* LISTEN tcp6 0 0 ::1:8000 :::* LISTEN

（2）在另一个终端运行 Python 脚本

python ask_llm.py

# client_llama3.pyfrom openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1",# 这里保持 /v1 api_key="not-needed") response = client.chat.completions.create( model="llama-3.1-8b", messages=[{"role":"system","content":"你是一个 helpful AI 助手。"},{"role":"user","content":"你好，请介绍一下你自己。"}], max_tokens=150)print(response.choices[0].message.content)

回答如下：

(llama3) rd-111s@rd-111s-Z790-PG-ITX-TB4:~/document/ddd/LLM$ python client_llama3.py 你好！我是 LLaMA，一个人工智能助手。我的名字来源于“Large Language Model Application”，我是一种语言模型，能理解和生成人类语言。 我能够回答各种问题，提供信息，帮助你完成任务，甚至可以进行创作和对话。我的知识范围广泛，涵盖各个领域，包括但不限于科技、历史、文化、娱乐等。 我是一种学习型的模型，能不断学习和改进。通过与用户的互动，我能够提高我的理解力和生成能力，提供更好的服务。 我希望能成为你的一个有用的助手，帮助你解决问题，找到答案，甚至带来一些乐趣

新的问题 1：服务器端口被占用或者客户端无法通过ssh连接到服务器

解决方法1：服务器端换个端口port

if __name__ == "__main__": # 启动服务器 uvicorn.run( app, host="0.0.0.0", # 允许外部访问 port=8000, log_level="info" )

解决方法2：停止所有相关进程

# 检查8000端口 (llama_3_1) root@f2osdcap97m-0:/ddd/dd/LLMs# netstat -tulpn | grep :8000 tcp 0 0 127.0.0.1:8000 0.0.0.0:* LISTEN 1845141/ssh 杀掉进程 > kill -9 1845141

更换建立终端方式：
打开一个新的终端窗口，执行

# 在本地机器的新终端中执行 ssh -4 -N -L 8000:localhost:8000 [email protected] -p 30304 # 即在原来的 ssh [email protected] -p 30304的shh后面加一个-4 -N -L 8000:localhost:8000

然后输入密码后可能没有输出的内容，不用关这个终端，另外打开一个进行测试

(llama3) rd-111s@rd-111s-Z790-PG-ITX-TB4:~/document/ddd/code/SGRL (v2)$ netstat -tuln | grep 8000 tcp 0 0 127.0.0.1:8000 0.0.0.0:* LISTEN (llama3) rd-111s@rd-111s-Z790-PG-ITX-TB4:~/document/ddd/code/SGRL (v2)$

连接成功

新的问题2 ：ssh连接不稳定

解决方法：创建一个脚本，断了重连
使用方法：

# 设置环境变量exportSSH_TUNNEL_PASSWORD="your_password"# 运行脚本 ./ssh_tunnel.sh

脚本内容如下：

#!/bin/bash# SSH隧道参数SSH_HOST="[email protected]"# 用户名@服务器IP地址SSH_PORT="111111"# SSH服务的端口号LOCAL_PORT="8000"# 本地监听的端口REMOTE_HOST="localhost"# 这里的localhost指的是远程服务器本身REMOTE_PORT="8000"# 远程服务器上目标服务的端口# 从环境变量读取密码SSH_PASSWORD="${SSH_TUNNEL_PASSWORD}"# 检查密码是否设置if[-z"$SSH_PASSWORD"];thenecho"错误：请设置SSH_TUNNEL_PASSWORD环境变量"echo"使用: export SSH_TUNNEL_PASSWORD='your_password'"exit1fi# 日志文件LOG_FILE="ssh_tunnel.log"# 自己设置路径PID_FILE="ssh_tunnel.pid"# 自己设置路径# 函数：检查端口是否在监听check_port(){ifcommand-vnc&> /dev/null;thennc-z localhost $LOCAL_PORTelifcommand-v telnet &> /dev/null;thenecho""| telnet localhost $LOCAL_PORT2>&1|grep-q"Connected"else ss -tln|grep-q":$LOCAL_PORT"fi}# 函数：启动SSH隧道start_tunnel(){echo"$(date): 启动SSH隧道...">>"$LOG_FILE"if!command-v sshpass &> /dev/null;thenecho"$(date): 错误：未安装sshpass">>"$LOG_FILE"return1fiifcommand-v autossh &> /dev/null;then sshpass -p"$SSH_PASSWORD" autossh -M0-4-N-L${LOCAL_PORT}:${REMOTE_HOST}:${REMOTE_PORT}${SSH_HOST}-p${SSH_PORT}\-o"ServerAliveInterval=30"\-o"ServerAliveCountMax=3"\-o"ExitOnForwardFailure=yes"\-o"StrictHostKeyChecking=no"&echo$!>"$PID_FILE"else sshpass -p"$SSH_PASSWORD"ssh-4-N-L${LOCAL_PORT}:${REMOTE_HOST}:${REMOTE_PORT}${SSH_HOST}-p${SSH_PORT}\-o"ServerAliveInterval=30"\-o"ServerAliveCountMax=3"\-o"ExitOnForwardFailure=yes"\-o"StrictHostKeyChecking=no"&echo$!>"$PID_FILE"fisleep3}# 函数：停止SSH隧道stop_tunnel(){if[-f"$PID_FILE"];thenPID=$(cat"$PID_FILE")ifkill-0$PID2>/dev/null;thenecho"$(date): 停止现有SSH隧道 (PID: $PID)">>"$LOG_FILE"kill$PIDsleep2firm-f"$PID_FILE"fi# 确保所有ssh进程都被清理pkill-f"ssh.*-L ${LOCAL_PORT}:${REMOTE_HOST}:${REMOTE_PORT}"2>/dev/null }# 清理函数cleanup(){echo"$(date): 接收到退出信号，清理进程...">>"$LOG_FILE" stop_tunnel exit0}# 设置信号处理trap cleanup SIGINT SIGTERM # 主循环echo"$(date): SSH隧道监控脚本启动">>"$LOG_FILE"echo"$(date): 目标: ${SSH_HOST}:${SSH_PORT}">>"$LOG_FILE"echo"$(date): 本地转发: ${LOCAL_PORT} -> ${REMOTE_HOST}:${REMOTE_PORT}">>"$LOG_FILE"whiletrue;do# 检查隧道是否运行if! check_port;thenecho"$(date): 隧道未运行或端口不可用，重新启动...">>"$LOG_FILE" stop_tunnel start_tunnel fi# 每隔10秒检查一次sleep10done

Llama 3.1：本地部署

优质文章学习记录

一、本地部署模型

二、本地部署长期运行的API服务

三、跨服务器部署与使用的API服务

1. 服务器端

2. 客户端

新的问题 1：服务器端口被占用或者客户端无法通过ssh连接到服务器

解决方法1：服务器端换个端口port

解决方法2：停止所有相关进程

新的问题2 ：ssh连接不稳定

Read more

从MVP到千万级并发 AI在前后端开发中的差异化落地指南

【前沿解析】2026年3月2日AI双重突破：MWC IQ时代与DeepSeek V4多模态革命

人工智能：大模型高效推理与部署技术实战

【嵌入式开发者的终极武器】：VSCode + AI编译引擎的7个核心应用场景

一、本地部署模型

二、本地部署长期运行的API服务

三、跨服务器部署与使用的API服务

1. 服务器端

2. 客户端

新的问题 1：服务器端口被占用或者客户端无法通过ssh连接到服务器

解决方法1： 服务器端换个端口port

解决方法2： 停止所有相关进程

新的问题2 ：ssh连接不稳定

Read more

从MVP到千万级并发 AI在前后端开发中的差异化落地指南

【前沿解析】2026年3月2日AI双重突破：MWC IQ时代与DeepSeek V4多模态革命

人工智能：大模型高效推理与部署技术实战

【嵌入式开发者的终极武器】：VSCode + AI编译引擎的7个核心应用场景

解决方法1：服务器端换个端口port

解决方法2：停止所有相关进程