大模型高效部署方法对比：以 LLaMA2 为例 | 极客日志

PythonAI算法

大模型高效部署方法对比：以 LLaMA2 为例

综述由AI生成通过部署 LLaMA2 示例，对比了 vLLM、Text generation inference、CTranslate2、DeepSpeed-MII、OpenLLM、Ray Serve、MLC LLM 及 LightLLM 等八种开源推理服务框架的优缺点。文章涵盖了各框架的性能特点、支持功能（如连续批处理、量化、适配器支持）及适用场景，并提供了相应的代码示例与启动命令，旨在帮助开发者根据实际需求选择合适的大模型部署方案。

全栈工匠发布于 2025/2/6更新于 2026/5/2815 浏览

大模型的 N 种高效部署方法：以 LLaMA2 为例

通过部署 LLaMA2 示例，比较不同 LLM 开源推理服务框架的优缺点。

本文未介绍深度学习模型推理服务的传统库，如 TorchServe、KServe 或 Triton Inference Server。

1. vLLM

它的吞吐量比 huggingface transformers（HF）高 14 倍到 24 倍，吞吐量比文本生成推理（TGI）高 2.2 倍。有连续批处理（Continuous batching）和 PagedAttention 功能，集成各种解码算法，包括并行采样、波束搜索等。但缺乏对适配器（LoRA、QLoRA 等）的支持。

后期功能迭代可以追踪官方库。

本地推理服务

# pip install vllm
from vllm import LLM, SamplingParams

prompts = [
    "Funniest joke ever:",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.95, top_p=0.95, max_tokens=200)
llm = LLM(model="huggyllama/llama-13b")
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

API 服务

# Start the server:
python -m vllm.entrypoints.api_server --env MODEL_NAME=huggyllama/llama-13b

# Query the model in shell:
curl http://localhost:8000/generate \
    -d '{
        "prompt": "Funniest joke ever:",
        "n": 1,
        "temperature": 0.95,
        "max_tokens": 200
    }'

2. Text generation inference

用于文本生成推理的 Rust、Python 和 gRPC 服务框架。在 HuggingFace 的生产中使用，为的推理小部件提供支持。内置，可以监控服务器负载和性能，可以使用和。所有依赖项都安装在中，支持模型，有很多选项来管理模型推理，包括精度调整、量化、张量并行性、重复惩罚等。适合了解 Rust 编程的人。

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

LLM

API

Prometheus metrics

Flashattention

PagedAttention

Docker

HuggingFace

mkdir data
docker run --gpus all --shm-size 1g -p 8080:80 \
-v data:/data ghcr.io/huggingface/text-generation-inference:0.9 \
  --model-id huggyllama/llama-13b \
  --num-shard 1

# pip install text-generation
from text_generation import Client

client = Client("http://127.0.0.1:8080")
prompt = "Funniest joke ever:"
print(client.generate(prompt, max_new_tokens=17, temperature=0.95).generated_text)

pip install -qqq transformers ctranslate2

# The model should be first converted into the CTranslate2 model format:
ct2-transformers-converter --model huggyllama/llama-13b --output_dir llama-13b-ct2 --force

import ctranslate2
import transformers

generator = ctranslate2.Generator("llama-13b-ct2", device="cuda", compute_type="float16")
tokenizer = transformers.AutoTokenizer.from_pretrained("huggyllama/llama-13b")

prompt = "Funniest joke ever:"
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch(
    [tokens], 
    sampling_topk=1, 
    max_length=200, 
)
tokens = results[0].sequences_ids[0]
output = tokenizer.decode(tokens)
print(output)

# DON'T INSTALL USING pip install deepspeed-mii
# git clone https://github.com/microsoft/DeepSpeed-MII.git
# git reset --hard 60a85dc3da5bac3bcefa8824175f8646a0f12203
# cd DeepSpeed-MII && pip install .
# pip3 install -U deepspeed

# ... and make sure that you have same CUDA versions:
# python -c "import torch;print(torch.version.cuda)" == nvcc --version

import mii

mii_configs = {
    "dtype": "fp16",
    'max_tokens': 200,
    'tensor_parallel': 1,
    "enable_load_balancing": False
}
mii.deploy(task="text-generation",
           model="huggyllama/llama-13b",
           deployment_name="llama_13b_deployment",
           mii_config=mii_configs)

import mii

generator = mii.mii_query_handle("llama_13b_deployment")
result = generator.query(  
  {"query": ["Funniest joke ever:"]}, 
  do_sample=True,
  max_new_tokens=200
)
print(result)

pip install openllm scipy
openllm start llama --model-id huggyllama/llama-13b \
  --max-new-tokens 200 \
  --temperature 0.95 \
  --api-workers 1 \
  --workers-per-resource 1

import openllm

client = openllm.client.HTTPClient('http://localhost:3000')
print(client.query("Funniest joke ever:"))

# pip install ray[serve] accelerate>=0.16.0 transformers>=4.26.0 torch starlette pandas
# ray_serve.py
import pandas as pd

import ray
from ray import serve
from starlette.requests import Request

@serve.deployment(ray_actor_options={"num_gpus": 1})
class PredictDeployment:
    def __init__(self, model_id: str):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch

        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            device_map="auto",
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

    def generate(self, text: str) -> pd.DataFrame:
        input_ids = self.tokenizer(text, return_tensors="pt").input_ids.to(
            self.model.device
        )
        gen_tokens = self.model.generate(
            input_ids,
            temperature=0.9,
            max_length=200,
        )
        return pd.DataFrame(
            self.tokenizer.batch_decode(gen_tokens), columns=["responses"]
        )

    async def __call__(self, http_request: Request) -> str:
        json_request: str = await http_request.json()
        # Note: Prompt key might vary depending on request structure
        prompt = json_request.get("text", "")
        return self.generate(prompt)

deployment = PredictDeployment.bind(model_id="huggyllama/llama-13b")

# then run from CLI command:
# serve run ray_serve:deployment

import requests

sample_input = {"text": "Funniest joke ever:"}
output = requests.post("http://localhost:8000/", json=[sample_input]).json()
print(output)

# 1. Make sure that you have python >= 3.9
# 2. You have to run it using conda:
conda create -n mlc-chat-venv -c mlc-ai -c conda-forge mlc-chat-nightly
conda activate mlc-chat-venv

# 3. Then install package:
pip install --pre --force-reinstall mlc-ai-nightly-cu118 \
  mlc-chat-nightly-cu118 \
  -f https://mlc.ai/wheels

# 4. Download the model weights from HuggingFace and binary libraries:
git lfs install && mkdir -p dist/prebuilt && \
  git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib && \
  cd dist/prebuilt && \
  git clone https://huggingface.co/huggyllama/llama-13b dist/ && \
  cd ../..

python -m mlc_chat.rest --device-name cuda --artifact-path dist

import requests

payload = {
   "model": "lama-30b",
   "messages": [{"role": "user", "content": "Funniest joke ever:"}],
   "stream": False
}
r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload)
print(r.json()['choices'][0]['message']['content'])

import time
import requests
import json

url = 'http://localhost:8000/generate'
headers = {'Content-Type': 'application/json'}
data = {
    'inputs': 'What is AI?',
    "parameters": {
        'do_sample': False,
        'ignore_eos': False,
        'max_new_tokens': 1024,
    }
}
response = requests.post(url, headers=headers, data=json.dumps(data))
if response.status_code == 200:
    print(response.json())
else:
    print('Error:', response.status_code, response.text)

python -m lightllm.server.api_server --model_dir /path/llama-7B --tp 1 --max_total_token_num 120000

curl 127.0.0.1:8000/generate \
    -X POST \
    -d '{"inputs":"What is AI?","parameters":{"max_new_tokens":17, "frequency_penalty":1}}' \
    -H 'Content-Type: application/json'

type: task

env:
  - MODEL=huggyllama/llama-13b
  # (Optional) Specify your Hugging Face token
  - HUGGING_FACE_HUB_TOKEN=

ports:
  - 8000

commands:
  - conda install cuda # Required since vLLM will rebuild the CUDA kernel
  - pip install vllm
  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000

大模型高效部署方法对比：以 LLaMA2 为例

大模型的 N 种高效部署方法：以 LLaMA2 为例

1. vLLM

本地推理服务

API 服务

2. Text generation inference

更多推荐文章

相关免费在线工具

使用 Docker 运行 Web 服务器

进行查询

3. CTranslate2

进行查询

4. DeepSpeed-MII

进行查询

5. OpenLLM

官方库

进行查询

6. Ray Serve

进行查询

7. MLC LLM

运行服务器

进行查询

8. LightLLM

Python 脚本访问

其他

结论

部署建议

更多推荐文章

相关免费在线工具

大模型高效部署方法对比：以 LLaMA2 为例

大模型的 N 种高效部署方法：以 LLaMA2 为例

1. vLLM

本地推理服务

API 服务

2. Text generation inference

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

使用 Docker 运行 Web 服务器

进行查询

3. CTranslate2

进行查询

4. DeepSpeed-MII

进行查询

5. OpenLLM

官方库

进行查询

6. Ray Serve

进行查询

7. MLC LLM

运行服务器

进行查询

8. LightLLM

Python 脚本访问

其他

结论

部署建议

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具