Xinference 大模型部署与分布式推理框架使用指南 | 极客日志

PythonAI算法

Xinference 大模型部署与分布式推理框架使用指南

综述由AI生成Xinference 是一个支持大语言模型、语音识别及多模态模型的分布式推理框架。文章详细阐述了其安装方法、服务启动流程、模型部署配置以及 API 接口调用方式。内容涵盖通过 Web GUI、命令行工具及 Python SDK 进行模型管理的具体操作，并说明了 LoRA 微调模型的集成方案。

PentesterX发布于 2025/2/6更新于 2026/6/623 浏览

Xinference 基本使用

概述

Xorbits Inference（Xinference）是一个性能强大且功能全面的分布式推理框架。可用于大语言模型（LLM），语音识别模型，多模态模型等各种模型的推理。通过 Xorbits Inference，你可以轻松地一键部署你自己的模型或内置的前沿开源模型。

GitHub：https://github.com/xorbitsai/inference

官方文档：https://inference.readthedocs.io/zh-cn/latest/index.html

安装

Xinference 在 Linux, Windows, MacOS 上都可以通过 pip 来安装。如果需要使用 Xinference 进行模型推理，可以根据不同的模型指定不同的引擎。

目前 Xinference 支持以下推理引擎：

vllm
sglang
llama.cpp
transformers

创建一个 xinference 虚拟环境，使用 Python 版本 3.10

conda create -n xinference python=3.10

如果希望能够推理所有支持的模型，可以用以下命令安装所有需要的依赖：

pip install "xinference[all]"

使用其他引擎

# Transformers 引擎
pip install "xinference[transformers]"

# vLLM 引擎
pip install "xinference[vllm]"

# Llama.cpp 引擎
# 初始步骤：
pip install xinference
# Apple M 系列
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
# 英伟达显卡：
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
# AMD 显卡：
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python

# SGLang 引擎
pip install 'xinference[sglang]'

注意：在执行安装 Xinference 过程中，可能会出现异常，可参考官方文档或社区常见问题进行解决。

启动服务

可以在本地运行 Xinference，也可以使用 Docker 部署 Xinference，甚至在集群环境中部署 Xinference。这里采用本地运行 Xinference。

执行以下命令启动本地的 Xinference 服务

xinference-local

xinference-local --host 0.0.0.0 --port 9997

启动日志如下：

(xinference) root@master:~# xinference-local

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

Xinference 默认使用 `<HOME>/.xinference` 作为主目录存储一些必要信息，如：日志文件和模型文件

通过配置环境变量 `XINFERENCE_HOME` 修改主目录，比如：XINFERENCE_HOME=/tmp/xinference xinference-local --host 0.0.0.0 --port 9997

(xinference) root@master:~# ls .xinference/
cache  logs
(xinference) root@master:~# ls .xinference/cache/
chatglm3-pytorch-6b
(xinference) root@master:~# ls .xinference/logs/
local_1721628924181  local_1721629451488  local_1721697225558  local_1721698858667

Model Engine：模型推理引擎，根据模型不同，可能支持的引擎不同

Model Format: 模型格式，可以选择量化 (ggml、gptq 等) 和非量化 (pytorch) 的格式

Model Size：模型的参数量大小，不同模型参数量不同，可能是：6B、7B、13B、70B 等

Quantization：量化精度，有 4bit、8bit 等量化精度选择

N-GPU：模型使用的 GPU 数量：可选择 Auto、CPU、GPU 数量，默认 Auto

Replica：模型的副本，默认为 1

Model UID: 模型的 UID，可理解为模型自定义名称，默认用原始模型名称

Request Limits: 模型的请求限制数量，默认为 None。None 表示此模型没有限制

Worker Ip: 指定分布式场景中模型所在的工作器 ip

Gpu Idx: 指定模型所在的 GPU 索引

Download hub: 模型从哪里下载，可选：none、huggingface、modelscope

Lora Model Config：PEFT（参数高效微调）模型和路径的列表

Lora Load Kwargs for Image Model：图像模型的 lora 加载参数字典

Lora Fuse Kwargs for Image Model：图像模型的 lora fuse 参数字典

curl -X 'POST' \
  'http://localhost:9997/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "chatglm3",
    "messages": [
      {
        "role": "user",
        "content": "你好啊"
      }
    ]
  }'

{"id":"chat73f8c754-4898-11ef-89f6-000c2981d002","object":"chat.completion","created":1721700508,"model":"chatglm3","choices":[{"index":0,"message":{"role":"assistant","content":"你好👋！我是人工智能助手 ChatGLM3-6B，很高兴见到你，欢迎问我任何问题。"},"finish_reason":"stop"}],"usage":{"prompt_tokens":-1,"completion_tokens":-1,"total_tokens":-1}}

curl -X 'GET' \
  'http://localhost:9997/v1/models' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json'

{"object":"list","data":[{"id":"chatglm3","object":"model","created":0,"owned_by":"xinference","model_type":"LLM","address":"0.0.0.0:38145","accelerators":["0"],"model_name":"chatglm3","model_lang":["en","zh"],"model_ability":["chat","tools"],"model_description":"ChatGLM3 is the third generation of ChatGLM, still open-source and trained on Chinese and English data.","model_format":"pytorch","model_size_in_billions":6,"model_family":"chatglm3","quantization":"4-bit","model_hub":"modelscope","revision":"v1.0.2","context_length":8192,"replica":1}]}

curl -X 'POST' \
  'http://localhost:9997/v1/embeddings' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "嵌入模型名称、UID",
    "input": "你好啊"
  }'

curl -X 'POST' \
  'http://localhost:9997/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
	  "model": "bge-reranker-base",
	  "query": "你是谁？",
	  "documents": [
		"你是一名乐于助人的 AI 助手。",
		"你的名字叫'rerank'"
	  ]
	}'

pip install xinference-client==${SERVER_VERSION}

from xinference.client import RESTfulClient

client = RESTfulClient("http://127.0.0.1:9997")
# 注意：my-llm 是参数 `--model-uid` 指定的值
model = client.get_model("my-llm")
print(model.chat(
    prompt="你好啊",
    system_prompt="你是一个乐于助人的 AI 助手。",
    chat_history=[]
))

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:9997/v1", api_key="")

response = client.chat.completions.create(
    model="my-llm",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the largest animal?"}
    ]
)
print(response)

pip install xinference

(xinference) root@master:~# xinference --help
Usage: xinference [OPTIONS] COMMAND [ARGS]...

  Xinference command-line interface for serving and deploying models.

Options:
  -v, --version       Show the current version of the Xinference tool.
  --log-level TEXT    Set the logger level. Options listed from most log to
                      least log are: DEBUG > INFO > WARNING > ERROR > CRITICAL
                      (Default level is INFO)
  -H, --host TEXT     Specify the host address for the Xinference server.
  -p, --port INTEGER  Specify the port number for the Xinference server.
  --help              Show this message to exit.

Commands:
  cached         List all cached models in Xinference.
  cal-model-mem  calculate gpu mem usage with specified model size and...
  chat           Chat with a running LLM.
  engine         Query the applicable inference engine by model name.
  generate       Generate text using a running LLM.
  launch         Launch a model with the Xinference framework with the...
  list           List all running models in Xinference.
  login          Login when the cluster is authenticated.
  register       Register a new model with Xinference for deployment.
  registrations  List all registered models in Xinference.
  remove-cache   Remove selected cached models in Xinference.
  stop-cluster   Stop a cluster using the Xinference framework with the...
  terminate      Terminate a deployed model through unique identifier...
  unregister     Unregister a model from Xinference, removing it from...
  vllm-models    Query and display models compatible with vLLM.

(xinference) root@master:~# xinference launch --help
Usage: xinference launch [OPTIONS]

  Launch a model with the Xinference framework with the given parameters.

Options:
  -e, --endpoint TEXT             Xinference endpoint.
  -n, --model-name TEXT           Provide the name of the model to be
                                  launched.  [required]
  -t, --model-type TEXT           Specify type of model, LLM as default.
  -en, --model-engine TEXT        Specify the inference engine of the model
                                  when launching LLM.
  -u, --model-uid TEXT            Specify UID of model, default is None.
  -s, --size-in-billions TEXT     Specify the model size in billions of
                                  parameters.
  -f, --model-format TEXT         Specify the format of the model, e.g. pytorch, ggmlv3, etc.
  -q, --quantization TEXT         Define the quantization settings for the
                                  model.
  -r, --replica INTEGER           The replica count of the model, default is
                                  1.
  --n-gpu TEXT                    The number of GPUs used by the model,
                                  default is "auto".
  -lm, --lora-modules <TEXT TEXT>...
                                  LoRA module configurations in the format
                                  name=path. Multiple modules can be
                                  specified.
  -ld, --image-lora-load-kwargs <TEXT TEXT>...
  -fd, --image-lora-fuse-kwargs <TEXT TEXT>...
  --worker-ip TEXT                Specify which worker this model runs on by
                                  ip, for distributed situation.
  --gpu-idx TEXT                  Specify which GPUs of a worker this model
                                  can run on, separated with commas.
  --trust-remote-code BOOLEAN     Whether or not to allow for custom models
                                  defined on the Hub in their own modeling
                                  files.
  -ak, --api-key TEXT             Api-Key for access xinference api with
                                  authorization.
  --help                          Show this message to exit.

xinference launch --model-engine transformers --model-uid my-llm --model-name chatglm3 --quantization 4-bit --size-in-billions 6 --model-format pytorch

--model-engine transformers：指定模型的推理引擎
--model-uid：指定模型的 UID，如果没有指定，则随机生成一个 ID
--model-name：指定模型名称
--quantization: 指定模型量化精度
--size-in-billions：指定模型参数大小，以十亿为单位
--model-format：指定模型的格式

(xinference) root@master:~# xinference launch --model-engine transformers --model-uid myllm --model-name chatglm3 --quantization 4-bit --size-in-billions 6 --model-format pytorch
Launch model name: chatglm3 with kwargs: {}
Model uid: myllm

(xinference) root@master:~# xinference engine --help
Usage: xinference engine [OPTIONS]

  Query the applicable inference engine by model name.

Options:
  -n, --model-name TEXT           The model name you want to query.
                                  [required]
  -en, --model-engine TEXT        Specify the `model_engine` to query the
                                  corresponding combination of other
                                  parameters.
  -f, --model-format TEXT         Specify the `model_format` to query the
                                  corresponding combination of other
                                  parameters.
  -s, --model-size-in-billions TEXT
                                  Specify the `model_size_in_billions` to
                                  query the corresponding combination of other
                                  parameters.
  -q, --quantization TEXT         Specify the `quantization` to query the
                                  corresponding combination of other
                                  parameters.
  -e, --endpoint TEXT             Xinference endpoint.
  -ak, --api-key TEXT             Api-Key for access xinference api with
                                  authorization.
  --help                          Show this message to exit.

(xinference) root@master:~# xinference engine --model-name chatglm3
Name      Engine        Format      Size (in billions)  Quantization
--------  ------------  --------  --------------------  --------------
chatglm3  Transformers  pytorch                      6  4-bit
chatglm3  Transformers  pytorch                      6  8-bit
chatglm3  Transformers  pytorch                      6  none
chatglm3  vLLM          pytorch                      6  none

(xinference) root@master:~# xinference engine --model-name chatglm3 --model-engine vllm
Name      Engine    Format      Size (in billions)  Quantization
--------  --------  --------  --------------------  --------------
chatglm3  vLLM      pytorch                      6  none

(xinference) root@master:~#  xinference engine --model-name chatglm3 --model-engine transformers
Name      Engine        Format      Size (in billions)  Quantization
--------  ------------  --------  --------------------  --------------
chatglm3  Transformers  pytorch                      6  4-bit
chatglm3  Transformers  pytorch                      6  8-bit
chatglm3  Transformers  pytorch                      6  none

(xinference) root@master:~# xinference engine --model-name qwen-chat -f ggufv2
Name       Engine     Format      Size (in billions)  Quantization
---------  ---------  --------  --------------------  --------------
qwen-chat  llama.cpp  ggufv2                       7  Q4_K_M
qwen-chat  llama.cpp  ggufv2                      14  Q4_K_M

xinference registrations -t LLM

xinference list

xinference terminate --model-uid "my-llm"

xinference launch <options>
--lora-modules <lora_name1> <lora_model_path1>
--lora-modules <lora_name2> <lora_model_path2>
--image-lora-load-kwargs <load_params1> <load_value1>
--image-lora-load-kwargs <load_params2> <load_value2>
--image-lora-fuse-kwargs <fuse_params1> <fuse_value1>
--image-lora-fuse-kwargs <fuse_params2> <fuse_value2>

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

lora_model1={'lora_name': <lora_name1>, 'local_path': <lora_model_path1>}
lora_model2={'lora_name': <lora_name2>, 'local_path': <lora_model_path2>}
lora_models=[lora_model1, lora_model2]
image_lora_load_kwargs={'<load_params1>': <load_value1>, '<load_params2>': <load_value2>},
image_lora_fuse_kwargs={'<fuse_params1>': <fuse_value1>, '<fuse_params2>': <fuse_value2>}

peft_model_config = {
"image_lora_load_kwargs": image_lora_load_params,
"image_lora_fuse_kwargs": image_lora_fuse_params,
"lora_list": lora_models
}

client.launch_model(
    <other_options>,
    peft_model_config=peft_model_config
)

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<model_uid>")
model.chat(
    "<prompt>",
    <other_options>,
    generate_config={"lora_name": "<your_lora_name>"}
)

Xinference 大模型部署与分布式推理框架使用指南

Xinference 基本使用

概述

安装

启动服务

更多推荐文章

相关免费在线工具

模型部署

模型参数配置说明

API 接口

概述

对话接口

模型列表

嵌入模型

Rerank 模型

使用 Xinference SDK

使用 OpenAI SDK

命令行工具

概述

启动模型

引擎参数

其他操作

集成 LoRA

启动时集成 LoRA

应用时集成 LoRA

更多推荐文章

相关免费在线工具

Xinference 大模型部署与分布式推理框架使用指南

Xinference 基本使用

概述

安装

启动服务

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

模型部署

模型参数配置说明

API 接口

概述

对话接口

模型列表

嵌入模型

Rerank 模型

使用 Xinference SDK

使用 OpenAI SDK

命令行工具

概述

启动模型

引擎参数

其他操作

集成 LoRA

启动时集成 LoRA

应用时集成 LoRA

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具