Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程 (Open-EQA 示例)

本文以具身智能数据集 Open-EQA 为例，演示多模态模型 Qwen3-VL 在 Llama-Factory 框架下使用嵌套量化 QLoRA 进行训练、评估、导出及部署的完整流程。数据经过处理，每个样本包含八张图片，划分为训练验证集和测试集。

1. 微调训练

首先配置环境。若拥有 CUDA 显卡，可安装 Unsloth 以加速训练和推理；同时建议安装 tensorboard 记录完整的训练过程曲线，避免中断后无法复盘。

创建配置文件 saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/training_args.yaml，内容如下。注意根据实际路径调整基座模型位置：

### model
model_name_or_path: model/Qwen3-VL-2B-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 8
lora_alpha: 16
lora_dropout: 0.1

### 是否使用 unsloth 加速
use_unsloth: false
#unsloth_max_seq_length: 2048
flash_attn: auto

### quantization (QLoRA)
quantization_bit: 4
quantization_method: bitsandbytes
double_quantization: true

### dataset
dataset: open_eqa_train_val
template: qwen3_vl_nothink
cutoff_len: 2048
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 8

### output
output_dir: saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa
logging_steps: 10
save_steps: 25
resume_from_checkpoint: false
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 5.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.05
fp16: true
ddp_timeout: 180000000

### optimization
optim: adamw_torch
report_to: tensorboard
plot_loss: true
video_max_pixels: 65536
video_min_pixels: 256
freeze_multi_modal_projector: true
freeze_vision_tower: true
image_max_pixels: 589824
image_min_pixels: 1024

### evaluation
do_eval: true
per_device_eval_batch_size: 2
val_size: 0.125
eval_strategy: steps
eval_steps: 25
eval_delay: 0
prediction_loss_only: true

### save & eval 联动
load_best_model_at_end: true

本次实验设备为显存 16GB 的 NVIDIA Tesla T4。执行命令启动训练：

llamafactory-cli train saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/training_args.yaml

文章配图

若训练意外中断，将 resume_from_checkpoint 设为 true 即可恢复：

llamafactory-cli train saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/training_args.yaml

训练结束后，观察损失曲线与指标变化。Qwen3-VL-2B-Instruct 在 Open-EQA 上的表现总结如下：

训练阶段表现

损失变化：初始损失较高（约 5.5），前 100 步快速下降，200 步后稳定在 1.0~1.5 区间。
最终指标：训练损失收敛至 1.3233，拟合效果良好且波动小。
效率：总耗时约 2 小时 27 分钟，计算量达 48049026 GFLOPs，体现了多模态训练的计算成本。

验证阶段表现

损失变化：验证损失初期约 1.8，100 步后趋于平缓，稳定在 1.25~1.3 区间。
最终指标：验证损失 1.2683，略低于训练损失，说明泛化能力较好，无明显过拟合。
效率：173 个样本耗时约 2 分 10 秒，批大小为 2。

整体来看，训练与验证损失趋势一致，模型在 Open-EQA 任务上表现稳定。若云端 TensorBoard 日志不完整，可下载 runs 目录到本地查看。本地安装 tensorboard 后执行：

uv pip install tensorboard -i http://mirrors.aliyun.com/pypi/simple
# 或
pip install tensorboard -i http://mirrors.aliyun.com/pypi/simple

启动服务：

tensorboard --logdir=/path/to/logs --port 6006

文章配图

2. 测试评估

创建评估配置文件 saves/Qwen3-VL-2B-Instruct/qlora/eval_openeqa/eval_args.yaml。注意修改适配器路径、基座模型路径及输出目录：

adapter_name_or_path: saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/
cutoff_len: 2048
dataset_dir: data
ddp_timeout: 180000000
do_predict: true
eval_dataset: open_eqa_test
finetuning_type: lora
flash_attn: auto
max_new_tokens: 128
max_samples: 99999
model_name_or_path: model/Qwen3-VL-2B-Instruct
output_dir: saves/Qwen3-VL-2B-Instruct/qlora/eval_openeqa
per_device_eval_batch_size: 2
predict_with_generate: true
preprocessing_num_workers: 4
report_to: none
stage: sft
temperature: 0.2
template: qwen3_vl_nothink
top_p: 1.0
trust_remote_code: true

执行评估命令：

llamafactory-cli train saves/Qwen3-VL-2B-Instruct/qlora/eval_openeqa/eval_args.yaml

文章配图

评估结果显示，模型在 16GB 显存的 Tesla T4 上成功完成全量测试集推理。共涉及 258 个测试样本，每样含 8 张关联图片。批大小设为 2，全程无显存溢出。

生成指标方面，BLEU-4 值为 29.4966，ROUGE-1/2/L 分别为 36.2965/7.9106/35.7659。ROUGE-1 和 ROUGE-L 表现较好，说明模型能有效捕捉核心信息；ROUGE-2 偏低则反映细粒度语义衔接仍有提升空间。推理耗时约 2 小时 55 分钟，主要瓶颈在于单样本 8 张图片的多模态特征提取。整体验证了 QLoRA 方案在低算力显卡上的有效性。

3. 融合模型导出

将 LoRA 权重合并至基座模型，创建 saves/Qwen3-VL-2B-Instruct/qlora/merge/merge_openeqa.yaml：

### model
model_name_or_path: model/Qwen3-VL-2B-Instruct
adapter_name_or_path: saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/
template: qwen3_vl_nothink
finetuning_type: lora
trust_remote_code: true

### export
export_dir: saves/Qwen3-VL-2B-Instruct/qlora/merge
export_size: 2
export_device: auto
export_legacy_format: false

执行导出命令：

llamafactory-cli export saves/Qwen3-VL-2B-Instruct/qlora/merge/merge_openeqa.yaml

文章配图

4. 推理部署 API 服务

(1) Ollama

将融合模型文件放入本地目录（如 /saves/Qwen3-VL-2B-Instruct/qlora/merge），编辑 Modelfile 文件。可根据需求调整参数：

FROM .
TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ range .Messages }}{{ if eq .Role "user" }}<|im_start|>user {{ .Content }}<|im_end|> <|im_start|>assistant {{ else if eq .Role "assistant" }}{{ .Content }}<|im_end|> {{ end }}{{ end }}"""
PARAMETER temperature 0.7
PARAMETER stop "<|im_end|>"
PARAMETER num_ctx 4096

终端执行创建模型命令，模型名称可自定义：

ollama create qwen3-vl-2b -f Modelfile

文章配图

查看模型列表：

ollama list

文章配图

调用方式

命令行直接传入

ollama run qwen3-vl-2b "墙上有什么东西" ./data/open_eqa_frames/0a0c0f2b9ba65d1b/000.jpg

文章配图

交互式模式 先运行 ollama run qwen3-vl-2b，随后输入问题及图片路径。

curl 调用 先将图片转为 base64：

IMG=$(base64 -i data/open_eqa_frames/0a0c0f2b9ba65d1b/000.jpg | tr -d '\n')

发送请求：

curl http://localhost:11434/api/generate -d '{
  "model": "qwen3-vl-2b",
  "system": "你是机器人控制 AI...",
  "prompt": "观察图片...",
  "images": ["'$IMG'"],
  "format": "json",
  "stream": false,
  "options": {"temperature": 0.01, "num_predict": 300}
}'

文章配图

(2) LMDeploy

激活环境并安装库：

pip install --no-cache-dir lmdeploy

编写测试脚本 test_offline.py。注意：T4 显卡必须使用 PyTorch 后端，TurboMind 不支持 Qwen3-VL 架构。

from lmdeploy import pipeline, TurbomindEngineConfig, PytorchEngineConfig, GenerationConfig
from lmdeploy.vl import load_image
import time

MODEL_PATH = "/workspace/LlamaFactory/saves/Qwen3-VL-2B-Instruct/qlora/merge"
IMAGE_PATH = "/workspace/LlamaFactory/data/open_eqa_frames/0a0c0f2b9ba65d1b/000.jpg"

print("🚀 使用 LMDeploy PyTorch 后端加载 Qwen3-VL...")

engine_config = PytorchEngineConfig(
    tp=1,
    session_len=4096,
    max_batch_size=4,
    cache_max_entry_count=0.6,
    eager_mode=True
)

if __name__ == '__main__':
    pipe = pipeline(MODEL_PATH, backend_config=engine_config)
    print("✅ 模型加载成功！")

    image = load_image(IMAGE_PATH)
    prompts = [("描述这张图片", image)]
    start = time.time()
    response = pipe(prompts, gen_config=GenerationConfig(max_new_tokens=256, temperature=0.7))
    latency = time.time() - start
    print(f"⏱️ 延迟：{latency:.2f} s")
    print(f"📝 输出：{response[0].text}")

    # Batch 测试
    prompts_batch = [
        ("描述这张图片", image),
        ("图中有几个人？", image),
        ("这是什么场景？", image),
        ("图片主色调是什么？", image),
    ]
    start = time.time()
    responses = pipe(prompts_batch, gen_config=GenerationConfig(max_new_tokens=128))
    batch_latency = time.time() - start
    print(f"⏱️ Batch 总延迟：{batch_latency:.2f} s")
    print(f"⚡ 平均每个请求：{batch_latency/4:.2f} s")

文章配图

结果分析

功能层面：单图推理能精准描述细节（如沙发颜色、装饰画文字），Batch 4 个不同问题均响应正确，无报错或 OOM。
性能层面：单图延迟约 9.63s，符合 T4+PyTorch 后端预期。Batch 4 请求总延迟 31.42s，平均单请求降低 18.4%，吞吐量提升 1.2x，验证了 continuous batching 机制有效。

后台服务部署

nohup lmdeploy serve api_server /workspace/LlamaFactory/saves/Qwen3-VL-2B-Instruct/qlora/merge \
--model-name qwen3-vl --backend pytorch --tp 1 \
--session-len 4096 --cache-max-entry-count 0.6 \
--max-batch-size 4 --eager-mode --server-port 23333 > api_server.log 2>&1 &

关键参数说明：

--backend pytorch：必须项，TurboMind 不支持 Qwen3-VL。
--cache-max-entry-count 0.6：核心优化，预留 9.6GB 给 KV Cache。
--eager-mode：必须项，T4 架构较旧，禁用 CUDA Graph 避免错误。

查看日志与进程：

tail -f api_server.log
ps aux | grep "lmdeploy serve api_server"

文章配图

停止服务：

kill 12684

文章配图

API 测试

BASE64_IMG=$(base64 -w 0 /workspace/LlamaFactory/data/open_eqa_frames/0a0c0f2b9ba65d1b/000.jpg)
curl -X POST http://localhost:23333/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{ \"model\": \"qwen3-vl\", \"messages\": [{ \"role\": \"user\", \"content\": [{\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${BASE64_IMG}\"}}, {\"type\": \"text\", \"text\": \"描述这张图片\"}]}], \"max_tokens\": 256, \"temperature\": 0.7 }"

文章配图

1. 微调训练

首先配置环境。若拥有 CUDA 显卡，可安装 Unsloth 以加速训练和推理；同时建议安装 tensorboard 记录完整的训练过程曲线，避免中断后无法复盘。

创建配置文件 saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/training_args.yaml，内容如下。注意根据实际路径调整基座模型位置：

### model
model_name_or_path: model/Qwen3-VL-2B-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 8
lora_alpha: 16
lora_dropout: 0.1

### 是否使用 unsloth 加速
use_unsloth: false
#unsloth_max_seq_length: 2048
flash_attn: auto

### quantization (QLoRA)
quantization_bit: 4
quantization_method: bitsandbytes
double_quantization: true

### dataset
dataset: open_eqa_train_val
template: qwen3_vl_nothink
cutoff_len: 2048
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 8

### output
output_dir: saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa
logging_steps: 10
save_steps: 25
resume_from_checkpoint: false
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 5.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.05
fp16: true
ddp_timeout: 180000000

### optimization
optim: adamw_torch
report_to: tensorboard
plot_loss: true
video_max_pixels: 65536
video_min_pixels: 256
freeze_multi_modal_projector: true
freeze_vision_tower: true
image_max_pixels: 589824
image_min_pixels: 1024

### evaluation
do_eval: true
per_device_eval_batch_size: 2
val_size: 0.125
eval_strategy: steps
eval_steps: 25
eval_delay: 0
prediction_loss_only: true

### save & eval 联动
load_best_model_at_end: true

本次实验设备为显存 16GB 的 NVIDIA Tesla T4。执行命令启动训练：

llamafactory-cli train saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/training_args.yaml

文章配图

若训练意外中断，将 resume_from_checkpoint 设为 true 即可恢复：

llamafactory-cli train saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/training_args.yaml

训练结束后，观察损失曲线与指标变化。Qwen3-VL-2B-Instruct 在 Open-EQA 上的表现总结如下：

训练阶段表现

损失变化：初始损失较高（约 5.5），前 100 步快速下降，200 步后稳定在 1.0~1.5 区间。
最终指标：训练损失收敛至 1.3233，拟合效果良好且波动小。
效率：总耗时约 2 小时 27 分钟，计算量达 48049026 GFLOPs，体现了多模态训练的计算成本。

验证阶段表现

损失变化：验证损失初期约 1.8，100 步后趋于平缓，稳定在 1.25~1.3 区间。
最终指标：验证损失 1.2683，略低于训练损失，说明泛化能力较好，无明显过拟合。
效率：173 个样本耗时约 2 分 10 秒，批大小为 2。

uv pip install tensorboard -i http://mirrors.aliyun.com/pypi/simple
# 或
pip install tensorboard -i http://mirrors.aliyun.com/pypi/simple

启动服务：

tensorboard --logdir=/path/to/logs --port 6006

文章配图

2. 测试评估

创建评估配置文件 saves/Qwen3-VL-2B-Instruct/qlora/eval_openeqa/eval_args.yaml。注意修改适配器路径、基座模型路径及输出目录：

adapter_name_or_path: saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/
cutoff_len: 2048
dataset_dir: data
ddp_timeout: 180000000
do_predict: true
eval_dataset: open_eqa_test
finetuning_type: lora
flash_attn: auto
max_new_tokens: 128
max_samples: 99999
model_name_or_path: model/Qwen3-VL-2B-Instruct
output_dir: saves/Qwen3-VL-2B-Instruct/qlora/eval_openeqa
per_device_eval_batch_size: 2
predict_with_generate: true
preprocessing_num_workers: 4
report_to: none
stage: sft
temperature: 0.2
template: qwen3_vl_nothink
top_p: 1.0
trust_remote_code: true

执行评估命令：

llamafactory-cli train saves/Qwen3-VL-2B-Instruct/qlora/eval_openeqa/eval_args.yaml

文章配图

3. 融合模型导出

将 LoRA 权重合并至基座模型，创建 saves/Qwen3-VL-2B-Instruct/qlora/merge/merge_openeqa.yaml：

### model
model_name_or_path: model/Qwen3-VL-2B-Instruct
adapter_name_or_path: saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/
template: qwen3_vl_nothink
finetuning_type: lora
trust_remote_code: true

### export
export_dir: saves/Qwen3-VL-2B-Instruct/qlora/merge
export_size: 2
export_device: auto
export_legacy_format: false

执行导出命令：

llamafactory-cli export saves/Qwen3-VL-2B-Instruct/qlora/merge/merge_openeqa.yaml

文章配图

4. 推理部署 API 服务

(1) Ollama

将融合模型文件放入本地目录（如 /saves/Qwen3-VL-2B-Instruct/qlora/merge），编辑 Modelfile 文件。可根据需求调整参数：

FROM .
TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ range .Messages }}{{ if eq .Role "user" }}<|im_start|>user {{ .Content }}<|im_end|> <|im_start|>assistant {{ else if eq .Role "assistant" }}{{ .Content }}<|im_end|> {{ end }}{{ end }}"""
PARAMETER temperature 0.7
PARAMETER stop "<|im_end|>"
PARAMETER num_ctx 4096

终端执行创建模型命令，模型名称可自定义：

ollama create qwen3-vl-2b -f Modelfile

文章配图

查看模型列表：

ollama list

文章配图

调用方式

命令行直接传入

ollama run qwen3-vl-2b "墙上有什么东西" ./data/open_eqa_frames/0a0c0f2b9ba65d1b/000.jpg

文章配图

交互式模式 先运行 ollama run qwen3-vl-2b，随后输入问题及图片路径。

curl 调用 先将图片转为 base64：

IMG=$(base64 -i data/open_eqa_frames/0a0c0f2b9ba65d1b/000.jpg | tr -d '\n')

发送请求：

curl http://localhost:11434/api/generate -d '{
  "model": "qwen3-vl-2b",
  "system": "你是机器人控制 AI...",
  "prompt": "观察图片...",
  "images": ["'$IMG'"],
  "format": "json",
  "stream": false,
  "options": {"temperature": 0.01, "num_predict": 300}
}'

文章配图

(2) LMDeploy

激活环境并安装库：

pip install --no-cache-dir lmdeploy

编写测试脚本 test_offline.py。注意：T4 显卡必须使用 PyTorch 后端，TurboMind 不支持 Qwen3-VL 架构。

from lmdeploy import pipeline, TurbomindEngineConfig, PytorchEngineConfig, GenerationConfig
from lmdeploy.vl import load_image
import time

MODEL_PATH = "/workspace/LlamaFactory/saves/Qwen3-VL-2B-Instruct/qlora/merge"
IMAGE_PATH = "/workspace/LlamaFactory/data/open_eqa_frames/0a0c0f2b9ba65d1b/000.jpg"

print("🚀 使用 LMDeploy PyTorch 后端加载 Qwen3-VL...")

engine_config = PytorchEngineConfig(
    tp=1,
    session_len=4096,
    max_batch_size=4,
    cache_max_entry_count=0.6,
    eager_mode=True
)

if __name__ == '__main__':
    pipe = pipeline(MODEL_PATH, backend_config=engine_config)
    print("✅ 模型加载成功！")

    image = load_image(IMAGE_PATH)
    prompts = [("描述这张图片", image)]
    start = time.time()
    response = pipe(prompts, gen_config=GenerationConfig(max_new_tokens=256, temperature=0.7))
    latency = time.time() - start
    print(f"⏱️ 延迟：{latency:.2f} s")
    print(f"📝 输出：{response[0].text}")

    # Batch 测试
    prompts_batch = [
        ("描述这张图片", image),
        ("图中有几个人？", image),
        ("这是什么场景？", image),
        ("图片主色调是什么？", image),
    ]
    start = time.time()
    responses = pipe(prompts_batch, gen_config=GenerationConfig(max_new_tokens=128))
    batch_latency = time.time() - start
    print(f"⏱️ Batch 总延迟：{batch_latency:.2f} s")
    print(f"⚡ 平均每个请求：{batch_latency/4:.2f} s")

文章配图

结果分析

功能层面：单图推理能精准描述细节（如沙发颜色、装饰画文字），Batch 4 个不同问题均响应正确，无报错或 OOM。
性能层面：单图延迟约 9.63s，符合 T4+PyTorch 后端预期。Batch 4 请求总延迟 31.42s，平均单请求降低 18.4%，吞吐量提升 1.2x，验证了 continuous batching 机制有效。

后台服务部署

nohup lmdeploy serve api_server /workspace/LlamaFactory/saves/Qwen3-VL-2B-Instruct/qlora/merge \
--model-name qwen3-vl --backend pytorch --tp 1 \
--session-len 4096 --cache-max-entry-count 0.6 \
--max-batch-size 4 --eager-mode --server-port 23333 > api_server.log 2>&1 &

关键参数说明：

--backend pytorch：必须项，TurboMind 不支持 Qwen3-VL。
--cache-max-entry-count 0.6：核心优化，预留 9.6GB 给 KV Cache。
--eager-mode：必须项，T4 架构较旧，禁用 CUDA Graph 避免错误。

查看日志与进程：

tail -f api_server.log
ps aux | grep "lmdeploy serve api_server"

文章配图

停止服务：

kill 12684

文章配图

API 测试

BASE64_IMG=$(base64 -w 0 /workspace/LlamaFactory/data/open_eqa_frames/0a0c0f2b9ba65d1b/000.jpg)
curl -X POST http://localhost:23333/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{ \"model\": \"qwen3-vl\", \"messages\": [{ \"role\": \"user\", \"content\": [{\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${BASE64_IMG}\"}}, {\"type\": \"text\", \"text\": \"描述这张图片\"}]}], \"max_tokens\": 256, \"temperature\": 0.7 }"

文章配图

Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程 (Open-EQA 示例)

1. 微调训练

2. 测试评估

3. 融合模型导出

4. 推理部署 API 服务

(1) Ollama

(2) LMDeploy

Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程 (Open-EQA 示例)

1. 微调训练

2. 测试评估

3. 融合模型导出

4. 推理部署 API 服务

(1) Ollama

(2) LMDeploy

更多推荐文章

相关免费在线工具

更多推荐文章

相关免费在线工具

Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程 (Open-EQA 示例)

1. 微调训练

2. 测试评估

3. 融合模型导出

4. 推理部署 API 服务

(1) Ollama

(2) LMDeploy

Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程 (Open-EQA 示例)

1. 微调训练

2. 测试评估

3. 融合模型导出

4. 推理部署 API 服务

(1) Ollama

(2) LMDeploy

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具