Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程实战 | 极客日志

PythonAI算法

Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程实战

Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程实战。内容涵盖环境配置、Open-EQA 数据集训练、评估指标分析、模型合并以及 Ollama 和 LMDeploy 两种部署方案。重点展示在 16GB 显存 T4 显卡上的实操细节，包括 Unsloth 加速、TensorBoard 监控及 PyTorch 后端推理优化，为具身智能多模态任务提供完整落地参考。

山野来信发布于 2026/4/7更新于 2026/6/419 浏览

Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程实战

本次实战以具身智能数据集 Open-EQA 为例，演示如何在 16GB 显存的 NVIDIA Tesla T4 环境下，完成 Qwen3-VL-2B-Instruct 模型的嵌套量化 QLoRA 训练、评估、导出及 Ollama/LMDeploy 部署。每个样本包含八张图片，经过处理后划分为训练 - 验证集和测试集。

1. 环境准备与训练配置

有 CUDA 显卡的用户建议安装 Unsloth 来加速训练和推理。同时，为了完整记录训练过程，避免中断后只能看到部分曲线，务必安装 TensorBoard：

pip install unsloth tensorboard

在 saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa 目录下创建 training_args.yaml。这里的关键是合理设置量化参数和显存占用策略。对于 T4 这种单卡环境，双量化（double_quantization）能有效节省显存。

### model
model_name_or_path: model/Qwen3-VL-2B-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 8
lora_alpha: 16
lora_dropout: 0.1
use_unsloth: false
flash_attn: auto

### quantization (QLoRA)
quantization_bit: 4
quantization_method: bitsandbytes
double_quantization: true

### dataset
dataset: open_eqa_train_val
template: qwen3_vl_nothink
cutoff_len: 2048

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

llamafactory-cli train saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/training_args.yaml

adapter_name_or_path: saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/
cutoff_len: 2048
dataset_dir: data
ddp_timeout: 180000000
do_predict: true
eval_dataset: open_eqa_test
finetuning_type: lora
flash_attn: auto
max_new_tokens: 128
max_samples: 99999
model_name_or_path: model/Qwen3-VL-2B-Instruct
output_dir: saves/Qwen3-VL-2B-Instruct/qlora/eval_openeqa
per_device_eval_batch_size: 2
predict_with_generate: true
preprocessing_num_workers: 4
report_to: none
stage: sft
temperature: 0.2
template: qwen3_vl_nothink
top_p: 1.0
trust_remote_code: true

llamafactory-cli train saves/Qwen3-VL-2B-Instruct/qlora/eval_openeqa/eval_args.yaml

### model
model_name_or_path: model/Qwen3-VL-2B-Instruct
adapter_name_or_path: saves/Qwen3-VL-2B-Instruct/qlora/train_openeqa/
template: qwen3_vl_nothink
finetuning_type: lora
trust_remote_code: true

### export
export_dir: saves/Qwen3-VL-2B-Instruct/qlora/merge
export_size: 2
export_device: auto
export_legacy_format: false

llamafactory-cli export saves/Qwen3-VL-2B-Instruct/qlora/merge/merge_openeqa.yaml

FROM .
TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ range .Messages }}{{ if eq .Role "user" }}<|im_start|>user {{ .Content }}<|im_end|> <|im_start|>assistant {{ else if eq .Role "assistant" }}{{ .Content }}<|im_end|> {{ end }}{{ end }}"""
PARAMETER temperature 0.7
PARAMETER stop "<|im_end|>"
PARAMETER num_ctx 4096

ollama create qwen3-vl-2b -f Modelfile
ollama run qwen3-vl-2b "问题" 图片路径

pip install --no-cache-dir lmdeploy

from lmdeploy import pipeline, TurbomindEngineConfig, PytorchEngineConfig, GenerationConfig
from lmdeploy.vl import load_image
import time

MODEL_PATH = "/workspace/LlamaFactory/saves/Qwen3-VL-2B-Instruct/qlora/merge"
IMAGE_PATH = "/workspace/LlamaFactory/data/open_eqa_frames/0a0c0f2b9ba65d1b/000.jpg"

print("🚀 使用 LMDeploy PyTorch 后端加载 Qwen3-VL...")
engine_config = PytorchEngineConfig(
    tp=1,
    session_len=4096,
    max_batch_size=4,
    cache_max_entry_count=0.6,
    eager_mode=True
)

if __name__ == '__main__':
    pipe = pipeline(MODEL_PATH, backend_config=engine_config)
    print("✅ 模型加载成功！")
    image = load_image(IMAGE_PATH)
    prompts = [("描述这张图片", image)]
    start = time.time()
    response = pipe(prompts, gen_config=GenerationConfig(max_new_tokens=256, temperature=0.7))
    latency = time.time() - start
    print(f"⏱️ 延迟：{latency:.2f} s")
    print(f"📝 输出：{response[0].text}")

nohup lmdeploy serve api_server /workspace/LlamaFactory/saves/Qwen3-VL-2B-Instruct/qlora/merge \
--model-name qwen3-vl --backend pytorch --tp 1 \
--session-len 4096 --cache-max-entry-count 0.6 \
--max-batch-size 4 --eager-mode --server-port 23333 > api_server.log 2>&1 &

Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程实战

Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程实战

1. 环境准备与训练配置

更多推荐文章

相关免费在线工具

2. 测试评估

3. 融合模型导出

4. 推理部署 API 服务

(1) Ollama 部署

(2) LMDeploy 部署

更多推荐文章

相关免费在线工具

Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程实战

Qwen3-VL 基于 Llama-Factory 的 QLoRA 微调与部署全流程实战

1. 环境准备与训练配置

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2. 测试评估

3. 融合模型导出

4. 推理部署 API 服务

(1) Ollama 部署

(2) LMDeploy 部署

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具