Qwen-VL 多模态大模型在自定义数据上的微调与部署指南 | 极客日志

PythonAI算法

Qwen-VL 多模态大模型在自定义数据上的微调与部署指南

Qwen-VL 是阿里云研发的多模态视觉语言模型，支持图像、文本及检测框输入输出。在自定义数据集上对 Qwen-VL 进行微调与部署的全流程。内容包括硬件配置要求、软件环境搭建、模型下载方式、数据格式准备、LoRA 及 Q-LoRA 微调方法、模型合并以及微调后的推理使用。通过可掌握基于 PEFT 技术的高效微调方案，适用于下游任务接入。

abccba发布于 2025/2/7更新于 2026/6/1529 浏览

Qwen-VL 多模态大模型在自定义数据上的微调与部署指南

Qwen-VL 是阿里云研发的大规模视觉语言模型（Large Vision Language Model, LVLM）。Qwen-VL 可以以图像、文本、检测框作为输入，并以文本和检测框作为输出。

Qwen-VL 架构示意图

Qwen-VL-Chat = 大语言模型 (Qwen-7B) + 视觉图片特征编码器 (Openclip ViT-bigG) + 位置感知视觉语言适配器 (可训练 Adapter) + 1.5B 的图文数据 + 多轮训练 + 对齐机制 (Chat)

Qwen-VL 结构图

Qwen-VL 系列模型特点

多语言对话模型：天然支持英文、中文等多语言对话，端到端支持图片里中英双语的长文本识别；
多图交错对话：支持多图输入和比较，指定图片问答，多图文学创作等；
开放域目标定位：通过中文开放域语言表达进行检测框标注；
细粒度识别和理解：448 分辨率可以提升细粒度的文字识别、文档问答和检测框标注。

硬件配置及部署要求

微调训练的显存占用及速度如下 (BS=1)，可根据显存大小调整 Sequence Length 参数。

Method	Speed (512 Sequence Length)	Memory (512 Sequence Length)
LoRA (Base)	2.4s/it	37.3GB
LoRA (Chat)	2.3s/it	23.6GB
Q-LoRA	4.5s/it	17.2GB

推理阶段的显存占用及速度如下。

Quantization	Speed (2048 tokens)	Memory (2048 tokens)
BF16	28.87	22.60GB
Int4	37.79	11.82GB

A100、H100、RTX3060、RTX3070 等显卡建议启用 bf16 精度以节省显存。
V100、P100、T4 等显卡建议启用 fp16 精度以节省显存。
使用 CPU 进行推理，需要约 32GB 内存，默认 GPU 进行推理，需要约 24GB 显存。

软件环境配置

$ curl -O https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh   # 从官网下载安装脚本
$ bash Anaconda3-2019.03-Linux-x86_64.sh           
$ conda create -n qwen_vl python=3.10            
$ conda activate qwen_vl                         
$ conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

pip3 install -r requirements.txt
pip3 install -r requirements_openai_api.txt
pip3 install -r requirements_web_demo.txt
pip3 install deepspeed
pip3 install peft
pip3 install optimum
pip3 install auto-gptq
pip3 install modelscope -U

from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

# 其中版本 v1.1.0 支持 INT4、INT8 的在线量化，其余版本不支持
model_id = 'qwen/Qwen-VL-Chat'
revision = 'v1.0.0'

# 下载模型到指定目录
local_dir = "/root/autodl-tmp/Qwen-VL-Chat"

snapshot_download(repo_id=model_id, revision=revision, local_dir=local_dir)

# 启动命令，局域网访问
python web_demo_mm.py --server-name 0.0.0.0

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch

torch.manual_seed(1234)

# 请注意：根据显存选择配置，分词器默认行为已更改为默认关闭特殊 token 攻击防护。
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True, fp16=False).eval()

# 第一轮对话
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url
    {'text': '这是什么？'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# 图中是一名女子在沙滩上和狗玩耍，旁边是一只拉布拉多犬，它们处于沙滩上。

# 第二轮对话
response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history)
print(response)
# <ref>击掌</ref><box>(536,509),(588,602)</box>

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "user",
        "value": "你好"
      },
      {
        "from": "assistant",
        "value": "我是 Qwen-VL，一个支持视觉输入的大模型。"
      }
    ]
  },
  {
    "id": "identity_1",
    "conversations": [
      {
        "from": "user",
        "value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种？"
      },
      {
        "from": "assistant",
        "value": "图中是一只拉布拉多犬。"
      },
      {
        "from": "user",
        "value": "框出图中的格子衬衫"
      },
      {
        "from": "assistant",
        "value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>"
      }
    ]
  },
  {
    "id": "identity_2",
    "conversations": [
      {
        "from": "user",
        "value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪"
      },
      {
        "from": "assistant",
        "value": "第一张图片是重庆的城市天际线，第二张图片是北京的天际线。"
      }
    ]
  }
]

# 单卡训练
sh finetune/finetune_lora_single_gpu.sh
# 分布式训练
sh finetune/finetune_lora_ds.sh

#!/bin/bash

export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

MODEL="/root/autodl-tmp/Qwen-VL-Chat"
DATA="/root/autodl-tmp/data.json"

export CUDA_VISIBLE_DEVICES=0

python3 finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --bf16 True \
    --fix_vit True \
    --output_dir output_qwen \
    --num_train_epochs 5 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 600 \
    --lazy_preprocess True \
    --gradient_checkpointing \
    --use_lora

# 单卡训练
sh finetune/finetune_qlora_single_gpu.sh
# 分布式训练
sh finetune/finetune_qlora_ds.sh

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    path_to_adapter, # path to the output directory
    device_map="auto",
    trust_remote_code=True
).eval()

merged_model = model.merge_and_unload()
# max_shard_size and safe serialization are not necessary.
# They respectively work for sharding checkpoint and save the model to safetensors
merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("new_model_directory", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("new_model_directory", device_map="auto", trust_remote_code=True, bf16=True).eval()

# 测试推理
query = tokenizer.from_list_format([
    {'image': 'test_image.jpg'},
    {'text': '请描述这张图片'}
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

Qwen-VL 多模态大模型在自定义数据上的微调与部署指南

Qwen-VL 多模态大模型在自定义数据上的微调与部署指南

Qwen-VL 系列模型特点

硬件配置及部署要求

软件环境配置

更多推荐文章

相关免费在线工具

快速使用及模型下载地址

安装相关的依赖库

各模型文件的下载

Qwen-VL-chat 推理使用

第一种通过网页端 Web UI 使用

第二种通过代码使用

自定义数据微调

训练数据准备

对模型进行 LoRA 微调

1. LoRA 微调

2. Q-LoRA 微调，仅支持 fp16

3. 模型合并及推理

4. 微调后模型推理

5. 常见问题与优化

更多推荐文章

相关免费在线工具

Qwen-VL 多模态大模型在自定义数据上的微调与部署指南

Qwen-VL 多模态大模型在自定义数据上的微调与部署指南

Qwen-VL 系列模型特点

硬件配置及部署要求

软件环境配置

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

快速使用及模型下载地址

安装相关的依赖库

各模型文件的下载

Qwen-VL-chat 推理使用

第一种通过网页端 Web UI 使用

第二种通过代码使用

自定义数据微调

训练数据准备

对模型进行 LoRA 微调

1. LoRA 微调

2. Q-LoRA 微调，仅支持 fp16

3. 模型合并及推理

4. 微调后模型推理

5. 常见问题与优化

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具