基于 Qwen3-VL 与 LLaMA-Factory 的 Grounding 任务 LoRA 微调实战

Qwen3-VL 与 LLaMA-Factory Grounding 微调实战

模型背景

Qwen3-VL 在空间感知能力上有了显著提升，2D grounding 从绝对坐标转向相对坐标，支持判断物体方位、视角变化及遮挡关系，甚至能实现 3D grounding。此外，OCR 支持语言扩展至 32 种，对复杂光线、模糊场景表现更稳定。

技术层面主要包含三点改进：

MRoPE-Interleave：原始 MRoPE 将特征维度按时间、高度、宽度顺序分块，Qwen3-VL 采用交错分布形式，实现对全频率覆盖，增强长视频理解鲁棒性。
DeepStack 技术：融合 ViT 多层次特征，将视觉 tokens 注入到 LLM 的多层中而非单层，保留从底层到高层的丰富信息，提升图文对齐精度。
文本时间戳对齐机制：升级原有的 T-RoPE，采用'时间戳 - 视频帧'交错输入，原生支持秒数与 HMS 格式输出，提升时序推理精度。

环境配置

首先创建独立的 Conda 环境并安装依赖。注意命令中的 -n 参数用于指定环境名。

conda create -n Qwen3-vl python=3.10
conda activate Qwen3-vl
pip install accelerate
pip install qwen-vl-utils==0.0.14
uv pip install -U vllm>=0.11.0

模型加载与推理

下载完整模型库后，可以通过 Hugging Face 或 ModelScope 获取权重。以下是一个基础的推理示例，重点在于消息格式的构建和生成参数的调整。

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image

def load_qwen3_vl_4b_model():
    """加载 Qwen3-VL-4B-Instruct 模型和处理器"""
    model = Qwen3VLForConditionalGeneration.from_pretrained(
        "Qwen/Qwen3-VL-4B-Instruct",
        torch_dtype=torch.bfloat16,
        device_map="auto",
        attn_implementation="flash_attention_2"
    )
    processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Instruct")
    return model, processor

def process_multimodal_query(model, processor, image_path, text_query):
    """处理多模态查询（图像 + 文本）"""
    image = Image.open(image_path).convert('RGB')
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": text_query}
            ]
        }
    ]
    inputs = processor.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True,
        return_dict=True, return_tensors="pt"
    )
    generated_ids = model.generate(
        **inputs, max_new_tokens=128, do_sample=True,
        temperature=0.7, top_p=0.8
    )
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    return output_text[0] if output_text else ""

if __name__ == "__main__":
    model, processor = load_qwen3_vl_4b_model()
    image_path = "example.jpg"
    query = "描述这张图片中的场景和主要对象"
    result = process_multimodal_query(model, processor, image_path, query)
    print("模型回复：", result)

数据准备

LLaMA-Factory 需要特定格式的数据集。对于 Grounding 任务，核心在于坐标系的转换。Qwen3-VL 使用 0-1000 的相对坐标，而常见的 YOLO 格式通常是归一化的 [0,1] 坐标，需要进行映射。

数据集格式要求

数据集需符合 ShareGPT 格式，包含 conversations 和 images 字段。在 dataset_info.json 中注册新的数据集类型。

YOLO 转 Qwen3-VL 格式脚本

以下脚本实现了从 YOLO 标签到 Qwen3-VL 所需 JSON 格式的转换，关键逻辑在于坐标缩放。

import os
import json
from tqdm import tqdm

IMAGE_DIR = "images"
LABEL_DIR = "labels"
OUTPUT_JSON = "qwen3_vl_grounding_mllm.json"
CLASS_ID2NAME = {0: "house"} # 根据实际类别修改
USER_PROMPT = (
    "<image>\n"
    "Locate all objects in this image and output the bbox coordinates "
    "in JSON format using relative coordinates in the range [0, 1000]."
)

def yolo_to_xyxy_relative(xc, yc, w, h):
    x_min = xc - w / 2
    y_min = yc - h / 2
    x_max = xc + w / 2
    y_max = yc + h / 2
    return [
        max(0.0, min(1.0, x_min)),
        max(0.0, min(1.0, y_min)),
        max(0.0, min(1.0, x_max)),
        max(0.0, min(1.0, y_max))
    ]

def scale_to_qwen_coords(xyxy_rel, scale=1000):
    x_min, y_min, x_max, y_max = xyxy_rel
    return [
        int(round(x_min * scale)),
        int(round(y_min * scale)),
        int(round(x_max * scale)),
        int(round(y_max * scale))
    ]

def main():
    exts = {".jpg", ".jpeg", ".png", ".bmp", ".webp"}
    files = sorted([f for f in os.listdir(IMAGE_DIR) if os.path.splitext(f)[1].lower() in exts])
    dataset = []
    for img_name in tqdm(files, desc="Converting"):
        img_path = os.path.join(IMAGE_DIR, img_name)
        base, _ = os.path.splitext(img_name)
        label_path = os.path.join(LABEL_DIR, base + ".txt")
        if not os.path.exists(label_path):
            continue
        bboxes_qwen, cls_ids = [], []
        with open(label_path, "r", encoding="utf-8") as f:
            for line in f:
                parts = line.strip().split()
                if len(parts) < 5:
                    continue
                cls_id = int(parts[0])
                xc, yc, w, h = float(parts[1]), float(parts[2]), float(parts[3]), float(parts[4])
                xyxy_rel = yolo_to_xyxy_relative(xc, yc, w, h)
                xyxy_qwen = scale_to_qwen_coords(xyxy_rel)
                bboxes_qwen.append(xyxy_qwen)
                cls_ids.append(cls_id)
        if not bboxes_qwen:
            continue
        objects = [{"cls_id": cid, "bbox_2d": box} for cid, box in zip(cls_ids, bboxes_qwen)]
        answer_obj = {"objects": objects}
        sample = {
            "conversations": [
                {"from": "human", "value": USER_PROMPT},
                {"from": "gpt", "value": json.dumps(answer_obj, ensure_ascii=False)}
            ],
            "images": [os.path.abspath(img_path)]
        }
        dataset.append(sample)
    with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
        json.dump(dataset, f, ensure_ascii=False, indent=2)
    print(f"Done. Wrote {len(dataset)} samples to {OUTPUT_JSON}")

if __name__ == "__main__":
    main()

将生成的 JSON 文件放入 LLaMA-Factory/data 目录下，并在 dataset_info.json 中添加如下配置：

"qwen3_vl_grounding_mllm": {
    "file_name": "qwen3_vl_grounding_mllm.json",
    "formatting": "sharegpt",
    "columns": {
        "messages": "conversations",
        "images": "images"
    }
}

微调训练

使用 LLaMA-Factory 的可视化界面进行训练更为便捷。

启动 WebUI：进入项目目录运行 llamafactory-cli webui。
参数配置：
- 选择模型路径及下载源。
- 计算类型建议选择 Pure_bf16，这通常比 fp16 更省显存且数值稳定性更好。
- 在数据集栏选择刚才注册的 qwen3_vl_grounding_mllm。
开始训练：保存参数后点击开始。训练过程中可监控 Loss 变化，若显存不足可适当减小 batch size。

测试与导出

训练完成后，在 WebUI 中选择 Chat 模式，加载微调后的模型路径进行测试。输入图片和提示词验证 Grounding 效果，确认坐标预测是否准确。最后通过导出功能保存模型权重即可部署使用。

Qwen3-VL 与 LLaMA-Factory Grounding 微调实战

模型背景

技术层面主要包含三点改进：

MRoPE-Interleave：原始 MRoPE 将特征维度按时间、高度、宽度顺序分块，Qwen3-VL 采用交错分布形式，实现对全频率覆盖，增强长视频理解鲁棒性。
DeepStack 技术：融合 ViT 多层次特征，将视觉 tokens 注入到 LLM 的多层中而非单层，保留从底层到高层的丰富信息，提升图文对齐精度。
文本时间戳对齐机制：升级原有的 T-RoPE，采用'时间戳 - 视频帧'交错输入，原生支持秒数与 HMS 格式输出，提升时序推理精度。

环境配置

首先创建独立的 Conda 环境并安装依赖。注意命令中的 -n 参数用于指定环境名。

conda create -n Qwen3-vl python=3.10
conda activate Qwen3-vl
pip install accelerate
pip install qwen-vl-utils==0.0.14
uv pip install -U vllm>=0.11.0

模型加载与推理

下载完整模型库后，可以通过 Hugging Face 或 ModelScope 获取权重。以下是一个基础的推理示例，重点在于消息格式的构建和生成参数的调整。

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image

def load_qwen3_vl_4b_model():
    """加载 Qwen3-VL-4B-Instruct 模型和处理器"""
    model = Qwen3VLForConditionalGeneration.from_pretrained(
        "Qwen/Qwen3-VL-4B-Instruct",
        torch_dtype=torch.bfloat16,
        device_map="auto",
        attn_implementation="flash_attention_2"
    )
    processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Instruct")
    return model, processor

def process_multimodal_query(model, processor, image_path, text_query):
    """处理多模态查询（图像 + 文本）"""
    image = Image.open(image_path).convert('RGB')
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": text_query}
            ]
        }
    ]
    inputs = processor.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True,
        return_dict=True, return_tensors="pt"
    )
    generated_ids = model.generate(
        **inputs, max_new_tokens=128, do_sample=True,
        temperature=0.7, top_p=0.8
    )
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    return output_text[0] if output_text else ""

if __name__ == "__main__":
    model, processor = load_qwen3_vl_4b_model()
    image_path = "example.jpg"
    query = "描述这张图片中的场景和主要对象"
    result = process_multimodal_query(model, processor, image_path, query)
    print("模型回复：", result)

数据准备

数据集格式要求

数据集需符合 ShareGPT 格式，包含 conversations 和 images 字段。在 dataset_info.json 中注册新的数据集类型。

YOLO 转 Qwen3-VL 格式脚本

以下脚本实现了从 YOLO 标签到 Qwen3-VL 所需 JSON 格式的转换，关键逻辑在于坐标缩放。

import os
import json
from tqdm import tqdm

IMAGE_DIR = "images"
LABEL_DIR = "labels"
OUTPUT_JSON = "qwen3_vl_grounding_mllm.json"
CLASS_ID2NAME = {0: "house"} # 根据实际类别修改
USER_PROMPT = (
    "<image>\n"
    "Locate all objects in this image and output the bbox coordinates "
    "in JSON format using relative coordinates in the range [0, 1000]."
)

def yolo_to_xyxy_relative(xc, yc, w, h):
    x_min = xc - w / 2
    y_min = yc - h / 2
    x_max = xc + w / 2
    y_max = yc + h / 2
    return [
        max(0.0, min(1.0, x_min)),
        max(0.0, min(1.0, y_min)),
        max(0.0, min(1.0, x_max)),
        max(0.0, min(1.0, y_max))
    ]

def scale_to_qwen_coords(xyxy_rel, scale=1000):
    x_min, y_min, x_max, y_max = xyxy_rel
    return [
        int(round(x_min * scale)),
        int(round(y_min * scale)),
        int(round(x_max * scale)),
        int(round(y_max * scale))
    ]

def main():
    exts = {".jpg", ".jpeg", ".png", ".bmp", ".webp"}
    files = sorted([f for f in os.listdir(IMAGE_DIR) if os.path.splitext(f)[1].lower() in exts])
    dataset = []
    for img_name in tqdm(files, desc="Converting"):
        img_path = os.path.join(IMAGE_DIR, img_name)
        base, _ = os.path.splitext(img_name)
        label_path = os.path.join(LABEL_DIR, base + ".txt")
        if not os.path.exists(label_path):
            continue
        bboxes_qwen, cls_ids = [], []
        with open(label_path, "r", encoding="utf-8") as f:
            for line in f:
                parts = line.strip().split()
                if len(parts) < 5:
                    continue
                cls_id = int(parts[0])
                xc, yc, w, h = float(parts[1]), float(parts[2]), float(parts[3]), float(parts[4])
                xyxy_rel = yolo_to_xyxy_relative(xc, yc, w, h)
                xyxy_qwen = scale_to_qwen_coords(xyxy_rel)
                bboxes_qwen.append(xyxy_qwen)
                cls_ids.append(cls_id)
        if not bboxes_qwen:
            continue
        objects = [{"cls_id": cid, "bbox_2d": box} for cid, box in zip(cls_ids, bboxes_qwen)]
        answer_obj = {"objects": objects}
        sample = {
            "conversations": [
                {"from": "human", "value": USER_PROMPT},
                {"from": "gpt", "value": json.dumps(answer_obj, ensure_ascii=False)}
            ],
            "images": [os.path.abspath(img_path)]
        }
        dataset.append(sample)
    with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
        json.dump(dataset, f, ensure_ascii=False, indent=2)
    print(f"Done. Wrote {len(dataset)} samples to {OUTPUT_JSON}")

if __name__ == "__main__":
    main()

将生成的 JSON 文件放入 LLaMA-Factory/data 目录下，并在 dataset_info.json 中添加如下配置：

"qwen3_vl_grounding_mllm": {
    "file_name": "qwen3_vl_grounding_mllm.json",
    "formatting": "sharegpt",
    "columns": {
        "messages": "conversations",
        "images": "images"
    }
}

微调训练

使用 LLaMA-Factory 的可视化界面进行训练更为便捷。

启动 WebUI：进入项目目录运行 llamafactory-cli webui。
参数配置：
- 选择模型路径及下载源。
- 计算类型建议选择 Pure_bf16，这通常比 fp16 更省显存且数值稳定性更好。
- 在数据集栏选择刚才注册的 qwen3_vl_grounding_mllm。
开始训练：保存参数后点击开始。训练过程中可监控 Loss 变化，若显存不足可适当减小 batch size。

基于 Qwen3-VL 与 LLaMA-Factory 的 Grounding 任务 LoRA 微调实战

Qwen3-VL 与 LLaMA-Factory Grounding 微调实战

模型背景

环境配置

模型加载与推理

数据准备

数据集格式要求

YOLO 转 Qwen3-VL 格式脚本

微调训练

测试与导出

基于 Qwen3-VL 与 LLaMA-Factory 的 Grounding 任务 LoRA 微调实战

Qwen3-VL 与 LLaMA-Factory Grounding 微调实战

模型背景

环境配置

模型加载与推理

数据准备

数据集格式要求

YOLO 转 Qwen3-VL 格式脚本

微调训练

测试与导出

更多推荐文章

相关免费在线工具

更多推荐文章

相关免费在线工具

基于 Qwen3-VL 与 LLaMA-Factory 的 Grounding 任务 LoRA 微调实战

Qwen3-VL 与 LLaMA-Factory Grounding 微调实战

模型背景

环境配置

模型加载与推理

数据准备

数据集格式要求

YOLO 转 Qwen3-VL 格式脚本

微调训练

测试与导出

基于 Qwen3-VL 与 LLaMA-Factory 的 Grounding 任务 LoRA 微调实战

Qwen3-VL 与 LLaMA-Factory Grounding 微调实战

模型背景

环境配置

模型加载与推理

数据准备

数据集格式要求

YOLO 转 Qwen3-VL 格式脚本

微调训练

测试与导出

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具