基于 Qwen3-VL 与 LLaMA-Factory 的 Grounding 任务 LoRA 微调 | 极客日志

PythonAI算法

基于 Qwen3-VL 与 LLaMA-Factory 的 Grounding 任务 LoRA 微调

综述由AI生成使用 Qwen3-VL 多模态大模型结合 LLaMA-Factory 框架进行 Grounding（定位）任务的 LoRA 微调流程。内容包括环境配置、模型下载与推理测试、数据集准备（含 YOLO 格式转换）、微调参数配置及可视化界面操作。通过相对坐标归一化适配 Qwen3-VL 特性，实现了从图像到文本的定位能力增强。

KernelLab发布于 2026/4/6更新于 2026/5/2336 浏览

基于 Qwen3-VL 与 LLaMA-Factory 的 Grounding 任务 LoRA 微调

1. 环境配置

conda create -n Qwen3-vl python=3.10
conda activate Qwen3-vl
pip install accelerate
pip install qwen-vl-utils==0.0.14
uv pip install -U vllm

2. 下载代码

git clone https://github.com/QwenLM/Qwen3-VL

3. 下载权重文件

使用 ModelScope 下载模型库：

pip install modelscope
modelscope download --model Qwen/Qwen3-VL-4B-Instruct

4. 推理测试

修改模型路径和图片路径后运行以下脚本：

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image

def load_qwen3_vl_4b_model():
    model = Qwen3VLForConditionalGeneration.from_pretrained(
        "Qwen/Qwen3-VL-4B-Instruct",
        torch_dtype=torch.bfloat16,
        device_map="auto",
        attn_implementation="flash_attention_2"
    )
    processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Instruct")
    return model, processor

def process_multimodal_query(model, processor, image_path, text_query):
    image = Image.open(image_path).convert('RGB')
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": text_query}
            ]
        }
    ]
    inputs = processor.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=,
        return_dict=, return_tensors=
    )
    generated_ids = model.generate(
        **inputs, max_new_tokens=, do_sample=,
        temperature=, top_p=
    )
    generated_ids_trimmed = [
        out_ids[(in_ids):]  in_ids, out_ids  (inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=, clean_up_tokenization_spaces=
    )
     output_text[]  output_text  

 __name__ == :
    model, processor = load_qwen3_vl_4b_model()
    image_path = 
    query = 
    result = process_multimodal_query(model, processor, image_path, query)
    (, result)

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

git clone https://github.com/hiyouga/LLaMA-Factory
cd LLaMA-Factory

conda create -n llama-factory python=3.12
conda activate llama-factory
pip install -e ".[torch,metrics]" --no-build-isolation
pip uninstall torch torchvision
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

import os
import json
from tqdm import tqdm

IMAGE_DIR = "images"
LABEL_DIR = "labels"
OUTPUT_JSON = "qwen3_vl_grounding_mllm.json"
CLASS_ID2NAME = {0: "house"}
USER_PROMPT = (
    "<image>\n"
    "Locate all objects in this image and output the bbox coordinates "
    "in JSON format using relative coordinates in the range [0, 1000]."
)

def yolo_to_xyxy_relative(xc, yc, w, h):
    x_min = xc - w / 2
    y_min = yc - h / 2
    x_max = xc + w / 2
    y_max = yc + h / 2
    return [max(0.0, min(1.0, x)), max(0.0, min(1.0, y)) for x, y in [(x_min, y_min), (x_max, y_max)]]

def scale_to_qwen_coords(xyxy_rel, scale=1000):
    return [int(round(x * scale)) for x in xyxy_rel]

def collect_image_files(image_dir):
    exts = {".jpg", ".jpeg", ".png", ".bmp", ".webp"}
    return sorted([f for f in os.listdir(image_dir) if os.path.splitext(f)[1].lower() in exts])

def main():
    image_files = collect_image_files(IMAGE_DIR)
    dataset = []
    for img_name in tqdm(image_files, desc="Converting"):
        img_path = os.path.join(IMAGE_DIR, img_name)
        base, _ = os.path.splitext(img_name)
        label_path = os.path.join(LABEL_DIR, base + ".txt")
        if not os.path.exists(label_path):
            continue
        bboxes_qwen = []
        cls_ids = []
        with open(label_path, "r", encoding="utf-8") as f:
            for line in f:
                parts = line.strip().split()
                if len(parts) < 5:
                    continue
                cls_id = int(parts[0])
                xc, yc, w, h = float(parts[1]), float(parts[2]), float(parts[3]), float(parts[4])
                xyxy_rel = yolo_to_xyxy_relative(xc, yc, w, h)
                xyxy_qwen = scale_to_qwen_coords(xyxy_rel, scale=1000)
                bboxes_qwen.append(xyxy_qwen)
                cls_ids.append(cls_id)
        if not bboxes_qwen:
            continue
        objects = [{"cls_id": cid, "bbox_2d": box} for cid, box in zip(cls_ids, bboxes_qwen)]
        answer_obj = {"objects": objects}
        answer_str = json.dumps(answer_obj, ensure_ascii=False)
        sample = {
            "conversations": [
                {"from": "human", "value": USER_PROMPT},
                {"from": "gpt", "value": answer_str}
            ],
            "images": [img_path]
        }
        dataset.append(sample)
    with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
        json.dump(dataset, f, ensure_ascii=False, indent=2)
    print(f"Done. Wrote {len(dataset)} samples to {OUTPUT_JSON}")

if __name__ == "__main__":
    main()

llamafactory-cli webui

"qwen3_vl_grounding_mllm": {
    "file_name": "qwen3_vl_grounding_mllm.json",
    "formatting": "sharegpt",
    "columns": {
        "messages": "conversations",
        "images": "images"
    }
}

基于 Qwen3-VL 与 LLaMA-Factory 的 Grounding 任务 LoRA 微调

1. 环境配置

2. 下载代码

3. 下载权重文件

4. 推理测试

更多推荐文章

相关免费在线工具

5. 微调部分

5.1 使用 LLaMA-Factory 项目

5.1.1 下载项目

5.1.2 创建虚拟环境

5.2 准备微调数据集

5.2.1 数据集格式

5.2.2 YOLO 格式转换为 Qwen3-VL Grounding 格式

5.3 使用 LLaMA-Factory 可视化界面微调

5.4 对话测试与导出

更多推荐文章

相关免费在线工具

基于 Qwen3-VL 与 LLaMA-Factory 的 Grounding 任务 LoRA 微调

1. 环境配置

2. 下载代码

3. 下载权重文件

4. 推理测试

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

5. 微调部分

5.1 使用 LLaMA-Factory 项目

5.1.1 下载项目

5.1.2 创建虚拟环境

5.2 准备微调数据集

5.2.1 数据集格式

5.2.2 YOLO 格式转换为 Qwen3-VL Grounding 格式

5.3 使用 LLaMA-Factory 可视化界面微调

5.4 对话测试与导出

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具