Qwen2 大模型指令微调入门实战（附完整代码）

Qwen2 大模型指令微调入门实战

1. 背景介绍

Qwen2 是通义千问团队开源的大语言模型，由阿里云通义实验室研发。以 Qwen2 作为基座大模型，通过指令微调（Instruction Tuning）的方式实现高准确率的文本分类任务，是学习大语言模型微调的入门实践。

指令微调是一种通过在由（指令，输出）对组成的数据集上进一步训练大语言模型的过程。其中，指令代表模型的人类指令，输出代表遵循指令的期望输出。这个过程有助于弥合大语言模型的下一个词预测目标与用户让模型遵循人类指令的目标之间的差距。

在本教程中，我们将使用 Qwen2-1.5B-Instruct 模型在 Fudan News 新闻分类数据集上进行指令微调，并使用 SwanLab 进行训练过程的监控和可视化。

2. 环境准备

本案例基于 Python>=3.8，请在您的计算机上安装好 Python，并且需要一张英伟达显卡（显存要求并不高，大概 10GB 左右就可以运行）。

我们需要安装以下 Python 库，在此之前，请确保你的环境内已安装了 PyTorch 以及 CUDA：

swanlab
modelscope
transformers
datasets
peft
accelerate
pandas

一键安装命令：

pip install swanlab modelscope transformers datasets peft pandas accelerate

注意：本案例测试于 modelscope 1.14.0、transformers 4.41.2、datasets 2.18.0、peft 0.11.1、accelerate 0.30.1、swanlab 0.3.9。

3. 数据集准备

本案例使用的是 zh_cls_fudan_news 数据集，该数据集主要被用于训练文本分类模型。

数据集包含几千条数据，每条数据包含 text、category、output 三列：

text 是训练语料，内容是书籍或新闻的文本内容。
category 是 text 的多个备选类型组成的列表。
output 则是 text 唯一真实的类型。

我们的训练任务，便是希望微调后的大模型能够根据 Text 和 Category 组成的提示词，预测出正确的 Output。

我们将数据集下载到本地目录下。下载方式是前往魔搭社区，将 train.jsonl 和 test.jsonl 下载到本地根目录下即可。

数据集示例如下：

{
"PROMPT": "Text: 第四届全国大企业足球赛复赛结束新华社郑州5月3日电...",
"Category": "Sports, Politics",
"Output": "Sports"
}

4. 加载模型

这里我们使用 modelscope 下载 Qwen2-1.5B-Instruct 模型，然后把它加载到 Transformers 中进行训练。

from modelscope import snapshot_download, AutoTokenizer
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import torch

# 在 modelscope 上下载 Qwen 模型到本地目录下
model_dir = snapshot_download("qwen/Qwen2-1.5B-Instruct", cache_dir="./", revision="master")

# Transformers 加载模型权重
tokenizer = AutoTokenizer.from_pretrained("./qwen/Qwen2-1___5B-Instruct/", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("./qwen/Qwen2-1___5B-Instruct/", device_map="auto", torch_dtype=torch.bfloat16)

5. 配置训练可视化工具

我们使用 SwanLab 来监控整个训练过程，并评估最终的模型效果。

这里直接使用 SwanLab 和 Transformers 的集成来实现：

from swanlab.integration.huggingface import SwanLabCallback

swanlab_callback = SwanLabCallback(...)

trainer = Trainer(
    ...
    callbacks=[swanlab_callback],
)

如果你是第一次使用 SwanLab，那么还需要去官网注册一个账号，在用户设置页面复制你的 API Key，然后在训练开始时粘贴进去即可。

6. 完整代码实现

开始训练时的目录结构如下：

|--- train.py
|--- train.jsonl
|--- test.jsonl

6.1 数据处理与转换

首先需要将原始数据集转换为大模型微调所需的数据格式。我们将 JSONL 文件转换为包含 instruction、input、output 的结构。

import json
import pandas as pd
import torch
from datasets import Dataset
from modelscope import snapshot_download, AutoTokenizer
from swanlab.integration.huggingface import SwanLabCallback
from peft import LoraConfig, TaskType, get_peft_model
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import os
import swanlab

def dataset_jsonl_transfer(origin_path, new_path):
    """
    将原始数据集转换为大模型微调所需数据格式的新数据集
    """
    messages = []

    # 读取旧的 JSONL 文件
    with open(origin_path, "r") as file:
        for line in file:
            # 解析每一行的 json 数据
            data = json.loads(line)
            context = data["text"]
            catagory = data["category"]
            label = data["output"]
            message = {
                "instruction": "你是一个文本分类领域的专家，你会接收到一段文本和几个潜在的分类选项，请输出文本内容的正确类型",
                "input": f"文本:{context},类型选型:{catagory}",
                "output": label,
            }
            messages.append(message)

    # 保存重构后的 JSONL 文件
    with open(new_path, "w", encoding="utf-8") as file:
        for message in messages:
            file.write(json.dumps(message, ensure_ascii=False) + "\n")

6.2 数据预处理函数

定义 process_func 函数将数据集进行预处理，包括分词、截断和标签掩码处理。

def process_func(example):
    """
    将数据集进行预处理
    """
    MAX_LENGTH = 384 
    input_ids, attention_mask, labels = [], [], []
    instruction = tokenizer(
        f"<|im_start|>system\n你是一个文本分类领域的专家，你会接收到一段文本和几个潜在的分类选项，请输出文本内容的正确类型<|im_end|>\n<|im_start|>user\n{example['input']}<|im_end|>\n<|im_start|>assistant\n",
        add_special_tokens=False,
    )
    response = tokenizer(f"{example['output']}", add_special_tokens=False)
    input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
    attention_mask = (
        instruction["attention_mask"] + response["attention_mask"] + [1]
    )
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
    if len(input_ids) > MAX_LENGTH:  # 做一个截断
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

6.3 推理函数

定义推理函数用于测试模型效果。

def predict(messages, model, tokenizer):
    device = "cuda"
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(response)
     
    return response

6.4 主训练流程

整合上述模块，完成模型加载、LoRA 配置、训练参数设置及训练执行。

# 在 modelscope 上下载 Qwen 模型到本地目录下
model_dir = snapshot_download("qwen/Qwen2-1.5B-Instruct", cache_dir="./", revision="master")

# Transformers 加载模型权重
tokenizer = AutoTokenizer.from_pretrained("./qwen/Qwen2-1___5B-Instruct/", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("./qwen/Qwen2-1___5B-Instruct/", device_map="auto", torch_dtype=torch.bfloat16)
model.enable_input_require_grads()  # 开启梯度检查点时，要执行该方法

# 加载、处理数据集和测试集
train_dataset_path = "train.jsonl"
test_dataset_path = "test.jsonl"

train_jsonl_new_path = "new_train.jsonl"
test_jsonl_new_path = "new_test.jsonl"

if not os.path.exists(train_jsonl_new_path):
    dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)
if not os.path.exists(test_jsonl_new_path):
    dataset_jsonl_transfer(test_dataset_path, test_jsonl_new_path)

# 得到训练集
train_df = pd.read_json(train_jsonl_new_path, lines=True)
train_ds = Dataset.from_pandas(train_df)
train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)

# LoRA 配置详解
config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    inference_mode=False,  # 训练模式
    r=8,  # Lora 秩，控制低秩矩阵的大小
    lora_alpha=32,  # Lora alpha，缩放系数
    lora_dropout=0.1,  # Dropout 比例，防止过拟合
)

model = get_peft_model(model, config)

# 训练参数配置
args = TrainingArguments(
    output_dir="./output/Qwen2",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # 梯度累积步数，模拟更大 batch size
    logging_steps=10,
    num_train_epochs=2,
    save_steps=100,
    learning_rate=1e-4,
    save_on_each_node=True,
    gradient_checkpointing=True,  # 启用梯度检查点节省显存
    report_to="none",
)

# 初始化 SwanLab 回调
swanlab_callback = SwanLabCallback(
    project="Qwen2-fintune",
    experiment_name="Qwen2-1.5B-Instruct",
    description="使用通义千问 Qwen2-1.5B-Instruct 模型在 zh_cls_fudan-news 数据集上微调。",
    config={
        "model": "qwen/Qwen2-1.5B-Instruct",
        "dataset": "huangjintao/zh_cls_fudan-news",
    }
)

# 创建 Trainer 并开始训练
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
    callbacks=[swanlab_callback],
)

trainer.train()

# 用测试集的前 10 条，测试模型
test_df = pd.read_json(test_jsonl_new_path, lines=True)[:10]

test_text_list = []
for index, row in test_df.iterrows():
    instruction = row['instruction']
    input_value = row['input']
    
    messages = [
        {"role": "system", "content": f"{instruction}"},
        {"role": "user", "content": f"{input_value}"}
    ]

    response = predict(messages, model, tokenizer)
    messages.append({"role": "assistant", "content": f"{response}"})
    result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"
    test_text_list.append(swanlab.Text(result_text, caption=response))
    
swanlab.log({"Prediction": test_text_list})
swanlab.finish()

看到下面的进度条即代表训练开始。

7. 训练结果演示

在 SwanLab 上查看最终的训练结果。

可以看到在 2 个 epoch 之后，微调后的 qwen2 的 loss 降低到了不错的水平——当然对于大模型来说，真正的效果评估还得看主观效果。

可以看到在一些测试样例上，微调后的 qwen2 能够给出准确的文本类型。

至此，你已经完成了 qwen2 指令微调的训练！

8. 推理训练好的模型

训好的模型默认被保存在 ./output/Qwen2 文件夹下。

推理模型的代码如下：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

def predict(messages, model, tokenizer):
    device = "cuda"

    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    return response

# 加载原下载路径的 tokenizer 和 model
tokenizer = AutoTokenizer.from_pretrained("./qwen/Qwen2-1___5B-Instruct/", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("./qwen/Qwen2-1___5B-Instruct/", device_map="auto", torch_dtype=torch.bfloat16)

# 加载训练好的 Lora 模型，将下面的 checkpointXXX 替换为实际的 checkpoint 文件名名称
model = PeftModel.from_pretrained(model, model_id="./output/Qwen2/checkpointXXX")

test_texts = {
    'instruction': "你是一个文本分类领域的专家，你会接收到一段文本和几个潜在的分类选项，请输出文本内容的正确类型",
    'input': "文本：航空动力学报 JOURNAL OF AEROSPACE POWER1998 年 第 4 期 No.4 1998 科技期刊管路系统敷设的并行工程模型研究*陈志英* 马枚北京航空航天大学【摘要】提出了一种应用于并行工程模型转换研究的标号法，该法是将现行串行设计过程 (As-is) 转换为并行设计过程 (To-be)。本文应用该法将发动机外部管路系统敷设过程模型进行了串并行转换，应用并行工程过程重构的手段，得到了管路敷设并行过程模型。"
}

instruction = test_texts['instruction']
input_value = test_texts['input']

messages = [
    {"role": "system", "content": f"{instruction}"},
    {"role": "user", "content": f"{input_value}"}
]

response = predict(messages, model, tokenizer)
print(response)

9. 常见问题与优化建议

显存不足：如果显存不足以支撑当前 batch size，可以尝试减小 per_device_train_batch_size 或增大 gradient_accumulation_steps。同时开启 gradient_checkpointing 可以显著降低显存占用。
训练速度：可以使用多卡训练，只需修改 device_map 或使用 accelerate 的多卡启动脚本。
LoRA 参数调整：r 值越大，可训练参数量越多，效果可能更好但显存消耗增加。通常 r=8 或 r=16 是较好的起点。
学习率：微调大模型时，学习率不宜过大，通常在 1e-4 到 5e-5 之间尝试。

通过以上步骤，你可以成功完成 Qwen2 大模型的指令微调任务，并将其应用于实际场景。