大模型指令微调中的 Prompt 设计与数据集构建指南 | 极客日志

PythonAI算法

大模型指令微调中的 Prompt 设计与数据集构建指南

综述由AI生成详细探讨了大模型指令微调中的 Prompt 设计与数据集构建策略。内容涵盖了指令微调数据集的质量选择原则、多样性要求以及 LoRA 等高效微调方法的应用。文章分析了 Stanford Alpaca、Llama2、NousResearch 等多种主流指令模板的结构差异及其适用场景。重点讲解了多轮对话场景下的 Loss 掩码机制，通过代码示例展示了如何屏蔽 Input 部分的 Loss 以提升训练效率。此外，还总结了高效率微调的最佳实践，包括数据集准备、超参数调整、模板一致性的重要性以及常见问题如灾难性遗忘的解决方案。

晚风告白发布于 2025/2/7更新于 2026/6/327 浏览

大模型指令微调中的 Prompt 设计与数据集构建指南

1. 指令微调数据集形式与质量策略

在大型语言模型（LLM）的微调过程中，Prompt 的设计对模型的训练效果及推理表现有着至关重要的影响。许多开发者在推理阶段发现，若不使用特定的 Prompt 格式直接输入，模型性能会显著下降。这引发了一个核心问题：如果在训练阶段未包含 Prompt，测试时是否可以直接输入？此外，多轮对话与单轮对话的构造方式也直接影响最终模型的能力。

目前市面上的指令微调数据格式繁多，导致选择困难。针对这一问题，我们提出以下核心观点：

质量优先：单次实验微调所用的指令微调数据集应选取'高质量、高多样性'的数据。低质量的噪声数据会严重干扰模型收敛。
资源利用：在训练资源充足的情况下，可以加入数量更多、长度更大的数据集，以增强模型的泛化能力。
统一格式：建议基于多个高质量数据源，制作一份格式统一的多样性数据用于 SFT（Supervised Fine-Tuning）。一次性微调通常优于多次微调，后者可能导致灾难性遗忘或效果折扣。
增量微调方案：如果必须进行多次微调，建议采用 LoRA 或 QLoRA 等参数高效微调方法。将训练好的 LoRA 权重合并到原始底座模型中，可以有效减轻多次微调对模型原有能力的负面影响。

2. 常见指令微调模板分析

通过观测 Hugging Face 排行榜靠前和主流开源项目的指令微调数据集，我们可以总结出几种常见的 Prompt 模板结构。不同的模型架构往往对应特定的模板格式，混用会导致生成效果不佳。

2.1 Stanford Alpaca 模板

这是最经典的指令微调模板之一，适用于大多数基础指令跟随任务。

PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Response:"
    ),
}

2.2 Llama2 模板

Llama2 引入了 System Prompt 机制，增强了模型的安全性和角色设定能力。

instruction = """[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

            If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n{} [/INST]"""

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

### Instruction:{prompt.strip()}  ### Response:

### Instruction:
<prompt>

### Response:
<leave a newline blank for model to respond>

### Instruction:
<prompt>

### Input:
<additional context>

### Response:
<leave a newline blank for model to respond>

prompt = "你是谁？"
formatted_prompt = f"""<|System|>:
You are a helpful, respectful and honest assistant named YaYi developed by Beijing Wenge Technology Co.,Ltd. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

<|Human|>:
{prompt}

<|YaYi|>:
"""

### System:
This is a system prompt, please behave and help the user.

### User:
Your prompt here

### Assistant:
The output of Stable Beluga 2

system_prompt = "### System:\nYou are Stable Beluga, an AI that follows instructions extremely well. Help as much as can. Remember, be safe, and don't do anything illegal.\n\n"
message = "Write me a poem please"
prompt = f"{system_prompt}### User: {message}\n\n### Assistant:\n"

### Human: {prompt}
### Assistant:

prompt = "Introduce yourself"
formatted_prompt = (
    f"A chat between a curious human and an artificial intelligence assistant."
    f"The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
    f"### Human: {prompt} ### Assistant:"
)

# https://github.com/LinkSoul-AI/Chinese-Llama-2-7b/blob/main/train.py
def tokenize(item, tokenizer):
    roles = {"human": "user", "gpt": "assistant"}
    input_ids = []
    labels = []
    if "instruction" in item and len(item["instruction"]) > 0:
        system = item["instruction"]
    else:
        system = dummy_message["system"]
    system = B_SYS + system + E_SYS
    # add system before the first content in conversations
    item["conversations"][0]['value'] = system + item["conversations"][0]['value']
    
    for i, turn in enumerate(item["conversations"]):
        role = turn['from']
        content = turn['value']
        content = content.strip()
        if role == 'human':
            content = f"{B_INST} {content} {E_INST} "
            content_ids = tokenizer.encode(content)
            labels += [IGNORE_TOKEN_ID] * (len(content_ids))
        else:
            # assert role == "gpt"
            content = f"{content} "
            content_ids = tokenizer.encode(content, add_special_tokens=False) + [tokenizer.eos_token_id]
            labels += content_ids
        input_ids += content_ids

    input_ids = input_ids[:tokenizer.model_max_length]
    labels = labels[:tokenizer.model_max_length]

    trunc_id = last_index(labels, IGNORE_TOKEN_ID) + 1
    input_ids = input_ids[:trunc_id]
    labels = labels[:trunc_id]
    if len(labels) == 0:
        return tokenize(dummy_message, tokenizer)
    input_ids = safe_ids(input_ids, tokenizer.vocab_size, tokenizer.pad_token_id)
    labels = safe_ids(labels, tokenizer.vocab_size, IGNORE_TOKEN_ID)
    return input_ids, labels

# https://github.com/yangjianxin1/Firefly/blob/master/component/dataset.py
class SFTDataset(Dataset):
    def __init__(self, file, tokenizer, max_seq_length):
        self.tokenizer = tokenizer
        self.bos_token_id = tokenizer.bos_token_id
        self.eos_token_id = tokenizer.eos_token_id
        self.eos_token = tokenizer.eos_token
        self.bos_token = tokenizer.bos_token
        self.max_seq_length = max_seq_length
        logger.info('Loading data: {}'.format(file))
        with open(file, 'r', encoding='utf8') as f:
            data_list = f.readlines()
        logger.info("there are {} data in dataset".format(len(data_list)))
        self.data_list = data_list

    def __len__(self):
        return len(self.data_list)

    def __getitem__(self, index):
        # 每条数据格式为：<s>input1</s>target1</s>input2</s>target2</s>...
        data = self.data_list[index]
        data = json.loads(data)
        conversation = data['conversation']

        # 收集多轮对话
        utterances = []
        for x in conversation:
            utterances.append(x['human'])
            utterances.append(x['assistant'])
        utterances_ids = self.tokenizer(utterances, add_special_tokens=False).input_ids

        # 模型的输入格式为：<s>input1</s>target1</s>input2</s>target2</s>...
        input_ids = [self.bos_token_id]
        target_mask = [0]  # 用于对 input 进行 mask，只计算 target 部分的 loss
        for i, utterances_id in enumerate(utterances_ids):
            input_ids += (utterances_id + [self.eos_token_id])
            if i % 2 == 0:
                target_mask += [0] * (len(utterances_id) + 1)
            else:
                target_mask += [1] * (len(utterances_id) + 1)
        assert len(input_ids) == len(target_mask)
        # 对长度进行截断
        input_ids = input_ids[:self.max_seq_length]
        target_mask = target_mask[:self.max_seq_length]
        attention_mask = [1] * len(input_ids)
        assert len(input_ids) == len(target_mask) == len(attention_mask)
        inputs = {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'target_mask': target_mask
        }
        return inputs

大模型指令微调中的 Prompt 设计与数据集构建指南

大模型指令微调中的 Prompt 设计与数据集构建指南

1. 指令微调数据集形式与质量策略

2. 常见指令微调模板分析

2.1 Stanford Alpaca 模板

2.2 Llama2 模板

更多推荐文章

相关免费在线工具

2.3 Linly-AI 模板

2.4 NousResearch (OpenLLM Leaderboard Top)

2.5 Yayi 模板

2.6 StableBeluga2 模板

2.7 Guanaco 数据集常用模板

3. 多轮对话输入和输出构造

3.1 损失掩码原理

3.2 具体实现代码

4. 高效率微调大模型的最佳实践

4.1 数据集准备策略

4.2 超参数调整建议

4.3 评估与验证

5. 常见问题与解决方案

5.1 灾难性遗忘

5.2 显存溢出

5.3 生成重复

6. 总结

更多推荐文章

相关免费在线工具

大模型指令微调中的 Prompt 设计与数据集构建指南

大模型指令微调中的 Prompt 设计与数据集构建指南

1. 指令微调数据集形式与质量策略

2. 常见指令微调模板分析

2.1 Stanford Alpaca 模板

2.2 Llama2 模板

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2.3 Linly-AI 模板

2.4 NousResearch (OpenLLM Leaderboard Top)

2.5 Yayi 模板

2.6 StableBeluga2 模板

2.7 Guanaco 数据集常用模板

3. 多轮对话输入和输出构造

3.1 损失掩码原理

3.2 具体实现代码

4. 高效率微调大模型的最佳实践

4.1 数据集准备策略

4.2 超参数调整建议

4.3 评估与验证

5. 常见问题与解决方案

5.1 灾难性遗忘

5.2 显存溢出

5.3 生成重复

6. 总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具