大模型 LLM 合成训练样本的数据分布问题
在研究大语言模型(LLM)的训练过程中,数据质量与分布对模型性能有着决定性影响。近期在进行一项关于 LLM'统计字符串中字母个数'能力的实验时,发现合成数据集的生成策略直接影响了模型的泛化能力。本文通过具体案例,分析合成训练样本中的数据分布偏差问题及其解决方案。
1. 实验背景与初始方案
为了测试模型对简单计数任务的理解能力,我们构建了基于英文单词的合成数据集。任务要求模型根据给定的字符串,统计其中包含的字母总数(不含空格)。分词逻辑基于预定义的常见英文单词表。
初始阶段,合成随机字符串的代码逻辑如下:
# self.words 为常见英文单词数组,长度为 3432
if random.random() < 0.1:
ss = random.choices(self.words, k=random.randint(1, 9))
else:
ss = random.choices(self.words, k=random.randint(1, 99))
该逻辑采用均匀随机采样(Uniform Random Sampling),从词汇表中选取 1 到 99 个单词组成句子。生成的样本示例如下:
how many letters are there in the following string: "spread high"? 10
how many letters are there in the following string: "european contradictory"? 21
how many letters are there in the following string: "lock over constitution smart boil superior patient teenager graduation drop speaker pronounce contribution boring step carpet realize format surprise disappoint promote track thick rank affect nurse preparation armchair data warn pint construction tale organization tank wear understand vast tremble"? 261
使用单卡 GPU 训练约 12 小时后,模型在测试集上的准确率达到了 99.937%。这一结果看似非常理想,但在人工进行边缘情况(Edge Case)测试时,发现了明显的缺陷。
2. 问题现象:分布偏移导致的失效
尽管整体准确率高,但模型在处理某些特定模式的输入时表现极差。这些模式在实际场景中其实更为常见或简单,例如短单词重复、高频小词等。
错误预测案例:
Input: how many letters are there in the following string: "a a"?
Output: 4 (Expected: 2)
how many letters are there the following : ?
(Expected: )
how many letters are there the following : ?
(Expected: )
how many letters are there the following : ( times)?
(Expected: )
how many letters are there the following : ?
(Expected: )


