QLoRA 高效微调技术实战
讲述了 QLoRA 的技术原理。该技术核心思想就是在不降低任何性能的情况下微调量化为 4 bit 的模型。光说不练假把式,下面我们对其进行实战演练。
环境搭建
基础环境配置如下:
- 操作系统: CentOS 7
- CPUs: 单个节点具有 1TB 内存的 Intel CPU,物理 CPU 个数为 64,每颗 CPU 核数为 16
- GPUs: 8 卡 A800 80GB GPUs
- Python: 3.10 (需要先升级 OpenSSL 到 1.1.1t 版本,然后再编译安装 Python)
- NVIDIA 驱动程序版本: 515.65.01,根据不同型号选择不同的驱动程序
- CUDA 工具包: 11.7
- NCCL: nccl_2.14.3-1+cuda11.7
- cuDNN: 8.8.1.3_cuda11
上面的 NVIDIA 驱动、CUDA、Python 等工具的安装就不一一赘述了。
创建虚拟环境并激活虚拟环境(qlora-venv-py310-cu117):
cd /workspace/virtual-venv
virtualenv -p /usr/bin/python3.10 qlora-venv-py310-cu117
source /workspace/virtual-venv/qlora-venv-py310-cu117/bin/activate
安装 transformers、accelerate、peft 库。
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout 8f093fb
pip install .
git clone https://github.com/huggingface/accelerate.git
cd accelerate/
git checkout 665d518
pip install .
git clone https://github.com/huggingface/peft.git
cd peft
git checkout 189a6b8
pip install .
安装其他依赖库:
pip install -r requirements.txt
其中,requirements.txt 内容如下:
bitsandbytes==0.39.0
einops==0.6.1
evaluate==0.4.0
scikit-learn==1.2.2
sentencepiece==0.1.99
tensorboardX
数据集准备
数据集直接使用 alpaca-lora 项目提供的 alpaca_data.json、alpaca_data_cleaned_archive.json 或 alpaca_data_gpt4.json 即可。
模型权重格式转换
首先,对原始的 LLaMA 30B/65B 大模型进行模型权重格式转换为 Huggingface Transformers 格式。模型转换的具体步骤请参考之前的文章:从 0 到 1 复现斯坦福羊驼(Stanford Alpaca 7B)。
本文会使用到 LLaMA 7B 和 65B 模型,需预先转换好。
模型微调
git clone https://github.com/artidoro/qlora.git
cd qlora
git checkout cc48811
python qlora.py \
--dataset "/data/alpaca_data_cleaned.json" \
--model_name_or_path "/data/pretrain/hf-llama-model/llama-7b" \
--output_dir "/workspace/output/llama-7b-qlora" \
--per_device_train_batch_size 1 \
--max_steps 1000 \
--save_total_limit 2
模型情况下,会将模型的不同层放置在不同层已进行模型并行。
模型训练过程日志示例:
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /workspace/virtual-venv/qlora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/guodong.li/virtual-venv/qlora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/rh/devtoolset-9/root/usr/lib/dyninst'), PosixPath('/opt/rh/devtoolset-7/root/usr/lib/dyninst')}
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /workspace/virtual-venv/qlora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Found a previous checkpoint at: /workspace/output/llama-7b-qlora/checkpoint-250
loading base model /data/pretrain/hf-llama-model/llama-7b...
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████| 33/33 [00:17<00:00, 1.93it/s]
Loading adapters from checkpoint.
trainable params: 79953920.0 || all params: 3660320768 || trainable: 2.184341894267557
loaded model
Adding special tokens.
Found cached dataset json (/home/.cache/huggingface/datasets/json/default-3c2be6958ca766f9/0.0.0)
Loading cached split indices for dataset at /home/.cache/huggingface/datasets/json/default-3c2be6958ca766f9/0.0.0/cache-d071c407d7bc0de0.arrow and /home/.cache/huggingface/datasets/json/default-3c2be6958ca766f9/0.0.0/cache-e716a74b2c29e789.arrow
Loading cached processed dataset at /home/.cache/huggingface/datasets/json/default-3c2be6958ca766f9/0.0.0/cache-01d5099f3f094d7.arrow
torch.float32 422326272 0.11537932153507864
torch.uint8 3238002688 0.8846206784649213
{'loss': 1.4282, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 1.469, 'learning_rate': 0.0002, 'epoch': 0.01}
...
{'loss': 1.4002, 'learning_rate': 0.0002, 'epoch': 0.08}
{'loss': 1.4261, : 0.0002, : 0.08}
{: 2.4323, : 0.0002, : 0.09}
25%|██████████████████████▎ | 250/1000 [25:34<1:10:31, 5.64s/it]Saving PEFT checkpoint...
{: 1.6007, : 0.0002, : 0.09}
{: 1.6187, : 0.0002, : 0.09}
...
{: 1.6242, : 0.0002, : 0.16}
{: 1.6073, : 0.0002, : 0.16}
{: 1.6825, : 0.0002, : 0.17}
{: 2.6283, : 0.0002, : 0.17}
50%|█████████████████████████████████████████████▌ | 500/1000 [50:44<49:21, 5.92s/it]Saving PEFT checkpoint...
{: 1.619, : 0.0002, : 0.17}
{: 1.5394, : 0.0002, : 0.18}
...
{: 1.5247, : 0.0002, : 0.25}
{: 1.6054, : 0.0002, : 0.25}
{: 2.3289, : 0.0002, : 0.26}
75%|██████████████████████████████████████████████████████████████████▊ | 750/1000 [1:15:27<23:37, 5.67s/it]Saving PEFT checkpoint...
{: 1.6001, : 0.0002, : 0.26}
...
{: 1.6287, : 0.0002, : 0.34}
{: 2.3511, : 0.0002, : 0.34}
100%|████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [1:42:08<00:00, 7.34s/it]Saving PEFT checkpoint...
{: 6132.3668, : 2.609, : 0.163, : 1.7447978076934814, : 0.34}
100%|████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [1:42:12<00:00, 6.13s/it]
Saving PEFT checkpoint...
***** train metrics *****
epoch = 0.34
train_loss = 1.7448
train_runtime = 1:42:12.36
train_samples_per_second = 2.609
train_steps_per_second = 0.163
模型输出权重文件结构:
llama-7b-qlora
├── all_results.json
├── checkpoint-1000
│ ├── adapter_config.json
│ ├── adapter_model
│ │ ├── adapter_config.json
│ │ ├── adapter_model.bin
│ │ └── README.md
│ ├── adapter_model.bin
│ ├── added_tokens.json
│ ├── optimizer.pt
│ ├── README.md
│ ├── rng_state.pth
│ ├── scheduler.pt
│ ├── special_tokens_map.json
│ ├── tokenizer_config.json
│ ├── tokenizer.model
│ ├── trainer_state.json
│ └── training_args.bin
├── checkpoint-750
├── completed
├── metrics.json
├── trainer_state.json
└── train_results.json
显存占用:
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================
| 0 N/A N/A 37939 C python 2513MiB |
| 1 N/A N/A 37939 C python 2819MiB |
| 2 N/A N/A 37939 C python 2819MiB |
| 3 N/A N/A |
| / / |
| / / |
| / / |
| / / |
++
模型权重合并
新增模型权重合并文件(export_hf_checkpoint.py),将 lora 权重合并回原始权重。
import os
import torch
import transformers
from peft import PeftModel
from transformers import LlamaForCausalLM, LlamaTokenizer
BASE_MODEL = os.environ.get("BASE_MODEL", None)
LORA_MODEL = os.environ.get("LORA_MODEL", "tloen/alpaca-lora-7b")
HF_CHECKPOINT = os.environ.get("HF_CHECKPOINT", "./hf_ckpt")
assert (
BASE_MODEL
), "Please specify a value for BASE_MODEL environment variable, e.g. `export BASE_MODEL=decapoda-research/llama-7b-hf`"
tokenizer = LlamaTokenizer.from_pretrained(BASE_MODEL)
base_model = LlamaForCausalLM.from_pretrained(
BASE_MODEL,
torch_dtype=torch.bfloat16,
device_map={"": "cpu"},
)
first_weight = base_model.model.layers[0].self_attn.q_proj.weight
first_weight_old = first_weight.clone()
lora_model = PeftModel.from_pretrained(
base_model,
LORA_MODEL,
)
lora_weight = lora_model.base_model.model.model.layers[
0
].self_attn.q_proj.weight
assert torch.allclose(first_weight_old, first_weight)
for layer in lora_model.base_model.model.model.layers:
layer.self_attn.q_proj.merge_weights = True
layer.self_attn.v_proj.merge_weights = True
lora_model.train(False)
lora_model_sd = lora_model.state_dict()
deloreanized_sd = {
k.replace("base_model.model.", ""): v
for k, v in lora_model_sd.items()
if "lora" not in k
}
LlamaForCausalLM.save_pretrained(
base_model, HF_CHECKPOINT , state_dict=deloreanized_sd, max_shard_size="400MB"
)
接下来,就可以使用合并后的权重文件进行模型推理了。
模型推理
新增推理代码(inference.py):
from transformers import AutoModelForCausalLM, LlamaTokenizer
import torch
model_id = "/data/pretrain/hf-llama-model/llama-7b"
merge_model_id = "/workspace/output/llama-7b-merge"
model = AutoModelForCausalLM.from_pretrained(merge_model_id, load_in_4bit=True, device_map="auto")
tokenizer = LlamaTokenizer.from_pretrained(model_id)
device = torch.device("cuda:0")
text = "Hello, my name is "
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20, do_sample=True, top_k=30, top_p=0.85)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print("\n------------------------------------------------\nInput: ")
line = input()
while line:
inputs = tokenizer(line, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20, do_sample=True, top_k=30, top_p=0.85)
print("Output: ",tokenizer.decode(outputs[0], skip_special_tokens=True))
print("\n------------------------------------------------\nInput: ")
line = input()
运行过程及显存占用显示,合并权重后推理显存占用约为 5.8GB。
除此之外,还可以不进行合并权重,直接进行推理,具体如下所示。
新增推理代码(inference_qlora.py):
from transformers import AutoModelForCausalLM, LlamaTokenizer
import torch
from peft import PeftModel
model_id = "/data/pretrain/hf-llama-model/llama-7b"
lora_weights = "/workspace/output/llama-7b-qlora/checkpoint-1000/adapter_model"
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")
model = PeftModel.from_pretrained(
model,
lora_weights,
)
tokenizer = LlamaTokenizer.from_pretrained(model_id)
device = torch.device("cuda:0")
text = "Hello, my name is "
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20, do_sample=True, top_k=30, top_p=0.85)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print("\n------------------------------------------------\nInput: ")
line = input()
while line:
inputs = tokenizer(line, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20, do_sample=True, top_k=30, top_p=0.85)
print("Output: ",tokenizer.decode(outputs[0], skip_special_tokens=True))
print("\n------------------------------------------------\nInput: ")
line = input()
可以看到,此时的模型推理的显存占用会高于合并之后进行模型推理。
当然,将 lora 权重合并会 base 模型权重还可以通过 merge_and_unload() 方法。
前面仅对 7B 模型进行了尝试,而 LLaMA-65B 模型对于显存的占用效果如何呢,是否如官方所说仅需 48G 显存足矣了呢?带着疑问,接下来我们使用 QLoRA 对 LLaMA-65B 进行微调。
微调 LLaMA-65B 大模型
单 GPU 运行过程:
CUDA_VISIBLE_DEVICES=0 python qlora.py \
--model_name_or_path /data/pretrain/hf-llama-model/llama-65b \
--dataset /data/alpaca_data_cleaned.json \
--output_dir /workspace/output/llama-65b-qlora \
--logging_steps 10 \
--save_strategy steps \
--data_seed 42 \
--save_steps 100 \
--save_total_limit 2 \
--evaluation_strategy steps \
--eval_dataset_size 128 \
--max_eval_samples 200 \
--per_device_eval_batch_size 1 \
--max_new_tokens 32 \
--dataloader_num_workers 3 \
--group_by_length \
--logging_strategy steps \
--remove_unused_columns False \
--do_train \
--do_eval \
--do_mmlu_eval \
--lora_r 64 \
--lora_alpha 16 \
--lora_modules all \
--double_quant \
--quant_type nf4 \
--bf16 \
--bits 4 \
--warmup_ratio 0.03 \
--lr_scheduler_type constant \
--gradient_checkpointing \
--source_max_len 16 \
--target_max_len 512 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--max_steps 200 \
--eval_steps 50 \
--learning_rate 0.0001 \
--adam_beta2 0.999 \
--max_grad_norm 0.3 \
--lora_dropout 0.05 \
--weight_decay 0.0 \
--seed 0 \
--report_to tensorboard
训练日志显示 trainable params 占比约 1.18%,显存占用不到 48G。
多 GPU 显存占用显示,当使用 8 张 GPU 卡微调时,单卡 GPU 的显存占用不超过 10G,这让很多消费级显卡可以轻松微调百亿级大模型了。
结语
本文讲述了高效微调技术 QLoRA 训练 LLaMA 大模型并讲述了如何进行推理。通过实测验证了 QLoRA 在减少显存占用方面的显著优势,为大规模语言模型的微调提供了可行的实践方案。