LLaMa-Factory应用到实战(二)

LLaMa-Factory应用到实战(二) | 极客日志

字段	说明
模型名称	这里是模型支持的所有基座模型的列表，模型本体需要自己手动下载。注意这里有部分模型的名称是不够明确的，例如DeepSeek-R1-7B-Distill，这个名称只说明了模型把DeepSeek-R1当作教师模型，学习了它的推理数据，却没说明学生模型具体是哪个。
模型路径	模型在本地的存储路径。
检查点路径	如果是从某个检查点（checkpoint）继续训练，填写检查点在本地的存储路径。如果不是，置空。
对话模板	不同模型的对话模板会有差异，尽量选同系列模型，至少是相近模型的模板，否则有可能报错。

字段	说明
数据路径	本地存储训练数据和dataset_info.json文件的路径
数据集	dataset_info.json里定义的数据集名称
{
'chat-train': { #数据集名称
'file_name': 'train.jsonl' #数据集文件
}
}

训练阶段	说明
Pre-Training	这是大模型的起步阶段
Supervised Fine-Tuning	有监督训练
Reward Modeling	RLHF(人类反馈强化学习)的第一步，训练奖励模型
PPO	RLHF(人类反馈强化学习)的第二步，通过奖励模型的反馈，调整自身策略
DPO	直接利用人类偏好数据微调AI
KTO	在训练AI时，会‌不对称地处理正负反馈（惩罚错误比奖励正确更严厉）

量化等级	说明
none	不做量化
8	8位量化
4	4位量化

方法	易用性	压缩率	速度	典型用途
bitsandbytes	⭐⭐⭐⭐	⭐⭐	⭐⭐	个人快速部署
HQQ	⭐⭐	⭐⭐⭐⭐	⭐	手机/嵌入式设备
EETQ	⭐	⭐⭐	⭐⭐⭐⭐	企业级高性能推理

技术	适用阶段	优化重点	典型用户
‌FlashAttention-2‌	训练/推理	注意力计算	研究者、企业
‌Unsloth‌	训练	微调效率	个人开发者
‌Liger Kernel‌	推理	高并发吞吐量	云服务厂商

参数名称	说明
学习率	0.1 一般只用于探索，没人会真的用
0.01 从头训练的标准模型初始学习率
0.001 已经快接近优化目标时做细致调整时采用
0.0001 模型接近收敛时做微调
0.00005 预训练阶段的最后微调
训练轮数	学习的特性较简单，1轮即可，如果数据集太小，不应选择太多轮数，可以考虑数据增强。
最大梯度范数	指定一个允许的最大梯度值，防止梯度爆炸
最大样本数	训练时从数据集里最多取这个数量的样本
计算类型	fp32：单精度浮点数，32位，计算准确、稳定，但速度慢、占显存最多。
fp16：半精度浮点数，16位，显存省一半，计算速度快，但容易溢出。
bf16：脑浮点数，16位，显存省一半，计算速度快，不容易溢出。
pure_bf16 纯bf16模式，16位，显存占用最小，但需要硬件支持。
截断长度	输入样本的截断长度
批处理大小	根据显存的情况调整
梯度累计	批处理大小*梯度累计决定了梯度更新的频率
验证集比例	验证集占全体样本的比例
学习率调节器	模型训练的不同阶段需要不同的学习率，这个参数决定学习率的调节器，cosin最常用

参数	说明
LoRA的秩	LoRA矩阵的秩
LoRA缩放系数	LoRA缩放系数的大小
LoRA随机丢弃	LoRA权重随即丢弃的概率
LoRA+学习率比例	允许A，B两个矩阵的学习率不一样，在 LoRA+ 中，适配器矩阵 A 的学习率 `ηA` 即为优化器学习率。适配器矩阵 B 的学习率 `ηB` 为 `λ * ηA`。其中 `λ` 为 `loraplus_lr_ratio` 的值。
使用rslora	在LoRA训练的时候可以动态调整LoRA缩放系数
使用DoRA	把原始权重分继承幅度和方向两部分，同时优化，微调更灵活
使用PiSSA	直接调整权重矩阵的主成分，减少计算量

llamafactory-cli export merge_config.yaml

#基座模型路径 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct #LoRA路径 adapter_name_or_path: saves/llama3-8b/lora/sft #对话模板 template: llama3 #微调类型 finetuning_type: lora #合并后模型路径 export_dir: models/llama3_lora_sft #模型文件切割的大小（GB） export_size: 2#导出的设备 export_device: cpu #导出的文件格式 True： .bin 格式保存。 False： .safetensors 格式保存。 export_legacy_format: false

llamafactory-cli export quantization_config.yaml

#基座模型路径 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct #对话模板 template: llama3 #合并后模型路径 export_dir: models/llama3_gptq #量化等级 export_quantization_bit: 4 #量化校验数据集 export_quantization_dataset: data/c4_demo.json #模型文件切割的大小（GB） export_size: 2 #导出的设备 export_device: cpu #导出的文件格式 True： .bin 格式保存。 False： .safetensors 格式保存。 export_legacy_format: false

FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/llama3_full_sft_ds3.yaml

FORCE_TORCHRUN=1 CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train config/config1.yaml

torchrun --standalone --nnodes=1 --nproc-per-node=8 src/train.py \ --stage sft \ --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \ --do_train \ --dataset alpaca_en_demo \ --template llama3 \ --finetuning_type lora \ --output_dir saves/llama3-8b/lora/ \ --overwrite_cache \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --logging_steps 100 \ --save_steps 500 \ --learning_rate 1e-4 \ --num_train_epochs 2.0 \ --plot_loss \ --bf16

#accelerate_singleNode_config.yaml compute_environment: LOCAL_MACHINE debug: true distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false 您可以通过运行以下指令开始训练: accelerate launch \ --config_file accelerate_singleNode_config.yaml \ src/train.py training_config.yaml

FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=0 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 \ llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=1 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 \ llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

#accelerate_multiNode_config.yaml compute_environment: LOCAL_MACHINE debug: true distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: all machine_rank: 0 main_process_ip: '192.168.0.1' main_process_port: 29500 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 16 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

accelerate launch \ --config_file accelerate_multiNode_config.yaml \ train.py llm_config.yaml

FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/llama3_full_sft_ds3.yaml

 deepspeed: examples/deepspeed/ds_z3_config.json

deepspeed --include localhost:1 your_program.py <normal cl args> --deepspeed ds_config.json

deepspeed --num_gpus 8 src/train.py \ --deepspeed examples/deepspeed/ds_z3_config.json \ --stage sft \ --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \ --do_train \ --dataset alpaca_en \ --template llama3 \ --finetuning_type full \ --output_dir saves/llama3-8b/lora/full \ --overwrite_cache \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 500 \ --learning_rate 1e-4 \ --num_train_epochs 2.0 \ --plot_loss \ --bf16

deepspeed --include localhost:1 your_program.py <normal cl args> --deepspeed ds_config.json

FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=0 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=1 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml

deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 \ your_program.py <normal cl args> --deepspeed ds_config.json

accelerate config

#deepspeed_config.yaml compute_environment: LOCAL_MACHINE debug: false deepspeed_config: deepspeed_multinode_launcher: standard gradient_accumulation_steps: 8 offload_optimizer_device: none offload_param_device: none zero3_init_flag: false zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' enable_cpu_affinity: false machine_rank: 0 main_process_ip: '192.168.0.1' main_process_port: 29500 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 16 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

accelerate launch \ --config_file deepspeed_config.yaml \ train.py llm_config.yaml

#ds_z0_config.json{"train_batch_size": "auto","train_micro_batch_size_per_gpu": "auto","gradient_accumulation_steps": "auto","gradient_clipping": "auto","zero_allow_untested_optimizer": true,"fp16": {"enabled": "auto","loss_scale": 0,"loss_scale_window": 1000,"initial_scale_power": 16,"hysteresis": 2,"min_loss_scale": 1 },"bf16": {"enabled": "auto"},"zero_optimization": {"stage": 0,"allgather_partitions": true,"allgather_bucket_size": 5e8,"overlap_comm": true,"reduce_scatter": true,"reduce_bucket_size": 5e8,"contiguous_gradients": true,"round_robin_gradients": true }}

#ds_z2_config.json{..."zero_optimization": {"stage": 2,...}}

#ds_z2_offload_config.json{..."zero_optimization": {"stage": 2,"offload_optimizer": {"device": "cpu","pin_memory": true },...}}

#ds_z3_config.json{..."zero_optimization": {"stage": 3,"overlap_comm": true,"contiguous_gradients": true,"sub_group_size": 1e9,"reduce_bucket_size": "auto","stage3_prefetch_bucket_size": "auto","stage3_param_persistence_threshold": "auto","stage3_max_live_parameters": 1e9,"stage3_max_reuse_distance": 1e9,"stage3_gather_16bit_weights_on_model_save": true }}

ds_z3_offload_config.json {..."zero_optimization": {"stage": 3,"offload_optimizer": {"device": "cpu","pin_memory": true },"offload_param": {"device": "cpu","pin_memory": true },...}}

bash examples/extras/fsdp_qlora/train.sh

accelerate config

#/examples/accelerate/fsdp_config.yaml compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_forward_prefetch: false fsdp_cpu_ram_efficient_loading: true fsdp_offload_params: true # offload may affect training speed fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: FULL_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: true machine_rank: 0 main_training_function: main mixed_precision: fp16 # or bf16 num_machines: 1 # the number of nodes num_processes: 2 # the number of GPUs in all nodes rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

accelerate launch \ --config_file fsdp_config.yaml \ src/train.py llm_config.yaml

LLaMa-Factory应用到实战(二)

文章目录

序言

六、webui详解

1.基座模型

2.数据集

3. 训练阶段

3.1 RLHF（Reward Modeling+PPO）

3.2 DPO

3.3 KTO

3.4 小结

4. 训练方法

5.量化

5.1 bitsandbytes（8-bit/4-bit量化）

5.2 HQQ（Half-Quadratic Quantization，半二次量化）

5.3 EETQ（Efficient Engine for Tensor Quantization，高效张量量化引擎）

5.4 适用场景

5.5 小结

6. 加速算法

6.1 FlashAttention-2

6.2 Unsloth

6.3 Liger Kernel

6.4 适用场景

6.5 小结

7. RoPE插值方法

7.1 Linear（线性缩放）

7.2 Dynamic（动态缩放）

7.3 YaRN（NTK-aware RoPE Scaling）

7.4 LLaMA-3 的改进

7.5 一句话总结

7.6 小结

8. 通用参数

8.1 小结

9. 其它参数

10. 部分参数微调设置

11. LoRA参数设置

11.1 小结

12. RLHF参数设置

13. 多模态参数设置

14.GaLore参数设置

15.APOLLO参数设置

16. BAdam参数设置

17. 模型和配置的保存路径

18. SwanLab参数设置

19. 预测与评估

20.推理

21.导出模型

22.小结

七、LoRA模型合并和量化

1. 模型合并

1.1 使用webui

1.2 使用命令行

2. 模型量化

2.1 使用webui

2.2 使用命令行

3. 小结

八、分布式训练

1. 整体介绍

2. DPP

2.1 单机多卡

2.2 单机多卡

2.3 小结

3. DeepSpeed

3.1 单机多卡

3.2 多机多卡

3.3 小结

4.FSDP

4.1 llamafactory-cli

4.2 accelerate

4.3 小结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具