HuggingFace Trainer 源码解析：权重保存与模型检查点机制 | 极客日志

PythonAI算法

HuggingFace Trainer 源码解析：权重保存与模型检查点机制

HuggingFace Trainer 框架中权重保存涉及多个核心方法。_maybe_log_save_evaluate 负责日志记录、评估及触发保存流程。_save_checkpoint 处理模型检查点路径生成、分布式环境下的优化器与调度器状态保存、随机数种子记录及旧检查点清理。save_model 方法最终调用_save 将模型权重、配置及训练参数持久化至磁盘。不同分布式策略（DDP、DeepSpeed、FSDP）下保存逻辑有所差异，确保训练中断后可恢复或加载最佳模型。

NodeJser发布于 2025/1/14更新于 2026/6/932 浏览

一、self.state 与 self.control 初始化

TrainerState 与 TrainerControl 对象用于管理训练状态和控制流程。

二、self._maybe_log_save_evaluate 源码解读

该函数负责日志记录、评估及触发保存流程。

1. _maybe_log_save_evaluate 完整源码

def _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval):
    if self.control.should_log:
        if is_torch_tpu_available():
            xm.mark_step()
        logs: Dict[str, float] = {}
        # all_gather + mean() to get average loss over all processes
        tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
        # reset tr_loss to zero
        tr_loss -= tr_loss
        logs["loss"] = round(tr_loss_scalar / (self.state.global_step - self._globalstep_last_logged), 4)
        logs["learning_rate"] = self._get_learning_rate()
        self._total_loss_scalar += tr_loss_scalar
        self._globalstep_last_logged = self.state.global_step
        self.store_flos()
        self.log(logs)
        metrics = None
        if self.control.should_evaluate:
            if isinstance(self.eval_dataset, dict):
                metrics = {}
                for eval_dataset_name, eval_dataset in self.eval_dataset.items():
                    dataset_metrics = .evaluate(
                        eval_dataset=eval_dataset,
                        ignore_keys=ignore_keys_for_eval,
                        metric_key_prefix=,
                    )
                    metrics.update(dataset_metrics)
            :
                metrics = .evaluate(ignore_keys=ignore_keys_for_eval)
            ._report_to_hp_search(trial, .state.global_step, metrics)
            
             (.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
                metric_to_check = .args.metric_for_best_model
                  metric_to_check.startswith():
                    metric_to_check = 
                .lr_scheduler.step(metrics[metric_to_check])
         .control.should_save:
            ._save_checkpoint(model, trial, metrics=metrics)
        .control = .callback_handler.on_save(.args, .state, .control)

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

if self.control.should_log:
    if is_torch_tpu_available():
        xm.mark_step()

metrics = None
if self.control.should_evaluate:
    if isinstance(self.eval_dataset, dict):
        metrics = {}
        for eval_dataset_name, eval_dataset in self.eval_dataset.items():
            dataset_metrics = self.evaluate(
                eval_dataset=eval_dataset,
                ignore_keys=ignore_keys_for_eval,
                metric_key_prefix=f"eval_{eval_dataset_name}",
            )
            metrics.update(dataset_metrics)
    else:
        metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
    self._report_to_hp_search(trial, self.state.global_step, metrics)
    # Run delayed LR scheduler now that metrics are populated
    if isinstance(self.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
        metric_to_check = self.args.metric_for_best_model
        if not metric_to_check.startswith("eval_"):
            metric_to_check = f"eval_{metric_to_check}"
        self.lr_scheduler.step(metrics[metric_to_check])

if self.control.should_save:
    self._save_checkpoint(model, trial, metrics=metrics)
    self.control = self.callback_handler.on_save(self.args, self.state, self.control)

class CallbackHandler(TrainerCallback):
    """Internal class that just calls the list of callbacks in order."""
    def __init__(self, callbacks, model, tokenizer, optimizer, lr_scheduler):
        self.callbacks = []
        for cb in callbacks:
            self.add_callback(cb)
        self.model = model
        self.tokenizer = tokenizer
        self.optimizer = optimizer
        self.lr_scheduler = lr_scheduler
        self.train_dataloader = None
        self.eval_dataloader = None
        # ... initialization logic ...

    def call_event(self, event, args, state, control, **kwargs):
        for callback in self.callbacks:
            result = getattr(callback, event)(
                args, state, control, model=self.model, tokenizer=self.tokenizer,
                optimizer=self.optimizer, lr_scheduler=self.lr_scheduler,
                train_dataloader=self.train_dataloader, eval_dataloader=self.eval_dataloader,
                **kwargs,
            )
            if result is not None:
                control = result
        return control

def _save_checkpoint(self, model, trial, metrics=None):
    # In all cases, including ddp/dp/deepspeed, self.model is always a reference to the model we want to save except FullyShardedDDP.
    assert unwrap_model(model) is self.model, "internal model should be a reference to self.model"
    
    # Save model checkpoint
    checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
    if self.hp_search_backend is None and trial is None:
        self.store_flos()
    run_dir = self._get_output_dir(trial=trial)
    output_dir = os.path.join(run_dir, checkpoint_folder)
    self.save_model(output_dir, _internal_call=True)
    
    if self.is_deepspeed_enabled:
        # under zero3 model file itself doesn't get saved since it's bogus!
        self.model_wrapped.save_checkpoint(output_dir)
    
    # Save optimizer and scheduler
    if self.sharded_ddp == ShardedDDPOption.SIMPLE:
        self.optimizer.consolidate_state_dict()
    
    if self.fsdp or self.is_fsdp_enabled:
        if self.is_fsdp_enabled:
            save_fsdp_optimizer(
                self.accelerator.state.fsdp_plugin, self.accelerator, self.optimizer, self.model, output_dir
            )
        else:
            full_osd = self.model.__class__.full_optim_state_dict(self.model, self.optimizer)
    
    if is_torch_tpu_available():
        xm.rendezvous("saving_optimizer_states")
        xm.save(self.optimizer.state_dict(), os.path.join(output_dir, OPTIMIZER_NAME))
        with warnings.catch_warnings(record=True) as caught_warnings:
            xm.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, SCHEDULER_NAME))
            reissue_pt_warnings(caught_warnings)
    elif is_sagemaker_mp_enabled():
        opt_state_dict = self.optimizer.local_state_dict(gather_if_shard=False)
        smp.barrier()
        if smp.rdp_rank() == 0 or smp.state.cfg.shard_optimizer_state:
            smp.save(opt_state_dict, os.path.join(output_dir, OPTIMIZER_NAME), partial=True, v3=smp.state.cfg.shard_optimizer_state)
        if self.args.should_save:
            with warnings.catch_warnings(record=True) as caught_warnings:
                torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, SCHEDULER_NAME))
                reissue_pt_warnings(caught_warnings)
            if self.do_grad_scaling:
                torch.save(self.scaler.state_dict(), os.path.join(output_dir, SCALER_NAME))
    elif self.args.should_save and not self.is_deepspeed_enabled:
        if self.fsdp and not self.is_fsdp_enabled:
            torch.save(full_osd, os.path.join(output_dir, OPTIMIZER_NAME))
        else:
            torch.save(self.optimizer.state_dict(), os.path.join(output_dir, OPTIMIZER_NAME))
        with warnings.catch_warnings(record=True) as caught_warnings:
            torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, SCHEDULER_NAME))
            reissue_pt_warnings(caught_warnings)
        if self.do_grad_scaling:
            torch.save(self.scaler.state_dict(), os.path.join(output_dir, SCALER_NAME))
    
    # Determine the new best metric / best model checkpoint
    if metrics is not None and self.args.metric_for_best_model is not None:
        metric_to_check = self.args.metric_for_best_model
        if not metric_to_check.startswith("eval_"):
            metric_to_check = f"eval_{metric_to_check}"
        metric_value = metrics[metric_to_check]
        operator = np.greater if self.args.greater_is_better else np.less
        if (self.state.best_metric is None or self.state.best_model_checkpoint is None or operator(metric_value, self.state.best_metric)):
            self.state.best_metric = metric_value
            self.state.best_model_checkpoint = output_dir
    
    # Save the Trainer state
    if self.args.should_save:
        self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))
    
    # Save RNG state in non-distributed training
    rng_states = {
        "python": random.getstate(),
        "numpy": np.random.get_state(),
        "cpu": torch.random.get_rng_state(),
    }
    if torch.cuda.is_available():
        if self.args.parallel_mode == ParallelMode.DISTRIBUTED:
            rng_states["cuda"] = torch.cuda.random.get_rng_state_all()
        else:
            rng_states["cuda"] = torch.cuda.random.get_rng_state()
    if is_torch_tpu_available():
        rng_states["xla"] = xm.get_rng_state()
    
    os.makedirs(output_dir, exist_ok=True)
    if self.args.world_size <= 1:
        torch.save(rng_states, os.path.join(output_dir, "rng_state.pth"))
    else:
        torch.save(rng_states, os.path.join(output_dir, f"rng_state_{self.args.process_index}.pth"))
    
    if self.args.push_to_hub:
        self._push_from_checkpoint(output_dir)
    
    # Maybe delete some older checkpoints.
    if self.args.should_save:
        self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)

checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
if self.hp_search_backend is None and trial is None:
    self.store_flos()
run_dir = self._get_output_dir(trial=trial)
output_dir = os.path.join(run_dir, checkpoint_folder)

def store_flos(self):
    # Storing the number of floating-point operations that went into the model
    if self.args.parallel_mode == ParallelMode.DISTRIBUTED:
        self.state.total_flos += (
            distributed_broadcast_scalars([self.current_flos], device=self.args.device).sum().item()
        )
        self.current_flos = 0
    else:
        self.state.total_flos += self.current_flos
        self.current_flos = 0

self.save_model(output_dir, _internal_call=True)

if self.sharded_ddp == ShardedDDPOption.SIMPLE:
    self.optimizer.consolidate_state_dict()

if self.fsdp or self.is_fsdp_enabled:
    if self.is_fsdp_enabled:
        save_fsdp_optimizer(
            self.accelerator.state.fsdp_plugin, self.accelerator, self.optimizer, self.model, output_dir
        )
    else:
        full_osd = self.model.__class__.full_optim_state_dict(self.model, self.optimizer)

if is_torch_tpu_available():
    xm.rendezvous("saving_optimizer_states")
    xm.save(self.optimizer.state_dict(), os.path.join(output_dir, OPTIMIZER_NAME))
    # ... save scheduler ...
elif is_sagemaker_mp_enabled():
    # ... save optimizer ...
    # ... save scheduler ...
elif self.args.should_save and not self.is_deepspeed_enabled:
    # ... save optimizer ...
    # ... save scheduler ...

if metrics is not None and self.args.metric_for_best_model is not None:
    metric_to_check = self.args.metric_for_best_model
    if not metric_to_check.startswith("eval_"):
        metric_to_check = f"eval_{metric_to_check}"
    metric_value = metrics[metric_to_check]
    operator = np.greater if self.args.greater_is_better else np.less
    if (self.state.best_metric is None or self.state.best_model_checkpoint is None or operator(metric_value, self.state.best_metric)):
        self.state.best_metric = metric_value
        self.state.best_model_checkpoint = output_dir

if self.args.should_save:
    self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))

rng_states = {
    "python": random.getstate(),
    "numpy": np.random.get_state(),
    "cpu": torch.random.get_rng_state(),
}
# ... cuda/xla handling ...
os.makedirs(output_dir, exist_ok=True)
if self.args.world_size <= 1:
    torch.save(rng_states, os.path.join(output_dir, "rng_state.pth"))
else:
    torch.save(rng_states, os.path.join(output_dir, f"rng_state_{self.args.process_index}.pth"))

if self.args.should_save:
    self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)

def save_model(self, output_dir: Optional[str] = None, _internal_call: bool = False):
    if output_dir is None:
        output_dir = self.args.output_dir
    if is_torch_tpu_available():
        self._save_tpu(output_dir)
    elif is_sagemaker_mp_enabled():
        # Calling the state_dict needs to be done on the wrapped model and on all processes.
        ...
    elif (ShardedDDPOption.ZERO_DP_2 in self.args.sharded_ddp or ShardedDDPOption.ZERO_DP_3 in self.args.sharded_ddp or self.fsdp is not None or self.is_fsdp_enabled):
        ...
    elif self.is_deepspeed_enabled:
        ...
    elif self.args.should_save:
        self._save(output_dir)
    # Push to the Hub when `save_model` is called by the user.
    if self.args.push_to_hub and not _internal_call:
        self.push_to_hub(commit_message="Model save")

def _save(self, output_dir: Optional[str] = None, state_dict=None):
    output_dir = output_dir if output_dir is not None else self.args.output_dir
    os.makedirs(output_dir, exist_ok=True)
    logger.info(f"Saving model checkpoint to {output_dir}")
    supported_classes = (PreTrainedModel,)
    if not is_peft_available():
        supported_classes = (PreTrainedModel, PeftModel)
    
    if not isinstance(self.model, supported_classes):
        if state_dict is None:
            state_dict = self.model.state_dict()
        if isinstance(unwrap_model(self.model), supported_classes):
            unwrap_model(self.model).save_pretrained(
                output_dir, state_dict=state_dict, safe_serialization=self.args.save_safetensors
            )
        else:
            logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.")
    else:
        self.model.save_pretrained(
            output_dir, state_dict=state_dict, safe_serialization=self.args.save_safetensors
        )
    
    if self.tokenizer is not None:
        self.tokenizer.save_pretrained(output_dir)
    
    # Good practice: save your training arguments together with the trained model
    torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))

# 定义 Trainer 对象
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)
trainer.args.a = 20  # 添加自定义参数
# 开始训练
trainer.train()
trainer.save_model("/huggingface_demo/out_dirs")

HuggingFace Trainer 源码解析：权重保存与模型检查点机制

一、self.state 与 self.control 初始化

二、self._maybe_log_save_evaluate 源码解读

1. _maybe_log_save_evaluate 完整源码

更多推荐文章

相关免费在线工具

2. control.should_log 与 is_torch_tpu_available

3. 评估（self.control.should_evaluate）

4. 保存权重

5. self.callback_handler 类

三、self._save_checkpoint 源码解读

1. 完整的源码

2. 获得保存路径与模型参数计算

self.store_flos() 参数计算

3. self.save_model(output_dir, _internal_call=True) 源码解读

4. self.sharded_ddp == ShardedDDPOption.SIMPLE 条件保存 optimizer and scheduler 状态

5. self.fsdp or self.is_fsdp_enabled 条件保存优化器状态

6. 不同条件保存 optimizer、scheduler and scaler 等状态

7. Determine the new best metric / best model checkpoint

8. self.state 状态保存

9. 随机状态数保存

10. 控制权重限制保存

四、self.save_model(output_dir, _internal_call=True)

1. save_model 源码

2. _save 源码解读

五、自定义参数保存示例

更多推荐文章

相关免费在线工具

HuggingFace Trainer 源码解析：权重保存与模型检查点机制

一、self.state 与 self.control 初始化

二、self._maybe_log_save_evaluate 源码解读

1. _maybe_log_save_evaluate 完整源码

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2. control.should_log 与 is_torch_tpu_available

3. 评估（self.control.should_evaluate）

4. 保存权重

5. self.callback_handler 类

三、self._save_checkpoint 源码解读

1. 完整的源码

2. 获得保存路径与模型参数计算

self.store_flos() 参数计算

3. self.save_model(output_dir, _internal_call=True) 源码解读

4. self.sharded_ddp == ShardedDDPOption.SIMPLE 条件保存 optimizer and scheduler 状态

5. self.fsdp or self.is_fsdp_enabled 条件保存优化器状态

6. 不同条件保存 optimizer、scheduler and scaler 等状态

7. Determine the new best metric / best model checkpoint

8. self.state 状态保存

9. 随机状态数保存

10. 控制权重限制保存

四、self.save_model(output_dir, _internal_call=True)

1. save_model 源码

2. _save 源码解读

五、自定义参数保存示例

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具