ICML 2024 DoRA：权重分解低秩适应详解

基本信息

论文：https://openreview.net/forum?id=3d5CIRG1n2
代码：https://github.com/NVlabs/DoRA
会议：ICML
年份：2024

0 论文摘要（Abstract）

DoRA（Weight-Decomposed Low-Rank Adaptation）是一种新型参数高效微调（PEFT）方法，旨在缩小 LoRA 与全微调（FT）之间的性能差距，同时保持无额外推理开销的优势。

实验背景

现有 LoRA 及其变体因无需额外推理成本而广泛应用，但与全微调相比仍存在精度差距，此前研究多将其归因于可训练参数有限。本文基于权重归一化思想，提出新颖的权重分解分析，将模型权重拆解为幅度（magnitude）和方向（direction）两个组件，揭示了 LoRA 与全微调的本质差异：LoRA 的幅度和方向更新呈正相关的比例关系，缺乏精细调整能力；而全微调的更新模式更灵活，二者呈负相关，可实现独立的幅度或方向优化。

方法设计

DoRA 借鉴该分析结果，对预训练权重进行幅度和方向分解并分别微调。针对方向组件参数规模大的问题，通过 LoRA 实现高效的方向更新，同时保持幅度组件的可训练性。该设计简化了 LoRA 需同时学习幅度和方向的复杂任务，提升了训练稳定性，且训练后可将分解组件合并回预训练权重，不增加推理延迟。此外，通过将方向组件的范数从梯度图中分离，DoRA 大幅降低了训练内存开销（LLaMA 微调中减少 24.4%）。

实验结果

在多模态任务和不同模型骨干上，DoRA 持续优于 LoRA：常识推理任务中，LLaMA-7B/13B 精度分别提升 3.7%/1.0%，LLaMA3-8B 提升 4.4%；视觉指令微调（LLaVA-7B）提升 0.6%；图像/视频 - 文本理解（VL-BART）分别提升 0.9%/1.9%。DoRA 还可与 VeRA 等 LoRA 变体兼容（如 DVoRA），在减少参数的同时保持性能优势，且在训练数据量有限、秩设置多样的场景下均表现出强鲁棒性。此外，基于 DoRA 衍生的 QDoRA 在量化微调中超越 QLoRA，在文本到图像生成任务中也展现出更优的个性化效果。

核心贡献

提出权重分解分析，揭示了 LoRA 与全微调的学习模式差异；
设计 DoRA 方法，在不增加推理开销的前提下实现接近全微调的学习能力；
在 NLP、视觉 - 语言等多任务及 LLM、LVLM 等模型上验证了 DoRA 的优越性与兼容性。

1 引言（Introduction）

LoRA 和 FT 展现出了明显不同的更新模式，各有优势，由此研究人员提出权重分解低秩适应 (DoRA),DoRA 在不同任务上均较 LoRA 有了提升。

DoRA 是一种新颖的参数高效微调方法（PEFT），融入了权重分解技术，在不增加 LoRA 推理延迟的情况下实现了与 FT 接近的学习能力。

灵感来源：借鉴权重归一化（Weight Normalization），将权重重参数化为'幅度 + 方向'，分析 FT 与 LoRA 的更新模式差异。

DoRA 的核心设计：分解预训练权重为幅度和方向，用 LoRA 优化方向分量（解决方向参数规模大的问题），同时微调幅度分量，使学习模式更接近 FT。

DoRA 核心操作图

该方法将预训练权重分解为幅度 (magnitude) 和方向 (direction) 两个组件进行微调，具体通过 DoRA 对方向组件进行高效更新，'幅度'描述权重的'数值大小'，'方向'描述权重在向量空间中的'指向'。

分解（初始化）阶段： 把预训练好的权重拆分成幅度（工具的规格）和方向（工具的样式），其中方向部分初始为冻结状态（Frozen），幅度部分是可以调整的可训练状态（Trainable）。

适配阶段： 对'方向'部分进行调整，引入了新的可训练的增量（ΔV），此时方向部分整体 V + ΔV 变为可训练状态，通过梯度下降等优化方法更新其参数以适配目标任务。

合并阶段： 把调整后的幅度 m 与方向 V + ΔV 重新结合到新的权重 W'，用于模型在目标任务上的前向推理与后续训练。

2 相关工作（Related Works）

class Linear(nn.Linear, LoraLayer): # Lora implemented in a dense layer def __init__(self, in_features: int, out_features: int, r: int = 0, lora_alpha: int = 1, lora_dropout: float = 0.0, fan_in_fan_out: bool = False, merge_weights: bool = True, Wdecompose: bool = False, dora_simple: bool = True, **kwargs): nn.Linear.__init__(self, in_features, out_features, **kwargs) LoraLayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout, merge_weights=merge_weights) self.weight_m_wdecomp = nn.Linear(1, out_features, bias=False) self.fan_in_fan_out = fan_in_fan_out self.Wdecompose = Wdecompose self.dora_simple = dora_simple if self.Wdecompose == False: if r > 0: self.lora_A = nn.Linear(in_features, r, bias=False) self.lora_B = nn.Linear(r, out_features, bias=False) self.scaling = self.lora_alpha / self.r self.weight.requires_grad = False self.reset_parameters() if fan_in_fan_out: self.weight.data = self.weight.data.T def reset_parameters(self): nn.Linear.reset_parameters(self) if hasattr(self, "lora_A"): nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5)) nn.init.zeros_(self.lora_B.weight) def train(self, mode: bool = True): nn.Linear.train(self, mode) if self.Wdecompose == False: self.lora_A.train(mode) self.lora_B.train(mode) self.weight_m_wdecomp.train(mode) def eval(self): nn.Linear.eval(self) if self.Wdecompose == False: self.lora_A.eval() self.lora_B.eval() self.weight_m_wdecomp.eval() def forward(self, x: torch.Tensor): previous_dtype = self.weight.dtype if self.disable_adapters: raise NotImplementedError elif self.Wdecompose and not self.merged: norm_scale = self.weight_m_wdecomp.weight.view(-1) / (torch.linalg.norm(self.weight, dim=1)) org_result = F.linear(x, transpose(self.weight, self.fan_in_fan_out)) result = org_result + (norm_scale - 1) * (F.linear(self.lora_dropout(x), transpose(self.weight, self.fan_in_fan_out))) if not self.bias is None: result += self.bias.view(1, -1).expand_as(result) elif self.r > 0 and not self.merged: new_weight_v = self.weight + (self.lora_B.weight @ self.lora_A.weight) * self.scaling if self.dora_simple: norm_scale = self.weight_m_wdecomp.weight.view(-1) / (torch.linalg.norm(new_weight_v, dim=1)).detach() else: norm_scale = self.weight_m_wdecomp.weight.view(-1) / (torch.linalg.norm(new_weight_v, dim=1)) org_result = F.linear(x, transpose(self.weight, self.fan_in_fan_out)) dropout_x = self.lora_dropout(x) result = org_result + (norm_scale - 1) * (F.linear(dropout_x, transpose(self.weight, self.fan_in_fan_out))) if not self.bias is None: result += self.bias.view(1, -1).expand_as(result) result += (norm_scale * (self.lora_B(self.lora_A(dropout_x.to(self.lora_A.weight.dtype))))) * self.scaling else: result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) if result.dtype != previous_dtype: result = result.to(previous_dtype) return result

ICML 2024 DoRA：权重分解低秩适应详解