MoE 起源：1991 年 Adaptive Mixtures of Local Experts 论文解读

论文标题：Adaptive Mixtures of Local Experts

论文地址：https://people.engr.tamu.edu/rgutier/web_courses/cpsc636_s10/jacobs1991moe.pdf

Abstract

论文提出了一种新的监督学习方法，适用于由多个独立网络组成的系统，每个网络学习处理训练案例全集的一个子集。

这种新方法既可以看作是多层监督网络的模块化版本，也可以看作是竞争学习的关联版本。

多层监督网络： 指的是把多个中间层 (隐藏层) 的数据也纳入计算损失。类比到 Bert 模型，就是把某些个 Transformer Block 最后的 hidden tensor 结果一些变换 (如平均池化) 后去计算损失，然后加权到最终的损失函数上。

竞争学习： 类似在线版本的 K Means，假设我们有三个数据源，当一个样本到来后，通过竞争规则 (如欧式距离) 计算出获胜的神经元，然后只更新获胜神经元的权重。

因此，它在这两种看似不同的方法之间建立了新的联系。论文证明，该学习方法将元音辨别任务分解为适当的子任务，每个子任务都可以由一个非常简单的专家网络解决。

Making Associative Learning Competitive

如果使用反向传播算法在不同场合训练单个多层网络来执行不同的子任务，通常会产生强烈的干扰效应，导致学习缓慢且泛化能力差。

这里说的很像是多任务学习，多任务学习有硬参数共享和软参数共享两种形式，我读研时尝试把多标签文本分类和标签个数分类作为硬参数共享多任务进行联合建模，效果一般。

如果我们事先知道一组训练样本可以自然地划分为与不同子任务相对应的子集，那么通过使用由几个不同的'专家'网络加上一个门控网络组成的系统，可以减少干扰。

MoE 可以看作是多任务学习的一种架构，增加了门控机制来让模型选择专家网络，子任务可以是一个任务的拆分，也可以是不同学习目标的任务。

门控网络决定每个训练样本应该使用哪个专家网络。当在训练前就知道子任务的划分时可以使用该系统，该系统学习如何将样本分配给专家。这种系统背后的理念是，门控网络将一个新样本分配给一个或几个专家，如果输出不正确，权重变化将局限于这些专家（以及门控网络）。

因此，这不会对专门处理截然不同情况的其他专家的权重产生干扰。从这个意义上说，这些专家是局部的，即一个专家的权重与其他专家的权重是解耦的。

软组合

基于上述思想，最初提出的系统误差为：

$ E^{c}=\left| d^{c}-\sum_{i} p_{i}^{c} o_{i}^{c}\right| ^{2} \tag1 $

其中 $o_{i}^{c}$ 是专家 i 在案例 c 上的输出向量，$p_{i}^{c}$ 是专家 i 对组合输出向量的比例贡献，而 $d^{c}$ 是案例 c 中的期望输出向量。

对 $E^c$ 关于 $o_i^c$ 求偏导，我们需要分别处理两个公式。

对于公式 (1) 求偏导：

$ \frac{\partial E^c}{\partial o_i^c} = 2 \left( d^c - \sum_{i} p_i^c o_i^c \right) \cdot (-p_i^c) $

根据偏导可以看出对某个专家网络更新权重，会受到所有其他专家网络的影响。当一个专家的权重发生变化时，残差误差也会改变，因此所有其他本地专家的误差导数也会改变。

在合作模式中，所有专家都参与输出，即使某个专家预测很差，只要其他专家表现好，总误差仍可能较小。这可能导致：

某些专家变得懒惰（不精确也没关系，反正别人会补偿）；
专家之间缺乏区分性，难以实现'分工'；
门控网络无法清晰地学习到'哪个专家擅长处理哪类数据'。

这就是所谓的'责任稀释'（credit assignment problem）。

硬竞争

一种更简单的补救方法是重新定义误差函数，从而鼓励本地专家相互竞争而非合作。不再将所有专家的输出线性组合，而是设想：每次只使用一个专家，该专家由门控网络以概率 $p_i^c$ 随机选出。

$ E^{c}=\left<\left| d^{c}-o_{i}^{c}\right| ^{2}\right>=\sum_{i} p_{i}^{c}\left| d^{c}-o_{i}^{c}\right| ^{2} \tag 2 $

import torch import torch.nn as nn import torch.optim as optim import torchvision import torchvision.transforms as transforms import matplotlib.pyplot as plt import numpy as np import torch.nn.functional as F from torch.utils.data import DataLoader, ConcatDataset from tqdm import tqdm # 定义单个专家的神经网络 class SingleExpert(nn.Module): def __init__(self, input_dim=28*28, output_dim=20): super(SingleExpert, self).__init__() self.fc1 = nn.Linear(input_dim, 256) self.fc2 = nn.Linear(256, 128) self.fc3 = nn.Linear(128, output_dim) def forward(self, x): x = x.view(-1, 28*28) x = torch.relu(self.fc1(x)) x = torch.relu(self.fc2(x)) x = self.fc3(x) return x # 定义 MoE 模型 class MoE(nn.Module): def __init__(self, input_dim=28*28, output_dim=20, num_experts=4): super(MoE, self).__init__() self.num_experts = num_experts # 定义专家 self.experts = nn.ModuleList([SingleExpert(input_dim, output_dim) for _ in range(num_experts)]) # 定义门网络（决定使用哪个专家） self.gating_network = nn.Linear(input_dim, num_experts) def forward(self, x): x_flat = x.view(-1, 28*28) # 获取门网络的概率 gate_outputs = torch.softmax(self.gating_network(x_flat), dim=1) # Shape: [batch_size, num_experts] # 获取每个专家的输出 expert_outputs = torch.stack([expert(x) for expert in self.experts], dim=1) # Shape: [batch_size, num_experts, output_dim] # 随机选择一个专家基于门网络的概率 expert_indices = torch.multinomial(gate_outputs, num_samples=1).squeeze() # Shape: [batch_size] # 收集对应于采样索引的专家输出 final_output = expert_outputs[torch.arange(x.size(0)), expert_indices] # Shape: [batch_size, output_dim] return final_output, expert_outputs, gate_outputs, expert_indices # 定义 MoE 模型的损失函数 def moe_loss(targets, expert_outputs, gate_outputs): """ 计算 MoE 模型的损失。 Arguments: - targets: 真实输出向量，形状 [batch_size, output_dim] - expert_outputs: 每个专家的输出向量，形状 [batch_size, num_experts, output_dim] - gate_outputs: 门网络的输出（每个专家的概率），形状 [batch_size, num_experts] Returns: - loss: 计算的损失值 """ # 计算目标输出和每个专家输出之间的平方误差 errors = torch.sum((expert_outputs - targets.unsqueeze(1))**2, dim=2) # Shape: [batch_size, num_experts] # 计算负半误差的指数（高斯似然） weighted_errors = torch.exp(-0.5 * errors) # Shape: [batch_size, num_experts] # 通过门网络的输出（概率）加权误差 weighted_errors = gate_outputs * weighted_errors # Shape: [batch_size, num_experts] # 对专家求和并取对数以获得负对数似然 loss = -torch.log(torch.sum(weighted_errors, dim=1) + 1e-8) # Shape: [batch_size] # 返回批次的平均损失 return loss.mean() # 定义 MoE 模型的均方误差硬损失 def mse_hard_loss(targets, expert_outputs, gate_outputs): """ 计算 MoE 模型的均方误差硬损失。 Arguments: - targets: 真实输出向量，形状 [batch_size, output_dim] - expert_outputs: 每个专家的输出向量，形状 [batch_size, num_experts, output_dim] - gate_outputs: 门网络的输出（每个专家的概率），形状 [batch_size, num_experts] Returns: - loss: 计算的损失值 """ # Step 1: 使用门控权重对专家输出进行加权求和（融合输出） fused_output = torch.sum(gate_outputs.unsqueeze(-1) * expert_outputs, dim=1) # [batch_size, output_dim] # Step 2: 计算融合输出与目标之间的 L2 距离平方 reconstruction_error = torch.sum((fused_output - targets)**2, dim=1) # [batch_size] # Step 3: 取 batch 平均作为最终损失 loss = torch.mean(reconstruction_error) # 返回批次的平均损失 return loss # 定义 MoE 模型的均方误差软损失 def mse_soft_loss(targets, expert_outputs, gate_outputs): """ 计算 MoE 模型的均方误差损失。 Arguments: - targets: 真实输出向量，形状 [batch_size, output_dim] - expert_outputs: 每个专家的输出向量，形状 [batch_size, num_experts, output_dim] - gate_outputs: 门网络的输出（每个专家的概率），形状 [batch_size, num_experts] Returns: - loss: 计算的损失值 """ errors = torch.sum((expert_outputs - targets.unsqueeze(1))**2, dim=2) # Shape: [batch_size, output_dim] weighted_errors = gate_outputs * errors # Shape: [batch_size, output_dim] # 对专家求和并取平均以获得均方误差 loss = torch.mean(weighted_errors) # 返回批次的平均损失 return loss def one_hot_encoding(labels, num_classes=10): # 确保标签矩阵在与标签相同的设备上创建 return torch.eye(num_classes, device=labels.device)[labels] # 训练循环，跟踪专家选择 def train(model, dataloader, optimizer, num_epochs=10, loss_name='moe'): model.train() # 初始化专家选择计数器 expert_selection_count = torch.zeros(model.num_experts, device=device) for epoch in range(num_epochs): running_loss = 0.0 correct = 0 total = 0 # 使用 tqdm 显示进度 for inputs, labels in tqdm(dataloader): inputs, labels = inputs.to(device), labels.to(device) optimizer.zero_grad() # 前向传播 final_output, expert_outputs, gate_outputs, expert_indices = model(inputs) # 统计专家选择次数 for idx in expert_indices: expert_selection_count[idx] += 1 # 将标签转换为独热编码 one_hot_labels = one_hot_encoding(labels, num_classes=len(combined_classes)) # 计算 MoE 损失 if loss_name == 'moe': loss = moe_loss(one_hot_labels, expert_outputs, gate_outputs) elif loss_name == 'mse_hard': loss = mse_hard_loss(one_hot_labels, expert_outputs, gate_outputs) elif loss_name == 'mse_soft': loss = mse_soft_loss(one_hot_labels, expert_outputs, gate_outputs) loss.backward() optimizer.step() running_loss += loss.item() _, predicted = torch.max(final_output, 1) total += labels.size(0) correct += (predicted == labels).sum().item() print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {running_loss / len(dataloader):.4f}, Accuracy: {100* correct / total:.2f}%') # 输出专家选择次数 print("\n专家选择次数统计（训练集）：") for i, count in enumerate(expert_selection_count): print(f'专家 {i}: 被选择 {count.item()} 次') return expert_selection_count # 测试循环，跟踪专家选择（用于数据集层面和类别层面） def test_with_expert_statistics(model, dataloader, dataset_name="", num_classes=20): model.eval() correct = 0 total = 0 # 初始化专家选择计数器（数据集层面） expert_selection_count = torch.zeros(model.num_experts, device=device) with torch.no_grad(): for inputs, labels in dataloader: inputs, labels = inputs.to(device), labels.to(device) # 前向传播 final_output, _, _, expert_indices = model(inputs) # 统计专家选择次数（数据集层面） for idx in expert_indices: expert_selection_count[idx] += 1 # 获取预测 _, predicted = torch.max(final_output, 1) total += labels.size(0) correct += (predicted == labels).sum().item() accuracy = 100 * correct / total print(f'\n{dataset_name} 测试集准确率：{accuracy:.2f}%') # 输出专家选择次数统计（数据集层面） print(f"\n{dataset_name} 专家选择次数统计（数据集层面）：") for i, count in enumerate(expert_selection_count): print(f'专家 {i}: 被选择 {count.item()} 次，占比 {100* count.item()/ total:.2f}%') return accuracy, expert_selection_count if __name__ == "__main__": # 主实验 batch_size = 1024 num_experts = 4 loss_name = ['moe', 'mse_hard', 'mse_soft'][1] device = torch.device('cpu') if not torch.cuda.is_available() else torch.device('cuda:0') # 定义数据预处理 transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) ]) # 加载 MNIST 和 Fashion-MNIST 数据集 mnist_train = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform) mnist_test = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform) fashion_mnist_train = torchvision.datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform) fashion_mnist_test = torchvision.datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform) # 修改 Fashion-MNIST 的标签，使其与 MNIST 标签区分开来（加上 10） fashion_mnist_train.targets = fashion_mnist_train.targets + 10 fashion_mnist_test.targets = fashion_mnist_test.targets + 10 # 合并训练集和测试集 combined_train_data = ConcatDataset([mnist_train, fashion_mnist_train]) combined_test_data = ConcatDataset([mnist_test, fashion_mnist_test]) # 打印合并后的数据集信息 print(f"训练集样本数：{len(combined_train_data)}") print(f"测试集样本数：{len(combined_test_data)}") # 定义 MNIST 和 Fashion-MNIST 的类别名称 mnist_classes = [str(i) for i in range(10)] # MNIST 类别是 '0' 到 '9' fashion_mnist_classes = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'] # 合并 MNIST 和 Fashion-MNIST 的类别名称 combined_classes = mnist_classes + fashion_mnist_classes # 创建数据加载器 train_loader = DataLoader(combined_train_data, batch_size=batch_size, shuffle=True) test_loader = DataLoader(combined_test_data, batch_size=batch_size, shuffle=False) # MoE 模型 moe_model = MoE(output_dim=len(combined_classes), num_experts=num_experts).to(device) optimizer_moe = optim.Adam(moe_model.parameters(), lr=0.001) print("\n训练 MoE 模型...") train_expert_selection = train(moe_model, train_loader, optimizer_moe, num_epochs=10, loss_name=loss_name) print("测试 MoE 模型在 MNIST 和 Fashion-MNIST 上...") # 测试在单个数据集上 print("测试在 MNIST 数据集上...") mnist_test_loader = DataLoader(mnist_test, batch_size=batch_size, shuffle=False) test_accuracy_mnist, mnist_expert_selection = test_with_expert_statistics( moe_model, mnist_test_loader, dataset_name="MNIST", num_classes=len(combined_classes)) print("测试在 Fashion-MNIST 数据集上...") fashion_mnist_test_loader = DataLoader(fashion_mnist_test, batch_size=batch_size, shuffle=False) test_accuracy_fashion_mnist, fashion_mnist_expert_selection = test_with_expert_statistics( moe_model, fashion_mnist_test_loader, dataset_name="Fashion-MNIST", num_classes=len(combined_classes))

损失函数	MNIST 准确率	Fashion-MNIST 准确率	训练 Loss 下降速度	专家多样性
MSE-Hard (1)	93.73%	85.36%	中等	专家均衡
MSE-Soft (2)	97.57%	88.30%	极快	专家崩溃
MoE (3)	97.67%	87.77%	快	偏向单一专家（MNIST）

模型	MNIST 专家分布	Fashion-MNIST 专家分布	是否专业化	是否稀疏
MSE-Hard	均匀（45%/19%/12%/23%）	均匀（43%/12%/16%/28%）	否	否
MSE-Soft	专家 3 占 99.86%	专家 2/3 主导	崩溃	伪稀疏
MoE Loss	专家 2 占 100%	专家 2(74%) + 专家 1(26%)	是	是

MoE 起源：1991 年 Adaptive Mixtures of Local Experts 论文解读

论文标题：Adaptive Mixtures of Local Experts

Abstract

Making Associative Learning Competitive

软组合

硬竞争

更多推荐文章

相关免费在线工具

软竞争

实验复现

实验代码

实验分析

公式 1 mse-hard

公式 2 mse-soft

公式 3 moe

分析

更多推荐文章

相关免费在线工具

MoE 起源：1991 年 Adaptive Mixtures of Local Experts 论文解读

论文标题：Adaptive Mixtures of Local Experts

Abstract

Making Associative Learning Competitive

软组合

硬竞争

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

软竞争

实验复现

实验代码

实验分析

公式 1 mse-hard

公式 2 mse-soft

公式 3 moe

分析

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具