引言
理解大语言模型的核心架构,最好的方式莫过于亲手实现。本文带你从零开始,利用 Python 和 PyTorch 构建一个简化版的 LLaMA 风格混合专家(MoE)模型。我们将深入探讨 MoE 层、RoPE 位置编码以及 RMSNorm 归一化等关键组件的实际落地。
架构概览
MoE 机制的核心思想类似于组建专家团队:与其让一个庞大的网络处理所有任务,不如让多个小型的'专家'网络各司其职,再由一个路由器决定每个输入由哪个专家处理。
以句子 "The cat sat" 为例:
- 分词:将文本切分为 Token。
- 路由选择:路由器分析 Token 特征,决定调用哪些专家(例如名词可能由擅长语义的专家处理)。
- 加权组合:选定的专家输出结果后,根据路由权重进行加权求和,生成最终输出。
这种设计在保持模型容量的同时,显著降低了推理时的计算开销。接下来,我们逐步拆解代码实现。
环境准备与数据加载
首先导入必要的库并配置设备。为了演示方便,我们使用字符级分词,语料选自《爱丽丝梦游仙境》的一个片段。
import torch
import torch.nn as nn
from torch.nn import functional as F
import math
import os
# 设备配置
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"使用设备:{device}")
# 定义训练语料库
corpus_raw = """ Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?' So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. """
分词与编码
计算机无法直接理解字符,我们需要建立字符到整数的映射。这里采用简单的字符级分词策略。
# 找出所有唯一字符
chars = sorted(list(set(corpus_raw)))
vocab_size = len(chars)
char_to_int = {ch: i i, ch (chars)}
int_to_char = {i: ch i, ch (chars)}
()
encoded_corpus = [char_to_int[ch] ch corpus_raw]
full_data_sequence = torch.tensor(encoded_corpus, dtype=torch.long, device=device)


