斯坦福 CS336 作业实战：从零实现 Transformer 语言模型架构 | 极客日志

PythonAI算法

斯坦福 CS336 作业实战：从零实现 Transformer 语言模型架构

斯坦福 CS336 课程作业详解，涵盖 Transformer 语言模型核心组件的从零实现。内容包括线性层、Embedding、RMSNorm、SwiGLU 前馈网络、RoPE 位置编码及因果多头自注意力机制。重点解析了各模块的代码逻辑与数值稳定性处理，并通过 GPT-2 XL 规模模型进行参数量与 FLOPs 核算，分析不同组件的计算开销占比。最终整合为完整的 Transformer LM 架构，为理解大模型底层原理提供实践基础。

月光旅人发布于 2026/4/11更新于 2026/7/1929 浏览

斯坦福 CS336 作业实战：从零实现 Transformer 语言模型架构

在之前的讨论中，我们了解了 Transformer Language Model 的作业要求。今天我们来深入拆解 Assignment 1 的具体实现细节。这份笔记记录了从基础模块到完整架构的构建过程，旨在帮助读者理解大模型底层的代码逻辑。

1. 线性层实现 (Linear)

我们需要实现一个 Linear 类，继承自 torch.nn.Module。核心在于执行线性变换，但不包含偏置（bias）参数。接口设计需与 PyTorch 内置模块保持一致。

import math
import torch
from torch import nn

class Linear(nn.Module):
    """ A bias-free Linear layer that matches torch.nn.Linear's interface (except it has no bias)
        Stores weight as W with shape (out_features, in_features)
    """
    def __init__(self, in_features: int, out_features: int, device=None, dtype=None):
        super().__init__()
        self.in_features = int(in_features)
        self.out_features = int(out_features)
        # Store W (NOT W^T): shape (d_out, d_in)
        self.weight = nn.Parameter(
            torch.empty((self.out_features, self.in_features), device=device, dtype=dtype)
        )
        # Init: N(0, 2/(d_in+d_out)), truncated to [-3σ,3σ]
        sigma = math.sqrt(2.0/(self.in_features + self.out_features))
        nn.init.trunc_normal_(self.weight, mean=0.0, std=sigma, a=-3.0*sigma, b=3.0*sigma)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        
        
        
         torch.einsum(, x, .weight)

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

import torch
from torch import nn

class Embedding(nn.Module):
    """ A learnable embedding lookup table, equivalent to torch.nn.Embedding
        This module maps integer token IDs to continuous vectors of fixed dimensionality.
    """
    def __init__(self, num_embeddings: int, embedding_dim: int, device=None, dtype=None):
        super().__init__()
        self.num_embeddings = int(num_embeddings)
        self.embedding_dim = int(embedding_dim)
        # Weight matrix with shape (vocab_size, d_model)
        self.weight = nn.Parameter(
            torch.empty((self.num_embeddings, self.embedding_dim), device=device, dtype=dtype)
        )
        # Init: N(0, 1), truncated to [-3, 3]
        nn.init.trunc_normal_(self.weight, mean=0.0, std=1.0, a=-3.0, b=3.0)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        # token_ids: (...) int -> output: (..., d_model)
        return self.weight[token_ids]

import torch
from torch import nn

class RMSNorm(nn.Module):
    """ Root Mean Square Layer Normalization (RMSNorm). 
        For an input vector a in R^{d_model}:
        RMS(a) = sqrt(mean(a^2) + eps)
        RMSNorm(a) = (a / RMS(a)) * g
    """
    def __init__(self, d_model: int, eps: float = 1e-5, device=None, dtype=None):
        super().__init__()
        self.d_model = int(d_model)
        self.eps = float(eps)
        # Learnable gain parameter (g), shape (d_model,)
        self.weight = nn.Parameter(torch.ones((self.d_model,), device=device, dtype=dtype))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        in_dtype = x.dtype
        x_fp32 = x.to(torch.float32)
        # Compute RMS over the last dimension: sqrt(mean(x^2) + eps)
        rms = torch.sqrt(x_fp32.pow(2).mean(dim=-1, keepdim=True) + self.eps)
        # Normalize and apply gain; do match in float32 then cast back
        y = (x_fp32 / rms) * self.weight.to(torch.float32)
        return y.to(in_dtype)

import math
import torch
from torch import nn

def round_up_to_multiple(x: int, multiple: int) -> int:
    if multiple <= 0:
        raise ValueError("multiple must be a positive integer")
    return int(((x + multiple - 1) // multiple) * multiple)

def default_d_ff(d_model: int, multiple_of: int = 64) -> int:
    raw = int(math.ceil((8.0 * d_model) / 3.0))
    return round_up_to_multiple(raw, multiple_of)

class SwiGLU(nn.Module):
    """ Position-wise feed-forward network using the SwiGLU nonlinearity.
        The transformation is: FFN(x) = W2( SiLU(W1 x) ⊙ (W3 x) )
    """
    def __init__(self, d_model: int, d_ff: int | None = None, *, multiple_of: int = 64,
                 device=None, dtype=None):
        super().__init__()
        self.d_model = int(d_model)
        self.d_ff = int(d_ff) if d_ff is not None else default_d_ff(self.d_model, multiple_of)
        # Two up-projections and one down-projection (no bias)
        self.w1 = Linear(self.d_model, self.d_ff, device=device, dtype=dtype)
        self.w2 = Linear(self.d_ff, self.d_model, device=device, dtype=dtype)
        self.w3 = Linear(self.d_model, self.d_ff, device=device, dtype=dtype)

    @staticmethod
    def silu(x: torch.Tensor) -> torch.Tensor:
        return x * torch.sigmoid(x)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        a = self.w1(x)
        b = self.w3(x)
        gated = self.silu(a) * b
        return self.w2(gated)

import torch
from torch import nn

class RoPE(nn.Module):
    """ Rotary Positional Embeddings (RoPE). Applies a position-dependent rotation to the last dimension.
    """
    def __init__(self, theta: float, d_k: int, max_seq_len: int, device=None):
        super().__init__()
        if d_k % 2 != 0:
            raise ValueError(f"d_k must be even for RoPE, got d_k={d_k}")
        if max_seq_len <= 0:
            raise ValueError(f"max_seq_len must be positive, got {max_seq_len}")
        self.theta = float(theta)
        self.d_k = int(d_k)
        self.max_seq_len = int(max_seq_len)
        
        # Precompute inverse frequencies
        pair_idx = torch.arange(0, self.d_k, 2, device=device, dtype=torch.float32)
        inv_freq = self.theta ** (-pair_idx / self.d_k)
        positions = torch.arange(self.max_seq_len, device=device, dtype=torch.float32)
        angles = positions[:, None] * inv_freq[None, :]
        cos = torch.cos(angles)
        sin = torch.sin(angles)
        
        self.register_buffer("cos", cos, persistent=False)
        self.register_buffer("sin", sin, persistent=False)

    def forward(self, x: torch.Tensor, token_positions: torch.Tensor) -> torch.Tensor:
        if x.size(-1) != self.d_k:
            raise ValueError(f"Expected x.size(-1)==d_k=={self.d_k}, got {x.size(-1)}")
        
        pos = token_positions.to(device=x.device)
        cos = self.cos.index_select(0, pos.reshape(-1)).reshape(*pos.shape, -1)
        sin = self.sin.index_select(0, pos.reshape(-1)).reshape(*pos.shape, -1)
        
        x_fp32 = x.to(torch.float32)
        cos = cos.to(torch.float32)
        sin = sin.to(torch.float32)
        
        x_even = x_fp32[..., ::2]
        x_odd = x_fp32[..., 1::2]
        
        while cos.dim() < x_even.dim():
            cos = cos.unsqueeze(cos.dim()-2)
            sin = sin.unsqueeze(sin.dim()-2)
        
        out_even = x_even * cos - x_odd * sin
        out_odd = x_even * sin + x_odd * cos
        out = torch.stack((out_even, out_odd), dim=-1).flatten(-2)
        return out.to(dtype=x.dtype)

def softmax(x: torch.Tensor, dim: int) -> torch.Tensor:
    """ Numerically stable softmax over a given dimension."""
    x_max = torch.amax(x, dim=dim, keepdim=True)
    z = x - x_max
    exp_z = torch.exp(z)
    sum_exp = torch.sum(exp_z, dim=dim, keepdim=True)
    return exp_z / sum_exp

import math
import torch

def scaled_dot_product_attention(query: torch.Tensor, key: torch.Tensor, value: torch.tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
    """ Scaled dot-product attention."""
    if query.dim() < 2 or key.dim() < 2 or value.dim() < 2:
        raise ValueError("query/key/value must have shape (..., seq_len, d_*)")
    if query.shape[:-2] != key.shape[:-2] or query.shape[:-2] != value.shape[:-2]:
        raise ValueError("batch dimensions of query, key, value must match")
    
    d_k = query.shape[-1]
    if d_k != key.shape[-1]:
        raise ValueError("query and key must have the same d_k")
    
    # Compute attention logits in float32 for stability
    q = query.to(torch.float32)
    k = key.to(torch.float32)
    v = value.to(torch.float32)
    scale = 1.0 / math.sqrt(d_k)
    
    logits = torch.einsum("... s d, ... t d -> ... s t", q, k) * scale
    
    if mask is not None:
        if mask.dtype != torch.bool:
            raise TypeError("mask must be a boolean tensor")
        neg_inf = torch.finfo(torch.float32).min
        logits = torch.where(mask.to(device=logits.device), logits, neg_inf)
    
    probs = softmax(logits, dim=-1)
    if mask is not None:
        probs = probs * mask.to(device=probs.device, dtype=probs.dtype)
    
    out = torch.einsum("... s t, ... t d -> ... s d", probs, v)
    return out.to(dtype=value.dtype)

import math
import torch
from torch import nn

class CausalMultiHeadSelfAttention(nn.Module):
    """ Causal multi-head self-attention (no RoPE)."""
    def __init__(self, d_model: int, num_heads: int, device=None, dtype=None):
        super().__init__()
        self.d_model = int(d_model)
        self.num_heads = int(num_heads)
        if self.d_model % self.num_heads != 0:
            raise ValueError("d_model must be divisible by num_heads")
        self.head_dim = self.d_model // self.num_heads
        
        self.q_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.k_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.v_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.o_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)

    @staticmethod
    def _causal_mask(seq_len: int, device: torch.device) -> torch.Tensor:
        return torch.tril(torch.ones((seq_len, seq_len), device=device, dtype=torch.bool))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if x.size(-1) != self.d_model:
            raise ValueError(f"Expected last dim {self.d_model}, got {x.size(-1)}")
        
        seq_len = x.size(-2)
        device = x.device
        
        q = self.q_proj(x)
        k = self.k_proj(x)
        v = self.v_proj(x)
        
        new_shape = q.shape[:-1] + (self.num_heads, self.head_dim)
        q = q.view(new_shape).transpose(-3, -2)
        k = k.view(new_shape).transpose(-3, -2)
        v = v.view(new_shape).transpose(-3, -2)
        
        mask = self._causal_mask(seq_len, device=device)
        out = scaled_dot_product_attention(q, k, v, mask=mask)
        
        out = out.transpose(-3, -2).contiguous().view(x.shape[:-1] + (self.d_model,))
        return self.o_proj(out)

class CausalMultiHeadSelfAttentionWithRoPE(nn.Module):
    """ Causal multi-head self-attention with RoPE applied to Q and K (not V)."""
    def __init__(self, d_model: int, num_heads: int, theta: float, max_seq_len: int, device=None, dtype=None):
        super().__init__()
        self.d_model = int(d_model)
        self.num_heads = int(num_heads)
        if self.d_model % self.num_heads != 0:
            raise ValueError("d_model must be divisible by num_heads")
        self.head_dim = self.d_model // self.num_heads
        
        self.q_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.k_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.v_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.output_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.rope = RoPE(theta=theta, d_k=self.head_dim, max_seq_len=max_seq_len, device=device)

    @staticmethod
    def _causal_mask(seq_len: int, device: torch.device) -> torch.Tensor:
        return torch.tril(torch.ones((seq_len, seq_len), device=device, dtype=torch.bool))

    def forward(self, x: torch.Tensor, token_positions: torch.Tensor) -> torch.Tensor:
        if x.size(-1) != self.d_model:
            raise ValueError(f"Expected last dim {self.d_model}, got {x.size(-1)}")
        
        seq_len = x.size(-2)
        device = x.device
        
        q = self.q_proj(x)
        k = self.k_proj(x)
        v = self.v_proj(x)
        
        new_shape = q.shape[:-1] + (self.num_heads, self.head_dim)
        q = q.view(new_shape).transpose(-3, -2)
        k = k.view(new_shape).transpose(-3, -2)
        v = v.view(new_shape).transpose(-3, -2)
        
        q = self.rope(q, token_positions)
        k = self.rope(k, token_positions)
        
        mask = self._causal_mask(seq_len, device=device)
        out = scaled_dot_product_attention(q, k, v, mask=mask)
        
        out = out.transpose(-3, -2).contiguous().view(x.shape[:-1] + (self.d_model,))
        return self.output_proj(out)

import torch
from torch import nn

class TransformerBlock(nn.Module):
    """ Pre-norm Transformer block. Structure (pre-norm):
        y = x + Attn(RMSNorm(x))
        z = y + FFN(RMSNorm(y))
    """
    def __init__(self, d_model: int, num_heads: int, d_ff: int, *, max_seq_len: int, theta: float, eps: float = 1e-5, device=None, dtype=None):
        super().__init__()
        self.d_model = int(d_model)
        self.num_heads = int(num_heads)
        self.d_ff = int(d_ff)
        self.ln1 = RMSNorm(self.d_model, eps=eps, device=device, dtype=dtype)
        self.attn = CausalMultiHeadSelfAttentionWithRoPE(
            d_model=self.d_model, num_heads=self.num_heads, theta=theta, max_seq_len=max_seq_len, device=device, dtype=dtype
        )
        self.ln2 = RMSNorm(self.d_model, eps=eps, device=device, dtype=dtype)
        self.ffn = SwiGLU(self.d_model, d_ff=self.d_ff, device=device, dtype=dtype)

    def forward(self, x: torch.Tensor, token_positions: torch.Tensor) -> torch.Tensor:
        h = self.ln1(x)
        x = x + self.attn(h, token_positions)
        h = self.ln2(x)
        x = x + self.ffn(h)
        return x

import torch
from torch import nn
from cs336_basics.modules import Embedding, Linear, RMSNorm, TransformerBlock

class TransformerLM(nn.Module):
    """ A Transformer language model composed of:
        token embedding -> N pre-norm Transformer blocks -> final RMSNorm -> LM head.
    """
    def __init__(self, vocab_size: int, context_length: int, d_model: int, num_layers: int, num_heads: int, d_ff: int, *, rope_theta: float, max_seq_len: int | None = None, eps: float = 1e-5, device=None, dtype=None):
        super().__init__()
        self.vocab_size = int(vocab_size)
        self.context_length = int(context_length)
        self.d_model = int(d_model)
        self.num_layers = int(num_layers)
        self.max_seq_len = int(max_seq_len if max_seq_len is not None else context_length)
        
        self.token_embeddings = Embedding(self.vocab_size, self.d_model, device=device, dtype=dtype)
        self.layers = nn.ModuleList([
            TransformerBlock(
                d_model=self.d_model, num_heads=num_heads, d_ff=d_ff,
                max_seq_len=self.max_seq_len, theta=rope_theta, eps=eps, device=device, dtype=dtype
            ) for _ in range(self.num_layers)
        ])
        self.ln_final = RMSNorm(self.d_model, eps=eps, device=device, dtype=dtype)
        self.lm_head = Linear(self.d_model, self.vocab_size, device=device, dtype=dtype)

    def forward(self, in_indices: torch.Tensor) -> torch.Tensor:
        if in_indices.dim() != 2:
            raise ValueError(f"in_indices must have shape (batch, seq_len), got {tuple(in_indices.shape)}")
        
        batch, seq_len = in_indices.shape
        if seq_len > self.context_length:
            raise ValueError(f"seq_len={seq_len} exceeds context_length={self.context_length}")
        
        token_positions = torch.arange(seq_len, device=in_indices.device, dtype=torch.long).view(1, seq_len)
        token_positions = token_positions.expand(batch, seq_len)
        
        x = self.token_embeddings(in_indices)
        for block in self.layers:
            x = block(x, token_positions)
        
        x = self.ln_final(x)
        logits = self.lm_head(x)
        return logits

斯坦福 CS336 作业实战：从零实现 Transformer 语言模型架构

斯坦福 CS336 作业实战：从零实现 Transformer 语言模型架构

1. 线性层实现 (Linear)

更多推荐文章

相关免费在线工具

2. 嵌入层实现 (Embedding)

3. RMSNorm 归一化

4. SwiGLU 前馈网络

5. RoPE 旋转位置编码

6. Softmax 数值稳定实现

7. 缩放点积注意力 (Scaled Dot-Product Attention)

8. 因果多头自注意力 (Causal Multi-Head Self-Attention)

9. Transformer Block

10. Transformer LM 整合

11. 资源核算 (FLOPs Accounting)

(a) GPT-2 XL 参数量

(b) 前向传播 FLOPs

(c) 瓶颈分析

(d) 规模变化影响

(e) 序列长度影响

更多推荐文章

相关免费在线工具

斯坦福 CS336 作业实战：从零实现 Transformer 语言模型架构

斯坦福 CS336 作业实战：从零实现 Transformer 语言模型架构

1. 线性层实现 (Linear)

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2. 嵌入层实现 (Embedding)

3. RMSNorm 归一化

4. SwiGLU 前馈网络

5. RoPE 旋转位置编码

6. Softmax 数值稳定实现

7. 缩放点积注意力 (Scaled Dot-Product Attention)

8. 因果多头自注意力 (Causal Multi-Head Self-Attention)

9. Transformer Block

10. Transformer LM 整合

11. 资源核算 (FLOPs Accounting)

(a) GPT-2 XL 参数量

(b) 前向传播 FLOPs

(c) 瓶颈分析

(d) 规模变化影响

(e) 序列长度影响

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具