CS336 从零构建语言模型：Transformer LM 架构实现 | 极客日志

PythonAI算法

CS336 从零构建语言模型：Transformer LM 架构实现

综述由AI生成详细记录了斯坦福 CS336 课程 Assignment 1 中 Transformer 语言模型的从零实现过程。内容包括线性层、嵌入层、RMSNorm、SwiGLU 前馈网络、RoPE 位置编码、Softmax、缩放点积注意力及多头自注意力模块的代码实现。最后整合为完整的 Transformer Block 和 Transformer LM，并对 GPT-2 XL 规模的参数量与前向传播 FLOPs 进行了核算分析，明确了各组件的计算消耗占比。

神经兮兮发布于 2026/4/6更新于 2026/5/2032 浏览

前言

本文记录 CS336 作业 Assignment 1: Basics 中的 Transformer Language Model Architecture 实现，涵盖从基础模块到完整模型的构建过程。

Assignment 1: https://github.com/stanford-cs336/assignment1-basics/tree/main

1. Problem (linear): Implementing the linear module (1 point)

Deliverable：请实现一个 Linear 类，该类继承自 torch.nn.Module，并执行线性变换，你的实现应当遵循 PyTorch 内置 nn.Linear 模块的接口设计，但不包含偏置（bias）参数或偏置项。

推荐使用如下接口：

def __init__(self, in_features, out_features, device=None, dtype=None)

用于构造一个线性变换模块，该函数应当接收以下参数：

in_features: int：输入的最终维度
out_features: int：输出的最终维度
device: torch.device | None = None：用于存放参数的设备
dtype: torch.dtype | None = None：参数的数据类型

def forward(self, x: torch.Tensor) -> torch.Tensor

将线性变换应用到输入张量上。

实现时请务必注意以下几点：

继承自 nn.Module
调用父类构造函数（super().__init__()）
构造并存储参数矩阵为 W（而不是 W^T），这是出于内存排列顺序的考虑，该参数应存放在一个 nn.Parameter 中
不要使用 nn.Linear 或 nn.functional.linear

关于参数初始化，请使用上文给出的初始化设置，并结合 torch.nn.init.trunc_normal_ 来初始化权重参数。

代码实现如下：

import math
import torch
from torch import nn

class Linear(nn.Module):
    """ A bias-free Linear layer that matches torch.nn.Linear's interface (except it has no bias)
        Stores weight as W with shape (out_features, in_features)
    """
     ():
        ().__init__()
        .in_features = (in_features)
        .out_features = (out_features)
        
        .weight = nn.Parameter(
            torch.empty((.out_features, .in_features), device=device, dtype=dtype)
        )
        
        sigma = math.sqrt(/(.in_features + .out_features))
        nn.init.trunc_normal_(.weight, mean=, std=sigma, a=-*sigma, b=*sigma)

     () -> torch.Tensor:
        
        
        
         torch.einsum(, x, .weight)

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

def run_linear(
    d_in: int,
    d_out: int,
    weights: Float[Tensor, "d_out d_in"],
    in_features: Float[Tensor, "... d_in"],
) -> Float[Tensor, "... d_out"]:
    """ Given the weights of a Linear layer, compute the transformation of a batched input."""
    from cs336_basics.modules import Linear
    layer = Linear(d_in, d_out, device=in_features.device, dtype=in_features.dtype)
    layer.load_state_dict({"weight": weights.to(device=in_features.device, dtype=in_features.dtype)})
    return layer(in_features)

def __init__(self, num_embeddings, embedding_dim, device=None, dtype=None)

def forward(self, token_ids: torch.Tensor) -> torch.Tensor

import torch
from torch import nn

class Embedding(nn.Module):
    """ A learnable embedding lookup table, equivalent to torch.nn.Embedding
        This module maps integer token IDs to continuous vectors of fixed dimensionality (embedding_dim).
        The embedding matrix is stored as a learnable parameter of shape (num_embeddings, embedding_dim).
    """
    def __init__(self, num_embeddings: int, embedding_dim: int, device: torch.device | None = None, dtype: torch.dtype | None = None):
        super().__init__()
        self.num_embeddings = int(num_embeddings)
        self.embedding_dim = int(embedding_dim)
        # Weight matrix with shape (vocab_size, d_model)
        self.weight = nn.Parameter(
            torch.empty((self.num_embeddings, self.embedding_dim), device=device, dtype=dtype)
        )
        # Init: N(0, 1), truncated to [-3, 3]
        nn.init.trunc_normal_(self.weight, mean=0.0, std=1.0, a=-3.0, b=3.0)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        # token_ids: (...) int -> output: (..., d_model)
        return self.weight[token_ids]

def run_embedding(
    vocab_size: int,
    d_model: int,
    weights: Float[Tensor, "vocab_size d_model"],
    token_ids: Int[Tensor, "..."],
) -> Float[Tensor, "... d_model"]:
    """ Given the weights of an Embedding layer, get the embeddings for a batch of token ids."""
    from cs336_basics.modules import Embedding
    layer = Embedding(vocab_size, d_model, device=weights.device, dtype=weights.dtype)
    layer.load_state_dict({"weight": weights.to(device=weights.device, dtype=weights.dtype)})
    return layer(token_ids)

def __init__(self, d_model:int, eps:float=1e-5, device=None, dtype=None)

def forward(self, x: torch.Tensor) -> torch.Tensor

import torch
from torch import nn

class RMSNorm(nn.Module):
    """ Root Mean Square Layer Normalization (RMSNorm). For an input vector a in R^{d_model}:
        RMS(a) = sqrt(mean(a^2) + eps)
        RMSNorm(a) = (a / RMS(a)) * g
    """
    def __init__(self, d_model: int, eps: float = 1e-5, device: torch.device | None = None, dtype: torch.dtype | None = None):
        super().__init__()
        self.d_model = int(d_model)
        self.eps = float(eps)
        # Learnable gain parameter (g), shape (d_model,)
        self.weight = nn.Parameter(torch.ones((self.d_model,), device=device, dtype=dtype))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        in_dtype = x.dtype
        x_fp32 = x.to(torch.float32)
        # Compute RMS over the last dimension: sqrt(mean(x^2) + eps)
        rms = torch.sqrt(x_fp32.pow(2).mean(dim=-1, keepdim=True) + self.eps)
        # Normalize and apply gain; do match in float32 then cast back
        y = (x_fp32 / rms) * self.weight.to(torch.float32)
        return y.to(in_dtype)

def run_rmsnorm(
    d_model: int,
    eps: float,
    weights: Float[Tensor, "d_model"],
    in_features: Float[Tensor, "... d_model"],
) -> Float[Tensor, "... d_model"]:
    """ Given the weights of a RMSNorm affine transform, return the output of running RMSNorm on the input features."""
    from cs336_basics.modules import RMSNorm
    layer = RMSNorm(d_model=d_model, eps=eps, device=in_features.device, dtype=weights.dtype)
    layer.load_state_dict({"weight": weights.to(device=in_features.device, dtype=weights.dtype)})
    return layer(in_features)

import math
import torch
from torch import nn

def round_up_to_multiple(x: int, multiple: int) -> int:
    """Round x up to the nearest positive multiple of `multiple`."""
    if multiple <= 0:
        raise ValueError("multiple must be a positive integer")
    return int(((x + multiple - 1) // multiple) * multiple)

def default_d_ff(d_model: int, multiple_of: int = 64) -> int:
    """ Compute the recommended SwiGLU hidden size. We use d_ff ~= (8/3) * d_model and then round up to a hardware-friendly multiple (typically 64). """
    raw = int(math.ceil((8.0 * d_model) / 3.0))
    return round_up_to_multiple(raw, multiple_of)

class SwiGLU(nn.Module):
    """ Position-wise feed-forward network using the SwiGLU nonlinearity.
        The transformation is: FFN(x) = W2( SiLU(W1 x) ⊙ (W3 x) )
        where SiLU(z) = z * sigmoid(z), and ⊙ is elementwise multiplication.
    """
    def __init__(self, d_model: int, d_ff: int | None = None, *, multiple_of: int = 64, device: torch.device | None = None, dtype: torch.dtype | None = None):
        super().__init__()
        self.d_model = int(d_model)
        self.d_ff = int(d_ff) if d_ff is not None else default_d_ff(self.d_model, multiple_of)
        # Two up-projections and one down-projection (no bias)
        self.w1 = Linear(self.d_model, self.d_ff, device=device, dtype=dtype)
        self.w2 = Linear(self.d_ff, self.d_model, device=device, dtype=dtype)
        self.w3 = Linear(self.d_model, self.d_ff, device=device, dtype=dtype)

    @staticmethod
    def silu(x: torch.Tensor) -> torch.Tensor:
        return x * torch.sigmoid(x)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        a = self.w1(x)
        b = self.w3(x)
        gated = self.silu(a) * b
        return self.w2(gated)

def run_swiglu(
    d_model: int,
    d_ff: int,
    w1_weight: Float[Tensor, "d_ff d_model"],
    w2_weight: Float[Tensor, "d_model d_ff"],
    w3_weight: Float[Tensor, "d_ff d_model"],
    in_features: Float[Tensor, "... d_model"],
) -> Float[Tensor, "... d_model"]:
    """ Given the weights of a SwiGLU network, return the output of your implementation with these weights."""
    from cs336_basics.modules import SwiGLU
    swiglu = SwiGLU(d_model=d_model, d_ff=d_ff, device=in_features.device, dtype=w1_weight.dtype)
    swiglu.load_state_dict({
        "w1.weight": w1_weight.to(device=in_features.device, dtype=w1_weight.dtype),
        "w2.weight": w2_weight.to(device=in_features.device, dtype=w2_weight.dtype),
        "w3.weight": w3_weight.to(device=in_features.device, dtype=w3_weight.dtype)
    })
    return swiglu(in_features)

def __init__(self, theta:float, d_k:int, max_seq_len:int, device=None)

def forward(self, x: torch.Tensor, token_positions: torch.Tensor) -> torch.Tensor

import torch
from torch import nn

class RoPE(nn.Module):
    """ Rotary Positional Embeddings (RoPE). Applies a position-dependent rotation to the last dimension (d_k) of an input tensor.
        The rotation is applied pairwise on (x[..., 0], x[..., 1], x[..., 2], x[..., 3]), ...
        This module has no learnable parameters. It can precompute and cache cos/sin table.
    """
    def __init__(self, theta: float, d_k: int, max_seq_len: int, device: torch.device | None = None):
        super().__init__()
        if d_k % 2 != 0:
            raise ValueError(f"d_k must be even for RoPE, got d_k={d_k}")
        if max_seq_len <= 0:
            raise ValueError(f"max_seq_len must be positive, got {max_seq_len}")
        self.theta = float(theta)
        self.d_k = int(d_k)
        self.max_seq_len = int(max_seq_len)
        # Precompute inverse frequencies for even indices:
        # inv_freq[j] = theta^(-2j/d_k), where j indexes pairs (0, 1, ..., d_k/2 - 1).
        pair_idx = torch.arange(0, self.d_k, 2, device=device, dtype=torch.float32)
        inv_freq = self.theta ** (-pair_idx / self.d_k)
        # Positions [0, 1, ..., max_seq_len-1]
        positions = torch.arange(self.max_seq_len, device=device, dtype=torch.float32)
        # Angles: (max_seq_len, d_k/2)
        angles = positions[:, None] * inv_freq[None, :]
        cos = torch.cos(angles)  # (max_seq_len, d_k/2)
        sin = torch.sin(angles)  # (max_seq_len, d_k/2)
        # Cache as non-persistent buffers (not saved in state_dict)
        self.register_buffer("cos", cos, persistent=False)
        self.register_buffer("sin", sin, persistent=False)

    def forward(self, x: torch.Tensor, token_positions: torch.Tensor) -> torch.Tensor:
        """ Args:
            x: Tensor of shape (..., seq_len, d_k)
            token_positions: Tensor of shape (..., seq_len) with integer positions
            Returns: Tensor of shape (..., seq_len, d_k) after applying RoPE.
        """
        if x.size(-1) != self.d_k:
            raise ValueError(f"Expected x.size(-1)==d_k=={self.d_k}, got {x.size(-1)}")
        # token_positions is used to slice cached cos/sin along the sequence of dimension.
        # Shapes after indexing: (..., seq_len, d_k/2)
        pos = token_positions.to(device=x.device)
        cos = self.cos.index_select(0, pos.reshape(-1)).reshape(*pos.shape, -1)
        sin = self.sin.index_select(0, pos.reshape(-1)).reshape(*pos.shape, -1)
        # Promote to float32 for numerical stability, then cast back
        x_fp32 = x.to(torch.float32)
        cos = cos.to(torch.float32)
        sin = sin.to(torch.float32)
        x_even = x_fp32[..., ::2]  # (..., seq_len, d_k/2)
        x_odd = x_fp32[..., 1::2]  # (..., seq_len, d_k/2)
        # make cos/sin broadcastable for inputs like (B, H, S, d_k)
        while cos.dim() < x_even.dim():
            cos = cos.unsqueeze(cos.dim()-2)
            sin = sin.unsqueeze(sin.dim()-2)
        # Apply 2D rotation for each pair.
        out_even = x_even * cos - x_odd * sin
        out_odd = x_even * sin + x_odd * cos
        # Interleave even/odd back to (..., seq_len, d_k)
        out = torch.stack((out_even, out_odd), dim=-1).flatten(-2)
        return out.to(dtype=x.dtype)

def run_rope(
    d_k: int,
    theta: float,
    max_seq_len: int,
    in_query_or_key: Float[Tensor, "... sequence_length d_k"],
    token_positions: Int[Tensor, "... sequence_length"],
) -> Float[Tensor, "... sequence_length d_k"]:
    """ Run RoPE for a given input tensor."""
    from cs336_basics.modules import RoPE
    rope = RoPE(theta=theta, d_k=d_k, max_seq_len=max_seq_len, device=in_query_or_key.device)
    return rope(in_query_or_key, token_positions)

def softmax(x: torch.Tensor, dim: int) -> torch.Tensor:
    """ Numerically stable softmax over a given dimension.
        This implementation subtracts the maximum value along `dim` before exponentiation to improve numerical stability.
    """
    # Subtract max for numerical stability (keepdim for correct broadcasting)
    x_max = torch.amax(x, dim=dim, keepdim=True)
    z = x - x_max
    exp_z = torch.exp(z)
    sum_exp = torch.sum(exp_z, dim=dim, keepdim=True)
    return exp_z / sum_exp

def run_softmax(in_features: Float[Tensor, "..."], dim: int) -> Float[Tensor, "..."]:
    """ Given a tensor of inputs, return the output of softmaxing the given `dim` of the input."""
    from cs336_basics.modules import softmax
    return softmax(in_features, dim)

import math
import torch

def scaled_dot_product_attention(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.tensor,
    mask: torch.Tensor | None = None
) -> torch.Tensor:
    """ Scaled dot-product attention."""
    if query.dim() < 2 or key.dim() < 2 or value.dim() < 2:
        raise ValueError("query/key/value must have shape (..., seq_len, d_*)")
    if query.shape[:-2] != key.shape[:-2] or query.shape[:-2] != value.shape[:-2]:
        raise ValueError("batch dimensions of query, key, value must match")
    d_k = query.shape[-1]
    if d_k != key.shape[-1]:
        raise ValueError("query and key must have the same d_k")
    
    # Compute attention logits in float32 for stability
    q = query.to(torch.float32)
    k = key.to(torch.float32)
    v = value.to(torch.float32)
    scale = 1.0 / math.sqrt(d_k)
    
    # logits: (..., seq_len, seq_len)
    logits = torch.einsum("... s d, ... t d -> ... s t", q, k) * scale
    
    if mask is not None:
        if mask.dtype != torch.bool:
            raise TypeError("mask must be a boolean tensor")
        # Broadcast mask to logits shape: (..., seq_len, seq_len)
        # True = keep, False = mask out.
        neg_inf = torch.finfo(torch.float32).min
        logits = torch.where(mask.to(device=logits.device), logits, neg_inf)
    
    # probs: (..., seq_len, seq_len)
    probs = softmax(logits, dim=-1)
    
    if mask is not None:
        # Ensure exact zeros on masked positions
        probs = probs * mask.to(device=probs.device, dtype=probs.dtype)
    
    # out: (..., seq_len, d_v)
    out = torch.einsum("... s t, ... t d -> ... s d", probs, v)
    
    # Cast back to the original value dtype
    return out.to(dtype=value.dtype)

def run_scaled_dot_product_attention(
    Q: Float[Tensor, "... queries d_k"],
    K: Float[Tensor, "... keys d_k"],
    V: Float[Tensor, "... values d_v"],
    mask: Bool[Tensor, "... queries keys"] | None = None,
) -> Float[Tensor, "... queries d_v"]:
    """ Given key (K), query (Q), and value (V) tensors, return the output of your scaled dot product attention implementation."""
    from cs336_basics.modules import scaled_dot_product_attention
    return scaled_dot_product_attention(query=Q, key=K, value=V, mask=mask)

import math
import torch
from torch import nn

class CausalMultiHeadSelfAttention(nn.Module):
    """ Causal multi-head self-attention (no RoPE). This module computes:
        Q = W_Q x, K = W_K x, V = W_V x
        heads = SDPA(Q_heads, K_heads, V_heads, causal_mask)
        out = W_O concat(heads)
    """
    def __init__(self, d_model: int, num_heads: int, device: torch.device | None = None, dtype: torch.dtype | None = None):
        super().__init__()
        self.d_model = int(d_model)
        self.num_heads = int(num_heads)
        if self.d_model % self.num_heads != 0:
            raise ValueError("d_model must be divisible by num_heads")
        self.head_dim = self.d_model // self.num_heads  # d_k = d_v = d_model / h
        # Separate projections (one matmul each). Combining into one is an optional optimization
        self.q_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.k_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.v_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.o_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)

    @staticmethod
    def _causal_mask(seq_len: int, device: torch.device) -> torch.Tensor:
        """ Build a (seq_len, seq_len) causal mask where True means 'allowed' """
        return torch.tril(torch.ones((seq_len, seq_len), device=device, dtype=torch.bool))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """ Args:
            x: Tensor of shape (..., seq_len, d_model)
            Returns: Tensor of shape (..., seq_len, d_model)
        """
        if x.size(-1) != self.d_model:
            raise ValueError(f"Expected last dim {self.d_model}, got {x.size(-1)}")
        seq_len = x.size(-2)
        device = x.device
        # Project to Q, K, V: (..., seq_len, d_model)
        q = self.q_proj(x)
        k = self.k_proj(x)
        v = self.v_proj(x)
        # Reshape into heads: (..., seq_len, num_heads, head_dim)
        # Then move heads into a batch-like dimension: (..., num_heads, seq_len, head_dim)
        new_shape = q.shape[:-1] + (self.num_heads, self.head_dim)
        q = q.view(new_shape).transpose(-3, -2)
        k = k.view(new_shape).transpose(-3, -2)
        v = v.view(new_shape).transpose(-3, -2)
        # Causal mask shared across heads and batches
        mask = self._causal_mask(seq_len, device=device)
        # SDPA: (..., num_heads, seq_len, head_dim)
        out = scaled_dot_product_attention(q, k, v, mask=mask)
        # Merge heads: (..., seq_len, d_model)
        out = out.transpose(-3, -2).contiguous().view(x.shape[:-1] + (self.d_model,))
        # Output projection: (..., seq_len, d_model)
        return self.o_proj(out)

class CausalMultiHeadSelfAttentionWithRoPE(nn.Module):
    """ Causal multi-head self-attention with RoPE applied to Q and K (not V). This version uses a fused QKV projection: qkv = W_qkv x q, k, v = split(qkv) """
    def __init__(self, d_model: int, num_heads: int, theta: float, max_seq_len: int, device: torch.device | None = None, dtype: torch.dtype | None = None):
        super().__init__()
        self.d_model = int(d_model)
        self.num_heads = int(num_heads)
        if self.d_model % self.num_heads != 0:
            raise ValueError("d_model must be divisible by num_heads")
        self.head_dim = self.d_model // self.num_heads
        # Separate projections (matches reference state_dict keys).
        self.q_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.k_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.v_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        self.output_proj = Linear(self.d_model, self.d_model, device=device, dtype=dtype)
        # RoPE operates on per-head dimension
        self.rope = RoPE(theta=theta, d_k=self.head_dim, max_seq_len=max_seq_len, device=device)

    @staticmethod
    def _causal_mask(seq_len: int, device: torch.device) -> torch.Tensor:
        """Build a (seq_len, seq_len) causal mask where True means 'allowed'."""
        return torch.tril(torch.ones((seq_len, seq_len), device=device, dtype=torch.bool))

    def forward(self, x: torch.Tensor, token_positions: torch.Tensor) -> torch.Tensor:
        """ Args:
            x: Tensor of shape (..., seq_len, d_model)
            token_positions: Tensor of shape (..., seq_len)
            Returns: Tensor of shape (..., seq_len, d_model)
        """
        if x.size(-1) != self.d_model:
            raise ValueError(f"Expected last dim {self.d_model}, got {x.size(-1)}")
        seq_len = x.size(-2)
        device = x.device
        # Project to Q, K, V: (..., seq_len, d_model)
        q = self.q_proj(x)
        k = self.k_proj(x)
        v = self.v_proj(x)
        # Reshape into heads: (..., seq_len, num_heads, head_dim)
        # Then transpose to (..., num_heads, seq_len, head_dim)
        new_shape = q.shape[:-1] + (self.num_heads, self.head_dim)
        q = q.view(new_shape).transpose(-3, -2)
        k = k.view(new_shape).transpose(-3, -2)
        v = v.view(new_shape).transpose(-3, -2)
        # Apply RoPE to Q and K for each head (heads are treated as batch-like dims)
        q = self.rope(q, token_positions)
        k = self.rope(k, token_positions)
        # Causal mask shared across heads and batches
        mask = self._causal_mask(seq_len, device=device)
        # Attention: (..., num_heads, seq_len, head_dim)
        out = scaled_dot_product_attention(q, k, v, mask=mask)
        # Merge heads back: (..., seq_len, d_model)
        out = out.transpose(-3, -2).contiguous().view(x.shape[:-1] + (self.d_model,))
        return self.output_proj(out)

def run_multihead_self_attention(
    d_model: int,
    num_heads: int,
    q_proj_weight: Float[Tensor, "d_k d_in"],
    k_proj_weight: Float[Tensor, "d_k d_in"],
    v_proj_weight: Float[Tensor, "d_v d_in"],
    o_proj_weight: Float[Tensor, "d_model d_v"],
    in_features: Float[Tensor, "... sequence_length d_in"],
) -> Float[Tensor, "... sequence_length d_out"]:
    """ Given the key, query, and value projection weights of a naive unbatched implementation of multi-head attention, return the output of an optimized batched implementation."""
    from cs336_basics.modules import CausalMultiHeadSelfAttention
    mha = CausalMultiHeadSelfAttention(d_model=d_model, num_heads=num_heads, device=in_features.device, dtype=q_proj_weight.dtype)
    mha.load_state_dict({
        "q_proj.weight": q_proj_weight.to(device=in_features.device, dtype=q_proj_weight.dtype),
        "k_proj.weight": k_proj_weight.to(device=in_features.device, dtype=k_proj_weight.dtype),
        "v_proj.weight": v_proj_weight.to(device=in_features.device, dtype=v_proj_weight.dtype),
        "o_proj.weight": o_proj_weight.to(device=in_features.device, dtype=o_proj_weight.dtype),
    })
    return mha(in_features)

def run_multihead_self_attention_with_rope(
    d_model: int,
    num_heads: int,
    max_seq_len: int,
    theta: float,
    q_proj_weight: Float[Tensor, "d_k d_in"],
    k_proj_weight: Float[Tensor, "d_k d_in"],
    v_proj_weight: Float[Tensor, "d_v d_in"],
    o_proj_weight: Float[Tensor, "d_model d_v"],
    in_features: Float[Tensor, "... sequence_length d_in"],
    token_positions: Int[Tensor, "... sequence_length"] | None = None,
) -> Float[Tensor, "... sequence_length d_out"]:
    """ Given the key, query, and value projection weights of a naive unbatched implementation of multi-head attention, return the output of an optimized batched implementation. This version of MHA should include RoPE."""
    from cs336_basics.modules import CausalMultiHeadSelfAttentionWithRoPE
    if token_positions is None:
        # Default positions: 0..seq_len-1, broadcast to batch-like dims
        seq_len = in_features.size(-2)
        token_positions = torch.arange(seq_len, device=in_features.device, dtype=torch.long)
        token_positions = token_positions.view(*([1]*(in_features.dim()-2)), seq_len)
    mha = CausalMultiHeadSelfAttentionWithRoPE(
        d_model=d_model, num_heads=num_heads, theta=theta, max_seq_len=max_seq_len, device=in_features.device, dtype=q_proj_weight.dtype
    )
    mha.load_state_dict({
        "q_proj.weight": q_proj_weight.to(device=in_features.device, dtype=q_proj_weight.dtype),
        "k_proj.weight": k_proj_weight.to(device=in_features.device, dtype=k_proj_weight.dtype),
        "v_proj.weight": v_proj_weight.to(device=in_features.device, dtype=v_proj_weight.dtype),
        "output_proj.weight": o_proj_weight.to(device=in_features.device, dtype=o_proj_weight.dtype),
    })
    return mha(in_features, token_positions)

uv run pytest -k test_transformer_block

import torch
from torch import nn

class TransformerBlock(nn.Module):
    """ Pre-norm Transformer block. Structure (pre-norm):
        y = x + Attn(RMSNorm(x))
        z = y + FFN(RMSNorm(y))
        This block uses causal multi-head self-attention with RoPE
    """
    def __init__(self, d_model: int, num_heads: int, d_ff: int, *, max_seq_len: int, theta: float, eps: float = 1e-5, device: torch.device | None = None, dtype: torch.dtype | None = None):
        super().__init__()
        self.d_model = int(d_model)
        self.num_heads = int(num_heads)
        self.d_ff = int(d_ff)
        self.ln1 = RMSNorm(self.d_model, eps=eps, device=device, dtype=dtype)
        self.attn = CausalMultiHeadSelfAttentionWithRoPE(
            d_model=self.d_model, num_heads=self.num_heads, theta=theta, max_seq_len=max_seq_len, device=device, dtype=dtype
        )
        self.ln2 = RMSNorm(self.d_model, eps=eps, device=device, dtype=dtype)
        self.ffn = SwiGLU(self.d_model, d_ff=self.d_ff, device=device, dtype=dtype)

    def forward(self, x: torch.Tensor, token_positions: torch.Tensor) -> torch.Tensor:
        """ Args:
            x: Tensor of shape (batch, seq_len, d_model)
            token_positions: Tensor of shape (batch, seq_len) or broadcastable to it
            Returns: Tensor of shape (batch, seq_len, d_model)
        """
        # Pre-norm attention + residual
        h = self.ln1(x)
        x = x + self.attn(h, token_positions)
        # Pre-norm FFN + residual
        h = self.ln2(x)
        x = x + self.ffn(h)
        return x

def run_transformer_block(
    d_model: int,
    num_heads: int,
    d_ff: int,
    max_seq_len: int,
    theta: float,
    weights: dict[str, Tensor],
    in_features: Float[Tensor, "batch sequence_length d_model"],
) -> Float[Tensor, "batch sequence_length d_model"]:
    """ Given the weights of a pre-norm Transformer block and input features, return the output of running the Transformer block on the input features. This function should use RoPE."""
    from cs336_basics.modules import TransformerBlock
    block = TransformerBlock(
        d_model=d_model, num_heads=num_heads, d_ff=d_ff, max_seq_len=max_seq_len, theta=theta, device=in_features.device, dtype=in_features.dtype
    )
    # Move weights to the right device/dtype
    sd = {k: v.to(device=in_features.device, dtype=in_features.dtype) for k, v in weights.items()}
    block.load_state_dict(sd)
    # Build token positions: shape (batch, seq_len)
    batch, seq_len, _ = in_features.shape
    token_position = torch.arange(seq_len, device=in_features.device, dtype=torch.long).view(1, seq_len).expand(batch, seq_len)
    return block(in_features, token_position)

uv run pytest -k test_transformer_lm

import torch
from torch import nn
from cs336_basics.modules import Embedding, Linear, RMSNorm, TransformerBlock

class TransformerLM(nn.Module):
    """ A Transformer language model composed of:
        token embedding -> N pre-norm Transformer blocks -> final RMSNorm -> LM head.
        This implementation uses RoPE inside each TransformerBlock's attention module.
    """
    def __init__(self, vocab_size: int, context_length: int, d_model: int, num_layers: int, num_heads: int, d_ff: int, *, rope_theta: float, max_seq_len: int | None = None, eps: float = 1e-5, device: torch.device | None = None, dtype: torch.dtype | None = None):
        super().__init__()
        self.vocab_size = int(vocab_size)
        self.context_length = int(context_length)
        self.d_model = int(d_model)
        self.num_layers = int(num_layers)
        self.max_seq_len = int(max_seq_len if max_seq_len is not None else context_length)
        # Token embedding table: (vocab_size, d_model)
        self.token_embeddings = Embedding(self.vocab_size, self.d_model, device=device, dtype=dtype)
        # Stack of pre-norm Transformer blocks
        self.layers = nn.ModuleList([
            TransformerBlock(
                d_model=self.d_model, num_heads=num_heads, d_ff=d_ff, max_seq_len=self.max_seq_len, theta=rope_theta, eps=eps, device=device, dtype=dtype
            ) for _ in range(self.num_layers)
        ])
        # Final normalization before the LM head
        self.ln_final = RMSNorm(self.d_model, eps=eps, device=device, dtype=dtype)
        # Output projection to vocabulary logits: weight shape (vocab_size, d_model)
        self.lm_head = Linear(self.d_model, self.vocab_size, device=device, dtype=dtype)

    def forward(self, in_indices: torch.Tensor) -> torch.Tensor:
        """ Args:
            in_indices: LongTensor of shape (batch, seq_len)
            Returns: logits: Tensor of shape (batch, seq_len, vocab_size)
        """
        if in_indices.dim() != 2:
            raise ValueError(f"in_indices must have shape (batch, seq_len), got {tuple(in_indices.shape)}")
        batch, seq_len = in_indices.shape
        if seq_len > self.context_length:
            raise ValueError(f"seq_len={seq_len} exceeds context_length={self.context_length}")
        # Token positions for RoPE: (batch, seq_len)
        token_positions = torch.arange(seq_len, device=in_indices.device, dtype=torch.long).view(1, seq_len)
        token_positions = token_positions.expand(batch, seq_len)
        # Embed tokens: (batch, seq_len, d_model)
        x = self.token_embeddings(in_indices)
        # Apply Transformer blocks
        for block in self.layers:
            x = block(x, token_positions)
        # Final norm and vocabulary projection
        x = self.ln_final(x)
        logits = self.lm_head(x)
        return logits

def run_transformer_lm(
    vocab_size: int,
    context_length: int,
    d_model: int,
    num_layers: int,
    num_heads: int,
    d_ff: int,
    rope_theta: float,
    weights: dict[str, Tensor],
    in_indices: Int[Tensor, "batch_size sequence_length"],
) -> Float[Tensor, "batch_size sequence_length vocab_size"]:
    r""" Given the weights of a Transformer language model and input indices, return the output of running a forward pass on the input indices. This function should use RoPE."""
    from cs336_basics.transformer_lm import TransformerLM
    model = TransformerLM(
        vocab_size=vocab_size, context_length=context_length, d_model=d_model, num_layers=num_layers, num_heads=num_heads, d_ff=d_ff, rope_theta=rope_theta, device=in_indices.device, dtype=torch.float32
    )
    # Move weights to the correct device/dtype before loading.
    sd = {k: v.to(device=in_indices.device, dtype=torch.float32) for k, v in weights.items()}
    model.load_state_dict(sd)
    return model(in_indices)

CS336 从零构建语言模型：Transformer LM 架构实现

前言

1. Problem (linear): Implementing the linear module (1 point)

更多推荐文章

相关免费在线工具

2. Problem (embedding): Implement the embedding module (1 point)

3. Problem (rmsnorm): Root Mean Square Layer Normalization (1 point)

4. Problem (positionwise_feedforward): Implement the position-wise feed-forward network (2 points)

5. Problem (rope): Implement RoPE (2 points)

6. Problem (softmax): Implement softmax (1 point)

7. Problem (scaled_dot_product_attention): Implement scaled dot-product attention (5 points)

8. Problem (multihead_self_attention): Implement causal multi-head self-attention (5 points)

9. Problem (transformer_block): Implement the Transformer block (3 points)

10. Problem (transformer_lm): Implementing the Transformer LM (3 points)

11. Problem (transformer_accounting): Transformer LM resource accounting (5 points)

结语

更多推荐文章

相关免费在线工具

CS336 从零构建语言模型：Transformer LM 架构实现

前言

1. Problem (linear): Implementing the linear module (1 point)

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2. Problem (embedding): Implement the embedding module (1 point)

3. Problem (rmsnorm): Root Mean Square Layer Normalization (1 point)

4. Problem (positionwise_feedforward): Implement the position-wise feed-forward network (2 points)

5. Problem (rope): Implement RoPE (2 points)

6. Problem (softmax): Implement softmax (1 point)

7. Problem (scaled_dot_product_attention): Implement scaled dot-product attention (5 points)

8. Problem (multihead_self_attention): Implement causal multi-head self-attention (5 points)

9. Problem (transformer_block): Implement the Transformer block (3 points)

10. Problem (transformer_lm): Implementing the Transformer LM (3 points)

11. Problem (transformer_accounting): Transformer LM resource accounting (5 points)

结语

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具