Transformer 三种注意力机制详解与 PyTorch 实现 | 极客日志

PythonAI算法

Transformer 三种注意力机制详解与 PyTorch 实现

Transformer 模型的核心在于注意力机制，主要包括自注意力、交叉注意力和因果自注意力。自注意力允许序列内部元素交互，交叉注意力连接不同序列，因果自注意力确保生成任务中的单向依赖。通过 PyTorch 从零实现这三种机制，解析缩放点积、多头扩展及掩码处理细节，并探讨其在大型语言模型中的应用与优化策略。文章涵盖输入嵌入、QKV 投影、Softmax 归一化、上下文向量计算及工程实践中的 Padding 与精度选择建议。

忘忧发布于 2025/2/7更新于 2026/6/218 浏览

Transformer 三种注意力机制详解与 PyTorch 实现

本文深入探讨 Transformer 模型中三种关键的注意力机制：自注意力、交叉注意力和因果自注意力。这些机制是 GPT-4、Llama 等大型语言模型（LLMs）的核心组件。通过理解这些注意力机制，我们可以更好地把握这些模型的工作原理和应用潜力。

我们将使用 Python 和 PyTorch 从零开始实现这些注意力机制。通过实际编码，可以更深入地理解这些机制的内部工作原理。

自注意力概述

自注意力机制自 2017 年在开创性论文《Attention Is All You Need》中被提出以来，已成为最先进深度学习模型的核心，尤其是在自然语言处理（NLP）领域。考虑到其广泛应用，深入理解自注意力的运作机制变得尤为重要。

在深度学习中，"注意力"概念的引入最初是为了改进递归神经网络（RNNs）处理长序列或句子的能力。例如，在机器翻译任务中，逐字翻译通常无法捕捉语言的复杂语法和表达方式，导致翻译质量低下。

为解决这一问题，注意力机制使模型能够在每个步骤考虑整个输入序列，有选择地关注上下文中最相关的部分。2017 年引入的 Transformer 架构进一步发展了这一概念，将自注意力作为独立机制整合，使得 RNNs 不再必要。

自注意力允许模型通过整合上下文信息来增强输入嵌入，使其能够动态地权衡序列中不同元素的重要性。这一特性在 NLP 中尤其有价值，因为词语的含义往往随其在句子或文档中的上下文而变化。

尽管已提出多种高效版本的自注意力，但《Attention Is All You Need》中引入的原始缩放点积注意力机制仍然是应用最广泛的。由于其在大规模 Transformer 模型中表现出色的实际性能和计算效率，它仍然是许多模型的基础。

输入句子嵌入

在深入探讨自注意力机制之前，我们先通过一个示例句子 "The sun rises in the east" 来演示操作过程。与其他文本处理模型（如递归或卷积神经网络）类似，第一步是创建句子嵌入。

为简化说明，我们的字典 dc 仅包含输入句子中的单词。在实际应用中，字典通常从更大的词汇表构建，一般包含 30,000 到 50,000 个单词。

sentence = 'The sun rises in the east'
dc = {s:i for i,s in enumerate(sorted(sentence.split()))}
print(dc)

输出：

{'The': 0, 'east': 1, 'in': 2, 'rises': 3, 'sun': 4, 'the': 5}

接下来，我们使用这个字典将句子中的每个单词转换为其对应的整数索引。

import torch
sentence_int = torch.tensor([dc[s] for s in sentence.split()])
print(sentence_int)

输出：

tensor([, , , , , ])

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

vocab_size = 50_000
torch.manual_seed(123)
embed = torch.nn.Embedding(vocab_size, 3)
embedded_sentence = embed(sentence_int).detach()
print(embedded_sentence)
print(embedded_sentence.shape)

tensor([[ 0.3374, -0.1778, -0.3035], 
        [ 0.1794, 1.8951, 0.4954], 
        [ 0.2692, -0.0770, -1.0205], 
        [-0.2196, -0.3792, 0.7671], 
        [-0.5880, 0.3486, 0.6603], 
        [-1.1925, 0.6984, -1.4097]])
torch.Size([6, 3])

torch.manual_seed(123)
d = embedded_sentence.shape[1]
d_q, d_k, d_v = 2, 2, 4
W_query = torch.nn.Parameter(torch.rand(d, d_q))
W_key = torch.nn.Parameter(torch.rand(d, d_k))
W_value = torch.nn.Parameter(torch.rand(d, d_v))

x_3 = embedded_sentence[2]  # 第三个元素（索引 2）
query_3 = x_3 @ W_query
key_3 = x_3 @ W_key
value_3 = x_3 @ W_value
print("Query shape:", query_3.shape)
print("Key shape:", key_3.shape)
print("Value shape:", value_3.shape)

Query shape: torch.Size([2])
Key shape: torch.Size([2])
Value shape: torch.Size([4])

keys = embedded_sentence @ W_key
values = embedded_sentence @ W_value
print("All keys shape:", keys.shape)
print("All values shape:", values.shape)

All keys shape: torch.Size([6, 2])
All values shape: torch.Size([6, 4])

omega_3 = query_3 @ keys.T
print("Unnormalized attention weights for query 3:")
print(omega_3)

Unnormalized attention weights for query 3:
tensor([ 0.8721, -0.5302,  2.1436, -1.7589,  0.9103,  1.3245])

max_score = omega_3.max()
min_score = omega_3.min()
max_index = omega_3.argmax()
min_index = omega_3.argmin()
print(f"Highest compatibility: {max_score:.4f} with input {max_index+1}")
print(f"Lowest compatibility: {min_score:.4f} with input {min_index+1}")

Highest compatibility: 2.1436 with input 3
Lowest compatibility: -1.7589 with input 4

import torch.nn.functional as F
d_k = 2  # 键向量的维度
omega_3 = query_3 @ keys.T  # 使用前面的例子
attention_weights_3 = F.softmax(omega_3 / d_k**0.5, dim=0)
print("Normalized attention weights for input 3:")
print(attention_weights_3)

Normalized attention weights for input 3:
tensor([0.1834, 0.0452, 0.6561, 0.0133, 0.1906, 0.2885])

max_weight = attention_weights_3.max()
max_weight_index = attention_weights_3.argmax()
print(f"Input {max_weight_index+1} has the highest attention weight: {max_weight:.4f}")

Input 3 has the highest attention weight: 0.6561

context_vector_3 = attention_weights_3 @ values
print("Context vector shape:", context_vector_3.shape)
print("Context vector:")
print(context_vector_3)

Context vector shape: torch.Size([4])
Context vector:
tensor([0.6237, 0.9845, 1.0523, 1.2654])

import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, d_in, d_out_kq, d_out_v):
        super().__init__()
        self.d_out_kq = d_out_kq
        self.W_query = nn.Parameter(torch.rand(d_in, d_out_kq))
        self.W_key = nn.Parameter(torch.rand(d_in, d_out_kq))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out_v))

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(
            attn_scores / self.d_out_kq**0.5, dim=-1
        )
        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(123)
d_in, d_out_kq, d_out_v = 3, 2, 4
sa = SelfAttention(d_in, d_out_kq, d_out_v)
output = sa(embedded_sentence)
print(output)

tensor([[-0.1564,  0.1028, -0.0763, -0.0764], 
        [ 0.5313,  1.3607,  0.7891,  1.3110], 
        [-0.3542, -0.1234, -0.2627, -0.3706], 
        [ 0.0071,  0.3345,  0.0969,  0.1998], 
        [ 0.1008,  0.4780,  0.2021,  0.3674], 
        [-0.5296, -0.2799, -0.4107, -0.6006]], grad_fn=<MmBackward0>)

class MultiHeadAttentionWrapper(nn.Module):
    def __init__(self, d_in, d_out_kq, d_out_v, num_heads):
        super().__init__()
        self.heads = nn.ModuleList(
            [SelfAttention(d_in, d_out_kq, d_out_v) 
             for _ in range(num_heads)]
        )

    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1)

torch.manual_seed(123)
d_in, d_out_kq, d_out_v = 3, 2, 1
num_heads = 4
mha = MultiHeadAttentionWrapper(d_in, d_out_kq, d_out_v, num_heads)
context_vecs = mha(embedded_sentence)
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[-0.0185,  0.0170,  0.1999, -0.0860], 
        [ 0.4003,  1.7137,  1.3981,  1.0497], 
        [-0.1103, -0.1609,  0.0079, -0.2416], 
        [ 0.0668,  0.3534,  0.2322,  0.1008], 
        [ 0.1180,  0.6949,  0.3157,  0.2807], 
        [-0.1827, -0.2060, -0.2393, -0.3167]], grad_fn=<CatBackward0>)
context_vecs.shape: torch.Size([6, 4])

class CrossAttention(nn.Module):
    def __init__(self, d_in, d_out_kq, d_out_v):
        super().__init__()
        self.d_out_kq = d_out_kq
        self.W_query = nn.Parameter(torch.rand(d_in, d_out_kq))
        self.W_key = nn.Parameter(torch.rand(d_in, d_out_kq))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out_v))

    def forward(self, x_1, x_2):
        queries_1 = x_1 @ self.W_query
        keys_2 = x_2 @ self.W_key
        values_2 = x_2 @ self.W_value
        attn_scores = queries_1 @ keys_2.T
        attn_weights = torch.softmax(
            attn_scores / self.d_out_kq**0.5, dim=-1)
        context_vec = attn_weights @ values_2
        return context_vec

torch.manual_seed(123)
d_in, d_out_kq, d_out_v = 3, 2, 4
crossattn = CrossAttention(d_in, d_out_kq, d_out_v)
first_input = embedded_sentence
second_input = torch.rand(8, d_in)
print("First input shape:", first_input.shape)
print("Second input shape:", second_input.shape)
context_vectors = crossattn(first_input, second_input)
print(context_vectors)
print("Output shape:", context_vectors.shape)

First input shape: torch.Size([6, 3])
Second input shape: torch.Size([8, 3])
tensor([[0.4231, 0.8665, 0.6503, 1.0042], 
        [0.4874, 0.9718, 0.7359, 1.1353], 
        [0.4054, 0.8359, 0.6258, 0.9667], 
        [0.4357, 0.8886, 0.6678, 1.0311], 
        [0.4429, 0.9006, 0.6775, 1.0460], 
        [0.3860, 0.8021, 0.5985, 0.9250]], grad_fn=<MmBackward0>)
Output shape: torch.Size([6, 4])

torch.manual_seed(123)
d_in, d_out_kq, d_out_v = 3, 2, 4
W_query = nn.Parameter(torch.rand(d_in, d_out_kq))
W_key = nn.Parameter(torch.rand(d_in, d_out_kq))
W_value = nn.Parameter(torch.rand(d_in, d_out_v))
x = embedded_sentence
keys = x @ W_key
queries = x @ W_query
values = x @ W_value
attn_scores = queries @ keys.T
print(attn_scores)
print(attn_scores.shape)

tensor([[ 0.0613, -0.3491,  0.1443, -0.0437, -0.1303,  0.1076], 
        [-0.6004,  3.4707, -1.5023,  0.4991,  1.2903, -1.3374], 
        [ 0.2432, -1.3934,  0.5869, -0.1851, -0.5191,  0.4730], 
        [-0.0794,  0.4487, -0.1807,  0.0518,  0.1677, -0.1197], 
        [-0.1510,  0.8626, -0.3597,  0.1112,  0.3216, -0.2787], 
        [ 0.4344, -2.5037,  1.0740, -0.3509, -0.9315,  0.9265]], 
       grad_fn=<MmBackward0>)
torch.Size([6, 6])

attn_weights = torch.softmax(attn_scores / d_out_kq**0.5, dim=1)
print(attn_weights)

tensor([[0.1772, 0.1326, 0.1879, 0.1645, 0.1547, 0.1831], 
        [0.0386, 0.6870, 0.0204, 0.0840, 0.1470, 0.0229], 
        [0.1965, 0.0618, 0.2506, 0.1452, 0.1146, 0.2312], 
        [0.1505, 0.2187, 0.1401, 0.1651, 0.1793, 0.1463], 
        [0.1347, 0.2758, 0.1162, 0.1621, 0.1881, 0.1231], 
        [0.1973, 0.0247, 0.3102, 0.1132, 0.0751, 0.2794]], 
       grad_fn=<SoftmaxBackward0>)

block_size = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(block_size, block_size))
print(mask_simple)

tensor([[1., 0., 0., 0., 0., 0.], 
        [1., 1., 0., 0., 0., 0.], 
        [1., 1., 1., 0., 0., 0.], 
        [1., 1., 1., 1., 0., 0.], 
        [1., 1., 1., 1., 1., 0.], 
        [1., 1., 1., 1., 1., 1.]])

masked_simple = attn_weights * mask_simple
print(masked_simple)

tensor([[0.1772, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], 
        [0.0386, 0.6870, 0.0000, 0.0000, 0.0000, 0.0000], 
        [0.1965, 0.0618, 0.2506, 0.0000, 0.0000, 0.0000], 
        [0.1505, 0.2187, 0.1401, 0.1651, 0.0000, 0.0000], 
        [0.1347, 0.2758, 0.1162, 0.1621, 0.1881, 0.0000], 
        [0.1973, 0.0247, 0.3102, 0.1132, 0.0751, 0.2794]], 
       grad_fn=<MulBackward0>)

row_sums = masked_simple.sum(dim=1, keepdim=True)
masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], 
        [0.0532, 0.9468, 0.0000, 0.0000, 0.0000, 0.0000], 
        [0.3862, 0.1214, 0.4924, 0.0000, 0.0000, 0.0000], 
        [0.2232, 0.3242, 0.2078, 0.2449, 0.0000, 0.0000], 
        [0.1536, 0.3145, 0.1325, 0.1849, 0.2145, 0.0000], 
        [0.1973, 0.0247, 0.3102, 0.1132, 0.0751, 0.2794]], 
       grad_fn=<DivBackward0>)

mask = torch.triu(torch.ones(block_size, block_size), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), float('-inf'))
print(masked)

tensor([[ 0.0613,    -inf,    -inf,    -inf,    -inf,    -inf], 
        [-0.6004,  3.4707,    -inf,    -inf,    -inf,    -inf], 
        [ 0.2432, -1.3934,  0.5869,    -inf,    -inf,    -inf], 
        [-0.0794,  0.4487, -0.1807,  0.0518,    -inf,    -inf], 
        [-0.1510,  0.8626, -0.3597,  0.1112,  0.3216,    -inf], 
        [ 0.4344, -2.5037,  1.0740, -0.3509, -0.9315,  0.9265]], 
       grad_fn=<MaskedFillBackward0>)

attn_weights = torch.softmax(masked / d_out_kq**0.5, dim=1)
print(attn_weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], 
        [0.0532, 0.9468, 0.0000, 0.0000, 0.0000, 0.0000], 
        [0.3862, 0.1214, 0.4924, 0.0000, 0.0000, 0.0000], 
        [0.2232, 0.3242, 0.2078, 0.2449, 0.0000, 0.0000], 
        [0.1536, 0.3145, 0.1325, 0.1849, 0.2145, 0.0000], 
        [0.1973, 0.0247, 0.3102, 0.1132, 0.0751, 0.2794]], 
       grad_fn=<SoftmaxBackward0>)

Transformer 三种注意力机制详解与 PyTorch 实现

Transformer 三种注意力机制详解与 PyTorch 实现

自注意力概述

输入句子嵌入

更多推荐文章

相关免费在线工具

缩放点积注意力的权重矩阵

查询、键和值的转换

计算自注意力机制中的非归一化注意力权重

注意力权重归一化与上下文向量计算

自注意力的 PyTorch 实现

多头注意力机制：自注意力的高级扩展

多头注意力的核心概念

多头注意力的实现

多头注意力的优势

多头注意力与单头大输出的比较

实际应用考虑

交叉注意力：连接不同输入序列的桥梁

交叉注意力的核心概念

交叉注意力的实现

交叉注意力与自注意力的主要区别

交叉注意力的应用领域

交叉注意力的优势

实际应用中的考虑因素

因果自注意力

工程实践中的注意事项

总结

更多推荐文章

相关免费在线工具

Transformer 三种注意力机制详解与 PyTorch 实现

Transformer 三种注意力机制详解与 PyTorch 实现

自注意力概述

输入句子嵌入

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

缩放点积注意力的权重矩阵

查询、键和值的转换

计算自注意力机制中的非归一化注意力权重

注意力权重归一化与上下文向量计算

自注意力的 PyTorch 实现

多头注意力机制：自注意力的高级扩展

多头注意力的核心概念

多头注意力的实现

多头注意力的优势

多头注意力与单头大输出的比较

实际应用考虑

交叉注意力：连接不同输入序列的桥梁

交叉注意力的核心概念

交叉注意力的实现

交叉注意力与自注意力的主要区别

交叉注意力的应用领域

交叉注意力的优势

实际应用中的考虑因素

因果自注意力

工程实践中的注意事项

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具