视觉 Transformer (ViT) 原理与代码实现 | 极客日志

PythonAI算法

视觉 Transformer (ViT) 原理与代码实现

综述由AI生成详细解析了 Vision Transformer (ViT) 的核心原理与代码实现。内容涵盖从图像 Token 化、位置编码、编码块结构到预测处理的全流程。文章提供了基于 PyTorch 的完整代码示例，包括 Patch Tokenization 模块、Encoding 块、Neural Net 模块及 ViT Backbone 类。同时修正了原文中的术语错误，补充了训练优化建议，帮助读者深入理解 ViT 架构及其在图像分类任务中的应用。

暗影行者发布于 2025/2/7更新于 2026/6/222 浏览

视觉 Transformer 解析

自 2017 年'注意力就是一切'的理念问世以来，Transformer 模型便迅速在自然语言处理（NLP）领域崭露头角。到了 2021 年，'一张图片等于 16x16 个单词'的理念成功将 Transformer 模型引入计算机视觉任务中。自此之后，众多基于 Transformer 的架构纷纷涌现，应用于计算机视觉领域。

本文将详细介绍 Vision Transformer（ViT），包括其开源代码和对各组件的概念解释。所有代码均使用 PyTorch Python 包实现。

什么是 Vision Transformer？

正如'注意力就是一切'所介绍的，Transformer 是一种利用注意力机制作为主要学习机制的机器学习模型。它迅速成为序列到序列任务（如语言翻译）的领先技术。

Vision Transformer（ViT）改进了原始 Transformer，使其能够应对图像分类任务。ViT 与原始 Transformer 一样，基于注意力机制。不过，与用于 NLP 任务的 Transformer 包含编码器和解码器两个注意力分支不同，ViT 仅使用编码器。编码器的输出随后传递给神经网络'头'进行预测。

然而，原始 ViT 存在一个缺点，即其最佳性能需要在大型数据集上进行预训练。最佳模型是在专有的 JFT-300M 数据集上预训练的。而在较小的开源 ImageNet-21k 数据集上进行预训练的模型，其性能与最先进的卷积 ResNet 模型相当。

Tokens-to-Token ViT 试图通过引入一种新颖的预处理方法，将输入图像转换为一系列 token，从而消除这种预训练要求。在本文中，我们将重点讨论基础 ViT 的实现。

模型解析

本文遵循《一张图片等于 16x16 个单词》中概述的模型结构。ViT 模型的结构主要包括图像 Token 化、Token 处理、编码块和预测处理。

图像 Token 化

ViT 的第一步是从输入图像创建 Token。Transformer 操作的是一系列 Token；在 NLP 中，这通常是一个句子的单词。对于计算机视觉来说，如何将输入分段成 Token 并不太明确。

ViT 将图像转换为 Token，以便每个 Token 表示图像的一个局部区域（或补丁）。他们描述了如何将高度 H、宽度 W 和通道数 C 的图像重新塑造为 N 个补丁大小为 P 的 Token：

每个 Token 的长度为 P² * C。

让我们以像素艺术图像为例进行补丁 Token 化。原始艺术品已被裁剪并转换为单通道图像。这意味着每个像素的值在 0 到 1 之间。单通道图像通常以灰度显示，但为了可视化效果，我们将以特定配色方案显示。

import numpy as np
import matplotlib.pyplot as plt
import os

# 假设 mountains 是加载的单通道图像数组
# mountains = np.load(os.path.join(figure_path, 'mountains.npy'))
H = 60
W = 100
print(f'Mountain at Dusk is H = {H} and W = {W} pixels.')

这个图像的高度为 H=60，宽度为 W=100。我们将设置 P=20，因为它能够均匀地整除 H 和 W。

P = 20
N = int((H*W)/(P**2))
print(f'There will be {N} patches, each {P} by .')

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

import torch
import torch.nn as nn

class Patch_Tokenization(nn.Module):
    def __init__(self,
                 img_size: tuple[int, int, int]=(1, 1, 60, 100),
                 patch_size: int=50,
                 token_len: int=768):
        """ Patch Tokenization Module
        Args:
            img_size (tuple[int, int, int]): size of input (channels, height, width)
            patch_size (int): the side length of a square patch
            token_len (int): desired length of an output token
        """
        super().__init__()
        ## Defining Parameters
        self.img_size = img_size
        C, H, W = self.img_size
        self.patch_size = patch_size
        self.token_len = token_len
        assert H % self.patch_size == 0, 'Height of image must be evenly divisible by patch size.'
        assert W % self.patch_size == 0, 'Width of image must be evenly divisible by patch size.'
        self.num_tokens = (H / self.patch_size) * (W / self.patch_size)

        ## Defining Layers
        self.split = nn.Unfold(kernel_size=self.patch_size, stride=self.patch_size, padding=0)
        self.project = nn.Linear((self.patch_size**2)*C, token_len)

    def forward(self, x):
        x = self.split(x).transpose(1, 0)
        x = self.project(x)
        return x

# 假设 mountains 已定义
x = torch.from_numpy(mountains).unsqueeze(0).unsqueeze(0).to(torch.float32)
token_len = 768
print('Input dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of input channels:', x.shape[1], '\n\timage size:', (x.shape[2], x.shape[3]))

# Define the Module
patch_tokens = Patch_Tokenization(img_size=(x.shape[1], x.shape[2], x.shape[3]),
                                  patch_size=P,
                                  token_len=token_len)

x = patch_tokens.split(x).transpose(2, 1)
print('After patch tokenization, dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n	token length:', x.shape[2])

# Define an Input
num_tokens = 175
token_len = 768
batch = 13
x = torch.rand(batch, num_tokens, token_len)
print('Input dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken length:', x.shape[2])

# Append a Prediction Token
pred_token = torch.zeros(1, 1, token_len).expand(batch, -1, -1)
print('Prediction Token dimensions are\n\tbatchsize:', pred_token.shape[0], '\n\tnumber of tokens:', pred_token.shape[1], '\n\ttoken length:', pred_token.shape[2])

x = torch.cat((pred_token, x), dim=1)
print('Dimensions with Prediction Token are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken length:', x.shape[2])

def get_sinusoid_encoding(num_tokens, token_len):
    """ Make Sinusoid Encoding Table
    Args:
        num_tokens (int): number of tokens
        token_len (int): length of a token
    Returns:
        (torch.FloatTensor) sinusoidal position encoding table
    """
    def get_position_angle_vec(i):
        return [i / np.power(10000, 2 * (j // 2) / token_len) for j in range(token_len)]

    sinusoid_table = np.array([get_position_angle_vec(i) for i in range(num_tokens)])
    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])
    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])

    return torch.FloatTensor(sinusoid_table).unsqueeze(0)

PE = get_sinusoid_encoding(num_tokens+1, token_len)
print('Position embedding dimensions are\n\tnumber of tokens:', PE.shape[1], '\n\ttoken length:', PE.shape[2])

x = x + PE
print('Dimensions with Position Embedding are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken length:', x.shape[2])

from typing import Optional
import torch.nn.functional as F

class Encoding(nn.Module):
    def __init__(self,
                 dim: int,
                 num_heads: int=1,
                 hidden_chan_mul: float=4.,
                 qkv_bias: bool=False,
                 qk_scale: Optional[float]=None,
                 act_layer=nn.GELU,
                 norm_layer=nn.LayerNorm):
                """ Encoding Block
            Args:
                dim (int): size of a single token
                num_heads(int): number of attention heads in MSA
                hidden_chan_mul (float): multiplier to determine the number of hidden channels (features) in the NeuralNet component
                qkv_bias (bool): determines if the qkv layer learns an additive bias
                qk_scale (Optional[float]): value to scale the queries and keys by;
                                     if None, queries and keys are scaled by head_dim ** -0.5
                act_layer(nn.modules.activation): torch neural network layer class to use as activation
                norm_layer(nn.modules.normalization): torch neural network layer class to use as normalization
        """
        super().__init__()
        ## Define Layers
        self.norm1 = norm_layer(dim)
        self.attn = Attention(dim=dim,
                              chan=dim,
                              num_heads=num_heads,
                              qkv_bias=qkv_bias,
                              qk_scale=qk_scale)
        self.norm2 = norm_layer(dim)
        self.neuralnet = NeuralNet(in_chan=dim,
                                   hidden_chan=int(dim*hidden_chan_mul),
                                   out_chan=dim,
                                   act_layer=act_layer)

    def forward(self, x):
        x = x + self.attn(self.norm1(x))
        x = x + self.neuralnet(self.norm2(x))
        return x

# Define an Input
num_tokens = 176
token_len = 768
batch = 13
heads = 4
x = torch.rand(batch, num_tokens, token_len)
print('Input dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken length:', x.shape[2])

# Define the Module
E = Encoding(dim=token_len, num_heads=heads, hidden_chan_mul=1.5, qkv_bias=False, qk_scale=None, act_layer=nn.GELU, norm_layer=nn.LayerNorm)
E.eval();

y = E.norm1(x)
print('After norm, dimensions are\n\tbatchsize:', y.shape[0], '\n\tnumber of tokens:', y.shape[1], '\n\ttoken size:', y.shape[2])
y = E.attn(y)
print('After attention, dimensions are\n\tbatchsize:', y.shape[0], '\n\tnumber of tokens:', y.shape[1], '\n\ttoken size:', y.shape[2])
y = y + x
print('After residual connection, dimensions are\n\tbatchsize:', y.shape[0], '\n\tnumber of tokens:', y.shape[1], '\n\ttoken size:', y.shape[2])

z = E.norm2(y)
print('After norm, dimensions are\n\tbatchsize:', z.shape[0], '\n\tnumber of tokens:', z.shape[1], '\n\ttoken size:', z.shape[2])
z = E.neuralnet(z)
print('After neural net, dimensions are\n\tbatchsize:', z.shape[0], '\n\tnumber of tokens:', z.shape[1], '\n\ttoken size:', z.shape[2])
z = z + y
print('After residual connection, dimensions are\n\tbatchsize:', z.shape[0], '\n\tnumber of tokens:', z.shape[1], '\n\ttoken size:', z.shape[2])

class NeuralNet(nn.Module):
    def __init__(self,
                 in_chan: int,
                 hidden_chan: Optional[float]=None,
                 out_chan: Optional[float]=None,
                 act_layer=nn.GELU):
        """ Neural Network Module
        Args:
            in_chan (int): number of channels (features) at input
            hidden_chan (Optional[float]): number of channels (features) in the hidden layer;
                                        if None, number of channels in hidden layer is the same as the number of input channels
            out_chan (Optional[float]): number of channels (features) at output;
                                        if None, number of output channels is same as the number of input channels
            act_layer(nn.modules.activation): torch neural network layer class to use as activation
        """
        super().__init__()
        ## Define Number of Channels
        hidden_chan = hidden_chan or in_chan
        self.fc1 = nn.Linear(in_chan, hidden_chan)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_chan, out_chan if out_chan else in_chan)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.fc2(x)
        return x

# Define an Input
num_tokens = 176
token_len = 768
batch = 1
x = torch.rand(batch, num_tokens, token_len)
print('Input dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken length:', x.shape[2])

# Norm Layer
norm = nn.LayerNorm(token_len)
x = norm(x)
print('After norm, dimensions are\n\tbatchsize:', x.shape[0], '\n\tnumber of tokens:', x.shape[1], '\n\ttoken size:', x.shape[2])

pred_token = x[:, 0]
print('Length of prediction token:', pred_token.shape[-1])

head = nn.Linear(token_len, 1)
pred = head(pred_token)
print('Length of prediction:', (pred.shape[0], pred.shape[1]))
print('Prediction:', float(pred))

class ViT_Backbone(nn.Module):
    def __init__(self,
                 preds: int=1,
                 token_len: int=768,
                 num_heads: int=1,
                 Encoding_hidden_chan_mul: float=4.,
                 depth: int=12,
                 qkv_bias=False,
                 qk_scale=None,
                 act_layer=nn.GELU,
                 norm_layer=nn.LayerNorm):
   """ VisTransformer Backbone
        Args:
            preds (int): number of predictions to output
            token_len (int): length of a token
            num_heads(int): number of attention heads in MSA
            Encoding_hidden_chan_mul (float): multiplier to determine the number of hidden channels (features) in the NeuralNet component of the Encoding Module
            depth (int): number of encoding blocks in the model
            qkv_bias (bool): determines if the qkv layer learns an additive bias
            qk_scale (Optional[float]): value to scale the queries and keys by;
                  if None, queries and keys are scaled by head_dim ** -0.5
            act_layer(nn.modules.activation): torch neural network layer class to use as activation
            norm_layer(nn.modules.normalization): torch neural network layer class to use as normalization
        """
   super().__init__()
   ## Defining Parameters
   self.num_heads = num_heads
   self.Encoding_hidden_chan_mul = Encoding_hidden_chan_mul
   self.depth = depth

   ## Defining Token Processing Components
   self.cls_token = nn.Parameter(torch.zeros(1, 1, self.token_len))
   self.pos_embed = nn.Parameter(data=get_sinusoid_encoding(num_tokens=self.num_tokens+1, token_len=self.token_len), requires_grad=False)

   ## Defining Encoding blocks
   self.blocks = nn.ModuleList([Encoding(dim = self.token_len,
                                          num_heads = self.num_heads,
                                          hidden_chan_mul = self.Encoding_hidden_chan_mul,
                                          qkv_bias = qkv_bias,
                                          qk_scale = qk_scale,
                                          act_layer = act_layer,
                                          norm_layer = norm_layer)
             for i in range(self.depth)])

   ## Defining Prediction Processing
   self.norm = norm_layer(self.token_len)
   self.head = nn.Linear(self.token_len, preds)

   ## Make the class token sampled from a truncated normal distribution
   # timm.layers.trunc_normal_(self.cls_token, std=.02)

def forward(self, x):
   ## Assumes x is already tokenized
   ## Get Batch Size
   B = x.shape[0]
   ## Concatenate Class Token
   x = torch.cat((self.cls_token.expand(B, -1, -1), x), dim=1)
   ## Add Positional Embedding
   x = x + self.pos_embed
   ## Run Through Encoding Blocks
   for blk in self.blocks:
       x = blk(x)
   ## Take Norm
   x = self.norm(x)
   ## Make Prediction on Class Token
   x = self.head(x[:, 0])
   return x

class ViT_Model(nn.Module):
    def __init__(self,
                 img_size: tuple[int, int, int]=(1, 400, 100),
                 patch_size: int=50,
                 token_len: int=768,
                 preds: int=1,
                 num_heads: int=1,
                 Encoding_hidden_chan_mul: float=4.,
                 depth: int=12,
                 qkv_bias=False,
                 qk_scale=None,
                 act_layer=nn.GELU,
                 norm_layer=nn.LayerNorm):
   """ VisTransformer Model
      Args:
        img_size (tuple[int, int, int]): size of input (channels, height, width)
        patch_size (int): the side length of a square patch
        token_len (int): desired length of an output token
        preds (int): number of predictions to output
        num_heads(int): number of attention heads in MSA
        Encoding_hidden_chan_mul (float): multiplier to determine the number of hidden channels (features) in the NeuralNet component of the Encoding Module
        depth (int): number of encoding blocks in the model
        qkv_bias (bool): determines if the qkv layer learns an additive bias
        qk_scale (Optional[float]): value to scale the queries and keys by;
              if None, queries and keys are scaled by head_dim ** -0.5
        act_layer(nn.modules.activation): torch neural network layer class to use as activation
        norm_layer(nn.modules.normalization): torch neural network layer class to use as normalization
      """
    super().__init__()

    ## Defining Parameters
    self.img_size = img_size
    C, H, W = self.img_size
    self.patch_size = patch_size
    self.token_len = token_len
    self.num_heads = num_heads
    self.Encoding_hidden_chan_mul = Encoding_hidden_chan_mul
    self.depth = depth

    ## Defining Patch Embedding Module
    self.patch_tokens = Patch_Tokenization(img_size,
           patch_size,
           token_len)

    ## Defining ViT Backbone
    self.backbone = ViT_Backbone(preds,
          self.token_len,
          self.num_heads,
          self.Encoding_hidden_chan_mul,
          self.depth,
          qkv_bias,
          qk_scale,
          act_layer,
          norm_layer)
  ## Initialize the Weights
  self.apply(self._init_weights)

def _init_weights(self, m):
  """ Initialize the weights of the linear layers & the layernorms
  """
  ## For Linear Layers
  if isinstance(m, nn.Linear):
   ## Weights are initialized from a truncated normal distribution
   # timm.layers.trunc_normal_(m.weight, std=.02)
   if isinstance(m, nn.Linear) and m.bias is not None:
    ## If bias is present, bias is initialized at zero
    nn.init.constant_(m.bias, 0)
  ## For Layernorm Layers
  elif isinstance(m, nn.LayerNorm):
   ## Weights are initialized at one
   nn.init.constant_(m.weight, 1.0)
   ## Bias is initialized at zero
   nn.init.constant_(m.bias, 0)

@torch.jit.ignore ##Tell pytorch to not compile as TorchScript
def no_weight_decay(self):
  """ Used in Optimizer to ignore weight decay in the class token
  """
  return {'cls_token'}

def forward(self, x):
  x = self.patch_tokens(x)
  x = self.backbone(x)
  return x

视觉 Transformer (ViT) 原理与代码实现

视觉 Transformer 解析

什么是 Vision Transformer？

模型解析

图像 Token 化

更多推荐文章

相关免费在线工具

Token 处理

编码块

神经网络模块

预测处理

完整代码

训练与优化建议

结论

更多推荐文章

相关免费在线工具

视觉 Transformer (ViT) 原理与代码实现

视觉 Transformer 解析

什么是 Vision Transformer？

模型解析

图像 Token 化

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

Token 处理

编码块

神经网络模块

预测处理

完整代码

训练与优化建议

结论

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具