其中 A2C2F 就是 yolo12 中所提出的主要模块,A 表示的含义是 Area Attention。为了克服传统自注意力机制计算复杂度高的问题,YOLOv12 通过创新的区域注意力模块(Area Attention,A2),分辨率为 (H, W) 的特征图被划分为 l 个大小为 (H/l, W) 或 (H, W/l) 的段。这消除了显式的窗口划分,仅需要简单的重塑操作,从而实现更快的速度。将 l 的默认值设置为 4,将感受野减小到原来的 1/4,但仍保持较大的感受野。采用这种方法,注意力机制的计算成本从 2n²hd 降低到 1/2n²hd。尽管存在 n²的复杂度,但当 n 固定为 640 时(如果输入分辨率增加,则 n 会增加),这仍然足够高效,可以满足 YOLO 系统的实时要求。A2 降低了注意力机制的计算成本,同时保持较大的感受野,显著提升了检测精度。如下图所示,右侧的区域所表示的就是作者提出的注意力机制,相当于是以较少的计算量就捕捉到了相关的区域。
下面是 Area Attention 的实现,如下。
classA2C2f(nn.Module):
""" Area-Attention C2f module for enhanced feature extraction with area-based attention mechanisms. This module extends the C2f architecture by incorporating area-attention and ABlock layers for improved feature processing. It supports both area-attention and standard convolution modes.
Attributes:
cv1 (Conv): Initial 1x1 convolution layer that reduces input channels to hidden channels.
cv2 (Conv): Final 1x1 convolution layer that processes concatenated features.
gamma (nn.Parameter | None): Learnable parameter for residual scaling when using area attention.
m (nn.ModuleList): List of either ABlock or C3k modules for feature processing.
Methods:
forward: Processes input through area-attention or standard convolution pathway.
Examples:
>>> m = A2C2f(512, 512, n=1, a2=True, area=1)
>>> x = torch.randn(1, 512, 32, 32)
>>> output = m(x)
>>> print(output.shape)
torch.Size([1, 512, 32, 32])
"""def__init__(self, c1, c2, n=1, a2=True, area=1, residual=False, mlp_ratio=2.0, e=0.5, g=1, shortcut=True):
""" Area-Attention C2f module for enhanced feature extraction with area-based attention mechanisms.
Args:
c1 (int): Number of input channels.
c2 (int): Number of output channels.
n (int): Number of ABlock or C3k modules to stack.
a2 (bool): Whether to use area attention blocks. If False, uses C3k blocks instead.
area (int): Number of areas the feature map is divided.
residual (bool): Whether to use residual connections with learnable gamma parameter.
mlp_ratio (float): Expansion ratio for MLP hidden dimension.
e (float): Channel expansion ratio for hidden channels.
g (int): Number of groups for grouped convolutions.
shortcut (bool): Whether to use shortcut connections in C3k blocks.
"""super().__init__()
c_ = int(c2 * e)# hidden channelsassert c_ % 32==0,"Dimension of ABlock be a multiple of 32."self.cv1 = Conv(c1, c_,1,1)
self.cv2 = Conv((1+ n)* c_, c2,1)
self.gamma = nn.Parameter(0.01* torch.ones(c2), requires_grad=True) if a2 and residual elseNoneself.m = nn.ModuleList(
nn.Sequential(*(ABlock(c_, c_ // 32, mlp_ratio, area) for _ inrange(2))) if a2 else C3k(c_, c_, 2, shortcut, g) for _ inrange(n))
defforward(self, x):
"""Forward pass through R-ELAN layer."""
y = [self.cv1(x)]
y.extend(m(y[-1]) for m inself.m)
y = self.cv2(torch.cat(y, 1))
ifself.gamma isnotNone:
return x + self.gamma.view(-1, len(self.gamma), 1, 1) * y
return y
头部(Head)YOLOv11 的 Head 负责生成目标检测和分类方面的最终预测。它处理从颈部传递的特征映射,最终输出图像内对象的边界框和类标签。一般负责将特征进行映射到你对应的任务上,如果是检测任务,对应的就是 4 个边界框的值以及 1 个置信度的值和一个物体类别的值。如下所示。
# Ultralytics YOLO 🚀, AGPL-3.0 license"""Model head modules."""import copy
import math
import torch
import torch.nn as nn
from torch.nn.init import constant_, xavier_uniform_
from ultralytics.utils.tal import TORCH_1_10, dist2bbox, dist2rbox, make_anchors
from .block import DFL, BNContrastiveHead, ContrastiveHead, Proto
from .conv import Conv, DWConv
from .transformer import MLP, DeformableTransformerDecoder, DeformableTransformerDecoderLayer
from .utils import bias_init_with_prob, linear_init
__all__ ="Detect","Segment","Pose","Classify","OBB","RTDETRDecoder","v10Detect"
classConcat(nn.Module):
"""Concatenate a list of tensors along dimension."""def__init__(self, dimension=1):
"""Concatenates a list of tensors along a specified dimension."""super().__init__()
self.d = dimension
defforward(self, x):
"""Forward pass for the YOLOv8 mask Proto module."""return torch.cat(x, self.d)
classC2PSA(nn.Module):
""" C2PSA module with attention mechanism for enhanced feature extraction and processing. This module implements a convolutional block with attention mechanisms to enhance feature extraction and processing capabilities. It includes a series of PSABlock modules for self-attention and feed-forward operations.
Attributes:
c (int): Number of hidden channels.
cv1 (Conv): 1x1 convolution layer to reduce the number of input channels to 2*c.
cv2 (Conv): 1x1 convolution layer to reduce the number of output channels to c.
m (nn.Sequential): Sequential container of PSABlock modules for attention and feed-forward operations.
Methods:
forward: Performs a forward pass through the C2PSA module, applying attention and feed-forward operations.
Notes:
This module essentially is the same as PSA module, but refactored to allow stacking more PSABlock modules.
Examples:
>>> c2psa = C2PSA(c1=256, c2=256, n=3, e=0.5)
>>> input_tensor = torch.randn(1, 256, 64, 64)
>>> output_tensor = c2psa(input_tensor)
"""def__init__(self, c1, c2, n=1, e=0.5):
"""Initializes C2PSA module with specified input/output channels, number of layers, and expansion ratio."""super().__init__()
assert c1 == c2
self.c = int(c1 * e)
self.cv1 = Conv(c1, 2* self.c, 1, 1)
self.cv2 = Conv(2* self.c, c1, 1)
self.m = nn.Sequential(*(PSABlock(self.c, attn_ratio=0.5, num_heads=self.c // 64) for _ inrange(n)))
defforward(self, x):
"""Processes the input tensor 'x' through a series of PSA blocks and returns the transformed tensor."""
a, b = self.cv1(x).split((self.c, self.c), dim=1)
b = self.m(b)
returnself.cv2(torch.cat((a, b), 1))
[1] Zhang Y , Li H , Bu R ,et al.Fuzzy Multi-objective Requirements for NRP Based on Particle Swarm Optimization[C]//2020.DOI:10.1007/978-3-030-57881-7_13.
[2] Zhao N , Cao M , Song C ,et al.Trusted Component Decomposition Based on OR-Transition Colored Petri Net[C]//International Conference on Artificial Intelligence and Security.Springer, Cham, 2019.DOI:10.1007/978-3-030-24268-8_41.
DOI: 10.1109/ACCESS.2020.2973568
[3] Song C, Chang H. RST R-CNN: a triplet matching few-shot remote sensing object detection framework[C]//Fourth International Conference on Computer Vision, Application, and Algorithm (CVAA 2024). SPIE, 2025, 13486: 553-568.
[4] Zhou Q , Yu C . Point RCNN: An Angle-Free Framework for Rotated Object Detection[J]. Remote Sensing, 2022, 14.
[5] Zhang, Y., Li, H., Bu, R., Song, C., Li, T., Kang, Y., & Chen, T. (2020). Fuzzy Multi-objective Requirements for NRP Based on Particle Swarm Optimization. International Conference on Adaptive and Intelligent Systems.
[6] Li X , Deng J , Fang Y . Few-Shot Object Detection on Remote Sensing Images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021(99).
[7] Su W, Zhu X, Tao C, et al. Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information[J]. arXiv preprint arXiv:2211.09807, 2022.
[8] Chen Q, Wang J, Han C, et al. Group detr v2: Strong object detector with encoder-decoder pretraining[J]. arXiv preprint arXiv:2211.03594, 2022.
[9] Liu, Shilong, et al. 'Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.' arXiv preprint arXiv:2303.05499 (2023).
[10] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788.
[11] Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7263-7271.
[12] Redmon J, Farhadi A. Yolov3: An incremental improvement[J]. arXiv preprint arXiv:1804.02767, 2018.
[13] Tian Z, Shen C, Chen H, et al. Fcos: Fully convolutional one-stage object detection[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 9627-9636.
[14] Chen L C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 801-818.
[15] Liu W, Anguelov D, Erhan D, et al. Ssd: Single shot multibox detector[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 2016: 21-37.
[16] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2117-2125.
[17] Cai Z, Vasconcelos N. Cascade r-cnn: Delving into high quality object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6154-6162.
[18] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[J]. Advances in neural information processing systems, 2015, 28.
[19] Wang R, Shivanna R, Cheng D, et al. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems[C]//Proceedings of the web conference 2021. 2021: 1785-1797.
[20] Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation[J]. arXiv preprint arXiv:1706.05587, 2017.
模型改进的基本流程(选看)
首先我们说说如何在 yolo 的基础模型上进行改进。
新增配置文件
在 task.py文件中引用。
在 init.py文件中引用。
在 block.py或者 conv.py中添加你要修改的模块,比如我在这里添加了 se 的类,包含了输入和输出的通道数。
速度方面的改进速度方面改进 2-GhostConvGhost Convolution 是一种轻量化卷积操作,首次提出于论文《GhostNet: More Features from Cheap Operations》(CVPR 2020)。GhostConv 的核心思想是利用便宜的操作生成额外的特征图,以减少计算复杂度和参数量。、GhostConv 的核心思想如是,卷积操作会生成冗余的特征图。许多特征图之间存在高相关性。GhostConv 的目标是通过减少冗余特征图的计算来加速网络的推理。GhostConv 的结构如下: