其中 A2C2F 就是 yolo12 中所提出的主要模块,A 表示的含义是 Area Attention。为了克服传统自注意力机制计算复杂度高的问题,YOLOv12 通过创新的区域注意力模块(Area Attention,A2),分辨率为 (H, W) 的特征图被划分为 l 个大小为 (H/l, W) 或 (H, W/l) 的段。这消除了显式的窗口划分,仅需要简单的重塑操作,从而实现更快的速度。将 l 的默认值设置为 4,将感受野减小到原来的 1/4,但仍保持较大的感受野。采用这种方法,注意力机制的计算成本从 2n²hd 降低到 1/2n²hd。尽管存在 n²的复杂度,但当 n 固定为 640 时(如果输入分辨率增加,则 n 会增加),这仍然足够高效,可以满足 YOLO 系统的实时要求。A2 降低了注意力机制的计算成本,同时保持较大的感受野,显著提升了检测精度。如下图所示,右侧的区域所表示的就是作者提出的注意力机制,相当于是以较少的计算量就捕捉到了相关的区域。
![image]
下面是 Area Attention 的实现,如下。
classA2C2f(nn.Module):
""" Area-Attention C2f module for enhanced feature extraction with area-based attention mechanisms. This module extends the C2f architecture by incorporating area-attention and ABlock layers for improved feature processing. It supports both area-attention and standard convolution modes. Attributes: cv1 (Conv): Initial 1x1 convolution layer that reduces input channels to hidden channels. cv2 (Conv): Final 1x1 convolution layer that processes concatenated features. gamma (nn.Parameter | None): Learnable parameter for residual scaling when using area attention. m (nn.ModuleList): List of either ABlock or C3k modules for feature processing. Methods: forward: Processes input through area-attention or standard convolution pathway. Examples: >>> m = A2C2f(512, 512, n=1, a2=True, area=1) >>> x = torch.randn(1, 512, 32, 32) >>> output = m(x) >>> print(output.shape) torch.Size([1, 512, 32, 32]) """
def__init__(self, c1, c2, n=1, a2=True, area=1, residual=False, mlp_ratio=2.0, e=0.5, g=1, shortcut=True):
""" Area-Attention C2f module for enhanced feature extraction with area-based attention mechanisms. Args: c1 (int): Number of input channels. c2 (int): Number of output channels. n (int): Number of ABlock or C3k modules to stack. a2 (bool): Whether to use area attention blocks. If False, uses C3k blocks instead. area (int): Number of areas the feature map is divided. residual (bool): Whether to use residual connections with learnable gamma parameter. mlp_ratio (float): Expansion ratio for MLP hidden dimension. e (float): Channel expansion ratio for hidden channels. g (int): Number of groups for grouped convolutions. shortcut (bool): Whether to use shortcut connections in C3k blocks. """super().__init__() c_ =int(c2 * e)# hidden channelsassert c_ %32==0,"Dimension of ABlock be a multiple of 32." self.cv1 = Conv(c1, c_,1,1) self.cv2 = Conv((1+ n)* c_, c2,1) self.gamma = nn.Parameter(0.01* torch.ones(c2), requires_grad=True)if a2 and residual elseNone self.m = nn.ModuleList( nn.Sequential(*(ABlock(c_, c_ //32, mlp_ratio, area)for _ inrange(2)))if a2 else C3k(c_, c_,2, shortcut, g)for _ inrange(n))defforward(self, x):"""Forward pass through R-ELAN layer.""" y =[self.cv1(x)] y.extend(m(y[-1])for m inself.m) y = self.cv2(torch.cat(y,1))ifself.gamma isnotNone:return x + self.gamma.view(-1,len(self.gamma),1,1)* y return y
classConcat(nn.Module):
"""Concatenate a list of tensors along dimension."""
def__init__(self, dimension=1):
"""Concatenates a list of tensors along a specified dimension."""super().__init__() self.d = dimension defforward(self, x):
"""Forward pass for the YOLOv8 mask Proto module."""return torch.cat(x, self.d)
classC2PSA(nn.Module):
""" C2PSA module with attention mechanism for enhanced feature extraction and processing. This module implements a convolutional block with attention mechanisms to enhance feature extraction and processing capabilities. It includes a series of PSABlock modules for self-attention and feed-forward operations. Attributes: c (int): Number of hidden channels. cv1 (Conv): 1x1 convolution layer to reduce the number of input channels to 2*c. cv2 (Conv): 1x1 convolution layer to reduce the number of output channels to c. m (nn.Sequential): Sequential container of PSABlock modules for attention and feed-forward operations. Methods: forward: Performs a forward pass through the C2PSA module, applying attention and feed-forward operations. Notes: This module essentially is the same as PSA module, but refactored to allow stacking more PSABlock modules. Examples: >>> c2psa = C2PSA(c1=256, c2=256, n=3, e=0.5) >>> input_tensor = torch.randn(1, 256, 64, 64) >>> output_tensor = c2psa(input_tensor) """
def__init__(self, c1, c2, n=1, e=0.5):
"""Initializes C2PSA module with specified input/output channels, number of layers, and expansion ratio."""super().__init__()assert c1 == c2 self.c =int(c1 * e) self.cv1 = Conv(c1,2* self.c,1,1) self.cv2 = Conv(2* self.c, c1,1) self.m = nn.Sequential(*(PSABlock(self.c, attn_ratio=0.5, num_heads=self.c //64)for _ inrange(n)))defforward(self, x):
"""Processes the input tensor 'x' through a series of PSA blocks and returns the transformed tensor.""" a, b = self.cv1(x).split((self.c, self.c), dim=1) b = self.m(b)returnself.cv2(torch.cat((a, b),1))
[1] Zhang Y , Li H , Bu R ,et al.Fuzzy Multi-objective Requirements for NRP Based on Particle Swarm Optimization[C]//2020.DOI:10.1007/978-3-030-57881-7_13.
[2] Zhao N , Cao M , Song C ,et al.Trusted Component Decomposition Based on OR-Transition Colored Petri Net[C]//International Conference on Artificial Intelligence and Security.Springer, Cham, 2019.DOI:10.1007/978-3-030-24268-8_41.
DOI: 10.1109/ACCESS.2020.2973568
[3] Song C, Chang H. RST R-CNN: a triplet matching few-shot remote sensing object detection framework[C]//Fourth International Conference on Computer Vision, Application, and Algorithm (CVAA 2024). SPIE, 2025, 13486: 553-568.
[4] Zhou Q , Yu C . Point RCNN: An Angle-Free Framework for Rotated Object Detection[J]. Remote Sensing, 2022, 14.
[5] Zhang, Y., Li, H., Bu, R., Song, C., Li, T., Kang, Y., & Chen, T. (2020). Fuzzy Multi-objective Requirements for NRP Based on Particle Swarm Optimization. International Conference on Adaptive and Intelligent Systems.
[6] Li X , Deng J , Fang Y . Few-Shot Object Detection on Remote Sensing Images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021(99).
[7] Su W, Zhu X, Tao C, et al. Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information[J]. arXiv preprint arXiv:2211.09807, 2022.
[8] Chen Q, Wang J, Han C, et al. Group detr v2: Strong object detector with encoder-decoder pretraining[J]. arXiv preprint arXiv:2211.03594, 2022.
[9] Liu, Shilong, et al. 'Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.' arXiv preprint arXiv:2303.05499 (2023).
[10] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788.
[11] Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7263-7271.
[12] Redmon J, Farhadi A. Yolov3: An incremental improvement[J]. arXiv preprint arXiv:1804.02767, 2018.
[13] Tian Z, Shen C, Chen H, et al. Fcos: Fully convolutional one-stage object detection[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 9627-9636.
[14] Chen L C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 801-818.
[15] Liu W, Anguelov D, Erhan D, et al. Ssd: Single shot multibox detector[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 2016: 21-37.
[16] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2117-2125.
[17] Cai Z, Vasconcelos N. Cascade r-cnn: Delving into high quality object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6154-6162.
[18] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[J]. Advances in neural information processing systems, 2015, 28.
[19] Wang R, Shivanna R, Cheng D, et al. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems[C]//Proceedings of the web conference 2021. 2021: 1785-1797.
[20] Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation[J]. arXiv preprint arXiv:1706.05587, 2017.
模型改进的基本流程(选看)
首先我们说说如何在 yolo 的基础模型上进行改进。
新增配置文件
![image]
在 task.py文件中引用。
![image]
在 init.py文件中引用。
![image]
在 block.py或者 conv.py中添加你要修改的模块,比如我在这里添加了 se 的类,包含了输入和输出的通道数。
速度方面的改进速度方面改进 2-GhostConvGhost Convolution 是一种轻量化卷积操作,首次提出于论文《GhostNet: More Features from Cheap Operations》(CVPR 2020)。GhostConv 的核心思想是利用便宜的操作生成额外的特征图,以减少计算复杂度和参数量。、GhostConv 的核心思想如是,卷积操作会生成冗余的特征图。许多特征图之间存在高相关性。GhostConv 的目标是通过减少冗余特征图的计算来加速网络的推理。GhostConv 的结构如下: