YOLOv8 模型网络结构解析

YOLOv8 结构图

1 YOLOv8 配置文件解析

YOLOv8 的配置文件定义了模型的关键参数和结构，包括类别数、模型尺寸、骨干（backbone）和头部（head）结构。这些配置决定了模型的性能和复杂性。

核心参数说明

# Ultralytics YOLO 🚀, AGPL-3.0 license
c: 80 # number of classes (类别数目)
scales:
  n: [0.33, 0.25, 1024] # depth, width, max_channels
  s: [0.33, 0.50, 1024]
  m: [0.67, 0.75, 768]
  l: [1.00, 1.00, 512]
  x: [1.00, 1.25, 512]

nc: 代表"number of classes"，即模型用于检测的对象类别总数。默认使用 COCO 数据集时 nc=80。
scales: 定义模型的不同尺寸和复杂度，包含 depth（深度因子）、width（宽度因子）和 max_channels（最大通道数）。
backbone: 主干网络，负责从输入图像中提取特征。采用类似 CSPDarknet 的结构。
head: 检测头，负责产生最终的检测结果。

层结构详解

backbone:
  - [-1, 1, Conv, [64, 3, 2]]       # P1/2
  - [-1, 1, Conv, [128, 3, 2]]      # P2/4
  - [-1, 3, C2f, [128, True]]       # P2/4
  - [-1, 1, Conv, [256, 3, 2]]      # P3/8
  - [-1, 6, C2f, [256, True]]       # P3/8
  - [-1, 1, Conv, [512, 3, 2]]      # P4/16
  - [-1, 6, C2f, [512, True]]       # P4/16
  - [-1, 1, Conv, [1024, 3, 2]]     # P5/32
  - [-1, 3, C2f, [1024, True]]      # P5/32
  - [-1, 1, SPPF, [1024, 5]]        # SPPF
head:
  - [-1, 1, nn.Upsample, [None, 2, 'nearest']] # Upsample
  - [[-1, 6], 1, Concat, [1]]       # cat backbone P4
  - [-1, 3, C2f, [512]]             # C2f
  ...                               # (后续层省略)
  - [[15, 18, 21], 1, Detect, [nc]] # Detect(P3, P4, P5)

Conv: 卷积层，输出通道数、卷积核大小、步长。
C2f: CSP Bottleneck with 2 convolutions，增强梯度流信息。
SPPF: 快速空间金字塔池化层，在多个尺度上聚合特征。
Upsample: 上采样层，增加特征图的空间分辨率。
Detect: 最终检测层，输出预测结果。

2 YOLOv8 网络结构

网络结构细节

Backbone 主干网络：负责特征提取，采用类 CSPDarknet 结构。 Head 头部网络：目标检测模型的决策部分，产生最终检测结果。 Neck 颈部网络：位于主干和头部之间，进行特征融合和增强。

关键组件

ConvModule: 包含卷积层（Conv）、批量归一化（BN）和激活函数（SiLU），即 CBS 模块。
DarknetBottleneck: 通过残差连接增加网络深度，保持效率。
CSP Layer: CSP 结构的变体，提高训练效率。

输出的特征图大小计算公式： $$ f_{out} = \lfloor \frac{f_{in} - k + 2*p}{s} \rfloor + 1 $$

损失函数

Bbox Loss（边界框回归损失）： $$ Loss_{bbox} = \sum_{i=1}^N{(x_i - \hat{x}_i)^2} $$ 其中 $x_i$ 表示真实边界框坐标，$\hat{x}_i$ 表示预测坐标。均方误差有助于修正大的预测错误。

Cls Loss（分类损失）： $$ Loss_{cls} = -\sum_{c=1}^M y_{o,c} \log(p_{o,c}) $$ 其中 $y_{o,c}$ 是指示器（样本属于类别 c 则为 1），$p_{o,c}$ 是预测概率。交叉熵损失优化预测分布接近真实标签。

2.1 Conv 模块

def autopad(k, p=None, d=1):
    if d > 1: k = d * (k - 1) + 1
    if p is None: p = k // 2
    return p

class Conv(nn.Module):
    default_act = nn.SiLU()
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act
    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

2.2 C3 与 C2f 模块

C3 与 C2f 对比

class C3(nn.Module):
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):
        super().__init__()
        c_ = int(c2 * e)
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)
        self.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, k=((1,1),(3,3)), e=1.0) for _ in range(n)))
    def forward(self, x):
        return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), 1))

C2f 模块参考了 YOLOv7 的 ELAN 模块思想，获得了更多的梯度流信息。

class C2f(nn.Module):
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):
        super().__init__()
        self.c = int(c2 * e)
        self.cv1 = Conv(c1, 2 * self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3,3),(3,3)), e=1.0) for _ in range(n))
    def forward(self, x):
        y = list(self.cv1(x).chunk(2, 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

2.3 SPPF 模块

class SPPF(nn.Module):
    def __init__(self, c1, c2, k=5):
        super().__init__()
        c_ = c1 // 2
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * 4, c2, 1, 1)
        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)
    def forward(self, x):
        x = self.cv1(x)
        y1 = self.m(x)
        y2 = self.m(y1)
        return self.cv2(torch.cat((x, y1, y2, self.m(y2)), 1))

2.4 Upsample

torch.nn.Upsample(size=None, scale_factor=None, mode='nearest', align_corners=None)

2.5 Detect 层

class Detect(nn.Module):
    dynamic = False
    export = False
    shape = None
    anchors = torch.empty(0)
    strides = torch.empty(0)
    
    def __init__(self, nc=80, ch=()):
        super().__init__()
        self.nc = nc
        self.nl = len(ch)
        self.reg_max = 16
        self.no = nc + self.reg_max * 4
        self.stride = torch.zeros(self.nl)
        c2, c3 = max((16, ch[0]//4, self.reg_max * 4)), max(ch[0], self.nc)
        self.cv2 = nn.ModuleList(
            nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1))
            for x in ch
        )
        self.cv3 = nn.ModuleList(
            nn.Sequential(Conv(x, c3, 3), Conv(c3, c3, 3), nn.Conv2d(c3, self.nc, 1))
            for x in ch
        )
        self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()

    def forward(self, x):
        shape = x[0].shape
        for i in range(self.nl):
            x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)
        if self.training:
            return x
        elif self.dynamic or self.shape != shape:
            self.anchors, self.strides = (x.transpose(0, 1) for x in make_anchors(x, self.stride, 0.5))
            self.shape = shape
        box, cls = torch.cat([xi.view(shape[0], self.no, -1) for xi in x], 2).split((self.reg_max * 4, self.nc), 1)
        dbox = dist2bbox(self.dfl(box), self.anchors.unsqueeze(0), xywh=True, dim=1) * self.strides
        y = torch.cat((dbox, cls.sigmoid()), 1)
        return y if self.export else (y, x)

DFL (Distribution Focal Loss) 模块用于处理边界框回归的分布预测。

class DFL(nn.Module):
    def __init__(self, c1=16):
        super().__init__()
        self.conv = nn.Conv2d(c1, 1, 1, bias=False).requires_grad_(False)
        x = torch.arange(c1, dtype=torch.float)
        self.conv.weight.data[:] = nn.Parameter(x.view(1, c1, 1, 1))
        self.c1 = c1
    def forward(self, x):
        b, c, a = x.shape
        return self.conv(x.view(b, 4, self.c1, a).transpose(2, 1).softmax(1)).view(b, 4, a)