1. 网络架构设计

1.1 稠密块 (Dense Block)

稠密连接的数学表示：

对于第 $\ell$ 层，其输入为前面所有层输出的拼接：

$x_\ell = H_\ell([x_0, x_1, \ldots, x_{\ell-1}])$

其中 $[x0, x_1, \ldots, x{\ell-1}]$ 表示通道维度上的拼接操作，$H_\ell$ 是复合函数：BatchNorm → ReLU → Conv3×3

稠密连接的优势：

特征复用：每层都能直接访问前面所有层的特征

梯度流动：缓解梯度消失问题，改善反向传播

参数效率：减少冗余参数，提高模型紧凑性

隐式深度监督：每层都能接收来自最终损失的梯度

增长率 (Growth Rate) $k$：

增长率 $k$ 是DenseNet的核心超参数，定义每个稠密层产生的新特征图数量。

若输入有 $k_0$ 个特征图，第 $\ell$ 层将有 $k_0 + k \times (\ell-1)$ 个输入特征图
典型值：$k = 32$ 可以在ImageNet上获得良好性能
较小的 $k$ 值使网络更紧凑但可能影响表达能力

1.2 过渡层 (Transition Layer)

过渡层 (transition layer) 衔接在两个稠密块之间，用于控制模型复杂度：首先，使用 1×1 卷积层减少通道数，避免通道数的无限增加；然后，使用 2×2 平均池化层减小特征图的空间尺寸，下采样以降低计算量：

from torch import nn, Tensor


class TransitionLayer(nn.Module):
    def __init__(self, in_channels: int, out_channels: int):
        """
        过渡层

        通过 1x1 卷积减少通道数，并通过 2x2 平均池化层下采样。

        :param in_channels: 输入特征图的通道数
        :param out_channels: 输出特征图的通道数
        """
        super().__init__()
        self.transition = nn.Sequential(
            nn.BatchNorm2d(in_channels),
            nn.ReLU(),
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),  # 1x1 卷积，减少通道数
            nn.AvgPool2d(kernel_size=2, stride=2)  # 2x2 平均池化，下采样
        )

    def forward(self, x: Tensor) -> Tensor:
        return self.transition(x)

查看过渡层维度测试结果

from torchinfo import summary

model = TransitionLayer(in_channels=23, out_channels=10)
summary(model, input_size=(4, 23, 8, 8))

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
TransitionLayer                          [4, 10, 4, 4]             --
├─Sequential: 1-1                        [4, 10, 4, 4]             --
│    └─BatchNorm2d: 2-1                  [4, 23, 8, 8]             46
│    └─ReLU: 2-2                         [4, 23, 8, 8]             --
│    └─Conv2d: 2-3                       [4, 10, 8, 8]             230
│    └─AvgPool2d: 2-4                    [4, 10, 4, 4]             --
==========================================================================================
Total params: 276
Trainable params: 276
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 0.06
==========================================================================================
Input size (MB): 0.02
Forward/backward pass size (MB): 0.07
Params size (MB): 0.00
Estimated Total Size (MB): 0.09
==========================================================================================

1.3 完整DenseNet架构

DenseNet-121架构设计：

DenseNet架构借鉴ResNet的整体框架，由稠密块和过渡层交替组成：

初始层：7×7 Conv + 3×3 MaxPool (步幅均为2)

4个稠密块：每块4层，增长率k=32

3个过渡层：1×1 Conv + 2×2 AvgPool

全局平均池化 + 全连接层输出

通道数计算

各稠密块的通道变化：

输入: 64 通道
稠密块1: 64 → 64+32×4 = 192 通道
过渡层1: 192 → 96 通道 (减半)

稠密块2: 96 → 96+32×4 = 224 通道  
过渡层2: 224 → 112 通道

稠密块3: 112 → 112+32×4 = 240 通道
过渡层3: 240 → 120 通道

稠密块4: 120 → 120+32×4 = 248 通道

DenseNet vs ResNet：

参数效率对比：

DenseNet-121: ~7.98M 参数
ResNet-50: ~25.5M 参数
性能相当但参数量大幅减少

内存使用特点：

DenseNet内存消耗更高（特征图拼接）
ResNet计算更高效（残差连接）
需要根据具体应用场景选择

import torch
from torch import Tensor, nn


class DenseBlock(nn.Module):
    def __init__(self, in_channels: int, layers_num: int, growth_rate: int):
        """
        稠密块

        由多组 “BatchNorm → ReLU → Conv” 结构（稠密层，DenseLayer）组成，每循环一次这样的结构，通道数增长 k。输出结果在通道上完成拼接。

        :param in_channels: 输入特征图的通道数
        :param layers_num: 堆叠的稠密层 (DenseLayer) 数量
        :param growth_rate: 增长率 (k)。每个稠密层输出的新特征图通道数，将与先前层的拼接
        """
        super().__init__()

        self.growth_rate = growth_rate

        # 根据需要堆叠的稠密层数，动态创建
        self.dense_layers = nn.ModuleList([
            self._get_dense_layer(in_channels + growth_rate * i)
            for i in range(layers_num)
        ])

    def _get_dense_layer(self, connected_channels: int) -> nn.Sequential:
        """
        返回单个稠密层实例
        BatchNorm → ReLU → Conv(3x3)
        :param connected_channels: 该稠密层的输入通道数，等于初始输入的通道数加上之前所有层的增长率累积
        """
        dense_layer = nn.Sequential(
            nn.BatchNorm2d(connected_channels),
            nn.ReLU(),
            nn.Conv2d(connected_channels, self.growth_rate, kernel_size=3, padding=1, bias=False)
        )
        return dense_layer

    def forward(self, x: Tensor) -> Tensor:
        for layer in self.dense_layers:
            out = layer(x)
            x = torch.cat((x, out), dim=1)  # 在通道维度拼接
        return x


class TransitionLayer(nn.Module):
    def __init__(self, in_channels: int, out_channels: int):
        """
        过渡层

        通过 1x1 卷积减少通道数，并通过 2x2 平均池化层下采样。

        :param in_channels: 输入特征图的通道数
        :param out_channels: 输出特征图的通道数
        """
        super().__init__()
        self.transition = nn.Sequential(
            nn.BatchNorm2d(in_channels),
            nn.ReLU(),
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),  # 1x1 卷积，减少通道数
            nn.AvgPool2d(kernel_size=2, stride=2)  # 2x2 平均池化，下采样
        )

    def forward(self, x: Tensor) -> Tensor:
        return self.transition(x)


class DenseNet(nn.Module):
    def __init__(self, num_classes: int):
        super().__init__()

        self.model = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),

            # 组 1
            DenseBlock(in_channels=64, layers_num=4, growth_rate=32),
            TransitionLayer(in_channels=192, out_channels=96),

            # 组 2
            DenseBlock(in_channels=96, layers_num=4, growth_rate=32),
            TransitionLayer(in_channels=224, out_channels=112),

            # 组 3
            DenseBlock(in_channels=112, layers_num=4, growth_rate=32),
            TransitionLayer(in_channels=240, out_channels=120),

            # 组 4
            DenseBlock(in_channels=120, layers_num=4, growth_rate=32),

            nn.AdaptiveAvgPool2d(1), nn.Flatten(),
            nn.Linear(in_features=248, out_features=num_classes)
        )

        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')
                if m.bias is not None: nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None: nn.init.constant_(m.bias, 0)

    def forward(self, x) -> Tensor:
        return self.model(x)

查看DenseNet模型结构分析

from torchinfo import summary

model = DenseNet(num_classes=10)
summary(model, input_size=(1, 1, 224, 224))

===============================================================================================
Layer (type:depth-idx)                        Output Shape              Param #
===============================================================================================
DenseNet                                      [1, 10]                   --
├─Sequential: 1-1                             [1, 10]                   --
│    └─Conv2d: 2-1                            [1, 64, 112, 112]         3,136
│    └─BatchNorm2d: 2-2                       [1, 64, 112, 112]         128
│    └─ReLU: 2-3                              [1, 64, 112, 112]         --
│    └─MaxPool2d: 2-4                         [1, 64, 56, 56]           --
│    └─DenseBlock: 2-5                        [1, 192, 56, 56]          --
│    │    └─ModuleList: 3-1                   --                        129,920
│    └─TransitionLayer: 2-6                   [1, 96, 28, 28]           --
│    │    └─Sequential: 3-2                   [1, 96, 28, 28]           18,816
│    └─DenseBlock: 2-7                        [1, 224, 28, 28]          --
│    │    └─ModuleList: 3-3                   --                        167,040
│    └─TransitionLayer: 2-8                   [1, 112, 14, 14]          --
│    │    └─Sequential: 3-4                   [1, 112, 14, 14]          25,536
│    └─DenseBlock: 2-9                        [1, 240, 14, 14]          --
│    │    └─ModuleList: 3-5                   --                        185,600
│    └─TransitionLayer: 2-10                  [1, 120, 7, 7]            --
│    │    └─Sequential: 3-6                   [1, 120, 7, 7]            29,280
│    └─DenseBlock: 2-11                       [1, 248, 7, 7]            --
│    │    └─ModuleList: 3-7                   --                        194,880
│    └─AdaptiveAvgPool2d: 2-12                [1, 248, 1, 1]            --
│    └─Flatten: 2-13                          [1, 248]                  --
│    └─Linear: 2-14                           [1, 10]                   2,490
===============================================================================================
Total params: 756,826
Trainable params: 756,826
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 702.75
===============================================================================================
Input size (MB): 0.20
Forward/backward pass size (MB): 43.13
Params size (MB): 3.03
Estimated Total Size (MB): 46.35
===============================================================================================

内存使用特点：

尽管DenseNet通过特征复用显著减少了参数数量，但存在以下内存消耗问题：

特征图拼接导致通道数快速累积
反向传播需要保存所有中间特征图
拼接操作增加计算复杂度

因此，DenseNet在内存使用上比ResNet更为昂贵，特别是在高分辨率输入时。

2. 模型训练与性能分析

继续使用training_tools.py中的工具训练评估模型：

import torch
from torch import Tensor, nn, optim

from training_tools import fashionMNIST_loader, Trainer


class DenseBlock(nn.Module):
    def __init__(self, in_channels: int, layers_num: int, growth_rate: int):
        """
        稠密块

        由多组 “BatchNorm → ReLU → Conv” 结构（稠密层，DenseLayer）组成，每循环一次这样的结构，通道数增长 k。输出结果在通道上完成拼接。

        :param in_channels: 输入特征图的通道数
        :param layers_num: 堆叠的稠密层 (DenseLayer) 数量
        :param growth_rate: 增长率 (k)。每个稠密层输出的新特征图通道数，将与先前层的拼接
        """
        super().__init__()

        self.growth_rate = growth_rate

        # 根据需要堆叠的稠密层数，动态创建
        self.dense_layers = nn.ModuleList([
            self._get_dense_layer(in_channels + growth_rate * i)
            for i in range(layers_num)
        ])

    def _get_dense_layer(self, connected_channels: int) -> nn.Sequential:
        """
        返回单个稠密层实例
        BatchNorm → ReLU → Conv(3x3)
        :param connected_channels: 该稠密层的输入通道数，等于初始输入的通道数加上之前所有层的增长率累积
        """
        dense_layer = nn.Sequential(
            nn.BatchNorm2d(connected_channels),
            nn.ReLU(),
            nn.Conv2d(connected_channels, self.growth_rate, kernel_size=3, padding=1, bias=False)
        )
        return dense_layer

    def forward(self, x: Tensor) -> Tensor:
        for layer in self.dense_layers:
            out = layer(x)
            x = torch.cat((x, out), dim=1)  # 在通道维度拼接
        return x


class TransitionLayer(nn.Module):
    def __init__(self, in_channels: int, out_channels: int):
        """
        过渡层

        通过 1x1 卷积减少通道数，并通过 2x2 平均池化层下采样。

        :param in_channels: 输入特征图的通道数
        :param out_channels: 输出特征图的通道数
        """
        super().__init__()
        self.transition = nn.Sequential(
            nn.BatchNorm2d(in_channels),
            nn.ReLU(),
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),  # 1x1 卷积，减少通道数
            nn.AvgPool2d(kernel_size=2, stride=2)  # 2x2 平均池化，下采样
        )

    def forward(self, x: Tensor) -> Tensor:
        return self.transition(x)


class DenseNet(nn.Module):
    def __init__(self, num_classes: int):
        super().__init__()

        self.model = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),

            # 组 1
            DenseBlock(in_channels=64, layers_num=4, growth_rate=32),
            TransitionLayer(in_channels=192, out_channels=96),

            # 组 2
            DenseBlock(in_channels=96, layers_num=4, growth_rate=32),
            TransitionLayer(in_channels=224, out_channels=112),

            # 组 3
            DenseBlock(in_channels=112, layers_num=4, growth_rate=32),
            TransitionLayer(in_channels=240, out_channels=120),

            # 组 4
            DenseBlock(in_channels=120, layers_num=4, growth_rate=32),

            nn.AdaptiveAvgPool2d(1), nn.Flatten(),
            nn.Linear(in_features=248, out_features=num_classes)
        )

        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')
                if m.bias is not None: nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None: nn.init.constant_(m.bias, 0)

    def forward(self, x) -> Tensor:
        return self.model(x)


if __name__ == '__main__':
    BATCH_SIZE = 256
    EPOCHS_NUM = 30
    LEARNING_RATE = 0.1

    model = DenseNet(num_classes=10)
    train_loader, test_loader = fashionMNIST_loader(BATCH_SIZE, resize=96)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), LEARNING_RATE)
    platform = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    with Trainer(model, train_loader, test_loader, criterion, optimizer, platform) as trainer:
        trainer.train(EPOCHS_NUM)

2.1 训练过程详细记录

查看DenseNet完整训练过程

第 001/30 轮，训练损失：0.5371，训练精度：80.44%，测试损失：0.6133，测试精度：79.65%
第 002/30 轮，训练损失：0.3150，训练精度：88.48%，测试损失：0.4794，测试精度：81.66%
第 003/30 轮，训练损失：0.2608，训练精度：90.45%，测试损失：0.2806，测试精度：89.82%
第 004/30 轮，训练损失：0.2315，训练精度：91.45%，测试损失：0.3462，测试精度：88.04%
第 005/30 轮，训练损失：0.2094，训练精度：92.28%，测试损失：1.0428，测试精度：75.68%
第 006/30 轮，训练损失：0.1932，训练精度：92.86%，测试损失：0.3108，测试精度：89.00%
第 007/30 轮，训练损失：0.1772，训练精度：93.53%，测试损失：0.3008，测试精度：89.49%
第 008/30 轮，训练损失：0.1638，训练精度：93.85%，测试损失：0.2650，测试精度：90.43%
第 009/30 轮，训练损失：0.1519，训练精度：94.37%，测试损失：1.1470，测试精度：71.91%
第 010/30 轮，训练损失：0.1374，训练精度：94.93%，测试损失：0.3286，测试精度：89.27%
第 011/30 轮，训练损失：0.1287，训练精度：95.17%，测试损失：0.5050，测试精度：84.50%
第 012/30 轮，训练损失：0.1170，训练精度：95.71%，测试损失：0.4150，测试精度：86.79%
第 013/30 轮，训练损失：0.1054，训练精度：96.11%，测试损失：0.6745，测试精度：82.44%
第 014/30 轮，训练损失：0.0994，训练精度：96.36%，测试损失：0.4001，测试精度：88.26%
第 015/30 轮，训练损失：0.0933，训练精度：96.57%，测试损失：0.3560，测试精度：90.23%
第 016/30 轮，训练损失：0.0814，训练精度：97.05%，测试损失：0.5095，测试精度：86.09%
第 017/30 轮，训练损失：0.0744，训练精度：97.34%，测试损失：0.2882，测试精度：91.84%
第 018/30 轮，训练损失：0.0678，训练精度：97.56%，测试损失：0.3881，测试精度：89.55%
第 019/30 轮，训练损失：0.0593，训练精度：97.85%，测试损失：0.4351，测试精度：89.45%
第 020/30 轮，训练损失：0.0600，训练精度：97.88%，测试损失：0.4277，测试精度：89.03%
第 021/30 轮，训练损失：0.0504，训练精度：98.23%，测试损失：0.3292，测试精度：91.43%
第 022/30 轮，训练损失：0.0447，训练精度：98.41%，测试损失：0.5910，测试精度：86.86%
第 023/30 轮，训练损失：0.0472，训练精度：98.35%，测试损失：0.6230，测试精度：84.13%
第 024/30 轮，训练损失：0.0420，训练精度：98.54%，测试损失：0.9643，测试精度：82.98%
第 025/30 轮，训练损失：0.0312，训练精度：98.92%，测试损失：0.8418，测试精度：83.77%
第 026/30 轮，训练损失：0.0393，训练精度：98.57%，测试损失：0.3783，测试精度：91.49%
第 027/30 轮，训练损失：0.0326，训练精度：98.83%，测试损失：0.4318，测试精度：90.47%
第 028/30 轮，训练损失：0.0258，训练精度：99.11%，测试损失：0.4270，测试精度：90.74%
第 029/30 轮，训练损失：0.0168，训练精度：99.42%，测试损失：0.4315，测试精度：91.65%
第 030/30 轮，训练损失：0.0346，训练精度：98.99%，测试损失：0.6006，测试精度：88.90%