本文基于d2l项目内容整理，介绍在ImageNet挑战赛中脱颖而出的GoogLeNet网络，包括其核心的Inception块设计和并行连接思想。

CNN架构发展历程：

从 LeNet 首次将卷积的思想引入计算机视觉，再到后来的 AlexNet、VGG 和 NiN 等，这些网络除了考虑如何变得更深而复杂，也在不断地探索哪种窗口大小（卷积核大小）更适合 ImageNet 的数据。一定程度受美国华纳兄弟于 2010 年发行的科幻动作惊悚片《盗梦空间 (Inception)》影响，2014 年腾空出世的 GoogLeNet 在保持类似精度的条件下，以较少的计算复杂度成为 ImageNet 图像识别挑战赛最有效的模型之一。

双重设计理念

We Need To Go Deeper - 通过更深的网络结构提高性能，借鉴 NiN 网络设计

全面启动 - 同时引入多种卷积核尺寸的并行组合 (1×1

梦中梦 - 逐层深入的多层次特征捕捉概念

受《盗梦空间 (Inception)》影响

一方面，电影《盗梦空间 (Inception)》的台词”We Need To Go Deeper”被 GoogLeNet 用于强调模型通过更深的网络结构，进一步提高性能的方式。在具体实现上，借鉴了 NiN 网络。

另一方面，电影台湾译名”全面启动”似乎更能说明 GoogLeNet 同时引入多种卷积核尺寸的并行组合（1×1、3×3 和 5×5），捕捉多层次特征。这与电影中”梦中梦”逐层深入的概念不谋而合。

盗梦空间电影海报

实现简化说明：

随着观念和框架的进步，这里在初始版本的 GoogLeNet 上删去了不必要的、为稳定训练而设置的特性，简化了实现。

1. GoogLeNet 网络架构设计

1.1 Inception 块的核心思想

Inception 块设计理念：

Inception 块是实现并行连接的关键。为了捕捉不同尺度下的图像特征，使用 4 种不同的卷积窗口组合并行，最后在通道维度上进行连接输出。

1.2 四种并行路径设计

1×1 卷积层

提取通道特征并降维
计算效率最高的路径

1×1 卷积层 → 3×3 卷积层

先对通道降维，减少计算量
随后提取较大的空间特征

1×1 卷积层 → 5×5 卷积层

先对通道降维，减少计算量
随后提取更大的空间特征

3×3 最大池化层 → 1×1 卷积层

先下采样特征图，保留重要信息
随后调整通道数，匹配其他路径

Inception块结构示意图

1.3 Inception 块的 PyTorch 实现

Inception 块IncepBlock的 PyTorch 实现如下：

from typing import Tuple

import torch
from torch import nn


class IncepBlock(nn.Module):
    def __init__(self, in_channels: int, c1_out: int, c2_out: Tuple[int, int], c3_out: Tuple[int, int], c4_out: int):
        super().__init__()

        self.channel1 = nn.Sequential(  # 路径一: 1×1 卷积
            nn.Conv2d(in_channels, c1_out, kernel_size=1), nn.ReLU()
        )

        self.channel2 = nn.Sequential(  # 路径二: 1×1 卷积 -> 3×3 卷积
            nn.Conv2d(in_channels, c2_out[0], kernel_size=1), nn.ReLU(),
            nn.Conv2d(c2_out[0], c2_out[1], kernel_size=3, padding=1), nn.ReLU()
        )

        self.channel3 = nn.Sequential(  # 路径三: 1×1 卷积 -> 5×5 卷积
            nn.Conv2d(in_channels, c3_out[0], kernel_size=1), nn.ReLU(),
            nn.Conv2d(c3_out[0], c3_out[1], kernel_size=5, padding=2), nn.ReLU()
        )

        self.channel4 = nn.Sequential(  # 路径四: 3×3 最大池化 -> 1×1 卷积
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, c4_out, kernel_size=1), nn.ReLU()
        )

    def forward(self, x):
        output1 = self.channel1(x)
        output2 = self.channel2(x)
        output3 = self.channel3(x)
        output4 = self.channel4(x)
        output = torch.cat([output1, output2, output3, output4], dim=1)
        return output

2. GoogLeNet 完整网络架构

2.1 网络整体设计

GoogLeNet 架构特点：

GoogLeNet 在进入 Inception 块之前，首先需要经过一系列层的逐步特征提取与数据维度压缩：

7×7 卷积层以较大的感受野捕获特征，并由最大池化层下采样

1×1 卷积层用于降维并进行跨通道特征融合

3×3 卷积层提取更细粒度的特征，并由最大池化层下采样

共有 9 个 Inception 块，每个 Inception 块之间用最大池化层降维

最后使用全局平均池化层和全连接层获得输出

2.2 适配 Fashion-MNIST 的设计

数据集适配说明：

为了继续在 Fashion-MNIST 数据集上测试 GoogLeNet 网络，需要将图像像素修改为 96×96 以简化计算。复用 Inception 块，GoogLeNet 网络的 PyTorch 实现如下：

GoogLeNet完整网络架构图

from typing import Tuple

import torch
from torch import nn, Tensor


class IncepBlock(nn.Module):
    def __init__(self, in_channels: int, c1_out: int, c2_out: Tuple[int, int], c3_out: Tuple[int, int], c4_out: int):
        super().__init__()

        self.channel1 = nn.Sequential(  # 路径一: 1×1 卷积
            nn.Conv2d(in_channels, c1_out, kernel_size=1), nn.ReLU()
        )

        self.channel2 = nn.Sequential(  # 路径二: 1×1 卷积 -> 3×3 卷积
            nn.Conv2d(in_channels, c2_out[0], kernel_size=1), nn.ReLU(),
            nn.Conv2d(c2_out[0], c2_out[1], kernel_size=3, padding=1), nn.ReLU()
        )

        self.channel3 = nn.Sequential(  # 路径三: 1×1 卷积 -> 5×5 卷积
            nn.Conv2d(in_channels, c3_out[0], kernel_size=1), nn.ReLU(),
            nn.Conv2d(c3_out[0], c3_out[1], kernel_size=5, padding=2), nn.ReLU()
        )

        self.channel4 = nn.Sequential(  # 路径四: 3×3 最大汇聚 -> 1×1 卷积
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, c4_out, kernel_size=1), nn.ReLU()
        )

    def forward(self, x) -> Tensor:
        output1 = self.channel1(x)
        output2 = self.channel2(x)
        output3 = self.channel3(x)
        output4 = self.channel4(x)
        output = torch.cat([output1, output2, output3, output4], dim=1)
        return output


class GoogLeNet(nn.Module):
    def __init__(self, num_classes: int):
        super().__init__()
        self.model = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=64, kernel_size=7, stride=2, padding=3), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),

            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=1), nn.ReLU(),
            nn.Conv2d(in_channels=64, out_channels=192, kernel_size=3, padding=1), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),

            IncepBlock(in_channels=192, c1_out=64, c2_out=(96, 128), c3_out=(16, 32), c4_out=32),
            IncepBlock(in_channels=256, c1_out=128, c2_out=(128, 192), c3_out=(32, 96), c4_out=64),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),

            IncepBlock(in_channels=480, c1_out=192, c2_out=(96, 208), c3_out=(16, 48), c4_out=64),
            IncepBlock(in_channels=512, c1_out=160, c2_out=(112, 224), c3_out=(24, 64), c4_out=64),
            IncepBlock(in_channels=512, c1_out=128, c2_out=(128, 256), c3_out=(24, 64), c4_out=64),
            IncepBlock(in_channels=512, c1_out=112, c2_out=(144, 288), c3_out=(32, 64), c4_out=64),
            IncepBlock(in_channels=528, c1_out=256, c2_out=(160, 320), c3_out=(32, 128), c4_out=128),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),

            IncepBlock(in_channels=832, c1_out=256, c2_out=(160, 320), c3_out=(32, 128), c4_out=128),
            IncepBlock(in_channels=832, c1_out=384, c2_out=(192, 384), c3_out=(48, 128), c4_out=128),
            nn.AdaptiveAvgPool2d(1), nn.Flatten(),
            nn.Linear(in_features=1024, out_features=num_classes)
        )

        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')
                if m.bias is not None: nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None: nn.init.constant_(m.bias, 0)

    def forward(self, x) -> Tensor:
        return self.model(x)

2.3 网络结构分析

使用torchinfo库的summary函数执行输出维度测试：

from torchinfo import summary

model = GoogLeNet(num_classes=10)
summary(model, input_size=(1, 1, 96, 96))
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
GoogLeNet                                [1, 10]                   --
├─Sequential: 1-1                        [1, 10]                   --
│    └─Conv2d: 2-1                       [1, 64, 48, 48]           3,200
│    └─ReLU: 2-2                         [1, 64, 48, 48]           --
│    └─MaxPool2d: 2-3                    [1, 64, 24, 24]           --
│    └─Conv2d: 2-4                       [1, 64, 24, 24]           4,160
│    └─ReLU: 2-5                         [1, 64, 24, 24]           --
│    └─Conv2d: 2-6                       [1, 192, 24, 24]          110,784
│    └─ReLU: 2-7                         [1, 192, 24, 24]          --
│    └─MaxPool2d: 2-8                    [1, 192, 12, 12]          --
│    └─IncepBlock: 2-9                   [1, 256, 12, 12]          --
│    │    └─Sequential: 3-1              [1, 64, 12, 12]           12,352
│    │    └─Sequential: 3-2              [1, 128, 12, 12]          129,248
│    │    └─Sequential: 3-3              [1, 32, 12, 12]           15,920
│    │    └─Sequential: 3-4              [1, 32, 12, 12]           6,176
│    └─IncepBlock: 2-10                  [1, 480, 12, 12]          --
│    │    └─Sequential: 3-5              [1, 128, 12, 12]          32,896
│    │    └─Sequential: 3-6              [1, 192, 12, 12]          254,272
│    │    └─Sequential: 3-7              [1, 96, 12, 12]           85,120
│    │    └─Sequential: 3-8              [1, 64, 12, 12]           16,448
│    └─MaxPool2d: 2-11                   [1, 480, 6, 6]            --
│    └─IncepBlock: 2-12                  [1, 512, 6, 6]            --
│    │    └─Sequential: 3-9              [1, 192, 6, 6]            92,352
│    │    └─Sequential: 3-10             [1, 208, 6, 6]            226,096
│    │    └─Sequential: 3-11             [1, 48, 6, 6]             26,944
│    │    └─Sequential: 3-12             [1, 64, 6, 6]             30,784
│    └─IncepBlock: 2-13                  [1, 512, 6, 6]            --
│    │    └─Sequential: 3-13             [1, 160, 6, 6]            82,080
│    │    └─Sequential: 3-14             [1, 224, 6, 6]            283,472
│    │    └─Sequential: 3-15             [1, 64, 6, 6]             50,776
│    │    └─Sequential: 3-16             [1, 64, 6, 6]             32,832
│    └─IncepBlock: 2-14                  [1, 512, 6, 6]            --
│    │    └─Sequential: 3-17             [1, 128, 6, 6]            65,664
│    │    └─Sequential: 3-18             [1, 256, 6, 6]            360,832
│    │    └─Sequential: 3-19             [1, 64, 6, 6]             50,776
│    │    └─Sequential: 3-20             [1, 64, 6, 6]             32,832
│    └─IncepBlock: 2-15                  [1, 528, 6, 6]            --
│    │    └─Sequential: 3-21             [1, 112, 6, 6]            57,456
│    │    └─Sequential: 3-22             [1, 288, 6, 6]            447,408
│    │    └─Sequential: 3-23             [1, 64, 6, 6]             67,680
│    │    └─Sequential: 3-24             [1, 64, 6, 6]             32,832
│    └─IncepBlock: 2-16                  [1, 832, 6, 6]            --
│    │    └─Sequential: 3-25             [1, 256, 6, 6]            135,424
│    │    └─Sequential: 3-26             [1, 320, 6, 6]            545,760
│    │    └─Sequential: 3-27             [1, 128, 6, 6]            119,456
│    │    └─Sequential: 3-28             [1, 128, 6, 6]            67,712
│    └─MaxPool2d: 2-17                   [1, 832, 3, 3]            --
│    └─IncepBlock: 2-18                  [1, 832, 3, 3]            --
│    │    └─Sequential: 3-29             [1, 256, 3, 3]            213,248
│    │    └─Sequential: 3-30             [1, 320, 3, 3]            594,400
│    │    └─Sequential: 3-31             [1, 128, 3, 3]            129,184
│    │    └─Sequential: 3-32             [1, 128, 3, 3]            106,624
│    └─IncepBlock: 2-19                  [1, 1024, 3, 3]           --
│    │    └─Sequential: 3-33             [1, 384, 3, 3]            319,872
│    │    └─Sequential: 3-34             [1, 384, 3, 3]            823,872
│    │    └─Sequential: 3-35             [1, 128, 3, 3]            193,712
│    │    └─Sequential: 3-36             [1, 128, 3, 3]            106,624
│    └─AdaptiveAvgPool2d: 2-20           [1, 1024, 1, 1]           --
│    └─Flatten: 2-21                     [1, 1024]                 --
│    └─Linear: 2-22                      [1, 10]                   10,250
==========================================================================================
Total params: 5,977,530
Trainable params: 5,977,530
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 276.66
==========================================================================================
Input size (MB): 0.04
Forward/backward pass size (MB): 4.74
Params size (MB): 23.91
Estimated Total Size (MB): 28.69
==========================================================================================

网络参数统计：

总参数量：5,977,530 个参数（约 598 万）
计算复杂度：276.66 MB 的乘加运算
内存占用：总计约 28.69 MB
特点：通过 Inception 块的并行设计，在保持高性能的同时控制了参数数量

3. 模型训练与评估

3.1 训练配置与实现

继续使用training_tools.py中的工具训练评估模型：

from typing import Tuple

import torch
from torch import nn, Tensor, optim

from training_tools import fashionMNIST_loader, Trainer


class IncepBlock(nn.Module):
    def __init__(self, in_channels: int, c1_out: int, c2_out: Tuple[int, int], c3_out: Tuple[int, int], c4_out: int):
        super().__init__()

        self.channel1 = nn.Sequential(  # 路径一: 1×1 卷积
            nn.Conv2d(in_channels, c1_out, kernel_size=1), nn.ReLU()
        )

        self.channel2 = nn.Sequential(  # 路径二: 1×1 卷积 -> 3×3 卷积
            nn.Conv2d(in_channels, c2_out[0], kernel_size=1), nn.ReLU(),
            nn.Conv2d(c2_out[0], c2_out[1], kernel_size=3, padding=1), nn.ReLU()
        )

        self.channel3 = nn.Sequential(  # 路径三: 1×1 卷积 -> 5×5 卷积
            nn.Conv2d(in_channels, c3_out[0], kernel_size=1), nn.ReLU(),
            nn.Conv2d(c3_out[0], c3_out[1], kernel_size=5, padding=2), nn.ReLU()
        )

        self.channel4 = nn.Sequential(  # 路径四: 3×3 最大汇聚 -> 1×1 卷积
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, c4_out, kernel_size=1), nn.ReLU()
        )

    def forward(self, x) -> Tensor:
        output1 = self.channel1(x)
        output2 = self.channel2(x)
        output3 = self.channel3(x)
        output4 = self.channel4(x)
        output = torch.cat([output1, output2, output3, output4], dim=1)
        return output


class GoogLeNet(nn.Module):
    def __init__(self, num_classes: int):
        super().__init__()
        self.model = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=64, kernel_size=7, stride=2, padding=3), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),

            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=1), nn.ReLU(),
            nn.Conv2d(in_channels=64, out_channels=192, kernel_size=3, padding=1), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),

            IncepBlock(in_channels=192, c1_out=64, c2_out=(96, 128), c3_out=(16, 32), c4_out=32),
            IncepBlock(in_channels=256, c1_out=128, c2_out=(128, 192), c3_out=(32, 96), c4_out=64),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),

            IncepBlock(in_channels=480, c1_out=192, c2_out=(96, 208), c3_out=(16, 48), c4_out=64),
            IncepBlock(in_channels=512, c1_out=160, c2_out=(112, 224), c3_out=(24, 64), c4_out=64),
            IncepBlock(in_channels=512, c1_out=128, c2_out=(128, 256), c3_out=(24, 64), c4_out=64),
            IncepBlock(in_channels=512, c1_out=112, c2_out=(144, 288), c3_out=(32, 64), c4_out=64),
            IncepBlock(in_channels=528, c1_out=256, c2_out=(160, 320), c3_out=(32, 128), c4_out=128),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),

            IncepBlock(in_channels=832, c1_out=256, c2_out=(160, 320), c3_out=(32, 128), c4_out=128),
            IncepBlock(in_channels=832, c1_out=384, c2_out=(192, 384), c3_out=(48, 128), c4_out=128),
            nn.AdaptiveAvgPool2d(1), nn.Flatten(),
            nn.Linear(in_features=1024, out_features=num_classes)
        )

        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')
                if m.bias is not None: nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None: nn.init.constant_(m.bias, 0)

    def forward(self, x) -> Tensor:
        return self.model(x)


if __name__ == '__main__':
    BATCH_SIZE = 128
    EPOCHS_NUM = 30
    LEARNING_RATE = 0.005

    model = GoogLeNet(num_classes=10)
    train_loader, test_loader = fashionMNIST_loader(BATCH_SIZE, resize=96)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), LEARNING_RATE)
    platform = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    with Trainer(model, train_loader, test_loader, criterion, optimizer, platform) as trainer:
        trainer.train(EPOCHS_NUM)

3.2 训练结果与分析

查看完整训练过程

第 001/30 轮，训练损失：1.0233，训练精度：64.77%，测试损失：0.5896，测试精度：78.30%
第 002/30 轮，训练损失：0.5145，训练精度：81.14%，测试损失：0.4663，测试精度：83.26%
第 003/30 轮，训练损失：0.4330，训练精度：84.10%，测试损失：0.4222，测试精度：84.27%
第 004/30 轮，训练损失：0.3915，训练精度：85.55%，测试损失：0.3902，测试精度：85.33%
第 005/30 轮，训练损失：0.3597，训练精度：86.72%，测试损失：0.6181，测试精度：78.45%
第 006/30 轮，训练损失：0.3360，训练精度：87.59%，测试损失：0.4024，测试精度：85.58%
第 007/30 轮，训练损失：0.3194，训练精度：88.19%，测试损失：0.3629，测试精度：86.59%
第 008/30 轮，训练损失：0.3041，训练精度：88.78%，测试损失：0.3193，测试精度：88.33%
第 009/30 轮，训练损失：0.2902，训练精度：89.14%，测试损失：0.3558，测试精度：86.62%
第 010/30 轮，训练损失：0.2797，训练精度：89.57%，测试损失：0.3258，测试精度：88.02%
第 011/30 轮，训练损失：0.2684，训练精度：90.09%，测试损失：0.2906，测试精度：89.48%
第 012/30 轮，训练损失：0.2612，训练精度：90.34%，测试损失：0.3176，测试精度：88.67%
第 013/30 轮，训练损失：0.2493，训练精度：90.71%，测试损失：0.2911，测试精度：89.44%
第 014/30 轮，训练损失：0.2429，训练精度：90.96%，测试损失：0.3492，测试精度：87.41%
第 015/30 轮，训练损失：0.2351，训练精度：91.34%，测试损失：0.3176，测试精度：88.10%
第 016/30 轮，训练损失：0.2292，训练精度：91.44%，测试损失：0.2931，测试精度：88.95%
第 017/30 轮，训练损失：0.2221，训练精度：91.71%，测试损失：0.3761，测试精度：86.24%
第 018/30 轮，训练损失：0.2123，训练精度：92.17%，测试损失：0.2816，测试精度：89.70%
第 019/30 轮，训练损失：0.2087，训练精度：92.14%，测试损失：0.3294，测试精度：88.39%
第 020/30 轮，训练损失：0.2000，训练精度：92.52%，测试损失：0.2823，测试精度：89.97%
第 021/30 轮，训练损失：0.1973，训练精度：92.78%，测试损失：0.2764，测试精度：90.12%
第 022/30 轮，训练损失：0.1918，训练精度：92.79%，测试损失：0.2800，测试精度：89.67%
第 023/30 轮，训练损失：0.1846，训练精度：93.20%，测试损失：0.2640，测试精度：90.43%
第 024/30 轮，训练损失：0.1796，训练精度：93.38%，测试损失：0.2875，测试精度：89.47%
第 025/30 轮，训练损失：0.1744，训练精度：93.61%，测试损失：0.2566，测试精度：90.57%
第 026/30 轮，训练损失：0.1676，训练精度：93.80%，测试损失：0.2848，测试精度：89.85%
第 027/30 轮，训练损失：0.1627，训练精度：94.03%，测试损失：0.2633，测试精度：90.86%
第 028/30 轮，训练损失：0.1585，训练精度：94.17%，测试损失：0.2793，测试精度：90.02%
第 029/30 轮，训练损失：0.1545，训练精度：94.38%，测试损失：0.2631，测试精度：90.83%
第 030/30 轮，训练损失：0.1463，训练精度：94.52%，测试损失：0.2876，测试精度：90.21%