本文基于d2l项目内容整理，介绍深度学习中参数管理的核心概念和实用技巧，包括参数访问、初始化和共享等重要主题。

1. 参数管理的重要性

此前，我们只关注模型的训练，但在实际应用中，我们需要考虑更多场景：

参数管理的应用场景：

模型复用：在其他环境中复用训练的模型
模型分析：分析、解释并改进模型性能
透明性保证：保证模型的透明性和可信度
参数保存：从模型中提取并保存参数用于部署

这涉及到参数访问与可视化、参数共享，以及参数初始化等关键问题。

1.1 示例模型

以具有单隐藏层的 MLP 为例来演示参数管理：

1
2
3

from torch import nn

mlp = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1))

2. 参数访问

2.1 访问全部参数

使用 parameters() 方法

使用parameters()方法访问模型的全部参数，以迭代器的形式返回：

1 2	for param in mlp.parameters(): print(param)

输出结果：

Parameter containing:
tensor([[ 0.1416, -0.2652,  0.0283, -0.4912],
        [ 0.2108,  0.0797, -0.1839,  0.2387],
        [-0.1042,  0.2752, -0.2129, -0.1715],
        [-0.4608,  0.3681,  0.0611, -0.0767],
        [-0.4061,  0.3216, -0.2284,  0.2368],
        [ 0.4824, -0.0715, -0.2320, -0.4492],
        [ 0.1156, -0.2577, -0.0868,  0.2522],
        [ 0.0710, -0.1287, -0.1870,  0.0530]], requires_grad=True)
Parameter containing:
tensor([-0.2088,  0.2378,  0.2101,  0.1412,  0.1108,  0.3565,  0.0615,  0.2228],
       requires_grad=True)
Parameter containing:
tensor([[-0.0041, -0.3456,  0.3524,  0.1401, -0.0913,  0.1707,  0.2127, -0.1229]],
       requires_grad=True)
Parameter containing:
tensor([-0.3494], requires_grad=True)

重要说明：

当前，mlp由nn.Sequential构建，该容器不直接持有任何参数。所有参数均处于其子模块中。当参数recurse=False时，将返回一个空的迭代器，无打印输出。

参数说明

parameters()方法的可选参数：

recurse: bool = True：表示是否迭代访问子模块的参数

2.2 带名称地访问全部参数

使用 named_parameters() 方法

使用named_parameters()方法访问模型带名称的全部参数：

1 2	for name, param in mlp.named_parameters(): print(f'{name} = {param}')

输出结果：

0.weight = Parameter containing:
tensor([[-0.4334, -0.1572, -0.3452,  0.0636],
        [ 0.4990,  0.3165, -0.2827, -0.3702],
        [ 0.1865,  0.4365,  0.4086,  0.2605],
        [-0.3549, -0.2154, -0.2181, -0.2735],
        [ 0.0308,  0.3332, -0.4905, -0.3380],
        [ 0.1924, -0.1065, -0.4659,  0.4902],
        [ 0.2771,  0.3481, -0.0443,  0.3201],
        [-0.2433, -0.1370, -0.4293,  0.2107]], requires_grad=True)
0.bias = Parameter containing:
tensor([-0.0582, -0.4997,  0.1805, -0.1224, -0.0438, -0.1452, -0.1529,  0.2032],
       requires_grad=True)
2.weight = Parameter containing:
tensor([[ 0.2474,  0.1611,  0.1864, -0.1218, -0.0855, -0.1430, -0.0446, -0.2508]],
       requires_grad=True)
2.bias = Parameter containing:
tensor([0.2601], requires_grad=True)

参数说明

prefix: str = ''：参数名称前缀
recurse: bool = True：是否迭代访问子模块
remove_duplicate: bool = True：是否移除重复的参数

调试模型：查看每层的参数名称和形状
参数分析：分析特定层的参数分布
选择性操作：对特定名称的参数进行操作

2.3 访问特定层的参数

使用迭代和索引切片的方式访问每层（兼容嵌套的块）：

for i in range(len(mlp)):
    if hasattr(mlp[i], 'weight'):
        print(f'Layer {i} [{mlp[i].__class__.__name__}]: weight = {mlp[i].weight.data}')
    if hasattr(mlp[i], 'bias'):
        print(f'Layer {i} [{mlp[i].__class__.__name__}]: bias = {mlp[i].bias.data}')
    else:
        print(f"Layer {i} [{mlp[i].__class__.__name__}]: has no weights or bias.")

输出结果：

Layer 0 [Linear]: weight = tensor([[-4.7561e-01,  9.6813e-02,  3.4569e-02,  7.8308e-03],
        [-5.5281e-02,  2.5018e-01, -3.9521e-01, -3.5687e-02],
        [ 4.1087e-01, -5.9178e-02,  4.0543e-04,  3.3486e-01],
        [ 4.2845e-01,  8.2424e-03,  1.3492e-01,  1.4855e-01],
        [-4.1954e-02,  4.5593e-01, -1.9483e-01,  1.8994e-03],
        [-2.1427e-01, -1.9506e-01,  1.3504e-01, -6.4553e-02],
        [-3.9364e-01, -2.8565e-01,  4.7102e-01, -4.8467e-01],
        [ 1.6487e-01,  9.2206e-02, -1.8677e-01, -4.5183e-01]])
Layer 0 [Linear]: bias = tensor([-0.1909,  0.1203,  0.3650,  0.4064, -0.1391,  0.0739,  0.0105,  0.0505])
Layer 1 [ReLU]: has no weights or bias.
Layer 2 [Linear]: weight = tensor([[-0.1981, -0.0677, -0.1561,  0.0717, -0.2161,  0.1854, -0.2776, -0.1918]])
Layer 2 [Linear]: bias = tensor([0.3295])

参数访问说明：

对于每一个层layer，其layer.weight对应一个torch.nn.parameter.Parameter参数实例。可进一步使用data属性访问具体的数值。

2.4 访问参数的有序字典形式

使用state_dict()方法以有序字典 (OrderedDict) 的形式返回模型的全部参数与缓冲区：

for i in range(len(mlp)):
    print(mlp[i].state_dict())

for k, v in mlp.state_dict().items():
    print(f'{k}: {v}')

输出结果：

OrderedDict([('weight', tensor([[ 0.2438,  0.4527,  0.3543, -0.2325],
        [-0.3226,  0.1549,  0.4258,  0.0360],
        [-0.1136,  0.2583, -0.3417,  0.0637],
        [ 0.1996,  0.2257, -0.1983, -0.1986],
        [ 0.0203,  0.2687,  0.3132, -0.3821],
        [ 0.0177,  0.1618, -0.0728,  0.4820],
        [-0.4938,  0.2227, -0.4224, -0.4283],
        [ 0.3910, -0.2744,  0.1875, -0.4780]])), ('bias', tensor([-0.1088,  0.4029,  0.2719, -0.3215,  0.0109,  0.2277,  0.0511, -0.3369]))])
OrderedDict()
OrderedDict([('weight', tensor([[-0.0166,  0.2787, -0.2200, -0.2806,  0.2416, -0.2221,  0.1832, -0.2463]])), ('bias', tensor([-0.2722]))])

0.weight: tensor([[ 0.1944, -0.0307,  0.3028,  0.1538],
        [ 0.1045, -0.1321,  0.2132,  0.0064],
        [ 0.1400, -0.3735, -0.4175,  0.4298],
        [-0.0318, -0.2091,  0.3814,  0.3978],
        [ 0.1265,  0.0288, -0.1242,  0.3750],
        [ 0.1812, -0.4614,  0.1175, -0.0494],
        [-0.3612,  0.4967, -0.1158,  0.1046],
        [-0.2768, -0.2640, -0.1287, -0.4331]])
0.bias: tensor([-0.1913, -0.1836, -0.3315,  0.3204,  0.4959, -0.4941,  0.2261, -0.2725])
2.weight: tensor([[-0.2896,  0.0803,  0.1009, -0.1523, -0.0526, -0.2986,  0.3070,  0.1412]])
2.bias: tensor([0.3495])

参数说明

destination=None：将输出添加到已有的有序字典
prefix=''：参数名称前缀
keep_vars=False：是否记录张量的梯度信息

模型保存：保存模型参数到文件
参数传输：在不同设备间传输参数
模型对比：比较不同模型的参数差异

3. 参数初始化

良好的参数初始化对模型训练至关重要。

3.1 内置初始化方法

PyTorch 的nn.init模块内置了多种初始化方法：

基于分布的初始化：

normal_()：正态分布初始化
uniform_()：均匀分布初始化
constant_()：常数初始化

专业初始化方法：

kaiming_normal_()、kaiming_uniform_()：Kaiming 初始化
xavier_normal_()、xavier_uniform_()：Xavier 初始化
orthogonal_()：正交初始化
sparse_()：稀疏初始化

命名规范说明：

带单下划线后缀的方法（如method_()），表示操作在已有的张量上原地进行。API 中虽保留对应无单下划线后缀的同名方法，但已标记为弃用。

3.2 通用初始化模式

模型初始化的通用方法如下：

import torch.nn as nn
import torch.nn.init as init


class Model(nn.Module):
    def __init__(self):
        super().__init__()
        # 定义网络层
        ...

    def forward(self, data):
        # 定义前向传播
        ...
        return data


def initialize(model):
    for m in model.modules():
        if isinstance(m, nn.Linear):
            init.xavier_normal_(m.weight)
            if m.bias is not None:
                init.constant_(m.bias, 0)
        elif isinstance(m, nn.Conv2d):
            init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            if m.bias is not None:
                init.constant_(m.bias, 0)
        # 添加更多层类型的初始化...


model_instance = Model()
initialize(model_instance)  # 或直接使用模型的 apply 方法

3.3 自定义初始化

有时，深度学习框架没有提供我们需要的初始化方法。下面的例子中，我们使用以下分布为任意权重参数$w$定义初始化方法：

$w\sim\left\{\begin{array}{ll}U(-10,-5)&P=0.25\\0&P=0.50\\U(+5,+10)&P=0.25\end{array}\right.$

实现代码

import torch.nn as nn
import torch.nn.init as init


def my_init(m):
    if isinstance(m, nn.Linear):
        init.uniform_(m.weight, -10, 10)  # 权重初始化为 [-10, 10] 的均匀分布
        m.weight.data *= m.weight.data.abs() >= 5  # >=5 的权重保留原值，否则重置为 0（与 False 相乘）


mlp = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1))

mlp.apply(my_init)
for name, param in mlp.named_parameters():
    print(f"{name}:\n{param.data}")

输出结果：

0.weight:
tensor([[ 0.0000, -0.0000,  5.3036, -6.8710],
        [ 0.0000, -0.0000,  0.0000,  7.3253],
        [-0.0000, -0.0000, -0.0000,  0.0000],
        [-7.3227, -9.0059,  0.0000, -8.8338],
        [-0.0000, -7.7275, -0.0000, -7.2224],
        [ 6.2056, -8.1714,  0.0000, -0.0000],
        [-6.5656, -0.0000,  9.1092, -9.4554],
        [-0.0000,  0.0000, -6.7090,  6.3277]])
0.bias:
tensor([-0.4083, -0.2523, -0.1757, -0.2357, -0.0353, -0.0037,  0.4661,  0.0477])
2.weight:
tensor([[-0.0000,  0.0000, -7.6418,  0.0000,  0.0000, -7.1894,  0.0000,  8.6543]])
2.bias:
tensor([-0.0228])

4. 参数绑定

4.1 参数绑定的概念

参数绑定可以实现多个层之间的参数共享，将同一个参数对象绑定到不同的模型层上。

参数绑定的机制：

前向传播：共享的参数会在多个计算路径中复用
反向传播：梯度的变化会在共享的参数间累加
内存效率：显著减少模型的参数数量和内存使用

4.2 实现参数绑定

from torch import nn

shared_layer = nn.Linear(8, 8)
mlp = nn.Sequential(
    nn.Linear(4, 8),  # 0
    nn.ReLU(),        # 1
    shared_layer,     # 2
    nn.ReLU(),        # 3
    shared_layer,     # 4 - 与层2共享参数
    nn.ReLU(),        # 5
    nn.Linear(8, 1)   # 6
)

print(*mlp.named_parameters(remove_duplicate=False), sep='\n\n')

输出结果：

('0.weight', Parameter containing:
tensor([[ 0.4498, -0.2106, -0.3657,  0.0569],
        [-0.1041,  0.4180,  0.1986,  0.1852],
        [ 0.1846, -0.2800, -0.2179,  0.4976],
        [-0.4961,  0.3372, -0.2459,  0.1330],
        [-0.0068,  0.4297,  0.3575, -0.3342],
        [ 0.3815,  0.2712, -0.1784,  0.4909],
        [ 0.1812,  0.3937,  0.2199, -0.3863],
        [ 0.0506, -0.4913, -0.4984,  0.3429]], requires_grad=True))

('0.bias', Parameter containing:
tensor([ 0.4860,  0.3686, -0.2496, -0.0575,  0.3068,  0.2675,  0.4123, -0.4558],
       requires_grad=True))

('2.weight', Parameter containing:
tensor([[ 0.0590,  0.1841,  0.0330,  0.3468,  0.2566,  0.0160,  0.1089,  0.2442],
        [ 0.2502, -0.0257,  0.0304,  0.1460, -0.3342, -0.2527, -0.3007,  0.0809],
        [ 0.1058,  0.0624,  0.2905, -0.0474, -0.0983,  0.0402, -0.2424, -0.0652],
        [-0.1226, -0.2152, -0.3290,  0.3441,  0.1590, -0.3139,  0.2169, -0.1834],
        [ 0.1637, -0.2153,  0.2685,  0.1763,  0.2828, -0.0979,  0.2355, -0.0575],
        [-0.3041, -0.2542, -0.3197, -0.0129,  0.1893, -0.3135,  0.2045,  0.3202],
        [ 0.0094, -0.2106,  0.1643, -0.2307,  0.1777,  0.1465,  0.1821,  0.3263],
        [ 0.2449, -0.1513,  0.2915,  0.2287, -0.0244, -0.1628, -0.1861,  0.3187]],
       requires_grad=True))

('2.bias', Parameter containing:
tensor([-0.1521,  0.2661,  0.0729, -0.0857, -0.1317,  0.3258, -0.0936,  0.2853],
       requires_grad=True))

('4.weight', Parameter containing:
tensor([[ 0.0590,  0.1841,  0.0330,  0.3468,  0.2566,  0.0160,  0.1089,  0.2442],
        [ 0.2502, -0.0257,  0.0304,  0.1460, -0.3342, -0.2527, -0.3007,  0.0809],
        [ 0.1058,  0.0624,  0.2905, -0.0474, -0.0983,  0.0402, -0.2424, -0.0652],
        [-0.1226, -0.2152, -0.3290,  0.3441,  0.1590, -0.3139,  0.2169, -0.1834],
        [ 0.1637, -0.2153,  0.2685,  0.1763,  0.2828, -0.0979,  0.2355, -0.0575],
        [-0.3041, -0.2542, -0.3197, -0.0129,  0.1893, -0.3135,  0.2045,  0.3202],
        [ 0.0094, -0.2106,  0.1643, -0.2307,  0.1777,  0.1465,  0.1821,  0.3263],
        [ 0.2449, -0.1513,  0.2915,  0.2287, -0.0244, -0.1628, -0.1861,  0.3187]],
       requires_grad=True))

('4.bias', Parameter containing:
tensor([-0.1521,  0.2661,  0.0729, -0.0857, -0.1317,  0.3258, -0.0936,  0.2853],
       requires_grad=True))

('6.weight', Parameter containing:
tensor([[-0.0486,  0.1527,  0.1151,  0.3060, -0.0270,  0.2657, -0.2306,  0.0666]],
       requires_grad=True))

('6.bias', Parameter containing:
tensor([0.1217], requires_grad=True))

4.3 参数绑定的优势

计算和存储效率：

减少参数数量：显著减少模型的参数数量和内存使用
降低计算复杂度：减少了计算的复杂度
加速训练：更少的参数意味着更快的训练速度

学习和泛化能力：

通用特征学习：促使模型学习到更通用的特征
提高泛化能力：避免过拟合，提高模型的泛化能力
结构化设计：促进模型的结构化设计

适用场景：

多任务学习：在多个相关任务间共享知识
元学习：快速适应新任务的学习算法
资源受限环境：在计算资源有限的环境中部署模型

4.4 经典应用案例

参数共享的经典应用：

卷积神经网络 (CNN)：卷积核在图像的更大尺度上发现相同的特征，实现平移不变性
循环神经网络 (RNN)：序列的每个时间步都应用同样的变换规则，更好地理解全序列的结构
自动编码器：编码器和解码器之间的权重矩阵保持相同的正交约束关系，有效利用参数
Transformer：多头注意力机制中的参数共享，提高模型的表达能力

5. 参数管理的最佳实践

5.1 参数访问的最佳实践

def analyze_model_parameters(model):
    """分析模型参数的统计信息"""
    total_params = 0
    trainable_params = 0
    
    print("=" * 50)
    print("Model Parameter Analysis")
    print("=" * 50)
    
    for name, param in model.named_parameters():
        param_count = param.numel()
        total_params += param_count
        
        if param.requires_grad:
            trainable_params += param_count
            
        print(f"{name:20s} | Shape: {str(param.shape):20s} | Params: {param_count:8d} | Trainable: {param.requires_grad}")
    
    print("=" * 50)
    print(f"Total parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,}")
    print(f"Non-trainable parameters: {total_params - trainable_params:,}")
    print("=" * 50)

# 使用示例
analyze_model_parameters(mlp)

5.2 参数初始化的最佳实践

def smart_init(model):
    """智能参数初始化"""
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            # 对于线性层使用Xavier初始化
            nn.init.xavier_uniform_(module.weight)
            if module.bias is not None:
                nn.init.constant_(module.bias, 0)
        elif isinstance(module, nn.Conv2d):
            # 对于卷积层使用Kaiming初始化
            nn.init.kaiming_normal_(module.weight, mode='fan_out', nonlinearity='relu')
            if module.bias is not None:
                nn.init.constant_(module.bias, 0)
        elif isinstance(module, (nn.BatchNorm2d, nn.GroupNorm)):
            # 对于归一化层
            nn.init.constant_(module.weight, 1)
            nn.init.constant_(module.bias, 0)

5.3 参数保存和加载

# 保存模型参数
torch.save(model.state_dict(), 'model_parameters.pth')

# 加载模型参数
model.load_state_dict(torch.load('model_parameters.pth'))

# 部分参数加载（用于迁移学习）
pretrained_dict = torch.load('pretrained_model.pth')
model_dict = model.state_dict()

# 过滤掉不匹配的参数
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict and v.size() == model_dict[k].size()}
model_dict.update(pretrained_dict)
model.load_state_dict(model_dict)