深度学习基础(二)-激活函数，损失函数，优化算法与参数管理

激活函数

激活函数有两类：饱和激活函数”和“非饱和激活函数”。sigmoid和tanh是“饱和激活函数”而ReLU及其变体则是“非饱和激活函数”。使用“非饱和激活函数”的优势在于两点：

首先，“非饱和激活函数”能解决所谓的“梯度消失”问题。
其次，它能加快收敛速度。

ReLU函数

ReLU提供了一种非常简单的非线性变换。给定元素x，ReLU函数被定义为该元素与0的最大值：

{ReLU}(x) = max(x, 0).

ReLU函数通过将相应的活性值设为0，仅保留正元素并丢弃所有负元素。

使用ReLU的原因是，它求导表现得特别好：要么让参数消失，要么让参数通过。这使得优化表现得更好，并且ReLU减轻了困扰以往神经网络的梯度消失问题

缺点：
训练的时候很脆弱，很容易就"死了"例如，一个非常大的梯度流过一个 ReLU 神经元，更新过参数之后，这个神经元再也不会对任何数据有激活现象了，那么这个神经元的梯度就永远都会是 0。如果 learning rate 很大，那么很有可能网络中的 40% 的神经元都"dead"了。

变体：

Leaky ReLU：Leaky ReLU在负输入部分引入了一个小的非零斜率。其中， $\alpha$ 是一个很小的常数(大于0小于1）。这样，即便输入为负，神经元也能保持一定的梯度：

\text{LeakyReLU} (x) = \max(\alpha x, x)

参数化ReLU（Parameterized ReLU，pReLU） 函数：Leaky ReLU的推广，其中 $\alpha$ 不再是一个固定的小常数，而是一个可学习的参数。这种设置让模型自动学习到在不同情况下最适合的 $\alpha$ 值，从而提高模型的灵活性和性能。

{pReLU}(x) = max(0, x) + \alpha \min(0, x).

随机纠正线性单元（）：Leaky ReLU的一个变体。在RReLU中，负值的斜率在训练中是随机的，在之后的测试中就变成了固定的了。

Exponential Linear Unit (ELU)：ELU是另一种尝试解决ReLU非零中心输出问题的激活函数。其在负值输入时采用指数衰减.它试图将激活函数的平均值接近零，从而加快学习的速度。同时，它还能通过正值的标识来避免梯度消失的问题。：

ELU(x) = \left\{\begin{matrix} x&x>0 \\ \alpha (e^{x}-1) & x\leq 0 \end{matrix}\right.

ELU函数的特点：

没有Dead ReLU问题，输出的平均值接近0，以0为中心。
ELU 通过减少偏置偏移的影响，使正常梯度更接近于单位自然梯度，从而使均值向零加速学习。
ELU函数在较小的输入下会饱和至负值，从而减少前向传播的变异和信息。
ELU函数的计算强度更高。与Leaky ReLU类似，尽管理论上比ReLU要好，但目前在实践中没有充分的证据表明ELU总是比ReLU好。

import torch
form torch import nn

torch.relu(x)
relu = nn.ReLU()

m = nn.LeakyReLU(0.1)
m = nn.PReLU()
m = nn.RReLU(0.1, 0.3)
m = nn.ELU()

sigmod函数

sigmoid函数将输入变换为区间(0, 1)上的输出。因此，sigmoid通常称为挤压函数（squashing function）：它将范围（-inf, inf）中的任意输入压缩到区间（0, 1）中的某个值：

\operatorname{sigmoid}(x) = \frac{1}{1 + \exp(-x)}.

当我们想要将输出视作二元分类问题的概率时， sigmoid仍然被广泛用作输出单元上的激活函数（sigmoid可以视为softmax的特例）。然而，sigmoid在隐藏层中已经较少使用，它在大部分时候被更简单、更容易训练的ReLU所取代。

Sigmoid函数的特性与优缺点：

Sigmoid函数的输出范围是0到1。由于输出值限定在0到1，因此它对每个神经元的输出进行了归一化。
用于将预测概率作为输出的模型。由于概率的取值范围是0到1，因此Sigmoid函数非常合适
梯度平滑，避免跳跃的输出值
函数是可微的。这意味着可以找到任意两个点的Sigmoid曲线的斜率
明确的预测，即非常接近1或0。
函数输出不是以0为中心的，这会降低权重更新的效率
Sigmoid函数执行指数运算，计算机运行得较慢。

import torch
form torch import nn

m = nn.Sigmoid()

tanh函数

tanh(双曲正切)函数也能将其输入压缩转换到区间(-1, 1)上,公式如下：

\operatorname{tanh}(x) = \frac{1 - \exp(-2x)}{1 + \exp(-2x)}.

注意，当输入在0附近时，tanh函数接近线性变换。函数的形状类似于sigmoid函数，不同的是tanh函数关于坐标系原点中心对称。它解决了Sigmoid函数的不以0为中心输出问题，然而，梯度消失的问题和幂运算的问题仍然存在。

import torch
form torch import nn

m = nn.Tanh()

GLU函数

GLU(Gated Linear Units,门控线性单元)2引入了两个不同的线性层，其中一个首先经过sigmoid函数，其结果将和另一个线性层的输出进行逐元素相乘作为最终的输出：

\text{GLU}(x,W,V,b,c) = \sigma(xW+b) \otimes (xV+c)

这里W,V以及b,c分别是这两个线性层的参数； $\sigma(xW+b)$ 作为门控，控制xV+c的输出。

这里使用 $\sigma$ 作为激活函数，修改改激活函数得到的变体通常能带来更好的性能表现。

SwiGLU

将公式中GLU的激活函数改为Swish即变成了所谓的SwiGLU激活函数

参考LLaMA，全连接层使用带有SwiGLU激活函数的FFN,省略偏置项

FFN_{SwiGLU}(x,W,V,W_{2})=(Swish_{1}(xW)⊗xV)W_{2}

import torch
from torch import nn
import torch.nn.functional as F

class FeedForward(nn.Module):
    def __init__(self, hidden_size: int, intermediate_size: int) -> None:
       	super().__init__()    
    self.w1 = nn.Linear(hidden_size, intermediate_size, bias=False)
    self.w2 = nn.Linear(intermediate_size, hidden_size, bias=False)
    self.w3 = nn.Linear(hidden_size, intermediate_size, bias=False)
    
def forward(self, x: torch.Tensor) -> torch.Tensor:
    # x: (batch_size, seq_len, hidden_size)
    # w1(x) -> (batch_size, seq_len, intermediate_size)
    # w3(x) -> (batch_size, seq_len, intermediate_size)
    # w2(*) -> (batch_size, seq_len, hidden_size)
	return self.w2(F.silu(self.w1(x)) * self.w3(x))

Swish激活函数

机器学习中的数学——激活函数（八）：Swish函数_swish算子-CSDN博客

Llama改进之——SwiGLU激活函数_swiglu ffn-CSDN博客

Swish 的设计受到了 LSTM 和高速网络中gating的sigmoid函数使用的启发。我们使用相同的gating值来简化gating机制，这称为self-gating。self-gating的优点在于它只需要简单的标量输入，而普通的gating则需要多个标量输入。这使得诸如Swish之类的self-gated激活函数能够轻松替换以单个标量为输入的激活函数（如：ReLU），而无需更改隐藏容量或参数数量。

有助于防止慢速训练期间，梯度逐渐接近0并导致饱和
导数恒大于0。
平滑度在优化和泛化中起了重要作用。

Swish激活函数的形式为：

\text{Swish}_\beta(x) = x \sigma(\beta x)

其中 $\sigma(x)$ 是Sigmoid函数； $\beta$ 是一个可学习的参数。

import torch.nn.functional as F
from torch import nn

# 对于beta = 1
nn.SiLU()
F.silu()

可以看到，当 $\beta$ 趋近于0时，Swish函数趋近于线性函数 $y=x^2$ ;当\beta趋近于无穷大时，Swish函数趋近于ReLU函数；当 $\beta$ 取值为1时，Swish函数是光滑且非单调的。

GELU激活函数

由于ReLU等分段线性函数并不光滑，且在某些间断点处不可导（特别是原点处）。然而，神经网络处理的数据一般要求是零均值的，ReLU等函数在零点不可微，在某种程度上影响网络的性能。此外，ReLU等确定性非线性函数作为神经网络的激活函数时，网络往往需要加入随机正则化以提高模型的泛化能力。如果一个非线性函数本身具有随机正则性，是不是就能同时保证网络的非线性和泛化性。

高斯误差线性单元(Gaussian error linear unit，GELU)即GELU。

GELU(x)=xP(X<=x)=xΦ(x)

其中 $Φ(x)$ 是x的高斯正态分布的累积分布，完整形式如下

Φ(x)=\int_{-\infty }^{x} \frac{e^{-\frac{(X-\mu)^2}{2\sigma ^2}}}{\sqrt{2\Pi}\sigma} dX

由于 $xΦ(x)$ 无法精确计算，因此 $GELU(x)$ 可以被近似表达为

GELU(x)=0.5x(1+tanh[\sqrt{2/\Pi } (x+0.044715x^3)])

GELU处处可导，而且更平滑，可以减少梯度爆炸的风险，不会出现神经元死亡的状态

损失函数

交叉熵损失函数

在pytorch中的交叉熵损失CrossEntropyLoss 包含了两部分，softmax和交叉熵计算。可以衡量真实值和预测值之间的差距的。

-P(x) log Q(x)

其中P(x)是真实值，Q(x)是预测值。当P(x)和Q(x)是矩阵的时候，就分别对其计算，然后求和即可。

CLASStorch.nn.CrossEntropyLoss(weight=None,size_average=None,ignore_index=-100,reduce=None,reduction=‘mean’,label_smoothing=0.0)

reduction是指损失计算方式，默认取平均mean，同时支持none，sum ，分别表示每一个损失不做其他操作、所有损失求求和.

预测值： [0.8, 0.5, 0.2, 0.5],target可以是 [1, 0, 0, 0] 或者索引形式 0

import torch
from torch import nn

loss = nn.CrossEntropyLoss(reduction='none')
out = loss(y_hat, y)

均方误差损失函数

torch.nn.MSELoss(size_average=None, reduce=None, reduction: str = "mean")

size_average和reduce在当前版本的pytorch已经不建议使用了，只设置reduction就行了。

reduction的可选参数有："none" 、"mean" 、"sum"

reduction="none"：求所有对应位置的差的平方，返回的仍然是一个和原来形状一样的矩阵。

reduction="mean"：求所有对应位置差的平方的均值，返回的是一个标量。

reduction="sum"：求所有对应位置差的平方的和，返回的是一个标量。

给定损失函数的输入y，pred，shape均为b×c。若设定loss_fn = torch.nn.MSELoss(reduction="mean")，最终的输出值其实是( y−pred)每个元素数字的平方之和除以(b×c) ，也就是在batch和特征维度上都取了平均。

import torch
from torch import nn

loss = nn.MSELoss(reduction='none')

只有标量才能执行backward()函数，因此在反向传播中reduction不能设为"none"。

优化算法

SGD

1
2
3

import torch

optimizer = torch.optim.SGD(net.parameters(), lr=0.1)

Adam

参数管理

当通过Sequential类定义模型时，我们可以通过索引来访问模型的任意层

import torch
from torch import nn

net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1))
X = torch.rand(size=(2, 4))
net(X)

**net[].state_dict()：**将每一层的参数映射成tensor张量并存储到字典中

print(net[2].state_dict())

输出：OrderedDict([('weight', tensor([[-0.0427, -0.2939, -0.1894,  0.0220, -0.1709, -0.1522, -0.0334, -0.2263]])), ('bias', tensor([0.0887]))])

print(net.state_dict())
OrderedDict([('0.weight', tensor([[ 0.2861,  0.2400,  0.4630, -0.1340],
        [ 0.3398,  0.1188,  0.2201, -0.1069],
        [ 0.3857, -0.3230,  0.0230, -0.3195],
        [ 0.4652,  0.0321,  0.3840, -0.3612],
        [ 0.4141, -0.3728, -0.2018,  0.1321],
        [ 0.4022, -0.3111, -0.4738, -0.3092],
        [-0.1753,  0.1980, -0.4757,  0.2719],
        [ 0.3098,  0.4368, -0.1393, -0.0652]])), ('0.bias', tensor([ 0.1255,  0.4535, -0.2396, -0.2423,  0.3596, -0.2444, -0.2790,  0.2215])), ('2.weight', tensor([[-0.1768, -0.2927, -0.0773,  0.0006, -0.0372, -0.1224,  0.2603, -0.0563]])), ('2.bias', tensor([-0.1250]))])

单独访问参数：

1
2
3

print(type(net[2].bias))
print(net[2].bias) # 返回的是一个偏置的参数类实例
print(net[2].bias.data) # 返回偏置参数的值

一次性访问所有参数：

print(*[(name, param.shape) for name, param in net[0].named_parameters()])
print(*[(name, param.shape) for name, param in net.named_parameters()])

输出：
('weight', torch.Size([8, 4])) ('bias', torch.Size([8]))
('0.weight', torch.Size([8, 4])) ('0.bias', torch.Size([8])) ('2.weight', torch.Size([1, 8])) ('2.bias', torch.Size([1]))

可以看到单独一层的key的前面没有对应层数的标号

1	net.state_dict()['2.bias'].data

也可以通过如上方式单独访问

在pytorch中，torch.nn.Module模块中的state_dict变量存放训练过程中需要学习的权重和偏执系数，state_dict作为python的字典对象将每一层的参数映射成tensor张量，需要注意的是torch.nn.Module模块中的state_dict只包含卷积层和全连接层的参数，当网络中存在batchnorm时，例如vgg网络结构，torch.nn.Module模块中的state_dict也会存放batchnorm’s running_mean

torch.optim模块中的Optimizer优化器对象也存在一个state_dict对象，此处的state_dict字典对象包含state和param_groups的字典对象，而param_groups key对应的value也是一个由学习率，动量等参数组成的一个字典对象。

#encoding:utf-8
 
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import numpy as mp
import matplotlib.pyplot as plt
import torch.nn.functional as F
 
#define model
class TheModelClass(nn.Module):
    def __init__(self):
        super(TheModelClass,self).__init__()
        self.conv1=nn.Conv2d(3,6,5)
        self.pool=nn.MaxPool2d(2,2)
        self.conv2=nn.Conv2d(6,16,5)
        self.fc1=nn.Linear(16*5*5,120)
        self.fc2=nn.Linear(120,84)
        self.fc3=nn.Linear(84,10)
 
    def forward(self,x):
        x=self.pool(F.relu(self.conv1(x)))
        x=self.pool(F.relu(self.conv2(x)))
        x=x.view(-1,16*5*5)
        x=F.relu(self.fc1(x))
        x=F.relu(self.fc2(x))
        x=self.fc3(x)
        return x
 
def main():
    # Initialize model
    model = TheModelClass()
 
    #Initialize optimizer
    optimizer=optim.SGD(model.parameters(),lr=0.001,momentum=0.9)
 
    #print model's state_dict
    print('Model.state_dict:')
    for param_tensor in model.state_dict():
        #打印 key value字典
        print(param_tensor,'\t',model.state_dict()[param_tensor].size())
 
    #print optimizer's state_dict
    print('Optimizer,s state_dict:')
    for var_name in optimizer.state_dict():
        print(var_name,'\t',optimizer.state_dict()[var_name])
 
 
 
if __name__=='__main__':
    main()

具体输出结果：

Model.state_dict:
conv1.weight 	 torch.Size([6, 3, 5, 5])
conv1.bias 	 torch.Size([6])
conv2.weight 	 torch.Size([16, 6, 5, 5])
conv2.bias 	 torch.Size([16])
fc1.weight 	 torch.Size([120, 400])
fc1.bias 	 torch.Size([120])
fc2.weight 	 torch.Size([84, 120])
fc2.bias 	 torch.Size([84])
fc3.weight 	 torch.Size([10, 84])
fc3.bias 	 torch.Size([10])
Optimizer,s state_dict:
state 	 {}
param_groups 	 [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [367949288, 367949432, 376459056, 381121808, 381121952, 381122024, 381121880, 381122168, 381122096, 381122312]}]

add_module:

add_module()可以快速地替换特定结构可以不用修改过多的代码。

add_module的功能为Module添加一个子module，对应名字为name。使用方式如下：

1	add_module(name, module)

其中name为子模块的名字，使用这个名字可以访问特定的子module，module为我们自定义的子module。

使用add_module添加至Sequential模块举例

class Net3(torch.nn.Module):
  def __init__(self):
    super(Net3, self).__init__()
    self.conv=torch.nn.Sequential()
    self.conv.add_module("conv1",torch.nn.Conv2d(3, 32, 3, 1, 1)) # 没有下采样，kernel = 3, stride = 1, padding = 1
    self.conv.add_module("relu1",torch.nn.ReLU())
    self.conv.add_module("pool1",torch.nn.MaxPool2d(2)) # 下采样，fearture size减半
    self.dense = torch.nn.Sequential()
    self.dense.add_module("dense1",torch.nn.Linear(32 * 64 * 64, 128)) # 输入cxhxw, 需要对应上面的输出
    self.dense.add_module("relu2",torch.nn.ReLU())
    self.dense.add_module("dense2",torch.nn.Linear(128, 10))
 
  def forward(self, x):
        conv_out = self.conv(x)
        res = conv_out.view(conv_out.size(0), -1) # 重新排列，进行线性运算
        out = self.dense(res)
        return out
 
model3 = Net3()
print(model3)

print(model3)
model3.to('cuda')
summary(model3,(3,128,128))

输出如下：

Net3(
  (conv): Sequential(
    (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (relu1): ReLU()
    (pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (dense): Sequential(
    (dense1): Linear(in_features=131072, out_features=128, bias=True)
    (relu2): ReLU()
    (dense2): Linear(in_features=128, out_features=10, bias=True)
  )
)
Net3(
  (conv): Sequential(
    (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (relu1): ReLU()
    (pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (dense): Sequential(
    (dense1): Linear(in_features=131072, out_features=128, bias=True)
    (relu2): ReLU()
    (dense2): Linear(in_features=128, out_features=10, bias=True)
  )
)
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 32, 128, 128]             896
              ReLU-2         [-1, 32, 128, 128]               0
         MaxPool2d-3           [-1, 32, 64, 64]               0
            Linear-4                  [-1, 128]      16,777,344
              ReLU-5                  [-1, 128]               0
            Linear-6                   [-1, 10]           1,290
================================================================
Total params: 16,779,530
Trainable params: 16,779,530
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.19
Forward/backward pass size (MB): 9.00
Params size (MB): 64.01
Estimated Total Size (MB): 73.20
----------------------------------------------------------------

参数初始化

nn.init.calculate_gain

对于给定的非线性函数，返回推荐的增益值

参数：

nonlinearity - 非线性函数（nn.functional名称）
param - 非线性函数的可选参数

1	>>> gain = nn.init.calculate_gain('leaky_relu', 0.2) # leaky_relu with negative_slope=0.2

init.uniform_

init.uniform_(tensor, a=0, b=1)

>>> w = torch.Tensor(3, 5)
>>> nn.init.uniform_(w)

从均匀分布 $U(a,b)$ 中生成值，填充输入的张量或变量

Parameters：

tensor - n维的torch.Tensor
a - 均匀分布的下界
b - 均匀分布的上界

nn.init.normal_

从给定均值和标准差的正态分布 $\mathcal{N}(mean, std)$ 中生成值，填充输入的张量或变量

nn.init.normal_(tensor, mean=0, std=1)

>>> w = torch.Tensor(3, 5)
>>> nn.init.normal_(w)

Parameters：

tensor – n维的torch.Tensor
mean – 正态分布的均值
std – 正态分布的标准差

nn.init.constant_

用val的值填充输入的张量或变量

nn.init.constant_(tensor, val)

>>> w = torch.Tensor(3, 5)
>>> nn.init.constant_(w)

Parameters：

tensor – n维的torch.Tensor 或 autograd.Variable
val – 用来填充张量的值

nn.init.eye_

nn.init.eye_(tensor)

>>> w = torch.Tensor(3, 5)
>>> nn.init.eye_(w)

用单位矩阵来填充2维输入张量或变量。在线性层尽可能多的保存输入特性

Parameters：

tensor – 2维的torch.Tensor 或 autograd.Variable

nn.init.dirac_

nn.init.dirac_(tensor)

>>> w = torch.Tensor(3, 16, 5, 5)
>>> nn.init.dirac_(w)

用Dirac $\delta$ 函数来填充{3, 4, 5}维输入张量或变量。在卷积层尽可能多的保存输入通道特性

Parameters：

tensor – {3, 4, 5}维的torch.Tensor 或 autograd.Variable

nn.init.xavier_uniform_

nn.init.xavier_uniform_(tensor, gain=1)

>>> w = torch.Tensor(3, 5)
>>> nn.init.xavier_uniform_(w, gain=math.sqrt(2.0))

用一个均匀分布生成值，填充输入的张量或变量。结果张量中的值采样自 $U\left(-\sqrt{\frac{6}{n_\mathrm{in} + n_\mathrm{out}}}, \sqrt{\frac{6}{n_\mathrm{in} + n_\mathrm{out}}}\right).$
参考：Glorot, X.和Bengio, Y.等“Understanding the difficulty of training deep feedforward neural networks”

Parameters：

tensor – n维的torch.Tensor
gain - 可选的缩放因子

nn.init.xavier_normal

nn.init.xavier_normal_(tensor, gain=1)

>>> w = torch.Tensor(3, 5)
>>> nn.init.xavier_normal_(w)

用一个正态分布生成值，填充输入的张量或变量。结果张量中的值采样 $W_{ij}\sim N(0,\sqrt{\frac{2}{n_{in}+n_{out}} } )$

参考：Glorot, X.和Bengio, Y. 等“Understanding the difficulty of training deep feedforward neural networks”

Parameters：

tensor – n维的torch.Tensor
gain - 可选的缩放因子

nn.init.kaiming_uniform_

nn.init.kaiming_uniform_(tensor, a=0, mode='fan_in',nonlinearity='leaky_relu')

>>> w = torch.Tensor(3, 5)
>>> nn.init.kaiming_uniform_(w, mode='fan_in')

用一个均匀分布生成值，填充输入的张量或变量。结果张量中的值采样自 $U\left(-\sqrt{\frac{6}{n_\mathrm{in}}}, \sqrt{\frac{6}{n_\mathrm{in}}}\right).$

参考：He, K等“Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification”

Parameters：

tensor – n维的torch.Tensor或autograd.Variable
a -这层之后使用的rectifier的斜率系数（ReLU的默认值为0）
mode -可以为“fan_in”（默认）或 “fan_out”

“fan_in”保留前向传播时权值方差的量级
“fan_out”保留反向传播时的量级

nonlinearity=‘leaky_relu’ - 非线性函数建议“relu”或“leaky_relu”（默认值）使用。

nn.init.kaiming_normal_

nn.init.kaiming_normal_(tensor, a=0, mode='fan_in')

>>> w = torch.Tensor(3, 5)
>>> nn.init.kaiming_normal_(w, mode='fan_out')

用一个正态分布生成值，填充输入的张量或变量。结果张量中的值采样自 $W_{ij}\sim N(0,\sqrt{\frac{2}{(1+\alpha^2) n_{in}}})$

参考：He, K 在 “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification”

Parameters：

tensor – n维的torch.Tensor或 autograd.Variable
a -这层之后使用的rectifier的斜率系数（ReLU的默认值为0）
mode -可以为“fan_in”（默认）或 “fan_out”

“fan_in”保留前向传播时权值方差的量级
“fan_out”保留反向传播时的量级

nn.init.orthogonal_

nn.init.orthogonal_(tensor, gain=1)

>>> w = torch.Tensor(3, 5)
>>> nn.init.orthogonal_(w)

用（半）正交矩阵填充输入的张量或变量。输入张量必须至少是2维的，对于更高维度的张量，超出的维度会被展平，视作行等于第一个维度，列等于稀疏矩阵乘积的2维表示【其中非零元素生成自均值为0，标准差为std的正态分布】

参考：Saxe, A等人(2013)的“Exact solutions to the nonlinear dynamics of learning in deep linear neural networks”

Parameters：

tensor – n维的torch.Tensor 或 autograd.Variable，其中n>=2
gain -可选

nn.init.sparse_

nn.init.sparse_(tensor, sparsity, std=0.01)

>>> w = torch.Tensor(3, 5)
>>> nn.init.sparse_(w, sparsity=0.1)

将2维的输入张量或变量当做 稀疏矩阵填充，其中非零元素根据一个均值为0，标准差为std的正态分布生成

参考：Martens, J.(2010)的 “Deep learning via Hessian-free optimization”

Parameters：

tensor – n维的torch.Tensor或autograd.Variable
sparsity - 每列中需要被设置成零的元素比例
std - 用于生成非零值的正态分布的标准差

nn.init.ones_

1 2	w = torch.empty(3, 5) nn.init.ones_(w)

nn.init.zeros_

1 2	w = torch.empty(3, 5) nn.init.ones_(w)

nn.init.trunc_normal_

该函数用截断正态分布中的值填充输入张量。这些值实际上是从正态分布 $N(\text{mean}, \text{std}^2)$
)中得出的，其中[a , b] 之外的值被重新绘制，直到它们在边界内。用于生成随机值的方法在 $a\leq\text{mean}\leq b$ a情况下效果最佳。

参数

tensor：[Tensor] 一个N NN维张量torch.Tensor
mean ：[float] 正态分布的均值
std ：[float] 正态分布的标准差
a：[float] 截断边界的最小值
b：[float] 截断边界的最大值

torch.nn.init.trunc_normal_(tensor, mean=0.0, std=1.0, a=- 2.0, b=2.0)

w = torch.empty(3, 5)
nn.init.trunc_normal_(w)

参数绑定

# 我们需要给共享层一个名称，以便可以引用它的参数
shared = nn.Linear(8, 8)
net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.Linear(8, 1))
net(X)
# 检查参数是否相同
print(net[2].weight.data[0] == net[4].weight.data[0])
net[2].weight.data[0, 0] = 100
# 确保它们实际上是同一个对象，而不只是有相同的值
print(net[2].weight.data[0] == net[4].weight.data[0])