算法框架-深度框架-模型优化方法理论+工程实践

NoiseFloor · 发表于 2022-4-17 15:40

1 标签平滑损失labelsmooth

1 标签平滑损失labelsmooth
论文“When does label smoothing help？”研究了标签平滑如何影响深度神经网络的最终激活层，理论上分析为什么，以及何时使用，何时不适用

标签平滑是对损失函数的一种修正
它将神经网络的训练目标从“1”调整为“1-label smoothing adjustment”，这意味着神经网络被训练得对自己的答案不那么自信。默认值通常是 0.1，这意味着目标答案是 0.9(1 - 0.1)而不是 1
知识蒸馏不使用标签平滑（作为老师的网络），虽然使用标签平滑可以提高老师的准确性，但学生并未学到足够的知识
公式如下

code：

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

class LabelSmoothingCrossEntropy(nn.Module):
def __init__(self, eps=0.1, reduction=&#39;mean&#39;, ignore_index=-100):
      super(LabelSmoothingCrossEntropy, self).__init__()
      self.eps = eps
      self.reduction = reduction
      self.ignore_index = ignore_index

def forward(self, output, target):
      c = output.size()[-1]
      log_pred = torch.log_softmax(output, dim=-1)
      print(log_pred)
      if self.reduction == &#39;sum&#39;:
         loss = -log_pred.sum()
      else:
         loss = -log_pred.sum(dim=-1)
         print(loss)
         if self.reduction == &#39;mean&#39;:
            loss = loss.mean()
      print(loss)
      return loss * self.eps / c + (1 - self.eps) * torch.nn.functional.nll_loss(log_pred, target, 2 温度系数 temperature parameter

从对比损失的角度看：
- 对比损失Contrastive Loss是一个困难负样本自发现的损失函数
- 温度系数的作用是调节对困难样本的关注程度：越小的温度系数越关注于将本样本和最相似的其他样本分开

使用一个简单的公式表示

t设置比较大，那么预测的概率分布会比较平滑，那么loss会很大，这样可以避免我们陷入局部最优解。随着训练的进行，我们将t变小，也可以称作降温，类似于模拟退火算法
code
代码比较简单，直接在softmax里面加一个超参t
3 学习率预热warm up

由于刚开始训练时,模型的权重(weights)是随机初始化的，此时若选择一个较大的学习率,可能带来模型的不稳定(振荡)，选择Warmup预热学习率的方式，可以使得开始训练的几个epoches或者一些steps内学习率较小,在预热的小学习率下，模型可以慢慢趋于稳定,等模型相对稳定后再选择预先设置的学习率进行训练,使得模型收敛速度变得更快，模型效果更佳。

transformers源码

transformers.get_linear_schedule_with_warmup < source >
Parameters

optimizer (Optimizer) — The optimizer for which to schedule the learning rate.
num_warmup_steps (int) — The number of steps for the warmup phase.需要设置的上升步数
num_training_steps (int) — The total number of training steps. 总的训练步数
last_epoch (int, optional, defaults to -1) — The index of the last epoch when resuming training.

Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer. 学习率先从0线性上升到设定值，再从设定值线性减少至0

scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = warm_up_ratio * total_steps, num_training_steps = total_steps)
def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1):
&#34;&#34;&#34;
Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after
a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.

Args:
      optimizer ([`~torch.optim.Optimizer`]):
         The optimizer for which to schedule the learning rate.
      num_warmup_steps (`int`):
         The number of steps for the warmup phase.
      num_training_steps (`int`):
         The total number of training steps.
      last_epoch (`int`, *optional*, defaults to -1):
         The index of the last epoch when resuming training.

Return:
      `torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
&#34;&#34;&#34;

def lr_lambda(current_step: int):
      if current_step < num_warmup_steps:
         return float(current_step) / float(max(1, num_warmup_steps))
      return max(
         0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps))
      )

return LambdaLR(optimizer, lr_lambda, last_epoch)4 focal loss

参考链接，通俗易懂

在实际应用中，负样本往往来自于负采样，大量的负采样会使训练时负样本数量远多余正样本数量导致训练样本不平衡，且软负采样的负样本往往非常弱，在模型推理时置信度一般较高，加入focal loss可以让模型专注于那些置信度低的比较难区分的样本，提高模型的训练效果。

在交叉熵损失中的pt 反映了模型对这个样本的识别能力，在focal loss中对于pt越大的样本，越要打压它对loss的贡献
参数  α 和样本标签数量有关，等α为0.5时，相当于没有作用
参数 γ 是难度权重，越大，打压越重，loss越小，当γ=0时，相当于交叉熵损失

code
class FocalLoss(nn.Module):
&#34;&#34;&#34;Multi-class Focal loss implementation&#34;&#34;&#34;
def __init__(self, gamma=2, weight=None, reduction=&#39;mean&#39;, ignore_index=-100):
      super(FocalLoss, self).__init__()
      self.gamma = gamma
      self.weight = weight
      self.ignore_index = ignore_index
      self.reduction = reduction

def forward(self, input, target):
      &#34;&#34;&#34;
      input: [N, C]
      target: [N, ]
      &#34;&#34;&#34;
      log_pt = torch.log_softmax(input, dim=1)  # 在softmax上再做一个log 等价于 log(softmax) dim=1 对每一行的所有元素
      pt = torch.exp(log_pt)
      log_pt = (1 - pt) ** self.gamma * log_pt
      loss = torch.nn.functional.nll_loss(log_pt, target, self.weight, reduction=self.reduction, ignore_index=self.ignore_index)
      return loss
# 使用的时候
if loss_type == &#39;ce&#39;:
self.criterion = nn.CrossEntropyLoss(reduction=reduction)
elif loss_type == &#39;ls_ce&#39;:
self.criterion = LabelSmoothingCrossEntropy(reduction=reduction)
else:
self.criterion = FocalLoss(reduction=reduction)5 对抗损失PGD FGM

对抗样本：对于人类来说 &#34;看起来&#34; 几乎一样，但对于模型来说预测结果却完全不一样的样本，什么样的样本是好的样本呢，对抗样本一般需要有两个特点
- 相对原始输入，所添加的扰动是微小的
- 能使模型犯错
扰动计算的思想：将输入样本想着损失上升的方向再进一步，得到的对抗样本就能造成更大的损失，提高模型的错误率
对抗训练的作用：
- 提高模型应对恶意对抗样本时的鲁棒性
- 作为一种正则化手段，减少过拟合，提高泛化能力
在学习FGM代码时，先理解一下optimizer.zero_grad(), loss.backward(), optimizer.step()的作用及原理，看[13]
FGM代码

import torch
import torch.nn as nn
# FGM
class FGM:
def __init__(self, model: nn.Module, eps=1.):
      self.model = (
         model.module if hasattr(model, &#34;module&#34;) else model
      )
      self.eps = eps
      self.backup = {}
# only attack word embedding
def attack(self, emb_name=&#39;word_embeddings&#39;):
      for name, param in self.model.named_parameters():
         if param.requires_grad and emb_name in name:
            self.backup[name] = param.data.clone()
            norm = torch.norm(param.grad)
            if norm and not torch.isnan(norm):
                  r_at = self.eps * param.grad / norm
                  param.data.add_(r_at)  # 参数进行修改x+xrat
def restore(self, emb_name=&#39;word_embeddings&#39;):
      for name, para in self.model.named_parameters():
         if para.requires_grad and emb_name in name:
            assert name in self.backup
            para.data = self.backup[name]
      self.backup = {}
# 初始化
fgm = FGM(model)
for batch_input, batch_label in data:
  # 正常训练
  loss = model(batch_input, batch_label)
  loss.backward() # 反向传播，得到正常的grad
  # 对抗训练
  fgm.attack() # embedding被修改了
  # optimizer.zero_grad() # 如果不想累加梯度，就把这里的注释取消
  loss_sum = model(batch_input, batch_label)
  loss_sum.backward() # 反向传播，在正常的grad基础上，累加对抗训练的梯度
  fgm.restore() # 恢复Embedding的参数
  # 梯度下降，更新参数
  optimizer.step()
  optimizer.zero_grad()

FGM 的思路是梯度上升，一步到位；PGD，小步走，多走几步，如果走出了扰动半径为 ε的空间，就重新映射回 &#34;球面&#34; 上，以保证扰动不要过大由于 PGD 理论和代码比较复杂，因此下面先给出伪代码方便理解，然后再给出代码

对于每个x:
  1.计算x的前向loss，反向传播得到梯度并备份
  对于每步t:
2.根据Embedding矩阵的梯度计算出r，并加到当前Embedding上，相当于x+r（超出范围则投影回epsilon内）
3.t不是最后一步: 将梯度归0，根据(1)的x+r计算前后向并得到梯度
4.t是最后一步: 恢复(1)的梯度，计算最后的x+r并将梯度累加到(1)上
  5.将Embedding恢复为(1)时的值
  6.根据(4)的梯度对参数进行更新

# PGD
class PGD:
def __init__(self, model, eps=1., alpha=0.3):
      self.model = (
         model.module if hasattr(model, &#34;module&#34;) else model
      )
      self.eps = eps
      self.alpha = alpha
      self.emb_backup = {}
      self.grad_backup = {}

def attack(self, emb_name=&#39;word_embeddings&#39;, is_first_attack=False):
      for name, param in self.model.named_parameters():
         if param.requires_grad and emb_name in name:
            if is_first_attack:
                  self.emb_backup[name] = param.data.clone()
            norm = torch.norm(param.grad)
            if norm != 0 and not torch.isnan(norm):
                  r_at = self.alpha * param.grad / norm
                  param.data.add_(r_at)
                  param.data = self.project(name, param.data)

def restore(self, emb_name=&#39;word_embeddings&#39;):
      for name, param in self.model.named_parameters():
         if param.requires_grad and emb_name in name:
            assert name in self.emb_backup
            param.data = self.emb_backup[name]
      self.emb_backup = {}

def project(self, param_name, param_data):
      r = param_data - self.emb_backup[param_name]
      if torch.norm(r) > self.eps:
         r = self.eps * r / torch.norm(r)
      return self.emb_backup[param_name] + r

def backup_grad(self):
      for name, param in self.model.named_parameters():
         if param.requires_grad and param.grad is not None:
            self.grad_backup[name] = param.grad.clone()

def restore_grad(self):
      for name, param in self.model.named_parameters():
         if param.requires_grad and param.grad is not None:
            param.grad = self.grad_backup[name]

pgd = PGD(model)
K = 3
for batch_input, batch_label in data:
# 正常训练
loss = model(batch_input, batch_label)
loss.backward() # 反向传播，得到正常的grad
pgd.backup_grad() # 保存正常的grad
# 对抗训练
for t in range(K):
      pgd.attack(is_first_attack=(t==0)) # 在embedding上添加对抗扰动, first attack时备份param.data
      if t != K-1:
         optimizer.zero_grad()
      else:
         pgd.restore_grad() # 恢复正常的grad
      loss_sum = model(batch_input, batch_label)
      loss_sum.backward() # 反向传播，并在正常的grad基础上，累加对抗训练的梯度
pgd.restore() # 恢复embedding参数
# 梯度下降，更新参数
optimizer.step()
optimizer.zero_grad()为什么对抗有效：
问题引入：对于人类来说 &#34;看起来&#34; 几乎一样，但对于模型来说预测结果却完全不一样的样本
在 Word Embedding 上添加的 Perturbation 很可能会导致原来的 good 变成 bad，导致分类错误，计算的 Adversarial Loss 很大，而计算 Adversarial Loss 的部分是不参与梯度计算的，也就是说，模型（LSTM 和最后的 Dense Layer）的 Weight 和 Bias 的改变并不会影响 Adversarial Loss，模型只能通过改变 Word Embedding Weight 来努力降低它，进而如文章所说：
Adversarial training ensures that the meaning of a sentence cannot be inverted via a small change, so these words with similar grammatical role but different meaning become separated.
这些含义不同而语言结构角色类似的词能够通过这种 Adversarial Training 的方法而被分离开，从而提升了 Word Embedding 的质量，帮助模型取得了非常好的表现
6 scheduled sampling计划采样

Scheduled sampling方法介绍: 在生成式模型训练阶段, 常采用Teacher-forcing的策略辅助模型更快的收敛. 但是一直使用Teacher-forcing的策略, 会造成训练和预测的输入样本分布不一致, 也就是著名的&#34;Exposure bias&#34;.
解决方法: 对于&#34;Exposure bias&#34;, 很好的一个解决策略就是采用&#34;Scheduled sampling&#34;, 即在训练阶段, 将ground truth和decoder predict混合起来使用, 作为下一个时间步的decoder input. 具体做法是每个时间步以一个p值概率进行Teacher forcing, 以(1 - p)值概率不进行Teacher forcing. 同时p值的大小随着batch或者epoch衰减.
策略有效性分析: 采用&#34;Scheduled sampling&#34;, 在刚开始训练的阶段, 模型所具有的知识很少, 需要采用Teacher forcing的方式, 使用ground truth加速模型的学习和收敛. 到了训练后期, 模型已经掌握了很多数据分布的特征和数据本身的特征, 这个时候将decoder input替换成decoder predict, 保持和预测阶段一致, 来解决&#34;Exposure bias&#34;的问题.
code

随着训练迭代步数的增长, 计算是否使用Teacher-forcing的概率大小
class ScheduledSampler():
def __init__(self, phases):
      self.phases = phases
      # 通过超参数phases来提前计算出每一个epoch是否采用Teacher forcing的阈值概率
      self.scheduled_probs = [i / (self.phases - 1) for i in range(self.phases)]
def teacher_forcing(self, phase):
      # 生成随机数
      sampling_prob = random.random()
      # 每一轮训练时, 通过随机数和阈值概率比较, 来决定是否采用Teacher forcing
      if sampling_prob >= self.scheduled_probs[phase]:
         return True
      else:
         return False
# 在训练时，根据迭代次数判断是否使用teacher-forcing
for epoch in range(10):
teacher_forcing = scheduled_sampler.teacher_forcing(epoch)
# 在模型中forward部分
不使用teacher_forcing，解码器的输入真实标签第一个字符；如果使用teacher_forcing,那每一步都用真实标签7 weight tying

Weight tying方法介绍: 其实这个策略要解决的问题还是上面提出来的&#34;Exposure bias&#34;问题, 除了用训练后期去除掉Teacher forcing的方法. 我们还可以通过让Encoder和Decoder的词嵌入尽量一致来纠正这种偏差.
解决方法: 对于&#34;Exposure bias&#34;另一个很好的解决策略就是采用&#34;Weight tying&#34;, 字面意思就是权重的绑定, 具体做法是将Encoder和Decoder的embedding权重矩阵进行共享.
策略有效性分析: 对embedding权重矩阵进行共享, 这样就使得Encoder和Decoder的输入词向量表达完全相同了, 一定程度上可以缓解&#34;Exposure bias&#34;.

保证预训练的词向量维度和隐藏层维度一致即可共用一个权重矩阵，参考链接怎样在RNN中使用共享权重矩阵，截取下面的图

上面的左括号公式：

U是输入权重矩阵，目的是使one-hot向量x变为稠密向量e，小标t表示第t个token；U的shape是C×D，C是vocab的大小，D是词向量维度
h是隐藏层状态向量
s是得分向量或是输出的概率，shape是 C×1；V是输出权重矩阵C×D Pdh 是个投影矩阵D×H
softmax层

下面的左括号公式：

和上面类似，只不过在第三个输出概率的公式，输出权重矩阵变为了V，它和输入的形状完全一样，用来共享输入输出权重

code:
# 在编码和解码的时候使用同一个权重矩阵
class Encoder(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size, rnn_drop=0):
      self.embedding = nn.Embedding(vocab_size, embed_size)
def forward(self, x, decoder_embedding):
      if weight_tying:
         embedded = decoder_embedding(x)  # 在编码时使用解码embedding

class Decoder(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size, enc_hidden_size=None):
      self.embedding = nn.Embedding(vocab_size, embed_size)
      self.vocab_size = vocab_size
      self.hidden_size = hidden_size  # 这里embed_size和hidden_size相同
def forward(self,x)
      if weight_tying:
         Fc2_out = torch.mm(Fc1_out, torch.t(self.embedding.weight))
         # Fc1_out.size=[seq,hidden_size]
         # embedding.weight.size=[C,embed_size]
      p_vocab = F.softmax(Fc2_out, dim=1)
class Model(nn.Module):
def __init__(self):
      self.encoder = Encoder(len(v), config.embed_size, config.hidden_size)
      self.decoder = Decoder(len(v), config.embed_size, config.hidden_size)
def forward(self,x):
      encoder_output, encoder_states = self.encoder(x, self.decoder.embedding)8 不同层制定不同的学习率

只需要将原始的参数组，划分成两个，甚至更多的参数组，然后分别进行设置学习率。这里将原始参数“切分”成fc3层参数和其余参数，为fc3层设置更大的学习率。
关于optimizer优化器中的参数说明参考[13]

ignored_params = list(map(id, net.fc3.parameters())) # 返回的是parameters的内存地址
base_params = filter(lambda p: id(p) not in ignored_params, net.parameters())
optimizer = optim.SGD([
{&#39;params&#39;: base_params},
{&#39;params&#39;: net.fc3.parameters(), &#39;lr&#39;: 0.001*10}],
0.001, momentum=0.9, weight_decay=1e-4)

第一行+ 第二行的意思就是，
将fc3层的参数net.fc3.parameters()从原始参数net.parameters()中剥离出来
base_params就是剥离了fc3层的参数的其余参数，然后在优化器中为fc3层的参数单独设定学习率。
optimizer = optim.SGD(......)这里的意思就是
base_params中的层，用 0.001, momentum=0.9, weight_decay=1e-4
fc3层设定学习率为： 0.001*10

如果要改变多个层呢

如果多层，则：
conv5_params = list(map(id, net.conv5.parameters()))
conv4_params = list(map(id, net.conv4.parameters()))
base_params = filter(lambda p: id(p) not in conv5_params + conv4_params,
                  net.parameters())
optimizer = torch.optim.SGD([
         {&#39;params&#39;: base_params},
         {&#39;params&#39;: net.conv5.parameters(), &#39;lr&#39;: lr * 100},
         {&#39;params&#39;: net.conv4.parameters(), &#39;lr&#39;: lr * 100},
         , lr=lr, momentum=0.9)

model.named_parameters() 是给出网络层的名字和参数的迭代器，model.parameters()会给出一个网络的全部参数的选代器。

params = model.named_parameters()
for name ,param in params:
print(name,param.data.shape)
embedding.weight torch.Size([21154, 256])
bilstm.weight_ih_l0 torch.Size([1536, 256])
bilstm.bias_ih_l0 torch.Size([1536])
classifier.weight torch.Size([7, 768])
classifier.bias torch.Size([7])
crf.start_transitions torch.Size([7])
crf.end_transitions torch.Size([7])
crf.transitions torch.Size([7, 7])

一般在使用bert的时候修改步骤，下面是一个例子

if config.full_fine_tuning:
# model.named_parameters(): [bert, classifier, crf]
bert_optimizer = list(model.bert.named_parameters())
classifier_optimizer = list(model.classifier.named_parameters())
no_decay = [&#39;bias&#39;, &#39;LayerNorm.bias&#39;, &#39;LayerNorm.weight&#39;]
optimizer_grouped_parameters = [
      {&#39;params&#39;: [p for n, p in bert_optimizer if not any(nd in n for nd in no_decay)],
      &#39;weight_decay&#39;: config.weight_decay},
      {&#39;params&#39;: [p for n, p in bert_optimizer if any(nd in n for nd in no_decay)],
      &#39;weight_decay&#39;: 0.0},
      {&#39;params&#39;: [p for n, p in classifier_optimizer if not any(nd in n for nd in no_decay)],
      &#39;lr&#39;: config.learning_rate * 5, &#39;weight_decay&#39;: config.weight_decay},
      {&#39;params&#39;: [p for n, p in classifier_optimizer if any(nd in n for nd in no_decay)],
      &#39;lr&#39;: config.learning_rate * 5, &#39;weight_decay&#39;: 0.0},
      {&#39;params&#39;: model.crf.parameters(), &#39;lr&#39;: config.learning_rate * 5}
]
# only fine-tune the head classifier
else:
param_optimizer = list(model.classifier.named_parameters())
optimizer_grouped_parameters = [{&#39;params&#39;: [p for n, p in param_optimizer]}]9 梯度累加

解决显存不够，起到增大batchsize的作用，多次累积梯度后再进行权重更新
loss.backward()为模型创建并存储梯度，而optimizer.step()更新权重。在如果在调用优化器之前两次调用loss.backward()就会对梯度进行累加。下面是如何在PyTorch中实现梯度累加:
model = model.train()
optimizer.zero_grad()
for index, batch in enumerate(train_loader):
input = batch[0].to(device)
correct_answer = batch[1].to(device)
output = model(input).to(device)
loss = criterion(output, correct_answer).to(device)
loss.backward()
if (index+1) % 2 == 0:
   optimizer.step()
   optimizer.zero_grad()10 混合精度

混合精度说明：

是在训练一个数值精度 FP32 的模型，一部分算子的操作时，数值精度为 FP16，其余算子的操作精度是 FP32，而具体哪些算子用 FP16，哪些用 FP32，不需要用户关心，amp 自动给它们都安排好了
不改变模型、不降低模型训练精度的前提下，可以缩短训练时间，降低存储需求，因而能支持更多的 batch size、更大模型和尺寸更大的输入进行训练
pytorch官方在1.6后使用方法需要调用torch.cuda.amp
直接调用Nvidia官方的api，form apex import amp(自己还没有测试)
对于理论部分可参考链接

# 以下是从自己测试程序中截取的部分，使用的是torch.cuda.amp
import torch
import logging
from torch.cuda.amp import autocast
if use_fp16:
scaler = torch.cuda.amp.GradScaler()
with autocast():
      loss = model(**batch_data)[0]
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
scaler.step(optimizer)
scaler.update()使用torch.cuda.amp的基本操作
# amp依赖Tensor core架构，所以model参数必须是cuda tensor类型
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
# GradScaler对象用来自动做梯度缩放
scaler = GradScaler()

for epoch in epochs:
for input, target in data:
      optimizer.zero_grad()
      # 在autocast enable 区域运行forward
      with autocast():
         # model做一个FP16的副本，forward
         output = model(input)
         loss = loss_fn(output, target)
      # 用scaler，scale loss(FP16)，backward得到scaled的梯度(FP16)
      scaler.scale(loss).backward()
      # scaler 更新参数，会先自动unscale梯度
      # 如果有nan或inf，自动跳过
      scaler.step(optimizer)
      # scaler factor更新
      scaler.update()使用apex的基本操作
from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level=&#34;O1&#34;) # 这里是“欧一”，不是“零一”
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()11 梯度裁剪

避免梯度爆炸，将梯度约束在某一个区间之内,可以使用torch.nn.utils.clipgrad_norm来实现。在计算梯度loss.backward()之后和权重更新optimizer.step()之前插入
loss = Crossentropy(...)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=10, norm_type=2)
optimizer.step()12 分布式训练

模型并行和数据并行
- 模型并行：一个很大的模型，将不同层放在不同的GPU上
- 数据并行：模型适当，将不同的数据放到不同的GPU上，实现一个模型的训练
- 数据并行：
  - 同步更新和异步更新
    - 同步更新有等待，速度取决于最慢的那个GPU；异步更新没有等待，但是涉及到更复杂的梯度过时，loss下降抖动大的问题。所以实践中，一般使用同步更新的方式。
  - Parameter Server算法 vs Ring AllReduce算法 (都属于同步更新)
PS
- 木桶效应和通信耗时
Ring AllReduce
- 通信与GPU数量无关
torch.distributed
- 每个进程独立训练，只对梯度等少量数据进行交换
- 每个进程有自己的优化过程
- 各个进程梯度计算完成，汇总平均，最后由rank=0 的进程将梯度广播到其他进程，进行梯度更新
dataparallel
- 只有一个全局optimizer,各gpu梯度求和，由主GPU更新，参数再更新到其他gpu

# python -m torch.distributed.launch --nproc_per_node=4 test.py
# 使用ddp模式时分配两个进程给4个显卡
# 使用dp模式不需要指定进程
# 代码细节需要查看参考链接
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader
import os
import time
from torch.utils.data.distributed import DistributedSampler
os.environ[&#39;CUDA_VISIBLE_DEVICES&#39;] = &#39;0,1,2,3&#39;
def ddp():
start = time.time()
# 1) 初始化
torch.distributed.init_process_group(backend=&#34;nccl&#34;)  # 放在一开始的位置
input_size = 5
output_size = 2
batch_size = 30
data_size = 9000
# 2）配置每个进程的gpu
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
device = torch.device(&#34;cuda&#34;, local_rank)
print(local_rank)
class RandomDataset(Dataset):
      def __init__(self, size, length):
         self.len = length
         self.data = torch.randn(length, size).to(&#39;cuda&#39;)
      def __getitem__(self, index):
         return self.data[index]
      def __len__(self):
         return self.len
class Model(nn.Module):
      def __init__(self, input_size, output_size):
         super(Model, self).__init__()
         self.fc = nn.Linear(input_size, output_size)
      def forward(self, input):
         output = self.fc(input)
         print(&#34;  In Model: input size&#34;, input.size(),
               &#34;output size&#34;, output.size())
         return output
dataset = RandomDataset(input_size, data_size)
# print(dataset)
# 3）使用DistributedSampler
rand_loader = DataLoader(dataset=dataset,
                           batch_size=batch_size,
                           sampler=DistributedSampler(dataset))
model = Model(input_size, output_size)
# 4) 封装之前要把模型移到对应的gpu
model.to(device)
if torch.cuda.device_count() > 1:
      print(&#34;Let&#39;s use&#34;, torch.cuda.device_count(), &#34;GPUs!&#34;)
      # 5) 封装
      model = torch.nn.parallel.DistributedDataParallel(model,
                                                      device_ids=[local_rank],
                                                      output_device=local_rank)
for data in rand_loader:
      if torch.cuda.is_available():
         input_var = data
      else:
         input_var = data
      output = model(input_var)
      print(&#34;Outside: input size&#34;, input_var.size(), &#34;output_size&#34;, output.size())
end = time.time()-start
print(end)
def dp():
start = time.time()
input_size = 5
output_size = 2
batch_size = 30
data_size = 9000
class RandomDataset(Dataset):
      def __init__(self, size, length):
         self.len = length
         self.data = torch.randn(length, size)
      def __getitem__(self, index):
         return self.data[index]
      def __len__(self):
         return self.len
rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                           batch_size=batch_size, shuffle=True)
class Model(nn.Module):
      # Our model
      def __init__(self, input_size, output_size):
         super(Model, self).__init__()
         self.fc = nn.Linear(input_size, output_size)

      def forward(self, input):
         output = self.fc(input)
         print(&#34;  In Model: input size&#34;, input.size(),
               &#34;output size&#34;, output.size())
         return output
model = Model(input_size, output_size)
if torch.cuda.is_available():
      model.cuda()
if torch.cuda.device_count() > 1:
      print(&#34;Let&#39;s use&#34;, torch.cuda.device_count(), &#34;GPUs!&#34;)
      # 就这一行
      model = nn.DataParallel(model)
for data in rand_loader:
      if torch.cuda.is_available():
         input_var = Variable(data.cuda())
      else:
         input_var = Variable(data)
      output = model(input_var)
      print(&#34;Outside: input size&#34;, input_var.size(), &#34;output_size&#34;, output.size())
end = time.time()-start
print(end)
if __name__ == &#39;__main__&#39;:
dp()
# ddp()
pass13 优化器说明

model = MyModel()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)

for epoch in range(1, epochs):
for i, (inputs, labels) in enumerate(train_loader):
      output= model(inputs)
      loss = criterion(output, labels)

      # compute gradient and do SGD step
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()总得来说，这三个函数的作用是先将梯度归零（optimizer.zero_grad()），然后反向传播计算得到每个参数的梯度值（loss.backward()），最后通过梯度下降执行一步参数更新（optimizer.step()）
param_groups：Optimizer类在实例化时会在构造函数中创建一个param_groups列表，列表中有num_groups个长度为6的param_group字典（num_groups取决于你定义optimizer时传入了几组参数），每个param_group包含了 [&#39;params&#39;, &#39;lr&#39;, &#39;momentum&#39;, &#39;dampening&#39;, &#39;weight_decay&#39;, &#39;nesterov&#39;] 这6组键值对。
param_group[&#39;params&#39;]：由传入的模型参数组成的列表，即实例化Optimizer类时传入该group的参数，如果参数没有分组，则为整个模型的参数model.parameters()，每个参数是一个torch.nn.parameter.Parameter对象。
一、optimizer.zero_grad()：
def zero_grad(self):
r&#34;&#34;&#34;Clears the gradients of all optimized :class:`torch.Tensor` s.&#34;&#34;&#34;
for group in self.param_groups:
      for p in group[&#39;params&#39;]:
         if p.grad is not None:
            p.grad.detach_()
            p.grad.zero_()optimizer.zero_grad()函数会遍历模型的所有参数，通过p.grad.detach_()方法截断反向传播的梯度流，再通过p.grad.zero_()函数将每个参数的梯度值设为0，即上一次的梯度记录被清空。
因为训练的过程通常使用mini-batch方法，所以如果不将梯度清零的话，梯度会与上一个batch的数据相关，因此该函数要写在反向传播和梯度下降之前。（其实只要在下一个批次计算之前将梯度清零即可）
二、loss.backward()：
PyTorch的反向传播(即tensor.backward())是通过autograd包来实现的，autograd包会根据tensor进行过的数学运算来自动计算其对应的梯度。
具体来说，torch.tensor是autograd包的基础类，如果你设置tensor的requires_grads为True，就会开始跟踪这个tensor上面的所有运算，如果你做完运算后使用tensor.backward()，所有的梯度就会自动运算，tensor的梯度将会累加到它的.grad属性里面去。
更具体地说，损失函数loss是由模型的所有权重w经过一系列运算得到的，若某个w的requires_grads为True，则w的所有上层参数（后面层的权重w）的.grad_fn属性中就保存了对应的运算，然后在使用loss.backward()后，会一层层的反向传播计算每个w的梯度值，并保存到该w的.grad属性中。
如果没有进行tensor.backward()的话，梯度值将会是None，因此loss.backward()要写在optimizer.step()之前。
三、optimizer.step()：
以SGD为例，torch.optim.SGD().step()源码如下：
def step(self, closure=None):
&#34;&#34;&#34;Performs a single optimization step.
Arguments:
      closure (callable, optional): A closure that reevaluates the model
         and returns the loss.
&#34;&#34;&#34;
loss = None
if closure is not None:
      loss = closure()
for group in self.param_groups:
      weight_decay = group[&#39;weight_decay&#39;]
      momentum = group[&#39;momentum&#39;]
      dampening = group[&#39;dampening&#39;]
      nesterov = group[&#39;nesterov&#39;]
      for p in group[&#39;params&#39;]:
         if p.grad is None:
            continue
         d_p = p.grad.data
         if weight_decay != 0:
            d_p.add_(weight_decay, p.data)
         if momentum != 0:
            param_state = self.state[p]
            if &#39;momentum_buffer&#39; not in param_state:
                  buf = param_state[&#39;momentum_buffer&#39;] = torch.clone(d_p).detach()
            else:
                  buf = param_state[&#39;momentum_buffer&#39;]
                  buf.mul_(momentum).add_(1 - dampening, d_p)
            if nesterov:
                  d_p = d_p.add(momentum, buf)
            else:
                  d_p = buf
         p.data.add_(-group[&#39;lr&#39;], d_p)
return lossstep()函数的作用是执行一次优化步骤，通过梯度下降法来更新参数的值。因为梯度下降是基于梯度的，所以在执行optimizer.step()函数前应先执行loss.backward()函数来计算梯度。
注意：optimizer只负责通过梯度下降进行优化，而不负责产生梯度，梯度是tensor.backward()方法产生的。
参考

[1]论文：When does label smoothing help https://arxiv.org/abs/1906.02629
https://zhuanlan.zhihu.com/p/101553787
解决过拟合：如何在PyTorch中使用标签平滑正则化https://cloud.tencent.com/developer/article/1625899
https://github.com/km1994/NLP-Interview-Notes/tree/main/Trick/label_smoothing
[2]论文：Understanding the Behaviour of Contrastive Loss https://arxiv.org/pdf/2012.09740.pdf
深度学习中的temperature parameter是什么https://zhuanlan.zhihu.com/p/132785733
CVPR2021自监督学习论文: 理解对比损失的性质以及温度系数的作用https://zhuanlan.zhihu.com/p/357071960?ivk_sa=1024320u
[3]深度学习训练策略-学习率预热Warmup
[4]Focal Loss --- 从直觉到实现（非常通俗易懂）https://zhuanlan.zhihu.com/p/103623160
https://blog.csdn.net/weixin_45839693/article/details/109469031
[5]论文：Intriguing properties of neural networks https://arxiv.org/abs/1312.6199
NLP 中的对抗训练https://wmathor.com/index.php/archives/1537/
[7]How does weight-tying work in a RNN？https://tomroth.com.au/weight_tying/
Using the Output Embedding to Improve Language Models https://arxiv.org/abs/1608.05859
[8] PyTorch 学习笔记（五）：Finetune和各层定制学习率https://zhuanlan.zhihu.com/p/59780798
[10]【PyTorch】唯快不破：基于Apex的混合精度加速https://zhuanlan.zhihu.com/p/79887894
PyTorch 源码解读之 torch.cuda.amp: 自动混合精度详解https://zhuanlan.zhihu.com/p/348554267
[12]【分布式训练】单机多卡的正确打开方式（一）https://zhuanlan.zhihu.com/p/72939003
【分布式训练】单机多卡的正确打开方式（三）：PyTorch https://zhuanlan.zhihu.com/p/74792767
[13] 理解optimizer.zero_grad(), loss.backward(), optimizer.step()的作用及原理
https://blog.csdn.net/PanYHHH/article/details/107361827

		自动登录	找回密码
密码			立即注册

算法框架-深度框架-模型优化方法理论+工程实践

本帖子中包含更多资源