为什么 Flux `withgradient` 计算的损失与我计算的不符？

Question

我正在尝试使用 Flux 训练一个简单的 CNN，但遇到了一个奇怪的问题......在训练期间，损失似乎下降了（表明它正在工作），但尽管损失曲线表明“训练过的”模型输出非常好很糟糕，当我手动计算损失时，我注意到它与训练表明的结果不同（它表现得好像根本没有经过训练）。

然后我开始计算梯度内部与外部返回的损失，经过大量挖掘，我认为问题与

BatchNorm

层有关。考虑以下最小示例：

using Flux
x = rand(100,100,1,1) #say a greyscale image 100x100 with 1 channel (greyscale) and 1 batch
y = @. 5*x + 3 #output image, some relationship to the input values (doesn't matter for this)
m = Chain(BatchNorm(1),Conv((1,1),1=>1)) #very simple model (doesn't really do anything but illustrates the problem)
l_init = Flux.mse(m(x),y) #initial loss after model creation
l_grad, grad = Flux.withgradient(m -> Flux.mse(m(x),y), m) #loss calculated by gradient
l_final = Flux.mse(m(x),y) #loss calculated again using the model (no parameters have been updated)
println("initial loss: $l_init")
println("loss calculated in withgradient: $l_grad")
println("final loss: $l_final")

上面所有的损失都会有所不同，有时会非常显着（刚才运行时我得到了 22.6、30.7 和 23.0），而我认为它们应该是相同的？

有趣的是，如果我删除

BatchNorm

层，输出都是相同的，即运行：

using Flux
x = rand(100,100,1,1) #say a greyscale image 100x100 with 1 channel (greyscale) and 1 batch
y = @. 5*x + 3 #output image
m = Chain(Conv((1,1),1=>1))
l_init = Flux.mse(m(x),y) #initial loss after model creation
l_grad, grad = Flux.withgradient(m -> Flux.mse(m(x),y), m)
l_final = Flux.mse(m(x),y)
println("initial loss: $l_init")
println("loss calculated in withgradient: $l_grad")
println("final loss: $l_final")

每次损失计算产生相同的数字。

为什么包含

BatchNorm

层会像这样改变损失值？

我（有限）的理解是，这只是为了标准化输入值，我知道这可能会影响非标准化和标准化情况之间的损失，但我不明白为什么它会产生相同的损失值在同一模型上输入值而不更新该模型的任何参数？

Answer 1

查看

BatchNorm

的文档

BatchNorm(channels::Integer, λ=identity;
            initβ=zeros32, initγ=ones32,
            affine=true, track_stats=true, active=nothing,
            eps=1f-5, momentum= 0.1f0)

  Batch Normalization (https://arxiv.org/abs/1502.03167) layer. channels should
  be the size of the channel dimension in your data (see below).

  Given an array with N dimensions, call the N-1th the channel dimension. For a
  batch of feature vectors this is just the data dimension, for WHCN images it's
  the usual channel dimension.

  BatchNorm computes the mean and variance for each D_1×...×D_{N-2}×1×D_N input
  slice and normalises the input accordingly.

  If affine=true, it also applies a shift and a rescale to the input through to
  learnable per-channel bias β and scale γ parameters.

  After normalisation, elementwise activation λ is applied.

  If track_stats=true, accumulates mean and var statistics in training phase that
  will be used to renormalize the input in test phase.

  Use testmode! during inference.

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  julia> using Statistics
  
  julia> xs = rand(3, 3, 3, 2);  # a batch of 2 images, each having 3 channels
  
  julia> m = BatchNorm(3);
  
  julia> Flux.trainmode!(m);
  
  julia> isapprox(std(m(xs)), 1, atol=0.1) && std(xs) != std(m(xs))
  true

这里的关键是默认情况下

track_stats=true

。这导致输入发生变化。如果您不想出现这种行为，请使用

初始化您的模型

m = Chain(BatchNorm(1, track_state=false),Conv((1,1),1=>1)) #very simple model (doesn't really do anything but illustrates the problem)

您将获得与第二个示例相同的输出。

BatchNorm

是用零均值和单位标准差初始化的，而您的输入数据不是，这就是为什么在

track_state=true

的情况下，即使重复相同的输入，您也会得到不断变化的输出，据我所知它（很快）。

为什么 Flux `withgradient` 计算的损失与我计算的不符？

问题描述投票：0回答：1

1个回答

最新问题

为什么 Flux `withgradient` 计算的损失与我计算的不符？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1