Why doesn't a batch-normalized layer sum to 1?

Question

I've been looking deeper into how batch norm works in PyTorch and have noticed that for the below code:

torch.manual_seed(0)
# With Learnable Parameters
m = nn.BatchNorm2d(1)
# Without Learnable Parameters
#m = nn.BatchNorm2d(1, affine=False)
input = torch.randn(2, 1, 2, 2)
output = m(input)
#print(input)
print(output)

the output below does not sum to 1:

tensor([[[[-0.1461, -0.0348],
          [ 0.4644, -0.0339]]],


        [[[ 0.6359, -0.0718],
          [-1.1104,  0.2967]]]], grad_fn=<NativeBatchNormBackward>)

It sums to 0 instead, and I'm guess this is because batch norm makes the mean 0 (unless the scale and shift params are added). Isn't batch normalization supposed to produce a distribution per channel across the batch?

I understand what you mean, but isn't normalization supposed to essentially also achieve a distribution that sums to 1, or am I misunderstanding something? — b0neval
– b0neval, Commented Nov 27, 2022 at 12:05
Also, then it can be said that batch normalization does NOT produce a probability distribution around the batch, right? If not then what exactly does it do? — b0neval
– b0neval, Commented Nov 27, 2022 at 12:08

Susmit Agrawal · Accepted Answer · 2022-11-27 15:50:47Z

I think you have BatchNorm confused with Softmax.

To answer your questions in the comments, normalization does not change the distribution - it simply centers it at 0 with unit variance.

For example, if the data was from a uniform distribution, it remains uniform after normalizing, albeit with different statistics.

For example, take the distribution below:

After normalizing, this is what the distribution looks like:

Notice that the shape of the overall distribution and number of samples in each bucket is exactly the same - what has changed is the mean value (i.e., center) of the distribution. And though not visually obvious, one can check the new normalized values (X-axis of the plot) and see that the variance is approximately 1.

This is precisely what BatchNorm does, with the X-axis being each example in a batch. For other kinds of norms, the dimension taken to normalize over changes (e.g., from the batch dimension to feature dimension in LayerNorm), but the effect is essentially the same.

If you wanted probabilities, you could simply divide the size of each bin by the number of samples (scale the Y-axis instead of the X-axis)! This would give a graph of the exact same shape, with the X-axis values the same as the original graph and the Y-axis values scaled to represent probabilities!

Let's now see what Softmax does to the distribution. Applying softmax over the distribution gives the following graph:

As you can see, softmax actually creates a probability distribution over the points, meaning, it gives a probability of how likely each point is assuming they all are sampled from a Gaussian distribution (the Gaussian part is important theoretically since that is what gives e in the softmax expression).

In contrast, simply scaling the Y-axis with the number of samples does not make the Gaussian assumption - it simply creates a distribution from the given points. Since the probability of any point outside this distribution will be 0, it is useless for generalization. Hence, softmax is used instead of simply creating probabilities out of sample points.

Collectives™ on Stack Overflow

Why doesn't a batch-normalized layer sum to 1?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related