Batch normalization is a normalization method used to prevent gradient explosion or vanish while training DNNs which is announced as a paper in 2015 by Sergey et al. When it was first presented, it led various DNN models to achieve higher performance matrics by simply adding BN layers between original layers.
https://arxiv.org/abs/1502.03167
What is Batch Normalization?
Batch normalization is a small neural network applied inside original DNN or CNN layers to perform internal normalizing. To be specific, BN is inserted right before each neuron's activation function, and trained while the entire model is being trained with input dataset.
Why are they needed?
So, they are capable of normalizing activations to maintain its distribution. Good! But why do we need to use them anyway?
All the works and efforts to normalize the activations came from the stochastic gradient descent(SGD) or just gradient descent in general. Gradient descent is a method which is used for training deep layered neural networks. However, non constrained DNNs suffered a lot from the problem named internal covariant shift, occured while training.
Exploding/Vanishing Gradients
As simply passing computed values throughout neurons WITHOUT activation functions would lead to a simple linear function no matter how much deep those layers are. To break this linearity, researchers inserted non-linear activation functions like sigmoid, or ReLU between layers. However, introducing these kind of functions made gradients to vanish or explode, because of the saturation zone. For example, sigmoid function has saturation zone where value x is too big or small. On those area the gradients become too small to get back propagated through the lower layers, which leads the gradients to vanish.
Internal Covariant Shift
The Exploding/Vanishing Gradients problem is caused by several reasons, but one of those reasons is the internal covariant shift. Internal covariant shift means that the distribution of data which is inserted during training is quite different from the testing data, which leads outliered test data to fall into saturation zone on activation functions while testing. So that Batch normalization tried to handle internal covariant shift means it tried to fix layers' output distribution througout training.
Nice Try - Whitening
At first, to handle the problem of internal covariant shift, researchers tried to whiten(normalize to have mean of 0 and standard deviation of 1) outputs. This seems to solve the problem, but then a new problem comes up. If we normalize the activation $y=Wx+b$, then $\hat{y}=y-E[y]$. Surprisingly, as we attempt to update the bias term $b$, the resulting normalized activation remains same.
$$y'= y+\Delta b = Wx+b+\Delta b$$
$$\hat{y'}=y'-E[y']=Wx+b+\Delta b - E[Wx+b+\Delta b] = Wx+b - E[Wx+b]=\hat{y}$$
This can make bias term $b$ explode during training, which actually happened in experiment.
The batch normalization
How To
1. Simple change from Whitening
To handle the problem happened on Whitening, researchers took some simple approach. First, they tried normalizing input by each dimensions. For example, if the input has d-dimentional input $x=(x^{(1)}, x^{(2)}, ...x^{(d)})$ each dimension is normalized through the input dataset like below.
$$\hat x^{(k)}=\frac{x^{(k)}-E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$
Additionaly, this makes each activation to fall mainly to the linear section(center) of activcation functions. For example, the sigmoid activation function plotted above has linear spot around $x=0$ which normalized outputs would generally fall into. To break this linearity, researchers added two parameters $\gamma$ and $\beta$ which can stretch(transform) the normalized values to any region.
$$y^{(k)}=\gamma^{(k)}\hat x^{(k)}+\beta^{(k)}$$
2. Applying Batch
While normalizing and breaking linearity like above, the case of batched input is another issue to handle, because we can't have the entire dataset's distribution on batched situation. In this case, we can predict the entire dataset's mean and variance with the values of the batches.
Features
1. Differentiability
The paper also includes several rows of equations to present that the total loss is differentiable in terms of those batch normalization parameters and statistics. This means that those layers are capable of learning appropriate parameter values which makes lower internal covariant shift while training.
2. Applcation to Convolutional networks
The batch normalization method is still appliable to convNets, but changing it a little, because CNNs have feature maps only which all neurons have same parameters. This feature is termed as convolutional property on the paper.
3. Higher learning Rate
As every activations of layers are normalized, we can train deep models with a higher learning rate without concerning about vanishing/exploding gradient problems. Lets find this out through case assuming the layer's parameters are largely scaled as constant $a$. The parameter matrix is $W$ and input vector is $u$.
$$BN(Wu) = BN((aW)u)$$
$$\frac{\partial BN((aW)u)}{\partial u}=\frac{\partial BN(Wu)}{\partial u}$$
$$\frac{\partial BN((aW)u)}{\partial (aW)}=\frac{1}{a} \frac{\partial BN(Wu)}{\partial W}$$
First, the scale does not affect the gradient back-propagating. Moreover, it scales the gradient down on the current layer's parameter which leads to more stable parameter updates.
4. Regularizer
While training, every batch may have slightly different distributions and this means that single training input's effect on the model would change if the batch distribution changes, which may be all the time. We can say that the output is non-deterministic during the training, and this kind of feature could be used as a regularizer.
Experiments
The paper experimented the effect of Batch Normalization layer on several models throughout classification tasks, and acheived remarkable performance on most of them. Applying batch normalization made training more stable in initial training steps, which leads to fast convergence and lower overall training time.
Outro
Batch normalization is widely used on the textbook I've studying and was quite interesting to read this research paper to understand it's implementation details and mathematical backgrounds. Although there are some cons toward using BN layers, I think they are still pretty good solution to handle internal covariant shifts.