Batch Normalization - Research Analysis

Research Analysis/Data Science

Batch Normalization - Research Analysis

장민스기 2021. 8. 18. 17:00

Batch normalization is a normalization method used to prevent gradient explosion or vanish while training DNNs which is announced as a paper in 2015 by Sergey et al. When it was first presented, it led various DNN models to achieve higher performance matrics by simply adding BN layers between original layers.

https://arxiv.org/abs/1502.03167

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful param

arxiv.org

What is Batch Normalization?

Batch normalization is a small neural network applied inside original DNN or CNN layers to perform internal normalizing. To be specific, BN is inserted right before each neuron's activation function, and trained while the entire model is being trained with input dataset.

Why are they needed?

So, they are capable of normalizing activations to maintain its distribution. Good! But why do we need to use them anyway?

All the works and efforts to normalize the activations came from the stochastic gradient descent(SGD) or just gradient descent in general. Gradient descent is a method which is used for training deep layered neural networks. However, non constrained DNNs suffered a lot from the problem named internal covariant shift, occured while training.

Exploding/Vanishing Gradients

As simply passing computed values throughout neurons WITHOUT activation functions would lead to a simple linear function no matter how much deep those layers are. To break this linearity, researchers inserted non-linear activation functions like sigmoid, or ReLU between layers. However, introducing these kind of functions made gradients to vanish or explode, because of the saturation zone. For example, sigmoid function has saturation zone where value x is too big or small. On those area the gradients become too small to get back propagated through the lower layers, which leads the gradients to vanish.

Sigmoid Function - https://medium.com/@toprak.mhmt/activation-functions-for-deep-learning-13d8b9b20e

Internal Covariant Shift

The Exploding/Vanishing Gradients problem is caused by several reasons, but one of those reasons is the internal covariant shift. Internal covariant shift means that the distribution of data which is inserted during training is quite different from the testing data, which leads outliered test data to fall into saturation zone on activation functions while testing. So that Batch normalization tried to handle internal covariant shift means it tried to fix layers' output distribution througout training.

Nice Try - Whitening

At first, to handle the problem of internal covariant shift, researchers tried to whiten(normalize to have mean of 0 and standard deviation of 1) outputs. This seems to solve the problem, but then a new problem comes up. If we normalize the activation $y=Wx+b$, then $\hat{y}=y-E[y]$. Surprisingly, as we attempt to update the bias term $b$, the resulting normalized activation remains same.

$$y'= y+\Delta b = Wx+b+\Delta b$$

$$\hat{y'}=y'-E[y']=Wx+b+\Delta b - E[Wx+b+\Delta b] = Wx+b - E[Wx+b]=\hat{y}$$

This can make bias term $b$ explode during training, which actually happened in experiment.

The batch normalization

How To

1. Simple change from Whitening

To handle the problem happened on Whitening, researchers took some simple approach. First, they tried normalizing input by each dimensions. For example, if the input has d-dimentional input $x=(x^{(1)}, x^{(2)}, ...x^{(d)})$ each dimension is normalized through the input dataset like below.

$$\hat x^{(k)}=\frac{x^{(k)}-E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$

Additionaly, this makes each activation to fall mainly to the linear section(center) of activcation functions. For example, the sigmoid activation function plotted above has linear spot around $x=0$ which normalized outputs would generally fall into. To break this linearity, researchers added two parameters $\gamma$ and $\beta$ which can stretch(transform) the normalized values to any region.

$$y^{(k)}=\gamma^{(k)}\hat x^{(k)}+\beta^{(k)}$$

2. Applying Batch

While normalizing and breaking linearity like above, the case of batched input is another issue to handle, because we can't have the entire dataset's distribution on batched situation. In this case, we can predict the entire dataset's mean and variance with the values of the batches.

Features

1. Differentiability

The paper also includes several rows of equations to present that the total loss is differentiable in terms of those batch normalization parameters and statistics. This means that those layers are capable of learning appropriate parameter values which makes lower internal covariant shift while training.

2. Applcation to Convolutional networks

The batch normalization method is still appliable to convNets, but changing it a little, because CNNs have feature maps only which all neurons have same parameters. This feature is termed as convolutional property on the paper.

3. Higher learning Rate

As every activations of layers are normalized, we can train deep models with a higher learning rate without concerning about vanishing/exploding gradient problems. Lets find this out through case assuming the layer's parameters are largely scaled as constant $a$. The parameter matrix is $W$ and input vector is $u$.

$$BN(Wu) = BN((aW)u)$$

$$\frac{\partial BN((aW)u)}{\partial u}=\frac{\partial BN(Wu)}{\partial u}$$

$$\frac{\partial BN((aW)u)}{\partial (aW)}=\frac{1}{a} \frac{\partial BN(Wu)}{\partial W}$$

First, the scale does not affect the gradient back-propagating. Moreover, it scales the gradient down on the current layer's parameter which leads to more stable parameter updates.

4. Regularizer

While training, every batch may have slightly different distributions and this means that single training input's effect on the model would change if the batch distribution changes, which may be all the time. We can say that the output is non-deterministic during the training, and this kind of feature could be used as a regularizer.

Experiments

The paper experimented the effect of Batch Normalization layer on several models throughout classification tasks, and acheived remarkable performance on most of them. Applying batch normalization made training more stable in initial training steps, which leads to fast convergence and lower overall training time.

Outro

Batch normalization is widely used on the textbook I've studying and was quite interesting to read this research paper to understand it's implementation details and mathematical backgrounds. Although there are some cons toward using BN layers, I think they are still pretty good solution to handle internal covariant shifts.

저작자표시

'Research Analysis > Data Science' 카테고리의 다른 글

CycleGAN : Unpaired Image-to-Image Translation - Research Analysis (0)	2021.08.19
SimGAN : Simulated and Unsupervised Images in Adversarial Training - Research Analysis (0)	2021.08.18
YOLO9000: Better, Faster, Stronger - Research Analysis (0)	2021.08.01
You Only Look Once: Unified, Real-Time Object Detection - research analysis (0)	2021.07.28
Learning representations by back-propagating errors - Research Analysis (0)	2021.07.20

현재글Batch Normalization - Research Analysis

Welcome to Computer Science!

Joseph Redmon, Geoffrey E. Hinton, paper, CloneCloud, You Only Look Once, react, Ronald J. Williams, flux pattern, YOLO9000, Joseph Redmon at el, React Tutorial, yolov2, Learning representations by back-propagating errors, David E. Rumelhart, back-propagation, Byunggon Chun, Real-Time Object Detection, Boosting Mobile Device Applications Through Cloud Clone Execution, virtual Dom, Research Analysis,

Today :
Yesterday :

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Compile Error