Research Analysis/Data Science

Learning representations by back-propagating errors - Research Analysis

장민스기 2021. 7. 20. 12:55



1. Introduction

Learning representations by back-propagating error has introduced the very first idea of back-propagation for training multi layers of Perceptrons(MLP) on 1986, written by David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams. In this short article, I'll explain some core concepts of the paper and focus on how each equation is induced from the formal ones.

 

2. Preliminary

 

As the hidden layers of perceptrons gets thicker and wider to form a complicated neural network, there should be an time-effective algorithm to calculate how much each perceptrons and weights influence the total Error. In order to calculate, we first need to know how an input data passes through the layers and reaches the output.

Image of Neural Network - https://www.kdnuggets.com/2019/11/designing-neural-networks.html

As input data comes into the input layer, they are propagated to each hidden layers level by level. The values are calculated sequentially through output layer and finally output value comes out from the last layer. While output values from the previous perceptrons are passed as the next perceptrons' inputs, the edge's weight is multiplicated whose edge connects the two perceptrons.

The paper generalizes this case by defining node $i$ as the previous perceptron and node $j$ as the next one. Then the weight of the edge between these nodes would be $w_{ji}$. We can also define that an output of the node $j$ as $y_j$ and node $i$'s aggregated input as $x_i$.

The realtionship how values are calculated between layers are represented on the following equation.

$$ (1)\qquad x_j = \sum_{i}^{}y_iw_{ji} $$

Also the paper introduces a function which represents the relationship between $x_j$ and $y_j$.

$$ (2)\qquad y_j = \frac{1}{1+e^{-x_j}} $$

This function is applied after the aggregation of lower layer's outputs, to simplify the calaulation which will be explained soon.

 

3. Back-Propagation

The main purpose of this method is to calculate a Total Error of current model, and update every weights on the edge in order to minimize the Total Error. The aggregated error of an entire model is calculated as below.

$$ (3)\qquad E = \frac{1}{2}\sum_c^{}\sum_j^{}(y_{j, c}-d_{j, c})^2 $$

Propagating its result values and calculating the total error is not that hard, because applying equations above multiple times is everything. However back-propagating the toal errors to calculate every weights' gradient is little bit complicated.

Our ultimate goal is to calculate $\frac{\partial {E}}{\partial {w_{ji}}}$ for every node on every layer, which means how each weight causes the total Error to go up/down. This can be expanded while applying the chain rule.

$$ (4)\qquad \frac{\partial {E}}{\partial {w_{ji}}} = \frac{\partial {E}}{\partial {x_{j}}} \cdot \frac{\partial {x_{j}}}{\partial {w_{ji}}} = \frac{\partial {E}}{\partial {x_{j}}} \cdot y_j $$

$\frac{\partial {E}}{\partial {x_{j}}}$ can be substituted with $y_i$ by partially derivating the equation(1) with $w_{ji}$ on each side. And by derivating the equation(2) we can get the following equation.

$$ (5)\qquad \frac{dy_j}{dx_j} = y_j(1-y_j) $$

Now, lets compute $\frac{\partial {E}}{\partial {x_{j}}}$ and finally substitute $\frac{\partial {E}}{\partial {w_{ji}}}$.

$$ (6)\qquad \frac{\partial {E}}{\partial {x_{j}}} = \frac{\partial {E}}{\partial {y_{j}}} \cdot \frac{dy_j}{dx_j} = \frac{\partial {E}}{\partial {y_{j}}}y_j(1-y_j) $$ $$ (7)\qquad \frac{\partial {E}}{\partial {w_{ji}}} = \frac{\partial {E}}{\partial {x_{j}}} \cdot y_j = \frac{\partial {E}}{\partial {y_{j}}}{y_j}^2(1-y_j) $$

To summerize a little, now we can calculate the partial derivative $\frac{\partial {E}}{\partial {w_{ji}}}$ with only the value of $\frac{\partial {E}}{\partial {y_{j}}}$ and $y_j$. Our last job is to find the equation which represents $\frac{\partial {E}}{\partial {y_{i}}}$ with $\frac{\partial {E}}{\partial {y_{j}}}$ in order to back-traverse all the way back to the input layers. Lets start with applying chain rule to $\frac{\partial {E}}{\partial {y_{i}}}$ for a certain $j$.

$$ (8)\qquad \frac{\partial {E}}{\partial {y_{i}}} = \frac{\partial {E}}{\partial {x_{j}}} \cdot \frac{\partial {x_{j}}}{\partial {y_{i}}} = \frac{\partial {E}}{\partial {x_{j}}} \cdot w_{ji} \quad (\because equation(1)) $$

If we aggregate equation(8) for all values j, we get the final equation(9) which represents the method to continuously back-propagate $\frac{\partial {E}}{\partial {y}}$ and $\frac{\partial {E}}{\partial {w_{ji}}}$ all the way to the top.

$$ (9)\qquad \frac{\partial {E}}{\partial {y_{i}}} = \sum_{j}^{} \frac{\partial {E}}{\partial {x_{j}}} \cdot w_{ji} = \sum_{j}^{} \frac{\partial {E}}{\partial {y_{j}}}y_j(1-y_j)w_{ji} \quad (\because equation(6)) $$

To conclude, if we apply equation(9) and equation(7) from the output layer to the input layer, we can calculate how much each weights contribute to the models's total error. For the initial value $\frac{\partial {E}}{\partial {y_{j}}}$ of the output layer, we can get $y_j-d_j$ by differentiating the error function(equation(3)).

 

4. Supplementary

The main idea of back-propagation is finished on chapter 3, and it's time to look over some additional contents on the paper. Authors of this paper says that in order to increase the speed of convergence and simplify the algorithm, it is preferable to use the simple version which uses $\frac{\partial {E}}{\partial {w}}$ which is gained by accumulating all $\frac{\partial {E}}{\partial {w_{ji}}}$. The new algorithm gradient descents each weight by the amount proportional to accumulated $\frac{\partial {E}}{\partial {w}}$.

$$ \Delta w = -\epsilon \frac{\partial E}{\partial w} $$

Moreover, a more fine method is developed by changing the weight's velocity to increase the convergence speed without losing its simplicity and locality.

$$ \Delta w(t) = -\epsilon \frac{\partial E}{\partial w(t)} + \alpha \Delta w(t-1) $$

In the above equation, t increases by 1 after every epoch ends. Alpha is a hyper parameter which changes how much previous gradient will contribute to the current weight change.

 

5. Concluision

This final part is not an explanation of original paper but my personal thoughts. If it were not this short 4 paged paper which introduced the core concept of training the neural network, there couldn't have been such development on the deep learning and neural network technology. While I have read this paper, there wasn't any unneccesary equations or paragraphs for introducing the concept of back-propagation, and each equations were written and expressed simple enough for me to understand without heavy mathmatical knowledges. I totally recommend for everyone who are interested in deep learning to read this paper in original version, and absorb many details which is not explained in this article.