Article Introduction
YOLO which refers to You Only Look (not Live!) Once became this paper's title, because it was the main advantage of the new object detection model. As designed by Joseph Redmon at el, YOLO brought a huge sensation on object detection realm for increasing prediction speed upto almost real-time while trading small amount of precision. In this article I'll explain the main architecture of the YOLO model and various technique used while training.
You Only need to Look Once?
Before YOLO came out, there were many algorithms and methodologies based on CNN architecture which detected objects on an image, and correctly locating them inside a box. Systems like DPM(Deformable Parts Models) and R-CNNs were used at object detection by classification. Those algorithms run quite differently, but both take a long time to predict a single picture, because they apply same functions on different areas several times(sliding-window) or pass through complex post-process pipelines. Because of this slowness, YOLO became popular in the area which requires fast prediction speed like detecting various objects in videos.
But, how can YOLO be so faster than other models? YOLO doesn't let image pass through pipelines or apply same methods several time through single image. It just puts the input image in the trained CNN model and exports its output results. The simple structure which requires only one model to be applied to one input image led YOLO to be astinishingly fast.
The Confidence Score
YOLO's CNN firstly divides images into $S \times S$ grid. Each grid cells are responsible for detecting objects whose center falls inside those cells. Each grid cells predict $B$ boxes and the Conficence scores of those boxes. So, every cells would predict $B$ boxes, and boxes which includes actual object would have high confidence scores. Joseph et al. defined conficence score as below. IOU means Intersection Of Union which is Area of overlap divided by Area of Union.
$$Confidence = Pr(Object) * IOU_{truth}^{pred}$$
$$IOU = \frac{Area \; of \; Intersection}{Area \; of \; Union}$$
To predict one box, there should be 5 outputs including x, y, w, h, confidence. So there would be $5 \times B$ outputs added for each cells. Moreover, each cells need to predict Conditional Class probability C, which is explained below.
$$C=Pr(Class_i|Object)$$
If we combine those two formulas above, we are able to get Class-Specific Confidence scores.
$$Confidence * C = Pr(Class_i|Object) * Pr(Object) * IOU_{truth}^{pred} = Pr(Class_i)*IOU_{truth}^{pred}$$
CNN Network
The CNN network design was inspired by Google's GoogLeNet and simplified the Inception modules to alternating $1 \times 1$ reduction layers followed by $3 \times 3$ convolutional layers. The number of total convolutional layers is 24 and for the faster version of YOLO, the number is 9.
Pretraining
For training the CNN network, YOLO pretrains the network with the classification dataset from ImageNet. While training, the input size becomes $224 \times 224$ and last 4 layers are replaced by an average pooling layer and a fully connected layer. Check out the figure of the CNN architecture below.
After pretraining the model, input layer's resolution is increased to $448 \times 448$ to preserve the fine grained information for accurate detection. For additional preprocessing, bounding boxes' heights and weights are divided by those of the image in order to locate the number between 0 and 1. Similarly the object center's position x and y are also normalized.
Sum-squared Error
Sum-squred Error is used as a optimization metric, but there follows some disadvantage under its simplicity. Simple sum of squared error weights localization errors and classifications error as the same. This makes only confidence score of empty cells to go straight to 0 by gradient descent, while scores of cells which actually contain objects are not specifically boosted. This can lead to early diverging while training. To solve this problem, YOLO weights the loss from bounding box predictions heavily, and weights the loss from confidence predictions of empty boxes lightly. The parameters for those two are the followings.
$$\lambda _{coord}=5, \lambda _{noobj}=0.5$$
Also, sum-squared errors equally weights the error from big boxes and small boxes as same, so YOLO predicts the square root of the boxes' width and height.
Prediction Responsibility
YOLO predicts multiple boxes which are centered on same cells. For those cells, YOLO assigns only one predictor to be responsible for detecting each object which has the highest IOU. This makes each cells' boxes to detect different objects and gain speciality.
Error Function
As the equation of error fuction is quite long to explain in detail, we'll highlight the main points only. The symbol $\mathbb{I}_{obj}^{i}$ denotes if object appears in $i$, and $\mathbb{I}_{obj}^{ij}$ denotes that the $j$th bounding box prediction in cell $i$ is responsible for the detection.
If we look on the each term, $\mathbb{I}_{obj}^{i}$ is multiplied on most of them. This means that the model is panalized only if object exists in the particular cell. Also, the term with $\mathbb{I}_{obj}^{ij}$ means that only the bounding box coordinate error is panalized if the predictor is responsible for that object. These terms make the error function more reasonable.
Other processes
Several more little techniques are applied to the model to avoid overfitting, including learning rate schedule, dropout and data augmentation. For the earning rate, it starts from a low value($01^{-3}$) and raised slowly until ($01^{-3}$) on the beggining to prevent divergent, and keep decreased until ($01^{-4}$). A dropout layer is applied right after the first connected layer to prevent co-adaptation between layers. On data augmentation step, same images are transformed slightly in size or exposure/saturation and fed to the model in order to increase stability.
So, Is YOLO great?
YOLO almost maintained its mAP from falling while increasing its prediction speed over 3~40 times faster than other latest and well performing object detection models. This was a remarkable advantage in object detection realm, in order to utilize it on videos with high frames. However, There were also some disadvantages of YOLO. YOLO could only detect two boxes and a single class of objects per one cell. This means that images with small objects crowded like flocks of birds would be fatal for YOLO. Also, compared to other models, YOLO has incorrect localization as its main error source. This means YOLO can detect an object well, but the size or the location of the box could be less accurate.
Article Conclusion
I first read the model YOLO on the book, and was recommended to have a look on the original paper. The paper was well-organized, and easily-explained about the model. Moreover, the comparison with other models occupy almost 4 pages on the paper. This detailed comparison and analysis made me to strongly admit that it actually performs better than any other models in aspect of speed. The improved versions of YOLO are now published on paper, so I am willing to take a glimpse on it and check what is improved.
'Research Analysis > Data Science' 카테고리의 다른 글
CycleGAN : Unpaired Image-to-Image Translation - Research Analysis (0) | 2021.08.19 |
---|---|
SimGAN : Simulated and Unsupervised Images in Adversarial Training - Research Analysis (0) | 2021.08.18 |
Batch Normalization - Research Analysis (0) | 2021.08.18 |
YOLO9000: Better, Faster, Stronger - Research Analysis (0) | 2021.08.01 |
Learning representations by back-propagating errors - Research Analysis (0) | 2021.07.20 |