Article Introduction
YOLO was outstanding CNN model specialized in object detection with extreamly low latency. However there was some drawbacks including the lack of ability to discriminate multiple small objects in the image (details are in the previous article). To improve this model, YOLOv2 and YOLO9000 was introduced newly by Joseph Redmon et al. again, in 2016.
YOLOv2 & YOLO9000
YOLOv2 is simply improved version of YOLO by accepting some techniques like usages of passthrough layers. Its main difference comes from the architecture of the CNN model and both the general accuracy and adaptability has increased compared to the original YOLO.
YOLO9000 is based on YOLOv2 focusing on merging different types of data. As the name says, YOLO9000 is capable of detecting more than 9000 classes of objects including which wasn't included in the training dataset of detection. YOLO9000 achieves this by using data from classification task and detection task, combining those two data's lables using the WordTree. It will be expained in detail in this article very soon.
Getting Better
YOLOv2 adopts several techniques to improve its performance measured by mAP. We'll introduce some of them in detail, which is seemed to be important.
1. Basic implementaion
Both Batch Normalization and High Resolution Classification are used in YOLOv2. Batch normalization makes YOLOv2 not to think about normalization any more, and boosts its mAP by simply using it after every conv layers.
Also, YOLOv2 chooses to maintain high resolution of 448 X 448 while pre-training on classification dataset for 10 epochs, whch leads to increased mAP over 4%.
2. Anchor Boxes
While YOLO detects the object by predicting the box's coordincates directly, YOLOv2 uses pre-defined Anchor Boxes to express the objects' location by calculating its offsets. This leads the convolutional model to learn easily if the prior anchors are well-selected. Moreover, YOLOv2 chooses to use resolution of 416 X 416, which makes the output feature map sized 13 X 13 which was 14 X 14 originally(YOLO downsamples images by factor of 32). The odd number makes the model to easily learn detecting centered objects, which is usually the case for large objects.
However, using Anchor Boxes decreases mAP slightly, which may be not preferable. But in contrast, the recall of prediction gets boosted significantly in tradeoff, which means this model has potential to be improved.
As mentioned above, we need priorly selected anchor boxes for YOLO to use while training. While some models use hand-picked priors, YOLOv2 automatically selects the prior anchor boxes by using K-means Clustering algorithm. To find good priors which generalizes well in terms of IOU score, the distance metric is adjusted like below.
$$d(box, centroid)=1-IOU(box, centroid)$$
For the value of k, k=5 is selected for well-tradeoff for general cases, and in oreder achieve higher mAP, k=9 is preferable.
Although we have selected the priors, constraints about locations of those boxes are needed in order to maintain stability while training the model. Without the constraints, all boxes could end up at any points of the image, which makes it hard for the model to learn. To improve this, YOLO constrains the offset of the boxes to fall between 0 and 1 by using the sigmoid function($\sigma$) which is relative to the grid cell, not the full image. This makes the cell which detected the object to easily generalize offsets of anchors. The model predictes five coordinates($t_x, t_y, t_w, t_h, t_o$), and the actual coordinates($b_x, b_y, b_w, b_h$) of the box is calculated based on the equation below. $p_w$ and $p_h$ is the original width and height of prior box.
$$b_x = \sigma (t_x)+c_x$$
$$b_y = \sigma (t_y)+c_y$$
$$b_w = p_we^{t_w}$$
$$b_h = p_he^{t_h}$$
$$Pr(object)*IOU(b, object)=\sigma (t_o)$$
This method of constraining coordinates leads to improvement of mAP of 5% compared to plain anchor boxs without constraints.
3. Fine-Grained Features
YOLOv2 predicts detections on 13 X 13 sized feature map. This may be enough for detecting large objects, but more fine-grained feature maps would make detecting smaller objects easily. This idea is implemented by using a Passthrough Layer.
The passthrough layer concatenates features with different resolutions by stacking adjacent features in different channels. This acts simillarly with identity mappings of ResNet by reducing the width and height by half, and stacking each features into 4 new feature maps.
4. Multi-Scaled Training
One constraint of original version of YOLO was that the size of its input image was fixed to 448 X 448. For YOLOv2, the input size is changable based on the dataset while training. On every 10 batches of iteration, YOLOv2 chooses new input size in multiples of 32(downsample factor) ranged between 320 and 608. This makes YOLOv2 to train on every resolution of images, providing an easy tradeoff between prediction speed and accuracy.
Getting Faster
In order to increment the prediction speed of YOLO, YOLOv2 chose some strategies which will be explained in breif.
Darknet-19
Compared to previous object detection frameworks which used VGG-16 as its base feature extractor, YOLOv2 uses newly created classification model named Darknet-19. The problem or VGG-16 was that it was accurate but quite slow, requiring more than 30 billion floating point operations for the single pass. Instead, original YOLO used custom network based on GoogLeNet architecture which required only 8.5 bllion operations, dropping its accuracy a little for the tradeoff. However, Darknet-19 is invented based on mixture of VGG models and Network in Network(NiN) model. This combination reduces Darknet-19 to only require 5.58 billion operations while reaching top-5 accuracy of 91.2%.
Training for classification & detection
Training for classification is based on ImageNet dataset including 1000 classes. After training on images sized 224 X 224, we find tune YOLOv2 for a higher resolution 448 X 448 only for 10 epochs.
After training on classification task, the model is trained on detection task with last convolutional layer replaced by three 3 X 3 conv layers with 1024 filters, followed by 1 X 1 conv layer with the number of outputs needed for detection. Also, passthrough layers are added to utilize fine grained feature.
YOLO9000
The term YOLO9000 was named because of the model's capability of detecting more than 9000 classes even if those classes don't appear on training dataset for detection. In order to achieve this remarkable ability, YOLO9000 is trained on a dataset composed of both classification and detection datasets.
The main challenge of this method is that the labels quite different in levels among the datasets. For example, detection labels include general words like dogs, cats, etc while classification labels include more deeply classified labels like Yorkshire terrior or Munchkin.
WordTrees
In order to breakthrough this problem, YOLO9000 builds a Wordtree using WordNet. WordNet is a graph-structured net which contains all labels from ImageNet while representing the relationships between those labels. To simplify its structure, the wordtree is a part of WordNet which only includes the shortest paths from each labels to common root, Physical Object. We can perform classification using this tree by predicting every nodes' conditional probability. The benefit of building this tree is that we can now predict that certain image contains a dog, even though the model is not sure if its Yorkshire terrior or Norfolk terrior. The probability of certain node is calculated by multiplying all nodes' probabilities which is on the path from the root node.
The first probability $Pr(physical object)$ is given by YOLOv2's objectness predictor. The detector predicts both bounding boxes and the probabilities inside the wordtree.
Combining Datasets
By creating wordtrees, we can now combine two datasets which has different classification levels of labels. YOLO9000 is trained by using this method using dataset composed of COCO detection dataset and ImageNet classification dataset. YOLO9000 back propagates normally if it meets detection image, but with calculating its classification error only though the label's level. For example, if YOLO9000 meets an image of Flower, the loss shouldn't be different whether the model predicted the image as Lily or Cosmos. If YOLO9000 meets the classification image, it searches bounding boxes which has highest probability for each classes, and then calculated the loss based solely on the wordtree.
The only drawback of this method is that the model fails to predict well on classes which is not well-grouped by WordNet as sysnsets. However, the reason this is not that bad is because its performance would be better if we enhance or invent better word grouping modules like WordNet and replace it.
Article Conclusion
The remarkable prediction speed and accuracy was once more improved in YOLOv2 and YOLO9000. Even more, YOLO9000 can detect objects which is not trained directly before. This enhancement from original YOLO gave so much inspiration to me and I'm really looking forward for the method of combining datasets would make deep learing models in other realms to be gradually enhanced.