# Review: YOLOv3

YOLOv3从前两代发展而来，融合了老式的YOLO系列的one-stage的特点和一些SOTA模型的tricks。要了解YOLOv3，最好是先读YOLOv1中关于回归概念和损失函数的描述，YOLOv2基本有了YOLOv3的形状但是还没有ResNet的思想，不如直接看YOLOv3。YOLOv3模型有这几大重点：

SOTA的锚框方法。虽然是YOLOv2就有的，但是YOLOv3做了改进，anchor-based的方法成全了YOLOv3；

• 这篇文章在英文版的review基础上修改得来，写的时候觉得之前review的是个什么东西，重点只把握一部分，没有认识到网络结构和损失函数的重要性。所以最后文章把原文放底下，细节算新写的。原文还是可以看的，当作熟悉整体思路，以及看YOLOv3的测试表现的描述。
• 理解YOLOv3重在理解网络结构的层，特别是检测层的功能。
• 我想的是，一定要展现YOLOv3为什么从YOLOv2能变强这么多，因为它俩实际上很像，但是些微的调整就带来了大的进步，ResNet的shortcut结构为什么能带来更好的特征描述，这些还要从ResNet讲起。话说我为什么要把残差网络当作我的第一篇review的论文...叫残暴起点算了。
• 文章里的很多东西，是论文里没写到的，细节一般来源于按照前人的from-scratch系列重写的代码
• 引用很精彩，用来做扩展阅读挺好。

## 网络结构

◎ Image credit: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b ◎ Image credit: https://blog.csdn.net/leviopku/article/details/82660381
• 没有池化层。那怎么实现特征图尺寸变换呢？是卷积核的stride的改变来实现缩小的。YOLOv3和YOLOv2一样，输入经过五次缩小缩小比例为$1/2^5=1/32$，所以要求输出是32的倍数。如果输入为416×416，则最小的特征图是13×13（考虑属性的维度则是13×13×255）的。
• 没有全连接层。那输出不是scores了吗？对的，one-stage的YOLOv3，将边界框的定位和目标的分类当作回归问题，融合为一步了，输出的tensor里，既包括所有候选的位置信息的预测，也包括平常情况下全连接层所输出的分类分数

## 多尺度的输出

YOLOv3在feature map的每一个格子都会预测边界框。将目标的边界框的定位当作回归问题做，通过减小锚框和真实边界框的损失函数，学习预测锚框缩放大小和偏移。YOLOv3在一个特征图格子会预测三个候选框，multi-scale是通过三个拼接成的不同scale的特征图实现的，会有很多候选框，这三个scale是在网络中不同位置实现的。

### 检测层

 1  prediction = prediction.view(batch_size, grid_size*grid_size*num_anchors, bbox_attributes) 

• 第一维表示图片的ID，大小是batch_size

• 第二维是这张图片在这个scale的边界框ID，总数为此次feature map所有的边界框数目，即$n_{fm}\times n_{fm}\times B$；

• 第三维是所有的按顺序重排的一个边界框的所有属性项，总数是5+C=85=85。

◎ Credit: https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/

### 尺度变换

◎ 第一次预测结果 ◎ 第二次预测结果 ◎ 第三次预测结果

## 锚框方法

◎ Bounding box attributes

**锚框(anchors)是一类设定好的先验框(priors)，YOLOv3网络预测出对数空间上的转换关系，或者说是偏移，先验框因此转换为后验框，即候选。**这一句话有两个概念，第一个是锚框，YOLOv3定义了九个锚框，它们是在数据集上聚类得来的：

anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326


 1 2 3 4  # Sigmoid transform: centre_X, centre_Y, objectness score prediction[:, :, 0] = torch.sigmoid(prediction[:, :, 0]) prediction[:, :, 1] = torch.sigmoid(prediction[:, :, 1]) prediction[:, :, 4] = torch.sigmoid(prediction[:, :, 4]) 

◎ 转换到预测框

## 损失函数

◎ Credit: https://blog.csdn.net/qq_30159015/article/details/80446363
• 第一行为总平方误差，是位置预测的损失函数；
• 第二行为根号总平方误差，是宽度和高度的损失函数；
• 第三，四行对目标分数或者置信度用总平方误差作为损失函数；
• 第五行对分类分数用总平方误差作为损失函数；

YOLOv3对后三行做了修改，将总平方误差替换为分类任务中更好用的交叉熵误差。也就是说，YOLOv3中的目标置信度和分类分数通过逻辑回归实现了。同时，每一个真实框，只匹配一个边界框，也就是IoU最大的那个。

## 细节和问题

offsets为什么constrain到[0, 1]上而不是[-1, 1]上？因为是采用对格子左上角的相对坐标，一个格子只管右下方的坐标就行。(issue)

## 原文

### Abstract

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP 50 in 51 ms on a Titan X, compared to 57.5 AP 50 in 198 ms by RetinaNet, similar performance but 3.8× faster. As always, all the code is online at https://pjreddie.com/yolo/.

### Aim

There is an unfinished issue existing in the second generation: poor performance in detection of small objects or detection of a group of small objects. The author managed to solve this issue and promote accuracy while still maintaining the speed of the model. He aimed to provide the information of what he made to make YOLOv2 better and how he did it.

### Summary

The author first shared the tricks he used to improve the performance.

• Bounding box prediction. YOLOv3 addresses the prediction of bounding boxes as a regression problem. The loss function is the MSE. And the gradient is the ground truth value minus the prediction: $\hat{t}_b-t_b$, where $t_b$ is the tensor of the coordinates for each bounding box. The tensor is used to calculate the predictions as:

The objectness score is 1 if the bounding box prior overlaps a ground truth object more than any other priors. If the box prior overlaps the ground truth by over 0.5 but is not the most, the prediction is not counted.

• Class prediction. Each box now predicts the classes using multilabel classification. Softmax is substituted with binary cross-entropy loss function. This modification is suitable on more complex datasets. Softmax assumes that each class is independent while some dataset contains classes which are overlapping (like Woman and Person)

• Prediction across scales. The author used a similar concept to SPP: adopting the feature extractor and the additional convolutional layers to encode a 3D tensor ($N\times N\times [3*(4+1+80)]$) prediction (at three scales) including the bounding box (4 coordinates), objectness (1 objectness score) and class predictions (80 classes). The algorithm takes the feature map from 2 layers previous and concatenates it with the $2\times$ unsampled feature map. This combined feature map is processed via a few more convolutional layers to predict a tensor at twice the size. The same process is applied to this tensor to create the third scaled feature map. Nine box priors and three scales are chosen via k-means clustering as before.

• Feature extractor. The authors used a new but significantly larger network called Darknet-53. Darknet-53, compared to Darknet-19, adopts residual shortcut connections (ResNet). It is proved to yield state-of-the-art results: Similar performance to ResNet-152 but $2\times$ faster. Also, Darknet-53 can better utilize the GPU.

• Training. They used multi-scale training instead of mining.

Then the author introduced some trials that didn't help.

• Anchor box x,y offset predictions. The linear activation of x,y offset as a multiple of the box width or height can decrease the stability.
• Linear x,y predictions instead of logistic functions lead to reduced mAP.
• Focal loss does not help improve the performance but drops the mAP.
• Dual IoU thresholds and truth assignment (Faster R-CNN) does not lead to good results.

Experiments are performed on COCO dataset. The model YOLOv3 is firstly trained on COCO trainset and then tested on the test set. Results indicate that YOLOv3 is comparable to SSD variants but $3\times$ faster, but behind the RetinaNet. However, YOLOv3 achieve state-of-the-art performance at .5IOU metric. In terms of small object detection, YOLOv3 performs better via multis-scale predictions but is worse while detecting larger objects.

#### My thoughts

The good side is: The paper presents explicit knowledge about what factors contributes to the improvements in YOLOv3's performance. The sections are easy to follow and understand. But compared to the paper of YOLOv2, this paper is less rigorous and precise.

• The authors revealed that he took the tables of performances of backbones from other research instead of reproducing and experimenting them uniformly.
• The authors didn't present all trials they made that failed but the things that they can remember. We may lack information of some potential but crucial issues that leads to poor performance.

#### Comparison

The paper in particular demonstrates the results of the performance at .5IOU metric. The best YOLOv3 model achieved state-of-the-art trade-off between accuracy and speed. It achieves 57.9 $AP_{50}$ in 51 ms on a Titan X, in contrast to 57.5 $AP_{50}$ in 198ms by RetinaNet. YOLOv3 was still not the most accurate detector but the most balanced and the fastest detector at that time.

#### Applications

Similar to YOLOv2, YOLOv3 can be used as the state-of-the-art detector at high FPS in high resolutions. However, in low resolutions or the cased when top accuracy is not necessary, YOLOv2 is much better because Darknet-19 can run at 171 FPS tested with $256\times 256$ input images compared to 78 by Darknet-53. The authors, in particular, addressed this issue, and they demonstrated that distinguishing an IOU of 0.3 from 0.5 is hard for humans. So for machines, .5IOU metric is also enough in object detection.

first commit