2016Object detection · part 4 of 4

YOLO. Skip the proposals.

You can keep removing things. Redmon, Divvala, Girshick, Farhadi notice that even Faster R-CNN’s RPN is a hedge: it produces a region proposal that the detection head later scores. If the network is going to predict the boxes anyway, why have two stages at all?

YOLO divides the image into an S×Sgrid (originally 7×7). Each cell predicts B bounding boxes (originally 2), their confidence scores, and a class distribution. One forward pass, one tensor, done. There is no proposal step, no per-region CNN call, no cropping. The whole thing is just a regression head on top of a backbone, trained end to end.

Two costs come with the simplicity. First, each grid cell is responsible for at most a small number of boxes — small objects clustered close together get clipped. Subsequent versions (YOLOv2, v3, v4…) bring back anchors and multi-scale predictions to fix this. Second, the network now emits manyoverlapping detections per object, because nearby grid cells all see the same thing. Filtering them is the last surviving piece of the pipeline, and it’s the same algorithm everyone has been using all along: non-max suppression.

alive
30
kept
0
suppressed
0
Step through the algorithm. Drop the threshold to keep more boxes.

NMS is short. Sort boxes by confidence; take the highest as a winner; kill every other box of the same class with IoU above a threshold; repeat. It is the operation every detector in the lineage ends with, and it is why dropping the threshold returns piles of duplicates and raising it merges different objects into one detection.

YOLO’s frame rate on a 2016 GPU is 45 frames per second. Faster R-CNN ran at five.

What came next

The four papers above set the template. Most of what came after is variations on it. RetinaNet kept anchors but introduced focal loss to deal with foreground-background imbalance. SSD did dense detection at multiple scales. Anchor-free methods (CenterNet, FCOS) replaced the anchor library with direct regression of box extents from each cell, simplifying training. EfficientDet swept the design space. DETR (Carion et al., 2020) replaced NMS with a transformer that emits a fixed-size set of detections directly, trained with bipartite matching to ground truth.

What hasn’t changed since 2014 is the choreography. Score candidate regions, refine their boxes, prune overlap. The improvements have been about how aggressively each step gets fused into the one next to it. Today’s real-time detectors are mostly that fusion taken to its logical end: a single tensor in, a clean set of boxes out, and a suppression step left over like an appendix.