Faster R-CNN. Learn the proposals.
With Fast R-CNN most of the per-image work has been folded into a single backbone pass. Half the remaining time is Selective Search, a hand-engineered relic in a network that is otherwise end to end. The fix, from Ren, He, Girshick, Sun, is to replace it with a small convolutional head, the Region Proposal Network, that slides over the same feature map the detector uses. At every spatial location it predicts whether an object might be there and a small offset to a fitted box.
The piece that took some staring at is how the RPN parameterises its predictions. Instead of regressing a box from scratch, every location keeps a small library of anchor reference boxes — three scales by three aspect ratios, nine in total — and predicts an offset for each anchor independently. The anchor whose IoU to the ground-truth box is highest is held responsible for that ground truth.
Move the cell across the image and you can read out the assignment rule physically: cells whose centre is near the target object contain anchors that match it well; cells far from any object contain none and are trained as background. The IoU thresholds are how training tells one from the other.
Sharing the backbone between the RPN and the detection head removes the last separate stage. Inference: about 0.2 seconds per image, ~250x faster than R-CNN, with better accuracy. Detection is now a single network with one set of weights.