2015Object detection · part 3 of 4

Faster R-CNN. Learn the proposals.

With Fast R-CNN most of the per-image work has been folded into a single backbone pass. Half the remaining time is Selective Search, a hand-engineered relic in a network that is otherwise end to end. The fix, from Ren, He, Girshick, Sun, is to replace it with a small convolutional head, the Region Proposal Network, that slides over the same feature map the detector uses. At every spatial location it predicts whether an object might be there and a small offset to a fitted box.

The piece that took some staring at is how the RPN parameterises its predictions. Instead of regressing a box from scratch, every location keeps a small library of anchor reference boxes — three scales by three aspect ratios, nine in total — and predicts an offset for each anchor independently. The anchor whose IoU to the ground-truth box is highest is held responsible for that ground truth.

Scene + 8×6 feature gridcell (4, 3)

target

Click any grid cell. Switch the target.

Move the cell across the image and you can read out the assignment rule physically: cells whose centre is near the target object contain anchors that match it well; cells far from any object contain none and are trained as background. The IoU thresholds are how training tells one from the other.

Sharing the backbone between the RPN and the detection head removes the last separate stage. Inference: about 0.2 seconds per image, ~250x faster than R-CNN, with better accuracy. Detection is now a single network with one set of weights.

Where the time went, redux

Fast R-CNN was about two seconds an image, almost all of it spent in Selective Search on the CPU. Faster R-CNN deletes that step and folds proposal generation back into the same conv backbone the detector already runs. The breakdown shifts again, and one bar disappears entirely:

Selective Searchremoved

Conv backbone (VGG16)140ms

RPN proposals10ms

RoI pool + heads × 30045ms

NMS, post-process5ms

Bars use square-root scaling; the strikethrough marks the bar that dominated the chart on the previous page. Total ~200ms on VGG16 — close to 5 fps on a 2015 GPU.

Selective Search is gone. The conv backbone is the new dominant cost — about 140 mson VGG16, a single forward pass per image. Everything that comes after — the RPN, the RoI heads on the top-300 proposals, NMS — runs on top of that one feature map and adds roughly 60 ms combined. The whole detector clears in about a fifth of a second on a 2015 GPU, near five frames per second. The next paper to make a real wall-clock dent will not optimise this pipeline further; it will throw it away.