2015Object detection · part 2 of 4

Fast R-CNN. Share the features.

The observation

The previous paper, R-CNN, took 47 seconds an image because it ran AlexNet two thousand times per picture — once for each Selective Search proposal. Each of those AlexNet calls re-extracted edges, textures, and corners from a crop of the same image. Most of those low-level features were redundant with the ones extracted for the next crop, and the next, and the next.

A year later, same author. The fix is the obvious one in retrospect: compute the convolutional features for the whole image once, into a single feature map, and let each proposal pull its features out of that one map instead of running the network again.

Sharing the backbone

The shape of the compute graph changes more than it sounds. R-CNN’s graph is a fan of two thousand independent CNNs. Fast R-CNN’s graph is a single CNN with two thousand light taps off its last conv layer.

R-CNNconv ~45 s

proposal 1
proposal 2
…
proposal 2000

→ AlexNet → fc7

Two thousand independent forward passes through a CNN. Identical low-level features extracted on overlapping crops, over and over.

Fast R-CNNconv ~150 ms

image

→ CNN → feature map → 2000 × RoI pool

One forward pass through a CNN. Two thousand cheap RoI pools on the same feature map. No re-computation.

The convolutional layers don’t care which crop of the image they’re looking at; they only care about pixels in their receptive field. Run them once over the whole image and the feature map you get out is the samefeature map any individual crop’s forward pass would have produced for its own region — just assembled into one big spatial grid instead of two thousand tiles. The proposals don’t need their own forward pass. They need a way to read out their region of that grid.

RoI pooling

The technical hurdle is that the proposals are arbitrary rectangles, but the classifier head expects a fixed-size input. That is what RoI pooling is for.

Feature map14 × 14

Pooled output7 × 7

The region on the left is divided into a fixed 7×7 grid. Each output cell takes the maximum feature value inside its bin. Hover any output cell to highlight the input cells it pooled from.

Drag the region. Drag the corner to resize. Hover the output.

For an H×Woutput (the paper uses 7×7 to match VGG’s pre-trained fc6 input), divide the proposal’s rectangle on the feature map into an H×Wlattice of bins. Each bin max-pools whatever feature-map cells fall inside it. Output: a fixed grid no matter the input region’s shape. That grid feeds the classifier and box regressor.

Two subtleties hide in the geometry. First, the proposal coordinates are real-valued (in the original image), but feature-map cells are integers, so the bin boundaries get quantised at two places: when the proposal is mapped from image coordinates to feature-map coordinates, and when each bin’s max-pool window is rounded to whole cells. The misalignment that this introduces is small for classification but matters for pixel-accurate tasks; RoI Align in the Mask R-CNN paper will replace the rounding with bilinear sampling at exact locations and recover a couple of mAP. Second, max-pool over an empty bin (when the proposal is small relative to the feature-map resolution) returns zero, which gives the regressor nothing to work with for tiny objects. The architecture cares about the ratio of input image to feature-map stride.

Critically, RoI pooling is differentiable. Gradients flow back through the max-pool to whichever feature-map cell won inside each bin, and from there into the convolutional backbone. The backbone now learns features that are good for detection, not features that happened to be good for ImageNet.

One loss, one network

R-CNN trained in three separate stages with three separate objectives. Fast R-CNN trains in one. After RoI pooling, the fixed grid feeds two sibling heads:

A softmax over K + 1 classes (the K object classes plus a background class). Cross-entropy loss.
A per-class bounding-box regressor that predicts the same four-tuple offset R-CNN used. The loss here is smooth L₁— quadratic near zero, linear far away — so a single mis-localised proposal can’t blow up the gradient.

The two losses sum into one multi-task objective, with the box regression term gated to foreground proposals only:

L = L_cls(p, u) + λ · [u ≥ 1] · L_loc(t^u, v)

One forward pass produces both predictions; one backward pass updates the entire network including the conv backbone. The per-class SVMs go away. The cached fc7 features go away. The disk you needed to train R-CNN goes away.

Two more speedups, almost free

Two implementation tricks earn the rest of the wall-clock improvement.

Hierarchical sampling. Each training mini-batch samples just N = 2 images and R / N = 64proposals from each (so the batch is 128 proposals from 2 images). The conv features are computed twice per batch instead of 128 times. About a 64x speedup on the conv work without any loss in accuracy — the gradients within a batch are correlated, but in practice it doesn’t matter.
Truncated SVD on the FC layers. The classifier head is two large fully-connected layers (fc6, fc7, both 4096-d). At inference, since we’re running them 2000 times per image, they dominate the post-conv cost. Replacing each W by its truncated singular value decomposition U Σ V^T with the top t singular values cuts the parameter count and halves inference time, with a 0.3% drop in mAP.

Where the time went, redux

R-CNN was 47 seconds; Fast R-CNN with VGG16 is about 2 seconds per image. The breakdown shifts dramatically:

Selective Search1.8s

Conv backbone (VGG16)130ms

RoI pool + heads × 200080ms

NMS, post-process10ms

Bars use square-root scaling; the conv pass is now smaller than Selective Search by an order of magnitude.

Selective Search is now the slowest thing in the system. The conv backbone — previously the bottleneck — takes a few hundred milliseconds. The next paper, Faster R-CNN, will replace Selective Search with a small network that shares the same backbone, killing the last serial CPU step.

What it bought

Compared with R-CNN, on PASCAL VOC 2007 with VGG16: roughly the same accuracy (66 vs 66.0 mAP), 9× faster training (84 hours → 9 hours), and ~25× faster inference excluding Selective Search. With Selective Search included it is ~2.5x faster, because Selective Search now dominates.

The intellectual win is bigger than the runtime win. The whole detector is now one network, trainable end to end, with gradients reaching from the box regressor at the top through the conv backbone at the bottom. Every later paper will have this shape: one network, multi-task loss, RoI-style feature pooling somewhere in the middle. The R-CNN three-stage scaffolding is permanently gone.