2014Object detection · part 1 of 4

R-CNN. The slow ancestor.

The frame

Object detection is the task of taking a picture and returning a short list: cat at (40, 110, 110, 110), dog at (200, 50, 150, 170). A modern phone runs this thirty times a second on a live camera feed. A decade ago, the same task on a single still image took the better part of a minute on a workstation.

Four papers, between 2014 and 2016, did most of that work. Each one noticed a piece of redundant computation in the previous paper and removed it. By the end, everyone had stopped asking what region might contain an object and started asking which prediction wins this cell. This page is the first of the four. Read it as the start of a series; the next paper is linked at the bottom.

The metric, IoU

Before anything, a unit of measurement. Two boxes are good matches if their intersection is large compared with their union. The ratio is called Intersection over Union, IoU, and it is the metric that drives everything downstream — how training assigns predictions to ground truth, how evaluation scores a detector, how non-max suppression decides whether two boxes are really the same one.

20cells

∩

6cells

∪

34cells

IoU = ∩ / ∪ = 6 / 34

IoU0.176

Drag either box. Drag the corner to resize. Boxes snap to the grid; out of 60 cells.

A few thresholds repeat throughout the rest of the series. A box with IoU ≥ 0.5 to a ground-truth label is, by convention, a correct detection. During training, anchors with IoU ≥ 0.7 are made positive examples; below 0.3, negative; the band in between gets ignored. At inference, two predictions with IoU ≥ 0.5 are usually the same object and one of them gets suppressed.

What R-CNN proposed

Girshick, Donahue, Darrell, Malik, in late 2013. Their pitch is in the title: Regions with CNN features. Until then, state-of-the-art detectors were sliding-window classifiers built on hand-crafted features — HOG, deformable part models — that the deep-learning revolution of 2012 had not yet reached. Imagenet-pretrained AlexNet was a year old and famously good at classification. R-CNN’s observation: turn detection into a large pile of classification problems and let the CNN do the heavy lifting.

A three-step recipe.

Propose. Use a region-proposal algorithm (Selective Search) to nominate about two thousand boxes per image that might contain an object.
Classify. Crop each proposal, warp it to 224×224, push it through AlexNet, and score the resulting feature vector with a per-class linear SVM.
Refine. A second linear head per class predicts a small offset that nudges the proposal box into a tighter fit on the object.

That’s the whole paper. The next four sections walk through each step, with the slow part — step two — visualised first.

Two thousand proposals

Selective Search Uijlings et al., 2013 is the proposal generator R-CNN uses. It is not a neural network. It is a hand-engineered hierarchical clusterer: start by over-segmenting the image into super-pixels, then greedily merge neighbouring regions whose colour, texture, size, and fillsimilarities are highest. At every level of the merge hierarchy, emit the bounding box of the current region. The output is a multi-scale grab-bag of rectangles — small, medium, large; objecty and not.

The number per image is around two thousand. That number is the recall budget: enough boxes that, for any object in any image, at least one of them has IoU ≥ 0.5 to ground truth more than 95% of the time.

Proposals

120

AlexNet passes

120

Inference time

2.76s

Show first N proposals120

1500100015002000

Selective Search emits ~2000 proposals per image. AlexNet runs on every one. ~23ms per pass on a 2014 GPU.

The picture above is what two thousand boxes feels like. Every rectangle is one classification problem in the recipe. Drag the slider to twenty and the screen reads as a sensible coarse pass. Drag it to two thousand and the visual collapses into noise; that is the signal R-CNN is asked to find an object inside.

The runtime indicator is the punchline. Selective Search itself takes about two seconds on a CPU. Running AlexNet on every one of its outputs takes about forty-five seconds on a 2014 GPU. The proposal stage is cheap; the next stage is where the day disappears.

The classifier

For each proposal box, R-CNN does roughly this, sequentially:

Cropfrom image
Warp224×224
AlexNetconv → fc
fc74096-d
SVMper class
Scoresoftmax-free

Three details are worth pausing on.

Warp, not crop. AlexNet expects a fixed 224×224input. R-CNN doesn’t preserve aspect ratio — it stretches the proposal’s rectangle to a square, with a 16-pixel context border so the object isn’t cropped tight against the edge. This is crude. Later papers will replace it with RoI pooling, which keeps the geometry honest.
Per-class SVMs, not a single softmax.The natural thing would be to attach a 21-way softmax to AlexNet (twenty PASCAL classes + background) and train end-to-end. R-CNN doesn’t. It trains AlexNet’s features once with a soft-labels loss (positives are IoU ≥ 0.5), then freezes the features and trains a hard linear SVM per class on top, with a stricter positive definition (IoU ≥ 0.7). Empirically, the two-loss arrangement gave 4 mAP more than a single softmax. It is exactly the kind of pipeline kludge the next paper will collapse.
Non-max suppression.The 2000 boxes overlap. After scoring, R-CNN sorts by score and keeps a box only if it has IoU < 0.3 with any already-kept box of the same class. Same algorithm every detector in this series ends with.

The regressor

The classifier’s job is to say cat or dog; the regressor’s job is to make the box fit. R-CNN trains a per-class linear regressor on the same 4096-d AlexNet features that predicts a tiny parameterised offset: (dₚₓ, dₚᵧ, dₚᵥ, dₚₕ). The first two are translation in units of the proposal’s own width and height; the last two are a log-scale on the proposal’s width and height. Applied to the proposal, those four numbers produce a refined box.

On its own, regression buys about 3–4 mAP on PASCAL. Small but worth it. Every later paper in the lineage inherits this same parameterisation, which is why all four use the same offsets in their loss functions.

Where the time went

The 47-second figure breaks down roughly as:

Selective Search2.0s

AlexNet × 200045.0s

SVM scoring50ms

Box regress + NMS30ms

Bars use square-root scaling so the small steps don’t vanish next to the AlexNet bar.

Almost all of it is the second bullet: 2000 forward passes through AlexNet, on the same image, with massively overlapping crops. The conv layers that extract edges and textures are doing the same work over and over because each crop is processed independently. Two of the three remaining papers in this series exist to eliminate this waste.

A three-stage training pipeline

R-CNN is also slow to train, and for a related reason: there is no single loss to descend.

Pretrain AlexNet on ImageNet classification.
Fine-tunethe pretrained network on PASCAL VOC, with a 21-way softmax over 20 classes plus background. Positive samples are proposals with IoU ≥ 0.5 to a ground-truth box. This step warms the conv layers up for the detection distribution; once it’s done, the softmax is thrown away.
Train twenty per-class linear SVMs on top of the frozen fc7 features, then twenty per-class bounding-box regressors, both as separate problems on cached features. Hard-negative mining at this stage to handle the huge background-vs-foreground imbalance.

Three stages, three different objectives, one disk full of cached fc7vectors. The Fast R-CNN paper’s first contribution is to fold the whole thing into one loss, one training run, no caches.

Why the paper mattered

Numbers are the easiest way to see it. On PASCAL VOC 2010, the best deformable-part-model detector at the time scored about 33% mAP. R-CNN scored 53.7%. On PASCAL VOC 2012, the previous best was around 35%; R-CNN hit 62.4%. The gap is not incremental. It is the gap that ended a generation of hand-crafted features.

What R-CNN actually showed was narrower than the leaderboard jump. It showed that a CNN pretrained on ImageNet, fine-tuned on a small detection set, transferred its features extraordinarily well to a new task. That insight — pretrain large, fine-tune small — is the move every model in vision and language has made since. The detection-specific pieces (Selective Search, the per-class SVMs, the three training stages) all looked, in hindsight, like temporary scaffolding around a much more important idea.

The next paper, Fast R-CNN, started taking the scaffolding apart.