2020Vision Transformer

An image is sixteen patches wide.

ViT slices an image into a regular grid of patches, treats the patches like words, and runs a vanilla transformer over the sequence. The whole “convolution is the right inductive bias for vision” assumption gets thrown out, traded against more data and compute. It worked.

The frame

For a decade after AlexNet, vision belonged to convolution. CNNs carried two priors that everyone agreed were essential for images. Translation equivariance: a cat in the top-left and a cat in the bottom-right share the same kernel weights, so the network doesn’t have to learn the same feature twice. And locality: a 3×3 kernel only reaches three pixels on a side, biasing the network to look for nearby structure first.

Both priors are correct in the sense that they cut down the hypothesis space the network has to search. The question ViT (Dosovitskiy et al., 2020) asks is whether the priors are necessary, or whether enough data and compute could let the network learn them on its own from a more generic starting point.

The bet

Take the architecture that conquered NLP — the transformer encoder — and feed it images. With as few changes as possible. Almost no image-specific layers; almost no image- specific bias.

The transformer expects a sequence of token embeddings. So: chop the image into a regular grid of small patches, flatten each patch into a vector, run a single learned linear projection to put it in embedding space, add a position embedding, and prepend a learnable [CLS] token whose final state will be classified. Pass the whole sequence through a stack of standard transformer encoder blocks. Read out the class from the final [CLS].

An image is a sequence of patches

A 224×224 image with 16×16 patches splits into 14 rows of 14 patches: 196 patches, plus the [CLS] = 197 tokens. That sequence length is comparable to a short paragraph of text.

Patch size
16px
Grid
24×16
Patches
384
Tokens (+CLS)
385
Patch size
ViT-Base/16 uses 16-pixel patches. ViT-Large/14 uses 14.

Three patch sizes, three regimes. 16-pixel patches are the original ViT-Base/16; the sequence is short enough that full-attention is cheap and long enough that the model has something to attend to. 8-pixel patches give finer detail at quadratic cost — the attention matrix grows as (L+1)2, so halving the patch size quadruples attention compute. 32 and 64 are coarse enough that you start losing single objects; useful only at very high resolution or for ablations.

From patch to vector

Each P×P patch is 3 P2numbers (RGB × pixels). A 16×16 patch is 768 numbers. ViT-Base’s embedding dimension is, conveniently, also 768, so the “projection” is a single learned linear map from R768 to R768— in code, a 2D convolution with a 16×16 kernel and stride 16 over the image. The same trick collapses patchify, flatten, project into one tensor op.

That is the single line of image-specific code in the model. Everything after it operates on a sequence of vectors with no memory of where the pixels came from.

Tag with position

Self-attention is permutation-equivariant: shuffle the input sequence and the output shuffles the same way. Without help, the model cannot tell the patch in the upper-left corner from the one in the bottom-right.

ViT adds a learnable position embedding— one vector per patch slot, the same dimension as the patch embedding — and sums it into the input. Position is now carried in the values, not in the layer’s structure. Funnily enough, when you visualise the learned position embeddings after training, they recover a 2D grid topology on their own: nearby positions have similar embeddings, even though the model was given no hint of what 2D meant.

The CLS token

A learnable extra token, prepended to the sequence. It has no patch behind it — it’s just a parameter vector shared across every example in the dataset. Its job is to collect global information through attention as the layers stack: at every layer it can attend to all L patches and pull what it needs. After the final layer, the classification head reads only the [CLS] output and ignores the rest.

The pattern is borrowed directly from BERT. It is also gloriously hackable: later work (DINO, MAE, DeiT-III) experiments with different aggregation strategies — average-pooling all patch tokens, learnable pooling heads, or additional task-specific tokens — and the differences in downstream behaviour are not small.

Self-attention, applied to patches

From here, ViT is just a transformer. The attention explorable walks through the math on a seven-token sentence; everything there carries over verbatim. Q, K, V projections from each token; QKT/√d for scores; softmax; weighted sum of V. ViT-Base does this with 12 attention heads in 12 layers; ViT-Large with 16 heads in 24.

What you can ask of a trained ViT is: where is each token looking? The widget below is a synthetic but qualitatively accurate stand-in. Click any patch to make it the query and see where its attention concentrates; or pick the [CLS]token with a target class and watch the classifier’s gaze settle on the foreground.

Attention from [CLS] · cat12×8 = 96 patches
queryor click any patch above
Patch size 32. Click any patch, or pick a CLS target.

The shape that emerges in real models is the same: object patches attend strongly to other patches of the same object; background patches attend diffusely; the [CLS] token, by the final layer, is mostly looking at the foreground. The technique used to extract this from a trained network is called attention rollout— multiply the attention matrices through the layers, residual streams included — and the resulting heatmaps look strikingly like saliency maps, for free, without any segmentation supervision.

The data-scaling fact

ViT’s headline result was not just “a transformer works on images”. It was a curve: at small training scale, ViT loses to ResNet; at medium scale, it catches up; only at JFT-300M (300 million labelled images) does ViT clearly pull ahead.

ImageNet-1kImageNet-21kJFT-300Maccuracypretraining data (log scale)CNN (ResNet)ViT

Schematic, not to scale. The qualitative crossover at large data is the headline finding.

The interpretation is the moral of the paper. The CNN’s inductive biases (locality, translation equivariance) are not wrong — they save the model from learning structure it would otherwise have to discover from data. With limited data, built-in priors win. With enough data, the model can learn the priors that are actually true for its task and discard the parts of the prior that aren’t. ViT is not the discovery that transformers beat CNNs; it’s the discovery that architectural priors trade against data, and the trade-off tilts as datasets grow.

What it changed

DeiT (Touvron et al., 2021) showed that with the right regularisation and a distillation token, ViT trains on ImageNet-1k alone — closing the data-hunger gap. Swin (Liu et al., 2021) re-introduced locality through windowed attention and gave the model back its hierarchical structure; it became the default backbone for downstream vision tasks for a couple of years. MAE (He et al., 2021) showed that masking out 75% of patches and training the model to fill them in is a very strong self-supervised pretraining recipe — ViT finally had its BERT moment. DINO and DINOv2 (Caron et al., Oquab et al.) built on the same backbone for dense self- supervised features, and produced the now-standard observation that ViT’s attention maps come out unsupervised-segmentation-like.

Today’s vision foundation models — SAM, DINOv2, CLIP’s vision tower, every multimodal LLM’s image encoder — are descendants of this paper. The patches won.