Skip to main content

Transformers - Additional Details

Where does the transformer model get the general knowledge

That “general knowledge” in a Transformer model — like knowing which sport an athlete plays even when it’s not stated in the current input — comes from patterns learned during pretraining, not from the context you provide at inference time.

1. Pretraining: where knowledge comes from

A Transformer model (e.g., GPT, BERT) is usually trained on massive corpora of text — books, articles, Wikipedia, news, websites, etc. During training, the model sees countless examples that link names to facts implicitly:

  • Sentences like “Serena Williams won the Wimbledon tennis championship…”
  • Tables in sports articles with player–sport associations
  • Fan pages, interviews, biographies, etc.

Over billions of examples, the model’s weights adjust so that the statistical associations between “Serena Williams” and concepts like “tennis,” “grand slam,” “serve,” are embedded in the learned vector space.

2. How it’s stored

  • This is not a database of facts, but rather distributed representations inside the weights of the attention layers and feed-forward networks.
  • Words and concepts that often co-occur in similar contexts have embeddings that are geometrically close in the model’s high-dimensional space.
  • Attention heads and MLP layers then learn to retrieve and combine these associations when needed.

3. Inference: how it recalls

When you ask:

"Which sport does Serena Williams play?"

Even if your current context has no explicit answer, the token embedding for "Serena Williams" triggers activation patterns in the model that were learned during pretraining. These activations bias the output distribution toward tokens like "tennis" or "professional tennis" because those words had high conditional probability in similar contexts in training.

4. Limitations

  • The knowledge is frozen to whatever was in the training data at the time of pretraining.
  • If facts change (e.g., a player switches sports, or a new athlete appears), the model won’t know unless retrained or fine-tuned.
  • It can hallucinate — producing incorrect facts if the associations in training were noisy or contradictory. In short: The Transformer model’s “general knowledge” comes from statistical patterns learned during pretraining across huge datasets, not from reasoning at runtime or external databases. That’s why it can answer fact-based questions even without the answer in the current context.

Transformers for image recognition

The original Transformer from "Attention Is All You Need" (2017) was designed for text sequence-to-sequence tasks (machine translation), but the core idea — self-attention over a set of tokens — has since been adapted heavily for images in both recognition and generation.

In fact, using Transformers for images today usually involves two big changes compared to the original text-based design:

1. The core challenge for images

Images are not 1D sequences of words — they are 2D grids of pixels (often 224×224×3 for recognition tasks, or much larger for generation). If you naively flattened all pixels and fed them into the original Transformer:

  • Sequence length = width × height (e.g., 50,176 tokens for 224×224 image)
  • Attention complexity is O(n²) → completely infeasible for large n.

So adaptations are made to:

  • Reduce sequence length (e.g., patch embeddings)
  • Add positional information for 2D spatial structure
  • Change attention computation to be more efficient for images

2. Key modifications for vision tasks

Component from 2017 TransformerHow it's modified for visionWhy it’s needed
TokenizationInstead of word tokens, images are split into fixed-size patches (e.g., 16×16 pixels), each patch is flattened and linearly projected to a vector.Reduces sequence length drastically; keeps local spatial info.
Positional EncodingUse 2D positional encodings or learned embeddings for patches, sometimes relative position bias instead of sinusoidal.Transformers need to know spatial arrangement of patches.
AttentionFor large images, attention can be restricted locally (Swin Transformer) or computed at multiple scales to reduce cost.Standard attention is O(n²); large n is impractical.
Architecture DepthOften deeper but with fewer heads and parameters per layer for efficiency (ViT, DeiT).Vision tasks require different inductive biases and data scaling.
PretrainingLarge-scale supervised (ImageNet-21k) or self-supervised (MAE, DINO) pretraining before fine-tuning.Transformers need much more data than CNNs to work well on vision.
Decoder block (for generation)For image generation (e.g., DALL·E, Imagen), the decoder takes image tokens (discrete or continuous) and predicts them autoregressively or via diffusion models.Adapts Transformer to generative objectives.

3. Examples in Image Recognition

Vision Transformer (ViT) – 2020
  • Treats an image as a sequence of patch embeddings + class token.
  • Passes sequence through a standard Transformer encoder (no decoder).
  • The final hidden state of the class token is used for classification.
  • Needs massive datasets (e.g., JFT-300M) or heavy augmentation/distillation.
Swin Transformer – 2021
  • Introduces shifted window attention → local attention inside small windows that shift at each layer for cross-window communication.
  • Complexity scales linearly with image size rather than quadratically.

4. Examples in Image Generation

Autoregressive Transformers (DALL·E 1, Parti)
  • Convert images into discrete tokens using a learned VQ-VAE (vector quantization).
  • Train a Transformer decoder to predict tokens sequentially, similar to text generation.
Diffusion + Transformers (Imagen, DALL·E 2, Stable Diffusion XL)
  • Use a Transformer as the denoising network or for text–image fusion (cross-attention).
  • The attention mechanism naturally integrates text and image features.
Perceiver & Perceiver IO
  • Handles multi-modal input (images, audio, video) with latent attention to reduce cost.

5. How it differs from the original 2017 Transformer

In short:

  • Input representation: Words → Image patches or VQ tokens
  • Positional encoding: 1D → 2D or relative
  • Attention: Global → Local / hierarchical for efficiency
  • Pretraining needs: Much larger datasets for vision
  • Decoder usage: For recognition, often no decoder (encoder only); for generation, decoder is adapted to autoregressive or diffusion-style prediction.

Bottom line: The original Transformer blueprint remains — multi-head self-attention, feed-forward layers, residual connections — but almost every part of the pipeline (tokenization, positional encoding, attention pattern) is modified to fit the structure and scale of image data. Today, ViT is the baseline for recognition, and transformer–diffusion hybrids dominate generation.