Skip to main content

Transformers - Encoders & Decoders

TransformersEncodersDecoders

Figure: Transformer encoder-decoder architecture. Source: "Attention Is All You Need" (Vaswani et al., 2017).

🧩 Summary of Flow​

Inputs → Embedding → Positional Encoding → Encoder Stack.

Decoder takes shifted outputs → Embedding → Positional Encoding.

Decoder attends to both its own previous outputs and encoder’s outputs.

Final linear + softmax gives output probabilities.


This figure illustrates the complete architecture of the Transformer model, a neural network designed for sequence-to-sequence tasks like machine translation. It consists of an encoder on the left and a decoder on the right. The core idea is to process all input tokens in parallel using a mechanism called multi-head attention, which allows the model to weigh the importance of different words in a sequence to each other. This approach eliminates the need for sequential processing, which was a bottleneck in previous models like RNNs.


Encoder​

The Encoder is on the left and is composed of a stack of identical layers. Its purpose is to process the input sequence and produce a representation of it.

  • Input Embedding:

    • The input words are first converted into dense vector representations.
  • Positional Encoding:

    • Since the Transformer processes all words simultaneously, it has no inherent sense of word order.
    • The positional encoding adds a vector to each input embedding to give the model information about the position of a word in the sequence.
  • Multi-Head Attention:

    • This is the heart of the encoder. It calculates the relationship between all words in the input sequence.
    • For each word, it generates a new representation that is a weighted sum of all other words' representations.
    • The "multi-head" part means this process is done multiple times in parallel, allowing the model to focus on different aspects of the relationships between words.
    Details

    Key, Value, and Query in Attention Function In the context of the Transformer's self-attention mechanism, the query, key, and value are concepts derived from retrieval systems and are used to compute the attention scores. All three are vector representations of the same input word, but each plays a different role in the calculation.

    The query (QQ), key (KK), and value (VV) weight matrices are not pre-prepared during pretraining — they are learned during the training process itself.

    Where Q, K, V weights live in the Transformer​

    • Each Multi-Head Attention (MHA) block in both the encoder and decoder has:

      • WQW_Q: Query projection matrix
      • WKW_K: Key projection matrix
      • WVW_V: Value projection matrix
    • Locations:

      • Encoder:
        • Self-Attention → Q/K/V weights.
      • Decoder:
        • Self-Attention → Q/K/V weights.
        • Cross-Attention → Another set of Q/K/V weights (because queries come from decoder, keys/values from encoder outputs).

    How They're Created​

    1. Initialization

    • When you first create a Transformer model (before any training), the Q, K, V weight matrices are initialized — usually with random values following some distribution (e.g., Xavier/Glorot, Kaiming).
    • This happens once, before training starts, not in the computation/communication loop yet.
    • At this stage, they know nothing about language, vision, etc.

    2. Computation & Communication

    • For each word in the input sequence, its original embedding is transformed into three different vectors: a query (Q), a key (K), and a value (V). This is done by multiplying the word's embedding vector by three separate, trainable weight matrices: WQW^Q, WKW^K, and WVW^V.

      • Query (Q): Q=EmbeddingƗWQQ = \text{Embedding} \times W^Q
      • Key (K): K=EmbeddingƗWKK = \text{Embedding} \times W^K
      • Value (V): V=EmbeddingƗWVV = \text{Embedding} \times W^V

      These matrices (WQW^Q, WKW^K, WVW^V) are learned during the model's training.

      Step-by-step in one Transformer block​

      (Example: Encoder Self-Attention Block — same logic for decoder’s attention)

      Forward pass → Computation Phase​
      1. Input embeddings / hidden states enter the MHA layer.
      2. They are multiplied by WQ,WK,WVW_Q, W_K, W_V to produce Q, K, V vectors:
      Q=XWQ,K=XWK,V=XWVQ = X W_Q,\quad K = X W_K,\quad V = X W_V
      1. Attention scores computed:
      Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
      1. Output flows to residual connection + layer norm, then to the MLP (Feed-Forward Network).

      At this point:
      āœ… Computation phase — but no weights are updated yet, just activations produced.


      Backward pass → Computation + Communication Phase​
      1. Loss computed (at the end of the model).

      2. Backpropagation starts:

        • Gradients flow back through MLP, then through attention outputs, then into Q, K, V computations.
        • Gradients for WQ,WK,WVW_Q, W_K, W_V are calculated here (computation phase).
      3. If distributed training:

        • Gradients for WQ,WK,WVW_Q, W_K, W_V are communicated across devices (all-reduce operation).
        • This is the communication phase — ensures all GPUs/nodes have the same gradients.

      Optimizer step → Computation Phase​
      1. The optimizer (e.g., Adam) updates WQ,WK,WVW_Q, W_K, W_V using their synchronized gradients.
      2. Updated weights are stored for the next forward pass.

    3. Summary Table — Q/K/V Weight Updates in Transformer Blocks​

    Transformer BlockWhere Q/K/V LiveWhen Gradients Computed (Computation)When Gradients Synced (Communication)When Weights Updated
    Encoder Self-AttnInside MHABackward pass through self-attnAfter backprop, before optimizer stepOptimizer step
    Decoder Self-AttnInside MHABackward pass through self-attnSameSame
    Decoder Cross-AttnInside MHABackward pass through cross-attnSameSame

    What They Do​

    The core idea is to think of the attention mechanism as a search process.

    • Query: The query is the vector representing the current word you are focused on. It's what you are "searching" with. For example, if you are calculating the attention output for the word "it," your query vector will be for "it."
    • Keys: The keys are the vectors for all the words in the sequence. They represent what you are "searching against." Each key is used to determine how relevant its corresponding word is to the query word.
    • Values: The values are also vectors for all the words in the sequence. The value vectors, on the other hand, are the vectors that contain the information that will be used to form a new representation for the current word. They are the actual "payload" of the attention mechanism.

    How the attention mechanism works​

    The core formula for scaled dot-product attention is:

    Attention(Q,K,V)=Softmax(QK⊤dk)V\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

    Step-by-step:

    1. Matching stage (Q and K)
    • For each position ii (Query), we compute a similarity score with every position jj (Key) using a dot product:

      score(i,j)=Qiā‹…Kj\text{score}(i, j) = Q_i \cdot K_j
    • These scores tell us how much attention position ii should pay to position jj.

    1. Softmax to get weights
    • The scores are scaled by 1dk\frac{1}{\sqrt{d_k}} to stabilize gradients.

    • A softmax is applied across all jj for a given ii, producing attention weights αij\alpha_{ij}:

      αij=exp⁔(score(i,j))āˆ‘kexp⁔(score(i,k))\alpha_{ij} = \frac{\exp(\text{score}(i, j))}{\sum_k \exp(\text{score}(i, k))}

    Attention weights are intermediate results inside the Attention block, just before being multiplied by the value vectors VV.
    Attention weights are updated at Every forward pass and depend on current QQ and KK, so they change with input and layer state. They are not stored as part of model weights; they disappear after computation.

    Relationship Between KK/QQ/VV weights and Attention weights​
    FeatureKey/Query/Value Weight MatricesAttention Weights (Softmax Output)
    TypeLearned parametersComputed activations
    PersistenceStored in model and updated during trainingTemporary, discarded after forward pass
    LocationInside Attention block (parameter matrices)Inside Attention block (computed output)
    Phase UpdatedComputation phase during trainingComputed fresh every forward pass
    FunctionTransform inputs into Q, K, V vectorsDecide "who attends to whom"
    • Now, each weight says: ā€œHow much of token jj’s content should be sent to token ii?ā€
    1. Using the Values (V)
    • Finally, the Values hold the actual content vectors we want to aggregate.

    • For each position ii, we take a weighted sum of all Value vectors VjV_j using the attention weights:

      outputi=āˆ‘jαijā‹…Vj \text{output}_i = \sum_j \alpha_{ij} \cdot V_j
    • This output is a contextualized representation of token ii, because it blends information from other tokens according to relevance.

    • This means:

      • If αij\alpha_{ij} is high, token ii borrows a lot from token jj’s information VjV_j.
      • If αij\alpha_{ij} is low, token ii mostly ignores VjV_j.
    • Why Values matter

      • Keys are just for matching.

      • Values are what actually gets transferred.

      • The ā€œrelevanceā€ comes from αij\alpha_{ij}, and the ā€œcontextā€ comes from pulling in pieces of VjV_j for all jj.

      • In theory, Q and K could be identical to V, but separating them allows the model to decide different ways to match and different content to pass along.

      • For example:

        • Keys might encode syntactic roles.
        • Values might carry semantic content.
    • An Analogy: Library Search

      Imagine you're at a library looking for books on a specific topic.

      • Query: Your search query (e.g., "History of AI") is the query vector.
      • Keys: The labels or keywords on all the books in the library (e.g., "Computer Science," "Robotics," "1960s") are the key vectors.
      • Values: The actual content of each book is the value vector.

      You match your query against all the keys. Books with keys that closely match your query get a high relevance score. Then, you use these scores to decide how much "information" (value) from each book you should "read" to form a comprehensive answer. The Transformer's self-attention mechanism works similarly, but it does this for every word in the sequence simultaneously.

    1. Linear Projection (Post-Attention)
    • In a Transformer implementation, this new weighted-sum vector goes through a final linear layer (sometimes called WOW_O) to mix the multiple attention heads’ results into one representation per token.
    • Formula (simplified for one head):
    Zi=(āˆ‘jαijā‹…Vj)WOZ_i = \left( \sum_j \alpha_{ij} \cdot V_j \right) W_O
    • If there are multiple heads, all head outputs are concatenated before multiplying by WOW_O.
    1. Residual Connection
    • The Transformer adds the original input embedding (the one that was used to create Q,K,VQ, K, V) back to the attention output:

      AttnOutputi=Inputi+Zi\text{AttnOutput}_i = \text{Input}_i + Z_i

      This residual connection helps preserve the original token information while allowing the model to integrate the contextualized representation.

    1. Layer Normalization
    • After the residual sum, LayerNorm is applied to stabilize and normalize the representation:

      NormOutputi=LayerNorm(AttnOutputi)\text{NormOutput}_i = \text{LayerNorm}(\text{AttnOutput}_i)
    1. Feed-Forward Network (MLP)
    • The normalized vector is then sent to the position-wise feed-forward network (two linear layers with a nonlinearity like GELU/ReLU):

      FFNOutputi=MLP(NormOutputi)\text{FFNOutput}_i = \text{MLP}(\text{NormOutput}_i)
    • Another residual connection + LayerNorm happens after this MLP.

  • Add & Norm:

    • A residual connection (the "Add" part) adds the input of the sub-layer to its output, and then the result is passed through a layer normalization (the "Norm" part).
    • This helps with training by preventing the gradients from vanishing.
    Details

    Residual Connection (The "Add" Part) A residual connection is a shortcut that connects the input of a sub-layer directly to its output. In the Transformer, for any sub-layer Sublayer(x)Sublayer(x), the output is calculated as:

    x+Sublayer(x)x + Sublayer(x)

    By allowing the gradient to flow directly through this "shortcut," it helps to mitigate the vanishing gradient problem. Without residual connections, the gradients could become extremely small as they are backpropagated through many layers, making it difficult to train the model effectively. This allows the network to learn a sub-layer's function as a modification to its input, rather than having to learn the entire transformation from scratch.


    Layer Normalization (The "Norm" Part) Following the residual connection, the result is passed through a layer normalization step. This component normalizes the activations across the features for each individual sample in the batch. Specifically, it computes the mean and variance of the activations within a single layer for a given input sequence and then uses these statistics to normalize the values.

    LayerNorm(x)=Ī³āŠ™xāˆ’Ī¼Ļƒ2+ϵ+βLayerNorm(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

    Here, μ\mu and σ\sigma are the mean and standard deviation of the input values across the features, and γ\gamma and β\beta are learnable parameters that allow the network to "undo" the normalization if it determines that this is a better configuration.

    The primary benefits of layer normalization are:

    • Faster and More Stable Training: It stabilizes the learning process by ensuring that the inputs to each sub-layer are within a consistent range, regardless of the previous layers' outputs.
    • Reduced Dependence on Learning Rate: Layer normalization makes the network less sensitive to the choice of the learning rate, which simplifies the tuning of hyperparameters.

    In essence, the combination of Add & Norm ensures that the deep Transformer architecture can be trained effectively by creating a direct path for gradient flow and stabilizing the activations within each layer.

  • Feed Forward (Aka MLP- Multi-Layer Perceptron):

    • A simple, fully connected neural network is applied to each position separately and identically.
    • Typically two linear layers with a ReLU in between.
    • The MLP injects non-linearity and feature transformation into the model, enabling it to learn more complex mappings at each position.

šŸ” This whole block is repeated N times (e.g., 6 in the base model).


Decoder​

The Decoder is on the right and also consists of a stack of identical layers. Its job is to generate the output sequence one token at a time. It receives the encoder's output and the previously generated output tokens as input.

  • Output Embedding & Positional Encoding:
    • Similar to the encoder, the previously generated output words are embedded, and positional information is added.
    • Inputs are shifted right so that predictions can’t ā€œsee the future.ā€
  • Masked Multi-Head Attention:
    • This is a crucial difference. It's the same as the encoder's multi-head attention but with a "mask" applied. The mask ensures that the model can only attend to the words it has already generated (to preserve autoregressive decoding).
    • Before applying the softmax, set the next/future tokens to āˆ’āˆž-\infty (After it passes through the softmax function, that becomes zero).
    • This prevents it from "cheating" by looking at future words in the output sequence during training. MaskingNextSequence
  • Encoder-Decoder Attention:
    • After the masked attention layer, a second multi-head attention layer takes the output from the decoder's masked attention layer and the final output from the encoder stack.
    • This layer allows the decoder to focus on relevant parts of the input sequence (like attention in seq2seq models) while generating the next word, which is essential for tasks like translation.
  • Add & Norm and Feed Forward:
    • These layers function identically to their counterparts in the encoder, further refining the decoder's representation.
  • Linear & Softmax:
    • The final output of the decoder stack is a vector that is fed into a linear layer, which projects it into a larger vector called a logit vector. This logit vector is the size of the vocabulary.
    • The softmax layer then converts these logits into probabilities, indicating the likelihood of each word in the vocabulary being the next word in the output sequence.
    • The word with the highest probability is chosen as the final output.

šŸ’” Key Components Highlighted in the Diagram​

ComponentDescription
Multi-Head AttentionAllows the model to attend to different parts of the sequence in parallel.
Masked Multi-Head AttentionPrevents a position from attending to future positions.
Feed ForwardTwo-layer FFN applied to each position independently.
Add & NormResidual connection followed by layer normalization.
Positional EncodingAdded to embeddings to encode order of tokens.
SoftmaxConverts output logits to probabilities.

Cases where only decoders are used​

1. Autoregressive Language Modeling​

  • Goal: Predict the next token given all previous tokens.

  • Why decoder-only?

    • The model sees only the tokens generated so far.
    • Masked self-attention enforces the ā€œno peeking aheadā€ rule.
  • Examples:

    • GPT, GPT-2, GPT-3, GPT-4 — large language models for text generation, code generation, and chat.
    • OpenAI Codex, LLaMA, MPT, Falcon, etc.

2. Text Generation Without Separate Input Sequence​

  • Goal: Produce free-form text from a prompt or initial context.

  • Why decoder-only?

    • The "prompt" is treated as the first part of the output sequence.
    • No need for an encoder because there’s no separate source sequence to encode.
  • Examples:

    • Story generation, poetry, summarization (prompt-based).
    • InstructGPT-style models where instructions are part of the input text.

3. Single-Stream Tasks​

  • Goal: Process input and output in a single stream (one continuous sequence).

  • Why decoder-only?

    • Input and output are concatenated in one sequence.
    • Masking ensures that output tokens can’t see future output tokens, but can see all input tokens (since they appear before them in the sequence).
  • Examples:

    • Prompt-based question answering without separate encoding stage.
    • Few-shot in-context learning (the examples + question are just part of the sequence).

Key Difference from Encoder–Decoder​

  • Encoder–Decoder: Needed when there’s a source sequence and a target sequence (e.g., translation: French → English). The encoder processes the source, and the decoder generates the target conditioned on the source.

  • Decoder-Only: Suitable when:

    1. The model generates based on what it has already generated.
    2. Input and output share the same token stream.
    3. There’s no need for a separate encoded representation.

šŸ’” Rule of thumb:

  • If you have an input sequence separate from the output sequence → use encoder–decoder.
  • If your task is just to generate a continuation of a sequence → use decoder-only.

Online sources:​

  • Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy

  • Visualizing transformers and attention | Talk for TNG Big Tech Day '24