Skip to content

1.Transformer Architecture

📅 2026-05-17 (created during knowledge-base reorganization) 👉 #AI #LLM #Architecture #DeepLearning #Foundation 📎 Attention Is All You Need (Vaswani et al., 2017) 📎 The Illustrated Transformer (Jay Alammar) 📎 Hugging Face: Transformer Models

1. Overview

1.1. Definition & Why
  • The Transformer is a neural-network architecture introduced by Vaswani et al. (Google, 2017) in the paper Attention Is All You Need. It replaced the previously dominant recurrent (RNN/LSTM) and convolutional (CNN) approaches for sequence modeling and is the architectural foundation of every modern LLM (GPT, Claude, Gemini, Llama, Mistral, DeepSeek).
  • Key idea: instead of processing tokens one at a time (RNN) or via fixed local windows (CNN), the Transformer processes all tokens in a sequence simultaneously using self-attention.
  • Pain points solved over predecessors:
  • Long-range dependencies: RNNs struggle to retain information across long sequences (vanishing gradients); attention can directly relate any two positions in O(1) hops.
  • Parallelization: RNNs are sequential by design; Transformers process the entire sequence in parallel, dramatically improving training-time GPU utilization.
  • Scalability: this parallelism is the precondition for the Scaling Law — the empirical finding that bigger model + more data = better performance.
1.2. Where it sits in the AI stack
  • Foundation note 1.Foundation/1.Intro_to_LLM.md mentions "Transformers: looks at every word in a sentence simultaneously" — this note explains how that actually works.
  • Foundation note 1.Foundation/3.LLM_Application_Logic.md calls Transformer "the bedrock architecture of the AI industry".
  • Other algorithm notes build on this:
  • 2.Attention_Mechanism.md — drills into self-attention specifically.
  • 3.MoE_and_RLHF.md — covers MoE (a Transformer modification) and RLHF (a training technique for Transformer-based models).
  • 4.Quantization_TurboQuant.md — TurboQuant compresses the KV-Cache that arises during Transformer inference.

2. Concept, Component, & Architecture

2.1. The High-Level Picture

A Transformer takes a sequence of tokens (e.g., the text of a sentence, encoded as integer IDs) and produces a sequence of output vectors (or, for an LLM, a probability distribution over the next token).

The original 2017 architecture had two halves: - Encoder: reads the input sequence and produces contextualized representations. Used in BERT-style models. - Decoder: generates an output sequence one token at a time, attending to previously generated tokens and (in encoder-decoder models) to the encoder output.

Modern LLMs (GPT, Claude, Llama, etc.) are decoder-only — they use just the decoder half, generating text autoregressively.

flowchart LR
  Input[Token IDs] --> Embed[Token Embeddings]
  Embed --> PE[+ Positional Encoding]
  PE --> Block1[Transformer Block 1]
  Block1 --> Block2[Transformer Block 2]
  Block2 --> BlockN[... Block N]
  BlockN --> LN[Final LayerNorm]
  LN --> Head[LM Head: linear → softmax]
  Head --> Output[Next-token probabilities]
2.2. Inside a Transformer Block

A single Transformer block contains two sub-layers, each wrapped in a residual connection and layer normalization.

       ┌──────────────────────────────────────┐
       │  Multi-Head Self-Attention           │
input ─┤  + residual connection               ├─→ intermediate
       │  + LayerNorm (or RMSNorm in LLaMA)   │
       └──────────────────────────────────────┘
       ┌──────────────────────────────────────┐
       │  Feed-Forward Network (FFN/MLP)      │
       │  + residual connection               ├─→ output
       │  + LayerNorm                         │
       └──────────────────────────────────────┘

Mathematically:

intermediate = LayerNorm(input + MultiHeadAttention(input))
output       = LayerNorm(intermediate + FFN(intermediate))

A modern LLM stacks 12-128+ such blocks. GPT-3 has 96; Llama 3 70B has 80.

2.3. The Core Components
(1) Token Embeddings
  • Each token ID is mapped to a dense vector via a learned embedding table of shape (vocab_size, d_model).
  • d_model is the model's hidden dimension (e.g., 768 for BERT-base, 4096 for Llama-3-8B, 12288 for GPT-3).
(2) Positional Encoding
  • Self-attention is permutation-invariant: shuffling the input tokens would give the same output. To inject order, the model adds positional information.
  • Sinusoidal positional encoding (original 2017 paper): hand-crafted sin/cos waves of varying frequencies.
  • Learned positional embeddings (BERT, GPT-2): a second embedding table indexed by position.
  • RoPE (Rotary Position Embedding) — used by Llama, Qwen, DeepSeek, Mistral: rotates the query/key vectors in 2D subspaces by an angle proportional to position. Generalizes better to longer contexts than absolute encodings.
  • ALiBi (Attention with Linear Biases) — used by some open models: adds a position-dependent bias directly to attention scores; no learned position embeddings needed.
(3) Multi-Head Self-Attention (MHA)
  • The heart of the Transformer. Detailed in 2.Attention_Mechanism.md.
  • Each token computes Query/Key/Value vectors, attends to all other tokens via scaled dot-product, and the output is a weighted sum of the Value vectors.
  • "Multi-head" splits the hidden dimension into h independent heads (typically 8-128) that attend in parallel — this lets the model jointly attend to information from different representation subspaces.
(4) Feed-Forward Network (FFN / MLP)
  • Applied independently to each position. Two linear layers with a non-linearity in between: FFN(x) = W_2 · activation(W_1 · x + b_1) + b_2
  • Common activations:
  • ReLU (original): max(0, x)
  • GELU (BERT, GPT-2/3): smooth gating
  • SwiGLU (Llama, Mistral, DeepSeek): (W_1 · x) ⊙ silu(W_3 · x) then projected — improves quality, used universally in modern LLMs
  • Hidden dimension is typically 4× d_model (or ~2.7× for SwiGLU since it has 3 matrices).
  • The FFN holds the bulk of the model's parameters — making MoE (where FFNs become "experts") a natural place to add sparsity (see 3.MoE_and_RLHF.md).
(5) Layer Normalization
  • Stabilizes training by normalizing activations to zero mean and unit variance.
  • LayerNorm (original): per-token normalization with learned scale and shift.
  • RMSNorm (Llama, Mistral, DeepSeek): drops the mean centering and shift; faster and surprisingly works just as well.
  • Modern LLMs almost universally use pre-norm (LayerNorm before each sub-layer) rather than post-norm — pre-norm trains more stably at depth.
(6) Residual Connections
  • The + input skip-connections around each sub-layer let gradients flow directly back to earlier layers, enabling training of very deep networks (100+ layers).
2.4. Encoder-only, Decoder-only, Encoder-Decoder
Variant Representative models Use case
Encoder-only BERT, RoBERTa, ModernBERT Classification, sentence embeddings, NER
Decoder-only GPT, Claude, Llama, Gemini Text generation, chat, code gen — all modern LLMs
Encoder-Decoder T5, BART, original Transformer Translation, summarization, seq2seq tasks

Decoder-only models use causal (masked) self-attention: token at position t can only attend to positions ≤ t. This makes autoregressive generation work without re-running the network on already-generated tokens (see KV-Cache below).

3. The Training Objective

Modern LLMs are pre-trained with one simple objective: next-token prediction.

Given a sequence of tokens $x_1, x_2, ..., x_T$, the model is trained to maximize: $$\sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, ..., x_{t-1})$$

That is — predict each token given the previous ones. This is sometimes called causal language modeling or autoregressive language modeling. With enough data (trillions of tokens) and parameters (billions), this single objective produces emergent capabilities — reasoning, code generation, in-context learning.

Post-training (instruction tuning + RLHF / DPO; see 3.MoE_and_RLHF.md) is what turns a raw "next-token predictor" into a useful assistant.

4. KV-Cache: Why Long Context is Expensive

During autoregressive generation, naively re-running the entire Transformer on every new token would be O(N²) in sequence length. The KV-Cache trick stores the Key and Value tensors computed for past tokens so each new token only does fresh attention computation against the cache.

  • Memory footprint per token per layer: 2 × num_kv_heads × head_dim × precision
  • For Llama-3-70B: ~2.5 MB per token across all 80 layers in FP16.
  • A 128K-token context → ~320 GB for KV-Cache alone.

This is why innovations target the KV-Cache: - Multi-Query Attention (MQA): one K/V head shared across all Q heads (used in some Falcon/Llama variants). Cuts KV-Cache by num_heads×. - Grouped-Query Attention (GQA): a middle ground — small groups share K/V heads (used by Llama-3, Mistral). Best quality/memory tradeoff. - Multi-Head Latent Attention (MLA): DeepSeek's innovation — compresses K/V into a low-rank latent that is decompressed on-the-fly. Cuts KV-Cache by 90%+ (see 1.Foundation/2.LLM_Industry_Overview.md §1.2.4). - TurboQuant / PolarQuant: compress what's stored in the cache (see 4.Quantization_TurboQuant.md).

5. Practical Notes

5.1. Implementing a Minimal Transformer Block (PyTorch)
import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout, batch_first=True)
        self.norm1 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, attn_mask=None) -> torch.Tensor:
        # Pre-norm self-attention
        a, _ = self.attn(self.norm1(x), self.norm1(x), self.norm1(x), attn_mask=attn_mask)
        x = x + self.dropout(a)
        # Pre-norm FFN
        x = x + self.dropout(self.ffn(self.norm2(x)))
        return x
5.2. Common Hyperparameters
Model Layers d_model n_heads Total params
BERT-base 12 768 12 110M
GPT-2 48 1600 25 1.5B
GPT-3 96 12288 96 175B
Llama-3-8B 32 4096 32 8B
Llama-3-70B 80 8192 64 70B
DeepSeek-V3 61 7168 128 671B (MoE; 37B active)
5.3. Common Pitfalls
  • Forgetting attention masks: causal LMs must mask future tokens during training. Forgetting this leaks information from the future and the model "cheats".
  • Wrong positional encoding for long context: a model trained on 4K context will degrade severely at 32K unless trained with position-extrapolatable encodings (RoPE with NTK scaling, ALiBi, etc.).
  • Numerical instability without pre-norm: post-norm Transformers at depth >24 are notoriously hard to train.

6. Common Q & A

  • Q: Why is "Attention Is All You Need" such a big deal?
  • A: It showed you don't need recurrence or convolution to model sequences — pure attention is enough, and it parallelizes beautifully. This unlocked the scaling laws that produced today's LLMs.
  • Q: What's the difference between encoder-only and decoder-only?
  • A: Encoder-only sees the whole input bidirectionally — good for understanding tasks. Decoder-only is causal (only sees the past) — good for generation. Modern LLMs use decoder-only because next-token prediction at scale produces emergent reasoning.
  • Q: Why did Transformers replace RNNs/LSTMs for everything?
  • A: Three reasons: (1) parallelism in training, (2) explicit access to long-range dependencies via attention, (3) scaling laws — Transformer performance improves smoothly with size, while RNN gains plateau.
  • Q: Are Transformers the final architecture?
  • A: Probably not. Mamba (state-space models), Hyena, and RWKV explore alternatives with sub-quadratic attention complexity. As of 2026, Transformers remain dominant but hybrid architectures are emerging — e.g., Jamba (Transformer + Mamba) and DeepSeek's MLA innovation.