1.Transformer Architecture

📅 2026-05-17 (created during knowledge-base reorganization) 👉 #AI #LLM #Architecture #DeepLearning #Foundation 📎 Attention Is All You Need (Vaswani et al., 2017) 📎 The Illustrated Transformer (Jay Alammar) 📎 Hugging Face: Transformer Models

1. Overview

1.1. Definition & Why

The Transformer is a neural-network architecture introduced by Vaswani et al. (Google, 2017) in the paper Attention Is All You Need. It replaced the previously dominant recurrent (RNN/LSTM) and convolutional (CNN) approaches for sequence modeling and is the architectural foundation of every modern LLM (GPT, Claude, Gemini, Llama, Mistral, DeepSeek).
Key idea: instead of processing tokens one at a time (RNN) or via fixed local windows (CNN), the Transformer processes all tokens in a sequence simultaneously using self-attention.
Pain points solved over predecessors:
Long-range dependencies: RNNs struggle to retain information across long sequences (vanishing gradients); attention can directly relate any two positions in O(1) hops.
Parallelization: RNNs are sequential by design; Transformers process the entire sequence in parallel, dramatically improving training-time GPU utilization.
Scalability: this parallelism is the precondition for the Scaling Law — the empirical finding that bigger model + more data = better performance.

1.2. Where it sits in the AI stack

Foundation note 1.Foundation/1.Intro_to_LLM.md mentions "Transformers: looks at every word in a sentence simultaneously" — this note explains how that actually works.
Foundation note 1.Foundation/3.LLM_Application_Logic.md calls Transformer "the bedrock architecture of the AI industry".
Other algorithm notes build on this:
2.Attention_Mechanism.md — drills into self-attention specifically.
3.MoE_and_RLHF.md — covers MoE (a Transformer modification) and RLHF (a training technique for Transformer-based models).
4.Quantization_TurboQuant.md — TurboQuant compresses the KV-Cache that arises during Transformer inference.

2. Concept, Component, & Architecture

2.1. The High-Level Picture

A Transformer takes a sequence of tokens (e.g., the text of a sentence, encoded as integer IDs) and produces a sequence of output vectors (or, for an LLM, a probability distribution over the next token).

The original 2017 architecture had two halves: - Encoder: reads the input sequence and produces contextualized representations. Used in BERT-style models. - Decoder: generates an output sequence one token at a time, attending to previously generated tokens and (in encoder-decoder models) to the encoder output.

Modern LLMs (GPT, Claude, Llama, etc.) are decoder-only — they use just the decoder half, generating text autoregressively.

flowchart LR
  Input[Token IDs] --> Embed[Token Embeddings]
  Embed --> PE[+ Positional Encoding]
  PE --> Block1[Transformer Block 1]
  Block1 --> Block2[Transformer Block 2]
  Block2 --> BlockN[... Block N]
  BlockN --> LN[Final LayerNorm]
  LN --> Head[LM Head: linear → softmax]
  Head --> Output[Next-token probabilities]

2.2. Inside a Transformer Block

A single Transformer block contains two sub-layers, each wrapped in a residual connection and layer normalization.

       ┌──────────────────────────────────────┐
       │  Multi-Head Self-Attention           │
input ─┤  + residual connection               ├─→ intermediate
       │  + LayerNorm (or RMSNorm in LLaMA)   │
       └──────────────────────────────────────┘
       ┌──────────────────────────────────────┐
       │  Feed-Forward Network (FFN/MLP)      │
       │  + residual connection               ├─→ output
       │  + LayerNorm                         │
       └──────────────────────────────────────┘

Mathematically:

intermediate = LayerNorm(input + MultiHeadAttention(input))
output       = LayerNorm(intermediate + FFN(intermediate))

A modern LLM stacks 12-128+ such blocks. GPT-3 has 96; Llama 3 70B has 80.

2.3. The Core Components

(1) Token Embeddings

Each token ID is mapped to a dense vector via a learned embedding table of shape (vocab_size, d_model).
d_model is the model's hidden dimension (e.g., 768 for BERT-base, 4096 for Llama-3-8B, 12288 for GPT-3).

(2) Positional Encoding

Self-attention is permutation-invariant: shuffling the input tokens would give the same output. To inject order, the model adds positional information.
Sinusoidal positional encoding (original 2017 paper): hand-crafted sin/cos waves of varying frequencies.
Learned positional embeddings (BERT, GPT-2): a second embedding table indexed by position.
RoPE (Rotary Position Embedding) — used by Llama, Qwen, DeepSeek, Mistral: rotates the query/key vectors in 2D subspaces by an angle proportional to position. Generalizes better to longer contexts than absolute encodings.
ALiBi (Attention with Linear Biases) — used by some open models: adds a position-dependent bias directly to attention scores; no learned position embeddings needed.

(3) Multi-Head Self-Attention (MHA)

The heart of the Transformer. Detailed in 2.Attention_Mechanism.md.
Each token computes Query/Key/Value vectors, attends to all other tokens via scaled dot-product, and the output is a weighted sum of the Value vectors.
"Multi-head" splits the hidden dimension into h independent heads (typically 8-128) that attend in parallel — this lets the model jointly attend to information from different representation subspaces.

(4) Feed-Forward Network (FFN / MLP)

Applied independently to each position. Two linear layers with a non-linearity in between: FFN(x) = W_2 · activation(W_1 · x + b_1) + b_2
Common activations:
ReLU (original): max(0, x)
GELU (BERT, GPT-2/3): smooth gating
SwiGLU (Llama, Mistral, DeepSeek): (W_1 · x) ⊙ silu(W_3 · x) then projected — improves quality, used universally in modern LLMs
Hidden dimension is typically 4× d_model (or ~2.7× for SwiGLU since it has 3 matrices).
The FFN holds the bulk of the model's parameters — making MoE (where FFNs become "experts") a natural place to add sparsity (see 3.MoE_and_RLHF.md).

(5) Layer Normalization

Stabilizes training by normalizing activations to zero mean and unit variance.
LayerNorm (original): per-token normalization with learned scale and shift.
RMSNorm (Llama, Mistral, DeepSeek): drops the mean centering and shift; faster and surprisingly works just as well.
Modern LLMs almost universally use pre-norm (LayerNorm before each sub-layer) rather than post-norm — pre-norm trains more stably at depth.

(6) Residual Connections

The + input skip-connections around each sub-layer let gradients flow directly back to earlier layers, enabling training of very deep networks (100+ layers).

2.4. Encoder-only, Decoder-only, Encoder-Decoder

Variant	Representative models	Use case
Encoder-only	BERT, RoBERTa, ModernBERT	Classification, sentence embeddings, NER
Decoder-only	GPT, Claude, Llama, Gemini	Text generation, chat, code gen — all modern LLMs
Encoder-Decoder	T5, BART, original Transformer	Translation, summarization, seq2seq tasks

Decoder-only models use causal (masked) self-attention: token at position t can only attend to positions ≤ t. This makes autoregressive generation work without re-running the network on already-generated tokens (see KV-Cache below).

3. The Training Objective

Modern LLMs are pre-trained with one simple objective: next-token prediction.

Given a sequence of tokens $x_1, x_2, ..., x_T$, the model is trained to maximize: $$\sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, ..., x_{t-1})$$

That is — predict each token given the previous ones. This is sometimes called causal language modeling or autoregressive language modeling. With enough data (trillions of tokens) and parameters (billions), this single objective produces emergent capabilities — reasoning, code generation, in-context learning.

Post-training (instruction tuning + RLHF / DPO; see 3.MoE_and_RLHF.md) is what turns a raw "next-token predictor" into a useful assistant.

4. KV-Cache: Why Long Context is Expensive

During autoregressive generation, naively re-running the entire Transformer on every new token would be O(N²) in sequence length. The KV-Cache trick stores the Key and Value tensors computed for past tokens so each new token only does fresh attention computation against the cache.

Memory footprint per token per layer: 2 × num_kv_heads × head_dim × precision
For Llama-3-70B: ~2.5 MB per token across all 80 layers in FP16.
A 128K-token context → ~320 GB for KV-Cache alone.

This is why innovations target the KV-Cache: - Multi-Query Attention (MQA): one K/V head shared across all Q heads (used in some Falcon/Llama variants). Cuts KV-Cache by num_heads×. - Grouped-Query Attention (GQA): a middle ground — small groups share K/V heads (used by Llama-3, Mistral). Best quality/memory tradeoff. - Multi-Head Latent Attention (MLA): DeepSeek's innovation — compresses K/V into a low-rank latent that is decompressed on-the-fly. Cuts KV-Cache by 90%+ (see 1.Foundation/2.LLM_Industry_Overview.md §1.2.4). - TurboQuant / PolarQuant: compress what's stored in the cache (see 4.Quantization_TurboQuant.md).

5. Practical Notes

5.1. Implementing a Minimal Transformer Block (PyTorch)

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout, batch_first=True)
        self.norm1 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, attn_mask=None) -> torch.Tensor:
        # Pre-norm self-attention
        a, _ = self.attn(self.norm1(x), self.norm1(x), self.norm1(x), attn_mask=attn_mask)
        x = x + self.dropout(a)
        # Pre-norm FFN
        x = x + self.dropout(self.ffn(self.norm2(x)))
        return x

5.2. Common Hyperparameters

Model	Layers	d_model	n_heads	Total params
BERT-base	12	768	12	110M
GPT-2	48	1600	25	1.5B
GPT-3	96	12288	96	175B
Llama-3-8B	32	4096	32	8B
Llama-3-70B	80	8192	64	70B
DeepSeek-V3	61	7168	128	671B (MoE; 37B active)

5.3. Common Pitfalls

Forgetting attention masks: causal LMs must mask future tokens during training. Forgetting this leaks information from the future and the model "cheats".
Wrong positional encoding for long context: a model trained on 4K context will degrade severely at 32K unless trained with position-extrapolatable encodings (RoPE with NTK scaling, ALiBi, etc.).
Numerical instability without pre-norm: post-norm Transformers at depth >24 are notoriously hard to train.

6. Common Q & A

Q: Why is "Attention Is All You Need" such a big deal?
A: It showed you don't need recurrence or convolution to model sequences — pure attention is enough, and it parallelizes beautifully. This unlocked the scaling laws that produced today's LLMs.
Q: What's the difference between encoder-only and decoder-only?
A: Encoder-only sees the whole input bidirectionally — good for understanding tasks. Decoder-only is causal (only sees the past) — good for generation. Modern LLMs use decoder-only because next-token prediction at scale produces emergent reasoning.
Q: Why did Transformers replace RNNs/LSTMs for everything?
A: Three reasons: (1) parallelism in training, (2) explicit access to long-range dependencies via attention, (3) scaling laws — Transformer performance improves smoothly with size, while RNN gains plateau.
Q: Are Transformers the final architecture?
A: Probably not. Mamba (state-space models), Hyena, and RWKV explore alternatives with sub-quadratic attention complexity. As of 2026, Transformers remain dominant but hybrid architectures are emerging — e.g., Jamba (Transformer + Mamba) and DeepSeek's MLA innovation.