📅 2026-05-17 (created during knowledge-base reorganization) 👉 #AI #LLM #MoE #RLHF #DPO #Training 📎 Mixture of Experts: A Survey (2024) 📎 Switch Transformers: Scaling to Trillion-Parameter Models 📎 Mixtral 8x7B technical report 📎 InstructGPT / RLHF (Ouyang et al., 2022) 📎 Direct Preference Optimization (Rafailov et al., 2023)

This note covers two seemingly different topics — MoE (an architectural pattern) and RLHF/DPO (post-training techniques) — because together they explain how modern LLMs achieve both high capability per dollar (MoE) and alignment with human preferences (RLHF / DPO). Both are referenced throughout the Foundation, Application, and Technology notes; this is the canonical home.

Part A — Mixture of Experts (MoE)

1. Overview

1.1. Why MoE?

A standard "dense" Transformer activates every parameter for every token it processes. If you scale the model from 8B to 70B parameters, every token now does 9× the compute. That's expensive.

Mixture of Experts (MoE) breaks this 1:1 link between parameter count and compute per token. The idea: instead of one giant FFN per Transformer block, have many smaller FFNs (the "experts") and a small router network that decides which k experts to activate for each token.

The result: a model with 100B+ total parameters that activates only ~10B per token. You get the knowledge capacity of the bigger model with the inference cost of the smaller one.

1.2. Models that use MoE (2026)

Mixtral 8×7B / 8×22B (Mistral): 8 experts per layer, 2 active per token. ~47B total, ~13B active.
DeepSeek-V3 / R1: 256 fine-grained experts per layer, 8 active. ~671B total, ~37B active. The "fine-grained" approach is a DeepSeek innovation.
Llama-4 Scout / Maverick: 16 / 128 experts respectively, 17B active in both.
Qwen-MoE: Alibaba's MoE family.
Switch Transformer / GLaM: early Google MoE work; not a current production model but the academic foundation.

2. How MoE Works

2.1. The Architecture

In a standard Transformer block, each token goes through one FFN. In an MoE Transformer block, each token goes through:

Router (a small linear layer): produces a score for each of N experts.
Top-k selection: pick the k experts with the highest scores (typically k = 1 or k = 2).
Expert FFN forward: the token is routed to those experts, each of which is a regular FFN.
Weighted combination: outputs are combined, weighted by the router's scores (after softmax).

       ┌────────────────────────────────────────┐
       │              Router                    │
input ─┤  → softmax over N experts → top-k pick │
       └────────────────────────────────────────┘
                 │              │
                 ▼              ▼
             Expert 3       Expert 7
             (active)       (active)
                 │              │
                 ▼              ▼
             output_3 × w_3 + output_7 × w_7  →  output

The other parts of the Transformer block — self-attention, layer norm, residual connections — are unchanged. MoE replaces only the FFN sub-layer.

2.2. Top-k Variants

Variant	Active experts/token	Used by
Top-1	1	Switch Transformer
Top-2	2	Mixtral 8×7B, Mixtral 8×22B
Top-k=8 (fine-grained)	8 of 256	DeepSeek-V3 / R1

DeepSeek's fine-grained approach uses many small experts and activates more of them per token. Empirically this gives more flexible expert specialization than fewer large experts.

2.3. Shared Experts (DeepSeek-style)

DeepSeek-V3 also adds a small number of "shared experts" — experts that are always activated for every token. These cover broad common-sense knowledge so the routed experts can specialize without each one re-learning the basics.

2.4. Load Balancing

A naive router will collapse: a few popular experts get all the traffic and the rest become useless. To avoid this, MoE training adds an auxiliary load-balancing loss that penalizes uneven expert utilization. Modern implementations (DeepSeek-V3) prefer auxiliary-loss-free approaches that adjust router biases instead.

3. The Engineering Reality

MoE is not free. Pain points include:

Communication: experts may be sharded across GPUs; routing tokens to the right device adds all-to-all communication overhead. This is the dominant cost for distributed MoE training (see Megablocks, DeepSpeed-MoE).
Memory: even though you activate few experts per token, all experts must be stored in memory. A 671B-parameter MoE still needs 671B parameters' worth of VRAM for inference.
Expert collapse: without load balancing, you end up with a few overworked experts and many idle ones.
Routing instability: the router's decision changes during training, and this can cause gradient instability. Various tricks (jitter, noise) help.

Despite these, MoE is the dominant scaling approach in 2026 — and it's why DeepSeek can match GPT-4-class capability at a fraction of the inference cost.

4. MoE vs. Dense — Decision Reference

Dimension	Dense	MoE
Parameter count	All active per token	Few active per token (typically 5-15%)
Inference cost	Linear in size	Linear in active params, not total
VRAM (inference)	Linear in size	Linear in total params (all experts loaded)
Training complexity	Simpler	Routing + load-balancing complications
Quality per active param	Similar	Usually better — experts can specialize
Best-suited workloads	Latency-sensitive single-stream	High-throughput batch serving

Part B — RLHF, DPO, and the Alignment Pipeline

5. Why Pre-training Isn't Enough

A pre-trained LLM has learned to predict the next token from internet-scale data. It is not yet a useful assistant. Out of the box it will: - Continue your prompt instead of answering it. - Imitate the worst of its training data (offensive content, hallucinations, refusal to follow instructions). - Produce completions that are technically plausible but not helpful, harmless, or honest.

Turning a raw next-token predictor into Claude / GPT-4 / Llama-3-Instruct requires post-training:

SFT (Supervised Fine-Tuning): train on (instruction, ideal response) pairs to teach the model to follow instructions.
Preference optimization (RLHF or DPO): train on human comparisons to align with what humans actually prefer.

Steps 1-2 are why a freshly downloaded base model from Hugging Face behaves so differently from its -instruct or -chat counterpart.

6. RLHF (Reinforcement Learning from Human Feedback)

6.1. The Three-Stage Pipeline (InstructGPT, 2022)

SFT: collect a few thousand (prompt, ideal-response) examples written by humans. Fine-tune the base model on these. The result: a model that follows instructions but isn't yet preference-aligned.
Reward Model (RM): collect comparisons. Show humans two model outputs for the same prompt; ask which is better. Train a separate model (typically the same architecture as the SFT model, with a scalar head) to predict "which response a human would prefer". The RM's output is a reward signal.
PPO (Proximal Policy Optimization): treat the SFT model as a policy. For each prompt, sample a response, score it with the RM, and use PPO (a reinforcement-learning algorithm) to push the policy toward higher-reward responses while staying close to the SFT model (so it doesn't catastrophically forget how to speak coherently).

flowchart LR
  Pretrain[Base model<br>~1T tokens] --> SFT[SFT<br>~10K demonstrations]
  SFT --> Policy[Policy model]
  SFT --> RM[Reward model<br>trained on ~100K comparisons]
  Policy --> PPO{PPO loop}
  RM --> PPO
  PPO --> Aligned[Aligned model]

6.2. The KL Penalty

PPO doesn't optimize reward alone — it optimizes: $$\text{reward} - \beta \cdot \text{KL}(\pi_{\text{policy}} \,|\, \pi_{\text{SFT}})$$

The KL penalty keeps the aligned model close to the SFT distribution, preventing reward hacking (the policy finding ways to score high on the RM that humans wouldn't actually like).

6.3. Why RLHF is Hard

Reward hacking: the RM is imperfect; the policy will exploit its weaknesses.
Compute-intensive: PPO requires multiple model copies in memory (policy, reference, RM, value).
Hyperparameter-sensitive: KL coefficient, learning rate, batch size — all need tuning.
Comparison data is expensive: you need many human-annotated comparisons.

This is why almost only big labs (OpenAI, Anthropic, Meta) ran RLHF at scale until DPO came along.

7. DPO (Direct Preference Optimization)

7.1. The Insight

DPO (Rafailov et al., 2023) showed that you can skip the explicit reward model and the PPO loop entirely. A clever derivation reformulates RLHF as a supervised learning problem on preference pairs.

7.2. The Loss

Given preference data (prompt, chosen_response, rejected_response), DPO directly minimizes:

$$\mathcal{L}{\text{DPO}} = -\log \sigma!\left( \beta \log \frac{\pi\theta(\text{chosen})}{\pi_{\text{ref}}(\text{chosen})} - \beta \log \frac{\pi_\theta(\text{rejected})}{\pi_{\text{ref}}(\text{rejected})} \right)$$

In plain English: increase the probability of the chosen response relative to the SFT model; decrease the probability of the rejected response relative to the SFT model. No reward model. No RL.

7.3. Why DPO Won

Simplicity: ~50 lines of training code vs. PPO's hundreds.
Stability: behaves like supervised learning — no PPO instabilities.
Memory: only two model copies in memory (policy + reference) instead of four.
Quality: matches RLHF on most benchmarks.

DPO has become the default alignment method for open-source models (Llama-3, Mistral, Qwen) and increasingly for closed-source ones too.

8. GRPO (Group Relative Policy Optimization)

A 2024-2025 innovation introduced by DeepSeek and rapidly popular in 2025-2026.

8.1. The Setup

For each prompt, generate G responses (a "group"; e.g., 8 or 16).
Score each response (with a reward model, an automated judge, or a verifier — for math/code, by running unit tests).
Compute relative advantages within the group (each response minus the group average).
Apply policy-gradient updates using these advantages.

8.2. Why It Works

No critic / value model needed: a major simplification over PPO.
Excellent for verifiable tasks: math problems, code, logic puzzles where correctness can be checked automatically.
DeepSeek-R1's secret sauce: GRPO is what enabled DeepSeek-R1's chain-of-thought reasoning — the model learns to self-correct because longer thinking traces that produce correct answers get reinforced.

9. The Modern Alignment Recipe (2026)

A typical post-training pipeline for a state-of-the-art LLM in 2026:

Base pre-training — trillions of tokens of next-token prediction.
SFT — instruction-tuning on a curated mix of demonstrations.
DPO — preference-tune on human comparisons.
GRPO — for reasoning-heavy domains (math, code), reinforcement-learning from verifiable rewards.
Safety hardening — refuse harmful requests; RLHF or DPO with safety-specific data.
Constitutional AI (Anthropic) — use a model trained to follow a constitution to critique its own outputs and self-improve.

10. Common Q & A

Q: Are MoE models always better?
A: Better parameter efficiency — yes. But they're harder to train, take more total VRAM, and have higher communication overhead. For low-batch latency-critical apps, dense models are often preferable. For high-throughput serving, MoE wins.
Q: Can I fine-tune an MoE model with LoRA?
A: Yes, but there are subtleties — you may want to LoRA-tune only the routers, only specific experts, or all experts. Tooling support has matured rapidly in 2025-2026.
Q: Will RLHF be replaced?
A: For most use cases, DPO has already replaced it. RLHF (PPO) survives in scenarios where you genuinely need an explicit reward model — e.g., ongoing RL from production user feedback. GRPO is replacing both for reasoning tasks.
Q: What if I don't have human comparisons but I want to align?
A: Use AI feedback — have a strong model (GPT-4, Claude) label preferences. This is "RLAIF" or "DPO from AI feedback". Quality can match human-labeled data when the labeler model is much stronger than the model being trained.
Q: Is base-model SFT enough by itself?
A: SFT alone is decent for instruction-following but tends to produce verbose, hedged, or mode-collapsed responses. The preference step (DPO/RLHF) is what makes the model concise, decisive, and pleasant to interact with.
Q: How much data does each step need?
A: Rough rules of thumb: SFT works with 1K-100K examples; DPO works with 5K-100K preference pairs; GRPO can work on as few as a few thousand verifiable problems but benefits from more.