4.Fine Tuning

👉 #AI #LLM #ML #Coding

I. Fine-Tuning — Adapting Large Language Models

📅 2026-04-28 Tuesday PST; Claude Opus 4.6 📎 LLM Fine-Tuning Complete Guide 2026 📎 LoRA and QLoRA: Fine-Tuning on Consumer GPUs 📎 Fine-Tuning Techniques: LoRA, DPO, GRPO

1. Overview

1.1. Definition & Why

Fine-Tuning: starting from a pretrained large model, continue training on domain-specific data so the model adapts to a specific task, style, or domain knowledge.
Key distinction: Pre-Training is "learning language from scratch"; Fine-Tuning is "specializing on top of existing capability".
Design intent: general models (GPT-4, Claude) know everything but are master of nothing; Fine-Tuning makes the model an expert in a specific area.
Pain points solved:
Style adaptation: make the output match brand voice, format conventions, terminology
Instruction following: improve adherence to complex instructions (e.g., "always answer in JSON")
Domain knowledge: internalize medical / legal / financial expertise into the model weights
Cost optimization: a fine-tuned small model can replace a big one, cutting inference cost (e.g., fine-tuned Llama 8B replaces GPT-4o)
Latency optimization: small models infer faster — good for real-time scenarios
2026 key shift: PEFT (Parameter-Efficient Fine-Tuning) brought fine-tuning down from "needs 8 × A100" to "single consumer-grade GPU is enough"; cost dropped from tens of thousands of dollars to $10-100.

1.2. Features & Use Cases

Good for fine-tuning:
Output format is fixed: always emit a specific JSON schema or report template
Domain-term-heavy: medical diagnosis, legal clauses, financial analysis
Style consistency: brand voice, customer-service phrasing, technical-doc style
Small model replacing large: a fine-tuned 7B/8B reaching 70B-level performance on a specific task
Multilingual adaptation: better performance on a specific language (e.g., Chinese, Japanese)
Not good for fine-tuning:
Knowledge changes frequently → use RAG
Need real-time data → use Function Calling
One-off task → use Prompt Engineering
Insufficient data (< 100 high-quality samples) → use Few-Shot Prompting

1.3. Competitors

Fine-Tuning's position among LLM knowledge-augmentation approaches:

Dimension	Fine-Tuning	RAG (retrieval-augmented)	Prompt Engineering	Long-Context
Knowledge source	Internalized in weights	Real-time external retrieval	Written in the prompt	Stuffed into context
Update speed	Slow (re-train)	Fast (minutes)	Instant	Instant
Cost	Medium (one-time train)	Low (API calls)	Lowest	High (token cost)
Data scale	100-100K samples	Unlimited	< 20 examples	< 2M tokens
Core value	Behavior/style/format adaptation	Real-time knowledge	Quick experiments	Single-shot deep analysis

Decision flow:
Try Prompt Engineering first (zero cost, immediate)
Not enough? → add RAG (knowledge augmentation)
Still need style/format adaptation? → Fine-Tuning
Production: RAG + Fine-Tuning combined (best practice)

2. Concept, Component, & Architecture

2.1. Key Concepts

(1) Full Fine-Tuning

Update all model parameters (billions of weights).
Best results, but needs huge GPU resources (8 × A100 80GB minimum).
In 2026, mostly used by large labs and research; individuals and small teams use PEFT.

(2) PEFT (Parameter-Efficient Fine-Tuning)

Core idea: freeze 99% of the original model's parameters, train only a small set of new parameters.
Benefits: VRAM down 10-100×, faster training, original model capability preserved.
In 2026, PEFT is the default for fine-tuning.

(3) LoRA (Low-Rank Adaptation)

The most mainstream PEFT method — practically synonymous with fine-tuning in 2026.
Mechanism: insert two small matrices (A and B) next to the model's attention layers; only A and B are trained.
Original weights W (frozen) + ΔW = W + B × A
A: d × r matrix; B: r × d matrix (r is the rank, usually 8-64, far smaller than d)
Trainable parameter count: 0.1%-1% of original
Key hyperparameters:
r (rank): larger means more expressive but more parameters; usually 8-32
alpha: scaling factor; usually 2 × r
target_modules: layers to apply LoRA to; usually q_proj, v_proj (Attention's Query and Value)
Advantage: LoRA weights are "hot-swappable" — the same base model can load different LoRA adapters.

(4) QLoRA (Quantized LoRA)

On top of LoRA, quantize the base model to 4-bit (NF4 format), further reducing VRAM.
Effect: a 70B model can be fine-tuned on a single 24GB GPU (e.g., RTX 4090).
Accuracy loss: usually 1-2%, acceptable for most tasks.
The default for consumer-grade GPU fine-tuning in 2026.

(5) SFT (Supervised Fine-Tuning)

The most basic kind of fine-tuning: train the model with (input, expected-output) pairs.
Data formats: instruction format (instruction, input, output) or chat format (messages).
Use: teach the model "how to answer" — format, style, task-specific behavior.

(6) RLHF (Reinforcement Learning from Human Feedback)

After SFT, use human preference data to further align model behavior.
Flow: SFT → train a Reward Model → PPO reinforcement learning to optimize.
Use: make the model more helpful, harmless, and honest.
Expensive — usually only model providers (OpenAI, Anthropic) use it.

(7) DPO (Direct Preference Optimization)

A simpler alternative to RLHF: no separate Reward Model — optimize directly from preference data.
Data format: (prompt, chosen_response, rejected_response) triples.
Advantages: simple to implement, stable training, results close to RLHF.
The mainstream alignment-training method by 2026.

(8) GRPO (Group Relative Policy Optimization)

A new method proposed by DeepSeek; rapidly popular in 2025-2026.
Core: no Reward Model needed; optimize policy via in-group relative ranking.
Advantages: better suited to reasoning tasks (math, code) than DPO; higher training efficiency.

2.2. Core Components

(1) Training Data

Quality > quantity: 100 high-quality samples beat 10K low-quality ones.
Format: usually JSONL — one training sample per line.
SFT data example:

{"messages": [
  {"role": "system", "content": "You are a professional legal advisor"},
  {"role": "user", "content": "What is a force majeure clause?"},
  {"role": "assistant", "content": "A force majeure clause is..."}
]}

DPO data example:

{
  "prompt": "Explain quantum computing",
  "chosen": "Quantum computing leverages quantum-mechanics principles... (accurate, detailed)",
  "rejected": "Quantum computing is just a faster computer... (inaccurate)"
}

(2) Base Model Selection

Open-source picks (2026):
Llama 3.3 70B / 8B: from Meta; largest community, most mature toolchain
Mistral Large / Small: from Europe; excellent multilingual
Qwen 2.5: from Alibaba; strong on Chinese
DeepSeek V3: outstanding reasoning, great cost/performance
Closed-source fine-tuning:
OpenAI Fine-Tuning API: GPT-4o-mini / GPT-4o
Google Vertex AI: Gemini fine-tuning
Amazon Bedrock: many models supported

(3) Training Frameworks

Framework	Notes	Best fit
Hugging Face TRL	Most popular; supports SFT/DPO/GRPO	General fine-tuning
Unsloth	2× faster, 60% VRAM savings	Consumer-grade GPUs
Axolotl	YAML-config-driven, code-free	Quick experiments
LLaMA-Factory	Active Chinese community, GUI	Chinese scenarios
OpenAI API	Fully managed, no GPU needed	Closed-source model fine-tuning

(4) Evaluation Metrics

General: Perplexity, loss curves
Task-specific: Accuracy, F1, BLEU, ROUGE
Human evaluation: blind A/B (before vs. after fine-tuning)
Benchmarks: MMLU, HumanEval, MT-Bench

2.3. Architecture & Design

(1) Fine-Tuning Workflow Overview

flowchart LR
  A[Data prep] --> B[Cleaning + formatting]
  B --> C[Base-model selection]
  C --> D{Method}
  D -->|Behavior adaptation| E[SFT + LoRA/QLoRA]
  D -->|Preference alignment| F[DPO / GRPO]
  D -->|Comprehensive customization| G[SFT → DPO two-stage]

  E & F & G --> H[Training + monitoring]
  H --> I[Evaluation + testing]
  I --> J{Meets target?}
  J -->|No| K[Adjust hyperparams / data]
  K --> H
  J -->|Yes| L[Deployment]
  L --> M[Inference service: vLLM / TGI]

(2) LoRA Architecture Principle

flowchart LR
  X[Input x] --> W[Original weight W — frozen]
  X --> A[LoRA matrix A — d×r]
  A --> B[LoRA matrix B — r×d]
  W --> ADD((+))
  B --> ADD
  ADD --> Y[Output y = Wx + BAx]

  style W fill:#ccc,stroke:#999
  style A fill:#4CAF50,stroke:#388E3C,color:#fff
  style B fill:#4CAF50,stroke:#388E3C,color:#fff

Gray: frozen original weights (billions of params)
Green: trainable LoRA matrices (millions of params, 0.1%-1% of original)

2.4. Eco-system

Model hosting: Hugging Face Hub (open-source distribution), Ollama (local inference)
Training platforms: AWS SageMaker, Google Vertex AI, Lambda Cloud, RunPod, Vast.ai
Inference deployment: vLLM (high throughput), TGI (Hugging Face), Ollama (local), Together AI (API)
Data labeling: Argilla, Label Studio, Scale AI
Collaboration with RAG:
RAG provides real-time knowledge (external retrieval)
Fine-Tuning provides behavioral adaptation (internalized in weights)
Best combo: fine-tuned model + RAG pipeline = both domain-aware and real-time-knowledgeable

3. Install, Configure, Secure, & Cheatsheets

3.1. QLoRA Hands-On (Unsloth + Llama)

(1) Environment install

# Recommended: Unsloth (2× faster, 60% VRAM savings)
pip install unsloth
pip install trl datasets  # Hugging Face training libs

(2) Full fine-tuning code

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Step 1: load base model (4-bit quantized)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.3-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,  # QLoRA: 4-bit quantization
)

# Step 2: add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank
    lora_alpha=32,           # scaling factor (usually 2*r)
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",
)

# Step 3: load data
dataset = load_dataset("json", data_files="train_data.jsonl", split="train")

# Step 4: train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./output",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        warmup_steps=10,
        logging_steps=10,
        save_strategy="epoch",
        fp16=True,
    ),
)
trainer.train()

# Step 5: save LoRA weights
model.save_pretrained("./lora_adapter")
# Or merge and save the full model
model.save_pretrained_merged("./merged_model", tokenizer)

3.2. OpenAI Fine-Tuning API (fully managed)

from openai import OpenAI
client = OpenAI()

# Step 1: upload training data
file = client.files.create(
    file=open("train_data.jsonl", "rb"),
    purpose="fine-tune"
)

# Step 2: create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",  # base model
    hyperparameters={
        "n_epochs": 3,
        "batch_size": "auto",
        "learning_rate_multiplier": "auto"
    }
)

# Step 3: monitor progress
status = client.fine_tuning.jobs.retrieve(job.id)
print(status.status)  # queued → running → succeeded

# Step 4: use fine-tuned model
response = client.chat.completions.create(
    model=job.fine_tuned_model,  # ft:gpt-4o-mini:org:custom:id
    messages=[{"role": "user", "content": "your question"}]
)

3.3. Hyperparameter Quick Reference

Hyperparameter	Recommended	Notes
LoRA r	8-32	Rank; 16 is a common starting point
LoRA alpha	2 × r	Scaling factor
Learning rate	1e-4 - 3e-4	QLoRA usually higher than full fine-tuning
Batch size	4-8 (per GPU)	Lower if VRAM is tight; compensate with gradient accumulation
Epochs	1-5	<1K samples → 3-5; >10K samples → 1-2
Warmup	5-10% total steps	Prevents over-large LR at the start
Max seq length	512-2048	Set per data; longer means more VRAM

3.4. Security Best Practices

Data security:
Training data should not contain PII (or must be de-identified)
When fine-tuning via closed-source APIs, confirm the data-usage terms (used to train base model?)
Prefer local fine-tuning: sensitive data stays in-house
Model security:
Fine-tuning may break the model's safety alignment — include safety samples in training data
Evaluate the fine-tuned model's refusal capability (does it still refuse harmful requests?)
Don't share unsafety-evaluated fine-tuned models on public platforms
Cost control:
Validate the pipeline on a small dataset (100 samples) before scaling up
Use QLoRA over full fine-tuning to save 10×+ cost
Monitor loss curves and stop early on overfitting

4. Bootcamp & Workshops

4.1. Official & Classic Tutorials

Resource	Link	Goal
Hugging Face TRL docs	huggingface.co/docs/trl	SFT/DPO/GRPO end-to-end
Unsloth Wiki	github.com/unslothai/unsloth	Consumer-GPU fine-tuning acceleration
OpenAI Fine-Tuning Guide	platform.openai.com	Closed-source model fine-tuning
DeepLearning.AI - Finetuning LLMs	deeplearning.ai	Andrew Ng practical course
LLaMA-Factory	github.com/hiyouga/LLaMA-Factory	GUI, Chinese-friendly
Axolotl	github.com/axolotl-ai-cloud/axolotl	YAML-driven fine-tuning

4.2. Trouble Shooting

Symptom	Root Cause	Solution
OOM (Out of Memory)	Insufficient VRAM	Lower batch_size; use QLoRA 4-bit; enable gradient checkpointing
Loss not decreasing	Learning rate too low or bad data	Raise learning rate; check data format
Loss drops then rises	Overfitting	Reduce epochs; add more data; add dropout
Fine-tuned model gets dumber	Catastrophic forgetting	Lower learning rate; reduce epochs; mix in general-purpose data
Output format unstable	Inconsistent training-data format	Standardize format across all samples; add format-related samples
Fine-tuned model refuses normal requests	Safety alignment broken	Include safety samples in training; evaluate refusal rate

4.3. Common Q & A

Q: How much training data do I need?
A: 50-100 high-quality samples can already show effects. Format/style adaptation: 100-500 is enough; domain knowledge injection: 1K-10K. Quality always matters more than quantity.
Q: LoRA or QLoRA?
A: If you have ≥ 48GB VRAM (e.g., A100), use LoRA (more accurate). If ≤ 24GB (e.g., RTX 4090), use QLoRA (60%+ VRAM savings, 1-2% accuracy loss).
Q: Can the fine-tuned model be used commercially?
A: Depends on the base model's license. Llama 3.3 (Meta License, commercial-friendly), Mistral (Apache 2.0, fully open), Qwen (Apache 2.0). Models fine-tuned via closed-source APIs are usually commercial-OK but you can't export weights.
Q: Fine-Tuning or RAG first?
A: Do RAG first (low cost, fast results). If RAG solves the knowledge issue but format/style/instruction adherence still isn't right, then add Fine-Tuning.
Q: How long and how much does one fine-tune cost?
A: 8B model + 1K data + QLoRA: about 1-2 hours on a single RTX 4090, < $1 in electricity. Cloud (RunPod / Lambda): $5-20. OpenAI API fine-tuning GPT-4o-mini: $10-50 (token-billed).