4.Fine Tuning
👉 #AI #LLM #ML #Coding
I. Fine-Tuning — Adapting Large Language Models
📅 2026-04-28 Tuesday PST; Claude Opus 4.6
📎 LLM Fine-Tuning Complete Guide 2026
📎 LoRA and QLoRA: Fine-Tuning on Consumer GPUs
📎 Fine-Tuning Techniques: LoRA, DPO, GRPO
1. Overview
1.1. Definition & Why
- Fine-Tuning: starting from a pretrained large model, continue training on domain-specific data so the model adapts to a specific task, style, or domain knowledge.
- Key distinction: Pre-Training is "learning language from scratch"; Fine-Tuning is "specializing on top of existing capability".
- Design intent: general models (GPT-4, Claude) know everything but are master of nothing; Fine-Tuning makes the model an expert in a specific area.
- Pain points solved:
- Style adaptation: make the output match brand voice, format conventions, terminology
- Instruction following: improve adherence to complex instructions (e.g., "always answer in JSON")
- Domain knowledge: internalize medical / legal / financial expertise into the model weights
- Cost optimization: a fine-tuned small model can replace a big one, cutting inference cost (e.g., fine-tuned Llama 8B replaces GPT-4o)
- Latency optimization: small models infer faster — good for real-time scenarios
- 2026 key shift: PEFT (Parameter-Efficient Fine-Tuning) brought fine-tuning down from "needs 8 × A100" to "single consumer-grade GPU is enough"; cost dropped from tens of thousands of dollars to $10-100.
1.2. Features & Use Cases
- Good for fine-tuning:
- Output format is fixed: always emit a specific JSON schema or report template
- Domain-term-heavy: medical diagnosis, legal clauses, financial analysis
- Style consistency: brand voice, customer-service phrasing, technical-doc style
- Small model replacing large: a fine-tuned 7B/8B reaching 70B-level performance on a specific task
- Multilingual adaptation: better performance on a specific language (e.g., Chinese, Japanese)
- Not good for fine-tuning:
- Knowledge changes frequently → use RAG
- Need real-time data → use Function Calling
- One-off task → use Prompt Engineering
- Insufficient data (< 100 high-quality samples) → use Few-Shot Prompting
1.3. Competitors
- Fine-Tuning's position among LLM knowledge-augmentation approaches:
| Dimension |
Fine-Tuning |
RAG (retrieval-augmented) |
Prompt Engineering |
Long-Context |
| Knowledge source |
Internalized in weights |
Real-time external retrieval |
Written in the prompt |
Stuffed into context |
| Update speed |
Slow (re-train) |
Fast (minutes) |
Instant |
Instant |
| Cost |
Medium (one-time train) |
Low (API calls) |
Lowest |
High (token cost) |
| Data scale |
100-100K samples |
Unlimited |
< 20 examples |
< 2M tokens |
| Core value |
Behavior/style/format adaptation |
Real-time knowledge |
Quick experiments |
Single-shot deep analysis |
- Decision flow:
- Try Prompt Engineering first (zero cost, immediate)
- Not enough? → add RAG (knowledge augmentation)
- Still need style/format adaptation? → Fine-Tuning
- Production: RAG + Fine-Tuning combined (best practice)
2. Concept, Component, & Architecture
2.1. Key Concepts
(1) Full Fine-Tuning
- Update all model parameters (billions of weights).
- Best results, but needs huge GPU resources (8 × A100 80GB minimum).
- In 2026, mostly used by large labs and research; individuals and small teams use PEFT.
(2) PEFT (Parameter-Efficient Fine-Tuning)
- Core idea: freeze 99% of the original model's parameters, train only a small set of new parameters.
- Benefits: VRAM down 10-100×, faster training, original model capability preserved.
- In 2026, PEFT is the default for fine-tuning.
(3) LoRA (Low-Rank Adaptation)
- The most mainstream PEFT method — practically synonymous with fine-tuning in 2026.
- Mechanism: insert two small matrices (A and B) next to the model's attention layers; only A and B are trained.
- Original weights W (frozen) + ΔW = W + B × A
- A: d × r matrix; B: r × d matrix (r is the rank, usually 8-64, far smaller than d)
- Trainable parameter count: 0.1%-1% of original
- Key hyperparameters:
r (rank): larger means more expressive but more parameters; usually 8-32
alpha: scaling factor; usually 2 × r
target_modules: layers to apply LoRA to; usually q_proj, v_proj (Attention's Query and Value)
- Advantage: LoRA weights are "hot-swappable" — the same base model can load different LoRA adapters.
(4) QLoRA (Quantized LoRA)
- On top of LoRA, quantize the base model to 4-bit (NF4 format), further reducing VRAM.
- Effect: a 70B model can be fine-tuned on a single 24GB GPU (e.g., RTX 4090).
- Accuracy loss: usually 1-2%, acceptable for most tasks.
- The default for consumer-grade GPU fine-tuning in 2026.
(5) SFT (Supervised Fine-Tuning)
- The most basic kind of fine-tuning: train the model with (input, expected-output) pairs.
- Data formats: instruction format
(instruction, input, output) or chat format (messages).
- Use: teach the model "how to answer" — format, style, task-specific behavior.
(6) RLHF (Reinforcement Learning from Human Feedback)
- After SFT, use human preference data to further align model behavior.
- Flow: SFT → train a Reward Model → PPO reinforcement learning to optimize.
- Use: make the model more helpful, harmless, and honest.
- Expensive — usually only model providers (OpenAI, Anthropic) use it.
(7) DPO (Direct Preference Optimization)
- A simpler alternative to RLHF: no separate Reward Model — optimize directly from preference data.
- Data format:
(prompt, chosen_response, rejected_response) triples.
- Advantages: simple to implement, stable training, results close to RLHF.
- The mainstream alignment-training method by 2026.
(8) GRPO (Group Relative Policy Optimization)
- A new method proposed by DeepSeek; rapidly popular in 2025-2026.
- Core: no Reward Model needed; optimize policy via in-group relative ranking.
- Advantages: better suited to reasoning tasks (math, code) than DPO; higher training efficiency.
2.2. Core Components
(1) Training Data
- Quality > quantity: 100 high-quality samples beat 10K low-quality ones.
- Format: usually JSONL — one training sample per line.
- SFT data example:
{"messages": [
{"role": "system", "content": "You are a professional legal advisor"},
{"role": "user", "content": "What is a force majeure clause?"},
{"role": "assistant", "content": "A force majeure clause is..."}
]}
{
"prompt": "Explain quantum computing",
"chosen": "Quantum computing leverages quantum-mechanics principles... (accurate, detailed)",
"rejected": "Quantum computing is just a faster computer... (inaccurate)"
}
(2) Base Model Selection
- Open-source picks (2026):
- Llama 3.3 70B / 8B: from Meta; largest community, most mature toolchain
- Mistral Large / Small: from Europe; excellent multilingual
- Qwen 2.5: from Alibaba; strong on Chinese
- DeepSeek V3: outstanding reasoning, great cost/performance
- Closed-source fine-tuning:
- OpenAI Fine-Tuning API: GPT-4o-mini / GPT-4o
- Google Vertex AI: Gemini fine-tuning
- Amazon Bedrock: many models supported
(3) Training Frameworks
| Framework |
Notes |
Best fit |
| Hugging Face TRL |
Most popular; supports SFT/DPO/GRPO |
General fine-tuning |
| Unsloth |
2× faster, 60% VRAM savings |
Consumer-grade GPUs |
| Axolotl |
YAML-config-driven, code-free |
Quick experiments |
| LLaMA-Factory |
Active Chinese community, GUI |
Chinese scenarios |
| OpenAI API |
Fully managed, no GPU needed |
Closed-source model fine-tuning |
(4) Evaluation Metrics
- General: Perplexity, loss curves
- Task-specific: Accuracy, F1, BLEU, ROUGE
- Human evaluation: blind A/B (before vs. after fine-tuning)
- Benchmarks: MMLU, HumanEval, MT-Bench
2.3. Architecture & Design
(1) Fine-Tuning Workflow Overview
flowchart LR
A[Data prep] --> B[Cleaning + formatting]
B --> C[Base-model selection]
C --> D{Method}
D -->|Behavior adaptation| E[SFT + LoRA/QLoRA]
D -->|Preference alignment| F[DPO / GRPO]
D -->|Comprehensive customization| G[SFT → DPO two-stage]
E & F & G --> H[Training + monitoring]
H --> I[Evaluation + testing]
I --> J{Meets target?}
J -->|No| K[Adjust hyperparams / data]
K --> H
J -->|Yes| L[Deployment]
L --> M[Inference service: vLLM / TGI]
(2) LoRA Architecture Principle
flowchart LR
X[Input x] --> W[Original weight W — frozen]
X --> A[LoRA matrix A — d×r]
A --> B[LoRA matrix B — r×d]
W --> ADD((+))
B --> ADD
ADD --> Y[Output y = Wx + BAx]
style W fill:#ccc,stroke:#999
style A fill:#4CAF50,stroke:#388E3C,color:#fff
style B fill:#4CAF50,stroke:#388E3C,color:#fff
- Gray: frozen original weights (billions of params)
- Green: trainable LoRA matrices (millions of params, 0.1%-1% of original)
2.4. Eco-system
- Model hosting: Hugging Face Hub (open-source distribution), Ollama (local inference)
- Training platforms: AWS SageMaker, Google Vertex AI, Lambda Cloud, RunPod, Vast.ai
- Inference deployment: vLLM (high throughput), TGI (Hugging Face), Ollama (local), Together AI (API)
- Data labeling: Argilla, Label Studio, Scale AI
- Collaboration with RAG:
- RAG provides real-time knowledge (external retrieval)
- Fine-Tuning provides behavioral adaptation (internalized in weights)
- Best combo: fine-tuned model + RAG pipeline = both domain-aware and real-time-knowledgeable
3.1. QLoRA Hands-On (Unsloth + Llama)
(1) Environment install
# Recommended: Unsloth (2× faster, 60% VRAM savings)
pip install unsloth
pip install trl datasets # Hugging Face training libs
(2) Full fine-tuning code
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
# Step 1: load base model (4-bit quantized)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.3-8B-Instruct",
max_seq_length=2048,
load_in_4bit=True, # QLoRA: 4-bit quantization
)
# Step 2: add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
lora_alpha=32, # scaling factor (usually 2*r)
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
use_gradient_checkpointing="unsloth",
)
# Step 3: load data
dataset = load_dataset("json", data_files="train_data.jsonl", split="train")
# Step 4: train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=TrainingArguments(
output_dir="./output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
warmup_steps=10,
logging_steps=10,
save_strategy="epoch",
fp16=True,
),
)
trainer.train()
# Step 5: save LoRA weights
model.save_pretrained("./lora_adapter")
# Or merge and save the full model
model.save_pretrained_merged("./merged_model", tokenizer)
3.2. OpenAI Fine-Tuning API (fully managed)
from openai import OpenAI
client = OpenAI()
# Step 1: upload training data
file = client.files.create(
file=open("train_data.jsonl", "rb"),
purpose="fine-tune"
)
# Step 2: create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18", # base model
hyperparameters={
"n_epochs": 3,
"batch_size": "auto",
"learning_rate_multiplier": "auto"
}
)
# Step 3: monitor progress
status = client.fine_tuning.jobs.retrieve(job.id)
print(status.status) # queued → running → succeeded
# Step 4: use fine-tuned model
response = client.chat.completions.create(
model=job.fine_tuned_model, # ft:gpt-4o-mini:org:custom:id
messages=[{"role": "user", "content": "your question"}]
)
3.3. Hyperparameter Quick Reference
| Hyperparameter |
Recommended |
Notes |
| LoRA r |
8-32 |
Rank; 16 is a common starting point |
| LoRA alpha |
2 × r |
Scaling factor |
| Learning rate |
1e-4 - 3e-4 |
QLoRA usually higher than full fine-tuning |
| Batch size |
4-8 (per GPU) |
Lower if VRAM is tight; compensate with gradient accumulation |
| Epochs |
1-5 |
<1K samples → 3-5; >10K samples → 1-2 |
| Warmup |
5-10% total steps |
Prevents over-large LR at the start |
| Max seq length |
512-2048 |
Set per data; longer means more VRAM |
3.4. Security Best Practices
- Data security:
- Training data should not contain PII (or must be de-identified)
- When fine-tuning via closed-source APIs, confirm the data-usage terms (used to train base model?)
- Prefer local fine-tuning: sensitive data stays in-house
- Model security:
- Fine-tuning may break the model's safety alignment — include safety samples in training data
- Evaluate the fine-tuned model's refusal capability (does it still refuse harmful requests?)
- Don't share unsafety-evaluated fine-tuned models on public platforms
- Cost control:
- Validate the pipeline on a small dataset (100 samples) before scaling up
- Use QLoRA over full fine-tuning to save 10×+ cost
- Monitor loss curves and stop early on overfitting
4. Bootcamp & Workshops
4.1. Official & Classic Tutorials
4.2. Trouble Shooting
| Symptom |
Root Cause |
Solution |
| OOM (Out of Memory) |
Insufficient VRAM |
Lower batch_size; use QLoRA 4-bit; enable gradient checkpointing |
| Loss not decreasing |
Learning rate too low or bad data |
Raise learning rate; check data format |
| Loss drops then rises |
Overfitting |
Reduce epochs; add more data; add dropout |
| Fine-tuned model gets dumber |
Catastrophic forgetting |
Lower learning rate; reduce epochs; mix in general-purpose data |
| Output format unstable |
Inconsistent training-data format |
Standardize format across all samples; add format-related samples |
| Fine-tuned model refuses normal requests |
Safety alignment broken |
Include safety samples in training; evaluate refusal rate |
4.3. Common Q & A
- Q: How much training data do I need?
- A: 50-100 high-quality samples can already show effects. Format/style adaptation: 100-500 is enough; domain knowledge injection: 1K-10K. Quality always matters more than quantity.
- Q: LoRA or QLoRA?
- A: If you have ≥ 48GB VRAM (e.g., A100), use LoRA (more accurate). If ≤ 24GB (e.g., RTX 4090), use QLoRA (60%+ VRAM savings, 1-2% accuracy loss).
- Q: Can the fine-tuned model be used commercially?
- A: Depends on the base model's license. Llama 3.3 (Meta License, commercial-friendly), Mistral (Apache 2.0, fully open), Qwen (Apache 2.0). Models fine-tuned via closed-source APIs are usually commercial-OK but you can't export weights.
- Q: Fine-Tuning or RAG first?
- A: Do RAG first (low cost, fast results). If RAG solves the knowledge issue but format/style/instruction adherence still isn't right, then add Fine-Tuning.
- Q: How long and how much does one fine-tune cost?
- A: 8B model + 1K data + QLoRA: about 1-2 hours on a single RTX 4090, < $1 in electricity. Cloud (RunPod / Lambda): $5-20. OpenAI API fine-tuning GPT-4o-mini: $10-50 (token-billed).