Skip to content

4.Fine Tuning

👉 #AI #LLM #ML #Coding

I. Fine-Tuning — Adapting Large Language Models

📅 2026-04-28 Tuesday PST; Claude Opus 4.6 📎 LLM Fine-Tuning Complete Guide 2026 📎 LoRA and QLoRA: Fine-Tuning on Consumer GPUs 📎 Fine-Tuning Techniques: LoRA, DPO, GRPO

1. Overview

1.1. Definition & Why
  • Fine-Tuning: starting from a pretrained large model, continue training on domain-specific data so the model adapts to a specific task, style, or domain knowledge.
  • Key distinction: Pre-Training is "learning language from scratch"; Fine-Tuning is "specializing on top of existing capability".
  • Design intent: general models (GPT-4, Claude) know everything but are master of nothing; Fine-Tuning makes the model an expert in a specific area.
  • Pain points solved:
  • Style adaptation: make the output match brand voice, format conventions, terminology
  • Instruction following: improve adherence to complex instructions (e.g., "always answer in JSON")
  • Domain knowledge: internalize medical / legal / financial expertise into the model weights
  • Cost optimization: a fine-tuned small model can replace a big one, cutting inference cost (e.g., fine-tuned Llama 8B replaces GPT-4o)
  • Latency optimization: small models infer faster — good for real-time scenarios
  • 2026 key shift: PEFT (Parameter-Efficient Fine-Tuning) brought fine-tuning down from "needs 8 × A100" to "single consumer-grade GPU is enough"; cost dropped from tens of thousands of dollars to $10-100.
1.2. Features & Use Cases
  • Good for fine-tuning:
  • Output format is fixed: always emit a specific JSON schema or report template
  • Domain-term-heavy: medical diagnosis, legal clauses, financial analysis
  • Style consistency: brand voice, customer-service phrasing, technical-doc style
  • Small model replacing large: a fine-tuned 7B/8B reaching 70B-level performance on a specific task
  • Multilingual adaptation: better performance on a specific language (e.g., Chinese, Japanese)
  • Not good for fine-tuning:
  • Knowledge changes frequently → use RAG
  • Need real-time data → use Function Calling
  • One-off task → use Prompt Engineering
  • Insufficient data (< 100 high-quality samples) → use Few-Shot Prompting
1.3. Competitors
  • Fine-Tuning's position among LLM knowledge-augmentation approaches:
Dimension Fine-Tuning RAG (retrieval-augmented) Prompt Engineering Long-Context
Knowledge source Internalized in weights Real-time external retrieval Written in the prompt Stuffed into context
Update speed Slow (re-train) Fast (minutes) Instant Instant
Cost Medium (one-time train) Low (API calls) Lowest High (token cost)
Data scale 100-100K samples Unlimited < 20 examples < 2M tokens
Core value Behavior/style/format adaptation Real-time knowledge Quick experiments Single-shot deep analysis
  • Decision flow:
  • Try Prompt Engineering first (zero cost, immediate)
  • Not enough? → add RAG (knowledge augmentation)
  • Still need style/format adaptation? → Fine-Tuning
  • Production: RAG + Fine-Tuning combined (best practice)

2. Concept, Component, & Architecture

2.1. Key Concepts
(1) Full Fine-Tuning
  • Update all model parameters (billions of weights).
  • Best results, but needs huge GPU resources (8 × A100 80GB minimum).
  • In 2026, mostly used by large labs and research; individuals and small teams use PEFT.
(2) PEFT (Parameter-Efficient Fine-Tuning)
  • Core idea: freeze 99% of the original model's parameters, train only a small set of new parameters.
  • Benefits: VRAM down 10-100×, faster training, original model capability preserved.
  • In 2026, PEFT is the default for fine-tuning.
(3) LoRA (Low-Rank Adaptation)
  • The most mainstream PEFT method — practically synonymous with fine-tuning in 2026.
  • Mechanism: insert two small matrices (A and B) next to the model's attention layers; only A and B are trained.
  • Original weights W (frozen) + ΔW = W + B × A
  • A: d × r matrix; B: r × d matrix (r is the rank, usually 8-64, far smaller than d)
  • Trainable parameter count: 0.1%-1% of original
  • Key hyperparameters:
  • r (rank): larger means more expressive but more parameters; usually 8-32
  • alpha: scaling factor; usually 2 × r
  • target_modules: layers to apply LoRA to; usually q_proj, v_proj (Attention's Query and Value)
  • Advantage: LoRA weights are "hot-swappable" — the same base model can load different LoRA adapters.
(4) QLoRA (Quantized LoRA)
  • On top of LoRA, quantize the base model to 4-bit (NF4 format), further reducing VRAM.
  • Effect: a 70B model can be fine-tuned on a single 24GB GPU (e.g., RTX 4090).
  • Accuracy loss: usually 1-2%, acceptable for most tasks.
  • The default for consumer-grade GPU fine-tuning in 2026.
(5) SFT (Supervised Fine-Tuning)
  • The most basic kind of fine-tuning: train the model with (input, expected-output) pairs.
  • Data formats: instruction format (instruction, input, output) or chat format (messages).
  • Use: teach the model "how to answer" — format, style, task-specific behavior.
(6) RLHF (Reinforcement Learning from Human Feedback)
  • After SFT, use human preference data to further align model behavior.
  • Flow: SFT → train a Reward Model → PPO reinforcement learning to optimize.
  • Use: make the model more helpful, harmless, and honest.
  • Expensive — usually only model providers (OpenAI, Anthropic) use it.
(7) DPO (Direct Preference Optimization)
  • A simpler alternative to RLHF: no separate Reward Model — optimize directly from preference data.
  • Data format: (prompt, chosen_response, rejected_response) triples.
  • Advantages: simple to implement, stable training, results close to RLHF.
  • The mainstream alignment-training method by 2026.
(8) GRPO (Group Relative Policy Optimization)
  • A new method proposed by DeepSeek; rapidly popular in 2025-2026.
  • Core: no Reward Model needed; optimize policy via in-group relative ranking.
  • Advantages: better suited to reasoning tasks (math, code) than DPO; higher training efficiency.
2.2. Core Components
(1) Training Data
  • Quality > quantity: 100 high-quality samples beat 10K low-quality ones.
  • Format: usually JSONL — one training sample per line.
  • SFT data example:
{"messages": [
  {"role": "system", "content": "You are a professional legal advisor"},
  {"role": "user", "content": "What is a force majeure clause?"},
  {"role": "assistant", "content": "A force majeure clause is..."}
]}
  • DPO data example:
{
  "prompt": "Explain quantum computing",
  "chosen": "Quantum computing leverages quantum-mechanics principles... (accurate, detailed)",
  "rejected": "Quantum computing is just a faster computer... (inaccurate)"
}
(2) Base Model Selection
  • Open-source picks (2026):
  • Llama 3.3 70B / 8B: from Meta; largest community, most mature toolchain
  • Mistral Large / Small: from Europe; excellent multilingual
  • Qwen 2.5: from Alibaba; strong on Chinese
  • DeepSeek V3: outstanding reasoning, great cost/performance
  • Closed-source fine-tuning:
  • OpenAI Fine-Tuning API: GPT-4o-mini / GPT-4o
  • Google Vertex AI: Gemini fine-tuning
  • Amazon Bedrock: many models supported
(3) Training Frameworks
Framework Notes Best fit
Hugging Face TRL Most popular; supports SFT/DPO/GRPO General fine-tuning
Unsloth 2× faster, 60% VRAM savings Consumer-grade GPUs
Axolotl YAML-config-driven, code-free Quick experiments
LLaMA-Factory Active Chinese community, GUI Chinese scenarios
OpenAI API Fully managed, no GPU needed Closed-source model fine-tuning
(4) Evaluation Metrics
  • General: Perplexity, loss curves
  • Task-specific: Accuracy, F1, BLEU, ROUGE
  • Human evaluation: blind A/B (before vs. after fine-tuning)
  • Benchmarks: MMLU, HumanEval, MT-Bench
2.3. Architecture & Design
(1) Fine-Tuning Workflow Overview
flowchart LR
  A[Data prep] --> B[Cleaning + formatting]
  B --> C[Base-model selection]
  C --> D{Method}
  D -->|Behavior adaptation| E[SFT + LoRA/QLoRA]
  D -->|Preference alignment| F[DPO / GRPO]
  D -->|Comprehensive customization| G[SFT → DPO two-stage]

  E & F & G --> H[Training + monitoring]
  H --> I[Evaluation + testing]
  I --> J{Meets target?}
  J -->|No| K[Adjust hyperparams / data]
  K --> H
  J -->|Yes| L[Deployment]
  L --> M[Inference service: vLLM / TGI]
(2) LoRA Architecture Principle
flowchart LR
  X[Input x] --> W[Original weight W — frozen]
  X --> A[LoRA matrix A — d×r]
  A --> B[LoRA matrix B — r×d]
  W --> ADD((+))
  B --> ADD
  ADD --> Y[Output y = Wx + BAx]

  style W fill:#ccc,stroke:#999
  style A fill:#4CAF50,stroke:#388E3C,color:#fff
  style B fill:#4CAF50,stroke:#388E3C,color:#fff
  • Gray: frozen original weights (billions of params)
  • Green: trainable LoRA matrices (millions of params, 0.1%-1% of original)
2.4. Eco-system
  • Model hosting: Hugging Face Hub (open-source distribution), Ollama (local inference)
  • Training platforms: AWS SageMaker, Google Vertex AI, Lambda Cloud, RunPod, Vast.ai
  • Inference deployment: vLLM (high throughput), TGI (Hugging Face), Ollama (local), Together AI (API)
  • Data labeling: Argilla, Label Studio, Scale AI
  • Collaboration with RAG:
  • RAG provides real-time knowledge (external retrieval)
  • Fine-Tuning provides behavioral adaptation (internalized in weights)
  • Best combo: fine-tuned model + RAG pipeline = both domain-aware and real-time-knowledgeable

3. Install, Configure, Secure, & Cheatsheets

3.1. QLoRA Hands-On (Unsloth + Llama)
(1) Environment install
# Recommended: Unsloth (2× faster, 60% VRAM savings)
pip install unsloth
pip install trl datasets  # Hugging Face training libs
(2) Full fine-tuning code
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Step 1: load base model (4-bit quantized)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.3-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,  # QLoRA: 4-bit quantization
)

# Step 2: add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank
    lora_alpha=32,           # scaling factor (usually 2*r)
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",
)

# Step 3: load data
dataset = load_dataset("json", data_files="train_data.jsonl", split="train")

# Step 4: train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./output",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        warmup_steps=10,
        logging_steps=10,
        save_strategy="epoch",
        fp16=True,
    ),
)
trainer.train()

# Step 5: save LoRA weights
model.save_pretrained("./lora_adapter")
# Or merge and save the full model
model.save_pretrained_merged("./merged_model", tokenizer)
3.2. OpenAI Fine-Tuning API (fully managed)
from openai import OpenAI
client = OpenAI()

# Step 1: upload training data
file = client.files.create(
    file=open("train_data.jsonl", "rb"),
    purpose="fine-tune"
)

# Step 2: create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",  # base model
    hyperparameters={
        "n_epochs": 3,
        "batch_size": "auto",
        "learning_rate_multiplier": "auto"
    }
)

# Step 3: monitor progress
status = client.fine_tuning.jobs.retrieve(job.id)
print(status.status)  # queued → running → succeeded

# Step 4: use fine-tuned model
response = client.chat.completions.create(
    model=job.fine_tuned_model,  # ft:gpt-4o-mini:org:custom:id
    messages=[{"role": "user", "content": "your question"}]
)
3.3. Hyperparameter Quick Reference
Hyperparameter Recommended Notes
LoRA r 8-32 Rank; 16 is a common starting point
LoRA alpha 2 × r Scaling factor
Learning rate 1e-4 - 3e-4 QLoRA usually higher than full fine-tuning
Batch size 4-8 (per GPU) Lower if VRAM is tight; compensate with gradient accumulation
Epochs 1-5 <1K samples → 3-5; >10K samples → 1-2
Warmup 5-10% total steps Prevents over-large LR at the start
Max seq length 512-2048 Set per data; longer means more VRAM
3.4. Security Best Practices
  • Data security:
  • Training data should not contain PII (or must be de-identified)
  • When fine-tuning via closed-source APIs, confirm the data-usage terms (used to train base model?)
  • Prefer local fine-tuning: sensitive data stays in-house
  • Model security:
  • Fine-tuning may break the model's safety alignment — include safety samples in training data
  • Evaluate the fine-tuned model's refusal capability (does it still refuse harmful requests?)
  • Don't share unsafety-evaluated fine-tuned models on public platforms
  • Cost control:
  • Validate the pipeline on a small dataset (100 samples) before scaling up
  • Use QLoRA over full fine-tuning to save 10×+ cost
  • Monitor loss curves and stop early on overfitting

4. Bootcamp & Workshops

4.1. Official & Classic Tutorials
Resource Link Goal
Hugging Face TRL docs huggingface.co/docs/trl SFT/DPO/GRPO end-to-end
Unsloth Wiki github.com/unslothai/unsloth Consumer-GPU fine-tuning acceleration
OpenAI Fine-Tuning Guide platform.openai.com Closed-source model fine-tuning
DeepLearning.AI - Finetuning LLMs deeplearning.ai Andrew Ng practical course
LLaMA-Factory github.com/hiyouga/LLaMA-Factory GUI, Chinese-friendly
Axolotl github.com/axolotl-ai-cloud/axolotl YAML-driven fine-tuning
4.2. Trouble Shooting
Symptom Root Cause Solution
OOM (Out of Memory) Insufficient VRAM Lower batch_size; use QLoRA 4-bit; enable gradient checkpointing
Loss not decreasing Learning rate too low or bad data Raise learning rate; check data format
Loss drops then rises Overfitting Reduce epochs; add more data; add dropout
Fine-tuned model gets dumber Catastrophic forgetting Lower learning rate; reduce epochs; mix in general-purpose data
Output format unstable Inconsistent training-data format Standardize format across all samples; add format-related samples
Fine-tuned model refuses normal requests Safety alignment broken Include safety samples in training; evaluate refusal rate
4.3. Common Q & A
  • Q: How much training data do I need?
  • A: 50-100 high-quality samples can already show effects. Format/style adaptation: 100-500 is enough; domain knowledge injection: 1K-10K. Quality always matters more than quantity.
  • Q: LoRA or QLoRA?
  • A: If you have ≥ 48GB VRAM (e.g., A100), use LoRA (more accurate). If ≤ 24GB (e.g., RTX 4090), use QLoRA (60%+ VRAM savings, 1-2% accuracy loss).
  • Q: Can the fine-tuned model be used commercially?
  • A: Depends on the base model's license. Llama 3.3 (Meta License, commercial-friendly), Mistral (Apache 2.0, fully open), Qwen (Apache 2.0). Models fine-tuned via closed-source APIs are usually commercial-OK but you can't export weights.
  • Q: Fine-Tuning or RAG first?
  • A: Do RAG first (low cost, fast results). If RAG solves the knowledge issue but format/style/instruction adherence still isn't right, then add Fine-Tuning.
  • Q: How long and how much does one fine-tune cost?
  • A: 8B model + 1K data + QLoRA: about 1-2 hours on a single RTX 4090, < $1 in electricity. Cloud (RunPod / Lambda): $5-20. OpenAI API fine-tuning GPT-4o-mini: $10-50 (token-billed).