Agent-Assisted Fine-Tuning
Using coding agents like Claude Code and OpenAI Codex to automate the entire LLM fine-tuning workflow—from GPU selection to model deployment—through natural language instructions.
Why Agent-Assisted Fine-Tuning?
Fine-tuning LLMs traditionally requires deep MLOps expertise: selecting hardware, configuring training scripts, managing datasets, monitoring jobs, and deploying models. Coding agents can now handle this entire workflow autonomously, making custom model training accessible to developers without ML infrastructure experience.
Automatic Hardware Selection
Agent picks optimal GPU based on model size, training method, and budget
Dataset Validation
Pre-flight checks on CPU before incurring GPU costs
Job Orchestration
Submit, monitor, and manage training runs via conversation
Local Deployment
Convert to GGUF and run with llama.cpp automatically
Real-World Impact
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ CODING AGENT │
│ (Claude Code / Codex / Gemini CLI) │
│ │
│ User: "Fine-tune Qwen-7B on my customer support data" │
│ │
│ Agent Actions: │
│ 1. Validate dataset format │
│ 2. Select hardware (a10g-large for 7B + LoRA) │
│ 3. Generate training configuration │
│ 4. Submit job to compute platform │
│ 5. Monitor progress and report status │
│ 6. Convert to GGUF for local deployment │
└─────────────────────────────────────────────────────────────────┘
│
│ Skills / Plugins
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Hugging Face │ │ Unsloth │ │ Local LLM │
│ Jobs │ │ │ │ (llama.cpp) │
│ │ │ │ │ │
│ - Managed GPU │ │ - 2x faster │ │ - Private data │
│ - Auto scaling │ │ - 30% less VRAM │ │ - No API costs │
│ - Trackio logs │ │ - GGUF export │ │ - Offline use │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└────────────────────┼────────────────────┘
│
▼
┌─────────────────┐
│ Fine-Tuned │
│ Model │
│ │
│ • HF Hub │
│ • GGUF local │
│ • API endpoint │
└─────────────────┘ Hugging Face Skills
The hf-llm-trainer skill teaches coding agents everything needed for fine-tuning: which GPU to pick, how to configure training, when to use LoRA vs full fine-tuning, and how to handle the dozens of decisions in a successful training run.
# Install Hugging Face Skills plugin
/plugin marketplace add huggingface/skills
/plugin install hf-llm-trainer@huggingface-skills
# Authenticate with Hugging Face (requires Pro plan for Jobs)
hf auth login
# Or set environment variable
export HF_TOKEN=hf_your_write_access_token_here
# Start fine-tuning with natural language
# Claude Code handles everything automatically:
# - GPU selection based on model size
# - Training script configuration
# - Job submission and monitoring
# - Model upload to Hub # Simple fine-tuning request
User: "Fine-tune Qwen3-0.6B on the open-r1/codeforces-cots dataset
for instruction following."
# Agent automatically:
# 1. Validates dataset format
# 2. Selects hardware (t4-small for 0.6B model)
# 3. Configures training with Trackio monitoring
# 4. Submits job to Hugging Face Jobs
# 5. Reports cost estimate (~$0.30)
# Production run with specific parameters
User: "SFT Qwen-0.6B for production on my-org/support-conversations.
Checkpoints every 500 steps, 3 epochs, cosine learning rate."
# Multi-stage pipeline
User: "Train a math reasoning model:
1. SFT on openai/gsm8k
2. DPO alignment with preference data
3. Convert to GGUF Q4_K_M for local deployment" Hardware & Cost Guide
| Model Size | Recommended GPU | Training Time | Estimated Cost |
|---|---|---|---|
| <1B | t4-small | Minutes | $1-2 |
| 1-3B | t4-medium / a10g-small | Hours | $5-15 |
| 3-7B | a10g-large (LoRA) | Hours | $15-40 |
| 7-13B | a100-large (LoRA) | Hours | $40-100 |
| 70B+ | Multi-GPU / QLoRA | Many hours | $100+ |
Cost Optimization
Training Methods
Coding agents support multiple training methods, automatically selecting the best approach based on your dataset and goals:
# Best for: High-quality input-output demonstration pairs
# Use cases: Customer support, code generation, domain Q&A
User: "Fine-tune Qwen3-0.6B on my-org/support-conversations for 3 epochs."
# Agent selects:
# - LoRA for models >3B (memory efficient)
# - Full fine-tuning for smaller models
# - Appropriate batch size and learning rate
# Dataset format (messages column):
{
"messages": [
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "To reset your password..."}
]
} # Best for: Preference-annotated data (chosen vs rejected)
# Use cases: Alignment, reducing harmful outputs, style matching
User: "Run DPO on my-org/preference-data to align the SFT model.
Dataset has 'chosen' and 'rejected' columns."
# No separate reward model needed
# Typically applied after SFT
# Dataset format:
{
"prompt": "Explain quantum computing",
"chosen": "Quantum computing uses quantum bits...",
"rejected": "Quantum computing is magic..."
} # Best for: Verifiable tasks with programmatic success criteria
# Use cases: Math reasoning, code generation, structured problems
User: "Train a math reasoning model using GRPO on openai/gsm8k
based on Qwen3-0.6B."
# Model generates responses and receives rewards
# Learning from verifiable outcomes
# Particularly effective for reasoning tasks
# The agent configures:
# - Reward function based on answer correctness
# - Multiple response sampling
# - Relative ranking within groups When to Use Each Method
| Method | Best For | Dataset Requirements |
|---|---|---|
| SFT | Teaching specific behaviors, domain adaptation | messages column with conversations |
| DPO | Alignment, preference learning, safety | chosen and rejected columns |
| GRPO | Math, code, verifiable reasoning tasks | Tasks with programmatic success criteria |
Multi-Stage Pipelines
Running with Local LLMs
For private data or to avoid API costs, connect coding agents to local LLMs via llama.cpp's OpenAI-compatible API:
# Build llama.cpp with GPU support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # or DGGML_METAL=ON for Mac
cmake --build build --config Release
# Download quantized model (e.g., GLM-4.7-Flash 30B MoE)
pip install huggingface_hub
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id='unsloth/GLM-4.7-Flash-GGUF',
filename='GLM-4.7-Flash-UD-Q4_K_XL.gguf',
local_dir='./models'
)
"
# Start server with OpenAI-compatible API
./build/bin/llama-server \
-m ./models/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
--port 8000 \
--host 0.0.0.0 \
-c 32768 \
--temp 1.0 \
--top-p 0.95 \
--jinja # Enable tool calling support # Point Claude Code to local server
export ANTHROPIC_BASE_URL="http://localhost:8000"
# Run with local model
claude --model unsloth/GLM-4.7-Flash
# For unrestricted execution (use with caution)
claude --model unsloth/GLM-4.7-Flash --dangerously-skip-permissions # ~/.codex/config.toml
[llama_cpp]
endpoint = "http://localhost:8000/v1"
wire_api = "responses"
# Run Codex with local model
# codex --model unsloth/GLM-4.7-Flash -c model_provider=llama_cpp Model Requirements
Recommended Models
Fine-Tuning Frameworks
Agents can drive various fine-tuning frameworks. Here are the most popular options:
# config.yaml - Axolotl configuration
base_model: Qwen/Qwen2.5-7B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
# Dataset
datasets:
- path: my-org/training-data
type: chat_template
chat_template: chatml
# LoRA configuration
adapter: lora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true
# Training parameters
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 2e-4
lr_scheduler: cosine
warmup_ratio: 0.1
# Optimization
bf16: true
flash_attention: true
gradient_checkpointing: true
# Output
output_dir: ./outputs/qwen-finetuned
hub_model_id: username/qwen-finetuned
push_to_hub: true # LLaMA-Factory - zero-code fine-tuning
# Install
pip install llamafactory
# Launch web UI for no-code training
llamafactory-cli webui
# Or use CLI with YAML config
llamafactory-cli train \
--model_name_or_path Qwen/Qwen2.5-7B-Instruct \
--dataset my_dataset \
--finetuning_type lora \
--lora_rank 32 \
--output_dir ./outputs \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4 \
--num_train_epochs 3 \
--learning_rate 2e-4 \
--bf16 true
# Export to GGUF for local deployment
llamafactory-cli export \
--model_name_or_path ./outputs \
--export_quantization_bit 4 \
--export_dir ./gguf-output from unsloth import FastLanguageModel
import torch
# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
max_seq_length=4096,
dtype=None, # Auto-detect
load_in_4bit=True,
)
# Add LoRA adapters (2x faster than standard)
model = FastLanguageModel.get_peft_model(
model,
r=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=64,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth", # 30% less VRAM
random_state=42,
)
# Train with HuggingFace Trainer
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=100,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
output_dir="outputs",
),
)
trainer.train()
# Save and convert to GGUF
model.save_pretrained_gguf(
"outputs-gguf",
tokenizer,
quantization_method="q4_k_m"
) Framework Comparison
| Framework | Best For | Key Features |
|---|---|---|
| Axolotl | Multi-GPU, production workloads | YAML config, DeepSpeed, FSDP, extensive model support |
| LLaMA-Factory | Beginners, quick experiments | Web UI, zero-code option, 100+ models supported |
| Unsloth | Speed and efficiency | 2x faster training, 30% less VRAM, native GGUF export |
| HF TRL | Maximum flexibility | Official HF library, RLHF support, research-grade |
LoRA and QLoRA
Parameter-efficient fine-tuning methods that make training large models feasible on consumer hardware:
LoRA (Low-Rank Adaptation)
Trains small adapter layers instead of full model weights
- Typically r=32, alpha=64
- ~1% of original parameters
- Preserves base model quality
- Multiple adapters per base model
QLoRA (Quantized LoRA)
Combines LoRA with 4-bit quantization
- 70B model on single 24GB GPU
- 4-bit NormalFloat quantization
- Double quantization for memory
- Paged optimizers for spikes
When to Use Which
Skill Transfer with upskill
An alternative to weight-based fine-tuning: transfer expertise from expensive models to cheaper ones via structured context. HuggingFace's upskill tool automates this "Robin Hood" approach—teaching open models skills that frontier models have mastered.
# Install upskill tool
pip install upskill
# Or use uvx for one-off runs
uvx upskill --help
# Set API keys
export ANTHROPIC_API_KEY=sk-ant-...
export HF_TOKEN=hf_...
# Generate skill from task description (uses Opus as teacher)
upskill generate "build optimized CUDA kernels for PyTorch"
# Generate from existing execution trace
upskill generate "write kernels" --from ./claude-trace.md
# Improve existing skill
upskill generate "add more error handling" --from ./skills/kernel-builder/
# Evaluate skill on different models
upskill eval ./skills/kernel-builder/ --model haiku --model sonnet --runs 5
# Evaluate with local model via llama.cpp
llama-server -hf unsloth/GLM-4.7-Flash-GGUF:Q4_K_M
upskill eval ./skills/my-skill/ \
--model "unsloth/GLM-4.7-Flash-GGUF:Q4_0" \
--base-url http://localhost:8080/v1 # Agent Skills follow the agentskills.io specification
# A skill is a directory with structured context
./skills/kernel-builder-cuda/
SKILL.md # Main instructions (~500 tokens)
skill_meta.json # Metadata and test cases
# SKILL.md contains domain expertise:
# - Project structure and conventions
# - Configuration file formats
# - API usage patterns
# - Common pitfalls and solutions
# - Best practices
# skill_meta.json defines evaluation:
{
"name": "h100-cuda-kernel-builder",
"version": "1.0.0",
"cases": [
{
"input": "Create a build.toml for CUDA targeting H100",
"expected": {"contains": "9.0"} // compute capability
},
{
"input": "Write a basic CUDA kernel template",
"expected": {"contains": "cuda_runtime.h"}
}
]
} # The "Robin Hood" approach to skill transfer:
# 1. Expensive teacher model solves the problem
# 2. Export the solution as a skill
# 3. Cheaper student models use the skill
# Step 1: Teacher (Opus) builds solution interactively
User: "Help me write CUDA kernels for diffusers library"
Claude Opus: [solves problem, exports trace]
# Step 2: Convert execution trace to skill
upskill generate "write CUDA kernels" --from ./opus-trace.md
# Step 3: Evaluate skill lift on student models
upskill eval ./skills/cuda-kernels/ --model haiku --model kimi
# Results show improvement per model:
# Model | Without Skill | With Skill | Lift
# ------------|---------------|------------|------
# Haiku | 45% | 82% | +37%
# GLM-4.7 | 40% | 85% | +45%
# Kimi-K2 | 60% | 95% | +35%
# Opus 4.5 | 90% | 88% | -2% (skip!)
# Key insight: Skills don't work universally
# Always measure both accuracy AND token usage Skills vs Fine-Tuning
| Approach | Modifies | Best For | Trade-offs |
|---|---|---|---|
| Fine-Tuning | Model weights | Consistent style, domain-specific knowledge, production | GPU cost, training time, potential catastrophic forgetting |
| Skill Transfer | Context only | Specialized tasks, quick iteration, model comparison | Token overhead, skill maintenance, varies by model |
When to Use Skills
Measure Per Model
upskill eval to measure accuracy and token efficiency.
End-to-End Workflow
Here's a complete agent-assisted fine-tuning workflow from dataset validation to local deployment:
# Complete agent-assisted fine-tuning workflow
# Step 1: Dataset Validation (runs on CPU - no cost)
User: "Check if my-org/conversation-data works for SFT training."
Agent Response:
Dataset validation for my-org/conversation-data:
SFT: READY
Found 'messages' column with conversation format
DPO: INCOMPATIBLE
Missing 'chosen' and 'rejected' columns
# Step 2: Test Run (quick validation)
User: "Do a quick test run to SFT Qwen-0.6B with 100 examples."
Agent Response:
Configuration:
- Hardware: t4-small (~$0.75/hour)
- Estimated time: ~5 minutes
- Estimated cost: ~$0.10
Submitting test job...
Job ID: test-abc123
Monitor: https://huggingface.co/jobs/username/test-abc123
# Step 3: Production Training
User: "Looks good! Run full training with 3 epochs."
Agent Response:
Configuration:
- Hardware: t4-medium (~$1.50/hour)
- Estimated time: ~2 hours
- Estimated cost: ~$3.00
- Checkpoints: Every 500 steps
Job submitted: prod-xyz789
# Step 4: Monitor Progress
User: "How's my training job doing?"
Agent Response:
Job: prod-xyz789
Status: Running (step 1500/3000)
Training loss: 1.23 -> 0.45
Learning rate: 1.8e-4
ETA: ~45 minutes
# Step 5: Deploy Locally
User: "Training complete! Convert to GGUF Q4_K_M and push to Hub."
Agent Response:
Converting to GGUF...
Quantization: Q4_K_M (best size/quality balance)
Pushed to: username/qwen-finetuned-gguf
Run locally:
llama-server -hf username/qwen-finetuned-gguf:Q4_K_M Compatible Coding Agents
| Agent | HF Skills Support | Local LLM Support | Notes |
|---|---|---|---|
| Claude Code | Yes (plugin) | Yes (ANTHROPIC_BASE_URL) | Most capable, 90% of its own code written by itself |
| OpenAI Codex | Yes (instructions) | Yes (config.toml) | Good for OpenAI ecosystem users |
| Gemini CLI | Yes (extensions) | Limited | Google Cloud integration |
| Aider | Manual | Yes | Git-focused, good for code changes |
| Anon Kode | Manual | Yes (native) | LLM-agnostic Claude Code fork |
Best Practices
1. Start Small
Run test jobs with 100 examples before full training. Validate dataset format on CPU first. This catches issues before incurring GPU costs.
2. Use Checkpoints
Save checkpoints every 500 steps for long runs. This allows recovery from failures and enables evaluation at different training stages.
3. Monitor Loss Curves
Watch training loss via Trackio or W&B. Flat loss means the model isn't learning; spiking loss indicates issues with learning rate or data.
4. Version Your Data
Push datasets to Hugging Face Hub with version tags. This ensures reproducibility and makes it easy to iterate on data quality.
5. Test Before Deploy
Run the fine-tuned model through your evaluation suite before production. Use the agent to help write and run evaluation scripts.
Resources
upskill: Skill Transfer for LLMs - Teaching open models with Agent Skills
upskill Repository - CLI tool for skill generation and evaluation
Agent Skills Specification - Standard format for portable skills
Hugging Face Skills for Training - Official guide to agent-assisted fine-tuning
Unsloth + Claude Code Guide - Local LLM fine-tuning setup
Axolotl Framework - Production-grade fine-tuning
LLaMA-Factory - Zero-code fine-tuning with web UI
Unsloth - 2x faster fine-tuning