Model Fine-tuning Guide
When and How to Fine-tune vs. Prompt Engineering
Document ID: C9-MODEL-FINETUNING
Version: 1.0
Category: Technical Deep Dive
Decision Framework
When to Fine-tune vs. Prompt Engineer
START: Do you need custom model behavior?
│
├─► Task can be described in prompt? ──► PROMPT ENGINEERING
│ (Most cases)
│
├─► Need consistent output format? ──► STRUCTURED OUTPUT + PROMPTING
│ (JSON mode, function calling)
│
├─► Domain-specific terminology? ──► RAG + PROMPTING first
│ (Technical jargon) Then consider fine-tuning
│
├─► Specific writing style? ──► FEW-SHOT PROMPTING first
│ (Brand voice) Then consider fine-tuning
│
├─► High-volume, latency-sensitive? ──► CONSIDER FINE-TUNING
│ (Cost optimization) (Smaller fine-tuned > larger base)
│
└─► Truly novel task? ──► FINE-TUNING likely needed
(No existing capability)
Cost-Benefit Matrix
| Approach | Development Cost | Per-Query Cost | Quality Ceiling | Iteration Speed |
|---|---|---|---|---|
| Prompt Engineering | Low | Higher (more tokens) | High | Fast |
| Few-Shot Learning | Low-Medium | Higher | High | Fast |
| RAG | Medium | Medium | Very High | Medium |
| Fine-tuning (SFT) | High | Lower | High | Slow |
| RLHF/DPO | Very High | Lower | Highest | Very Slow |
Prompt Engineering Techniques
Maximizing Base Model Performance
1. System Prompt Optimization
# Instead of fine-tuning for consistent format:
OPTIMIZED_SYSTEM_PROMPT = """
You are a customer service agent for TechCorp.
RESPONSE FORMAT (ALWAYS follow exactly):
1. Greeting: One sentence acknowledging the customer
2. Understanding: Restate the issue in your own words
3. Solution: Provide clear steps or information
4. Next Steps: One specific action for the customer
5. Closing: Professional sign-off
TONE: Professional, empathetic, concise
CONSTRAINTS:
- Never promise things outside policy
- Always offer escalation option for complaints
- Maximum response length: 200 words
"""
# This often achieves 90%+ of fine-tuning quality for style/format
2. Few-Shot Examples
FEW_SHOT_PROMPT = """
Convert customer feedback to structured format.
Example 1:
Input: "The product arrived late and was damaged. Very disappointed!"
Output: {
"sentiment": "negative",
"issues": ["delivery_delay", "product_damage"],
"priority": "high",
"suggested_action": "replacement_expedited"
}
Example 2:
Input: "Love the new features! Works great with my setup."
Output: {
"sentiment": "positive",
"issues": [],
"priority": "low",
"suggested_action": "send_thank_you"
}
Now process:
Input: "{customer_feedback}"
Output:
"""
3. Chain-of-Thought for Complex Reasoning
COT_PROMPT = """
Analyze this contract clause for risks.
Think step by step:
1. First, identify the type of clause
2. List the obligations it creates
3. Identify any ambiguous language
4. Compare to standard market terms
5. Assess risk level with justification
Clause: {clause_text}
Analysis:
"""
When Fine-tuning Makes Sense
Scenario 1: High-Volume Cost Optimization
Before Fine-tuning:
- Model: GPT-4
- Avg prompt: 2,000 tokens (including instructions)
- Avg completion: 500 tokens
- Queries/day: 100,000
- Daily cost: ~$3,500
After Fine-tuning (on GPT-3.5):
- Model: Fine-tuned GPT-3.5
- Avg prompt: 500 tokens (instructions baked in)
- Avg completion: 400 tokens
- Same quality for this specific task
- Daily cost: ~$180
ROI: Fine-tuning cost recovered in 2 days
Scenario 2: Consistent Specialized Format
# Training data format for specialized extraction
TRAINING_EXAMPLE = {
"messages": [
{
"role": "system",
"content": "Extract medical entities from clinical notes."
},
{
"role": "user",
"content": "Patient presents with persistent cough for 2 weeks, mild fever. History of asthma. Currently taking albuterol."
},
{
"role": "assistant",
"content": """{
"symptoms": [
{"term": "persistent cough", "duration": "2 weeks"},
{"term": "mild fever", "duration": null}
],
"conditions": [
{"term": "asthma", "status": "history"}
],
"medications": [
{"term": "albuterol", "status": "current"}
]
}"""
}
]
}
Fine-tuning Implementation
Data Preparation
import json
from typing import List, Dict
class FineTuningDataPreparer:
def __init__(self, task_description: str):
self.task_description = task_description
self.examples = []
def add_example(self, input_text: str, output_text: str,
system_prompt: str = None):
"""Add a training example"""
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": input_text})
messages.append({"role": "assistant", "content": output_text})
self.examples.append({"messages": messages})
def validate_examples(self) -> Dict:
"""Validate training data quality"""
issues = []
# Check minimum count
if len(self.examples) < 100:
issues.append(f"Only {len(self.examples)} examples. Recommend 100+")
# Check for diversity
unique_inputs = len(set(e['messages'][1]['content'] for e in self.examples))
if unique_inputs < len(self.examples) * 0.9:
issues.append("Possible duplicate inputs detected")
# Check token lengths
for i, ex in enumerate(self.examples):
total_tokens = sum(len(m['content'].split()) * 1.3
for m in ex['messages'])
if total_tokens > 4096:
issues.append(f"Example {i} may exceed token limit")
return {
'total_examples': len(self.examples),
'issues': issues,
'ready': len(issues) == 0
}
def export_jsonl(self, output_path: str):
"""Export to JSONL format for fine-tuning"""
with open(output_path, 'w') as f:
for example in self.examples:
f.write(json.dumps(example) + '\n')
Fine-tuning with OpenAI
from openai import OpenAI
client = OpenAI()
def fine_tune_model(training_file_path: str, model_base: str = "gpt-3.5-turbo"):
"""Fine-tune a model using OpenAI API"""
# Upload training file
with open(training_file_path, 'rb') as f:
file_response = client.files.create(
file=f,
purpose='fine-tune'
)
# Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file_response.id,
model=model_base,
hyperparameters={
"n_epochs": 3,
"batch_size": 4,
"learning_rate_multiplier": 1.0
}
)
print(f"Fine-tuning job created: {job.id}")
return job
def monitor_fine_tuning(job_id: str):
"""Monitor fine-tuning progress"""
while True:
job = client.fine_tuning.jobs.retrieve(job_id)
print(f"Status: {job.status}")
if job.status == 'succeeded':
print(f"Fine-tuned model: {job.fine_tuned_model}")
return job.fine_tuned_model
elif job.status == 'failed':
raise Exception(f"Fine-tuning failed: {job.error}")
time.sleep(60)
Fine-tuning with Open Source (LoRA)
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
def fine_tune_with_lora(
base_model: str,
training_data_path: str,
output_dir: str
):
"""Fine-tune using LoRA (Parameter Efficient Fine-Tuning)"""
# Load base model
model = AutoModelForCausalLM.from_pretrained(
base_model,
load_in_4bit=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model)
# Prepare for training
model = prepare_model_for_kbit_training(model)
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Training arguments
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=10,
save_steps=100,
fp16=True
)
# Load training data
from datasets import load_dataset
dataset = load_dataset('json', data_files=training_data_path)
# Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset['train'],
args=training_args,
tokenizer=tokenizer,
max_seq_length=2048
)
trainer.train()
# Save adapter
model.save_pretrained(output_dir)
return output_dir
Evaluation Framework
Comparing Base vs. Fine-tuned
class ModelComparator:
def __init__(self, base_model, fine_tuned_model, test_dataset):
self.base = base_model
self.fine_tuned = fine_tuned_model
self.test_data = test_dataset
def evaluate(self) -> Dict:
"""Compare models on test dataset"""
results = {
'base': {'correct': 0, 'total': 0, 'avg_tokens': 0},
'fine_tuned': {'correct': 0, 'total': 0, 'avg_tokens': 0}
}
for example in self.test_data:
# Base model
base_output = self.base.generate(example['input'])
base_correct = self.evaluate_output(base_output, example['expected'])
results['base']['correct'] += base_correct
results['base']['total'] += 1
results['base']['avg_tokens'] += count_tokens(base_output)
# Fine-tuned model
ft_output = self.fine_tuned.generate(example['input'])
ft_correct = self.evaluate_output(ft_output, example['expected'])
results['fine_tuned']['correct'] += ft_correct
results['fine_tuned']['total'] += 1
results['fine_tuned']['avg_tokens'] += count_tokens(ft_output)
# Calculate metrics
for model in results:
results[model]['accuracy'] = results[model]['correct'] / results[model]['total']
results[model]['avg_tokens'] /= results[model]['total']
return results
Quick Reference
| Need | First Try | Then Consider |
|---|---|---|
| Consistent format | Structured output + prompt | Fine-tune if >10K queries/day |
| Domain language | RAG with domain docs | Fine-tune if RAG insufficient |
| Writing style | Few-shot examples | Fine-tune if style critical |
| Cost reduction | Prompt optimization | Fine-tune smaller model |
| Latency reduction | Caching, smaller model | Fine-tune for shorter prompts |
| Novel capability | Prompt engineering | Fine-tune + evaluation |
Document maintained by CODITECT ML Team