Skip to main content

Model Fine-tuning Guide

When and How to Fine-tune vs. Prompt Engineering

Document ID: C9-MODEL-FINETUNING
Version: 1.0
Category: Technical Deep Dive


Decision Framework

When to Fine-tune vs. Prompt Engineer

START: Do you need custom model behavior?

├─► Task can be described in prompt? ──► PROMPT ENGINEERING
│ (Most cases)

├─► Need consistent output format? ──► STRUCTURED OUTPUT + PROMPTING
│ (JSON mode, function calling)

├─► Domain-specific terminology? ──► RAG + PROMPTING first
│ (Technical jargon) Then consider fine-tuning

├─► Specific writing style? ──► FEW-SHOT PROMPTING first
│ (Brand voice) Then consider fine-tuning

├─► High-volume, latency-sensitive? ──► CONSIDER FINE-TUNING
│ (Cost optimization) (Smaller fine-tuned > larger base)

└─► Truly novel task? ──► FINE-TUNING likely needed
(No existing capability)

Cost-Benefit Matrix

ApproachDevelopment CostPer-Query CostQuality CeilingIteration Speed
Prompt EngineeringLowHigher (more tokens)HighFast
Few-Shot LearningLow-MediumHigherHighFast
RAGMediumMediumVery HighMedium
Fine-tuning (SFT)HighLowerHighSlow
RLHF/DPOVery HighLowerHighestVery Slow

Prompt Engineering Techniques

Maximizing Base Model Performance

1. System Prompt Optimization

# Instead of fine-tuning for consistent format:

OPTIMIZED_SYSTEM_PROMPT = """
You are a customer service agent for TechCorp.

RESPONSE FORMAT (ALWAYS follow exactly):
1. Greeting: One sentence acknowledging the customer
2. Understanding: Restate the issue in your own words
3. Solution: Provide clear steps or information
4. Next Steps: One specific action for the customer
5. Closing: Professional sign-off

TONE: Professional, empathetic, concise
CONSTRAINTS:
- Never promise things outside policy
- Always offer escalation option for complaints
- Maximum response length: 200 words
"""

# This often achieves 90%+ of fine-tuning quality for style/format

2. Few-Shot Examples

FEW_SHOT_PROMPT = """
Convert customer feedback to structured format.

Example 1:
Input: "The product arrived late and was damaged. Very disappointed!"
Output: {
"sentiment": "negative",
"issues": ["delivery_delay", "product_damage"],
"priority": "high",
"suggested_action": "replacement_expedited"
}

Example 2:
Input: "Love the new features! Works great with my setup."
Output: {
"sentiment": "positive",
"issues": [],
"priority": "low",
"suggested_action": "send_thank_you"
}

Now process:
Input: "{customer_feedback}"
Output:
"""

3. Chain-of-Thought for Complex Reasoning

COT_PROMPT = """
Analyze this contract clause for risks.

Think step by step:
1. First, identify the type of clause
2. List the obligations it creates
3. Identify any ambiguous language
4. Compare to standard market terms
5. Assess risk level with justification

Clause: {clause_text}

Analysis:
"""

When Fine-tuning Makes Sense

Scenario 1: High-Volume Cost Optimization

Before Fine-tuning:
- Model: GPT-4
- Avg prompt: 2,000 tokens (including instructions)
- Avg completion: 500 tokens
- Queries/day: 100,000
- Daily cost: ~$3,500

After Fine-tuning (on GPT-3.5):
- Model: Fine-tuned GPT-3.5
- Avg prompt: 500 tokens (instructions baked in)
- Avg completion: 400 tokens
- Same quality for this specific task
- Daily cost: ~$180

ROI: Fine-tuning cost recovered in 2 days

Scenario 2: Consistent Specialized Format

# Training data format for specialized extraction

TRAINING_EXAMPLE = {
"messages": [
{
"role": "system",
"content": "Extract medical entities from clinical notes."
},
{
"role": "user",
"content": "Patient presents with persistent cough for 2 weeks, mild fever. History of asthma. Currently taking albuterol."
},
{
"role": "assistant",
"content": """{
"symptoms": [
{"term": "persistent cough", "duration": "2 weeks"},
{"term": "mild fever", "duration": null}
],
"conditions": [
{"term": "asthma", "status": "history"}
],
"medications": [
{"term": "albuterol", "status": "current"}
]
}"""
}
]
}

Fine-tuning Implementation

Data Preparation

import json
from typing import List, Dict

class FineTuningDataPreparer:
def __init__(self, task_description: str):
self.task_description = task_description
self.examples = []

def add_example(self, input_text: str, output_text: str,
system_prompt: str = None):
"""Add a training example"""
messages = []

if system_prompt:
messages.append({"role": "system", "content": system_prompt})

messages.append({"role": "user", "content": input_text})
messages.append({"role": "assistant", "content": output_text})

self.examples.append({"messages": messages})

def validate_examples(self) -> Dict:
"""Validate training data quality"""
issues = []

# Check minimum count
if len(self.examples) < 100:
issues.append(f"Only {len(self.examples)} examples. Recommend 100+")

# Check for diversity
unique_inputs = len(set(e['messages'][1]['content'] for e in self.examples))
if unique_inputs < len(self.examples) * 0.9:
issues.append("Possible duplicate inputs detected")

# Check token lengths
for i, ex in enumerate(self.examples):
total_tokens = sum(len(m['content'].split()) * 1.3
for m in ex['messages'])
if total_tokens > 4096:
issues.append(f"Example {i} may exceed token limit")

return {
'total_examples': len(self.examples),
'issues': issues,
'ready': len(issues) == 0
}

def export_jsonl(self, output_path: str):
"""Export to JSONL format for fine-tuning"""
with open(output_path, 'w') as f:
for example in self.examples:
f.write(json.dumps(example) + '\n')

Fine-tuning with OpenAI

from openai import OpenAI

client = OpenAI()

def fine_tune_model(training_file_path: str, model_base: str = "gpt-3.5-turbo"):
"""Fine-tune a model using OpenAI API"""

# Upload training file
with open(training_file_path, 'rb') as f:
file_response = client.files.create(
file=f,
purpose='fine-tune'
)

# Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file_response.id,
model=model_base,
hyperparameters={
"n_epochs": 3,
"batch_size": 4,
"learning_rate_multiplier": 1.0
}
)

print(f"Fine-tuning job created: {job.id}")
return job

def monitor_fine_tuning(job_id: str):
"""Monitor fine-tuning progress"""
while True:
job = client.fine_tuning.jobs.retrieve(job_id)
print(f"Status: {job.status}")

if job.status == 'succeeded':
print(f"Fine-tuned model: {job.fine_tuned_model}")
return job.fine_tuned_model
elif job.status == 'failed':
raise Exception(f"Fine-tuning failed: {job.error}")

time.sleep(60)

Fine-tuning with Open Source (LoRA)

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

def fine_tune_with_lora(
base_model: str,
training_data_path: str,
output_dir: str
):
"""Fine-tune using LoRA (Parameter Efficient Fine-Tuning)"""

# Load base model
model = AutoModelForCausalLM.from_pretrained(
base_model,
load_in_4bit=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Training arguments
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=10,
save_steps=100,
fp16=True
)

# Load training data
from datasets import load_dataset
dataset = load_dataset('json', data_files=training_data_path)

# Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset['train'],
args=training_args,
tokenizer=tokenizer,
max_seq_length=2048
)

trainer.train()

# Save adapter
model.save_pretrained(output_dir)

return output_dir

Evaluation Framework

Comparing Base vs. Fine-tuned

class ModelComparator:
def __init__(self, base_model, fine_tuned_model, test_dataset):
self.base = base_model
self.fine_tuned = fine_tuned_model
self.test_data = test_dataset

def evaluate(self) -> Dict:
"""Compare models on test dataset"""
results = {
'base': {'correct': 0, 'total': 0, 'avg_tokens': 0},
'fine_tuned': {'correct': 0, 'total': 0, 'avg_tokens': 0}
}

for example in self.test_data:
# Base model
base_output = self.base.generate(example['input'])
base_correct = self.evaluate_output(base_output, example['expected'])
results['base']['correct'] += base_correct
results['base']['total'] += 1
results['base']['avg_tokens'] += count_tokens(base_output)

# Fine-tuned model
ft_output = self.fine_tuned.generate(example['input'])
ft_correct = self.evaluate_output(ft_output, example['expected'])
results['fine_tuned']['correct'] += ft_correct
results['fine_tuned']['total'] += 1
results['fine_tuned']['avg_tokens'] += count_tokens(ft_output)

# Calculate metrics
for model in results:
results[model]['accuracy'] = results[model]['correct'] / results[model]['total']
results[model]['avg_tokens'] /= results[model]['total']

return results

Quick Reference

NeedFirst TryThen Consider
Consistent formatStructured output + promptFine-tune if >10K queries/day
Domain languageRAG with domain docsFine-tune if RAG insufficient
Writing styleFew-shot examplesFine-tune if style critical
Cost reductionPrompt optimizationFine-tune smaller model
Latency reductionCaching, smaller modelFine-tune for shorter prompts
Novel capabilityPrompt engineeringFine-tune + evaluation

Document maintained by CODITECT ML Team