Model Fine-tuning Guide

When and How to Fine-tune vs. Prompt Engineering

Document ID: C9-MODEL-FINETUNING
Version: 1.0
Category: Technical Deep Dive

Decision Framework

When to Fine-tune vs. Prompt Engineer

START: Do you need custom model behavior?
│
├─► Task can be described in prompt? ──► PROMPT ENGINEERING
│   (Most cases)
│
├─► Need consistent output format? ──► STRUCTURED OUTPUT + PROMPTING
│   (JSON mode, function calling)
│
├─► Domain-specific terminology? ──► RAG + PROMPTING first
│   (Technical jargon)             Then consider fine-tuning
│
├─► Specific writing style? ──► FEW-SHOT PROMPTING first
│   (Brand voice)             Then consider fine-tuning
│
├─► High-volume, latency-sensitive? ──► CONSIDER FINE-TUNING
│   (Cost optimization)              (Smaller fine-tuned > larger base)
│
└─► Truly novel task? ──► FINE-TUNING likely needed
    (No existing capability)

Cost-Benefit Matrix

Approach	Development Cost	Per-Query Cost	Quality Ceiling	Iteration Speed
Prompt Engineering	Low	Higher (more tokens)	High	Fast
Few-Shot Learning	Low-Medium	Higher	High	Fast
RAG	Medium	Medium	Very High	Medium
Fine-tuning (SFT)	High	Lower	High	Slow
RLHF/DPO	Very High	Lower	Highest	Very Slow

Prompt Engineering Techniques

Maximizing Base Model Performance

1. System Prompt Optimization

# Instead of fine-tuning for consistent format:

OPTIMIZED_SYSTEM_PROMPT = """
You are a customer service agent for TechCorp.

RESPONSE FORMAT (ALWAYS follow exactly):
1. Greeting: One sentence acknowledging the customer
2. Understanding: Restate the issue in your own words
3. Solution: Provide clear steps or information
4. Next Steps: One specific action for the customer
5. Closing: Professional sign-off

TONE: Professional, empathetic, concise
CONSTRAINTS: 
- Never promise things outside policy
- Always offer escalation option for complaints
- Maximum response length: 200 words
"""

# This often achieves 90%+ of fine-tuning quality for style/format

2. Few-Shot Examples

FEW_SHOT_PROMPT = """
Convert customer feedback to structured format.

Example 1:
Input: "The product arrived late and was damaged. Very disappointed!"
Output: {
  "sentiment": "negative",
  "issues": ["delivery_delay", "product_damage"],
  "priority": "high",
  "suggested_action": "replacement_expedited"
}

Example 2:
Input: "Love the new features! Works great with my setup."
Output: {
  "sentiment": "positive",
  "issues": [],
  "priority": "low",
  "suggested_action": "send_thank_you"
}

Now process:
Input: "{customer_feedback}"
Output:
"""

3. Chain-of-Thought for Complex Reasoning

COT_PROMPT = """
Analyze this contract clause for risks.

Think step by step:
1. First, identify the type of clause
2. List the obligations it creates
3. Identify any ambiguous language
4. Compare to standard market terms
5. Assess risk level with justification

Clause: {clause_text}

Analysis:
"""

When Fine-tuning Makes Sense

Scenario 1: High-Volume Cost Optimization

Before Fine-tuning:
- Model: GPT-4
- Avg prompt: 2,000 tokens (including instructions)
- Avg completion: 500 tokens
- Queries/day: 100,000
- Daily cost: ~$3,500

After Fine-tuning (on GPT-3.5):
- Model: Fine-tuned GPT-3.5
- Avg prompt: 500 tokens (instructions baked in)
- Avg completion: 400 tokens
- Same quality for this specific task
- Daily cost: ~$180

ROI: Fine-tuning cost recovered in 2 days

Scenario 2: Consistent Specialized Format

# Training data format for specialized extraction

TRAINING_EXAMPLE = {
    "messages": [
        {
            "role": "system",
            "content": "Extract medical entities from clinical notes."
        },
        {
            "role": "user", 
            "content": "Patient presents with persistent cough for 2 weeks, mild fever. History of asthma. Currently taking albuterol."
        },
        {
            "role": "assistant",
            "content": """{
  "symptoms": [
    {"term": "persistent cough", "duration": "2 weeks"},
    {"term": "mild fever", "duration": null}
  ],
  "conditions": [
    {"term": "asthma", "status": "history"}
  ],
  "medications": [
    {"term": "albuterol", "status": "current"}
  ]
}"""
        }
    ]
}

Fine-tuning Implementation

Data Preparation

import json
from typing import List, Dict

class FineTuningDataPreparer:
    def __init__(self, task_description: str):
        self.task_description = task_description
        self.examples = []
    
    def add_example(self, input_text: str, output_text: str, 
                   system_prompt: str = None):
        """Add a training example"""
        messages = []
        
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        
        messages.append({"role": "user", "content": input_text})
        messages.append({"role": "assistant", "content": output_text})
        
        self.examples.append({"messages": messages})
    
    def validate_examples(self) -> Dict:
        """Validate training data quality"""
        issues = []
        
        # Check minimum count
        if len(self.examples) < 100:
            issues.append(f"Only {len(self.examples)} examples. Recommend 100+")
        
        # Check for diversity
        unique_inputs = len(set(e['messages'][1]['content'] for e in self.examples))
        if unique_inputs < len(self.examples) * 0.9:
            issues.append("Possible duplicate inputs detected")
        
        # Check token lengths
        for i, ex in enumerate(self.examples):
            total_tokens = sum(len(m['content'].split()) * 1.3 
                             for m in ex['messages'])
            if total_tokens > 4096:
                issues.append(f"Example {i} may exceed token limit")
        
        return {
            'total_examples': len(self.examples),
            'issues': issues,
            'ready': len(issues) == 0
        }
    
    def export_jsonl(self, output_path: str):
        """Export to JSONL format for fine-tuning"""
        with open(output_path, 'w') as f:
            for example in self.examples:
                f.write(json.dumps(example) + '\n')

Fine-tuning with OpenAI

from openai import OpenAI

client = OpenAI()

def fine_tune_model(training_file_path: str, model_base: str = "gpt-3.5-turbo"):
    """Fine-tune a model using OpenAI API"""
    
    # Upload training file
    with open(training_file_path, 'rb') as f:
        file_response = client.files.create(
            file=f,
            purpose='fine-tune'
        )
    
    # Create fine-tuning job
    job = client.fine_tuning.jobs.create(
        training_file=file_response.id,
        model=model_base,
        hyperparameters={
            "n_epochs": 3,
            "batch_size": 4,
            "learning_rate_multiplier": 1.0
        }
    )
    
    print(f"Fine-tuning job created: {job.id}")
    return job

def monitor_fine_tuning(job_id: str):
    """Monitor fine-tuning progress"""
    while True:
        job = client.fine_tuning.jobs.retrieve(job_id)
        print(f"Status: {job.status}")
        
        if job.status == 'succeeded':
            print(f"Fine-tuned model: {job.fine_tuned_model}")
            return job.fine_tuned_model
        elif job.status == 'failed':
            raise Exception(f"Fine-tuning failed: {job.error}")
        
        time.sleep(60)

Fine-tuning with Open Source (LoRA)

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

def fine_tune_with_lora(
    base_model: str,
    training_data_path: str,
    output_dir: str
):
    """Fine-tune using LoRA (Parameter Efficient Fine-Tuning)"""
    
    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_4bit=True,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    
    # Prepare for training
    model = prepare_model_for_kbit_training(model)
    
    # LoRA configuration
    lora_config = LoraConfig(
        r=16,                      # Rank
        lora_alpha=32,             # Scaling factor
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    model = get_peft_model(model, lora_config)
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        logging_steps=10,
        save_steps=100,
        fp16=True
    )
    
    # Load training data
    from datasets import load_dataset
    dataset = load_dataset('json', data_files=training_data_path)
    
    # Train
    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        args=training_args,
        tokenizer=tokenizer,
        max_seq_length=2048
    )
    
    trainer.train()
    
    # Save adapter
    model.save_pretrained(output_dir)
    
    return output_dir

Evaluation Framework

Comparing Base vs. Fine-tuned

class ModelComparator:
    def __init__(self, base_model, fine_tuned_model, test_dataset):
        self.base = base_model
        self.fine_tuned = fine_tuned_model
        self.test_data = test_dataset
    
    def evaluate(self) -> Dict:
        """Compare models on test dataset"""
        results = {
            'base': {'correct': 0, 'total': 0, 'avg_tokens': 0},
            'fine_tuned': {'correct': 0, 'total': 0, 'avg_tokens': 0}
        }
        
        for example in self.test_data:
            # Base model
            base_output = self.base.generate(example['input'])
            base_correct = self.evaluate_output(base_output, example['expected'])
            results['base']['correct'] += base_correct
            results['base']['total'] += 1
            results['base']['avg_tokens'] += count_tokens(base_output)
            
            # Fine-tuned model
            ft_output = self.fine_tuned.generate(example['input'])
            ft_correct = self.evaluate_output(ft_output, example['expected'])
            results['fine_tuned']['correct'] += ft_correct
            results['fine_tuned']['total'] += 1
            results['fine_tuned']['avg_tokens'] += count_tokens(ft_output)
        
        # Calculate metrics
        for model in results:
            results[model]['accuracy'] = results[model]['correct'] / results[model]['total']
            results[model]['avg_tokens'] /= results[model]['total']
        
        return results

Quick Reference

Need	First Try	Then Consider
Consistent format	Structured output + prompt	Fine-tune if >10K queries/day
Domain language	RAG with domain docs	Fine-tune if RAG insufficient
Writing style	Few-shot examples	Fine-tune if style critical
Cost reduction	Prompt optimization	Fine-tune smaller model
Latency reduction	Caching, smaller model	Fine-tune for shorter prompts
Novel capability	Prompt engineering	Fine-tune + evaluation

Document maintained by CODITECT ML Team

When and How to Fine-tune vs. Prompt Engineering​

Decision Framework​

When to Fine-tune vs. Prompt Engineer​

Cost-Benefit Matrix​

Prompt Engineering Techniques​

Maximizing Base Model Performance​

When Fine-tuning Makes Sense​

Scenario 1: High-Volume Cost Optimization​

Scenario 2: Consistent Specialized Format​

Fine-tuning Implementation​

Data Preparation​

Fine-tuning with OpenAI​

Fine-tuning with Open Source (LoRA)​

Evaluation Framework​

Comparing Base vs. Fine-tuned​

Quick Reference​