Production Patterns

How to Use This Skill

Review the patterns and examples below
Apply the relevant patterns to your implementation
Follow the best practices outlined in this skill

Expert skill for production-ready code with circuit breakers, comprehensive error handling, observability, and fault tolerance.

When to Use

✅ Use this skill when:

Implementing external API calls (LM Studio, third-party services) - Need circuit breakers
Building backend services for production (T2 V5 API) - Need error handling + observability
Adding retry logic with exponential backoff (FDB operations, HTTP calls)
Preventing cascading failures in multi-service architecture
Implementing async patterns with timeouts and bulkheads
Need graceful degradation (fallback to cache when service down)
Adding metrics and structured logging for production monitoring
Need time savings: 75% faster incident resolution (40→10 min debugging)

❌ Don't use this skill when:

Simple synchronous operations (no external I/O)
Quick prototypes or POCs (too much overhead)
Frontend-only code (different patterns apply)
Already using framework with built-in patterns (don't reinvent)

Circuit Breaker Pattern

Core Concept

Prevent cascading failures by breaking the circuit when error threshold exceeded.

from enum import Enum
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Callable, Any, Optional
import asyncio


class CircuitState(Enum):
    """Circuit breaker states"""
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Circuit broken, fail fast
    HALF_OPEN = "half_open"  # Testing if service recovered


@dataclass
class CircuitBreakerConfig:
    """Circuit breaker configuration"""
    failure_threshold: int = 5  # Failures before opening circuit
    timeout: timedelta = timedelta(seconds=60)  # Time before trying half-open
    success_threshold: int = 2  # Successes in half-open before closing


class CircuitBreaker:
    """Circuit breaker for fault tolerance"""

    def __init__(self, config: CircuitBreakerConfig):
        self.config = config
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time: Optional[datetime] = None

    async def call(self, func: Callable, *args, **kwargs) -> Any:
        """Execute function with circuit breaker protection"""

        # If circuit is open, check if we should try half-open
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self.state = CircuitState.HALF_OPEN
                self.success_count = 0
            else:
                raise CircuitBreakerOpenError(
                    f"Circuit breaker open. Last failure: {self.last_failure_time}"
                )

        try:
            # Execute the function
            result = await func(*args, **kwargs) if asyncio.iscoroutinefunction(func) else func(*args, **kwargs)

            # On success
            self._on_success()
            return result

        except Exception as e:
            # On failure
            self._on_failure()
            raise

    def _should_attempt_reset(self) -> bool:
        """Check if enough time has passed to try half-open"""
        if self.last_failure_time is None:
            return True
        return datetime.now() - self.last_failure_time > self.config.timeout

    def _on_success(self):
        """Handle successful call"""
        self.failure_count = 0

        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.config.success_threshold:
                self.state = CircuitState.CLOSED
                self.success_count = 0

    def _on_failure(self):
        """Handle failed call"""
        self.failure_count += 1
        self.last_failure_time = datetime.now()

        if self.failure_count >= self.config.failure_threshold:
            self.state = CircuitState.OPEN


class CircuitBreakerOpenError(Exception):
    """Raised when circuit breaker is open"""
    pass

Usage Example

# Create circuit breaker
circuit_breaker = CircuitBreaker(CircuitBreakerConfig(
    failure_threshold=5,
    timeout=timedelta(seconds=60),
    success_threshold=2
))

# Use with external service call
async def fetch_user_data(user_id: str) -> dict:
    """Fetch user data with circuit breaker protection"""
    async def _call():
        async with httpx.AsyncClient() as client:
            response = await client.get(f"https://api.example.com/users/{user_id}")
            response.raise_for_status()
            return response.json()

    try:
        return await circuit_breaker.call(_call)
    except CircuitBreakerOpenError as e:
        # Circuit is open, return cached data or default
        logger.warning(f"Circuit breaker open for user service: {e}")
        return get_cached_user_data(user_id)
    except Exception as e:
        logger.error(f"Failed to fetch user data: {e}")
        raise

Error Handling Patterns

Comprehensive Error Handling

from typing import Optional, Dict, Any
from datetime import datetime
import traceback


class ApplicationError(Exception):
    """Base application error with structured context"""

    def __init__(
        self,
        message: str,
        error_code: str,
        details: Optional[Dict[str, Any]] = None,
        retry_able: bool = False,
        original_exception: Optional[Exception] = None
    ):
        super().__init__(message)
        self.message = message
        self.error_code = error_code
        self.details = details or {}
        self.retryable = retry_able
        self.original_exception = original_exception
        self.timestamp = datetime.now()
        self.stack_trace = traceback.format_exc() if original_exception else None

    def to_dict(self) -> Dict[str, Any]:
        """Serialize error for logging/reporting"""
        return {
            "error_code": self.error_code,
            "message": self.message,
            "details": self.details,
            "retryable": self.retryable,
            "timestamp": self.timestamp.isoformat(),
            "stack_trace": self.stack_trace,
        }


class DatabaseError(ApplicationError):
    """Database operation failed"""
    def __init__(self, message: str, **kwargs):
        super().__init__(
            message,
            error_code="DB_ERROR",
            retry_able=True,
            **kwargs
        )


class ValidationError(ApplicationError):
    """Input validation failed"""
    def __init__(self, message: str, field: str, **kwargs):
        super().__init__(
            message,
            error_code="VALIDATION_ERROR",
            details={"field": field},
            retry_able=False,
            **kwargs
        )


class ExternalServiceError(ApplicationError):
    """External service call failed"""
    def __init__(self, service_name: str, message: str, **kwargs):
        super().__init__(
            message,
            error_code="EXTERNAL_SERVICE_ERROR",
            details={"service": service_name},
            retry_able=True,
            **kwargs
        )

Error Recovery with Retry

import asyncio
from typing import TypeVar, Callable, Optional
from functools import wraps

T = TypeVar('T')


async def retry_with_backoff(
    func: Callable[..., T],
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    exponential_base: float = 2.0,
    jitter: bool = True,
    retryable_exceptions: tuple = (Exception,)
) -> T:
    """Retry function with exponential backoff"""

    for attempt in range(max_retries):
        try:
            return await func()
        except retryable_exceptions as e:
            if attempt == max_retries - 1:
                # Last attempt, re-raise
                raise

            # Calculate delay
            delay = min(base_delay * (exponential_base ** attempt), max_delay)

            # Add jitter to prevent thundering herd
            if jitter:
                import random
                delay = delay * (0.5 + random.random())

            logger.warning(
                f"Attempt {attempt + 1} failed: {e}. "
                f"Retrying in {delay:.2f}s..."
            )

            await asyncio.sleep(delay)

    raise RuntimeError("Should not reach here")


def with_retry(max_retries: int = 3, **kwargs):
    """Decorator for automatic retry"""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **func_kwargs):
            return await retry_with_backoff(
                lambda: func(*args, **func_kwargs),
                max_retries=max_retries,
                **kwargs
            )
        return wrapper
    return decorator


# Usage
@with_retry(max_retries=3, base_delay=1.0)
async def fetch_data(url: str) -> dict:
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        response.raise_for_status()
        return response.json()

Observability Hooks

Metrics Collection

from dataclasses import dataclass
from datetime import datetime
from typing import Dict, Any, Optional
import time


@dataclass
class MetricPoint:
    """Single metric data point"""
    name: str
    value: float
    timestamp: datetime
    tags: Dict[str, str]


class MetricsCollector:
    """Collect and export metrics"""

    def __init__(self):
        self.metrics: list[MetricPoint] = []

    def increment(self, name: str, value: float = 1.0, tags: Optional[Dict[str, str]] = None):
        """Increment a counter"""
        self.metrics.append(MetricPoint(
            name=name,
            value=value,
            timestamp=datetime.now(),
            tags=tags or {}
        ))

    def gauge(self, name: str, value: float, tags: Optional[Dict[str, str]] = None):
        """Set a gauge value"""
        self.metrics.append(MetricPoint(
            name=name,
            value=value,
            timestamp=datetime.now(),
            tags=tags or {}
        ))

    def histogram(self, name: str, value: float, tags: Optional[Dict[str, str]] = None):
        """Record histogram value"""
        self.metrics.append(MetricPoint(
            name=name,
            value=value,
            timestamp=datetime.now(),
            tags=tags or {}
        ))


class TimingContext:
    """Context manager for timing operations"""

    def __init__(self, metrics: MetricsCollector, metric_name: str, tags: Optional[Dict[str, str]] = None):
        self.metrics = metrics
        self.metric_name = metric_name
        self.tags = tags or {}
        self.start_time: Optional[float] = None

    def __enter__(self):
        self.start_time = time.time()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        duration = time.time() - self.start_time
        self.metrics.histogram(
            f"{self.metric_name}_duration_seconds",
            duration,
            tags={**self.tags, "success": str(exc_type is None)}
        )
        return False  # Don't suppress exceptions


# Usage
metrics = MetricsCollector()

async def process_request(request_id: str):
    metrics.increment("requests_total", tags={"endpoint": "/api/users"})

    with TimingContext(metrics, "process_request", tags={"request_id": request_id}):
        # Process request
        result = await do_work()

        metrics.gauge("active_requests", get_active_count())

        return result

Structured Logging

import logging
import json
from datetime import datetime
from typing import Dict, Any, Optional


class StructuredLogger:
    """Structured logging with context"""

    def __init__(self, name: str):
        self.logger = logging.getLogger(name)
        self.context: Dict[str, Any] = {}

    def add_context(self, **kwargs):
        """Add context to all log messages"""
        self.context.update(kwargs)

    def clear_context(self):
        """Clear context"""
        self.context = {}

    def _log(self, level: int, message: str, **kwargs):
        """Log with structured data"""
        log_data = {
            "timestamp": datetime.now().isoformat(),
            "level": logging.getLevelName(level),
            "message": message,
            "context": self.context,
            **kwargs
        }
        self.logger.log(level, json.dumps(log_data))

    def info(self, message: str, **kwargs):
        self._log(logging.INFO, message, **kwargs)

    def warning(self, message: str, **kwargs):
        self._log(logging.WARNING, message, **kwargs)

    def error(self, message: str, error: Optional[Exception] = None, **kwargs):
        if error:
            kwargs["error"] = {
                "type": type(error).__name__,
                "message": str(error),
            }
        self._log(logging.ERROR, message, **kwargs)


# Usage
logger = StructuredLogger("my_service")
logger.add_context(request_id="req_123", user_id="user_456")

logger.info("Processing request", action="fetch_data")
logger.error("Database query failed", error=exc, query="SELECT * FROM users")

Async Patterns

Async Timeout

import asyncio
from typing import TypeVar, Awaitable

T = TypeVar('T')


async def with_timeout(
    coro: Awaitable[T],
    timeout: float,
    timeout_error_message: str = "Operation timed out"
) -> T:
    """Execute coroutine with timeout"""
    try:
        return await asyncio.wait_for(coro, timeout=timeout)
    except asyncio.TimeoutError:
        raise TimeoutError(timeout_error_message)


# Usage
try:
    result = await with_timeout(
        fetch_data(url),
        timeout=5.0,
        timeout_error_message=f"Failed to fetch data from {url} within 5 seconds"
    )
except TimeoutError as e:
    logger.error(f"Timeout: {e}")
    return default_value

Async Bulkhead

class Bulkhead:
    """Limit concurrent operations to prevent resource exhaustion"""

    def __init__(self, max_concurrent: int):
        self.semaphore = asyncio.Semaphore(max_concurrent)

    async def execute(self, coro: Awaitable[T]) -> T:
        """Execute with concurrency limit"""
        async with self.semaphore:
            return await coro


# Usage
bulkhead = Bulkhead(max_concurrent=10)

async def process_items(items: list):
    tasks = [bulkhead.execute(process_item(item)) for item in items]
    return await asyncio.gather(*tasks)

Executable Scripts

See core/circuit_breaker.py for circuit breaker implementation. See core/observability_hooks.py for metrics and logging utilities.

Best Practices

✅ DO

Add circuit breakers - Protect against cascading failures
Use structured errors - Include error codes, context, retryability
Implement retries - With exponential backoff and jitter
Add observability - Metrics, logs, traces for all critical paths
Use async patterns - Timeouts, bulkheads, concurrent limits
Fail fast - Don't retry non-retryable errors
Log context - Include request IDs, user IDs, etc.

❌ DON'T

Don't ignore errors - Always handle or propagate
Don't retry blindly - Check if error is retryable
Don't block - Use async/await for I/O
Don't leak exceptions - Wrap in application errors
Don't skip timeouts - All external calls need timeouts
Don't forget metrics - Track successes and failures

Integration with T2

Use cases in T2:

Circuit breakers on FDB calls
Retry logic for LM Studio API
Observability for agent coordination
Structured errors for agent failures
Bulkheads for parallel agent execution

Example:

// Circuit breaker for FDB operations
let fdb_circuit_breaker = CircuitBreaker::new(Config {
    failure_threshold: 5,
    timeout: Duration::from_secs(60),
    success_threshold: 2,
});

// Use in repository
pub async fn get_user(&self, user_id: &Uuid) -> Result<User, RepositoryError> {
    fdb_circuit_breaker.call(|| async {
        let key = format!("/{}/users/{}", tenant_id, user_id);
        let value = self.fdb.get(&key).await?;
        serde_json::from_slice(&value)
    }).await
}

Templates

See templates/error_handling_template.md for error handling patterns. See templates/async_patterns.md for async/await best practices.

Success Output

When this skill is successfully applied, you should see:

✅ SKILL COMPLETE: production-patterns

Completed:
- [x] Circuit breaker implemented for external services
- [x] Retry logic with exponential backoff added
- [x] Structured error handling (ApplicationError hierarchy)
- [x] Observability hooks configured (metrics, logging)
- [x] Async patterns applied (timeouts, bulkheads)
- [x] Integration tests passed (95%+ success rate)

Outputs:
- CircuitBreaker instances for critical dependencies (LM Studio, FDB)
- Retry decorators with configurable backoff (@with_retry)
- Error classes with context (DatabaseError, ValidationError, ExternalServiceError)
- MetricsCollector tracking successes, failures, latencies
- StructuredLogger with request context
- Async wrappers (with_timeout, Bulkhead for concurrency limits)

Performance Metrics:
- MTTR: 40 min → 10 min (75% faster incident resolution)
- Error recovery: 90%+ auto-recovery with retry logic
- Circuit breaker activation: <5 failures in 60 seconds
- Observability coverage: 100% of critical paths instrumented

Completion Checklist

Before marking this skill as complete, verify:

Failure Indicators

This skill has FAILED if:

❌ No circuit breakers on external services (cascading failures possible)
❌ Retry logic missing or infinite retries (no max_retries limit)
❌ Errors without context (can't diagnose root cause)
❌ No metrics on critical paths (blind to performance issues)
❌ Plain text logging (can't parse or query logs)
❌ No timeouts on async operations (hangs indefinitely)
❌ Unbounded concurrency (resource exhaustion risk)
❌ Integration tests not covering error scenarios
❌ Circuit breaker never tested (don't know if it works)

When NOT to Use

Do NOT use this skill when:

Simple CRUD operations - Basic database queries don't need circuit breakers
Single-tenant scripts - Production patterns add overhead to throwaway code
Synchronous-only systems - If no async/await, skip async patterns
Frameworks with built-in patterns - Don't reinvent if using Spring Boot, Axum with middleware
Non-production environments - Development/staging can skip some patterns
Low-traffic applications - <10 req/sec doesn't justify full production hardening
Offline tools - Command-line utilities without external dependencies

Use alternatives:

Simple apps → Basic try/catch error handling
Frameworks → Use built-in middleware (Spring Hystrix, Axum layers)
Development → Focus on functionality first, harden later
Low-traffic → Start with logging only, add patterns as traffic grows

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
No circuit breaker	One failing dependency takes down entire system	Add circuit breaker to all external calls (LM Studio, APIs)
Retry non-retryable errors	Validation failures retried 3x (waste)	Check error.retryable flag before retry
Blind retries	Retry immediately, hammer failing service	Use exponential backoff (1s, 2s, 4s, 8s)
No timeout	Async calls wait forever if service hangs	Always wrap async with timeout (5-30s typical)
Unbounded concurrency	1000 parallel requests exhaust memory	Use Bulkhead (semaphore) to limit concurrent operations
Plain text logs	Can't query logs for request_id or error_code	Use structured logging (JSON with context)
No metrics	Can't see failure rates or latency spikes	Add metrics to all critical paths (requests, queries)
Ignoring circuit state	Continue calling when circuit open	Check circuit state, return cached data or default when open
No jitter	All retries at same time (thundering herd)	Add random jitter to backoff delays

Principles

This skill embodies CODITECT automation principles:

#1 Recycle → Extend → Re-Use → Create

Recycle circuit breaker instances - One CircuitBreaker per dependency, reuse across calls
Extend error classes - Inherit from ApplicationError for consistency
Re-use retry decorators - @with_retry applies to any async function
Create custom patterns - Only when standard patterns don't fit

#2 First Principles Thinking

Understand failure modes - Circuit breaker prevents cascading failures
Know error types - Transient (retry) vs permanent (fail fast)
Measure observability ROI - 75% faster debugging justifies instrumentation cost

#5 Eliminate Ambiguity

Explicit error codes - DB_ERROR, VALIDATION_ERROR vs vague "error occurred"
Clear thresholds - 5 failures to open circuit is unambiguous
Concrete timeouts - 30 seconds instead of "reasonable timeout"

#6 Clear, Understandable, Explainable

Circuit state transitions - CLOSED → OPEN → HALF_OPEN is clear state machine
Retry rationale - Explain why 3 retries with exponential backoff
Metrics dashboards - Visualize failure rates, latencies for team visibility

#8 No Assumptions

Test circuit breaker - Simulate failures, verify circuit opens
Validate retryability - Don't assume all errors are retryable
Check timeout values - Measure actual latencies, don't guess timeout values

#10 Automation First

Auto-retry on transient failures - No manual intervention for network glitches
Auto-open circuit - Failure threshold hit → circuit opens automatically
Auto-collect metrics - Every request tracked, no manual logging

Multi-Context Window Support

This skill supports long-running production pattern implementation across multiple context windows using Claude 4.5's enhanced state management capabilities.

State Tracking

Pattern Implementation State (JSON):

{
  "checkpoint_id": "ckpt_20251129_150000",
  "patterns_applied": [
    {"name": "circuit_breaker", "files": ["src/api/client.py"], "status": "complete"},
    {"name": "retry_backoff", "files": ["src/services/fdb.py"], "status": "in_progress"},
    {"name": "observability", "files": ["src/monitoring/metrics.py"], "status": "pending"}
  ],
  "integration_tests": {
    "passed": 12,
    "failed": 2,
    "pending": 5
  },
  "observability_configured": {
    "metrics": true,
    "logging": true,
    "tracing": false
  },
  "token_usage": 15000,
  "created_at": "2025-11-29T15:00:00Z"
}

Progress Notes (Markdown):

# Production Patterns Progress - 2025-11-29

## Completed
- Circuit breaker implemented for LM Studio API calls
- Retry logic with exponential backoff added to FDB operations
- Structured logging configured for all handlers

## In Progress
- Implementing bulkhead pattern for parallel agent execution
- Integration tests: 2 failures need investigation (timeout issues)

## Next Actions
- Fix timeout configuration in test suite
- Add distributed tracing with OpenTelemetry
- Configure Prometheus metrics export

Session Recovery

When starting a fresh context window after production pattern implementation:

Load Checkpoint State: Read .coditect/checkpoints/production-patterns-latest.json
Review Progress Notes: Check production-patterns-progress.md for narrative context
Verify Pattern Implementation: Use Read tool to confirm applied patterns
Check Integration Tests: Run test suite to verify current status
Resume Implementation: Continue from last completed pattern

Recovery Commands:

# 1. Check latest checkpoint
cat .coditect/checkpoints/production-patterns-latest.json | jq '.patterns_applied'

# 2. Review progress
tail -30 production-patterns-progress.md

# 3. Verify tests
pytest tests/integration/ --tb=short

# 4. Check observability status
cat .coditect/checkpoints/production-patterns-latest.json | jq '.observability_configured'

# 5. Resume from pending patterns
# Apply next pattern from checkpoint

State Management Best Practices

Checkpoint Files (JSON Schema):

Store in .coditect/checkpoints/production-patterns-{timestamp}.json
Include pattern checksums for integrity validation
Track integration test results for quick status check
Record observability configuration state

Progress Tracking (Markdown Narrative):

Maintain production-patterns-progress.md with timestamp entries
Document pattern application decisions and trade-offs
Note integration test failures with investigation notes
List next patterns to apply with priority order

Git Integration:

Create checkpoint before applying critical patterns (circuit breakers)
Commit frequently with pattern names in messages
Use conventional commit format: feat(patterns): Add circuit breaker to API client
Tag checkpoints: git tag checkpoint-prod-patterns-{timestamp}

Progress Checkpoints

Natural Breaking Points:

After each pattern category implemented (circuit breakers, retries, observability)
After integration tests pass for a pattern
After observability hooks configured
Before applying fault tolerance patterns
After all patterns validated in production

Checkpoint Creation Pattern:

# Automatic checkpoint creation at critical phases
if patterns_applied_count >= 3 or integration_tests_failed > 0:
    create_checkpoint({
        "patterns": applied_patterns,
        "tests": test_results,
        "observability": config_status,
        "tokens": current_token_usage
    })

Example: Multi-Context Pattern Implementation

Context Window 1: Circuit Breakers & Retries

{
  "checkpoint_id": "ckpt_patterns_phase1",
  "phase": "error_handling_complete",
  "patterns_applied": ["circuit_breaker", "retry_backoff", "timeout"],
  "tests_passed": 8,
  "next_action": "Implement observability patterns",
  "token_usage": 9000
}

Context Window 2: Observability & Validation

# Load checkpoint from Phase 1
cat .coditect/checkpoints/ckpt_patterns_phase1.json

# Continue with observability patterns
# Apply: metrics collection, structured logging, tracing

# Token savings: ~6000 tokens (no need to re-read circuit breaker code)

Token Savings Analysis:

Without checkpoint: 15000 tokens (re-implement patterns + verify)
With checkpoint: 9000 tokens (resume from validated state)
Savings: 40% reduction (15000 → 9000 tokens)

How to Use This Skill​

When to Use​

Circuit Breaker Pattern​

Core Concept​

Usage Example​

Error Handling Patterns​

Comprehensive Error Handling​

Error Recovery with Retry​

Observability Hooks​

Metrics Collection​

Structured Logging​

Async Patterns​

Async Timeout​

Async Bulkhead​

Executable Scripts​

Best Practices​

✅ DO​

❌ DON'T​

Integration with T2​

Templates​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​

#1 Recycle → Extend → Re-Use → Create​

#2 First Principles Thinking​

#5 Eliminate Ambiguity​

#6 Clear, Understandable, Explainable​

#8 No Assumptions​

#10 Automation First​

Multi-Context Window Support​

State Tracking​

Session Recovery​

State Management Best Practices​

Progress Checkpoints​

Example: Multi-Context Pattern Implementation​