Auto trigger configuration for implicit activation
Multi-Provider LLM Fallback
How to Use This Skill
- Review the patterns and examples below
- Apply the relevant patterns to your implementation
- Follow the best practices outlined in this skill
Expert skill for intelligent routing across multiple LLM providers with automatic failover, cost optimization, and capability-based selection. Ensures high availability and optimal cost/performance balance.
When to Use
Use this skill when:
- Building production systems requiring high availability
- Need cost optimization across multiple providers
- Want automatic failover without manual intervention
- Different tasks need different model capabilities
- Rate limits on primary provider are a concern
Don't use this skill when:
- Single provider is sufficient and reliable
- Testing/development with no reliability requirements
- Fixed provider required by policy/compliance
- Latency-critical with no time for fallback
Supported Providers
| Provider | Models | Strengths |
|---|---|---|
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus | Reasoning, coding, long context |
| OpenAI | GPT-4, GPT-4 Turbo, GPT-3.5 | General purpose, fast |
| Gemini Pro, Gemini Ultra | Multimodal, large context | |
| Azure OpenAI | GPT-4, GPT-3.5 (hosted) | Enterprise compliance |
| Local | Ollama, LM Studio | Privacy, no API costs |
Core Algorithm
Provider Router
from typing import List, Dict, Optional, Callable, Any
from dataclasses import dataclass, field
from enum import Enum
import time
import asyncio
from abc import ABC, abstractmethod
class ProviderStatus(Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNAVAILABLE = "unavailable"
RATE_LIMITED = "rate_limited"
class TaskType(Enum):
CODING = "coding"
REASONING = "reasoning"
CREATIVE = "creative"
FAST_RESPONSE = "fast_response"
LONG_CONTEXT = "long_context"
COST_SENSITIVE = "cost_sensitive"
@dataclass
class ProviderConfig:
"""Configuration for an LLM provider"""
name: str
api_key_env: str # Environment variable for API key
base_url: Optional[str] = None
models: List[str] = field(default_factory=list)
default_model: str = ""
max_tokens: int = 4096
cost_per_1k_input: float = 0.0
cost_per_1k_output: float = 0.0
rate_limit_rpm: int = 60
timeout_seconds: int = 120
strengths: List[TaskType] = field(default_factory=list)
priority: int = 50 # Higher = preferred
@dataclass
class ProviderHealth:
"""Health status of a provider"""
provider: str
status: ProviderStatus
last_success: float = 0.0
last_failure: float = 0.0
failure_count: int = 0
rate_limit_reset: float = 0.0
latency_ms: float = 0.0
@dataclass
class RoutingDecision:
"""Result of routing decision"""
provider: str
model: str
reason: str
fallback_chain: List[str]
estimated_cost: float
class LLMProvider(ABC):
"""Abstract base for LLM providers"""
@abstractmethod
async def complete(self, prompt: str, **kwargs) -> Dict:
pass
@abstractmethod
def health_check(self) -> ProviderHealth:
pass
class MultiProviderRouter:
"""
Intelligent router across multiple LLM providers.
Key Features:
- Automatic failover on errors
- Cost-based routing
- Capability-based selection
- Health monitoring
- Rate limit awareness
"""
# Default provider configurations
DEFAULT_PROVIDERS = {
"anthropic": ProviderConfig(
name="anthropic",
api_key_env="ANTHROPIC_API_KEY",
models=["claude-3-5-sonnet-20241022", "claude-3-opus-20240229"],
default_model="claude-3-5-sonnet-20241022",
max_tokens=8192,
cost_per_1k_input=0.003,
cost_per_1k_output=0.015,
rate_limit_rpm=50,
timeout_seconds=120,
strengths=[TaskType.CODING, TaskType.REASONING, TaskType.LONG_CONTEXT],
priority=90
),
"openai": ProviderConfig(
name="openai",
api_key_env="OPENAI_API_KEY",
models=["gpt-4-turbo", "gpt-4", "gpt-3.5-turbo"],
default_model="gpt-4-turbo",
max_tokens=4096,
cost_per_1k_input=0.01,
cost_per_1k_output=0.03,
rate_limit_rpm=60,
timeout_seconds=60,
strengths=[TaskType.FAST_RESPONSE, TaskType.CREATIVE],
priority=80
),
"google": ProviderConfig(
name="google",
api_key_env="GOOGLE_API_KEY",
base_url="https://generativelanguage.googleapis.com",
models=["gemini-pro", "gemini-ultra"],
default_model="gemini-pro",
max_tokens=8192,
cost_per_1k_input=0.0005,
cost_per_1k_output=0.0015,
rate_limit_rpm=60,
timeout_seconds=90,
strengths=[TaskType.LONG_CONTEXT, TaskType.COST_SENSITIVE],
priority=70
),
"local": ProviderConfig(
name="local",
api_key_env="", # No key needed
base_url="http://localhost:11434", # Ollama default
models=["llama2", "codellama", "mistral"],
default_model="codellama",
max_tokens=4096,
cost_per_1k_input=0.0,
cost_per_1k_output=0.0,
rate_limit_rpm=1000,
timeout_seconds=300,
strengths=[TaskType.COST_SENSITIVE],
priority=30
),
}
def __init__(self, providers: Optional[Dict[str, ProviderConfig]] = None):
self.providers = providers or self.DEFAULT_PROVIDERS
self.health: Dict[str, ProviderHealth] = {}
self.request_counts: Dict[str, int] = {}
self._initialize_health()
def _initialize_health(self):
"""Initialize health status for all providers"""
for name in self.providers:
self.health[name] = ProviderHealth(
provider=name,
status=ProviderStatus.HEALTHY,
last_success=time.time()
)
self.request_counts[name] = 0
def select_provider(
self,
task_type: Optional[TaskType] = None,
prefer_cost: bool = False,
prefer_speed: bool = False,
exclude: Optional[List[str]] = None,
required_context_length: int = 4000
) -> RoutingDecision:
"""
Select best provider for the task.
Args:
task_type: Type of task (affects capability matching)
prefer_cost: Prioritize cost over quality
prefer_speed: Prioritize speed over quality
exclude: Providers to skip (e.g., already failed)
required_context_length: Minimum context window needed
Returns:
RoutingDecision with selected provider and fallback chain
"""
exclude = exclude or []
candidates = []
for name, config in self.providers.items():
# Skip excluded and unhealthy providers
if name in exclude:
continue
health = self.health[name]
if health.status == ProviderStatus.UNAVAILABLE:
continue
# Check rate limits
if health.status == ProviderStatus.RATE_LIMITED:
if time.time() < health.rate_limit_reset:
continue
# Check context length requirement
if config.max_tokens < required_context_length:
continue
# Calculate score
score = config.priority
# Boost for matching task type
if task_type and task_type in config.strengths:
score += 20
# Adjust for cost preference
if prefer_cost:
# Lower cost = higher score
cost_score = 100 - (config.cost_per_1k_output * 1000)
score += cost_score * 0.3
# Adjust for speed preference
if prefer_speed:
# Lower timeout = faster expected response
speed_score = 100 - (config.timeout_seconds / 3)
score += speed_score * 0.3
# Penalize degraded providers
if health.status == ProviderStatus.DEGRADED:
score -= 20
# Penalize high recent latency
if health.latency_ms > 5000:
score -= 10
candidates.append((name, score, config))
# Sort by score
candidates.sort(key=lambda x: x[1], reverse=True)
if not candidates:
raise Exception("No available providers")
selected = candidates[0]
fallback_chain = [c[0] for c in candidates[1:4]] # Next 3 as fallback
return RoutingDecision(
provider=selected[0],
model=selected[2].default_model,
reason=f"Score: {selected[1]:.1f}, Priority: {selected[2].priority}",
fallback_chain=fallback_chain,
estimated_cost=selected[2].cost_per_1k_output
)
async def complete_with_fallback(
self,
prompt: str,
task_type: Optional[TaskType] = None,
max_retries: int = 3,
**kwargs
) -> Dict:
"""
Complete request with automatic fallback.
Tries providers in order until success or all exhausted.
"""
attempted = []
last_error = None
for attempt in range(max_retries):
try:
decision = self.select_provider(
task_type=task_type,
exclude=attempted
)
provider_name = decision.provider
attempted.append(provider_name)
# Execute request
start_time = time.time()
result = await self._execute_request(
provider_name,
prompt,
decision.model,
**kwargs
)
latency = (time.time() - start_time) * 1000
# Update health on success
self._update_health(provider_name, success=True, latency=latency)
return {
"content": result["content"],
"provider": provider_name,
"model": decision.model,
"attempts": len(attempted),
"latency_ms": latency
}
except RateLimitError as e:
self._update_health(
attempted[-1],
success=False,
rate_limited=True,
reset_time=e.reset_time
)
last_error = e
except ProviderError as e:
self._update_health(attempted[-1], success=False)
last_error = e
except TimeoutError as e:
self._update_health(attempted[-1], success=False, timeout=True)
last_error = e
raise FallbackExhaustedError(
f"All providers failed after {max_retries} attempts",
attempted=attempted,
last_error=last_error
)
async def _execute_request(
self,
provider_name: str,
prompt: str,
model: str,
**kwargs
) -> Dict:
"""Execute request to specific provider"""
config = self.providers[provider_name]
# Provider-specific implementation
if provider_name == "anthropic":
return await self._anthropic_complete(config, prompt, model, **kwargs)
elif provider_name == "openai":
return await self._openai_complete(config, prompt, model, **kwargs)
elif provider_name == "google":
return await self._google_complete(config, prompt, model, **kwargs)
elif provider_name == "local":
return await self._local_complete(config, prompt, model, **kwargs)
else:
raise ValueError(f"Unknown provider: {provider_name}")
def _update_health(
self,
provider: str,
success: bool,
latency: float = 0,
rate_limited: bool = False,
reset_time: float = 0,
timeout: bool = False
):
"""Update provider health status"""
health = self.health[provider]
if success:
health.status = ProviderStatus.HEALTHY
health.last_success = time.time()
health.failure_count = 0
health.latency_ms = latency
else:
health.last_failure = time.time()
health.failure_count += 1
if rate_limited:
health.status = ProviderStatus.RATE_LIMITED
health.rate_limit_reset = reset_time
elif health.failure_count >= 3:
health.status = ProviderStatus.UNAVAILABLE
else:
health.status = ProviderStatus.DEGRADED
def get_health_report(self) -> Dict:
"""Get health status of all providers"""
return {
name: {
"status": health.status.value,
"last_success": health.last_success,
"failure_count": health.failure_count,
"latency_ms": health.latency_ms
}
for name, health in self.health.items()
}
# Provider-specific implementations (simplified)
async def _anthropic_complete(self, config, prompt, model, **kwargs):
# Implementation using anthropic SDK
pass
async def _openai_complete(self, config, prompt, model, **kwargs):
# Implementation using openai SDK
pass
async def _google_complete(self, config, prompt, model, **kwargs):
# Implementation using google SDK
pass
async def _local_complete(self, config, prompt, model, **kwargs):
# Implementation using local Ollama/LM Studio
pass
# Custom exceptions
class ProviderError(Exception):
pass
class RateLimitError(ProviderError):
def __init__(self, message, reset_time=0):
super().__init__(message)
self.reset_time = reset_time
class FallbackExhaustedError(Exception):
def __init__(self, message, attempted=None, last_error=None):
super().__init__(message)
self.attempted = attempted or []
self.last_error = last_error
Usage Examples
Basic Fallback
router = MultiProviderRouter()
# Simple completion with automatic fallback
result = await router.complete_with_fallback(
prompt="Explain the visitor pattern in Python",
task_type=TaskType.CODING
)
print(f"Provider: {result['provider']}")
print(f"Attempts: {result['attempts']}")
print(f"Content: {result['content']}")
Cost-Optimized Routing
# Prefer cheaper providers for simple tasks
result = await router.complete_with_fallback(
prompt="Summarize this text...",
task_type=TaskType.COST_SENSITIVE,
prefer_cost=True
)
# Will prefer Google Gemini or local models
Capability-Based Selection
# Select based on task requirements
decision = router.select_provider(
task_type=TaskType.LONG_CONTEXT,
required_context_length=100000
)
# Will select providers with large context windows (Anthropic, Google)
Health Monitoring
# Get current provider health
health = router.get_health_report()
for provider, status in health.items():
print(f"{provider}: {status['status']} (latency: {status['latency_ms']}ms)")
Integration with CODITECT
With Adaptive Retry
from skills.adaptive_retry import AdaptiveRetryParameters
async def robust_llm_call(prompt: str) -> Dict:
"""Combine adaptive retry with multi-provider fallback"""
router = MultiProviderRouter()
retry_params = AdaptiveRetryParameters()
for retry_count in range(3):
try:
# Get adjusted parameters for this retry
params = retry_params.get_params(retry_count)
result = await router.complete_with_fallback(
prompt=prompt,
max_tokens=params["max_tokens"],
temperature=params["temperature"]
)
return result
except FallbackExhaustedError:
if retry_count == 2:
raise
# Wait before retrying all providers
await asyncio.sleep(retry_params.get_backoff(retry_count))
With Orchestrator
# In orchestrator configuration
llm_routing:
skill: multi-provider-llm-fallback
default_task_type: coding
cost_threshold: 0.10 # Max cost per request
timeout_threshold: 60 # Seconds before fallback
Configuration
| Parameter | Default | Description |
|---|---|---|
max_retries | 3 | Maximum fallback attempts |
failure_threshold | 3 | Failures before marking unavailable |
rate_limit_buffer | 0.9 | Use 90% of rate limit |
health_check_interval | 60 | Seconds between health checks |
cost_tracking | true | Track and report costs |
Success Metrics
| Metric | Target |
|---|---|
| Request success rate | 99.5% |
| Average fallback attempts | <1.5 |
| Cost optimization | 20-40% savings |
| Latency overhead | <100ms |
Success Output
When this skill is successfully applied, output:
✅ SKILL COMPLETE: multi-provider-llm-fallback
Completed:
- [x] MultiProviderRouter configured with 4 providers (Anthropic, OpenAI, Google, Local)
- [x] Provider health monitoring active (status, latency, failure tracking)
- [x] Capability-based routing implemented (task type matching)
- [x] Automatic fallback tested (3 providers tried, success on 2nd attempt)
- [x] Cost optimization configured (prefer cheaper for cost-sensitive tasks)
Outputs:
- MultiProviderRouter class with intelligent routing
- ProviderConfig for all providers (API keys, models, costs, rate limits)
- Health monitoring dashboard (status, latency, failure counts)
- Fallback chain documentation (primary → fallback order)
Metrics:
- Request success rate: 99.5%
- Average fallback attempts: 1.2
- Cost savings: 32% (vs. always using most expensive)
- Average latency overhead: 65ms
Completion Checklist
Before marking this skill as complete, verify:
- At least 2 providers configured with valid API keys
- ProviderConfig includes models, costs, rate limits, strengths
- Health monitoring tracks status, failures, latency for each provider
- Routing logic selects provider based on task type, cost, speed preferences
- Automatic fallback tries alternative providers on failure
- Rate limit detection prevents hitting provider limits
- Provider degradation detected after 3 failures
- Health report shows current status of all providers
- Integration with adaptive retry (if applicable)
- Cost tracking and reporting implemented
Failure Indicators
This skill has FAILED if:
- ❌ All providers unavailable (FallbackExhaustedError)
- ❌ Primary provider always selected despite degradation
- ❌ Rate limits exceeded causing 429 errors
- ❌ No fallback attempted when primary fails
- ❌ Health status not updated on failures
- ❌ Cost optimization disabled or not working
- ❌ Provider selection ignores task type capabilities
- ❌ Latency overhead >500ms (routing too slow)
When NOT to Use
Do NOT use this skill when:
- Single provider sufficient and reliable - complexity not justified
- Fixed provider required by policy/compliance - no alternatives allowed
- Testing/development with no availability requirements - single provider adequate
- Latency-critical with no time for fallback (<100ms response time) - use single fast provider
- Budget for only one provider - can't afford multiple API subscriptions
- Provider selection requires human judgment - automated routing inappropriate
Use alternatives instead:
- Single reliable provider → Direct API calls with retry
- Compliance constraints → Use approved provider only
- Development → Mock LLM responses
- Latency-critical → Pre-computed responses or cache
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| No health monitoring | Continues sending to failed provider | Track failures, mark unavailable after 3 |
| Always using most expensive | Unnecessary costs | Route cost-sensitive tasks to cheaper providers |
| No rate limit awareness | Hits 429 errors, wastes time | Track rate limits, pause provider when limited |
| Fallback to identical provider | No redundancy benefit | Ensure fallback chain has different providers |
| Ignoring task requirements | Wrong model for task (e.g., small context) | Match task type to provider strengths |
| No cost tracking | Budget overruns undetected | Log and aggregate costs per provider |
| Synchronous fallback only | High latency on failures | Consider parallel requests with circuit breaker |
Principles
This skill embodies:
- #2 First Principles Thinking - Understand provider capabilities before routing
- #3 Keep It Simple (KISS) - Start with 2 providers, add more only if needed
- #4 Separation of Concerns - Router separate from provider-specific implementations
- #7 Automation - Automatic health monitoring and failover
- #8 No Assumptions - Verify provider availability, don't assume success
- #11 Resilience - Graceful degradation with fallback chains
Full Standard: CODITECT-STANDARD-AUTOMATION.md
Provider Comparison Quick Reference
| Provider | Best For | Cost ($/1K out) | Context | Latency | Reliability |
|---|---|---|---|---|---|
| Anthropic Claude | Coding, Reasoning | $0.015 | 200K | Medium | High |
| OpenAI GPT-4 | General, Creative | $0.03 | 128K | Fast | High |
| Google Gemini | Long Context, Cost | $0.0015 | 1M | Medium | Medium |
| Azure OpenAI | Enterprise, Compliance | $0.03 | 128K | Fast | Very High |
| Local (Ollama) | Privacy, Cost | $0.00 | 32K | Variable | Depends |
Provider Selection Decision Tree:
What's the primary requirement?
│
├── Coding/Technical → Anthropic Claude (primary)
│ └── Fallback: OpenAI GPT-4
│
├── Cost Optimization → Google Gemini (primary)
│ └── Fallback: Local Ollama
│
├── Enterprise Compliance → Azure OpenAI (primary)
│ └── Fallback: Anthropic Claude
│
├── Long Context (>128K) → Google Gemini (primary)
│ └── Fallback: Anthropic Claude
│
└── Maximum Reliability → Multi-provider with fallback chain
Anthropic → OpenAI → Google → Local
Recommended Fallback Chains:
| Task Type | Primary | Fallback 1 | Fallback 2 |
|---|---|---|---|
| Code Generation | Anthropic | OpenAI | Local |
| Creative Writing | OpenAI | Anthropic | |
| Long Document Analysis | Anthropic | OpenAI | |
| Enterprise/Compliance | Azure | Anthropic | |
| Cost-Sensitive Batch | Local | Anthropic |
Source Reference
Pattern extracted from DeepCode multi-agent system.
See /submodules/labs/DeepCode/DEEP-ANALYSIS.md for complete analysis.