Skip to main content

Auto trigger configuration for implicit activation

Multi-Provider LLM Fallback

How to Use This Skill

  1. Review the patterns and examples below
  2. Apply the relevant patterns to your implementation
  3. Follow the best practices outlined in this skill

Expert skill for intelligent routing across multiple LLM providers with automatic failover, cost optimization, and capability-based selection. Ensures high availability and optimal cost/performance balance.

When to Use

Use this skill when:

  • Building production systems requiring high availability
  • Need cost optimization across multiple providers
  • Want automatic failover without manual intervention
  • Different tasks need different model capabilities
  • Rate limits on primary provider are a concern

Don't use this skill when:

  • Single provider is sufficient and reliable
  • Testing/development with no reliability requirements
  • Fixed provider required by policy/compliance
  • Latency-critical with no time for fallback

Supported Providers

ProviderModelsStrengths
AnthropicClaude 3.5 Sonnet, Claude 3 OpusReasoning, coding, long context
OpenAIGPT-4, GPT-4 Turbo, GPT-3.5General purpose, fast
GoogleGemini Pro, Gemini UltraMultimodal, large context
Azure OpenAIGPT-4, GPT-3.5 (hosted)Enterprise compliance
LocalOllama, LM StudioPrivacy, no API costs

Core Algorithm

Provider Router

from typing import List, Dict, Optional, Callable, Any
from dataclasses import dataclass, field
from enum import Enum
import time
import asyncio
from abc import ABC, abstractmethod


class ProviderStatus(Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNAVAILABLE = "unavailable"
RATE_LIMITED = "rate_limited"


class TaskType(Enum):
CODING = "coding"
REASONING = "reasoning"
CREATIVE = "creative"
FAST_RESPONSE = "fast_response"
LONG_CONTEXT = "long_context"
COST_SENSITIVE = "cost_sensitive"


@dataclass
class ProviderConfig:
"""Configuration for an LLM provider"""
name: str
api_key_env: str # Environment variable for API key
base_url: Optional[str] = None
models: List[str] = field(default_factory=list)
default_model: str = ""
max_tokens: int = 4096
cost_per_1k_input: float = 0.0
cost_per_1k_output: float = 0.0
rate_limit_rpm: int = 60
timeout_seconds: int = 120
strengths: List[TaskType] = field(default_factory=list)
priority: int = 50 # Higher = preferred


@dataclass
class ProviderHealth:
"""Health status of a provider"""
provider: str
status: ProviderStatus
last_success: float = 0.0
last_failure: float = 0.0
failure_count: int = 0
rate_limit_reset: float = 0.0
latency_ms: float = 0.0


@dataclass
class RoutingDecision:
"""Result of routing decision"""
provider: str
model: str
reason: str
fallback_chain: List[str]
estimated_cost: float


class LLMProvider(ABC):
"""Abstract base for LLM providers"""

@abstractmethod
async def complete(self, prompt: str, **kwargs) -> Dict:
pass

@abstractmethod
def health_check(self) -> ProviderHealth:
pass


class MultiProviderRouter:
"""
Intelligent router across multiple LLM providers.

Key Features:
- Automatic failover on errors
- Cost-based routing
- Capability-based selection
- Health monitoring
- Rate limit awareness
"""

# Default provider configurations
DEFAULT_PROVIDERS = {
"anthropic": ProviderConfig(
name="anthropic",
api_key_env="ANTHROPIC_API_KEY",
models=["claude-3-5-sonnet-20241022", "claude-3-opus-20240229"],
default_model="claude-3-5-sonnet-20241022",
max_tokens=8192,
cost_per_1k_input=0.003,
cost_per_1k_output=0.015,
rate_limit_rpm=50,
timeout_seconds=120,
strengths=[TaskType.CODING, TaskType.REASONING, TaskType.LONG_CONTEXT],
priority=90
),
"openai": ProviderConfig(
name="openai",
api_key_env="OPENAI_API_KEY",
models=["gpt-4-turbo", "gpt-4", "gpt-3.5-turbo"],
default_model="gpt-4-turbo",
max_tokens=4096,
cost_per_1k_input=0.01,
cost_per_1k_output=0.03,
rate_limit_rpm=60,
timeout_seconds=60,
strengths=[TaskType.FAST_RESPONSE, TaskType.CREATIVE],
priority=80
),
"google": ProviderConfig(
name="google",
api_key_env="GOOGLE_API_KEY",
base_url="https://generativelanguage.googleapis.com",
models=["gemini-pro", "gemini-ultra"],
default_model="gemini-pro",
max_tokens=8192,
cost_per_1k_input=0.0005,
cost_per_1k_output=0.0015,
rate_limit_rpm=60,
timeout_seconds=90,
strengths=[TaskType.LONG_CONTEXT, TaskType.COST_SENSITIVE],
priority=70
),
"local": ProviderConfig(
name="local",
api_key_env="", # No key needed
base_url="http://localhost:11434", # Ollama default
models=["llama2", "codellama", "mistral"],
default_model="codellama",
max_tokens=4096,
cost_per_1k_input=0.0,
cost_per_1k_output=0.0,
rate_limit_rpm=1000,
timeout_seconds=300,
strengths=[TaskType.COST_SENSITIVE],
priority=30
),
}

def __init__(self, providers: Optional[Dict[str, ProviderConfig]] = None):
self.providers = providers or self.DEFAULT_PROVIDERS
self.health: Dict[str, ProviderHealth] = {}
self.request_counts: Dict[str, int] = {}
self._initialize_health()

def _initialize_health(self):
"""Initialize health status for all providers"""
for name in self.providers:
self.health[name] = ProviderHealth(
provider=name,
status=ProviderStatus.HEALTHY,
last_success=time.time()
)
self.request_counts[name] = 0

def select_provider(
self,
task_type: Optional[TaskType] = None,
prefer_cost: bool = False,
prefer_speed: bool = False,
exclude: Optional[List[str]] = None,
required_context_length: int = 4000
) -> RoutingDecision:
"""
Select best provider for the task.

Args:
task_type: Type of task (affects capability matching)
prefer_cost: Prioritize cost over quality
prefer_speed: Prioritize speed over quality
exclude: Providers to skip (e.g., already failed)
required_context_length: Minimum context window needed

Returns:
RoutingDecision with selected provider and fallback chain
"""
exclude = exclude or []
candidates = []

for name, config in self.providers.items():
# Skip excluded and unhealthy providers
if name in exclude:
continue

health = self.health[name]
if health.status == ProviderStatus.UNAVAILABLE:
continue

# Check rate limits
if health.status == ProviderStatus.RATE_LIMITED:
if time.time() < health.rate_limit_reset:
continue

# Check context length requirement
if config.max_tokens < required_context_length:
continue

# Calculate score
score = config.priority

# Boost for matching task type
if task_type and task_type in config.strengths:
score += 20

# Adjust for cost preference
if prefer_cost:
# Lower cost = higher score
cost_score = 100 - (config.cost_per_1k_output * 1000)
score += cost_score * 0.3

# Adjust for speed preference
if prefer_speed:
# Lower timeout = faster expected response
speed_score = 100 - (config.timeout_seconds / 3)
score += speed_score * 0.3

# Penalize degraded providers
if health.status == ProviderStatus.DEGRADED:
score -= 20

# Penalize high recent latency
if health.latency_ms > 5000:
score -= 10

candidates.append((name, score, config))

# Sort by score
candidates.sort(key=lambda x: x[1], reverse=True)

if not candidates:
raise Exception("No available providers")

selected = candidates[0]
fallback_chain = [c[0] for c in candidates[1:4]] # Next 3 as fallback

return RoutingDecision(
provider=selected[0],
model=selected[2].default_model,
reason=f"Score: {selected[1]:.1f}, Priority: {selected[2].priority}",
fallback_chain=fallback_chain,
estimated_cost=selected[2].cost_per_1k_output
)

async def complete_with_fallback(
self,
prompt: str,
task_type: Optional[TaskType] = None,
max_retries: int = 3,
**kwargs
) -> Dict:
"""
Complete request with automatic fallback.

Tries providers in order until success or all exhausted.
"""
attempted = []
last_error = None

for attempt in range(max_retries):
try:
decision = self.select_provider(
task_type=task_type,
exclude=attempted
)

provider_name = decision.provider
attempted.append(provider_name)

# Execute request
start_time = time.time()
result = await self._execute_request(
provider_name,
prompt,
decision.model,
**kwargs
)
latency = (time.time() - start_time) * 1000

# Update health on success
self._update_health(provider_name, success=True, latency=latency)

return {
"content": result["content"],
"provider": provider_name,
"model": decision.model,
"attempts": len(attempted),
"latency_ms": latency
}

except RateLimitError as e:
self._update_health(
attempted[-1],
success=False,
rate_limited=True,
reset_time=e.reset_time
)
last_error = e

except ProviderError as e:
self._update_health(attempted[-1], success=False)
last_error = e

except TimeoutError as e:
self._update_health(attempted[-1], success=False, timeout=True)
last_error = e

raise FallbackExhaustedError(
f"All providers failed after {max_retries} attempts",
attempted=attempted,
last_error=last_error
)

async def _execute_request(
self,
provider_name: str,
prompt: str,
model: str,
**kwargs
) -> Dict:
"""Execute request to specific provider"""
config = self.providers[provider_name]

# Provider-specific implementation
if provider_name == "anthropic":
return await self._anthropic_complete(config, prompt, model, **kwargs)
elif provider_name == "openai":
return await self._openai_complete(config, prompt, model, **kwargs)
elif provider_name == "google":
return await self._google_complete(config, prompt, model, **kwargs)
elif provider_name == "local":
return await self._local_complete(config, prompt, model, **kwargs)
else:
raise ValueError(f"Unknown provider: {provider_name}")

def _update_health(
self,
provider: str,
success: bool,
latency: float = 0,
rate_limited: bool = False,
reset_time: float = 0,
timeout: bool = False
):
"""Update provider health status"""
health = self.health[provider]

if success:
health.status = ProviderStatus.HEALTHY
health.last_success = time.time()
health.failure_count = 0
health.latency_ms = latency
else:
health.last_failure = time.time()
health.failure_count += 1

if rate_limited:
health.status = ProviderStatus.RATE_LIMITED
health.rate_limit_reset = reset_time
elif health.failure_count >= 3:
health.status = ProviderStatus.UNAVAILABLE
else:
health.status = ProviderStatus.DEGRADED

def get_health_report(self) -> Dict:
"""Get health status of all providers"""
return {
name: {
"status": health.status.value,
"last_success": health.last_success,
"failure_count": health.failure_count,
"latency_ms": health.latency_ms
}
for name, health in self.health.items()
}

# Provider-specific implementations (simplified)
async def _anthropic_complete(self, config, prompt, model, **kwargs):
# Implementation using anthropic SDK
pass

async def _openai_complete(self, config, prompt, model, **kwargs):
# Implementation using openai SDK
pass

async def _google_complete(self, config, prompt, model, **kwargs):
# Implementation using google SDK
pass

async def _local_complete(self, config, prompt, model, **kwargs):
# Implementation using local Ollama/LM Studio
pass


# Custom exceptions
class ProviderError(Exception):
pass

class RateLimitError(ProviderError):
def __init__(self, message, reset_time=0):
super().__init__(message)
self.reset_time = reset_time

class FallbackExhaustedError(Exception):
def __init__(self, message, attempted=None, last_error=None):
super().__init__(message)
self.attempted = attempted or []
self.last_error = last_error

Usage Examples

Basic Fallback

router = MultiProviderRouter()

# Simple completion with automatic fallback
result = await router.complete_with_fallback(
prompt="Explain the visitor pattern in Python",
task_type=TaskType.CODING
)

print(f"Provider: {result['provider']}")
print(f"Attempts: {result['attempts']}")
print(f"Content: {result['content']}")

Cost-Optimized Routing

# Prefer cheaper providers for simple tasks
result = await router.complete_with_fallback(
prompt="Summarize this text...",
task_type=TaskType.COST_SENSITIVE,
prefer_cost=True
)
# Will prefer Google Gemini or local models

Capability-Based Selection

# Select based on task requirements
decision = router.select_provider(
task_type=TaskType.LONG_CONTEXT,
required_context_length=100000
)
# Will select providers with large context windows (Anthropic, Google)

Health Monitoring

# Get current provider health
health = router.get_health_report()

for provider, status in health.items():
print(f"{provider}: {status['status']} (latency: {status['latency_ms']}ms)")

Integration with CODITECT

With Adaptive Retry

from skills.adaptive_retry import AdaptiveRetryParameters

async def robust_llm_call(prompt: str) -> Dict:
"""Combine adaptive retry with multi-provider fallback"""
router = MultiProviderRouter()
retry_params = AdaptiveRetryParameters()

for retry_count in range(3):
try:
# Get adjusted parameters for this retry
params = retry_params.get_params(retry_count)

result = await router.complete_with_fallback(
prompt=prompt,
max_tokens=params["max_tokens"],
temperature=params["temperature"]
)
return result

except FallbackExhaustedError:
if retry_count == 2:
raise
# Wait before retrying all providers
await asyncio.sleep(retry_params.get_backoff(retry_count))

With Orchestrator

# In orchestrator configuration
llm_routing:
skill: multi-provider-llm-fallback
default_task_type: coding
cost_threshold: 0.10 # Max cost per request
timeout_threshold: 60 # Seconds before fallback

Configuration

ParameterDefaultDescription
max_retries3Maximum fallback attempts
failure_threshold3Failures before marking unavailable
rate_limit_buffer0.9Use 90% of rate limit
health_check_interval60Seconds between health checks
cost_trackingtrueTrack and report costs

Success Metrics

MetricTarget
Request success rate99.5%
Average fallback attempts<1.5
Cost optimization20-40% savings
Latency overhead<100ms

Success Output

When this skill is successfully applied, output:

✅ SKILL COMPLETE: multi-provider-llm-fallback

Completed:
- [x] MultiProviderRouter configured with 4 providers (Anthropic, OpenAI, Google, Local)
- [x] Provider health monitoring active (status, latency, failure tracking)
- [x] Capability-based routing implemented (task type matching)
- [x] Automatic fallback tested (3 providers tried, success on 2nd attempt)
- [x] Cost optimization configured (prefer cheaper for cost-sensitive tasks)

Outputs:
- MultiProviderRouter class with intelligent routing
- ProviderConfig for all providers (API keys, models, costs, rate limits)
- Health monitoring dashboard (status, latency, failure counts)
- Fallback chain documentation (primary → fallback order)

Metrics:
- Request success rate: 99.5%
- Average fallback attempts: 1.2
- Cost savings: 32% (vs. always using most expensive)
- Average latency overhead: 65ms

Completion Checklist

Before marking this skill as complete, verify:

  • At least 2 providers configured with valid API keys
  • ProviderConfig includes models, costs, rate limits, strengths
  • Health monitoring tracks status, failures, latency for each provider
  • Routing logic selects provider based on task type, cost, speed preferences
  • Automatic fallback tries alternative providers on failure
  • Rate limit detection prevents hitting provider limits
  • Provider degradation detected after 3 failures
  • Health report shows current status of all providers
  • Integration with adaptive retry (if applicable)
  • Cost tracking and reporting implemented

Failure Indicators

This skill has FAILED if:

  • ❌ All providers unavailable (FallbackExhaustedError)
  • ❌ Primary provider always selected despite degradation
  • ❌ Rate limits exceeded causing 429 errors
  • ❌ No fallback attempted when primary fails
  • ❌ Health status not updated on failures
  • ❌ Cost optimization disabled or not working
  • ❌ Provider selection ignores task type capabilities
  • ❌ Latency overhead >500ms (routing too slow)

When NOT to Use

Do NOT use this skill when:

  • Single provider sufficient and reliable - complexity not justified
  • Fixed provider required by policy/compliance - no alternatives allowed
  • Testing/development with no availability requirements - single provider adequate
  • Latency-critical with no time for fallback (<100ms response time) - use single fast provider
  • Budget for only one provider - can't afford multiple API subscriptions
  • Provider selection requires human judgment - automated routing inappropriate

Use alternatives instead:

  • Single reliable provider → Direct API calls with retry
  • Compliance constraints → Use approved provider only
  • Development → Mock LLM responses
  • Latency-critical → Pre-computed responses or cache

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
No health monitoringContinues sending to failed providerTrack failures, mark unavailable after 3
Always using most expensiveUnnecessary costsRoute cost-sensitive tasks to cheaper providers
No rate limit awarenessHits 429 errors, wastes timeTrack rate limits, pause provider when limited
Fallback to identical providerNo redundancy benefitEnsure fallback chain has different providers
Ignoring task requirementsWrong model for task (e.g., small context)Match task type to provider strengths
No cost trackingBudget overruns undetectedLog and aggregate costs per provider
Synchronous fallback onlyHigh latency on failuresConsider parallel requests with circuit breaker

Principles

This skill embodies:

  • #2 First Principles Thinking - Understand provider capabilities before routing
  • #3 Keep It Simple (KISS) - Start with 2 providers, add more only if needed
  • #4 Separation of Concerns - Router separate from provider-specific implementations
  • #7 Automation - Automatic health monitoring and failover
  • #8 No Assumptions - Verify provider availability, don't assume success
  • #11 Resilience - Graceful degradation with fallback chains

Full Standard: CODITECT-STANDARD-AUTOMATION.md

Provider Comparison Quick Reference

ProviderBest ForCost ($/1K out)ContextLatencyReliability
Anthropic ClaudeCoding, Reasoning$0.015200KMediumHigh
OpenAI GPT-4General, Creative$0.03128KFastHigh
Google GeminiLong Context, Cost$0.00151MMediumMedium
Azure OpenAIEnterprise, Compliance$0.03128KFastVery High
Local (Ollama)Privacy, Cost$0.0032KVariableDepends

Provider Selection Decision Tree:

What's the primary requirement?

├── Coding/Technical → Anthropic Claude (primary)
│ └── Fallback: OpenAI GPT-4

├── Cost Optimization → Google Gemini (primary)
│ └── Fallback: Local Ollama

├── Enterprise Compliance → Azure OpenAI (primary)
│ └── Fallback: Anthropic Claude

├── Long Context (>128K) → Google Gemini (primary)
│ └── Fallback: Anthropic Claude

└── Maximum Reliability → Multi-provider with fallback chain
Anthropic → OpenAI → Google → Local

Recommended Fallback Chains:

Task TypePrimaryFallback 1Fallback 2
Code GenerationAnthropicOpenAILocal
Creative WritingOpenAIAnthropicGoogle
Long Document AnalysisGoogleAnthropicOpenAI
Enterprise/ComplianceAzureAnthropicGoogle
Cost-Sensitive BatchGoogleLocalAnthropic

Source Reference

Pattern extracted from DeepCode multi-agent system.

See /submodules/labs/DeepCode/DEEP-ANALYSIS.md for complete analysis.