Skip to main content

TaskExecutor Refactoring Project Plan

TaskExecutor Refactoring - Project Plan

Document Version: 1.0 Last Updated: 2025-11-23 Document Owner: Hal Casteel, CEO/CTO, AZ1.AI INC. Project Type: Strategic Refactoring (Foundation for Phase 1 Autonomous Agents) Status: PLANNING - Ready for Go/No-Go Decision


Executive Summary

This project-plan.md provides the comprehensive strategy for refactoring the TaskExecutor to use the llm_abstractions layer directly instead of subprocess-based execute_*.py scripts. This strategic refactoring eliminates process overhead, enables async orchestration, and lays the foundation for Phase 1 Message Bus autonomous agents.

Project Mission: Transform TaskExecutor from subprocess-based execution to direct LLM abstraction layer, achieving 30-50% performance improvement while maintaining 100% backward compatibility and enabling future autonomous agent communication.

Current Status Overview

MetricCurrent StateTarget
Project Start DateTBD (pending go/no-go)-
Estimated Duration3-4 weeks80 engineering hours
Current ArchitectureSubprocess-based (execute_*.py)Direct LLM abstraction
Performance BaselineSubprocess overhead + JSON I/O30-50% improvement
Backward CompatibilityN/A100% (dual-mode executor)
LLM Providers ReadyGemini (placeholder)Anthropic, OpenAI, Gemini
Foundation StatusBaseLlm abstract class existsFactory + implementations
Test Coverage0% for llm_abstractions90%+ for new code
Next MilestoneGo/No-Go DecisionTBD

Key Deliverables

DeliverableDescriptionStatus
LlmFactoryDynamic provider loading with AgentType mapping📅 Planned
AnthropicLlmOfficial Anthropic SDK implementation📅 Planned
OpenAILlmAsyncOpenAI SDK implementation📅 Planned
GeminiLlmComplete Gemini implementation (replace placeholder)📅 Planned
Dual-Mode Executoruse_direct_llm flag with graceful fallback📅 Planned
Migration GuideScript deprecation timeline and adoption guide📅 Planned
Test Suite90%+ coverage for new code📅 Planned
Performance BenchmarksValidation of 30%+ improvement📅 Planned

Budget & Investment

PhaseBudgetHoursStatus
Phase 1A: Foundation$2,00016 hours📅 Planned
Phase 1B: Dual-Mode Executor$2,00016 hours📅 Planned
Phase 2A: OpenAI Implementation$2,00016 hours📅 Planned
Phase 2B: Gemini Implementation$2,00016 hours📅 Planned
Phase 3: Script Deprecation$2,00016 hours📅 Planned
Total Investment$10,00080 hoursThrough Week 4

Engineering Rate: $125/hour (senior Python developer) ROI: Performance improvements reduce LLM orchestration latency by 30-50%, enabling future autonomous agent workflows.


Table of Contents

  1. Project Overview
  2. Strategic Context & Rationale
  3. User Feedback: Async Executor AlignmentNEW
  4. Technical Architecture
  5. Implementation Phases
  6. Multi-Agent Orchestration Strategy
  7. Quality Gates & Success Criteria
  8. Risk Management
  9. Backward Compatibility Strategy
  10. Testing Strategy
  11. Performance Benchmarks
  12. Migration & Deprecation Plan
  13. Agent Assignment Matrix
  14. Budget Breakdown
  15. Timeline & Dependencies
  16. Monitoring & Metrics
  17. Go/No-Go Decision Criteria
  18. Architecture Decision RecordNEW

Project Overview

Purpose

The TaskExecutor Refactoring project modernizes the task execution architecture by replacing subprocess-based execute_*.py scripts with direct llm_abstractions layer integration. This refactoring is foundational for Phase 1 autonomous agent communication via Message Bus.

Strategic Context

Current Architecture (Subprocess-Based)

TaskExecutor → subprocess → execute_claude.py → JSON I/O → Display Command
TaskExecutor → subprocess → execute_gpt.py → OpenAI SDK → API Call → Result
TaskExecutor → subprocess → execute_gemini.py → Placeholder → Result

Problems with Current Architecture

  • Process Overhead: Each task spawns subprocess (100-200ms latency)
  • No Async Support: Synchronous subprocess.run() blocks orchestration
  • Duplicate Logic: Task validation + prompt generation duplicated across scripts
  • Hard to Test: Subprocess testing requires integration tests
  • Not Future-Ready: Phase 1 Message Bus requires async LLM calls

Target Architecture (Direct LLM Abstraction)

TaskExecutor → LlmFactory.get_provider(agent_type) → BaseLlm.generate_content_async()

AnthropicLlm (official SDK) → Claude API
OpenAILlm (AsyncOpenAI SDK) → GPT-4 API
GeminiLlm (google-generativeai) → Gemini API

Benefits of Target Architecture

  • 30-50% Performance Improvement: Eliminate subprocess overhead
  • Async Orchestration: Enable concurrent task execution
  • Single Source of Truth: Unified LLM abstraction layer
  • Easier Testing: Mock BaseLlm implementations
  • Future-Ready: Foundation for Phase 1 Message Bus autonomous agents

Project Objectives

Primary Objectives

  1. Implement LlmFactory with dynamic provider loading based on AgentType
  2. Complete BaseLlm Implementations for Anthropic (Claude), OpenAI (GPT), Google (Gemini)
  3. Add Dual-Mode Executor with use_direct_llm feature flag and graceful fallback
  4. Achieve 30-50% Performance Improvement over subprocess approach
  5. Maintain 100% Backward Compatibility during migration period

Secondary Objectives

  1. Deprecate execute_*.py Scripts gracefully over 6 months
  2. Achieve 90%+ Test Coverage for new llm_abstractions code
  3. Document Migration Path for future CODITECT users
  4. Establish Foundation for Phase 1 Message Bus autonomous agents

Success Criteria

Technical Success

  • ✅ All LLM providers (Anthropic, OpenAI, Gemini) working via LlmFactory
  • ✅ Performance benchmarks show 30%+ improvement over subprocess approach
  • ✅ Test coverage ≥90% for llm_abstractions module
  • ✅ Zero breaking changes (dual-mode ensures rollback safety)
  • ✅ Async/sync compatibility layer functional

Process Success

  • ✅ All quality gates passed (see Section 6)
  • ✅ Migration guide published with script deprecation timeline
  • ✅ Deprecation warnings added to execute_*.py scripts
  • ✅ Documentation updated (architecture, API, developer guides)

Business Success

  • ✅ Timeline: Complete in ≤4 weeks (80 engineering hours)
  • ✅ Cost: ≤$10,000 (within allocated budget)
  • ✅ Risk: Zero production incidents during migration
  • ✅ Foundation: Phase 1 Message Bus can build on this architecture

Strategic Context & Rationale

Why This Refactoring Matters

1. Foundation for Phase 1 Autonomous Agents

The current Phase 1 roadmap (see docs/03-project-planning/project-plan.md) requires:

  • Message Bus (RabbitMQ): Inter-agent task passing with priority queues
  • Agent Discovery Service (Redis): Capability-based agent discovery
  • Task Queue Manager (Redis + RQ): Persistent queue with dependency resolution

All of these require async LLM calls. The current subprocess-based architecture blocks async orchestration, making autonomous agent-to-agent communication impossible without this refactoring.

2. Performance Bottleneck Elimination

Current subprocess overhead profile:

Task submission: ~5ms
Subprocess spawn: ~100-200ms ← BOTTLENECK
JSON serialization: ~10-20ms
Script execution: ~50-100ms
Result parsing: ~10-20ms
---
Total overhead: ~180-350ms per task

```yaml

With direct LLM abstraction:

```yaml
Task submission: ~5ms
LlmFactory lookup: ~1ms ← FAST
BaseLlm.generate_content_async(): ~50-100ms
Result parsing: ~10-20ms
---
Total overhead: ~66-126ms per task

Net improvement: 30-50% reduction in orchestration latency

3. Code Quality & Maintainability

Current architecture has:

  • Duplicate logic: Task validation repeated in every execute_*.py script
  • Inconsistent error handling: Each script implements own retry logic
  • Hard to test: Subprocess testing requires complex mocking
  • No type safety: JSON I/O loses Python type information

Target architecture provides:

  • Single source of truth: BaseLlm abstract class enforces interface
  • Centralized error handling: Factory manages provider availability
  • Easy testing: Mock BaseLlm implementations for unit tests
  • Full type safety: Python type hints throughout

4. Alignment with Industry Best Practices

Modern LLM orchestration frameworks (LangGraph, CrewAI, AutoGen) use:

  • ✅ Direct SDK integration (not subprocess scripts)
  • ✅ Async/await for concurrent execution
  • ✅ Factory pattern for provider abstraction
  • ✅ Unified error handling and retry logic

This refactoring aligns CODITECT with these industry standards.

Alternative Approaches Considered

ApproachProsConsDecision
Keep subprocess scriptsNo code changesPerformance bottleneck remains, blocks Phase 1❌ REJECTED
Refactor scripts in-placeSimpler migrationStill subprocess overhead, not async-friendly❌ REJECTED
Direct LLM abstraction (chosen)Performance + async + Phase 1 foundationRequires refactoring effortSELECTED
Third-party framework (LangChain)Community supportHeavy dependencies, not CODITECT-aligned❌ REJECTED

Dependencies on Other Work

Prerequisites (Already Complete)

  • BaseLlm abstract class exists (llm_abstractions/base_llm.py)
  • Gemini placeholder implementation exists (needs completion)
  • AgentRegistry with AgentType enum defined
  • TaskExecutor with execution modes (interactive, API, hybrid)

Enables Future Work (Phase 1)

  • 🔜 Message Bus Implementation: Async LLM calls required for agent-to-agent communication
  • 🔜 Agent Discovery Service: Dynamic provider selection via factory pattern
  • 🔜 Task Queue Manager: Concurrent task execution with async/await

No Blocking Dependencies: This refactoring can proceed immediately after go/no-go decision.


User Feedback: Async Executor Alignment

User Request

"I agree with your analysis. Please explicitly detail making TaskExecutor.execute async to align seamlessly with ProjectOrchestrator.execute_task's async nature."

Strategic Decision: Make TaskExecutor.execute() Async

APPROVED: Based on user feedback, we will make TaskExecutor.execute() an async method as part of this refactoring project.

Rationale

  1. Seamless Integration: ProjectOrchestrator.execute_task() will become async, enabling end-to-end async flow without asyncio.run() wrappers.

  2. Eliminates Async/Sync Boundary: Current architecture uses asyncio.run() in _execute_via_llm(), which:

    • Creates new event loop for each task (~10-20ms overhead)
    • Blocks async orchestration
    • Prevents parallel task execution
    • Makes Phase 1 Message Bus impossible
  3. Enables Parallel Execution: Async executor enables concurrent LLM API calls:

    # Current (Sequential): 6 seconds for 3 tasks
    Task 1: |---LLM API (2s)---|
    Task 2: |---LLM API (2s)---|
    Task 3: |---LLM API (2s)---|

    # Target (Parallel): 2 seconds for 3 tasks
    Task 1: |---LLM API (2s)---|
    Task 2: |---LLM API (2s)---|
    Task 3: |---LLM API (2s)---|

4. **Phase 1 Foundation:** Message Bus autonomous agents **require** async executor for agent-to-agent coordination.

### Method Signature Changes

#### Current (Sync)

```python
def execute(
self,
task: AgentTask,
agent: Optional[str] = None,
mode: Optional[str] = None
) -> ExecutionResult:
"""Execute a single task using specified agent."""
# Synchronous execution with asyncio.run() wrapper
...

Target (Async)

async def execute(
self,
task: AgentTask,
agent: Optional[str] = None,
mode: Optional[str] = None
) -> ExecutionResult:
"""
Execute a single task using specified agent.

Note:
This method is async to enable concurrent task execution
and seamless integration with async LLM providers.
"""
# Direct async execution, no wrappers needed
...

Cascading Changes

  • TaskExecutor._execute_via_llm() → async
  • TaskExecutor._execute_api() → async
  • ProjectOrchestrator.execute_task() → async
  • All callers → use await executor.execute()
  • All tests → migrate to pytest-asyncio

Performance Impact

Baseline (Current)

  • Single task overhead: 180-350ms (subprocess) → 66-126ms (direct LLM)
  • Improvement: 30-50%

With Async (Target)

  • Single task overhead: 66-126ms (same)
  • Parallel tasks (3): 6s (sequential) → 2s (concurrent)
  • Improvement: 30-50% (single) + 3x (parallel)

Budget Impact

Original Phase 1B: 16 hours, $2,000 Updated Phase 1B (Async): 22 hours, $2,750 Delta: +6 hours, +$750

Total Project Budget

  • Original: $10,000 (80 hours)
  • Updated: $10,750 (86 hours)
  • Delta: +$750 (7.5% increase)

Justification: The $750 investment enables:

  • End-to-end async flow (eliminates event loop overhead)
  • Parallel task execution (3x speedup)
  • Phase 1 Message Bus foundation (enables $100K+ autonomous agent implementation)
  • ROI: 10x+

Timeline Impact

Original Timeline: 4 weeks (80 hours) Updated Timeline: 4.5 weeks (86 hours) Delta: +3 days (concentrated in Phase 1B)

Week 1 (Updated)

  • Days 1-2: Phase 1A (Foundation) - 16 hours
  • Days 3-5.5: Phase 1B (Async Executor) - 22 hours

Weeks 2-4: No changes (providers already async-compatible)

Architecture Decision Record

See: ADR-001: Async TaskExecutor Refactoring

This architectural decision is formally documented in ADR-001, which provides:

  • Complete analysis of alternatives considered
  • Detailed consequences (positive and negative)
  • Implementation validation checklist
  • Rollback procedures
  • Cross-references to related documents

Key ADR Sections

  • Decision Outcome: Make TaskExecutor.execute() async (7 methods total)
  • Rationale: User alignment + Phase 1 foundation + 3x performance improvement
  • Risks: Breaking change, test migration, +$750 budget
  • Mitigation: Migration guide, rollback plan, comprehensive testing

For complete details, see

  • docs/ASYNC-EXECUTOR-STRATEGIC-PLAN.md (62KB comprehensive analysis)
  • docs/02-architecture/adrs/ADR-001-async-task-executor-refactoring.md (formal ADR)

Decision Status: ✅ APPROVED (based on user feedback) Implementation Phase: Phase 1B (Week 1, Days 3-5.5) Budget Approved: $10,750 (up from $10,000) Timeline Approved: 4.5 weeks (up from 4 weeks)


Technical Architecture

Current Architecture (Subprocess-Based)


TaskExecutor._execute_api()

_execute_via_script()

subprocess.run([sys.executable, "execute_gpt.py"], input=json.dumps(task))

execute_gpt.py
├── Read task from stdin (JSON)
├── Validate task spec
├── Import OpenAI SDK
├── Call GPT-4 API
└── Output result to stdout (JSON)

Parse subprocess output

Return ExecutionResult

Problems

  • Each task spawns new Python process (~100-200ms overhead)
  • JSON serialization/deserialization overhead
  • No async support (subprocess.run() is synchronous)
  • Error handling scattered across multiple scripts

Target Architecture (Direct LLM Abstraction)


TaskExecutor._execute_via_llm()

LlmFactory.get_provider(agent_type)
├── AgentType.ANTHROPIC_CLAUDE → AnthropicLlm
├── AgentType.OPENAI_GPT → OpenAILlm
└── AgentType.GOOGLE_GEMINI → GeminiLlm

provider.generate_content_async(messages, **kwargs)

Return ExecutionResult

Benefits

  • Direct in-process LLM call (no subprocess overhead)
  • Native Python objects (no JSON serialization)
  • Async/await support for concurrent execution
  • Centralized error handling in factory + base class

Component Design

1. LlmFactory (NEW)

File: llm_abstractions/factory.py

Responsibilities

  • Map AgentType enum to BaseLlm implementations
  • Dynamic provider loading (import only when needed)
  • Provider availability checking (API keys, dependencies)
  • Error handling for missing providers

Interface

class LlmFactory:
"""Factory for creating LLM provider instances."""

@staticmethod
def get_provider(
agent_type: AgentType,
model: Optional[str] = None,
**kwargs: Any
) -> BaseLlm:
"""
Get LLM provider instance for agent type.

Args:
agent_type: Type of LLM (Claude, GPT, Gemini, etc.)
model: Specific model name (optional)
**kwargs: Additional provider configuration

Returns:
BaseLlm instance configured for agent type

Raises:
ValueError: If agent type not supported
ImportError: If provider dependencies not installed
RuntimeError: If API key not configured
"""
pass

@staticmethod
def list_available_providers() -> List[AgentType]:
"""List LLM providers available (API keys + dependencies)."""
pass

@staticmethod
def is_provider_available(agent_type: AgentType) -> bool:
"""Check if provider is available for use."""
pass

Implementation Strategy

# Provider registry (lazy loading)
_PROVIDER_REGISTRY = {
AgentType.ANTHROPIC_CLAUDE: {
"class": "AnthropicLlm",
"module": "llm_abstractions.anthropic",
"env_var": "ANTHROPIC_API_KEY",
"dependencies": ["anthropic"],
},
AgentType.OPENAI_GPT: {
"class": "OpenAILlm",
"module": "llm_abstractions.openai",
"env_var": "OPENAI_API_KEY",
"dependencies": ["openai"],
},
AgentType.GOOGLE_GEMINI: {
"class": "GeminiLlm", # Update existing Gemini class
"module": "llm_abstractions.gemini",
"env_var": "GOOGLE_API_KEY",
"dependencies": ["google-generativeai"],
},
}

def get_provider(agent_type, model=None, **kwargs):
# 1. Check provider registry
if agent_type not in _PROVIDER_REGISTRY:
raise ValueError(f"Unsupported agent type: {agent_type}")

provider_config = _PROVIDER_REGISTRY[agent_type]

# 2. Check API key
api_key = os.getenv(provider_config["env_var"])
if not api_key:
raise RuntimeError(
f"{provider_config['env_var']} environment variable not set"
)

# 3. Dynamic import
module = importlib.import_module(provider_config["module"])
provider_class = getattr(module, provider_config["class"])

# 4. Instantiate provider
return provider_class(model=model, api_key=api_key, **kwargs)

2. AnthropicLlm (NEW)

File: llm_abstractions/anthropic.py

Responsibilities

  • Implement BaseLlm interface for Anthropic Claude
  • Use official anthropic Python SDK (v0.39+)
  • Support async/await with AsyncAnthropic
  • Handle Claude-specific features (system prompts, tool use, etc.)

Interface

class AnthropicLlm(BaseLlm):
"""Anthropic Claude LLM implementation."""

def __init__(
self,
model: str = "claude-sonnet-4-5-20250929",
api_key: Optional[str] = None,
**kwargs: Any
):
"""
Initialize Anthropic Claude provider.

Args:
model: Claude model name (default: latest Sonnet)
api_key: Anthropic API key (or use ANTHROPIC_API_KEY env var)
**kwargs: Additional SDK configuration
"""
self.model = model
self.api_key = api_key or os.getenv("ANTHROPIC_API_KEY")
if not self.api_key:
raise ValueError("ANTHROPIC_API_KEY environment variable not set")

# Initialize AsyncAnthropic client
self.client = AsyncAnthropic(api_key=self.api_key)

async def generate_content_async(
self,
messages: List[Dict[str, str]],
**kwargs: Any
) -> str:
"""
Generate content using Claude.

Args:
messages: Conversation history (OpenAI format)
**kwargs: Additional parameters (max_tokens, temperature, etc.)

Returns:
Generated content as string
"""
# Convert OpenAI format to Claude format (system + messages)
system_prompt = self._extract_system_prompt(messages)
claude_messages = self._convert_to_claude_format(messages)

# Call Claude API
response = await self.client.messages.create(
model=self.model,
system=system_prompt,
messages=claude_messages,
max_tokens=kwargs.get("max_tokens", 4096),
temperature=kwargs.get("temperature", 0.7),
)

# Extract text content
return response.content[0].text

Key Implementation Details

  • Use AsyncAnthropic for async support
  • Convert OpenAI message format to Claude format (system + messages)
  • Handle Claude-specific parameters (system prompt separate from messages)
  • Extract text content from Claude response format

3. OpenAILlm (NEW)

File: llm_abstractions/openai_llm.py

Responsibilities

  • Implement BaseLlm interface for OpenAI GPT
  • Use official openai Python SDK (v1.99+)
  • Support async/await with AsyncOpenAI
  • Handle GPT-specific features (function calling, JSON mode, etc.)
  • Support GPT-5.1-Codex-Max for specialized coding tasks (Nov 2025 release)
    • Multi-hour agent loops and project-scale refactors
    • Natively trained across multiple context windows (millions of tokens)
    • 30% fewer thinking tokens with better performance

Interface

class OpenAILlm(BaseLlm):
"""OpenAI GPT LLM implementation."""

# Supported models
SUPPORTED_MODELS = [
"gpt-4o", # General-purpose (default)
"gpt-4", # Previous generation
"gpt-5.1-codex-max", # Specialized coding (Nov 2025)
"gpt-4-turbo", # Fast variant
]

def __init__(
self,
model: str = "gpt-4o",
api_key: Optional[str] = None,
**kwargs: Any
):
"""
Initialize OpenAI GPT provider.

Args:
model: GPT model name
- "gpt-4o" (default): General-purpose latest
- "gpt-5.1-codex-max": Specialized for coding tasks,
multi-hour agent loops, project-scale refactors
- "gpt-4", "gpt-4-turbo": Alternative models
api_key: OpenAI API key (or use OPENAI_API_KEY env var)
**kwargs: Additional SDK configuration
"""
self.model = model
self.api_key = api_key or os.getenv("OPENAI_API_KEY")
if not self.api_key:
raise ValueError("OPENAI_API_KEY environment variable not set")

# Initialize AsyncOpenAI client (2025 SDK)
self.client = AsyncOpenAI(api_key=self.api_key)

async def generate_content_async(
self,
messages: List[Dict[str, str]],
**kwargs: Any
) -> str:
"""
Generate content using GPT.

Args:
messages: Conversation history (OpenAI format)
**kwargs: Additional parameters (max_tokens, temperature, etc.)

Returns:
Generated content as string
"""
# Call GPT API (native OpenAI format)
response = await self.client.chat.completions.create(
model=self.model,
messages=messages,
max_tokens=kwargs.get("max_tokens", 4000),
temperature=kwargs.get("temperature", 0.7),
)

# Extract text content
return response.choices[0].message.content

Key Implementation Details

  • Use AsyncOpenAI for async support (2025 SDK)
  • Messages already in OpenAI format (no conversion needed)
  • Standard GPT parameters (max_tokens, temperature)
  • Extract content from choices[0].message.content

4. GeminiLlm (UPDATE EXISTING)

File: llm_abstractions/gemini.py (replace placeholder)

Responsibilities

  • Complete the existing placeholder implementation
  • Use google-generativeai Python SDK
  • Support async/await
  • Handle Gemini-specific features

Interface

class GeminiLlm(BaseLlm):
"""Google Gemini LLM implementation."""

def __init__(
self,
model: str = "gemini-1.5-pro",
api_key: Optional[str] = None,
**kwargs: Any
):
"""
Initialize Google Gemini provider.

Args:
model: Gemini model name (default: Gemini 1.5 Pro)
api_key: Google API key (or use GOOGLE_API_KEY env var)
**kwargs: Additional SDK configuration
"""
self.model = model
self.api_key = api_key or os.getenv("GOOGLE_API_KEY")
if not self.api_key:
raise ValueError("GOOGLE_API_KEY environment variable not set")

# Initialize Gemini SDK
import google.generativeai as genai
genai.configure(api_key=self.api_key)
self.client = genai.GenerativeModel(model)

async def generate_content_async(
self,
messages: List[Dict[str, str]],
**kwargs: Any
) -> str:
"""
Generate content using Gemini.

Args:
messages: Conversation history (OpenAI format)
**kwargs: Additional parameters

Returns:
Generated content as string
"""
# Convert OpenAI format to Gemini format
gemini_messages = self._convert_to_gemini_format(messages)

# Call Gemini API
response = await self.client.generate_content_async(
gemini_messages,
generation_config={
"temperature": kwargs.get("temperature", 0.7),
"max_output_tokens": kwargs.get("max_tokens", 2048),
}
)

# Extract text content
return response.text

Key Implementation Details

  • Use google-generativeai SDK with async support
  • Convert OpenAI message format to Gemini format
  • Handle Gemini-specific configuration
  • Extract text from response.text

5. TaskExecutor Updates (MODIFY EXISTING)

File: orchestration/executor.py

New Method: _execute_via_llm()

def _execute_via_llm(
self,
task: AgentTask,
agent_config: AgentConfig,
result: ExecutionResult
) -> ExecutionResult:
"""
Execute task via direct LLM abstraction layer.

Uses LlmFactory to get provider, then calls generate_content_async().
This is the new high-performance execution path.

Args:
task: Task to execute
agent_config: Agent configuration
result: Execution result (in progress)

Returns:
Updated ExecutionResult
"""
result.status = ExecutionStatus.IN_PROGRESS

try:
# Get LLM provider via factory
from llm_abstractions.factory import LlmFactory

provider = LlmFactory.get_provider(
agent_type=agent_config.agent_type,
model=task.metadata.get("model"),
)

# Build messages from task
messages = self._build_messages_from_task(task, agent_config)

# Call LLM asynchronously (or sync wrapper if needed)
if asyncio.get_event_loop().is_running():
# Already in async context
output = await provider.generate_content_async(messages)
else:
# Create new event loop for sync context
output = asyncio.run(provider.generate_content_async(messages))

# Success
result.status = ExecutionStatus.SUCCESS
result.output = output
result.completed_at = datetime.now()
result.metadata["execution_mode"] = "llm_direct"

except ImportError as e:
# Provider dependencies not installed
result.status = ExecutionStatus.FAILED
result.error = f"Provider dependencies not installed: {e}"
result.completed_at = datetime.now()
result.metadata["fallback_reason"] = "missing_dependencies"

except RuntimeError as e:
# API key not configured
result.status = ExecutionStatus.FAILED
result.error = f"Provider configuration error: {e}"
result.completed_at = datetime.now()
result.metadata["fallback_reason"] = "missing_api_key"

except Exception as e:
# Other errors
result.status = ExecutionStatus.FAILED
result.error = str(e)
result.completed_at = datetime.now()

return result

Modified Method: _execute_api()

def _execute_api(
self,
task: AgentTask,
agent_config: AgentConfig,
result: ExecutionResult
) -> ExecutionResult:
"""
Execute task via direct API call.

Now supports dual-mode execution:
- If use_direct_llm=True: Use _execute_via_llm() (NEW)
- If use_direct_llm=False: Use _execute_via_script() (OLD)

Graceful fallback if provider not available.
"""
result.status = ExecutionStatus.IN_PROGRESS

# Check feature flag
use_direct_llm = task.metadata.get("use_direct_llm", True) # Default: ON

if use_direct_llm:
# Try new direct LLM execution
result = self._execute_via_llm(task, agent_config, result)

# If failed due to provider unavailability, fallback to script
if (result.status == ExecutionStatus.FAILED and
result.metadata.get("fallback_reason") in ["missing_dependencies", "missing_api_key"]):

print(f"\n⚠️ Falling back to script execution for {agent_config.name}")
print(f" Reason: {result.metadata['fallback_reason']}\n")

# Reset result and try script execution
result.status = ExecutionStatus.PENDING
result.error = ""
result = self._execute_via_script(task, agent_config, script_path, result)
else:
# Use legacy script execution
script_path = self._get_execution_script(agent_config.agent_type)
if script_path and script_path.exists():
result = self._execute_via_script(task, agent_config, script_path, result)
else:
result.status = ExecutionStatus.PENDING
result.metadata["requires_implementation"] = True

return result

New Helper Method: _build_messages_from_task()

def _build_messages_from_task(
self,
task: AgentTask,
agent_config: AgentConfig
) -> List[Dict[str, str]]:
"""
Build LLM messages from task specification.

Converts AgentTask to OpenAI-format messages (standard format
supported by all providers).

Args:
task: Task to execute
agent_config: Agent configuration

Returns:
List of messages in OpenAI format
"""
messages = []

# System prompt
system_prompt = f"""You are an AI agent executing a specific task for project orchestration.

Task ID: {task.task_id}
Title: {task.title}
Agent: {agent_config.name}

Your goal is to complete this task according to the specifications provided."""

messages.append({"role": "system", "content": system_prompt})

# User prompt with task details
user_prompt = f"""Please complete the following task:

{task.description}
"""

if task.deliverables:
user_prompt += "\nExpected Deliverables:\n"
for deliverable in task.deliverables:
user_prompt += f"- {deliverable}\n"

if task.success_criteria:
user_prompt += "\nSuccess Criteria:\n"
for criteria in task.success_criteria:
user_prompt += f"- {criteria}\n"

messages.append({"role": "user", "content": user_prompt})

return messages

Architecture Diagram


┌─────────────────────────────────────────────────────────────┐
│ TaskExecutor │
│ │
│ execute(task, agent) │
│ ↓ │
│ _execute_api(task, agent_config, result) │
│ ↓ │
│ [Feature Flag: use_direct_llm] │
│ ├─ True (NEW) ──→ _execute_via_llm() │
│ │ ↓ │
│ │ LlmFactory.get_provider(agent_type) │
│ │ ↓ │
│ │ ┌──────┴──────┬──────────┬──────────┐ │
│ │ ↓ ↓ ↓ ↓ │
│ │ AnthropicLlm OpenAILlm GeminiLlm [more] │
│ │ ↓ ↓ ↓ │
│ │ Claude API GPT API Gemini API │
│ │ │
│ └─ False (OLD) ──→ _execute_via_script() │
│ ↓ │
│ subprocess.run(execute_*.py) │
│ ↓ │
│ Legacy Script Execution │
└─────────────────────────────────────────────────────────────┘

File Structure Changes

New Files


llm_abstractions/
├── __init__.py # Export factory + providers
├── factory.py # LlmFactory (NEW)
├── anthropic.py # AnthropicLlm (NEW)
└── openai_llm.py # OpenAILlm (NEW)

Updated Files


llm_abstractions/
├── gemini.py # Complete implementation (UPDATED)
└── base_llm.py # No changes (already correct)

orchestration/
└── executor.py # Add _execute_via_llm() (UPDATED)

Deprecated Files (6-month timeline)


scripts/llm_execution/
├── execute_claude.py # Add deprecation warning
├── execute_gpt.py # Add deprecation warning
└── execute_gemini.py # Add deprecation warning


Implementation Phases

Phase 1A: Foundation (Week 1, Days 1-2, 16 hours)

Goal: Create LlmFactory and implement AnthropicLlm with official SDK.

Tasks

  1. Create llm_abstractions/factory.py with LlmFactory class
  2. Implement provider registry with dynamic loading
  3. Implement get_provider(), list_available_providers(), is_provider_available()
  4. Create llm_abstractions/anthropic.py with AnthropicLlm class
  5. Implement AsyncAnthropic client initialization
  6. Implement generate_content_async() with OpenAI→Claude format conversion
  7. Write unit tests for LlmFactory (90%+ coverage)
  8. Write unit tests for AnthropicLlm (90%+ coverage)

Deliverables

  • llm_abstractions/factory.py (200 lines)
  • llm_abstractions/anthropic.py (150 lines)
  • ✅ Unit tests: tests/test_factory.py (100 lines)
  • ✅ Unit tests: tests/test_anthropic.py (100 lines)

Success Criteria

  • ✅ LlmFactory can instantiate AnthropicLlm
  • ✅ AnthropicLlm successfully calls Claude API (integration test with real API key)
  • ✅ Test coverage ≥90% for factory.py and anthropic.py
  • ✅ All tests pass

Agent Assignment

  • rust-expert-developer: Implement factory pattern and async patterns
  • codi-test-engineer: Write unit tests with mocks

Time Estimate: 16 hours

  • LlmFactory implementation: 6 hours
  • AnthropicLlm implementation: 6 hours
  • Unit tests: 4 hours

Phase 1B: Dual-Mode Executor (Week 1, Days 3-5, 16 hours)

Goal: Add dual-mode execution to TaskExecutor with graceful fallback.

Tasks

  1. Add _execute_via_llm() method to TaskExecutor
  2. Add _build_messages_from_task() helper method
  3. Modify _execute_api() to check use_direct_llm flag
  4. Implement graceful fallback to script execution
  5. Add async/sync compatibility layer (asyncio.run() wrapper)
  6. Write unit tests for dual-mode execution (90%+ coverage)
  7. Write integration tests (Anthropic provider with real API)

Deliverables

  • ✅ Updated orchestration/executor.py (+150 lines)
  • ✅ Unit tests: tests/test_executor_dual_mode.py (150 lines)
  • ✅ Integration tests: tests/integration/test_executor_anthropic.py (50 lines)

Success Criteria

  • ✅ TaskExecutor can execute tasks via AnthropicLlm when use_direct_llm=True
  • ✅ TaskExecutor falls back to script execution when provider unavailable
  • ✅ Async/sync compatibility layer works in both contexts
  • ✅ Test coverage ≥90% for new executor code
  • ✅ All tests pass

Agent Assignment

  • rust-expert-developer: Implement dual-mode executor with async compatibility
  • codi-test-engineer: Write unit + integration tests

Time Estimate: 16 hours

  • Executor modifications: 8 hours
  • Async/sync compatibility: 4 hours
  • Testing: 4 hours

Phase 2A: OpenAI Implementation (Week 2, Days 1-2, 16 hours)

Goal: Implement OpenAILlm with AsyncOpenAI SDK, including GPT-5.1-Codex-Max support.

Tasks

  1. Create llm_abstractions/openai_llm.py with OpenAILlm class
  2. Implement AsyncOpenAI client initialization (2025 SDK)
  3. Implement generate_content_async() (native OpenAI format)
  4. Add model validation for supported models (gpt-4o, gpt-5.1-codex-max, etc.)
  5. Register OpenAILlm in LlmFactory provider registry
  6. Write unit tests for OpenAILlm (90%+ coverage)
  7. Write integration tests (real OpenAI API calls with both GPT-4o and Codex-Max)
  8. Run performance benchmarks vs subprocess execute_gpt.py

Codex-Max Specific Features

  • Multi-context window support for long-running tasks
  • Optimized for coding tasks (refactors, debugging, agent loops)
  • 30% fewer thinking tokens while maintaining quality
  • Ideal for complex TaskExecutor workflows requiring deep code analysis

Deliverables

  • llm_abstractions/openai_llm.py (150 lines)
  • ✅ Updated llm_abstractions/factory.py (+10 lines for registry)
  • ✅ Unit tests: tests/test_openai_llm.py (100 lines)
  • ✅ Integration tests: tests/integration/test_executor_openai.py (50 lines)
  • ✅ Performance benchmark: benchmarks/openai_comparison.py (100 lines)

Success Criteria

  • ✅ OpenAILlm successfully calls GPT-4 API
  • ✅ LlmFactory can instantiate OpenAILlm
  • ✅ TaskExecutor can execute tasks via OpenAILlm
  • ✅ Performance benchmark shows ≥30% improvement over subprocess
  • ✅ Test coverage ≥90% for openai_llm.py
  • ✅ All tests pass

Agent Assignment

  • rust-expert-developer: Implement OpenAILlm with AsyncOpenAI SDK
  • codi-test-engineer: Write unit + integration tests + benchmarks

Time Estimate: 16 hours

  • OpenAILlm implementation: 6 hours
  • Factory registration: 2 hours
  • Testing: 4 hours
  • Performance benchmarks: 4 hours

Phase 2B: Gemini Implementation (Week 2, Days 3-5, 16 hours)

Goal: Complete Gemini implementation (replace placeholder).

Tasks

  1. Update llm_abstractions/gemini.py to use google-generativeai SDK
  2. Implement generate_content_async() with async support
  3. Implement OpenAI→Gemini format conversion
  4. Register GeminiLlm in LlmFactory provider registry
  5. Write unit tests for GeminiLlm (90%+ coverage)
  6. Write integration tests (real Gemini API calls)
  7. Run performance benchmarks vs subprocess execute_gemini.py

Deliverables

  • ✅ Updated llm_abstractions/gemini.py (replace placeholder, 150 lines)
  • ✅ Updated llm_abstractions/factory.py (+10 lines for registry)
  • ✅ Unit tests: tests/test_gemini.py (100 lines)
  • ✅ Integration tests: tests/integration/test_executor_gemini.py (50 lines)
  • ✅ Performance benchmark: benchmarks/gemini_comparison.py (100 lines)

Success Criteria

  • ✅ GeminiLlm successfully calls Gemini API
  • ✅ LlmFactory can instantiate GeminiLlm
  • ✅ TaskExecutor can execute tasks via GeminiLlm
  • ✅ Performance benchmark shows ≥30% improvement over subprocess
  • ✅ Test coverage ≥90% for gemini.py
  • ✅ All tests pass

Agent Assignment

  • rust-expert-developer: Complete Gemini implementation with async support
  • codi-test-engineer: Write unit + integration tests + benchmarks

Time Estimate: 16 hours

  • GeminiLlm implementation: 6 hours
  • Factory registration: 2 hours
  • Testing: 4 hours
  • Performance benchmarks: 4 hours

Phase 3: Script Deprecation (Week 3, 16 hours)

Goal: Gracefully deprecate execute_*.py scripts over 6 months.

Tasks

  1. Add deprecation warnings to execute_claude.py, execute_gpt.py, execute_gemini.py
  2. Set removal date (v2.0 or 6 months from Phase 3 completion)
  3. Create migration guide: docs/EXECUTOR-MIGRATION-GUIDE.md
  4. Update documentation: README, API docs, architecture diagrams
  5. Update developer guide with new TaskExecutor usage
  6. Create rollback procedure if issues arise
  7. Announce deprecation to CODITECT users (if applicable)

Deliverables

  • ✅ Updated scripts/llm_execution/execute_*.py with warnings (+20 lines each)
  • ✅ Migration guide: docs/EXECUTOR-MIGRATION-GUIDE.md (2,000 words)
  • ✅ Updated documentation: README.md, docs/architecture/
  • ✅ Rollback procedure: docs/EXECUTOR-ROLLBACK-PROCEDURE.md (500 words)

Success Criteria

  • ✅ All execute_*.py scripts display clear deprecation warnings
  • ✅ Migration guide provides step-by-step adoption path
  • ✅ Documentation reflects new architecture
  • ✅ Rollback procedure tested and validated

Agent Assignment

  • codi-documentation-writer: Create migration guide and update docs
  • senior-architect: Review rollback procedure

Time Estimate: 16 hours

  • Deprecation warnings: 2 hours
  • Migration guide: 6 hours
  • Documentation updates: 6 hours
  • Rollback procedure: 2 hours

Phase 4: Testing & Polish (Week 4, 16 hours)

Goal: Comprehensive testing, performance validation, and production readiness.

Tasks

  1. Run comprehensive integration test suite (all providers)
  2. Validate performance benchmarks (≥30% improvement for all providers)
  3. Run load tests (100+ concurrent tasks)
  4. Review code quality (type hints, docstrings, error handling)
  5. Security review (API key handling, error messages)
  6. Production readiness checklist
  7. Final documentation review
  8. Create Phase 1 Message Bus foundation validation report

Deliverables

  • ✅ Integration test report: test-results/integration-test-report.md
  • ✅ Performance validation report: test-results/performance-validation.md
  • ✅ Load test report: test-results/load-test-report.md
  • ✅ Security review report: test-results/security-review.md
  • ✅ Production readiness checklist: docs/PRODUCTION-READINESS-CHECKLIST.md
  • ✅ Phase 1 foundation report: docs/PHASE-1-FOUNDATION-VALIDATION.md

Success Criteria

  • ✅ All integration tests pass (100% success rate)
  • ✅ Performance benchmarks show ≥30% improvement across all providers
  • ✅ Load tests handle 100+ concurrent tasks without failures
  • ✅ Security review finds no critical issues
  • ✅ Production readiness checklist 100% complete
  • ✅ Phase 1 foundation validated (async support, factory pattern ready)

Agent Assignment

  • codi-test-engineer: Run comprehensive test suite
  • senior-architect: Production readiness review
  • security-specialist-agent: Security review

Time Estimate: 16 hours

  • Integration testing: 4 hours
  • Performance validation: 4 hours
  • Load testing: 4 hours
  • Security + production review: 4 hours

Multi-Agent Orchestration Strategy

Agent Roles & Responsibilities

AgentPhasesPrimary ResponsibilitiesDeliverables
rust-expert-developer1A, 1B, 2A, 2BLLM abstraction implementations, async patterns, factory patternLlmFactory, AnthropicLlm, OpenAILlm, GeminiLlm, executor updates
codi-test-engineer1A, 1B, 2A, 2B, 4Unit tests, integration tests, performance benchmarks, load testsTest suites, benchmark reports, test coverage reports
senior-architect3, 4Architecture review, rollback procedure, production readinessArchitecture validation, rollback plan, readiness checklist
codi-documentation-writer3Migration guide, documentation updates, developer guidesMigration guide, updated docs, API documentation
security-specialist-agent4Security review, API key handling, error message sanitizationSecurity review report, remediation recommendations

Workflow Coordination

Phase 1A Workflow

rust-expert-developer:
1. Implement LlmFactory (6 hours)
2. Implement AnthropicLlm (6 hours)

codi-test-engineer:
3. Write unit tests for factory (2 hours)
4. Write unit tests for anthropic (2 hours)

senior-architect:
5. Review factory pattern implementation (30 min)
6. Approve Phase 1A completion ✅

Phase 1B Workflow

rust-expert-developer:
1. Implement _execute_via_llm() (8 hours)
2. Add async/sync compatibility (4 hours)

codi-test-engineer:
3. Write unit tests for dual-mode (3 hours)
4. Write integration tests (1 hour)

senior-architect:
5. Review dual-mode design (30 min)
6. Approve Phase 1B completion ✅

Phase 2A Workflow

rust-expert-developer:
1. Implement OpenAILlm (6 hours)
2. Register in factory (2 hours)

codi-test-engineer (parallel):
3. Write unit tests (2 hours)
4. Write integration tests (2 hours)
5. Run performance benchmarks (4 hours)

senior-architect:
6. Review performance results (30 min)
7. Approve Phase 2A completion ✅

Phase 2B Workflow

rust-expert-developer:
1. Complete GeminiLlm (6 hours)
2. Register in factory (2 hours)

codi-test-engineer (parallel):
3. Write unit tests (2 hours)
4. Write integration tests (2 hours)
5. Run performance benchmarks (4 hours)

senior-architect:
6. Review all providers (1 hour)
7. Approve Phase 2B completion ✅

Phase 3 Workflow

codi-documentation-writer:
1. Add deprecation warnings to scripts (2 hours)
2. Create migration guide (6 hours)
3. Update documentation (6 hours)

senior-architect:
4. Create rollback procedure (2 hours)
5. Review migration strategy (30 min)
6. Approve Phase 3 completion ✅

Phase 4 Workflow

codi-test-engineer:
1. Run integration tests (4 hours)
2. Validate performance (4 hours)
3. Run load tests (4 hours)

security-specialist-agent (parallel):
4. Security review (3 hours)

senior-architect:
5. Production readiness review (4 hours)
6. Phase 1 foundation validation (1 hour)
7. Final approval ✅

Parallel Execution Opportunities

Week 1

  • Phase 1A (Days 1-2): Sequential (factory → anthropic → tests)
  • Phase 1B (Days 3-5): Sequential (executor → tests)

Week 2

  • Phase 2A (Days 1-2): Parallel (implementation + testing can overlap)
  • Phase 2B (Days 3-5): Parallel (implementation + testing can overlap)

Week 3

  • Phase 3: Parallel (documentation + rollback procedure can overlap)

Week 4

  • Phase 4: Parallel (testing + security review can run concurrently)

Maximum Concurrency: 2-3 agents working simultaneously during Weeks 2-4


Quality Gates & Success Criteria

Phase 1A Quality Gate

Entry Criteria

  • ✅ Go/no-go decision approved
  • ✅ Development environment setup complete
  • ✅ Dependencies installed (anthropic, pytest, pytest-asyncio)

Exit Criteria

  • ✅ LlmFactory can instantiate AnthropicLlm
  • ✅ AnthropicLlm successfully calls Claude API (integration test)
  • ✅ Test coverage ≥90% for factory.py and anthropic.py
  • ✅ All unit tests pass (100% pass rate)
  • ✅ Code review approved by senior-architect

Validation

# Run tests
pytest tests/test_factory.py tests/test_anthropic.py -v --cov

# Integration test
pytest tests/integration/test_anthropic_real_api.py -v

# Coverage check
pytest --cov=llm_abstractions --cov-report=term-missing

Rollback: If exit criteria not met, revert to subprocess execution.


Phase 1B Quality Gate

Entry Criteria

  • ✅ Phase 1A completed (LlmFactory + AnthropicLlm working)
  • ✅ TaskExecutor code reviewed and understood

Exit Criteria

  • ✅ TaskExecutor executes tasks via AnthropicLlm when use_direct_llm=True
  • ✅ Graceful fallback to script execution works
  • ✅ Async/sync compatibility layer functional
  • ✅ Test coverage ≥90% for new executor code
  • ✅ All unit + integration tests pass (100% pass rate)
  • ✅ Code review approved by senior-architect

Validation

# Run executor tests
pytest tests/test_executor_dual_mode.py -v --cov

# Integration test (real API)
pytest tests/integration/test_executor_anthropic.py -v

# Test fallback behavior
pytest tests/test_executor_fallback.py -v

Rollback: Disable use_direct_llm flag (default to False).


Phase 2A Quality Gate

Entry Criteria

  • ✅ Phase 1B completed (dual-mode executor working)
  • ✅ OpenAI API key available for testing

Exit Criteria

  • ✅ OpenAILlm successfully calls GPT-4 API
  • ✅ TaskExecutor executes tasks via OpenAILlm
  • ✅ Performance benchmark shows ≥30% improvement over subprocess
  • ✅ Test coverage ≥90% for openai_llm.py
  • ✅ All unit + integration tests pass (100% pass rate)
  • ✅ Performance validation report approved

Validation

# Run OpenAI tests
pytest tests/test_openai_llm.py tests/integration/test_executor_openai.py -v

# Run performance benchmark
python benchmarks/openai_comparison.py

# Expected output: ≥30% faster than subprocess

Rollback: Remove OpenAILlm from factory registry (AnthropicLlm still works).


Phase 2B Quality Gate

Entry Criteria

  • ✅ Phase 2A completed (OpenAILlm working)
  • ✅ Google API key available for testing

Exit Criteria

  • ✅ GeminiLlm successfully calls Gemini API
  • ✅ TaskExecutor executes tasks via GeminiLlm
  • ✅ Performance benchmark shows ≥30% improvement over subprocess
  • ✅ Test coverage ≥90% for gemini.py
  • ✅ All unit + integration tests pass (100% pass rate)
  • ✅ All 3 providers (Anthropic, OpenAI, Gemini) working
  • ✅ Performance validation report approved

Validation

# Run Gemini tests
pytest tests/test_gemini.py tests/integration/test_executor_gemini.py -v

# Run performance benchmark
python benchmarks/gemini_comparison.py

# Run comprehensive provider tests
pytest tests/integration/test_all_providers.py -v

Rollback: Remove GeminiLlm from factory registry (Anthropic + OpenAI still work).


Phase 3 Quality Gate

Entry Criteria

  • ✅ Phase 2B completed (all providers working)
  • ✅ Migration strategy approved

Exit Criteria

  • ✅ All execute_*.py scripts display clear deprecation warnings
  • ✅ Migration guide reviewed and approved
  • ✅ Documentation updated to reflect new architecture
  • ✅ Rollback procedure tested and validated
  • ✅ Deprecation announcement prepared (if applicable)

Validation

# Verify deprecation warnings
echo '{}' | python scripts/llm_execution/execute_claude.py 2>&1 | grep -i deprecat

# Review migration guide
cat docs/EXECUTOR-MIGRATION-GUIDE.md | wc -w # Should be ~2000 words

# Test rollback procedure
./docs/EXECUTOR-ROLLBACK-PROCEDURE.sh --dry-run

Rollback: N/A (documentation-only phase, no code changes).


Phase 4 Quality Gate (Final)

Entry Criteria

  • ✅ Phase 3 completed (deprecation + docs)
  • ✅ All previous phases passed quality gates

Exit Criteria (Production Readiness)

  • ✅ All integration tests pass (100% success rate)
  • ✅ Performance benchmarks show ≥30% improvement for all providers
  • ✅ Load tests handle 100+ concurrent tasks without failures
  • ✅ Test coverage ≥90% for all new code
  • ✅ Security review finds no critical issues
  • ✅ API key handling follows best practices
  • ✅ Error messages don't leak sensitive data
  • ✅ Production readiness checklist 100% complete
  • ✅ Phase 1 Message Bus foundation validated
  • ✅ Final sign-off from senior-architect

Validation

# Integration tests
pytest tests/integration/ -v --maxfail=0

# Performance validation
python benchmarks/validate_all_providers.py

# Load tests
python tests/load/test_concurrent_execution.py --tasks=100

# Security check
python scripts/security_audit.py --check-api-keys --check-error-messages

# Coverage report
pytest --cov=llm_abstractions --cov=orchestration --cov-report=html

Production Deployment: If all exit criteria met, merge to main and deploy.

Rollback: Set use_direct_llm=False globally in TaskExecutor (reverts to script execution).


Risk Management

Risk Matrix

Risk IDRisk DescriptionProbabilityImpactMitigation StrategyOwner
R1API provider changes SDK interfaceLowHighPin SDK versions in requirements.txt; monitor changelogsrust-expert-developer
R2Performance improvement < 30%MediumMediumRun early benchmarks in Phase 2A; optimize if neededcodi-test-engineer
R3Backward compatibility brokenLowCriticalDual-mode executor with feature flag; comprehensive testingsenior-architect
R4Async/sync compatibility issuesMediumHighTest in both contexts; use asyncio.run() wrapperrust-expert-developer
R5LLM provider API outage during testingLowLowUse mock providers for unit tests; retry integration testscodi-test-engineer
R6Security vulnerability in API key handlingLowCriticalSecurity review in Phase 4; follow SDK best practicessecurity-specialist-agent
R7Timeline slip (>4 weeks)MediumMediumWeekly progress tracking; adjust scope if neededsenior-architect
R8Cost overrun (>$10K)LowMediumTrack hours daily; flag early if approaching 80% budgetsenior-architect
R9Phase 1 Message Bus integration issuesLowHighValidate async foundation in Phase 4; document requirementssenior-architect
R10User adoption resistance (prefer scripts)MediumLowCreate compelling migration guide; show performance winscodi-documentation-writer

Risk Monitoring

Weekly Risk Review

  • Review open risks during weekly sync
  • Update probability/impact based on progress
  • Activate mitigation strategies as needed

Escalation Triggers

  • Any CRITICAL impact risk becomes HIGH probability
  • Timeline slip >20% (>1 week delay)
  • Cost overrun >20% (>$2K over budget)
  • Any quality gate fails twice

Escalation Path

  1. Senior architect reviews issue
  2. Decide: adjust scope, add resources, or accept delay
  3. Notify stakeholders if timeline/budget impacted

Backward Compatibility Strategy

Dual-Mode Execution

Feature Flag: use_direct_llm (task-level metadata)

Default Behavior

  • use_direct_llm=True (NEW): Try direct LLM execution via factory
  • If provider unavailable (missing API key or dependencies): Graceful fallback to script execution
  • use_direct_llm=False (OLD): Use script execution (legacy behavior)

Rollback Safety

  • Setting use_direct_llm=False globally reverts to 100% subprocess execution
  • No code changes needed to rollback (just flip feature flag)
  • Scripts remain fully functional during entire 6-month deprecation period

Migration Timeline

DateMilestoneAction
Week 4 (Phase 3)Script deprecation warnings addedScripts display deprecation notice but still work
Month 1-6Migration periodUsers adopt direct LLM execution at their own pace
Month 6Deprecation deadlineScripts show stronger warnings
Month 7 (v2.0)Script removalScripts deleted from codebase

Safety Net: If migration issues arise, extend deprecation period by 3 months.

Testing Strategy for Compatibility

Dual-Mode Testing

# Test both execution paths for every provider
@pytest.mark.parametrize("use_direct_llm", [True, False])
def test_task_execution(use_direct_llm):
task.metadata["use_direct_llm"] = use_direct_llm
result = executor.execute(task)
assert result.status == ExecutionStatus.SUCCESS

Fallback Testing

# Test graceful fallback when provider unavailable
def test_fallback_missing_api_key(monkeypatch):
monkeypatch.delenv("OPENAI_API_KEY")
task.metadata["use_direct_llm"] = True
result = executor.execute(task)
# Should fallback to script execution
assert result.metadata.get("fallback_reason") == "missing_api_key"


Testing Strategy

Test Coverage Goals

ModuleTarget CoverageTest Types
llm_abstractions/factory.py≥90%Unit, integration
llm_abstractions/anthropic.py≥90%Unit, integration
llm_abstractions/openai_llm.py≥90%Unit, integration
llm_abstractions/gemini.py≥90%Unit, integration
orchestration/executor.py (new code)≥90%Unit, integration
Overall llm_abstractions≥90%All types

Test Types

Unit Tests (Fast, No API Calls)

  • Mock BaseLlm implementations
  • Test factory provider registry
  • Test message conversion logic
  • Test error handling

Integration Tests (Real API Calls)

  • Test each provider with real API key
  • Validate response parsing
  • Test error scenarios (rate limits, invalid keys)
  • Requires API keys in CI/CD environment

Performance Benchmarks

  • Compare direct LLM vs subprocess execution
  • Measure latency reduction (target: ≥30%)
  • Test with various task sizes (short, medium, long prompts)

Load Tests

  • Test 100+ concurrent tasks
  • Validate async execution doesn't degrade performance
  • Check for memory leaks or resource exhaustion

Test Execution Strategy

Local Development

# Run unit tests (fast, no API keys needed)
pytest tests/ -m "not integration" -v

# Run integration tests (requires API keys)
export ANTHROPIC_API_KEY=xxx
export OPENAI_API_KEY=xxx
export GOOGLE_API_KEY=xxx
pytest tests/integration/ -v

# Run all tests with coverage
pytest --cov=llm_abstractions --cov=orchestration --cov-report=html

CI/CD Pipeline

# GitHub Actions workflow
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- run: pytest tests/ -m "not integration" --cov

integration-tests:
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
steps:
- run: pytest tests/integration/ -v


Performance Benchmarks

Baseline (Subprocess Execution)

Measured with: scripts/llm_execution/execute_gpt.py

MetricValue
Subprocess spawn overhead100-200ms
JSON serialization10-20ms
Script execution overhead50-100ms
Result parsing10-20ms
Total overhead180-350ms
LLM API call500-2000ms (variable)
End-to-end latency680-2350ms

Target (Direct LLM Execution)

Expected with: LlmFactory → OpenAILlm

MetricTarget Value
Factory lookup1ms
Message building5ms
LLM API call500-2000ms (same)
Result parsing10-20ms
Total overhead66-126ms
End-to-end latency566-2126ms

Performance Improvement Calculation

Overhead reduction = (180-350ms) - (66-126ms) = 114-224ms

Improvement % = (114-224ms) / (180-350ms) = 30-50%

Target: ≥30% improvement across all providers

Benchmark Suite

File: benchmarks/executor_performance.py

import time
from orchestration.executor import TaskExecutor
from orchestration.task import AgentTask

def benchmark_subprocess_execution():
"""Baseline: subprocess-based execution."""
task = AgentTask(
task_id="BENCH-001",
title="Test task",
description="Simple test",
agent="gpt-4"
)
task.metadata["use_direct_llm"] = False # Force subprocess

start = time.time()
result = executor.execute(task)
end = time.time()

return end - start

def benchmark_direct_llm_execution():
"""Target: direct LLM execution."""
task = AgentTask(
task_id="BENCH-002",
title="Test task",
description="Simple test",
agent="gpt-4"
)
task.metadata["use_direct_llm"] = True # Force direct LLM

start = time.time()
result = executor.execute(task)
end = time.time()

return end - start

# Run benchmarks
subprocess_times = [benchmark_subprocess_execution() for _ in range(10)]
direct_llm_times = [benchmark_direct_llm_execution() for _ in range(10)]

# Calculate improvement
avg_subprocess = sum(subprocess_times) / len(subprocess_times)
avg_direct_llm = sum(direct_llm_times) / len(direct_llm_times)
improvement = (avg_subprocess - avg_direct_llm) / avg_subprocess * 100

print(f"Subprocess avg: {avg_subprocess:.3f}s")
print(f"Direct LLM avg: {avg_direct_llm:.3f}s")
print(f"Improvement: {improvement:.1f}%")
assert improvement >= 30, f"Performance target not met: {improvement:.1f}% < 30%"


Migration & Deprecation Plan

Deprecation Timeline

6-Month Gradual Deprecation

MonthActionUser Impact
Month 0 (Phase 3)Add deprecation warnings to scriptsUsers see warnings but scripts work
Month 1-3Migration period (early adopters)Users migrate at their own pace
Month 4-6Migration period (remaining users)Stronger warnings added
Month 6Final deprecation noticeScripts still work but show urgent warnings
Month 7 (v2.0)Script removalScripts deleted from codebase

Deprecation Warning Implementation

File: scripts/llm_execution/execute_gpt.py (and others)

import warnings

# Add at top of main()
warnings.warn(
"\n" + "=" * 70 + "\n"
"DEPRECATION WARNING: execute_gpt.py will be removed in v2.0\n"
"=" * 70 + "\n"
"This script-based execution is deprecated in favor of direct\n"
"LLM abstraction layer via LlmFactory.\n\n"
"Migration Guide: docs/EXECUTOR-MIGRATION-GUIDE.md\n"
"Removal Date: [6 months from Phase 3 completion]\n"
"=" * 70 + "\n",
DeprecationWarning,
stacklevel=2
)

Migration Guide Outline

File: docs/EXECUTOR-MIGRATION-GUIDE.md

Contents

  1. Why Migrate? - Performance improvements, async support, Phase 1 foundation
  2. Before You Start - Prerequisites, API keys, dependencies
  3. Step-by-Step Migration:
    • Update TaskExecutor usage to set use_direct_llm=True
    • Test dual-mode execution
    • Validate performance improvements
    • Remove use_direct_llm flag (defaults to True)
  4. Testing Your Migration:
    • Unit test examples
    • Integration test examples
    • Performance benchmarks
  5. Rollback Procedure - How to revert if issues arise
  6. Troubleshooting:
    • Missing API keys
    • Missing dependencies
    • Provider unavailability
  7. FAQ:
    • When will scripts be removed?
    • What if I can't migrate by Month 6?
    • How do I get help?

Rollback Procedure

File: docs/EXECUTOR-ROLLBACK-PROCEDURE.md

Quick Rollback (Set Feature Flag)

# In TaskExecutor initialization or task metadata
task.metadata["use_direct_llm"] = False # Revert to script execution

Full Rollback (Revert Git Commits)

# If critical issues arise, revert refactoring
git revert <phase-4-commit-sha>
git revert <phase-3-commit-sha>
git revert <phase-2b-commit-sha>
git revert <phase-2a-commit-sha>
git revert <phase-1b-commit-sha>
git revert <phase-1a-commit-sha>

Validation After Rollback

# Ensure scripts still work
pytest tests/test_executor_scripts.py -v

# Verify no regressions
pytest tests/integration/ -v


Agent Assignment Matrix

Task IDTask DescriptionAgentPhaseHoursDependencies
T1.1Implement LlmFactoryrust-expert-developer1A6None
T1.2Implement AnthropicLlmrust-expert-developer1A6T1.1
T1.3Unit tests for factorycodi-test-engineer1A2T1.1
T1.4Unit tests for anthropiccodi-test-engineer1A2T1.2
T2.1Add _execute_via_llm()rust-expert-developer1B8T1.2
T2.2Async/sync compatibilityrust-expert-developer1B4T2.1
T2.3Unit tests for executorcodi-test-engineer1B3T2.1
T2.4Integration testscodi-test-engineer1B1T2.1
T3.1Implement OpenAILlmrust-expert-developer2A6T2.1
T3.2Register in factoryrust-expert-developer2A2T3.1
T3.3Unit tests for openaicodi-test-engineer2A2T3.1
T3.4Integration testscodi-test-engineer2A2T3.1
T3.5Performance benchmarkscodi-test-engineer2A4T3.1
T4.1Complete GeminiLlmrust-expert-developer2B6T2.1
T4.2Register in factoryrust-expert-developer2B2T4.1
T4.3Unit tests for geminicodi-test-engineer2B2T4.1
T4.4Integration testscodi-test-engineer2B2T4.1
T4.5Performance benchmarkscodi-test-engineer2B4T4.1
T5.1Add deprecation warningscodi-documentation-writer32T4.1
T5.2Create migration guidecodi-documentation-writer36T4.1
T5.3Update documentationcodi-documentation-writer36T4.1
T5.4Create rollback proceduresenior-architect32T4.1
T6.1Integration testscodi-test-engineer44T5.1
T6.2Performance validationcodi-test-engineer44T5.1
T6.3Load testscodi-test-engineer44T5.1
T6.4Security reviewsecurity-specialist-agent43T5.1
T6.5Production readinesssenior-architect44T6.1, T6.2, T6.3, T6.4
T6.6Phase 1 foundation validationsenior-architect41T6.5

Total Tasks: 26 Total Hours: 80 Total Cost: $10,000 (at $125/hour)


Budget Breakdown

Engineering Costs

PhaseTasksHoursCostNotes
Phase 1AT1.1-T1.416$2,000Factory + Anthropic implementation
Phase 1BT2.1-T2.416$2,000Dual-mode executor
Phase 2AT3.1-T3.516$2,000OpenAI implementation
Phase 2BT4.1-T4.516$2,000Gemini implementation
Phase 3T5.1-T5.416$2,000Deprecation + docs
Phase 4T6.1-T6.616$2,000Testing + validation
Total26 tasks80 hours$10,000Fully allocated

Engineering Rate: $125/hour (senior Python developer with LLM/async expertise)

Infrastructure Costs

ItemCostNotes
LLM API Usage (Testing)$200Anthropic + OpenAI + Gemini integration tests
CI/CD Compute (GitHub Actions)$5080 hours of testing across 4 weeks
Total Infrastructure$250One-time costs

Total Project Cost

Engineering:      $10,000
Infrastructure: $ 250
---
Total: $10,250

Budget Contingency: 10% ($1,025) for scope adjustments = $11,275 total budget


Timeline & Dependencies

Gantt Chart (4 Weeks)


Week 1: Foundation
├─ Days 1-2: Phase 1A (Factory + Anthropic)
│ ├─ T1.1: LlmFactory (6h)
│ ├─ T1.2: AnthropicLlm (6h)
│ ├─ T1.3-T1.4: Tests (4h)

└─ Days 3-5: Phase 1B (Dual-Mode Executor)
├─ T2.1: _execute_via_llm() (8h)
├─ T2.2: Async compatibility (4h)
└─ T2.3-T2.4: Tests (4h)

Week 2: Provider Implementations
├─ Days 1-2: Phase 2A (OpenAI)
│ ├─ T3.1-T3.2: OpenAILlm (8h)
│ ├─ T3.3-T3.4: Tests (4h) [PARALLEL]
│ └─ T3.5: Benchmarks (4h)

└─ Days 3-5: Phase 2B (Gemini)
├─ T4.1-T4.2: GeminiLlm (8h)
├─ T4.3-T4.4: Tests (4h) [PARALLEL]
└─ T4.5: Benchmarks (4h)

Week 3: Deprecation & Documentation
└─ Days 1-5: Phase 3
├─ T5.1: Warnings (2h)
├─ T5.2: Migration guide (6h)
├─ T5.3: Docs (6h)
└─ T5.4: Rollback procedure (2h) [PARALLEL]

Week 4: Testing & Validation
└─ Days 1-5: Phase 4
├─ T6.1: Integration tests (4h)
├─ T6.2: Performance tests (4h)
├─ T6.3: Load tests (4h)
├─ T6.4: Security review (3h) [PARALLEL]
├─ T6.5: Production readiness (4h)
└─ T6.6: Phase 1 validation (1h)

Critical Path


T1.1 → T1.2 → T2.1 → T2.2 → T3.1 → T4.1 → T5.2 → T6.5 → T6.6

Critical Path Duration: 50 hours (longest sequential dependency chain) Total Project Duration: 80 hours (with parallel work) Efficiency: 62.5% (50/80) - Good parallelization

Dependencies Diagram


T1.1 (Factory)

T1.2 (Anthropic) ───→ T1.3-T1.4 (Tests)

T2.1 (Executor) ───→ T2.3-T2.4 (Tests)

T2.2 (Async)

┌────┴────┐
↓ ↓
T3.1 (OpenAI) T4.1 (Gemini)
↓ ↓
T3.3-T3.5 T4.3-T4.5 (Tests + Benchmarks)
└────┬────┘

T5.1-T5.4 (Deprecation)

T6.1-T6.4 (Testing)

T6.5-T6.6 (Validation)


Monitoring & Metrics

Success Metrics

MetricBaselineTargetMeasurement Method
Performance Improvement0%≥30%Performance benchmarks
Test Coverage0%≥90%pytest --cov report
Integration Test Pass RateN/A100%CI/CD pipeline
Provider Availability1 (Gemini placeholder)3 (Anthropic, OpenAI, Gemini)Factory registry
Async SupportNoYesTaskExecutor async compatibility
Parallel ExecutionNoYes (3x speedup)Async executor benchmarks
Backward Compatibility100%100%Dual-mode testing
Timeline AdherenceN/A≤4.5 weeksWeekly progress tracking
Cost AdherenceN/A≤$10,750Daily hour tracking

Weekly Progress Tracking

Weekly Sync Meeting Agenda

  1. Review completed tasks from TASKLIST-EXECUTOR-REFACTORING.md
  2. Update progress metrics
  3. Identify blockers
  4. Adjust timeline/scope if needed
  5. Review risks and activate mitigations

Progress Report Template

## Week [N] Progress Report

**Date:** [YYYY-MM-DD]
**Phase:** [Phase Name]

### Completed Tasks
- [x] T1.1: LlmFactory implementation
- [x] T1.2: AnthropicLlm implementation

### In Progress
- [ ] T1.3: Unit tests for factory (75% complete)

### Blockers
- None

### Metrics
- Hours spent: 12/16 (75%)
- Test coverage: 85% (target: 90%)
- Performance improvement: N/A (not yet measured)

### Next Week Plan
- Complete Phase 1A testing
- Begin Phase 1B (dual-mode executor)

Automated Metrics Collection

CI/CD Integration

# .github/workflows/executor-refactoring.yml
jobs:
metrics:
runs-on: ubuntu-latest
steps:
- name: Test coverage
run: |
pytest --cov=llm_abstractions --cov-report=json
echo "Coverage: $(jq '.totals.percent_covered' coverage.json)%"

- name: Performance benchmarks
run: |
python benchmarks/executor_performance.py > benchmark-results.txt
cat benchmark-results.txt

- name: Upload artifacts
uses: actions/upload-artifact@v3
with:
name: metrics-week-${{ github.run_number }}
path: |
coverage.json
benchmark-results.txt


Go/No-Go Decision Criteria

Go Criteria (Proceed with Refactoring)

Technical Readiness

  • BaseLlm abstract class exists and is well-designed
  • AgentRegistry with AgentType enum is in place
  • TaskExecutor architecture supports dual-mode execution
  • ✅ Development environment ready (Python 3.10+, dependencies)

Business Justification

  • ✅ Performance improvement (30-50%) aligns with Phase 1 requirements
  • ✅ Budget ($10K) is available and approved
  • ✅ Timeline (4 weeks) fits within project schedule
  • ✅ ROI is clear: Foundation for Phase 1 autonomous agents

Resource Availability

  • ✅ Senior Python developer available (80 hours over 4 weeks)
  • ✅ Agents available for coordination (test, docs, architecture)
  • ✅ API keys available for testing (Anthropic, OpenAI, Gemini)

Risk Acceptance

  • ✅ Dual-mode strategy mitigates backward compatibility risk
  • ✅ Rollback procedure provides safety net
  • ✅ No blocking dependencies on other work

No-Go Criteria (Defer Refactoring)

Technical Concerns

  • BaseLlm design fundamentally flawed (requires redesign)
  • ❌ Async/sync compatibility not achievable
  • ❌ Critical security issues with API key handling

Business Concerns

  • ❌ Budget not available ($10K not approved)
  • ❌ Timeline too aggressive (need >4 weeks but can't allocate)
  • ❌ Phase 1 delayed (no immediate need for async foundation)

Resource Constraints

  • ❌ Senior Python developer not available
  • ❌ API keys not available for testing
  • ❌ Competing priorities (critical bugs, other features)

Risk Intolerance

  • ❌ Stakeholders unwilling to accept any risk
  • ❌ Backward compatibility concerns cannot be mitigated
  • ❌ Performance improvement not critical

Decision Authority

Decision Maker: Hal Casteel, CEO/CTO, AZ1.AI INC.

Recommendation: GO

Rationale

  1. Technical Readiness: All prerequisites met; architecture supports dual-mode execution
  2. Strategic Importance: Foundation for Phase 1 Message Bus autonomous agents
  3. Performance Impact: 30-50% improvement enables better user experience
  4. Risk Mitigation: Dual-mode executor + rollback procedure = low risk
  5. ROI: $10K investment enables $100K+ Phase 1 implementation

Conditions for GO

  • ✅ API keys available for all 3 providers (Anthropic, OpenAI, Gemini)
  • ✅ 86 engineering hours allocated over 4.5 weeks (updated for async)
  • ✅ Weekly progress tracking and risk review
  • ✅ Senior architect oversight for quality gates
  • ✅ User approval for async executor refactoring

Architecture Decision Record

ADR-001: Async TaskExecutor Refactoring

Status: ✅ ACCEPTED (2025-11-23)

Decision: Make TaskExecutor.execute() an async method, converting all 7 execution-related methods to async/await pattern.

Complete Documentation: docs/02-architecture/adrs/ADR-001-async-task-executor-refactoring.md

Key Sections

  1. Context and Problem Statement

    • Current architecture uses asyncio.run() wrapper (creates new event loop per task)
    • User explicitly requested async alignment with ProjectOrchestrator
    • Phase 1 Message Bus requires end-to-end async flow
  2. Decision Drivers

    • User alignment (mandatory)
    • Phase 1 foundation (mandatory)
    • Parallel execution (3x speedup)
    • Architecture purity (eliminate async/sync boundaries)
  3. Alternatives Considered

    • Option 1: Keep sync executor with asyncio.run() wrapper ❌ REJECTED
    • Option 2: Hybrid approach (dual sync/async methods) ❌ REJECTED
    • Option 3: Make TaskExecutor.execute() async ✅ SELECTED
  4. Decision Outcome

    • 7 methods converted to async
    • All callers use await executor.execute()
    • pytest-asyncio for all tests
    • Budget: +$750 (7.5% increase)
    • Timeline: +3 days (4.5 weeks vs 4 weeks)
  5. Consequences

    • Positive: End-to-end async, 3x parallel speedup, Phase 1 foundation, architecture purity
    • ⚠️ Negative: Breaking change, test migration, budget/timeline increase
    • Mitigation: Migration guide, rollback plan, comprehensive testing
  6. Implementation Details

    • Phase 1B updated: 22 hours (up from 16 hours)
    • 7 method signatures changed to async
    • pytest-asyncio==0.23.0 added to requirements-dev.txt
    • Success criteria: Parallel tasks <3s (vs 6s sequential)
  7. Validation and Compliance

    • All 7 methods converted to async
    • Parallel execution test passes (<3s for 3 tasks)
    • No asyncio.run() in production code
    • Phase 1 foundation validated

Approval

  • Decision Maker: Hal Casteel, CEO/CTO, AZ1.AI INC.
  • Status: ✅ APPROVED (based on user feedback)
  • Date: 2025-11-23
  • ROI: 10x+ (enables $100K+ Phase 1 implementation)

Appendix

Glossary

TermDefinition
BaseLlmAbstract base class for all LLM provider implementations
LlmFactoryFactory class for creating LLM provider instances based on AgentType
AgentTypeEnum defining LLM types (ANTHROPIC_Claude, OPENAI_GPT, GOOGLE_GEMINI, etc.)
Dual-Mode ExecutorTaskExecutor supporting both direct LLM and subprocess execution
Feature Flaguse_direct_llm metadata flag controlling execution mode
Graceful FallbackAutomatic revert to script execution if provider unavailable
Phase 1 Message BusFuture work: autonomous agent-to-agent communication via RabbitMQ

References

Internal Documents

  • docs/03-project-planning/project-plan.md - Master rollout plan
  • orchestration/README.md - Orchestration framework documentation
  • llm_abstractions/base_llm.py - BaseLlm abstract class

External Resources

Change Log

VersionDateAuthorChanges
1.02025-11-23Hal CasteelInitial project plan created

Document Status: ✅ READY FOR GO/NO-GO DECISION Last Updated: 2025-11-23 Next Review: After go/no-go decision Owner: Hal Casteel, CEO/CTO, AZ1.AI INC.