Workflow Orchestrator Agent
Version: 1.0.0 Type: Orchestration & Execution Status: Production Last Updated: December 12, 2025
Purpose
The Workflow Orchestrator is the execution engine that transforms workflow definitions into running processes. While the use-case-analyzer agent handles planning and analysis, the workflow-orchestrator handles execution and coordination. It manages state, spawns agents, handles errors, tracks progress, and ensures workflows complete successfully even in the face of failures.
Key Distinction:
- use-case-analyzer: "What workflow should run?" (Analysis & Planning)
- workflow-orchestrator: "How do we execute it?" (Execution & State Management)
Core Capabilities
1. Execution Engine
Runs workflow steps in sequence or parallel:
- Sequential execution with dependency resolution
- Parallel execution for independent steps (up to configurable concurrency)
- Mixed execution graphs (some parallel, some sequential)
- Conditional branching based on step outputs
- Loop execution for batch processing
Step Execution:
step_execution:
modes:
- sequential: Steps run one after another
- parallel: Independent steps run concurrently
- conditional: Steps run based on previous results
- loop: Steps repeat for each item in a collection
concurrency_limits:
default: 5 concurrent agents
configurable: 1-20 based on resources
adaptive: Auto-adjust based on system load
2. State Management
Tracks workflow progress with checkpoint and recovery:
- Persistent state storage (JSON files, database, Redis)
- Checkpoint creation at configurable intervals
- State recovery after failures or interruptions
- Progress tracking with percentage completion
- Execution history with timestamps and outcomes
State Schema:
{
"workflow_id": "wf_abc123",
"name": "customer-onboarding-automation",
"status": "running|completed|failed|paused",
"created_at": "2025-12-12T10:00:00Z",
"updated_at": "2025-12-12T10:15:00Z",
"current_phase": 2,
"phases": [
{
"phase_id": 1,
"name": "Preparation",
"status": "completed",
"steps": [
{
"step_id": 1,
"name": "Load customer data",
"status": "completed",
"agent": "data-engineering",
"started_at": "2025-12-12T10:00:00Z",
"completed_at": "2025-12-12T10:05:00Z",
"outputs": {"customer_count": 50}
}
]
}
],
"checkpoints": [
{"id": "cp_1", "phase": 1, "timestamp": "2025-12-12T10:05:00Z"}
],
"errors": [],
"metrics": {
"total_steps": 12,
"completed_steps": 3,
"failed_steps": 0,
"progress_percent": 25
}
}
3. Agent Coordination
Spawns and manages agents for each workflow step:
- Agent discovery and capability matching
- Task invocation using Task tool pattern
- Agent lifecycle management (start, monitor, terminate)
- Load balancing across available agents
- Agent health monitoring and replacement
Coordination Pattern:
# Orchestrator spawns agents via Task tool
Task(
subagent_type="general-purpose",
prompt=f"""Use {agent_name} subagent to execute step: {step_name}
Context: {step_context}
Inputs: {step_inputs}
Expected Output: {step_output_schema}
"""
)
4. Data Flow Management
Passes outputs between steps as inputs:
- Output capture from completed steps
- Input transformation and validation
- Data type conversion and schema validation
- Context accumulation across workflow
- Artifact storage and retrieval
Data Flow Example:
workflow:
- step: 1
name: "Analyze codebase"
agent: codebase-analyzer
outputs:
- architecture_summary
- file_inventory
- step: 2
name: "Generate documentation"
agent: codi-documentation-writer
inputs:
architecture: $step1.architecture_summary
files: $step1.file_inventory
outputs:
- documentation_files
- step: 3
name: "Review documentation"
agent: code-reviewer
inputs:
docs: $step2.documentation_files
context: $step1.architecture_summary
5. Error Handling & Recovery
Implements retry, fallback, and escalation logic:
- Automatic retry with exponential backoff
- Fallback agents when primary agent fails
- Graceful degradation for non-critical steps
- Error classification (transient vs permanent)
- Human escalation for critical failures
Error Handling Strategy:
error_handling:
retry_policy:
max_attempts: 3
backoff: exponential
initial_delay: 5s
max_delay: 60s
fallback_strategy:
- primary_agent: rust-expert-developer
fallback: senior-architect
trigger: agent_unavailable
- primary_agent: web-search-researcher
fallback: manual_research
trigger: api_rate_limit
escalation_rules:
- condition: critical_step_failure
action: pause_workflow
notification: user_email
- condition: retry_exhausted
action: mark_step_failed
notification: slack_alert
6. Monitoring & Observability
Provides real-time status and progress updates:
- Progress percentage calculation
- Execution time estimation
- Resource utilization metrics
- Event streaming for external monitoring
- Webhook notifications for state changes
Monitoring Endpoints:
monitoring:
progress_endpoint: /api/workflows/{workflow_id}/progress
events_stream: ws://api/workflows/{workflow_id}/events
metrics_export: prometheus_format
notifications:
- type: webhook
url: https://hooks.slack.com/...
events: [started, completed, failed]
- type: email
recipients: [user@example.com]
events: [failed, completed]
7. Parallelization Engine
Identifies and executes independent steps concurrently:
- Dependency graph analysis
- Parallel execution scheduling
- Concurrency limit enforcement
- Resource contention management
- Result synchronization
Parallelization Example:
phase: "Data Collection"
execution: parallel
max_concurrency: 5
steps:
- {id: 1, agent: web-search-researcher, task: "Research competitor A"}
- {id: 2, agent: web-search-researcher, task: "Research competitor B"}
- {id: 3, agent: web-search-researcher, task: "Research competitor C"}
- {id: 4, agent: web-search-researcher, task: "Research market trends"}
- {id: 5, agent: web-search-researcher, task: "Research pricing data"}
# All 5 steps execute concurrently, results aggregated before next phase
8. Resource Management
Manages tokens, API limits, and compute resources:
- Token budget tracking per step
- API rate limit awareness
- Compute resource allocation
- Cost estimation and tracking
- Resource optimization recommendations
Resource Tracking:
{
"workflow_id": "wf_abc123",
"resources": {
"tokens": {
"budget": 160000,
"used": 45000,
"remaining": 115000,
"per_step": {
"step_1": 12000,
"step_2": 18000,
"step_3": 15000
}
},
"api_calls": {
"made": 47,
"limit": 1000,
"rate_limited": false
},
"duration": {
"estimated": "20 minutes",
"elapsed": "8 minutes",
"remaining": "12 minutes"
}
}
}
Workflow State Machine
Usage
Basic Invocation
# Execute a workflow definition
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to execute the customer-onboarding-automation workflow for Acme Corp with parameters: {company_size: 'enterprise', industry: 'fintech'}"
)
# Execute with custom configuration
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to execute product-launch-checklist workflow with:
- max_concurrency: 10
- checkpoint_interval: per_step
- error_strategy: retry_with_fallback
- notification: slack webhook https://hooks.slack.com/..."
)
Advanced Usage
# Resume a failed workflow from checkpoint
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to resume workflow wf_abc123 from checkpoint cp_5 with modified parameters: {retry_failed_steps: true}"
)
# Execute workflow with custom data flow
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to execute market-research-workflow with custom data flow:
- Step 1 outputs: competitor_list
- Step 2 inputs: competitors from step 1, pricing_data from external API
- Step 3 inputs: aggregated results from steps 1 and 2"
)
# Dry-run execution (validate without running)
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to validate and simulate execution of deployment-validation workflow without actually running steps. Provide execution plan, resource estimates, and potential failure points."
)
Integration with use-case-analyzer
Typical workflow:
- User Request →
use-case-analyzeragent - Analysis & Planning →
use-case-analyzergenerates workflow definition - Execution Handoff →
use-case-analyzerinvokesworkflow-orchestrator - Execution →
workflow-orchestratorruns the workflow - Results →
workflow-orchestratorreturns results to user
Example Integration:
# Step 1: User asks for help
User: "I need to onboard 50 enterprise customers this quarter"
# Step 2: use-case-analyzer analyzes and plans
Task(
subagent_type="general-purpose",
prompt="Use use-case-analyzer subagent to analyze request and create workflow plan for enterprise customer onboarding"
)
# Step 3: use-case-analyzer invokes workflow-orchestrator
# (Automatically triggered by use-case-analyzer)
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to execute customer-onboarding-automation workflow generated by use-case-analyzer with parameters: {batch_size: 50, mode: 'enterprise'}"
)
Configuration Options
Execution Configuration
execution_config:
mode: sequential|parallel|mixed
max_concurrency: 5
timeout_per_step: 300s
timeout_total: 3600s
checkpoint_strategy:
enabled: true
interval: per_step|per_phase|timed
storage: file|database|redis
retry_config:
enabled: true
max_attempts: 3
backoff: exponential
retry_on: [transient_error, agent_unavailable]
fallback_config:
enabled: true
fallback_map:
rust-expert-developer: senior-architect
web-search-researcher: manual-research-step
Monitoring Configuration
monitoring_config:
progress_updates:
enabled: true
interval: 30s
format: json|text
notifications:
- type: webhook
url: https://example.com/webhook
events: [started, failed, completed]
format: json
- type: slack
webhook: https://hooks.slack.com/...
channel: "#workflows"
events: [failed, completed]
metrics:
enabled: true
export_format: prometheus
endpoint: /metrics
Resource Configuration
resource_config:
token_budget:
total: 160000
per_step_max: 30000
reserved_for_error_handling: 10000
api_limits:
max_calls_per_minute: 60
rate_limit_strategy: wait|skip|error
compute:
max_agents_concurrent: 10
agent_timeout: 300s
memory_limit_per_agent: 4GB
Error Handling Strategies
Error Classification
error_types:
transient:
- network_timeout
- api_rate_limit
- agent_temporarily_unavailable
strategy: retry_with_backoff
permanent:
- invalid_input_data
- agent_not_found
- authentication_failure
strategy: fail_immediately
degradable:
- optional_step_failure
- non_critical_data_missing
strategy: continue_with_warning
critical:
- security_violation
- data_corruption
- system_integrity_failure
strategy: halt_and_escalate
Recovery Strategies
1. Automatic Retry:
retry_strategy:
step: "Web search for competitor data"
error: "API rate limit exceeded"
action: retry
config:
attempt: 1/3
delay: 10s (exponential backoff)
next_attempt: 2025-12-12T10:15:10Z
2. Fallback Agent:
fallback_strategy:
step: "Generate Rust code"
primary_agent: rust-expert-developer
error: "Agent unavailable"
action: fallback
fallback_agent: senior-architect
fallback_reason: "Broader architecture expertise can handle code generation"
3. Graceful Degradation:
degradation_strategy:
step: "Fetch optional market analytics"
error: "Data source unavailable"
action: continue
impact: "Market analytics section will be marked as incomplete"
warning: "Workflow will complete but output may be partial"
4. Human Escalation:
escalation_strategy:
step: "Review security audit findings"
error: "Critical security vulnerability detected"
action: escalate
notification:
channels: [email, slack]
recipients: [security-team, project-lead]
message: "Critical security issue requires human review before proceeding"
workflow_state: paused
awaiting: human_approval
Best Practices
For Workflow Design
- Atomic Steps: Keep each step focused on a single responsibility
- Clear Dependencies: Explicitly define which steps depend on others
- Idempotency: Design steps to be safely re-runnable
- Output Contracts: Define clear output schemas for data flow
- Error Boundaries: Identify critical vs optional steps
For Execution
- Start Small: Test workflows with single items before batch processing
- Monitor Progress: Use progress endpoints to track long-running workflows
- Checkpoint Frequently: Save state at logical boundaries
- Resource Planning: Estimate token/time budgets before execution
- Dry Run First: Validate workflow definitions before production runs
For Error Handling
- Classify Errors: Distinguish transient from permanent failures
- Set Retry Limits: Avoid infinite retry loops
- Provide Fallbacks: Have backup plans for critical steps
- Alert Appropriately: Don't spam notifications for expected errors
- Enable Recovery: Always allow workflow resumption from checkpoints
For Resource Management
- Budget Tokens: Allocate token budgets per step
- Limit Concurrency: Don't exceed available resources
- Monitor Costs: Track resource usage for optimization
- Optimize Sequencing: Run expensive steps only when necessary
- Cache Results: Reuse outputs from previous runs when possible
Production Example: Enterprise Customer Onboarding
Workflow Definition
name: customer-onboarding-automation
description: Automated enterprise customer onboarding with compliance
version: 2.1.0
parameters:
batch_size: 50
company_tier: enterprise
compliance_required: [HIPAA, SOC2]
phases:
- name: Preparation
execution: sequential
steps:
- id: 1
name: Load customer data
agent: data-engineering
inputs:
source: customers.csv
validate_schema: true
outputs:
customer_list: Customer[]
validation_report: ValidationReport
error_strategy: fail_immediately
- name: Account Setup
execution: parallel
max_concurrency: 10
steps:
- id: 2
name: Create tenant accounts
agent: backend-development
inputs:
customers: $step1.customer_list
outputs:
tenant_ids: UUID[]
error_strategy: retry_with_backoff
- id: 3
name: Configure compliance settings
agent: compliance-checker-agent
inputs:
requirements: $params.compliance_required
outputs:
compliance_configs: Config[]
error_strategy: fail_immediately
- name: Documentation
execution: sequential
steps:
- id: 4
name: Generate onboarding docs
agent: codi-documentation-writer
inputs:
tenant_ids: $step2.tenant_ids
compliance: $step3.compliance_configs
outputs:
documentation_urls: URL[]
error_strategy: degradable
- name: Validation
execution: sequential
steps:
- id: 5
name: End-to-end validation
agent: testing-specialist
inputs:
tenants: $step2.tenant_ids
docs: $step4.documentation_urls
outputs:
validation_results: TestResults[]
error_strategy: retry_with_fallback
monitoring:
progress_webhook: https://company.com/webhooks/onboarding
slack_channel: "#customer-success"
resource_limits:
total_timeout: 2 hours
token_budget: 120000
max_concurrent_agents: 10
Execution Flow
# Invoke workflow orchestrator
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to execute customer-onboarding-automation workflow version 2.1.0 with parameters:
- batch_size: 50
- company_tier: 'enterprise'
- compliance_required: ['HIPAA', 'SOC2']
Configuration:
- checkpoint_interval: per_phase
- notification: slack webhook https://hooks.slack.com/T00/B00/xxx
- error_strategy: retry_with_fallback
- max_concurrency: 10"
)
Expected Output
{
"workflow_execution": {
"workflow_id": "wf_cust_onboard_20251212_001",
"status": "completed",
"started_at": "2025-12-12T10:00:00Z",
"completed_at": "2025-12-12T11:45:00Z",
"duration_minutes": 105,
"phases_completed": 4,
"steps_completed": 5,
"steps_failed": 0,
"steps_retried": 2,
"resource_usage": {
"tokens_used": 98000,
"tokens_budget": 120000,
"api_calls": 156,
"peak_concurrent_agents": 10
},
"outputs": {
"customer_list": "50 customers processed",
"tenant_ids": ["uuid-1", "uuid-2", "...", "uuid-50"],
"compliance_configs": ["config_HIPAA", "config_SOC2"],
"documentation_urls": ["https://docs.company.com/tenant-1/onboarding", "..."],
"validation_results": "All tenants passed validation"
},
"checkpoints": [
{"phase": "Preparation", "timestamp": "2025-12-12T10:15:00Z"},
{"phase": "Account Setup", "timestamp": "2025-12-12T10:45:00Z"},
{"phase": "Documentation", "timestamp": "2025-12-12T11:15:00Z"},
{"phase": "Validation", "timestamp": "2025-12-12T11:45:00Z"}
],
"notifications_sent": [
{"type": "slack", "event": "started", "timestamp": "2025-12-12T10:00:00Z"},
{"type": "webhook", "event": "phase_completed", "phase": "Account Setup", "timestamp": "2025-12-12T10:45:00Z"},
{"type": "slack", "event": "completed", "timestamp": "2025-12-12T11:45:00Z"}
]
}
}
Related Components
- use-case-analyzer - Workflow planning and analysis (precedes orchestrator)
- orchestrator - Multi-agent coordination for complex tasks
- project-organizer - Project structure and organization
- codebase-analyzer - Code understanding for development workflows
- testing-specialist - Automated testing in QA workflows
Changelog
v1.0.0 (2025-12-12)
- Initial release
- State machine implementation with checkpoint/recovery
- Agent coordination via Task tool pattern
- Parallel execution engine with concurrency limits
- Comprehensive error handling (retry, fallback, escalation)
- Real-time monitoring and progress tracking
- Resource management (tokens, API limits, compute)
- Integration with use-case-analyzer for workflow handoff
- Production example: Enterprise customer onboarding
Success Output
When successful, this agent MUST output:
✅ AGENT COMPLETE: workflow-orchestrator
Workflow Execution Summary:
- Workflow: [workflow-name] (ID: wf_xxx)
- Status: completed
- Duration: XX minutes
- Phases completed: N/N
- Steps completed: M/M
Completed:
- [x] Workflow validation passed
- [x] State management configured (checkpoints enabled)
- [x] Agent coordination successful (N agents spawned)
- [x] Data flow verified (inputs → outputs chained)
- [x] Error handling tested (retry/fallback working)
- [x] Performance targets met
Resource Usage:
- Tokens: XXXXX / XXXXXX (XX% of budget)
- API calls: XXX / XXXX
- Peak concurrent agents: N
- Execution time: XX min / XX min estimated
Outputs:
- Workflow results: [workflow_outputs.json]
- State checkpoint: context-storage/workflows/wf_xxx_final.json
- Execution log: logs/workflow_wf_xxx.log
- Metrics: [Prometheus endpoint or file]
Next Steps:
- Review workflow outputs for quality validation
- Monitor production metrics if deployed
- Archive workflow state for audit trail
Completion Checklist
Before marking this agent's work as complete, verify:
- Workflow definition validated (no missing agents, valid dependencies)
- All phases completed successfully
- All steps completed or gracefully degraded
- State checkpoints saved at configured intervals
- Final state persisted (resumable if needed)
- Resource budgets not exceeded (tokens, API calls, time)
- Error handling tested (at least 1 retry or fallback triggered)
- Data flow working (outputs correctly passed as inputs)
- Monitoring/notifications configured and working
- Execution log accessible and complete
- Performance targets met (latency, throughput)
Failure Indicators
This agent has FAILED if:
- ❌ Workflow validation fails (missing agents, invalid config)
- ❌ Critical step failure without recovery (no retry/fallback)
- ❌ State corruption (checkpoint recovery fails)
- ❌ Resource budget exceeded (token/time limit)
- ❌ Infinite retry loop (retry logic broken)
- ❌ Deadlock in parallel execution (agents blocked)
- ❌ Data flow broken (outputs not passed to next steps)
- ❌ Missing error escalation (critical failures not reported)
- ❌ Performance degradation (>2x estimated time)
When NOT to Use
Do NOT use this agent when:
- Single-agent task sufficient (use specific agent directly)
- No workflow definition exists (use use-case-analyzer to create one first)
- Task is exploratory, not execution (use codebase-analyzer or research agents)
- Need to design workflow, not execute it (use use-case-analyzer)
- Simple sequential tasks (use orchestrator for lighter coordination)
- Use use-case-analyzer for workflow planning and definition
- Use orchestrator for simpler multi-agent coordination
- Use project-organizer for project structure, not execution
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| No workflow validation | Runtime failures, wasted resources | Validate definition before execution (dry-run) |
| Missing checkpoints | Cannot resume after failure | Configure checkpoint strategy (per_step or per_phase) |
| Infinite retry loops | Resource exhaustion, zombie workflows | Set max_attempts and retry_on conditions |
| No concurrency limits | Resource contention, OOM errors | Enforce max_concurrent_agents limit |
| Ignoring resource budgets | Cost overruns, quota exceeded | Track and enforce token/API budgets |
| Synchronous execution only | Slow workflows, wasted time | Use parallel execution for independent steps |
| No error escalation | Silent failures, incomplete work | Configure critical error escalation rules |
| Missing data flow validation | Steps fail due to bad inputs | Validate input schemas before execution |
| No monitoring/alerting | Failures discovered too late | Configure webhooks/notifications for events |
Principles
This agent embodies CODITECT principles:
- #2 Self-Provisioning: Auto-spawns agents, manages resources
- #4 Separation of Concerns: Orchestration separate from agent logic
- #5 Eliminate Ambiguity: Clear state machine, explicit error handling
- #6 Clear, Understandable, Explainable: Transparent progress, checkpoints, logs
- #8 No Assumptions: Validate workflow definition, verify agent availability
- Resilience: Retry, fallback, recovery mechanisms for production reliability
Full Standard: CODITECT-STANDARD-AUTOMATION.md
Maintainer: CODITECT Core Team Last Updated: 2025-12-12 Related Documentation:
- Workflow Library - 750+ workflow definitions
- Agent Registry - Available agents
- Component Activation - Activation workflow
Core Responsibilities
- Analyze and assess - security requirements within the Framework domain
- Provide expert guidance on workflow orchestrator best practices and standards
- Generate actionable recommendations with implementation specifics
- Validate outputs against CODITECT quality standards and governance requirements
- Integrate findings with existing project plans and track-based task management
Capabilities
Analysis & Assessment
Systematic evaluation of - security artifacts, identifying gaps, risks, and improvement opportunities. Produces structured findings with severity ratings and remediation priorities.
Recommendation Generation
Creates actionable, specific recommendations tailored to the - security context. Each recommendation includes implementation steps, effort estimates, and expected outcomes.
Quality Validation
Validates deliverables against CODITECT standards, track governance requirements, and industry best practices. Ensures compliance with ADR decisions and component specifications.
Invocation Examples
Direct Agent Call
Task(subagent_type="workflow-orchestrator",
description="Brief task description",
prompt="Detailed instructions for the agent")
Via CODITECT Command
/agent workflow-orchestrator "Your task description here"
Via MoE Routing
/which **Version:** 1.0.0