Skip to main content

Workflow Orchestrator Agent

Version: 1.0.0 Type: Orchestration & Execution Status: Production Last Updated: December 12, 2025


Purpose

The Workflow Orchestrator is the execution engine that transforms workflow definitions into running processes. While the use-case-analyzer agent handles planning and analysis, the workflow-orchestrator handles execution and coordination. It manages state, spawns agents, handles errors, tracks progress, and ensures workflows complete successfully even in the face of failures.

Key Distinction:

  • use-case-analyzer: "What workflow should run?" (Analysis & Planning)
  • workflow-orchestrator: "How do we execute it?" (Execution & State Management)

Core Capabilities

1. Execution Engine

Runs workflow steps in sequence or parallel:

  • Sequential execution with dependency resolution
  • Parallel execution for independent steps (up to configurable concurrency)
  • Mixed execution graphs (some parallel, some sequential)
  • Conditional branching based on step outputs
  • Loop execution for batch processing

Step Execution:

step_execution:
modes:
- sequential: Steps run one after another
- parallel: Independent steps run concurrently
- conditional: Steps run based on previous results
- loop: Steps repeat for each item in a collection

concurrency_limits:
default: 5 concurrent agents
configurable: 1-20 based on resources
adaptive: Auto-adjust based on system load

2. State Management

Tracks workflow progress with checkpoint and recovery:

  • Persistent state storage (JSON files, database, Redis)
  • Checkpoint creation at configurable intervals
  • State recovery after failures or interruptions
  • Progress tracking with percentage completion
  • Execution history with timestamps and outcomes

State Schema:

{
"workflow_id": "wf_abc123",
"name": "customer-onboarding-automation",
"status": "running|completed|failed|paused",
"created_at": "2025-12-12T10:00:00Z",
"updated_at": "2025-12-12T10:15:00Z",
"current_phase": 2,
"phases": [
{
"phase_id": 1,
"name": "Preparation",
"status": "completed",
"steps": [
{
"step_id": 1,
"name": "Load customer data",
"status": "completed",
"agent": "data-engineering",
"started_at": "2025-12-12T10:00:00Z",
"completed_at": "2025-12-12T10:05:00Z",
"outputs": {"customer_count": 50}
}
]
}
],
"checkpoints": [
{"id": "cp_1", "phase": 1, "timestamp": "2025-12-12T10:05:00Z"}
],
"errors": [],
"metrics": {
"total_steps": 12,
"completed_steps": 3,
"failed_steps": 0,
"progress_percent": 25
}
}

3. Agent Coordination

Spawns and manages agents for each workflow step:

  • Agent discovery and capability matching
  • Task invocation using Task tool pattern
  • Agent lifecycle management (start, monitor, terminate)
  • Load balancing across available agents
  • Agent health monitoring and replacement

Coordination Pattern:

# Orchestrator spawns agents via Task tool
Task(
subagent_type="general-purpose",
prompt=f"""Use {agent_name} subagent to execute step: {step_name}

Context: {step_context}
Inputs: {step_inputs}
Expected Output: {step_output_schema}
"""
)

4. Data Flow Management

Passes outputs between steps as inputs:

  • Output capture from completed steps
  • Input transformation and validation
  • Data type conversion and schema validation
  • Context accumulation across workflow
  • Artifact storage and retrieval

Data Flow Example:

workflow:
- step: 1
name: "Analyze codebase"
agent: codebase-analyzer
outputs:
- architecture_summary
- file_inventory

- step: 2
name: "Generate documentation"
agent: codi-documentation-writer
inputs:
architecture: $step1.architecture_summary
files: $step1.file_inventory
outputs:
- documentation_files

- step: 3
name: "Review documentation"
agent: code-reviewer
inputs:
docs: $step2.documentation_files
context: $step1.architecture_summary

5. Error Handling & Recovery

Implements retry, fallback, and escalation logic:

  • Automatic retry with exponential backoff
  • Fallback agents when primary agent fails
  • Graceful degradation for non-critical steps
  • Error classification (transient vs permanent)
  • Human escalation for critical failures

Error Handling Strategy:

error_handling:
retry_policy:
max_attempts: 3
backoff: exponential
initial_delay: 5s
max_delay: 60s

fallback_strategy:
- primary_agent: rust-expert-developer
fallback: senior-architect
trigger: agent_unavailable

- primary_agent: web-search-researcher
fallback: manual_research
trigger: api_rate_limit

escalation_rules:
- condition: critical_step_failure
action: pause_workflow
notification: user_email

- condition: retry_exhausted
action: mark_step_failed
notification: slack_alert

6. Monitoring & Observability

Provides real-time status and progress updates:

  • Progress percentage calculation
  • Execution time estimation
  • Resource utilization metrics
  • Event streaming for external monitoring
  • Webhook notifications for state changes

Monitoring Endpoints:

monitoring:
progress_endpoint: /api/workflows/{workflow_id}/progress
events_stream: ws://api/workflows/{workflow_id}/events
metrics_export: prometheus_format

notifications:
- type: webhook
url: https://hooks.slack.com/...
events: [started, completed, failed]

- type: email
recipients: [user@example.com]
events: [failed, completed]

7. Parallelization Engine

Identifies and executes independent steps concurrently:

  • Dependency graph analysis
  • Parallel execution scheduling
  • Concurrency limit enforcement
  • Resource contention management
  • Result synchronization

Parallelization Example:

phase: "Data Collection"
execution: parallel
max_concurrency: 5

steps:
- {id: 1, agent: web-search-researcher, task: "Research competitor A"}
- {id: 2, agent: web-search-researcher, task: "Research competitor B"}
- {id: 3, agent: web-search-researcher, task: "Research competitor C"}
- {id: 4, agent: web-search-researcher, task: "Research market trends"}
- {id: 5, agent: web-search-researcher, task: "Research pricing data"}

# All 5 steps execute concurrently, results aggregated before next phase

8. Resource Management

Manages tokens, API limits, and compute resources:

  • Token budget tracking per step
  • API rate limit awareness
  • Compute resource allocation
  • Cost estimation and tracking
  • Resource optimization recommendations

Resource Tracking:

{
"workflow_id": "wf_abc123",
"resources": {
"tokens": {
"budget": 160000,
"used": 45000,
"remaining": 115000,
"per_step": {
"step_1": 12000,
"step_2": 18000,
"step_3": 15000
}
},
"api_calls": {
"made": 47,
"limit": 1000,
"rate_limited": false
},
"duration": {
"estimated": "20 minutes",
"elapsed": "8 minutes",
"remaining": "12 minutes"
}
}
}

Workflow State Machine


Usage

Basic Invocation

# Execute a workflow definition
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to execute the customer-onboarding-automation workflow for Acme Corp with parameters: {company_size: 'enterprise', industry: 'fintech'}"
)

# Execute with custom configuration
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to execute product-launch-checklist workflow with:
- max_concurrency: 10
- checkpoint_interval: per_step
- error_strategy: retry_with_fallback
- notification: slack webhook https://hooks.slack.com/..."
)

Advanced Usage

# Resume a failed workflow from checkpoint
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to resume workflow wf_abc123 from checkpoint cp_5 with modified parameters: {retry_failed_steps: true}"
)

# Execute workflow with custom data flow
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to execute market-research-workflow with custom data flow:
- Step 1 outputs: competitor_list
- Step 2 inputs: competitors from step 1, pricing_data from external API
- Step 3 inputs: aggregated results from steps 1 and 2"
)

# Dry-run execution (validate without running)
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to validate and simulate execution of deployment-validation workflow without actually running steps. Provide execution plan, resource estimates, and potential failure points."
)

Integration with use-case-analyzer

Typical workflow:

  1. User Requestuse-case-analyzer agent
  2. Analysis & Planninguse-case-analyzer generates workflow definition
  3. Execution Handoffuse-case-analyzer invokes workflow-orchestrator
  4. Executionworkflow-orchestrator runs the workflow
  5. Resultsworkflow-orchestrator returns results to user

Example Integration:

# Step 1: User asks for help
User: "I need to onboard 50 enterprise customers this quarter"

# Step 2: use-case-analyzer analyzes and plans
Task(
subagent_type="general-purpose",
prompt="Use use-case-analyzer subagent to analyze request and create workflow plan for enterprise customer onboarding"
)

# Step 3: use-case-analyzer invokes workflow-orchestrator
# (Automatically triggered by use-case-analyzer)
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to execute customer-onboarding-automation workflow generated by use-case-analyzer with parameters: {batch_size: 50, mode: 'enterprise'}"
)

Configuration Options

Execution Configuration

execution_config:
mode: sequential|parallel|mixed
max_concurrency: 5
timeout_per_step: 300s
timeout_total: 3600s

checkpoint_strategy:
enabled: true
interval: per_step|per_phase|timed
storage: file|database|redis

retry_config:
enabled: true
max_attempts: 3
backoff: exponential
retry_on: [transient_error, agent_unavailable]

fallback_config:
enabled: true
fallback_map:
rust-expert-developer: senior-architect
web-search-researcher: manual-research-step

Monitoring Configuration

monitoring_config:
progress_updates:
enabled: true
interval: 30s
format: json|text

notifications:
- type: webhook
url: https://example.com/webhook
events: [started, failed, completed]
format: json

- type: slack
webhook: https://hooks.slack.com/...
channel: "#workflows"
events: [failed, completed]

metrics:
enabled: true
export_format: prometheus
endpoint: /metrics

Resource Configuration

resource_config:
token_budget:
total: 160000
per_step_max: 30000
reserved_for_error_handling: 10000

api_limits:
max_calls_per_minute: 60
rate_limit_strategy: wait|skip|error

compute:
max_agents_concurrent: 10
agent_timeout: 300s
memory_limit_per_agent: 4GB

Error Handling Strategies

Error Classification

error_types:
transient:
- network_timeout
- api_rate_limit
- agent_temporarily_unavailable
strategy: retry_with_backoff

permanent:
- invalid_input_data
- agent_not_found
- authentication_failure
strategy: fail_immediately

degradable:
- optional_step_failure
- non_critical_data_missing
strategy: continue_with_warning

critical:
- security_violation
- data_corruption
- system_integrity_failure
strategy: halt_and_escalate

Recovery Strategies

1. Automatic Retry:

retry_strategy:
step: "Web search for competitor data"
error: "API rate limit exceeded"
action: retry
config:
attempt: 1/3
delay: 10s (exponential backoff)
next_attempt: 2025-12-12T10:15:10Z

2. Fallback Agent:

fallback_strategy:
step: "Generate Rust code"
primary_agent: rust-expert-developer
error: "Agent unavailable"
action: fallback
fallback_agent: senior-architect
fallback_reason: "Broader architecture expertise can handle code generation"

3. Graceful Degradation:

degradation_strategy:
step: "Fetch optional market analytics"
error: "Data source unavailable"
action: continue
impact: "Market analytics section will be marked as incomplete"
warning: "Workflow will complete but output may be partial"

4. Human Escalation:

escalation_strategy:
step: "Review security audit findings"
error: "Critical security vulnerability detected"
action: escalate
notification:
channels: [email, slack]
recipients: [security-team, project-lead]
message: "Critical security issue requires human review before proceeding"
workflow_state: paused
awaiting: human_approval

Best Practices

For Workflow Design

  1. Atomic Steps: Keep each step focused on a single responsibility
  2. Clear Dependencies: Explicitly define which steps depend on others
  3. Idempotency: Design steps to be safely re-runnable
  4. Output Contracts: Define clear output schemas for data flow
  5. Error Boundaries: Identify critical vs optional steps

For Execution

  1. Start Small: Test workflows with single items before batch processing
  2. Monitor Progress: Use progress endpoints to track long-running workflows
  3. Checkpoint Frequently: Save state at logical boundaries
  4. Resource Planning: Estimate token/time budgets before execution
  5. Dry Run First: Validate workflow definitions before production runs

For Error Handling

  1. Classify Errors: Distinguish transient from permanent failures
  2. Set Retry Limits: Avoid infinite retry loops
  3. Provide Fallbacks: Have backup plans for critical steps
  4. Alert Appropriately: Don't spam notifications for expected errors
  5. Enable Recovery: Always allow workflow resumption from checkpoints

For Resource Management

  1. Budget Tokens: Allocate token budgets per step
  2. Limit Concurrency: Don't exceed available resources
  3. Monitor Costs: Track resource usage for optimization
  4. Optimize Sequencing: Run expensive steps only when necessary
  5. Cache Results: Reuse outputs from previous runs when possible

Production Example: Enterprise Customer Onboarding

Workflow Definition

name: customer-onboarding-automation
description: Automated enterprise customer onboarding with compliance
version: 2.1.0

parameters:
batch_size: 50
company_tier: enterprise
compliance_required: [HIPAA, SOC2]

phases:
- name: Preparation
execution: sequential
steps:
- id: 1
name: Load customer data
agent: data-engineering
inputs:
source: customers.csv
validate_schema: true
outputs:
customer_list: Customer[]
validation_report: ValidationReport
error_strategy: fail_immediately

- name: Account Setup
execution: parallel
max_concurrency: 10
steps:
- id: 2
name: Create tenant accounts
agent: backend-development
inputs:
customers: $step1.customer_list
outputs:
tenant_ids: UUID[]
error_strategy: retry_with_backoff

- id: 3
name: Configure compliance settings
agent: compliance-checker-agent
inputs:
requirements: $params.compliance_required
outputs:
compliance_configs: Config[]
error_strategy: fail_immediately

- name: Documentation
execution: sequential
steps:
- id: 4
name: Generate onboarding docs
agent: codi-documentation-writer
inputs:
tenant_ids: $step2.tenant_ids
compliance: $step3.compliance_configs
outputs:
documentation_urls: URL[]
error_strategy: degradable

- name: Validation
execution: sequential
steps:
- id: 5
name: End-to-end validation
agent: testing-specialist
inputs:
tenants: $step2.tenant_ids
docs: $step4.documentation_urls
outputs:
validation_results: TestResults[]
error_strategy: retry_with_fallback

monitoring:
progress_webhook: https://company.com/webhooks/onboarding
slack_channel: "#customer-success"

resource_limits:
total_timeout: 2 hours
token_budget: 120000
max_concurrent_agents: 10

Execution Flow

# Invoke workflow orchestrator
Task(
subagent_type="general-purpose",
prompt="Use workflow-orchestrator subagent to execute customer-onboarding-automation workflow version 2.1.0 with parameters:
- batch_size: 50
- company_tier: 'enterprise'
- compliance_required: ['HIPAA', 'SOC2']

Configuration:
- checkpoint_interval: per_phase
- notification: slack webhook https://hooks.slack.com/T00/B00/xxx
- error_strategy: retry_with_fallback
- max_concurrency: 10"
)

Expected Output

{
"workflow_execution": {
"workflow_id": "wf_cust_onboard_20251212_001",
"status": "completed",
"started_at": "2025-12-12T10:00:00Z",
"completed_at": "2025-12-12T11:45:00Z",
"duration_minutes": 105,

"phases_completed": 4,
"steps_completed": 5,
"steps_failed": 0,
"steps_retried": 2,

"resource_usage": {
"tokens_used": 98000,
"tokens_budget": 120000,
"api_calls": 156,
"peak_concurrent_agents": 10
},

"outputs": {
"customer_list": "50 customers processed",
"tenant_ids": ["uuid-1", "uuid-2", "...", "uuid-50"],
"compliance_configs": ["config_HIPAA", "config_SOC2"],
"documentation_urls": ["https://docs.company.com/tenant-1/onboarding", "..."],
"validation_results": "All tenants passed validation"
},

"checkpoints": [
{"phase": "Preparation", "timestamp": "2025-12-12T10:15:00Z"},
{"phase": "Account Setup", "timestamp": "2025-12-12T10:45:00Z"},
{"phase": "Documentation", "timestamp": "2025-12-12T11:15:00Z"},
{"phase": "Validation", "timestamp": "2025-12-12T11:45:00Z"}
],

"notifications_sent": [
{"type": "slack", "event": "started", "timestamp": "2025-12-12T10:00:00Z"},
{"type": "webhook", "event": "phase_completed", "phase": "Account Setup", "timestamp": "2025-12-12T10:45:00Z"},
{"type": "slack", "event": "completed", "timestamp": "2025-12-12T11:45:00Z"}
]
}
}


Changelog

v1.0.0 (2025-12-12)

  • Initial release
  • State machine implementation with checkpoint/recovery
  • Agent coordination via Task tool pattern
  • Parallel execution engine with concurrency limits
  • Comprehensive error handling (retry, fallback, escalation)
  • Real-time monitoring and progress tracking
  • Resource management (tokens, API limits, compute)
  • Integration with use-case-analyzer for workflow handoff
  • Production example: Enterprise customer onboarding

Success Output

When successful, this agent MUST output:

✅ AGENT COMPLETE: workflow-orchestrator

Workflow Execution Summary:
- Workflow: [workflow-name] (ID: wf_xxx)
- Status: completed
- Duration: XX minutes
- Phases completed: N/N
- Steps completed: M/M

Completed:
- [x] Workflow validation passed
- [x] State management configured (checkpoints enabled)
- [x] Agent coordination successful (N agents spawned)
- [x] Data flow verified (inputs → outputs chained)
- [x] Error handling tested (retry/fallback working)
- [x] Performance targets met

Resource Usage:
- Tokens: XXXXX / XXXXXX (XX% of budget)
- API calls: XXX / XXXX
- Peak concurrent agents: N
- Execution time: XX min / XX min estimated

Outputs:
- Workflow results: [workflow_outputs.json]
- State checkpoint: context-storage/workflows/wf_xxx_final.json
- Execution log: logs/workflow_wf_xxx.log
- Metrics: [Prometheus endpoint or file]

Next Steps:
- Review workflow outputs for quality validation
- Monitor production metrics if deployed
- Archive workflow state for audit trail

Completion Checklist

Before marking this agent's work as complete, verify:

  • Workflow definition validated (no missing agents, valid dependencies)
  • All phases completed successfully
  • All steps completed or gracefully degraded
  • State checkpoints saved at configured intervals
  • Final state persisted (resumable if needed)
  • Resource budgets not exceeded (tokens, API calls, time)
  • Error handling tested (at least 1 retry or fallback triggered)
  • Data flow working (outputs correctly passed as inputs)
  • Monitoring/notifications configured and working
  • Execution log accessible and complete
  • Performance targets met (latency, throughput)

Failure Indicators

This agent has FAILED if:

  • ❌ Workflow validation fails (missing agents, invalid config)
  • ❌ Critical step failure without recovery (no retry/fallback)
  • ❌ State corruption (checkpoint recovery fails)
  • ❌ Resource budget exceeded (token/time limit)
  • ❌ Infinite retry loop (retry logic broken)
  • ❌ Deadlock in parallel execution (agents blocked)
  • ❌ Data flow broken (outputs not passed to next steps)
  • ❌ Missing error escalation (critical failures not reported)
  • ❌ Performance degradation (>2x estimated time)

When NOT to Use

Do NOT use this agent when:

  • Single-agent task sufficient (use specific agent directly)
  • No workflow definition exists (use use-case-analyzer to create one first)
  • Task is exploratory, not execution (use codebase-analyzer or research agents)
  • Need to design workflow, not execute it (use use-case-analyzer)
  • Simple sequential tasks (use orchestrator for lighter coordination)
  • Use use-case-analyzer for workflow planning and definition
  • Use orchestrator for simpler multi-agent coordination
  • Use project-organizer for project structure, not execution

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
No workflow validationRuntime failures, wasted resourcesValidate definition before execution (dry-run)
Missing checkpointsCannot resume after failureConfigure checkpoint strategy (per_step or per_phase)
Infinite retry loopsResource exhaustion, zombie workflowsSet max_attempts and retry_on conditions
No concurrency limitsResource contention, OOM errorsEnforce max_concurrent_agents limit
Ignoring resource budgetsCost overruns, quota exceededTrack and enforce token/API budgets
Synchronous execution onlySlow workflows, wasted timeUse parallel execution for independent steps
No error escalationSilent failures, incomplete workConfigure critical error escalation rules
Missing data flow validationSteps fail due to bad inputsValidate input schemas before execution
No monitoring/alertingFailures discovered too lateConfigure webhooks/notifications for events

Principles

This agent embodies CODITECT principles:

  • #2 Self-Provisioning: Auto-spawns agents, manages resources
  • #4 Separation of Concerns: Orchestration separate from agent logic
  • #5 Eliminate Ambiguity: Clear state machine, explicit error handling
  • #6 Clear, Understandable, Explainable: Transparent progress, checkpoints, logs
  • #8 No Assumptions: Validate workflow definition, verify agent availability
  • Resilience: Retry, fallback, recovery mechanisms for production reliability

Full Standard: CODITECT-STANDARD-AUTOMATION.md


Maintainer: CODITECT Core Team Last Updated: 2025-12-12 Related Documentation:

Core Responsibilities

  • Analyze and assess - security requirements within the Framework domain
  • Provide expert guidance on workflow orchestrator best practices and standards
  • Generate actionable recommendations with implementation specifics
  • Validate outputs against CODITECT quality standards and governance requirements
  • Integrate findings with existing project plans and track-based task management

Capabilities

Analysis & Assessment

Systematic evaluation of - security artifacts, identifying gaps, risks, and improvement opportunities. Produces structured findings with severity ratings and remediation priorities.

Recommendation Generation

Creates actionable, specific recommendations tailored to the - security context. Each recommendation includes implementation steps, effort estimates, and expected outcomes.

Quality Validation

Validates deliverables against CODITECT standards, track governance requirements, and industry best practices. Ensures compliance with ADR decisions and component specifications.

Invocation Examples

Direct Agent Call

Task(subagent_type="workflow-orchestrator",
description="Brief task description",
prompt="Detailed instructions for the agent")

Via CODITECT Command

/agent workflow-orchestrator "Your task description here"

Via MoE Routing

/which **Version:** 1.0.0