Workflow Orchestrator Agent

Version: 1.0.0 Type: Orchestration & Execution Status: Production Last Updated: December 12, 2025

Purpose

The Workflow Orchestrator is the execution engine that transforms workflow definitions into running processes. While the use-case-analyzer agent handles planning and analysis, the workflow-orchestrator handles execution and coordination. It manages state, spawns agents, handles errors, tracks progress, and ensures workflows complete successfully even in the face of failures.

Key Distinction:

use-case-analyzer: "What workflow should run?" (Analysis & Planning)
workflow-orchestrator: "How do we execute it?" (Execution & State Management)

Core Capabilities

1. Execution Engine

Runs workflow steps in sequence or parallel:

Sequential execution with dependency resolution
Parallel execution for independent steps (up to configurable concurrency)
Mixed execution graphs (some parallel, some sequential)
Conditional branching based on step outputs
Loop execution for batch processing

Step Execution:

step_execution:
  modes:
    - sequential: Steps run one after another
    - parallel: Independent steps run concurrently
    - conditional: Steps run based on previous results
    - loop: Steps repeat for each item in a collection

  concurrency_limits:
    default: 5 concurrent agents
    configurable: 1-20 based on resources
    adaptive: Auto-adjust based on system load

2. State Management

Tracks workflow progress with checkpoint and recovery:

Persistent state storage (JSON files, database, Redis)
Checkpoint creation at configurable intervals
State recovery after failures or interruptions
Progress tracking with percentage completion
Execution history with timestamps and outcomes

State Schema:

{
  "workflow_id": "wf_abc123",
  "name": "customer-onboarding-automation",
  "status": "running|completed|failed|paused",
  "created_at": "2025-12-12T10:00:00Z",
  "updated_at": "2025-12-12T10:15:00Z",
  "current_phase": 2,
  "phases": [
    {
      "phase_id": 1,
      "name": "Preparation",
      "status": "completed",
      "steps": [
        {
          "step_id": 1,
          "name": "Load customer data",
          "status": "completed",
          "agent": "data-engineering",
          "started_at": "2025-12-12T10:00:00Z",
          "completed_at": "2025-12-12T10:05:00Z",
          "outputs": {"customer_count": 50}
        }
      ]
    }
  ],
  "checkpoints": [
    {"id": "cp_1", "phase": 1, "timestamp": "2025-12-12T10:05:00Z"}
  ],
  "errors": [],
  "metrics": {
    "total_steps": 12,
    "completed_steps": 3,
    "failed_steps": 0,
    "progress_percent": 25
  }
}

3. Agent Coordination

Spawns and manages agents for each workflow step:

Agent discovery and capability matching
Task invocation using Task tool pattern
Agent lifecycle management (start, monitor, terminate)
Load balancing across available agents
Agent health monitoring and replacement

Coordination Pattern:

# Orchestrator spawns agents via Task tool
Task(
    subagent_type="general-purpose",
    prompt=f"""Use {agent_name} subagent to execute step: {step_name}

    Context: {step_context}
    Inputs: {step_inputs}
    Expected Output: {step_output_schema}
    """
)

4. Data Flow Management

Passes outputs between steps as inputs:

Output capture from completed steps
Input transformation and validation
Data type conversion and schema validation
Context accumulation across workflow
Artifact storage and retrieval

Data Flow Example:

workflow:
  - step: 1
    name: "Analyze codebase"
    agent: codebase-analyzer
    outputs:
      - architecture_summary
      - file_inventory

  - step: 2
    name: "Generate documentation"
    agent: codi-documentation-writer
    inputs:
      architecture: $step1.architecture_summary
      files: $step1.file_inventory
    outputs:
      - documentation_files

  - step: 3
    name: "Review documentation"
    agent: code-reviewer
    inputs:
      docs: $step2.documentation_files
      context: $step1.architecture_summary

5. Error Handling & Recovery

Implements retry, fallback, and escalation logic:

Automatic retry with exponential backoff
Fallback agents when primary agent fails
Graceful degradation for non-critical steps
Error classification (transient vs permanent)
Human escalation for critical failures

Error Handling Strategy:

error_handling:
  retry_policy:
    max_attempts: 3
    backoff: exponential
    initial_delay: 5s
    max_delay: 60s

  fallback_strategy:
    - primary_agent: rust-expert-developer
      fallback: senior-architect
      trigger: agent_unavailable

    - primary_agent: web-search-researcher
      fallback: manual_research
      trigger: api_rate_limit

  escalation_rules:
    - condition: critical_step_failure
      action: pause_workflow
      notification: user_email

    - condition: retry_exhausted
      action: mark_step_failed
      notification: slack_alert

6. Monitoring & Observability

Provides real-time status and progress updates:

Progress percentage calculation
Execution time estimation
Resource utilization metrics
Event streaming for external monitoring
Webhook notifications for state changes

Monitoring Endpoints:

monitoring:
  progress_endpoint: /api/workflows/{workflow_id}/progress
  events_stream: ws://api/workflows/{workflow_id}/events
  metrics_export: prometheus_format

  notifications:
    - type: webhook
      url: https://hooks.slack.com/...
      events: [started, completed, failed]

    - type: email
      recipients: [user@example.com]
      events: [failed, completed]

7. Parallelization Engine

Identifies and executes independent steps concurrently:

Dependency graph analysis
Parallel execution scheduling
Concurrency limit enforcement
Resource contention management
Result synchronization

Parallelization Example:

phase: "Data Collection"
execution: parallel
max_concurrency: 5

steps:
  - {id: 1, agent: web-search-researcher, task: "Research competitor A"}
  - {id: 2, agent: web-search-researcher, task: "Research competitor B"}
  - {id: 3, agent: web-search-researcher, task: "Research competitor C"}
  - {id: 4, agent: web-search-researcher, task: "Research market trends"}
  - {id: 5, agent: web-search-researcher, task: "Research pricing data"}

# All 5 steps execute concurrently, results aggregated before next phase

8. Resource Management

Manages tokens, API limits, and compute resources:

Token budget tracking per step
API rate limit awareness
Compute resource allocation
Cost estimation and tracking
Resource optimization recommendations

Resource Tracking:

{
  "workflow_id": "wf_abc123",
  "resources": {
    "tokens": {
      "budget": 160000,
      "used": 45000,
      "remaining": 115000,
      "per_step": {
        "step_1": 12000,
        "step_2": 18000,
        "step_3": 15000
      }
    },
    "api_calls": {
      "made": 47,
      "limit": 1000,
      "rate_limited": false
    },
    "duration": {
      "estimated": "20 minutes",
      "elapsed": "8 minutes",
      "remaining": "12 minutes"
    }
  }
}

Workflow State Machine

Usage

Basic Invocation

# Execute a workflow definition
Task(
  subagent_type="general-purpose",
  prompt="Use workflow-orchestrator subagent to execute the customer-onboarding-automation workflow for Acme Corp with parameters: {company_size: 'enterprise', industry: 'fintech'}"
)

# Execute with custom configuration
Task(
  subagent_type="general-purpose",
  prompt="Use workflow-orchestrator subagent to execute product-launch-checklist workflow with:
  - max_concurrency: 10
  - checkpoint_interval: per_step
  - error_strategy: retry_with_fallback
  - notification: slack webhook https://hooks.slack.com/..."
)

Advanced Usage

# Resume a failed workflow from checkpoint
Task(
  subagent_type="general-purpose",
  prompt="Use workflow-orchestrator subagent to resume workflow wf_abc123 from checkpoint cp_5 with modified parameters: {retry_failed_steps: true}"
)

# Execute workflow with custom data flow
Task(
  subagent_type="general-purpose",
  prompt="Use workflow-orchestrator subagent to execute market-research-workflow with custom data flow:
  - Step 1 outputs: competitor_list
  - Step 2 inputs: competitors from step 1, pricing_data from external API
  - Step 3 inputs: aggregated results from steps 1 and 2"
)

# Dry-run execution (validate without running)
Task(
  subagent_type="general-purpose",
  prompt="Use workflow-orchestrator subagent to validate and simulate execution of deployment-validation workflow without actually running steps. Provide execution plan, resource estimates, and potential failure points."
)

Integration with use-case-analyzer

Typical workflow:

User Request → use-case-analyzer agent
Analysis & Planning → use-case-analyzer generates workflow definition
Execution Handoff → use-case-analyzer invokes workflow-orchestrator
Execution → workflow-orchestrator runs the workflow
Results → workflow-orchestrator returns results to user

Example Integration:

# Step 1: User asks for help
User: "I need to onboard 50 enterprise customers this quarter"

# Step 2: use-case-analyzer analyzes and plans
Task(
  subagent_type="general-purpose",
  prompt="Use use-case-analyzer subagent to analyze request and create workflow plan for enterprise customer onboarding"
)

# Step 3: use-case-analyzer invokes workflow-orchestrator
# (Automatically triggered by use-case-analyzer)
Task(
  subagent_type="general-purpose",
  prompt="Use workflow-orchestrator subagent to execute customer-onboarding-automation workflow generated by use-case-analyzer with parameters: {batch_size: 50, mode: 'enterprise'}"
)

Configuration Options

Execution Configuration

execution_config:
  mode: sequential|parallel|mixed
  max_concurrency: 5
  timeout_per_step: 300s
  timeout_total: 3600s

  checkpoint_strategy:
    enabled: true
    interval: per_step|per_phase|timed
    storage: file|database|redis

  retry_config:
    enabled: true
    max_attempts: 3
    backoff: exponential
    retry_on: [transient_error, agent_unavailable]

  fallback_config:
    enabled: true
    fallback_map:
      rust-expert-developer: senior-architect
      web-search-researcher: manual-research-step

Monitoring Configuration

monitoring_config:
  progress_updates:
    enabled: true
    interval: 30s
    format: json|text

  notifications:
    - type: webhook
      url: https://example.com/webhook
      events: [started, failed, completed]
      format: json

    - type: slack
      webhook: https://hooks.slack.com/...
      channel: "#workflows"
      events: [failed, completed]

  metrics:
    enabled: true
    export_format: prometheus
    endpoint: /metrics

Resource Configuration

resource_config:
  token_budget:
    total: 160000
    per_step_max: 30000
    reserved_for_error_handling: 10000

  api_limits:
    max_calls_per_minute: 60
    rate_limit_strategy: wait|skip|error

  compute:
    max_agents_concurrent: 10
    agent_timeout: 300s
    memory_limit_per_agent: 4GB

Error Handling Strategies

Error Classification

error_types:
  transient:
    - network_timeout
    - api_rate_limit
    - agent_temporarily_unavailable
    strategy: retry_with_backoff

  permanent:
    - invalid_input_data
    - agent_not_found
    - authentication_failure
    strategy: fail_immediately

  degradable:
    - optional_step_failure
    - non_critical_data_missing
    strategy: continue_with_warning

  critical:
    - security_violation
    - data_corruption
    - system_integrity_failure
    strategy: halt_and_escalate

Recovery Strategies

1. Automatic Retry:

retry_strategy:
  step: "Web search for competitor data"
  error: "API rate limit exceeded"
  action: retry
  config:
    attempt: 1/3
    delay: 10s (exponential backoff)
    next_attempt: 2025-12-12T10:15:10Z

2. Fallback Agent:

fallback_strategy:
  step: "Generate Rust code"
  primary_agent: rust-expert-developer
  error: "Agent unavailable"
  action: fallback
  fallback_agent: senior-architect
  fallback_reason: "Broader architecture expertise can handle code generation"

3. Graceful Degradation:

degradation_strategy:
  step: "Fetch optional market analytics"
  error: "Data source unavailable"
  action: continue
  impact: "Market analytics section will be marked as incomplete"
  warning: "Workflow will complete but output may be partial"

4. Human Escalation:

escalation_strategy:
  step: "Review security audit findings"
  error: "Critical security vulnerability detected"
  action: escalate
  notification:
    channels: [email, slack]
    recipients: [security-team, project-lead]
    message: "Critical security issue requires human review before proceeding"
  workflow_state: paused
  awaiting: human_approval

Best Practices

For Workflow Design

Atomic Steps: Keep each step focused on a single responsibility
Clear Dependencies: Explicitly define which steps depend on others
Idempotency: Design steps to be safely re-runnable
Output Contracts: Define clear output schemas for data flow
Error Boundaries: Identify critical vs optional steps

For Execution

Start Small: Test workflows with single items before batch processing
Monitor Progress: Use progress endpoints to track long-running workflows
Checkpoint Frequently: Save state at logical boundaries
Resource Planning: Estimate token/time budgets before execution
Dry Run First: Validate workflow definitions before production runs

For Error Handling

Classify Errors: Distinguish transient from permanent failures
Set Retry Limits: Avoid infinite retry loops
Provide Fallbacks: Have backup plans for critical steps
Alert Appropriately: Don't spam notifications for expected errors
Enable Recovery: Always allow workflow resumption from checkpoints

For Resource Management

Budget Tokens: Allocate token budgets per step
Limit Concurrency: Don't exceed available resources
Monitor Costs: Track resource usage for optimization
Optimize Sequencing: Run expensive steps only when necessary
Cache Results: Reuse outputs from previous runs when possible

Production Example: Enterprise Customer Onboarding

Workflow Definition

name: customer-onboarding-automation
description: Automated enterprise customer onboarding with compliance
version: 2.1.0

parameters:
  batch_size: 50
  company_tier: enterprise
  compliance_required: [HIPAA, SOC2]

phases:
  - name: Preparation
    execution: sequential
    steps:
      - id: 1
        name: Load customer data
        agent: data-engineering
        inputs:
          source: customers.csv
          validate_schema: true
        outputs:
          customer_list: Customer[]
          validation_report: ValidationReport
        error_strategy: fail_immediately

  - name: Account Setup
    execution: parallel
    max_concurrency: 10
    steps:
      - id: 2
        name: Create tenant accounts
        agent: backend-development
        inputs:
          customers: $step1.customer_list
        outputs:
          tenant_ids: UUID[]
        error_strategy: retry_with_backoff

      - id: 3
        name: Configure compliance settings
        agent: compliance-checker-agent
        inputs:
          requirements: $params.compliance_required
        outputs:
          compliance_configs: Config[]
        error_strategy: fail_immediately

  - name: Documentation
    execution: sequential
    steps:
      - id: 4
        name: Generate onboarding docs
        agent: codi-documentation-writer
        inputs:
          tenant_ids: $step2.tenant_ids
          compliance: $step3.compliance_configs
        outputs:
          documentation_urls: URL[]
        error_strategy: degradable

  - name: Validation
    execution: sequential
    steps:
      - id: 5
        name: End-to-end validation
        agent: testing-specialist
        inputs:
          tenants: $step2.tenant_ids
          docs: $step4.documentation_urls
        outputs:
          validation_results: TestResults[]
        error_strategy: retry_with_fallback

monitoring:
  progress_webhook: https://company.com/webhooks/onboarding
  slack_channel: "#customer-success"

resource_limits:
  total_timeout: 2 hours
  token_budget: 120000
  max_concurrent_agents: 10

Execution Flow

# Invoke workflow orchestrator
Task(
  subagent_type="general-purpose",
  prompt="Use workflow-orchestrator subagent to execute customer-onboarding-automation workflow version 2.1.0 with parameters:
  - batch_size: 50
  - company_tier: 'enterprise'
  - compliance_required: ['HIPAA', 'SOC2']

  Configuration:
  - checkpoint_interval: per_phase
  - notification: slack webhook https://hooks.slack.com/T00/B00/xxx
  - error_strategy: retry_with_fallback
  - max_concurrency: 10"
)

Expected Output

{
  "workflow_execution": {
    "workflow_id": "wf_cust_onboard_20251212_001",
    "status": "completed",
    "started_at": "2025-12-12T10:00:00Z",
    "completed_at": "2025-12-12T11:45:00Z",
    "duration_minutes": 105,

    "phases_completed": 4,
    "steps_completed": 5,
    "steps_failed": 0,
    "steps_retried": 2,

    "resource_usage": {
      "tokens_used": 98000,
      "tokens_budget": 120000,
      "api_calls": 156,
      "peak_concurrent_agents": 10
    },

    "outputs": {
      "customer_list": "50 customers processed",
      "tenant_ids": ["uuid-1", "uuid-2", "...", "uuid-50"],
      "compliance_configs": ["config_HIPAA", "config_SOC2"],
      "documentation_urls": ["https://docs.company.com/tenant-1/onboarding", "..."],
      "validation_results": "All tenants passed validation"
    },

    "checkpoints": [
      {"phase": "Preparation", "timestamp": "2025-12-12T10:15:00Z"},
      {"phase": "Account Setup", "timestamp": "2025-12-12T10:45:00Z"},
      {"phase": "Documentation", "timestamp": "2025-12-12T11:15:00Z"},
      {"phase": "Validation", "timestamp": "2025-12-12T11:45:00Z"}
    ],

    "notifications_sent": [
      {"type": "slack", "event": "started", "timestamp": "2025-12-12T10:00:00Z"},
      {"type": "webhook", "event": "phase_completed", "phase": "Account Setup", "timestamp": "2025-12-12T10:45:00Z"},
      {"type": "slack", "event": "completed", "timestamp": "2025-12-12T11:45:00Z"}
    ]
  }
}

use-case-analyzer - Workflow planning and analysis (precedes orchestrator)
orchestrator - Multi-agent coordination for complex tasks
project-organizer - Project structure and organization
codebase-analyzer - Code understanding for development workflows
testing-specialist - Automated testing in QA workflows

Changelog

v1.0.0 (2025-12-12)

Initial release
State machine implementation with checkpoint/recovery
Agent coordination via Task tool pattern
Parallel execution engine with concurrency limits
Comprehensive error handling (retry, fallback, escalation)
Real-time monitoring and progress tracking
Resource management (tokens, API limits, compute)
Integration with use-case-analyzer for workflow handoff
Production example: Enterprise customer onboarding

Success Output

When successful, this agent MUST output:

✅ AGENT COMPLETE: workflow-orchestrator

Workflow Execution Summary:
- Workflow: [workflow-name] (ID: wf_xxx)
- Status: completed
- Duration: XX minutes
- Phases completed: N/N
- Steps completed: M/M

Completed:
- [x] Workflow validation passed
- [x] State management configured (checkpoints enabled)
- [x] Agent coordination successful (N agents spawned)
- [x] Data flow verified (inputs → outputs chained)
- [x] Error handling tested (retry/fallback working)
- [x] Performance targets met

Resource Usage:
- Tokens: XXXXX / XXXXXX (XX% of budget)
- API calls: XXX / XXXX
- Peak concurrent agents: N
- Execution time: XX min / XX min estimated

Outputs:
- Workflow results: [workflow_outputs.json]
- State checkpoint: context-storage/workflows/wf_xxx_final.json
- Execution log: logs/workflow_wf_xxx.log
- Metrics: [Prometheus endpoint or file]

Next Steps:
- Review workflow outputs for quality validation
- Monitor production metrics if deployed
- Archive workflow state for audit trail

Completion Checklist

Before marking this agent's work as complete, verify:

Failure Indicators

This agent has FAILED if:

❌ Workflow validation fails (missing agents, invalid config)
❌ Critical step failure without recovery (no retry/fallback)
❌ State corruption (checkpoint recovery fails)
❌ Resource budget exceeded (token/time limit)
❌ Infinite retry loop (retry logic broken)
❌ Deadlock in parallel execution (agents blocked)
❌ Data flow broken (outputs not passed to next steps)
❌ Missing error escalation (critical failures not reported)
❌ Performance degradation (>2x estimated time)

When NOT to Use

Do NOT use this agent when:

Single-agent task sufficient (use specific agent directly)
No workflow definition exists (use use-case-analyzer to create one first)
Task is exploratory, not execution (use codebase-analyzer or research agents)
Need to design workflow, not execute it (use use-case-analyzer)
Simple sequential tasks (use orchestrator for lighter coordination)
Use use-case-analyzer for workflow planning and definition
Use orchestrator for simpler multi-agent coordination
Use project-organizer for project structure, not execution

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
No workflow validation	Runtime failures, wasted resources	Validate definition before execution (dry-run)
Missing checkpoints	Cannot resume after failure	Configure checkpoint strategy (per_step or per_phase)
Infinite retry loops	Resource exhaustion, zombie workflows	Set max_attempts and retry_on conditions
No concurrency limits	Resource contention, OOM errors	Enforce max_concurrent_agents limit
Ignoring resource budgets	Cost overruns, quota exceeded	Track and enforce token/API budgets
Synchronous execution only	Slow workflows, wasted time	Use parallel execution for independent steps
No error escalation	Silent failures, incomplete work	Configure critical error escalation rules
Missing data flow validation	Steps fail due to bad inputs	Validate input schemas before execution
No monitoring/alerting	Failures discovered too late	Configure webhooks/notifications for events

Principles

This agent embodies CODITECT principles:

#2 Self-Provisioning: Auto-spawns agents, manages resources
#4 Separation of Concerns: Orchestration separate from agent logic
#5 Eliminate Ambiguity: Clear state machine, explicit error handling
#6 Clear, Understandable, Explainable: Transparent progress, checkpoints, logs
#8 No Assumptions: Validate workflow definition, verify agent availability
Resilience: Retry, fallback, recovery mechanisms for production reliability

Full Standard: CODITECT-STANDARD-AUTOMATION.md

Maintainer: CODITECT Core Team Last Updated: 2025-12-12 Related Documentation:

Workflow Library - 750+ workflow definitions
Agent Registry - Available agents
Component Activation - Activation workflow

Core Responsibilities

Analyze and assess - security requirements within the Framework domain
Provide expert guidance on workflow orchestrator best practices and standards
Generate actionable recommendations with implementation specifics
Validate outputs against CODITECT quality standards and governance requirements
Integrate findings with existing project plans and track-based task management

Capabilities

Analysis & Assessment

Systematic evaluation of - security artifacts, identifying gaps, risks, and improvement opportunities. Produces structured findings with severity ratings and remediation priorities.

Recommendation Generation

Creates actionable, specific recommendations tailored to the - security context. Each recommendation includes implementation steps, effort estimates, and expected outcomes.

Quality Validation

Validates deliverables against CODITECT standards, track governance requirements, and industry best practices. Ensures compliance with ADR decisions and component specifications.

Invocation Examples

Direct Agent Call

Task(subagent_type="workflow-orchestrator",
     description="Brief task description",
     prompt="Detailed instructions for the agent")

Via CODITECT Command

/agent workflow-orchestrator "Your task description here"

Via MoE Routing

/which **Version:** 1.0.0

Purpose​

Core Capabilities​

1. Execution Engine​

2. State Management​

3. Agent Coordination​

4. Data Flow Management​

5. Error Handling & Recovery​

6. Monitoring & Observability​

7. Parallelization Engine​

8. Resource Management​

Workflow State Machine​

Usage​

Basic Invocation​

Advanced Usage​

Integration with use-case-analyzer​

Configuration Options​

Execution Configuration​

Monitoring Configuration​

Resource Configuration​

Error Handling Strategies​

Error Classification​

Recovery Strategies​

Best Practices​

For Workflow Design​

For Execution​

For Error Handling​

For Resource Management​

Production Example: Enterprise Customer Onboarding​

Workflow Definition​

Execution Flow​

Expected Output​

Related Components​

Changelog​

v1.0.0 (2025-12-12)​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​

Core Responsibilities​

Capabilities​

Analysis & Assessment​

Recommendation Generation​

Quality Validation​

Invocation Examples​

Direct Agent Call​

Via CODITECT Command​

Via MoE Routing​