ADR 010: Autonomous Multi Agent Orchestration System

ADR-010: Autonomous Multi-Agent Orchestration System

Document: ADR-010-autonomous-orchestration-system
Version: 1.0.0
Purpose: Document architectural decisions for fully autonomous multi-agent task orchestration with automated sync, intelligent dispatch, and parallel execution
Audience: Framework contributors, developers, AI agents, operations teams
Date Created: 2025-12-19
Status: APPROVED
Related ADRs:
  - ADR-006-work-item-hierarchy (task data model)
  - ADR-001-async-task-executor-refactoring (execution patterns)
Related Documents:
  - scripts/autonomous-orchestrator.py
  - scripts/task-dispatcher.py
  - scripts/agent-executor.py
  - scripts/sync-daemon.py
  - config/orchestrator-config.json

Context and Problem Statement

The Autonomous Operation Problem

CODITECT's V2 project plan contains 122+ tasks organized in ADR-006 hierarchy (Epic → Feature → Task), but execution requires:

Manual Agent Coordination - Humans must assign tasks to appropriate agents
Manual Status Sync - Markdown checkboxes and database drift apart
Sequential Execution - Tasks executed one-at-a-time, not parallelized
No Dependency Management - Tasks executed in arbitrary order
No Progress Persistence - Session breaks lose execution state

Current State (Human-in-the-Loop):

User → Read Task → Pick Agent → Execute → Update Checkbox → Repeat
       └── Manual sync ──┘           └── Manual sync ────┘

Target State (95% Autonomous):

User → Start Orchestrator → Autonomous Loop
                              ├── Sync Daemon (bidirectional)
                              ├── Task Dispatcher (intelligent assignment)
                              ├── Agent Executor (parallel execution)
                              └── Checkpoint System (state preservation)

Business Impact:

60% reduction in coordination overhead
10x increase in task throughput (parallel execution)
99.9% sync accuracy (automated bidirectional sync)
Zero state loss across sessions (checkpoint system)
March 11, 2026 launch timeline achievable

Decision Drivers

Time to Market - 83 days to public launch requires parallel execution
Human Bottleneck - Manual coordination cannot scale to 122+ tasks
Quality Consistency - Automated dispatch ensures correct agent-to-task matching
State Persistence - Multi-day execution requires checkpoint/resume capability
Audit Trail - Compliance requires complete execution logging

Considered Options

Option A: Enhanced Manual Workflow

Improve tooling but keep human-in-the-loop
Rejected: Does not solve coordination bottleneck

Option B: Simple Queue System

Basic FIFO task queue with single agent
Rejected: No parallelization, no intelligent dispatch

Option C: Full Autonomous System (Selected)

Sync daemon + task dispatcher + parallel executor + orchestrator
Selected: Achieves 95% autonomy with checkpoint recovery

Option D: External Workflow Engine (Airflow/Temporal)

Use enterprise workflow orchestration
Rejected: Over-engineered for 122 tasks, adds infrastructure complexity

Decision

Implement Option C: Full Autonomous System with four integrated components:

1. Sync Daemon (`sync-daemon.py`)

Purpose: Bidirectional synchronization between markdown tasklist and database

Architecture:

┌─────────────────────────────────────────────────────────────┐
│                     SYNC DAEMON                              │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐   │
│  │ File Watcher │ ←→ │ Sync Engine  │ ←→ │ DB Poller    │   │
│  └──────────────┘    └──────────────┘    └──────────────┘   │
│         ↓                   ↓                   ↓            │
│  [MD Checkboxes]      [Debounce]        [Task Status]       │
│         ↓                   ↓                   ↓            │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              v2_plan_sync (Audit Log)               │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

Key Features:

MD5 hash-based change detection
2-second debounce to prevent thrashing
WAL mode for concurrent read/write
Audit trail in v2_plan_sync table

2. Task Dispatcher (`task-dispatcher.py`)

Purpose: Intelligent task-to-agent matching with dependency resolution

Agent Mapping Algorithm:

AGENT_MAPPINGS = {
    "devops-engineer": ["deploy", "docker", "kubernetes", "ci/cd"],
    "security-specialist": ["security", "auth", "compliance"],
    "testing-specialist": ["test", "validation", "coverage"],
    "database-architect": ["database", "schema", "migration"],
    "backend-development": ["api", "endpoint", "server"],
    "frontend-development-agent": ["ui", "component", "react"],
    "codi-documentation-writer": ["document", "guide", "readme"],
    "general-purpose": []  # Default fallback
}

def match_agent(task_description, epic_name):
    combined = f"{task_description} {epic_name}".lower()
    scores = {agent: sum(1 for kw in keywords if kw in combined)
              for agent, keywords in AGENT_MAPPINGS.items()}
    return max(scores, key=scores.get) or "general-purpose"

Dispatch Queue:

Priority Order: P0 → P1 → P2
Dependency Check: blocked_by field must be empty
Duplicate Prevention: task_assignments tracking table

3. Agent Executor (`agent-executor.py`)

Purpose: Execute tasks via Claude Code with status tracking

Execution Flow:

Get assignment from task_assignments table
Update status to "in_progress"
Build prompt with task context
Execute via `claude --print -p "prompt"`
Capture output and exit code
Update status to "completed" or "failed"
Log execution details to file
Trigger sync daemon

Timeout and Retry:

Default timeout: 2 hours per task
Max retries: 3 with exponential backoff
Failed tasks return to pending queue

4. Autonomous Orchestrator (`autonomous-orchestrator.py`)

Purpose: Master control loop coordinating all components

Architecture:

┌──────────────────────────────────────────────────────────────────┐
│                    AUTONOMOUS ORCHESTRATOR                        │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│   ┌─────────────┐      ┌──────────────┐      ┌─────────────┐     │
│   │ Sync Daemon │      │   Task Pool  │      │ Checkpoint  │     │
│   │  (Thread)   │      │ (Executor)   │      │   Manager   │     │
│   └──────┬──────┘      └──────┬───────┘      └──────┬──────┘     │
│          │                    │                      │            │
│          v                    v                      v            │
│   ┌────────────────────────────────────────────────────────┐     │
│   │                  CONTROL LOOP                           │     │
│   │  ┌─────────────────────────────────────────────────┐   │     │
│   │  │ 1. Check completed futures                      │   │     │
│   │  │ 2. Update status (completed/failed)             │   │     │
│   │  │ 3. Check failure threshold (pause if exceeded)  │   │     │
│   │  │ 4. Get pending tasks (priority ordered)         │   │     │
│   │  │ 5. Assign to available agent slots              │   │     │
│   │  │ 6. Submit to thread pool                        │   │     │
│   │  │ 7. Create checkpoint if milestone reached       │   │     │
│   │  │ 8. Sleep(poll_interval)                         │   │     │
│   │  └─────────────────────────────────────────────────┘   │     │
│   └────────────────────────────────────────────────────────┘     │
│                                                                   │
│   Concurrent Agents: 5 (configurable)                            │
│   Poll Interval: 10 seconds                                       │
│   Checkpoint Interval: Every 10 tasks                            │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

State Machine:

                 ┌──────────────┐
                 │   STOPPED    │
                 └──────┬───────┘
                        │ start()
                        v
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│    PAUSED    │ ←─│   RUNNING    │ ←─│    ERROR     │
│              │ ─→│              │ ─→│              │
└──────────────┘   └──────┬───────┘   └──────────────┘
        ↑                 │
        │        failures >= threshold
        └─────────────────┘

Database Schema Extensions

New Tables

-- Task assignment tracking
CREATE TABLE task_assignments (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    task_id TEXT NOT NULL,
    agent_type TEXT NOT NULL,
    assigned_at TEXT NOT NULL,
    started_at TEXT,
    completed_at TEXT,
    status TEXT DEFAULT 'assigned'
        CHECK(status IN ('assigned', 'in_progress', 'completed', 'failed', 'cancelled')),
    result TEXT,
    FOREIGN KEY (task_id) REFERENCES v2_tasks(task_id)
);

-- Orchestrator state persistence
CREATE TABLE orchestrator_state (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    started_at TEXT NOT NULL,
    stopped_at TEXT,
    tasks_completed INTEGER DEFAULT 0,
    tasks_failed INTEGER DEFAULT 0,
    status TEXT DEFAULT 'running'
        CHECK(status IN ('running', 'stopped', 'paused', 'error'))
);

-- Checkpoint tracking
CREATE TABLE orchestrator_checkpoints (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    checkpoint_name TEXT NOT NULL,
    created_at TEXT NOT NULL,
    tasks_completed INTEGER,
    tasks_pending INTEGER,
    notes TEXT
);

-- Indexes for performance
CREATE INDEX idx_task_assignments_task ON task_assignments(task_id);
CREATE INDEX idx_task_assignments_status ON task_assignments(status);

Configuration

orchestrator-config.json

{
  "orchestrator": {
    "max_concurrent_agents": 5,
    "poll_interval": 10,
    "checkpoint_interval": 10,
    "retry_limit": 3,
    "pause_on_failure_count": 5
  },
  "executor": {
    "timeout": 7200,
    "max_retries": 3,
    "retry_delay": 30
  },
  "sync": {
    "interval": 30,
    "debounce": 2.0
  },
  "agent_mappings": {
    "devops-engineer": ["deploy", "docker", "kubernetes"],
    "security-specialist": ["security", "auth", "compliance"],
    ...
  }
}

Success Metrics

Metric	Target	Measurement
Autonomy Rate	95%+	(tasks without human intervention) / total tasks
Dispatch Latency	<5s (p95)	Time from task available to assigned
Task Throughput	10+ tasks/hour	Completed tasks per hour
Success Rate	95%+	Completed / (completed + failed)
Sync Accuracy	99.9%	MD checkboxes matching DB status
Checkpoint Coverage	100%	All milestone states recoverable

Usage Examples

Start Orchestrator (Full Autonomous)

cd submodules/core/coditect-core

# Start with 5 concurrent agents
python3 scripts/autonomous-orchestrator.py --max-agents 5

# P0 tasks only (critical path)
python3 scripts/autonomous-orchestrator.py --priority P0

# Preview mode (no actual execution)
python3 scripts/autonomous-orchestrator.py --dry-run

Manual Operations

# Check sync status
python3 scripts/sync-daemon.py --status

# Get next available tasks
python3 scripts/task-dispatcher.py --next 5

# Execute specific task
python3 scripts/agent-executor.py --task T001.002

# Create checkpoint
python3 scripts/autonomous-orchestrator.py --checkpoint "pre-deploy"

Monitor Progress

# Live dashboard
python3 scripts/autonomous-orchestrator.py --dashboard

# JSON status for scripting
python3 scripts/autonomous-orchestrator.py --status

# Sync status
python3 scripts/sync-project-plan.py --status

Consequences

Positive

95% Autonomous Operation - Minimal human intervention required
10x Throughput - Parallel execution with 5+ concurrent agents
Zero State Loss - Checkpoint/resume across sessions
Perfect Sync - Bidirectional MD↔DB synchronization
Full Audit Trail - Every execution logged with results
Intelligent Dispatch - Tasks matched to optimal agents

Negative

Complexity - Four integrated components to maintain
Resource Usage - Concurrent agents consume more compute
Debugging - Distributed execution harder to troubleshoot
Claude Dependency - Requires Claude Code binary availability

Mitigations

Dry-run Mode - Preview execution without changes
Pause Threshold - Auto-pause on repeated failures
Comprehensive Logging - Full execution logs per task
Checkpoint Recovery - Resume from any milestone

Implementation Checklist

References

ADR-006: Work Item Hierarchy (task data model)
V2-CONSOLIDATED-project-plan.md (project plan source)
V2-tasklist-with-checkboxes.md (markdown tasklist)
v2-work-items.json (JSON extraction)

Decision: APPROVED Date: 2025-12-19 Author: CODITECT Orchestrator Agent Reviewers: Hal Casteel (Founder/CEO/CTO)

Context and Problem Statement​

The Autonomous Operation Problem​

Decision Drivers​

Considered Options​

Option A: Enhanced Manual Workflow​

Option B: Simple Queue System​

Option C: Full Autonomous System (Selected)​

Option D: External Workflow Engine (Airflow/Temporal)​

Decision​

1. Sync Daemon (sync-daemon.py)​

2. Task Dispatcher (task-dispatcher.py)​

3. Agent Executor (agent-executor.py)​

4. Autonomous Orchestrator (autonomous-orchestrator.py)​

Database Schema Extensions​

New Tables​

Configuration​

orchestrator-config.json​

Success Metrics​

Usage Examples​

Start Orchestrator (Full Autonomous)​

Manual Operations​

Monitor Progress​

Consequences​

Positive​

Negative​

Mitigations​

Implementation Checklist​

References​