ADR 010: Autonomous Multi Agent Orchestration System
ADR-010: Autonomous Multi-Agent Orchestration System
Document: ADR-010-autonomous-orchestration-system
Version: 1.0.0
Purpose: Document architectural decisions for fully autonomous multi-agent task orchestration with automated sync, intelligent dispatch, and parallel execution
Audience: Framework contributors, developers, AI agents, operations teams
Date Created: 2025-12-19
Status: APPROVED
Related ADRs:
- ADR-006-work-item-hierarchy (task data model)
- ADR-001-async-task-executor-refactoring (execution patterns)
Related Documents:
- scripts/autonomous-orchestrator.py
- scripts/task-dispatcher.py
- scripts/agent-executor.py
- scripts/sync-daemon.py
- config/orchestrator-config.json
Context and Problem Statement
The Autonomous Operation Problem
CODITECT's V2 project plan contains 122+ tasks organized in ADR-006 hierarchy (Epic → Feature → Task), but execution requires:
- Manual Agent Coordination - Humans must assign tasks to appropriate agents
- Manual Status Sync - Markdown checkboxes and database drift apart
- Sequential Execution - Tasks executed one-at-a-time, not parallelized
- No Dependency Management - Tasks executed in arbitrary order
- No Progress Persistence - Session breaks lose execution state
Current State (Human-in-the-Loop):
User → Read Task → Pick Agent → Execute → Update Checkbox → Repeat
└── Manual sync ──┘ └── Manual sync ────┘
Target State (95% Autonomous):
User → Start Orchestrator → Autonomous Loop
├── Sync Daemon (bidirectional)
├── Task Dispatcher (intelligent assignment)
├── Agent Executor (parallel execution)
└── Checkpoint System (state preservation)
Business Impact:
- 60% reduction in coordination overhead
- 10x increase in task throughput (parallel execution)
- 99.9% sync accuracy (automated bidirectional sync)
- Zero state loss across sessions (checkpoint system)
- March 11, 2026 launch timeline achievable
Decision Drivers
- Time to Market - 83 days to public launch requires parallel execution
- Human Bottleneck - Manual coordination cannot scale to 122+ tasks
- Quality Consistency - Automated dispatch ensures correct agent-to-task matching
- State Persistence - Multi-day execution requires checkpoint/resume capability
- Audit Trail - Compliance requires complete execution logging
Considered Options
Option A: Enhanced Manual Workflow
- Improve tooling but keep human-in-the-loop
- Rejected: Does not solve coordination bottleneck
Option B: Simple Queue System
- Basic FIFO task queue with single agent
- Rejected: No parallelization, no intelligent dispatch
Option C: Full Autonomous System (Selected)
- Sync daemon + task dispatcher + parallel executor + orchestrator
- Selected: Achieves 95% autonomy with checkpoint recovery
Option D: External Workflow Engine (Airflow/Temporal)
- Use enterprise workflow orchestration
- Rejected: Over-engineered for 122 tasks, adds infrastructure complexity
Decision
Implement Option C: Full Autonomous System with four integrated components:
1. Sync Daemon (sync-daemon.py)
Purpose: Bidirectional synchronization between markdown tasklist and database
Architecture:
┌─────────────────────────────────────────────────────────────┐
│ SYNC DAEMON │
├─────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ File Watcher │ ←→ │ Sync Engine │ ←→ │ DB Poller │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ↓ ↓ ↓ │
│ [MD Checkboxes] [Debounce] [Task Status] │
│ ↓ ↓ ↓ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ v2_plan_sync (Audit Log) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key Features:
- MD5 hash-based change detection
- 2-second debounce to prevent thrashing
- WAL mode for concurrent read/write
- Audit trail in
v2_plan_synctable
2. Task Dispatcher (task-dispatcher.py)
Purpose: Intelligent task-to-agent matching with dependency resolution
Agent Mapping Algorithm:
AGENT_MAPPINGS = {
"devops-engineer": ["deploy", "docker", "kubernetes", "ci/cd"],
"security-specialist": ["security", "auth", "compliance"],
"testing-specialist": ["test", "validation", "coverage"],
"database-architect": ["database", "schema", "migration"],
"backend-development": ["api", "endpoint", "server"],
"frontend-development-agent": ["ui", "component", "react"],
"codi-documentation-writer": ["document", "guide", "readme"],
"general-purpose": [] # Default fallback
}
def match_agent(task_description, epic_name):
combined = f"{task_description} {epic_name}".lower()
scores = {agent: sum(1 for kw in keywords if kw in combined)
for agent, keywords in AGENT_MAPPINGS.items()}
return max(scores, key=scores.get) or "general-purpose"
Dispatch Queue:
Priority Order: P0 → P1 → P2
Dependency Check: blocked_by field must be empty
Duplicate Prevention: task_assignments tracking table
3. Agent Executor (agent-executor.py)
Purpose: Execute tasks via Claude Code with status tracking
Execution Flow:
1. Get assignment from task_assignments table
2. Update status to "in_progress"
3. Build prompt with task context
4. Execute via `claude --print -p "prompt"`
5. Capture output and exit code
6. Update status to "completed" or "failed"
7. Log execution details to file
8. Trigger sync daemon
Timeout and Retry:
- Default timeout: 2 hours per task
- Max retries: 3 with exponential backoff
- Failed tasks return to pending queue
4. Autonomous Orchestrator (autonomous-orchestrator.py)
Purpose: Master control loop coordinating all components
Architecture:
┌──────────────────────────────────────────────────────────────────┐
│ AUTONOMOUS ORCHESTRATOR │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Sync Daemon │ │ Task Pool │ │ Checkpoint │ │
│ │ (Thread) │ │ (Executor) │ │ Manager │ │
│ └──────┬──────┘ └──────┬───────┘ └──────┬──────┘ │
│ │ │ │ │
│ v v v │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ CONTROL LOOP │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ 1. Check completed futures │ │ │
│ │ │ 2. Update status (completed/failed) │ │ │
│ │ │ 3. Check failure threshold (pause if exceeded) │ │ │
│ │ │ 4. Get pending tasks (priority ordered) │ │ │
│ │ │ 5. Assign to available agent slots │ │ │
│ │ │ 6. Submit to thread pool │ │ │
│ │ │ 7. Create checkpoint if milestone reached │ │ │
│ │ │ 8. Sleep(poll_interval) │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ Concurrent Agents: 5 (configurable) │
│ Poll Interval: 10 seconds │
│ Checkpoint Interval: Every 10 tasks │
│ │
└──────────────────────────────────────────────────────────────────┘
State Machine:
┌──────────────┐
│ STOPPED │
└──────┬───────┘
│ start()
v
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PAUSED │ ←─│ RUNNING │ ←─│ ERROR │
│ │ ─→│ │ ─→│ │
└──────────────┘ └──────┬───────┘ └──────────────┘
↑ │
│ failures >= threshold
└─────────────────┘
Database Schema Extensions
New Tables
-- Task assignment tracking
CREATE TABLE task_assignments (
id INTEGER PRIMARY KEY AUTOINCREMENT,
task_id TEXT NOT NULL,
agent_type TEXT NOT NULL,
assigned_at TEXT NOT NULL,
started_at TEXT,
completed_at TEXT,
status TEXT DEFAULT 'assigned'
CHECK(status IN ('assigned', 'in_progress', 'completed', 'failed', 'cancelled')),
result TEXT,
FOREIGN KEY (task_id) REFERENCES v2_tasks(task_id)
);
-- Orchestrator state persistence
CREATE TABLE orchestrator_state (
id INTEGER PRIMARY KEY AUTOINCREMENT,
started_at TEXT NOT NULL,
stopped_at TEXT,
tasks_completed INTEGER DEFAULT 0,
tasks_failed INTEGER DEFAULT 0,
status TEXT DEFAULT 'running'
CHECK(status IN ('running', 'stopped', 'paused', 'error'))
);
-- Checkpoint tracking
CREATE TABLE orchestrator_checkpoints (
id INTEGER PRIMARY KEY AUTOINCREMENT,
checkpoint_name TEXT NOT NULL,
created_at TEXT NOT NULL,
tasks_completed INTEGER,
tasks_pending INTEGER,
notes TEXT
);
-- Indexes for performance
CREATE INDEX idx_task_assignments_task ON task_assignments(task_id);
CREATE INDEX idx_task_assignments_status ON task_assignments(status);
Configuration
orchestrator-config.json
{
"orchestrator": {
"max_concurrent_agents": 5,
"poll_interval": 10,
"checkpoint_interval": 10,
"retry_limit": 3,
"pause_on_failure_count": 5
},
"executor": {
"timeout": 7200,
"max_retries": 3,
"retry_delay": 30
},
"sync": {
"interval": 30,
"debounce": 2.0
},
"agent_mappings": {
"devops-engineer": ["deploy", "docker", "kubernetes"],
"security-specialist": ["security", "auth", "compliance"],
...
}
}
Success Metrics
| Metric | Target | Measurement |
|---|---|---|
| Autonomy Rate | 95%+ | (tasks without human intervention) / total tasks |
| Dispatch Latency | <5s (p95) | Time from task available to assigned |
| Task Throughput | 10+ tasks/hour | Completed tasks per hour |
| Success Rate | 95%+ | Completed / (completed + failed) |
| Sync Accuracy | 99.9% | MD checkboxes matching DB status |
| Checkpoint Coverage | 100% | All milestone states recoverable |
Usage Examples
Start Orchestrator (Full Autonomous)
cd submodules/core/coditect-core
# Start with 5 concurrent agents
python3 scripts/autonomous-orchestrator.py --max-agents 5
# P0 tasks only (critical path)
python3 scripts/autonomous-orchestrator.py --priority P0
# Preview mode (no actual execution)
python3 scripts/autonomous-orchestrator.py --dry-run
Manual Operations
# Check sync status
python3 scripts/sync-daemon.py --status
# Get next available tasks
python3 scripts/task-dispatcher.py --next 5
# Execute specific task
python3 scripts/agent-executor.py --task T001.002
# Create checkpoint
python3 scripts/autonomous-orchestrator.py --checkpoint "pre-deploy"
Monitor Progress
# Live dashboard
python3 scripts/autonomous-orchestrator.py --dashboard
# JSON status for scripting
python3 scripts/autonomous-orchestrator.py --status
# Sync status
python3 scripts/sync-project-plan.py --status
Consequences
Positive
- 95% Autonomous Operation - Minimal human intervention required
- 10x Throughput - Parallel execution with 5+ concurrent agents
- Zero State Loss - Checkpoint/resume across sessions
- Perfect Sync - Bidirectional MD↔DB synchronization
- Full Audit Trail - Every execution logged with results
- Intelligent Dispatch - Tasks matched to optimal agents
Negative
- Complexity - Four integrated components to maintain
- Resource Usage - Concurrent agents consume more compute
- Debugging - Distributed execution harder to troubleshoot
- Claude Dependency - Requires Claude Code binary availability
Mitigations
- Dry-run Mode - Preview execution without changes
- Pause Threshold - Auto-pause on repeated failures
- Comprehensive Logging - Full execution logs per task
- Checkpoint Recovery - Resume from any milestone
Implementation Checklist
- Sync Daemon (
sync-daemon.py) - Task Dispatcher (
task-dispatcher.py) - Agent Executor (
agent-executor.py) - Autonomous Orchestrator (
autonomous-orchestrator.py) - Configuration (
orchestrator-config.json) - Database schema extensions (in scripts)
- ADR documentation (this document)
- Integration tests
- Load testing (10+ concurrent agents)
- Production deployment guide
References
- ADR-006: Work Item Hierarchy (task data model)
- V2-CONSOLIDATED-project-plan.md (project plan source)
- V2-tasklist-with-checkboxes.md (markdown tasklist)
- v2-work-items.json (JSON extraction)
Decision: APPROVED Date: 2025-12-19 Author: CODITECT Orchestrator Agent Reviewers: Hal Casteel (Founder/CEO/CTO)