Implementation Requirements: Agent Health Monitoring Layer
Document ID: IMPL-REQ-003
Priority: P1 (High)
Target ADR: ADR-110 (Proposed)
Estimated Effort: 3 Sprints
Dependencies: Orchestrator, Checkpoint Service, Event Bus
1. Overview
1.1 Problem Statement
Autonomous agent loops can become stuck, unresponsive, or enter infinite retry cycles. Without active health monitoring, these failure modes:
- Consume resources indefinitely
- Block dependent tasks
- Generate costs without progress
- Require manual detection and intervention
Gas Town's GUPP principle ("If there is work on your Hook, YOU MUST RUN IT") and Witness/Deacon pattern provide a battle-tested model for stuck detection and intervention.
1.2 Objective
Implement a health monitoring layer that:
- Detects stuck or unresponsive agents
- Provides graduated intervention (nudge → escalate → terminate)
- Enforces circuit breaker patterns
- Enables self-healing through automatic recovery
- Maintains observability into agent health
1.3 Success Criteria
| Metric | Target |
|---|---|
| Stuck detection latency | < 5 minutes |
| False positive rate | < 1% |
| Recovery success rate | > 95% |
| Mean time to intervention | < 10 minutes |
2. Functional Requirements
2.1 Health Status Model
FR-001: Agent Health States
states:
HEALTHY:
description: "Agent is making progress"
indicators:
- Recent checkpoint update (< 10 min)
- Tool calls being made
- No error accumulation
DEGRADED:
description: "Agent showing warning signs"
indicators:
- Checkpoint stale (10-20 min)
- High error rate (> 25%)
- Token consumption anomaly
actions:
- Increase monitoring frequency
- Prepare intervention
STUCK:
description: "Agent not making progress"
indicators:
- No checkpoint update (> 30 min)
- Repeated identical operations
- Context exhaustion without handoff
actions:
- Initiate nudge sequence
- Alert orchestrator
FAILING:
description: "Agent in error loop"
indicators:
- Circuit breaker tripped
- Consecutive failures (> 3)
- Unrecoverable errors detected
actions:
- Terminate agent
- Initiate recovery from checkpoint
TERMINATED:
description: "Agent has been stopped"
indicators:
- Explicit termination
- Resource limit exceeded
- Unrecoverable failure
actions:
- Cleanup resources
- Archive final state
2.2 Health Check Protocol
FR-002: Health Check Operations
Heartbeat Protocol:
├── Agent emits heartbeat every 5 minutes
├── Heartbeat includes:
│ ├── agent_id
│ ├── task_id
│ ├── current_phase
│ ├── last_tool_call_timestamp
│ ├── token_count
│ ├── error_count
│ └── progress_indicator
├── Monitor records heartbeat receipt
└── Missing heartbeat triggers DEGRADED state
Progress Detection:
├── Compare checkpoint timestamps
├── Analyze tool call patterns
│ ├── Repeated identical calls = stuck
│ ├── No calls for 15+ min = stuck
│ └── Diverse calls = healthy
├── Token consumption rate analysis
│ ├── Sudden spike = potential loop
│ ├── Zero consumption = stuck
│ └── Steady rate = healthy
└── Output analysis (file changes, test results)
2.3 Intervention Protocol
FR-003: Graduated Intervention Sequence
Level 1 - NUDGE (soft intervention):
├── Trigger: STUCK state detected
├── Action: Inject reminder into agent context
│ "REMINDER: You have been working for {duration} without
│ checkpoint update. Please either:
│ - Update your progress
│ - Request handoff if stuck
│ - Report blockers"
├── Wait: 10 minutes
└── Escalate if no response
Level 2 - ESCALATE (orchestrator alert):
├── Trigger: Nudge unsuccessful (3 attempts)
├── Action:
│ ├── Alert orchestrator
│ ├── Log escalation event
│ └── Prepare recovery options
├── Orchestrator decision:
│ ├── Allow more time
│ ├── Force handoff
│ └── Terminate and recover
└── Wait: Orchestrator response or timeout (15 min)
Level 3 - TERMINATE (forced stop):
├── Trigger: Escalation timeout or orchestrator decision
├── Action:
│ ├── Force agent termination
│ ├── Save partial state to checkpoint
│ ├── Clean up resources
│ └── Initiate recovery from last valid checkpoint
└── Post-action: Recovery or human escalation
2.4 Circuit Breaker Pattern
FR-004: Circuit Breaker Implementation
States:
├── CLOSED (normal operation)
│ ├── Requests pass through
│ ├── Failures counted
│ └── Trips to OPEN on threshold
├── OPEN (blocking)
│ ├── Requests immediately fail
│ ├── Timer starts
│ └── Transitions to HALF_OPEN on timeout
└── HALF_OPEN (testing)
├── Limited requests allowed
├── Success → CLOSED
└── Failure → OPEN
Configuration:
failure_threshold: 3 # Consecutive failures to trip
recovery_timeout_seconds: 60 # Time in OPEN before HALF_OPEN
half_open_requests: 1 # Requests allowed in HALF_OPEN
Per-Agent Circuit Breakers:
├── Tool execution circuit breaker
├── Checkpoint write circuit breaker
├── External API circuit breaker
└── Recovery attempt circuit breaker
2.5 Self-Healing Protocol
FR-005: Automatic Recovery
Recovery Decision Tree:
├── Is last checkpoint valid?
│ ├── Yes → Spawn new agent from checkpoint
│ └── No → Try previous checkpoint
├── Is previous checkpoint valid?
│ ├── Yes → Spawn from previous (some work lost)
│ └── No → Continue searching chain
├── No valid checkpoint in chain?
│ ├── Reset to task start
│ └── Alert for human intervention
└── Recovery spawned?
├── Monitor closely for 15 min
└── If fails again → Human escalation
Recovery Limits:
├── Max recovery attempts per task: 3
├── Max recovery attempts per hour: 5
├── Backoff between attempts: exponential (1, 2, 4 min)
└── After limits: Mandatory human review
3. Non-Functional Requirements
3.1 Performance
| Requirement | Specification |
|---|---|
| NFR-001 | Health check latency < 100ms |
| NFR-002 | Heartbeat processing < 50ms |
| NFR-003 | Support 100+ concurrent agent monitors |
| NFR-004 | State transition < 1s |
| NFR-005 | Intervention injection < 500ms |
3.2 Reliability
| Requirement | Specification |
|---|---|
| NFR-006 | Monitor service 99.99% availability |
| NFR-007 | No false positives from monitor failures |
| NFR-008 | Graceful degradation if monitoring unavailable |
| NFR-009 | Persistent health state across monitor restarts |
3.3 Observability
| Requirement | Specification |
|---|---|
| NFR-010 | Real-time health dashboard |
| NFR-011 | Alert integration (PagerDuty, Slack) |
| NFR-012 | Historical health metrics (30 day retention) |
| NFR-013 | Intervention audit trail |
| NFR-014 | Distributed tracing for health checks |
3.4 Compliance
| Requirement | Specification |
|---|---|
| NFR-015 | Audit log of all interventions |
| NFR-016 | Retention per compliance policy |
| NFR-017 | Non-repudiation for termination events |
4. Implementation Steps
Phase 1: Health Status Infrastructure (Week 1-2)
Step 1.1: Health State Machine
├── Define HealthState enum
├── Implement state transition logic
├── Add transition validation rules
├── Create state persistence (FoundationDB)
└── Implement state change events
Step 1.2: Heartbeat System
├── Define heartbeat message schema
├── Implement agent-side heartbeat emission
│ ├── Periodic timer (5 min)
│ ├── Heartbeat content collection
│ └── Emission via event bus
├── Implement monitor-side heartbeat reception
│ ├── Heartbeat listener
│ ├── Timestamp recording
│ └── Missing heartbeat detection
└── Add heartbeat to agent base class
Step 1.3: Health Check Service
├── Create HealthCheckService interface
├── Implement checkpoint-based health check
│ ├── Query latest checkpoint
│ ├── Compare timestamps
│ └── Assess staleness
├── Implement tool-call-based health check
│ ├── Query recent tool calls
│ ├── Analyze patterns
│ └── Detect repetition
└── Implement composite health assessment
├── Aggregate all signals
├── Apply thresholds
└── Determine health state
Phase 2: Monitoring Service (Week 3-4)
Step 2.1: Agent Monitor
├── Create AgentMonitor class
│ ├── Per-agent monitoring instance
│ ├── Health state tracking
│ ├── Heartbeat tracking
│ └── Intervention tracking
├── Implement monitoring loop
│ ├── Periodic health check (1 min)
│ ├── State transition evaluation
│ ├── Intervention trigger check
│ └── Metric emission
└── Add monitor lifecycle management
├── Start on agent spawn
├── Stop on agent termination
└── Cleanup resources
Step 2.2: Monitor Coordinator
├── Create MonitorCoordinator service
│ ├── Manages all AgentMonitors
│ ├── Provides aggregate health view
│ └── Coordinates interventions
├── Implement monitor discovery
│ ├── Subscribe to agent spawn events
│ ├── Create monitors for new agents
│ └── Remove monitors for terminated agents
└── Add health dashboard data provider
Step 2.3: Intervention Service
├── Create InterventionService
├── Implement nudge injection
│ ├── Generate nudge message
│ ├── Inject into agent context
│ │ (via orchestrator or direct)
│ └── Record nudge attempt
├── Implement escalation flow
│ ├── Notify orchestrator
│ ├── Await decision
│ └── Execute decision
└── Implement termination flow
├── Force stop agent
├── Save final state
└── Trigger cleanup
Phase 3: Circuit Breaker (Week 5-6)
Step 3.1: Circuit Breaker Core
├── Create CircuitBreaker class
│ ├── State management (CLOSED/OPEN/HALF_OPEN)
│ ├── Failure counting
│ ├── Timer management
│ └── State transition logic
├── Implement call wrapping
│ ├── Check state before call
│ ├── Execute or fail fast
│ ├── Record result
│ └── Update state
└── Add configuration support
├── Thresholds
├── Timeouts
└── Per-operation overrides
Step 3.2: Circuit Breaker Integration
├── Integrate with tool execution
│ ├── Wrap tool calls
│ ├── Track tool-specific breakers
│ └── Report breaker status
├── Integrate with checkpoint writes
│ ├── Prevent writes during OPEN
│ ├── Queue for retry
│ └── Alert on persistent failure
└── Integrate with external APIs
├── Per-endpoint breakers
├── Shared breaker for related endpoints
└── Graceful degradation
Step 3.3: Circuit Breaker Dashboard
├── Expose breaker states via API
├── Add to health dashboard
├── Implement manual reset capability
└── Add alerting on OPEN state
Phase 4: Self-Healing and Testing (Week 7-8)
Step 4.1: Recovery Service
├── Create RecoveryService
├── Implement checkpoint chain traversal
│ ├── Find last valid checkpoint
│ ├── Validate checkpoint integrity
│ └── Handle broken chains
├── Implement agent respawn
│ ├── Create new agent instance
│ ├── Inject checkpoint state
│ ├── Start with recovery context
│ └── Link to original task
└── Implement recovery limits
├── Track attempts per task
├── Enforce backoff
└── Escalate on limit breach
Step 4.2: Testing
├── Unit tests:
│ ├── State machine transitions
│ ├── Health check calculations
│ ├── Circuit breaker logic
│ └── Recovery decisions
├── Integration tests:
│ ├── Heartbeat flow
│ ├── Intervention sequence
│ ├── Circuit breaker triggering
│ └── Recovery spawning
└── Chaos tests:
├── Simulate stuck agents
├── Simulate failures
├── Test recovery under load
└── Verify no false positives
Step 4.3: Documentation and Runbooks
├── Architecture documentation
├── Configuration guide
├── Runbook: Manual intervention
├── Runbook: Circuit breaker reset
└── Runbook: Recovery troubleshooting
5. Event Definitions
// Health monitoring events
type HealthEvents =
| { type: 'AGENT_HEARTBEAT'; payload: HeartbeatPayload }
| { type: 'HEALTH_STATE_CHANGED'; payload: { agentId: string; from: HealthState; to: HealthState; reason: string } }
| { type: 'NUDGE_SENT'; payload: { agentId: string; attempt: number; message: string } }
| { type: 'ESCALATION_TRIGGERED'; payload: { agentId: string; reason: string } }
| { type: 'AGENT_TERMINATED'; payload: { agentId: string; reason: string; checkpoint?: string } }
| { type: 'RECOVERY_INITIATED'; payload: { taskId: string; fromCheckpoint: string; attempt: number } }
| { type: 'RECOVERY_COMPLETED'; payload: { taskId: string; newAgentId: string } }
| { type: 'RECOVERY_FAILED'; payload: { taskId: string; reason: string; attempts: number } }
| { type: 'CIRCUIT_BREAKER_STATE_CHANGED'; payload: { name: string; from: BreakerState; to: BreakerState } };
interface HeartbeatPayload {
agentId: string;
taskId: string;
timestamp: string;
phase: string;
lastToolCall: string;
tokenCount: number;
errorCount: number;
progressIndicator: string;
}
6. API Specification
interface HealthMonitoringService {
// Health queries
getAgentHealth(agentId: AgentId): Promise<AgentHealth>;
getAllAgentHealth(): Promise<AgentHealth[]>;
getHealthHistory(agentId: AgentId, duration: Duration): Promise<HealthSnapshot[]>;
// Heartbeat
recordHeartbeat(heartbeat: Heartbeat): Promise<void>;
getLastHeartbeat(agentId: AgentId): Promise<Heartbeat | null>;
// Interventions
sendNudge(agentId: AgentId, message?: string): Promise<NudgeResult>;
escalateToOrchestrator(agentId: AgentId, reason: string): Promise<EscalationResult>;
terminateAgent(agentId: AgentId, reason: string): Promise<TerminationResult>;
// Circuit breakers
getCircuitBreakerStatus(name: string): Promise<CircuitBreakerStatus>;
getAllCircuitBreakers(): Promise<CircuitBreakerStatus[]>;
resetCircuitBreaker(name: string): Promise<void>;
// Recovery
initiateRecovery(taskId: TaskId): Promise<RecoveryResult>;
getRecoveryStatus(taskId: TaskId): Promise<RecoveryStatus>;
}
interface AgentHealth {
agentId: string;
taskId: string;
state: HealthState;
stateChangedAt: string;
lastHeartbeat: string;
lastCheckpoint: string;
metrics: {
uptimeSeconds: number;
tokenCount: number;
errorCount: number;
toolCallCount: number;
interventionCount: number;
};
circuitBreakers: CircuitBreakerStatus[];
}
7. Configuration Schema
# coditect-config.yaml
health_monitoring:
enabled: true
heartbeat:
interval_seconds: 300 # 5 minutes
timeout_seconds: 600 # 10 minutes (2x interval)
health_check:
interval_seconds: 60 # 1 minute
checkpoint_stale_threshold_seconds: 1800 # 30 minutes
intervention:
nudge:
enabled: true
max_attempts: 3
interval_seconds: 600 # 10 minutes between nudges
escalation:
timeout_seconds: 900 # 15 minutes
termination:
cleanup_timeout_seconds: 30
circuit_breaker:
default:
failure_threshold: 3
recovery_timeout_seconds: 60
half_open_requests: 1
overrides:
tool_execution:
failure_threshold: 5
recovery_timeout_seconds: 30
checkpoint_write:
failure_threshold: 3
recovery_timeout_seconds: 120
recovery:
max_attempts_per_task: 3
max_attempts_per_hour: 5
backoff_seconds: [60, 120, 240] # Exponential backoff
alerting:
stuck_agent:
channels: ["slack", "pagerduty"]
severity: "warning"
circuit_breaker_open:
channels: ["slack"]
severity: "warning"
recovery_failed:
channels: ["slack", "pagerduty"]
severity: "critical"
8. Dashboard Requirements
Health Dashboard Components:
1. Agent Grid View
├── All active agents
├── Health state indicator (color-coded)
├── Time in current state
├── Last heartbeat
└── Quick actions (nudge, terminate)
2. Agent Detail View
├── Full health metrics
├── Heartbeat history (chart)
├── State transition timeline
├── Intervention history
└── Circuit breaker status
3. Circuit Breaker Panel
├── All breakers with state
├── Failure counts
├── Time until HALF_OPEN
└── Manual reset buttons
4. Recovery Status
├── Active recovery attempts
├── Recent recovery history
├── Success/failure rates
└── Manual recovery trigger
5. Alerts Feed
├── Recent health events
├── Intervention notifications
├── Circuit breaker trips
└── Recovery outcomes
9. Dependencies
| Dependency | Type | Status |
|---|---|---|
| Event Bus | Platform | ✅ Available |
| FoundationDB | Infrastructure | ✅ Available |
| Orchestrator | Platform | ⚠️ Requires integration |
| Checkpoint Service | Platform | 🔄 IMPL-REQ-001 |
| Alerting (Slack/PagerDuty) | External | ✅ Available |
| Dashboard Framework | Frontend | ⚠️ TBD |
10. Risks and Mitigations
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| False positive stuck detection | Unnecessary termination | Medium | Multiple signals, tunable thresholds |
| Monitor service failure | Undetected stuck agents | Low | Monitor redundancy, self-monitoring |
| Intervention storm | Resource exhaustion | Low | Rate limiting, backoff |
| Recovery loop | Infinite recovery attempts | Medium | Attempt limits, human escalation |
| Clock skew | Incorrect timestamp comparisons | Low | NTP sync, tolerance windows |
11. Acceptance Criteria
- Heartbeat system operational for all agent types
- Health state machine correctly tracks transitions
- Stuck agents detected within 5 minutes of threshold
- Nudge injection successfully reaches agents
- Escalation notifies orchestrator with full context
- Termination cleanly stops agents and preserves state
- Circuit breakers prevent cascading failures
- Recovery successfully respawns from checkpoints
- Dashboard displays real-time health status
- Alerts fire correctly for configured conditions
- False positive rate < 1% over 7-day test period
- Documentation and runbooks complete
Document Version: 1.0 | Last Updated: January 24, 2026