Implementation Requirements: Agent Health Monitoring Layer

Document ID: IMPL-REQ-003
Priority: P1 (High)
Target ADR: ADR-110 (Proposed)
Estimated Effort: 3 Sprints
Dependencies: Orchestrator, Checkpoint Service, Event Bus

1. Overview

1.1 Problem Statement

Autonomous agent loops can become stuck, unresponsive, or enter infinite retry cycles. Without active health monitoring, these failure modes:

Consume resources indefinitely
Block dependent tasks
Generate costs without progress
Require manual detection and intervention

Gas Town's GUPP principle ("If there is work on your Hook, YOU MUST RUN IT") and Witness/Deacon pattern provide a battle-tested model for stuck detection and intervention.

1.2 Objective

Implement a health monitoring layer that:

Detects stuck or unresponsive agents
Provides graduated intervention (nudge → escalate → terminate)
Enforces circuit breaker patterns
Enables self-healing through automatic recovery
Maintains observability into agent health

1.3 Success Criteria

Metric	Target
Stuck detection latency	< 5 minutes
False positive rate	< 1%
Recovery success rate	> 95%
Mean time to intervention	< 10 minutes

2. Functional Requirements

2.1 Health Status Model

FR-001: Agent Health States

states:
  HEALTHY:
    description: "Agent is making progress"
    indicators:
      - Recent checkpoint update (< 10 min)
      - Tool calls being made
      - No error accumulation
    
  DEGRADED:
    description: "Agent showing warning signs"
    indicators:
      - Checkpoint stale (10-20 min)
      - High error rate (> 25%)
      - Token consumption anomaly
    actions:
      - Increase monitoring frequency
      - Prepare intervention
      
  STUCK:
    description: "Agent not making progress"
    indicators:
      - No checkpoint update (> 30 min)
      - Repeated identical operations
      - Context exhaustion without handoff
    actions:
      - Initiate nudge sequence
      - Alert orchestrator
      
  FAILING:
    description: "Agent in error loop"
    indicators:
      - Circuit breaker tripped
      - Consecutive failures (> 3)
      - Unrecoverable errors detected
    actions:
      - Terminate agent
      - Initiate recovery from checkpoint
      
  TERMINATED:
    description: "Agent has been stopped"
    indicators:
      - Explicit termination
      - Resource limit exceeded
      - Unrecoverable failure
    actions:
      - Cleanup resources
      - Archive final state

2.2 Health Check Protocol

FR-002: Health Check Operations

Heartbeat Protocol:
├── Agent emits heartbeat every 5 minutes
├── Heartbeat includes:
│   ├── agent_id
│   ├── task_id
│   ├── current_phase
│   ├── last_tool_call_timestamp
│   ├── token_count
│   ├── error_count
│   └── progress_indicator
├── Monitor records heartbeat receipt
└── Missing heartbeat triggers DEGRADED state

Progress Detection:
├── Compare checkpoint timestamps
├── Analyze tool call patterns
│   ├── Repeated identical calls = stuck
│   ├── No calls for 15+ min = stuck
│   └── Diverse calls = healthy
├── Token consumption rate analysis
│   ├── Sudden spike = potential loop
│   ├── Zero consumption = stuck
│   └── Steady rate = healthy
└── Output analysis (file changes, test results)

2.3 Intervention Protocol

FR-003: Graduated Intervention Sequence

Level 1 - NUDGE (soft intervention):
├── Trigger: STUCK state detected
├── Action: Inject reminder into agent context
│   "REMINDER: You have been working for {duration} without 
│    checkpoint update. Please either:
│    - Update your progress
│    - Request handoff if stuck
│    - Report blockers"
├── Wait: 10 minutes
└── Escalate if no response

Level 2 - ESCALATE (orchestrator alert):
├── Trigger: Nudge unsuccessful (3 attempts)
├── Action: 
│   ├── Alert orchestrator
│   ├── Log escalation event
│   └── Prepare recovery options
├── Orchestrator decision:
│   ├── Allow more time
│   ├── Force handoff
│   └── Terminate and recover
└── Wait: Orchestrator response or timeout (15 min)

Level 3 - TERMINATE (forced stop):
├── Trigger: Escalation timeout or orchestrator decision
├── Action:
│   ├── Force agent termination
│   ├── Save partial state to checkpoint
│   ├── Clean up resources
│   └── Initiate recovery from last valid checkpoint
└── Post-action: Recovery or human escalation

2.4 Circuit Breaker Pattern

FR-004: Circuit Breaker Implementation

States:
├── CLOSED (normal operation)
│   ├── Requests pass through
│   ├── Failures counted
│   └── Trips to OPEN on threshold
├── OPEN (blocking)
│   ├── Requests immediately fail
│   ├── Timer starts
│   └── Transitions to HALF_OPEN on timeout
└── HALF_OPEN (testing)
    ├── Limited requests allowed
    ├── Success → CLOSED
    └── Failure → OPEN

Configuration:
  failure_threshold: 3          # Consecutive failures to trip
  recovery_timeout_seconds: 60  # Time in OPEN before HALF_OPEN
  half_open_requests: 1         # Requests allowed in HALF_OPEN
  
Per-Agent Circuit Breakers:
├── Tool execution circuit breaker
├── Checkpoint write circuit breaker
├── External API circuit breaker
└── Recovery attempt circuit breaker

2.5 Self-Healing Protocol

FR-005: Automatic Recovery

Recovery Decision Tree:
├── Is last checkpoint valid?
│   ├── Yes → Spawn new agent from checkpoint
│   └── No → Try previous checkpoint
├── Is previous checkpoint valid?
│   ├── Yes → Spawn from previous (some work lost)
│   └── No → Continue searching chain
├── No valid checkpoint in chain?
│   ├── Reset to task start
│   └── Alert for human intervention
└── Recovery spawned?
    ├── Monitor closely for 15 min
    └── If fails again → Human escalation

Recovery Limits:
├── Max recovery attempts per task: 3
├── Max recovery attempts per hour: 5
├── Backoff between attempts: exponential (1, 2, 4 min)
└── After limits: Mandatory human review

3. Non-Functional Requirements

3.1 Performance

Requirement	Specification
NFR-001	Health check latency < 100ms
NFR-002	Heartbeat processing < 50ms
NFR-003	Support 100+ concurrent agent monitors
NFR-004	State transition < 1s
NFR-005	Intervention injection < 500ms

3.2 Reliability

Requirement	Specification
NFR-006	Monitor service 99.99% availability
NFR-007	No false positives from monitor failures
NFR-008	Graceful degradation if monitoring unavailable
NFR-009	Persistent health state across monitor restarts

3.3 Observability

Requirement	Specification
NFR-010	Real-time health dashboard
NFR-011	Alert integration (PagerDuty, Slack)
NFR-012	Historical health metrics (30 day retention)
NFR-013	Intervention audit trail
NFR-014	Distributed tracing for health checks

3.4 Compliance

Requirement	Specification
NFR-015	Audit log of all interventions
NFR-016	Retention per compliance policy
NFR-017	Non-repudiation for termination events

4. Implementation Steps

Phase 1: Health Status Infrastructure (Week 1-2)

Step 1.1: Health State Machine
├── Define HealthState enum
├── Implement state transition logic
├── Add transition validation rules
├── Create state persistence (FoundationDB)
└── Implement state change events

Step 1.2: Heartbeat System
├── Define heartbeat message schema
├── Implement agent-side heartbeat emission
│   ├── Periodic timer (5 min)
│   ├── Heartbeat content collection
│   └── Emission via event bus
├── Implement monitor-side heartbeat reception
│   ├── Heartbeat listener
│   ├── Timestamp recording
│   └── Missing heartbeat detection
└── Add heartbeat to agent base class

Step 1.3: Health Check Service
├── Create HealthCheckService interface
├── Implement checkpoint-based health check
│   ├── Query latest checkpoint
│   ├── Compare timestamps
│   └── Assess staleness
├── Implement tool-call-based health check
│   ├── Query recent tool calls
│   ├── Analyze patterns
│   └── Detect repetition
└── Implement composite health assessment
    ├── Aggregate all signals
    ├── Apply thresholds
    └── Determine health state

Phase 2: Monitoring Service (Week 3-4)

Step 2.1: Agent Monitor
├── Create AgentMonitor class
│   ├── Per-agent monitoring instance
│   ├── Health state tracking
│   ├── Heartbeat tracking
│   └── Intervention tracking
├── Implement monitoring loop
│   ├── Periodic health check (1 min)
│   ├── State transition evaluation
│   ├── Intervention trigger check
│   └── Metric emission
└── Add monitor lifecycle management
    ├── Start on agent spawn
    ├── Stop on agent termination
    └── Cleanup resources

Step 2.2: Monitor Coordinator
├── Create MonitorCoordinator service
│   ├── Manages all AgentMonitors
│   ├── Provides aggregate health view
│   └── Coordinates interventions
├── Implement monitor discovery
│   ├── Subscribe to agent spawn events
│   ├── Create monitors for new agents
│   └── Remove monitors for terminated agents
└── Add health dashboard data provider

Step 2.3: Intervention Service
├── Create InterventionService
├── Implement nudge injection
│   ├── Generate nudge message
│   ├── Inject into agent context
│   │   (via orchestrator or direct)
│   └── Record nudge attempt
├── Implement escalation flow
│   ├── Notify orchestrator
│   ├── Await decision
│   └── Execute decision
└── Implement termination flow
    ├── Force stop agent
    ├── Save final state
    └── Trigger cleanup

Phase 3: Circuit Breaker (Week 5-6)

Step 3.1: Circuit Breaker Core
├── Create CircuitBreaker class
│   ├── State management (CLOSED/OPEN/HALF_OPEN)
│   ├── Failure counting
│   ├── Timer management
│   └── State transition logic
├── Implement call wrapping
│   ├── Check state before call
│   ├── Execute or fail fast
│   ├── Record result
│   └── Update state
└── Add configuration support
    ├── Thresholds
    ├── Timeouts
    └── Per-operation overrides

Step 3.2: Circuit Breaker Integration
├── Integrate with tool execution
│   ├── Wrap tool calls
│   ├── Track tool-specific breakers
│   └── Report breaker status
├── Integrate with checkpoint writes
│   ├── Prevent writes during OPEN
│   ├── Queue for retry
│   └── Alert on persistent failure
└── Integrate with external APIs
    ├── Per-endpoint breakers
    ├── Shared breaker for related endpoints
    └── Graceful degradation

Step 3.3: Circuit Breaker Dashboard
├── Expose breaker states via API
├── Add to health dashboard
├── Implement manual reset capability
└── Add alerting on OPEN state

Phase 4: Self-Healing and Testing (Week 7-8)

Step 4.1: Recovery Service
├── Create RecoveryService
├── Implement checkpoint chain traversal
│   ├── Find last valid checkpoint
│   ├── Validate checkpoint integrity
│   └── Handle broken chains
├── Implement agent respawn
│   ├── Create new agent instance
│   ├── Inject checkpoint state
│   ├── Start with recovery context
│   └── Link to original task
└── Implement recovery limits
    ├── Track attempts per task
    ├── Enforce backoff
    └── Escalate on limit breach

Step 4.2: Testing
├── Unit tests:
│   ├── State machine transitions
│   ├── Health check calculations
│   ├── Circuit breaker logic
│   └── Recovery decisions
├── Integration tests:
│   ├── Heartbeat flow
│   ├── Intervention sequence
│   ├── Circuit breaker triggering
│   └── Recovery spawning
└── Chaos tests:
    ├── Simulate stuck agents
    ├── Simulate failures
    ├── Test recovery under load
    └── Verify no false positives

Step 4.3: Documentation and Runbooks
├── Architecture documentation
├── Configuration guide
├── Runbook: Manual intervention
├── Runbook: Circuit breaker reset
└── Runbook: Recovery troubleshooting

5. Event Definitions

// Health monitoring events
type HealthEvents =
  | { type: 'AGENT_HEARTBEAT'; payload: HeartbeatPayload }
  | { type: 'HEALTH_STATE_CHANGED'; payload: { agentId: string; from: HealthState; to: HealthState; reason: string } }
  | { type: 'NUDGE_SENT'; payload: { agentId: string; attempt: number; message: string } }
  | { type: 'ESCALATION_TRIGGERED'; payload: { agentId: string; reason: string } }
  | { type: 'AGENT_TERMINATED'; payload: { agentId: string; reason: string; checkpoint?: string } }
  | { type: 'RECOVERY_INITIATED'; payload: { taskId: string; fromCheckpoint: string; attempt: number } }
  | { type: 'RECOVERY_COMPLETED'; payload: { taskId: string; newAgentId: string } }
  | { type: 'RECOVERY_FAILED'; payload: { taskId: string; reason: string; attempts: number } }
  | { type: 'CIRCUIT_BREAKER_STATE_CHANGED'; payload: { name: string; from: BreakerState; to: BreakerState } };

interface HeartbeatPayload {
  agentId: string;
  taskId: string;
  timestamp: string;
  phase: string;
  lastToolCall: string;
  tokenCount: number;
  errorCount: number;
  progressIndicator: string;
}

6. API Specification

interface HealthMonitoringService {
  // Health queries
  getAgentHealth(agentId: AgentId): Promise<AgentHealth>;
  getAllAgentHealth(): Promise<AgentHealth[]>;
  getHealthHistory(agentId: AgentId, duration: Duration): Promise<HealthSnapshot[]>;
  
  // Heartbeat
  recordHeartbeat(heartbeat: Heartbeat): Promise<void>;
  getLastHeartbeat(agentId: AgentId): Promise<Heartbeat | null>;
  
  // Interventions
  sendNudge(agentId: AgentId, message?: string): Promise<NudgeResult>;
  escalateToOrchestrator(agentId: AgentId, reason: string): Promise<EscalationResult>;
  terminateAgent(agentId: AgentId, reason: string): Promise<TerminationResult>;
  
  // Circuit breakers
  getCircuitBreakerStatus(name: string): Promise<CircuitBreakerStatus>;
  getAllCircuitBreakers(): Promise<CircuitBreakerStatus[]>;
  resetCircuitBreaker(name: string): Promise<void>;
  
  // Recovery
  initiateRecovery(taskId: TaskId): Promise<RecoveryResult>;
  getRecoveryStatus(taskId: TaskId): Promise<RecoveryStatus>;
}

interface AgentHealth {
  agentId: string;
  taskId: string;
  state: HealthState;
  stateChangedAt: string;
  lastHeartbeat: string;
  lastCheckpoint: string;
  metrics: {
    uptimeSeconds: number;
    tokenCount: number;
    errorCount: number;
    toolCallCount: number;
    interventionCount: number;
  };
  circuitBreakers: CircuitBreakerStatus[];
}

7. Configuration Schema

# coditect-config.yaml
health_monitoring:
  enabled: true
  
  heartbeat:
    interval_seconds: 300      # 5 minutes
    timeout_seconds: 600       # 10 minutes (2x interval)
    
  health_check:
    interval_seconds: 60       # 1 minute
    checkpoint_stale_threshold_seconds: 1800  # 30 minutes
    
  intervention:
    nudge:
      enabled: true
      max_attempts: 3
      interval_seconds: 600    # 10 minutes between nudges
    escalation:
      timeout_seconds: 900     # 15 minutes
    termination:
      cleanup_timeout_seconds: 30
      
  circuit_breaker:
    default:
      failure_threshold: 3
      recovery_timeout_seconds: 60
      half_open_requests: 1
    overrides:
      tool_execution:
        failure_threshold: 5
        recovery_timeout_seconds: 30
      checkpoint_write:
        failure_threshold: 3
        recovery_timeout_seconds: 120
        
  recovery:
    max_attempts_per_task: 3
    max_attempts_per_hour: 5
    backoff_seconds: [60, 120, 240]  # Exponential backoff
    
  alerting:
    stuck_agent:
      channels: ["slack", "pagerduty"]
      severity: "warning"
    circuit_breaker_open:
      channels: ["slack"]
      severity: "warning"
    recovery_failed:
      channels: ["slack", "pagerduty"]
      severity: "critical"

8. Dashboard Requirements

Health Dashboard Components:

1. Agent Grid View
   ├── All active agents
   ├── Health state indicator (color-coded)
   ├── Time in current state
   ├── Last heartbeat
   └── Quick actions (nudge, terminate)

2. Agent Detail View
   ├── Full health metrics
   ├── Heartbeat history (chart)
   ├── State transition timeline
   ├── Intervention history
   └── Circuit breaker status

3. Circuit Breaker Panel
   ├── All breakers with state
   ├── Failure counts
   ├── Time until HALF_OPEN
   └── Manual reset buttons

4. Recovery Status
   ├── Active recovery attempts
   ├── Recent recovery history
   ├── Success/failure rates
   └── Manual recovery trigger

5. Alerts Feed
   ├── Recent health events
   ├── Intervention notifications
   ├── Circuit breaker trips
   └── Recovery outcomes

9. Dependencies

Dependency	Type	Status
Event Bus	Platform	✅ Available
FoundationDB	Infrastructure	✅ Available
Orchestrator	Platform	⚠️ Requires integration
Checkpoint Service	Platform	🔄 IMPL-REQ-001
Alerting (Slack/PagerDuty)	External	✅ Available
Dashboard Framework	Frontend	⚠️ TBD

10. Risks and Mitigations

Risk	Impact	Likelihood	Mitigation
False positive stuck detection	Unnecessary termination	Medium	Multiple signals, tunable thresholds
Monitor service failure	Undetected stuck agents	Low	Monitor redundancy, self-monitoring
Intervention storm	Resource exhaustion	Low	Rate limiting, backoff
Recovery loop	Infinite recovery attempts	Medium	Attempt limits, human escalation
Clock skew	Incorrect timestamp comparisons	Low	NTP sync, tolerance windows

11. Acceptance Criteria

Document Version: 1.0 | Last Updated: January 24, 2026

1. Overview​

1.1 Problem Statement​

1.2 Objective​

1.3 Success Criteria​

2. Functional Requirements​

2.1 Health Status Model​

2.2 Health Check Protocol​

2.3 Intervention Protocol​

2.4 Circuit Breaker Pattern​

2.5 Self-Healing Protocol​

3. Non-Functional Requirements​

3.1 Performance​

3.2 Reliability​

3.3 Observability​

3.4 Compliance​

4. Implementation Steps​

Phase 1: Health Status Infrastructure (Week 1-2)​

Phase 2: Monitoring Service (Week 3-4)​

Phase 3: Circuit Breaker (Week 5-6)​

Phase 4: Self-Healing and Testing (Week 7-8)​

5. Event Definitions​

6. API Specification​

7. Configuration Schema​

8. Dashboard Requirements​

9. Dependencies​

10. Risks and Mitigations​

11. Acceptance Criteria​