Skip to main content

Implementation Requirements: Agent Health Monitoring Layer

Document ID: IMPL-REQ-003
Priority: P1 (High)
Target ADR: ADR-110 (Proposed)
Estimated Effort: 3 Sprints
Dependencies: Orchestrator, Checkpoint Service, Event Bus


1. Overview

1.1 Problem Statement

Autonomous agent loops can become stuck, unresponsive, or enter infinite retry cycles. Without active health monitoring, these failure modes:

  • Consume resources indefinitely
  • Block dependent tasks
  • Generate costs without progress
  • Require manual detection and intervention

Gas Town's GUPP principle ("If there is work on your Hook, YOU MUST RUN IT") and Witness/Deacon pattern provide a battle-tested model for stuck detection and intervention.

1.2 Objective

Implement a health monitoring layer that:

  • Detects stuck or unresponsive agents
  • Provides graduated intervention (nudge → escalate → terminate)
  • Enforces circuit breaker patterns
  • Enables self-healing through automatic recovery
  • Maintains observability into agent health

1.3 Success Criteria

MetricTarget
Stuck detection latency< 5 minutes
False positive rate< 1%
Recovery success rate> 95%
Mean time to intervention< 10 minutes

2. Functional Requirements

2.1 Health Status Model

FR-001: Agent Health States

states:
HEALTHY:
description: "Agent is making progress"
indicators:
- Recent checkpoint update (< 10 min)
- Tool calls being made
- No error accumulation

DEGRADED:
description: "Agent showing warning signs"
indicators:
- Checkpoint stale (10-20 min)
- High error rate (> 25%)
- Token consumption anomaly
actions:
- Increase monitoring frequency
- Prepare intervention

STUCK:
description: "Agent not making progress"
indicators:
- No checkpoint update (> 30 min)
- Repeated identical operations
- Context exhaustion without handoff
actions:
- Initiate nudge sequence
- Alert orchestrator

FAILING:
description: "Agent in error loop"
indicators:
- Circuit breaker tripped
- Consecutive failures (> 3)
- Unrecoverable errors detected
actions:
- Terminate agent
- Initiate recovery from checkpoint

TERMINATED:
description: "Agent has been stopped"
indicators:
- Explicit termination
- Resource limit exceeded
- Unrecoverable failure
actions:
- Cleanup resources
- Archive final state

2.2 Health Check Protocol

FR-002: Health Check Operations

Heartbeat Protocol:
├── Agent emits heartbeat every 5 minutes
├── Heartbeat includes:
│ ├── agent_id
│ ├── task_id
│ ├── current_phase
│ ├── last_tool_call_timestamp
│ ├── token_count
│ ├── error_count
│ └── progress_indicator
├── Monitor records heartbeat receipt
└── Missing heartbeat triggers DEGRADED state

Progress Detection:
├── Compare checkpoint timestamps
├── Analyze tool call patterns
│ ├── Repeated identical calls = stuck
│ ├── No calls for 15+ min = stuck
│ └── Diverse calls = healthy
├── Token consumption rate analysis
│ ├── Sudden spike = potential loop
│ ├── Zero consumption = stuck
│ └── Steady rate = healthy
└── Output analysis (file changes, test results)

2.3 Intervention Protocol

FR-003: Graduated Intervention Sequence

Level 1 - NUDGE (soft intervention):
├── Trigger: STUCK state detected
├── Action: Inject reminder into agent context
│ "REMINDER: You have been working for {duration} without
│ checkpoint update. Please either:
│ - Update your progress
│ - Request handoff if stuck
│ - Report blockers"
├── Wait: 10 minutes
└── Escalate if no response

Level 2 - ESCALATE (orchestrator alert):
├── Trigger: Nudge unsuccessful (3 attempts)
├── Action:
│ ├── Alert orchestrator
│ ├── Log escalation event
│ └── Prepare recovery options
├── Orchestrator decision:
│ ├── Allow more time
│ ├── Force handoff
│ └── Terminate and recover
└── Wait: Orchestrator response or timeout (15 min)

Level 3 - TERMINATE (forced stop):
├── Trigger: Escalation timeout or orchestrator decision
├── Action:
│ ├── Force agent termination
│ ├── Save partial state to checkpoint
│ ├── Clean up resources
│ └── Initiate recovery from last valid checkpoint
└── Post-action: Recovery or human escalation

2.4 Circuit Breaker Pattern

FR-004: Circuit Breaker Implementation

States:
├── CLOSED (normal operation)
│ ├── Requests pass through
│ ├── Failures counted
│ └── Trips to OPEN on threshold
├── OPEN (blocking)
│ ├── Requests immediately fail
│ ├── Timer starts
│ └── Transitions to HALF_OPEN on timeout
└── HALF_OPEN (testing)
├── Limited requests allowed
├── Success → CLOSED
└── Failure → OPEN

Configuration:
failure_threshold: 3 # Consecutive failures to trip
recovery_timeout_seconds: 60 # Time in OPEN before HALF_OPEN
half_open_requests: 1 # Requests allowed in HALF_OPEN

Per-Agent Circuit Breakers:
├── Tool execution circuit breaker
├── Checkpoint write circuit breaker
├── External API circuit breaker
└── Recovery attempt circuit breaker

2.5 Self-Healing Protocol

FR-005: Automatic Recovery

Recovery Decision Tree:
├── Is last checkpoint valid?
│ ├── Yes → Spawn new agent from checkpoint
│ └── No → Try previous checkpoint
├── Is previous checkpoint valid?
│ ├── Yes → Spawn from previous (some work lost)
│ └── No → Continue searching chain
├── No valid checkpoint in chain?
│ ├── Reset to task start
│ └── Alert for human intervention
└── Recovery spawned?
├── Monitor closely for 15 min
└── If fails again → Human escalation

Recovery Limits:
├── Max recovery attempts per task: 3
├── Max recovery attempts per hour: 5
├── Backoff between attempts: exponential (1, 2, 4 min)
└── After limits: Mandatory human review

3. Non-Functional Requirements

3.1 Performance

RequirementSpecification
NFR-001Health check latency < 100ms
NFR-002Heartbeat processing < 50ms
NFR-003Support 100+ concurrent agent monitors
NFR-004State transition < 1s
NFR-005Intervention injection < 500ms

3.2 Reliability

RequirementSpecification
NFR-006Monitor service 99.99% availability
NFR-007No false positives from monitor failures
NFR-008Graceful degradation if monitoring unavailable
NFR-009Persistent health state across monitor restarts

3.3 Observability

RequirementSpecification
NFR-010Real-time health dashboard
NFR-011Alert integration (PagerDuty, Slack)
NFR-012Historical health metrics (30 day retention)
NFR-013Intervention audit trail
NFR-014Distributed tracing for health checks

3.4 Compliance

RequirementSpecification
NFR-015Audit log of all interventions
NFR-016Retention per compliance policy
NFR-017Non-repudiation for termination events

4. Implementation Steps

Phase 1: Health Status Infrastructure (Week 1-2)

Step 1.1: Health State Machine
├── Define HealthState enum
├── Implement state transition logic
├── Add transition validation rules
├── Create state persistence (FoundationDB)
└── Implement state change events

Step 1.2: Heartbeat System
├── Define heartbeat message schema
├── Implement agent-side heartbeat emission
│ ├── Periodic timer (5 min)
│ ├── Heartbeat content collection
│ └── Emission via event bus
├── Implement monitor-side heartbeat reception
│ ├── Heartbeat listener
│ ├── Timestamp recording
│ └── Missing heartbeat detection
└── Add heartbeat to agent base class

Step 1.3: Health Check Service
├── Create HealthCheckService interface
├── Implement checkpoint-based health check
│ ├── Query latest checkpoint
│ ├── Compare timestamps
│ └── Assess staleness
├── Implement tool-call-based health check
│ ├── Query recent tool calls
│ ├── Analyze patterns
│ └── Detect repetition
└── Implement composite health assessment
├── Aggregate all signals
├── Apply thresholds
└── Determine health state

Phase 2: Monitoring Service (Week 3-4)

Step 2.1: Agent Monitor
├── Create AgentMonitor class
│ ├── Per-agent monitoring instance
│ ├── Health state tracking
│ ├── Heartbeat tracking
│ └── Intervention tracking
├── Implement monitoring loop
│ ├── Periodic health check (1 min)
│ ├── State transition evaluation
│ ├── Intervention trigger check
│ └── Metric emission
└── Add monitor lifecycle management
├── Start on agent spawn
├── Stop on agent termination
└── Cleanup resources

Step 2.2: Monitor Coordinator
├── Create MonitorCoordinator service
│ ├── Manages all AgentMonitors
│ ├── Provides aggregate health view
│ └── Coordinates interventions
├── Implement monitor discovery
│ ├── Subscribe to agent spawn events
│ ├── Create monitors for new agents
│ └── Remove monitors for terminated agents
└── Add health dashboard data provider

Step 2.3: Intervention Service
├── Create InterventionService
├── Implement nudge injection
│ ├── Generate nudge message
│ ├── Inject into agent context
│ │ (via orchestrator or direct)
│ └── Record nudge attempt
├── Implement escalation flow
│ ├── Notify orchestrator
│ ├── Await decision
│ └── Execute decision
└── Implement termination flow
├── Force stop agent
├── Save final state
└── Trigger cleanup

Phase 3: Circuit Breaker (Week 5-6)

Step 3.1: Circuit Breaker Core
├── Create CircuitBreaker class
│ ├── State management (CLOSED/OPEN/HALF_OPEN)
│ ├── Failure counting
│ ├── Timer management
│ └── State transition logic
├── Implement call wrapping
│ ├── Check state before call
│ ├── Execute or fail fast
│ ├── Record result
│ └── Update state
└── Add configuration support
├── Thresholds
├── Timeouts
└── Per-operation overrides

Step 3.2: Circuit Breaker Integration
├── Integrate with tool execution
│ ├── Wrap tool calls
│ ├── Track tool-specific breakers
│ └── Report breaker status
├── Integrate with checkpoint writes
│ ├── Prevent writes during OPEN
│ ├── Queue for retry
│ └── Alert on persistent failure
└── Integrate with external APIs
├── Per-endpoint breakers
├── Shared breaker for related endpoints
└── Graceful degradation

Step 3.3: Circuit Breaker Dashboard
├── Expose breaker states via API
├── Add to health dashboard
├── Implement manual reset capability
└── Add alerting on OPEN state

Phase 4: Self-Healing and Testing (Week 7-8)

Step 4.1: Recovery Service
├── Create RecoveryService
├── Implement checkpoint chain traversal
│ ├── Find last valid checkpoint
│ ├── Validate checkpoint integrity
│ └── Handle broken chains
├── Implement agent respawn
│ ├── Create new agent instance
│ ├── Inject checkpoint state
│ ├── Start with recovery context
│ └── Link to original task
└── Implement recovery limits
├── Track attempts per task
├── Enforce backoff
└── Escalate on limit breach

Step 4.2: Testing
├── Unit tests:
│ ├── State machine transitions
│ ├── Health check calculations
│ ├── Circuit breaker logic
│ └── Recovery decisions
├── Integration tests:
│ ├── Heartbeat flow
│ ├── Intervention sequence
│ ├── Circuit breaker triggering
│ └── Recovery spawning
└── Chaos tests:
├── Simulate stuck agents
├── Simulate failures
├── Test recovery under load
└── Verify no false positives

Step 4.3: Documentation and Runbooks
├── Architecture documentation
├── Configuration guide
├── Runbook: Manual intervention
├── Runbook: Circuit breaker reset
└── Runbook: Recovery troubleshooting

5. Event Definitions

// Health monitoring events
type HealthEvents =
| { type: 'AGENT_HEARTBEAT'; payload: HeartbeatPayload }
| { type: 'HEALTH_STATE_CHANGED'; payload: { agentId: string; from: HealthState; to: HealthState; reason: string } }
| { type: 'NUDGE_SENT'; payload: { agentId: string; attempt: number; message: string } }
| { type: 'ESCALATION_TRIGGERED'; payload: { agentId: string; reason: string } }
| { type: 'AGENT_TERMINATED'; payload: { agentId: string; reason: string; checkpoint?: string } }
| { type: 'RECOVERY_INITIATED'; payload: { taskId: string; fromCheckpoint: string; attempt: number } }
| { type: 'RECOVERY_COMPLETED'; payload: { taskId: string; newAgentId: string } }
| { type: 'RECOVERY_FAILED'; payload: { taskId: string; reason: string; attempts: number } }
| { type: 'CIRCUIT_BREAKER_STATE_CHANGED'; payload: { name: string; from: BreakerState; to: BreakerState } };

interface HeartbeatPayload {
agentId: string;
taskId: string;
timestamp: string;
phase: string;
lastToolCall: string;
tokenCount: number;
errorCount: number;
progressIndicator: string;
}

6. API Specification

interface HealthMonitoringService {
// Health queries
getAgentHealth(agentId: AgentId): Promise<AgentHealth>;
getAllAgentHealth(): Promise<AgentHealth[]>;
getHealthHistory(agentId: AgentId, duration: Duration): Promise<HealthSnapshot[]>;

// Heartbeat
recordHeartbeat(heartbeat: Heartbeat): Promise<void>;
getLastHeartbeat(agentId: AgentId): Promise<Heartbeat | null>;

// Interventions
sendNudge(agentId: AgentId, message?: string): Promise<NudgeResult>;
escalateToOrchestrator(agentId: AgentId, reason: string): Promise<EscalationResult>;
terminateAgent(agentId: AgentId, reason: string): Promise<TerminationResult>;

// Circuit breakers
getCircuitBreakerStatus(name: string): Promise<CircuitBreakerStatus>;
getAllCircuitBreakers(): Promise<CircuitBreakerStatus[]>;
resetCircuitBreaker(name: string): Promise<void>;

// Recovery
initiateRecovery(taskId: TaskId): Promise<RecoveryResult>;
getRecoveryStatus(taskId: TaskId): Promise<RecoveryStatus>;
}

interface AgentHealth {
agentId: string;
taskId: string;
state: HealthState;
stateChangedAt: string;
lastHeartbeat: string;
lastCheckpoint: string;
metrics: {
uptimeSeconds: number;
tokenCount: number;
errorCount: number;
toolCallCount: number;
interventionCount: number;
};
circuitBreakers: CircuitBreakerStatus[];
}

7. Configuration Schema

# coditect-config.yaml
health_monitoring:
enabled: true

heartbeat:
interval_seconds: 300 # 5 minutes
timeout_seconds: 600 # 10 minutes (2x interval)

health_check:
interval_seconds: 60 # 1 minute
checkpoint_stale_threshold_seconds: 1800 # 30 minutes

intervention:
nudge:
enabled: true
max_attempts: 3
interval_seconds: 600 # 10 minutes between nudges
escalation:
timeout_seconds: 900 # 15 minutes
termination:
cleanup_timeout_seconds: 30

circuit_breaker:
default:
failure_threshold: 3
recovery_timeout_seconds: 60
half_open_requests: 1
overrides:
tool_execution:
failure_threshold: 5
recovery_timeout_seconds: 30
checkpoint_write:
failure_threshold: 3
recovery_timeout_seconds: 120

recovery:
max_attempts_per_task: 3
max_attempts_per_hour: 5
backoff_seconds: [60, 120, 240] # Exponential backoff

alerting:
stuck_agent:
channels: ["slack", "pagerduty"]
severity: "warning"
circuit_breaker_open:
channels: ["slack"]
severity: "warning"
recovery_failed:
channels: ["slack", "pagerduty"]
severity: "critical"

8. Dashboard Requirements

Health Dashboard Components:

1. Agent Grid View
├── All active agents
├── Health state indicator (color-coded)
├── Time in current state
├── Last heartbeat
└── Quick actions (nudge, terminate)

2. Agent Detail View
├── Full health metrics
├── Heartbeat history (chart)
├── State transition timeline
├── Intervention history
└── Circuit breaker status

3. Circuit Breaker Panel
├── All breakers with state
├── Failure counts
├── Time until HALF_OPEN
└── Manual reset buttons

4. Recovery Status
├── Active recovery attempts
├── Recent recovery history
├── Success/failure rates
└── Manual recovery trigger

5. Alerts Feed
├── Recent health events
├── Intervention notifications
├── Circuit breaker trips
└── Recovery outcomes

9. Dependencies

DependencyTypeStatus
Event BusPlatform✅ Available
FoundationDBInfrastructure✅ Available
OrchestratorPlatform⚠️ Requires integration
Checkpoint ServicePlatform🔄 IMPL-REQ-001
Alerting (Slack/PagerDuty)External✅ Available
Dashboard FrameworkFrontend⚠️ TBD

10. Risks and Mitigations

RiskImpactLikelihoodMitigation
False positive stuck detectionUnnecessary terminationMediumMultiple signals, tunable thresholds
Monitor service failureUndetected stuck agentsLowMonitor redundancy, self-monitoring
Intervention stormResource exhaustionLowRate limiting, backoff
Recovery loopInfinite recovery attemptsMediumAttempt limits, human escalation
Clock skewIncorrect timestamp comparisonsLowNTP sync, tolerance windows

11. Acceptance Criteria

  • Heartbeat system operational for all agent types
  • Health state machine correctly tracks transitions
  • Stuck agents detected within 5 minutes of threshold
  • Nudge injection successfully reaches agents
  • Escalation notifies orchestrator with full context
  • Termination cleanly stops agents and preserves state
  • Circuit breakers prevent cascading failures
  • Recovery successfully respawns from checkpoints
  • Dashboard displays real-time health status
  • Alerts fire correctly for configured conditions
  • False positive rate < 1% over 7-day test period
  • Documentation and runbooks complete

Document Version: 1.0 | Last Updated: January 24, 2026