ADR-110: Agent Health Monitoring Layer
Status
Proposed
Context
Problem Statement
Autonomous agent loops can become stuck, unresponsive, or enter infinite retry cycles. Without active health monitoring, these failure modes:
- Consume resources indefinitely
- Block dependent tasks
- Generate costs without progress
- Require manual detection and intervention
Key Insight from Gas Town Architecture
Gas Town's GUPP principle ("If there is work on your Hook, YOU MUST RUN IT") and Witness/Deacon pattern provide a battle-tested model for stuck detection and intervention:
- Witness - Health checks, nudges stuck workers
- Deacon - Patrol cycles, escalation
Current State
- No heartbeat mechanism for agents
- No stuck detection
- No graduated intervention
- No circuit breaker pattern
- Manual detection required for failed agents
- No self-healing capability
Decision
Implement a comprehensive health monitoring layer that detects stuck or unresponsive agents, provides graduated intervention, enforces circuit breaker patterns, and enables self-healing through automatic recovery.
1. Health Status Model
health_states:
HEALTHY:
description: "Agent is making progress"
indicators:
- Recent checkpoint update (< 10 min)
- Tool calls being made
- No error accumulation
DEGRADED:
description: "Agent showing warning signs"
indicators:
- Checkpoint stale (10-20 min)
- High error rate (> 25%)
- Token consumption anomaly
actions:
- Increase monitoring frequency
- Prepare intervention
STUCK:
description: "Agent not making progress"
indicators:
- No checkpoint update (> 30 min)
- Repeated identical operations
- Context exhaustion without handoff
actions:
- Initiate nudge sequence
- Alert orchestrator
FAILING:
description: "Agent in error loop"
indicators:
- Circuit breaker tripped
- Consecutive failures (> 3)
- Unrecoverable errors detected
actions:
- Terminate agent
- Initiate recovery from checkpoint
TERMINATED:
description: "Agent has been stopped"
indicators:
- Explicit termination
- Resource limit exceeded
- Unrecoverable failure
actions:
- Cleanup resources
- Archive final state
2. Heartbeat Protocol
heartbeat:
interval_seconds: 300 # 5 minutes
timeout_seconds: 600 # 2x interval
payload:
agent_id: string
task_id: string
timestamp: datetime
phase: string
last_tool_call: datetime
token_count: integer
error_count: integer
progress_indicator: string
progress_detection:
checkpoint_comparison: true # Compare timestamps
tool_call_analysis:
repeated_identical: stuck # Same calls = stuck
no_calls_threshold: 900 # 15+ min = stuck
diverse_calls: healthy
token_consumption:
sudden_spike: potential_loop
zero_consumption: stuck
steady_rate: healthy
3. Intervention Protocol
GRADUATED INTERVENTION SEQUENCE:
Level 1 - NUDGE (soft intervention):
├── Trigger: STUCK state detected
├── Action: Inject reminder into agent context
│ "REMINDER: You have been working for {duration} without
│ checkpoint update. Please either:
│ - Update your progress
│ - Request handoff if stuck
│ - Report blockers"
├── Wait: 10 minutes
└── Escalate if no response
Level 2 - ESCALATE (orchestrator alert):
├── Trigger: Nudge unsuccessful (3 attempts)
├── Action:
│ ├── Alert orchestrator
│ ├── Log escalation event
│ └── Prepare recovery options
├── Orchestrator decision:
│ ├── Allow more time
│ ├── Force handoff
│ └── Terminate and recover
└── Wait: Orchestrator response or timeout (15 min)
Level 3 - TERMINATE (forced stop):
├── Trigger: Escalation timeout or orchestrator decision
├── Action:
│ ├── Force agent termination
│ ├── Save partial state to checkpoint
│ ├── Clean up resources
│ └── Initiate recovery from last valid checkpoint
└── Post-action: Recovery or human escalation
4. Circuit Breaker Pattern
circuit_breaker:
states:
CLOSED:
description: "Normal operation"
behavior: "Requests pass through, failures counted"
transition: "Trips to OPEN on threshold"
OPEN:
description: "Blocking"
behavior: "Requests immediately fail"
transition: "HALF_OPEN on timeout"
HALF_OPEN:
description: "Testing"
behavior: "Limited requests allowed"
transition: "CLOSED on success, OPEN on failure"
configuration:
failure_threshold: 3 # Consecutive failures to trip
recovery_timeout_seconds: 60 # Time in OPEN before HALF_OPEN
half_open_requests: 1 # Requests allowed in HALF_OPEN
per_agent_breakers:
- tool_execution
- checkpoint_write
- external_api
- recovery_attempt
5. Self-Healing Protocol
RECOVERY DECISION TREE:
├── Is last checkpoint valid?
│ ├── Yes → Spawn new agent from checkpoint
│ └── No → Try previous checkpoint
├── Is previous checkpoint valid?
│ ├── Yes → Spawn from previous (some work lost)
│ └── No → Continue searching chain
├── No valid checkpoint in chain?
│ ├── Reset to task start
│ └── Alert for human intervention
└── Recovery spawned?
├── Monitor closely for 15 min
└── If fails again → Human escalation
RECOVERY LIMITS:
├── Max recovery attempts per task: 3
├── Max recovery attempts per hour: 5
├── Backoff between attempts: exponential (1, 2, 4 min)
└── After limits: Mandatory human review
6. Event Definitions
type HealthEvents =
| { type: 'AGENT_HEARTBEAT'; payload: HeartbeatPayload }
| { type: 'HEALTH_STATE_CHANGED'; payload: {
agentId: string; from: HealthState; to: HealthState; reason: string
}}
| { type: 'NUDGE_SENT'; payload: { agentId: string; attempt: number; message: string }}
| { type: 'ESCALATION_TRIGGERED'; payload: { agentId: string; reason: string }}
| { type: 'AGENT_TERMINATED'; payload: { agentId: string; reason: string; checkpoint?: string }}
| { type: 'RECOVERY_INITIATED'; payload: { taskId: string; fromCheckpoint: string; attempt: number }}
| { type: 'RECOVERY_COMPLETED'; payload: { taskId: string; newAgentId: string }}
| { type: 'RECOVERY_FAILED'; payload: { taskId: string; reason: string; attempts: number }}
| { type: 'CIRCUIT_BREAKER_STATE_CHANGED'; payload: {
name: string; from: BreakerState; to: BreakerState
}};
7. Configuration Schema
health_monitoring:
enabled: true
heartbeat:
interval_seconds: 300 # 5 minutes
timeout_seconds: 600 # 10 minutes
health_check:
interval_seconds: 60 # 1 minute
checkpoint_stale_threshold_seconds: 1800 # 30 minutes
intervention:
nudge:
enabled: true
max_attempts: 3
interval_seconds: 600 # 10 minutes between nudges
escalation:
timeout_seconds: 900 # 15 minutes
termination:
cleanup_timeout_seconds: 30
circuit_breaker:
default:
failure_threshold: 3
recovery_timeout_seconds: 60
half_open_requests: 1
overrides:
tool_execution:
failure_threshold: 5
recovery_timeout_seconds: 30
checkpoint_write:
failure_threshold: 3
recovery_timeout_seconds: 120
recovery:
max_attempts_per_task: 3
max_attempts_per_hour: 5
backoff_seconds: [60, 120, 240]
alerting:
stuck_agent:
channels: ["slack", "pagerduty"]
severity: "warning"
circuit_breaker_open:
channels: ["slack"]
severity: "warning"
recovery_failed:
channels: ["slack", "pagerduty"]
severity: "critical"
Consequences
Positive
- Automatic Detection - Stuck agents detected within 5 minutes
- Graduated Response - Nudge → Escalate → Terminate prevents premature termination
- Self-Healing - Automatic recovery from checkpoints
- Resource Protection - Circuit breakers prevent cascade failures
- Observability - Real-time health dashboard
- Cost Control - Stuck agents don't consume indefinitely
Negative
- Latency Overhead - Health checks add monitoring load
- False Positives - Tuning required to avoid premature intervention
- Complexity - More infrastructure to maintain
- Recovery Costs - Failed recoveries consume additional tokens
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| False positive stuck detection | Medium | Unnecessary termination | Multiple signals, tunable thresholds |
| Monitor service failure | Low | Undetected stuck agents | Monitor redundancy, self-monitoring |
| Intervention storm | Low | Resource exhaustion | Rate limiting, backoff |
| Recovery loop | Medium | Infinite recovery attempts | Attempt limits, human escalation |
| Clock skew | Low | Incorrect timestamps | NTP sync, tolerance windows |
Performance Requirements
| Metric | Target |
|---|---|
| Stuck detection latency | < 5 minutes |
| False positive rate | < 1% |
| Recovery success rate | > 95% |
| Mean time to intervention | < 10 minutes |
| Health check latency | < 100ms |
| Heartbeat processing | < 50ms |
| Concurrent agent monitors | 100+ |
| State transition | < 1s |
Observability Requirements
- Real-time health dashboard
- Alert integration (PagerDuty, Slack)
- Historical health metrics (30 day retention)
- Intervention audit trail
- Distributed tracing for health checks
Implementation Phases
Phase 1: Health Status Infrastructure (Week 1-2)
- Health state machine implementation
- Heartbeat system (agent emission + monitor reception)
- Health check service (checkpoint-based, tool-call-based)
Phase 2: Monitoring Service (Week 3-4)
- AgentMonitor class (per-agent monitoring)
- MonitorCoordinator service
- InterventionService (nudge, escalate, terminate)
Phase 3: Circuit Breaker (Week 5-6)
- CircuitBreaker core implementation
- Integration with tool execution, checkpoint writes
- Circuit breaker dashboard
Phase 4: Self-Healing and Testing (Week 7-8)
- RecoveryService implementation
- Checkpoint chain traversal
- Chaos testing (simulate stuck agents, failures)
Related ADRs
- ADR-002: PostgreSQL as Primary Database (health events stored in PostgreSQL)
- ADR-089: Two-Database Architecture (local SQLite + cloud PostgreSQL)
- ADR-108: Checkpoint Protocol (recovery from checkpoints)
- ADR-109: Browser Automation (browser resource monitoring)
- ADR-111: Token Economics (health metrics include token consumption)
- ADR-112: Ralph Wiggum Database Architecture (consolidates DB decisions)
Glossary
| Term | Definition |
|---|---|
| Circuit Breaker | Design pattern that stops repeated failures from cascading, with CLOSED/OPEN/HALF_OPEN states |
| Health State | Agent operational status: HEALTHY, DEGRADED, STUCK, FAILING, or TERMINATED |
| Heartbeat | Periodic signal from agent indicating it is still operational and making progress |
| Nudge | Soft intervention injecting reminder into agent context to prompt progress |
| Escalation | Alerting orchestrator when nudge fails to restore agent progress |
| Termination | Forced stop of unresponsive agent with state preservation |
| Self-Healing | Automatic recovery from checkpoints without human intervention |
| Gas Town | Steve Yegge's multi-agent orchestration architecture with Witness/Deacon patterns |
| GUPP | "If there is work on your Hook, YOU MUST RUN IT" - Gas Town execution principle |
| Witness | Gas Town agent role that performs health checks and nudges stuck workers |
| Deacon | Gas Town agent role that patrols and escalates persistent issues |
| NTP | Network Time Protocol - ensures synchronized timestamps across systems |
| Checkpoint | Persistent snapshot of agent state enabling recovery after failures |
References
ADR-110 | Created: 2026-01-24 | Status: Proposed