Skip to main content

ADR-110: Agent Health Monitoring Layer

Status

Proposed

Context

Problem Statement

Autonomous agent loops can become stuck, unresponsive, or enter infinite retry cycles. Without active health monitoring, these failure modes:

  • Consume resources indefinitely
  • Block dependent tasks
  • Generate costs without progress
  • Require manual detection and intervention

Key Insight from Gas Town Architecture

Gas Town's GUPP principle ("If there is work on your Hook, YOU MUST RUN IT") and Witness/Deacon pattern provide a battle-tested model for stuck detection and intervention:

  • Witness - Health checks, nudges stuck workers
  • Deacon - Patrol cycles, escalation

Current State

  • No heartbeat mechanism for agents
  • No stuck detection
  • No graduated intervention
  • No circuit breaker pattern
  • Manual detection required for failed agents
  • No self-healing capability

Decision

Implement a comprehensive health monitoring layer that detects stuck or unresponsive agents, provides graduated intervention, enforces circuit breaker patterns, and enables self-healing through automatic recovery.

1. Health Status Model

health_states:
HEALTHY:
description: "Agent is making progress"
indicators:
- Recent checkpoint update (< 10 min)
- Tool calls being made
- No error accumulation

DEGRADED:
description: "Agent showing warning signs"
indicators:
- Checkpoint stale (10-20 min)
- High error rate (> 25%)
- Token consumption anomaly
actions:
- Increase monitoring frequency
- Prepare intervention

STUCK:
description: "Agent not making progress"
indicators:
- No checkpoint update (> 30 min)
- Repeated identical operations
- Context exhaustion without handoff
actions:
- Initiate nudge sequence
- Alert orchestrator

FAILING:
description: "Agent in error loop"
indicators:
- Circuit breaker tripped
- Consecutive failures (> 3)
- Unrecoverable errors detected
actions:
- Terminate agent
- Initiate recovery from checkpoint

TERMINATED:
description: "Agent has been stopped"
indicators:
- Explicit termination
- Resource limit exceeded
- Unrecoverable failure
actions:
- Cleanup resources
- Archive final state

2. Heartbeat Protocol

heartbeat:
interval_seconds: 300 # 5 minutes
timeout_seconds: 600 # 2x interval

payload:
agent_id: string
task_id: string
timestamp: datetime
phase: string
last_tool_call: datetime
token_count: integer
error_count: integer
progress_indicator: string

progress_detection:
checkpoint_comparison: true # Compare timestamps
tool_call_analysis:
repeated_identical: stuck # Same calls = stuck
no_calls_threshold: 900 # 15+ min = stuck
diverse_calls: healthy
token_consumption:
sudden_spike: potential_loop
zero_consumption: stuck
steady_rate: healthy

3. Intervention Protocol

GRADUATED INTERVENTION SEQUENCE:

Level 1 - NUDGE (soft intervention):
├── Trigger: STUCK state detected
├── Action: Inject reminder into agent context
│ "REMINDER: You have been working for {duration} without
│ checkpoint update. Please either:
│ - Update your progress
│ - Request handoff if stuck
│ - Report blockers"
├── Wait: 10 minutes
└── Escalate if no response

Level 2 - ESCALATE (orchestrator alert):
├── Trigger: Nudge unsuccessful (3 attempts)
├── Action:
│ ├── Alert orchestrator
│ ├── Log escalation event
│ └── Prepare recovery options
├── Orchestrator decision:
│ ├── Allow more time
│ ├── Force handoff
│ └── Terminate and recover
└── Wait: Orchestrator response or timeout (15 min)

Level 3 - TERMINATE (forced stop):
├── Trigger: Escalation timeout or orchestrator decision
├── Action:
│ ├── Force agent termination
│ ├── Save partial state to checkpoint
│ ├── Clean up resources
│ └── Initiate recovery from last valid checkpoint
└── Post-action: Recovery or human escalation

4. Circuit Breaker Pattern

circuit_breaker:
states:
CLOSED:
description: "Normal operation"
behavior: "Requests pass through, failures counted"
transition: "Trips to OPEN on threshold"
OPEN:
description: "Blocking"
behavior: "Requests immediately fail"
transition: "HALF_OPEN on timeout"
HALF_OPEN:
description: "Testing"
behavior: "Limited requests allowed"
transition: "CLOSED on success, OPEN on failure"

configuration:
failure_threshold: 3 # Consecutive failures to trip
recovery_timeout_seconds: 60 # Time in OPEN before HALF_OPEN
half_open_requests: 1 # Requests allowed in HALF_OPEN

per_agent_breakers:
- tool_execution
- checkpoint_write
- external_api
- recovery_attempt

5. Self-Healing Protocol

RECOVERY DECISION TREE:

├── Is last checkpoint valid?
│ ├── Yes → Spawn new agent from checkpoint
│ └── No → Try previous checkpoint
├── Is previous checkpoint valid?
│ ├── Yes → Spawn from previous (some work lost)
│ └── No → Continue searching chain
├── No valid checkpoint in chain?
│ ├── Reset to task start
│ └── Alert for human intervention
└── Recovery spawned?
├── Monitor closely for 15 min
└── If fails again → Human escalation

RECOVERY LIMITS:
├── Max recovery attempts per task: 3
├── Max recovery attempts per hour: 5
├── Backoff between attempts: exponential (1, 2, 4 min)
└── After limits: Mandatory human review

6. Event Definitions

type HealthEvents =
| { type: 'AGENT_HEARTBEAT'; payload: HeartbeatPayload }
| { type: 'HEALTH_STATE_CHANGED'; payload: {
agentId: string; from: HealthState; to: HealthState; reason: string
}}
| { type: 'NUDGE_SENT'; payload: { agentId: string; attempt: number; message: string }}
| { type: 'ESCALATION_TRIGGERED'; payload: { agentId: string; reason: string }}
| { type: 'AGENT_TERMINATED'; payload: { agentId: string; reason: string; checkpoint?: string }}
| { type: 'RECOVERY_INITIATED'; payload: { taskId: string; fromCheckpoint: string; attempt: number }}
| { type: 'RECOVERY_COMPLETED'; payload: { taskId: string; newAgentId: string }}
| { type: 'RECOVERY_FAILED'; payload: { taskId: string; reason: string; attempts: number }}
| { type: 'CIRCUIT_BREAKER_STATE_CHANGED'; payload: {
name: string; from: BreakerState; to: BreakerState
}};

7. Configuration Schema

health_monitoring:
enabled: true

heartbeat:
interval_seconds: 300 # 5 minutes
timeout_seconds: 600 # 10 minutes

health_check:
interval_seconds: 60 # 1 minute
checkpoint_stale_threshold_seconds: 1800 # 30 minutes

intervention:
nudge:
enabled: true
max_attempts: 3
interval_seconds: 600 # 10 minutes between nudges
escalation:
timeout_seconds: 900 # 15 minutes
termination:
cleanup_timeout_seconds: 30

circuit_breaker:
default:
failure_threshold: 3
recovery_timeout_seconds: 60
half_open_requests: 1
overrides:
tool_execution:
failure_threshold: 5
recovery_timeout_seconds: 30
checkpoint_write:
failure_threshold: 3
recovery_timeout_seconds: 120

recovery:
max_attempts_per_task: 3
max_attempts_per_hour: 5
backoff_seconds: [60, 120, 240]

alerting:
stuck_agent:
channels: ["slack", "pagerduty"]
severity: "warning"
circuit_breaker_open:
channels: ["slack"]
severity: "warning"
recovery_failed:
channels: ["slack", "pagerduty"]
severity: "critical"

Consequences

Positive

  1. Automatic Detection - Stuck agents detected within 5 minutes
  2. Graduated Response - Nudge → Escalate → Terminate prevents premature termination
  3. Self-Healing - Automatic recovery from checkpoints
  4. Resource Protection - Circuit breakers prevent cascade failures
  5. Observability - Real-time health dashboard
  6. Cost Control - Stuck agents don't consume indefinitely

Negative

  1. Latency Overhead - Health checks add monitoring load
  2. False Positives - Tuning required to avoid premature intervention
  3. Complexity - More infrastructure to maintain
  4. Recovery Costs - Failed recoveries consume additional tokens

Risks

RiskLikelihoodImpactMitigation
False positive stuck detectionMediumUnnecessary terminationMultiple signals, tunable thresholds
Monitor service failureLowUndetected stuck agentsMonitor redundancy, self-monitoring
Intervention stormLowResource exhaustionRate limiting, backoff
Recovery loopMediumInfinite recovery attemptsAttempt limits, human escalation
Clock skewLowIncorrect timestampsNTP sync, tolerance windows

Performance Requirements

MetricTarget
Stuck detection latency< 5 minutes
False positive rate< 1%
Recovery success rate> 95%
Mean time to intervention< 10 minutes
Health check latency< 100ms
Heartbeat processing< 50ms
Concurrent agent monitors100+
State transition< 1s

Observability Requirements

  • Real-time health dashboard
  • Alert integration (PagerDuty, Slack)
  • Historical health metrics (30 day retention)
  • Intervention audit trail
  • Distributed tracing for health checks

Implementation Phases

Phase 1: Health Status Infrastructure (Week 1-2)

  • Health state machine implementation
  • Heartbeat system (agent emission + monitor reception)
  • Health check service (checkpoint-based, tool-call-based)

Phase 2: Monitoring Service (Week 3-4)

  • AgentMonitor class (per-agent monitoring)
  • MonitorCoordinator service
  • InterventionService (nudge, escalate, terminate)

Phase 3: Circuit Breaker (Week 5-6)

  • CircuitBreaker core implementation
  • Integration with tool execution, checkpoint writes
  • Circuit breaker dashboard

Phase 4: Self-Healing and Testing (Week 7-8)

  • RecoveryService implementation
  • Checkpoint chain traversal
  • Chaos testing (simulate stuck agents, failures)
  • ADR-002: PostgreSQL as Primary Database (health events stored in PostgreSQL)
  • ADR-089: Two-Database Architecture (local SQLite + cloud PostgreSQL)
  • ADR-108: Checkpoint Protocol (recovery from checkpoints)
  • ADR-109: Browser Automation (browser resource monitoring)
  • ADR-111: Token Economics (health metrics include token consumption)
  • ADR-112: Ralph Wiggum Database Architecture (consolidates DB decisions)

Glossary

TermDefinition
Circuit BreakerDesign pattern that stops repeated failures from cascading, with CLOSED/OPEN/HALF_OPEN states
Health StateAgent operational status: HEALTHY, DEGRADED, STUCK, FAILING, or TERMINATED
HeartbeatPeriodic signal from agent indicating it is still operational and making progress
NudgeSoft intervention injecting reminder into agent context to prompt progress
EscalationAlerting orchestrator when nudge fails to restore agent progress
TerminationForced stop of unresponsive agent with state preservation
Self-HealingAutomatic recovery from checkpoints without human intervention
Gas TownSteve Yegge's multi-agent orchestration architecture with Witness/Deacon patterns
GUPP"If there is work on your Hook, YOU MUST RUN IT" - Gas Town execution principle
WitnessGas Town agent role that performs health checks and nudges stuck workers
DeaconGas Town agent role that patrols and escalates persistent issues
NTPNetwork Time Protocol - ensures synchronized timestamps across systems
CheckpointPersistent snapshot of agent state enabling recovery after failures

References


ADR-110 | Created: 2026-01-24 | Status: Proposed