ADR-110: Agent Health Monitoring Layer

Status

Proposed

Context

Problem Statement

Autonomous agent loops can become stuck, unresponsive, or enter infinite retry cycles. Without active health monitoring, these failure modes:

Consume resources indefinitely
Block dependent tasks
Generate costs without progress
Require manual detection and intervention

Key Insight from Gas Town Architecture

Gas Town's GUPP principle ("If there is work on your Hook, YOU MUST RUN IT") and Witness/Deacon pattern provide a battle-tested model for stuck detection and intervention:

Witness - Health checks, nudges stuck workers
Deacon - Patrol cycles, escalation

Current State

No heartbeat mechanism for agents
No stuck detection
No graduated intervention
No circuit breaker pattern
Manual detection required for failed agents
No self-healing capability

Decision

Implement a comprehensive health monitoring layer that detects stuck or unresponsive agents, provides graduated intervention, enforces circuit breaker patterns, and enables self-healing through automatic recovery.

1. Health Status Model

health_states:
  HEALTHY:
    description: "Agent is making progress"
    indicators:
      - Recent checkpoint update (< 10 min)
      - Tool calls being made
      - No error accumulation

  DEGRADED:
    description: "Agent showing warning signs"
    indicators:
      - Checkpoint stale (10-20 min)
      - High error rate (> 25%)
      - Token consumption anomaly
    actions:
      - Increase monitoring frequency
      - Prepare intervention

  STUCK:
    description: "Agent not making progress"
    indicators:
      - No checkpoint update (> 30 min)
      - Repeated identical operations
      - Context exhaustion without handoff
    actions:
      - Initiate nudge sequence
      - Alert orchestrator

  FAILING:
    description: "Agent in error loop"
    indicators:
      - Circuit breaker tripped
      - Consecutive failures (> 3)
      - Unrecoverable errors detected
    actions:
      - Terminate agent
      - Initiate recovery from checkpoint

  TERMINATED:
    description: "Agent has been stopped"
    indicators:
      - Explicit termination
      - Resource limit exceeded
      - Unrecoverable failure
    actions:
      - Cleanup resources
      - Archive final state

2. Heartbeat Protocol

heartbeat:
  interval_seconds: 300  # 5 minutes
  timeout_seconds: 600   # 2x interval

  payload:
    agent_id: string
    task_id: string
    timestamp: datetime
    phase: string
    last_tool_call: datetime
    token_count: integer
    error_count: integer
    progress_indicator: string

progress_detection:
  checkpoint_comparison: true    # Compare timestamps
  tool_call_analysis:
    repeated_identical: stuck    # Same calls = stuck
    no_calls_threshold: 900      # 15+ min = stuck
    diverse_calls: healthy
  token_consumption:
    sudden_spike: potential_loop
    zero_consumption: stuck
    steady_rate: healthy

3. Intervention Protocol

GRADUATED INTERVENTION SEQUENCE:

Level 1 - NUDGE (soft intervention):
├── Trigger: STUCK state detected
├── Action: Inject reminder into agent context
│   "REMINDER: You have been working for {duration} without
│    checkpoint update. Please either:
│    - Update your progress
│    - Request handoff if stuck
│    - Report blockers"
├── Wait: 10 minutes
└── Escalate if no response

Level 2 - ESCALATE (orchestrator alert):
├── Trigger: Nudge unsuccessful (3 attempts)
├── Action:
│   ├── Alert orchestrator
│   ├── Log escalation event
│   └── Prepare recovery options
├── Orchestrator decision:
│   ├── Allow more time
│   ├── Force handoff
│   └── Terminate and recover
└── Wait: Orchestrator response or timeout (15 min)

Level 3 - TERMINATE (forced stop):
├── Trigger: Escalation timeout or orchestrator decision
├── Action:
│   ├── Force agent termination
│   ├── Save partial state to checkpoint
│   ├── Clean up resources
│   └── Initiate recovery from last valid checkpoint
└── Post-action: Recovery or human escalation

4. Circuit Breaker Pattern

circuit_breaker:
  states:
    CLOSED:
      description: "Normal operation"
      behavior: "Requests pass through, failures counted"
      transition: "Trips to OPEN on threshold"
    OPEN:
      description: "Blocking"
      behavior: "Requests immediately fail"
      transition: "HALF_OPEN on timeout"
    HALF_OPEN:
      description: "Testing"
      behavior: "Limited requests allowed"
      transition: "CLOSED on success, OPEN on failure"

  configuration:
    failure_threshold: 3           # Consecutive failures to trip
    recovery_timeout_seconds: 60   # Time in OPEN before HALF_OPEN
    half_open_requests: 1          # Requests allowed in HALF_OPEN

  per_agent_breakers:
    - tool_execution
    - checkpoint_write
    - external_api
    - recovery_attempt

5. Self-Healing Protocol

RECOVERY DECISION TREE:

├── Is last checkpoint valid?
│   ├── Yes → Spawn new agent from checkpoint
│   └── No → Try previous checkpoint
├── Is previous checkpoint valid?
│   ├── Yes → Spawn from previous (some work lost)
│   └── No → Continue searching chain
├── No valid checkpoint in chain?
│   ├── Reset to task start
│   └── Alert for human intervention
└── Recovery spawned?
    ├── Monitor closely for 15 min
    └── If fails again → Human escalation

RECOVERY LIMITS:
├── Max recovery attempts per task: 3
├── Max recovery attempts per hour: 5
├── Backoff between attempts: exponential (1, 2, 4 min)
└── After limits: Mandatory human review

6. Event Definitions

type HealthEvents =
  | { type: 'AGENT_HEARTBEAT'; payload: HeartbeatPayload }
  | { type: 'HEALTH_STATE_CHANGED'; payload: {
      agentId: string; from: HealthState; to: HealthState; reason: string
    }}
  | { type: 'NUDGE_SENT'; payload: { agentId: string; attempt: number; message: string }}
  | { type: 'ESCALATION_TRIGGERED'; payload: { agentId: string; reason: string }}
  | { type: 'AGENT_TERMINATED'; payload: { agentId: string; reason: string; checkpoint?: string }}
  | { type: 'RECOVERY_INITIATED'; payload: { taskId: string; fromCheckpoint: string; attempt: number }}
  | { type: 'RECOVERY_COMPLETED'; payload: { taskId: string; newAgentId: string }}
  | { type: 'RECOVERY_FAILED'; payload: { taskId: string; reason: string; attempts: number }}
  | { type: 'CIRCUIT_BREAKER_STATE_CHANGED'; payload: {
      name: string; from: BreakerState; to: BreakerState
    }};

7. Configuration Schema

health_monitoring:
  enabled: true

  heartbeat:
    interval_seconds: 300      # 5 minutes
    timeout_seconds: 600       # 10 minutes

  health_check:
    interval_seconds: 60       # 1 minute
    checkpoint_stale_threshold_seconds: 1800  # 30 minutes

  intervention:
    nudge:
      enabled: true
      max_attempts: 3
      interval_seconds: 600    # 10 minutes between nudges
    escalation:
      timeout_seconds: 900     # 15 minutes
    termination:
      cleanup_timeout_seconds: 30

  circuit_breaker:
    default:
      failure_threshold: 3
      recovery_timeout_seconds: 60
      half_open_requests: 1
    overrides:
      tool_execution:
        failure_threshold: 5
        recovery_timeout_seconds: 30
      checkpoint_write:
        failure_threshold: 3
        recovery_timeout_seconds: 120

  recovery:
    max_attempts_per_task: 3
    max_attempts_per_hour: 5
    backoff_seconds: [60, 120, 240]

  alerting:
    stuck_agent:
      channels: ["slack", "pagerduty"]
      severity: "warning"
    circuit_breaker_open:
      channels: ["slack"]
      severity: "warning"
    recovery_failed:
      channels: ["slack", "pagerduty"]
      severity: "critical"

Consequences

Positive

Automatic Detection - Stuck agents detected within 5 minutes
Graduated Response - Nudge → Escalate → Terminate prevents premature termination
Self-Healing - Automatic recovery from checkpoints
Resource Protection - Circuit breakers prevent cascade failures
Observability - Real-time health dashboard
Cost Control - Stuck agents don't consume indefinitely

Negative

Latency Overhead - Health checks add monitoring load
False Positives - Tuning required to avoid premature intervention
Complexity - More infrastructure to maintain
Recovery Costs - Failed recoveries consume additional tokens

Risks

Risk	Likelihood	Impact	Mitigation
False positive stuck detection	Medium	Unnecessary termination	Multiple signals, tunable thresholds
Monitor service failure	Low	Undetected stuck agents	Monitor redundancy, self-monitoring
Intervention storm	Low	Resource exhaustion	Rate limiting, backoff
Recovery loop	Medium	Infinite recovery attempts	Attempt limits, human escalation
Clock skew	Low	Incorrect timestamps	NTP sync, tolerance windows

Performance Requirements

Metric	Target
Stuck detection latency	< 5 minutes
False positive rate	< 1%
Recovery success rate	> 95%
Mean time to intervention	< 10 minutes
Health check latency	< 100ms
Heartbeat processing	< 50ms
Concurrent agent monitors	100+
State transition	< 1s

Observability Requirements

Real-time health dashboard
Alert integration (PagerDuty, Slack)
Historical health metrics (30 day retention)
Intervention audit trail
Distributed tracing for health checks

Implementation Phases

Phase 1: Health Status Infrastructure (Week 1-2)

Health state machine implementation
Heartbeat system (agent emission + monitor reception)
Health check service (checkpoint-based, tool-call-based)

Phase 2: Monitoring Service (Week 3-4)

AgentMonitor class (per-agent monitoring)
MonitorCoordinator service
InterventionService (nudge, escalate, terminate)

Phase 3: Circuit Breaker (Week 5-6)

CircuitBreaker core implementation
Integration with tool execution, checkpoint writes
Circuit breaker dashboard

Phase 4: Self-Healing and Testing (Week 7-8)

RecoveryService implementation
Checkpoint chain traversal
Chaos testing (simulate stuck agents, failures)

ADR-002: PostgreSQL as Primary Database (health events stored in PostgreSQL)
ADR-089: Two-Database Architecture (local SQLite + cloud PostgreSQL)
ADR-108: Checkpoint Protocol (recovery from checkpoints)
ADR-109: Browser Automation (browser resource monitoring)
ADR-111: Token Economics (health metrics include token consumption)
ADR-112: Ralph Wiggum Database Architecture (consolidates DB decisions)

Glossary

Term	Definition
Circuit Breaker	Design pattern that stops repeated failures from cascading, with CLOSED/OPEN/HALF_OPEN states
Health State	Agent operational status: HEALTHY, DEGRADED, STUCK, FAILING, or TERMINATED
Heartbeat	Periodic signal from agent indicating it is still operational and making progress
Nudge	Soft intervention injecting reminder into agent context to prompt progress
Escalation	Alerting orchestrator when nudge fails to restore agent progress
Termination	Forced stop of unresponsive agent with state preservation
Self-Healing	Automatic recovery from checkpoints without human intervention
Gas Town	Steve Yegge's multi-agent orchestration architecture with Witness/Deacon patterns
GUPP	"If there is work on your Hook, YOU MUST RUN IT" - Gas Town execution principle
Witness	Gas Town agent role that performs health checks and nudges stuck workers
Deacon	Gas Town agent role that patrols and escalates persistent issues
NTP	Network Time Protocol - ensures synchronized timestamps across systems
Checkpoint	Persistent snapshot of agent state enabling recovery after failures

References

ADR-110 | Created: 2026-01-24 | Status: Proposed

Status​

Context​

Problem Statement​

Key Insight from Gas Town Architecture​

Current State​

Decision​

1. Health Status Model​

2. Heartbeat Protocol​

3. Intervention Protocol​

4. Circuit Breaker Pattern​

5. Self-Healing Protocol​

6. Event Definitions​

7. Configuration Schema​

Consequences​

Positive​

Negative​

Risks​

Performance Requirements​

Observability Requirements​

Implementation Phases​

Phase 1: Health Status Infrastructure (Week 1-2)​

Phase 2: Monitoring Service (Week 3-4)​

Phase 3: Circuit Breaker (Week 5-6)​

Phase 4: Self-Healing and Testing (Week 7-8)​

Related ADRs​

Glossary​

References​