Skip to main content

ADR-108: Agent Checkpoint and Handoff Protocol

Status

Proposed

Context

Problem Statement

Current agent handoff relies on implicit state (context window contents, git history). The Ralph Wiggum community has empirically validated that explicit, structured checkpoints are essential for:

  • Reliable recovery from failures
  • Clean context window transitions
  • Compliance audit trails
  • Cost attribution per task segment

Key Insight from Ralph Wiggum Analysis

"Single-context loops degrade; fresh-context iterations maintain quality."

The Ralph Wiggum technique demonstrates that autonomous agent loops work best when:

  1. Each iteration starts with a fresh context window
  2. State persists externally (not in the context)
  3. Handoff protocols are explicit and structured

Current State

  • Agents rely on context window for state (degrades over time)
  • No standardized handoff format between agent iterations
  • Recovery from failures requires manual intervention
  • No compliance audit trail for state transitions
  • Cost attribution per task segment is impossible

CODITECT Advantage

CODITECT's database-backed event-driven architecture is architecturally superior to the git-file-based state management used by Ralph implementations. This ADR formalizes a checkpoint protocol that leverages this advantage.

Database Architecture Note (ADR-002, ADR-089):

  • Local: SQLite (context.db) for offline operation
  • Cloud: PostgreSQL with Row-Level Security (RLS) for multi-tenant isolation
  • This aligns with ADR-002's decision to use PostgreSQL for cloud, rejecting FoundationDB due to operational complexity and lack of managed services.

Decision

Implement a standardized checkpoint protocol that enables agents to persist state to the appropriate database layer (SQLite locally, PostgreSQL in cloud), allowing fresh-context iterations while maintaining full task continuity and compliance evidence.

1. Checkpoint Schema

checkpoint_schema:
version: "1.0"

metadata:
checkpoint_id: string # UUID v7 (time-ordered)
task_id: string # Parent task reference
agent_id: string # Executing agent identifier
agent_type: enum # [architecture, implementation, qa, documentation]
iteration: integer # Loop iteration number
timestamp: datetime # ISO 8601 with timezone

execution_state:
phase: enum # [planning, implementing, testing, reviewing, complete]
completed_items: array # List of completed work items
pending_items: array # Remaining work items
blocked_items: array # Items with blockers
current_focus: string # Active work item ID

context_summary:
key_decisions: array # Architectural decisions made
assumptions: array # Assumptions in effect
constraints: array # Active constraints
external_dependencies: array # Third-party dependencies

metrics:
tokens_consumed: integer # Total tokens this iteration
tools_invoked: integer # Tool call count
files_modified: array # List of changed files
tests_status:
passed: integer
failed: integer
skipped: integer
coverage_percent: float

recovery:
last_successful_state: string # Reference to prior checkpoint
rollback_instructions: string # How to undo current iteration
continuation_prompt: string # Prompt to resume work

compliance:
event_log_ref: string # FoundationDB event stream ref
hash: string # SHA-256 of checkpoint content
signature: string # Optional cryptographic signature
retention_policy: string # Compliance retention requirement

2. Database Schema

Local SQLite (context.db):

CREATE TABLE checkpoints (
id TEXT PRIMARY KEY, -- UUID v7
task_id TEXT NOT NULL,
agent_id TEXT NOT NULL,
agent_type TEXT NOT NULL,
iteration INTEGER DEFAULT 1,
phase TEXT NOT NULL, -- planning, implementing, testing, reviewing, complete
checkpoint_data JSON NOT NULL, -- Full checkpoint schema
hash TEXT NOT NULL, -- SHA-256 integrity
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (task_id) REFERENCES tasks(id)
);

CREATE INDEX idx_checkpoints_task ON checkpoints(task_id, created_at DESC);
CREATE INDEX idx_checkpoints_agent ON checkpoints(agent_id, created_at DESC);

Cloud PostgreSQL (with RLS):

CREATE TABLE checkpoints (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organization_id UUID NOT NULL REFERENCES organizations(id),
task_id UUID NOT NULL REFERENCES tasks(id),
agent_id TEXT NOT NULL,
agent_type TEXT NOT NULL,
iteration INTEGER DEFAULT 1,
phase TEXT NOT NULL,
checkpoint_data JSONB NOT NULL, -- PostgreSQL JSONB for efficient queries
hash TEXT NOT NULL,
signature TEXT, -- Optional cryptographic signature
created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Row-Level Security for multi-tenant isolation
ALTER TABLE checkpoints ENABLE ROW LEVEL SECURITY;

CREATE POLICY org_isolation_checkpoints ON checkpoints
FOR ALL USING (
organization_id IN (
SELECT organization_id FROM organization_members
WHERE user_id = current_setting('app.current_user_id')::UUID
)
);

CREATE INDEX idx_checkpoints_task ON checkpoints(task_id, created_at DESC);
CREATE INDEX idx_checkpoints_org ON checkpoints(organization_id, created_at DESC);
CREATE INDEX idx_checkpoints_data ON checkpoints USING gin(checkpoint_data);

3. Checkpoint Operations

Write Operations:

  • create_checkpoint(task_id, agent_id, state) → checkpoint_id
  • update_checkpoint(checkpoint_id, partial_state) → checkpoint_id
  • finalize_checkpoint(checkpoint_id, status) → void
  • link_checkpoints(parent_id, child_id) → void

Read Operations:

  • get_checkpoint(checkpoint_id) → Checkpoint
  • get_latest_checkpoint(task_id) → Checkpoint
  • get_checkpoint_history(task_id, limit) → [Checkpoint]
  • get_checkpoints_by_agent(agent_id, time_range) → [Checkpoint]

4. Handoff Protocol

PRE-HANDOFF (Current Agent):
1. Detect handoff trigger (context > 70% OR task complete OR error threshold)
2. Generate continuation_prompt summarizing:
- What was accomplished
- What remains
- Current blockers
- Recommended next steps
3. Write final checkpoint with phase="handoff"
4. Emit AGENT_HANDOFF event to orchestrator

HANDOFF (Orchestrator):
5. Receive AGENT_HANDOFF event
6. Validate checkpoint integrity (hash verification)
7. Select next agent (same type for continuation, different for phase change)
8. Inject checkpoint context into new agent's system prompt
9. Spawn new agent with fresh context window

POST-HANDOFF (New Agent):
10. Read latest checkpoint
11. Acknowledge checkpoint receipt
12. Resume from continuation_prompt
13. Create new checkpoint linking to parent

5. Recovery Protocol

ON AGENT FAILURE:
1. Orchestrator detects agent termination (timeout, error, crash)
2. Retrieve last successful checkpoint
3. Analyze failure:
- If recoverable: spawn new agent from checkpoint
- If unrecoverable: mark task blocked, alert human
4. Log recovery attempt to compliance trail

ON CHECKPOINT CORRUPTION:
1. Detect via hash mismatch
2. Attempt recovery from last_successful_state reference
3. If chain broken: escalate to human intervention
4. Never silently continue with corrupt state

6. Handoff Triggers

TriggerThresholdAction
Context utilization> 70%Initiate handoff
Task phase completePhase boundaryHandoff to next phase agent
Error count> 3 consecutiveHandoff with error context
Explicit requestAgent requestsImmediate handoff
Token budget> 80% of task budgetHandoff with budget warning

Consequences

Positive

  1. Reliable Recovery - Agents can resume from any checkpoint after failures
  2. Fresh Context Quality - Each iteration starts clean, avoiding degradation
  3. Compliance Ready - Full audit trail for FDA 21 CFR Part 11, HIPAA, SOC2
  4. Cost Attribution - Token consumption tracked per checkpoint/iteration
  5. Superior to Ralph - Database ACID guarantees vs git-file eventual consistency
  6. Multi-Tenant Ready - PostgreSQL RLS ensures complete tenant isolation

Negative

  1. Latency Overhead - Checkpoint writes add ~100ms per handoff
  2. Storage Growth - Checkpoint history accumulates (mitigated by retention policies)
  3. Complexity - More moving parts than simple context-based state

Risks

RiskLikelihoodImpactMitigation
Database latency spikesMediumAgent delaysLocal SQLite buffer, async cloud sync
Checkpoint size exceeds limitsLowWrite failuresCompression, summary truncation
Concurrent modificationMediumData lossOptimistic locking, conflict detection
Recovery loopMediumInfinite retriesCircuit breaker, max retry limits
Sync conflicts (local/cloud)MediumData inconsistencyLast-write-wins with conflict log

Performance Requirements

MetricTarget
Checkpoint write latency< 100ms p99
Checkpoint read latency< 10ms p99
Recovery success rate> 99.9%
Audit trail completeness100%
Context window at handoff< 70% utilized

Compliance Requirements

StandardRequirementImplementation
FDA 21 CFR Part 11Electronic signaturesCryptographic signing of checkpoints
HIPAAEncryptionAt-rest and in-transit encryption
SOC2Audit trailComplete event log in PostgreSQL with RLS
Data RetentionConfigurablePer-task retention policy field

Implementation Phases

Phase 1: Schema and Storage (2 weeks)

  • Define TypeScript/Rust types for Checkpoint
  • Implement SQLite repository (local) and PostgreSQL repository (cloud)
  • Add transaction handling and optimistic locking
  • Implement local-to-cloud sync (cursor-based polling per ADR-053)

Phase 2: Checkpoint Operations (2 weeks)

  • Implement write operations with hash generation
  • Implement read operations with integrity verification
  • Add compliance layer (signing, audit events)

Phase 3: Handoff Protocol (2 weeks)

  • Implement handoff trigger detection
  • Create continuation prompt generation
  • Integrate with orchestrator

Phase 4: Recovery and Testing (2 weeks)

  • Implement failure detection and recovery
  • Add circuit breaker for repeated failures
  • Comprehensive testing (unit, integration, chaos)
  • ADR-001: Event-Driven Architecture (checkpoint events)
  • ADR-002: PostgreSQL as Primary Database (cloud storage, RLS for multi-tenant)
  • ADR-053: Cloud Context Sync Architecture (local-to-cloud sync protocol)
  • ADR-089: Two-Database Architecture (local SQLite + cloud PostgreSQL)
  • ADR-109: QA Agent Browser Automation (depends on checkpoint for test results)
  • ADR-110: Agent Health Monitoring (uses checkpoint timestamps)
  • ADR-111: Token Economics (checkpoint includes token metrics)
  • ADR-112: Ralph Wiggum Database Architecture (consolidates DB decisions)

Glossary

TermDefinition
ADRArchitecture Decision Record - document capturing significant architectural decisions with context and consequences
ACIDAtomicity, Consistency, Isolation, Durability - database transaction properties ensuring data integrity
CheckpointPersistent snapshot of agent execution state enabling recovery and handoff
HandoffTransfer of execution context from one agent instance to another (typically with fresh context window)
RLSRow-Level Security - PostgreSQL feature that automatically filters query results based on tenant context
Ralph WiggumAutonomous agent loop technique named after The Simpsons character, enabling persistent iteration until task completion
SHA-256Secure Hash Algorithm producing 256-bit hash for data integrity verification
UUID v7Universally Unique Identifier version 7 - time-ordered UUIDs for sortable, unique keys
FDA 21 CFR Part 11US FDA regulation for electronic records and signatures in regulated industries
HIPAAHealth Insurance Portability and Accountability Act - US healthcare data privacy regulation
SOC2Service Organization Control 2 - security compliance framework for service providers

References


ADR-108 | Created: 2026-01-24 | Status: Proposed