ADR-108: Agent Checkpoint and Handoff Protocol
Status
Proposed
Context
Problem Statement
Current agent handoff relies on implicit state (context window contents, git history). The Ralph Wiggum community has empirically validated that explicit, structured checkpoints are essential for:
- Reliable recovery from failures
- Clean context window transitions
- Compliance audit trails
- Cost attribution per task segment
Key Insight from Ralph Wiggum Analysis
"Single-context loops degrade; fresh-context iterations maintain quality."
The Ralph Wiggum technique demonstrates that autonomous agent loops work best when:
- Each iteration starts with a fresh context window
- State persists externally (not in the context)
- Handoff protocols are explicit and structured
Current State
- Agents rely on context window for state (degrades over time)
- No standardized handoff format between agent iterations
- Recovery from failures requires manual intervention
- No compliance audit trail for state transitions
- Cost attribution per task segment is impossible
CODITECT Advantage
CODITECT's database-backed event-driven architecture is architecturally superior to the git-file-based state management used by Ralph implementations. This ADR formalizes a checkpoint protocol that leverages this advantage.
Database Architecture Note (ADR-002, ADR-089):
- Local: SQLite (context.db) for offline operation
- Cloud: PostgreSQL with Row-Level Security (RLS) for multi-tenant isolation
- This aligns with ADR-002's decision to use PostgreSQL for cloud, rejecting FoundationDB due to operational complexity and lack of managed services.
Decision
Implement a standardized checkpoint protocol that enables agents to persist state to the appropriate database layer (SQLite locally, PostgreSQL in cloud), allowing fresh-context iterations while maintaining full task continuity and compliance evidence.
1. Checkpoint Schema
checkpoint_schema:
version: "1.0"
metadata:
checkpoint_id: string # UUID v7 (time-ordered)
task_id: string # Parent task reference
agent_id: string # Executing agent identifier
agent_type: enum # [architecture, implementation, qa, documentation]
iteration: integer # Loop iteration number
timestamp: datetime # ISO 8601 with timezone
execution_state:
phase: enum # [planning, implementing, testing, reviewing, complete]
completed_items: array # List of completed work items
pending_items: array # Remaining work items
blocked_items: array # Items with blockers
current_focus: string # Active work item ID
context_summary:
key_decisions: array # Architectural decisions made
assumptions: array # Assumptions in effect
constraints: array # Active constraints
external_dependencies: array # Third-party dependencies
metrics:
tokens_consumed: integer # Total tokens this iteration
tools_invoked: integer # Tool call count
files_modified: array # List of changed files
tests_status:
passed: integer
failed: integer
skipped: integer
coverage_percent: float
recovery:
last_successful_state: string # Reference to prior checkpoint
rollback_instructions: string # How to undo current iteration
continuation_prompt: string # Prompt to resume work
compliance:
event_log_ref: string # FoundationDB event stream ref
hash: string # SHA-256 of checkpoint content
signature: string # Optional cryptographic signature
retention_policy: string # Compliance retention requirement
2. Database Schema
Local SQLite (context.db):
CREATE TABLE checkpoints (
id TEXT PRIMARY KEY, -- UUID v7
task_id TEXT NOT NULL,
agent_id TEXT NOT NULL,
agent_type TEXT NOT NULL,
iteration INTEGER DEFAULT 1,
phase TEXT NOT NULL, -- planning, implementing, testing, reviewing, complete
checkpoint_data JSON NOT NULL, -- Full checkpoint schema
hash TEXT NOT NULL, -- SHA-256 integrity
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (task_id) REFERENCES tasks(id)
);
CREATE INDEX idx_checkpoints_task ON checkpoints(task_id, created_at DESC);
CREATE INDEX idx_checkpoints_agent ON checkpoints(agent_id, created_at DESC);
Cloud PostgreSQL (with RLS):
CREATE TABLE checkpoints (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organization_id UUID NOT NULL REFERENCES organizations(id),
task_id UUID NOT NULL REFERENCES tasks(id),
agent_id TEXT NOT NULL,
agent_type TEXT NOT NULL,
iteration INTEGER DEFAULT 1,
phase TEXT NOT NULL,
checkpoint_data JSONB NOT NULL, -- PostgreSQL JSONB for efficient queries
hash TEXT NOT NULL,
signature TEXT, -- Optional cryptographic signature
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Row-Level Security for multi-tenant isolation
ALTER TABLE checkpoints ENABLE ROW LEVEL SECURITY;
CREATE POLICY org_isolation_checkpoints ON checkpoints
FOR ALL USING (
organization_id IN (
SELECT organization_id FROM organization_members
WHERE user_id = current_setting('app.current_user_id')::UUID
)
);
CREATE INDEX idx_checkpoints_task ON checkpoints(task_id, created_at DESC);
CREATE INDEX idx_checkpoints_org ON checkpoints(organization_id, created_at DESC);
CREATE INDEX idx_checkpoints_data ON checkpoints USING gin(checkpoint_data);
3. Checkpoint Operations
Write Operations:
create_checkpoint(task_id, agent_id, state) → checkpoint_idupdate_checkpoint(checkpoint_id, partial_state) → checkpoint_idfinalize_checkpoint(checkpoint_id, status) → voidlink_checkpoints(parent_id, child_id) → void
Read Operations:
get_checkpoint(checkpoint_id) → Checkpointget_latest_checkpoint(task_id) → Checkpointget_checkpoint_history(task_id, limit) → [Checkpoint]get_checkpoints_by_agent(agent_id, time_range) → [Checkpoint]
4. Handoff Protocol
PRE-HANDOFF (Current Agent):
1. Detect handoff trigger (context > 70% OR task complete OR error threshold)
2. Generate continuation_prompt summarizing:
- What was accomplished
- What remains
- Current blockers
- Recommended next steps
3. Write final checkpoint with phase="handoff"
4. Emit AGENT_HANDOFF event to orchestrator
HANDOFF (Orchestrator):
5. Receive AGENT_HANDOFF event
6. Validate checkpoint integrity (hash verification)
7. Select next agent (same type for continuation, different for phase change)
8. Inject checkpoint context into new agent's system prompt
9. Spawn new agent with fresh context window
POST-HANDOFF (New Agent):
10. Read latest checkpoint
11. Acknowledge checkpoint receipt
12. Resume from continuation_prompt
13. Create new checkpoint linking to parent
5. Recovery Protocol
ON AGENT FAILURE:
1. Orchestrator detects agent termination (timeout, error, crash)
2. Retrieve last successful checkpoint
3. Analyze failure:
- If recoverable: spawn new agent from checkpoint
- If unrecoverable: mark task blocked, alert human
4. Log recovery attempt to compliance trail
ON CHECKPOINT CORRUPTION:
1. Detect via hash mismatch
2. Attempt recovery from last_successful_state reference
3. If chain broken: escalate to human intervention
4. Never silently continue with corrupt state
6. Handoff Triggers
| Trigger | Threshold | Action |
|---|---|---|
| Context utilization | > 70% | Initiate handoff |
| Task phase complete | Phase boundary | Handoff to next phase agent |
| Error count | > 3 consecutive | Handoff with error context |
| Explicit request | Agent requests | Immediate handoff |
| Token budget | > 80% of task budget | Handoff with budget warning |
Consequences
Positive
- Reliable Recovery - Agents can resume from any checkpoint after failures
- Fresh Context Quality - Each iteration starts clean, avoiding degradation
- Compliance Ready - Full audit trail for FDA 21 CFR Part 11, HIPAA, SOC2
- Cost Attribution - Token consumption tracked per checkpoint/iteration
- Superior to Ralph - Database ACID guarantees vs git-file eventual consistency
- Multi-Tenant Ready - PostgreSQL RLS ensures complete tenant isolation
Negative
- Latency Overhead - Checkpoint writes add ~100ms per handoff
- Storage Growth - Checkpoint history accumulates (mitigated by retention policies)
- Complexity - More moving parts than simple context-based state
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Database latency spikes | Medium | Agent delays | Local SQLite buffer, async cloud sync |
| Checkpoint size exceeds limits | Low | Write failures | Compression, summary truncation |
| Concurrent modification | Medium | Data loss | Optimistic locking, conflict detection |
| Recovery loop | Medium | Infinite retries | Circuit breaker, max retry limits |
| Sync conflicts (local/cloud) | Medium | Data inconsistency | Last-write-wins with conflict log |
Performance Requirements
| Metric | Target |
|---|---|
| Checkpoint write latency | < 100ms p99 |
| Checkpoint read latency | < 10ms p99 |
| Recovery success rate | > 99.9% |
| Audit trail completeness | 100% |
| Context window at handoff | < 70% utilized |
Compliance Requirements
| Standard | Requirement | Implementation |
|---|---|---|
| FDA 21 CFR Part 11 | Electronic signatures | Cryptographic signing of checkpoints |
| HIPAA | Encryption | At-rest and in-transit encryption |
| SOC2 | Audit trail | Complete event log in PostgreSQL with RLS |
| Data Retention | Configurable | Per-task retention policy field |
Implementation Phases
Phase 1: Schema and Storage (2 weeks)
- Define TypeScript/Rust types for Checkpoint
- Implement SQLite repository (local) and PostgreSQL repository (cloud)
- Add transaction handling and optimistic locking
- Implement local-to-cloud sync (cursor-based polling per ADR-053)
Phase 2: Checkpoint Operations (2 weeks)
- Implement write operations with hash generation
- Implement read operations with integrity verification
- Add compliance layer (signing, audit events)
Phase 3: Handoff Protocol (2 weeks)
- Implement handoff trigger detection
- Create continuation prompt generation
- Integrate with orchestrator
Phase 4: Recovery and Testing (2 weeks)
- Implement failure detection and recovery
- Add circuit breaker for repeated failures
- Comprehensive testing (unit, integration, chaos)
Related ADRs
- ADR-001: Event-Driven Architecture (checkpoint events)
- ADR-002: PostgreSQL as Primary Database (cloud storage, RLS for multi-tenant)
- ADR-053: Cloud Context Sync Architecture (local-to-cloud sync protocol)
- ADR-089: Two-Database Architecture (local SQLite + cloud PostgreSQL)
- ADR-109: QA Agent Browser Automation (depends on checkpoint for test results)
- ADR-110: Agent Health Monitoring (uses checkpoint timestamps)
- ADR-111: Token Economics (checkpoint includes token metrics)
- ADR-112: Ralph Wiggum Database Architecture (consolidates DB decisions)
Glossary
| Term | Definition |
|---|---|
| ADR | Architecture Decision Record - document capturing significant architectural decisions with context and consequences |
| ACID | Atomicity, Consistency, Isolation, Durability - database transaction properties ensuring data integrity |
| Checkpoint | Persistent snapshot of agent execution state enabling recovery and handoff |
| Handoff | Transfer of execution context from one agent instance to another (typically with fresh context window) |
| RLS | Row-Level Security - PostgreSQL feature that automatically filters query results based on tenant context |
| Ralph Wiggum | Autonomous agent loop technique named after The Simpsons character, enabling persistent iteration until task completion |
| SHA-256 | Secure Hash Algorithm producing 256-bit hash for data integrity verification |
| UUID v7 | Universally Unique Identifier version 7 - time-ordered UUIDs for sortable, unique keys |
| FDA 21 CFR Part 11 | US FDA regulation for electronic records and signatures in regulated industries |
| HIPAA | Health Insurance Portability and Accountability Act - US healthcare data privacy regulation |
| SOC2 | Service Organization Control 2 - security compliance framework for service providers |
References
- Ralph Wiggum Analysis
- CODITECT Impact Analysis
- IMPL-REQ-001
- Anthropic: Effective Harnesses for Long-Running Agents
- Geoffrey Huntley's Ralph Explanation
ADR-108 | Created: 2026-01-24 | Status: Proposed