Skip to main content

Implementation Requirements: Agent Checkpoint and Handoff Protocol

Document ID: IMPL-REQ-001
Priority: P0 (Critical Path)
Target ADR: ADR-108 (Proposed)
Estimated Effort: 2-3 Sprints
Dependencies: FoundationDB infrastructure, Multi-Agent Orchestration Layer


1. Overview

1.1 Problem Statement

Current agent handoff relies on implicit state (context window contents, git history). The Ralph Wiggum community has validated that explicit, structured checkpoints are essential for:

  • Reliable recovery from failures
  • Clean context window transitions
  • Compliance audit trails
  • Cost attribution per task segment

1.2 Objective

Implement a standardized checkpoint protocol that enables agents to persist state to FoundationDB, allowing fresh-context iterations while maintaining full task continuity and compliance evidence.

1.3 Success Criteria

MetricTarget
Checkpoint write latency< 100ms
Recovery success rate> 99.9%
Audit trail completeness100% of state transitions
Context window utilization< 70% at handoff

2. Functional Requirements

2.1 Checkpoint Schema

# FR-001: Define checkpoint data structure
checkpoint_schema:
version: "1.0"

metadata:
checkpoint_id: string # UUID v7 (time-ordered)
task_id: string # Parent task reference
agent_id: string # Executing agent identifier
agent_type: enum # [architecture, implementation, qa, documentation]
iteration: integer # Loop iteration number
timestamp: datetime # ISO 8601 with timezone

execution_state:
phase: enum # [planning, implementing, testing, reviewing, complete]
completed_items: array # List of completed work items
pending_items: array # Remaining work items
blocked_items: array # Items with blockers
current_focus: string # Active work item ID

context_summary:
key_decisions: array # Architectural decisions made
assumptions: array # Assumptions in effect
constraints: array # Active constraints
external_dependencies: array # Third-party dependencies

metrics:
tokens_consumed: integer # Total tokens this iteration
tools_invoked: integer # Tool call count
files_modified: array # List of changed files
tests_status:
passed: integer
failed: integer
skipped: integer
coverage_percent: float

recovery:
last_successful_state: string # Reference to prior checkpoint
rollback_instructions: string # How to undo current iteration
continuation_prompt: string # Prompt to resume work

compliance:
event_log_ref: string # FoundationDB event stream ref
hash: string # SHA-256 of checkpoint content
signature: string # Optional cryptographic signature
retention_policy: string # Compliance retention requirement

2.2 Checkpoint Operations

FR-002: Checkpoint Write Operations

MUST support:
├── create_checkpoint(task_id, agent_id, state) → checkpoint_id
├── update_checkpoint(checkpoint_id, partial_state) → checkpoint_id
├── finalize_checkpoint(checkpoint_id, status) → void
└── link_checkpoints(parent_id, child_id) → void

Transactional requirements:
- All writes MUST be atomic
- Failed writes MUST NOT corrupt existing checkpoints
- Concurrent writes to same task MUST be serialized
FR-003: Checkpoint Read Operations

MUST support:
├── get_checkpoint(checkpoint_id) → Checkpoint
├── get_latest_checkpoint(task_id) → Checkpoint
├── get_checkpoint_history(task_id, limit) → [Checkpoint]
├── get_checkpoints_by_agent(agent_id, time_range) → [Checkpoint]
└── search_checkpoints(query) → [Checkpoint]

Performance requirements:
- Single checkpoint read: < 10ms
- History query (100 items): < 100ms

2.3 Handoff Protocol

FR-004: Agent Handoff Sequence

PRE-HANDOFF (Current Agent):
1. Detect handoff trigger (context > 70% OR task complete OR error threshold)
2. Generate continuation_prompt summarizing:
- What was accomplished
- What remains
- Current blockers
- Recommended next steps
3. Write final checkpoint with phase="handoff"
4. Emit AGENT_HANDOFF event to orchestrator

HANDOFF (Orchestrator):
5. Receive AGENT_HANDOFF event
6. Validate checkpoint integrity (hash verification)
7. Select next agent (same type for continuation, different for phase change)
8. Inject checkpoint context into new agent's system prompt
9. Spawn new agent with fresh context window

POST-HANDOFF (New Agent):
10. Read latest checkpoint
11. Acknowledge checkpoint receipt
12. Resume from continuation_prompt
13. Create new checkpoint linking to parent

2.4 Recovery Protocol

FR-005: Failure Recovery Sequence

ON AGENT FAILURE:
1. Orchestrator detects agent termination (timeout, error, crash)
2. Retrieve last successful checkpoint
3. Analyze failure:
- If recoverable: spawn new agent from checkpoint
- If unrecoverable: mark task blocked, alert human
4. Log recovery attempt to compliance trail

ON CHECKPOINT CORRUPTION:
1. Detect via hash mismatch
2. Attempt recovery from last_successful_state reference
3. If chain broken: escalate to human intervention
4. Never silently continue with corrupt state

3. Non-Functional Requirements

3.1 Performance

RequirementSpecification
NFR-001Checkpoint writes < 100ms p99
NFR-002Checkpoint reads < 10ms p99
NFR-003History queries < 100ms for 100 items
NFR-004Support 1000+ concurrent checkpoints
NFR-005Zero data loss on node failure (FoundationDB replication)

3.2 Reliability

RequirementSpecification
NFR-00699.99% checkpoint service availability
NFR-007Automatic retry on transient failures (3x with backoff)
NFR-008Graceful degradation if FDB unavailable (queue locally)
NFR-009Checkpoint validation on every read

3.3 Compliance

RequirementSpecification
NFR-010FDA 21 CFR Part 11: Electronic signatures on checkpoints
NFR-011HIPAA: Encryption at rest and in transit
NFR-012SOC2: Complete audit trail of all state changes
NFR-013Retention: Configurable per-task retention policy
NFR-014Immutability: Finalized checkpoints cannot be modified

3.4 Observability

RequirementSpecification
NFR-015Metrics: checkpoint_write_latency_ms histogram
NFR-016Metrics: checkpoint_read_latency_ms histogram
NFR-017Metrics: handoff_success_rate gauge
NFR-018Metrics: recovery_attempts_total counter
NFR-019Logs: Structured JSON for all checkpoint operations
NFR-020Traces: Distributed tracing across agent handoffs

4. Implementation Steps

Phase 1: Schema and Storage (Week 1-2)

Step 1.1: Define Checkpoint Schema
├── Create TypeScript/Rust types for Checkpoint
├── Define FoundationDB key structure:
│ └── /coditect/checkpoints/{task_id}/{checkpoint_id}
├── Implement serialization (MessagePack for efficiency)
└── Add schema versioning for forward compatibility

Step 1.2: Implement Storage Layer
├── Create CheckpointRepository interface
├── Implement FoundationDBCheckpointRepository
├── Add transaction handling for atomic writes
├── Implement optimistic locking for concurrent access
└── Add connection pooling and retry logic

Step 1.3: Add Compliance Layer
├── Implement SHA-256 hash generation
├── Add optional cryptographic signing
├── Create immutability enforcement (no updates after finalize)
└── Implement audit event emission

Phase 2: Checkpoint Operations (Week 3-4)

Step 2.1: Write Operations
├── Implement create_checkpoint()
│ ├── Validate schema
│ ├── Generate checkpoint_id (UUID v7)
│ ├── Calculate hash
│ ├── Write to FDB transactionally
│ └── Emit CHECKPOINT_CREATED event
├── Implement update_checkpoint()
│ ├── Verify checkpoint not finalized
│ ├── Merge partial state
│ ├── Recalculate hash
│ └── Write atomically
└── Implement finalize_checkpoint()
├── Mark as immutable
├── Apply retention policy
└── Emit CHECKPOINT_FINALIZED event

Step 2.2: Read Operations
├── Implement get_checkpoint()
│ ├── Read from FDB
│ ├── Verify hash integrity
│ └── Deserialize and return
├── Implement get_latest_checkpoint()
│ ├── Query by task_id with ordering
│ └── Return most recent non-corrupt
├── Implement get_checkpoint_history()
│ ├── Range query with pagination
│ └── Build checkpoint chain
└── Implement search_checkpoints()
├── Secondary index queries
└── Filter by agent, time, status

Phase 3: Handoff Protocol (Week 5-6)

Step 3.1: Handoff Trigger Detection
├── Implement context utilization monitor
│ └── Track token count vs limit
├── Define handoff triggers:
│ ├── context_utilization > 70%
│ ├── task_phase_complete
│ ├── error_count > threshold
│ └── explicit_handoff_request
└── Add trigger evaluation to agent loop

Step 3.2: Continuation Prompt Generation
├── Create prompt template for handoff
├── Implement state summarization:
│ ├── Extract key decisions
│ ├── List completed work
│ ├── Identify pending items
│ └── Note blockers and risks
├── Add context window budget for summary
└── Validate prompt fits target context

Step 3.3: Agent Spawning Integration
├── Update Orchestrator to handle AGENT_HANDOFF
├── Implement checkpoint injection into system prompt
├── Add agent selection logic:
│ ├── Same agent type for continuation
│ ├── Different type for phase transition
└── Ensure fresh context window creation

Phase 4: Recovery and Testing (Week 7-8)

Step 4.1: Recovery Implementation
├── Implement failure detection in Orchestrator
├── Add checkpoint chain traversal for recovery point
├── Create recovery decision logic:
│ ├── Analyze failure type
│ ├── Determine recoverability
│ └── Select recovery strategy
├── Implement agent respawn from checkpoint
└── Add circuit breaker for repeated failures

Step 4.2: Testing
├── Unit tests:
│ ├── Schema validation
│ ├── Storage operations
│ ├── Hash verification
│ └── Compliance enforcement
├── Integration tests:
│ ├── Full handoff cycle
│ ├── Recovery scenarios
│ ├── Concurrent checkpoint writes
│ └── FDB failure simulation
└── Load tests:
├── 1000 concurrent checkpoints
├── Sustained write throughput
└── Recovery under load

Step 4.3: Documentation
├── API documentation
├── Handoff protocol specification
├── Recovery runbook
└── Compliance evidence guide

5. API Specification

5.1 Checkpoint Service Interface

interface CheckpointService {
// Write operations
createCheckpoint(params: CreateCheckpointParams): Promise<Checkpoint>;
updateCheckpoint(id: CheckpointId, update: PartialCheckpoint): Promise<Checkpoint>;
finalizeCheckpoint(id: CheckpointId, status: FinalStatus): Promise<void>;

// Read operations
getCheckpoint(id: CheckpointId): Promise<Checkpoint | null>;
getLatestCheckpoint(taskId: TaskId): Promise<Checkpoint | null>;
getCheckpointHistory(taskId: TaskId, options?: HistoryOptions): Promise<Checkpoint[]>;
searchCheckpoints(query: CheckpointQuery): Promise<SearchResult<Checkpoint>>;

// Handoff operations
initiateHandoff(checkpointId: CheckpointId): Promise<HandoffResult>;
acknowledgeHandoff(checkpointId: CheckpointId, newAgentId: AgentId): Promise<void>;

// Recovery operations
findRecoveryPoint(taskId: TaskId): Promise<Checkpoint | null>;
executeRecovery(checkpointId: CheckpointId): Promise<RecoveryResult>;
}

5.2 Event Definitions

// Events emitted by Checkpoint Service
type CheckpointEvents =
| { type: 'CHECKPOINT_CREATED'; payload: { checkpointId: string; taskId: string; agentId: string } }
| { type: 'CHECKPOINT_UPDATED'; payload: { checkpointId: string; fields: string[] } }
| { type: 'CHECKPOINT_FINALIZED'; payload: { checkpointId: string; status: string } }
| { type: 'AGENT_HANDOFF'; payload: { fromCheckpoint: string; toAgent: string } }
| { type: 'RECOVERY_INITIATED'; payload: { checkpointId: string; reason: string } }
| { type: 'RECOVERY_COMPLETED'; payload: { checkpointId: string; newAgentId: string } }
| { type: 'CHECKPOINT_CORRUPTED'; payload: { checkpointId: string; error: string } };

6. Dependencies

DependencyTypeStatus
FoundationDB clusterInfrastructure✅ Available
Event bus (for CHECKPOINT_* events)Platform✅ Available
Agent OrchestratorPlatform⚠️ Requires update
Cryptographic signing serviceSecurity⚠️ May need implementation
Observability stack (metrics, logs, traces)Platform✅ Available

7. Risks and Mitigations

RiskImpactLikelihoodMitigation
FoundationDB latency spikesAgent delaysMediumLocal write queue, async finalization
Checkpoint size exceeds limitsWrite failuresLowCompression, summary truncation
Hash collisionData integrityVery LowSHA-256 is collision-resistant
Concurrent modificationData lossMediumOptimistic locking, conflict detection
Recovery loopInfinite retriesMediumCircuit breaker, max retry limits

8. Acceptance Criteria

  • Checkpoints persist across agent restarts
  • Handoff completes in < 500ms total latency
  • Recovery succeeds from last valid checkpoint
  • Audit trail passes compliance review
  • No checkpoint data loss under simulated failures
  • Performance targets met under load test
  • Documentation complete and reviewed

Document Version: 1.0 | Last Updated: January 24, 2026