Implementation Requirements: Agent Checkpoint and Handoff Protocol
Document ID: IMPL-REQ-001
Priority: P0 (Critical Path)
Target ADR: ADR-108 (Proposed)
Estimated Effort: 2-3 Sprints
Dependencies: FoundationDB infrastructure, Multi-Agent Orchestration Layer
1. Overview
1.1 Problem Statement
Current agent handoff relies on implicit state (context window contents, git history). The Ralph Wiggum community has validated that explicit, structured checkpoints are essential for:
- Reliable recovery from failures
- Clean context window transitions
- Compliance audit trails
- Cost attribution per task segment
1.2 Objective
Implement a standardized checkpoint protocol that enables agents to persist state to FoundationDB, allowing fresh-context iterations while maintaining full task continuity and compliance evidence.
1.3 Success Criteria
| Metric | Target |
|---|---|
| Checkpoint write latency | < 100ms |
| Recovery success rate | > 99.9% |
| Audit trail completeness | 100% of state transitions |
| Context window utilization | < 70% at handoff |
2. Functional Requirements
2.1 Checkpoint Schema
# FR-001: Define checkpoint data structure
checkpoint_schema:
version: "1.0"
metadata:
checkpoint_id: string # UUID v7 (time-ordered)
task_id: string # Parent task reference
agent_id: string # Executing agent identifier
agent_type: enum # [architecture, implementation, qa, documentation]
iteration: integer # Loop iteration number
timestamp: datetime # ISO 8601 with timezone
execution_state:
phase: enum # [planning, implementing, testing, reviewing, complete]
completed_items: array # List of completed work items
pending_items: array # Remaining work items
blocked_items: array # Items with blockers
current_focus: string # Active work item ID
context_summary:
key_decisions: array # Architectural decisions made
assumptions: array # Assumptions in effect
constraints: array # Active constraints
external_dependencies: array # Third-party dependencies
metrics:
tokens_consumed: integer # Total tokens this iteration
tools_invoked: integer # Tool call count
files_modified: array # List of changed files
tests_status:
passed: integer
failed: integer
skipped: integer
coverage_percent: float
recovery:
last_successful_state: string # Reference to prior checkpoint
rollback_instructions: string # How to undo current iteration
continuation_prompt: string # Prompt to resume work
compliance:
event_log_ref: string # FoundationDB event stream ref
hash: string # SHA-256 of checkpoint content
signature: string # Optional cryptographic signature
retention_policy: string # Compliance retention requirement
2.2 Checkpoint Operations
FR-002: Checkpoint Write Operations
MUST support:
├── create_checkpoint(task_id, agent_id, state) → checkpoint_id
├── update_checkpoint(checkpoint_id, partial_state) → checkpoint_id
├── finalize_checkpoint(checkpoint_id, status) → void
└── link_checkpoints(parent_id, child_id) → void
Transactional requirements:
- All writes MUST be atomic
- Failed writes MUST NOT corrupt existing checkpoints
- Concurrent writes to same task MUST be serialized
FR-003: Checkpoint Read Operations
MUST support:
├── get_checkpoint(checkpoint_id) → Checkpoint
├── get_latest_checkpoint(task_id) → Checkpoint
├── get_checkpoint_history(task_id, limit) → [Checkpoint]
├── get_checkpoints_by_agent(agent_id, time_range) → [Checkpoint]
└── search_checkpoints(query) → [Checkpoint]
Performance requirements:
- Single checkpoint read: < 10ms
- History query (100 items): < 100ms
2.3 Handoff Protocol
FR-004: Agent Handoff Sequence
PRE-HANDOFF (Current Agent):
1. Detect handoff trigger (context > 70% OR task complete OR error threshold)
2. Generate continuation_prompt summarizing:
- What was accomplished
- What remains
- Current blockers
- Recommended next steps
3. Write final checkpoint with phase="handoff"
4. Emit AGENT_HANDOFF event to orchestrator
HANDOFF (Orchestrator):
5. Receive AGENT_HANDOFF event
6. Validate checkpoint integrity (hash verification)
7. Select next agent (same type for continuation, different for phase change)
8. Inject checkpoint context into new agent's system prompt
9. Spawn new agent with fresh context window
POST-HANDOFF (New Agent):
10. Read latest checkpoint
11. Acknowledge checkpoint receipt
12. Resume from continuation_prompt
13. Create new checkpoint linking to parent
2.4 Recovery Protocol
FR-005: Failure Recovery Sequence
ON AGENT FAILURE:
1. Orchestrator detects agent termination (timeout, error, crash)
2. Retrieve last successful checkpoint
3. Analyze failure:
- If recoverable: spawn new agent from checkpoint
- If unrecoverable: mark task blocked, alert human
4. Log recovery attempt to compliance trail
ON CHECKPOINT CORRUPTION:
1. Detect via hash mismatch
2. Attempt recovery from last_successful_state reference
3. If chain broken: escalate to human intervention
4. Never silently continue with corrupt state
3. Non-Functional Requirements
3.1 Performance
| Requirement | Specification |
|---|---|
| NFR-001 | Checkpoint writes < 100ms p99 |
| NFR-002 | Checkpoint reads < 10ms p99 |
| NFR-003 | History queries < 100ms for 100 items |
| NFR-004 | Support 1000+ concurrent checkpoints |
| NFR-005 | Zero data loss on node failure (FoundationDB replication) |
3.2 Reliability
| Requirement | Specification |
|---|---|
| NFR-006 | 99.99% checkpoint service availability |
| NFR-007 | Automatic retry on transient failures (3x with backoff) |
| NFR-008 | Graceful degradation if FDB unavailable (queue locally) |
| NFR-009 | Checkpoint validation on every read |
3.3 Compliance
| Requirement | Specification |
|---|---|
| NFR-010 | FDA 21 CFR Part 11: Electronic signatures on checkpoints |
| NFR-011 | HIPAA: Encryption at rest and in transit |
| NFR-012 | SOC2: Complete audit trail of all state changes |
| NFR-013 | Retention: Configurable per-task retention policy |
| NFR-014 | Immutability: Finalized checkpoints cannot be modified |
3.4 Observability
| Requirement | Specification |
|---|---|
| NFR-015 | Metrics: checkpoint_write_latency_ms histogram |
| NFR-016 | Metrics: checkpoint_read_latency_ms histogram |
| NFR-017 | Metrics: handoff_success_rate gauge |
| NFR-018 | Metrics: recovery_attempts_total counter |
| NFR-019 | Logs: Structured JSON for all checkpoint operations |
| NFR-020 | Traces: Distributed tracing across agent handoffs |
4. Implementation Steps
Phase 1: Schema and Storage (Week 1-2)
Step 1.1: Define Checkpoint Schema
├── Create TypeScript/Rust types for Checkpoint
├── Define FoundationDB key structure:
│ └── /coditect/checkpoints/{task_id}/{checkpoint_id}
├── Implement serialization (MessagePack for efficiency)
└── Add schema versioning for forward compatibility
Step 1.2: Implement Storage Layer
├── Create CheckpointRepository interface
├── Implement FoundationDBCheckpointRepository
├── Add transaction handling for atomic writes
├── Implement optimistic locking for concurrent access
└── Add connection pooling and retry logic
Step 1.3: Add Compliance Layer
├── Implement SHA-256 hash generation
├── Add optional cryptographic signing
├── Create immutability enforcement (no updates after finalize)
└── Implement audit event emission
Phase 2: Checkpoint Operations (Week 3-4)
Step 2.1: Write Operations
├── Implement create_checkpoint()
│ ├── Validate schema
│ ├── Generate checkpoint_id (UUID v7)
│ ├── Calculate hash
│ ├── Write to FDB transactionally
│ └── Emit CHECKPOINT_CREATED event
├── Implement update_checkpoint()
│ ├── Verify checkpoint not finalized
│ ├── Merge partial state
│ ├── Recalculate hash
│ └── Write atomically
└── Implement finalize_checkpoint()
├── Mark as immutable
├── Apply retention policy
└── Emit CHECKPOINT_FINALIZED event
Step 2.2: Read Operations
├── Implement get_checkpoint()
│ ├── Read from FDB
│ ├── Verify hash integrity
│ └── Deserialize and return
├── Implement get_latest_checkpoint()
│ ├── Query by task_id with ordering
│ └── Return most recent non-corrupt
├── Implement get_checkpoint_history()
│ ├── Range query with pagination
│ └── Build checkpoint chain
└── Implement search_checkpoints()
├── Secondary index queries
└── Filter by agent, time, status
Phase 3: Handoff Protocol (Week 5-6)
Step 3.1: Handoff Trigger Detection
├── Implement context utilization monitor
│ └── Track token count vs limit
├── Define handoff triggers:
│ ├── context_utilization > 70%
│ ├── task_phase_complete
│ ├── error_count > threshold
│ └── explicit_handoff_request
└── Add trigger evaluation to agent loop
Step 3.2: Continuation Prompt Generation
├── Create prompt template for handoff
├── Implement state summarization:
│ ├── Extract key decisions
│ ├── List completed work
│ ├── Identify pending items
│ └── Note blockers and risks
├── Add context window budget for summary
└── Validate prompt fits target context
Step 3.3: Agent Spawning Integration
├── Update Orchestrator to handle AGENT_HANDOFF
├── Implement checkpoint injection into system prompt
├── Add agent selection logic:
│ ├── Same agent type for continuation
│ ├── Different type for phase transition
└── Ensure fresh context window creation
Phase 4: Recovery and Testing (Week 7-8)
Step 4.1: Recovery Implementation
├── Implement failure detection in Orchestrator
├── Add checkpoint chain traversal for recovery point
├── Create recovery decision logic:
│ ├── Analyze failure type
│ ├── Determine recoverability
│ └── Select recovery strategy
├── Implement agent respawn from checkpoint
└── Add circuit breaker for repeated failures
Step 4.2: Testing
├── Unit tests:
│ ├── Schema validation
│ ├── Storage operations
│ ├── Hash verification
│ └── Compliance enforcement
├── Integration tests:
│ ├── Full handoff cycle
│ ├── Recovery scenarios
│ ├── Concurrent checkpoint writes
│ └── FDB failure simulation
└── Load tests:
├── 1000 concurrent checkpoints
├── Sustained write throughput
└── Recovery under load
Step 4.3: Documentation
├── API documentation
├── Handoff protocol specification
├── Recovery runbook
└── Compliance evidence guide
5. API Specification
5.1 Checkpoint Service Interface
interface CheckpointService {
// Write operations
createCheckpoint(params: CreateCheckpointParams): Promise<Checkpoint>;
updateCheckpoint(id: CheckpointId, update: PartialCheckpoint): Promise<Checkpoint>;
finalizeCheckpoint(id: CheckpointId, status: FinalStatus): Promise<void>;
// Read operations
getCheckpoint(id: CheckpointId): Promise<Checkpoint | null>;
getLatestCheckpoint(taskId: TaskId): Promise<Checkpoint | null>;
getCheckpointHistory(taskId: TaskId, options?: HistoryOptions): Promise<Checkpoint[]>;
searchCheckpoints(query: CheckpointQuery): Promise<SearchResult<Checkpoint>>;
// Handoff operations
initiateHandoff(checkpointId: CheckpointId): Promise<HandoffResult>;
acknowledgeHandoff(checkpointId: CheckpointId, newAgentId: AgentId): Promise<void>;
// Recovery operations
findRecoveryPoint(taskId: TaskId): Promise<Checkpoint | null>;
executeRecovery(checkpointId: CheckpointId): Promise<RecoveryResult>;
}
5.2 Event Definitions
// Events emitted by Checkpoint Service
type CheckpointEvents =
| { type: 'CHECKPOINT_CREATED'; payload: { checkpointId: string; taskId: string; agentId: string } }
| { type: 'CHECKPOINT_UPDATED'; payload: { checkpointId: string; fields: string[] } }
| { type: 'CHECKPOINT_FINALIZED'; payload: { checkpointId: string; status: string } }
| { type: 'AGENT_HANDOFF'; payload: { fromCheckpoint: string; toAgent: string } }
| { type: 'RECOVERY_INITIATED'; payload: { checkpointId: string; reason: string } }
| { type: 'RECOVERY_COMPLETED'; payload: { checkpointId: string; newAgentId: string } }
| { type: 'CHECKPOINT_CORRUPTED'; payload: { checkpointId: string; error: string } };
6. Dependencies
| Dependency | Type | Status |
|---|---|---|
| FoundationDB cluster | Infrastructure | ✅ Available |
| Event bus (for CHECKPOINT_* events) | Platform | ✅ Available |
| Agent Orchestrator | Platform | ⚠️ Requires update |
| Cryptographic signing service | Security | ⚠️ May need implementation |
| Observability stack (metrics, logs, traces) | Platform | ✅ Available |
7. Risks and Mitigations
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| FoundationDB latency spikes | Agent delays | Medium | Local write queue, async finalization |
| Checkpoint size exceeds limits | Write failures | Low | Compression, summary truncation |
| Hash collision | Data integrity | Very Low | SHA-256 is collision-resistant |
| Concurrent modification | Data loss | Medium | Optimistic locking, conflict detection |
| Recovery loop | Infinite retries | Medium | Circuit breaker, max retry limits |
8. Acceptance Criteria
- Checkpoints persist across agent restarts
- Handoff completes in < 500ms total latency
- Recovery succeeds from last valid checkpoint
- Audit trail passes compliance review
- No checkpoint data loss under simulated failures
- Performance targets met under load test
- Documentation complete and reviewed
Document Version: 1.0 | Last Updated: January 24, 2026