Implementation Requirements: Agent Checkpoint and Handoff Protocol

Document ID: IMPL-REQ-001
Priority: P0 (Critical Path)
Target ADR: ADR-108 (Proposed)
Estimated Effort: 2-3 Sprints
Dependencies: FoundationDB infrastructure, Multi-Agent Orchestration Layer

1. Overview

1.1 Problem Statement

Current agent handoff relies on implicit state (context window contents, git history). The Ralph Wiggum community has validated that explicit, structured checkpoints are essential for:

Reliable recovery from failures
Clean context window transitions
Compliance audit trails
Cost attribution per task segment

1.2 Objective

Implement a standardized checkpoint protocol that enables agents to persist state to FoundationDB, allowing fresh-context iterations while maintaining full task continuity and compliance evidence.

1.3 Success Criteria

Metric	Target
Checkpoint write latency	< 100ms
Recovery success rate	> 99.9%
Audit trail completeness	100% of state transitions
Context window utilization	< 70% at handoff

2. Functional Requirements

2.1 Checkpoint Schema

# FR-001: Define checkpoint data structure
checkpoint_schema:
  version: "1.0"
  
  metadata:
    checkpoint_id: string       # UUID v7 (time-ordered)
    task_id: string             # Parent task reference
    agent_id: string            # Executing agent identifier
    agent_type: enum            # [architecture, implementation, qa, documentation]
    iteration: integer          # Loop iteration number
    timestamp: datetime         # ISO 8601 with timezone
    
  execution_state:
    phase: enum                 # [planning, implementing, testing, reviewing, complete]
    completed_items: array      # List of completed work items
    pending_items: array        # Remaining work items
    blocked_items: array        # Items with blockers
    current_focus: string       # Active work item ID
    
  context_summary:
    key_decisions: array        # Architectural decisions made
    assumptions: array          # Assumptions in effect
    constraints: array          # Active constraints
    external_dependencies: array # Third-party dependencies
    
  metrics:
    tokens_consumed: integer    # Total tokens this iteration
    tools_invoked: integer      # Tool call count
    files_modified: array       # List of changed files
    tests_status:
      passed: integer
      failed: integer
      skipped: integer
      coverage_percent: float
      
  recovery:
    last_successful_state: string  # Reference to prior checkpoint
    rollback_instructions: string  # How to undo current iteration
    continuation_prompt: string    # Prompt to resume work
    
  compliance:
    event_log_ref: string       # FoundationDB event stream ref
    hash: string                # SHA-256 of checkpoint content
    signature: string           # Optional cryptographic signature
    retention_policy: string    # Compliance retention requirement

2.2 Checkpoint Operations

FR-002: Checkpoint Write Operations

MUST support:
├── create_checkpoint(task_id, agent_id, state) → checkpoint_id
├── update_checkpoint(checkpoint_id, partial_state) → checkpoint_id
├── finalize_checkpoint(checkpoint_id, status) → void
└── link_checkpoints(parent_id, child_id) → void

Transactional requirements:
- All writes MUST be atomic
- Failed writes MUST NOT corrupt existing checkpoints
- Concurrent writes to same task MUST be serialized

FR-003: Checkpoint Read Operations

MUST support:
├── get_checkpoint(checkpoint_id) → Checkpoint
├── get_latest_checkpoint(task_id) → Checkpoint
├── get_checkpoint_history(task_id, limit) → [Checkpoint]
├── get_checkpoints_by_agent(agent_id, time_range) → [Checkpoint]
└── search_checkpoints(query) → [Checkpoint]

Performance requirements:
- Single checkpoint read: < 10ms
- History query (100 items): < 100ms

2.3 Handoff Protocol

FR-004: Agent Handoff Sequence

PRE-HANDOFF (Current Agent):
1. Detect handoff trigger (context > 70% OR task complete OR error threshold)
2. Generate continuation_prompt summarizing:
   - What was accomplished
   - What remains
   - Current blockers
   - Recommended next steps
3. Write final checkpoint with phase="handoff"
4. Emit AGENT_HANDOFF event to orchestrator

HANDOFF (Orchestrator):
5. Receive AGENT_HANDOFF event
6. Validate checkpoint integrity (hash verification)
7. Select next agent (same type for continuation, different for phase change)
8. Inject checkpoint context into new agent's system prompt
9. Spawn new agent with fresh context window

POST-HANDOFF (New Agent):
10. Read latest checkpoint
11. Acknowledge checkpoint receipt
12. Resume from continuation_prompt
13. Create new checkpoint linking to parent

2.4 Recovery Protocol

FR-005: Failure Recovery Sequence

ON AGENT FAILURE:
1. Orchestrator detects agent termination (timeout, error, crash)
2. Retrieve last successful checkpoint
3. Analyze failure:
   - If recoverable: spawn new agent from checkpoint
   - If unrecoverable: mark task blocked, alert human
4. Log recovery attempt to compliance trail

ON CHECKPOINT CORRUPTION:
1. Detect via hash mismatch
2. Attempt recovery from last_successful_state reference
3. If chain broken: escalate to human intervention
4. Never silently continue with corrupt state

3. Non-Functional Requirements

3.1 Performance

Requirement	Specification
NFR-001	Checkpoint writes < 100ms p99
NFR-002	Checkpoint reads < 10ms p99
NFR-003	History queries < 100ms for 100 items
NFR-004	Support 1000+ concurrent checkpoints
NFR-005	Zero data loss on node failure (FoundationDB replication)

3.2 Reliability

Requirement	Specification
NFR-006	99.99% checkpoint service availability
NFR-007	Automatic retry on transient failures (3x with backoff)
NFR-008	Graceful degradation if FDB unavailable (queue locally)
NFR-009	Checkpoint validation on every read

3.3 Compliance

Requirement	Specification
NFR-010	FDA 21 CFR Part 11: Electronic signatures on checkpoints
NFR-011	HIPAA: Encryption at rest and in transit
NFR-012	SOC2: Complete audit trail of all state changes
NFR-013	Retention: Configurable per-task retention policy
NFR-014	Immutability: Finalized checkpoints cannot be modified

3.4 Observability

Requirement	Specification
NFR-015	Metrics: checkpoint_write_latency_ms histogram
NFR-016	Metrics: checkpoint_read_latency_ms histogram
NFR-017	Metrics: handoff_success_rate gauge
NFR-018	Metrics: recovery_attempts_total counter
NFR-019	Logs: Structured JSON for all checkpoint operations
NFR-020	Traces: Distributed tracing across agent handoffs

4. Implementation Steps

Phase 1: Schema and Storage (Week 1-2)

Step 1.1: Define Checkpoint Schema
├── Create TypeScript/Rust types for Checkpoint
├── Define FoundationDB key structure:
│   └── /coditect/checkpoints/{task_id}/{checkpoint_id}
├── Implement serialization (MessagePack for efficiency)
└── Add schema versioning for forward compatibility

Step 1.2: Implement Storage Layer
├── Create CheckpointRepository interface
├── Implement FoundationDBCheckpointRepository
├── Add transaction handling for atomic writes
├── Implement optimistic locking for concurrent access
└── Add connection pooling and retry logic

Step 1.3: Add Compliance Layer
├── Implement SHA-256 hash generation
├── Add optional cryptographic signing
├── Create immutability enforcement (no updates after finalize)
└── Implement audit event emission

Phase 2: Checkpoint Operations (Week 3-4)

Step 2.1: Write Operations
├── Implement create_checkpoint()
│   ├── Validate schema
│   ├── Generate checkpoint_id (UUID v7)
│   ├── Calculate hash
│   ├── Write to FDB transactionally
│   └── Emit CHECKPOINT_CREATED event
├── Implement update_checkpoint()
│   ├── Verify checkpoint not finalized
│   ├── Merge partial state
│   ├── Recalculate hash
│   └── Write atomically
└── Implement finalize_checkpoint()
    ├── Mark as immutable
    ├── Apply retention policy
    └── Emit CHECKPOINT_FINALIZED event

Step 2.2: Read Operations
├── Implement get_checkpoint()
│   ├── Read from FDB
│   ├── Verify hash integrity
│   └── Deserialize and return
├── Implement get_latest_checkpoint()
│   ├── Query by task_id with ordering
│   └── Return most recent non-corrupt
├── Implement get_checkpoint_history()
│   ├── Range query with pagination
│   └── Build checkpoint chain
└── Implement search_checkpoints()
    ├── Secondary index queries
    └── Filter by agent, time, status

Phase 3: Handoff Protocol (Week 5-6)

Step 3.1: Handoff Trigger Detection
├── Implement context utilization monitor
│   └── Track token count vs limit
├── Define handoff triggers:
│   ├── context_utilization > 70%
│   ├── task_phase_complete
│   ├── error_count > threshold
│   └── explicit_handoff_request
└── Add trigger evaluation to agent loop

Step 3.2: Continuation Prompt Generation
├── Create prompt template for handoff
├── Implement state summarization:
│   ├── Extract key decisions
│   ├── List completed work
│   ├── Identify pending items
│   └── Note blockers and risks
├── Add context window budget for summary
└── Validate prompt fits target context

Step 3.3: Agent Spawning Integration
├── Update Orchestrator to handle AGENT_HANDOFF
├── Implement checkpoint injection into system prompt
├── Add agent selection logic:
│   ├── Same agent type for continuation
│   ├── Different type for phase transition
└── Ensure fresh context window creation

Phase 4: Recovery and Testing (Week 7-8)

Step 4.1: Recovery Implementation
├── Implement failure detection in Orchestrator
├── Add checkpoint chain traversal for recovery point
├── Create recovery decision logic:
│   ├── Analyze failure type
│   ├── Determine recoverability
│   └── Select recovery strategy
├── Implement agent respawn from checkpoint
└── Add circuit breaker for repeated failures

Step 4.2: Testing
├── Unit tests:
│   ├── Schema validation
│   ├── Storage operations
│   ├── Hash verification
│   └── Compliance enforcement
├── Integration tests:
│   ├── Full handoff cycle
│   ├── Recovery scenarios
│   ├── Concurrent checkpoint writes
│   └── FDB failure simulation
└── Load tests:
    ├── 1000 concurrent checkpoints
    ├── Sustained write throughput
    └── Recovery under load

Step 4.3: Documentation
├── API documentation
├── Handoff protocol specification
├── Recovery runbook
└── Compliance evidence guide

5. API Specification

5.1 Checkpoint Service Interface

interface CheckpointService {
  // Write operations
  createCheckpoint(params: CreateCheckpointParams): Promise<Checkpoint>;
  updateCheckpoint(id: CheckpointId, update: PartialCheckpoint): Promise<Checkpoint>;
  finalizeCheckpoint(id: CheckpointId, status: FinalStatus): Promise<void>;
  
  // Read operations
  getCheckpoint(id: CheckpointId): Promise<Checkpoint | null>;
  getLatestCheckpoint(taskId: TaskId): Promise<Checkpoint | null>;
  getCheckpointHistory(taskId: TaskId, options?: HistoryOptions): Promise<Checkpoint[]>;
  searchCheckpoints(query: CheckpointQuery): Promise<SearchResult<Checkpoint>>;
  
  // Handoff operations
  initiateHandoff(checkpointId: CheckpointId): Promise<HandoffResult>;
  acknowledgeHandoff(checkpointId: CheckpointId, newAgentId: AgentId): Promise<void>;
  
  // Recovery operations
  findRecoveryPoint(taskId: TaskId): Promise<Checkpoint | null>;
  executeRecovery(checkpointId: CheckpointId): Promise<RecoveryResult>;
}

5.2 Event Definitions

// Events emitted by Checkpoint Service
type CheckpointEvents = 
  | { type: 'CHECKPOINT_CREATED'; payload: { checkpointId: string; taskId: string; agentId: string } }
  | { type: 'CHECKPOINT_UPDATED'; payload: { checkpointId: string; fields: string[] } }
  | { type: 'CHECKPOINT_FINALIZED'; payload: { checkpointId: string; status: string } }
  | { type: 'AGENT_HANDOFF'; payload: { fromCheckpoint: string; toAgent: string } }
  | { type: 'RECOVERY_INITIATED'; payload: { checkpointId: string; reason: string } }
  | { type: 'RECOVERY_COMPLETED'; payload: { checkpointId: string; newAgentId: string } }
  | { type: 'CHECKPOINT_CORRUPTED'; payload: { checkpointId: string; error: string } };

6. Dependencies

Dependency	Type	Status
FoundationDB cluster	Infrastructure	✅ Available
Event bus (for CHECKPOINT_* events)	Platform	✅ Available
Agent Orchestrator	Platform	⚠️ Requires update
Cryptographic signing service	Security	⚠️ May need implementation
Observability stack (metrics, logs, traces)	Platform	✅ Available

7. Risks and Mitigations

Risk	Impact	Likelihood	Mitigation
FoundationDB latency spikes	Agent delays	Medium	Local write queue, async finalization
Checkpoint size exceeds limits	Write failures	Low	Compression, summary truncation
Hash collision	Data integrity	Very Low	SHA-256 is collision-resistant
Concurrent modification	Data loss	Medium	Optimistic locking, conflict detection
Recovery loop	Infinite retries	Medium	Circuit breaker, max retry limits

8. Acceptance Criteria

Checkpoints persist across agent restarts
Handoff completes in < 500ms total latency
Recovery succeeds from last valid checkpoint
Audit trail passes compliance review
No checkpoint data loss under simulated failures
Performance targets met under load test
Documentation complete and reviewed

Document Version: 1.0 | Last Updated: January 24, 2026

1. Overview​

1.1 Problem Statement​

1.2 Objective​

1.3 Success Criteria​

2. Functional Requirements​

2.1 Checkpoint Schema​

2.2 Checkpoint Operations​

2.3 Handoff Protocol​

2.4 Recovery Protocol​

3. Non-Functional Requirements​

3.1 Performance​

3.2 Reliability​

3.3 Compliance​

3.4 Observability​

4. Implementation Steps​

Phase 1: Schema and Storage (Week 1-2)​

Phase 2: Checkpoint Operations (Week 3-4)​

Phase 3: Handoff Protocol (Week 5-6)​

Phase 4: Recovery and Testing (Week 7-8)​

5. API Specification​

5.1 Checkpoint Service Interface​

5.2 Event Definitions​

6. Dependencies​

7. Risks and Mitigations​

8. Acceptance Criteria​