ADR-108: Agent Checkpoint and Handoff Protocol

Status

Proposed

Context

Problem Statement

Current agent handoff relies on implicit state (context window contents, git history). The Ralph Wiggum community has empirically validated that explicit, structured checkpoints are essential for:

Reliable recovery from failures
Clean context window transitions
Compliance audit trails
Cost attribution per task segment

Key Insight from Ralph Wiggum Analysis

"Single-context loops degrade; fresh-context iterations maintain quality."

The Ralph Wiggum technique demonstrates that autonomous agent loops work best when:

Each iteration starts with a fresh context window
State persists externally (not in the context)
Handoff protocols are explicit and structured

Current State

Agents rely on context window for state (degrades over time)
No standardized handoff format between agent iterations
Recovery from failures requires manual intervention
No compliance audit trail for state transitions
Cost attribution per task segment is impossible

CODITECT Advantage

CODITECT's database-backed event-driven architecture is architecturally superior to the git-file-based state management used by Ralph implementations. This ADR formalizes a checkpoint protocol that leverages this advantage.

Database Architecture Note (ADR-002, ADR-089):

Local: SQLite (context.db) for offline operation

Cloud: PostgreSQL with Row-Level Security (RLS) for multi-tenant isolation

This aligns with ADR-002's decision to use PostgreSQL for cloud, rejecting FoundationDB due to operational complexity and lack of managed services.

Decision

Implement a standardized checkpoint protocol that enables agents to persist state to the appropriate database layer (SQLite locally, PostgreSQL in cloud), allowing fresh-context iterations while maintaining full task continuity and compliance evidence.

1. Checkpoint Schema

checkpoint_schema:
  version: "1.0"

  metadata:
    checkpoint_id: string       # UUID v7 (time-ordered)
    task_id: string             # Parent task reference
    agent_id: string            # Executing agent identifier
    agent_type: enum            # [architecture, implementation, qa, documentation]
    iteration: integer          # Loop iteration number
    timestamp: datetime         # ISO 8601 with timezone

  execution_state:
    phase: enum                 # [planning, implementing, testing, reviewing, complete]
    completed_items: array      # List of completed work items
    pending_items: array        # Remaining work items
    blocked_items: array        # Items with blockers
    current_focus: string       # Active work item ID

  context_summary:
    key_decisions: array        # Architectural decisions made
    assumptions: array          # Assumptions in effect
    constraints: array          # Active constraints
    external_dependencies: array # Third-party dependencies

  metrics:
    tokens_consumed: integer    # Total tokens this iteration
    tools_invoked: integer      # Tool call count
    files_modified: array       # List of changed files
    tests_status:
      passed: integer
      failed: integer
      skipped: integer
      coverage_percent: float

  recovery:
    last_successful_state: string  # Reference to prior checkpoint
    rollback_instructions: string  # How to undo current iteration
    continuation_prompt: string    # Prompt to resume work

  compliance:
    event_log_ref: string       # FoundationDB event stream ref
    hash: string                # SHA-256 of checkpoint content
    signature: string           # Optional cryptographic signature
    retention_policy: string    # Compliance retention requirement

2. Database Schema

Local SQLite (context.db):

CREATE TABLE checkpoints (
    id TEXT PRIMARY KEY,              -- UUID v7
    task_id TEXT NOT NULL,
    agent_id TEXT NOT NULL,
    agent_type TEXT NOT NULL,
    iteration INTEGER DEFAULT 1,
    phase TEXT NOT NULL,              -- planning, implementing, testing, reviewing, complete
    checkpoint_data JSON NOT NULL,    -- Full checkpoint schema
    hash TEXT NOT NULL,               -- SHA-256 integrity
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (task_id) REFERENCES tasks(id)
);

CREATE INDEX idx_checkpoints_task ON checkpoints(task_id, created_at DESC);
CREATE INDEX idx_checkpoints_agent ON checkpoints(agent_id, created_at DESC);

Cloud PostgreSQL (with RLS):

CREATE TABLE checkpoints (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_id UUID NOT NULL REFERENCES organizations(id),
    task_id UUID NOT NULL REFERENCES tasks(id),
    agent_id TEXT NOT NULL,
    agent_type TEXT NOT NULL,
    iteration INTEGER DEFAULT 1,
    phase TEXT NOT NULL,
    checkpoint_data JSONB NOT NULL,   -- PostgreSQL JSONB for efficient queries
    hash TEXT NOT NULL,
    signature TEXT,                   -- Optional cryptographic signature
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Row-Level Security for multi-tenant isolation
ALTER TABLE checkpoints ENABLE ROW LEVEL SECURITY;

CREATE POLICY org_isolation_checkpoints ON checkpoints
    FOR ALL USING (
        organization_id IN (
            SELECT organization_id FROM organization_members
            WHERE user_id = current_setting('app.current_user_id')::UUID
        )
    );

CREATE INDEX idx_checkpoints_task ON checkpoints(task_id, created_at DESC);
CREATE INDEX idx_checkpoints_org ON checkpoints(organization_id, created_at DESC);
CREATE INDEX idx_checkpoints_data ON checkpoints USING gin(checkpoint_data);

3. Checkpoint Operations

Write Operations:

create_checkpoint(task_id, agent_id, state) → checkpoint_id
update_checkpoint(checkpoint_id, partial_state) → checkpoint_id
finalize_checkpoint(checkpoint_id, status) → void
link_checkpoints(parent_id, child_id) → void

Read Operations:

get_checkpoint(checkpoint_id) → Checkpoint
get_latest_checkpoint(task_id) → Checkpoint
get_checkpoint_history(task_id, limit) → [Checkpoint]
get_checkpoints_by_agent(agent_id, time_range) → [Checkpoint]

4. Handoff Protocol

PRE-HANDOFF (Current Agent):
1. Detect handoff trigger (context > 70% OR task complete OR error threshold)
2. Generate continuation_prompt summarizing:
   - What was accomplished
   - What remains
   - Current blockers
   - Recommended next steps
3. Write final checkpoint with phase="handoff"
4. Emit AGENT_HANDOFF event to orchestrator

HANDOFF (Orchestrator):
5. Receive AGENT_HANDOFF event
6. Validate checkpoint integrity (hash verification)
7. Select next agent (same type for continuation, different for phase change)
8. Inject checkpoint context into new agent's system prompt
9. Spawn new agent with fresh context window

POST-HANDOFF (New Agent):
10. Read latest checkpoint
11. Acknowledge checkpoint receipt
12. Resume from continuation_prompt
13. Create new checkpoint linking to parent

5. Recovery Protocol

ON AGENT FAILURE:
1. Orchestrator detects agent termination (timeout, error, crash)
2. Retrieve last successful checkpoint
3. Analyze failure:
   - If recoverable: spawn new agent from checkpoint
   - If unrecoverable: mark task blocked, alert human
4. Log recovery attempt to compliance trail

ON CHECKPOINT CORRUPTION:
1. Detect via hash mismatch
2. Attempt recovery from last_successful_state reference
3. If chain broken: escalate to human intervention
4. Never silently continue with corrupt state

6. Handoff Triggers

Trigger	Threshold	Action
Context utilization	> 70%	Initiate handoff
Task phase complete	Phase boundary	Handoff to next phase agent
Error count	> 3 consecutive	Handoff with error context
Explicit request	Agent requests	Immediate handoff
Token budget	> 80% of task budget	Handoff with budget warning

Consequences

Positive

Reliable Recovery - Agents can resume from any checkpoint after failures
Fresh Context Quality - Each iteration starts clean, avoiding degradation
Compliance Ready - Full audit trail for FDA 21 CFR Part 11, HIPAA, SOC2
Cost Attribution - Token consumption tracked per checkpoint/iteration
Superior to Ralph - Database ACID guarantees vs git-file eventual consistency
Multi-Tenant Ready - PostgreSQL RLS ensures complete tenant isolation

Negative

Latency Overhead - Checkpoint writes add ~100ms per handoff
Storage Growth - Checkpoint history accumulates (mitigated by retention policies)
Complexity - More moving parts than simple context-based state

Risks

Risk	Likelihood	Impact	Mitigation
Database latency spikes	Medium	Agent delays	Local SQLite buffer, async cloud sync
Checkpoint size exceeds limits	Low	Write failures	Compression, summary truncation
Concurrent modification	Medium	Data loss	Optimistic locking, conflict detection
Recovery loop	Medium	Infinite retries	Circuit breaker, max retry limits
Sync conflicts (local/cloud)	Medium	Data inconsistency	Last-write-wins with conflict log

Performance Requirements

Metric	Target
Checkpoint write latency	< 100ms p99
Checkpoint read latency	< 10ms p99
Recovery success rate	> 99.9%
Audit trail completeness	100%
Context window at handoff	< 70% utilized

Compliance Requirements

Standard	Requirement	Implementation
FDA 21 CFR Part 11	Electronic signatures	Cryptographic signing of checkpoints
HIPAA	Encryption	At-rest and in-transit encryption
SOC2	Audit trail	Complete event log in PostgreSQL with RLS
Data Retention	Configurable	Per-task retention policy field

Implementation Phases

Phase 1: Schema and Storage (2 weeks)

Define TypeScript/Rust types for Checkpoint
Implement SQLite repository (local) and PostgreSQL repository (cloud)
Add transaction handling and optimistic locking
Implement local-to-cloud sync (cursor-based polling per ADR-053)

Phase 2: Checkpoint Operations (2 weeks)

Implement write operations with hash generation
Implement read operations with integrity verification
Add compliance layer (signing, audit events)

Phase 3: Handoff Protocol (2 weeks)

Implement handoff trigger detection
Create continuation prompt generation
Integrate with orchestrator

Phase 4: Recovery and Testing (2 weeks)

Implement failure detection and recovery
Add circuit breaker for repeated failures
Comprehensive testing (unit, integration, chaos)

ADR-001: Event-Driven Architecture (checkpoint events)
ADR-002: PostgreSQL as Primary Database (cloud storage, RLS for multi-tenant)
ADR-053: Cloud Context Sync Architecture (local-to-cloud sync protocol)
ADR-089: Two-Database Architecture (local SQLite + cloud PostgreSQL)
ADR-109: QA Agent Browser Automation (depends on checkpoint for test results)
ADR-110: Agent Health Monitoring (uses checkpoint timestamps)
ADR-111: Token Economics (checkpoint includes token metrics)
ADR-112: Ralph Wiggum Database Architecture (consolidates DB decisions)

Glossary

Term	Definition
ADR	Architecture Decision Record - document capturing significant architectural decisions with context and consequences
ACID	Atomicity, Consistency, Isolation, Durability - database transaction properties ensuring data integrity
Checkpoint	Persistent snapshot of agent execution state enabling recovery and handoff
Handoff	Transfer of execution context from one agent instance to another (typically with fresh context window)
RLS	Row-Level Security - PostgreSQL feature that automatically filters query results based on tenant context
Ralph Wiggum	Autonomous agent loop technique named after The Simpsons character, enabling persistent iteration until task completion
SHA-256	Secure Hash Algorithm producing 256-bit hash for data integrity verification
UUID v7	Universally Unique Identifier version 7 - time-ordered UUIDs for sortable, unique keys
FDA 21 CFR Part 11	US FDA regulation for electronic records and signatures in regulated industries
HIPAA	Health Insurance Portability and Accountability Act - US healthcare data privacy regulation
SOC2	Service Organization Control 2 - security compliance framework for service providers

References

ADR-108 | Created: 2026-01-24 | Status: Proposed

Status​

Context​

Problem Statement​

Key Insight from Ralph Wiggum Analysis​

Current State​

CODITECT Advantage​

Decision​

1. Checkpoint Schema​

2. Database Schema​

3. Checkpoint Operations​

4. Handoff Protocol​

5. Recovery Protocol​

6. Handoff Triggers​

Consequences​

Positive​

Negative​

Risks​

Performance Requirements​

Compliance Requirements​

Implementation Phases​

Phase 1: Schema and Storage (2 weeks)​

Phase 2: Checkpoint Operations (2 weeks)​

Phase 3: Handoff Protocol (2 weeks)​

Phase 4: Recovery and Testing (2 weeks)​

Related ADRs​

Glossary​

References​