Skip to main content

ADR-XXX: Unified Persistent Workspace Architecture v2.0

Status: Accepted
Date: 2026-01-31
Author: CODITECT Architecture Team
Stakeholders: Platform Engineering, DevOps, Security, Executive


1. Context

1.1 Problem Statement

The initial CODITECT Development Studio architecture (v1.0) used ephemeral sandboxes with the following characteristics:

  • 4 separate containers (one per LLM: Claude, Gemini, Codex, Kimi)
  • 30-minute timeout on each sandbox
  • External routing of tasks to appropriate sandboxes
  • R2 storage with periodic snapshots
  • Cold start penalty of 5-10 seconds per sandbox

This design created significant friction:

  1. Poor multi-agent collaboration - Agents couldn't easily share context or work on the same files simultaneously
  2. Frequent timeouts - Long-running tasks were interrupted
  3. Cold start latency - Users experienced delays when starting sessions
  4. Complex coordination - External routing logic was brittle
  5. Data loss risk - Unsaved work lost on timeout

1.2 Decision Drivers

DriverWeightRationale
Multi-agent collaborationCriticalCore value proposition of CODITECT
Session persistenceCriticalUser experience, data safety
Cold start eliminationHighCompetitive requirement
Cost efficiencyMediumSustainable unit economics
Implementation complexityMediumTime to market

2. Decision

2.1 Selected Option: Unified Persistent Workspace

We will adopt a unified persistent workspace architecture with the following characteristics:

┌─────────────────────────────────────────────────────────────┐
│ UNIFIED WORKSPACE │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Claude │ │ Gemini │ │ Kimi │ │ Codex │ │
│ │ Agent │ │ Agent │ │ Agent │ │ Agent │ │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
│ └─────────┬─────────┬─────────┘ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Shared Context Manager │ │
│ │ • Task queue │ │
│ │ • File locks │ │
│ │ • Message bus │ │
│ └─────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ SQLite Database Cluster │ │
│ │ (6 databases, WAL mode) │ │
│ └─────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ GCS FUSE Mount (/projects) │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

2.2 Key Characteristics

Aspectv1.0v2.0 (This ADR)
Containers4 ephemeral sandboxes1 persistent workspace
Lifetime30 minutes8+ hours, renewable
StorageR2 snapshotsGCS FUSE real-time sync
DatabasesDurable Objects onlySQLite cluster + JSONL
Agent CoordinationExternal routerIn-workspace orchestrator
Cold Start5-10 seconds0 seconds

3. Consequences

3.1 Positive Consequences

  1. Seamless Multi-Agent Collaboration

    • All 4 LLMs share the same filesystem
    • Agents can see each other's activity in real-time
    • File locks prevent conflicts
    • Shared context reduces token waste
  2. Zero Cold Start

    • Workspaces are always warm
    • Users reconnect instantly
    • Session survives browser close/reopen
  3. Strong Data Durability

    • SQLite WAL mode provides ACID guarantees
    • GCS FUSE ensures data is never lost
    • JSONL provides immutable audit trail
    • Git integration for code versioning
  4. Simplified Architecture

    • Removed external routing complexity
    • Single deployment target
    • Easier debugging and monitoring

3.2 Negative Consequences

  1. Higher Infrastructure Cost (+55%)

    • v1.0: $4.20/user @ 1K users
    • v2.0: $6.50/user @ 1K users
    • Persistent containers run 24/7
    • GCS operations higher than R2
  2. Increased Implementation Complexity

    • GCS FUSE requires careful tuning
    • SQLite clustering is non-trivial
    • File lock management needed
    • New failure modes (disk full, DB corruption)
  3. Resource Contention Risk

    • 4 agents share 2 vCPU / 4GB RAM
    • Noisy neighbor problem within workspace
    • Requires careful resource limits

3.3 Mitigations

RiskMitigation
Cost overrunAuto-sleep after 8h; aggressive downscaling
Resource contentionPer-agent CPU/memory limits; cgroups
Data corruptionAutomated SQLite backups; checksums
GCS latencyR2 hot mirror; local caching layer

4. Alternatives Considered

4.1 Alternative 1: Keep Ephemeral, Improve Coordination

Approach: Maintain 4 sandboxes but add better state sharing via Redis/Durable Objects

Rejected because:

  • Still has cold start problem
  • Network overhead for file sync
  • Complex consistency model
  • Doesn't solve timeout interruption

4.2 Alternative 2: Hybrid (Persistent + Ephemeral)

Approach: 1 persistent "home" sandbox + 3 ephemeral "worker" sandboxes

Rejected because:

  • Added complexity of two models
  • Workers still have cold start
  • File sync between home/workers is tricky
  • Doesn't simplify architecture

4.3 Alternative 3: Serverless Functions

Approach: Use Cloudflare Workers for everything, no containers

Rejected because:

  • Workers have 30s CPU limit
  • No filesystem for tools
  • Can't run long-lived agents
  • SQLite not available

5. Implementation

5.1 Phases

PhaseDurationDeliverable
Phase 1: Foundation4 weeksGCS FUSE, SQLite cluster, basic workspace
Phase 2: Agents3 weeksIn-workspace orchestrator, 4 agent adapters
Phase 3: Frontend2 weeksMulti-agent UI, agent activity panel
Phase 4: Migration2 weeksv1.0 → v2.0 data migration, cutover
Phase 5: Optimization3 weeksPerformance tuning, cost optimization

Total: 14 weeks

5.2 Success Metrics

MetricTargetMeasurement
Cold start time< 1 secondTime from click to ready
Session availability99.9%Uptime excluding maintenance
Agent collaboration latency< 100msFile lock acquisition time
Data durability99.999%Checksums + GCS durability
Cost per user @ 1K<$7.00Monthly infrastructure cost


Decision Record: Approved by Architecture Review Board 2026-01-31
Next Review: 2026-04-01 (post-implementation)