ADR-XXX: Unified Persistent Workspace Architecture v2.0
Status: Accepted
Date: 2026-01-31
Author: CODITECT Architecture Team
Stakeholders: Platform Engineering, DevOps, Security, Executive
1. Context
1.1 Problem Statement
The initial CODITECT Development Studio architecture (v1.0) used ephemeral sandboxes with the following characteristics:
- 4 separate containers (one per LLM: Claude, Gemini, Codex, Kimi)
- 30-minute timeout on each sandbox
- External routing of tasks to appropriate sandboxes
- R2 storage with periodic snapshots
- Cold start penalty of 5-10 seconds per sandbox
This design created significant friction:
- Poor multi-agent collaboration - Agents couldn't easily share context or work on the same files simultaneously
- Frequent timeouts - Long-running tasks were interrupted
- Cold start latency - Users experienced delays when starting sessions
- Complex coordination - External routing logic was brittle
- Data loss risk - Unsaved work lost on timeout
1.2 Decision Drivers
| Driver | Weight | Rationale |
|---|---|---|
| Multi-agent collaboration | Critical | Core value proposition of CODITECT |
| Session persistence | Critical | User experience, data safety |
| Cold start elimination | High | Competitive requirement |
| Cost efficiency | Medium | Sustainable unit economics |
| Implementation complexity | Medium | Time to market |
2. Decision
2.1 Selected Option: Unified Persistent Workspace
We will adopt a unified persistent workspace architecture with the following characteristics:
┌─────────────────────────────────────────────────────────────┐
│ UNIFIED WORKSPACE │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Claude │ │ Gemini │ │ Kimi │ │ Codex │ │
│ │ Agent │ │ Agent │ │ Agent │ │ Agent │ │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
│ └─────────┬─────────┬─────────┘ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Shared Context Manager │ │
│ │ • Task queue │ │
│ │ • File locks │ │
│ │ • Message bus │ │
│ └─────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ SQLite Database Cluster │ │
│ │ (6 databases, WAL mode) │ │
│ └─────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ GCS FUSE Mount (/projects) │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
2.2 Key Characteristics
| Aspect | v1.0 | v2.0 (This ADR) |
|---|---|---|
| Containers | 4 ephemeral sandboxes | 1 persistent workspace |
| Lifetime | 30 minutes | 8+ hours, renewable |
| Storage | R2 snapshots | GCS FUSE real-time sync |
| Databases | Durable Objects only | SQLite cluster + JSONL |
| Agent Coordination | External router | In-workspace orchestrator |
| Cold Start | 5-10 seconds | 0 seconds |
3. Consequences
3.1 Positive Consequences
-
Seamless Multi-Agent Collaboration
- All 4 LLMs share the same filesystem
- Agents can see each other's activity in real-time
- File locks prevent conflicts
- Shared context reduces token waste
-
Zero Cold Start
- Workspaces are always warm
- Users reconnect instantly
- Session survives browser close/reopen
-
Strong Data Durability
- SQLite WAL mode provides ACID guarantees
- GCS FUSE ensures data is never lost
- JSONL provides immutable audit trail
- Git integration for code versioning
-
Simplified Architecture
- Removed external routing complexity
- Single deployment target
- Easier debugging and monitoring
3.2 Negative Consequences
-
Higher Infrastructure Cost (+55%)
- v1.0: $4.20/user @ 1K users
- v2.0: $6.50/user @ 1K users
- Persistent containers run 24/7
- GCS operations higher than R2
-
Increased Implementation Complexity
- GCS FUSE requires careful tuning
- SQLite clustering is non-trivial
- File lock management needed
- New failure modes (disk full, DB corruption)
-
Resource Contention Risk
- 4 agents share 2 vCPU / 4GB RAM
- Noisy neighbor problem within workspace
- Requires careful resource limits
3.3 Mitigations
| Risk | Mitigation |
|---|---|
| Cost overrun | Auto-sleep after 8h; aggressive downscaling |
| Resource contention | Per-agent CPU/memory limits; cgroups |
| Data corruption | Automated SQLite backups; checksums |
| GCS latency | R2 hot mirror; local caching layer |
4. Alternatives Considered
4.1 Alternative 1: Keep Ephemeral, Improve Coordination
Approach: Maintain 4 sandboxes but add better state sharing via Redis/Durable Objects
Rejected because:
- Still has cold start problem
- Network overhead for file sync
- Complex consistency model
- Doesn't solve timeout interruption
4.2 Alternative 2: Hybrid (Persistent + Ephemeral)
Approach: 1 persistent "home" sandbox + 3 ephemeral "worker" sandboxes
Rejected because:
- Added complexity of two models
- Workers still have cold start
- File sync between home/workers is tricky
- Doesn't simplify architecture
4.3 Alternative 3: Serverless Functions
Approach: Use Cloudflare Workers for everything, no containers
Rejected because:
- Workers have 30s CPU limit
- No filesystem for tools
- Can't run long-lived agents
- SQLite not available
5. Implementation
5.1 Phases
| Phase | Duration | Deliverable |
|---|---|---|
| Phase 1: Foundation | 4 weeks | GCS FUSE, SQLite cluster, basic workspace |
| Phase 2: Agents | 3 weeks | In-workspace orchestrator, 4 agent adapters |
| Phase 3: Frontend | 2 weeks | Multi-agent UI, agent activity panel |
| Phase 4: Migration | 2 weeks | v1.0 → v2.0 data migration, cutover |
| Phase 5: Optimization | 3 weeks | Performance tuning, cost optimization |
Total: 14 weeks
5.2 Success Metrics
| Metric | Target | Measurement |
|---|---|---|
| Cold start time | < 1 second | Time from click to ready |
| Session availability | 99.9% | Uptime excluding maintenance |
| Agent collaboration latency | < 100ms | File lock acquisition time |
| Data durability | 99.999% | Checksums + GCS durability |
| Cost per user @ 1K | <$7.00 | Monthly infrastructure cost |
6. Related Documents
- CODITECT-UNIFIED-PERSISTENT-ARCHITECTURE.md - Detailed architecture
- CODITECT-THIN-CLIENT-SDD-v2.md - System design
- CODITECT-THIN-CLIENT-TDD-v2.md - Technical design
- CODITECT-REVISED-ECONOMIC-MODEL.md - Cost analysis
Decision Record: Approved by Architecture Review Board 2026-01-31
Next Review: 2026-04-01 (post-implementation)