ADR-028 Part 1: Hybrid Storage Architecture - Problem & Analysis
Date: 2025-10-28
Status: Under Review → QA: ✅ CONDITIONAL PASS
Deciders: System Architect, Infrastructure Team
Related: ADR-029 (StatefulSet Migration), Analysis: docs/11-analysis/2025-10-28-persistent-storage-dynamic-pods.md
Context
The initial GKE deployment used StatefulSet volumeClaimTemplates with pod-local storage (50 GB per pod). This pattern is standard for databases and distributed systems where each pod manages unique data.
Example from Kubernetes documentation:
# Standard StatefulSet pattern (databases, Kafka, Cassandra)
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
Works great for:
- ✅ Databases (Postgres, MySQL) - each pod is a replica
- ✅ Distributed systems (Kafka, Cassandra) - each pod has unique data
Fails for:
- ❌ Multi-user IDEs - users need to access workspace from ANY pod
Problem Statement
Issue Discovered: 2025-10-28
During MVP scaling analysis for 20 users with HPA (3-30 pods), a critical data loss scenario was identified:
User creates files in Pod-0 → Pod-0 scales down → User data PERMANENTLY LOST
Root Cause: Architecture Mismatch
What we built (database pattern):
Pod-0 → workspace-coditect-combined-0 (50 GB) → User A's files LOCKED to Pod-0
Pod-1 → workspace-coditect-combined-1 (50 GB) → User B's files LOCKED to Pod-1
Pod-2 → workspace-coditect-combined-2 (50 GB) → User C's files LOCKED to Pod-2
Scale-down event: Pod-0 deleted → workspace-coditect-combined-0 deleted → User A data LOST
What we SHOULD have built (multi-user pattern):
Pod-0 ┐
Pod-1 ├→ workspace-user-A (10 GB) ← User A can access from ANY pod
Pod-2 ┘ workspace-user-B (10 GB) ← User B can access from ANY pod
workspace-user-C (10 GB) ← User C can access from ANY pod
Scale-down event: Pod-0 deleted → User A logs in → routed to Pod-1 → sees same files
Current Failure Modes
| Scenario | Current Behavior | Expected Behavior |
|---|---|---|
| User logs out, pod scales down | Data LOST | Data persists, accessible from any pod |
| User switches IP address | Routed to different pod, can't access files | Same files regardless of pod |
| Pod crashes | User loses all work since last save | User reconnects, sees exact same state |
| HPA scales 3→2 pods | 1/3 of users lose ALL data | All users unaffected |
Why This Wasn't Caught Earlier
1. Cargo Cult Kubernetes
What happened: Copied StatefulSet pattern from database examples without adapting for our use case.
Source: Kubernetes documentation, Helm charts for databases
Assumption: "StatefulSet = persistent storage = correct for all stateful apps"
Reality: StatefulSet is for pod-specific state, not user-portable state
2. Single-User Testing
Test coverage:
- ✅ One user logs in, creates files, sees files
- ❌ User switches between pods
- ❌ User logs in after pod restart
- ❌ User logs in after pod scale-down
Gap: Multi-user, multi-pod scenarios never tested
3. No Autoscaling Testing
Deployment history:
- October 13: Deployed StatefulSet with 3 fixed replicas
- October 19-26: Build/deploy iterations (no scaling changes)
- October 28: First discussion of autoscaling → problem discovered
Gap: Scaling wasn't considered until MVP planning
4. Missing Architectural Review
What was missing:
- ❌ No ADR for storage strategy
- ❌ No question: "What happens to user data when pods scale?"
- ❌ No multi-user access pattern analysis
- ❌ No comparison of storage options
Result: Fundamental architectural flaw shipped to production
Requirements
Functional Requirements
| ID | Requirement | Priority | Acceptance Criteria |
|---|---|---|---|
| FR-1 | Data Persistence | P0 | User files survive pod deletion, scale-down, crashes |
| FR-2 | Pod Portability | P0 | User can access workspace from any pod (no pod stickiness) |
| FR-3 | Performance | P1 | File operations <100ms (IDE responsive) |
| FR-4 | Multi-User Isolation | P1 | Users cannot access each other's workspaces |
| FR-5 | Cost Efficiency | P2 | Storage costs scale linearly with users |
| FR-6 | Backup/Recovery | P2 | Daily snapshots, 7-day retention, point-in-time restore |
Non-Functional Requirements
| ID | Requirement | Target | Measurement |
|---|---|---|---|
| NFR-1 | File Read Latency | <1ms | fio benchmark on SSD PVC |
| NFR-2 | File Write Latency | <5ms | fio benchmark on SSD PVC |
| NFR-3 | PVC Attach Time | <10s | kubectl PVC attach duration |
| NFR-4 | Storage IOPS | 15K-30K | GCE Persistent Disk SSD spec |
| NFR-5 | User Provisioning | <30s | PVC creation + pod assignment |
| NFR-6 | Cost per User | <$1/month | GCP billing (storage only) |
Options Analysis
Option 1: Shared NFS (Google Filestore)
Architecture:
┌─────────────────────────────────────┐
│ Google Filestore (NFS Server) │
│ IP: 10.0.0.2 │
│ /workspace/users/{user_id}/ │
└──────────────┬──────────────────────┘
│ (NFS mount)
┌───────┴────────┐
│ │
┌───▼───┐ ┌───▼───┐
│ Pod-0 │ │ Pod-1 │ ... Pod-N
└───────┘ └───────┘
All pods mount /workspace via NFS
Implementation:
# Persistent Volume (NFS)
apiVersion: v1
kind: PersistentVolume
metadata:
name: filestore-workspace
spec:
capacity:
storage: 1Ti
accessModes:
- ReadWriteMany
nfs:
server: 10.0.0.2 # Filestore IP
path: /workspace
---
# PersistentVolumeClaim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: workspace-shared
namespace: coditect-app
spec:
accessModes:
- ReadWriteMany
storageClassName: "" # No storage class (manual PV binding)
resources:
requests:
storage: 1Ti
Pros:
- ✅ POSIX-Compliant: Full filesystem semantics (hard links, file locking, atomic operations)
- ✅ ReadWriteMany: All pods can mount simultaneously
- ✅ Simple Implementation: Standard NFS mount (no custom code)
- ✅ Familiar Model: Works like a traditional NAS
- ✅ IDE Compatible: No code changes needed (theia sees local filesystem)
Cons:
- ❌ Expensive: $204.80/month for 1TB (Basic tier: $0.20/GB/month)
- ❌ Fixed Overhead: Minimum 1TB even for 3 users
- ❌ Network Latency: 10-50ms for file operations (vs <1ms for local SSD)
- ❌ Single Point of Failure: Filestore outage = all pods down
- ❌ Scaling Limits: Max 100TB, 60MB/s per TB throughput
- ❌ No Built-in Versioning: Need separate backup solution
Cost Breakdown (Filestore Basic):
- Storage: 1024 GB × $0.20/GB = $204.80/month
- Operations: Included
- Total: $204.80/month (fixed cost regardless of users)
Performance (Google spec + real-world):
- Read latency: 5-15ms (cached), 10-50ms (uncached)
- Write latency: 10-30ms (sync writes)
- Throughput: 60-100 MB/s (Basic tier)
- IOPS: 1000-3000 (file-size dependent)
When to use:
- 50+ concurrent users
- Can tolerate 10-30ms latency
- Budget allows $200+/month storage
- Need true multi-user file sharing (Git, collaborative editing)
Decision: ❌ REJECTED - Too expensive for 10-20 user MVP ($204/month vs $7/month for Hybrid)
Option 2: Google Cloud Storage (GCS) with gcsfuse
Architecture:
┌─────────────────────────────────────┐
│ GCS Bucket: coditect-workspaces │
│ gs://coditect-workspaces/users/ │
└──────────────┬──────────────────────┘
│ (gcsfuse FUSE mount)
┌───────┴────────┐
│ │
┌───▼───┐ ┌───▼───┐
│ Pod-0 │ │ Pod-1 │ ... Pod-N
└───────┘ └───────┘
gcsfuse mounts GCS bucket at /workspace
Implementation:
# CSI Driver for GCS FUSE
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gcs-fuse-csi-driver
spec:
template:
spec:
containers:
- name: gcs-fuse
image: gcr.io/gcs-fuse-csi-driver/gcs-fuse-csi-driver:latest
securityContext:
privileged: true
volumeMounts:
- name: gcs-mount
mountPath: /workspace
mountPropagation: Bidirectional
volumes:
- name: gcs-mount
csi:
driver: gcsfuse.csi.storage.gke.io
volumeAttributes:
bucketName: coditect-workspaces
mountOptions: "implicit-dirs,file-mode=644,dir-mode=755"
Pros:
- ✅ Cost-Effective: $20-30/month for 1TB (Standard: $0.020/GB/month)
- ✅ Scalable: Unlimited capacity, auto-scales
- ✅ Durable: 99.999999999% (11 nines) durability
- ✅ Versioning: Built-in object versioning
- ✅ No Single Point of Failure: Highly available by design
- ✅ No Capacity Planning: No pre-provisioning needed
Cons:
- ❌ Not POSIX-Compliant: Limited metadata, no hard links, no atomic rename across directories
- ❌ Eventual Consistency: Directory listings may lag behind writes
- ❌ High Latency: 50-200ms for small file ops (network roundtrip to GCS API)
- ❌ Cache Complexity: Need aggressive caching for acceptable IDE performance
- ❌ Debugging Harder: FUSE layer adds complexity to troubleshooting
POSIX Limitations (breaks IDE features):
# These operations DON'T work properly with gcsfuse:
ln file1 file2 # Hard links not supported
mv dir1/file dir2/file # Not atomic across "directories" (objects with prefixes)
flock /workspace/file # File locking unreliable
stat /workspace/file # Missing metadata: ctime, inode number
Cost Breakdown (GCS Standard, 1TB):
- Storage: 1024 GB × $0.020/GB = $20.48/month
- Class A operations (writes): ~100K/month × $0.05/10K = $0.50
- Class B operations (reads): ~1M/month × $0.004/10K = $0.40
- Total: ~$21-30/month (variable with usage)
Performance (Google spec + gcsfuse overhead):
- Read latency: 50-150ms (uncached), 5-10ms (cached)
- Write latency: 100-300ms (API roundtrip + object creation)
- Throughput: 100-500 MB/s (network-limited)
- IOPS: 1000-5000 (highly variable, not guaranteed)
When to use:
- Cost is primary concern
- Workload is read-heavy (can cache aggressively)
- Can tolerate eventual consistency
- Don't need full POSIX (no Git, no file locking)
Decision: ❌ REJECTED - Latency (50-200ms) breaks IDE responsiveness, POSIX incompatibility breaks Git workflows
Option 3: User-Specific PVCs (50 GB per user)
Architecture:
┌─────────────────┐ ┌─────────────────┐
│ workspace-user-a│ │ workspace-user-b│
│ (50 GB PVC) │ │ (50 GB PVC) │
└────────┬────────┘ └────────┬────────┘
│ (pod affinity) │ (pod affinity)
┌───▼───┐ ┌───▼───┐
│ Pod-0 │ │ Pod-1 │ ... Pod-N
└───────┘ └───────┘
(user-a assigned) (user-b assigned)
Implementation:
# User-specific PVC (created per user)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: workspace-user-{{ user_id }}
namespace: coditect-app
labels:
user: {{ user_id }}
app: coditect-workspace
spec:
accessModes:
- ReadWriteOnce # Only one pod can mount
resources:
requests:
storage: 50Gi
storageClassName: standard-rwo
---
# StatefulSet with pod affinity (assigns user to pod with their PVC)
apiVersion: apps/v1
kind: StatefulSet
spec:
template:
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: user
operator: In
values: ["{{ user_id }}"]
topologyKey: kubernetes.io/hostname
volumes:
- name: workspace
persistentVolumeClaim:
claimName: workspace-{{ user_id }}
Pros:
- ✅ Full POSIX: Standard block storage (ext4/xfs filesystem)
- ✅ Per-User Isolation: Each user gets dedicated PVC (security compliant)
- ✅ Performance: Local SSD possible (sub-ms latency)
- ✅ Kubernetes-Native: No external dependencies (Filestore, GCS)
- ✅ Snapshot Support: GKE persistent disk snapshots built-in
Cons:
- ❌ Complex Scheduling: Need custom controller for PVC→Pod binding
- ❌ Scale Challenges: 1000 users = 1000 PVCs (management overhead with
kubectl) - ❌ Pod Stickiness: User tied to pod, can't easily switch
- ❌ Wasted Capacity: Idle user PVCs still consume storage (50 GB × 100 idle users = 5 TB waste)
- ❌ Cold Start: Attaching PVC to new pod takes 30-60s (user waits)
Cost Breakdown (50 GB × 20 users):
- Storage: 20 users × 50 GB × $0.020/GB = $20.00/month
- Snapshots: 20 users × 7 daily snapshots × 5 GB × $0.026/GB = $18.20/month
- Total: ~$38/month (at 20 users)
Scaling Cost:
- 100 users: $100/month storage + $91/month snapshots = $191/month
- 500 users: $500/month storage + $455/month snapshots = $955/month
Performance (GCE Persistent Disk SSD):
- Read latency: <1ms (local SSD)
- Write latency: <5ms (SSD)
- Throughput: 240 MB/s (standard PD)
- IOPS: 15K-30K (SSD)
When to use:
- Need maximum performance
- User count < 100
- Can implement custom PVC lifecycle management
- Budget scales linearly with users
Decision: ⚠️ PARTIAL - Good performance, but wastes storage on duplicated base files (IDE tools, configs). Leads to Option 4 (Hybrid).
➡️ Option 4: Hybrid Storage - Shared Base + User Overlays (RECOMMENDED)
See ADR-028 Part 2 for full decision and implementation.
Comparative Summary
| Criteria | NFS (Filestore) | GCS (gcsfuse) | User PVCs (50 GB) | Hybrid (10 GB) |
|---|---|---|---|---|
| Cost (20 users) | $205/month | $30/month | $38/month | $7/month |
| Cost per user | $10.25 | $1.50 | $1.90 | $0.35 |
| File Read Latency | 10-50ms | 50-200ms | <1ms | <1ms |
| File Write Latency | 10-30ms | 100-300ms | <5ms | <5ms |
| POSIX Compliant | ✅ Yes | ❌ No | ✅ Yes | ✅ Yes |
| Pod Portability | ✅ Yes | ✅ Yes | ⚠️ Limited | ✅ Yes |
| Storage Waste | ❌ 1TB min | ✅ None | ⚠️ Moderate | ✅ Minimal |
| Implementation | Simple | Medium | Complex | Medium |
| Scaling (500 users) | $2,050/month | $300/month | $955/month | $100/month |
Winner: Hybrid Storage - 96% cost savings vs NFS, <1ms performance, $0.35/user/month
Related ADRs
- ADR-029: StatefulSet with Persistent Storage Migration (2025-10-27) - Documents initial StatefulSet migration
- ADR-004: FoundationDB for Persistence - Session metadata storage (orthogonal to workspace files)
- ADR-006: OPFS for Browser Storage - Browser-side cache (orthogonal to server-side storage)
- ADR-020: GCP Deployment Strategy - OUTDATED (references Cloud Run, project uses GKE)
- ADR-026: Wrapper Persistence Architecture - UI wrapper state (orthogonal to workspace storage)
Lessons Learned
What Went Wrong
- ❌ Cargo Cult Kubernetes: Copied StatefulSet pattern without adapting for multi-user access
- ❌ No Testing of Scaling Scenarios: Didn't test pod scale-down, user switching between pods
- ❌ Assumed "Working Deployment" = "Correct Architecture": Initial deployment worked for 1 user, shipped without multi-user validation
- ❌ No ADR for Storage Strategy: Skipped architectural review, missed fundamental flaw
What Should Have Happened
- ✅ Written ADR for storage strategy BEFORE implementation
- ✅ Asked: "What happens to user data when pods scale down?"
- ✅ Tested multi-user scenarios: User switches pods, pod failures, autoscaling
- ✅ Designed for 100 users from day 1 (not just 3 pods)
Takeaway
Don't cargo cult Kubernetes patterns. Understand YOUR access patterns first:
- Databases: Pod-specific state (StatefulSet volumeClaimTemplates = correct)
- Multi-user IDEs: User-portable state (StatefulSet volumeClaimTemplates = WRONG)
QA Review Status
Quality Gate: ✅ CONDITIONAL PASS (2025-10-28)
Issues Found: 5 (2 Critical, 3 Important)
| Issue | Severity | Status |
|---|---|---|
| ConfigMap size limit (1 MB max, not 50 GB) | Critical | Addressed in Part 2 |
| Dynamic PVC strategy unclear | Critical | Addressed in Part 2 |
| Timeline underestimated (17-22h → 30-38h) | Important | Corrected in Part 2 |
| Missing backup strategy | Important | Added as Phase 6 in Part 2 |
| Missing ADR-029 reference | Minor | ✅ Fixed above |
Next: See ADR-028 Part 2 for decision, implementation plan, and QA-validated architecture.
Document Status: ✅ Part 1 Complete Next Step: Review Part 2 (Decision & Implementation) Approval Required: System Architect sign-off before implementation