Skip to main content

ADR-028 Part 1: Hybrid Storage Architecture - Problem & Analysis

Date: 2025-10-28 Status: Under Review → QA: ✅ CONDITIONAL PASS Deciders: System Architect, Infrastructure Team Related: ADR-029 (StatefulSet Migration), Analysis: docs/11-analysis/2025-10-28-persistent-storage-dynamic-pods.md


Context

The initial GKE deployment used StatefulSet volumeClaimTemplates with pod-local storage (50 GB per pod). This pattern is standard for databases and distributed systems where each pod manages unique data.

Example from Kubernetes documentation:

# Standard StatefulSet pattern (databases, Kafka, Cassandra)
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi

Works great for:

  • ✅ Databases (Postgres, MySQL) - each pod is a replica
  • ✅ Distributed systems (Kafka, Cassandra) - each pod has unique data

Fails for:

  • Multi-user IDEs - users need to access workspace from ANY pod

Problem Statement

Issue Discovered: 2025-10-28

During MVP scaling analysis for 20 users with HPA (3-30 pods), a critical data loss scenario was identified:

User creates files in Pod-0 → Pod-0 scales down → User data PERMANENTLY LOST

Root Cause: Architecture Mismatch

What we built (database pattern):

Pod-0 → workspace-coditect-combined-0 (50 GB) → User A's files LOCKED to Pod-0
Pod-1 → workspace-coditect-combined-1 (50 GB) → User B's files LOCKED to Pod-1
Pod-2 → workspace-coditect-combined-2 (50 GB) → User C's files LOCKED to Pod-2

Scale-down event: Pod-0 deleted → workspace-coditect-combined-0 deleted → User A data LOST

What we SHOULD have built (multi-user pattern):

Pod-0 ┐
Pod-1 ├→ workspace-user-A (10 GB) ← User A can access from ANY pod
Pod-2 ┘ workspace-user-B (10 GB) ← User B can access from ANY pod
workspace-user-C (10 GB) ← User C can access from ANY pod

Scale-down event: Pod-0 deleted → User A logs in → routed to Pod-1 → sees same files

Current Failure Modes

ScenarioCurrent BehaviorExpected Behavior
User logs out, pod scales downData LOSTData persists, accessible from any pod
User switches IP addressRouted to different pod, can't access filesSame files regardless of pod
Pod crashesUser loses all work since last saveUser reconnects, sees exact same state
HPA scales 3→2 pods1/3 of users lose ALL dataAll users unaffected

Why This Wasn't Caught Earlier

1. Cargo Cult Kubernetes

What happened: Copied StatefulSet pattern from database examples without adapting for our use case.

Source: Kubernetes documentation, Helm charts for databases

Assumption: "StatefulSet = persistent storage = correct for all stateful apps"

Reality: StatefulSet is for pod-specific state, not user-portable state

2. Single-User Testing

Test coverage:

  • ✅ One user logs in, creates files, sees files
  • ❌ User switches between pods
  • ❌ User logs in after pod restart
  • ❌ User logs in after pod scale-down

Gap: Multi-user, multi-pod scenarios never tested

3. No Autoscaling Testing

Deployment history:

  • October 13: Deployed StatefulSet with 3 fixed replicas
  • October 19-26: Build/deploy iterations (no scaling changes)
  • October 28: First discussion of autoscaling → problem discovered

Gap: Scaling wasn't considered until MVP planning

4. Missing Architectural Review

What was missing:

  • ❌ No ADR for storage strategy
  • ❌ No question: "What happens to user data when pods scale?"
  • ❌ No multi-user access pattern analysis
  • ❌ No comparison of storage options

Result: Fundamental architectural flaw shipped to production


Requirements

Functional Requirements

IDRequirementPriorityAcceptance Criteria
FR-1Data PersistenceP0User files survive pod deletion, scale-down, crashes
FR-2Pod PortabilityP0User can access workspace from any pod (no pod stickiness)
FR-3PerformanceP1File operations <100ms (IDE responsive)
FR-4Multi-User IsolationP1Users cannot access each other's workspaces
FR-5Cost EfficiencyP2Storage costs scale linearly with users
FR-6Backup/RecoveryP2Daily snapshots, 7-day retention, point-in-time restore

Non-Functional Requirements

IDRequirementTargetMeasurement
NFR-1File Read Latency<1msfio benchmark on SSD PVC
NFR-2File Write Latency<5msfio benchmark on SSD PVC
NFR-3PVC Attach Time<10skubectl PVC attach duration
NFR-4Storage IOPS15K-30KGCE Persistent Disk SSD spec
NFR-5User Provisioning<30sPVC creation + pod assignment
NFR-6Cost per User<$1/monthGCP billing (storage only)

Options Analysis

Option 1: Shared NFS (Google Filestore)

Architecture:

┌─────────────────────────────────────┐
│ Google Filestore (NFS Server) │
│ IP: 10.0.0.2 │
│ /workspace/users/{user_id}/ │
└──────────────┬──────────────────────┘
│ (NFS mount)
┌───────┴────────┐
│ │
┌───▼───┐ ┌───▼───┐
│ Pod-0 │ │ Pod-1 │ ... Pod-N
└───────┘ └───────┘
All pods mount /workspace via NFS

Implementation:

# Persistent Volume (NFS)
apiVersion: v1
kind: PersistentVolume
metadata:
name: filestore-workspace
spec:
capacity:
storage: 1Ti
accessModes:
- ReadWriteMany
nfs:
server: 10.0.0.2 # Filestore IP
path: /workspace
---
# PersistentVolumeClaim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: workspace-shared
namespace: coditect-app
spec:
accessModes:
- ReadWriteMany
storageClassName: "" # No storage class (manual PV binding)
resources:
requests:
storage: 1Ti

Pros:

  • POSIX-Compliant: Full filesystem semantics (hard links, file locking, atomic operations)
  • ReadWriteMany: All pods can mount simultaneously
  • Simple Implementation: Standard NFS mount (no custom code)
  • Familiar Model: Works like a traditional NAS
  • IDE Compatible: No code changes needed (theia sees local filesystem)

Cons:

  • Expensive: $204.80/month for 1TB (Basic tier: $0.20/GB/month)
  • Fixed Overhead: Minimum 1TB even for 3 users
  • Network Latency: 10-50ms for file operations (vs <1ms for local SSD)
  • Single Point of Failure: Filestore outage = all pods down
  • Scaling Limits: Max 100TB, 60MB/s per TB throughput
  • No Built-in Versioning: Need separate backup solution

Cost Breakdown (Filestore Basic):

  • Storage: 1024 GB × $0.20/GB = $204.80/month
  • Operations: Included
  • Total: $204.80/month (fixed cost regardless of users)

Performance (Google spec + real-world):

  • Read latency: 5-15ms (cached), 10-50ms (uncached)
  • Write latency: 10-30ms (sync writes)
  • Throughput: 60-100 MB/s (Basic tier)
  • IOPS: 1000-3000 (file-size dependent)

When to use:

  • 50+ concurrent users
  • Can tolerate 10-30ms latency
  • Budget allows $200+/month storage
  • Need true multi-user file sharing (Git, collaborative editing)

Decision: ❌ REJECTED - Too expensive for 10-20 user MVP ($204/month vs $7/month for Hybrid)


Option 2: Google Cloud Storage (GCS) with gcsfuse

Architecture:

┌─────────────────────────────────────┐
│ GCS Bucket: coditect-workspaces │
│ gs://coditect-workspaces/users/ │
└──────────────┬──────────────────────┘
│ (gcsfuse FUSE mount)
┌───────┴────────┐
│ │
┌───▼───┐ ┌───▼───┐
│ Pod-0 │ │ Pod-1 │ ... Pod-N
└───────┘ └───────┘
gcsfuse mounts GCS bucket at /workspace

Implementation:

# CSI Driver for GCS FUSE
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gcs-fuse-csi-driver
spec:
template:
spec:
containers:
- name: gcs-fuse
image: gcr.io/gcs-fuse-csi-driver/gcs-fuse-csi-driver:latest
securityContext:
privileged: true
volumeMounts:
- name: gcs-mount
mountPath: /workspace
mountPropagation: Bidirectional
volumes:
- name: gcs-mount
csi:
driver: gcsfuse.csi.storage.gke.io
volumeAttributes:
bucketName: coditect-workspaces
mountOptions: "implicit-dirs,file-mode=644,dir-mode=755"

Pros:

  • Cost-Effective: $20-30/month for 1TB (Standard: $0.020/GB/month)
  • Scalable: Unlimited capacity, auto-scales
  • Durable: 99.999999999% (11 nines) durability
  • Versioning: Built-in object versioning
  • No Single Point of Failure: Highly available by design
  • No Capacity Planning: No pre-provisioning needed

Cons:

  • Not POSIX-Compliant: Limited metadata, no hard links, no atomic rename across directories
  • Eventual Consistency: Directory listings may lag behind writes
  • High Latency: 50-200ms for small file ops (network roundtrip to GCS API)
  • Cache Complexity: Need aggressive caching for acceptable IDE performance
  • Debugging Harder: FUSE layer adds complexity to troubleshooting

POSIX Limitations (breaks IDE features):

# These operations DON'T work properly with gcsfuse:
ln file1 file2 # Hard links not supported
mv dir1/file dir2/file # Not atomic across "directories" (objects with prefixes)
flock /workspace/file # File locking unreliable
stat /workspace/file # Missing metadata: ctime, inode number

Cost Breakdown (GCS Standard, 1TB):

  • Storage: 1024 GB × $0.020/GB = $20.48/month
  • Class A operations (writes): ~100K/month × $0.05/10K = $0.50
  • Class B operations (reads): ~1M/month × $0.004/10K = $0.40
  • Total: ~$21-30/month (variable with usage)

Performance (Google spec + gcsfuse overhead):

  • Read latency: 50-150ms (uncached), 5-10ms (cached)
  • Write latency: 100-300ms (API roundtrip + object creation)
  • Throughput: 100-500 MB/s (network-limited)
  • IOPS: 1000-5000 (highly variable, not guaranteed)

When to use:

  • Cost is primary concern
  • Workload is read-heavy (can cache aggressively)
  • Can tolerate eventual consistency
  • Don't need full POSIX (no Git, no file locking)

Decision: ❌ REJECTED - Latency (50-200ms) breaks IDE responsiveness, POSIX incompatibility breaks Git workflows


Option 3: User-Specific PVCs (50 GB per user)

Architecture:

┌─────────────────┐  ┌─────────────────┐
│ workspace-user-a│ │ workspace-user-b│
│ (50 GB PVC) │ │ (50 GB PVC) │
└────────┬────────┘ └────────┬────────┘
│ (pod affinity) │ (pod affinity)
┌───▼───┐ ┌───▼───┐
│ Pod-0 │ │ Pod-1 │ ... Pod-N
└───────┘ └───────┘
(user-a assigned) (user-b assigned)

Implementation:

# User-specific PVC (created per user)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: workspace-user-{{ user_id }}
namespace: coditect-app
labels:
user: {{ user_id }}
app: coditect-workspace
spec:
accessModes:
- ReadWriteOnce # Only one pod can mount
resources:
requests:
storage: 50Gi
storageClassName: standard-rwo
---
# StatefulSet with pod affinity (assigns user to pod with their PVC)
apiVersion: apps/v1
kind: StatefulSet
spec:
template:
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: user
operator: In
values: ["{{ user_id }}"]
topologyKey: kubernetes.io/hostname
volumes:
- name: workspace
persistentVolumeClaim:
claimName: workspace-{{ user_id }}

Pros:

  • Full POSIX: Standard block storage (ext4/xfs filesystem)
  • Per-User Isolation: Each user gets dedicated PVC (security compliant)
  • Performance: Local SSD possible (sub-ms latency)
  • Kubernetes-Native: No external dependencies (Filestore, GCS)
  • Snapshot Support: GKE persistent disk snapshots built-in

Cons:

  • Complex Scheduling: Need custom controller for PVC→Pod binding
  • Scale Challenges: 1000 users = 1000 PVCs (management overhead with kubectl)
  • Pod Stickiness: User tied to pod, can't easily switch
  • Wasted Capacity: Idle user PVCs still consume storage (50 GB × 100 idle users = 5 TB waste)
  • Cold Start: Attaching PVC to new pod takes 30-60s (user waits)

Cost Breakdown (50 GB × 20 users):

  • Storage: 20 users × 50 GB × $0.020/GB = $20.00/month
  • Snapshots: 20 users × 7 daily snapshots × 5 GB × $0.026/GB = $18.20/month
  • Total: ~$38/month (at 20 users)

Scaling Cost:

  • 100 users: $100/month storage + $91/month snapshots = $191/month
  • 500 users: $500/month storage + $455/month snapshots = $955/month

Performance (GCE Persistent Disk SSD):

  • Read latency: <1ms (local SSD)
  • Write latency: <5ms (SSD)
  • Throughput: 240 MB/s (standard PD)
  • IOPS: 15K-30K (SSD)

When to use:

  • Need maximum performance
  • User count < 100
  • Can implement custom PVC lifecycle management
  • Budget scales linearly with users

Decision: ⚠️ PARTIAL - Good performance, but wastes storage on duplicated base files (IDE tools, configs). Leads to Option 4 (Hybrid).


See ADR-028 Part 2 for full decision and implementation.


Comparative Summary

CriteriaNFS (Filestore)GCS (gcsfuse)User PVCs (50 GB)Hybrid (10 GB)
Cost (20 users)$205/month$30/month$38/month$7/month
Cost per user$10.25$1.50$1.90$0.35
File Read Latency10-50ms50-200ms<1ms<1ms
File Write Latency10-30ms100-300ms<5ms<5ms
POSIX Compliant✅ Yes❌ No✅ Yes✅ Yes
Pod Portability✅ Yes✅ Yes⚠️ Limited✅ Yes
Storage Waste❌ 1TB min✅ None⚠️ Moderate✅ Minimal
ImplementationSimpleMediumComplexMedium
Scaling (500 users)$2,050/month$300/month$955/month$100/month

Winner: Hybrid Storage - 96% cost savings vs NFS, <1ms performance, $0.35/user/month


  • ADR-029: StatefulSet with Persistent Storage Migration (2025-10-27) - Documents initial StatefulSet migration
  • ADR-004: FoundationDB for Persistence - Session metadata storage (orthogonal to workspace files)
  • ADR-006: OPFS for Browser Storage - Browser-side cache (orthogonal to server-side storage)
  • ADR-020: GCP Deployment Strategy - OUTDATED (references Cloud Run, project uses GKE)
  • ADR-026: Wrapper Persistence Architecture - UI wrapper state (orthogonal to workspace storage)

Lessons Learned

What Went Wrong

  1. Cargo Cult Kubernetes: Copied StatefulSet pattern without adapting for multi-user access
  2. No Testing of Scaling Scenarios: Didn't test pod scale-down, user switching between pods
  3. Assumed "Working Deployment" = "Correct Architecture": Initial deployment worked for 1 user, shipped without multi-user validation
  4. No ADR for Storage Strategy: Skipped architectural review, missed fundamental flaw

What Should Have Happened

  1. Written ADR for storage strategy BEFORE implementation
  2. Asked: "What happens to user data when pods scale down?"
  3. Tested multi-user scenarios: User switches pods, pod failures, autoscaling
  4. Designed for 100 users from day 1 (not just 3 pods)

Takeaway

Don't cargo cult Kubernetes patterns. Understand YOUR access patterns first:

  • Databases: Pod-specific state (StatefulSet volumeClaimTemplates = correct)
  • Multi-user IDEs: User-portable state (StatefulSet volumeClaimTemplates = WRONG)

QA Review Status

Quality Gate: ✅ CONDITIONAL PASS (2025-10-28)

Issues Found: 5 (2 Critical, 3 Important)

IssueSeverityStatus
ConfigMap size limit (1 MB max, not 50 GB)CriticalAddressed in Part 2
Dynamic PVC strategy unclearCriticalAddressed in Part 2
Timeline underestimated (17-22h → 30-38h)ImportantCorrected in Part 2
Missing backup strategyImportantAdded as Phase 6 in Part 2
Missing ADR-029 referenceMinor✅ Fixed above

Next: See ADR-028 Part 2 for decision, implementation plan, and QA-validated architecture.


Document Status: ✅ Part 1 Complete Next Step: Review Part 2 (Decision & Implementation) Approval Required: System Architect sign-off before implementation