StatefulSet with Persistent Storage: Infrastructure Evolution
Document: ADR-029 Date: 2025-10-27T06:12:03Z Status: ✅ IMPLEMENTED Category: Infrastructure / Kubernetes
Executive Summary
Coditect AI IDE migrated from Kubernetes Deployment to StatefulSet with persistent volumes to solve the critical data loss problem. Users were losing all code and files on logout/timeout because pods had no persistent storage.
Key Achievement: 100% workspace persistence across pod restarts, enabling true cloud IDE experience like VS Code Server, Gitpod, GitHub Codespaces.
Table of Contents
- The Problem: Data Loss
- Decision Context
- Original Architecture (Deployment)
- New Architecture (StatefulSet)
- Technical Implementation
- Capacity Planning
- Migration Process
- Benefits Realized
- Cost Analysis
- Lessons Learned
The Problem: Data Loss
User Experience (Before StatefulSet)
- User logs into theia IDE at https://coditect.ai/theia
- User creates files, installs extensions, configures workspace
- User logs out or session times out
- ALL DATA LOST - pod terminates, ephemeral storage destroyed
- User logs back in - fresh environment, all work gone
Impact: Unusable for real development work, users can't trust the platform
Root Cause
Kubernetes Deployment uses ephemeral storage by default:
- No persistent volumes attached
- Pod filesystem lives in container's writable layer
- Pod deletion = data deletion
- Every logout = potential data loss
Comparison to Competitors:
- ✅ VS Code Server: Persistent workspace via SSH
- ✅ Gitpod: Persistent workspace via PVC
- ✅ GitHub Codespaces: Persistent workspace via Azure Files
- ❌ Coditect (before migration): Ephemeral storage, data loss
Decision Context
Requirements
Must Have:
- workspace data persists across pod restarts
- User configurations persist (settings, extensions, keybindings)
- Git repositories persist
- Session state persists (open files, editor state)
- Per-user storage isolation
Should Have: 6. Predictable pod naming (for debugging) 7. Ordered deployment (for migrations) 8. Headless service for direct pod access 9. Storage size customization per tier
Nice to Have: 10. Volume snapshot/backup capability 11. Volume resize without downtime 12. Cross-zone replication for HA
Constraints
- GKE Cluster: us-central1-a (single zone)
- Storage Class: standard-rwo (GCE Persistent Disk)
- Budget: ~$0.10/GB/month for storage
- Users: Start with 10-20, scale to 500+
- Uptime: 99% target (allows ~7 hours downtime/month)
Original Architecture (Deployment)
Kubernetes Deployment Manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: coditect-combined-v5
namespace: coditect-app
spec:
replicas: 3
selector:
matchLabels:
app: coditect-combined
template:
metadata:
labels:
app: coditect-combined
spec:
containers:
- name: combined
image: us-central1-docker.pkg.dev/.../coditect-combined:latest
ports:
- containerPort: 80
# ❌ NO volumeMounts - ephemeral storage only
# ❌ NO volumeClaimTemplates - shared pod identity
Problems with Deployment
| Problem | Impact | Severity |
|---|---|---|
| No persistent volumes | Data loss on pod restart | 🔴 CRITICAL |
| Shared pod identity | Can't map user → pod | 🔴 HIGH |
| Random pod naming | Debugging difficult | 🟡 MEDIUM |
| No storage isolation | Users share ephemeral disk | 🔴 HIGH |
| No pod affinity | User may get different pod | 🔴 HIGH |
Result: Deployment is suitable for stateless applications, NOT for cloud IDEs
New Architecture (StatefulSet)
StatefulSet with Persistent Volumes
apiVersion: v1
kind: Service
metadata:
name: theia-headless
namespace: coditect-app
spec:
clusterIP: None # Headless service for StatefulSet
ports:
- port: 80
targetPort: 80
selector:
app: coditect-combined
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: coditect-combined
namespace: coditect-app
spec:
serviceName: "theia-headless"
replicas: 3
podManagementPolicy: Parallel # Start all pods at once
selector:
matchLabels:
app: coditect-combined
template:
metadata:
labels:
app: coditect-combined
spec:
terminationGracePeriodSeconds: 120 # 2 minutes for graceful shutdown
containers:
- name: combined
image: us-central1-docker.pkg.dev/.../coditect-combined:latest
ports:
- containerPort: 80
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
volumeMounts:
- name: workspace
mountPath: /workspace # ✅ User files, git repos
- name: theia-config
mountPath: /home/theia/.theia # ✅ IDE settings, extensions
# ✅ Automatic PVC creation - one per pod
volumeClaimTemplates:
- metadata:
name: workspace
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: standard-rwo
resources:
requests:
storage: 100Gi # 100 GB per user workspace
- metadata:
name: theia-config
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: standard-rwo
resources:
requests:
storage: 10Gi # 10 GB for IDE config
Key Differences from Deployment
| Feature | Deployment | StatefulSet | Benefit |
|---|---|---|---|
| Pod Naming | Random (combined-v5-abc123) | Ordinal (combined-0, combined-1, combined-2) | ✅ Predictable |
| Pod Identity | Shared | Unique per pod | ✅ User → pod mapping |
| Storage | Ephemeral | Persistent (PVC) | ✅ Data persists |
| Volume Lifecycle | Deleted with pod | Independent | ✅ Survives pod deletion |
| Headless Service | Optional | Required | ✅ Direct pod access |
| Deployment Order | Random | Sequential or parallel | ✅ Controlled |
| Pod Replacement | New name | Same name | ✅ Reconnects to PVC |
Technical Implementation
Storage Architecture
┌─────────────────────────────────────────────────────┐
│ GKE Cluster (us-central1-a) │
│ │
│ ┌───────────────────────────────────────────────┐ │
│ │ StatefulSet: coditect-combined │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────┐ │ │
│ │ │ Pod: coditect-combined-0 │ │ │
│ │ │ ├─ Container: combined │ │ │
│ │ │ ├─ Volume: workspace (100 GB) │ │ │
│ │ │ │ └─ PVC: workspace-coditect-combined-0 │ │ │
│ │ │ └─ Volume: theia-config (10 GB) │ │ │
│ │ │ └─ PVC: theia-config-coditect-...-0 │ │ │
│ │ └──────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────┐ │ │
│ │ │ Pod: coditect-combined-1 │ │ │
│ │ │ ├─ Container: combined │ │ │
│ │ │ ├─ Volume: workspace (100 GB) │ │ │
│ │ │ │ └─ PVC: workspace-coditect-combined-1 │ │ │
│ │ │ └─ Volume: theia-config (10 GB) │ │ │
│ │ │ └─ PVC: theia-config-coditect-...-1 │ │ │
│ │ └──────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────┐ │ │
│ │ │ Pod: coditect-combined-2 │ │ │
│ │ │ ├─ Container: combined │ │ │
│ │ │ ├─ Volume: workspace (100 GB) │ │ │
│ │ │ │ └─ PVC: workspace-coditect-combined-2 │ │ │
│ │ │ └─ Volume: theia-config (10 GB) │ │ │
│ │ │ └─ PVC: theia-config-coditect-...-2 │ │ │
│ │ └──────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ GCE Persistent Disks (standard-rwo) │
│ ├─ pvc-workspace-combined-0 (100 GB) │
│ ├─ pvc-theia-config-combined-0 (10 GB) │
│ ├─ pvc-workspace-combined-1 (100 GB) │
│ ├─ pvc-theia-config-combined-1 (10 GB) │
│ ├─ pvc-workspace-combined-2 (100 GB) │
│ └─ pvc-theia-config-combined-2 (10 GB) │
└─────────────────────────────────────────────────────┘
Volume Mount Points
workspace Volume (/workspace):
- User files (Python, JavaScript, Go, etc.)
- Git repositories
- Build artifacts
- Downloaded packages (node_modules, venv)
- Size: 100 GB (Starter tier)
theia Config Volume (/home/theia/.theia):
- IDE settings (settings.json)
- Installed extensions
- Extension settings
- Keybindings
- workspace layout
- Size: 10 GB (enough for 50+ extensions)
Pod Lifecycle with PVCs
First Pod Creation:
- StatefulSet creates
coditect-combined-0 - Kubernetes creates PVCs:
workspace-coditect-combined-0(100 GB)theia-config-coditect-combined-0(10 GB)
- GCE creates persistent disks
- Kubernetes mounts disks to pod
- Pod starts, volumes are empty (first time)
Pod Restart (logout, crash, upgrade):
- StatefulSet deletes pod
coditect-combined-0 - PVCs remain (not deleted)
- StatefulSet creates new pod with same name
- Kubernetes mounts same PVCs
- Pod starts, data is intact
Pod Deletion (scale down, manual delete):
- Pod deleted
- PVCs remain (orphaned, waiting for pod to return)
- If pod recreated with same name → data restored
- If PVC manually deleted → data lost (intentional cleanup)
Capacity Planning
Starter Configuration (10-20 users)
spec:
replicas: 10 # Start with 10 pods
volumeClaimTemplates:
- metadata:
name: workspace
spec:
resources:
requests:
storage: 50Gi # 50 GB per user (smaller for Starter)
- metadata:
name: theia-config
spec:
resources:
requests:
storage: 5Gi # 5 GB for config (smaller for Starter)
Resource Allocation:
- Pods: 10-30 (autoscaling based on demand)
- CPU: 4 vCPU per pod (40 vCPU total baseline)
- Memory: 8 GB per pod (80 GB total baseline)
- Storage: 55 GB per pod (550 GB total baseline)
Cost Estimate:
- Compute: 40 vCPU × $0.031/hour = $1.24/hour = ~$900/month
- Memory: 80 GB × $0.0042/GB/hour = $0.34/hour = ~$250/month
- Storage: 550 GB × $0.10/GB/month = ~$55/month
- Total: ~$1,205/month for 10-20 users = $60-120/user/month
Production Configuration (50-100 users)
spec:
replicas: 50 # 50 concurrent users
volumeClaimTemplates:
- metadata:
name: workspace
spec:
resources:
requests:
storage: 100Gi # 100 GB per user
- metadata:
name: theia-config
spec:
resources:
requests:
storage: 10Gi # 10 GB for config
Resource Allocation:
- Pods: 50-100 (autoscaling 2x)
- CPU: 4 vCPU per pod (200 vCPU baseline)
- Memory: 8 GB per pod (400 GB baseline)
- Storage: 110 GB per pod (5,500 GB baseline)
Cost Estimate:
- Compute: 200 vCPU × $0.031/hour = $6.20/hour = ~$4,500/month
- Memory: 400 GB × $0.0042/GB/hour = $1.68/hour = ~$1,220/month
- Storage: 5,500 GB × $0.10/GB/month = ~$550/month
- Total: ~$6,270/month for 50-100 users = $63-125/user/month
Enterprise Configuration (500+ users)
spec:
replicas: 100 # 100 concurrent, 500+ total users
volumeClaimTemplates:
- metadata:
name: workspace
spec:
resources:
requests:
storage: 200Gi # 200 GB per active user
- metadata:
name: theia-config
spec:
resources:
requests:
storage: 20Gi # 20 GB for extensive extensions
Resource Allocation:
- Pods: 100-200 (autoscaling 2x for peak)
- CPU: 8 vCPU per pod (800 vCPU baseline)
- Memory: 16 GB per pod (1,600 GB baseline)
- Storage: 220 GB per pod (22,000 GB baseline)
Cost Estimate:
- Compute: 800 vCPU × $0.031/hour = $24.80/hour = ~$18,000/month
- Memory: 1,600 GB × $0.0042/GB/hour = $6.72/hour = ~$4,880/month
- Storage: 22,000 GB × $0.10/GB/month = ~$2,200/month
- Total: ~$25,080/month for 500 users = $50/user/month
Cost Scaling: Larger deployments = better economics (50% cheaper per user at enterprise scale)
Migration Process
Step 1: Create StatefulSet Manifest
File: k8s/theia-statefulset.yaml
Created complete StatefulSet definition with:
- Headless service (
theia-headless) - StatefulSet (
coditect-combined) - Volume claim templates (workspace + config)
- Pod environment variables (POD_NAME, POD_NAMESPACE, POD_IP)
Commit: See git history for complete manifest
Step 2: Update Cloud Build Pipeline
File: cloudbuild-combined.yaml
Added deployment steps:
# Step 4: Apply StatefulSet manifest (creates if doesn't exist)
- name: 'gcr.io/cloud-builders/kubectl'
id: 'apply-statefulset'
args:
- 'apply'
- '-f'
- 'k8s/theia-statefulset.yaml'
waitFor: ['push-build-id', 'push-latest']
# Step 5: Update StatefulSet image to newly built version
- name: 'gcr.io/cloud-builders/kubectl'
id: 'deploy-gke'
args:
- 'set'
- 'image'
- 'statefulset/coditect-combined'
- 'combined=...:$BUILD_ID'
waitFor: ['apply-statefulset']
Idempotent approach: Apply creates OR updates, safe to run multiple times
Step 3: Deploy to GKE
Command:
kubectl apply -f k8s/theia-statefulset.yaml -n coditect-app
Result:
- Headless service created (
theia-headless) - StatefulSet created (
coditect-combined) - Pods created with ordinal names (combined-0, combined-1, combined-2)
- PVCs created automatically (6 total: 3 workspace + 3 config)
- GCE persistent disks provisioned
- Pods mount disks and start
Verification:
# Check StatefulSet
kubectl get statefulset -n coditect-app
# Check pods
kubectl get pods -n coditect-app -l app=coditect-combined
# Check PVCs
kubectl get pvc -n coditect-app
# Check persistent volumes
kubectl get pv | grep coditect
Step 4: Test Data Persistence
Test procedure:
- Connect to pod:
kubectl exec -it coditect-combined-0 -n coditect-app -- bash - Create test file:
echo "test data" > /workspace/test.txt - Create config:
echo '{"setting": "value"}' > /home/theia/.theia/settings.json - Delete pod:
kubectl delete pod coditect-combined-0 -n coditect-app - Wait for recreation:
kubectl get pods -n coditect-app -w - Verify data:
kubectl exec -it coditect-combined-0 -n coditect-app -- cat /workspace/test.txt - Expected: File exists, data intact ✅
Benefits Realized
User Experience Improvements
| Before (Deployment) | After (StatefulSet) | Impact |
|---|---|---|
| ❌ Data loss on logout | ✅ Data persists | 🎉 CRITICAL |
| ❌ Fresh environment every login | ✅ Environment restored | 🎉 HIGH |
| ❌ Can't trust platform for real work | ✅ Production-ready IDE | 🎉 HIGH |
| ❌ No git repo persistence | ✅ Git repos persist | 🎉 HIGH |
| ❌ Extension reinstall every session | ✅ Extensions persist | 🎉 MEDIUM |
Technical Benefits
-
Pod Identity: Predictable pod names enable:
- User → pod assignment tracking
- Debugging (logs, exec)
- Pod-specific monitoring
- Direct pod access via headless service
-
Storage Isolation: Each user gets dedicated volumes:
- No interference between users
- Independent disk I/O performance
- Secure data separation
- Quota enforcement per user
-
Graceful Upgrades: StatefulSet enables:
- Rolling updates (one pod at a time)
- Ordered deployment
- Health checks before proceeding
- Rollback capability
-
High Availability: PVCs independent of pods:
- Pod crash → data survives
- Pod reschedule → data survives
- Node failure → data survives (GCE PD)
- Zone failure → need cross-zone replication
Operational Benefits
-
Backup/Restore: PVCs can be snapshotted:
# Create snapshot
kubectl create volumesnapshot workspace-backup \
--source=workspace-coditect-combined-0 \
-n coditect-app
# Restore from snapshot
kubectl apply -f workspace-restore-pvc.yaml -
Monitoring: Track storage metrics:
- Disk usage per user
- I/O throughput
- Disk read/write latency
- PVC provisioning time
-
Cost Optimization:
- Identify unused PVCs (orphaned)
- Delete PVCs for inactive users
- Resize PVCs without downtime (GKE supports resizing)
- Use storage classes for different tiers (SSD vs HDD)
Cost Analysis
Storage Cost Breakdown
GCE Persistent Disk Pricing (us-central1):
- Standard (HDD): $0.040/GB/month
- SSD: $0.170/GB/month
- Snapshots: $0.026/GB/month
- Disk operations: Negligible
Example Cost (10 users, standard-rwo):
- 10 users × 110 GB (100 workspace + 10 config) = 1,100 GB
- 1,100 GB × $0.040/month = $44/month
- Per user: $4.40/month
Comparison to Competitors:
- Gitpod: ~$39/user/month (includes compute + 30 GB storage)
- GitHub Codespaces: ~$58/user/month (includes compute + 32 GB storage)
- Coditect (Starter): ~$60-120/user/month (includes compute + 110 GB storage)
Value Proposition: More storage (110 GB vs 30 GB) at competitive price
Cost Optimization Strategies
-
Tiered Storage:
- Free tier: 10 GB workspace + 2 GB config
- Starter tier: 50 GB workspace + 5 GB config
- Pro tier: 100 GB workspace + 10 GB config
- Enterprise tier: 200 GB workspace + 20 GB config
-
Auto-cleanup:
- Delete PVCs for users inactive >90 days
- Snapshot before deletion (restore on request)
- Compress snapshots for long-term storage
-
Storage Class Selection:
- Standard (HDD): User files, git repos (bulk storage)
- SSD: IDE config, extensions (fast access)
- Saves 75% on bulk storage
-
Resize on Demand:
- Start with small PVC (10 GB)
- Expand when user needs more (GKE supports online resize)
- Shrink requires migration (manual process)
Lessons Learned
What Worked Well ✅
- StatefulSet for Persistence: Perfect fit for cloud IDEs
- Volume Claim Templates: Automatic PVC creation, no manual provisioning
- Headless Service: Enables direct pod access, useful for debugging
- GCE Persistent Disk: Fast provisioning (<1 minute), reliable
- Parallel Pod Management: Faster startup than sequential
Challenges Encountered ⚠️
- PVC Lifecycle Management: PVCs survive StatefulSet deletion (intentional, but surprising)
- Cross-Zone HA: GCE PD is zonal, need replication for zone failures
- PVC Resize: Easy to expand, hard to shrink (requires manual migration)
- Orphaned PVCs: Need monitoring to detect unused PVCs
- Cost Tracking: Hard to attribute storage cost per user without custom metrics
Best Practices Established
-
Label PVCs: Add user labels for cost attribution
metadata:
labels:
user-id: "user-123"
tier: "starter" -
Set Resource Limits: Prevent runaway storage usage
resources:
limits:
ephemeral-storage: "10Gi" # Limit container layer growth -
Monitor Disk Usage: Alert when >80% full
kubectl exec coditect-combined-0 -- df -h /workspace -
Backup Strategy: Weekly snapshots, retain 4 weeks
# Automated backup via CronJob
kubectl create -f workspace-backup-cronjob.yaml -
Graceful Shutdown: Allow 2 minutes for file sync
terminationGracePeriodSeconds: 120
Related Documentation
Architecture Decisions:
- ADR-028: theia IDE Integration - IDE foundation
- ADR-029: This document - Infrastructure evolution
Implementation Details:
- Build #17 Deployment Ready - Current deployment status
- theia GKE Scaling Research - Scaling 1-10k users
- Socket.IO & Persistence Insights - Session persistence fixes
Kubernetes Manifests:
- StatefulSet YAML - Production manifest
- Starter Configuration - 10-20 users
Conclusion
The migration from Kubernetes Deployment to StatefulSet with persistent volumes was essential for Coditect AI IDE to be production-ready. It enabled:
- 100% data persistence across pod restarts
- User trust in platform (no more data loss)
- Competitive parity with Gitpod, GitHub Codespaces, VS Code Server
- Predictable pod identity for debugging and monitoring
- Storage isolation for security and performance
- Tiered storage options for different user segments
Cost: ~$4-8/user/month for storage (competitive with alternatives) Benefit: Transforms Coditect from toy demo to production cloud IDE
Recommendation: Continue StatefulSet approach, implement tiered storage, add cross-zone replication for HA.
Document Status: ✅ COMPLETE Last Updated: 2025-10-27T06:12:03Z Author: Coditect AI Infrastructure Team Reviewers: Technical Leadership