Skip to main content

StatefulSet with Persistent Storage: Infrastructure Evolution

Document: ADR-029 Date: 2025-10-27T06:12:03Z Status: ✅ IMPLEMENTED Category: Infrastructure / Kubernetes


Executive Summary

Coditect AI IDE migrated from Kubernetes Deployment to StatefulSet with persistent volumes to solve the critical data loss problem. Users were losing all code and files on logout/timeout because pods had no persistent storage.

Key Achievement: 100% workspace persistence across pod restarts, enabling true cloud IDE experience like VS Code Server, Gitpod, GitHub Codespaces.


Table of Contents

  1. The Problem: Data Loss
  2. Decision Context
  3. Original Architecture (Deployment)
  4. New Architecture (StatefulSet)
  5. Technical Implementation
  6. Capacity Planning
  7. Migration Process
  8. Benefits Realized
  9. Cost Analysis
  10. Lessons Learned

The Problem: Data Loss

User Experience (Before StatefulSet)

  1. User logs into theia IDE at https://coditect.ai/theia
  2. User creates files, installs extensions, configures workspace
  3. User logs out or session times out
  4. ALL DATA LOST - pod terminates, ephemeral storage destroyed
  5. User logs back in - fresh environment, all work gone

Impact: Unusable for real development work, users can't trust the platform

Root Cause

Kubernetes Deployment uses ephemeral storage by default:

  • No persistent volumes attached
  • Pod filesystem lives in container's writable layer
  • Pod deletion = data deletion
  • Every logout = potential data loss

Comparison to Competitors:

  • VS Code Server: Persistent workspace via SSH
  • Gitpod: Persistent workspace via PVC
  • GitHub Codespaces: Persistent workspace via Azure Files
  • Coditect (before migration): Ephemeral storage, data loss

Decision Context

Requirements

Must Have:

  1. workspace data persists across pod restarts
  2. User configurations persist (settings, extensions, keybindings)
  3. Git repositories persist
  4. Session state persists (open files, editor state)
  5. Per-user storage isolation

Should Have: 6. Predictable pod naming (for debugging) 7. Ordered deployment (for migrations) 8. Headless service for direct pod access 9. Storage size customization per tier

Nice to Have: 10. Volume snapshot/backup capability 11. Volume resize without downtime 12. Cross-zone replication for HA

Constraints

  1. GKE Cluster: us-central1-a (single zone)
  2. Storage Class: standard-rwo (GCE Persistent Disk)
  3. Budget: ~$0.10/GB/month for storage
  4. Users: Start with 10-20, scale to 500+
  5. Uptime: 99% target (allows ~7 hours downtime/month)

Original Architecture (Deployment)

Kubernetes Deployment Manifest

apiVersion: apps/v1
kind: Deployment
metadata:
name: coditect-combined-v5
namespace: coditect-app
spec:
replicas: 3
selector:
matchLabels:
app: coditect-combined
template:
metadata:
labels:
app: coditect-combined
spec:
containers:
- name: combined
image: us-central1-docker.pkg.dev/.../coditect-combined:latest
ports:
- containerPort: 80
# ❌ NO volumeMounts - ephemeral storage only
# ❌ NO volumeClaimTemplates - shared pod identity

Problems with Deployment

ProblemImpactSeverity
No persistent volumesData loss on pod restart🔴 CRITICAL
Shared pod identityCan't map user → pod🔴 HIGH
Random pod namingDebugging difficult🟡 MEDIUM
No storage isolationUsers share ephemeral disk🔴 HIGH
No pod affinityUser may get different pod🔴 HIGH

Result: Deployment is suitable for stateless applications, NOT for cloud IDEs


New Architecture (StatefulSet)

StatefulSet with Persistent Volumes

apiVersion: v1
kind: Service
metadata:
name: theia-headless
namespace: coditect-app
spec:
clusterIP: None # Headless service for StatefulSet
ports:
- port: 80
targetPort: 80
selector:
app: coditect-combined
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: coditect-combined
namespace: coditect-app
spec:
serviceName: "theia-headless"
replicas: 3
podManagementPolicy: Parallel # Start all pods at once
selector:
matchLabels:
app: coditect-combined
template:
metadata:
labels:
app: coditect-combined
spec:
terminationGracePeriodSeconds: 120 # 2 minutes for graceful shutdown
containers:
- name: combined
image: us-central1-docker.pkg.dev/.../coditect-combined:latest
ports:
- containerPort: 80
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
volumeMounts:
- name: workspace
mountPath: /workspace # ✅ User files, git repos
- name: theia-config
mountPath: /home/theia/.theia # ✅ IDE settings, extensions

# ✅ Automatic PVC creation - one per pod
volumeClaimTemplates:
- metadata:
name: workspace
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: standard-rwo
resources:
requests:
storage: 100Gi # 100 GB per user workspace
- metadata:
name: theia-config
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: standard-rwo
resources:
requests:
storage: 10Gi # 10 GB for IDE config

Key Differences from Deployment

FeatureDeploymentStatefulSetBenefit
Pod NamingRandom (combined-v5-abc123)Ordinal (combined-0, combined-1, combined-2)✅ Predictable
Pod IdentitySharedUnique per pod✅ User → pod mapping
StorageEphemeralPersistent (PVC)✅ Data persists
Volume LifecycleDeleted with podIndependent✅ Survives pod deletion
Headless ServiceOptionalRequired✅ Direct pod access
Deployment OrderRandomSequential or parallel✅ Controlled
Pod ReplacementNew nameSame name✅ Reconnects to PVC

Technical Implementation

Storage Architecture

┌─────────────────────────────────────────────────────┐
│ GKE Cluster (us-central1-a) │
│ │
│ ┌───────────────────────────────────────────────┐ │
│ │ StatefulSet: coditect-combined │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────┐ │ │
│ │ │ Pod: coditect-combined-0 │ │ │
│ │ │ ├─ Container: combined │ │ │
│ │ │ ├─ Volume: workspace (100 GB) │ │ │
│ │ │ │ └─ PVC: workspace-coditect-combined-0 │ │ │
│ │ │ └─ Volume: theia-config (10 GB) │ │ │
│ │ │ └─ PVC: theia-config-coditect-...-0 │ │ │
│ │ └──────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────┐ │ │
│ │ │ Pod: coditect-combined-1 │ │ │
│ │ │ ├─ Container: combined │ │ │
│ │ │ ├─ Volume: workspace (100 GB) │ │ │
│ │ │ │ └─ PVC: workspace-coditect-combined-1 │ │ │
│ │ │ └─ Volume: theia-config (10 GB) │ │ │
│ │ │ └─ PVC: theia-config-coditect-...-1 │ │ │
│ │ └──────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────┐ │ │
│ │ │ Pod: coditect-combined-2 │ │ │
│ │ │ ├─ Container: combined │ │ │
│ │ │ ├─ Volume: workspace (100 GB) │ │ │
│ │ │ │ └─ PVC: workspace-coditect-combined-2 │ │ │
│ │ │ └─ Volume: theia-config (10 GB) │ │ │
│ │ │ └─ PVC: theia-config-coditect-...-2 │ │ │
│ │ └──────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────┐
│ GCE Persistent Disks (standard-rwo) │
│ ├─ pvc-workspace-combined-0 (100 GB) │
│ ├─ pvc-theia-config-combined-0 (10 GB) │
│ ├─ pvc-workspace-combined-1 (100 GB) │
│ ├─ pvc-theia-config-combined-1 (10 GB) │
│ ├─ pvc-workspace-combined-2 (100 GB) │
│ └─ pvc-theia-config-combined-2 (10 GB) │
└─────────────────────────────────────────────────────┘

Volume Mount Points

workspace Volume (/workspace):

  • User files (Python, JavaScript, Go, etc.)
  • Git repositories
  • Build artifacts
  • Downloaded packages (node_modules, venv)
  • Size: 100 GB (Starter tier)

theia Config Volume (/home/theia/.theia):

  • IDE settings (settings.json)
  • Installed extensions
  • Extension settings
  • Keybindings
  • workspace layout
  • Size: 10 GB (enough for 50+ extensions)

Pod Lifecycle with PVCs

First Pod Creation:

  1. StatefulSet creates coditect-combined-0
  2. Kubernetes creates PVCs:
    • workspace-coditect-combined-0 (100 GB)
    • theia-config-coditect-combined-0 (10 GB)
  3. GCE creates persistent disks
  4. Kubernetes mounts disks to pod
  5. Pod starts, volumes are empty (first time)

Pod Restart (logout, crash, upgrade):

  1. StatefulSet deletes pod coditect-combined-0
  2. PVCs remain (not deleted)
  3. StatefulSet creates new pod with same name
  4. Kubernetes mounts same PVCs
  5. Pod starts, data is intact

Pod Deletion (scale down, manual delete):

  1. Pod deleted
  2. PVCs remain (orphaned, waiting for pod to return)
  3. If pod recreated with same name → data restored
  4. If PVC manually deleted → data lost (intentional cleanup)

Capacity Planning

Starter Configuration (10-20 users)

spec:
replicas: 10 # Start with 10 pods
volumeClaimTemplates:
- metadata:
name: workspace
spec:
resources:
requests:
storage: 50Gi # 50 GB per user (smaller for Starter)
- metadata:
name: theia-config
spec:
resources:
requests:
storage: 5Gi # 5 GB for config (smaller for Starter)

Resource Allocation:

  • Pods: 10-30 (autoscaling based on demand)
  • CPU: 4 vCPU per pod (40 vCPU total baseline)
  • Memory: 8 GB per pod (80 GB total baseline)
  • Storage: 55 GB per pod (550 GB total baseline)

Cost Estimate:

  • Compute: 40 vCPU × $0.031/hour = $1.24/hour = ~$900/month
  • Memory: 80 GB × $0.0042/GB/hour = $0.34/hour = ~$250/month
  • Storage: 550 GB × $0.10/GB/month = ~$55/month
  • Total: ~$1,205/month for 10-20 users = $60-120/user/month

Production Configuration (50-100 users)

spec:
replicas: 50 # 50 concurrent users
volumeClaimTemplates:
- metadata:
name: workspace
spec:
resources:
requests:
storage: 100Gi # 100 GB per user
- metadata:
name: theia-config
spec:
resources:
requests:
storage: 10Gi # 10 GB for config

Resource Allocation:

  • Pods: 50-100 (autoscaling 2x)
  • CPU: 4 vCPU per pod (200 vCPU baseline)
  • Memory: 8 GB per pod (400 GB baseline)
  • Storage: 110 GB per pod (5,500 GB baseline)

Cost Estimate:

  • Compute: 200 vCPU × $0.031/hour = $6.20/hour = ~$4,500/month
  • Memory: 400 GB × $0.0042/GB/hour = $1.68/hour = ~$1,220/month
  • Storage: 5,500 GB × $0.10/GB/month = ~$550/month
  • Total: ~$6,270/month for 50-100 users = $63-125/user/month

Enterprise Configuration (500+ users)

spec:
replicas: 100 # 100 concurrent, 500+ total users
volumeClaimTemplates:
- metadata:
name: workspace
spec:
resources:
requests:
storage: 200Gi # 200 GB per active user
- metadata:
name: theia-config
spec:
resources:
requests:
storage: 20Gi # 20 GB for extensive extensions

Resource Allocation:

  • Pods: 100-200 (autoscaling 2x for peak)
  • CPU: 8 vCPU per pod (800 vCPU baseline)
  • Memory: 16 GB per pod (1,600 GB baseline)
  • Storage: 220 GB per pod (22,000 GB baseline)

Cost Estimate:

  • Compute: 800 vCPU × $0.031/hour = $24.80/hour = ~$18,000/month
  • Memory: 1,600 GB × $0.0042/GB/hour = $6.72/hour = ~$4,880/month
  • Storage: 22,000 GB × $0.10/GB/month = ~$2,200/month
  • Total: ~$25,080/month for 500 users = $50/user/month

Cost Scaling: Larger deployments = better economics (50% cheaper per user at enterprise scale)


Migration Process

Step 1: Create StatefulSet Manifest

File: k8s/theia-statefulset.yaml

Created complete StatefulSet definition with:

  • Headless service (theia-headless)
  • StatefulSet (coditect-combined)
  • Volume claim templates (workspace + config)
  • Pod environment variables (POD_NAME, POD_NAMESPACE, POD_IP)

Commit: See git history for complete manifest

Step 2: Update Cloud Build Pipeline

File: cloudbuild-combined.yaml

Added deployment steps:

# Step 4: Apply StatefulSet manifest (creates if doesn't exist)
- name: 'gcr.io/cloud-builders/kubectl'
id: 'apply-statefulset'
args:
- 'apply'
- '-f'
- 'k8s/theia-statefulset.yaml'
waitFor: ['push-build-id', 'push-latest']

# Step 5: Update StatefulSet image to newly built version
- name: 'gcr.io/cloud-builders/kubectl'
id: 'deploy-gke'
args:
- 'set'
- 'image'
- 'statefulset/coditect-combined'
- 'combined=...:$BUILD_ID'
waitFor: ['apply-statefulset']

Idempotent approach: Apply creates OR updates, safe to run multiple times

Step 3: Deploy to GKE

Command:

kubectl apply -f k8s/theia-statefulset.yaml -n coditect-app

Result:

  1. Headless service created (theia-headless)
  2. StatefulSet created (coditect-combined)
  3. Pods created with ordinal names (combined-0, combined-1, combined-2)
  4. PVCs created automatically (6 total: 3 workspace + 3 config)
  5. GCE persistent disks provisioned
  6. Pods mount disks and start

Verification:

# Check StatefulSet
kubectl get statefulset -n coditect-app

# Check pods
kubectl get pods -n coditect-app -l app=coditect-combined

# Check PVCs
kubectl get pvc -n coditect-app

# Check persistent volumes
kubectl get pv | grep coditect

Step 4: Test Data Persistence

Test procedure:

  1. Connect to pod: kubectl exec -it coditect-combined-0 -n coditect-app -- bash
  2. Create test file: echo "test data" > /workspace/test.txt
  3. Create config: echo '{"setting": "value"}' > /home/theia/.theia/settings.json
  4. Delete pod: kubectl delete pod coditect-combined-0 -n coditect-app
  5. Wait for recreation: kubectl get pods -n coditect-app -w
  6. Verify data: kubectl exec -it coditect-combined-0 -n coditect-app -- cat /workspace/test.txt
  7. Expected: File exists, data intact ✅

Benefits Realized

User Experience Improvements

Before (Deployment)After (StatefulSet)Impact
❌ Data loss on logout✅ Data persists🎉 CRITICAL
❌ Fresh environment every login✅ Environment restored🎉 HIGH
❌ Can't trust platform for real work✅ Production-ready IDE🎉 HIGH
❌ No git repo persistence✅ Git repos persist🎉 HIGH
❌ Extension reinstall every session✅ Extensions persist🎉 MEDIUM

Technical Benefits

  1. Pod Identity: Predictable pod names enable:

    • User → pod assignment tracking
    • Debugging (logs, exec)
    • Pod-specific monitoring
    • Direct pod access via headless service
  2. Storage Isolation: Each user gets dedicated volumes:

    • No interference between users
    • Independent disk I/O performance
    • Secure data separation
    • Quota enforcement per user
  3. Graceful Upgrades: StatefulSet enables:

    • Rolling updates (one pod at a time)
    • Ordered deployment
    • Health checks before proceeding
    • Rollback capability
  4. High Availability: PVCs independent of pods:

    • Pod crash → data survives
    • Pod reschedule → data survives
    • Node failure → data survives (GCE PD)
    • Zone failure → need cross-zone replication

Operational Benefits

  1. Backup/Restore: PVCs can be snapshotted:

    # Create snapshot
    kubectl create volumesnapshot workspace-backup \
    --source=workspace-coditect-combined-0 \
    -n coditect-app

    # Restore from snapshot
    kubectl apply -f workspace-restore-pvc.yaml
  2. Monitoring: Track storage metrics:

    • Disk usage per user
    • I/O throughput
    • Disk read/write latency
    • PVC provisioning time
  3. Cost Optimization:

    • Identify unused PVCs (orphaned)
    • Delete PVCs for inactive users
    • Resize PVCs without downtime (GKE supports resizing)
    • Use storage classes for different tiers (SSD vs HDD)

Cost Analysis

Storage Cost Breakdown

GCE Persistent Disk Pricing (us-central1):

  • Standard (HDD): $0.040/GB/month
  • SSD: $0.170/GB/month
  • Snapshots: $0.026/GB/month
  • Disk operations: Negligible

Example Cost (10 users, standard-rwo):

  • 10 users × 110 GB (100 workspace + 10 config) = 1,100 GB
  • 1,100 GB × $0.040/month = $44/month
  • Per user: $4.40/month

Comparison to Competitors:

  • Gitpod: ~$39/user/month (includes compute + 30 GB storage)
  • GitHub Codespaces: ~$58/user/month (includes compute + 32 GB storage)
  • Coditect (Starter): ~$60-120/user/month (includes compute + 110 GB storage)

Value Proposition: More storage (110 GB vs 30 GB) at competitive price

Cost Optimization Strategies

  1. Tiered Storage:

    • Free tier: 10 GB workspace + 2 GB config
    • Starter tier: 50 GB workspace + 5 GB config
    • Pro tier: 100 GB workspace + 10 GB config
    • Enterprise tier: 200 GB workspace + 20 GB config
  2. Auto-cleanup:

    • Delete PVCs for users inactive >90 days
    • Snapshot before deletion (restore on request)
    • Compress snapshots for long-term storage
  3. Storage Class Selection:

    • Standard (HDD): User files, git repos (bulk storage)
    • SSD: IDE config, extensions (fast access)
    • Saves 75% on bulk storage
  4. Resize on Demand:

    • Start with small PVC (10 GB)
    • Expand when user needs more (GKE supports online resize)
    • Shrink requires migration (manual process)

Lessons Learned

What Worked Well ✅

  1. StatefulSet for Persistence: Perfect fit for cloud IDEs
  2. Volume Claim Templates: Automatic PVC creation, no manual provisioning
  3. Headless Service: Enables direct pod access, useful for debugging
  4. GCE Persistent Disk: Fast provisioning (<1 minute), reliable
  5. Parallel Pod Management: Faster startup than sequential

Challenges Encountered ⚠️

  1. PVC Lifecycle Management: PVCs survive StatefulSet deletion (intentional, but surprising)
  2. Cross-Zone HA: GCE PD is zonal, need replication for zone failures
  3. PVC Resize: Easy to expand, hard to shrink (requires manual migration)
  4. Orphaned PVCs: Need monitoring to detect unused PVCs
  5. Cost Tracking: Hard to attribute storage cost per user without custom metrics

Best Practices Established

  1. Label PVCs: Add user labels for cost attribution

    metadata:
    labels:
    user-id: "user-123"
    tier: "starter"
  2. Set Resource Limits: Prevent runaway storage usage

    resources:
    limits:
    ephemeral-storage: "10Gi" # Limit container layer growth
  3. Monitor Disk Usage: Alert when >80% full

    kubectl exec coditect-combined-0 -- df -h /workspace
  4. Backup Strategy: Weekly snapshots, retain 4 weeks

    # Automated backup via CronJob
    kubectl create -f workspace-backup-cronjob.yaml
  5. Graceful Shutdown: Allow 2 minutes for file sync

    terminationGracePeriodSeconds: 120

Architecture Decisions:

Implementation Details:

Kubernetes Manifests:


Conclusion

The migration from Kubernetes Deployment to StatefulSet with persistent volumes was essential for Coditect AI IDE to be production-ready. It enabled:

  • 100% data persistence across pod restarts
  • User trust in platform (no more data loss)
  • Competitive parity with Gitpod, GitHub Codespaces, VS Code Server
  • Predictable pod identity for debugging and monitoring
  • Storage isolation for security and performance
  • Tiered storage options for different user segments

Cost: ~$4-8/user/month for storage (competitive with alternatives) Benefit: Transforms Coditect from toy demo to production cloud IDE

Recommendation: Continue StatefulSet approach, implement tiered storage, add cross-zone replication for HA.


Document Status: ✅ COMPLETE Last Updated: 2025-10-27T06:12:03Z Author: Coditect AI Infrastructure Team Reviewers: Technical Leadership