Skip to main content

Coditect V5 - Scaling Architecture: 10 → 100K+ Users

Date: 2025-10-07 Purpose: Comprehensive scaling analysis and architecture for production growth Target: Support 10, 100, 1,000, 10,000, and 100,000+ concurrent users


Executive Summary

Current Issue: Our current architecture has a critical scaling problem - one workspace pod per user doesn't scale past ~1,000 users.

The Problem:

  • 1 user = 1 dedicated pod (theia + Sidecar in isolated namespace)
  • ❌ At 1,000 users = 1,000 pods = massive resource waste
  • ❌ At 10,000 users = 10,000 pods = impossible on GKE (cost + management nightmare)
  • ❌ Each pod consumes: 512Mi-2Gi RAM, 500m-2000m CPU → $500-5000/month per 100 users

The Solution:

  • Multi-tenant workspace pods (multiple users per pod)
  • Session-based isolation (not pod-based isolation)
  • Horizontal pod autoscaling (scale pods based on active sessions)
  • Resource pooling (share infrastructure, isolate data)

Scaling Targets:

Usersworkspace PodsCost/MonthStatus
102-3$100-200✅ MVP Ready
1005-10$500-1000✅ Current Design
1,00020-50$2,000-5,000⚠️ Need Multi-Tenancy
10,000100-200$10,000-20,000⚠️ Need Optimization
100,000500-1000$50,000-100,000❌ Need Re-Architecture

This document provides:

  1. Scaling bottlenecks analysis
  2. Multi-tenant workspace architecture
  3. Resource optimization strategies
  4. Cost projections at each scale
  5. Migration path from current design

Table of Contents

  1. Current Architecture (1 User = 1 Pod)
  2. Scaling Bottlenecks
  3. Multi-Tenant workspace Architecture
  4. Scaling Plan by User Count
  5. Resource Optimization
  6. Cost Analysis
  7. Database Scaling (FoundationDB)
  8. Network Architecture
  9. Migration Strategy
  10. Monitoring & Observability

Current Architecture (1 User = 1 Pod)

What We Designed (V5 Provisioning Architecture)

User Registration

Provisioning Controller

┌─────────────────────────────────────────────────────────────┐
│ Create Dedicated Resources for EACH User: │
│ │
│ 1. Namespace: user-{user_id} │
│ 2. ServiceAccount + RBAC │
│ 3. PVC (10GB): workspace-pvc │
│ 4. Pod: workspace-{user_id} │
│ ├─ theia container (512Mi-2Gi RAM, 500m-2000m CPU) │
│ └─ WS Sidecar container (128Mi-256Mi RAM, 100m-200m CPU)│
│ 5. Service: workspace-{user_id}-service │
│ 6. Ingress: {user_id}.coditect.ai │
│ │
│ Total Resources Per User: │
│ - RAM: 640Mi - 2.25Gi │
│ - CPU: 600m - 2200m │
│ - Storage: 10GB PVC │
└─────────────────────────────────────────────────────────────┘

Why This Doesn't Scale

Problem 1: Resource Waste

  • Most users are idle 90% of the time
  • Dedicated pod runs 24/7 even when user is offline
  • Utilization: 10-20% average (80-90% waste)

Problem 2: Cost Explosion

Cost per user per month (GKE):
- CPU: 2000m (2 cores) × $0.031/core-hour × 730 hours = $45.26
- RAM: 2Gi × $0.0033/GB-hour × 730 hours = $4.82
- Storage: 10GB × $0.17/GB-month = $1.70
- Total: ~$52/user/month

At scale:
- 100 users: $5,200/month
- 1,000 users: $52,000/month
- 10,000 users: $520,000/month ❌ UNSUSTAINABLE

Problem 3: Kubernetes Limits

  • GKE node pools: Max 1000 nodes per cluster
  • Pods per node: ~110 pods
  • Max pods per cluster: ~110,000 (but cost prohibitive at 10K users)

Problem 4: Management Overhead

  • 1,000 users = 1,000 namespaces to monitor
  • 1,000 PVCs to backup
  • 1,000 Ingress rules to manage
  • 1,000 SSL certificates to rotate

Problem 5: Cold Start Latency

  • New user signup → provision pod → 2-3 minutes wait
  • User expects instant access (like VSCode.dev)

Scaling Bottlenecks

Bottleneck Analysis by Component

ComponentBottleneck at ScaleImpactMitigation
workspace Pods1:1 user:pod ratioCriticalMulti-tenant pods (100 users per pod)
FoundationDBWrite throughput (100K ops/sec)HighSharding, read replicas, caching
Kubernetes API1000s of namespace operationsMediumBatch operations, eventual consistency
Ingress10K+ Ingress rulesMediumWildcard DNS, shared Ingress
PVC Storage10K PVCs × 10GB = 100TBHighShared PVCs, S3 for cold storage
Backend APIRequest throughputMediumHorizontal scaling (3-100 pods)
NetworkPod-to-pod trafficLowGKE native networking

Critical Scaling Thresholds

10 users:    ✅ Current design works fine
└─ 10 pods, 10 namespaces, minimal cost

100 users: ✅ Still manageable
└─ 100 pods, but cost is $5K/month (high for revenue)

1,000 users: ⚠️ FIRST CRITICAL THRESHOLD
└─ Need multi-tenant pods (100 users/pod = 10 pods)
└─ Need session-based routing (not namespace-based)
└─ Need shared PVCs with user directories

10,000 users: ⚠️ SECOND CRITICAL THRESHOLD
└─ Need FDB read replicas + caching (Redis)
└─ Need CDN for static assets
└─ Need multiple GKE clusters (geo-distributed)

100,000 users: ❌ REQUIRES RE-ARCHITECTURE
└─ Need serverless workspaces (AWS Lambda, Cloud Run)
└─ Need global CDN (Cloudflare Workers)
└─ Need multi-region FDB + CockroachDB

Multi-Tenant workspace Architecture

New Design: Shared workspace Pods

Key Principle: Multiple users share workspace pods, isolated by sessions not pods.

┌─────────────────────────────────────────────────────────────────────┐
│ Shared workspace Pod Pool │
│ (Horizontal Pod Autoscaler: 3-100 pods based on active sessions) │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ workspace Pod 1 (100 users capacity) │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ theia Multi-User Server │ │ │
│ │ │ - Session Manager (tracks active user sessions) │ │ │
│ │ │ - File Isolation (user dirs: /workspace/{user_id}/) │ │ │
│ │ │ - Process Isolation (cgroups per user) │ │ │
│ │ │ - Resource Quotas (CPU/RAM limits per user) │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Active Sessions (tracked in memory): │ │
│ │ - user-123: /workspace/user-123/ (CPU: 200m, RAM: 512Mi) │ │
│ │ - user-456: /workspace/user-456/ (CPU: 150m, RAM: 384Mi) │ │
│ │ - user-789: /workspace/user-789/ (CPU: 300m, RAM: 768Mi) │ │
│ │ ... (up to 100 concurrent users) │ │
│ │ │ │
│ │ Resources: 16Gi RAM, 8 CPU cores │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ workspace Pod 2 (100 users capacity) │ │
│ │ ... (same structure as Pod 1) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ workspace Pod N (100 users capacity) │ │
│ │ ... (auto-scaled based on load) │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
↓ ↓
┌──────────────┐ ┌──────────────────┐
│ Shared PVC │ │ FoundationDB │
│ (1TB) │ │ (Session Routing)│
│ │ │ │
│ /workspace/ │ │ session_id → │
│ ├─ user-123/ │ │ pod_id mapping │
│ ├─ user-456/ │ └──────────────────┘
│ ├─ user-789/ │
│ └─ ... │
└──────────────┘

Session-Based Routing

Instead of: User → Dedicated Pod (1:1) New: User Session → Load-Balanced Pod (N:1)

1. User logs in → Creates session (JWT token)

2. Frontend connects to backend API

3. Backend queries FDB: "Which pod has capacity?"

4. FDB returns: pod-3 (has 45/100 active sessions)

5. Backend creates session assignment:
- session_id: abc-123
- user_id: user-456
- pod_id: workspace-pod-3
- workspace_path: /workspace/user-456/

6. Frontend WebSocket connects to workspace-pod-3
- Header: Authorization: Bearer <JWT>
- WebSocket URL: wss://ide.coditect.ai/ws?session_id=abc-123

7. workspace pod validates JWT → loads user workspace

8. User sees theia IDE with their files from /workspace/user-456/

File Isolation Strategy

Shared PVC with User Directories:

/workspace/
├── user-123/
│ ├── src/
│ ├── .git/
│ └── package.json
├── user-456/
│ ├── src/
│ └── cargo.toml
└── user-789/
└── ...

Mounted to ALL workspace pods as:
- Volume: shared-workspace-pvc
- MountPath: /workspace
- ReadWriteMany (NFS or GlusterFS)

Security: theia process enforces user directory isolation

// theia workspace initialization
const userworkspace = `/workspace/${user_id}/`;

// Prevent access outside user directory
if (!requestPath.startsWith(userworkspace)) {
throw new Error('Access denied');
}

Process Isolation (cgroups)

Linux cgroups limit per-user resource usage:

# Create cgroup for user
cgcreate -g cpu,memory:user-123

# Set limits
echo "200000" > /sys/fs/cgroup/cpu/user-123/cpu.cfs_quota_us # 20% CPU (200m)
echo "536870912" > /sys/fs/cgroup/memory/user-123/memory.limit_in_bytes # 512Mi

# Run user's theia process in cgroup
cgexec -g cpu,memory:user-123 node theia-start --user=user-123

Resource Quotas

Per-User Limits (enforced in theia):

interface UserQuota {
cpu_limit: '200m' | '500m' | '1000m', // Based on license tier
memory_limit: '512Mi' | '1Gi' | '2Gi', // Based on license tier
storage_limit: '10Gi' | '50Gi' | '100Gi',
concurrent_sessions: 1 | 5 | 10, // Free vs Pro
}

const QUOTAS = {
free: { cpu: '200m', memory: '512Mi', storage: '10Gi', sessions: 1 },
starter: { cpu: '500m', memory: '1Gi', storage: '50Gi', sessions: 5 },
pro: { cpu: '1000m', memory: '2Gi', storage: '100Gi', sessions: 10 },
};

Pod Capacity Planning

Single workspace Pod Capacity:

Pod Resources: 16Gi RAM, 8 CPU cores

Per-User Average:
- RAM: 150Mi (average active user)
- CPU: 100m (average active user)

Theoretical Capacity: 16Gi / 150Mi = 106 users
Practical Capacity: 100 users (6% overhead for system processes)

Peak Capacity (all users active):
- RAM: 100 users × 150Mi = 15Gi (leaves 1Gi for system)
- CPU: 100 users × 100m = 10 cores (8 cores = 80% utilization)

Scaling Math:

Users    Pods Needed    Total Resources
10 1 16Gi RAM, 8 CPU
100 1 16Gi RAM, 8 CPU
1,000 10 160Gi RAM, 80 CPU
10,000 100 1.6Ti RAM, 800 CPU
100,000 1000 16Ti RAM, 8000 CPU

Scaling Plan by User Count

Scale 1: 10 Users (MVP / Beta)

Architecture:

- workspace Pods: 1-2 (for redundancy)
- Backend API Pods: 3
- FoundationDB Pods: 3
- GKE Nodes: 2 (n1-standard-8: 8 vCPU, 30Gi RAM each)

Resources:

  • Total RAM: 60Gi
  • Total CPU: 16 cores
  • Total Storage: 100Gi (shared PVC)

Cost: $200-300/month

  • GKE nodes: 2 × $200 = $400
  • LoadBalancer: $20
  • Storage: 100Gi × $0.17 = $17
  • Total: ~$437/month

Bottlenecks: None Status: ✅ Current design is optimal


Scale 2: 100 Users

Architecture:

- workspace Pods: 2-5 (HPA: min=2, max=5)
- Backend API Pods: 3-10 (HPA: min=3, max=10)
- FoundationDB Pods: 3
- GKE Nodes: 3-5 (n1-standard-8)

Resources:

  • Total RAM: 90-150Gi
  • Total CPU: 24-40 cores
  • Total Storage: 1Ti (shared PVC)

Cost: $600-1,000/month

  • GKE nodes: 5 × $200 = $1,000
  • LoadBalancer: $20
  • Storage: 1Ti × $0.17 = $173
  • Total: ~$1,193/month

Revenue (assuming 20% paid conversion @ $29/mo):

  • 100 users × 20% × $29 = $580/month
  • Margin: -$613/month (loss leader during growth)

Bottlenecks: None Status: ✅ Multi-tenant design handles easily


Scale 3: 1,000 Users ⚠️ CRITICAL THRESHOLD

Architecture:

- workspace Pods: 10-20 (HPA: min=10, max=20)
- Backend API Pods: 5-15 (HPA: min=5, max=15)
- FoundationDB Pods: 5 (with read replicas)
- Redis Cache: 3 nodes (for session routing)
- GKE Nodes: 10-15 (n1-standard-16: 16 vCPU, 60Gi RAM)

Resources:

  • Total RAM: 600-900Gi
  • Total CPU: 160-240 cores
  • Total Storage: 10Ti (shared PVC or distributed storage)

Cost: $3,000-5,000/month

  • GKE nodes: 15 × $660 = $9,900
  • LoadBalancer: $20
  • Storage: 10Ti × $0.17 = $1,730
  • Redis: 3 × $50 = $150
  • Total: ~$11,800/month

Revenue (30% paid conversion @ $29/mo):

  • 1,000 users × 30% × $29 = $8,700/month
  • Margin: -$3,100/month (still loss, but improving)

Bottlenecks:

  1. ⚠️ FoundationDB Write Throughput: Approaching 100K ops/sec limit
    • Mitigation: Add read replicas, cache in Redis
  2. ⚠️ Shared PVC Performance: 10K IOPS limit on single PVC
    • Mitigation: Use multiple PVCs (shard by user_id hash)
  3. ⚠️ Session Routing Latency: FDB queries for session assignment
    • Mitigation: Cache session → pod mapping in Redis

Required Changes:

  1. ✅ Implement Redis caching layer
  2. ✅ Shard PVCs (10 PVCs × 1Ti each instead of 1 × 10Ti)
  3. ✅ Add FDB read replicas (3 write nodes + 5 read replicas)

Status: ⚠️ Need optimizations listed above


Scale 4: 10,000 Users ⚠️ SECOND CRITICAL THRESHOLD

Architecture:

- workspace Pods: 100-200 (HPA: min=100, max=200)
- Backend API Pods: 20-50 (HPA: min=20, max=50)
- FoundationDB Pods: 10 (5 write + 5 read replicas)
- Redis Cache: 6 nodes (sharded)
- GKE Cluster: 2 clusters (US + EU for geo-distribution)
- CDN: Cloudflare for static assets
- GKE Nodes per cluster: 50-100 (n1-standard-16)

Resources (per cluster):

  • Total RAM: 3-6Ti
  • Total CPU: 800-1600 cores
  • Total Storage: 100Ti (distributed across 100 PVCs)

Cost: $20,000-30,000/month

  • GKE nodes: 2 clusters × 100 nodes × $660 = $132,000
  • LoadBalancer: 2 × $20 = $40
  • Storage: 100Ti × $0.17 = $17,000
  • Redis: 6 × $200 = $1,200
  • CDN: $500
  • Total: ~$150,740/month

Revenue (40% paid conversion @ $49 avg):

  • 10,000 users × 40% × $49 = $196,000/month
  • Margin: +$45,260/monthPROFITABLE

Bottlenecks:

  1. ⚠️ FDB Cluster Capacity: Need horizontal sharding
    • Mitigation: Shard by tenant_id (multiple FDB clusters)
  2. ⚠️ Network I/O: 100 workspace pods × 100 users = high traffic
    • Mitigation: CDN for static assets, WebSocket connection pooling
  3. ⚠️ Kubernetes API Load: Managing 200 pods across 2 clusters
    • Mitigation: ArgoCD with eventual consistency, batch operations

Required Changes:

  1. ✅ Multi-region deployment (US + EU)
  2. ✅ FDB sharding by tenant_id (5 FDB clusters)
  3. ✅ CDN for static assets (Cloudflare)
  4. ✅ WebSocket connection pooling
  5. ✅ Dedicated GKE cluster per region

Status: ⚠️ Significant infrastructure investment required


Scale 5: 100,000 Users ❌ REQUIRES RE-ARCHITECTURE

Problem: GKE-based architecture hits hard limits:

  • Cost: $1.5M+/month on GKE alone
  • Management: 2,000 workspace pods is operationally complex
  • Latency: Global users need <100ms response times

Solution: Hybrid Serverless Architecture

┌────────────────────────────────────────────────────────────────┐
│ Global Architecture │
│ │
│ Cloudflare Workers (Edge) │
│ ├─ Static assets (CDN) │
│ ├─ WebSocket proxy (routing to nearest region) │
│ └─ Auth token validation │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ US-WEST │ │ US-EAST │ │ EU │ │
│ │ │ │ │ │ │ │
│ │ GKE Cluster │ │ GKE Cluster │ │ GKE Cluster │ │
│ │ - 50 pods │ │ - 50 pods │ │ - 50 pods │ │
│ │ - 5K users │ │ - 5K users │ │ - 5K users │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ↓ ↓ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Global FoundationDB (Multi-Region) │ │
│ │ - 15 nodes (5 per region) │ │
│ │ - Cross-region replication │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Serverless workspaces (Cold Start Optimization) │
│ ├─ AWS Lambda (Firecracker VMs) │
│ ├─ Cloud Run (for infrequent users) │
│ └─ GKE (for power users only) │
└────────────────────────────────────────────────────────────────┘

Cost: $50,000-100,000/month

  • Multi-region GKE: 3 clusters × $20K = $60K
  • Cloudflare Enterprise: $5K
  • FDB multi-region: $15K
  • Serverless compute (Lambda/Cloud Run): $20K
  • Total: ~$100K/month

Revenue (50% paid conversion @ $49 avg):

  • 100,000 users × 50% × $49 = $2,450,000/month
  • Margin: +$2,350,000/monthHIGHLY PROFITABLE

Key Changes:

  1. Serverless workspaces for cold users (90% of users)
    • AWS Lambda with Firecracker VMs (200ms cold start)
    • Cloud Run for theia instances (scale to zero)
  2. GKE for power users only (10% of users needing persistent pods)
  3. Cloudflare Workers for edge routing (reduce latency to <50ms globally)
  4. CockroachDB as FDB alternative (geo-distributed SQL)
  5. S3/GCS for cold storage (move inactive workspaces to object storage)

Status: ❌ Requires 6-12 months engineering effort


Resource Optimization

Memory Optimization

Problem: theia is memory-heavy (512Mi-2Gi per user)

Solutions:

  1. Lazy Loading: Don't load all extensions on startup

    // Only load extensions when user opens relevant file type
    if (file.endsWith('.py')) {
    await loadExtension('ms-python.python');
    }
  2. Shared Extension Host: Single extension host for all users in pod

    // Instead of: 1 extension host per user (100 × 200Mi = 20Gi)
    // Use: 1 shared extension host (1 × 2Gi = 2Gi)
  3. Aggressive GC: Force garbage collection for idle users

    if (user.idleTime > 5 * 60 * 1000) {  // 5 min idle
    global.gc(); // Force GC
    }

Result: Reduce per-user memory from 512Mi → 150Mi (70% reduction)

CPU Optimization

Problem: theia LSP servers are CPU-intensive

Solutions:

  1. Throttle LSP for idle users: Pause LSP when user inactive

    if (user.idleTime > 2 * 60 * 1000) {  // 2 min idle
    languageServer.pause();
    }
  2. Share TypeScript Server: Single tsserver for all TypeScript projects

    // Single tsserver handles multiple projects (project references)
  3. Offload to Backend: Move heavy operations to backend API

    // Code formatting, linting via backend API (not in browser)
    await fetch('/api/format', { method: 'POST', body: code });

Result: Reduce per-user CPU from 200m → 100m (50% reduction)

Storage Optimization

Problem: 10K users × 10Gi = 100Ti storage ($17K/month)

Solutions:

  1. Tiered Storage:

    Hot Storage (SSD PVC): Last 7 days files (1Gi per user)
    Warm Storage (HDD PVC): Last 30 days files (5Gi per user)
    Cold Storage (GCS): Inactive files (archive)
  2. Deduplication: Shared node_modules, .git objects

    # Instead of: 10K × 500Mi node_modules = 5Ti
    # Use: Shared node_modules with symlinks = 500Mi
  3. Compression: Enable PVC compression (30-50% savings)

    storageClassName: compressed-ssd

Result: Reduce per-user storage from 10Gi → 2Gi (80% reduction)


Cost Analysis

Cost Breakdown by Scale

UsersGKE NodesStorageRedisTotal/MonthRevenue/MonthMargin
10$400$17$0$437$58 (20% @ $29)-$379
100$1,000$173$0$1,193$580 (20% @ $29)-$613
1,000$9,900$1,730$150$11,780$8,700 (30% @ $29)-$3,080
10,000$132,000$17,000$1,200$150,200$196,000 (40% @ $49)+$45,800
100,000$60,000$34,000$5,000$99,000$2,450,000 (50% @ $49)+$2,351,000

Key Insights:

  1. Break-even: ~8,000 users (with 40% conversion)
  2. Profitable scale: 10,000+ users
  3. 100K users: 95% profit margin (after infrastructure costs)

Revenue Projections

Conservative Model (assumes slow growth):

Month 1-3 (Beta):     100 users,  10% paid  → $290/month
Month 4-6: 500 users, 20% paid → $2,900/month
Month 7-12: 2,000 users, 30% paid → $17,400/month
Year 2: 10,000 users, 40% paid → $196,000/month
Year 3: 50,000 users, 50% paid → $1,225,000/month

Aggressive Model (viral growth):

Month 1-3 (Beta):     1,000 users,  10% paid  → $2,900/month
Month 4-6: 5,000 users, 20% paid → $29,000/month
Month 7-12: 20,000 users, 30% paid → $174,000/month
Year 2: 100,000 users, 40% paid → $1,960,000/month

Database Scaling (FoundationDB)

FoundationDB Performance Characteristics

Single FDB Cluster Limits:

  • Write Throughput: 100,000 ops/sec
  • Read Throughput: 1,000,000 ops/sec (with read replicas)
  • Storage: 100Ti+ (horizontally scalable)
  • Latency: <5ms (single-region), <50ms (multi-region)

Scaling Strategy by User Count

10-100 Users:

FDB Cluster: 3 nodes (simple triple replication)
- Write capacity: 100K ops/sec → handles 10K requests/sec easily
- Read capacity: 1M ops/sec → handles 100K requests/sec

100-1,000 Users:

FDB Cluster: 5 nodes (3 write + 2 read replicas)
- Write capacity: 100K ops/sec
- Read capacity: 2M ops/sec (read replicas handle reads)

1,000-10,000 Users:

FDB Cluster: 10 nodes (5 write + 5 read replicas)
- Write capacity: 100K ops/sec
- Read capacity: 5M ops/sec

Redis Caching Layer: 3 nodes
- Cache session → pod mappings (90% hit rate)
- Cache user metadata (email, name, avatar)
- TTL: 5 minutes

10,000-100,000 Users:

Multi-Region FDB:
- US-WEST: 5 nodes (3 write + 2 read)
- US-EAST: 5 nodes (3 write + 2 read)
- EU: 5 nodes (3 write + 2 read)

Sharding Strategy:
- Shard by tenant_id hash (0-4 → cluster 1, 5-9 → cluster 2, etc.)
- Each cluster handles 20K users

Redis Caching: 6 nodes (sharded)
- 95% cache hit rate
- Reduces FDB load by 20x

FDB Key Design for Scaling

Current Key Schema (works up to 10K users):

users/{user_id}                              # User record
users/{user_id}/sessions/{session_id} # User sessions
tenants/{tenant_id}/users/{user_id} # Tenant membership
sessions/{session_id} # Session data
workspaces/{assignment_id} # workspace assignments

Optimized Key Schema (for 100K+ users):

# Shard by tenant_id (first 2 chars of UUID)
shard_ab/users/{user_id} # Users starting with ab-xxx
shard_cd/users/{user_id} # Users starting with cd-xxx
...

# Hot/cold data separation
hot/sessions/{session_id} # Active sessions (TTL: 1 hour)
cold/sessions/{session_id} # Archived sessions (TTL: 30 days)

# Denormalized for read performance
user_sessions/{user_id} # All sessions for user (JSON array)

Network Architecture

Ingress Strategy by Scale

Current Design (1 user = 1 Ingress rule):

# ❌ Doesn't scale past 1K users
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: workspace-user-123
spec:
rules:
- host: abc123.coditect.ai
http:
paths:
- path: /
backend:
service:
name: workspace-abc123-service
port: 3000

Optimized Design (1 Ingress, path-based routing):

# ✅ Scales to 100K+ users
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: coditect-workspace-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$2
spec:
rules:
- host: ide.coditect.ai
http:
paths:
- path: /ws/([^/]+)(/|$)(.*)
pathType: Prefix
backend:
service:
name: workspace-router-service # Routes to correct pod
port: 8080

workspace Router Service:

// workspace-router/src/main.rs

async fn handle_websocket(
ws: WebSocket,
session_id: String,
fdb_client: Arc<Database>,
) -> Result<()> {
// 1. Validate session
let session = fdb_client.get_session(&session_id).await?;

// 2. Get pod assignment (from Redis cache or FDB)
let pod_id = redis_client
.get(&format!("session:{}:pod", session_id))
.await
.or_else(|| fdb_client.get_session_pod(&session_id).await?);

// 3. Proxy WebSocket to correct pod
let pod_url = format!("ws://workspace-pod-{}.coditect-app.svc.cluster.local:3000", pod_id);
proxy_websocket(ws, &pod_url).await?;

Ok(())
}

CDN Strategy

Static Assets (Cloudflare CDN):

Frontend Assets:
- HTML/CSS/JS bundles
- theia static files
- Monaco editor assets
- VS Code extensions

Edge Locations: 275+ worldwide
Latency: <50ms globally
Cost: $500/month (up to 100K users)

Dynamic Content (Direct to Backend):

API Requests:
- Authentication (POST /api/auth/login)
- workspace provisioning (POST /api/workspaces/provision)
- File operations (GET/PUT /api/files/*)

No CDN (always fresh data)

Migration Strategy

Phase 1: Current Design → Multi-Tenant (Weeks 1-4)

Current: 1 user = 1 pod Target: 100 users per pod

Steps:

  1. Build multi-tenant theia server

    • Session manager (track active users in pod)
    • File isolation (enforce user directory boundaries)
    • Process isolation (cgroups per user)
  2. Build workspace router service

    • Query FDB for session → pod mapping
    • Proxy WebSocket to correct pod
  3. Deploy new workspace pods (parallel to old)

    • New users → multi-tenant pods
    • Old users → keep dedicated pods (migrate later)
  4. Migrate existing users (1 per day)

    • Copy user files to shared PVC
    • Update session assignment
    • Delete old dedicated pod

Timeline: 4 weeks Risk: Low (old users unaffected during migration)

Phase 2: Add Redis Caching (Week 5)

Purpose: Reduce FDB load for session routing

Steps:

  1. Deploy Redis cluster (3 nodes, sharded)
  2. Update workspace router to check Redis first
  3. Cache session → pod mappings (TTL: 5 min)
  4. Monitor cache hit rate (target: 90%+)

Timeline: 1 week Impact: 10x reduction in FDB queries

Phase 3: Shard PVCs (Week 6-7)

Purpose: Avoid single PVC bottleneck at 10K+ users

Steps:

  1. Create 10 PVCs (1Ti each) instead of 1 × 10Ti
  2. Shard by user_id hash: user_id % 10 = pvc_number
  3. Mount all 10 PVCs to each workspace pod:
    volumeMounts:
    - name: pvc-0
    mountPath: /workspace/shard-0
    - name: pvc-1
    mountPath: /workspace/shard-1
    ...
  4. Router calculates shard: /workspace/shard-{user_id % 10}/{user_id}/

Timeline: 2 weeks Impact: 10x IOPS capacity (10K → 100K IOPS)

Phase 4: Multi-Region (Week 8-12)

Purpose: Global latency <100ms, disaster recovery

Steps:

  1. Deploy GKE cluster in EU (parallel to US)
  2. Deploy FDB cluster in EU (replicate from US)
  3. Deploy Cloudflare Workers for geo-routing
  4. Route EU users → EU cluster, US users → US cluster
  5. Enable FDB cross-region replication (async)

Timeline: 4 weeks Impact: <100ms latency for EU users, 99.99% uptime


Monitoring & Observability

Key Metrics by Scale

10-100 Users:

  • Pod CPU/memory usage
  • API request latency (p95)
  • FoundationDB query latency

1,000 Users:

  • All above +
  • workspace pod utilization (users per pod)
  • Redis cache hit rate
  • Session provisioning time (p95)

10,000 Users:

  • All above +
  • FDB replication lag
  • PVC IOPS usage
  • Cross-region latency
  • Cost per active user

100,000 Users:

  • All above +
  • Serverless cold start latency
  • Global user distribution
  • Edge cache hit rate (Cloudflare)

Alerting Thresholds

alerts:
# Critical
- name: workspacePodsCrashing
threshold: pod_restarts > 5 in 10 minutes
severity: critical
action: Page on-call engineer

- name: FoundationDBDown
threshold: fdb_available_nodes < 3
severity: critical
action: Page on-call + auto-failover

# Warning
- name: HighPodUtilization
threshold: users_per_pod > 90
severity: warning
action: Trigger pod autoscaling

- name: LowCacheHitRate
threshold: redis_hit_rate < 80%
severity: warning
action: Investigate cache keys

# Info
- name: ScalingEvent
threshold: pod_count increased by 5+
severity: info
action: Log for capacity planning

Conclusion

Scaling Summary

ScaleArchitectureCost/MonthMarginStatus
10 usersCurrent design (1:1 pods)$437-$379✅ Ready
100 usersMulti-tenant (2-5 pods)$1,193-$613✅ Ready with multi-tenant
1,000 usersMulti-tenant + Redis$11,780-$3,080⚠️ Need Redis + PVC sharding
10,000 usersMulti-region + FDB sharding$150,200+$45,800⚠️ 3-4 months work
100,000 usersServerless hybrid$99,000+$2,351,000❌ 6-12 months work

Critical Decisions

For MVP (10-100 users):

  • ✅ Keep current 1:1 pod design (simpler, faster to market)
  • ✅ Plan multi-tenant migration for Month 3

For Growth (100-1,000 users):

  • ✅ Implement multi-tenant pods (Week 5-8)
  • ✅ Add Redis caching (Week 9)

For Scale (1,000-10,000 users):

  • ✅ Shard PVCs (Month 4)
  • ✅ Multi-region deployment (Month 5-6)
  • ✅ FDB sharding (Month 6-7)

For Massive Scale (100,000+ users):

  • ✅ Serverless workspaces (Year 2)
  • ✅ Global edge network (Year 2)

Month 1-3: Launch MVP with 1:1 pods

  • Focus on product-market fit
  • Cost is manageable (<$500/month)
  • Avoid premature optimization

Month 4-6: Migrate to multi-tenant

  • Start when you hit 100-200 users
  • Reduces cost by 10x
  • Enables path to 1,000+ users

Month 7-12: Scale to 1,000-10,000 users

  • Add Redis, shard PVCs
  • Deploy multi-region
  • Profitable at 8,000+ users

Year 2: Re-architect for 100,000+ users

  • Serverless workspaces
  • Global edge
  • 95%+ profit margins

Document Status: ✅ Complete Last Updated: 2025-10-07 Next Review: Month 3 (when hitting 100+ users) Owner: Engineering + Finance Teams