Coditect V5 - Scaling Architecture: 10 → 100K+ Users
Date: 2025-10-07 Purpose: Comprehensive scaling analysis and architecture for production growth Target: Support 10, 100, 1,000, 10,000, and 100,000+ concurrent users
Executive Summary
Current Issue: Our current architecture has a critical scaling problem - one workspace pod per user doesn't scale past ~1,000 users.
The Problem:
- ❌ 1 user = 1 dedicated pod (theia + Sidecar in isolated namespace)
- ❌ At 1,000 users = 1,000 pods = massive resource waste
- ❌ At 10,000 users = 10,000 pods = impossible on GKE (cost + management nightmare)
- ❌ Each pod consumes: 512Mi-2Gi RAM, 500m-2000m CPU → $500-5000/month per 100 users
The Solution:
- ✅ Multi-tenant workspace pods (multiple users per pod)
- ✅ Session-based isolation (not pod-based isolation)
- ✅ Horizontal pod autoscaling (scale pods based on active sessions)
- ✅ Resource pooling (share infrastructure, isolate data)
Scaling Targets:
| Users | workspace Pods | Cost/Month | Status |
|---|---|---|---|
| 10 | 2-3 | $100-200 | ✅ MVP Ready |
| 100 | 5-10 | $500-1000 | ✅ Current Design |
| 1,000 | 20-50 | $2,000-5,000 | ⚠️ Need Multi-Tenancy |
| 10,000 | 100-200 | $10,000-20,000 | ⚠️ Need Optimization |
| 100,000 | 500-1000 | $50,000-100,000 | ❌ Need Re-Architecture |
This document provides:
- Scaling bottlenecks analysis
- Multi-tenant workspace architecture
- Resource optimization strategies
- Cost projections at each scale
- Migration path from current design
Table of Contents
- Current Architecture (1 User = 1 Pod)
- Scaling Bottlenecks
- Multi-Tenant workspace Architecture
- Scaling Plan by User Count
- Resource Optimization
- Cost Analysis
- Database Scaling (FoundationDB)
- Network Architecture
- Migration Strategy
- Monitoring & Observability
Current Architecture (1 User = 1 Pod)
What We Designed (V5 Provisioning Architecture)
User Registration
↓
Provisioning Controller
↓
┌─────────────────────────────────────────────────────────────┐
│ Create Dedicated Resources for EACH User: │
│ │
│ 1. Namespace: user-{user_id} │
│ 2. ServiceAccount + RBAC │
│ 3. PVC (10GB): workspace-pvc │
│ 4. Pod: workspace-{user_id} │
│ ├─ theia container (512Mi-2Gi RAM, 500m-2000m CPU) │
│ └─ WS Sidecar container (128Mi-256Mi RAM, 100m-200m CPU)│
│ 5. Service: workspace-{user_id}-service │
│ 6. Ingress: {user_id}.coditect.ai │
│ │
│ Total Resources Per User: │
│ - RAM: 640Mi - 2.25Gi │
│ - CPU: 600m - 2200m │
│ - Storage: 10GB PVC │
└─────────────────────────────────────────────────────────────┘
Why This Doesn't Scale
Problem 1: Resource Waste
- Most users are idle 90% of the time
- Dedicated pod runs 24/7 even when user is offline
- Utilization: 10-20% average (80-90% waste)
Problem 2: Cost Explosion
Cost per user per month (GKE):
- CPU: 2000m (2 cores) × $0.031/core-hour × 730 hours = $45.26
- RAM: 2Gi × $0.0033/GB-hour × 730 hours = $4.82
- Storage: 10GB × $0.17/GB-month = $1.70
- Total: ~$52/user/month
At scale:
- 100 users: $5,200/month
- 1,000 users: $52,000/month
- 10,000 users: $520,000/month ❌ UNSUSTAINABLE
Problem 3: Kubernetes Limits
- GKE node pools: Max 1000 nodes per cluster
- Pods per node: ~110 pods
- Max pods per cluster: ~110,000 (but cost prohibitive at 10K users)
Problem 4: Management Overhead
- 1,000 users = 1,000 namespaces to monitor
- 1,000 PVCs to backup
- 1,000 Ingress rules to manage
- 1,000 SSL certificates to rotate
Problem 5: Cold Start Latency
- New user signup → provision pod → 2-3 minutes wait
- User expects instant access (like VSCode.dev)
Scaling Bottlenecks
Bottleneck Analysis by Component
| Component | Bottleneck at Scale | Impact | Mitigation |
|---|---|---|---|
| workspace Pods | 1:1 user:pod ratio | Critical | Multi-tenant pods (100 users per pod) |
| FoundationDB | Write throughput (100K ops/sec) | High | Sharding, read replicas, caching |
| Kubernetes API | 1000s of namespace operations | Medium | Batch operations, eventual consistency |
| Ingress | 10K+ Ingress rules | Medium | Wildcard DNS, shared Ingress |
| PVC Storage | 10K PVCs × 10GB = 100TB | High | Shared PVCs, S3 for cold storage |
| Backend API | Request throughput | Medium | Horizontal scaling (3-100 pods) |
| Network | Pod-to-pod traffic | Low | GKE native networking |
Critical Scaling Thresholds
10 users: ✅ Current design works fine
└─ 10 pods, 10 namespaces, minimal cost
100 users: ✅ Still manageable
└─ 100 pods, but cost is $5K/month (high for revenue)
1,000 users: ⚠️ FIRST CRITICAL THRESHOLD
└─ Need multi-tenant pods (100 users/pod = 10 pods)
└─ Need session-based routing (not namespace-based)
└─ Need shared PVCs with user directories
10,000 users: ⚠️ SECOND CRITICAL THRESHOLD
└─ Need FDB read replicas + caching (Redis)
└─ Need CDN for static assets
└─ Need multiple GKE clusters (geo-distributed)
100,000 users: ❌ REQUIRES RE-ARCHITECTURE
└─ Need serverless workspaces (AWS Lambda, Cloud Run)
└─ Need global CDN (Cloudflare Workers)
└─ Need multi-region FDB + CockroachDB
Multi-Tenant workspace Architecture
New Design: Shared workspace Pods
Key Principle: Multiple users share workspace pods, isolated by sessions not pods.
┌─────────────────────────────────────────────────────────────────────┐
│ Shared workspace Pod Pool │
│ (Horizontal Pod Autoscaler: 3-100 pods based on active sessions) │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ workspace Pod 1 (100 users capacity) │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ theia Multi-User Server │ │ │
│ │ │ - Session Manager (tracks active user sessions) │ │ │
│ │ │ - File Isolation (user dirs: /workspace/{user_id}/) │ │ │
│ │ │ - Process Isolation (cgroups per user) │ │ │
│ │ │ - Resource Quotas (CPU/RAM limits per user) │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Active Sessions (tracked in memory): │ │
│ │ - user-123: /workspace/user-123/ (CPU: 200m, RAM: 512Mi) │ │
│ │ - user-456: /workspace/user-456/ (CPU: 150m, RAM: 384Mi) │ │
│ │ - user-789: /workspace/user-789/ (CPU: 300m, RAM: 768Mi) │ │
│ │ ... (up to 100 concurrent users) │ │
│ │ │ │
│ │ Resources: 16Gi RAM, 8 CPU cores │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ workspace Pod 2 (100 users capacity) │ │
│ │ ... (same structure as Pod 1) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ workspace Pod N (100 users capacity) │ │
│ │ ... (auto-scaled based on load) │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
↓ ↓
┌──────────────┐ ┌──────────────────┐
│ Shared PVC │ │ FoundationDB │
│ (1TB) │ │ (Session Routing)│
│ │ │ │
│ /workspace/ │ │ session_id → │
│ ├─ user-123/ │ │ pod_id mapping │
│ ├─ user-456/ │ └──────────────────┘
│ ├─ user-789/ │
│ └─ ... │
└──────────────┘
Session-Based Routing
Instead of: User → Dedicated Pod (1:1) New: User Session → Load-Balanced Pod (N:1)
1. User logs in → Creates session (JWT token)
↓
2. Frontend connects to backend API
↓
3. Backend queries FDB: "Which pod has capacity?"
↓
4. FDB returns: pod-3 (has 45/100 active sessions)
↓
5. Backend creates session assignment:
- session_id: abc-123
- user_id: user-456
- pod_id: workspace-pod-3
- workspace_path: /workspace/user-456/
↓
6. Frontend WebSocket connects to workspace-pod-3
- Header: Authorization: Bearer <JWT>
- WebSocket URL: wss://ide.coditect.ai/ws?session_id=abc-123
↓
7. workspace pod validates JWT → loads user workspace
↓
8. User sees theia IDE with their files from /workspace/user-456/
File Isolation Strategy
Shared PVC with User Directories:
/workspace/
├── user-123/
│ ├── src/
│ ├── .git/
│ └── package.json
├── user-456/
│ ├── src/
│ └── cargo.toml
└── user-789/
└── ...
Mounted to ALL workspace pods as:
- Volume: shared-workspace-pvc
- MountPath: /workspace
- ReadWriteMany (NFS or GlusterFS)
Security: theia process enforces user directory isolation
// theia workspace initialization
const userworkspace = `/workspace/${user_id}/`;
// Prevent access outside user directory
if (!requestPath.startsWith(userworkspace)) {
throw new Error('Access denied');
}
Process Isolation (cgroups)
Linux cgroups limit per-user resource usage:
# Create cgroup for user
cgcreate -g cpu,memory:user-123
# Set limits
echo "200000" > /sys/fs/cgroup/cpu/user-123/cpu.cfs_quota_us # 20% CPU (200m)
echo "536870912" > /sys/fs/cgroup/memory/user-123/memory.limit_in_bytes # 512Mi
# Run user's theia process in cgroup
cgexec -g cpu,memory:user-123 node theia-start --user=user-123
Resource Quotas
Per-User Limits (enforced in theia):
interface UserQuota {
cpu_limit: '200m' | '500m' | '1000m', // Based on license tier
memory_limit: '512Mi' | '1Gi' | '2Gi', // Based on license tier
storage_limit: '10Gi' | '50Gi' | '100Gi',
concurrent_sessions: 1 | 5 | 10, // Free vs Pro
}
const QUOTAS = {
free: { cpu: '200m', memory: '512Mi', storage: '10Gi', sessions: 1 },
starter: { cpu: '500m', memory: '1Gi', storage: '50Gi', sessions: 5 },
pro: { cpu: '1000m', memory: '2Gi', storage: '100Gi', sessions: 10 },
};
Pod Capacity Planning
Single workspace Pod Capacity:
Pod Resources: 16Gi RAM, 8 CPU cores
Per-User Average:
- RAM: 150Mi (average active user)
- CPU: 100m (average active user)
Theoretical Capacity: 16Gi / 150Mi = 106 users
Practical Capacity: 100 users (6% overhead for system processes)
Peak Capacity (all users active):
- RAM: 100 users × 150Mi = 15Gi (leaves 1Gi for system)
- CPU: 100 users × 100m = 10 cores (8 cores = 80% utilization)
Scaling Math:
Users Pods Needed Total Resources
10 1 16Gi RAM, 8 CPU
100 1 16Gi RAM, 8 CPU
1,000 10 160Gi RAM, 80 CPU
10,000 100 1.6Ti RAM, 800 CPU
100,000 1000 16Ti RAM, 8000 CPU
Scaling Plan by User Count
Scale 1: 10 Users (MVP / Beta)
Architecture:
- workspace Pods: 1-2 (for redundancy)
- Backend API Pods: 3
- FoundationDB Pods: 3
- GKE Nodes: 2 (n1-standard-8: 8 vCPU, 30Gi RAM each)
Resources:
- Total RAM: 60Gi
- Total CPU: 16 cores
- Total Storage: 100Gi (shared PVC)
Cost: $200-300/month
- GKE nodes: 2 × $200 = $400
- LoadBalancer: $20
- Storage: 100Gi × $0.17 = $17
- Total: ~$437/month
Bottlenecks: None Status: ✅ Current design is optimal
Scale 2: 100 Users
Architecture:
- workspace Pods: 2-5 (HPA: min=2, max=5)
- Backend API Pods: 3-10 (HPA: min=3, max=10)
- FoundationDB Pods: 3
- GKE Nodes: 3-5 (n1-standard-8)
Resources:
- Total RAM: 90-150Gi
- Total CPU: 24-40 cores
- Total Storage: 1Ti (shared PVC)
Cost: $600-1,000/month
- GKE nodes: 5 × $200 = $1,000
- LoadBalancer: $20
- Storage: 1Ti × $0.17 = $173
- Total: ~$1,193/month
Revenue (assuming 20% paid conversion @ $29/mo):
- 100 users × 20% × $29 = $580/month
- Margin: -$613/month (loss leader during growth)
Bottlenecks: None Status: ✅ Multi-tenant design handles easily
Scale 3: 1,000 Users ⚠️ CRITICAL THRESHOLD
Architecture:
- workspace Pods: 10-20 (HPA: min=10, max=20)
- Backend API Pods: 5-15 (HPA: min=5, max=15)
- FoundationDB Pods: 5 (with read replicas)
- Redis Cache: 3 nodes (for session routing)
- GKE Nodes: 10-15 (n1-standard-16: 16 vCPU, 60Gi RAM)
Resources:
- Total RAM: 600-900Gi
- Total CPU: 160-240 cores
- Total Storage: 10Ti (shared PVC or distributed storage)
Cost: $3,000-5,000/month
- GKE nodes: 15 × $660 = $9,900
- LoadBalancer: $20
- Storage: 10Ti × $0.17 = $1,730
- Redis: 3 × $50 = $150
- Total: ~$11,800/month
Revenue (30% paid conversion @ $29/mo):
- 1,000 users × 30% × $29 = $8,700/month
- Margin: -$3,100/month (still loss, but improving)
Bottlenecks:
- ⚠️ FoundationDB Write Throughput: Approaching 100K ops/sec limit
- Mitigation: Add read replicas, cache in Redis
- ⚠️ Shared PVC Performance: 10K IOPS limit on single PVC
- Mitigation: Use multiple PVCs (shard by user_id hash)
- ⚠️ Session Routing Latency: FDB queries for session assignment
- Mitigation: Cache session → pod mapping in Redis
Required Changes:
- ✅ Implement Redis caching layer
- ✅ Shard PVCs (10 PVCs × 1Ti each instead of 1 × 10Ti)
- ✅ Add FDB read replicas (3 write nodes + 5 read replicas)
Status: ⚠️ Need optimizations listed above
Scale 4: 10,000 Users ⚠️ SECOND CRITICAL THRESHOLD
Architecture:
- workspace Pods: 100-200 (HPA: min=100, max=200)
- Backend API Pods: 20-50 (HPA: min=20, max=50)
- FoundationDB Pods: 10 (5 write + 5 read replicas)
- Redis Cache: 6 nodes (sharded)
- GKE Cluster: 2 clusters (US + EU for geo-distribution)
- CDN: Cloudflare for static assets
- GKE Nodes per cluster: 50-100 (n1-standard-16)
Resources (per cluster):
- Total RAM: 3-6Ti
- Total CPU: 800-1600 cores
- Total Storage: 100Ti (distributed across 100 PVCs)
Cost: $20,000-30,000/month
- GKE nodes: 2 clusters × 100 nodes × $660 = $132,000
- LoadBalancer: 2 × $20 = $40
- Storage: 100Ti × $0.17 = $17,000
- Redis: 6 × $200 = $1,200
- CDN: $500
- Total: ~$150,740/month
Revenue (40% paid conversion @ $49 avg):
- 10,000 users × 40% × $49 = $196,000/month
- Margin: +$45,260/month ✅ PROFITABLE
Bottlenecks:
- ⚠️ FDB Cluster Capacity: Need horizontal sharding
- Mitigation: Shard by tenant_id (multiple FDB clusters)
- ⚠️ Network I/O: 100 workspace pods × 100 users = high traffic
- Mitigation: CDN for static assets, WebSocket connection pooling
- ⚠️ Kubernetes API Load: Managing 200 pods across 2 clusters
- Mitigation: ArgoCD with eventual consistency, batch operations
Required Changes:
- ✅ Multi-region deployment (US + EU)
- ✅ FDB sharding by tenant_id (5 FDB clusters)
- ✅ CDN for static assets (Cloudflare)
- ✅ WebSocket connection pooling
- ✅ Dedicated GKE cluster per region
Status: ⚠️ Significant infrastructure investment required
Scale 5: 100,000 Users ❌ REQUIRES RE-ARCHITECTURE
Problem: GKE-based architecture hits hard limits:
- Cost: $1.5M+/month on GKE alone
- Management: 2,000 workspace pods is operationally complex
- Latency: Global users need <100ms response times
Solution: Hybrid Serverless Architecture
┌────────────────────────────────────────────────────────────────┐
│ Global Architecture │
│ │
│ Cloudflare Workers (Edge) │
│ ├─ Static assets (CDN) │
│ ├─ WebSocket proxy (routing to nearest region) │
│ └─ Auth token validation │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ US-WEST │ │ US-EAST │ │ EU │ │
│ │ │ │ │ │ │ │
│ │ GKE Cluster │ │ GKE Cluster │ │ GKE Cluster │ │
│ │ - 50 pods │ │ - 50 pods │ │ - 50 pods │ │
│ │ - 5K users │ │ - 5K users │ │ - 5K users │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ↓ ↓ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Global FoundationDB (Multi-Region) │ │
│ │ - 15 nodes (5 per region) │ │
│ │ - Cross-region replication │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Serverless workspaces (Cold Start Optimization) │
│ ├─ AWS Lambda (Firecracker VMs) │
│ ├─ Cloud Run (for infrequent users) │
│ └─ GKE (for power users only) │
└────────────────────────────────────────────────────────────────┘
Cost: $50,000-100,000/month
- Multi-region GKE: 3 clusters × $20K = $60K
- Cloudflare Enterprise: $5K
- FDB multi-region: $15K
- Serverless compute (Lambda/Cloud Run): $20K
- Total: ~$100K/month
Revenue (50% paid conversion @ $49 avg):
- 100,000 users × 50% × $49 = $2,450,000/month
- Margin: +$2,350,000/month ✅ HIGHLY PROFITABLE
Key Changes:
- ✅ Serverless workspaces for cold users (90% of users)
- AWS Lambda with Firecracker VMs (200ms cold start)
- Cloud Run for theia instances (scale to zero)
- ✅ GKE for power users only (10% of users needing persistent pods)
- ✅ Cloudflare Workers for edge routing (reduce latency to <50ms globally)
- ✅ CockroachDB as FDB alternative (geo-distributed SQL)
- ✅ S3/GCS for cold storage (move inactive workspaces to object storage)
Status: ❌ Requires 6-12 months engineering effort
Resource Optimization
Memory Optimization
Problem: theia is memory-heavy (512Mi-2Gi per user)
Solutions:
-
Lazy Loading: Don't load all extensions on startup
// Only load extensions when user opens relevant file type
if (file.endsWith('.py')) {
await loadExtension('ms-python.python');
} -
Shared Extension Host: Single extension host for all users in pod
// Instead of: 1 extension host per user (100 × 200Mi = 20Gi)
// Use: 1 shared extension host (1 × 2Gi = 2Gi) -
Aggressive GC: Force garbage collection for idle users
if (user.idleTime > 5 * 60 * 1000) { // 5 min idle
global.gc(); // Force GC
}
Result: Reduce per-user memory from 512Mi → 150Mi (70% reduction)
CPU Optimization
Problem: theia LSP servers are CPU-intensive
Solutions:
-
Throttle LSP for idle users: Pause LSP when user inactive
if (user.idleTime > 2 * 60 * 1000) { // 2 min idle
languageServer.pause();
} -
Share TypeScript Server: Single tsserver for all TypeScript projects
// Single tsserver handles multiple projects (project references) -
Offload to Backend: Move heavy operations to backend API
// Code formatting, linting via backend API (not in browser)
await fetch('/api/format', { method: 'POST', body: code });
Result: Reduce per-user CPU from 200m → 100m (50% reduction)
Storage Optimization
Problem: 10K users × 10Gi = 100Ti storage ($17K/month)
Solutions:
-
Tiered Storage:
Hot Storage (SSD PVC): Last 7 days files (1Gi per user)
Warm Storage (HDD PVC): Last 30 days files (5Gi per user)
Cold Storage (GCS): Inactive files (archive) -
Deduplication: Shared node_modules, .git objects
# Instead of: 10K × 500Mi node_modules = 5Ti
# Use: Shared node_modules with symlinks = 500Mi -
Compression: Enable PVC compression (30-50% savings)
storageClassName: compressed-ssd
Result: Reduce per-user storage from 10Gi → 2Gi (80% reduction)
Cost Analysis
Cost Breakdown by Scale
| Users | GKE Nodes | Storage | Redis | Total/Month | Revenue/Month | Margin |
|---|---|---|---|---|---|---|
| 10 | $400 | $17 | $0 | $437 | $58 (20% @ $29) | -$379 |
| 100 | $1,000 | $173 | $0 | $1,193 | $580 (20% @ $29) | -$613 |
| 1,000 | $9,900 | $1,730 | $150 | $11,780 | $8,700 (30% @ $29) | -$3,080 |
| 10,000 | $132,000 | $17,000 | $1,200 | $150,200 | $196,000 (40% @ $49) | +$45,800 ✅ |
| 100,000 | $60,000 | $34,000 | $5,000 | $99,000 | $2,450,000 (50% @ $49) | +$2,351,000 ✅ |
Key Insights:
- Break-even: ~8,000 users (with 40% conversion)
- Profitable scale: 10,000+ users
- 100K users: 95% profit margin (after infrastructure costs)
Revenue Projections
Conservative Model (assumes slow growth):
Month 1-3 (Beta): 100 users, 10% paid → $290/month
Month 4-6: 500 users, 20% paid → $2,900/month
Month 7-12: 2,000 users, 30% paid → $17,400/month
Year 2: 10,000 users, 40% paid → $196,000/month
Year 3: 50,000 users, 50% paid → $1,225,000/month
Aggressive Model (viral growth):
Month 1-3 (Beta): 1,000 users, 10% paid → $2,900/month
Month 4-6: 5,000 users, 20% paid → $29,000/month
Month 7-12: 20,000 users, 30% paid → $174,000/month
Year 2: 100,000 users, 40% paid → $1,960,000/month
Database Scaling (FoundationDB)
FoundationDB Performance Characteristics
Single FDB Cluster Limits:
- Write Throughput: 100,000 ops/sec
- Read Throughput: 1,000,000 ops/sec (with read replicas)
- Storage: 100Ti+ (horizontally scalable)
- Latency: <5ms (single-region), <50ms (multi-region)
Scaling Strategy by User Count
10-100 Users:
FDB Cluster: 3 nodes (simple triple replication)
- Write capacity: 100K ops/sec → handles 10K requests/sec easily
- Read capacity: 1M ops/sec → handles 100K requests/sec
100-1,000 Users:
FDB Cluster: 5 nodes (3 write + 2 read replicas)
- Write capacity: 100K ops/sec
- Read capacity: 2M ops/sec (read replicas handle reads)
1,000-10,000 Users:
FDB Cluster: 10 nodes (5 write + 5 read replicas)
- Write capacity: 100K ops/sec
- Read capacity: 5M ops/sec
Redis Caching Layer: 3 nodes
- Cache session → pod mappings (90% hit rate)
- Cache user metadata (email, name, avatar)
- TTL: 5 minutes
10,000-100,000 Users:
Multi-Region FDB:
- US-WEST: 5 nodes (3 write + 2 read)
- US-EAST: 5 nodes (3 write + 2 read)
- EU: 5 nodes (3 write + 2 read)
Sharding Strategy:
- Shard by tenant_id hash (0-4 → cluster 1, 5-9 → cluster 2, etc.)
- Each cluster handles 20K users
Redis Caching: 6 nodes (sharded)
- 95% cache hit rate
- Reduces FDB load by 20x
FDB Key Design for Scaling
Current Key Schema (works up to 10K users):
users/{user_id} # User record
users/{user_id}/sessions/{session_id} # User sessions
tenants/{tenant_id}/users/{user_id} # Tenant membership
sessions/{session_id} # Session data
workspaces/{assignment_id} # workspace assignments
Optimized Key Schema (for 100K+ users):
# Shard by tenant_id (first 2 chars of UUID)
shard_ab/users/{user_id} # Users starting with ab-xxx
shard_cd/users/{user_id} # Users starting with cd-xxx
...
# Hot/cold data separation
hot/sessions/{session_id} # Active sessions (TTL: 1 hour)
cold/sessions/{session_id} # Archived sessions (TTL: 30 days)
# Denormalized for read performance
user_sessions/{user_id} # All sessions for user (JSON array)
Network Architecture
Ingress Strategy by Scale
Current Design (1 user = 1 Ingress rule):
# ❌ Doesn't scale past 1K users
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: workspace-user-123
spec:
rules:
- host: abc123.coditect.ai
http:
paths:
- path: /
backend:
service:
name: workspace-abc123-service
port: 3000
Optimized Design (1 Ingress, path-based routing):
# ✅ Scales to 100K+ users
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: coditect-workspace-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$2
spec:
rules:
- host: ide.coditect.ai
http:
paths:
- path: /ws/([^/]+)(/|$)(.*)
pathType: Prefix
backend:
service:
name: workspace-router-service # Routes to correct pod
port: 8080
workspace Router Service:
// workspace-router/src/main.rs
async fn handle_websocket(
ws: WebSocket,
session_id: String,
fdb_client: Arc<Database>,
) -> Result<()> {
// 1. Validate session
let session = fdb_client.get_session(&session_id).await?;
// 2. Get pod assignment (from Redis cache or FDB)
let pod_id = redis_client
.get(&format!("session:{}:pod", session_id))
.await
.or_else(|| fdb_client.get_session_pod(&session_id).await?);
// 3. Proxy WebSocket to correct pod
let pod_url = format!("ws://workspace-pod-{}.coditect-app.svc.cluster.local:3000", pod_id);
proxy_websocket(ws, &pod_url).await?;
Ok(())
}
CDN Strategy
Static Assets (Cloudflare CDN):
Frontend Assets:
- HTML/CSS/JS bundles
- theia static files
- Monaco editor assets
- VS Code extensions
Edge Locations: 275+ worldwide
Latency: <50ms globally
Cost: $500/month (up to 100K users)
Dynamic Content (Direct to Backend):
API Requests:
- Authentication (POST /api/auth/login)
- workspace provisioning (POST /api/workspaces/provision)
- File operations (GET/PUT /api/files/*)
No CDN (always fresh data)
Migration Strategy
Phase 1: Current Design → Multi-Tenant (Weeks 1-4)
Current: 1 user = 1 pod Target: 100 users per pod
Steps:
-
Build multi-tenant theia server
- Session manager (track active users in pod)
- File isolation (enforce user directory boundaries)
- Process isolation (cgroups per user)
-
Build workspace router service
- Query FDB for session → pod mapping
- Proxy WebSocket to correct pod
-
Deploy new workspace pods (parallel to old)
- New users → multi-tenant pods
- Old users → keep dedicated pods (migrate later)
-
Migrate existing users (1 per day)
- Copy user files to shared PVC
- Update session assignment
- Delete old dedicated pod
Timeline: 4 weeks Risk: Low (old users unaffected during migration)
Phase 2: Add Redis Caching (Week 5)
Purpose: Reduce FDB load for session routing
Steps:
- Deploy Redis cluster (3 nodes, sharded)
- Update workspace router to check Redis first
- Cache session → pod mappings (TTL: 5 min)
- Monitor cache hit rate (target: 90%+)
Timeline: 1 week Impact: 10x reduction in FDB queries
Phase 3: Shard PVCs (Week 6-7)
Purpose: Avoid single PVC bottleneck at 10K+ users
Steps:
- Create 10 PVCs (1Ti each) instead of 1 × 10Ti
- Shard by user_id hash:
user_id % 10 = pvc_number - Mount all 10 PVCs to each workspace pod:
volumeMounts:
- name: pvc-0
mountPath: /workspace/shard-0
- name: pvc-1
mountPath: /workspace/shard-1
... - Router calculates shard:
/workspace/shard-{user_id % 10}/{user_id}/
Timeline: 2 weeks Impact: 10x IOPS capacity (10K → 100K IOPS)
Phase 4: Multi-Region (Week 8-12)
Purpose: Global latency <100ms, disaster recovery
Steps:
- Deploy GKE cluster in EU (parallel to US)
- Deploy FDB cluster in EU (replicate from US)
- Deploy Cloudflare Workers for geo-routing
- Route EU users → EU cluster, US users → US cluster
- Enable FDB cross-region replication (async)
Timeline: 4 weeks Impact: <100ms latency for EU users, 99.99% uptime
Monitoring & Observability
Key Metrics by Scale
10-100 Users:
- Pod CPU/memory usage
- API request latency (p95)
- FoundationDB query latency
1,000 Users:
- All above +
- workspace pod utilization (users per pod)
- Redis cache hit rate
- Session provisioning time (p95)
10,000 Users:
- All above +
- FDB replication lag
- PVC IOPS usage
- Cross-region latency
- Cost per active user
100,000 Users:
- All above +
- Serverless cold start latency
- Global user distribution
- Edge cache hit rate (Cloudflare)
Alerting Thresholds
alerts:
# Critical
- name: workspacePodsCrashing
threshold: pod_restarts > 5 in 10 minutes
severity: critical
action: Page on-call engineer
- name: FoundationDBDown
threshold: fdb_available_nodes < 3
severity: critical
action: Page on-call + auto-failover
# Warning
- name: HighPodUtilization
threshold: users_per_pod > 90
severity: warning
action: Trigger pod autoscaling
- name: LowCacheHitRate
threshold: redis_hit_rate < 80%
severity: warning
action: Investigate cache keys
# Info
- name: ScalingEvent
threshold: pod_count increased by 5+
severity: info
action: Log for capacity planning
Conclusion
Scaling Summary
| Scale | Architecture | Cost/Month | Margin | Status |
|---|---|---|---|---|
| 10 users | Current design (1:1 pods) | $437 | -$379 | ✅ Ready |
| 100 users | Multi-tenant (2-5 pods) | $1,193 | -$613 | ✅ Ready with multi-tenant |
| 1,000 users | Multi-tenant + Redis | $11,780 | -$3,080 | ⚠️ Need Redis + PVC sharding |
| 10,000 users | Multi-region + FDB sharding | $150,200 | +$45,800 | ⚠️ 3-4 months work |
| 100,000 users | Serverless hybrid | $99,000 | +$2,351,000 | ❌ 6-12 months work |
Critical Decisions
For MVP (10-100 users):
- ✅ Keep current 1:1 pod design (simpler, faster to market)
- ✅ Plan multi-tenant migration for Month 3
For Growth (100-1,000 users):
- ✅ Implement multi-tenant pods (Week 5-8)
- ✅ Add Redis caching (Week 9)
For Scale (1,000-10,000 users):
- ✅ Shard PVCs (Month 4)
- ✅ Multi-region deployment (Month 5-6)
- ✅ FDB sharding (Month 6-7)
For Massive Scale (100,000+ users):
- ✅ Serverless workspaces (Year 2)
- ✅ Global edge network (Year 2)
Recommended Path
Month 1-3: Launch MVP with 1:1 pods
- Focus on product-market fit
- Cost is manageable (<$500/month)
- Avoid premature optimization
Month 4-6: Migrate to multi-tenant
- Start when you hit 100-200 users
- Reduces cost by 10x
- Enables path to 1,000+ users
Month 7-12: Scale to 1,000-10,000 users
- Add Redis, shard PVCs
- Deploy multi-region
- Profitable at 8,000+ users
Year 2: Re-architect for 100,000+ users
- Serverless workspaces
- Global edge
- 95%+ profit margins
Document Status: ✅ Complete Last Updated: 2025-10-07 Next Review: Month 3 (when hitting 100+ users) Owner: Engineering + Finance Teams