Coditect V5 - Scaling Architecture: 10 → 100K+ Users

Date: 2025-10-07 Purpose: Comprehensive scaling analysis and architecture for production growth Target: Support 10, 100, 1,000, 10,000, and 100,000+ concurrent users

Executive Summary

Current Issue: Our current architecture has a critical scaling problem - one workspace pod per user doesn't scale past ~1,000 users.

The Problem:

❌ 1 user = 1 dedicated pod (theia + Sidecar in isolated namespace)
❌ At 1,000 users = 1,000 pods = massive resource waste
❌ At 10,000 users = 10,000 pods = impossible on GKE (cost + management nightmare)
❌ Each pod consumes: 512Mi-2Gi RAM, 500m-2000m CPU → $500-5000/month per 100 users

The Solution:

✅ Multi-tenant workspace pods (multiple users per pod)
✅ Session-based isolation (not pod-based isolation)
✅ Horizontal pod autoscaling (scale pods based on active sessions)
✅ Resource pooling (share infrastructure, isolate data)

Scaling Targets:

Users	workspace Pods	Cost/Month	Status
10	2-3	$100-200	✅ MVP Ready
100	5-10	$500-1000	✅ Current Design
1,000	20-50	$2,000-5,000	⚠️ Need Multi-Tenancy
10,000	100-200	$10,000-20,000	⚠️ Need Optimization
100,000	500-1000	$50,000-100,000	❌ Need Re-Architecture

This document provides:

Scaling bottlenecks analysis
Multi-tenant workspace architecture
Resource optimization strategies
Cost projections at each scale
Migration path from current design

Current Architecture (1 User = 1 Pod)
Scaling Bottlenecks
Multi-Tenant workspace Architecture
Scaling Plan by User Count
Resource Optimization
Cost Analysis
Database Scaling (FoundationDB)
Network Architecture
Migration Strategy
Monitoring & Observability

Current Architecture (1 User = 1 Pod)

What We Designed (V5 Provisioning Architecture)

User Registration
    ↓
Provisioning Controller
    ↓
┌─────────────────────────────────────────────────────────────┐
│ Create Dedicated Resources for EACH User:                   │
│                                                              │
│ 1. Namespace: user-{user_id}                                │
│ 2. ServiceAccount + RBAC                                    │
│ 3. PVC (10GB): workspace-pvc                                │
│ 4. Pod: workspace-{user_id}                                 │
│    ├─ theia container (512Mi-2Gi RAM, 500m-2000m CPU)      │
│    └─ WS Sidecar container (128Mi-256Mi RAM, 100m-200m CPU)│
│ 5. Service: workspace-{user_id}-service                     │
│ 6. Ingress: {user_id}.coditect.ai                          │
│                                                              │
│ Total Resources Per User:                                   │
│ - RAM: 640Mi - 2.25Gi                                       │
│ - CPU: 600m - 2200m                                         │
│ - Storage: 10GB PVC                                         │
└─────────────────────────────────────────────────────────────┘

Why This Doesn't Scale

Problem 1: Resource Waste

Most users are idle 90% of the time
Dedicated pod runs 24/7 even when user is offline
Utilization: 10-20% average (80-90% waste)

Problem 2: Cost Explosion

Cost per user per month (GKE):
- CPU: 2000m (2 cores) × $0.031/core-hour × 730 hours = $45.26
- RAM: 2Gi × $0.0033/GB-hour × 730 hours = $4.82
- Storage: 10GB × $0.17/GB-month = $1.70
- Total: ~$52/user/month

At scale:
- 100 users: $5,200/month
- 1,000 users: $52,000/month
- 10,000 users: $520,000/month ❌ UNSUSTAINABLE

Problem 3: Kubernetes Limits

GKE node pools: Max 1000 nodes per cluster
Pods per node: ~110 pods
Max pods per cluster: ~110,000 (but cost prohibitive at 10K users)

Problem 4: Management Overhead

1,000 users = 1,000 namespaces to monitor
1,000 PVCs to backup
1,000 Ingress rules to manage
1,000 SSL certificates to rotate

Problem 5: Cold Start Latency

New user signup → provision pod → 2-3 minutes wait
User expects instant access (like VSCode.dev)

Scaling Bottlenecks

Bottleneck Analysis by Component

Component	Bottleneck at Scale	Impact	Mitigation
workspace Pods	1:1 user:pod ratio	Critical	Multi-tenant pods (100 users per pod)
FoundationDB	Write throughput (100K ops/sec)	High	Sharding, read replicas, caching
Kubernetes API	1000s of namespace operations	Medium	Batch operations, eventual consistency
Ingress	10K+ Ingress rules	Medium	Wildcard DNS, shared Ingress
PVC Storage	10K PVCs × 10GB = 100TB	High	Shared PVCs, S3 for cold storage
Backend API	Request throughput	Medium	Horizontal scaling (3-100 pods)
Network	Pod-to-pod traffic	Low	GKE native networking

Critical Scaling Thresholds

10 users:    ✅ Current design works fine
             └─ 10 pods, 10 namespaces, minimal cost

100 users:   ✅ Still manageable
             └─ 100 pods, but cost is $5K/month (high for revenue)

1,000 users: ⚠️ FIRST CRITICAL THRESHOLD
             └─ Need multi-tenant pods (100 users/pod = 10 pods)
             └─ Need session-based routing (not namespace-based)
             └─ Need shared PVCs with user directories

10,000 users: ⚠️ SECOND CRITICAL THRESHOLD
              └─ Need FDB read replicas + caching (Redis)
              └─ Need CDN for static assets
              └─ Need multiple GKE clusters (geo-distributed)

100,000 users: ❌ REQUIRES RE-ARCHITECTURE
               └─ Need serverless workspaces (AWS Lambda, Cloud Run)
               └─ Need global CDN (Cloudflare Workers)
               └─ Need multi-region FDB + CockroachDB

Multi-Tenant workspace Architecture

New Design: Shared workspace Pods

Key Principle: Multiple users share workspace pods, isolated by sessions not pods.

┌─────────────────────────────────────────────────────────────────────┐
│                 Shared workspace Pod Pool                           │
│  (Horizontal Pod Autoscaler: 3-100 pods based on active sessions)   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ workspace Pod 1 (100 users capacity)                        │  │
│  │  ┌────────────────────────────────────────────────────────┐ │  │
│  │  │ theia Multi-User Server                               │ │  │
│  │  │  - Session Manager (tracks active user sessions)      │ │  │
│  │  │  - File Isolation (user dirs: /workspace/{user_id}/)  │ │  │
│  │  │  - Process Isolation (cgroups per user)               │ │  │
│  │  │  - Resource Quotas (CPU/RAM limits per user)          │ │  │
│  │  └────────────────────────────────────────────────────────┘ │  │
│  │                                                              │  │
│  │  Active Sessions (tracked in memory):                       │  │
│  │  - user-123: /workspace/user-123/ (CPU: 200m, RAM: 512Mi)  │  │
│  │  - user-456: /workspace/user-456/ (CPU: 150m, RAM: 384Mi)  │  │
│  │  - user-789: /workspace/user-789/ (CPU: 300m, RAM: 768Mi)  │  │
│  │  ... (up to 100 concurrent users)                           │  │
│  │                                                              │  │
│  │  Resources: 16Gi RAM, 8 CPU cores                           │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ workspace Pod 2 (100 users capacity)                        │  │
│  │  ... (same structure as Pod 1)                              │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ workspace Pod N (100 users capacity)                        │  │
│  │  ... (auto-scaled based on load)                            │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘
            ↓                              ↓
    ┌──────────────┐            ┌──────────────────┐
    │ Shared PVC   │            │ FoundationDB     │
    │ (1TB)        │            │ (Session Routing)│
    │              │            │                  │
    │ /workspace/  │            │ session_id →     │
    │ ├─ user-123/ │            │   pod_id mapping │
    │ ├─ user-456/ │            └──────────────────┘
    │ ├─ user-789/ │
    │ └─ ...       │
    └──────────────┘

Session-Based Routing

Instead of: User → Dedicated Pod (1:1) New: User Session → Load-Balanced Pod (N:1)

1. User logs in → Creates session (JWT token)
   ↓
2. Frontend connects to backend API
   ↓
3. Backend queries FDB: "Which pod has capacity?"
   ↓
4. FDB returns: pod-3 (has 45/100 active sessions)
   ↓
5. Backend creates session assignment:
   - session_id: abc-123
   - user_id: user-456
   - pod_id: workspace-pod-3
   - workspace_path: /workspace/user-456/
   ↓
6. Frontend WebSocket connects to workspace-pod-3
   - Header: Authorization: Bearer <JWT>
   - WebSocket URL: wss://ide.coditect.ai/ws?session_id=abc-123
   ↓
7. workspace pod validates JWT → loads user workspace
   ↓
8. User sees theia IDE with their files from /workspace/user-456/

File Isolation Strategy

Shared PVC with User Directories:

/workspace/
├── user-123/
│   ├── src/
│   ├── .git/
│   └── package.json
├── user-456/
│   ├── src/
│   └── cargo.toml
└── user-789/
    └── ...

Mounted to ALL workspace pods as:
- Volume: shared-workspace-pvc
- MountPath: /workspace
- ReadWriteMany (NFS or GlusterFS)

Security: theia process enforces user directory isolation

// theia workspace initialization
const userworkspace = `/workspace/${user_id}/`;

// Prevent access outside user directory
if (!requestPath.startsWith(userworkspace)) {
  throw new Error('Access denied');
}

Process Isolation (cgroups)

Linux cgroups limit per-user resource usage:

# Create cgroup for user
cgcreate -g cpu,memory:user-123

# Set limits
echo "200000" > /sys/fs/cgroup/cpu/user-123/cpu.cfs_quota_us  # 20% CPU (200m)
echo "536870912" > /sys/fs/cgroup/memory/user-123/memory.limit_in_bytes  # 512Mi

# Run user's theia process in cgroup
cgexec -g cpu,memory:user-123 node theia-start --user=user-123

Resource Quotas

Per-User Limits (enforced in theia):

interface UserQuota {
  cpu_limit: '200m' | '500m' | '1000m',   // Based on license tier
  memory_limit: '512Mi' | '1Gi' | '2Gi',  // Based on license tier
  storage_limit: '10Gi' | '50Gi' | '100Gi',
  concurrent_sessions: 1 | 5 | 10,        // Free vs Pro
}

const QUOTAS = {
  free: { cpu: '200m', memory: '512Mi', storage: '10Gi', sessions: 1 },
  starter: { cpu: '500m', memory: '1Gi', storage: '50Gi', sessions: 5 },
  pro: { cpu: '1000m', memory: '2Gi', storage: '100Gi', sessions: 10 },
};

Pod Capacity Planning

Single workspace Pod Capacity:

Pod Resources: 16Gi RAM, 8 CPU cores

Per-User Average:
- RAM: 150Mi (average active user)
- CPU: 100m (average active user)

Theoretical Capacity: 16Gi / 150Mi = 106 users
Practical Capacity: 100 users (6% overhead for system processes)

Peak Capacity (all users active):
- RAM: 100 users × 150Mi = 15Gi (leaves 1Gi for system)
- CPU: 100 users × 100m = 10 cores (8 cores = 80% utilization)

Scaling Math:

Users    Pods Needed    Total Resources
10       1             16Gi RAM, 8 CPU
100      1             16Gi RAM, 8 CPU
1,000    10            160Gi RAM, 80 CPU
10,000   100           1.6Ti RAM, 800 CPU
100,000  1000          16Ti RAM, 8000 CPU

Scaling Plan by User Count

Scale 1: 10 Users (MVP / Beta)

Architecture:

- workspace Pods: 1-2 (for redundancy)
- Backend API Pods: 3
- FoundationDB Pods: 3
- GKE Nodes: 2 (n1-standard-8: 8 vCPU, 30Gi RAM each)

Resources:

Total RAM: 60Gi
Total CPU: 16 cores
Total Storage: 100Gi (shared PVC)

Cost: $200-300/month

GKE nodes: 2 × $200 = $400
LoadBalancer: $20
Storage: 100Gi × $0.17 = $17
Total: ~$437/month

Bottlenecks: None Status: ✅ Current design is optimal

Scale 2: 100 Users

Architecture:

- workspace Pods: 2-5 (HPA: min=2, max=5)
- Backend API Pods: 3-10 (HPA: min=3, max=10)
- FoundationDB Pods: 3
- GKE Nodes: 3-5 (n1-standard-8)

Resources:

Total RAM: 90-150Gi
Total CPU: 24-40 cores
Total Storage: 1Ti (shared PVC)

Cost: $600-1,000/month

GKE nodes: 5 × $200 = $1,000
LoadBalancer: $20
Storage: 1Ti × $0.17 = $173
Total: ~$1,193/month

Revenue (assuming 20% paid conversion @ $29/mo):

100 users × 20% × $29 = $580/month
Margin: -$613/month (loss leader during growth)

Bottlenecks: None Status: ✅ Multi-tenant design handles easily

Scale 3: 1,000 Users ⚠️ CRITICAL THRESHOLD

Architecture:

- workspace Pods: 10-20 (HPA: min=10, max=20)
- Backend API Pods: 5-15 (HPA: min=5, max=15)
- FoundationDB Pods: 5 (with read replicas)
- Redis Cache: 3 nodes (for session routing)
- GKE Nodes: 10-15 (n1-standard-16: 16 vCPU, 60Gi RAM)

Resources:

Total RAM: 600-900Gi
Total CPU: 160-240 cores
Total Storage: 10Ti (shared PVC or distributed storage)

Cost: $3,000-5,000/month

GKE nodes: 15 × $660 = $9,900
LoadBalancer: $20
Storage: 10Ti × $0.17 = $1,730
Redis: 3 × $50 = $150
Total: ~$11,800/month

Revenue (30% paid conversion @ $29/mo):

1,000 users × 30% × $29 = $8,700/month
Margin: -$3,100/month (still loss, but improving)

Bottlenecks:

⚠️ FoundationDB Write Throughput: Approaching 100K ops/sec limit
- Mitigation: Add read replicas, cache in Redis
⚠️ Shared PVC Performance: 10K IOPS limit on single PVC
- Mitigation: Use multiple PVCs (shard by user_id hash)
⚠️ Session Routing Latency: FDB queries for session assignment
- Mitigation: Cache session → pod mapping in Redis

Required Changes:

✅ Implement Redis caching layer
✅ Shard PVCs (10 PVCs × 1Ti each instead of 1 × 10Ti)
✅ Add FDB read replicas (3 write nodes + 5 read replicas)

Status: ⚠️ Need optimizations listed above

Scale 4: 10,000 Users ⚠️ SECOND CRITICAL THRESHOLD

Architecture:

- workspace Pods: 100-200 (HPA: min=100, max=200)
- Backend API Pods: 20-50 (HPA: min=20, max=50)
- FoundationDB Pods: 10 (5 write + 5 read replicas)
- Redis Cache: 6 nodes (sharded)
- GKE Cluster: 2 clusters (US + EU for geo-distribution)
- CDN: Cloudflare for static assets
- GKE Nodes per cluster: 50-100 (n1-standard-16)

Resources (per cluster):

Total RAM: 3-6Ti
Total CPU: 800-1600 cores
Total Storage: 100Ti (distributed across 100 PVCs)

Cost: $20,000-30,000/month

GKE nodes: 2 clusters × 100 nodes × $660 = $132,000
LoadBalancer: 2 × $20 = $40
Storage: 100Ti × $0.17 = $17,000
Redis: 6 × $200 = $1,200
CDN: $500
Total: ~$150,740/month

Revenue (40% paid conversion @ $49 avg):

10,000 users × 40% × $49 = $196,000/month
Margin: +$45,260/month ✅ PROFITABLE

Bottlenecks:

⚠️ FDB Cluster Capacity: Need horizontal sharding
- Mitigation: Shard by tenant_id (multiple FDB clusters)
⚠️ Network I/O: 100 workspace pods × 100 users = high traffic
- Mitigation: CDN for static assets, WebSocket connection pooling
⚠️ Kubernetes API Load: Managing 200 pods across 2 clusters
- Mitigation: ArgoCD with eventual consistency, batch operations

Required Changes:

✅ Multi-region deployment (US + EU)
✅ FDB sharding by tenant_id (5 FDB clusters)
✅ CDN for static assets (Cloudflare)
✅ WebSocket connection pooling
✅ Dedicated GKE cluster per region

Status: ⚠️ Significant infrastructure investment required

Scale 5: 100,000 Users ❌ REQUIRES RE-ARCHITECTURE

Problem: GKE-based architecture hits hard limits:

Cost: $1.5M+/month on GKE alone
Management: 2,000 workspace pods is operationally complex
Latency: Global users need <100ms response times

Solution: Hybrid Serverless Architecture

┌────────────────────────────────────────────────────────────────┐
│                     Global Architecture                         │
│                                                                 │
│  Cloudflare Workers (Edge)                                     │
│  ├─ Static assets (CDN)                                        │
│  ├─ WebSocket proxy (routing to nearest region)               │
│  └─ Auth token validation                                      │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │
│  │  US-WEST     │  │   US-EAST    │  │     EU       │        │
│  │              │  │              │  │              │        │
│  │ GKE Cluster  │  │ GKE Cluster  │  │ GKE Cluster  │        │
│  │ - 50 pods    │  │ - 50 pods    │  │ - 50 pods    │        │
│  │ - 5K users   │  │ - 5K users   │  │ - 5K users   │        │
│  └──────────────┘  └──────────────┘  └──────────────┘        │
│         ↓                 ↓                 ↓                  │
│  ┌──────────────────────────────────────────────────────────┐ │
│  │          Global FoundationDB (Multi-Region)              │ │
│  │  - 15 nodes (5 per region)                               │ │
│  │  - Cross-region replication                              │ │
│  └──────────────────────────────────────────────────────────┘ │
│                                                                 │
│  Serverless workspaces (Cold Start Optimization)               │
│  ├─ AWS Lambda (Firecracker VMs)                              │
│  ├─ Cloud Run (for infrequent users)                          │
│  └─ GKE (for power users only)                                │
└────────────────────────────────────────────────────────────────┘

Cost: $50,000-100,000/month

Multi-region GKE: 3 clusters × $20K = $60K
Cloudflare Enterprise: $5K
FDB multi-region: $15K
Serverless compute (Lambda/Cloud Run): $20K
Total: ~$100K/month

Revenue (50% paid conversion @ $49 avg):

100,000 users × 50% × $49 = $2,450,000/month
Margin: +$2,350,000/month ✅ HIGHLY PROFITABLE

Key Changes:

✅ Serverless workspaces for cold users (90% of users)
- AWS Lambda with Firecracker VMs (200ms cold start)
- Cloud Run for theia instances (scale to zero)
✅ GKE for power users only (10% of users needing persistent pods)
✅ Cloudflare Workers for edge routing (reduce latency to <50ms globally)
✅ CockroachDB as FDB alternative (geo-distributed SQL)
✅ S3/GCS for cold storage (move inactive workspaces to object storage)

Status: ❌ Requires 6-12 months engineering effort

Resource Optimization

Memory Optimization

Problem: theia is memory-heavy (512Mi-2Gi per user)

Solutions:

Lazy Loading: Don't load all extensions on startup

// Only load extensions when user opens relevant file type
if (file.endsWith('.py')) {
  await loadExtension('ms-python.python');
}

Shared Extension Host: Single extension host for all users in pod

// Instead of: 1 extension host per user (100 × 200Mi = 20Gi)
// Use: 1 shared extension host (1 × 2Gi = 2Gi)

Aggressive GC: Force garbage collection for idle users

if (user.idleTime > 5 * 60 * 1000) {  // 5 min idle
  global.gc();  // Force GC
}

Result: Reduce per-user memory from 512Mi → 150Mi (70% reduction)

CPU Optimization

Problem: theia LSP servers are CPU-intensive

Solutions:

Throttle LSP for idle users: Pause LSP when user inactive

if (user.idleTime > 2 * 60 * 1000) {  // 2 min idle
  languageServer.pause();
}

Share TypeScript Server: Single tsserver for all TypeScript projects
```
// Single tsserver handles multiple projects (project references)
```

Offload to Backend: Move heavy operations to backend API

// Code formatting, linting via backend API (not in browser)
await fetch('/api/format', { method: 'POST', body: code });

Result: Reduce per-user CPU from 200m → 100m (50% reduction)

Storage Optimization

Problem: 10K users × 10Gi = 100Ti storage ($17K/month)

Solutions:

Tiered Storage:

Hot Storage (SSD PVC): Last 7 days files (1Gi per user)
Warm Storage (HDD PVC): Last 30 days files (5Gi per user)
Cold Storage (GCS): Inactive files (archive)

Deduplication: Shared node_modules, .git objects

# Instead of: 10K × 500Mi node_modules = 5Ti
# Use: Shared node_modules with symlinks = 500Mi

Compression: Enable PVC compression (30-50% savings)
```
storageClassName: compressed-ssd
```

Result: Reduce per-user storage from 10Gi → 2Gi (80% reduction)

Cost Analysis

Cost Breakdown by Scale

Users	GKE Nodes	Storage	Redis	Total/Month	Revenue/Month	Margin
10	$400	$17	$0	$437	$58 (20% @ $29)	-$379
100	$1,000	$173	$0	$1,193	$580 (20% @ $29)	-$613
1,000	$9,900	$1,730	$150	$11,780	$8,700 (30% @ $29)	-$3,080
10,000	$132,000	$17,000	$1,200	$150,200	$196,000 (40% @ $49)	+$45,800 ✅
100,000	$60,000	$34,000	$5,000	$99,000	$2,450,000 (50% @ $49)	+$2,351,000 ✅

Key Insights:

Break-even: ~8,000 users (with 40% conversion)
Profitable scale: 10,000+ users
100K users: 95% profit margin (after infrastructure costs)

Revenue Projections

Conservative Model (assumes slow growth):

Month 1-3 (Beta):     100 users,  10% paid  → $290/month
Month 4-6:            500 users,  20% paid  → $2,900/month
Month 7-12:           2,000 users, 30% paid → $17,400/month
Year 2:               10,000 users, 40% paid → $196,000/month
Year 3:               50,000 users, 50% paid → $1,225,000/month

Aggressive Model (viral growth):

Month 1-3 (Beta):     1,000 users,  10% paid  → $2,900/month
Month 4-6:            5,000 users,  20% paid  → $29,000/month
Month 7-12:           20,000 users, 30% paid → $174,000/month
Year 2:               100,000 users, 40% paid → $1,960,000/month

Database Scaling (FoundationDB)

FoundationDB Performance Characteristics

Single FDB Cluster Limits:

Write Throughput: 100,000 ops/sec
Read Throughput: 1,000,000 ops/sec (with read replicas)
Storage: 100Ti+ (horizontally scalable)
Latency: <5ms (single-region), <50ms (multi-region)

Scaling Strategy by User Count

10-100 Users:

FDB Cluster: 3 nodes (simple triple replication)
- Write capacity: 100K ops/sec → handles 10K requests/sec easily
- Read capacity: 1M ops/sec → handles 100K requests/sec

100-1,000 Users:

FDB Cluster: 5 nodes (3 write + 2 read replicas)
- Write capacity: 100K ops/sec
- Read capacity: 2M ops/sec (read replicas handle reads)

1,000-10,000 Users:

FDB Cluster: 10 nodes (5 write + 5 read replicas)
- Write capacity: 100K ops/sec
- Read capacity: 5M ops/sec

Redis Caching Layer: 3 nodes
- Cache session → pod mappings (90% hit rate)
- Cache user metadata (email, name, avatar)
- TTL: 5 minutes

10,000-100,000 Users:

Multi-Region FDB:
- US-WEST: 5 nodes (3 write + 2 read)
- US-EAST: 5 nodes (3 write + 2 read)
- EU: 5 nodes (3 write + 2 read)

Sharding Strategy:
- Shard by tenant_id hash (0-4 → cluster 1, 5-9 → cluster 2, etc.)
- Each cluster handles 20K users

Redis Caching: 6 nodes (sharded)
- 95% cache hit rate
- Reduces FDB load by 20x

FDB Key Design for Scaling

Current Key Schema (works up to 10K users):

users/{user_id}                              # User record
users/{user_id}/sessions/{session_id}        # User sessions
tenants/{tenant_id}/users/{user_id}          # Tenant membership
sessions/{session_id}                         # Session data
workspaces/{assignment_id}                    # workspace assignments

Optimized Key Schema (for 100K+ users):

# Shard by tenant_id (first 2 chars of UUID)
shard_ab/users/{user_id}                     # Users starting with ab-xxx
shard_cd/users/{user_id}                     # Users starting with cd-xxx
...

# Hot/cold data separation
hot/sessions/{session_id}                    # Active sessions (TTL: 1 hour)
cold/sessions/{session_id}                   # Archived sessions (TTL: 30 days)

# Denormalized for read performance
user_sessions/{user_id}                      # All sessions for user (JSON array)

Network Architecture

Ingress Strategy by Scale

Current Design (1 user = 1 Ingress rule):

# ❌ Doesn't scale past 1K users
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: workspace-user-123
spec:
  rules:
  - host: abc123.coditect.ai
    http:
      paths:
      - path: /
        backend:
          service:
            name: workspace-abc123-service
            port: 3000

Optimized Design (1 Ingress, path-based routing):

# ✅ Scales to 100K+ users
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: coditect-workspace-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /$2
spec:
  rules:
  - host: ide.coditect.ai
    http:
      paths:
      - path: /ws/([^/]+)(/|$)(.*)
        pathType: Prefix
        backend:
          service:
            name: workspace-router-service  # Routes to correct pod
            port: 8080

workspace Router Service:

// workspace-router/src/main.rs

async fn handle_websocket(
    ws: WebSocket,
    session_id: String,
    fdb_client: Arc<Database>,
) -> Result<()> {
    // 1. Validate session
    let session = fdb_client.get_session(&session_id).await?;

    // 2. Get pod assignment (from Redis cache or FDB)
    let pod_id = redis_client
        .get(&format!("session:{}:pod", session_id))
        .await
        .or_else(|| fdb_client.get_session_pod(&session_id).await?);

    // 3. Proxy WebSocket to correct pod
    let pod_url = format!("ws://workspace-pod-{}.coditect-app.svc.cluster.local:3000", pod_id);
    proxy_websocket(ws, &pod_url).await?;

    Ok(())
}

CDN Strategy

Static Assets (Cloudflare CDN):

Frontend Assets:
- HTML/CSS/JS bundles
- theia static files
- Monaco editor assets
- VS Code extensions

Edge Locations: 275+ worldwide
Latency: <50ms globally
Cost: $500/month (up to 100K users)

Dynamic Content (Direct to Backend):

API Requests:
- Authentication (POST /api/auth/login)
- workspace provisioning (POST /api/workspaces/provision)
- File operations (GET/PUT /api/files/*)

No CDN (always fresh data)

Migration Strategy

Phase 1: Current Design → Multi-Tenant (Weeks 1-4)

Current: 1 user = 1 pod Target: 100 users per pod

Steps:

Build multi-tenant theia server
- Session manager (track active users in pod)
- File isolation (enforce user directory boundaries)
- Process isolation (cgroups per user)
Build workspace router service
- Query FDB for session → pod mapping
- Proxy WebSocket to correct pod
Deploy new workspace pods (parallel to old)
- New users → multi-tenant pods
- Old users → keep dedicated pods (migrate later)
Migrate existing users (1 per day)
- Copy user files to shared PVC
- Update session assignment
- Delete old dedicated pod

Timeline: 4 weeks Risk: Low (old users unaffected during migration)

Phase 2: Add Redis Caching (Week 5)

Purpose: Reduce FDB load for session routing

Steps:

Deploy Redis cluster (3 nodes, sharded)
Update workspace router to check Redis first
Cache session → pod mappings (TTL: 5 min)
Monitor cache hit rate (target: 90%+)

Timeline: 1 week Impact: 10x reduction in FDB queries

Phase 3: Shard PVCs (Week 6-7)

Purpose: Avoid single PVC bottleneck at 10K+ users

Steps:

Create 10 PVCs (1Ti each) instead of 1 × 10Ti
Shard by user_id hash: user_id % 10 = pvc_number

Mount all 10 PVCs to each workspace pod:

volumeMounts:
- name: pvc-0
  mountPath: /workspace/shard-0
- name: pvc-1
  mountPath: /workspace/shard-1
...

Router calculates shard: /workspace/shard-{user_id % 10}/{user_id}/

Timeline: 2 weeks Impact: 10x IOPS capacity (10K → 100K IOPS)

Phase 4: Multi-Region (Week 8-12)

Purpose: Global latency <100ms, disaster recovery

Steps:

Deploy GKE cluster in EU (parallel to US)
Deploy FDB cluster in EU (replicate from US)
Deploy Cloudflare Workers for geo-routing
Route EU users → EU cluster, US users → US cluster
Enable FDB cross-region replication (async)

Timeline: 4 weeks Impact: <100ms latency for EU users, 99.99% uptime

Monitoring & Observability

Key Metrics by Scale

10-100 Users:

Pod CPU/memory usage
API request latency (p95)
FoundationDB query latency

1,000 Users:

All above +
workspace pod utilization (users per pod)
Redis cache hit rate
Session provisioning time (p95)

10,000 Users:

All above +
FDB replication lag
PVC IOPS usage
Cross-region latency
Cost per active user

100,000 Users:

All above +
Serverless cold start latency
Global user distribution
Edge cache hit rate (Cloudflare)

Alerting Thresholds

alerts:
  # Critical
  - name: workspacePodsCrashing
    threshold: pod_restarts > 5 in 10 minutes
    severity: critical
    action: Page on-call engineer

  - name: FoundationDBDown
    threshold: fdb_available_nodes < 3
    severity: critical
    action: Page on-call + auto-failover

  # Warning
  - name: HighPodUtilization
    threshold: users_per_pod > 90
    severity: warning
    action: Trigger pod autoscaling

  - name: LowCacheHitRate
    threshold: redis_hit_rate < 80%
    severity: warning
    action: Investigate cache keys

  # Info
  - name: ScalingEvent
    threshold: pod_count increased by 5+
    severity: info
    action: Log for capacity planning

Conclusion

Scaling Summary

Scale	Architecture	Cost/Month	Margin	Status
10 users	Current design (1:1 pods)	$437	-$379	✅ Ready
100 users	Multi-tenant (2-5 pods)	$1,193	-$613	✅ Ready with multi-tenant
1,000 users	Multi-tenant + Redis	$11,780	-$3,080	⚠️ Need Redis + PVC sharding
10,000 users	Multi-region + FDB sharding	$150,200	+$45,800	⚠️ 3-4 months work
100,000 users	Serverless hybrid	$99,000	+$2,351,000	❌ 6-12 months work

Critical Decisions

For MVP (10-100 users):

✅ Keep current 1:1 pod design (simpler, faster to market)
✅ Plan multi-tenant migration for Month 3

For Growth (100-1,000 users):

✅ Implement multi-tenant pods (Week 5-8)
✅ Add Redis caching (Week 9)

For Scale (1,000-10,000 users):

✅ Shard PVCs (Month 4)
✅ Multi-region deployment (Month 5-6)
✅ FDB sharding (Month 6-7)

For Massive Scale (100,000+ users):

✅ Serverless workspaces (Year 2)
✅ Global edge network (Year 2)

Recommended Path

Month 1-3: Launch MVP with 1:1 pods

Focus on product-market fit
Cost is manageable (<$500/month)
Avoid premature optimization

Month 4-6: Migrate to multi-tenant

Start when you hit 100-200 users
Reduces cost by 10x
Enables path to 1,000+ users

Month 7-12: Scale to 1,000-10,000 users

Add Redis, shard PVCs
Deploy multi-region
Profitable at 8,000+ users

Year 2: Re-architect for 100,000+ users

Serverless workspaces
Global edge
95%+ profit margins

Document Status: ✅ Complete Last Updated: 2025-10-07 Next Review: Month 3 (when hitting 100+ users) Owner: Engineering + Finance Teams

Executive Summary​

Table of Contents​

Current Architecture (1 User = 1 Pod)​

What We Designed (V5 Provisioning Architecture)​

Why This Doesn't Scale​

Scaling Bottlenecks​

Bottleneck Analysis by Component​

Critical Scaling Thresholds​

Multi-Tenant workspace Architecture​

New Design: Shared workspace Pods​

Session-Based Routing​

File Isolation Strategy​

Process Isolation (cgroups)​

Resource Quotas​

Pod Capacity Planning​

Scaling Plan by User Count​

Scale 1: 10 Users (MVP / Beta)​

Scale 2: 100 Users​

Scale 3: 1,000 Users ⚠️ CRITICAL THRESHOLD​

Scale 4: 10,000 Users ⚠️ SECOND CRITICAL THRESHOLD​

Scale 5: 100,000 Users ❌ REQUIRES RE-ARCHITECTURE​

Resource Optimization​

Memory Optimization​

CPU Optimization​

Storage Optimization​

Cost Analysis​

Cost Breakdown by Scale​

Revenue Projections​

Database Scaling (FoundationDB)​

FoundationDB Performance Characteristics​

Scaling Strategy by User Count​

FDB Key Design for Scaling​

Network Architecture​

Ingress Strategy by Scale​

CDN Strategy​

Migration Strategy​

Phase 1: Current Design → Multi-Tenant (Weeks 1-4)​

Phase 2: Add Redis Caching (Week 5)​

Phase 3: Shard PVCs (Week 6-7)​

Phase 4: Multi-Region (Week 8-12)​

Monitoring & Observability​

Key Metrics by Scale​

Alerting Thresholds​

Conclusion​

Scaling Summary​

Critical Decisions​

Recommended Path​

Executive Summary

Table of Contents

Current Architecture (1 User = 1 Pod)

What We Designed (V5 Provisioning Architecture)

Why This Doesn't Scale

Scaling Bottlenecks

Bottleneck Analysis by Component

Critical Scaling Thresholds

Multi-Tenant workspace Architecture

New Design: Shared workspace Pods

Session-Based Routing

File Isolation Strategy

Process Isolation (cgroups)

Resource Quotas

Pod Capacity Planning

Scaling Plan by User Count

Scale 1: 10 Users (MVP / Beta)

Scale 2: 100 Users

Scale 3: 1,000 Users ⚠️ CRITICAL THRESHOLD

Scale 4: 10,000 Users ⚠️ SECOND CRITICAL THRESHOLD

Scale 5: 100,000 Users ❌ REQUIRES RE-ARCHITECTURE

Resource Optimization

Memory Optimization

CPU Optimization

Storage Optimization

Cost Analysis

Cost Breakdown by Scale

Revenue Projections

Database Scaling (FoundationDB)

FoundationDB Performance Characteristics

Scaling Strategy by User Count

FDB Key Design for Scaling

Network Architecture

Ingress Strategy by Scale

CDN Strategy

Migration Strategy

Phase 1: Current Design → Multi-Tenant (Weeks 1-4)

Phase 2: Add Redis Caching (Week 5)

Phase 3: Shard PVCs (Week 6-7)

Phase 4: Multi-Region (Week 8-12)

Monitoring & Observability

Key Metrics by Scale

Alerting Thresholds

Conclusion

Scaling Summary

Critical Decisions

Recommended Path