C2: Container Diagram - CODITECT Cloud Infrastructure
Level: Container (C4 Model Level 2) Scope: Internal Components of CODITECT Cloud Infrastructure Primary Audience: Technical Architects, Senior Developers, DevOps Engineers Last Updated: November 23, 2025
Overview
The Container Diagram shows the high-level technology choices and how responsibilities are distributed across different runtime containers within the CODITECT cloud infrastructure system.
Key Containers:
- Google Kubernetes Engine (GKE) cluster running FastAPI application
- Cloud SQL PostgreSQL for persistent data storage
- Cloud Memorystore Redis for session management
- Cloud Run for serverless background jobs
- Networking layer (VPC, Load Balancer, Cloud NAT)
- Security layer (Identity Platform, Cloud KMS, Secret Manager)
Container Diagram
Container Details
1. Google Cloud Load Balancer (Ingress)
Technology: Google Cloud HTTP(S) Load Balancer Purpose: SSL termination, DDoS protection, geographic routing Deployment: Global, multi-region (us-central1 primary)
Configuration:
Type: HTTPS Load Balancer
SSL Policy: TLS 1.3 only
Certificate: Let's Encrypt (auto-renewed)
Backend Service: NGINX Ingress Controller (GKE)
Health Check: /health endpoint (every 10s)
Session Affinity: Client IP (for WebSocket support)
Cloud Armor: Enabled (rate limiting, geo-blocking)
Responsibilities:
- Terminate SSL/TLS connections
- Distribute traffic across GKE ingress pods
- Protect against DDoS attacks (Cloud Armor)
- Rate limiting (100 req/min per IP)
- Geographic routing (future multi-region support)
Scalability:
- Auto-scales based on traffic
- Handles 1M+ req/sec globally
- 99.99% SLA for Premium Tier
2. GKE Cluster (Compute)
Technology: Google Kubernetes Engine (GKE) Purpose: Container orchestration for License API Deployment: Regional (us-central1), multi-zone
Configuration:
Cluster Name: coditect-dev
Kubernetes Version: 1.28 (auto-upgrade)
Node Count: 3-10 (auto-scaling)
Machine Type: n1-standard-2 (2 vCPU, 7.5GB RAM)
Node Pool: Preemptible (dev), Standard (prod)
Network: Custom VPC (10.0.0.0/16)
Pod CIDR: 10.1.0.0/16
Service CIDR: 10.2.0.0/16
Workload Identity: Enabled
Binary Authorization: Disabled (dev), Enabled (prod)
Responsibilities:
- Run License API pods (FastAPI)
- Auto-scaling based on CPU/memory
- Rolling updates with zero downtime
- Health monitoring and self-healing
- Workload Identity for GCP API access
Scalability:
- Horizontal Pod Autoscaler (HPA): 3-10 replicas
- Cluster Autoscaler: 3-20 nodes
- Handles 10,000 concurrent users at scale
3. License API Pods (Application)
Technology: FastAPI 0.104+ (Python 3.11) Purpose: License validation, session management, JWT generation Deployment: Kubernetes Deployment (3-10 replicas)
Container Specification:
Image: gcr.io/coditect/license-api:latest
Port: 8000 (HTTP)
CPU Request: 500m (0.5 vCPU)
CPU Limit: 2000m (2 vCPU)
Memory Request: 1GB
Memory Limit: 4GB
Liveness Probe: /health (every 30s)
Readiness Probe: /ready (every 10s)
Environment: Production
API Endpoints:
POST /api/v1/auth/register # User registration
POST /api/v1/auth/login # OAuth2 login
POST /api/v1/licenses/acquire # Acquire license seat
POST /api/v1/licenses/heartbeat # Extend session TTL
DELETE /api/v1/licenses/release # Release license seat
GET /api/v1/licenses/status # Check license status
GET /api/v1/analytics/usage # Usage analytics (admin)
Responsibilities:
- Validate license keys and hardware fingerprints
- Allocate/release seats atomically (Redis Lua scripts)
- Generate signed JWT tokens (Cloud KMS)
- Verify OAuth2 tokens (Identity Platform)
- Session heartbeat management (Redis TTL)
- Multi-tenant data isolation (PostgreSQL RLS)
Performance Characteristics:
- Request latency: <50ms p95 (local region)
- Throughput: 100 req/sec per pod
- Connection pooling: 10 DB connections per pod
- Async I/O: asyncio + aiohttp
4. Cloud SQL PostgreSQL (Database)
Technology: Cloud SQL PostgreSQL 16 Purpose: Persistent storage for licenses, tenants, users, audit logs Deployment: Regional HA (us-central1)
Configuration:
Instance Name: coditect-dev
Machine Type: db-custom-2-7680 (2 vCPU, 7.68GB RAM)
Storage: 100GB SSD (auto-resize to 200GB)
Version: PostgreSQL 16.1
Availability: Regional HA (auto-failover)
Private IP: 10.67.0.3 (VPC peering)
Public IP: Disabled
SSL/TLS: Required (self-signed CA)
Backup: Daily at 03:00 UTC
PITR: 7-day retention
Database Schema:
-- Multi-tenant architecture
CREATE TABLE tenants (
id UUID PRIMARY KEY,
name VARCHAR(255) NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE users (
id UUID PRIMARY KEY,
tenant_id UUID REFERENCES tenants(id),
email VARCHAR(255) UNIQUE NOT NULL,
oauth_provider VARCHAR(50),
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE licenses (
id UUID PRIMARY KEY,
tenant_id UUID REFERENCES tenants(id),
license_key VARCHAR(64) UNIQUE NOT NULL,
max_seats INTEGER NOT NULL,
expires_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW(),
hardware_fingerprints JSONB DEFAULT '[]'
);
CREATE TABLE audit_logs (
id BIGSERIAL PRIMARY KEY,
tenant_id UUID REFERENCES tenants(id),
user_id UUID REFERENCES users(id),
action VARCHAR(100) NOT NULL,
details JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Row-Level Security for multi-tenant isolation
ALTER TABLE licenses ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON licenses
USING (tenant_id = current_setting('app.current_tenant')::UUID);
Responsibilities:
- Store license records, tenant data, user accounts
- Enforce multi-tenant data isolation (RLS)
- Provide ACID guarantees for seat allocation
- Audit trail for compliance (SOC 2, GDPR)
- Backup and point-in-time recovery
Performance Characteristics:
- Max connections: 100 (configured)
- Query latency: <10ms p95
- Throughput: 500 QPS (current), 10K QPS (production)
- Automatic failover: <60 seconds
5. Cloud Memorystore Redis (Cache)
Technology: Cloud Memorystore Redis 7.0 Purpose: Session tracking, seat allocation, rate limiting, cache Deployment: Single-zone (BASIC tier, dev), Multi-zone (STANDARD HA, prod)
Configuration:
Instance Name: coditect-dev-redis
Version: Redis 7.0
Tier: BASIC (dev), STANDARD_HA (prod)
Memory: 6GB (dev), 16GB (prod)
Private IP: 10.121.42.67:6378
Network: VPC peering
Auth: Enabled (AUTH command)
Transit Encryption: SERVER_AUTHENTICATION
Persistence: RDB snapshots (automatic)
Data Structures:
# Active session tracking
SET license:{license_id}:session:{hardware_fp} "active" EX 360 # 6-minute TTL
# Seat allocation counter (atomic increment/decrement)
HSET license:{license_id}:seats count 0 max_seats 10
# Rate limiting (sliding window)
ZADD rate_limit:{user_id} {timestamp} {request_id}
ZREMRANGEBYSCORE rate_limit:{user_id} 0 {timestamp - 60}
# Cached license data (reduce DB queries)
HSET license:{license_id}:cache license_key "..." expires_at "..."
Lua Scripts (Atomic Operations):
-- Atomic seat allocation
local key = KEYS[1]
local max_seats = tonumber(ARGV[1])
local current = tonumber(redis.call('HGET', key, 'count') or 0)
if current < max_seats then
redis.call('HINCRBY', key, 'count', 1)
return 1 -- Success
else
return 0 -- No seats available
end
Responsibilities:
- Track active sessions with TTL-based expiration
- Atomic seat allocation/deallocation (Lua scripts)
- Rate limiting for API endpoints
- Cache frequently accessed license data
- Heartbeat tracking (update TTL every 5 minutes)
Performance Characteristics:
- Latency: <1ms p95 (in-region)
- Throughput: 50K ops/sec (BASIC), 500K ops/sec (STANDARD HA)
- Persistence: RDB snapshots every 15 minutes
- Automatic failover: <30 seconds (STANDARD HA)
6. Identity Platform (Authentication)
Technology: Google Identity Platform (Firebase Auth) Purpose: OAuth2 authentication, JWT validation, user management Deployment: Managed service (global)
Configuration:
Providers:
- Google OAuth2
- GitHub OAuth2
- Email/Password (disabled for now)
Token Settings:
Algorithm: RS256
Expiration: 1 hour (access token), 7 days (refresh token)
Multi-Factor Auth:
Status: Optional (future)
Methods: TOTP, SMS
Responsibilities:
- User registration and authentication
- OAuth2 token generation and validation
- JWT signature verification (RS256)
- User session management
- Multi-factor authentication (future)
Integration:
# FastAPI dependency injection
async def verify_token(token: str = Depends(oauth2_scheme)):
try:
decoded = auth.verify_id_token(token)
return decoded
except auth.InvalidIdTokenError:
raise HTTPException(401, "Invalid token")
7. Cloud KMS (License Signing)
Technology: Google Cloud Key Management Service Purpose: RSA-4096 asymmetric key signing for license tokens Deployment: Regional (us-central1), HSM-backed
Configuration:
Key Ring: coditect-license-keys
Key Name: license-signing-key
Algorithm: RSA_SIGN_PKCS1_4096_SHA256
Purpose: ASYMMETRIC_SIGN
Protection Level: HSM (hardware security module)
Rotation: Manual (future: automatic annual rotation)
Usage:
# Sign license token
async def sign_license(license_data: dict) -> str:
kms_client = KeyManagementServiceAsyncClient()
key_name = "projects/.../keyRings/.../cryptoKeys/.../cryptoKeyVersions/1"
# Serialize license data
message = json.dumps(license_data, sort_keys=True).encode()
digest = hashlib.sha256(message).digest()
# Sign with Cloud KMS
response = await kms_client.asymmetric_sign(
request={"name": key_name, "digest": {"sha256": digest}}
)
# Return base64-encoded signature
return base64.b64encode(response.signature).decode()
Responsibilities:
- Sign license tokens with RSA-4096 private key
- Provide public key for offline verification
- Key rotation and versioning
- Audit trail for all signing operations
8. Secret Manager (Secrets)
Technology: Google Cloud Secret Manager Purpose: Secure storage for API keys, credentials, tokens Deployment: Global (replicated)
Secrets Inventory:
Secrets (9 total):
1. db-password # Cloud SQL root password
2. db-app-user-password # Application database user
3. db-readonly-password # Read-only user for analytics
4. redis-auth-token # Redis AUTH password
5. django-secret-key # Django SECRET_KEY
6. stripe-api-key # Stripe API key
7. stripe-webhook-secret # Stripe webhook signature
8. jwt-secret-key # JWT signing key (backup)
9. sendgrid-api-key # SendGrid email API key
Access Control:
# Workload Identity binding
Service Account: license-api@coditect.iam.gserviceaccount.com
Role: roles/secretmanager.secretAccessor
Secrets: All (scoped by IAM policy)
Usage:
# Fetch secret at startup
async def get_secret(secret_id: str) -> str:
client = SecretManagerServiceAsyncClient()
name = f"projects/{PROJECT_ID}/secrets/{secret_id}/versions/latest"
response = await client.access_secret_version(request={"name": name})
return response.payload.data.decode("UTF-8")
# Inject into environment
os.environ["DATABASE_PASSWORD"] = await get_secret("db-password")
9. VPC Network (Networking)
Technology: Google Cloud VPC (Virtual Private Cloud) Purpose: Network isolation, private communication, security Deployment: Regional (us-central1)
Configuration:
VPC Name: coditect-dev-vpc
CIDR Ranges:
Primary Subnet: 10.0.0.0/16 (GKE nodes)
Pods Secondary: 10.1.0.0/16 (Kubernetes pods)
Services Secondary: 10.2.0.0/16 (Kubernetes services)
Private Google Access: Enabled
Flow Logs: Enabled (5-second interval, 100% sampling)
Routing Mode: Regional
MTU: 1460 (default)
Firewall Rules:
allow-health-checks:
Source: 35.191.0.0/16, 130.211.0.0/22 (Google health checks)
Target: gke-node
Ports: TCP 8000, 8080
allow-ingress-https:
Source: 0.0.0.0/0
Target: load-balancer
Ports: TCP 443
deny-all-ingress:
Priority: 65535 (lowest)
Action: Deny
VPC Peering:
- Cloud SQL: Private service connection (10.67.0.0/16)
- Redis: Private VPC peering (10.121.0.0/16)
10. Cloud NAT (Egress)
Technology: Google Cloud NAT Purpose: Egress traffic for private GKE nodes Deployment: Regional (us-central1)
Configuration:
NAT Name: coditect-dev-nat
Router: coditect-dev-router
NAT IP Allocation: AUTO_ONLY
Source Subnetworks: ALL_SUBNETWORKS_ALL_IP_RANGES
Min Ports per VM: 64
Logging: Enabled (ALL filter)
TCP Timeouts:
Established Idle: 1200s (20 minutes)
Transitory Idle: 30s
Time Wait: 120s
Responsibilities:
- Provide outbound internet access for private GKE nodes
- Static IP allocation for egress (rate limiting by external APIs)
- Logging and monitoring of egress traffic
11. Monitoring & Observability
Prometheus + Grafana:
Prometheus:
Deployment: Managed Prometheus (GKE)
Scrape Interval: 15 seconds
Retention: 15 days
Grafana:
Deployment: GKE pod (StatefulSet)
Data Source: Prometheus, Cloud SQL
Dashboards:
- License API performance (latency, throughput, errors)
- Database health (connections, query time, cache hit rate)
- Redis performance (ops/sec, memory usage, evictions)
- Seat utilization (per tenant, per license)
Cloud Logging:
Log Router:
- API request/response logs → BigQuery (analytics)
- Error logs → PagerDuty (alerting)
- Audit logs → Cloud Storage (compliance)
Log Format: Structured JSON
Retention: 30 days (logs), 7 years (audit)
Technology Decision Rationale
Why FastAPI over Django?
Decision: Use FastAPI for License API instead of Django REST Framework.
Rationale:
- Performance: Async/await support (3x faster than Django)
- Type Safety: Pydantic models prevent runtime errors
- Documentation: Auto-generated OpenAPI/Swagger docs
- Lightweight: Minimal overhead for microservices
- Modern: Built for Python 3.11+ with type hints
Trade-offs:
- Smaller ecosystem than Django
- Less built-in functionality (need to implement auth, admin)
- Fewer third-party packages
Why Cloud SQL over Self-Managed PostgreSQL?
Decision: Use Cloud SQL PostgreSQL instead of self-managed on GKE.
Rationale:
- Reliability: 99.95% SLA with automatic failover
- Backups: Automated daily backups + PITR (7 days)
- Maintenance: Automatic minor version upgrades
- Security: Encryption at rest, private IP only
- Cost: Comparable to EC2 + EBS + ops overhead
Trade-offs:
- Limited PostgreSQL extensions (no Citus)
- No direct filesystem access
- Higher cost than self-managed
Why Redis over Memcached?
Decision: Use Redis instead of Memcached for session storage.
Rationale:
- Data Structures: Supports hash, set, sorted set (not just key-value)
- Persistence: RDB snapshots prevent data loss on restart
- Atomic Operations: Lua scripting for seat allocation
- TTL: Built-in expiration for session management
- Pub/Sub: Future support for real-time notifications
Trade-offs:
- Single-threaded (lower throughput than Memcached)
- Higher memory overhead
- More complex to operate
Data Flow Examples
License Acquisition Flow
1. User starts CODITECT application
2. License Client SDK → Load Balancer (HTTPS POST /api/v1/licenses/acquire)
3. Load Balancer → NGINX Ingress Controller
4. Ingress → License API pod (FastAPI)
5. API validates JWT (Identity Platform)
6. API checks license in PostgreSQL (SELECT * FROM licenses WHERE license_key = ?)
7. API atomically allocates seat in Redis (Lua script HINCRBY)
8. API signs license token with Cloud KMS (RSA-4096)
9. API stores session in Redis (SET with 6-minute TTL)
10. API logs audit event (INSERT INTO audit_logs)
11. Return signed JWT to SDK
12. CODITECT stores JWT locally for offline mode
Total Latency: ~200ms (p95)
Heartbeat Flow
1. Background thread wakes every 5 minutes
2. License Client SDK → Load Balancer (HTTPS POST /api/v1/licenses/heartbeat)
3. Load Balancer → License API pod
4. API validates JWT signature (local verification, no KMS call)
5. API updates Redis TTL (EXPIRE license:{id}:session:{fp} 360)
6. Return 200 OK
Total Latency: ~20ms (p95)
Seat Release (Graceful Shutdown)
1. User closes CODITECT application
2. SDK → Load Balancer (HTTPS DELETE /api/v1/licenses/release)
3. Load Balancer → License API pod
4. API deletes Redis session (DEL license:{id}:session:{fp})
5. API decrements seat counter (HINCRBY license:{id}:seats count -1)
6. API logs audit event (INSERT INTO audit_logs)
7. Return 200 OK
Total Latency: ~50ms (p95)
Automatic Seat Release (Zombie Session Cleanup)
1. User crashes / network disconnects
2. Heartbeat stops sending
3. Redis TTL expires after 6 minutes (EXPIRE)
4. Seat automatically released (no API call needed)
5. Next heartbeat attempt returns 404 (session not found)
6. SDK re-acquires license on next startup
Scalability Analysis
Current Capacity (Development)
| Component | Configuration | Max Throughput |
|---|---|---|
| GKE Cluster | 3 nodes x 2 vCPU | ~100 concurrent users |
| License API | 3 pods x 500m CPU | ~300 req/sec |
| Cloud SQL | 2 vCPU, 100 connections | ~500 QPS |
| Redis | 6GB BASIC | ~50K ops/sec |
| Load Balancer | Auto-scaling | ~1M req/sec |
Production Target (10,000 users)
| Component | Configuration | Max Throughput |
|---|---|---|
| GKE Cluster | 10 nodes x 4 vCPU | ~10,000 concurrent users |
| License API | 10 pods x 2000m CPU | ~1,000 req/sec |
| Cloud SQL | 8 vCPU, 1000 connections | ~10K QPS |
| Redis | 16GB STANDARD HA | ~500K ops/sec |
| Load Balancer | Auto-scaling | ~1M req/sec |
Bottleneck Analysis:
- Current Bottleneck: Cloud SQL (100 connections)
- Mitigation: Connection pooling (10 per pod), read replicas
- Future Bottleneck: Redis (seat allocation contention)
- Mitigation: Sharding by tenant_id, Redis Cluster
Security Architecture
Defense in Depth
Layer 1: Network (VPC Firewall)
- Deny all ingress except HTTPS (443)
- Private GKE cluster (no public node IPs)
- Cloud NAT for egress (controlled IP ranges)
Layer 2: Application (API Gateway)
- Rate limiting (100 req/min per IP)
- Cloud Armor WAF (SQL injection, XSS prevention)
- HTTPS-only (TLS 1.3)
Layer 3: Authentication (Identity Platform)
- OAuth2 with JWT tokens (RS256)
- Hardware fingerprinting (device binding)
- Session expiration (1-hour access token, 7-day refresh)
Layer 4: Authorization (RBAC)
- Tenant-based isolation (PostgreSQL RLS)
- Role-based access control (admin, user, readonly)
- Workload Identity for GCP API access
Layer 5: Data (Encryption)
- Encryption at rest (Cloud SQL, Redis)
- Encryption in transit (TLS 1.3)
- Cloud KMS for key management (HSM-backed)
Cost Optimization Strategies
Current Costs (Development)
| Component | Monthly Cost | Optimization |
|---|---|---|
| GKE (3x preemptible) | $100 | Use preemptible nodes |
| Cloud SQL (Regional HA) | $150 | Committed use discount (57%) |
| Redis (6GB BASIC) | $30 | BASIC tier (no HA) |
| Networking | $20 | Cloud NAT (minimal egress) |
| Total | $310/month | $3,720/year |
Production Costs (10K users)
| Component | Monthly Cost | Optimization |
|---|---|---|
| GKE (10x standard) | $500 | 3-year committed use |
| Cloud SQL (8 vCPU HA) | $400 | Read replicas for analytics |
| Redis (16GB HA) | $150 | STANDARD HA for failover |
| Cloud KMS | $10 | Pay-per-use |
| Identity Platform | $50 | Free tier up to 50K MAU |
| Load Balancer | $50 | Premium tier for SLA |
| Monitoring | $40 | Managed Prometheus |
| Total | $1,200/month | $14,400/year |
Cost per User: $1.20/month per active user (at 10K users)
Deployment Strategy
Blue/Green Deployment
# Blue environment (current production)
Namespace: coditect-prod-blue
Deployment: license-api-blue (version 1.2.3)
Service: license-api-blue
Ingress: Routes 100% traffic to blue
# Green environment (new version)
Namespace: coditect-prod-green
Deployment: license-api-green (version 1.3.0)
Service: license-api-green
Ingress: Routes 0% traffic initially
# Gradual rollout
1. Deploy green (0% traffic)
2. Health checks pass
3. Route 10% traffic to green (canary)
4. Monitor metrics for 30 minutes
5. Route 50% traffic to green
6. Route 100% traffic to green
7. Decommission blue after 24 hours
Rolling Update Strategy
Deployment Strategy: RollingUpdate
Max Surge: 1 (add 1 extra pod during update)
Max Unavailable: 0 (zero-downtime deployment)
Update Process:
1. Create new pod (version 1.3.0)
2. Wait for readiness probe (30s)
3. Add to service load balancer
4. Remove old pod (version 1.2.3)
5. Repeat until all pods updated
Disaster Recovery
Backup Strategy
Cloud SQL:
- Automated daily backups (03:00 UTC)
- Point-in-time recovery (7-day retention)
- Transaction log backups (every 5 minutes)
- Backup location: us-central1 + us-east1 (geo-redundant)
Redis:
- RDB snapshots every 15 minutes
- AOF persistence (future, for STANDARD HA)
- Manual snapshot before major changes
Secrets:
- Secret Manager automatic replication (global)
- Manual backup to encrypted GCS bucket (quarterly)
Recovery Procedures
Scenario 1: Pod Failure
- Detection: Liveness probe fails (30s)
- Action: Kubernetes auto-restarts pod
- Recovery Time: <60 seconds
- Data Loss: None (stateless pods)
Scenario 2: Node Failure
- Detection: Node NotReady (60s)
- Action: Pods rescheduled to healthy nodes
- Recovery Time: <2 minutes
- Data Loss: None (stateful data in Cloud SQL/Redis)
Scenario 3: Cloud SQL Failover
- Detection: Regional HA detects primary failure
- Action: Automatic failover to standby
- Recovery Time: <60 seconds
- Data Loss: None (synchronous replication)
Scenario 4: Redis Failure (BASIC tier)
- Detection: Connection timeout
- Action: Manual intervention (restart instance)
- Recovery Time: ~5 minutes
- Data Loss: Up to 15 minutes (RDB snapshot interval)
Scenario 5: Complete Region Failure
- Detection: Manual (monitoring alerts)
- Action: Failover to disaster recovery region (future)
- Recovery Time: ~4 hours (manual DR)
- Data Loss: Up to 1 hour (backup lag)
Related Diagrams
- C1: System Context Diagram - External view and integrations
- C3: GKE Component Diagram - Kubernetes internals
- C3: Networking Components - VPC and firewall details
- C3: Security Components - Authentication and encryption
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-11-23 | SDD Architect | Initial C2 Container diagram with all components |
Document Classification: Internal - Architecture Documentation Review Cycle: Quarterly (or upon infrastructure changes) Next Review Date: 2026-02-23