ADR-020: GCP Cloud Run Deployment Strategy
Status: Accepted Date: 2025-10-06 Deciders: Development Team, DevOps Team, Infrastructure Team Related: ADR-016 (NGINX), ADR-017 (WebSocket), ADR-004 (FoundationDB)
Context
The AZ1.AI llm IDE requires a cloud deployment strategy for production use. We need:
- Scalability: Handle 1000+ concurrent users
- Cost-Efficiency: Pay for actual usage, not idle capacity
- Global Reach: Low latency worldwide
- Easy Deployment: Simple CI/CD pipeline
- WebSocket Support: For real-time communication
- Stateful Services: FoundationDB, Redis, file storage
Current State
- Local development with Docker Compose
- No production deployment infrastructure
- No CI/CD pipeline
- Manual deployments
Requirements
- Auto-Scaling: Scale from 0 to 1000+ instances
- Global CDN: Fast content delivery worldwide
- Managed Services: Minimize operational overhead
- Cost Control: Budget-friendly for startup
- Security: SSL/TLS, IAM, VPC
- Monitoring: Metrics, logs, alerts
- High Availability: 99.9% uptime SLA
Decision
We will deploy to Google Cloud Platform using Cloud Run with supporting managed services:
Architecture
┌────────────────────────────────────────────────────────────────┐
│ Global CDN │
│ (Cloud CDN + Cloud Armor) │
└──────────────────────────┬─────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Global Load Balancer │
│ (Cloud Load Balancing - HTTPS) │
└──────────────────────────┬─────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Cloud Run Services │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ theia │ │ WebSocket │ │ MCP Gateway │ │
│ │ Frontend │ │ Backend │ │ Service │ │
│ │ (Port 3000) │ │ (Port 4000) │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ │ │
└────────────────────────────┼───────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Managed Services │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │FoundationDB │ │ Memorystore │ │ Cloud │ │
│ │ (VMs on │ │ (Redis) │ │ Storage │ │
│ │ Compute │ │ │ │ (Files) │ │
│ │ Engine) │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Cloud SQL │ │ Secret │ │ Cloud │ │
│ │ (PostgreSQL) │ │ Manager │ │ Logging │ │
│ │ (Metadata) │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────────────┘
Service Breakdown
Compute:
- Cloud Run: theia frontend, WebSocket backend, MCP gateway
- Compute Engine: FoundationDB cluster (3-5 VMs)
Storage:
- Cloud Storage: User files, session data, static assets
- Memorystore (Redis): Session cache, MCP response cache
- Cloud SQL (PostgreSQL): User data, session metadata (alternative to FDB for some use cases)
Networking:
- Cloud Load Balancing: Global HTTPS load balancer
- Cloud CDN: Static asset caching
- Cloud Armor: DDoS protection, WAF
Observability:
- Cloud Logging: Centralized logs
- Cloud Monitoring: Metrics, dashboards
- Cloud Trace: Distributed tracing
- Error Reporting: Exception tracking
Security:
- Secret Manager: API keys, credentials
- Identity Platform: User authentication
- VPC: Private networking for services
- Cloud IAM: Fine-grained access control
Implementation
1. Cloud Run Service Definitions
# cloud-run/theia-frontend.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: theia-frontend
namespace: default
annotations:
run.googleapis.com/ingress: all
run.googleapis.com/execution-environment: gen2
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/minScale: '1'
autoscaling.knative.dev/maxScale: '100'
run.googleapis.com/cpu-throttling: 'false' # Important for WebSocket
run.googleapis.com/startup-cpu-boost: 'true'
spec:
containerConcurrency: 80
timeoutSeconds: 3600 # 1 hour for WebSocket connections
containers:
- name: theia
image: gcr.io/PROJECT_ID/theia-frontend:latest
ports:
- name: http1
containerPort: 3000
env:
- name: NODE_ENV
value: production
- name: PORT
value: '3000'
- name: REDIS_HOST
valueFrom:
secretKeyRef:
name: redis-connection
key: host
- name: FDB_CLUSTER_FILE
value: /etc/foundationdb/fdb.cluster
resources:
limits:
cpu: '2000m'
memory: '4Gi'
volumeMounts:
- name: fdb-config
mountPath: /etc/foundationdb
readOnly: true
volumes:
- name: fdb-config
secret:
secretName: fdb-cluster-file
# cloud-run/websocket-backend.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: websocket-backend
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/minScale: '2'
autoscaling.knative.dev/maxScale: '200'
run.googleapis.com/cpu-throttling: 'false'
spec:
containerConcurrency: 100
timeoutSeconds: 86400 # 24 hours for long-lived WebSocket
containers:
- name: websocket
image: gcr.io/PROJECT_ID/websocket-backend:latest
ports:
- name: h2c # HTTP/2 for WebSocket
containerPort: 4000
env:
- name: PORT
value: '4000'
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: redis-connection
key: url
- name: GCS_BUCKET
value: az1ai-user-files
resources:
limits:
cpu: '4000m'
memory: '8Gi'
2. Dockerfile for Cloud Run
# Dockerfile.cloudrun
FROM node:20-slim AS builder
WORKDIR /app
# Copy package files
COPY package*.json ./
COPY tsconfig*.json ./
# Install dependencies
RUN npm ci --production=false
# Copy source
COPY src ./src
COPY theia-app ./theia-app
# Build theia application
RUN npm run theia:build:prod
# Production image
FROM node:20-slim
WORKDIR /app
# Install production dependencies only
COPY package*.json ./
RUN npm ci --production
# Copy built application
COPY --from=builder /app/lib ./lib
COPY --from=builder /app/theia-app ./theia-app
# Expose port
EXPOSE 3000
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s \
CMD node healthcheck.js || exit 1
# Start application
CMD ["npm", "run", "theia:start"]
3. FoundationDB on Compute Engine
#!/bin/bash
# scripts/deploy-fdb.sh
# Create FoundationDB VMs (3-node cluster)
for i in {1..3}; do
gcloud compute instances create fdb-node-$i \
--zone=us-central1-a \
--machine-type=n2-standard-8 \
--boot-disk-size=100GB \
--boot-disk-type=pd-ssd \
--image-family=ubuntu-2204-lts \
--image-project=ubuntu-os-cloud \
--metadata-from-file startup-script=install-fdb.sh \
--tags=fdb-cluster \
--scopes=cloud-platform
done
# Create firewall rule for FDB cluster communication
gcloud compute firewall-rules create allow-fdb-internal \
--network=default \
--allow=tcp:4500-4520 \
--source-tags=fdb-cluster \
--target-tags=fdb-cluster
#!/bin/bash
# install-fdb.sh (startup script for FDB VMs)
# Download and install FoundationDB
wget https://github.com/apple/foundationdb/releases/download/7.1.27/foundationdb-server_7.1.27-1_amd64.deb
wget https://github.com/apple/foundationdb/releases/download/7.1.27/foundationdb-clients_7.1.27-1_amd64.deb
sudo dpkg -i foundationdb-clients_7.1.27-1_amd64.deb
sudo dpkg -i foundationdb-server_7.1.27-1_amd64.deb
# Configure cluster
sudo fdbcli --exec "configure new single ssd"
# Enable automatic backups to Cloud Storage
sudo gsutil cp /etc/foundationdb/fdb.cluster gs://az1ai-fdb-backups/cluster/
4. CI/CD Pipeline (Cloud Build)
# cloudbuild.yaml
steps:
# Build theia frontend
- name: 'gcr.io/cloud-builders/docker'
args:
- 'build'
- '-t'
- 'gcr.io/$PROJECT_ID/theia-frontend:$COMMIT_SHA'
- '-t'
- 'gcr.io/$PROJECT_ID/theia-frontend:latest'
- '-f'
- 'Dockerfile.cloudrun'
- '.'
# Push images
- name: 'gcr.io/cloud-builders/docker'
args:
- 'push'
- 'gcr.io/$PROJECT_ID/theia-frontend:$COMMIT_SHA'
# Deploy to Cloud Run
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: gcloud
args:
- 'run'
- 'deploy'
- 'theia-frontend'
- '--image=gcr.io/$PROJECT_ID/theia-frontend:$COMMIT_SHA'
- '--region=us-central1'
- '--platform=managed'
- '--allow-unauthenticated'
- '--max-instances=100'
- '--min-instances=1'
- '--memory=4Gi'
- '--cpu=2'
- '--timeout=3600'
- '--concurrency=80'
- '--set-env-vars=NODE_ENV=production'
# Run database migrations
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: bash
args:
- '-c'
- |
gcloud run jobs execute fdb-migrate \
--region=us-central1 \
--wait
images:
- 'gcr.io/$PROJECT_ID/theia-frontend:$COMMIT_SHA'
- 'gcr.io/$PROJECT_ID/theia-frontend:latest'
options:
machineType: 'E2_HIGHCPU_8'
logging: CLOUD_LOGGING_ONLY
5. Infrastructure as Code (Terraform)
# infrastructure/main.tf
terraform {
required_version = ">= 1.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
backend "gcs" {
bucket = "az1ai-terraform-state"
prefix = "prod"
}
}
provider "google" {
project = var.project_id
region = var.region
}
# Cloud Run Service
resource "google_cloud_run_service" "theia_frontend" {
name = "theia-frontend"
location = var.region
template {
spec {
containers {
image = "gcr.io/${var.project_id}/theia-frontend:latest"
ports {
container_port = 3000
}
resources {
limits = {
cpu = "2000m"
memory = "4Gi"
}
}
env {
name = "NODE_ENV"
value = "production"
}
env {
name = "REDIS_URL"
value_from {
secret_key_ref {
name = google_secret_manager_secret.redis_url.secret_id
key = "latest"
}
}
}
}
container_concurrency = 80
timeout_seconds = 3600
service_account_name = google_service_account.cloud_run_sa.email
}
metadata {
annotations = {
"autoscaling.knative.dev/minScale" = "1"
"autoscaling.knative.dev/maxScale" = "100"
"run.googleapis.com/cpu-throttling" = "false"
"run.googleapis.com/startup-cpu-boost" = "true"
"run.googleapis.com/execution-environment" = "gen2"
}
}
}
traffic {
percent = 100
latest_revision = true
}
}
# Memorystore Redis
resource "google_redis_instance" "cache" {
name = "az1ai-cache"
tier = "STANDARD_HA"
memory_size_gb = 5
region = var.region
redis_version = "REDIS_7_0"
display_name = "AZ1.AI Cache"
authorized_network = google_compute_network.vpc.id
redis_configs = {
maxmemory-policy = "allkeys-lru"
}
}
# Cloud Storage Bucket
resource "google_storage_bucket" "user_files" {
name = "az1ai-user-files"
location = "US"
storage_class = "STANDARD"
uniform_bucket_level_access = true
lifecycle_rule {
condition {
age = 90
}
action {
type = "SetStorageClass"
storage_class = "NEARLINE"
}
}
lifecycle_rule {
condition {
age = 365
}
action {
type = "SetStorageClass"
storage_class = "COLDLINE"
}
}
cors {
origin = ["https://ide.az1.ai"]
method = ["GET", "HEAD", "PUT", "POST", "DELETE"]
response_header = ["*"]
max_age_seconds = 3600
}
}
# Cloud SQL (Alternative to FoundationDB for metadata)
resource "google_sql_database_instance" "metadata" {
name = "az1ai-metadata"
database_version = "POSTGRES_15"
region = var.region
settings {
tier = "db-custom-4-16384" # 4 vCPU, 16GB RAM
availability_type = "REGIONAL" # HA
disk_type = "PD_SSD"
disk_size = 100
backup_configuration {
enabled = true
point_in_time_recovery_enabled = true
start_time = "03:00"
transaction_log_retention_days = 7
}
ip_configuration {
ipv4_enabled = false
private_network = google_compute_network.vpc.id
}
database_flags {
name = "max_connections"
value = "200"
}
}
deletion_protection = true
}
# Load Balancer
resource "google_compute_global_address" "default" {
name = "az1ai-lb-ip"
}
resource "google_compute_global_forwarding_rule" "https" {
name = "az1ai-https-lb"
target = google_compute_target_https_proxy.default.id
port_range = "443"
ip_address = google_compute_global_address.default.address
}
resource "google_compute_target_https_proxy" "default" {
name = "az1ai-https-proxy"
url_map = google_compute_url_map.default.id
ssl_certificates = [google_compute_managed_ssl_certificate.default.id]
}
resource "google_compute_managed_ssl_certificate" "default" {
name = "az1ai-ssl-cert"
managed {
domains = ["ide.az1.ai", "www.ide.az1.ai"]
}
}
resource "google_compute_url_map" "default" {
name = "az1ai-url-map"
default_service = google_compute_backend_service.cloud_run.id
}
resource "google_compute_backend_service" "cloud_run" {
name = "az1ai-backend"
protocol = "HTTP"
port_name = "http"
timeout_sec = 3600
enable_cdn = true
backend {
group = google_compute_region_network_endpoint_group.cloud_run_neg.id
}
cdn_policy {
cache_mode = "CACHE_ALL_STATIC"
default_ttl = 3600
max_ttl = 86400
}
log_config {
enable = true
sample_rate = 1.0
}
}
resource "google_compute_region_network_endpoint_group" "cloud_run_neg" {
name = "cloud-run-neg"
network_endpoint_type = "SERVERLESS"
region = var.region
cloud_run {
service = google_cloud_run_service.theia_frontend.name
}
}
# VPC Network
resource "google_compute_network" "vpc" {
name = "az1ai-vpc"
auto_create_subnetworks = false
}
resource "google_compute_subnetwork" "private" {
name = "az1ai-private"
ip_cidr_range = "10.0.0.0/24"
region = var.region
network = google_compute_network.vpc.id
private_ip_google_access = true
}
# Service Account
resource "google_service_account" "cloud_run_sa" {
account_id = "cloud-run-sa"
display_name = "Cloud Run Service Account"
}
resource "google_project_iam_member" "cloud_run_sa_roles" {
for_each = toset([
"roles/cloudsql.client",
"roles/secretmanager.secretAccessor",
"roles/storage.objectAdmin",
"roles/logging.logWriter",
"roles/cloudtrace.agent"
])
project = var.project_id
role = each.value
member = "serviceAccount:${google_service_account.cloud_run_sa.email}"
}
# Secret Manager
resource "google_secret_manager_secret" "redis_url" {
secret_id = "redis-url"
replication {
automatic = true
}
}
# Monitoring
resource "google_monitoring_alert_policy" "cloud_run_errors" {
display_name = "Cloud Run High Error Rate"
combiner = "OR"
conditions {
display_name = "Error rate > 5%"
condition_threshold {
filter = "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\" AND metric.label.response_code_class=\"5xx\""
duration = "300s"
comparison = "COMPARISON_GT"
threshold_value = 0.05
aggregations {
alignment_period = "60s"
per_series_aligner = "ALIGN_RATE"
}
}
}
notification_channels = [google_monitoring_notification_channel.email.name]
}
resource "google_monitoring_notification_channel" "email" {
display_name = "Email Notifications"
type = "email"
labels = {
email_address = "alerts@az1.ai"
}
}
# Outputs
output "load_balancer_ip" {
value = google_compute_global_address.default.address
}
output "cloud_run_url" {
value = google_cloud_run_service.theia_frontend.status[0].url
}
output "redis_host" {
value = google_redis_instance.cache.host
}
Rationale
Why Cloud Run?
Serverless Benefits:
- ✅ Auto-scaling from 0 to 1000+ instances
- ✅ Pay only for actual usage (not idle time)
- ✅ No server management
- ✅ Built-in load balancing
Performance:
- ✅ Gen 2 execution environment (faster cold starts)
- ✅ CPU boost on startup
- ✅ HTTP/2 support for WebSocket
Cost:
- ✅ Free tier: 2M requests/month
- ✅ $0.00002400/vCPU-second ($51.84/month for 1 vCPU 24/7)
- ✅ $0.00000250/GiB-second ($5.40/month for 1 GiB 24/7)
Why GCP vs AWS/Azure?
GCP Advantages:
- ✅ Cloud Run (best serverless containers)
- ✅ Global load balancer (built-in CDN)
- ✅ Generous free tier
- ✅ Simple pricing
- ✅ Excellent Kubernetes integration (GKE)
AWS Disadvantages:
- ❌ Fargate more expensive
- ❌ Complex networking setup
- ❌ More services to manage
Azure Disadvantages:
- ❌ Container Apps less mature
- ❌ Higher latency in some regions
- ❌ Less generous free tier
Why Memorystore vs Self-Hosted Redis?
Managed Service Benefits:
- ✅ High availability (HA tier)
- ✅ Automatic failover
- ✅ Automatic backups
- ✅ No maintenance
Cost:
- ✅ $0.053/GB-hour (~$38/month for 1GB HA)
- ✅ Cheaper than managing VMs
Alternatives Considered
Alternative 1: GKE (Google Kubernetes Engine)
Pros:
- More control
- Better for complex deployments
- Easier multi-cloud migration
Cons:
- ❌ More complex
- ❌ Higher cost (always-on nodes)
- ❌ More operational overhead
Deferred: Start with Cloud Run, migrate to GKE if needed
Alternative 2: AWS ECS Fargate
Pros:
- Similar to Cloud Run
- AWS ecosystem
Cons:
- ❌ More expensive
- ❌ More complex networking
- ❌ Slower cold starts
Rejected: Cloud Run is simpler and cheaper
Alternative 3: Vercel/Netlify
Pros:
- Extremely simple
- Great DX
- Built-in CI/CD
Cons:
- ❌ Expensive at scale
- ❌ Vendor lock-in
- ❌ Limited backend capabilities
Rejected: Need more control for backend
Alternative 4: Self-Hosted (Bare Metal)
Pros:
- Full control
- Predictable cost
Cons:
- ❌ High upfront cost
- ❌ Operational burden
- ❌ No auto-scaling
Rejected: Too much overhead for startup
Consequences
Positive
✅ Cost-Efficient: Pay only for usage, generous free tier ✅ Auto-Scaling: Handle traffic spikes automatically ✅ Global: Low latency worldwide with Cloud CDN ✅ Managed: Less operational overhead ✅ Fast Deployment: CI/CD with Cloud Build ✅ Secure: IAM, VPC, Secret Manager built-in
Negative
❌ Vendor Lock-In: GCP-specific features (Cloud Run) ❌ Cold Starts: ~1-2s for first request (mitigated with min instances) ❌ Cost Unpredictability: Usage-based pricing can spike ❌ WebSocket Limits: 1-hour timeout (but can extend to 24 hours)
Mitigation
Vendor Lock-In:
- Use Docker containers (portable)
- Abstract cloud-specific code
- Document migration path to GKE/other clouds
Cold Starts:
- Set min instances = 1 for critical services
- Use startup CPU boost
- Optimize container image size
Cost Unpredictability:
- Set budget alerts
- Monitor usage dashboards
- Use quotas to cap spending
WebSocket Limits:
- Document timeout behavior
- Implement reconnection logic
- Consider GKE for long-lived connections if needed
Implementation Plan
Phase 1: Development Environment ✅
- Dockerize application
- Local Docker Compose setup
- Development workflow
Phase 2: GCP Setup 🔲
- Create GCP project
- Enable required APIs
- Set up billing
- Configure IAM roles
Phase 3: Core Services 🔲
- Deploy Cloud Run services
- Configure Memorystore Redis
- Set up Cloud Storage
- Deploy FoundationDB cluster
Phase 4: Networking 🔲
- Configure load balancer
- Set up Cloud CDN
- Configure SSL certificates
- Set up Cloud Armor (WAF)
Phase 5: CI/CD 🔲
- Set up Cloud Build
- Create deployment pipeline
- Automated testing
- Rollback strategy
Phase 6: Observability 🔲
- Cloud Logging integration
- Cloud Monitoring dashboards
- Alert policies
- Error tracking
Phase 7: Production Hardening 🔲
- Load testing (10K users)
- Disaster recovery plan
- Backup strategy
- Security audit
Success Metrics
Performance:
- < 2s cold start time
- < 100ms p99 latency (warm)
- 1000+ concurrent users
Reliability:
- 99.9% uptime
- < 1% error rate
- Zero data loss
Cost:
- < $500/month for 100 active users
- < $5000/month for 1000 active users
Deployment:
- < 10 minutes deployment time
- Zero-downtime deployments
- Automated rollbacks
Related Decisions
- ADR-016: NGINX Load Balancer - Frontend LB
- ADR-017: WebSocket Backend - Backend architecture
- ADR-004: FoundationDB - Database
References
Cloud Run:
GCP Services:
Best Practices:
Status: ✅ Accepted Next Review: 2025-11-06 (1 month) Last Updated: 2025-10-06