Skip to main content

ADR-020: GCP Cloud Run Deployment Strategy

Status: Accepted Date: 2025-10-06 Deciders: Development Team, DevOps Team, Infrastructure Team Related: ADR-016 (NGINX), ADR-017 (WebSocket), ADR-004 (FoundationDB)


Context

The AZ1.AI llm IDE requires a cloud deployment strategy for production use. We need:

  • Scalability: Handle 1000+ concurrent users
  • Cost-Efficiency: Pay for actual usage, not idle capacity
  • Global Reach: Low latency worldwide
  • Easy Deployment: Simple CI/CD pipeline
  • WebSocket Support: For real-time communication
  • Stateful Services: FoundationDB, Redis, file storage

Current State

  • Local development with Docker Compose
  • No production deployment infrastructure
  • No CI/CD pipeline
  • Manual deployments

Requirements

  1. Auto-Scaling: Scale from 0 to 1000+ instances
  2. Global CDN: Fast content delivery worldwide
  3. Managed Services: Minimize operational overhead
  4. Cost Control: Budget-friendly for startup
  5. Security: SSL/TLS, IAM, VPC
  6. Monitoring: Metrics, logs, alerts
  7. High Availability: 99.9% uptime SLA

Decision

We will deploy to Google Cloud Platform using Cloud Run with supporting managed services:

Architecture

┌────────────────────────────────────────────────────────────────┐
│ Global CDN │
│ (Cloud CDN + Cloud Armor) │
└──────────────────────────┬─────────────────────────────────────┘


┌────────────────────────────────────────────────────────────────┐
│ Global Load Balancer │
│ (Cloud Load Balancing - HTTPS) │
└──────────────────────────┬─────────────────────────────────────┘


┌────────────────────────────────────────────────────────────────┐
│ Cloud Run Services │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ theia │ │ WebSocket │ │ MCP Gateway │ │
│ │ Frontend │ │ Backend │ │ Service │ │
│ │ (Port 3000) │ │ (Port 4000) │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ │ │
└────────────────────────────┼───────────────────────────────────┘


┌────────────────────────────────────────────────────────────────┐
│ Managed Services │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │FoundationDB │ │ Memorystore │ │ Cloud │ │
│ │ (VMs on │ │ (Redis) │ │ Storage │ │
│ │ Compute │ │ │ │ (Files) │ │
│ │ Engine) │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Cloud SQL │ │ Secret │ │ Cloud │ │
│ │ (PostgreSQL) │ │ Manager │ │ Logging │ │
│ │ (Metadata) │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────────────┘

Service Breakdown

Compute:

  • Cloud Run: theia frontend, WebSocket backend, MCP gateway
  • Compute Engine: FoundationDB cluster (3-5 VMs)

Storage:

  • Cloud Storage: User files, session data, static assets
  • Memorystore (Redis): Session cache, MCP response cache
  • Cloud SQL (PostgreSQL): User data, session metadata (alternative to FDB for some use cases)

Networking:

  • Cloud Load Balancing: Global HTTPS load balancer
  • Cloud CDN: Static asset caching
  • Cloud Armor: DDoS protection, WAF

Observability:

  • Cloud Logging: Centralized logs
  • Cloud Monitoring: Metrics, dashboards
  • Cloud Trace: Distributed tracing
  • Error Reporting: Exception tracking

Security:

  • Secret Manager: API keys, credentials
  • Identity Platform: User authentication
  • VPC: Private networking for services
  • Cloud IAM: Fine-grained access control

Implementation

1. Cloud Run Service Definitions

# cloud-run/theia-frontend.yaml

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: theia-frontend
namespace: default
annotations:
run.googleapis.com/ingress: all
run.googleapis.com/execution-environment: gen2
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/minScale: '1'
autoscaling.knative.dev/maxScale: '100'
run.googleapis.com/cpu-throttling: 'false' # Important for WebSocket
run.googleapis.com/startup-cpu-boost: 'true'
spec:
containerConcurrency: 80
timeoutSeconds: 3600 # 1 hour for WebSocket connections
containers:
- name: theia
image: gcr.io/PROJECT_ID/theia-frontend:latest
ports:
- name: http1
containerPort: 3000
env:
- name: NODE_ENV
value: production
- name: PORT
value: '3000'
- name: REDIS_HOST
valueFrom:
secretKeyRef:
name: redis-connection
key: host
- name: FDB_CLUSTER_FILE
value: /etc/foundationdb/fdb.cluster
resources:
limits:
cpu: '2000m'
memory: '4Gi'
volumeMounts:
- name: fdb-config
mountPath: /etc/foundationdb
readOnly: true
volumes:
- name: fdb-config
secret:
secretName: fdb-cluster-file
# cloud-run/websocket-backend.yaml

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: websocket-backend
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/minScale: '2'
autoscaling.knative.dev/maxScale: '200'
run.googleapis.com/cpu-throttling: 'false'
spec:
containerConcurrency: 100
timeoutSeconds: 86400 # 24 hours for long-lived WebSocket
containers:
- name: websocket
image: gcr.io/PROJECT_ID/websocket-backend:latest
ports:
- name: h2c # HTTP/2 for WebSocket
containerPort: 4000
env:
- name: PORT
value: '4000'
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: redis-connection
key: url
- name: GCS_BUCKET
value: az1ai-user-files
resources:
limits:
cpu: '4000m'
memory: '8Gi'

2. Dockerfile for Cloud Run

# Dockerfile.cloudrun

FROM node:20-slim AS builder

WORKDIR /app

# Copy package files
COPY package*.json ./
COPY tsconfig*.json ./

# Install dependencies
RUN npm ci --production=false

# Copy source
COPY src ./src
COPY theia-app ./theia-app

# Build theia application
RUN npm run theia:build:prod

# Production image
FROM node:20-slim

WORKDIR /app

# Install production dependencies only
COPY package*.json ./
RUN npm ci --production

# Copy built application
COPY --from=builder /app/lib ./lib
COPY --from=builder /app/theia-app ./theia-app

# Expose port
EXPOSE 3000

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s \
CMD node healthcheck.js || exit 1

# Start application
CMD ["npm", "run", "theia:start"]

3. FoundationDB on Compute Engine

#!/bin/bash
# scripts/deploy-fdb.sh

# Create FoundationDB VMs (3-node cluster)
for i in {1..3}; do
gcloud compute instances create fdb-node-$i \
--zone=us-central1-a \
--machine-type=n2-standard-8 \
--boot-disk-size=100GB \
--boot-disk-type=pd-ssd \
--image-family=ubuntu-2204-lts \
--image-project=ubuntu-os-cloud \
--metadata-from-file startup-script=install-fdb.sh \
--tags=fdb-cluster \
--scopes=cloud-platform
done

# Create firewall rule for FDB cluster communication
gcloud compute firewall-rules create allow-fdb-internal \
--network=default \
--allow=tcp:4500-4520 \
--source-tags=fdb-cluster \
--target-tags=fdb-cluster
#!/bin/bash
# install-fdb.sh (startup script for FDB VMs)

# Download and install FoundationDB
wget https://github.com/apple/foundationdb/releases/download/7.1.27/foundationdb-server_7.1.27-1_amd64.deb
wget https://github.com/apple/foundationdb/releases/download/7.1.27/foundationdb-clients_7.1.27-1_amd64.deb

sudo dpkg -i foundationdb-clients_7.1.27-1_amd64.deb
sudo dpkg -i foundationdb-server_7.1.27-1_amd64.deb

# Configure cluster
sudo fdbcli --exec "configure new single ssd"

# Enable automatic backups to Cloud Storage
sudo gsutil cp /etc/foundationdb/fdb.cluster gs://az1ai-fdb-backups/cluster/

4. CI/CD Pipeline (Cloud Build)

# cloudbuild.yaml

steps:
# Build theia frontend
- name: 'gcr.io/cloud-builders/docker'
args:
- 'build'
- '-t'
- 'gcr.io/$PROJECT_ID/theia-frontend:$COMMIT_SHA'
- '-t'
- 'gcr.io/$PROJECT_ID/theia-frontend:latest'
- '-f'
- 'Dockerfile.cloudrun'
- '.'

# Push images
- name: 'gcr.io/cloud-builders/docker'
args:
- 'push'
- 'gcr.io/$PROJECT_ID/theia-frontend:$COMMIT_SHA'

# Deploy to Cloud Run
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: gcloud
args:
- 'run'
- 'deploy'
- 'theia-frontend'
- '--image=gcr.io/$PROJECT_ID/theia-frontend:$COMMIT_SHA'
- '--region=us-central1'
- '--platform=managed'
- '--allow-unauthenticated'
- '--max-instances=100'
- '--min-instances=1'
- '--memory=4Gi'
- '--cpu=2'
- '--timeout=3600'
- '--concurrency=80'
- '--set-env-vars=NODE_ENV=production'

# Run database migrations
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: bash
args:
- '-c'
- |
gcloud run jobs execute fdb-migrate \
--region=us-central1 \
--wait

images:
- 'gcr.io/$PROJECT_ID/theia-frontend:$COMMIT_SHA'
- 'gcr.io/$PROJECT_ID/theia-frontend:latest'

options:
machineType: 'E2_HIGHCPU_8'
logging: CLOUD_LOGGING_ONLY

5. Infrastructure as Code (Terraform)

# infrastructure/main.tf

terraform {
required_version = ">= 1.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}

backend "gcs" {
bucket = "az1ai-terraform-state"
prefix = "prod"
}
}

provider "google" {
project = var.project_id
region = var.region
}

# Cloud Run Service
resource "google_cloud_run_service" "theia_frontend" {
name = "theia-frontend"
location = var.region

template {
spec {
containers {
image = "gcr.io/${var.project_id}/theia-frontend:latest"

ports {
container_port = 3000
}

resources {
limits = {
cpu = "2000m"
memory = "4Gi"
}
}

env {
name = "NODE_ENV"
value = "production"
}

env {
name = "REDIS_URL"
value_from {
secret_key_ref {
name = google_secret_manager_secret.redis_url.secret_id
key = "latest"
}
}
}
}

container_concurrency = 80
timeout_seconds = 3600

service_account_name = google_service_account.cloud_run_sa.email
}

metadata {
annotations = {
"autoscaling.knative.dev/minScale" = "1"
"autoscaling.knative.dev/maxScale" = "100"
"run.googleapis.com/cpu-throttling" = "false"
"run.googleapis.com/startup-cpu-boost" = "true"
"run.googleapis.com/execution-environment" = "gen2"
}
}
}

traffic {
percent = 100
latest_revision = true
}
}

# Memorystore Redis
resource "google_redis_instance" "cache" {
name = "az1ai-cache"
tier = "STANDARD_HA"
memory_size_gb = 5
region = var.region

redis_version = "REDIS_7_0"
display_name = "AZ1.AI Cache"

authorized_network = google_compute_network.vpc.id

redis_configs = {
maxmemory-policy = "allkeys-lru"
}
}

# Cloud Storage Bucket
resource "google_storage_bucket" "user_files" {
name = "az1ai-user-files"
location = "US"
storage_class = "STANDARD"

uniform_bucket_level_access = true

lifecycle_rule {
condition {
age = 90
}
action {
type = "SetStorageClass"
storage_class = "NEARLINE"
}
}

lifecycle_rule {
condition {
age = 365
}
action {
type = "SetStorageClass"
storage_class = "COLDLINE"
}
}

cors {
origin = ["https://ide.az1.ai"]
method = ["GET", "HEAD", "PUT", "POST", "DELETE"]
response_header = ["*"]
max_age_seconds = 3600
}
}

# Cloud SQL (Alternative to FoundationDB for metadata)
resource "google_sql_database_instance" "metadata" {
name = "az1ai-metadata"
database_version = "POSTGRES_15"
region = var.region

settings {
tier = "db-custom-4-16384" # 4 vCPU, 16GB RAM
availability_type = "REGIONAL" # HA

disk_type = "PD_SSD"
disk_size = 100

backup_configuration {
enabled = true
point_in_time_recovery_enabled = true
start_time = "03:00"
transaction_log_retention_days = 7
}

ip_configuration {
ipv4_enabled = false
private_network = google_compute_network.vpc.id
}

database_flags {
name = "max_connections"
value = "200"
}
}

deletion_protection = true
}

# Load Balancer
resource "google_compute_global_address" "default" {
name = "az1ai-lb-ip"
}

resource "google_compute_global_forwarding_rule" "https" {
name = "az1ai-https-lb"
target = google_compute_target_https_proxy.default.id
port_range = "443"
ip_address = google_compute_global_address.default.address
}

resource "google_compute_target_https_proxy" "default" {
name = "az1ai-https-proxy"
url_map = google_compute_url_map.default.id
ssl_certificates = [google_compute_managed_ssl_certificate.default.id]
}

resource "google_compute_managed_ssl_certificate" "default" {
name = "az1ai-ssl-cert"

managed {
domains = ["ide.az1.ai", "www.ide.az1.ai"]
}
}

resource "google_compute_url_map" "default" {
name = "az1ai-url-map"
default_service = google_compute_backend_service.cloud_run.id
}

resource "google_compute_backend_service" "cloud_run" {
name = "az1ai-backend"
protocol = "HTTP"
port_name = "http"
timeout_sec = 3600
enable_cdn = true

backend {
group = google_compute_region_network_endpoint_group.cloud_run_neg.id
}

cdn_policy {
cache_mode = "CACHE_ALL_STATIC"
default_ttl = 3600
max_ttl = 86400
}

log_config {
enable = true
sample_rate = 1.0
}
}

resource "google_compute_region_network_endpoint_group" "cloud_run_neg" {
name = "cloud-run-neg"
network_endpoint_type = "SERVERLESS"
region = var.region

cloud_run {
service = google_cloud_run_service.theia_frontend.name
}
}

# VPC Network
resource "google_compute_network" "vpc" {
name = "az1ai-vpc"
auto_create_subnetworks = false
}

resource "google_compute_subnetwork" "private" {
name = "az1ai-private"
ip_cidr_range = "10.0.0.0/24"
region = var.region
network = google_compute_network.vpc.id

private_ip_google_access = true
}

# Service Account
resource "google_service_account" "cloud_run_sa" {
account_id = "cloud-run-sa"
display_name = "Cloud Run Service Account"
}

resource "google_project_iam_member" "cloud_run_sa_roles" {
for_each = toset([
"roles/cloudsql.client",
"roles/secretmanager.secretAccessor",
"roles/storage.objectAdmin",
"roles/logging.logWriter",
"roles/cloudtrace.agent"
])

project = var.project_id
role = each.value
member = "serviceAccount:${google_service_account.cloud_run_sa.email}"
}

# Secret Manager
resource "google_secret_manager_secret" "redis_url" {
secret_id = "redis-url"

replication {
automatic = true
}
}

# Monitoring
resource "google_monitoring_alert_policy" "cloud_run_errors" {
display_name = "Cloud Run High Error Rate"
combiner = "OR"

conditions {
display_name = "Error rate > 5%"

condition_threshold {
filter = "resource.type=\"cloud_run_revision\" AND metric.type=\"run.googleapis.com/request_count\" AND metric.label.response_code_class=\"5xx\""
duration = "300s"
comparison = "COMPARISON_GT"
threshold_value = 0.05

aggregations {
alignment_period = "60s"
per_series_aligner = "ALIGN_RATE"
}
}
}

notification_channels = [google_monitoring_notification_channel.email.name]
}

resource "google_monitoring_notification_channel" "email" {
display_name = "Email Notifications"
type = "email"

labels = {
email_address = "alerts@az1.ai"
}
}

# Outputs
output "load_balancer_ip" {
value = google_compute_global_address.default.address
}

output "cloud_run_url" {
value = google_cloud_run_service.theia_frontend.status[0].url
}

output "redis_host" {
value = google_redis_instance.cache.host
}

Rationale

Why Cloud Run?

Serverless Benefits:

  • ✅ Auto-scaling from 0 to 1000+ instances
  • ✅ Pay only for actual usage (not idle time)
  • ✅ No server management
  • ✅ Built-in load balancing

Performance:

  • ✅ Gen 2 execution environment (faster cold starts)
  • ✅ CPU boost on startup
  • ✅ HTTP/2 support for WebSocket

Cost:

  • ✅ Free tier: 2M requests/month
  • ✅ $0.00002400/vCPU-second ($51.84/month for 1 vCPU 24/7)
  • ✅ $0.00000250/GiB-second ($5.40/month for 1 GiB 24/7)

Why GCP vs AWS/Azure?

GCP Advantages:

  • ✅ Cloud Run (best serverless containers)
  • ✅ Global load balancer (built-in CDN)
  • ✅ Generous free tier
  • ✅ Simple pricing
  • ✅ Excellent Kubernetes integration (GKE)

AWS Disadvantages:

  • ❌ Fargate more expensive
  • ❌ Complex networking setup
  • ❌ More services to manage

Azure Disadvantages:

  • ❌ Container Apps less mature
  • ❌ Higher latency in some regions
  • ❌ Less generous free tier

Why Memorystore vs Self-Hosted Redis?

Managed Service Benefits:

  • ✅ High availability (HA tier)
  • ✅ Automatic failover
  • ✅ Automatic backups
  • ✅ No maintenance

Cost:

  • ✅ $0.053/GB-hour (~$38/month for 1GB HA)
  • ✅ Cheaper than managing VMs

Alternatives Considered

Alternative 1: GKE (Google Kubernetes Engine)

Pros:

  • More control
  • Better for complex deployments
  • Easier multi-cloud migration

Cons:

  • ❌ More complex
  • ❌ Higher cost (always-on nodes)
  • ❌ More operational overhead

Deferred: Start with Cloud Run, migrate to GKE if needed

Alternative 2: AWS ECS Fargate

Pros:

  • Similar to Cloud Run
  • AWS ecosystem

Cons:

  • ❌ More expensive
  • ❌ More complex networking
  • ❌ Slower cold starts

Rejected: Cloud Run is simpler and cheaper

Alternative 3: Vercel/Netlify

Pros:

  • Extremely simple
  • Great DX
  • Built-in CI/CD

Cons:

  • ❌ Expensive at scale
  • ❌ Vendor lock-in
  • ❌ Limited backend capabilities

Rejected: Need more control for backend

Alternative 4: Self-Hosted (Bare Metal)

Pros:

  • Full control
  • Predictable cost

Cons:

  • ❌ High upfront cost
  • ❌ Operational burden
  • ❌ No auto-scaling

Rejected: Too much overhead for startup


Consequences

Positive

Cost-Efficient: Pay only for usage, generous free tier ✅ Auto-Scaling: Handle traffic spikes automatically ✅ Global: Low latency worldwide with Cloud CDN ✅ Managed: Less operational overhead ✅ Fast Deployment: CI/CD with Cloud Build ✅ Secure: IAM, VPC, Secret Manager built-in

Negative

Vendor Lock-In: GCP-specific features (Cloud Run) ❌ Cold Starts: ~1-2s for first request (mitigated with min instances) ❌ Cost Unpredictability: Usage-based pricing can spike ❌ WebSocket Limits: 1-hour timeout (but can extend to 24 hours)

Mitigation

Vendor Lock-In:

  • Use Docker containers (portable)
  • Abstract cloud-specific code
  • Document migration path to GKE/other clouds

Cold Starts:

  • Set min instances = 1 for critical services
  • Use startup CPU boost
  • Optimize container image size

Cost Unpredictability:

  • Set budget alerts
  • Monitor usage dashboards
  • Use quotas to cap spending

WebSocket Limits:

  • Document timeout behavior
  • Implement reconnection logic
  • Consider GKE for long-lived connections if needed

Implementation Plan

Phase 1: Development Environment ✅

  • Dockerize application
  • Local Docker Compose setup
  • Development workflow

Phase 2: GCP Setup 🔲

  • Create GCP project
  • Enable required APIs
  • Set up billing
  • Configure IAM roles

Phase 3: Core Services 🔲

  • Deploy Cloud Run services
  • Configure Memorystore Redis
  • Set up Cloud Storage
  • Deploy FoundationDB cluster

Phase 4: Networking 🔲

  • Configure load balancer
  • Set up Cloud CDN
  • Configure SSL certificates
  • Set up Cloud Armor (WAF)

Phase 5: CI/CD 🔲

  • Set up Cloud Build
  • Create deployment pipeline
  • Automated testing
  • Rollback strategy

Phase 6: Observability 🔲

  • Cloud Logging integration
  • Cloud Monitoring dashboards
  • Alert policies
  • Error tracking

Phase 7: Production Hardening 🔲

  • Load testing (10K users)
  • Disaster recovery plan
  • Backup strategy
  • Security audit

Success Metrics

Performance:

  • < 2s cold start time
  • < 100ms p99 latency (warm)
  • 1000+ concurrent users

Reliability:

  • 99.9% uptime
  • < 1% error rate
  • Zero data loss

Cost:

  • < $500/month for 100 active users
  • < $5000/month for 1000 active users

Deployment:

  • < 10 minutes deployment time
  • Zero-downtime deployments
  • Automated rollbacks


References

Cloud Run:

GCP Services:

Best Practices:


Status: ✅ Accepted Next Review: 2025-11-06 (1 month) Last Updated: 2025-10-06