Skip to main content

ADR-016: Use NGINX as Load Balancer for Frontend

Status: Accepted Date: 2025-10-06 Deciders: Development Team, DevOps Team Related: ADR-014 (theia), ADR-017 (WebSocket)


Context​

The AZ1.AI llm IDE needs a robust frontend load balancing solution to:

  • Handle high traffic volumes
  • Distribute load across multiple theia instances
  • Terminate SSL/TLS connections
  • Serve static assets efficiently
  • Proxy WebSocket connections to backend
  • Provide health checks and failover

Current State​

  • Single theia instance on port 3000
  • No load balancing
  • No SSL termination
  • Direct client β†’ server connection

Requirements​

  1. Scalability: Handle 1000+ concurrent users
  2. Reliability: Automatic failover if instance fails
  3. Security: SSL/TLS termination, HTTP/2 support
  4. Performance: Static asset caching, compression
  5. WebSocket: Proxy WebSocket connections for real-time communication
  6. Monitoring: Health checks, metrics export

Decision​

We will use NGINX as the frontend load balancer with the following configuration:

Architecture​

                        Internet
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ NGINX β”‚
β”‚ Load Balancerβ”‚
β”‚ (Port 443) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ theia β”‚ β”‚ theia β”‚ β”‚ theia β”‚
β”‚Instance 1β”‚ β”‚Instance 2β”‚ β”‚Instance 3β”‚
β”‚ :3000 β”‚ β”‚ :3001 β”‚ β”‚ :3002 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

NGINX Configuration​

Load Balancing Strategy: Least Connections (best for IDE workloads)

Features Enabled:

  • SSL/TLS termination
  • HTTP/2 and HTTP/3 (QUIC)
  • Gzip compression
  • Static asset caching
  • WebSocket proxying
  • Health checks
  • Rate limiting

Implementation​

1. NGINX Configuration File​

# /etc/nginx/nginx.conf

upstream theia_backend {
least_conn; # Distribute to least busy server

# theia instances
server 127.0.0.1:3000 max_fails=3 fail_timeout=30s;
server 127.0.0.1:3001 max_fails=3 fail_timeout=30s;
server 127.0.0.1:3002 max_fails=3 fail_timeout=30s;

# Health check
keepalive 32;
}

server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name ide.az1.ai;

# SSL Configuration
ssl_certificate /etc/letsencrypt/live/ide.az1.ai/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/ide.az1.ai/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;

# Security Headers
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;

# Gzip Compression
gzip on;
gzip_vary on;
gzip_min_length 1024;
gzip_types text/plain text/css text/xml text/javascript
application/json application/javascript application/xml+rss;

# Static Assets (Cache)
location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg|woff|woff2|ttf|eot)$ {
proxy_pass http://theia_backend;
proxy_cache_valid 200 1d;
expires 1d;
add_header Cache-Control "public, immutable";
}

# WebSocket Support
location /services {
proxy_pass http://theia_backend;

# WebSocket headers
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";

# Standard proxy headers
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;

# Timeouts for long-lived connections
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
}

# Main Application
location / {
proxy_pass http://theia_backend;

proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;

# Timeouts
proxy_read_timeout 300s;
proxy_connect_timeout 75s;
}

# Health Check Endpoint
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}

# Rate Limiting
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
limit_req zone=api_limit burst=20 nodelay;
}

# Redirect HTTP to HTTPS
server {
listen 80;
listen [::]:80;
server_name ide.az1.ai;
return 301 https://$server_name$request_uri;
}

2. Docker Compose (Development)​

version: '3.8'

services:
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./ssl:/etc/letsencrypt:ro
depends_on:
- theia1
- theia2
- theia3
networks:
- theia-network

theia1:
build: .
ports:
- "3000:3000"
networks:
- theia-network
environment:
- NODE_ENV=production

theia2:
build: .
ports:
- "3001:3000"
networks:
- theia-network
environment:
- NODE_ENV=production

theia3:
build: .
ports:
- "3002:3000"
networks:
- theia-network
environment:
- NODE_ENV=production

networks:
theia-network:
driver: bridge

3. Health Check Script​

#!/bin/bash
# /usr/local/bin/theia-health-check.sh

# Check if theia is responding
curl -f http://localhost:3000/health || exit 1

Rationale​

Why NGINX?​

Performance:

  • βœ… Event-driven architecture (handles 10K+ concurrent connections)
  • βœ… Low memory footprint (~10MB per worker)
  • βœ… Efficient static file serving
  • βœ… Built-in caching

Features:

  • βœ… Native WebSocket support
  • βœ… HTTP/2 and HTTP/3 (QUIC)
  • βœ… SSL/TLS termination
  • βœ… Load balancing algorithms (least_conn, ip_hash, round_robin)
  • βœ… Health checks and failover

Ecosystem:

  • βœ… Battle-tested (powers 30%+ of top websites)
  • βœ… Extensive documentation
  • βœ… Large community
  • βœ… Easy to configure

Cost:

  • βœ… Free and open source
  • βœ… Enterprise version available (NGINX Plus)

Alternatives Considered​

Alternative 1: HAProxy​

Pros:

  • Excellent load balancing features
  • Advanced health checks
  • Good for TCP/HTTP

Cons:

  • ❌ No built-in caching
  • ❌ More complex configuration
  • ❌ Requires separate SSL terminator

Rejected: NGINX provides more features out-of-the-box

Alternative 2: Traefik​

Pros:

  • Modern cloud-native LB
  • Automatic service discovery
  • Let's Encrypt integration

Cons:

  • ❌ Higher resource usage
  • ❌ Younger, less mature
  • ❌ More complex for simple use case

Rejected: Overkill for current needs

Alternative 3: Cloud Load Balancer (GCP/AWS)​

Pros:

  • Fully managed
  • Auto-scaling
  • Global CDN integration

Cons:

  • ❌ Vendor lock-in
  • ❌ Higher cost
  • ❌ Less control

Deferred: Consider for production scaling


Consequences​

Positive​

βœ… Scalability: Handle 1000+ concurrent users βœ… Reliability: Automatic failover on instance failure βœ… Security: SSL/TLS termination, security headers βœ… Performance: Static caching, compression, HTTP/2 βœ… WebSocket: Native support for real-time connections βœ… Monitoring: Built-in metrics, health checks βœ… Cost: Free and open source

Negative​

❌ Complexity: Additional infrastructure component ❌ Maintenance: Need to manage NGINX config ❌ Single Point of Failure: Need HA setup for production ❌ SSL Management: Need to handle certificate renewal

Mitigation​

Complexity:

  • Use Docker Compose for easy deployment
  • Provide example configurations
  • Document common patterns

Maintenance:

  • Automate config updates with Ansible/Terraform
  • Version control NGINX configs
  • Use config validation before reload

HA Setup (Production):

  • Deploy multiple NGINX instances
  • Use Keepalived for VIP failover
  • Or use cloud LB in front of NGINX

SSL Management:

  • Automate with Certbot/ACME
  • Use cert-manager in Kubernetes
  • Monitor expiration with alerts

Implementation Plan​

Phase 1: Development Setup​

  • Create NGINX Docker container
  • Configure basic load balancing
  • Test with 3 theia instances
  • Add WebSocket proxying
  • Test health checks

Phase 2: Security Hardening​

  • Enable SSL/TLS with self-signed certs (dev)
  • Add security headers
  • Implement rate limiting
  • Add request logging

Phase 3: Production Deployment​

  • Obtain Let's Encrypt certificate
  • Enable HTTP/2 and HTTP/3
  • Configure caching
  • Set up monitoring (Prometheus, Grafana)
  • Load testing (10K concurrent users)

Phase 4: High Availability​

  • Deploy NGINX in HA mode (Keepalived)
  • Add health monitoring
  • Implement automatic scaling
  • Disaster recovery procedures

Success Metrics​

Performance:

  • Handle 1000+ concurrent users
  • < 50ms latency overhead
  • Static assets served from cache (>90% hit rate)

Reliability:

  • 99.9% uptime
  • < 5s failover time
  • Zero downtime deployments

Security:

  • A+ SSL Labs rating
  • All security headers present
  • Rate limiting prevents DoS


References​

NGINX Documentation:

Best Practices:


Status: βœ… Accepted Next Review: 2025-11-06 (1 month) Last Updated: 2025-10-06