ADR-016: Use NGINX as Load Balancer for Frontend
Status: Accepted Date: 2025-10-06 Deciders: Development Team, DevOps Team Related: ADR-014 (theia), ADR-017 (WebSocket)
Contextβ
The AZ1.AI llm IDE needs a robust frontend load balancing solution to:
- Handle high traffic volumes
- Distribute load across multiple theia instances
- Terminate SSL/TLS connections
- Serve static assets efficiently
- Proxy WebSocket connections to backend
- Provide health checks and failover
Current Stateβ
- Single theia instance on port 3000
- No load balancing
- No SSL termination
- Direct client β server connection
Requirementsβ
- Scalability: Handle 1000+ concurrent users
- Reliability: Automatic failover if instance fails
- Security: SSL/TLS termination, HTTP/2 support
- Performance: Static asset caching, compression
- WebSocket: Proxy WebSocket connections for real-time communication
- Monitoring: Health checks, metrics export
Decisionβ
We will use NGINX as the frontend load balancer with the following configuration:
Architectureβ
Internet
β
βΌ
ββββββββββββββββ
β NGINX β
β Load Balancerβ
β (Port 443) β
ββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β theia β β theia β β theia β
βInstance 1β βInstance 2β βInstance 3β
β :3000 β β :3001 β β :3002 β
ββββββββββββ ββββββββββββ ββββββββββββ
NGINX Configurationβ
Load Balancing Strategy: Least Connections (best for IDE workloads)
Features Enabled:
- SSL/TLS termination
- HTTP/2 and HTTP/3 (QUIC)
- Gzip compression
- Static asset caching
- WebSocket proxying
- Health checks
- Rate limiting
Implementationβ
1. NGINX Configuration Fileβ
# /etc/nginx/nginx.conf
upstream theia_backend {
least_conn; # Distribute to least busy server
# theia instances
server 127.0.0.1:3000 max_fails=3 fail_timeout=30s;
server 127.0.0.1:3001 max_fails=3 fail_timeout=30s;
server 127.0.0.1:3002 max_fails=3 fail_timeout=30s;
# Health check
keepalive 32;
}
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name ide.az1.ai;
# SSL Configuration
ssl_certificate /etc/letsencrypt/live/ide.az1.ai/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/ide.az1.ai/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
# Security Headers
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
# Gzip Compression
gzip on;
gzip_vary on;
gzip_min_length 1024;
gzip_types text/plain text/css text/xml text/javascript
application/json application/javascript application/xml+rss;
# Static Assets (Cache)
location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg|woff|woff2|ttf|eot)$ {
proxy_pass http://theia_backend;
proxy_cache_valid 200 1d;
expires 1d;
add_header Cache-Control "public, immutable";
}
# WebSocket Support
location /services {
proxy_pass http://theia_backend;
# WebSocket headers
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
# Standard proxy headers
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeouts for long-lived connections
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
}
# Main Application
location / {
proxy_pass http://theia_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeouts
proxy_read_timeout 300s;
proxy_connect_timeout 75s;
}
# Health Check Endpoint
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
# Rate Limiting
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
limit_req zone=api_limit burst=20 nodelay;
}
# Redirect HTTP to HTTPS
server {
listen 80;
listen [::]:80;
server_name ide.az1.ai;
return 301 https://$server_name$request_uri;
}
2. Docker Compose (Development)β
version: '3.8'
services:
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./ssl:/etc/letsencrypt:ro
depends_on:
- theia1
- theia2
- theia3
networks:
- theia-network
theia1:
build: .
ports:
- "3000:3000"
networks:
- theia-network
environment:
- NODE_ENV=production
theia2:
build: .
ports:
- "3001:3000"
networks:
- theia-network
environment:
- NODE_ENV=production
theia3:
build: .
ports:
- "3002:3000"
networks:
- theia-network
environment:
- NODE_ENV=production
networks:
theia-network:
driver: bridge
3. Health Check Scriptβ
#!/bin/bash
# /usr/local/bin/theia-health-check.sh
# Check if theia is responding
curl -f http://localhost:3000/health || exit 1
Rationaleβ
Why NGINX?β
Performance:
- β Event-driven architecture (handles 10K+ concurrent connections)
- β Low memory footprint (~10MB per worker)
- β Efficient static file serving
- β Built-in caching
Features:
- β Native WebSocket support
- β HTTP/2 and HTTP/3 (QUIC)
- β SSL/TLS termination
- β Load balancing algorithms (least_conn, ip_hash, round_robin)
- β Health checks and failover
Ecosystem:
- β Battle-tested (powers 30%+ of top websites)
- β Extensive documentation
- β Large community
- β Easy to configure
Cost:
- β Free and open source
- β Enterprise version available (NGINX Plus)
Alternatives Consideredβ
Alternative 1: HAProxyβ
Pros:
- Excellent load balancing features
- Advanced health checks
- Good for TCP/HTTP
Cons:
- β No built-in caching
- β More complex configuration
- β Requires separate SSL terminator
Rejected: NGINX provides more features out-of-the-box
Alternative 2: Traefikβ
Pros:
- Modern cloud-native LB
- Automatic service discovery
- Let's Encrypt integration
Cons:
- β Higher resource usage
- β Younger, less mature
- β More complex for simple use case
Rejected: Overkill for current needs
Alternative 3: Cloud Load Balancer (GCP/AWS)β
Pros:
- Fully managed
- Auto-scaling
- Global CDN integration
Cons:
- β Vendor lock-in
- β Higher cost
- β Less control
Deferred: Consider for production scaling
Consequencesβ
Positiveβ
β Scalability: Handle 1000+ concurrent users β Reliability: Automatic failover on instance failure β Security: SSL/TLS termination, security headers β Performance: Static caching, compression, HTTP/2 β WebSocket: Native support for real-time connections β Monitoring: Built-in metrics, health checks β Cost: Free and open source
Negativeβ
β Complexity: Additional infrastructure component β Maintenance: Need to manage NGINX config β Single Point of Failure: Need HA setup for production β SSL Management: Need to handle certificate renewal
Mitigationβ
Complexity:
- Use Docker Compose for easy deployment
- Provide example configurations
- Document common patterns
Maintenance:
- Automate config updates with Ansible/Terraform
- Version control NGINX configs
- Use config validation before reload
HA Setup (Production):
- Deploy multiple NGINX instances
- Use Keepalived for VIP failover
- Or use cloud LB in front of NGINX
SSL Management:
- Automate with Certbot/ACME
- Use cert-manager in Kubernetes
- Monitor expiration with alerts
Implementation Planβ
Phase 1: Development Setupβ
- Create NGINX Docker container
- Configure basic load balancing
- Test with 3 theia instances
- Add WebSocket proxying
- Test health checks
Phase 2: Security Hardeningβ
- Enable SSL/TLS with self-signed certs (dev)
- Add security headers
- Implement rate limiting
- Add request logging
Phase 3: Production Deploymentβ
- Obtain Let's Encrypt certificate
- Enable HTTP/2 and HTTP/3
- Configure caching
- Set up monitoring (Prometheus, Grafana)
- Load testing (10K concurrent users)
Phase 4: High Availabilityβ
- Deploy NGINX in HA mode (Keepalived)
- Add health monitoring
- Implement automatic scaling
- Disaster recovery procedures
Success Metricsβ
Performance:
- Handle 1000+ concurrent users
- < 50ms latency overhead
- Static assets served from cache (>90% hit rate)
Reliability:
- 99.9% uptime
- < 5s failover time
- Zero downtime deployments
Security:
- A+ SSL Labs rating
- All security headers present
- Rate limiting prevents DoS
Related Decisionsβ
- ADR-014: Eclipse theia - Application server
- ADR-017: WebSocket Backend - Backend architecture
- ADR-020: GCP Deployment - Cloud infrastructure
Referencesβ
NGINX Documentation:
Best Practices:
Status: β Accepted Next Review: 2025-11-06 (1 month) Last Updated: 2025-10-06