ADR-016: Use NGINX as Load Balancer for Frontend

Status: Accepted Date: 2025-10-06 Deciders: Development Team, DevOps Team Related: ADR-014 (theia), ADR-017 (WebSocket)

Context

The AZ1.AI llm IDE needs a robust frontend load balancing solution to:

Handle high traffic volumes
Distribute load across multiple theia instances
Terminate SSL/TLS connections
Serve static assets efficiently
Proxy WebSocket connections to backend
Provide health checks and failover

Current State

Single theia instance on port 3000
No load balancing
No SSL termination
Direct client → server connection

Requirements

Scalability: Handle 1000+ concurrent users
Reliability: Automatic failover if instance fails
Security: SSL/TLS termination, HTTP/2 support
Performance: Static asset caching, compression
WebSocket: Proxy WebSocket connections for real-time communication
Monitoring: Health checks, metrics export

Decision

We will use NGINX as the frontend load balancer with the following configuration:

Architecture

                        Internet
                           │
                           ▼
                   ┌──────────────┐
                   │    NGINX     │
                   │ Load Balancer│
                   │   (Port 443) │
                   └──────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
  ┌──────────┐      ┌──────────┐      ┌──────────┐
  │  theia   │      │  theia   │      │  theia   │
  │Instance 1│      │Instance 2│      │Instance 3│
  │ :3000    │      │ :3001    │      │ :3002    │
  └──────────┘      └──────────┘      └──────────┘

NGINX Configuration

Load Balancing Strategy: Least Connections (best for IDE workloads)

Features Enabled:

SSL/TLS termination
HTTP/2 and HTTP/3 (QUIC)
Gzip compression
Static asset caching
WebSocket proxying
Health checks
Rate limiting

Implementation

1. NGINX Configuration File

# /etc/nginx/nginx.conf

upstream theia_backend {
    least_conn;  # Distribute to least busy server

    # theia instances
    server 127.0.0.1:3000 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:3001 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:3002 max_fails=3 fail_timeout=30s;

    # Health check
    keepalive 32;
}

server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2;
    server_name ide.az1.ai;

    # SSL Configuration
    ssl_certificate /etc/letsencrypt/live/ide.az1.ai/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ide.az1.ai/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;

    # Security Headers
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;

    # Gzip Compression
    gzip on;
    gzip_vary on;
    gzip_min_length 1024;
    gzip_types text/plain text/css text/xml text/javascript
               application/json application/javascript application/xml+rss;

    # Static Assets (Cache)
    location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg|woff|woff2|ttf|eot)$ {
        proxy_pass http://theia_backend;
        proxy_cache_valid 200 1d;
        expires 1d;
        add_header Cache-Control "public, immutable";
    }

    # WebSocket Support
    location /services {
        proxy_pass http://theia_backend;

        # WebSocket headers
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # Standard proxy headers
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Timeouts for long-lived connections
        proxy_read_timeout 3600s;
        proxy_send_timeout 3600s;
    }

    # Main Application
    location / {
        proxy_pass http://theia_backend;

        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Timeouts
        proxy_read_timeout 300s;
        proxy_connect_timeout 75s;
    }

    # Health Check Endpoint
    location /health {
        access_log off;
        return 200 "healthy\n";
        add_header Content-Type text/plain;
    }

    # Rate Limiting
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
    limit_req zone=api_limit burst=20 nodelay;
}

# Redirect HTTP to HTTPS
server {
    listen 80;
    listen [::]:80;
    server_name ide.az1.ai;
    return 301 https://$server_name$request_uri;
}

2. Docker Compose (Development)

version: '3.8'

services:
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./ssl:/etc/letsencrypt:ro
    depends_on:
      - theia1
      - theia2
      - theia3
    networks:
      - theia-network

  theia1:
    build: .
    ports:
      - "3000:3000"
    networks:
      - theia-network
    environment:
      - NODE_ENV=production

  theia2:
    build: .
    ports:
      - "3001:3000"
    networks:
      - theia-network
    environment:
      - NODE_ENV=production

  theia3:
    build: .
    ports:
      - "3002:3000"
    networks:
      - theia-network
    environment:
      - NODE_ENV=production

networks:
  theia-network:
    driver: bridge

3. Health Check Script

#!/bin/bash
# /usr/local/bin/theia-health-check.sh

# Check if theia is responding
curl -f http://localhost:3000/health || exit 1

Rationale

Why NGINX?

Performance:

✅ Event-driven architecture (handles 10K+ concurrent connections)
✅ Low memory footprint (~10MB per worker)
✅ Efficient static file serving
✅ Built-in caching

Features:

✅ Native WebSocket support
✅ HTTP/2 and HTTP/3 (QUIC)
✅ SSL/TLS termination
✅ Load balancing algorithms (least_conn, ip_hash, round_robin)
✅ Health checks and failover

Ecosystem:

✅ Battle-tested (powers 30%+ of top websites)
✅ Extensive documentation
✅ Large community
✅ Easy to configure

Cost:

✅ Free and open source
✅ Enterprise version available (NGINX Plus)

Alternatives Considered

Alternative 1: HAProxy

Pros:

Excellent load balancing features
Advanced health checks
Good for TCP/HTTP

Cons:

❌ No built-in caching
❌ More complex configuration
❌ Requires separate SSL terminator

Rejected: NGINX provides more features out-of-the-box

Alternative 2: Traefik

Pros:

Modern cloud-native LB
Automatic service discovery
Let's Encrypt integration

Cons:

❌ Higher resource usage
❌ Younger, less mature
❌ More complex for simple use case

Rejected: Overkill for current needs

Alternative 3: Cloud Load Balancer (GCP/AWS)

Pros:

Fully managed
Auto-scaling
Global CDN integration

Cons:

❌ Vendor lock-in
❌ Higher cost
❌ Less control

Deferred: Consider for production scaling

Consequences

Positive

✅ Scalability: Handle 1000+ concurrent users ✅ Reliability: Automatic failover on instance failure ✅ Security: SSL/TLS termination, security headers ✅ Performance: Static caching, compression, HTTP/2 ✅ WebSocket: Native support for real-time connections ✅ Monitoring: Built-in metrics, health checks ✅ Cost: Free and open source

Negative

❌ Complexity: Additional infrastructure component ❌ Maintenance: Need to manage NGINX config ❌ Single Point of Failure: Need HA setup for production ❌ SSL Management: Need to handle certificate renewal

Mitigation

Complexity:

Use Docker Compose for easy deployment
Provide example configurations
Document common patterns

Maintenance:

Automate config updates with Ansible/Terraform
Version control NGINX configs
Use config validation before reload

HA Setup (Production):

Deploy multiple NGINX instances
Use Keepalived for VIP failover
Or use cloud LB in front of NGINX

SSL Management:

Automate with Certbot/ACME
Use cert-manager in Kubernetes
Monitor expiration with alerts

Implementation Plan

Phase 1: Development Setup

Phase 2: Security Hardening

Enable SSL/TLS with self-signed certs (dev)
Add security headers
Implement rate limiting
Add request logging

Phase 3: Production Deployment

Obtain Let's Encrypt certificate
Enable HTTP/2 and HTTP/3
Configure caching
Set up monitoring (Prometheus, Grafana)
Load testing (10K concurrent users)

Phase 4: High Availability

Deploy NGINX in HA mode (Keepalived)
Add health monitoring
Implement automatic scaling
Disaster recovery procedures

Success Metrics

Performance:

Handle 1000+ concurrent users
< 50ms latency overhead
Static assets served from cache (>90% hit rate)

Reliability:

99.9% uptime
< 5s failover time
Zero downtime deployments

Security:

A+ SSL Labs rating
All security headers present
Rate limiting prevents DoS

ADR-014: Eclipse theia - Application server
ADR-017: WebSocket Backend - Backend architecture
ADR-020: GCP Deployment - Cloud infrastructure

References

NGINX Documentation:

Best Practices:

Status: ✅ Accepted Next Review: 2025-11-06 (1 month) Last Updated: 2025-10-06

Context​

Current State​

Requirements​

Decision​

Architecture​

NGINX Configuration​

Implementation​

1. NGINX Configuration File​

2. Docker Compose (Development)​

3. Health Check Script​

Rationale​

Why NGINX?​

Alternatives Considered​

Alternative 1: HAProxy​

Alternative 2: Traefik​

Alternative 3: Cloud Load Balancer (GCP/AWS)​

Consequences​

Positive​

Negative​

Mitigation​

Implementation Plan​

Phase 1: Development Setup​

Phase 2: Security Hardening​

Phase 3: Production Deployment​

Phase 4: High Availability​

Success Metrics​

Related Decisions​

References​

Context

Current State

Requirements

Decision

Architecture

NGINX Configuration

Implementation

1. NGINX Configuration File

2. Docker Compose (Development)

3. Health Check Script

Rationale

Why NGINX?

Alternatives Considered

Alternative 1: HAProxy

Alternative 2: Traefik

Alternative 3: Cloud Load Balancer (GCP/AWS)

Consequences

Positive

Negative

Mitigation

Implementation Plan

Phase 1: Development Setup

Phase 2: Security Hardening

Phase 3: Production Deployment

Phase 4: High Availability

Success Metrics

Related Decisions

References