Blue/Green Deployment Strategy

Type: Deployment Architecture Purpose: Zero-downtime deployments with instant rollback capability Risk Level: Low (validated in staging before production) Last Updated: November 23, 2025

Overview

Blue/Green deployment is the primary deployment strategy for CODITECT cloud infrastructure, providing zero-downtime updates with instant rollback capability. Two identical production environments (Blue and Green) run in parallel, with traffic switched between them during deployments.

Key Benefits:

Zero Downtime: Traffic switches instantly between environments
Fast Rollback: Revert to previous version in <5 minutes
Pre-validation: Test new version with real production data before cutover
Risk Mitigation: Canary testing catches 95% of deployment issues

Trade-offs:

Cost: Requires 2x infrastructure during deployment (temporary)
Complexity: Need orchestration for traffic routing and health checks
Database Migrations: Require backward-compatible schema changes

Architecture Diagram

Deployment Phases

Phase 0: Pre-Deployment Validation

Timeline: 1 hour before deployment Owner: DevOps Engineer

Checklist:

Green environment deployed (0% traffic)
All green pods passing readiness probes
Database migrations applied (backward-compatible)
Smoke tests passed in green environment
Monitoring dashboards configured
Rollback plan documented
On-call engineer notified

Validation:

# Verify green deployment
kubectl get pods -n coditect-prod-green -l app=license-api
# All pods should show STATUS: Running, READY: 1/1

# Check health endpoint
curl -H "Host: api.coditect.dev" http://green-service-ip:8000/health
# Should return: {"status": "healthy", "version": "1.3.0"}

# Run smoke tests
python tests/smoke_test.py --env=green
# All critical paths should pass

Phase 1: Canary Testing (10% Traffic)

Timeline: 30 minutes Traffic Split: Blue 90%, Green 10% Purpose: Detect critical errors with minimal user impact

Actions:

# Update ingress to route 10% traffic to green
kubectl patch ingress license-api -n coditect-prod --type=json \
  -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "10"}]'

Monitoring Metrics:

Error Rate: <0.5% (same as blue baseline)
Latency p95: <100ms (within 10% of blue)
Throughput: 10-15 req/sec (proportional to traffic split)
Database Queries: No slow queries (>1s)
Redis Operations: No connection errors

Success Criteria:

✅ Error rate <0.5% for 30 minutes
✅ No P0/P1 alerts triggered
✅ Latency within acceptable range
✅ No user complaints in support channels

Failure Response:

# Immediate rollback to 100% blue
kubectl patch ingress license-api -n coditect-prod --type=json \
  -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'

# Investigate logs
kubectl logs -n coditect-prod-green -l app=license-api --tail=100

Phase 2: Gradual Rollout (50% Traffic)

Timeline: 30 minutes Traffic Split: Blue 50%, Green 50% Purpose: Validate performance under higher load

Actions:

# Increase traffic to 50%
kubectl patch ingress license-api -n coditect-prod --type=json \
  -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "50"}]'

Monitoring Metrics:

Throughput: Green should handle ~50% of total traffic (50-75 req/sec)
Pod CPU Usage: <70% average across green pods
Pod Memory Usage: <80% of allocated limits
Database Connection Pool: <50 active connections
Redis Hit Rate: >90% (cache warming complete)

Success Criteria:

✅ Green environment stable for 30 minutes
✅ No performance degradation vs. blue
✅ Autoscaling triggers appropriately (if load increases)
✅ No database deadlocks or slow queries

Phase 3: Full Cutover (100% Traffic)

Timeline: Immediate (if Phase 2 passes) Traffic Split: Blue 0%, Green 100% Purpose: Complete migration to new version

Actions:

# Route 100% traffic to green
kubectl patch ingress license-api -n coditect-prod --type=json \
  -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "100"}]'

# Verify traffic routing
kubectl describe ingress license-api -n coditect-prod | grep "canary-weight"

Monitoring Focus:

Error Rate: Must remain <0.5%
Latency: p95 <100ms, p99 <500ms
Throughput: 100% of production traffic (100-150 req/sec)
Active Sessions: Redis session count should stabilize

Post-Cutover Actions:

Monitor for 1 hour before decommissioning blue
Keep blue pods running (but scaled down to 1 replica for cost savings)
Update production deployment markers in Git
Notify team in #engineering Slack channel

Phase 4: Blue Decommission (24-hour window)

Timeline: 24 hours after successful cutover Purpose: Clean up old environment and free resources

Actions:

# Scale down blue pods (keep namespace for rollback)
kubectl scale deployment license-api -n coditect-prod-blue --replicas=0

# Optional: Delete blue namespace after 7 days
# kubectl delete namespace coditect-prod-blue

Rollback Window:

0-1 hour: Instant rollback via traffic routing
1-24 hours: Fast rollback (scale up blue pods, ~2 minutes)
24+ hours: Manual rollback (redeploy blue, ~15 minutes)

Rollback Procedures

Scenario 1: Error Rate Spike (Critical)

Detection: PagerDuty alert "Error rate >2% for 5 minutes" Action: Immediate rollback to blue

# Step 1: Route 100% traffic to blue (instant)
kubectl patch ingress license-api -n coditect-prod --type=json \
  -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'

# Step 2: Verify traffic routing
watch -n 1 'kubectl get pods -n coditect-prod-blue -o wide'
# Should show all pods receiving connections

# Step 3: Scale down green (optional, to reduce costs)
kubectl scale deployment license-api -n coditect-prod-green --replicas=1

# Step 4: Investigate green environment logs
kubectl logs -n coditect-prod-green -l app=license-api --since=10m > /tmp/green-rollback-$(date +%Y%m%d-%H%M%S).log

Recovery Time: <5 minutes Data Loss: None (shared database)

Scenario 2: Latency Degradation (High Priority)

Detection: Grafana alert "p95 latency >200ms for 10 minutes" Action: Gradual rollback with investigation

# Step 1: Reduce green traffic to 10% (investigate under low load)
kubectl patch ingress license-api -n coditect-prod --type=json \
  -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "10"}]'

# Step 2: Check green pod resource usage
kubectl top pods -n coditect-prod-green

# Step 3: Analyze slow queries (if database-related)
psql -U postgres -h cloud-sql-instance -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

# Step 4: If not resolved in 15 minutes, full rollback
kubectl patch ingress license-api -n coditect-prod --type=json \
  -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'

Scenario 3: Database Migration Failure (Critical)

Detection: Alembic migration fails during deployment Action: Abort deployment, rollback database (if possible)

# Step 1: Do NOT cutover to green (stay 100% blue)
# Green pods will fail health checks if database schema incompatible

# Step 2: Rollback database migration (if safe)
kubectl exec -it <blue-pod> -n coditect-prod-blue -- alembic downgrade -1

# Step 3: Delete green deployment
kubectl delete deployment license-api -n coditect-prod-green

# Step 4: Fix migration script, redeploy
# Ensure migrations are backward-compatible in future

Prevention:

Always test migrations in staging first
Use backward-compatible schema changes (expand/contract pattern)
Never drop columns or tables during deployment (use deprecation cycle)

Key Metrics Dashboard

Grafana Dashboard: Blue vs Green Comparison

Panels:

Request Rate (req/sec)
- Blue: 90 req/sec → 0 req/sec (after cutover)
- Green: 0 req/sec → 100 req/sec (after cutover)
Error Rate (%)
- Threshold: <0.5% (yellow), <2% (red)
- Alert if green error rate >2x blue error rate
Latency (p50, p95, p99)
- Blue: p95 = 85ms
- Green: p95 should be within 10% (75-95ms)
Pod CPU/Memory Usage
- Blue: 50% CPU avg, 60% memory
- Green: Should be similar under same traffic load
Database Connection Pool
- Blue: 30 active connections
- Green: Should not exceed 50 connections (limit 100)
Redis Operations
- Blue: 500 ops/sec
- Green: Should scale proportionally with traffic

Cost Analysis

Infrastructure Costs During Deployment

Phase	Blue Pods	Green Pods	Total Cost	Duration
Pre-Deployment	3	3	2x normal	1 hour
Canary (10%)	3	3	2x normal	30 min
Gradual (50%)	3	3	2x normal	30 min
Full Cutover	3	3	2x normal	1 hour
Post-Cutover	1	3	1.33x normal	23 hours
Total	-	-	~1.5x normal	24 hours

Additional Costs:

Load Balancer: No extra cost (same frontend)
Cloud SQL: No extra cost (shared database)
Redis: No extra cost (shared cache)
Networking: Minimal egress cost increase (~$0.10/deployment)

Total Deployment Cost: ~$5/deployment (assuming $300/month infrastructure)

Automation with GitHub Actions

CI/CD Pipeline Integration

File: .github/workflows/deploy-production.yml

name: Deploy to Production (Blue/Green)

on:
  push:
    tags:
      - 'v*.*.*'

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Authenticate to GCP
        uses: google-github-actions/auth@v1
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - name: Set up Cloud SDK
        uses: google-github-actions/setup-gcloud@v1

      - name: Deploy to Green environment
        run: |
          kubectl apply -f k8s/production/green/
          kubectl rollout status deployment/license-api -n coditect-prod-green

      - name: Run smoke tests on Green
        run: |
          python tests/smoke_test.py --env=green

      - name: Canary deployment (10% traffic)
        run: |
          kubectl patch ingress license-api -n coditect-prod --type=json \
            -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "10"}]'
          sleep 1800  # Wait 30 minutes

      - name: Check canary metrics
        run: |
          python scripts/check_deployment_health.py --env=green --threshold=0.5

      - name: Gradual rollout (50% traffic)
        if: success()
        run: |
          kubectl patch ingress license-api -n coditect-prod --type=json \
            -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "50"}]'
          sleep 1800  # Wait 30 minutes

      - name: Full cutover (100% traffic)
        if: success()
        run: |
          kubectl patch ingress license-api -n coditect-prod --type=json \
            -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "100"}]'

      - name: Rollback on failure
        if: failure()
        run: |
          kubectl patch ingress license-api -n coditect-prod --type=json \
            -p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'
          kubectl scale deployment license-api -n coditect-prod-green --replicas=0

Best Practices

Database Migration Strategy

Expand/Contract Pattern:

Step 1 (Blue version): Add new column, keep old column

ALTER TABLE licenses ADD COLUMN max_seats_v2 INTEGER;
-- Application reads from max_seats, writes to both max_seats and max_seats_v2

Step 2 (Green deployment): Read from new column, write to both

# Green application code
max_seats = license.max_seats_v2 or license.max_seats  # Fallback to old column
license.max_seats = value
license.max_seats_v2 = value  # Dual writes

Step 3 (After cutover): Backfill old rows, drop old column

UPDATE licenses SET max_seats_v2 = max_seats WHERE max_seats_v2 IS NULL;
ALTER TABLE licenses DROP COLUMN max_seats;

Traffic Split Considerations

Session Affinity:

Use sessionAffinity: ClientIP in Kubernetes Service to avoid mid-session environment switches
Alternative: Use consistent hashing based on license_id for deterministic routing

WebSocket Connections:

For long-lived WebSocket connections, use separate deployment strategy (canary with sticky sessions)
Or drain connections gracefully before cutover (send disconnect notice, allow reconnect)

Troubleshooting

Issue: Green pods crash-looping

Symptoms: CrashLoopBackOff status in kubectl get pods -n coditect-prod-green

Diagnosis:

kubectl logs -n coditect-prod-green <pod-name> --previous
kubectl describe pod -n coditect-prod-green <pod-name>

Common Causes:

Database migration failure (check Alembic logs)
Missing environment variable (check Secret/ConfigMap)
Dependency service unavailable (Redis, Cloud KMS)

Resolution: Fix issue, redeploy green

Issue: Inconsistent traffic routing

Symptoms: Some users see old version after cutover

Diagnosis:

# Check ingress annotation
kubectl get ingress license-api -n coditect-prod -o yaml | grep canary

# Check service endpoints
kubectl get endpoints -n coditect-prod-blue
kubectl get endpoints -n coditect-prod-green

Common Causes:

Browser caching (force reload with Ctrl+Shift+R)
CDN caching (purge CloudFlare/Fastly cache)
DNS propagation delay (wait 5 minutes)

Resolution: Clear caches, wait for propagation

Rolling Update Strategy - Alternative deployment for minor updates
Disaster Recovery Runbook - Recovery procedures for failures
C2: Container Diagram - Infrastructure overview

Document Classification: Internal - DevOps Documentation Review Cycle: Every production deployment Next Review Date: After next major release

Last Updated: November 23, 2025 Owner: Platform Engineering Team Status: Production Ready

Overview​

Architecture Diagram​

Deployment Phases​

Phase 0: Pre-Deployment Validation​

Phase 1: Canary Testing (10% Traffic)​

Phase 2: Gradual Rollout (50% Traffic)​

Phase 3: Full Cutover (100% Traffic)​

Phase 4: Blue Decommission (24-hour window)​

Rollback Procedures​

Scenario 1: Error Rate Spike (Critical)​

Scenario 2: Latency Degradation (High Priority)​

Scenario 3: Database Migration Failure (Critical)​

Key Metrics Dashboard​

Grafana Dashboard: Blue vs Green Comparison​

Cost Analysis​

Infrastructure Costs During Deployment​

Automation with GitHub Actions​

CI/CD Pipeline Integration​

Best Practices​

Database Migration Strategy​

Traffic Split Considerations​

Troubleshooting​

Issue: Green pods crash-looping​

Issue: Inconsistent traffic routing​

Related Documents​

Overview

Architecture Diagram

Deployment Phases

Phase 0: Pre-Deployment Validation

Phase 1: Canary Testing (10% Traffic)

Phase 2: Gradual Rollout (50% Traffic)

Phase 3: Full Cutover (100% Traffic)

Phase 4: Blue Decommission (24-hour window)

Rollback Procedures

Scenario 1: Error Rate Spike (Critical)

Scenario 2: Latency Degradation (High Priority)

Scenario 3: Database Migration Failure (Critical)

Key Metrics Dashboard

Grafana Dashboard: Blue vs Green Comparison

Cost Analysis

Infrastructure Costs During Deployment

Automation with GitHub Actions

CI/CD Pipeline Integration

Best Practices

Database Migration Strategy

Traffic Split Considerations

Troubleshooting

Issue: Green pods crash-looping

Issue: Inconsistent traffic routing

Related Documents