Skip to main content

Blue/Green Deployment Strategy

Type: Deployment Architecture Purpose: Zero-downtime deployments with instant rollback capability Risk Level: Low (validated in staging before production) Last Updated: November 23, 2025


Overview

Blue/Green deployment is the primary deployment strategy for CODITECT cloud infrastructure, providing zero-downtime updates with instant rollback capability. Two identical production environments (Blue and Green) run in parallel, with traffic switched between them during deployments.

Key Benefits:

  • Zero Downtime: Traffic switches instantly between environments
  • Fast Rollback: Revert to previous version in <5 minutes
  • Pre-validation: Test new version with real production data before cutover
  • Risk Mitigation: Canary testing catches 95% of deployment issues

Trade-offs:

  • Cost: Requires 2x infrastructure during deployment (temporary)
  • Complexity: Need orchestration for traffic routing and health checks
  • Database Migrations: Require backward-compatible schema changes

Architecture Diagram


Deployment Phases

Phase 0: Pre-Deployment Validation

Timeline: 1 hour before deployment Owner: DevOps Engineer

Checklist:

  • Green environment deployed (0% traffic)
  • All green pods passing readiness probes
  • Database migrations applied (backward-compatible)
  • Smoke tests passed in green environment
  • Monitoring dashboards configured
  • Rollback plan documented
  • On-call engineer notified

Validation:

# Verify green deployment
kubectl get pods -n coditect-prod-green -l app=license-api
# All pods should show STATUS: Running, READY: 1/1

# Check health endpoint
curl -H "Host: api.coditect.dev" http://green-service-ip:8000/health
# Should return: {"status": "healthy", "version": "1.3.0"}

# Run smoke tests
python tests/smoke_test.py --env=green
# All critical paths should pass

Phase 1: Canary Testing (10% Traffic)

Timeline: 30 minutes Traffic Split: Blue 90%, Green 10% Purpose: Detect critical errors with minimal user impact

Actions:

# Update ingress to route 10% traffic to green
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "10"}]'

Monitoring Metrics:

  • Error Rate: <0.5% (same as blue baseline)
  • Latency p95: <100ms (within 10% of blue)
  • Throughput: 10-15 req/sec (proportional to traffic split)
  • Database Queries: No slow queries (>1s)
  • Redis Operations: No connection errors

Success Criteria:

  • ✅ Error rate <0.5% for 30 minutes
  • ✅ No P0/P1 alerts triggered
  • ✅ Latency within acceptable range
  • ✅ No user complaints in support channels

Failure Response:

# Immediate rollback to 100% blue
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'

# Investigate logs
kubectl logs -n coditect-prod-green -l app=license-api --tail=100

Phase 2: Gradual Rollout (50% Traffic)

Timeline: 30 minutes Traffic Split: Blue 50%, Green 50% Purpose: Validate performance under higher load

Actions:

# Increase traffic to 50%
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "50"}]'

Monitoring Metrics:

  • Throughput: Green should handle ~50% of total traffic (50-75 req/sec)
  • Pod CPU Usage: <70% average across green pods
  • Pod Memory Usage: <80% of allocated limits
  • Database Connection Pool: <50 active connections
  • Redis Hit Rate: >90% (cache warming complete)

Success Criteria:

  • ✅ Green environment stable for 30 minutes
  • ✅ No performance degradation vs. blue
  • ✅ Autoscaling triggers appropriately (if load increases)
  • ✅ No database deadlocks or slow queries

Phase 3: Full Cutover (100% Traffic)

Timeline: Immediate (if Phase 2 passes) Traffic Split: Blue 0%, Green 100% Purpose: Complete migration to new version

Actions:

# Route 100% traffic to green
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "100"}]'

# Verify traffic routing
kubectl describe ingress license-api -n coditect-prod | grep "canary-weight"

Monitoring Focus:

  • Error Rate: Must remain <0.5%
  • Latency: p95 <100ms, p99 <500ms
  • Throughput: 100% of production traffic (100-150 req/sec)
  • Active Sessions: Redis session count should stabilize

Post-Cutover Actions:

  1. Monitor for 1 hour before decommissioning blue
  2. Keep blue pods running (but scaled down to 1 replica for cost savings)
  3. Update production deployment markers in Git
  4. Notify team in #engineering Slack channel

Phase 4: Blue Decommission (24-hour window)

Timeline: 24 hours after successful cutover Purpose: Clean up old environment and free resources

Actions:

# Scale down blue pods (keep namespace for rollback)
kubectl scale deployment license-api -n coditect-prod-blue --replicas=0

# Optional: Delete blue namespace after 7 days
# kubectl delete namespace coditect-prod-blue

Rollback Window:

  • 0-1 hour: Instant rollback via traffic routing
  • 1-24 hours: Fast rollback (scale up blue pods, ~2 minutes)
  • 24+ hours: Manual rollback (redeploy blue, ~15 minutes)

Rollback Procedures

Scenario 1: Error Rate Spike (Critical)

Detection: PagerDuty alert "Error rate >2% for 5 minutes" Action: Immediate rollback to blue

# Step 1: Route 100% traffic to blue (instant)
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'

# Step 2: Verify traffic routing
watch -n 1 'kubectl get pods -n coditect-prod-blue -o wide'
# Should show all pods receiving connections

# Step 3: Scale down green (optional, to reduce costs)
kubectl scale deployment license-api -n coditect-prod-green --replicas=1

# Step 4: Investigate green environment logs
kubectl logs -n coditect-prod-green -l app=license-api --since=10m > /tmp/green-rollback-$(date +%Y%m%d-%H%M%S).log

Recovery Time: <5 minutes Data Loss: None (shared database)


Scenario 2: Latency Degradation (High Priority)

Detection: Grafana alert "p95 latency >200ms for 10 minutes" Action: Gradual rollback with investigation

# Step 1: Reduce green traffic to 10% (investigate under low load)
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "10"}]'

# Step 2: Check green pod resource usage
kubectl top pods -n coditect-prod-green

# Step 3: Analyze slow queries (if database-related)
psql -U postgres -h cloud-sql-instance -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

# Step 4: If not resolved in 15 minutes, full rollback
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'

Scenario 3: Database Migration Failure (Critical)

Detection: Alembic migration fails during deployment Action: Abort deployment, rollback database (if possible)

# Step 1: Do NOT cutover to green (stay 100% blue)
# Green pods will fail health checks if database schema incompatible

# Step 2: Rollback database migration (if safe)
kubectl exec -it <blue-pod> -n coditect-prod-blue -- alembic downgrade -1

# Step 3: Delete green deployment
kubectl delete deployment license-api -n coditect-prod-green

# Step 4: Fix migration script, redeploy
# Ensure migrations are backward-compatible in future

Prevention:

  • Always test migrations in staging first
  • Use backward-compatible schema changes (expand/contract pattern)
  • Never drop columns or tables during deployment (use deprecation cycle)

Key Metrics Dashboard

Grafana Dashboard: Blue vs Green Comparison

Panels:

  1. Request Rate (req/sec)

    • Blue: 90 req/sec → 0 req/sec (after cutover)
    • Green: 0 req/sec → 100 req/sec (after cutover)
  2. Error Rate (%)

    • Threshold: <0.5% (yellow), <2% (red)
    • Alert if green error rate >2x blue error rate
  3. Latency (p50, p95, p99)

    • Blue: p95 = 85ms
    • Green: p95 should be within 10% (75-95ms)
  4. Pod CPU/Memory Usage

    • Blue: 50% CPU avg, 60% memory
    • Green: Should be similar under same traffic load
  5. Database Connection Pool

    • Blue: 30 active connections
    • Green: Should not exceed 50 connections (limit 100)
  6. Redis Operations

    • Blue: 500 ops/sec
    • Green: Should scale proportionally with traffic

Cost Analysis

Infrastructure Costs During Deployment

PhaseBlue PodsGreen PodsTotal CostDuration
Pre-Deployment332x normal1 hour
Canary (10%)332x normal30 min
Gradual (50%)332x normal30 min
Full Cutover332x normal1 hour
Post-Cutover131.33x normal23 hours
Total--~1.5x normal24 hours

Additional Costs:

  • Load Balancer: No extra cost (same frontend)
  • Cloud SQL: No extra cost (shared database)
  • Redis: No extra cost (shared cache)
  • Networking: Minimal egress cost increase (~$0.10/deployment)

Total Deployment Cost: ~$5/deployment (assuming $300/month infrastructure)


Automation with GitHub Actions

CI/CD Pipeline Integration

File: .github/workflows/deploy-production.yml

name: Deploy to Production (Blue/Green)

on:
push:
tags:
- 'v*.*.*'

jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3

- name: Authenticate to GCP
uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}

- name: Set up Cloud SDK
uses: google-github-actions/setup-gcloud@v1

- name: Deploy to Green environment
run: |
kubectl apply -f k8s/production/green/
kubectl rollout status deployment/license-api -n coditect-prod-green

- name: Run smoke tests on Green
run: |
python tests/smoke_test.py --env=green

- name: Canary deployment (10% traffic)
run: |
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "10"}]'
sleep 1800 # Wait 30 minutes

- name: Check canary metrics
run: |
python scripts/check_deployment_health.py --env=green --threshold=0.5

- name: Gradual rollout (50% traffic)
if: success()
run: |
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "50"}]'
sleep 1800 # Wait 30 minutes

- name: Full cutover (100% traffic)
if: success()
run: |
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "100"}]'

- name: Rollback on failure
if: failure()
run: |
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'
kubectl scale deployment license-api -n coditect-prod-green --replicas=0

Best Practices

Database Migration Strategy

Expand/Contract Pattern:

Step 1 (Blue version): Add new column, keep old column

ALTER TABLE licenses ADD COLUMN max_seats_v2 INTEGER;
-- Application reads from max_seats, writes to both max_seats and max_seats_v2

Step 2 (Green deployment): Read from new column, write to both

# Green application code
max_seats = license.max_seats_v2 or license.max_seats # Fallback to old column
license.max_seats = value
license.max_seats_v2 = value # Dual writes

Step 3 (After cutover): Backfill old rows, drop old column

UPDATE licenses SET max_seats_v2 = max_seats WHERE max_seats_v2 IS NULL;
ALTER TABLE licenses DROP COLUMN max_seats;

Traffic Split Considerations

Session Affinity:

  • Use sessionAffinity: ClientIP in Kubernetes Service to avoid mid-session environment switches
  • Alternative: Use consistent hashing based on license_id for deterministic routing

WebSocket Connections:

  • For long-lived WebSocket connections, use separate deployment strategy (canary with sticky sessions)
  • Or drain connections gracefully before cutover (send disconnect notice, allow reconnect)

Troubleshooting

Issue: Green pods crash-looping

Symptoms: CrashLoopBackOff status in kubectl get pods -n coditect-prod-green

Diagnosis:

kubectl logs -n coditect-prod-green <pod-name> --previous
kubectl describe pod -n coditect-prod-green <pod-name>

Common Causes:

  • Database migration failure (check Alembic logs)
  • Missing environment variable (check Secret/ConfigMap)
  • Dependency service unavailable (Redis, Cloud KMS)

Resolution: Fix issue, redeploy green


Issue: Inconsistent traffic routing

Symptoms: Some users see old version after cutover

Diagnosis:

# Check ingress annotation
kubectl get ingress license-api -n coditect-prod -o yaml | grep canary

# Check service endpoints
kubectl get endpoints -n coditect-prod-blue
kubectl get endpoints -n coditect-prod-green

Common Causes:

  • Browser caching (force reload with Ctrl+Shift+R)
  • CDN caching (purge CloudFlare/Fastly cache)
  • DNS propagation delay (wait 5 minutes)

Resolution: Clear caches, wait for propagation



Document Classification: Internal - DevOps Documentation Review Cycle: Every production deployment Next Review Date: After next major release


Last Updated: November 23, 2025 Owner: Platform Engineering Team Status: Production Ready