Blue/Green Deployment Strategy
Type: Deployment Architecture Purpose: Zero-downtime deployments with instant rollback capability Risk Level: Low (validated in staging before production) Last Updated: November 23, 2025
Overview
Blue/Green deployment is the primary deployment strategy for CODITECT cloud infrastructure, providing zero-downtime updates with instant rollback capability. Two identical production environments (Blue and Green) run in parallel, with traffic switched between them during deployments.
Key Benefits:
- Zero Downtime: Traffic switches instantly between environments
- Fast Rollback: Revert to previous version in <5 minutes
- Pre-validation: Test new version with real production data before cutover
- Risk Mitigation: Canary testing catches 95% of deployment issues
Trade-offs:
- Cost: Requires 2x infrastructure during deployment (temporary)
- Complexity: Need orchestration for traffic routing and health checks
- Database Migrations: Require backward-compatible schema changes
Architecture Diagram
Deployment Phases
Phase 0: Pre-Deployment Validation
Timeline: 1 hour before deployment Owner: DevOps Engineer
Checklist:
- Green environment deployed (0% traffic)
- All green pods passing readiness probes
- Database migrations applied (backward-compatible)
- Smoke tests passed in green environment
- Monitoring dashboards configured
- Rollback plan documented
- On-call engineer notified
Validation:
# Verify green deployment
kubectl get pods -n coditect-prod-green -l app=license-api
# All pods should show STATUS: Running, READY: 1/1
# Check health endpoint
curl -H "Host: api.coditect.dev" http://green-service-ip:8000/health
# Should return: {"status": "healthy", "version": "1.3.0"}
# Run smoke tests
python tests/smoke_test.py --env=green
# All critical paths should pass
Phase 1: Canary Testing (10% Traffic)
Timeline: 30 minutes Traffic Split: Blue 90%, Green 10% Purpose: Detect critical errors with minimal user impact
Actions:
# Update ingress to route 10% traffic to green
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "10"}]'
Monitoring Metrics:
- Error Rate: <0.5% (same as blue baseline)
- Latency p95: <100ms (within 10% of blue)
- Throughput: 10-15 req/sec (proportional to traffic split)
- Database Queries: No slow queries (>1s)
- Redis Operations: No connection errors
Success Criteria:
- ✅ Error rate <0.5% for 30 minutes
- ✅ No P0/P1 alerts triggered
- ✅ Latency within acceptable range
- ✅ No user complaints in support channels
Failure Response:
# Immediate rollback to 100% blue
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'
# Investigate logs
kubectl logs -n coditect-prod-green -l app=license-api --tail=100
Phase 2: Gradual Rollout (50% Traffic)
Timeline: 30 minutes Traffic Split: Blue 50%, Green 50% Purpose: Validate performance under higher load
Actions:
# Increase traffic to 50%
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "50"}]'
Monitoring Metrics:
- Throughput: Green should handle ~50% of total traffic (50-75 req/sec)
- Pod CPU Usage: <70% average across green pods
- Pod Memory Usage: <80% of allocated limits
- Database Connection Pool: <50 active connections
- Redis Hit Rate: >90% (cache warming complete)
Success Criteria:
- ✅ Green environment stable for 30 minutes
- ✅ No performance degradation vs. blue
- ✅ Autoscaling triggers appropriately (if load increases)
- ✅ No database deadlocks or slow queries
Phase 3: Full Cutover (100% Traffic)
Timeline: Immediate (if Phase 2 passes) Traffic Split: Blue 0%, Green 100% Purpose: Complete migration to new version
Actions:
# Route 100% traffic to green
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "100"}]'
# Verify traffic routing
kubectl describe ingress license-api -n coditect-prod | grep "canary-weight"
Monitoring Focus:
- Error Rate: Must remain <0.5%
- Latency: p95 <100ms, p99 <500ms
- Throughput: 100% of production traffic (100-150 req/sec)
- Active Sessions: Redis session count should stabilize
Post-Cutover Actions:
- Monitor for 1 hour before decommissioning blue
- Keep blue pods running (but scaled down to 1 replica for cost savings)
- Update production deployment markers in Git
- Notify team in #engineering Slack channel
Phase 4: Blue Decommission (24-hour window)
Timeline: 24 hours after successful cutover Purpose: Clean up old environment and free resources
Actions:
# Scale down blue pods (keep namespace for rollback)
kubectl scale deployment license-api -n coditect-prod-blue --replicas=0
# Optional: Delete blue namespace after 7 days
# kubectl delete namespace coditect-prod-blue
Rollback Window:
- 0-1 hour: Instant rollback via traffic routing
- 1-24 hours: Fast rollback (scale up blue pods, ~2 minutes)
- 24+ hours: Manual rollback (redeploy blue, ~15 minutes)
Rollback Procedures
Scenario 1: Error Rate Spike (Critical)
Detection: PagerDuty alert "Error rate >2% for 5 minutes" Action: Immediate rollback to blue
# Step 1: Route 100% traffic to blue (instant)
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'
# Step 2: Verify traffic routing
watch -n 1 'kubectl get pods -n coditect-prod-blue -o wide'
# Should show all pods receiving connections
# Step 3: Scale down green (optional, to reduce costs)
kubectl scale deployment license-api -n coditect-prod-green --replicas=1
# Step 4: Investigate green environment logs
kubectl logs -n coditect-prod-green -l app=license-api --since=10m > /tmp/green-rollback-$(date +%Y%m%d-%H%M%S).log
Recovery Time: <5 minutes Data Loss: None (shared database)
Scenario 2: Latency Degradation (High Priority)
Detection: Grafana alert "p95 latency >200ms for 10 minutes" Action: Gradual rollback with investigation
# Step 1: Reduce green traffic to 10% (investigate under low load)
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "10"}]'
# Step 2: Check green pod resource usage
kubectl top pods -n coditect-prod-green
# Step 3: Analyze slow queries (if database-related)
psql -U postgres -h cloud-sql-instance -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
# Step 4: If not resolved in 15 minutes, full rollback
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'
Scenario 3: Database Migration Failure (Critical)
Detection: Alembic migration fails during deployment Action: Abort deployment, rollback database (if possible)
# Step 1: Do NOT cutover to green (stay 100% blue)
# Green pods will fail health checks if database schema incompatible
# Step 2: Rollback database migration (if safe)
kubectl exec -it <blue-pod> -n coditect-prod-blue -- alembic downgrade -1
# Step 3: Delete green deployment
kubectl delete deployment license-api -n coditect-prod-green
# Step 4: Fix migration script, redeploy
# Ensure migrations are backward-compatible in future
Prevention:
- Always test migrations in staging first
- Use backward-compatible schema changes (expand/contract pattern)
- Never drop columns or tables during deployment (use deprecation cycle)
Key Metrics Dashboard
Grafana Dashboard: Blue vs Green Comparison
Panels:
-
Request Rate (req/sec)
- Blue: 90 req/sec → 0 req/sec (after cutover)
- Green: 0 req/sec → 100 req/sec (after cutover)
-
Error Rate (%)
- Threshold: <0.5% (yellow), <2% (red)
- Alert if green error rate >2x blue error rate
-
Latency (p50, p95, p99)
- Blue: p95 = 85ms
- Green: p95 should be within 10% (75-95ms)
-
Pod CPU/Memory Usage
- Blue: 50% CPU avg, 60% memory
- Green: Should be similar under same traffic load
-
Database Connection Pool
- Blue: 30 active connections
- Green: Should not exceed 50 connections (limit 100)
-
Redis Operations
- Blue: 500 ops/sec
- Green: Should scale proportionally with traffic
Cost Analysis
Infrastructure Costs During Deployment
| Phase | Blue Pods | Green Pods | Total Cost | Duration |
|---|---|---|---|---|
| Pre-Deployment | 3 | 3 | 2x normal | 1 hour |
| Canary (10%) | 3 | 3 | 2x normal | 30 min |
| Gradual (50%) | 3 | 3 | 2x normal | 30 min |
| Full Cutover | 3 | 3 | 2x normal | 1 hour |
| Post-Cutover | 1 | 3 | 1.33x normal | 23 hours |
| Total | - | - | ~1.5x normal | 24 hours |
Additional Costs:
- Load Balancer: No extra cost (same frontend)
- Cloud SQL: No extra cost (shared database)
- Redis: No extra cost (shared cache)
- Networking: Minimal egress cost increase (~$0.10/deployment)
Total Deployment Cost: ~$5/deployment (assuming $300/month infrastructure)
Automation with GitHub Actions
CI/CD Pipeline Integration
File: .github/workflows/deploy-production.yml
name: Deploy to Production (Blue/Green)
on:
push:
tags:
- 'v*.*.*'
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Authenticate to GCP
uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- name: Set up Cloud SDK
uses: google-github-actions/setup-gcloud@v1
- name: Deploy to Green environment
run: |
kubectl apply -f k8s/production/green/
kubectl rollout status deployment/license-api -n coditect-prod-green
- name: Run smoke tests on Green
run: |
python tests/smoke_test.py --env=green
- name: Canary deployment (10% traffic)
run: |
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "10"}]'
sleep 1800 # Wait 30 minutes
- name: Check canary metrics
run: |
python scripts/check_deployment_health.py --env=green --threshold=0.5
- name: Gradual rollout (50% traffic)
if: success()
run: |
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "50"}]'
sleep 1800 # Wait 30 minutes
- name: Full cutover (100% traffic)
if: success()
run: |
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "100"}]'
- name: Rollback on failure
if: failure()
run: |
kubectl patch ingress license-api -n coditect-prod --type=json \
-p='[{"op": "replace", "path": "/metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight", "value": "0"}]'
kubectl scale deployment license-api -n coditect-prod-green --replicas=0
Best Practices
Database Migration Strategy
Expand/Contract Pattern:
Step 1 (Blue version): Add new column, keep old column
ALTER TABLE licenses ADD COLUMN max_seats_v2 INTEGER;
-- Application reads from max_seats, writes to both max_seats and max_seats_v2
Step 2 (Green deployment): Read from new column, write to both
# Green application code
max_seats = license.max_seats_v2 or license.max_seats # Fallback to old column
license.max_seats = value
license.max_seats_v2 = value # Dual writes
Step 3 (After cutover): Backfill old rows, drop old column
UPDATE licenses SET max_seats_v2 = max_seats WHERE max_seats_v2 IS NULL;
ALTER TABLE licenses DROP COLUMN max_seats;
Traffic Split Considerations
Session Affinity:
- Use
sessionAffinity: ClientIPin Kubernetes Service to avoid mid-session environment switches - Alternative: Use consistent hashing based on
license_idfor deterministic routing
WebSocket Connections:
- For long-lived WebSocket connections, use separate deployment strategy (canary with sticky sessions)
- Or drain connections gracefully before cutover (send disconnect notice, allow reconnect)
Troubleshooting
Issue: Green pods crash-looping
Symptoms: CrashLoopBackOff status in kubectl get pods -n coditect-prod-green
Diagnosis:
kubectl logs -n coditect-prod-green <pod-name> --previous
kubectl describe pod -n coditect-prod-green <pod-name>
Common Causes:
- Database migration failure (check Alembic logs)
- Missing environment variable (check Secret/ConfigMap)
- Dependency service unavailable (Redis, Cloud KMS)
Resolution: Fix issue, redeploy green
Issue: Inconsistent traffic routing
Symptoms: Some users see old version after cutover
Diagnosis:
# Check ingress annotation
kubectl get ingress license-api -n coditect-prod -o yaml | grep canary
# Check service endpoints
kubectl get endpoints -n coditect-prod-blue
kubectl get endpoints -n coditect-prod-green
Common Causes:
- Browser caching (force reload with Ctrl+Shift+R)
- CDN caching (purge CloudFlare/Fastly cache)
- DNS propagation delay (wait 5 minutes)
Resolution: Clear caches, wait for propagation
Related Documents
- Rolling Update Strategy - Alternative deployment for minor updates
- Disaster Recovery Runbook - Recovery procedures for failures
- C2: Container Diagram - Infrastructure overview
Document Classification: Internal - DevOps Documentation Review Cycle: Every production deployment Next Review Date: After next major release
Last Updated: November 23, 2025 Owner: Platform Engineering Team Status: Production Ready