Production Deployment Guide
Production Deployment Guide
Target Audience: DevOps engineers, Site Reliability Engineers, Deployment teams Purpose: Complete production deployment procedures Reading Time: 20 minutes
Table of Contents
- Pre-Deployment Checklist
- Deployment Workflow
- Step-by-Step Deployment
- Post-Deployment Validation
- Health Checks
- Performance Verification
- Rollback Procedures
- Troubleshooting
Pre-Deployment Checklist
Critical Requirements
Complete all items before proceeding with deployment.
Infrastructure Readiness
- Infrastructure provisioned and tested (see INFRASTRUCTURE.md)
- Database instances running and accessible
- Network configuration verified (VPC, subnets, firewall)
- Load balancer configured and health-checked
- DNS records created and propagated
- SSL/TLS certificates installed and validated
- Cloud storage buckets created with proper IAM
- CDN configured (if applicable)
Application Readiness
- Docker image built and pushed to registry
- Image tagged with version and
latest - Environment variables configured
- Secrets stored in Secret Manager
- Database migrations tested in staging
- Application configuration validated
- Feature flags configured (if using)
- API keys and credentials verified
Security Checklist
- Security scan passed (container vulnerabilities)
- Dependency audit completed (no critical CVEs)
- Firewall rules reviewed and approved
- IAM roles and permissions audited
- Secrets rotation scheduled
- HTTPS enforced (HTTP → HTTPS redirect)
- CORS policies configured
- Rate limiting configured
Testing & Quality
- All tests passing in CI/CD
- Integration tests passed in staging
- Load testing completed
- Performance benchmarks met
- Staging deployment validated
- Smoke tests prepared
- Rollback plan documented
Operations Readiness
- Monitoring dashboards configured
- Alerting rules created and tested
- On-call schedule confirmed
- Incident response plan reviewed
- Runbooks updated
- Deployment window scheduled
- Stakeholders notified
- Communication channels ready
Documentation
- Architecture diagrams updated
- API documentation current
- Deployment procedures documented
- Rollback procedures documented
- Known issues documented
- Change log updated
- Release notes prepared
Deployment Workflow
Deployment Stages
1. Pre-Deployment Validation
↓
2. Database Migration (if required)
↓
3. Blue-Green Deployment Preparation
↓
4. Application Deployment
↓
5. Health Checks
↓
6. Traffic Switch (gradual or full)
↓
7. Post-Deployment Validation
↓
8. Monitoring & Observation
Deployment Timeline (Typical)
| Stage | Duration | Critical? |
|---|---|---|
| Pre-validation | 15 min | Yes |
| Database migration | 5-30 min | Yes |
| Application deployment | 10 min | Yes |
| Health checks | 5 min | Yes |
| Traffic switch | 5 min | Yes |
| Post-validation | 15 min | Yes |
| Total | 55-80 min |
Step-by-Step Deployment
Phase 1: Pre-Deployment Validation (15 minutes)
1.1 Verify Prerequisites
# Set environment variables
export ENVIRONMENT=production
export VERSION=v1.2.3
export PROJECT_ID=my-gcp-project
export REGION=us-central1
# Verify authentication
gcloud auth list
gcloud config set project $PROJECT_ID
# Check current deployment
gcloud run services list --platform managed --region $REGION
# Verify database connectivity
gcloud sql instances list
1.2 Validate Docker Image
# Pull production image
docker pull gcr.io/$PROJECT_ID/coditect:$VERSION
# Verify image
docker inspect gcr.io/$PROJECT_ID/coditect:$VERSION
# Check image size and layers
docker images gcr.io/$PROJECT_ID/coditect:$VERSION
# Run security scan
gcloud container images scan gcr.io/$PROJECT_ID/coditect:$VERSION
1.3 Backup Current State
# Backup database
gcloud sql backups create \
--instance=coditect-postgres-production \
--description="Pre-deployment backup $VERSION"
# Export current configuration
gcloud run services describe coditect-app-production \
--platform managed \
--region $REGION \
--format yaml > backup-config-$(date +%Y%m%d-%H%M%S).yaml
# Tag current image as rollback
docker tag gcr.io/$PROJECT_ID/coditect:latest \
gcr.io/$PROJECT_ID/coditect:rollback-$(date +%Y%m%d-%H%M%S)
docker push gcr.io/$PROJECT_ID/coditect:rollback-$(date +%Y%m%d-%H%M%S)
Phase 2: Database Migration (5-30 minutes)
2.1 Pre-Migration Validation
# Connect to database
gcloud sql connect coditect-postgres-production --user=postgres
# Verify schema version
SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;
# Check database size
SELECT pg_size_pretty(pg_database_size('coditect'));
# Verify active connections
SELECT count(*) FROM pg_stat_activity;
2.2 Run Migrations
# Download migration scripts
gsutil cp gs://$PROJECT_ID-migrations/v$VERSION/*.sql ./migrations/
# Apply migrations (with transaction wrapper)
psql "postgresql://coditect_admin@/coditect?host=/cloudsql/$PROJECT_ID:$REGION:coditect-postgres-production" <<EOF
BEGIN;
-- Run migration
\i migrations/001_add_user_preferences.sql
-- Verify changes
SELECT COUNT(*) FROM user_preferences;
-- Commit if successful
COMMIT;
EOF
2.3 Post-Migration Validation
# Verify schema changes
psql "postgresql://coditect_admin@/coditect?host=/cloudsql/$PROJECT_ID:$REGION:coditect-postgres-production" <<EOF
-- List tables
\dt
-- Verify constraints
SELECT conname, contype FROM pg_constraint WHERE conrelid = 'user_preferences'::regclass;
-- Check indexes
\di user_preferences*
EOF
Phase 3: Application Deployment (10 minutes)
3.1 Deploy New Version (Cloud Run)
# Deploy new revision (no traffic initially)
gcloud run deploy coditect-app-production \
--image gcr.io/$PROJECT_ID/coditect:$VERSION \
--platform managed \
--region $REGION \
--no-traffic \
--tag v$(echo $VERSION | tr . -) \
--set-env-vars "VERSION=$VERSION,ENVIRONMENT=production" \
--set-secrets "DATABASE_URL=database-url:latest,API_KEY=api-key:latest" \
--cpu 2 \
--memory 2Gi \
--min-instances 2 \
--max-instances 10 \
--concurrency 80 \
--timeout 300s
# Get revision name
NEW_REVISION=$(gcloud run revisions list \
--service coditect-app-production \
--platform managed \
--region $REGION \
--format="value(name)" \
--limit=1)
echo "New revision: $NEW_REVISION"
3.2 Deploy New Version (AWS ECS)
# Register new task definition
aws ecs register-task-definition \
--cli-input-json file://task-definition-$VERSION.json
# Update service (no traffic initially)
aws ecs update-service \
--cluster coditect-production \
--service coditect-app \
--task-definition coditect-app:$VERSION \
--desired-count 2 \
--deployment-configuration "maximumPercent=200,minimumHealthyPercent=100"
# Wait for deployment
aws ecs wait services-stable \
--cluster coditect-production \
--services coditect-app
Phase 4: Health Checks (5 minutes)
4.1 Application Health Check
# Get preview URL (Cloud Run)
PREVIEW_URL=$(gcloud run services describe coditect-app-production \
--platform managed \
--region $REGION \
--format="value(status.traffic.where(tag:v$(echo $VERSION | tr . -)).url)")
echo "Preview URL: $PREVIEW_URL"
# Test health endpoint
curl -f $PREVIEW_URL/health
# Expected response:
# {"status":"ok","version":"v1.2.3","timestamp":"2025-12-22T10:30:00Z"}
4.2 Smoke Tests
# Run smoke tests against preview URL
cat > smoke-tests.sh <<'EOF'
#!/bin/bash
set -e
BASE_URL=$1
echo "Running smoke tests against $BASE_URL"
# Test 1: Health check
echo "Test 1: Health check"
curl -f $BASE_URL/health | jq -e '.status == "ok"'
# Test 2: API endpoint
echo "Test 2: API endpoint"
curl -f $BASE_URL/api/v1/status | jq -e '.online == true'
# Test 3: Database connectivity
echo "Test 3: Database connectivity"
curl -f $BASE_URL/api/v1/db/ping | jq -e '.connected == true'
# Test 4: Authentication
echo "Test 4: Authentication"
curl -f -H "Authorization: Bearer $API_KEY" \
$BASE_URL/api/v1/me | jq -e '.user.id != null'
echo "All smoke tests passed!"
EOF
chmod +x smoke-tests.sh
./smoke-tests.sh $PREVIEW_URL
4.3 Dependency Checks
# Verify external dependencies
cat > dependency-checks.sh <<'EOF'
#!/bin/bash
echo "Checking external dependencies..."
# Check database
pg_isready -h $DB_HOST -p 5432 || exit 1
# Check Redis (if used)
redis-cli -h $REDIS_HOST ping | grep PONG || exit 1
# Check S3/GCS
gsutil ls gs://$BUCKET_NAME || exit 1
# Check external APIs
curl -f https://api.external-service.com/health || exit 1
echo "All dependencies healthy!"
EOF
chmod +x dependency-checks.sh
./dependency-checks.sh
Phase 5: Traffic Switch (5 minutes)
5.1 Gradual Traffic Migration (Canary)
# Send 10% traffic to new revision
gcloud run services update-traffic coditect-app-production \
--platform managed \
--region $REGION \
--to-revisions $NEW_REVISION=10
echo "Observing metrics for 5 minutes..."
sleep 300
# Check error rates
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=coditect-app-production AND severity>=ERROR" \
--limit 50 \
--format json
# If healthy, increase to 50%
gcloud run services update-traffic coditect-app-production \
--platform managed \
--region $REGION \
--to-revisions $NEW_REVISION=50
echo "Observing metrics for 5 minutes..."
sleep 300
# If still healthy, switch to 100%
gcloud run services update-traffic coditect-app-production \
--platform managed \
--region $REGION \
--to-latest
5.2 Blue-Green Deployment (Full Switch)
# Immediate full switch (use cautiously)
gcloud run services update-traffic coditect-app-production \
--platform managed \
--region $REGION \
--to-latest
# Monitor error rates closely
watch -n 5 'gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" --limit 10 --format json | jq -r ".[] | .textPayload"'
Phase 6: Post-Deployment Validation (15 minutes)
6.1 Functional Testing
# Full integration tests
cat > integration-tests.sh <<'EOF'
#!/bin/bash
set -e
BASE_URL=https://api.coditect.ai
echo "Running integration tests..."
# User registration
USER_ID=$(curl -X POST $BASE_URL/api/v1/users \
-H "Content-Type: application/json" \
-d '{"email":"test@example.com","name":"Test User"}' \
| jq -r '.id')
# User login
TOKEN=$(curl -X POST $BASE_URL/api/v1/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"test@example.com","password":"test123"}' \
| jq -r '.token')
# Create resource
RESOURCE_ID=$(curl -X POST $BASE_URL/api/v1/resources \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name":"Test Resource"}' \
| jq -r '.id')
# Verify resource
curl -f $BASE_URL/api/v1/resources/$RESOURCE_ID \
-H "Authorization: Bearer $TOKEN"
# Cleanup
curl -X DELETE $BASE_URL/api/v1/resources/$RESOURCE_ID \
-H "Authorization: Bearer $TOKEN"
echo "Integration tests passed!"
EOF
chmod +x integration-tests.sh
./integration-tests.sh
6.2 Performance Verification
# Load test with Apache Bench
ab -n 1000 -c 10 -H "Authorization: Bearer $TOKEN" \
https://api.coditect.ai/api/v1/status
# Expected results:
# - Requests per second: > 100
# - Time per request: < 100ms (mean)
# - Failed requests: 0
# Load test with wrk
wrk -t4 -c100 -d30s --latency \
-H "Authorization: Bearer $TOKEN" \
https://api.coditect.ai/api/v1/status
# Expected results:
# - Latency 99th percentile: < 500ms
# - Requests/sec: > 500
# - Non-2xx responses: 0%
6.3 Database Verification
# Verify database integrity
psql "postgresql://coditect_admin@/coditect" <<EOF
-- Check table sizes
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
-- Verify constraints
SELECT COUNT(*) FROM information_schema.table_constraints WHERE constraint_type = 'FOREIGN KEY';
-- Check replication lag (if applicable)
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS replication_lag_seconds;
EOF
Health Checks
Application Health Endpoints
/health - Basic Health Check
GET /health
Response:
{
"status": "ok",
"version": "v1.2.3",
"timestamp": "2025-12-22T10:30:00Z"
}
/health/ready - Readiness Probe
GET /health/ready
Response:
{
"status": "ready",
"checks": {
"database": "ok",
"redis": "ok",
"storage": "ok"
}
}
/health/live - Liveness Probe
GET /health/live
Response:
{
"status": "alive",
"uptime": 3600,
"memory_usage": "450MB/2GB"
}
Monitoring Queries
# Check request latency (Cloud Run)
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/request_latencies"' \
--interval-start-time=$(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
--interval-end-time=$(date -u +%Y-%m-%dT%H:%M:%SZ)
# Check error rate
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/request_count" AND metric.labels.response_code_class="5xx"'
# Check instance count
gcloud run services describe coditect-app-production \
--platform managed \
--region $REGION \
--format="value(spec.template.metadata.annotations.autoscaling.knative.dev/minScale,spec.template.metadata.annotations.autoscaling.knative.dev/maxScale)"
Performance Verification
Key Performance Indicators
| Metric | Target | Critical Threshold |
|---|---|---|
| Request latency (p50) | < 100ms | < 200ms |
| Request latency (p95) | < 500ms | < 1000ms |
| Request latency (p99) | < 1000ms | < 2000ms |
| Error rate | < 0.1% | < 1% |
| Availability | > 99.9% | > 99% |
| CPU usage | < 70% | < 90% |
| Memory usage | < 80% | < 95% |
| Database connections | < 80% of max | < 95% of max |
Performance Testing Script
cat > performance-test.sh <<'EOF'
#!/bin/bash
BASE_URL=https://api.coditect.ai
echo "Running performance tests..."
# Latency test
echo "Testing latency..."
for i in {1..100}; do
curl -o /dev/null -s -w "%{time_total}\n" $BASE_URL/health
done | awk '{sum+=$1; count++} END {print "Average latency: " sum/count "s"}'
# Throughput test
echo "Testing throughput..."
ab -n 10000 -c 50 $BASE_URL/health | grep "Requests per second"
# Stress test
echo "Running stress test..."
wrk -t8 -c200 -d60s --latency $BASE_URL/api/v1/status
echo "Performance tests complete!"
EOF
chmod +x performance-test.sh
./performance-test.sh
Rollback Procedures
When to Rollback
Initiate rollback if any of these occur:
- Error rate > 1% for 5 minutes
- Request latency p99 > 2000ms for 5 minutes
- Critical functionality broken
- Database corruption detected
- Security vulnerability discovered
Immediate Rollback (< 2 minutes)
# Get previous revision
PREVIOUS_REVISION=$(gcloud run revisions list \
--service coditect-app-production \
--platform managed \
--region $REGION \
--format="value(name)" \
--limit=2 | tail -1)
echo "Rolling back to: $PREVIOUS_REVISION"
# Immediate traffic switch to previous revision
gcloud run services update-traffic coditect-app-production \
--platform managed \
--region $REGION \
--to-revisions $PREVIOUS_REVISION=100
# Verify rollback
curl -f https://api.coditect.ai/health | jq .version
Database Rollback
# Restore from backup
gcloud sql backups list --instance=coditect-postgres-production
# Restore specific backup
gcloud sql backups restore BACKUP_ID \
--backup-instance=coditect-postgres-production \
--backup-instance=coditect-postgres-production
# OR restore to new instance (safer)
gcloud sql instances clone coditect-postgres-production \
coditect-postgres-rollback \
--point-in-time '2025-12-22T10:00:00.000Z'
Post-Rollback Actions
- Notify stakeholders - Incident report
- Investigate root cause - Review logs and metrics
- Create bug report - Document issue
- Schedule fix - Plan remediation
- Update runbooks - Improve procedures
Troubleshooting
Issue 1: Deployment Fails
Symptom: gcloud run deploy fails with error
Diagnosis:
# Check deployment logs
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=coditect-app-production" \
--limit 50 \
--format json
# Check image accessibility
gcloud container images describe gcr.io/$PROJECT_ID/coditect:$VERSION
Solution:
- Verify image exists and is accessible
- Check service account permissions
- Verify environment variables and secrets
- Review resource limits (CPU, memory)
Issue 2: High Error Rate After Deployment
Symptom: 5xx errors spike after deployment
Diagnosis:
# Check error logs
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
--limit 100 \
--format json | jq -r '.[] | .jsonPayload.message'
# Check database connections
psql "postgresql://coditect_admin@/coditect" -c "SELECT count(*) FROM pg_stat_activity;"
Solution:
- Rollback immediately if error rate > 1%
- Check database connectivity
- Verify environment variables
- Review recent code changes
Issue 3: Slow Performance After Deployment
Symptom: Request latency increased significantly
Diagnosis:
# Check CPU and memory usage
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/container/cpu/utilization"'
# Check database query performance
psql "postgresql://coditect_admin@/coditect" -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
Solution:
- Increase instance resources (CPU, memory)
- Optimize slow database queries
- Check for memory leaks
- Review application code changes
Related Documentation
- INFRASTRUCTURE.md - Infrastructure provisioning
- CI-CD-DEPLOYMENT-GUIDE.md - Automated deployment pipelines
- DOCKER-DEVELOPMENT-GUIDE.md - Container development
Document Status: Production Ready Last Validation: December 22, 2025 Next Review: March 2026 On-Call: DevOps Team (Slack: #devops-oncall)