Skip to main content

Production Deployment Guide

Production Deployment Guide

Target Audience: DevOps engineers, Site Reliability Engineers, Deployment teams Purpose: Complete production deployment procedures Reading Time: 20 minutes


Table of Contents

  1. Pre-Deployment Checklist
  2. Deployment Workflow
  3. Step-by-Step Deployment
  4. Post-Deployment Validation
  5. Health Checks
  6. Performance Verification
  7. Rollback Procedures
  8. Troubleshooting

Pre-Deployment Checklist

Critical Requirements

Complete all items before proceeding with deployment.

Infrastructure Readiness

  • Infrastructure provisioned and tested (see INFRASTRUCTURE.md)
  • Database instances running and accessible
  • Network configuration verified (VPC, subnets, firewall)
  • Load balancer configured and health-checked
  • DNS records created and propagated
  • SSL/TLS certificates installed and validated
  • Cloud storage buckets created with proper IAM
  • CDN configured (if applicable)

Application Readiness

  • Docker image built and pushed to registry
  • Image tagged with version and latest
  • Environment variables configured
  • Secrets stored in Secret Manager
  • Database migrations tested in staging
  • Application configuration validated
  • Feature flags configured (if using)
  • API keys and credentials verified

Security Checklist

  • Security scan passed (container vulnerabilities)
  • Dependency audit completed (no critical CVEs)
  • Firewall rules reviewed and approved
  • IAM roles and permissions audited
  • Secrets rotation scheduled
  • HTTPS enforced (HTTP → HTTPS redirect)
  • CORS policies configured
  • Rate limiting configured

Testing & Quality

  • All tests passing in CI/CD
  • Integration tests passed in staging
  • Load testing completed
  • Performance benchmarks met
  • Staging deployment validated
  • Smoke tests prepared
  • Rollback plan documented

Operations Readiness

  • Monitoring dashboards configured
  • Alerting rules created and tested
  • On-call schedule confirmed
  • Incident response plan reviewed
  • Runbooks updated
  • Deployment window scheduled
  • Stakeholders notified
  • Communication channels ready

Documentation

  • Architecture diagrams updated
  • API documentation current
  • Deployment procedures documented
  • Rollback procedures documented
  • Known issues documented
  • Change log updated
  • Release notes prepared

Deployment Workflow

Deployment Stages

1. Pre-Deployment Validation

2. Database Migration (if required)

3. Blue-Green Deployment Preparation

4. Application Deployment

5. Health Checks

6. Traffic Switch (gradual or full)

7. Post-Deployment Validation

8. Monitoring & Observation

Deployment Timeline (Typical)

StageDurationCritical?
Pre-validation15 minYes
Database migration5-30 minYes
Application deployment10 minYes
Health checks5 minYes
Traffic switch5 minYes
Post-validation15 minYes
Total55-80 min

Step-by-Step Deployment

Phase 1: Pre-Deployment Validation (15 minutes)

1.1 Verify Prerequisites

# Set environment variables
export ENVIRONMENT=production
export VERSION=v1.2.3
export PROJECT_ID=my-gcp-project
export REGION=us-central1

# Verify authentication
gcloud auth list
gcloud config set project $PROJECT_ID

# Check current deployment
gcloud run services list --platform managed --region $REGION

# Verify database connectivity
gcloud sql instances list

1.2 Validate Docker Image

# Pull production image
docker pull gcr.io/$PROJECT_ID/coditect:$VERSION

# Verify image
docker inspect gcr.io/$PROJECT_ID/coditect:$VERSION

# Check image size and layers
docker images gcr.io/$PROJECT_ID/coditect:$VERSION

# Run security scan
gcloud container images scan gcr.io/$PROJECT_ID/coditect:$VERSION

1.3 Backup Current State

# Backup database
gcloud sql backups create \
--instance=coditect-postgres-production \
--description="Pre-deployment backup $VERSION"

# Export current configuration
gcloud run services describe coditect-app-production \
--platform managed \
--region $REGION \
--format yaml > backup-config-$(date +%Y%m%d-%H%M%S).yaml

# Tag current image as rollback
docker tag gcr.io/$PROJECT_ID/coditect:latest \
gcr.io/$PROJECT_ID/coditect:rollback-$(date +%Y%m%d-%H%M%S)
docker push gcr.io/$PROJECT_ID/coditect:rollback-$(date +%Y%m%d-%H%M%S)

Phase 2: Database Migration (5-30 minutes)

2.1 Pre-Migration Validation

# Connect to database
gcloud sql connect coditect-postgres-production --user=postgres

# Verify schema version
SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;

# Check database size
SELECT pg_size_pretty(pg_database_size('coditect'));

# Verify active connections
SELECT count(*) FROM pg_stat_activity;

2.2 Run Migrations

# Download migration scripts
gsutil cp gs://$PROJECT_ID-migrations/v$VERSION/*.sql ./migrations/

# Apply migrations (with transaction wrapper)
psql "postgresql://coditect_admin@/coditect?host=/cloudsql/$PROJECT_ID:$REGION:coditect-postgres-production" <<EOF
BEGIN;

-- Run migration
\i migrations/001_add_user_preferences.sql

-- Verify changes
SELECT COUNT(*) FROM user_preferences;

-- Commit if successful
COMMIT;
EOF

2.3 Post-Migration Validation

# Verify schema changes
psql "postgresql://coditect_admin@/coditect?host=/cloudsql/$PROJECT_ID:$REGION:coditect-postgres-production" <<EOF
-- List tables
\dt

-- Verify constraints
SELECT conname, contype FROM pg_constraint WHERE conrelid = 'user_preferences'::regclass;

-- Check indexes
\di user_preferences*
EOF

Phase 3: Application Deployment (10 minutes)

3.1 Deploy New Version (Cloud Run)

# Deploy new revision (no traffic initially)
gcloud run deploy coditect-app-production \
--image gcr.io/$PROJECT_ID/coditect:$VERSION \
--platform managed \
--region $REGION \
--no-traffic \
--tag v$(echo $VERSION | tr . -) \
--set-env-vars "VERSION=$VERSION,ENVIRONMENT=production" \
--set-secrets "DATABASE_URL=database-url:latest,API_KEY=api-key:latest" \
--cpu 2 \
--memory 2Gi \
--min-instances 2 \
--max-instances 10 \
--concurrency 80 \
--timeout 300s

# Get revision name
NEW_REVISION=$(gcloud run revisions list \
--service coditect-app-production \
--platform managed \
--region $REGION \
--format="value(name)" \
--limit=1)

echo "New revision: $NEW_REVISION"

3.2 Deploy New Version (AWS ECS)

# Register new task definition
aws ecs register-task-definition \
--cli-input-json file://task-definition-$VERSION.json

# Update service (no traffic initially)
aws ecs update-service \
--cluster coditect-production \
--service coditect-app \
--task-definition coditect-app:$VERSION \
--desired-count 2 \
--deployment-configuration "maximumPercent=200,minimumHealthyPercent=100"

# Wait for deployment
aws ecs wait services-stable \
--cluster coditect-production \
--services coditect-app

Phase 4: Health Checks (5 minutes)

4.1 Application Health Check

# Get preview URL (Cloud Run)
PREVIEW_URL=$(gcloud run services describe coditect-app-production \
--platform managed \
--region $REGION \
--format="value(status.traffic.where(tag:v$(echo $VERSION | tr . -)).url)")

echo "Preview URL: $PREVIEW_URL"

# Test health endpoint
curl -f $PREVIEW_URL/health

# Expected response:
# {"status":"ok","version":"v1.2.3","timestamp":"2025-12-22T10:30:00Z"}

4.2 Smoke Tests

# Run smoke tests against preview URL
cat > smoke-tests.sh <<'EOF'
#!/bin/bash
set -e

BASE_URL=$1

echo "Running smoke tests against $BASE_URL"

# Test 1: Health check
echo "Test 1: Health check"
curl -f $BASE_URL/health | jq -e '.status == "ok"'

# Test 2: API endpoint
echo "Test 2: API endpoint"
curl -f $BASE_URL/api/v1/status | jq -e '.online == true'

# Test 3: Database connectivity
echo "Test 3: Database connectivity"
curl -f $BASE_URL/api/v1/db/ping | jq -e '.connected == true'

# Test 4: Authentication
echo "Test 4: Authentication"
curl -f -H "Authorization: Bearer $API_KEY" \
$BASE_URL/api/v1/me | jq -e '.user.id != null'

echo "All smoke tests passed!"
EOF

chmod +x smoke-tests.sh
./smoke-tests.sh $PREVIEW_URL

4.3 Dependency Checks

# Verify external dependencies
cat > dependency-checks.sh <<'EOF'
#!/bin/bash

echo "Checking external dependencies..."

# Check database
pg_isready -h $DB_HOST -p 5432 || exit 1

# Check Redis (if used)
redis-cli -h $REDIS_HOST ping | grep PONG || exit 1

# Check S3/GCS
gsutil ls gs://$BUCKET_NAME || exit 1

# Check external APIs
curl -f https://api.external-service.com/health || exit 1

echo "All dependencies healthy!"
EOF

chmod +x dependency-checks.sh
./dependency-checks.sh

Phase 5: Traffic Switch (5 minutes)

5.1 Gradual Traffic Migration (Canary)

# Send 10% traffic to new revision
gcloud run services update-traffic coditect-app-production \
--platform managed \
--region $REGION \
--to-revisions $NEW_REVISION=10

echo "Observing metrics for 5 minutes..."
sleep 300

# Check error rates
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=coditect-app-production AND severity>=ERROR" \
--limit 50 \
--format json

# If healthy, increase to 50%
gcloud run services update-traffic coditect-app-production \
--platform managed \
--region $REGION \
--to-revisions $NEW_REVISION=50

echo "Observing metrics for 5 minutes..."
sleep 300

# If still healthy, switch to 100%
gcloud run services update-traffic coditect-app-production \
--platform managed \
--region $REGION \
--to-latest

5.2 Blue-Green Deployment (Full Switch)

# Immediate full switch (use cautiously)
gcloud run services update-traffic coditect-app-production \
--platform managed \
--region $REGION \
--to-latest

# Monitor error rates closely
watch -n 5 'gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" --limit 10 --format json | jq -r ".[] | .textPayload"'

Phase 6: Post-Deployment Validation (15 minutes)

6.1 Functional Testing

# Full integration tests
cat > integration-tests.sh <<'EOF'
#!/bin/bash
set -e

BASE_URL=https://api.coditect.ai

echo "Running integration tests..."

# User registration
USER_ID=$(curl -X POST $BASE_URL/api/v1/users \
-H "Content-Type: application/json" \
-d '{"email":"test@example.com","name":"Test User"}' \
| jq -r '.id')

# User login
TOKEN=$(curl -X POST $BASE_URL/api/v1/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"test@example.com","password":"test123"}' \
| jq -r '.token')

# Create resource
RESOURCE_ID=$(curl -X POST $BASE_URL/api/v1/resources \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name":"Test Resource"}' \
| jq -r '.id')

# Verify resource
curl -f $BASE_URL/api/v1/resources/$RESOURCE_ID \
-H "Authorization: Bearer $TOKEN"

# Cleanup
curl -X DELETE $BASE_URL/api/v1/resources/$RESOURCE_ID \
-H "Authorization: Bearer $TOKEN"

echo "Integration tests passed!"
EOF

chmod +x integration-tests.sh
./integration-tests.sh

6.2 Performance Verification

# Load test with Apache Bench
ab -n 1000 -c 10 -H "Authorization: Bearer $TOKEN" \
https://api.coditect.ai/api/v1/status

# Expected results:
# - Requests per second: > 100
# - Time per request: < 100ms (mean)
# - Failed requests: 0

# Load test with wrk
wrk -t4 -c100 -d30s --latency \
-H "Authorization: Bearer $TOKEN" \
https://api.coditect.ai/api/v1/status

# Expected results:
# - Latency 99th percentile: < 500ms
# - Requests/sec: > 500
# - Non-2xx responses: 0%

6.3 Database Verification

# Verify database integrity
psql "postgresql://coditect_admin@/coditect" <<EOF
-- Check table sizes
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- Verify constraints
SELECT COUNT(*) FROM information_schema.table_constraints WHERE constraint_type = 'FOREIGN KEY';

-- Check replication lag (if applicable)
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS replication_lag_seconds;
EOF

Health Checks

Application Health Endpoints

/health - Basic Health Check

GET /health

Response:
{
"status": "ok",
"version": "v1.2.3",
"timestamp": "2025-12-22T10:30:00Z"
}

/health/ready - Readiness Probe

GET /health/ready

Response:
{
"status": "ready",
"checks": {
"database": "ok",
"redis": "ok",
"storage": "ok"
}
}

/health/live - Liveness Probe

GET /health/live

Response:
{
"status": "alive",
"uptime": 3600,
"memory_usage": "450MB/2GB"
}

Monitoring Queries

# Check request latency (Cloud Run)
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/request_latencies"' \
--interval-start-time=$(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
--interval-end-time=$(date -u +%Y-%m-%dT%H:%M:%SZ)

# Check error rate
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/request_count" AND metric.labels.response_code_class="5xx"'

# Check instance count
gcloud run services describe coditect-app-production \
--platform managed \
--region $REGION \
--format="value(spec.template.metadata.annotations.autoscaling.knative.dev/minScale,spec.template.metadata.annotations.autoscaling.knative.dev/maxScale)"

Performance Verification

Key Performance Indicators

MetricTargetCritical Threshold
Request latency (p50)< 100ms< 200ms
Request latency (p95)< 500ms< 1000ms
Request latency (p99)< 1000ms< 2000ms
Error rate< 0.1%< 1%
Availability> 99.9%> 99%
CPU usage< 70%< 90%
Memory usage< 80%< 95%
Database connections< 80% of max< 95% of max

Performance Testing Script

cat > performance-test.sh <<'EOF'
#!/bin/bash

BASE_URL=https://api.coditect.ai

echo "Running performance tests..."

# Latency test
echo "Testing latency..."
for i in {1..100}; do
curl -o /dev/null -s -w "%{time_total}\n" $BASE_URL/health
done | awk '{sum+=$1; count++} END {print "Average latency: " sum/count "s"}'

# Throughput test
echo "Testing throughput..."
ab -n 10000 -c 50 $BASE_URL/health | grep "Requests per second"

# Stress test
echo "Running stress test..."
wrk -t8 -c200 -d60s --latency $BASE_URL/api/v1/status

echo "Performance tests complete!"
EOF

chmod +x performance-test.sh
./performance-test.sh

Rollback Procedures

When to Rollback

Initiate rollback if any of these occur:

  • Error rate > 1% for 5 minutes
  • Request latency p99 > 2000ms for 5 minutes
  • Critical functionality broken
  • Database corruption detected
  • Security vulnerability discovered

Immediate Rollback (< 2 minutes)

# Get previous revision
PREVIOUS_REVISION=$(gcloud run revisions list \
--service coditect-app-production \
--platform managed \
--region $REGION \
--format="value(name)" \
--limit=2 | tail -1)

echo "Rolling back to: $PREVIOUS_REVISION"

# Immediate traffic switch to previous revision
gcloud run services update-traffic coditect-app-production \
--platform managed \
--region $REGION \
--to-revisions $PREVIOUS_REVISION=100

# Verify rollback
curl -f https://api.coditect.ai/health | jq .version

Database Rollback

# Restore from backup
gcloud sql backups list --instance=coditect-postgres-production

# Restore specific backup
gcloud sql backups restore BACKUP_ID \
--backup-instance=coditect-postgres-production \
--backup-instance=coditect-postgres-production

# OR restore to new instance (safer)
gcloud sql instances clone coditect-postgres-production \
coditect-postgres-rollback \
--point-in-time '2025-12-22T10:00:00.000Z'

Post-Rollback Actions

  1. Notify stakeholders - Incident report
  2. Investigate root cause - Review logs and metrics
  3. Create bug report - Document issue
  4. Schedule fix - Plan remediation
  5. Update runbooks - Improve procedures

Troubleshooting

Issue 1: Deployment Fails

Symptom: gcloud run deploy fails with error

Diagnosis:

# Check deployment logs
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=coditect-app-production" \
--limit 50 \
--format json

# Check image accessibility
gcloud container images describe gcr.io/$PROJECT_ID/coditect:$VERSION

Solution:

  • Verify image exists and is accessible
  • Check service account permissions
  • Verify environment variables and secrets
  • Review resource limits (CPU, memory)

Issue 2: High Error Rate After Deployment

Symptom: 5xx errors spike after deployment

Diagnosis:

# Check error logs
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
--limit 100 \
--format json | jq -r '.[] | .jsonPayload.message'

# Check database connections
psql "postgresql://coditect_admin@/coditect" -c "SELECT count(*) FROM pg_stat_activity;"

Solution:

  • Rollback immediately if error rate > 1%
  • Check database connectivity
  • Verify environment variables
  • Review recent code changes

Issue 3: Slow Performance After Deployment

Symptom: Request latency increased significantly

Diagnosis:

# Check CPU and memory usage
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/container/cpu/utilization"'

# Check database query performance
psql "postgresql://coditect_admin@/coditect" -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

Solution:

  • Increase instance resources (CPU, memory)
  • Optimize slow database queries
  • Check for memory leaks
  • Review application code changes


Document Status: Production Ready Last Validation: December 22, 2025 Next Review: March 2026 On-Call: DevOps Team (Slack: #devops-oncall)