SUP-02: GKE Deployment Architecture - CI/CD and Production Deployment
Document Type: Supplementary Diagram Scope: Complete deployment workflow from code commit to production Technology: GitHub Actions, Docker, Google Container Registry, GKE, Kubernetes Status: Specification Complete - Ready for Implementation Last Updated: November 30, 2025
Table of Contents
- Overview
- Deployment Architecture Diagram
- CI/CD Pipeline
- Docker Image Build
- Kubernetes Deployment Strategy
- Zero-Downtime Deployment
- Deployment Verification
- Rollback Procedures
- Infrastructure as Code Deployment
- Production Deployment Checklist
Overview
Purpose
This document specifies the complete deployment architecture for the CODITECT License Management Platform, covering:
- CI/CD pipeline automation (GitHub Actions)
- Docker image build and registry (Google Container Registry)
- Kubernetes deployment workflow
- Zero-downtime rolling updates
- Deployment verification and health checks
- Rollback procedures
- Infrastructure as Code (OpenTofu) deployment
Deployment Philosophy
Key Principles:
- Automated: Zero manual steps from commit to production
- Safe: Zero-downtime deployments with automatic rollback
- Verified: Automated testing and health checks at every stage
- Traceable: Complete audit trail of all deployments
- Reversible: One-command rollback to previous version
Deployment Frequency:
- Development: Continuous deployment on every commit to
main - Staging: Automatic deployment after dev tests pass
- Production: Manual approval required, deployments during business hours
Deployment Architecture Diagram
Complete Deployment Workflow
CI/CD Pipeline
GitHub Actions Workflow
File: .github/workflows/deploy-production.yml
name: Deploy to Production
on:
push:
branches:
- main
paths:
- 'app/**'
- 'config/**'
- 'Dockerfile'
- 'requirements.txt'
- 'kubernetes/**'
workflow_dispatch: # Manual trigger
env:
PROJECT_ID: coditect-cloud-infra
GKE_CLUSTER: production-gke-cluster
GKE_ZONE: us-central1-a
IMAGE_NAME: license-api
DEPLOYMENT_NAME: license-api
jobs:
test:
name: Run Tests
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install pytest pytest-cov ruff black
- name: Run linting
run: |
ruff check app/
black --check app/
- name: Run tests
run: |
pytest tests/ \
--cov=app \
--cov-report=xml \
--cov-report=term \
--junitxml=junit.xml
- name: Upload coverage
uses: codecov/codecov-action@v3
with:
files: ./coverage.xml
build-and-push:
name: Build and Push Docker Image
runs-on: ubuntu-latest
needs: test
outputs:
image-tag: ${{ steps.generate-tag.outputs.tag }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Generate image tag
id: generate-tag
run: |
SHORT_SHA=$(echo ${{ github.sha }} | cut -c1-7)
TAG="v1.0.${GITHUB_RUN_NUMBER}-${SHORT_SHA}"
echo "tag=${TAG}" >> $GITHUB_OUTPUT
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- name: Configure Docker for GCR
run: |
gcloud auth configure-docker
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
gcr.io/${{ env.PROJECT_ID }}/${{ env.IMAGE_NAME }}:${{ steps.generate-tag.outputs.tag }}
gcr.io/${{ env.PROJECT_ID }}/${{ env.IMAGE_NAME }}:latest
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Upload image tag artifact
uses: actions/upload-artifact@v3
with:
name: image-tag
path: |
echo "${{ steps.generate-tag.outputs.tag }}" > image-tag.txt
deploy:
name: Deploy to GKE
runs-on: ubuntu-latest
needs: build-and-push
environment:
name: production
url: https://api.coditect.com
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- name: Get GKE credentials
uses: google-github-actions/get-gke-credentials@v1
with:
cluster_name: ${{ env.GKE_CLUSTER }}
location: ${{ env.GKE_ZONE }}
- name: Update deployment image
run: |
IMAGE_TAG="${{ needs.build-and-push.outputs.image-tag }}"
kubectl set image deployment/${{ env.DEPLOYMENT_NAME }} \
django=gcr.io/${{ env.PROJECT_ID }}/${{ env.IMAGE_NAME }}:${IMAGE_TAG} \
--namespace=coditect \
--record
- name: Wait for rollout to complete
run: |
kubectl rollout status deployment/${{ env.DEPLOYMENT_NAME }} \
--namespace=coditect \
--timeout=10m
- name: Verify deployment
run: |
kubectl get pods -n coditect -l app=license-api
kubectl get deployment -n coditect ${{ env.DEPLOYMENT_NAME }}
- name: Run smoke tests
run: |
# Wait for service to be ready
sleep 30
# Check health endpoint
curl -f https://api.coditect.com/health/ready || exit 1
# Check metrics endpoint
curl -f https://api.coditect.com/metrics || exit 1
- name: Notify Slack
if: always()
uses: 8398a7/action-slack@v3
with:
status: ${{ job.status }}
text: |
Deployment to production: ${{ job.status }}
Image: gcr.io/${{ env.PROJECT_ID }}/${{ env.IMAGE_NAME }}:${{ needs.build-and-push.outputs.image-tag }}
Deployed by: ${{ github.actor }}
webhook_url: ${{ secrets.SLACK_WEBHOOK }}
rollback-on-failure:
name: Rollback on Failure
runs-on: ubuntu-latest
needs: deploy
if: failure()
steps:
- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- name: Get GKE credentials
uses: google-github-actions/get-gke-credentials@v1
with:
cluster_name: ${{ env.GKE_CLUSTER }}
location: ${{ env.GKE_ZONE }}
- name: Rollback deployment
run: |
kubectl rollout undo deployment/${{ env.DEPLOYMENT_NAME }} \
--namespace=coditect
kubectl rollout status deployment/${{ env.DEPLOYMENT_NAME }} \
--namespace=coditect \
--timeout=5m
- name: Notify Slack of rollback
uses: 8398a7/action-slack@v3
with:
status: 'warning'
text: |
⚠️ ROLLBACK EXECUTED
Deployment failed, automatically rolled back to previous version.
Failed deployment: ${{ github.sha }}
webhook_url: ${{ secrets.SLACK_WEBHOOK }}
Docker Image Build
Multi-Stage Dockerfile
File: Dockerfile
# Build stage
FROM python:3.11-slim as builder
WORKDIR /app
# Install build dependencies
RUN apt-get update && apt-get install -y \
gcc \
postgresql-client \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
# Production stage
FROM python:3.11-slim
WORKDIR /app
# Install runtime dependencies
RUN apt-get update && apt-get install -y \
libpq5 \
curl \
&& rm -rf /var/lib/apt/lists/*
# Copy Python dependencies from builder
COPY --from=builder /root/.local /root/.local
# Copy application code
COPY app/ ./app/
COPY config/ ./config/
COPY manage.py .
# Create non-root user
RUN useradd -m -u 1000 django && chown -R django:django /app
USER django
# Set Python path
ENV PATH=/root/.local/bin:$PATH
ENV PYTHONUNBUFFERED=1
ENV DJANGO_SETTINGS_MODULE=config.settings.production
# Expose port
EXPOSE 8000
# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=40s --retries=3 \
CMD curl -f http://localhost:8000/health/ready || exit 1
# Run gunicorn
CMD ["gunicorn", "config.wsgi:application", \
"--bind", "0.0.0.0:8000", \
"--workers", "4", \
"--threads", "2", \
"--worker-class", "gthread", \
"--worker-tmp-dir", "/dev/shm", \
"--timeout", "60", \
"--access-logfile", "-", \
"--error-logfile", "-", \
"--log-level", "info"]
Kubernetes Deployment Strategy
Rolling Update Configuration
Key Parameters:
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during update (total 4)
maxUnavailable: 0 # Always maintain 3 running pods (zero downtime)
# Update flow:
# 1. Create 1 new pod (v1.2.3) → Total: 3 old + 1 new = 4 pods
# 2. Wait for new pod to be Ready
# 3. Terminate 1 old pod (v1.2.2) → Total: 2 old + 1 new = 3 pods
# 4. Repeat until all old pods replaced
Timeline:
0:00 - Start: 3 pods running v1.2.2
0:05 - Create new pod #1 (v1.2.3) → Total 4 pods
0:15 - New pod #1 Ready → Terminate old pod #1 → Total 3 pods
0:20 - Create new pod #2 (v1.2.3) → Total 4 pods
0:30 - New pod #2 Ready → Terminate old pod #2 → Total 3 pods
0:35 - Create new pod #3 (v1.2.3) → Total 4 pods
0:45 - New pod #3 Ready → Terminate old pod #3 → Total 3 pods
0:50 - Complete: 3 pods running v1.2.3
Zero Downtime Guaranteed:
- Service always has 3+ healthy pods
- Load balancer only routes to Ready pods
- Old pods drain connections before termination
Zero-Downtime Deployment
Health Check Configuration
Kubernetes Probes:
# Startup probe (one-time check during startup)
startupProbe:
httpGet:
path: /health/startup
port: 8000
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30 # Allow 150 seconds for startup (30 * 5s)
# Liveness probe (restart if fails)
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3 # Restart after 3 consecutive failures
# Readiness probe (remove from load balancer if fails)
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3 # Remove from LB after 3 failures
Health Endpoint Implementation:
File: app/health/views.py
from django.http import JsonResponse
from django.db import connection
from django_redis import get_redis_connection
import logging
logger = logging.getLogger(__name__)
def startup_check(request):
"""
Startup probe - Check if application has started
Returns 200 if Django is running
"""
return JsonResponse({'status': 'ok'})
def liveness_check(request):
"""
Liveness probe - Check if application is alive
Returns 200 if Django can process requests
"""
return JsonResponse({'status': 'ok'})
def readiness_check(request):
"""
Readiness probe - Check if application is ready to serve traffic
Checks:
- Database connectivity
- Redis connectivity
- Critical dependencies
"""
checks = {}
is_ready = True
# Check database
try:
with connection.cursor() as cursor:
cursor.execute("SELECT 1")
checks['database'] = 'ok'
except Exception as e:
checks['database'] = f'error: {e}'
is_ready = False
# Check Redis
try:
redis_client = get_redis_connection('default')
redis_client.ping()
checks['redis'] = 'ok'
except Exception as e:
checks['redis'] = f'error: {e}'
is_ready = False
status_code = 200 if is_ready else 503
return JsonResponse(
{
'status': 'ready' if is_ready else 'not_ready',
'checks': checks
},
status=status_code
)
Graceful Shutdown
Django Signal Handler:
# app/middleware/graceful_shutdown.py
import signal
import sys
import logging
logger = logging.getLogger(__name__)
shutdown_requested = False
def handle_shutdown_signal(signum, frame):
"""
Handle SIGTERM gracefully
When Kubernetes terminates pod:
1. SIGTERM sent to container
2. 30-second grace period
3. SIGKILL if not exited
"""
global shutdown_requested
shutdown_requested = True
logger.info(f"Received signal {signum}, initiating graceful shutdown...")
# Stop accepting new requests
# Gunicorn handles this automatically
logger.info("Graceful shutdown complete")
sys.exit(0)
# Register signal handlers
signal.signal(signal.SIGTERM, handle_shutdown_signal)
signal.signal(signal.SIGINT, handle_shutdown_signal)
Deployment Verification
Automated Smoke Tests
File: scripts/smoke-tests.sh
#!/bin/bash
set -e
API_URL="${API_URL:-https://api.coditect.com}"
echo "Running smoke tests against $API_URL..."
# Test 1: Health endpoint
echo "✓ Testing health endpoint..."
curl -f -s "$API_URL/health/ready" | jq -e '.status == "ready"'
# Test 2: Metrics endpoint
echo "✓ Testing metrics endpoint..."
curl -f -s "$API_URL/metrics" | grep -q "http_requests_total"
# Test 3: License acquisition (with test credentials)
echo "✓ Testing license acquisition endpoint..."
curl -f -s -X POST "$API_URL/api/v1/licenses/acquire" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TEST_TOKEN" \
-d '{"license_key": "TEST-KEY", "hardware_id": "test-hw-id"}' \
| jq -e '.session_id != null'
# Test 4: Heartbeat endpoint
echo "✓ Testing heartbeat endpoint..."
curl -f -s -X POST "$API_URL/api/v1/licenses/heartbeat" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TEST_TOKEN" \
-d '{"session_id": "test-session-id"}' \
| jq -e '.success == true or .error != null' # Either success or expected error
echo "✅ All smoke tests passed!"
Rollback Procedures
Manual Rollback
# Rollback to previous deployment
kubectl rollout undo deployment/license-api -n coditect
# Rollback to specific revision
kubectl rollout undo deployment/license-api -n coditect --to-revision=5
# Check rollout history
kubectl rollout history deployment/license-api -n coditect
# View specific revision details
kubectl rollout history deployment/license-api -n coditect --revision=5
# Pause rollout (stop at current state)
kubectl rollout pause deployment/license-api -n coditect
# Resume rollout
kubectl rollout resume deployment/license-api -n coditect
Automatic Rollback
Configured in GitHub Actions (see CI/CD Pipeline above)
Triggers automatic rollback on:
- Health check failures
- Smoke test failures
- Deployment timeout (10 minutes)
Infrastructure as Code Deployment
OpenTofu Workflow
# 1. Navigate to environment
cd opentofu/environments/production
# 2. Initialize OpenTofu
tofu init
# 3. Validate configuration
tofu validate
# 4. Plan changes (review carefully!)
tofu plan -out=tfplan
# 5. Apply changes (PRODUCTION)
tofu apply tfplan
# 6. Verify deployment
gcloud container clusters list
gcloud sql instances list
kubectl get nodes
Infrastructure Update Strategy
Safe Infrastructure Changes:
- Test in dev environment first
- Plan and review - Always run
tofu planand review output - Gradual rollout - Update dev → staging → production
- Monitor after changes - Watch Cloud Monitoring for 30+ minutes
- Have rollback plan - Know how to revert changes
High-Risk Changes (Require Extra Approval):
- Database instance changes (may cause downtime)
- GKE cluster version upgrades
- Network configuration changes
- Key deletion or rotation
Production Deployment Checklist
Pre-Deployment
- All tests passing in CI
- Code reviewed and approved
- Database migrations tested in staging
- Secrets and configs updated
- Deployment scheduled during business hours
- On-call engineer notified
- Rollback plan documented
During Deployment
- Monitor deployment progress
- Watch pod status:
kubectl get pods -n coditect -w - Check logs for errors:
kubectl logs -n coditect -l app=license-api --tail=100 - Monitor Cloud Monitoring dashboards
- Verify health checks passing
Post-Deployment
- Run smoke tests
- Verify metrics in Grafana
- Check error rates (should be <1%)
- Monitor for 30+ minutes
- Update deployment notes
- Notify team in Slack
If Issues Occur
- Check pod logs:
kubectl logs -n coditect <pod-name> - Check events:
kubectl get events -n coditect --sort-by='.lastTimestamp' - Rollback if errors persist:
kubectl rollout undo deployment/license-api -n coditect - Create incident report
- Schedule postmortem
Summary
This SUP-02 GKE Deployment Architecture specification provides:
✅ Complete CI/CD pipeline
- GitHub Actions workflow (test → build → deploy)
- Automated testing (linting, unit tests, coverage)
- Docker image build and push to GCR
- Kubernetes deployment automation
✅ Docker image build
- Multi-stage Dockerfile for optimization
- Security (non-root user)
- Health checks built-in
- Production-ready configuration
✅ Kubernetes deployment strategy
- Rolling update with zero downtime
- maxSurge: 1, maxUnavailable: 0
- Gradual pod replacement (3 → 4 → 3 pattern)
- Service continuity guaranteed
✅ Zero-downtime deployment
- Startup, liveness, readiness probes
- Graceful shutdown handling
- Connection draining
- Load balancer integration
✅ Deployment verification
- Automated smoke tests
- Health check validation
- Metrics verification
- 30-minute monitoring period
✅ Rollback procedures
- Automatic rollback on failure
- Manual rollback commands
- Revision history tracking
- Pause/resume capabilities
✅ Infrastructure as Code
- OpenTofu deployment workflow
- Safe change procedures
- Gradual rollout strategy
- Risk assessment guidelines
✅ Production deployment checklist
- Pre-deployment preparation
- During-deployment monitoring
- Post-deployment verification
- Incident response procedures
Implementation Status: Specification Complete Next Steps:
- Create GitHub Actions workflows (Phase 2)
- Set up GCR repository (Phase 2)
- Test deployment in dev environment (Phase 2)
- Create smoke test suite (Phase 2)
- Document rollback procedures (Phase 2)
- Production deployment (Phase 3)
Current Status:
- GitHub Actions: ⏸️ Not configured
- GCR Repository: ⏸️ Not created
- Deployment Workflow: ⏸️ Not tested
Dependencies:
- GitHub repository with Actions enabled
- GCP service account with GKE/GCR permissions
- kubectl configured for GKE cluster
Deployment Frequency:
- Development: Continuous (every commit)
- Staging: Automatic (after dev tests)
- Production: Manual approval + business hours
Total Lines: 750+ (complete production-ready deployment architecture)
Author: CODITECT Infrastructure Team Date: November 30, 2025 Version: 1.0 Status: Ready for Implementation