Skip to main content

SUP-02: GKE Deployment Architecture - CI/CD and Production Deployment

Document Type: Supplementary Diagram Scope: Complete deployment workflow from code commit to production Technology: GitHub Actions, Docker, Google Container Registry, GKE, Kubernetes Status: Specification Complete - Ready for Implementation Last Updated: November 30, 2025


Table of Contents

  1. Overview
  2. Deployment Architecture Diagram
  3. CI/CD Pipeline
  4. Docker Image Build
  5. Kubernetes Deployment Strategy
  6. Zero-Downtime Deployment
  7. Deployment Verification
  8. Rollback Procedures
  9. Infrastructure as Code Deployment
  10. Production Deployment Checklist

Overview

Purpose

This document specifies the complete deployment architecture for the CODITECT License Management Platform, covering:

  • CI/CD pipeline automation (GitHub Actions)
  • Docker image build and registry (Google Container Registry)
  • Kubernetes deployment workflow
  • Zero-downtime rolling updates
  • Deployment verification and health checks
  • Rollback procedures
  • Infrastructure as Code (OpenTofu) deployment

Deployment Philosophy

Key Principles:

  • Automated: Zero manual steps from commit to production
  • Safe: Zero-downtime deployments with automatic rollback
  • Verified: Automated testing and health checks at every stage
  • Traceable: Complete audit trail of all deployments
  • Reversible: One-command rollback to previous version

Deployment Frequency:

  • Development: Continuous deployment on every commit to main
  • Staging: Automatic deployment after dev tests pass
  • Production: Manual approval required, deployments during business hours

Deployment Architecture Diagram

Complete Deployment Workflow


CI/CD Pipeline

GitHub Actions Workflow

File: .github/workflows/deploy-production.yml

name: Deploy to Production

on:
push:
branches:
- main
paths:
- 'app/**'
- 'config/**'
- 'Dockerfile'
- 'requirements.txt'
- 'kubernetes/**'
workflow_dispatch: # Manual trigger

env:
PROJECT_ID: coditect-cloud-infra
GKE_CLUSTER: production-gke-cluster
GKE_ZONE: us-central1-a
IMAGE_NAME: license-api
DEPLOYMENT_NAME: license-api

jobs:
test:
name: Run Tests
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install pytest pytest-cov ruff black

- name: Run linting
run: |
ruff check app/
black --check app/

- name: Run tests
run: |
pytest tests/ \
--cov=app \
--cov-report=xml \
--cov-report=term \
--junitxml=junit.xml

- name: Upload coverage
uses: codecov/codecov-action@v3
with:
files: ./coverage.xml

build-and-push:
name: Build and Push Docker Image
runs-on: ubuntu-latest
needs: test
outputs:
image-tag: ${{ steps.generate-tag.outputs.tag }}

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Generate image tag
id: generate-tag
run: |
SHORT_SHA=$(echo ${{ github.sha }} | cut -c1-7)
TAG="v1.0.${GITHUB_RUN_NUMBER}-${SHORT_SHA}"
echo "tag=${TAG}" >> $GITHUB_OUTPUT

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}

- name: Configure Docker for GCR
run: |
gcloud auth configure-docker

- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
gcr.io/${{ env.PROJECT_ID }}/${{ env.IMAGE_NAME }}:${{ steps.generate-tag.outputs.tag }}
gcr.io/${{ env.PROJECT_ID }}/${{ env.IMAGE_NAME }}:latest
cache-from: type=gha
cache-to: type=gha,mode=max

- name: Upload image tag artifact
uses: actions/upload-artifact@v3
with:
name: image-tag
path: |
echo "${{ steps.generate-tag.outputs.tag }}" > image-tag.txt

deploy:
name: Deploy to GKE
runs-on: ubuntu-latest
needs: build-and-push
environment:
name: production
url: https://api.coditect.com

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}

- name: Get GKE credentials
uses: google-github-actions/get-gke-credentials@v1
with:
cluster_name: ${{ env.GKE_CLUSTER }}
location: ${{ env.GKE_ZONE }}

- name: Update deployment image
run: |
IMAGE_TAG="${{ needs.build-and-push.outputs.image-tag }}"
kubectl set image deployment/${{ env.DEPLOYMENT_NAME }} \
django=gcr.io/${{ env.PROJECT_ID }}/${{ env.IMAGE_NAME }}:${IMAGE_TAG} \
--namespace=coditect \
--record

- name: Wait for rollout to complete
run: |
kubectl rollout status deployment/${{ env.DEPLOYMENT_NAME }} \
--namespace=coditect \
--timeout=10m

- name: Verify deployment
run: |
kubectl get pods -n coditect -l app=license-api
kubectl get deployment -n coditect ${{ env.DEPLOYMENT_NAME }}

- name: Run smoke tests
run: |
# Wait for service to be ready
sleep 30

# Check health endpoint
curl -f https://api.coditect.com/health/ready || exit 1

# Check metrics endpoint
curl -f https://api.coditect.com/metrics || exit 1

- name: Notify Slack
if: always()
uses: 8398a7/action-slack@v3
with:
status: ${{ job.status }}
text: |
Deployment to production: ${{ job.status }}
Image: gcr.io/${{ env.PROJECT_ID }}/${{ env.IMAGE_NAME }}:${{ needs.build-and-push.outputs.image-tag }}
Deployed by: ${{ github.actor }}
webhook_url: ${{ secrets.SLACK_WEBHOOK }}

rollback-on-failure:
name: Rollback on Failure
runs-on: ubuntu-latest
needs: deploy
if: failure()

steps:
- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}

- name: Get GKE credentials
uses: google-github-actions/get-gke-credentials@v1
with:
cluster_name: ${{ env.GKE_CLUSTER }}
location: ${{ env.GKE_ZONE }}

- name: Rollback deployment
run: |
kubectl rollout undo deployment/${{ env.DEPLOYMENT_NAME }} \
--namespace=coditect

kubectl rollout status deployment/${{ env.DEPLOYMENT_NAME }} \
--namespace=coditect \
--timeout=5m

- name: Notify Slack of rollback
uses: 8398a7/action-slack@v3
with:
status: 'warning'
text: |
⚠️ ROLLBACK EXECUTED
Deployment failed, automatically rolled back to previous version.
Failed deployment: ${{ github.sha }}
webhook_url: ${{ secrets.SLACK_WEBHOOK }}

Docker Image Build

Multi-Stage Dockerfile

File: Dockerfile

# Build stage
FROM python:3.11-slim as builder

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y \
gcc \
postgresql-client \
&& rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Production stage
FROM python:3.11-slim

WORKDIR /app

# Install runtime dependencies
RUN apt-get update && apt-get install -y \
libpq5 \
curl \
&& rm -rf /var/lib/apt/lists/*

# Copy Python dependencies from builder
COPY --from=builder /root/.local /root/.local

# Copy application code
COPY app/ ./app/
COPY config/ ./config/
COPY manage.py .

# Create non-root user
RUN useradd -m -u 1000 django && chown -R django:django /app
USER django

# Set Python path
ENV PATH=/root/.local/bin:$PATH
ENV PYTHONUNBUFFERED=1
ENV DJANGO_SETTINGS_MODULE=config.settings.production

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=40s --retries=3 \
CMD curl -f http://localhost:8000/health/ready || exit 1

# Run gunicorn
CMD ["gunicorn", "config.wsgi:application", \
"--bind", "0.0.0.0:8000", \
"--workers", "4", \
"--threads", "2", \
"--worker-class", "gthread", \
"--worker-tmp-dir", "/dev/shm", \
"--timeout", "60", \
"--access-logfile", "-", \
"--error-logfile", "-", \
"--log-level", "info"]

Kubernetes Deployment Strategy

Rolling Update Configuration

Key Parameters:

spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during update (total 4)
maxUnavailable: 0 # Always maintain 3 running pods (zero downtime)

# Update flow:
# 1. Create 1 new pod (v1.2.3) → Total: 3 old + 1 new = 4 pods
# 2. Wait for new pod to be Ready
# 3. Terminate 1 old pod (v1.2.2) → Total: 2 old + 1 new = 3 pods
# 4. Repeat until all old pods replaced

Timeline:

0:00 - Start: 3 pods running v1.2.2
0:05 - Create new pod #1 (v1.2.3) → Total 4 pods
0:15 - New pod #1 Ready → Terminate old pod #1 → Total 3 pods
0:20 - Create new pod #2 (v1.2.3) → Total 4 pods
0:30 - New pod #2 Ready → Terminate old pod #2 → Total 3 pods
0:35 - Create new pod #3 (v1.2.3) → Total 4 pods
0:45 - New pod #3 Ready → Terminate old pod #3 → Total 3 pods
0:50 - Complete: 3 pods running v1.2.3

Zero Downtime Guaranteed:

  • Service always has 3+ healthy pods
  • Load balancer only routes to Ready pods
  • Old pods drain connections before termination

Zero-Downtime Deployment

Health Check Configuration

Kubernetes Probes:

# Startup probe (one-time check during startup)
startupProbe:
httpGet:
path: /health/startup
port: 8000
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30 # Allow 150 seconds for startup (30 * 5s)

# Liveness probe (restart if fails)
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3 # Restart after 3 consecutive failures

# Readiness probe (remove from load balancer if fails)
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3 # Remove from LB after 3 failures

Health Endpoint Implementation:

File: app/health/views.py

from django.http import JsonResponse
from django.db import connection
from django_redis import get_redis_connection
import logging

logger = logging.getLogger(__name__)


def startup_check(request):
"""
Startup probe - Check if application has started

Returns 200 if Django is running
"""
return JsonResponse({'status': 'ok'})


def liveness_check(request):
"""
Liveness probe - Check if application is alive

Returns 200 if Django can process requests
"""
return JsonResponse({'status': 'ok'})


def readiness_check(request):
"""
Readiness probe - Check if application is ready to serve traffic

Checks:
- Database connectivity
- Redis connectivity
- Critical dependencies
"""
checks = {}
is_ready = True

# Check database
try:
with connection.cursor() as cursor:
cursor.execute("SELECT 1")
checks['database'] = 'ok'
except Exception as e:
checks['database'] = f'error: {e}'
is_ready = False

# Check Redis
try:
redis_client = get_redis_connection('default')
redis_client.ping()
checks['redis'] = 'ok'
except Exception as e:
checks['redis'] = f'error: {e}'
is_ready = False

status_code = 200 if is_ready else 503

return JsonResponse(
{
'status': 'ready' if is_ready else 'not_ready',
'checks': checks
},
status=status_code
)

Graceful Shutdown

Django Signal Handler:

# app/middleware/graceful_shutdown.py
import signal
import sys
import logging

logger = logging.getLogger(__name__)

shutdown_requested = False


def handle_shutdown_signal(signum, frame):
"""
Handle SIGTERM gracefully

When Kubernetes terminates pod:
1. SIGTERM sent to container
2. 30-second grace period
3. SIGKILL if not exited
"""
global shutdown_requested
shutdown_requested = True

logger.info(f"Received signal {signum}, initiating graceful shutdown...")

# Stop accepting new requests
# Gunicorn handles this automatically

logger.info("Graceful shutdown complete")
sys.exit(0)


# Register signal handlers
signal.signal(signal.SIGTERM, handle_shutdown_signal)
signal.signal(signal.SIGINT, handle_shutdown_signal)

Deployment Verification

Automated Smoke Tests

File: scripts/smoke-tests.sh

#!/bin/bash
set -e

API_URL="${API_URL:-https://api.coditect.com}"

echo "Running smoke tests against $API_URL..."

# Test 1: Health endpoint
echo "✓ Testing health endpoint..."
curl -f -s "$API_URL/health/ready" | jq -e '.status == "ready"'

# Test 2: Metrics endpoint
echo "✓ Testing metrics endpoint..."
curl -f -s "$API_URL/metrics" | grep -q "http_requests_total"

# Test 3: License acquisition (with test credentials)
echo "✓ Testing license acquisition endpoint..."
curl -f -s -X POST "$API_URL/api/v1/licenses/acquire" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TEST_TOKEN" \
-d '{"license_key": "TEST-KEY", "hardware_id": "test-hw-id"}' \
| jq -e '.session_id != null'

# Test 4: Heartbeat endpoint
echo "✓ Testing heartbeat endpoint..."
curl -f -s -X POST "$API_URL/api/v1/licenses/heartbeat" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TEST_TOKEN" \
-d '{"session_id": "test-session-id"}' \
| jq -e '.success == true or .error != null' # Either success or expected error

echo "✅ All smoke tests passed!"

Rollback Procedures

Manual Rollback

# Rollback to previous deployment
kubectl rollout undo deployment/license-api -n coditect

# Rollback to specific revision
kubectl rollout undo deployment/license-api -n coditect --to-revision=5

# Check rollout history
kubectl rollout history deployment/license-api -n coditect

# View specific revision details
kubectl rollout history deployment/license-api -n coditect --revision=5

# Pause rollout (stop at current state)
kubectl rollout pause deployment/license-api -n coditect

# Resume rollout
kubectl rollout resume deployment/license-api -n coditect

Automatic Rollback

Configured in GitHub Actions (see CI/CD Pipeline above)

Triggers automatic rollback on:

  1. Health check failures
  2. Smoke test failures
  3. Deployment timeout (10 minutes)

Infrastructure as Code Deployment

OpenTofu Workflow

# 1. Navigate to environment
cd opentofu/environments/production

# 2. Initialize OpenTofu
tofu init

# 3. Validate configuration
tofu validate

# 4. Plan changes (review carefully!)
tofu plan -out=tfplan

# 5. Apply changes (PRODUCTION)
tofu apply tfplan

# 6. Verify deployment
gcloud container clusters list
gcloud sql instances list
kubectl get nodes

Infrastructure Update Strategy

Safe Infrastructure Changes:

  1. Test in dev environment first
  2. Plan and review - Always run tofu plan and review output
  3. Gradual rollout - Update dev → staging → production
  4. Monitor after changes - Watch Cloud Monitoring for 30+ minutes
  5. Have rollback plan - Know how to revert changes

High-Risk Changes (Require Extra Approval):

  • Database instance changes (may cause downtime)
  • GKE cluster version upgrades
  • Network configuration changes
  • Key deletion or rotation

Production Deployment Checklist

Pre-Deployment

  • All tests passing in CI
  • Code reviewed and approved
  • Database migrations tested in staging
  • Secrets and configs updated
  • Deployment scheduled during business hours
  • On-call engineer notified
  • Rollback plan documented

During Deployment

  • Monitor deployment progress
  • Watch pod status: kubectl get pods -n coditect -w
  • Check logs for errors: kubectl logs -n coditect -l app=license-api --tail=100
  • Monitor Cloud Monitoring dashboards
  • Verify health checks passing

Post-Deployment

  • Run smoke tests
  • Verify metrics in Grafana
  • Check error rates (should be <1%)
  • Monitor for 30+ minutes
  • Update deployment notes
  • Notify team in Slack

If Issues Occur

  • Check pod logs: kubectl logs -n coditect <pod-name>
  • Check events: kubectl get events -n coditect --sort-by='.lastTimestamp'
  • Rollback if errors persist: kubectl rollout undo deployment/license-api -n coditect
  • Create incident report
  • Schedule postmortem

Summary

This SUP-02 GKE Deployment Architecture specification provides:

Complete CI/CD pipeline

  • GitHub Actions workflow (test → build → deploy)
  • Automated testing (linting, unit tests, coverage)
  • Docker image build and push to GCR
  • Kubernetes deployment automation

Docker image build

  • Multi-stage Dockerfile for optimization
  • Security (non-root user)
  • Health checks built-in
  • Production-ready configuration

Kubernetes deployment strategy

  • Rolling update with zero downtime
  • maxSurge: 1, maxUnavailable: 0
  • Gradual pod replacement (3 → 4 → 3 pattern)
  • Service continuity guaranteed

Zero-downtime deployment

  • Startup, liveness, readiness probes
  • Graceful shutdown handling
  • Connection draining
  • Load balancer integration

Deployment verification

  • Automated smoke tests
  • Health check validation
  • Metrics verification
  • 30-minute monitoring period

Rollback procedures

  • Automatic rollback on failure
  • Manual rollback commands
  • Revision history tracking
  • Pause/resume capabilities

Infrastructure as Code

  • OpenTofu deployment workflow
  • Safe change procedures
  • Gradual rollout strategy
  • Risk assessment guidelines

Production deployment checklist

  • Pre-deployment preparation
  • During-deployment monitoring
  • Post-deployment verification
  • Incident response procedures

Implementation Status: Specification Complete Next Steps:

  1. Create GitHub Actions workflows (Phase 2)
  2. Set up GCR repository (Phase 2)
  3. Test deployment in dev environment (Phase 2)
  4. Create smoke test suite (Phase 2)
  5. Document rollback procedures (Phase 2)
  6. Production deployment (Phase 3)

Current Status:

  • GitHub Actions: ⏸️ Not configured
  • GCR Repository: ⏸️ Not created
  • Deployment Workflow: ⏸️ Not tested

Dependencies:

  • GitHub repository with Actions enabled
  • GCP service account with GKE/GCR permissions
  • kubectl configured for GKE cluster

Deployment Frequency:

  • Development: Continuous (every commit)
  • Staging: Automatic (after dev tests)
  • Production: Manual approval + business hours

Total Lines: 750+ (complete production-ready deployment architecture)


Author: CODITECT Infrastructure Team Date: November 30, 2025 Version: 1.0 Status: Ready for Implementation