SUP-02: GKE Deployment Architecture - CI/CD and Production Deployment

Document Type: Supplementary Diagram Scope: Complete deployment workflow from code commit to production Technology: GitHub Actions, Docker, Google Container Registry, GKE, Kubernetes Status: Specification Complete - Ready for Implementation Last Updated: November 30, 2025

Overview
Deployment Architecture Diagram
CI/CD Pipeline
Docker Image Build
Kubernetes Deployment Strategy
Zero-Downtime Deployment
Deployment Verification
Rollback Procedures
Infrastructure as Code Deployment
Production Deployment Checklist

Overview

Purpose

This document specifies the complete deployment architecture for the CODITECT License Management Platform, covering:

CI/CD pipeline automation (GitHub Actions)
Docker image build and registry (Google Container Registry)
Kubernetes deployment workflow
Zero-downtime rolling updates
Deployment verification and health checks
Rollback procedures
Infrastructure as Code (OpenTofu) deployment

Deployment Philosophy

Key Principles:

Automated: Zero manual steps from commit to production
Safe: Zero-downtime deployments with automatic rollback
Verified: Automated testing and health checks at every stage
Traceable: Complete audit trail of all deployments
Reversible: One-command rollback to previous version

Deployment Frequency:

Development: Continuous deployment on every commit to main
Staging: Automatic deployment after dev tests pass
Production: Manual approval required, deployments during business hours

Deployment Architecture Diagram

Complete Deployment Workflow

CI/CD Pipeline

GitHub Actions Workflow

File: .github/workflows/deploy-production.yml

name: Deploy to Production

on:
  push:
    branches:
      - main
    paths:
      - 'app/**'
      - 'config/**'
      - 'Dockerfile'
      - 'requirements.txt'
      - 'kubernetes/**'
  workflow_dispatch:  # Manual trigger

env:
  PROJECT_ID: coditect-cloud-infra
  GKE_CLUSTER: production-gke-cluster
  GKE_ZONE: us-central1-a
  IMAGE_NAME: license-api
  DEPLOYMENT_NAME: license-api

jobs:
  test:
    name: Run Tests
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install pytest pytest-cov ruff black

    - name: Run linting
      run: |
        ruff check app/
        black --check app/

    - name: Run tests
      run: |
        pytest tests/ \
          --cov=app \
          --cov-report=xml \
          --cov-report=term \
          --junitxml=junit.xml

    - name: Upload coverage
      uses: codecov/codecov-action@v3
      with:
        files: ./coverage.xml

  build-and-push:
    name: Build and Push Docker Image
    runs-on: ubuntu-latest
    needs: test
    outputs:
      image-tag: ${{ steps.generate-tag.outputs.tag }}

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Generate image tag
      id: generate-tag
      run: |
        SHORT_SHA=$(echo ${{ github.sha }} | cut -c1-7)
        TAG="v1.0.${GITHUB_RUN_NUMBER}-${SHORT_SHA}"
        echo "tag=${TAG}" >> $GITHUB_OUTPUT

    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v3

    - name: Authenticate to Google Cloud
      uses: google-github-actions/auth@v1
      with:
        credentials_json: ${{ secrets.GCP_SA_KEY }}

    - name: Configure Docker for GCR
      run: |
        gcloud auth configure-docker

    - name: Build and push Docker image
      uses: docker/build-push-action@v5
      with:
        context: .
        push: true
        tags: |
          gcr.io/${{ env.PROJECT_ID }}/${{ env.IMAGE_NAME }}:${{ steps.generate-tag.outputs.tag }}
          gcr.io/${{ env.PROJECT_ID }}/${{ env.IMAGE_NAME }}:latest
        cache-from: type=gha
        cache-to: type=gha,mode=max

    - name: Upload image tag artifact
      uses: actions/upload-artifact@v3
      with:
        name: image-tag
        path: |
          echo "${{ steps.generate-tag.outputs.tag }}" > image-tag.txt

  deploy:
    name: Deploy to GKE
    runs-on: ubuntu-latest
    needs: build-and-push
    environment:
      name: production
      url: https://api.coditect.com

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Authenticate to Google Cloud
      uses: google-github-actions/auth@v1
      with:
        credentials_json: ${{ secrets.GCP_SA_KEY }}

    - name: Get GKE credentials
      uses: google-github-actions/get-gke-credentials@v1
      with:
        cluster_name: ${{ env.GKE_CLUSTER }}
        location: ${{ env.GKE_ZONE }}

    - name: Update deployment image
      run: |
        IMAGE_TAG="${{ needs.build-and-push.outputs.image-tag }}"
        kubectl set image deployment/${{ env.DEPLOYMENT_NAME }} \
          django=gcr.io/${{ env.PROJECT_ID }}/${{ env.IMAGE_NAME }}:${IMAGE_TAG} \
          --namespace=coditect \
          --record

    - name: Wait for rollout to complete
      run: |
        kubectl rollout status deployment/${{ env.DEPLOYMENT_NAME }} \
          --namespace=coditect \
          --timeout=10m

    - name: Verify deployment
      run: |
        kubectl get pods -n coditect -l app=license-api
        kubectl get deployment -n coditect ${{ env.DEPLOYMENT_NAME }}

    - name: Run smoke tests
      run: |
        # Wait for service to be ready
        sleep 30

        # Check health endpoint
        curl -f https://api.coditect.com/health/ready || exit 1

        # Check metrics endpoint
        curl -f https://api.coditect.com/metrics || exit 1

    - name: Notify Slack
      if: always()
      uses: 8398a7/action-slack@v3
      with:
        status: ${{ job.status }}
        text: |
          Deployment to production: ${{ job.status }}
          Image: gcr.io/${{ env.PROJECT_ID }}/${{ env.IMAGE_NAME }}:${{ needs.build-and-push.outputs.image-tag }}
          Deployed by: ${{ github.actor }}
        webhook_url: ${{ secrets.SLACK_WEBHOOK }}

  rollback-on-failure:
    name: Rollback on Failure
    runs-on: ubuntu-latest
    needs: deploy
    if: failure()

    steps:
    - name: Authenticate to Google Cloud
      uses: google-github-actions/auth@v1
      with:
        credentials_json: ${{ secrets.GCP_SA_KEY }}

    - name: Get GKE credentials
      uses: google-github-actions/get-gke-credentials@v1
      with:
        cluster_name: ${{ env.GKE_CLUSTER }}
        location: ${{ env.GKE_ZONE }}

    - name: Rollback deployment
      run: |
        kubectl rollout undo deployment/${{ env.DEPLOYMENT_NAME }} \
          --namespace=coditect

        kubectl rollout status deployment/${{ env.DEPLOYMENT_NAME }} \
          --namespace=coditect \
          --timeout=5m

    - name: Notify Slack of rollback
      uses: 8398a7/action-slack@v3
      with:
        status: 'warning'
        text: |
          ⚠️ ROLLBACK EXECUTED
          Deployment failed, automatically rolled back to previous version.
          Failed deployment: ${{ github.sha }}
        webhook_url: ${{ secrets.SLACK_WEBHOOK }}

Docker Image Build

Multi-Stage Dockerfile

File: Dockerfile

# Build stage
FROM python:3.11-slim as builder

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    postgresql-client \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Production stage
FROM python:3.11-slim

WORKDIR /app

# Install runtime dependencies
RUN apt-get update && apt-get install -y \
    libpq5 \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy Python dependencies from builder
COPY --from=builder /root/.local /root/.local

# Copy application code
COPY app/ ./app/
COPY config/ ./config/
COPY manage.py .

# Create non-root user
RUN useradd -m -u 1000 django && chown -R django:django /app
USER django

# Set Python path
ENV PATH=/root/.local/bin:$PATH
ENV PYTHONUNBUFFERED=1
ENV DJANGO_SETTINGS_MODULE=config.settings.production

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=40s --retries=3 \
  CMD curl -f http://localhost:8000/health/ready || exit 1

# Run gunicorn
CMD ["gunicorn", "config.wsgi:application", \
     "--bind", "0.0.0.0:8000", \
     "--workers", "4", \
     "--threads", "2", \
     "--worker-class", "gthread", \
     "--worker-tmp-dir", "/dev/shm", \
     "--timeout", "60", \
     "--access-logfile", "-", \
     "--error-logfile", "-", \
     "--log-level", "info"]

Kubernetes Deployment Strategy

Rolling Update Configuration

Key Parameters:

spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Allow 1 extra pod during update (total 4)
      maxUnavailable: 0  # Always maintain 3 running pods (zero downtime)

  # Update flow:
  # 1. Create 1 new pod (v1.2.3) → Total: 3 old + 1 new = 4 pods
  # 2. Wait for new pod to be Ready
  # 3. Terminate 1 old pod (v1.2.2) → Total: 2 old + 1 new = 3 pods
  # 4. Repeat until all old pods replaced

Timeline:

00 - Start: 3 pods running v1.2.2
05 - Create new pod #1 (v1.2.3) → Total 4 pods
15 - New pod #1 Ready → Terminate old pod #1 → Total 3 pods
20 - Create new pod #2 (v1.2.3) → Total 4 pods
30 - New pod #2 Ready → Terminate old pod #2 → Total 3 pods
35 - Create new pod #3 (v1.2.3) → Total 4 pods
45 - New pod #3 Ready → Terminate old pod #3 → Total 3 pods
50 - Complete: 3 pods running v1.2.3

Zero Downtime Guaranteed:

Service always has 3+ healthy pods
Load balancer only routes to Ready pods
Old pods drain connections before termination

Zero-Downtime Deployment

Health Check Configuration

Kubernetes Probes:

# Startup probe (one-time check during startup)
startupProbe:
  httpGet:
    path: /health/startup
    port: 8000
  initialDelaySeconds: 0
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 30  # Allow 150 seconds for startup (30 * 5s)

# Liveness probe (restart if fails)
livenessProbe:
  httpGet:
    path: /health/live
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3  # Restart after 3 consecutive failures

# Readiness probe (remove from load balancer if fails)
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3  # Remove from LB after 3 failures

Health Endpoint Implementation:

File: app/health/views.py

from django.http import JsonResponse
from django.db import connection
from django_redis import get_redis_connection
import logging

logger = logging.getLogger(__name__)


def startup_check(request):
    """
    Startup probe - Check if application has started

    Returns 200 if Django is running
    """
    return JsonResponse({'status': 'ok'})


def liveness_check(request):
    """
    Liveness probe - Check if application is alive

    Returns 200 if Django can process requests
    """
    return JsonResponse({'status': 'ok'})


def readiness_check(request):
    """
    Readiness probe - Check if application is ready to serve traffic

    Checks:
    - Database connectivity
    - Redis connectivity
    - Critical dependencies
    """
    checks = {}
    is_ready = True

    # Check database
    try:
        with connection.cursor() as cursor:
            cursor.execute("SELECT 1")
        checks['database'] = 'ok'
    except Exception as e:
        checks['database'] = f'error: {e}'
        is_ready = False

    # Check Redis
    try:
        redis_client = get_redis_connection('default')
        redis_client.ping()
        checks['redis'] = 'ok'
    except Exception as e:
        checks['redis'] = f'error: {e}'
        is_ready = False

    status_code = 200 if is_ready else 503

    return JsonResponse(
        {
            'status': 'ready' if is_ready else 'not_ready',
            'checks': checks
        },
        status=status_code
    )

Graceful Shutdown

Django Signal Handler:

# app/middleware/graceful_shutdown.py
import signal
import sys
import logging

logger = logging.getLogger(__name__)

shutdown_requested = False


def handle_shutdown_signal(signum, frame):
    """
    Handle SIGTERM gracefully

    When Kubernetes terminates pod:
    1. SIGTERM sent to container
    2. 30-second grace period
    3. SIGKILL if not exited
    """
    global shutdown_requested
    shutdown_requested = True

    logger.info(f"Received signal {signum}, initiating graceful shutdown...")

    # Stop accepting new requests
    # Gunicorn handles this automatically

    logger.info("Graceful shutdown complete")
    sys.exit(0)


# Register signal handlers
signal.signal(signal.SIGTERM, handle_shutdown_signal)
signal.signal(signal.SIGINT, handle_shutdown_signal)

Deployment Verification

Automated Smoke Tests

File: scripts/smoke-tests.sh

#!/bin/bash
set -e

API_URL="${API_URL:-https://api.coditect.com}"

echo "Running smoke tests against $API_URL..."

# Test 1: Health endpoint
echo "✓ Testing health endpoint..."
curl -f -s "$API_URL/health/ready" | jq -e '.status == "ready"'

# Test 2: Metrics endpoint
echo "✓ Testing metrics endpoint..."
curl -f -s "$API_URL/metrics" | grep -q "http_requests_total"

# Test 3: License acquisition (with test credentials)
echo "✓ Testing license acquisition endpoint..."
curl -f -s -X POST "$API_URL/api/v1/licenses/acquire" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TEST_TOKEN" \
  -d '{"license_key": "TEST-KEY", "hardware_id": "test-hw-id"}' \
  | jq -e '.session_id != null'

# Test 4: Heartbeat endpoint
echo "✓ Testing heartbeat endpoint..."
curl -f -s -X POST "$API_URL/api/v1/licenses/heartbeat" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TEST_TOKEN" \
  -d '{"session_id": "test-session-id"}' \
  | jq -e '.success == true or .error != null'  # Either success or expected error

echo "✅ All smoke tests passed!"

Rollback Procedures

Manual Rollback

# Rollback to previous deployment
kubectl rollout undo deployment/license-api -n coditect

# Rollback to specific revision
kubectl rollout undo deployment/license-api -n coditect --to-revision=5

# Check rollout history
kubectl rollout history deployment/license-api -n coditect

# View specific revision details
kubectl rollout history deployment/license-api -n coditect --revision=5

# Pause rollout (stop at current state)
kubectl rollout pause deployment/license-api -n coditect

# Resume rollout
kubectl rollout resume deployment/license-api -n coditect

Automatic Rollback

Configured in GitHub Actions (see CI/CD Pipeline above)

Triggers automatic rollback on:

Health check failures
Smoke test failures
Deployment timeout (10 minutes)

Infrastructure as Code Deployment

OpenTofu Workflow

# 1. Navigate to environment
cd opentofu/environments/production

# 2. Initialize OpenTofu
tofu init

# 3. Validate configuration
tofu validate

# 4. Plan changes (review carefully!)
tofu plan -out=tfplan

# 5. Apply changes (PRODUCTION)
tofu apply tfplan

# 6. Verify deployment
gcloud container clusters list
gcloud sql instances list
kubectl get nodes

Infrastructure Update Strategy

Safe Infrastructure Changes:

Test in dev environment first
Plan and review - Always run tofu plan and review output
Gradual rollout - Update dev → staging → production
Monitor after changes - Watch Cloud Monitoring for 30+ minutes
Have rollback plan - Know how to revert changes

High-Risk Changes (Require Extra Approval):

Database instance changes (may cause downtime)
GKE cluster version upgrades
Network configuration changes
Key deletion or rotation

Production Deployment Checklist

Pre-Deployment

All tests passing in CI
Code reviewed and approved
Database migrations tested in staging
Secrets and configs updated
Deployment scheduled during business hours
On-call engineer notified
Rollback plan documented

During Deployment

Monitor deployment progress
Watch pod status: kubectl get pods -n coditect -w
Check logs for errors: kubectl logs -n coditect -l app=license-api --tail=100
Monitor Cloud Monitoring dashboards
Verify health checks passing

Post-Deployment

If Issues Occur

Check pod logs: kubectl logs -n coditect <pod-name>
Check events: kubectl get events -n coditect --sort-by='.lastTimestamp'
Rollback if errors persist: kubectl rollout undo deployment/license-api -n coditect
Create incident report
Schedule postmortem

Summary

This SUP-02 GKE Deployment Architecture specification provides:

✅ Complete CI/CD pipeline

GitHub Actions workflow (test → build → deploy)
Automated testing (linting, unit tests, coverage)
Docker image build and push to GCR
Kubernetes deployment automation

✅ Docker image build

Multi-stage Dockerfile for optimization
Security (non-root user)
Health checks built-in
Production-ready configuration

✅ Kubernetes deployment strategy

Rolling update with zero downtime
maxSurge: 1, maxUnavailable: 0
Gradual pod replacement (3 → 4 → 3 pattern)
Service continuity guaranteed

✅ Zero-downtime deployment

Startup, liveness, readiness probes
Graceful shutdown handling
Connection draining
Load balancer integration

✅ Deployment verification

Automated smoke tests
Health check validation
Metrics verification
30-minute monitoring period

✅ Rollback procedures

Automatic rollback on failure
Manual rollback commands
Revision history tracking
Pause/resume capabilities

✅ Infrastructure as Code

OpenTofu deployment workflow
Safe change procedures
Gradual rollout strategy
Risk assessment guidelines

✅ Production deployment checklist

Pre-deployment preparation
During-deployment monitoring
Post-deployment verification
Incident response procedures

Implementation Status: Specification Complete Next Steps:

Create GitHub Actions workflows (Phase 2)
Set up GCR repository (Phase 2)
Test deployment in dev environment (Phase 2)
Create smoke test suite (Phase 2)
Document rollback procedures (Phase 2)
Production deployment (Phase 3)

Current Status:

GitHub Actions: ⏸️ Not configured
GCR Repository: ⏸️ Not created
Deployment Workflow: ⏸️ Not tested

Dependencies:

GitHub repository with Actions enabled
GCP service account with GKE/GCR permissions
kubectl configured for GKE cluster

Deployment Frequency:

Development: Continuous (every commit)
Staging: Automatic (after dev tests)
Production: Manual approval + business hours

Total Lines: 750+ (complete production-ready deployment architecture)

Author: CODITECT Infrastructure Team Date: November 30, 2025 Version: 1.0 Status: Ready for Implementation

Table of Contents​

Overview​

Purpose​

Deployment Philosophy​

Deployment Architecture Diagram​

Complete Deployment Workflow​

CI/CD Pipeline​

GitHub Actions Workflow​

Docker Image Build​

Multi-Stage Dockerfile​

Kubernetes Deployment Strategy​

Rolling Update Configuration​

Zero-Downtime Deployment​

Health Check Configuration​

Graceful Shutdown​

Deployment Verification​

Automated Smoke Tests​

Rollback Procedures​

Manual Rollback​

Automatic Rollback​

Infrastructure as Code Deployment​

OpenTofu Workflow​

Infrastructure Update Strategy​

Production Deployment Checklist​

Pre-Deployment​

During Deployment​

Post-Deployment​

If Issues Occur​

Summary​

Table of Contents