Skip to main content

Emergency Response Playbook

CODITECT Emergency Response Playbook

Version: 1.0.0 Last Updated: 2026-01-08 Owner: AZ1.AI INC Operations Team Classification: Internal - Operations


Table of Contents


Overview

This playbook provides standardized procedures for responding to production incidents affecting the CODITECT platform. All team members should be familiar with this document before participating in on-call rotations.

Guiding Principles

  1. Safety First - Protect customer data and system integrity
  2. Communicate Early - Inform stakeholders as soon as impact is known
  3. Document Everything - Create timeline in real-time for post-mortems
  4. Escalate Quickly - When in doubt, escalate to senior team members
  5. Customer Focus - Prioritize restoring customer experience

Severity Levels

SEV-1: Critical

Definition: Complete service outage or security breach affecting all customers.

AttributeDetails
Response Time< 15 minutes
Update FrequencyEvery 15 minutes
EscalationImmediate to CTO
Status PageMajor Outage
Customer CommsRequired

Examples:

  • API completely down (5xx for all requests)
  • Database corruption or data loss
  • Security breach or data exposure
  • License validation failing for all customers
  • Payment processing completely down

SEV-2: High

Definition: Major functionality impaired or significant customer impact.

AttributeDetails
Response Time< 30 minutes
Update FrequencyEvery 30 minutes
EscalationTeam lead within 1 hour
Status PagePartial Outage
Customer CommsIf > 1 hour

Examples:

  • License validation intermittently failing
  • Significant latency (>5s response times)
  • One region/availability zone down
  • Critical feature broken (e.g., seat management)
  • Celery queue backed up significantly

SEV-3: Medium

Definition: Minor functionality impaired, workaround available.

AttributeDetails
Response Time< 2 hours
Update FrequencyEvery 2 hours
EscalationIf not resolved in 4 hours
Status PageDegraded Performance
Customer CommsNo

Examples:

  • Non-critical feature broken
  • Minor latency increase
  • Single customer affected
  • Documentation site down
  • Admin dashboard issues

SEV-4: Low

Definition: Minimal impact, cosmetic issues, or monitoring alerts.

AttributeDetails
Response TimeNext business day
Update FrequencyDaily
EscalationN/A
Status PageNo update
Customer CommsNo

Examples:

  • UI cosmetic issues
  • Non-customer-facing errors
  • Minor log warnings
  • Performance optimization opportunities

Incident Response Process

Phase 1: Detection & Triage (0-15 min)

┌─────────────────────────────────────────────────────────────────┐
│ INCIDENT DETECTED │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ ▼ ▼ ▼ │
│ Monitoring Customer Manual │
│ Alert Report Discovery │
│ │ │ │ │
│ └───────────────────────┼───────────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ TRIAGE CHECKLIST│ │
│ │ □ Verify impact │ │
│ │ □ Assign SEV │ │
│ │ □ Create ticket │ │
│ │ □ Page on-call │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Triage Checklist:

  1. Verify the incident - Confirm it's not a false positive
  2. Assess impact - How many customers? Which features?
  3. Assign severity - Use definitions above
  4. Create incident ticket - Use template below
  5. Page on-call (SEV-1/2) - Use PagerDuty or Slack
  6. Start incident channel - #incident-YYYY-MM-DD-description

Phase 2: Response & Mitigation (15 min - Resolution)

Incident Commander Responsibilities:

  • Coordinate response efforts
  • Make mitigation decisions
  • Manage communication
  • Request resources
  • Document timeline

Response Team Responsibilities:

  • Investigate root cause
  • Implement fixes
  • Test changes
  • Deploy to production
  • Monitor recovery

Communication Lead Responsibilities:

  • Update status page
  • Send customer communications
  • Update internal stakeholders
  • Document decisions

Phase 3: Resolution & Recovery

  1. Confirm resolution - Verify customer experience restored
  2. Monitor stability - Watch metrics for 30+ minutes
  3. Update status page - Mark incident resolved
  4. Send resolution comms - If customer comms were sent
  5. Schedule post-mortem - Within 48 hours for SEV-1/2

Common Incident Runbooks

Runbook 1: API 5xx Errors Spike

Symptoms: Monitoring shows >1% error rate, customer reports failures

Diagnosis:

# Check GKE pod status
kubectl get pods -n coditect-dev -l app=django-backend

# Check recent logs
kubectl logs -n coditect-dev -l app=django-backend --tail=100

# Check if pods are crashing
kubectl describe pod -n coditect-dev <pod-name>

# Check resource usage
kubectl top pods -n coditect-dev

Common Causes & Fixes:

CauseFix
Pod OOM killedIncrease memory limits, restart pods
Database connection exhaustedRestart pods, check connection pool
External service downCheck Stripe, SendGrid status
Bad deploymentRollback to previous version
Resource exhaustionScale up replicas

Rollback Procedure:

# Find previous working image
kubectl rollout history deployment/django-backend -n coditect-dev

# Rollback to previous version
kubectl rollout undo deployment/django-backend -n coditect-dev

# Verify rollback
kubectl rollout status deployment/django-backend -n coditect-dev

Runbook 2: Database Connection Issues

Symptoms: "Connection refused" errors, slow queries, connection timeout

Diagnosis:

# Check Cloud SQL status (GCP Console or gcloud)
gcloud sql instances describe coditect-postgres --project coditect-citus-prod

# Check connection count from application
kubectl exec -n coditect-dev <pod> -- python -c "
from django.db import connection
cursor = connection.cursor()
cursor.execute('SELECT count(*) FROM pg_stat_activity;')
print(cursor.fetchone())
"

# Check for long-running queries
kubectl exec -n coditect-dev <pod> -- python -c "
from django.db import connection
cursor = connection.cursor()
cursor.execute('''
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND now() - pg_stat_activity.query_start > interval '5 minutes';
''')
for row in cursor.fetchall(): print(row)
"

Fixes:

# Restart application pods (releases connections)
kubectl rollout restart deployment/django-backend -n coditect-dev

# Kill long-running queries (CAUTION)
# SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE ...;

# If Cloud SQL is unresponsive, check GCP status page
# May need to failover to replica

Runbook 3: License Validation Failures

Symptoms: Customers cannot validate licenses, "License invalid" errors

Diagnosis:

# Check license service logs
kubectl logs -n coditect-dev -l app=django-backend --tail=100 | grep -i license

# Test license validation endpoint
curl -X POST https://api.coditect.ai/api/v1/licenses/validate/ \
-H "Content-Type: application/json" \
-d '{"key_string": "TEST-LICENSE-KEY"}'

# Check Redis connectivity (for seat management)
kubectl exec -n coditect-dev <pod> -- python -c "
import redis
r = redis.from_url('redis://...')
print(r.ping())
"

Common Causes:

CauseFix
Redis downCheck Redis Memorystore status
Database issuesSee Runbook 2
Certificate expiredRenew SSL certificates
Rate limitingCheck rate limit configuration

Runbook 4: High Latency

Symptoms: Response times >2s, slow dashboard, customer complaints

Diagnosis:

# Check application metrics
kubectl top pods -n coditect-dev

# Check slow queries in database
kubectl exec -n coditect-dev <pod> -- python manage.py shell -c "
from django.db import connection
from django.db.backends.utils import CursorDebugWrapper
# Enable query logging temporarily
"

# Check external dependencies
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/charges

Fixes:

CauseFix
CPU saturationScale horizontally, optimize code
Slow queriesAdd indexes, optimize queries
External API slowAdd timeouts, circuit breaker
Memory pressureIncrease limits, optimize memory

Runbook 5: GKE Cluster Issues

Symptoms: Nodes unhealthy, pods not scheduling, cluster unreachable

Diagnosis:

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Check system pods
kubectl get pods -n kube-system

# Check cluster events
kubectl get events --sort-by='.lastTimestamp' -A | tail -50

# Check GKE cluster status
gcloud container clusters describe coditect-gke-cluster \
--region us-central1 --project coditect-citus-prod

Fixes:

# Cordon unhealthy node
kubectl cordon <node-name>

# Drain node for maintenance
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# For cluster-wide issues, check GCP status page
# May need to contact GCP support for SEV-1

Runbook 6: Security Incident

Symptoms: Unauthorized access, suspicious activity, data exposure

IMMEDIATE ACTIONS:

  1. DO NOT delete logs or evidence
  2. Contain - Isolate affected systems if safe
  3. Escalate - Page CTO immediately (SEV-1)
  4. Preserve - Take snapshots of affected systems
  5. Document - Record all observations with timestamps

Investigation:

# Check authentication logs
kubectl logs -n coditect-dev -l app=django-backend | grep -i "auth\|login\|401\|403"

# Check for unusual API activity
# Review Cloud Audit Logs in GCP Console

# Check for unauthorized access
gcloud logging read 'resource.type="gce_instance" AND protoPayload.methodName:"ssh"' \
--limit=50 --project coditect-citus-prod

DO NOT:

  • Attempt to "clean up" the breach
  • Communicate externally without legal approval
  • Make changes that destroy evidence
  • Assume the scope is limited

Communication Templates

Status Page - Investigating

[Investigating] Elevated Error Rates on API

We are currently investigating elevated error rates affecting the CODITECT API.
Some users may experience intermittent failures when using the platform.

We are actively working to identify and resolve the issue.
Updates will be provided every 15 minutes.

Posted: [TIME] UTC

Status Page - Identified

[Identified] Elevated Error Rates on API

The issue has been identified as [brief description].
Our team is implementing a fix.

Affected services:
- License validation
- Seat management

Workaround: [if available]

Next update: [TIME] UTC

Status Page - Resolved

[Resolved] Elevated Error Rates on API

The issue affecting the CODITECT API has been resolved.
All services are operating normally.

Root cause: [brief description]
Duration: [start time] - [end time] UTC

We apologize for any inconvenience. A full post-mortem will be published
within 48 hours.

Resolved: [TIME] UTC

Customer Email - Major Incident

Subject: [CODITECT] Service Disruption - [Brief Description]

Dear CODITECT Customer,

We are currently experiencing a service disruption affecting [affected services].

Impact: [Description of customer impact]
Status: Our team is actively working to resolve the issue
ETA: [Estimated resolution time if known]

We apologize for any inconvenience this may cause. We will send an update
when the issue is resolved.

For urgent inquiries: 1@az1.ai

Best regards,
CODITECT Operations Team

Escalation Contacts

On-Call Rotation

RolePrimaryBackup
On-Call Engineer[Check PagerDuty][Check PagerDuty]
Engineering Lead[Name][Name]
CTOHal Casteel[Backup]

Escalation Triggers

ConditionEscalate To
SEV-1 detectedCTO immediately
SEV-2 > 1 hourEngineering Lead
Security incidentCTO + Legal
Data breachCTO + Legal + CEO
Customer escalationAccount Manager + Engineering Lead

External Contacts

ServiceContact
GCP Support[Support Portal]
Stripe Support[Support Portal]
SendGrid Support[Support Portal]

Post-Incident Process

Post-Mortem Template

# Post-Mortem: [Incident Title]

**Date:** YYYY-MM-DD
**Duration:** HH:MM
**Severity:** SEV-X
**Authors:** [Names]

## Summary
[2-3 sentence summary of what happened]

## Impact
- Customers affected: [Number]
- Duration: [Time]
- Revenue impact: [If applicable]

## Timeline (All times UTC)
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Event]

## Root Cause
[Detailed technical explanation]

## Resolution
[What fixed the issue]

## What Went Well
- [Item]
- [Item]

## What Went Poorly
- [Item]
- [Item]

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Action] | [Name] | [Date] | [ ] |

## Lessons Learned
[Key takeaways]

Post-Mortem Process

  1. Schedule meeting - Within 48 hours of SEV-1/2 resolution
  2. Gather timeline - Collect logs, messages, decisions
  3. Blameless analysis - Focus on systems, not individuals
  4. Identify actions - Concrete improvements with owners
  5. Share learnings - Publish internally (and externally if appropriate)
  6. Track actions - Follow up on completion

Appendix

Useful Commands Quick Reference

# GKE
kubectl get pods -n coditect-dev
kubectl logs -n coditect-dev <pod> --tail=100
kubectl rollout restart deployment/django-backend -n coditect-dev
kubectl rollout undo deployment/django-backend -n coditect-dev

# Cloud SQL
gcloud sql instances describe coditect-postgres --project coditect-citus-prod

# Redis
redis-cli -h <redis-host> ping

# API Health
curl https://api.coditect.ai/api/v1/health/
curl https://api.coditect.ai/api/v1/health/ready/

Monitoring Dashboards

DashboardURLPurpose
GKE Monitoring[GCP Console]Cluster health
Application Metrics[Grafana URL]API performance
Error Tracking[Sentry URL]Application errors
Uptime Monitoring[Uptime Robot]External availability

Document Control:

  • Review: Quarterly
  • Owner: Operations Team
  • Approved: CTO

Revision History:

VersionDateAuthorChanges
1.0.02026-01-08ClaudeInitial version