Emergency Response Playbook

CODITECT Emergency Response Playbook

Version: 1.0.0 Last Updated: 2026-01-08 Owner: AZ1.AI INC Operations Team Classification: Internal - Operations

Overview
Severity Levels
Incident Response Process
Common Incident Runbooks
Communication Templates
Escalation Contacts
Post-Incident Process

Overview

This playbook provides standardized procedures for responding to production incidents affecting the CODITECT platform. All team members should be familiar with this document before participating in on-call rotations.

Guiding Principles

Safety First - Protect customer data and system integrity
Communicate Early - Inform stakeholders as soon as impact is known
Document Everything - Create timeline in real-time for post-mortems
Escalate Quickly - When in doubt, escalate to senior team members
Customer Focus - Prioritize restoring customer experience

Severity Levels

SEV-1: Critical

Definition: Complete service outage or security breach affecting all customers.

Attribute	Details
Response Time	< 15 minutes
Update Frequency	Every 15 minutes
Escalation	Immediate to CTO
Status Page	Major Outage
Customer Comms	Required

Examples:

API completely down (5xx for all requests)
Database corruption or data loss
Security breach or data exposure
License validation failing for all customers
Payment processing completely down

SEV-2: High

Definition: Major functionality impaired or significant customer impact.

Attribute	Details
Response Time	< 30 minutes
Update Frequency	Every 30 minutes
Escalation	Team lead within 1 hour
Status Page	Partial Outage
Customer Comms	If > 1 hour

Examples:

License validation intermittently failing
Significant latency (>5s response times)
One region/availability zone down
Critical feature broken (e.g., seat management)
Celery queue backed up significantly

SEV-3: Medium

Definition: Minor functionality impaired, workaround available.

Attribute	Details
Response Time	< 2 hours
Update Frequency	Every 2 hours
Escalation	If not resolved in 4 hours
Status Page	Degraded Performance
Customer Comms	No

Examples:

Non-critical feature broken
Minor latency increase
Single customer affected
Documentation site down
Admin dashboard issues

SEV-4: Low

Definition: Minimal impact, cosmetic issues, or monitoring alerts.

Attribute	Details
Response Time	Next business day
Update Frequency	Daily
Escalation	N/A
Status Page	No update
Customer Comms	No

Examples:

UI cosmetic issues
Non-customer-facing errors
Minor log warnings
Performance optimization opportunities

Incident Response Process

Phase 1: Detection & Triage (0-15 min)

┌─────────────────────────────────────────────────────────────────┐
│                        INCIDENT DETECTED                         │
│                              │                                   │
│      ┌───────────────────────┼───────────────────────┐          │
│      ▼                       ▼                       ▼          │
│  Monitoring              Customer               Manual          │
│   Alert                  Report                 Discovery       │
│      │                       │                       │          │
│      └───────────────────────┼───────────────────────┘          │
│                              ▼                                   │
│                    ┌─────────────────┐                          │
│                    │ TRIAGE CHECKLIST│                          │
│                    │ □ Verify impact │                          │
│                    │ □ Assign SEV    │                          │
│                    │ □ Create ticket │                          │
│                    │ □ Page on-call  │                          │
│                    └─────────────────┘                          │
└─────────────────────────────────────────────────────────────────┘

Triage Checklist:

Verify the incident - Confirm it's not a false positive
Assess impact - How many customers? Which features?
Assign severity - Use definitions above
Create incident ticket - Use template below
Page on-call (SEV-1/2) - Use PagerDuty or Slack
Start incident channel - #incident-YYYY-MM-DD-description

Phase 2: Response & Mitigation (15 min - Resolution)

Incident Commander Responsibilities:

Coordinate response efforts
Make mitigation decisions
Manage communication
Request resources
Document timeline

Response Team Responsibilities:

Investigate root cause
Implement fixes
Test changes
Deploy to production
Monitor recovery

Communication Lead Responsibilities:

Update status page
Send customer communications
Update internal stakeholders
Document decisions

Phase 3: Resolution & Recovery

Confirm resolution - Verify customer experience restored
Monitor stability - Watch metrics for 30+ minutes
Update status page - Mark incident resolved
Send resolution comms - If customer comms were sent
Schedule post-mortem - Within 48 hours for SEV-1/2

Common Incident Runbooks

Runbook 1: API 5xx Errors Spike

Symptoms: Monitoring shows >1% error rate, customer reports failures

Diagnosis:

# Check GKE pod status
kubectl get pods -n coditect-dev -l app=django-backend

# Check recent logs
kubectl logs -n coditect-dev -l app=django-backend --tail=100

# Check if pods are crashing
kubectl describe pod -n coditect-dev <pod-name>

# Check resource usage
kubectl top pods -n coditect-dev

Common Causes & Fixes:

Cause	Fix
Pod OOM killed	Increase memory limits, restart pods
Database connection exhausted	Restart pods, check connection pool
External service down	Check Stripe, SendGrid status
Bad deployment	Rollback to previous version
Resource exhaustion	Scale up replicas

Rollback Procedure:

# Find previous working image
kubectl rollout history deployment/django-backend -n coditect-dev

# Rollback to previous version
kubectl rollout undo deployment/django-backend -n coditect-dev

# Verify rollback
kubectl rollout status deployment/django-backend -n coditect-dev

Runbook 2: Database Connection Issues

Symptoms: "Connection refused" errors, slow queries, connection timeout

Diagnosis:

# Check Cloud SQL status (GCP Console or gcloud)
gcloud sql instances describe coditect-postgres --project coditect-citus-prod

# Check connection count from application
kubectl exec -n coditect-dev <pod> -- python -c "
from django.db import connection
cursor = connection.cursor()
cursor.execute('SELECT count(*) FROM pg_stat_activity;')
print(cursor.fetchone())
"

# Check for long-running queries
kubectl exec -n coditect-dev <pod> -- python -c "
from django.db import connection
cursor = connection.cursor()
cursor.execute('''
  SELECT pid, now() - pg_stat_activity.query_start AS duration, query
  FROM pg_stat_activity
  WHERE state != 'idle' AND now() - pg_stat_activity.query_start > interval '5 minutes';
''')
for row in cursor.fetchall(): print(row)
"

Fixes:

# Restart application pods (releases connections)
kubectl rollout restart deployment/django-backend -n coditect-dev

# Kill long-running queries (CAUTION)
# SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE ...;

# If Cloud SQL is unresponsive, check GCP status page
# May need to failover to replica

Runbook 3: License Validation Failures

Symptoms: Customers cannot validate licenses, "License invalid" errors

Diagnosis:

# Check license service logs
kubectl logs -n coditect-dev -l app=django-backend --tail=100 | grep -i license

# Test license validation endpoint
curl -X POST https://api.coditect.ai/api/v1/licenses/validate/ \
  -H "Content-Type: application/json" \
  -d '{"key_string": "TEST-LICENSE-KEY"}'

# Check Redis connectivity (for seat management)
kubectl exec -n coditect-dev <pod> -- python -c "
import redis
r = redis.from_url('redis://...')
print(r.ping())
"

Common Causes:

Cause	Fix
Redis down	Check Redis Memorystore status
Database issues	See Runbook 2
Certificate expired	Renew SSL certificates
Rate limiting	Check rate limit configuration

Runbook 4: High Latency

Symptoms: Response times >2s, slow dashboard, customer complaints

Diagnosis:

# Check application metrics
kubectl top pods -n coditect-dev

# Check slow queries in database
kubectl exec -n coditect-dev <pod> -- python manage.py shell -c "
from django.db import connection
from django.db.backends.utils import CursorDebugWrapper
# Enable query logging temporarily
"

# Check external dependencies
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/charges

Fixes:

Cause	Fix
CPU saturation	Scale horizontally, optimize code
Slow queries	Add indexes, optimize queries
External API slow	Add timeouts, circuit breaker
Memory pressure	Increase limits, optimize memory

Runbook 5: GKE Cluster Issues

Symptoms: Nodes unhealthy, pods not scheduling, cluster unreachable

Diagnosis:

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Check system pods
kubectl get pods -n kube-system

# Check cluster events
kubectl get events --sort-by='.lastTimestamp' -A | tail -50

# Check GKE cluster status
gcloud container clusters describe coditect-gke-cluster \
  --region us-central1 --project coditect-citus-prod

Fixes:

# Cordon unhealthy node
kubectl cordon <node-name>

# Drain node for maintenance
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# For cluster-wide issues, check GCP status page
# May need to contact GCP support for SEV-1

Runbook 6: Security Incident

Symptoms: Unauthorized access, suspicious activity, data exposure

IMMEDIATE ACTIONS:

DO NOT delete logs or evidence
Contain - Isolate affected systems if safe
Escalate - Page CTO immediately (SEV-1)
Preserve - Take snapshots of affected systems
Document - Record all observations with timestamps

Investigation:

# Check authentication logs
kubectl logs -n coditect-dev -l app=django-backend | grep -i "auth\|login\|401\|403"

# Check for unusual API activity
# Review Cloud Audit Logs in GCP Console

# Check for unauthorized access
gcloud logging read 'resource.type="gce_instance" AND protoPayload.methodName:"ssh"' \
  --limit=50 --project coditect-citus-prod

DO NOT:

Attempt to "clean up" the breach
Communicate externally without legal approval
Make changes that destroy evidence
Assume the scope is limited

Communication Templates

Status Page - Investigating

[Investigating] Elevated Error Rates on API

We are currently investigating elevated error rates affecting the CODITECT API.
Some users may experience intermittent failures when using the platform.

We are actively working to identify and resolve the issue.
Updates will be provided every 15 minutes.

Posted: [TIME] UTC

Status Page - Identified

[Identified] Elevated Error Rates on API

The issue has been identified as [brief description].
Our team is implementing a fix.

Affected services:
- License validation
- Seat management

Workaround: [if available]

Next update: [TIME] UTC

Status Page - Resolved

[Resolved] Elevated Error Rates on API

The issue affecting the CODITECT API has been resolved.
All services are operating normally.

Root cause: [brief description]
Duration: [start time] - [end time] UTC

We apologize for any inconvenience. A full post-mortem will be published
within 48 hours.

Resolved: [TIME] UTC

Customer Email - Major Incident

Subject: [CODITECT] Service Disruption - [Brief Description]

Dear CODITECT Customer,

We are currently experiencing a service disruption affecting [affected services].

Impact: [Description of customer impact]
Status: Our team is actively working to resolve the issue
ETA: [Estimated resolution time if known]

We apologize for any inconvenience this may cause. We will send an update
when the issue is resolved.

For urgent inquiries: 1@az1.ai

Best regards,
CODITECT Operations Team

Escalation Contacts

On-Call Rotation

Role	Primary	Backup
On-Call Engineer	[Check PagerDuty]	[Check PagerDuty]
Engineering Lead	[Name]	[Name]
CTO	Hal Casteel	[Backup]

Escalation Triggers

Condition	Escalate To
SEV-1 detected	CTO immediately
SEV-2 > 1 hour	Engineering Lead
Security incident	CTO + Legal
Data breach	CTO + Legal + CEO
Customer escalation	Account Manager + Engineering Lead

External Contacts

Service	Contact
GCP Support	[Support Portal]
Stripe Support	[Support Portal]
SendGrid Support	[Support Portal]

Post-Incident Process

Post-Mortem Template

# Post-Mortem: [Incident Title]

**Date:** YYYY-MM-DD
**Duration:** HH:MM
**Severity:** SEV-X
**Authors:** [Names]

## Summary
[2-3 sentence summary of what happened]

## Impact
- Customers affected: [Number]
- Duration: [Time]
- Revenue impact: [If applicable]

## Timeline (All times UTC)
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Event]

## Root Cause
[Detailed technical explanation]

## Resolution
[What fixed the issue]

## What Went Well
- [Item]
- [Item]

## What Went Poorly
- [Item]
- [Item]

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Action] | [Name] | [Date] | [ ] |

## Lessons Learned
[Key takeaways]

Post-Mortem Process

Schedule meeting - Within 48 hours of SEV-1/2 resolution
Gather timeline - Collect logs, messages, decisions
Blameless analysis - Focus on systems, not individuals
Identify actions - Concrete improvements with owners
Share learnings - Publish internally (and externally if appropriate)
Track actions - Follow up on completion

Appendix

Useful Commands Quick Reference

# GKE
kubectl get pods -n coditect-dev
kubectl logs -n coditect-dev <pod> --tail=100
kubectl rollout restart deployment/django-backend -n coditect-dev
kubectl rollout undo deployment/django-backend -n coditect-dev

# Cloud SQL
gcloud sql instances describe coditect-postgres --project coditect-citus-prod

# Redis
redis-cli -h <redis-host> ping

# API Health
curl https://api.coditect.ai/api/v1/health/
curl https://api.coditect.ai/api/v1/health/ready/

Monitoring Dashboards

Dashboard	URL	Purpose
GKE Monitoring	[GCP Console]	Cluster health
Application Metrics	[Grafana URL]	API performance
Error Tracking	[Sentry URL]	Application errors
Uptime Monitoring	[Uptime Robot]	External availability

Document Control:

Review: Quarterly
Owner: Operations Team
Approved: CTO

Revision History:

Version	Date	Author	Changes
1.0.0	2026-01-08	Claude	Initial version

Table of Contents​

Overview​

Guiding Principles​

Severity Levels​

SEV-1: Critical​

SEV-2: High​

SEV-3: Medium​

SEV-4: Low​

Incident Response Process​

Phase 1: Detection & Triage (0-15 min)​

Phase 2: Response & Mitigation (15 min - Resolution)​

Phase 3: Resolution & Recovery​

Common Incident Runbooks​

Runbook 1: API 5xx Errors Spike​

Runbook 2: Database Connection Issues​

Runbook 3: License Validation Failures​

Runbook 4: High Latency​

Runbook 5: GKE Cluster Issues​

Runbook 6: Security Incident​

Communication Templates​

Status Page - Investigating​

Status Page - Identified​

Status Page - Resolved​

Customer Email - Major Incident​

Escalation Contacts​

On-Call Rotation​

Escalation Triggers​

External Contacts​

Post-Incident Process​

Post-Mortem Template​

Post-Mortem Process​

Appendix​

Useful Commands Quick Reference​

Monitoring Dashboards​

Table of Contents

Overview

Guiding Principles

Severity Levels

SEV-1: Critical

SEV-2: High

SEV-3: Medium

SEV-4: Low

Incident Response Process

Phase 1: Detection & Triage (0-15 min)

Phase 2: Response & Mitigation (15 min - Resolution)

Phase 3: Resolution & Recovery

Common Incident Runbooks

Runbook 1: API 5xx Errors Spike

Runbook 2: Database Connection Issues

Runbook 3: License Validation Failures

Runbook 4: High Latency

Runbook 5: GKE Cluster Issues

Runbook 6: Security Incident

Communication Templates

Status Page - Investigating

Status Page - Identified

Status Page - Resolved

Customer Email - Major Incident

Escalation Contacts

On-Call Rotation

Escalation Triggers

External Contacts

Post-Incident Process

Post-Mortem Template

Post-Mortem Process

Appendix

Useful Commands Quick Reference

Monitoring Dashboards