Emergency Response Playbook
CODITECT Emergency Response Playbook
Version: 1.0.0 Last Updated: 2026-01-08 Owner: AZ1.AI INC Operations Team Classification: Internal - Operations
Table of Contents
- Overview
- Severity Levels
- Incident Response Process
- Common Incident Runbooks
- Communication Templates
- Escalation Contacts
- Post-Incident Process
Overview
This playbook provides standardized procedures for responding to production incidents affecting the CODITECT platform. All team members should be familiar with this document before participating in on-call rotations.
Guiding Principles
- Safety First - Protect customer data and system integrity
- Communicate Early - Inform stakeholders as soon as impact is known
- Document Everything - Create timeline in real-time for post-mortems
- Escalate Quickly - When in doubt, escalate to senior team members
- Customer Focus - Prioritize restoring customer experience
Severity Levels
SEV-1: Critical
Definition: Complete service outage or security breach affecting all customers.
| Attribute | Details |
|---|---|
| Response Time | < 15 minutes |
| Update Frequency | Every 15 minutes |
| Escalation | Immediate to CTO |
| Status Page | Major Outage |
| Customer Comms | Required |
Examples:
- API completely down (5xx for all requests)
- Database corruption or data loss
- Security breach or data exposure
- License validation failing for all customers
- Payment processing completely down
SEV-2: High
Definition: Major functionality impaired or significant customer impact.
| Attribute | Details |
|---|---|
| Response Time | < 30 minutes |
| Update Frequency | Every 30 minutes |
| Escalation | Team lead within 1 hour |
| Status Page | Partial Outage |
| Customer Comms | If > 1 hour |
Examples:
- License validation intermittently failing
- Significant latency (>5s response times)
- One region/availability zone down
- Critical feature broken (e.g., seat management)
- Celery queue backed up significantly
SEV-3: Medium
Definition: Minor functionality impaired, workaround available.
| Attribute | Details |
|---|---|
| Response Time | < 2 hours |
| Update Frequency | Every 2 hours |
| Escalation | If not resolved in 4 hours |
| Status Page | Degraded Performance |
| Customer Comms | No |
Examples:
- Non-critical feature broken
- Minor latency increase
- Single customer affected
- Documentation site down
- Admin dashboard issues
SEV-4: Low
Definition: Minimal impact, cosmetic issues, or monitoring alerts.
| Attribute | Details |
|---|---|
| Response Time | Next business day |
| Update Frequency | Daily |
| Escalation | N/A |
| Status Page | No update |
| Customer Comms | No |
Examples:
- UI cosmetic issues
- Non-customer-facing errors
- Minor log warnings
- Performance optimization opportunities
Incident Response Process
Phase 1: Detection & Triage (0-15 min)
┌─────────────────────────────────────────────────────────────────┐
│ INCIDENT DETECTED │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ ▼ ▼ ▼ │
│ Monitoring Customer Manual │
│ Alert Report Discovery │
│ │ │ │ │
│ └───────────────────────┼───────────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ TRIAGE CHECKLIST│ │
│ │ □ Verify impact │ │
│ │ □ Assign SEV │ │
│ │ □ Create ticket │ │
│ │ □ Page on-call │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Triage Checklist:
- Verify the incident - Confirm it's not a false positive
- Assess impact - How many customers? Which features?
- Assign severity - Use definitions above
- Create incident ticket - Use template below
- Page on-call (SEV-1/2) - Use PagerDuty or Slack
- Start incident channel - #incident-YYYY-MM-DD-description
Phase 2: Response & Mitigation (15 min - Resolution)
Incident Commander Responsibilities:
- Coordinate response efforts
- Make mitigation decisions
- Manage communication
- Request resources
- Document timeline
Response Team Responsibilities:
- Investigate root cause
- Implement fixes
- Test changes
- Deploy to production
- Monitor recovery
Communication Lead Responsibilities:
- Update status page
- Send customer communications
- Update internal stakeholders
- Document decisions
Phase 3: Resolution & Recovery
- Confirm resolution - Verify customer experience restored
- Monitor stability - Watch metrics for 30+ minutes
- Update status page - Mark incident resolved
- Send resolution comms - If customer comms were sent
- Schedule post-mortem - Within 48 hours for SEV-1/2
Common Incident Runbooks
Runbook 1: API 5xx Errors Spike
Symptoms: Monitoring shows >1% error rate, customer reports failures
Diagnosis:
# Check GKE pod status
kubectl get pods -n coditect-dev -l app=django-backend
# Check recent logs
kubectl logs -n coditect-dev -l app=django-backend --tail=100
# Check if pods are crashing
kubectl describe pod -n coditect-dev <pod-name>
# Check resource usage
kubectl top pods -n coditect-dev
Common Causes & Fixes:
| Cause | Fix |
|---|---|
| Pod OOM killed | Increase memory limits, restart pods |
| Database connection exhausted | Restart pods, check connection pool |
| External service down | Check Stripe, SendGrid status |
| Bad deployment | Rollback to previous version |
| Resource exhaustion | Scale up replicas |
Rollback Procedure:
# Find previous working image
kubectl rollout history deployment/django-backend -n coditect-dev
# Rollback to previous version
kubectl rollout undo deployment/django-backend -n coditect-dev
# Verify rollback
kubectl rollout status deployment/django-backend -n coditect-dev
Runbook 2: Database Connection Issues
Symptoms: "Connection refused" errors, slow queries, connection timeout
Diagnosis:
# Check Cloud SQL status (GCP Console or gcloud)
gcloud sql instances describe coditect-postgres --project coditect-citus-prod
# Check connection count from application
kubectl exec -n coditect-dev <pod> -- python -c "
from django.db import connection
cursor = connection.cursor()
cursor.execute('SELECT count(*) FROM pg_stat_activity;')
print(cursor.fetchone())
"
# Check for long-running queries
kubectl exec -n coditect-dev <pod> -- python -c "
from django.db import connection
cursor = connection.cursor()
cursor.execute('''
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND now() - pg_stat_activity.query_start > interval '5 minutes';
''')
for row in cursor.fetchall(): print(row)
"
Fixes:
# Restart application pods (releases connections)
kubectl rollout restart deployment/django-backend -n coditect-dev
# Kill long-running queries (CAUTION)
# SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE ...;
# If Cloud SQL is unresponsive, check GCP status page
# May need to failover to replica
Runbook 3: License Validation Failures
Symptoms: Customers cannot validate licenses, "License invalid" errors
Diagnosis:
# Check license service logs
kubectl logs -n coditect-dev -l app=django-backend --tail=100 | grep -i license
# Test license validation endpoint
curl -X POST https://api.coditect.ai/api/v1/licenses/validate/ \
-H "Content-Type: application/json" \
-d '{"key_string": "TEST-LICENSE-KEY"}'
# Check Redis connectivity (for seat management)
kubectl exec -n coditect-dev <pod> -- python -c "
import redis
r = redis.from_url('redis://...')
print(r.ping())
"
Common Causes:
| Cause | Fix |
|---|---|
| Redis down | Check Redis Memorystore status |
| Database issues | See Runbook 2 |
| Certificate expired | Renew SSL certificates |
| Rate limiting | Check rate limit configuration |
Runbook 4: High Latency
Symptoms: Response times >2s, slow dashboard, customer complaints
Diagnosis:
# Check application metrics
kubectl top pods -n coditect-dev
# Check slow queries in database
kubectl exec -n coditect-dev <pod> -- python manage.py shell -c "
from django.db import connection
from django.db.backends.utils import CursorDebugWrapper
# Enable query logging temporarily
"
# Check external dependencies
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/charges
Fixes:
| Cause | Fix |
|---|---|
| CPU saturation | Scale horizontally, optimize code |
| Slow queries | Add indexes, optimize queries |
| External API slow | Add timeouts, circuit breaker |
| Memory pressure | Increase limits, optimize memory |
Runbook 5: GKE Cluster Issues
Symptoms: Nodes unhealthy, pods not scheduling, cluster unreachable
Diagnosis:
# Check node status
kubectl get nodes
kubectl describe node <node-name>
# Check system pods
kubectl get pods -n kube-system
# Check cluster events
kubectl get events --sort-by='.lastTimestamp' -A | tail -50
# Check GKE cluster status
gcloud container clusters describe coditect-gke-cluster \
--region us-central1 --project coditect-citus-prod
Fixes:
# Cordon unhealthy node
kubectl cordon <node-name>
# Drain node for maintenance
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# For cluster-wide issues, check GCP status page
# May need to contact GCP support for SEV-1
Runbook 6: Security Incident
Symptoms: Unauthorized access, suspicious activity, data exposure
IMMEDIATE ACTIONS:
- DO NOT delete logs or evidence
- Contain - Isolate affected systems if safe
- Escalate - Page CTO immediately (SEV-1)
- Preserve - Take snapshots of affected systems
- Document - Record all observations with timestamps
Investigation:
# Check authentication logs
kubectl logs -n coditect-dev -l app=django-backend | grep -i "auth\|login\|401\|403"
# Check for unusual API activity
# Review Cloud Audit Logs in GCP Console
# Check for unauthorized access
gcloud logging read 'resource.type="gce_instance" AND protoPayload.methodName:"ssh"' \
--limit=50 --project coditect-citus-prod
DO NOT:
- Attempt to "clean up" the breach
- Communicate externally without legal approval
- Make changes that destroy evidence
- Assume the scope is limited
Communication Templates
Status Page - Investigating
[Investigating] Elevated Error Rates on API
We are currently investigating elevated error rates affecting the CODITECT API.
Some users may experience intermittent failures when using the platform.
We are actively working to identify and resolve the issue.
Updates will be provided every 15 minutes.
Posted: [TIME] UTC
Status Page - Identified
[Identified] Elevated Error Rates on API
The issue has been identified as [brief description].
Our team is implementing a fix.
Affected services:
- License validation
- Seat management
Workaround: [if available]
Next update: [TIME] UTC
Status Page - Resolved
[Resolved] Elevated Error Rates on API
The issue affecting the CODITECT API has been resolved.
All services are operating normally.
Root cause: [brief description]
Duration: [start time] - [end time] UTC
We apologize for any inconvenience. A full post-mortem will be published
within 48 hours.
Resolved: [TIME] UTC
Customer Email - Major Incident
Subject: [CODITECT] Service Disruption - [Brief Description]
Dear CODITECT Customer,
We are currently experiencing a service disruption affecting [affected services].
Impact: [Description of customer impact]
Status: Our team is actively working to resolve the issue
ETA: [Estimated resolution time if known]
We apologize for any inconvenience this may cause. We will send an update
when the issue is resolved.
For urgent inquiries: 1@az1.ai
Best regards,
CODITECT Operations Team
Escalation Contacts
On-Call Rotation
| Role | Primary | Backup |
|---|---|---|
| On-Call Engineer | [Check PagerDuty] | [Check PagerDuty] |
| Engineering Lead | [Name] | [Name] |
| CTO | Hal Casteel | [Backup] |
Escalation Triggers
| Condition | Escalate To |
|---|---|
| SEV-1 detected | CTO immediately |
| SEV-2 > 1 hour | Engineering Lead |
| Security incident | CTO + Legal |
| Data breach | CTO + Legal + CEO |
| Customer escalation | Account Manager + Engineering Lead |
External Contacts
| Service | Contact |
|---|---|
| GCP Support | [Support Portal] |
| Stripe Support | [Support Portal] |
| SendGrid Support | [Support Portal] |
Post-Incident Process
Post-Mortem Template
# Post-Mortem: [Incident Title]
**Date:** YYYY-MM-DD
**Duration:** HH:MM
**Severity:** SEV-X
**Authors:** [Names]
## Summary
[2-3 sentence summary of what happened]
## Impact
- Customers affected: [Number]
- Duration: [Time]
- Revenue impact: [If applicable]
## Timeline (All times UTC)
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Event]
## Root Cause
[Detailed technical explanation]
## Resolution
[What fixed the issue]
## What Went Well
- [Item]
- [Item]
## What Went Poorly
- [Item]
- [Item]
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Action] | [Name] | [Date] | [ ] |
## Lessons Learned
[Key takeaways]
Post-Mortem Process
- Schedule meeting - Within 48 hours of SEV-1/2 resolution
- Gather timeline - Collect logs, messages, decisions
- Blameless analysis - Focus on systems, not individuals
- Identify actions - Concrete improvements with owners
- Share learnings - Publish internally (and externally if appropriate)
- Track actions - Follow up on completion
Appendix
Useful Commands Quick Reference
# GKE
kubectl get pods -n coditect-dev
kubectl logs -n coditect-dev <pod> --tail=100
kubectl rollout restart deployment/django-backend -n coditect-dev
kubectl rollout undo deployment/django-backend -n coditect-dev
# Cloud SQL
gcloud sql instances describe coditect-postgres --project coditect-citus-prod
# Redis
redis-cli -h <redis-host> ping
# API Health
curl https://api.coditect.ai/api/v1/health/
curl https://api.coditect.ai/api/v1/health/ready/
Monitoring Dashboards
| Dashboard | URL | Purpose |
|---|---|---|
| GKE Monitoring | [GCP Console] | Cluster health |
| Application Metrics | [Grafana URL] | API performance |
| Error Tracking | [Sentry URL] | Application errors |
| Uptime Monitoring | [Uptime Robot] | External availability |
Document Control:
- Review: Quarterly
- Owner: Operations Team
- Approved: CTO
Revision History:
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2026-01-08 | Claude | Initial version |