Skip to main content

/postmortem - Blameless Incident Postmortem

Generate structured, blameless incident postmortems. Guides through timeline construction, 5-Whys root cause analysis, corrective action categorization (detect/prevent/mitigate/process), and produces professional postmortem documents.

System Prompt

EXECUTION DIRECTIVE: When the user invokes this command, you MUST:

  1. IMMEDIATELY execute - no questions first
  2. Load the skill at skills/incident-postmortem-patterns/SKILL.md
  3. Gather incident details - ID, description, timeline start, impact metrics
  4. Construct timeline - chronological sequence of events from detection to resolution
  5. Perform 5-Whys analysis - iterative root cause investigation
  6. Categorize corrective actions - detect (monitoring), prevent (safeguards), mitigate (response), process (improvements)
  7. Apply blameless principle - focus on systems and processes, not individuals
  8. Generate postmortem document - structured markdown, Confluence, or Notion format
  9. Extract action items - assign owners and due dates for follow-up

Usage

# Basic postmortem
/postmortem --incident "Database outage on 2026-02-01"

# With specific template
/postmortem --incident INC-1234 --template security

# From timeline start
/postmortem --incident "API degradation" --timeline-from "2026-02-01T14:30:00Z"

# Output format selection
/postmortem --incident INC-5678 --output confluence

# Action items only (for tracking)
/postmortem --incident INC-9999 --action-items-only

Options

OptionDescriptionDefault
--incidentIncident ID or descriptionInteractive prompt
--templateTemplate: standard|abbreviated|securitystandard
--timeline-fromTimeline start timestamp (ISO 8601)Auto-detect
--outputOutput format: markdown|confluence|notionmarkdown
--action-items-onlyGenerate only corrective actions (no full postmortem)false
  • /incident-response - Real-time incident response workflow
  • /triage - Defect triage for bugs discovered during incident
  • /chaos-test - Validate fixes with chaos engineering

Success Output

✅ Postmortem Generated
━━━━━━━━━━━━━━━━━━━━━━━━━━

📋 Incident Summary
ID: INC-1234
Title: Database Connection Pool Exhaustion
Date: 2026-02-01
Duration: 2h 15m
Severity: S1 (Critical)
Impact: 100% of API requests failed, ~$45K revenue loss

⏱️ Timeline
14:30 - First alert: API error rate spike (PagerDuty)
14:32 - On-call engineer acknowledged
14:45 - Identified: Database connection pool exhausted
15:00 - Mitigation attempted: Restart application pods
15:15 - Mitigation failed: Pool exhausted again within 5m
15:30 - Root cause found: Connection leak in payment service
15:45 - Fix deployed: Patch connection handling + increase pool size
16:00 - Monitoring: Error rate dropping
16:45 - Resolved: All metrics returned to baseline

🔍 5-Whys Root Cause Analysis
1. Why did API fail? → Database connections exhausted
2. Why were connections exhausted? → Connection pool too small
3. Why was pool too small? → Payment service leaking connections
4. Why was service leaking? → Missing connection.close() in error path
5. Why was leak not detected? → No connection pool monitoring

Root Cause: Missing connection cleanup in error handling path, undetected due to lack of connection pool monitoring.

📝 Corrective Actions

🔍 DETECT (Monitoring & Alerting)
- [ ] Add connection pool utilization metrics (@alice, Due: 2026-02-08)
- [ ] Alert on pool >80% utilization (@bob, Due: 2026-02-08)
- [ ] Dashboard: connection pool health per service (@charlie, Due: 2026-02-15)

🛡️ PREVENT (Safeguards & Design)
- [ ] Code review checklist: verify connection cleanup (@dave, Due: 2026-02-05)
- [ ] Linter rule: flag missing close() in try-catch (@eve, Due: 2026-02-12)
- [ ] Connection pool sizing: calculate based on pod count (@alice, Due: 2026-02-15)

🚨 MITIGATE (Incident Response)
- [ ] Runbook: connection pool exhaustion diagnosis (@bob, Due: 2026-02-10)
- [ ] Auto-scaling: trigger on connection pool metric (@frank, Due: 2026-02-20)

📋 PROCESS (Organizational)
- [ ] Load testing: mandate connection pool stress test (@charlie, Due: 2026-02-12)
- [ ] Postmortem review: share learnings in engineering all-hands (@dave, Due: 2026-02-05)

📄 Document: postmortems/INC-1234-database-connection-pool-2026-02-01.md
🔗 Action Items: Exported to Linear/Jira

Completion Checklist

  • Incident details gathered
  • Timeline constructed (detection → resolution)
  • 5-Whys analysis completed
  • Root cause identified
  • Corrective actions categorized (detect/prevent/mitigate/process)
  • Action items assigned with owners and due dates
  • Blameless language verified
  • Postmortem document generated
  • Stakeholders notified
  • Action items tracked in project management tool

Failure Indicators

  • Blame language - postmortem targets individuals instead of systems
  • Shallow root cause - stopped at first "why" instead of iterating to systemic cause
  • No action items - postmortem documents but doesn't improve
  • Missing timeline - events described but not chronologically ordered

When NOT to Use

  • ❌ Minor incidents that don't require formal postmortem (use incident log instead)
  • ❌ Ongoing incidents (use /incident-response first, postmortem after resolution)
  • ❌ Security incidents requiring confidentiality (use --template security with limited distribution)

Anti-Patterns

  • ❌ Blame-focused language - "Engineer X forgot to..." instead of "Process lacks safeguard..."
  • ❌ Skipping 5-Whys - jumping to solution without understanding root cause
  • ❌ Action items without owners - corrective actions never get completed
  • ❌ Postmortem without review - document written but never discussed with team

Principles

  • #3 Complete Execution - Full postmortem lifecycle (timeline → analysis → actions → follow-up)
  • #9 Based on Facts - Use objective timeline data and metrics, avoid speculation

Full Standard: CODITECT-STANDARD-AUTOMATION.md