Skip to main content

Agent Skills Framework Extension

Incident Response Patterns Skill

When to Use This Skill

Use this skill when implementing incident response patterns patterns in your codebase.

How to Use This Skill

  1. Review the patterns and examples below
  2. Apply the relevant patterns to your implementation
  3. Follow the best practices outlined in this skill

Runbook automation, incident classification, on-call rotation, and post-mortem processes.

Core Capabilities

  1. Incident Classification - Severity levels, impact assessment
  2. Runbook Automation - Automated remediation steps
  3. On-Call Management - Rotation schedules, escalation
  4. Communication - Status pages, stakeholder updates
  5. Post-Mortems - Blameless RCA, action items

Incident Classification

// src/incidents/classification.ts
export enum Severity {
P1 = 'P1', // Critical - Service down, data loss
P2 = 'P2', // High - Major feature broken
P3 = 'P3', // Medium - Minor feature impacted
P4 = 'P4', // Low - Cosmetic, workaround exists
}

interface IncidentCriteria {
severity: Severity;
responseTime: string;
updateFrequency: string;
escalationTime: string;
examples: string[];
}

export const INCIDENT_MATRIX: Record<Severity, IncidentCriteria> = {
[Severity.P1]: {
severity: Severity.P1,
responseTime: '15 minutes',
updateFrequency: '30 minutes',
escalationTime: '30 minutes if unresolved',
examples: [
'Complete service outage',
'Data breach or security incident',
'Payment processing failure',
'Authentication system down',
],
},
[Severity.P2]: {
severity: Severity.P2,
responseTime: '30 minutes',
updateFrequency: '1 hour',
escalationTime: '2 hours if unresolved',
examples: [
'Major feature unavailable',
'Performance degradation >50%',
'Partial data loss (recoverable)',
'Key integration broken',
],
},
[Severity.P3]: {
severity: Severity.P3,
responseTime: '4 hours',
updateFrequency: '4 hours',
escalationTime: 'Next business day',
examples: [
'Minor feature broken',
'Performance degradation <50%',
'Non-critical integration issue',
'Workaround available',
],
},
[Severity.P4]: {
severity: Severity.P4,
responseTime: 'Next business day',
updateFrequency: 'Upon resolution',
escalationTime: 'N/A',
examples: [
'Cosmetic issues',
'Documentation errors',
'Feature requests',
'Non-urgent maintenance',
],
},
};

interface Incident {
id: string;
title: string;
severity: Severity;
status: 'open' | 'investigating' | 'identified' | 'resolved' | 'closed';
impact: string;
affectedSystems: string[];
createdAt: Date;
resolvedAt?: Date;
timeline: IncidentEvent[];
commander?: string;
channel?: string;
}

interface IncidentEvent {
timestamp: Date;
type: 'created' | 'status_change' | 'update' | 'escalation' | 'resolved';
message: string;
author: string;
}

export class IncidentManager {
private incidents: Map<string, Incident> = new Map();

async createIncident(params: {
title: string;
severity: Severity;
impact: string;
affectedSystems: string[];
reporter: string;
}): Promise<Incident> {
const incident: Incident = {
id: `INC-${Date.now()}`,
title: params.title,
severity: params.severity,
status: 'open',
impact: params.impact,
affectedSystems: params.affectedSystems,
createdAt: new Date(),
timeline: [{
timestamp: new Date(),
type: 'created',
message: `Incident created: ${params.title}`,
author: params.reporter,
}],
};

this.incidents.set(incident.id, incident);

// Trigger notifications based on severity
await this.notifyOnCall(incident);
await this.createChannel(incident);
await this.updateStatusPage(incident);

return incident;
}

async updateStatus(
incidentId: string,
status: Incident['status'],
message: string,
author: string
): Promise<Incident> {
const incident = this.incidents.get(incidentId);
if (!incident) throw new Error('Incident not found');

incident.status = status;
incident.timeline.push({
timestamp: new Date(),
type: 'status_change',
message: `Status changed to ${status}: ${message}`,
author,
});

if (status === 'resolved') {
incident.resolvedAt = new Date();
}

await this.updateStatusPage(incident);
return incident;
}

private async notifyOnCall(incident: Incident): Promise<void> {
// Integration with PagerDuty/Opsgenie
console.log(`Notifying on-call for ${incident.severity}: ${incident.title}`);
}

private async createChannel(incident: Incident): Promise<void> {
// Create Slack channel for incident
incident.channel = `#incident-${incident.id.toLowerCase()}`;
}

private async updateStatusPage(incident: Incident): Promise<void> {
// Update public status page
console.log(`Updating status page for ${incident.id}`);
}
}

Runbook Automation

// src/incidents/runbook.ts
interface RunbookStep {
name: string;
command?: string;
script?: string;
manual?: string;
timeout?: number;
retries?: number;
onFailure?: 'continue' | 'abort' | 'escalate';
}

interface Runbook {
id: string;
name: string;
description: string;
triggers: string[];
steps: RunbookStep[];
escalation: {
team: string;
timeout: number;
};
}

export const RUNBOOKS: Runbook[] = [
{
id: 'high-cpu',
name: 'High CPU Usage',
description: 'Remediation for sustained high CPU on application servers',
triggers: ['alert:cpu_usage_high', 'alert:response_time_p99'],
steps: [
{
name: 'Identify top processes',
command: 'top -b -n 1 | head -20',
timeout: 30,
},
{
name: 'Check for stuck queries',
script: `
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;
`,
timeout: 60,
},
{
name: 'Scale up if needed',
command: 'kubectl scale deployment/api --replicas=+2',
timeout: 120,
onFailure: 'escalate',
},
{
name: 'Verify scaling',
command: 'kubectl get pods -l app=api',
timeout: 60,
},
],
escalation: {
team: 'platform-oncall',
timeout: 300,
},
},
{
id: 'database-connection-exhausted',
name: 'Database Connection Pool Exhausted',
description: 'Handle database connection pool exhaustion',
triggers: ['alert:db_connections_high'],
steps: [
{
name: 'Check active connections',
script: `
SELECT count(*) as total,
state,
usename
FROM pg_stat_activity
GROUP BY state, usename
ORDER BY count(*) DESC;
`,
timeout: 30,
},
{
name: 'Identify long-running queries',
script: `
SELECT pid, query, now() - query_start as duration
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < now() - interval '5 minutes'
ORDER BY duration DESC;
`,
timeout: 30,
},
{
name: 'Terminate idle connections',
script: `
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
`,
timeout: 60,
onFailure: 'continue',
},
{
name: 'Restart connection-heavy services',
manual: 'Evaluate which services to rolling-restart based on connection count',
onFailure: 'escalate',
},
],
escalation: {
team: 'database-oncall',
timeout: 600,
},
},
];

export class RunbookExecutor {
async execute(runbook: Runbook, context: Record<string, unknown>): Promise<{
success: boolean;
results: Array<{ step: string; output: string; success: boolean }>;
}> {
const results: Array<{ step: string; output: string; success: boolean }> = [];

for (const step of runbook.steps) {
try {
console.log(`Executing step: ${step.name}`);

if (step.manual) {
// Manual step - log and continue
results.push({
step: step.name,
output: `Manual step: ${step.manual}`,
success: true,
});
continue;
}

const output = await this.executeStep(step, context);
results.push({ step: step.name, output, success: true });
} catch (error) {
const errorMsg = error instanceof Error ? error.message : String(error);
results.push({ step: step.name, output: errorMsg, success: false });

if (step.onFailure === 'abort') {
return { success: false, results };
} else if (step.onFailure === 'escalate') {
await this.escalate(runbook, step, errorMsg);
return { success: false, results };
}
// 'continue' - proceed to next step
}
}

return { success: true, results };
}

private async executeStep(step: RunbookStep, context: Record<string, unknown>): Promise<string> {
// Execute command or script
// This would integrate with your execution environment
return `Executed: ${step.command || step.script}`;
}

private async escalate(runbook: Runbook, step: RunbookStep, error: string): Promise<void> {
console.log(`Escalating to ${runbook.escalation.team}: ${step.name} failed - ${error}`);
// PagerDuty/Opsgenie escalation
}
}

Post-Mortem Template

// src/incidents/postmortem.ts
interface PostMortem {
incidentId: string;
title: string;
date: Date;
authors: string[];
status: 'draft' | 'review' | 'published';

summary: string;
impact: {
duration: string;
affectedUsers: number;
affectedSystems: string[];
severity: Severity;
};

timeline: Array<{
time: Date;
event: string;
}>;

rootCause: string;
contributingFactors: string[];

detection: {
howDetected: string;
timeToDetect: string;
whatWorked: string[];
whatDidntWork: string[];
};

response: {
timeToMitigate: string;
timeToResolve: string;
whatWorked: string[];
whatDidntWork: string[];
};

actionItems: Array<{
id: string;
description: string;
owner: string;
priority: 'P1' | 'P2' | 'P3';
dueDate: Date;
status: 'open' | 'in_progress' | 'done';
type: 'prevent' | 'detect' | 'respond';
}>;

lessonsLearned: string[];
}

export function generatePostMortemTemplate(incident: Incident): Partial<PostMortem> {
const duration = incident.resolvedAt
? `${Math.round((incident.resolvedAt.getTime() - incident.createdAt.getTime()) / 60000)} minutes`
: 'Ongoing';

return {
incidentId: incident.id,
title: incident.title,
date: new Date(),
authors: [incident.commander || 'TBD'],
status: 'draft',

summary: `On ${incident.createdAt.toISOString()}, we experienced ${incident.title}. This affected ${incident.affectedSystems.join(', ')}.`,

impact: {
duration,
affectedUsers: 0, // To be filled
affectedSystems: incident.affectedSystems,
severity: incident.severity,
},

timeline: incident.timeline.map(e => ({
time: e.timestamp,
event: e.message,
})),

rootCause: 'TBD - To be determined during post-mortem review',
contributingFactors: [],

detection: {
howDetected: 'TBD',
timeToDetect: 'TBD',
whatWorked: [],
whatDidntWork: [],
},

response: {
timeToMitigate: 'TBD',
timeToResolve: duration,
whatWorked: [],
whatDidntWork: [],
},

actionItems: [],
lessonsLearned: [],
};
}

// Markdown export
export function exportToMarkdown(pm: PostMortem): string {
return `# Post-Mortem: ${pm.title}

**Incident ID:** ${pm.incidentId}
**Date:** ${pm.date.toISOString().split('T')[0]}
**Authors:** ${pm.authors.join(', ')}
**Status:** ${pm.status}

## Summary

${pm.summary}

## Impact

- **Duration:** ${pm.impact.duration}
- **Affected Users:** ${pm.impact.affectedUsers}
- **Affected Systems:** ${pm.impact.affectedSystems.join(', ')}
- **Severity:** ${pm.impact.severity}

## Timeline

| Time | Event |
|------|-------|
${pm.timeline.map(e => `| ${e.time.toISOString()} | ${e.event} |`).join('\n')}

## Root Cause

${pm.rootCause}

### Contributing Factors

${pm.contributingFactors.map(f => `- ${f}`).join('\n')}

## Detection

- **How Detected:** ${pm.detection.howDetected}
- **Time to Detect:** ${pm.detection.timeToDetect}

### What Worked
${pm.detection.whatWorked.map(w => `- ${w}`).join('\n')}

### What Didn't Work
${pm.detection.whatDidntWork.map(w => `- ${w}`).join('\n')}

## Response

- **Time to Mitigate:** ${pm.response.timeToMitigate}
- **Time to Resolve:** ${pm.response.timeToResolve}

## Action Items

| ID | Description | Owner | Priority | Due Date | Status | Type |
|----|-------------|-------|----------|----------|--------|------|
${pm.actionItems.map(a => `| ${a.id} | ${a.description} | ${a.owner} | ${a.priority} | ${a.dueDate.toISOString().split('T')[0]} | ${a.status} | ${a.type} |`).join('\n')}

## Lessons Learned

${pm.lessonsLearned.map(l => `- ${l}`).join('\n')}
`;
}

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: incident-response-patterns

Completed:
- [x] Incident classification matrix defined (P1-P4)
- [x] Runbooks created for common issues
- [x] On-call rotation and escalation configured
- [x] Post-mortem template established

Outputs:
- Incident classification criteria with SLAs
- Automated runbooks with remediation steps
- On-call notification and escalation paths
- Post-mortem template with action item tracking
- Status page update workflow

Completion Checklist

Before marking this skill as complete, verify:

  • P1-P4 severity levels defined with response time SLAs
  • Runbooks created for top 5 incident types
  • On-call rotation configured with PagerDuty/Opsgenie
  • Incident Slack channel creation automated
  • Status page update workflow implemented
  • Post-mortem template includes timeline, root cause, action items
  • Blameless culture principles documented
  • Escalation paths defined with timeouts

Failure Indicators

This skill has FAILED if:

  • ❌ No clear severity criteria (ambiguous P1 vs P2)
  • ❌ Runbooks missing for critical systems
  • ❌ No on-call notification integration
  • ❌ Manual status page updates required
  • ❌ Post-mortem template incomplete
  • ❌ Blame-oriented incident culture
  • ❌ No escalation timeouts defined
  • ❌ Action items not tracked to completion

When NOT to Use

Do NOT use this skill when:

  • No production systems (dev/staging only)
  • Single developer project (no on-call needed)
  • No user-facing services (internal tools only)
  • Incidents extremely rare (<1 per quarter)
  • No SLA requirements
  • Team size < 3 (no rotation needed)

Use alternative approaches when:

  • Dev/staging → Simple alerting, no runbooks
  • Single dev → Manual incident handling
  • Internal tools → Best-effort support
  • Rare incidents → Ad-hoc response acceptable
  • No SLAs → Prioritize by severity informally

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Vague severity criteriaInconsistent classificationDefine clear P1-P4 examples and SLAs
Manual runbook steps onlySlow response, human errorAutomate common remediation steps
No escalation timeoutsIncidents linger unresolvedDefine escalation after X minutes
Blame-oriented post-mortemsFear of reporting, hiding issuesEmphasize blameless culture
Action items without ownersNothing gets fixedAssign owner + due date to every action
Missing timelineHard to identify root causeDocument all events with timestamps
No detection analysisRepeat incidentsAnalyze "what worked/didn't work"
Ignoring contributing factorsShallow fixesDocument all factors, not just root cause

Principles

This skill embodies these CODITECT principles:

#5 Eliminate Ambiguity

  • Clear P1-P4 criteria with examples
  • Explicit SLAs for response time
  • Defined escalation paths

#6 Clear, Understandable, Explainable

  • Runbooks with step-by-step instructions
  • Post-mortems with timeline and rationale
  • Blameless culture for transparency

#7 Incremental Improvement

  • Action items from every post-mortem
  • Detection/response analysis
  • Lessons learned feed into prevention

#9 Automation First

  • Automated runbook execution
  • Automated channel creation
  • Automated status page updates

Reliability

  • Runbooks prevent repeated incidents
  • Escalation ensures resolution
  • Post-mortems drive improvements

Usage Examples

Create Incident Response Process

Apply incident-response-patterns skill to implement P1-P4 incident classification with SLAs

Automate Runbooks

Apply incident-response-patterns skill to create automated runbooks for common infrastructure issues

Setup Post-Mortem Process

Apply incident-response-patterns skill to implement blameless post-mortem workflow with action item tracking

Integration Points

  • monitoring-observability - Alert integration
  • cicd-pipeline-design - Rollback automation
  • error-handling-resilience - Circuit breaker triggers