Agent Skills Framework Extension
Incident Response Patterns Skill
When to Use This Skill
Use this skill when implementing incident response patterns patterns in your codebase.
How to Use This Skill
- Review the patterns and examples below
- Apply the relevant patterns to your implementation
- Follow the best practices outlined in this skill
Runbook automation, incident classification, on-call rotation, and post-mortem processes.
Core Capabilities
- Incident Classification - Severity levels, impact assessment
- Runbook Automation - Automated remediation steps
- On-Call Management - Rotation schedules, escalation
- Communication - Status pages, stakeholder updates
- Post-Mortems - Blameless RCA, action items
Incident Classification
// src/incidents/classification.ts
export enum Severity {
P1 = 'P1', // Critical - Service down, data loss
P2 = 'P2', // High - Major feature broken
P3 = 'P3', // Medium - Minor feature impacted
P4 = 'P4', // Low - Cosmetic, workaround exists
}
interface IncidentCriteria {
severity: Severity;
responseTime: string;
updateFrequency: string;
escalationTime: string;
examples: string[];
}
export const INCIDENT_MATRIX: Record<Severity, IncidentCriteria> = {
[Severity.P1]: {
severity: Severity.P1,
responseTime: '15 minutes',
updateFrequency: '30 minutes',
escalationTime: '30 minutes if unresolved',
examples: [
'Complete service outage',
'Data breach or security incident',
'Payment processing failure',
'Authentication system down',
],
},
[Severity.P2]: {
severity: Severity.P2,
responseTime: '30 minutes',
updateFrequency: '1 hour',
escalationTime: '2 hours if unresolved',
examples: [
'Major feature unavailable',
'Performance degradation >50%',
'Partial data loss (recoverable)',
'Key integration broken',
],
},
[Severity.P3]: {
severity: Severity.P3,
responseTime: '4 hours',
updateFrequency: '4 hours',
escalationTime: 'Next business day',
examples: [
'Minor feature broken',
'Performance degradation <50%',
'Non-critical integration issue',
'Workaround available',
],
},
[Severity.P4]: {
severity: Severity.P4,
responseTime: 'Next business day',
updateFrequency: 'Upon resolution',
escalationTime: 'N/A',
examples: [
'Cosmetic issues',
'Documentation errors',
'Feature requests',
'Non-urgent maintenance',
],
},
};
interface Incident {
id: string;
title: string;
severity: Severity;
status: 'open' | 'investigating' | 'identified' | 'resolved' | 'closed';
impact: string;
affectedSystems: string[];
createdAt: Date;
resolvedAt?: Date;
timeline: IncidentEvent[];
commander?: string;
channel?: string;
}
interface IncidentEvent {
timestamp: Date;
type: 'created' | 'status_change' | 'update' | 'escalation' | 'resolved';
message: string;
author: string;
}
export class IncidentManager {
private incidents: Map<string, Incident> = new Map();
async createIncident(params: {
title: string;
severity: Severity;
impact: string;
affectedSystems: string[];
reporter: string;
}): Promise<Incident> {
const incident: Incident = {
id: `INC-${Date.now()}`,
title: params.title,
severity: params.severity,
status: 'open',
impact: params.impact,
affectedSystems: params.affectedSystems,
createdAt: new Date(),
timeline: [{
timestamp: new Date(),
type: 'created',
message: `Incident created: ${params.title}`,
author: params.reporter,
}],
};
this.incidents.set(incident.id, incident);
// Trigger notifications based on severity
await this.notifyOnCall(incident);
await this.createChannel(incident);
await this.updateStatusPage(incident);
return incident;
}
async updateStatus(
incidentId: string,
status: Incident['status'],
message: string,
author: string
): Promise<Incident> {
const incident = this.incidents.get(incidentId);
if (!incident) throw new Error('Incident not found');
incident.status = status;
incident.timeline.push({
timestamp: new Date(),
type: 'status_change',
message: `Status changed to ${status}: ${message}`,
author,
});
if (status === 'resolved') {
incident.resolvedAt = new Date();
}
await this.updateStatusPage(incident);
return incident;
}
private async notifyOnCall(incident: Incident): Promise<void> {
// Integration with PagerDuty/Opsgenie
console.log(`Notifying on-call for ${incident.severity}: ${incident.title}`);
}
private async createChannel(incident: Incident): Promise<void> {
// Create Slack channel for incident
incident.channel = `#incident-${incident.id.toLowerCase()}`;
}
private async updateStatusPage(incident: Incident): Promise<void> {
// Update public status page
console.log(`Updating status page for ${incident.id}`);
}
}
Runbook Automation
// src/incidents/runbook.ts
interface RunbookStep {
name: string;
command?: string;
script?: string;
manual?: string;
timeout?: number;
retries?: number;
onFailure?: 'continue' | 'abort' | 'escalate';
}
interface Runbook {
id: string;
name: string;
description: string;
triggers: string[];
steps: RunbookStep[];
escalation: {
team: string;
timeout: number;
};
}
export const RUNBOOKS: Runbook[] = [
{
id: 'high-cpu',
name: 'High CPU Usage',
description: 'Remediation for sustained high CPU on application servers',
triggers: ['alert:cpu_usage_high', 'alert:response_time_p99'],
steps: [
{
name: 'Identify top processes',
command: 'top -b -n 1 | head -20',
timeout: 30,
},
{
name: 'Check for stuck queries',
script: `
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;
`,
timeout: 60,
},
{
name: 'Scale up if needed',
command: 'kubectl scale deployment/api --replicas=+2',
timeout: 120,
onFailure: 'escalate',
},
{
name: 'Verify scaling',
command: 'kubectl get pods -l app=api',
timeout: 60,
},
],
escalation: {
team: 'platform-oncall',
timeout: 300,
},
},
{
id: 'database-connection-exhausted',
name: 'Database Connection Pool Exhausted',
description: 'Handle database connection pool exhaustion',
triggers: ['alert:db_connections_high'],
steps: [
{
name: 'Check active connections',
script: `
SELECT count(*) as total,
state,
usename
FROM pg_stat_activity
GROUP BY state, usename
ORDER BY count(*) DESC;
`,
timeout: 30,
},
{
name: 'Identify long-running queries',
script: `
SELECT pid, query, now() - query_start as duration
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < now() - interval '5 minutes'
ORDER BY duration DESC;
`,
timeout: 30,
},
{
name: 'Terminate idle connections',
script: `
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
`,
timeout: 60,
onFailure: 'continue',
},
{
name: 'Restart connection-heavy services',
manual: 'Evaluate which services to rolling-restart based on connection count',
onFailure: 'escalate',
},
],
escalation: {
team: 'database-oncall',
timeout: 600,
},
},
];
export class RunbookExecutor {
async execute(runbook: Runbook, context: Record<string, unknown>): Promise<{
success: boolean;
results: Array<{ step: string; output: string; success: boolean }>;
}> {
const results: Array<{ step: string; output: string; success: boolean }> = [];
for (const step of runbook.steps) {
try {
console.log(`Executing step: ${step.name}`);
if (step.manual) {
// Manual step - log and continue
results.push({
step: step.name,
output: `Manual step: ${step.manual}`,
success: true,
});
continue;
}
const output = await this.executeStep(step, context);
results.push({ step: step.name, output, success: true });
} catch (error) {
const errorMsg = error instanceof Error ? error.message : String(error);
results.push({ step: step.name, output: errorMsg, success: false });
if (step.onFailure === 'abort') {
return { success: false, results };
} else if (step.onFailure === 'escalate') {
await this.escalate(runbook, step, errorMsg);
return { success: false, results };
}
// 'continue' - proceed to next step
}
}
return { success: true, results };
}
private async executeStep(step: RunbookStep, context: Record<string, unknown>): Promise<string> {
// Execute command or script
// This would integrate with your execution environment
return `Executed: ${step.command || step.script}`;
}
private async escalate(runbook: Runbook, step: RunbookStep, error: string): Promise<void> {
console.log(`Escalating to ${runbook.escalation.team}: ${step.name} failed - ${error}`);
// PagerDuty/Opsgenie escalation
}
}
Post-Mortem Template
// src/incidents/postmortem.ts
interface PostMortem {
incidentId: string;
title: string;
date: Date;
authors: string[];
status: 'draft' | 'review' | 'published';
summary: string;
impact: {
duration: string;
affectedUsers: number;
affectedSystems: string[];
severity: Severity;
};
timeline: Array<{
time: Date;
event: string;
}>;
rootCause: string;
contributingFactors: string[];
detection: {
howDetected: string;
timeToDetect: string;
whatWorked: string[];
whatDidntWork: string[];
};
response: {
timeToMitigate: string;
timeToResolve: string;
whatWorked: string[];
whatDidntWork: string[];
};
actionItems: Array<{
id: string;
description: string;
owner: string;
priority: 'P1' | 'P2' | 'P3';
dueDate: Date;
status: 'open' | 'in_progress' | 'done';
type: 'prevent' | 'detect' | 'respond';
}>;
lessonsLearned: string[];
}
export function generatePostMortemTemplate(incident: Incident): Partial<PostMortem> {
const duration = incident.resolvedAt
? `${Math.round((incident.resolvedAt.getTime() - incident.createdAt.getTime()) / 60000)} minutes`
: 'Ongoing';
return {
incidentId: incident.id,
title: incident.title,
date: new Date(),
authors: [incident.commander || 'TBD'],
status: 'draft',
summary: `On ${incident.createdAt.toISOString()}, we experienced ${incident.title}. This affected ${incident.affectedSystems.join(', ')}.`,
impact: {
duration,
affectedUsers: 0, // To be filled
affectedSystems: incident.affectedSystems,
severity: incident.severity,
},
timeline: incident.timeline.map(e => ({
time: e.timestamp,
event: e.message,
})),
rootCause: 'TBD - To be determined during post-mortem review',
contributingFactors: [],
detection: {
howDetected: 'TBD',
timeToDetect: 'TBD',
whatWorked: [],
whatDidntWork: [],
},
response: {
timeToMitigate: 'TBD',
timeToResolve: duration,
whatWorked: [],
whatDidntWork: [],
},
actionItems: [],
lessonsLearned: [],
};
}
// Markdown export
export function exportToMarkdown(pm: PostMortem): string {
return `# Post-Mortem: ${pm.title}
**Incident ID:** ${pm.incidentId}
**Date:** ${pm.date.toISOString().split('T')[0]}
**Authors:** ${pm.authors.join(', ')}
**Status:** ${pm.status}
## Summary
${pm.summary}
## Impact
- **Duration:** ${pm.impact.duration}
- **Affected Users:** ${pm.impact.affectedUsers}
- **Affected Systems:** ${pm.impact.affectedSystems.join(', ')}
- **Severity:** ${pm.impact.severity}
## Timeline
| Time | Event |
|------|-------|
${pm.timeline.map(e => `| ${e.time.toISOString()} | ${e.event} |`).join('\n')}
## Root Cause
${pm.rootCause}
### Contributing Factors
${pm.contributingFactors.map(f => `- ${f}`).join('\n')}
## Detection
- **How Detected:** ${pm.detection.howDetected}
- **Time to Detect:** ${pm.detection.timeToDetect}
### What Worked
${pm.detection.whatWorked.map(w => `- ${w}`).join('\n')}
### What Didn't Work
${pm.detection.whatDidntWork.map(w => `- ${w}`).join('\n')}
## Response
- **Time to Mitigate:** ${pm.response.timeToMitigate}
- **Time to Resolve:** ${pm.response.timeToResolve}
## Action Items
| ID | Description | Owner | Priority | Due Date | Status | Type |
|----|-------------|-------|----------|----------|--------|------|
${pm.actionItems.map(a => `| ${a.id} | ${a.description} | ${a.owner} | ${a.priority} | ${a.dueDate.toISOString().split('T')[0]} | ${a.status} | ${a.type} |`).join('\n')}
## Lessons Learned
${pm.lessonsLearned.map(l => `- ${l}`).join('\n')}
`;
}
Success Output
When successful, this skill MUST output:
✅ SKILL COMPLETE: incident-response-patterns
Completed:
- [x] Incident classification matrix defined (P1-P4)
- [x] Runbooks created for common issues
- [x] On-call rotation and escalation configured
- [x] Post-mortem template established
Outputs:
- Incident classification criteria with SLAs
- Automated runbooks with remediation steps
- On-call notification and escalation paths
- Post-mortem template with action item tracking
- Status page update workflow
Completion Checklist
Before marking this skill as complete, verify:
- P1-P4 severity levels defined with response time SLAs
- Runbooks created for top 5 incident types
- On-call rotation configured with PagerDuty/Opsgenie
- Incident Slack channel creation automated
- Status page update workflow implemented
- Post-mortem template includes timeline, root cause, action items
- Blameless culture principles documented
- Escalation paths defined with timeouts
Failure Indicators
This skill has FAILED if:
- ❌ No clear severity criteria (ambiguous P1 vs P2)
- ❌ Runbooks missing for critical systems
- ❌ No on-call notification integration
- ❌ Manual status page updates required
- ❌ Post-mortem template incomplete
- ❌ Blame-oriented incident culture
- ❌ No escalation timeouts defined
- ❌ Action items not tracked to completion
When NOT to Use
Do NOT use this skill when:
- No production systems (dev/staging only)
- Single developer project (no on-call needed)
- No user-facing services (internal tools only)
- Incidents extremely rare (<1 per quarter)
- No SLA requirements
- Team size < 3 (no rotation needed)
Use alternative approaches when:
- Dev/staging → Simple alerting, no runbooks
- Single dev → Manual incident handling
- Internal tools → Best-effort support
- Rare incidents → Ad-hoc response acceptable
- No SLAs → Prioritize by severity informally
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Vague severity criteria | Inconsistent classification | Define clear P1-P4 examples and SLAs |
| Manual runbook steps only | Slow response, human error | Automate common remediation steps |
| No escalation timeouts | Incidents linger unresolved | Define escalation after X minutes |
| Blame-oriented post-mortems | Fear of reporting, hiding issues | Emphasize blameless culture |
| Action items without owners | Nothing gets fixed | Assign owner + due date to every action |
| Missing timeline | Hard to identify root cause | Document all events with timestamps |
| No detection analysis | Repeat incidents | Analyze "what worked/didn't work" |
| Ignoring contributing factors | Shallow fixes | Document all factors, not just root cause |
Principles
This skill embodies these CODITECT principles:
#5 Eliminate Ambiguity
- Clear P1-P4 criteria with examples
- Explicit SLAs for response time
- Defined escalation paths
#6 Clear, Understandable, Explainable
- Runbooks with step-by-step instructions
- Post-mortems with timeline and rationale
- Blameless culture for transparency
#7 Incremental Improvement
- Action items from every post-mortem
- Detection/response analysis
- Lessons learned feed into prevention
#9 Automation First
- Automated runbook execution
- Automated channel creation
- Automated status page updates
Reliability
- Runbooks prevent repeated incidents
- Escalation ensures resolution
- Post-mortems drive improvements
Usage Examples
Create Incident Response Process
Apply incident-response-patterns skill to implement P1-P4 incident classification with SLAs
Automate Runbooks
Apply incident-response-patterns skill to create automated runbooks for common infrastructure issues
Setup Post-Mortem Process
Apply incident-response-patterns skill to implement blameless post-mortem workflow with action item tracking
Integration Points
- monitoring-observability - Alert integration
- cicd-pipeline-design - Rollback automation
- error-handling-resilience - Circuit breaker triggers