Agent Skills Framework Extension

Incident Response Patterns Skill

When to Use This Skill

Use this skill when implementing incident response patterns patterns in your codebase.

How to Use This Skill

Review the patterns and examples below
Apply the relevant patterns to your implementation
Follow the best practices outlined in this skill

Runbook automation, incident classification, on-call rotation, and post-mortem processes.

Core Capabilities

Incident Classification - Severity levels, impact assessment
Runbook Automation - Automated remediation steps
On-Call Management - Rotation schedules, escalation
Communication - Status pages, stakeholder updates
Post-Mortems - Blameless RCA, action items

Incident Classification

// src/incidents/classification.ts
export enum Severity {
  P1 = 'P1', // Critical - Service down, data loss
  P2 = 'P2', // High - Major feature broken
  P3 = 'P3', // Medium - Minor feature impacted
  P4 = 'P4', // Low - Cosmetic, workaround exists
}

interface IncidentCriteria {
  severity: Severity;
  responseTime: string;
  updateFrequency: string;
  escalationTime: string;
  examples: string[];
}

export const INCIDENT_MATRIX: Record<Severity, IncidentCriteria> = {
  [Severity.P1]: {
    severity: Severity.P1,
    responseTime: '15 minutes',
    updateFrequency: '30 minutes',
    escalationTime: '30 minutes if unresolved',
    examples: [
      'Complete service outage',
      'Data breach or security incident',
      'Payment processing failure',
      'Authentication system down',
    ],
  },
  [Severity.P2]: {
    severity: Severity.P2,
    responseTime: '30 minutes',
    updateFrequency: '1 hour',
    escalationTime: '2 hours if unresolved',
    examples: [
      'Major feature unavailable',
      'Performance degradation >50%',
      'Partial data loss (recoverable)',
      'Key integration broken',
    ],
  },
  [Severity.P3]: {
    severity: Severity.P3,
    responseTime: '4 hours',
    updateFrequency: '4 hours',
    escalationTime: 'Next business day',
    examples: [
      'Minor feature broken',
      'Performance degradation <50%',
      'Non-critical integration issue',
      'Workaround available',
    ],
  },
  [Severity.P4]: {
    severity: Severity.P4,
    responseTime: 'Next business day',
    updateFrequency: 'Upon resolution',
    escalationTime: 'N/A',
    examples: [
      'Cosmetic issues',
      'Documentation errors',
      'Feature requests',
      'Non-urgent maintenance',
    ],
  },
};

interface Incident {
  id: string;
  title: string;
  severity: Severity;
  status: 'open' | 'investigating' | 'identified' | 'resolved' | 'closed';
  impact: string;
  affectedSystems: string[];
  createdAt: Date;
  resolvedAt?: Date;
  timeline: IncidentEvent[];
  commander?: string;
  channel?: string;
}

interface IncidentEvent {
  timestamp: Date;
  type: 'created' | 'status_change' | 'update' | 'escalation' | 'resolved';
  message: string;
  author: string;
}

export class IncidentManager {
  private incidents: Map<string, Incident> = new Map();

  async createIncident(params: {
    title: string;
    severity: Severity;
    impact: string;
    affectedSystems: string[];
    reporter: string;
  }): Promise<Incident> {
    const incident: Incident = {
      id: `INC-${Date.now()}`,
      title: params.title,
      severity: params.severity,
      status: 'open',
      impact: params.impact,
      affectedSystems: params.affectedSystems,
      createdAt: new Date(),
      timeline: [{
        timestamp: new Date(),
        type: 'created',
        message: `Incident created: ${params.title}`,
        author: params.reporter,
      }],
    };

    this.incidents.set(incident.id, incident);

    // Trigger notifications based on severity
    await this.notifyOnCall(incident);
    await this.createChannel(incident);
    await this.updateStatusPage(incident);

    return incident;
  }

  async updateStatus(
    incidentId: string,
    status: Incident['status'],
    message: string,
    author: string
  ): Promise<Incident> {
    const incident = this.incidents.get(incidentId);
    if (!incident) throw new Error('Incident not found');

    incident.status = status;
    incident.timeline.push({
      timestamp: new Date(),
      type: 'status_change',
      message: `Status changed to ${status}: ${message}`,
      author,
    });

    if (status === 'resolved') {
      incident.resolvedAt = new Date();
    }

    await this.updateStatusPage(incident);
    return incident;
  }

  private async notifyOnCall(incident: Incident): Promise<void> {
    // Integration with PagerDuty/Opsgenie
    console.log(`Notifying on-call for ${incident.severity}: ${incident.title}`);
  }

  private async createChannel(incident: Incident): Promise<void> {
    // Create Slack channel for incident
    incident.channel = `#incident-${incident.id.toLowerCase()}`;
  }

  private async updateStatusPage(incident: Incident): Promise<void> {
    // Update public status page
    console.log(`Updating status page for ${incident.id}`);
  }
}

Runbook Automation

// src/incidents/runbook.ts
interface RunbookStep {
  name: string;
  command?: string;
  script?: string;
  manual?: string;
  timeout?: number;
  retries?: number;
  onFailure?: 'continue' | 'abort' | 'escalate';
}

interface Runbook {
  id: string;
  name: string;
  description: string;
  triggers: string[];
  steps: RunbookStep[];
  escalation: {
    team: string;
    timeout: number;
  };
}

export const RUNBOOKS: Runbook[] = [
  {
    id: 'high-cpu',
    name: 'High CPU Usage',
    description: 'Remediation for sustained high CPU on application servers',
    triggers: ['alert:cpu_usage_high', 'alert:response_time_p99'],
    steps: [
      {
        name: 'Identify top processes',
        command: 'top -b -n 1 | head -20',
        timeout: 30,
      },
      {
        name: 'Check for stuck queries',
        script: `
          SELECT pid, now() - pg_stat_activity.query_start AS duration, query
          FROM pg_stat_activity
          WHERE state != 'idle'
          ORDER BY duration DESC
          LIMIT 10;
        `,
        timeout: 60,
      },
      {
        name: 'Scale up if needed',
        command: 'kubectl scale deployment/api --replicas=+2',
        timeout: 120,
        onFailure: 'escalate',
      },
      {
        name: 'Verify scaling',
        command: 'kubectl get pods -l app=api',
        timeout: 60,
      },
    ],
    escalation: {
      team: 'platform-oncall',
      timeout: 300,
    },
  },
  {
    id: 'database-connection-exhausted',
    name: 'Database Connection Pool Exhausted',
    description: 'Handle database connection pool exhaustion',
    triggers: ['alert:db_connections_high'],
    steps: [
      {
        name: 'Check active connections',
        script: `
          SELECT count(*) as total,
                 state,
                 usename
          FROM pg_stat_activity
          GROUP BY state, usename
          ORDER BY count(*) DESC;
        `,
        timeout: 30,
      },
      {
        name: 'Identify long-running queries',
        script: `
          SELECT pid, query, now() - query_start as duration
          FROM pg_stat_activity
          WHERE state = 'active'
            AND query_start < now() - interval '5 minutes'
          ORDER BY duration DESC;
        `,
        timeout: 30,
      },
      {
        name: 'Terminate idle connections',
        script: `
          SELECT pg_terminate_backend(pid)
          FROM pg_stat_activity
          WHERE state = 'idle'
            AND query_start < now() - interval '10 minutes';
        `,
        timeout: 60,
        onFailure: 'continue',
      },
      {
        name: 'Restart connection-heavy services',
        manual: 'Evaluate which services to rolling-restart based on connection count',
        onFailure: 'escalate',
      },
    ],
    escalation: {
      team: 'database-oncall',
      timeout: 600,
    },
  },
];

export class RunbookExecutor {
  async execute(runbook: Runbook, context: Record<string, unknown>): Promise<{
    success: boolean;
    results: Array<{ step: string; output: string; success: boolean }>;
  }> {
    const results: Array<{ step: string; output: string; success: boolean }> = [];

    for (const step of runbook.steps) {
      try {
        console.log(`Executing step: ${step.name}`);

        if (step.manual) {
          // Manual step - log and continue
          results.push({
            step: step.name,
            output: `Manual step: ${step.manual}`,
            success: true,
          });
          continue;
        }

        const output = await this.executeStep(step, context);
        results.push({ step: step.name, output, success: true });
      } catch (error) {
        const errorMsg = error instanceof Error ? error.message : String(error);
        results.push({ step: step.name, output: errorMsg, success: false });

        if (step.onFailure === 'abort') {
          return { success: false, results };
        } else if (step.onFailure === 'escalate') {
          await this.escalate(runbook, step, errorMsg);
          return { success: false, results };
        }
        // 'continue' - proceed to next step
      }
    }

    return { success: true, results };
  }

  private async executeStep(step: RunbookStep, context: Record<string, unknown>): Promise<string> {
    // Execute command or script
    // This would integrate with your execution environment
    return `Executed: ${step.command || step.script}`;
  }

  private async escalate(runbook: Runbook, step: RunbookStep, error: string): Promise<void> {
    console.log(`Escalating to ${runbook.escalation.team}: ${step.name} failed - ${error}`);
    // PagerDuty/Opsgenie escalation
  }
}

Post-Mortem Template

// src/incidents/postmortem.ts
interface PostMortem {
  incidentId: string;
  title: string;
  date: Date;
  authors: string[];
  status: 'draft' | 'review' | 'published';

  summary: string;
  impact: {
    duration: string;
    affectedUsers: number;
    affectedSystems: string[];
    severity: Severity;
  };

  timeline: Array<{
    time: Date;
    event: string;
  }>;

  rootCause: string;
  contributingFactors: string[];

  detection: {
    howDetected: string;
    timeToDetect: string;
    whatWorked: string[];
    whatDidntWork: string[];
  };

  response: {
    timeToMitigate: string;
    timeToResolve: string;
    whatWorked: string[];
    whatDidntWork: string[];
  };

  actionItems: Array<{
    id: string;
    description: string;
    owner: string;
    priority: 'P1' | 'P2' | 'P3';
    dueDate: Date;
    status: 'open' | 'in_progress' | 'done';
    type: 'prevent' | 'detect' | 'respond';
  }>;

  lessonsLearned: string[];
}

export function generatePostMortemTemplate(incident: Incident): Partial<PostMortem> {
  const duration = incident.resolvedAt
    ? `${Math.round((incident.resolvedAt.getTime() - incident.createdAt.getTime()) / 60000)} minutes`
    : 'Ongoing';

  return {
    incidentId: incident.id,
    title: incident.title,
    date: new Date(),
    authors: [incident.commander || 'TBD'],
    status: 'draft',

    summary: `On ${incident.createdAt.toISOString()}, we experienced ${incident.title}. This affected ${incident.affectedSystems.join(', ')}.`,

    impact: {
      duration,
      affectedUsers: 0, // To be filled
      affectedSystems: incident.affectedSystems,
      severity: incident.severity,
    },

    timeline: incident.timeline.map(e => ({
      time: e.timestamp,
      event: e.message,
    })),

    rootCause: 'TBD - To be determined during post-mortem review',
    contributingFactors: [],

    detection: {
      howDetected: 'TBD',
      timeToDetect: 'TBD',
      whatWorked: [],
      whatDidntWork: [],
    },

    response: {
      timeToMitigate: 'TBD',
      timeToResolve: duration,
      whatWorked: [],
      whatDidntWork: [],
    },

    actionItems: [],
    lessonsLearned: [],
  };
}

// Markdown export
export function exportToMarkdown(pm: PostMortem): string {
  return `# Post-Mortem: ${pm.title}

**Incident ID:** ${pm.incidentId}
**Date:** ${pm.date.toISOString().split('T')[0]}
**Authors:** ${pm.authors.join(', ')}
**Status:** ${pm.status}

## Summary

${pm.summary}

## Impact

- **Duration:** ${pm.impact.duration}
- **Affected Users:** ${pm.impact.affectedUsers}
- **Affected Systems:** ${pm.impact.affectedSystems.join(', ')}
- **Severity:** ${pm.impact.severity}

## Timeline

| Time | Event |
|------|-------|
${pm.timeline.map(e => `| ${e.time.toISOString()} | ${e.event} |`).join('\n')}

## Root Cause

${pm.rootCause}

### Contributing Factors

${pm.contributingFactors.map(f => `- ${f}`).join('\n')}

## Detection

- **How Detected:** ${pm.detection.howDetected}
- **Time to Detect:** ${pm.detection.timeToDetect}

### What Worked
${pm.detection.whatWorked.map(w => `- ${w}`).join('\n')}

### What Didn't Work
${pm.detection.whatDidntWork.map(w => `- ${w}`).join('\n')}

## Response

- **Time to Mitigate:** ${pm.response.timeToMitigate}
- **Time to Resolve:** ${pm.response.timeToResolve}

## Action Items

| ID | Description | Owner | Priority | Due Date | Status | Type |
|----|-------------|-------|----------|----------|--------|------|
${pm.actionItems.map(a => `| ${a.id} | ${a.description} | ${a.owner} | ${a.priority} | ${a.dueDate.toISOString().split('T')[0]} | ${a.status} | ${a.type} |`).join('\n')}

## Lessons Learned

${pm.lessonsLearned.map(l => `- ${l}`).join('\n')}
`;
}

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: incident-response-patterns

Completed:
- [x] Incident classification matrix defined (P1-P4)
- [x] Runbooks created for common issues
- [x] On-call rotation and escalation configured
- [x] Post-mortem template established

Outputs:
- Incident classification criteria with SLAs
- Automated runbooks with remediation steps
- On-call notification and escalation paths
- Post-mortem template with action item tracking
- Status page update workflow

Completion Checklist

Before marking this skill as complete, verify:

P1-P4 severity levels defined with response time SLAs
Runbooks created for top 5 incident types
On-call rotation configured with PagerDuty/Opsgenie
Incident Slack channel creation automated
Status page update workflow implemented
Post-mortem template includes timeline, root cause, action items
Blameless culture principles documented
Escalation paths defined with timeouts

Failure Indicators

This skill has FAILED if:

❌ No clear severity criteria (ambiguous P1 vs P2)
❌ Runbooks missing for critical systems
❌ No on-call notification integration
❌ Manual status page updates required
❌ Post-mortem template incomplete
❌ Blame-oriented incident culture
❌ No escalation timeouts defined
❌ Action items not tracked to completion

When NOT to Use

Do NOT use this skill when:

No production systems (dev/staging only)
Single developer project (no on-call needed)
No user-facing services (internal tools only)
Incidents extremely rare (<1 per quarter)
No SLA requirements
Team size < 3 (no rotation needed)

Use alternative approaches when:

Dev/staging → Simple alerting, no runbooks
Single dev → Manual incident handling
Internal tools → Best-effort support
Rare incidents → Ad-hoc response acceptable
No SLAs → Prioritize by severity informally

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
Vague severity criteria	Inconsistent classification	Define clear P1-P4 examples and SLAs
Manual runbook steps only	Slow response, human error	Automate common remediation steps
No escalation timeouts	Incidents linger unresolved	Define escalation after X minutes
Blame-oriented post-mortems	Fear of reporting, hiding issues	Emphasize blameless culture
Action items without owners	Nothing gets fixed	Assign owner + due date to every action
Missing timeline	Hard to identify root cause	Document all events with timestamps
No detection analysis	Repeat incidents	Analyze "what worked/didn't work"
Ignoring contributing factors	Shallow fixes	Document all factors, not just root cause

Principles

This skill embodies these CODITECT principles:

#5 Eliminate Ambiguity

Clear P1-P4 criteria with examples
Explicit SLAs for response time
Defined escalation paths

#6 Clear, Understandable, Explainable

Runbooks with step-by-step instructions
Post-mortems with timeline and rationale
Blameless culture for transparency

#7 Incremental Improvement

Action items from every post-mortem
Detection/response analysis
Lessons learned feed into prevention

#9 Automation First

Automated runbook execution
Automated channel creation
Automated status page updates

Reliability

Runbooks prevent repeated incidents
Escalation ensures resolution
Post-mortems drive improvements

Usage Examples

Create Incident Response Process

Apply incident-response-patterns skill to implement P1-P4 incident classification with SLAs

Automate Runbooks

Apply incident-response-patterns skill to create automated runbooks for common infrastructure issues

Setup Post-Mortem Process

Apply incident-response-patterns skill to implement blameless post-mortem workflow with action item tracking

Integration Points

monitoring-observability - Alert integration
cicd-pipeline-design - Rollback automation
error-handling-resilience - Circuit breaker triggers

When to Use This Skill​

How to Use This Skill​

Core Capabilities​

Incident Classification​

Runbook Automation​

Post-Mortem Template​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​

Usage Examples​

Create Incident Response Process​

Automate Runbooks​

Setup Post-Mortem Process​

Integration Points​