Agent Skills Framework Extension (Optional)

Monitoring & Observability Skill

When to Use This Skill

Use this skill when implementing monitoring observability patterns in your codebase.

How to Use This Skill

Review the patterns and examples below
Apply the relevant patterns to your implementation
Follow the best practices outlined in this skill

Prometheus metrics, Grafana dashboards, distributed tracing, and intelligent alerting for production-grade observability.

Core Capabilities

Metrics Collection - Prometheus, custom metrics, exporters
Visualization - Grafana dashboards, panels, variables
Alerting - Alert rules, routing, escalation
Distributed Tracing - OpenTelemetry, Jaeger, trace correlation
Log Aggregation - Loki, ELK, structured logging
SLO/SLI - Service level objectives and indicators

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: production
    region: us-central1

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

Alert Rules

# prometheus/rules/api-alerts.yml
groups:
  - name: api-availability
    interval: 30s
    rules:
      - alert: APIHighErrorRate
        expr: |
          sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
          / sum(rate(http_requests_total{job="api"}[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate on API"
          description: "API error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
          runbook_url: "https://wiki.example.com/runbooks/api-errors"

      - alert: APIHighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
          ) > 1.0
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High latency on API"
          description: "P99 latency is {{ $value | humanizeDuration }}"

      - alert: APIInstanceDown
        expr: up{job="api"} == 0
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "API instance down"
          description: "API instance {{ $labels.instance }} is down"

  - name: slo-compliance
    interval: 1m
    rules:
      # Error budget burn rate
      - alert: ErrorBudgetBurnRate
        expr: |
          sum(rate(http_requests_total{job="api",status=~"5.."}[1h]))
          / sum(rate(http_requests_total{job="api"}[1h]))
          > (14.4 * (1 - 0.999))
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Error budget burning fast"
          description: "At current rate, monthly error budget will be exhausted"

      # SLO violation (availability)
      - record: slo:availability:ratio_rate5m
        expr: |
          1 - (
            sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
            / sum(rate(http_requests_total{job="api"}[5m]))
          )

      - record: slo:availability:ratio_rate30d
        expr: avg_over_time(slo:availability:ratio_rate5m[30d])

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "${SLACK_WEBHOOK_URL}"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

route:
  receiver: 'default'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: critical
      receiver: 'slack-critical'
    - match:
        severity: warning
      receiver: 'slack-warnings'
    - match:
        team: platform
      receiver: 'slack-platform'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: "${PAGERDUTY_SERVICE_KEY}"
        severity: critical
        description: "{{ .GroupLabels.alertname }}"
        details:
          firing: "{{ .Alerts.Firing | len }}"
          resolved: "{{ .Alerts.Resolved | len }}"

  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts-critical'
        color: 'danger'
        title: '🚨 {{ .GroupLabels.alertname }}'
        text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
        actions:
          - type: button
            text: 'Runbook'
            url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
          - type: button
            text: 'Silence'
            url: '{{ template "__alertmanagerURL" . }}/#/silences/new?filter=%7B{{ range .GroupLabels.SortedPairs }}{{ .Name }}%3D%22{{ .Value }}%22%2C{{ end }}%7D'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warnings'
        color: 'warning'

  - name: 'slack-platform'
    slack_configs:
      - channel: '#platform-alerts'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

Grafana Dashboard

{
  "title": "API Service Dashboard",
  "uid": "api-service",
  "tags": ["api", "production"],
  "timezone": "browser",
  "schemaVersion": 38,
  "version": 1,
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(http_requests_total, namespace)",
        "current": { "text": "production", "value": "production" }
      }
    ]
  },
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{namespace=\"$namespace\"}[5m])) by (status)",
          "legendFormat": "{{ status }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "reqps",
          "custom": {
            "drawStyle": "line",
            "lineInterpolation": "smooth"
          }
        }
      }
    },
    {
      "title": "Latency Percentiles",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
      "targets": [
        {
          "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
          "legendFormat": "p50"
        },
        {
          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
          "legendFormat": "p95"
        },
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
          "legendFormat": "p99"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "s"
        }
      }
    },
    {
      "title": "Error Rate SLI",
      "type": "stat",
      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 8 },
      "targets": [
        {
          "expr": "1 - sum(rate(http_requests_total{namespace=\"$namespace\",status=~\"5..\"}[24h])) / sum(rate(http_requests_total{namespace=\"$namespace\"}[24h]))",
          "instant": true
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percentunit",
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "red", "value": 0 },
              { "color": "yellow", "value": 0.99 },
              { "color": "green", "value": 0.999 }
            ]
          }
        }
      }
    }
  ]
}

OpenTelemetry Configuration

# otel-collector/otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 10s
          static_configs:
            - targets: ['0.0.0.0:8888']

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

  attributes:
    actions:
      - key: environment
        value: production
        action: insert

exporters:
  googlecloud:
    project: ${GCP_PROJECT_ID}

  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: otel

  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [jaeger, googlecloud]

    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheus, googlecloud]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki, googlecloud]

Application Instrumentation

// src/observability/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const resource = new Resource({
  [SemanticResourceAttributes.SERVICE_NAME]: 'api',
  [SemanticResourceAttributes.SERVICE_VERSION]: process.env.VERSION || '1.0.0',
  [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
});

const sdk = new NodeSDK({
  resource,
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.log('Error terminating tracing', error))
    .finally(() => process.exit(0));
});

SLO/SLI Definitions

# slo/api-slos.yml
slos:
  - name: api-availability
    description: "API availability SLO"
    service: api
    sli:
      type: availability
      metric: |
        1 - (
          sum(rate(http_requests_total{job="api",status=~"5.."}[{{.window}}]))
          / sum(rate(http_requests_total{job="api"}[{{.window}}]))
        )
    objective: 0.999  # 99.9%
    windows:
      - 30d
    error_budget:
      monthly_budget_minutes: 43.2  # 30 days * 24h * 60min * 0.001

  - name: api-latency
    description: "API latency SLO"
    service: api
    sli:
      type: latency
      metric: |
        histogram_quantile(0.99,
          sum(rate(http_request_duration_seconds_bucket{job="api"}[{{.window}}])) by (le)
        )
    objective: 0.5  # 500ms p99
    comparison: "<="
    windows:
      - 30d

Usage Examples

Configure Prometheus Monitoring

Apply monitoring-observability skill to set up Prometheus monitoring for Kubernetes with service discovery and custom metrics

Create Grafana Dashboard

Apply monitoring-observability skill to create a Grafana dashboard for API service with request rate, latency percentiles, and error rate SLI

Implement OpenTelemetry

Apply monitoring-observability skill to add OpenTelemetry instrumentation with traces, metrics, and logs exported to GCP

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: monitoring-observability

Completed:
- [x] Prometheus configuration deployed
- [x] Alert rules defined and loaded
- [x] Alertmanager routing configured
- [x] Grafana dashboards created
- [x] OpenTelemetry instrumentation added
- [x] SLO/SLI metrics defined
- [x] Monitoring stack verified

Outputs:
- prometheus/prometheus.yml
- prometheus/rules/*.yml
- alertmanager/alertmanager.yml
- grafana/dashboards/*.json
- otel-collector/otel-collector-config.yaml
- slo/api-slos.yml
- Application instrumentation: src/observability/tracing.ts

Verification:
- Prometheus targets up: curl http://prometheus:9090/api/v1/targets
- Grafana dashboards accessible: http://grafana:3000
- Traces visible in Jaeger: http://jaeger:16686
- Alerts firing correctly: http://alertmanager:9093

Completion Checklist

Before marking this skill as complete, verify:

Failure Indicators

This skill has FAILED if:

❌ Prometheus cannot reach scrape targets (check network policies)
❌ Alert rules fail validation (syntax errors)
❌ Alertmanager not sending notifications (check routing/receivers)
❌ Grafana dashboards empty (no data source connection)
❌ OpenTelemetry collector crashing (check resource limits)
❌ No traces appearing in Jaeger (instrumentation not working)
❌ SLO metrics always show 100% (no error data captured)
❌ Alerts not firing during test failure scenarios
❌ High cardinality metrics causing memory issues
❌ Data loss after Prometheus restart (no persistent volume)

When NOT to Use

Do NOT use monitoring-observability when:

Development environment only - Use simpler logging (console.log, print statements)
Proof-of-concept projects - Monitoring overhead not justified
Serverless applications - Use platform-native monitoring (CloudWatch, Azure Monitor)
Third-party SaaS only - Use vendor monitoring (Datadog, New Relic) instead
No production deployment - Premature optimization for non-production apps
Static websites - Simple uptime monitoring sufficient
Batch jobs only - Use job-specific logging instead

Use platform monitoring when: Cloud-native services (AWS, GCP, Azure) with native tools Use vendor SaaS when: Team lacks observability expertise or infrastructure capacity Use this skill when: Self-hosted production applications needing full observability stack

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
No resource limits	Prometheus/Grafana consume all memory	Define memory/CPU limits and requests
High cardinality labels	Memory explosion, slow queries	Limit label values, avoid user IDs in labels
No retention policy	Disk fills up, cluster crashes	Configure retention time and size limits
Missing service discovery	Manual target configuration breaks	Use Kubernetes SD for auto-discovery
No persistent volumes	Data lost on pod restart	Use PVCs for Prometheus, Grafana, Alertmanager
Hardcoded credentials	Security vulnerability	Use Secrets for API tokens and passwords
Alert fatigue	Too many low-value alerts	Use SLO-based alerting, proper thresholds
No recording rules	Slow dashboard queries	Precompute complex queries as recording rules
Missing timestamps	Broken distributed tracing	Use NTP, enable timestamp in OTLP exporters
No backup strategy	Lose historical data on failure	Regular snapshots of Prometheus TSDB

Principles

This skill embodies the following CODITECT principles:

#2 First Principles Thinking:

Understand WHY we monitor: detect failures, measure performance, understand user impact
Apply observability pillars: metrics, logs, traces

#5 Eliminate Ambiguity:

Clear SLO definitions (99.9% availability = 43.2 min/month downtime)
Explicit alert severity levels (critical, warning, info)

#6 Clear, Understandable, Explainable:

Grafana dashboards with descriptive titles and units
Alert annotations include runbook URLs and descriptions
Metrics use semantic naming (http_requests_total, not metric1)

#8 No Assumptions:

Verify Prometheus targets healthy before trusting data
Test alerting with intentional failures
Validate retention policies match storage capacity

#10 Automation First:

Kubernetes service discovery (no manual target config)
Automated dashboard provisioning via ConfigMaps
GitOps for alert rule management

Full Standard: CODITECT-STANDARD-AUTOMATION.md

Integration Points

container-orchestration - Pod metrics, Kubernetes monitoring
cicd-pipeline-design - Deployment monitoring, canary metrics
infrastructure-as-code - Monitoring infrastructure provisioning
multi-tenant-security - Tenant-aware metrics and alerts

When to Use This Skill​

How to Use This Skill​

Core Capabilities​

Prometheus Configuration​

Alert Rules​

Alertmanager Configuration​

Grafana Dashboard​

OpenTelemetry Configuration​

Application Instrumentation​

SLO/SLI Definitions​

Usage Examples​

Configure Prometheus Monitoring​

Create Grafana Dashboard​

Implement OpenTelemetry​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​

Integration Points​