Skip to main content

Agent Skills Framework Extension (Optional)

Monitoring & Observability Skill

When to Use This Skill

Use this skill when implementing monitoring observability patterns in your codebase.

How to Use This Skill

  1. Review the patterns and examples below
  2. Apply the relevant patterns to your implementation
  3. Follow the best practices outlined in this skill

Prometheus metrics, Grafana dashboards, distributed tracing, and intelligent alerting for production-grade observability.

Core Capabilities

  1. Metrics Collection - Prometheus, custom metrics, exporters
  2. Visualization - Grafana dashboards, panels, variables
  3. Alerting - Alert rules, routing, escalation
  4. Distributed Tracing - OpenTelemetry, Jaeger, trace correlation
  5. Log Aggregation - Loki, ELK, structured logging
  6. SLO/SLI - Service level objectives and indicators

Prometheus Configuration

# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: production
region: us-central1

rule_files:
- /etc/prometheus/rules/*.yml

alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093

scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https

- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod

Alert Rules

# prometheus/rules/api-alerts.yml
groups:
- name: api-availability
interval: 30s
rules:
- alert: APIHighErrorRate
expr: |
sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="api"}[5m])) > 0.01
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate on API"
description: "API error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
runbook_url: "https://wiki.example.com/runbooks/api-errors"

- alert: APIHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
) > 1.0
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "High latency on API"
description: "P99 latency is {{ $value | humanizeDuration }}"

- alert: APIInstanceDown
expr: up{job="api"} == 0
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "API instance down"
description: "API instance {{ $labels.instance }} is down"

- name: slo-compliance
interval: 1m
rules:
# Error budget burn rate
- alert: ErrorBudgetBurnRate
expr: |
sum(rate(http_requests_total{job="api",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="api"}[1h]))
> (14.4 * (1 - 0.999))
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Error budget burning fast"
description: "At current rate, monthly error budget will be exhausted"

# SLO violation (availability)
- record: slo:availability:ratio_rate5m
expr: |
1 - (
sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="api"}[5m]))
)

- record: slo:availability:ratio_rate30d
expr: avg_over_time(slo:availability:ratio_rate5m[30d])

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "${SLACK_WEBHOOK_URL}"
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: critical
receiver: 'slack-critical'
- match:
severity: warning
receiver: 'slack-warnings'
- match:
team: platform
receiver: 'slack-platform'

receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'

- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: "${PAGERDUTY_SERVICE_KEY}"
severity: critical
description: "{{ .GroupLabels.alertname }}"
details:
firing: "{{ .Alerts.Firing | len }}"
resolved: "{{ .Alerts.Resolved | len }}"

- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
color: 'danger'
title: '🚨 {{ .GroupLabels.alertname }}'
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
actions:
- type: button
text: 'Runbook'
url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
- type: button
text: 'Silence'
url: '{{ template "__alertmanagerURL" . }}/#/silences/new?filter=%7B{{ range .GroupLabels.SortedPairs }}{{ .Name }}%3D%22{{ .Value }}%22%2C{{ end }}%7D'

- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
color: 'warning'

- name: 'slack-platform'
slack_configs:
- channel: '#platform-alerts'

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']

Grafana Dashboard

{
"title": "API Service Dashboard",
"uid": "api-service",
"tags": ["api", "production"],
"timezone": "browser",
"schemaVersion": 38,
"version": 1,
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(http_requests_total, namespace)",
"current": { "text": "production", "value": "production" }
}
]
},
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum(rate(http_requests_total{namespace=\"$namespace\"}[5m])) by (status)",
"legendFormat": "{{ status }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth"
}
}
}
},
{
"title": "Latency Percentiles",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
"legendFormat": "p99"
}
],
"fieldConfig": {
"defaults": {
"unit": "s"
}
}
},
{
"title": "Error Rate SLI",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 8 },
"targets": [
{
"expr": "1 - sum(rate(http_requests_total{namespace=\"$namespace\",status=~\"5..\"}[24h])) / sum(rate(http_requests_total{namespace=\"$namespace\"}[24h]))",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 0.99 },
{ "color": "green", "value": 0.999 }
]
}
}
}
}
]
}

OpenTelemetry Configuration

# otel-collector/otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
- targets: ['0.0.0.0:8888']

processors:
batch:
timeout: 1s
send_batch_size: 1024

memory_limiter:
check_interval: 1s
limit_mib: 1000
spike_limit_mib: 200

attributes:
actions:
- key: environment
value: production
action: insert

exporters:
googlecloud:
project: ${GCP_PROJECT_ID}

prometheus:
endpoint: "0.0.0.0:8889"
namespace: otel

jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true

loki:
endpoint: http://loki:3100/loki/api/v1/push

service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, attributes]
exporters: [jaeger, googlecloud]

metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheus, googlecloud]

logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki, googlecloud]

Application Instrumentation

// src/observability/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const resource = new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'api',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.VERSION || '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
});

const sdk = new NodeSDK({
resource,
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
}),
exportIntervalMillis: 15000,
}),
instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});

SLO/SLI Definitions

# slo/api-slos.yml
slos:
- name: api-availability
description: "API availability SLO"
service: api
sli:
type: availability
metric: |
1 - (
sum(rate(http_requests_total{job="api",status=~"5.."}[{{.window}}]))
/ sum(rate(http_requests_total{job="api"}[{{.window}}]))
)
objective: 0.999 # 99.9%
windows:
- 30d
error_budget:
monthly_budget_minutes: 43.2 # 30 days * 24h * 60min * 0.001

- name: api-latency
description: "API latency SLO"
service: api
sli:
type: latency
metric: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="api"}[{{.window}}])) by (le)
)
objective: 0.5 # 500ms p99
comparison: "<="
windows:
- 30d

Usage Examples

Configure Prometheus Monitoring

Apply monitoring-observability skill to set up Prometheus monitoring for Kubernetes with service discovery and custom metrics

Create Grafana Dashboard

Apply monitoring-observability skill to create a Grafana dashboard for API service with request rate, latency percentiles, and error rate SLI

Implement OpenTelemetry

Apply monitoring-observability skill to add OpenTelemetry instrumentation with traces, metrics, and logs exported to GCP

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: monitoring-observability

Completed:
- [x] Prometheus configuration deployed
- [x] Alert rules defined and loaded
- [x] Alertmanager routing configured
- [x] Grafana dashboards created
- [x] OpenTelemetry instrumentation added
- [x] SLO/SLI metrics defined
- [x] Monitoring stack verified

Outputs:
- prometheus/prometheus.yml
- prometheus/rules/*.yml
- alertmanager/alertmanager.yml
- grafana/dashboards/*.json
- otel-collector/otel-collector-config.yaml
- slo/api-slos.yml
- Application instrumentation: src/observability/tracing.ts

Verification:
- Prometheus targets up: curl http://prometheus:9090/api/v1/targets
- Grafana dashboards accessible: http://grafana:3000
- Traces visible in Jaeger: http://jaeger:16686
- Alerts firing correctly: http://alertmanager:9093

Completion Checklist

Before marking this skill as complete, verify:

  • Prometheus deployed and scraping targets successfully
  • All Kubernetes service discovery working (pods, endpoints)
  • Alert rules validated and loaded without errors
  • Alertmanager routing to correct receivers (Slack, PagerDuty)
  • Grafana dashboards show live data from Prometheus
  • OpenTelemetry collector receiving traces, metrics, logs
  • Application instrumentation exporting telemetry
  • Jaeger showing distributed traces across services
  • SLO/SLI metrics recording correctly
  • Error budget calculations accurate
  • Test alerts trigger and route correctly
  • Monitoring stack survives pod restarts
  • Data retention policies configured
  • Authentication/authorization enabled for UIs

Failure Indicators

This skill has FAILED if:

  • ❌ Prometheus cannot reach scrape targets (check network policies)
  • ❌ Alert rules fail validation (syntax errors)
  • ❌ Alertmanager not sending notifications (check routing/receivers)
  • ❌ Grafana dashboards empty (no data source connection)
  • ❌ OpenTelemetry collector crashing (check resource limits)
  • ❌ No traces appearing in Jaeger (instrumentation not working)
  • ❌ SLO metrics always show 100% (no error data captured)
  • ❌ Alerts not firing during test failure scenarios
  • ❌ High cardinality metrics causing memory issues
  • ❌ Data loss after Prometheus restart (no persistent volume)

When NOT to Use

Do NOT use monitoring-observability when:

  • Development environment only - Use simpler logging (console.log, print statements)
  • Proof-of-concept projects - Monitoring overhead not justified
  • Serverless applications - Use platform-native monitoring (CloudWatch, Azure Monitor)
  • Third-party SaaS only - Use vendor monitoring (Datadog, New Relic) instead
  • No production deployment - Premature optimization for non-production apps
  • Static websites - Simple uptime monitoring sufficient
  • Batch jobs only - Use job-specific logging instead

Use platform monitoring when: Cloud-native services (AWS, GCP, Azure) with native tools Use vendor SaaS when: Team lacks observability expertise or infrastructure capacity Use this skill when: Self-hosted production applications needing full observability stack

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
No resource limitsPrometheus/Grafana consume all memoryDefine memory/CPU limits and requests
High cardinality labelsMemory explosion, slow queriesLimit label values, avoid user IDs in labels
No retention policyDisk fills up, cluster crashesConfigure retention time and size limits
Missing service discoveryManual target configuration breaksUse Kubernetes SD for auto-discovery
No persistent volumesData lost on pod restartUse PVCs for Prometheus, Grafana, Alertmanager
Hardcoded credentialsSecurity vulnerabilityUse Secrets for API tokens and passwords
Alert fatigueToo many low-value alertsUse SLO-based alerting, proper thresholds
No recording rulesSlow dashboard queriesPrecompute complex queries as recording rules
Missing timestampsBroken distributed tracingUse NTP, enable timestamp in OTLP exporters
No backup strategyLose historical data on failureRegular snapshots of Prometheus TSDB

Principles

This skill embodies the following CODITECT principles:

#2 First Principles Thinking:

  • Understand WHY we monitor: detect failures, measure performance, understand user impact
  • Apply observability pillars: metrics, logs, traces

#5 Eliminate Ambiguity:

  • Clear SLO definitions (99.9% availability = 43.2 min/month downtime)
  • Explicit alert severity levels (critical, warning, info)

#6 Clear, Understandable, Explainable:

  • Grafana dashboards with descriptive titles and units
  • Alert annotations include runbook URLs and descriptions
  • Metrics use semantic naming (http_requests_total, not metric1)

#8 No Assumptions:

  • Verify Prometheus targets healthy before trusting data
  • Test alerting with intentional failures
  • Validate retention policies match storage capacity

#10 Automation First:

  • Kubernetes service discovery (no manual target config)
  • Automated dashboard provisioning via ConfigMaps
  • GitOps for alert rule management

Full Standard: CODITECT-STANDARD-AUTOMATION.md

Integration Points

  • container-orchestration - Pod metrics, Kubernetes monitoring
  • cicd-pipeline-design - Deployment monitoring, canary metrics
  • infrastructure-as-code - Monitoring infrastructure provisioning
  • multi-tenant-security - Tenant-aware metrics and alerts