Agent Skills Framework Extension (Optional)
Monitoring & Observability Skill
When to Use This Skill
Use this skill when implementing monitoring observability patterns in your codebase.
How to Use This Skill
- Review the patterns and examples below
- Apply the relevant patterns to your implementation
- Follow the best practices outlined in this skill
Prometheus metrics, Grafana dashboards, distributed tracing, and intelligent alerting for production-grade observability.
Core Capabilities
- Metrics Collection - Prometheus, custom metrics, exporters
- Visualization - Grafana dashboards, panels, variables
- Alerting - Alert rules, routing, escalation
- Distributed Tracing - OpenTelemetry, Jaeger, trace correlation
- Log Aggregation - Loki, ELK, structured logging
- SLO/SLI - Service level objectives and indicators
Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: production
region: us-central1
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
Alert Rules
# prometheus/rules/api-alerts.yml
groups:
- name: api-availability
interval: 30s
rules:
- alert: APIHighErrorRate
expr: |
sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="api"}[5m])) > 0.01
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate on API"
description: "API error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
runbook_url: "https://wiki.example.com/runbooks/api-errors"
- alert: APIHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
) > 1.0
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "High latency on API"
description: "P99 latency is {{ $value | humanizeDuration }}"
- alert: APIInstanceDown
expr: up{job="api"} == 0
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "API instance down"
description: "API instance {{ $labels.instance }} is down"
- name: slo-compliance
interval: 1m
rules:
# Error budget burn rate
- alert: ErrorBudgetBurnRate
expr: |
sum(rate(http_requests_total{job="api",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="api"}[1h]))
> (14.4 * (1 - 0.999))
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Error budget burning fast"
description: "At current rate, monthly error budget will be exhausted"
# SLO violation (availability)
- record: slo:availability:ratio_rate5m
expr: |
1 - (
sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="api"}[5m]))
)
- record: slo:availability:ratio_rate30d
expr: avg_over_time(slo:availability:ratio_rate5m[30d])
Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "${SLACK_WEBHOOK_URL}"
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: critical
receiver: 'slack-critical'
- match:
severity: warning
receiver: 'slack-warnings'
- match:
team: platform
receiver: 'slack-platform'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: "${PAGERDUTY_SERVICE_KEY}"
severity: critical
description: "{{ .GroupLabels.alertname }}"
details:
firing: "{{ .Alerts.Firing | len }}"
resolved: "{{ .Alerts.Resolved | len }}"
- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
color: 'danger'
title: '🚨 {{ .GroupLabels.alertname }}'
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
actions:
- type: button
text: 'Runbook'
url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
- type: button
text: 'Silence'
url: '{{ template "__alertmanagerURL" . }}/#/silences/new?filter=%7B{{ range .GroupLabels.SortedPairs }}{{ .Name }}%3D%22{{ .Value }}%22%2C{{ end }}%7D'
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
color: 'warning'
- name: 'slack-platform'
slack_configs:
- channel: '#platform-alerts'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
Grafana Dashboard
{
"title": "API Service Dashboard",
"uid": "api-service",
"tags": ["api", "production"],
"timezone": "browser",
"schemaVersion": 38,
"version": 1,
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(http_requests_total, namespace)",
"current": { "text": "production", "value": "production" }
}
]
},
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum(rate(http_requests_total{namespace=\"$namespace\"}[5m])) by (status)",
"legendFormat": "{{ status }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth"
}
}
}
},
{
"title": "Latency Percentiles",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
"legendFormat": "p99"
}
],
"fieldConfig": {
"defaults": {
"unit": "s"
}
}
},
{
"title": "Error Rate SLI",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 8 },
"targets": [
{
"expr": "1 - sum(rate(http_requests_total{namespace=\"$namespace\",status=~\"5..\"}[24h])) / sum(rate(http_requests_total{namespace=\"$namespace\"}[24h]))",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 0.99 },
{ "color": "green", "value": 0.999 }
]
}
}
}
}
]
}
OpenTelemetry Configuration
# otel-collector/otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
- targets: ['0.0.0.0:8888']
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 1000
spike_limit_mib: 200
attributes:
actions:
- key: environment
value: production
action: insert
exporters:
googlecloud:
project: ${GCP_PROJECT_ID}
prometheus:
endpoint: "0.0.0.0:8889"
namespace: otel
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, attributes]
exporters: [jaeger, googlecloud]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheus, googlecloud]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki, googlecloud]
Application Instrumentation
// src/observability/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const resource = new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'api',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.VERSION || '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
});
const sdk = new NodeSDK({
resource,
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
}),
exportIntervalMillis: 15000,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
SLO/SLI Definitions
# slo/api-slos.yml
slos:
- name: api-availability
description: "API availability SLO"
service: api
sli:
type: availability
metric: |
1 - (
sum(rate(http_requests_total{job="api",status=~"5.."}[{{.window}}]))
/ sum(rate(http_requests_total{job="api"}[{{.window}}]))
)
objective: 0.999 # 99.9%
windows:
- 30d
error_budget:
monthly_budget_minutes: 43.2 # 30 days * 24h * 60min * 0.001
- name: api-latency
description: "API latency SLO"
service: api
sli:
type: latency
metric: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="api"}[{{.window}}])) by (le)
)
objective: 0.5 # 500ms p99
comparison: "<="
windows:
- 30d
Usage Examples
Configure Prometheus Monitoring
Apply monitoring-observability skill to set up Prometheus monitoring for Kubernetes with service discovery and custom metrics
Create Grafana Dashboard
Apply monitoring-observability skill to create a Grafana dashboard for API service with request rate, latency percentiles, and error rate SLI
Implement OpenTelemetry
Apply monitoring-observability skill to add OpenTelemetry instrumentation with traces, metrics, and logs exported to GCP
Success Output
When successful, this skill MUST output:
✅ SKILL COMPLETE: monitoring-observability
Completed:
- [x] Prometheus configuration deployed
- [x] Alert rules defined and loaded
- [x] Alertmanager routing configured
- [x] Grafana dashboards created
- [x] OpenTelemetry instrumentation added
- [x] SLO/SLI metrics defined
- [x] Monitoring stack verified
Outputs:
- prometheus/prometheus.yml
- prometheus/rules/*.yml
- alertmanager/alertmanager.yml
- grafana/dashboards/*.json
- otel-collector/otel-collector-config.yaml
- slo/api-slos.yml
- Application instrumentation: src/observability/tracing.ts
Verification:
- Prometheus targets up: curl http://prometheus:9090/api/v1/targets
- Grafana dashboards accessible: http://grafana:3000
- Traces visible in Jaeger: http://jaeger:16686
- Alerts firing correctly: http://alertmanager:9093
Completion Checklist
Before marking this skill as complete, verify:
- Prometheus deployed and scraping targets successfully
- All Kubernetes service discovery working (pods, endpoints)
- Alert rules validated and loaded without errors
- Alertmanager routing to correct receivers (Slack, PagerDuty)
- Grafana dashboards show live data from Prometheus
- OpenTelemetry collector receiving traces, metrics, logs
- Application instrumentation exporting telemetry
- Jaeger showing distributed traces across services
- SLO/SLI metrics recording correctly
- Error budget calculations accurate
- Test alerts trigger and route correctly
- Monitoring stack survives pod restarts
- Data retention policies configured
- Authentication/authorization enabled for UIs
Failure Indicators
This skill has FAILED if:
- ❌ Prometheus cannot reach scrape targets (check network policies)
- ❌ Alert rules fail validation (syntax errors)
- ❌ Alertmanager not sending notifications (check routing/receivers)
- ❌ Grafana dashboards empty (no data source connection)
- ❌ OpenTelemetry collector crashing (check resource limits)
- ❌ No traces appearing in Jaeger (instrumentation not working)
- ❌ SLO metrics always show 100% (no error data captured)
- ❌ Alerts not firing during test failure scenarios
- ❌ High cardinality metrics causing memory issues
- ❌ Data loss after Prometheus restart (no persistent volume)
When NOT to Use
Do NOT use monitoring-observability when:
- Development environment only - Use simpler logging (console.log, print statements)
- Proof-of-concept projects - Monitoring overhead not justified
- Serverless applications - Use platform-native monitoring (CloudWatch, Azure Monitor)
- Third-party SaaS only - Use vendor monitoring (Datadog, New Relic) instead
- No production deployment - Premature optimization for non-production apps
- Static websites - Simple uptime monitoring sufficient
- Batch jobs only - Use job-specific logging instead
Use platform monitoring when: Cloud-native services (AWS, GCP, Azure) with native tools Use vendor SaaS when: Team lacks observability expertise or infrastructure capacity Use this skill when: Self-hosted production applications needing full observability stack
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| No resource limits | Prometheus/Grafana consume all memory | Define memory/CPU limits and requests |
| High cardinality labels | Memory explosion, slow queries | Limit label values, avoid user IDs in labels |
| No retention policy | Disk fills up, cluster crashes | Configure retention time and size limits |
| Missing service discovery | Manual target configuration breaks | Use Kubernetes SD for auto-discovery |
| No persistent volumes | Data lost on pod restart | Use PVCs for Prometheus, Grafana, Alertmanager |
| Hardcoded credentials | Security vulnerability | Use Secrets for API tokens and passwords |
| Alert fatigue | Too many low-value alerts | Use SLO-based alerting, proper thresholds |
| No recording rules | Slow dashboard queries | Precompute complex queries as recording rules |
| Missing timestamps | Broken distributed tracing | Use NTP, enable timestamp in OTLP exporters |
| No backup strategy | Lose historical data on failure | Regular snapshots of Prometheus TSDB |
Principles
This skill embodies the following CODITECT principles:
#2 First Principles Thinking:
- Understand WHY we monitor: detect failures, measure performance, understand user impact
- Apply observability pillars: metrics, logs, traces
#5 Eliminate Ambiguity:
- Clear SLO definitions (99.9% availability = 43.2 min/month downtime)
- Explicit alert severity levels (critical, warning, info)
#6 Clear, Understandable, Explainable:
- Grafana dashboards with descriptive titles and units
- Alert annotations include runbook URLs and descriptions
- Metrics use semantic naming (http_requests_total, not metric1)
#8 No Assumptions:
- Verify Prometheus targets healthy before trusting data
- Test alerting with intentional failures
- Validate retention policies match storage capacity
#10 Automation First:
- Kubernetes service discovery (no manual target config)
- Automated dashboard provisioning via ConfigMaps
- GitOps for alert rule management
Full Standard: CODITECT-STANDARD-AUTOMATION.md
Integration Points
- container-orchestration - Pod metrics, Kubernetes monitoring
- cicd-pipeline-design - Deployment monitoring, canary metrics
- infrastructure-as-code - Monitoring infrastructure provisioning
- multi-tenant-security - Tenant-aware metrics and alerts