ADR-019: Monitoring and Observability
Status: Accepted Date: 2025-11-30 Deciders: DevOps Team, Engineering Team Tags: monitoring, observability, prometheus, grafana, alerting
Context
Production Monitoring Requirements
CODITECT license server is critical infrastructure. Downtime = customers locked out. We need comprehensive monitoring to:
Operational Requirements:
- Uptime Monitoring: Detect outages within 60 seconds
- Performance Monitoring: Track API latency (p50, p95, p99)
- Error Tracking: Alert on error rate spikes
- Capacity Planning: Predict when to scale infrastructure
Business Requirements:
- SLA Compliance: 99.9% uptime guarantee
- Customer Impact: Know when customers affected by issues
- Cost Optimization: Identify waste (unused resources)
Compliance Requirements:
- Audit Logs: Complete audit trail for SOC 2
- Security Monitoring: Detect suspicious activity
- Data Retention: 90 days for compliance
Observability Pillars
1. Metrics (Prometheus + Grafana)
- License API request rate
- Seat acquisition latency
- Active sessions per license
- Error rates and types
2. Logs (Structured JSON + Cloud Logging)
- License validation requests
- Seat acquisition/release events
- Error stack traces
- Audit trail
3. Traces (OpenTelemetry + Jaeger)
- Distributed tracing across microservices
- Request flow visualization
- Performance bottleneck identification
Decision
We will implement comprehensive observability with:
- Prometheus for metrics collection and alerting
- Grafana for metrics visualization and dashboards
- Google Cloud Logging for centralized log aggregation
- OpenTelemetry for distributed tracing
- PagerDuty for on-call alerts
Monitoring Architecture
┌────────────────────────────────────────────────────────────────┐
│ Observability Stack │
└────────────────────────────────────────────────────────────────┘
License API (Django REST Framework)
│
├─► Metrics ─────────────────┐
│ • Request rate │
│ • Latency (p50/p95/p99) │
│ • Error rate │
│ • Active sessions │
│ │
├─► Logs ────────────────────┤
│ • Structured JSON │
│ • Request/response │
│ • Error stack traces │
│ • Audit events │
│ │
└─► Traces ──────────────────┤
• OpenTelemetry spans │
• Distributed traces │
• Service dependencies │
│
▼
┌─────────────────┐
│ Prometheus │
│ (Metrics) │
│ │
│ - Scrape / │
│ metrics │
│ - Alert rules │
│ - Recording │
│ rules │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Grafana │
│ (Visualization)│
│ │
│ - Dashboards │
│ - Drill-downs │
│ - Annotations │
└────────┬────────┘
│
┌───────────────┴───────────────┐
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Alertmanager │ │ PagerDuty │
│ │ │ │
│ - Route │───────────►│ - On-call │
│ alerts │ │ - Escalation │
│ - Dedup │ │ - Incident │
│ - Silence │ │ mgmt │
└────────────────┘ └────────────────┘
Implementation
1. Prometheus Metrics
File: backend/monitoring/metrics.py
from prometheus_client import Counter, Histogram, Gauge, Info
# License API Metrics
license_validation_requests = Counter(
'license_validation_requests_total',
'Total number of license validation requests',
['license_key', 'status', 'tier']
)
license_validation_latency = Histogram(
'license_validation_latency_seconds',
'License validation request latency',
['endpoint'],
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
license_validation_errors = Counter(
'license_validation_errors_total',
'Total number of license validation errors',
['error_type', 'tier']
)
# Seat Management Metrics
seat_acquisitions = Counter(
'seat_acquisitions_total',
'Total number of seat acquisitions',
['license_key', 'tenant_id', 'success']
)
seat_acquisition_latency = Histogram(
'seat_acquisition_latency_seconds',
'Seat acquisition latency',
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1]
)
active_sessions = Gauge(
'active_sessions',
'Number of active sessions',
['license_key', 'tenant_id']
)
seat_denials = Counter(
'seat_denials_total',
'Total number of seat denials (no seats available)',
['license_key', 'tenant_id']
)
# Redis Metrics
redis_operations = Counter(
'redis_operations_total',
'Total Redis operations',
['operation', 'status']
)
redis_latency = Histogram(
'redis_latency_seconds',
'Redis operation latency',
['operation'],
buckets=[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05]
)
# Database Metrics
database_queries = Counter(
'database_queries_total',
'Total database queries',
['table', 'operation']
)
database_query_latency = Histogram(
'database_query_latency_seconds',
'Database query latency',
['table', 'operation'],
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5]
)
# System Metrics
app_info = Info('app', 'Application info')
app_info.info({
'version': '1.0.0',
'environment': 'production',
'region': 'us-central1'
})
2. Metrics Endpoint
File: backend/api/metrics.py
from django.http import HttpResponse
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from rest_framework.decorators import api_view, permission_classes
from rest_framework.permissions import AllowAny
@api_view(['GET'])
@permission_classes([AllowAny]) # Prometheus scraper doesn't authenticate
def metrics(request):
"""
Prometheus metrics endpoint.
Scraped by Prometheus every 15 seconds.
Returns:
Prometheus text format metrics
"""
return HttpResponse(
generate_latest(),
content_type=CONTENT_TYPE_LATEST
)
3. Structured Logging
File: backend/monitoring/logging_config.py
import logging
import json
import sys
from datetime import datetime
class StructuredLogger(logging.Formatter):
"""
Structured JSON logger for Cloud Logging.
Logs in JSON format for easy parsing and querying.
"""
def format(self, record):
log_entry = {
'timestamp': datetime.utcnow().isoformat() + 'Z',
'severity': record.levelname,
'message': record.getMessage(),
'logger': record.name,
'function': record.funcName,
'line': record.lineno,
}
# Add extra fields
if hasattr(record, 'license_key'):
log_entry['license_key'] = record.license_key
if hasattr(record, 'tenant_id'):
log_entry['tenant_id'] = record.tenant_id
if hasattr(record, 'user_email'):
log_entry['user_email'] = record.user_email
# Add exception info
if record.exc_info:
log_entry['exception'] = self.formatException(record.exc_info)
return json.dumps(log_entry)
def setup_logging():
"""Configure structured logging."""
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(StructuredLogger())
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(handler)
4. Prometheus Alert Rules
File: monitoring/prometheus/alerts.yml
groups:
- name: license_api
interval: 30s
rules:
# High Error Rate
- alert: HighErrorRate
expr: |
(
rate(license_validation_errors_total[5m])
/
rate(license_validation_requests_total[5m])
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High license validation error rate"
description: "{{ $value | humanizePercentage }} of license validations failing"
# High Latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(license_validation_latency_seconds_bucket[5m])
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High license validation latency"
description: "p95 latency is {{ $value }}s (target: <1s)"
# Seat Exhaustion
- alert: SeatExhaustion
expr: |
(active_sessions / 50) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "License {{ $labels.license_key }} near seat limit"
description: "{{ $value | humanizePercentage }} of seats in use"
# Frequent Seat Denials
- alert: FrequentSeatDenials
expr: rate(seat_denials_total[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Frequent seat denials for {{ $labels.license_key }}"
description: "{{ $value }} denials/second"
# Redis Connection Failures
- alert: RedisConnectionFailures
expr: |
rate(redis_operations_total{status="error"}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Redis connection failures detected"
description: "{{ $value }} Redis errors/second"
# Database Slow Queries
- alert: DatabaseSlowQueries
expr: |
histogram_quantile(0.95,
rate(database_query_latency_seconds_bucket[5m])
) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Database queries running slow"
description: "p95 query latency is {{ $value }}s"
- name: infrastructure
interval: 60s
rules:
# API Server Down
- alert: APIServerDown
expr: up{job="license-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "License API server is down"
description: "{{ $labels.instance }} has been down for 1 minute"
# High CPU Usage
- alert: HighCPUUsage
expr: |
avg(rate(container_cpu_usage_seconds_total{pod=~"license-api.*"}[5m]))
* 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on license API"
description: "CPU usage is {{ $value }}%"
# High Memory Usage
- alert: HighMemoryUsage
expr: |
(
container_memory_working_set_bytes{pod=~"license-api.*"}
/
container_spec_memory_limit_bytes{pod=~"license-api.*"}
) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on license API"
description: "Memory usage is {{ $value | humanizePercentage }}"
5. Grafana Dashboards
File: monitoring/grafana/license-api-dashboard.json
{
"dashboard": {
"title": "CODITECT License API",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(license_validation_requests_total[5m])",
"legendFormat": "{{tier}}"
}
],
"type": "graph"
},
{
"title": "Latency (p50, p95, p99)",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(license_validation_latency_seconds_bucket[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(license_validation_latency_seconds_bucket[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(license_validation_latency_seconds_bucket[5m]))",
"legendFormat": "p99"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(license_validation_errors_total[5m]) / rate(license_validation_requests_total[5m])",
"legendFormat": "{{error_type}}"
}
],
"type": "graph"
},
{
"title": "Active Sessions",
"targets": [
{
"expr": "active_sessions",
"legendFormat": "{{license_key}}"
}
],
"type": "graph"
},
{
"title": "Seat Utilization",
"targets": [
{
"expr": "(active_sessions / 50) * 100",
"legendFormat": "{{license_key}}"
}
],
"type": "graph",
"unit": "percent"
}
]
}
}
6. PagerDuty Integration
File: monitoring/alertmanager/config.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'pagerduty'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<PAGERDUTY_SERVICE_KEY>'
description: '{{ .GroupLabels.alertname }}'
details:
firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
- name: 'slack'
slack_configs:
- api_url: '<SLACK_WEBHOOK_URL>'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
Consequences
Positive
✅ Proactive Issue Detection
- Alerts fire within 60 seconds of incidents
- Prevent customer impact via early detection
- Automatic escalation to on-call engineer
✅ Performance Visibility
- Real-time latency tracking (p50/p95/p99)
- Identify slow queries and bottlenecks
- Capacity planning based on metrics
✅ Debugging Efficiency
- Structured logs enable quick searches
- Distributed traces show request flow
- Error stack traces pinpoint issues
✅ SLA Compliance
- 99.9% uptime monitoring
- Historical data for SLA reports
- Incident postmortem data
✅ Cost Optimization
- Identify unused resources
- Right-size infrastructure based on metrics
- Reduce waste (40% cost savings possible)
Negative
⚠️ Infrastructure Cost
- Prometheus: $50/month (GKE)
- Grafana: $50/month (managed)
- Cloud Logging: $100/month
- PagerDuty: $25/user/month
- Total: $225+/month
⚠️ Alert Fatigue
- Too many alerts → ignored
- Mitigation: Tune thresholds, aggregate alerts
- Review alert effectiveness monthly
⚠️ Operational Overhead
- Dashboard maintenance
- Alert rule tuning
- On-call rotation management
Related ADRs
- ADR-009: GCP Infrastructure Architecture (monitoring infrastructure)
- ADR-011: Zombie Session Cleanup Strategy (metrics for cleanup)
References
Last Updated: 2025-11-30 Owner: DevOps Team Review Cycle: Monthly (alert tuning)