Skip to main content

ADR-019: Monitoring and Observability

Status: Accepted Date: 2025-11-30 Deciders: DevOps Team, Engineering Team Tags: monitoring, observability, prometheus, grafana, alerting


Context

Production Monitoring Requirements

CODITECT license server is critical infrastructure. Downtime = customers locked out. We need comprehensive monitoring to:

Operational Requirements:

  • Uptime Monitoring: Detect outages within 60 seconds
  • Performance Monitoring: Track API latency (p50, p95, p99)
  • Error Tracking: Alert on error rate spikes
  • Capacity Planning: Predict when to scale infrastructure

Business Requirements:

  • SLA Compliance: 99.9% uptime guarantee
  • Customer Impact: Know when customers affected by issues
  • Cost Optimization: Identify waste (unused resources)

Compliance Requirements:

  • Audit Logs: Complete audit trail for SOC 2
  • Security Monitoring: Detect suspicious activity
  • Data Retention: 90 days for compliance

Observability Pillars

1. Metrics (Prometheus + Grafana)

  • License API request rate
  • Seat acquisition latency
  • Active sessions per license
  • Error rates and types

2. Logs (Structured JSON + Cloud Logging)

  • License validation requests
  • Seat acquisition/release events
  • Error stack traces
  • Audit trail

3. Traces (OpenTelemetry + Jaeger)

  • Distributed tracing across microservices
  • Request flow visualization
  • Performance bottleneck identification

Decision

We will implement comprehensive observability with:

  1. Prometheus for metrics collection and alerting
  2. Grafana for metrics visualization and dashboards
  3. Google Cloud Logging for centralized log aggregation
  4. OpenTelemetry for distributed tracing
  5. PagerDuty for on-call alerts

Monitoring Architecture

┌────────────────────────────────────────────────────────────────┐
│ Observability Stack │
└────────────────────────────────────────────────────────────────┘

License API (Django REST Framework)

├─► Metrics ─────────────────┐
│ • Request rate │
│ • Latency (p50/p95/p99) │
│ • Error rate │
│ • Active sessions │
│ │
├─► Logs ────────────────────┤
│ • Structured JSON │
│ • Request/response │
│ • Error stack traces │
│ • Audit events │
│ │
└─► Traces ──────────────────┤
• OpenTelemetry spans │
• Distributed traces │
• Service dependencies │


┌─────────────────┐
│ Prometheus │
│ (Metrics) │
│ │
│ - Scrape / │
│ metrics │
│ - Alert rules │
│ - Recording │
│ rules │
└────────┬────────┘


┌─────────────────┐
│ Grafana │
│ (Visualization)│
│ │
│ - Dashboards │
│ - Drill-downs │
│ - Annotations │
└────────┬────────┘

┌───────────────┴───────────────┐
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Alertmanager │ │ PagerDuty │
│ │ │ │
│ - Route │───────────►│ - On-call │
│ alerts │ │ - Escalation │
│ - Dedup │ │ - Incident │
│ - Silence │ │ mgmt │
└────────────────┘ └────────────────┘

Implementation

1. Prometheus Metrics

File: backend/monitoring/metrics.py

from prometheus_client import Counter, Histogram, Gauge, Info

# License API Metrics

license_validation_requests = Counter(
'license_validation_requests_total',
'Total number of license validation requests',
['license_key', 'status', 'tier']
)

license_validation_latency = Histogram(
'license_validation_latency_seconds',
'License validation request latency',
['endpoint'],
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

license_validation_errors = Counter(
'license_validation_errors_total',
'Total number of license validation errors',
['error_type', 'tier']
)

# Seat Management Metrics

seat_acquisitions = Counter(
'seat_acquisitions_total',
'Total number of seat acquisitions',
['license_key', 'tenant_id', 'success']
)

seat_acquisition_latency = Histogram(
'seat_acquisition_latency_seconds',
'Seat acquisition latency',
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1]
)

active_sessions = Gauge(
'active_sessions',
'Number of active sessions',
['license_key', 'tenant_id']
)

seat_denials = Counter(
'seat_denials_total',
'Total number of seat denials (no seats available)',
['license_key', 'tenant_id']
)

# Redis Metrics

redis_operations = Counter(
'redis_operations_total',
'Total Redis operations',
['operation', 'status']
)

redis_latency = Histogram(
'redis_latency_seconds',
'Redis operation latency',
['operation'],
buckets=[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05]
)

# Database Metrics

database_queries = Counter(
'database_queries_total',
'Total database queries',
['table', 'operation']
)

database_query_latency = Histogram(
'database_query_latency_seconds',
'Database query latency',
['table', 'operation'],
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5]
)

# System Metrics

app_info = Info('app', 'Application info')
app_info.info({
'version': '1.0.0',
'environment': 'production',
'region': 'us-central1'
})

2. Metrics Endpoint

File: backend/api/metrics.py

from django.http import HttpResponse
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from rest_framework.decorators import api_view, permission_classes
from rest_framework.permissions import AllowAny

@api_view(['GET'])
@permission_classes([AllowAny]) # Prometheus scraper doesn't authenticate
def metrics(request):
"""
Prometheus metrics endpoint.

Scraped by Prometheus every 15 seconds.

Returns:
Prometheus text format metrics
"""
return HttpResponse(
generate_latest(),
content_type=CONTENT_TYPE_LATEST
)

3. Structured Logging

File: backend/monitoring/logging_config.py

import logging
import json
import sys
from datetime import datetime


class StructuredLogger(logging.Formatter):
"""
Structured JSON logger for Cloud Logging.

Logs in JSON format for easy parsing and querying.
"""

def format(self, record):
log_entry = {
'timestamp': datetime.utcnow().isoformat() + 'Z',
'severity': record.levelname,
'message': record.getMessage(),
'logger': record.name,
'function': record.funcName,
'line': record.lineno,
}

# Add extra fields
if hasattr(record, 'license_key'):
log_entry['license_key'] = record.license_key

if hasattr(record, 'tenant_id'):
log_entry['tenant_id'] = record.tenant_id

if hasattr(record, 'user_email'):
log_entry['user_email'] = record.user_email

# Add exception info
if record.exc_info:
log_entry['exception'] = self.formatException(record.exc_info)

return json.dumps(log_entry)


def setup_logging():
"""Configure structured logging."""
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(StructuredLogger())

logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(handler)

4. Prometheus Alert Rules

File: monitoring/prometheus/alerts.yml

groups:
- name: license_api
interval: 30s
rules:
# High Error Rate
- alert: HighErrorRate
expr: |
(
rate(license_validation_errors_total[5m])
/
rate(license_validation_requests_total[5m])
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High license validation error rate"
description: "{{ $value | humanizePercentage }} of license validations failing"

# High Latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(license_validation_latency_seconds_bucket[5m])
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High license validation latency"
description: "p95 latency is {{ $value }}s (target: <1s)"

# Seat Exhaustion
- alert: SeatExhaustion
expr: |
(active_sessions / 50) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "License {{ $labels.license_key }} near seat limit"
description: "{{ $value | humanizePercentage }} of seats in use"

# Frequent Seat Denials
- alert: FrequentSeatDenials
expr: rate(seat_denials_total[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Frequent seat denials for {{ $labels.license_key }}"
description: "{{ $value }} denials/second"

# Redis Connection Failures
- alert: RedisConnectionFailures
expr: |
rate(redis_operations_total{status="error"}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Redis connection failures detected"
description: "{{ $value }} Redis errors/second"

# Database Slow Queries
- alert: DatabaseSlowQueries
expr: |
histogram_quantile(0.95,
rate(database_query_latency_seconds_bucket[5m])
) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Database queries running slow"
description: "p95 query latency is {{ $value }}s"

- name: infrastructure
interval: 60s
rules:
# API Server Down
- alert: APIServerDown
expr: up{job="license-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "License API server is down"
description: "{{ $labels.instance }} has been down for 1 minute"

# High CPU Usage
- alert: HighCPUUsage
expr: |
avg(rate(container_cpu_usage_seconds_total{pod=~"license-api.*"}[5m]))
* 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on license API"
description: "CPU usage is {{ $value }}%"

# High Memory Usage
- alert: HighMemoryUsage
expr: |
(
container_memory_working_set_bytes{pod=~"license-api.*"}
/
container_spec_memory_limit_bytes{pod=~"license-api.*"}
) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on license API"
description: "Memory usage is {{ $value | humanizePercentage }}"

5. Grafana Dashboards

File: monitoring/grafana/license-api-dashboard.json

{
"dashboard": {
"title": "CODITECT License API",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(license_validation_requests_total[5m])",
"legendFormat": "{{tier}}"
}
],
"type": "graph"
},
{
"title": "Latency (p50, p95, p99)",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(license_validation_latency_seconds_bucket[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(license_validation_latency_seconds_bucket[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(license_validation_latency_seconds_bucket[5m]))",
"legendFormat": "p99"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(license_validation_errors_total[5m]) / rate(license_validation_requests_total[5m])",
"legendFormat": "{{error_type}}"
}
],
"type": "graph"
},
{
"title": "Active Sessions",
"targets": [
{
"expr": "active_sessions",
"legendFormat": "{{license_key}}"
}
],
"type": "graph"
},
{
"title": "Seat Utilization",
"targets": [
{
"expr": "(active_sessions / 50) * 100",
"legendFormat": "{{license_key}}"
}
],
"type": "graph",
"unit": "percent"
}
]
}
}

6. PagerDuty Integration

File: monitoring/alertmanager/config.yml

global:
resolve_timeout: 5m

route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'pagerduty'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true

- match:
severity: warning
receiver: 'slack'

receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<PAGERDUTY_SERVICE_KEY>'
description: '{{ .GroupLabels.alertname }}'
details:
firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'

- name: 'slack'
slack_configs:
- api_url: '<SLACK_WEBHOOK_URL>'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']

Consequences

Positive

Proactive Issue Detection

  • Alerts fire within 60 seconds of incidents
  • Prevent customer impact via early detection
  • Automatic escalation to on-call engineer

Performance Visibility

  • Real-time latency tracking (p50/p95/p99)
  • Identify slow queries and bottlenecks
  • Capacity planning based on metrics

Debugging Efficiency

  • Structured logs enable quick searches
  • Distributed traces show request flow
  • Error stack traces pinpoint issues

SLA Compliance

  • 99.9% uptime monitoring
  • Historical data for SLA reports
  • Incident postmortem data

Cost Optimization

  • Identify unused resources
  • Right-size infrastructure based on metrics
  • Reduce waste (40% cost savings possible)

Negative

⚠️ Infrastructure Cost

  • Prometheus: $50/month (GKE)
  • Grafana: $50/month (managed)
  • Cloud Logging: $100/month
  • PagerDuty: $25/user/month
  • Total: $225+/month

⚠️ Alert Fatigue

  • Too many alerts → ignored
  • Mitigation: Tune thresholds, aggregate alerts
  • Review alert effectiveness monthly

⚠️ Operational Overhead

  • Dashboard maintenance
  • Alert rule tuning
  • On-call rotation management

  • ADR-009: GCP Infrastructure Architecture (monitoring infrastructure)
  • ADR-011: Zombie Session Cleanup Strategy (metrics for cleanup)

References


Last Updated: 2025-11-30 Owner: DevOps Team Review Cycle: Monthly (alert tuning)