ADR-019: Monitoring and Observability

Status: Accepted Date: 2025-11-30 Deciders: DevOps Team, Engineering Team Tags: monitoring, observability, prometheus, grafana, alerting

Context

Production Monitoring Requirements

CODITECT license server is critical infrastructure. Downtime = customers locked out. We need comprehensive monitoring to:

Operational Requirements:

Uptime Monitoring: Detect outages within 60 seconds
Performance Monitoring: Track API latency (p50, p95, p99)
Error Tracking: Alert on error rate spikes
Capacity Planning: Predict when to scale infrastructure

Business Requirements:

SLA Compliance: 99.9% uptime guarantee
Customer Impact: Know when customers affected by issues
Cost Optimization: Identify waste (unused resources)

Compliance Requirements:

Audit Logs: Complete audit trail for SOC 2
Security Monitoring: Detect suspicious activity
Data Retention: 90 days for compliance

Observability Pillars

1. Metrics (Prometheus + Grafana)

License API request rate
Seat acquisition latency
Active sessions per license
Error rates and types

2. Logs (Structured JSON + Cloud Logging)

License validation requests
Seat acquisition/release events
Error stack traces
Audit trail

3. Traces (OpenTelemetry + Jaeger)

Distributed tracing across microservices
Request flow visualization
Performance bottleneck identification

Decision

We will implement comprehensive observability with:

Prometheus for metrics collection and alerting
Grafana for metrics visualization and dashboards
Google Cloud Logging for centralized log aggregation
OpenTelemetry for distributed tracing
PagerDuty for on-call alerts

Monitoring Architecture

┌────────────────────────────────────────────────────────────────┐
│                 Observability Stack                             │
└────────────────────────────────────────────────────────────────┘

License API (Django REST Framework)
        │
        ├─► Metrics ─────────────────┐
        │   • Request rate            │
        │   • Latency (p50/p95/p99)   │
        │   • Error rate              │
        │   • Active sessions         │
        │                             │
        ├─► Logs ────────────────────┤
        │   • Structured JSON         │
        │   • Request/response        │
        │   • Error stack traces      │
        │   • Audit events            │
        │                             │
        └─► Traces ──────────────────┤
            • OpenTelemetry spans     │
            • Distributed traces      │
            • Service dependencies    │
                                      │
                                      ▼
                            ┌─────────────────┐
                            │  Prometheus     │
                            │  (Metrics)      │
                            │                 │
                            │  - Scrape /     │
                            │    metrics      │
                            │  - Alert rules  │
                            │  - Recording    │
                            │    rules        │
                            └────────┬────────┘
                                     │
                                     ▼
                            ┌─────────────────┐
                            │  Grafana        │
                            │  (Visualization)│
                            │                 │
                            │  - Dashboards   │
                            │  - Drill-downs  │
                            │  - Annotations  │
                            └────────┬────────┘
                                     │
                     ┌───────────────┴───────────────┐
                     │                               │
                     ▼                               ▼
            ┌────────────────┐            ┌────────────────┐
            │  Alertmanager  │            │  PagerDuty     │
            │                │            │                │
            │  - Route       │───────────►│  - On-call     │
            │    alerts      │            │  - Escalation  │
            │  - Dedup       │            │  - Incident    │
            │  - Silence     │            │    mgmt        │
            └────────────────┘            └────────────────┘

Implementation

1. Prometheus Metrics

File: backend/monitoring/metrics.py

from prometheus_client import Counter, Histogram, Gauge, Info

# License API Metrics

license_validation_requests = Counter(
    'license_validation_requests_total',
    'Total number of license validation requests',
    ['license_key', 'status', 'tier']
)

license_validation_latency = Histogram(
    'license_validation_latency_seconds',
    'License validation request latency',
    ['endpoint'],
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

license_validation_errors = Counter(
    'license_validation_errors_total',
    'Total number of license validation errors',
    ['error_type', 'tier']
)

# Seat Management Metrics

seat_acquisitions = Counter(
    'seat_acquisitions_total',
    'Total number of seat acquisitions',
    ['license_key', 'tenant_id', 'success']
)

seat_acquisition_latency = Histogram(
    'seat_acquisition_latency_seconds',
    'Seat acquisition latency',
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1]
)

active_sessions = Gauge(
    'active_sessions',
    'Number of active sessions',
    ['license_key', 'tenant_id']
)

seat_denials = Counter(
    'seat_denials_total',
    'Total number of seat denials (no seats available)',
    ['license_key', 'tenant_id']
)

# Redis Metrics

redis_operations = Counter(
    'redis_operations_total',
    'Total Redis operations',
    ['operation', 'status']
)

redis_latency = Histogram(
    'redis_latency_seconds',
    'Redis operation latency',
    ['operation'],
    buckets=[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05]
)

# Database Metrics

database_queries = Counter(
    'database_queries_total',
    'Total database queries',
    ['table', 'operation']
)

database_query_latency = Histogram(
    'database_query_latency_seconds',
    'Database query latency',
    ['table', 'operation'],
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5]
)

# System Metrics

app_info = Info('app', 'Application info')
app_info.info({
    'version': '1.0.0',
    'environment': 'production',
    'region': 'us-central1'
})

2. Metrics Endpoint

File: backend/api/metrics.py

from django.http import HttpResponse
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from rest_framework.decorators import api_view, permission_classes
from rest_framework.permissions import AllowAny

@api_view(['GET'])
@permission_classes([AllowAny])  # Prometheus scraper doesn't authenticate
def metrics(request):
    """
    Prometheus metrics endpoint.

    Scraped by Prometheus every 15 seconds.

    Returns:
        Prometheus text format metrics
    """
    return HttpResponse(
        generate_latest(),
        content_type=CONTENT_TYPE_LATEST
    )

3. Structured Logging

File: backend/monitoring/logging_config.py

import logging
import json
import sys
from datetime import datetime


class StructuredLogger(logging.Formatter):
    """
    Structured JSON logger for Cloud Logging.

    Logs in JSON format for easy parsing and querying.
    """

    def format(self, record):
        log_entry = {
            'timestamp': datetime.utcnow().isoformat() + 'Z',
            'severity': record.levelname,
            'message': record.getMessage(),
            'logger': record.name,
            'function': record.funcName,
            'line': record.lineno,
        }

        # Add extra fields
        if hasattr(record, 'license_key'):
            log_entry['license_key'] = record.license_key

        if hasattr(record, 'tenant_id'):
            log_entry['tenant_id'] = record.tenant_id

        if hasattr(record, 'user_email'):
            log_entry['user_email'] = record.user_email

        # Add exception info
        if record.exc_info:
            log_entry['exception'] = self.formatException(record.exc_info)

        return json.dumps(log_entry)


def setup_logging():
    """Configure structured logging."""
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(StructuredLogger())

    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    logger.addHandler(handler)

4. Prometheus Alert Rules

File: monitoring/prometheus/alerts.yml

groups:
  - name: license_api
    interval: 30s
    rules:
      # High Error Rate
      - alert: HighErrorRate
        expr: |
          (
            rate(license_validation_errors_total[5m])
            /
            rate(license_validation_requests_total[5m])
          ) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High license validation error rate"
          description: "{{ $value | humanizePercentage }} of license validations failing"

      # High Latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(license_validation_latency_seconds_bucket[5m])
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High license validation latency"
          description: "p95 latency is {{ $value }}s (target: <1s)"

      # Seat Exhaustion
      - alert: SeatExhaustion
        expr: |
          (active_sessions / 50) > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "License {{ $labels.license_key }} near seat limit"
          description: "{{ $value | humanizePercentage }} of seats in use"

      # Frequent Seat Denials
      - alert: FrequentSeatDenials
        expr: rate(seat_denials_total[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Frequent seat denials for {{ $labels.license_key }}"
          description: "{{ $value }} denials/second"

      # Redis Connection Failures
      - alert: RedisConnectionFailures
        expr: |
          rate(redis_operations_total{status="error"}[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Redis connection failures detected"
          description: "{{ $value }} Redis errors/second"

      # Database Slow Queries
      - alert: DatabaseSlowQueries
        expr: |
          histogram_quantile(0.95,
            rate(database_query_latency_seconds_bucket[5m])
          ) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Database queries running slow"
          description: "p95 query latency is {{ $value }}s"

  - name: infrastructure
    interval: 60s
    rules:
      # API Server Down
      - alert: APIServerDown
        expr: up{job="license-api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "License API server is down"
          description: "{{ $labels.instance }} has been down for 1 minute"

      # High CPU Usage
      - alert: HighCPUUsage
        expr: |
          avg(rate(container_cpu_usage_seconds_total{pod=~"license-api.*"}[5m]))
          * 100 > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on license API"
          description: "CPU usage is {{ $value }}%"

      # High Memory Usage
      - alert: HighMemoryUsage
        expr: |
          (
            container_memory_working_set_bytes{pod=~"license-api.*"}
            /
            container_spec_memory_limit_bytes{pod=~"license-api.*"}
          ) > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on license API"
          description: "Memory usage is {{ $value | humanizePercentage }}"

5. Grafana Dashboards

File: monitoring/grafana/license-api-dashboard.json

{
  "dashboard": {
    "title": "CODITECT License API",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(license_validation_requests_total[5m])",
            "legendFormat": "{{tier}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Latency (p50, p95, p99)",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(license_validation_latency_seconds_bucket[5m]))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.95, rate(license_validation_latency_seconds_bucket[5m]))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(license_validation_latency_seconds_bucket[5m]))",
            "legendFormat": "p99"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(license_validation_errors_total[5m]) / rate(license_validation_requests_total[5m])",
            "legendFormat": "{{error_type}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Active Sessions",
        "targets": [
          {
            "expr": "active_sessions",
            "legendFormat": "{{license_key}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Seat Utilization",
        "targets": [
          {
            "expr": "(active_sessions / 50) * 100",
            "legendFormat": "{{license_key}}"
          }
        ],
        "type": "graph",
        "unit": "percent"
      }
    ]
  }
}

6. PagerDuty Integration

File: monitoring/alertmanager/config.yml

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'pagerduty'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<PAGERDUTY_SERVICE_KEY>'
        description: '{{ .GroupLabels.alertname }}'
        details:
          firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'

  - name: 'slack'
    slack_configs:
      - api_url: '<SLACK_WEBHOOK_URL>'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

Consequences

Positive

✅ Proactive Issue Detection

Alerts fire within 60 seconds of incidents
Prevent customer impact via early detection
Automatic escalation to on-call engineer

✅ Performance Visibility

Real-time latency tracking (p50/p95/p99)
Identify slow queries and bottlenecks
Capacity planning based on metrics

✅ Debugging Efficiency

Structured logs enable quick searches
Distributed traces show request flow
Error stack traces pinpoint issues

✅ SLA Compliance

99.9% uptime monitoring
Historical data for SLA reports
Incident postmortem data

✅ Cost Optimization

Identify unused resources
Right-size infrastructure based on metrics
Reduce waste (40% cost savings possible)

Negative

⚠️ Infrastructure Cost

Prometheus: $50/month (GKE)
Grafana: $50/month (managed)
Cloud Logging: $100/month
PagerDuty: $25/user/month
Total: $225+/month

⚠️ Alert Fatigue

Too many alerts → ignored
Mitigation: Tune thresholds, aggregate alerts
Review alert effectiveness monthly

⚠️ Operational Overhead

Dashboard maintenance
Alert rule tuning
On-call rotation management

ADR-009: GCP Infrastructure Architecture (monitoring infrastructure)
ADR-011: Zombie Session Cleanup Strategy (metrics for cleanup)

References

Last Updated: 2025-11-30 Owner: DevOps Team Review Cycle: Monthly (alert tuning)

Context​

Production Monitoring Requirements​

Observability Pillars​

Decision​

Monitoring Architecture​

Implementation​

1. Prometheus Metrics​

2. Metrics Endpoint​

3. Structured Logging​

4. Prometheus Alert Rules​

5. Grafana Dashboards​

6. PagerDuty Integration​

Consequences​

Positive​

Negative​

Related ADRs​

References​