FP&A Platform — Monitoring & Alerting Specification

Version: 1.0
Last Updated: 2026-02-03
Document ID: OPS-002
Classification: Internal

1. Overview

This document defines the comprehensive monitoring, alerting, and observability strategy for the FP&A Platform.

Observability Stack

Component	Tool	Purpose
Metrics	Prometheus + Grafana	Time-series metrics, dashboards
Logs	Loki	Log aggregation and querying
Traces	OpenTelemetry + Jaeger	Distributed tracing
Alerting	Alertmanager + PagerDuty	Alert routing and escalation
Synthetic	Blackbox Exporter	Endpoint monitoring
APM	Custom + OpenTelemetry	Application performance

2. Service Level Objectives (SLOs)

2.1 Availability SLOs

Service Tier	SLO	Error Budget (monthly)	Measurement
Tier 1 - Critical	99.9%	43.2 minutes	Successful requests / total requests
Tier 2 - Essential	99.5%	3.6 hours	Successful requests / total requests
Tier 3 - Important	99.0%	7.2 hours	Successful requests / total requests

Tier 1 Services: API Gateway, Authentication, GL Service, Database Tier 2 Services: Reconciliation, Reporting, Integrations Tier 3 Services: AI Agents, Forecasting, Analytics

2.2 Latency SLOs

Endpoint Category	P50	P95	P99
Authentication	100ms	300ms	500ms
Simple Reads	50ms	200ms	500ms
Complex Queries	500ms	2s	5s
Report Generation	2s	10s	30s
AI Agent Response	5s	15s	30s
Batch Operations	30s	2m	5m

2.3 SLI Definitions

slis:
  availability:
    name: "Service Availability"
    description: "Percentage of successful requests"
    formula: |
      sum(rate(http_requests_total{status!~"5.."}[5m])) /
      sum(rate(http_requests_total[5m]))
    
  latency_p99:
    name: "Request Latency P99"
    description: "99th percentile request latency"
    formula: |
      histogram_quantile(0.99, 
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
      )
    
  error_rate:
    name: "Error Rate"
    description: "Percentage of requests resulting in errors"
    formula: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) /
      sum(rate(http_requests_total[5m]))

3. Metrics Collection

3.1 Infrastructure Metrics

Kubernetes Metrics

# kube-state-metrics configuration
metrics:
  pods:
    - kube_pod_status_phase
    - kube_pod_container_status_restarts_total
    - kube_pod_container_resource_requests
    - kube_pod_container_resource_limits
    
  deployments:
    - kube_deployment_status_replicas
    - kube_deployment_status_replicas_available
    - kube_deployment_status_replicas_unavailable
    
  nodes:
    - kube_node_status_condition
    - kube_node_status_capacity
    - kube_node_status_allocatable

Resource Metrics

Metric	Source	Alert Threshold
`container_cpu_usage_seconds_total`	cAdvisor	>80% sustained
`container_memory_usage_bytes`	cAdvisor	>85% of limit
`container_fs_usage_bytes`	cAdvisor	>80% of limit
`container_network_receive_bytes_total`	cAdvisor	Anomaly detection
`container_network_transmit_bytes_total`	cAdvisor	Anomaly detection

3.2 Database Metrics (PostgreSQL)

postgres_metrics:
  connections:
    - pg_stat_activity_count  # Active connections
    - pg_settings_max_connections  # Max allowed
    
  performance:
    - pg_stat_statements_total_time  # Query time
    - pg_stat_statements_calls  # Query count
    - pg_stat_user_tables_seq_scan  # Sequential scans
    - pg_stat_user_tables_idx_scan  # Index scans
    
  replication:
    - pg_stat_replication_lag  # Replication lag
    - pg_stat_wal_receiver_lag  # WAL receiver lag
    
  storage:
    - pg_database_size_bytes  # Database size
    - pg_stat_user_tables_dead_tuples  # Dead tuples (vacuum needed)
    
  locks:
    - pg_locks_count  # Lock count by mode
    - pg_stat_activity_waiting  # Queries waiting on locks

# Alert thresholds
alerts:
  - name: PostgresConnectionsHigh
    expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
    for: 5m
    severity: warning
    
  - name: PostgresReplicationLag
    expr: pg_stat_replication_lag > 10
    for: 1m
    severity: critical

3.3 Application Metrics

HTTP Metrics (RED Method)

# FastAPI metrics middleware
from prometheus_client import Counter, Histogram

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status', 'service']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint', 'service'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 30]
)

REQUEST_IN_PROGRESS = Gauge(
    'http_requests_in_progress',
    'HTTP requests currently in progress',
    ['method', 'endpoint', 'service']
)

Business Metrics

business_metrics:
  accounting:
    - fpa_journal_entries_total{status}  # Journal entries by status
    - fpa_journal_entries_amount_total{entity}  # Total amounts
    - fpa_period_close_duration_seconds{entity}  # Close time
    
  reconciliation:
    - fpa_reconciliation_match_rate{entity}  # Match percentage
    - fpa_reconciliation_exceptions_total{entity,type}  # Exception count
    - fpa_reconciliation_duration_seconds{entity}  # Recon time
    
  forecasting:
    - fpa_forecast_accuracy_mape{entity,horizon}  # Forecast accuracy
    - fpa_forecast_generation_duration_seconds  # Generation time
    
  ai_agents:
    - fpa_agent_requests_total{agent_type}  # Agent invocations
    - fpa_agent_tokens_used_total{agent_type,model}  # Token usage
    - fpa_agent_duration_seconds{agent_type}  # Agent latency
    - fpa_agent_confidence_score{agent_type}  # Confidence distribution

3.4 AI/ML Metrics

ml_metrics:
  inference:
    - fpa_ml_inference_latency_seconds{model}
    - fpa_ml_inference_requests_total{model,status}
    - fpa_ml_tokens_input_total{model}
    - fpa_ml_tokens_output_total{model}
    
  model_performance:
    - fpa_ml_prediction_confidence{model}
    - fpa_ml_recon_match_accuracy{model}  # Rolling accuracy
    - fpa_ml_forecast_mape{model,horizon}  # Forecast error
    
  resource_usage:
    - fpa_ml_gpu_utilization{node}
    - fpa_ml_gpu_memory_used_bytes{node}
    - fpa_ml_queue_depth{model}

4. Alerting Rules

4.1 Critical Alerts (P1 - Page Immediately)

groups:
  - name: critical
    rules:
      - alert: ServiceDown
        expr: up{job=~"fpa-.*"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          runbook: "https://wiki.fpa/runbooks/service-down"
          
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL database is down"
          runbook: "https://wiki.fpa/runbooks/database-down"
          
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
          sum(rate(http_requests_total[5m])) by (service) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate ({{ $value | humanizePercentage }}) for {{ $labels.service }}"
          
      - alert: DatabaseReplicationLag
        expr: pg_stat_replication_lag > 30
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database replication lag is {{ $value }}s"
          
      - alert: CertificateExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "SSL certificate expires in {{ $value | humanizeDuration }}"

4.2 High Severity Alerts (P2 - Respond within 1 hour)

groups:
  - name: high
    rules:
      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 5
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "P99 latency is {{ $value | humanizeDuration }} for {{ $labels.service }}"
          
      - alert: HighMemoryUsage
        expr: |
          container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Memory usage at {{ $value | humanizePercentage }} for {{ $labels.pod }}"
          
      - alert: HighCPUUsage
        expr: |
          sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) /
          sum(container_spec_cpu_quota/container_spec_cpu_period) by (pod) > 0.8
        for: 10m
        labels:
          severity: high
          
      - alert: PodRestartLoop
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Pod {{ $labels.pod }} has restarted {{ $value }} times in 1 hour"
          
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
        for: 5m
        labels:
          severity: high

4.3 Warning Alerts (P3 - Respond within 4 hours)

groups:
  - name: warning
    rules:
      - alert: ElevatedErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
          sum(rate(http_requests_total[5m])) by (service) > 0.01
        for: 10m
        labels:
          severity: warning
          
      - alert: SlowQueries
        expr: |
          pg_stat_statements_mean_time_seconds{query!~".*pg_.*"} > 1
        for: 15m
        labels:
          severity: warning
          
      - alert: AIAgentLowConfidence
        expr: |
          histogram_quantile(0.5, fpa_agent_confidence_score_bucket) < 0.7
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "AI agent median confidence below 70%"
          
      - alert: ReconciliationMatchRateDrop
        expr: |
          fpa_reconciliation_match_rate < 0.85
        for: 1h
        labels:
          severity: warning
          
      - alert: ForecastAccuracyDegraded
        expr: |
          fpa_ml_forecast_mape{horizon="13_weeks"} > 0.15
        for: 1d
        labels:
          severity: warning

4.4 Alert Routing

# Alertmanager configuration
route:
  receiver: 'default'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
      
    - match:
        severity: critical
      receiver: 'slack-incidents'
      
    - match:
        severity: high
      receiver: 'pagerduty-high'
      
    - match:
        severity: warning
      receiver: 'slack-alerts'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_CRITICAL_KEY}'
        severity: critical
        
  - name: 'pagerduty-high'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_HIGH_KEY}'
        severity: high
        
  - name: 'slack-incidents'
    slack_configs:
      - channel: '#incidents'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        
  - name: 'slack-alerts'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true

5. Dashboards

5.1 Executive Dashboard

Purpose: SLA compliance and business KPIs for leadership

Panels:

Platform Availability (30-day rolling)
Error Budget Remaining
Monthly Active Users
Transaction Volume
Reconciliation Match Rate (trend)
Forecast Accuracy (trailing)
Top 5 Errors This Week

5.2 Service Health Dashboard

Purpose: Real-time service status for engineering

Panels per Service:

┌─────────────────────────────────────────────────────────────┐
│  GL SERVICE                                                  │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │ Availability │ │  Error Rate │ │   P99 Lat   │           │
│  │    99.95%    │ │    0.02%    │ │    245ms    │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐ │
│  │  Request Rate (5m avg)                                 │ │
│  │  ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇                                │ │
│  └────────────────────────────────────────────────────────┘ │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐ │
│  │  Latency Distribution                                  │ │
│  │  P50: 45ms | P90: 120ms | P99: 245ms                  │ │
│  └────────────────────────────────────────────────────────┘ │
│                                                              │
│  Pods: 3/3 Running | CPU: 45% | Memory: 62%                │
└─────────────────────────────────────────────────────────────┘

5.3 On-Call Dashboard

Purpose: Incident triage and quick diagnostics

Panels:

Active Alerts (grouped by severity)
Recent Deployments
Error Rate by Endpoint (last 1h)
Slow Queries
Pod Restarts
External Dependencies Status
Quick Links (runbooks, logs, traces)

5.4 AI/ML Dashboard

Purpose: Model performance and resource monitoring

Panels:

Model Inference Latency (by model)
Token Usage (by agent)
GPU Utilization
Queue Depth
Confidence Score Distribution
Reconciliation Model Accuracy (rolling 7-day)
Forecast Model MAPE (by horizon)
Hallucination Detection Rate

5.5 Cost Dashboard

Purpose: Cloud spend monitoring

Panels:

Daily/Monthly Spend (vs budget)
Cost by Service
AI Token Costs (by model)
Resource Efficiency (CPU/Memory utilization)
Top 10 Expensive Queries
Recommendations for Optimization

6. Logging Standards

6.1 Log Format

{
  "timestamp": "2026-02-03T10:15:30.123Z",
  "level": "INFO",
  "service": "gl-service",
  "version": "1.2.3",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "tenant_id": "tenant_acme",
  "user_id": "user_john",
  "message": "Journal entry created",
  "attributes": {
    "journal_entry_id": "je_001",
    "amount": 10000.00,
    "duration_ms": 45
  }
}

6.2 Log Levels

Level	Use Case	Examples
ERROR	Failures requiring attention	Unhandled exceptions, failed operations
WARN	Potential issues	Deprecated API usage, near-limit resources
INFO	Normal operations	Request completed, job finished
DEBUG	Detailed troubleshooting	SQL queries, intermediate state

6.3 PII Redaction

REDACT_PATTERNS = {
    'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
    'credit_card': r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b',
    'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
    'api_key': r'\b(sk|pk|api)_[A-Za-z0-9]{20,}\b',
}

def redact_pii(log_message: str) -> str:
    for field, pattern in REDACT_PATTERNS.items():
        log_message = re.sub(pattern, f'[REDACTED:{field}]', log_message)
    return log_message

7. Distributed Tracing

7.1 Trace Propagation

# OpenTelemetry configuration
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor

# Auto-instrument frameworks
FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=engine)
RedisInstrumentor().instrument()

# Custom spans for business operations
tracer = trace.get_tracer(__name__)

async def create_journal_entry(entry: JournalEntry):
    with tracer.start_as_current_span("create_journal_entry") as span:
        span.set_attribute("entity_id", entry.entity_id)
        span.set_attribute("amount", float(entry.total_debit))
        
        # Nested span for validation
        with tracer.start_as_current_span("validate_entry"):
            await validate_entry(entry)
        
        # Nested span for persistence
        with tracer.start_as_current_span("persist_entry"):
            await db.save(entry)

7.2 Sampling Strategy

sampling:
  default: 0.1  # 10% of traces sampled by default
  
  rules:
    - service: api-gateway
      sample_rate: 0.05  # High volume, lower sampling
      
    - operation: "create_journal_entry"
      sample_rate: 1.0  # Always sample financial operations
      
    - http.status_code: "5xx"
      sample_rate: 1.0  # Always sample errors
      
    - duration_ms: ">5000"
      sample_rate: 1.0  # Always sample slow requests

8. Health Checks

8.1 Kubernetes Probes

# Deployment health probes
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 5
  failureThreshold: 30

8.2 Health Check Endpoints

@app.get("/health/live")
async def liveness():
    """Basic liveness check - is the process running?"""
    return {"status": "ok"}

@app.get("/health/ready")
async def readiness():
    """Readiness check - can the service handle requests?"""
    checks = {
        "database": await check_database(),
        "redis": await check_redis(),
        "dependencies": await check_dependencies(),
    }
    
    all_healthy = all(c["healthy"] for c in checks.values())
    status_code = 200 if all_healthy else 503
    
    return JSONResponse(
        content={"status": "ready" if all_healthy else "not_ready", "checks": checks},
        status_code=status_code
    )

@app.get("/health/startup")
async def startup():
    """Startup check - has initialization completed?"""
    if not app.state.initialized:
        return JSONResponse({"status": "initializing"}, status_code=503)
    return {"status": "started"}

9. Synthetic Monitoring

9.1 Critical User Journeys

synthetic_tests:
  - name: "Login Flow"
    schedule: "*/5 * * * *"  # Every 5 minutes
    steps:
      - visit: "https://app.fpa-platform.com/login"
      - fill: { selector: "#email", value: "${SYNTHETIC_USER}" }
      - fill: { selector: "#password", value: "${SYNTHETIC_PASS}" }
      - click: { selector: "#login-button" }
      - wait: { selector: "#dashboard", timeout: 10s }
    assertions:
      - response_time: "<3s"
      - status: 200
      
  - name: "Create Journal Entry"
    schedule: "*/15 * * * *"
    steps:
      - authenticate: "${SYNTHETIC_TOKEN}"
      - post:
          url: "/api/gl/journal-entries"
          body: { ... }
      - assert:
          status: 201
          response_time: "<2s"
          
  - name: "Run Reconciliation"
    schedule: "0 * * * *"  # Every hour
    steps:
      - authenticate: "${SYNTHETIC_TOKEN}"
      - post:
          url: "/api/recon/sessions"
          body: { entity_id: "test_entity" }
      - poll:
          url: "/api/recon/sessions/${session_id}"
          until: { status: "completed" }
          timeout: 5m

Monitoring & Alerting Specification v1.0 — FP&A Platform Document ID: OPS-002

1. Overview​

Observability Stack​

2. Service Level Objectives (SLOs)​

2.1 Availability SLOs​

2.2 Latency SLOs​

2.3 SLI Definitions​

3. Metrics Collection​

3.1 Infrastructure Metrics​

Kubernetes Metrics​

Resource Metrics​

3.2 Database Metrics (PostgreSQL)​

3.3 Application Metrics​

HTTP Metrics (RED Method)​

Business Metrics​

3.4 AI/ML Metrics​

4. Alerting Rules​

4.1 Critical Alerts (P1 - Page Immediately)​

4.2 High Severity Alerts (P2 - Respond within 1 hour)​

4.3 Warning Alerts (P3 - Respond within 4 hours)​

4.4 Alert Routing​

5. Dashboards​

5.1 Executive Dashboard​

5.2 Service Health Dashboard​

5.3 On-Call Dashboard​

5.4 AI/ML Dashboard​

5.5 Cost Dashboard​

6. Logging Standards​

6.1 Log Format​

6.2 Log Levels​

6.3 PII Redaction​

7. Distributed Tracing​

7.1 Trace Propagation​

7.2 Sampling Strategy​

8. Health Checks​

8.1 Kubernetes Probes​

8.2 Health Check Endpoints​

9. Synthetic Monitoring​

9.1 Critical User Journeys​

1. Overview

Observability Stack

2. Service Level Objectives (SLOs)

2.1 Availability SLOs

2.2 Latency SLOs

2.3 SLI Definitions

3. Metrics Collection

3.1 Infrastructure Metrics

Kubernetes Metrics

Resource Metrics

3.2 Database Metrics (PostgreSQL)

3.3 Application Metrics

HTTP Metrics (RED Method)

Business Metrics

3.4 AI/ML Metrics

4. Alerting Rules

4.1 Critical Alerts (P1 - Page Immediately)

4.2 High Severity Alerts (P2 - Respond within 1 hour)

4.3 Warning Alerts (P3 - Respond within 4 hours)

4.4 Alert Routing

5. Dashboards

5.1 Executive Dashboard

5.2 Service Health Dashboard

5.3 On-Call Dashboard

5.4 AI/ML Dashboard

5.5 Cost Dashboard

6. Logging Standards

6.1 Log Format

6.2 Log Levels

6.3 PII Redaction

7. Distributed Tracing

7.1 Trace Propagation

7.2 Sampling Strategy

8. Health Checks

8.1 Kubernetes Probes

8.2 Health Check Endpoints

9. Synthetic Monitoring

9.1 Critical User Journeys