Skip to main content

FP&A Platform — Monitoring & Alerting Specification

Version: 1.0
Last Updated: 2026-02-03
Document ID: OPS-002
Classification: Internal


1. Overview

This document defines the comprehensive monitoring, alerting, and observability strategy for the FP&A Platform.

Observability Stack

ComponentToolPurpose
MetricsPrometheus + GrafanaTime-series metrics, dashboards
LogsLokiLog aggregation and querying
TracesOpenTelemetry + JaegerDistributed tracing
AlertingAlertmanager + PagerDutyAlert routing and escalation
SyntheticBlackbox ExporterEndpoint monitoring
APMCustom + OpenTelemetryApplication performance

2. Service Level Objectives (SLOs)

2.1 Availability SLOs

Service TierSLOError Budget (monthly)Measurement
Tier 1 - Critical99.9%43.2 minutesSuccessful requests / total requests
Tier 2 - Essential99.5%3.6 hoursSuccessful requests / total requests
Tier 3 - Important99.0%7.2 hoursSuccessful requests / total requests

Tier 1 Services: API Gateway, Authentication, GL Service, Database Tier 2 Services: Reconciliation, Reporting, Integrations Tier 3 Services: AI Agents, Forecasting, Analytics

2.2 Latency SLOs

Endpoint CategoryP50P95P99
Authentication100ms300ms500ms
Simple Reads50ms200ms500ms
Complex Queries500ms2s5s
Report Generation2s10s30s
AI Agent Response5s15s30s
Batch Operations30s2m5m

2.3 SLI Definitions

slis:
availability:
name: "Service Availability"
description: "Percentage of successful requests"
formula: |
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

latency_p99:
name: "Request Latency P99"
description: "99th percentile request latency"
formula: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

error_rate:
name: "Error Rate"
description: "Percentage of requests resulting in errors"
formula: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

3. Metrics Collection

3.1 Infrastructure Metrics

Kubernetes Metrics

# kube-state-metrics configuration
metrics:
pods:
- kube_pod_status_phase
- kube_pod_container_status_restarts_total
- kube_pod_container_resource_requests
- kube_pod_container_resource_limits

deployments:
- kube_deployment_status_replicas
- kube_deployment_status_replicas_available
- kube_deployment_status_replicas_unavailable

nodes:
- kube_node_status_condition
- kube_node_status_capacity
- kube_node_status_allocatable

Resource Metrics

MetricSourceAlert Threshold
container_cpu_usage_seconds_totalcAdvisor>80% sustained
container_memory_usage_bytescAdvisor>85% of limit
container_fs_usage_bytescAdvisor>80% of limit
container_network_receive_bytes_totalcAdvisorAnomaly detection
container_network_transmit_bytes_totalcAdvisorAnomaly detection

3.2 Database Metrics (PostgreSQL)

postgres_metrics:
connections:
- pg_stat_activity_count # Active connections
- pg_settings_max_connections # Max allowed

performance:
- pg_stat_statements_total_time # Query time
- pg_stat_statements_calls # Query count
- pg_stat_user_tables_seq_scan # Sequential scans
- pg_stat_user_tables_idx_scan # Index scans

replication:
- pg_stat_replication_lag # Replication lag
- pg_stat_wal_receiver_lag # WAL receiver lag

storage:
- pg_database_size_bytes # Database size
- pg_stat_user_tables_dead_tuples # Dead tuples (vacuum needed)

locks:
- pg_locks_count # Lock count by mode
- pg_stat_activity_waiting # Queries waiting on locks

# Alert thresholds
alerts:
- name: PostgresConnectionsHigh
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
for: 5m
severity: warning

- name: PostgresReplicationLag
expr: pg_stat_replication_lag > 10
for: 1m
severity: critical

3.3 Application Metrics

HTTP Metrics (RED Method)

# FastAPI metrics middleware
from prometheus_client import Counter, Histogram

REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status', 'service']
)

REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint', 'service'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 30]
)

REQUEST_IN_PROGRESS = Gauge(
'http_requests_in_progress',
'HTTP requests currently in progress',
['method', 'endpoint', 'service']
)

Business Metrics

business_metrics:
accounting:
- fpa_journal_entries_total{status} # Journal entries by status
- fpa_journal_entries_amount_total{entity} # Total amounts
- fpa_period_close_duration_seconds{entity} # Close time

reconciliation:
- fpa_reconciliation_match_rate{entity} # Match percentage
- fpa_reconciliation_exceptions_total{entity,type} # Exception count
- fpa_reconciliation_duration_seconds{entity} # Recon time

forecasting:
- fpa_forecast_accuracy_mape{entity,horizon} # Forecast accuracy
- fpa_forecast_generation_duration_seconds # Generation time

ai_agents:
- fpa_agent_requests_total{agent_type} # Agent invocations
- fpa_agent_tokens_used_total{agent_type,model} # Token usage
- fpa_agent_duration_seconds{agent_type} # Agent latency
- fpa_agent_confidence_score{agent_type} # Confidence distribution

3.4 AI/ML Metrics

ml_metrics:
inference:
- fpa_ml_inference_latency_seconds{model}
- fpa_ml_inference_requests_total{model,status}
- fpa_ml_tokens_input_total{model}
- fpa_ml_tokens_output_total{model}

model_performance:
- fpa_ml_prediction_confidence{model}
- fpa_ml_recon_match_accuracy{model} # Rolling accuracy
- fpa_ml_forecast_mape{model,horizon} # Forecast error

resource_usage:
- fpa_ml_gpu_utilization{node}
- fpa_ml_gpu_memory_used_bytes{node}
- fpa_ml_queue_depth{model}

4. Alerting Rules

4.1 Critical Alerts (P1 - Page Immediately)

groups:
- name: critical
rules:
- alert: ServiceDown
expr: up{job=~"fpa-.*"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
runbook: "https://wiki.fpa/runbooks/service-down"

- alert: DatabaseDown
expr: pg_up == 0
for: 30s
labels:
severity: critical
annotations:
summary: "PostgreSQL database is down"
runbook: "https://wiki.fpa/runbooks/database-down"

- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate ({{ $value | humanizePercentage }}) for {{ $labels.service }}"

- alert: DatabaseReplicationLag
expr: pg_stat_replication_lag > 30
for: 1m
labels:
severity: critical
annotations:
summary: "Database replication lag is {{ $value }}s"

- alert: CertificateExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
for: 1h
labels:
severity: critical
annotations:
summary: "SSL certificate expires in {{ $value | humanizeDuration }}"

4.2 High Severity Alerts (P2 - Respond within 1 hour)

groups:
- name: high
rules:
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 5
for: 5m
labels:
severity: high
annotations:
summary: "P99 latency is {{ $value | humanizeDuration }} for {{ $labels.service }}"

- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
for: 5m
labels:
severity: high
annotations:
summary: "Memory usage at {{ $value | humanizePercentage }} for {{ $labels.pod }}"

- alert: HighCPUUsage
expr: |
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) /
sum(container_spec_cpu_quota/container_spec_cpu_period) by (pod) > 0.8
for: 10m
labels:
severity: high

- alert: PodRestartLoop
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: high
annotations:
summary: "Pod {{ $labels.pod }} has restarted {{ $value }} times in 1 hour"

- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
for: 5m
labels:
severity: high

4.3 Warning Alerts (P3 - Respond within 4 hours)

groups:
- name: warning
rules:
- alert: ElevatedErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service) > 0.01
for: 10m
labels:
severity: warning

- alert: SlowQueries
expr: |
pg_stat_statements_mean_time_seconds{query!~".*pg_.*"} > 1
for: 15m
labels:
severity: warning

- alert: AIAgentLowConfidence
expr: |
histogram_quantile(0.5, fpa_agent_confidence_score_bucket) < 0.7
for: 30m
labels:
severity: warning
annotations:
summary: "AI agent median confidence below 70%"

- alert: ReconciliationMatchRateDrop
expr: |
fpa_reconciliation_match_rate < 0.85
for: 1h
labels:
severity: warning

- alert: ForecastAccuracyDegraded
expr: |
fpa_ml_forecast_mape{horizon="13_weeks"} > 0.15
for: 1d
labels:
severity: warning

4.4 Alert Routing

# Alertmanager configuration
route:
receiver: 'default'
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h

routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true

- match:
severity: critical
receiver: 'slack-incidents'

- match:
severity: high
receiver: 'pagerduty-high'

- match:
severity: warning
receiver: 'slack-alerts'

receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '${PAGERDUTY_CRITICAL_KEY}'
severity: critical

- name: 'pagerduty-high'
pagerduty_configs:
- service_key: '${PAGERDUTY_HIGH_KEY}'
severity: high

- name: 'slack-incidents'
slack_configs:
- channel: '#incidents'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'

- name: 'slack-alerts'
slack_configs:
- channel: '#alerts'
send_resolved: true

5. Dashboards

5.1 Executive Dashboard

Purpose: SLA compliance and business KPIs for leadership

Panels:

  • Platform Availability (30-day rolling)
  • Error Budget Remaining
  • Monthly Active Users
  • Transaction Volume
  • Reconciliation Match Rate (trend)
  • Forecast Accuracy (trailing)
  • Top 5 Errors This Week

5.2 Service Health Dashboard

Purpose: Real-time service status for engineering

Panels per Service:

┌─────────────────────────────────────────────────────────────┐
│ GL SERVICE │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Availability │ │ Error Rate │ │ P99 Lat │ │
│ │ 99.95% │ │ 0.02% │ │ 245ms │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Request Rate (5m avg) │ │
│ │ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇ │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Latency Distribution │ │
│ │ P50: 45ms | P90: 120ms | P99: 245ms │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ Pods: 3/3 Running | CPU: 45% | Memory: 62% │
└─────────────────────────────────────────────────────────────┘

5.3 On-Call Dashboard

Purpose: Incident triage and quick diagnostics

Panels:

  • Active Alerts (grouped by severity)
  • Recent Deployments
  • Error Rate by Endpoint (last 1h)
  • Slow Queries
  • Pod Restarts
  • External Dependencies Status
  • Quick Links (runbooks, logs, traces)

5.4 AI/ML Dashboard

Purpose: Model performance and resource monitoring

Panels:

  • Model Inference Latency (by model)
  • Token Usage (by agent)
  • GPU Utilization
  • Queue Depth
  • Confidence Score Distribution
  • Reconciliation Model Accuracy (rolling 7-day)
  • Forecast Model MAPE (by horizon)
  • Hallucination Detection Rate

5.5 Cost Dashboard

Purpose: Cloud spend monitoring

Panels:

  • Daily/Monthly Spend (vs budget)
  • Cost by Service
  • AI Token Costs (by model)
  • Resource Efficiency (CPU/Memory utilization)
  • Top 10 Expensive Queries
  • Recommendations for Optimization

6. Logging Standards

6.1 Log Format

{
"timestamp": "2026-02-03T10:15:30.123Z",
"level": "INFO",
"service": "gl-service",
"version": "1.2.3",
"trace_id": "abc123def456",
"span_id": "789ghi",
"tenant_id": "tenant_acme",
"user_id": "user_john",
"message": "Journal entry created",
"attributes": {
"journal_entry_id": "je_001",
"amount": 10000.00,
"duration_ms": 45
}
}

6.2 Log Levels

LevelUse CaseExamples
ERRORFailures requiring attentionUnhandled exceptions, failed operations
WARNPotential issuesDeprecated API usage, near-limit resources
INFONormal operationsRequest completed, job finished
DEBUGDetailed troubleshootingSQL queries, intermediate state

6.3 PII Redaction

REDACT_PATTERNS = {
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
'credit_card': r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b',
'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'api_key': r'\b(sk|pk|api)_[A-Za-z0-9]{20,}\b',
}

def redact_pii(log_message: str) -> str:
for field, pattern in REDACT_PATTERNS.items():
log_message = re.sub(pattern, f'[REDACTED:{field}]', log_message)
return log_message

7. Distributed Tracing

7.1 Trace Propagation

# OpenTelemetry configuration
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor

# Auto-instrument frameworks
FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=engine)
RedisInstrumentor().instrument()

# Custom spans for business operations
tracer = trace.get_tracer(__name__)

async def create_journal_entry(entry: JournalEntry):
with tracer.start_as_current_span("create_journal_entry") as span:
span.set_attribute("entity_id", entry.entity_id)
span.set_attribute("amount", float(entry.total_debit))

# Nested span for validation
with tracer.start_as_current_span("validate_entry"):
await validate_entry(entry)

# Nested span for persistence
with tracer.start_as_current_span("persist_entry"):
await db.save(entry)

7.2 Sampling Strategy

sampling:
default: 0.1 # 10% of traces sampled by default

rules:
- service: api-gateway
sample_rate: 0.05 # High volume, lower sampling

- operation: "create_journal_entry"
sample_rate: 1.0 # Always sample financial operations

- http.status_code: "5xx"
sample_rate: 1.0 # Always sample errors

- duration_ms: ">5000"
sample_rate: 1.0 # Always sample slow requests

8. Health Checks

8.1 Kubernetes Probes

# Deployment health probes
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3

readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3

startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30

8.2 Health Check Endpoints

@app.get("/health/live")
async def liveness():
"""Basic liveness check - is the process running?"""
return {"status": "ok"}

@app.get("/health/ready")
async def readiness():
"""Readiness check - can the service handle requests?"""
checks = {
"database": await check_database(),
"redis": await check_redis(),
"dependencies": await check_dependencies(),
}

all_healthy = all(c["healthy"] for c in checks.values())
status_code = 200 if all_healthy else 503

return JSONResponse(
content={"status": "ready" if all_healthy else "not_ready", "checks": checks},
status_code=status_code
)

@app.get("/health/startup")
async def startup():
"""Startup check - has initialization completed?"""
if not app.state.initialized:
return JSONResponse({"status": "initializing"}, status_code=503)
return {"status": "started"}

9. Synthetic Monitoring

9.1 Critical User Journeys

synthetic_tests:
- name: "Login Flow"
schedule: "*/5 * * * *" # Every 5 minutes
steps:
- visit: "https://app.fpa-platform.com/login"
- fill: { selector: "#email", value: "${SYNTHETIC_USER}" }
- fill: { selector: "#password", value: "${SYNTHETIC_PASS}" }
- click: { selector: "#login-button" }
- wait: { selector: "#dashboard", timeout: 10s }
assertions:
- response_time: "<3s"
- status: 200

- name: "Create Journal Entry"
schedule: "*/15 * * * *"
steps:
- authenticate: "${SYNTHETIC_TOKEN}"
- post:
url: "/api/gl/journal-entries"
body: { ... }
- assert:
status: 201
response_time: "<2s"

- name: "Run Reconciliation"
schedule: "0 * * * *" # Every hour
steps:
- authenticate: "${SYNTHETIC_TOKEN}"
- post:
url: "/api/recon/sessions"
body: { entity_id: "test_entity" }
- poll:
url: "/api/recon/sessions/${session_id}"
until: { status: "completed" }
timeout: 5m

Monitoring & Alerting Specification v1.0 — FP&A Platform Document ID: OPS-002