FP&A Platform — Monitoring & Alerting Specification
Version: 1.0
Last Updated: 2026-02-03
Document ID: OPS-002
Classification: Internal
1. Overview
This document defines the comprehensive monitoring, alerting, and observability strategy for the FP&A Platform.
Observability Stack
| Component | Tool | Purpose |
|---|---|---|
| Metrics | Prometheus + Grafana | Time-series metrics, dashboards |
| Logs | Loki | Log aggregation and querying |
| Traces | OpenTelemetry + Jaeger | Distributed tracing |
| Alerting | Alertmanager + PagerDuty | Alert routing and escalation |
| Synthetic | Blackbox Exporter | Endpoint monitoring |
| APM | Custom + OpenTelemetry | Application performance |
2. Service Level Objectives (SLOs)
2.1 Availability SLOs
| Service Tier | SLO | Error Budget (monthly) | Measurement |
|---|---|---|---|
| Tier 1 - Critical | 99.9% | 43.2 minutes | Successful requests / total requests |
| Tier 2 - Essential | 99.5% | 3.6 hours | Successful requests / total requests |
| Tier 3 - Important | 99.0% | 7.2 hours | Successful requests / total requests |
Tier 1 Services: API Gateway, Authentication, GL Service, Database Tier 2 Services: Reconciliation, Reporting, Integrations Tier 3 Services: AI Agents, Forecasting, Analytics
2.2 Latency SLOs
| Endpoint Category | P50 | P95 | P99 |
|---|---|---|---|
| Authentication | 100ms | 300ms | 500ms |
| Simple Reads | 50ms | 200ms | 500ms |
| Complex Queries | 500ms | 2s | 5s |
| Report Generation | 2s | 10s | 30s |
| AI Agent Response | 5s | 15s | 30s |
| Batch Operations | 30s | 2m | 5m |
2.3 SLI Definitions
slis:
availability:
name: "Service Availability"
description: "Percentage of successful requests"
formula: |
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
latency_p99:
name: "Request Latency P99"
description: "99th percentile request latency"
formula: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
error_rate:
name: "Error Rate"
description: "Percentage of requests resulting in errors"
formula: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
3. Metrics Collection
3.1 Infrastructure Metrics
Kubernetes Metrics
# kube-state-metrics configuration
metrics:
pods:
- kube_pod_status_phase
- kube_pod_container_status_restarts_total
- kube_pod_container_resource_requests
- kube_pod_container_resource_limits
deployments:
- kube_deployment_status_replicas
- kube_deployment_status_replicas_available
- kube_deployment_status_replicas_unavailable
nodes:
- kube_node_status_condition
- kube_node_status_capacity
- kube_node_status_allocatable
Resource Metrics
| Metric | Source | Alert Threshold |
|---|---|---|
container_cpu_usage_seconds_total | cAdvisor | >80% sustained |
container_memory_usage_bytes | cAdvisor | >85% of limit |
container_fs_usage_bytes | cAdvisor | >80% of limit |
container_network_receive_bytes_total | cAdvisor | Anomaly detection |
container_network_transmit_bytes_total | cAdvisor | Anomaly detection |
3.2 Database Metrics (PostgreSQL)
postgres_metrics:
connections:
- pg_stat_activity_count # Active connections
- pg_settings_max_connections # Max allowed
performance:
- pg_stat_statements_total_time # Query time
- pg_stat_statements_calls # Query count
- pg_stat_user_tables_seq_scan # Sequential scans
- pg_stat_user_tables_idx_scan # Index scans
replication:
- pg_stat_replication_lag # Replication lag
- pg_stat_wal_receiver_lag # WAL receiver lag
storage:
- pg_database_size_bytes # Database size
- pg_stat_user_tables_dead_tuples # Dead tuples (vacuum needed)
locks:
- pg_locks_count # Lock count by mode
- pg_stat_activity_waiting # Queries waiting on locks
# Alert thresholds
alerts:
- name: PostgresConnectionsHigh
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
for: 5m
severity: warning
- name: PostgresReplicationLag
expr: pg_stat_replication_lag > 10
for: 1m
severity: critical
3.3 Application Metrics
HTTP Metrics (RED Method)
# FastAPI metrics middleware
from prometheus_client import Counter, Histogram
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status', 'service']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint', 'service'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 30]
)
REQUEST_IN_PROGRESS = Gauge(
'http_requests_in_progress',
'HTTP requests currently in progress',
['method', 'endpoint', 'service']
)
Business Metrics
business_metrics:
accounting:
- fpa_journal_entries_total{status} # Journal entries by status
- fpa_journal_entries_amount_total{entity} # Total amounts
- fpa_period_close_duration_seconds{entity} # Close time
reconciliation:
- fpa_reconciliation_match_rate{entity} # Match percentage
- fpa_reconciliation_exceptions_total{entity,type} # Exception count
- fpa_reconciliation_duration_seconds{entity} # Recon time
forecasting:
- fpa_forecast_accuracy_mape{entity,horizon} # Forecast accuracy
- fpa_forecast_generation_duration_seconds # Generation time
ai_agents:
- fpa_agent_requests_total{agent_type} # Agent invocations
- fpa_agent_tokens_used_total{agent_type,model} # Token usage
- fpa_agent_duration_seconds{agent_type} # Agent latency
- fpa_agent_confidence_score{agent_type} # Confidence distribution
3.4 AI/ML Metrics
ml_metrics:
inference:
- fpa_ml_inference_latency_seconds{model}
- fpa_ml_inference_requests_total{model,status}
- fpa_ml_tokens_input_total{model}
- fpa_ml_tokens_output_total{model}
model_performance:
- fpa_ml_prediction_confidence{model}
- fpa_ml_recon_match_accuracy{model} # Rolling accuracy
- fpa_ml_forecast_mape{model,horizon} # Forecast error
resource_usage:
- fpa_ml_gpu_utilization{node}
- fpa_ml_gpu_memory_used_bytes{node}
- fpa_ml_queue_depth{model}
4. Alerting Rules
4.1 Critical Alerts (P1 - Page Immediately)
groups:
- name: critical
rules:
- alert: ServiceDown
expr: up{job=~"fpa-.*"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
runbook: "https://wiki.fpa/runbooks/service-down"
- alert: DatabaseDown
expr: pg_up == 0
for: 30s
labels:
severity: critical
annotations:
summary: "PostgreSQL database is down"
runbook: "https://wiki.fpa/runbooks/database-down"
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate ({{ $value | humanizePercentage }}) for {{ $labels.service }}"
- alert: DatabaseReplicationLag
expr: pg_stat_replication_lag > 30
for: 1m
labels:
severity: critical
annotations:
summary: "Database replication lag is {{ $value }}s"
- alert: CertificateExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
for: 1h
labels:
severity: critical
annotations:
summary: "SSL certificate expires in {{ $value | humanizeDuration }}"
4.2 High Severity Alerts (P2 - Respond within 1 hour)
groups:
- name: high
rules:
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 5
for: 5m
labels:
severity: high
annotations:
summary: "P99 latency is {{ $value | humanizeDuration }} for {{ $labels.service }}"
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
for: 5m
labels:
severity: high
annotations:
summary: "Memory usage at {{ $value | humanizePercentage }} for {{ $labels.pod }}"
- alert: HighCPUUsage
expr: |
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) /
sum(container_spec_cpu_quota/container_spec_cpu_period) by (pod) > 0.8
for: 10m
labels:
severity: high
- alert: PodRestartLoop
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: high
annotations:
summary: "Pod {{ $labels.pod }} has restarted {{ $value }} times in 1 hour"
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
for: 5m
labels:
severity: high
4.3 Warning Alerts (P3 - Respond within 4 hours)
groups:
- name: warning
rules:
- alert: ElevatedErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service) > 0.01
for: 10m
labels:
severity: warning
- alert: SlowQueries
expr: |
pg_stat_statements_mean_time_seconds{query!~".*pg_.*"} > 1
for: 15m
labels:
severity: warning
- alert: AIAgentLowConfidence
expr: |
histogram_quantile(0.5, fpa_agent_confidence_score_bucket) < 0.7
for: 30m
labels:
severity: warning
annotations:
summary: "AI agent median confidence below 70%"
- alert: ReconciliationMatchRateDrop
expr: |
fpa_reconciliation_match_rate < 0.85
for: 1h
labels:
severity: warning
- alert: ForecastAccuracyDegraded
expr: |
fpa_ml_forecast_mape{horizon="13_weeks"} > 0.15
for: 1d
labels:
severity: warning
4.4 Alert Routing
# Alertmanager configuration
route:
receiver: 'default'
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: critical
receiver: 'slack-incidents'
- match:
severity: high
receiver: 'pagerduty-high'
- match:
severity: warning
receiver: 'slack-alerts'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '${PAGERDUTY_CRITICAL_KEY}'
severity: critical
- name: 'pagerduty-high'
pagerduty_configs:
- service_key: '${PAGERDUTY_HIGH_KEY}'
severity: high
- name: 'slack-incidents'
slack_configs:
- channel: '#incidents'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
- name: 'slack-alerts'
slack_configs:
- channel: '#alerts'
send_resolved: true
5. Dashboards
5.1 Executive Dashboard
Purpose: SLA compliance and business KPIs for leadership
Panels:
- Platform Availability (30-day rolling)
- Error Budget Remaining
- Monthly Active Users
- Transaction Volume
- Reconciliation Match Rate (trend)
- Forecast Accuracy (trailing)
- Top 5 Errors This Week
5.2 Service Health Dashboard
Purpose: Real-time service status for engineering
Panels per Service:
┌─────────────────────────────────────────────────────────────┐
│ GL SERVICE │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Availability │ │ Error Rate │ │ P99 Lat │ │
│ │ 99.95% │ │ 0.02% │ │ 245ms │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Request Rate (5m avg) │ │
│ │ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇ │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Latency Distribution │ │
│ │ P50: 45ms | P90: 120ms | P99: 245ms │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ Pods: 3/3 Running | CPU: 45% | Memory: 62% │
└─────────────────────────────────────────────────────────────┘
5.3 On-Call Dashboard
Purpose: Incident triage and quick diagnostics
Panels:
- Active Alerts (grouped by severity)
- Recent Deployments
- Error Rate by Endpoint (last 1h)
- Slow Queries
- Pod Restarts
- External Dependencies Status
- Quick Links (runbooks, logs, traces)
5.4 AI/ML Dashboard
Purpose: Model performance and resource monitoring
Panels:
- Model Inference Latency (by model)
- Token Usage (by agent)
- GPU Utilization
- Queue Depth
- Confidence Score Distribution
- Reconciliation Model Accuracy (rolling 7-day)
- Forecast Model MAPE (by horizon)
- Hallucination Detection Rate
5.5 Cost Dashboard
Purpose: Cloud spend monitoring
Panels:
- Daily/Monthly Spend (vs budget)
- Cost by Service
- AI Token Costs (by model)
- Resource Efficiency (CPU/Memory utilization)
- Top 10 Expensive Queries
- Recommendations for Optimization
6. Logging Standards
6.1 Log Format
{
"timestamp": "2026-02-03T10:15:30.123Z",
"level": "INFO",
"service": "gl-service",
"version": "1.2.3",
"trace_id": "abc123def456",
"span_id": "789ghi",
"tenant_id": "tenant_acme",
"user_id": "user_john",
"message": "Journal entry created",
"attributes": {
"journal_entry_id": "je_001",
"amount": 10000.00,
"duration_ms": 45
}
}
6.2 Log Levels
| Level | Use Case | Examples |
|---|---|---|
| ERROR | Failures requiring attention | Unhandled exceptions, failed operations |
| WARN | Potential issues | Deprecated API usage, near-limit resources |
| INFO | Normal operations | Request completed, job finished |
| DEBUG | Detailed troubleshooting | SQL queries, intermediate state |
6.3 PII Redaction
REDACT_PATTERNS = {
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
'credit_card': r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b',
'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'api_key': r'\b(sk|pk|api)_[A-Za-z0-9]{20,}\b',
}
def redact_pii(log_message: str) -> str:
for field, pattern in REDACT_PATTERNS.items():
log_message = re.sub(pattern, f'[REDACTED:{field}]', log_message)
return log_message
7. Distributed Tracing
7.1 Trace Propagation
# OpenTelemetry configuration
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
# Auto-instrument frameworks
FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=engine)
RedisInstrumentor().instrument()
# Custom spans for business operations
tracer = trace.get_tracer(__name__)
async def create_journal_entry(entry: JournalEntry):
with tracer.start_as_current_span("create_journal_entry") as span:
span.set_attribute("entity_id", entry.entity_id)
span.set_attribute("amount", float(entry.total_debit))
# Nested span for validation
with tracer.start_as_current_span("validate_entry"):
await validate_entry(entry)
# Nested span for persistence
with tracer.start_as_current_span("persist_entry"):
await db.save(entry)
7.2 Sampling Strategy
sampling:
default: 0.1 # 10% of traces sampled by default
rules:
- service: api-gateway
sample_rate: 0.05 # High volume, lower sampling
- operation: "create_journal_entry"
sample_rate: 1.0 # Always sample financial operations
- http.status_code: "5xx"
sample_rate: 1.0 # Always sample errors
- duration_ms: ">5000"
sample_rate: 1.0 # Always sample slow requests
8. Health Checks
8.1 Kubernetes Probes
# Deployment health probes
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30
8.2 Health Check Endpoints
@app.get("/health/live")
async def liveness():
"""Basic liveness check - is the process running?"""
return {"status": "ok"}
@app.get("/health/ready")
async def readiness():
"""Readiness check - can the service handle requests?"""
checks = {
"database": await check_database(),
"redis": await check_redis(),
"dependencies": await check_dependencies(),
}
all_healthy = all(c["healthy"] for c in checks.values())
status_code = 200 if all_healthy else 503
return JSONResponse(
content={"status": "ready" if all_healthy else "not_ready", "checks": checks},
status_code=status_code
)
@app.get("/health/startup")
async def startup():
"""Startup check - has initialization completed?"""
if not app.state.initialized:
return JSONResponse({"status": "initializing"}, status_code=503)
return {"status": "started"}
9. Synthetic Monitoring
9.1 Critical User Journeys
synthetic_tests:
- name: "Login Flow"
schedule: "*/5 * * * *" # Every 5 minutes
steps:
- visit: "https://app.fpa-platform.com/login"
- fill: { selector: "#email", value: "${SYNTHETIC_USER}" }
- fill: { selector: "#password", value: "${SYNTHETIC_PASS}" }
- click: { selector: "#login-button" }
- wait: { selector: "#dashboard", timeout: 10s }
assertions:
- response_time: "<3s"
- status: 200
- name: "Create Journal Entry"
schedule: "*/15 * * * *"
steps:
- authenticate: "${SYNTHETIC_TOKEN}"
- post:
url: "/api/gl/journal-entries"
body: { ... }
- assert:
status: 201
response_time: "<2s"
- name: "Run Reconciliation"
schedule: "0 * * * *" # Every hour
steps:
- authenticate: "${SYNTHETIC_TOKEN}"
- post:
url: "/api/recon/sessions"
body: { entity_id: "test_entity" }
- poll:
url: "/api/recon/sessions/${session_id}"
until: { status: "completed" }
timeout: 5m
Monitoring & Alerting Specification v1.0 — FP&A Platform Document ID: OPS-002