Skip to main content

Observability and SLOs

Logging Standard

  • Structured JSON logs in production.
  • Required fields: timestamp, level, trace_id, flow, step, event_topic, request_id, duration_ms, error.
  • Log redaction for secrets and PII.

Tracing Standard

  • Create a trace at API entry and attach to event emissions.
  • Child spans per step execution and state/stream operations.
  • Propagate trace_id across event adapter boundaries.

Metrics Baseline

  • api.request.count
  • api.request.error.count
  • api.request.latency_ms
  • step.execution.count
  • step.execution.error.count
  • step.execution.latency_ms
  • queue.depth
  • queue.lag_ms
  • stream.subscriptions.active
  • state.operation.count
  • state.operation.error.count
  • runtime.memory.rss_bytes
  • runtime.cpu.percent

Initial SLO Targets

  • API availability: 99.9% monthly.
  • API latency: p95 < 300ms, p99 < 1s for lightweight endpoints.
  • Step execution error rate: < 1% per 30 days.
  • Event processing lag: p95 < 2s.

Error Budgets

  • 0.1% monthly unavailability budget for API.
  • 1% monthly error budget for step executions.

Alerting

  • Error rate spikes above 2x baseline for 5 minutes.
  • p95 latency exceeds target for 10 minutes.
  • Queue depth exceeds configured max for 10 minutes.
  • Stream subscription failures exceed 1% over 5 minutes.

Reporting

  • Weekly SLO report with trend lines for latency, error rate, and availability.
  • Release gates require no SLO violations in the last 7 days.