Logging Standard
- Structured JSON logs in production.
- Required fields:
timestamp, level, trace_id, flow, step, event_topic, request_id, duration_ms, error.
- Log redaction for secrets and PII.
Tracing Standard
- Create a trace at API entry and attach to event emissions.
- Child spans per step execution and state/stream operations.
- Propagate
trace_id across event adapter boundaries.
Metrics Baseline
api.request.count
api.request.error.count
api.request.latency_ms
step.execution.count
step.execution.error.count
step.execution.latency_ms
queue.depth
queue.lag_ms
stream.subscriptions.active
state.operation.count
state.operation.error.count
runtime.memory.rss_bytes
runtime.cpu.percent
Initial SLO Targets
- API availability: 99.9% monthly.
- API latency: p95 < 300ms, p99 < 1s for lightweight endpoints.
- Step execution error rate: < 1% per 30 days.
- Event processing lag: p95 < 2s.
Error Budgets
- 0.1% monthly unavailability budget for API.
- 1% monthly error budget for step executions.
Alerting
- Error rate spikes above 2x baseline for 5 minutes.
- p95 latency exceeds target for 10 minutes.
- Queue depth exceeds configured max for 10 minutes.
- Stream subscription failures exceed 1% over 5 minutes.
Reporting
- Weekly SLO report with trend lines for latency, error rate, and availability.
- Release gates require no SLO violations in the last 7 days.