Skip to main content

Platform Hardening Plan

Goals

  • Make Motia safe for production multi-tenant use.
  • Provide explicit access control for diagnostic and admin capabilities.
  • Enforce operational limits for stability and predictability.
  • Establish observability standards and SLOs.
  • Prove scalability with automated tests and failure injection.

Scope

  • Gate dev/diagnostic endpoints behind config or auth and fix CORS to allowlist + credentials.
  • Add configurable limits for request sizes, timeouts, and max concurrency.
  • Avoid CLI-arg payload transfer for large inputs.
  • Make event input validation optionally strict with explicit runtime config.
  • Strengthen subscription lifecycle correctness and error handling.
  • Add platform observability, SLOs, and error budgets.
  • Prove scalability with load, soak, and failure-injection tests.

Out of Scope

  • Product feature expansion unrelated to platform reliability/security.
  • New language runners beyond current JS/TS/Python/Ruby.

Assumptions

  • Motia runtime remains horizontally scalable by externalizing state, queues, and streams.
  • Platform will support multi-tenant deployments in the near term.
  • Express remains the core HTTP server for now.

Workstreams

  1. Access Control and CORS Hardening
  2. Runtime Limits and Payload Transport
  3. Input Validation Strictness
  4. Event Subscription Lifecycle Reliability
  5. Observability and SLOs
  6. Scalability Proof via Tests and Chaos

Milestones and Deliverables

  • M1: Access control foundation

  • Deliver RBAC model and enforcement points.

  • Gate __motia endpoints behind config + role check.

  • Replace permissive CORS with allowlist + credentials rules.

  • M2: Runtime safety limits

  • Configurable request size limits with sensible defaults.

  • Per-step timeout enforcement and global max concurrency.

  • Payload transport path for large inputs using stdin or temp files.

  • M3: Validation and subscription reliability

  • Runtime flag for strict event input validation.

  • Reliable subscription setup with deterministic lifecycle and error reporting.

  • M4: Observability and SLOs

  • Structured logs with request and trace identifiers.

  • Metrics for latency, errors, queue depth, and stream throughput.

  • Initial SLOs and error budget policy.

  • M5: Scalability proof

  • Load tests for API and event processing.

  • Soak tests for long-running stability.

  • Failure injection for queue/redis/network disruptions.

Acceptance Criteria

  • All diagnostic endpoints are disabled by default or require authorized access.
  • CORS policy prevents * with credentials and supports allowlists.
  • Request size limits and timeouts are configurable and enforced.
  • Large payloads do not depend on CLI-arg transfer.
  • Strict validation is configurable and covered by tests.
  • Subscription lifecycle is deterministic with clear logs on failure.
  • SLOs are defined, measured, and reported in CI.
  • Load, soak, and chaos tests pass with documented thresholds.

Dependencies

  • RBAC model and enforcement layer.
  • Config system that is readable by core runtime and workbench.
  • Metrics backend or pluggable interface for export.

Risks

  • Backward compatibility if defaults tighten too aggressively.
  • Performance impact from extra validation and logging.
  • Feature creep in observability integrations.

Rollout Strategy

  • Introduce defaults in warning mode first.
  • Ship config flags with deprecation notices for unsafe defaults.
  • Provide migration guide for production deployments.

Open Questions

  • Should dev endpoints be enabled only in NODE_ENV=development by default?
  • What is the minimum RBAC role set required for OSS vs hosted?
  • Which metrics backend should be the default for hosted deployments?