Platform Hardening Plan
Goals
- Make Motia safe for production multi-tenant use.
- Provide explicit access control for diagnostic and admin capabilities.
- Enforce operational limits for stability and predictability.
- Establish observability standards and SLOs.
- Prove scalability with automated tests and failure injection.
Scope
- Gate dev/diagnostic endpoints behind config or auth and fix CORS to allowlist + credentials.
- Add configurable limits for request sizes, timeouts, and max concurrency.
- Avoid CLI-arg payload transfer for large inputs.
- Make event input validation optionally strict with explicit runtime config.
- Strengthen subscription lifecycle correctness and error handling.
- Add platform observability, SLOs, and error budgets.
- Prove scalability with load, soak, and failure-injection tests.
Out of Scope
- Product feature expansion unrelated to platform reliability/security.
- New language runners beyond current JS/TS/Python/Ruby.
Assumptions
- Motia runtime remains horizontally scalable by externalizing state, queues, and streams.
- Platform will support multi-tenant deployments in the near term.
- Express remains the core HTTP server for now.
Workstreams
- Access Control and CORS Hardening
- Runtime Limits and Payload Transport
- Input Validation Strictness
- Event Subscription Lifecycle Reliability
- Observability and SLOs
- Scalability Proof via Tests and Chaos
Milestones and Deliverables
-
M1: Access control foundation
-
Deliver RBAC model and enforcement points.
-
Gate
__motiaendpoints behind config + role check. -
Replace permissive CORS with allowlist + credentials rules.
-
M2: Runtime safety limits
-
Configurable request size limits with sensible defaults.
-
Per-step timeout enforcement and global max concurrency.
-
Payload transport path for large inputs using stdin or temp files.
-
M3: Validation and subscription reliability
-
Runtime flag for strict event input validation.
-
Reliable subscription setup with deterministic lifecycle and error reporting.
-
M4: Observability and SLOs
-
Structured logs with request and trace identifiers.
-
Metrics for latency, errors, queue depth, and stream throughput.
-
Initial SLOs and error budget policy.
-
M5: Scalability proof
-
Load tests for API and event processing.
-
Soak tests for long-running stability.
-
Failure injection for queue/redis/network disruptions.
Acceptance Criteria
- All diagnostic endpoints are disabled by default or require authorized access.
- CORS policy prevents
*with credentials and supports allowlists. - Request size limits and timeouts are configurable and enforced.
- Large payloads do not depend on CLI-arg transfer.
- Strict validation is configurable and covered by tests.
- Subscription lifecycle is deterministic with clear logs on failure.
- SLOs are defined, measured, and reported in CI.
- Load, soak, and chaos tests pass with documented thresholds.
Dependencies
- RBAC model and enforcement layer.
- Config system that is readable by core runtime and workbench.
- Metrics backend or pluggable interface for export.
Risks
- Backward compatibility if defaults tighten too aggressively.
- Performance impact from extra validation and logging.
- Feature creep in observability integrations.
Rollout Strategy
- Introduce defaults in warning mode first.
- Ship config flags with deprecation notices for unsafe defaults.
- Provide migration guide for production deployments.
Open Questions
- Should dev endpoints be enabled only in
NODE_ENV=developmentby default? - What is the minimum RBAC role set required for OSS vs hosted?
- Which metrics backend should be the default for hosted deployments?