Technical Design Document (TDD)
1. Component Topology
1.1 Control Plane Services
- tenant-service: tenant lifecycle, region and residency policy.
- identity-service: users, teams, orgs, RBAC, service accounts.
- project-service: projects, environments, flows, versions.
- policy-service: policy definitions, enforcement hooks, validation modes.
- audit-service: immutable audit chain, verification, export.
- reporting-service: contracts, status reports, session logs.
- usage-service: quotas, rate limits, metering.
1.2 Data Plane Services
- runtime-service (Rust): step execution and orchestration.
- event-gateway: ingest, validate, route to runtime.
- stream-service: websocket state and log fan-out.
- artifact-service: payload storage and retrieval.
1.3 Workbench
- React Flow UI for workflow authoring and runtime inspection.
- Mobile-responsive layout with reduced panels and adaptive nav.
1.4 IDE Integration
- GCP Cloud Workstations are opened in a new browser tab.
- Platform provides a secure, audited, signed access URL.
1.5 Repository Organization
coditect-step-dev-platform hosts the platform runtime, UI, and rewrite documentation.
coditect-core remains the parent intelligence framework and standards source.
2. Runtime Execution Flow
- Event arrives at event-gateway.
- AuthN and RBAC enforced at tenant and project scopes.
- Payload validation applies configured strictness.
- Event is routed to runtime-service.
- Runtime loads workflow graph and executes steps.
- State writes and outputs are emitted to streams and storage.
- Observability events are emitted to logs, traces, and metrics.
3. API Surfaces
3.1 REST (Control Plane)
/tenants, /teams, /users, /projects, /roles
/contracts, /reports, /session-logs
/audit/verify, /audit/export
/usage/quotas, /usage/limits
/workstations (list and access link generation)
3.2 gRPC (Runtime and Internal)
Runtime.ExecuteStep
Runtime.EmitEvent
Runtime.StreamState
Artifacts.PutPayload, Artifacts.GetPayload
Audit.AppendBlock, Audit.VerifyChain
4. Data Model (High-Level)
- tenants: id, name, region, status, kms_key_id
- teams: id, tenant_id, name, lead_user_id
- users: id, tenant_id, email, role, status, mfa_enabled
- roles: id, name, description
- role_bindings: user_id, role_id, scope
- projects: id, tenant_id, team_id, name, environment
- flows: id, project_id, name, version
- steps: id, flow_id, name, type, config_json
- executions: id, step_id, trace_id, status, latency_ms
- events: id, topic, payload_ref, trace_id
- contracts: id, tenant_id, status, value, dates
- status_reports: id, project_id, status, summary
- session_logs: id, project_id, visibility, content
- workstations: id, tenant_id, user_id, status, region, gcp_resource
- audit_blocks: id, hash, prev_hash, payload_ref, created_at
5. Security and Policy Enforcement
- mTLS between all services.
- OIDC/SAML with short-lived tokens.
- CORS allowlist with credentials only for approved origins.
- Dev and diagnostic endpoints require explicit config and RBAC.
- Immutable audit chain with hash verification for every privileged action.
6. Limits and Concurrency
- Tenant-wide concurrency limits and per-project execution pools.
- Request size limits for API and event ingest.
- Timeouts for step execution, event gateway, and streaming.
- Large payloads use object storage references, not CLI args.
7. Event Validation Modes
validation_mode = strict | permissive
- Strict mode rejects unknown fields and invalid schemas.
- Permissive mode logs violations and continues.
8. Subscription Lifecycle and Adapter Hardening
- Adapter wiring must handle errors with retries and clear failure states.
- Subscriptions must be idempotent and auditable.
9. Observability
- OpenTelemetry traces across gateway, runtime, and workbench APIs.
- Metrics: latency, error rate, queue depth, concurrency, saturation.
- Logs: structured JSON with tenant and trace identifiers.
- SLO alerts per step, per project, and per tenant.
10. Testing
- Contract tests for API boundaries.
- Load tests for runtime and gateway.
- Soak tests for long-running streams.
- Failure injection for event adapters and queue backpressure.
11. Deployment Model
- Separate scaling domains for control plane and data plane.
- Multi-region clusters with residency enforcement.
- Blue/green deployments for runtime upgrades.
12. Mobile Responsiveness
- All UI layouts MUST adapt to phone and tablet breakpoints.
- Tables must support horizontal scrolling on small screens.
- Primary actions must remain accessible without hover-only controls.