RFC: Observability System for Motia Framework
Status
- RFC Date: 2025-06-02
- Status: Implemented
- Authors: Ytallo Layon
- Reviewers: Motia Team
Summary
This RFC proposes implementing an observability system for Motia that provides comprehensive tracing and real-time monitoring through an intuitive horizontal timeline interface. The system will enable users to track execution traces, monitor performance, and debug issues with detailed insights into their workflow executions, focusing on Motia-specific concepts like States, Emit events, Streams, and step interactions through a clean, horizontal visualization.
Background
Currently, Motia provides basic logging capabilities through:
- Terminal-based logs during development
- Real-time log streaming in the Workbench
- Basic analytics tracking for internal telemetry
However, users lack intuitive observability into their workflow executions, making it difficult to:
- Debug complex multi-step workflows in a visual timeline
- Monitor performance bottlenecks across state and stream operations
- Track execution patterns and event flow between steps
- Understand system behavior with clear step-by-step progression
- Trace how events propagate through steps horizontally over time
Goals
Primary Goals
- Horizontal Timeline Visualization: Intuitive left-to-right timeline showing step execution over time
- Comprehensive Trace Tracking: Track execution traces with clean step-by-step progression
- Cross-Trace Correlation: Link related traces across multiple triggers for logical flow visualization
- Motia Operations Monitoring: Track state, emit, and stream operations within each step
- Real-time Flow Status: Live status indicators in the Workbench showing active executions
- Enhanced Logging Integration: Leverage existing logging infrastructure with structured events
- Persistent Trace Storage: Store recent traces (last 50) with efficient retrieval
- Clean UI Experience: Horizontal timeline with operation badges and intuitive navigation
- Performance Insights: Timing and duration metrics for steps and operations
- Search and Filter: Find traces by step, state key, event topic, flow name, or correlation ID
Secondary Goals
- Minimal Performance Impact: Lightweight tracking through enhanced logging
- Developer Experience: Intuitive interfaces for monitoring and debugging
- Extensibility: Design for future enhancements and integrations
- Mobile Responsive: Adaptive design for different screen sizes
Cross-Trace Correlation: Critical Requirement
The Challenge
Real-world workflows often involve multiple related executions that form a single logical business process, but individual traces don't capture these relationships:
Problem Scenarios
- Chat Threads: Each message creates a separate API trace, but they're all part of one conversation
- Mixed API/Event Flows: Initial API → background processing → external input → continuation API → completion
- User Sessions: Multiple interactions across different endpoints that belong to the same user journey
- Order Processing: Payment confirmation, inventory updates, shipping notifications - all separate triggers, one logical order
Required Solution
✅ Required: Correlated Logical Flows
Chat Thread → [Trace A] → [Trace B] → [Trace C]
└─────── All part of same conversation ──────┘
Order Flow → [API: Create] → [Event: Process] → [API: Confirm] → [Event: Ship]
└─────────── Complete business process visibility ────────────┘
Correlation Strategies
1. Automatic Correlation
- State-based: Auto-detect traces using similar state key patterns
- Event-based: Link traces through shared event topics and data
- Temporal: Connect traces with logical timing relationships
2. Manual Correlation
- Context API:
await context.correlate('business_process_id') - HTTP Headers:
X-Motia-Correlation-Idfor API calls
3. Logical Flow Visualization
Chat Session: user_123_thread_abc (3 traces, 2 min duration)
├─ Trace 1: "Hello, weather?" (✓ 150ms) - 2 min ago
│ └─ get-weather → format-response → send-reply
├─ ⏱️ [Gap: 45s - user thinking]
├─ Trace 2: "What about tomorrow?" (✓ 200ms) - 1 min ago
│ └─ get-context → get-forecast → format-response
└─ Trace 3: "Thanks!" (🔄 running) - 10s ago
└─ process-gratitude → [running]
Architecture Overview
High-Level System Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Motia Core System │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌──────────────────┐ ┌────────────┐ │
│ │ Enhanced │───►│ Trace Builder │───►│ Trace │ │
│ │ Logging │ │ Service │ │ Storage │ │
│ │ (call-step) │ │ │ │(In-Memory) │ │
│ └─────────────────┘ └──────────────────┘ └────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌──────────────────┐ ┌────────────┐ │
│ │ Structured │ │ Real-time │ │ Workbench │ │
│ │ Log Events │ │ Updates │ │ API │ │
│ └─────────────────┘ └──────────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Workbench UI Layer │
│ ┌─────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ Traces List │ │ Horizontal │ │ Real-time Flow │ │
│ │ Page │ │ Timeline │ │ Status │ │
│ └─────────────┘ └─────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Data Flow Architecture
┌─────────────────┐
│ Step Execution │
│ (any step) │
└─────┬───────────┘
│
▼
┌─────────────────┐ ┌──────────────────┐
│ Enhanced Logger │────►│ ObservabilityEvent│
│ (logEvent) │ │ - step_start │
└─────────────────┘ │ - step_end │
│ │ - state_op │
│ │ - emit_op │
│ │ - stream_op │
▼ └──────────────────┘
┌─────────────────┐ │
│ Log Stream │◄─────────────┘
│ (existing) │
└─────┬───────────┘
│
▼
┌─────────────────┐ ┌──────────────────┐
│ TraceBuilder │────►│ Trace │
│ Service │ │ - id │
│ (aggregator) │ │ - steps[] │
└─────────────────┘ │ - metadata │
│ └──────────────────┘
▼ │
┌─────────────────┐ │
│ In-Memory │◄─────────────┘
│ Trace Cache │
└─────┬───────────┘
│
▼
┌─────────────────┐ ┌──────────────────┐
│ API Endpoints │────►│ UI Components │
│ /motia/traces │ │ - TracesPage │
│ /motia/traces/:id│ │ - Timeline │
└─────────────────┘ │ - FlowStatus │
└──────────────────┘
Core Design Components
1. Data Models
Trace Structure
interface Trace {
id: string
correlationId?: string // Links related traces across triggers
parentTraceId?: string // For child/continuation traces
flowName: string
status: 'running' | 'completed' | 'failed'
startTime: number
duration?: number
entryPoint: { type: 'api' | 'cron' | 'event', stepName: string }
steps: Step[]
metadata: {
totalSteps: number,
completedSteps: number,
errorCount: number,
traceIndex?: number, // Position in logical flow sequence
isChildTrace?: boolean, // Indicates continuation of another trace
correlationContext?: any // Additional correlation metadata
}
}
interface Step {
name: string
status: 'waiting' | 'running' | 'completed' | 'failed'
startTime?: number
duration?: number
operations: { state: number, emit: number, stream: number }
error?: { message: string, code?: string | number }
}
Trace Group Structure
interface TraceGroup {
id: string // Unique identifier (same as correlationId)
correlationId: string // Primary correlation identifier
name: string // Business process name (e.g., "Chat Thread", "Order Processing")
status: 'active' | 'completed' | 'failed' | 'stalled'
startTime: number
lastActivity: number
totalDuration?: number
traces: Trace[] // All related traces in chronological order
metadata: {
totalTraces: number,
completedTraces: number,
activeTraces: number,
totalSteps: number,
averageStepDuration: number,
gapsCount: number, // Number of waiting periods between traces
totalGapDuration: number // Total time spent waiting between traces
}
}
Enhanced Log Event Structure
interface ObservabilityEvent extends Log {
eventType: 'step_start' | 'step_end' | 'state_op' | 'emit_op' | 'stream_op'
| 'correlation_start' | 'correlation_continue' // 🆕 Correlation events
traceId: string
correlationId?: string // 🆕 Links to logical flow
parentTraceId?: string // 🆕 For trace relationships
stepName: string
duration?: number
metadata?: {
operation?: 'get' | 'set' | 'delete' | 'clear'
key?: string
topic?: string
streamName?: string
success?: boolean
correlationContext?: any // 🆕 Additional correlation data
correlationMethod?: 'automatic' | 'manual' | 'state-based' | 'event-based' // 🆕
}
}
Observability Output Example
Example: User Registration Flow Execution
When a user registration flow executes, the observability system will capture comprehensive data about each step and operation:
Before Execution
Flow: user-registration
Status: Not Started
Steps: 0/3 completed
During Execution - Step 1: validate-user
Trace ID: trace_abc123_20241230_143052
Flow: user-registration
Status: Running
Current Step: validate-user (started at 143:052ms)
Step Operations:
├─ 🗄️ state.get('user.email') - 15ms ✓
├─ 🗄️ state.set('validation.status', 'pending') - 8ms ✓
├─ 🗄️ state.set('validation.result', validationData) - 12ms ✓
└─ 📡 emit('user.validated', userData) → triggers save-user - 5ms ✓
Step Result: ✓ Completed in 95ms
Operations: 3 state, 1 emit, 0 stream
During Execution - Step 2: save-user
Current Step: save-user (started at 143:147ms)
Step Operations:
├─ 🗄️ state.get('validation.result') - 8ms ✓
├─ 🌊 stream.set('users', userId, userData) - 45ms ✓
├─ 🗄️ state.set('user.saved', true) - 10ms ✓
└─ 📡 emit('user.saved', { userId, email }) → triggers send-email - 3ms ✓
Step Result: ✓ Completed in 155ms
Operations: 2 state, 1 emit, 1 stream
During Execution - Step 3: send-email
Current Step: send-email (started at 143:302ms)
Step Operations:
├─ 🗄️ state.get('user.saved') - 5ms ✓
├─ 🌊 stream.set('email-queue', emailId, emailData) - 25ms ✓
└─ 🗄️ state.set('email.queued', true) - 8ms ✓
Step Result: ✓ Completed in 75ms
Operations: 2 state, 0 emit, 1 stream
Final Observability Data Available
{
"id": "trace_abc123_20241230_143052",
"flowName": "user-registration",
"status": "completed",
"startTime": 1703943052000,
"duration": 325,
"entryPoint": {
"type": "api",
"stepName": "validate-user"
},
"steps": [
{
"name": "validate-user",
"status": "completed",
"startTime": 0,
"duration": 95,
"operations": { "state": 3, "emit": 1, "stream": 0 }
},
{
"name": "save-user",
"status": "completed",
"startTime": 95,
"duration": 155,
"operations": { "state": 2, "emit": 1, "stream": 1 }
},
{
"name": "send-email",
"status": "completed",
"startTime": 250,
"duration": 75,
"operations": { "state": 2, "emit": 0, "stream": 1 }
}
],
"metadata": {
"totalSteps": 3,
"completedSteps": 3,
"errorCount": 0
}
}
Timeline Visualization Available
user-registration (✓ 325ms) - Started 2 minutes ago
Time → 0ms 100ms 200ms 300ms 400ms
| | | |
┌─────┐
│█████│ validate-user ✓ 95ms
└──┬──┘ 🗄️(3) 📡(1) 🌊(0)
│
▼
┌──────────────┐
│██████████████│ save-user ✓ 155ms
└──────┬───────┘ 🗄️(2) 📡(1) 🌊(1)
│
▼
┌─────────┐
│█████████│ send-email ✓ 75ms
└─────────┘ 🗄️(2) 📡(0) 🌊(1)
Total Operations: 7 state, 2 emit, 2 stream
Performance: All steps < 200ms ✓
Available Analytics & Insights
- Flow Performance: 325ms total execution time
- Step Breakdown: validate-user (29%), save-user (48%), send-email (23%)
- Operation Distribution: 7 state operations, 2 emit operations, 2 stream operations
- Bottleneck Identification: save-user step took longest due to stream operation
- Event Flow: user.validated → user.saved (step-to-step communication tracked)
- State Usage: 4 unique state keys accessed across the flow
- Stream Usage: 2 streams utilized (users, email-queue)
This comprehensive observability data enables developers to understand exactly what happened during the workflow execution, identify performance bottlenecks, debug issues, and optimize their flows.
2. Horizontal Timeline Visualization
Timeline Layout Concept
Time → 0ms 100ms 250ms 400ms [ongoing]
| | | |
StepA [████]
|
├─ 🗄️ StateOp(get): user.email (25ms)
├─ 📡 EmitOp: user.validated → StepB
└─ ✓ Complete: 95ms
|
v
StepB [████████]
| |
├─ 🗄️ StateOp(set): validation.result (15ms)
├─ 🌊 StreamOp: users.create (35ms)
├─ 📡 EmitOp: user.saved → StepC
└─ ✓ Complete: 155ms
|
v
StepC [██████]
| |
├─ 🌊 StreamOp: emails.queue (20ms)
└─ ✓ Complete: 75ms
UI Component Architecture
TracesPage
├── TracesList (1/3 width)
│ ├── TraceSearch
│ ├── TraceFilters
│ └── TraceItems[]
│ └── TraceListItem
│ ├── Status Indicator
│ ├── Duration & Steps
│ └── Flow Information
└── TraceTimeline (2/3 width)
├── TimelineHeader
│ ├── TraceMetadata
│ └── TimeAxis
└── TimelineBody
└── StepRow[]
├── StepLabel
├── StepExecutionBar
├── OperationBadges[]
└── DurationInfo
3. Real-time Integration Architecture
Workbench Flow Status
┌─────────────────────────────────────────────────────────────────┐
│ User Registration Flow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ validate-user ────→ save-user ────→ send-email │
│ 🟢 2 🟡 1 ⚪ 0 │
│ (2 running) (1 running) (waiting) │
│ │
│ 📊 Live Stats: │
│ • Active traces: 3 │
│ • Avg duration: 245ms │
│ • Success rate: 98.5% │
│ │
└─────────────────────────────────────────────────────────────────┘
Real-time Update Flow
Step Execution
│
▼
Enhanced Logging ─────► ObservabilityEvent
│ │
▼ ▼
Log Stream ──────────► TraceBuilder
│ │
▼ ▼
WebSocket ────────────► UI Updates
│ │
▼ ▼
Live Status ──────────► Timeline Animation
Integration Points
1. Enhanced Logging Integration
- Extend existing
call-step-file.tswith structured observability events - Leverage current logging infrastructure without major changes
- Add operation tracking for state, emit, and stream operations
- Maintain backward compatibility with existing log systems
2. TraceBuilder Service
- Aggregate log events into meaningful trace objects
- Maintain in-memory cache of recent traces (LRU eviction)
- Provide real-time updates via WebSocket integration
- Support filtering and search across trace data
3. Workbench API Extensions
- Add
/motia/tracesendpoints for trace retrieval - Integrate with existing Workbench authentication
- Provide real-time trace updates via WebSocket
- Support filtering by flow, status, step, and time range
4. UI Component Integration
- New dedicated Traces page in Workbench navigation
- Real-time flow status overlay on existing flow visualizations
- Horizontal timeline component with responsive design
- Integration with existing Workbench design system
Technical Considerations
Performance Impact
- Lightweight Logging: Leverage existing logging infrastructure with minimal overhead
- Efficient Aggregation: Process log events asynchronously in background
- Smart Rendering: Virtualized timeline for large traces
- Memory Management: LRU cache with configurable size limits
Scalability Considerations
- Log-based Architecture: Scales naturally with existing logging system
- Incremental Updates: Real-time updates without full trace recomputation
- Configurable Retention: Adjustable trace history limits based on usage
- Horizontal Scaling: Stateless TraceBuilder service design
Developer Experience
- Intuitive Timeline: Natural left-to-right execution flow visualization
- Operation Visibility: Clear visual indicators for Motia operations
- Quick Navigation: Fast filtering and search capabilities
- Mobile Support: Responsive design for all device sizes
Success Metrics
Technical Metrics
- Performance Overhead: <2% impact on step execution time
- UI Responsiveness: Timeline renders in <100ms for typical traces
- Memory Usage: Trace storage under 50MB for 50 traces
User Experience Metrics
- Developer Adoption: Usage metrics from Workbench analytics
- Error Detection: Reduced time to identify workflow issues
- Debugging Efficiency: Faster resolution of multi-step workflow problems
Conclusion
This observability system provides comprehensive visibility into Motia workflows through an intuitive horizontal timeline interface. By leveraging the existing logging infrastructure and focusing on clean, performant UI components, we can deliver powerful debugging and monitoring capabilities with minimal complexity and overhead. The architecture prioritizes simplicity, performance, and developer experience while providing the essential observability features needed for complex workflow debugging.