ADR-001: Event Processing Architecture
Status: Accepted
Date: 2025-01-06
Deciders: Architecture Team
Contextβ
The file monitor system needs to handle high-velocity file system events reliably without resource exhaustion, while providing production-grade observability and graceful shutdown capabilities.
Decisionβ
We implement a multi-stage event processing pipeline with the following key architectural decisions:
1. Semaphore-Based Rate Limitingβ
Decision: Use tokio::sync::Semaphore for bounded concurrency control.
Rationale:
- Prevents unbounded task spawning during event storms (npm install, build systems)
- Provides explicit backpressure mechanism
- Try-acquire pattern allows graceful event dropping with metrics
Alternatives Considered:
- Channel-based backpressure: Rejected due to blocking behavior propagating to watcher
- Token bucket: Rejected due to higher complexity for similar outcome
- No limiting: Rejected due to OOM risk in production
Consequences:
- β Predictable memory usage under load
- β Clear error handling path (RateLimitExceeded)
- β Events may be dropped under extreme load (acceptable, with metrics)
2. Streaming Checksum Calculationβ
Decision: Use 8KB buffered streaming for SHA-256 calculation with hard size limits.
Rationale:
- Original implementation loaded entire files into memory (OOM on large files)
- Streaming approach maintains constant ~8KB memory per checksum operation
- Timeout protection prevents hanging on slow I/O
Code Pattern:
let mut buffer = vec![0u8; 8192];
loop {
let bytes_read = file.read(&mut buffer).await?;
if bytes_read == 0 { break; }
hasher.update(&buffer[..bytes_read]);
}
Alternatives Considered:
- Memory-mapped files: Rejected due to platform inconsistencies
- Larger buffers: 8KB balances I/O efficiency with memory overhead
- Async hasher: Not available in sha2 crate
Consequences:
- β Handles arbitrarily large files (up to configured limit)
- β Predictable memory footprint
- β Slower than in-memory for small files (acceptable tradeoff)
3. Time-Based Debouncingβ
Decision: Implement debouncing using HashMap<String, Instant> with periodic cleanup.
Rationale:
- Text editors generate 3-5 events per save operation
- Build tools generate hundreds of temporary file events
- Debouncing reduces downstream load by 70-90% in typical scenarios
Implementation:
match last_events.get(key) {
Some(&last_time) if now.duration_since(last_time) < window => false,
_ => {
last_events.insert(key, now);
true
}
}
Alternatives Considered:
- Event-count debouncing: Rejected due to unpredictable behavior
- LRU cache: Rejected as overkill for this use case
- No debouncing: Rejected due to downstream system overload
Consequences:
- β Significantly reduced event volume
- β Configurable window balances responsiveness vs deduplication
- β Requires periodic cleanup to prevent memory growth
4. Graceful Shutdown with Drain Periodβ
Decision: Implement coordinated shutdown with broadcast channel and drain timeout.
Rationale:
- File system events may still be in-flight during shutdown
- Processing tasks may be mid-operation
- Need to guarantee delivery of events already received
Architecture:
shutdown_coordinator.shutdown() β broadcast signal
β
Stop watcher (no new events)
β
Drain channel with timeout
β
Wait for tasks with timeout
β
Force shutdown after timeout
Alternatives Considered:
- Immediate shutdown: Rejected due to event loss
- Infinite drain: Rejected due to hang risk
- Checkpoint-based: Overly complex for use case
Consequences:
- β No event loss during normal shutdown
- β Bounded shutdown time (configurable timeout)
- β Events may be lost if timeout exceeded (logged and metered)
5. Modular Architecture with Explicit Dependenciesβ
Decision: Separate concerns into focused modules with clear responsibilities.
Module Structure:
config β Configuration and validation
events β Event types and serialization
error β Domain errors and context
checksum β Streaming hash calculation
debouncer β Time-based deduplication
rate_limiter β Backpressure control
processor β Event processing pipeline
lifecycle β Shutdown coordination
observabilityβ Metrics and tracing
monitor β Public API and orchestration
Rationale:
- Each module has single responsibility
- Dependencies flow in one direction (no cycles)
- Easy to test in isolation
- Clear boundaries for future changes
Alternatives Considered:
- Monolithic module: Rejected due to testing and maintenance challenges
- More granular modules: Rejected as over-engineering
- Trait-based abstraction: Deferred until multiple implementations needed
Consequences:
- β High testability (90%+ coverage achievable)
- β Clear boundaries for modification
- β Easy to understand and maintain
- β More files to navigate (acceptable with good organization)
Metrics and Observabilityβ
All decisions support comprehensive observability:
- Rate limiter:
fs_monitor.rate_limiter.utilizationgauge - Debouncing:
fs_monitor.events.debouncedcounter - Checksums:
fs_monitor.checksum.duration_mshistogram - Errors:
fs_monitor.errorscounter with type label - Processing:
fs_monitor.processing.latency_ushistogram
Performance Targetsβ
Based on these decisions, we target:
- Throughput: 10,000+ events/sec (no checksums)
- Memory: <50MB under typical load
- Latency: <5ms processing pipeline (p99)
- Availability: 99.9% uptime during normal operation
Migration Pathβ
For existing deployments:
- Deploy with conservative rate limits (50 concurrent tasks)
- Monitor
fs_monitor.events.droppedmetric - Tune limits based on actual load patterns
- Gradually enable checksums based on file size distribution
Referencesβ
- Original issue: Resource exhaustion during npm install
- Related: Production incident #2024-11-15
- notify crate documentation: https://docs.rs/notify/
- Tokio semaphore: https://docs.rs/tokio/latest/tokio/sync/struct.Semaphore.html
Review Notesβ
Approved by: Lead Architect, SRE Team
Next Review: After 3 months in production