Skip to main content

ADR-001: Event Processing Architecture

Status: Accepted
Date: 2025-01-06
Deciders: Architecture Team

Context​

The file monitor system needs to handle high-velocity file system events reliably without resource exhaustion, while providing production-grade observability and graceful shutdown capabilities.

Decision​

We implement a multi-stage event processing pipeline with the following key architectural decisions:

1. Semaphore-Based Rate Limiting​

Decision: Use tokio::sync::Semaphore for bounded concurrency control.

Rationale:

  • Prevents unbounded task spawning during event storms (npm install, build systems)
  • Provides explicit backpressure mechanism
  • Try-acquire pattern allows graceful event dropping with metrics

Alternatives Considered:

  • Channel-based backpressure: Rejected due to blocking behavior propagating to watcher
  • Token bucket: Rejected due to higher complexity for similar outcome
  • No limiting: Rejected due to OOM risk in production

Consequences:

  • βœ… Predictable memory usage under load
  • βœ… Clear error handling path (RateLimitExceeded)
  • ❌ Events may be dropped under extreme load (acceptable, with metrics)

2. Streaming Checksum Calculation​

Decision: Use 8KB buffered streaming for SHA-256 calculation with hard size limits.

Rationale:

  • Original implementation loaded entire files into memory (OOM on large files)
  • Streaming approach maintains constant ~8KB memory per checksum operation
  • Timeout protection prevents hanging on slow I/O

Code Pattern:

let mut buffer = vec![0u8; 8192];
loop {
let bytes_read = file.read(&mut buffer).await?;
if bytes_read == 0 { break; }
hasher.update(&buffer[..bytes_read]);
}

Alternatives Considered:

  • Memory-mapped files: Rejected due to platform inconsistencies
  • Larger buffers: 8KB balances I/O efficiency with memory overhead
  • Async hasher: Not available in sha2 crate

Consequences:

  • βœ… Handles arbitrarily large files (up to configured limit)
  • βœ… Predictable memory footprint
  • ❌ Slower than in-memory for small files (acceptable tradeoff)

3. Time-Based Debouncing​

Decision: Implement debouncing using HashMap<String, Instant> with periodic cleanup.

Rationale:

  • Text editors generate 3-5 events per save operation
  • Build tools generate hundreds of temporary file events
  • Debouncing reduces downstream load by 70-90% in typical scenarios

Implementation:

match last_events.get(key) {
Some(&last_time) if now.duration_since(last_time) < window => false,
_ => {
last_events.insert(key, now);
true
}
}

Alternatives Considered:

  • Event-count debouncing: Rejected due to unpredictable behavior
  • LRU cache: Rejected as overkill for this use case
  • No debouncing: Rejected due to downstream system overload

Consequences:

  • βœ… Significantly reduced event volume
  • βœ… Configurable window balances responsiveness vs deduplication
  • ❌ Requires periodic cleanup to prevent memory growth

4. Graceful Shutdown with Drain Period​

Decision: Implement coordinated shutdown with broadcast channel and drain timeout.

Rationale:

  • File system events may still be in-flight during shutdown
  • Processing tasks may be mid-operation
  • Need to guarantee delivery of events already received

Architecture:

shutdown_coordinator.shutdown() β†’ broadcast signal
↓
Stop watcher (no new events)
↓
Drain channel with timeout
↓
Wait for tasks with timeout
↓
Force shutdown after timeout

Alternatives Considered:

  • Immediate shutdown: Rejected due to event loss
  • Infinite drain: Rejected due to hang risk
  • Checkpoint-based: Overly complex for use case

Consequences:

  • βœ… No event loss during normal shutdown
  • βœ… Bounded shutdown time (configurable timeout)
  • ❌ Events may be lost if timeout exceeded (logged and metered)

5. Modular Architecture with Explicit Dependencies​

Decision: Separate concerns into focused modules with clear responsibilities.

Module Structure:

config       β†’ Configuration and validation
events β†’ Event types and serialization
error β†’ Domain errors and context
checksum β†’ Streaming hash calculation
debouncer β†’ Time-based deduplication
rate_limiter β†’ Backpressure control
processor β†’ Event processing pipeline
lifecycle β†’ Shutdown coordination
observability→ Metrics and tracing
monitor β†’ Public API and orchestration

Rationale:

  • Each module has single responsibility
  • Dependencies flow in one direction (no cycles)
  • Easy to test in isolation
  • Clear boundaries for future changes

Alternatives Considered:

  • Monolithic module: Rejected due to testing and maintenance challenges
  • More granular modules: Rejected as over-engineering
  • Trait-based abstraction: Deferred until multiple implementations needed

Consequences:

  • βœ… High testability (90%+ coverage achievable)
  • βœ… Clear boundaries for modification
  • βœ… Easy to understand and maintain
  • ❌ More files to navigate (acceptable with good organization)

Metrics and Observability​

All decisions support comprehensive observability:

  • Rate limiter: fs_monitor.rate_limiter.utilization gauge
  • Debouncing: fs_monitor.events.debounced counter
  • Checksums: fs_monitor.checksum.duration_ms histogram
  • Errors: fs_monitor.errors counter with type label
  • Processing: fs_monitor.processing.latency_us histogram

Performance Targets​

Based on these decisions, we target:

  • Throughput: 10,000+ events/sec (no checksums)
  • Memory: <50MB under typical load
  • Latency: <5ms processing pipeline (p99)
  • Availability: 99.9% uptime during normal operation

Migration Path​

For existing deployments:

  1. Deploy with conservative rate limits (50 concurrent tasks)
  2. Monitor fs_monitor.events.dropped metric
  3. Tune limits based on actual load patterns
  4. Gradually enable checksums based on file size distribution

References​

Review Notes​

Approved by: Lead Architect, SRE Team
Next Review: After 3 months in production