ADR-001: Event Processing Architecture

Status: Accepted
Date: 2025-01-06
Deciders: Architecture Team

Context

The file monitor system needs to handle high-velocity file system events reliably without resource exhaustion, while providing production-grade observability and graceful shutdown capabilities.

Decision

We implement a multi-stage event processing pipeline with the following key architectural decisions:

1. Semaphore-Based Rate Limiting

Decision: Use tokio::sync::Semaphore for bounded concurrency control.

Rationale:

Prevents unbounded task spawning during event storms (npm install, build systems)
Provides explicit backpressure mechanism
Try-acquire pattern allows graceful event dropping with metrics

Alternatives Considered:

Channel-based backpressure: Rejected due to blocking behavior propagating to watcher
Token bucket: Rejected due to higher complexity for similar outcome
No limiting: Rejected due to OOM risk in production

Consequences:

✅ Predictable memory usage under load
✅ Clear error handling path (RateLimitExceeded)
❌ Events may be dropped under extreme load (acceptable, with metrics)

2. Streaming Checksum Calculation

Decision: Use 8KB buffered streaming for SHA-256 calculation with hard size limits.

Rationale:

Original implementation loaded entire files into memory (OOM on large files)
Streaming approach maintains constant ~8KB memory per checksum operation
Timeout protection prevents hanging on slow I/O

Code Pattern:

let mut buffer = vec![0u8; 8192];
loop {
    let bytes_read = file.read(&mut buffer).await?;
    if bytes_read == 0 { break; }
    hasher.update(&buffer[..bytes_read]);
}

Alternatives Considered:

Memory-mapped files: Rejected due to platform inconsistencies
Larger buffers: 8KB balances I/O efficiency with memory overhead
Async hasher: Not available in sha2 crate

Consequences:

✅ Handles arbitrarily large files (up to configured limit)
✅ Predictable memory footprint
❌ Slower than in-memory for small files (acceptable tradeoff)

3. Time-Based Debouncing

Decision: Implement debouncing using HashMap<String, Instant> with periodic cleanup.

Rationale:

Text editors generate 3-5 events per save operation
Build tools generate hundreds of temporary file events
Debouncing reduces downstream load by 70-90% in typical scenarios

Implementation:

match last_events.get(key) {
    Some(&last_time) if now.duration_since(last_time) < window => false,
    _ => {
        last_events.insert(key, now);
        true
    }
}

Alternatives Considered:

Event-count debouncing: Rejected due to unpredictable behavior
LRU cache: Rejected as overkill for this use case
No debouncing: Rejected due to downstream system overload

Consequences:

✅ Significantly reduced event volume
✅ Configurable window balances responsiveness vs deduplication
❌ Requires periodic cleanup to prevent memory growth

4. Graceful Shutdown with Drain Period

Decision: Implement coordinated shutdown with broadcast channel and drain timeout.

Rationale:

File system events may still be in-flight during shutdown
Processing tasks may be mid-operation
Need to guarantee delivery of events already received

Architecture:

shutdown_coordinator.shutdown() → broadcast signal
    ↓
Stop watcher (no new events)
    ↓
Drain channel with timeout
    ↓
Wait for tasks with timeout
    ↓
Force shutdown after timeout

Alternatives Considered:

Immediate shutdown: Rejected due to event loss
Infinite drain: Rejected due to hang risk
Checkpoint-based: Overly complex for use case

Consequences:

✅ No event loss during normal shutdown
✅ Bounded shutdown time (configurable timeout)
❌ Events may be lost if timeout exceeded (logged and metered)

5. Modular Architecture with Explicit Dependencies

Decision: Separate concerns into focused modules with clear responsibilities.

Module Structure:

config       → Configuration and validation
events       → Event types and serialization
error        → Domain errors and context
checksum     → Streaming hash calculation
debouncer    → Time-based deduplication
rate_limiter → Backpressure control
processor    → Event processing pipeline
lifecycle    → Shutdown coordination
observability→ Metrics and tracing
monitor      → Public API and orchestration

Rationale:

Each module has single responsibility
Dependencies flow in one direction (no cycles)
Easy to test in isolation
Clear boundaries for future changes

Alternatives Considered:

Monolithic module: Rejected due to testing and maintenance challenges
More granular modules: Rejected as over-engineering
Trait-based abstraction: Deferred until multiple implementations needed

Consequences:

✅ High testability (90%+ coverage achievable)
✅ Clear boundaries for modification
✅ Easy to understand and maintain
❌ More files to navigate (acceptable with good organization)

Metrics and Observability

All decisions support comprehensive observability:

Rate limiter: fs_monitor.rate_limiter.utilization gauge
Debouncing: fs_monitor.events.debounced counter
Checksums: fs_monitor.checksum.duration_ms histogram
Errors: fs_monitor.errors counter with type label
Processing: fs_monitor.processing.latency_us histogram

Performance Targets

Based on these decisions, we target:

Throughput: 10,000+ events/sec (no checksums)
Memory: <50MB under typical load
Latency: <5ms processing pipeline (p99)
Availability: 99.9% uptime during normal operation

Migration Path

For existing deployments:

Deploy with conservative rate limits (50 concurrent tasks)
Monitor fs_monitor.events.dropped metric
Tune limits based on actual load patterns
Gradually enable checksums based on file size distribution

References

Original issue: Resource exhaustion during npm install
Related: Production incident #2024-11-15
notify crate documentation: https://docs.rs/notify/
Tokio semaphore: https://docs.rs/tokio/latest/tokio/sync/struct.Semaphore.html

Review Notes

Approved by: Lead Architect, SRE Team
Next Review: After 3 months in production

Context​

Decision​

1. Semaphore-Based Rate Limiting​

2. Streaming Checksum Calculation​

3. Time-Based Debouncing​

4. Graceful Shutdown with Drain Period​

5. Modular Architecture with Explicit Dependencies​

Metrics and Observability​

Performance Targets​

Migration Path​

References​

Review Notes​

Context

Decision

1. Semaphore-Based Rate Limiting

2. Streaming Checksum Calculation

3. Time-Based Debouncing

4. Graceful Shutdown with Drain Period

5. Modular Architecture with Explicit Dependencies

Metrics and Observability

Performance Targets

Migration Path

References

Review Notes