ADR-028: CODI2 Separation of Concerns Architecture (v4) - Part 1: Human Narrative
Document Specification Block​
Document: ADR-028-v4-codi2-separation-of-concerns
Version: 1.0.0
Purpose: Redesign CODI with proper separation of logging, messaging, and state management to eliminate race conditions
Audience: Developers, DevOps Engineers, Platform Architects, Business Leaders
Date Created: 2025-09-06
Date Modified: 2025-09-06
Status: DRAFT
Key Innovation: Separates audit logging from inter-agent messaging and state management
Table of Contents​
- Vision: A Race-Free Future
- The Problem Story
- User Stories
- The Solution: Three Distinct Systems
- Architecture Overview
- Migration Benefits
- Success Metrics
- Implementation Timeline
1. Vision: A Race-Free Future​
Imagine a world where your development monitoring system never loses data, never experiences race conditions, and operates 10-100x faster than before. CODI2 achieves this by recognizing a fundamental truth: logging, messaging, and state management are three different concerns that require three different solutions.
The Paradigm Shift​
2. The Problem Story​
Chapter 1: The Discovery​
Sarah, a senior developer at CODITECT, was debugging why her AI agents kept stepping on each other's work. She discovered that five different Claude sessions were all writing to the same log file simultaneously, creating a mess of interleaved JSON that no parser could understand.
Chapter 2: The Investigation​
After analyzing 23 different race condition scenarios, the team realized they had been using logging for everything:
- Communication: "Hey agent-2, please work on ADR-005"
- State Management: "Current task: ADR-005, Status: In Progress"
- Actual Logging: "User authenticated successfully"
Chapter 3: The Revelation​
The root cause was simple but profound: We were abusing logging as a universal communication mechanism. Like trying to run a modern office using only a bulletin board, it worked at small scale but fell apart under load.
3. User Stories​
As a Developer​
"I want my file changes tracked without race conditions so I never lose work due to monitoring system failures."
Acceptance Criteria:
- Zero data loss during concurrent operations
- Sub-millisecond tracking latency
- Clear separation of audit events from messages
As an AI Agent​
"I want to communicate with other agents through a proper message bus so we can coordinate work without conflicts."
Acceptance Criteria:
- Guaranteed message delivery
- Proper routing and filtering
- No file I/O bottlenecks
As an Operations Engineer​
"I want system state stored in a proper database so I can query current status without parsing gigabytes of logs."
Acceptance Criteria:
- ACID transactions for state updates
- Efficient queries by time range
- Consistent view across all readers
As a Business Leader​
"I want a monitoring system that scales with our platform so we don't have to rebuild it as we grow."
Acceptance Criteria:
- 10-100x performance improvement
- Linear scalability with load
- Reduced operational costs
4. The Solution: Three Distinct Systems​
System 1: Audit Logger (What Actually IS Logging)​
Purpose: Immutable record of significant events for compliance, debugging, and analytics.
System 2: Message Bus (What Should NOT Be Logging)​
Purpose: High-performance, in-memory communication between agents with proper routing.
System 3: State Store (What Should NEVER Be in Logs)​
Purpose: Consistent, distributed storage of system state with ACID guarantees.
5. Architecture Overview​
Before and After Comparison​
| Aspect | CODI v1 (Current) | CODI2 (New) |
|---|---|---|
| Architecture | Everything → Log File | Separated Concerns |
| Performance | 100+ ms per operation | <1 ms audit, <0.1 ms messaging |
| Concurrency | File locks, race conditions | Lock-free, wait-free |
| Scalability | Limited by file I/O | Linear with resources |
| Data Loss | Common under load | Impossible by design |
| Complexity | Simple but broken | Proper but maintainable |
Data Flow Example: Task Assignment​
Old Way (Everything in Logs):
{"action": "TASK_ASSIGN", "from": "orchestrator", "to": "agent-1", "task": "ADR-028"}
{"action": "TASK_ACK", "from": "agent-1", "task": "ADR-028"}
{"action": "STATUS_UPDATE", "task": "ADR-028", "status": "in_progress"}
New Way (Proper Separation):
- Message: Orchestrator → Agent-1 via message bus (0.1 ms)
- State: Update task status in FDB atomically (5 ms)
- Audit: Log assignment event for compliance (10 ms, async)
6. Migration Benefits​
Immediate Benefits (Day 1)​
- No More Race Conditions: Single-writer pattern eliminates conflicts
- 10x Faster Operations: In-memory messaging vs file I/O
- Data Integrity: ACID transactions for all state changes
Medium-term Benefits (Month 1)​
- Advanced Queries: "Show all tasks assigned in last hour"
- Real-time Dashboards: WebSocket feeds from message bus
- Debugging Tools: Trace specific agent interactions
Long-term Benefits (Year 1)​
- 100x Scale: Handle millions of events per second
- AI Training Data: Clean, structured audit logs
- Compliance Ready: Immutable audit trail with proof
7. Success Metrics​
Performance Metrics (Current → Target)​
- Message Latency: 100+ ms → <0.1 ms (p99)
- State Updates: 200+ ms → <5 ms (p99)
- Audit Writes: 150+ ms → <10 ms (p99)
- Query Performance: 5+ seconds → <50 ms for time-range queries
Reliability Metrics (Current → Target)​
- Data Loss: ~1% under load → 0 events lost
- Race Conditions: 23 identified → 0 detected
- Uptime: 95% (due to locks) → 99.99% availability
- Recovery Time: 30+ seconds → <5 seconds
Business Metrics (Current → Target)​
- Development Velocity: Baseline → 2x faster feature delivery
- Operational Cost: Baseline → 50% reduction in compute
- Developer Time on Race Bugs: 20% → 0%
- Audit Compliance: 90% → 100% event capture
Test Coverage Requirements​
- Unit Tests: 100% coverage (no exceptions)
- Integration Tests: 100% coverage (no exceptions)
- Critical Path Tests: 100% coverage (message delivery, state consistency, audit integrity)
Zero-Tolerance Policy: CODI2 is too critical for partial coverage. Every code path must be tested.
8. Implementation Timeline​
Phase 1: Foundation (Week 1)​
- Build message bus with MPSC channels
- Implement basic state store interface
- Create audit logger with FDB backend
- Start with "session coordination" use case
Phase 2: Migration (Week 2)​
- Port file monitoring to new architecture
- Update AI agent communication
- Migrate existing log parsers
- Maintain backward compatibility
Phase 3: Enhancement (Week 3)​
- Add WebSocket streaming
- Build query interface
- Implement retention policies
- Create monitoring dashboards
Phase 4: Optimization (Week 4)​
- Performance tuning
- Add caching layers
- Implement batching
- Deploy to production
Visual Success Story​
Version History​
- 2.0.0 (2025-09-06): Updated with baseline metrics and test coverage requirements
- 1.0.0 (2025-09-06): Initial version
Approval​
Product Owner: ___________________ Date: ___________
Technical Lead: ___________________ Date: ___________
QA Review: ___________________ Date: ___________
Next:
- Part 2: Technical Implementation - Complete implementation details
- Part 3: Comprehensive Testing - Exhaustive test strategy