ADR-060: MoE Verification Layer - Constitutional Court Architecture
Status
Accepted - January 8, 2026
Context
Problem Statement
The CODITECT Mixture of Experts (MoE) classification system requires a verification layer to ensure classification quality through multi-perspective evaluation. Key challenges:
- Single-Point Bias: Single-model evaluation risks systematic blind spots
- No Verification: Analyst classifications lack independent verification
- Missing Audit Trail: No provenance tracking for evaluation decisions
- Unclear Standards: Inconsistent evaluation criteria across components
- Static Prompts: No mechanism for continuous prompt improvement
Research Foundation
Based on research in:
analyze-new-artifacts/MOE-JUDGES-RESEARCH/- PoLL (Panel of LLM Judges) patternsanalyze-new-artifacts/CLAUDE-CODE-EVAL-LOOPS/- Self-improving eval loops- Constitutional AI principles for multi-perspective evaluation
Requirements
- Multi-perspective evaluation with diverse judge personas
- Multi-model diversity to reduce single-vendor bias
- Debate protocol for resolving disagreements
- Full provenance tracking (model, tokens, latency, timestamp)
- ADR-derived rubrics for consistent standards
- Self-improving eval loop for continuous quality improvement
Decision
Implement a Constitutional Court verification architecture with five interconnected subsystems:
1. Judge Persona System (H.3.1)
Specialized judge personas with distinct expertise areas:
┌─────────────────────────────────────────────────────────────────────────────┐
│ JUDGE PERSONA SYSTEM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Technical │ │ Compliance │ │ Security │ │
│ │ Architect │ │ Auditor │ │ Analyst │ │
│ │ Marcus Rivera │ │ Dr. Okonkwo │ │ James Nakamura │ │
│ │ 22 years exp │ │ HIPAA/FDA/SOC2 │ │ OWASP Top 10 │ │
│ │ claude-sonnet │ │ gpt-4o │ │ deepseek-v3 │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────────┼────────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Domain │ │ QA │ │ Additional │ │
│ │ Expert │ │ Evaluator │ │ Personas │ │
│ │ Dr. Vasquez │ │ Priya Sharma │ │ (Extensible) │ │
│ │ Healthcare/HL7 │ │ Test Coverage │ │ │ │
│ │ qwen2.5-72b │ │ claude-haiku │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Configuration: config/judge-personas.json
Loader: scripts/moe_classifier/core/persona_loader.py
2. ADR-to-Rubric Generator (H.3.2)
Automatic extraction of evaluation rubrics from Architecture Decision Records:
┌─────────────────────────────────────────────────────────────────────────────┐
│ ADR-TO-RUBRIC PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ADR Documents Constraint Extraction Generated Rubric │
│ ┌──────────────┐ ┌──────────────────┐ ┌──────────────┐ │
│ │ ADR-001.md │──────────▶│ MUST: mandatory │────────▶│ Dimension 1 │ │
│ │ ADR-002.md │ │ SHOULD: recomm. │ │ Scale: 1-3 │ │
│ │ ... │ │ MAY: optional │ │ Weight: 0.25 │ │
│ │ ADR-058.md │ │ Technical terms │ └──────────────┘ │
│ └──────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Merge with Persona Rubrics │ │
│ │ │ │
│ │ Base Persona Rubric + ADR Rubrics = Merged Rubric │ │
│ │ (5 dimensions) (N dimensions) (5+N dimensions) │ │
│ │ │ │
│ │ Weights renormalized to sum to 1.0 │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Generator: scripts/adr-rubric-generator.py
Merger: scripts/moe_classifier/core/rubric_merger.py
Output: config/generated-rubrics/, config/merged-rubrics/
3. Debate Protocol (H.3.3)
Multi-round debate when judges disagree:
┌─────────────────────────────────────────────────────────────────────────────┐
│ DEBATE PROTOCOL │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Round 0: Initial Evaluation │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Judge A │ │ Judge B │ │ Judge C │ │
│ │ APPROVE │ │ REJECT │ │ APPROVE │ ◄── Disagreement detected │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └────────────┼────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Debate Context Preparation │ │
│ │ • Identify disagreement points (verdict-level, dimension-level) │ │
│ │ • Format positions with rationale excerpts │ │
│ │ • Distribute to all judges │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Round 1-N: Debate Rounds (max 3) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Judge A │ │ Judge B │ │ Judge C │ │
│ │ APPROVE │ │ APPROVE │ │ APPROVE │ ◄── Convergence! │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Convergence Threshold: 80% agreement │
│ Max Rounds: 3 │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation: scripts/moe_classifier/core/debate.py
Integration: scripts/moe_classifier/core/consensus.py
4. Multi-Model Judge Panel (H.3.5)
Diverse LLM providers to prevent single-vendor bias:
┌─────────────────────────────────────────────────────────────────────────────┐
│ MULTI-MODEL JUDGE PANEL │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Provider Diversity Requirements │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ • Minimum 3 model families │ │
│ │ • Maximum 40% weight on single model │ │
│ │ • Automatic fallback on failure │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ Supported Providers │
│ ┌───────────────┬─────────────────────┬──────────────────────────────┐ │
│ │ Provider │ Models │ Use Case │ │
│ ├───────────────┼─────────────────────┼──────────────────────────────┤ │
│ │ Anthropic │ claude-opus-4.5 │ Deep reasoning │ │
│ │ │ claude-sonnet-4 │ Balanced performance │ │
│ │ │ claude-haiku-4.5 │ Fast evaluation │ │
│ ├───────────────┼─────────────────────┼──────────────────────────────┤ │
│ │ OpenAI │ gpt-4o │ General analysis │ │
│ │ │ gpt-4o-mini │ Cost-effective │ │
│ ├───────────────┼─────────────────────┼──────────────────────────────┤ │
│ │ DeepSeek │ deepseek-v3 │ Cost-effective analysis │ │
│ ├───────────────┼─────────────────────┼──────────────────────────────┤ │
│ │ Alibaba │ qwen2.5-72b │ Multilingual │ │
│ ├───────────────┼─────────────────────┼──────────────────────────────┤ │
│ │ Meta │ llama-3.3-70b │ Open-source perspective │ │
│ ├───────────────┼─────────────────────┼──────────────────────────────┤ │
│ │ Google │ gemini-2.0-flash │ Multimodal capability │ │
│ └───────────────┴─────────────────────┴──────────────────────────────┘ │
│ │
│ Fallback Strategy │
│ 1. Retry with exponential backoff (max 2 retries) │
│ 2. Fall back to backup model if configured │
│ 3. Record all attempts in provenance chain │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Client: scripts/moe_classifier/core/multi_model_client.py
LLM Judge: scripts/moe_classifier/judges/llm_judge.py
Config: config/judge-model-routing.json
5. Self-Improving Eval Loop (H.3.6)
Automated evaluation and prompt improvement:
┌─────────────────────────────────────────────────────────────────────────────┐
│ SELF-IMPROVING EVAL LOOP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Load Evals │───▶│ Run Model │───▶│ Score F1 │ │
│ │ (JSONL) │ │ (Multi-LLM) │ │ micro/macro │ │
│ └──────────────┘ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ┌────────────────────▼─────────┐ │
│ │ Target F1 Reached? │ │
│ │ (default: 0.90) │ │
│ └────────────────┬─────────────┘ │
│ NO │ YES │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │ ┌──────────┐ │
│ │ Applier │◄───│ Critic │ │ │ DONE │ │
│ │ (Update files│ │ (Analyze │ │ │ Return │ │
│ │ with backup)│ │ failures) │ │ │ Results │ │
│ └──────┬───────┘ └──────────────┘ │ └──────────┘ │
│ │ │ │
│ └────────────────────────────────────┘ │
│ Next Round │
│ │
│ Components: │
│ • eval_runner.py - Execute evals, compute F1 │
│ • critic_agent.py - Analyze failures, propose changes │
│ • improvement_applier.py - Apply changes with validation │
│ • eval_loop.py - Orchestrate improvement cycle │
│ • run_ci.py - CI quality gates │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Skill: skills/self-improving-eval/
CI: .github/workflows/eval-improvement.yml
6. Provenance Tracking (H.3.4)
Full audit trail for every evaluation:
@dataclass
class JudgeDecision:
judge: str # Judge persona ID
approved: bool # Verdict
reason: str # Detailed reasoning
confidence: float # 0.0-1.0
dimension_scores: Dict[str, int] # Per-dimension scores
# Provenance fields (H.3.4)
model_used: str # e.g., "claude-sonnet-4"
timestamp: datetime # When evaluated
token_usage: int # Total tokens consumed
raw_response: str # Raw LLM response
evaluation_start_time: datetime
evaluation_end_time: datetime
duration_ms: float # Latency
@dataclass
class ConsensusResult:
verdict: str # APPROVED, REJECTED, etc.
confidence: float
reasoning: str
# Provenance chain (H.3.4)
provenance_chain: List[Dict] # All judge decisions
dissenting_views: List[Dict] # Disagreeing judges
total_token_usage: int
total_latency_ms: float
Consequences
Positive
- Multi-perspective Quality: 5+ judge personas catch different issues
- Reduced Bias: 3+ model families prevent vendor lock-in
- Full Auditability: Complete provenance for every decision
- ADR Alignment: Rubrics automatically derived from architecture decisions
- Continuous Improvement: Self-improving loop enhances prompts over time
- Disagreement Resolution: Debate protocol builds consensus
Negative
- Increased Costs: Multiple LLM calls per evaluation
- Latency: Multi-model evaluation slower than single model
- Complexity: More components to maintain and test
- API Dependencies: Requires multiple LLM provider accounts
Mitigations
- Cost Control: Use cheaper models (Haiku, DeepSeek) for initial screening
- Parallelism: Run judge evaluations concurrently
- Caching: Cache common evaluation patterns
- Fallbacks: Graceful degradation when providers unavailable
Implementation
Components Created
| Component | Path | Tests |
|---|---|---|
| Persona Loader | core/persona_loader.py | 43 tests |
| ADR Rubric Generator | scripts/adr-rubric-generator.py | 62 tests |
| Rubric Merger | core/rubric_merger.py | Part of above |
| Debate Protocol | core/debate.py | 30 tests |
| Multi-Model Client | core/multi_model_client.py | 49 tests |
| LLM Judge | judges/llm_judge.py | 17 tests |
| Provenance | core/models.py (enhanced) | 30 tests |
| Self-Improving Eval | skills/self-improving-eval/ | 28 tests |
Total: 260 tests
Commands
| Command | Purpose |
|---|---|
/moe-judges <target> | Assemble judge panel and evaluate |
/classify <path> | Classify documents with verification |
/eval-improve | Run self-improving eval loop |
Configuration Files
config/
├── judge-personas.json # 6 judge persona definitions
├── judge-model-routing.json # Model routing per persona
├── generated-rubrics/ # 27 ADR-derived rubrics
└── merged-rubrics/ # 5 merged persona rubrics
References
- H.3.1-H.3.6 implementation in PILOT-PARALLEL-EXECUTION-PLAN.md
- PoLL (Panel of LLM Judges) research pattern
- Constitutional AI multi-perspective evaluation
analyze-new-artifacts/MOE-JUDGES-RESEARCH/analyze-new-artifacts/CLAUDE-CODE-EVAL-LOOPS/
Related ADRs
- ADR-052: Intent-Aware Context Management
- ADR-053: Cloud Context Sync Architecture
- ADR-054: Track Nomenclature Extensibility
Author: CODITECT Team Reviewers: Architecture Council