Skip to main content

ADR-060: MoE Verification Layer - Constitutional Court Architecture

Status

Accepted - January 8, 2026

Context

Problem Statement

The CODITECT Mixture of Experts (MoE) classification system requires a verification layer to ensure classification quality through multi-perspective evaluation. Key challenges:

  1. Single-Point Bias: Single-model evaluation risks systematic blind spots
  2. No Verification: Analyst classifications lack independent verification
  3. Missing Audit Trail: No provenance tracking for evaluation decisions
  4. Unclear Standards: Inconsistent evaluation criteria across components
  5. Static Prompts: No mechanism for continuous prompt improvement

Research Foundation

Based on research in:

  • analyze-new-artifacts/MOE-JUDGES-RESEARCH/ - PoLL (Panel of LLM Judges) patterns
  • analyze-new-artifacts/CLAUDE-CODE-EVAL-LOOPS/ - Self-improving eval loops
  • Constitutional AI principles for multi-perspective evaluation

Requirements

  1. Multi-perspective evaluation with diverse judge personas
  2. Multi-model diversity to reduce single-vendor bias
  3. Debate protocol for resolving disagreements
  4. Full provenance tracking (model, tokens, latency, timestamp)
  5. ADR-derived rubrics for consistent standards
  6. Self-improving eval loop for continuous quality improvement

Decision

Implement a Constitutional Court verification architecture with five interconnected subsystems:

1. Judge Persona System (H.3.1)

Specialized judge personas with distinct expertise areas:

┌─────────────────────────────────────────────────────────────────────────────┐
│ JUDGE PERSONA SYSTEM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Technical │ │ Compliance │ │ Security │ │
│ │ Architect │ │ Auditor │ │ Analyst │ │
│ │ Marcus Rivera │ │ Dr. Okonkwo │ │ James Nakamura │ │
│ │ 22 years exp │ │ HIPAA/FDA/SOC2 │ │ OWASP Top 10 │ │
│ │ claude-sonnet │ │ gpt-4o │ │ deepseek-v3 │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────────┼────────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Domain │ │ QA │ │ Additional │ │
│ │ Expert │ │ Evaluator │ │ Personas │ │
│ │ Dr. Vasquez │ │ Priya Sharma │ │ (Extensible) │ │
│ │ Healthcare/HL7 │ │ Test Coverage │ │ │ │
│ │ qwen2.5-72b │ │ claude-haiku │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Configuration: config/judge-personas.json Loader: scripts/moe_classifier/core/persona_loader.py

2. ADR-to-Rubric Generator (H.3.2)

Automatic extraction of evaluation rubrics from Architecture Decision Records:

┌─────────────────────────────────────────────────────────────────────────────┐
│ ADR-TO-RUBRIC PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ADR Documents Constraint Extraction Generated Rubric │
│ ┌──────────────┐ ┌──────────────────┐ ┌──────────────┐ │
│ │ ADR-001.md │──────────▶│ MUST: mandatory │────────▶│ Dimension 1 │ │
│ │ ADR-002.md │ │ SHOULD: recomm. │ │ Scale: 1-3 │ │
│ │ ... │ │ MAY: optional │ │ Weight: 0.25 │ │
│ │ ADR-058.md │ │ Technical terms │ └──────────────┘ │
│ └──────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Merge with Persona Rubrics │ │
│ │ │ │
│ │ Base Persona Rubric + ADR Rubrics = Merged Rubric │ │
│ │ (5 dimensions) (N dimensions) (5+N dimensions) │ │
│ │ │ │
│ │ Weights renormalized to sum to 1.0 │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Generator: scripts/adr-rubric-generator.py Merger: scripts/moe_classifier/core/rubric_merger.py Output: config/generated-rubrics/, config/merged-rubrics/

3. Debate Protocol (H.3.3)

Multi-round debate when judges disagree:

┌─────────────────────────────────────────────────────────────────────────────┐
│ DEBATE PROTOCOL │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Round 0: Initial Evaluation │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Judge A │ │ Judge B │ │ Judge C │ │
│ │ APPROVE │ │ REJECT │ │ APPROVE │ ◄── Disagreement detected │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └────────────┼────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Debate Context Preparation │ │
│ │ • Identify disagreement points (verdict-level, dimension-level) │ │
│ │ • Format positions with rationale excerpts │ │
│ │ • Distribute to all judges │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Round 1-N: Debate Rounds (max 3) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Judge A │ │ Judge B │ │ Judge C │ │
│ │ APPROVE │ │ APPROVE │ │ APPROVE │ ◄── Convergence! │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Convergence Threshold: 80% agreement │
│ Max Rounds: 3 │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation: scripts/moe_classifier/core/debate.py Integration: scripts/moe_classifier/core/consensus.py

4. Multi-Model Judge Panel (H.3.5)

Diverse LLM providers to prevent single-vendor bias:

┌─────────────────────────────────────────────────────────────────────────────┐
│ MULTI-MODEL JUDGE PANEL │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Provider Diversity Requirements │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ • Minimum 3 model families │ │
│ │ • Maximum 40% weight on single model │ │
│ │ • Automatic fallback on failure │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ Supported Providers │
│ ┌───────────────┬─────────────────────┬──────────────────────────────┐ │
│ │ Provider │ Models │ Use Case │ │
│ ├───────────────┼─────────────────────┼──────────────────────────────┤ │
│ │ Anthropic │ claude-opus-4.5 │ Deep reasoning │ │
│ │ │ claude-sonnet-4 │ Balanced performance │ │
│ │ │ claude-haiku-4.5 │ Fast evaluation │ │
│ ├───────────────┼─────────────────────┼──────────────────────────────┤ │
│ │ OpenAI │ gpt-4o │ General analysis │ │
│ │ │ gpt-4o-mini │ Cost-effective │ │
│ ├───────────────┼─────────────────────┼──────────────────────────────┤ │
│ │ DeepSeek │ deepseek-v3 │ Cost-effective analysis │ │
│ ├───────────────┼─────────────────────┼──────────────────────────────┤ │
│ │ Alibaba │ qwen2.5-72b │ Multilingual │ │
│ ├───────────────┼─────────────────────┼──────────────────────────────┤ │
│ │ Meta │ llama-3.3-70b │ Open-source perspective │ │
│ ├───────────────┼─────────────────────┼──────────────────────────────┤ │
│ │ Google │ gemini-2.0-flash │ Multimodal capability │ │
│ └───────────────┴─────────────────────┴──────────────────────────────┘ │
│ │
│ Fallback Strategy │
│ 1. Retry with exponential backoff (max 2 retries) │
│ 2. Fall back to backup model if configured │
│ 3. Record all attempts in provenance chain │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Client: scripts/moe_classifier/core/multi_model_client.py LLM Judge: scripts/moe_classifier/judges/llm_judge.py Config: config/judge-model-routing.json

5. Self-Improving Eval Loop (H.3.6)

Automated evaluation and prompt improvement:

┌─────────────────────────────────────────────────────────────────────────────┐
│ SELF-IMPROVING EVAL LOOP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Load Evals │───▶│ Run Model │───▶│ Score F1 │ │
│ │ (JSONL) │ │ (Multi-LLM) │ │ micro/macro │ │
│ └──────────────┘ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ┌────────────────────▼─────────┐ │
│ │ Target F1 Reached? │ │
│ │ (default: 0.90) │ │
│ └────────────────┬─────────────┘ │
│ NO │ YES │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │ ┌──────────┐ │
│ │ Applier │◄───│ Critic │ │ │ DONE │ │
│ │ (Update files│ │ (Analyze │ │ │ Return │ │
│ │ with backup)│ │ failures) │ │ │ Results │ │
│ └──────┬───────┘ └──────────────┘ │ └──────────┘ │
│ │ │ │
│ └────────────────────────────────────┘ │
│ Next Round │
│ │
│ Components: │
│ • eval_runner.py - Execute evals, compute F1 │
│ • critic_agent.py - Analyze failures, propose changes │
│ • improvement_applier.py - Apply changes with validation │
│ • eval_loop.py - Orchestrate improvement cycle │
│ • run_ci.py - CI quality gates │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Skill: skills/self-improving-eval/ CI: .github/workflows/eval-improvement.yml

6. Provenance Tracking (H.3.4)

Full audit trail for every evaluation:

@dataclass
class JudgeDecision:
judge: str # Judge persona ID
approved: bool # Verdict
reason: str # Detailed reasoning
confidence: float # 0.0-1.0
dimension_scores: Dict[str, int] # Per-dimension scores

# Provenance fields (H.3.4)
model_used: str # e.g., "claude-sonnet-4"
timestamp: datetime # When evaluated
token_usage: int # Total tokens consumed
raw_response: str # Raw LLM response
evaluation_start_time: datetime
evaluation_end_time: datetime
duration_ms: float # Latency

@dataclass
class ConsensusResult:
verdict: str # APPROVED, REJECTED, etc.
confidence: float
reasoning: str

# Provenance chain (H.3.4)
provenance_chain: List[Dict] # All judge decisions
dissenting_views: List[Dict] # Disagreeing judges
total_token_usage: int
total_latency_ms: float

Consequences

Positive

  1. Multi-perspective Quality: 5+ judge personas catch different issues
  2. Reduced Bias: 3+ model families prevent vendor lock-in
  3. Full Auditability: Complete provenance for every decision
  4. ADR Alignment: Rubrics automatically derived from architecture decisions
  5. Continuous Improvement: Self-improving loop enhances prompts over time
  6. Disagreement Resolution: Debate protocol builds consensus

Negative

  1. Increased Costs: Multiple LLM calls per evaluation
  2. Latency: Multi-model evaluation slower than single model
  3. Complexity: More components to maintain and test
  4. API Dependencies: Requires multiple LLM provider accounts

Mitigations

  1. Cost Control: Use cheaper models (Haiku, DeepSeek) for initial screening
  2. Parallelism: Run judge evaluations concurrently
  3. Caching: Cache common evaluation patterns
  4. Fallbacks: Graceful degradation when providers unavailable

Implementation

Components Created

ComponentPathTests
Persona Loadercore/persona_loader.py43 tests
ADR Rubric Generatorscripts/adr-rubric-generator.py62 tests
Rubric Mergercore/rubric_merger.pyPart of above
Debate Protocolcore/debate.py30 tests
Multi-Model Clientcore/multi_model_client.py49 tests
LLM Judgejudges/llm_judge.py17 tests
Provenancecore/models.py (enhanced)30 tests
Self-Improving Evalskills/self-improving-eval/28 tests

Total: 260 tests

Commands

CommandPurpose
/moe-judges <target>Assemble judge panel and evaluate
/classify <path>Classify documents with verification
/eval-improveRun self-improving eval loop

Configuration Files

config/
├── judge-personas.json # 6 judge persona definitions
├── judge-model-routing.json # Model routing per persona
├── generated-rubrics/ # 27 ADR-derived rubrics
└── merged-rubrics/ # 5 merged persona rubrics

References

  • H.3.1-H.3.6 implementation in PILOT-PARALLEL-EXECUTION-PLAN.md
  • PoLL (Panel of LLM Judges) research pattern
  • Constitutional AI multi-perspective evaluation
  • analyze-new-artifacts/MOE-JUDGES-RESEARCH/
  • analyze-new-artifacts/CLAUDE-CODE-EVAL-LOOPS/
  • ADR-052: Intent-Aware Context Management
  • ADR-053: Cloud Context Sync Architecture
  • ADR-054: Track Nomenclature Extensibility

Author: CODITECT Team Reviewers: Architecture Council