ADR-060: MoE Verification Layer - Constitutional Court Architecture

Status

Accepted - January 8, 2026

Context

Problem Statement

The CODITECT Mixture of Experts (MoE) classification system requires a verification layer to ensure classification quality through multi-perspective evaluation. Key challenges:

Single-Point Bias: Single-model evaluation risks systematic blind spots
No Verification: Analyst classifications lack independent verification
Missing Audit Trail: No provenance tracking for evaluation decisions
Unclear Standards: Inconsistent evaluation criteria across components
Static Prompts: No mechanism for continuous prompt improvement

Research Foundation

Based on research in:

analyze-new-artifacts/MOE-JUDGES-RESEARCH/ - PoLL (Panel of LLM Judges) patterns
analyze-new-artifacts/CLAUDE-CODE-EVAL-LOOPS/ - Self-improving eval loops
Constitutional AI principles for multi-perspective evaluation

Requirements

Multi-perspective evaluation with diverse judge personas
Multi-model diversity to reduce single-vendor bias
Debate protocol for resolving disagreements
Full provenance tracking (model, tokens, latency, timestamp)
ADR-derived rubrics for consistent standards
Self-improving eval loop for continuous quality improvement

Decision

Implement a Constitutional Court verification architecture with five interconnected subsystems:

1. Judge Persona System (H.3.1)

Specialized judge personas with distinct expertise areas:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         JUDGE PERSONA SYSTEM                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐             │
│  │    Technical    │  │   Compliance    │  │    Security     │             │
│  │   Architect     │  │    Auditor      │  │    Analyst      │             │
│  │  Marcus Rivera  │  │ Dr. Okonkwo     │  │ James Nakamura  │             │
│  │  22 years exp   │  │  HIPAA/FDA/SOC2 │  │  OWASP Top 10   │             │
│  │  claude-sonnet  │  │  gpt-4o         │  │  deepseek-v3    │             │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘             │
│           │                    │                    │                       │
│           └────────────────────┼────────────────────┘                       │
│                                ▼                                            │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐             │
│  │     Domain      │  │       QA        │  │   Additional    │             │
│  │     Expert      │  │   Evaluator     │  │    Personas     │             │
│  │  Dr. Vasquez    │  │  Priya Sharma   │  │   (Extensible)  │             │
│  │  Healthcare/HL7 │  │  Test Coverage  │  │                 │             │
│  │  qwen2.5-72b    │  │  claude-haiku   │  │                 │             │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Configuration: config/judge-personas.json Loader: scripts/moe_classifier/core/persona_loader.py

2. ADR-to-Rubric Generator (H.3.2)

Automatic extraction of evaluation rubrics from Architecture Decision Records:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        ADR-TO-RUBRIC PIPELINE                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ADR Documents               Constraint Extraction         Generated Rubric │
│   ┌──────────────┐           ┌──────────────────┐         ┌──────────────┐ │
│   │ ADR-001.md   │──────────▶│ MUST: mandatory  │────────▶│ Dimension 1  │ │
│   │ ADR-002.md   │           │ SHOULD: recomm.  │         │ Scale: 1-3   │ │
│   │ ...          │           │ MAY: optional    │         │ Weight: 0.25 │ │
│   │ ADR-058.md   │           │ Technical terms  │         └──────────────┘ │
│   └──────────────┘           └──────────────────┘                          │
│                                                                              │
│   ┌──────────────────────────────────────────────────────────────────────┐ │
│   │ Merge with Persona Rubrics                                           │ │
│   │                                                                       │ │
│   │   Base Persona Rubric  +  ADR Rubrics  =  Merged Rubric              │ │
│   │   (5 dimensions)          (N dimensions)   (5+N dimensions)           │ │
│   │                                                                       │ │
│   │   Weights renormalized to sum to 1.0                                 │ │
│   └──────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Generator: scripts/adr-rubric-generator.py Merger: scripts/moe_classifier/core/rubric_merger.py Output: config/generated-rubrics/, config/merged-rubrics/

3. Debate Protocol (H.3.3)

Multi-round debate when judges disagree:

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DEBATE PROTOCOL                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Round 0: Initial Evaluation                                                │
│   ┌─────────┐  ┌─────────┐  ┌─────────┐                                    │
│   │ Judge A │  │ Judge B │  │ Judge C │                                    │
│   │ APPROVE │  │ REJECT  │  │ APPROVE │  ◄── Disagreement detected         │
│   └────┬────┘  └────┬────┘  └────┬────┘                                    │
│        │            │            │                                          │
│        └────────────┼────────────┘                                          │
│                     ▼                                                        │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │ Debate Context Preparation                                           │  │
│   │ • Identify disagreement points (verdict-level, dimension-level)     │  │
│   │ • Format positions with rationale excerpts                          │  │
│   │ • Distribute to all judges                                          │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                     │                                                        │
│                     ▼                                                        │
│   Round 1-N: Debate Rounds (max 3)                                          │
│   ┌─────────┐  ┌─────────┐  ┌─────────┐                                    │
│   │ Judge A │  │ Judge B │  │ Judge C │                                    │
│   │ APPROVE │  │ APPROVE │  │ APPROVE │  ◄── Convergence!                  │
│   └─────────┘  └─────────┘  └─────────┘                                    │
│                                                                              │
│   Convergence Threshold: 80% agreement                                       │
│   Max Rounds: 3                                                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation: scripts/moe_classifier/core/debate.py Integration: scripts/moe_classifier/core/consensus.py

4. Multi-Model Judge Panel (H.3.5)

Diverse LLM providers to prevent single-vendor bias:

┌─────────────────────────────────────────────────────────────────────────────┐
│                      MULTI-MODEL JUDGE PANEL                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Provider Diversity Requirements                                            │
│   ┌──────────────────────────────────────────────────────────────────────┐ │
│   │ • Minimum 3 model families                                           │ │
│   │ • Maximum 40% weight on single model                                 │ │
│   │ • Automatic fallback on failure                                      │ │
│   └──────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│   Supported Providers                                                        │
│   ┌───────────────┬─────────────────────┬──────────────────────────────┐  │
│   │ Provider      │ Models              │ Use Case                      │  │
│   ├───────────────┼─────────────────────┼──────────────────────────────┤  │
│   │ Anthropic     │ claude-opus-4.5     │ Deep reasoning               │  │
│   │               │ claude-sonnet-4     │ Balanced performance         │  │
│   │               │ claude-haiku-4.5    │ Fast evaluation              │  │
│   ├───────────────┼─────────────────────┼──────────────────────────────┤  │
│   │ OpenAI        │ gpt-4o              │ General analysis             │  │
│   │               │ gpt-4o-mini         │ Cost-effective               │  │
│   ├───────────────┼─────────────────────┼──────────────────────────────┤  │
│   │ DeepSeek      │ deepseek-v3         │ Cost-effective analysis      │  │
│   ├───────────────┼─────────────────────┼──────────────────────────────┤  │
│   │ Alibaba       │ qwen2.5-72b         │ Multilingual                 │  │
│   ├───────────────┼─────────────────────┼──────────────────────────────┤  │
│   │ Meta          │ llama-3.3-70b       │ Open-source perspective      │  │
│   ├───────────────┼─────────────────────┼──────────────────────────────┤  │
│   │ Google        │ gemini-2.0-flash    │ Multimodal capability        │  │
│   └───────────────┴─────────────────────┴──────────────────────────────┘  │
│                                                                              │
│   Fallback Strategy                                                          │
│   1. Retry with exponential backoff (max 2 retries)                         │
│   2. Fall back to backup model if configured                                │
│   3. Record all attempts in provenance chain                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Client: scripts/moe_classifier/core/multi_model_client.py LLM Judge: scripts/moe_classifier/judges/llm_judge.py Config: config/judge-model-routing.json

5. Self-Improving Eval Loop (H.3.6)

Automated evaluation and prompt improvement:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SELF-IMPROVING EVAL LOOP                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                 │
│   │  Load Evals  │───▶│  Run Model   │───▶│  Score F1    │                 │
│   │  (JSONL)     │    │  (Multi-LLM) │    │  micro/macro │                 │
│   └──────────────┘    └──────────────┘    └──────┬───────┘                 │
│                                                   │                         │
│                              ┌────────────────────▼─────────┐               │
│                              │     Target F1 Reached?       │               │
│                              │     (default: 0.90)          │               │
│                              └────────────────┬─────────────┘               │
│                                    NO         │         YES                 │
│                                    │          │          │                  │
│                                    ▼          │          ▼                  │
│   ┌──────────────┐    ┌──────────────┐       │    ┌──────────┐            │
│   │   Applier    │◄───│   Critic     │       │    │  DONE    │            │
│   │ (Update files│    │  (Analyze    │       │    │  Return  │            │
│   │  with backup)│    │   failures)  │       │    │  Results │            │
│   └──────┬───────┘    └──────────────┘       │    └──────────┘            │
│          │                                    │                            │
│          └────────────────────────────────────┘                            │
│                      Next Round                                            │
│                                                                              │
│   Components:                                                                │
│   • eval_runner.py - Execute evals, compute F1                              │
│   • critic_agent.py - Analyze failures, propose changes                     │
│   • improvement_applier.py - Apply changes with validation                  │
│   • eval_loop.py - Orchestrate improvement cycle                            │
│   • run_ci.py - CI quality gates                                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Skill: skills/self-improving-eval/ CI: .github/workflows/eval-improvement.yml

6. Provenance Tracking (H.3.4)

Full audit trail for every evaluation:

@dataclass
class JudgeDecision:
    judge: str                      # Judge persona ID
    approved: bool                  # Verdict
    reason: str                     # Detailed reasoning
    confidence: float               # 0.0-1.0
    dimension_scores: Dict[str, int]  # Per-dimension scores

    # Provenance fields (H.3.4)
    model_used: str                 # e.g., "claude-sonnet-4"
    timestamp: datetime             # When evaluated
    token_usage: int                # Total tokens consumed
    raw_response: str               # Raw LLM response
    evaluation_start_time: datetime
    evaluation_end_time: datetime
    duration_ms: float              # Latency

@dataclass
class ConsensusResult:
    verdict: str                    # APPROVED, REJECTED, etc.
    confidence: float
    reasoning: str

    # Provenance chain (H.3.4)
    provenance_chain: List[Dict]    # All judge decisions
    dissenting_views: List[Dict]    # Disagreeing judges
    total_token_usage: int
    total_latency_ms: float

Consequences

Positive

Multi-perspective Quality: 5+ judge personas catch different issues
Reduced Bias: 3+ model families prevent vendor lock-in
Full Auditability: Complete provenance for every decision
ADR Alignment: Rubrics automatically derived from architecture decisions
Continuous Improvement: Self-improving loop enhances prompts over time
Disagreement Resolution: Debate protocol builds consensus

Negative

Increased Costs: Multiple LLM calls per evaluation
Latency: Multi-model evaluation slower than single model
Complexity: More components to maintain and test
API Dependencies: Requires multiple LLM provider accounts

Mitigations

Cost Control: Use cheaper models (Haiku, DeepSeek) for initial screening
Parallelism: Run judge evaluations concurrently
Caching: Cache common evaluation patterns
Fallbacks: Graceful degradation when providers unavailable

Implementation

Components Created

Component	Path	Tests
Persona Loader	`core/persona_loader.py`	43 tests
ADR Rubric Generator	`scripts/adr-rubric-generator.py`	62 tests
Rubric Merger	`core/rubric_merger.py`	Part of above
Debate Protocol	`core/debate.py`	30 tests
Multi-Model Client	`core/multi_model_client.py`	49 tests
LLM Judge	`judges/llm_judge.py`	17 tests
Provenance	`core/models.py` (enhanced)	30 tests
Self-Improving Eval	`skills/self-improving-eval/`	28 tests

Total: 260 tests

Commands

Command	Purpose
`/moe-judges <target>`	Assemble judge panel and evaluate
`/classify <path>`	Classify documents with verification
`/eval-improve`	Run self-improving eval loop

Configuration Files

config/
├── judge-personas.json        # 6 judge persona definitions
├── judge-model-routing.json   # Model routing per persona
├── generated-rubrics/         # 27 ADR-derived rubrics
└── merged-rubrics/            # 5 merged persona rubrics

References

H.3.1-H.3.6 implementation in PILOT-PARALLEL-EXECUTION-PLAN.md
PoLL (Panel of LLM Judges) research pattern
Constitutional AI multi-perspective evaluation
analyze-new-artifacts/MOE-JUDGES-RESEARCH/
analyze-new-artifacts/CLAUDE-CODE-EVAL-LOOPS/

ADR-052: Intent-Aware Context Management
ADR-053: Cloud Context Sync Architecture
ADR-054: Track Nomenclature Extensibility

Author: CODITECT Team Reviewers: Architecture Council

Status​

Context​

Problem Statement​

Research Foundation​

Requirements​

Decision​

1. Judge Persona System (H.3.1)​

2. ADR-to-Rubric Generator (H.3.2)​

3. Debate Protocol (H.3.3)​

4. Multi-Model Judge Panel (H.3.5)​

5. Self-Improving Eval Loop (H.3.6)​

6. Provenance Tracking (H.3.4)​

Consequences​

Positive​

Negative​

Mitigations​

Implementation​

Components Created​

Commands​

Configuration Files​

References​

Related ADRs​