ADR 012: Mixture of Experts (MoE) Analysis Framework
ADR-012: Mixture of Experts (MoE) Analysis Framework
Document: ADR-012-moe-analysis-framework
Version: 1.0.0
Purpose: Document architectural decisions for multi-agent research analysis with certainty scoring
Audience: Framework contributors, developers, AI agents
Date Created: 2025-12-19
Status: APPROVED
Depends On:
- ADR-011-uncertainty-quantification-framework
Related ADRs:
- ADR-010-autonomous-orchestration-system
- ADR-013-moe-judges-framework
Related Components:
- commands/moe-analyze.md
- agents/uncertainty-orchestrator.md
- skills/uncertainty-quantification/SKILL.md
Research Foundation:
- docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md
Context and Problem Statement
The Research Quality Problem
Current multi-agent research workflows suffer from:
- Implicit Certainty - Findings presented without confidence indicators
- Evidence Opacity - Sources not distinguished by reliability or recency
- Hidden Gaps - Missing information not explicitly documented
- Overconfident Assertions - Claims made without acknowledging uncertainty
- No Inference Transparency - Speculative conclusions lack reasoning traces
Research Foundation
This ADR is supported by peer-reviewed research from 2024-2025:
| Research | Venue | Contribution | Certainty |
|---|---|---|---|
| Semantic Density | NeurIPS 2024 | Best AUROC 26/28 cases | 96% |
| Self-Consistency | ICLR 2022 | +17.9% GSM8K improvement | 97% |
| Mixture-of-Agents | arXiv 2024 | 65.1% AlpacaEval win rate | 90% |
| UoT Framework | NeurIPS 2024 | 38.1% task completion improvement | 93% |
| Chain-of-Verification | ACL 2024 | 23% F1 improvement | 92% |
Full citations: See docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md
Decision Drivers
- Factual Accuracy - CODITECT must never report unverified information as fact
- Explicit Uncertainty - When certainty is lacking, this must be communicated
- Research Validation - Claims must be traceable to evidence sources
- Logical Transparency - Inferred conclusions must show reasoning chains
- Ensemble Wisdom - Multiple perspectives reduce individual agent bias
Considered Options
Option A: Single-Agent Research with Confidence Prompting
- Single agent with uncertainty-aware prompting
- Rejected: Single-source bias, no cross-validation, limited perspectives
Option B: Multi-Agent Research without Certainty Scoring
- Multiple agents aggregate findings informally
- Rejected: No explicit certainty, inconsistent quality, hidden disagreements
Option C: MoE Analysis Framework with Certainty Scoring (Selected)
- Multiple specialized agents with structured certainty outputs
- Composite certainty scoring from weighted factors
- Evidence validation protocol with source classification
- Logical inference chains for speculative claims
- Selected: Addresses all decision drivers with research backing
Option D: External Fact-Checking Service Integration
- Third-party verification API
- Rejected: Latency, cost, availability dependencies
Decision
Implement Option C: MoE Analysis Framework with the following architecture:
1. Multi-Agent Analyst Panel
Agent Composition (4-6 agents per analysis):
analysts = [
{"type": "web-search-researcher", "focus": "External evidence gathering"},
{"type": "thoughts-analyzer", "focus": "Internal documentation analysis"},
{"type": "codebase-analyzer", "focus": "Implementation verification"},
{"type": "qa-reviewer", "focus": "Analysis quality validation"},
# Optional domain-specific agents
]
Research Basis: Mixture-of-Agents (arXiv 2024) demonstrates ensemble wisdom outperforms single powerful models (65.1% vs 57.5% AlpacaEval).
2. Certainty Scoring System
Composite Formula (from ADR-011):
certainty_score = (
evidence_support * 0.40 + # Quality of supporting sources
source_reliability * 0.25 + # Credibility of sources
internal_consistency * 0.20 + # Agent agreement level
recency * 0.15 # Information freshness
)
Research Basis: Semantic Density (NeurIPS 2024) achieves best AUROC in 26/28 test cases using multi-factor confidence analysis.
Certainty Levels:
| Score | Level | Required Action |
|---|---|---|
| 85-100% | HIGH | Report with confidence |
| 60-84% | MEDIUM | Note limitations, provide sources |
| 30-59% | LOW | Explicitly state uncertainty |
| 0-29% | INFERRED | Require logical inference chain |
3. Evidence Validation Protocol
Source Classification:
| Source Type | Base Reliability | Recency Penalty |
|---|---|---|
| Peer-reviewed | 95% | -30% if >5 years |
| Government | 90% | -15% if >3 years |
| Academic institution | 85% | -15% if >3 years |
| Industry leader | 80% | -5% if >1 year |
| Reputable news | 70% | -10% if >1 year |
| Industry blog | 60% | -15% if >1 year |
| Unknown/no source | 20% | N/A |
Research Basis: RAGAS framework (industry standard) achieves 95% human agreement on faithfulness validation.
Required Evidence Format:
{
"claim": "Specific statement being made",
"certainty_factor": 0.85,
"certainty_basis": "evidence_backed",
"evidence": [
{
"url": "https://example.com/source",
"title": "Source Title",
"venue": "Publication Venue",
"year": 2024,
"evidence_strength": "strong",
"summary": "How source supports claim"
}
],
"missing_information": ["What would increase certainty"]
}
4. Logical Inference Protocol
When evidence is insufficient, generate explicit reasoning:
## Inferred Conclusion: [Statement]
**Inference Type:** Deduction | Induction | Abduction
**Certainty:** [X%] (INFERRED)
### Reasoning Chain
1. **Premise 1:** [Statement]
- Evidence: [Source or "Assumed based on..."]
- Certainty: [X%]
2. **Premise 2:** [Statement]
- Evidence: [Source or "Domain practice"]
- Certainty: [X%]
3. **Therefore:** [Conclusion]
### Assumptions
- [List assumptions that, if false, invalidate conclusion]
### Falsification Criteria
- [Evidence that would disprove this inference]
Research Basis: Uncertainty of Thoughts (NeurIPS 2024) demonstrates 38.1% improvement through explicit uncertainty modeling.
5. Quality Gates
| Condition | Action |
|---|---|
| Claim without evidence | Mark INFERRED, require reasoning chain |
| Source >2 years old | Flag recency concern, reduce reliability |
| Single-source claim | Mark LOW certainty minimum |
| Contradictory sources | Document conflict, require reconciliation |
| Agent disagreement >1.5σ | Investigate and document dissent |
Research Basis: Chain-of-Verification (ACL 2024) shows 23% F1 improvement with verification requirements.
Architecture
Workflow Phases
┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Dispatch │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Analyst │ │ Analyst │ │ Analyst │ │ Analyst │ │
│ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └────────────┴─────┬──────┴────────────┘ │
│ ▼ │
├─────────────────────────────────────────────────────────────┤
│ Phase 2: Aggregation │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Cross-validate claims, identify contradictions, │ │
│ │ calculate composite certainty, document gaps │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
├─────────────────────────────────────────────────────────────┤
│ Phase 3: Conditional Research │
│ IF low_certainty_claims > threshold: │
│ Dispatch targeted follow-up research │
│ ELSE: │
│ Proceed to synthesis │
│ │ │
│ ▼ │
├─────────────────────────────────────────────────────────────┤
│ Phase 4: Synthesis │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Generate findings report with: │ │
│ │ - Certainty scores per finding │ │
│ │ - Evidence citations │ │
│ │ - Logical inference chains (where needed) │ │
│ │ - Explicit gaps and recommendations │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Output Schema
{
"analysis_id": "moe-[timestamp]",
"query": "Original research question",
"findings": [
{
"id": "F1",
"statement": "Finding statement",
"certainty": {
"score": 85,
"level": "HIGH",
"factors": {
"evidence_support": 90,
"source_reliability": 85,
"internal_consistency": 80,
"recency": 75
}
},
"evidence": [...],
"inference_chain": null
},
{
"id": "F2",
"statement": "Inferred finding",
"certainty": {
"score": 25,
"level": "INFERRED"
},
"evidence": [],
"inference_chain": {
"type": "deduction",
"premises": [...],
"assumptions": [...],
"falsification": [...]
}
}
],
"gaps": ["List of missing information"],
"recommendations": ["Suggested follow-up research"],
"metadata": {
"agents_used": 4,
"sources_validated": 12,
"inference_chains_generated": 2,
"overall_certainty": 72
}
}
Consequences
Positive
- Reduced Overconfidence - Explicit certainty prevents unwarranted assertions
- Evidence Traceability - All claims linked to sources
- Transparent Reasoning - Inference chains expose logic
- Gap Documentation - Missing information explicitly captured
- Quality Improvement - Gaps enable targeted follow-up research
Negative
- Increased Token Usage - ~1000-2000 tokens per analysis for certainty metadata
- Latency - Multi-agent coordination adds processing time
- Complexity - More sophisticated output parsing required
- Training - Users must understand certainty levels
Neutral
- Shifts from implicit to explicit uncertainty expression
- Changes agent prompting requirements
Implementation
Phase 1: Core Components (Week 1-2)
- Create
moe-analyzecommand specification - Create
uncertainty-orchestratoragent - Create
uncertainty-quantificationskill - Implement certainty scoring functions
- Define evidence validation schema
Phase 2: Integration (Week 3-4)
- Integrate with existing analyst agents
- Add certainty requirements to agent prompts
- Implement quality gate checks
- Create output formatting
Phase 3: Validation (Week 5-6)
- Test against known-answer datasets
- Calibrate certainty thresholds
- Gather user feedback
- Document edge cases
Validation Criteria
| Metric | Target | Measurement |
|---|---|---|
| Overconfidence Rate | <10% | Claims with HIGH certainty that are incorrect |
| Evidence Coverage | >90% | Claims with valid source citations |
| Gap Documentation | 100% | Analyses that document missing information |
| Inference Transparency | 100% | INFERRED claims with reasoning chains |
| User Certainty Understanding | >80% | Survey comprehension of certainty levels |
References
Primary Research (Tier 1: 95%+ Certainty)
-
Semantic Density - Qiu & Miikkulainen, NeurIPS 2024
- URL: https://neurips.cc/virtual/2024/poster/95598
- Contribution: Multi-factor confidence scoring methodology
-
Self-Consistency (CoT-SC) - Wang et al., ICLR 2022
- URL: https://arxiv.org/abs/2203.11171
- Contribution: Internal consistency measurement approach
-
Uncertainty of Thoughts - Hu et al., NeurIPS 2024
- URL: https://arxiv.org/abs/2402.03271
- Contribution: Explicit uncertainty modeling patterns
-
Chain-of-Verification - Meta AI, ACL 2024
- URL: https://arxiv.org/abs/2309.11495
- Contribution: Evidence validation protocol design
Secondary Research (Tier 2: 85-94% Certainty)
-
Mixture-of-Agents - Together AI, arXiv 2024
- URL: https://arxiv.org/abs/2406.04692
- Contribution: Multi-agent orchestration pattern
-
RAGAS Framework - Explodinggradients
- URL: https://docs.ragas.io
- Contribution: Evidence validation metrics
CODITECT Components
commands/moe-analyze.md- Command specificationagents/uncertainty-orchestrator.md- Orchestration agentskills/uncertainty-quantification/SKILL.md- Certainty calculation patternsdocs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md- Full research catalog
Document Version: 1.0.0 Last Updated: 2025-12-19 Author: CODITECT Research Team Status: APPROVED