Academic Research References: Uncertainty Quantification & MoE Evaluation Frameworks (2024 2025)
Academic Research References: Uncertainty Quantification & MoE Evaluation Frameworks (2024-2025)
Document: ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025
Version: 1.0.0
Purpose: Source reference artifact for CODITECT uncertainty quantification and MoE evaluation frameworks
Classification: Research Reference - High Reliability
Date Created: 2025-12-19
Last Updated: 2025-12-19
Status: ACTIVE
Evidence Quality: Tier 1 (Peer-Reviewed) + Tier 2 (Industry Validated)
Usage: Required reading before implementing uncertainty-aware components
Related ADRs:
- ADR-011-UNCERTAINTY-QUANTIFICATION-FRAMEWORK
- ADR-012-MOE-ANALYSIS-FRAMEWORK
- ADR-013-MOE-JUDGES-FRAMEWORK
Purpose & Usage Guidelines
This document serves as the canonical source reference artifact for all academic research supporting CODITECT's uncertainty quantification and MoE evaluation frameworks. Per CODITECT's commitment to factual accuracy:
- Before implementing any UQ or evaluation feature, consult this document
- When making claims about methodology effectiveness, cite sources from this document
- When certainty is lacking, explicitly state "INFERRED" and provide reasoning chain
- During re-evaluation, use these sources to validate or update approaches
Reliability Tiers:
- Tier 1 (95-100% certainty): Peer-reviewed at top venues (NeurIPS, ICLR, ACL, EMNLP, ICML)
- Tier 2 (85-94% certainty): Industry-validated, highly cited, or OpenReview accepted
- Tier 3 (70-84% certainty): arXiv preprints with strong methodology
- INFERRED (<70% certainty): Logical inference from established research
1. Semantic Entropy & Token-Level Uncertainty Methods
1.1 Kernel Language Entropy (KLE)
| Field | Value |
|---|---|
| Title | Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs |
| Authors | Nikitin, Kossen, Gal, Marttinen |
| Venue | NeurIPS 2024 |
| Reliability | Tier 1 (95%) |
| URL | https://neurips.cc/virtual/2024/poster/93979 |
| GitHub | N/A |
Core Contribution: Defines positive semidefinite unit trace kernels for semantic similarity, quantifies uncertainty via von Neumann entropy.
Key Metrics:
- Improved UQ across multiple NLG datasets
- Theoretically proven to generalize semantic entropy
- Both white-box and black-box implementations
CODITECT Application: Composite certainty scoring, semantic clustering in MoE Analysis.
1.2 Semantic Density
| Field | Value |
|---|---|
| Title | Semantic Density: Quantifying Confidence in LLMs |
| Authors | Xin Qiu, Risto Miikkulainen (Cognizant AI Labs, UT Austin) |
| Venue | NeurIPS 2024 |
| Reliability | Tier 1 (96%) |
| URL | https://neurips.cc/virtual/2024/poster/95598 |
| GitHub | https://github.com/cognizant-ai-labs/semantic-density-paper |
| OpenReview | https://openreview.net/forum?id=LOH6qzI7T6 |
Core Contribution: Analyzes probability distributions in semantic space for confidence quantification.
Key Metrics:
- Best AUROC in 26/28 test cases
- Best AUPR in 27/28 cases
- Statistical significance: p-values < 10^-6
- Tested on 7 LLMs (Llama 3, Mixtral-8x22B, etc.)
CODITECT Application: Primary certainty scoring methodology for MoE Analysis.
1.3 Beyond Semantic Entropy (SNNE)
| Field | Value |
|---|---|
| Title | Beyond Semantic Entropy: Boosting LLM UQ with Pairwise Semantic Similarity |
| Authors | Nguyen et al. |
| Venue | ACL 2025 (Findings) |
| Reliability | Tier 1 (93%) |
| URL | https://aclanthology.org/2025.findings-acl.234/ |
| arXiv | https://arxiv.org/abs/2506.00245 |
Core Contribution: Black-box UQ using intra-cluster and inter-cluster similarity.
Key Metrics:
- Superior performance on Phi3 and Llama3
- Effective for longer responses (modern LLMs)
- No NLI model clustering required
CODITECT Application: Backup certainty method for longer agent outputs.
1.4 Semantic Entropy Probes (SEPs)
| Field | Value |
|---|---|
| Title | Semantic Entropy Probes: Efficient UQ from Hidden States |
| Authors | Various |
| Venue | arXiv 2024 |
| Reliability | Tier 3 (85%) |
| URL | https://arxiv.org/abs/2406.15927 |
Core Contribution: Approximates semantic entropy from single generation hidden states.
Key Metrics:
- ~10x computational savings vs. traditional semantic entropy
- Robust hallucination detection
- Better OOD generalization than previous probing methods
CODITECT Application: Efficient uncertainty estimation for high-throughput scenarios.
2. Self-Consistency Methods
2.1 Self-Consistency with Chain-of-Thought (CoT-SC)
| Field | Value |
|---|---|
| Title | Self-Consistency Improves Chain of Thought Reasoning |
| Authors | Wang et al. |
| Venue | ICLR 2022 (foundational) |
| Reliability | Tier 1 (97%) |
| URL | https://arxiv.org/abs/2203.11171 |
| OpenReview | https://openreview.net/forum?id=1PL1NIMMrw |
Core Contribution: Sample diverse reasoning paths, select most consistent answer via majority voting.
Key Metrics:
- GSM8K: +17.9% improvement
- SVAMP: +11.0% improvement
- AQuA: +12.2% improvement
- StrategyQA: +6.4% improvement
CODITECT Application: Internal consistency scoring (20% weight in composite certainty).
2.2 Soft Self-Consistency (SOFT-SC)
| Field | Value |
|---|---|
| Title | Soft Self-Consistency for Language Model Agents |
| Authors | Various |
| Venue | ACL 2024 |
| Reliability | Tier 1 (92%) |
| URL | https://aclanthology.org/2024.acl-short.28.pdf |
Core Contribution: Handles cases without unique majority answer, better scaling with k samples.
CODITECT Application: Agent agreement measurement when no clear consensus exists.
3. Calibration Methods
3.1 Adaptive Temperature Scaling (ATS)
| Field | Value |
|---|---|
| Title | Adaptive Temperature Scaling for RLHF-Tuned LLMs |
| Authors | Xie, Chen, Lee, Mitchell, Finn |
| Venue | EMNLP 2024 |
| Reliability | Tier 1 (94%) |
| URL | https://arxiv.org/abs/2409.19817 |
| ACL | https://aclanthology.org/2024.emnlp-main.1007.pdf |
Core Contribution: Per-token temperature scaling to address RLHF calibration degradation.
Key Metrics:
- 10-50% calibration improvement across 3 NL benchmarks
- Preserves RLHF performance gains
CODITECT Application: Post-hoc calibration for production models.
3.2 Thermometer: Universal Calibration
| Field | Value |
|---|---|
| Title | Thermometer: Universal LLM Calibration |
| Authors | Shen, Das, Greenewald, Sattigeri, Wornell, Ghosh (MIT, IBM) |
| Venue | ICML 2024 |
| Reliability | Tier 1 (95%) |
| URL | https://arxiv.org/html/2403.08819v1 |
| Paper | https://sia.mit.edu/wp-content/uploads/2024/12/2024-shen-das-greenewald-sattigeri-wornell-ghosh-icml.pdf |
Core Contribution: Auxiliary recognition network for dataset-specific temperature prediction.
Key Metrics:
- Universal: Works on new tasks without retraining
- Transferable: 7B model calibrates 13B, 70B models
- Inference overhead: ~0.5%
- Accuracy-preserving
CODITECT Application: Cross-model calibration for multi-agent systems.
4. Conformal Prediction for LLMs
4.1 Enhanced Conformal Prediction
| Field | Value |
|---|---|
| Title | Enhanced Conformal Prediction for LLM Validity Guarantees |
| Authors | Cherian, Gibbs, Candès (Stanford) |
| Venue | NeurIPS 2024 |
| Reliability | Tier 1 (96%) |
| URL | https://arxiv.org/abs/2406.09714 |
| OpenReview | https://openreview.net/forum?id=JD3NYpeQ3R |
Core Contribution: Conditional validity across response topics, differentiable scoring function.
Key Metrics:
- Rigorous statistical guarantees
- Better preserves valuable/accurate claims
CODITECT Application: Statistical quality gates for high-stakes outputs.
4.2 Conformal Language Modeling
| Field | Value |
|---|---|
| Title | Conformal Language Modeling |
| Authors | Google Research |
| Venue | ICLR 2024 |
| Reliability | Tier 1 (96%) |
| URL | https://arxiv.org/abs/2306.10193 |
| OpenReview | https://openreview.net/forum?id=pzUhfQ74c5 |
Core Contribution: Calibrated stopping rule + rejection rule for LLM outputs.
Key Metrics:
- Validated on open-domain QA, summarization, radiology
- Identifies hallucination-free subsets
- Theoretical coverage guarantees
CODITECT Application: Quality gate implementation for MoE outputs.
5. Internal State Analysis
5.1 LLM-Check: Eigenvalue Analysis
| Field | Value |
|---|---|
| Title | LLM-Check: Hallucination Detection via Eigenvalue Analysis |
| Authors | Sriramanan et al. |
| Venue | NeurIPS 2024 |
| Reliability | Tier 1 (95%) |
| URL | https://openreview.net/pdf?id=LYx4w3CAgy |
| GitHub | https://github.com/GaurangSriramanan/LLM_Check_Hallucination_Detection |
Core Contribution: Single forward pass analysis using attention kernel maps and hidden activations.
Key Metrics:
- 45x to 450x speedup vs. baselines
- Detects consistent hallucination patterns
CODITECT Application: Efficient real-time hallucination detection.
5.2 MIND: Unsupervised Hallucination Detection
| Field | Value |
|---|---|
| Title | MIND: Unsupervised Real-Time Hallucination Detection |
| Authors | Various |
| Venue | ACL 2024 (Findings) |
| Reliability | Tier 1 (93%) |
| URL | https://aclanthology.org/2024.findings-acl.854/ |
Core Contribution: Unsupervised framework using internal states, trained on Wikipedia text.
Key Metrics:
- Real-time detection
- No labeled data required
- Generalizable across tasks
CODITECT Application: Quality gate for agent outputs.
6. LLM-as-Judge Frameworks
6.1 G-Eval
| Field | Value |
|---|---|
| Title | G-Eval: NLG Evaluation with GPT-4 + Chain-of-Thought |
| Authors | Microsoft Research |
| Venue | EMNLP 2023 (foundational) |
| Reliability | Tier 1 (95%) |
| Documentation | https://www.confident-ai.com/blog/g-eval-the-definitive-guide |
| DeepEval | https://deepeval.com/docs/metrics-llm-evals |
Core Contribution: Form-filling paradigm with automatic CoT generation for evaluation.
Key Metrics:
- Spearman correlation: 0.514 (text summarization)
- Outperformed all baselines on coherence, consistency, fluency, relevance
- 8M+ evaluations in March 2025
CODITECT Application: Primary grading methodology for MoE Judges.
6.2 LLM-Rubric
| Field | Value |
|---|---|
| Title | LLM-Rubric: Multi-Dimensional Calibrated Evaluation |
| Authors | Various |
| Venue | ACL 2024 |
| Reliability | Tier 1 (92%) |
| URL | https://arxiv.org/abs/2501.00274 |
| Implementation | https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/llm-rubric/ |
Core Contribution: Calibrated rubric-based evaluation with web search integration.
CODITECT Application: Rubric design for prompt and response quality assessment.
6.3 Mixture-of-Agents (MoA)
| Field | Value |
|---|---|
| Title | Mixture-of-Agents Enhances LLM Capabilities |
| Authors | Together AI |
| Venue | arXiv 2024 |
| Reliability | Tier 2 (90%) |
| URL | https://arxiv.org/abs/2406.04692 |
Core Contribution: Layered architecture with ensemble wisdom from multiple LLM agents.
Key Metrics:
- AlpacaEval 2.0: 65.1% win rate (vs. GPT-4's 57.5%)
- MT-Bench: 9.40/10 with GPT-4o
- State-of-the-art with open-source models
CODITECT Application: Multi-agent orchestration pattern for MoE Analysis.
6.4 Agent-as-a-Judge
| Field | Value |
|---|---|
| Title | Agent-as-a-Judge: Agentic Systems Evaluating Agentic Systems |
| Authors | Various |
| Venue | arXiv 2024 |
| Reliability | Tier 2 (88%) |
| URL | https://arxiv.org/html/2410.10934v2 |
Core Contribution: Multi-step reasoning and tool use for agent evaluation.
Key Metrics:
- ~70% alignment with human judge consensus
- Higher alignment than individual human evaluators
CODITECT Application: Evaluation pattern for multi-agent workflows.
6.5 ChatEval: Multi-Agent Referee Team
| Field | Value |
|---|---|
| Title | ChatEval: Multi-Agent Discussion for Text Quality |
| Authors | Various |
| Venue | ICLR 2024 |
| Reliability | Tier 1 (91%) |
| URL | https://openreview.net/forum?id=FQepisCUWu |
Core Contribution: Multi-agent referee team with collaborative discussion.
Key Metrics:
- Superior accuracy vs. single-agent evaluation
- Higher correlation with human assessment
CODITECT Application: Judge panel coordination pattern.
7. Prompt Engineering for Uncertainty
7.1 Uncertainty of Thoughts (UoT)
| Field | Value |
|---|---|
| Title | Uncertainty of Thoughts: Uncertainty-Aware Planning |
| Authors | Zhiyuan Hu et al. |
| Venue | NeurIPS 2024 |
| Reliability | Tier 1 (93%) |
| URL | https://arxiv.org/abs/2402.03271 |
| GitHub | https://github.com/zhiyuanhubj/UoT |
Core Contribution: Information-seeking behavior through uncertainty-aware planning.
Key Metrics:
- 38.1% average improvement in task completion
- 57.8% peak improvement
- Tested on Llama-3-70B, Mistral-Large, GPT-4
CODITECT Application: Agent prompt design for explicit uncertainty handling.
7.2 Chain-of-Verification (CoVe)
| Field | Value |
|---|---|
| Title | Chain-of-Verification Reduces Hallucination in LLMs |
| Authors | Meta AI |
| Venue | ACL 2024 (Findings) |
| Reliability | Tier 1 (92%) |
| URL | https://arxiv.org/abs/2309.11495 |
| ACL | https://aclanthology.org/2024.findings-acl.212.pdf |
Core Contribution: Self-verification via planned verification questions answered independently.
Key Metrics:
- 23% improvement in F1 score (closed-book QA)
- 40% accuracy increase for technical writing
- 28% improvement in FACTSCORE
CODITECT Application: Evidence validation protocol for claims.
7.3 Verbal Uncertainty Calibration (VUF/MUC)
| Field | Value |
|---|---|
| Title | Calibrating Verbal Uncertainty as a Linear Feature |
| Authors | Various |
| Venue | arXiv 2025 |
| Reliability | Tier 3 (85%) |
| URL | https://arxiv.org/html/2503.14477 |
Core Contribution: Mechanistic calibration via Verbal Uncertainty Feature discovery.
Key Metrics:
- AUROC: 70-81% for hallucination detection
- ~30% reduction in confident hallucinations
CODITECT Application: Verbal uncertainty marker calibration tables.
8. Hallucination Detection Systems
8.1 HHEM (Hughes Hallucination Evaluation Model)
| Field | Value |
|---|---|
| Title | HHEM 2.1: Hallucination Detection for RAG |
| Authors | Vectara |
| Venue | Industry (Production) |
| Reliability | Tier 2 (93%) |
| URL | https://www.vectara.com/blog/hhem-2-1-a-better-hallucination-detection-model |
| HuggingFace | https://huggingface.co/vectara/hallucination_evaluation_model |
| GitHub | https://github.com/vectara/hallucination-leaderboard |
Core Contribution: RAG-optimized hallucination detection model.
Key Metrics:
- Outperforms GPT-3.5-Turbo and GPT-4 in balanced accuracy
- 2M+ downloads
- <600MB RAM, ~1.5s per 2k tokens
CODITECT Application: Quality gate for RAG-based research outputs.
8.2 SelfCheckGPT
| Field | Value |
|---|---|
| Title | SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection |
| Authors | Cambridge University |
| Venue | EMNLP 2023 |
| Reliability | Tier 1 (92%) |
| URL | https://arxiv.org/abs/2303.08896 |
| GitHub | https://github.com/potsawee/selfcheckgpt |
Core Contribution: Zero-resource detection via sampling consistency.
Key Metrics:
- AUC-PR: 93.42% (non-factual)
- 67.09% (factual)
- No external database required
CODITECT Application: Zero-resource validation when evidence is unavailable.
8.3 MiniCheck
| Field | Value |
|---|---|
| Title | MiniCheck: Efficient Fact-Checking on Grounding Documents |
| Authors | Various |
| Venue | EMNLP 2024 |
| Reliability | Tier 1 (94%) |
| URL | https://github.com/Liyan06/MiniCheck |
| HuggingFace | https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B |
Core Contribution: GPT-4-level performance at 400x lower cost.
Key Metrics:
-
500 docs/min on single A6000 GPU
- 4.3% improvement over AlignScore
- State-of-the-art on LLM-AggreFact benchmark
CODITECT Application: Efficient fact-checking for evidence validation.
8.4 FactScore
| Field | Value |
|---|---|
| Title | FActScore: Fine-grained Atomic Evaluation of Factual Precision |
| Authors | Sewon Min et al. |
| Venue | EMNLP 2023 |
| Reliability | Tier 1 (95%) |
| URL | https://arxiv.org/abs/2305.14251 |
| GitHub | https://github.com/shmsw25/FActScore |
Core Contribution: Atomic fact decomposition for long-form factuality.
Key Metrics:
- <2% error vs. human evaluation
- Foundational methodology for factuality assessment
CODITECT Application: Claim decomposition for evidence validation protocol.
9. RAG & Grounding Evaluation
9.1 RAGAS
| Field | Value |
|---|---|
| Title | RAGAS: Retrieval-Augmented Generation Assessment |
| Authors | Explodinggradients |
| Venue | Open-source framework |
| Reliability | Tier 2 (90%) |
| URL | https://docs.ragas.io/en/v0.1.21/concepts/metrics/ |
Core Metrics:
- Faithfulness: 95% agreement with humans
- Answer Relevancy: 78% agreement
- Context Precision/Recall: Retrieval quality
CODITECT Application: RAG-based research validation metrics.
9.2 DeepEval
| Field | Value |
|---|---|
| Title | DeepEval: LLM Evaluation Framework |
| Authors | Confident AI |
| Venue | Open-source framework |
| Reliability | Tier 2 (88%) |
| URL | https://github.com/confident-ai/deepeval |
| Docs | https://deepeval.com/docs/metrics-introduction |
Key Features:
- Answer Relevancy, Faithfulness, Hallucination metrics
- Contextual Precision/Recall
- G-Eval integration
- 5 lines of code for evaluation
CODITECT Application: Automated quality metrics for MoE outputs.
10. Benchmarks & Evaluation Standards
10.1 LM-Polygraph
| Field | Value |
|---|---|
| Title | LM-Polygraph: Benchmarking UQ Methods for LLMs |
| Authors | Fadeeva, Vashurin et al. |
| Venue | TACL 2024 |
| Reliability | Tier 1 (97%) |
| URL | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00737 |
| GitHub | https://github.com/IINemo/lm-polygraph |
Core Contribution: Battery of state-of-the-art UE methods with standardized evaluation.
Key Metrics:
- PRR (Prediction Rejection Ratio) as primary metric
- Evaluated across 11 tasks
- Pre-implemented methods with demo application
CODITECT Application: Validation and benchmarking of UQ implementations.
10.2 Arena-Hard
| Field | Value |
|---|---|
| Title | Arena-Hard: High-Quality LLM Benchmark |
| Authors | LMSYS |
| Venue | LMSYS Blog 2024 |
| Reliability | Tier 2 (90%) |
| URL | https://lmsys.org/blog/2024-04-19-arena-hard/ |
Key Metrics:
- 87.4% separability
- 89.1% agreement with Chatbot Arena
- 500 challenging prompts
CODITECT Application: Benchmark for evaluating MoE system quality.
Citation Guidelines
When Citing Research in CODITECT Outputs
Format for High Certainty Claims (Tier 1):
[Claim]. (Source: [Author] et al., [Venue] [Year], Certainty: 95%)
Format for Medium Certainty Claims (Tier 2):
[Claim]. (Source: [Author] et al., [Year], Certainty: 85-94%)
Format for Inferred Claims:
[INFERRED] [Claim].
Reasoning: [Logical chain]
Assumptions: [List]
Evidence Gap: [What would increase certainty]
Research Update Schedule
| Update Type | Frequency | Responsibility |
|---|---|---|
| New publication review | Weekly | AI Research Team |
| Metric validation | Monthly | Quality Assurance |
| Reference deprecation | Quarterly | Documentation Lead |
| Full audit | Annually | Architecture Review Board |
Version History
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2025-12-19 | Initial comprehensive research compilation |
Document Status: ACTIVE Last Validation: 2025-12-19 Next Review: 2026-03-19 Maintainer: CODITECT Research Team