Skip to main content

Academic Research References: Uncertainty Quantification & MoE Evaluation Frameworks (2024 2025)

Academic Research References: Uncertainty Quantification & MoE Evaluation Frameworks (2024-2025)

Document: ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025
Version: 1.0.0
Purpose: Source reference artifact for CODITECT uncertainty quantification and MoE evaluation frameworks
Classification: Research Reference - High Reliability
Date Created: 2025-12-19
Last Updated: 2025-12-19
Status: ACTIVE
Evidence Quality: Tier 1 (Peer-Reviewed) + Tier 2 (Industry Validated)
Usage: Required reading before implementing uncertainty-aware components
Related ADRs:
- ADR-011-UNCERTAINTY-QUANTIFICATION-FRAMEWORK
- ADR-012-MOE-ANALYSIS-FRAMEWORK
- ADR-013-MOE-JUDGES-FRAMEWORK

Purpose & Usage Guidelines

This document serves as the canonical source reference artifact for all academic research supporting CODITECT's uncertainty quantification and MoE evaluation frameworks. Per CODITECT's commitment to factual accuracy:

  1. Before implementing any UQ or evaluation feature, consult this document
  2. When making claims about methodology effectiveness, cite sources from this document
  3. When certainty is lacking, explicitly state "INFERRED" and provide reasoning chain
  4. During re-evaluation, use these sources to validate or update approaches

Reliability Tiers:

  • Tier 1 (95-100% certainty): Peer-reviewed at top venues (NeurIPS, ICLR, ACL, EMNLP, ICML)
  • Tier 2 (85-94% certainty): Industry-validated, highly cited, or OpenReview accepted
  • Tier 3 (70-84% certainty): arXiv preprints with strong methodology
  • INFERRED (<70% certainty): Logical inference from established research

1. Semantic Entropy & Token-Level Uncertainty Methods

1.1 Kernel Language Entropy (KLE)

FieldValue
TitleKernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs
AuthorsNikitin, Kossen, Gal, Marttinen
VenueNeurIPS 2024
ReliabilityTier 1 (95%)
URLhttps://neurips.cc/virtual/2024/poster/93979
GitHubN/A

Core Contribution: Defines positive semidefinite unit trace kernels for semantic similarity, quantifies uncertainty via von Neumann entropy.

Key Metrics:

  • Improved UQ across multiple NLG datasets
  • Theoretically proven to generalize semantic entropy
  • Both white-box and black-box implementations

CODITECT Application: Composite certainty scoring, semantic clustering in MoE Analysis.


1.2 Semantic Density

FieldValue
TitleSemantic Density: Quantifying Confidence in LLMs
AuthorsXin Qiu, Risto Miikkulainen (Cognizant AI Labs, UT Austin)
VenueNeurIPS 2024
ReliabilityTier 1 (96%)
URLhttps://neurips.cc/virtual/2024/poster/95598
GitHubhttps://github.com/cognizant-ai-labs/semantic-density-paper
OpenReviewhttps://openreview.net/forum?id=LOH6qzI7T6

Core Contribution: Analyzes probability distributions in semantic space for confidence quantification.

Key Metrics:

  • Best AUROC in 26/28 test cases
  • Best AUPR in 27/28 cases
  • Statistical significance: p-values < 10^-6
  • Tested on 7 LLMs (Llama 3, Mixtral-8x22B, etc.)

CODITECT Application: Primary certainty scoring methodology for MoE Analysis.


1.3 Beyond Semantic Entropy (SNNE)

FieldValue
TitleBeyond Semantic Entropy: Boosting LLM UQ with Pairwise Semantic Similarity
AuthorsNguyen et al.
VenueACL 2025 (Findings)
ReliabilityTier 1 (93%)
URLhttps://aclanthology.org/2025.findings-acl.234/
arXivhttps://arxiv.org/abs/2506.00245

Core Contribution: Black-box UQ using intra-cluster and inter-cluster similarity.

Key Metrics:

  • Superior performance on Phi3 and Llama3
  • Effective for longer responses (modern LLMs)
  • No NLI model clustering required

CODITECT Application: Backup certainty method for longer agent outputs.


1.4 Semantic Entropy Probes (SEPs)

FieldValue
TitleSemantic Entropy Probes: Efficient UQ from Hidden States
AuthorsVarious
VenuearXiv 2024
ReliabilityTier 3 (85%)
URLhttps://arxiv.org/abs/2406.15927

Core Contribution: Approximates semantic entropy from single generation hidden states.

Key Metrics:

  • ~10x computational savings vs. traditional semantic entropy
  • Robust hallucination detection
  • Better OOD generalization than previous probing methods

CODITECT Application: Efficient uncertainty estimation for high-throughput scenarios.


2. Self-Consistency Methods

2.1 Self-Consistency with Chain-of-Thought (CoT-SC)

FieldValue
TitleSelf-Consistency Improves Chain of Thought Reasoning
AuthorsWang et al.
VenueICLR 2022 (foundational)
ReliabilityTier 1 (97%)
URLhttps://arxiv.org/abs/2203.11171
OpenReviewhttps://openreview.net/forum?id=1PL1NIMMrw

Core Contribution: Sample diverse reasoning paths, select most consistent answer via majority voting.

Key Metrics:

  • GSM8K: +17.9% improvement
  • SVAMP: +11.0% improvement
  • AQuA: +12.2% improvement
  • StrategyQA: +6.4% improvement

CODITECT Application: Internal consistency scoring (20% weight in composite certainty).


2.2 Soft Self-Consistency (SOFT-SC)

FieldValue
TitleSoft Self-Consistency for Language Model Agents
AuthorsVarious
VenueACL 2024
ReliabilityTier 1 (92%)
URLhttps://aclanthology.org/2024.acl-short.28.pdf

Core Contribution: Handles cases without unique majority answer, better scaling with k samples.

CODITECT Application: Agent agreement measurement when no clear consensus exists.


3. Calibration Methods

3.1 Adaptive Temperature Scaling (ATS)

FieldValue
TitleAdaptive Temperature Scaling for RLHF-Tuned LLMs
AuthorsXie, Chen, Lee, Mitchell, Finn
VenueEMNLP 2024
ReliabilityTier 1 (94%)
URLhttps://arxiv.org/abs/2409.19817
ACLhttps://aclanthology.org/2024.emnlp-main.1007.pdf

Core Contribution: Per-token temperature scaling to address RLHF calibration degradation.

Key Metrics:

  • 10-50% calibration improvement across 3 NL benchmarks
  • Preserves RLHF performance gains

CODITECT Application: Post-hoc calibration for production models.


3.2 Thermometer: Universal Calibration

FieldValue
TitleThermometer: Universal LLM Calibration
AuthorsShen, Das, Greenewald, Sattigeri, Wornell, Ghosh (MIT, IBM)
VenueICML 2024
ReliabilityTier 1 (95%)
URLhttps://arxiv.org/html/2403.08819v1
Paperhttps://sia.mit.edu/wp-content/uploads/2024/12/2024-shen-das-greenewald-sattigeri-wornell-ghosh-icml.pdf

Core Contribution: Auxiliary recognition network for dataset-specific temperature prediction.

Key Metrics:

  • Universal: Works on new tasks without retraining
  • Transferable: 7B model calibrates 13B, 70B models
  • Inference overhead: ~0.5%
  • Accuracy-preserving

CODITECT Application: Cross-model calibration for multi-agent systems.


4. Conformal Prediction for LLMs

4.1 Enhanced Conformal Prediction

FieldValue
TitleEnhanced Conformal Prediction for LLM Validity Guarantees
AuthorsCherian, Gibbs, Candès (Stanford)
VenueNeurIPS 2024
ReliabilityTier 1 (96%)
URLhttps://arxiv.org/abs/2406.09714
OpenReviewhttps://openreview.net/forum?id=JD3NYpeQ3R

Core Contribution: Conditional validity across response topics, differentiable scoring function.

Key Metrics:

  • Rigorous statistical guarantees
  • Better preserves valuable/accurate claims

CODITECT Application: Statistical quality gates for high-stakes outputs.


4.2 Conformal Language Modeling

FieldValue
TitleConformal Language Modeling
AuthorsGoogle Research
VenueICLR 2024
ReliabilityTier 1 (96%)
URLhttps://arxiv.org/abs/2306.10193
OpenReviewhttps://openreview.net/forum?id=pzUhfQ74c5

Core Contribution: Calibrated stopping rule + rejection rule for LLM outputs.

Key Metrics:

  • Validated on open-domain QA, summarization, radiology
  • Identifies hallucination-free subsets
  • Theoretical coverage guarantees

CODITECT Application: Quality gate implementation for MoE outputs.


5. Internal State Analysis

5.1 LLM-Check: Eigenvalue Analysis

FieldValue
TitleLLM-Check: Hallucination Detection via Eigenvalue Analysis
AuthorsSriramanan et al.
VenueNeurIPS 2024
ReliabilityTier 1 (95%)
URLhttps://openreview.net/pdf?id=LYx4w3CAgy
GitHubhttps://github.com/GaurangSriramanan/LLM_Check_Hallucination_Detection

Core Contribution: Single forward pass analysis using attention kernel maps and hidden activations.

Key Metrics:

  • 45x to 450x speedup vs. baselines
  • Detects consistent hallucination patterns

CODITECT Application: Efficient real-time hallucination detection.


5.2 MIND: Unsupervised Hallucination Detection

FieldValue
TitleMIND: Unsupervised Real-Time Hallucination Detection
AuthorsVarious
VenueACL 2024 (Findings)
ReliabilityTier 1 (93%)
URLhttps://aclanthology.org/2024.findings-acl.854/

Core Contribution: Unsupervised framework using internal states, trained on Wikipedia text.

Key Metrics:

  • Real-time detection
  • No labeled data required
  • Generalizable across tasks

CODITECT Application: Quality gate for agent outputs.


6. LLM-as-Judge Frameworks

6.1 G-Eval

FieldValue
TitleG-Eval: NLG Evaluation with GPT-4 + Chain-of-Thought
AuthorsMicrosoft Research
VenueEMNLP 2023 (foundational)
ReliabilityTier 1 (95%)
Documentationhttps://www.confident-ai.com/blog/g-eval-the-definitive-guide
DeepEvalhttps://deepeval.com/docs/metrics-llm-evals

Core Contribution: Form-filling paradigm with automatic CoT generation for evaluation.

Key Metrics:

  • Spearman correlation: 0.514 (text summarization)
  • Outperformed all baselines on coherence, consistency, fluency, relevance
  • 8M+ evaluations in March 2025

CODITECT Application: Primary grading methodology for MoE Judges.


6.2 LLM-Rubric

FieldValue
TitleLLM-Rubric: Multi-Dimensional Calibrated Evaluation
AuthorsVarious
VenueACL 2024
ReliabilityTier 1 (92%)
URLhttps://arxiv.org/abs/2501.00274
Implementationhttps://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/llm-rubric/

Core Contribution: Calibrated rubric-based evaluation with web search integration.

CODITECT Application: Rubric design for prompt and response quality assessment.


6.3 Mixture-of-Agents (MoA)

FieldValue
TitleMixture-of-Agents Enhances LLM Capabilities
AuthorsTogether AI
VenuearXiv 2024
ReliabilityTier 2 (90%)
URLhttps://arxiv.org/abs/2406.04692

Core Contribution: Layered architecture with ensemble wisdom from multiple LLM agents.

Key Metrics:

  • AlpacaEval 2.0: 65.1% win rate (vs. GPT-4's 57.5%)
  • MT-Bench: 9.40/10 with GPT-4o
  • State-of-the-art with open-source models

CODITECT Application: Multi-agent orchestration pattern for MoE Analysis.


6.4 Agent-as-a-Judge

FieldValue
TitleAgent-as-a-Judge: Agentic Systems Evaluating Agentic Systems
AuthorsVarious
VenuearXiv 2024
ReliabilityTier 2 (88%)
URLhttps://arxiv.org/html/2410.10934v2

Core Contribution: Multi-step reasoning and tool use for agent evaluation.

Key Metrics:

  • ~70% alignment with human judge consensus
  • Higher alignment than individual human evaluators

CODITECT Application: Evaluation pattern for multi-agent workflows.


6.5 ChatEval: Multi-Agent Referee Team

FieldValue
TitleChatEval: Multi-Agent Discussion for Text Quality
AuthorsVarious
VenueICLR 2024
ReliabilityTier 1 (91%)
URLhttps://openreview.net/forum?id=FQepisCUWu

Core Contribution: Multi-agent referee team with collaborative discussion.

Key Metrics:

  • Superior accuracy vs. single-agent evaluation
  • Higher correlation with human assessment

CODITECT Application: Judge panel coordination pattern.


7. Prompt Engineering for Uncertainty

7.1 Uncertainty of Thoughts (UoT)

FieldValue
TitleUncertainty of Thoughts: Uncertainty-Aware Planning
AuthorsZhiyuan Hu et al.
VenueNeurIPS 2024
ReliabilityTier 1 (93%)
URLhttps://arxiv.org/abs/2402.03271
GitHubhttps://github.com/zhiyuanhubj/UoT

Core Contribution: Information-seeking behavior through uncertainty-aware planning.

Key Metrics:

  • 38.1% average improvement in task completion
  • 57.8% peak improvement
  • Tested on Llama-3-70B, Mistral-Large, GPT-4

CODITECT Application: Agent prompt design for explicit uncertainty handling.


7.2 Chain-of-Verification (CoVe)

FieldValue
TitleChain-of-Verification Reduces Hallucination in LLMs
AuthorsMeta AI
VenueACL 2024 (Findings)
ReliabilityTier 1 (92%)
URLhttps://arxiv.org/abs/2309.11495
ACLhttps://aclanthology.org/2024.findings-acl.212.pdf

Core Contribution: Self-verification via planned verification questions answered independently.

Key Metrics:

  • 23% improvement in F1 score (closed-book QA)
  • 40% accuracy increase for technical writing
  • 28% improvement in FACTSCORE

CODITECT Application: Evidence validation protocol for claims.


7.3 Verbal Uncertainty Calibration (VUF/MUC)

FieldValue
TitleCalibrating Verbal Uncertainty as a Linear Feature
AuthorsVarious
VenuearXiv 2025
ReliabilityTier 3 (85%)
URLhttps://arxiv.org/html/2503.14477

Core Contribution: Mechanistic calibration via Verbal Uncertainty Feature discovery.

Key Metrics:

  • AUROC: 70-81% for hallucination detection
  • ~30% reduction in confident hallucinations

CODITECT Application: Verbal uncertainty marker calibration tables.


8. Hallucination Detection Systems

8.1 HHEM (Hughes Hallucination Evaluation Model)

FieldValue
TitleHHEM 2.1: Hallucination Detection for RAG
AuthorsVectara
VenueIndustry (Production)
ReliabilityTier 2 (93%)
URLhttps://www.vectara.com/blog/hhem-2-1-a-better-hallucination-detection-model
HuggingFacehttps://huggingface.co/vectara/hallucination_evaluation_model
GitHubhttps://github.com/vectara/hallucination-leaderboard

Core Contribution: RAG-optimized hallucination detection model.

Key Metrics:

  • Outperforms GPT-3.5-Turbo and GPT-4 in balanced accuracy
  • 2M+ downloads
  • <600MB RAM, ~1.5s per 2k tokens

CODITECT Application: Quality gate for RAG-based research outputs.


8.2 SelfCheckGPT

FieldValue
TitleSelfCheckGPT: Zero-Resource Black-Box Hallucination Detection
AuthorsCambridge University
VenueEMNLP 2023
ReliabilityTier 1 (92%)
URLhttps://arxiv.org/abs/2303.08896
GitHubhttps://github.com/potsawee/selfcheckgpt

Core Contribution: Zero-resource detection via sampling consistency.

Key Metrics:

  • AUC-PR: 93.42% (non-factual)
  • 67.09% (factual)
  • No external database required

CODITECT Application: Zero-resource validation when evidence is unavailable.


8.3 MiniCheck

FieldValue
TitleMiniCheck: Efficient Fact-Checking on Grounding Documents
AuthorsVarious
VenueEMNLP 2024
ReliabilityTier 1 (94%)
URLhttps://github.com/Liyan06/MiniCheck
HuggingFacehttps://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B

Core Contribution: GPT-4-level performance at 400x lower cost.

Key Metrics:

  • 500 docs/min on single A6000 GPU

  • 4.3% improvement over AlignScore
  • State-of-the-art on LLM-AggreFact benchmark

CODITECT Application: Efficient fact-checking for evidence validation.


8.4 FactScore

FieldValue
TitleFActScore: Fine-grained Atomic Evaluation of Factual Precision
AuthorsSewon Min et al.
VenueEMNLP 2023
ReliabilityTier 1 (95%)
URLhttps://arxiv.org/abs/2305.14251
GitHubhttps://github.com/shmsw25/FActScore

Core Contribution: Atomic fact decomposition for long-form factuality.

Key Metrics:

  • <2% error vs. human evaluation
  • Foundational methodology for factuality assessment

CODITECT Application: Claim decomposition for evidence validation protocol.


9. RAG & Grounding Evaluation

9.1 RAGAS

FieldValue
TitleRAGAS: Retrieval-Augmented Generation Assessment
AuthorsExplodinggradients
VenueOpen-source framework
ReliabilityTier 2 (90%)
URLhttps://docs.ragas.io/en/v0.1.21/concepts/metrics/

Core Metrics:

  • Faithfulness: 95% agreement with humans
  • Answer Relevancy: 78% agreement
  • Context Precision/Recall: Retrieval quality

CODITECT Application: RAG-based research validation metrics.


9.2 DeepEval

FieldValue
TitleDeepEval: LLM Evaluation Framework
AuthorsConfident AI
VenueOpen-source framework
ReliabilityTier 2 (88%)
URLhttps://github.com/confident-ai/deepeval
Docshttps://deepeval.com/docs/metrics-introduction

Key Features:

  • Answer Relevancy, Faithfulness, Hallucination metrics
  • Contextual Precision/Recall
  • G-Eval integration
  • 5 lines of code for evaluation

CODITECT Application: Automated quality metrics for MoE outputs.


10. Benchmarks & Evaluation Standards

10.1 LM-Polygraph

FieldValue
TitleLM-Polygraph: Benchmarking UQ Methods for LLMs
AuthorsFadeeva, Vashurin et al.
VenueTACL 2024
ReliabilityTier 1 (97%)
URLhttps://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00737
GitHubhttps://github.com/IINemo/lm-polygraph

Core Contribution: Battery of state-of-the-art UE methods with standardized evaluation.

Key Metrics:

  • PRR (Prediction Rejection Ratio) as primary metric
  • Evaluated across 11 tasks
  • Pre-implemented methods with demo application

CODITECT Application: Validation and benchmarking of UQ implementations.


10.2 Arena-Hard

FieldValue
TitleArena-Hard: High-Quality LLM Benchmark
AuthorsLMSYS
VenueLMSYS Blog 2024
ReliabilityTier 2 (90%)
URLhttps://lmsys.org/blog/2024-04-19-arena-hard/

Key Metrics:

  • 87.4% separability
  • 89.1% agreement with Chatbot Arena
  • 500 challenging prompts

CODITECT Application: Benchmark for evaluating MoE system quality.


Citation Guidelines

When Citing Research in CODITECT Outputs

Format for High Certainty Claims (Tier 1):

[Claim]. (Source: [Author] et al., [Venue] [Year], Certainty: 95%)

Format for Medium Certainty Claims (Tier 2):

[Claim]. (Source: [Author] et al., [Year], Certainty: 85-94%)

Format for Inferred Claims:

[INFERRED] [Claim].
Reasoning: [Logical chain]
Assumptions: [List]
Evidence Gap: [What would increase certainty]

Research Update Schedule

Update TypeFrequencyResponsibility
New publication reviewWeeklyAI Research Team
Metric validationMonthlyQuality Assurance
Reference deprecationQuarterlyDocumentation Lead
Full auditAnnuallyArchitecture Review Board

Version History

VersionDateChanges
1.0.02025-12-19Initial comprehensive research compilation

Document Status: ACTIVE Last Validation: 2025-12-19 Next Review: 2026-03-19 Maintainer: CODITECT Research Team