Skip to main content

CODITECT Uncertainty Quantification & MoE Evaluation Framework

CODITECT Uncertainty Quantification & MoE Evaluation Framework

Version: 1.0.0 Created: 2025-12-19 Author: AI Research Team Status: Production-Ready Design Classification: CONFIDENTIAL - AZ1.AI Inc.


Executive Summary

This document defines a comprehensive framework for managing uncertainty in LLM interactions through:

  1. MoE Analysis Framework (/moe-analyze) - Multi-agent research with certainty scoring
  2. MoE Judges Framework (/moe-judge) - Evaluation and grading of inputs/outputs
  3. Uncertainty Orchestrator Agent - Coordination of uncertainty-aware workflows
  4. Uncertainty Quantification Skill - Reusable patterns for measuring confidence

Key Research-Backed Principles (2024-2025)

PrincipleSourceImplementation
Semantic EntropyKuhn et al. 2023, NeurIPSMultiple-sampling consistency
Verbal Uncertainty (VOCAL)OpenReview 2024Calibrated confidence markers
Uncertainty of Thoughts (UoT)ICLR 2024Explicit uncertainty modeling
Conformal PredictionGoogle 2024Statistical guarantees
Internal State AnalysisMIND Framework 2024Hidden state probing

Table of Contents

  1. Problem Statement
  2. Research Foundation
  3. MoE Analysis Framework
  4. MoE Judges Framework
  5. Uncertainty Orchestrator Agent
  6. Uncertainty Quantification Skill
  7. Implementation Components
  8. Grading Rubrics
  9. Integration Guide

1. Problem Statement

The Dual Uncertainty Challenge

Outbound Uncertainty (LLM → User):

  • Risk of hallucinations and factual errors
  • Inconsistent confidence calibration
  • Over-confident assertions without evidence
  • Unclear knowledge boundaries

Inbound Uncertainty (User → LLM):

  • Ambiguous prompts leading to skewed responses
  • Missing context causing incorrect assumptions
  • Prompt engineering weaknesses amplifying errors
  • Lack of clarity in requirements

Goals

  1. Quantify Certainty - Measure confidence in both prompts and responses
  2. Support with Evidence - Require research-backed assertions
  3. Expose Gaps - Clearly flag information deficits
  4. Logical Inference - When data is unavailable, show reasoning chains
  5. Judge Quality - Grade both input prompts and output responses

2. Research Foundation

2.1 Uncertainty Quantification Methods (2024-2025)

Semantic Entropy

Source: Kuhn et al., NeurIPS 2023; follow-up work 2024 URL: https://arxiv.org/abs/2302.09664

Methodology:

  • Generate multiple responses to the same prompt
  • Cluster responses by semantic meaning (not lexical similarity)
  • High entropy = high uncertainty (many semantically different answers)
  • Low entropy = high confidence (responses converge semantically)

Implementation:

def calculate_semantic_entropy(responses: List[str]) -> float:
"""Calculate semantic entropy across multiple responses."""
embeddings = embed_responses(responses)
clusters = cluster_by_similarity(embeddings, threshold=0.85)
probabilities = [len(c) / len(responses) for c in clusters]
return -sum(p * log(p) for p in probabilities if p > 0)

Certainty Score Mapping:

Semantic EntropyCertainty LevelDescription
0.0 - 0.3HIGH (90%+)Strong agreement across samples
0.3 - 0.7MEDIUM (60-90%)Some variation in responses
0.7 - 1.5LOW (30-60%)Significant disagreement
> 1.5VERY LOW (<30%)No consensus

Self-Consistency (CoT-SC)

Source: Wang et al., 2023; extended 2024 URL: https://arxiv.org/abs/2203.11171

Methodology:

  • Use chain-of-thought prompting
  • Sample multiple reasoning paths
  • Majority vote on final answers
  • Confidence = proportion agreeing with majority

Implementation:

def self_consistency_score(
prompt: str,
model: LLM,
samples: int = 10
) -> Tuple[str, float]:
"""Generate multiple CoT responses and calculate consistency."""
responses = [model.generate(prompt, temperature=0.7) for _ in range(samples)]
answers = [extract_final_answer(r) for r in responses]
majority = mode(answers)
confidence = answers.count(majority) / len(answers)
return majority, confidence

Verbal Uncertainty Calibration (VOCAL)

Source: OpenReview 2024 URL: https://openreview.net/forum?id=verbal-uncertainty

Methodology:

  • LLMs express confidence through linguistic markers
  • Calibrate verbal cues to match actual accuracy
  • Map phrases like "I'm confident", "likely", "possibly" to probability ranges

Calibration Table:

Verbal MarkerTarget ProbabilityExample
"I am certain"95-100%"I am certain that Python was created by Guido van Rossum"
"Very likely"85-95%"It is very likely that the error is in the authentication module"
"Probably"65-85%"This is probably caused by a race condition"
"Possibly"35-65%"This could possibly be a caching issue"
"Uncertain"15-35%"I'm uncertain, but it might be related to memory"
"Cannot determine"0-15%"I cannot determine the root cause without more information"

Conformal Prediction for LLMs

Source: Google Research 2024, ICLR 2024 URL: https://arxiv.org/abs/2309.03893

Methodology:

  • Provides statistical guarantees on predictions
  • Creates prediction sets with coverage guarantees
  • If true answer is in set with probability ≥ 1-α
  • Smaller set = higher precision

Application:

def conformal_prediction_set(
prompt: str,
calibration_data: List[Tuple[str, str]],
alpha: float = 0.1
) -> List[str]:
"""Generate prediction set with (1-alpha) coverage guarantee."""
# Calibrate on held-out data
scores = calibrate_nonconformity_scores(calibration_data)
quantile = np.quantile(scores, 1 - alpha)

# Generate candidates and filter
candidates = generate_candidates(prompt)
return [c for c in candidates if nonconformity_score(c) <= quantile]

2.2 Hallucination Detection Methods

HHEM (Hallucination Evaluation Model)

Source: Vectara 2024 URL: https://huggingface.co/vectara/hallucination_evaluation_model

Methodology:

  • Cross-encoder model trained on hallucination detection
  • Compares generated text against source documents
  • Outputs probability that text is grounded in source

FactScore

Source: Min et al., 2023; extended 2024 URL: https://arxiv.org/abs/2305.14251

Methodology:

  • Decompose claims into atomic facts
  • Verify each atomic fact independently
  • Score = proportion of verified facts
def factscore(response: str, sources: List[str]) -> float:
"""Calculate FactScore by decomposing and verifying atomic facts."""
atomic_facts = decompose_to_atomic_facts(response)
verified = [verify_fact(f, sources) for f in atomic_facts]
return sum(verified) / len(verified) if verified else 0.0

SelfCheckGPT

Source: Manakul et al., 2023 URL: https://arxiv.org/abs/2303.08896

Methodology:

  • Zero-resource hallucination detection
  • Generate multiple samples from same prompt
  • Check consistency without external knowledge
  • Inconsistent claims likely hallucinated

2.3 LLM-as-Judge Frameworks

G-Eval

Source: Liu et al., 2023 URL: https://arxiv.org/abs/2303.16634

Methodology:

  • Chain-of-thought evaluation
  • Structured evaluation criteria
  • Probability-weighted scoring
  • Multiple dimension assessment

Multi-Agent Debate

Source: Du et al., 2023; extended 2024

Methodology:

  • Multiple LLM agents debate answers
  • Iterative refinement through discussion
  • Consensus emerges through argumentation
  • Final answer more reliable than single agent

3. MoE Analysis Framework

3.1 Command: /moe-analyze

Purpose: Execute research workflows with explicit certainty scoring and evidence requirements.

Workflow:

User Request → Orchestrator → [Analyst Agents] → [Web Search] → Synthesis → Certainty Report

3.2 Analyst Agent Types

Agent RoleResponsibilityCertainty Contribution
Domain ExpertDeep domain knowledgeHigh-confidence baseline
ResearcherWeb search and source validationEvidence gathering
Devil's AdvocateChallenge assumptionsUncertainty exposure
SynthesizerCombine findingsWeighted integration
Fact CheckerVerify claimsGrounding validation

3.3 Certainty Scoring Protocol

Each finding MUST include:

## Finding: [Topic]

**Certainty Level:** [HIGH/MEDIUM/LOW/INFERRED]
**Certainty Score:** [0-100%]

### Supporting Evidence
1. **Source:** [URL]
- **Reliability:** [Peer-reviewed/Industry/Blog/Unknown]
- **Recency:** [Date]
- **Relevance:** [Direct/Indirect/Tangential]

2. **Source:** [URL]
- ...

### Evidence Gaps
- [What information is missing]
- [What would increase certainty]

### Logical Inference (if applicable)
**Premise 1:** [Statement with evidence]
**Premise 2:** [Statement with evidence]
**Inference Rule:** [Deduction/Induction/Abduction]
**Conclusion:** [Inferred statement]
**Inference Confidence:** [0-100%]
**Decision Tree:**

IF [condition A] AND [condition B] THEN [conclusion C] ELSE IF [condition D] THEN [conclusion E]

3.4 Certainty Calculation Algorithm

@dataclass
class CertaintyScore:
"""Composite certainty score with breakdown."""
overall: float # 0-100
evidence_support: float # Weight: 40%
source_reliability: float # Weight: 25%
internal_consistency: float # Weight: 20%
recency: float # Weight: 15%

@property
def weighted_score(self) -> float:
return (
self.evidence_support * 0.40 +
self.source_reliability * 0.25 +
self.internal_consistency * 0.20 +
self.recency * 0.15
)

@property
def confidence_level(self) -> str:
score = self.weighted_score
if score >= 85: return "HIGH"
if score >= 60: return "MEDIUM"
if score >= 30: return "LOW"
return "VERY_LOW"

def calculate_source_reliability(source_type: str) -> float:
"""Score source reliability by type."""
reliability_scores = {
"peer_reviewed": 95,
"government": 90,
"academic_institution": 85,
"industry_leader": 80,
"reputable_news": 70,
"industry_blog": 60,
"personal_blog": 40,
"unknown": 20,
"no_source": 0
}
return reliability_scores.get(source_type, 20)

4. MoE Judges Framework

4.1 Command: /moe-judge

Purpose: Evaluate both outbound prompts and inbound responses for quality and correctness.

Dual Evaluation:

  1. Prompt Quality Assessment - Is the outbound prompt clear, complete, and unambiguous?
  2. Response Quality Assessment - Is the inbound response accurate, grounded, and well-calibrated?

4.2 Judge Panel Composition

Judge RoleEvaluation FocusWeight
Accuracy JudgeFactual correctness25%
Groundedness JudgeEvidence support20%
Coherence JudgeLogical consistency15%
Completeness JudgeCoverage of requirements15%
Calibration JudgeConfidence alignment15%
Ambiguity JudgeClarity and precision10%

4.3 Grading Dimensions

A. Prompt Quality Rubric (Outbound Assessment)

DimensionWeight5 (Excellent)3 (Adequate)1 (Failing)
Clarity20%Unambiguous, single interpretationMinor ambiguitiesMultiple interpretations
Completeness20%All necessary context providedSome context missingCritical context missing
Specificity20%Precise requirements statedGeneral requirementsVague requirements
Scope Definition15%Clear boundaries definedImplicit boundariesUnbounded scope
Success Criteria15%Measurable outcomes definedOutcomes impliedNo outcomes defined
Constraint Clarity10%All constraints explicitSome constraints impliedConstraints unclear

Prompt Ambiguity Score:

def calculate_prompt_ambiguity(prompt: str) -> AmbiguityReport:
"""Analyze prompt for potential ambiguities."""
issues = []

# Check for vague quantifiers
vague_terms = ["some", "many", "few", "various", "etc", "and so on"]
for term in vague_terms:
if term in prompt.lower():
issues.append(AmbiguityIssue(
type="vague_quantifier",
term=term,
severity="MEDIUM",
suggestion=f"Replace '{term}' with specific quantity"
))

# Check for undefined references
pronouns_without_antecedent = find_dangling_pronouns(prompt)
for pronoun in pronouns_without_antecedent:
issues.append(AmbiguityIssue(
type="undefined_reference",
term=pronoun,
severity="HIGH",
suggestion=f"Clarify what '{pronoun}' refers to"
))

# Check for missing success criteria
if not contains_success_criteria(prompt):
issues.append(AmbiguityIssue(
type="missing_criteria",
term="success criteria",
severity="HIGH",
suggestion="Add explicit success criteria or expected outcomes"
))

return AmbiguityReport(issues=issues, score=100 - len(issues) * 10)

B. Response Quality Rubric (Inbound Assessment)

DimensionWeight5 (Excellent)3 (Adequate)1 (Failing)
Factual Accuracy25%All claims verifiableMost claims verifiableUnverifiable claims
Evidence Grounding20%All assertions sourcedSome assertions sourcedNo sources
Confidence Calibration15%Confidence matches accuracySlight miscalibrationSevere miscalibration
Logical Coherence15%Consistent reasoningMinor inconsistenciesContradictions
Completeness15%Addresses all aspectsAddresses main aspectsIncomplete coverage
Uncertainty Honesty10%Gaps clearly statedSome gaps notedOverconfident assertions

4.4 Aggregate Grading System

@dataclass
class JudgeVerdict:
"""Individual judge's evaluation."""
judge_role: str
dimension_scores: Dict[str, float] # 1-5 scale
weighted_score: float
justification: str
evidence_citations: List[str]
improvement_suggestions: List[str]

@dataclass
class MoEJudgement:
"""Aggregated MoE panel judgement."""
individual_verdicts: List[JudgeVerdict]
overall_score: float # 0-100
grade: str # A, B, C, D, F
consensus_level: str # HIGH, MEDIUM, LOW
key_strengths: List[str]
key_weaknesses: List[str]
correctness_assessment: CorrectnessVerdict

@property
def grade(self) -> str:
if self.overall_score >= 90: return "A"
if self.overall_score >= 80: return "B"
if self.overall_score >= 70: return "C"
if self.overall_score >= 60: return "D"
return "F"

@dataclass
class CorrectnessVerdict:
"""Final correctness determination."""
is_correct: bool
confidence: float # 0-100
correct_elements: List[str]
incorrect_elements: List[str]
uncertain_elements: List[str]
reasoning: str

4.5 Consensus Calculation

def calculate_judge_consensus(verdicts: List[JudgeVerdict]) -> float:
"""Calculate agreement level among judges."""
scores = [v.weighted_score for v in verdicts]
mean_score = np.mean(scores)
std_dev = np.std(scores)

# Lower std dev = higher consensus
if std_dev < 0.5:
return 1.0 # HIGH consensus
elif std_dev < 1.0:
return 0.7 # MEDIUM consensus
else:
return 0.4 # LOW consensus

def final_moe_score(
verdicts: List[JudgeVerdict],
weights: Dict[str, float]
) -> float:
"""Calculate final weighted MoE score."""
total = 0.0
weight_sum = 0.0

for verdict in verdicts:
weight = weights.get(verdict.judge_role, 0.1)
total += verdict.weighted_score * weight
weight_sum += weight

return (total / weight_sum) * 20 # Scale to 0-100

5. Uncertainty Orchestrator Agent

5.1 Agent Specification

---
name: uncertainty-orchestrator
description: |
Coordinates uncertainty-aware multi-agent workflows for research,
analysis, and evaluation. Manages certainty scoring, evidence
gathering, and MoE judging across specialized analyst and judge agents.
tools:
- TodoWrite
- Read
- Write
- Edit
- Bash
- Grep
- Glob
- Task
- WebSearch
- WebFetch
model: sonnet

context_awareness:
uncertainty_patterns:
high_certainty: ["confirmed", "verified", "proven", "established"]
medium_certainty: ["likely", "probably", "suggests", "indicates"]
low_certainty: ["possibly", "may", "uncertain", "unclear"]
no_data: ["unknown", "no information", "cannot determine"]

workflow_types:
analysis: "/moe-analyze"
judgement: "/moe-judge"
combined: "/moe-analyze --with-judgement"

auto_trigger_integration:
enabled: true
always_active_skills:
- uncertainty-quantification
- evidence-validation
event_triggered_skills:
on_claim_made:
- fact-checking
- source-verification
on_low_confidence:
- additional-research
- expert-consultation
---

5.2 Agent Behavior

You are the Uncertainty Orchestrator, responsible for coordinating
uncertainty-aware research and evaluation workflows.

## Core Responsibilities

1. **Research Coordination**
- Dispatch analyst agents with specific research mandates
- Collect and validate evidence from multiple sources
- Calculate certainty scores using standardized methodology
- Synthesize findings with explicit uncertainty acknowledgment

2. **Evidence Validation**
- Verify source reliability and recency
- Cross-reference claims across sources
- Flag unsupported assertions
- Calculate grounding scores

3. **Certainty Reporting**
- Generate certainty scores for each finding
- Document evidence gaps explicitly
- Present logical inference chains when data is unavailable
- Provide decision trees for inferred conclusions

4. **Judge Panel Coordination**
- Dispatch judge agents for evaluation
- Aggregate verdicts with consensus weighting
- Generate final grades and correctness assessments
- Identify improvement opportunities

## Mandatory Outputs

Every analysis MUST include:
- Certainty score (0-100) with level (HIGH/MEDIUM/LOW/INFERRED)
- Evidence citations with reliability ratings
- Explicit gaps and limitations
- Logical reasoning chains for inferred conclusions
- Decision tree visualization where applicable

## Quality Gates

- Claims without evidence: REJECT or mark as INFERRED
- Sources older than 2 years: Flag for recency concerns
- Single-source claims: Mark as LOW certainty
- Contradictory sources: Require reconciliation

6. Uncertainty Quantification Skill

6.1 Skill Specification

---
name: uncertainty-quantification
description: |
Reusable patterns for measuring and expressing uncertainty in LLM
interactions. Includes semantic entropy, self-consistency, verbal
calibration, and confidence scoring methodologies.
allowed-tools: [Read, Write, WebSearch]
metadata:
token-efficiency: "Adds ~500 tokens per evaluation for uncertainty analysis"
integration: "Uncertainty Orchestrator, MoE Judges"
tech-stack: "Python dataclasses, statistical analysis, NLP"
tags: [uncertainty, calibration, confidence, evaluation]
version: 1.0.0
status: production
---

6.2 Core Patterns

Semantic Entropy Pattern

def semantic_entropy_uncertainty(
prompt: str,
model: LLM,
samples: int = 5
) -> UncertaintyReport:
"""Calculate uncertainty via semantic entropy."""
responses = [model.generate(prompt, temperature=0.7) for _ in range(samples)]
embeddings = [embed(r) for r in responses]

# Cluster by semantic similarity
clusters = hierarchical_cluster(embeddings, threshold=0.85)

# Calculate entropy
probs = [len(c) / samples for c in clusters]
entropy = -sum(p * log(p) for p in probs if p > 0)

# Normalize to certainty score
max_entropy = log(samples)
certainty = 1 - (entropy / max_entropy) if max_entropy > 0 else 1.0

return UncertaintyReport(
certainty_score=certainty * 100,
entropy=entropy,
num_clusters=len(clusters),
responses=responses
)

Self-Consistency Pattern

def self_consistency_uncertainty(
prompt: str,
model: LLM,
samples: int = 10
) -> UncertaintyReport:
"""Calculate uncertainty via answer consistency."""
responses = [
model.generate(prompt + "\nThink step by step.", temperature=0.7)
for _ in range(samples)
]

final_answers = [extract_conclusion(r) for r in responses]
answer_counts = Counter(final_answers)

majority = answer_counts.most_common(1)[0]
consistency = majority[1] / samples

return UncertaintyReport(
certainty_score=consistency * 100,
majority_answer=majority[0],
answer_distribution=dict(answer_counts),
responses=responses
)

Verbal Uncertainty Detection

def detect_verbal_uncertainty(response: str) -> VocalAnalysis:
"""Analyze verbal uncertainty markers in response."""
markers = {
"certain": (95, 100),
"confident": (90, 100),
"very likely": (85, 95),
"likely": (70, 85),
"probably": (65, 80),
"possibly": (35, 65),
"might": (30, 60),
"uncertain": (15, 35),
"unclear": (10, 30),
"cannot determine": (0, 15),
"unknown": (0, 10)
}

found_markers = []
for marker, (low, high) in markers.items():
if marker in response.lower():
found_markers.append((marker, (low + high) / 2))

avg_confidence = np.mean([m[1] for m in found_markers]) if found_markers else 50

return VocalAnalysis(
markers_found=found_markers,
implied_confidence=avg_confidence,
calibration_needed=len(found_markers) == 0
)

7. Implementation Components

7.1 File Locations

ComponentPathPurpose
Analysis Commandcommands/moe-analyze.mdMulti-agent research with certainty
Judge Commandcommands/moe-judge.mdEvaluation and grading
Orchestrator Agentagents/uncertainty-orchestrator.mdWorkflow coordination
UQ Skillskills/uncertainty-quantification/SKILL.mdReusable patterns
Rubricsskills/uncertainty-quantification/templates/Evaluation rubrics

7.2 Dependencies

  • Existing Skills: evaluation-framework, web-search-researcher
  • Existing Agents: orchestrator, thoughts-analyzer, qa-reviewer
  • External APIs: None required (all LLM-based)

8. Detailed Grading Rubrics

8.1 Prompt Quality Rubric (Outbound)

CriterionWeight54321
Clarity20%Zero ambiguityMinor ambiguityModerate ambiguitySignificant ambiguityIncomprehensible
Context20%Full contextMost contextPartial contextMinimal contextNo context
Specificity20%Exact requirementsClear requirementsGeneral requirementsVague requirementsNo requirements
Scope15%Precise boundariesClear boundariesImplied boundariesLoose boundariesUnbounded
Criteria15%Measurable successClear successImplied successVague successNo success criteria
Constraints10%All explicitMost explicitSome explicitFew explicitNone explicit

8.2 Response Quality Rubric (Inbound)

CriterionWeight54321
Accuracy25%100% verifiable90%+ verifiable70%+ verifiable50%+ verifiable<50% verifiable
Grounding20%All sourcedMost sourcedSome sourcedFew sourcedNone sourced
Calibration15%Perfect matchSlight mismatchModerate mismatchSignificant mismatchSevere mismatch
Coherence15%Fully consistentMinor inconsistencyModerate inconsistencySignificant inconsistencyContradictory
Completeness15%100% coverage90%+ coverage70%+ coverage50%+ coverage<50% coverage
Honesty10%All gaps statedMost gaps statedSome gaps statedFew gaps statedNo gaps stated

9. Integration Guide

9.1 Quick Start

# Run MoE analysis with certainty scoring
/moe-analyze "Evaluate current functional requirements for completeness"

# Run MoE judgement on previous output
/moe-judge --target analysis-report.md

# Combined workflow
/moe-analyze --with-judgement "Research DMS compliance frameworks"

9.2 Example Workflow

  1. User Request: "Analyze our DMS against HIPAA requirements"

  2. Orchestrator Dispatches:

    • Compliance Expert Agent
    • Healthcare Research Agent
    • Gap Analysis Agent
    • Fact Checker Agent
  3. Each Agent Returns:

    Finding: "HIPAA requires encryption at rest"
    Certainty: HIGH (95%)
    Evidence:
    - Source: https://hhs.gov/hipaa/regulations
    Reliability: Government (90%)
    Recency: 2024
    Gaps: None
  4. Judge Panel Evaluates:

    • Accuracy Judge: 5/5 (verified against HHS)
    • Grounding Judge: 5/5 (government source)
    • Completeness Judge: 4/5 (missing some details)
    • Overall: 4.5/5 (90%, Grade A)
  5. Final Output:

    # MoE Analysis Report: HIPAA Compliance

    ## Overall Certainty: 92% (HIGH)
    ## Grade: A

    ### Findings with Certainty Scores
    1. [HIGH 95%] Encryption at rest required
    2. [HIGH 90%] Access controls mandatory
    3. [MEDIUM 75%] Audit logging recommended
    4. [INFERRED 60%] Estimated compliance timeline

    ### Evidence Gaps
    - Specific encryption algorithm requirements unclear
    - State-level variations not researched

    ### Recommendations
    ...

10. Research Sources

Primary References

  1. Semantic Entropy for Uncertainty Quantification

  2. VOCAL: Verbal Uncertainty Calibration

  3. Uncertainty of Thoughts (UoT)

  4. Conformal Prediction for LLMs

  5. MIND: LLM Hallucination Detection


Document Version: 1.0.0 Last Updated: 2025-12-19 Status: Production-Ready Design Copyright: 2025 AZ1.AI Inc. All Rights Reserved.