Skip to main content

ADR-011: Uncertainty Quantification Framework for LLM Interactions

Document: ADR-011-uncertainty-quantification-framework
Version: 1.0.0
Purpose: Document architectural decisions for measuring, expressing, and managing uncertainty in LLM-based multi-agent workflows
Audience: Framework contributors, developers, AI agents, researchers
Date Created: 2025-12-19
Status: APPROVED
Related ADRs:
- ADR-010-autonomous-orchestration-system (orchestration patterns)
- ADR-012-moe-analysis-framework (analysis implementation)
- ADR-013-moe-judges-framework (evaluation implementation)
Related Documents:
- skills/uncertainty-quantification/SKILL.md
- agents/uncertainty-orchestrator.md
- commands/moe-analyze.md
- commands/moe-judge.md
- docs/09-research-analysis/UNCERTAINTY-QUANTIFICATION-MOE-FRAMEWORK.md
Research Foundation:
- docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md (Canonical source catalog)
Research References:
- Semantic Entropy: https://arxiv.org/abs/2302.09664 (NeurIPS 2023)
- Self-Consistency: https://arxiv.org/abs/2203.11171 (Wang et al. 2023)
- Uncertainty of Thoughts: https://arxiv.org/abs/2402.03271 (ICLR 2024)
- VOCAL: OpenReview 2024
- LLM-Rubric: https://arxiv.org/abs/2501.00274 (ACL 2024)

Context and Problem Statement

The Dual Uncertainty Problem

LLM-based systems face uncertainty from two directions:

Outbound Uncertainty (LLM → User):

  • Hallucinations and factual errors in generated content
  • Overconfident assertions without supporting evidence
  • Unclear knowledge boundaries leading to unreliable outputs
  • Miscalibrated confidence expressions

Inbound Uncertainty (User → LLM):

  • Ambiguous prompts causing skewed responses
  • Missing context leading to incorrect assumptions
  • Vague requirements amplifying interpretation errors
  • Prompt engineering weaknesses degrading output quality

Current State Problems

  1. No Certainty Scoring - Agent outputs lack confidence indicators
  2. Evidence Not Required - Claims made without source validation
  3. Gaps Not Documented - Missing information silently ignored
  4. Overconfidence - Agents assert without acknowledging uncertainty
  5. No Inference Transparency - Speculative conclusions lack reasoning traces
  6. No Input Quality Assessment - Prompt ambiguity not measured

Business Impact

ProblemImpactSeverity
Hallucinations accepted as factIncorrect decisionsCRITICAL
Overconfident recommendationsMisallocated resourcesHIGH
Hidden information gapsIncomplete solutionsHIGH
Ambiguous requirementsImplementation reworkMEDIUM
Uncalibrated confidenceEroded trustMEDIUM

Decision Drivers

  1. Research Validation - 2024-2025 papers provide proven methodologies
  2. Enterprise Requirements - High-stakes decisions require explicit uncertainty
  3. Audit Trails - Compliance needs evidence documentation
  4. Trust Building - Calibrated confidence improves user trust
  5. Quality Improvement - Identifying gaps enables targeted research

Considered Options

Option A: No Uncertainty Management

  • Continue with current implicit confidence
  • Rejected: Does not address hallucination/overconfidence problems

Option B: Simple Confidence Scores

  • Add single 0-100 confidence number to outputs
  • Rejected: Oversimplified; doesn't explain basis or enable calibration

Option C: Comprehensive UQ Framework (Selected)

  • Multi-method certainty calculation
  • Evidence validation requirements
  • Logical inference chains for speculative claims
  • Calibrated verbal uncertainty markers
  • Selected: Addresses all uncertainty dimensions with research backing

Option D: External Fact-Checking Service

  • Integrate third-party verification
  • Rejected: Latency, cost, and API dependency concerns

Decision

Implement Option C: Comprehensive Uncertainty Quantification Framework with these components:

1. Certainty Scoring System

Composite Score Formula:

certainty_score = (
evidence_support * 0.40 + # 40% weight on source quality
source_reliability * 0.25 + # 25% weight on source credibility
internal_consistency * 0.20 + # 20% weight on agent agreement
recency * 0.15 # 15% weight on information freshness
)

Certainty Levels:

Score RangeLevelDescription
85-100%HIGHStrong evidence, reliable sources, consensus
60-84%MEDIUMGood evidence, some gaps or disagreement
30-59%LOWLimited evidence, significant uncertainty
0-29%INFERREDNo direct evidence, logical inference only

2. Evidence Validation Protocol

Required Fields for Every Claim:

{
"claim": "Statement being made",
"certainty_factor": 0.85,
"certainty_basis": "evidence_backed",
"evidence": [
{
"url": "https://source.example.com",
"title": "Source Title",
"venue": "Publication",
"year": 2024,
"evidence_strength": "strong",
"summary": "How this supports the claim"
}
],
"missing_information": ["What would increase certainty"]
}

3. Logical Inference Protocol

When evidence is insufficient:

## Inferred Conclusion: [Statement]

**Inference Type:** Deduction | Induction | Abduction
**Certainty:** [X%] (INFERRED)

### Reasoning Chain
1. Premise: [Statement] - Evidence: [Source] - Certainty: [X%]
2. Premise: [Statement] - Evidence: [Source] - Certainty: [X%]
3. Therefore: [Conclusion]

### Decision Tree
IF [condition A] AND [condition B]
THEN [conclusion C] WITH certainty Z%
ELSE [alternative conclusion]

### Assumptions (if false, conclusion invalid)
- [Assumption 1]
- [Assumption 2]

### Falsification Criteria
- [Evidence that would disprove this]

4. Uncertainty Methods (Research-Backed)

Semantic Entropy (Kuhn et al., NeurIPS 2023):

  • Generate multiple responses with temperature sampling
  • Cluster by semantic similarity
  • High entropy = high uncertainty

Self-Consistency (Wang et al., 2023):

  • Sample multiple chain-of-thought paths
  • Majority vote on conclusions
  • Confidence = proportion agreeing

Verbal Uncertainty Calibration (VOCAL, 2024):

  • Map linguistic markers to probability ranges
  • "Certain" = 95-100%, "Possibly" = 35-65%, etc.
  • Detect and correct miscalibration

Uncertainty of Thoughts (UoT, ICLR 2024):

  • Explicit uncertainty modeling in queries
  • Information-seeking behavior when uncertain
  • Prefer clarification over guessing

5. Quality Gates

ConditionAction
Claim without evidenceMark INFERRED or REJECT
Source >2 years oldFlag recency concern
Single-source claimMark LOW certainty
Contradictory sourcesRequire reconciliation
Agent disagreement >1.5σInvestigate conflict

Consequences

Positive

  • Reduced Hallucinations - Evidence requirements catch unsupported claims
  • Calibrated Trust - Users know when to verify vs. accept
  • Transparent Reasoning - Inference chains expose logic
  • Improved Quality - Gap documentation enables targeted improvement
  • Audit Compliance - Full evidence trails for decisions

Negative

  • Increased Token Usage - ~500-1000 tokens per uncertainty analysis
  • Slower Responses - Multi-sample methods add latency
  • Implementation Complexity - Multiple UQ methods to maintain
  • Training Required - Users must understand certainty levels

Neutral

  • Shifts burden from implicit to explicit uncertainty
  • Changes agent prompt engineering requirements

Implementation

Phase 1: Core Infrastructure (Week 1-2)

  • Implement uncertainty-quantification skill
  • Create certainty scoring functions
  • Define evidence validation schema
  • Build logical inference templates

Phase 2: Agent Integration (Week 3-4)

  • Deploy uncertainty-orchestrator agent
  • Integrate with existing analysts
  • Add certainty requirements to prompts
  • Implement quality gate checks

Phase 3: Command Layer (Week 5)

  • Create /moe-analyze command
  • Create /moe-judge command
  • Build output formatting
  • Add CLI options

Phase 4: Validation (Week 6)

  • Test against known-answer datasets
  • Calibrate confidence thresholds
  • Gather user feedback
  • Document edge cases

Validation Criteria

MetricTargetMeasurement
Hallucination Detection>85%Known-false claims flagged
Confidence Calibration<0.1 ECEExpected Calibration Error
User Trust Improvement+20%Survey before/after
Inference Transparency100%All INFERRED claims have chains
Evidence Coverage>90%Claims with valid sources

References

Primary Research (2024-2025)

  1. Semantic Entropy for Uncertainty Quantification

  2. Self-Consistency with Chain-of-Thought

  3. Uncertainty of Thoughts (UoT)

  4. VOCAL: Verbal Uncertainty Calibration

    • OpenReview 2024
    • Reliability: Conference paper
  5. LLM-Rubric: Calibrated Evaluation

CODITECT Implementation

  • skills/uncertainty-quantification/SKILL.md
  • agents/uncertainty-orchestrator.md
  • commands/moe-analyze.md
  • commands/moe-judge.md

Document Version: 1.0.0 Last Updated: 2025-12-19 Author: CODITECT Research Team Status: APPROVED