ADR-011: Uncertainty Quantification Framework for LLM Interactions

Document: ADR-011-uncertainty-quantification-framework
Version: 1.0.0
Purpose: Document architectural decisions for measuring, expressing, and managing uncertainty in LLM-based multi-agent workflows
Audience: Framework contributors, developers, AI agents, researchers
Date Created: 2025-12-19
Status: APPROVED
Related ADRs:
  - ADR-010-autonomous-orchestration-system (orchestration patterns)
  - ADR-012-moe-analysis-framework (analysis implementation)
  - ADR-013-moe-judges-framework (evaluation implementation)
Related Documents:
  - skills/uncertainty-quantification/SKILL.md
  - agents/uncertainty-orchestrator.md
  - commands/moe-analyze.md
  - commands/moe-judge.md
  - docs/09-research-analysis/UNCERTAINTY-QUANTIFICATION-MOE-FRAMEWORK.md
Research Foundation:
  - docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md (Canonical source catalog)
Research References:
  - Semantic Entropy: https://arxiv.org/abs/2302.09664 (NeurIPS 2023)
  - Self-Consistency: https://arxiv.org/abs/2203.11171 (Wang et al. 2023)
  - Uncertainty of Thoughts: https://arxiv.org/abs/2402.03271 (ICLR 2024)
  - VOCAL: OpenReview 2024
  - LLM-Rubric: https://arxiv.org/abs/2501.00274 (ACL 2024)

Context and Problem Statement

The Dual Uncertainty Problem

LLM-based systems face uncertainty from two directions:

Outbound Uncertainty (LLM → User):

Hallucinations and factual errors in generated content
Overconfident assertions without supporting evidence
Unclear knowledge boundaries leading to unreliable outputs
Miscalibrated confidence expressions

Inbound Uncertainty (User → LLM):

Ambiguous prompts causing skewed responses
Missing context leading to incorrect assumptions
Vague requirements amplifying interpretation errors
Prompt engineering weaknesses degrading output quality

Current State Problems

No Certainty Scoring - Agent outputs lack confidence indicators
Evidence Not Required - Claims made without source validation
Gaps Not Documented - Missing information silently ignored
Overconfidence - Agents assert without acknowledging uncertainty
No Inference Transparency - Speculative conclusions lack reasoning traces
No Input Quality Assessment - Prompt ambiguity not measured

Business Impact

Problem	Impact	Severity
Hallucinations accepted as fact	Incorrect decisions	CRITICAL
Overconfident recommendations	Misallocated resources	HIGH
Hidden information gaps	Incomplete solutions	HIGH
Ambiguous requirements	Implementation rework	MEDIUM
Uncalibrated confidence	Eroded trust	MEDIUM

Decision Drivers

Research Validation - 2024-2025 papers provide proven methodologies
Enterprise Requirements - High-stakes decisions require explicit uncertainty
Audit Trails - Compliance needs evidence documentation
Trust Building - Calibrated confidence improves user trust
Quality Improvement - Identifying gaps enables targeted research

Considered Options

Option A: No Uncertainty Management

Continue with current implicit confidence
Rejected: Does not address hallucination/overconfidence problems

Option B: Simple Confidence Scores

Add single 0-100 confidence number to outputs
Rejected: Oversimplified; doesn't explain basis or enable calibration

Option C: Comprehensive UQ Framework (Selected)

Multi-method certainty calculation
Evidence validation requirements
Logical inference chains for speculative claims
Calibrated verbal uncertainty markers
Selected: Addresses all uncertainty dimensions with research backing

Option D: External Fact-Checking Service

Integrate third-party verification
Rejected: Latency, cost, and API dependency concerns

Decision

Implement Option C: Comprehensive Uncertainty Quantification Framework with these components:

1. Certainty Scoring System

Composite Score Formula:

certainty_score = (
    evidence_support * 0.40 +      # 40% weight on source quality
    source_reliability * 0.25 +    # 25% weight on source credibility
    internal_consistency * 0.20 +  # 20% weight on agent agreement
    recency * 0.15                 # 15% weight on information freshness
)

Certainty Levels:

Score Range	Level	Description
85-100%	HIGH	Strong evidence, reliable sources, consensus
60-84%	MEDIUM	Good evidence, some gaps or disagreement
30-59%	LOW	Limited evidence, significant uncertainty
0-29%	INFERRED	No direct evidence, logical inference only

2. Evidence Validation Protocol

Required Fields for Every Claim:

{
  "claim": "Statement being made",
  "certainty_factor": 0.85,
  "certainty_basis": "evidence_backed",
  "evidence": [
    {
      "url": "https://source.example.com",
      "title": "Source Title",
      "venue": "Publication",
      "year": 2024,
      "evidence_strength": "strong",
      "summary": "How this supports the claim"
    }
  ],
  "missing_information": ["What would increase certainty"]
}

3. Logical Inference Protocol

When evidence is insufficient:

## Inferred Conclusion: [Statement]

**Inference Type:** Deduction | Induction | Abduction
**Certainty:** [X%] (INFERRED)

### Reasoning Chain
1. Premise: [Statement] - Evidence: [Source] - Certainty: [X%]
2. Premise: [Statement] - Evidence: [Source] - Certainty: [X%]
3. Therefore: [Conclusion]

### Decision Tree
IF [condition A] AND [condition B]
THEN [conclusion C] WITH certainty Z%
ELSE [alternative conclusion]

### Assumptions (if false, conclusion invalid)
- [Assumption 1]
- [Assumption 2]

### Falsification Criteria
- [Evidence that would disprove this]

4. Uncertainty Methods (Research-Backed)

Semantic Entropy (Kuhn et al., NeurIPS 2023):

Generate multiple responses with temperature sampling
Cluster by semantic similarity
High entropy = high uncertainty

Self-Consistency (Wang et al., 2023):

Sample multiple chain-of-thought paths
Majority vote on conclusions
Confidence = proportion agreeing

Verbal Uncertainty Calibration (VOCAL, 2024):

Map linguistic markers to probability ranges
"Certain" = 95-100%, "Possibly" = 35-65%, etc.
Detect and correct miscalibration

Uncertainty of Thoughts (UoT, ICLR 2024):

Explicit uncertainty modeling in queries
Information-seeking behavior when uncertain
Prefer clarification over guessing

5. Quality Gates

Condition	Action
Claim without evidence	Mark INFERRED or REJECT
Source >2 years old	Flag recency concern
Single-source claim	Mark LOW certainty
Contradictory sources	Require reconciliation
Agent disagreement >1.5σ	Investigate conflict

Consequences

Positive

Reduced Hallucinations - Evidence requirements catch unsupported claims
Calibrated Trust - Users know when to verify vs. accept
Transparent Reasoning - Inference chains expose logic
Improved Quality - Gap documentation enables targeted improvement
Audit Compliance - Full evidence trails for decisions

Negative

Increased Token Usage - ~500-1000 tokens per uncertainty analysis
Slower Responses - Multi-sample methods add latency
Implementation Complexity - Multiple UQ methods to maintain
Training Required - Users must understand certainty levels

Neutral

Shifts burden from implicit to explicit uncertainty
Changes agent prompt engineering requirements

Implementation

Phase 1: Core Infrastructure (Week 1-2)

Implement uncertainty-quantification skill
Create certainty scoring functions
Define evidence validation schema
Build logical inference templates

Phase 2: Agent Integration (Week 3-4)

Deploy uncertainty-orchestrator agent
Integrate with existing analysts
Add certainty requirements to prompts
Implement quality gate checks

Phase 3: Command Layer (Week 5)

Create /moe-analyze command
Create /moe-judge command
Build output formatting
Add CLI options

Phase 4: Validation (Week 6)

Test against known-answer datasets
Calibrate confidence thresholds
Gather user feedback
Document edge cases

Validation Criteria

Metric	Target	Measurement
Hallucination Detection	>85%	Known-false claims flagged
Confidence Calibration	<0.1 ECE	Expected Calibration Error
User Trust Improvement	+20%	Survey before/after
Inference Transparency	100%	All INFERRED claims have chains
Evidence Coverage	>90%	Claims with valid sources

References

Primary Research (2024-2025)

Semantic Entropy for Uncertainty Quantification
- URL: https://arxiv.org/abs/2302.09664
- Kuhn et al., NeurIPS 2023
- Reliability: Peer-reviewed, highly cited
Self-Consistency with Chain-of-Thought
- URL: https://arxiv.org/abs/2203.11171
- Wang et al., 2023
- Reliability: Peer-reviewed, foundational work
Uncertainty of Thoughts (UoT)
- URL: https://arxiv.org/abs/2402.03271
- ICLR 2024
- Reliability: Peer-reviewed, novel approach
VOCAL: Verbal Uncertainty Calibration
- OpenReview 2024
- Reliability: Conference paper
LLM-Rubric: Calibrated Evaluation
- URL: https://arxiv.org/abs/2501.00274
- ACL 2024
- Reliability: Peer-reviewed

CODITECT Implementation

skills/uncertainty-quantification/SKILL.md
agents/uncertainty-orchestrator.md
commands/moe-analyze.md
commands/moe-judge.md

Document Version: 1.0.0 Last Updated: 2025-12-19 Author: CODITECT Research Team Status: APPROVED

Context and Problem Statement​

The Dual Uncertainty Problem​

Current State Problems​

Business Impact​

Decision Drivers​

Considered Options​

Option A: No Uncertainty Management​

Option B: Simple Confidence Scores​

Option C: Comprehensive UQ Framework (Selected)​

Option D: External Fact-Checking Service​

Decision​

1. Certainty Scoring System​

2. Evidence Validation Protocol​

3. Logical Inference Protocol​

4. Uncertainty Methods (Research-Backed)​

5. Quality Gates​

Consequences​

Positive​

Negative​

Neutral​

Implementation​

Phase 1: Core Infrastructure (Week 1-2)​

Phase 2: Agent Integration (Week 3-4)​

Phase 3: Command Layer (Week 5)​

Phase 4: Validation (Week 6)​

Validation Criteria​

References​

Primary Research (2024-2025)​

CODITECT Implementation​