ADR-011: Uncertainty Quantification Framework for LLM Interactions
Document: ADR-011-uncertainty-quantification-framework
Version: 1.0.0
Purpose: Document architectural decisions for measuring, expressing, and managing uncertainty in LLM-based multi-agent workflows
Audience: Framework contributors, developers, AI agents, researchers
Date Created: 2025-12-19
Status: APPROVED
Related ADRs:
- ADR-010-autonomous-orchestration-system (orchestration patterns)
- ADR-012-moe-analysis-framework (analysis implementation)
- ADR-013-moe-judges-framework (evaluation implementation)
Related Documents:
- skills/uncertainty-quantification/SKILL.md
- agents/uncertainty-orchestrator.md
- commands/moe-analyze.md
- commands/moe-judge.md
- docs/09-research-analysis/UNCERTAINTY-QUANTIFICATION-MOE-FRAMEWORK.md
Research Foundation:
- docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md (Canonical source catalog)
Research References:
- Semantic Entropy: https://arxiv.org/abs/2302.09664 (NeurIPS 2023)
- Self-Consistency: https://arxiv.org/abs/2203.11171 (Wang et al. 2023)
- Uncertainty of Thoughts: https://arxiv.org/abs/2402.03271 (ICLR 2024)
- VOCAL: OpenReview 2024
- LLM-Rubric: https://arxiv.org/abs/2501.00274 (ACL 2024)
Context and Problem Statement
The Dual Uncertainty Problem
LLM-based systems face uncertainty from two directions:
Outbound Uncertainty (LLM → User):
- Hallucinations and factual errors in generated content
- Overconfident assertions without supporting evidence
- Unclear knowledge boundaries leading to unreliable outputs
- Miscalibrated confidence expressions
Inbound Uncertainty (User → LLM):
- Ambiguous prompts causing skewed responses
- Missing context leading to incorrect assumptions
- Vague requirements amplifying interpretation errors
- Prompt engineering weaknesses degrading output quality
Current State Problems
- No Certainty Scoring - Agent outputs lack confidence indicators
- Evidence Not Required - Claims made without source validation
- Gaps Not Documented - Missing information silently ignored
- Overconfidence - Agents assert without acknowledging uncertainty
- No Inference Transparency - Speculative conclusions lack reasoning traces
- No Input Quality Assessment - Prompt ambiguity not measured
Business Impact
| Problem | Impact | Severity |
|---|---|---|
| Hallucinations accepted as fact | Incorrect decisions | CRITICAL |
| Overconfident recommendations | Misallocated resources | HIGH |
| Hidden information gaps | Incomplete solutions | HIGH |
| Ambiguous requirements | Implementation rework | MEDIUM |
| Uncalibrated confidence | Eroded trust | MEDIUM |
Decision Drivers
- Research Validation - 2024-2025 papers provide proven methodologies
- Enterprise Requirements - High-stakes decisions require explicit uncertainty
- Audit Trails - Compliance needs evidence documentation
- Trust Building - Calibrated confidence improves user trust
- Quality Improvement - Identifying gaps enables targeted research
Considered Options
Option A: No Uncertainty Management
- Continue with current implicit confidence
- Rejected: Does not address hallucination/overconfidence problems
Option B: Simple Confidence Scores
- Add single 0-100 confidence number to outputs
- Rejected: Oversimplified; doesn't explain basis or enable calibration
Option C: Comprehensive UQ Framework (Selected)
- Multi-method certainty calculation
- Evidence validation requirements
- Logical inference chains for speculative claims
- Calibrated verbal uncertainty markers
- Selected: Addresses all uncertainty dimensions with research backing
Option D: External Fact-Checking Service
- Integrate third-party verification
- Rejected: Latency, cost, and API dependency concerns
Decision
Implement Option C: Comprehensive Uncertainty Quantification Framework with these components:
1. Certainty Scoring System
Composite Score Formula:
certainty_score = (
evidence_support * 0.40 + # 40% weight on source quality
source_reliability * 0.25 + # 25% weight on source credibility
internal_consistency * 0.20 + # 20% weight on agent agreement
recency * 0.15 # 15% weight on information freshness
)
Certainty Levels:
| Score Range | Level | Description |
|---|---|---|
| 85-100% | HIGH | Strong evidence, reliable sources, consensus |
| 60-84% | MEDIUM | Good evidence, some gaps or disagreement |
| 30-59% | LOW | Limited evidence, significant uncertainty |
| 0-29% | INFERRED | No direct evidence, logical inference only |
2. Evidence Validation Protocol
Required Fields for Every Claim:
{
"claim": "Statement being made",
"certainty_factor": 0.85,
"certainty_basis": "evidence_backed",
"evidence": [
{
"url": "https://source.example.com",
"title": "Source Title",
"venue": "Publication",
"year": 2024,
"evidence_strength": "strong",
"summary": "How this supports the claim"
}
],
"missing_information": ["What would increase certainty"]
}
3. Logical Inference Protocol
When evidence is insufficient:
## Inferred Conclusion: [Statement]
**Inference Type:** Deduction | Induction | Abduction
**Certainty:** [X%] (INFERRED)
### Reasoning Chain
1. Premise: [Statement] - Evidence: [Source] - Certainty: [X%]
2. Premise: [Statement] - Evidence: [Source] - Certainty: [X%]
3. Therefore: [Conclusion]
### Decision Tree
IF [condition A] AND [condition B]
THEN [conclusion C] WITH certainty Z%
ELSE [alternative conclusion]
### Assumptions (if false, conclusion invalid)
- [Assumption 1]
- [Assumption 2]
### Falsification Criteria
- [Evidence that would disprove this]
4. Uncertainty Methods (Research-Backed)
Semantic Entropy (Kuhn et al., NeurIPS 2023):
- Generate multiple responses with temperature sampling
- Cluster by semantic similarity
- High entropy = high uncertainty
Self-Consistency (Wang et al., 2023):
- Sample multiple chain-of-thought paths
- Majority vote on conclusions
- Confidence = proportion agreeing
Verbal Uncertainty Calibration (VOCAL, 2024):
- Map linguistic markers to probability ranges
- "Certain" = 95-100%, "Possibly" = 35-65%, etc.
- Detect and correct miscalibration
Uncertainty of Thoughts (UoT, ICLR 2024):
- Explicit uncertainty modeling in queries
- Information-seeking behavior when uncertain
- Prefer clarification over guessing
5. Quality Gates
| Condition | Action |
|---|---|
| Claim without evidence | Mark INFERRED or REJECT |
| Source >2 years old | Flag recency concern |
| Single-source claim | Mark LOW certainty |
| Contradictory sources | Require reconciliation |
| Agent disagreement >1.5σ | Investigate conflict |
Consequences
Positive
- Reduced Hallucinations - Evidence requirements catch unsupported claims
- Calibrated Trust - Users know when to verify vs. accept
- Transparent Reasoning - Inference chains expose logic
- Improved Quality - Gap documentation enables targeted improvement
- Audit Compliance - Full evidence trails for decisions
Negative
- Increased Token Usage - ~500-1000 tokens per uncertainty analysis
- Slower Responses - Multi-sample methods add latency
- Implementation Complexity - Multiple UQ methods to maintain
- Training Required - Users must understand certainty levels
Neutral
- Shifts burden from implicit to explicit uncertainty
- Changes agent prompt engineering requirements
Implementation
Phase 1: Core Infrastructure (Week 1-2)
- Implement
uncertainty-quantificationskill - Create certainty scoring functions
- Define evidence validation schema
- Build logical inference templates
Phase 2: Agent Integration (Week 3-4)
- Deploy
uncertainty-orchestratoragent - Integrate with existing analysts
- Add certainty requirements to prompts
- Implement quality gate checks
Phase 3: Command Layer (Week 5)
- Create
/moe-analyzecommand - Create
/moe-judgecommand - Build output formatting
- Add CLI options
Phase 4: Validation (Week 6)
- Test against known-answer datasets
- Calibrate confidence thresholds
- Gather user feedback
- Document edge cases
Validation Criteria
| Metric | Target | Measurement |
|---|---|---|
| Hallucination Detection | >85% | Known-false claims flagged |
| Confidence Calibration | <0.1 ECE | Expected Calibration Error |
| User Trust Improvement | +20% | Survey before/after |
| Inference Transparency | 100% | All INFERRED claims have chains |
| Evidence Coverage | >90% | Claims with valid sources |
References
Primary Research (2024-2025)
-
Semantic Entropy for Uncertainty Quantification
- URL: https://arxiv.org/abs/2302.09664
- Kuhn et al., NeurIPS 2023
- Reliability: Peer-reviewed, highly cited
-
Self-Consistency with Chain-of-Thought
- URL: https://arxiv.org/abs/2203.11171
- Wang et al., 2023
- Reliability: Peer-reviewed, foundational work
-
Uncertainty of Thoughts (UoT)
- URL: https://arxiv.org/abs/2402.03271
- ICLR 2024
- Reliability: Peer-reviewed, novel approach
-
VOCAL: Verbal Uncertainty Calibration
- OpenReview 2024
- Reliability: Conference paper
-
LLM-Rubric: Calibrated Evaluation
- URL: https://arxiv.org/abs/2501.00274
- ACL 2024
- Reliability: Peer-reviewed
CODITECT Implementation
skills/uncertainty-quantification/SKILL.mdagents/uncertainty-orchestrator.mdcommands/moe-analyze.mdcommands/moe-judge.md
Document Version: 1.0.0 Last Updated: 2025-12-19 Author: CODITECT Research Team Status: APPROVED