Uncertainty Quantification Skill
Uncertainty Quantification Skill
How to Use This Skill
- Review the patterns and examples below
- Apply the relevant patterns to your implementation
- Follow the best practices outlined in this skill
Expert skill for measuring, expressing, and managing uncertainty in LLM interactions. Implements research-backed methods for certainty scoring, evidence validation, and confidence calibration.
When to Use
Use this skill when:
- Calculating certainty scores for research findings
- Validating evidence quality and reliability
- Calibrating confidence in LLM outputs
- Detecting overclaimed assertions
- Implementing self-consistency checks
- Creating logical inference chains with explicit uncertainty
Don't use this skill when:
- Simple pass/fail checks (use basic validation)
- Aesthetic/subjective judgments (rubrics work best for objective criteria)
- Single quick queries (overhead not justified)
Core Uncertainty Methods
1. Semantic Entropy
Source: Kuhn et al., NeurIPS 2023; extensions 2024 URL: https://arxiv.org/abs/2302.09664
Principle: Generate multiple responses, cluster by semantic meaning, calculate entropy across clusters.
from dataclasses import dataclass
from typing import List, Tuple
import numpy as np
@dataclass
class SemanticEntropyResult:
certainty_score: float # 0-100
entropy: float
num_clusters: int
cluster_distribution: List[float]
responses: List[str]
def calculate_semantic_entropy(
prompt: str,
model: LLM,
samples: int = 5,
similarity_threshold: float = 0.85
) -> SemanticEntropyResult:
"""Calculate uncertainty via semantic entropy across response samples."""
# Generate multiple responses with temperature
responses = [
model.generate(prompt, temperature=0.7)
for _ in range(samples)
]
# Embed responses for semantic comparison
embeddings = [embed_text(r) for r in responses]
# Cluster by semantic similarity
clusters = hierarchical_cluster(embeddings, threshold=similarity_threshold)
# Calculate entropy
probs = [len(c) / samples for c in clusters]
entropy = -sum(p * np.log(p) for p in probs if p > 0)
# Normalize to certainty score (0-100)
max_entropy = np.log(samples)
certainty = (1 - (entropy / max_entropy)) * 100 if max_entropy > 0 else 100
return SemanticEntropyResult(
certainty_score=certainty,
entropy=entropy,
num_clusters=len(clusters),
cluster_distribution=probs,
responses=responses
)
Certainty Mapping:
| Semantic Entropy | Certainty Level | Interpretation |
|---|---|---|
| 0.0 - 0.3 | HIGH (90%+) | Strong agreement across samples |
| 0.3 - 0.7 | MEDIUM (60-90%) | Some variation in responses |
| 0.7 - 1.5 | LOW (30-60%) | Significant disagreement |
| > 1.5 | VERY LOW (<30%) | No consensus |
2. Self-Consistency (CoT-SC)
Source: Wang et al., 2023; extensions 2024 URL: https://arxiv.org/abs/2203.11171
Principle: Sample multiple chain-of-thought reasoning paths, majority vote on conclusions.
from collections import Counter
@dataclass
class SelfConsistencyResult:
certainty_score: float
majority_answer: str
majority_count: int
answer_distribution: dict
reasoning_samples: List[str]
def self_consistency_uncertainty(
prompt: str,
model: LLM,
samples: int = 10
) -> SelfConsistencyResult:
"""Calculate uncertainty via answer consistency across CoT samples."""
cot_prompt = prompt + "\n\nThink step by step before giving your final answer."
responses = [
model.generate(cot_prompt, temperature=0.7)
for _ in range(samples)
]
# Extract final answers from each response
final_answers = [extract_final_answer(r) for r in responses]
answer_counts = Counter(final_answers)
majority = answer_counts.most_common(1)[0]
consistency = (majority[1] / samples) * 100
return SelfConsistencyResult(
certainty_score=consistency,
majority_answer=majority[0],
majority_count=majority[1],
answer_distribution=dict(answer_counts),
reasoning_samples=responses
)
3. Verbal Uncertainty Detection (VOCAL)
Source: OpenReview 2024 Principle: Analyze linguistic uncertainty markers and calibrate to probability ranges.
@dataclass
class VocalAnalysis:
markers_found: List[Tuple[str, float]]
implied_confidence: float
calibration_needed: bool
suggested_rewording: str = None
VERBAL_CONFIDENCE_MAP = {
"certain": (95, 100),
"confident": (90, 100),
"very likely": (85, 95),
"highly probable": (85, 95),
"likely": (70, 85),
"probably": (65, 80),
"possibly": (35, 65),
"might": (30, 60),
"uncertain": (15, 35),
"unclear": (10, 30),
"cannot determine": (0, 15),
"unknown": (0, 10),
"no information": (0, 5)
}
def detect_verbal_uncertainty(response: str) -> VocalAnalysis:
"""Analyze verbal uncertainty markers in response text."""
found_markers = []
response_lower = response.lower()
for marker, (low, high) in VERBAL_CONFIDENCE_MAP.items():
if marker in response_lower:
midpoint = (low + high) / 2
found_markers.append((marker, midpoint))
if found_markers:
avg_confidence = np.mean([m[1] for m in found_markers])
else:
avg_confidence = 50 # Default neutral when no markers
calibration_needed = len(found_markers) == 0 and len(response) > 100
return VocalAnalysis(
markers_found=found_markers,
implied_confidence=avg_confidence,
calibration_needed=calibration_needed
)
Verbal Marker Calibration Table:
| Marker | Target Probability | Usage Context |
|---|---|---|
| "I am certain" | 95-100% | Only for verified facts |
| "Very likely" | 85-95% | Strong evidence, high confidence |
| "Probably" | 65-80% | Good evidence, some uncertainty |
| "Possibly" | 35-65% | Limited evidence, unclear |
| "Uncertain" | 15-35% | Low evidence, significant gaps |
| "Cannot determine" | 0-15% | Insufficient information |
4. Composite Certainty Score
Combines multiple signals into a weighted certainty score:
@dataclass
class CertaintyScore:
overall: float # 0-100
evidence_support: float # Weight: 40%
source_reliability: float # Weight: 25%
internal_consistency: float # Weight: 20%
recency: float # Weight: 15%
level: str # HIGH/MEDIUM/LOW/INFERRED
@classmethod
def calculate(cls,
evidence_support: float,
source_reliability: float,
internal_consistency: float,
recency: float) -> 'CertaintyScore':
"""Calculate weighted composite certainty score."""
overall = (
evidence_support * 0.40 +
source_reliability * 0.25 +
internal_consistency * 0.20 +
recency * 0.15
)
if overall >= 85:
level = "HIGH"
elif overall >= 60:
level = "MEDIUM"
elif overall >= 30:
level = "LOW"
else:
level = "INFERRED"
return cls(
overall=overall,
evidence_support=evidence_support,
source_reliability=source_reliability,
internal_consistency=internal_consistency,
recency=recency,
level=level
)
5. Source Reliability Scoring
SOURCE_RELIABILITY_SCORES = {
"peer_reviewed": 95,
"government": 90,
"academic_institution": 85,
"industry_leader": 80,
"reputable_news": 70,
"industry_blog": 60,
"personal_blog": 40,
"social_media": 25,
"unknown": 20,
"no_source": 0
}
def calculate_source_reliability(source_type: str, year: int = None) -> float:
"""Score source reliability by type and recency."""
base_score = SOURCE_RELIABILITY_SCORES.get(source_type.lower(), 20)
# Apply recency penalty
if year:
current_year = 2025
age = current_year - year
if age > 5:
base_score *= 0.7 # 30% penalty for old sources
elif age > 3:
base_score *= 0.85 # 15% penalty for somewhat old sources
elif age > 1:
base_score *= 0.95 # 5% penalty for 1-year-old sources
return min(base_score, 100)
Evidence Schema
Standard format for evidence items:
@dataclass
class EvidenceItem:
url: str
title: str
venue: str = None
year: int = None
evidence_strength: str = "moderate" # strong/moderate/weak
summary: str = None
support_type: str = "direct" # direct/indirect/tangential
@dataclass
class Certainty:
certainty_factor: float # 0.0-1.0
certainty_basis: str # evidence_backed/domain_heuristic/speculative
evidence: List[EvidenceItem] = None
Logical Inference Template
When evidence is insufficient, use structured inference:
@dataclass
class InferenceChain:
premises: List[Tuple[str, float, str]] # (statement, certainty, evidence)
inference_type: str # deduction/induction/abduction
conclusion: str
confidence: float
assumptions: List[str]
falsification_criteria: List[str]
@dataclass
class DecisionNode:
node_id: str
type: str # decision/outcome
condition: str = None
rationale: str = None
children: List[str] = None
recommended: bool = False
@dataclass
class DecisionTree:
is_inferred: bool
root_node_id: str
nodes: List[DecisionNode]
Integration Patterns
With Uncertainty Orchestrator
# The orchestrator invokes this skill for certainty calculations
from skills.uncertainty_quantification import (
calculate_semantic_entropy,
self_consistency_uncertainty,
detect_verbal_uncertainty,
CertaintyScore
)
# Calculate composite score
score = CertaintyScore.calculate(
evidence_support=semantic_result.certainty_score,
source_reliability=avg_source_score,
internal_consistency=consistency_result.certainty_score,
recency=recency_score
)
With MoE Analysis
# Each analyst agent uses this skill
for finding in analyst_findings:
finding.certainty = CertaintyScore.calculate(...)
finding.evidence = validate_evidence(finding.sources)
if finding.certainty.level == "INFERRED":
finding.inference_chain = generate_inference(finding)
With MoE Judges
# Judges use calibrated confidence
from skills.uncertainty_quantification import calibrated_confidence
judge_score = DimensionScore(
score=4.0,
confidence=calibrated_confidence(
self_certainty=0.8,
evidence_quality=0.7,
ground_truth_available=False
),
rationale="..."
)
Success Output
When successful, this skill MUST output:
✅ SKILL COMPLETE: uncertainty-quantification
Completed:
- [x] Certainty scores calculated for all findings
- [x] Evidence validation completed
- [x] Inference chains generated for speculative claims
- [x] Verbal uncertainty markers calibrated
Outputs:
- Composite certainty scores (0-100 scale)
- Evidence quality assessment
- Source reliability ratings
- Inference chains (if applicable)
- Uncertainty report with calibrated language
Completion Checklist
Before marking this skill as complete, verify:
- All findings have associated certainty scores (0-100)
- Certainty levels classified (HIGH/MEDIUM/LOW/INFERRED)
- Evidence items include url, title, venue, year
- Source reliability scored using SOURCE_RELIABILITY_SCORES
- Verbal uncertainty markers match VERBAL_CONFIDENCE_MAP
- Inference chains documented for INFERRED findings
- Composite scores include all 4 weighted components
- No overclaimed assertions without evidence
Failure Indicators
This skill has FAILED if:
- ❌ Certainty scores missing from findings
- ❌ High certainty claimed without evidence backing
- ❌ Source reliability not assessed
- ❌ Contradictory sources ignored without explanation
- ❌ Speculative claims lack inference chains
- ❌ Verbal markers don't match probability ranges
- ❌ Old sources (>5 years) used without recency penalty
- ❌ No distinction between evidence types
When NOT to Use
Do NOT use this skill when:
- Simple pass/fail validation (use basic checks instead)
- Aesthetic or subjective judgments (rubrics better for subjective criteria)
- Single quick queries where overhead not justified
- User explicitly requests opinion without certainty
- Task is purely conversational (no claims to validate)
- Time-sensitive queries where processing time critical
Use alternative approaches when:
- Need deterministic validation → Use schema validation
- Need aesthetic judgment → Use scoring rubrics
- Need quick response → Skip certainty quantification
- Need binary decision → Use simple classification
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| High certainty without evidence | Overclaimed confidence, trust erosion | Always require evidence for >70% certainty |
| Skipping source reliability | Treats blog posts = peer reviewed | Apply SOURCE_RELIABILITY_SCORES consistently |
| Vague verbal markers | "probably" without score | Map all verbal markers to VERBAL_CONFIDENCE_MAP |
| Ignoring contradictory sources | Confirmation bias | Document contradictions, adjust certainty down |
| No recency penalty | Old sources treated as current | Apply age penalties (>5 years = 30% reduction) |
| Missing inference chains | Speculative claims appear factual | Generate InferenceChain for all INFERRED findings |
| Single-sample certainty | No self-consistency check | Use semantic entropy or self-consistency methods |
| Assuming LLM certainty | Model confidence ≠ accuracy | Always validate with evidence |
Principles
This skill embodies these CODITECT principles:
#5 Eliminate Ambiguity
- Explicit certainty levels (HIGH/MEDIUM/LOW/INFERRED)
- Quantified scores (0-100) instead of vague language
- Clear evidence-to-certainty mappings
#6 Clear, Understandable, Explainable
- Verbal markers calibrated to probability ranges
- Composite scores show weighted components
- Inference chains make reasoning transparent
#8 No Assumptions
- Every claim requires evidence or inference chain
- Source reliability explicitly scored
- Missing information documented, not assumed
Trust & Transparency (CODITECT-STANDARD-TRUST-AND-TRANSPARENCY.md)
- Certainty scoring required for all findings
- Source citations mandatory
- Gaps and contradictions documented
- Calibrated confidence reporting
Best Practices
DO:
- Always provide certainty scores with findings
- Distinguish between evidence types clearly
- Document gaps and missing information
- Use calibrated verbal uncertainty markers
- Generate inference chains for speculative claims
DON'T:
- Claim high certainty without evidence
- Ignore contradictory sources
- Skip uncertainty when confident
- Use vague language like "probably" without score
- Assume old sources are still valid
Templates
See templates/ directory for:
certainty_report_template.md- Standard certainty reporting formatinference_chain_template.md- Logical inference documentationevidence_validation_template.md- Source validation checklist
Multi-Context Window Support
State Tracking for Long Workflows:
{
"checkpoint_id": "ckpt_uq_20251219",
"certainty_calculations": [
{"finding": "F1", "score": 85.2, "status": "complete"},
{"finding": "F2", "score": 72.1, "status": "complete"},
{"finding": "F3", "score": 0, "status": "pending"}
],
"evidence_validated": 15,
"inference_chains_generated": 3,
"token_usage": 8500
}
Recovery Commands:
cat .coditect/checkpoints/uq-latest.json | jq '.certainty_calculations'
cat uncertainty-progress.md | tail -30
Version: 1.0.0 Last Updated: 2025-12-19 Research Foundation:
- Semantic Entropy: https://arxiv.org/abs/2302.09664
- Self-Consistency: https://arxiv.org/abs/2203.11171
- Uncertainty of Thoughts: https://arxiv.org/abs/2402.03271
- VOCAL: OpenReview 2024
- LLM-Rubric: ACL 2024