Skip to main content

Uncertainty Quantification Skill

Uncertainty Quantification Skill

How to Use This Skill

  1. Review the patterns and examples below
  2. Apply the relevant patterns to your implementation
  3. Follow the best practices outlined in this skill

Expert skill for measuring, expressing, and managing uncertainty in LLM interactions. Implements research-backed methods for certainty scoring, evidence validation, and confidence calibration.

When to Use

Use this skill when:

  • Calculating certainty scores for research findings
  • Validating evidence quality and reliability
  • Calibrating confidence in LLM outputs
  • Detecting overclaimed assertions
  • Implementing self-consistency checks
  • Creating logical inference chains with explicit uncertainty

Don't use this skill when:

  • Simple pass/fail checks (use basic validation)
  • Aesthetic/subjective judgments (rubrics work best for objective criteria)
  • Single quick queries (overhead not justified)

Core Uncertainty Methods

1. Semantic Entropy

Source: Kuhn et al., NeurIPS 2023; extensions 2024 URL: https://arxiv.org/abs/2302.09664

Principle: Generate multiple responses, cluster by semantic meaning, calculate entropy across clusters.

from dataclasses import dataclass
from typing import List, Tuple
import numpy as np

@dataclass
class SemanticEntropyResult:
certainty_score: float # 0-100
entropy: float
num_clusters: int
cluster_distribution: List[float]
responses: List[str]

def calculate_semantic_entropy(
prompt: str,
model: LLM,
samples: int = 5,
similarity_threshold: float = 0.85
) -> SemanticEntropyResult:
"""Calculate uncertainty via semantic entropy across response samples."""
# Generate multiple responses with temperature
responses = [
model.generate(prompt, temperature=0.7)
for _ in range(samples)
]

# Embed responses for semantic comparison
embeddings = [embed_text(r) for r in responses]

# Cluster by semantic similarity
clusters = hierarchical_cluster(embeddings, threshold=similarity_threshold)

# Calculate entropy
probs = [len(c) / samples for c in clusters]
entropy = -sum(p * np.log(p) for p in probs if p > 0)

# Normalize to certainty score (0-100)
max_entropy = np.log(samples)
certainty = (1 - (entropy / max_entropy)) * 100 if max_entropy > 0 else 100

return SemanticEntropyResult(
certainty_score=certainty,
entropy=entropy,
num_clusters=len(clusters),
cluster_distribution=probs,
responses=responses
)

Certainty Mapping:

Semantic EntropyCertainty LevelInterpretation
0.0 - 0.3HIGH (90%+)Strong agreement across samples
0.3 - 0.7MEDIUM (60-90%)Some variation in responses
0.7 - 1.5LOW (30-60%)Significant disagreement
> 1.5VERY LOW (<30%)No consensus

2. Self-Consistency (CoT-SC)

Source: Wang et al., 2023; extensions 2024 URL: https://arxiv.org/abs/2203.11171

Principle: Sample multiple chain-of-thought reasoning paths, majority vote on conclusions.

from collections import Counter

@dataclass
class SelfConsistencyResult:
certainty_score: float
majority_answer: str
majority_count: int
answer_distribution: dict
reasoning_samples: List[str]

def self_consistency_uncertainty(
prompt: str,
model: LLM,
samples: int = 10
) -> SelfConsistencyResult:
"""Calculate uncertainty via answer consistency across CoT samples."""
cot_prompt = prompt + "\n\nThink step by step before giving your final answer."

responses = [
model.generate(cot_prompt, temperature=0.7)
for _ in range(samples)
]

# Extract final answers from each response
final_answers = [extract_final_answer(r) for r in responses]
answer_counts = Counter(final_answers)

majority = answer_counts.most_common(1)[0]
consistency = (majority[1] / samples) * 100

return SelfConsistencyResult(
certainty_score=consistency,
majority_answer=majority[0],
majority_count=majority[1],
answer_distribution=dict(answer_counts),
reasoning_samples=responses
)

3. Verbal Uncertainty Detection (VOCAL)

Source: OpenReview 2024 Principle: Analyze linguistic uncertainty markers and calibrate to probability ranges.

@dataclass
class VocalAnalysis:
markers_found: List[Tuple[str, float]]
implied_confidence: float
calibration_needed: bool
suggested_rewording: str = None

VERBAL_CONFIDENCE_MAP = {
"certain": (95, 100),
"confident": (90, 100),
"very likely": (85, 95),
"highly probable": (85, 95),
"likely": (70, 85),
"probably": (65, 80),
"possibly": (35, 65),
"might": (30, 60),
"uncertain": (15, 35),
"unclear": (10, 30),
"cannot determine": (0, 15),
"unknown": (0, 10),
"no information": (0, 5)
}

def detect_verbal_uncertainty(response: str) -> VocalAnalysis:
"""Analyze verbal uncertainty markers in response text."""
found_markers = []
response_lower = response.lower()

for marker, (low, high) in VERBAL_CONFIDENCE_MAP.items():
if marker in response_lower:
midpoint = (low + high) / 2
found_markers.append((marker, midpoint))

if found_markers:
avg_confidence = np.mean([m[1] for m in found_markers])
else:
avg_confidence = 50 # Default neutral when no markers

calibration_needed = len(found_markers) == 0 and len(response) > 100

return VocalAnalysis(
markers_found=found_markers,
implied_confidence=avg_confidence,
calibration_needed=calibration_needed
)

Verbal Marker Calibration Table:

MarkerTarget ProbabilityUsage Context
"I am certain"95-100%Only for verified facts
"Very likely"85-95%Strong evidence, high confidence
"Probably"65-80%Good evidence, some uncertainty
"Possibly"35-65%Limited evidence, unclear
"Uncertain"15-35%Low evidence, significant gaps
"Cannot determine"0-15%Insufficient information

4. Composite Certainty Score

Combines multiple signals into a weighted certainty score:

@dataclass
class CertaintyScore:
overall: float # 0-100
evidence_support: float # Weight: 40%
source_reliability: float # Weight: 25%
internal_consistency: float # Weight: 20%
recency: float # Weight: 15%
level: str # HIGH/MEDIUM/LOW/INFERRED

@classmethod
def calculate(cls,
evidence_support: float,
source_reliability: float,
internal_consistency: float,
recency: float) -> 'CertaintyScore':
"""Calculate weighted composite certainty score."""
overall = (
evidence_support * 0.40 +
source_reliability * 0.25 +
internal_consistency * 0.20 +
recency * 0.15
)

if overall >= 85:
level = "HIGH"
elif overall >= 60:
level = "MEDIUM"
elif overall >= 30:
level = "LOW"
else:
level = "INFERRED"

return cls(
overall=overall,
evidence_support=evidence_support,
source_reliability=source_reliability,
internal_consistency=internal_consistency,
recency=recency,
level=level
)

5. Source Reliability Scoring

SOURCE_RELIABILITY_SCORES = {
"peer_reviewed": 95,
"government": 90,
"academic_institution": 85,
"industry_leader": 80,
"reputable_news": 70,
"industry_blog": 60,
"personal_blog": 40,
"social_media": 25,
"unknown": 20,
"no_source": 0
}

def calculate_source_reliability(source_type: str, year: int = None) -> float:
"""Score source reliability by type and recency."""
base_score = SOURCE_RELIABILITY_SCORES.get(source_type.lower(), 20)

# Apply recency penalty
if year:
current_year = 2025
age = current_year - year
if age > 5:
base_score *= 0.7 # 30% penalty for old sources
elif age > 3:
base_score *= 0.85 # 15% penalty for somewhat old sources
elif age > 1:
base_score *= 0.95 # 5% penalty for 1-year-old sources

return min(base_score, 100)

Evidence Schema

Standard format for evidence items:

@dataclass
class EvidenceItem:
url: str
title: str
venue: str = None
year: int = None
evidence_strength: str = "moderate" # strong/moderate/weak
summary: str = None
support_type: str = "direct" # direct/indirect/tangential

@dataclass
class Certainty:
certainty_factor: float # 0.0-1.0
certainty_basis: str # evidence_backed/domain_heuristic/speculative
evidence: List[EvidenceItem] = None

Logical Inference Template

When evidence is insufficient, use structured inference:

@dataclass
class InferenceChain:
premises: List[Tuple[str, float, str]] # (statement, certainty, evidence)
inference_type: str # deduction/induction/abduction
conclusion: str
confidence: float
assumptions: List[str]
falsification_criteria: List[str]

@dataclass
class DecisionNode:
node_id: str
type: str # decision/outcome
condition: str = None
rationale: str = None
children: List[str] = None
recommended: bool = False

@dataclass
class DecisionTree:
is_inferred: bool
root_node_id: str
nodes: List[DecisionNode]

Integration Patterns

With Uncertainty Orchestrator

# The orchestrator invokes this skill for certainty calculations
from skills.uncertainty_quantification import (
calculate_semantic_entropy,
self_consistency_uncertainty,
detect_verbal_uncertainty,
CertaintyScore
)

# Calculate composite score
score = CertaintyScore.calculate(
evidence_support=semantic_result.certainty_score,
source_reliability=avg_source_score,
internal_consistency=consistency_result.certainty_score,
recency=recency_score
)

With MoE Analysis

# Each analyst agent uses this skill
for finding in analyst_findings:
finding.certainty = CertaintyScore.calculate(...)
finding.evidence = validate_evidence(finding.sources)
if finding.certainty.level == "INFERRED":
finding.inference_chain = generate_inference(finding)

With MoE Judges

# Judges use calibrated confidence
from skills.uncertainty_quantification import calibrated_confidence

judge_score = DimensionScore(
score=4.0,
confidence=calibrated_confidence(
self_certainty=0.8,
evidence_quality=0.7,
ground_truth_available=False
),
rationale="..."
)

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: uncertainty-quantification

Completed:
- [x] Certainty scores calculated for all findings
- [x] Evidence validation completed
- [x] Inference chains generated for speculative claims
- [x] Verbal uncertainty markers calibrated

Outputs:
- Composite certainty scores (0-100 scale)
- Evidence quality assessment
- Source reliability ratings
- Inference chains (if applicable)
- Uncertainty report with calibrated language

Completion Checklist

Before marking this skill as complete, verify:

  • All findings have associated certainty scores (0-100)
  • Certainty levels classified (HIGH/MEDIUM/LOW/INFERRED)
  • Evidence items include url, title, venue, year
  • Source reliability scored using SOURCE_RELIABILITY_SCORES
  • Verbal uncertainty markers match VERBAL_CONFIDENCE_MAP
  • Inference chains documented for INFERRED findings
  • Composite scores include all 4 weighted components
  • No overclaimed assertions without evidence

Failure Indicators

This skill has FAILED if:

  • ❌ Certainty scores missing from findings
  • ❌ High certainty claimed without evidence backing
  • ❌ Source reliability not assessed
  • ❌ Contradictory sources ignored without explanation
  • ❌ Speculative claims lack inference chains
  • ❌ Verbal markers don't match probability ranges
  • ❌ Old sources (>5 years) used without recency penalty
  • ❌ No distinction between evidence types

When NOT to Use

Do NOT use this skill when:

  • Simple pass/fail validation (use basic checks instead)
  • Aesthetic or subjective judgments (rubrics better for subjective criteria)
  • Single quick queries where overhead not justified
  • User explicitly requests opinion without certainty
  • Task is purely conversational (no claims to validate)
  • Time-sensitive queries where processing time critical

Use alternative approaches when:

  • Need deterministic validation → Use schema validation
  • Need aesthetic judgment → Use scoring rubrics
  • Need quick response → Skip certainty quantification
  • Need binary decision → Use simple classification

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
High certainty without evidenceOverclaimed confidence, trust erosionAlways require evidence for >70% certainty
Skipping source reliabilityTreats blog posts = peer reviewedApply SOURCE_RELIABILITY_SCORES consistently
Vague verbal markers"probably" without scoreMap all verbal markers to VERBAL_CONFIDENCE_MAP
Ignoring contradictory sourcesConfirmation biasDocument contradictions, adjust certainty down
No recency penaltyOld sources treated as currentApply age penalties (>5 years = 30% reduction)
Missing inference chainsSpeculative claims appear factualGenerate InferenceChain for all INFERRED findings
Single-sample certaintyNo self-consistency checkUse semantic entropy or self-consistency methods
Assuming LLM certaintyModel confidence ≠ accuracyAlways validate with evidence

Principles

This skill embodies these CODITECT principles:

#5 Eliminate Ambiguity

  • Explicit certainty levels (HIGH/MEDIUM/LOW/INFERRED)
  • Quantified scores (0-100) instead of vague language
  • Clear evidence-to-certainty mappings

#6 Clear, Understandable, Explainable

  • Verbal markers calibrated to probability ranges
  • Composite scores show weighted components
  • Inference chains make reasoning transparent

#8 No Assumptions

  • Every claim requires evidence or inference chain
  • Source reliability explicitly scored
  • Missing information documented, not assumed

Trust & Transparency (CODITECT-STANDARD-TRUST-AND-TRANSPARENCY.md)

  • Certainty scoring required for all findings
  • Source citations mandatory
  • Gaps and contradictions documented
  • Calibrated confidence reporting

Best Practices

DO:

  • Always provide certainty scores with findings
  • Distinguish between evidence types clearly
  • Document gaps and missing information
  • Use calibrated verbal uncertainty markers
  • Generate inference chains for speculative claims

DON'T:

  • Claim high certainty without evidence
  • Ignore contradictory sources
  • Skip uncertainty when confident
  • Use vague language like "probably" without score
  • Assume old sources are still valid

Templates

See templates/ directory for:

  • certainty_report_template.md - Standard certainty reporting format
  • inference_chain_template.md - Logical inference documentation
  • evidence_validation_template.md - Source validation checklist

Multi-Context Window Support

State Tracking for Long Workflows:

{
"checkpoint_id": "ckpt_uq_20251219",
"certainty_calculations": [
{"finding": "F1", "score": 85.2, "status": "complete"},
{"finding": "F2", "score": 72.1, "status": "complete"},
{"finding": "F3", "score": 0, "status": "pending"}
],
"evidence_validated": 15,
"inference_chains_generated": 3,
"token_usage": 8500
}

Recovery Commands:

cat .coditect/checkpoints/uq-latest.json | jq '.certainty_calculations'
cat uncertainty-progress.md | tail -30

Version: 1.0.0 Last Updated: 2025-12-19 Research Foundation: