Uncertainty Quantification Skill

How to Use This Skill

Review the patterns and examples below
Apply the relevant patterns to your implementation
Follow the best practices outlined in this skill

Expert skill for measuring, expressing, and managing uncertainty in LLM interactions. Implements research-backed methods for certainty scoring, evidence validation, and confidence calibration.

When to Use

Use this skill when:

Calculating certainty scores for research findings
Validating evidence quality and reliability
Calibrating confidence in LLM outputs
Detecting overclaimed assertions
Implementing self-consistency checks
Creating logical inference chains with explicit uncertainty

Don't use this skill when:

Simple pass/fail checks (use basic validation)
Aesthetic/subjective judgments (rubrics work best for objective criteria)
Single quick queries (overhead not justified)

Core Uncertainty Methods

1. Semantic Entropy

Source: Kuhn et al., NeurIPS 2023; extensions 2024 URL: https://arxiv.org/abs/2302.09664

Principle: Generate multiple responses, cluster by semantic meaning, calculate entropy across clusters.

from dataclasses import dataclass
from typing import List, Tuple
import numpy as np

@dataclass
class SemanticEntropyResult:
    certainty_score: float  # 0-100
    entropy: float
    num_clusters: int
    cluster_distribution: List[float]
    responses: List[str]

def calculate_semantic_entropy(
    prompt: str,
    model: LLM,
    samples: int = 5,
    similarity_threshold: float = 0.85
) -> SemanticEntropyResult:
    """Calculate uncertainty via semantic entropy across response samples."""
    # Generate multiple responses with temperature
    responses = [
        model.generate(prompt, temperature=0.7)
        for _ in range(samples)
    ]

    # Embed responses for semantic comparison
    embeddings = [embed_text(r) for r in responses]

    # Cluster by semantic similarity
    clusters = hierarchical_cluster(embeddings, threshold=similarity_threshold)

    # Calculate entropy
    probs = [len(c) / samples for c in clusters]
    entropy = -sum(p * np.log(p) for p in probs if p > 0)

    # Normalize to certainty score (0-100)
    max_entropy = np.log(samples)
    certainty = (1 - (entropy / max_entropy)) * 100 if max_entropy > 0 else 100

    return SemanticEntropyResult(
        certainty_score=certainty,
        entropy=entropy,
        num_clusters=len(clusters),
        cluster_distribution=probs,
        responses=responses
    )

Certainty Mapping:

Semantic Entropy	Certainty Level	Interpretation
0.0 - 0.3	HIGH (90%+)	Strong agreement across samples
0.3 - 0.7	MEDIUM (60-90%)	Some variation in responses
0.7 - 1.5	LOW (30-60%)	Significant disagreement
> 1.5	VERY LOW (<30%)	No consensus

2. Self-Consistency (CoT-SC)

Source: Wang et al., 2023; extensions 2024 URL: https://arxiv.org/abs/2203.11171

Principle: Sample multiple chain-of-thought reasoning paths, majority vote on conclusions.

from collections import Counter

@dataclass
class SelfConsistencyResult:
    certainty_score: float
    majority_answer: str
    majority_count: int
    answer_distribution: dict
    reasoning_samples: List[str]

def self_consistency_uncertainty(
    prompt: str,
    model: LLM,
    samples: int = 10
) -> SelfConsistencyResult:
    """Calculate uncertainty via answer consistency across CoT samples."""
    cot_prompt = prompt + "\n\nThink step by step before giving your final answer."

    responses = [
        model.generate(cot_prompt, temperature=0.7)
        for _ in range(samples)
    ]

    # Extract final answers from each response
    final_answers = [extract_final_answer(r) for r in responses]
    answer_counts = Counter(final_answers)

    majority = answer_counts.most_common(1)[0]
    consistency = (majority[1] / samples) * 100

    return SelfConsistencyResult(
        certainty_score=consistency,
        majority_answer=majority[0],
        majority_count=majority[1],
        answer_distribution=dict(answer_counts),
        reasoning_samples=responses
    )

3. Verbal Uncertainty Detection (VOCAL)

Source: OpenReview 2024 Principle: Analyze linguistic uncertainty markers and calibrate to probability ranges.

@dataclass
class VocalAnalysis:
    markers_found: List[Tuple[str, float]]
    implied_confidence: float
    calibration_needed: bool
    suggested_rewording: str = None

VERBAL_CONFIDENCE_MAP = {
    "certain": (95, 100),
    "confident": (90, 100),
    "very likely": (85, 95),
    "highly probable": (85, 95),
    "likely": (70, 85),
    "probably": (65, 80),
    "possibly": (35, 65),
    "might": (30, 60),
    "uncertain": (15, 35),
    "unclear": (10, 30),
    "cannot determine": (0, 15),
    "unknown": (0, 10),
    "no information": (0, 5)
}

def detect_verbal_uncertainty(response: str) -> VocalAnalysis:
    """Analyze verbal uncertainty markers in response text."""
    found_markers = []
    response_lower = response.lower()

    for marker, (low, high) in VERBAL_CONFIDENCE_MAP.items():
        if marker in response_lower:
            midpoint = (low + high) / 2
            found_markers.append((marker, midpoint))

    if found_markers:
        avg_confidence = np.mean([m[1] for m in found_markers])
    else:
        avg_confidence = 50  # Default neutral when no markers

    calibration_needed = len(found_markers) == 0 and len(response) > 100

    return VocalAnalysis(
        markers_found=found_markers,
        implied_confidence=avg_confidence,
        calibration_needed=calibration_needed
    )

Verbal Marker Calibration Table:

Marker	Target Probability	Usage Context
"I am certain"	95-100%	Only for verified facts
"Very likely"	85-95%	Strong evidence, high confidence
"Probably"	65-80%	Good evidence, some uncertainty
"Possibly"	35-65%	Limited evidence, unclear
"Uncertain"	15-35%	Low evidence, significant gaps
"Cannot determine"	0-15%	Insufficient information

4. Composite Certainty Score

Combines multiple signals into a weighted certainty score:

@dataclass
class CertaintyScore:
    overall: float  # 0-100
    evidence_support: float  # Weight: 40%
    source_reliability: float  # Weight: 25%
    internal_consistency: float  # Weight: 20%
    recency: float  # Weight: 15%
    level: str  # HIGH/MEDIUM/LOW/INFERRED

    @classmethod
    def calculate(cls,
                  evidence_support: float,
                  source_reliability: float,
                  internal_consistency: float,
                  recency: float) -> 'CertaintyScore':
        """Calculate weighted composite certainty score."""
        overall = (
            evidence_support * 0.40 +
            source_reliability * 0.25 +
            internal_consistency * 0.20 +
            recency * 0.15
        )

        if overall >= 85:
            level = "HIGH"
        elif overall >= 60:
            level = "MEDIUM"
        elif overall >= 30:
            level = "LOW"
        else:
            level = "INFERRED"

        return cls(
            overall=overall,
            evidence_support=evidence_support,
            source_reliability=source_reliability,
            internal_consistency=internal_consistency,
            recency=recency,
            level=level
        )

5. Source Reliability Scoring

SOURCE_RELIABILITY_SCORES = {
    "peer_reviewed": 95,
    "government": 90,
    "academic_institution": 85,
    "industry_leader": 80,
    "reputable_news": 70,
    "industry_blog": 60,
    "personal_blog": 40,
    "social_media": 25,
    "unknown": 20,
    "no_source": 0
}

def calculate_source_reliability(source_type: str, year: int = None) -> float:
    """Score source reliability by type and recency."""
    base_score = SOURCE_RELIABILITY_SCORES.get(source_type.lower(), 20)

    # Apply recency penalty
    if year:
        current_year = 2025
        age = current_year - year
        if age > 5:
            base_score *= 0.7  # 30% penalty for old sources
        elif age > 3:
            base_score *= 0.85  # 15% penalty for somewhat old sources
        elif age > 1:
            base_score *= 0.95  # 5% penalty for 1-year-old sources

    return min(base_score, 100)

Evidence Schema

Standard format for evidence items:

@dataclass
class EvidenceItem:
    url: str
    title: str
    venue: str = None
    year: int = None
    evidence_strength: str = "moderate"  # strong/moderate/weak
    summary: str = None
    support_type: str = "direct"  # direct/indirect/tangential

@dataclass
class Certainty:
    certainty_factor: float  # 0.0-1.0
    certainty_basis: str  # evidence_backed/domain_heuristic/speculative
    evidence: List[EvidenceItem] = None

Logical Inference Template

When evidence is insufficient, use structured inference:

@dataclass
class InferenceChain:
    premises: List[Tuple[str, float, str]]  # (statement, certainty, evidence)
    inference_type: str  # deduction/induction/abduction
    conclusion: str
    confidence: float
    assumptions: List[str]
    falsification_criteria: List[str]

@dataclass
class DecisionNode:
    node_id: str
    type: str  # decision/outcome
    condition: str = None
    rationale: str = None
    children: List[str] = None
    recommended: bool = False

@dataclass
class DecisionTree:
    is_inferred: bool
    root_node_id: str
    nodes: List[DecisionNode]

Integration Patterns

With Uncertainty Orchestrator

# The orchestrator invokes this skill for certainty calculations
from skills.uncertainty_quantification import (
    calculate_semantic_entropy,
    self_consistency_uncertainty,
    detect_verbal_uncertainty,
    CertaintyScore
)

# Calculate composite score
score = CertaintyScore.calculate(
    evidence_support=semantic_result.certainty_score,
    source_reliability=avg_source_score,
    internal_consistency=consistency_result.certainty_score,
    recency=recency_score
)

With MoE Analysis

# Each analyst agent uses this skill
for finding in analyst_findings:
    finding.certainty = CertaintyScore.calculate(...)
    finding.evidence = validate_evidence(finding.sources)
    if finding.certainty.level == "INFERRED":
        finding.inference_chain = generate_inference(finding)

With MoE Judges

# Judges use calibrated confidence
from skills.uncertainty_quantification import calibrated_confidence

judge_score = DimensionScore(
    score=4.0,
    confidence=calibrated_confidence(
        self_certainty=0.8,
        evidence_quality=0.7,
        ground_truth_available=False
    ),
    rationale="..."
)

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: uncertainty-quantification

Completed:
- [x] Certainty scores calculated for all findings
- [x] Evidence validation completed
- [x] Inference chains generated for speculative claims
- [x] Verbal uncertainty markers calibrated

Outputs:
- Composite certainty scores (0-100 scale)
- Evidence quality assessment
- Source reliability ratings
- Inference chains (if applicable)
- Uncertainty report with calibrated language

Completion Checklist

Before marking this skill as complete, verify:

All findings have associated certainty scores (0-100)
Certainty levels classified (HIGH/MEDIUM/LOW/INFERRED)
Evidence items include url, title, venue, year
Source reliability scored using SOURCE_RELIABILITY_SCORES
Verbal uncertainty markers match VERBAL_CONFIDENCE_MAP
Inference chains documented for INFERRED findings
Composite scores include all 4 weighted components
No overclaimed assertions without evidence

Failure Indicators

This skill has FAILED if:

❌ Certainty scores missing from findings
❌ High certainty claimed without evidence backing
❌ Source reliability not assessed
❌ Contradictory sources ignored without explanation
❌ Speculative claims lack inference chains
❌ Verbal markers don't match probability ranges
❌ Old sources (>5 years) used without recency penalty
❌ No distinction between evidence types

When NOT to Use

Do NOT use this skill when:

Simple pass/fail validation (use basic checks instead)
Aesthetic or subjective judgments (rubrics better for subjective criteria)
Single quick queries where overhead not justified
User explicitly requests opinion without certainty
Task is purely conversational (no claims to validate)
Time-sensitive queries where processing time critical

Use alternative approaches when:

Need deterministic validation → Use schema validation
Need aesthetic judgment → Use scoring rubrics
Need quick response → Skip certainty quantification
Need binary decision → Use simple classification

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
High certainty without evidence	Overclaimed confidence, trust erosion	Always require evidence for >70% certainty
Skipping source reliability	Treats blog posts = peer reviewed	Apply SOURCE_RELIABILITY_SCORES consistently
Vague verbal markers	"probably" without score	Map all verbal markers to VERBAL_CONFIDENCE_MAP
Ignoring contradictory sources	Confirmation bias	Document contradictions, adjust certainty down
No recency penalty	Old sources treated as current	Apply age penalties (>5 years = 30% reduction)
Missing inference chains	Speculative claims appear factual	Generate InferenceChain for all INFERRED findings
Single-sample certainty	No self-consistency check	Use semantic entropy or self-consistency methods
Assuming LLM certainty	Model confidence ≠ accuracy	Always validate with evidence

Principles

This skill embodies these CODITECT principles:

#5 Eliminate Ambiguity

Explicit certainty levels (HIGH/MEDIUM/LOW/INFERRED)
Quantified scores (0-100) instead of vague language
Clear evidence-to-certainty mappings

#6 Clear, Understandable, Explainable

Verbal markers calibrated to probability ranges
Composite scores show weighted components
Inference chains make reasoning transparent

#8 No Assumptions

Every claim requires evidence or inference chain
Source reliability explicitly scored
Missing information documented, not assumed

Trust & Transparency (CODITECT-STANDARD-TRUST-AND-TRANSPARENCY.md)

Certainty scoring required for all findings
Source citations mandatory
Gaps and contradictions documented
Calibrated confidence reporting

Best Practices

DO:

Always provide certainty scores with findings
Distinguish between evidence types clearly
Document gaps and missing information
Use calibrated verbal uncertainty markers
Generate inference chains for speculative claims

DON'T:

Claim high certainty without evidence
Ignore contradictory sources
Skip uncertainty when confident
Use vague language like "probably" without score
Assume old sources are still valid

Templates

See templates/ directory for:

certainty_report_template.md - Standard certainty reporting format
inference_chain_template.md - Logical inference documentation
evidence_validation_template.md - Source validation checklist

Multi-Context Window Support

State Tracking for Long Workflows:

{
  "checkpoint_id": "ckpt_uq_20251219",
  "certainty_calculations": [
    {"finding": "F1", "score": 85.2, "status": "complete"},
    {"finding": "F2", "score": 72.1, "status": "complete"},
    {"finding": "F3", "score": 0, "status": "pending"}
  ],
  "evidence_validated": 15,
  "inference_chains_generated": 3,
  "token_usage": 8500
}

Recovery Commands:

cat .coditect/checkpoints/uq-latest.json | jq '.certainty_calculations'
cat uncertainty-progress.md | tail -30

Version: 1.0.0 Last Updated: 2025-12-19 Research Foundation:

Semantic Entropy: https://arxiv.org/abs/2302.09664
Self-Consistency: https://arxiv.org/abs/2203.11171
Uncertainty of Thoughts: https://arxiv.org/abs/2402.03271
VOCAL: OpenReview 2024
LLM-Rubric: ACL 2024

How to Use This Skill​

When to Use​

Core Uncertainty Methods​

1. Semantic Entropy​

2. Self-Consistency (CoT-SC)​

3. Verbal Uncertainty Detection (VOCAL)​

4. Composite Certainty Score​

5. Source Reliability Scoring​

Evidence Schema​

Logical Inference Template​

Integration Patterns​

With Uncertainty Orchestrator​

With MoE Analysis​

With MoE Judges​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​

Best Practices​

Templates​

Multi-Context Window Support​