ADR-004: Risk Scoring Model

Status: Proposed Date: 2026-02-18 Deciders: CODITECT Architecture Team Source Research: ClawGuard AI Agent Security Ecosystem Evaluation (2026-02-18)

Context

When the CODITECT agent security layer (ADR-001) evaluates a tool call input or output, it may match zero, one, or multiple security patterns (ADR-002). The security layer must produce a risk assessment that:

Determines the action to take (block, redact, confirm, warn, log) for the current evaluation.
Is auditable — security officers and compliance teams must be able to understand why a particular action was taken.
Supports policy thresholds — operators must be able to configure "block if risk exceeds X" without understanding internal implementation details.
Enables trending and analytics — session-level risk scores should be comparable across sessions, tenants, and time periods.

The three ClawGuard repositories each implement a different risk model:

maxxie114: Numeric Scoring (0-100)

ClawGuard (maxxie114) uses a numeric risk aggregation approach in sanitizer.py:

Each pattern match contributes a numeric weight to a running total.
The total is capped at 100.
The final score determines whether the content is clean, suspicious, or dangerous.

Strength: Enables threshold-based policy (block_if_score_above: 75). Supports partial matches — a message that matches 3 medium-severity patterns accumulates score without any single pattern being critical. Weakness: Score values are arbitrary. A score of 72 vs 78 is meaningful in implementation but not self-explanatory to a reviewer. Score-to-action mapping must be separately configured.

ClawGuardian (superglue-ai): Categorical Severity Enum

ClawGuardian (superglue-ai) uses a severity enumeration with directly mapped actions in patterns/*.ts:

// From ClawGuardian source (reconstructed from research context)
type Severity = "critical" | "high" | "medium" | "low";
type Action = "block" | "redact" | "confirm" | "warn" | "log";

// Severity → Action mapping
const DEFAULT_ACTIONS: Record<Severity, Action> = {
  critical: "block",
  high: "redact",
  medium: "confirm",
  low: "warn",
};

Strength: Immediately interpretable. "Critical: block" is self-explanatory in an audit log. No threshold configuration required for default behavior. Directly policy-mappable. Weakness: Binary per-pattern — either a pattern matches (at its severity) or it does not. Does not aggregate across multiple lower-severity matches. A tool call that matches 20 low-severity patterns is treated identically to one that matches 1 low-severity pattern.

JaydenBeard: Category-Grouped Arrays

clawguard (JaydenBeard) groups patterns into named severity arrays in lib/risk-analyzer.js:

const CRITICAL_SHELL_PATTERNS = [/* 11 patterns */];
const HIGH_RISK_SHELL_PATTERNS = [/* 30+ patterns */];
const MEDIUM_RISK_SHELL_PATTERNS = [/* 20+ patterns */];

The severity is implicit in the array name. Risk is determined by the highest-severity category that produces a match.

Strength: Simple pattern management. Easy to add patterns to the right severity tier. Weakness: No aggregation. No numeric score. No cross-category accumulation. Not analytically queryable.

None of the three models alone satisfies all CODITECT requirements. CODITECT needs both numeric thresholds (for policy configuration) and categorical severity (for auditability and policy mapping).

Decision

Use a hybrid risk scoring model that produces both a numeric score (0-100) and a categorical severity (critical/high/medium/low) for every security evaluation.

Model Definition

Severity Weights

Each pattern in the YAML pattern library (ADR-002) has an assigned severity. The following base weights apply:

Severity	Base Weight
`critical`	40
`high`	20
`medium`	8
`low`	2

Score Calculation

When a tool call input/output is evaluated:

Run all applicable patterns (filtered by applies_to lifecycle point).
For each matching pattern, add its base weight to a running total.
Apply a diminishing returns function to prevent trivial pattern spam from inflating scores: the nth match of the same severity contributes weight * (0.85 ^ (n-1)).
Cap the final score at 100.

# Pseudocode: score calculation
def calculate_score(matches: list[PatternMatch]) -> int:
    score = 0.0
    severity_counts: dict[str, int] = {}
    for match in sorted(matches, key=lambda m: SEVERITY_WEIGHTS[m.severity], reverse=True):
        n = severity_counts.get(match.severity, 0)
        weight = SEVERITY_WEIGHTS[match.severity] * (0.85 ** n)
        score += weight
        severity_counts[match.severity] = n + 1
    return min(100, int(round(score)))

Categorical Severity Derivation

The categorical severity of an evaluation is the highest severity of any single matching pattern. This is independent of the numeric score.

def derive_categorical_severity(matches: list[PatternMatch]) -> str | None:
    if not matches:
        return None
    priority = {"critical": 4, "high": 3, "medium": 2, "low": 1}
    return max(matches, key=lambda m: priority[m.severity]).severity

Action Determination

The action to take is determined by the categorical severity (not the numeric score), using the action map from the pattern definitions (ADR-002). If multiple patterns match at the same severity with different actions, the most restrictive action wins:

block > redact > confirm > warn > log

Exception: If the numeric score exceeds a tenant-configured score_override_threshold (default: 85), the action is escalated to block regardless of categorical severity. This handles the case where many medium-severity matches accumulate to a high aggregate risk that the categorical model would otherwise underweight.

Evaluation Result Schema

Every security evaluation produces the following result object:

@dataclass
class SecurityEvaluation:
    # Identity
    evaluation_id: str              # UUID
    tool_call_id: str               # References the tool invocation
    lifecycle_point: str            # pre-agent-start | pre-tool-call | post-tool-result
    timestamp: datetime

    # Matches
    matched_patterns: list[PatternMatch]    # All patterns that matched
    match_count: int

    # Hybrid risk model output
    numeric_score: int              # 0-100
    categorical_severity: str | None    # critical | high | medium | low | None
    action: str                     # block | redact | confirm | warn | log | allow

    # Audit fields
    scan_duration_ms: float
    pattern_library_version: str
    tenant_id: str

This schema is written to sessions.db for every evaluation — both positive detections and clean passes (score 0, no matches). Clean-pass logging enables retrospective analysis to confirm the security layer was active and functioning.

Consequences

Positive

Policy expressibility: The numeric score enables threshold-based policy (block_if_score_above: 85) for operators who need fine-grained control. The categorical severity enables simple policy (block all critical).
Auditability: Every evaluation in sessions.db contains both the numeric score and the categorical severity, plus the matched pattern IDs. Compliance reviewers can reconstruct exactly why an action was taken.
Analytics: Numeric scores can be trended over time, correlated with tenant activity, and used to identify emerging attack patterns (rising baseline scores across a tenant's sessions).
Aggregation without losing signal: The diminishing-returns formula prevents trivial multi-match inflation while still rewarding the information that "many medium patterns matched" contributes meaningful risk beyond "one medium pattern matched."
Clean-pass logging: Every tool call generates an evaluation record, not just ones with detections. This provides a complete audit trail and baseline for anomaly detection.

Negative

Increased session log volume: Logging every evaluation (including clean passes) adds write volume to sessions.db. Mitigation: clean-pass records can use a compact representation; match-positive records use the full schema. Sessions.db is explicitly classified as regenerable (ADR-118) so retention policies can be aggressive.
Weight calibration is opinionated: The base weights (40/20/8/2) and diminishing returns factor (0.85) are engineering judgments, not derived from empirical data. They will require tuning based on real-world detection feedback. Initial values are informed by maxxie114's scoring approach but not directly ported.
Score override threshold: The score_override_threshold adds a secondary decision path that can surprise operators ("why was this blocked? the severity was only medium"). Mitigation: when a score override triggers a block, the audit log explicitly states action_reason: score_override_threshold.

Neutral

The hybrid model is a superset of both maxxie114's numeric model and ClawGuardian's categorical model. Existing tooling that consumes only numeric scores or only categorical severities remains compatible with this model.

Alternatives Considered

Alternative A: Numeric Only (maxxie114 Model)

Approach: Produce a 0-100 score. Action is determined entirely by score thresholds.

Rejected because:

A single critical pattern match (e.g., ignore all instructions) might score 40, which maps to a warning under some threshold configurations but deserves an immediate block.
Score thresholds are tenant-configurable, meaning a misconfigured tenant could set a threshold that allows critical-severity matches to pass. Categorical severity provides a safety backstop.
Audit logs that say "blocked: score 72" are harder to defend to a compliance reviewer than "blocked: critical severity match on ignore_instructions pattern."

Alternative B: Categorical Only (ClawGuardian Model)

Approach: Each pattern has a severity and a directly mapped action. No numeric score. The highest-severity match determines the action.

Rejected because:

No way to express "many medium matches = elevated risk." A session that triggers 20 medium-severity patterns looks identical to one that triggers 1.
Operators cannot configure threshold-based policies (alert if cumulative risk is high today).
Session-level risk trending is impossible without a numeric dimension.

Alternative C: Weighted Category Score (JaydenBeard Extension)

Approach: Assign a fixed weight to each severity category. Score = (critical_count * 40) + (high_count * 20) + (medium_count * 8) + (low_count * 2), capped at 100.

Not selected as primary (this is a simplified version of the chosen model):

No diminishing returns — a single tool call with 50 low-severity matches would score 100, equivalent to a critical match. This produces misleading scores in high-noise environments.
The chosen hybrid model extends this approach with diminishing returns and explicit categorical severity tracking, strictly adding capability without removing JaydenBeard's core insight.

Score Reference Table

For operator guidance and documentation:

Score Range	Typical Pattern	Recommended Default Action
0	No matches	`allow`
1-15	1-7 low-severity matches	`log`
16-30	1-2 medium matches or many low	`warn`
31-55	Multiple medium or 1-2 high matches	`confirm` (or `warn`)
56-79	1-2 high + medium, or many high	`redact` (or `confirm`)
80-100	Any critical, or high accumulation	`block`

Note: The categorical severity action (from pattern definitions) takes precedence over score-based action, except when score_override_threshold is exceeded.

Implementation Notes

Scoring engine: scripts/core/security/risk_scorer.py
Schema: config/schemas/security-evaluation.yaml
Storage: sessions.db — table security_evaluations
Clean-pass record format: Compact row with match_count=0, numeric_score=0, action=allow; no matched_patterns JSON blob to reduce storage.
Calibration plan: After 30 days of production data (pilot tenants), analyze score distributions and false-positive rates. Adjust base weights if critical matches are not clearly separated from high matches in score space.
Score override threshold: Default 85, configurable per tenant in config/tenants/{tenant-id}/security.yaml.

References

maxxie114 scoring: sanitizer.py — calculate_risk_score() function
ClawGuardian severity model: patterns/*.ts — Severity type and DEFAULT_ACTIONS mapping
JaydenBeard category arrays: lib/risk-analyzer.js — CRITICAL_SHELL_PATTERNS, HIGH_RISK_SHELL_PATTERNS, etc.
Research context: analyze-new-artifacts/clawguard-ai-agent-security/artifacts/research-context.json
Related ADRs: ADR-001 (Security Layer Architecture), ADR-002 (Pattern Library Format), ADR-003 (Fail Behavior)

Context​

maxxie114: Numeric Scoring (0-100)​

ClawGuardian (superglue-ai): Categorical Severity Enum​

JaydenBeard: Category-Grouped Arrays​

Decision​

Model Definition​

Severity Weights​

Score Calculation​

Categorical Severity Derivation​

Action Determination​

Evaluation Result Schema​

Consequences​

Positive​

Negative​

Neutral​

Alternatives Considered​

Alternative A: Numeric Only (maxxie114 Model)​

Alternative B: Categorical Only (ClawGuardian Model)​

Alternative C: Weighted Category Score (JaydenBeard Extension)​

Score Reference Table​

Implementation Notes​

References​