ADR-004: Risk Scoring Model
Status: Proposed Date: 2026-02-18 Deciders: CODITECT Architecture Team Source Research: ClawGuard AI Agent Security Ecosystem Evaluation (2026-02-18)
Context
When the CODITECT agent security layer (ADR-001) evaluates a tool call input or output, it may match zero, one, or multiple security patterns (ADR-002). The security layer must produce a risk assessment that:
- Determines the action to take (block, redact, confirm, warn, log) for the current evaluation.
- Is auditable — security officers and compliance teams must be able to understand why a particular action was taken.
- Supports policy thresholds — operators must be able to configure "block if risk exceeds X" without understanding internal implementation details.
- Enables trending and analytics — session-level risk scores should be comparable across sessions, tenants, and time periods.
The three ClawGuard repositories each implement a different risk model:
maxxie114: Numeric Scoring (0-100)
ClawGuard (maxxie114) uses a numeric risk aggregation approach in sanitizer.py:
- Each pattern match contributes a numeric weight to a running total.
- The total is capped at 100.
- The final score determines whether the content is
clean,suspicious, ordangerous.
Strength: Enables threshold-based policy (block_if_score_above: 75). Supports partial matches — a message that matches 3 medium-severity patterns accumulates score without any single pattern being critical.
Weakness: Score values are arbitrary. A score of 72 vs 78 is meaningful in implementation but not self-explanatory to a reviewer. Score-to-action mapping must be separately configured.
ClawGuardian (superglue-ai): Categorical Severity Enum
ClawGuardian (superglue-ai) uses a severity enumeration with directly mapped actions in patterns/*.ts:
// From ClawGuardian source (reconstructed from research context)
type Severity = "critical" | "high" | "medium" | "low";
type Action = "block" | "redact" | "confirm" | "warn" | "log";
// Severity → Action mapping
const DEFAULT_ACTIONS: Record<Severity, Action> = {
critical: "block",
high: "redact",
medium: "confirm",
low: "warn",
};
Strength: Immediately interpretable. "Critical: block" is self-explanatory in an audit log. No threshold configuration required for default behavior. Directly policy-mappable. Weakness: Binary per-pattern — either a pattern matches (at its severity) or it does not. Does not aggregate across multiple lower-severity matches. A tool call that matches 20 low-severity patterns is treated identically to one that matches 1 low-severity pattern.
JaydenBeard: Category-Grouped Arrays
clawguard (JaydenBeard) groups patterns into named severity arrays in lib/risk-analyzer.js:
const CRITICAL_SHELL_PATTERNS = [/* 11 patterns */];
const HIGH_RISK_SHELL_PATTERNS = [/* 30+ patterns */];
const MEDIUM_RISK_SHELL_PATTERNS = [/* 20+ patterns */];
The severity is implicit in the array name. Risk is determined by the highest-severity category that produces a match.
Strength: Simple pattern management. Easy to add patterns to the right severity tier. Weakness: No aggregation. No numeric score. No cross-category accumulation. Not analytically queryable.
None of the three models alone satisfies all CODITECT requirements. CODITECT needs both numeric thresholds (for policy configuration) and categorical severity (for auditability and policy mapping).
Decision
Use a hybrid risk scoring model that produces both a numeric score (0-100) and a categorical severity (critical/high/medium/low) for every security evaluation.
Model Definition
Severity Weights
Each pattern in the YAML pattern library (ADR-002) has an assigned severity. The following base weights apply:
| Severity | Base Weight |
|---|---|
critical | 40 |
high | 20 |
medium | 8 |
low | 2 |
Score Calculation
When a tool call input/output is evaluated:
- Run all applicable patterns (filtered by
applies_tolifecycle point). - For each matching pattern, add its base weight to a running total.
- Apply a diminishing returns function to prevent trivial pattern spam from inflating scores: the nth match of the same severity contributes
weight * (0.85 ^ (n-1)). - Cap the final score at 100.
# Pseudocode: score calculation
def calculate_score(matches: list[PatternMatch]) -> int:
score = 0.0
severity_counts: dict[str, int] = {}
for match in sorted(matches, key=lambda m: SEVERITY_WEIGHTS[m.severity], reverse=True):
n = severity_counts.get(match.severity, 0)
weight = SEVERITY_WEIGHTS[match.severity] * (0.85 ** n)
score += weight
severity_counts[match.severity] = n + 1
return min(100, int(round(score)))
Categorical Severity Derivation
The categorical severity of an evaluation is the highest severity of any single matching pattern. This is independent of the numeric score.
def derive_categorical_severity(matches: list[PatternMatch]) -> str | None:
if not matches:
return None
priority = {"critical": 4, "high": 3, "medium": 2, "low": 1}
return max(matches, key=lambda m: priority[m.severity]).severity
Action Determination
The action to take is determined by the categorical severity (not the numeric score), using the action map from the pattern definitions (ADR-002). If multiple patterns match at the same severity with different actions, the most restrictive action wins:
block > redact > confirm > warn > log
Exception: If the numeric score exceeds a tenant-configured score_override_threshold (default: 85), the action is escalated to block regardless of categorical severity. This handles the case where many medium-severity matches accumulate to a high aggregate risk that the categorical model would otherwise underweight.
Evaluation Result Schema
Every security evaluation produces the following result object:
@dataclass
class SecurityEvaluation:
# Identity
evaluation_id: str # UUID
tool_call_id: str # References the tool invocation
lifecycle_point: str # pre-agent-start | pre-tool-call | post-tool-result
timestamp: datetime
# Matches
matched_patterns: list[PatternMatch] # All patterns that matched
match_count: int
# Hybrid risk model output
numeric_score: int # 0-100
categorical_severity: str | None # critical | high | medium | low | None
action: str # block | redact | confirm | warn | log | allow
# Audit fields
scan_duration_ms: float
pattern_library_version: str
tenant_id: str
This schema is written to sessions.db for every evaluation — both positive detections and clean passes (score 0, no matches). Clean-pass logging enables retrospective analysis to confirm the security layer was active and functioning.
Consequences
Positive
- Policy expressibility: The numeric score enables threshold-based policy (
block_if_score_above: 85) for operators who need fine-grained control. The categorical severity enables simple policy (block all critical). - Auditability: Every evaluation in
sessions.dbcontains both the numeric score and the categorical severity, plus the matched pattern IDs. Compliance reviewers can reconstruct exactly why an action was taken. - Analytics: Numeric scores can be trended over time, correlated with tenant activity, and used to identify emerging attack patterns (rising baseline scores across a tenant's sessions).
- Aggregation without losing signal: The diminishing-returns formula prevents trivial multi-match inflation while still rewarding the information that "many medium patterns matched" contributes meaningful risk beyond "one medium pattern matched."
- Clean-pass logging: Every tool call generates an evaluation record, not just ones with detections. This provides a complete audit trail and baseline for anomaly detection.
Negative
- Increased session log volume: Logging every evaluation (including clean passes) adds write volume to
sessions.db. Mitigation: clean-pass records can use a compact representation; match-positive records use the full schema. Sessions.db is explicitly classified as regenerable (ADR-118) so retention policies can be aggressive. - Weight calibration is opinionated: The base weights (40/20/8/2) and diminishing returns factor (0.85) are engineering judgments, not derived from empirical data. They will require tuning based on real-world detection feedback. Initial values are informed by maxxie114's scoring approach but not directly ported.
- Score override threshold: The
score_override_thresholdadds a secondary decision path that can surprise operators ("why was this blocked? the severity was only medium"). Mitigation: when a score override triggers a block, the audit log explicitly statesaction_reason: score_override_threshold.
Neutral
- The hybrid model is a superset of both maxxie114's numeric model and ClawGuardian's categorical model. Existing tooling that consumes only numeric scores or only categorical severities remains compatible with this model.
Alternatives Considered
Alternative A: Numeric Only (maxxie114 Model)
Approach: Produce a 0-100 score. Action is determined entirely by score thresholds.
Rejected because:
- A single critical pattern match (e.g.,
ignore all instructions) might score 40, which maps to a warning under some threshold configurations but deserves an immediate block. - Score thresholds are tenant-configurable, meaning a misconfigured tenant could set a threshold that allows critical-severity matches to pass. Categorical severity provides a safety backstop.
- Audit logs that say "blocked: score 72" are harder to defend to a compliance reviewer than "blocked: critical severity match on
ignore_instructionspattern."
Alternative B: Categorical Only (ClawGuardian Model)
Approach: Each pattern has a severity and a directly mapped action. No numeric score. The highest-severity match determines the action.
Rejected because:
- No way to express "many medium matches = elevated risk." A session that triggers 20 medium-severity patterns looks identical to one that triggers 1.
- Operators cannot configure threshold-based policies (
alert if cumulative risk is high today). - Session-level risk trending is impossible without a numeric dimension.
Alternative C: Weighted Category Score (JaydenBeard Extension)
Approach: Assign a fixed weight to each severity category. Score = (critical_count * 40) + (high_count * 20) + (medium_count * 8) + (low_count * 2), capped at 100.
Not selected as primary (this is a simplified version of the chosen model):
- No diminishing returns — a single tool call with 50 low-severity matches would score 100, equivalent to a critical match. This produces misleading scores in high-noise environments.
- The chosen hybrid model extends this approach with diminishing returns and explicit categorical severity tracking, strictly adding capability without removing JaydenBeard's core insight.
Score Reference Table
For operator guidance and documentation:
| Score Range | Typical Pattern | Recommended Default Action |
|---|---|---|
| 0 | No matches | allow |
| 1-15 | 1-7 low-severity matches | log |
| 16-30 | 1-2 medium matches or many low | warn |
| 31-55 | Multiple medium or 1-2 high matches | confirm (or warn) |
| 56-79 | 1-2 high + medium, or many high | redact (or confirm) |
| 80-100 | Any critical, or high accumulation | block |
Note: The categorical severity action (from pattern definitions) takes precedence over score-based action, except when score_override_threshold is exceeded.
Implementation Notes
- Scoring engine:
scripts/core/security/risk_scorer.py - Schema:
config/schemas/security-evaluation.yaml - Storage:
sessions.db— tablesecurity_evaluations - Clean-pass record format: Compact row with
match_count=0,numeric_score=0,action=allow; nomatched_patternsJSON blob to reduce storage. - Calibration plan: After 30 days of production data (pilot tenants), analyze score distributions and false-positive rates. Adjust base weights if critical matches are not clearly separated from high matches in score space.
- Score override threshold: Default 85, configurable per tenant in
config/tenants/{tenant-id}/security.yaml.
References
- maxxie114 scoring:
sanitizer.py—calculate_risk_score()function - ClawGuardian severity model:
patterns/*.ts—Severitytype andDEFAULT_ACTIONSmapping - JaydenBeard category arrays:
lib/risk-analyzer.js—CRITICAL_SHELL_PATTERNS,HIGH_RISK_SHELL_PATTERNS, etc. - Research context:
analyze-new-artifacts/clawguard-ai-agent-security/artifacts/research-context.json - Related ADRs: ADR-001 (Security Layer Architecture), ADR-002 (Pattern Library Format), ADR-003 (Fail Behavior)