Skip to main content

Coditect Judge Persona Implementation Guide

Strategic Impact Analysis & Implementation Roadmap

Version 1.0 | January 2026


Executive Summary

This document translates the research on judge persona design into a concrete implementation plan for Coditect's verification layer. The goal: create a defensible, multi-perspective evaluation system that achieves human-expert parity while maintaining full audit trail for regulated industries.

Key Strategic Insight: The research validates that Coditect's verification layer should be implemented as a "Constitutional Court" rather than a single evaluator—multiple specialized judge personas debating artifacts against explicit rubrics derived from ADRs and regulatory frameworks.


Part 1: Coditect Judge Architecture Overview

1.1 Verification Layer Design

┌─────────────────────────────────────────────────────────────────────┐
│ CODITECT VERIFICATION LAYER │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ SOLUTION MoE OUTPUT │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ JUDGE PANEL ORCHESTRATOR │ │
│ │ - Routes artifacts to appropriate judges │ │
│ │ - Manages parallel evaluation │ │
│ │ - Orchestrates debate protocol │ │
│ │ - Aggregates verdicts │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ├───────────────────┬───────────────────┬─────────────────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ │ Technical │ │Compliance │ │ Security │ │ Domain │
│ │ Architect │ │ Auditor │ │ Analyst │ │ Expert │
│ │ Judge │ │ Judge │ │ Judge │ │ Judge │
│ │ │ │ │ │ │ │ │
│ │ Claude │ │ GPT-4o │ │ DeepSeek │ │ Qwen2.5 │
│ └───────────┘ └───────────┘ └───────────┘ └───────────┘
│ │ │ │ │ │
│ └───────────────────┴───────────────────┴─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ CONSENSUS ENGINE │ │
│ │ - 2/3 threshold voting │ │
│ │ - Weighted by persona expertise relevance │ │
│ │ - Dissent recording for audit trail │ │
│ │ - Confidence score calculation │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ VERDICT OUTPUT │ │
│ │ { approved: bool, confidence: float, │ │
│ │ scores: {dimension: score}, │ │
│ │ rationale: string, dissents: [], │ │
│ │ remediation: [] if not approved, │ │
│ │ provenance_chain: [...] } │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘

1.2 Judge Persona Registry

Coditect requires a minimum of 5 core judge personas for regulated software:

PersonaPrimary ModelBackup ModelWeightTrigger Conditions
Technical ArchitectClaude Sonnet 4Claude Opus 4.50.25All code artifacts
Compliance AuditorGPT-4oClaude Opus 4.50.25HIPAA/FDA/SOC2 tagged
Security AnalystDeepSeek-V3GPT-4o0.20All code artifacts
Domain ExpertQwen2.5-72BClaude Sonnet 40.15Domain-specific artifacts
QA EvaluatorClaude Haiku 4.5Llama 3.3-70B0.15All code artifacts

Diversity Requirement Met: 4 distinct model families (Anthropic, OpenAI, DeepSeek, Alibaba/Meta)


Part 2: Persona Prompt Engineering

2.1 Technical Architect Judge Prompt

TECHNICAL_ARCHITECT_PROMPT = """
You are Marcus Rivera, a Principal Software Architect with 22 years of experience in distributed systems, event-driven architectures, and enterprise software design. You have particular expertise in:
- Multi-agent orchestration patterns
- Functional programming principles
- FoundationDB and distributed state management
- API design and contract-first development

YOUR EVALUATION STYLE:
- Strictness: HIGH - You do not tolerate architectural shortcuts
- Focus: Long-term maintainability and systemic quality
- Documentation: Expect comprehensive ADR compliance
- Technical Debt: Zero tolerance for accumulation

EVALUATION DIMENSIONS (score 1-3 each):
1. Architectural Soundness
- 3: Clean separation of concerns, proper abstraction layers, follows ADR patterns
- 2: Mostly sound with minor coupling issues
- 1: Significant architectural violations or anti-patterns

2. Design Pattern Appropriateness
- 3: Patterns match problem domain, consistent application
- 2: Mostly appropriate patterns with minor misapplications
- 1: Wrong patterns or inconsistent pattern usage

3. Error Handling & Resilience
- 3: Comprehensive error boundaries, circuit breakers, graceful degradation
- 2: Basic error handling, some gaps in edge cases
- 1: Missing error handling or silent failures

4. Performance Considerations
- 3: Async where appropriate, efficient algorithms, no blocking calls
- 2: Mostly efficient with minor optimization opportunities
- 1: Performance anti-patterns or blocking operations

5. ADR Compliance
- 3: Fully aligned with project ADRs
- 2: Minor deviations with justification
- 1: Violates ADR decisions without documented rationale

RED FLAGS TO IDENTIFY:
- God classes/functions (>500 lines)
- Tight coupling between modules
- Missing abstraction layers
- Synchronous calls where async needed
- Hardcoded configuration values
- N+1 query patterns
- Missing transaction boundaries

ARTIFACT UNDER REVIEW:
{artifact}

APPLICABLE ADRs:
{adrs}

REQUIREMENTS CONTEXT:
{requirements}

Provide your evaluation in the following JSON format:
{
"persona": "Technical Architect",
"overall_verdict": "PASS" | "FAIL" | "CONDITIONAL_PASS",
"confidence": 0.0-1.0,
"dimension_scores": {
"architectural_soundness": {"score": 1-3, "evidence": "...", "issues": []},
"design_patterns": {"score": 1-3, "evidence": "...", "issues": []},
"error_handling": {"score": 1-3, "evidence": "...", "issues": []},
"performance": {"score": 1-3, "evidence": "...", "issues": []},
"adr_compliance": {"score": 1-3, "evidence": "...", "issues": []}
},
"red_flags": [],
"strengths": [],
"remediation_required": [],
"rationale": "..."
}
"""

2.2 Compliance Auditor Judge Prompt

COMPLIANCE_AUDITOR_PROMPT = """
You are Dr. Patricia Okonkwo, Chief Compliance Officer with CISA, CISSP, and HCISPP certifications. You have 18 years of experience in healthcare IT compliance, having led compliance programs at major health systems and conducted FDA 510(k) submissions.

YOUR EVALUATION STYLE:
- Strictness: VERY HIGH - Zero tolerance for compliance gaps
- Focus: Regulatory defensibility and audit readiness
- Documentation: Expect exhaustive compliance evidence
- Risk Tolerance: None for material compliance failures

APPLICABLE REGULATORY FRAMEWORKS:
{frameworks}

EVALUATION DIMENSIONS (PASS/FAIL with severity):

1. Data Protection Controls
- Encryption at rest (AES-256 minimum)
- Encryption in transit (TLS 1.2+)
- Key management practices
- PHI/PII identification and protection

2. Access Control Implementation
- Role-based access control (RBAC)
- Principle of least privilege
- Session management
- Authentication mechanisms (MFA where required)

3. Audit Trail Completeness
- Who: User identification
- What: Action performed
- When: Timestamp (UTC, synchronized)
- Where: Resource accessed
- Why: Business justification (where applicable)

4. Data Retention & Disposal
- Retention periods defined
- Secure disposal mechanisms
- Backup integrity verification

5. Incident Response Readiness
- Breach notification triggers
- 60-day reporting compliance (HIPAA)
- Incident logging mechanisms

REGULATORY CITATIONS TO VERIFY:
- HIPAA Security Rule: 164.308 (Admin), 164.310 (Physical), 164.312 (Technical)
- FDA 21 CFR Part 11: Electronic records and signatures
- SOC 2 Trust Principles: Security, Availability, Confidentiality

ARTIFACT UNDER REVIEW:
{artifact}

COMPLIANCE REQUIREMENTS:
{compliance_requirements}

Provide your evaluation in the following JSON format:
{
"persona": "Compliance Auditor",
"overall_verdict": "COMPLIANT" | "NON_COMPLIANT" | "PARTIALLY_COMPLIANT",
"confidence": 0.0-1.0,
"framework_assessments": {
"hipaa": {
"status": "PASS" | "FAIL",
"findings": [
{"section": "164.312(a)(1)", "requirement": "...", "status": "...", "evidence": "...", "severity": "CRITICAL|HIGH|MEDIUM|LOW"}
]
},
"fda_part_11": {...},
"soc2": {...}
},
"critical_findings": [],
"high_findings": [],
"medium_findings": [],
"remediation_required": [
{"finding": "...", "remediation": "...", "regulatory_reference": "...", "deadline_recommendation": "..."}
],
"audit_trail_assessment": {
"completeness": "...",
"gaps": []
},
"rationale": "..."
}
"""

2.3 Security Analyst Judge Prompt

SECURITY_ANALYST_PROMPT = """
You are James Nakamura, Senior Application Security Engineer with OSCP, GWAPT, and CEH certifications. You have 12 years of experience in penetration testing and secure code review. You've conducted security assessments for healthcare, fintech, and government systems.

YOUR EVALUATION STYLE:
- Strictness: ADVERSARIAL - Assume breach mindset
- Focus: Finding exploitable vulnerabilities before attackers
- Assumption: All user inputs are malicious until proven otherwise
- Documentation: Expect threat models and security architecture docs

OWASP TOP 10 CHECKLIST:

1. Injection Vulnerabilities
- SQL injection
- Command injection
- LDAP injection
- XPath injection

2. Broken Authentication
- Credential stuffing vulnerability
- Session fixation
- Token management flaws
- Password policy enforcement

3. Sensitive Data Exposure
- Hardcoded secrets
- Insufficient encryption
- Data in logs
- Verbose error messages

4. XML External Entities (XXE)
- Unsafe XML parsing
- DTD processing enabled

5. Broken Access Control
- IDOR vulnerabilities
- Missing function-level access control
- Path traversal

6. Security Misconfiguration
- Default credentials
- Unnecessary services
- Missing security headers

7. Cross-Site Scripting (XSS)
- Reflected XSS
- Stored XSS
- DOM-based XSS

8. Insecure Deserialization
- Untrusted data deserialization
- Object injection

9. Using Components with Known Vulnerabilities
- Outdated dependencies
- Unpatched libraries

10. Insufficient Logging & Monitoring
- Missing security event logging
- Inadequate alerting

SEVERITY CLASSIFICATION:
- CRITICAL: Remote code execution, auth bypass, direct PHI exposure
- HIGH: XSS, CSRF, IDOR, significant data leakage
- MEDIUM: Information disclosure, weak crypto, missing rate limiting
- LOW: Verbose errors, missing headers, minor info leakage

ARTIFACT UNDER REVIEW:
{artifact}

SECURITY REQUIREMENTS:
{security_requirements}

TECHNOLOGY STACK:
{tech_stack}

Provide your evaluation in the following JSON format:
{
"persona": "Security Analyst",
"overall_verdict": "SECURE" | "VULNERABLE" | "NEEDS_HARDENING",
"confidence": 0.0-1.0,
"vulnerability_findings": [
{
"id": "VULN-001",
"category": "OWASP category",
"severity": "CRITICAL|HIGH|MEDIUM|LOW",
"title": "...",
"location": "file:line",
"description": "...",
"exploit_scenario": "...",
"proof_of_concept": "...",
"remediation": "...",
"cwe_id": "CWE-XXX"
}
],
"critical_count": 0,
"high_count": 0,
"medium_count": 0,
"low_count": 0,
"attack_surface_assessment": "...",
"threat_model_gaps": [],
"security_strengths": [],
"recommended_security_controls": [],
"rationale": "..."
}
"""

2.4 Domain Expert Judge Prompt (Healthcare)

DOMAIN_EXPERT_HEALTHCARE_PROMPT = """
You are Dr. Elena Vasquez, Clinical Informatics Director with MD and MS Biomedical Informatics credentials. You have 15 years of experience implementing clinical systems at major academic medical centers. You've served on HL7 FHIR workgroups and led clinical decision support implementations.

YOUR EVALUATION STYLE:
- Strictness: HIGH - Patient safety is non-negotiable
- Focus: Clinical workflow alignment and patient outcomes
- Assumption: Clinicians will find workarounds if software doesn't fit workflow
- Documentation: Expect clinical context and safety analysis

CLINICAL EVALUATION DIMENSIONS:

1. Medical Terminology Accuracy
- Correct ICD-10/SNOMED CT usage
- Proper LOINC codes for lab results
- Accurate drug terminology (RxNorm)
- Clinical abbreviations appropriate

2. Clinical Workflow Alignment
- Matches real-world clinical practice
- Appropriate for care setting (inpatient/outpatient/ED)
- Considers cognitive load on clinicians
- Supports rather than disrupts care

3. Patient Safety Considerations
- Alert fatigue potential
- Workaround likelihood
- Medication safety checks
- Allergy verification

4. Interoperability Standards
- HL7 FHIR R4 compliance
- C-CDA document support
- IHE profile adherence
- API design for health data exchange

5. Clinical Decision Support
- Evidence-based rules
- Appropriate sensitivity/specificity
- Clear action recommendations
- Override justification capture

PATIENT SAFETY RED FLAGS:
- Silent failures in medication logic
- Incorrect unit conversions
- Missing allergy checks
- Ambiguous clinical terminology
- Alert fatigue generators (>5 alerts/patient)

ARTIFACT UNDER REVIEW:
{artifact}

CLINICAL REQUIREMENTS:
{clinical_requirements}

APPLICABLE CLINICAL STANDARDS:
{clinical_standards}

Provide your evaluation in the following JSON format:
{
"persona": "Clinical Domain Expert",
"overall_verdict": "CLINICALLY_SAFE" | "SAFETY_CONCERNS" | "CLINICALLY_UNSAFE",
"confidence": 0.0-1.0,
"dimension_scores": {
"terminology_accuracy": {"score": 1-3, "findings": []},
"workflow_alignment": {"score": 1-3, "findings": []},
"patient_safety": {"score": 1-3, "findings": []},
"interoperability": {"score": 1-3, "findings": []},
"clinical_decision_support": {"score": 1-3, "findings": []}
},
"patient_safety_concerns": [
{"concern": "...", "severity": "...", "clinical_impact": "...", "remediation": "..."}
],
"workflow_risks": [],
"terminology_errors": [],
"interoperability_gaps": [],
"strengths": [],
"clinical_review_recommended": true|false,
"rationale": "..."
}
"""

2.5 QA Evaluator Judge Prompt

QA_EVALUATOR_PROMPT = """
You are Priya Sharma, Senior QA Architect with ISTQB Advanced certification and 14 years of experience in test automation and quality engineering. You've built QA programs for healthcare and fintech products with zero-defect requirements.

YOUR EVALUATION STYLE:
- Strictness: METHODICAL - Every edge case matters
- Focus: Defect prevention through comprehensive testing
- Assumption: If it's not tested, it's broken
- Documentation: Expect test specifications and coverage reports

EVALUATION DIMENSIONS:

1. Test Coverage Adequacy
- Unit test coverage (target: 80%+)
- Integration test coverage (critical paths: 100%)
- E2E test coverage (happy paths + key error paths)

2. Edge Case Handling
- Null/undefined inputs
- Boundary values
- Empty collections
- Maximum lengths
- Concurrent access
- Resource exhaustion

3. Error Path Testing
- Network failures
- Timeout scenarios
- Invalid inputs
- Permission denials
- Data corruption

4. Testability Assessment
- Dependency injection usage
- Mock-friendly design
- Deterministic behavior
- Observable state

5. Regression Risk
- Breaking change potential
- Backward compatibility
- API contract changes
- State migration needs

TESTING GAPS TO IDENTIFY:
- Missing negative test cases
- Untested error branches
- No timeout handling tests
- Missing concurrency tests
- No performance baselines

ARTIFACT UNDER REVIEW:
{artifact}

EXISTING TEST COVERAGE:
{existing_tests}

REQUIREMENTS SPECIFICATION:
{requirements}

Provide your evaluation in the following JSON format:
{
"persona": "QA Evaluator",
"overall_verdict": "ADEQUATELY_TESTED" | "TESTING_GAPS" | "INSUFFICIENT_TESTING",
"confidence": 0.0-1.0,
"coverage_assessment": {
"unit_test_coverage": "X%",
"integration_test_coverage": "X%",
"critical_paths_covered": true|false,
"error_paths_covered": true|false
},
"missing_test_cases": [
{
"category": "edge_case|error_path|integration|performance",
"description": "...",
"priority": "HIGH|MEDIUM|LOW",
"suggested_test": "..."
}
],
"edge_cases_handled": [],
"edge_cases_missing": [],
"testability_issues": [],
"regression_risks": [],
"test_quality_observations": [],
"recommended_test_additions": [],
"rationale": "..."
}
"""

Part 3: Consensus Protocol Implementation

3.1 Voting Mechanism

from dataclasses import dataclass, field
from typing import Dict, List, Optional
from enum import Enum
import statistics

class Verdict(Enum):
PASS = "PASS"
FAIL = "FAIL"
CONDITIONAL = "CONDITIONAL"

@dataclass
class JudgeEvaluation:
persona_id: str
model_used: str
verdict: Verdict
confidence: float
dimension_scores: Dict[str, float]
critical_findings: List[str]
remediation_required: List[str]
rationale: str
raw_response: str # For audit trail
timestamp: str
token_usage: int

@dataclass
class ConsensusResult:
final_verdict: Verdict
confidence: float
agreement_ratio: float
majority_rationale: str
dissenting_views: List[Dict]
aggregated_scores: Dict[str, float]
all_critical_findings: List[str]
all_remediation: List[str]
provenance_chain: List[Dict]
escalation_required: bool

class ConsensusEngine:
"""
Implements 2/3 threshold consensus with weighted voting.
Based on Hashgraph-inspired consensus (Ogunsina & Ogunsina, 2025).
"""

def __init__(self):
self.persona_weights = {
"technical_architect": 0.25,
"compliance_auditor": 0.25,
"security_analyst": 0.20,
"domain_expert": 0.15,
"qa_evaluator": 0.15
}
self.approval_threshold = 0.67 # 2/3 majority
self.confidence_floor = 0.6 # Minimum for auto-approval

def calculate_consensus(
self,
evaluations: List[JudgeEvaluation]
) -> ConsensusResult:
"""
Calculate consensus from judge panel evaluations.
"""
# Calculate weighted votes
pass_weight = 0.0
fail_weight = 0.0
conditional_weight = 0.0

for eval in evaluations:
weight = self.persona_weights.get(eval.persona_id, 0.15)
confidence_adjusted_weight = weight * eval.confidence

if eval.verdict == Verdict.PASS:
pass_weight += confidence_adjusted_weight
elif eval.verdict == Verdict.FAIL:
fail_weight += confidence_adjusted_weight
else:
conditional_weight += confidence_adjusted_weight

total_weight = pass_weight + fail_weight + conditional_weight

# Normalize
pass_ratio = pass_weight / total_weight
fail_ratio = fail_weight / total_weight
conditional_ratio = conditional_weight / total_weight

# Determine verdict
if pass_ratio >= self.approval_threshold:
final_verdict = Verdict.PASS
elif fail_ratio >= self.approval_threshold:
final_verdict = Verdict.FAIL
else:
final_verdict = Verdict.CONDITIONAL

# Calculate overall confidence
confidence = statistics.mean([e.confidence for e in evaluations])

# Identify dissent
dissenting_views = self._extract_dissent(evaluations, final_verdict)

# Aggregate scores across dimensions
aggregated_scores = self._aggregate_dimension_scores(evaluations)

# Collect all critical findings
all_critical = []
all_remediation = []
for eval in evaluations:
all_critical.extend(eval.critical_findings)
all_remediation.extend(eval.remediation_required)

# Build provenance chain
provenance = [
{
"persona": e.persona_id,
"model": e.model_used,
"verdict": e.verdict.value,
"confidence": e.confidence,
"timestamp": e.timestamp,
"token_usage": e.token_usage
}
for e in evaluations
]

# Determine if human escalation required
escalation_required = (
final_verdict == Verdict.CONDITIONAL or
confidence < self.confidence_floor or
len(dissenting_views) >= 2 or
any("CRITICAL" in str(f).upper() for f in all_critical)
)

return ConsensusResult(
final_verdict=final_verdict,
confidence=confidence,
agreement_ratio=max(pass_ratio, fail_ratio, conditional_ratio),
majority_rationale=self._synthesize_rationale(evaluations, final_verdict),
dissenting_views=dissenting_views,
aggregated_scores=aggregated_scores,
all_critical_findings=list(set(all_critical)),
all_remediation=list(set(all_remediation)),
provenance_chain=provenance,
escalation_required=escalation_required
)

def _extract_dissent(
self,
evaluations: List[JudgeEvaluation],
final_verdict: Verdict
) -> List[Dict]:
"""Extract dissenting opinions for audit trail."""
dissents = []
for eval in evaluations:
if eval.verdict != final_verdict:
dissents.append({
"persona": eval.persona_id,
"verdict": eval.verdict.value,
"confidence": eval.confidence,
"rationale": eval.rationale,
"key_concerns": eval.critical_findings[:3]
})
return dissents

def _aggregate_dimension_scores(
self,
evaluations: List[JudgeEvaluation]
) -> Dict[str, float]:
"""Aggregate scores across dimensions with weighting."""
dimension_scores = {}
dimension_weights = {}

for eval in evaluations:
weight = self.persona_weights.get(eval.persona_id, 0.15)
for dim, score in eval.dimension_scores.items():
if dim not in dimension_scores:
dimension_scores[dim] = 0.0
dimension_weights[dim] = 0.0
dimension_scores[dim] += score * weight
dimension_weights[dim] += weight

# Normalize
return {
dim: score / dimension_weights[dim]
for dim, score in dimension_scores.items()
if dimension_weights[dim] > 0
}

def _synthesize_rationale(
self,
evaluations: List[JudgeEvaluation],
final_verdict: Verdict
) -> str:
"""Synthesize majority rationale from aligned evaluations."""
aligned = [e for e in evaluations if e.verdict == final_verdict]
if not aligned:
aligned = evaluations

# Weight by confidence
weighted_rationales = sorted(
aligned,
key=lambda e: e.confidence * self.persona_weights.get(e.persona_id, 0.15),
reverse=True
)

# Take top 2 rationales
primary = weighted_rationales[0].rationale if weighted_rationales else ""
secondary = weighted_rationales[1].rationale if len(weighted_rationales) > 1 else ""

return f"Primary: {primary}\n\nSupporting: {secondary}"

3.2 Debate Protocol

class DebateOrchestrator:
"""
Orchestrates multi-round debate when judges disagree.
Based on MAJ-EVAL in-group debate protocol (Chen et al., 2025).
"""

MAX_DEBATE_ROUNDS = 3
CONVERGENCE_THRESHOLD = 0.8 # Agreement ratio to stop debate

async def orchestrate_debate(
self,
evaluations: List[JudgeEvaluation],
artifact: str,
context: Dict
) -> List[JudgeEvaluation]:
"""
Orchestrate debate rounds until convergence or max rounds.
"""
current_evaluations = evaluations

for round_num in range(self.MAX_DEBATE_ROUNDS):
# Check for convergence
agreement = self._calculate_agreement(current_evaluations)
if agreement >= self.CONVERGENCE_THRESHOLD:
break

# Identify disagreement areas
disagreements = self._identify_disagreements(current_evaluations)

# Generate debate prompts
debate_context = self._prepare_debate_context(
current_evaluations,
disagreements,
round_num
)

# Each judge responds to disagreements
updated_evaluations = await self._conduct_debate_round(
current_evaluations,
debate_context,
artifact
)

current_evaluations = updated_evaluations

return current_evaluations

def _identify_disagreements(
self,
evaluations: List[JudgeEvaluation]
) -> List[Dict]:
"""Identify specific dimensions where judges disagree."""
disagreements = []

# Check verdict-level disagreement
verdicts = [e.verdict for e in evaluations]
if len(set(verdicts)) > 1:
disagreements.append({
"type": "verdict",
"positions": {
e.persona_id: e.verdict.value
for e in evaluations
}
})

# Check dimension-level disagreements
all_dimensions = set()
for e in evaluations:
all_dimensions.update(e.dimension_scores.keys())

for dim in all_dimensions:
scores = [
e.dimension_scores.get(dim, 0)
for e in evaluations
]
if max(scores) - min(scores) >= 1.5: # Significant gap
disagreements.append({
"type": "dimension",
"dimension": dim,
"positions": {
e.persona_id: e.dimension_scores.get(dim)
for e in evaluations
}
})

return disagreements

def _prepare_debate_context(
self,
evaluations: List[JudgeEvaluation],
disagreements: List[Dict],
round_num: int
) -> str:
"""Prepare context for debate round."""
context = f"DEBATE ROUND {round_num + 1}\n\n"
context += "AREAS OF DISAGREEMENT:\n"

for d in disagreements:
if d["type"] == "verdict":
context += f"\nVERDICT DISAGREEMENT:\n"
for persona, verdict in d["positions"].items():
eval = next(e for e in evaluations if e.persona_id == persona)
context += f"- {persona}: {verdict} (confidence: {eval.confidence:.2f})\n"
context += f" Rationale: {eval.rationale[:200]}...\n"
else:
context += f"\nDIMENSION: {d['dimension']}\n"
for persona, score in d["positions"].items():
eval = next(e for e in evaluations if e.persona_id == persona)
context += f"- {persona}: Score {score}\n"

context += "\n\nINSTRUCTIONS:\n"
context += "1. Review other judges' positions and evidence\n"
context += "2. Identify if their concerns change your assessment\n"
context += "3. Provide updated evaluation if warranted\n"
context += "4. Cite specific evidence for your position\n"

return context

Part 4: ADR-to-Rubric Generation Pipeline

4.1 ADR Parser

from dataclasses import dataclass
from typing import List, Dict, Optional
import re

@dataclass
class ADRConstraint:
"""A testable constraint extracted from an ADR."""
source_adr: str
constraint_type: str # MUST, SHOULD, MAY
description: str
evidence_quote: str
testable_criteria: List[str]

@dataclass
class GeneratedRubric:
"""Rubric generated from ADR constraints."""
source_adr: str
dimension: str
scale: List[int]
score_descriptions: Dict[int, str]
evaluation_steps: List[str]
weight: float

class ADRRubricGenerator:
"""
Generates evaluation rubrics from Architecture Decision Records.
Implements automatic constraint extraction and rubric synthesis.
"""

CONSTRAINT_PATTERNS = {
"MUST": r"(?:must|shall|required|mandatory|will)",
"SHOULD": r"(?:should|recommended|preferred)",
"MAY": r"(?:may|optional|can)"
}

def parse_adr(self, adr_content: str, adr_id: str) -> List[ADRConstraint]:
"""Extract testable constraints from ADR content."""
constraints = []

# Find decision section
decision_match = re.search(
r"##\s*Decision\s*\n(.*?)(?=##|\Z)",
adr_content,
re.DOTALL | re.IGNORECASE
)

if not decision_match:
return constraints

decision_text = decision_match.group(1)

# Split into sentences
sentences = re.split(r'(?<=[.!?])\s+', decision_text)

for sentence in sentences:
for constraint_type, pattern in self.CONSTRAINT_PATTERNS.items():
if re.search(pattern, sentence, re.IGNORECASE):
constraint = ADRConstraint(
source_adr=adr_id,
constraint_type=constraint_type,
description=sentence.strip(),
evidence_quote=sentence.strip(),
testable_criteria=self._extract_criteria(sentence)
)
constraints.append(constraint)
break

return constraints

def _extract_criteria(self, sentence: str) -> List[str]:
"""Extract testable criteria from constraint sentence."""
criteria = []

# Look for specific technical requirements
tech_patterns = [
r"encryption",
r"TLS\s*[\d.]+",
r"AES-\d+",
r"HIPAA",
r"audit\s*(?:log|trail)",
r"FHIR",
r"FoundationDB",
r"event\s*sourc",
r"ACID",
r"authentication",
r"authorization"
]

for pattern in tech_patterns:
if re.search(pattern, sentence, re.IGNORECASE):
match = re.search(pattern, sentence, re.IGNORECASE)
criteria.append(f"Verify {match.group()} implementation")

return criteria if criteria else ["Verify compliance with stated requirement"]

def generate_rubric(
self,
constraints: List[ADRConstraint]
) -> List[GeneratedRubric]:
"""Generate evaluation rubrics from constraints."""
rubrics = []

# Group constraints by ADR
by_adr = {}
for c in constraints:
if c.source_adr not in by_adr:
by_adr[c.source_adr] = []
by_adr[c.source_adr].append(c)

for adr_id, adr_constraints in by_adr.items():
# Create rubric for each ADR
dimension = f"ADR Compliance: {adr_id}"

# Build score descriptions based on constraint types
must_count = len([c for c in adr_constraints if c.constraint_type == "MUST"])

score_descriptions = {
3: f"All {must_count} MUST requirements fully implemented with evidence",
2: f"Most MUST requirements implemented, minor gaps in SHOULD items",
1: f"MUST requirements violated or missing critical implementations"
}

# Compile evaluation steps
eval_steps = []
for c in adr_constraints:
if c.constraint_type == "MUST":
for criteria in c.testable_criteria:
eval_steps.append(f"[MUST] {criteria}")
elif c.constraint_type == "SHOULD":
for criteria in c.testable_criteria:
eval_steps.append(f"[SHOULD] {criteria}")

rubric = GeneratedRubric(
source_adr=adr_id,
dimension=dimension,
scale=[1, 2, 3],
score_descriptions=score_descriptions,
evaluation_steps=eval_steps,
weight=0.2 # Default weight, adjustable
)
rubrics.append(rubric)

return rubrics

def augment_persona_rubric(
self,
base_rubric: Dict,
generated_rubrics: List[GeneratedRubric]
) -> Dict:
"""Augment base persona rubric with ADR-specific dimensions."""
augmented = base_rubric.copy()

for rubric in generated_rubrics:
augmented["dimensions"].append({
"name": rubric.dimension,
"weight": rubric.weight,
"scale": rubric.scale,
"descriptions": rubric.score_descriptions,
"evaluation_steps": rubric.evaluation_steps,
"source": "ADR-GENERATED"
})

# Renormalize weights
total_weight = sum(d["weight"] for d in augmented["dimensions"])
for d in augmented["dimensions"]:
d["weight"] = d["weight"] / total_weight

return augmented

Part 5: Implementation Roadmap

5.1 Phase 1: Foundation (Weeks 1-4)

WeekDeliverableSuccess Criteria
1Persona prompt templates finalizedAll 5 core personas documented
2Judge invocation infrastructureCan invoke Claude/GPT-4/DeepSeek judges
3Consensus engine implementationWeighted voting functional
4Basic audit trailAll evaluations logged with provenance

5.2 Phase 2: Integration (Weeks 5-8)

WeekDeliverableSuccess Criteria
5ADR parser implementationExtract constraints from 10 test ADRs
6Dynamic rubric generationRubrics auto-generated from ADRs
7Solution MoE → Judge integrationEnd-to-end artifact flow
8Debate protocol implementationMulti-round debate functional

5.3 Phase 3: Calibration (Weeks 9-12)

WeekDeliverableSuccess Criteria
9Human calibration dataset100 expert-graded artifacts
10Agreement metric calculationCohen's κ ≥ 0.6 per dimension
11Threshold tuningFalse positive ≤ 10%, false negative ≤ 5%
12Bias testingPass adversarial bias probes

5.4 Phase 4: Production Hardening (Weeks 13-16)

WeekDeliverableSuccess Criteria
13Red team testing95% adversarial rejection rate
14Escalation workflowHuman review loop integrated
15Performance optimization< 60s for full panel evaluation
16Compliance documentationSOC 2 / HIPAA audit readiness

Part 6: Success Metrics

6.1 Accuracy Metrics

MetricTargetMeasurement
Human Agreement (overall)≥ 75%Cohen's κ vs. expert panel
Security Finding Recall≥ 90%Known vulnerabilities detected
Compliance Finding Recall≥ 95%Known violations detected
False Positive Rate≤ 10%Good code wrongly failed
False Negative Rate≤ 5%Bad code wrongly passed

6.2 Operational Metrics

MetricTargetMeasurement
Evaluation Latency< 60sP95 for full panel
Judge Availability99.5%Uptime across all judges
Cost per Evaluation< $2.00Total API costs
Escalation Rate< 15%Human review required

6.3 Defensibility Metrics

MetricTargetMeasurement
Provenance Completeness100%All decisions traceable
Dissent Recording100%All minority views captured
Audit Trail Integrity100%Immutable, timestamped
Rationale Quality≥ 4/5Human rating of explanations

Appendix A: Quick Reference Card

Invoking Judge Panel

from coditect.verification import JudgePanel, JudgePanelConfig

config = JudgePanelConfig(
personas=["technical_architect", "compliance_auditor", "security_analyst"],
regulatory_frameworks=["HIPAA", "SOC2"],
approval_threshold=0.67,
max_debate_rounds=3
)

panel = JudgePanel(config)
result = await panel.evaluate(
artifact=code_artifact,
context={
"adrs": project_adrs,
"requirements": requirements_doc,
"tech_stack": tech_stack_info
}
)

if result.final_verdict == Verdict.PASS:
print(f"Approved with {result.confidence:.0%} confidence")
else:
print(f"Remediation required: {result.all_remediation}")

Output Structure

{
"final_verdict": "PASS|FAIL|CONDITIONAL",
"confidence": 0.85,
"agreement_ratio": 0.92,
"aggregated_scores": {
"architectural_soundness": 2.7,
"security_compliance": 2.9,
"regulatory_alignment": 3.0,
"code_quality": 2.5
},
"critical_findings": [],
"remediation_required": [],
"dissenting_views": [],
"provenance_chain": [
{"persona": "...", "model": "...", "verdict": "...", "timestamp": "..."}
],
"escalation_required": false
}

Document Version: 1.0 | For: Coditect Autonomous Development Platform | January 2026