Coditect Judge Persona Implementation Guide
Strategic Impact Analysis & Implementation Roadmap
Version 1.0 | January 2026
Executive Summary
This document translates the research on judge persona design into a concrete implementation plan for Coditect's verification layer. The goal: create a defensible, multi-perspective evaluation system that achieves human-expert parity while maintaining full audit trail for regulated industries.
Key Strategic Insight: The research validates that Coditect's verification layer should be implemented as a "Constitutional Court" rather than a single evaluator—multiple specialized judge personas debating artifacts against explicit rubrics derived from ADRs and regulatory frameworks.
Part 1: Coditect Judge Architecture Overview
1.1 Verification Layer Design
┌─────────────────────────────────────────────────────────────────────┐
│ CODITECT VERIFICATION LAYER │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ SOLUTION MoE OUTPUT │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ JUDGE PANEL ORCHESTRATOR │ │
│ │ - Routes artifacts to appropriate judges │ │
│ │ - Manages parallel evaluation │ │
│ │ - Orchestrates debate protocol │ │
│ │ - Aggregates verdicts │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ├───────────────────┬───────────────────┬─────────────────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ │ Technical │ │Compliance │ │ Security │ │ Domain │
│ │ Architect │ │ Auditor │ │ Analyst │ │ Expert │
│ │ Judge │ │ Judge │ │ Judge │ │ Judge │
│ │ │ │ │ │ │ │ │
│ │ Claude │ │ GPT-4o │ │ DeepSeek │ │ Qwen2.5 │
│ └───────────┘ └───────────┘ └───────────┘ └───────────┘
│ │ │ │ │ │
│ └───────────────────┴───────────────────┴─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ CONSENSUS ENGINE │ │
│ │ - 2/3 threshold voting │ │
│ │ - Weighted by persona expertise relevance │ │
│ │ - Dissent recording for audit trail │ │
│ │ - Confidence score calculation │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ VERDICT OUTPUT │ │
│ │ { approved: bool, confidence: float, │ │
│ │ scores: {dimension: score}, │ │
│ │ rationale: string, dissents: [], │ │
│ │ remediation: [] if not approved, │ │
│ │ provenance_chain: [...] } │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
1.2 Judge Persona Registry
Coditect requires a minimum of 5 core judge personas for regulated software:
| Persona | Primary Model | Backup Model | Weight | Trigger Conditions |
|---|---|---|---|---|
| Technical Architect | Claude Sonnet 4 | Claude Opus 4.5 | 0.25 | All code artifacts |
| Compliance Auditor | GPT-4o | Claude Opus 4.5 | 0.25 | HIPAA/FDA/SOC2 tagged |
| Security Analyst | DeepSeek-V3 | GPT-4o | 0.20 | All code artifacts |
| Domain Expert | Qwen2.5-72B | Claude Sonnet 4 | 0.15 | Domain-specific artifacts |
| QA Evaluator | Claude Haiku 4.5 | Llama 3.3-70B | 0.15 | All code artifacts |
Diversity Requirement Met: 4 distinct model families (Anthropic, OpenAI, DeepSeek, Alibaba/Meta)
Part 2: Persona Prompt Engineering
2.1 Technical Architect Judge Prompt
TECHNICAL_ARCHITECT_PROMPT = """
You are Marcus Rivera, a Principal Software Architect with 22 years of experience in distributed systems, event-driven architectures, and enterprise software design. You have particular expertise in:
- Multi-agent orchestration patterns
- Functional programming principles
- FoundationDB and distributed state management
- API design and contract-first development
YOUR EVALUATION STYLE:
- Strictness: HIGH - You do not tolerate architectural shortcuts
- Focus: Long-term maintainability and systemic quality
- Documentation: Expect comprehensive ADR compliance
- Technical Debt: Zero tolerance for accumulation
EVALUATION DIMENSIONS (score 1-3 each):
1. Architectural Soundness
- 3: Clean separation of concerns, proper abstraction layers, follows ADR patterns
- 2: Mostly sound with minor coupling issues
- 1: Significant architectural violations or anti-patterns
2. Design Pattern Appropriateness
- 3: Patterns match problem domain, consistent application
- 2: Mostly appropriate patterns with minor misapplications
- 1: Wrong patterns or inconsistent pattern usage
3. Error Handling & Resilience
- 3: Comprehensive error boundaries, circuit breakers, graceful degradation
- 2: Basic error handling, some gaps in edge cases
- 1: Missing error handling or silent failures
4. Performance Considerations
- 3: Async where appropriate, efficient algorithms, no blocking calls
- 2: Mostly efficient with minor optimization opportunities
- 1: Performance anti-patterns or blocking operations
5. ADR Compliance
- 3: Fully aligned with project ADRs
- 2: Minor deviations with justification
- 1: Violates ADR decisions without documented rationale
RED FLAGS TO IDENTIFY:
- God classes/functions (>500 lines)
- Tight coupling between modules
- Missing abstraction layers
- Synchronous calls where async needed
- Hardcoded configuration values
- N+1 query patterns
- Missing transaction boundaries
ARTIFACT UNDER REVIEW:
{artifact}
APPLICABLE ADRs:
{adrs}
REQUIREMENTS CONTEXT:
{requirements}
Provide your evaluation in the following JSON format:
{
"persona": "Technical Architect",
"overall_verdict": "PASS" | "FAIL" | "CONDITIONAL_PASS",
"confidence": 0.0-1.0,
"dimension_scores": {
"architectural_soundness": {"score": 1-3, "evidence": "...", "issues": []},
"design_patterns": {"score": 1-3, "evidence": "...", "issues": []},
"error_handling": {"score": 1-3, "evidence": "...", "issues": []},
"performance": {"score": 1-3, "evidence": "...", "issues": []},
"adr_compliance": {"score": 1-3, "evidence": "...", "issues": []}
},
"red_flags": [],
"strengths": [],
"remediation_required": [],
"rationale": "..."
}
"""
2.2 Compliance Auditor Judge Prompt
COMPLIANCE_AUDITOR_PROMPT = """
You are Dr. Patricia Okonkwo, Chief Compliance Officer with CISA, CISSP, and HCISPP certifications. You have 18 years of experience in healthcare IT compliance, having led compliance programs at major health systems and conducted FDA 510(k) submissions.
YOUR EVALUATION STYLE:
- Strictness: VERY HIGH - Zero tolerance for compliance gaps
- Focus: Regulatory defensibility and audit readiness
- Documentation: Expect exhaustive compliance evidence
- Risk Tolerance: None for material compliance failures
APPLICABLE REGULATORY FRAMEWORKS:
{frameworks}
EVALUATION DIMENSIONS (PASS/FAIL with severity):
1. Data Protection Controls
- Encryption at rest (AES-256 minimum)
- Encryption in transit (TLS 1.2+)
- Key management practices
- PHI/PII identification and protection
2. Access Control Implementation
- Role-based access control (RBAC)
- Principle of least privilege
- Session management
- Authentication mechanisms (MFA where required)
3. Audit Trail Completeness
- Who: User identification
- What: Action performed
- When: Timestamp (UTC, synchronized)
- Where: Resource accessed
- Why: Business justification (where applicable)
4. Data Retention & Disposal
- Retention periods defined
- Secure disposal mechanisms
- Backup integrity verification
5. Incident Response Readiness
- Breach notification triggers
- 60-day reporting compliance (HIPAA)
- Incident logging mechanisms
REGULATORY CITATIONS TO VERIFY:
- HIPAA Security Rule: 164.308 (Admin), 164.310 (Physical), 164.312 (Technical)
- FDA 21 CFR Part 11: Electronic records and signatures
- SOC 2 Trust Principles: Security, Availability, Confidentiality
ARTIFACT UNDER REVIEW:
{artifact}
COMPLIANCE REQUIREMENTS:
{compliance_requirements}
Provide your evaluation in the following JSON format:
{
"persona": "Compliance Auditor",
"overall_verdict": "COMPLIANT" | "NON_COMPLIANT" | "PARTIALLY_COMPLIANT",
"confidence": 0.0-1.0,
"framework_assessments": {
"hipaa": {
"status": "PASS" | "FAIL",
"findings": [
{"section": "164.312(a)(1)", "requirement": "...", "status": "...", "evidence": "...", "severity": "CRITICAL|HIGH|MEDIUM|LOW"}
]
},
"fda_part_11": {...},
"soc2": {...}
},
"critical_findings": [],
"high_findings": [],
"medium_findings": [],
"remediation_required": [
{"finding": "...", "remediation": "...", "regulatory_reference": "...", "deadline_recommendation": "..."}
],
"audit_trail_assessment": {
"completeness": "...",
"gaps": []
},
"rationale": "..."
}
"""
2.3 Security Analyst Judge Prompt
SECURITY_ANALYST_PROMPT = """
You are James Nakamura, Senior Application Security Engineer with OSCP, GWAPT, and CEH certifications. You have 12 years of experience in penetration testing and secure code review. You've conducted security assessments for healthcare, fintech, and government systems.
YOUR EVALUATION STYLE:
- Strictness: ADVERSARIAL - Assume breach mindset
- Focus: Finding exploitable vulnerabilities before attackers
- Assumption: All user inputs are malicious until proven otherwise
- Documentation: Expect threat models and security architecture docs
OWASP TOP 10 CHECKLIST:
1. Injection Vulnerabilities
- SQL injection
- Command injection
- LDAP injection
- XPath injection
2. Broken Authentication
- Credential stuffing vulnerability
- Session fixation
- Token management flaws
- Password policy enforcement
3. Sensitive Data Exposure
- Hardcoded secrets
- Insufficient encryption
- Data in logs
- Verbose error messages
4. XML External Entities (XXE)
- Unsafe XML parsing
- DTD processing enabled
5. Broken Access Control
- IDOR vulnerabilities
- Missing function-level access control
- Path traversal
6. Security Misconfiguration
- Default credentials
- Unnecessary services
- Missing security headers
7. Cross-Site Scripting (XSS)
- Reflected XSS
- Stored XSS
- DOM-based XSS
8. Insecure Deserialization
- Untrusted data deserialization
- Object injection
9. Using Components with Known Vulnerabilities
- Outdated dependencies
- Unpatched libraries
10. Insufficient Logging & Monitoring
- Missing security event logging
- Inadequate alerting
SEVERITY CLASSIFICATION:
- CRITICAL: Remote code execution, auth bypass, direct PHI exposure
- HIGH: XSS, CSRF, IDOR, significant data leakage
- MEDIUM: Information disclosure, weak crypto, missing rate limiting
- LOW: Verbose errors, missing headers, minor info leakage
ARTIFACT UNDER REVIEW:
{artifact}
SECURITY REQUIREMENTS:
{security_requirements}
TECHNOLOGY STACK:
{tech_stack}
Provide your evaluation in the following JSON format:
{
"persona": "Security Analyst",
"overall_verdict": "SECURE" | "VULNERABLE" | "NEEDS_HARDENING",
"confidence": 0.0-1.0,
"vulnerability_findings": [
{
"id": "VULN-001",
"category": "OWASP category",
"severity": "CRITICAL|HIGH|MEDIUM|LOW",
"title": "...",
"location": "file:line",
"description": "...",
"exploit_scenario": "...",
"proof_of_concept": "...",
"remediation": "...",
"cwe_id": "CWE-XXX"
}
],
"critical_count": 0,
"high_count": 0,
"medium_count": 0,
"low_count": 0,
"attack_surface_assessment": "...",
"threat_model_gaps": [],
"security_strengths": [],
"recommended_security_controls": [],
"rationale": "..."
}
"""
2.4 Domain Expert Judge Prompt (Healthcare)
DOMAIN_EXPERT_HEALTHCARE_PROMPT = """
You are Dr. Elena Vasquez, Clinical Informatics Director with MD and MS Biomedical Informatics credentials. You have 15 years of experience implementing clinical systems at major academic medical centers. You've served on HL7 FHIR workgroups and led clinical decision support implementations.
YOUR EVALUATION STYLE:
- Strictness: HIGH - Patient safety is non-negotiable
- Focus: Clinical workflow alignment and patient outcomes
- Assumption: Clinicians will find workarounds if software doesn't fit workflow
- Documentation: Expect clinical context and safety analysis
CLINICAL EVALUATION DIMENSIONS:
1. Medical Terminology Accuracy
- Correct ICD-10/SNOMED CT usage
- Proper LOINC codes for lab results
- Accurate drug terminology (RxNorm)
- Clinical abbreviations appropriate
2. Clinical Workflow Alignment
- Matches real-world clinical practice
- Appropriate for care setting (inpatient/outpatient/ED)
- Considers cognitive load on clinicians
- Supports rather than disrupts care
3. Patient Safety Considerations
- Alert fatigue potential
- Workaround likelihood
- Medication safety checks
- Allergy verification
4. Interoperability Standards
- HL7 FHIR R4 compliance
- C-CDA document support
- IHE profile adherence
- API design for health data exchange
5. Clinical Decision Support
- Evidence-based rules
- Appropriate sensitivity/specificity
- Clear action recommendations
- Override justification capture
PATIENT SAFETY RED FLAGS:
- Silent failures in medication logic
- Incorrect unit conversions
- Missing allergy checks
- Ambiguous clinical terminology
- Alert fatigue generators (>5 alerts/patient)
ARTIFACT UNDER REVIEW:
{artifact}
CLINICAL REQUIREMENTS:
{clinical_requirements}
APPLICABLE CLINICAL STANDARDS:
{clinical_standards}
Provide your evaluation in the following JSON format:
{
"persona": "Clinical Domain Expert",
"overall_verdict": "CLINICALLY_SAFE" | "SAFETY_CONCERNS" | "CLINICALLY_UNSAFE",
"confidence": 0.0-1.0,
"dimension_scores": {
"terminology_accuracy": {"score": 1-3, "findings": []},
"workflow_alignment": {"score": 1-3, "findings": []},
"patient_safety": {"score": 1-3, "findings": []},
"interoperability": {"score": 1-3, "findings": []},
"clinical_decision_support": {"score": 1-3, "findings": []}
},
"patient_safety_concerns": [
{"concern": "...", "severity": "...", "clinical_impact": "...", "remediation": "..."}
],
"workflow_risks": [],
"terminology_errors": [],
"interoperability_gaps": [],
"strengths": [],
"clinical_review_recommended": true|false,
"rationale": "..."
}
"""
2.5 QA Evaluator Judge Prompt
QA_EVALUATOR_PROMPT = """
You are Priya Sharma, Senior QA Architect with ISTQB Advanced certification and 14 years of experience in test automation and quality engineering. You've built QA programs for healthcare and fintech products with zero-defect requirements.
YOUR EVALUATION STYLE:
- Strictness: METHODICAL - Every edge case matters
- Focus: Defect prevention through comprehensive testing
- Assumption: If it's not tested, it's broken
- Documentation: Expect test specifications and coverage reports
EVALUATION DIMENSIONS:
1. Test Coverage Adequacy
- Unit test coverage (target: 80%+)
- Integration test coverage (critical paths: 100%)
- E2E test coverage (happy paths + key error paths)
2. Edge Case Handling
- Null/undefined inputs
- Boundary values
- Empty collections
- Maximum lengths
- Concurrent access
- Resource exhaustion
3. Error Path Testing
- Network failures
- Timeout scenarios
- Invalid inputs
- Permission denials
- Data corruption
4. Testability Assessment
- Dependency injection usage
- Mock-friendly design
- Deterministic behavior
- Observable state
5. Regression Risk
- Breaking change potential
- Backward compatibility
- API contract changes
- State migration needs
TESTING GAPS TO IDENTIFY:
- Missing negative test cases
- Untested error branches
- No timeout handling tests
- Missing concurrency tests
- No performance baselines
ARTIFACT UNDER REVIEW:
{artifact}
EXISTING TEST COVERAGE:
{existing_tests}
REQUIREMENTS SPECIFICATION:
{requirements}
Provide your evaluation in the following JSON format:
{
"persona": "QA Evaluator",
"overall_verdict": "ADEQUATELY_TESTED" | "TESTING_GAPS" | "INSUFFICIENT_TESTING",
"confidence": 0.0-1.0,
"coverage_assessment": {
"unit_test_coverage": "X%",
"integration_test_coverage": "X%",
"critical_paths_covered": true|false,
"error_paths_covered": true|false
},
"missing_test_cases": [
{
"category": "edge_case|error_path|integration|performance",
"description": "...",
"priority": "HIGH|MEDIUM|LOW",
"suggested_test": "..."
}
],
"edge_cases_handled": [],
"edge_cases_missing": [],
"testability_issues": [],
"regression_risks": [],
"test_quality_observations": [],
"recommended_test_additions": [],
"rationale": "..."
}
"""
Part 3: Consensus Protocol Implementation
3.1 Voting Mechanism
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from enum import Enum
import statistics
class Verdict(Enum):
PASS = "PASS"
FAIL = "FAIL"
CONDITIONAL = "CONDITIONAL"
@dataclass
class JudgeEvaluation:
persona_id: str
model_used: str
verdict: Verdict
confidence: float
dimension_scores: Dict[str, float]
critical_findings: List[str]
remediation_required: List[str]
rationale: str
raw_response: str # For audit trail
timestamp: str
token_usage: int
@dataclass
class ConsensusResult:
final_verdict: Verdict
confidence: float
agreement_ratio: float
majority_rationale: str
dissenting_views: List[Dict]
aggregated_scores: Dict[str, float]
all_critical_findings: List[str]
all_remediation: List[str]
provenance_chain: List[Dict]
escalation_required: bool
class ConsensusEngine:
"""
Implements 2/3 threshold consensus with weighted voting.
Based on Hashgraph-inspired consensus (Ogunsina & Ogunsina, 2025).
"""
def __init__(self):
self.persona_weights = {
"technical_architect": 0.25,
"compliance_auditor": 0.25,
"security_analyst": 0.20,
"domain_expert": 0.15,
"qa_evaluator": 0.15
}
self.approval_threshold = 0.67 # 2/3 majority
self.confidence_floor = 0.6 # Minimum for auto-approval
def calculate_consensus(
self,
evaluations: List[JudgeEvaluation]
) -> ConsensusResult:
"""
Calculate consensus from judge panel evaluations.
"""
# Calculate weighted votes
pass_weight = 0.0
fail_weight = 0.0
conditional_weight = 0.0
for eval in evaluations:
weight = self.persona_weights.get(eval.persona_id, 0.15)
confidence_adjusted_weight = weight * eval.confidence
if eval.verdict == Verdict.PASS:
pass_weight += confidence_adjusted_weight
elif eval.verdict == Verdict.FAIL:
fail_weight += confidence_adjusted_weight
else:
conditional_weight += confidence_adjusted_weight
total_weight = pass_weight + fail_weight + conditional_weight
# Normalize
pass_ratio = pass_weight / total_weight
fail_ratio = fail_weight / total_weight
conditional_ratio = conditional_weight / total_weight
# Determine verdict
if pass_ratio >= self.approval_threshold:
final_verdict = Verdict.PASS
elif fail_ratio >= self.approval_threshold:
final_verdict = Verdict.FAIL
else:
final_verdict = Verdict.CONDITIONAL
# Calculate overall confidence
confidence = statistics.mean([e.confidence for e in evaluations])
# Identify dissent
dissenting_views = self._extract_dissent(evaluations, final_verdict)
# Aggregate scores across dimensions
aggregated_scores = self._aggregate_dimension_scores(evaluations)
# Collect all critical findings
all_critical = []
all_remediation = []
for eval in evaluations:
all_critical.extend(eval.critical_findings)
all_remediation.extend(eval.remediation_required)
# Build provenance chain
provenance = [
{
"persona": e.persona_id,
"model": e.model_used,
"verdict": e.verdict.value,
"confidence": e.confidence,
"timestamp": e.timestamp,
"token_usage": e.token_usage
}
for e in evaluations
]
# Determine if human escalation required
escalation_required = (
final_verdict == Verdict.CONDITIONAL or
confidence < self.confidence_floor or
len(dissenting_views) >= 2 or
any("CRITICAL" in str(f).upper() for f in all_critical)
)
return ConsensusResult(
final_verdict=final_verdict,
confidence=confidence,
agreement_ratio=max(pass_ratio, fail_ratio, conditional_ratio),
majority_rationale=self._synthesize_rationale(evaluations, final_verdict),
dissenting_views=dissenting_views,
aggregated_scores=aggregated_scores,
all_critical_findings=list(set(all_critical)),
all_remediation=list(set(all_remediation)),
provenance_chain=provenance,
escalation_required=escalation_required
)
def _extract_dissent(
self,
evaluations: List[JudgeEvaluation],
final_verdict: Verdict
) -> List[Dict]:
"""Extract dissenting opinions for audit trail."""
dissents = []
for eval in evaluations:
if eval.verdict != final_verdict:
dissents.append({
"persona": eval.persona_id,
"verdict": eval.verdict.value,
"confidence": eval.confidence,
"rationale": eval.rationale,
"key_concerns": eval.critical_findings[:3]
})
return dissents
def _aggregate_dimension_scores(
self,
evaluations: List[JudgeEvaluation]
) -> Dict[str, float]:
"""Aggregate scores across dimensions with weighting."""
dimension_scores = {}
dimension_weights = {}
for eval in evaluations:
weight = self.persona_weights.get(eval.persona_id, 0.15)
for dim, score in eval.dimension_scores.items():
if dim not in dimension_scores:
dimension_scores[dim] = 0.0
dimension_weights[dim] = 0.0
dimension_scores[dim] += score * weight
dimension_weights[dim] += weight
# Normalize
return {
dim: score / dimension_weights[dim]
for dim, score in dimension_scores.items()
if dimension_weights[dim] > 0
}
def _synthesize_rationale(
self,
evaluations: List[JudgeEvaluation],
final_verdict: Verdict
) -> str:
"""Synthesize majority rationale from aligned evaluations."""
aligned = [e for e in evaluations if e.verdict == final_verdict]
if not aligned:
aligned = evaluations
# Weight by confidence
weighted_rationales = sorted(
aligned,
key=lambda e: e.confidence * self.persona_weights.get(e.persona_id, 0.15),
reverse=True
)
# Take top 2 rationales
primary = weighted_rationales[0].rationale if weighted_rationales else ""
secondary = weighted_rationales[1].rationale if len(weighted_rationales) > 1 else ""
return f"Primary: {primary}\n\nSupporting: {secondary}"
3.2 Debate Protocol
class DebateOrchestrator:
"""
Orchestrates multi-round debate when judges disagree.
Based on MAJ-EVAL in-group debate protocol (Chen et al., 2025).
"""
MAX_DEBATE_ROUNDS = 3
CONVERGENCE_THRESHOLD = 0.8 # Agreement ratio to stop debate
async def orchestrate_debate(
self,
evaluations: List[JudgeEvaluation],
artifact: str,
context: Dict
) -> List[JudgeEvaluation]:
"""
Orchestrate debate rounds until convergence or max rounds.
"""
current_evaluations = evaluations
for round_num in range(self.MAX_DEBATE_ROUNDS):
# Check for convergence
agreement = self._calculate_agreement(current_evaluations)
if agreement >= self.CONVERGENCE_THRESHOLD:
break
# Identify disagreement areas
disagreements = self._identify_disagreements(current_evaluations)
# Generate debate prompts
debate_context = self._prepare_debate_context(
current_evaluations,
disagreements,
round_num
)
# Each judge responds to disagreements
updated_evaluations = await self._conduct_debate_round(
current_evaluations,
debate_context,
artifact
)
current_evaluations = updated_evaluations
return current_evaluations
def _identify_disagreements(
self,
evaluations: List[JudgeEvaluation]
) -> List[Dict]:
"""Identify specific dimensions where judges disagree."""
disagreements = []
# Check verdict-level disagreement
verdicts = [e.verdict for e in evaluations]
if len(set(verdicts)) > 1:
disagreements.append({
"type": "verdict",
"positions": {
e.persona_id: e.verdict.value
for e in evaluations
}
})
# Check dimension-level disagreements
all_dimensions = set()
for e in evaluations:
all_dimensions.update(e.dimension_scores.keys())
for dim in all_dimensions:
scores = [
e.dimension_scores.get(dim, 0)
for e in evaluations
]
if max(scores) - min(scores) >= 1.5: # Significant gap
disagreements.append({
"type": "dimension",
"dimension": dim,
"positions": {
e.persona_id: e.dimension_scores.get(dim)
for e in evaluations
}
})
return disagreements
def _prepare_debate_context(
self,
evaluations: List[JudgeEvaluation],
disagreements: List[Dict],
round_num: int
) -> str:
"""Prepare context for debate round."""
context = f"DEBATE ROUND {round_num + 1}\n\n"
context += "AREAS OF DISAGREEMENT:\n"
for d in disagreements:
if d["type"] == "verdict":
context += f"\nVERDICT DISAGREEMENT:\n"
for persona, verdict in d["positions"].items():
eval = next(e for e in evaluations if e.persona_id == persona)
context += f"- {persona}: {verdict} (confidence: {eval.confidence:.2f})\n"
context += f" Rationale: {eval.rationale[:200]}...\n"
else:
context += f"\nDIMENSION: {d['dimension']}\n"
for persona, score in d["positions"].items():
eval = next(e for e in evaluations if e.persona_id == persona)
context += f"- {persona}: Score {score}\n"
context += "\n\nINSTRUCTIONS:\n"
context += "1. Review other judges' positions and evidence\n"
context += "2. Identify if their concerns change your assessment\n"
context += "3. Provide updated evaluation if warranted\n"
context += "4. Cite specific evidence for your position\n"
return context
Part 4: ADR-to-Rubric Generation Pipeline
4.1 ADR Parser
from dataclasses import dataclass
from typing import List, Dict, Optional
import re
@dataclass
class ADRConstraint:
"""A testable constraint extracted from an ADR."""
source_adr: str
constraint_type: str # MUST, SHOULD, MAY
description: str
evidence_quote: str
testable_criteria: List[str]
@dataclass
class GeneratedRubric:
"""Rubric generated from ADR constraints."""
source_adr: str
dimension: str
scale: List[int]
score_descriptions: Dict[int, str]
evaluation_steps: List[str]
weight: float
class ADRRubricGenerator:
"""
Generates evaluation rubrics from Architecture Decision Records.
Implements automatic constraint extraction and rubric synthesis.
"""
CONSTRAINT_PATTERNS = {
"MUST": r"(?:must|shall|required|mandatory|will)",
"SHOULD": r"(?:should|recommended|preferred)",
"MAY": r"(?:may|optional|can)"
}
def parse_adr(self, adr_content: str, adr_id: str) -> List[ADRConstraint]:
"""Extract testable constraints from ADR content."""
constraints = []
# Find decision section
decision_match = re.search(
r"##\s*Decision\s*\n(.*?)(?=##|\Z)",
adr_content,
re.DOTALL | re.IGNORECASE
)
if not decision_match:
return constraints
decision_text = decision_match.group(1)
# Split into sentences
sentences = re.split(r'(?<=[.!?])\s+', decision_text)
for sentence in sentences:
for constraint_type, pattern in self.CONSTRAINT_PATTERNS.items():
if re.search(pattern, sentence, re.IGNORECASE):
constraint = ADRConstraint(
source_adr=adr_id,
constraint_type=constraint_type,
description=sentence.strip(),
evidence_quote=sentence.strip(),
testable_criteria=self._extract_criteria(sentence)
)
constraints.append(constraint)
break
return constraints
def _extract_criteria(self, sentence: str) -> List[str]:
"""Extract testable criteria from constraint sentence."""
criteria = []
# Look for specific technical requirements
tech_patterns = [
r"encryption",
r"TLS\s*[\d.]+",
r"AES-\d+",
r"HIPAA",
r"audit\s*(?:log|trail)",
r"FHIR",
r"FoundationDB",
r"event\s*sourc",
r"ACID",
r"authentication",
r"authorization"
]
for pattern in tech_patterns:
if re.search(pattern, sentence, re.IGNORECASE):
match = re.search(pattern, sentence, re.IGNORECASE)
criteria.append(f"Verify {match.group()} implementation")
return criteria if criteria else ["Verify compliance with stated requirement"]
def generate_rubric(
self,
constraints: List[ADRConstraint]
) -> List[GeneratedRubric]:
"""Generate evaluation rubrics from constraints."""
rubrics = []
# Group constraints by ADR
by_adr = {}
for c in constraints:
if c.source_adr not in by_adr:
by_adr[c.source_adr] = []
by_adr[c.source_adr].append(c)
for adr_id, adr_constraints in by_adr.items():
# Create rubric for each ADR
dimension = f"ADR Compliance: {adr_id}"
# Build score descriptions based on constraint types
must_count = len([c for c in adr_constraints if c.constraint_type == "MUST"])
score_descriptions = {
3: f"All {must_count} MUST requirements fully implemented with evidence",
2: f"Most MUST requirements implemented, minor gaps in SHOULD items",
1: f"MUST requirements violated or missing critical implementations"
}
# Compile evaluation steps
eval_steps = []
for c in adr_constraints:
if c.constraint_type == "MUST":
for criteria in c.testable_criteria:
eval_steps.append(f"[MUST] {criteria}")
elif c.constraint_type == "SHOULD":
for criteria in c.testable_criteria:
eval_steps.append(f"[SHOULD] {criteria}")
rubric = GeneratedRubric(
source_adr=adr_id,
dimension=dimension,
scale=[1, 2, 3],
score_descriptions=score_descriptions,
evaluation_steps=eval_steps,
weight=0.2 # Default weight, adjustable
)
rubrics.append(rubric)
return rubrics
def augment_persona_rubric(
self,
base_rubric: Dict,
generated_rubrics: List[GeneratedRubric]
) -> Dict:
"""Augment base persona rubric with ADR-specific dimensions."""
augmented = base_rubric.copy()
for rubric in generated_rubrics:
augmented["dimensions"].append({
"name": rubric.dimension,
"weight": rubric.weight,
"scale": rubric.scale,
"descriptions": rubric.score_descriptions,
"evaluation_steps": rubric.evaluation_steps,
"source": "ADR-GENERATED"
})
# Renormalize weights
total_weight = sum(d["weight"] for d in augmented["dimensions"])
for d in augmented["dimensions"]:
d["weight"] = d["weight"] / total_weight
return augmented
Part 5: Implementation Roadmap
5.1 Phase 1: Foundation (Weeks 1-4)
| Week | Deliverable | Success Criteria |
|---|---|---|
| 1 | Persona prompt templates finalized | All 5 core personas documented |
| 2 | Judge invocation infrastructure | Can invoke Claude/GPT-4/DeepSeek judges |
| 3 | Consensus engine implementation | Weighted voting functional |
| 4 | Basic audit trail | All evaluations logged with provenance |
5.2 Phase 2: Integration (Weeks 5-8)
| Week | Deliverable | Success Criteria |
|---|---|---|
| 5 | ADR parser implementation | Extract constraints from 10 test ADRs |
| 6 | Dynamic rubric generation | Rubrics auto-generated from ADRs |
| 7 | Solution MoE → Judge integration | End-to-end artifact flow |
| 8 | Debate protocol implementation | Multi-round debate functional |
5.3 Phase 3: Calibration (Weeks 9-12)
| Week | Deliverable | Success Criteria |
|---|---|---|
| 9 | Human calibration dataset | 100 expert-graded artifacts |
| 10 | Agreement metric calculation | Cohen's κ ≥ 0.6 per dimension |
| 11 | Threshold tuning | False positive ≤ 10%, false negative ≤ 5% |
| 12 | Bias testing | Pass adversarial bias probes |
5.4 Phase 4: Production Hardening (Weeks 13-16)
| Week | Deliverable | Success Criteria |
|---|---|---|
| 13 | Red team testing | 95% adversarial rejection rate |
| 14 | Escalation workflow | Human review loop integrated |
| 15 | Performance optimization | < 60s for full panel evaluation |
| 16 | Compliance documentation | SOC 2 / HIPAA audit readiness |
Part 6: Success Metrics
6.1 Accuracy Metrics
| Metric | Target | Measurement |
|---|---|---|
| Human Agreement (overall) | ≥ 75% | Cohen's κ vs. expert panel |
| Security Finding Recall | ≥ 90% | Known vulnerabilities detected |
| Compliance Finding Recall | ≥ 95% | Known violations detected |
| False Positive Rate | ≤ 10% | Good code wrongly failed |
| False Negative Rate | ≤ 5% | Bad code wrongly passed |
6.2 Operational Metrics
| Metric | Target | Measurement |
|---|---|---|
| Evaluation Latency | < 60s | P95 for full panel |
| Judge Availability | 99.5% | Uptime across all judges |
| Cost per Evaluation | < $2.00 | Total API costs |
| Escalation Rate | < 15% | Human review required |
6.3 Defensibility Metrics
| Metric | Target | Measurement |
|---|---|---|
| Provenance Completeness | 100% | All decisions traceable |
| Dissent Recording | 100% | All minority views captured |
| Audit Trail Integrity | 100% | Immutable, timestamped |
| Rationale Quality | ≥ 4/5 | Human rating of explanations |
Appendix A: Quick Reference Card
Invoking Judge Panel
from coditect.verification import JudgePanel, JudgePanelConfig
config = JudgePanelConfig(
personas=["technical_architect", "compliance_auditor", "security_analyst"],
regulatory_frameworks=["HIPAA", "SOC2"],
approval_threshold=0.67,
max_debate_rounds=3
)
panel = JudgePanel(config)
result = await panel.evaluate(
artifact=code_artifact,
context={
"adrs": project_adrs,
"requirements": requirements_doc,
"tech_stack": tech_stack_info
}
)
if result.final_verdict == Verdict.PASS:
print(f"Approved with {result.confidence:.0%} confidence")
else:
print(f"Remediation required: {result.all_remediation}")
Output Structure
{
"final_verdict": "PASS|FAIL|CONDITIONAL",
"confidence": 0.85,
"agreement_ratio": 0.92,
"aggregated_scores": {
"architectural_soundness": 2.7,
"security_compliance": 2.9,
"regulatory_alignment": 3.0,
"code_quality": 2.5
},
"critical_findings": [],
"remediation_required": [],
"dissenting_views": [],
"provenance_chain": [
{"persona": "...", "model": "...", "verdict": "...", "timestamp": "..."}
],
"escalation_required": false
}
Document Version: 1.0 | For: Coditect Autonomous Development Platform | January 2026