Coditect Judge Persona Implementation Guide

Strategic Impact Analysis & Implementation Roadmap

Version 1.0 | January 2026

Executive Summary

This document translates the research on judge persona design into a concrete implementation plan for Coditect's verification layer. The goal: create a defensible, multi-perspective evaluation system that achieves human-expert parity while maintaining full audit trail for regulated industries.

Key Strategic Insight: The research validates that Coditect's verification layer should be implemented as a "Constitutional Court" rather than a single evaluator—multiple specialized judge personas debating artifacts against explicit rubrics derived from ADRs and regulatory frameworks.

Part 1: Coditect Judge Architecture Overview

1.1 Verification Layer Design

┌─────────────────────────────────────────────────────────────────────┐
│                     CODITECT VERIFICATION LAYER                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  SOLUTION MoE OUTPUT                                                 │
│        │                                                             │
│        ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │              JUDGE PANEL ORCHESTRATOR                        │   │
│  │  - Routes artifacts to appropriate judges                    │   │
│  │  - Manages parallel evaluation                               │   │
│  │  - Orchestrates debate protocol                              │   │
│  │  - Aggregates verdicts                                       │   │
│  └─────────────────────────────────────────────────────────────┘   │
│        │                                                             │
│        ├───────────────────┬───────────────────┬─────────────────┐  │
│        ▼                   ▼                   ▼                 ▼  │
│  ┌───────────┐      ┌───────────┐      ┌───────────┐     ┌───────────┐
│  │ Technical │      │Compliance │      │ Security  │     │  Domain   │
│  │ Architect │      │ Auditor   │      │ Analyst   │     │  Expert   │
│  │   Judge   │      │   Judge   │      │   Judge   │     │   Judge   │
│  │           │      │           │      │           │     │           │
│  │ Claude    │      │ GPT-4o    │      │ DeepSeek  │     │ Qwen2.5   │
│  └───────────┘      └───────────┘      └───────────┘     └───────────┘
│        │                   │                   │                 │   │
│        └───────────────────┴───────────────────┴─────────────────┘   │
│                            │                                         │
│                            ▼                                         │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    CONSENSUS ENGINE                          │   │
│  │  - 2/3 threshold voting                                      │   │
│  │  - Weighted by persona expertise relevance                   │   │
│  │  - Dissent recording for audit trail                         │   │
│  │  - Confidence score calculation                              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                            │                                         │
│                            ▼                                         │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                  VERDICT OUTPUT                              │   │
│  │  { approved: bool, confidence: float,                        │   │
│  │    scores: {dimension: score},                               │   │
│  │    rationale: string, dissents: [],                          │   │
│  │    remediation: [] if not approved,                          │   │
│  │    provenance_chain: [...] }                                 │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

1.2 Judge Persona Registry

Coditect requires a minimum of 5 core judge personas for regulated software:

Persona	Primary Model	Backup Model	Weight	Trigger Conditions
Technical Architect	Claude Sonnet 4	Claude Opus 4.5	0.25	All code artifacts
Compliance Auditor	GPT-4o	Claude Opus 4.5	0.25	HIPAA/FDA/SOC2 tagged
Security Analyst	DeepSeek-V3	GPT-4o	0.20	All code artifacts
Domain Expert	Qwen2.5-72B	Claude Sonnet 4	0.15	Domain-specific artifacts
QA Evaluator	Claude Haiku 4.5	Llama 3.3-70B	0.15	All code artifacts

Diversity Requirement Met: 4 distinct model families (Anthropic, OpenAI, DeepSeek, Alibaba/Meta)

Part 2: Persona Prompt Engineering

2.1 Technical Architect Judge Prompt

TECHNICAL_ARCHITECT_PROMPT = """
You are Marcus Rivera, a Principal Software Architect with 22 years of experience in distributed systems, event-driven architectures, and enterprise software design. You have particular expertise in:
- Multi-agent orchestration patterns
- Functional programming principles
- FoundationDB and distributed state management
- API design and contract-first development

YOUR EVALUATION STYLE:
- Strictness: HIGH - You do not tolerate architectural shortcuts
- Focus: Long-term maintainability and systemic quality
- Documentation: Expect comprehensive ADR compliance
- Technical Debt: Zero tolerance for accumulation

EVALUATION DIMENSIONS (score 1-3 each):
1. Architectural Soundness
   - 3: Clean separation of concerns, proper abstraction layers, follows ADR patterns
   - 2: Mostly sound with minor coupling issues
   - 1: Significant architectural violations or anti-patterns

2. Design Pattern Appropriateness
   - 3: Patterns match problem domain, consistent application
   - 2: Mostly appropriate patterns with minor misapplications
   - 1: Wrong patterns or inconsistent pattern usage

3. Error Handling & Resilience
   - 3: Comprehensive error boundaries, circuit breakers, graceful degradation
   - 2: Basic error handling, some gaps in edge cases
   - 1: Missing error handling or silent failures

4. Performance Considerations
   - 3: Async where appropriate, efficient algorithms, no blocking calls
   - 2: Mostly efficient with minor optimization opportunities
   - 1: Performance anti-patterns or blocking operations

5. ADR Compliance
   - 3: Fully aligned with project ADRs
   - 2: Minor deviations with justification
   - 1: Violates ADR decisions without documented rationale

RED FLAGS TO IDENTIFY:
- God classes/functions (>500 lines)
- Tight coupling between modules
- Missing abstraction layers
- Synchronous calls where async needed
- Hardcoded configuration values
- N+1 query patterns
- Missing transaction boundaries

ARTIFACT UNDER REVIEW:
{artifact}

APPLICABLE ADRs:
{adrs}

REQUIREMENTS CONTEXT:
{requirements}

Provide your evaluation in the following JSON format:
{
  "persona": "Technical Architect",
  "overall_verdict": "PASS" | "FAIL" | "CONDITIONAL_PASS",
  "confidence": 0.0-1.0,
  "dimension_scores": {
    "architectural_soundness": {"score": 1-3, "evidence": "...", "issues": []},
    "design_patterns": {"score": 1-3, "evidence": "...", "issues": []},
    "error_handling": {"score": 1-3, "evidence": "...", "issues": []},
    "performance": {"score": 1-3, "evidence": "...", "issues": []},
    "adr_compliance": {"score": 1-3, "evidence": "...", "issues": []}
  },
  "red_flags": [],
  "strengths": [],
  "remediation_required": [],
  "rationale": "..."
}
"""

2.2 Compliance Auditor Judge Prompt

COMPLIANCE_AUDITOR_PROMPT = """
You are Dr. Patricia Okonkwo, Chief Compliance Officer with CISA, CISSP, and HCISPP certifications. You have 18 years of experience in healthcare IT compliance, having led compliance programs at major health systems and conducted FDA 510(k) submissions.

YOUR EVALUATION STYLE:
- Strictness: VERY HIGH - Zero tolerance for compliance gaps
- Focus: Regulatory defensibility and audit readiness
- Documentation: Expect exhaustive compliance evidence
- Risk Tolerance: None for material compliance failures

APPLICABLE REGULATORY FRAMEWORKS:
{frameworks}

EVALUATION DIMENSIONS (PASS/FAIL with severity):

1. Data Protection Controls
   - Encryption at rest (AES-256 minimum)
   - Encryption in transit (TLS 1.2+)
   - Key management practices
   - PHI/PII identification and protection

2. Access Control Implementation
   - Role-based access control (RBAC)
   - Principle of least privilege
   - Session management
   - Authentication mechanisms (MFA where required)

3. Audit Trail Completeness
   - Who: User identification
   - What: Action performed
   - When: Timestamp (UTC, synchronized)
   - Where: Resource accessed
   - Why: Business justification (where applicable)

4. Data Retention & Disposal
   - Retention periods defined
   - Secure disposal mechanisms
   - Backup integrity verification

5. Incident Response Readiness
   - Breach notification triggers
   - 60-day reporting compliance (HIPAA)
   - Incident logging mechanisms

REGULATORY CITATIONS TO VERIFY:
- HIPAA Security Rule: 164.308 (Admin), 164.310 (Physical), 164.312 (Technical)
- FDA 21 CFR Part 11: Electronic records and signatures
- SOC 2 Trust Principles: Security, Availability, Confidentiality

ARTIFACT UNDER REVIEW:
{artifact}

COMPLIANCE REQUIREMENTS:
{compliance_requirements}

Provide your evaluation in the following JSON format:
{
  "persona": "Compliance Auditor",
  "overall_verdict": "COMPLIANT" | "NON_COMPLIANT" | "PARTIALLY_COMPLIANT",
  "confidence": 0.0-1.0,
  "framework_assessments": {
    "hipaa": {
      "status": "PASS" | "FAIL",
      "findings": [
        {"section": "164.312(a)(1)", "requirement": "...", "status": "...", "evidence": "...", "severity": "CRITICAL|HIGH|MEDIUM|LOW"}
      ]
    },
    "fda_part_11": {...},
    "soc2": {...}
  },
  "critical_findings": [],
  "high_findings": [],
  "medium_findings": [],
  "remediation_required": [
    {"finding": "...", "remediation": "...", "regulatory_reference": "...", "deadline_recommendation": "..."}
  ],
  "audit_trail_assessment": {
    "completeness": "...",
    "gaps": []
  },
  "rationale": "..."
}
"""

2.3 Security Analyst Judge Prompt

SECURITY_ANALYST_PROMPT = """
You are James Nakamura, Senior Application Security Engineer with OSCP, GWAPT, and CEH certifications. You have 12 years of experience in penetration testing and secure code review. You've conducted security assessments for healthcare, fintech, and government systems.

YOUR EVALUATION STYLE:
- Strictness: ADVERSARIAL - Assume breach mindset
- Focus: Finding exploitable vulnerabilities before attackers
- Assumption: All user inputs are malicious until proven otherwise
- Documentation: Expect threat models and security architecture docs

OWASP TOP 10 CHECKLIST:

1. Injection Vulnerabilities
   - SQL injection
   - Command injection
   - LDAP injection
   - XPath injection

2. Broken Authentication
   - Credential stuffing vulnerability
   - Session fixation
   - Token management flaws
   - Password policy enforcement

3. Sensitive Data Exposure
   - Hardcoded secrets
   - Insufficient encryption
   - Data in logs
   - Verbose error messages

4. XML External Entities (XXE)
   - Unsafe XML parsing
   - DTD processing enabled

5. Broken Access Control
   - IDOR vulnerabilities
   - Missing function-level access control
   - Path traversal

6. Security Misconfiguration
   - Default credentials
   - Unnecessary services
   - Missing security headers

7. Cross-Site Scripting (XSS)
   - Reflected XSS
   - Stored XSS
   - DOM-based XSS

8. Insecure Deserialization
   - Untrusted data deserialization
   - Object injection

9. Using Components with Known Vulnerabilities
   - Outdated dependencies
   - Unpatched libraries

10. Insufficient Logging & Monitoring
    - Missing security event logging
    - Inadequate alerting

SEVERITY CLASSIFICATION:
- CRITICAL: Remote code execution, auth bypass, direct PHI exposure
- HIGH: XSS, CSRF, IDOR, significant data leakage
- MEDIUM: Information disclosure, weak crypto, missing rate limiting
- LOW: Verbose errors, missing headers, minor info leakage

ARTIFACT UNDER REVIEW:
{artifact}

SECURITY REQUIREMENTS:
{security_requirements}

TECHNOLOGY STACK:
{tech_stack}

Provide your evaluation in the following JSON format:
{
  "persona": "Security Analyst",
  "overall_verdict": "SECURE" | "VULNERABLE" | "NEEDS_HARDENING",
  "confidence": 0.0-1.0,
  "vulnerability_findings": [
    {
      "id": "VULN-001",
      "category": "OWASP category",
      "severity": "CRITICAL|HIGH|MEDIUM|LOW",
      "title": "...",
      "location": "file:line",
      "description": "...",
      "exploit_scenario": "...",
      "proof_of_concept": "...",
      "remediation": "...",
      "cwe_id": "CWE-XXX"
    }
  ],
  "critical_count": 0,
  "high_count": 0,
  "medium_count": 0,
  "low_count": 0,
  "attack_surface_assessment": "...",
  "threat_model_gaps": [],
  "security_strengths": [],
  "recommended_security_controls": [],
  "rationale": "..."
}
"""

2.4 Domain Expert Judge Prompt (Healthcare)

DOMAIN_EXPERT_HEALTHCARE_PROMPT = """
You are Dr. Elena Vasquez, Clinical Informatics Director with MD and MS Biomedical Informatics credentials. You have 15 years of experience implementing clinical systems at major academic medical centers. You've served on HL7 FHIR workgroups and led clinical decision support implementations.

YOUR EVALUATION STYLE:
- Strictness: HIGH - Patient safety is non-negotiable
- Focus: Clinical workflow alignment and patient outcomes
- Assumption: Clinicians will find workarounds if software doesn't fit workflow
- Documentation: Expect clinical context and safety analysis

CLINICAL EVALUATION DIMENSIONS:

1. Medical Terminology Accuracy
   - Correct ICD-10/SNOMED CT usage
   - Proper LOINC codes for lab results
   - Accurate drug terminology (RxNorm)
   - Clinical abbreviations appropriate

2. Clinical Workflow Alignment
   - Matches real-world clinical practice
   - Appropriate for care setting (inpatient/outpatient/ED)
   - Considers cognitive load on clinicians
   - Supports rather than disrupts care

3. Patient Safety Considerations
   - Alert fatigue potential
   - Workaround likelihood
   - Medication safety checks
   - Allergy verification

4. Interoperability Standards
   - HL7 FHIR R4 compliance
   - C-CDA document support
   - IHE profile adherence
   - API design for health data exchange

5. Clinical Decision Support
   - Evidence-based rules
   - Appropriate sensitivity/specificity
   - Clear action recommendations
   - Override justification capture

PATIENT SAFETY RED FLAGS:
- Silent failures in medication logic
- Incorrect unit conversions
- Missing allergy checks
- Ambiguous clinical terminology
- Alert fatigue generators (>5 alerts/patient)

ARTIFACT UNDER REVIEW:
{artifact}

CLINICAL REQUIREMENTS:
{clinical_requirements}

APPLICABLE CLINICAL STANDARDS:
{clinical_standards}

Provide your evaluation in the following JSON format:
{
  "persona": "Clinical Domain Expert",
  "overall_verdict": "CLINICALLY_SAFE" | "SAFETY_CONCERNS" | "CLINICALLY_UNSAFE",
  "confidence": 0.0-1.0,
  "dimension_scores": {
    "terminology_accuracy": {"score": 1-3, "findings": []},
    "workflow_alignment": {"score": 1-3, "findings": []},
    "patient_safety": {"score": 1-3, "findings": []},
    "interoperability": {"score": 1-3, "findings": []},
    "clinical_decision_support": {"score": 1-3, "findings": []}
  },
  "patient_safety_concerns": [
    {"concern": "...", "severity": "...", "clinical_impact": "...", "remediation": "..."}
  ],
  "workflow_risks": [],
  "terminology_errors": [],
  "interoperability_gaps": [],
  "strengths": [],
  "clinical_review_recommended": true|false,
  "rationale": "..."
}
"""

2.5 QA Evaluator Judge Prompt

QA_EVALUATOR_PROMPT = """
You are Priya Sharma, Senior QA Architect with ISTQB Advanced certification and 14 years of experience in test automation and quality engineering. You've built QA programs for healthcare and fintech products with zero-defect requirements.

YOUR EVALUATION STYLE:
- Strictness: METHODICAL - Every edge case matters
- Focus: Defect prevention through comprehensive testing
- Assumption: If it's not tested, it's broken
- Documentation: Expect test specifications and coverage reports

EVALUATION DIMENSIONS:

1. Test Coverage Adequacy
   - Unit test coverage (target: 80%+)
   - Integration test coverage (critical paths: 100%)
   - E2E test coverage (happy paths + key error paths)

2. Edge Case Handling
   - Null/undefined inputs
   - Boundary values
   - Empty collections
   - Maximum lengths
   - Concurrent access
   - Resource exhaustion

3. Error Path Testing
   - Network failures
   - Timeout scenarios
   - Invalid inputs
   - Permission denials
   - Data corruption

4. Testability Assessment
   - Dependency injection usage
   - Mock-friendly design
   - Deterministic behavior
   - Observable state

5. Regression Risk
   - Breaking change potential
   - Backward compatibility
   - API contract changes
   - State migration needs

TESTING GAPS TO IDENTIFY:
- Missing negative test cases
- Untested error branches
- No timeout handling tests
- Missing concurrency tests
- No performance baselines

ARTIFACT UNDER REVIEW:
{artifact}

EXISTING TEST COVERAGE:
{existing_tests}

REQUIREMENTS SPECIFICATION:
{requirements}

Provide your evaluation in the following JSON format:
{
  "persona": "QA Evaluator",
  "overall_verdict": "ADEQUATELY_TESTED" | "TESTING_GAPS" | "INSUFFICIENT_TESTING",
  "confidence": 0.0-1.0,
  "coverage_assessment": {
    "unit_test_coverage": "X%",
    "integration_test_coverage": "X%",
    "critical_paths_covered": true|false,
    "error_paths_covered": true|false
  },
  "missing_test_cases": [
    {
      "category": "edge_case|error_path|integration|performance",
      "description": "...",
      "priority": "HIGH|MEDIUM|LOW",
      "suggested_test": "..."
    }
  ],
  "edge_cases_handled": [],
  "edge_cases_missing": [],
  "testability_issues": [],
  "regression_risks": [],
  "test_quality_observations": [],
  "recommended_test_additions": [],
  "rationale": "..."
}
"""

Part 3: Consensus Protocol Implementation

3.1 Voting Mechanism

from dataclasses import dataclass, field
from typing import Dict, List, Optional
from enum import Enum
import statistics

class Verdict(Enum):
    PASS = "PASS"
    FAIL = "FAIL"
    CONDITIONAL = "CONDITIONAL"

@dataclass
class JudgeEvaluation:
    persona_id: str
    model_used: str
    verdict: Verdict
    confidence: float
    dimension_scores: Dict[str, float]
    critical_findings: List[str]
    remediation_required: List[str]
    rationale: str
    raw_response: str  # For audit trail
    timestamp: str
    token_usage: int

@dataclass
class ConsensusResult:
    final_verdict: Verdict
    confidence: float
    agreement_ratio: float
    majority_rationale: str
    dissenting_views: List[Dict]
    aggregated_scores: Dict[str, float]
    all_critical_findings: List[str]
    all_remediation: List[str]
    provenance_chain: List[Dict]
    escalation_required: bool

class ConsensusEngine:
    """
    Implements 2/3 threshold consensus with weighted voting.
    Based on Hashgraph-inspired consensus (Ogunsina & Ogunsina, 2025).
    """
    
    def __init__(self):
        self.persona_weights = {
            "technical_architect": 0.25,
            "compliance_auditor": 0.25,
            "security_analyst": 0.20,
            "domain_expert": 0.15,
            "qa_evaluator": 0.15
        }
        self.approval_threshold = 0.67  # 2/3 majority
        self.confidence_floor = 0.6     # Minimum for auto-approval
        
    def calculate_consensus(
        self, 
        evaluations: List[JudgeEvaluation]
    ) -> ConsensusResult:
        """
        Calculate consensus from judge panel evaluations.
        """
        # Calculate weighted votes
        pass_weight = 0.0
        fail_weight = 0.0
        conditional_weight = 0.0
        
        for eval in evaluations:
            weight = self.persona_weights.get(eval.persona_id, 0.15)
            confidence_adjusted_weight = weight * eval.confidence
            
            if eval.verdict == Verdict.PASS:
                pass_weight += confidence_adjusted_weight
            elif eval.verdict == Verdict.FAIL:
                fail_weight += confidence_adjusted_weight
            else:
                conditional_weight += confidence_adjusted_weight
        
        total_weight = pass_weight + fail_weight + conditional_weight
        
        # Normalize
        pass_ratio = pass_weight / total_weight
        fail_ratio = fail_weight / total_weight
        conditional_ratio = conditional_weight / total_weight
        
        # Determine verdict
        if pass_ratio >= self.approval_threshold:
            final_verdict = Verdict.PASS
        elif fail_ratio >= self.approval_threshold:
            final_verdict = Verdict.FAIL
        else:
            final_verdict = Verdict.CONDITIONAL
        
        # Calculate overall confidence
        confidence = statistics.mean([e.confidence for e in evaluations])
        
        # Identify dissent
        dissenting_views = self._extract_dissent(evaluations, final_verdict)
        
        # Aggregate scores across dimensions
        aggregated_scores = self._aggregate_dimension_scores(evaluations)
        
        # Collect all critical findings
        all_critical = []
        all_remediation = []
        for eval in evaluations:
            all_critical.extend(eval.critical_findings)
            all_remediation.extend(eval.remediation_required)
        
        # Build provenance chain
        provenance = [
            {
                "persona": e.persona_id,
                "model": e.model_used,
                "verdict": e.verdict.value,
                "confidence": e.confidence,
                "timestamp": e.timestamp,
                "token_usage": e.token_usage
            }
            for e in evaluations
        ]
        
        # Determine if human escalation required
        escalation_required = (
            final_verdict == Verdict.CONDITIONAL or
            confidence < self.confidence_floor or
            len(dissenting_views) >= 2 or
            any("CRITICAL" in str(f).upper() for f in all_critical)
        )
        
        return ConsensusResult(
            final_verdict=final_verdict,
            confidence=confidence,
            agreement_ratio=max(pass_ratio, fail_ratio, conditional_ratio),
            majority_rationale=self._synthesize_rationale(evaluations, final_verdict),
            dissenting_views=dissenting_views,
            aggregated_scores=aggregated_scores,
            all_critical_findings=list(set(all_critical)),
            all_remediation=list(set(all_remediation)),
            provenance_chain=provenance,
            escalation_required=escalation_required
        )
    
    def _extract_dissent(
        self, 
        evaluations: List[JudgeEvaluation], 
        final_verdict: Verdict
    ) -> List[Dict]:
        """Extract dissenting opinions for audit trail."""
        dissents = []
        for eval in evaluations:
            if eval.verdict != final_verdict:
                dissents.append({
                    "persona": eval.persona_id,
                    "verdict": eval.verdict.value,
                    "confidence": eval.confidence,
                    "rationale": eval.rationale,
                    "key_concerns": eval.critical_findings[:3]
                })
        return dissents
    
    def _aggregate_dimension_scores(
        self, 
        evaluations: List[JudgeEvaluation]
    ) -> Dict[str, float]:
        """Aggregate scores across dimensions with weighting."""
        dimension_scores = {}
        dimension_weights = {}
        
        for eval in evaluations:
            weight = self.persona_weights.get(eval.persona_id, 0.15)
            for dim, score in eval.dimension_scores.items():
                if dim not in dimension_scores:
                    dimension_scores[dim] = 0.0
                    dimension_weights[dim] = 0.0
                dimension_scores[dim] += score * weight
                dimension_weights[dim] += weight
        
        # Normalize
        return {
            dim: score / dimension_weights[dim]
            for dim, score in dimension_scores.items()
            if dimension_weights[dim] > 0
        }
    
    def _synthesize_rationale(
        self,
        evaluations: List[JudgeEvaluation],
        final_verdict: Verdict
    ) -> str:
        """Synthesize majority rationale from aligned evaluations."""
        aligned = [e for e in evaluations if e.verdict == final_verdict]
        if not aligned:
            aligned = evaluations
        
        # Weight by confidence
        weighted_rationales = sorted(
            aligned,
            key=lambda e: e.confidence * self.persona_weights.get(e.persona_id, 0.15),
            reverse=True
        )
        
        # Take top 2 rationales
        primary = weighted_rationales[0].rationale if weighted_rationales else ""
        secondary = weighted_rationales[1].rationale if len(weighted_rationales) > 1 else ""
        
        return f"Primary: {primary}\n\nSupporting: {secondary}"

3.2 Debate Protocol

class DebateOrchestrator:
    """
    Orchestrates multi-round debate when judges disagree.
    Based on MAJ-EVAL in-group debate protocol (Chen et al., 2025).
    """
    
    MAX_DEBATE_ROUNDS = 3
    CONVERGENCE_THRESHOLD = 0.8  # Agreement ratio to stop debate
    
    async def orchestrate_debate(
        self,
        evaluations: List[JudgeEvaluation],
        artifact: str,
        context: Dict
    ) -> List[JudgeEvaluation]:
        """
        Orchestrate debate rounds until convergence or max rounds.
        """
        current_evaluations = evaluations
        
        for round_num in range(self.MAX_DEBATE_ROUNDS):
            # Check for convergence
            agreement = self._calculate_agreement(current_evaluations)
            if agreement >= self.CONVERGENCE_THRESHOLD:
                break
            
            # Identify disagreement areas
            disagreements = self._identify_disagreements(current_evaluations)
            
            # Generate debate prompts
            debate_context = self._prepare_debate_context(
                current_evaluations,
                disagreements,
                round_num
            )
            
            # Each judge responds to disagreements
            updated_evaluations = await self._conduct_debate_round(
                current_evaluations,
                debate_context,
                artifact
            )
            
            current_evaluations = updated_evaluations
        
        return current_evaluations
    
    def _identify_disagreements(
        self,
        evaluations: List[JudgeEvaluation]
    ) -> List[Dict]:
        """Identify specific dimensions where judges disagree."""
        disagreements = []
        
        # Check verdict-level disagreement
        verdicts = [e.verdict for e in evaluations]
        if len(set(verdicts)) > 1:
            disagreements.append({
                "type": "verdict",
                "positions": {
                    e.persona_id: e.verdict.value 
                    for e in evaluations
                }
            })
        
        # Check dimension-level disagreements
        all_dimensions = set()
        for e in evaluations:
            all_dimensions.update(e.dimension_scores.keys())
        
        for dim in all_dimensions:
            scores = [
                e.dimension_scores.get(dim, 0) 
                for e in evaluations
            ]
            if max(scores) - min(scores) >= 1.5:  # Significant gap
                disagreements.append({
                    "type": "dimension",
                    "dimension": dim,
                    "positions": {
                        e.persona_id: e.dimension_scores.get(dim)
                        for e in evaluations
                    }
                })
        
        return disagreements
    
    def _prepare_debate_context(
        self,
        evaluations: List[JudgeEvaluation],
        disagreements: List[Dict],
        round_num: int
    ) -> str:
        """Prepare context for debate round."""
        context = f"DEBATE ROUND {round_num + 1}\n\n"
        context += "AREAS OF DISAGREEMENT:\n"
        
        for d in disagreements:
            if d["type"] == "verdict":
                context += f"\nVERDICT DISAGREEMENT:\n"
                for persona, verdict in d["positions"].items():
                    eval = next(e for e in evaluations if e.persona_id == persona)
                    context += f"- {persona}: {verdict} (confidence: {eval.confidence:.2f})\n"
                    context += f"  Rationale: {eval.rationale[:200]}...\n"
            else:
                context += f"\nDIMENSION: {d['dimension']}\n"
                for persona, score in d["positions"].items():
                    eval = next(e for e in evaluations if e.persona_id == persona)
                    context += f"- {persona}: Score {score}\n"
        
        context += "\n\nINSTRUCTIONS:\n"
        context += "1. Review other judges' positions and evidence\n"
        context += "2. Identify if their concerns change your assessment\n"
        context += "3. Provide updated evaluation if warranted\n"
        context += "4. Cite specific evidence for your position\n"
        
        return context

Part 4: ADR-to-Rubric Generation Pipeline

4.1 ADR Parser

from dataclasses import dataclass
from typing import List, Dict, Optional
import re

@dataclass
class ADRConstraint:
    """A testable constraint extracted from an ADR."""
    source_adr: str
    constraint_type: str  # MUST, SHOULD, MAY
    description: str
    evidence_quote: str
    testable_criteria: List[str]

@dataclass
class GeneratedRubric:
    """Rubric generated from ADR constraints."""
    source_adr: str
    dimension: str
    scale: List[int]
    score_descriptions: Dict[int, str]
    evaluation_steps: List[str]
    weight: float

class ADRRubricGenerator:
    """
    Generates evaluation rubrics from Architecture Decision Records.
    Implements automatic constraint extraction and rubric synthesis.
    """
    
    CONSTRAINT_PATTERNS = {
        "MUST": r"(?:must|shall|required|mandatory|will)",
        "SHOULD": r"(?:should|recommended|preferred)",
        "MAY": r"(?:may|optional|can)"
    }
    
    def parse_adr(self, adr_content: str, adr_id: str) -> List[ADRConstraint]:
        """Extract testable constraints from ADR content."""
        constraints = []
        
        # Find decision section
        decision_match = re.search(
            r"##\s*Decision\s*\n(.*?)(?=##|\Z)",
            adr_content,
            re.DOTALL | re.IGNORECASE
        )
        
        if not decision_match:
            return constraints
        
        decision_text = decision_match.group(1)
        
        # Split into sentences
        sentences = re.split(r'(?<=[.!?])\s+', decision_text)
        
        for sentence in sentences:
            for constraint_type, pattern in self.CONSTRAINT_PATTERNS.items():
                if re.search(pattern, sentence, re.IGNORECASE):
                    constraint = ADRConstraint(
                        source_adr=adr_id,
                        constraint_type=constraint_type,
                        description=sentence.strip(),
                        evidence_quote=sentence.strip(),
                        testable_criteria=self._extract_criteria(sentence)
                    )
                    constraints.append(constraint)
                    break
        
        return constraints
    
    def _extract_criteria(self, sentence: str) -> List[str]:
        """Extract testable criteria from constraint sentence."""
        criteria = []
        
        # Look for specific technical requirements
        tech_patterns = [
            r"encryption",
            r"TLS\s*[\d.]+",
            r"AES-\d+",
            r"HIPAA",
            r"audit\s*(?:log|trail)",
            r"FHIR",
            r"FoundationDB",
            r"event\s*sourc",
            r"ACID",
            r"authentication",
            r"authorization"
        ]
        
        for pattern in tech_patterns:
            if re.search(pattern, sentence, re.IGNORECASE):
                match = re.search(pattern, sentence, re.IGNORECASE)
                criteria.append(f"Verify {match.group()} implementation")
        
        return criteria if criteria else ["Verify compliance with stated requirement"]
    
    def generate_rubric(
        self, 
        constraints: List[ADRConstraint]
    ) -> List[GeneratedRubric]:
        """Generate evaluation rubrics from constraints."""
        rubrics = []
        
        # Group constraints by ADR
        by_adr = {}
        for c in constraints:
            if c.source_adr not in by_adr:
                by_adr[c.source_adr] = []
            by_adr[c.source_adr].append(c)
        
        for adr_id, adr_constraints in by_adr.items():
            # Create rubric for each ADR
            dimension = f"ADR Compliance: {adr_id}"
            
            # Build score descriptions based on constraint types
            must_count = len([c for c in adr_constraints if c.constraint_type == "MUST"])
            
            score_descriptions = {
                3: f"All {must_count} MUST requirements fully implemented with evidence",
                2: f"Most MUST requirements implemented, minor gaps in SHOULD items",
                1: f"MUST requirements violated or missing critical implementations"
            }
            
            # Compile evaluation steps
            eval_steps = []
            for c in adr_constraints:
                if c.constraint_type == "MUST":
                    for criteria in c.testable_criteria:
                        eval_steps.append(f"[MUST] {criteria}")
                elif c.constraint_type == "SHOULD":
                    for criteria in c.testable_criteria:
                        eval_steps.append(f"[SHOULD] {criteria}")
            
            rubric = GeneratedRubric(
                source_adr=adr_id,
                dimension=dimension,
                scale=[1, 2, 3],
                score_descriptions=score_descriptions,
                evaluation_steps=eval_steps,
                weight=0.2  # Default weight, adjustable
            )
            rubrics.append(rubric)
        
        return rubrics
    
    def augment_persona_rubric(
        self,
        base_rubric: Dict,
        generated_rubrics: List[GeneratedRubric]
    ) -> Dict:
        """Augment base persona rubric with ADR-specific dimensions."""
        augmented = base_rubric.copy()
        
        for rubric in generated_rubrics:
            augmented["dimensions"].append({
                "name": rubric.dimension,
                "weight": rubric.weight,
                "scale": rubric.scale,
                "descriptions": rubric.score_descriptions,
                "evaluation_steps": rubric.evaluation_steps,
                "source": "ADR-GENERATED"
            })
        
        # Renormalize weights
        total_weight = sum(d["weight"] for d in augmented["dimensions"])
        for d in augmented["dimensions"]:
            d["weight"] = d["weight"] / total_weight
        
        return augmented

Part 5: Implementation Roadmap

5.1 Phase 1: Foundation (Weeks 1-4)

Week	Deliverable	Success Criteria
1	Persona prompt templates finalized	All 5 core personas documented
2	Judge invocation infrastructure	Can invoke Claude/GPT-4/DeepSeek judges
3	Consensus engine implementation	Weighted voting functional
4	Basic audit trail	All evaluations logged with provenance

5.2 Phase 2: Integration (Weeks 5-8)

Week	Deliverable	Success Criteria
5	ADR parser implementation	Extract constraints from 10 test ADRs
6	Dynamic rubric generation	Rubrics auto-generated from ADRs
7	Solution MoE → Judge integration	End-to-end artifact flow
8	Debate protocol implementation	Multi-round debate functional

5.3 Phase 3: Calibration (Weeks 9-12)

Week	Deliverable	Success Criteria
9	Human calibration dataset	100 expert-graded artifacts
10	Agreement metric calculation	Cohen's κ ≥ 0.6 per dimension
11	Threshold tuning	False positive ≤ 10%, false negative ≤ 5%
12	Bias testing	Pass adversarial bias probes

5.4 Phase 4: Production Hardening (Weeks 13-16)

Week	Deliverable	Success Criteria
13	Red team testing	95% adversarial rejection rate
14	Escalation workflow	Human review loop integrated
15	Performance optimization	< 60s for full panel evaluation
16	Compliance documentation	SOC 2 / HIPAA audit readiness

Part 6: Success Metrics

6.1 Accuracy Metrics

Metric	Target	Measurement
Human Agreement (overall)	≥ 75%	Cohen's κ vs. expert panel
Security Finding Recall	≥ 90%	Known vulnerabilities detected
Compliance Finding Recall	≥ 95%	Known violations detected
False Positive Rate	≤ 10%	Good code wrongly failed
False Negative Rate	≤ 5%	Bad code wrongly passed

6.2 Operational Metrics

Metric	Target	Measurement
Evaluation Latency	< 60s	P95 for full panel
Judge Availability	99.5%	Uptime across all judges
Cost per Evaluation	< $2.00	Total API costs
Escalation Rate	< 15%	Human review required

6.3 Defensibility Metrics

Metric	Target	Measurement
Provenance Completeness	100%	All decisions traceable
Dissent Recording	100%	All minority views captured
Audit Trail Integrity	100%	Immutable, timestamped
Rationale Quality	≥ 4/5	Human rating of explanations

Appendix A: Quick Reference Card

Invoking Judge Panel

from coditect.verification import JudgePanel, JudgePanelConfig

config = JudgePanelConfig(
    personas=["technical_architect", "compliance_auditor", "security_analyst"],
    regulatory_frameworks=["HIPAA", "SOC2"],
    approval_threshold=0.67,
    max_debate_rounds=3
)

panel = JudgePanel(config)
result = await panel.evaluate(
    artifact=code_artifact,
    context={
        "adrs": project_adrs,
        "requirements": requirements_doc,
        "tech_stack": tech_stack_info
    }
)

if result.final_verdict == Verdict.PASS:
    print(f"Approved with {result.confidence:.0%} confidence")
else:
    print(f"Remediation required: {result.all_remediation}")

Output Structure

{
  "final_verdict": "PASS|FAIL|CONDITIONAL",
  "confidence": 0.85,
  "agreement_ratio": 0.92,
  "aggregated_scores": {
    "architectural_soundness": 2.7,
    "security_compliance": 2.9,
    "regulatory_alignment": 3.0,
    "code_quality": 2.5
  },
  "critical_findings": [],
  "remediation_required": [],
  "dissenting_views": [],
  "provenance_chain": [
    {"persona": "...", "model": "...", "verdict": "...", "timestamp": "..."}
  ],
  "escalation_required": false
}

Document Version: 1.0 | For: Coditect Autonomous Development Platform | January 2026

Strategic Impact Analysis & Implementation Roadmap​

Executive Summary​

Part 1: Coditect Judge Architecture Overview​

1.1 Verification Layer Design​

1.2 Judge Persona Registry​

Part 2: Persona Prompt Engineering​

2.1 Technical Architect Judge Prompt​

2.2 Compliance Auditor Judge Prompt​

2.3 Security Analyst Judge Prompt​

2.4 Domain Expert Judge Prompt (Healthcare)​

2.5 QA Evaluator Judge Prompt​

Part 3: Consensus Protocol Implementation​

3.1 Voting Mechanism​

3.2 Debate Protocol​

Part 4: ADR-to-Rubric Generation Pipeline​

4.1 ADR Parser​

Part 5: Implementation Roadmap​

5.1 Phase 1: Foundation (Weeks 1-4)​

5.2 Phase 2: Integration (Weeks 5-8)​

5.3 Phase 3: Calibration (Weeks 9-12)​

5.4 Phase 4: Production Hardening (Weeks 13-16)​

Part 6: Success Metrics​

6.1 Accuracy Metrics​

6.2 Operational Metrics​

6.3 Defensibility Metrics​

Appendix A: Quick Reference Card​

Invoking Judge Panel​

Output Structure​