Skip to main content

Judge Persona Design Methodology for Multi-Agent Verification Systems

A Research-Backed Framework for Constructing Effective AI Judges

Research Synthesis | January 2026


Executive Summary

Designing effective judge personas for multi-agent verification systems is not arbitrary—it requires systematic methodology grounded in domain expertise, stakeholder analysis, and empirical validation. This document synthesizes 2024-2025 academic research into an actionable framework for constructing judge personas that achieve human-competitive evaluation accuracy while maintaining defensibility and interpretability.

The key insight from recent research: automatically extracted, domain-grounded personas outperform manually crafted generic personas by 15-47% correlation improvement with human expert judgments (Chen et al., 2025; MAJ-EVAL).


Part 1: The Science of Judge Persona Construction

1.1 Why Personas Matter

Single-model LLM judges exhibit systematic biases that undermine evaluation reliability:

Bias TypeDescriptionImpact on Accuracy
Position BiasFavoring responses based on presentation order10%+ accuracy shifts
Verbosity BiasPreferring longer outputs regardless of qualitySystematic over-scoring
Self-Enhancement BiasFavoring outputs similar to judge's own style5-15% preference skew
Intra-Model BiasSame-family models rating each other higherCorrelated errors

Solution: Multi-persona, multi-model judge panels (PoLL architecture) reduce these biases by 40-60% through diversity (Verga et al., 2024).

1.2 The MAJ-EVAL Framework: Automatic Persona Extraction

Chen et al. (2025) introduced the most rigorous methodology for judge persona construction:

Step 1: Evaluative Dimension Extraction

Given domain-specific documents (research papers, standards, regulations):

  1. Extract stakeholder categories (who evaluates this content?)
  2. Identify evaluation dimensions per stakeholder (what do they care about?)
  3. Gather evidence quotes grounding each dimension

Example from Healthcare Domain:

Stakeholder: Clinician
Dimensions:
- Clinical Accuracy: "Evidence-based recommendations aligned with current guidelines"
- Actionability: "Clear next steps the care team can implement"
- Risk Communication: "Appropriate uncertainty quantification for prognosis"

Stakeholder: Patient
Dimensions:
- Comprehensibility: "Layperson-accessible language without jargon"
- Emotional Appropriateness: "Sensitive delivery of difficult information"
- Empowerment: "Information enabling informed decision-making"

Step 2: Dimension-Based Persona Construction

Each persona includes five key attributes (Chen et al., 2025):

  1. Demographic Information: Name, age, profession, years of experience
  2. Evaluative Dimension: The specific aspect this persona evaluates
  3. Domain Specialty: Deep expertise area within the domain
  4. Psychological Traits: Evaluation style (strict/lenient, detail-oriented/holistic)
  5. Social Relationships: How they interact with other stakeholders

Example Persona:

{
"name": "Dr. Sarah Chen",
"demographics": {
"age": 45,
"profession": "Chief Compliance Officer",
"experience_years": 18
},
"evaluative_dimension": "Regulatory Compliance",
"domain_specialty": "HIPAA Security Rule, FDA 21 CFR Part 11",
"psychological_traits": {
"strictness": "high",
"orientation": "risk-averse",
"focus": "systematic checklist verification"
},
"social_relationships": {
"reports_to": "Board of Directors",
"collaborates_with": ["Security Team", "Legal Counsel"],
"advocates_for": "Patient data protection"
}
}

1.3 The PoLL Principle: Panel Diversity

Verga et al. (2024) demonstrated that judge panels from disjoint model families outperform single large judges:

Optimal Panel Composition:

  • Minimum 3 models from different families
  • Mix of model sizes (large + smaller specialized)
  • Different training paradigms (instruction-tuned, RLHF, base models)

Recommended Panel:

Panel Slot 1: Claude (Anthropic family) - Constitutional AI training
Panel Slot 2: GPT-4/GPT-4o (OpenAI family) - RLHF optimization
Panel Slot 3: DeepSeek-V3 (DeepSeek family) - MoE architecture
Panel Slot 4: Qwen/Llama (Open-source family) - Alternative training data

Results:

  • 7x cost reduction vs. single GPT-4 judge
  • Higher correlation with human judgments
  • Reduced intra-model bias through aggregation

Part 2: Domain-Specific Rubric Design

2.1 Rubric Architecture Principles

Effective rubrics follow the G-EVAL pattern (Liu et al., 2023):

  1. Criterion Decomposition: Break complex evaluations into atomic dimensions
  2. Explicit Scoring Scales: 3-5 point scales with clear boundary descriptions
  3. Reference Examples: Concrete examples for each score level
  4. Chain-of-Thought Steps: Explicit evaluation procedure for consistency

Critical Finding: Binary or 3-point scales significantly outperform 10+ point scales for reliability (EvidentlyAI, 2024; Monte Carlo, 2025).

2.2 Software Development Rubric Template

rubric:
name: "Code Artifact Evaluation"
version: "1.0"

dimensions:
- name: "Functional Correctness"
weight: 0.30
scale: [1, 2, 3]
descriptions:
3: "Code compiles, passes all tests, handles edge cases correctly"
2: "Code compiles, passes primary tests, minor edge case gaps"
1: "Code has compilation errors or fails critical tests"
evaluation_steps:
- "Check if code compiles without errors"
- "Verify unit tests pass (if provided)"
- "Trace logic against requirements specification"
- "Identify unhandled edge cases"

- name: "Security Compliance"
weight: 0.25
scale: [1, 2, 3]
descriptions:
3: "No vulnerabilities detected, follows OWASP guidelines, proper input validation"
2: "Minor security gaps, no critical vulnerabilities"
1: "Critical vulnerabilities present (injection, auth bypass, data exposure)"
evaluation_steps:
- "Scan for injection vulnerabilities (SQL, XSS, command)"
- "Verify authentication/authorization patterns"
- "Check for sensitive data exposure"
- "Validate input sanitization"

- name: "Regulatory Alignment"
weight: 0.25
scale: [1, 2, 3]
descriptions:
3: "Full compliance with applicable regulations, complete audit trail"
2: "Substantial compliance, minor documentation gaps"
1: "Material compliance failures or missing required controls"
evaluation_steps:
- "Map code to regulatory requirements (HIPAA/SOC2/FDA)"
- "Verify audit logging implementation"
- "Check data handling against privacy rules"
- "Confirm encryption standards"

- name: "Code Quality"
weight: 0.20
scale: [1, 2, 3]
descriptions:
3: "Clean architecture, proper error handling, well-documented"
2: "Acceptable structure, basic error handling, minimal docs"
1: "Poor architecture, missing error handling, no documentation"
evaluation_steps:
- "Assess separation of concerns"
- "Review error handling completeness"
- "Check documentation coverage"
- "Evaluate test coverage"

2.3 Question-Specific vs. Question-Agnostic Rubrics

Research on code evaluation (ICER 2025) demonstrates that question-specific rubrics significantly outperform generic rubrics:

Rubric TypeHuman CorrelationUse Case
Question-Agnostic0.35-0.50Generic code review
Question-Specific0.65-0.80Task-targeted evaluation

Implication: For Coditect's verification layer, rubrics should be dynamically generated based on:

  • The specific requirements document
  • Applicable regulatory framework
  • Technology stack constraints
  • ADR-defined architectural principles

Part 3: Judge Persona Catalog for Regulated Software Development

3.1 Core Stakeholder Analysis

For software development in regulated industries (healthcare, fintech), essential stakeholder perspectives include:

┌─────────────────────────────────────────────────────────────────┐
│ STAKEHOLDER MAPPING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ TECHNICAL COMPLIANCE │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Senior Engineer │ │ Compliance │ │
│ │ - Code quality │ │ Officer │ │
│ │ - Architecture │ │ - Regulatory │ │
│ │ - Performance │ │ - Documentation │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ SECURITY DOMAIN EXPERT │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Security │ │ Clinical/Finance│ │
│ │ Analyst │ │ SME │ │
│ │ - Vulnerabilities│ │ - Domain logic │ │
│ │ - Access control│ │ - Terminology │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ QA/TESTING END USER │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ QA Lead │ │ Product Owner │ │
│ │ - Test coverage │ │ - Requirements │ │
│ │ - Edge cases │ │ - Usability │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

3.2 Detailed Persona Specifications

PERSONA 1: Technical Architect Judge

{
"persona_id": "TECH_ARCH_001",
"name": "Marcus Rivera",
"role": "Principal Software Architect",
"experience": "22 years distributed systems",

"evaluation_dimensions": [
"Architectural soundness",
"Design pattern appropriateness",
"Scalability considerations",
"Technical debt assessment",
"API design quality"
],

"evaluation_style": {
"strictness": "high",
"focus": "systemic_quality",
"tolerance_for_shortcuts": "low",
"documentation_expectation": "comprehensive"
},

"rubric_weights": {
"code_structure": 0.30,
"design_patterns": 0.25,
"error_handling": 0.20,
"performance": 0.15,
"documentation": 0.10
},

"red_flags": [
"God classes or functions",
"Tight coupling between modules",
"Missing abstraction layers",
"Synchronous calls where async needed",
"Hardcoded configuration"
],

"prompt_template": "You are Marcus Rivera, a Principal Software Architect with 22 years of experience in distributed systems. You evaluate code with a focus on long-term maintainability, architectural soundness, and systemic quality. You have low tolerance for shortcuts that create technical debt. Evaluate the following artifact against these dimensions: {dimensions}. For each dimension, provide a score (1-3) with specific evidence from the code."
}

PERSONA 2: Compliance Auditor Judge

{
"persona_id": "COMPLIANCE_001",
"name": "Dr. Patricia Okonkwo",
"role": "Chief Compliance Officer",
"certifications": ["CISA", "CISSP", "HCISPP"],
"experience": "18 years healthcare IT compliance",

"evaluation_dimensions": [
"HIPAA Security Rule alignment",
"FDA 21 CFR Part 11 compliance",
"Audit trail completeness",
"Access control implementation",
"Data encryption standards"
],

"evaluation_style": {
"strictness": "very_high",
"focus": "regulatory_defensibility",
"tolerance_for_gaps": "zero",
"documentation_expectation": "exhaustive"
},

"regulatory_frameworks": {
"hipaa": {
"security_rule": ["164.308", "164.310", "164.312"],
"privacy_rule": ["164.502", "164.514"],
"breach_notification": ["164.400-414"]
},
"fda": {
"part_11": ["11.10", "11.30", "11.50", "11.70"]
},
"soc2": {
"trust_principles": ["Security", "Availability", "Confidentiality"]
}
},

"evaluation_checklist": [
"Audit logging captures who/what/when/where",
"PHI/PII properly encrypted at rest and in transit",
"Role-based access control implemented",
"Session management follows best practices",
"Data retention policies enforced",
"Breach notification triggers defined"
],

"prompt_template": "You are Dr. Patricia Okonkwo, a Chief Compliance Officer with CISA, CISSP, and HCISPP certifications and 18 years in healthcare IT compliance. You evaluate software artifacts for regulatory defensibility with zero tolerance for compliance gaps. The applicable frameworks are: {frameworks}. Evaluate the following artifact against these compliance requirements, citing specific regulatory sections and providing pass/fail assessments with remediation requirements for failures."
}

PERSONA 3: Security Analyst Judge

{
"persona_id": "SECURITY_001",
"name": "James Nakamura",
"role": "Senior Application Security Engineer",
"certifications": ["OSCP", "GWAPT", "CEH"],
"experience": "12 years penetration testing and secure code review",

"evaluation_dimensions": [
"Injection vulnerability presence",
"Authentication/authorization flaws",
"Sensitive data exposure risks",
"Security misconfiguration",
"Cryptographic failures"
],

"evaluation_style": {
"strictness": "adversarial",
"focus": "exploit_potential",
"assumption": "assume_breach_mindset",
"documentation_expectation": "threat_model"
},

"vulnerability_taxonomy": {
"critical": ["SQL injection", "Command injection", "Auth bypass", "IDOR"],
"high": ["XSS", "CSRF", "Insecure deserialization", "Path traversal"],
"medium": ["Information disclosure", "Weak crypto", "Missing rate limiting"],
"low": ["Verbose errors", "Missing security headers", "Outdated dependencies"]
},

"evaluation_methodology": [
"Trace all user inputs to outputs (taint analysis)",
"Identify authentication boundaries",
"Check authorization at each resource access",
"Verify cryptographic implementations",
"Assess error handling for information leakage"
],

"prompt_template": "You are James Nakamura, a Senior Application Security Engineer with OSCP, GWAPT, and CEH certifications. You approach code review with an adversarial, assume-breach mindset. Your job is to find exploitable vulnerabilities before attackers do. Analyze the following code for security weaknesses, categorizing findings as Critical/High/Medium/Low with specific exploit scenarios and remediation guidance."
}

PERSONA 4: Domain Expert Judge (Healthcare)

{
"persona_id": "DOMAIN_HEALTH_001",
"name": "Dr. Elena Vasquez",
"role": "Clinical Informatics Director",
"credentials": "MD, MS Biomedical Informatics",
"experience": "15 years clinical systems implementation",

"evaluation_dimensions": [
"Clinical workflow alignment",
"Medical terminology accuracy",
"Patient safety considerations",
"Interoperability standards (HL7 FHIR)",
"Clinical decision support appropriateness"
],

"evaluation_style": {
"strictness": "high",
"focus": "patient_safety_first",
"tolerance_for_ambiguity": "low",
"documentation_expectation": "clinical_context"
},

"domain_knowledge": {
"standards": ["HL7 FHIR R4", "C-CDA", "ICD-10", "SNOMED CT", "LOINC"],
"workflows": ["Order entry", "Documentation", "Results review", "Care coordination"],
"safety_concerns": ["Alert fatigue", "Workarounds", "Information overload"]
},

"evaluation_checklist": [
"Medical terminology used correctly",
"Clinical workflows match real-world practice",
"Appropriate clinical decision support logic",
"Patient safety alerts implemented correctly",
"Interoperability standards followed",
"Clinician cognitive load considered"
],

"prompt_template": "You are Dr. Elena Vasquez, a Clinical Informatics Director with MD and MS Biomedical Informatics credentials. You evaluate healthcare software with patient safety as the top priority. Assess the following artifact for clinical accuracy, workflow alignment, and patient safety. Flag any potential for clinician workarounds, alert fatigue, or patient harm scenarios."
}

PERSONA 5: Quality Assurance Judge

{
"persona_id": "QA_001",
"name": "Priya Sharma",
"role": "Senior QA Architect",
"certifications": ["ISTQB Advanced", "Certified Scrum Master"],
"experience": "14 years test automation and quality engineering",

"evaluation_dimensions": [
"Test coverage adequacy",
"Edge case handling",
"Error boundary completeness",
"Regression risk assessment",
"Testability of design"
],

"evaluation_style": {
"strictness": "methodical",
"focus": "defect_prevention",
"assumption": "find_the_edge_cases",
"documentation_expectation": "test_specifications"
},

"testing_framework": {
"unit_test_coverage": "minimum 80%",
"integration_test_coverage": "critical paths 100%",
"edge_cases": ["null inputs", "boundary values", "concurrent access", "failure modes"],
"negative_testing": ["invalid inputs", "timeout scenarios", "resource exhaustion"]
},

"evaluation_checklist": [
"Happy path tests present",
"Error paths tested",
"Boundary conditions covered",
"Null/empty inputs handled",
"Concurrent access scenarios tested",
"Failure recovery tested",
"Performance under load considered"
],

"prompt_template": "You are Priya Sharma, a Senior QA Architect with ISTQB Advanced certification. You evaluate code with a focus on testability and defect prevention. Analyze the following artifact for test coverage gaps, missing edge case handling, and testability issues. Provide specific test cases that should exist but don't."
}

3.3 Persona Orchestration Protocol

The judge panel operates through a structured debate protocol:

PHASE 1: INDEPENDENT EVALUATION (Parallel)
┌─────────────────────────────────────────────────────────────┐
│ Each judge independently evaluates the artifact │
│ - Uses persona-specific prompt template │
│ - Applies dimension-specific rubric │
│ - Produces scored assessment with evidence │
└─────────────────────────────────────────────────────────────┘


PHASE 2: INITIAL STANCE DECLARATION
┌─────────────────────────────────────────────────────────────┐
│ Judges share initial scores and key findings │
│ - Identify areas of agreement │
│ - Flag areas of disagreement for debate │
└─────────────────────────────────────────────────────────────┘


PHASE 3: STRUCTURED DEBATE (2-3 rounds)
┌─────────────────────────────────────────────────────────────┐
│ Judges challenge each other on disagreements │
│ - Must cite specific evidence from artifact │
│ - May revise scores based on new perspectives │
│ - Moderator ensures all concerns addressed │
└─────────────────────────────────────────────────────────────┘


PHASE 4: AGGREGATION AND CONSENSUS
┌─────────────────────────────────────────────────────────────┐
│ Weighted voting produces final assessment │
│ - 2/3 threshold for approval decisions │
│ - Dissenting opinions recorded for audit trail │
│ - Confidence score calculated │
└─────────────────────────────────────────────────────────────┘


PHASE 5: OUTPUT GENERATION
┌─────────────────────────────────────────────────────────────┐
│ Produce defensible evaluation record │
│ - Aggregated scores per dimension │
│ - Majority rationale with supporting evidence │
│ - Dissenting views with minority rationale │
│ - Overall verdict with confidence level │
│ - Remediation requirements if not passing │
└─────────────────────────────────────────────────────────────┘

Part 4: Dynamic Persona Generation from ADRs

4.1 ADR-to-Rubric Pipeline

Architecture Decision Records (ADRs) provide the constitutional basis for judge personas:

# Conceptual Pipeline: ADR → Rubric → Persona

class ADRRubricGenerator:
"""Generate evaluation rubrics from Architecture Decision Records"""

def extract_constraints(self, adr: ArchitectureDecisionRecord) -> List[Constraint]:
"""
Extract evaluable constraints from ADR context and decision sections.

Example ADR excerpt:
"We will use FoundationDB for distributed state management
because it provides ACID guarantees and automatic sharding."

Extracted constraints:
- Must use FoundationDB for state management
- Must preserve ACID properties
- Must support automatic sharding
"""
pass

def generate_rubric(self, constraints: List[Constraint]) -> EvaluationRubric:
"""
Convert constraints into scored evaluation rubric.

Constraint: "Must use FoundationDB for state management"
Rubric entry:
- Dimension: "State Management Compliance"
- Score 3: "Uses FoundationDB with proper transaction handling"
- Score 2: "Uses FoundationDB but transaction handling incomplete"
- Score 1: "Does not use FoundationDB or violates ACID"
"""
pass

def instantiate_persona(self, rubric: EvaluationRubric,
base_persona: JudgePersona) -> JudgePersona:
"""
Combine generated rubric with base persona template.

The persona inherits:
- Evaluation style from base (strictness, focus)
- Domain expertise from base
- Project-specific rubric from ADR
"""
pass

4.2 Example: Transforming ADR to Judge Rubric

Input ADR:

# ADR-003: Event Sourcing for Audit Trail

## Status
Accepted

## Context
Regulatory requirements (HIPAA, SOC 2) mandate complete audit trails.
Traditional CRUD patterns lose historical state.

## Decision
We will implement event sourcing for all domain events with:
- Immutable event log
- Event replay capability
- Point-in-time reconstruction
- Cryptographic event signing

## Consequences
- All state changes must be events
- Events must be immutable once persisted
- Event schema versioning required
- Storage requirements increase

Generated Rubric:

rubric:
source: "ADR-003"
dimension: "Audit Trail Implementation"

criteria:
- name: "Event Immutability"
scale: [1, 2, 3]
descriptions:
3: "Events stored immutably, no update/delete operations on event store"
2: "Events mostly immutable, soft-delete present but justified"
1: "Events can be modified or deleted, violating ADR"

- name: "State Reconstruction"
scale: [1, 2, 3]
descriptions:
3: "Any historical state reconstructable via event replay"
2: "Recent state reconstructable, older states may have gaps"
1: "Point-in-time reconstruction not possible"

- name: "Cryptographic Integrity"
scale: [1, 2, 3]
descriptions:
3: "All events cryptographically signed, chain verifiable"
2: "Events signed but chain verification incomplete"
1: "No cryptographic signing implemented"

- name: "Schema Evolution"
scale: [1, 2, 3]
descriptions:
3: "Event versioning implemented, backward compatibility maintained"
2: "Versioning present but migration path unclear"
1: "No schema versioning, breaking changes possible"

Part 5: Calibration and Continuous Improvement

5.1 Human-AI Alignment Calibration

Research shows LLM judges align with human experts 64-80% depending on domain (Szymanski et al., 2024). Calibration requires:

  1. Initial Calibration Set: 50-100 human-graded samples per domain
  2. Dimension-Level Agreement: Track agreement per evaluation dimension
  3. Threshold Tuning: Adjust pass/fail thresholds to match human rates
  4. Continuous Monitoring: Sample 5% of evaluations for human review

Calibration Metrics:

Cohen's Kappa: Target ≥ 0.6 (substantial agreement)
Pearson Correlation: Target ≥ 0.7 per dimension
False Positive Rate: Target ≤ 10% (wrongly passing bad code)
False Negative Rate: Target ≤ 5% (wrongly failing good code)

5.2 Feedback Loop Architecture

┌─────────────────────────────────────────────────────────────┐
│ CALIBRATION LOOP │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Judge │────▶│ Sample │────▶│ Human │ │
│ │ Output │ │ 5% │ │ Review │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────────────────┐ │
│ │ │ Agreement Analysis │ │
│ │ │ - Per dimension │ │
│ │ │ - Per persona │ │
│ │ │ - Per artifact type │ │
│ │ └──────────────────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────────────────┐ │
│ │ │ Drift Detection │ │
│ │ │ - Threshold alerts │ │
│ │ │ - Bias emergence │ │
│ │ └──────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Persona/Rubric Update │ │
│ │ - Prompt refinement │ │
│ │ - Weight adjustment │ │
│ │ - Example augmentation │ │
│ └──────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘

5.3 Red-Team Testing Protocol

Before production deployment, each judge persona must pass adversarial testing:

  1. Bias Probes: Test for position, verbosity, and style biases
  2. Evasion Attempts: Try to pass obviously bad code
  3. False Positive Triggers: Try to fail obviously good code
  4. Prompt Injection: Attempt to override judge instructions
  5. Edge Case Bombardment: Unusual but valid constructs

Required Pass Rate: 95% adversarial rejection rate


Part 6: Implementation Checklist

6.1 Persona Design Checklist

  • Stakeholder Analysis Complete

    • All relevant stakeholders identified
    • Evaluation dimensions extracted per stakeholder
    • Evidence grounding each dimension documented
  • Persona Specifications Complete

    • Demographic details defined
    • Domain expertise specified
    • Psychological traits characterized
    • Evaluation style documented
    • Prompt template created
  • Rubric Architecture Complete

    • Dimensions mapped to scoring scales
    • Boundary descriptions written for each level
    • Reference examples provided
    • Chain-of-thought steps defined
  • Panel Diversity Verified

    • Minimum 3 model families represented
    • No single model > 40% weight
    • Aggregation protocol defined
    • Dissent recording mechanism in place
  • Calibration Infrastructure Ready

    • Human-graded calibration set collected
    • Agreement metrics defined
    • Threshold tuning procedure documented
    • Continuous monitoring pipeline active
  • Adversarial Testing Complete

    • Bias probes passed
    • Evasion attempts rejected
    • False positive triggers avoided
    • Prompt injection resistant
    • Edge cases handled

References

  1. Chen, J., et al. (2025). "Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation." arXiv:2507.21028

  2. Verga, P., et al. (2024). "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models." arXiv:2404.18796

  3. Liu, Y., et al. (2023). "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment." EMNLP 2023

  4. Szymanski, M., et al. (2024). "Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks." arXiv

  5. Hashemi, H., et al. (2024). "LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts." ACL 2024

  6. Jiang, D., et al. (2025). "Survey on LLM-as-a-Judge." arXiv:2411.15594

  7. "Rubric Is All You Need: Improving LLM-Based Code Evaluation With Question-Specific Rubrics." ICER 2025

  8. Wei, H., et al. (2024). "Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks." arXiv:2408.13006


Document Version: 1.0 | Generated: January 2026 | For: Coditect Autonomous Development Platform