Annotated Bibliography: MoE Agents + Judge Systems for Defensible AI Decision-Making

Research Sources 2024-2025-2026

Category 1: Mixture of Experts (MoE) Architectures

1.1 Comprehensive Surveys

Cai, W., Jiang, J., Wang, F., Tang, J., Kim, S., & Huang, J. (2024/2025). "A Survey on Mixture of Experts in Large Language Models."

URL: https://arxiv.org/abs/2407.06204
Published: June 2024 (v1), revised April 2025 (v3)
Journal: IEEE TKDE 2025 (accepted)
Description: Comprehensive 50+ page survey covering MoE taxonomy, gating functions, routing strategies, and empirical evaluations across NLP, computer vision, and multimodal domains. Establishes resource repository at GitHub for ongoing research tracking.
Key Contribution: Proposes new taxonomy of MoE architectures; catalogs open-source implementations and hyperparameter configurations; identifies expert diversity as critical factor for model performance.
Relevance: Foundation for understanding how MoE can enable specialized solution agents and judge agents.

Joshi, S. (2025). "A Survey of Mixture of Experts Models: Architectures and Applications in Business and Finance."

URL: https://hal.science/hal-05113196v1/file/20250527153855_FEI-2025-3-006.1.pdf
Published: 2025
Journal: International Journal of Future Engineering Innovations, 2(3), 127-134
Description: Reviews MoE evolution from early neural networks to trillion-parameter models (GPT-4, Mixtral). Covers theoretical foundations, hardware innovations, and real-world deployments in finance and healthcare.
Key Contribution: Identifies expert parallelism and load balancing as key challenges; discusses MoE applications in time series forecasting and tabular data analysis.
Relevance: Business/finance applications relevant to Coditect's regulated industry focus.

1.2 MoE Architecture Innovations

DeepSeek-AI. (2024). "DeepSeek-V3 Technical Report."

URL: https://arxiv.org/abs/2412.19437
Published: December 2024
Description: Technical report for 671B parameter MoE model (37B active per token). Details Multi-head Latent Attention (MLA), DeepSeekMoE architecture, auxiliary-loss-free load balancing, and multi-token prediction training.
Key Contribution: Pioneers auxiliary-loss-free load balancing that maintains routing stability without performance degradation. Achieves training cost of $5.6M vs. estimated $50-100M for comparable dense models.
Relevance: Proves MoE production viability; demonstrates cost efficiency critical for enterprise deployment.

DeepSeek-AI. (2025). "DeepSeek-V3.2 Release Notes."

URL: https://api-docs.deepseek.com/news/news251201
Published: December 2025
Description: Release documentation for V3.2 and V3.2-Speciale variants. Introduces DeepSeek Sparse Attention (DSA), "Thinking in Tool-Use" capability, and massive agent training data synthesis covering 1,800+ environments.
Key Contribution: First model to integrate thinking directly into tool-use; gold-medal performance on IMO 2025 and IOI 2025; introduces self-verification mechanisms.
Relevance: Validates AI reasoning at human-competitive levels; demonstrates integrated verification capability.

Lo, K.M., et al. (2024). "A Closer Look into Mixture-of-Experts in Large Language Models."

URL: https://arxiv.org/abs/2406.18219
Published: June 2024 (v1), revised June 2025 (v3)
Description: Empirical study of three popular MoE models examining parametric and behavioral features. Analyzes router selection patterns, expert diversity across layers, and neuron-level specialization.
Key Contribution: Discovers that neurons act like fine-grained experts; routers prefer experts with larger output norms; expert diversity increases through model depth except for outlier last layer.
Relevance: Informs expert allocation and router design for solution/judge separation.

Category 2: LLM-as-Judge and Multi-Agent Evaluation

2.1 Foundational Surveys

Jiang, X., et al. (2024/2025). "A Survey on LLM-as-a-Judge."

URL: https://arxiv.org/abs/2411.15594
Published: November 2024 (v1), revised October 2025 (v6)
Description: Comprehensive survey addressing core question: "How can reliable LLM-as-a-Judge systems be built?" Covers consistency enhancement, bias mitigation, and adaptation to diverse assessment scenarios.
Key Contribution: Identifies position bias, verbosity bias, and self-enhancement bias as primary reliability threats; proposes standardized protocols including chain-of-thought reasoning and inter-judge reliability metrics (Cohen's Kappa, Krippendorff's Alpha).
Relevance: Establishes reliability requirements for judge agents; provides bias mitigation strategies.

Zhuge, M., et al. (2024). "Agent-as-a-Judge: Evaluating Agents Through Complete Action Chain Analysis."

URL: Referenced in https://arxiv.org/html/2508.02994v1
Published: 2024
Description: Introduces Agent-as-a-Judge framework where agents evaluate other agents by examining entire chains of actions and decisions rather than just final answers.
Key Contribution: Demonstrates that process evaluation outperforms outcome-only evaluation; addresses extended sequences of actions in autonomous agents.
Relevance: Critical for Coditect's multi-step development workflows where intermediate artifacts matter.

2.2 Multi-Agent Evaluation Frameworks

Li, Y., et al. (2025). "Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments."

URL: https://arxiv.org/abs/2504.17087
Published: April 2025
Description: Three-stage meta-judge selection pipeline: (1) comprehensive rubric development with GPT-4 and human experts, (2) three advanced LLM agents scoring judgments, (3) threshold filtering for low-scoring judgments.
Key Contribution: Multi-agent collaboration achieves 15.55% improvement over raw judgments and 8.37% improvement over single-agent baseline on JudgeBench dataset.
Relevance: Validates multi-layer verification architecture; provides concrete implementation pipeline.

Chen, J., et al. (2025). "Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation."

URL: https://arxiv.org/abs/2507.21028
Published: July 2025
Description: MAJ-EVAL framework automatically constructs multiple evaluator personas with distinct dimensions from relevant documents (research papers, clinical guidelines). Evaluated in educational and medical domains.
Key Contribution: Demonstrates automatic persona construction for domain-specific evaluation; achieves better alignment with human experts than conventional LLM-as-a-judge methods.
Relevance: Enables automated creation of domain-specific judge agents from compliance documentation.

Verga, P., et al. (2024). "Panels of LLM Evaluators (PoLL): Replacing Single Large Judges with Diverse Model Ensembles."

URL: Referenced in multiple sources; Cohere research
Published: 2024
Description: Proposes using panels of smaller, diverse LLMs instead of single large models for evaluation. Demonstrates reduced intra-model bias and 7x cost reduction.
Key Contribution: Establishes that model diversity (different families) is more valuable than model size for reliable evaluation; cost-effectiveness enables production deployment.
Relevance: Direct architecture recommendation for Coditect's judge layer.

2.3 Agent Evaluation and Debate Mechanisms

Chan, C.M., et al. (2023/2024). "ChatEval: Towards Better LLM-based Evaluators Through Multi-Agent Debate."

URL: Referenced in multiple survey papers
Published: 2023-2024
Description: Implements multiple debate agents for autonomous discussion and evaluation of AI-generated content. Demonstrates that collaborative evaluation through debate leads to more reliable assessments.
Key Contribution: Introduces adversarial debate protocol for evaluation; shows debate-based consensus improves factuality and reasoning.
Relevance: Provides mechanism for judge agents to challenge and refine each other's assessments.

Category 3: Consensus Mechanisms for AI Verification

Ogunsina, K.E., & Ogunsina, M.A. (2025). "A Hashgraph-Inspired Consensus Mechanism for Reliable Multi-Model Reasoning."

URL: https://arxiv.org/abs/2505.03553
Published: May 2025
Description: Adapts Hashgraph distributed ledger consensus for multi-model AI reasoning. Proposes gossip-about-gossip and virtual voting for treating proprietary reasoning models as black-box peers.
Key Contribution: Provides formal guarantees: if ≥2/3 of models have correct fact, consensus will reflect that fact with high confidence. Multi-round deliberation enables collective reasoning until unanimous or high-confidence agreement.
Relevance: Core technology for defensible consensus in Coditect's verification layer.

Pokharel, P., et al. (2025). "Deliberation-Based Consensus Mechanism for Multi-LLM Decision Making."

URL: Referenced in Ogunsina paper
Published: 2025
Description: Introduces deliberation-based consensus where multiple LLM agents discuss to reach unanimous decisions. Maintains classic blockchain consensus properties while handling hallucinations and malicious models.
Key Contribution: Graded consensus with multi-round deliberation; outputs distribution of opinions for subjective matters; explicit handling of degeneration of thoughts.
Relevance: Addresses edge cases where full consensus isn't achievable.

Bandara, E., et al. (2025). "Towards Responsible and Explainable AI Agents with Consensus-Driven Reasoning."

URL: https://arxiv.org/abs/2512.21699
Published: December 2025
Description: Production-grade agentic workflow architecture based on multi-model consensus and reasoning-layer governance. Designed for Responsible AI (RAI) and Explainable AI (XAI) requirements.
Key Contribution: Bridges academic consensus research with production deployment requirements; addresses governance and explainability simultaneously.
Relevance: Directly applicable to Coditect's compliance-native architecture requirements.

Category 4: Constitutional AI and Scalable Oversight

Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback."

URL: https://arxiv.org/abs/2212.08073
Published: December 2022
Description: Foundational paper introducing Constitutional AI methodology. Uses explicit principles (constitution) for self-improvement without human labels for harmlessness. Combines supervised learning (self-critique and revision) with RLAIF.
Key Contribution: Demonstrates AI can train AI using explicit principles; achieves harmlessness improvements competitive with human-labeled RLHF; chain-of-thought improves principle application.
Relevance: Framework for encoding compliance requirements as constitutional principles for judge agents.

Huang, S., Siddarth, D., Lovitt, L., et al. (2024). "Collective Constitutional AI: Aligning a Language Model with Public Input."

URL: https://facctconference.org/static/papers24/facct24-94.pdf
Published: FAccT '24, June 2024
Description: Extends Constitutional AI with public input via Polis platform. Creates constitutions reflecting diverse stakeholder values through wiki-survey methods.
Key Contribution: Democratic principle generation; addresses concern that single-author constitutions reflect hegemonic perspectives; achieves improved helpfulness and harmlessness Elo scores.
Relevance: Method for incorporating diverse stakeholder requirements (FDA, HIPAA, enterprise policies) into constitutional principles.

COCOA Framework. (2025). "Governance in Motion: Co-evolution of Constitutions and AI Models for Scalable Oversight."

URL: https://aclanthology.org/2025.emnlp-main.869.pdf
Published: EMNLP 2025
Description: Two-stage architecture where model and constitution dynamically adapt. Stage 1: Actor learns to follow principles while constitution evolves reflecting on Actor's behavior. Tested on HH-RLHF Red Team, PKU-SafeRLHF, and BigBench-HHH datasets.
Key Contribution: Addresses static constitution limitation; achieves highest robustness against jailbreak attacks (StrongReject) while maintaining low over-refusal rate (XSTest).
Relevance: Self-improving governance aligned with Coditect's autonomous evolution goals.

Category 5: Multi-Agent Systems and Agentic AI

5.1 Surveys and Frameworks

ScienceDirect. (2025). "AgentAI: A Comprehensive Survey on Autonomous Agents in Distributed AI for Industry 4.0."

URL: https://www.sciencedirect.com/science/article/pii/S0957417425020238
Published: June 2025
Description: Survey on autonomous agents in industrial contexts. Covers multi-agent coordination, decision-making, and Industry 4.0 applications.
Key Contribution: Bridges academic agent research with industrial deployment requirements.
Relevance: Industry 4.0 context aligns with enterprise software development automation.

Agentic AI Survey. (2025). "A Comprehensive Survey of Agentic AI: Architectures, Applications, and Future Directions."

URL: https://arxiv.org/html/2510.25445v1
Published: October 2025
Description: PRISMA-based review of 90 studies (2018-2025) using dual-paradigm framework: Symbolic/Classical vs. Neural/Generative. Analyzes healthcare, finance, and robotics domains.
Key Contribution: Identifies neuro-symbolic integration and decentralized agent networks as key emerging trends; finds symbolic/hybrid architectures dominate safety-critical applications.
Relevance: Validates hybrid approach for regulated industries; blockchain coordination for verifiable governance.

Masterman, T., Besen, S., & Sawtell, M. (2024). "The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling."

URL: https://consensus.app/papers/the-landscape-of-emerging-ai-agent-architectures-for-besen-chao/5af0218c30295a76a523e255df0167cb/
Published: April 2024
Description: Survey examining single-agent and multi-agent architectures for complex goal achievement. Identifies key patterns in design choices and their impact on task completion.
Key Contribution: Establishes leadership patterns in agent systems; analyzes communication styles and planning/execution phases.
Relevance: Informs orchestration patterns for Coditect's multi-agent system.

5.2 Protocol and Coordination Standards

Google Research. (2025). "A2A: Agent-to-Agent Protocol."

URL: https://github.com/google/A2A
Published: 2025
Description: Standardized protocol for multi-agent coordination. Enables memory management, goal coordination, task invocation, and capability discovery.
Key Contribution: Industry-backed standard for agent interoperability; emphasizes security and scalability.
Relevance: Potential standard for Coditect's solution-judge agent communication.

Agentic AI Frameworks Survey. (2025). "Agentic AI Frameworks: Architectures, Protocols, and Design Challenges."

URL: https://arxiv.org/html/2508.10146v1
Published: August 2025
Description: Comprehensive analysis of communication protocols (MCP, A2A, Agent Network Protocol). Covers tool integration, error handling, and coordination challenges.
Key Contribution: Identifies unifying principle: "eliminate need for manual integration through standardized frameworks"; compares client-server (MCP) vs. agent-oriented (A2A) architectures.
Relevance: Protocol selection guidance for Coditect's multi-agent infrastructure.

5.3 Responsible Agentic AI

Responsible Agentic Reasoning Survey. (2025). "Responsible Agentic Reasoning and AI Agents: A Critical Survey."

URL: https://www.techrxiv.org/users/574774/articles/1329333/master/file/data/review/review.pdf
Published: 2025
Description: Critical analysis of LLM-based agent capabilities across public benchmarks. Compares GPT-4, Claude, DeepSeek, and open-source models on agentic tasks. Documents performance gaps between humans and AI.
Key Contribution: Humans solve ~98% of private ARC tasks vs. 87.5% for fine-tuned O3; provides performance baselines across 13 ML experiment tasks.
Relevance: Establishes human parity benchmarks; identifies domains where AI agents already exceed human performance.

Anthropic et al. (2025). "Fully Autonomous AI Agents Should Not be Developed."

URL: https://arxiv.org/html/2502.02649v3
Published: October 2025
Description: Position paper analyzing risks of fully autonomous agents. Examines accuracy, cascading errors, trust dynamics, and privacy implications.
Key Contribution: Articulates specific failure modes; argues for meaningful human control; identifies risk multipliers when autonomy co-occurs with humanlike cues.
Relevance: Risk framework for Coditect's human-in-the-loop design; failure mode taxonomy for judge agents.

Category 6: Scientific Discovery and Complex Reasoning

Agentic Science Survey. (2025). "From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery."

URL: https://arxiv.org/html/2508.14111v1
Published: August 2025
Description: Survey on AI agents for scientific discovery. Analyzes five foundational capabilities: Reasoning and Planning, Tool Integration, Memory Mechanisms, Multi-Agent Collaboration, and Optimization and Evolution.
Key Contribution: Identifies deliberative, refinement-based collaboration as key strategy; documents ReConcile, METAL, and DS-Agent frameworks for iterative solution improvement.
Relevance: Multi-agent collaboration patterns applicable to software development; scientific rigor standards for verification.

Category 7: Industry and Enterprise Applications

Deloitte. (2024). "Autonomous Generative AI Agents: 2025 Technology Predictions."

URL: https://www.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2025/autonomous-generative-ai-agents-still-under-development.html
Published: November 2024
Description: Industry analysis of agentic AI adoption in enterprise settings. Examines software development (Devin, Codeium), sales, marketing, and regulatory compliance use cases.
Key Contribution: Documents that current agents make too many errors for fully autonomous operation; multiagent systems often outperform single-model systems; enterprise pilots launching in late 2024.
Relevance: Enterprise readiness assessment; identifies regulatory compliance as key application area.

Summary Statistics

Category	Papers Reviewed	Date Range	Key Venues
MoE Architectures	6	2024-2025	arXiv, IEEE TKDE, NeurIPS
LLM-as-Judge	8	2023-2025	arXiv, FAccT, EMNLP
Consensus Mechanisms	3	2025	arXiv
Constitutional AI	4	2022-2025	arXiv, FAccT, EMNLP, ACL
Multi-Agent Systems	8	2024-2025	arXiv, IEEE, ScienceDirect
Industry Reports	2	2024-2025	Deloitte, Industry

Total Unique Sources: 31 academic papers + 5 industry reports + 3 technical documentation sources

Quick Reference: Most Important Papers

Must-Read (Top 5)

Cai et al. (2024/2025) — MoE survey providing complete architecture understanding
Jiang et al. (2024/2025) — LLM-as-Judge survey establishing reliability requirements
Ogunsina & Ogunsina (2025) — Hashgraph consensus for AI verification
Bai et al. (2022) — Constitutional AI foundational methodology
Li et al. (2025) — Meta-judges multi-agent framework with quantified improvements

Implementation-Focused (Top 3)

DeepSeek-V3 Technical Report — Production MoE architecture details
Verga et al. (2024) — PoLL implementation for cost-effective evaluation
Bandara et al. (2025) — Production-grade consensus architecture

Research Sources 2024-2025-2026​

Category 1: Mixture of Experts (MoE) Architectures​

1.1 Comprehensive Surveys​

1.2 MoE Architecture Innovations​

Category 2: LLM-as-Judge and Multi-Agent Evaluation​

2.1 Foundational Surveys​

2.2 Multi-Agent Evaluation Frameworks​

2.3 Agent Evaluation and Debate Mechanisms​

Category 3: Consensus Mechanisms for AI Verification​

Category 4: Constitutional AI and Scalable Oversight​

Category 5: Multi-Agent Systems and Agentic AI​

5.1 Surveys and Frameworks​

5.2 Protocol and Coordination Standards​

5.3 Responsible Agentic AI​

Category 6: Scientific Discovery and Complex Reasoning​

Category 7: Industry and Enterprise Applications​

Summary Statistics​

Quick Reference: Most Important Papers​

Must-Read (Top 5)​

Implementation-Focused (Top 3)​