LLM Council Pattern: Technical Architecture Analysis for Coditect

Executive Summary

Karpathy's LLM Council demonstrates a deliberation pattern for consensus-seeking across multiple models. Coditect's autonomous development platform requires a delegation pattern for task decomposition. However, the council's peer review mechanism offers a valuable quality assurance primitive that can be adapted for Coditect's compliance-critical workflows.

Pattern Comparison

LLM Council (Deliberation)

┌─────────────────────────────────────────────────────────────┐
│                      User Query                              │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 1: Parallel Dispatch (Same Prompt → All Models)      │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐           │
│  │ GPT-5.1 │ │ Gemini  │ │ Claude  │ │  Grok   │           │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘           │
│       │           │           │           │                 │
│       ▼           ▼           ▼           ▼                 │
│  Response A  Response B  Response C  Response D             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 2: Anonymous Peer Review                             │
│  Each model ranks others (identities hidden as A/B/C/D)     │
│  Output: aggregate_rankings + evaluation_text               │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 3: Chairman Synthesis                                │
│  Single model synthesizes with ranking context              │
└─────────────────────────────────────────────────────────────┘

Coditect (Delegation)

┌─────────────────────────────────────────────────────────────┐
│                  Development Task                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Orchestrator Agent                                         │
│  - Task decomposition                                       │
│  - Agent capability matching                                │
│  - Token budget allocation                                  │
│  - Checkpoint management                                    │
└─────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  Architect    │     │  Implementer  │     │   Reviewer    │
│  Agent        │     │  Agent        │     │   Ensemble    │
│               │     │               │     │   (COUNCIL)   │
│  - Design     │     │  - Code gen   │     │  - Security   │
│  - ADRs       │     │  - Tests      │     │  - Compliance │
│  - C4 models  │     │  - Docs       │     │  - Style      │
└───────────────┘     └───────────────┘     └───────────────┘
        │                     │                     │
        └─────────────────────┼─────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Aggregation + Merge Decision                               │
│  (With audit trail for regulated industries)                │
└─────────────────────────────────────────────────────────────┘

Key Architectural Differences

Dimension	LLM Council	Coditect Multi-Agent
Coordination	Parallel → Synthesize	Hierarchical delegation
Agent Roles	Homogeneous (same prompt)	Heterogeneous (specialized)
State	Stateless per query	Checkpoint-based persistence
Tool Usage	None	Tool-aware with constraints
Quality Signal	Peer ranking	Domain-specific review
Failure Mode	Graceful degradation	Circuit breaker + recovery
Audit	None	Full trace for compliance

Adaptation: Reviewer Council Pattern

The LLM Council's peer review mechanism maps cleanly to Coditect's QA phase. Instead of multiple models reviewing the same content, specialized reviewer agents evaluate code artifacts against domain-specific criteria.

Reviewer Council Architecture

from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from enum import Enum
import asyncio

class ReviewDomain(Enum):
    SECURITY = "security"
    COMPLIANCE = "compliance"  # HIPAA, SOC2, FDA 21 CFR Part 11
    PERFORMANCE = "performance"
    MAINTAINABILITY = "maintainability"
    TEST_COVERAGE = "test_coverage"

@dataclass
class ReviewerConfig:
    """Configuration for a specialized reviewer agent."""
    domain: ReviewDomain
    model: str  # e.g., "anthropic/claude-sonnet-4.5"
    system_prompt: str
    evaluation_rubric: Dict[str, float]  # criteria → weight
    severity_thresholds: Dict[str, int]  # finding_type → max_allowed
    
@dataclass
class ReviewFinding:
    """Individual finding from a reviewer."""
    domain: ReviewDomain
    severity: str  # "critical", "high", "medium", "low", "info"
    location: str  # file:line or AST path
    description: str
    recommendation: str
    confidence: float
    
@dataclass
class ReviewResult:
    """Complete review from a single reviewer."""
    reviewer_id: str
    domain: ReviewDomain
    findings: List[ReviewFinding]
    overall_score: float  # 0.0 - 1.0
    pass_fail: bool
    raw_evaluation: str  # Full LLM response for audit
    token_usage: int

@dataclass
class CouncilVerdict:
    """Aggregated verdict from reviewer council."""
    individual_results: Dict[str, ReviewResult]
    aggregate_score: float
    blocking_findings: List[ReviewFinding]
    merge_decision: str  # "approve", "request_changes", "reject"
    chairman_synthesis: str
    consensus_level: float  # Agreement metric across reviewers
    audit_hash: str  # SHA256 of all inputs for compliance


class ReviewerCouncil:
    """
    Multi-agent review council for code quality assurance.
    Adapted from Karpathy's LLM Council pattern with enterprise hardening.
    """
    
    def __init__(
        self,
        reviewers: List[ReviewerConfig],
        chairman_model: str,
        checkpoint_store: 'CheckpointStore',
        compliance_mode: bool = True
    ):
        self.reviewers = {r.domain: r for r in reviewers}
        self.chairman_model = chairman_model
        self.checkpoint_store = checkpoint_store
        self.compliance_mode = compliance_mode
        self.circuit_breakers: Dict[str, 'CircuitBreaker'] = {}
        
    async def review_artifact(
        self,
        artifact: 'CodeArtifact',
        context: Dict[str, Any]
    ) -> CouncilVerdict:
        """
        Execute full council review with anonymized cross-evaluation.
        
        Stage 1: Parallel specialized reviews
        Stage 2: Cross-reviewer ranking (anonymized)
        Stage 3: Chairman synthesis with merge decision
        """
        
        # Checkpoint: Start
        checkpoint_id = await self.checkpoint_store.create(
            stage="review_start",
            artifact_hash=artifact.hash,
            context=context
        )
        
        try:
            # Stage 1: Parallel specialized reviews
            stage1_results = await self._stage1_collect_reviews(artifact, context)
            
            await self.checkpoint_store.update(
                checkpoint_id,
                stage="stage1_complete",
                results=stage1_results
            )
            
            # Stage 2: Cross-reviewer ranking (anonymized)
            stage2_rankings, label_mapping = await self._stage2_cross_evaluate(
                stage1_results
            )
            
            await self.checkpoint_store.update(
                checkpoint_id,
                stage="stage2_complete",
                rankings=stage2_rankings
            )
            
            # Stage 3: Chairman synthesis
            verdict = await self._stage3_synthesize_verdict(
                artifact,
                stage1_results,
                stage2_rankings,
                label_mapping
            )
            
            # Finalize checkpoint with audit trail
            await self.checkpoint_store.finalize(
                checkpoint_id,
                verdict=verdict,
                compliance_hash=self._compute_audit_hash(
                    artifact, stage1_results, stage2_rankings, verdict
                )
            )
            
            return verdict
            
        except Exception as e:
            await self.checkpoint_store.mark_failed(checkpoint_id, str(e))
            raise
    
    async def _stage1_collect_reviews(
        self,
        artifact: 'CodeArtifact',
        context: Dict[str, Any]
    ) -> Dict[ReviewDomain, ReviewResult]:
        """
        Dispatch artifact to all specialized reviewers in parallel.
        Each reviewer evaluates against their domain-specific rubric.
        """
        
        async def review_with_circuit_breaker(
            domain: ReviewDomain,
            config: ReviewerConfig
        ) -> ReviewResult:
            cb = self.circuit_breakers.get(domain.value)
            if cb and cb.is_open:
                # Fallback: Return degraded result
                return self._degraded_review_result(domain)
            
            try:
                result = await self._execute_review(artifact, config, context)
                if cb:
                    cb.record_success()
                return result
            except Exception as e:
                if cb:
                    cb.record_failure()
                raise
        
        # Parallel execution with timeout
        tasks = [
            asyncio.create_task(
                asyncio.wait_for(
                    review_with_circuit_breaker(domain, config),
                    timeout=60.0  # Per-reviewer timeout
                )
            )
            for domain, config in self.reviewers.items()
        ]
        
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # Process results, handling partial failures
        review_results = {}
        for domain, result in zip(self.reviewers.keys(), results):
            if isinstance(result, Exception):
                review_results[domain] = self._error_review_result(domain, result)
            else:
                review_results[domain] = result
                
        return review_results
    
    async def _stage2_cross_evaluate(
        self,
        stage1_results: Dict[ReviewDomain, ReviewResult]
    ) -> tuple[Dict[str, Dict], Dict[str, ReviewDomain]]:
        """
        Each reviewer ranks the other reviews (anonymized).
        Key insight from LLM Council: Prevent model favoritism.
        """
        
        # Anonymize: Map domains to neutral labels
        labels = ['Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon']
        label_mapping = {}
        anonymized_reviews = {}
        
        for i, (domain, result) in enumerate(stage1_results.items()):
            label = labels[i]
            label_mapping[label] = domain
            anonymized_reviews[label] = {
                'findings': [f.__dict__ for f in result.findings],
                'overall_score': result.overall_score,
                'evaluation': result.raw_evaluation
            }
        
        # Each reviewer ranks the others
        ranking_tasks = []
        for domain, config in self.reviewers.items():
            # Exclude self from ranking
            others = {
                label: review 
                for label, review in anonymized_reviews.items()
                if label_mapping[label] != domain
            }
            
            ranking_tasks.append(
                self._request_ranking(config, others)
            )
        
        rankings = await asyncio.gather(*ranking_tasks)
        
        # Aggregate rankings
        aggregate_rankings = self._aggregate_rankings(
            rankings, 
            list(self.reviewers.keys())
        )
        
        return aggregate_rankings, label_mapping
    
    async def _stage3_synthesize_verdict(
        self,
        artifact: 'CodeArtifact',
        stage1_results: Dict[ReviewDomain, ReviewResult],
        stage2_rankings: Dict[str, Dict],
        label_mapping: Dict[str, ReviewDomain]
    ) -> CouncilVerdict:
        """
        Chairman synthesizes final verdict with merge decision.
        Unlike LLM Council, we have a specific decision output.
        """
        
        # Collect all blocking findings
        blocking_findings = []
        for domain, result in stage1_results.items():
            for finding in result.findings:
                if finding.severity in ['critical', 'high']:
                    blocking_findings.append(finding)
        
        # Build chairman prompt
        chairman_prompt = self._build_chairman_prompt(
            artifact,
            stage1_results,
            stage2_rankings,
            label_mapping
        )
        
        # Chairman decision
        chairman_response = await self._call_llm(
            self.chairman_model,
            chairman_prompt,
            response_format="json"
        )
        
        # Parse structured decision
        decision = self._parse_chairman_decision(chairman_response)
        
        # Compute consensus level
        consensus = self._compute_consensus(stage1_results, stage2_rankings)
        
        return CouncilVerdict(
            individual_results=stage1_results,
            aggregate_score=decision['aggregate_score'],
            blocking_findings=blocking_findings,
            merge_decision=decision['merge_decision'],
            chairman_synthesis=decision['synthesis'],
            consensus_level=consensus,
            audit_hash=""  # Computed by caller
        )
    
    def _build_chairman_prompt(
        self,
        artifact: 'CodeArtifact',
        results: Dict[ReviewDomain, ReviewResult],
        rankings: Dict[str, Dict],
        label_mapping: Dict[str, ReviewDomain]
    ) -> str:
        """Build chairman synthesis prompt with all context."""
        
        return f"""You are the Chairman of a Code Review Council for a regulated software system.

## Artifact Under Review
- File: {artifact.path}
- Language: {artifact.language}
- Lines: {artifact.line_count}
- Compliance Context: {artifact.compliance_tags}

## Individual Reviews

{self._format_reviews(results)}

## Cross-Reviewer Rankings

{self._format_rankings(rankings, label_mapping)}

## Your Task

Synthesize all reviews into a final verdict. You must provide:

1. **Aggregate Score** (0.0-1.0): Weighted by reviewer consensus and finding severity
2. **Merge Decision**: One of "approve", "request_changes", "reject"
3. **Synthesis**: 2-3 paragraph summary of key findings and rationale

Decision Criteria:
- Any CRITICAL finding → reject
- >2 HIGH findings → request_changes
- Compliance domain failure → reject (regulated context)
- <0.7 aggregate score → request_changes

Respond in JSON:
{{
    "aggregate_score": 0.85,
    "merge_decision": "approve",
    "synthesis": "..."
}}"""

    @staticmethod
    def _aggregate_rankings(
        rankings: List[Dict],
        reviewer_domains: List[ReviewDomain]
    ) -> Dict[str, float]:
        """
        Compute average rank position across all peer evaluations.
        Direct adaptation of LLM Council's calculate_aggregate_rankings.
        """
        
        rank_scores = {}
        
        for ranking in rankings:
            for label, position in ranking.items():
                if label not in rank_scores:
                    rank_scores[label] = []
                rank_scores[label].append(position)
        
        # Average position (lower is better)
        return {
            label: sum(positions) / len(positions)
            for label, positions in rank_scores.items()
        }

Integration Points

1. Pipeline Integration

class CoditectPipeline:
    """Main development pipeline with council-based review."""
    
    async def execute(self, task: DevelopmentTask) -> PipelineResult:
        # Phase 1: Architecture
        architecture = await self.architect_agent.design(task)
        
        # Phase 2: Implementation
        artifacts = await self.implementer_agent.generate(architecture)
        
        # Phase 3: Council Review (NEW)
        review_verdict = await self.reviewer_council.review_artifact(
            artifacts,
            context={
                'architecture': architecture,
                'compliance_requirements': task.compliance_tags,
                'risk_level': task.risk_assessment
            }
        )
        
        # Phase 4: Decision Gate
        if review_verdict.merge_decision == 'reject':
            return PipelineResult(
                status='failed',
                reason=review_verdict.chairman_synthesis,
                artifacts=None
            )
        
        if review_verdict.merge_decision == 'request_changes':
            # Recursive refinement
            refined = await self.implementer_agent.refine(
                artifacts,
                feedback=review_verdict.blocking_findings
            )
            return await self.execute_review_only(refined)
        
        # Approved
        return PipelineResult(
            status='success',
            artifacts=artifacts,
            audit_trail=review_verdict.audit_hash
        )

2. Compliance Audit Trail

@dataclass
class ComplianceAuditRecord:
    """
    Immutable audit record for regulated environments.
    Required for FDA 21 CFR Part 11, SOC2, HIPAA.
    """
    
    timestamp: datetime
    artifact_hash: str
    reviewer_council_config: Dict[str, Any]
    stage1_hashes: Dict[str, str]  # domain → response hash
    stage2_hashes: Dict[str, str]  # reviewer → ranking hash
    chairman_hash: str
    verdict_hash: str
    electronic_signature: str  # For 21 CFR Part 11
    
    def compute_chain_hash(self) -> str:
        """Compute hash chain for tamper detection."""
        chain = hashlib.sha256()
        chain.update(self.artifact_hash.encode())
        for h in sorted(self.stage1_hashes.values()):
            chain.update(h.encode())
        for h in sorted(self.stage2_hashes.values()):
            chain.update(h.encode())
        chain.update(self.chairman_hash.encode())
        chain.update(self.verdict_hash.encode())
        return chain.hexdigest()

Token Economics

LLM Council Cost Model

Per Query:
- Stage 1: 4 models × ~2000 tokens = 8,000 tokens
- Stage 2: 4 models × ~3000 tokens = 12,000 tokens  
- Stage 3: 1 model × ~4000 tokens = 4,000 tokens
Total: ~24,000 tokens per query

Coditect Reviewer Council Cost Model

Per Code Review:
- Stage 1: 5 reviewers × ~3000 tokens = 15,000 tokens
- Stage 2: 5 reviewers × ~2000 tokens = 10,000 tokens
- Stage 3: 1 chairman × ~5000 tokens = 5,000 tokens
Total: ~30,000 tokens per review

With 15x multi-agent multiplier consideration:
- Single complex file: ~30K tokens
- Full PR (10 files): ~300K tokens
- Amortized per line of code: ~50 tokens

Key Adaptations from LLM Council

LLM Council Feature	Coditect Adaptation
OpenRouter routing	Multi-provider with compliance controls
JSON file storage	FoundationDB with audit trail
Anonymous labels	Domain-aware anonymization
Ranking extraction	Structured JSON response format
Chairman synthesis	Merge decision with gate criteria
No authentication	Role-based access + electronic signatures
No checkpointing	Full checkpoint/recovery
No circuit breakers	Per-reviewer circuit breakers

Conclusion

The LLM Council pattern provides a clean reference for consensus-based quality signals in multi-agent systems. For Coditect, the key adaptation is transforming the deliberation pattern into a review gate within the delegation workflow, with enterprise hardening for:

Compliance audit trails
Deterministic replay
Graceful degradation
Structured decision outputs

The anonymized peer review mechanism is the most valuable primitive to adopt - it prevents model bias in QA evaluation while providing aggregate confidence signals.

Executive Summary​

Pattern Comparison​

LLM Council (Deliberation)​

Coditect (Delegation)​

Key Architectural Differences​

Adaptation: Reviewer Council Pattern​

Reviewer Council Architecture​

Integration Points​

1. Pipeline Integration​

2. Compliance Audit Trail​

Token Economics​

LLM Council Cost Model​

Coditect Reviewer Council Cost Model​

Key Adaptations from LLM Council​

Conclusion​