Skip to main content

LLM Council Pattern: Technical Architecture Analysis for Coditect

Executive Summary

Karpathy's LLM Council demonstrates a deliberation pattern for consensus-seeking across multiple models. Coditect's autonomous development platform requires a delegation pattern for task decomposition. However, the council's peer review mechanism offers a valuable quality assurance primitive that can be adapted for Coditect's compliance-critical workflows.


Pattern Comparison

LLM Council (Deliberation)

┌─────────────────────────────────────────────────────────────┐
│ User Query │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Stage 1: Parallel Dispatch (Same Prompt → All Models) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ GPT-5.1 │ │ Gemini │ │ Claude │ │ Grok │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ Response A Response B Response C Response D │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Stage 2: Anonymous Peer Review │
│ Each model ranks others (identities hidden as A/B/C/D) │
│ Output: aggregate_rankings + evaluation_text │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Stage 3: Chairman Synthesis │
│ Single model synthesizes with ranking context │
└─────────────────────────────────────────────────────────────┘

Coditect (Delegation)

┌─────────────────────────────────────────────────────────────┐
│ Development Task │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Orchestrator Agent │
│ - Task decomposition │
│ - Agent capability matching │
│ - Token budget allocation │
│ - Checkpoint management │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Architect │ │ Implementer │ │ Reviewer │
│ Agent │ │ Agent │ │ Ensemble │
│ │ │ │ │ (COUNCIL) │
│ - Design │ │ - Code gen │ │ - Security │
│ - ADRs │ │ - Tests │ │ - Compliance │
│ - C4 models │ │ - Docs │ │ - Style │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└─────────────────────┼─────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Aggregation + Merge Decision │
│ (With audit trail for regulated industries) │
└─────────────────────────────────────────────────────────────┘

Key Architectural Differences

DimensionLLM CouncilCoditect Multi-Agent
CoordinationParallel → SynthesizeHierarchical delegation
Agent RolesHomogeneous (same prompt)Heterogeneous (specialized)
StateStateless per queryCheckpoint-based persistence
Tool UsageNoneTool-aware with constraints
Quality SignalPeer rankingDomain-specific review
Failure ModeGraceful degradationCircuit breaker + recovery
AuditNoneFull trace for compliance

Adaptation: Reviewer Council Pattern

The LLM Council's peer review mechanism maps cleanly to Coditect's QA phase. Instead of multiple models reviewing the same content, specialized reviewer agents evaluate code artifacts against domain-specific criteria.

Reviewer Council Architecture

from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from enum import Enum
import asyncio

class ReviewDomain(Enum):
SECURITY = "security"
COMPLIANCE = "compliance" # HIPAA, SOC2, FDA 21 CFR Part 11
PERFORMANCE = "performance"
MAINTAINABILITY = "maintainability"
TEST_COVERAGE = "test_coverage"

@dataclass
class ReviewerConfig:
"""Configuration for a specialized reviewer agent."""
domain: ReviewDomain
model: str # e.g., "anthropic/claude-sonnet-4.5"
system_prompt: str
evaluation_rubric: Dict[str, float] # criteria → weight
severity_thresholds: Dict[str, int] # finding_type → max_allowed

@dataclass
class ReviewFinding:
"""Individual finding from a reviewer."""
domain: ReviewDomain
severity: str # "critical", "high", "medium", "low", "info"
location: str # file:line or AST path
description: str
recommendation: str
confidence: float

@dataclass
class ReviewResult:
"""Complete review from a single reviewer."""
reviewer_id: str
domain: ReviewDomain
findings: List[ReviewFinding]
overall_score: float # 0.0 - 1.0
pass_fail: bool
raw_evaluation: str # Full LLM response for audit
token_usage: int

@dataclass
class CouncilVerdict:
"""Aggregated verdict from reviewer council."""
individual_results: Dict[str, ReviewResult]
aggregate_score: float
blocking_findings: List[ReviewFinding]
merge_decision: str # "approve", "request_changes", "reject"
chairman_synthesis: str
consensus_level: float # Agreement metric across reviewers
audit_hash: str # SHA256 of all inputs for compliance


class ReviewerCouncil:
"""
Multi-agent review council for code quality assurance.
Adapted from Karpathy's LLM Council pattern with enterprise hardening.
"""

def __init__(
self,
reviewers: List[ReviewerConfig],
chairman_model: str,
checkpoint_store: 'CheckpointStore',
compliance_mode: bool = True
):
self.reviewers = {r.domain: r for r in reviewers}
self.chairman_model = chairman_model
self.checkpoint_store = checkpoint_store
self.compliance_mode = compliance_mode
self.circuit_breakers: Dict[str, 'CircuitBreaker'] = {}

async def review_artifact(
self,
artifact: 'CodeArtifact',
context: Dict[str, Any]
) -> CouncilVerdict:
"""
Execute full council review with anonymized cross-evaluation.

Stage 1: Parallel specialized reviews
Stage 2: Cross-reviewer ranking (anonymized)
Stage 3: Chairman synthesis with merge decision
"""

# Checkpoint: Start
checkpoint_id = await self.checkpoint_store.create(
stage="review_start",
artifact_hash=artifact.hash,
context=context
)

try:
# Stage 1: Parallel specialized reviews
stage1_results = await self._stage1_collect_reviews(artifact, context)

await self.checkpoint_store.update(
checkpoint_id,
stage="stage1_complete",
results=stage1_results
)

# Stage 2: Cross-reviewer ranking (anonymized)
stage2_rankings, label_mapping = await self._stage2_cross_evaluate(
stage1_results
)

await self.checkpoint_store.update(
checkpoint_id,
stage="stage2_complete",
rankings=stage2_rankings
)

# Stage 3: Chairman synthesis
verdict = await self._stage3_synthesize_verdict(
artifact,
stage1_results,
stage2_rankings,
label_mapping
)

# Finalize checkpoint with audit trail
await self.checkpoint_store.finalize(
checkpoint_id,
verdict=verdict,
compliance_hash=self._compute_audit_hash(
artifact, stage1_results, stage2_rankings, verdict
)
)

return verdict

except Exception as e:
await self.checkpoint_store.mark_failed(checkpoint_id, str(e))
raise

async def _stage1_collect_reviews(
self,
artifact: 'CodeArtifact',
context: Dict[str, Any]
) -> Dict[ReviewDomain, ReviewResult]:
"""
Dispatch artifact to all specialized reviewers in parallel.
Each reviewer evaluates against their domain-specific rubric.
"""

async def review_with_circuit_breaker(
domain: ReviewDomain,
config: ReviewerConfig
) -> ReviewResult:
cb = self.circuit_breakers.get(domain.value)
if cb and cb.is_open:
# Fallback: Return degraded result
return self._degraded_review_result(domain)

try:
result = await self._execute_review(artifact, config, context)
if cb:
cb.record_success()
return result
except Exception as e:
if cb:
cb.record_failure()
raise

# Parallel execution with timeout
tasks = [
asyncio.create_task(
asyncio.wait_for(
review_with_circuit_breaker(domain, config),
timeout=60.0 # Per-reviewer timeout
)
)
for domain, config in self.reviewers.items()
]

results = await asyncio.gather(*tasks, return_exceptions=True)

# Process results, handling partial failures
review_results = {}
for domain, result in zip(self.reviewers.keys(), results):
if isinstance(result, Exception):
review_results[domain] = self._error_review_result(domain, result)
else:
review_results[domain] = result

return review_results

async def _stage2_cross_evaluate(
self,
stage1_results: Dict[ReviewDomain, ReviewResult]
) -> tuple[Dict[str, Dict], Dict[str, ReviewDomain]]:
"""
Each reviewer ranks the other reviews (anonymized).
Key insight from LLM Council: Prevent model favoritism.
"""

# Anonymize: Map domains to neutral labels
labels = ['Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon']
label_mapping = {}
anonymized_reviews = {}

for i, (domain, result) in enumerate(stage1_results.items()):
label = labels[i]
label_mapping[label] = domain
anonymized_reviews[label] = {
'findings': [f.__dict__ for f in result.findings],
'overall_score': result.overall_score,
'evaluation': result.raw_evaluation
}

# Each reviewer ranks the others
ranking_tasks = []
for domain, config in self.reviewers.items():
# Exclude self from ranking
others = {
label: review
for label, review in anonymized_reviews.items()
if label_mapping[label] != domain
}

ranking_tasks.append(
self._request_ranking(config, others)
)

rankings = await asyncio.gather(*ranking_tasks)

# Aggregate rankings
aggregate_rankings = self._aggregate_rankings(
rankings,
list(self.reviewers.keys())
)

return aggregate_rankings, label_mapping

async def _stage3_synthesize_verdict(
self,
artifact: 'CodeArtifact',
stage1_results: Dict[ReviewDomain, ReviewResult],
stage2_rankings: Dict[str, Dict],
label_mapping: Dict[str, ReviewDomain]
) -> CouncilVerdict:
"""
Chairman synthesizes final verdict with merge decision.
Unlike LLM Council, we have a specific decision output.
"""

# Collect all blocking findings
blocking_findings = []
for domain, result in stage1_results.items():
for finding in result.findings:
if finding.severity in ['critical', 'high']:
blocking_findings.append(finding)

# Build chairman prompt
chairman_prompt = self._build_chairman_prompt(
artifact,
stage1_results,
stage2_rankings,
label_mapping
)

# Chairman decision
chairman_response = await self._call_llm(
self.chairman_model,
chairman_prompt,
response_format="json"
)

# Parse structured decision
decision = self._parse_chairman_decision(chairman_response)

# Compute consensus level
consensus = self._compute_consensus(stage1_results, stage2_rankings)

return CouncilVerdict(
individual_results=stage1_results,
aggregate_score=decision['aggregate_score'],
blocking_findings=blocking_findings,
merge_decision=decision['merge_decision'],
chairman_synthesis=decision['synthesis'],
consensus_level=consensus,
audit_hash="" # Computed by caller
)

def _build_chairman_prompt(
self,
artifact: 'CodeArtifact',
results: Dict[ReviewDomain, ReviewResult],
rankings: Dict[str, Dict],
label_mapping: Dict[str, ReviewDomain]
) -> str:
"""Build chairman synthesis prompt with all context."""

return f"""You are the Chairman of a Code Review Council for a regulated software system.

## Artifact Under Review
- File: {artifact.path}
- Language: {artifact.language}
- Lines: {artifact.line_count}
- Compliance Context: {artifact.compliance_tags}

## Individual Reviews

{self._format_reviews(results)}

## Cross-Reviewer Rankings

{self._format_rankings(rankings, label_mapping)}

## Your Task

Synthesize all reviews into a final verdict. You must provide:

1. **Aggregate Score** (0.0-1.0): Weighted by reviewer consensus and finding severity
2. **Merge Decision**: One of "approve", "request_changes", "reject"
3. **Synthesis**: 2-3 paragraph summary of key findings and rationale

Decision Criteria:
- Any CRITICAL finding → reject
- >2 HIGH findings → request_changes
- Compliance domain failure → reject (regulated context)
- <0.7 aggregate score → request_changes

Respond in JSON:
{{
"aggregate_score": 0.85,
"merge_decision": "approve",
"synthesis": "..."
}}"""

@staticmethod
def _aggregate_rankings(
rankings: List[Dict],
reviewer_domains: List[ReviewDomain]
) -> Dict[str, float]:
"""
Compute average rank position across all peer evaluations.
Direct adaptation of LLM Council's calculate_aggregate_rankings.
"""

rank_scores = {}

for ranking in rankings:
for label, position in ranking.items():
if label not in rank_scores:
rank_scores[label] = []
rank_scores[label].append(position)

# Average position (lower is better)
return {
label: sum(positions) / len(positions)
for label, positions in rank_scores.items()
}

Integration Points

1. Pipeline Integration

class CoditectPipeline:
"""Main development pipeline with council-based review."""

async def execute(self, task: DevelopmentTask) -> PipelineResult:
# Phase 1: Architecture
architecture = await self.architect_agent.design(task)

# Phase 2: Implementation
artifacts = await self.implementer_agent.generate(architecture)

# Phase 3: Council Review (NEW)
review_verdict = await self.reviewer_council.review_artifact(
artifacts,
context={
'architecture': architecture,
'compliance_requirements': task.compliance_tags,
'risk_level': task.risk_assessment
}
)

# Phase 4: Decision Gate
if review_verdict.merge_decision == 'reject':
return PipelineResult(
status='failed',
reason=review_verdict.chairman_synthesis,
artifacts=None
)

if review_verdict.merge_decision == 'request_changes':
# Recursive refinement
refined = await self.implementer_agent.refine(
artifacts,
feedback=review_verdict.blocking_findings
)
return await self.execute_review_only(refined)

# Approved
return PipelineResult(
status='success',
artifacts=artifacts,
audit_trail=review_verdict.audit_hash
)

2. Compliance Audit Trail

@dataclass
class ComplianceAuditRecord:
"""
Immutable audit record for regulated environments.
Required for FDA 21 CFR Part 11, SOC2, HIPAA.
"""

timestamp: datetime
artifact_hash: str
reviewer_council_config: Dict[str, Any]
stage1_hashes: Dict[str, str] # domain → response hash
stage2_hashes: Dict[str, str] # reviewer → ranking hash
chairman_hash: str
verdict_hash: str
electronic_signature: str # For 21 CFR Part 11

def compute_chain_hash(self) -> str:
"""Compute hash chain for tamper detection."""
chain = hashlib.sha256()
chain.update(self.artifact_hash.encode())
for h in sorted(self.stage1_hashes.values()):
chain.update(h.encode())
for h in sorted(self.stage2_hashes.values()):
chain.update(h.encode())
chain.update(self.chairman_hash.encode())
chain.update(self.verdict_hash.encode())
return chain.hexdigest()

Token Economics

LLM Council Cost Model

Per Query:
- Stage 1: 4 models × ~2000 tokens = 8,000 tokens
- Stage 2: 4 models × ~3000 tokens = 12,000 tokens
- Stage 3: 1 model × ~4000 tokens = 4,000 tokens
Total: ~24,000 tokens per query

Coditect Reviewer Council Cost Model

Per Code Review:
- Stage 1: 5 reviewers × ~3000 tokens = 15,000 tokens
- Stage 2: 5 reviewers × ~2000 tokens = 10,000 tokens
- Stage 3: 1 chairman × ~5000 tokens = 5,000 tokens
Total: ~30,000 tokens per review

With 15x multi-agent multiplier consideration:
- Single complex file: ~30K tokens
- Full PR (10 files): ~300K tokens
- Amortized per line of code: ~50 tokens

Key Adaptations from LLM Council

LLM Council FeatureCoditect Adaptation
OpenRouter routingMulti-provider with compliance controls
JSON file storageFoundationDB with audit trail
Anonymous labelsDomain-aware anonymization
Ranking extractionStructured JSON response format
Chairman synthesisMerge decision with gate criteria
No authenticationRole-based access + electronic signatures
No checkpointingFull checkpoint/recovery
No circuit breakersPer-reviewer circuit breakers

Conclusion

The LLM Council pattern provides a clean reference for consensus-based quality signals in multi-agent systems. For Coditect, the key adaptation is transforming the deliberation pattern into a review gate within the delegation workflow, with enterprise hardening for:

  1. Compliance audit trails
  2. Deterministic replay
  3. Graceful degradation
  4. Structured decision outputs

The anonymized peer review mechanism is the most valuable primitive to adopt - it prevents model bias in QA evaluation while providing aggregate confidence signals.