LLM Council Pattern: Technical Architecture Analysis for Coditect
LLM Council Pattern: Technical Architecture Analysis for Coditect
Executive Summary
Karpathy's LLM Council demonstrates a deliberation pattern for consensus-seeking across multiple models. Coditect's autonomous development platform requires a delegation pattern for task decomposition. However, the council's peer review mechanism offers a valuable quality assurance primitive that can be adapted for Coditect's compliance-critical workflows.
Pattern Comparison
LLM Council (Deliberation)
┌─────────────────────────────────────────────────────────────┐
│ User Query │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Stage 1: Parallel Dispatch (Same Prompt → All Models) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ GPT-5.1 │ │ Gemini │ │ Claude │ │ Grok │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ Response A Response B Response C Response D │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Stage 2: Anonymous Peer Review │
│ Each model ranks others (identities hidden as A/B/C/D) │
│ Output: aggregate_rankings + evaluation_text │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Stage 3: Chairman Synthesis │
│ Single model synthesizes with ranking context │
└─────────────────────────────────────────────────────────────┘
Coditect (Delegation)
┌─────────────────────────────────────────────────────────────┐
│ Development Task │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Orchestrator Agent │
│ - Task decomposition │
│ - Agent capability matching │
│ - Token budget allocation │
│ - Checkpoint management │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Architect │ │ Implementer │ │ Reviewer │
│ Agent │ │ Agent │ │ Ensemble │
│ │ │ │ │ (COUNCIL) │
│ - Design │ │ - Code gen │ │ - Security │
│ - ADRs │ │ - Tests │ │ - Compliance │
│ - C4 models │ │ - Docs │ │ - Style │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└─────────────────────┼─────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ Aggregation + Merge Decision │
│ (With audit trail for regulated industries) │
└─────────────────────────────────────────────────────────────┘
Key Architectural Differences
| Dimension | LLM Council | Coditect Multi-Agent |
|---|---|---|
| Coordination | Parallel → Synthesize | Hierarchical delegation |
| Agent Roles | Homogeneous (same prompt) | Heterogeneous (specialized) |
| State | Stateless per query | Checkpoint-based persistence |
| Tool Usage | None | Tool-aware with constraints |
| Quality Signal | Peer ranking | Domain-specific review |
| Failure Mode | Graceful degradation | Circuit breaker + recovery |
| Audit | None | Full trace for compliance |
Adaptation: Reviewer Council Pattern
The LLM Council's peer review mechanism maps cleanly to Coditect's QA phase. Instead of multiple models reviewing the same content, specialized reviewer agents evaluate code artifacts against domain-specific criteria.
Reviewer Council Architecture
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from enum import Enum
import asyncio
class ReviewDomain(Enum):
SECURITY = "security"
COMPLIANCE = "compliance" # HIPAA, SOC2, FDA 21 CFR Part 11
PERFORMANCE = "performance"
MAINTAINABILITY = "maintainability"
TEST_COVERAGE = "test_coverage"
@dataclass
class ReviewerConfig:
"""Configuration for a specialized reviewer agent."""
domain: ReviewDomain
model: str # e.g., "anthropic/claude-sonnet-4.5"
system_prompt: str
evaluation_rubric: Dict[str, float] # criteria → weight
severity_thresholds: Dict[str, int] # finding_type → max_allowed
@dataclass
class ReviewFinding:
"""Individual finding from a reviewer."""
domain: ReviewDomain
severity: str # "critical", "high", "medium", "low", "info"
location: str # file:line or AST path
description: str
recommendation: str
confidence: float
@dataclass
class ReviewResult:
"""Complete review from a single reviewer."""
reviewer_id: str
domain: ReviewDomain
findings: List[ReviewFinding]
overall_score: float # 0.0 - 1.0
pass_fail: bool
raw_evaluation: str # Full LLM response for audit
token_usage: int
@dataclass
class CouncilVerdict:
"""Aggregated verdict from reviewer council."""
individual_results: Dict[str, ReviewResult]
aggregate_score: float
blocking_findings: List[ReviewFinding]
merge_decision: str # "approve", "request_changes", "reject"
chairman_synthesis: str
consensus_level: float # Agreement metric across reviewers
audit_hash: str # SHA256 of all inputs for compliance
class ReviewerCouncil:
"""
Multi-agent review council for code quality assurance.
Adapted from Karpathy's LLM Council pattern with enterprise hardening.
"""
def __init__(
self,
reviewers: List[ReviewerConfig],
chairman_model: str,
checkpoint_store: 'CheckpointStore',
compliance_mode: bool = True
):
self.reviewers = {r.domain: r for r in reviewers}
self.chairman_model = chairman_model
self.checkpoint_store = checkpoint_store
self.compliance_mode = compliance_mode
self.circuit_breakers: Dict[str, 'CircuitBreaker'] = {}
async def review_artifact(
self,
artifact: 'CodeArtifact',
context: Dict[str, Any]
) -> CouncilVerdict:
"""
Execute full council review with anonymized cross-evaluation.
Stage 1: Parallel specialized reviews
Stage 2: Cross-reviewer ranking (anonymized)
Stage 3: Chairman synthesis with merge decision
"""
# Checkpoint: Start
checkpoint_id = await self.checkpoint_store.create(
stage="review_start",
artifact_hash=artifact.hash,
context=context
)
try:
# Stage 1: Parallel specialized reviews
stage1_results = await self._stage1_collect_reviews(artifact, context)
await self.checkpoint_store.update(
checkpoint_id,
stage="stage1_complete",
results=stage1_results
)
# Stage 2: Cross-reviewer ranking (anonymized)
stage2_rankings, label_mapping = await self._stage2_cross_evaluate(
stage1_results
)
await self.checkpoint_store.update(
checkpoint_id,
stage="stage2_complete",
rankings=stage2_rankings
)
# Stage 3: Chairman synthesis
verdict = await self._stage3_synthesize_verdict(
artifact,
stage1_results,
stage2_rankings,
label_mapping
)
# Finalize checkpoint with audit trail
await self.checkpoint_store.finalize(
checkpoint_id,
verdict=verdict,
compliance_hash=self._compute_audit_hash(
artifact, stage1_results, stage2_rankings, verdict
)
)
return verdict
except Exception as e:
await self.checkpoint_store.mark_failed(checkpoint_id, str(e))
raise
async def _stage1_collect_reviews(
self,
artifact: 'CodeArtifact',
context: Dict[str, Any]
) -> Dict[ReviewDomain, ReviewResult]:
"""
Dispatch artifact to all specialized reviewers in parallel.
Each reviewer evaluates against their domain-specific rubric.
"""
async def review_with_circuit_breaker(
domain: ReviewDomain,
config: ReviewerConfig
) -> ReviewResult:
cb = self.circuit_breakers.get(domain.value)
if cb and cb.is_open:
# Fallback: Return degraded result
return self._degraded_review_result(domain)
try:
result = await self._execute_review(artifact, config, context)
if cb:
cb.record_success()
return result
except Exception as e:
if cb:
cb.record_failure()
raise
# Parallel execution with timeout
tasks = [
asyncio.create_task(
asyncio.wait_for(
review_with_circuit_breaker(domain, config),
timeout=60.0 # Per-reviewer timeout
)
)
for domain, config in self.reviewers.items()
]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Process results, handling partial failures
review_results = {}
for domain, result in zip(self.reviewers.keys(), results):
if isinstance(result, Exception):
review_results[domain] = self._error_review_result(domain, result)
else:
review_results[domain] = result
return review_results
async def _stage2_cross_evaluate(
self,
stage1_results: Dict[ReviewDomain, ReviewResult]
) -> tuple[Dict[str, Dict], Dict[str, ReviewDomain]]:
"""
Each reviewer ranks the other reviews (anonymized).
Key insight from LLM Council: Prevent model favoritism.
"""
# Anonymize: Map domains to neutral labels
labels = ['Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon']
label_mapping = {}
anonymized_reviews = {}
for i, (domain, result) in enumerate(stage1_results.items()):
label = labels[i]
label_mapping[label] = domain
anonymized_reviews[label] = {
'findings': [f.__dict__ for f in result.findings],
'overall_score': result.overall_score,
'evaluation': result.raw_evaluation
}
# Each reviewer ranks the others
ranking_tasks = []
for domain, config in self.reviewers.items():
# Exclude self from ranking
others = {
label: review
for label, review in anonymized_reviews.items()
if label_mapping[label] != domain
}
ranking_tasks.append(
self._request_ranking(config, others)
)
rankings = await asyncio.gather(*ranking_tasks)
# Aggregate rankings
aggregate_rankings = self._aggregate_rankings(
rankings,
list(self.reviewers.keys())
)
return aggregate_rankings, label_mapping
async def _stage3_synthesize_verdict(
self,
artifact: 'CodeArtifact',
stage1_results: Dict[ReviewDomain, ReviewResult],
stage2_rankings: Dict[str, Dict],
label_mapping: Dict[str, ReviewDomain]
) -> CouncilVerdict:
"""
Chairman synthesizes final verdict with merge decision.
Unlike LLM Council, we have a specific decision output.
"""
# Collect all blocking findings
blocking_findings = []
for domain, result in stage1_results.items():
for finding in result.findings:
if finding.severity in ['critical', 'high']:
blocking_findings.append(finding)
# Build chairman prompt
chairman_prompt = self._build_chairman_prompt(
artifact,
stage1_results,
stage2_rankings,
label_mapping
)
# Chairman decision
chairman_response = await self._call_llm(
self.chairman_model,
chairman_prompt,
response_format="json"
)
# Parse structured decision
decision = self._parse_chairman_decision(chairman_response)
# Compute consensus level
consensus = self._compute_consensus(stage1_results, stage2_rankings)
return CouncilVerdict(
individual_results=stage1_results,
aggregate_score=decision['aggregate_score'],
blocking_findings=blocking_findings,
merge_decision=decision['merge_decision'],
chairman_synthesis=decision['synthesis'],
consensus_level=consensus,
audit_hash="" # Computed by caller
)
def _build_chairman_prompt(
self,
artifact: 'CodeArtifact',
results: Dict[ReviewDomain, ReviewResult],
rankings: Dict[str, Dict],
label_mapping: Dict[str, ReviewDomain]
) -> str:
"""Build chairman synthesis prompt with all context."""
return f"""You are the Chairman of a Code Review Council for a regulated software system.
## Artifact Under Review
- File: {artifact.path}
- Language: {artifact.language}
- Lines: {artifact.line_count}
- Compliance Context: {artifact.compliance_tags}
## Individual Reviews
{self._format_reviews(results)}
## Cross-Reviewer Rankings
{self._format_rankings(rankings, label_mapping)}
## Your Task
Synthesize all reviews into a final verdict. You must provide:
1. **Aggregate Score** (0.0-1.0): Weighted by reviewer consensus and finding severity
2. **Merge Decision**: One of "approve", "request_changes", "reject"
3. **Synthesis**: 2-3 paragraph summary of key findings and rationale
Decision Criteria:
- Any CRITICAL finding → reject
- >2 HIGH findings → request_changes
- Compliance domain failure → reject (regulated context)
- <0.7 aggregate score → request_changes
Respond in JSON:
{{
"aggregate_score": 0.85,
"merge_decision": "approve",
"synthesis": "..."
}}"""
@staticmethod
def _aggregate_rankings(
rankings: List[Dict],
reviewer_domains: List[ReviewDomain]
) -> Dict[str, float]:
"""
Compute average rank position across all peer evaluations.
Direct adaptation of LLM Council's calculate_aggregate_rankings.
"""
rank_scores = {}
for ranking in rankings:
for label, position in ranking.items():
if label not in rank_scores:
rank_scores[label] = []
rank_scores[label].append(position)
# Average position (lower is better)
return {
label: sum(positions) / len(positions)
for label, positions in rank_scores.items()
}
Integration Points
1. Pipeline Integration
class CoditectPipeline:
"""Main development pipeline with council-based review."""
async def execute(self, task: DevelopmentTask) -> PipelineResult:
# Phase 1: Architecture
architecture = await self.architect_agent.design(task)
# Phase 2: Implementation
artifacts = await self.implementer_agent.generate(architecture)
# Phase 3: Council Review (NEW)
review_verdict = await self.reviewer_council.review_artifact(
artifacts,
context={
'architecture': architecture,
'compliance_requirements': task.compliance_tags,
'risk_level': task.risk_assessment
}
)
# Phase 4: Decision Gate
if review_verdict.merge_decision == 'reject':
return PipelineResult(
status='failed',
reason=review_verdict.chairman_synthesis,
artifacts=None
)
if review_verdict.merge_decision == 'request_changes':
# Recursive refinement
refined = await self.implementer_agent.refine(
artifacts,
feedback=review_verdict.blocking_findings
)
return await self.execute_review_only(refined)
# Approved
return PipelineResult(
status='success',
artifacts=artifacts,
audit_trail=review_verdict.audit_hash
)
2. Compliance Audit Trail
@dataclass
class ComplianceAuditRecord:
"""
Immutable audit record for regulated environments.
Required for FDA 21 CFR Part 11, SOC2, HIPAA.
"""
timestamp: datetime
artifact_hash: str
reviewer_council_config: Dict[str, Any]
stage1_hashes: Dict[str, str] # domain → response hash
stage2_hashes: Dict[str, str] # reviewer → ranking hash
chairman_hash: str
verdict_hash: str
electronic_signature: str # For 21 CFR Part 11
def compute_chain_hash(self) -> str:
"""Compute hash chain for tamper detection."""
chain = hashlib.sha256()
chain.update(self.artifact_hash.encode())
for h in sorted(self.stage1_hashes.values()):
chain.update(h.encode())
for h in sorted(self.stage2_hashes.values()):
chain.update(h.encode())
chain.update(self.chairman_hash.encode())
chain.update(self.verdict_hash.encode())
return chain.hexdigest()
Token Economics
LLM Council Cost Model
Per Query:
- Stage 1: 4 models × ~2000 tokens = 8,000 tokens
- Stage 2: 4 models × ~3000 tokens = 12,000 tokens
- Stage 3: 1 model × ~4000 tokens = 4,000 tokens
Total: ~24,000 tokens per query
Coditect Reviewer Council Cost Model
Per Code Review:
- Stage 1: 5 reviewers × ~3000 tokens = 15,000 tokens
- Stage 2: 5 reviewers × ~2000 tokens = 10,000 tokens
- Stage 3: 1 chairman × ~5000 tokens = 5,000 tokens
Total: ~30,000 tokens per review
With 15x multi-agent multiplier consideration:
- Single complex file: ~30K tokens
- Full PR (10 files): ~300K tokens
- Amortized per line of code: ~50 tokens
Key Adaptations from LLM Council
| LLM Council Feature | Coditect Adaptation |
|---|---|
| OpenRouter routing | Multi-provider with compliance controls |
| JSON file storage | FoundationDB with audit trail |
| Anonymous labels | Domain-aware anonymization |
| Ranking extraction | Structured JSON response format |
| Chairman synthesis | Merge decision with gate criteria |
| No authentication | Role-based access + electronic signatures |
| No checkpointing | Full checkpoint/recovery |
| No circuit breakers | Per-reviewer circuit breakers |
Conclusion
The LLM Council pattern provides a clean reference for consensus-based quality signals in multi-agent systems. For Coditect, the key adaptation is transforming the deliberation pattern into a review gate within the delegation workflow, with enterprise hardening for:
- Compliance audit trails
- Deterministic replay
- Graceful degradation
- Structured decision outputs
The anonymized peer review mechanism is the most valuable primitive to adopt - it prevents model bias in QA evaluation while providing aggregate confidence signals.