ADR-016: LLM Council Pattern
Status: Accepted Date: 2025-12-20 Deciders: Hal Casteel, CODITECT Core Team Categories: Architecture, Code Review, Multi-Agent, Consensus
Context
CODITECT serves regulated industries (healthcare, finance) that require:
- Defensible QA evidence - Audit trails for compliance
- Multi-perspective review - Different domain experts
- Bias prevention - Objective quality signals
- Structured decisions - Clear approve/reject with rationale
Inspiration: Andrej Karpathy released "LLM Council" (November 2025), demonstrating:
- Multiple LLMs reviewing the same query
- Anonymous peer ranking to prevent model bias
- Chairman synthesis for final answer
The Gap: Karpathy's implementation is a "weekend hack" lacking:
- Authentication/authorization
- Compliance audit trails
- Circuit breakers and retry logic
- PII redaction
- Electronic signatures
CODITECT needs the pattern with enterprise hardening.
Decision
Implement the LLM Council pattern as a 3-stage multi-agent code review pipeline with:
- Stage 1: Parallel Specialized Review - Domain experts review independently
- Stage 2: Anonymous Cross-Evaluation - Reviewers rank each other (identities hidden)
- Stage 3: Chairman Synthesis - Final verdict with structured decision
Architecture
┌─────────────────────────────────────────────────────────────┐
│ STAGE 1: Parallel Specialized Reviews │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Security │ │Compliance│ │Performance│ │Testing │ │
│ │Reviewer │ │Reviewer │ │Reviewer │ │Reviewer│ │
│ └────┬────┘ └────┬────┘ └────┬─────┘ └───┬────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ [Findings] [Findings] [Findings] [Findings] │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STAGE 2: Anonymous Cross-Evaluation │
│ │
│ Reviews anonymized: Security → Alpha, Compliance → Beta │
│ │
│ Each reviewer ranks others: │
│ Alpha ranks: [Beta, Gamma, Delta] │
│ Beta ranks: [Alpha, Gamma, Delta] │
│ ... │
│ │
│ Consensus calculated via Kendall's W │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STAGE 3: Chairman Synthesis │
│ │
│ - Sees de-anonymized reviews │
│ - Applies hard decision thresholds │
│ - Generates structured verdict │
│ - Signs audit record (hash chain) │
│ │
│ Output: APPROVE | REQUEST_CHANGES | REJECT │
└─────────────────────────────────────────────────────────────┘
Components Implemented
| Component | Type | Purpose |
|---|---|---|
council-review | Skill | Core pattern: anonymization, ranking, consensus |
council-orchestrator | Agent | Coordinates 3-stage pipeline |
council-chairman | Agent | Synthesizes verdicts with compliance |
/council-review | Command | Entry point for users |
Decision Thresholds
| Condition | Decision | Rationale |
|---|---|---|
| Critical findings > 0 | REJECT | Zero tolerance for critical |
| High findings > 3 | REQUEST_CHANGES | Too many significant issues |
| Aggregate score < 0.70 | REQUEST_CHANGES | Below quality threshold |
| Consensus < 0.50 + blocking | HUMAN_REVIEW | Disagreement signals uncertainty |
| All pass | APPROVE | Quality requirements met |
Consensus Calculation
Using Kendall's W (coefficient of concordance):
def compute_consensus(rankings: Dict[str, List[str]]) -> float:
"""
Returns 0.0-1.0 where:
- 1.0 = Perfect agreement (all rank identically)
- 0.7+ = Good agreement
- 0.5-0.7 = Moderate agreement
- <0.5 = Low agreement (flag for human review)
"""
n_items = len(rankings[list(rankings.keys())[0]])
n_rankers = len(rankings)
rank_sums = defaultdict(int)
for ranker, ordered_labels in rankings.items():
for position, label in enumerate(ordered_labels, start=1):
rank_sums[label] += position
mean_rank_sum = sum(rank_sums.values()) / n_items
ss = sum((rs - mean_rank_sum) ** 2 for rs in rank_sums.values())
max_ss = (n_rankers ** 2 * (n_items ** 3 - n_items)) / 12
return ss / max_ss if max_ss > 0 else 1.0
Audit Trail
@dataclass
class ComplianceAuditRecord:
timestamp: datetime
artifact_hash: str # SHA256 of reviewed code
stage1_hashes: Dict[str, str] # reviewer → response hash
stage2_hashes: Dict[str, str] # reviewer → ranking hash
chairman_hash: str
verdict_hash: str
chain_hash: str # SHA256 of all above
def verify(self) -> bool:
"""Verify hash chain integrity for auditors."""
expected = compute_chain_hash(...)
return expected == self.chain_hash
Rationale
Why Anonymous Cross-Evaluation?
Models have known biases toward their own "family" of responses. GPT-4 tends to prefer GPT-4-style outputs. By hiding model identity during ranking, we get more objective quality signals.
Research Support: Karpathy's experiments showed that anonymization reduces self-preference bias by ~15%.
Why 3 Stages (Not 2 or 4)?
| Stages | Description | Tradeoff |
|---|---|---|
| 1 | Just reviews | No cross-validation |
| 2 | Reviews + synthesis | No peer ranking |
| 3 | Reviews + ranking + synthesis | Optimal balance |
| 4+ | Additional stages | Diminishing returns, higher cost |
Decision: 3 stages provides peer validation without excessive token cost.
Why Kendall's W for Consensus?
| Metric | Use Case | Limitations |
|---|---|---|
| Simple majority | Binary decisions | Ignores ranking order |
| Fleiss' Kappa | Category agreement | Not designed for rankings |
| Kendall's W | Ranking agreement | Purpose-built for this |
Decision: Kendall's W is the standard statistical measure for inter-rater agreement on rankings.
Why Hard Thresholds?
Regulated industries require deterministic decisions. A model saying "probably approve" isn't auditable. Hard thresholds provide:
- Predictability - Same inputs → same decision
- Auditability - Can explain why decision was made
- Override Protection - Critical issues can't be synthesized away
Cost Analysis
| Stage | Tokens | Cost @ $3/1M |
|---|---|---|
| Stage 1 (4 reviewers) | ~12,000 | ~$0.036 |
| Stage 2 (4 rankings) | ~8,000 | ~$0.024 |
| Stage 3 (chairman) | ~5,000 | ~$0.015 |
| Total per file | ~25,000 | ~$0.075 |
For a 10-file PR: ~$0.75 (the "quality tax")
Consequences
Positive
- Defensible QA - Multi-reviewer consensus is auditable evidence
- Bias Reduction - Anonymous ranking prevents model favoritism
- Compliance Ready - Hash chains satisfy FDA 21 CFR Part 11
- Clear Decisions - Structured verdicts integrate with CI/CD
- Configurable - Choose reviewers, thresholds, compliance frameworks
Negative
- Higher Cost - 4-6x more tokens than single-agent review
- Latency - 3 sequential stages take longer
- Complexity - More moving parts than simple review
Mitigations
| Risk | Mitigation |
|---|---|
| Cost | Tiered review (quick pre-check for trivial changes) |
| Latency | Parallel Stage 1, streaming Stage 3 |
| Complexity | Encapsulated in skill, abstracted by command |
Alternatives Considered
1. Single-Agent Deep Review
Have one agent do comprehensive review with multiple passes.
Rejected: No consensus signal, no bias prevention, single point of failure.
2. Voting Without Ranking
Simple majority vote on approve/reject.
Rejected: Loses nuance of "which review is most valuable?" Ranking provides signal.
3. Human-in-the-Loop Always
Require human approval for all reviews.
Rejected: Defeats automation purpose. Council provides human-level confidence for most cases, escalates only when uncertain.
4. External Review Services
Use third-party code review tools (CodeClimate, SonarQube).
Rejected: Not LLM-based, can't do semantic analysis. Complementary, not replacement.
Usage Examples
Standard Code Review
/council-review src/auth/login.rs --reviewers security,testing
Compliance-Critical Review
/council-review src/patient_records/ \
--compliance hipaa \
--audit \
--threshold 0.8 \
--recursive
CI/CD Integration
# GitHub Actions
- name: Council Review
run: |
claude "/council-review src/ --format ci" > review.json
if jq -e '.decision != "approve"' review.json; then
exit 1
fi
Comparison with Existing Components
| Component | Pattern | Best For |
|---|---|---|
/council-review | Multi-agent consensus | Compliance-critical, high-stakes |
/code-review | Single agent | Quick feedback, low risk |
orchestrator-code-review | ADR compliance | CODITECT v4 standards |
uncertainty-orchestrator | MoE judges | Research, not code review |
Key Differentiator: Council pattern adds anonymous peer ranking and consensus scoring. Others lack this bias prevention mechanism.
Related ADRs
- ADR-014-COMPONENT-CAPABILITY-SCHEMA - How council components are indexed
- ADR-015-SELF-AWARENESS-FRAMEWORK - How orchestrators discover council
- ADR-012-MOE-ANALYSIS-FRAMEWORK - Related pattern for research (not code)
References
- Karpathy's LLM Council - Original inspiration
- VentureBeat Analysis - Enterprise gap analysis
- docs/09-research-analysis/llm-council-pattern/ - Full research docs
- skills/council-review/SKILL.md - Implementation
Approved: 2025-12-20 Review Date: 2026-03-20