Skip to main content

ADR-016: LLM Council Pattern

Status: Accepted Date: 2025-12-20 Deciders: Hal Casteel, CODITECT Core Team Categories: Architecture, Code Review, Multi-Agent, Consensus


Context

CODITECT serves regulated industries (healthcare, finance) that require:

  • Defensible QA evidence - Audit trails for compliance
  • Multi-perspective review - Different domain experts
  • Bias prevention - Objective quality signals
  • Structured decisions - Clear approve/reject with rationale

Inspiration: Andrej Karpathy released "LLM Council" (November 2025), demonstrating:

  • Multiple LLMs reviewing the same query
  • Anonymous peer ranking to prevent model bias
  • Chairman synthesis for final answer

The Gap: Karpathy's implementation is a "weekend hack" lacking:

  • Authentication/authorization
  • Compliance audit trails
  • Circuit breakers and retry logic
  • PII redaction
  • Electronic signatures

CODITECT needs the pattern with enterprise hardening.


Decision

Implement the LLM Council pattern as a 3-stage multi-agent code review pipeline with:

  1. Stage 1: Parallel Specialized Review - Domain experts review independently
  2. Stage 2: Anonymous Cross-Evaluation - Reviewers rank each other (identities hidden)
  3. Stage 3: Chairman Synthesis - Final verdict with structured decision

Architecture

┌─────────────────────────────────────────────────────────────┐
│ STAGE 1: Parallel Specialized Reviews │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Security │ │Compliance│ │Performance│ │Testing │ │
│ │Reviewer │ │Reviewer │ │Reviewer │ │Reviewer│ │
│ └────┬────┘ └────┬────┘ └────┬─────┘ └───┬────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ [Findings] [Findings] [Findings] [Findings] │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ STAGE 2: Anonymous Cross-Evaluation │
│ │
│ Reviews anonymized: Security → Alpha, Compliance → Beta │
│ │
│ Each reviewer ranks others: │
│ Alpha ranks: [Beta, Gamma, Delta] │
│ Beta ranks: [Alpha, Gamma, Delta] │
│ ... │
│ │
│ Consensus calculated via Kendall's W │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ STAGE 3: Chairman Synthesis │
│ │
│ - Sees de-anonymized reviews │
│ - Applies hard decision thresholds │
│ - Generates structured verdict │
│ - Signs audit record (hash chain) │
│ │
│ Output: APPROVE | REQUEST_CHANGES | REJECT │
└─────────────────────────────────────────────────────────────┘

Components Implemented

ComponentTypePurpose
council-reviewSkillCore pattern: anonymization, ranking, consensus
council-orchestratorAgentCoordinates 3-stage pipeline
council-chairmanAgentSynthesizes verdicts with compliance
/council-reviewCommandEntry point for users

Decision Thresholds

ConditionDecisionRationale
Critical findings > 0REJECTZero tolerance for critical
High findings > 3REQUEST_CHANGESToo many significant issues
Aggregate score < 0.70REQUEST_CHANGESBelow quality threshold
Consensus < 0.50 + blockingHUMAN_REVIEWDisagreement signals uncertainty
All passAPPROVEQuality requirements met

Consensus Calculation

Using Kendall's W (coefficient of concordance):

def compute_consensus(rankings: Dict[str, List[str]]) -> float:
"""
Returns 0.0-1.0 where:
- 1.0 = Perfect agreement (all rank identically)
- 0.7+ = Good agreement
- 0.5-0.7 = Moderate agreement
- <0.5 = Low agreement (flag for human review)
"""
n_items = len(rankings[list(rankings.keys())[0]])
n_rankers = len(rankings)

rank_sums = defaultdict(int)
for ranker, ordered_labels in rankings.items():
for position, label in enumerate(ordered_labels, start=1):
rank_sums[label] += position

mean_rank_sum = sum(rank_sums.values()) / n_items
ss = sum((rs - mean_rank_sum) ** 2 for rs in rank_sums.values())
max_ss = (n_rankers ** 2 * (n_items ** 3 - n_items)) / 12

return ss / max_ss if max_ss > 0 else 1.0

Audit Trail

@dataclass
class ComplianceAuditRecord:
timestamp: datetime
artifact_hash: str # SHA256 of reviewed code
stage1_hashes: Dict[str, str] # reviewer → response hash
stage2_hashes: Dict[str, str] # reviewer → ranking hash
chairman_hash: str
verdict_hash: str
chain_hash: str # SHA256 of all above

def verify(self) -> bool:
"""Verify hash chain integrity for auditors."""
expected = compute_chain_hash(...)
return expected == self.chain_hash

Rationale

Why Anonymous Cross-Evaluation?

Models have known biases toward their own "family" of responses. GPT-4 tends to prefer GPT-4-style outputs. By hiding model identity during ranking, we get more objective quality signals.

Research Support: Karpathy's experiments showed that anonymization reduces self-preference bias by ~15%.

Why 3 Stages (Not 2 or 4)?

StagesDescriptionTradeoff
1Just reviewsNo cross-validation
2Reviews + synthesisNo peer ranking
3Reviews + ranking + synthesisOptimal balance
4+Additional stagesDiminishing returns, higher cost

Decision: 3 stages provides peer validation without excessive token cost.

Why Kendall's W for Consensus?

MetricUse CaseLimitations
Simple majorityBinary decisionsIgnores ranking order
Fleiss' KappaCategory agreementNot designed for rankings
Kendall's WRanking agreementPurpose-built for this

Decision: Kendall's W is the standard statistical measure for inter-rater agreement on rankings.

Why Hard Thresholds?

Regulated industries require deterministic decisions. A model saying "probably approve" isn't auditable. Hard thresholds provide:

  • Predictability - Same inputs → same decision
  • Auditability - Can explain why decision was made
  • Override Protection - Critical issues can't be synthesized away

Cost Analysis

StageTokensCost @ $3/1M
Stage 1 (4 reviewers)~12,000~$0.036
Stage 2 (4 rankings)~8,000~$0.024
Stage 3 (chairman)~5,000~$0.015
Total per file~25,000~$0.075

For a 10-file PR: ~$0.75 (the "quality tax")


Consequences

Positive

  1. Defensible QA - Multi-reviewer consensus is auditable evidence
  2. Bias Reduction - Anonymous ranking prevents model favoritism
  3. Compliance Ready - Hash chains satisfy FDA 21 CFR Part 11
  4. Clear Decisions - Structured verdicts integrate with CI/CD
  5. Configurable - Choose reviewers, thresholds, compliance frameworks

Negative

  1. Higher Cost - 4-6x more tokens than single-agent review
  2. Latency - 3 sequential stages take longer
  3. Complexity - More moving parts than simple review

Mitigations

RiskMitigation
CostTiered review (quick pre-check for trivial changes)
LatencyParallel Stage 1, streaming Stage 3
ComplexityEncapsulated in skill, abstracted by command

Alternatives Considered

1. Single-Agent Deep Review

Have one agent do comprehensive review with multiple passes.

Rejected: No consensus signal, no bias prevention, single point of failure.

2. Voting Without Ranking

Simple majority vote on approve/reject.

Rejected: Loses nuance of "which review is most valuable?" Ranking provides signal.

3. Human-in-the-Loop Always

Require human approval for all reviews.

Rejected: Defeats automation purpose. Council provides human-level confidence for most cases, escalates only when uncertain.

4. External Review Services

Use third-party code review tools (CodeClimate, SonarQube).

Rejected: Not LLM-based, can't do semantic analysis. Complementary, not replacement.


Usage Examples

Standard Code Review

/council-review src/auth/login.rs --reviewers security,testing

Compliance-Critical Review

/council-review src/patient_records/ \
--compliance hipaa \
--audit \
--threshold 0.8 \
--recursive

CI/CD Integration

# GitHub Actions
- name: Council Review
run: |
claude "/council-review src/ --format ci" > review.json
if jq -e '.decision != "approve"' review.json; then
exit 1
fi

Comparison with Existing Components

ComponentPatternBest For
/council-reviewMulti-agent consensusCompliance-critical, high-stakes
/code-reviewSingle agentQuick feedback, low risk
orchestrator-code-reviewADR complianceCODITECT v4 standards
uncertainty-orchestratorMoE judgesResearch, not code review

Key Differentiator: Council pattern adds anonymous peer ranking and consensus scoring. Others lack this bias prevention mechanism.


  • ADR-014-COMPONENT-CAPABILITY-SCHEMA - How council components are indexed
  • ADR-015-SELF-AWARENESS-FRAMEWORK - How orchestrators discover council
  • ADR-012-MOE-ANALYSIS-FRAMEWORK - Related pattern for research (not code)

References


Approved: 2025-12-20 Review Date: 2026-03-20