ADR-016: LLM Council Pattern

Status: Accepted Date: 2025-12-20 Deciders: Hal Casteel, CODITECT Core Team Categories: Architecture, Code Review, Multi-Agent, Consensus

Context

CODITECT serves regulated industries (healthcare, finance) that require:

Defensible QA evidence - Audit trails for compliance
Multi-perspective review - Different domain experts
Bias prevention - Objective quality signals
Structured decisions - Clear approve/reject with rationale

Inspiration: Andrej Karpathy released "LLM Council" (November 2025), demonstrating:

Multiple LLMs reviewing the same query
Anonymous peer ranking to prevent model bias
Chairman synthesis for final answer

The Gap: Karpathy's implementation is a "weekend hack" lacking:

Authentication/authorization
Compliance audit trails
Circuit breakers and retry logic
PII redaction
Electronic signatures

CODITECT needs the pattern with enterprise hardening.

Decision

Implement the LLM Council pattern as a 3-stage multi-agent code review pipeline with:

Stage 1: Parallel Specialized Review - Domain experts review independently
Stage 2: Anonymous Cross-Evaluation - Reviewers rank each other (identities hidden)
Stage 3: Chairman Synthesis - Final verdict with structured decision

Architecture

┌─────────────────────────────────────────────────────────────┐
│  STAGE 1: Parallel Specialized Reviews                      │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐           │
│  │Security │ │Compliance│ │Performance│ │Testing │           │
│  │Reviewer │ │Reviewer │ │Reviewer  │ │Reviewer│           │
│  └────┬────┘ └────┬────┘ └────┬─────┘ └───┬────┘           │
│       │           │           │           │                 │
│       ▼           ▼           ▼           ▼                 │
│  [Findings]  [Findings]  [Findings]  [Findings]            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  STAGE 2: Anonymous Cross-Evaluation                        │
│                                                              │
│  Reviews anonymized: Security → Alpha, Compliance → Beta    │
│                                                              │
│  Each reviewer ranks others:                                 │
│  Alpha ranks: [Beta, Gamma, Delta]                          │
│  Beta ranks:  [Alpha, Gamma, Delta]                         │
│  ...                                                         │
│                                                              │
│  Consensus calculated via Kendall's W                       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  STAGE 3: Chairman Synthesis                                │
│                                                              │
│  - Sees de-anonymized reviews                               │
│  - Applies hard decision thresholds                         │
│  - Generates structured verdict                             │
│  - Signs audit record (hash chain)                          │
│                                                              │
│  Output: APPROVE | REQUEST_CHANGES | REJECT                 │
└─────────────────────────────────────────────────────────────┘

Components Implemented

Component	Type	Purpose
`council-review`	Skill	Core pattern: anonymization, ranking, consensus
`council-orchestrator`	Agent	Coordinates 3-stage pipeline
`council-chairman`	Agent	Synthesizes verdicts with compliance
`/council-review`	Command	Entry point for users

Decision Thresholds

Condition	Decision	Rationale
Critical findings > 0	REJECT	Zero tolerance for critical
High findings > 3	REQUEST_CHANGES	Too many significant issues
Aggregate score < 0.70	REQUEST_CHANGES	Below quality threshold
Consensus < 0.50 + blocking	HUMAN_REVIEW	Disagreement signals uncertainty
All pass	APPROVE	Quality requirements met

Consensus Calculation

Using Kendall's W (coefficient of concordance):

def compute_consensus(rankings: Dict[str, List[str]]) -> float:
    """
    Returns 0.0-1.0 where:
    - 1.0 = Perfect agreement (all rank identically)
    - 0.7+ = Good agreement
    - 0.5-0.7 = Moderate agreement
    - <0.5 = Low agreement (flag for human review)
    """
    n_items = len(rankings[list(rankings.keys())[0]])
    n_rankers = len(rankings)

    rank_sums = defaultdict(int)
    for ranker, ordered_labels in rankings.items():
        for position, label in enumerate(ordered_labels, start=1):
            rank_sums[label] += position

    mean_rank_sum = sum(rank_sums.values()) / n_items
    ss = sum((rs - mean_rank_sum) ** 2 for rs in rank_sums.values())
    max_ss = (n_rankers ** 2 * (n_items ** 3 - n_items)) / 12

    return ss / max_ss if max_ss > 0 else 1.0

Audit Trail

@dataclass
class ComplianceAuditRecord:
    timestamp: datetime
    artifact_hash: str           # SHA256 of reviewed code
    stage1_hashes: Dict[str, str]  # reviewer → response hash
    stage2_hashes: Dict[str, str]  # reviewer → ranking hash
    chairman_hash: str
    verdict_hash: str
    chain_hash: str              # SHA256 of all above

    def verify(self) -> bool:
        """Verify hash chain integrity for auditors."""
        expected = compute_chain_hash(...)
        return expected == self.chain_hash

Rationale

Why Anonymous Cross-Evaluation?

Models have known biases toward their own "family" of responses. GPT-4 tends to prefer GPT-4-style outputs. By hiding model identity during ranking, we get more objective quality signals.

Research Support: Karpathy's experiments showed that anonymization reduces self-preference bias by ~15%.

Why 3 Stages (Not 2 or 4)?

Stages	Description	Tradeoff
1	Just reviews	No cross-validation
2	Reviews + synthesis	No peer ranking
3	Reviews + ranking + synthesis	Optimal balance
4+	Additional stages	Diminishing returns, higher cost

Decision: 3 stages provides peer validation without excessive token cost.

Why Kendall's W for Consensus?

Metric	Use Case	Limitations
Simple majority	Binary decisions	Ignores ranking order
Fleiss' Kappa	Category agreement	Not designed for rankings
Kendall's W	Ranking agreement	Purpose-built for this

Decision: Kendall's W is the standard statistical measure for inter-rater agreement on rankings.

Why Hard Thresholds?

Regulated industries require deterministic decisions. A model saying "probably approve" isn't auditable. Hard thresholds provide:

Predictability - Same inputs → same decision
Auditability - Can explain why decision was made
Override Protection - Critical issues can't be synthesized away

Cost Analysis

Stage	Tokens	Cost @ $3/1M
Stage 1 (4 reviewers)	~12,000	~$0.036
Stage 2 (4 rankings)	~8,000	~$0.024
Stage 3 (chairman)	~5,000	~$0.015
Total per file	~25,000	~$0.075

For a 10-file PR: ~$0.75 (the "quality tax")

Consequences

Positive

Defensible QA - Multi-reviewer consensus is auditable evidence
Bias Reduction - Anonymous ranking prevents model favoritism
Compliance Ready - Hash chains satisfy FDA 21 CFR Part 11
Clear Decisions - Structured verdicts integrate with CI/CD
Configurable - Choose reviewers, thresholds, compliance frameworks

Negative

Higher Cost - 4-6x more tokens than single-agent review
Latency - 3 sequential stages take longer
Complexity - More moving parts than simple review

Mitigations

Risk	Mitigation
Cost	Tiered review (quick pre-check for trivial changes)
Latency	Parallel Stage 1, streaming Stage 3
Complexity	Encapsulated in skill, abstracted by command

Alternatives Considered

1. Single-Agent Deep Review

Have one agent do comprehensive review with multiple passes.

Rejected: No consensus signal, no bias prevention, single point of failure.

2. Voting Without Ranking

Simple majority vote on approve/reject.

Rejected: Loses nuance of "which review is most valuable?" Ranking provides signal.

3. Human-in-the-Loop Always

Require human approval for all reviews.

Rejected: Defeats automation purpose. Council provides human-level confidence for most cases, escalates only when uncertain.

4. External Review Services

Use third-party code review tools (CodeClimate, SonarQube).

Rejected: Not LLM-based, can't do semantic analysis. Complementary, not replacement.

Usage Examples

Standard Code Review

/council-review src/auth/login.rs --reviewers security,testing

Compliance-Critical Review

/council-review src/patient_records/ \
  --compliance hipaa \
  --audit \
  --threshold 0.8 \
  --recursive

CI/CD Integration

# GitHub Actions
- name: Council Review
  run: |
    claude "/council-review src/ --format ci" > review.json
    if jq -e '.decision != "approve"' review.json; then
      exit 1
    fi

Comparison with Existing Components

Component	Pattern	Best For
`/council-review`	Multi-agent consensus	Compliance-critical, high-stakes
`/code-review`	Single agent	Quick feedback, low risk
`orchestrator-code-review`	ADR compliance	CODITECT v4 standards
`uncertainty-orchestrator`	MoE judges	Research, not code review

Key Differentiator: Council pattern adds anonymous peer ranking and consensus scoring. Others lack this bias prevention mechanism.

ADR-014-COMPONENT-CAPABILITY-SCHEMA - How council components are indexed
ADR-015-SELF-AWARENESS-FRAMEWORK - How orchestrators discover council
ADR-012-MOE-ANALYSIS-FRAMEWORK - Related pattern for research (not code)

References

Karpathy's LLM Council - Original inspiration
VentureBeat Analysis - Enterprise gap analysis
docs/09-research-analysis/llm-council-pattern/ - Full research docs
skills/council-review/SKILL.md - Implementation

Approved: 2025-12-20 Review Date: 2026-03-20

Context​

Decision​

Architecture​

Components Implemented​

Decision Thresholds​

Consensus Calculation​

Audit Trail​

Rationale​

Why Anonymous Cross-Evaluation?​

Why 3 Stages (Not 2 or 4)?​

Why Kendall's W for Consensus?​

Why Hard Thresholds?​

Cost Analysis​

Consequences​

Positive​

Negative​

Mitigations​

Alternatives Considered​

1. Single-Agent Deep Review​

2. Voting Without Ranking​

3. Human-in-the-Loop Always​

4. External Review Services​

Usage Examples​

Standard Code Review​

Compliance-Critical Review​

CI/CD Integration​

Comparison with Existing Components​

Related ADRs​

References​