Skip to main content

Council Review Skill

Council Review Skill

When to Use This Skill

Use this skill when implementing council review patterns in your codebase.

How to Use This Skill

  1. Review the patterns and examples below
  2. Apply the relevant patterns to your implementation
  3. Follow the best practices outlined in this skill

Multi-agent code review pattern with anonymized peer evaluation, consensus scoring, and structured verdicts. Based on the LLM Council pattern adapted for enterprise compliance requirements.

Overview

The Council Review pattern implements a 3-stage review pipeline:

  1. Stage 1: Parallel Specialized Review - Multiple domain experts review independently
  2. Stage 2: Anonymous Cross-Evaluation - Reviewers rank each other's work (identities hidden)
  3. Stage 3: Chairman Synthesis - Final verdict with merge decision

Key Differentiators

FeatureStandard ReviewCouncil Review
ReviewersSingle agent4-6 specialized agents
Bias PreventionNoneAnonymous peer ranking
Consensus SignalNoneKendall's W coefficient
Decision OutputCommentsStructured verdict (approve/changes/reject)
Audit TrailBasic logsHash-chained compliance records

Core Components

1. Anonymization Pattern

# Prevent model bias through identity hiding during cross-evaluation
NEUTRAL_LABELS = ['Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon']

def anonymize_reviews(reviews: Dict[str, ReviewResult]) -> Tuple[Dict, Dict]:
"""
Convert reviewer IDs to neutral labels.

Returns:
anonymized_reviews: Dict[label, content]
label_mapping: Dict[label, original_reviewer_id]
"""
shuffled_labels = random.sample(NEUTRAL_LABELS[:len(reviews)], len(reviews))
label_mapping = {}
anonymized = {}

for (reviewer_id, review), label in zip(reviews.items(), shuffled_labels):
label_mapping[label] = reviewer_id
anonymized[label] = sanitize_content(review)

return anonymized, label_mapping

2. Ranking Aggregation

def compute_consensus(rankings: Dict[str, List[str]]) -> float:
"""
Compute Kendall's W (coefficient of concordance).

Returns:
0.0-1.0 where 1.0 = perfect agreement among rankers
"""
# Collect position sums for each ranked item
n_items = len(rankings[list(rankings.keys())[0]])
n_rankers = len(rankings)

rank_sums = defaultdict(int)
for ranker, ordered_labels in rankings.items():
for position, label in enumerate(ordered_labels, start=1):
rank_sums[label] += position

# Calculate W statistic
mean_rank_sum = sum(rank_sums.values()) / n_items
ss = sum((rs - mean_rank_sum) ** 2 for rs in rank_sums.values())
max_ss = (n_rankers ** 2 * (n_items ** 3 - n_items)) / 12

return ss / max_ss if max_ss > 0 else 1.0

3. Chairman Verdict Structure

@dataclass
class CouncilVerdict:
"""Structured output from council review."""
decision: str # "approve" | "request_changes" | "reject"
aggregate_score: float # 0.0-1.0
blocking_findings: List[Finding]
consensus_level: float # Kendall's W
chairman_synthesis: str
audit_hash: str # SHA256 chain for compliance

def to_ci_status(self) -> str:
"""Convert to GitHub/GitLab CI status."""
return {
"approve": "success",
"request_changes": "pending",
"reject": "failure"
}[self.decision]

Invocation Patterns

Via Task Tool

# Full council review
Task(
subagent_type="council-orchestrator",
prompt="""
Execute council review on src/auth/jwt_handler.rs with:
- Security reviewer focus on OWASP Top 10
- Compliance reviewer focus on HIPAA
- Performance reviewer focus on latency
Require consensus >= 0.6 for approval
"""
)

Via Command

/council-review src/api/handlers.rs --reviewers security,compliance,performance --threshold 0.7

Reviewer Types

SecurityReviewer

Focus: Injection vulnerabilities, authentication issues, data exposure, cryptographic weaknesses

evaluation_rubric:
injection_vulnerabilities: 0.25
authentication_issues: 0.20
data_exposure: 0.20
cryptographic_weaknesses: 0.15
input_validation: 0.10
dependency_risks: 0.10

ComplianceReviewer

Focus: Data handling, audit logging, access control, encryption, retention

evaluation_rubric:
data_handling: 0.30
audit_logging: 0.25
access_control: 0.20
encryption: 0.15
retention: 0.10

frameworks:
- HIPAA
- SOC2
- FDA_21_CFR_Part_11

PerformanceReviewer

Focus: Algorithmic complexity, memory usage, I/O efficiency, concurrency

evaluation_rubric:
algorithmic_complexity: 0.30
memory_usage: 0.25
io_efficiency: 0.20
concurrency: 0.15
caching: 0.10

TestCoverageReviewer

Focus: Test completeness, edge cases, mock quality, integration testing

evaluation_rubric:
coverage_percentage: 0.30
edge_case_handling: 0.25
mock_quality: 0.20
integration_tests: 0.15
test_maintainability: 0.10

Pipeline Workflow

┌─────────────────────────────────────────────────────────────┐
│ Stage 1: Parallel Specialized Reviews │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Security │ │Compliance│ │Performance│ │TestCov │ │
│ └────┬────┘ └────┬────┘ └────┬─────┘ └────┬───┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ Findings Findings Findings Findings │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Stage 2: Anonymous Cross-Evaluation │
│ Each reviewer ranks others (Alpha, Beta, Gamma, Delta) │
│ - Identity hidden to prevent model favoritism │
│ - Rankings aggregated via Kendall's W │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Stage 3: Chairman Synthesis │
│ - Sees de-anonymized context │
│ - Applies hard thresholds │
│ - Generates structured verdict │
│ - Signs audit record │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Output: CouncilVerdict │
│ - decision: approve | request_changes | reject │
│ - aggregate_score: 0.0-1.0 │
│ - consensus_level: Kendall's W │
│ - audit_hash: SHA256 chain │
└─────────────────────────────────────────────────────────────┘

Decision Thresholds

ConditionAction
Any CRITICAL findingREJECT
>3 HIGH findingsREQUEST_CHANGES
Aggregate score < 0.7REQUEST_CHANGES
Consensus < 0.5 + blocking findingsFLAG FOR HUMAN REVIEW
All pass + consensus >= 0.7APPROVE

Compliance Integration

Audit Trail Structure

@dataclass
class ComplianceAuditRecord:
timestamp: datetime
artifact_hash: str # SHA256 of reviewed code
stage1_hashes: Dict[str, str] # reviewer → response hash
stage2_hashes: Dict[str, str] # reviewer → ranking hash
chairman_hash: str
verdict_hash: str
chain_hash: str # SHA256 of all above

def verify(self) -> bool:
"""Verify hash chain integrity for auditors."""
expected = compute_chain_hash(
self.artifact_hash,
self.stage1_hashes,
self.stage2_hashes,
self.chairman_hash,
self.verdict_hash
)
return expected == self.chain_hash

Supported Frameworks

  • FDA 21 CFR Part 11 - Electronic signatures and audit trails
  • HIPAA - PHI handling and access controls
  • SOC2 - Security and availability controls
  • GDPR - Data protection compliance

Token Economics

StageTokensEst. Cost
Stage 1 (4 reviewers)~12,000~$0.036
Stage 2 (4 cross-evals)~8,000~$0.024
Stage 3 (chairman)~5,000~$0.015
Total per file~25,000~$0.075

For a typical PR (10 files): ~$0.75

  • council-orchestrator agent - Coordinates the 3-stage pipeline
  • council-chairman agent - Synthesizes final verdicts
  • /council-review command - Entry point for council reviews
  • uncertainty-orchestrator agent - Alternative for research (not code review)

Version History

VersionDateChanges
1.0.02025-12-20Initial implementation based on LLM Council pattern

Origin: Adapted from Andrej Karpathy's LLM Council pattern with enterprise hardening for CODITECT compliance requirements.

Last Updated: 2025-12-20


Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: council-review

Completed:
- [x] Stage 1: {REVIEWER_COUNT} specialized reviews completed
- [x] Stage 2: Anonymous cross-evaluation rankings aggregated
- [x] Stage 3: Chairman synthesis and verdict generated
- [x] Audit trail hash chain created
- [x] Compliance report generated

Council Decision: {APPROVE|REQUEST_CHANGES|REJECT}

Verdict Details:
- Aggregate score: {SCORE}/100
- Consensus level: {KENDALLS_W} (Kendall's W)
- Blocking findings: {BLOCKING_COUNT}
- Critical issues: {CRITICAL_COUNT}
- High issues: {HIGH_COUNT}

Reviewers:
- {REVIEWER_1}: Score {SCORE_1}/100
- {REVIEWER_2}: Score {SCORE_2}/100
- {REVIEWER_3}: Score {SCORE_3}/100
- {REVIEWER_4}: Score {SCORE_4}/100

Outputs:
- Council verdict: council-verdict-{TIMESTAMP}.json
- Audit record: audit-trail-{HASH}.json
- Compliance report: compliance-report-{TIMESTAMP}.md

Completion Checklist

Before marking this skill as complete, verify:

  • At least 3 specialized reviewers participated
  • All reviewers completed Stage 1 independent reviews
  • Identities anonymized in Stage 2 (Alpha, Beta, Gamma, Delta)
  • Cross-evaluation rankings collected from each reviewer
  • Kendall's W coefficient calculated (consensus metric)
  • Chairman reviewed de-anonymized context
  • Blocking findings identified (CRITICAL severity)
  • Decision threshold logic applied correctly
  • Structured verdict generated with all fields
  • Audit trail hash chain created and verified
  • CouncilVerdict JSON serialized to file
  • Compliance report generated (if required)

Failure Indicators

This skill has FAILED if:

  • ❌ Fewer than 3 reviewers participated (insufficient diversity)
  • ❌ Reviewer identities leaked during Stage 2 (bias risk)
  • ❌ Kendall's W not calculated (no consensus metric)
  • ❌ Blocking findings not surfaced in verdict
  • ❌ Decision threshold logic bypassed
  • ❌ Verdict missing required fields (decision, score, consensus)
  • ❌ Audit hash chain verification fails
  • ❌ Compliance requirements not met (HIPAA, SOC2, etc.)
  • ❌ Chairman synthesis contradicts reviewer consensus
  • ❌ Token cost exceeds budget (>30,000 tokens/file)

When NOT to Use

Do NOT use this skill when:

  • Simple code review - Use standard PR review for minor changes
  • Urgent hotfixes - Council review takes 15-30 minutes, too slow for emergencies
  • Non-critical code - Reserve for security-sensitive, compliance-required, or architectural changes
  • Low compliance requirements - Use single-agent review for internal tools
  • Budget constraints - Council review costs ~$0.75 per file (4-6 reviewers)
  • Early prototyping - Overkill for experimental code
  • Documentation-only PRs - Use doc review agent instead

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Using for every PRNoise, review fatigue, high costReserve for critical paths (auth, payments, PHI)
Skipping anonymizationModel bias towards certain agentsAlways anonymize in Stage 2
Ignoring low consensusReviewers disagree, verdict unreliableFlag for human review when W < 0.5
Auto-approving high scoresScore doesn't capture blocking issuesCheck blocking findings first
No audit trailCompliance failure, no recordAlways generate hash chain
Chairman overriding consensusDefeats purpose of councilChairman synthesizes, doesn't override
Same reviewers every timeStale perspectivesRotate reviewers based on code domain

Principles

This skill embodies:

  • #5 Eliminate Ambiguity - Structured verdict with clear decision criteria
  • #6 Clear, Understandable, Explainable - Chairman synthesis explains reasoning
  • #8 No Assumptions - Multiple reviewers catch different issues
  • #10 Security First - Security reviewer mandatory for auth/crypto code
  • #11 Quality is Non-Negotiable - Multi-agent review ensures thorough analysis
  • #12 Transparent Documentation - Audit trail provides complete record
  • #15 Compliance is Essential - Built for FDA 21 CFR Part 11, HIPAA, SOC2

Full Standard: CODITECT-STANDARD-AUTOMATION.md