Skip to main content

ADR-004: Scaling Model for Agent Selection

Status

Proposed

Context

CODITECT currently selects agents and orchestration patterns using heuristic-based logic:

  • MoE classifier routes to "best" agent based on skill similarity scores
  • Multi-agent invocations use fixed counts (e.g., always 3 agents for peer review)
  • No quantitative feedback on whether adding more agents improves or degrades performance
  • Coordination overhead not measured or considered in selection

Brainqub3 Agent Labs provides a paper-aligned scaling model (arXiv:2512.08296) that predicts:

Performance Scaling:

P_hat = clip(beta0 + sum(beta_i*z_i) + sum(beta_ij*z_i*z_j), 0, 1)

Where:

  • P_hat = predicted performance (0-1 scale)
  • z_i = standardized task/architecture features
  • beta_i = feature coefficients (learned from experiments)
  • beta_ij = interaction terms (e.g., agent_count × task_complexity)

Elasticity Estimation:

x_hat = clamp(x_base * (n/n0)^eta_n * (T/T0)^eta_T)

Where:

  • x_hat = predicted value (performance, cost, time, etc.)
  • n = agent count, T = task complexity
  • eta_n, eta_T = elasticity coefficients

Coordination Metrics (5):

  1. Overhead %: Time spent coordinating vs productive work
  2. Message Density: Messages per agent per task
  3. Redundancy: Duplicate work across agents
  4. Efficiency: Output quality per unit coordination cost
  5. Error Amplification: How errors compound with agent count

The question: Should CODITECT use Agent Labs scaling predictions to dynamically select agent count and architecture at runtime, rather than using static configurations?

Example scenarios where this matters:

  • Simple task → MoE might predict coordination overhead > benefit for multi-agent
  • Complex task → Model might recommend 5 agents (decentralised) over 3 (centralised)
  • High-redundancy detection → Switch from Independent to Centralised to reduce waste

Decision

Use Agent Labs scaling model predictions to inform CODITECT agent selection decisions, with the following design:

1. Two-Phase Approach

Phase 1: Offline Calibration (Manual)

  • Run Agent Labs experiments on representative CODITECT tasks
  • Generate scaling curves for each (task_type, architecture) pair
  • Store coefficients (beta, eta) in ~/.coditect-data/scaling-models/
  • Update periodically (e.g., quarterly or when new agent types added)

Phase 2: Runtime Prediction (Automated)

  • Before invoking multi-agent orchestration, query scaling model
  • Predict performance for candidate configurations (1 agent, 3 agents, 5 agents; centralised vs decentralised)
  • Select configuration with best predicted efficiency (quality / coordination_cost)
  • Fall back to heuristic if no calibration data available

2. Integration Points

MoE Router Enhancement:

# Current: always select single best agent
best_agent = moe_router.select(task)

# Enhanced: compare single-agent vs multi-agent predictions
sas_prediction = scaling_model.predict(task, architecture="sas", agents=1)
mas_prediction = scaling_model.predict(task, architecture="centralised", agents=3)

if mas_prediction.efficiency > sas_prediction.efficiency * 1.2: # 20% threshold
agents = moe_router.select_top_k(task, k=3)
else:
agents = [best_agent]

Architecture Selection:

# For known multi-agent tasks, compare architectures
predictions = {
"centralised": scaling_model.predict(task, architecture="centralised", agents=3),
"decentralised": scaling_model.predict(task, architecture="decentralised", agents=3),
"hybrid": scaling_model.predict(task, architecture="hybrid", agents=3)
}
best_arch = max(predictions.items(), key=lambda x: x[1].efficiency)[0]

3. Coordination Collapse Detection

Trigger re-evaluation if coordination metrics exceed thresholds:

  • overhead_pct > 40% → Reduce agent count or switch to SAS
  • message_density > 10 → Switch from Decentralised to Centralised
  • redundancy > 0.5 → Consolidate agents or use Independent
  • error_amplification > 1.2 → Reduce agent count (errors compounding)

4. Guardrails

  • Always support manual override: --agents 5 --architecture hybrid ignores predictions
  • Gradual rollout: Start with logging predictions only (no action), validate accuracy
  • Fallback to heuristic: If scaling model unavailable or low confidence (<0.7), use existing logic
  • Cost limits: Never exceed user-configured max agent count or API budget

Alternatives Considered

1. Static Configuration (Status Quo)

Pros:

  • Simple, predictable
  • No runtime overhead
  • Proven to work

Cons:

  • Suboptimal for many tasks (over/under-provision agents)
  • No adaptation to task complexity
  • Coordination overhead invisible
  • Cannot detect coordination collapse

2. Heuristic-Based Scaling (No Model)

Pros:

  • Lightweight, fast
  • No calibration needed

Cons:

  • Rules brittle (e.g., "if task_tokens > 1000, use 3 agents")
  • No quantitative basis
  • Cannot predict efficiency vs overhead tradeoff
  • Misses interaction effects (agent_count × complexity)

3. Reinforcement Learning Agent Selector

Pros:

  • Could learn optimal policies end-to-end
  • Adapts continuously

Cons:

  • Requires large training dataset (hundreds of runs)
  • Black box (no interpretability)
  • High maintenance (model drift, retraining)
  • Violates CODITECT principle of explainability

4. Always Use Maximum Agents (Brute Force)

Pros:

  • Maximizes chance of success

Cons:

  • Extremely expensive (10x API cost for 10 agents vs 1)
  • High coordination overhead (diminishing returns)
  • Slower (more latency)
  • Violates cost optimization goals

5. User-Configured Policies (No Automation)

Pros:

  • User has full control
  • No surprises

Cons:

  • High cognitive load (users must understand scaling dynamics)
  • Requires expertise in coordination theory
  • Error-prone (easy to misconfigure)

Consequences

Positive

  1. Data-Driven Decisions: Agent selection based on empirical evidence, not guesswork
  2. Cost Optimization: Avoid over-provisioning agents when coordination overhead > benefit
  3. Performance Improvement: Select architectures that maximize efficiency for specific task types
  4. Coordination Collapse Prevention: Detect and mitigate when adding agents hurts performance
  5. Transparent Tradeoffs: Users see predicted performance vs cost before execution
  6. Continuous Improvement: Scaling model improves as more calibration runs accumulate
  7. Adaptability: Responds to task complexity variations automatically
  8. Explainability: Model coefficients and predictions are interpretable (vs black-box ML)

Negative

  1. Calibration Overhead: Requires running experiments to generate scaling curves
  2. Model Accuracy Dependency: Predictions only useful if model is well-calibrated
  3. Runtime Latency: Querying scaling model adds <100ms to invocation time
  4. Complexity Increase: More moving parts in agent selection logic
  5. Potential Misclassification: Incorrect task type classification → wrong scaling curve
  6. Storage Requirements: Must persist scaling coefficients for each (task_type, architecture) pair
  7. Maintenance Burden: Scaling model needs periodic recalibration as agents evolve

Risks

  1. Model Overfitting: Scaling curves trained on narrow task set don't generalize

    • Mitigation: Use diverse calibration tasks; validate on held-out tasks; fallback to heuristic if prediction confidence low
  2. Prediction Errors Lead to Poor Performance: Model recommends 5 agents, coordination collapse occurs

    • Mitigation: Monitor actual coordination metrics; trigger re-evaluation if thresholds exceeded; allow manual override
  3. Calibration Cost: Running 100s of experiments expensive in API tokens

    • Mitigation: Use small-scale tasks for calibration; apply findings to high-volume production tasks
  4. Stale Coefficients: CODITECT agents improve, but scaling model not recalibrated

    • Mitigation: Version scaling models alongside agent releases; flag "stale" warnings after 90 days
  5. Task Type Explosion: Every new task type requires separate calibration

    • Mitigation: Group similar tasks (e.g., "code_analysis", "document_generation"); use task embeddings for clustering
  6. User Confusion: Predictions contradict user intuition, eroding trust

    • Mitigation: Explain predictions ("3 agents predicted 20% faster with 15% lower cost"); log decision rationale

Implementation Notes

1. Calibration Workflow

# Define calibration task set
cat > calibration_tasks.yaml <<EOF
tasks:
- id: code_review_simple
type: code_analysis
complexity: low
description: "Review 50-line Python function"

- id: code_review_complex
type: code_analysis
complexity: high
description: "Review 500-line distributed system module"

- id: doc_generation_api
type: documentation
complexity: medium
description: "Generate API reference from OpenAPI spec"
EOF

# Run experiments (all architectures, agent counts 1-5)
/scaling-calibrate \
--tasks calibration_tasks.yaml \
--architectures sas,independent,centralised,decentralised,hybrid \
--agent-counts 1,2,3,5,7 \
--output ~/.coditect-data/scaling-models/calibration-2026-02-16/

# Train scaling model (fit coefficients)
/scaling-train \
--experiments ~/.coditect-data/scaling-models/calibration-2026-02-16/ \
--output ~/.coditect-data/scaling-models/coditect-v1.json

# Validate predictions
/scaling-validate \
--model ~/.coditect-data/scaling-models/coditect-v1.json \
--holdout-tasks validation_tasks.yaml

2. Runtime Prediction API

# scripts/scaling-analysis/predictor.py

from pathlib import Path
import json
from typing import Dict, Tuple

class ScalingPredictor:
def __init__(self, model_path: Path):
with open(model_path) as f:
self.model = json.load(f)
self.coefficients = self.model["coefficients"]
self.elasticities = self.model["elasticities"]

def predict(
self,
task_type: str,
architecture: str,
agent_count: int
) -> Dict[str, float]:
"""
Returns:
{
"performance": 0.85, # 0-1 scale
"overhead_pct": 25.0,
"efficiency": 3.4,
"confidence": 0.82
}
"""
# Implementation: apply scaling law equations
pass

def recommend_config(self, task_type: str) -> Tuple[str, int]:
"""Returns (architecture, agent_count) with best efficiency."""
best_config = None
best_efficiency = 0

for arch in ["sas", "independent", "centralised", "decentralised", "hybrid"]:
for n in [1, 2, 3, 5, 7]:
pred = self.predict(task_type, arch, n)
if pred["efficiency"] > best_efficiency:
best_efficiency = pred["efficiency"]
best_config = (arch, n)

return best_config

3. MoE Integration

# Enhance existing MoE router
from scripts.scaling_analysis.predictor import ScalingPredictor

predictor = ScalingPredictor(Path("~/.coditect-data/scaling-models/coditect-v1.json"))

def select_agents(task: str, task_type: str = "code_analysis"):
# Get recommendation
recommended_arch, recommended_count = predictor.recommend_config(task_type)

# Predict performance for recommended config vs SAS baseline
mas_pred = predictor.predict(task_type, recommended_arch, recommended_count)
sas_pred = predictor.predict(task_type, "sas", 1)

# Log predictions
print(f"[scaling] SAS: perf={sas_pred['performance']:.2f}")
print(f"[scaling] {recommended_arch} (n={recommended_count}): perf={mas_pred['performance']:.2f}, overhead={mas_pred['overhead_pct']:.0f}%")

# Decide based on efficiency gain threshold
if mas_pred["efficiency"] > sas_pred["efficiency"] * 1.2:
# Use multi-agent
agents = moe_router.select_top_k(task, k=recommended_count)
return agents, recommended_arch
else:
# Use single agent
return [moe_router.select(task)], "sas"

4. Storage Schema

// ~/.coditect-data/scaling-models/coditect-v1.json
{
"version": "1.0",
"created": "2026-02-16T10:30:00Z",
"calibration_run_count": 245,
"task_types": ["code_analysis", "documentation", "testing"],
"coefficients": {
"code_analysis": {
"beta0": 0.45,
"beta_agent_count": 0.12,
"beta_complexity": 0.38,
"beta_agent_count_x_complexity": -0.08
}
},
"elasticities": {
"code_analysis": {
"eta_agent_count": 0.65,
"eta_task_complexity": 1.12
}
},
"thresholds": {
"overhead_pct_max": 40.0,
"message_density_max": 10.0,
"redundancy_max": 0.5
}
}

Rollout Plan

Stage 1 (Weeks 1-2): Calibration

  • Run 200+ experiments on diverse CODITECT tasks
  • Generate initial scaling model v1.0
  • Validate predictions on held-out tasks (target: R² > 0.7)

Stage 2 (Weeks 3-4): Shadow Mode

  • Integrate predictor into MoE router
  • Log predictions only (no action)
  • Compare predictions vs actual performance
  • Tune thresholds and confidence levels

Stage 3 (Weeks 5-6): Gradual Activation

  • Enable predictions for low-risk tasks (documentation, simple code analysis)
  • Monitor coordination metrics
  • Collect user feedback

Stage 4 (Weeks 7-8): Full Deployment

  • Enable for all task types
  • Continuous monitoring of prediction accuracy
  • Recalibration every 90 days

References

  • Paper: arXiv:2512.08296 - "Scaling Laws for Multi-Agent Systems"
  • Scaling Model Equations: Section 3 "Mathematical Framework"
  • Coordination Metrics: Section 4 "Coordination Dynamics"
  • CODITECT MoE Router: scripts/moe_classifier/router.py
  • Related ADRs:
    • ADR-001: Agent Labs Adoption
    • ADR-002: Integration Pattern
    • ADR-003: Agent Orchestration Mapping
    • ADR-005: Experiment Data Governance

Author: Claude (Sonnet 4.5) Date: 2026-02-16 Track: H (Framework) Task ID: H.0