ADR-004: Scaling Model for Agent Selection

Status

Proposed

Context

CODITECT currently selects agents and orchestration patterns using heuristic-based logic:

MoE classifier routes to "best" agent based on skill similarity scores
Multi-agent invocations use fixed counts (e.g., always 3 agents for peer review)
No quantitative feedback on whether adding more agents improves or degrades performance
Coordination overhead not measured or considered in selection

Brainqub3 Agent Labs provides a paper-aligned scaling model (arXiv:2512.08296) that predicts:

Performance Scaling:

P_hat = clip(beta0 + sum(beta_i*z_i) + sum(beta_ij*z_i*z_j), 0, 1)

Where:

P_hat = predicted performance (0-1 scale)
z_i = standardized task/architecture features
beta_i = feature coefficients (learned from experiments)
beta_ij = interaction terms (e.g., agent_count × task_complexity)

Elasticity Estimation:

x_hat = clamp(x_base * (n/n0)^eta_n * (T/T0)^eta_T)

Where:

x_hat = predicted value (performance, cost, time, etc.)
n = agent count, T = task complexity
eta_n, eta_T = elasticity coefficients

Coordination Metrics (5):

Overhead %: Time spent coordinating vs productive work
Message Density: Messages per agent per task
Redundancy: Duplicate work across agents
Efficiency: Output quality per unit coordination cost
Error Amplification: How errors compound with agent count

The question: Should CODITECT use Agent Labs scaling predictions to dynamically select agent count and architecture at runtime, rather than using static configurations?

Example scenarios where this matters:

Simple task → MoE might predict coordination overhead > benefit for multi-agent
Complex task → Model might recommend 5 agents (decentralised) over 3 (centralised)
High-redundancy detection → Switch from Independent to Centralised to reduce waste

Decision

Use Agent Labs scaling model predictions to inform CODITECT agent selection decisions, with the following design:

1. Two-Phase Approach

Phase 1: Offline Calibration (Manual)

Run Agent Labs experiments on representative CODITECT tasks
Generate scaling curves for each (task_type, architecture) pair
Store coefficients (beta, eta) in ~/.coditect-data/scaling-models/
Update periodically (e.g., quarterly or when new agent types added)

Phase 2: Runtime Prediction (Automated)

Before invoking multi-agent orchestration, query scaling model
Predict performance for candidate configurations (1 agent, 3 agents, 5 agents; centralised vs decentralised)
Select configuration with best predicted efficiency (quality / coordination_cost)
Fall back to heuristic if no calibration data available

2. Integration Points

MoE Router Enhancement:

# Current: always select single best agent
best_agent = moe_router.select(task)

# Enhanced: compare single-agent vs multi-agent predictions
sas_prediction = scaling_model.predict(task, architecture="sas", agents=1)
mas_prediction = scaling_model.predict(task, architecture="centralised", agents=3)

if mas_prediction.efficiency > sas_prediction.efficiency * 1.2:  # 20% threshold
    agents = moe_router.select_top_k(task, k=3)
else:
    agents = [best_agent]

Architecture Selection:

# For known multi-agent tasks, compare architectures
predictions = {
    "centralised": scaling_model.predict(task, architecture="centralised", agents=3),
    "decentralised": scaling_model.predict(task, architecture="decentralised", agents=3),
    "hybrid": scaling_model.predict(task, architecture="hybrid", agents=3)
}
best_arch = max(predictions.items(), key=lambda x: x[1].efficiency)[0]

3. Coordination Collapse Detection

Trigger re-evaluation if coordination metrics exceed thresholds:

overhead_pct > 40% → Reduce agent count or switch to SAS
message_density > 10 → Switch from Decentralised to Centralised
redundancy > 0.5 → Consolidate agents or use Independent
error_amplification > 1.2 → Reduce agent count (errors compounding)

4. Guardrails

Always support manual override: --agents 5 --architecture hybrid ignores predictions
Gradual rollout: Start with logging predictions only (no action), validate accuracy
Fallback to heuristic: If scaling model unavailable or low confidence (<0.7), use existing logic
Cost limits: Never exceed user-configured max agent count or API budget

Alternatives Considered

1. Static Configuration (Status Quo)

Pros:

Simple, predictable
No runtime overhead
Proven to work

Cons:

Suboptimal for many tasks (over/under-provision agents)
No adaptation to task complexity
Coordination overhead invisible
Cannot detect coordination collapse

2. Heuristic-Based Scaling (No Model)

Pros:

Lightweight, fast
No calibration needed

Cons:

Rules brittle (e.g., "if task_tokens > 1000, use 3 agents")
No quantitative basis
Cannot predict efficiency vs overhead tradeoff
Misses interaction effects (agent_count × complexity)

3. Reinforcement Learning Agent Selector

Pros:

Could learn optimal policies end-to-end
Adapts continuously

Cons:

Requires large training dataset (hundreds of runs)
Black box (no interpretability)
High maintenance (model drift, retraining)
Violates CODITECT principle of explainability

4. Always Use Maximum Agents (Brute Force)

Pros:

Maximizes chance of success

Cons:

Extremely expensive (10x API cost for 10 agents vs 1)
High coordination overhead (diminishing returns)
Slower (more latency)
Violates cost optimization goals

5. User-Configured Policies (No Automation)

Pros:

User has full control
No surprises

Cons:

High cognitive load (users must understand scaling dynamics)
Requires expertise in coordination theory
Error-prone (easy to misconfigure)

Consequences

Positive

Data-Driven Decisions: Agent selection based on empirical evidence, not guesswork
Cost Optimization: Avoid over-provisioning agents when coordination overhead > benefit
Performance Improvement: Select architectures that maximize efficiency for specific task types
Coordination Collapse Prevention: Detect and mitigate when adding agents hurts performance
Transparent Tradeoffs: Users see predicted performance vs cost before execution
Continuous Improvement: Scaling model improves as more calibration runs accumulate
Adaptability: Responds to task complexity variations automatically
Explainability: Model coefficients and predictions are interpretable (vs black-box ML)

Negative

Calibration Overhead: Requires running experiments to generate scaling curves
Model Accuracy Dependency: Predictions only useful if model is well-calibrated
Runtime Latency: Querying scaling model adds <100ms to invocation time
Complexity Increase: More moving parts in agent selection logic
Potential Misclassification: Incorrect task type classification → wrong scaling curve
Storage Requirements: Must persist scaling coefficients for each (task_type, architecture) pair
Maintenance Burden: Scaling model needs periodic recalibration as agents evolve

Risks

Model Overfitting: Scaling curves trained on narrow task set don't generalize
- Mitigation: Use diverse calibration tasks; validate on held-out tasks; fallback to heuristic if prediction confidence low
Prediction Errors Lead to Poor Performance: Model recommends 5 agents, coordination collapse occurs
- Mitigation: Monitor actual coordination metrics; trigger re-evaluation if thresholds exceeded; allow manual override
Calibration Cost: Running 100s of experiments expensive in API tokens
- Mitigation: Use small-scale tasks for calibration; apply findings to high-volume production tasks
Stale Coefficients: CODITECT agents improve, but scaling model not recalibrated
- Mitigation: Version scaling models alongside agent releases; flag "stale" warnings after 90 days
Task Type Explosion: Every new task type requires separate calibration
- Mitigation: Group similar tasks (e.g., "code_analysis", "document_generation"); use task embeddings for clustering
User Confusion: Predictions contradict user intuition, eroding trust
- Mitigation: Explain predictions ("3 agents predicted 20% faster with 15% lower cost"); log decision rationale

Implementation Notes

1. Calibration Workflow

# Define calibration task set
cat > calibration_tasks.yaml <<EOF
tasks:
  - id: code_review_simple
    type: code_analysis
    complexity: low
    description: "Review 50-line Python function"

  - id: code_review_complex
    type: code_analysis
    complexity: high
    description: "Review 500-line distributed system module"

  - id: doc_generation_api
    type: documentation
    complexity: medium
    description: "Generate API reference from OpenAPI spec"
EOF

# Run experiments (all architectures, agent counts 1-5)
/scaling-calibrate \
  --tasks calibration_tasks.yaml \
  --architectures sas,independent,centralised,decentralised,hybrid \
  --agent-counts 1,2,3,5,7 \
  --output ~/.coditect-data/scaling-models/calibration-2026-02-16/

# Train scaling model (fit coefficients)
/scaling-train \
  --experiments ~/.coditect-data/scaling-models/calibration-2026-02-16/ \
  --output ~/.coditect-data/scaling-models/coditect-v1.json

# Validate predictions
/scaling-validate \
  --model ~/.coditect-data/scaling-models/coditect-v1.json \
  --holdout-tasks validation_tasks.yaml

2. Runtime Prediction API

# scripts/scaling-analysis/predictor.py

from pathlib import Path
import json
from typing import Dict, Tuple

class ScalingPredictor:
    def __init__(self, model_path: Path):
        with open(model_path) as f:
            self.model = json.load(f)
        self.coefficients = self.model["coefficients"]
        self.elasticities = self.model["elasticities"]

    def predict(
        self,
        task_type: str,
        architecture: str,
        agent_count: int
    ) -> Dict[str, float]:
        """
        Returns:
            {
                "performance": 0.85,  # 0-1 scale
                "overhead_pct": 25.0,
                "efficiency": 3.4,
                "confidence": 0.82
            }
        """
        # Implementation: apply scaling law equations
        pass

    def recommend_config(self, task_type: str) -> Tuple[str, int]:
        """Returns (architecture, agent_count) with best efficiency."""
        best_config = None
        best_efficiency = 0

        for arch in ["sas", "independent", "centralised", "decentralised", "hybrid"]:
            for n in [1, 2, 3, 5, 7]:
                pred = self.predict(task_type, arch, n)
                if pred["efficiency"] > best_efficiency:
                    best_efficiency = pred["efficiency"]
                    best_config = (arch, n)

        return best_config

3. MoE Integration

# Enhance existing MoE router
from scripts.scaling_analysis.predictor import ScalingPredictor

predictor = ScalingPredictor(Path("~/.coditect-data/scaling-models/coditect-v1.json"))

def select_agents(task: str, task_type: str = "code_analysis"):
    # Get recommendation
    recommended_arch, recommended_count = predictor.recommend_config(task_type)

    # Predict performance for recommended config vs SAS baseline
    mas_pred = predictor.predict(task_type, recommended_arch, recommended_count)
    sas_pred = predictor.predict(task_type, "sas", 1)

    # Log predictions
    print(f"[scaling] SAS: perf={sas_pred['performance']:.2f}")
    print(f"[scaling] {recommended_arch} (n={recommended_count}): perf={mas_pred['performance']:.2f}, overhead={mas_pred['overhead_pct']:.0f}%")

    # Decide based on efficiency gain threshold
    if mas_pred["efficiency"] > sas_pred["efficiency"] * 1.2:
        # Use multi-agent
        agents = moe_router.select_top_k(task, k=recommended_count)
        return agents, recommended_arch
    else:
        # Use single agent
        return [moe_router.select(task)], "sas"

4. Storage Schema

// ~/.coditect-data/scaling-models/coditect-v1.json
{
  "version": "1.0",
  "created": "2026-02-16T10:30:00Z",
  "calibration_run_count": 245,
  "task_types": ["code_analysis", "documentation", "testing"],
  "coefficients": {
    "code_analysis": {
      "beta0": 0.45,
      "beta_agent_count": 0.12,
      "beta_complexity": 0.38,
      "beta_agent_count_x_complexity": -0.08
    }
  },
  "elasticities": {
    "code_analysis": {
      "eta_agent_count": 0.65,
      "eta_task_complexity": 1.12
    }
  },
  "thresholds": {
    "overhead_pct_max": 40.0,
    "message_density_max": 10.0,
    "redundancy_max": 0.5
  }
}

Rollout Plan

Stage 1 (Weeks 1-2): Calibration

Run 200+ experiments on diverse CODITECT tasks
Generate initial scaling model v1.0
Validate predictions on held-out tasks (target: R² > 0.7)

Stage 2 (Weeks 3-4): Shadow Mode

Integrate predictor into MoE router
Log predictions only (no action)
Compare predictions vs actual performance
Tune thresholds and confidence levels

Stage 3 (Weeks 5-6): Gradual Activation

Enable predictions for low-risk tasks (documentation, simple code analysis)
Monitor coordination metrics
Collect user feedback

Stage 4 (Weeks 7-8): Full Deployment

Enable for all task types
Continuous monitoring of prediction accuracy
Recalibration every 90 days

References

Paper: arXiv:2512.08296 - "Scaling Laws for Multi-Agent Systems"
Scaling Model Equations: Section 3 "Mathematical Framework"
Coordination Metrics: Section 4 "Coordination Dynamics"
CODITECT MoE Router: scripts/moe_classifier/router.py
Related ADRs:
- ADR-001: Agent Labs Adoption
- ADR-002: Integration Pattern
- ADR-003: Agent Orchestration Mapping
- ADR-005: Experiment Data Governance

Author: Claude (Sonnet 4.5) Date: 2026-02-16 Track: H (Framework) Task ID: H.0

Status​

Context​

Decision​

1. Two-Phase Approach​

2. Integration Points​

3. Coordination Collapse Detection​

4. Guardrails​

Alternatives Considered​

1. Static Configuration (Status Quo)​

2. Heuristic-Based Scaling (No Model)​

3. Reinforcement Learning Agent Selector​

4. Always Use Maximum Agents (Brute Force)​

5. User-Configured Policies (No Automation)​

Consequences​

Positive​

Negative​

Risks​

Implementation Notes​

1. Calibration Workflow​

2. Runtime Prediction API​

3. MoE Integration​

4. Storage Schema​

Rollout Plan​

References​

Status

Context

Decision

1. Two-Phase Approach

2. Integration Points

3. Coordination Collapse Detection

4. Guardrails

Alternatives Considered

1. Static Configuration (Status Quo)

2. Heuristic-Based Scaling (No Model)

3. Reinforcement Learning Agent Selector

4. Always Use Maximum Agents (Brute Force)

5. User-Configured Policies (No Automation)

Consequences

Positive

Negative

Risks

Implementation Notes

1. Calibration Workflow

2. Runtime Prediction API

3. MoE Integration

4. Storage Schema

Rollout Plan

References