Brainqub3 Agent Labs — Quick Start (1-2-3)

Overview

Brainqub3 Agent Labs is an open-source measurement rig for empirically comparing single-agent (SAS) vs. multi-agent (MAS) architectures. Built to validate findings from the research paper "Towards a Science of Scaling Agent Systems" (arXiv:2512.08296), it provides:

Architecture comparison framework: SAS baseline vs. 4 MAS patterns (independent, centralised, decentralised, hybrid)
Mixed-effects scaling model: Empirical elasticity layer predicting scaling behavior across agent count, tool availability, task complexity
Paper-aligned coordination metrics: overhead%, message density, redundancy, efficiency, error amplification
SAS-relative delta scoring: All MAS performance measured against the simplest alternative (SAS) to quantify actual improvement
Live agent execution: Claude Agent SDK integration for real multi-agent runs with tool access enforcement
Telemetry stack: SQLite-backed run persistence, JSON schemas, HTML/Plotly dashboard

Tech stack: Python 3.11+, uv package manager, Claude Agent SDK, numpy/scikit-learn (scaling model), PyYAML (config), SQLite (telemetry)

Key dependencies:

claude-agent-sdk>=0.1.33 — Anthropic's agent SDK for live runs with permission callbacks
numpy, scikit-learn — Statistical modeling, mixed-effects regression for elasticity analysis
PyYAML — Task/scenario configuration
plotly, jinja2 — Interactive dashboard rendering

Step 1: Local Setup

Clone and Bootstrap Environment

# Clone repository
git clone https://github.com/brainqub3/agent-labs.git
cd agent-labs

# Install uv package manager if missing
python3.11 -m pip install --user uv

# Sync all dependencies (creates .venv, installs packages)
uv sync --all-extras

# Configure API key
cp .env.example .env
# Edit .env → set ANTHROPIC_API_KEY=sk-ant-...

# Verify environment health
uv run brainqub3 doctor

Expected output from doctor:

✓ Python 3.11+ detected
✓ uv environment active
✓ ANTHROPIC_API_KEY configured
✓ SQLite database writable
✓ Claude Agent SDK installed (v0.1.33)
All checks passed.

If doctor fails, DO NOT PROCEED. Common issues:

Missing ANTHROPIC_API_KEY in .env
Python version <3.11
Filesystem permissions preventing SQLite write to data/runs/

Preflight Script Alternative (One-Shot Bootstrap)

bash scripts/preflight.sh --bootstrap-uv

This script:

Detects Python 3.11+ or exits with installation guidance
Installs uv if missing
Runs uv sync --all-extras
Checks for .env and prompts if missing
Runs uv run brainqub3 doctor

Use when onboarding new contributors or provisioning CI environments.

Step 2: Run Your First Experiment

Evaluator-First Workflow (Test-Driven Agent Design)

1. Validate evaluators before running agents (prevents expensive API calls with broken tests):

uv run pytest brainqub3/tasks/examples/hello_world/tests -q

Expected: 3 passed in 0.12s (evaluator logic validated against mock instances).

2. Run SAS baseline (single-agent reference performance):

uv run brainqub3 run sas \
  --task hello_world \
  --model claude-sonnet-4-5 \
  --instances 3 \
  --require-live

Flags explained:

--instances 3 — Run evaluator against 3 task instances
--require-live — Fail if ANTHROPIC_API_KEY missing (prevents accidental mock runs)
Output: data/runs/<run_id>/sas_baseline.json with success rate, latency, cost

3. Run MAS comparison (multi-agent with 3 workers):

uv run brainqub3 run mas \
  --task hello_world \
  --arch hybrid \
  --model claude-sonnet-4-5 \
  --n-agents 3 \
  --instances 3 \
  --require-live

Architectures available:

Architecture	Workers	Orchestrator	Peer Exchange	Aggregation
independent	Parallel, no communication	None	None	Majority vote
centralised	Draft answers independently	Synthesizes all drafts	None	Orchestrator decides
decentralised	Propose initial answers	None	Multi-round refinement	Consensus vote
hybrid	Assigned subtasks	Synthesizes results	Refine via peer review	Orchestrator + peer feedback

Key insight: Every MAS run auto-generates a paired SAS baseline if one doesn't exist for the same task/model/instances combination. This ensures fair comparison.

Tool access enforcement: Claude Agent SDK permission callbacks enforce tool availability at runtime. Tool count maps to fixed prefix ordering:

6 tools = core set (Read, Write, Edit, Bash, Grep, Glob)
8 tools = core + web search/fetch

4. Launch interactive dashboard:

uv run brainqub3 dashboard
# Opens http://127.0.0.1:8765

Navigate to:

Comparison tab → View SAS vs. MAS delta metrics
Runs tab → Inspect individual agent traces, messages, tool calls
Scaling Laws tab → Visualize elasticity calibration (see Step 3)

Task Authoring Scaffold

Create a new evaluation task:

uv run brainqub3 task init my_task

Generated structure:

brainqub3/tasks/my_task/
├── __init__.py
├── evaluator.py       # Pass/fail logic for agent outputs
├── instances.json     # Task test cases
├── prompts.yaml       # SAS/MAS prompts
└── tests/
    └── test_evaluator.py  # Pytest cases for evaluator logic

Critical workflow:

Write evaluator + instances first
Test evaluator with pytest brainqub3/tasks/my_task/tests -q
ONLY THEN run live agents (avoids wasted API calls)

Run data persistence: All runs saved to data/runs/<run_id>/:

sas_baseline.json — SAS performance
mas_<arch>_<n_agents>.json — MAS performance
traces/ — Full agent message logs, tool calls, reasoning chains
metadata.json — Run config, timestamp, content hash

Run immutability: Completed runs are finalized with SHA-256 content hashes. Cannot be modified in-place (prevents results tampering). Re-run with --force to overwrite.

Step 3: Elasticity Calibration & Scaling Analysis

The Key Scaling Question

"At what agent count / tool count does coordination overhead overwhelm performance gains?"

The paper's mixed-effects model predicts:

Low task complexity (P_SA near 1.0) → SAS already solves it, MAS adds cost with no benefit
High agent count (10+ workers) → Coordination overhead can cause performance collapse
More tools + more agents → Compounding coordination cost (message explosion)

Running Elasticity Calibration

uv run brainqub3 run elasticity \
  --task hello_world \
  --arch hybrid \
  --model claude-sonnet-4-5 \
  --batch-id my-elasticity-batch \
  --n-agents-grid 3,4 \
  --tool-count-grid 6,8 \
  --instances 3 \
  --require-live

What this does:

Sweeps agent count (3, 4) × tool count (6, 8) = 4 MAS runs
Generates paired SAS baselines for each tool count
Collects coordination metrics (overhead%, message density, redundancy)
Fits mixed-effects regression model to predict scaling behavior

Output: data/elasticity/<batch_id>/ with fitted model coefficients, residual plots, JSON coefficients.

Interpreting Scaling Laws Dashboard

Navigate to Scaling Laws tab:

Select elasticity batch (NOT comparison batch) — only batches with --n-agents-grid show here
X-axis options:
- Number of agents (3, 4, 5, ...)
- Tool count (6, 8, 10, ...)
Y-axis: Delta performance vs. SAS baseline (positive = MAS wins, negative = SAS wins)
Adjustable parameters:
- P_SA (SAS baseline success rate, proxy for task difficulty) — slider 0.0 to 1.0
- Intelligence index (model capability, 0-100) — defaults from model tier
- Tool count (if X-axis is agent count)

Key insights from paper validation:

Scenario	Prediction	Dashboard Evidence
Easy task (`P_SA` = 0.95)	SAS sufficient, MAS overhead	MAS delta near 0 or negative
Hard task (`P_SA` = 0.40)	MAS gains until overhead dominates	MAS delta positive at low N, drops at high N
3→5 agents	Coordination overhead grows	Overhead% increases, efficiency drops
5→10 agents	Potential collapse	Error amplification >1.0, negative delta

Architecture-specific patterns:

Architecture	Overhead Characteristics	When to Use
independent	Lowest overhead (no messages)	High task parallelizability, no synthesis needed
centralised	Moderate overhead (N→1 messages)	Tasks needing expert orchestrator, low peer dependency
decentralised	High overhead (N² messages)	Consensus-critical, no clear leader
hybrid	Variable overhead (orchestrator + peers)	Complex tasks with subtask decomposition + synthesis

Scenario Analysis (Multi-Task Comparison)

uv run brainqub3 scenario run --scenario data/scenarios/my_scenario.yaml

Scenario YAML structure:

name: "Production Readiness Test"
tasks:
  - task: code_review
    instances: 5
  - task: bug_triage
    instances: 5
architectures: [sas, hybrid]
models: [claude-sonnet-4-5]
n_agents_grid: [3, 5]
tool_count_grid: [6, 8]

Use cases:

Compare architectures across task diversity
Production cost estimation (aggregates API costs)
Regression testing (compare scenario results across code versions)

Gotchas & Production Considerations

Mock Mode Limitations

Running without ANTHROPIC_API_KEY (or --mock flag) produces deterministic mock responses:

uv run brainqub3 run sas --task hello_world --instances 3 --mock

Mock mode is ONLY useful for:

Testing evaluator logic against known outputs
CI/CD pipeline validation (no API costs)
Onboarding contributors without API keys

Mock mode is USELESS for:

Architecture comparison (deterministic = no coordination variance)
Scaling analysis (no real overhead)
Production readiness assessment

Rate Limit Management

Tier 1 Anthropic subscriptions (default for new accounts):

50 requests/minute (RPM)
Parallel agent runs with N=5 agents, 3 instances = 15 concurrent requests
Will hit rate limits in <12 seconds

Mitigation strategies:

Use --instances 1 for initial experiments
Reduce --n-agents-grid (e.g., 3,4 instead of 3,5,7,10)
Upgrade to Tier 2+ Anthropic subscription (500 RPM)
Add --rate-limit-delay 1.2 (seconds between requests)

Tool Count Semantics

Tool count is not arbitrary — it maps to SDK permission callbacks with fixed ordering:

Tool Count	Included Tools
6	Read, Write, Edit, Bash, Grep, Glob
8	Above + WebSearch, WebFetch
10	Above + DatabaseQuery, EmailSend

Implication: --tool-count-grid 5,7,9 will fail (no defined tool set for odd counts).

Use pre-defined sets from brainqub3/core/tool_registry.py or extend with custom tool sets.

Run Immutability & Content Hashing

Completed runs are finalized with SHA-256 content hashes (covers instances, evaluator outputs, agent traces).

Prevents:

Accidental result modification
Results tampering in research contexts
Inconsistent re-runs with same run_id

To re-run with same config:

uv run brainqub3 run sas --task hello_world --instances 3 --force
# Generates NEW run_id, preserves old run

To delete a run:

rm -rf data/runs/<run_id>
# Dashboard auto-refreshes, removes from index

Dashboard Performance with Large Batches

Elasticity batches with --n-agents-grid 3,4,5,6,7,8,9,10 × --tool-count-grid 6,8,10 = 24 runs.

Dashboard renders all traces in-browser (Plotly.js). With 10 agents × 5 instances × 20 messages/agent = 1000 message objects per run.

Symptom: Browser tab freezes on Scaling Laws tab load.

Fix: Use dashboard filters:

uv run brainqub3 dashboard --max-runs 50 --max-traces-per-run 100

Or export to JSON for custom analysis:

uv run brainqub3 export --batch-id my-elasticity-batch --format json > results.json

Next Steps

Read the paper: arXiv:2512.08296 — understand theoretical grounding
Explore example tasks: brainqub3/tasks/examples/ — code_review, data_analysis, creative_writing
Author your own task: uv run brainqub3 task init <name> → follow evaluator-first workflow
Run production scenario: Define YAML scenario with business-critical tasks, analyze cost/performance trade-offs
Contribute calibration data: Submit elasticity batch results to community dataset (see CONTRIBUTING.md)

Repository: https://github.com/brainqub3/agent-labs Paper: arXiv:2512.08296 License: MIT Maintainer: Brainqub3 Research

Version: 1.0.0 | Last Updated: 2026-02-16

Overview​

Step 1: Local Setup​

Clone and Bootstrap Environment​

Preflight Script Alternative (One-Shot Bootstrap)​

Step 2: Run Your First Experiment​

Evaluator-First Workflow (Test-Driven Agent Design)​

Task Authoring Scaffold​

Step 3: Elasticity Calibration & Scaling Analysis​

The Key Scaling Question​

Running Elasticity Calibration​

Interpreting Scaling Laws Dashboard​

Scenario Analysis (Multi-Task Comparison)​

Gotchas & Production Considerations​

Mock Mode Limitations​

Rate Limit Management​

Tool Count Semantics​

Run Immutability & Content Hashing​

Dashboard Performance with Large Batches​

Next Steps​