Brainqub3 Agent Labs — Quick Start (1-2-3)
Overview
Brainqub3 Agent Labs is an open-source measurement rig for empirically comparing single-agent (SAS) vs. multi-agent (MAS) architectures. Built to validate findings from the research paper "Towards a Science of Scaling Agent Systems" (arXiv:2512.08296), it provides:
- Architecture comparison framework: SAS baseline vs. 4 MAS patterns (independent, centralised, decentralised, hybrid)
- Mixed-effects scaling model: Empirical elasticity layer predicting scaling behavior across agent count, tool availability, task complexity
- Paper-aligned coordination metrics: overhead%, message density, redundancy, efficiency, error amplification
- SAS-relative delta scoring: All MAS performance measured against the simplest alternative (SAS) to quantify actual improvement
- Live agent execution: Claude Agent SDK integration for real multi-agent runs with tool access enforcement
- Telemetry stack: SQLite-backed run persistence, JSON schemas, HTML/Plotly dashboard
Tech stack: Python 3.11+, uv package manager, Claude Agent SDK, numpy/scikit-learn (scaling model), PyYAML (config), SQLite (telemetry)
Key dependencies:
claude-agent-sdk>=0.1.33— Anthropic's agent SDK for live runs with permission callbacksnumpy,scikit-learn— Statistical modeling, mixed-effects regression for elasticity analysisPyYAML— Task/scenario configurationplotly,jinja2— Interactive dashboard rendering
Step 1: Local Setup
Clone and Bootstrap Environment
# Clone repository
git clone https://github.com/brainqub3/agent-labs.git
cd agent-labs
# Install uv package manager if missing
python3.11 -m pip install --user uv
# Sync all dependencies (creates .venv, installs packages)
uv sync --all-extras
# Configure API key
cp .env.example .env
# Edit .env → set ANTHROPIC_API_KEY=sk-ant-...
# Verify environment health
uv run brainqub3 doctor
Expected output from doctor:
✓ Python 3.11+ detected
✓ uv environment active
✓ ANTHROPIC_API_KEY configured
✓ SQLite database writable
✓ Claude Agent SDK installed (v0.1.33)
All checks passed.
If doctor fails, DO NOT PROCEED. Common issues:
- Missing ANTHROPIC_API_KEY in
.env - Python version <3.11
- Filesystem permissions preventing SQLite write to
data/runs/
Preflight Script Alternative (One-Shot Bootstrap)
bash scripts/preflight.sh --bootstrap-uv
This script:
- Detects Python 3.11+ or exits with installation guidance
- Installs
uvif missing - Runs
uv sync --all-extras - Checks for
.envand prompts if missing - Runs
uv run brainqub3 doctor
Use when onboarding new contributors or provisioning CI environments.
Step 2: Run Your First Experiment
Evaluator-First Workflow (Test-Driven Agent Design)
1. Validate evaluators before running agents (prevents expensive API calls with broken tests):
uv run pytest brainqub3/tasks/examples/hello_world/tests -q
Expected: 3 passed in 0.12s (evaluator logic validated against mock instances).
2. Run SAS baseline (single-agent reference performance):
uv run brainqub3 run sas \
--task hello_world \
--model claude-sonnet-4-5 \
--instances 3 \
--require-live
Flags explained:
--instances 3— Run evaluator against 3 task instances--require-live— Fail ifANTHROPIC_API_KEYmissing (prevents accidental mock runs)- Output:
data/runs/<run_id>/sas_baseline.jsonwith success rate, latency, cost
3. Run MAS comparison (multi-agent with 3 workers):
uv run brainqub3 run mas \
--task hello_world \
--arch hybrid \
--model claude-sonnet-4-5 \
--n-agents 3 \
--instances 3 \
--require-live
Architectures available:
| Architecture | Workers | Orchestrator | Peer Exchange | Aggregation |
|---|---|---|---|---|
| independent | Parallel, no communication | None | None | Majority vote |
| centralised | Draft answers independently | Synthesizes all drafts | None | Orchestrator decides |
| decentralised | Propose initial answers | None | Multi-round refinement | Consensus vote |
| hybrid | Assigned subtasks | Synthesizes results | Refine via peer review | Orchestrator + peer feedback |
Key insight: Every MAS run auto-generates a paired SAS baseline if one doesn't exist for the same task/model/instances combination. This ensures fair comparison.
Tool access enforcement: Claude Agent SDK permission callbacks enforce tool availability at runtime. Tool count maps to fixed prefix ordering:
6 tools= core set (Read, Write, Edit, Bash, Grep, Glob)8 tools= core + web search/fetch
4. Launch interactive dashboard:
uv run brainqub3 dashboard
# Opens http://127.0.0.1:8765
Navigate to:
- Comparison tab → View SAS vs. MAS delta metrics
- Runs tab → Inspect individual agent traces, messages, tool calls
- Scaling Laws tab → Visualize elasticity calibration (see Step 3)
Task Authoring Scaffold
Create a new evaluation task:
uv run brainqub3 task init my_task
Generated structure:
brainqub3/tasks/my_task/
├── __init__.py
├── evaluator.py # Pass/fail logic for agent outputs
├── instances.json # Task test cases
├── prompts.yaml # SAS/MAS prompts
└── tests/
└── test_evaluator.py # Pytest cases for evaluator logic
Critical workflow:
- Write evaluator + instances first
- Test evaluator with
pytest brainqub3/tasks/my_task/tests -q - ONLY THEN run live agents (avoids wasted API calls)
Run data persistence: All runs saved to data/runs/<run_id>/:
sas_baseline.json— SAS performancemas_<arch>_<n_agents>.json— MAS performancetraces/— Full agent message logs, tool calls, reasoning chainsmetadata.json— Run config, timestamp, content hash
Run immutability: Completed runs are finalized with SHA-256 content hashes. Cannot be modified in-place (prevents results tampering). Re-run with --force to overwrite.
Step 3: Elasticity Calibration & Scaling Analysis
The Key Scaling Question
"At what agent count / tool count does coordination overhead overwhelm performance gains?"
The paper's mixed-effects model predicts:
- Low task complexity (
P_SAnear 1.0) → SAS already solves it, MAS adds cost with no benefit - High agent count (10+ workers) → Coordination overhead can cause performance collapse
- More tools + more agents → Compounding coordination cost (message explosion)
Running Elasticity Calibration
uv run brainqub3 run elasticity \
--task hello_world \
--arch hybrid \
--model claude-sonnet-4-5 \
--batch-id my-elasticity-batch \
--n-agents-grid 3,4 \
--tool-count-grid 6,8 \
--instances 3 \
--require-live
What this does:
- Sweeps agent count (3, 4) × tool count (6, 8) = 4 MAS runs
- Generates paired SAS baselines for each tool count
- Collects coordination metrics (overhead%, message density, redundancy)
- Fits mixed-effects regression model to predict scaling behavior
Output: data/elasticity/<batch_id>/ with fitted model coefficients, residual plots, JSON coefficients.
Interpreting Scaling Laws Dashboard
Navigate to Scaling Laws tab:
- Select elasticity batch (NOT comparison batch) — only batches with
--n-agents-gridshow here - X-axis options:
- Number of agents (3, 4, 5, ...)
- Tool count (6, 8, 10, ...)
- Y-axis: Delta performance vs. SAS baseline (positive = MAS wins, negative = SAS wins)
- Adjustable parameters:
P_SA(SAS baseline success rate, proxy for task difficulty) — slider 0.0 to 1.0- Intelligence index (model capability, 0-100) — defaults from model tier
- Tool count (if X-axis is agent count)
Key insights from paper validation:
| Scenario | Prediction | Dashboard Evidence |
|---|---|---|
Easy task (P_SA = 0.95) | SAS sufficient, MAS overhead | MAS delta near 0 or negative |
Hard task (P_SA = 0.40) | MAS gains until overhead dominates | MAS delta positive at low N, drops at high N |
| 3→5 agents | Coordination overhead grows | Overhead% increases, efficiency drops |
| 5→10 agents | Potential collapse | Error amplification >1.0, negative delta |
Architecture-specific patterns:
| Architecture | Overhead Characteristics | When to Use |
|---|---|---|
| independent | Lowest overhead (no messages) | High task parallelizability, no synthesis needed |
| centralised | Moderate overhead (N→1 messages) | Tasks needing expert orchestrator, low peer dependency |
| decentralised | High overhead (N² messages) | Consensus-critical, no clear leader |
| hybrid | Variable overhead (orchestrator + peers) | Complex tasks with subtask decomposition + synthesis |
Scenario Analysis (Multi-Task Comparison)
uv run brainqub3 scenario run --scenario data/scenarios/my_scenario.yaml
Scenario YAML structure:
name: "Production Readiness Test"
tasks:
- task: code_review
instances: 5
- task: bug_triage
instances: 5
architectures: [sas, hybrid]
models: [claude-sonnet-4-5]
n_agents_grid: [3, 5]
tool_count_grid: [6, 8]
Use cases:
- Compare architectures across task diversity
- Production cost estimation (aggregates API costs)
- Regression testing (compare scenario results across code versions)
Gotchas & Production Considerations
Mock Mode Limitations
Running without ANTHROPIC_API_KEY (or --mock flag) produces deterministic mock responses:
uv run brainqub3 run sas --task hello_world --instances 3 --mock
Mock mode is ONLY useful for:
- Testing evaluator logic against known outputs
- CI/CD pipeline validation (no API costs)
- Onboarding contributors without API keys
Mock mode is USELESS for:
- Architecture comparison (deterministic = no coordination variance)
- Scaling analysis (no real overhead)
- Production readiness assessment
Rate Limit Management
Tier 1 Anthropic subscriptions (default for new accounts):
- 50 requests/minute (RPM)
- Parallel agent runs with N=5 agents, 3 instances = 15 concurrent requests
- Will hit rate limits in <12 seconds
Mitigation strategies:
- Use
--instances 1for initial experiments - Reduce
--n-agents-grid(e.g.,3,4instead of3,5,7,10) - Upgrade to Tier 2+ Anthropic subscription (500 RPM)
- Add
--rate-limit-delay 1.2(seconds between requests)
Tool Count Semantics
Tool count is not arbitrary — it maps to SDK permission callbacks with fixed ordering:
| Tool Count | Included Tools |
|---|---|
| 6 | Read, Write, Edit, Bash, Grep, Glob |
| 8 | Above + WebSearch, WebFetch |
| 10 | Above + DatabaseQuery, EmailSend |
Implication: --tool-count-grid 5,7,9 will fail (no defined tool set for odd counts).
Use pre-defined sets from brainqub3/core/tool_registry.py or extend with custom tool sets.
Run Immutability & Content Hashing
Completed runs are finalized with SHA-256 content hashes (covers instances, evaluator outputs, agent traces).
Prevents:
- Accidental result modification
- Results tampering in research contexts
- Inconsistent re-runs with same
run_id
To re-run with same config:
uv run brainqub3 run sas --task hello_world --instances 3 --force
# Generates NEW run_id, preserves old run
To delete a run:
rm -rf data/runs/<run_id>
# Dashboard auto-refreshes, removes from index
Dashboard Performance with Large Batches
Elasticity batches with --n-agents-grid 3,4,5,6,7,8,9,10 × --tool-count-grid 6,8,10 = 24 runs.
Dashboard renders all traces in-browser (Plotly.js). With 10 agents × 5 instances × 20 messages/agent = 1000 message objects per run.
Symptom: Browser tab freezes on Scaling Laws tab load.
Fix: Use dashboard filters:
uv run brainqub3 dashboard --max-runs 50 --max-traces-per-run 100
Or export to JSON for custom analysis:
uv run brainqub3 export --batch-id my-elasticity-batch --format json > results.json
Next Steps
- Read the paper: arXiv:2512.08296 — understand theoretical grounding
- Explore example tasks:
brainqub3/tasks/examples/— code_review, data_analysis, creative_writing - Author your own task:
uv run brainqub3 task init <name>→ follow evaluator-first workflow - Run production scenario: Define YAML scenario with business-critical tasks, analyze cost/performance trade-offs
- Contribute calibration data: Submit elasticity batch results to community dataset (see
CONTRIBUTING.md)
Repository: https://github.com/brainqub3/agent-labs Paper: arXiv:2512.08296 License: MIT Maintainer: Brainqub3 Research
Version: 1.0.0 | Last Updated: 2026-02-16