Skip to main content

Brainqub3 Agent Labs — Quick Start (1-2-3)

Overview

Brainqub3 Agent Labs is an open-source measurement rig for empirically comparing single-agent (SAS) vs. multi-agent (MAS) architectures. Built to validate findings from the research paper "Towards a Science of Scaling Agent Systems" (arXiv:2512.08296), it provides:

  • Architecture comparison framework: SAS baseline vs. 4 MAS patterns (independent, centralised, decentralised, hybrid)
  • Mixed-effects scaling model: Empirical elasticity layer predicting scaling behavior across agent count, tool availability, task complexity
  • Paper-aligned coordination metrics: overhead%, message density, redundancy, efficiency, error amplification
  • SAS-relative delta scoring: All MAS performance measured against the simplest alternative (SAS) to quantify actual improvement
  • Live agent execution: Claude Agent SDK integration for real multi-agent runs with tool access enforcement
  • Telemetry stack: SQLite-backed run persistence, JSON schemas, HTML/Plotly dashboard

Tech stack: Python 3.11+, uv package manager, Claude Agent SDK, numpy/scikit-learn (scaling model), PyYAML (config), SQLite (telemetry)

Key dependencies:

  • claude-agent-sdk>=0.1.33 — Anthropic's agent SDK for live runs with permission callbacks
  • numpy, scikit-learn — Statistical modeling, mixed-effects regression for elasticity analysis
  • PyYAML — Task/scenario configuration
  • plotly, jinja2 — Interactive dashboard rendering

Step 1: Local Setup

Clone and Bootstrap Environment

# Clone repository
git clone https://github.com/brainqub3/agent-labs.git
cd agent-labs

# Install uv package manager if missing
python3.11 -m pip install --user uv

# Sync all dependencies (creates .venv, installs packages)
uv sync --all-extras

# Configure API key
cp .env.example .env
# Edit .env → set ANTHROPIC_API_KEY=sk-ant-...

# Verify environment health
uv run brainqub3 doctor

Expected output from doctor:

✓ Python 3.11+ detected
✓ uv environment active
✓ ANTHROPIC_API_KEY configured
✓ SQLite database writable
✓ Claude Agent SDK installed (v0.1.33)
All checks passed.

If doctor fails, DO NOT PROCEED. Common issues:

  • Missing ANTHROPIC_API_KEY in .env
  • Python version <3.11
  • Filesystem permissions preventing SQLite write to data/runs/

Preflight Script Alternative (One-Shot Bootstrap)

bash scripts/preflight.sh --bootstrap-uv

This script:

  1. Detects Python 3.11+ or exits with installation guidance
  2. Installs uv if missing
  3. Runs uv sync --all-extras
  4. Checks for .env and prompts if missing
  5. Runs uv run brainqub3 doctor

Use when onboarding new contributors or provisioning CI environments.


Step 2: Run Your First Experiment

Evaluator-First Workflow (Test-Driven Agent Design)

1. Validate evaluators before running agents (prevents expensive API calls with broken tests):

uv run pytest brainqub3/tasks/examples/hello_world/tests -q

Expected: 3 passed in 0.12s (evaluator logic validated against mock instances).

2. Run SAS baseline (single-agent reference performance):

uv run brainqub3 run sas \
--task hello_world \
--model claude-sonnet-4-5 \
--instances 3 \
--require-live

Flags explained:

  • --instances 3 — Run evaluator against 3 task instances
  • --require-live — Fail if ANTHROPIC_API_KEY missing (prevents accidental mock runs)
  • Output: data/runs/<run_id>/sas_baseline.json with success rate, latency, cost

3. Run MAS comparison (multi-agent with 3 workers):

uv run brainqub3 run mas \
--task hello_world \
--arch hybrid \
--model claude-sonnet-4-5 \
--n-agents 3 \
--instances 3 \
--require-live

Architectures available:

ArchitectureWorkersOrchestratorPeer ExchangeAggregation
independentParallel, no communicationNoneNoneMajority vote
centralisedDraft answers independentlySynthesizes all draftsNoneOrchestrator decides
decentralisedPropose initial answersNoneMulti-round refinementConsensus vote
hybridAssigned subtasksSynthesizes resultsRefine via peer reviewOrchestrator + peer feedback

Key insight: Every MAS run auto-generates a paired SAS baseline if one doesn't exist for the same task/model/instances combination. This ensures fair comparison.

Tool access enforcement: Claude Agent SDK permission callbacks enforce tool availability at runtime. Tool count maps to fixed prefix ordering:

  • 6 tools = core set (Read, Write, Edit, Bash, Grep, Glob)
  • 8 tools = core + web search/fetch

4. Launch interactive dashboard:

uv run brainqub3 dashboard
# Opens http://127.0.0.1:8765

Navigate to:

  • Comparison tab → View SAS vs. MAS delta metrics
  • Runs tab → Inspect individual agent traces, messages, tool calls
  • Scaling Laws tab → Visualize elasticity calibration (see Step 3)

Task Authoring Scaffold

Create a new evaluation task:

uv run brainqub3 task init my_task

Generated structure:

brainqub3/tasks/my_task/
├── __init__.py
├── evaluator.py # Pass/fail logic for agent outputs
├── instances.json # Task test cases
├── prompts.yaml # SAS/MAS prompts
└── tests/
└── test_evaluator.py # Pytest cases for evaluator logic

Critical workflow:

  1. Write evaluator + instances first
  2. Test evaluator with pytest brainqub3/tasks/my_task/tests -q
  3. ONLY THEN run live agents (avoids wasted API calls)

Run data persistence: All runs saved to data/runs/<run_id>/:

  • sas_baseline.json — SAS performance
  • mas_<arch>_<n_agents>.json — MAS performance
  • traces/ — Full agent message logs, tool calls, reasoning chains
  • metadata.json — Run config, timestamp, content hash

Run immutability: Completed runs are finalized with SHA-256 content hashes. Cannot be modified in-place (prevents results tampering). Re-run with --force to overwrite.


Step 3: Elasticity Calibration & Scaling Analysis

The Key Scaling Question

"At what agent count / tool count does coordination overhead overwhelm performance gains?"

The paper's mixed-effects model predicts:

  • Low task complexity (P_SA near 1.0) → SAS already solves it, MAS adds cost with no benefit
  • High agent count (10+ workers) → Coordination overhead can cause performance collapse
  • More tools + more agents → Compounding coordination cost (message explosion)

Running Elasticity Calibration

uv run brainqub3 run elasticity \
--task hello_world \
--arch hybrid \
--model claude-sonnet-4-5 \
--batch-id my-elasticity-batch \
--n-agents-grid 3,4 \
--tool-count-grid 6,8 \
--instances 3 \
--require-live

What this does:

  1. Sweeps agent count (3, 4) × tool count (6, 8) = 4 MAS runs
  2. Generates paired SAS baselines for each tool count
  3. Collects coordination metrics (overhead%, message density, redundancy)
  4. Fits mixed-effects regression model to predict scaling behavior

Output: data/elasticity/<batch_id>/ with fitted model coefficients, residual plots, JSON coefficients.

Interpreting Scaling Laws Dashboard

Navigate to Scaling Laws tab:

  1. Select elasticity batch (NOT comparison batch) — only batches with --n-agents-grid show here
  2. X-axis options:
    • Number of agents (3, 4, 5, ...)
    • Tool count (6, 8, 10, ...)
  3. Y-axis: Delta performance vs. SAS baseline (positive = MAS wins, negative = SAS wins)
  4. Adjustable parameters:
    • P_SA (SAS baseline success rate, proxy for task difficulty) — slider 0.0 to 1.0
    • Intelligence index (model capability, 0-100) — defaults from model tier
    • Tool count (if X-axis is agent count)

Key insights from paper validation:

ScenarioPredictionDashboard Evidence
Easy task (P_SA = 0.95)SAS sufficient, MAS overheadMAS delta near 0 or negative
Hard task (P_SA = 0.40)MAS gains until overhead dominatesMAS delta positive at low N, drops at high N
3→5 agentsCoordination overhead growsOverhead% increases, efficiency drops
5→10 agentsPotential collapseError amplification >1.0, negative delta

Architecture-specific patterns:

ArchitectureOverhead CharacteristicsWhen to Use
independentLowest overhead (no messages)High task parallelizability, no synthesis needed
centralisedModerate overhead (N→1 messages)Tasks needing expert orchestrator, low peer dependency
decentralisedHigh overhead (N² messages)Consensus-critical, no clear leader
hybridVariable overhead (orchestrator + peers)Complex tasks with subtask decomposition + synthesis

Scenario Analysis (Multi-Task Comparison)

uv run brainqub3 scenario run --scenario data/scenarios/my_scenario.yaml

Scenario YAML structure:

name: "Production Readiness Test"
tasks:
- task: code_review
instances: 5
- task: bug_triage
instances: 5
architectures: [sas, hybrid]
models: [claude-sonnet-4-5]
n_agents_grid: [3, 5]
tool_count_grid: [6, 8]

Use cases:

  • Compare architectures across task diversity
  • Production cost estimation (aggregates API costs)
  • Regression testing (compare scenario results across code versions)

Gotchas & Production Considerations

Mock Mode Limitations

Running without ANTHROPIC_API_KEY (or --mock flag) produces deterministic mock responses:

uv run brainqub3 run sas --task hello_world --instances 3 --mock

Mock mode is ONLY useful for:

  • Testing evaluator logic against known outputs
  • CI/CD pipeline validation (no API costs)
  • Onboarding contributors without API keys

Mock mode is USELESS for:

  • Architecture comparison (deterministic = no coordination variance)
  • Scaling analysis (no real overhead)
  • Production readiness assessment

Rate Limit Management

Tier 1 Anthropic subscriptions (default for new accounts):

  • 50 requests/minute (RPM)
  • Parallel agent runs with N=5 agents, 3 instances = 15 concurrent requests
  • Will hit rate limits in <12 seconds

Mitigation strategies:

  1. Use --instances 1 for initial experiments
  2. Reduce --n-agents-grid (e.g., 3,4 instead of 3,5,7,10)
  3. Upgrade to Tier 2+ Anthropic subscription (500 RPM)
  4. Add --rate-limit-delay 1.2 (seconds between requests)

Tool Count Semantics

Tool count is not arbitrary — it maps to SDK permission callbacks with fixed ordering:

Tool CountIncluded Tools
6Read, Write, Edit, Bash, Grep, Glob
8Above + WebSearch, WebFetch
10Above + DatabaseQuery, EmailSend

Implication: --tool-count-grid 5,7,9 will fail (no defined tool set for odd counts).

Use pre-defined sets from brainqub3/core/tool_registry.py or extend with custom tool sets.

Run Immutability & Content Hashing

Completed runs are finalized with SHA-256 content hashes (covers instances, evaluator outputs, agent traces).

Prevents:

  • Accidental result modification
  • Results tampering in research contexts
  • Inconsistent re-runs with same run_id

To re-run with same config:

uv run brainqub3 run sas --task hello_world --instances 3 --force
# Generates NEW run_id, preserves old run

To delete a run:

rm -rf data/runs/<run_id>
# Dashboard auto-refreshes, removes from index

Dashboard Performance with Large Batches

Elasticity batches with --n-agents-grid 3,4,5,6,7,8,9,10 × --tool-count-grid 6,8,10 = 24 runs.

Dashboard renders all traces in-browser (Plotly.js). With 10 agents × 5 instances × 20 messages/agent = 1000 message objects per run.

Symptom: Browser tab freezes on Scaling Laws tab load.

Fix: Use dashboard filters:

uv run brainqub3 dashboard --max-runs 50 --max-traces-per-run 100

Or export to JSON for custom analysis:

uv run brainqub3 export --batch-id my-elasticity-batch --format json > results.json

Next Steps

  1. Read the paper: arXiv:2512.08296 — understand theoretical grounding
  2. Explore example tasks: brainqub3/tasks/examples/ — code_review, data_analysis, creative_writing
  3. Author your own task: uv run brainqub3 task init <name> → follow evaluator-first workflow
  4. Run production scenario: Define YAML scenario with business-critical tasks, analyze cost/performance trade-offs
  5. Contribute calibration data: Submit elasticity batch results to community dataset (see CONTRIBUTING.md)

Repository: https://github.com/brainqub3/agent-labs Paper: arXiv:2512.08296 License: MIT Maintainer: Brainqub3 Research

Version: 1.0.0 | Last Updated: 2026-02-16