I Built an Open-Source Rig That Measures Multi-Agent Architectures

The Problem with Multi-Agent Hype

There's a lot of hype right now around multi-agent systems — agents for agent teams, decentralized teams of agents. And if you spend enough time in that world, it's easy to start believing they're always the right choice. But in production, more agents doesn't automatically mean better performance. Sometimes multi-agent architectures genuinely scale, and sometimes they collapse under coordination costs, turning into slower, noisier versions of the single-agent system. The real issue is that without a baseline and without measurement, most people can't tell which one they've built.

What We're Building

Before we go any further, let me be clear about what we're building here. We have an open-source tool that you'll want if you're planning to build production agents — you know, the type that can actually scale. It's called Brain Cube Agent Labs.

I anchored this work on the paper Towards a Science of Scaling Agent Systems. They didn't just benchmark a few demos and call it a day. They built a predictive model using real coordination metrics like efficiency, overhead, error amplification, redundancy, and message patterns. And the key reason I trust it enough to build on: it holds up on unseen configurations with a cross-validated R-squared of 0.52.

If you're not into stats, here's what that means in plain English. The model can explain just over half of the performance differences it sees on new, held-out runs. Not perfect, but strong — especially for systems as messy as agent coordination. It's the difference between "I have a story" and "I have a signal." And of course, we keep the right attitude about models. George Box nailed it when he said: all models are wrong, but some are useful.

The Core Idea

Here's the core idea. We treat a single-agent system as a default baseline. Then we run multi-agent architectures on that exact same task, model, and tool setup, and we score everything as a delta versus that baseline. That tells you who's better at this setting.

But the bigger-picture win is that we can start to anticipate behavior under scale, because once you're measuring the coordination predictors from real runs, you can reason about what's likely to happen as you dial up agent count and tool count — where a MAS or a multi-agent system should keep improving, where it should plateau, and where coordination dynamics are likely to overwhelm the gains and trigger collapse.

Right now, the implementation works out of the box with Claude Code primitives like Claude skills, so you can get moving immediately. But it's modular and can be migrated to other agent primitives without rewriting the whole system. I've also included my SWE-bench experiment in the repo and I've included my finance agent tasks as well, along with a dashboard you can explore yourself.

So if you're building agents you want to actually ship and you want to know when multi-agent systems are worth the complexity, this is for you. Let's get into it.

System Architecture

Brain Cube Agent Labs is basically a measurement rig for agent architectures. At a high level, it has two halves that fit together. First, you run the arena to generate empirical data. Then you run the modeling layer on top of that to answer two questions.

Question One: Comparison

Is this multi-agent architecture actually better than a single agent on this task with this tool set? That's where the single-agent system baseline comes in. We treat a single-agent system as the default anchor point, and we score everything else as a delta or relative versus that base. So you're not guessing whether a swarm "feels" stronger — you are measuring whether it is actually outperforming the simplest alternative.

Question Two: Scaling Sensitivity

If I change the number of agents or the number of tools available, how do the coordination dynamics shift? This is where the scaling part becomes real, because it's not just who wins at one setting — it's how things move as you scale up.

The Scoring Model

Under the hood, the scoring model we start from is the paper's mixed-effects model. That model takes coordination predictors and task properties and produces a predicted performance score. What Brain Cube adds is an elasticity layer on top, so we can take the coordination predictors we measured at one setting and sensibly extrapolate how they would change as you vary agent count and tool count.

In other words: comparison tells you who is better at a point in time. Elasticities tell you how the metrics behave as you scale.

Two Kinds of Batches

This is also why there are two kinds of batches you run. A compare-only batch is great for ranking architectures at a fixed configuration. But if you want to use the scaling laws tab and do sweeps across different agent counts and tool counts, you need an elasticity calibration batch, because the system needs controlled comparisons to estimate those scaling sensitivities.

How the Arena Works

Now, how does the arena actually run things? When you run a multi-agent architecture, the lab automatically runs a single-agent baseline first. Then it runs the multi-agent system with the exact same task, model, and tool configuration. That means every multi-agent result is paired to a baseline result, and you always have a clean apples-to-apples comparison. It also means each multi-agent configuration produces two saved runs: the baseline run and the multi-agent run linked to it.

Coordination Patterns

On the multi-agent side, Brain Cube supports a small set of canonical coordination patterns, because you want to explore different coordination dynamics — not just "more agents":

Pattern	Description
Independent	Agents work in parallel and you aggregate
Centralised	Workers draft and an orchestrator synthesizes
Decentralised	Agents exchange information across rounds and converge
Hybrid	Mixes assignment, peer rounds, and orchestration

Conceptually, these represent different points in the coordination trade-space, from minimal interaction to heavy interaction.

Tool Enforcement

Tools are handled in a way that keeps experiments honest. Tool access is enforced at runtime, not just described on paper, so an agent cannot accidentally use tools it wasn't allowed. And every run records the tool set it was permitted to use, including a simple tool count value that becomes one of the scaling knobs later.

For scaling sweeps, the lab makes tool variation deterministic. Instead of manually cherry-picking different tool combinations, tool count maps to a fixed prefix of the default tool-belt ordering. So six tools always means the same core six, and eight tools always means that core set plus the extras. That consistency matters because otherwise you can't tell if you're seeing scaling behavior or just differences in tool choice.

Default Experiment Shape

Finally, a quick intuition for defaults so you understand the shape of an experiment before you run it. A standard multi-agent run defaults to 3 agents. An elasticity run defaults to a small grid — 2 agent settings and 2 tool settings, which gives 4 multi-agent configurations per architecture. If you sweep all four architectures, that's 16 multi-agent configurations total. And because each one auto-runs a single-agent baseline, you end up with double the number of run records saved.

Summary

That's the conceptual system: run the arena to generate paired single-agent system and multi-agent system evidence, measure coordination metrics from real runs, then apply the paper's model plus the elasticity layer to compare architectures today and estimate how they behave as you scale.

Quick heads up before you run this: you'll need an Anthropic API key because the agent system templates are wrappers around the Claude Agent SDK — not Claude Code primitives. So when you run arenas and tests from the CLI, you're making direct API calls and you'll pay for the tokens you consume as you go.

Demo: Setting Up a Finance Agent Experiment

I want to show you just how easy it is to get up and running with an experiment with the Brain Cube Agent Lab. So what I'm doing here is I have the Brain Cube Agent Lab set up to work with the Claude Code agent harness, and what I have set it up to do is allow you to build your own experiments within the agent lab by just asking Claude Code to build them.

The Finance Agent Benchmark

So I'm going to start off with this finance agent experiment, which is a benchmark that enables you to test how well an AI agent can work with financial problem sets that you might encounter in real life. It is an agentic task, and I think it's one of the tasks that was in the scaling laws paper that this agent lab is built around.

So what you can do is this benchmark actually links you to some of the publicly available data, because to preserve the efficacy of the benchmark, they don't release all of the data for training and testing publicly, but they do give you a public sample. So we'll work with that public sample, and I'll show you what that looks like. There's various questions — it's all available on GitHub, by the way. This is what the questions look like.

Task Construction

So then I go to task author, and the task author basically puts Claude Code in a mode to be able to start writing those tasks for you. So it wants us to provide a task name. I said: "I want you to construct a task from the Finance Agent benchmark from Vals AI. I have the data here. I'm providing the data, please select a subset of five questions."

The reason I'm just doing a subset of five questions is because we'll be here all day if not — some of these questions are taking 10 to 15 to 20 minutes for a human expert to answer. So I suspect we'll be here all day if I do any more than that, so I'm just selecting a subset of five questions. I also said: "Please do not jump into the arena, just construct the task."

What Claude Code Does

So in constructing this task, what it's going to do is it's going to download that data, put it in a form that our agents in the lab can use. It's also going to create the evaluation too, and it will create the prompts for the multi-agent and single-agent systems that we're going to be using to benchmark the performance of multi-agent systems versus the single-agent systems. So that's what it will do, just based on that information we've provided alone. And then we'll go ahead and run.

So as Claude works through this, you'll notice it's going off to GitHub and actually fetching that public data. It is then going to interpret that data and select a subset of those tasks — we asked for five questions, so it will just select five. And it will build the task dynamically. I'll pull over my GitHub desktop to show you the changes that the agent has made to build that task dynamically.

Custom Task Flexibility

But yeah, essentially I've set this Brain Cube lab up so that you can create your own custom tasks if you're trying to evaluate different agentic systems or architecture choices. We do have the standard one set in place based on the paper, so we have the single-agent system, we have multi-agent systems as well and all of the different flavors in there as I mentioned previously. So it will work towards evaluating those architectures and you'll be able to gauge from there the change in performance based on scaling your agents to those multi-agent architectures from that single-agent architecture as well.

Setting Up the Workstation

So yeah, this is the first step — obviously constructing this task — that comes after you've set up the workstation itself. Again, when you're setting up the workstation, there is a way to do it: you run a slash command and you set up live. You need to have an Anthropic API key — that's one of the prerequisites. If you don't have an API key, you won't be able to run the live runs.

This is also based on the Claude Agentic SDK, so the Anthropic Agent SDK — it's based on that. Obviously, when you clone this repo for yourself, you can change it around to use whatever SDK you want, but just this initial repo is based on that, so you will be using those primitives for any testing you do in the lab.

There is a dashboard as well on top of this, and that dashboard will provide you with a way to visualize the experiments you run and to see graphically how your multi-agent systems diverge from your single-agent systems.

Reviewing What Gets Built

Claude Code has finished building the task. You can see from the top what steps Claude Code actually took. I'm not going to go through all of the details of the code, but what I'm going to do is I'm going to show you high level what has been built.

So first of all, it started by actually selecting five questions from the benchmark that we wanted, and you can see it pulled from the GitHub Finance Agent repo from Vals AI. And we've got five questions out of that benchmark, and it tried to select a range of expert time — so assuming this is kind of a range of difficulty.

Then we actually have how tasks are created in the Brain Cube lab. Claude Code has been set up with the Brain Cube lab scaffolding to make sure that tasks are always created to ensure that the agent workflows can work with those tasks. So we have:

task.md — Specifies what the task is
instances.jsonl — A JSON Lines file that gives you the answers, the questions themselves, and the scoring rubric
evaluator.py — Designed based on the task; this is how we're scoring the responses of the agent system against those answers
Prompts — What we're using to run the agentic systems. Claude comes up with the prompts itself for the single-agent system, orchestrators for the multi-agent systems that require orchestrators, and prompts for the worker agents that exist in those multi-agent systems too
Python scaffolding — Just to make sure that this is an executable task. That's it.

Inspecting the Output

And if I bring over my GitHub desktop, we can actually step into some of those things. So we can start off with the prompts — if we can find it, so you can see the prompt for the single-agent system. It's just a basic prompt and that slots right into our already existing scaffold that we have for the single-agent system that exists in the Brain Cube lab repo.

And then we have the task. Task is outlined clearly here with the source of the data, the goal, input format, output format — all of this is obviously written by Claude Code in the way that's most effective to get this thing to execute without any errors.

And yeah, the rest of it is — I'll show you what the JSON Lines file looks like. So you have a JSON Lines file with all of the questions and answers in there. You can see at the top straight away the question and then the expected response. A JSON Lines file is just a bunch of JSON objects stacked on top of each other — that's all it is. Each line is a separate JSON object. It's pretty useful for tasks like this because you can literally step through those JSON objects line by line.

Bringing Your Own Data

That's how simple it is to get set up in the Brain Cube Agent Lab with your own tasks. You can bring your own data as well. You will obviously need to communicate well to Claude Code what exactly is being evaluated and how to provide those things, otherwise Claude Code is just going to create dummy data and that might not align with your task set properly. So that is important to note — you will need to provide the data or point to a source of data like I did with this finance agent thing if you want this to run properly.

Running the Arena

Let's move on to the next step in actually seeing how this thing runs. We have our task in place, now we want to actually step into the arena. The arena is obviously where we run the agent systems against the task to compare our scaling across multi-agent systems compared to a single baseline and how that scales with number of agents and tools.

This is a key detail because the way our arena is set up is to test the agent systems across those different tasks, obviously running them for different batches, but also running them under different scenarios.

Rate Limits Warning

One last detail: your Anthropic subscription level does matter. If you are on tier one, I think you're going to hit a lot of rate limits when you're doing this, so you want to be mindful of that. I'm on tier three and I've had to put in place code to make sure that I'm not hitting the rate limits on Anthropic when I'm running these evaluations, because Claude Code will just run loads of these in parallel. So yeah, be mindful of that.

Launching the Arena

All right, let's jump into the arena. So to run your evaluations, you want to go to arena — and that's just a simple slash command in here — and then Claude is going to prompt us with the next steps. So it's identified two possible tasks, it knows that we want to run finance bench, so it'll run a full arena protocol for finance bench. We've got our finance bench compare 2026 and finance bench elasticity.

So the elasticity is going to be how we model things like overhead and efficiency and how they change with scale. Obviously, we can't get every point for modeling those overheads and efficiencies because we actually measure them real time as this is executing. What we're trying to do there is we're trying to see how those live empirical measures change when you scale, and we're trying to fit an elasticity model to that, which will then give us a scale factor that we can apply when we're trying to model how these agent systems scale from single agents to multi-agents, where we're increasing the number of agents too.

So this is exactly what the arena will do for us. It will start off with this base comparison level where we're just running a standard set, and then it will scale from there and we'll be able to fit models on top of that scaling.

The Modeling

There are more details about how the modeling works in the repo itself, so if you want to understand this in more detail, just simply ask Claude Code to explain the modeling to you. I have also linked the paper — the scaling paper — in the repo itself, so feel free to read that to get a better understanding of the modeling itself.

I'll just raise this again: all models are wrong, but some are useful, and this fits into that category. This is not intended to give you perfect predictions about how a multi-agent system is going to perform — that wouldn't really be possible. Instead, this is intended to give you the relative performance of a multi-agent system versus a single-agent system and whether that performance collapses as you start to scale.

And this is really key for enterprise, because in enterprise often you're going to build something on a small pilot data set and it might perform well, and then when you try to scale it, that's when things fall over. So hopefully this is something that can save you a lot of money and time in designing architectures up front that will successfully scale, and you'll be able to spot the failure modes quicker by using a tool like this — ahead of time — instead of waiting to build a solution that you thought is operating on just five tools and you've scaled it to 20 and all of a sudden it collapses. So yeah, this is to prevent something like that.

Live Run

So you can see we've actually stepped into the single-agent system, so it's running live evaluations and collecting empirical data. This will take some time, so I'm going to come away from this and I'll cut back when this is finished.

Arena Results

Okay, so that did take a while, but we've finally run the arena for our finance agent task. And as I said, when I first set this up, we took a subset of five questions and we ran it over the different set of experiments to get our comparison batch — which I explained at the beginning of the video — and our elasticity batch.

What's really cool about the lab is that you'll get this final analysis telling you about the results of the experiment and giving you some key findings about how the different agent architectures scale. Not only that, you will also get a dashboard, and that dashboard is spun up automatically at the end of the arena and is available for you to inspect locally. And in that dashboard, there are various buttons that you can use to control the different scaling scenarios that you might want to visualize.

But yeah, the agent lab has been set up in such a way that you could in theory interact with any of the results delivered through the Claude Code interface, because all of the data is saved behind the scenes in the files. So the next thing I want to do is I just want to show you quickly how the agent lab files have been populated, then we'll talk through the results and I'll pull up the dashboard.

Exploring the Data Files

So the agent labs folder — if you start off in the project — within agent labs, you want to navigate to data/. This is where everything is updated, and then you want to navigate to your runs/.

Okay. So in the runs itself, this is where you can inspect at a granular level the data. And you can see what we've pulled in here for finance bench. This is probably a little small to see on screen, but you can see this folder says "finance bench, single agent system, Claude Haiku 4.5" and "finance bench, hybrid, Claude Haiku 4.5."

Let's step into the hybrid one, because there you will start to see how the agent system actually derives the results. And because the agents run on the SDK, it's not consuming a lot of tokens within the Claude instance itself, which means you don't have to worry about context overflow and all of those things that you would normally worry about — all of the agent processing is handled off on a remote server using the SDK.

Agent Traces

So within the folder, you have access to the agent traces themselves — obviously all for auditability. If we open this up, I'll pull one open in a text editor so you can see what that looks like. So if we pull up the agent trace for the orchestrator, we can see the types of messages the orchestrator was sending in this hybrid workflow. I won't go into details, but you can see it's basically communicating with all of the other agents.

And, you know, this is pretty interesting for analyzing the why as to why these systems broke where they broke. If that's what you want to do, and remember all of that context is there for you in Claude Code, so that makes it really useful as a tool to also picking out why certain systems break down.

Evaluation Results

Let's step back up one level. So back up to another level, we can look at the actual evaluation results. So yeah, the eval results here — you can see where we have passing and failing. And you can see how these things are scored. We have our scores listed here:

A score of zero means it's completely wrong
Anything between zero and one is partially correct (labeled "partial match")
A score of one is completely correct

We have the error type — so answer mismatch, or partial match, or correct if it was correct. Then there's the details, and then an actual preview of the answer that was delivered.

So you get the point. This is all going to be available to you. I'm just showing you this not to bog you down in details, but to show you that the agents were actually working behind the scenes to deliver that data. And it's structured in a way that is consistent, so you can work across many different experiments if you need to with this same scaffold.

Interpreting the Comparison Batch

Now, let's see how the results are presented. So we've been presented with a comparison batch, which just basically tells us at a point in space — so for the same set of questions, the same set of tools, the only difference being the actual agent architecture between them — how well did those agents do as compared to baseline.

Don't take the probabilities too seriously by themselves. What we care about is actually the differences — that's the main thing, because the probabilities, there's error in that predictive model. But the differences between the single-agent system and the multi-agent system is what we want to see, and that's where the dashboard actually comes in.

These things are useful — I find this analysis can be useful — but actually, the main thing you want to be doing is pulling up the dashboard and looking at the analysis provided to you there, which will make a lot more clear how to interpret these results. Yeah, I wouldn't pay too much attention to these key findings. If you ask the right questions, you could probably interrogate the data better than the default summary that comes out. Instead, jump into the dashboard.

Understanding the Dashboard

So this is the dashboard itself, right? And what you want to do is you want to go to select the batch, and the batch you want to work with is the elasticity — so remember we're doing the finance agent bench, the elasticity will allow us to monitor scaling laws. So you want to select the elasticity batch of the test that you ran. So we'll select that for finance agent.

Choosing Your Sweep Variable

By default, we put model intelligence on the X-axis, but I'm going to swap that out and instead of looking at the model intelligence as a sweep, let's look at the number of agents, because I find that is a good sweep to start understanding the patterns of scaling laws for multi-agent systems.

The Delta Performance Plot

This plot shows the delta in performance and the sweep that we're interested in at the bottom here, so in this case it's number of agents. When we talk about delta in performance, we're talking about the performance difference using the single-agent system as a baseline. So the theory here is you start off with the simplest system, which is a single-agent system, you build a bunch of multi-agent systems, and then you compare the performance of those multi-agent systems to that single-agent system as you scale in any direction.

Degrees of Freedom

Okay, so we have a few directions we can scale in. We have number of agents down at the bottom, but we can also adjust tool count. So we have the delta in performance plotted against the number of agents. And then we have a few degrees of freedom here that we can move, and these are all things that we control. And these are things that we can scale:

Parameter	Description
Single-Agent Performance	Can be envisioned as a proxy of task difficulty. If a single agent can perform well on a task, it suggests that it's not that difficult. But if a single agent doesn't perform well on a task, you can suggest maybe that task is more difficult. It's a sort of proxy — it doesn't fit perfectly.
Intelligence Index	Related to the models used themselves. That intelligence index comes from the paper. Those are maps to the corresponding models — I think the source is provided in the source code itself.
Tool Count	Self-explanatory — that is the number of tools the agent has access to.
Number of Agents	How many agents participate in the multi-agent system.

Tool Count Nuance

So there is a bit of nuance on the tool count parameter, especially if you're running on the default settings. We don't run tool count beyond six and eight tools, because the Claude primitives — basically, unless you're going to start adding custom tools, you need the minimum six tools, which is like read, write, edit, glob, grep, search, and there's one other, maybe delete or something, right? Those minimum six tools are what you just need for an agent. Then the other tools are fetch and web search, basically.

So what we do is we do experiments to measure elasticities across six and eight. What that means is when you're changing the tool count here, unless you go beyond with the tools, you're not going to get too many reliable results outside of that six to eight range. So six, seven, eight is what you can do to understand the scaling laws if you're going to use the defaults. If you're going to add more tools than that, then obviously your scaling laws are going to be able to cover a wider range of tools.

So, you have your tool count there. Let's play around with it. So we started off as a minimum of six, essentially — the performance of the system versus baseline, the performance of multi-agent systems versus the single-agent system.

Key Findings from the Finance Agent Experiment

Collapse at Scale

As you scale the number of agents, performance seems to deteriorate. It actually collapses here. Flattening out there shows almost like total collapse of the system compared to the baseline. So you can see that collapse here by the delta in performance. And that is an intuition taken from the paper.

So, moving it up to eight — with more tools, that pattern still remains consistent. You see eventually, as you scale the number of agents you have, that tool complexity you start to pay for at the end of the day, right? So you can have a certain amount of tools, but if you scale the number of agents you have, you start to pay for that tool complexity. And that's what we're seeing with this collapse even with six and eight tools.

And why that's significant is because for this particular experiment, the additional two tools that we gave — it was web search, basically — so it allowed the agents to use the internet to answer some of the questions, which is completely necessary for those finance questions. There's information you need to find from the internet. So that explains why the collapse comes later perhaps, but it still comes as you scale the number of agents.

And you see that as we verge towards 10, that basically every multi-agent system just ends up collapsing — it ends up deteriorating in performance versus the single-agent system.

The takeaway: If you have a system with a lot of tools and start scaling your agents, you're going to pay for that orchestration and complexity collapse. So perhaps the more tools you have, the more you want to ensure that you're not running too many agents in parallel.

Again, this might be different for your particular task and the way it breaks down, but this is the intuition we're already getting from this task — this finance agent task that we ran.

Swapping the Axes

You can play around with a lot of those scenarios. You can swap the tool and agent axes around. So instead of looking at the number of agents, you could instead put the number of tools on the X-axis and measure the delta in that way. Let's just compute it back there.

So yeah, this will obviously look a bit stranger because we've only got two points for the tools. And what I would expect to see is if we increase the number of agents, we see that collapse just deepens. Let's see if we do. Yeah, so as you increase the number of agents, that collapse just deepens out.

And yeah, the reason you're seeing this recovery is because the next set of tools was using the internet searches, so it was an improved set of tools. But nonetheless, the collapse still occurs.

The Baseline Performance Insight

And something that's interesting — if we swap this back, I want to show you something that is interesting as an insight from the paper. So let's put number of agents back on that axis and let us adjust the performance. Let's see how the performance of the single agent changes this. Okay.

So notice: if we take the performance of the single agent down, what we see is that weaker agents benefit from those multi-agent systems even under total collapse and everything else that we've been modeling. If the task is too complex for the agent, there does seem to be an uplift with multi-agent systems. And that seems to hold right across the different architectures.

And if you take that up, you'll see that collapse back down to the performance of the baseline single-agent system. And if you increase that right to the top, you see that collapse much sooner.

So ultimately, if your agent is already performing really well on a task, it doesn't make too much sense to add more agents. This is what this modeling would say — you get diminishing returns after that point, and it's probably going to be more expensive.

So that was a really interesting insight I got from — it came from reading the paper actually, but it's nice to see it also replicated in my own experiments here too.

How the Modeling Works

How does this all work under the hood? I already talked about the coordination metrics. There's a lot of details in the repository itself under the modeling documentation about how that works. But essentially, what we're trying to do is measure the elasticity of those measured components that we talked about.

Measured vs. Established Inputs

So the measured components here are the ones that come from experimentation — these are the coordination components: overhead, message density, redundancy, efficiency, error amplification. These are measured from the experiment; they're measured from when we send the agent systems running.

Then there are the things we establish initially to kick everything off. Our intelligence index — that is from the model. Tool count — we establish. Number of agents — we establish architecturally. So a lot of these architectural things are assumptions we set, apart from the performance of the single agent — that's the baseline that we get.

The Mixed-Effects Model

So this is the mixed-effects model taken directly from the paper itself. I'm not going to go into details about the different model terms here, because I've done that in a separate video and am happy to point to that. Definitely watch that video if you want to get more detail about this. But all of the modeling assumptions are laid out right here within the dashboard itself.

Why This Matters for Enterprise

So yeah, this dashboard is obviously very useful if you want to build those multi-agent systems, or if you're building production agent systems and you're wondering if you can get a performance uplift from going from a single-agent system to a multi-agent system, or whether you should just go for the multi-agent system to start with.

This type of experimentation will save you a lot of time and money in the long run, especially if you're working on an enterprise budget and you're being held to deliverables and you cannot get past the delivery gate unless it meets a certain standard. And there's uncertainty as well, because you can only build on a subset of data, or you can only build on a subset of tools, and you know the real thing has to scale to multiple users.

This, I hope, will be a tool that can help many of you out there that are in that position.

Thank you, and I'll see you on the next one.

The Problem with Multi-Agent Hype​

What We're Building​

The Core Idea​

System Architecture​

Question One: Comparison​

Question Two: Scaling Sensitivity​

The Scoring Model​

Two Kinds of Batches​

How the Arena Works​

Coordination Patterns​

Tool Enforcement​

Default Experiment Shape​

Summary​

Demo: Setting Up a Finance Agent Experiment​

The Finance Agent Benchmark​

Task Construction​

What Claude Code Does​

Custom Task Flexibility​

Setting Up the Workstation​

Reviewing What Gets Built​

Inspecting the Output​

Bringing Your Own Data​

Running the Arena​

Rate Limits Warning​

Launching the Arena​

The Modeling​

Live Run​

Arena Results​

Exploring the Data Files​

Agent Traces​

Evaluation Results​

Interpreting the Comparison Batch​

Understanding the Dashboard​

Choosing Your Sweep Variable​

The Delta Performance Plot​

Degrees of Freedom​

Tool Count Nuance​

Key Findings from the Finance Agent Experiment​

Collapse at Scale​

Swapping the Axes​

The Baseline Performance Insight​

How the Modeling Works​

Measured vs. Established Inputs​

The Mixed-Effects Model​

Why This Matters for Enterprise​