KIMI K2.5 research
Kimi K2.5 is Moonshot AI’s latest open‑source, native multimodal “agentic” model, designed around visual coding and large-scale agent swarms rather than just single‑agent scaling.1234567
Core architecture and training
- Mixture‑of‑Experts (MoE) LLM with about 1T total parameters and roughly 32B active per token, using 61 layers and an expert layer with hundreds of experts (8 selected per token).4
- Trained via continual pretraining on ~15T mixed tokens (vision + text) on top of Kimi‑K2‑Base, making it a native multimodal model rather than a text‑only model with bolted‑on vision.214
- Uses MoonViT (~400M parameters) as the vision encoder, tightly integrated with the language backbone for cross‑modal reasoning (images, video, UI mocks, scientific figures, etc.).4
- Quantization‑aware training yields native INT4 inference for the MoE components, advertised as ~2× generation speed improvement with minimal quality loss.4
- Supports up to a 256K context window, with benchmarks reported at up to 128K for long‑context evaluation.4
For your mental model: think “open SOTA‑ish MoE like Llama 4‑MoE / DeepSeek‑V3, but explicitly optimized for long‑horizon tool‑augmented and multimodal agent workflows.”
Agent swarm and PARL (research focus)
- Key research contribution: “K2.5 Agent Swarm,” where the model learns to self‑direct up to ~100 sub‑agents, coordinating as many as ~1,500 tool calls/steps within a single long‑horizon task.5614
- The swarm is not defined as a static prompt graph; instead, an internal “orchestrator” persona emerges that dynamically instantiates role‑specialized sub‑agents (e.g., AI Researcher, Fact Checker, Physics Specialist) and decomposes tasks into parallelizable subtasks.154
- They train this via Parallel‑Agent Reinforcement Learning (PARL): during training, both the reward and the allowed degree of parallelism are increased over time, encouraging the model to discover useful patterns of parallel decomposition and coordination.1
- Internal experiments claim up to ~80% end‑to‑end runtime reduction on complex workloads compared to a single‑agent tool‑use baseline, with up to ~4.5× wall‑clock speedup via parallel workflows.1
This is directly aligned with agentic multi‑agent research: they’re essentially baking a scheduler + planner into the model’s learned policy, instead of relying on hand‑crafted external orchestration graphs.
Modes and usage paradigms
- Exposed modes: K2.5 Instant (fast, shallow), K2.5 Thinking (slow, chain‑of‑thought + tools), K2.5 Agent, and K2.5 Agent Swarm (full multi‑agent orchestration).65
- Supports both “conversational” and “agentic” paradigms, plus instant vs thinking modes; the latter interleaves CoT, tool calling, and visual grounding in a single policy.264
- Production‑oriented design: stable tool usage over 200–300 sequential calls in single‑agent settings, and up to 1,500 steps in swarm mode, with explicit emphasis on reliability of long tool chains.14
From a systems point of view, you can treat it as a unification point: one model for chat, research agents, and visual coding agents.
Benchmarks and capabilities
- Reasoning and exams: ~50.2% on Humanity’s Last Exam (HLE) with tools; strong results on GPQA‑Diamond and other high‑difficulty reasoning benchmarks.4
- Math: high performance on AIME 2025, HMMT 2025, and IMO‑style benchmarks (they position it as “PhD‑level” math reasoning).4
- Coding: ~76.8% SWE‑Bench Verified, ~73% SWE‑Bench Multilingual, and strong LiveCodeBench v6 scores, with explicit support for “production‑level” coding workflows.4
- Vision: strong MMMU‑Pro, MathVista, MathVision, and CharXiv chart‑reasoning performance, making it particularly suited to scientific figures and diagrams.4
- Agentic web search: strong BrowseComp and Seal‑0 scores, especially in swarm mode, which involve long‑horizon multi‑page browsing and synthesis.54
These benchmarks support the thesis that K2.5 is not just a “tool‑caller,” but optimized as a research / coding / visual agent co‑pilot.
Agentic multi‑agent implications and prompts
Architectural implications for your systems
- Internal orchestration: The model itself learns to schedule sub‑agents, which means you can experiment with thinner external orchestrators that expose a toolbox, memory, and safety rails, and let K2.5 design the workflow.514
- Parallel tool infrastructure: To exploit PARL and swarm, your tool backends (search, code‑exec, retrievers) must support high concurrency with idempotent calls and robust dedup/aggregation.14
- Observability: You’ll want trace‑level logging of “sub‑agent” instantiations and tool calls to expose the learned swarm dynamics and avoid opaque behavior in regulated domains.81
Tight prompts you can reuse
You said you’ll often paste prompts back in, so here are focused starters:
- System‑level: turn K2.5 into a meta‑orchestrator
You are Kimi K2.5 operating in Agent Swarm mode. You have access to high‑latency but powerful tools (code execution, web research, vector search, internal APIs). Your job is to (1) decompose the user’s goal into parallelizable subtasks, (2) dynamically instantiate specialized sub‑agents with explicit roles and local scratchpads, (3) coordinate up to 100 sub‑agents and 1,500 tool calls, and (4) synthesize a single coherent, well‑cited final artifact. Prefer maximal safe parallelism and minimize wall‑clock latency while maintaining correctness and citation quality.
- Research workflow prompt
Act as a self‑directed research swarm. Instantiate sub‑agents for literature search, experiment design, evaluation, and synthesis. Use parallel tool calls whenever subtasks are independent. Maintain a shared global outline and have one orchestrator agent periodically reconcile conflicting findings before final synthesis.
- Multi‑agent coding prompt
You are a multi‑agent coding team using Kimi K2.5. Create sub‑agents for architecture, backend, frontend, tests, and documentation. Plan the implementation, split work into parallel tasks, keep interfaces in a shared spec, and converge to a consistent codebase design with tests and minimal cross‑module conflicts.
- Visual‑coding prompt (UI spec → code)
You are a visual coding swarm. From the provided UI screenshots and textual requirements, spawn sub‑agents for design analysis, component decomposition, CSS/layout, state management, and integration tests. Work in parallel to produce a coherent React implementation with clear component boundaries and comments.
Suggested next‑step questions for deeper research
To guide where you might be going:
- “Compare Kimi K2.5’s PARL and agent swarm approach to existing multi‑agent frameworks (LangGraph, AutoGen, Swarm, CrewAI) in terms of where orchestration lives and how parallelism is expressed.”
- “Design an evaluation harness to benchmark K2.5 Agent Swarm vs my current hand‑crafted multi‑agent graph on long‑horizon coding tasks, including metrics for latency, correctness, and tool‑call stability.”
- “Propose an architecture to integrate Kimi K2.5 as the core ‘brain’ of Coditect’s sandboxed execution platform, including safety, observability, and cost controls for 1,500‑step swarms.”
If you specify which dimension you care most about (training algorithm, tool API design, or integration into your existing LangGraph‑style orchestration), I can drill down into prompts and concrete design patterns around that.
ou are Kimi K2.5 operating in Agent Swarm mode. You have access to high‑latency but powerful tools (code execution, web research, vector search, internal APIs). Your job is to (1) decompose the user’s goal into parallelizable subtasks, (2) dynamically instantiate specialized sub‑agents with explicit roles and local scratchpads, (3) coordinate up to 100 sub‑agents and 1,500 tool calls, and (4) synthesize a single coherent, well‑cited final artifact. Prefer maximal safe parallelism and minimize wall‑clock latency while maintaining correctness and citation quality.
Act as a self‑directed research swarm. Instantiate sub‑agents for literature search, experiment design, evaluation, and synthesis. Use parallel tool calls whenever subtasks are independent. Maintain a shared global outline and have one orchestrator agent periodically reconcile conflicting findings before final synthesis. You are a multi‑agent coding team using Kimi K2.5. Create sub‑agents for architecture, backend, frontend, tests, and documentation. Plan the implementation, split work into parallel tasks, keep interfaces in a shared spec, and converge to a consistent codebase design with tests and minimal cross‑module conflicts. You are a visual coding swarm. From the provided UI screenshots and textual requirements, spawn sub‑agents for design analysis, component decomposition, CSS/layout, state management, and integration tests. Work in parallel to produce a coherent React implementation with clear component boundaries and comments. “Compare Kimi K2.5’s PARL and agent swarm approach to existing multi‑agent frameworks (LangGraph, AutoGen, Swarm, CrewAI) in terms of where orchestration lives and how parallelism is expressed.” “Design an evaluation harness to benchmark K2.5 Agent Swarm vs my current hand‑crafted multi‑agent graph on long‑horizon coding tasks, including metrics for latency, correctness, and tool‑call stability.” “Propose an architecture to integrate Kimi K2.5 as the core ‘brain’ of Coditect’s sandboxed execution platform, including safety, observability, and cost controls for 1,500‑step swarms.” How to access and use Kimi K2.5 model Benchmark comparisons of Kimi K2.5 vs GPT-4o or Claude 3.5 What are real-world examples of K2.5 Agent Swarm Moonshot AI background and other Kimi models Kimi K2.5 limitations and known issues
You can access Kimi K2.5 today via several hosted APIs, and it’s positioned as an open, multimodal, swarm‑capable alternative to GPT‑4o / Claude 3.5 for research, coding, and visual agent workflows.1112131415
How to access and use Kimi K2.5
- Official Kimi web app for interactive use (Chinese/English UI): web client with K2.5 modes including Agent Swarm preview.1613
- Fireworks.ai: serverless and dedicated deployments, OpenAI‑style HTTPS API, up to ~200 tokens/s; supports fine‑tuning / RL on K2.5 with low‑level primitives (
forward,forward_backward,optimizer_step,save_weight).1215 - Hugging Face:
moonshotai/Kimi-K2.5model card, with weights and configuration for self‑hosting on your own GPU stack.11 - Replicate: hosted inference with HTTP API for text+image (and some video) multimodal workloads.14
Minimal Fireworks‑style usage (pseudo‑Go/Python):
import requests, json
url = "https://api.fireworks.ai/inference/v1/chat/completions"
headers = {
"Authorization": "Bearer $FIREWORKS_API_KEY",
"Content-Type": "application/json",
}
payload = {
"model": "fireworks/kimi-k2.5",
"messages": [
{"role": "system", "content": "You are a Kimi K2.5 Agent Swarm orchestrator."},
{"role": "user", "content": "Design an evaluation harness for long-horizon coding agents."}
],
"temperature": 0.7,
"max_tokens": 2048
}
resp = requests.post(url, headers=headers, data=json.dumps(payload))
print(resp.json()["choices"][^2_0]["message"]["content"])
For “tools as swarms,” you’d expose your tools via OpenAI‑style function calling (Fireworks supports this for K2/K2 thinking and similar patterns are expected for K2.5).1712
Benchmarks vs GPT‑4o / Claude 3.5
- K2.5 is pitched as “SOTA open model,” with charts showing it matching or beating leading closed models on several agentic and visual benchmarks (HLE Full, BrowseComp, DeepSearch, OmniDoc Bench 1.5).131812
- It’s reported to outperform prior K2/K2‑Thinking on agent tasks and visual reasoning; public materials only give relative comparisons to GPT‑4/Claude rather than a full table with GPT‑4o / Claude 3.5, but the narrative is “closed‑source parity or better on selected tasks.”191213
To really answer this for Coditect, you’d want your own harness; see below.
Real‑world K2.5 Agent Swarm examples
- Official trajectories show: multi‑page web research (dozens of pages) with up to 100 sub‑agents and 1,500 tool calls, scientific literature review with chart/table interpretation, and large‑scale “office automation” like summarizing and cross‑linking document collections.1813
- Blog and secondary write‑ups describe use cases like: long‑horizon academic research (20–30 sources), multi‑file refactors in large repos, and visual coding from UI screenshots or Figma‑style mocks, all using swarms rather than single agents.1319
These are presented primarily as research/preview workflows, not yet as GA, fully supported enterprise features.1813
Moonshot AI and other Kimi models
- Moonshot AI is a China‑based AI company behind the Kimi assistant and Kimi K2/K2.5 model family.202122
- Kimi K2: 1T‑parameter MoE, 32B active params, 15.5T tokens, optimized for agentic capabilities; supports ultra‑long context and robust tool use.232122
- K2 Thinking: a “thinking” variant exposing reasoning traces, OpenAI‑style tool calling, and long‑context chain‑of‑thought for complex tasks.2217
- K2.5: continual pretraining on ~15T mixed vision+text tokens on top of K2; native multimodal, with MoonViT vision encoder, and Agent Swarm via PARL.1419111318
K2.5 limitations and known issues
From the tech report and model cards:
- Agent Swarm is explicitly marked as “research preview”; behavior can be less stable than single‑agent mode, and guarantees around determinism and safety are weaker.1318
- Tool orchestration is learned, not hard‑coded; this can mean redundant or suboptimal tool calls, and occasional “thrashing” when tasks are poorly specified.1913
- As a large MoE VLM, it requires substantial GPU memory and optimized serving to hit advertised performance; naive self‑hosting can be costly and slow.1114
- Standard frontier‑model caveats: hallucinations, potential bias, and incomplete transparency on training data; Moonshot withholds detailed dataset composition.241113
How this maps to your prompts and systems
Here are tightened, copy‑paste‑ready prompts aligned with your intent:
1) PARL / swarm vs LangGraph / AutoGen / CrewAI
Analyze Kimi K2.5’s Parallel-Agent Reinforcement Learning (PARL) and Agent Swarm paradigm. Compare it to external multi-agent frameworks such as LangGraph, AutoGen, Swarm, and CrewAI along these axes: (1) where orchestration logic resides (inside vs outside the model), (2) how task decomposition and parallelism are represented (prompt graphs, policies, or learned planners), (3) implications for observability, debuggability, and safety in regulated SaaS environments, and (4) how to hybridize learned swarm orchestration with explicit graph-based workflows.
2) Evaluation harness for K2.5 vs your current graph
Design an evaluation harness to benchmark Kimi K2.5 Agent Swarm against my existing hand-crafted multi-agent graph on long-horizon coding tasks. Assume tasks such as multi-file refactors, large feature additions, and cross-cutting changes. Specify: (1) task corpus and ground-truth definition, (2) metrics for functional correctness, latency, tool-call stability, and cost, (3) logging/telemetry needed to reconstruct swarm behavior, and (4) statistical methodology to compare systems across multiple runs with temperature and tool-set variations.
3) Coditect integration architecture
Propose an end-to-end architecture integrating Kimi K2.5 as the core “brain” of a sandboxed code-execution platform (Coditect-like). Include: (1) how the orchestrator agent sees tools (code sandboxes, search, VCS, CI) and constraints, (2) patterns for safely handling up to 1,500 swarm tool calls per task, including rate limiting and circuit breakers, (3) observability/trace design to attribute actions to sub-agents, and (4) cost-control mechanisms (budget-aware planning, dynamic parallelism limits, and early stopping on diminishing returns).
If you want, the next step can be: “generate a concrete OpenAI‑style tool schema + LangGraph‑like graph that exposes Coditect’s tools to K2.5 while still letting it exercise its learned swarm behavior.”
enerate a concrete OpenAI‑style tool schema + LangGraph‑like graph that exposes Coditect’s tools to K2.5 while still letting it exercise its learned swarm behavior.” How do I access and run Kimi K2.5 model
What are benchmarks showing Kimi K2.5 SOTA performance Compare Kimi K2.5 Agent Swarm to single-agent systems Provide examples of visual coding with Kimi K2.5 What tools does Kimi K2.5 support in agent mode
Kimi K2.5 is accessible via several APIs (Moonshot, Fireworks, OpenRouter, Replicate) and exposes standard OpenAI‑style tool calling, which you can wire into a LangGraph‑like graph while still letting its learned Agent Swarm planner run inside the model.262728293031
How to access and run Kimi K2.5
- Moonshot platform: Kimi web/app and Open Platform APIs; supports tool use/function calling with JSON Schema tools.323133
- Fireworks.ai:
fireworks/kimi-k2.5and “Reasoning” variants via/chat/completions, OpenAI‑compatible; supports function calling and high‑throughput inference.27283426 - Hugging Face / self‑hosting:
moonshotai/Kimi-K2.5weights and config for your own GPU clusters.30 - OpenRouter / Replicate: additional hosted endpoints with standard chat APIs.2935
Minimal tool‑calling request (Python):
from openai import OpenAI
client = OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="FWK_...YOUR_KEY..."
)
tools = [
{
"type": "function",
"function": {
"name": "run_sandbox",
"description": "Execute code in an isolated Coditect sandbox.",
"parameters": {
"type": "object",
"properties": {
"language": {"type": "string"},
"code": {"type": "string"},
"timeout_ms": {"type": "integer"}
},
"required": ["language", "code"]
}
}
},
{
"type": "function",
"function": {
"name": "search_repo",
"description": "Search the current repository by filename and content.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"max_results": {"type": "integer", "default": 20}
},
"required": ["query"]
}
}
}
]
resp = client.chat.completions.create(
model="fireworks/kimi-k2.5",
messages=[
{"role": "system", "content": "You are a Kimi K2.5 Agent Swarm orchestrator for Coditect."},
{"role": "user", "content": "Refactor the payment module for multi-tenant billing."}
],
tools=tools,
temperature=0.7,
)
Benchmarks showing SOTA‑ish performance
From the official tech report/model card:363230
- Reasoning & knowledge: HLE‑Full with tools 50.2 (ahead of several frontier baselines), strong GPQA‑Diamond and Olympiad‑style math (AIME 2025 96.1, HMMT 95.4, IMO‑AnswerBench 81.8).30
- Vision & documents: leading or near‑leading scores on OmniDocBench 1.5 (88.8), CharXiv (77.5), MathVision (84.2), MathVista‑mini (90.1), and long‑video benchmarks like VideoMMMU and LongVideoBench.3730
- Coding: SWE‑Bench Verified 76.8, SWE‑Bench Pro 50.7, LiveCodeBench v6 85.0, TerminalBench 2.0 50.8, placing it in the top tier of open models.30
- Agentic search / swarm: BrowseComp (Agent Swarm) 78.4 and WideSearch (Agent Swarm) 79.0, materially above single‑agent runs and above several closed models on these tasks.30
Moonshot’s internal “AI Office” and “General Agent” benchmarks show 59.3% and 24.3% improvement over K2 Thinking, respectively, for long, production‑like multi‑step tasks.3832
Agent Swarm vs single‑agent systems
- Single‑agent: tools orchestrated externally (LangGraph/AutoGen) with a fixed or programmatic graph; the model mostly fills in task planning “locally” in its prompt.39
- K2.5 Agent Swarm: planner/scheduler is inside the model, trained via Parallel‑Agent RL; it can spawn many internal roles and issue many parallel tool calls, rather than following a static graph.383230
- Empirical gains: large speedups and higher success on long‑horizon browse/search tasks; e.g., BrowseComp and WideSearch scores jump when Agent Swarm is enabled compared with “Thinking + tools” single‑agent mode.30
In practice, you keep your graph relatively thin (safety, memory, budgeting) and let K2.5 discover decomposition/parallelism, instead of encoding the entire workflow as nodes/edges.
Examples of visual coding with Kimi K2.5
- From Kimi’s visual‑agent report: K2.5 can turn UI screenshots and short specs into multi‑page React/Tailwind sites with interactive layouts and scroll‑triggered animations, iterating by visually inspecting its own output.3238
- “Kimi Code” demo: translates the visual style of Matisse’s “La Danse” into a working Kimi app/webpage, then uses image+DOM screenshots to auto‑debug visual glitches.38
- Media coverage describes workflows like: screen‑recording an app, letting K2.5 infer structure and state, and then generating or editing code to replicate/modify the behavior.40
Your Coditect‑style variant would be: provide repository snapshot + UI screenshots + a “visual coding swarm” system prompt and expose tools for search_repo, run_sandbox, and capture_screenshot.
What tools K2.5 supports in agent mode
- Tool schema: OpenAI‑style “Tool Use / Function Calling” with functions defined by JSON Schema; Moonshot documents this for Kimi models in their Open Platform.31
- Capabilities: text, image, video, PDFs as inputs; tool calls can include arbitrary structured JSON (e.g., search, HTTP, code‑exec, DB queries), and some deployments support parallel tool calling in a single response.2826373130
- Kimi‑specific: K2/K2.5 expect tool call IDs in a
functions.func_name:idxstyle in history; Moonshot’s Vendor Verifier repo captures these conventions for stable tool use.41
Example OpenAI‑style tool schema for Coditect
[
{
"type": "function",
"function": {
"name": "run_sandbox",
"description": "Execute code in an isolated Coditect sandbox and return stdout/stderr, exit code, and artifacts.",
"parameters": {
"type": "object",
"properties": {
"language": {"type": "string", "enum": ["python", "typescript", "go", "rust"]},
"code": {"type": "string"},
"timeout_ms": {"type": "integer", "minimum": 100, "maximum": 600000}
},
"required": ["language", "code"]
}
}
},
{
"type": "function",
"function": {
"name": "search_repo",
"description": "Semantic and text search over the current workspace.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"top_k": {"type": "integer", "default": 20}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "get_file",
"description": "Retrieve the content of a file from the workspace.",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string"}
},
"required": ["path"]
}
}
},
{
"type": "function",
"function": {
"name": "write_file",
"description": "Create or update a file in the workspace.",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string"},
"content": {"type": "string"}
},
"required": ["path", "content"]
}
}
}
]
You’d plug this tools array directly into Fireworks/Moonshot API calls.262831
LangGraph‑like graph that exposes Coditect tools but keeps swarm “inside”
Conceptually, define a thin graph:
- Node:
SwarmOrchestrator— callskimi-k2.5with your system prompt and all tools above; lets K2.5 decide decomposition and tool use. - Node:
ToolExecutor— your runtime that executes any tool calls (sandbox, repo, etc.), returns results into the conversation history. - Node:
SafetyGuard— post‑hoc filters/validators on code, commands, and outputs. - Edges:
user_input → SwarmOrchestratorSwarmOrchestrator → ToolExecutorwhen tool calls are emittedToolExecutor → SwarmOrchestratorwith results appended astoolmessagesSwarmOrchestrator → SafetyGuard → user_outputfor final answer
In LangGraph terms, this is a loop node where K2.5 remains the planner and the graph just handles I/O, limits (e.g., 1,500 tool calls, per‑task budgets), and safety.
Example OpenAI tool schema for Coditect tools with Kimi K2.5
LangGraph workflow to expose tools to Kimi K2.5 agent swarm Kimi K2.5 tool calling code example on Fireworks AI Moonshot AI API key setup for Kimi K2.5 access Kimi K2.5 vision coding agent swarm demo
Kimi K2.5 can call OpenAI‑style tools, so you can expose Coditect’s sandbox/search/VCS capabilities directly, then wrap that in a thin LangGraph loop while K2.5 runs its own swarm internally.454647484950
OpenAI‑style tool schema for Coditect + K2.5
A concrete tools array you can drop into Fireworks/Moonshot/OpenRouter:
[
{
"type": "function",
"function": {
"name": "run_sandbox",
"description": "Execute code in an isolated Coditect sandbox and return stdout, stderr, exit code, and artifacts.",
"parameters": {
"type": "object",
"properties": {
"language": {
"type": "string",
"enum": ["python", "typescript", "go", "rust"]
},
"code": {
"type": "string",
"description": "Complete, executable code snippet."
},
"timeout_ms": {
"type": "integer",
"minimum": 100,
"maximum": 600000,
"default": 60000
}
},
"required": ["language", "code"]
}
}
},
{
"type": "function",
"function": {
"name": "search_repo",
"description": "Search the current workspace by filename and content (semantic + keyword).",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language or keyword query."
},
"top_k": {
"type": "integer",
"default": 20,
"minimum": 1,
"maximum": 200
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "get_file",
"description": "Retrieve file content from the workspace.",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Path relative to repo root."
}
},
"required": ["path"]
}
}
},
{
"type": "function",
"function": {
"name": "write_file",
"description": "Create or overwrite a file in the workspace.",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Path relative to repo root."
},
"content": {
"type": "string",
"description": "Full file contents."
}
},
"required": ["path", "content"]
}
}
},
{
"type": "function",
"function": {
"name": "run_tests",
"description": "Run the project test suite or a subset of tests.",
"parameters": {
"type": "object",
"properties": {
"pattern": {
"type": "string",
"description": "Optional test filter (file, class, or test name)."
}
}
}
}
}
]
This matches Moonshot’s JSON‑Schema‑based tool format and Fireworks’ function‑calling interface.484950
LangGraph‑style workflow exposing tools to K2.5 swarm
Conceptual state:
from typing import TypedDict, List, Any
from langgraph.graph import StateGraph, END
class AgentState(TypedDict):
messages: List[Any] # chat messages, including tool calls/results
tool_budget: int # remaining allowed tool calls
Nodes:
def kimi_node(state: AgentState) -> AgentState:
"""
Call fireworks/kimi-k2.5 with tools.
K2.5 may internally 'swarm'; externally we just see tool calls.
"""
# pseudo-code: use OpenAI-compatible client
resp = client.chat.completions.create(
model="fireworks/kimi-k2.5",
messages=state["messages"],
tools=CODITECT_TOOLS,
tool_choice="auto",
temperature=0.6,
)
msg = resp.choices[^4_0].message
state["messages"].append(msg)
return state
def tool_exec_node(state: AgentState) -> AgentState:
"""
Execute any pending tool calls from the last assistant message.
"""
last = state["messages"][-1]
if not getattr(last, "tool_calls", None):
return state
for call in last.tool_calls:
if state["tool_budget"] <= 0:
break
result = execute_tool_locally(call) # your Coditect runtime
state["tool_budget"] -= 1
state["messages"].append({
"role": "tool",
"name": call.function.name,
"tool_call_id": call.id,
"content": result,
})
return state
def router(state: AgentState):
"""
Decide whether to keep looping or end.
"""
last = state["messages"][-1]
if hasattr(last, "tool_calls") and last.tool_calls and state["tool_budget"] > 0:
return "tool_exec"
# heuristic: stop when last assistant message has no tool_calls
return END
Graph construction:
builder = StateGraph(AgentState)
builder.add_node("kimi", kimi_node)
builder.add_node("tool_exec", tool_exec_node)
builder.set_entry_point("kimi")
builder.add_edge("kimi", "tool_exec")
builder.add_conditional_edges("tool_exec", router)
graph = builder.compile()
You then run:
initial_state = {
"messages": [
{"role": "system", "content": "You are a Kimi K2.5 Agent Swarm orchestrator for Coditect."},
{"role": "user", "content": "Implement a multi-tenant billing module and tests."}
],
"tool_budget": 1500
}
for event in graph.stream(initial_state):
...
Here the graph is thin (I/O, limits); swarm orchestration lives inside K2.5’s policy.465145
Kimi K2.5 tool calling on Fireworks AI
Fireworks uses OpenAI‑compatible chat.completions plus tools.4948
from openai import OpenAI
client = OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="FWK_...YOUR_KEY..."
)
response = client.chat.completions.create(
model="fireworks/kimi-k2.5",
messages=[
{"role": "system", "content": "You are a Kimi K2.5 Agent Swarm for Coditect."},
{"role": "user", "content": "Refactor the auth module and add integration tests."}
],
tools=CODITECT_TOOLS,
tool_choice="auto",
temperature=0.6,
max_tokens=2048,
)
print(response.choices[^4_0].message)
If message.tool_calls is non‑empty, route each call to your backend, then append tool messages and send back through the same endpoint in a loop, as in the LangGraph node above.4849
Moonshot API key setup for K2.5
The flow for K2/K2.5 on Moonshot’s platform mirrors their documented K2 setup:5251
- Create account: go to
https://platform.moonshot.ai/consoleand sign up.52 - Add balance: in
console/pay, top up a small amount (e.g., 1 USD equivalent) to enable API calls.52 - Create API key: in
console/api-keys, generate a key and store it securely.52 - Call chat API: use
https://api.moonshot.ai/v1/chat/completionswith headerAuthorization: Bearer <YOUR_KEY>andmodelset to the Kimi K2.5 identifier once exposed there.535152
Minimal raw call:
curl https://api.moonshot.ai/v1/chat/completions \
-H "Authorization: Bearer $MOONSHOT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "kimi-k2.5",
"messages": [
{"role": "system", "content": "You are a Kimi K2.5 Agent Swarm."},
{"role": "user", "content": "Summarize the repository structure."}
],
"tools": [...]
}'
Moonshot’s tool‑use docs describe the same JSON‑Schema tool format as above.5150
Vision coding agent swarm demo (what it looks like)
The official K2/K2.5 materials describe Kimi Code / visual coding flows:544755
- Input: Figma/screenshot or recorded UI, plus spec (“responsive dashboard in React with Tailwind, left nav, metrics cards, dark mode toggle”).5554
- K2.5 uses its MoE + MoonViT stack and Agent Swarm to:
A system prompt you can use with the same Coditect tools:
You are a Kimi K2.5 visual coding swarm. You receive UI screenshots and a short spec. Use tools
search_repo,get_file,write_file, andrun_sandboxto: (1) infer components and layout, (2) implement React + Tailwind components, (3) iteratively refine styling and responsiveness. Prefer parallel work across components but keep props/contracts stable.
That gives you a concrete starting point to exercise K2.5’s vision+agentic behavior on your infrastructure.
You are a Kimi K2.5 visual coding swarm. You receive UI screenshots and a short spec. Use tools search_repo, get_file, write_file, and run_sandbox to: (1) infer components and layout, (2) implement React + Tailwind components, (3) iteratively refine styling and responsiveness. Prefer parallel work across components but keep props/contracts stable.
You can run that visual‑coding swarm prompt against Kimi K2.5 today by passing screenshots + spec as multimodal input and wiring in your Coditect tools; K2.5’s native vision + swarm training is explicitly designed for this pattern.6364656667686970
1. Concrete system prompt (copy‑paste)
Use this as your system message:
You are a Kimi K2.5 visual coding swarm. You receive UI screenshots and a short spec. Use tools
search_repo,get_file,write_file, andrun_sandboxto: (1) infer components and layout, (2) implement React + Tailwind components, and (3) iteratively refine styling and responsiveness. Prefer parallel work across components but keep props and contracts stable across the codebase. When unsure, inspect existing files before writing new ones.
This matches the “coding with vision” + agent swarm behavior described for K2.5 and Kimi Code.646566686963
2. Minimal multimodal + tools call example (Fireworks)
Assume you have a PNG wireframe and Coditect tools exposed:
import base64
from openai import OpenAI
client = OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="FWK_...YOUR_KEY..."
)
def encode_image(path: str) -> str:
return base64.b64encode(open(path, "rb").read()).decode("utf-8")
CODITECT_TOOLS = [...] # the JSON tool schema we defined earlier
messages = [
{
"role": "system",
"content": "You are a Kimi K2.5 visual coding swarm. You receive UI screenshots and a short spec. Use tools search_repo, get_file, write_file, and run_sandbox..."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Convert this dashboard design into a responsive React + Tailwind layout with a sidebar, metric cards, and a chart area."
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{encode_image('dashboard.png')}"
}
}
]
}
]
resp = client.chat.completions.create(
model="fireworks/kimi-k2.5",
messages=messages,
tools=CODITECT_TOOLS,
tool_choice="auto",
temperature=0.4,
max_tokens=2048,
)
This mirrors the multimodal wireframe‑to‑React examples in public K2.5 API docs.656870
3. How K2.5 actually uses vision + swarm here
From the tech report and model card:
- It is natively multimodal, pre‑trained on vision–language tokens with a built‑in vision encoder, so it directly reasons over UI screenshots and video.666964
- “Coding with vision” is a first‑class use case: converting mockups and wireframes into React/Tailwind, HTML/CSS, and full responsive layouts.68697063646566
- Agent Swarm lets it decompose into internal sub‑agents (layout analysis, component decomposition, styling, debugging) and issue many parallel tool calls (up to ~1,500) on complex tasks.676971636466
So your prompt is aligned with both the documented capabilities and how they position Kimi Code in demos.696366
4. Suggested follow‑up to iterate toward Coditect integration
If you want to push this further into implementation territory, you can ask:
“Design a concrete end‑to‑end workflow where Kimi K2.5 visual coding swarm takes 3–5 UI screenshots, uses my Coditect tools (search_repo, get_file, write_file, run_sandbox) via OpenAI‑style function calling, and produces a fully working React + Tailwind dashboard. Include: (1) tool schemas, (2) example tool traces, and (3) guardrails for file overwrites and sandbox runtime.”
That will give you a more complete, testable spec around this prompt.
Provide example UI screenshot and spec for a login page to implement
Show sample workflow using search_repo and write_file tools How does agent swarm coordinate up to 100 sub-agents Demonstrate visual debugging on a React Tailwind app Generate OpenAI tool schema for these Kimi K2.5 tools
Example UI and tools, plus how swarms and visual debugging work, below.
Example login UI screenshot + spec
You can’t upload an actual image here, but this is a concrete spec you can pair with a Figma/wireframe screenshot:
- Layout: Centered card on a light gray background, full‑height viewport.
- Card: White, rounded corners, drop shadow, max‑width 420px, full width on mobile.
- Header: App logo on top, then “Sign in to Coditect” title, and a short subtitle for SSO hint.
- Fields:
- Email input with label and placeholder.
- Password input with label, placeholder, and “Show” toggle.
- Actions:
- Primary “Sign in” button (full width).
- Secondary “Continue with GitHub/Google” buttons with icons.
- “Forgot your password?” text link.
- Footer: “Don’t have an account? Sign up” inline link; small “By continuing you agree to Terms / Privacy” copy.
Short textual spec you’d send with the screenshot:
Build a responsive React + Tailwind login page: centered card on desktop, full‑width on mobile. Include email/password fields, a primary “Sign in” button, social login buttons, and basic footer links. Match colors and spacing from the screenshot as closely as possible.
You then attach the login screenshot as an image_url alongside that text in the user message.787980
Sample workflow using search_repo and write_file
Pseudocode for one loop iteration around K2.5’s tool calls:
- User sends: screenshot + spec.
- K2.5 (assistant) emits tool calls like:
{
"role": "assistant",
"tool_calls": [
{
"id": "call_1",
"type": "function",
"function": {
"name": "search_repo",
"arguments": "{\"query\": \"login\", \"top_k\": 10}"
}
}
]
}
- Your backend executes
search_repoand returns results:
{
"role": "tool",
"tool_call_id": "call_1",
"name": "search_repo",
"content": [
{
"path": "src/pages/login.tsx",
"snippet": "export default function LoginPage() { ... }"
},
{
"path": "src/components/forms/TextInput.tsx",
"snippet": "export function TextInput(...) { ... }"
}
]
}
- K2.5 reads that, then calls
write_file:
{
"role": "assistant",
"tool_calls": [
{
"id": "call_2",
"type": "function",
"function": {
"name": "write_file",
"arguments": "{\"path\": \"src/pages/login.tsx\", \"content\": \"// new React + Tailwind implementation here...\"}"
}
}
]
}
- You apply the change, run tests or a dev build via
run_sandbox, and feed outputs back as moretoolmessages for further refinement.
This matches the coding‑with‑vision + tool‑chaining pattern described for K2.5.81827980
How Agent Swarm coordinates up to ~100 sub‑agents
From Moonshot’s descriptions:
- K2.5 acts as an orchestrator, not just a single agent; via Parallel‑Agent Reinforcement Learning (PARL), it learns a policy that decides how many internal “sub‑agents” to spin up, what roles they play, and which tools they call.83808481
- For hard tasks, it can create up to about 100 domain‑specific sub‑agents and coordinate up to ~1,500 tool calls in a single task, achieving up to ~4.5× wall‑clock speedups vs a single‑agent baseline.80848183
- PARL trains it to:
- Decompose tasks into parallel subtasks.
- Execute them concurrently (different “sub‑agents” issuing tool calls).
- Aggregate, deduplicate, and reconcile results into a final answer.
- Internally, this shows up to you as: multiple related tool calls and follow‑up reasoning steps, but the sub‑agent structure is implicit in its text, not separate API entities.84818380
So your external LangGraph stays thin; you just enforce budgets and safety while K2.5’s learned policy does the swarm coordination.
Visual debugging on a React + Tailwind app
K2.5’s visual debugging pattern (as shown in Kimi Code demos):85868281
- Generate or modify React + Tailwind code for your login page.
- Run the app in a sandbox and capture a screenshot or short screen recording of the rendered UI.
- Send the new screenshot/video back to K2.5 with a prompt like:
Here is the current rendered login page. Compare it to the original design. Identify visual issues (spacing, alignment, responsiveness, colors) and propose code changes to fix them.
- K2.5 reasons over the visual differences and emits
write_filetool calls with targeted Tailwind/JSX changes. - Repeat until the visual diff is acceptable (you can even ask it to stop when differences are “minor”).
The tech report emphasizes that K2.5 “looks” at its own output and iterates autonomously to fix visual bugs—especially layout and animation issues that are hard to specify in text alone.86828581
OpenAI tool schema for these K2.5 tools
Here is an updated JSON Schema set including the tools you referenced:
[
{
"type": "function",
"function": {
"name": "search_repo",
"description": "Search the current workspace by filename and content (semantic + keyword).",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language or keyword query describing files or concepts to find."
},
"top_k": {
"type": "integer",
"description": "Maximum number of results to return.",
"default": 20,
"minimum": 1,
"maximum": 200
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "get_file",
"description": "Retrieve file content from the workspace.",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Path relative to the repository root."
}
},
"required": ["path"]
}
}
},
{
"type": "function",
"function": {
"name": "write_file",
"description": "Create or overwrite a file in the workspace.",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Path relative to the repository root."
},
"content": {
"type": "string",
"description": "Full file contents to write."
},
"overwrite": {
"type": "boolean",
"description": "If false, fail when the file already exists.",
"default": true
}
},
"required": ["path", "content"]
}
}
},
{
"type": "function",
"function": {
"name": "run_sandbox",
"description": "Run the current project or a specific script in an isolated Coditect sandbox and return stdout, stderr, exit code, and relevant artifacts.",
"parameters": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "Shell command to execute (e.g., `npm test`, `npm run dev -- --port 4173`)."
},
"timeout_ms": {
"type": "integer",
"description": "Maximum runtime for the command in milliseconds.",
"minimum": 1000,
"maximum": 600000,
"default": 120000
}
},
"required": ["command"]
}
}
}
]
This aligns with Moonshot’s and Fireworks’ JSON‑Schema tool formats for Kimi models while giving K2.5 enough surface area to do visual coding + debugging in a Coditect‑style environment.87888978
research academic Parallel‑Agent Reinforcement Learning (PARL),
Parallel‑Agent Reinforcement Learning (PARL) in the Kimi K2.5 context is a custom multi‑agent RL training regime where a single “orchestrator” policy learns how many sub‑agents to spawn, how to parallelize them, and is directly rewarded on wall‑clock efficiency rather than just token‑level steps.96979899100101102
What PARL is in Kimi K2.5
- PARL is described as a framework where the model learns to decompose complex tasks into parallel subtasks, explicitly optimizing for reduced wall‑clock time, not just fewer reasoning steps.9798100102
- Instead of scripting a static multi‑agent graph, K2.5’s training objective encourages it to act as a trainable orchestrator that decides the degree of parallelism, sub‑agent roles, and tool‑call scheduling.991001019697
- In deployment, this yields the “agent swarm” behavior: up to ~100 concurrent sub‑agents and ≈1,500 tool calls per task, with reported up to ~4.5× latency reduction against single‑agent tool‑use baselines.98100101102
High‑level training loop (conceptual)
Moonshot hasn’t published a full algorithm pseudocode, but public descriptions let you infer the structure:100101102969799
- Environment: long‑horizon tasks (codebases, multi‑page web search, document corpora) with tools (browsers, code runners, retrievers) as actions.
- Policy: the LLM (K2.5) outputs both content and a “plan” indirectly via tool calls and internal role descriptions (sub‑agents are implicit in the token stream).
- Parallel action space: at each “macro step” the orchestrator can issue multiple tool calls corresponding to different subtasks (e.g., different files, URLs, or sections), which execute concurrently.
- Reward: combines task success (e.g., benchmark pass/fail, correctness) and a penalty tied to actual or proxy wall‑clock time; policies that achieve goals with more parallelization but similar or fewer total steps are favored.10297100
So PARL is essentially a multi‑agent RL curriculum where the degree of parallelism is part of the learned strategy rather than a fixed hyperparameter.
Relation to academic PARL / MARL literature
- The name “Parallel‑Agent Reinforcement Learning (PARL)” is Moonshot’s own term for this training regime; it is conceptually related to multi‑agent RL and swarm RL, but not yet a standalone, detailed arXiv method as of the current technical report.1019699100
- There is a separate, unrelated academic work called “PARL: Prompt-based Agent for Reinforcement Learning,” which uses LLMs as RL agents via prompting in simple environments (states/actions/rewards encoded in the prompt, no weight updates).103104
- Classic swarm/MARL work (e.g., CTDE MARL, swarm robotics with GNNs, decentralized policies) gives useful conceptual background on learning coordinated behaviors across many agents, but Kimi’s PARL differs in that all “sub‑agents” are slices of a single large policy with shared weights.105106107108
For rigorous grounding, you would combine Moonshot’s K2.5 technical report with MARL swarm papers (CTDE, GNN‑based swarms, centralized‑training/decentralized‑execution) to reconstruct a more formal PARL framework.106107108105101
Key properties for your agentic systems work
- Centralized policy, emergent parallelism: One LLM policy learns to impersonate many cooperating agents, rather than training many separate policies.97100101
- Parallelism as a learned choice: How many tools to call in parallel and what each “sub‑agent” does are part of the action space, not fixed by an external DAG.10010297
- Rewarded on latency: This explicitly bakes infra‑like concerns (wall‑clock) into the learning problem, making it analytically closer to scheduling/throughput optimization than standard step‑count RL.10297100
If you want to go deeper, a focused next step would be: “Outline a formal MDP / multi‑agent MDP definition for K2.5’s PARL training, including state, action (parallel tool calls), and reward design, and map it to existing CTDE MARL formulations.”
Outline a formal MDP / multi‑agent MDP definition for K2.5’s PARL training, including state, action (parallel tool calls), and reward design, and map it to existing CTDE MARL formulations. How does PARL enable Agent Swarm in Kimi K2.5
What benchmarks show Kimi K2.5 outperforming GPT-5.2 Compare PARL prompt-based RL to traditional MARL Key challenges of PARL in complex RL environments Academic papers evaluating PARL on grid world tasks
Kimi K2.5’s PARL can be framed as a centralized‑policy multi‑agent MDP where the LLM decides how many parallel tool calls (sub‑agents) to issue per step, with reward shaped by both task success and wall‑clock efficiency.111112113114115116
1. Formal (single‑agent) MDP view with parallel actions
Let $E$ be an environment representing a long‑horizon task (e.g., browse, codebase, document corpus) with tools as actions.
- State $s_t$:
- Text history (conversation, tool results).
- External env state summary (e.g., current files touched, URLs visited, task progress).
- Latency/budget state (tokens spent, tool calls used).
- Action $a_t$:
- A set of parallel tool calls $a_t = {a_t^{(1)}, \dots, a_t^{(k_t)}}$ where each $a_t^{(i)}$ encodes a tool name plus JSON args (e.g.,
search,run_tests,get_url).
- A set of parallel tool calls $a_t = {a_t^{(1)}, \dots, a_t^{(k_t)}}$ where each $a_t^{(i)}$ encodes a tool name plus JSON args (e.g.,
- Transition $P(s_{t+1} \mid s_t, a_t)$: environment executes all tools in parallel, returns their outputs (possibly with stochastic latency/contents), which are appended to the history.
- Reward $r_t$:
The policy $\pi_\theta(a_t \mid s_t)$ is the LLM; PARL augments its action space from “one tool call or text step” to “a bundle of parallel tool calls plus text”, and tunes it with RL.
2. Multi‑agent / CTDE mapping
We can reinterpret this as a centralized‑training, decentralized‑execution (CTDE) MARL setup:
- Centralized “orchestrator” agent $\pi_\theta^{\text{orch}}$:
- Observes $s_t$ (full history + budget).
- Decides the number of sub‑tasks $k_t$, their roles, and corresponding tool calls.
- Sub‑agents $\pi_\theta^{(i)}$:
CTDE mapping:
- Centralized training: the orchestrator sees global state and all tool results, gets global reward (task success + latency), and gradient updates are applied to shared parameters $\theta$.114115116111112
- Decentralized execution: at inference, each sub‑agent (a role‑prompted instance) operates on its local subtask context; the orchestrator aggregates their outputs into the next global state.113111112117
So PARL ≈ CTDE MARL where:
- All policies share weights (one LLM),
- “Decentralized” is implemented via role prompts and separate tool calls,
- Parallelism (how many sub‑agents) is part of the orchestrator’s action.
3. How PARL enables Agent Swarm in Kimi K2.5
From public analyses and Moonshot’s materials:
- Trainable orchestrator: K2.5 learns when to decompose tasks and how many parallel subtasks to spawn, instead of following a fixed workflow graph.115116111112113114117
- Dynamic sub‑agents: roles like “AI Researcher”, “Fact Checker”, etc., are generated on the fly in text, and each corresponds to one or more tool calls; the orchestrator coordinates up to ~100 such sub‑agents and ~1,500 tool calls per task.111112113117
- Latency‑aware reward: benchmarks such as BrowseComp and WideSearch show large gains when swarm is enabled; reported speedups are roughly 3–4.5× in wide‑search workloads compared with single‑agent execution.112113117111
Mechanically, Agent Swarm is PARL’s learned policy in action: at each step, instead of one call like “search once”, K2.5 emits many calls (to different URLs, files, or tools) and later merges the results.
4. Benchmarks vs GPT‑5.2
There is no evidence that Kimi K2.5 overall outperforms GPT‑5.2 on general reasoning; third‑party comparisons emphasize that GPT‑5.x still leads on aggregate “intelligence index” metrics.118119120
- For earlier K2 (not 2.5), some head‑to‑heads indicate K2 can beat GPT‑5 on specific agentic benchmarks like Humanity’s Last Exam and BrowseComp when using tools, especially in cost‑normalized settings.119
- For K2.5, early commentary highlights strong performance on SWE‑Bench, HLE, BrowseComp, WideSearch, MMMU‑Pro, etc., but does not claim consistent dominance over GPT‑5.2 on broad benchmarks.113117111112
So the right reading is: K2.5 is competitive or better on some agent‑centric tasks and cost/performance tradeoffs, but GPT‑5.2 still has higher overall “intelligence index” in many evaluations.120118119
5. PARL (prompt‑based RL) vs traditional MARL
The academic PARL paper (“Prompt‑based Agents for Reinforcement Learning”) is a different method: it uses frozen LLMs as agents via prompting in small RL environments like grid worlds.121122123124
Differences vs traditional MARL:
- Model updates:
- Representation:
- Scalability:
Moonshot’s PARL for K2.5 is closer to classic RL (with parameter updates) but applied to an LLM and extended to parallel tool‑calling; the naming collision with the academic PARL paper is incidental.
6. Key challenges of PARL in complex environments
From MARL literature plus K2.5 commentary:125126127128121124117111112113
- Credit assignment under parallelism: attributing reward to specific sub‑agents or tool bundles is hard when many act concurrently.
- Non‑stationary environment: tools and web change over time; long‑horizon tasks make rewards sparse and delayed.
- Stability vs exploration: large action space (many tools, arguments, and possible sub‑agent roles) increases variance and can destabilize RL training.
- Cost and wall‑clock: running thousands of tool calls during training is expensive; yet latency is part of the reward, so training itself must approximate true performance.
- Safety and degeneracy: a learned orchestrator might over‑parallelize, spam tools, or exploit quirks in reward proxies (e.g., “do minimal work and guess answers”), requiring carefully designed constraints and safety filters.
These echo classic MARL issues (coordination, scalability, credit assignment) but with the added complexity of tool‑rich, partially observable, open‑world environments.
7. Academic PARL evaluations on grid worlds
The PARL (Prompt‑based Agents for RL) paper evaluates LLM‑based agents on structured environments including grid worlds:121122123124
- Setup:
- States (grid positions, passenger locations) are encoded as textual descriptions.
- Actions (move, pick up, drop) and rewards are written into the prompt at each timestep, forming a cumulative history.
- Tasks:
- Taxi‑like grid navigation, simple card games, and other discrete RL benchmarks.
- Results:
- PARL matches or outperforms classic RL agents in simple environments by exploiting pretrained knowledge.
- It struggles in tasks needing fine‑grained numeric reasoning or where state/action encodings become too verbose.
This gives you a clean, reproducible example of prompt‑only RL to contrast with Moonshot’s weight‑updating PARL for K2.5.
If you want, next step could be: “derive a CTDE MARL objective for K2.5‑style PARL with explicit notation for parallel tool bundles, and sketch how you’d simulate a simplified version in a toy web‑search environment.”
can this methodology be abstracted away from the trained model and developed into a methodology and framework that could work on any llm?
Yes, you can absolutely abstract the PARL / Agent Swarm methodology into a framework that works with many LLMs, but you won’t get K2.5‑level behavior without some form of RL/post‑training on each target model. You can, however, build a useful “PARL‑inspired” orchestration layer that is model‑agnostic.
What is model‑specific vs framework‑able?
From public descriptions of K2.5:134135136137138
- Model‑specific (hard to generalize without training):
- The policy that decides how many sub‑agents to spawn, what roles they take, and which tools to call in parallel.
- Latency‑aware behavior baked in via RL (explicit optimization for wall‑clock speed at similar quality).
- Robust long‑horizon stability with 100 sub‑agents / 1,500 tool calls.
- Framework‑able (you can generalize):
- The interface: state as conversation + tool results, actions as sets of tool calls, rewards combining success + latency.
- A CTDE‑style loop where a “manager” LLM plans parallel sub‑tasks and workers execute them.
- A runtime that supports parallel tool execution, aggregation, and budget/safety control.
So you can turn PARL’s structure into a reusable orchestration pattern, but you won’t magically get optimal parallelism from arbitrary LLMs without some training or careful prompt shaping.
What a model‑agnostic PARL‑style framework would look like
You can define a generic framework with these pieces:
- State abstraction (for any LLM):
- $s_t =$ {conversation history, tool results, task metadata, budget state}.
- Action abstraction:
- Manager LLM outputs a bundle of tool calls $a_t = {a_t^{(1)}, …, a_t^{(k_t)}}$ plus text; the framework runs each
a_t^{(i)}in parallel.
- Manager LLM outputs a bundle of tool calls $a_t = {a_t^{(1)}, …, a_t^{(k_t)}}$ plus text; the framework runs each
- Policy plugging:
- Any chat‑completion LLM (GPT‑x, Claude, Kimi, local) can be used as the “manager” and as “workers” via role prompts.
- Reward & logging layer:
- Track per‑task success (tests pass? benchmark solved?), latency, tool counts, and optional human ratings; store data for offline RL or bandit fine‑tuning later.
This is very close to existing research frameworks like M1‑Parallel (parallel LLM agents) and ARTIST (agentic RL + tools) which are designed to sit on top of arbitrary LLMs.139140141142
Practical architecture sketch
- Thin orchestrator service:
- Exposes a “manager step” endpoint: given $s_t$, call any LLM and read back its proposed tool bundle.
- Executes tools concurrently, updates state, and repeats until stopping criteria.
- Policy‑agnostic:
- The orchestrator doesn’t care if the LLM is K2.5, GPT‑5.2, or a local model; it just needs a consistent tool‑call JSON schema.
- Optional training layer:
You’d get a “PARL‑like” framework that can orchestrate parallel multi‑agent work even for base models that were never RL‑trained for swarm behavior.
What you don’t get without training
K2.5’s advantage comes from having learned good parallelization strategies and sub‑agent patterns:135136137138134
- Without RL / post‑training, a generic LLM:
- Will often under‑parallelize (one tool at a time) or over‑parallelize (spam calls).
- May not discover robust internal role structures (fact‑checker, planner, etc.) reliably.
- Will not systematically trade off latency vs redundancy unless you encode that in prompts and heuristics.
So a framework alone gets you the capability to run many sub‑agents in parallel; making that efficient and reliable still requires either model‑specific RL (Moonshot’s path) or hand‑tuned heuristics.
How this differs from pure prompt‑only approaches
Prompt‑only PARL (the academic “Prompt-based Agents for RL”) shows you can encode state/action/reward history in prompts for gridworld‑style tasks without weight updates.143144145146
- That approach is portable to any LLM, but:
- It’s demonstrated on small environments (grid worlds, simple games).
- Complexity and context length blow up quickly in real web / code tasks.
- K2.5‑style PARL adds:
So: methodology (parallel multi‑agent CTDE pattern + latency‑aware reward) is portable; the behavioral quality is not, unless you repeat some form of RL/fine‑tuning per model.
Bottom line for your use case
- Yes: you can design a PARL‑inspired orchestration framework that works with any LLM (Coditect is a great host environment).
- To approach K2.5‑like swarms, you either:
- Use models that already have agentic RL (K2.5, future variants, LLM‑in‑Sandbox‑RL‑trained models), or
- Run your own RL / bandit loop on top of your logs to specialize a base model for your tool + task distribution.
If you want, I can next draft an explicit “model‑agnostic PARL framework spec” (state/action schema, manager/worker contracts, logging, and a minimal offline‑RL loop) that you could implement on top of your current LangGraph‑style infra.
draft an explicit “model‑agnostic PARL framework spec” (state/action schema, manager/worker contracts, logging, and a minimal offline‑RL loop) that you could implement on top of your current LangGraph‑style infra.
Yes. Here is a model‑agnostic PARL framework spec you can layer on LangGraph (or equivalent) and drive with any chat‑completion LLM.
1. Core abstractions
State schema
Define a task‑level state $S$ shared across manager and workers:
type ToolCall = {
id: string // unique per step
name: string // tool name
args: any // JSON‑serializable
role?: string // optional logical sub‑agent role
};
type Message =
| { role: "user" | "system" | "assistant"; content: string }
| { role: "tool"; name: string; tool_call_id: string; content: any };
type AgentState = {
task_id: string;
messages: Message[]; // full dialog + tool outputs
step: number; // macro‑step index
tool_budget: number; // remaining tool calls
time_budget_ms: number; // remaining wall‑clock budget
parallel_limit: number; // max tools per macro‑step
tool_calls_this_step?: ToolCall[];
metrics?: {
start_time_ms: number;
wall_clock_ms?: number;
num_tool_calls: number;
num_tokens_in?: number;
num_tokens_out?: number;
// optional task‑specific metrics once finished
task_score?: number;
success?: boolean;
};
};
This is consistent with LLM‑in‑Sandbox and ARTIST‑style environments where state is “prompt + sandbox/files + metrics”.156157158159160161
Action schema
At each macro‑step, the manager LLM emits a bundle of tool calls plus optional textual reasoning:
type ManagerAction = {
thought: string; // free‑form plan / reasoning
tool_calls: ToolCall[]; // 0..parallel_limit
done: boolean; // whether to stop and emit final answer
};
Workers don’t need explicit actions in your infra: each tool call is effectively a “worker action”; the same or another LLM can be used inside tools if needed.
2. Manager / worker contracts
Manager contract (any LLM)
Prompt contract:
- Input: current
AgentState.messages, budgets, and tool schema (OpenAI‑style JSON). - Output: a JSON
ManagerActionencoded in an assistant message, e.g.:
{
"role": "assistant",
"content": "Planning step 3...",
"tool_calls": [
{
"id": "call_1",
"name": "search_repo",
"args": {"query": "login form component", "top_k": 5},
"role": "codebase_mapper"
},
{
"id": "call_2",
"name": "get_file",
"args": {"path": "src/components/ui/Button.tsx"},
"role": "component_reader"
}
]
}
Framework responsibilities:
- Enforce
tool_calls.length <= parallel_limitandtool_budget > 0. - If
done == trueortool_calls.length == 0, treat the assistant message as a candidate final answer.
You can implement this with any LLM that supports tool calling (Kimi, GPT, Claude, local), or even by asking it to emit JSON you parse yourself; the spec is model‑agnostic.158162163
Worker / tool contract
Each tool is a pure function from (name, args) to structured output:
type ToolResult = {
tool_call_id: string;
name: string;
ok: boolean;
result?: any;
error?: string;
latency_ms: number;
};
You then inject:
{
"role": "tool",
"name": result.name,
"tool_call_id": result.tool_call_id,
"content": {
"ok": result.ok,
"result": result.result,
"error": result.error,
"latency_ms": result.latency_ms
}
}
back into state.messages in the next manager step.
3. LangGraph‑style workflow
Use a thin graph: one manager node, one tool‑executor node, plus a router.
from typing import TypedDict, List, Any
from langgraph.graph import StateGraph, END
class AgentState(TypedDict):
task_id: str
messages: List[Any]
step: int
tool_budget: int
time_budget_ms: int
parallel_limit: int
metrics: dict
def manager_node(state: AgentState) -> AgentState:
# Call any LLM with your PARL‑style system prompt
resp = llama_chat(
messages=state["messages"],
tools=TOOL_SCHEMA,
tool_choice="auto"
)
msg = resp["choices"][^10_0]["message"]
state["messages"].append(msg)
state["step"] += 1
return state
def tool_exec_node(state: AgentState) -> AgentState:
last = state["messages"][-1]
tool_calls = getattr(last, "tool_calls", []) or last.get("tool_calls") or []
if not tool_calls:
return state
# Enforce budgets
allowed_calls = min(len(tool_calls), state["parallel_limit"], state["tool_budget"])
selected_calls = tool_calls[:allowed_calls]
results: List[Any] = run_tools_in_parallel(selected_calls) # your infra
state["tool_budget"] -= len(selected_calls)
state["metrics"]["num_tool_calls"] += len(selected_calls)
for r in results:
state["messages"].append({
"role": "tool",
"name": r.name,
"tool_call_id": r.tool_call_id,
"content": {
"ok": r.ok,
"result": r.result,
"error": r.error,
"latency_ms": r.latency_ms,
},
})
return state
def router(state: AgentState):
last = state["messages"][-1]
tool_calls = getattr(last, "tool_calls", []) or last.get("tool_calls") or []
if state["tool_budget"] <= 0 or state["time_budget_ms"] <= 0:
return END
if tool_calls:
return "tools"
# No new tool calls: treat as finished
return END
builder = StateGraph(AgentState)
builder.add_node("manager", manager_node)
builder.add_node("tools", tool_exec_node)
builder.set_entry_point("manager")
builder.add_edge("manager", "tools")
builder.add_conditional_edges("tools", router)
graph = builder.compile()
This is directly compatible with how M1‑Parallel and LLM‑in‑Sandbox structure parallel plans, but wrapped in LangGraph terms.159164162158
4. Logging and experience format
You want logs suitable for offline RL / bandits:
type Transition = {
task_id: string;
step: number;
s_t: AgentStateSnapshot; // minimally: messages hash, budgets, etc.
a_t: ManagerAction; // tool bundle + text
tool_results: ToolResult[]; // executed in parallel
s_t1: AgentStateSnapshot;
r_t: number; // scalar reward at this step (can be 0)
done: boolean;
};
type Episode = {
task_id: string;
model_id: string;
transitions: Transition[];
final_metrics: {
success: boolean;
task_score: number;
wall_clock_ms: number;
num_tool_calls: number;
num_tokens_in: number;
num_tokens_out: number;
};
};
You can store AgentStateSnapshot as a compact representation (e.g., hashes or truncated messages) rather than full prompts, to keep RL tractable.
Reward design (per episode):
- Terminal reward: e.g.
task_scorefrom tests / benchmarks (0–1).successboolean mapped to {0,1}.
- Cost penalty:
− α * log(1 + wall_clock_ms)− β * num_tool_calls
This mirrors ARTIST and LLM‑in‑Sandbox’s outcome‑based RL with cost regularization.160161158159
5. Minimal offline‑RL loop (algorithm‑agnostic)
At a high level:
- Collect episodes
- Run your existing system with one or more base models (GPT, Claude, Kimi, local) using the PARL framework.
- Log episodes as above with
task_scoreand cost metrics.
- Define a policy representation
- Supervised pre‑filter (cheap)
- Filter transitions where
task_scoreis high and budgets respected. - Train a next‑step policy to imitate
a_tfrom those transitions (behavior cloning). - This already encourages more parallel tool usage in contexts where it worked.
- Filter transitions where
- Outcome‑based RL (optional)
- Use episode‑level reward $R$ (e.g.,
task_score − λ * normalized_cost) and apply policy‑gradient on the manager’s logits for tool‑bundle decisions. - Since many proprietary models can’t be fine‑tuned, you’d apply this mainly to open models (K2.5, Llama‑family, etc.) or via adapters.
- Use episode‑level reward $R$ (e.g.,
- Re‑deployment
- Replace the manager model (or its system prompt) with the tuned version and keep the rest of the framework unchanged.
This mirrors LLM‑in‑Sandbox‑RL and ARTIST: RL over a tool‑rich environment with outcome‑based rewards.165166158159160161
6. How this stays model‑agnostic
- Any LLM that can:
- Take a history of messages + tool results.
- Output structured tool calls (name + args).
can be plugged in as manager with:
- No changes to your LangGraph graph.
- No changes to tool schema.
- Only the system prompt / examples change per model.
If you later get a model like K2.5 that has PARL‑style training baked in, it just performs better under the same framework; otherwise, you can approximate PARL behavior via prompt engineering plus the offline‑RL loop for open models.
If you’d like, next step I can do is: specify a concrete JSON logging spec + example episode for a Coditect “multi‑file refactor” task so you can wire this into your telemetry immediately.
Run your own RL / bandit loop on top of your logs to specialize a base model for your tool + task distribution. Provide LangGraph code skeleton for PARL state graph
Extend PARL spec with online RL adaptation loop Define JSON schemas for manager worker contracts Implement PARL logging with LangSmith integration Benchmark PARL framework vs single agent baselines
- Core abstractions State schema Define a task‑level state SS shared across manager and workers: ts type ToolCall = { id: string // unique per step name: string // tool name args: any // JSON‑serializable role?: string // optional logical sub‑agent role };
type Message = | { role: "user" | "system" | "assistant"; content: string } | { role: "tool"; name: string; tool_call_id: string; content: any };
type AgentState = { task_id: string; messages: Message[]; // full dialog + tool outputs step: number; // macro‑step index tool_budget: number; // remaining tool calls time_budget_ms: number; // remaining wall‑clock budget parallel_limit: number; // max tools per macro‑step tool_calls_this_step?: ToolCall[]; metrics?: { start_time_ms: number; wall_clock_ms?: number; num_tool_calls: number; num_tokens_in?: number; num_tokens_out?: number; // optional task‑specific metrics once finished task_score?: number; success?: boolean; }; };
This is consistent with LLM‑in‑Sandbox and ARTIST‑style environments where state is “prompt + sandbox/files + metrics”. Action schema At each macro‑step, the manager LLM emits a bundle of tool calls plus optional textual reasoning: ts type ManagerAction = { thought: string; // free‑form plan / reasoning tool_calls: ToolCall[]; // 0..parallel_limit done: boolean; // whether to stop and emit final answer };
Workers don’t need explicit actions in your infra: each tool call is effectively a “worker action”; the same or another LLM can be used inside tools if needed. 2. Manager / worker contracts Manager contract (any LLM) Prompt contract: Input: current AgentState.messages, budgets, and tool schema (OpenAI‑style JSON). Output: a JSON ManagerAction encoded in an assistant message, e.g.: json { "role": "assistant", "content": "Planning step 3...", "tool_calls": [ { "id": "call_1", "name": "search_repo", "args": {"query": "login form component", "top_k": 5}, "role": "codebase_mapper" }, { "id": "call_2", "name": "get_file", "args": {"path": "src/components/ui/Button.tsx"}, "role": "component_reader" } ] }
Framework responsibilities: Enforce tool_calls.length <= parallel_limit and tool_budget > 0. If done == true or tool_calls.length == 0, treat the assistant message as a candidate final answer. You can implement this with any LLM that supports tool calling (Kimi, GPT, Claude, local), or even by asking it to emit JSON you parse yourself; the spec is model‑agnostic. Worker / tool contract Each tool is a pure function from (name, args) to structured output: ts type ToolResult = { tool_call_id: string; name: string; ok: boolean; result?: any; error?: string; latency_ms: number; };
You then inject: ts { "role": "tool", "name": result.name, "tool_call_id": result.tool_call_id, "content": { "ok": result.ok, "result": result.result, "error": result.error, "latency_ms": result.latency_ms } }
back into state.messages in the next manager step. 3. LangGraph‑style workflow Use a thin graph: one manager node, one tool‑executor node, plus a router. python from typing import TypedDict, List, Any from langgraph.graph import StateGraph, END
class AgentState(TypedDict): task_id: str messages: List[Any] step: int tool_budget: int time_budget_ms: int parallel_limit: int metrics: dict
def manager_node(state: AgentState) -> AgentState: # Call any LLM with your PARL‑style system prompt resp = llama_chat( messages=state["messages"], tools=TOOL_SCHEMA, tool_choice="auto" ) msg = resp["choices"][0]["message"] state["messages"].append(msg) state["step"] += 1 return state
def tool_exec_node(state: AgentState) -> AgentState: last = state["messages"][-1] tool_calls = getattr(last, "tool_calls", []) or last.get("tool_calls") or [] if not tool_calls: return state
# Enforce budgets
allowed_calls = min(len(tool_calls), state["parallel_limit"], state["tool_budget"])
selected_calls = tool_calls[:allowed_calls]
results: List[Any] = run_tools_in_parallel(selected_calls) # your infra
state["tool_budget"] -= len(selected_calls)
state["metrics"]["num_tool_calls"] += len(selected_calls)
for r in results:
state["messages"].append({
"role": "tool",
"name": r.name,
"tool_call_id": r.tool_call_id,
"content": {
"ok": r.ok,
"result": r.result,
"error": r.error,
"latency_ms": r.latency_ms,
},
})
return state
def router(state: AgentState):
last = state["messages"][-1] tool_calls = getattr(last, "tool_calls", []) or last.get("tool_calls") or [] if state["tool_budget"] <= 0 or state["time_budget_ms"] <= 0: return END if tool_calls: return "tools" # No new tool calls: treat as finished return END
builder = StateGraph(AgentState) builder.add_node("manager", manager_node) builder.add_node("tools", tool_exec_node) builder.set_entry_point("manager") builder.add_edge("manager", "tools") builder.add_conditional_edges("tools", router) graph = builder.compile()
This is directly compatible with how M1‑Parallel and LLM‑in‑Sandbox structure parallel plans, but wrapped in LangGraph terms. 4. Logging and experience format You want logs suitable for offline RL / bandits: ts type Transition = { task_id: string; step: number; s_t: AgentStateSnapshot; // minimally: messages hash, budgets, etc. a_t: ManagerAction; // tool bundle + text tool_results: ToolResult[]; // executed in parallel s_t1: AgentStateSnapshot; r_t: number; // scalar reward at this step (can be 0) done: boolean; };
type Episode = { task_id: string; model_id: string; transitions: Transition[]; final_metrics: { success: boolean; task_score: number; wall_clock_ms: number; num_tool_calls: number; num_tokens_in: number; num_tokens_out: number; }; };
You can store AgentStateSnapshot as a compact representation (e.g., hashes or truncated messages) rather than full prompts, to keep RL tractable. Reward design (per episode): Terminal reward: e.g. task_score from tests / benchmarks (0–1). success boolean mapped to {0,1}. Cost penalty: − α * log(1 + wall_clock_ms) − β * num_tool_calls This mirrors ARTIST and LLM‑in‑Sandbox’s outcome‑based RL with cost regularization. 5. Minimal offline‑RL loop (algorithm‑agnostic) At a high level: Collect episodes Run your existing system with one or more base models (GPT, Claude, Kimi, local) using the PARL framework. Log episodes as above with task_score and cost metrics. Define a policy representation For a given model, policy is “prompt → action JSON”. You can fine‑tune it via supervised learning on good steps or do explicit RL (REINFORCE / PPO‑style on outcome rewards) if you control the weights (open models). Supervised pre‑filter (cheap) Filter transitions where task_score is high and budgets respected. Train a next‑step policy to imitate a_t from those transitions (behavior cloning). This already encourages more parallel tool usage in contexts where it worked. Outcome‑based RL (optional) Use episode‑level reward RR (e.g., task_score − λ * normalized_cost) and apply policy‑gradient on the manager’s logits for tool‑bundle decisions. Since many proprietary models can’t be fine‑tuned, you’d apply this mainly to open models (K2.5, Llama‑family, etc.) or via adapters. Re‑deployment Replace the manager model (or its system prompt) with the tuned version and keep the rest of the framework unchanged. This mirrors LLM‑in‑Sandbox‑RL and ARTIST: RL over a tool‑rich environment with outcome‑based rewards. 6. How this stays model‑agnostic Any LLM that can: Take a history of messages + tool results. Output structured tool calls (name + args). can be plugged in as manager with: No changes to your LangGraph graph. No changes to tool schema. Only the system prompt / examples change per model. If you later get a model like K2.5 that has PARL‑style training baked in, it just performs better under the same framework; otherwise, you can approximate PARL behavior via prompt engineering plus the offline‑RL loop for open models. If you’d like, next step I can do is:
Yes. Building on that spec, here’s a focused extension with the pieces you asked for.
LangGraph PARL state graph skeleton (manager + tools + online loop hook)
from typing import TypedDict, List, Any, Optional
from langgraph.graph import StateGraph, END
import time
# -------- Core state --------
class Metrics(TypedDict, total=False):
start_time_ms: int
wall_clock_ms: int
num_tool_calls: int
num_tokens_in: int
num_tokens_out: int
task_score: float
success: bool
class AgentState(TypedDict):
task_id: str
messages: List[Any]
step: int
tool_budget: int
time_budget_ms: int
parallel_limit: int
metrics: Metrics
# online-RL control
episode_log_id: Optional[str]
# -------- Manager node --------
def manager_node(state: AgentState) -> AgentState:
t0 = time.time()
resp = manager_llm_chat(
messages=state["messages"],
tools=TOOL_SCHEMA,
tool_choice="auto",
)
msg = resp["choices"][^11_0]["message"]
state["messages"].append(msg)
state["step"] += 1
# online adaptive tweaks (bandit hook)
adapt_manager_prompt_or_params(
state=state,
last_message=msg,
latency_ms=int((time.time() - t0) * 1000),
)
return state
# -------- Tool executor node --------
def tool_exec_node(state: AgentState) -> AgentState:
last = state["messages"][-1]
tool_calls = getattr(last, "tool_calls", []) or last.get("tool_calls") or []
if not tool_calls:
return state
allowed = min(len(tool_calls), state["parallel_limit"], state["tool_budget"])
selected = tool_calls[:allowed]
results = run_tools_in_parallel(selected) # your infra
state["tool_budget"] -= len(selected)
state["metrics"]["num_tool_calls"] += len(selected)
for r in results:
state["messages"].append({
"role": "tool",
"name": r.name,
"tool_call_id": r.tool_call_id,
"content": {
"ok": r.ok,
"result": r.result,
"error": r.error,
"latency_ms": r.latency_ms,
},
})
return state
# -------- Router --------
def router(state: AgentState):
last = state["messages"][-1]
tool_calls = getattr(last, "tool_calls", []) or last.get("tool_calls") or []
# time budget enforcement
elapsed = int(time.time() * 1000) - state["metrics"]["start_time_ms"]
state["metrics"]["wall_clock_ms"] = elapsed
if elapsed >= state["time_budget_ms"]:
return END
if state["tool_budget"] <= 0:
return END
if tool_calls:
return "tools"
return END
# -------- Graph compile --------
builder = StateGraph(AgentState)
builder.add_node("manager", manager_node)
builder.add_node("tools", tool_exec_node)
builder.set_entry_point("manager")
builder.add_edge("manager", "tools")
builder.add_conditional_edges("tools", router)
graph = builder.compile()
You can wrap graph.stream/graph.invoke with pre/post hooks to start/close an RL episode and emit LangSmith traces.
JSON schemas for manager / worker contracts
ManagerAction JSON (what you log / optionally supervise)
{
"$id": "ManagerAction",
"type": "object",
"properties": {
"thought": {
"type": "string",
"description": "Free-form reasoning and plan for this macro-step."
},
"tool_calls": {
"type": "array",
"items": {
"$ref": "#/definitions/ToolCall"
},
"description": "Bundle of tool calls to execute in parallel this step."
},
"done": {
"type": "boolean",
"description": "If true, no more tool calls; treat last assistant content as final answer."
}
},
"required": ["thought", "tool_calls", "done"],
"definitions": {
"ToolCall": {
"type": "object",
"properties": {
"id": { "type": "string" },
"name": { "type": "string" },
"args": { "type": "object" },
"role": {
"type": "string",
"description": "Optional logical sub-agent role, e.g., 'codebase_mapper'."
}
},
"required": ["id", "name", "args"]
}
}
}
ToolResult JSON (worker output)
{
"$id": "ToolResult",
"type": "object",
"properties": {
"tool_call_id": { "type": "string" },
"name": { "type": "string" },
"ok": { "type": "boolean" },
"result": {
"description": "Arbitrary tool output.",
"type": ["object", "array", "string", "number", "boolean", "null"]
},
"error": { "type": ["string", "null"] },
"latency_ms": { "type": "integer" }
},
"required": ["tool_call_id", "name", "ok", "latency_ms"]
}
These schemas are independent of any specific LLM and match the earlier TypeScript definitions.171172173
Online RL / bandit adaptation loop
Assume you cannot fine‑tune the base model, but can:
- Adjust system prompt / hyper‑params per task.
- Choose between several manager variants (ensembles).
Basic bandit‑style adaptation:
from dataclasses import dataclass
from typing import Dict
@dataclass
class PolicyVariant:
id: str
system_prompt: str
temperature: float
weight: float # bandit preference
POLICIES: Dict[str, PolicyVariant] = {
"swarm_aggressive": PolicyVariant("swarm_aggressive", "... favor parallel tools ...", 0.3, 1.0),
"swarm_conservative": PolicyVariant("swarm_conservative", "... prefer safety, few tools ...", 0.3, 1.0),
}
def select_policy(task_id: str) -> PolicyVariant:
# e.g., softmax over weights; Thompson sampling also possible
# placeholder: pick max weight
return max(POLICIES.values(), key=lambda p: p.weight)
def update_policy_weights(policy_id: str, reward: float, lr: float = 0.01):
POLICIES[policy_id].weight += lr * reward
def manager_llm_chat(messages, tools, tool_choice):
policy = select_policy(task_id=extract_task_id(messages))
sys_msg = {"role": "system", "content": policy.system_prompt}
full_messages = [sys_msg] + messages
resp = call_any_llm(
messages=full_messages,
tools=tools,
tool_choice=tool_choice,
temperature=policy.temperature,
)
return resp
def end_of_episode_update(episode: dict):
# e.g., reward = task_score - λ * normalized_cost
score = episode["final_metrics"]["task_score"]
cost = episode["final_metrics"]["wall_clock_ms"]
reward = score - 0.001 * cost
update_policy_weights(episode["policy_id"], reward)
If you control an open model, you can swap update_policy_weights with an offline RL fine‑tuning job that trains on logged ManagerAction examples weighted by episode reward (ARTIST/LLM‑in‑Sandbox style).172174171
LangSmith‑style logging for PARL
Pseudocode using LangChain / LangSmith tracing patterns:
from langsmith import Client
client = Client()
def run_episode_with_logging(task_id: str, initial_messages):
run = client.create_run(
name="parl_episode",
inputs={"task_id": task_id},
project_name="parl-experiments",
tags=["parl", "multi-agent"]
)
state: AgentState = {
"task_id": task_id,
"messages": initial_messages,
"step": 0,
"tool_budget": 200,
"time_budget_ms": 120000,
"parallel_limit": 8,
"metrics": {
"start_time_ms": int(time.time() * 1000),
"num_tool_calls": 0,
},
"episode_log_id": run.id,
}
for update in graph.stream(state):
# `update` is a dict of {node_name: AgentState}
for node_name, node_state in update.items():
client.create_child_run(
run_id=run.id,
name=node_name,
inputs={"step": node_state["step"]},
outputs={"messages": node_state["messages"][-1:]},
tags=["node", node_name],
)
final_state = list(update.values())[-1]
final_metrics = evaluate_task(final_state) # tests, scores, etc.
client.update_run(
run_id=run.id,
outputs={"final_metrics": final_metrics},
tags=["done"]
)
client.end_run(run_id=run.id)
return final_state, final_metrics
You can additionally emit your Transition JSONs into a separate store (BigQuery, ClickHouse) for RL training; LangSmith traces give you per‑step context and metrics.
Benchmarking PARL vs single‑agent baselines
Define 2–3 policies:
- Single‑agent baseline:
- Same LLM,
parallel_limit = 1, or prompt that forbids multiple tool calls per step.
- Same LLM,
- Heuristic parallelism:
- Allow up to N parallel calls, but use static templates (e.g., split tasks by file).
- PARL framework:
- Manager allowed to emit bundles, online bandit or RL‑tuned policy.
Benchmark protocol:
- Task suite:
- Metrics per task:
- Success / task_score (tests, validators).
- Wall‑clock latency.
- Tool calls count.
- Cost (tokens + tool cost).
- Experimental design:
- Run each policy on each task with multiple random seeds (temperature) and report mean/variance.
- Use paired tests (e.g., Wilcoxon) on success and latency.
You log using the same Episode schema; a simple report is then a table:
| Policy | Success rate | Median latency (s) | Mean tool calls | Tokens in/out |
|---|---|---|---|---|
| Single‑agent | … | … | … | … |
| Heuristic‑multi | … | … | … | … |
| PARL framework | … | … | … | … |
This is directly analogous to the evaluation setups used in “Optimizing Sequential Multi‑Step Tasks with Parallel LLM Agents” and “LLM‑in‑Sandbox” experiments.174176171175
If you’d like, next iteration I can collapse this into a single “spec doc” you could drop into a repo (Markdown + code blocks) and hand to an engineer to implement.