RECURSIVE LANGUAGE MODELS MIT paper
The paper you’re looking for is “Recursive Language Models” (often abbreviated RLMs), by a team at MIT CSAIL, released on arXiv with ID 2512.24601.12
Core references
- ArXiv entry (abstract + PDF): https://arxiv.org/abs/2512.24601
- HTML version of the paper: https://arxiv.org/html/2512.24601v1
- Hugging Face “Papers” page (nice summary + discussion): https://huggingface.co/papers/2512.24601
One-paragraph technical gist
RLMs treat the entire “long prompt” as an external environment string (e.g., in a Python REPL) and have the LM write code to inspect, slice, search, and recursively call sub-LM instances over relevant snippets, instead of shoving all tokens into the Transformer context at once. This inference-time algorithm effectively scales usable context length to $10^7$+ tokens while mitigating context rot and often reducing cost compared to naive long-context or summarization/RAG scaffolds, especially on long-document QA and multi-step reasoning benchmarks. The paper frames this as an inference-time scaling paradigm: you keep the base LM fixed, but expose it to a programmable environment where it can recursively orchestrate its own calls over an unbounded context buffer.341
Good follow-up prompts you can paste
- “Summarize the algorithmic architecture of MIT’s ‘Recursive Language Models’ (arXiv:2512.24601), focusing on the REPL environment design, recursion pattern, and how it interfaces with a base LLM.”
- “Derive a practical implementation plan for an RLM-style REPL around an existing LLM (e.g., GPT-4 class), including data structures, recursion control, and failure modes for ultra-long (10M+ token) contexts.”
- “Compare MIT RLMs (arXiv:2512.24601) to standard long-context approaches (extended context Transformers, retrieval + summarization, and tool-augmented agents) for regulated-industry workloads (compliance, audits, medical records). Present trade-offs in a table.”
- “Design an RLM-inspired multi-agent architecture for code understanding over a very large monorepo, including how agents share partial summaries, indexes, and recursion depth limits.”
- “Extract and explain the experimental setup and benchmarks used in the ‘Recursive Language Models’ paper (S-NIAH, BrowseComp-Plus, OOLONG, etc.), and relate them to practical enterprise use-cases.”
Main Contributions
The MIT CSAIL paper "Recursive Language Models" (arXiv:2512.24601) introduces RLMs, an inference-time paradigm that enables LLMs to process arbitrarily long prompts—up to two orders of magnitude beyond their native context windows—by treating the prompt as an external environment rather than feeding it directly into the Transformer.1112
Key contributions:
- Prompt-as-environment abstraction: The long prompt is stored as a string variable in a Python REPL; the LLM writes code to inspect, slice, and decompose it instead of ingesting all tokens at once.1211
- Recursive self-invocation: The LLM can programmatically spawn sub-LM calls on relevant snippets, enabling essentially unbounded reasoning chains.11
- Task-agnostic scaffold: A single, fixed system prompt works across all benchmarks—no task-specific engineering required.11
- Cost parity or reduction: Despite added complexity, median RLM query cost matches or beats base-model calls (e.g., $0.99 vs. $1.50–$2.75 for 6–11M tokens).11
RLM Inference Algorithm (Step-by-Step)
- Initialize REPL – A Python REPL is launched and the full prompt $P$ is assigned to a variable.11
- Expose metadata – The root LLM receives general context (e.g.,
len(P)) without seeing raw tokens.11 - Code generation – The LLM writes code to probe $P$: regex searches, slicing, keyword filtering, or printing small samples.11
- Execution & observation – The REPL executes the code; outputs (e.g., matched snippets) are returned to the LLM.11
- Recursive sub-calls – When deeper reasoning is needed, the LLM constructs a sub-task string and invokes a sub-LM (or sub-RLM) on that snippet; results are stored in REPL variables.11
- Iterate – Steps 3–5 repeat until the LLM decides it has enough information.11
- Final answer – The LLM returns a string (or a REPL variable reference for very long outputs) wrapped in a
FINAL()/FINAL_VAR()tag.11
Benchmark Comparison
| Benchmark | Task Length | Base GPT-5 | RLM (GPT-5) | Gain |
|---|---|---|---|---|
| CodeQA | 23K–4.2M | 24% | 62% | +38 pp 11 |
| BrowseComp+ (1K docs) | 6–11M | 0%* | 91.3% | — 11 |
| OOLONG | 131K | 44% | 56.5% | +12.5 pp 11 |
| OOLONG-Pairs | 32K | 0.04 F1 | 58.0 F1 | — 11 |
*Base model cannot fit the input.
RLMs also dramatically outperform CodeAct + BM25 and Summary-agent baselines on all tasks while maintaining comparable or lower cost.11
Experimental Tasks & Datasets
| Task | Source | Complexity Scaling | What It Tests |
|---|---|---|---|
| S-NIAH | RULER | Constant | Single needle retrieval in large haystack 11 |
| BrowseComp-Plus | chen2025 | Constant (multi-hop) | Deep-research QA over 1K docs (6–11M tokens) 11 |
| OOLONG | bertsch2025 | Linear | Semantic aggregation over every line 11 |
| OOLONG-Pairs | Custom | Quadratic | Pairwise semantic aggregation 11 |
| LongBench-v2 CodeQA | bai2025 | Fixed | Multi-choice repo understanding 11 |
Computational Costs & Latency Trade-offs
| Aspect | Finding |
|---|---|
| Median cost | RLM median ≤ base-model median; up to 3× cheaper than summary-agent because the LLM selectively views context 11. |
| Variance | High—trajectory length depends on task complexity; outlier runs can be significantly more expensive 11. |
| Latency | Sequential (blocking) sub-calls make current implementations slow; asynchronous sub-calls and sandboxed REPLs are noted as straightforward optimizations 11. |
| Token budget | RLMs avoid full-context ingestion; on BrowseComp-Plus the theoretical cost of GPT-5-mini ingesting 6–11M tokens is $1.50–$2.75, while RLM(GPT-5) averages $0.99 11. |
Suggested Follow-up Prompts
- "Describe the REPL environment design in RLMs—what helper functions or modules are exposed, how is recursion depth controlled, and how does the model decide when to finalize?"
- "Compare RLM sub-calling strategies (chunking vs. keyword search vs. regex filtering) and their impact on cost and accuracy, citing trajectories from Appendix B."
- "Outline a production architecture for RLMs with asynchronous sub-calls and sandboxed execution, targeting SOC-2 / HIPAA workloads."
- "Analyze the negative results in Appendix A: why do smaller models fail as RLMs, and what coding capability thresholds appear necessary?"
"Outline a production architecture for RLMs with asynchronous sub-calls and sandboxed execution, targeting SOC-2 / HIPAA workloads.""Describe the REPL environment design in RLMs—what helper functions or modules are exposed, how is recursion depth controlled, and how does the model decide when to finalize?
RLMs in production for compliance workloads need two things the paper explicitly calls out as future work: async sub-calls and sandboxed execution around the REPL. Below is a concrete, implementation-level outline plus a summary of the REPL design from the paper.13
REPL environment design
The paper instantiates RLMs by loading the entire input prompt $P$ into a Python REPL as a string variable, then letting the LM write and execute code against that environment.13
Core structure
- A long prompt $P$ is set as a variable, e.g.
P = "<entire corpus or conversation>", in a persistent Python interpreter.13 - The root LM is given metadata about the environment (e.g., length of
P, potentially document counts) plus instructions on how to interact with it via code; it does not seePtokens directly unless it chooses to print or slice.13 - The REPL exposes a minimal module for calling a sub-LM from inside code, so model-written Python can do things like:
- Slice
Pintochunk = P[start:end] - Call
call_lm(chunk, task="summarize")and store the result in variables.13
- Slice
Helper functions / modules (conceptual)
The paper describes a module “that allows it to query a sub-LM inside the environment” but does not standardize the full API. A practical design (aligned with their examples) is:13
- Read-only prompt access:
- Document / record utilities (if $P$ is structured):
iter_docs()orget_doc(i)for corpus-like inputs (BrowseComp+ style).13split_by(delim: str) -> list[str]to chunk by newline or delimiter.
- Sub-LM calls:
lm_call(prompt: str, *, model: str = "sub", system_prompt: str | None = None, max_tokens: int = 4096) -> stras the generic tool.13- Optionally thin conveniences, e.g.,
summarize(chunk),extract_pairs(chunk), which still route throughlm_call.
- Output management:
- RLM encourages using variables to accumulate intermediate and final answers, then returning them via special tags.13
Recursion depth and control
The paper’s experiments explicitly cap recursion depth at one layer: sub-calls are to LMs, not further RLMs. In a production system you would:13
- Track recursion context (depth, parent call ID) in the REPL or orchestrator.
- Hard-limit:
MAX_RECURSION_DEPTH(often 1–2).MAX_SUBCALLS_PER_ROOTandMAX_TOKENS_PER_SUBCALL.
- Reject deeper calls at the helper level:
lm_callchecks current depth and raises an exception or returns a “budget exhausted” message when limits are exceeded.
Finalization protocol
The RLM distinguishes “thinking steps” from final answers by asking the LM to wrap answers in FINAL() or FINAL_VAR().13
FINAL(text)– final answer is the literaltextstring.FINAL_VAR(var_name)– final answer is the value of a REPL variable (useful for very long outputs).13- The paper notes brittleness: models sometimes emit plans as
FINAL, requiring guardrails.13 - A production controller:
- Parses each LM turn; if
FINAL/FINAL_VARappear and pass validation (e.g.,var_nameexists, length within bounds), the orchestrator ends the trajectory. - Otherwise, treats the output as next-step code, executes it in the REPL, and returns stdout/stderr to the LM for the next step.13
- Parses each LM turn; if
Production architecture with async sub-calls & sandboxing (SOC‑2 / HIPAA)
At a high level, you want an RLM “engine” service that wraps:
- A sandboxed REPL runtime per request.
- An async fan-out / gather layer for sub-LM calls.
- A policy and observability layer for security, compliance, and billing.
1. High-level component layout
- API Gateway / Frontend service
- Accepts RLM requests:
{root_model, prompt P, task_type, tenant_id, policies, budgets}. - Performs initial authn/z, per-tenant rate limits, and logs request metadata (no raw PII in centralized logs for HIPAA).13
- Accepts RLM requests:
- RLM Orchestrator
- Creates an RLM session with:
- A dedicated REPL instance (container / VM micro-VM).
- A “root LM” channel (e.g., GPT‑5 as in the paper) and a “sub-LM pool” (e.g., GPT‑5-mini or open model).13
- Drives the loop:
- Send “REPL state summary + last outputs” to root LM.
- Receive Python +
lm_call()instructions orFINAL(...). - Execute code inside sandbox; when
lm_call()is requested, enqueue sub-calls. - Collect results asynchronously, update REPL variables, repeat.
- Creates an RLM session with:
- Sub-LM Worker Pool (Async)
- Queue-based architecture (e.g.,
subcallstopic). - Workers call provider APIs (OpenAI, internal models, etc.), streaming results back to the orchestrator.
- Parallelizes independent sub-calls that, in the paper’s implementation, are synchronous and therefore slow.13
- Queue-based architecture (e.g.,
- Secure Storage / Audit Layer
- Encrypted storage for:
- Per-tenant configuration and keys.
- RLM traces (REPL code, sub-call prompts, model outputs) needed for SOC‑2 audit and HIPAA traceability, but under strong access control and lifecycle policies.
- Redaction or structured separation of PHI so operational logs never store raw PHI where not strictly necessary.
- Encrypted storage for:
2. Sandbox and isolation design
The paper suggests that sandboxed REPLs and asynchronous calls are straightforward improvements over their synchronous Python REPL implementation. For compliance:13
- Runtime isolation
- Per-RLM session, launch:
- A container / Firecracker micro-VM running a restricted Python environment.
- No network access from inside the REPL except to an internal RPC for
lm_call, which is mediated by policy.
- Use seccomp/AppArmor and read-only mounted libraries; no filesystem write except to a temp volume encrypted per session.
- Per-RLM session, launch:
- Code surface minimization
- Expose only:
rlm_envmodule (withlen_P,get_slice,find, corpus iterators, etc.).- Safe Python built-ins,
re,json, basic list/dict/string ops.
- Deny:
os,subprocess,socket,open(except where explicitly wrapped to a virtual, in-memory filesystem).
- Expose only:
- PHI boundaries
- Treat
Pand any slices as PHI-containing; they never leave the sandbox except:- Minimal snippets printed back to the LM (already inherent in RLM), subject to length caps and optional PHI detection.
- Sub-LM calls that are strictly necessary to answer the user’s question; these must flow only to HIPAA-eligible model providers or your own hosted models.
- Treat
- Data lifecycle
- Explicit TTL plus deletion pipeline for:
- REPL state (P, variables).
- Raw traces.
- For HIPAA, ensure BAAs in place with any third-party model providers that might receive PHI.
- Explicit TTL plus deletion pipeline for:
3. Asynchronous sub-calls in the RLM loop
The paper runs all LM calls sequentially and notes this as a source of slowness. A production system:13
- Implements
lm_call()as an async primitive:- In REPL,
lm_call(prompt, ...)enqueues a request ID and returns immediately with a “future-like” token. - The orchestrator:
- Collects all outstanding subcall tokens for the current step.
- Fan-outs them to the sub-LM worker pool simultaneously.
- When all are fulfilled or timed out, writes their outputs back into REPL variables and resumes the root LM.
- In REPL,
- Provides structured APIs to keep code simple:
future_id = lm_call_async(prompt, ...)result = await_lm(future_id)- Or, in a more constrained pattern, a helper:
results = parallel_lm_map(chunks, task="summarize"), internally implemented with multiplelm_callrequests.
- Enforces budget and fairness:
lm_calltracks:- Total tokens per session, per user, per tenant.
- Max concurrent sub-calls per session and per tenant.
- If limits are exceeded,
lm_callraises an exception or returns a structured “budget_exhausted” response that the LM must handle.
4. SOC‑2 / HIPAA controls and observability
To meet SOC‑2 and HIPAA expectations around long-context RLM traces:
- Deterministic policies
- Per-tenant config for:
- Allowed model families (e.g., only in-house/HIPAA-compliant models).
- Max context slice length visible to any sub-call.
- Redaction rules (e.g., regexes for MRNs, SSNs, etc. to be masked before sub-calls).
- Per-tenant config for:
- Comprehensive tracing
- Log (to a secure, access-controlled store):
- RLM session metadata (tenant, request ID, timestamps).
- Every
lm_callwith:- Hash of prompt, optional partial view, token counts, model used, latency, cost.
- REPL events (code cells, exceptions, finalization).
- Use structured logs to support:
- SOC‑2 control evidence: change management, incident investigations.
- HIPAA: who saw what, when, under which authorization.
- Log (to a secure, access-controlled store):
- Guardrails and validation
- For each
FINAL/FINAL_VAR:- Validate type and size.
- Optional domain validators (e.g., JSON schema, FHIR resource validation).
- For healthcare:
- Prefer structured responses (FHIR JSON) over free text; RLM can still use unstructured reasoning internally.
- For each
How the model “decides” to finalize (practically)
The paper’s mechanism is purely prompt-driven: the system prompt instructs the model to wrap its final answer in FINAL() or FINAL_VAR(), and the orchestrator interprets that as termination. In practice you likely want:13
- State machine on the orchestrator side
- States:
THINKING,CANDIDATE_FINAL,FINAL. - On
FINAL(...):- Move to
CANDIDATE_FINAL, run validators (size, shape, policy checks). - If all pass, transition to
FINALand stop. - If not, send a constrained follow-up prompt: “Your previous response violated rule X; provide a corrected FINAL(...) answer adhering to policy Y.”
- Move to
- States:
- Heuristics for robustness
- If the LM outputs both code and
FINAL()in the same turn, apply a deterministic rule (e.g., ignore code, or only treatFINALnear the end). - Hard stop on max steps / tokens:
- If step limit reached with no valid
FINAL, either:- Return best-effort intermediate summary, clearly marked.
- Or fail closed, depending on tenant policy.
- If step limit reached with no valid
- If the LM outputs both code and
If you want, the next step can be a concrete YAML/TypeScript-level spec of the RLM Orchestrator API (including CreateSession, Step, SubcallResult streams) tailored to your existing Coditect multi-agent infra.
CREATE concrete MARKDOWN, JSON, YAML/TypeScript-level SDD TDD ARD specification of the RLM Orchestrator API (including CreateSession, Step, SubcallResult streams) tailored to your existing Coditect multi-agent infra.
Below is a concrete, copy‑pasteable spec for an RLM Orchestrator suitable for Coditect‑style multi‑agent infra, with Markdown, JSON, and YAML/TypeScript views.14
1. High‑level API surface (Markdown SDD)
The RLM Orchestrator exposes three primary concepts:
- CreateSession: start an RLM session over a long prompt $P$ in a sandboxed REPL.14
- Step: advance the RLM trajectory (execute model‑generated code, schedule sub‑calls, resume when sub‑results arrive).14
- SubcallResult stream: asynchronous delivery of sub‑LM call completions back into the orchestrator and REPL.14
1.1 Core invariants
- Each session has:
- Recursion depth is capped (typically 1) and enforced at the API level.14
- Finalization is signaled via
FINAL()orFINAL_VAR()in the root LM output; the orchestrator validates and closes the session.14
2. JSON API contract (TDD‑style)
2.1 CreateSession
HTTP
POST /v1/rlm/sessions
Request (JSON)
{
"tenant_id": "tenant-123",
"session_id": "optional-client-session-id-uuid",
"root_model": "gpt-5",
"sub_model_default": "gpt-5-mini",
"prompt": "LONG_INPUT_P",
"task_type": "generic_qa",
"policies": {
"max_recursion_depth": 1,
"max_steps": 64,
"max_subcalls_total": 256,
"max_tokens_per_subcall": 4096,
"max_visible_slice_chars": 32000,
"allow_network_tools": false
},
"compliance": {
"hipaa": true,
"soc2": true,
"data_region": "us-central1",
"phi_present": true
},
"metadata": {
"request_id": "ext-req-id",
"source": "coditect-pipeline",
"labels": {
"env": "prod"
}
}
}
Response
{
"session_id": "srv-rlm-7c3f4c01",
"state": "RUNNING",
"created_at": "2026-01-13T07:58:00Z",
"limits": {
"max_recursion_depth": 1,
"max_steps": 64,
"max_subcalls_total": 256,
"max_tokens_per_subcall": 4096
},
"repl_info": {
"language": "python",
"env_version": "rlm-repl-v1",
"sandbox_id": "sbx-0f12d8e3"
}
}
2.2 Step (pull‑driven)
Advances the RLM; can be used in either:
- Polling mode: client calls
Stepuntilstatus = FINAL. - Server‑push mode:
Stepreturns incremental events, while subcall completions also arrive through a stream.
HTTP
POST /v1/rlm/sessions/{session_id}/step
Request
{
"client_step_id": "cstep-001",
"mode": "AUTO",
"max_root_tokens": 2048,
"timeout_ms": 120000
}
Response
{
"session_id": "srv-rlm-7c3f4c01",
"server_step_id": "sstep-012",
"status": "THINKING",
"phase": "ROOT_LM",
"events": [
{
"type": "ROOT_LM_OUTPUT",
"root_call_id": "root-010",
"content": {
"raw_text": "```python\nfrom rlm_env import len_P, get_slice, lm_call\n...\n```",
"parsed": {
"code_blocks": [
{
"language": "python",
"code": "from rlm_env import len_P, get_slice, lm_call\n..."
}
],
"final_call": null
}
},
"usage": {
"prompt_tokens": 400,
"completion_tokens": 260
}
},
{
"type": "REPL_EXECUTION",
"execution_id": "exec-048",
"status": "OK",
"stdout": "",
"stderr": "",
"new_variables": [
"snippet_1",
"candidate_answer"
],
"scheduled_subcalls": [
{
"subcall_id": "sub-123",
"model": "gpt-5-mini",
"prompt_preview": "Summarize this snippet ...",
"recursion_depth": 1
}
]
}
],
"pending_subcalls": [
"sub-123"
],
"metrics": {
"steps_used": 12,
"subcalls_used": 37,
"total_tokens_root": 7200,
"total_tokens_subcalls": 41200
}
}
If the root LM emitted a final answer:
{
"session_id": "srv-rlm-7c3f4c01",
"server_step_id": "sstep-020",
"status": "FINAL",
"phase": "DONE",
"final": {
"type": "FINAL_TEXT",
"text": "The answer is Maria Dalmacio.",
"validated": true
},
"metrics": {
"steps_used": 20,
"subcalls_used": 52,
"total_tokens_root": 10200,
"total_tokens_subcalls": 56000
}
}
2.3 SubcallResult stream
Sub‑LM calls are not invoked directly by clients; they are emitted from REPL_EXECUTION and fulfilled by an internal worker pool.14
Assume an internal gRPC/WS channel:
{
"type": "SUBCALL_RESULT",
"session_id": "srv-rlm-7c3f4c01",
"subcall_id": "sub-123",
"model": "gpt-5-mini",
"status": "COMPLETED",
"output": {
"text": "The stew is called pinakbet, and the pageant winner is Maria Dalmacio.",
"tool_calls": []
},
"usage": {
"prompt_tokens": 900,
"completion_tokens": 180
},
"error": null,
"received_at": "2026-01-13T07:58:22Z"
}
The orchestrator:
- Writes this result into the REPL (e.g., variable
sub_123_result). - On the next
Step, exposes a short REPL state summary plus the last outputs to the root LM.14
3. YAML / TypeScript SDD + ARD (interfaces & flows)
3.1 TypeScript domain types
// rlm-types.ts
export type RlmStatus = "RUNNING" | "FINAL" | "FAILED";
export interface ComplianceConfig {
hipaa: boolean;
soc2: boolean;
dataRegion: string;
phiPresent: boolean;
}
export interface PolicyConfig {
maxRecursionDepth: number; // e.g., 1
maxSteps: number; // e.g., 64
maxSubcallsTotal: number; // e.g., 256
maxTokensPerSubcall: number; // e.g., 4096
maxVisibleSliceChars: number; // safety on printing P
allowNetworkTools: boolean;
}
export interface CreateSessionRequest {
tenantId: string;
sessionId?: string;
rootModel: string;
subModelDefault: string;
prompt: string;
taskType: string;
policies: PolicyConfig;
compliance: ComplianceConfig;
metadata?: Record<string, unknown>;
}
export interface CreateSessionResponse {
sessionId: string;
state: RlmStatus | "RUNNING";
createdAt: string;
limits: PolicyConfig;
replInfo: {
language: "python";
envVersion: string;
sandboxId: string;
};
}
export type StepMode = "AUTO" | "ROOT_ONLY" | "REPL_ONLY";
export interface StepRequest {
clientStepId?: string;
mode: StepMode;
maxRootTokens?: number;
timeoutMs?: number;
}
export type RlmEventType =
| "ROOT_LM_OUTPUT"
| "REPL_EXECUTION"
| "SUBCALL_SCHEDULED"
| "BUDGET_EXCEEDED"
| "ERROR";
export interface RootLmParsedContent {
codeBlocks: { language: string; code: string }[];
finalCall: null | {
kind: "FINAL_TEXT" | "FINAL_VAR";
argument: string;
};
}
export interface RlmEventRootLmOutput {
type: "ROOT_LM_OUTPUT";
rootCallId: string;
content: {
rawText: string;
parsed: RootLmParsedContent;
};
usage: {
promptTokens: number;
completionTokens: number;
};
}
export interface ScheduledSubcall {
subcallId: string;
model: string;
promptPreview: string;
recursionDepth: number;
}
export interface RlmEventReplExecution {
type: "REPL_EXECUTION";
executionId: string;
status: "OK" | "ERROR";
stdout: string;
stderr: string;
newVariables: string[];
scheduledSubcalls: ScheduledSubcall[];
}
export type RlmEvent =
| RlmEventRootLmOutput
| RlmEventReplExecution
| {
type: "BUDGET_EXCEEDED";
reason: string;
}
| {
type: "ERROR";
message: string;
};
export interface StepResponse {
sessionId: string;
serverStepId: string;
status: RlmStatus | "RUNNING" | "FAILED";
phase: "ROOT_LM" | "REPL" | "WAITING_SUBCALLS" | "DONE";
events: RlmEvent[];
pendingSubcalls: string[];
final?: {
type: "FINAL_TEXT" | "FINAL_VAR";
text?: string;
varName?: string;
validated: boolean;
};
metrics: {
stepsUsed: number;
subcallsUsed: number;
totalTokensRoot: number;
totalTokensSubcalls: number;
};
}
export interface SubcallResult {
sessionId: string;
subcallId: string;
model: string;
status: "COMPLETED" | "FAILED" | "CANCELLED";
output?: {
text: string;
toolCalls?: unknown[];
};
usage?: {
promptTokens: number;
completionTokens: number;
};
error?: {
code: string;
message: string;
};
receivedAt: string;
}
3.2 YAML ARD (components & flows)
components:
api_gateway:
responsibilities:
- AuthN/Z (OIDC/JWT, API keys)
- Tenant and rate limiting
- Request shaping -> RlmOrchestrator
rlm_orchestrator:
responsibilities:
- Manage RLM sessions and life cycle
- Maintain policy and budget state per session
- Drive root LM loop and REPL interaction
- Coordinate async sub-LM calls
repl_runtime:
implementation: "Python in sandbox (container/micro-VM)"
responsibilities:
- Hold P as read-only string
- Expose rlm_env module (len_P, get_slice, find, lm_call, etc.)
- Execute model-generated Python safely
sublm_worker_pool:
responsibilities:
- Consume Subcall jobs from queue
- Call configured sub-models (OpenAI/self-hosted)
- Emit SubcallResult events
audit_store:
responsibilities:
- Encrypted storage of traces and metrics
- SOC-2 / HIPAA access controls & retention policies
flows:
- name: create_session
steps:
- api_gateway validates tenant, checks policy
- rlm_orchestrator:
- allocates session_id and sandbox_id
- spins up isolated repl_runtime with P preloaded
- writes initial session record to audit_store
- returns CreateSessionResponse
- name: step_auto_mode
steps:
- client calls /step
- rlm_orchestrator:
- if pending_subcalls:
- wait until all SubcallResult received or timeout
- inject results into repl_runtime variables
- build root LM prompt: system (RLM prompt) + state summary
- call root_model
- parse for code blocks + FINAL()/FINAL_VAR()
- if FINAL:
- validate output (size, schema, PHI policy)
- mark session FINAL, persist trace, return final
- else:
- execute code in repl_runtime
- collect stdout/stderr, new vars, scheduled_subcalls
- enqueue subcalls to sublm_worker_pool
- update metrics and budgets
- emit StepResponse with events + pending_subcalls
- name: subcall_fulfillment
steps:
- sublm_worker_pool worker picks up subcall job
- calls actual LM provider
- emits SubcallResult onto internal bus
- rlm_orchestrator updates session state and repl_runtime
- next Step() sees updated state and proceeds
3.3 Minimal rlm_env REPL API (Python sketch)
# Exposed inside sandbox
P: str # full long prompt, read-only
def len_P() -> int: ...
def get_slice(start: int, end: int) -> str: ...
def find(pattern: str) -> list[int]: ...
def iter_docs() -> list[str]: ...
def get_doc(i: int) -> str: ...
# Async-abstracted helper; returns immediately and will be filled later
def lm_call(prompt: str,
model: str | None = None,
system_prompt: str | None = None,
tag: str | None = None) -> str:
"""
Schedules a sub-LM call, stores the future result in a variable,
and returns a placeholder or raises if budgets are exceeded.
"""
...
This spec should drop directly into Coditect as an RLM Orchestrator service fronted by your existing multi‑agent router, while matching the REPL and sub‑LM structure described in the MIT RLM paper.14