RECURSIVE LANGUAGE MODELS MIT paper

The paper you’re looking for is “Recursive Language Models” (often abbreviated RLMs), by a team at MIT CSAIL, released on arXiv with ID 2512.24601.¹²

Core references

ArXiv entry (abstract + PDF): https://arxiv.org/abs/2512.24601
HTML version of the paper: https://arxiv.org/html/2512.24601v1
Hugging Face “Papers” page (nice summary + discussion): https://huggingface.co/papers/2512.24601

One-paragraph technical gist

RLMs treat the entire “long prompt” as an external environment string (e.g., in a Python REPL) and have the LM write code to inspect, slice, search, and recursively call sub-LM instances over relevant snippets, instead of shoving all tokens into the Transformer context at once. This inference-time algorithm effectively scales usable context length to $10^7$+ tokens while mitigating context rot and often reducing cost compared to naive long-context or summarization/RAG scaffolds, especially on long-document QA and multi-step reasoning benchmarks. The paper frames this as an inference-time scaling paradigm: you keep the base LM fixed, but expose it to a programmable environment where it can recursively orchestrate its own calls over an unbounded context buffer.³⁴¹

Good follow-up prompts you can paste

“Summarize the algorithmic architecture of MIT’s ‘Recursive Language Models’ (arXiv:2512.24601), focusing on the REPL environment design, recursion pattern, and how it interfaces with a base LLM.”
“Derive a practical implementation plan for an RLM-style REPL around an existing LLM (e.g., GPT-4 class), including data structures, recursion control, and failure modes for ultra-long (10M+ token) contexts.”
“Compare MIT RLMs (arXiv:2512.24601) to standard long-context approaches (extended context Transformers, retrieval + summarization, and tool-augmented agents) for regulated-industry workloads (compliance, audits, medical records). Present trade-offs in a table.”
“Design an RLM-inspired multi-agent architecture for code understanding over a very large monorepo, including how agents share partial summaries, indexes, and recursion depth limits.”
“Extract and explain the experimental setup and benchmarks used in the ‘Recursive Language Models’ paper (S-NIAH, BrowseComp-Plus, OOLONG, etc.), and relate them to practical enterprise use-cases.” ⁵⁶⁷⁸⁹¹⁰

⁂

Main Contributions

The MIT CSAIL paper "Recursive Language Models" (arXiv:2512.24601) introduces RLMs, an inference-time paradigm that enables LLMs to process arbitrarily long prompts—up to two orders of magnitude beyond their native context windows—by treating the prompt as an external environment rather than feeding it directly into the Transformer.¹¹¹²

Key contributions:

Prompt-as-environment abstraction: The long prompt is stored as a string variable in a Python REPL; the LLM writes code to inspect, slice, and decompose it instead of ingesting all tokens at once.¹²¹¹
Recursive self-invocation: The LLM can programmatically spawn sub-LM calls on relevant snippets, enabling essentially unbounded reasoning chains.¹¹
Task-agnostic scaffold: A single, fixed system prompt works across all benchmarks—no task-specific engineering required.¹¹
Cost parity or reduction: Despite added complexity, median RLM query cost matches or beats base-model calls (e.g., $0.99 vs. $1.50–$2.75 for 6–11M tokens).¹¹

RLM Inference Algorithm (Step-by-Step)

Initialize REPL – A Python REPL is launched and the full prompt $P$ is assigned to a variable.¹¹
Expose metadata – The root LLM receives general context (e.g., len(P)) without seeing raw tokens.¹¹
Code generation – The LLM writes code to probe $P$: regex searches, slicing, keyword filtering, or printing small samples.¹¹
Execution & observation – The REPL executes the code; outputs (e.g., matched snippets) are returned to the LLM.¹¹
Recursive sub-calls – When deeper reasoning is needed, the LLM constructs a sub-task string and invokes a sub-LM (or sub-RLM) on that snippet; results are stored in REPL variables.¹¹
Iterate – Steps 3–5 repeat until the LLM decides it has enough information.¹¹
Final answer – The LLM returns a string (or a REPL variable reference for very long outputs) wrapped in a FINAL() / FINAL_VAR() tag.¹¹

Benchmark Comparison

Benchmark	Task Length	Base GPT-5	RLM (GPT-5)	Gain
CodeQA	23K–4.2M	24%	62%	+38 pp ¹¹
BrowseComp+ (1K docs)	6–11M	0%*	91.3%	— ¹¹
OOLONG	131K	44%	56.5%	+12.5 pp ¹¹
OOLONG-Pairs	32K	0.04 F1	58.0 F1	— ¹¹

*Base model cannot fit the input.

RLMs also dramatically outperform CodeAct + BM25 and Summary-agent baselines on all tasks while maintaining comparable or lower cost.¹¹

Experimental Tasks & Datasets

Task	Source	Complexity Scaling	What It Tests
S-NIAH	RULER	Constant	Single needle retrieval in large haystack ¹¹
BrowseComp-Plus	chen2025	Constant (multi-hop)	Deep-research QA over 1K docs (6–11M tokens) ¹¹
OOLONG	bertsch2025	Linear	Semantic aggregation over every line ¹¹
OOLONG-Pairs	Custom	Quadratic	Pairwise semantic aggregation ¹¹
LongBench-v2 CodeQA	bai2025	Fixed	Multi-choice repo understanding ¹¹

Computational Costs & Latency Trade-offs

Aspect	Finding
Median cost	RLM median ≤ base-model median; up to 3× cheaper than summary-agent because the LLM selectively views context ¹¹.
Variance	High—trajectory length depends on task complexity; outlier runs can be significantly more expensive ¹¹.
Latency	Sequential (blocking) sub-calls make current implementations slow; asynchronous sub-calls and sandboxed REPLs are noted as straightforward optimizations ¹¹.
Token budget	RLMs avoid full-context ingestion; on BrowseComp-Plus the theoretical cost of GPT-5-mini ingesting 6–11M tokens is $1.50–$2.75, while RLM(GPT-5) averages $0.99 ¹¹.

Suggested Follow-up Prompts

"Describe the REPL environment design in RLMs—what helper functions or modules are exposed, how is recursion depth controlled, and how does the model decide when to finalize?"
"Compare RLM sub-calling strategies (chunking vs. keyword search vs. regex filtering) and their impact on cost and accuracy, citing trajectories from Appendix B."
"Outline a production architecture for RLMs with asynchronous sub-calls and sandboxed execution, targeting SOC-2 / HIPAA workloads."
"Analyze the negative results in Appendix A: why do smaller models fail as RLMs, and what coding capability thresholds appear necessary?"

⁂

"Outline a production architecture for RLMs with asynchronous sub-calls and sandboxed execution, targeting SOC-2 / HIPAA workloads.""Describe the REPL environment design in RLMs—what helper functions or modules are exposed, how is recursion depth controlled, and how does the model decide when to finalize?

RLMs in production for compliance workloads need two things the paper explicitly calls out as future work: async sub-calls and sandboxed execution around the REPL. Below is a concrete, implementation-level outline plus a summary of the REPL design from the paper.¹³

REPL environment design

The paper instantiates RLMs by loading the entire input prompt $P$ into a Python REPL as a string variable, then letting the LM write and execute code against that environment.¹³

Core structure

A long prompt $P$ is set as a variable, e.g. P = "<entire corpus or conversation>", in a persistent Python interpreter.¹³
The root LM is given metadata about the environment (e.g., length of P, potentially document counts) plus instructions on how to interact with it via code; it does not see P tokens directly unless it chooses to print or slice.¹³
The REPL exposes a minimal module for calling a sub-LM from inside code, so model-written Python can do things like:
- Slice P into chunk = P[start:end]
- Call call_lm(chunk, task="summarize") and store the result in variables.¹³

Helper functions / modules (conceptual)

The paper describes a module “that allows it to query a sub-LM inside the environment” but does not standardize the full API. A practical design (aligned with their examples) is:¹³

Read-only prompt access:
- len_P() -> int (length of P in characters or tokens).¹³
- get_slice(start: int, end: int) -> str for safe slicing with bounds checking.
- find(pattern: str | regex) -> list[int] returning indices of matches for “needle” search.¹³
Document / record utilities (if $P$ is structured):
- iter_docs() or get_doc(i) for corpus-like inputs (BrowseComp+ style).¹³
- split_by(delim: str) -> list[str] to chunk by newline or delimiter.
Sub-LM calls:
- lm_call(prompt: str, *, model: str = "sub", system_prompt: str | None = None, max_tokens: int = 4096) -> str as the generic tool.¹³
- Optionally thin conveniences, e.g., summarize(chunk), extract_pairs(chunk), which still route through lm_call.
Output management:
- RLM encourages using variables to accumulate intermediate and final answers, then returning them via special tags.¹³

Recursion depth and control

The paper’s experiments explicitly cap recursion depth at one layer: sub-calls are to LMs, not further RLMs. In a production system you would:¹³

Track recursion context (depth, parent call ID) in the REPL or orchestrator.
Hard-limit:
- MAX_RECURSION_DEPTH (often 1–2).
- MAX_SUBCALLS_PER_ROOT and MAX_TOKENS_PER_SUBCALL.
Reject deeper calls at the helper level:
- lm_call checks current depth and raises an exception or returns a “budget exhausted” message when limits are exceeded.

Finalization protocol

The RLM distinguishes “thinking steps” from final answers by asking the LM to wrap answers in FINAL() or FINAL_VAR().¹³

FINAL(text) – final answer is the literal text string.
FINAL_VAR(var_name) – final answer is the value of a REPL variable (useful for very long outputs).¹³
The paper notes brittleness: models sometimes emit plans as FINAL, requiring guardrails.¹³
A production controller:
- Parses each LM turn; if FINAL/FINAL_VAR appear and pass validation (e.g., var_name exists, length within bounds), the orchestrator ends the trajectory.
- Otherwise, treats the output as next-step code, executes it in the REPL, and returns stdout/stderr to the LM for the next step.¹³

Production architecture with async sub-calls & sandboxing (SOC‑2 / HIPAA)

At a high level, you want an RLM “engine” service that wraps:

A sandboxed REPL runtime per request.
An async fan-out / gather layer for sub-LM calls.
A policy and observability layer for security, compliance, and billing.

1. High-level component layout

API Gateway / Frontend service
- Accepts RLM requests: {root_model, prompt P, task_type, tenant_id, policies, budgets}.
- Performs initial authn/z, per-tenant rate limits, and logs request metadata (no raw PII in centralized logs for HIPAA).¹³
RLM Orchestrator
- Creates an RLM session with:
  - A dedicated REPL instance (container / VM micro-VM).
  - A “root LM” channel (e.g., GPT‑5 as in the paper) and a “sub-LM pool” (e.g., GPT‑5-mini or open model).¹³
- Drives the loop:
  - Send “REPL state summary + last outputs” to root LM.
  - Receive Python + lm_call() instructions or FINAL(...).
  - Execute code inside sandbox; when lm_call() is requested, enqueue sub-calls.
  - Collect results asynchronously, update REPL variables, repeat.
Sub-LM Worker Pool (Async)
- Queue-based architecture (e.g., subcalls topic).
- Workers call provider APIs (OpenAI, internal models, etc.), streaming results back to the orchestrator.
- Parallelizes independent sub-calls that, in the paper’s implementation, are synchronous and therefore slow.¹³
Secure Storage / Audit Layer
- Encrypted storage for:
  - Per-tenant configuration and keys.
  - RLM traces (REPL code, sub-call prompts, model outputs) needed for SOC‑2 audit and HIPAA traceability, but under strong access control and lifecycle policies.
- Redaction or structured separation of PHI so operational logs never store raw PHI where not strictly necessary.

2. Sandbox and isolation design

The paper suggests that sandboxed REPLs and asynchronous calls are straightforward improvements over their synchronous Python REPL implementation. For compliance:¹³

Runtime isolation
- Per-RLM session, launch:
  - A container / Firecracker micro-VM running a restricted Python environment.
  - No network access from inside the REPL except to an internal RPC for lm_call, which is mediated by policy.
- Use seccomp/AppArmor and read-only mounted libraries; no filesystem write except to a temp volume encrypted per session.
Code surface minimization
- Expose only:
  - rlm_env module (with len_P, get_slice, find, corpus iterators, etc.).
  - Safe Python built-ins, re, json, basic list/dict/string ops.
- Deny:
  - os, subprocess, socket, open (except where explicitly wrapped to a virtual, in-memory filesystem).
PHI boundaries
- Treat P and any slices as PHI-containing; they never leave the sandbox except:
  - Minimal snippets printed back to the LM (already inherent in RLM), subject to length caps and optional PHI detection.
  - Sub-LM calls that are strictly necessary to answer the user’s question; these must flow only to HIPAA-eligible model providers or your own hosted models.
Data lifecycle
- Explicit TTL plus deletion pipeline for:
  - REPL state (P, variables).
  - Raw traces.
- For HIPAA, ensure BAAs in place with any third-party model providers that might receive PHI.

3. Asynchronous sub-calls in the RLM loop

The paper runs all LM calls sequentially and notes this as a source of slowness. A production system:¹³

Implements lm_call() as an async primitive:
- In REPL, lm_call(prompt, ...) enqueues a request ID and returns immediately with a “future-like” token.
- The orchestrator:
  - Collects all outstanding subcall tokens for the current step.
  - Fan-outs them to the sub-LM worker pool simultaneously.
  - When all are fulfilled or timed out, writes their outputs back into REPL variables and resumes the root LM.
Provides structured APIs to keep code simple:
- future_id = lm_call_async(prompt, ...)
- result = await_lm(future_id)
- Or, in a more constrained pattern, a helper:
  - results = parallel_lm_map(chunks, task="summarize"), internally implemented with multiple lm_call requests.
Enforces budget and fairness:
- lm_call tracks:
  - Total tokens per session, per user, per tenant.
  - Max concurrent sub-calls per session and per tenant.
- If limits are exceeded, lm_call raises an exception or returns a structured “budget_exhausted” response that the LM must handle.

4. SOC‑2 / HIPAA controls and observability

To meet SOC‑2 and HIPAA expectations around long-context RLM traces:

Deterministic policies
- Per-tenant config for:
  - Allowed model families (e.g., only in-house/HIPAA-compliant models).
  - Max context slice length visible to any sub-call.
  - Redaction rules (e.g., regexes for MRNs, SSNs, etc. to be masked before sub-calls).
Comprehensive tracing
- Log (to a secure, access-controlled store):
  - RLM session metadata (tenant, request ID, timestamps).
  - Every lm_call with:
    - Hash of prompt, optional partial view, token counts, model used, latency, cost.
  - REPL events (code cells, exceptions, finalization).
- Use structured logs to support:
  - SOC‑2 control evidence: change management, incident investigations.
  - HIPAA: who saw what, when, under which authorization.
Guardrails and validation
- For each FINAL/FINAL_VAR:
  - Validate type and size.
  - Optional domain validators (e.g., JSON schema, FHIR resource validation).
- For healthcare:
  - Prefer structured responses (FHIR JSON) over free text; RLM can still use unstructured reasoning internally.

How the model “decides” to finalize (practically)

The paper’s mechanism is purely prompt-driven: the system prompt instructs the model to wrap its final answer in FINAL() or FINAL_VAR(), and the orchestrator interprets that as termination. In practice you likely want:¹³

State machine on the orchestrator side
- States: THINKING, CANDIDATE_FINAL, FINAL.
- On FINAL(...):
  - Move to CANDIDATE_FINAL, run validators (size, shape, policy checks).
  - If all pass, transition to FINAL and stop.
  - If not, send a constrained follow-up prompt: “Your previous response violated rule X; provide a corrected FINAL(...) answer adhering to policy Y.”
Heuristics for robustness
- If the LM outputs both code and FINAL() in the same turn, apply a deterministic rule (e.g., ignore code, or only treat FINAL near the end).
- Hard stop on max steps / tokens:
  - If step limit reached with no valid FINAL, either:
    - Return best-effort intermediate summary, clearly marked.
    - Or fail closed, depending on tenant policy.

If you want, the next step can be a concrete YAML/TypeScript-level spec of the RLM Orchestrator API (including CreateSession, Step, SubcallResult streams) tailored to your existing Coditect multi-agent infra.

⁂

CREATE concrete MARKDOWN, JSON, YAML/TypeScript-level SDD TDD ARD specification of the RLM Orchestrator API (including CreateSession, Step, SubcallResult streams) tailored to your existing Coditect multi-agent infra.

Below is a concrete, copy‑pasteable spec for an RLM Orchestrator suitable for Coditect‑style multi‑agent infra, with Markdown, JSON, and YAML/TypeScript views.¹⁴

1. High‑level API surface (Markdown SDD)

The RLM Orchestrator exposes three primary concepts:

CreateSession: start an RLM session over a long prompt $P$ in a sandboxed REPL.¹⁴
Step: advance the RLM trajectory (execute model‑generated code, schedule sub‑calls, resume when sub‑results arrive).¹⁴
SubcallResult stream: asynchronous delivery of sub‑LM call completions back into the orchestrator and REPL.¹⁴

1.1 Core invariants

Each session has:
- One root LM (RLM controller, e.g., GPT‑5).¹⁴
- A sub‑LM pool (cheaper models for recursive calls).¹⁴
- A sandboxed REPL with P bound as a read‑only string and a minimal rlm_env module.¹⁴
Recursion depth is capped (typically 1) and enforced at the API level.¹⁴
Finalization is signaled via FINAL() or FINAL_VAR() in the root LM output; the orchestrator validates and closes the session.¹⁴

2. JSON API contract (TDD‑style)

2.1 CreateSession

HTTP

POST /v1/rlm/sessions

Request (JSON)

{
  "tenant_id": "tenant-123",
  "session_id": "optional-client-session-id-uuid",
  "root_model": "gpt-5",
  "sub_model_default": "gpt-5-mini",
  "prompt": "LONG_INPUT_P",
  "task_type": "generic_qa",
  "policies": {
    "max_recursion_depth": 1,
    "max_steps": 64,
    "max_subcalls_total": 256,
    "max_tokens_per_subcall": 4096,
    "max_visible_slice_chars": 32000,
    "allow_network_tools": false
  },
  "compliance": {
    "hipaa": true,
    "soc2": true,
    "data_region": "us-central1",
    "phi_present": true
  },
  "metadata": {
    "request_id": "ext-req-id",
    "source": "coditect-pipeline",
    "labels": {
      "env": "prod"
    }
  }
}

Response

{
  "session_id": "srv-rlm-7c3f4c01",
  "state": "RUNNING",
  "created_at": "2026-01-13T07:58:00Z",
  "limits": {
    "max_recursion_depth": 1,
    "max_steps": 64,
    "max_subcalls_total": 256,
    "max_tokens_per_subcall": 4096
  },
  "repl_info": {
    "language": "python",
    "env_version": "rlm-repl-v1",
    "sandbox_id": "sbx-0f12d8e3"
  }
}

2.2 Step (pull‑driven)

Advances the RLM; can be used in either:

Polling mode: client calls Step until status = FINAL.
Server‑push mode: Step returns incremental events, while subcall completions also arrive through a stream.

HTTP

POST /v1/rlm/sessions/{session_id}/step

Request

{
  "client_step_id": "cstep-001",
  "mode": "AUTO",
  "max_root_tokens": 2048,
  "timeout_ms": 120000
}

Response

{
  "session_id": "srv-rlm-7c3f4c01",
  "server_step_id": "sstep-012",
  "status": "THINKING",
  "phase": "ROOT_LM",
  "events": [
    {
      "type": "ROOT_LM_OUTPUT",
      "root_call_id": "root-010",
      "content": {
        "raw_text": "```python\nfrom rlm_env import len_P, get_slice, lm_call\n...\n```",
        "parsed": {
          "code_blocks": [
            {
              "language": "python",
              "code": "from rlm_env import len_P, get_slice, lm_call\n..."
            }
          ],
          "final_call": null
        }
      },
      "usage": {
        "prompt_tokens": 400,
        "completion_tokens": 260
      }
    },
    {
      "type": "REPL_EXECUTION",
      "execution_id": "exec-048",
      "status": "OK",
      "stdout": "",
      "stderr": "",
      "new_variables": [
        "snippet_1",
        "candidate_answer"
      ],
      "scheduled_subcalls": [
        {
          "subcall_id": "sub-123",
          "model": "gpt-5-mini",
          "prompt_preview": "Summarize this snippet ...",
          "recursion_depth": 1
        }
      ]
    }
  ],
  "pending_subcalls": [
    "sub-123"
  ],
  "metrics": {
    "steps_used": 12,
    "subcalls_used": 37,
    "total_tokens_root": 7200,
    "total_tokens_subcalls": 41200
  }
}

If the root LM emitted a final answer:

{
  "session_id": "srv-rlm-7c3f4c01",
  "server_step_id": "sstep-020",
  "status": "FINAL",
  "phase": "DONE",
  "final": {
    "type": "FINAL_TEXT",
    "text": "The answer is Maria Dalmacio.",
    "validated": true
  },
  "metrics": {
    "steps_used": 20,
    "subcalls_used": 52,
    "total_tokens_root": 10200,
    "total_tokens_subcalls": 56000
  }
}

2.3 SubcallResult stream

Sub‑LM calls are not invoked directly by clients; they are emitted from REPL_EXECUTION and fulfilled by an internal worker pool.¹⁴

Assume an internal gRPC/WS channel:

{
  "type": "SUBCALL_RESULT",
  "session_id": "srv-rlm-7c3f4c01",
  "subcall_id": "sub-123",
  "model": "gpt-5-mini",
  "status": "COMPLETED",
  "output": {
    "text": "The stew is called pinakbet, and the pageant winner is Maria Dalmacio.",
    "tool_calls": []
  },
  "usage": {
    "prompt_tokens": 900,
    "completion_tokens": 180
  },
  "error": null,
  "received_at": "2026-01-13T07:58:22Z"
}

The orchestrator:

Writes this result into the REPL (e.g., variable sub_123_result).
On the next Step, exposes a short REPL state summary plus the last outputs to the root LM.¹⁴

3. YAML / TypeScript SDD + ARD (interfaces & flows)

3.1 TypeScript domain types

// rlm-types.ts

export type RlmStatus = "RUNNING" | "FINAL" | "FAILED";

export interface ComplianceConfig {
  hipaa: boolean;
  soc2: boolean;
  dataRegion: string;
  phiPresent: boolean;
}

export interface PolicyConfig {
  maxRecursionDepth: number;    // e.g., 1
  maxSteps: number;             // e.g., 64
  maxSubcallsTotal: number;     // e.g., 256
  maxTokensPerSubcall: number;  // e.g., 4096
  maxVisibleSliceChars: number; // safety on printing P
  allowNetworkTools: boolean;
}

export interface CreateSessionRequest {
  tenantId: string;
  sessionId?: string;
  rootModel: string;
  subModelDefault: string;
  prompt: string;
  taskType: string;
  policies: PolicyConfig;
  compliance: ComplianceConfig;
  metadata?: Record<string, unknown>;
}

export interface CreateSessionResponse {
  sessionId: string;
  state: RlmStatus | "RUNNING";
  createdAt: string;
  limits: PolicyConfig;
  replInfo: {
    language: "python";
    envVersion: string;
    sandboxId: string;
  };
}

export type StepMode = "AUTO" | "ROOT_ONLY" | "REPL_ONLY";

export interface StepRequest {
  clientStepId?: string;
  mode: StepMode;
  maxRootTokens?: number;
  timeoutMs?: number;
}

export type RlmEventType =
  | "ROOT_LM_OUTPUT"
  | "REPL_EXECUTION"
  | "SUBCALL_SCHEDULED"
  | "BUDGET_EXCEEDED"
  | "ERROR";

export interface RootLmParsedContent {
  codeBlocks: { language: string; code: string }[];
  finalCall: null | {
    kind: "FINAL_TEXT" | "FINAL_VAR";
    argument: string;
  };
}

export interface RlmEventRootLmOutput {
  type: "ROOT_LM_OUTPUT";
  rootCallId: string;
  content: {
    rawText: string;
    parsed: RootLmParsedContent;
  };
  usage: {
    promptTokens: number;
    completionTokens: number;
  };
}

export interface ScheduledSubcall {
  subcallId: string;
  model: string;
  promptPreview: string;
  recursionDepth: number;
}

export interface RlmEventReplExecution {
  type: "REPL_EXECUTION";
  executionId: string;
  status: "OK" | "ERROR";
  stdout: string;
  stderr: string;
  newVariables: string[];
  scheduledSubcalls: ScheduledSubcall[];
}

export type RlmEvent =
  | RlmEventRootLmOutput
  | RlmEventReplExecution
  | {
      type: "BUDGET_EXCEEDED";
      reason: string;
    }
  | {
      type: "ERROR";
      message: string;
    };

export interface StepResponse {
  sessionId: string;
  serverStepId: string;
  status: RlmStatus | "RUNNING" | "FAILED";
  phase: "ROOT_LM" | "REPL" | "WAITING_SUBCALLS" | "DONE";
  events: RlmEvent[];
  pendingSubcalls: string[];
  final?: {
    type: "FINAL_TEXT" | "FINAL_VAR";
    text?: string;
    varName?: string;
    validated: boolean;
  };
  metrics: {
    stepsUsed: number;
    subcallsUsed: number;
    totalTokensRoot: number;
    totalTokensSubcalls: number;
  };
}

export interface SubcallResult {
  sessionId: string;
  subcallId: string;
  model: string;
  status: "COMPLETED" | "FAILED" | "CANCELLED";
  output?: {
    text: string;
    toolCalls?: unknown[];
  };
  usage?: {
    promptTokens: number;
    completionTokens: number;
  };
  error?: {
    code: string;
    message: string;
  };
  receivedAt: string;
}

3.2 YAML ARD (components & flows)

components:
  api_gateway:
    responsibilities:
      - AuthN/Z (OIDC/JWT, API keys)
      - Tenant and rate limiting
      - Request shaping -> RlmOrchestrator
  rlm_orchestrator:
    responsibilities:
      - Manage RLM sessions and life cycle
      - Maintain policy and budget state per session
      - Drive root LM loop and REPL interaction
      - Coordinate async sub-LM calls
  repl_runtime:
    implementation: "Python in sandbox (container/micro-VM)"
    responsibilities:
      - Hold P as read-only string
      - Expose rlm_env module (len_P, get_slice, find, lm_call, etc.)
      - Execute model-generated Python safely
  sublm_worker_pool:
    responsibilities:
      - Consume Subcall jobs from queue
      - Call configured sub-models (OpenAI/self-hosted)
      - Emit SubcallResult events
  audit_store:
    responsibilities:
      - Encrypted storage of traces and metrics
      - SOC-2 / HIPAA access controls & retention policies

flows:
  - name: create_session
    steps:
      - api_gateway validates tenant, checks policy
      - rlm_orchestrator:
          - allocates session_id and sandbox_id
          - spins up isolated repl_runtime with P preloaded
          - writes initial session record to audit_store
      - returns CreateSessionResponse

  - name: step_auto_mode
    steps:
      - client calls /step
      - rlm_orchestrator:
          - if pending_subcalls:
              - wait until all SubcallResult received or timeout
              - inject results into repl_runtime variables
          - build root LM prompt: system (RLM prompt) + state summary
          - call root_model
          - parse for code blocks + FINAL()/FINAL_VAR()
          - if FINAL:
              - validate output (size, schema, PHI policy)
              - mark session FINAL, persist trace, return final
          - else:
              - execute code in repl_runtime
              - collect stdout/stderr, new vars, scheduled_subcalls
              - enqueue subcalls to sublm_worker_pool
              - update metrics and budgets
          - emit StepResponse with events + pending_subcalls

  - name: subcall_fulfillment
    steps:
      - sublm_worker_pool worker picks up subcall job
      - calls actual LM provider
      - emits SubcallResult onto internal bus
      - rlm_orchestrator updates session state and repl_runtime
      - next Step() sees updated state and proceeds

3.3 Minimal `rlm_env` REPL API (Python sketch)

# Exposed inside sandbox

P: str  # full long prompt, read-only

def len_P() -> int: ...
def get_slice(start: int, end: int) -> str: ...
def find(pattern: str) -> list[int]: ...
def iter_docs() -> list[str]: ...
def get_doc(i: int) -> str: ...

# Async-abstracted helper; returns immediately and will be filled later
def lm_call(prompt: str,
            model: str | None = None,
            system_prompt: str | None = None,
            tag: str | None = None) -> str:
    """
    Schedules a sub-LM call, stores the future result in a variable,
    and returns a placeholder or raises if budgets are exceeded.
    """
    ...

This spec should drop directly into Coditect as an RLM Orchestrator service fronted by your existing multi‑agent router, while matching the REPL and sub‑LM structure described in the MIT RLM paper.¹⁴

⁂

Core references​

One-paragraph technical gist​

Good follow-up prompts you can paste​

Main Contributions​

RLM Inference Algorithm (Step-by-Step)​

Benchmark Comparison​

Experimental Tasks & Datasets​

Computational Costs & Latency Trade-offs​

Suggested Follow-up Prompts​

REPL environment design​

Production architecture with async sub-calls & sandboxing (SOC‑2 / HIPAA)​

1. High-level component layout​

2. Sandbox and isolation design​

3. Asynchronous sub-calls in the RLM loop​

4. SOC‑2 / HIPAA controls and observability​

How the model “decides” to finalize (practically)​

CREATE concrete MARKDOWN, JSON, YAML/TypeScript-level SDD TDD ARD specification of the RLM Orchestrator API (including CreateSession, Step, SubcallResult streams) tailored to your existing Coditect multi-agent infra.

1. High‑level API surface (Markdown SDD)​

1.1 Core invariants​

2. JSON API contract (TDD‑style)​

2.1 CreateSession​

2.2 Step (pull‑driven)​

2.3 SubcallResult stream​

3. YAML / TypeScript SDD + ARD (interfaces & flows)​

3.1 TypeScript domain types​

3.2 YAML ARD (components & flows)​

3.3 Minimal rlm_env REPL API (Python sketch)​

Footnotes​