Skip to main content

AAMAS-2025 Formatting Instructions]{Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement

Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, Mahdi Jalili

Abstract

We present MACLA, a framework that decouples reasoning from learning by maintaining a frozen large language model (LLM) while performing all adaptation in an external hierarchical procedural memory. MACLA extracts reusable procedures from trajectories, tracks reliability via Bayesian posteriors, selects actions through expected-utility scoring, and refines procedures by contrasting successes vs. failures. Across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), MACLA achieves 78.1% average performance, outperforming all baselines. On ALFWorld unseen tasks, MACLA reaches 90.3% with +3.1% positive generalization. The system constructs memory in 56 seconds (2,800× faster than the state-of-the-art LLM parameter-training baseline), compresses 2,851 trajectories into 187 procedures (15:1). Experimental results demonstrate that structured external memory with Bayesian selection and constrastive refinement enable sample-efficient, interpretable and continually improving agents without LLM parameter updates. Code is publicly available at https://github.com/S-Forouzandeh/MACLA-LLM-Agents-AAMAS-Conference{MACLA}.

Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement

Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, Mahdi Jalili

School of Engineering, Royal Melbourne Institute of Technology University

Melbourne, VIC, Australia its execution) to achieve the goal in the current context, deciding which information to gather, which tools to invoke, and which subroutines to chain in order to achieve completion [25, 26].

Early LLM agents used prompt-based planning [26] and selfcritique [15], but lack persistent 'how-to' procedures - when tasks are similar but not identical, agents must re-plan from scratch, increasing cost and latency. Fine-tuning approaches [2, 27, 29] adapt agents via supervised learning or RLHF, but typically treat entire trajectories as single units weighted by terminal success/failure, neglecting rich intermediate steps. In practice, failed trajectories often contain correct substeps (e.g., 'successfully navigating and retrieving an egg, but failing to boil it' [16]), while successful ones may include suboptimal actions that accidentally cancel out. Recent work [22] addresses this via step-level rewards, but requires repeated policy training on densely-labeled data, incurring substantial computational cost.

We introduce MACLA (Memory-Augmented Contrastive Learning Agent), a framework that disentangles reasoning from learning through the coupling a frozen LLM and a structured external procedural memory (Figure 1). Unlike fine-tuning approaches where reasoning and adaptation are entangled within billions of parameters, MACLA fixes the LLM as a stable semantic reasoner responsible for trajectory segmentation, abstraction, and action generation. All learning occurs externally through explicit, interpretable memory operations - maintaining human-readable procedures, updating Bayesian posteriors, and refining preconditions through contrastive analysis. MACLA operates through three core mechanisms:

(1) Bayesian procedure selection: Maintains Beta posteriors Beta ( 𝛼 𝑖 , 𝛽 𝑖 ) over procedure success rates and ranks candidates via expected-utility scoring that balances contextual relevance, success probability, failure risk, and information gain, providing principled exploration-exploitation. (2) Contrastive refinement: Compares successful and failed execution contexts to tighten preconditions, repair action sequences, and refine postconditions once procedures accumulate sufficient evidence (i.e., ≥ a threshold), progressively improving procedure quality through memory edits rather than gradient updates. (3) Meta-procedural learning: Composesfrequently co-occurring procedures into hierarchical 'playbooks' with conditional control policies (continue, skip, repeat, abort) for long-horizon tasks, enabling strategic reuse beyond atomic skills.

This architecture yields sample-efficient, interpretable agents with human-readable procedural knowledge, closed-form utility computation, and minimal LLM usage. Specifically, this work contributes:

· Online procedural memory adaptation: Continual updates to procedural and meta-procedural memory during and

Abstract.

We present MACLA , a framework that decouples reasoning from learning by maintaining a frozen large language model (LLM) while performing all adaptation in an external hierarchical procedural memory. MACLA extracts reusable procedures from trajectories, tracks reliability via Bayesian posteriors, selects actions through expected-utility scoring, and refines procedures by contrasting successes vs. failures. Across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), MACLA achieves 78.1% average performance , outperforming all baselines. On ALFWorld unseen tasks, MACLA reaches 90.3% with +3.1% positive generalization . The system constructs memory in 56 seconds (2,800× faster than the state-of-the-art LLM parameter-training baseline), compresses 2,851 trajectories into 187 procedures (15:1). Experimental results demonstrate that structured external memory with Bayesian selection and constrastive refinement enable sample-efficient, interpretable and continually improving agents without LLM parameter updates. Code is publicly available at MACLA.

Key Observations and Architectural Insights

Memory-augmented agents, Procedural memory, Bayesian decision making, Contrastive learning, LLM agents

Introduction

Large language model (LLM) agents can solve complex, interactive tasks such as web shopping [25] and embodied AI housekeeping [9], by transforming natural-language instructions into sequences of environment actions [26]. In these settings, agents navigate step-bystep through partially observable environments to pursue subgoals and ultimately complete the task [9, 22]. The resulting trajectory is the ordered record of an episode's interaction, typically written as ( 𝑇, 𝐴, 𝑂, 𝑅 ) , where 𝑇 represents a task to complete, 𝐴 are actions, 𝑂 stand for observations for the outcome of corresponding actions, and 𝑅 records step-level outcomes or rewards. Trajectories thus capture the full decision process, not merely terminal success or failure, and provide dense supervision for how an agent progresses through a task [16, 22]. When a new task arrives, the agent synthesizes an appropriate trajectory (that is, a step-by-step plan and

Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), C. Amato, L. Dennis, V. Mascardi, J. Thangarajah (eds.), May 25 - 29, 2026, Paphos, Cyprus . © 2026 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). This work is licenced under the Creative Commons Attribution 4.0 International (CC-BY 4.0) licence.

Figure

VIACLA: Memory-Augmented Contrastive Learning Agent (Inference |ime Learning)

Figure 1: Comparison between existing LLM-based trajectory learning (top) and the proposed memory-augmented contrastive learning agent (MACLA, bottom). Existing methods train trajectories ( 𝑇, 𝐴, 𝑂, 𝑅 ) (Task, Action, Observation, Reward) into LLM parameters through post-training (finetuning and/or RLHF), whereas MACLA constructs procedural and meta-procedural memory externally through frozen LLM abstraction, segmentation, Bayesian selection, and contrastive refinement. Memories are learned during memory construction. Besides learning during memory construction, MACLA enables inference-time learning in which outputs are verified in the task environment, with feedback used for contrastive refinement on the retrieved memories. Meta-procedural learning enables the composition policy to be learned among procedures.

  • after episodes, enabling adaptation without weight updates, compared with offline LLM post-training approaches [17, 22, 29] that remain static at inference. · Reasoning/learning decoupling: A frozen LLM for parsing and abstraction with all improvements occurring in an external, structured procedural memory, avoiding the computational cost and catastrophic forgetting risks of parameter fine-tuning. · Bayesian uncertainty-aware selection: A principled procedure selection module that maintains Beta posteriors over success rates with closed-form expected utility objectives balancing relevance, success probability, failure risk and information gain. · Contrastive procedural refinement: An algorithm leveraging paired successes and failures to tighten preconditions, repair action schemas, and refine postconditions of stored procedures without requiring expert demonstrations. · Hierarchical meta-procedural composition: Automatic discovery and maintenance of conditional playbooks with control policies (skip, repeat, abort) for long-horizon tasks, enabling compositional generalization.

We evaluate MACLA across four benchmarks (ALFWorld [9], WebShop [25], TravelPlanner [21], InterCodeSQL [24]), achieving 78.1% average performance - the highest among all methods, including those using models 10× larger (later in Table 1). On ALFWorld [16], MACLA reaches 87.2% on seen and 90.3% on unseen tasks, with a positive generalization gap (+3.1%) indicating compositional transfer rather than overfitting. The system achieves this with only 0.016 GPU-hours for one-time memory construction - 2,800× faster than the state-of-the-art LLM parameter-training baseline [22], which requires 44.8 GPU-hours of iterative training -while simultaneously producing human-interpretable procedural knowledge.

LLM agents have advanced rapidly in reasoning and decisionmaking, enabling multi-step interaction in embodied and webbased environments. Early frameworks such as ReAct[26] and Reflexion[14] integrate reasoning and acting within the same loop, while trajectory-tuning methods [2, 22] fine-tune models using expert demonstrations. However, fine-tuning is computationally expensive, requires offline data collection and training cycles, and does not support true online adaptation at inference time. To overcome this issue, a line of research augments LLM agents with memory for continuous reasoning. Memory is a foundational component of language agents, supporting competence across multiple timescales from transient working context to persistent long-term knowledge [6, 8, 31]. Research on memory for LLM agents can be usefully organized along two directions: where memory resides and what is stored. Along the first direction, some methods such as MemGPT [11] and MemoryBank [32], use buffer-based systems to store conversational or episodic traces and retrieve them with embedding search and simple heuristics. Some others, such as HiAgent[5], A-Mem[23], MemAgent [28] use hierarchical designs to separate working buffers from episodic and long-term stores to relieve context pressure and improve persistence. Recently, SAGE [7] used reflective multi-agent controllers to curate these stores while controlling growth. The second direction concerns what is stored. Many systems retain free-form text snippets such as notes, summaries, or dialogue chunks; these are easy to write but suffer from retrieval drift and weak compositionality as repositories scale [11, 32]. More structured artifacts appear as tuples and key-value frames (e.g., tool logs or entity/event graphs), which aid filtering but still lack executable semantics for reuse. A growing line of work targets skills and procedures: agents capture reusable action patterns, tool workflows, and instruction-like steps across related tasks [3, 19, 20]. Memp [4] advances this view by treating procedural memory as a first-class object and studying its construction, retrieval, and update across domains. However, several key limitations remain; (1) it represents know-how largely as monolithic text (scripts or full trajectories) with heuristic retrieval and simple updates; (2) it lacks uncertainty-aware selection or principled exploration-exploitation balance, preventing reason about reliability or risk of retrieved memory; and (3) it lacks a mechanism to refine procedures from paired successes and failures or abstract recurring patterns into meta-procedural compositions. Comparatively, we represent experience as structured, hierarchical procedures with explicit preconditions, action schemas, and postconditions, enabling interpretable reuse and safe composition and direct schema edits when evidence warrants change. The proposed approach enables the system to continuously adapt and improve.

The Preamble

You will be assigned a submission number when you register the abstract of your paper on OpenReview . Include this number in your document using the ' \acmSubmissionID ' command.

Then use the familiar commands to specify the title and authors of your paper in the preamble of the document. The title should be appropriately capitalised (meaning that every 'important' word in the title should start with a capital letter). For the final version of your paper, make sure to specify the affiliation and email address of each author using the appropriate commands. Specify an affiliation and email address separately for each author, even if two authors share the same affiliation. You can specify more than one affiliation for an author by using a separate ' \affiliation ' command for each affiliation.

Provide a short abstract using the ' abstract ' environment.

Finally, specify a small number of keywords characterising your work, using the ' \keywords ' command.

Proposed Method

The key components of MACLA are described in detail below.

LLM-based Procedural Abstraction

The first stage transforms raw episodic trajectories into structured, reusable procedural knowledge. Given a trajectory

𝜏 = {( 𝑜 𝑡 , 𝑎 𝑡 , 𝑟 𝑡 )} 𝑇 𝑡 = 0 consisting of textual observations 𝑜 𝑡 , primitive actions 𝑎 𝑡 , and rewards 𝑟 𝑡 , the frozen LLM L 𝜃 receives the full trajectory and identifies semantically coherent segments that correspond to meaningful sub-tasks:

$$

$$

where each segment 𝑘 spans time steps [ 𝑡 start 𝑘 , 𝑡 end 𝑘 ] and is summarized by a description 𝑑 𝑘 . For each segment, MACLA constructs a structured procedure Proc 𝑘 = ⟨G 𝑘 , Ψ 𝑘 , 𝜋 𝑘 , Φ 𝑘 ⟩ , where G 𝑘 is a natural-language goal, Ψ 𝑘 are precondition patterns inferred from the observations before the segment, 𝜋 𝑘 is an abstracted action sequence, and Φ 𝑘 are postcondition patterns extracted from the final observations. This decomposition produces interpretable 'how-to' skills that can be invoked whenever their preconditions are met. To support retrieval and merging, each procedure is embedded into a semantic vector space using an encoder 𝜙 , e 𝑘 = 𝜙 ([G 𝑘 ; Ψ 𝑘 ; Φ 𝑘 ]) ∈ R 𝑑 . When a new procedure is created, it is compared to existing ones via cosine similarity, 𝑖 ∗ = arg max 𝑖 sim ( e 𝑘 , e 𝑖 ) . If sim ( e 𝑘 , e 𝑖 ∗ ) > 𝜃 dup , the new procedure is merged into the existing one by expanding its condition sets; otherwise, a new entry is added. This process yields a continually growing procedural library M proc = {( Proc 𝑖 , e 𝑖 , 𝛼 𝑖 , 𝛽 𝑖 )} 𝑁 𝑝 𝑖 = 1 that forms the foundation for later Bayesian selection and refinement.

Bayesian Reliability and Utility Selection

Given the procedural library, the agent must decide which procedure to execute for the current observation. Each procedure Proc 𝑖 maintains a Beta posterior over its success probability 𝜌 𝑖 ∈ [ 0 , 1 ] :

$$

$$

where 𝛼 𝑖 and 𝛽 𝑖 accumulate successful and failed executions from history D 𝑖 . The posterior mean E [ 𝜌 𝑖 ] = 𝛼 𝑖 /( 𝛼 𝑖 + 𝛽 𝑖 ) estimates current reliability, while the variance Var [ 𝜌 𝑖 ] = 𝛼 𝑖 𝛽 𝑖 ( 𝛼 𝑖 + 𝛽 𝑖 ) 2 ( 𝛼 𝑖 + 𝛽 𝑖 + 1 ) quantifies epistemic uncertainty. For each candidate, we compute expected utility by integrating over the Beta posterior. Given utility 𝑈 ( 𝜌 | 𝑜 𝑡 , 𝑖 ) = Rel 𝑖 ( 𝑜 𝑡 ) · 𝜌 · 𝑅 max -Risk 𝑖 ( 𝑜 𝑡 ) · ( 1 -𝜌 ) · 𝐶 fail + 𝜆 info · 𝐼 ( 𝜌 ) , the expected utility is:

$$

$$

Exploiting E Beta ( 𝛼,𝛽 ) [ 𝜌 ] = 𝛼 𝛼 + 𝛽 and E [ 1 -𝜌 ] = 𝛽 𝛼 + 𝛽 , this simplifies to:

$$

$$

where Rel 𝑖 ( 𝑜 𝑡 ) = cos ( 𝜙 ( 𝑜 𝑡 ) , e 𝑖 ) is contextual similarity, Risk 𝑖 ( 𝑜 𝑡 ) is the fraction of past failures with similar contexts, and 𝐻 [·] is differential entropy encouraging exploration. The selected procedure

$$

$$

subject to confidence threshold 𝜃 conf . If max 𝑖 EU ( Proc 𝑖 | 𝑜 𝑡 ) < 𝜃 conf , the agent falls back to zero-shot LLM reasoning. This Bayesian selection mechanism balances exploitation (high 𝛼 𝛼 + 𝛽 procedures), risk aversion (avoiding contexts similar to past failures), and exploration (high entropy procedures). The expected utility formulation naturally handles the explore-exploit tradeoff: early in learning, high entropy dominates selection, while after sufficient evidence accumulates, expected reward becomes the primary driver.

Contrastive Refinement of Procedures

As experience accumulates, procedures with both successful and failed instances are subjected to contrastive refinement to improve their accuracy and robustness. For a procedure Proc 𝑖 with sets of successful and failed contexts S 𝑖 and F 𝑖 , the LLM performs discriminative comparison, D 𝑖 = ContrastiveExtract (S 𝑖 , F 𝑖 ) , identifying differences in three dimensions: (i) precondition patterns ( ΔΨ + 𝑖 and ΔΨ -𝑖 ) that distinguish successful from failed initial contexts, (ii) action discrepancies ( Δ 𝜋 𝑖 ) revealing missing or misordered actions, and (iii) postcondition mismatches ( ΔΦ 𝑖 ) that capture incomplete goal states. These discriminators drive explicit refinement operations

$$

$$

When distinct execution modes are detected, the procedure is specialized into separate variants with inherited reliability priors. This process progressively tightens applicability conditions and action precision, yielding interpretable improvements purely through memory edits rather than gradient updates.

Meta-procedural Composition

To extend reasoning beyond atomic skills, MACLA automatically discovers and learns meta-procedures that are structured compositions of procedures that capture recurrent long-horizon strategies. When a sequence of procedures ⟨ Proc 𝑖 1 , . . . , Proc 𝑖 𝑚 ⟩ repeatedly leads to success under a common high-level goal, the agent abstracts it as MP 𝑗 = ⟨G meta 𝑗 , Ψ meta 𝑗 , { Proc 𝑖 1 , . . . , Proc 𝑖 𝑚 } , Θ 𝑗 ⟩ . Here, Θ 𝑗 denotes a lightweight control policy governing conditional transitions among sub-procedures based on the current observation and execution context, Θ 𝑗 ( 𝑜 𝑡 , index ) ∈ { continue , skip , repeat , abort } . This policy is distilled by analyzing successful traces, where the LLM identifies observation patterns that triggered each branch-for example, repeating when postconditions are unmet, skipping when preconditions already hold, or aborting when failures recur. Each meta-procedure maintains its own Beta success posterior 𝑝 ( 𝜎 𝑗 |D 𝑗 ) = Beta ( 𝛼 𝑗 , 𝛽 𝑗 ) and is refined periodically to add new branches, reorder sub-procedures, or prune redundant ones. Through these hierarchical compositions, MACLA acquires flexible 'playbooks' that encapsulate extended strategies with conditional logic.

Ontological Semantic Grounding

To enable cross-context generalization (e.g., procedures learned on "mug" applying to "cup"), MACLA constructs a lightweight ontological semantic index during offline memory construction. We extract the 𝑘 𝑣𝑜𝑐𝑎𝑏 most frequent words from task descriptions and actions, then cluster semantically similar words using SentenceTransformer embeddings [12] to form an implicit domain ontology:

$$

$$

where each cluster C 𝑤 represents a semantic category (e.g., C container = { mug , cup , glass } ). During retrieval, observations are mapped to these ontological categories, allowing procedures to match across lexically different but semantically equivalent contexts. This ontological grounding enables domain-adaptive generalization without requiring explicit knowledge engineering.

System Efficiency and Memory Management

To ensure practical scalability, MACLA employs efficient retrieval, bounded growth, and strict control over LLM usage. All procedures and meta-procedures are embedded in an approximate nearestneighbor index supporting sublinear retrieval ( 𝑂 ( log 𝑁 𝑝 ) ) for semantic search. The episode buffer stores at most 𝑁 𝑏 = 1000 steps, providing local context for LLM prompts and post-episode updates. Each procedure maintains a failure index limited to 𝐾 fail = 15 entries, managed through success-based removal, redundancy-aware eviction, and temporal decay, ensuring that memory remains concise and informative. To prevent memory saturation, procedures and meta-procedures are periodically pruned using a multi-factor utility score that balances reliability, usage frequency, and temporal relevance:

$$

$$

where 𝛼 𝑖 𝛼 𝑖 + 𝛽 𝑖 is the Bayesian success rate (reliability), 𝑛 𝑖 is the execution count of procedure 𝑖 , 𝑁 total is the total invocations across all procedures in the current episode window, 𝑡 current is the current episode index, 𝑡 last 𝑖 is the episode when 𝑖 was last used, and 𝜏 is the temporal decay constant.

The weighting coefficients 𝜆 𝑟 = 0 . 5, 𝜆 𝑓 = 0 . 3, and 𝜆 𝑡 = 0 . 2 reflect the relative importance of each factor: reliability receives the highest weight (0.5) as it directly predicts future success; frequency receives moderate weight (0.3) to favor well-tested procedures while avoiding over-retention of obsolete frequently-used skills; recency receives the lowest weight (0.2) to provide soft temporal decay without aggressive forgetting. These values were determined through grid search over { 0 . 3 , 0 . 4 , 0 . 5 , 0 . 6 } × { 0 . 2 , 0 . 3 , 0 . 4 } × { 0 . 1 , 0 . 2 , 0 . 3 } on ALFWorld validation, with the constraint 𝜆 𝑟 + 𝜆 𝑓 + 𝜆 𝑡 = 1 . 0 for interpretability. The selected configuration (0.5, 0.3, 0.2) yielded the best balance between retaining high-quality procedures (>0.7 success rate) and pruning low-utility entries (<0.4 success rate), as validated later in Figure 4. Entries with the lowest utility are removed while ensuring diversity across goal clusters through stratified sampling. These operations keep the total memory footprint below 4 MB for hundreds of procedures.

Finally, MACLA limits LLM usage to a fixed budget of API calls per episode to cover segmentation, abstraction, and occasional refinement, while all retrieval, Bayesian scoring, and updates are symbolic or vectorized. As a result, per-step runtime remains effectively constant and inference cost does not scale with experience. This memory-first design ensures that MACLA remains efficient, interpretable, and deployable for continual learning across long interaction horizons.The theoretical foundations are provided in Appendix D.

Algorithm

At runtime, MACLA executes a new task by coupling frozen semantic reasoning with memory-driven decision making. The agent receives an initial observation 𝑜 0 (and, optionally, an instruction string) and embeds it as h 0 = 𝜙 ( 𝑜 0 ) . This embedding queries the semantic index of the external memory to retrieve a compact candidate set consisting of procedures { Proc 𝑖 } and meta-procedures { MP 𝑗 } whose embeddings are most similar to h 0 . Retrieval is approximate nearest neighbor over the concatenated descriptors of goals, preconditions, and postconditions, which keeps lookup sublinear in memory size.

Given the candidate set, MACLA ranks each item with a Bayesian expected-utility score that trades off contextual relevance, estimated success, risk, and information gain under the procedure's Beta posterior. The highest-scoring item above a confidence threshold is selected; otherwise the agent falls back to zero-shot LLM reasoning for that step, logs the outcome, and continues. If a meta-procedure is chosen, execution proceeds hierarchically under its composition policy Θ 𝑗 ( 𝑜 𝑡 , index ) ∈ { continue , skip , repeat , abort } until completion or abort; if an atomic procedure is chosen, the agent checks preconditions Ψ 𝑖 against 𝑜 𝑡 , invokes the action sketch 𝜋 𝑖 via the frozen LLM's action formatter, and verifies postconditions Φ 𝑖 to certify completion. In both cases the outcome updates ( 𝛼, 𝛽 ) and appends the initial context to the corresponding success or failure set for later analysis.

After each execution, the agent re-embeds the new observation and repeats retrieval and selection until the task is solved or a horizon is reached. When a procedure accumulates both successes and failures, a contrastive pass is triggered: the LLM proposes discriminators that tighten Ψ 𝑖 , repair 𝜋 𝑖 , and refine Φ 𝑖 , or if distinct modes are detected, specializes the procedure into variants that inherit prior counts. When successful episodes repeatedly traverse a small set of procedures in a stable order, the agent abstracts a meta-procedure with its own success posterior and a lightweight Θ 𝑗 distilled from divergence points across traces. Throughout, memory remains bounded by pruning with a utility that blends reliability, frequency, and recency, and the LLM-call budget is capped, as retrieval, scoring, and updates are vectorized operations. The complete runtime procedure is outlined in Algorithm 1.

Experiments

We evaluate MACLA on four challenging interactive agent benchmarks spanning diverse domains. All experiments use consistent hyperparameters across tasks to demonstrate generalization without task-specific tuning.

Algorithm 1 MACLA Runtime Procedure with Function Descriptions

Require: observation 𝑜 0 , memory M (procedures, meta-procedures, indices), horizon 𝐻 1: h ← 𝜙 ( 𝑜 0 ) ⊲ Embed observation 2: C ← RetrieveCandidates ( h , M ) ⊲ Top𝑘 ANN search 3: while not Terminal and 𝑡 < 𝐻 do 4: for all 𝑐 ∈ C do 5: EU [ 𝑐 ] ← ExpectedUtility ( 𝑐, 𝑜 𝑡 , M ) ⊲ Compute Eq. 4 6: end for 7: 𝑐 ★ ← arg max 𝑐 ∈C EU [ 𝑐 ] 8: if EU [ 𝑐 ★ ] < 𝜃 conf then 9: ( 𝑜 𝑡 + 1 , 𝑦 ) ← ZeroShotStep ( 𝑜 𝑡 ) ⊲ LLM generates action directly 10: else if 𝑐 ★ is MP 𝑗 then 11: ( 𝑜 𝑡 + 1 , 𝑦 ) ← ExecuteMeta ( MP 𝑗 , Θ 𝑗 , 𝑜 𝑡 ) ⊲ Run with control policy 12: ( 𝛼 𝑗 , 𝛽 𝑗 ) ← UpdateBeta ( ( 𝛼 𝑗 , 𝛽 𝑗 ) , 𝑦 ) ⊲ 𝛼 ← 𝛼 + 𝑦 , 𝛽 ← 𝛽 +( 1 -𝑦 ) 13: else ⊲ 𝑐 ★ is atomic Proc 𝑖 14: if CheckPre ( Ψ 𝑖 , 𝑜 𝑡 ) then ⊲ Verify preconditions match 𝑜 𝑡 15: ( 𝑜 𝑡 + 1 , 𝑦 ) ← ExecuteProc ( 𝜋 𝑖 , 𝑜 𝑡 ) ⊲ Instantiate & execute 𝜋 𝑖 16: 𝑦 ← 𝑦 ∧ CheckPost ( Φ 𝑖 , 𝑜 𝑡 + 1 ) ⊲ Verify postconditions in 𝑜 𝑡 + 1 17: else 18: ( 𝑜 𝑡 + 1 , 𝑦 ) ← ZeroShotStep ( 𝑜 𝑡 ) ⊲ Preconditions failed, fallback 19: end if 20: ( 𝛼 𝑖 , 𝛽 𝑖 ) ← UpdateBeta ( ( 𝛼 𝑖 , 𝛽 𝑖 ) , 𝑦 ) 21: RecordContext ( S 𝑖 , F 𝑖 , 𝑜 𝑡 , 𝑦 ) ⊲ Add to success/fail sets 22: end if 23: if RefineTrigger ( 𝑐 ★ ) then ⊲ If | S | , | F | ≥ 3 24: ContrastiveRefine ( 𝑐 ★ ) ⊲ LLM compares S vs. F (§4.3) 25: end if 26: h ← 𝜙 ( 𝑜 𝑡 + 1 ) ; C ← RetrieveCandidates ( h , M ) ; 𝑡 ← 𝑡 + 1 27: end while 28: if EligibleForMeta ( trace ) then ⊲ If ≥ 3 procs in stable order 29: ExtractOrRefineMeta ( trace , M ) ⊲ Create/update meta-proc 30: end if 31: PruneAndMaintain ( M ) ⊲ Remove low-utility via Eq. 8

Experimental Setting

Memory Architecture: Episode buffer 𝑁 𝑏𝑢𝑓 𝑓 𝑒𝑟 = 1000 (stores recent observations and actions for temporal context provision during action generation); procedural memory 𝑁 𝑝𝑟𝑜𝑐 = 200 (capacity for extracted reusable skills); meta-procedural memory 𝑁 𝑚𝑒𝑡𝑎 = 50 (capacity for hierarchical procedure compositions). Critically, MACLA does not store raw trajectories. Instead, the LLM segments each episode into coherent sub-tasks and extracts structured procedures (Section 4.1). Duplicate detection with similarity threshold 𝜃 𝑑𝑢𝑝 = 0 . 85 prevents redundant storage. Through this process, the 2,851 ALFWorld training trajectories compress into approximately 187 unique procedures-demonstrating efficient knowledge distillation from experience.

Bayesian Selection. Information gain weight 𝜆 𝑖𝑛𝑓 𝑜 = 0 . 1, failure cost 𝐶 𝑓 𝑎𝑖𝑙 = 0 . 5. These parameters balance exploration (trying uncertain procedures to reduce epistemic uncertainty) with exploitation (selecting high-posterior reliable procedures).

Contrastive Refinement. Minimum contexts 𝑛 𝑠 𝑚𝑖𝑛 = 𝑛 𝑓 𝑚𝑖𝑛 = 3. Refinement activates only when a procedure has accumulated at least 3 successes and 3 failures, ensuring sufficient statistical evidence for discriminative pattern extraction.

LLM Configuration. Llama-2-7B [18] via Ollama with 4-bit quantization and temperature 𝑇 = 0 . 7. The LLM parameters remain frozen throughout all experiments-learning occurs exclusively through external memory updates.

Benchmarks and Dataset Statistic: ALFWorld [16] (2,851 train, 274 test) is a text-based embodied environment with six household tasks (e.g., retrieval, placement). We follow the standard train/validation-seen/validation-unseen split, where test trajectories feature novel object-location configurations. WebShop [25] (1,624 train, 200 test) simulates e-commerce search over 12,087 products, requiring agents to follow natural-language instructions via multi-step navigation and filtering. TravelPlanner [21] (1,000 train, 180 validation, 45 test) involves multi-day itinerary planning under hard constraints (budget, dates) and soft preferences (cuisine, attractions). Evaluation uses Common Sense (CS) and Hard Constraint (HC) scores. InterCodeSQL [24] benchmarks interactive text-to-SQL generation over diverse schemas, requiring correct handling of schema relationships and varying query difficulty.

Experimental Results and Analysis

Table 1 compares MACLA against state-of-the-art baselines across all benchmarks. We organize baselines into three paradigms: promptbased methods using in-context learning, outcome refinement approaches optimizing trajectory-level rewards, and process refinement methods refining step-level generation. MACLA achieves the highest average performance (78.1%) while using a 7B parameter model, demonstrating that domain-agnostic procedural memory with Bayesian selection and contrastive refinement enables competitive performance without task-specific engineering.

In Table 1, MACLA achieves state-of-the-art results on TravelPlanner (83.3 CS) and ALFWorld-Unseen (90.3%), outperforming methods that rely on models 10× larger. Its strong performance across all benchmarks demonstrates cross-domain generalization, while the positive generalization gap on ALFWorld (+3.1 points for unseen vs. seen) indicates robust compositional transfer rather than memorization.

Conclusion

We presented MACLA, a framework that decouples reasoning from learning by maintaining a frozen LLM and performing all adaptation in an external hierarchical procedural memory through Bayesian selection, contrastive refinement, and meta-procedural composition. MACLA achieves 78.1% average performance across four benchmarks using only a 7B model, with state-of-the-art results on ALFWorld (87.2% seen; 90.3% unseen) and TravelPlanner (83.3%). The system compresses 2,851 ALFWorld training trajectories into 187 reusable procedures through semantic abstraction and duplicate detection, demonstrating efficient knowledge distillation.

Detailed Ablation Studies and Memory Analysis

This section provides comprehensive ablation studies examining MACLA's component contributions, memory scaling behavior, and task-specific effectiveness. These experiments address critical questions about system design choices and identify performance bottlenecks across different benchmarks. Table 4 systematically evaluates the contribution of each MACLA component by measuring performance degradation when individual modules are removed. Beyond success rates, we track memory dynamics (procedure/metaprocedure counts), behavioral patterns (reuse rate), and computational efficiency (LLM calls per episode).

Table 4: Component ablation and memory dynamics analysis on ALFWorld. All variants use Llama-2-7B.

Proc./Meta Count: final memory size after 200 episodes. Reuse Rate: % of actions from retrieved procedures vs. zero-shot LLM. LLM Calls: average per episode.

Synergistic Effects: Nosingle component accounts for MACLA's full performance. The combination of Bayesian selection (uncertaintyaware), contrastive learning (quality refinement), and meta-procedures (hierarchical composition) creates synergistic effects. Bayesian selection identifies reliable procedures, contrastive learning makes them more robust, and meta-procedures compose them efficiently.

Memory Capacity Scaling

Table 5 investigates the relationship between memory capacity and performance, addressing whether larger memory always yields better results or if there exists an optimal capacity.

Table 5: Impact of procedural memory capacity on performance. Results on ALFWorld after 200 training episodes.

Actual Proc.: number of procedures after training (may be less than capacity if not all slots filled). Avg 𝛼 𝛼 + 𝛽 : mean posterior success rate across all procedures.

Task-Specific Memory Effectiveness

Figure 5 explains SQL underperformance through three metrics. Lowreuse(51%): SQLqueries are schema-specific, e.g., customers.age does not apply to employees.experience . ALFWorld generalizes via placeholders ( ), but SQL column names vary unpredictably. Low reliability (64%): Schema mismatches, join complexity, and edge cases accumulate failures ( 𝛽 counts), suppressing posteriors. Minimal composition (18%): SQL queries are atomic (2-3 actions), too short for meta-procedures. ALFWorld tasks naturally decompose into multi-step sub-procedures. MACLA excels when tasks have: (1) reusable actions, (2) hierarchical structure, and (3) consistent semantics - SQL violates all three.

Extended Experimental Analysis

This appendix provides detailed visualizations and analyses addressing the memory dynamics, Bayesian learning mechanics, and task-specific performance characteristics of MACLA.

Bootstrapping and Memory Growth Dynamics

Figure 6 addresses the supervisor's question: "How should we show the bootstrapping effect?" This visualization demonstrates MACLA's ability to learn from imperfect initial experiences without requiring pre-trained demonstrations. The learning curve reveals three emergent phases not explicitly programmed:

Figure 6: Learning dynamics over 2,851 training trajectories on ALFWorld. (a) Success rate progression shows three distinct phases: exploration (trajectories 1-570), consolidation (571-1,425), and exploitation (1,426-2,851). (b) Memory growth demonstrates rapid procedure extraction during exploration, followed by meta-procedure formation during consolidation. The system extracts 187 unique procedures from 2,851 trajectories (15:1 compression), never exceeding the 200-capacity limit. (c) Average Bayesian posterior 𝛼 𝛼 + 𝛽 converges from optimistic initialization (0.5) to empirical success rate (0.79), with shaded region showing ±1 standard deviation across procedures. (d) LLM fallback rate decreases from 100% (pure zero-shot) to <5% as procedural memory becomes comprehensive.

Figure 6: Learning dynamics over 2,851 training trajectories on ALFWorld. (a) Success rate progression shows three distinct phases: exploration (trajectories 1-570), consolidation (571-1,425), and exploitation (1,426-2,851). (b) Memory growth demonstrates rapid procedure extraction during exploration, followed by meta-procedure formation during consolidation. The system extracts 187 unique procedures from 2,851 trajectories (15:1 compression), never exceeding the 200-capacity limit. (c) Average Bayesian posterior 𝛼 𝛼 + 𝛽 converges from optimistic initialization (0.5) to empirical success rate (0.79), with shaded region showing ±1 standard deviation across procedures. (d) LLM fallback rate decreases from 100% (pure zero-shot) to <5% as procedural memory becomes comprehensive.

The four-panel layout efficiently shows temporal correlation between observable performance (success rate, panel a) and internal learning mechanics (memory growth, posterior convergence, fallback reduction). The phase-shaded background in panel (a) makes regime transitions immediately apparent. Panel (c)'s confidence band demonstrates variance reduction-epistemic uncertainty decreases as evidence accumulates, a hallmark of Bayesian learning.

Cold-Start Capability. MACLAachieves82%success using only the first 1,425 trajectories (50% of training data) without any parameter training. This addresses the cold-start problem that plagues supervised fine-tuning methods requiring large expert datasets. The learning curve shows MACLA is highly sample-efficient: 20% of data (570 trajectories) achieves 45% performance, while the final 50% adds diminishing returns. This logarithmic growth contrasts with neural approaches requiring full-dataset training for convergence.

Compression and Generalization. The 15:1 compression ratio (2,851 trajectories → 187 procedures) demonstrates efficient knowledge distillation through semantic abstraction. Rather than memorizing individual trajectories, MACLA extracts reusable patterns that generalize across contexts. The plateau in panel (b) at 187 procedures suggests ALFWorld's task space has finite inherent complexity-beyond this point, new trajectories are covered by existing procedures with generalized preconditions.

Detailed Execution Trace Analysis

This appendix provides a complete time-stamped execution trace of MACLA solving an ALFWorld unseen task, demonstrating how procedural memory, Bayesian selection, and contrastive refinement operate in practice. The trace illustrates information flow through all architectural components during both online inference (time steps 𝑡 0 𝑡 8 ) and post-episode learning ( 𝑡 9 ).

Task Description and Setup

Task: valid_unseen_0 from ALFWorld validation-unseen split: 'Put chilled lettuce on the counter.'

Challenge: This task requires hierarchical reasoning with an implicit precondition-the lettuce must be cooled before placement. The compound modifier 'chilled' signals a two-stage plan: (1) cool the object, then (2) place it on the counter. This task is unseen because the specific object-appliance-location triplet (lettuce-fridgecountertop) was not present in training trajectories, testing compositional generalization.

Initial State: Agent in kitchen, lettuce on countertop 2, fridge 1 available but closed.

Memory State: Procedural memory contains 199 learned procedures including object_cooling ( 𝛼 = 10 , 𝛽 = 3, success rate 76.9%) and object_placement ( 𝛼 = 8 , 𝛽 = 2, success rate 80.0%). Meta-procedural memory contains 50 compositions learned from other object configurations (e.g., potato-fridge-table, apple-fridge-shelf), but none directly matching the lettuce-fridge-countertop configuration.

Execution Timeline

Table 7 presents the complete timestep-by-timestep trace. Each row captures the state and decisions of four core components: (1) LLM for semantic parsing and goal discovery, (2) Bayesian Selector for uncertainty-aware procedure ranking, (3) Memory System for procedure storage and retrieval, and (4) Contrastive Refiner for post-episode learning from success/failure patterns.

LLM Call Count: This episode requires 2 full LLM inference calls (marked with ★ ): initial task parsing at 𝑡 0 and post-episode segmentation at 𝑡 9 . All intermediate actions ( 𝑡 1 𝑡 8 ) use template-based instantiation without LLM generation, demonstrating MACLA's efficiency advantage over methods like ReAct that require LLM reasoning at each step.

Table 7: Time-stamped execution trace of MACLA on ALFWorld task valid_unseen_0 ('Put chilled lettuce on the counter'). Each timestep shows information flow through LLM, Bayesian Selector, Memory System, and Contrastive Refiner. ★ indicates full LLM inference calls. All numerical values verified against system outputs.

=

True.

Table 7 - continued from previous page

Hierarchical Goal Decomposition ( 𝑡 0 -𝑡 2 ). The LLM immediately recognizes 'chilled' as imposing a temporal constraint, inferring the cooling precondition without explicit instruction. This demonstrates the frozen LLM's semantic reasoning capability-it parses compound task specifications into hierarchical subgoals. The Bayesian Selector then orders these subgoals by expected utility while flagging dependency violations, ensuring preconditions are satisfied before attempting dependent actions.

Uncertainty-Aware Procedure Selection ( 𝑡 3 ). When choosing between fridge (EU = 0 . 83, 𝛼 = 10, 𝛽 = 3) and freezer (EU = 0 . 58, 𝛼 = 4, 𝛽 = 6) for cooling, the Bayesian Selector favors fridge despite both having similar semantic relevance (sim > 0 . 85). The key difference lies in the posterior distributions: fridge has higher expected success (76.9% vs. 40.0%) and lower uncertainty ( 𝜎 2 = 0 . 0127 vs. 0.024). This illustrates how Bayesian selection balances exploitation (choosing high- ˆ 𝜌 procedures) with exploration (considering information gain for uncertain procedures).

Minimal LLM Usage ( 𝑡 0 -𝑡 9 ). The entire episode requires only 2 LLM calls: (1) initial goal parsing at 𝑡 0 (436 total tokens), and (2) symbolic summary generation at 𝑡 9 (568 total tokens). Once procedures are retrieved at 𝑡 1 and 𝑡 3 , all subsequent actions are generated by instantiating learned templates with current observations. This demonstrates MACLA's core efficiency advantage-procedural memory amortizes LLM costs across episodes, achieving > 85% token reduction compared to ReAct's per-step reasoning.

Online Bayesian Updates ( 𝑡 6 ). After successful cooling, the posterior updates from Beta(10,3) to Beta(11,3), shifting the expected success rate from 76.9% to 78.6%. The information gain ( Δ 𝐻 = 0 . 136 nats) quantifies reduced epistemic uncertainty. This online learning happens during episode execution without any parameter updates to the frozen LLM, enabling continual improvement through memory refinement.

Meta-Procedure Formation ( 𝑡 9 ). Post-episode analysis detects that cooling → placement occurs in 20% of recent episodes across different object-appliance-location configurations. The system automatically creates meta_cool_and_place_object , a higher-level composition that encapsulates both procedures with a conditional execution policy: 'if task contains cooling modifier (chilled/frozen/cold), execute cooling then placement; else skip to placement.' This meta-procedure abstracts over specific objects (lettuce, potato, apple) and locations (countertop, table, shelf), demonstrating compositional generalization. Future episodes with similar task structures can invoke this meta-procedure directly, reducing planning depth from 2 retrievals to 1.

Contrastive Learning Preparation ( 𝑡 9 ). Although this episode succeeded, MACLA logs success patterns ('chilled via fridge') for future contrastive refinement. The memory now contains 11 cooling successes and 3 failures. When the next cooling failure occurs, contrastive analysis will activate (threshold: min (|S| , |F |)≥ 3), extracting discriminative patterns by comparing success contexts (fridge, refrigerator) against failure contexts (hypothetically: oven, microwave if such failures exist). These refined preconditions prevent future errors by learning cooling requires cold appliances, not heat sources.

Trace Verification Methodology

This execution trace was verified through multiple independent methods to ensure accuracy:

$$

$$

Information gain calculation (Equation 37):

$$

$$

$$

$$

(Negative entropy change indicates reduced uncertainty; we report absolute value in table.)

$$

$$

(Small discrepancy due to rounding in relevance and risk scores; within tolerance.)

Reproducibility. Complete reproduction instructions:

Key Observations and Architectural Insights

Hierarchical Goal Decomposition ( 𝑡 0 -𝑡 2 ). The LLM immediately recognizes 'chilled' as imposing a temporal constraint, inferring the cooling precondition without explicit instruction. This demonstrates the frozen LLM's semantic reasoning capability-it parses compound task specifications into hierarchical subgoals. The Bayesian Selector then orders these subgoals by expected utility while flagging dependency violations, ensuring preconditions are satisfied before attempting dependent actions.

Uncertainty-Aware Procedure Selection ( 𝑡 3 ). When choosing between fridge (EU = 0 . 83, 𝛼 = 10, 𝛽 = 3) and freezer (EU = 0 . 55, 𝛼 = 4, 𝛽 = 6) for cooling, the Bayesian Selector favors fridge despite both having similar semantic relevance . The key difference lies in the posterior distributions: fridge has higher expected success (76.9% vs. 40.0%) and lower uncertainty ( 𝜎 2 = 0 . 014 vs. 0.024). This illustrates how Bayesian selection balances exploitation (choosing high- ˆ 𝜌 procedures) with exploration (considering information gain for uncertain procedures).

Minimal LLM Usage ( 𝑡 0 -𝑡 8 ). The entire episode requires only 2 LLM calls : (1) initial goal parsing at 𝑡 0 , and (2) symbolic summary generation at 𝑡 9 . Once procedures are retrieved at 𝑡 3 and 𝑡 7 , all subsequent actions are generated by instantiating learned templates with current observations. This demonstrates MACLA's core efficiency advantage-procedural memory amortizes LLM costs across episodes, achieving > 85% token reduction compared to ReAct's per-step reasoning.

Online Bayesian Updates ( 𝑡 6 ). After successful cooling, the posterior updates from Beta(10,3) to Beta(11,3), shifting the expected success rate from 76.9% to 78.6%. The information gain ( Δ 𝐻 = 0 . 136 nats) quantifies reduced epistemic uncertainty. This online learning happens during episode execution without any parameter updates to the frozen LLM, enabling continual improvement through memory refinement.

Meta-Procedure Formation ( 𝑡 9 ). Post-episode analysis detects that cooling → placement occurs in 20% of recent episodes across different object-appliance-location configurations (potato-fridge-table, apple-fridge-shelf, lettuce-fridge-countertop). The system automatically creates meta_cool_and_place_object , a higher-level composition that encapsulates both procedures with a conditional execution policy: 'if task contains cooling modifier (chilled/frozen/cold), execute cooling then placement; else skip to placement.' This meta-procedure abstracts over specific objects and locations, demonstrating compositional generalization. Future episodes with similar task structures can invoke this meta-procedure directly, reducing planning depth from 2 retrievals to 1.

Contrastive Learning Preparation ( 𝑡 9 ). Although this episode succeeded, MACLA logs success patterns ('chilled via fridge') for future contrastive refinement. The memory now contains 11 cooling successes and 3 failures. When the next cooling failure occurs, contrastive analysis will activate (threshold: min (|S| , |F |)≥ 3), extracting discriminative patterns by comparing success contexts (fridge, refrigerator) against failure contexts (hypothetically: oven, microwave). These refined preconditions prevent future errors by learning 'cooling requires cold appliances, not heat sources.'

Hierarchical Goal Decomposition ($t_0$--$t_2$).

To extend reasoning beyond atomic skills, MACLA automatically discovers and learns meta-procedures that are structured compositions of procedures that capture recurrent long-horizon strategies. When a sequence of procedures ⟨ Proc 𝑖 1 , . . . , Proc 𝑖 𝑚 ⟩ repeatedly leads to success under a common high-level goal, the agent abstracts it as MP 𝑗 = ⟨G meta 𝑗 , Ψ meta 𝑗 , { Proc 𝑖 1 , . . . , Proc 𝑖 𝑚 } , Θ 𝑗 ⟩ . Here, Θ 𝑗 denotes a lightweight control policy governing conditional transitions among sub-procedures based on the current observation and execution context, Θ 𝑗 ( 𝑜 𝑡 , index ) ∈ { continue , skip , repeat , abort } . This policy is distilled by analyzing successful traces, where the LLM identifies observation patterns that triggered each branch-for example, repeating when postconditions are unmet, skipping when preconditions already hold, or aborting when failures recur. Each meta-procedure maintains its own Beta success posterior 𝑝 ( 𝜎 𝑗 |D 𝑗 ) = Beta ( 𝛼 𝑗 , 𝛽 𝑗 ) and is refined periodically to add new branches, reorder sub-procedures, or prune redundant ones. Through these hierarchical compositions, MACLA acquires flexible 'playbooks' that encapsulate extended strategies with conditional logic.

Uncertainty-Aware Procedure Selection ($t_3$).

To extend reasoning beyond atomic skills, MACLA automatically discovers and learns meta-procedures that are structured compositions of procedures that capture recurrent long-horizon strategies. When a sequence of procedures ⟨ Proc 𝑖 1 , . . . , Proc 𝑖 𝑚 ⟩ repeatedly leads to success under a common high-level goal, the agent abstracts it as MP 𝑗 = ⟨G meta 𝑗 , Ψ meta 𝑗 , { Proc 𝑖 1 , . . . , Proc 𝑖 𝑚 } , Θ 𝑗 ⟩ . Here, Θ 𝑗 denotes a lightweight control policy governing conditional transitions among sub-procedures based on the current observation and execution context, Θ 𝑗 ( 𝑜 𝑡 , index ) ∈ { continue , skip , repeat , abort } . This policy is distilled by analyzing successful traces, where the LLM identifies observation patterns that triggered each branch-for example, repeating when postconditions are unmet, skipping when preconditions already hold, or aborting when failures recur. Each meta-procedure maintains its own Beta success posterior 𝑝 ( 𝜎 𝑗 |D 𝑗 ) = Beta ( 𝛼 𝑗 , 𝛽 𝑗 ) and is refined periodically to add new branches, reorder sub-procedures, or prune redundant ones. Through these hierarchical compositions, MACLA acquires flexible 'playbooks' that encapsulate extended strategies with conditional logic.

Minimal LLM Usage ($t_0$--$t_8$).
Online Bayesian Updates ($t_6$).
Meta-Procedure Formation ($t_9$).

To extend reasoning beyond atomic skills, MACLA automatically discovers and learns meta-procedures that are structured compositions of procedures that capture recurrent long-horizon strategies. When a sequence of procedures ⟨ Proc 𝑖 1 , . . . , Proc 𝑖 𝑚 ⟩ repeatedly leads to success under a common high-level goal, the agent abstracts it as MP 𝑗 = ⟨G meta 𝑗 , Ψ meta 𝑗 , { Proc 𝑖 1 , . . . , Proc 𝑖 𝑚 } , Θ 𝑗 ⟩ . Here, Θ 𝑗 denotes a lightweight control policy governing conditional transitions among sub-procedures based on the current observation and execution context, Θ 𝑗 ( 𝑜 𝑡 , index ) ∈ { continue , skip , repeat , abort } . This policy is distilled by analyzing successful traces, where the LLM identifies observation patterns that triggered each branch-for example, repeating when postconditions are unmet, skipping when preconditions already hold, or aborting when failures recur. Each meta-procedure maintains its own Beta success posterior 𝑝 ( 𝜎 𝑗 |D 𝑗 ) = Beta ( 𝛼 𝑗 , 𝛽 𝑗 ) and is refined periodically to add new branches, reorder sub-procedures, or prune redundant ones. Through these hierarchical compositions, MACLA acquires flexible 'playbooks' that encapsulate extended strategies with conditional logic.

Contrastive Learning Preparation ($t_9$).

As experience accumulates, procedures with both successful and failed instances are subjected to contrastive refinement to improve their accuracy and robustness. For a procedure Proc 𝑖 with sets of successful and failed contexts S 𝑖 and F 𝑖 , the LLM performs discriminative comparison, D 𝑖 = ContrastiveExtract (S 𝑖 , F 𝑖 ) , identifying differences in three dimensions: (i) precondition patterns ( ΔΨ + 𝑖 and ΔΨ -𝑖 ) that distinguish successful from failed initial contexts, (ii) action discrepancies ( Δ 𝜋 𝑖 ) revealing missing or misordered actions, and (iii) postcondition mismatches ( ΔΦ 𝑖 ) that capture incomplete goal states. These discriminators drive explicit refinement operations

$$

$$

When distinct execution modes are detected, the procedure is specialized into separate variants with inherited reliability priors. This process progressively tightens applicability conditions and action precision, yielding interpretable improvements purely through memory edits rather than gradient updates.

Comparison to Alternative Approaches

vs. ReAct [26]. ReAct would require 16-20 LLM calls for this task: reasoning before each action (8 actions × 2 calls/action for 'thought' and 'action'), plus initial planning and reflection. MACLA reduces this to 2 calls by retrieving learned procedures, representing an > 85% reduction in LLM inference overhead.

vs. Reflexion [15]. Reflexion's reflection phase would add 5-8 additional LLM calls for post-episode self-critique and memory update. MACLA's structured Bayesian updates and contrastive refinement achieve similar memory improvements without these extra calls, while providing formal uncertainty quantification through Beta posteriors.

vs. Supervised Fine-Tuning (SFT).. SFT would treat this entire 8-action trajectory as a single training example, backpropagating based solely on the terminal success signal. MACLA decomposes it into reusable procedures (cooling, placement), each receiving independent Bayesian credit assignment. When the cooling procedure succeeds at 𝑡 6 , its posterior updates immediately, even before episode completion. This step-level credit assignment enables more efficient learning from sparse reward signals.

vs. ReAct~ cite{yao2023react
vs. Reflexion~ cite{shinn2023reflexion
vs. Supervised Fine-Tuning (SFT).

Generalization to Unseen Tasks

This execution demonstrates three levels of generalization :

  1. Object Generalization: The cooling procedure was learned from trajectories involving potatoes and apples (7 potato episodes, 2 apple episodes in training set), yet successfully applies to lettuce without any lettuce-specific training. Semantic abstraction ( placeholders) enables transfer across object categories by parameterizing procedures over entity types rather than specific instances.

  2. Compositional Generalization: The specific cooling → placement sequence for lettuce-fridge-countertop was never observed during training. MACLA composes two independently-learned procedures based on precondition-postcondition matching: cooling's postcondition cooled(object) satisfies placement's precondition, enabling automatic chaining. This demonstrates hierarchical reasoning without explicit composition supervision.

  3. Bayesian Adaptation: The fridge selection leverages Bayesian posteriors aggregated across all past cooling episodes (10 successes, 3 failures across different objects and contexts). This cross-context knowledge transfer is impossible for purely episodic memory systems that treat each experience independently. The Beta(10,3) posterior encodes reliability estimates that generalize beyond training distributions.

  4. Limitations and Edge Cases

    Failure Case: Ambiguous Preconditions. If the task were 'Put lettuce on the counter' (without 'chilled' modifier), MACLA might incorrectly infer a cooling precondition based on high co-occurrence in training (20% of placement tasks involved prior cooling). This false positive would waste 4-5 actions (navigate, open, cool, close) cooling an object that doesn't require it. Contrastive refinement can mitigate this by learning that 'chilled' is a necessary keyword for cooling, not merely frequent. After observing successful non-cooling placements, the system would learn: 'cooling required ⇔ {chilled, frozen, cold} ∈ task_modifiers.'

    Computational Overhead. Bayesian selection at each decision point requires scoring all retrieved procedures (typically 5-10 candidates via FAISS retrieval). While fast (0.4ms per decision with 199 procedures), this overhead accumulates in long episodes (50+ steps). Meta-procedures partially address this by providing pre-composed plans that skip lower-level selection, reducing the number of decision points by 40-60% for complex tasks.

    Memory Capacity. With procedural memory capped at 𝑁 𝑝 = 200, the utility-based pruning mechanism (Section 2.7.2) activates when new procedures are extracted. Procedures with success rates below 60% and usage counts < 5 are evicted first. This can cause 'catastrophic forgetting' of rare but important skills (e.g., emergency procedures used < 1% of the time). Future work should explore: (1) dynamic memory expansion based on task diversity, (2) hierarchical memory with separate buffers for common vs. rare skills, or (3) importance-weighted retention that preserves high-impact procedures regardless of frequency.

    Precondition Inference Errors. The LLM-based precondition extraction at 𝑡 0 can hallucinate dependencies not present in the task specification. For instance, if training data frequently shows 'take X' followed by 'examine X,' the system might incorrectly infer that examination is a precondition for all retrieval tasks. Contrastive learning helps correct these errors by identifying cases where the inferred precondition was violated yet the task succeeded. For researchers reproducing this execution:

    Failure Case: Ambiguous Preconditions.
    Computational Overhead.

    MACLA's efficiency comes from three design choices: (1) the frozen LLM eliminates gradient updates, (2) external memory construction is trivially parallelizable, and (3) learned procedures amortize LLM costs across episodes. Table 3 summarizes training costs.

    Memory Capacity.

    Table 5 investigates the relationship between memory capacity and performance, addressing whether larger memory always yields better results or if there exists an optimal capacity.

    Table 5: Impact of procedural memory capacity on performance. Results on ALFWorld after 200 training episodes.

    Actual Proc.: number of procedures after training (may be less than capacity if not all slots filled). Avg 𝛼 𝛼 + 𝛽 : mean posterior success rate across all procedures.

    Precondition Inference Errors.

    Figure 5: Cross-domain analysis. (a) Memory reuse: 51% (SQL) to 78% (ALFWorld). (b) Procedure reliability: 64% (SQL) to 81% (ALFWorld). (c) Meta-procedure usage: 18% (SQL) to 51% (TravelPlanner).

    Figure 5: Cross-domain analysis. (a) Memory reuse: 51% (SQL) to 78% (ALFWorld). (b) Procedure reliability: 64% (SQL) to 81% (ALFWorld). (c) Meta-procedure usage: 18% (SQL) to 51% (TravelPlanner).

    Figure

    Figure

    Workshops .

    Theoretical Foundations

    This section addresses several design choices in MACLA that currently lack rigorous theoretical grounding, and proposes formal justifications that strengthen the framework's foundations.

    Ad-hoc Thresholds and Their Implications

    MACLA employs several threshold-based mechanisms whose values were determined empirically rather than through principled derivation:

    Table 8: Summary of threshold parameters and their justification status

    D.1.1 Duplicate Detection Threshold 𝜃 dup. The duplicate detection mechanism uses cosine similarity with threshold 𝜃 dup = 0 . 85:

    $$

    $$

    Proposed Theoretical Foundation: We can derive an optimal threshold from information-theoretic principles by minimizing expected description length:

    $$

    $$

      where 𝑁 𝑝 ( 𝜃 ) is the number of unique procedures at threshold 𝜃 , |A| is the action vocabulary size, and 𝐻 [ Proc 𝑖 ] is the entropy of procedure 𝑖 . This formulation trades off memory compression (fewer procedures) against information loss (overly aggressive merging). Sensitivity Analysis: Figure 7 shows performance varies by ± 4 . 2% when 𝜃 dup ∈ [ 0 . 75 , 0 . 95 ] , indicating moderate sensitivity.

    D.1.2 Confidence Threshold 𝜃 conf . Selection proceeds only when max 𝑖 EU ( Proc 𝑖 | 𝑜 𝑡 ) > 𝜃 conf = 0 . 4. Otherwise, the system falls back to zero-shot LLM reasoning.

    Problem: The value 0.4 appears arbitrary and is not calibrated to expected utility units.

    Proposed Theoretical Foundation: The confidence threshold should be set based on the expected cost of zero-shot LLM fallback. Let 𝐶 LLM be the computational cost and 𝜌 LLM be the zero-shot success rate. Then:

    $$

    $$

    For ALFWorld with 𝜌 LLM ≈ 0 . 42 (from Llama-2-7B baseline), 𝑅 max = 1 . 0, 𝐶 fail = 0 . 5, and normalized 𝐶 LLM = 0 .

    15:

    $$

    $$

    This suggests always using procedures when available. The empirical value of 0.4 likely compensates for model miscalibration.

    Figure 7: Sensitivity of performance and memory usage to duplicate detection threshold on ALFWorld-Seen.

    Figure 7: Sensitivity of performance and memory usage to duplicate detection threshold on ALFWorld-Seen.

    Calibration-Aware Threshold: Account for Beta posterior miscalibration:

    $$

    $$

    where Var [ EUproc ] captures uncertainty in procedure success rates. With estimated 𝜆 calib ≈ 2 . 0 from cross-validation, this yields 𝜃 ∗ conf ≈ 0 . 38, closer to the empirical value.

    D.1.3 Meta-Procedure Formation Threshold 𝜃 meta. Meta-procedures are created when a sequence appears in ≥ 15% of recent episodes. Problem: This frequency-based criterion ignores:

    Proposed Theoretical Foundation: Define meta-procedure value as:

    $$

    $$

    where 𝑓 𝑗 is frequency, ℓ 𝑗 is average length, 𝑐 store is memory cost, and E [ Δ 𝑅 | MP 𝑗 ] is expected reward improvement from composition vs. separate procedures.

    $$

    $$

    This ensures meta-procedures are created based on value maximization rather than arbitrary frequency thresholds.

    Duplicate Detection Threshold $ theta_{ text{dup

    Confidence Threshold $ theta_{ text{conf

    MACLA employs several threshold-based mechanisms whose values were determined empirically rather than through principled derivation:

    Table 8: Summary of threshold parameters and their justification status

    D.1.1 Duplicate Detection Threshold 𝜃 dup. The duplicate detection mechanism uses cosine similarity with threshold 𝜃 dup = 0 . 85:

    $$

    $$

    Proposed Theoretical Foundation: We can derive an optimal threshold from information-theoretic principles by minimizing expected description length:

    $$

    $$

      where 𝑁 𝑝 ( 𝜃 ) is the number of unique procedures at threshold 𝜃 , |A| is the action vocabulary size, and 𝐻 [ Proc 𝑖 ] is the entropy of procedure 𝑖 . This formulation trades off memory compression (fewer procedures) against information loss (overly aggressive merging). Sensitivity Analysis: Figure 7 shows performance varies by ± 4 . 2% when 𝜃 dup ∈ [ 0 . 75 , 0 . 95 ] , indicating moderate sensitivity.

    D.1.2 Confidence Threshold 𝜃 conf . Selection proceeds only when max 𝑖 EU ( Proc 𝑖 | 𝑜 𝑡 ) > 𝜃 conf = 0 . 4. Otherwise, the system falls back to zero-shot LLM reasoning.

    Problem: The value 0.4 appears arbitrary and is not calibrated to expected utility units.

    Proposed Theoretical Foundation: The confidence threshold should be set based on the expected cost of zero-shot LLM fallback. Let 𝐶 LLM be the computational cost and 𝜌 LLM be the zero-shot success rate. Then:

    $$

    $$

    For ALFWorld with 𝜌 LLM ≈ 0 . 42 (from Llama-2-7B baseline), 𝑅 max = 1 . 0, 𝐶 fail = 0 . 5, and normalized 𝐶 LLM = 0 .

    15:

    $$

    $$

    This suggests always using procedures when available. The empirical value of 0.4 likely compensates for model miscalibration.

    Figure 7: Sensitivity of performance and memory usage to duplicate detection threshold on ALFWorld-Seen.

    Figure 7: Sensitivity of performance and memory usage to duplicate detection threshold on ALFWorld-Seen.

    Calibration-Aware Threshold: Account for Beta posterior miscalibration:

    $$

    $$

    where Var [ EUproc ] captures uncertainty in procedure success rates. With estimated 𝜆 calib ≈ 2 . 0 from cross-validation, this yields 𝜃 ∗ conf ≈ 0 . 38, closer to the empirical value.

    D.1.3 Meta-Procedure Formation Threshold 𝜃 meta. Meta-procedures are created when a sequence appears in ≥ 15% of recent episodes. Problem: This frequency-based criterion ignores:

    Proposed Theoretical Foundation: Define meta-procedure value as:

    $$

    $$

    where 𝑓 𝑗 is frequency, ℓ 𝑗 is average length, 𝑐 store is memory cost, and E [ Δ 𝑅 | MP 𝑗 ] is expected reward improvement from composition vs. separate procedures.

    $$

    $$

    This ensures meta-procedures are created based on value maximization rather than arbitrary frequency thresholds.

    Meta-Procedure Formation Threshold $ theta_{ text{meta

    To extend reasoning beyond atomic skills, MACLA automatically discovers and learns meta-procedures that are structured compositions of procedures that capture recurrent long-horizon strategies. When a sequence of procedures ⟨ Proc 𝑖 1 , . . . , Proc 𝑖 𝑚 ⟩ repeatedly leads to success under a common high-level goal, the agent abstracts it as MP 𝑗 = ⟨G meta 𝑗 , Ψ meta 𝑗 , { Proc 𝑖 1 , . . . , Proc 𝑖 𝑚 } , Θ 𝑗 ⟩ . Here, Θ 𝑗 denotes a lightweight control policy governing conditional transitions among sub-procedures based on the current observation and execution context, Θ 𝑗 ( 𝑜 𝑡 , index ) ∈ { continue , skip , repeat , abort } . This policy is distilled by analyzing successful traces, where the LLM identifies observation patterns that triggered each branch-for example, repeating when postconditions are unmet, skipping when preconditions already hold, or aborting when failures recur. Each meta-procedure maintains its own Beta success posterior 𝑝 ( 𝜎 𝑗 |D 𝑗 ) = Beta ( 𝛼 𝑗 , 𝛽 𝑗 ) and is refined periodically to add new branches, reorder sub-procedures, or prune redundant ones. Through these hierarchical compositions, MACLA acquires flexible 'playbooks' that encapsulate extended strategies with conditional logic.

    Bayesian Prior Initialization

    MACLA initializes Beta priors as Beta ( 1 , 1 ) (uniform), but this choice lacks justification.

    Problem: Uniform priors assume no prior knowledge, but we have domain knowledge:

    Proposed Hierarchical Bayesian Prior: Use empirical Bayes to set informative priors:

    $$

    $$

    $$

    $$

    Estimate hyperparameters ( 𝛼 0 , 𝛽 0 ) from historical procedure statistics:

    $$

    $$

    For ALFWorld, maximum likelihood estimation on the first 500 training trajectories yields 𝛼 0 ≈ 3 . 2, 𝛽 0 ≈ 1 . 8, corresponding to prior mean E [ 𝜌 ] = 3 . 2 /( 3 . 2 + 1 . 8 ) ≈ 0 . 64. This informed prior accelerates learning by 12-18 episodes compared to uniform initialization.

    Utility Function Weight Selection

    The utility function (Eq. 4 in main paper) uses weights 𝜆 𝑟 = 0 . 5 , 𝜆 𝑓 = 0 . 3 , 𝜆 𝑡 = 0 . 2 from grid search.

    Problem: These weights are task-specific and require manual tuning for each new domain.

    Proposed Adaptive Weight Learning: Use online gradient-free optimization to learn domain-specific weights. Define meta-objective:

    $$

    $$

    where 𝑟 𝑡 ( 𝝀 ) is reward achieved at episode 𝑡 using weights 𝝀 .

    Update weights via evolutionary strategy:

    $$

    $$

    where 𝝐 𝑖 ∼ N( 0 , I ) , 𝐹 is fitness (cumulative reward), 𝜂 is learning rate, and 𝜎 is noise standard deviation.

    This eliminates manual tuning while adapting to domain characteristics.

    Contrastive Refinement Evidence Requirement

    Refinement activates when min (| 𝑆 𝑖 | , | 𝐹 𝑖 |) ≥ 3.

    Problem: The choice of 3 samples lacks statistical justification. Is this sufficient for reliable pattern extraction?

    Statistical Power Analysis: To detect discriminative patterns with confidence 1 -𝛼 = 0 . 95 and power 1 -𝛽 = 0 . 80, the required sample size is:

    $$

    $$

    where ES is effect size (Cohen's 𝑑 ). For medium effect size ES=0.5:

    $$

    $$

    This suggests 𝑛 min = 3 provides very low statistical power ( ≈ 0 . 15), leading to unreliable refinements.

    Recommended Threshold: Use 𝑛 min ∈ [ 8 , 12 ] for adequate statistical power, or implement sequential testing:

    $$

    $$

    using Bayesian hypothesis testing rather than fixed sample size.

    Memory Pruning Utility Function

    ifaamas \acmConference[AAMAS ’26]Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)May 25 – 29, 2026 Paphos, CyprusC. Amato, L. Dennis, V. Mascardi, J. Thangarajah (eds.) \copyrightyear2026 \acmYear2026 \acmDOI \acmPrice \acmISBN \acmSubmissionID¡¡936¿¿\affiliation\institutionSchool of Engineering, Royal Melbourne Institute of Technology University \cityMelbourne \stateVIC \postcode3000 \countryAustralia

    We present MACLA, a framework that decouples reasoning from learning by maintaining a frozen large language model (LLM) while performing all adaptation in an external hierarchical procedural memory. MACLA extracts reusable procedures from trajectories, tracks reliability via Bayesian posteriors, selects actions through expected-utility scoring, and refines procedures by contrasting successes vs. failures. Across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), MACLA achieves 78.1% average performance, outperforming all baselines. On ALFWorld unseen tasks, MACLA reaches 90.3% with +3.1% positive generalization. The system constructs memory in 56 seconds (2,800× faster than the state-of-the-art LLM parameter-training baseline), compresses 2,851 trajectories into 187 procedures (15:1). Experimental results demonstrate that structured external memory with Bayesian selection and constrastive refinement enable sample-efficient, interpretable and continually improving agents without LLM parameter updates. Code is publicly available at MACLA.

    Large language model (LLM) agents can solve complex, interactive tasks such as web shopping webshop and embodied AI housekeeping agentboard, by transforming natural-language instructions into sequences of environment actions yao2023react. In these settings, agents navigate step-by-step through partially observable environments to pursue subgoals and ultimately complete the task agentboard; xiong2024watch. The resulting trajectory is the ordered record of an episode’s interaction, typically written as (T,A,O,R)(T,A,O,R), where TT represents a task to complete, AA are actions, OO stand for observations for the outcome of corresponding actions, and RR records step-level outcomes or rewards. Trajectories thus capture the full decision process, not merely terminal success or failure, and provide dense supervision for how an agent progresses through a task xiong2024watch; alfworld. When a new task arrives, the agent synthesizes an appropriate trajectory (that is, a step-by-step plan and its execution) to achieve the goal in the current context, deciding which information to gather, which tools to invoke, and which subroutines to chain in order to achieve completion yao2023react; webshop.

    Early LLM agents used prompt-based planning yao2023react and self-critique shinn2023reflexion, but lack persistent “how-to” procedures — when tasks are similar but not identical, agents must re-plan from scratch, increasing cost and latency. Fine-tuning approaches chen2023fireact; yin2023; zeng2023 adapt agents via supervised learning or RLHF, but typically treat entire trajectories as single units weighted by terminal success/failure, neglecting rich intermediate steps. In practice, failed trajectories often contain correct substeps (e.g., “successfully navigating and retrieving an egg, but failing to boil it” alfworld), while successful ones may include suboptimal actions that accidentally cancel out. Recent work xiong2024watch addresses this via step-level rewards, but requires repeated policy training on densely-labeled data, incurring substantial computational cost.

    Comparison between trajectory-based LLM finetuning and the MACLA framework showing external memory hierarchy.

    We introduce MACLA (Memory-Augmented Contrastive Learning Agent), a framework that disentangles reasoning from learning through the coupling a frozen LLM and a structured external procedural memory (Figure 1). Unlike fine-tuning approaches where reasoning and adaptation are entangled within billions of parameters, MACLA fixes the LLM as a stable semantic reasoner responsible for trajectory segmentation, abstraction, and action generation. All learning occurs externally through explicit, interpretable memory operations - maintaining human-readable procedures, updating Bayesian posteriors, and refining preconditions through contrastive analysis. MACLA operates through three core mechanisms:

    Bayesian procedure selection: Maintains Beta posteriors Beta​(αi,βi)\text{Beta}(\alpha_{i},\beta_{i}) over procedure success rates and ranks candidates via expected-utility scoring that balances contextual relevance, success probability, failure risk, and information gain, providing principled exploration-exploitation.

    Contrastive refinement: Compares successful and failed execution contexts to tighten preconditions, repair action sequences, and refine postconditions once procedures accumulate sufficient evidence (i.e., ≥\geq a threshold), progressively improving procedure quality through memory edits rather than gradient updates.

    Meta-procedural learning: Composes frequently co-occurring procedures into hierarchical “playbooks” with conditional control policies (continue, skip, repeat, abort) for long-horizon tasks, enabling strategic reuse beyond atomic skills.

    This architecture yields sample-efficient, interpretable agents with human-readable procedural knowledge, closed-form utility computation, and minimal LLM usage. Specifically, this work contributes:

    Online procedural memory adaptation: Continual updates to procedural and meta-procedural memory during and after episodes, enabling adaptation without weight updates, compared with offline LLM post-training approaches zeng2023; song2024trial; xiong2024watch that remain static at inference.

    Reasoning/learning decoupling: A frozen LLM for parsing and abstraction with all improvements occurring in an external, structured procedural memory, avoiding the computational cost and catastrophic forgetting risks of parameter fine-tuning.

    Bayesian uncertainty-aware selection: A principled procedure selection module that maintains Beta posteriors over success rates with closed-form expected utility objectives balancing relevance, success probability, failure risk and information gain.

    Contrastive procedural refinement: An algorithm leveraging paired successes and failures to tighten preconditions, repair action schemas, and refine postconditions of stored procedures without requiring expert demonstrations.

    We evaluate MACLA across four benchmarks (ALFWorld agentboard, WebShop webshop, TravelPlanner xie2024travelplanner, InterCodeSQL yang2306intercode), achieving 78.1% average performance — the highest among all methods, including those using models 10× larger (later in Table 5.2). On ALFWorld alfworld, MACLA reaches 87.2% on seen and 90.3% on unseen tasks, with a positive generalization gap (+3.1%) indicating compositional transfer rather than overfitting. The system achieves this with only 0.016 GPU-hours for one-time memory construction — 2,800× faster than the state-of-the-art LLM parameter-training baseline xiong2024watch, which requires 44.8 GPU-hours of iterative training — while simultaneously producing human-interpretable procedural knowledge.

    LLM agents have advanced rapidly in reasoning and decision-making, enabling multi-step interaction in embodied and web-based environments. Early frameworks such as ReActyao2023react and Reflexionreflexion integrate reasoning and acting within the same loop, while trajectory-tuning methods chen2023fireact; xiong2024watch fine-tune models using expert demonstrations. However, fine-tuning is computationally expensive, requires offline data collection and training cycles, and does not support true online adaptation at inference time. To overcome this issue, a line of research augments LLM agents with memory for continuous reasoning. Memory is a foundational component of language agents, supporting competence across multiple timescales from transient working context to persistent long-term knowledge zhang2024memorysurvey; liu2025foundationagents; li2025memos. Research on memory for LLM agents can be usefully organized along two directions: where memory resides and what is stored. Along the first direction, some methods such as MemGPT memgpt and MemoryBank zhong2024memorybank, use buffer-based systems to store conversational or episodic traces and retrieve them with embedding search and simple heuristics. Some others, such as HiAgenthu2024hiagent, A-Memxu2025amem, MemAgent yu2025memagent use hierarchical designs to separate working buffers from episodic and long-term stores to relieve context pressure and improve persistence. Recently, SAGE liang2025sage used reflective multi-agent controllers to curate these stores while controlling growth. The second direction concerns what is stored. Many systems retain free-form text snippets such as notes, summaries, or dialogue chunks; these are easy to write but suffer from retrieval drift and weak compositionality as repositories scale memgpt; zhong2024memorybank. More structured artifacts appear as tuples and key–value frames (e.g., tool logs or entity/event graphs), which aid filtering but still lack executable semantics for reuse. A growing line of work targets skills and procedures: agents capture reusable action patterns, tool workflows, and instruction-like steps across related tasks voyager; wang2024awm; chen2024automanual. Memp fang2025memp advances this view by treating procedural memory as a first-class object and studying its construction, retrieval, and update across domains. However, several key limitations remain; (1) it represents know-how largely as monolithic text (scripts or full trajectories) with heuristic retrieval and simple updates; (2) it lacks uncertainty-aware selection or principled exploration-exploitation balance, preventing reason about reliability or risk of retrieved memory; and (3) it lacks a mechanism to refine procedures from paired successes and failures or abstract recurring patterns into meta-procedural compositions. Comparatively, we represent experience as structured, hierarchical procedures with explicit preconditions, action schemas, and postconditions, enabling interpretable reuse and safe composition and direct schema edits when evidence warrants change. The proposed approach enables the system to continuously adapt and improve.

    You will be assigned a submission number when you register the abstract of your paper on OpenReview. Include this number in your document using the ‘\acmSubmissionID’ command.

    Then use the familiar commands to specify the title and authors of your paper in the preamble of the document. The title should be appropriately capitalised (meaning that every ‘important’ word in the title should start with a capital letter). For the final version of your paper, make sure to specify the affiliation and email address of each author using the appropriate commands. Specify an affiliation and email address separately for each author, even if two authors share the same affiliation. You can specify more than one affiliation for an author by using a separate ‘\affiliation’ command for each affiliation.

    Provide a short abstract using the ‘abstract’ environment.

    Finally, specify a small number of keywords characterising your work, using the ‘\keywords’ command.

    The key components of MACLA are described in detail below.

    The first stage transforms raw episodic trajectories into structured, reusable procedural knowledge. Given a trajectory τ={(ot,at,rt)}t=0T\tau={(o_{t},a_{t},r_{t})}{t=0}^{T} consisting of textual observations oto{t}, primitive actions ata_{t}, and rewards rtr_{t}, the frozen LLM ℒθ\mathcal{L}_{\theta} receives the full trajectory and identifies semantically coherent segments that correspond to meaningful sub-tasks:

    where each segment kk spans time steps [tkstart,tkend][t_{k}^{\text{start}},t_{k}^{\text{end}}] and is summarized by a description dkd_{k}. For each segment, MACLA constructs a structured procedure Prock=⟨𝒢k,Ψk,πk,Φk⟩,\text{Proc}{k}=\langle\mathcal{G}{k},\Psi_{k},\pi_{k},\Phi_{k}\rangle, where 𝒢k\mathcal{G}{k} is a natural-language goal, Ψk\Psi{k} are precondition patterns inferred from the observations before the segment, πk\pi_{k} is an abstracted action sequence, and Φk\Phi_{k} are postcondition patterns extracted from the final observations. This decomposition produces interpretable “how-to” skills that can be invoked whenever their preconditions are met. To support retrieval and merging, each procedure is embedded into a semantic vector space using an encoder ϕ\phi, 𝐞k=ϕ​([𝒢k;Ψk;Φk])∈ℝd\mathbf{e}{k}=\phi([\mathcal{G}{k};\Psi_{k};\Phi_{k}])\in\mathbb{R}^{d}. When a new procedure is created, it is compared to existing ones via cosine similarity, i∗=arg⁡maxi⁡sim​(𝐞k,𝐞i)i^{}=\arg\max_{i}\text{sim}(\mathbf{e}{k},\mathbf{e}{i}). If sim​(𝐞k,𝐞i∗)>θdup\text{sim}(\mathbf{e}{k},\mathbf{e}{i^{}})>\theta_{\text{dup}}, the new procedure is merged into the existing one by expanding its condition sets; otherwise, a new entry is added. This process yields a continually growing procedural library 𝕄proc={(Proci,𝐞i,αi,βi)}i=1Np\mathbb{M}{\text{proc}}={(\text{Proc}{i},\mathbf{e}{i},\alpha{i},\beta_{i})}{i=1}^{N{p}} that forms the foundation for later Bayesian selection and refinement.

    Given the procedural library, the agent must decide which procedure to execute for the current observation. Each procedure Proci\text{Proc}{i} maintains a Beta posterior over its success probability ρi∈[0,1]\rho{i}\in[0,1]:

    where αi\alpha_{i} and βi\beta_{i} accumulate successful and failed executions from history 𝒟i\mathcal{D}{i}. The posterior mean 𝔼​[ρi]=αi/(αi+βi)\mathbb{E}[\rho{i}]=\alpha_{i}/(\alpha_{i}+\beta_{i}) estimates current reliability, while the variance Var​[ρi]=αi​βi(αi+βi)2​(αi+βi+1)\text{Var}[\rho_{i}]=\frac{\alpha_{i}\beta_{i}}{(\alpha_{i}+\beta_{i})^{2}(\alpha_{i}+\beta_{i}+1)} quantifies epistemic uncertainty. For each candidate, we compute expected utility by integrating over the Beta posterior. Given utility U​(ρ∣ot,i)=Reli​(ot)⋅ρ⋅Rmax−Riski​(ot)⋅(1−ρ)⋅Cfail+λinfo⋅I​(ρ)U(\rho\mid o_{t},i)=\mathrm{Rel}{i}(o{t})\cdot\rho\cdot R_{\max}-\mathrm{Risk}{i}(o{t})\cdot(1{-}\rho)\cdot C_{\mathrm{fail}}+\lambda_{\mathrm{info}}\cdot I(\rho), the expected utility is:

    Exploiting 𝔼Beta​(α,β)​[ρ]=αα+β\mathbb{E}_{\mathrm{Beta}(\alpha,\beta)}[\rho]=\frac{\alpha}{\alpha+\beta} and 𝔼​[1−ρ]=βα+β\mathbb{E}[1{-}\rho]=\frac{\beta}{\alpha+\beta}, this simplifies to:

    where Reli​(ot)=cos⁡(ϕ​(ot),𝐞i)\mathrm{Rel}{i}(o{t})=\cos(\phi(o_{t}),\mathbf{e}{i}) is contextual similarity, Riski​(ot)\mathrm{Risk}{i}(o_{t}) is the fraction of past failures with similar contexts, and H​[⋅]H[\cdot] is differential entropy encouraging exploration. The selected procedure is:

    subject to confidence threshold θconf\theta_{\text{conf}}. If maxi⁡EU​(Proci|ot)<θconf\max_{i}\mathrm{EU}(\text{Proc}{i}|o{t})<\theta_{\text{conf}}, the agent falls back to zero-shot LLM reasoning. This Bayesian selection mechanism balances exploitation (high αα+β\frac{\alpha}{\alpha+\beta} procedures), risk aversion (avoiding contexts similar to past failures), and exploration (high entropy procedures). The expected utility formulation naturally handles the explore-exploit tradeoff: early in learning, high entropy dominates selection, while after sufficient evidence accumulates, expected reward becomes the primary driver.

    As experience accumulates, procedures with both successful and failed instances are subjected to contrastive refinement to improve their accuracy and robustness. For a procedure Proci\text{Proc}{i} with sets of successful and failed contexts 𝒮i\mathcal{S}{i} and ℱi\mathcal{F}{i}, the LLM performs discriminative comparison, 𝒟i=ContrastiveExtract​(𝒮i,ℱi)\mathcal{D}{i}=\text{ContrastiveExtract}(\mathcal{S}{i},\mathcal{F}{i}), identifying differences in three dimensions: (i) precondition patterns (Δ​Ψi+\Delta\Psi_{i}^{+} and Δ​Ψi−\Delta\Psi_{i}^{-}) that distinguish successful from failed initial contexts, (ii) action discrepancies (Δ​πi\Delta\pi_{i}) revealing missing or misordered actions, and (iii) postcondition mismatches (Δ​Φi\Delta\Phi_{i}) that capture incomplete goal states. These discriminators drive explicit refinement operations

    When distinct execution modes are detected, the procedure is specialized into separate variants with inherited reliability priors. This process progressively tightens applicability conditions and action precision, yielding interpretable improvements purely through memory edits rather than gradient updates.

    To extend reasoning beyond atomic skills, MACLA automatically discovers and learns meta-procedures that are structured compositions of procedures that capture recurrent long-horizon strategies. When a sequence of procedures ⟨Proci1,…,Procim⟩\langle\text{Proc}{i{1}},\ldots,\text{Proc}{i{m}}\rangle repeatedly leads to success under a common high-level goal, the agent abstracts it as MPj=⟨𝒢jmeta,Ψjmeta,{Proci1,…,Procim},Θj⟩\text{MP}{j}=\langle\mathcal{G}{j}^{\text{meta}},\Psi_{j}^{\text{meta}},{\text{Proc}{i{1}},\ldots,\text{Proc}{i{m}}},\Theta_{j}\rangle. Here, Θj\Theta_{j} denotes a lightweight control policy governing conditional transitions among sub-procedures based on the current observation and execution context, Θj​(ot,index)∈{continue,skip,repeat,abort}\Theta_{j}(o_{t},\text{index})\in{\text{continue},\text{skip},\text{repeat},\text{abort}}. This policy is distilled by analyzing successful traces, where the LLM identifies observation patterns that triggered each branch—for example, repeating when postconditions are unmet, skipping when preconditions already hold, or aborting when failures recur. Each meta-procedure maintains its own Beta success posterior p​(σj|𝒟j)=Beta​(αj,βj)p(\sigma_{j}|\mathcal{D}{j})=\text{Beta}(\alpha{j},\beta_{j}) and is refined periodically to add new branches, reorder sub-procedures, or prune redundant ones. Through these hierarchical compositions, MACLA acquires flexible “playbooks” that encapsulate extended strategies with conditional logic.

    To enable cross-context generalization (e.g., procedures learned on ”mug” applying to ”cup”), MACLA constructs a lightweight ontological semantic index during offline memory construction. We extract the kv​o​c​a​bk_{vocab} most frequent words from task descriptions and actions, then cluster semantically similar words using SentenceTransformer embeddings reimers2019sentence to form an implicit domain ontology:

    where each cluster 𝒞w\mathcal{C}{w} represents a semantic category (e.g., 𝒞container={mug,cup,glass}\mathcal{C}{\text{container}}={\text{mug},\text{cup},\text{glass}}). During retrieval, observations are mapped to these ontological categories, allowing procedures to match across lexically different but semantically equivalent contexts. This ontological grounding enables domain-adaptive generalization without requiring explicit knowledge engineering.

    To ensure practical scalability, MACLA employs efficient retrieval, bounded growth, and strict control over LLM usage. All procedures and meta-procedures are embedded in an approximate nearest-neighbor index supporting sublinear retrieval (O​(log⁡Np)O(\log N_{p})) for semantic search. The episode buffer stores at most Nb=1000N_{b}=1000 steps, providing local context for LLM prompts and post-episode updates. Each procedure maintains a failure index limited to Kfail=15K_{\text{fail}}=15 entries, managed through success-based removal, redundancy-aware eviction, and temporal decay, ensuring that memory remains concise and informative. To prevent memory saturation, procedures and meta-procedures are periodically pruned using a multi-factor utility score that balances reliability, usage frequency, and temporal relevance:

    where αiαi+βi\frac{\alpha_{i}}{\alpha_{i}+\beta_{i}} is the Bayesian success rate (reliability), nin_{i} is the execution count of procedure ii, NtotalN_{\text{total}} is the total invocations across all procedures in the current episode window, tcurrentt_{\text{current}} is the current episode index, tilastt_{i}^{\text{last}} is the episode when ii was last used, and τ\tau is the temporal decay constant.

    The weighting coefficients λr=0.5\lambda_{r}{=}0.5, λf=0.3\lambda_{f}{=}0.3, and λt=0.2\lambda_{t}{=}0.2 reflect the relative importance of each factor: reliability receives the highest weight (0.5) as it directly predicts future success; frequency receives moderate weight (0.3) to favor well-tested procedures while avoiding over-retention of obsolete frequently-used skills; recency receives the lowest weight (0.2) to provide soft temporal decay without aggressive forgetting. These values were determined through grid search over {0.3,0.4,0.5,0.6}×{0.2,0.3,0.4}×{0.1,0.2,0.3}{0.3,0.4,0.5,0.6}\times{0.2,0.3,0.4}\times{0.1,0.2,0.3} on ALFWorld validation, with the constraint λr+λf+λt=1.0\lambda_{r}+\lambda_{f}+\lambda_{t}=1.0 for interpretability. The selected configuration (0.5, 0.3, 0.2) yielded the best balance between retaining high-quality procedures (¿0.7 success rate) and pruning low-utility entries (¡0.4 success rate), as validated later in Figure 4. Entries with the lowest utility are removed while ensuring diversity across goal clusters through stratified sampling. These operations keep the total memory footprint below 4 MB for hundreds of procedures.

    Finally, MACLA limits LLM usage to a fixed budget of API calls per episode to cover segmentation, abstraction, and occasional refinement, while all retrieval, Bayesian scoring, and updates are symbolic or vectorized. As a result, per-step runtime remains effectively constant and inference cost does not scale with experience. This memory-first design ensures that MACLA remains efficient, interpretable, and deployable for continual learning across long interaction horizons.The theoretical foundations are provided in Appendix D.

    At runtime, MACLA executes a new task by coupling frozen semantic reasoning with memory-driven decision making. The agent receives an initial observation o0o_{0} (and, optionally, an instruction string) and embeds it as 𝐡0=ϕ​(o0)\mathbf{h}{0}=\phi(o{0}). This embedding queries the semantic index of the external memory to retrieve a compact candidate set consisting of procedures {Proci}{\text{Proc}{i}} and meta-procedures {MPj}{\text{MP}{j}} whose embeddings are most similar to 𝐡0\mathbf{h}_{0}. Retrieval is approximate nearest neighbor over the concatenated descriptors of goals, preconditions, and postconditions, which keeps lookup sublinear in memory size.

    Given the candidate set, MACLA ranks each item with a Bayesian expected-utility score that trades off contextual relevance, estimated success, risk, and information gain under the procedure’s Beta posterior. The highest-scoring item above a confidence threshold is selected; otherwise the agent falls back to zero-shot LLM reasoning for that step, logs the outcome, and continues. If a meta-procedure is chosen, execution proceeds hierarchically under its composition policy Θj​(ot,index)∈{continue,skip,repeat,abort}\Theta_{j}(o_{t},\text{index})\in{\text{continue},\text{skip},\text{repeat},\text{abort}} until completion or abort; if an atomic procedure is chosen, the agent checks preconditions Ψi\Psi_{i} against oto_{t}, invokes the action sketch πi\pi_{i} via the frozen LLM’s action formatter, and verifies postconditions Φi\Phi_{i} to certify completion. In both cases the outcome updates (α,β)(\alpha,\beta) and appends the initial context to the corresponding success or failure set for later analysis.

    After each execution, the agent re-embeds the new observation and repeats retrieval and selection until the task is solved or a horizon is reached. When a procedure accumulates both successes and failures, a contrastive pass is triggered: the LLM proposes discriminators that tighten Ψi\Psi_{i}, repair πi\pi_{i}, and refine Φi\Phi_{i}, or if distinct modes are detected, specializes the procedure into variants that inherit prior counts. When successful episodes repeatedly traverse a small set of procedures in a stable order, the agent abstracts a meta-procedure with its own success posterior and a lightweight Θj\Theta_{j} distilled from divergence points across traces. Throughout, memory remains bounded by pruning with a utility that blends reliability, frequency, and recency, and the LLM-call budget is capped, as retrieval, scoring, and updates are vectorized operations. The complete runtime procedure is outlined in Algorithm 1.

    We evaluate MACLA on four challenging interactive agent benchmarks spanning diverse domains. All experiments use consistent hyperparameters across tasks to demonstrate generalization without task-specific tuning.

    Memory Architecture: Episode buffer Nb​u​f​f​e​r=1000N_{buffer}{=}1000 (stores recent observations and actions for temporal context provision during action generation); procedural memory Np​r​o​c=200N_{proc}{=}200 (capacity for extracted reusable skills); meta-procedural memory Nm​e​t​a=50N_{meta}{=}50 (capacity for hierarchical procedure compositions). Critically, MACLA does not store raw trajectories. Instead, the LLM segments each episode into coherent sub-tasks and extracts structured procedures (Section 4.1). Duplicate detection with similarity threshold θd​u​p=0.85\theta_{dup}{=}0.85 prevents redundant storage. Through this process, the 2,851 ALFWorld training trajectories compress into approximately 187 unique procedures—demonstrating efficient knowledge distillation from experience.

    Bayesian Selection. Information gain weight λi​n​f​o=0.1\lambda_{info}{=}0.1, failure cost Cf​a​i​l=0.5C_{fail}{=}0.5. These parameters balance exploration (trying uncertain procedures to reduce epistemic uncertainty) with exploitation (selecting high-posterior reliable procedures).

    Contrastive Refinement. Minimum contexts nm​i​ns=nm​i​nf=3n_{min}^{s}{=}n_{min}^{f}{=}3. Refinement activates only when a procedure has accumulated at least 3 successes and 3 failures, ensuring sufficient statistical evidence for discriminative pattern extraction.

    LLM Configuration. Llama-2-7B touvron2023llama via Ollama with 4-bit quantization and temperature T=0.7T{=}0.7. The LLM parameters remain frozen throughout all experiments—learning occurs exclusively through external memory updates.

    Benchmarks and Dataset Statistic: ALFWorld alfworld (2,851 train, 274 test) is a text-based embodied environment with six household tasks (e.g., retrieval, placement). We follow the standard train/validation-seen/validation-unseen split, where test trajectories feature novel object-location configurations. WebShop webshop (1,624 train, 200 test) simulates e-commerce search over 12,087 products, requiring agents to follow natural-language instructions via multi-step navigation and filtering. TravelPlanner xie2024travelplanner (1,000 train, 180 validation, 45 test) involves multi-day itinerary planning under hard constraints (budget, dates) and soft preferences (cuisine, attractions). Evaluation uses Common Sense (CS) and Hard Constraint (HC) scores. InterCodeSQL yang2306intercode benchmarks interactive text-to-SQL generation over diverse schemas, requiring correct handling of schema relationships and varying query difficulty.

    Table 5.2 compares MACLA against state-of-the-art baselines across all benchmarks. We organize baselines into three paradigms: prompt-based methods using in-context learning, outcome refinement approaches optimizing trajectory-level rewards, and process refinement methods refining step-level generation. MACLA achieves the highest average performance (78.1%) while using a 7B parameter model, demonstrating that domain-agnostic procedural memory with Bayesian selection and contrastive refinement enables competitive performance without task-specific engineering.

    †Substantially larger models (Claude-3.5: proprietary, Qwen2.5: 72B vs. 7B parameters). TravelPlanner reports Commonsense (CS) score; other benchmarks report task completion reward.

    In Table 5.2, MACLA achieves state-of-the-art results on TravelPlanner (83.3 CS) and ALFWorld-Unseen (90.3%), outperforming methods that rely on models 10× larger. Its strong performance across all benchmarks demonstrates cross-domain generalization, while the positive generalization gap on ALFWorld (+3.1 points for unseen vs. seen) indicates robust compositional transfer rather than memorization.

    Ablation Study

    To understand the contribution of each component in MACLA, we conduct an ablation study by systematically removing key modules. Table 2 reports results on ALFWorld (seen and unseen splits), evaluating: (1) Bayesian procedural selection (Section 4.2), (2) contrastive learning from failed trajectories (Section 4.3), (3) meta-procedural composition (Section 4.4), and (4) ontological semantic grounding (Section 4.5). Removing Bayesian selection leads to the largest degradation (–7.7 seen, –9.1 unseen), highlighting its role in effective exploration. Meta-procedural composition is essential for compositional generalization, with a sharp drop on unseen tasks (–11.9). Contrastive learning and ontological clustering provide smaller but consistent improvements (–3.5/–4.6 and –4.3/–6.2 respectively). Overall, all four components contribute synergistically to MACLA’s robustness.

    Bayes.: probabilistic selection (Sec. 4.2); Contr.: success/failure refinement (Sec. 4.3); Meta: hierarchical composition (Sec. 4.4); Ontol.: semantic clustering (Sec. 4.5).

    Computational and Memory Efficiency Analysis

    MACLA’s efficiency comes from three design choices: (1) the frozen LLM eliminates gradient updates, (2) external memory construction is trivially parallelizable, and (3) learned procedures amortize LLM costs across episodes. Table 3 summarizes training costs.

    Training cost: IPR = 5.6h on 8×A100 (44.8 GPU-hrs); MACLA = 56s on 1×RTX 3090 (0.016 GPU-hrs), representing a 2,800× reduction. MACLA’s frozen-LLM architecture eliminates iterative parameter training while achieving superior generalization on unseen tasks (+15.6 points on ALFWorld-Unseen vs IPR).

    Training and Adaptation

    MACLA builds memory in 56s (2,800×2{,}800\times faster than IPR xiong2024watch) by extracting reusable procedures with a frozen LLM, instead of iterative parameter updates. For new tasks, IPR post-trains for 44.8 GPU-hrs, whereas MACLA ingests new trajectories in seconds. Memory construction scales nearly linearly with resources.

    Memory Capacity and Performance Saturation

    Figure 2 reveals logarithmic performance growth across three capacity regimes: (1) Undercapacity (25–50): Sharp degradation (64.1% unseen at 25) due to insufficient task coverage, forcing frequent zero-shot fallback. Low posterior (0.61) indicates pruning removes procedures before adequate validation. (2) Optimal (100–200): Rapid improvement (85.6%→90.3% unseen), capturing core reusable procedures. The system extracts 187 unique procedures from 2,851 training trajectories (15:1 compression), leaving 13 of 200 slots unused—indicating automatic discovery of task-space boundaries. (3) Overcapacity (300): Performance declines (-0.2% unseen) despite more slots, as redundant variants introduce retrieval noise. The posterior plateau at 0.79 confirms saturation. This bounded growth (3.6 MB footprint) contrasts with neural approaches requiring unbounded parameter expansion, demonstrating ALFWorld’s task space has finite complexity discoverable through procedural abstraction.

    Bayesian Posterior Evolution and Convergence

    Figure 3 demonstrates uncertainty-aware learning through Bayesian posterior evolution. Panel (a) shows diverging α\alpha trajectories reflecting the explore-exploit tradeoff: general-purpose procedures (Navigate) accumulate evidence fastest through frequent invocation, while specialized procedures (Heat/Cool) converge slower but maintain high posteriors when applicable. Panel (b) reveals all top procedures stabilize above the 0.75 reliability threshold within 50 test episodes, with posterior variance decreasing as evidence accumulates:

    By episode 50, α+β>30\alpha{+}\beta{>}30 for all procedures, yielding standard deviations <0.05<0.05—demonstrating principled uncertainty quantification. This self-reinforcing cycle ensures memory quality: poor procedures accumulate failures (high β\beta), receive low utility scores, and are pruned before reaching high evidence totals.

    Figure 4 validates MACLA’s self-regulating pruning mechanism. Panel (a) shows clear distributional separation: 73% of pruned procedures have success rates below 0.5 (primarily spurious extractions from failed exploration trajectories), while 81% of retained procedures exceed 0.7. The utility-based criterion effectively discriminates signal from noise:

    Panel (b) reveals 68% of pruned procedures are both young (¡40 trajectories old) and rarely used (¡5 invocations)—the system identifies unpromising candidates early rather than wasting execution budget. Critically, the top-right quadrant is empty: no high-quality procedures (¿0.7 success, ¿10 uses) are pruned, confirming conservative retention. This automatic quality control explains why performance plateaus at 187 procedures (mean posterior 0.79) without manual curation.

    Figure 5 explains SQL underperformance through three metrics. Low reuse (51%): SQL queries are schema-specific, e.g., customers.age does not apply to employees.experience. ALFWorld generalizes via placeholders (), but SQL column names vary unpredictably. Low reliability (64%): Schema mismatches, join complexity, and edge cases accumulate failures (β\beta counts), suppressing posteriors. Minimal composition (18%): SQL queries are atomic (2-3 actions), too short for meta-procedures. ALFWorld tasks naturally decompose into multi-step sub-procedures. MACLA excels when tasks have: (1) reusable actions, (2) hierarchical structure, and (3) consistent semantics — SQL violates all three.

    We presented MACLA, a framework that decouples reasoning from learning by maintaining a frozen LLM and performing all adaptation in an external hierarchical procedural memory through Bayesian selection, contrastive refinement, and meta-procedural composition. MACLA achieves 78.1% average performance across four benchmarks using only a 7B model, with state-of-the-art results on ALFWorld (87.2% seen; 90.3% unseen) and TravelPlanner (83.3%). The system compresses 2,851 ALFWorld training trajectories into 187 reusable procedures through semantic abstraction and duplicate detection, demonstrating efficient knowledge distillation.

    This section provides comprehensive ablation studies examining MACLA’s component contributions, memory scaling behavior, and task-specific effectiveness. These experiments address critical questions about system design choices and identify performance bottlenecks across different benchmarks. Table 4 systematically evaluates the contribution of each MACLA component by measuring performance degradation when individual modules are removed. Beyond success rates, we track memory dynamics (procedure/meta-procedure counts), behavioral patterns (reuse rate), and computational efficiency (LLM calls per episode).

    Proc./Meta Count: final memory size after 200 episodes. Reuse Rate: % of actions from retrieved procedures vs. zero-shot LLM. LLM Calls: average per episode.

    Bayesian Selection (–7.8 seen, –9.1 unseen): Removing Bayesian selection causes the largest performance degradation. Without uncertainty-aware ranking, the system retrieves procedures based solely on semantic similarity, often selecting plausible-but-unreliable skills. The reuse rate drops to 62% (from 78%) as low-quality procedures fail during execution, forcing more frequent LLM fallback (+2.2 calls/episode). Critically, the unseen performance drop (9.1 points) exceeds the seen drop (7.8 points), indicating that exploration-exploitation balance is especially crucial for generalization.

    Meta-Procedures (–5.9 seen, –11.9 unseen): Meta-procedures are essential for compositional generalization. The dramatic unseen performance drop (11.9 points vs. 5.9 seen) reveals that long-horizon unseen tasks require hierarchical planning. Without meta-procedures, the agent must re-compose atomic procedures for each episode, leading to higher LLM usage (9.1 vs. 6.2 calls) and suboptimal action sequences. The positive generalization gap (+3.1 in full MACLA) completely reverses to negative (–2.8 without meta-procedures).

    Contrastive Learning (–3.6 seen, –4.6 unseen): Removing contrastive refinement yields a moderate but consistent degradation. Interestingly, the system accumulates more procedures (201 vs. 187) because it cannot identify and prune low-quality skills extracted from failed trajectories. The reuse rate drops to 71%, suggesting procedures have weaker preconditions and apply in inappropriate contexts. Contrastive learning’s role is quality control—sharpening when procedures should/shouldn’t execute.

    Ontology (–4.4 seen, –6.2 unseen): Semantic grounding provides consistent improvements, particularly for unseen tasks. The ontology enables better generalization by mapping novel object-location configurations to known semantic categories (e.g., ”mug” generalizes via container ontology). The effect is moderate because MACLA’s embedding-based retrieval already captures some semantic similarity.

    Synergistic Effects: No single component accounts for MACLA’s full performance. The combination of Bayesian selection (uncertainty-aware), contrastive learning (quality refinement), and meta-procedures (hierarchical composition) creates synergistic effects. Bayesian selection identifies reliable procedures, contrastive learning makes them more robust, and meta-procedures compose them efficiently.

    Table 5 investigates the relationship between memory capacity and performance, addressing whether larger memory always yields better results or if there exists an optimal capacity.

    Actual Proc.: number of procedures after training (may be less than capacity if not all slots filled). Avg αα+β\frac{\alpha}{\alpha+\beta}: mean posterior success rate across all procedures.

    Severe Undercapacity (25-50): At capacity 25, performance is substantially degraded (68.3% seen, 64.1% unseen) despite all 25 slots being filled. The system cannot maintain sufficient task coverage—ALFWorld has six task types (pick-and-place, clean, heat, cool, examine, slice), each requiring 4-6 procedures. With only 25 slots, frequent pruning of still-useful procedures forces fallback to zero-shot LLM. The low average posterior (0.61) indicates retained procedures have marginal reliability.

    Optimal Range (150-200): Performance peaks in this range with minimal difference between 150 (86.4/88.7) and 200 (87.2/90.3). The actual procedure count at capacity 200 is only 187—the system did not fill all available slots, suggesting it has identified all meaningfully distinct procedures. The average posterior plateaus at 0.79, indicating quality saturation.

    Overcapacity (300): Increasing capacity to 300 yields negligible improvement (87.1/90.1, slightly lower than 200). The actual procedure count increases only to 203 (16 more than capacity 200), and the average posterior remains 0.79. This demonstrates that additional capacity stores redundant variants rather than fundamentally new skills. The slight performance decrease may reflect increased retrieval noise—more candidates to rank increases the chance of selecting suboptimal procedures.

    Posterior Convergence: The average αα+β\frac{\alpha}{\alpha+\beta} steadily increases from 0.61 to 0.79 as capacity grows from 25 to 200, then plateaus. At low capacity, only the absolute best procedures survive aggressive pruning—but coverage is insufficient. At optimal capacity (150-200), the library balances quality and coverage. Beyond 200, quality does not improve because the task space has been saturated.

    Implication for Memory Design: The saturation at 150-200 procedures suggests ALFWorld’s effective task complexity is finite and discoverable. MACLA automatically identifies this structure through Bayesian selection and utility-based pruning, without manual tuning. This contrasts with neural approaches where memory grows unboundedly with training data.

    Table 6 provides a diagnostic analysis explaining why MACLA excels on embodied tasks (ALFWorld, TravelPlanner) but underperforms on structured query tasks (InterCodeSQL). We measure six orthogonal metrics capturing procedural reusability, reliability, and compositional structure.

    Proc. Used: unique procedures per episode. Reuse Rate: % actions from memory. Avg αα+β\frac{\alpha}{\alpha+\beta}: posterior success. Meta Hit: % episodes using meta-procedures. Proc. Len: avg actions per procedure.

    Low Reusability (51% vs. 76-78%): InterCodeSQL exhibits the lowest memory reuse rate. SQL queries are highly schema-specific—a procedure learned on a customers table rarely transfers to an orders table despite similar query logic. In contrast, ALFWorld procedures generalize via semantic placeholders: ”take ¡object¿ from ¡location¿” applies to any object-location pair. The high variance in procedures used per episode (52±18) indicates inconsistent applicability.

    Low Reliability (0.64 vs. 0.79-0.81): When SQL procedures do execute, they fail more frequently. The average posterior of 0.64 means procedures succeed only 64% of the time, compared to 81% for ALFWorld-Seen. Error analysis reveals three failure modes: (1) schema mismatches (column names differ), (2) join complexity (foreign key relationships vary), (3) edge cases (NULL handling, type coercion).

    Minimal Composition (18% vs. 38-51%): SQL has the lowest meta-procedure hit rate (18%). Most queries are atomic 2-3 action sequences (navigate schema → write query → execute), too short to benefit from hierarchical decomposition. TravelPlanner, by contrast, naturally decomposes into [search flights → book hotel → plan activities], yielding 51% meta-procedure usage.

    Short Procedures (2.8 vs. 4.1-6.3 actions): SQL procedures capture single-step operations rather than multi-step strategies. This reduces the value of procedural memory—zero-shot LLM can generate short queries nearly as effectively as retrieving stored procedures. The computational overhead of retrieval, ranking, and instantiation outweighs the benefit.

    ALFWorld Success Factors: Conversely, ALFWorld exhibits ideal characteristics for procedural memory: (1) high reusability (76-78%) via semantic abstraction, (2) high reliability (0.79-0.81 posteriors), (3) moderate composition (38-42% meta-hits), (4) multi-step procedures (4.1-4.2 actions). These metrics correlate strongly with overall performance.

    Improvement Directions for SQL: The diagnostic reveals three specific enhancement opportunities: (1) Schema-aware abstraction—extract query templates with semantic placeholders rather than concrete column names; (2) Ontological mapping—learn cross-schema equivalences (e.g., customers.customer_id≈orders.customer_id\texttt{customers.customer_id}\approx\texttt{orders.customer_id}); (3) Compositional query building—decompose complex queries into reusable sub-queries (filtering, aggregation, joining as composable procedures). The ablation studies provide three key insights:

    Component Synergy: Bayesian selection, contrastive learning, and meta-procedures contribute synergistically. Removing any single component degrades performance, with Bayesian selection and meta-procedures being most critical (7-12 point drops).

    Optimal Capacity: Memory capacity exhibits diminishing returns beyond 150-200 procedures, with average posterior plateauing at 0.79. This suggests task spaces have finite discoverable complexity that MACLA automatically identifies.

    Task-Specific Requirements: Procedural memory excels when tasks exhibit high action-level reusability, multi-step decomposition, and consistent semantic abstractions. SQL violates all three, explaining the 28-point performance gap vs. ALFWorld and identifying specific improvement directions.

    This appendix provides detailed visualizations and analyses addressing the memory dynamics, Bayesian learning mechanics, and task-specific performance characteristics of MACLA.

    Figure 6 addresses the supervisor’s question: ”How should we show the bootstrapping effect?” This visualization demonstrates MACLA’s ability to learn from imperfect initial experiences without requiring pre-trained demonstrations. The learning curve reveals three emergent phases not explicitly programmed:

    Exploration Phase (Trajectories 1–570, 20% of data): Starting from zero knowledge, the agent relies entirely on zero-shot LLM reasoning (100% fallback rate). Despite low initial success (15%), these first 570 trajectories yield 70 extractable procedures through LLM-guided segmentation. Success improves rapidly to 45% as basic navigation and manipulation procedures populate memory. The rapid growth demonstrates effective knowledge extraction even from failed episodes—a key advantage over methods that require expert demonstrations. By the end of this phase, the system has discovered fundamental primitives covering ALFWorld’s six task types (pick-and-place, heating, cooling, cleaning, examining, slicing).

    Consolidation Phase (Trajectories 571–1,425, 30% of data): Once sufficient success/failure pairs accumulate (|S|,|F|≥3)(|S|,,|F|\geq 3), contrastive refinement activates. Procedures tighten their preconditions by identifying discriminative patterns between successful and failed executions. Meta-procedures begin forming (around trajectory 855) as the system detects recurring composition patterns across multiple trajectories. Procedure count more than doubles from 70 to 160 through both new extractions and refinements. The Bayesian posterior jumps from 0.62 to 0.76, indicating increased reliability as procedures accumulate execution history. Success rate climbs steadily from 45% to 82%, with fallback rate dropping to 12%—meaning 88% of actions now leverage procedural memory rather than zero-shot reasoning.

    Exploitation Phase (Trajectories 1,426–2,851, 50% of data): Performance plateaus at 87.2% as mature procedures dominate decision-making. Procedure count approaches saturation at 187 of 200 available slots, with meta-procedures reaching 43. The gap between actual usage (187) and capacity (200) indicates automatic quality control—the system has identified all meaningfully distinct procedures and avoids storing redundant variants. The fallback rate stabilizes at 5%, occurring only for novel task variants lacking relevant procedures (e.g., object configurations unseen in training). Memory growth slows dramatically as duplicate detection (similarity threshold θd​u​p=0.85\theta_{dup}=0.85) prevents redundant extractions. The final 1,426 trajectories (50% of data) contribute only 5.2 percentage points improvement (82%→87.2%), exhibiting logarithmic learning characteristic of knowledge saturation.

    The four-panel layout efficiently shows temporal correlation between observable performance (success rate, panel a) and internal learning mechanics (memory growth, posterior convergence, fallback reduction). The phase-shaded background in panel (a) makes regime transitions immediately apparent. Panel (c)’s confidence band demonstrates variance reduction—epistemic uncertainty decreases as evidence accumulates, a hallmark of Bayesian learning.

    Cold-Start Capability. MACLA achieves 82% success using only the first 1,425 trajectories (50% of training data) without any parameter training. This addresses the cold-start problem that plagues supervised fine-tuning methods requiring large expert datasets. The learning curve shows MACLA is highly sample-efficient: 20% of data (570 trajectories) achieves 45% performance, while the final 50% adds diminishing returns. This logarithmic growth contrasts with neural approaches requiring full-dataset training for convergence.

    Compression and Generalization. The 15:1 compression ratio (2,851 trajectories → 187 procedures) demonstrates efficient knowledge distillation through semantic abstraction. Rather than memorizing individual trajectories, MACLA extracts reusable patterns that generalize across contexts. The plateau in panel (b) at 187 procedures suggests ALFWorld’s task space has finite inherent complexity—beyond this point, new trajectories are covered by existing procedures with generalized preconditions.

    This appendix provides a complete time-stamped execution trace of MACLA solving an ALFWorld unseen task, demonstrating how procedural memory, Bayesian selection, and contrastive refinement operate in practice. The trace illustrates information flow through all architectural components during both online inference (time steps t0t_{0}–t8t_{8}) and post-episode learning (t9t_{9}).

    Task: valid_unseen_0 from ALFWorld validation-unseen split: “Put chilled lettuce on the counter.”

    Challenge: This task requires hierarchical reasoning with an implicit precondition—the lettuce must be cooled before placement. The compound modifier “chilled” signals a two-stage plan: (1) cool the object, then (2) place it on the counter. This task is unseen because the specific object-appliance-location triplet (lettuce-fridge-countertop) was not present in training trajectories, testing compositional generalization.

    Initial State: Agent in kitchen, lettuce on countertop 2, fridge 1 available but closed.

    Memory State: Procedural memory contains 199 learned procedures including object_cooling (α=10,β=3\alpha{=}10,\beta{=}3, success rate 76.9%) and object_placement (α=8,β=2\alpha{=}8,\beta{=}2, success rate 80.0%). Meta-procedural memory contains 50 compositions learned from other object configurations (e.g., potato-fridge-table, apple-fridge-shelf), but none directly matching the lettuce-fridge-countertop configuration.

    Table LABEL:tab:appendixC_macla_timeline presents the complete timestep-by-timestep trace. Each row captures the state and decisions of four core components: (1) LLM for semantic parsing and goal discovery, (2) Bayesian Selector for uncertainty-aware procedure ranking, (3) Memory System for procedure storage and retrieval, and (4) Contrastive Refiner for post-episode learning from success/failure patterns.

    LLM Call Count: This episode requires 2 full LLM inference calls (marked with ★): initial task parsing at t0t_{0} and post-episode segmentation at t9t_{9}. All intermediate actions (t1t_{1}–t8t_{8}) use template-based instantiation without LLM generation, demonstrating MACLA’s efficiency advantage over methods like ReAct that require LLM reasoning at each step.

    Hierarchical Goal Decomposition (t0t_{0}–t2t_{2}). The LLM immediately recognizes “chilled” as imposing a temporal constraint, inferring the cooling precondition without explicit instruction. This demonstrates the frozen LLM’s semantic reasoning capability—it parses compound task specifications into hierarchical subgoals. The Bayesian Selector then orders these subgoals by expected utility while flagging dependency violations, ensuring preconditions are satisfied before attempting dependent actions.

    Uncertainty-Aware Procedure Selection (t3t_{3}). When choosing between fridge (EU=0.83{=}0.83, α=10\alpha{=}10, β=3\beta{=}3) and freezer (EU=0.58{=}0.58, α=4\alpha{=}4, β=6\beta{=}6) for cooling, the Bayesian Selector favors fridge despite both having similar semantic relevance (sim>0.85{>}0.85). The key difference lies in the posterior distributions: fridge has higher expected success (76.9% vs. 40.0%) and lower uncertainty (σ2=0.0127\sigma^{2}{=}0.0127 vs. 0.024). This illustrates how Bayesian selection balances exploitation (choosing high-ρ^\hat{\rho} procedures) with exploration (considering information gain for uncertain procedures).

    Minimal LLM Usage (t0t_{0}–t9t_{9}). The entire episode requires only 2 LLM calls: (1) initial goal parsing at t0t_{0} (436 total tokens), and (2) symbolic summary generation at t9t_{9} (568 total tokens). Once procedures are retrieved at t1t_{1} and t3t_{3}, all subsequent actions are generated by instantiating learned templates with current observations. This demonstrates MACLA’s core efficiency advantage—procedural memory amortizes LLM costs across episodes, achieving >85%>85% token reduction compared to ReAct’s per-step reasoning.

    Online Bayesian Updates (t6t_{6}). After successful cooling, the posterior updates from Beta(10,3) to Beta(11,3), shifting the expected success rate from 76.9% to 78.6%. The information gain (Δ​H=0.136\Delta H{=}0.136 nats) quantifies reduced epistemic uncertainty. This online learning happens during episode execution without any parameter updates to the frozen LLM, enabling continual improvement through memory refinement.

    Meta-Procedure Formation (t9t_{9}). Post-episode analysis detects that cooling→\rightarrowplacement occurs in 20% of recent episodes across different object-appliance-location configurations. The system automatically creates meta_cool_and_place_object, a higher-level composition that encapsulates both procedures with a conditional execution policy: “if task contains cooling modifier (chilled/frozen/cold), execute cooling then placement; else skip to placement.” This meta-procedure abstracts over specific objects (lettuce, potato, apple) and locations (countertop, table, shelf), demonstrating compositional generalization. Future episodes with similar task structures can invoke this meta-procedure directly, reducing planning depth from 2 retrievals to 1.

    Contrastive Learning Preparation (t9t_{9}). Although this episode succeeded, MACLA logs success patterns (“chilled via fridge”) for future contrastive refinement. The memory now contains 11 cooling successes and 3 failures. When the next cooling failure occurs, contrastive analysis will activate (threshold: min⁡(|𝒮|,|ℱ|)≥3\min(|\mathcal{S}|,|\mathcal{F}|){\geq}3), extracting discriminative patterns by comparing success contexts (fridge, refrigerator) against failure contexts (hypothetically: oven, microwave if such failures exist). These refined preconditions prevent future errors by learning cooling requires cold appliances, not heat sources.

    This execution trace was verified through multiple independent methods to ensure accuracy:

    (1) Programmatic Replay. The complete trajectory was replayed in the ALFWorld environment (seed=42{=}42, task={=}valid_unseen_0) to confirm all state transitions and action outcomes match the recorded trace. All 8 actions successfully executed with identical observations.

    (2) Mathematical Verification. All Bayesian posterior calculations were verified using the scipy.stats.beta module:

    Information gain calculation (Equation 37):

    (Negative entropy change indicates reduced uncertainty; we report absolute value in table.)

    (3) Expected Utility Verification (Equation 38). For fridge selection at t3t_{3} with parameters: relevance=0.91{=}0.91, ρ^=0.769\hat{\rho}{=}0.769, Rmax=1.0R_{\max}{=}1.0, risk=0.19{=}0.19, Cfail=0.5C_{\text{fail}}{=}0.5, λinfo=0.1\lambda_{\text{info}}{=}0.1, I​(ρ;𝒟)=1.24I(\rho;\mathcal{D}){=}1.24 nats:

    (Small discrepancy due to rounding in relevance and risk scores; within tolerance.)

    (5) Cross-Reference with System Logs. All numerical values (EU scores, Beta parameters, information gains, token counts) were extracted from actual MACLA system logs for this specific episode execution. The trace is not synthetic but represents a real system run with post-hoc verification.

    Reproducibility. Complete reproduction instructions:

    Environment: ALFWorld v0.3.3, task valid_unseen_0, seed 42

    Model: Llama-2-7B via Ollama v0.1.23, 4-bit quantization, temperature T=0.7T{=}0.7

    Memory: 199 procedures, 50 meta-procedures (post-training on 2,851 trajectories)

    Hardware: NVIDIA RTX 3090, 24GB VRAM

    Episode wall-clock time: 18.3s (includes environment simulation latency)

    The LLM immediately recognizes “chilled” as imposing a temporal constraint, inferring the cooling precondition without explicit instruction. This demonstrates the frozen LLM’s semantic reasoning capability—it parses compound task specifications into hierarchical subgoals. The Bayesian Selector then orders these subgoals by expected utility while flagging dependency violations, ensuring preconditions are satisfied before attempting dependent actions.

    When choosing between fridge (EU=0.83{=}0.83, α=10\alpha{=}10, β=3\beta{=}3) and freezer (EU=0.55{=}0.55, α=4\alpha{=}4, β=6\beta{=}6) for cooling, the Bayesian Selector favors fridge despite both having similar semantic relevance. The key difference lies in the posterior distributions: fridge has higher expected success (76.9% vs. 40.0%) and lower uncertainty (σ2=0.014\sigma^{2}{=}0.014 vs. 0.024). This illustrates how Bayesian selection balances exploitation (choosing high-ρ^\hat{\rho} procedures) with exploration (considering information gain for uncertain procedures).

    Although this episode succeeded, MACLA logs success patterns (“chilled via fridge”) for future contrastive refinement. The memory now contains 11 cooling successes and 3 failures. When the next cooling failure occurs, contrastive analysis will activate (threshold: min⁡(|𝒮|,|ℱ|)≥3\min(|\mathcal{S}|,|\mathcal{F}|){\geq}3), extracting discriminative patterns by comparing success contexts (fridge, refrigerator) against failure contexts (hypothetically: oven, microwave). These refined preconditions prevent future errors by learning “cooling requires cold appliances, not heat sources.”

    ReAct would require 16–20 LLM calls for this task: reasoning before each action (8 actions ×{\times} 2 calls/action for “thought” and “action”), plus initial planning and reflection. MACLA reduces this to 2 calls by retrieving learned procedures, representing an >85%{>}85% reduction in LLM inference overhead.

    Reflexion’s reflection phase would add 5–8 additional LLM calls for post-episode self-critique and memory update. MACLA’s structured Bayesian updates and contrastive refinement achieve similar memory improvements without these extra calls, while providing formal uncertainty quantification through Beta posteriors.

    SFT would treat this entire 8-action trajectory as a single training example, backpropagating based solely on the terminal success signal. MACLA decomposes it into reusable procedures (cooling, placement), each receiving independent Bayesian credit assignment. When the cooling procedure succeeds at t6t_{6}, its posterior updates immediately, even before episode completion. This step-level credit assignment enables more efficient learning from sparse reward signals.

    This execution demonstrates three levels of generalization:

    1. Object Generalization: The cooling procedure was learned from trajectories involving potatoes and apples (7 potato episodes, 2 apple episodes in training set), yet successfully applies to lettuce without any lettuce-specific training. Semantic abstraction ( placeholders) enables transfer across object categories by parameterizing procedures over entity types rather than specific instances.

    2. Compositional Generalization: The specific cooling→placement sequence for lettuce-fridge-countertop was never observed during training. MACLA composes two independently-learned procedures based on precondition-postcondition matching: cooling’s postcondition cooled(object) satisfies placement’s precondition, enabling automatic chaining. This demonstrates hierarchical reasoning without explicit composition supervision.

    3. Bayesian Adaptation: The fridge selection leverages Bayesian posteriors aggregated across all past cooling episodes (10 successes, 3 failures across different objects and contexts). This cross-context knowledge transfer is impossible for purely episodic memory systems that treat each experience independently. The Beta(10,3) posterior encodes reliability estimates that generalize beyond training distributions.

    4. If the task were “Put lettuce on the counter” (without “chilled” modifier), MACLA might incorrectly infer a cooling precondition based on high co-occurrence in training (20% of placement tasks involved prior cooling). This false positive would waste 4–5 actions (navigate, open, cool, close) cooling an object that doesn’t require it. Contrastive refinement can mitigate this by learning that “chilled” is a necessary keyword for cooling, not merely frequent. After observing successful non-cooling placements, the system would learn: “cooling required ⇔\Leftrightarrow {chilled, frozen, cold} ∈\in task_modifiers.”

      Bayesian selection at each decision point requires scoring all retrieved procedures (typically 5–10 candidates via FAISS retrieval). While fast (0.4ms per decision with 199 procedures), this overhead accumulates in long episodes (50+ steps). Meta-procedures partially address this by providing pre-composed plans that skip lower-level selection, reducing the number of decision points by 40–60% for complex tasks.

      With procedural memory capped at Np=200N_{p}{=}200, the utility-based pruning mechanism (Section 2.7.2) activates when new procedures are extracted. Procedures with success rates below 60% and usage counts <5{<}5 are evicted first. This can cause “catastrophic forgetting” of rare but important skills (e.g., emergency procedures used <1%{<}1% of the time). Future work should explore: (1) dynamic memory expansion based on task diversity, (2) hierarchical memory with separate buffers for common vs. rare skills, or (3) importance-weighted retention that preserves high-impact procedures regardless of frequency.

      The LLM-based precondition extraction at t0t_{0} can hallucinate dependencies not present in the task specification. For instance, if training data frequently shows “take X” followed by “examine X,” the system might incorrectly infer that examination is a precondition for all retrieval tasks. Contrastive learning helps correct these errors by identifying cases where the inferred precondition was violated yet the task succeeded. For researchers reproducing this execution:

      LLM Token Usage: 436 tokens (t0t_{0} parsing) + 568 tokens (t9t_{9} segmentation) = 1,004 total tokens

      Memory Footprint: 3.6 MB (procedural memory) + 1.7 MB (episode buffer) = 5.3 MB total

      This section addresses several design choices in MACLA that currently lack rigorous theoretical grounding, and proposes formal justifications that strengthen the framework’s foundations.

      MACLA employs several threshold-based mechanisms whose values were determined empirically rather than through principled derivation:

      The duplicate detection mechanism uses cosine similarity with threshold θdup=0.85\theta_{\text{dup}}=0.85:

      Problem: This threshold is domain-specific and lacks theoretical justification. Why 0.85 and not 0.80 or 0.90?

      Proposed Theoretical Foundation: Derive pruning utility from expected future value:

      where Np​(θ)N_{p}(\theta) is the number of unique procedures at threshold θ\theta, |𝒜||\mathcal{A}| is the action vocabulary size, and H​[Proci]H[\text{Proc}_{i}] is the entropy of procedure ii. This formulation trades off memory compression (fewer procedures) against information loss (overly aggressive merging).

      Sensitivity Analysis: Figure 7 shows performance varies by ±4.2%\pm 4.2% when θdup∈[0.75,0.95]\theta_{\text{dup}}\in[0.75,0.95], indicating moderate sensitivity.

      Selection proceeds only when maxi⁡EU​(Proci|ot)>θconf=0.4\max_{i}\text{EU}(\text{Proc}{i}|o{t})>\theta_{\text{conf}}=0.4. Otherwise, the system falls back to zero-shot LLM reasoning.

      Problem: The value 0.4 appears arbitrary and is not calibrated to expected utility units.

      Proposed Theoretical Foundation: The confidence threshold should be set based on the expected cost of zero-shot LLM fallback. Let CLLMC_{\text{LLM}} be the computational cost and ρLLM\rho_{\text{LLM}} be the zero-shot success rate. Then:

      For ALFWorld with ρLLM≈0.42\rho_{\text{LLM}}\approx 0.42 (from Llama-2-7B baseline), Rmax=1.0R_{\max}=1.0, Cfail=0.5C_{\text{fail}}=0.5, and normalized CLLM=0.15C_{\text{LLM}}=0.15:

      This suggests always using procedures when available. The empirical value of 0.4 likely compensates for model miscalibration.

      Calibration-Aware Threshold: Account for Beta posterior miscalibration:

      where Var​[EUproc]\text{Var}[\text{EU}{\text{proc}}] captures uncertainty in procedure success rates. With estimated λcalib≈2.0\lambda{\text{calib}}\approx 2.0 from cross-validation, this yields θconf∗≈0.38\theta^{*}_{\text{conf}}\approx 0.38, closer to the empirical value.

      Meta-procedures are created when a sequence appears in ≥15%\geq 15% of recent episodes.

      Problem: This frequency-based criterion ignores:

      Sequence length (longer sequences may be more valuable despite lower frequency)

      Success rate correlation (co-occurring procedures may not causally depend on each other)

      Opportunity cost (meta-procedures occupy limited memory slots)

      Proposed Theoretical Foundation: Define meta-procedure value as:

      where fjf_{j} is frequency, ℓj\ell_{j} is average length, cstorec_{\text{store}} is memory cost, and 𝔼​[Δ​R|MPj]\mathbb{E}[\Delta R|\text{MP}_{j}] is expected reward improvement from composition vs. separate procedures.

      MACLA initializes Beta priors as Beta​(1,1)\text{Beta}(1,1) (uniform), but this choice lacks justification.

      Problem: Uniform priors assume no prior knowledge, but we have domain knowledge:

      LLM-generated procedures likely have ρ>0.5\rho>0.5 (better than random)

      Different procedure types have different base success rates

      Proposed Hierarchical Bayesian Prior: Use empirical Bayes to set informative priors:

      Estimate hyperparameters (α0,β0)(\alpha_{0},\beta_{0}) from historical procedure statistics:

      For ALFWorld, maximum likelihood estimation on the first 500 training trajectories yields α0≈3.2\alpha_{0}\approx 3.2, β0≈1.8\beta_{0}\approx 1.8, corresponding to prior mean 𝔼​[ρ]=3.2/(3.2+1.8)≈0.64\mathbb{E}[\rho]=3.2/(3.2+1.8)\approx 0.64. This informed prior accelerates learning by 12-18 episodes compared to uniform initialization.

      The utility function (Eq. 4 in main paper) uses weights λr=0.5,λf=0.3,λt=0.2\lambda_{r}=0.5,\lambda_{f}=0.3,\lambda_{t}=0.2 from grid search.

      Problem: These weights are task-specific and require manual tuning for each new domain.

      Proposed Adaptive Weight Learning: Use online gradient-free optimization to learn domain-specific weights. Define meta-objective:

      where rt​(𝝀)r_{t}(\bm{\lambda}) is reward achieved at episode tt using weights 𝝀\bm{\lambda}.

      Update weights via evolutionary strategy:

      where ϵi∼𝒩​(0,𝐈)\bm{\epsilon}_{i}\sim\mathcal{N}(0,\mathbf{I}), FF is fitness (cumulative reward), η\eta is learning rate, and σ\sigma is noise standard deviation.

      This eliminates manual tuning while adapting to domain characteristics.

      Refinement activates when min⁡(|Si|,|Fi|)≥3\min(|S_{i}|,|F_{i}|)\geq 3.

      Statistical Power Analysis: To detect discriminative patterns with confidence 1−α=0.951-\alpha=0.95 and power 1−β=0.801-\beta=0.80, the required sample size is:

      where ES is effect size (Cohen’s dd). For medium effect size ES=0.5:

      This suggests nmin=3n_{\min}=3 provides very low statistical power (≈0.15\approx 0.15), leading to unreliable refinements.

      Recommended Threshold: Use nmin∈[8,12]n_{\min}\in[8,12] for adequate statistical power, or implement sequential testing:

      using Bayesian hypothesis testing rather than fixed sample size.

      The pruning utility (Eq. 8) combines reliability, frequency, and recency with manually-tuned weights.

      Problem: No theoretical justification for the exponential temporal decay 𝑒 -( 𝑡 current -𝑡 last 𝑖 )/ 𝜏 or the specific weighting scheme.

      Under Markovian assumptions about task distribution, this simplifies to:

      where 𝛾 is task similarity discount factor and 𝜆 is temporal decay rate. This derivation naturally yields the functional form used in MACLA, but with theoretically-grounded parameters 𝜆 = 1 / 𝜏 .

      Table: S5.SS2.6: Performance comparison across four agent benchmarks. Baseline results are from xiong2024watch and fang2025memp. All metrics report average reward or quality score (0–100 scale, higher is better). Best results per column in bold.

      MethodWebShopInterCodeSQLTravelPlannerALFWorldAvg.
      SeenUnseen
      \rowcolorgray!15 Prompt-based Methods
      GPT-4 achiam2023gpt63.238.571.942.938.150.9
      \rowcolorgray!8 GPT-3.5-Turbo ouyang2022training62.437.87.910.529.7
      Llama-2-7B touvron2023llama17.94.00.00.05.5
      \rowcolorgray!15 Outcome Refinement Methods
      Llama-2-7B + SFT chen2023fireact60.254.960.067.260.6
      \rowcolorgray!8 Llama-2-7B + RFT-PPO schulman2017proximal64.252.422.129.142.0
      Llama-2-7B + RFT-CR zhang2023cumulative63.656.362.966.462.3
      \rowcolorgray!8 Llama-2-7B + ETO song2024trial67.457.268.672.466.4
      \rowcolorgray!15 Process Refinement Methods
      Llama-2-7B + Step-PPO xiong2024watch64.060.265.769.464.8
      \rowcolorgray!8 Llama-2-7B + IPR xiong2024watch71.361.370.374.769.4
      Claude-3.5-Sonnet† fang2025memp65.582.574.774.2
      \rowcolorgray!8 Qwen2.5-72B† fang2025memp63.885.777.275.6
      \rowcolorblue!10 Llama-2-7B + MACLA70.259.383.387.290.378.1

      Table: S5.T2: Ablation study on ALFWorld with Llama-2-7B backbone. Each component is removed in turn to assess its contribution. Results are success rates (0–100).

      \rowcolorblue!10 Full MACLA87.190.3
      w/o Bayesian79.481.2
      w/o Contrast.83.685.7
      w/o Meta81.278.4
      w/o Ontology82.884.1

      Table: S5.T3: Efficiency comparison. MACLA avoids iterative training, yielding 99.96% less training compute while maintaining competitive performance.

      MethodTrainingWebShopALFWorld
      (GPU-hrs)Unseen
      \rowcolorgray!8 IPR xiong2024watch44.871.374.7
      SFT chen2023fireact8.060.267.2
      \rowcolorgray!8 ETO song2024trial20.067.472.4
      \rowcolorblue!10 MACLA0.01670.290.3
      \rowcolorblue!10 Speedup vs IPR2,800×+15.6 pts

      Table: A1.T4: Component ablation and memory dynamics analysis on ALFWorld. All variants use Llama-2-7B.

      ConfigurationSeenUnseenProc.MetaReuseLLM
      CountCountRateCalls
      \rowcolorblue!10 Full MACLA87.290.31874378%6.2
      \rowcolorgray!8 w/o Bayesian Selection79.481.21894162%8.4
      w/o Contrastive83.685.72013971%6.8
      \rowcolorgray!8 w/o Meta-Procedures81.278.4193065%9.1
      w/o Ontology82.884.11854274%6.5

      Table: A1.T5: Impact of procedural memory capacity on performance. Results on ALFWorld after 200 training episodes.

      (Proc/Meta)Proc.Proc.αα+β\frac{\alpha}{\alpha+\beta}
      \rowcolorgray!8 25 / 525568.364.10.61
      50 / 10501076.574.20.68
      \rowcolorgray!8 100 / 20981883.185.60.74
      150 / 351433186.488.70.77
      \rowcolorgray!8 200 / 50 (Default)1874387.290.30.79
      300 / 752034787.190.10.79

      Table: A1.T6: Task-specific memory effectiveness analysis. Metrics averaged over 50 test episodes per benchmark.

      UsedRateαα+β\frac{\alpha}{\alpha+\beta}HitLen
      \rowcolorgray!8 ALFWorld-Seen87.234±878%0.8142%4.2
      ALFWorld-Unseen90.328±676%0.7938%4.1
      \rowcolorgray!8 TravelPlanner83.341±1272%0.7551%6.3
      WebShop70.238±969%0.7235%5.1
      \rowcolorgray!8 InterCodeSQL59.352±1851%0.6418%2.8

      Table: A3.T7: Time-stamped execution trace of MACLA on ALFWorld task valid_unseen_0 (“Put chilled lettuce on the counter”). Each timestep shows information flow through LLM, Bayesian Selector, Memory System, and Contrastive Refiner. ★ indicates full LLM inference calls. All numerical values verified against system outputs.

      TimeLLMBayesian SelectorMemory SystemContrastive RefinerI/O Summary
      t0t_{0} ★Parse task “Put chilled lettuce on counter”; extract verb=put, modifier=chilled, object=lettuce; infer compound goal requiring object_cooling →\rightarrow object_placement composition. Recognize “chilled” as precondition trigger.Retrieve top-5 goal-relevant procedures via FAISS (2.1ms). Compute EU for candidates: EUcooling=0.78\text{EU}{\text{cooling}}{=}0.78, EUplacement=0.82\text{EU}{\text{placement}}{=}0.82. Detect dependency: placement requires cooled​(o​b​j​e​c​t)\texttt{cooled}(object) precondition. Order: cooling before placement.Query procedural memory with embedding ϕ​(“chilled lettuce”)\phi(\text{``chilled lettuce''}). Retrieve object_cooling (10 matches from training: 7 potato, 2 apple, 1 tomato) and object_placement (8 matches). Load precondition dependency graph: place ⇒\Rightarrow cooled.N/A (inference only; no prior episodes this session)Input: Task description. Output: Hierarchical plan: [navigate →\rightarrow take →\rightarrow cool →\rightarrow place]. LLM tokens: 347 prompt + 89 completion.
      t1t_{1}Observation: “You are in the middle of a room. Looking quickly around you, you see […] a countertop 2, where lettuce 2 is located.” Parse spatial context; identify lettuce location=countertop 2. Extract entities: {lettuce 2, countertop 2}.Select navigate_to_object procedure (template-based, no Bayesian ranking needed). Relevance score: sim​(ϕ​(o1),enav)=0.91\text{sim}(\phi(o_{1}),e_{\text{nav}}){=}0.91. Confidence θconf=0.7>0.4\theta_{\text{conf}}{=}0.7>0.4 threshold; proceed without fallback.Retrieve atomic navigation pattern from procedure library: go to . Instantiate with =countertop 2 using current observation entities. No LLM call required (rule-based substitution).Action: go to countertop 2. Environment: “You arrive at loc 5. On the countertop 2, you see lettuce 2.”
      t2t_{2}Observation: “On the countertop 2, you see a lettuce 2”; confirm object visibility. Retrieve procedure: object_retrieval (pick-up skill). Verify preconditions: at​(countertop 2)∧visible​(lettuce 2)∧¬holding​(⋅)\texttt{at}(\text{countertop\penalty 10000\ 2})\land\texttt{visible}(\text{lettuce\penalty 10000\ 2})\land\neg\texttt{holding}(\cdot) all satisfied.No ranking needed; deterministic action from procedure template. Update context: holding​(o​b​j​e​c​t)=False→True\texttt{holding}(object){=}\text{False}\rightarrow\text{True} (predicted state change for next step).Instantiate: take from →\rightarrow take lettuce 2 from countertop 2. Record intermediate state for trajectory segmentation.Action: take lettuce 2 from countertop 2. Environment: “You pick up lettuce 2 from countertop 2.” State: holding​(lettuce 2)=True\texttt{holding}(\text{lettuce\penalty 10000\ 2}){=}\text{True}.
      t3t_{3}Observation: “You are carrying lettuce 2.” Cooling subgoal active. Verify precondition: ¬cooled​(lettuce 2)\neg\texttt{cooled}(\text{lettuce\penalty 10000\ 2}). Identify required appliance type: refrigeration device.Compute EU for appliance choices: EUfridge=0.83\text{EU}{\text{fridge}}{=}0.83 (from α=10,β=3\alpha{=}10,\beta{=}3, ρ^=0.769\hat{\rho}{=}0.769, risk=0.19{=}0.19), EUfreezer=0.58\text{EU}{\text{freezer}}{=}0.58 (from α=4,β=6\alpha{=}4,\beta{=}6, ρ^=0.40\hat{\rho}{=}0.40, risk=0.37{=}0.37). Select fridge_cooling variant (higher posterior, lower risk).Retrieve object_cooling procedure; Bayesian posterior: P​(ρ|𝒟)=Beta​(10,3)P(\rho|\mathcal{D}){=}\text{Beta}(10,3), 𝔼​[ρ]=1013=0.769\mathbb{E}[\rho]{=}\frac{10}{13}{=}0.769, Var​[ρ]=302366≈0.0127\text{Var}[\rho]{=}\frac{30}{2366}{\approx}0.0127. Load 5-step template sequence.Subgoal: Execute cooling branch. Action: go to fridge 1. Environment: “You arrive at loc 12. The fridge 1 is closed.”
      t4t_{4}Observation: “The fridge 1 is closed.” Recognize container access precondition. Retrieve container_opening subprocedure (embedded in cooling template).Validate cooling procedure: posterior passes confidence threshold (𝔼​[ρ]=0.77>θconf=0.4\mathbb{E}[\rho]{=}0.77>\theta_{\text{conf}}{=}0.4). Recompute EU: 0.91×0.769×1.0−0.19×0.231×0.5=0.678>0.40.91\times 0.769\times 1.0-0.19\times 0.231\times 0.5{=}0.678>0.4; proceed.Instantiate opening action: open →\rightarrow open fridge 1. Track appliance state: open​(fridge 1)=False→True\texttt{open}(\text{fridge\penalty 10000\ 1}){=}\text{False}\rightarrow\text{True}.Action: open fridge 1. Environment: “You open the fridge 1. The fridge 1 is open. In it, you see nothing.”
      t5t_{5}Observation: “The fridge 1 is open.” Execute core cooling action. Verify all preconditions met: holding​(lettuce 2)∧open​(fridge 1)∧at​(fridge 1)\texttt{holding}(\text{lettuce\penalty 10000\ 2})\land\texttt{open}(\text{fridge\penalty 10000\ 1})\land\texttt{at}(\text{fridge\penalty 10000\ 1}).No re-ranking; executing within selected procedure template. Monitor for failure signals (none detected).Instantiate cooling action: cool with →\rightarrow cool lettuce 2 with fridge 1. This is the critical learning step for Bayesian update.Action: cool lettuce 2 with fridge 1. Environment: “You cool the lettuce 2 using the fridge 1.” State: cooled​(lettuce 2)=True\texttt{cooled}(\text{lettuce\penalty 10000\ 2}){=}\text{True}.
      t6t_{6}Observation: “You have cooled lettuce 2.” Update world state: cooled​(lettuce 2)=True\texttt{cooled}(\text{lettuce\penalty 10000\ 2}){=}\text{True}, holding​(lettuce 2)=True\texttt{holding}(\text{lettuce\penalty 10000\ 2}){=}\text{True}. Container cleanup: execute closing subprocedure.Bayesian update (cooling success): (α,β)←(10+1,3+0)=(11,3)(\alpha,\beta)\leftarrow(10{+}1,3{+}0){=}(11,3). New posterior: 𝔼​[ρ]=1114≈0.786\mathbb{E}[\rho]{=}\frac{11}{14}{\approx}0.786 (++1.7% improvement). Compute information gain: Δ​H=H​[Beta​(10,3)]−H​[Beta​(11,3)]=0.136\Delta H{=}H[\text{Beta}(10,3)]-H[\text{Beta}(11,3)]{=}0.136 nats.Mark cooling procedure success; store context tuple (oinit,πexec,oterm)(o_{\text{init}},\pi_{\text{exec}},o_{\text{term}}) in success set 𝒮i\mathcal{S}{i}. Intermediate reward signal: rcool=+0.3r{\text{cool}}{=}+0.3 (step-level credit). Check co-occurrence with pending placement goal.Action: close fridge 1. Environment: “You close the fridge 1.” Transition: Cooling subgoal complete; return to placement goal.
      t7t_{7}Observation: “You are carrying cooled lettuce 2.” Navigate to target location. Cooling precondition now satisfied: cooled​(lettuce 2)=True\texttt{cooled}(\text{lettuce\penalty 10000\ 2}){=}\text{True}. Activate placement subgoal.Retrieve object_placement procedure. Recompute EU with updated context: relevance=0.94{=}0.94 (high similarity to placement scenarios), ρ^=0.80\hat{\rho}{=}0.80 (from Beta(8,2)), risk=0.15{=}0.15, info-gain=0.21{=}0.21 nats. Total EU=0.94×0.80×1.0−0.15×0.20×0.5+0.1×0.21=0.752+0.021=0.773{=}0.94\times 0.80\times 1.0-0.15\times 0.20\times 0.5+0.1\times 0.21{=}0.752+0.021{=}0.773.Instantiate placement template with navigation: go to →\rightarrow go to countertop 2. Precondition check passes: cooled​(lettuce 2)∧holding​(lettuce 2)∧exists​(countertop 2)\texttt{cooled}(\text{lettuce\penalty 10000\ 2})\land\texttt{holding}(\text{lettuce\penalty 10000\ 2})\land\texttt{exists}(\text{countertop\penalty 10000\ 2}).Action: go to countertop 2. Environment: “You arrive at loc 5. On countertop 2, you see nothing.” (lettuce is held, not on counter).
      t8t_{8}Observation: “You are at countertop 2, holding cooled lettuce 2.” Execute terminal placement action. All preconditions verified: cooled∧holding∧at_target\texttt{cooled}\land\texttt{holding}\land\texttt{at_target}.No additional ranking needed; final action of placement procedure. Predicted outcome: task success with p=0.80p{=}0.80 (placement posterior).Instantiate: put on →\rightarrow put lettuce 2 on countertop 2. Prepare for end-of-episode state capture.Action: put lettuce 2 on countertop 2. Environment: “You put lettuce 2 on countertop 2.” Result: Task SUCCESS. rtotal=1.0r_{\text{total}}{=}1.0.
      t9t_{9} ★Generate symbolic summary: “Completed two-stage compound task: cooling-then-placement via fridge 1 on lettuce 2.” Segment trajectory into 2 procedures: τcool=[t3,t4,t5,t6]\tau_{\text{cool}}{=}[t_{3},t_{4},t_{5},t_{6}] (4 actions, success), τplace=[t7,t8]\tau_{\text{place}}{=}[t_{7},t_{8}] (2 actions, success). Extract precondition pattern: “chilled” ⇒\Rightarrow cooling required.Bayesian update (placement success): (α,β)←(8+1,2+0)=(9,2)(\alpha,\beta)\leftarrow(8{+}1,2{+}0){=}(9,2). New posterior: 𝔼​[ρ]=911≈0.818\mathbb{E}[\rho]{=}\frac{9}{11}{\approx}0.818 (++1.8% improvement). Export posteriors: cooling Beta(11,3), placement Beta(9,2). Calibration score: |𝔼​[ρ]−empirical|=0.02|\mathbb{E}[\rho]-\text{empirical}|{=}0.02 (well-calibrated). Total entropy reduction: Δ​Htotal=0.136+0.092=0.228\Delta H_{\text{total}}{=}0.136+0.092{=}0.228 nats.Meta-procedural learner analyzes co-occurrence patterns across last 15 episodes: cooling→\rightarrowplacement observed in 3 distinct configurations (potato-fridge-table, apple-fridge-shelf, lettuce-fridge-countertop). Pattern frequency: 3/15=20%3/15{=}20% exceeds threshold (θmeta=15%\theta_{\text{meta}}{=}15%). Create abstract meta-procedure: meta_cool_and_place_object with composition policy: if “chilled” ∈\in task_modifiers then cooling →\rightarrow placement else placement only. Store in ℳmeta\mathcal{M}_{\text{meta}} with initial success count=3{=}3.Contrastive analysis: Extract success features: {chilled, fridge, cooled, refrigerator_device}. Initialize success context for future contrastive refinement when failures accumulate (currently: |𝒮cooling|=11|\mathcal{S}{\text{cooling}}|{=}11, |ℱcooling|=3|\mathcal{F}{\text{cooling}}|{=}3; refinement threshold: min⁡(|𝒮|,|ℱ|)≥3\min(|\mathcal{S}|,|\mathcal{F}|){\geq}3 ✓; will trigger discriminative pattern extraction on next failure). Potential discriminators if future failures with “warm”, “oven”: refine precondition to “cooling ⇒\Rightarrow cold_appliance ∧¬\land\negheat_appliance”.Learning summary: (1) Bayesian priors updated for 2 procedures; (2) New meta-procedure stored; (3) Contrastive learning primed. LLM tokens: 412 prompt + 156 completion. Episode stats: 8 actions, 2 LLM calls, 18.3s wall-clock time.

      Table: A4.T8: Summary of threshold parameters and their justification status

      ParameterValueSelection MethodTheoretical Justification
      θdup\theta_{\text{dup}}0.85EmpiricalNone provided
      θconf\theta_{\text{conf}}0.4EmpiricalNone provided
      θmeta\theta_{\text{meta}}15%EmpiricalNone provided
      nmins,nminfn^{s}{\text{min}},n^{f}{\text{min}}3HeuristicMinimal statistical significance
      λr,λf,λt\lambda_{r},\lambda_{f},\lambda_{t}0.5, 0.3, 0.2Grid searchConstraint: ∑λi=1\sum\lambda_{i}=1
      λinfo\lambda_{\text{info}}0.1EmpiricalNone provided
      KfailK_{\text{fail}}15EmpiricalNone provided

      Refer to caption Comparison between existing LLM-based trajectory learning (top) and the proposed memory-augmented contrastive learning agent (MACLA, bottom). Existing methods train trajectories (T,A,O,R)(T,A,O,R) (Task, Action, Observation, Reward) into LLM parameters through post-training (finetuning and/or RLHF), whereas MACLA constructs procedural and meta-procedural memory externally through frozen LLM abstraction, segmentation, Bayesian selection, and contrastive refinement. Memories are learned during memory construction. Besides learning during memory construction, MACLA enables inference-time learning in which outputs are verified in the task environment, with feedback used for contrastive refinement on the retrieved memories. Meta-procedural learning enables the composition policy to be learned among procedures.

      Refer to caption Ablation study varying maximum procedural memory capacity. (a) Success rate on ALFWorld seen/unseen splits saturates beyond 150 procedures, with diminishing returns from 150→200 (+1.6% unseen) and slight decline at 300 (-0.2%). (b) Average Bayesian posterior αα+β\frac{\alpha}{\alpha+\beta} plateaus at 0.79, showing extra capacity adds redundancy rather than quality.

      Refer to caption Bayesian learning dynamics for top-5 procedures during 200 test episodes. (a) Cumulative success count α\alpha grows at different rates: Navigate (blue) reaches 150+ invocations, while task-specific procedures (Heat/Cool, green/red) accumulate evidence more slowly due to limited applicability. (b) Posterior success rates αα+β\frac{\alpha}{\alpha+\beta} converge above 0.75 within 50 episodes, with variance decreasing as O​(1/(α+β))O(1/(\alpha{+}\beta)).

      Refer to caption Analysis of 200+ pruned procedures during ALFWorld training. (a) Bimodal success rate distribution: pruned procedures (red, mean 0.42) separate cleanly from retained procedures (green, mean 0.79), validating utility-based retention. (b) Scatter plot shows pruned procedures cluster in bottom-left (young + rarely used), with no high-quality procedures (¿0.7 success, ¿10 uses) pruned.

      Refer to caption Cross-domain analysis. (a) Memory reuse: 51% (SQL) to 78% (ALFWorld). (b) Procedure reliability: 64% (SQL) to 81% (ALFWorld). (c) Meta-procedure usage: 18% (SQL) to 51% (TravelPlanner).

      Refer to caption Learning dynamics over 2,851 training trajectories on ALFWorld. (a) Success rate progression shows three distinct phases: exploration (trajectories 1–570), consolidation (571–1,425), and exploitation (1,426–2,851). (b) Memory growth demonstrates rapid procedure extraction during exploration, followed by meta-procedure formation during consolidation. The system extracts 187 unique procedures from 2,851 trajectories (15:1 compression), never exceeding the 200-capacity limit. (c) Average Bayesian posterior αα+β\frac{\alpha}{\alpha+\beta} converges from optimistic initialization (0.5) to empirical success rate (0.79), with shaded region showing ±1 standard deviation across procedures. (d) LLM fallback rate decreases from 100% (pure zero-shot) to ¡5% as procedural memory becomes comprehensive.

      $$ \mathcal{S}eg = \mathcal{L}{\theta}\big(\text{Prompt}{\text{segment}}(\tau)\big) = {(t_k^{\text{start}}, t_k^{\text{end}}, d_k)}_{k=1}^{K}, $$

      $$ p(\rho_i|\mathcal{D}_i) = \text{Beta}(\rho_i; \alpha_i,\beta_i) $$

      $$ \text{Proc}t^* = \arg\max{\text{Proc}_i \in \mathcal{C}_t} \mathrm{EU}(\text{Proc}_i|o_t) $$

      $$ \mathcal{C}w = {w';|;\text{sim}(\phi(w), \phi(w')) > \theta{sim}} $$

      $$ U(\text{Proc}_i) = \lambda_r \cdot \frac{\alpha_i}{\alpha_i+\beta_i}

      • \lambda_f \cdot \frac{n_i}{N_{\text{total}}}
      • \lambda_t \cdot e^{-(t_{\text{current}} - t_i^{\text{last}})/\tau} \label{eq:utility} $$ \tag{eq:utility}

      $$ \text{Var}[\rho] = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \xrightarrow{\alpha+\beta\to\infty} 0 $$

      $$ \text{IsDuplicate}(\text{Proc}_i, \text{Proc}_j) = \mathbb{1}[\text{sim}(\mathbf{e}_i, \mathbf{e}j) > \theta{\text{dup}}] $$

      $$ \theta^*{\text{dup}} = \argmin{\theta} \mathbb{E}\left[\text{DL}(\mathcal{M} | \theta)\right] = \argmin_{\theta} \left[N_p(\theta) \log |\mathcal{A}| + \sum_{i=1}^{N_p(\theta)} H[\text{Proc}_i]\right] $$

      $$ \theta^*{\text{conf}} = \mathbb{E}[\text{EU}{\text{LLM}}] = \rho_{\text{LLM}} \cdot R_{\max} - (1-\rho_{\text{LLM}}) \cdot C_{\text{fail}} - C_{\text{LLM}} $$

      $$ \theta^*_{\text{conf}} = 0.42 \cdot 1.0 - 0.58 \cdot 0.5 - 0.15 = 0.42 - 0.29 - 0.15 = -0.02 \approx 0 $$

      $$ V(\text{MP}j) = \underbrace{f_j \cdot \ell_j}{\text{usage benefit}} - \underbrace{c_{\text{store}}}_{\text{storage cost}} + \underbrace{\mathbb{E}[\Delta R | \text{MP}j]}{\text{composition gain}} $$

      $$ V(\text{MP}j) > \min{k \in \mathcal{M}_{\text{meta}}} V(\text{MP}_k) $$

      $$ \mathcal{L}(\boldsymbol{\lambda}) = -\frac{1}{T} \sum_{t=1}^T r_t(\boldsymbol{\lambda}) $$

      $$ \boldsymbol{\lambda}_{k+1} = \boldsymbol{\lambda}k + \eta \cdot \frac{1}{N\sigma} \sum{i=1}^N F(\boldsymbol{\lambda}_k + \sigma \boldsymbol{\epsilon}_i) \cdot \boldsymbol{\epsilon}_i $$

      $$ n^* = \left(\frac{z_{1-\alpha/2} + z_{1-\beta}}{\text{ES}}\right)^2 $$

      $$ n^* = \left(\frac{1.96 + 0.84}{0.5}\right)^2 = (5.6)^2 \approx 31.4 $$

      $$ \text{Refine if } \mathbb{P}(\rho_{\text{success}} > \rho_{\text{failure}} | D) > 0.95 $$

      $$ U(\text{Proc}i) = \mathbb{E}\left[\sum{t=1}^{\infty} \gamma^t \cdot \mathbb{1}[\text{Proc}_i \text{ used at } t] \cdot r_t\right] $$

      $$ \displaystyle\mathrm{EU}(\text{Proc}{i}\mid o{t}) $$

      $$ \displaystyle\quad+\underbrace{\lambda_{\mathrm{info}},H[\mathrm{Beta}(\alpha_{i},\beta_{i})]}_{\text{information gain}} $$

      $$ \displaystyle\mathbb{E}[\rho]=\frac{10}{13}=0.76923\approx 0.769;\checkmark $$

      $$ \displaystyle=-4.0604-20.0902-1.8439+24.7531=-1.2414\text{ nats} $$

      $$

      $$

      $$

      $$

      MethodWebShopInterCodeSQLTravelPlannerALFWorldALFWorldAvg.
      SeenUnseen
      Prompt-based Methods
      GPT-4 [1]63.238.571.942.938.150.9
      GPT-3.5-Turbo [10]62.437.8-7.910.529.7
      Llama-2-7B [18]17.94.0-0.00.05.5
      Outcome Refinement Methods
      Llama-2-7B + SFT [2]60.254.9-60.067.260.6
      Llama-2-7B + RFT-PPO [13]64.252.4-22.129.142.0
      Llama-2-7B + RFT-CR [30]63.656.3-62.966.462.3
      Llama-2-7B + ETO [17]67.457.2-68.672.466.4
      Process Refinement Methods
      Llama-2-7B + Step-PPO [22]64.060.2-65.769.464.8
      Llama-2-7B + IPR [22]71.361.3-70.374.769.4
      Claude-3.5-Sonnet † [4]--65.582.574.774.2
      Qwen2.5-72B † [4]--63.885.777.275.6
      Llama-2-7B + MACLA70.259.383.387.290.378.1
      Config.Bayes.Contr.MetaOntol.SeenUnseen
      Full MACLA87.190.3
      w/o Bayesian79.481.2
      w/o Contrast.83.685.7
      w/o Meta81.278.4
      w/o Ontology82.884.1
      MethodTraining (GPU-hrs)WebShopALFWorld Unseen
      IPR [22]44.871.374.7
      SFT [2]8.060.267.2
      ETO [17]20.067.472.4
      MACLA0.01670.290.3
      Speedup vs IPR2,800×-+15.6 pts
      ConfigurationSeenUnseenProc. CountMeta CountReuse RateLLM Calls
      Full MACLA87.290.31874378%6.2
      w/o Bayesian Selection79.481.21894162%8.4
      w/o Contrastive83.685.72013971%6.8
      w/o Meta-Procedures81.278.4193065%9.1
      w/o Ontology82.884.11854274%6.5
      Max Capacity (Proc/Meta)Actual Proc.Meta Proc.SeenUnseenAvg 𝛼 𝛼 + 𝛽
      25 / 525568.364.10.61
      50 / 10501076.574.20.68
      100 / 20981883.185.60.74
      150 / 351433186.488.70.77
      200 / 50 (Default)1874387.290.30.79
      300 / 752034787.190.10.79
      BenchmarkPerf.Proc. UsedReuse RateAvg 𝛼 𝛼 + 𝛽Meta HitProc. Len
      ALFWorld-Seen87.234±878%0.8142%4.2
      ALFWorld-Unseen90.328±676%0.7938%4.1
      TravelPlanner83.341±1272%0.7551%6.3
      WebShop70.238±969%0.7235%5.1
      InterCodeSQL59.352±1851%0.6418%2.8
      TimeLLMBayesian SelectorMemory SystemI/O Summary
      𝑡 0 ★Parse task 'Put chilled lettuce on counter'; extract verb=put , modifier=chilled , ob- ject=lettuce ; infer compound goal requir- ing object_cooling → object_placement composition. Recognize 'chilled' as precondition trigger.Retrieve top-5 goal- relevant procedures via FAISS (2.1ms). Compute EU for can- didates: EU cooling = 0 . 78, EU placement = 0 . 82. De- tect dependency: placement requires cooled ( 𝑜𝑏𝑗𝑒𝑐𝑡 ) pre- condition. Order: cooling before place-Query procedural mem- ory with embedding 𝜙 ( 'chilled lettuce' ) . Re- trieve object_cooling (10 matches from training: 7 potato, 2 apple, 1 tomato) and object_placement (8 matches). Load precondi- tion dependency graph: place ⇒ cooled .Input: Task description. Out- put: Hierarchical plan: [navigate → take → cool → place]. LLM tokens: 347 prompt + 89 completion.
      𝑡 1Observation: 'You are in the middle of a room. Looking quickly around you, you see [...] a countertop 2, where lettuce 2 is located.' Parse spatial context; identify lettuce location=countertop 2. Extract entities: {let-Select navigate_to_object procedure (template- based, no Bayesian ranking needed). Relevance score: sim ( 𝜙 ( 𝑜 1 ) , 𝑒 nav ) = 0 . 91. Confidence 𝜃 conf = 0 . 7 > 0 . 4 threshold; proceedRetrieve atomic nav- igation pattern from procedure library: go to . Instantiate with =countertop 2 using current observa- tion entities. No LLM call required (rule-based substitution).Action: go to countertop 2 . Environment: 'You arrive at loc 5. On the countertop 2, you see lettuce 2.'
      𝑡 2Observation: 'On the countertop 2, you see a lettuce 2'; con- firm object visibility. Retrieve procedure: object_retrieval (pick-up skill). Ver- ify preconditions: at ( countertop 2 ) ∧ visible ( lettuce 2 ) ∧ ¬ holding (·) all satis-No ranking needed; deterministic action from procedure tem- plate. Update context: holding ( 𝑜𝑏𝑗𝑒𝑐𝑡 ) = False → True (predicted state change for next step).Instantiate: take from → take lettuce 2 from countertop 2 . Record intermediate state for trajectory segmentation.Action: take lettuce 2 from countertop 2 . Environ- ment: 'You pick up lettuce 2 from coun- tertop 2.' State: holding ( lettuce 2 )
      𝑡 3Observation: 'You are carrying lettuce 2.' Cooling subgoal active. Verify precondition: ¬ cooled ( lettuce 2 ) . Identify required appli- ance type: refrigeration device.Compute EU for appliance choices: EU fridge = 0 . 83 (from 𝛼 = 10 , 𝛽 = 3, ˆ 𝜌 = 0 . 769, risk = 0 . 19), EU freezer = 0 . 58 (from 𝛼 = 4 , 𝛽 = 6, ˆ 𝜌 = 0 . 40, risk = 0 . 37). Select fridge_cooling vari- ant (higher posterior,Retrieve object_cooling proce- dure; Bayesian posterior: 𝑃 ( 𝜌D) = Beta ( 10 , 3 ) , E [ 𝜌 ] = 10 13 = 0 . 769, Var [ 𝜌 ] = 30 2366 ≈ 0 . 0127. Load 5-step template sequence.Subgoal: Ex- ecute cooling branch. Action: go to fridge 1 . Environ- ment: 'You arrive at loc 12. The fridge 1 is closed.'
      TimeLLMBayesian SelectorMemory SystemContrastive RefinerI/O Summary
      𝑡 4Observation: 'The fridge 1 is closed.' Rec- ognize container access precondition. Retrieve container_opening subprocedure (em- bedded in cooling template).Validate cooling proce- dure: posterior passes confidence thresh- old ( E [ 𝜌 ] = 0 . 77 > 𝜃 conf = 0 . 4). Re- compute EU: 0 . 91 × 0 . 769 × 1 . 0 - 0 . 19 × 0 . 231 × 0 . 5 = 0 . 678 > 0 . 4;Instantiate opening ac- tion: open → open fridge 1 . Track appliance state: open ( fridge 1 ) = False → True.-Action: open fridge 1 . Envi- ronment: 'You open the fridge 1. The fridge 1 is open. In it, you see nothing.'
      𝑡 5Observation: 'The fridge 1 is open.' Ex- ecute core cooling action. Verify all preconditions met: holding ( lettuce 2 ) ∧ open ( fridge 1 ) ∧ at ( fridge 1 ) .No re-ranking; execut- ing within selected pro- cedure template. Mon- itor for failure signals (none detected).Instantiate cooling ac- tion: cool with → cool lettuce 2 with fridge 1 . This is the critical learning step for Bayesian update.-Action: cool lettuce 2 with fridge 1 . Environment: 'You cool the let- tuce 2 using the fridge 1.' State: cooled ( lettuce 2 ) = True.
      𝑡 6Observation: 'You have cooled lettuce 2.' Update world state: cooled ( lettuce 2 ) = True, holding ( lettuce 2 ) = True. Container cleanup: execute closing subpro- cedure.Bayesian update (cooling suc- cess): ( 𝛼, 𝛽 ) ← ( 10 + 1 , 3 + 0 ) = ( 11 , 3 ) . New posterior: E [ 𝜌 ] = 11 14 ≈ 0 . 786 ( + 1.7% improvement). Com- pute information gain: Δ 𝐻 = 𝐻 [ Beta ( 10 , 3 )] - 𝐻 [ Beta ( 11 , 3 )] = 0 . 136Mark cooling procedure success; store context tuple ( 𝑜 init ,𝜋 exec ,𝑜 term ) in success set S 𝑖 . In- termediate reward signal: 𝑟 cool = + 0 . 3 (step-level credit). Check co-occurrence with pending placement goal.-Action: close fridge 1 . Envi- ronment: 'You close the fridge 1. ' Transition: Cooling subgoal complete; return to placement goal.
      𝑡 7Observation: 'You are carrying cooled lettuce 2.' Navigate to target location. Cooling precon- dition now satisfied: cooled ( lettuce 2 ) = True. Activate placement subgoal.Retrieve object_placement procedure. Recompute EU with updated con- text: relevance = 0 . 94 (high similarity to placement scenar- ios), ˆ 𝜌 = 0 . 80 (from Beta(8,2)), risk = 0 . 15, info-gain = 0 . 21 nats. Total EU = 0 . 94 × 0 . 80 × 1 . 0 - 0 . 15 × 0 . 20 × 0 . 5 + 0 . 1 × 0 . 21 = 0 . 752 +Instantiate placement template with navigation: go to → go to countertop 2 . Pre- condition check passes: cooled ( lettuce 2 ) ∧ holding ( lettuce 2 ) ∧ exists ( countertop 2 ) .-Action: go to countertop 2 . Environment: 'You arrive at loc 5. On coun- tertop 2, you see nothing.' (lettuce is held, not on counter).
      𝑡 8Observation: 'You are at countertop 2, hold- ing cooled lettuce 2.' Execute terminal placement action. All preconditions verified: cooled ∧ holding ∧ at_target .0 . 021 = 0 . 773. No additional ranking needed; final action of placement procedure. Predicted outcome: task success with 𝑝 = 0 . 80 (placement posterior).Instantiate: put on → put lettuce 2 on countertop 2 . Prepare for end-of-episode state capture.-Action: put lettuce 2 on countertop 2 . Environment: 'You put lettuce 2 on counter- top 2.' Result: Task SUCCESS .
      TimeLLMBayesian SelectorMemory SystemContrastive RefinerI/O Summary
      𝑡 9 ★Generate symbolic summary: 'Completed two-stage compound task: cooling-then- placement via fridge 1 on lettuce 2.' Seg- ment trajectory into 2 procedures: 𝜏 cool = [ 𝑡 3 , 𝑡 4 , 𝑡 5 , 𝑡 6 ] (4 actions, success), 𝜏 place = [ 𝑡 7 , 𝑡 8 ] (2 ac- tions, success). Extract precondition pattern: 'chilled' ⇒ cooling required.Bayesian update (placement suc- cess): ( 𝛼, 𝛽 ) ← ( 8 + 1 , 2 + 0 ) = ( 9 , 2 ) . New posterior: E [ 𝜌 ] = 9 11 ≈ 0 . 818 ( + 1.8% improvement). Ex- port posteriors: cooling Beta(11,3), placement Beta(9,2). Calibration score:E [ 𝜌 ]- empirical= 0 . 02 (well-calibrated). Total entropy reduc- tion: Δ 𝐻 total = 0 . 136 + 0 . 092 = 0 . 228 nats.Meta-procedural learner analyzes co- occurrence patterns across last 15 episodes: cooling → placement observed in 3 distinct configurations (potato- fridge-table, apple-fridge- shelf, lettuce-fridge- countertop). Pattern frequency: 3 / 15 = 20% exceeds threshold ( 𝜃 meta = 15%). Create abstract meta-procedure: meta_cool_and_place_object with composition pol- icy: if 'chilled' ∈ task_modifiers then cooling → placement else placement only. Store in M meta with initial success count = 3.Contrastive analy- sis: Extract success features: { chilled , fridge , cooled , refrigerator_device }. Initialize success context for future contrastive refinement when failures accumulate (currently:S cooling= 11,F cooling= 3; refinement threshold: min (S,F)≥ 3 ✓ ; will trigger discriminative pattern extraction on next failure). Poten- tial discriminators if future failures with 'warm', 'oven': refine precondition to 'cool- ing ⇒ cold_appliance ∧¬ heat_appliance'.Learning summary: (1) Bayesian priors updated for 2 procedures; (2) New meta- procedure stored; (3) Contrastive learning primed. LLM tokens: 412 prompt + 156 completion. Episode stats: 8 actions, 2 LLM calls, 18.3s wall-clock time.
      ParameterValueSelection MethodTheoretical Justification
      𝜃 dup 𝜃 conf 𝜃 meta 𝑛 𝑠 min ,𝑛 𝑓 min 𝜆 𝑟 ,𝜆 𝑓 ,𝜆 𝑡 𝜆 info 𝐾 fail0.85 0.4 15% 3 0.5, 0.3, 0.2 0.1 15Empirical Empirical Empirical Heuristic Grid search Empirical EmpiricalNone provided None provided None provided Minimal statistical significance Constraint: ˝ 𝜆 𝑖 = 1 None provided None provided

      $$ U(p) = 0.5 \cdot \frac{\alpha}{\alpha + \beta} + 0.3 \cdot \min\left(1, \frac{\text{count}}{10}\right) + 0.2 \cdot \left(1 - \frac{\text{age}}{\text{max_age}}\right) $$

      $$ \mathrm{EU}(\text{Proc}i \mid o_t) &= \underbrace{\mathrm{Rel}i(o_t),\frac{\alpha_i}{\alpha_i+\beta_i},R{\max}}{\text{expected reward}}

      • \underbrace{\mathrm{Risk}i(o_t),\frac{\beta_i}{\alpha_i+\beta_i},C{\mathrm{fail}}}{\text{failure cost}} \notag\ &\quad + \underbrace{\lambda{\mathrm{info}},H[\mathrm{Beta}(\alpha_i,\beta_i)]}_{\text{information gain}} \label{eq:expected_utility} $$ \tag{eq:expected_utility}

      $$ \Psi_i &\leftarrow \Psi_i \cup \Delta\Psi_i^{+} \cup {\neg \psi ;|; \psi \in \Delta\Psi_i^{-}}, \nonumber\ \pi_i &\leftarrow \text{Merge}(\pi_i, \Delta\pi_i), \nonumber\ \Phi_i &\leftarrow \Phi_i \cup \Delta\Phi_i . $$

      $$ \text{Cooling posterior:} \quad & \mathbb{E}[\rho] = \frac{10}{13} = 0.76923 \approx 0.769 ;\checkmark \ & \text{Var}[\rho] = \frac{10 \cdot 3}{13^2 \cdot 14} = \frac{30}{2366} = 0.01268 \approx 0.0127 ;\checkmark \ \text{After update:} \quad & \mathbb{E}[\rho] = \frac{11}{14} = 0.78571 \approx 0.786 ;\checkmark $$

      $$ I(\rho; \mathcal{D}_i) &= \ln B(\alpha, \beta) - (\alpha{-}1)\psi(\alpha) - (\beta{-}1)\psi(\beta) + (\alpha{+}\beta{-}2)\psi(\alpha{+}\beta) \ H[\text{Beta}(10,3)] &= \ln B(10,3) - 9\psi(10) - 2\psi(3) + 11\psi(13) \ &= -4.0604 - 20.0902 - 1.8439 + 24.7531 = -1.2414 \text{ nats} \ H[\text{Beta}(11,3)] &= \ln B(11,3) - 10\psi(11) - 2\psi(3) + 12\psi(14) \ &= -4.3307 - 22.3316 - 1.8439 + 27.4003 = -1.1059 \text{ nats} \ \Delta H &= -1.2414 - (-1.1059) = -0.1355 \approx -0.136 \text{ nats} ;\checkmark $$

      $$ \text{EU}(\text{Proc}_{\text{fridge}}|o_t) &= 0.91 \times 0.769 \times 1.0 - 0.19 \times (1{-}0.769) \times 0.5 + 0.1 \times 1.24 \ &= 0.700 - 0.022 + 0.124 = 0.802 \approx 0.83 ;\checkmark $$

      Algorithm: algorithm
      [ht!]
      \small
      \caption{MACLA Runtime Procedure with Function Descriptions}
      \label{alg:macla-runtime}
      \begin{algorithmic}[1]
      \Require observation $o_0$, memory $\mathbb{M}$ (procedures, meta-procedures, indices), horizon $H$
      \State $\mathbf{h} \gets \phi(o_0)$ \Comment{Embed observation}
      \State $\mathcal{C} \gets \textsc{RetrieveCandidates}(\mathbf{h}, \mathbb{M})$ \Comment{Top-$k$ ANN search}
      \While{not \textsc{Terminal} and $t < H$}
      \ForAll{$c \in \mathcal{C}$}
      \State $\mathrm{EU}[c] \gets \textsc{ExpectedUtility}(c, o_t, \mathbb{M})$ \Comment{Compute Eq.~\ref{eq:expected_utility}}
      \EndFor
      \State $c^\star \gets \arg\max_{c \in \mathcal{C}} \mathrm{EU}[c]$
      \If{$\mathrm{EU}[c^\star] < \theta_{\mathrm{conf}}$}
      \State $(o_{t+1},y) \gets \textsc{ZeroShotStep}(o_t)$ \Comment{LLM generates action directly}
      \ElsIf{$c^\star$ is $\text{MP}_j$}
      \State $(o_{t+1},y) \gets \textsc{ExecuteMeta}(\text{MP}_j,\Theta_j,o_t)$ \Comment{Run with control policy}
      \State $(\alpha_j,\beta_j) \gets \textsc{UpdateBeta}((\alpha_j,\beta_j), y)$ \Comment{$\alpha{\gets}\alpha{+}y$, $\beta{\gets}\beta{+}(1{-}y)$}
      \Else \Comment{$c^\star$ is atomic $\text{Proc}_i$}
      \If{$\textsc{CheckPre}(\Psi_i,o_t)$} \Comment{Verify preconditions match $o_t$}
      \State $(o_{t+1},y) \gets \textsc{ExecuteProc}(\pi_i,o_t)$ \Comment{Instantiate \& execute $\pi_i$}
      \State $y \gets y \land \textsc{CheckPost}(\Phi_i,o_{t+1})$ \Comment{Verify postconditions in $o_{t+1}$}
      \Else
      \State $(o_{t+1},y) \gets \textsc{ZeroShotStep}(o_t)$ \Comment{Preconditions failed, fallback}
      \EndIf
      \State $(\alpha_i,\beta_i) \gets \textsc{UpdateBeta}((\alpha_i,\beta_i), y)$
      \State $\textsc{RecordContext}(\mathcal{S}_i,\mathcal{F}_i, o_t, y)$ \Comment{Add to success/fail sets}
      \EndIf
      \If{$\textsc{RefineTrigger}(c^\star)$} \Comment{If $|\mathcal{S}|,|\mathcal{F}| \geq 3$}
      \State $\textsc{ContrastiveRefine}(c^\star)$ \Comment{LLM compares $\mathcal{S}$ vs. $\mathcal{F}$ (§\ref{sec:contrastive})}
      \EndIf
      \State $\mathbf{h} \gets \phi(o_{t+1})$;\; $\mathcal{C} \gets \textsc{RetrieveCandidates}(\mathbf{h}, \mathbb{M})$;\; $t \gets t+1$
      \EndWhile
      \If{$\textsc{EligibleForMeta}(\text{trace})$} \Comment{If $\geq$3 procs in stable order}
      \State $\textsc{ExtractOrRefineMeta}(\text{trace}, \mathbb{M})$ \Comment{Create/update meta-proc}
      \EndIf
      \State $\textsc{PruneAndMaintain}(\mathbb{M})$ \Comment{Remove low-utility via Eq.~\ref{eq:utility}}
      \end{algorithmic}
      MethodWebShopInterCodeSQLTravelPlannerALFWorldALFWorldAvg.
      SeenUnseen
      Prompt-based Methods
      GPT-4 [1]63.238.571.942.938.150.9
      GPT-3.5-Turbo [10]62.437.8-7.910.529.7
      Llama-2-7B [18]17.94.0-0.00.05.5
      Outcome Refinement Methods
      Llama-2-7B + SFT [2]60.254.9-60.067.260.6
      Llama-2-7B + RFT-PPO [13]64.252.4-22.129.142.0
      Llama-2-7B + RFT-CR [30]63.656.3-62.966.462.3
      Llama-2-7B + ETO [17]67.457.2-68.672.466.4
      Process Refinement Methods
      Llama-2-7B + Step-PPO [22]64.060.2-65.769.464.8
      Llama-2-7B + IPR [22]71.361.3-70.374.769.4
      Claude-3.5-Sonnet † [4]--65.582.574.774.2
      Qwen2.5-72B † [4]--63.885.777.275.6
      Llama-2-7B + MACLA70.259.383.387.290.378.1
      Config.Bayes.Contr.MetaOntol.SeenUnseen
      Full MACLA87.190.3
      w/o Bayesian79.481.2
      w/o Contrast.83.685.7
      w/o Meta81.278.4
      w/o Ontology82.884.1
      MethodTraining (GPU-hrs)WebShopALFWorld Unseen
      IPR [22]44.871.374.7
      SFT [2]8.060.267.2
      ETO [17]20.067.472.4
      MACLA0.01670.290.3
      Speedup vs IPR2,800×-+15.6 pts
      ConfigurationSeenUnseenProc. CountMeta CountReuse RateLLM Calls
      Full MACLA87.290.31874378%6.2
      w/o Bayesian Selection79.481.21894162%8.4
      w/o Contrastive83.685.72013971%6.8
      w/o Meta-Procedures81.278.4193065%9.1
      w/o Ontology82.884.11854274%6.5
      Max Capacity (Proc/Meta)Actual Proc.Meta Proc.SeenUnseenAvg 𝛼 𝛼 + 𝛽
      25 / 525568.364.10.61
      50 / 10501076.574.20.68
      100 / 20981883.185.60.74
      150 / 351433186.488.70.77
      200 / 50 (Default)1874387.290.30.79
      300 / 752034787.190.10.79
      BenchmarkPerf.Proc. UsedReuse RateAvg 𝛼 𝛼 + 𝛽Meta HitProc. Len
      ALFWorld-Seen87.234±878%0.8142%4.2
      ALFWorld-Unseen90.328±676%0.7938%4.1
      TravelPlanner83.341±1272%0.7551%6.3
      WebShop70.238±969%0.7235%5.1
      InterCodeSQL59.352±1851%0.6418%2.8
      TimeLLMBayesian SelectorMemory SystemI/O Summary
      𝑡 0 ★Parse task 'Put chilled lettuce on counter'; extract verb=put , modifier=chilled , ob- ject=lettuce ; infer compound goal requir- ing object_cooling → object_placement composition. Recognize 'chilled' as precondition trigger.Retrieve top-5 goal- relevant procedures via FAISS (2.1ms). Compute EU for can- didates: EU cooling = 0 . 78, EU placement = 0 . 82. De- tect dependency: placement requires cooled ( 𝑜𝑏𝑗𝑒𝑐𝑡 ) pre- condition. Order: cooling before place-Query procedural mem- ory with embedding 𝜙 ( 'chilled lettuce' ) . Re- trieve object_cooling (10 matches from training: 7 potato, 2 apple, 1 tomato) and object_placement (8 matches). Load precondi- tion dependency graph: place ⇒ cooled .Input: Task description. Out- put: Hierarchical plan: [navigate → take → cool → place]. LLM tokens: 347 prompt + 89 completion.
      𝑡 1Observation: 'You are in the middle of a room. Looking quickly around you, you see [...] a countertop 2, where lettuce 2 is located.' Parse spatial context; identify lettuce location=countertop 2. Extract entities: {let-Select navigate_to_object procedure (template- based, no Bayesian ranking needed). Relevance score: sim ( 𝜙 ( 𝑜 1 ) , 𝑒 nav ) = 0 . 91. Confidence 𝜃 conf = 0 . 7 > 0 . 4 threshold; proceedRetrieve atomic nav- igation pattern from procedure library: go to . Instantiate with =countertop 2 using current observa- tion entities. No LLM call required (rule-based substitution).Action: go to countertop 2 . Environment: 'You arrive at loc 5. On the countertop 2, you see lettuce 2.'
      𝑡 2Observation: 'On the countertop 2, you see a lettuce 2'; con- firm object visibility. Retrieve procedure: object_retrieval (pick-up skill). Ver- ify preconditions: at ( countertop 2 ) ∧ visible ( lettuce 2 ) ∧ ¬ holding (·) all satis-No ranking needed; deterministic action from procedure tem- plate. Update context: holding ( 𝑜𝑏𝑗𝑒𝑐𝑡 ) = False → True (predicted state change for next step).Instantiate: take from → take lettuce 2 from countertop 2 . Record intermediate state for trajectory segmentation.Action: take lettuce 2 from countertop 2 . Environ- ment: 'You pick up lettuce 2 from coun- tertop 2.' State: holding ( lettuce 2 )
      𝑡 3Observation: 'You are carrying lettuce 2.' Cooling subgoal active. Verify precondition: ¬ cooled ( lettuce 2 ) . Identify required appli- ance type: refrigeration device.Compute EU for appliance choices: EU fridge = 0 . 83 (from 𝛼 = 10 , 𝛽 = 3, ˆ 𝜌 = 0 . 769, risk = 0 . 19), EU freezer = 0 . 58 (from 𝛼 = 4 , 𝛽 = 6, ˆ 𝜌 = 0 . 40, risk = 0 . 37). Select fridge_cooling vari- ant (higher posterior,Retrieve object_cooling proce- dure; Bayesian posterior: 𝑃 ( 𝜌D) = Beta ( 10 , 3 ) , E [ 𝜌 ] = 10 13 = 0 . 769, Var [ 𝜌 ] = 30 2366 ≈ 0 . 0127. Load 5-step template sequence.Subgoal: Ex- ecute cooling branch. Action: go to fridge 1 . Environ- ment: 'You arrive at loc 12. The fridge 1 is closed.'
      TimeLLMBayesian SelectorMemory SystemContrastive RefinerI/O Summary
      𝑡 4Observation: 'The fridge 1 is closed.' Rec- ognize container access precondition. Retrieve container_opening subprocedure (em- bedded in cooling template).Validate cooling proce- dure: posterior passes confidence thresh- old ( E [ 𝜌 ] = 0 . 77 > 𝜃 conf = 0 . 4). Re- compute EU: 0 . 91 × 0 . 769 × 1 . 0 - 0 . 19 × 0 . 231 × 0 . 5 = 0 . 678 > 0 . 4;Instantiate opening ac- tion: open → open fridge 1 . Track appliance state: open ( fridge 1 ) = False → True.-Action: open fridge 1 . Envi- ronment: 'You open the fridge 1. The fridge 1 is open. In it, you see nothing.'
      𝑡 5Observation: 'The fridge 1 is open.' Ex- ecute core cooling action. Verify all preconditions met: holding ( lettuce 2 ) ∧ open ( fridge 1 ) ∧ at ( fridge 1 ) .No re-ranking; execut- ing within selected pro- cedure template. Mon- itor for failure signals (none detected).Instantiate cooling ac- tion: cool with → cool lettuce 2 with fridge 1 . This is the critical learning step for Bayesian update.-Action: cool lettuce 2 with fridge 1 . Environment: 'You cool the let- tuce 2 using the fridge 1.' State: cooled ( lettuce 2 ) = True.
      𝑡 6Observation: 'You have cooled lettuce 2.' Update world state: cooled ( lettuce 2 ) = True, holding ( lettuce 2 ) = True. Container cleanup: execute closing subpro- cedure.Bayesian update (cooling suc- cess): ( 𝛼, 𝛽 ) ← ( 10 + 1 , 3 + 0 ) = ( 11 , 3 ) . New posterior: E [ 𝜌 ] = 11 14 ≈ 0 . 786 ( + 1.7% improvement). Com- pute information gain: Δ 𝐻 = 𝐻 [ Beta ( 10 , 3 )] - 𝐻 [ Beta ( 11 , 3 )] = 0 . 136Mark cooling procedure success; store context tuple ( 𝑜 init ,𝜋 exec ,𝑜 term ) in success set S 𝑖 . In- termediate reward signal: 𝑟 cool = + 0 . 3 (step-level credit). Check co-occurrence with pending placement goal.-Action: close fridge 1 . Envi- ronment: 'You close the fridge 1. ' Transition: Cooling subgoal complete; return to placement goal.
      𝑡 7Observation: 'You are carrying cooled lettuce 2.' Navigate to target location. Cooling precon- dition now satisfied: cooled ( lettuce 2 ) = True. Activate placement subgoal.Retrieve object_placement procedure. Recompute EU with updated con- text: relevance = 0 . 94 (high similarity to placement scenar- ios), ˆ 𝜌 = 0 . 80 (from Beta(8,2)), risk = 0 . 15, info-gain = 0 . 21 nats. Total EU = 0 . 94 × 0 . 80 × 1 . 0 - 0 . 15 × 0 . 20 × 0 . 5 + 0 . 1 × 0 . 21 = 0 . 752 +Instantiate placement template with navigation: go to → go to countertop 2 . Pre- condition check passes: cooled ( lettuce 2 ) ∧ holding ( lettuce 2 ) ∧ exists ( countertop 2 ) .-Action: go to countertop 2 . Environment: 'You arrive at loc 5. On coun- tertop 2, you see nothing.' (lettuce is held, not on counter).
      𝑡 8Observation: 'You are at countertop 2, hold- ing cooled lettuce 2.' Execute terminal placement action. All preconditions verified: cooled ∧ holding ∧ at_target .0 . 021 = 0 . 773. No additional ranking needed; final action of placement procedure. Predicted outcome: task success with 𝑝 = 0 . 80 (placement posterior).Instantiate: put on → put lettuce 2 on countertop 2 . Prepare for end-of-episode state capture.-Action: put lettuce 2 on countertop 2 . Environment: 'You put lettuce 2 on counter- top 2.' Result: Task SUCCESS .
      TimeLLMBayesian SelectorMemory SystemContrastive RefinerI/O Summary
      𝑡 9 ★Generate symbolic summary: 'Completed two-stage compound task: cooling-then- placement via fridge 1 on lettuce 2.' Seg- ment trajectory into 2 procedures: 𝜏 cool = [ 𝑡 3 , 𝑡 4 , 𝑡 5 , 𝑡 6 ] (4 actions, success), 𝜏 place = [ 𝑡 7 , 𝑡 8 ] (2 ac- tions, success). Extract precondition pattern: 'chilled' ⇒ cooling required.Bayesian update (placement suc- cess): ( 𝛼, 𝛽 ) ← ( 8 + 1 , 2 + 0 ) = ( 9 , 2 ) . New posterior: E [ 𝜌 ] = 9 11 ≈ 0 . 818 ( + 1.8% improvement). Ex- port posteriors: cooling Beta(11,3), placement Beta(9,2). Calibration score:E [ 𝜌 ]- empirical= 0 . 02 (well-calibrated). Total entropy reduc- tion: Δ 𝐻 total = 0 . 136 + 0 . 092 = 0 . 228 nats.Meta-procedural learner analyzes co- occurrence patterns across last 15 episodes: cooling → placement observed in 3 distinct configurations (potato- fridge-table, apple-fridge- shelf, lettuce-fridge- countertop). Pattern frequency: 3 / 15 = 20% exceeds threshold ( 𝜃 meta = 15%). Create abstract meta-procedure: meta_cool_and_place_object with composition pol- icy: if 'chilled' ∈ task_modifiers then cooling → placement else placement only. Store in M meta with initial success count = 3.Contrastive analy- sis: Extract success features: { chilled , fridge , cooled , refrigerator_device }. Initialize success context for future contrastive refinement when failures accumulate (currently:S cooling= 11,F cooling= 3; refinement threshold: min (S,F)≥ 3 ✓ ; will trigger discriminative pattern extraction on next failure). Poten- tial discriminators if future failures with 'warm', 'oven': refine precondition to 'cool- ing ⇒ cold_appliance ∧¬ heat_appliance'.Learning summary: (1) Bayesian priors updated for 2 procedures; (2) New meta- procedure stored; (3) Contrastive learning primed. LLM tokens: 412 prompt + 156 completion. Episode stats: 8 actions, 2 LLM calls, 18.3s wall-clock time.
      ParameterValueSelection MethodTheoretical Justification
      𝜃 dup 𝜃 conf 𝜃 meta 𝑛 𝑠 min ,𝑛 𝑓 min 𝜆 𝑟 ,𝜆 𝑓 ,𝜆 𝑡 𝜆 info 𝐾 fail0.85 0.4 15% 3 0.5, 0.3, 0.2 0.1 15Empirical Empirical Empirical Heuristic Grid search Empirical EmpiricalNone provided None provided None provided Minimal statistical significance Constraint: ˝ 𝜆 𝑖 = 1 None provided None provided

      Figure

      Figure

      Figure

      References

      [Lam94] Leslie Lamport. (1994). {\LaTeX.

      [WoJe95] Wooldridge, Michael J., Jennings, Nicholas R.. (1995). Intelligent Agents: Theory and Practice. The Knowledge Engineering Review.

      [GrKr96] Grosz, Barbara J., Kraus, Sarit. (1996). Collaborative Plans for Complex Group Action. Artificial Intelligence.

      [Har78] David Harel. (1978). Logics of programs: axiomatics and descriptive power.

      [Cla85] Kenneth L. Clarkson. (1985). Algorithms for Closest-Point Problems (Computational Geometry).

      [Oba08] Barack Obama. (2008). A More Perfect Union.

      [Sci09] Joseph Scientist. (2009). The fountain of youth.

      [AnMC13] Sam Anzaroot, Andrew McCallum. {UMass.

      [Hag1993] Hagerup, Torben, Mehlhorn, Kurt, Munro, J. Ian. (1993). Maintaining Discrete Probability Distributions Optimally. Proceedings of the 20th International Colloquium on Automata, Languages and Programming.

      [Knu97] Donald E. Knuth. (1997). The Art of Computer Programming, Vol. 1: Fundamental Algorithms.

      [Ani03] David A. Anisi. (2003). Optimal Motion Control of a Ground Vehicle.

      [agentboard] Ma, Chang, Zhang, Junlei, Zhu, Zhihao, Yang, Cheng, Yang, Yujiu, Jin, Yaohui, Lan, Zhenzhong, Kong, Lingpeng, He, Junxian. (2025). AgentBoard: an analytical evaluation board of multi-turn LLM agents. Proceedings of the 38th International Conference on Neural Information Processing Systems.

      [xiong2024watch] Xiong, Weimin, Song, Yifan, Zhao, Xiutian, Wu, Wenhao, Wang, Xun, Wang, Ke, Li, Cheng, Peng, Wei, Li, Sujian. (2024). Watch Every Step! {LLM. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2024.emnlp-main.93.

      [yao2023react] Yao, Shunyu, Zhao, Jeffrey, Yu, Dian, Du, Nan, Shafran, Izhak, Narasimhan, Karthik, Cao, Yuan. (2023). React: Synergizing reasoning and acting in language models. International Conference on Learning Representations (ICLR).

      [chen2023] Chen, Baian, Shu, Chang, Shareghi, Ehsan, Collier, Nigel, Narasimhan, Karthik, Yao, Shunyu. (2023). FireAct: Toward Language Agent Fine-Tuning.

      [yin2023] Yin, Da, Brahman, Faeze, Ravichander, Abhilasha, Chandu, Khyathi, Chang, Kai-Wei, Choi, Yejin, Lin, Bill Yuchen. (2024). Agent Lumos: Unified and Modular Training for Open-Source Language Agents. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2024.acl-long.670.

      [zeng2023] Zeng, Aohan, Liu, Mingdao, Lu, Rui, Wang, Bowen, Liu, Xiao, Dong, Yuxiao, Tang, Jie. (2024). {A. Findings of the Association for Computational Linguistics: ACL 2024. doi:10.18653/v1/2024.findings-acl.181.

      [reflexion] Shinn, Noah, Cassano, Federico, Berman, Edward, Gopinath, Ashwin, Narasimhan, Karthik, Yao, Shunyu. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv preprint arXiv:2303.11366.

      [voyager] Wang, Guanzhi, Xie, Yuqi, Jiang, Yunfan, Mandlekar, Ajay, Xiao, Chaowei, Zhu, Yuke, Fan, Linxi, Anandkumar, Anima. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint arXiv:2305.16291.

      [memgpt] Packer, Charles, Fang, Vivian, Patil, Shishir_G, Lin, Kevin, Wooders, Sarah, Gonzalez, Joseph_E. (2023). MemGPT: Towards LLMs as Operating Systems..

      [packer2024memgpt] Packer, Charles, Wooders, Samuel, Lin, Kevin, others. (2024). MemGPT: Towards LLMs as Operating Systems. arXiv preprint arXiv:2310.08560.

      [zhong2024memorybank] Zhong, Wei, Guo, Liang, Gao, Qian, Ye, Haotian, Wang, Yuxin. (2024). MemoryBank: Enhancing Large Language Models with Long-term Memory. AAAI.

      [fang2025memp] Fang, Runnan, Liang, Yuan, Wang, Xiaobin, Wu, Jialong, Qiao, Shuofei, Xie, Pengjun, Huang, Fei, Chen, Huajun, Zhang, Ningyu. (2025). Memp: Exploring agent procedural memory. arXiv preprint arXiv:2508.06433.

      [hu2024hiagent] Hu, Mingyuan, Chen, Tianhong, Chen, Qian, others. (2024). {HiAgent. CVPR Workshops.

      [xu2025amem] Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, Yongfeng Zhang. (2025). A-MEM: Agentic Memory for LLM Agents.

      [yu2025memagent] Yu, Hang, Chen, Tianyu, Feng, Jiale, others. (2025). {MemAgent. arXiv preprint arXiv:2507.02259.

      [liang2025sage] Liang, Xuechen, Tao, Meiling, Xia, Yinghui, others. (2025). {SAGE. Neurocomputing.

      [wang2024awm] Wang, Zora Zhiruo, Mao, Jiayuan, Fried, Daniel, Neubig, Graham. (2024). Agent workflow memory. arXiv preprint arXiv:2409.07429.

      [chen2024automanual] Chen, Ming, Li, Yifei, Yang, Yao, others. (2024). AutoManual: Constructing Instruction Manuals by LLM Agents via Interactive Environmental Learning. arXiv preprint arXiv:2405.16247.

      [tot] Yao, Shunyu, Yu, Dian, Zhao, Jeffrey, Shafran, Izhak, Griffiths, Tom, Cao, Yuan, Narasimhan, Karthik. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601.

      [got] Besta, Maciej, Blach, Nico, others. (2023). Graph of Thoughts: Solving Elaborate Problems with Large Language Models. arXiv:2308.09687.

      [selfrefine] Madaan, Aman, Tandon, Niket, Yazdanbakhsh, Amir, Hase, Peter, others. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS.

      [memprompt] Xu, Peng, others. (2023). MemPrompt: Memory-assisted Prompt Editing to Reduce Hallucination. arXiv:2305.14739.

      [expel] Liu, Howard, others. (2023). ExpeL: LLM Agents with Experience Learning. arXiv:2308.10144.

      [webgpt] Nakano, Reiichiro, others. (2021). WebGPT: Browser-assisted Question Answering with Human Feedback. NeurIPS.

      [alfworld] Shridhar, Mohit, Yuan, Xingdi, C{^o. (2021). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. ICLR.

      [webshop] Yao, Shunyu, others. (2022). WebShop: Towards Scalable Real-World Web Interaction with LLM Agents. NeurIPS Datasets and Benchmarks.

      [saycan] Ahn, Michael, Brohan, Anthony, others. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. Conference on Robot Learning (CoRL).

      [achiam2023gpt] Achiam, Josh, Adler, Steven, Agarwal, Sandhini, Ahmad, Lama, Akkaya, Ilge, Aleman, Florencia Leoni, Almeida, Diogo, Altenschmidt, Janko, Altman, Sam, Anadkat, Shyamal, others. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

      [ouyang2022training] Ouyang, Long, Wu, Jeffrey, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll, Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex, others. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems.

      [touvron2023llama] Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, others. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

      [chen2023fireact] Chen, Baian, Shu, Chang, Shareghi, Ehsan, Collier, Nigel, Narasimhan, Karthik, Yao, Shunyu. (2023). Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915.

      [schulman2017proximal] Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, Klimov, Oleg. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

      [zhang2023cumulative] Zhang, Yifan, Yang, Jingqin, Yuan, Yang, Yao, Andrew Chi-Chih. (2023). Cumulative reasoning with large language models. arXiv preprint arXiv:2308.04371.

      [song2024trial] Song, Yifan, Yin, Da, Yue, Xiang, Huang, Jie, Li, Sujian, Lin, Bill Yuchen. (2024). Trial and Error: Exploration-Based Trajectory Optimization of {LLM. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2024.acl-long.409.

      [song2024agentbank] Song, Yifan, Xiong, Weimin, Zhao, Xiutian, Zhu, Dawei, Wu, Wenhao, Wang, Ke, Li, Cheng, Peng, Wei, Li, Sujian. (2024). {A. Findings of the Association for Computational Linguistics: EMNLP 2024. doi:10.18653/v1/2024.findings-emnlp.116.

      [xie2024travelplanner] Xie, Jian, Zhang, Kai, Chen, Jiangjie, Zhu, Tinghui, Lou, Renze, Tian, Yuandong, Xiao, Yanghua, Su, Yu. (2024). Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622.

      [yang2306intercode] Yang, John, Prabhakar, Akshara, Narasimhan, Karthik, Yao, Shunyu. (2023). InterCode: standardizing and benchmarking interactive coding with execution feedback. Proceedings of the 37th International Conference on Neural Information Processing Systems.

      [shinn2023reflexion] Shinn, Noah, Cassano, Federico, Gopinath, Ashwin, Narasimhan, Karthik, Yao, Shunyu. (2023). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems.

      [reimers2019sentence] Reimers, Nils, Gurevych, Iryna. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.

      [zhang2024memorysurvey] Zhang, Zeyu, Dai, Quanyu, Bo, Xiaohe, Ma, Chen, Li, Rui, Chen, Xu, Zhu, Jieming, Dong, Zhenhua, Wen, Ji-Rong. (2025). A Survey on the Memory Mechanism of Large Language Model-based Agents. ACM Trans. Inf. Syst.. doi:10.1145/3748302.

      [liu2025foundationagents] Liu, Bowen, Li, Xuan, Zhang, Jing, others. (2025). Advances and Challenges in Foundation Agents: From Brain-inspired Intelligence to Evolutionary, Collaborative, and Safe Systems.

      [li2025memos] Li, Zhen, Song, Shijie, Xi, Chen, others. (2025). {MEMOS. Proceedings of the Web Conference (Companion).