AAMAS-2025 Formatting Instructions]{Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement

Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, Mahdi Jalili

Abstract

We present MACLA, a framework that decouples reasoning from learning by maintaining a frozen large language model (LLM) while performing all adaptation in an external hierarchical procedural memory. MACLA extracts reusable procedures from trajectories, tracks reliability via Bayesian posteriors, selects actions through expected-utility scoring, and refines procedures by contrasting successes vs. failures. Across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), MACLA achieves 78.1% average performance, outperforming all baselines. On ALFWorld unseen tasks, MACLA reaches 90.3% with +3.1% positive generalization. The system constructs memory in 56 seconds (2,800× faster than the state-of-the-art LLM parameter-training baseline), compresses 2,851 trajectories into 187 procedures (15:1). Experimental results demonstrate that structured external memory with Bayesian selection and constrastive refinement enable sample-efficient, interpretable and continually improving agents without LLM parameter updates. Code is publicly available at https://github.com/S-Forouzandeh/MACLA-LLM-Agents-AAMAS-Conference{MACLA}.

Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement

Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, Mahdi Jalili

School of Engineering, Royal Melbourne Institute of Technology University

Melbourne, VIC, Australia its execution) to achieve the goal in the current context, deciding which information to gather, which tools to invoke, and which subroutines to chain in order to achieve completion [25, 26].

Early LLM agents used prompt-based planning [26] and selfcritique [15], but lack persistent 'how-to' procedures - when tasks are similar but not identical, agents must re-plan from scratch, increasing cost and latency. Fine-tuning approaches [2, 27, 29] adapt agents via supervised learning or RLHF, but typically treat entire trajectories as single units weighted by terminal success/failure, neglecting rich intermediate steps. In practice, failed trajectories often contain correct substeps (e.g., 'successfully navigating and retrieving an egg, but failing to boil it' [16]), while successful ones may include suboptimal actions that accidentally cancel out. Recent work [22] addresses this via step-level rewards, but requires repeated policy training on densely-labeled data, incurring substantial computational cost.

We introduce MACLA (Memory-Augmented Contrastive Learning Agent), a framework that disentangles reasoning from learning through the coupling a frozen LLM and a structured external procedural memory (Figure 1). Unlike fine-tuning approaches where reasoning and adaptation are entangled within billions of parameters, MACLA fixes the LLM as a stable semantic reasoner responsible for trajectory segmentation, abstraction, and action generation. All learning occurs externally through explicit, interpretable memory operations - maintaining human-readable procedures, updating Bayesian posteriors, and refining preconditions through contrastive analysis. MACLA operates through three core mechanisms:

(1) Bayesian procedure selection: Maintains Beta posteriors Beta ( 𝛼 𝑖 , 𝛽 𝑖 ) over procedure success rates and ranks candidates via expected-utility scoring that balances contextual relevance, success probability, failure risk, and information gain, providing principled exploration-exploitation. (2) Contrastive refinement: Compares successful and failed execution contexts to tighten preconditions, repair action sequences, and refine postconditions once procedures accumulate sufficient evidence (i.e., ≥ a threshold), progressively improving procedure quality through memory edits rather than gradient updates. (3) Meta-procedural learning: Composesfrequently co-occurring procedures into hierarchical 'playbooks' with conditional control policies (continue, skip, repeat, abort) for long-horizon tasks, enabling strategic reuse beyond atomic skills.

This architecture yields sample-efficient, interpretable agents with human-readable procedural knowledge, closed-form utility computation, and minimal LLM usage. Specifically, this work contributes:

· Online procedural memory adaptation: Continual updates to procedural and meta-procedural memory during and

Abstract.

We present MACLA , a framework that decouples reasoning from learning by maintaining a frozen large language model (LLM) while performing all adaptation in an external hierarchical procedural memory. MACLA extracts reusable procedures from trajectories, tracks reliability via Bayesian posteriors, selects actions through expected-utility scoring, and refines procedures by contrasting successes vs. failures. Across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), MACLA achieves 78.1% average performance , outperforming all baselines. On ALFWorld unseen tasks, MACLA reaches 90.3% with +3.1% positive generalization . The system constructs memory in 56 seconds (2,800× faster than the state-of-the-art LLM parameter-training baseline), compresses 2,851 trajectories into 187 procedures (15:1). Experimental results demonstrate that structured external memory with Bayesian selection and constrastive refinement enable sample-efficient, interpretable and continually improving agents without LLM parameter updates. Code is publicly available at MACLA.

Key Observations and Architectural Insights

Memory-augmented agents, Procedural memory, Bayesian decision making, Contrastive learning, LLM agents

Introduction

Large language model (LLM) agents can solve complex, interactive tasks such as web shopping [25] and embodied AI housekeeping [9], by transforming natural-language instructions into sequences of environment actions [26]. In these settings, agents navigate step-bystep through partially observable environments to pursue subgoals and ultimately complete the task [9, 22]. The resulting trajectory is the ordered record of an episode's interaction, typically written as ( 𝑇, 𝐴, 𝑂, 𝑅 ) , where 𝑇 represents a task to complete, 𝐴 are actions, 𝑂 stand for observations for the outcome of corresponding actions, and 𝑅 records step-level outcomes or rewards. Trajectories thus capture the full decision process, not merely terminal success or failure, and provide dense supervision for how an agent progresses through a task [16, 22]. When a new task arrives, the agent synthesizes an appropriate trajectory (that is, a step-by-step plan and

Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), C. Amato, L. Dennis, V. Mascardi, J. Thangarajah (eds.), May 25 - 29, 2026, Paphos, Cyprus . © 2026 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). This work is licenced under the Creative Commons Attribution 4.0 International (CC-BY 4.0) licence.

VIACLA: Memory-Augmented Contrastive Learning Agent (Inference |ime Learning)

Figure 1: Comparison between existing LLM-based trajectory learning (top) and the proposed memory-augmented contrastive learning agent (MACLA, bottom). Existing methods train trajectories ( 𝑇, 𝐴, 𝑂, 𝑅 ) (Task, Action, Observation, Reward) into LLM parameters through post-training (finetuning and/or RLHF), whereas MACLA constructs procedural and meta-procedural memory externally through frozen LLM abstraction, segmentation, Bayesian selection, and contrastive refinement. Memories are learned during memory construction. Besides learning during memory construction, MACLA enables inference-time learning in which outputs are verified in the task environment, with feedback used for contrastive refinement on the retrieved memories. Meta-procedural learning enables the composition policy to be learned among procedures.

after episodes, enabling adaptation without weight updates, compared with offline LLM post-training approaches [17, 22, 29] that remain static at inference. · Reasoning/learning decoupling: A frozen LLM for parsing and abstraction with all improvements occurring in an external, structured procedural memory, avoiding the computational cost and catastrophic forgetting risks of parameter fine-tuning. · Bayesian uncertainty-aware selection: A principled procedure selection module that maintains Beta posteriors over success rates with closed-form expected utility objectives balancing relevance, success probability, failure risk and information gain. · Contrastive procedural refinement: An algorithm leveraging paired successes and failures to tighten preconditions, repair action schemas, and refine postconditions of stored procedures without requiring expert demonstrations. · Hierarchical meta-procedural composition: Automatic discovery and maintenance of conditional playbooks with control policies (skip, repeat, abort) for long-horizon tasks, enabling compositional generalization.

We evaluate MACLA across four benchmarks (ALFWorld [9], WebShop [25], TravelPlanner [21], InterCodeSQL [24]), achieving 78.1% average performance - the highest among all methods, including those using models 10× larger (later in Table 1). On ALFWorld [16], MACLA reaches 87.2% on seen and 90.3% on unseen tasks, with a positive generalization gap (+3.1%) indicating compositional transfer rather than overfitting. The system achieves this with only 0.016 GPU-hours for one-time memory construction - 2,800× faster than the state-of-the-art LLM parameter-training baseline [22], which requires 44.8 GPU-hours of iterative training -while simultaneously producing human-interpretable procedural knowledge.

LLM agents have advanced rapidly in reasoning and decisionmaking, enabling multi-step interaction in embodied and webbased environments. Early frameworks such as ReAct[26] and Reflexion[14] integrate reasoning and acting within the same loop, while trajectory-tuning methods [2, 22] fine-tune models using expert demonstrations. However, fine-tuning is computationally expensive, requires offline data collection and training cycles, and does not support true online adaptation at inference time. To overcome this issue, a line of research augments LLM agents with memory for continuous reasoning. Memory is a foundational component of language agents, supporting competence across multiple timescales from transient working context to persistent long-term knowledge [6, 8, 31]. Research on memory for LLM agents can be usefully organized along two directions: where memory resides and what is stored. Along the first direction, some methods such as MemGPT [11] and MemoryBank [32], use buffer-based systems to store conversational or episodic traces and retrieve them with embedding search and simple heuristics. Some others, such as HiAgent[5], A-Mem[23], MemAgent [28] use hierarchical designs to separate working buffers from episodic and long-term stores to relieve context pressure and improve persistence. Recently, SAGE [7] used reflective multi-agent controllers to curate these stores while controlling growth. The second direction concerns what is stored. Many systems retain free-form text snippets such as notes, summaries, or dialogue chunks; these are easy to write but suffer from retrieval drift and weak compositionality as repositories scale [11, 32]. More structured artifacts appear as tuples and key-value frames (e.g., tool logs or entity/event graphs), which aid filtering but still lack executable semantics for reuse. A growing line of work targets skills and procedures: agents capture reusable action patterns, tool workflows, and instruction-like steps across related tasks [3, 19, 20]. Memp [4] advances this view by treating procedural memory as a first-class object and studying its construction, retrieval, and update across domains. However, several key limitations remain; (1) it represents know-how largely as monolithic text (scripts or full trajectories) with heuristic retrieval and simple updates; (2) it lacks uncertainty-aware selection or principled exploration-exploitation balance, preventing reason about reliability or risk of retrieved memory; and (3) it lacks a mechanism to refine procedures from paired successes and failures or abstract recurring patterns into meta-procedural compositions. Comparatively, we represent experience as structured, hierarchical procedures with explicit preconditions, action schemas, and postconditions, enabling interpretable reuse and safe composition and direct schema edits when evidence warrants change. The proposed approach enables the system to continuously adapt and improve.

The Preamble

You will be assigned a submission number when you register the abstract of your paper on OpenReview . Include this number in your document using the ' \acmSubmissionID ' command.

Then use the familiar commands to specify the title and authors of your paper in the preamble of the document. The title should be appropriately capitalised (meaning that every 'important' word in the title should start with a capital letter). For the final version of your paper, make sure to specify the affiliation and email address of each author using the appropriate commands. Specify an affiliation and email address separately for each author, even if two authors share the same affiliation. You can specify more than one affiliation for an author by using a separate ' \affiliation ' command for each affiliation.

Provide a short abstract using the ' abstract ' environment.

Finally, specify a small number of keywords characterising your work, using the ' \keywords ' command.

Proposed Method

The key components of MACLA are described in detail below.

LLM-based Procedural Abstraction

The first stage transforms raw episodic trajectories into structured, reusable procedural knowledge. Given a trajectory

𝜏 = {( 𝑜 𝑡 , 𝑎 𝑡 , 𝑟 𝑡 )} 𝑇 𝑡 = 0 consisting of textual observations 𝑜 𝑡 , primitive actions 𝑎 𝑡 , and rewards 𝑟 𝑡 , the frozen LLM L 𝜃 receives the full trajectory and identifies semantically coherent segments that correspond to meaningful sub-tasks:

where each segment 𝑘 spans time steps [ 𝑡 start 𝑘 , 𝑡 end 𝑘 ] and is summarized by a description 𝑑 𝑘 . For each segment, MACLA constructs a structured procedure Proc 𝑘 = ⟨G 𝑘 , Ψ 𝑘 , 𝜋 𝑘 , Φ 𝑘 ⟩ , where G 𝑘 is a natural-language goal, Ψ 𝑘 are precondition patterns inferred from the observations before the segment, 𝜋 𝑘 is an abstracted action sequence, and Φ 𝑘 are postcondition patterns extracted from the final observations. This decomposition produces interpretable 'how-to' skills that can be invoked whenever their preconditions are met. To support retrieval and merging, each procedure is embedded into a semantic vector space using an encoder 𝜙 , e 𝑘 = 𝜙 ([G 𝑘 ; Ψ 𝑘 ; Φ 𝑘 ]) ∈ R 𝑑 . When a new procedure is created, it is compared to existing ones via cosine similarity, 𝑖 ∗ = arg max 𝑖 sim ( e 𝑘 , e 𝑖 ) . If sim ( e 𝑘 , e 𝑖 ∗ ) > 𝜃 dup , the new procedure is merged into the existing one by expanding its condition sets; otherwise, a new entry is added. This process yields a continually growing procedural library M proc = {( Proc 𝑖 , e 𝑖 , 𝛼 𝑖 , 𝛽 𝑖 )} 𝑁 𝑝 𝑖 = 1 that forms the foundation for later Bayesian selection and refinement.

Bayesian Reliability and Utility Selection

Given the procedural library, the agent must decide which procedure to execute for the current observation. Each procedure Proc 𝑖 maintains a Beta posterior over its success probability 𝜌 𝑖 ∈ [ 0 , 1 ] :

where 𝛼 𝑖 and 𝛽 𝑖 accumulate successful and failed executions from history D 𝑖 . The posterior mean E [ 𝜌 𝑖 ] = 𝛼 𝑖 /( 𝛼 𝑖 + 𝛽 𝑖 ) estimates current reliability, while the variance Var [ 𝜌 𝑖 ] = 𝛼 𝑖 𝛽 𝑖 ( 𝛼 𝑖 + 𝛽 𝑖 ) 2 ( 𝛼 𝑖 + 𝛽 𝑖 + 1 ) quantifies epistemic uncertainty. For each candidate, we compute expected utility by integrating over the Beta posterior. Given utility 𝑈 ( 𝜌 | 𝑜 𝑡 , 𝑖 ) = Rel 𝑖 ( 𝑜 𝑡 ) · 𝜌 · 𝑅 max -Risk 𝑖 ( 𝑜 𝑡 ) · ( 1 -𝜌 ) · 𝐶 fail + 𝜆 info · 𝐼 ( 𝜌 ) , the expected utility is:

Exploiting E Beta ( 𝛼,𝛽 ) [ 𝜌 ] = 𝛼 𝛼 + 𝛽 and E [ 1 -𝜌 ] = 𝛽 𝛼 + 𝛽 , this simplifies to:

where Rel 𝑖 ( 𝑜 𝑡 ) = cos ( 𝜙 ( 𝑜 𝑡 ) , e 𝑖 ) is contextual similarity, Risk 𝑖 ( 𝑜 𝑡 ) is the fraction of past failures with similar contexts, and 𝐻 [·] is differential entropy encouraging exploration. The selected procedure

subject to confidence threshold 𝜃 conf . If max 𝑖 EU ( Proc 𝑖 | 𝑜 𝑡 ) < 𝜃 conf , the agent falls back to zero-shot LLM reasoning. This Bayesian selection mechanism balances exploitation (high 𝛼 𝛼 + 𝛽 procedures), risk aversion (avoiding contexts similar to past failures), and exploration (high entropy procedures). The expected utility formulation naturally handles the explore-exploit tradeoff: early in learning, high entropy dominates selection, while after sufficient evidence accumulates, expected reward becomes the primary driver.

As experience accumulates, procedures with both successful and failed instances are subjected to contrastive refinement to improve their accuracy and robustness. For a procedure Proc 𝑖 with sets of successful and failed contexts S 𝑖 and F 𝑖 , the LLM performs discriminative comparison, D 𝑖 = ContrastiveExtract (S 𝑖 , F 𝑖 ) , identifying differences in three dimensions: (i) precondition patterns ( ΔΨ + 𝑖 and ΔΨ -𝑖 ) that distinguish successful from failed initial contexts, (ii) action discrepancies ( Δ 𝜋 𝑖 ) revealing missing or misordered actions, and (iii) postcondition mismatches ( ΔΦ 𝑖 ) that capture incomplete goal states. These discriminators drive explicit refinement operations

When distinct execution modes are detected, the procedure is specialized into separate variants with inherited reliability priors. This process progressively tightens applicability conditions and action precision, yielding interpretable improvements purely through memory edits rather than gradient updates.

Meta-procedural Composition

To extend reasoning beyond atomic skills, MACLA automatically discovers and learns meta-procedures that are structured compositions of procedures that capture recurrent long-horizon strategies. When a sequence of procedures ⟨ Proc 𝑖 1 , . . . , Proc 𝑖 𝑚 ⟩ repeatedly leads to success under a common high-level goal, the agent abstracts it as MP 𝑗 = ⟨G meta 𝑗 , Ψ meta 𝑗 , { Proc 𝑖 1 , . . . , Proc 𝑖 𝑚 } , Θ 𝑗 ⟩ . Here, Θ 𝑗 denotes a lightweight control policy governing conditional transitions among sub-procedures based on the current observation and execution context, Θ 𝑗 ( 𝑜 𝑡 , index ) ∈ { continue , skip , repeat , abort } . This policy is distilled by analyzing successful traces, where the LLM identifies observation patterns that triggered each branch-for example, repeating when postconditions are unmet, skipping when preconditions already hold, or aborting when failures recur. Each meta-procedure maintains its own Beta success posterior 𝑝 ( 𝜎 𝑗 |D 𝑗 ) = Beta ( 𝛼 𝑗 , 𝛽 𝑗 ) and is refined periodically to add new branches, reorder sub-procedures, or prune redundant ones. Through these hierarchical compositions, MACLA acquires flexible 'playbooks' that encapsulate extended strategies with conditional logic.

Ontological Semantic Grounding

To enable cross-context generalization (e.g., procedures learned on "mug" applying to "cup"), MACLA constructs a lightweight ontological semantic index during offline memory construction. We extract the 𝑘 𝑣𝑜𝑐𝑎𝑏 most frequent words from task descriptions and actions, then cluster semantically similar words using SentenceTransformer embeddings [12] to form an implicit domain ontology:

where each cluster C 𝑤 represents a semantic category (e.g., C container = { mug , cup , glass } ). During retrieval, observations are mapped to these ontological categories, allowing procedures to match across lexically different but semantically equivalent contexts. This ontological grounding enables domain-adaptive generalization without requiring explicit knowledge engineering.

System Efficiency and Memory Management

To ensure practical scalability, MACLA employs efficient retrieval, bounded growth, and strict control over LLM usage. All procedures and meta-procedures are embedded in an approximate nearestneighbor index supporting sublinear retrieval ( 𝑂 ( log 𝑁 𝑝 ) ) for semantic search. The episode buffer stores at most 𝑁 𝑏 = 1000 steps, providing local context for LLM prompts and post-episode updates. Each procedure maintains a failure index limited to 𝐾 fail = 15 entries, managed through success-based removal, redundancy-aware eviction, and temporal decay, ensuring that memory remains concise and informative. To prevent memory saturation, procedures and meta-procedures are periodically pruned using a multi-factor utility score that balances reliability, usage frequency, and temporal relevance:

where 𝛼 𝑖 𝛼 𝑖 + 𝛽 𝑖 is the Bayesian success rate (reliability), 𝑛 𝑖 is the execution count of procedure 𝑖 , 𝑁 total is the total invocations across all procedures in the current episode window, 𝑡 current is the current episode index, 𝑡 last 𝑖 is the episode when 𝑖 was last used, and 𝜏 is the temporal decay constant.

The weighting coefficients 𝜆 𝑟 = 0 . 5, 𝜆 𝑓 = 0 . 3, and 𝜆 𝑡 = 0 . 2 reflect the relative importance of each factor: reliability receives the highest weight (0.5) as it directly predicts future success; frequency receives moderate weight (0.3) to favor well-tested procedures while avoiding over-retention of obsolete frequently-used skills; recency receives the lowest weight (0.2) to provide soft temporal decay without aggressive forgetting. These values were determined through grid search over { 0 . 3 , 0 . 4 , 0 . 5 , 0 . 6 } × { 0 . 2 , 0 . 3 , 0 . 4 } × { 0 . 1 , 0 . 2 , 0 . 3 } on ALFWorld validation, with the constraint 𝜆 𝑟 + 𝜆 𝑓 + 𝜆 𝑡 = 1 . 0 for interpretability. The selected configuration (0.5, 0.3, 0.2) yielded the best balance between retaining high-quality procedures (>0.7 success rate) and pruning low-utility entries (<0.4 success rate), as validated later in Figure 4. Entries with the lowest utility are removed while ensuring diversity across goal clusters through stratified sampling. These operations keep the total memory footprint below 4 MB for hundreds of procedures.

Finally, MACLA limits LLM usage to a fixed budget of API calls per episode to cover segmentation, abstraction, and occasional refinement, while all retrieval, Bayesian scoring, and updates are symbolic or vectorized. As a result, per-step runtime remains effectively constant and inference cost does not scale with experience. This memory-first design ensures that MACLA remains efficient, interpretable, and deployable for continual learning across long interaction horizons.The theoretical foundations are provided in Appendix D.

Algorithm

At runtime, MACLA executes a new task by coupling frozen semantic reasoning with memory-driven decision making. The agent receives an initial observation 𝑜 0 (and, optionally, an instruction string) and embeds it as h 0 = 𝜙 ( 𝑜 0 ) . This embedding queries the semantic index of the external memory to retrieve a compact candidate set consisting of procedures { Proc 𝑖 } and meta-procedures { MP 𝑗 } whose embeddings are most similar to h 0 . Retrieval is approximate nearest neighbor over the concatenated descriptors of goals, preconditions, and postconditions, which keeps lookup sublinear in memory size.

Given the candidate set, MACLA ranks each item with a Bayesian expected-utility score that trades off contextual relevance, estimated success, risk, and information gain under the procedure's Beta posterior. The highest-scoring item above a confidence threshold is selected; otherwise the agent falls back to zero-shot LLM reasoning for that step, logs the outcome, and continues. If a meta-procedure is chosen, execution proceeds hierarchically under its composition policy Θ 𝑗 ( 𝑜 𝑡 , index ) ∈ { continue , skip , repeat , abort } until completion or abort; if an atomic procedure is chosen, the agent checks preconditions Ψ 𝑖 against 𝑜 𝑡 , invokes the action sketch 𝜋 𝑖 via the frozen LLM's action formatter, and verifies postconditions Φ 𝑖 to certify completion. In both cases the outcome updates ( 𝛼, 𝛽 ) and appends the initial context to the corresponding success or failure set for later analysis.

After each execution, the agent re-embeds the new observation and repeats retrieval and selection until the task is solved or a horizon is reached. When a procedure accumulates both successes and failures, a contrastive pass is triggered: the LLM proposes discriminators that tighten Ψ 𝑖 , repair 𝜋 𝑖 , and refine Φ 𝑖 , or if distinct modes are detected, specializes the procedure into variants that inherit prior counts. When successful episodes repeatedly traverse a small set of procedures in a stable order, the agent abstracts a meta-procedure with its own success posterior and a lightweight Θ 𝑗 distilled from divergence points across traces. Throughout, memory remains bounded by pruning with a utility that blends reliability, frequency, and recency, and the LLM-call budget is capped, as retrieval, scoring, and updates are vectorized operations. The complete runtime procedure is outlined in Algorithm 1.

Experiments

We evaluate MACLA on four challenging interactive agent benchmarks spanning diverse domains. All experiments use consistent hyperparameters across tasks to demonstrate generalization without task-specific tuning.

Algorithm 1 MACLA Runtime Procedure with Function Descriptions

Require: observation 𝑜 0 , memory M (procedures, meta-procedures, indices), horizon 𝐻 1: h ← 𝜙 ( 𝑜 0 ) ⊲ Embed observation 2: C ← RetrieveCandidates ( h , M ) ⊲ Top𝑘 ANN search 3: while not Terminal and 𝑡 < 𝐻 do 4: for all 𝑐 ∈ C do 5: EU [ 𝑐 ] ← ExpectedUtility ( 𝑐, 𝑜 𝑡 , M ) ⊲ Compute Eq. 4 6: end for 7: 𝑐 ★ ← arg max 𝑐 ∈C EU [ 𝑐 ] 8: if EU [ 𝑐 ★ ] < 𝜃 conf then 9: ( 𝑜 𝑡 + 1 , 𝑦 ) ← ZeroShotStep ( 𝑜 𝑡 ) ⊲ LLM generates action directly 10: else if 𝑐 ★ is MP 𝑗 then 11: ( 𝑜 𝑡 + 1 , 𝑦 ) ← ExecuteMeta ( MP 𝑗 , Θ 𝑗 , 𝑜 𝑡 ) ⊲ Run with control policy 12: ( 𝛼 𝑗 , 𝛽 𝑗 ) ← UpdateBeta ( ( 𝛼 𝑗 , 𝛽 𝑗 ) , 𝑦 ) ⊲ 𝛼 ← 𝛼 + 𝑦 , 𝛽 ← 𝛽 +( 1 -𝑦 ) 13: else ⊲ 𝑐 ★ is atomic Proc 𝑖 14: if CheckPre ( Ψ 𝑖 , 𝑜 𝑡 ) then ⊲ Verify preconditions match 𝑜 𝑡 15: ( 𝑜 𝑡 + 1 , 𝑦 ) ← ExecuteProc ( 𝜋 𝑖 , 𝑜 𝑡 ) ⊲ Instantiate & execute 𝜋 𝑖 16: 𝑦 ← 𝑦 ∧ CheckPost ( Φ 𝑖 , 𝑜 𝑡 + 1 ) ⊲ Verify postconditions in 𝑜 𝑡 + 1 17: else 18: ( 𝑜 𝑡 + 1 , 𝑦 ) ← ZeroShotStep ( 𝑜 𝑡 ) ⊲ Preconditions failed, fallback 19: end if 20: ( 𝛼 𝑖 , 𝛽 𝑖 ) ← UpdateBeta ( ( 𝛼 𝑖 , 𝛽 𝑖 ) , 𝑦 ) 21: RecordContext ( S 𝑖 , F 𝑖 , 𝑜 𝑡 , 𝑦 ) ⊲ Add to success/fail sets 22: end if 23: if RefineTrigger ( 𝑐 ★ ) then ⊲ If | S | , | F | ≥ 3 24: ContrastiveRefine ( 𝑐 ★ ) ⊲ LLM compares S vs. F (§4.3) 25: end if 26: h ← 𝜙 ( 𝑜 𝑡 + 1 ) ; C ← RetrieveCandidates ( h , M ) ; 𝑡 ← 𝑡 + 1 27: end while 28: if EligibleForMeta ( trace ) then ⊲ If ≥ 3 procs in stable order 29: ExtractOrRefineMeta ( trace , M ) ⊲ Create/update meta-proc 30: end if 31: PruneAndMaintain ( M ) ⊲ Remove low-utility via Eq. 8

Experimental Setting

Memory Architecture: Episode buffer 𝑁 𝑏𝑢𝑓 𝑓 𝑒𝑟 = 1000 (stores recent observations and actions for temporal context provision during action generation); procedural memory 𝑁 𝑝𝑟𝑜𝑐 = 200 (capacity for extracted reusable skills); meta-procedural memory 𝑁 𝑚𝑒𝑡𝑎 = 50 (capacity for hierarchical procedure compositions). Critically, MACLA does not store raw trajectories. Instead, the LLM segments each episode into coherent sub-tasks and extracts structured procedures (Section 4.1). Duplicate detection with similarity threshold 𝜃 𝑑𝑢𝑝 = 0 . 85 prevents redundant storage. Through this process, the 2,851 ALFWorld training trajectories compress into approximately 187 unique procedures-demonstrating efficient knowledge distillation from experience.

Bayesian Selection. Information gain weight 𝜆 𝑖𝑛𝑓 𝑜 = 0 . 1, failure cost 𝐶 𝑓 𝑎𝑖𝑙 = 0 . 5. These parameters balance exploration (trying uncertain procedures to reduce epistemic uncertainty) with exploitation (selecting high-posterior reliable procedures).

Contrastive Refinement. Minimum contexts 𝑛 𝑠 𝑚𝑖𝑛 = 𝑛 𝑓 𝑚𝑖𝑛 = 3. Refinement activates only when a procedure has accumulated at least 3 successes and 3 failures, ensuring sufficient statistical evidence for discriminative pattern extraction.

LLM Configuration. Llama-2-7B [18] via Ollama with 4-bit quantization and temperature 𝑇 = 0 . 7. The LLM parameters remain frozen throughout all experiments-learning occurs exclusively through external memory updates.

Benchmarks and Dataset Statistic: ALFWorld [16] (2,851 train, 274 test) is a text-based embodied environment with six household tasks (e.g., retrieval, placement). We follow the standard train/validation-seen/validation-unseen split, where test trajectories feature novel object-location configurations. WebShop [25] (1,624 train, 200 test) simulates e-commerce search over 12,087 products, requiring agents to follow natural-language instructions via multi-step navigation and filtering. TravelPlanner [21] (1,000 train, 180 validation, 45 test) involves multi-day itinerary planning under hard constraints (budget, dates) and soft preferences (cuisine, attractions). Evaluation uses Common Sense (CS) and Hard Constraint (HC) scores. InterCodeSQL [24] benchmarks interactive text-to-SQL generation over diverse schemas, requiring correct handling of schema relationships and varying query difficulty.

Experimental Results and Analysis

Table 1 compares MACLA against state-of-the-art baselines across all benchmarks. We organize baselines into three paradigms: promptbased methods using in-context learning, outcome refinement approaches optimizing trajectory-level rewards, and process refinement methods refining step-level generation. MACLA achieves the highest average performance (78.1%) while using a 7B parameter model, demonstrating that domain-agnostic procedural memory with Bayesian selection and contrastive refinement enables competitive performance without task-specific engineering.

In Table 1, MACLA achieves state-of-the-art results on TravelPlanner (83.3 CS) and ALFWorld-Unseen (90.3%), outperforming methods that rely on models 10× larger. Its strong performance across all benchmarks demonstrates cross-domain generalization, while the positive generalization gap on ALFWorld (+3.1 points for unseen vs. seen) indicates robust compositional transfer rather than memorization.

Conclusion

We presented MACLA, a framework that decouples reasoning from learning by maintaining a frozen LLM and performing all adaptation in an external hierarchical procedural memory through Bayesian selection, contrastive refinement, and meta-procedural composition. MACLA achieves 78.1% average performance across four benchmarks using only a 7B model, with state-of-the-art results on ALFWorld (87.2% seen; 90.3% unseen) and TravelPlanner (83.3%). The system compresses 2,851 ALFWorld training trajectories into 187 reusable procedures through semantic abstraction and duplicate detection, demonstrating efficient knowledge distillation.

Detailed Ablation Studies and Memory Analysis

This section provides comprehensive ablation studies examining MACLA's component contributions, memory scaling behavior, and task-specific effectiveness. These experiments address critical questions about system design choices and identify performance bottlenecks across different benchmarks. Table 4 systematically evaluates the contribution of each MACLA component by measuring performance degradation when individual modules are removed. Beyond success rates, we track memory dynamics (procedure/metaprocedure counts), behavioral patterns (reuse rate), and computational efficiency (LLM calls per episode).

Table 4: Component ablation and memory dynamics analysis on ALFWorld. All variants use Llama-2-7B.

Proc./Meta Count: final memory size after 200 episodes. Reuse Rate: % of actions from retrieved procedures vs. zero-shot LLM. LLM Calls: average per episode.

Synergistic Effects: Nosingle component accounts for MACLA's full performance. The combination of Bayesian selection (uncertaintyaware), contrastive learning (quality refinement), and meta-procedures (hierarchical composition) creates synergistic effects. Bayesian selection identifies reliable procedures, contrastive learning makes them more robust, and meta-procedures compose them efficiently.

Memory Capacity Scaling

Table 5 investigates the relationship between memory capacity and performance, addressing whether larger memory always yields better results or if there exists an optimal capacity.

Table 5: Impact of procedural memory capacity on performance. Results on ALFWorld after 200 training episodes.

Actual Proc.: number of procedures after training (may be less than capacity if not all slots filled). Avg 𝛼 𝛼 + 𝛽 : mean posterior success rate across all procedures.

Task-Specific Memory Effectiveness

Figure 5 explains SQL underperformance through three metrics. Lowreuse(51%): SQLqueries are schema-specific, e.g., customers.age does not apply to employees.experience . ALFWorld generalizes via placeholders (


Method	WebShop	InterCodeSQL	TravelPlanner	ALFWorld	Avg.
				Seen	Unseen
\rowcolorgray!15 Prompt-based Methods
GPT-4 achiam2023gpt	63.2	38.5	71.9	42.9	38.1	50.9
\rowcolorgray!8 GPT-3.5-Turbo ouyang2022training	62.4	37.8	–	7.9	10.5	29.7
Llama-2-7B touvron2023llama	17.9	4.0	–	0.0	0.0	5.5
\rowcolorgray!15 Outcome Refinement Methods
Llama-2-7B + SFT chen2023fireact	60.2	54.9	–	60.0	67.2	60.6
\rowcolorgray!8 Llama-2-7B + RFT-PPO schulman2017proximal	64.2	52.4	–	22.1	29.1	42.0
Llama-2-7B + RFT-CR zhang2023cumulative	63.6	56.3	–	62.9	66.4	62.3
\rowcolorgray!8 Llama-2-7B + ETO song2024trial	67.4	57.2	–	68.6	72.4	66.4
\rowcolorgray!15 Process Refinement Methods
Llama-2-7B + Step-PPO xiong2024watch	64.0	60.2	–	65.7	69.4	64.8
\rowcolorgray!8 Llama-2-7B + IPR xiong2024watch	71.3	61.3	–	70.3	74.7	69.4
Claude-3.5-Sonnet† fang2025memp	–	–	65.5	82.5	74.7	74.2
\rowcolorgray!8 Qwen2.5-72B† fang2025memp	–	–	63.8	85.7	77.2	75.6
\rowcolorblue!10 Llama-2-7B + MACLA	70.2	59.3	83.3	87.2	90.3	78.1

\rowcolorblue!10 Full MACLA	✓	✓	✓	✓	87.1	90.3
w/o Bayesian	✗	✓	✓	✓	79.4	81.2
w/o Contrast.	✓	✗	✓	✓	83.6	85.7
w/o Meta	✓	✓	✗	✓	81.2	78.4
w/o Ontology	✓	✓	✓	✗	82.8	84.1


Method	Training	WebShop	ALFWorld
	(GPU-hrs)		Unseen
\rowcolorgray!8 IPR xiong2024watch	44.8	71.3	74.7
SFT chen2023fireact	8.0	60.2	67.2
\rowcolorgray!8 ETO song2024trial	20.0	67.4	72.4
\rowcolorblue!10 MACLA	0.016	70.2	90.3
\rowcolorblue!10 Speedup vs IPR	2,800×	–	+15.6 pts

Configuration	Seen	Unseen	Proc.	Meta	Reuse	LLM
			Count	Count	Rate	Calls
\rowcolorblue!10 Full MACLA	87.2	90.3	187	43	78%	6.2
\rowcolorgray!8 w/o Bayesian Selection	79.4	81.2	189	41	62%	8.4
w/o Contrastive	83.6	85.7	201	39	71%	6.8
\rowcolorgray!8 w/o Meta-Procedures	81.2	78.4	193	0	65%	9.1
w/o Ontology	82.8	84.1	185	42	74%	6.5

(Proc/Meta)	Proc.	Proc.			αα+β\frac{\alpha}{\alpha+\beta}
\rowcolorgray!8 25 / 5	25	5	68.3	64.1	0.61
50 / 10	50	10	76.5	74.2	0.68
\rowcolorgray!8 100 / 20	98	18	83.1	85.6	0.74
150 / 35	143	31	86.4	88.7	0.77
\rowcolorgray!8 200 / 50 (Default)	187	43	87.2	90.3	0.79
300 / 75	203	47	87.1	90.1	0.79

		Used	Rate	αα+β\frac{\alpha}{\alpha+\beta}	Hit	Len
\rowcolorgray!8 ALFWorld-Seen	87.2	34±8	78%	0.81	42%	4.2
ALFWorld-Unseen	90.3	28±6	76%	0.79	38%	4.1
\rowcolorgray!8 TravelPlanner	83.3	41±12	72%	0.75	51%	6.3
WebShop	70.2	38±9	69%	0.72	35%	5.1
\rowcolorgray!8 InterCodeSQL	59.3	52±18	51%	0.64	18%	2.8

Time	LLM	Bayesian Selector	Memory System	Contrastive Refiner	I/O Summary
t0t_{0} ★	Parse task “Put chilled lettuce on counter”; extract verb=put, modifier=chilled, object=lettuce; infer compound goal requiring object_cooling →\rightarrow object_placement composition. Recognize “chilled” as precondition trigger.	Retrieve top-5 goal-relevant procedures via FAISS (2.1ms). Compute EU for candidates: EUcooling=0.78\text{EU}{\text{cooling}}{=}0.78, EUplacement=0.82\text{EU}{\text{placement}}{=}0.82. Detect dependency: placement requires cooled(object)\texttt{cooled}(object) precondition. Order: cooling before placement.	Query procedural memory with embedding ϕ(“chilled lettuce”)\phi(\text{``chilled lettuce''}). Retrieve object_cooling (10 matches from training: 7 potato, 2 apple, 1 tomato) and object_placement (8 matches). Load precondition dependency graph: place ⇒\Rightarrow cooled.	N/A (inference only; no prior episodes this session)	Input: Task description. Output: Hierarchical plan: [navigate →\rightarrow take →\rightarrow cool →\rightarrow place]. LLM tokens: 347 prompt + 89 completion.
t1t_{1}	Observation: “You are in the middle of a room. Looking quickly around you, you see […] a countertop 2, where lettuce 2 is located.” Parse spatial context; identify lettuce location=countertop 2. Extract entities: {lettuce 2, countertop 2}.	Select navigate_to_object procedure (template-based, no Bayesian ranking needed). Relevance score: sim(ϕ(o1),enav)=0.91\text{sim}(\phi(o_{1}),e_{\text{nav}}){=}0.91. Confidence θconf=0.7>0.4\theta_{\text{conf}}{=}0.7>0.4 threshold; proceed without fallback.	Retrieve atomic navigation pattern from procedure library: go to . Instantiate with =countertop 2 using current observation entities. No LLM call required (rule-based substitution).	–	Action: go to countertop 2. Environment: “You arrive at loc 5. On the countertop 2, you see lettuce 2.”
t2t_{2}	Observation: “On the countertop 2, you see a lettuce 2”; confirm object visibility. Retrieve procedure: object_retrieval (pick-up skill). Verify preconditions: at(countertop 2)∧visible(lettuce 2)∧¬holding(⋅)\texttt{at}(\text{countertop\penalty 10000\ 2})\land\texttt{visible}(\text{lettuce\penalty 10000\ 2})\land\neg\texttt{holding}(\cdot) all satisfied.	No ranking needed; deterministic action from procedure template. Update context: holding(object)=False→True\texttt{holding}(object){=}\text{False}\rightarrow\text{True} (predicted state change for next step).	Instantiate: take	–	Action: take lettuce 2 from countertop 2. Environment: “You pick up lettuce 2 from countertop 2.” State: holding(lettuce 2)=True\texttt{holding}(\text{lettuce\penalty 10000\ 2}){=}\text{True}.
t3t_{3}	Observation: “You are carrying lettuce 2.” Cooling subgoal active. Verify precondition: ¬cooled(lettuce 2)\neg\texttt{cooled}(\text{lettuce\penalty 10000\ 2}). Identify required appliance type: refrigeration device.	Compute EU for appliance choices: EUfridge=0.83\text{EU}{\text{fridge}}{=}0.83 (from α=10,β=3\alpha{=}10,\beta{=}3, ρ^=0.769\hat{\rho}{=}0.769, risk=0.19{=}0.19), EUfreezer=0.58\text{EU}{\text{freezer}}{=}0.58 (from α=4,β=6\alpha{=}4,\beta{=}6, ρ^=0.40\hat{\rho}{=}0.40, risk=0.37{=}0.37). Select fridge_cooling variant (higher posterior, lower risk).	Retrieve object_cooling procedure; Bayesian posterior: P(ρ\|𝒟)=Beta(10,3)P(\rho\|\mathcal{D}){=}\text{Beta}(10,3), 𝔼[ρ]=1013=0.769\mathbb{E}[\rho]{=}\frac{10}{13}{=}0.769, Var[ρ]=302366≈0.0127\text{Var}[\rho]{=}\frac{30}{2366}{\approx}0.0127. Load 5-step template sequence.	–	Subgoal: Execute cooling branch. Action: go to fridge 1. Environment: “You arrive at loc 12. The fridge 1 is closed.”
t4t_{4}	Observation: “The fridge 1 is closed.” Recognize container access precondition. Retrieve container_opening subprocedure (embedded in cooling template).	Validate cooling procedure: posterior passes confidence threshold (𝔼[ρ]=0.77>θconf=0.4\mathbb{E}[\rho]{=}0.77>\theta_{\text{conf}}{=}0.4). Recompute EU: 0.91×0.769×1.0−0.19×0.231×0.5=0.678>0.40.91\times 0.769\times 1.0-0.19\times 0.231\times 0.5{=}0.678>0.4; proceed.	Instantiate opening action: open →\rightarrow open fridge 1. Track appliance state: open(fridge 1)=False→True\texttt{open}(\text{fridge\penalty 10000\ 1}){=}\text{False}\rightarrow\text{True}.	–	Action: open fridge 1. Environment: “You open the fridge 1. The fridge 1 is open. In it, you see nothing.”
t5t_{5}	Observation: “The fridge 1 is open.” Execute core cooling action. Verify all preconditions met: holding(lettuce 2)∧open(fridge 1)∧at(fridge 1)\texttt{holding}(\text{lettuce\penalty 10000\ 2})\land\texttt{open}(\text{fridge\penalty 10000\ 1})\land\texttt{at}(\text{fridge\penalty 10000\ 1}).	No re-ranking; executing within selected procedure template. Monitor for failure signals (none detected).	Instantiate cooling action: cool	–	Action: cool lettuce 2 with fridge 1. Environment: “You cool the lettuce 2 using the fridge 1.” State: cooled(lettuce 2)=True\texttt{cooled}(\text{lettuce\penalty 10000\ 2}){=}\text{True}.
t6t_{6}	Observation: “You have cooled lettuce 2.” Update world state: cooled(lettuce 2)=True\texttt{cooled}(\text{lettuce\penalty 10000\ 2}){=}\text{True}, holding(lettuce 2)=True\texttt{holding}(\text{lettuce\penalty 10000\ 2}){=}\text{True}. Container cleanup: execute closing subprocedure.	Bayesian update (cooling success): (α,β)←(10+1,3+0)=(11,3)(\alpha,\beta)\leftarrow(10{+}1,3{+}0){=}(11,3). New posterior: 𝔼[ρ]=1114≈0.786\mathbb{E}[\rho]{=}\frac{11}{14}{\approx}0.786 (++1.7% improvement). Compute information gain: ΔH=H[Beta(10,3)]−H[Beta(11,3)]=0.136\Delta H{=}H[\text{Beta}(10,3)]-H[\text{Beta}(11,3)]{=}0.136 nats.	Mark cooling procedure success; store context tuple (oinit,πexec,oterm)(o_{\text{init}},\pi_{\text{exec}},o_{\text{term}}) in success set 𝒮i\mathcal{S}{i}. Intermediate reward signal: rcool=+0.3r{\text{cool}}{=}+0.3 (step-level credit). Check co-occurrence with pending placement goal.	–	Action: close fridge 1. Environment: “You close the fridge 1.” Transition: Cooling subgoal complete; return to placement goal.
t7t_{7}	Observation: “You are carrying cooled lettuce 2.” Navigate to target location. Cooling precondition now satisfied: cooled(lettuce 2)=True\texttt{cooled}(\text{lettuce\penalty 10000\ 2}){=}\text{True}. Activate placement subgoal.	Retrieve object_placement procedure. Recompute EU with updated context: relevance=0.94{=}0.94 (high similarity to placement scenarios), ρ^=0.80\hat{\rho}{=}0.80 (from Beta(8,2)), risk=0.15{=}0.15, info-gain=0.21{=}0.21 nats. Total EU=0.94×0.80×1.0−0.15×0.20×0.5+0.1×0.21=0.752+0.021=0.773{=}0.94\times 0.80\times 1.0-0.15\times 0.20\times 0.5+0.1\times 0.21{=}0.752+0.021{=}0.773.	Instantiate placement template with navigation: go to →\rightarrow go to countertop 2. Precondition check passes: cooled(lettuce 2)∧holding(lettuce 2)∧exists(countertop 2)\texttt{cooled}(\text{lettuce\penalty 10000\ 2})\land\texttt{holding}(\text{lettuce\penalty 10000\ 2})\land\texttt{exists}(\text{countertop\penalty 10000\ 2}).	–	Action: go to countertop 2. Environment: “You arrive at loc 5. On countertop 2, you see nothing.” (lettuce is held, not on counter).
t8t_{8}	Observation: “You are at countertop 2, holding cooled lettuce 2.” Execute terminal placement action. All preconditions verified: cooled∧holding∧at_target\texttt{cooled}\land\texttt{holding}\land\texttt{at_target}.	No additional ranking needed; final action of placement procedure. Predicted outcome: task success with p=0.80p{=}0.80 (placement posterior).	Instantiate: put	–	Action: put lettuce 2 on countertop 2. Environment: “You put lettuce 2 on countertop 2.” Result: Task SUCCESS. rtotal=1.0r_{\text{total}}{=}1.0.
t9t_{9} ★	Generate symbolic summary: “Completed two-stage compound task: cooling-then-placement via fridge 1 on lettuce 2.” Segment trajectory into 2 procedures: τcool=[t3,t4,t5,t6]\tau_{\text{cool}}{=}[t_{3},t_{4},t_{5},t_{6}] (4 actions, success), τplace=[t7,t8]\tau_{\text{place}}{=}[t_{7},t_{8}] (2 actions, success). Extract precondition pattern: “chilled” ⇒\Rightarrow cooling required.	Bayesian update (placement success): (α,β)←(8+1,2+0)=(9,2)(\alpha,\beta)\leftarrow(8{+}1,2{+}0){=}(9,2). New posterior: 𝔼[ρ]=911≈0.818\mathbb{E}[\rho]{=}\frac{9}{11}{\approx}0.818 (++1.8% improvement). Export posteriors: cooling Beta(11,3), placement Beta(9,2). Calibration score: \|𝔼[ρ]−empirical\|=0.02\|\mathbb{E}[\rho]-\text{empirical}\|{=}0.02 (well-calibrated). Total entropy reduction: ΔHtotal=0.136+0.092=0.228\Delta H_{\text{total}}{=}0.136+0.092{=}0.228 nats.	Meta-procedural learner analyzes co-occurrence patterns across last 15 episodes: cooling→\rightarrowplacement observed in 3 distinct configurations (potato-fridge-table, apple-fridge-shelf, lettuce-fridge-countertop). Pattern frequency: 3/15=20%3/15{=}20% exceeds threshold (θmeta=15%\theta_{\text{meta}}{=}15%). Create abstract meta-procedure: meta_cool_and_place_object with composition policy: if “chilled” ∈\in task_modifiers then cooling →\rightarrow placement else placement only. Store in ℳmeta\mathcal{M}_{\text{meta}} with initial success count=3{=}3.	Contrastive analysis: Extract success features: {chilled, fridge, cooled, refrigerator_device}. Initialize success context for future contrastive refinement when failures accumulate (currently: \|𝒮cooling\|=11\|\mathcal{S}{\text{cooling}}\|{=}11, \|ℱcooling\|=3\|\mathcal{F}{\text{cooling}}\|{=}3; refinement threshold: min⁡(\|𝒮\|,\|ℱ\|)≥3\min(\|\mathcal{S}\|,\|\mathcal{F}\|){\geq}3 ✓; will trigger discriminative pattern extraction on next failure). Potential discriminators if future failures with “warm”, “oven”: refine precondition to “cooling ⇒\Rightarrow cold_appliance ∧¬\land\negheat_appliance”.	Learning summary: (1) Bayesian priors updated for 2 procedures; (2) New meta-procedure stored; (3) Contrastive learning primed. LLM tokens: 412 prompt + 156 completion. Episode stats: 8 actions, 2 LLM calls, 18.3s wall-clock time.

Parameter	Value	Selection Method	Theoretical Justification
θdup\theta_{\text{dup}}	0.85	Empirical	None provided
θconf\theta_{\text{conf}}	0.4	Empirical	None provided
θmeta\theta_{\text{meta}}	15%	Empirical	None provided
nmins,nminfn^{s}{\text{min}},n^{f}{\text{min}}	3	Heuristic	Minimal statistical significance
λr,λf,λt\lambda_{r},\lambda_{f},\lambda_{t}	0.5, 0.3, 0.2	Grid search	Constraint: ∑λi=1\sum\lambda_{i}=1
λinfo\lambda_{\text{info}}	0.1	Empirical	None provided
KfailK_{\text{fail}}	15	Empirical	None provided

Method	WebShop	InterCodeSQL	TravelPlanner	ALFWorld	ALFWorld	Avg.
				Seen	Unseen
Prompt-based Methods
GPT-4 [1]	63.2	38.5	71.9	42.9	38.1	50.9
GPT-3.5-Turbo [10]	62.4	37.8	-	7.9	10.5	29.7
Llama-2-7B [18]	17.9	4.0	-	0.0	0.0	5.5
Outcome Refinement Methods
Llama-2-7B + SFT [2]	60.2	54.9	-	60.0	67.2	60.6
Llama-2-7B + RFT-PPO [13]	64.2	52.4	-	22.1	29.1	42.0
Llama-2-7B + RFT-CR [30]	63.6	56.3	-	62.9	66.4	62.3
Llama-2-7B + ETO [17]	67.4	57.2	-	68.6	72.4	66.4
Process Refinement Methods
Llama-2-7B + Step-PPO [22]	64.0	60.2	-	65.7	69.4	64.8
Llama-2-7B + IPR [22]	71.3	61.3	-	70.3	74.7	69.4
Claude-3.5-Sonnet † [4]	-	-	65.5	82.5	74.7	74.2
Qwen2.5-72B † [4]	-	-	63.8	85.7	77.2	75.6
Llama-2-7B + MACLA	70.2	59.3	83.3	87.2	90.3	78.1

Config.	Bayes.	Contr.	Meta	Ontol.	Seen	Unseen
Full MACLA	✓	✓	✓	✓	87.1	90.3
w/o Bayesian	✗	✓	✓	✓	79.4	81.2
w/o Contrast.	✓	✗	✓	✓	83.6	85.7
w/o Meta	✓	✓	✗	✓	81.2	78.4
w/o Ontology	✓	✓	✓	✗	82.8	84.1

Method	Training (GPU-hrs)	WebShop	ALFWorld Unseen
IPR [22]	44.8	71.3	74.7
SFT [2]	8.0	60.2	67.2
ETO [17]	20.0	67.4	72.4
MACLA	0.016	70.2	90.3
Speedup vs IPR	2,800×	-	+15.6 pts

Max Capacity (Proc/Meta)	Actual Proc.	Meta Proc.	Seen	Unseen	Avg 𝛼 𝛼 + 𝛽
25 / 5	25	5	68.3	64.1	0.61
50 / 10	50	10	76.5	74.2	0.68
100 / 20	98	18	83.1	85.6	0.74
150 / 35	143	31	86.4	88.7	0.77
200 / 50 (Default)	187	43	87.2	90.3	0.79
300 / 75	203	47	87.1	90.1	0.79

Benchmark	Perf.	Proc. Used	Reuse Rate	Avg 𝛼 𝛼 + 𝛽	Meta Hit	Proc. Len
ALFWorld-Seen	87.2	34±8	78%	0.81	42%	4.2
ALFWorld-Unseen	90.3	28±6	76%	0.79	38%	4.1
TravelPlanner	83.3	41±12	72%	0.75	51%	6.3
WebShop	70.2	38±9	69%	0.72	35%	5.1
InterCodeSQL	59.3	52±18	51%	0.64	18%	2.8

Time	LLM	Bayesian Selector	Memory System	I/O Summary
𝑡 0 ★	Parse task 'Put chilled lettuce on counter'; extract verb=put , modifier=chilled , ob- ject=lettuce ; infer compound goal requir- ing object_cooling → object_placement composition. Recognize 'chilled' as precondition trigger.	Retrieve top-5 goal- relevant procedures via FAISS (2.1ms). Compute EU for can- didates: EU cooling = 0 . 78, EU placement = 0 . 82. De- tect dependency: placement requires cooled ( 𝑜𝑏𝑗𝑒𝑐𝑡 ) pre- condition. Order: cooling before place-	Query procedural mem- ory with embedding 𝜙 ( 'chilled lettuce' ) . Re- trieve object_cooling (10 matches from training: 7 potato, 2 apple, 1 tomato) and object_placement (8 matches). Load precondi- tion dependency graph: place ⇒ cooled .	Input: Task description. Out- put: Hierarchical plan: [navigate → take → cool → place]. LLM tokens: 347 prompt + 89 completion.
𝑡 1	Observation: 'You are in the middle of a room. Looking quickly around you, you see [...] a countertop 2, where lettuce 2 is located.' Parse spatial context; identify lettuce location=countertop 2. Extract entities: {let-	Select navigate_to_object procedure (template- based, no Bayesian ranking needed). Relevance score: sim ( 𝜙 ( 𝑜 1 ) , 𝑒 nav ) = 0 . 91. Confidence 𝜃 conf = 0 . 7 > 0 . 4 threshold; proceed	Retrieve atomic nav- igation pattern from procedure library: go to . Instantiate with =countertop 2 using current observa- tion entities. No LLM call required (rule-based substitution).	Action: go to countertop 2 . Environment: 'You arrive at loc 5. On the countertop 2, you see lettuce 2.'
𝑡 2	Observation: 'On the countertop 2, you see a lettuce 2'; con- firm object visibility. Retrieve procedure: object_retrieval (pick-up skill). Ver- ify preconditions: at ( countertop 2 ) ∧ visible ( lettuce 2 ) ∧ ¬ holding (·) all satis-	No ranking needed; deterministic action from procedure tem- plate. Update context: holding ( 𝑜𝑏𝑗𝑒𝑐𝑡 ) = False → True (predicted state change for next step).	Instantiate: take	Action: take lettuce 2 from countertop 2 . Environ- ment: 'You pick up lettuce 2 from coun- tertop 2.' State: holding ( lettuce 2 )
𝑡 3	Observation: 'You are carrying lettuce 2.' Cooling subgoal active. Verify precondition: ¬ cooled ( lettuce 2 ) . Identify required appli- ance type: refrigeration device.	Compute EU for appliance choices: EU fridge = 0 . 83 (from 𝛼 = 10 , 𝛽 = 3, ˆ 𝜌 = 0 . 769, risk = 0 . 19), EU freezer = 0 . 58 (from 𝛼 = 4 , 𝛽 = 6, ˆ 𝜌 = 0 . 40, risk = 0 . 37). Select fridge_cooling vari- ant (higher posterior,	Retrieve object_cooling proce- dure; Bayesian posterior: 𝑃 ( 𝜌	D) = Beta ( 10 , 3 ) , E [ 𝜌 ] = 10 13 = 0 . 769, Var [ 𝜌 ] = 30 2366 ≈ 0 . 0127. Load 5-step template sequence.	Subgoal: Ex- ecute cooling branch. Action: go to fridge 1 . Environ- ment: 'You arrive at loc 12. The fridge 1 is closed.'

Time	LLM	Bayesian Selector	Memory System	Contrastive Refiner	I/O Summary
𝑡 4	Observation: 'The fridge 1 is closed.' Rec- ognize container access precondition. Retrieve container_opening subprocedure (em- bedded in cooling template).	Validate cooling proce- dure: posterior passes confidence thresh- old ( E [ 𝜌 ] = 0 . 77 > 𝜃 conf = 0 . 4). Re- compute EU: 0 . 91 × 0 . 769 × 1 . 0 - 0 . 19 × 0 . 231 × 0 . 5 = 0 . 678 > 0 . 4;	Instantiate opening ac- tion: open → open fridge 1 . Track appliance state: open ( fridge 1 ) = False → True.	-	Action: open fridge 1 . Envi- ronment: 'You open the fridge 1. The fridge 1 is open. In it, you see nothing.'
𝑡 5	Observation: 'The fridge 1 is open.' Ex- ecute core cooling action. Verify all preconditions met: holding ( lettuce 2 ) ∧ open ( fridge 1 ) ∧ at ( fridge 1 ) .	No re-ranking; execut- ing within selected pro- cedure template. Mon- itor for failure signals (none detected).	Instantiate cooling ac- tion: cool	-	Action: cool lettuce 2 with fridge 1 . Environment: 'You cool the let- tuce 2 using the fridge 1.' State: cooled ( lettuce 2 ) = True.
𝑡 6	Observation: 'You have cooled lettuce 2.' Update world state: cooled ( lettuce 2 ) = True, holding ( lettuce 2 ) = True. Container cleanup: execute closing subpro- cedure.	Bayesian update (cooling suc- cess): ( 𝛼, 𝛽 ) ← ( 10 + 1 , 3 + 0 ) = ( 11 , 3 ) . New posterior: E [ 𝜌 ] = 11 14 ≈ 0 . 786 ( + 1.7% improvement). Com- pute information gain: Δ 𝐻 = 𝐻 [ Beta ( 10 , 3 )] - 𝐻 [ Beta ( 11 , 3 )] = 0 . 136	Mark cooling procedure success; store context tuple ( 𝑜 init ,𝜋 exec ,𝑜 term ) in success set S 𝑖 . In- termediate reward signal: 𝑟 cool = + 0 . 3 (step-level credit). Check co-occurrence with pending placement goal.	-	Action: close fridge 1 . Envi- ronment: 'You close the fridge 1. ' Transition: Cooling subgoal complete; return to placement goal.
𝑡 7	Observation: 'You are carrying cooled lettuce 2.' Navigate to target location. Cooling precon- dition now satisfied: cooled ( lettuce 2 ) = True. Activate placement subgoal.	Retrieve object_placement procedure. Recompute EU with updated con- text: relevance = 0 . 94 (high similarity to placement scenar- ios), ˆ 𝜌 = 0 . 80 (from Beta(8,2)), risk = 0 . 15, info-gain = 0 . 21 nats. Total EU = 0 . 94 × 0 . 80 × 1 . 0 - 0 . 15 × 0 . 20 × 0 . 5 + 0 . 1 × 0 . 21 = 0 . 752 +	Instantiate placement template with navigation: go to → go to countertop 2 . Pre- condition check passes: cooled ( lettuce 2 ) ∧ holding ( lettuce 2 ) ∧ exists ( countertop 2 ) .	-	Action: go to countertop 2 . Environment: 'You arrive at loc 5. On coun- tertop 2, you see nothing.' (lettuce is held, not on counter).
𝑡 8	Observation: 'You are at countertop 2, hold- ing cooled lettuce 2.' Execute terminal placement action. All preconditions verified: cooled ∧ holding ∧ at_target .	0 . 021 = 0 . 773. No additional ranking needed; final action of placement procedure. Predicted outcome: task success with 𝑝 = 0 . 80 (placement posterior).	Instantiate: put	-	Action: put lettuce 2 on countertop 2 . Environment: 'You put lettuce 2 on counter- top 2.' Result: Task SUCCESS .

Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement

Abstract.​

Key Observations and Architectural Insights​

Introduction​

Related works​

The Preamble​

Proposed Method​

LLM-based Procedural Abstraction​

Bayesian Reliability and Utility Selection​

Contrastive Refinement of Procedures​

Meta-procedural Composition​

Ontological Semantic Grounding​

System Efficiency and Memory Management​

Algorithm​

Experiments​

Experimental Setting​

Experimental Results and Analysis​

Conclusion​

Detailed Ablation Studies and Memory Analysis​

Memory Capacity Scaling​

Task-Specific Memory Effectiveness​

Extended Experimental Analysis​

Bootstrapping and Memory Growth Dynamics​

Detailed Execution Trace Analysis​

Task Description and Setup​

Execution Timeline​

Trace Verification Methodology​

Key Observations and Architectural Insights​

Hierarchical Goal Decomposition ($t_0$--$t_2$).​

Uncertainty-Aware Procedure Selection ($t_3$).​

Minimal LLM Usage ($t_0$--$t_8$).​

Online Bayesian Updates ($t_6$).​

Meta-Procedure Formation ($t_9$).​

Contrastive Learning Preparation ($t_9$).​

Comparison to Alternative Approaches​

vs. ReAct~ cite{yao2023react​

vs. Reflexion~ cite{shinn2023reflexion​

vs. Supervised Fine-Tuning (SFT).​

Generalization to Unseen Tasks​

Limitations and Edge Cases​

Failure Case: Ambiguous Preconditions.​

Computational Overhead.​

Memory Capacity.​

Precondition Inference Errors.​

Theoretical Foundations​

Ad-hoc Thresholds and Their Implications​

Duplicate Detection Threshold $ theta_{ text{dup​

Confidence Threshold $ theta_{ text{conf​

Meta-Procedure Formation Threshold $ theta_{ text{meta​

Bayesian Prior Initialization​

Utility Function Weight Selection​

Contrastive Refinement Evidence Requirement​

Memory Pruning Utility Function​

References​

Abstract.

Key Observations and Architectural Insights

Introduction

Related works

The Preamble

Proposed Method

LLM-based Procedural Abstraction

Bayesian Reliability and Utility Selection

Contrastive Refinement of Procedures

Meta-procedural Composition

Ontological Semantic Grounding

System Efficiency and Memory Management

Algorithm

Experiments

Experimental Setting

Experimental Results and Analysis

Conclusion

Detailed Ablation Studies and Memory Analysis

Memory Capacity Scaling

Task-Specific Memory Effectiveness

Extended Experimental Analysis

Bootstrapping and Memory Growth Dynamics

Detailed Execution Trace Analysis

Task Description and Setup

Execution Timeline

Trace Verification Methodology

Key Observations and Architectural Insights

Hierarchical Goal Decomposition ($t_0$--$t_2$).

Uncertainty-Aware Procedure Selection ($t_3$).

Minimal LLM Usage ($t_0$--$t_8$).

Online Bayesian Updates ($t_6$).

Meta-Procedure Formation ($t_9$).

Contrastive Learning Preparation ($t_9$).

Comparison to Alternative Approaches

vs. ReAct~ cite{yao2023react

vs. Reflexion~ cite{shinn2023reflexion

vs. Supervised Fine-Tuning (SFT).

Generalization to Unseen Tasks

Limitations and Edge Cases

Failure Case: Ambiguous Preconditions.

Computational Overhead.

Memory Capacity.

Precondition Inference Errors.

Theoretical Foundations

Ad-hoc Thresholds and Their Implications

Duplicate Detection Threshold $ theta_{ text{dup

Confidence Threshold $ theta_{ text{conf

Meta-Procedure Formation Threshold $ theta_{ text{meta

Bayesian Prior Initialization

Utility Function Weight Selection

Contrastive Refinement Evidence Requirement

Memory Pruning Utility Function

References