Skip to main content

LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures

Hai Huang, Atlassian, Yann LeCun, NYU, Randall Balestriero, Brown University

Abstract

Large Language Model (LLM) pretraining, finetuning, and evaluation rely on input-space reconstruction and generative capabilities. Yet, it has been observed in vision that embedding-space training objectives, e.g., with Joint Embedding Predictive Architectures (JEPAs), are far superior to their input-space counterpart. That mismatch in how training is achieved between language and vision opens up a natural question: {\em can language training methods learn a few tricks from the vision ones?} The lack of JEPA-style LLM is a testimony of the challenge in designing such objectives for language. In this work, we propose a first step in that direction where we develop LLM-JEPA, a JEPA based solution for LLMs applicable both to finetuning and pretraining. Thus far, LLM-JEPA is able to outperform the standard LLM training objectives by a significant margin across models, all while being robust to overfiting. Those findings are observed across numerous datasets (NL-RX, GSM8K, Spider, RottenTomatoes) and various models from the Llama3, OpenELM, Gemma2 and Olmo families. Code: https://github.com/rbalestr-lab/llm-jepa.

LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures

Hai Huang

Atlassian hhuang3@atlassian.com

Yann LeCun NYU yann.lecun@nyu.edu

Large Language Model (LLM) pretraining, finetuning, and evaluation rely on input-space reconstruction and generative capabilities. Yet, it has been observed in vision that embedding-space training objectives, e.g., with Joint Embedding Predictive Architectures (JEPAs), are far superior to their input-space counterpart. That mismatch in how training is achieved between language and vision opens up a natural question: can language training methods learn a few tricks from the vision ones? The lack of JEPA-style LLM is a testimony of the challenge in designing such objectives for language. In this work, we propose a first step in that direction where we develop LLM-JEPA, a JEPA based solution for LLMs applicable both to finetuning and pretraining. Thus far, LLM-JEPA is able to outperform the standard LLM training objectives by a significant margin across models, all while being robust to overfiting. Those findings are observed across numerous datasets (NL-RX, GSM8K, Spider, RottenTomatoes) and various models from the Llama3, OpenELM, Gemma2 and Olmo families. Code: https:// github.com/rbalestr-lab/llm-jepa .

Figure 1: LLM-JEPA produces strong fine-tuned models across datasets and models.

Figure 1: LLM-JEPA produces strong fine-tuned models across datasets and models.

Introduction

The research landscape around representation learning has been increasingly divided into two camps: (i) generative or reconstruction-based methods Brown et al. (2020b); Chowdhery et al. (2023); He et al. (2022); LeCun (2022), and (ii) reconstruction-free Joint Embedding Predictive Architectures (JEPAs) Assran et al. (2023); Baevski et al. (2022); Bardes et al. (2024). While the former is self-explanatory, the latter learns a representation by ensuring that different views , e.g., pictures of a same building at different time of day, can be predicted from each other, all while preventing a collapse of the embeddings. By moving away from input-space objectives, JEPAs training benefits from less biases Littwin et al. (2024), at the cost of potential dimensional collapse of their representation Jing et al. (2021); Kenneweg et al. (2025). That divide has been well studied in vision, where it was found that JEPAs offer multiple provable benefits when it comes to knowledge discovery for perception tasks. In the realm of Natural Language Processing however, reconstruction-based methods remain

Randall Balestriero Brown University rbalestr@brown.edu

Figure 2: Left: JEPA applied to NLP tasks that has Text and Code , where Text and Code are naturally two views of the same thing. Right : (top) : An illustration of the NL-RX-SYNTH dataset, where each sample consists of a description of the regular expression in natural language ( Text ) and the regular expression itself ( Code ). (bottom) : The Spider dataset, where Text is the database ID and description of the SQL query and Code is the SQL query itself.

Figure 2: Left: JEPA applied to NLP tasks that has Text and Code , where Text and Code are naturally two views of the same thing. Right : (top) : An illustration of the NL-RX-SYNTH dataset, where each sample consists of a description of the regular expression in natural language ( Text ) and the regular expression itself ( Code ). (bottom) : The Spider dataset, where Text is the database ID and description of the SQL query and Code is the SQL query itself.

predominant. In fact, today's Large Language Models are mostly judged from their ability to generate samples and answers in input space in text form-making it challenging to leverage JEPA objectives.

Yet, LLMs' task also involve perception and reasoning where JEPA is known to be preferable. It thus seems crucial to adapt JEPA solutions to LLMs in the hope to showcase the same benefits as witnessed in vision. This first step is exactly what we present in this study. We propose to improve the representation quality of LLMs by leveraging a novel objective combining both the original reconstruction based loss-with an additional JEPA objective. To do so, we focus first on tasks and datasets that are inherently suited for JEPA objectives: the ones providing multiple views of the same underlying knowledge. One typical example is a git issue and the corresponding code diff (fig. 2) Jimenez et al. (2024). The two samples are two views-one being plain English and one being in code-of the same underlying functionality. Let's use that particular example to highlight our core contribution:

Viewing the (text,code) pairs as views of the same underlying knowledge enables JEPA objectives to be utilized with LLMs, complementing the standard text → code generative task.

We strongly emphasize that being able to obtain non-trivial views, such as described above, is crucial to the success of JEPA objectives. While we restrict ourselves to datasets offering those non-trivial views, developing a mechanism akin to data-augmentation in vision would enable JEPA objectives to be used on any dataset. Nonetheless, we believe that our proposed solution-coined LLM-JEPA-and empirical study will serve as a first step towards more JEPA-centric LLM pretraining and finetuning. We summarize our contributions below:

· Novel JEPA-based training objective: We present the first JEPA-based training objective for LLMs operating in embedding space and with different views-perfectly following vision-based JEPAs without sacrificing the generative capabilities of LLMs · Improved SOTA: We empirically validate our formulation in various finetuning settings, where we obtain improvements over standard LLM finetuning solutions. We also explore pretraining scenarios showing encouraging results of LLM-JEPA · Extensive empirical validation: on various model family (llama, gemma, apple/openelm, allenai/olmo), dataset (NL-RX, GSM8K, Spider, RottenTomatoes), and size.

Background

Large Language Models. Contemporary LLMs are mostly built from the same core principles: stacking numerous layers of nonlinear operations and skip-connections-known as Transformers. While subtleties may differ, e.g., about positional embeddings, initialization, normalization, the main driver of performance remains the availability of high quality dataset during the pretraining stage. The training objective in itself has also been standardize throughout methods: autoregressive token-space reconstruction. Let's first denote by L LLM the typical LLM objective used for the specific task and

dataset at hand. In most cases, this will be a cross-entropy loss between the predicted tokens and the ground-truth token to reconstruction. We note that our LLM-JEPA construction is agnostic of L LLM hence making our method general to numerous scenarios.

$$

$$

where Classifier predicts the logits of the next token Text L given the past tokens Text 1: L -1 . Computation of eq. (1) is done at once over L through causal autoregression. Different stages and tasks may vary the input and output of the loss.

Alternative training objectives. Next token prediction is the prevalent pretraining solution for today's latest LLMs. There exists a few alternatives, e.g., SimCSE leverages a contrastive loss in the latent space by treating different dropout-induced views of the same sentence as positive pairs, resulting in state-of-the-art sentence embeddings quality Gao et al. (2021). In a similar spirit, Wang et al. (2022) uses weak supervision from text pairs to learn a joint embedding architecture. An alternative relying on pretrained models instead of the supervised text pairs was explore in Ni et al. (2021). While those approaches are powerful, they are all concern with producing text embeddings without generative capabilities which greatly limits the applicability of those models since numerous evaluations and use-cases require generation-which is core to our proposed method. Another solution employs BERT pretraining coupled with a latent space semantic loss to ensure that representations of semantically similar sentences are nearby in embedding space. This additional term led to improved performance on semantic tasks compared to masked language modeling alone Reimers and Gurevych (2019)-yet again by building atop BERT this solution prevents generative evaluation and use.

JEPA-LLM: Improving LLMs' Reasoning and Generative Capabilities

We propose the LLM-JEPA loss in section 3.1 along with extensive empirical validations in section 4 demonstrating clear finetuning and pretraining benefits.

The LLM-JEPA Objective

Throughout this section, we will use Text and Code as concrete examples of having different views of the same underlying knowledge. It should be clear to the reader that our proposed LLM-JEPA objective handles different types of views similarly.

The construction of our LLM-JEPA objective relies on two principles. First, we must preserve the generative capabilities of LLMs and we therefore start with the L LLM from eq. (1). Second, we aim to improve the abstraction capabilities of LLMs using the joint embedding prediction task. On top of L LLM , we then propose to add the well-established JEPA objective leading to the complete loss L defined as

$$

$$

where λ ≥ 0 is an hyperparameter balancing the contribution of the two terms, Pred and Enc are the predictor and encoder networks respectively, and d is a metric of choice, e.g., the ℓ 2 distance. Let's now precisely describe each of those components.

The encoder. We use the hidden_state of the last token from the last layer as the embedding of an input sequence-as commonly done for LLM probing. We pack both Text and Code into a single context window, applying an attention mask to ensure they do not reference each other. Implementation-wise, most HuggingFace transformers support an additive attention mask, where setting entry ( i, j ) = -∞ prevents token j from attending to token i (for i < j ). Using this mechanism, we implement LLM-JEPA with only one additional forward pass. This introduces extra cost during training, but not during inference-see section 6 for further discussion.

The metric. When it comes comparing embeddings, it is now widely accepted in vision to leverage the cosine similarity. We thus propose to do the same for LLM-JEPA.

The predictor. We leverage the auto-regressive nature of LLM and their internal self-attention to define a tied-weights predictor . By introducing a special token [PRED] at the end of a given input, we allow for further nonlinear processing of the input hereby producing Pred ( · ) at the final embedding of the last layer. By reusing the internal weights of the LLM for the prediction task, we greatly reduce the training overhead and architectural design choices. Practically, we append k ∈ { 0 , . . . , K } predictor tokens to an input prompt and use the embedding of the last predictor token to be Pred ( Enc ( · )) . When k = 0 , the predictor is trivial, i.e., Pred ( x ) = x .

Implementation with Custom Attention Mask

The most important challenge in implementing our proposed LLM-JEPA objective lies in obtained the embeddings of the different views, e.g., Text and Code in eq. (2). A priori, it is not possible to get them in one forward pass because the current self-attention-albeit being causal. Even if we were to concatenate the two views which is already done for the next token prediction part of the loss, it would make the representation of the second view rely on the first view. As a result, we propose the following custom self-attention mask that is causal per block, with number of blocks set to 2 :

1 def additive_mask(k: int ): 2 """Returns a k by k triangle mask.""" 3 mask = torch.zeros((k, k), dtype=torch.float32) 4 mask[torch.triu(torch.ones(k, k), diagonal=1) == 1] = -torch.inf 5 return mask 6 7 # Initialize all elemets to -inf. 8 mask = torch.full((batch_size * 2, 1, seq_length, seq_length), -torch.inf ).to(device) 9 # Text starts from t_start and is of size t_size, and Code starts 10 # from c_start and is of size c_size. Set for the i-th batch. 11 mask[i, :, t_start: t_start + t_size, t_start: t_start + t_size] = additive_mask(t_size) 12 mask[i, :, c_start: c_start + c_size, c_start: c_start + c_size] = additive_mask(c_size)

By leveraging the above implementation, we are able to obtain the LLM-JEPA loss value in two forward passes instead of three. While this is still a substantial slowdown, we will explore later in the manuscript a dropout version where the JEPA term is only applied some % of the mini batch-the end result will be that are comparable FLOPS the proposed LLM-JEPA is still able to outperform the baseline.

Relation to Previous Work. Because loss functions such as L LLM (input space reconstruction since tokens are lossless compression of the original prompts) have been shown to be sub-optimal in vision, a few LLM variations have started to employ embedding space regularizers and training objectives Barrault et al. (2024); Wang and Sun (2025). Current solution however rely on intricate structural constraints of the embedding space, e.g., hierarchical organization and cluster, and thus fall out of the JEPA scope. We also note that our interpretation of views when it comes to LLM datasets, e.g., (text issue, code diff), is something that has been leveraged as part of the LLM finetuning solutions-by learning to generate one from the other-without a JEPA-style loss. This includes natural language to regular expression translation (Locascio et al., 2016; Ye et al., 2020; Zhong et al., 2018), natural language to SQL parsing (Guo et al., 2019; Iyer et al., 2017; Li et al., 2023; Wang et al., 2019; Yu et al., 2018) and the more recent issue descriptions to code diffs (Cabrera Lozoya et al., 2021; Hoang et al., 2020; Tian et al., 2020; Zhou et al., 2023). More intricate examples involve text-based problem solving and their counterpart program induction (Amini et al., 2019; Cobbe et al., 2021; Hendrycks et al., 2021; Ling et al., 2017).

A Good Next Token Predictor is not a Good JEPA

Before validating the proposed LLM-JEPA to real task and models, we ask ourselves a simple question. Is it really necessary to have an additional JEPA term? Is that term already implicitly minimized by the original next token prediction objective?

Figure

Index

Figure 3: left: The top 100 singular values of Enc(Text) -Enc(Code) . The curves of LLM-JEPA (ours) are a few magnitudes lower than that of base model and regular fine-tuning, meaning the mapping from Text to Code are confined within a narrow subspace, fostering the nice structure we see in Figure 4. Right: Losses in fine-tuning with L LLM loss ( L LLM ) and L LLM -JEPA loss ( L LLM -JEPA , our method). We measure both the cross-entropy loss for next token prediction ( Loss LLM , L LLM in chart) and JEPA prediction loss ( D ( · , · ) , pred in chart), although the latter does not contribute in the baseline case. The accuracy is 51 . 95% for L LLM and 71 . 10% for L LLM -JEPA . Since L LLM and L LLM -JEPA share similar L LLM loss, the L LLM loss cannot explain the gap between the accuracy. pred stays a constant in L LLM , while is minimized in L LLM -JEPA , hence pred should be the main reason behind the accuracy gap.

To answer that, we compare two controlled settings. We will be using Llama-3.2-1B-Instruct and NL-RX-SYNTH and have two training setup. In the first, we do the usual next-token prediction loss but monitor the JEPA objective, i.e., no gradient comes from the JEPA loss. In the second, we will use the proposed LLM-JEPA for gradient computation. We also monitor the prediction loss in both cases. We obtain the following finding in fig. 3: minimizing L LLM does not implicitly minimize L JEPA -indicating that it is required to add that term during training. This can be seen by comparing the red and green lines. A natural follow-up question is are we trading off next token prediction for JEPA? . In the same fig. 3 we obtain that the next token prediction capability is not hindered by the presence of the JEPA term (compare the blue and yellow lines that are overlapping). This is a very important observation echoing what was empirically observed in Balestriero and LeCun (2024) where it was observed that autoencoders in image space could become much stronger classifiers without sacrificing the reconstruction and generative objective. All in all we obtain that employed LLM-JEPA only brings additional structure of the LLM latent space without altering its generative capabilities. As we will see in the follows sections, we indeed obtain much better performances that the baseline-in the generative evaluation setup.

Empirical Validation: LLM-JEPAs Outperform LLMs

In this section, we first validate the proposed LLM-JEPA across four model families and four datasets with natural two-view structures (section 4.1). The consistent improvements observed across the board motivate us to further examine the internal representations of LLM-JEPA to better understand the source of its strength (section 4.2). We conclude this section with a rigorous ablation study on key design choices (section 4.3).

We adopt a universal protocol across all experiments by using five fixed random seeds, { 82 , 23 , 37 , 84 , 4 } , for training. Each experiment is repeated five times, once with each seed. This setup enables us to assess the stability of LLM-JEPA and to conduct paired one-tailed t -tests for statistical significance.

Fine-tuning and Pretraining Stronger Generative Models via JEPA

LLM-JEPA Improves Finetuning. We run experiments across multiple pretrained LLMs (Llama3.2-1B-Instruct (Grattafiori et al., 2024), gemma-2-2b-it (Team et al., 2024), OpenELM-1_1B-Instruct (Mehta et al., 2024), and OLMo-2-0425-1B-Instruct (OLMo et al., 2024)) with various datasets

(NL-RX-SYNTH, NL-RX-TURK (Locascio et al., 2016), GSM8K (Cobbe et al., 2021), Spider (Yu et al., 2018)).

To demonstrate that LLM-JEPA improves from the strongest possible baseline, we first search for the best learning rate lr ∈ { 1 e -5 , 2 e -5 , 4 e -5 , 8 e -5 } by selecting the value that yields the highest accuracy of L LLM after 4 epochs for a given (model,dataset) pair. Then we tune the hyperparameter specific to L LLM -JEPA , k and λ in a two dimensional grid defined by ( k, λ ) ∈ { 0 , 1 , 2 , 3 , 4 } × { 0 . 5 , 1 , 2 , 4 } (fig. 7 and table 10 in the Appendix). For both NL-RX-SYNTH and NL-RX-TURK, accuracy is exact match of the generated regular expression; for GSM8K, accuracy is exact match of the final result; and for Spider, accuracy is exact match of the execution result of the generated query.

We provide results demonstrating that LLM-JEPA improves performances across

· four model families, see fig. 1 left and table 12 in the Appendix. · four datasets, see table 13 in the Appendix. · 6 training epochs (fig. 1 right). We also observe that LLM-JEPA resists overfitting, whereas standard fine-tuning does not. · four different sizes: 1B, 3B, 7B, and 8B, see table 15 in the Appendix

Examples of inputs, targets, model predictions, and error analyses are provided in table 1 (more examples can be found in table 11 in the Appendix). For LoRA fine-tuning , the performance gains of LLM-JEPA hold consistently across different LoRA ranks (table 8 in the Appendix). We also observe the same resistance to overfitting in the LoRA setting (fig. 8 in the Appendix).

Table 1: Regular expressions generated by Llama-3.2-1B-Instruct after fine-tuning with L LLM loss and L LLM -JEPA loss (ours). Color code: wrong , extra , missing

LLM-JEPA Improves Pretraining. A natural extension of our fine-tuning results is to examine pretraining. We pretrain Llama-3.2-1B-Instruct from randomly initialized weights on the NL-RXSYNTH dataset. Owing to the limited size of the dataset, the pretrained model fails to reliably learn how to terminate generation. To address this, we adjust the evaluation criterion, deeming a generated solution valid as long as it begins with the ground-truth sequence. We obtain that LLM-JEPA also improves the quality of the learned representation, as shown in table 2.

Table 2: Pretraining accuracy on dataset NL-RX-SYNTH by Next Token Prediction ( L LLM ) loss vs. L LLM -JEPA loss (our method). We inherit the best configuration from fine-tuning. Each case runs five times. Average accuracy and standard deviation are reported. We also report p -value of paired, single-tailed t -Test.

Wethen conduct a more advanced pretraining experiment on dataset cestwc/paraphrase containing groups of 5 paraphrases. We leverage the five paraphrases to construct the JEPA loss by having the

Figure

  • (a) Baseline: Fine-tuned by NTP loss

Figure

(b) LLM-JEPA (Ours) k = 0

Figure 4: t -SNE plot of Text and Code representations in (a) Baseline that is fine-tuned with NTP loss, (b) LLM-JEPA (ours) with k = 0 . Clearly LLM-JEPA (ours) induced nice structure on the representations while fine-tuning with NTP loss disrupted the structure in the base model. A full version of this figure is in section A.2.

i -th version of a paraphrase predict the ( i +1) -th version. The goal is to encourage the JEPA loss to tie the embeddings of all versions into a compact subspace, providing a geometric foundation to align their representations. We pretrain the model for four epochs and then evaluate it on Rotten Tomatoes and Yelp after one epoch of fine-tuning. Although there is no direct link between the pretraining and evaluation datasets, we show that LLM-JEPA pretraining yields statistically significant improvements in downstream performance (see table 9 in the Appendix). Note that fine-tuning does not employ the JEPA loss-highlighting that the benefits arise specifically from the pretraining stage.

For Rotten Tomatoes and Yelp, we fine-tune the model to generate discrete sentiment labels: Good and Bad for Rotten Tomatoes, and Very Good , Good , Mediocre , Bad , and Very Bad for Yelp. A prediction is deemed correct if the generated output begins with the ground-truth label. This evaluation approach-mapping free-form generation to categorical labels and applying prefix matching-follows common practice in prior work on text classification with generative models (McCann et al., 2018; Raffel et al., 2020; Wei et al., 2022).

Lastly, we provide in table 7 (in the Appendix) generated samples demonstrating that JEPA pretraining does maintain the generative capabilities of the model when prompted with the first few tokens in the cestwc/paraphrase dataset.

Structured Representations Induced by LLM-JEPA

Wealso examine the representation space to better understand how LLM-JEPA regularizes learned features. Specifically, we plot t -SNE embeddings for both Text and Code across three settings: the base model, a model fine-tuned with L LLM , and a model fine-tuned with L LLM -JEPA . As shown in fig. 4, clear structure emerges after fine-tuning with L LLM -JEPA . We hypothesize that L LLM -JEPA enforces structure in the representation space by constraining the mapping from Enc(Text) to Enc(Code) within a narrow subspace. If this is the case, the SVD decomposition of Enc(Text) -Enc(Code) should yield significantly smaller singular values, which is confirmed in fig. 3. Furthermore, we hypothesize that the mapping is approximately linear. To test this, we compute the least-squares regression error, and table 14 in the Appendix supports this hypothesis. Together, these results suggest that LLM-JEPA promotes a near-linear transformation between Text and Code representations, which may underlie its accuracy improvements.

Ablation Study on Design Choices

In this section, we examine several alternative design choices. Specifically, we compare the use of cosine similarity against ℓ 2 -norm and mean squared error (MSE); appending a [PRED] token to the end of Text versus prepending it; and using Text to predict Code versus the reverse direction ( Code → Text ). As shown in table 3, none of these alternatives perform as well as LLM-JEPA, although some outperform standard fine-tuning.

Table 3: Fine-tuning accuracy on NL-RX-SYNTH under different design choices. Reported are average accuracy and standard deviation across runs. Learning rate lr = 2e -5 , λ = 1 . 0 , and k = 1 .

Additionally, we replace our cosine similarity loss with the InfoNCE loss van den Oord et al. (2018), which results in lower accuracy compared to the baseline. Moreover, its standard deviation is substantially higher than that of other alternatives. We use the commonly adopted temperature of τ = 0 . 07 for the InfoNCE loss (Chen et al., 2020; Radford et al., 2021).

Towards Faster and More General LLM-JEPAs

The structured representations induced by LLM-JEPA have the potential to provide universal benefits across diverse LLM applications. In this section, we explore its limits by evaluating datasets without natural two-view structures and models with richer capabilities (section 5.1). The promising results further motivate us to investigate methods for accelerating LLM-JEPA (section 5.2), as computational overhead remains a key obstacle to broad adoption.

Beyond Code: Applying LLM-JEPA to Q &A Datasets and Reasoning Models

We evaluate Llama-3.2-1B-Instruct on two QA benchmarks : NQ-Open (Lee et al., 2019a) and HellaSwag (Zellers et al., 2019). Our results show that LLM-JEPA achieves statistically significant improvements on both datasets, demonstrating its capability beyond the canonical setup where Text and Code are treated as two complementary views of the same object.

For NQ-Open, we regard Text as the question and Code as the answer span, typically consisting of only a few tokens, which contrasts with the more balanced sequence lengths found in other datasets. HellaSwag, by contrast, is a context-completion multiple-choice task. Rather than defining Code as the answer label (a single token from A , B , C , D ), we instead let Text denote the context and Code represent the correct continuation. This formulation differs from prior setups in two important ways: (i) Both Text and Code are now integral components of the question, and (ii) The relationship between context and completion is more diverse than the near-equivalence seen in NL → Regex or NL → SQL mappings.

Despite these differences, LLM-JEPA consistently improves accuracy on both benchmarks table 4.

For NQ-Open , generated answers may include additional syntactic or supporting tokens. Following prior work, we deem a prediction correct if any ground-truth answer appears as a substring of the generated output (Lee et al., 2019b; Guu et al., 2020; Izacard and Grave, 2021). For HellaSwag , we compute the relative probabilities of the four candidate options A , B , C , D and select the answer with the highest probability, consistent with standard practice in multiple-choice language understanding benchmarks (Zellers et al., 2019; Brown et al., 2020a; OpenAI, 2023).

An additional observation is that performance continues to improve as we scale λ up to 1024, without encountering a plateau. While table 10 in the Appendix suggests that extreme values of λ can degrade accuracy, in certain cases it remains beneficial to push λ further, yielding extra gains.

We then evaluate two strong reasoning models -Qwen3-1.7B (Yang et al., 2025) and DeepSeekR1-Distill-Qwen-1.5B (DeepSeek-AI et al., 2025)-on GSM8K. As shown in table 5, both models achieve statistically significant improvements when augmented with LLM-JEPA. These results offer promising evidence that LLM-JEPA extends its benefits to large reasoning models (LRMs).

Faster LLM-JEPAs via Loss Dropout

To further reduce compute, we introduce random JEPA-loss dropout ( LD ). During training or fine-tuning, we randomly drop the JEPA loss at a specified LD rate. Loss dropout is applied at the

Table 4: Fine-tuning accuracy of Llama-3.2-1B-Instruct with Next Token Prediction ( L LLM ) loss vs. L LLM -JEPA loss (our method). Each case runs five times. Average accuracy and standard deviation are reported. We also report p -value of paired, single-tailed t -Test.

Figure 5: LLM-JEPA converges faster than regular fine-tuning at the same PFLOPs. Furthermore, random JEPA-loss dropout (LD) helps save PFLOPs and boost accuracy at the same amount of compute. LD = 0 is the regular LLM-JEPA. Learning rate lr = 2e -5 and k = 1 . λ varies.

Figure

batch level. When active, it eliminates the need for an extra forward pass to compute Enc(Text) and Enc(Code) , thereby saving compute. If LD = α , the per-epoch cost becomes 2 -α times that of standard fine-tuning, since each batch saves α forward passes. As shown in Figure 5 and Table 6, LLM-JEPA tolerates aggressive loss dropout rates (e.g., 0.5 or 0.75), which leads to higher accuracy under the same compute budget. Moreover, increasing λ in proportion to the dropout rate can further improve performance. Empirically, we observe that keeping λ × (1 -α ) approximately constant provides a useful guideline for co-tuning λ and α to balance compute efficiency and accuracy. The use of the loss dropout coupled with our custom attention mask offers some positive perspectives to further scale LLM-JEPA to full scale pretraining with minimal computational overhead.

Conclusion and Future Work

We introduced an alternative training objective for LLMs leveraging JEPAs. Our formulation is an exact replicate of the JEPA objective extensively used in vision-but that hadn't been adapted to language yet. Crucially, our proposed LLM-JEPA maintains the generative capabilities of LLMs while improving their abstract prompt representation as empirically validated across datasets and models. While our experiments mostly focus on finetuning, preliminary pretraining experiment are promising which we plan to scale and more thoroughly test in future work. Regarding the limitations of LLM-JEPA, the primary bottleneck at present is the 2-fold increase in compute cost during training, which is mitigated by random loss dropout.

Table 6: Random JEPA-loss dropout ( LD ) help save PFLOPs and at the same time, boost accuracy. LD = 0 is the regular LLM-JEPA. Reported are average accuracy and standard deviation across runs. Each row is 4.83 PFLOPs apart. Learning rate lr = 2e -5 and k = 1 . λ varies.

Limitations Despite its strong accuracy gains, LLM-JEPA introduces two additional hyperparameters. As shown in fig. 7, the optimal configuration may occur at any point in a grid ( λ, k ) , which imposes a significant cost for hyperparameter tuning. While we have not identified an efficient method to explore this space, we empirically observe that adjacent grid points often yield similar accuracy, suggesting the potential for a more efficient tuning algorithm.

References

Table 7: Generated samples by model pretrained by cestwc/paraphrase dataset. The pretrained model is not good at terminating sentence. prompt and generation

Table 8: Fine-tuning accuracy on dataset NL-RX-SYNTH, LoRA vs. full fine-tuning, both by L LLM loss and L LLM -JEPA loss (our method). Configuration is lr = 2 e -5 , λ = 1 , k = 1 . Each cell runs five times. Average accuracy and standard deviation are reported. At every LoRA rank, L LLM -JEPA (ours) has better accuracy. At LoRA rank 512 (22.59% trainable parameters), L LLM -JEPA (ours) achieves same accuracy as full fine-tuning, but L LLM still has a significant gap from full fine-tuning.

Appendix

Faster LoRA Convergence

Table 8 demonstrates that LoRA fine-tuning with L LLM -JEPA loss not only achieves substantially higher accuracy than using L LLM alone, but also converges more quickly. Notably, at a LoRA rank of 512 , our method already reaches accuracy comparable to full fine-tuning, whereas LoRA with only L LLM still exhibits a clear performance gap.

Table 9: Pretraining + fine-tuning Llama-3.2-1B-Instruct accuracy on pretraining dataset cestwc/paraphrase and fine-tuning dataset Rotten Tomatoes and Yelp by Next Token Prediction ( L LLM ) loss vs. L LLM -JEPA loss (our method). Note that L LLM -JEPA is applied only at pretraining. We tune lr pre and lr ft by L LLM , and stick to them in LLM-JEPA pretraining. We run pretraining 5 times, and for each pretrained model, we run fine-tuning 5 times. Average accuracy and standard deviation are reported. We also report p -value of paired, single-tailed t -Test.

Figure 6: t -SNE plot of f Text and Code representations in (a) Base mode without fine-tuning, (b) Baseline that is fine-tuned with NTP loss, (c) LLM-JEPA (ours) with k = 0 , and (d) LLM-JEPA (ours) with k = 1 . Clearly LLM-JEPA (ours) induced nice structure on the representations while fine-tuning with NTP loss disrupted the structure in the base model.

Figure 6: t -SNE plot of f Text and Code representations in (a) Base mode without fine-tuning, (b) Baseline that is fine-tuned with NTP loss, (c) LLM-JEPA (ours) with k = 0 , and (d) LLM-JEPA (ours) with k = 1 . Clearly LLM-JEPA (ours) induced nice structure on the representations while fine-tuning with NTP loss disrupted the structure in the base model.

LLM-JEPA Induces Structured Representation

We present additional t -SNE plots of Text and Code representations in fig. 6, which show that different values of k yield similar structural patterns. In contrast, standard fine-tuning appears to further disrupt the representation structure compared to the baseline model.

Table 10: Fine-tuning accuracy on dataset NL-RX-SYNTH with L LLM -JEPA loss (ours) over various γ/λ . Configuration is lr = 2 e -5 , λ = 1 , k = 0 . We maintain max( γ, λ ) = 1 . 0 to use a fixed lr . Each cell runs five times. Average accuracy and standard deviation are reported. When γ = 0 . 0 , it generate only empty output.

Figure 7: In general we didn't find any pattern on where the best accuracy could appear. It could be at either high-end or low-end of either λ or k . Furthermore, there can be dips and spikes in random locations. Nonetheless, adjacent cells have close accuracy most of times, and sweeping ( k, λ ) ∈ { 0 , 1 , 2 , 3 , 4 } × { 0 . 5 , 1 , 2 , 4 } normally yield satisfiable results. Each cell is an average of five runs, epoch = 4 .

Figure 7: In general we didn't find any pattern on where the best accuracy could appear. It could be at either high-end or low-end of either λ or k . Furthermore, there can be dips and spikes in random locations. Nonetheless, adjacent cells have close accuracy most of times, and sweeping ( k, λ ) ∈ { 0 , 1 , 2 , 3 , 4 } × { 0 . 5 , 1 , 2 , 4 } normally yield satisfiable results. Each cell is an average of five runs, epoch = 4 .

Ablation Study on the Role of $ mathcal{L

One limitation of eq. (2) is that the contribution of L LLM cannot be effectively reduced to 0. To address this, we introduce an additional hyperparameter γ to explicitly control its relative strength:

$$

$$

We vary the ratio γ/λ within [0 , 1] while enforcing max( γ, λ ) = 1 to maintain a constant learning rate. Table 10 shows that L LLM remains essential for generative performance: when γ = 0 , the fine-tuned model produces only empty outputs. This indicates that the JEPA component primarily serves as a regularization term, complementing the generative loss.

Additional Generation Examples

Table 11 presents additional examples generated by fine-tuning Llama-3.2-1B-Instruct on the NLRX-SYNTH dataset using L LLM and L LLM -JEPA , respectively.

Table 11: More regular expressions generated by Llama-3.2-1B-Instruct after fine-tuning with L LLM loss and L LLM -JEPA loss (ours). Color code: wrong , extra , missing

Overfitting Behavior in LoRA Fine-Tuning

We also conducted experiments to examine whether LoRA fine-tuning with L LLM -JEPA exhibits similar resistance to overfitting. As shown in fig. 8, accuracy under L LLM -JEPA generally continues to improve with additional epochs, whereas fine-tuning with L LLM shows clear signs of overfitting. Notably, the standard deviation is much higher than in full fine-tuning, likely reflecting the lower capacity of LoRA fine-tuning. An interesting pattern emerges: for L LLM -JEPA , larger standard deviations often coincide with dips in accuracy, whereas for L LLM they tend to accompany accuracy spikes. This suggests that such fluctuations may be unreliable indicators of generalization quality.

Figure 8: LLM-JEPA resists overfitting in LoRA fine-tuning. Fine-tuning with L LLM -JEPA loss (our method) resists overfitting. When fine-tuning with L LLM loss start to overfit, L LLM -JEPA kept improving. However the trend is not as stable as in full fine-tuning, possibly due to limited capacity of LoRA fine-tuning.

Figure 8: LLM-JEPA resists overfitting in LoRA fine-tuning. Fine-tuning with L LLM -JEPA loss (our method) resists overfitting. When fine-tuning with L LLM loss start to overfit, L LLM -JEPA kept improving. However the trend is not as stable as in full fine-tuning, possibly due to limited capacity of LoRA fine-tuning.

Structured Representations Induced by LLM-JEPA

Wealso examine the representation space to better understand how LLM-JEPA regularizes learned features. Specifically, we plot t -SNE embeddings for both Text and Code across three settings: the base model, a model fine-tuned with L LLM , and a model fine-tuned with L LLM -JEPA . As shown in fig. 4, clear structure emerges after fine-tuning with L LLM -JEPA . We hypothesize that L LLM -JEPA enforces structure in the representation space by constraining the mapping from Enc(Text) to Enc(Code) within a narrow subspace. If this is the case, the SVD decomposition of Enc(Text) -Enc(Code) should yield significantly smaller singular values, which is confirmed in fig. 3. Furthermore, we hypothesize that the mapping is approximately linear. To test this, we compute the least-squares

Table 12: Fine-tuning accuracy on dataset NL-RX-SYNTH by Next Token Prediction ( L LLM ) loss vs. L LLM -JEPA loss (our method). Each cell is the best possible accuracy over a set of configurations. Each configuration runs five times. Average accuracy and standard deviation are reported. We also report p -value of paired, single-tailed t -Test.

Table 13: Fine-tuning accuracy by model Llama-3.2-1B-Instruct, L LLM loss vs. L LLM -JEPA loss (our method). Each cell is the best possible accuracy over a set of configurations. Each configuration runs five times. Average accuracy and standard deviation are reported. We also report p -value of paired, single-tailed t -Test.

regression error, and table 14 supports this hypothesis. Together, these results suggest that LLM-JEPA promotes a near-linear transformation between Text and Code representations, which may underlie its accuracy improvements.

Performance Across Model Sizes

We also evaluate LLM-JEPA across different model sizes. As shown in table 15, we observe statistically significant improvements at all scales. Since there is no official 8B version of Llama-3.2, we instead use Llama-3.1-8B-Instruct, where performance collapsed due to the model's difficulty in properly terminating regular expressions. To address this, we additionally evaluate using a startswith criterion-that is, a prediction is considered correct if the generated regular expression begins with the ground-truth expression, removing the need for exact termination. Under this metric, we again observe statistically significant accuracy improvements.

Table 14: LLM-JEPA is almost a linear transformation from Enc(Text) to Enc(Code) .

Table 15: Fine-tuning accuracy on NL-RX-SYNTH by Next Token Prediction ( L LLM ) loss vs. L LLM -JEPA loss (our method). Each case runs five times. Average accuracy and standard deviation are reported. We also report p -value of paired, single-tailed t -Test. Note that Llama does not have official 3.2-8B, and we have to use 3.1-8B, which has a lower accuracy. Still LLM-JEPA sees significant improvement. We also evaluated on OLMo-2-7B.

Figure 9: Fine-tuning HellaSwag with Llama-3.2-1B allows λ to be scaled up to 1024, with performance continuing to improve.

Figure 9: Fine-tuning HellaSwag with Llama-3.2-1B allows λ to be scaled up to 1024, with performance continuing to improve.

Ground TruthL LLML LLM - JEPA (ours)
lines not having the string 'dog' followed by a number, 3 or more timeslines not having the string 'dog' followed by a number, 3 or more timeslines not having the string 'dog' followed by a number, 3 or more times
((dog.[0-9].)3,)((dog.[0-9].)3,)((dog.[0-9].){3,})
lines containing ending with a vowel, zero or more timeslines containing ending with a vowel, zero or more timeslines containing ending with a vowel, zero or more times
.(.)(([AEIOUaeiou])).(.)(([AEIOUaeiou])) *( .* ) (.) { ([AEIOUaeiou]) }
lines with a number or a character before a vowellines with a number or a character before a vowellines with a number or a character before a vowel
(([0-9])(.)).([AEIOUaeiou]).(([0-9])(.)).([AEIOUaeiou]). .*(([0-9])(.)).([AEIOUaeiou]).
lines with words with the string 'dog', a letter, and a numberlines with words with the string 'dog', a letter, and a numberlines with words with the string 'dog', a letter, and a number
((([0-9])&(dog))([A-Za-z]))*((([0-9])&(dog))([A-Za-z]))*(( [0-9])&(dog))( ( [A-Za-z]) * )
ModelMethodAccuracy (%) ↑p -value ↓Config
Llama-3.2-1B-InstructL LLM L LLM - JEPA (ours)54 . 38 ± 1 . 70 60 . 59 ± 1 . 012 . 94 e - 4lr = 8 e - 5 λ = 2 ,k = 3 , same lr
Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑
BaselineLLM-JEPAℓ 2 -normMSEPrependCode → TextInfoNCE loss
57 . 29 ± 5 . 3271 . 46 ± 1 . 342 . 22 ± 0 . 0770 . 64 ± 2 . 0568 . 07 ± 2 . 5765 . 70 ± 2 . 6334 . 40 ± 6 . 10
DatasetMethodAccuracy (%) ↑p -value ↓Config
NQ-OpenL LLM L LLM - JEPA (ours)20 . 12 ± 0 . 41 21 . 59 ± 0 . 402 . 44 e - 3lr = 2 e - 5 λ = 1024 ,k = 0 , same lr
HellaSwagL LLM L LLM - JEPA (ours)69 . 40 ± 0 . 99 70 . 51 ± 1 . 200 . 0136lr = 4 e - 5 λ = 1 ,k = 3 , same lr
ModelMethodAccuracy (%) ↑p -value ↓Config
Qwen3-1.7BL LLM L LLM - JEPA (ours)44 . 32 ± 0 . 39 45 . 00 ± 0 . 400 . 0115lr = 4 e - 5 λ = 1 ,k = 0 , same lr
R1-Distill-Qwen-1.5BL LLM L LLM - JEPA (ours)13 . 87 ± 1 . 01 15 . 04 ± 0 . 150 . 0396lr = 8 e - 5 λ = 0 . 5 ,k = 1 , same lr
Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑
LD =0, λ =1LD =0.5, λ =1LD =0.5, λ =2LD =0.75, λ =1LD =0.75, λ =2LD =0.75, λ =4
19 . 85 ± 2 . 4425 . 00 ± 3 . 7324 . 50 ± 4 . 4032 . 46 ± 3 . 3232 . 10 ± 3 . 1131 . 45 ± 3 . 34
40 . 70 ± 2 . 6748 . 96 ± 6 . 0350 . 71 ± 6 . 5253 . 77 ± 5 . 5357 . 03 ± 3 . 5156 . 75 ± 4 . 63
55 . 60 ± 5 . 1664 . 08 ± 2 . 7563 . 79 ± 5 . 9664 . 51 ± 7 . 2867 . 03 ± 3 . 0865 . 32 ± 3 . 54
60 . 43 ± 3 . 2169 . 87 ± 2 . 4870 . 11 ± 3 . 1567 . 80 ± 4 . 9466 . 80 ± 4 . 0668 . 93 ± 4 . 59
63 . 57 ± 1 . 0169 . 74 ± 2 . 9971 . 20 ± 2 . 1770 . 00 ± 4 . 7470 . 77 ± 4 . 0872 . 42 ± 1 . 28
63 . 96 ± 1 . 0770 . 60 ± 3 . 0572 . 11 ± 2 . 1870 . 31 ± 4 . 6470 . 92 ± 4 . 6273 . 08 ± 1 . 28
Ground Truth vs. GenerationGround Truth vs. GenerationGround Truth vs. GenerationGround Truth vs. Generation
Ground Truth GenerationA garden of flowers and a bench stating "City of London." A garden of flowers and a vase with a flower in it.............A garden of flowers and a bench stating "City of London." A garden of flowers and a vase with a flower in it.............A garden of flowers and a bench stating "City of London." A garden of flowers and a vase with a flower in it.............A garden of flowers and a bench stating "City of London." A garden of flowers and a vase with a flower in it.............
Ground Truth GenerationA person that is riding on a horse in a grass field. A person that is ridingA person that is riding on a horse in a grass field. A person that is ridingA person that is riding on a horse in a grass field. A person that is ridingA person that is riding on a horse in a grass field. A person that is riding
Ground Truth Generationin a field.................in a field.................
Ground TruthA man is riding a horse in a field. A man is riding a horse in a field................A man is riding a horse in a field. A man is riding a horse in a field................A man is riding a horse in a field. A man is riding a horse in a field................A man is riding a horse in a field. A man is riding a horse in a field................
GenerationThere are two birds standing on top of a buildingThere are two birds standing on top of a buildingThere are two birds standing on top of a buildingThere are two birds standing on top of a building
Ground TruthThere are two birds standing on a rock................. Two hawks sit on top of a roof spire.There are two birds standing on a rock................. Two hawks sit on top of a roof spire.There are two birds standing on a rock................. Two hawks sit on top of a roof spire.There are two birds standing on a rock................. Two hawks sit on top of a roof spire.
GenerationTwo hawks sit on topTwo hawks sit on topof a wooden bench................of a wooden bench................
Ground Truth.A young woman serving herself at a cookout..A young woman serving herself at a cookout..A young woman serving herself at a cookout..A young woman serving herself at a cookout.
Generation.A young woman serving herself.A young woman serving herself.A young woman serving herselfin a kitchen.................
Ground Truth2 bowls of fruit sit on a table.2 bowls of fruit sit on a table.2 bowls of fruit sit on a table.2 bowls of fruit sit on a table.
Generation2 bowls of fruit siton a table.................on a table.................on a table.................
Ground TruthA wooden bench written 'CITY OF LONDON' at the parkA wooden bench written 'CITY OF LONDON' at the parkA wooden bench written 'CITY OF LONDON' at the parkA wooden bench written 'CITY OF LONDON' at the park
GenerationA wooden bench written 'CITYA wooden bench written 'CITYA wooden bench written 'CITYand a tree.................
LoRA RankMethodAccuracy (%) ↑
32L LLM L LLM - JEPA (ours)6 . 09 ± 0 . 55 7 . 45 ± 1 . 87
64L LLM L LLM - JEPA (ours)21 . 09 ± 1 . 90 32 . 46 ± 1 . 26
128L LLM L LLM - JEPA (ours)34 . 21 ± 2 . 82 48 . 45 ± 3 . 66
256L LLM L LLM - JEPA (ours)45 . 57 ± 4 . 52 60 . 80 ± 2 . 31
512L LLM L LLM - JEPA (ours)50 . 18 ± 5 . 15 72 . 41 ± 2 . 94
FullL LLM L LLM - JEPA (ours)57 . 29 ± 5 . 32 70 . 42 ± 2 . 36
FT DatasetMethodAccuracy (%) ↑p -value ↓Config
Rotten TomatoesL LLM L LLM - JEPA (ours)56 . 57 ± 1 . 66 57 . 76 ± 1 . 337 . 38 e - 4lr pre = 8 e - 5 , lr ft = 4 e - 5 λ = 0 . 5 ,k = 2 , same lr pre , lr ft
YelpL LLM L LLM - JEPA (ours)26 . 46 ± 0 . 92 27 . 15 ± 0 . 931 . 00 e - 3lr pre = 8 e - 5 , lr ft = 8 e - 5 λ = 0 . 5 ,k = 2 , same lr pre , lr ft
γ/λConfigAccuracy (%) ↑
0 . 0γ = 0 . 0 ,λ = 1 . 00 . 00 ± 0 . 00
0 . 01γ = 0 . 01 ,λ = 1 . 01 . 38 ± 0 . 06
0 . 1γ = 0 . 1 ,λ = 1 . 045 . 80 ± 5 . 04
1 . 0γ = 1 . 0 ,λ = 1 . 070 . 42 ± 2 . 36
10 . 0γ = 1 . 0 ,λ = 0 . 167 . 52 ± 1 . 45
100 . 0γ = 1 . 0 ,λ = 0 . 0166 . 83 ± 3 . 89
γ = 1 . 0 ,λ = 0 . 057 . 29 ± 5 . 32
Ground TruthL LLML LLM - JEPA (ours)
lines ending with a vowel or starting with a characterlines ending with a vowel or starting with a characterlines ending with a vowel or starting with a character
([AEIOUaeiou].[A-Za-z].)+([AEIOUaeiou].[A-Za-z].)+([AEIOUaeiou].[A-Za-z].)+
ines containing either a lower-case letter, a vowel, or a letterines containing either a lower-case letter, a vowel, or a letterines containing either a lower-case letter, a vowel, or a letter
((.*)([AEIOUaeiou]))((.)(.*))(.*) ( ([AEIOUaeiou])((.)(.*)) )(.*) ( ([AEIOUaeiou])((.)(.*)) )
lines starting with the string 'dog' before a vowellines starting with the string 'dog' before a vowellines starting with the string 'dog' before a vowel
(([A-Za-z])7,).(dog).(([A-Za-z])7,).(dog). .*(([A-Za-z])7,).(dog).
lines not containing a letter and the string 'dog'lines not containing a letter and the string 'dog'lines not containing a letter and the string 'dog'
((([A-Z])+)([a-z]))(.*)((([A-Z])+)([a-z]))(.*) +((([A-Z])+)([a-z]))(.*)
lines with a character before a vowel and the string 'dog', zero or more timeslines with a character before a vowel and the string 'dog', zero or more timeslines with a character before a vowel and the string 'dog', zero or more times
.(.)&([0-9])&(dog)..(.)&([0-9])&(dog). .*.(.)&([0-9])&(dog). ..
lines with a vowel at least once before not a characterlines with a vowel at least once before not a characterlines with a vowel at least once before not a character
(([A-Za-z])+).(~([0-9])).(([A-Za-z])+).(~([0-9])). .*(([A-Za-z])+).(~([0-9])).
ModelMethodAccuracy (%) ↑p -value ↓Config
Llama-3.2-1B-InstructL LLM L LLM - JEPA (ours)57 . 29 ± 5 . 32 71 . 46 ± 1 . 341 . 0 e - 3lr = 2 e - 5 λ = 1 ,k = 1 , same lr
gemma-2-2b-itL LLM L LLM - JEPA (ours)33 . 65 ± 3 . 24 43 . 12 ± 2 . 615 . 5 e - 3lr = 1 e - 5 λ = 2 ,k = 4 , same lr
OpenELM-1_1B-InstructL LLM L LLM - JEPA (ours)12 . 07 ± 1 . 81 25 . 40 ± 2 . 405 . 1 e - 4lr = 8 e - 5 λ = 4 ,k = 3 , same lr
OLMo-2-0425-1B-InstructL LLM L LLM - JEPA (ours)87 . 09 ± 0 . 36 87 . 52 ± 0 . 292 . 5 e - 3lr = 8 e - 5 λ = 2 ,k = 0 , same lr
DatasetMethodAccuracy (%) ↑p -value ↓Config
NL-RX-TURKL LLM L LLM - JEPA (ours)22 . 49 ± 1 . 91 30 . 94 ± 1 . 132 . 4 e - 4lr = 2 e - 5 λ = 1 ,k = 1 , same lr
GSM8KL LLM L LLM - JEPA (ours)32 . 36 ± 0 . 58 36 . 36 ± 0 . 209 . 6 e - 5lr = 2 e - 5 λ = 0 . 5 ,k = 4 , same lr
SpiderL LLM L LLM - JEPA (ours)47 . 52 ± 2 . 44 50 . 55 ± 2 . 084 . 0 e - 3lr = 4 e - 5 λ = 1 ,k = 3 , same lr
min XEnc(Text) · X - Enc(Code)2Avg. Top 100 Singular
Base model3953 . 11310 . 73
L LLM3035 . 01341 . 80
LLM-JEPA (Ours) k = 14 . 4794 . 84
LLM-JEPA (Ours) k = 04 . 0416 . 82
ModelMethodAccuracy (%) ↑p -value ↓Config
Llama-3.2-1B-InstructL LLM L LLM - JEPA (ours)57 . 29 ± 5 . 32 71 . 46 ± 1 . 341 . 0 e - 3lr = 2 e - 5 λ = 1 ,k = 1 , same lr
Llama-3.2-3B-InstructL LLM L LLM - JEPA (ours)74 . 55 ± 3 . 58 77 . 16 ± 3 . 660 . 0352lr = 2 e - 5 λ = 2 ,k = 0 , same lr
Llama-3.1-8B-InstructL LLM L LLM - JEPA (ours)35 . 77 ± 6 . 60 63 . 57 ± 16 . 810 . 0131lr = 2 e - 5 λ = 2 . 0 ,k = 0 , same lr
OLMo-2-1124-7B-InstructL LLM L LLM - JEPA (ours)87 . 26 ± 0 . 27 87 . 75 ± 0 . 330 . 0345lr = 2 e - 5 λ = 20 ,k = 2 , same lr

Large Language Model (LLM) pretraining, finetuning, and evaluation rely on input-space reconstruction and generative capabilities. Yet, it has been observed in vision that embedding-space training objectives, e.g., with Joint Embedding Predictive Architectures (JEPAs), are far superior to their input-space counterpart. That mismatch in how training is achieved between language and vision opens up a natural question: can language training methods learn a few tricks from the vision ones? The lack of JEPA-style LLM is a testimony of the challenge in designing such objectives for language. In this work, we propose a first step in that direction where we develop LLM-JEPA, a JEPA based solution for LLMs applicable both to finetuning and pretraining. Thus far, LLM-JEPA is able to outperform the standard LLM training objectives by a significant margin across models, all while being robust to overfiting. Those findings are observed across numerous datasets (NL-RX, GSM8K, Spider, RottenTomatoes) and various models from the Llama3, OpenELM, Gemma2 and Olmo families. Code: https://github.com/rbalestr-lab/llm-jepa.

The research landscape around representation learning has been increasingly divided into two camps: (i) generative or reconstruction-based methods brown2020language ; chowdhery2023palm ; he2022masked ; lecun2022path , and (ii) reconstruction-free Joint Embedding Predictive Architectures (JEPAs) assran2023self ; baevski2022data2vec ; bardes2024revisiting . While the former is self-explanatory, the latter learns a representation by ensuring that different views, e.g., pictures of a same building at different time of day, can be predicted from each other, all while preventing a collapse of the embeddings. By moving away from input-space objectives, JEPAs training benefits from less biases littwin2024jepa , at the cost of potential dimensional collapse of their representation jing2021understanding ; kenneweg2025jepa . That divide has been well studied in vision, where it was found that JEPAs offer multiple provable benefits when it comes to knowledge discovery for perception tasks. In the realm of Natural Language Processing however, reconstruction-based methods remain predominant. In fact, today’s Large Language Models are mostly judged from their ability to generate samples and answers in input space in text form–making it challenging to leverage JEPA objectives.

Yet, LLMs’ task also involve perception and reasoning where JEPA is known to be preferable. It thus seems crucial to adapt JEPA solutions to LLMs in the hope to showcase the same benefits as witnessed in vision. This first step is exactly what we present in this study. We propose to improve the representation quality of LLMs by leveraging a novel objective combining both the original reconstruction based loss–with an additional JEPA objective. To do so, we focus first on tasks and datasets that are inherently suited for JEPA objectives: the ones providing multiple views of the same underlying knowledge. One typical example is a git issue and the corresponding code diff (fig.˜2) jimenez2024swebench . The two samples are two views–one being plain English and one being in code–of the same underlying functionality. Let’s use that particular example to highlight our core contribution:

Viewing the (text,code) pairs as views of the same underlying knowledge enables JEPA objectives to be utilized with LLMs, complementing the standard text →\rightarrow code generative task.

We strongly emphasize that being able to obtain non-trivial views, such as described above, is crucial to the success of JEPA objectives. While we restrict ourselves to datasets offering those non-trivial views, developing a mechanism akin to data-augmentation in vision would enable JEPA objectives to be used on any dataset. Nonetheless, we believe that our proposed solution–coined LLM-JEPA–and empirical study will serve as a first step towards more JEPA-centric LLM pretraining and finetuning. We summarize our contributions below:

Novel JEPA-based training objective: We present the first JEPA-based training objective for LLMs operating in embedding space and with different views–perfectly following vision-based JEPAs without sacrificing the generative capabilities of LLMs

Improved SOTA: We empirically validate our formulation in various finetuning settings, where we obtain improvements over standard LLM finetuning solutions. We also explore pretraining scenarios showing encouraging results of LLM-JEPA

Extensive empirical validation: on various model family (llama, gemma, apple/openelm, allenai/olmo), dataset (NL-RX, GSM8K, Spider, RottenTomatoes), and size.

Natural Language to Regular Expression

The first section˜2.1 provides minimal background around next-token prediction LLM objectives, used as part of the proposed LLM-JEPA loss (section˜2.2). Empirical validation will then be provided in section˜2.3 demonstrating clear finetuning and pretraining benefits.

Contemporary LLMs are mostly built from the same core principles: stacking numerous layers of nonlinear operations and skip-connections–known as Transformers. While subtleties may differ, e.g., about positional embeddings, initialization, normalization, the main driver of performance remains the availability of high quality dataset during the pretraining stage. The training objective in itself has also been standardize throughout methods: autoregressive token-space reconstruction. Let’s first denote by ℒLLM\mathcal{L}{\rm LLM} the typical LLM objective used for the specific task and dataset at hand. In most cases, this will be a cross-entropy loss between the predicted tokens and the ground-truth token to reconstruction. We note that our LLM-JEPA construction is agnostic of ℒLLM\mathcal{L}{\rm LLM} hence making our method general to numerous scenarios.

where Classifier{\rm Classifier} predicts the logits of the next token TextL{\rm Text}{L} given the past tokens Text1:L−1{\rm Text}{1:L-1}. Computation of eq.˜1 is done at once over LL through causal autoregression. Different stages and tasks may vary the input and output of the loss.

Throughout this section, we will use Text{\rm Text} and Code{\rm Code} as concrete examples of having different views of the same underlying knowledge. It should be clear to the reader that our proposed LLM-JEPA objective handles different types of views similarly.

The construction of our LLM-JEPA objective relies on two principles. First, we must preserve the generative capabilities of LLMs and we therefore start with the ℒLLM\mathcal{L}{\rm LLM} from eq.˜1. Second, we aim to improve the abstraction capabilities of LLMs using the joint embedding prediction task. On top of ℒLLM\mathcal{L}{\rm LLM}, we then propose to add the well-established JEPA objective leading to the complete loss ℒ\mathcal{L} defined as

where λ≥0\lambda\geq 0 is an hyperparameter balancing the contribution of the two terms, Pred and Enc are the predictor and encoder networks respectively, and dd is a metric of choice, e.g., the ℓ2\ell_{2} distance. Let’s now precisely describe each of those components.

The encoder. We use the hidden_state of the last token from the last layer as the embedding of an input sequence–as commonly done for LLM probing. Practically, we can not produce Enc​(Text){\rm Enc}({\rm Text}) and Enc​(Code){\rm Enc}({\rm Code}) through a single forward pass. For example, passing the concatenation of [Text,Code][{\rm Text},{\rm Code}] would require meddling with the self attention to avoid cross-view interaction which would be efficient but specific to each LLM architecture. Instead, we propose to get the encoding through two additional forward passes: one for Text{\rm Text}, and one for Code{\rm Code}. This incurs additional costs during training–but not during inference–see section˜3 for further discussions. The metric. When it comes comparing embeddings, it is now widely accepted in vision to leverage the cosine similarity. We thus propose to do the same for LLM-JEPA. The predictor. We leverage the auto-regressive nature of LLM and their internal self-attention to define a tied-weights predictor. By introducing a special token [PRED] at the end of a given input, we allow for further nonlinear processing of the input hereby producing P​r​e​d​(⋅)Pred(\cdot) at the final embedding of the last layer. By reusing the internal weights of the LLM for the prediction task, we greatly reduce the training overhead and architectural design choices. Practically, we append k∈{0,…,K}k\in{0,\dots,K} predictor tokens to an input prompt and use the embedding of the last predictor token to be P​r​e​d​(E​n​c​(⋅))Pred(Enc(\cdot)). When k=0k=0, the predictor is trivial, i.e., P​r​e​d​(x)=xPred(x)=x.

Relation to Previous Work. Because loss functions such as ℒLLM\mathcal{L}_{\rm LLM} (input space reconstruction since tokens are lossless compression of the original prompts) have been shown to be sub-optimal in vision, a few LLM variations have started to employ embedding space regularizers and training objectives barrault2024large ; wang2025reversal . Current solution however rely on intricate structural constraints of the embedding space, e.g., hierarchical organization and cluster, and thus fall out of the JEPA scope. We also note that our interpretation of views when it comes to LLM datasets, e.g., (text issue, code diff), is something that has been leveraged as part of the LLM finetuning solutions–by learning to generate one from the other–without a JEPA-style loss. This includes natural language to regular expression translation (locascio2016neural, ; ye2020sketch, ; zhong2018semregex, ), natural language to SQL parsing (guo2019towards, ; iyer2017learning, ; li2023resdsql, ; wang2019rat, ; yu2018spider, ) and the more recent issue descriptions to code diffs (cabrera2021commit2vec, ; hoang2020cc2vec, ; tian2020evaluating, ; zhou2023ccbert, ). More intricate examples involve text-based problem solving and their counterpart program induction (amini2019mathqa, ; cobbe2021training, ; hendrycks2021measuring, ; ling2017program, ).

The JEPA loss is not implicitly minimized by ℒLLM\mathcal{L}{\rm LLM}. The very first observation we want to make, provided in fig.˜4 lies in observing that minimizing ℒLLM\mathcal{L}{\rm LLM} does not implicitly minimize ℒJEPA\mathcal{L}_{\rm JEPA}–indicating that it is required to add that term during training.

LLM-JEPA Improves Finetuning. We run experiments across multiple pretrained LLMs (Llama-3.2-1B-Instruct (llama3, ), gemma-2-2b-it (team2024gemma, ), OpenELM-1_1B-Instruct (mehta2024openelm, ), and OLMo-2-0425-1B-Instruct (olmo20242, )) with various datasets (NL-RX-SYNTH, NL-RX-TURK (locascio2016neural, ), GSM8K (cobbe2021training, ), Spider (yu2018spider, )). For a given (model,dataset) case, search for the best learning rate l​r∈{1​e−5,2​e−5,4​e−5,8​e−5}lr\in{1e-5,2e-5,4e-5,8e-5} based on the best possible accuracy of ℒLLM\mathcal{L}{\rm LLM} after 44 epochs. Then we tune the hyperparameter specific to ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA}, kk and λ\lambda in a two dimensional grid defined by (k,λ)∈{0,1,2,3,4}×{0.5,1,2,4}(k,\lambda)\in{0,1,2,3,4}\times{0.5,1,2,4} (fig.˜3 and table˜5). For both NL-RX-SYNTH and NL-RX-TURK, accuracy is exact match of the generated regular expression; for GSM8K, accuracy is exact match of the final result; and for Spider, accuracy is exact match of the execution result of the generated query. We provide results demonstrating how LLM-JEPA improves performances (fig.˜1 left) across models (table˜8), datasets (table˜9), training time (fig.˜1 right and fig.˜5), and sizes (table˜11). Examples of inputs and targets along with models’ predictions and error analysis are provided in tables˜6 and 7. The improved performance of LLM-JEPA holds across LoRA ranks as shown in table˜3. We also provide evidence that LLM-JEPA induces an approximately linear transformation from Enc⁡(Text)\operatorname{Enc}(\text{Text}) to Enc⁡(Code)\operatorname{Enc}(\text{Code}) (figs.˜6, 7 and 10).

LLM-JEPA Improves Pretraining. We pretrain Llama-3.2-1B-Instruct from randomly initialized weights on NL-RX-SYNTH dataset, a prediction is valid as long as it starts with the ground truth. We obtain that LLM-JEPA also improves the quality of the learned representation, as shown in table˜1. We also conduct another pretraining experiment on cestwc/paraphrase containing groups of 5 paraphrases. We employ the paraphrases within a same group for the JEPA loss. Once the model is pretrained (4 epochs), we do finetuning evaluation on rotten_tomatoes (1 epoch). We demonstrate how JEPA pretraining improves the downstream performance post-finetuning in table˜4. Note that finetuning does not employ the JEPA loss–hence showing the benefit of JEPA at pretraining stage. Lastly, we provide in table˜2 generated samples demonstrating that JEPA pretraining does maintain the generative capabilities of the model when prompted with the first few tokens in the cestwc/paraphrase dataset.

We introduced an alternative training objective for LLMs leveraging JEPAs. Our formulation is an exact replicate of the JEPA objective extensively used in vision–but that hadn’t been adapted to language yet. Crucially, our proposed LLM-JEPA maintains the generative capabilities of LLMs while improving their abstract prompt representation as empirically validated across datasets and models. While our experiments mostly focus on finetuning, preliminary pretraining experiment are promising which we plan to scale and more thoroughly test in future work. Regarding the limitations of LLM-JEPA, the main current bottleneck is the 3-fold compute cost during training required to obtain the representations of the views. We plan to explore possible mitigation that would mask the self-attention matrix and allow for our LLM-JEPA loss to be evaluated within a single forward pass through the LLM.

Table˜3 demonstrates that LoRA fine-tuning with ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} loss not only achieves substantially higher accuracy than using ℒLLM\mathcal{L}{\rm LLM} alone, but also converges more quickly. Notably, at a LoRA rank of 512512, our method already reaches accuracy comparable to full fine-tuning, whereas LoRA with only ℒLLM\mathcal{L}_{\rm LLM} still exhibits a clear performance gap.

One limitation of eq.˜2 is that the contribution of ℒLLM\mathcal{L}_{\rm LLM} cannot be effectively reduced to 0. To address this, we introduce an additional hyperparameter γ\gamma to explicitly control its relative strength:

We vary the ratio γ/λ\gamma/\lambda within [0,1][0,1] while enforcing max⁡(γ,λ)=1\max(\gamma,\lambda)=1 to maintain a constant learning rate. Table˜5 shows that ℒLLM\mathcal{L}_{\rm LLM} remains essential for generative performance: when γ=0\gamma=0, the fine-tuned model produces only empty outputs. This indicates that the JEPA component primarily serves as a regularization term, complementing the generative loss.

Despite its strong accuracy gains, LLM-JEPA introduces two additional hyperparameters. As shown in fig.˜3, the optimal configuration may occur at any point in the grid (λ,k)∈{0.5,1.0,2.0,4.0}×{0,1,2,3,4}(\lambda,k)\in{0.5,1.0,2.0,4.0}\times{0,1,2,3,4}, which imposes a significant cost for hyperparameter tuning. While we have not identified an efficient method to explore this space, we empirically observe that adjacent grid points often yield similar accuracy, suggesting the potential for a more efficient tuning algorithm.

Table˜7 presents additional examples generated by fine-tuning Llama-3.2-1B-Instruct on the NL-RX-SYNTH dataset using ℒLLM\mathcal{L}{\rm LLM} and ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA}, respectively.

We also conducted experiments to examine whether LoRA fine-tuning with ℒLLM​-​JEPA\mathcal{L}{\rm LLM\text{-}JEPA} exhibits similar resistance to overfitting. As shown in fig.˜5, accuracy under ℒLLM​-​JEPA\mathcal{L}{\rm LLM\text{-}JEPA} generally continues to improve with additional epochs, whereas fine-tuning with ℒLLM\mathcal{L}{\rm LLM} shows clear signs of overfitting. Notably, the standard deviation is much higher than in full fine-tuning, likely reflecting the lower capacity of LoRA fine-tuning. An interesting pattern emerges: for ℒLLM​-​JEPA\mathcal{L}{\rm LLM\text{-}JEPA}, larger standard deviations often coincide with dips in accuracy, whereas for ℒLLM\mathcal{L}_{\rm LLM} they tend to accompany accuracy spikes. This suggests that such fluctuations may be unreliable indicators of generalization quality.

We also examine the representation space to better understand how LLM-JEPA regularizes learned features. Specifically, we plot tt-SNE embeddings for both Text\operatorname{Text} and Code\operatorname{Code} across three settings: the base model, a model fine-tuned with ℒLLM\mathcal{L}{\rm LLM}, and a model fine-tuned with ℒLLM​-​JEPA\mathcal{L}{\rm LLM\text{-}JEPA}. As shown in fig.˜6, clear structure emerges after fine-tuning with ℒLLM​-​JEPA\mathcal{L}{\rm LLM\text{-}JEPA}. We hypothesize that ℒLLM​-​JEPA\mathcal{L}{\rm LLM\text{-}JEPA} enforces structure in the representation space by constraining the mapping from Enc⁡(Text)\operatorname{Enc}(\operatorname{Text}) to Enc⁡(Code)\operatorname{Enc}(\operatorname{Code}) within a narrow subspace. If this is the case, the SVD decomposition of Enc⁡(Text)−Enc⁡(Code)\operatorname{Enc}(\operatorname{Text})-\operatorname{Enc}(\operatorname{Code}) should yield significantly smaller singular values, which is confirmed in fig.˜7. Furthermore, we hypothesize that the mapping is approximately linear. To test this, we compute the least-squares regression error, and table˜10 supports this hypothesis. Together, these results suggest that LLM-JEPA promotes a near-linear transformation between Text\operatorname{Text} and Code\operatorname{Code} representations, which may underlie its accuracy improvements.

Table: S2.T1: Pretraining accuracy on dataset NL-RX-SYNTH by Next Token Prediction (ℒLLM\mathcal{L}{\rm LLM}) loss vs. ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} loss (our method). We inherit the best configuration from fine-tuning. Each case runs five times. Average accuracy and standard deviation are reported. We also report pp-value of paired, single-tailed tt-Test.

ModelMethodAccuracy (%) ↑\uparrowpp-value ↓\downarrowConfig
Llama-3.2-1B-InstructℒLLM\mathcal{L}_{\rm LLM}54.38±1.7054.38\pm 1.702.94​e−42.94e-4l​r=8​e−5lr=8e-5
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)60.59±1.0160.59\pm 1.01λ=2,k=3\lambda=2,k=3, same l​rlr

Table: A1.T2: Generated samples by model pretrained by paraphrase dataset. The pretrained model is not good at terminating sentence. prompt and generation

Ground Truth vs. Generation
Ground TruthA garden of flowers and a bench stating "City of London."
GenerationA garden of flowers and a vase with a flower in it………….
Ground TruthA person that is riding on a horse in a grass field.
GenerationA person that is riding in a field……………..
Ground TruthA man is riding a horse in a field.
GenerationA man is riding a horse in a field…………….
Ground TruthThere are two birds standing on top of a building
GenerationThere are two birds standing on a rock……………..
Ground TruthTwo hawks sit on top of a roof spire.
GenerationTwo hawks sit on top of a wooden bench…………….
Ground Truth.A young woman serving herself at a cookout.
Generation.A young woman serving herself in a kitchen……………..
Ground Truth2 bowls of fruit sit on a table.
Generation2 bowls of fruit sit on a table……………..
Ground TruthA wooden bench written ’CITY OF LONDON’ at the park
GenerationA wooden bench written ’CITY and a tree……………..

Table: A1.T3: Fine-tuning accuracy on dataset NL-RX-SYNTH, LoRA vs. full fine-tuning, both by ℒLLM\mathcal{L}{\rm LLM} loss and ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} loss (our method). Configuration is l​r=2​e−5,λ=1,k=1lr=2e-5,\lambda=1,k=1. Each cell runs five times. Average accuracy and standard deviation are reported. At every LoRA rank, ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} (ours) has better accuracy. At LoRA rank 512 (22.59% trainable parameters), ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} (ours) achieves same accuracy as full fine-tuning, but ℒLLM\mathcal{L}_{\rm LLM} still has a significant gap from full fine-tuning.

LoRA RankMethodAccuracy (%) ↑\uparrow
3232ℒLLM\mathcal{L}_{\rm LLM}6.09±0.556.09\pm 0.55
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)7.45±1.877.45\pm 1.87
6464ℒLLM\mathcal{L}_{\rm LLM}21.09±1.9021.09\pm 1.90
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)32.46±1.2632.46\pm 1.26
128128ℒLLM\mathcal{L}_{\rm LLM}34.21±2.8234.21\pm 2.82
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)48.45±3.6648.45\pm 3.66
256256ℒLLM\mathcal{L}_{\rm LLM}45.57±4.5245.57\pm 4.52
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)60.80±2.3160.80\pm 2.31
512512ℒLLM\mathcal{L}_{\rm LLM}50.18±5.1550.18\pm 5.15
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)72.41±2.9472.41\pm 2.94
FullℒLLM\mathcal{L}_{\rm LLM}57.29±5.3257.29\pm 5.32
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)70.42±2.3670.42\pm 2.36

Table: A1.T4: Pretraining ++ fine-tuning Llama-3.2-1B-Instruct accuracy on pretraining dataset paraphrase and fine-tuning dataset rotten_tomatoes and yelp by Next Token Prediction (ℒLLM\mathcal{L}{\rm LLM}) loss vs. ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} loss (our method). Note that ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} is applied only at pretraining. We tune l​rp​r​elr{pre} and l​rf​tlr_{ft} by ℒLLM\mathcal{L}_{\rm LLM}, and stick to them in LLM-JEPA pretraining. We run pretraining 5 times, and for each pretrained model, we run fine-tuning 5 times. Average accuracy and standard deviation are reported. We also report pp-value of paired, single-tailed tt-Test.

FT DatasetMethodAccuracy (%) ↑\uparrowpp-value ↓\downarrowConfig
rotten_tomatoesℒLLM\mathcal{L}_{\rm LLM}56.57±1.6656.57\pm 1.667.38​e−47.38e-4l​rp​r​e=8​e−5,l​rf​t=4​e−5lr_{pre}=8e-5,lr_{ft}=4e-5
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)57.76±1.3357.76\pm 1.33λ=0.5,k=2\lambda=0.5,k=2, same l​rp​r​e,l​rf​tlr_{pre},lr_{ft}
yelpℒLLM\mathcal{L}_{\rm LLM}26.46±0.9226.46\pm 0.921.00​e−31.00e-3l​rp​r​e=8​e−5,l​rf​t=8​e−5lr_{pre}=8e-5,lr_{ft}=8e-5
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)27.15±0.9327.15\pm 0.93λ=0.5,k=2\lambda=0.5,k=2, same l​rp​r​e,l​rf​tlr_{pre},lr_{ft}

Table: A1.T6: Regular expressions generated by Llama-3.2-1B-Instruct after fine-tuning with ℒLLM\mathcal{L}{\rm LLM} loss and ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} loss (ours). Color code: wrong, extra, missing

Ground TruthℒLLM\mathcal{L}_{\rm LLM}ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)
lines not having the string ’dog’ followed by a number, 3 or more times
((dog.[0-9].)3,)((dog.[0-9].)3,)((dog.[0-9].){3,})
lines containing ending with a vowel, zero or more times
.(.)(([AEIOUaeiou])).(.)(([AEIOUaeiou]))*(.)(.){([AEIOUaeiou])*}
lines with a number or a character before a vowel
(([0-9])|(.)).([AEIOUaeiou]).(([0-9])|(.)).([AEIOUaeiou])..*(([0-9])|(.)).([AEIOUaeiou]).
lines with words with the string ’dog’, a letter, and a number
((([0-9])&(dog))|([A-Za-z]))*((([0-9])&(dog))|([A-Za-z]))*(([0-9])&(dog))|(([A-Za-z])*)

Table: A1.T9: Fine-tuning accuracy by model Llama-3.2-1B-Instruct, ℒLLM\mathcal{L}{\rm LLM} loss vs. ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} loss (our method). Each cell is the best possible accuracy over a set of configurations. Each configuration runs five times. Average accuracy and standard deviation are reported. We also report pp-value of paired, single-tailed tt-Test.

DatasetMethodAccuracy (%) ↑\uparrowpp-value ↓\downarrowConfig
NL-RX-SYNTHℒLLM\mathcal{L}_{\rm LLM}57.29±5.3257.29\pm 5.321.0​e−31.0e-3l​r=2​e−5lr=2e-5
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)71.46±1.3471.46\pm 1.34λ=1,k=1\lambda=1,k=1, same l​rlr
NL-RX-TURKℒLLM\mathcal{L}_{\rm LLM}22.49±1.9122.49\pm 1.912.4​e−42.4e-4l​r=2​e−5lr=2e-5
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)30.94±1.1330.94\pm 1.13λ=1,k=1\lambda=1,k=1, same l​rlr
GSM8KℒLLM\mathcal{L}_{\rm LLM}32.36±0.5832.36\pm 0.589.6​e−59.6e-5l​r=2​e−5lr=2e-5
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)36.36±0.2036.36\pm 0.20λ=0.5,k=4\lambda=0.5,k=4, same l​rlr
SpiderℒLLM\mathcal{L}_{\rm LLM}47.52±2.4447.52\pm 2.444.0​e−34.0e-3l​r=4​e−5lr=4e-5
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)50.55±2.0850.55\pm 2.08λ=1,k=3\lambda=1,k=3, same l​rlr

Table: A1.T10: LLM-JEPA is almost a linear transformation from Enc⁡(Text)\operatorname{Enc}(\operatorname{Text}) to Enc⁡(Code)\operatorname{Enc}(\operatorname{Code}).

| | minX​‖Enc⁡(Text)⋅X−Enc⁡(Code)‖2\min_{X}{||\operatorname{Enc}(\operatorname{Text})\cdot X-\operatorname{Enc}(\operatorname{Code})||{2}} | Avg. Top 100 Singular | | --- | --- | --- | | Base model | 3953.113953.11 | 310.73310.73 | | ℒLLM\mathcal{L}{\rm LLM} | 3035.013035.01 | 341.80341.80 | | LLM-JEPA (Ours) k=1k=1 | 4.474.47 | 94.8494.84 | | LLM-JEPA (Ours) k=0k=0 | 4.044.04 | 16.8216.82 |

Table: A1.T11: Fine-tuning accuracy on NL-RX-SYNTH by Next Token Prediction (ℒLLM\mathcal{L}{\rm LLM}) loss vs. ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} loss (our method). Each case runs five times. Average accuracy and standard deviation are reported. We also report pp-value of paired, single-tailed tt-Test. Note that Llama does not have official 3.2-8B, and we have to use 3.1-8B, which has a lower accuracy. Still LLM-JEPA sees significant improvement. We also evaluated on OLMo-2-7B.

ModelMethodAccuracy (%) ↑\uparrowpp-value ↓\downarrowConfig
Llama-3.2-1B-InstructℒLLM\mathcal{L}_{\rm LLM}57.29±5.3257.29\pm 5.321.0​e−31.0e-3l​r=2​e−5lr=2e-5
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)71.46±1.3471.46\pm 1.34λ=1,k=1\lambda=1,k=1, same l​rlr
Llama-3.2-3B-InstructℒLLM\mathcal{L}_{\rm LLM}74.55±3.5874.55\pm 3.580.03520.0352l​r=2​e−5lr=2e-5
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)77.16±3.6677.16\pm 3.66λ=2,k=0\lambda=2,k=0, same l​rlr
Llama-3.1-8B-InstructℒLLM\mathcal{L}_{\rm LLM}35.77±6.6035.77\pm 6.600.01310.0131l​r=2​e−5lr=2e-5
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)63.57±16.8163.57\pm 16.81λ=2.0,k=0\lambda=2.0,k=0, same l​rlr
OLMo-2-1124-7B-InstructℒLLM\mathcal{L}_{\rm LLM}87.26±0.2787.26\pm 0.270.03450.0345l​r=2​e−5lr=2e-5
ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} (ours)87.75±0.3387.75\pm 0.33λ=20,k=2\lambda=20,k=2, same l​rlr

Refer to caption LLM-JEPA produces strong fine-tuned models across datasets and models.

Refer to caption Left: JEPA applied to NLP tasks that has T​e​x​tText and C​o​d​eCode, where T​e​x​tText and C​o​d​eCode are naturally two views of the same thing. Right: (top): An illustration of the NL-RX-SYNTH dataset, where each sample consists of a description of the regular expression in natural language (T​e​x​tText) and the regular expression itself (C​o​d​eCode). (bottom): The Spider dataset, where T​e​x​tText is the database ID and description of the SQL query and C​o​d​eCode is the SQL query itself.

Refer to caption (a) Llama on GSM8K, l​r=2​e−5lr=2e-5

Refer to caption Losses in fine-tuning with ℒLLM\mathcal{L}{\rm LLM} loss (ℒLLM\mathcal{L}{\rm LLM}) and ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} loss (ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA}, our method). We measure both the cross-entropy loss for next token prediction (L​o​s​sL​L​MLoss_{LLM}, ℒLLM\mathcal{L}{\rm LLM} in chart) and JEPA prediction loss (D​(⋅,⋅)D(\cdot,\cdot), pred in chart), although the latter does not contribute in the baseline case. The accuracy is 51.95%51.95% for ℒLLM\mathcal{L}{\rm LLM} and 71.10%71.10% for ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA}. Since ℒLLM\mathcal{L}{\rm LLM} and ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} share similar ℒLLM\mathcal{L}{\rm LLM} loss, the ℒLLM\mathcal{L}{\rm LLM} loss cannot explain the gap between the accuracy. pred stays a constant in ℒLLM\mathcal{L}{\rm LLM}, while is minimized in ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA}, hence pred should be the main reason behind the accuracy gap.

Refer to caption LLM-JEPA resists overfitting in LoRA fine-tuning. Fine-tuning with ℒLLM−JEPA\mathcal{L}{\rm LLM-JEPA} loss (our method) resists overfitting. When fine-tuning with ℒLLM\mathcal{L}{\rm LLM} loss start to overfit, ℒLLM−JEPA\mathcal{L}_{\rm LLM-JEPA} kept improving. However the trend is not as stable as in full fine-tuning, possibly due to limited capacity of LoRA fine-tuning.

Refer to caption (a) Base model: No fine-tuning

Refer to caption (c) LLM-JEPA (Ours) k=0k=0

Refer to caption The top 100 singular values of Enc⁡(Text)−Enc⁡(Code)\operatorname{Enc}(\operatorname{Text})-\operatorname{Enc}(\operatorname{Code}). The curves of LLM-JEPA (ours) are a few magnitudes lower than that of base model and regular fine-tuning, meaning the mapping from Text to Code are confined within a narrow subspace, fostering the nice structure we see in Figure 6

$$ \mathcal{L}{\rm LLM}({\rm Text}{1:L-1},{\rm Text}{L})={\rm XEnt}\left({\rm Classifier}\left({\rm Enc}({\rm Text}{1:L-1})\right),{\rm Text}_{L}\right),\label{eq:L_llm} $$ \tag{eq:L_llm}

$$ \mathcal{L}{\rm LLM-JEPA} = \underbrace{\gamma \times \sum{\ell=2}^{L}\mathcal{L}{\rm LLM}({\rm Text}{1:\ell-1},{\rm Text}{\ell})}{\text{generative capabilities}}+ \lambda \times \underbrace{d({\rm Pred}({\rm Enc}({\rm Text})), {\rm Enc}({\rm Code}))}_{\text{abstraction capabilities}},\label{eq:L_jepa_gamma} $$ \tag{eq:L_jepa_gamma}

Ground TruthL LLML LLM - JEPA (ours)
lines not having the string 'dog' followed by a number, 3 or more timeslines not having the string 'dog' followed by a number, 3 or more timeslines not having the string 'dog' followed by a number, 3 or more times
((dog.[0-9].)3,)((dog.[0-9].)3,)((dog.[0-9].){3,})
lines containing ending with a vowel, zero or more timeslines containing ending with a vowel, zero or more timeslines containing ending with a vowel, zero or more times
.(.)(([AEIOUaeiou])).(.)(([AEIOUaeiou])) *( .* ) (.) { ([AEIOUaeiou]) }
lines with a number or a character before a vowellines with a number or a character before a vowellines with a number or a character before a vowel
(([0-9])(.)).([AEIOUaeiou]).(([0-9])(.)).([AEIOUaeiou]). .*(([0-9])(.)).([AEIOUaeiou]).
lines with words with the string 'dog', a letter, and a numberlines with words with the string 'dog', a letter, and a numberlines with words with the string 'dog', a letter, and a number
((([0-9])&(dog))([A-Za-z]))*((([0-9])&(dog))([A-Za-z]))*(( [0-9])&(dog))( ( [A-Za-z]) * )
ModelMethodAccuracy (%) ↑p -value ↓Config
Llama-3.2-1B-InstructL LLM L LLM - JEPA (ours)54 . 38 ± 1 . 70 60 . 59 ± 1 . 012 . 94 e - 4lr = 8 e - 5 λ = 2 ,k = 3 , same lr
Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑
BaselineLLM-JEPAℓ 2 -normMSEPrependCode → TextInfoNCE loss
57 . 29 ± 5 . 3271 . 46 ± 1 . 342 . 22 ± 0 . 0770 . 64 ± 2 . 0568 . 07 ± 2 . 5765 . 70 ± 2 . 6334 . 40 ± 6 . 10
DatasetMethodAccuracy (%) ↑p -value ↓Config
NQ-OpenL LLM L LLM - JEPA (ours)20 . 12 ± 0 . 41 21 . 59 ± 0 . 402 . 44 e - 3lr = 2 e - 5 λ = 1024 ,k = 0 , same lr
HellaSwagL LLM L LLM - JEPA (ours)69 . 40 ± 0 . 99 70 . 51 ± 1 . 200 . 0136lr = 4 e - 5 λ = 1 ,k = 3 , same lr
ModelMethodAccuracy (%) ↑p -value ↓Config
Qwen3-1.7BL LLM L LLM - JEPA (ours)44 . 32 ± 0 . 39 45 . 00 ± 0 . 400 . 0115lr = 4 e - 5 λ = 1 ,k = 0 , same lr
R1-Distill-Qwen-1.5BL LLM L LLM - JEPA (ours)13 . 87 ± 1 . 01 15 . 04 ± 0 . 150 . 0396lr = 8 e - 5 λ = 0 . 5 ,k = 1 , same lr
Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑Accuracy (%) ↑
LD =0, λ =1LD =0.5, λ =1LD =0.5, λ =2LD =0.75, λ =1LD =0.75, λ =2LD =0.75, λ =4
19 . 85 ± 2 . 4425 . 00 ± 3 . 7324 . 50 ± 4 . 4032 . 46 ± 3 . 3232 . 10 ± 3 . 1131 . 45 ± 3 . 34
40 . 70 ± 2 . 6748 . 96 ± 6 . 0350 . 71 ± 6 . 5253 . 77 ± 5 . 5357 . 03 ± 3 . 5156 . 75 ± 4 . 63
55 . 60 ± 5 . 1664 . 08 ± 2 . 7563 . 79 ± 5 . 9664 . 51 ± 7 . 2867 . 03 ± 3 . 0865 . 32 ± 3 . 54
60 . 43 ± 3 . 2169 . 87 ± 2 . 4870 . 11 ± 3 . 1567 . 80 ± 4 . 9466 . 80 ± 4 . 0668 . 93 ± 4 . 59
63 . 57 ± 1 . 0169 . 74 ± 2 . 9971 . 20 ± 2 . 1770 . 00 ± 4 . 7470 . 77 ± 4 . 0872 . 42 ± 1 . 28
63 . 96 ± 1 . 0770 . 60 ± 3 . 0572 . 11 ± 2 . 1870 . 31 ± 4 . 6470 . 92 ± 4 . 6273 . 08 ± 1 . 28
Ground Truth vs. GenerationGround Truth vs. GenerationGround Truth vs. GenerationGround Truth vs. Generation
Ground Truth GenerationA garden of flowers and a bench stating "City of London." A garden of flowers and a vase with a flower in it.............A garden of flowers and a bench stating "City of London." A garden of flowers and a vase with a flower in it.............A garden of flowers and a bench stating "City of London." A garden of flowers and a vase with a flower in it.............A garden of flowers and a bench stating "City of London." A garden of flowers and a vase with a flower in it.............
Ground Truth GenerationA person that is riding on a horse in a grass field. A person that is ridingA person that is riding on a horse in a grass field. A person that is ridingA person that is riding on a horse in a grass field. A person that is ridingA person that is riding on a horse in a grass field. A person that is riding
Ground Truth Generationin a field.................in a field.................
Ground TruthA man is riding a horse in a field. A man is riding a horse in a field................A man is riding a horse in a field. A man is riding a horse in a field................A man is riding a horse in a field. A man is riding a horse in a field................A man is riding a horse in a field. A man is riding a horse in a field................
GenerationThere are two birds standing on top of a buildingThere are two birds standing on top of a buildingThere are two birds standing on top of a buildingThere are two birds standing on top of a building
Ground TruthThere are two birds standing on a rock................. Two hawks sit on top of a roof spire.There are two birds standing on a rock................. Two hawks sit on top of a roof spire.There are two birds standing on a rock................. Two hawks sit on top of a roof spire.There are two birds standing on a rock................. Two hawks sit on top of a roof spire.
GenerationTwo hawks sit on topTwo hawks sit on topof a wooden bench................of a wooden bench................
Ground Truth.A young woman serving herself at a cookout..A young woman serving herself at a cookout..A young woman serving herself at a cookout..A young woman serving herself at a cookout.
Generation.A young woman serving herself.A young woman serving herself.A young woman serving herselfin a kitchen.................
Ground Truth2 bowls of fruit sit on a table.2 bowls of fruit sit on a table.2 bowls of fruit sit on a table.2 bowls of fruit sit on a table.
Generation2 bowls of fruit siton a table.................on a table.................on a table.................
Ground TruthA wooden bench written 'CITY OF LONDON' at the parkA wooden bench written 'CITY OF LONDON' at the parkA wooden bench written 'CITY OF LONDON' at the parkA wooden bench written 'CITY OF LONDON' at the park
GenerationA wooden bench written 'CITYA wooden bench written 'CITYA wooden bench written 'CITYand a tree.................
LoRA RankMethodAccuracy (%) ↑
32L LLM L LLM - JEPA (ours)6 . 09 ± 0 . 55 7 . 45 ± 1 . 87
64L LLM L LLM - JEPA (ours)21 . 09 ± 1 . 90 32 . 46 ± 1 . 26
128L LLM L LLM - JEPA (ours)34 . 21 ± 2 . 82 48 . 45 ± 3 . 66
256L LLM L LLM - JEPA (ours)45 . 57 ± 4 . 52 60 . 80 ± 2 . 31
512L LLM L LLM - JEPA (ours)50 . 18 ± 5 . 15 72 . 41 ± 2 . 94
FullL LLM L LLM - JEPA (ours)57 . 29 ± 5 . 32 70 . 42 ± 2 . 36
FT DatasetMethodAccuracy (%) ↑p -value ↓Config
Rotten TomatoesL LLM L LLM - JEPA (ours)56 . 57 ± 1 . 66 57 . 76 ± 1 . 337 . 38 e - 4lr pre = 8 e - 5 , lr ft = 4 e - 5 λ = 0 . 5 ,k = 2 , same lr pre , lr ft
YelpL LLM L LLM - JEPA (ours)26 . 46 ± 0 . 92 27 . 15 ± 0 . 931 . 00 e - 3lr pre = 8 e - 5 , lr ft = 8 e - 5 λ = 0 . 5 ,k = 2 , same lr pre , lr ft
γ/λConfigAccuracy (%) ↑
0 . 0γ = 0 . 0 ,λ = 1 . 00 . 00 ± 0 . 00
0 . 01γ = 0 . 01 ,λ = 1 . 01 . 38 ± 0 . 06
0 . 1γ = 0 . 1 ,λ = 1 . 045 . 80 ± 5 . 04
1 . 0γ = 1 . 0 ,λ = 1 . 070 . 42 ± 2 . 36
10 . 0γ = 1 . 0 ,λ = 0 . 167 . 52 ± 1 . 45
100 . 0γ = 1 . 0 ,λ = 0 . 0166 . 83 ± 3 . 89
γ = 1 . 0 ,λ = 0 . 057 . 29 ± 5 . 32
Ground TruthL LLML LLM - JEPA (ours)
lines ending with a vowel or starting with a characterlines ending with a vowel or starting with a characterlines ending with a vowel or starting with a character
([AEIOUaeiou].[A-Za-z].)+([AEIOUaeiou].[A-Za-z].)+([AEIOUaeiou].[A-Za-z].)+
ines containing either a lower-case letter, a vowel, or a letterines containing either a lower-case letter, a vowel, or a letterines containing either a lower-case letter, a vowel, or a letter
((.*)([AEIOUaeiou]))((.)(.*))(.*) ( ([AEIOUaeiou])((.)(.*)) )(.*) ( ([AEIOUaeiou])((.)(.*)) )
lines starting with the string 'dog' before a vowellines starting with the string 'dog' before a vowellines starting with the string 'dog' before a vowel
(([A-Za-z])7,).(dog).(([A-Za-z])7,).(dog). .*(([A-Za-z])7,).(dog).
lines not containing a letter and the string 'dog'lines not containing a letter and the string 'dog'lines not containing a letter and the string 'dog'
((([A-Z])+)([a-z]))(.*)((([A-Z])+)([a-z]))(.*) +((([A-Z])+)([a-z]))(.*)
lines with a character before a vowel and the string 'dog', zero or more timeslines with a character before a vowel and the string 'dog', zero or more timeslines with a character before a vowel and the string 'dog', zero or more times
.(.)&([0-9])&(dog)..(.)&([0-9])&(dog). .*.(.)&([0-9])&(dog). ..
lines with a vowel at least once before not a characterlines with a vowel at least once before not a characterlines with a vowel at least once before not a character
(([A-Za-z])+).(~([0-9])).(([A-Za-z])+).(~([0-9])). .*(([A-Za-z])+).(~([0-9])).
ModelMethodAccuracy (%) ↑p -value ↓Config
Llama-3.2-1B-InstructL LLM L LLM - JEPA (ours)57 . 29 ± 5 . 32 71 . 46 ± 1 . 341 . 0 e - 3lr = 2 e - 5 λ = 1 ,k = 1 , same lr
gemma-2-2b-itL LLM L LLM - JEPA (ours)33 . 65 ± 3 . 24 43 . 12 ± 2 . 615 . 5 e - 3lr = 1 e - 5 λ = 2 ,k = 4 , same lr
OpenELM-1_1B-InstructL LLM L LLM - JEPA (ours)12 . 07 ± 1 . 81 25 . 40 ± 2 . 405 . 1 e - 4lr = 8 e - 5 λ = 4 ,k = 3 , same lr
OLMo-2-0425-1B-InstructL LLM L LLM - JEPA (ours)87 . 09 ± 0 . 36 87 . 52 ± 0 . 292 . 5 e - 3lr = 8 e - 5 λ = 2 ,k = 0 , same lr
DatasetMethodAccuracy (%) ↑p -value ↓Config
NL-RX-TURKL LLM L LLM - JEPA (ours)22 . 49 ± 1 . 91 30 . 94 ± 1 . 132 . 4 e - 4lr = 2 e - 5 λ = 1 ,k = 1 , same lr
GSM8KL LLM L LLM - JEPA (ours)32 . 36 ± 0 . 58 36 . 36 ± 0 . 209 . 6 e - 5lr = 2 e - 5 λ = 0 . 5 ,k = 4 , same lr
SpiderL LLM L LLM - JEPA (ours)47 . 52 ± 2 . 44 50 . 55 ± 2 . 084 . 0 e - 3lr = 4 e - 5 λ = 1 ,k = 3 , same lr
min XEnc(Text) · X - Enc(Code)2Avg. Top 100 Singular
Base model3953 . 11310 . 73
L LLM3035 . 01341 . 80
LLM-JEPA (Ours) k = 14 . 4794 . 84
LLM-JEPA (Ours) k = 04 . 0416 . 82
ModelMethodAccuracy (%) ↑p -value ↓Config
Llama-3.2-1B-InstructL LLM L LLM - JEPA (ours)57 . 29 ± 5 . 32 71 . 46 ± 1 . 341 . 0 e - 3lr = 2 e - 5 λ = 1 ,k = 1 , same lr
Llama-3.2-3B-InstructL LLM L LLM - JEPA (ours)74 . 55 ± 3 . 58 77 . 16 ± 3 . 660 . 0352lr = 2 e - 5 λ = 2 ,k = 0 , same lr
Llama-3.1-8B-InstructL LLM L LLM - JEPA (ours)35 . 77 ± 6 . 60 63 . 57 ± 16 . 810 . 0131lr = 2 e - 5 λ = 2 . 0 ,k = 0 , same lr
OLMo-2-1124-7B-InstructL LLM L LLM - JEPA (ours)87 . 26 ± 0 . 27 87 . 75 ± 0 . 330 . 0345lr = 2 e - 5 λ = 20 ,k = 2 , same lr

Figure

Figure

Figure

Figure

References

[Bengio+chapter2007] Bengio, Yoshua, LeCun, Yann. (2007). Scaling Learning Algorithms Towards {AI. Large Scale Kernel Machines.

[balestriero2024learning] Balestriero, Randall, LeCun, Yann. (2024). Learning by reconstruction produces uninformative features for perception. arXiv preprint arXiv:2402.11337.

[ni2021sentence] Ni, Jianmo, Abrego, Gustavo Hernandez, Constant, Noah, Ma, Ji, Hall, Keith B, Cer, Daniel, Yang, Yinfei. (2021). Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877.

[wang2022text] Wang, Liang, Yang, Nan, Huang, Xiaolong, Jiao, Binxing, Yang, Linjun, Jiang, Daxin, Majumder, Rangan, Wei, Furu. (2022). Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.

[reimers2019sentence] Reimers, Nils, Gurevych, Iryna. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.

[li2022contrastive] Li, Xiang Lisa, Holtzman, Ari, Fried, Daniel, Liang, Percy, Eisner, Jason, Hashimoto, Tatsunori, Zettlemoyer, Luke, Lewis, Mike. (2022). Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097.

[gao2021simcse] Gao, Tianyu, Yao, Xingcheng, Chen, Danqi. (2021). Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.

[Hinton06] Hinton, Geoffrey E., Osindero, Simon, Teh, Yee Whye. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation.

[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, Bengio, Yoshua. (2016). Deep learning.

[team2024gemma] Team, Gemma, Riviere, Morgane, Pathak, Shreya, Sessa, Pier Giuseppe, Hardin, Cassidy, Bhupatiraju, Surya, Hussenot, L{'e. (2024). Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118.

[mehta2024openelm] Mehta, Sachin, Sekhavat, Mohammad Hossein, Cao, Qingqing, Horton, Maxwell, Jin, Yanzi, Sun, Chenfan, Mirzadeh, Iman, Najibi, Mahyar, Belenko, Dmitry, Zatloukal, Peter, others. (2024). Openelm: An efficient language model family with open training and inference framework. arXiv preprint arXiv:2404.14619.

[llama3] Grattafiori, Aaron, Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, Mathur, Akhil, Schelten, Alan, Vaughan, Alex, others. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

[olmo20242] OLMo, Team, Walsh, Pete, Soldaini, Luca, Groeneveld, Dirk, Lo, Kyle, Arora, Shane, Bhagia, Akshita, Gu, Yuling, Huang, Shengyi, Jordan, Matt, others. (2024). 2 OLMo 2 Furious. arXiv preprint arXiv:2501.00656.

[locascio2016neural] Locascio, Nicholas, Narasimhan, Karthik, DeLeon, Eduardo, Kushman, Nate, Barzilay, Regina. (2016). Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.

[cobbe2021training] Cobbe, Karl, Kosaraju, Vineet, Bavarian, Mohammad, Chen, Mark, Jun, Heewoo, Kaiser, Lukasz, Schulman, John, Hilton, Jacob, Knight, Melanie, Weller, Adrian, Amodei, Dario, others. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.

[yu2018spider] Yu, Tao, Zhang, Rui, Yang, Kai, Yasunaga, Michihiro, Wang, Dongxu, Li, Zifan, Ma, James, Li, Irene, Yao, Qingning, Roman, Shanelle, others. (2018). Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.

[lecun2022path] LeCun, Yann. (2022). A Path Towards Autonomous Machine Intelligence (Version 0.9.2). OpenReview.

[brown2020language] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. Advances in neural information processing systems.

[chowdhery2023palm] Chowdhery, Aakanksha, Narang, Sharan, Devlin, Jacob, Bosma, Maarten, Mishra, Gaurav, Roberts, Adam, Barham, Paul, Chung, Hyung Won, Sutton, Charles, Gehrmann, Sebastian, others. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research.

[he2022masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[assran2023self] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[bardes2024revisiting] Bardes, Adrien, Garrido, Quentin, Ponce, Jean, Chen, Xinlei, Rabbat, Michael, LeCun, Yann, Assran, Mahmoud, Ballas, Nicolas. (2024). Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471.

[baevski2022data2vec] Baevski, Alexei, Hsu, Wei-Ning, Xu, Qiantong, Babu, Arun, Gu, Jiatao, Auli, Michael. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. International conference on machine learning.

[littwin2024jepa] Littwin, Etai, Saremi, Omid, Advani, Madhu, Thilak, Vimal, Nakkiran, Preetum, Huang, Chen, Susskind, Joshua. (2024). How jepa avoids noisy features: The implicit bias of deep linear self distillation networks. Advances in Neural Information Processing Systems.

[jing2021understanding] Jing, Li, Vincent, Pascal, LeCun, Yann, Tian, Yuandong. (2021). Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348.

[kenneweg2025jepa] Kenneweg, Tristan, Kenneweg, Philip, Hammer, Barbara. (2025). JEPA for RL: Investigating Joint-Embedding Predictive Architectures for Reinforcement Learning. arXiv preprint arXiv:2504.16591.

[tian2020evaluating] Tian, Haoye, Liu, Kui, Kabor{'e. (2020). Evaluating representation learning of code changes for predicting patch correctness in program repair. Proceedings of the 35th IEEE/ACM international conference on automated software engineering.

[hoang2020cc2vec] Hoang, Thong, Kang, Hong Jin, Lo, David, Lawall, Julia. (2020). Cc2vec: Distributed representations of code changes. Proceedings of the ACM/IEEE 42nd international conference on software engineering.

[cabrera2021commit2vec] Cabrera Lozoya, Roc{'\i. (2021). Commit2vec: Learning distributed representations of code changes. SN Computer Science.

[zhou2023ccbert] Zhou, Xin, Xu, Bowen, Han, DongGyun, Yang, Zhou, He, Junda, Lo, David. (2023). Ccbert: Self-supervised code change representation learning. 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[jimenez2024swebench] Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik R. Narasimhan. (2024). SWE-bench: Can Language Models Resolve Real-world GitHub Issues?. International Conference on Learning Representations (ICLR).

[barrault2024large] Barrault, Lo{. (2024). Large concept models: Language modeling in a sentence representation space. arXiv preprint arXiv:2412.08821.

[wang2025reversal] Wang, Boshi, Sun, Huan. (2025). Is the reversal curse a binding problem? uncovering limitations of transformers from a basic generalization failure. arXiv preprint arXiv:2504.01928.

[ye2020sketch] Ye, Xi, Chen, Qiaochu, Wang, Xinyu, Dillig, Isil, Durrett, Greg. (2020). Sketch-driven regular expression generation from natural language and examples. Transactions of the Association for Computational Linguistics.

[zhong2018semregex] Zhong, Zexuan, Guo, Jiaqi, Yang, Wei, Peng, Jian, Xie, Tao, Lou, Jian-Guang, Liu, Ting, Zhang, Dongmei. (2018). Semregex: A semantics-based approach for generating regular expressions from natural language specifications. Proceedings of the 2018 conference on empirical methods in natural language processing.

[iyer2017learning] Iyer, Srinivasan, Konstas, Ioannis, Cheung, Alvin, Krishnamurthy, Jayant, Zettlemoyer, Luke. (2017). Learning a neural semantic parser from user feedback. arXiv preprint arXiv:1704.08760.

[guo2019towards] Guo, Jiaqi, Zhan, Zecheng, Gao, Yan, Xiao, Yan, Lou, Jian-Guang, Liu, Ting, Zhang, Dongmei. (2019). Towards complex text-to-sql in cross-domain database with intermediate representation. arXiv preprint arXiv:1905.08205.

[wang2019rat] Wang, Bailin, Shin, Richard, Liu, Xiaodong, Polozov, Oleksandr, Richardson, Matthew. (2019). Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. arXiv preprint arXiv:1911.04942.

[li2023resdsql] Li, Haoyang, Zhang, Jing, Li, Cuiping, Chen, Hong. (2023). Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql. Proceedings of the AAAI Conference on Artificial Intelligence.

[hendrycks2021measuring] Hendrycks, Dan, Burns, Collin, Kadavath, Saurav, Arora, Akul, Basart, Steven, Tang, Eric, Song, Dawn, Steinhardt, Jacob. (2021). Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.

[ling2017program] Ling, Wang, Yogatama, Dani, Dyer, Chris, Blunsom, Phil. (2017). Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.

[amini2019mathqa] Amini, Aida, Gabriel, Saadia, Lin, Peter, Koncel-Kedziorski, Rik, Choi, Yejin, Hajishirzi, Hannaneh. (2019). Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319.

[lee-etal-2019-latent-no-open] Lee, Kenton, Chang, Ming-Wei, Toutanova, Kristina. (2019). Latent Retrieval for Weakly Supervised Open Domain Question Answering. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. doi:10.18653/v1/P19-1612.

[deepseekai2025deepseekr1incentivizingreasoningcapability] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

[yang2025qwen3technicalreport] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, Zihan Qiu. (2025). Qwen3 Technical Report.

[zellers2019hellaswag] Zellers, Rowan, Holtzman, Ari, Bisk, Yonatan, Farhadi, Ali, Choi, Yejin. (2019). HellaSwag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

[oord2018cpc] van den Oord, Aaron, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[chen2020simclr] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. Proceedings of the 37th International Conference on Machine Learning.

[radford2021clip] Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, Krueger, Gretchen, Sutskever, Ilya. (2021). Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning.

[lee2019latent] Lee, Kenton, Chang, Ming-Wei, Toutanova, Kristina. (2019). Latent retrieval for weakly supervised open domain question answering. Proceedings of ACL.

[guu2020retrieval] Guu, Kelvin, Lee, Kenton, Tung, Zora, Pasupat, Panupong, Chang, Ming-Wei. (2020). Retrieval augmented language model pre-training. Proceedings of ICML.

[izacard2021leveraging] Izacard, Gautier, Grave, Edouard. (2021). Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282.

[brown2020fewshot-learner] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. Proceedings of NeurIPS.

[openai2023gpt4] OpenAI. (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.

[mccann2018natural] McCann, Bryan, Bradbury, James, Xiong, Caiming, Socher, Richard. (2018). The natural language decathlon: Multitask learning as question answering. Proceedings of ICLR.

[raffel2020exploring] Raffel, Colin, Shazeer, Noam, Roberts, Adam, Lee, Katherine, Narang, Sharan, Matena, Michael, Zhou, Yanqi, Li, Wei, Liu, Peter J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Proceedings of JMLR.

[wei2022finetuned] Wei, Jason, Bosma, Maarten, Zhao, Vincent, Guu, Kelvin, Yu, Adams Wei, Lester, Brian, Du, Nan, Dai, Andrew M, Le, Quoc V. (2022). Finetuned language models are zero-shot learners. Proceedings of ICLR.

[frazier1978parsing] Frazier, Lyn, Fodor, Janet Dean. (1978). The sausage machine: A new two-stage parsing model. Cognition.

[altmann1999incremental] Altmann, Gerry T. M., Kamide, Yuki. (1999). Incremental interpretation at verbs: Restricting the domain of subsequent reference. Cognition.

[pollard1994hpsg] Pollard, Carl, Sag, Ivan A.. (1994). Head-Driven Phrase Structure Grammar.

[langacker1987foundations] Langacker, Ronald W.. (1987). Foundations of Cognitive Grammar, Volume I: Theoretical Prerequisites.

[croft2004cognitive] Croft, William, Cruse, D. Alan. (2004). Cognitive Linguistics.

[banarescu2013amr] Banarescu, Laura, Bonial, Claire, Cai, Shu, Georgescu, Madalina, Griffitt, Kira, Hermjakob, Ulf, Knight, Kevin, Koehn, Philipp, Palmer, Martha, Schneider, Nathan. (2013). Abstract Meaning Representation for Sembanking. Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse.

[abend2013ucca] Abend, Omri, Rappoport, Ari. (2013). Universal Conceptual Cognitive Annotation (UCCA). Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics.

[hoffmann2022chinchilla] Hoffmann, Jordan, Borgeaud, Sebastian, Mensch, Arthur, Buchatskaya, Elena, Cai, Trevor, Rutherford, Eliza, de Las Casas, Diego, Hendricks, Lisa Anne, Welbl, Johannes, Clark, Aidan, Hennigan, Tom, Noland, Eric, Millican, Katie, van den Driessche, George, Damoc, Bogdan, Guy, Aurelia, Osindero, Simon, Simonyan, Karen, Elsen, Erich, Vinyals, Oriol, Rae, Jack W., Sifre, Laurent. (2022). Training compute-optimal large language models. Proceedings of the 36th International Conference on Neural Information Processing Systems.

[petroni-etal-2021-kilt] Petroni, Fabio, Piktus, Aleksandra, Fan, Angela, Lewis, Patrick, Yazdani, Majid, De Cao, Nicola, Thorne, James, Jernite, Yacine, Karpukhin, Vladimir, Maillard, Jean, Plachouras, Vassilis, Rockt{. (2021). {KILT. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. doi:10.18653/v1/2021.naacl-main.200.

[kwiatkowski-etal-2019-natural] Kwiatkowski, Tom, Palomaki, Jennimaria, Redfield, Olivia, Collins, Michael, Parikh, Ankur, Alberti, Chris, Epstein, Danielle, Polosukhin, Illia, Devlin, Jacob, Lee, Kenton, Toutanova, Kristina, Jones, Llion, Kelcey, Matthew, Chang, Ming-Wei, Dai, Andrew M., Uszkoreit, Jakob, Le, Quoc, Petrov, Slav. (2019). Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics. doi:10.1162/tacl_a_00276.

[lee-etal-2022-deduplicating] Lee, Katherine, Ippolito, Daphne, Nystrom, Andrew, Zhang, Chiyuan, Eck, Douglas, Callison-Burch, Chris, Carlini, Nicholas. (2022). Deduplicating Training Data Makes Language Models Better. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2022.acl-long.577.

[zhuang-etal-2021-robustly] Zhuang, Liu, Wayne, Lin, Ya, Shi, Jun, Zhao. (2021). A Robustly Optimized {BERT. Proceedings of the 20th Chinese National Conference on Computational Linguistics.

[shamir2016correlate] Shamir, Ohad. (2016). Without-Replacement Sampling for Stochastic Gradient Methods. Advances in Neural Information Processing Systems.

[wikidump] Wikimedia Foundation. Wikimedia Downloads.

[tenney2019bertpipeline] Ian Tenney, Dipanjan Das, Ellie Pavlick. (2019). {BERT. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). doi:10.18653/v1/P19-1452.

[hewitt2019structuralprobe] John Hewitt, Christopher D. Manning. (2019). A Structural Probe for Finding Syntax in Word Representations. Proceedings of NAACL-HLT. doi:10.18653/v1/N19-1419.

[shen2019onlstm] Yikang Shen, Shawn Tan, Alessandro Sordoni, Aaron Courville. (2019). Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. International Conference on Learning Representations (ICLR).

[kim2019cpcfg] Yoon Kim, Chris Dyer, Alexander Rush. (2019). Compound Probabilistic Context{-. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). doi:10.18653/v1/P19-1228.

[drozdov2019diora] Andrew Drozdov, Patrick Verga, Mohit Yadav, Mohit Iyyer, Andrew McCallum. (2019). Unsupervised Latent Tree Induction with Deep Inside{-. Proceedings of NAACL-HLT. doi:10.18653/v1/N19-1116.

[shen2018prpn] Yikang Shen, Zhouhan Lin, Chin-Wei Huang, Aaron Courville. (2018). Neural Language Modeling by Jointly Learning Syntax and Lexicon. International Conference on Learning Representations (ICLR).

[socher2011recursive] Richard Socher, Cliff Chiung-Yu Lin, Andrew Y. Ng, Christopher D. Manning. (2011). Parsing Natural Scenes and Natural Language with Recursive Neural Networks. Proceedings of the 28th International Conference on Machine Learning (ICML).

[chomsky1957syntactic] Noam Chomsky. (1957). Syntactic Structures.

[bib1] Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.

[bib2] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023.

[bib3] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International conference on machine learning, pages 1298–1312. PMLR, 2022.

[bib4] Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024.

[bib5] Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R Costa-jussà, David Dale, et al. Large concept models: Language modeling in a sentence representation space. arXiv preprint arXiv:2412.08821, 2024.

[bib6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

[bib7] Rocío Cabrera Lozoya, Arnaud Baumann, Antonino Sabetta, and Michele Bezzi. Commit2vec: Learning distributed representations of code changes. SN Computer Science, 2(3):150, 2021.

[bib8] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.

[bib9] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, John Schulman, Jacob Hilton, Melanie Knight, Adrian Weller, Dario Amodei, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

[bib10] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.

[bib11] Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. Towards complex text-to-sql in cross-domain database with intermediate representation. arXiv preprint arXiv:1905.08205, 2019.

[bib12] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.

[bib13] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.

[bib14] Thong Hoang, Hong Jin Kang, David Lo, and Julia Lawall. Cc2vec: Distributed representations of code changes. In Proceedings of the ACM/IEEE 42nd international conference on software engineering, pages 518–529, 2020.

[bib15] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. Learning a neural semantic parser from user feedback. arXiv preprint arXiv:1704.08760, 2017.

[bib16] Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations (ICLR), 2024. Oral presentation.

[bib17] Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348, 2021.

[bib18] Tristan Kenneweg, Philip Kenneweg, and Barbara Hammer. Jepa for rl: Investigating joint-embedding predictive architectures for reinforcement learning. arXiv preprint arXiv:2504.16591, 2025.

[bib19] Yann LeCun. A path towards autonomous machine intelligence (version 0.9.2). OpenReview, 62(1):1–62, jun 2022. Version 0.9.2, released June 27, 2022.

[bib20] Haoyang Li, Jing Zhang, Cuiping Li, and Hong Chen. Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13067–13075, 2023.

[bib21] Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.

[bib22] Etai Littwin, Omid Saremi, Madhu Advani, Vimal Thilak, Preetum Nakkiran, Chen Huang, and Joshua Susskind. How jepa avoids noisy features: The implicit bias of deep linear self distillation networks. Advances in Neural Information Processing Systems, 37:91300–91336, 2024.

[bib23] Nicholas Locascio, Karthik Narasimhan, Eduardo DeLeon, Nate Kushman, and Regina Barzilay. Neural generation of regular expressions from natural language with minimal domain knowledge. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1918–1923, 2016.

[bib24] Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, et al. Openelm: An efficient language model family with open training and inference framework. arXiv preprint arXiv:2404.14619, 2024.

[bib25] Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024.

[bib26] Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.

[bib27] Haoye Tian, Kui Liu, Abdoul Kader Kaboré, Anil Koyuncu, Li Li, Jacques Klein, and Tegawendé F Bissyandé. Evaluating representation learning of code changes for predicting patch correctness in program repair. In Proceedings of the 35th IEEE/ACM international conference on automated software engineering, pages 981–992, 2020.

[bib28] Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. arXiv preprint arXiv:1911.04942, 2019.

[bib29] Boshi Wang and Huan Sun. Is the reversal curse a binding problem? uncovering limitations of transformers from a basic generalization failure. arXiv preprint arXiv:2504.01928, 2025.

[bib30] Xi Ye, Qiaochu Chen, Xinyu Wang, Isil Dillig, and Greg Durrett. Sketch-driven regular expression generation from natural language and examples. Transactions of the Association for Computational Linguistics, 8:679–694, 2020.

[bib31] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.

[bib32] Zexuan Zhong, Jiaqi Guo, Wei Yang, Jian Peng, Tao Xie, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. Semregex: A semantics-based approach for generating regular expressions from natural language specifications. In Proceedings of the 2018 conference on empirical methods in natural language processing, 2018.

[bib33] Xin Zhou, Bowen Xu, DongGyun Han, Zhou Yang, Junda He, and David Lo. Ccbert: Self-supervised code change representation learning. In 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 182–193. IEEE, 2023.