Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation

Yeqin Zhang, Yizheng Zhao, Chen Hu, Binxing Jiao, Daxin Jiang, Ruihang Miao, Cam-Tu Nguyen

Abstract

Text representation plays a critical role in tasks like clustering, retrieval, and other downstream applications. With the emergence of large language models (LLMs), there is increasing interest in harnessing their capabilities for this purpose. However, most of the LLMs are inherently causal and optimized for next-token prediction, making them suboptimal for producing holistic representations. To address this, recent studies introduced pretext tasks to adapt LLMs for text representation. Most of these tasks, however, rely on token-level prediction objectives, such as the masked next-token prediction (MNTP) used in LLM2Vec. In this work, we explore the untapped potential of context compression as a pretext task for unsupervised adaptation of LLMs. During compression pre-training, the model learns to generate compact memory tokens, which substitute the whole context for downstream sequence prediction. Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations, outperforming models trained with token-level pretext tasks. Further improvements through contrastive learning produce a strong representation model (LLM2Comp) that outperforms contemporary LLM-based text encoders on a wide range of tasks while being more sample‑efficient, requiring significantly less training data. Code is available at https://github.com/longtaizi13579/LLM2Comp.

Introduction

Text embedding models, which transform semantic content into vector representations, are fundamental to numerous tasks such as retrieval and recommendation. Early methods like TF-IDF and BM25 relied on simple statistics, thus falling short in capturing sequence semantics. The advent of deep neural networks led to the development of techniques such as Word2Vec (Mikolov et al. 2013). This paradigm shift paved the way for the foundational models like BERT (Devlin et al. 2019) and T5 (Raffel et al. 2020).

Corresponding authors.

entire sequences. To address this limitation, recent studies have proposed various pretext tasks for unsupervised adaptation of LLMs, focusing primarily on token-level prediction objectives. For example, as shown in Figure 1a, LLM2Vec (BehnamGhader et al. 2024) first transforms the causal attention mechanism of LLMs into a bidirectional form, and then adopts masked next token prediction (MNTP) as a pretext task to align the training objectives of causal LLMs with those of bidirectional models such as BERT. In this setup, MNTP randomly masks tokens within a sentence and leverages contextual representations of preceding tokens to predict masked ones. In contrast, as shown in Figure 1b, Llama2Vec (Li et al. 2024) employs two pretext tasks, Embedding-Based AutoEncoding (EBAE) and Embedding-Based Auto-Regression (EBAR), which predict tokens within the original sequence or the continued sequence. Although such objectives can capture the 'bag-of-tokens' information, they cannot fully preserve the coherent semantic integrity of the entire sequence. These tasks, therefore, remain fundamentally token-level rather than sequence-level prediction.

In this work, we explore context compression (Figure 1) as a pretext task for the unsupervised adaptation of LLMs to text encoders. Specifically, during compression pre-training, the model learns to produce compact 'memory tokens' that replace the original context for downstream sequence prediction tasks. The conceptual difference between our objectives and previous pretext tasks is illustrated in Figure 1. Here, we explore several compression objectives: reconstruction task (Ge et al. 2024; Xu et al. 2024; Cheng et al. 2024) involves regenerating original sentences; continuation task (Ge et al. 2024; Chevalier et al. 2023; Qin et al. 2024; Xu et al. 2024; Shao et al. 2024) focuses on generating correct subsequent tokens. Our preliminary experiment shows that the reconstruction task does not provide satisfactory results, whereas the continuation task trained with NLL loss (CT-NLL) suffers from unstable training. Inspired by (Mu, Li, and Goodman 2023; Wingate, Shoeybi, and Sorensen 2022), we then propose Continuation Task with Knowledge Distillation (CTKD) as a pretext task, which aims to predict the probability of subsequent tokens in alignment with the original extended sequence from the soft prompt. Our experiments demonstrate that this well-designed objective (CTKD) can significantly enhance LLM-based text representations.

Figure 1: The comparison of different pretext tasks: (1) MNTP (Mask Next Token Prediction) in LLM2Vec; (2) EBAE or EBAR in Llama2Vec; (3) Context Compression task in LLM2Comp (ours)

We then performed a comprehensive analysis and found that LLMs' embeddings trained with the compression pretext task still suffer from dimensional collapse (Jing et al. 2022), where the vectors lie in a low-dimensional subspace instead of using the full embedding space. However, compression pretraining with the CTKD objective is less affected by this issue than CT-NLL. This likely explains the advantage of CTKD over the CT-NLL objective. To further address dimensional collapse, we apply contrastive post-training on the model pretrained with CTKD. This post-training consists of an unsupervised contrastive phase followed by supervised contrastive learning (SCL). In particular, SCL encourages the embedding vectors of the negative samples to be pulled away, reducing the dimensional collapse and leading to significant performance gains. Our experiments show that CTKD pretraining synergizes with contrastive learning. As a result, our model, LLM2Comp, outperforms other LLM-based models on many MTEB benchmark tasks. LLM2Comp is also more sample-efficient, requiring much less (supervised) data during the contrastive learning phases than competing methods.

Our key contributions are summarized as follows:

· We thoroughly investigate the untapped potential of compression pretraining for adapting LLMs to text representation tasks. Through empirical analysis, we provide insights into the crucial factors for the success of such a pretraining task, including the optimal training objective (CTKD) and the appropriate number of memory tokens. · We delve into the reasons behind the advantage of CTKD over other compression objectives, demonstrating that it is less prone to the dimensional collapse issue and thus more suited for downstream text representation. · We show that further enhancements through contrastive learning help alleviate the dimensional collapse issue. Note that the CTKD task provides a robust foundation for text representation, enabling our model, LLM2Comp, to significantly outperform contemporary models such as LLM2Vec and Llama2Vec with less training data.

Do We Need Unsupervised Contrastive Learning?

A good representation should satisfy two key properties (Wang and Isola 2020; Jing et al. 2022): (1) Alignment , which encourages the representations of semantically related texts to be close to each other; and (2) a High Effective Dimensionality (Jing et al. 2022), i.e., embedding vectors occupy much of the embedding space. While compression-based objectives implicitly promote alignment, they also tend to suffer from dimensional collapse, as shown in the previous section. Prior work (Jing et al. 2022) has demonstrated that contrastive learning mitigates collapse by pushing representations of negative samples apart. In this paper, we investigate whether post-training with unsupervised contrastive learning (UCL) followed by supervised contrastive learning (SCL) can alleviate dimensional collapse and study its impact on downstream representation.

Recent methods (Nie et al. 2024) for adapting LLMs to text embedders can be broadly classified into training-free and training-based approaches.

Training-free Methods The simple way to use LLMs as text encoders is to take the last token's hidden state of the final transformer layer as the representation. This is known as the last token pooling mechanism. However, the last token's representation is optimized for next-token prediction rather than for aggregating global embedding. Recently, several prompting strategies such as PromptEOL (Jiang et al. 2024) and MetaEOL (Lei et al. 2024) have been proposed to enhance the representational ability of the last token. An alternative approach is weighted mean pooling (Muennighoff 2022), which aggregates information from all sequence tokens. To further address the limitations of causal attention, EE (Springer et al. 2024) duplicates the input text so that early tokens can attend to subsequent tokens. Although these training-free methods are simple, they still struggle to produce high-quality embeddings consistently. Moreover, approaches such as EE and MetaEOL increase the effective context length, which in turn increases the cost for extracting embeddings.

Training-required Methods To better adapt LLMs for representation learning, a variety of training strategies have been proposed, including supervised contrastive learning (Wang et al. 2024b; Ma et al. 2024; Muennighoff et al. 2024; Li et al. 2025; Wang et al. 2024a; Choi et al. 2024; Muennighoff et al. 2024; Li et al. 2024; BehnamGhader et al. 2024; Su et al. 2025), instruction tuning (Su et al. 2023), and in-context learning (Li et al. 2025). Other research directions explore the use of synthetic data (Wang et al. 2024a; Choi et al. 2024), multi-task learning (e.g., combining generation and representation learning) (Muennighoff et al. 2024; Man et al. 2024), variants of training loss functions (Deng et al. 2025), or the use of pretext tasks (Li et al. 2024; BehnamGhader et al. 2024). Our work falls into the latter category, but focuses on the unique question of whether context compression can be utilized as an effective pretext task to enhance LLMs' ability to learn text representations.

LLMs-based Text Representation

Training-free Methods

Compression Pretext Training Details We expanded the vocabulary by adding 8 special tokens, which increased the vocabulary size from 32,000 to 32,008. Accordingly, the embedding layer was reshaped to match the new dimensions. The number of memory tokens was selected based on the experimental analysis presented in Section 3.3. Following the approach of LLM2Vec, we employed LoRA (Low-Rank Adaptation) for efficient parameter fine-tuning. Specifically, the LoRA rank was set to 16, and the alpha parameter was set to 32 based on empirical evidence and consistent with LLM2Vec. The modules modified by LoRA include the query, value, and output projections within the attention layer, as well as the up, down, and gate projections within the feedforward network layers. Since the newly added special tokens were specifically introduced to compress and encode semantic information, we kept the embedding layer fully trainable to ensure that the model could effectively adapt to the representation learning space.

The model was trained for 8,000 steps with a batch size of 4, resulting in a total of 32,000 samples from English Wikipedia, consistent with LLM2Vec. We selected Wikipedia because it is presumably included in the pre-training corpus of the model used in our experiments. Thus, this adaptation step is not expected to provide new factual knowledge but rather to refine the model's ability to compress sentences and construct sequence representations, making the comparison with LLM2Vec appropriate. Specifically, we used the Wikitext-103 dataset (Merity et al. 2017) for training. During training, we utilized bfloat16 precision to optimize memory usage. The base model was the 'meta-llama/Llama-2-7b-chathf' 1 . We set the learning rate to 1e-4 and the weight decay to 1e-5, parameters chosen for stable training loss. We also applied DeepSpeed ZeRO-0 optimization training, along with a warm-up decay learning rate schedule, where the minimum learning rate during warm-up was set to 1e-5. To examine

1 https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

the impact of random seed selection on model performance, we report results for two representative cases: a seed (2026) that consistently led to stronger performance and a seed (42) that resulted in weaker performance. For all experiments, random seeds were fixed across PyTorch, NumPy, and Python's random library to ensure reproducibility within each setting.

Unsupervised Contrastive Learning Training Details Following the SimCSE(Gao, Yao, and Chen 2021), we use dropout to get the unsupervised positive samples, and inbatch sentences are regarded as negative samples, and the InfoNCE loss was then applied for unsupervised contrastive learning. The dropout rate was set to 0.2, and the batch size was 128 with gradient checkpointing. The criterion used for selecting dropout rate is stable training loss and batch size is selected to be consistent with LLM2Vec. We employed LoRA (Low-Rank Adaptation) for efficient parameter fine-tuning. Specifically, the LoRA rank was set to 16, and the alpha parameter was set to 32 based on empirical evidence and consistent with LLM2Vec. The modules modified by LoRA include the query, value, and output projections within the attention layer, as well as the up, down, and gate projections within the feedforward network layers.

Supervised Contrastive Learning Training Details Following LLM2Vec (BehnamGhader et al. 2024), we used the E5 dataset for training. This dataset consists of ELI5 (sample ratio 0.1) (Fan et al. 2019), HotpotQA (Yang et al. 2018), FEVER (Thorne et al. 2018), MIRACL (Zhang et al. 2023), MS-MARCO passage ranking (sample ratio 0.5) and document ranking (sample ratio 0.2) (Nguyen et al. 2016), NQ (Karpukhin et al. 2020), NLI (Gao, Yao, and Chen 2021), SQuAD (Rajpurkar et al. 2016), TriviaQA (Joshi et al. 2017), Quora Duplicate Questions (sample ratio 0.1), MrTyDi (Zhang et al. 2021), DuReader (He et al. 2018), and T2Ranking (sample ratio 0.5) (Xie et al. 2023). This is a public dataset widely used by LLM2Vec (BehnamGhader et al. 2024), mE5 (Wang et al. 2024b), E 5 mistral (Wang et al. 2024a), GritLM (Muennighoff et al. 2024) and so on. The fine-tuning instructions for each dataset are the same as those used in LLM2Vec, and are provided in Table 3.

The model was trained for 200 steps with a batch size of 128 on 8 NVIDIA H800 GPUs, yielding an effective batch

Table 3: Instructions for finetuning E5 datasets.

size of 1,024 and processing approximately 128,000 training instances from the E5 dataset. As the E5 dataset is a widely adopted benchmark for supervised contrastive learning, using it enables a fair comparison with many existing models. Specifically, we used a subset of the E5 dataset for training. During training, we employed bfloat16 precision to reduce memory usage. The learning rate was set to 1e-4 with a weight decay of 3e-4. We also applied DeepSpeed ZeRO-0 optimization, together with a warm-up decay learning rate schedule, where the minimum learning rate during warmup was set to 1e-5. These hyperparameters were selected to ensure stable training loss.

Here, we provide brief introductions to the baseline models compared in our main experiments: We compare our approach against several strong baselines:

Instructor (Su et al. 2023): An embedding model that intro- duces instruction tuning, extending GTR (Ni et al. 2022), and leveraging a curated dataset spanning a wide range of tasks.

ULLME (Man et al. 2024): An approach based on Generation-Representation Learning (GRL), which jointly optimizes contrastive learning and generation objectives.

RepLlama (Ma et al. 2024): A fine-tuned LLaMA-2-7B model, optimized for multi-stage text retrieval tasks.

In the main body of the paper, the number of training samples for LLM2Vec was mistakenly reported as 1.16 million due to a typographical error. The correct total is 1.66 million, comprising 1.5 million samples from the supervised contrastive learning stage (8 GPUs, 64 micro-batches, 1000 steps, 3 epochs) and 0.16 million samples from the pretext and unsupervised contrastive learning stages. This correction does not affect the conclusions of this paper. We apologize for the oversight and any confusion it may have caused.

Compression Pretraining

Negative Log-Likelihood of Continuation Task.

Dimension Collapse

Since compression pretraining aims to condense long contexts into compact memory tokens, it may lead to dimensional collapse, a phenomenon also observed in self-supervised learning (Shwartz-Ziv and LeCun 2024). Here, dimensional collapse occurs when embedding vectors occupy a subspace of significantly lower dimensionality than the original embedding space. In our problem, this manifests in two ways: (i) the pooled embeddings exhibit low rank, and (ii) the resulting memory tokens become highly similar to one another. To study this effect, we analyze embeddings produced by LLM2Comp KL and LLM2Comp NLL over a large corpus of

Figure 4: Comparing singular values of LLM2Comp NLL and LLM2Comp KL .

60K randomly sampled from the SciDocsRR dataset.

Method

For LLMs to capture global information from long contexts, we first convert them into bidirectional encoders that can process information in both directions. Building on this, we introduce context compression tasks as pretext tasks designed to further enhance their capacity to model the coherent semantics of entire contexts. Specifically, we consider two context compression tasks, reconstruction and continuation tasks (Ge et al. 2024), as described in detail below.

Reconstruction Task Let g ϕ denote the original LLM, and f θ be the targeted LLM-based (bidirectional) encoder adapted from g ϕ . The encoder f θ is responsible for producing the embeddings of k special (memory) tokens ˜ m 1 , . . . , ˜ m k given a long context n 1 , n 2 , . . . , n a . We say the set of compressed tokens ˜ m i k i =1 effectively captures the context if g ϕ can reconstruct the context from them. This process is formalized as follows:

where the memory embedding ˜ m i for token m i is taken at the final hidden layer, just before the logit layer, as shown in Equation 1. Equation 2 describes how the original context is reconstructed using the memory embeddings. During training, we sample context n 1 , . . . , n a from a text collection, and train the encoder f θ with LoRA (Hu et al. 2022) while keeping the LLM g ϕ frozen. Here, we use the negative log-likelihood computed by the frozen LLM g ϕ to compare the original context with the reconstructed context, which is generated from the compressed tokens.

Continuation Task This variant is trained to predict future tokens given a compressed prefix. Formally,

where ( n 1 , . . . , n a , n a +1 , . . . , n j ) denotes a sentence from the dataset, split into two segments: a prefix ( n 1 , . . . , n a ) and a continuation ( n a +1 , . . . , n j ) . Similar to the reconstruction task, we adapt the encoder using LoRA and optimize it with the negative log-likelihood (NLL) computed by the frozen LLM g ϕ , which measures the discrepancy between the generated continuation and the ground-truth continuation.

Our evaluation data are presented in Table 4, which lists the subset of the full MTEB benchmark used in our experiments. To balance resource expenditure with evaluation coverage and to accelerate the evaluation process, we selected 14 datasets, consistent with those included in LLM2Vec. To ensure that our ablation studies and analyses are not biased toward a particular category or task, this subset was constructed to include tasks from each category in proportions that approximately match those in the full MTEB benchmark. During evaluation, all models were provided with the same instructions as in (Wang et al. 2024a) and (BehnamGhader et al. 2024). The evaluation metrics followed the MTEB standard (Muennighoff et al. 2023): Accuracy for classification tasks, V-measure for clustering, NDCG@10 for retrieval, MAP for reranking, and Spearman correlation for STS. We followed the instructions of LLM2Vec, shown in Table 5.

Figure 11: Comparison of model performance across different data types (Clustering, Retrieval, STS, Classification, and Reranking) with two attention mechanisms: Bi-directional and Causal attention.

Continuation Task with Knowledge Distillation (CTKD)

To further enhance encoder training, inspired by (Mu, Li, and Goodman 2023; Wingate, Shoeybi, and Sorensen 2022), we propose a third pretext task that combines the continuation objective with knowledge distillation. In this variant, the encoder f θ is trained not only to generate the correct continuation, but also to match the next-token prediction distribution of the frozen LLM g ϕ when conditioned on the compressed context versus the original context. This encourages distributional alignment between the two representations. Specifically, we build on the process of Equations 3 and 4, and exploit the Kullback-Leibler divergence loss (KL loss) for training. The loss computation is formalized as follows:

Mean Pooling Once the LLM has been adapted into the encoder f θ using one of the proposed pretext tasks, the resulting encoder can be employed to generate text embeddings for a wide range of downstream tasks. In particular, a sentence embedding is obtained by applying mean pooling over the memory embeddings, as follows:

Experiment Setup

Implementation Details The unsupervised adaptation of LLMs using the proposed compression-based pretext tasks is conducted on 32,000 samples from the English Wikipedia103 dataset (Merity et al. 2017). This choice of dataset for unsupervised adaptation of LLMs to text embedders is consistent with the settings used in LLM2VEC and LLAMA2VEC. This dataset is chosen because Wikipedia is included in the pretraining data of the LLM models, thus the adaptation process does not introduce new factual knowledge; rather, it focuses on teaching the model how to compress contextual information into soft tokens and construct sequence-level representations. For LLM2Comp, we select the default number of memory tokens to be 8, unless stated otherwise. We provide additional details of our training setup and hyperparameters in the Appendix 2.1.

Evaluation Datasets We evaluate the method across 14 diverse tasks in six categories, including clustering, retrieval, semantic textual similarity (STS), classification, and reranking. The semantic similarity tasks (SST) directly assess whether an embedding captures sentence-level semantics. These tasks cover various domains, such as biomedical text, scientific literature, software and programming, finance and banking, and customer support. Dataset details are provided in Table 4 in the Appendix 2.2.

Baseline Methods We evaluate several established methods for sentence representation:

· LT and WMP (Muennighoff 2022) are training-free methods that obtain embedding from the last token (LT) or weighted mean pooling (WMP). · EE (Echo Embedding) (Springer et al. 2024): Sentence representations are created by duplicating the sentence and applying mean pooling to the latter sentence's tokens. · PromptEOL (Jiang et al. 2024): A prompt, 'means in one word:', is appended to the end of the sentence to enhance its representational capacity. · MetaEOL (Lei et al. 2024) designs eight meta-task prompts with ChatGPT-4 to guide LLMs to form sentence representations from multiple perspectives. · Llama2Vec (Li et al. 2024) and LLM2Vec (BehnamGhader et al. 2024) adapt LLMs for text

Table 1: Performance comparison of different models across three stages of training: self-supervised compression pretraining, evaluated on various tasks from the MTEB benchmark. Each model is assessed on a range of datasets, with results showing the impact of different training approaches on task performance.

representation with different pretext tasks, including EBAE, EBAR, and MNTP. The models are obtained by training LLama2 with pretext tasks using the same dataset as our method (LLM2Comp). Note that here, we consider LLM2Vec and Llama2Vec trained with pretext tasks, without subsequent contrastive learning.

· LLM2Comp : We compare three alternatives of our method, including LLM2Comp RC that is based on the reconstruction task, LLM2Comp NLL that is trained with continuation task and NLL loss (CT-NLL), and LLM2Comp KL which is trained with continuation task and knowledge distillation (CTKD).

Evaluation Datasets

Table 1 presents the performance of training-free methods as well as unsupervised adaptation methods based on different pretext tasks. It is observable that the reconstruction objective offers only marginal gains: LLM2Comp RC performs only slightly better than simple last-token pooling (LT). In contrast, continuation-based objectives lead to better improvements, making LLM2Comp NLL matches the performance of Llama2Vec. However, the CTKD objective proves to be more suitable than the CT-NLL objective for text representation , leading to the superior performance of LLM2Comp KL over LLM2Comp NLL and all other baselines.

Stability of Different Compression Tasks In our experiments, we observe that LLM2Comp KL exhibits more stable training behavior compared to LLM2Comp NLL and LLM2Comp RC . As shown in Figure 2, LLM2Comp KL achieves a standard deviation of 1.37, which is significantly lower than 2.65 for LLM2Comp RC and 5.32 for LLM2Comp NLL . Among the three, LLM2Comp NLL is the most unstable, as its performance can reach 51.85, comparable to LLM2Comp KL , but in unfavorable cases, its performance drops to 42.95.

The Impact of Token Length Figure 3 shows the impact of the memory token length on the performance of LLM2Comp KL . When the number of tokens is in the range of [1, 8], performance remains stable across most tasks, except for retrieval, which is more sensitive to this hyperparameter. Specifically, when the number of tokens increases to 16, we

Figure 3: LLM2Comp KL : Token length and its Effect

observe a clear performance drop with the retrieval task. This observation contrasts with findings in context compression literature (Qin et al. 2024; Ge et al. 2024; Wingate, Shoeybi, and Sorensen 2022), where using a larger number of tokens (on the order of 100) facilitates downstream generation.

Causal vs Bidirectional LLMs

Effectiveness vs Efficiency of Embedding Models

The dimensional collapse problem observed in models trained with compression objectives also manifests as a high degree of correlation among the memory tokens. To examine this, we compute the correlation matrix and average it over N s samples from ScidocRR as follows:

Here, ˜ M i denotes the matrix containing the compression tokens ˜ m i 1 , . . . , ˜ m ik for the i -th sample. The correlation

Figure 5: Llm2Comp NLL training with 32000 samples (left) and 128000 samples (right) in a bad training case.

matrices of a typical LLM2Comp NLL model trained with N s = 32K and N s = 128K are presented in Figure 5. For comparison, Figure 6 shows the corresponding correlation matrices for a representative LLM2Comp KL model trained with the same sample sizes.

Figure 5 shows that tokens from LLM2Comp NLL are highly similar with 32K training samples. In contrast, LLM2Comp KL with 32K samples suffers less from this issue. This observation is consistent with the analysis of effective dimensionality in the previous section. When the number of training samples increases to 128K, the token correlation problem in LLM2Comp NLL is partially alleviated. This improvement is also reflected in its performance, which rises from 42.95 to 48.38 as the sample size increases from 32K to 128K. These results suggest that the degree of token similarity also has a significant impact on the downstream task.

To further verify the above hypothesis, we investigate how reducing correlated tokens impacts downstream performance. We define a token cluster as a set of tokens with high pairwise similarity. Initially, each token is treated as its cluster. Clusters are then merged if the minimum similarity between any pair of tokens across clusters exceeds 0.9. From each resulting cluster, we randomly select one representative token, and refer to these selected tokens as effective tokens . Sentence embeddings are then computed by applying mean pooling to the embeddings of these effective tokens. Using this procedure, the performance of LLM2Comp NLL trained on 128,000 samples increases from 48.38 to 51.09, as shown in Figure 7. We also include the performance of LLM2Comp KL using a single compression token, as shown in Figure 7. The results

Figure 7: Comparison between LLM2Comp NLL with 1token compression and 8-token compression (with and without redundant tokens).

indicate that the single-token setting performs unsatisfactorily, likely due to excessive information loss. This highlights the need for a more effective learning strategy that better balances information preservation and redundancy.

Stability of Different Compression Tasks

The Impact of Token Length

In dimensional collapse, a small number of effective dimensions are sufficient to represent the data, while the remaining dimensions can be expressed as linear combinations of these effective ones. The degree of dimensional collapse can thus be quantified using Singular Value Decomposition (SVD) (Jing et al. 2022), where dimensions associated with near-zero singular values are considered ineffective. To this end, we first construct an embedding matrix from N s = 60 , 000 samples, with each sample represented by a single embedding vector as defined in Equation 6. We then compute the covariance matrix of these embeddings, apply SVD, and sort the resulting singular values in descending order. The sorted order corresponds to the principal component index, where the first component corresponds to the largest singular value. We plot the resulting singular values against their principal component indices for LLM2Comp KL and LLM2Comp NLL in Figure 4. Figure 4 shows that the curve corresponding to LLM2Comp NLL drops to zero much faster than that of LLM2Comp KL . The number of effective dimensions for LLM2Comp NLL and LLM2Comp KL are in the order of 10 and 100, respectively. This result indicates that LLM2Comp NLL suffers more severe dimensional collapse in this case. Intuitively, the KL divergence acts as a regularizer that better preserves information from less frequent tokens, thereby mitigating dimensional collapse. Nevertheless, the effective dimensionality of LLM2Comp KL remains small relative to the total dimension (4096), indicating that there is still room for improvement.

Effective Token Index Before Meanpooling

Training Data In the UCL stage, we utilize a Wikipedia sentence subset (128,000 samples) (Gao, Yao, and Chen 2021) identical to LLM2Vec's second-stage training data to ensure comparable experimental conditions. In the SCL stage, we utilize 1,024,000 samples from the public portions of datasets employed in LLM2Vec.

Compared methods In the following, unless otherwise specified, LLM2Comp refers to the model built upon LLM2Comp KL as the foundation model, further enhanced through contrastive post-training. In the UCL stage, we compare LLM2Comp with LLM2Vec, which is also first adapted using a pretext task (MNTP) and then trained with UCL. To maintain appropriate data variance, we apply a dropout rate of 0.2 for creating positive samples, which helps prevent excessive augmentation that could distort the original dataset distribution (Jing et al. 2022).

In the SCL stage, we further train LLM2Comp initialized from the UCL stage. Our baselines include LLM2Vec, Llama2Vec, and RepLLaMA, all of which share the same LLM backbone and are trained with SCL. RepLLaMA (Ma et al. 2024) directly applies last-token pooling with SCL without using any pretext tasks. Llama2Vec is trained with SCL directly after training with pretext tasks, as reported in the original paper (Li et al. 2024). We also report the performance of several contemporary LLM-based encoders, including ULLME, BGE-ICL (in a zero-shot setting), and Instructor. These models adopt different base models and different training strategies, such as finetuning with in-context data in BGE-ICL (Li et al. 2025), and multi-task learning in ULLME (Man et al. 2024). Consequently, comparisons with these models should be viewed as a reference only, as they do not directly provide scientific insight. Details of the compared methods are given in the Appendix 2.1.

Table 2: Performance comparison of different models across different post-training stages. Here, the backbone models include: Llama-2 (7B), Mistral-v0.1 (7B), GTR-XL (1.5B), Phi-1.5 (1.3B), Mistral-v0.2 (7B), and Llama-3 (8B).

It is worth noting that several of these models, including Instructor, BGE-ICL, and LLM2Vec, have reported results on MTEB, which includes our evaluation datasets. For these models, we directly use the scores reported in their respective papers. For others, such as Llama2Vec and ULLME, which only provide partial MTEB results, we perform our evaluations using their published models.

Further Analysis

Contrastive Learning Alleviates Dimensional Collapse As shown in Figure 9, the model trained with LLM2Comp KL + UCL + SCL exhibits a higher effective dimension than the model trained with LLM2Comp KL + UCL. Furthermore, LLM2Comp KL + UCL achieves a higher effective dimension than LLM2Comp KL alone. The reduction of dimensional collapse also correlates with the enhancement of LLM2Comp over different training stages as shown in the previous section.

Convergence Analysis Models trained with the compression objective exhibit higher data efficiency, achieving peak performance with only 0.36M training samples, compared to the 1.16M samples required by LLM2Vec, as shown in Table 2. For a more detailed analysis, Figure 10 shows how the performance of LLM2Comp evolves over training steps in the SCL stage. We observe that our model converges rapidly, reaching optimal performance within 200 steps and remaining stable for most tasks. However, for retrieval tasks, performance begins to decline when contrastive learning contin- ues beyond this point. We hypothesize LLM2Comp KL can achieve good alignment loss (Wang and Isola 2020; Jing et al. 2022), and the subsequent contrastive learning stage rapidly balances effective dimensionality and alignment, leading to faster convergence. Beyond this point, additional CL adopted in this paper, which uses InfoNCE with fixed negative sampling, becomes less effective. This phenomenon represents an interesting direction for future research.

Conclusion

Our study demonstrates the potential of context compression as a pretext task for the unsupervised adaptation of large language models (LLMs). We identify CTKD as the optimal training objective and determine the appropriate number of memory tokens needed for downstream representations. A deeper analysis shows that CTKD effectively mitigates dimensional collapse, resulting in stronger text representations than other pretext tasks. Building on this, additional contrastive learning yields a robust embedding model, LLM2Comp, which outperforms contemporary baselines (LLM2Vec and Llama2Vec) trained with similar recipes but requires much less training data. Furthermore, we provide insights into the effective dimensionality, task-aware performance, and sample efficiency, highlighting promising directions for future research.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. W2532049).

Continue Training In-domain improve but out-of-domain drop

arabicchecksubsection

Reproducibility Checklist

textcolorblue

Appendix

Limitations and Future Work

Different attention architecture

The Computing Infrastructure

All training and evaluation experiments were conducted on NVIDIA H800 GPUs running Ubuntu 22.04 x86 64 with a system memory capacity of 666 GB. The compression pretext task and unsupervised contrastive learning were performed on a single NVIDIA H800 GPU, whereas supervised contrastive learning was conducted across 8 NVIDIA H800 GPUs.

BehnamGhader, P.; Adlakha, V.; Mosbach, M.; Bahdanau, D.; Chapados, N.; and Reddy, S. 2024. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. CoRR , abs/2404.05961.

Cheng, X.; Wang, X.; Zhang, X.; Ge, T.; et al. 2024. xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token. In Annual Conference on Neural Information Processing Systems .

Chevalier, A.; Wettig, A.; Ajith, A.; and Chen, D. 2023. Adapting Language Models to Compress Contexts. In Conference on Empirical Methods in Natural Language Processing .

Choi, C.; Kim, J.; Lee, S.; et al. 2024. Linq-Embed-Mistral Technical Report. CoRR , abs/2412.03223.

Ge, T.; Hu, J.; Wang, L.; Wang, X.; Chen, S.; and Wei, F. 2024. In-context Autoencoder for Context Compression in a Large Language Model. In The Twelfth International Conference on Learning Representations .

He, W.; Liu, K.; Liu, J.; et al. 2018. DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications. In Proceedings of the Workshop on Machine Reading for Question Answering .

Joshi, M.; Choi, E.; Weld, D.; and Zettlemoyer, L. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Barzilay, R.; and Kan, M.-Y., eds., Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 1601-1611. Vancouver, Canada: Association for Computational Linguistics.

Lei, Y.; Wu, D.; Zhou, T.; Shen, T.; Cao, Y.; Tao, C.; and Yates, A. 2024. Meta-Task Prompting Elicits Embeddings from Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics .

Ma, X.; Wang, L.; Yang, N.; Wei, F.; and Lin, J. 2024. FineTuning LLaMA for Multi-Stage Text Retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval .

Muennighoff, N. 2022. SGPT: GPT Sentence Embeddings for Semantic Search. CoRR , abs/2202.08904.

Muennighoff, N.; Tazi, N.; Magne, L.; and Reimers, N. 2023. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics . Association for Computational Linguistics.

Nguyen, T.; Rosenberg, M.; Song, X.; Gao, J.; et al. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems .

Ni, J.; Qu, C.; Lu, J.; Dai, Z.; ´ Abrego, G. H.; Ma, J.; Zhao, V. Y.; Luan, Y.; Hall, K. B.; Chang, M.; and Yang, Y. 2022. Large Dual Encoders Are Generalizable Retrievers. In Goldberg, Y.; Kozareva, Z.; and Zhang, Y., eds., Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022 , 9844-9855. Association for Computational Linguistics.

Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. , 21: 140:1-140:67.

Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Su, J.; Duh, K.; and Carreras, X., eds., Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , 2383-2392. Austin, Texas: Association for Computational Linguistics.

Shwartz-Ziv, R.; and LeCun, Y. 2024. To Compress or Not to Compress - Self-Supervised Learning and Information Theory: A Review. Entropy , 26(3): 252.

Wang, T.; and Isola, P. 2020. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In Proceedings of the 37th International Conference on Machine Learning .

Wingate, D.; Shoeybi, M.; and Sorensen, T. 2022. Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models. In Findings of the Association for Computational Linguistics: EMNLP .

Xie, X.; Dong, Q.; Wang, B.; Lv, F.; Yao, T.; Gan, W.; Wu, Z.; Li, X.; Li, H.; Liu, Y .; and Ma, J. 2023. T2Ranking: A Large-scale Chinese Benchmark for Passage Ranking. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval .

Zhang, X.; Thakur, N.; Ogundepo, O.; Kamalloo, E.; AlfonsoHermelo, D.; Li, X.; Liu, Q.; Rezagholizadeh, M.; and Lin, J. 2023. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics , 11: 1114-1131.

Text representation plays a critical role in tasks like clustering, retrieval, and other downstream applications. With the emergence of large language models (LLMs), there is increasing interest in harnessing their capabilities for this purpose. However, most of the LLMs are inherently causal and optimized for next-token prediction, making them suboptimal for producing holistic representations. To address this, recent studies introduced pretext tasks to adapt LLMs for text representation. Most of these tasks, however, rely on token-level prediction objectives, such as the masked next-token prediction (MNTP) used in LLM2Vec. In this work, we explore the untapped potential of context compression as a pretext task for unsupervised adaptation of LLMs. During compression pre-training, the model learns to generate compact memory tokens, which substitute the whole context for downstream sequence prediction. Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations, outperforming models trained with token-level pretext tasks. Further improvements through contrastive learning produce a strong representation model (LLM2Comp) that outperforms contemporary LLM-based text encoders on a wide range of tasks while being more sample‑efficient, requiring significantly less training data.

Recently, there has been growing interest in leveraging the powerful capabilities of LLMs for text representation. However, most LLMs are inherently causal and optimized for next-token prediction, which makes them inherently suboptimal for generating holistic, coherent representations of entire sequences. To address this limitation, recent studies have proposed various pretext tasks for unsupervised adaptation of LLMs, focusing primarily on token-level prediction objectives. For example, LLM2Vec (DBLP:journals/corr/abs-2404-05961) first transforms the causal attention mechanism of LLMs into a bidirectional form, and then adopts masked next token prediction (MNTP) as a pretext task to align the training objectives of causal LLMs with those of bidirectional models such as BERT. In this setup, MNTP randomly masks tokens within a sentence and leverages contextual representations of preceding tokens to predict masked ones. On the other hand, Llama2Vec (DBLP:conf/acl/Li0XSL24) employs two pretext tasks, Embedding-Based Auto-Encoding (EBAE) and Embedding-Based Auto-Regression (EBAR), which predict tokens within the original sequence or the continued sequence. Although such objectives can capture the ”bag-of-tokens” information, they cannot fully preserve the coherent semantic integrity of the entire sequence. These tasks, therefore, remain fundamentally token-level rather than sequence-level prediction.

We thoroughly investigate the untapped potential of compression pretraining for adapting LLMs to text representation tasks. Through empirical analysis, we provide insights into the crucial factors for the success of such a pretraining task, including the optimal training objective (CTKD) and the appropriate number of memory tokens.

We delve into the reasons behind the advantage of CTKD over other compression objectives, demonstrating that it suffers less from the dimensional collapse issue, making it more suited for downstream text representation.

We show that further enhancements through contrastive learning help alleviate the dimensional collapse issue. Note that the CTKD task provides a robust foundation for text representation, enabling our model, LLM2Comp, to significantly outperform contemporary models such as LLM2Vec and Llama2Vec with less training data.

To better adapt LLMs for representation learning, a variety of training strategies have been proposed, including supervised contrastive learning (DBLP:journals/corr/abs-2402-05672; DBLP:conf/sigir/MaWYWL24; DBLP:journals/corr/abs-2402-09906; DBLP:conf/iclr/LiQXCLLSL25; DBLP:conf/acl/WangYHYMW24; DBLP:journals/corr/abs-2412-03223; DBLP:journals/corr/abs-2402-09906; DBLP:conf/acl/Li0XSL24; DBLP:journals/corr/abs-2404-05961), instruction tuning (DBLP:conf/acl/SuSKWHOYSZ023), and in-context learning (DBLP:conf/iclr/LiQXCLLSL25). Other research directions explore the use of synthetic data (DBLP:conf/acl/WangYHYMW24; DBLP:journals/corr/abs-2412-03223), multi-task learning (e.g., combining generation and representation learning) (DBLP:journals/corr/abs-2402-09906; DBLP:journals/corr/abs-2408-03402), or the use of pretext tasks (DBLP:conf/acl/Li0XSL24; DBLP:journals/corr/abs-2404-05961). Our work falls into the latter category, but focuses on the unique question of whether context compression can serve as an effective pretext task to adapt LLMs for text representation.

LT and WMP (DBLP:journals/corr/abs-2202-08904) are training-free methods that obtain embedding from the last token (LT) or weighted mean pooling (WMP).

EE (Echo Embedding) (DBLP:journals/corr/abs-2402-15449): Sentence representations are created by duplicating the sentence and applying mean pooling to the latter sentence’s tokens.

PromptEOL (DBLP:conf/emnlp/JiangHLWZ24): A prompt, “means in one word:”, is appended to the end of the sentence to enhance its representational capacity.

Llama2Vec (DBLP:conf/acl/Li0XSL24) and LLM2Vec (DBLP:journals/corr/abs-2404-05961) adapt LLMs for text representation with different pretext tasks, including EBAE, EBAR, and MNTP. The models are obtained by training LLama2 with pretext tasks using the same dataset as our method (LLM2Comp). Note that here, we consider LLM2Vec and Llama2Vec trained with pretext tasks, without subsequent contrastive learning.

LLM2Comp: We compare three alternatives of our method, including LLM2CompRC that is based on the reconstruction task, LLM2CompNLL that is trained with continuation task and NLL loss (CT-NLL), and LLM2CompKL which is trained with continuation task and knowledge distillation (CTKD).

In our experiments, we observe that LLM2CompKL exhibits more stable training behavior compared to LLM2CompNLL and LLM2CompRC. As shown in Figure 2, LLM2CompKL achieves a standard deviation of 1.37, which is significantly lower than 2.65 for LLM2CompRC and 5.32 for LLM2CompNLL. Among the three, LLM2CompNLL is the most unstable, as its performance can reach 51.85, comparable to LLM2CompKL, but in unfavorable cases, its performance drops to 42.95.

Following SimCSE (DBLP:conf/emnlp/GaoYC21), we construct positive samples of a particular sentence through dropout, and treat other sentence samples as negative ones. Formally, the objective is to train the encoder fθ to maximize the InfoNCE loss:

Where zi and zj are the representations of the sentences obtained by meanpooling the memory tokens sequences xi and xj using fθ (see equation 1, and equation 6). Additionally, sim(⋅,⋅) denotes the cosine similarity, τ is the temperature hyperparameter, and B is the batch size. For UCL, in-batch negative sampling is exploited, i.e., positive embedding of one sample in the batch is considered as the sampled negative for others in the batch.

Following UCL, we perform SCL based on supervised data, where relevant pairs are manually annotated. Negative samples are chosen following in-batch negative sampling and hard-negative sampling. The hard-negative samples are pre-chosen by the E5 dataset from a cross-encoder model (DBLP:journals/corr/abs-2402-05672). The optimization objective is similar to UCL (Equation 8) except that we have manually labeled positive samples.

Figure 8 shows the performance of LLM2Comp across different training stages and tasks. The results verify the contribution of UCL and SCL to performance gains beyond pre-training with pretext tasks. In addition, it is observable that the CL stages play a more important role in improving retrieval and clustering tasks.

Compared to other baselines, the experimental results in Table 2 show that LLM2Comp achieves the best performance on most datasets as well as on average. In the UCL stage, LLM2Comp surpasses LLM2Vec, confirming that the benefits of our compression-based pretext task carry over to the subsequent training. A similar pattern is observed in the SCL stage after UCL, where LLM2Comp is superior to LLM2Vec, and other contemporary models. Notably, LLM2Comp achieves this using a much smaller amount of supervised data compared to LLM2Vec as shown in Table 2. This suggests that compression-based pretraining provides a stronger foundation, enabling more efficient post-training. Since training cost scales with the amount of supervised data, this also highlights the practical value of LLM2Comp.

Following the SimCSE(DBLP:conf/emnlp/GaoYC21), we use dropout to get the unsupervised positive samples, and in-batch sentences are regarded as negative samples, and the InfoNCE loss was then applied for unsupervised contrastive learning. The dropout rate was set to 0.2, and the batch size was 128 with gradient checkpointing. The criterion used for selecting dropout rate is stable training loss and batch size is selected to be consistent with LLM2Vec. We employed LoRA (Low-Rank Adaptation) for efficient parameter fine-tuning. Specifically, the LoRA rank was set to 16, and the alpha parameter was set to 32 based on empirical evidence and consistent with LLM2Vec. The modules modified by LoRA include the query, value, and output projections within the attention layer, as well as the up, down, and gate projections within the feedforward network layers.

Following LLM2Vec (DBLP:journals/corr/abs-2404-05961), we used the E5 dataset for training. This dataset consists of ELI5 (sample ratio 0.1) (DBLP:conf/acl/FanJPGWA19), HotpotQA (yang-etal-2018-hotpotqa), FEVER (thorne-etal-2018-fever), MIRACL (zhang-et-all-2023-MIRACL), MS-MARCO passage ranking (sample ratio 0.5) and document ranking (sample ratio 0.2) (DBLP:conf/nips/NguyenRSGTMD16), NQ (karpukhin-etal-2020-DPR), NLI (DBLP:conf/emnlp/GaoYC21), SQuAD (rajpurkar-etal-2016-squad), TriviaQA (joshi-etal-2017-triviaqa), Quora Duplicate Questions (sample ratio 0.1), Mr- TyDi (zhang-etal-2021-mr), DuReader (he-etal-2018-dureader), and T2Ranking (sample ratio 0.5) (t2ranking). This is a public dataset widely used by LLM2Vec (DBLP:journals/corr/abs-2404-05961), mE5 (DBLP:journals/corr/abs-2402-05672), E5mistral (DBLP:conf/acl/WangYHYMW24), GritLM (DBLP:journals/corr/abs-2402-09906) and so on. The fine-tuning instructions for each dataset are the same as those used in LLM2Vec, and are provided in Table 3.

The model was trained for 200 steps with a batch size of 128 on 8 NVIDIA H800 GPUs, yielding an effective batch size of 1,024 and processing approximately 128,000 training instances from the E5 dataset. As the E5 dataset is a widely adopted benchmark for supervised contrastive learning, using it enables a fair comparison with many existing models. Specifically, we used a subset of the E5 dataset for training. During training, we employed bfloat16 precision to reduce memory usage. The learning rate was set to 1e-4 with a weight decay of 3e-4. We also applied DeepSpeed ZeRO-0 optimization, together with a warm-up decay learning rate schedule, where the minimum learning rate during warm-up was set to 1e-5. These hyperparameters were selected to ensure stable training loss.

BGE-ICL (DBLP:conf/iclr/LiQXCLLSL25): An embedding model that leverages the in-context learning (ICL) capabilities of large language models to enhance embedding quality.

We examine the impact of adapting causal attention to bidirectional attention on overall performance. To this end, we train LLM2CompKL using causal LLMs and, for comparison, using the same training recipe with a bidirectional architecture described in Section 3. As shown in Figure 11, the bidirectional variant outperforms its causal counterpart on average, consistent with prior findings (DBLP:journals/corr/abs-2402-09906). However, the advantage of bidirectional attention is pronounced in retrieval tasks, whereas the causal variant is more beneficial for STS and classification tasks. This observation highlights the importance of selecting the appropriate architecture based on the target downstream application.

We sample 10

In this figure, red markers represent training-free methods, while the color gradient (from light to dark) and marker size (from small to large) indicate the increasing amount of training data used by the models. Notably, except for Echo Embedding, the time overhead for all other methods does not differ substantially. This suggests that additional training data and the use of a LoRA architecture do not substantially affect the inference time for most models. In contrast, our method achieves the best performance while requiring fewer training samples than both LLM2Vec and Llama2Vec. This demonstrates that our approach strikes a better balance between performance and data efficiency.

Table: S3.T1: Performance comparison of different models across three stages of training: self-supervised compression pretraining, evaluated on various tasks from the MTEB benchmark. Each model is assessed on a range of datasets, with results showing the impact of different training approaches on task performance.

Model	Training Samples	Backbone	Clustering	Retrieval	STS	Classification	Reranking	Avg.
Biorxiv Clustering S2S	Biorxiv	Clustering	S2S	Medrxiv Clustering S2S	Medrxiv	Clustering	S2S	Twenty newsgroups Clustering	Twenty	newsgroups	Clustering	SciFact	SciFact	NFCorpus	NFCorpus	ArguAna	ArguAna	STS17	STS17	SICK-R	SICK-R	STS Benchmark	STS	Benchmark	Banking77 Classification	Banking77	Classification	Emotion Classification	Emotion	Classification	Sprint Duplicate Questions	Sprint	Duplicate	Questions	Stack Overflow DupQues.	Stack	Overflow	DupQues.	SciDocs RR	SciDocs	RR
Biorxiv
Clustering
S2S
Medrxiv
Clustering
S2S
Twenty
newsgroups
Clustering
SciFact
NFCorpus
ArguAna
STS17
SICK-R
STS
Benchmark
Banking77
Classification
Emotion
Classification
Sprint
Duplicate
Questions
Stack
Overflow
DupQues.
SciDocs
RR
75000	37500	59545	5483	3956	10080	5692	19854	2758	3696	2096	8931	82798	89131
Training-free Methods & Models trained with Pretext Tasks
LT.	0	Llama-2	15.99	17.42	15.96	2.17	1.31	14.24	57.8	55.63	45.72	68.65	29.85	47.01	32.07	58.83	33.05
WMP.	0	Llama-2	19.73	19.47	14.54	38.89	6.13	33.59	63.91	57.52	58.01	66.42	30.97	58.48	37.74	61.05	40.46
EE.	0	Llama-2	22.94	23.15	25.74	25.61	9.97	25.24	80.51	70.18	71.94	81.79	45.00	68.48	40.79	60.15	46.54
PrompEOL	0	Llama-2	22.49	21.14	31.47	27.16	13.59	11.65	79.67	73.82	75.32	76.37	47.13	26.08	37.65	66.22	43.55
MetaEOL	0	Llama-2	30.95	26.56	40.03	40.59	16.41	21.75	82.29	76.88	76.87	82.26	51.05	48.24	39.87	77.91	50.83
Llama2vec	32k	Llama-2	22.42	22.25	29.84	16.50	5.22	32.16	75.72	58.00	64.18	75.83	38.64	84.47	28.75	55.51	43.54
Llm2Vec	32k	Llama-2	26.44	25.14	25.76	44.51	4.34	31.02	73.45	67.65	65.82	79.77	39.28	70.07	41.48	61.48	46.87
LLM2CompRC	32k	Llama-2	6.65	13.56	8.94	17.41	1.56	14.58	64.66	54.37	41.20	73.95	36.06	76.89	36.50	54.72	35.79
LLM2CompNLL	32k	Llama-2	30.24	27.34	37.25	11.93	3.55	24.69	70.65	64.57	63.05	80.12	39.40	72.02	43.36	78.94	46.22
LLM2CompKL	32k	Llama-2	27.79	26.00	31.19	42.57	9.24	30.92	81.56	68.28	70.87	84.33	46.85	88.81	48.20	78.18	52.49

Table: A1.T3: Instructions for finetuning E5 datasets.

Dataset	Instruction(s)
DuReader	Given a Chinese search query, retrieve web passages that answer the question
ELI5	Provided a user question, retrieve the highest voted answers on Reddit ELI5 forum
FEVER	Given a claim, retrieve documents that support or refute the claim
HotpotQA	Given a multi-hop question, retrieve documents that can help answer the question
MIRACL	Given a question, retrieve Wikipedia passages that answer the question
MrTyDi	Given a question, retrieve Wikipedia passages that answer the question
MSMARCO Document	Given a web search query, retrieve relevant documents that answer the query
MSMARCO Passage	Given a web search query, retrieve relevant passages that answer the query
NLI	Given a premise, retrieve a hypothesis that is entailed by the premise
	Retrieve semantically similar text
NQ	Given a question, retrieve Wikipedia passages that answer the question
QuoraDuplicates	Given a question, retrieve questions that are semantically equivalent to the given question
	Find questions that have the same meaning as the input question
SQuAD	Retrieve Wikipedia passages that answer the question
T2Ranking	Given a Chinese search query, retrieve web passages that answer the question
TriviaQA	Retrieve Wikipedia passages that answer the question

Table: A2.T4: Statistics of evaluation datasets

Category	Dataset	#Samples
Clustering (3)	BiorxivCS2S	75000
MedrxivS2S	37500
TwentyNewsgroups	TwentyNewsgroups	59545
TwentyNewsgroups
Retrieval (3)	SciFact	5483
NFCorpus	3956
ArguAna	10080
STS (3)	STS17	5692
SICK-R	19854
STSBenchmark	2758
(Pair) Classification (3)	(Pair)	Classification (3)	Banking77	3696
(Pair)
Classification (3)
EmotionClassification	2096
SprintDuplicateQuestions	8931
Reranking (2)	StackOverflow DupQuestions	StackOverflow	DupQuestions	82798
StackOverflow
DupQuestions
SciDocsRR	89131
Overall	14 datasets	406520

Table: A2.T5: Instructions used for our evaluation datasets.

Task Name	Instruction
ArguAna	Given a claim, find documents that refute the claim
Banking77Classification	Given an online banking query, find the corresponding intents
BiorxivClusteringS2S	Identify the main category of Biorxiv papers based on the titles
EmotionClassification	Classify the emotion expressed in the given Twitter message into one of the six emotions: anger, fear, joy, love, sadness, and surprise
MedrxivClusteringS2S	Identify the main category of Medrxiv papers based on the titles
NFCorpus	Given a question, retrieve relevant documents that best answer the question
SciDocsRR	Given a title of a scientific paper, retrieve the titles of other relevant papers
SciFact	Given a scientific claim, retrieve documents that support or refute the claim
StackOverflowDupQuestions	Retrieve duplicate questions from StackOverflow forum
SICK-R	Retrieve semantically similar text.
SprintDuplicateQuestions	Retrieve duplicate questions from Sprint forum
STS17	Retrieve semantically similar text.
STSBenchmark	Retrieve semantically similar text.
TwentyNewsgroupsClustering	Identify the topic or theme of the given news articles

(a) LLM2Vec

Radar plot showing the performance of LLM2CompKL across different training stages and datasets.

Comparison of inference time and performance.

$$ Cor=\frac{1}{N_{s}}\sum_{i=1}^{N_{s}}\widetilde{M_{i}}\cdot\widetilde{M_{i}}^{T} $$

$$ \displaystyle\widetilde{m_{i}}=f_{\theta}(n_{1},\ldots,n_{a},m_{1},\ldots,m_{i})[-1] $$

Sample Size

Viewing Autoregressive Loss Through a Contrastive Learning Lens

Connection Between Alignment, Uniformity, and Dimensional Collapse

Direct Consequences:

Incorporating RankNet Loss.

$$ &\quad\log\biggl(\frac{g_\phi(n_j \mid n_1,\ldots,n_a, n_{a+1}, \ldots,n_{j-1})}{g_\phi(n_j \mid \widetilde{m_1},\ldots,\widetilde{m_k},n_{a+1},\ldots,n_{j-1})}\biggr) $$

$$ \max_\theta &\log \frac{\exp(\mathrm{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2B} \exp(\mathrm{sim}(z_i, z_k)/\tau)} $$

	Training Samples	Backbone	Clustering	Clustering	Clustering	Retrieval	Retrieval	Retrieval	STS	STS	STS	Classification	Classification	Classification	Reranking	Reranking	Avg.
Model	Training Samples	Backbone	Bior.	Medr.	Twen.	SciF.	NFCo.	Argu.	STS17	SICK-R	STSB.	Bank.	Emot.	Spri.	Stac.	SciD.	Avg.
Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks	Training-free Methods &Models trained with Pretext Tasks
LT.	0	Llama-2	15.99	17.42	15.96	2.17	1.31	14.24	57.8	55.63	45.72	68.65	29.85	47.01	32.07	58.83	33.05
WMP.	0	Llama-2	19.73	19.47	14.54	38.89	6.13	33.59	63.91	57.52	58.01	66.42	30.97	58.48	37.74	61.05	40.46
EE.	0	Llama-2	22.94	23.15	25.74	25.61	9.97	25.24	80.51	70.18	71.94	81.79	45.00	68.48	40.79	60.15	46.54
PrompEOL	0	Llama-2	22.49	21.14	31.47	27.16	13.59	11.65	79.67	73.82	75.32	76.37	47.13	26.08	37.65	66.22	43.55
MetaEOL	0	Llama-2	30.95	26.56	40.03	40.59	16.41	21.75	82.29	76.88	76.87	82.26	51.05	48.24	39.87	77.91	50.83
Llama2vec	32k	Llama-2	22.42	22.25	29.84	16.50	5.22	32.16	75.72	58.00	64.18	75.83	38.64	84.47	28.75	55.51	43.54
LLM2Vec	32k	Llama-2	26.44	25.14	25.76	44.51	4.34	31.02	73.45	67.65	65.82	79.77	39.28	70.07	41.48	61.48	46.87
LLM2Comp RC	32k	Llama-2	6.65	13.56	8.94	17.41	1.56	14.58	64.66	54.37	41.20	73.95	36.06	76.89	36.50	54.72	35.79
LLM2Comp NLL	32k	Llama-2	30.24	27.34	37.25	11.93	3.55	24.69	70.65	64.57	63.05	80.12	39.40	72.02	43.36	78.94	46.22
LLM2Comp KL	32k	Llama-2	27.79	26.00	31.19	42.57	9.24	30.92	81.56	68.28	70.87	84.33	46.85	88.81	48.20	78.18	52.49

References

[1] Tom{'{a. (2013). Efficient Estimation of Word Representations in Vector Space. 1st International Conference on Learning Representations.

[2] Jacob Devlin, Ming{-. (2019). {BERT:. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

[3] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res..

[4] OpenAI. (2023). GPT-4 Technical Report.

[5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur{-. (2023). PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res..

[6] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie{-. (2023). LLaMA: Open and Efficient Foundation Language Models. CoRR. doi:10.48550/ARXIV.2302.13971.

[7] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L{'{e. (2023). Mistral 7B. CoRR. doi:10.48550/ARXIV.2310.06825.

[8] Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, Douwe Kiela. (2024). Generative Representational Instruction Tuning. CoRR. doi:10.48550/ARXIV.2402.09906.

[9] Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy. (2024). LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. CoRR. doi:10.48550/ARXIV.2404.05961.

[10] Niklas Muennighoff. (2022). {SGPT:. CoRR.

[11] Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, Aditi Raghunathan. (2024). Repetition Improves Language Model Embeddings. CoRR. doi:10.48550/ARXIV.2402.15449.

[12] Yibin Lei, Di Wu, Tianyi Zhou, Tao Shen, Yu Cao, Chongyang Tao, Andrew Yates. (2024). Meta-Task Prompting Elicits Embeddings from Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics.

[13] Ting Jiang, Shaohan Huang, Zhongzhi Luan, Deqing Wang, Fuzhen Zhuang. (2024). Scaling Sentence Embeddings with Large Language Models. Findings of the Association for Computational Linguistics: {EMNLP.

[14] Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon. (2024). PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, {EMNLP.

[15] Bowen Zhang, Kehua Chang, Chunping Li. (2024). Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models. Advanced Intelligent Computing Technology and Applications - 20th International Conference, {ICIC. doi:10.1007/978-981-97-5669-8_5.

[16] Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu. (2024). M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. Findings of the Association for Computational Linguistics, {ACL. doi:10.18653/V1/2024.FINDINGS-ACL.137.

[17] Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si{-. (2024). In-context Autoencoder for Context Compression in a Large Language Model. The Twelfth International Conference on Learning Representations.

[18] Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen. (2023). Adapting Language Models to Compress Contexts. Conference on Empirical Methods in Natural Language Processing.

[19] Guanghui Qin, Corby Rosset, Ethan C. Chau, Nikhil Rao, Benjamin Van Durme. (2024). Dodo: Dynamic Contextual Compression for Decoder-only LMs. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics.

[20] Ninglu Shao, Shitao Xiao, Zheng Liu, Peitian Zhang. (2024). Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization. CoRR. doi:10.48550/ARXIV.2401.07793.

[21] Jesse Mu, Xiang Li, Noah D. Goodman. (2023). Learning to Compress Prompts with Gist Tokens. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems.

[22] Tongzhou Wang, Phillip Isola. (2020). Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. Proceedings of the 37th International Conference on Machine Learning.

[23] Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian. (2022). Understanding Dimensional Collapse in Contrastive Self-supervised Learning. The Tenth International Conference on Learning Representations.

[24] Kaihang Pan, Zhaoyu Fan, Juncheng Li, Qifan Yu, Hao Fei, Siliang Tang, Richang Hong, Hanwang Zhang, Qianru Sun. (2024). Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration. CoRR. doi:10.48550/ARXIV.2409.19872.

[25] Tianyu Gao, Xingcheng Yao, Danqi Chen. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings. Conference on Empirical Methods in Natural Language Processing.

[26] A{. (2018). Representation Learning with Contrastive Predictive Coding. CoRR.

[27] Wilhelm {\AA. (2022). The NT-Xent loss upper bound. CoRR. doi:10.48550/ARXIV.2205.03169.

[28] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei. (2024). Improving Text Embeddings with Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[29] Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, Michael Auli. (2019). {ELI5:. Conference of the Association for Computational Linguistics.

[30] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, Christopher D. Manning. (2018). HotpotQA: {A. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018. doi:10.18653/V1/D18-1259.

[31] Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso{-. (2023). {MIRACL:. Trans. Assoc. Comput. Linguistics.

[32] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, others. (2016). {MS. Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems.

[33] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming{-. (2019). Natural Questions: a Benchmark for Question Answering Research. Trans. Assoc. Comput. Linguistics. doi:10.1162/TACL_A_00276.

[34] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang. (2016). SQuAD: 100, 000+ Questions for Machine Comprehension of Text. Conference on Empirical Methods in Natural Language Processing.

[35] Mandar Joshi, Eunsol Choi, Daniel S. Weld, Luke Zettlemoyer. (2017). TriviaQA: {A. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.

[36] Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, Qiaoqiao She, Jing Liu, Hua Wu, Haifeng Wang. (2022). DuReader-Retrieval: {A. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, {EMNLP. doi:10.18653/V1/2022.EMNLP-MAIN.357.

[37] Ravid Shwartz{-. (2024). To Compress or Not to Compress - Self-Supervised Learning and Information Theory: {A. Entropy. doi:10.3390/E26030252.

[38] Suriya Gunasekar, Blake E. Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nati Srebro. (2017). Implicit Regularization in Matrix Factorization. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, {USA.

[39] Sanjeev Arora, Nadav Cohen, Wei Hu, Yuping Luo. (2019). Implicit Regularization in Deep Matrix Factorization. Annual Conference on Neural Information Processing Systems.

[40] David G. T. Barrett, Benoit Dherin. (2020). Implicit Gradient Regularization. CoRR.

[41] Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. (2017). Pointer Sentinel Mixture Models. 5th International Conference on Learning Representations.

[42] Yang Xu, Yunlong Feng, Honglin Mu, Yutai Hou, others. (2024). Concise and Precise Context Compression for Tool-Using Language Models. Findings of the Association for Computational Linguistics.

[43] Euna Jung, Jaeill Kim, Jungmin Ko, Jinwoo Park, Wonjong Rhee. (2024). Unveiling Key Aspects of Fine-Tuning in Sentence Embeddings: {A. {IEEE. doi:10.1109/ACCESS.2024.3485705.

[44] Olivier Roy, Martin Vetterli. (2007). The effective rank: {A. 15th European Signal Processing Conference, {EUSIPCO.

[45] Jianlin Su, Jiarun Cao, Weijie Liu, Yangyiwen Ou. (2021). Whitening Sentence Representations for Better Semantics and Faster Retrieval. CoRR.

[46] Xiangfeng Wang, Zaiyi Chen, Tong Xu, Zheyong Xie, Yongyi He, Enhong Chen. (2024). In-Context Former: Lightning-fast Compressing Context for Large Language Model. Findings of the Association for Computational Linguistics: {EMNLP.

[47] David Wingate, Mohammad Shoeybi, Taylor Sorensen. (2022). Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models. Findings of the Association for Computational Linguistics: {EMNLP.

[48] Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, others. (2024). xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token. Annual Conference on Neural Information Processing Systems.

[49] Georgiana Dinu, Corey D. Barrett, Yi Xiang, Miguel Romero Calvo, Anna Currey, Xing Niu. (2025). Effective post-training embedding compression via temperature control in contrastive training. The Thirteenth International Conference on Learning Representations, {ICLR.

[50] Chaofan Li, Zheng Liu, Shitao Xiao, Yingxia Shao, Defu Lian. (2024). Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics.

[51] Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, others. (2023). One Embedder, Any Task: Instruction-Finetuned Text Embeddings. Findings of the Association for Computational Linguistics.

[52] Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hern{'{a. (2022). Large Dual Encoders Are Generalizable Retrievers. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, {EMNLP. doi:10.18653/V1/2022.EMNLP-MAIN.669.

[53] Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, Daniel S. Weld. (2020). {S2ORC:. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, {ACL. doi:10.18653/V1/2020.ACL-MAIN.447.

[54] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei. (2024). Multilingual {E5. CoRR.

[55] Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Thien Huu Nguyen. (2024). {ULLME:. CoRR. doi:10.48550/ARXIV.2408.03402.

[56] Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, Jimmy Lin. (2024). Fine-Tuning LLaMA for Multi-Stage Text Retrieval. Proceedings of the 47th International {ACM.

[57] Chaofan Li, Minghao Qin, Shitao Xiao, Jianlyu Chen, others. (2025). Making Text Embedders Few-Shot Learners. The Thirteenth International Conference on Learning Representations.

[58] Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, Semih Yavuz. (2024). SFR-Embedding-Mistral:Enhance Text Retrieval with Transfer Learning.

[59] Chanyeol Choi, Junseong Kim, Seolhwa Lee, others. (2024). Linq-Embed-Mistral Technical Report. CoRR. doi:10.48550/ARXIV.2412.03223.

[60] Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping. (2025). NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. The Thirteenth International Conference on Learning Representations, {ICLR.

[61] Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang. (2023). Towards General Text Embeddings with Multi-stage Contrastive Learning. CoRR. doi:10.48550/ARXIV.2308.03281.

[62] Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou. (2025). Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. CoRR. doi:10.48550/ARXIV.2506.05176.

[63] Xianghong Fang, Jian Li, Qiang Sun, Benyou Wang. (2024). Rethinking the Uniformity Metric in Self-Supervised Learning. The Twelfth International Conference on Learning Representations, {ICLR.

[64] Zhijie Nie, Zhangchi Feng, Mingxin Li, Cunwang Zhang, Yanzhao Zhang, Dingkun Long, Richong Zhang. (2024). When Text Embedding Meets Large Language Model: {A. CoRR. doi:10.48550/ARXIV.2412.09165.

[65] Edward J. Hu, Yelong Shen, Phillip Wallis, others. (2022). LoRA: Low-Rank Adaptation of Large Language Models. The Tenth International Conference on Learning Representations.

[66] Feng Wang, Huaping Liu. (2021). Understanding the Behaviour of Contrastive Loss. {IEEE. doi:10.1109/CVPR46437.2021.00252.

[67] Yang, Zhilin, Qi, Peng, Zhang, Saizheng, Bengio, Yoshua, others. (2018). {H. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.

[68] Thorne, James, Vlachos, Andreas, Christodoulopoulos, Christos, Mittal, Arpit. (2018). {FEVER. Conference of the North {A.

[69] Rajpurkar, Pranav, Zhang, Jian, Lopyrev, Konstantin, Liang, Percy. (2016). {SQ. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/D16-1264.

[70] Joshi, Mandar, Choi, Eunsol, Weld, Daniel, Zettlemoyer, Luke. (2017). {T. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/P17-1147.

[71] Zhang, Xinyu, Thakur, Nandan, Ogundepo, Odunayo, Kamalloo, Ehsan, Alfonso-Hermelo, David, Li, Xiaoguang, Liu, Qun, Rezagholizadeh, Mehdi, Lin, Jimmy. (2023). {MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics. doi:10.1162/tacl_a_00595.

[72] Karpukhin, Vladimir, Oguz, Barlas, Min, Sewon, others. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Conference on Empirical Methods in Natural Language Processing.

[73] Zhang, Xinyu, Ma, Xueguang, Shi, Peng, Lin, Jimmy. (2021). Mr. {T. Proceedings of the 1st Workshop on Multilingual Representation Learning.

[74] Xie, Xiaohui, Dong, Qian, Wang, Bingning, Lv, Feiyang, Yao, Ting, Gan, Weinan, Wu, Zhijing, Li, Xiangsheng, Li, Haitao, Liu, Yiqun, Ma, Jin. (2023). T2Ranking: A Large-scale Chinese Benchmark for Passage Ranking. Proceedings of the 46th International {ACM.

[75] He, Wei, Liu, Kai, Liu, Jing, others. (2018). {D. Proceedings of the Workshop on Machine Reading for Question Answering.

[76] Niklas Muennighoff, Nouamane Tazi, Lo{. (2023). {MTEB:. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.

[77] Andrew Rosenberg, Julia Hirschberg. (2007). V-Measure: {A. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.

[78] Kalervo J{. (2002). Cumulated gain-based evaluation of {IR. {ACM. doi:10.1145/582415.582418.

[79] Ricardo A. Baeza{-. (1999). Modern Information Retrieval.

[80] Burges, Chris, Shaked, Tal, Renshaw, Erin, Lazier, Ari, Deeds, Matt, Hamilton, Nicole, Hullender, Greg. (2005). Learning to rank using gradient descent. Proceedings of the 22nd international conference on Machine learning.

[81] Su, Chang, Shi, Dengliang, Huang, Siyuan, others. (2025). Training {LLM. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing.

[82] Deng, Jingcheng, Jiang, Zhongtao, Pang, Liang, others. (2025). Following the Autoregressive Nature of {LLM. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing.

Introduction​

Do We Need Unsupervised Contrastive Learning?​

Related Works​

LLMs-based Text Representation​

Training-free Methods​

Compression Pretraining​

Negative Log-Likelihood of Continuation Task.​

Dimension Collapse​

Method​

Continuation Task with Knowledge Distillation (CTKD)​

Experiment Setup​

Evaluation Datasets​

Causal vs Bidirectional LLMs​

Effectiveness vs Efficiency of Embedding Models​

Stability of Different Compression Tasks​

The Impact of Token Length​

Effective Token Index Before Meanpooling​

Further Analysis​

Conclusion​

Acknowledgements​

Continue Training In-domain improve but out-of-domain drop​

arabicchecksubsection​

Reproducibility Checklist​

textcolorblue​

Appendix​

Limitations and Future Work​

Different attention architecture​

The Computing Infrastructure​

Sample Size​

Viewing Autoregressive Loss Through a Contrastive Learning Lens​

Connection Between Alignment, Uniformity, and Dimensional Collapse​

Direct Consequences:​

Incorporating RankNet Loss.​

References​