Skip to main content

SkyLadder: Better and Faster Pretraining via Context Window Scheduling

Tongyao Zhu, Qian Liu, Haonan Wang, Shiqi Chen, Xiangming Gu, Tianyu Pang, Min-Yen Kan

Abstract

Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our controlled study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose \ourmethod, a simple yet effective approach that implements a short-to-long context window transition. \ourmethod preserves strong standard benchmark performance, while matching or exceeding baseline results on long-context tasks. Through extensive experiments, we pretrain 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that \ourmethod yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselinesProject code is at \url{https://github.com/sail-sg/SkyLadder}.

Introduction

The evolution of language models has been marked by a consistent expansion in context window sizes (Figure 1 left). While early models like GPT [39] and BERT [8] were limited to context windows of 512 tokens, subsequent models have pushed significantly beyond these bounds. GPT-2 [40] doubled this capacity to 1024 tokens, and with Large Language Models (LLMs) exceeding 1B parameters, this trend has continued: Llama [51] has a 2048-token window, followed by Llama-2 [52] (4096 tokens), and Llama-3 [13] (8192 tokens). The need for models to handle longer sequences during inference has fueled the rush to expand the context window. As models pretrained with longer context windows reduce document truncation and preserve coherence [9], there is a widespread belief that such models should perform comparably to, or even surpass, their shorter-context counterparts.

We question the common belief that larger context windows do actually improve performance. Close inspection of previous work reveals that there has yet to be a fair experimental setup for comparing models across different context windows while adhering to a fixed token budget. Using tightly controlled experiments, we test how changing only the context window size during pretraining impacts their performance. As shown in Figure 1 (right), our results indicate that models pretrained using shorter contexts always outperform long-context models, when assessed by their average performance across popular benchmarks. In addition, we verify that the performance gap is not eliminated by using advanced document packing strategies [13, 9, 44].

To ensure the model can ultimately process long sequences, the model still needs to be exposed to long sequences. However, given the finding that shorter context windows enhance performance on

∗ Corresponding author.

2 Project code is at https://github.com/sail-sg/SkyLadder

Figure 1: Left: Pretraining context window of LLMs grows over the recent years. Right: Average performance (in %) across nine downstream tasks for 1B-parameter models with different pretrained context window sizes (color-coded). Increasing the context window degrades the overall performance.

Figure 1: Left: Pretraining context window of LLMs grows over the recent years. Right: Average performance (in %) across nine downstream tasks for 1B-parameter models with different pretrained context window sizes (color-coded). Increasing the context window degrades the overall performance.

downstream tasks, we face a trade-off between long-context capability and pretraining effectiveness. We propose SkyLadder, a simple yet effective context window scheduling strategy designed to balance both objectives. SkyLadder does this by progressively expanding the size of the context window during pretraining, beginning pretraining with a minimal short context window (e.g., 8 tokens) and progressively expanding it to the long target context window (e.g., 32,768 tokens).

Empirical results on 1B-parameter models (up to 32K context window) and 3B-parameter models (up to 8K context window) on 100B tokens demonstrate that SkyLadder outperforms naive longcontext pretraining baselines, in both short- and long-context evaluation tasks. For example, models trained with SkyLadder demonstrate significantly higher accuracy on standard benchmarks (e.g., HellaSwag), and reading comprehension tasks (e.g., HotpotQA), while still maintaining competitive performance on long-context evaluations like RULER. We further investigate the mechanisms behind the superior performance by observing the training dynamics, and discover that SkyLadder exhibits more concentrated and effective attention patterns.

Overall, we suggest that the length of the context window is an important dimension in pretraining and should be scheduled over the course of training. We recommend a progressive approach that begins with a small context of 8 tokens and gradually increases according to a linear function of training steps. Given a target context window (e.g., 32K), we suggest that allocating approximately 60% of the total training tokens to this expansion phase leads to stronger downstream performance compared to baselines. This scheduling strategy optimally enhances both training efficiency and model capability, offering a practical recipe for improving pretraining in language models.

Context Window Scheduling. Early work explored gradually increasing the context window in smaller models like BERT and GPT2, to improve training stability and efficiency [35, 28, 21]. Notably, Li et al. [28] proposed length warmup for more stable training but did not show clear performance gains, while Jin et al. [21] focused on training acceleration in 400M models. We extend these findings by demon-

Figure 2: Schematic comparison of training-time context window scheduling.

Figure 2: Schematic comparison of training-time context window scheduling.

strating, for the first time, that context window scheduling significantly boosts both efficiency and performance at much larger scales (up to 3B parameters). A parallel approach from Pouransari et al. [38] segments training documents by length, but Fu et al. [10] caution that such segmentation can introduce domain biases, as longer texts often cluster in specific domains such as books. Recent developments in continual pretraining with long context windows [37, 55, 12], can also be viewed through the lens of context window scheduling with different strategies (illustrated in Figure 2). Our work represents the first demonstration of both effectiveness and efficiency of context window scheduling, providing empirical evidence of its benefits in both standard and long-context benchmarks.

Long-Context Language Models. Long-context language models have received a lot of attention due to their ability to capture extended dependencies across large textual windows. Most existing

Figure 3: An illustration of the workflow for pretraining data preparation highlights several critical decisions. Key considerations include the method of data packing, the type of attention mask to employ (causal or intra-doc mask), and determining the appropriate context window length L .

Figure 3: An illustration of the workflow for pretraining data preparation highlights several critical decisions. Key considerations include the method of data packing, the type of attention mask to employ (causal or intra-doc mask), and determining the appropriate context window length L .

approaches follow a continual pretraining paradigm [10, 57], which extends a pretrained backbone model to longer contexts through specialized fine-tuning or additional training. Several works propose to intervene in the positional embeddings to accommodate longer sequences [1, 31, 37, 4, 22], while others perform extended pretraining on longer-sequence corpora [12, 55, 32, 63]. Our approach differs from previous methods as we train native long-context models from scratch, rather than modifying a pretrained model in post-training. Compared with a naive long-context pretraining baseline with a constant schedule, our approach delivers substantial gains on multiple long-context tasks, underscoring the benefits of training from scratch. These findings show that our method can be a promising direction for future research on building language models with longer context windows.

Context Window Scheduling.

We now present SkyLadder for progressively expanding the context window during pretraining.

Long-Context Language Models.
Packing and Masking in Pretraining

How does context window affect pretraining? To investigate this in a fair and comparable manner, we pretrain language models from scratch with context windows ranging from 512 to 16,384 tokens under a fixed total number of tokens, evaluating via perplexity and downstream task benchmarks. We examine how the context window size impacts model performance, analyzing how data packing and masking strategies interact with window size.

Attention in LM Pretraining

How does context window affect pretraining? To investigate this in a fair and comparable manner, we pretrain language models from scratch with context windows ranging from 512 to 16,384 tokens under a fixed total number of tokens, evaluating via perplexity and downstream task benchmarks. We examine how the context window size impacts model performance, analyzing how data packing and masking strategies interact with window size.

Attention Masking in Pretraining

How does context window affect pretraining? To investigate this in a fair and comparable manner, we pretrain language models from scratch with context windows ranging from 512 to 16,384 tokens under a fixed total number of tokens, evaluating via perplexity and downstream task benchmarks. We examine how the context window size impacts model performance, analyzing how data packing and masking strategies interact with window size.

How Context Window Affects Pretraining

How does context window affect pretraining? To investigate this in a fair and comparable manner, we pretrain language models from scratch with context windows ranging from 512 to 16,384 tokens under a fixed total number of tokens, evaluating via perplexity and downstream task benchmarks. We examine how the context window size impacts model performance, analyzing how data packing and masking strategies interact with window size.

Packing, Masking and Context Window

Most modern LLMs are based on a decoder-only transformer architecture [54] with a fixed context window size denoted by L . In contrast, the pretraining corpus, D = { d 1 , d 2 , d 3 , . . . , d n } , consists of documents with varying lengths different from L . Therefore, a key step before pretraining is to pack the documents into sequences of length L . Formally, a packed sequence C i is constructed as C i = Trunc ( d i, 1 ) ⊕ d i, 2 ⊕··· ⊕ d i,n -1 ⊕ Trunc ( d i,n ) , where ⊕ represents concatenation, and Trunc ( · ) denotes truncation of documents to ensure len ( C i ) = L . Following previous works [44, 64], document boundaries within C i are explicitly marked using end-of-sequence ( [EOS] ) tokens.

After the sequences are packed, the inputs are passed into transformer layers for next-token prediction training. A crucial component of these layers is the attention mechanism, which can be formulated as A i,j = q ⊤ i k j , and then Attn ( X ) = Softmax ( A + M ) . In decoder-only models, a mask M is applied to introduce constraints. A common approach is to use a causal mask , which ensures that each position can only attend to previous tokens by masking out (setting to -∞ ) attention scores corresponding to future positions: M ij = -∞ for j > i and M ij = 0 otherwise. A recently proposed masking scheme, known as intra-doc mask [64, 13], imposes a constraint that only allows tokens to attend to each other if they belong to the same document. Let each document d have start index s d and end index e d , the masking can be denoted as M intra ij = 0 when ∃ d such that s d ≤ i, j ≤ e d and j ≤ i , and M intra ij = -∞ otherwise. The model is trained with the standard cross-entropy loss on the packed sequences of length L . The workflow for pretraining data processing is illustrated in Figure 3.

Comparing Pretraining Context Length

Tongyao Zhu 1 , 2 Qian Liu 2 ∗ Haonan Wang 1 Shiqi Chen 3 Xiangming Gu 1 Tianyu Pang 2 Min-Yen Kan 1

1 National University of Singapore 2 Sea AI Lab 3 City University of Hong Kong tongyao.zhu@u.nus.edu; liuqian.sea@gmail.com

Context Window

Figure

Figure 9: Validation perplexity (evaluated on a sliding window of 512) on models with different context lengths.

Figure 10: Left: Evaluation perplexity of models with different packing or masking strategies. Right: Downstream performance over 9 tasks of different models.

Figure 11: Validation perplexity vs training tokens with different context windows and base of RoPE, θ . Evaluation is done on a sliding window of varying length (x-axis) on the validation documents.

In Figure 9. We plot the validation perplexity of models with different context windows under the Random, IntraDoc, and BM25 settings. We observe a consistent trend that a shorter-context model has lower evaluation perplexity on a shorter sequence under all settings.

In Figure 10, we plot the evaluation perplexity and downstream performance of models with different packing or masking strategies. We conclude that overall, IntraDoc achieves the best performance, with a consistently lower PPL and a higher downstream accuracy. We think that this is partially due to the shorter context window that the IntraDoc model is trained on.

Preliminary Study on Context Window Size

As per Section 1, we initiate our study by investigating the impact of context window size on model performance through a controlled experiment. Specifically, we pretrain language models with varying context window sizes, while preserving all other experimental settings. This enables a pure analysis of the context window's influence on model performance. Through this analysis, we aim to understand whether longer context windows inherently lead to better or worse model performance .

Figure 4: Ablation studies of different factors on different context window sizes. Note that the validation PPL is obtained on the validation documents with a sliding window size of 512 tokens. The packing strategy in (a) is Random, and the model sizes in (b) and (c) are 1B and 120M, respectively. Note that the context window in (d) means the number of available preceding tokens when making next-token prediction (calculation details in Section A.6).

Figure 4: Ablation studies of different factors on different context window sizes. Note that the validation PPL is obtained on the validation documents with a sliding window size of 512 tokens. The packing strategy in (a) is Random, and the model sizes in (b) and (c) are 1B and 120M, respectively. Note that the context window in (d) means the number of available preceding tokens when making next-token prediction (calculation details in Section A.6).

Key Variables. The context window size determines the number of tokens included in the context for each packed sequence. However, as discussed earlier, several additional factors influence the content within the context window: (1) Packing methods determine which documents constitute the context window, and different packing strategies can significantly alter the composition of token sequences; (2) Masking methods decide whether cross-document attention is enabled within the same context window. The choice of masking affects how the information from different documents interacts during training.

Packing and Masking. To study the impact of packing, we employ two strategies: random packing and semantic packing . For random packing, documents are randomly concatenated without a specific ordering. For semantic packing, inspired by Shi et al. [44], we retrieve and concatenate semantically relevant documents from the corpus, aiming to keep them within the same context window. After experimenting with both a dense retriever [20] and a lexical retriever BM25, we found that BM25 gives stronger performance and chose it as our focus. For masking, the baseline approach is causal masking, where each token can attend to all preceding tokens within the same context window, regardless of document boundaries. Conversely, recent studies [64, 9] show that disabling crossdocument attention, thereby enabling intra-document attention, improves performance. For clarity in subsequent discussions, we denote random packing with causal masking as Random , BM25 packing with causal masking as BM25 , and random packing with intra-document masking as IntraDoc .

Training. We pretrain models from scratch using the TinyLlama codebase [61], and study models with 120M, 360M and 1B parameters. Given the substantial computational cost associated with retrieval in semantic packing, we randomly select around 30B tokens from the CommonCrawl (CC) subset of the SlimPajama dataset [46] as the pretraining corpus. All models undergo training for up to 100B tokens ( ∼ 3.3 epochs). To ensure consistency across experiments, we strictly control all other settings, retaining the same batch size and learning rate schedule for all context windows. All models also incorporate Rotary Positional Encoding (RoPE) [47] to encode positional information. Appendix A.3 and A.4 give further model architecture details and training settings.

Evaluation. For all model sizes, we use perplexity (PPL) on validation documents from the original dataset as a key metric, in line with established practices [10, 24, 17]. Note that when comparing models across different context windows (e.g., a 2K-context model and an 8K-context model), we must ensure the evaluation sequence fits within the shorter model's context window to maintain a fair comparison. We also evaluate 1B models on downstream standard benchmarks: HellaSwag [60], ARC-Easy and ARC-Challenge [6], Winogrande [42], CommonsenseQA [48], OpenBookQA [34], PIQA [2], Social-QA [43], and MMLU [16]. We employ the OLMES suite [15] for the evaluation, as it has been shown to provide reliable and stable results with curated 5-shot demonstrations [12].

Key Variables.
Packing
Packing and Masking.

Most modern LLMs are based on a decoder-only transformer architecture [54] with a fixed context window size denoted by L . In contrast, the pretraining corpus, D = { d 1 , d 2 , d 3 , . . . , d n } , consists of documents with varying lengths different from L . Therefore, a key step before pretraining is to pack the documents into sequences of length L . Formally, a packed sequence C i is constructed as C i = Trunc ( d i, 1 ) ⊕ d i, 2 ⊕··· ⊕ d i,n -1 ⊕ Trunc ( d i,n ) , where ⊕ represents concatenation, and Trunc ( · ) denotes truncation of documents to ensure len ( C i ) = L . Following previous works [44, 64], document boundaries within C i are explicitly marked using end-of-sequence ( [EOS] ) tokens.

After the sequences are packed, the inputs are passed into transformer layers for next-token prediction training. A crucial component of these layers is the attention mechanism, which can be formulated as A i,j = q ⊤ i k j , and then Attn ( X ) = Softmax ( A + M ) . In decoder-only models, a mask M is applied to introduce constraints. A common approach is to use a causal mask , which ensures that each position can only attend to previous tokens by masking out (setting to -∞ ) attention scores corresponding to future positions: M ij = -∞ for j > i and M ij = 0 otherwise. A recently proposed masking scheme, known as intra-doc mask [64, 13], imposes a constraint that only allows tokens to attend to each other if they belong to the same document. Let each document d have start index s d and end index e d , the masking can be denoted as M intra ij = 0 when ∃ d such that s d ≤ i, j ≤ e d and j ≤ i , and M intra ij = -∞ otherwise. The model is trained with the standard cross-entropy loss on the packed sequences of length L . The workflow for pretraining data processing is illustrated in Figure 3.

Masking
Masking Method

Inspired by learning rate scheduling, we explore whether dynamically scheduling the context window from short to long during pretraining could lead to performance improvements. This method can be implemented by applying multiple local 'mini' causal masks to a long, packed sequence. We illustrate this masking strategy in Figure 5.

Formally, we define a local window length w . The associated mask M w is defined as follows: M ij = 0 when ⌊ i w ⌋ w ≤ j ≤ i , and M ij = -∞ otherwise, where ⌊ i w ⌋ w calculates the largest multiple of w that is less than or equal to i , effectively defining a block-wise attention

Figure

Figure 5: An illustration of SkyLadder with Random and IntraDoc. The example shows a packed sequence (length L ) consisting of two documents. For SkyLadder, the context window w starts from a small value and dynamically adjusts during training, eventually converging to the masking patterns of Random or IntraDoc.

mask for the query token at position i . We linearly adjust the window size upwards by a constant factor per training step t : w ( t ) = min( w e , w s + ⌊ αt ⌋ ) , where w e and w s represent the ending and starting context window sizes, respectively. Here, α denotes the rate of expansion, and t corresponds to the training step. As the training progresses, when the dynamic context window size w ( t ) eventually reaches the desired (long) context window size L = w e , it remains fixed at that value. At this point, the attention mask is equivalent to a full causal mask. Notably, this method modifies

Table 1: Performance (accuracy in %) of different 1B models pretrained on 100B CC tokens on standard benchmarks. ∗ denotes statistical improvements over the baseline (described in §A.7.3).

the effective context window through masking, independent of how the sequences are packed. As such, this mask M w can be integrated with M Intra , which maintains the attention boundaries between documents; it can be seamlessly combined with most packing and masking strategies.

Training
Training.
Evaluation

We now examine the impact of hyperparameters in SkyLadder scheduling. To manage computational costs, we adopt a default setup of pretraining 120M models with 8K context on 100B CC tokens.

Expansion Rate. We investigate the impact of the expansion rate α in Figure 6 (left). We choose different α ranging from slowest ( 1 / 12 ) to fastest (1). Our findings reveal that, for short contexts, performance generally improves as the expansion rate slows down. However, selecting an excessively slow rate (e.g., 1 / 12 ) can negatively affect long-context performance due to insufficient training on longer contexts. Therefore, we recommend setting α to 1 / 8 for a good balance.

Initial Context Window. As the final context window length w e is fixed to L , the sole remaining hyperparameter is w s . Intuitively, setting w s to an excessively large value (e.g. close to L ) leaves little room for scheduling, resulting in sub-optimal performance. In Figure 6 (right), we demonstrate that when w s is set to a relatively small value (e.g., 8), great performance can be achieved for both short and long contexts. This suggests that there is still potential for further improvement in our default setup. Therefore, we recommend starting with a small context window, such as 8 tokens .

Scheduling Type. The default scheduling method in SkyLadder is linear scheduling. We evaluate different context window scheduling types (more details in Table 20 and Figure 12 in Appendix A.7.4): (1) Stepwise Linear rounds window size w ( t ) to multiples of 1K, resulting in a step function; (2) Sinusoidal increases quickly at the early stage then slows down; (3) Exponential starts slow but accelerates sharply; (4) Continual pretraining setup trains with 4K context windows for ∼ 97B tokens, then switches to 32K context for the final 3B tokens. Table 7 shows that linear and sinusoidal schedules outperform the exponential variant on long tasks, likely because the exponential schedule, with extended short-context pretraining at the beginning, fails to adequately train on long contexts. Last, the most commonly used continual pretraining setup performs poorly overall, suggesting abrupt context changes harm both short and long performance. These findings suggest that context window scheduling is superior to both constant long-context pretraining and continual pretraining .

Overall, we conclude that the schedule should start from a small w s and the expansion should be gradual. We leave it to future work to study more advanced schedules and discover optimal configurations. For instance, it is possible that the schedule needs to be adjusted for various model sizes. More ablations for combination with BM25, hybrid attention, cyclic schedules and scheduling under a compute budget can be found in Appendix A.7.4.

Evaluation.

We now examine the impact of hyperparameters in SkyLadder scheduling. To manage computational costs, we adopt a default setup of pretraining 120M models with 8K context on 100B CC tokens.

Expansion Rate. We investigate the impact of the expansion rate α in Figure 6 (left). We choose different α ranging from slowest ( 1 / 12 ) to fastest (1). Our findings reveal that, for short contexts, performance generally improves as the expansion rate slows down. However, selecting an excessively slow rate (e.g., 1 / 12 ) can negatively affect long-context performance due to insufficient training on longer contexts. Therefore, we recommend setting α to 1 / 8 for a good balance.

Initial Context Window. As the final context window length w e is fixed to L , the sole remaining hyperparameter is w s . Intuitively, setting w s to an excessively large value (e.g. close to L ) leaves little room for scheduling, resulting in sub-optimal performance. In Figure 6 (right), we demonstrate that when w s is set to a relatively small value (e.g., 8), great performance can be achieved for both short and long contexts. This suggests that there is still potential for further improvement in our default setup. Therefore, we recommend starting with a small context window, such as 8 tokens .

Scheduling Type. The default scheduling method in SkyLadder is linear scheduling. We evaluate different context window scheduling types (more details in Table 20 and Figure 12 in Appendix A.7.4): (1) Stepwise Linear rounds window size w ( t ) to multiples of 1K, resulting in a step function; (2) Sinusoidal increases quickly at the early stage then slows down; (3) Exponential starts slow but accelerates sharply; (4) Continual pretraining setup trains with 4K context windows for ∼ 97B tokens, then switches to 32K context for the final 3B tokens. Table 7 shows that linear and sinusoidal schedules outperform the exponential variant on long tasks, likely because the exponential schedule, with extended short-context pretraining at the beginning, fails to adequately train on long contexts. Last, the most commonly used continual pretraining setup performs poorly overall, suggesting abrupt context changes harm both short and long performance. These findings suggest that context window scheduling is superior to both constant long-context pretraining and continual pretraining .

Overall, we conclude that the schedule should start from a small w s and the expansion should be gradual. We leave it to future work to study more advanced schedules and discover optimal configurations. For instance, it is possible that the schedule needs to be adjusted for various model sizes. More ablations for combination with BM25, hybrid attention, cyclic schedules and scheduling under a compute budget can be found in Appendix A.7.4.

Results

Figure 1 presents the main experimental result, obtained using the Random setting with 1B-parameter models. The results indicate that context window size significantly influences the performance of LLMs, with shorter contexts generally leading to better performance . To further investigate the

factors contributing to the observation, we perform a comprehensive analysis to examine potential variables that may affect the conclusion. Figure 4 shows our results, and we derive four key findings:

Findings: (1) The advantage of training on shorter contexts is consistent across model sizes; (2) This advantage is independent of the packing and masking methods employed; (3) It is also unrelated to the use of positional encoding; (4) The best packing and masking strategy is IntraDoc, which outperforms others probably because it introduces a larger number of short contexts during pretraining.

Findings (1) and (2). As shown in Figure 4, regardless of the model size in (a) or the packing and masking methods in (b), a shorter context window for pretraining generally results in higher average performance on benchmarks. The finding on benchmarks is consistent with the trend of validation PPL, where shorter context windows always yield lower PPL.

Finding (3). When using shorter context windows, one might hypothesize that the model learns positional encoding patterns for nearer positions more frequently, leading to better performance on standard benchmarks. To test the hypothesis, we systematically ablate RoPE by completely excluding it during pretraining, following prior work [25]. In Figure 4(c), models trained with short-context windows still outperform their long-context counterparts, even in the absence of positional encoding. This suggests that the advantages of shorter contexts are independent of positional encoding.

Finding (4). From Figure 4(b), we observe that IntraDoc achieves the best validation PPL across all context window sizes compared to Random and BM25, alongside consistently higher performance on standard benchmarks (c.f. Appendix A.7.1). This raises the question: why does IntraDoc excel? We attribute the advantage to the context window size distribution of IntraDoc, which implicitly increases the prevalence of shorter contexts. As illustrated in Figure 4(d), despite the sequence length of 8K, fewer than 1% of context windows actually reach this limit. While prior work links the success of IntraDoc to reduced contextual noise [64], we identify a complementary factor - reduced average context window size - as a key factor in its strong performance. That is, we hypothesize that the effectiveness of IntraDoc may also be closely tied to short context windows.

Intra-document masking gives the best performance.

Context Length in Pretraining

How does context window affect pretraining? To investigate this in a fair and comparable manner, we pretrain language models from scratch with context windows ranging from 512 to 16,384 tokens under a fixed total number of tokens, evaluating via perplexity and downstream task benchmarks. We examine how the context window size impacts model performance, analyzing how data packing and masking strategies interact with window size.

Experimental Results

Figure 1 presents the main experimental result, obtained using the Random setting with 1B-parameter models. The results indicate that context window size significantly influences the performance of LLMs, with shorter contexts generally leading to better performance . To further investigate the

factors contributing to the observation, we perform a comprehensive analysis to examine potential variables that may affect the conclusion. Figure 4 shows our results, and we derive four key findings:

Findings: (1) The advantage of training on shorter contexts is consistent across model sizes; (2) This advantage is independent of the packing and masking methods employed; (3) It is also unrelated to the use of positional encoding; (4) The best packing and masking strategy is IntraDoc, which outperforms others probably because it introduces a larger number of short contexts during pretraining.

Findings (1) and (2). As shown in Figure 4, regardless of the model size in (a) or the packing and masking methods in (b), a shorter context window for pretraining generally results in higher average performance on benchmarks. The finding on benchmarks is consistent with the trend of validation PPL, where shorter context windows always yield lower PPL.

Finding (3). When using shorter context windows, one might hypothesize that the model learns positional encoding patterns for nearer positions more frequently, leading to better performance on standard benchmarks. To test the hypothesis, we systematically ablate RoPE by completely excluding it during pretraining, following prior work [25]. In Figure 4(c), models trained with short-context windows still outperform their long-context counterparts, even in the absence of positional encoding. This suggests that the advantages of shorter contexts are independent of positional encoding.

Finding (4). From Figure 4(b), we observe that IntraDoc achieves the best validation PPL across all context window sizes compared to Random and BM25, alongside consistently higher performance on standard benchmarks (c.f. Appendix A.7.1). This raises the question: why does IntraDoc excel? We attribute the advantage to the context window size distribution of IntraDoc, which implicitly increases the prevalence of shorter contexts. As illustrated in Figure 4(d), despite the sequence length of 8K, fewer than 1% of context windows actually reach this limit. While prior work links the success of IntraDoc to reduced contextual noise [64], we identify a complementary factor - reduced average context window size - as a key factor in its strong performance. That is, we hypothesize that the effectiveness of IntraDoc may also be closely tied to short context windows.

Short-context models perform better on both benchmarks and PPL.
Models pretrained on shorter context windows perform better on both benchmarks and PPL.
Packing and masking reduces the performance gap between context lengths.
Packing and masking reduces the performance gap between context lengths.
Findings (1) and (2).
Finding (3).
Findings (4).
Intra-document masking improves performance, with much shorter contexts.
Finding (4).
Short-context's better performance is not relevant to positional encoding.

Ablations

We now examine the impact of hyperparameters in SkyLadder scheduling. To manage computational costs, we adopt a default setup of pretraining 120M models with 8K context on 100B CC tokens.

Expansion Rate. We investigate the impact of the expansion rate α in Figure 6 (left). We choose different α ranging from slowest ( 1 / 12 ) to fastest (1). Our findings reveal that, for short contexts, performance generally improves as the expansion rate slows down. However, selecting an excessively slow rate (e.g., 1 / 12 ) can negatively affect long-context performance due to insufficient training on longer contexts. Therefore, we recommend setting α to 1 / 8 for a good balance.

Initial Context Window. As the final context window length w e is fixed to L , the sole remaining hyperparameter is w s . Intuitively, setting w s to an excessively large value (e.g. close to L ) leaves little room for scheduling, resulting in sub-optimal performance. In Figure 6 (right), we demonstrate that when w s is set to a relatively small value (e.g., 8), great performance can be achieved for both short and long contexts. This suggests that there is still potential for further improvement in our default setup. Therefore, we recommend starting with a small context window, such as 8 tokens .

Scheduling Type. The default scheduling method in SkyLadder is linear scheduling. We evaluate different context window scheduling types (more details in Table 20 and Figure 12 in Appendix A.7.4): (1) Stepwise Linear rounds window size w ( t ) to multiples of 1K, resulting in a step function; (2) Sinusoidal increases quickly at the early stage then slows down; (3) Exponential starts slow but accelerates sharply; (4) Continual pretraining setup trains with 4K context windows for ∼ 97B tokens, then switches to 32K context for the final 3B tokens. Table 7 shows that linear and sinusoidal schedules outperform the exponential variant on long tasks, likely because the exponential schedule, with extended short-context pretraining at the beginning, fails to adequately train on long contexts. Last, the most commonly used continual pretraining setup performs poorly overall, suggesting abrupt context changes harm both short and long performance. These findings suggest that context window scheduling is superior to both constant long-context pretraining and continual pretraining .

Overall, we conclude that the schedule should start from a small w s and the expansion should be gradual. We leave it to future work to study more advanced schedules and discover optimal configurations. For instance, it is possible that the schedule needs to be adjusted for various model sizes. More ablations for combination with BM25, hybrid attention, cyclic schedules and scheduling under a compute budget can be found in Appendix A.7.4.

Short-to-Long Pretraining

How does context window affect pretraining? To investigate this in a fair and comparable manner, we pretrain language models from scratch with context windows ranging from 512 to 16,384 tokens under a fixed total number of tokens, evaluating via perplexity and downstream task benchmarks. We examine how the context window size impacts model performance, analyzing how data packing and masking strategies interact with window size.

ourmethod: Context Window Scheduling

We now present SkyLadder for progressively expanding the context window during pretraining.

Method

Inspired by learning rate scheduling, we explore whether dynamically scheduling the context window from short to long during pretraining could lead to performance improvements. This method can be implemented by applying multiple local 'mini' causal masks to a long, packed sequence. We illustrate this masking strategy in Figure 5.

Formally, we define a local window length w . The associated mask M w is defined as follows: M ij = 0 when ⌊ i w ⌋ w ≤ j ≤ i , and M ij = -∞ otherwise, where ⌊ i w ⌋ w calculates the largest multiple of w that is less than or equal to i , effectively defining a block-wise attention

Figure

Figure 5: An illustration of SkyLadder with Random and IntraDoc. The example shows a packed sequence (length L ) consisting of two documents. For SkyLadder, the context window w starts from a small value and dynamically adjusts during training, eventually converging to the masking patterns of Random or IntraDoc.

mask for the query token at position i . We linearly adjust the window size upwards by a constant factor per training step t : w ( t ) = min( w e , w s + ⌊ αt ⌋ ) , where w e and w s represent the ending and starting context window sizes, respectively. Here, α denotes the rate of expansion, and t corresponds to the training step. As the training progresses, when the dynamic context window size w ( t ) eventually reaches the desired (long) context window size L = w e , it remains fixed at that value. At this point, the attention mask is equivalent to a full causal mask. Notably, this method modifies

Table 1: Performance (accuracy in %) of different 1B models pretrained on 100B CC tokens on standard benchmarks. ∗ denotes statistical improvements over the baseline (described in §A.7.3).

the effective context window through masking, independent of how the sequences are packed. As such, this mask M w can be integrated with M Intra , which maintains the attention boundaries between documents; it can be seamlessly combined with most packing and masking strategies.

Experimental Setup

We follow the same setup in Section 3.2 to pretrain language models with 8K context on 100B tokens. We set w s = 32 and α = 1 / 8 by default, which means that a model roughly needs 64K steps (around 64B tokens) to reach the final desired context window of L = 8192 . All baseline and SkyLadder models are implemented with Flash Attention 2 [7] (pseudocode in A.5). We fix all other hyperparameters, such as the learning rate schedule, batch size, etc., for fair comparison. Due to resource constraints, we do not perform extensive hyperparameter search to obtain the best combinations for w ( t ) , α , and w s . In our ablation study, we show that these hyperparameters have a negligible impact on performance, as long as they are within a reasonable range.

For evaluation, we use the same suite mentioned in Section 3.2 with standard benchmarks. To evaluate the performance of long-context question answering within an 8K length, we utilize the 30-document setting from the Multi-Document QA (MDQA) benchmark [30]. This is a widely-adopted benchmark that is shown to be reliable for models of 1B scale [38, 64], with an average length of approximately 6K tokens. We also select synthetic tasks within RULER [18], as defined by Yen et al. [59]. We choose the setup of the task that fills up the model's target context window L .

Experimental Results

Figure 1 presents the main experimental result, obtained using the Random setting with 1B-parameter models. The results indicate that context window size significantly influences the performance of LLMs, with shorter contexts generally leading to better performance . To further investigate the

factors contributing to the observation, we perform a comprehensive analysis to examine potential variables that may affect the conclusion. Figure 4 shows our results, and we derive four key findings:

Findings: (1) The advantage of training on shorter contexts is consistent across model sizes; (2) This advantage is independent of the packing and masking methods employed; (3) It is also unrelated to the use of positional encoding; (4) The best packing and masking strategy is IntraDoc, which outperforms others probably because it introduces a larger number of short contexts during pretraining.

Findings (1) and (2). As shown in Figure 4, regardless of the model size in (a) or the packing and masking methods in (b), a shorter context window for pretraining generally results in higher average performance on benchmarks. The finding on benchmarks is consistent with the trend of validation PPL, where shorter context windows always yield lower PPL.

Finding (3). When using shorter context windows, one might hypothesize that the model learns positional encoding patterns for nearer positions more frequently, leading to better performance on standard benchmarks. To test the hypothesis, we systematically ablate RoPE by completely excluding it during pretraining, following prior work [25]. In Figure 4(c), models trained with short-context windows still outperform their long-context counterparts, even in the absence of positional encoding. This suggests that the advantages of shorter contexts are independent of positional encoding.

Finding (4). From Figure 4(b), we observe that IntraDoc achieves the best validation PPL across all context window sizes compared to Random and BM25, alongside consistently higher performance on standard benchmarks (c.f. Appendix A.7.1). This raises the question: why does IntraDoc excel? We attribute the advantage to the context window size distribution of IntraDoc, which implicitly increases the prevalence of shorter contexts. As illustrated in Figure 4(d), despite the sequence length of 8K, fewer than 1% of context windows actually reach this limit. While prior work links the success of IntraDoc to reduced contextual noise [64], we identify a complementary factor - reduced average context window size - as a key factor in its strong performance. That is, we hypothesize that the effectiveness of IntraDoc may also be closely tied to short context windows.

Scalability Experiments

We examine whether SkyLadder's improvements persist as we scale up the model parameters and extend the context window size. We use the largest model and context size that our compute permits.

Model Size. We conduct experiments across three model sizes: 120M, 360M, and 3B parameters on the Fineweb-Pro dataset. Table 5 demonstrates that models utilizing SkyLadder consistently achieve better standard benchmark performance on all model sizes. For long context tasks, our method does not benefit 120M models, possibly due to their limited capacity in processing long sequences. However, the performance gain on 3B models is prominent. We observe a positive scaling trend: as the model size grows, the performance improvement also increases, indicating the potential of applying our method to even larger models beyond our current scale. We leave it as a future work to explore larger models as it requires significantly more compute.

Context Window Size. To examine whether SkyLadder can effectively scale to longer context windows, we train 1B models with a 32K context window on 100B FineWebPro tokens. We adjust α to 1 / 2 to ensure that the final context window expands to 32K before the end of pretraining. As shown in Table 6, our model demonstrates strong performance on both standard and long benchmarks. In addition, the performance difference of SkyLadder (0.9%) between the 8K and 32K models is largely reduced compared with the baseline approach (1.8%), which alleviates

Table 6: Performance (%) of 1B models trained on 100B FineWeb-Pro tokens with a 32K context window.

Figure 6: Validation PPL on 512 and 8K contexts of models with different expansion rate α (left) and initial window length w s (right).

Figure 6: Validation PPL on 512 and 8K contexts of models with different expansion rate α (left) and initial window length w s (right).

Table 8: Comparison of relative training time and compute efficiency for 1B Models with different context window sizes L . FLOPs calculation follows Zhang et al. [61]. A larger context window leads to more efficiency gains.

the performance degradation described in our earlier study. Notably, compared to the baseline Random approach, SkyLadder trains the model on progressively shorter contexts during earlier stages. This reveals a counterintuitive insight: naively training a model with a long context window is not always optimal, even if the model is evaluated on long contexts. In contrast, strategic scheduling of the context window during pretraining can yield better results.

Data Quality

In this section, we provide detailed statistics of the datasets used in our study. These include the document length distributions of the pretraining corpora, the characteristics of the evaluation datasets, and the input length statistics of standard reasoning benchmarks.

Table 26 reports the document length statistics for the two pretraining corpora, CommonCrawl and FineWeb-Pro . Both distributions are strongly right-skewed, indicating that long documents are rare. Compared to FineWeb-Pro, CommonCrawl generally contains longer documents, while FineWeb-Pro has been more carefully cleaned and filtered.

Table 26: Document length statistics of the pretraining corpora, measured in tokens per document. Mean, median, and standard deviation describe the central tendency and variation. P25 and P75 indicate the 25th and 75th percentiles, while skewness and kurtosis capture distribution asymmetry and tail heaviness.

Table 27 shows the input length characteristics of common reasoning and knowledge benchmarks, including ARC, CSQA, HellaSwag, OBQA, PIQA, SocialIQA, Winogrande, and MMLU. While these benchmarks consist of relatively short contexts, they remain standard for assessing a model's factual consistency and reasoning ability. Importantly, a long-context model should maintain stable behavior even when the user provides a short query.

Finally, Table 28 summarizes the characteristics of the evaluation datasets used in the reading comprehension and long-context evaluation. These include QA benchmarks such as MDQA, RULER, SQuAD, HotpotQA, NQ, TriviaQA, and RACE. The datasets differ substantially in input length, reflecting the diversity of reasoning depth and context complexity.

Table 28: Length statistics of reading comprehension and QA evaluation datasets. These benchmarks capture varying levels of input complexity, from short factual QA to multi-hop reasoning tasks.

Overall, the datasets used in this work span a wide range of input lengths and domains, from largescale pretraining corpora to short and long-context evaluation benchmarks, ensuring that our analysis is both comprehensive and representative.

Model Size

In Table 11, we list the architecture choices of the models trained, including the 120M, 360M, and 1B models based on the TinyLlama architecture [61]. The 3B model is based on Llama3.2 architecture [13].

Model Size.

In Table 11, we list the architecture choices of the models trained, including the 120M, 360M, and 1B models based on the TinyLlama architecture [61]. The 3B model is based on Llama3.2 architecture [13].

Context Window Size

Figure

Figure 9: Validation perplexity (evaluated on a sliding window of 512) on models with different context lengths.

Figure 10: Left: Evaluation perplexity of models with different packing or masking strategies. Right: Downstream performance over 9 tasks of different models.

Figure 11: Validation perplexity vs training tokens with different context windows and base of RoPE, θ . Evaluation is done on a sliding window of varying length (x-axis) on the validation documents.

In Figure 9. We plot the validation perplexity of models with different context windows under the Random, IntraDoc, and BM25 settings. We observe a consistent trend that a shorter-context model has lower evaluation perplexity on a shorter sequence under all settings.

In Figure 10, we plot the evaluation perplexity and downstream performance of models with different packing or masking strategies. We conclude that overall, IntraDoc achieves the best performance, with a consistently lower PPL and a higher downstream accuracy. We think that this is partially due to the shorter context window that the IntraDoc model is trained on.

Context Window Size.

Figure

Figure 9: Validation perplexity (evaluated on a sliding window of 512) on models with different context lengths.

Figure 10: Left: Evaluation perplexity of models with different packing or masking strategies. Right: Downstream performance over 9 tasks of different models.

Figure 11: Validation perplexity vs training tokens with different context windows and base of RoPE, θ . Evaluation is done on a sliding window of varying length (x-axis) on the validation documents.

In Figure 9. We plot the validation perplexity of models with different context windows under the Random, IntraDoc, and BM25 settings. We observe a consistent trend that a shorter-context model has lower evaluation perplexity on a shorter sequence under all settings.

In Figure 10, we plot the evaluation perplexity and downstream performance of models with different packing or masking strategies. We conclude that overall, IntraDoc achieves the best performance, with a consistently lower PPL and a higher downstream accuracy. We think that this is partially due to the shorter context window that the IntraDoc model is trained on.

Coding

We conduct a comprehensive controlled study of the impact of context window on pretraining, revealing that a shorter context window is more beneficial to the model's performance on standard benchmarks. This debunks the trend of pretraining with longer context windows. We therefore propose SkyLadder to schedule the context window from short to long during pretraining, which gives substantial improvement in downstream performance and computational efficiency. We conclude that context window scheduling is an important dimension for pretraining, and deserves more consideration. In the future, we plan to explore more dynamic and performant scheduling strategies that adapt according to model size or pretraining data distribution.

Long-Context Continual Pretraining

How does context window affect pretraining? To investigate this in a fair and comparable manner, we pretrain language models from scratch with context windows ranging from 512 to 16,384 tokens under a fixed total number of tokens, evaluating via perplexity and downstream task benchmarks. We examine how the context window size impacts model performance, analyzing how data packing and masking strategies interact with window size.

Ablation Study

We now examine the impact of hyperparameters in SkyLadder scheduling. To manage computational costs, we adopt a default setup of pretraining 120M models with 8K context on 100B CC tokens.

Expansion Rate. We investigate the impact of the expansion rate α in Figure 6 (left). We choose different α ranging from slowest ( 1 / 12 ) to fastest (1). Our findings reveal that, for short contexts, performance generally improves as the expansion rate slows down. However, selecting an excessively slow rate (e.g., 1 / 12 ) can negatively affect long-context performance due to insufficient training on longer contexts. Therefore, we recommend setting α to 1 / 8 for a good balance.

Initial Context Window. As the final context window length w e is fixed to L , the sole remaining hyperparameter is w s . Intuitively, setting w s to an excessively large value (e.g. close to L ) leaves little room for scheduling, resulting in sub-optimal performance. In Figure 6 (right), we demonstrate that when w s is set to a relatively small value (e.g., 8), great performance can be achieved for both short and long contexts. This suggests that there is still potential for further improvement in our default setup. Therefore, we recommend starting with a small context window, such as 8 tokens .

Scheduling Type. The default scheduling method in SkyLadder is linear scheduling. We evaluate different context window scheduling types (more details in Table 20 and Figure 12 in Appendix A.7.4): (1) Stepwise Linear rounds window size w ( t ) to multiples of 1K, resulting in a step function; (2) Sinusoidal increases quickly at the early stage then slows down; (3) Exponential starts slow but accelerates sharply; (4) Continual pretraining setup trains with 4K context windows for ∼ 97B tokens, then switches to 32K context for the final 3B tokens. Table 7 shows that linear and sinusoidal schedules outperform the exponential variant on long tasks, likely because the exponential schedule, with extended short-context pretraining at the beginning, fails to adequately train on long contexts. Last, the most commonly used continual pretraining setup performs poorly overall, suggesting abrupt context changes harm both short and long performance. These findings suggest that context window scheduling is superior to both constant long-context pretraining and continual pretraining .

Overall, we conclude that the schedule should start from a small w s and the expansion should be gradual. We leave it to future work to study more advanced schedules and discover optimal configurations. For instance, it is possible that the schedule needs to be adjusted for various model sizes. More ablations for combination with BM25, hybrid attention, cyclic schedules and scheduling under a compute budget can be found in Appendix A.7.4.

Expansion Rate.
Initial Context Window.

Most modern LLMs are based on a decoder-only transformer architecture [54] with a fixed context window size denoted by L . In contrast, the pretraining corpus, D = { d 1 , d 2 , d 3 , . . . , d n } , consists of documents with varying lengths different from L . Therefore, a key step before pretraining is to pack the documents into sequences of length L . Formally, a packed sequence C i is constructed as C i = Trunc ( d i, 1 ) ⊕ d i, 2 ⊕··· ⊕ d i,n -1 ⊕ Trunc ( d i,n ) , where ⊕ represents concatenation, and Trunc ( · ) denotes truncation of documents to ensure len ( C i ) = L . Following previous works [44, 64], document boundaries within C i are explicitly marked using end-of-sequence ( [EOS] ) tokens.

After the sequences are packed, the inputs are passed into transformer layers for next-token prediction training. A crucial component of these layers is the attention mechanism, which can be formulated as A i,j = q ⊤ i k j , and then Attn ( X ) = Softmax ( A + M ) . In decoder-only models, a mask M is applied to introduce constraints. A common approach is to use a causal mask , which ensures that each position can only attend to previous tokens by masking out (setting to -∞ ) attention scores corresponding to future positions: M ij = -∞ for j > i and M ij = 0 otherwise. A recently proposed masking scheme, known as intra-doc mask [64, 13], imposes a constraint that only allows tokens to attend to each other if they belong to the same document. Let each document d have start index s d and end index e d , the masking can be denoted as M intra ij = 0 when ∃ d such that s d ≤ i, j ≤ e d and j ≤ i , and M intra ij = -∞ otherwise. The model is trained with the standard cross-entropy loss on the packed sequences of length L . The workflow for pretraining data processing is illustrated in Figure 3.

Scheduling Type.

Analysis and Discussion

Training Efficiency. We observe a significant boost in training efficiency when employing SkyLadder in Table 8. On 8K models, SkyLadder accelerates training time by 13% due to the reduced context window in calculating attention. With a 32K context window, the efficiency gain becomes even more pronounced: our method saves 22% of training time while achieving better performance. The FLOPs saving is larger than the actual time because of reduced attention calculation.

Attention Pattern. We next investigate why SkyLadder, despite being trained on short contexts overall, consistently outperforms the baseline. As language models rely on attention mechanisms to encode context information, we study how attention patterns change. Specifically, during pretraining, we monitor the dynamics of (i) attention entropy (solid lines in Figure 7), where a lower entropy is associated with better downstream performance [62]; (ii) attention sink [56], where the initial token in the context receives disproportionately high attention. We utilize the metric in Gu et al. [14] to quantitatively measure the amplitude of attention sink. As shown in Figure 7 (dashed lines), compared with the baseline Random, SkyLadder demonstrates reduced attention entropy, suggesting a more concentrated attention pattern. However, a slower emergence and lower amplitude of attention sink are simultaneously observed. This suggests that SkyLadder's attention is concentrated on the key information in the context instead of the initial token, which accounts for the performance gain.

Training Stability. To further understand the reasons behind SkyLadder's better performance, we analyze the impact of pretraining context length on training dynamics. Wepretrain 120M-parameter models with different context lengths. We first monitor the maximum attention logits ( S max = max i,j q i · k j for all i, j ) throughout pretraining, following the methodology of K2 [50]. A large attention logit indicates that an attention head is malfunctioning and may cause numerical instability. In Figure 8, we observe that pretraining with a long context of 16K tokens leads to exploding max attention logits, while a shorter window leads to lower attention logits.

Next, we study the loss and gradient behavior by computing four stability metrics over the first N = 30K steps of pretraining, where L t denotes the training loss and G t is the gradient norm before clipping:

Figure 7: Dynamics of attention sink and entropy during pretraining 1B models (8K context). SkyLadder delays the emergence of attention sink while lowering the overall entropy, indicating a more effective attention pattern.

Figure 7: Dynamics of attention sink and entropy during pretraining 1B models (8K context). SkyLadder delays the emergence of attention sink while lowering the overall entropy, indicating a more effective attention pattern.

Figure 8: Max attention logits during training of models of different context lengths (in different colors).

Figure 8: Max attention logits during training of models of different context lengths (in different colors).

· Loss Volatility: measures local fluctuations of loss over a sliding window ( w = 10 ), computed as 1 N ∑ N t =1 Std( L t -w +1 , . . . , L t ) . Lower values indicate more stable training. · Loss Smoothness: the average loss change between consecutive steps, 1 N -1 ∑ N t =2 | L t -L t -1 | . Smaller values mean smoother convergence. · Mean Loss Ratio [28]: measures temporary increases in loss relative to the best loss so far, 1 N -1 ∑ N t =2 L t min( L 1 ,...,L t -1 ) , where smaller values indicate fewer loss spikes.

Table 9: Training stability metrics during pretraining of 120M models with different context lengths. All metrics are averaged over the first 30 billion tokens. ↓ indicates that smaller values are better.

· Average Gradient Norm: 1 N ∑ N t =1 min( G t , 1) , where larger values indicate more aggressive gradient updates.

In Table 9, longer-context models show higher volatility, less smooth loss curves, more frequent upward spikes, and larger gradient norms, all indicating less stable optimization. In contrast, shortcontext models converge more smoothly with smaller fluctuations and more controlled gradient updates. Together, these results reveal that short-context pretraining is inherently more stable, both in attention behavior and optimization dynamics. The reduced numerical instability and smoother convergence likely enable more consistent gradient signals and better overall convergence, explaining their superior downstream performance.

Comparison with Related Work. We compare our method with another approach for improving pretraining in Table 10. As discussed in Section 2, Pouransari et al. [38] proposed Dataset Decomposition (DD) by segmenting a document into sequences of varying lengths and using a curriculum during pretraining. However, this approach inevitably introduces domain bias, as the document lengths in different domains are different [10]. This explains why DD with only one short-to-long cycle fails to outperform the IntraDoc baseline. To mitigate this, the authors suggested iterating through multiple cy-

Table 10: Comparison between SkyLadder and Dataset Decomposition (DD) on 1B models trained with 100B FineWeb-Pro tokens. Numbers are in average performance in %.

cles of long and short data, which does improve performance substantially. In contrast, our method achieves better performance by avoiding such biases by not altering the data order based on length. In Appendix A.7.4, we experimented with various cyclic schedules but did not observe any improvements. In fact, we noticed loss spikes between cycles (Figure 14), indicating potential issues with domain shifts. This further supports that our method is safer since it does not disrupt the natural ordering and distribution of the data. More discussion with other related works [28, 21] is in Section A.8, where we demonstrate that our work provides novel insights that scheduling the context window over the entire training time improves both efficiency and performance.

Training Efficiency.

Weinclude details of the training configurations in Table 12. All models, irrespective of size or context window length, are trained on this same set of hyperparameters. For most of the hyperparameter values, we follow the TinyLlama [61] project, therefore, our results are highly reproducible.

Analysis

Training Efficiency. We observe a significant boost in training efficiency when employing SkyLadder in Table 8. On 8K models, SkyLadder accelerates training time by 13% due to the reduced context window in calculating attention. With a 32K context window, the efficiency gain becomes even more pronounced: our method saves 22% of training time while achieving better performance. The FLOPs saving is larger than the actual time because of reduced attention calculation.

Attention Pattern. We next investigate why SkyLadder, despite being trained on short contexts overall, consistently outperforms the baseline. As language models rely on attention mechanisms to encode context information, we study how attention patterns change. Specifically, during pretraining, we monitor the dynamics of (i) attention entropy (solid lines in Figure 7), where a lower entropy is associated with better downstream performance [62]; (ii) attention sink [56], where the initial token in the context receives disproportionately high attention. We utilize the metric in Gu et al. [14] to quantitatively measure the amplitude of attention sink. As shown in Figure 7 (dashed lines), compared with the baseline Random, SkyLadder demonstrates reduced attention entropy, suggesting a more concentrated attention pattern. However, a slower emergence and lower amplitude of attention sink are simultaneously observed. This suggests that SkyLadder's attention is concentrated on the key information in the context instead of the initial token, which accounts for the performance gain.

Training Stability. To further understand the reasons behind SkyLadder's better performance, we analyze the impact of pretraining context length on training dynamics. Wepretrain 120M-parameter models with different context lengths. We first monitor the maximum attention logits ( S max = max i,j q i · k j for all i, j ) throughout pretraining, following the methodology of K2 [50]. A large attention logit indicates that an attention head is malfunctioning and may cause numerical instability. In Figure 8, we observe that pretraining with a long context of 16K tokens leads to exploding max attention logits, while a shorter window leads to lower attention logits.

Next, we study the loss and gradient behavior by computing four stability metrics over the first N = 30K steps of pretraining, where L t denotes the training loss and G t is the gradient norm before clipping:

Figure 7: Dynamics of attention sink and entropy during pretraining 1B models (8K context). SkyLadder delays the emergence of attention sink while lowering the overall entropy, indicating a more effective attention pattern.

Figure 7: Dynamics of attention sink and entropy during pretraining 1B models (8K context). SkyLadder delays the emergence of attention sink while lowering the overall entropy, indicating a more effective attention pattern.

Figure 8: Max attention logits during training of models of different context lengths (in different colors).

Figure 8: Max attention logits during training of models of different context lengths (in different colors).

· Loss Volatility: measures local fluctuations of loss over a sliding window ( w = 10 ), computed as 1 N ∑ N t =1 Std( L t -w +1 , . . . , L t ) . Lower values indicate more stable training. · Loss Smoothness: the average loss change between consecutive steps, 1 N -1 ∑ N t =2 | L t -L t -1 | . Smaller values mean smoother convergence. · Mean Loss Ratio [28]: measures temporary increases in loss relative to the best loss so far, 1 N -1 ∑ N t =2 L t min( L 1 ,...,L t -1 ) , where smaller values indicate fewer loss spikes.

Table 9: Training stability metrics during pretraining of 120M models with different context lengths. All metrics are averaged over the first 30 billion tokens. ↓ indicates that smaller values are better.

· Average Gradient Norm: 1 N ∑ N t =1 min( G t , 1) , where larger values indicate more aggressive gradient updates.

In Table 9, longer-context models show higher volatility, less smooth loss curves, more frequent upward spikes, and larger gradient norms, all indicating less stable optimization. In contrast, shortcontext models converge more smoothly with smaller fluctuations and more controlled gradient updates. Together, these results reveal that short-context pretraining is inherently more stable, both in attention behavior and optimization dynamics. The reduced numerical instability and smoother convergence likely enable more consistent gradient signals and better overall convergence, explaining their superior downstream performance.

Comparison with Related Work. We compare our method with another approach for improving pretraining in Table 10. As discussed in Section 2, Pouransari et al. [38] proposed Dataset Decomposition (DD) by segmenting a document into sequences of varying lengths and using a curriculum during pretraining. However, this approach inevitably introduces domain bias, as the document lengths in different domains are different [10]. This explains why DD with only one short-to-long cycle fails to outperform the IntraDoc baseline. To mitigate this, the authors suggested iterating through multiple cy-

Table 10: Comparison between SkyLadder and Dataset Decomposition (DD) on 1B models trained with 100B FineWeb-Pro tokens. Numbers are in average performance in %.

cles of long and short data, which does improve performance substantially. In contrast, our method achieves better performance by avoiding such biases by not altering the data order based on length. In Appendix A.7.4, we experimented with various cyclic schedules but did not observe any improvements. In fact, we noticed loss spikes between cycles (Figure 14), indicating potential issues with domain shifts. This further supports that our method is safer since it does not disrupt the natural ordering and distribution of the data. More discussion with other related works [28, 21] is in Section A.8, where we demonstrate that our work provides novel insights that scheduling the context window over the entire training time improves both efficiency and performance.

Attention Pattern.
Training Stability.

We acknowledge that there are several prior works discovering a similar pattern of short-to-long pretraining. For instance, Li et al. [28] discover that using a sequence-length warmup for the initial steps in pretraining improves model stability. However, they mostly focus on stability in training loss and do not show a clear performance gain across multiple evaluations and larger scales. Moreover, we demonstrate that the benefits of scheduling a model's context window go beyond only the warmup stage. In Table 21's first row, simply warming up the model with 8B tokens results in suboptimal performance compared to a slower expansion rate. This validates that the context window should be

Table 25: Performance (%) of 1B models with different masking schemes. All models are trained on the same 100B FineWeb-Pro tokens with a final context length of 8K. Both implementations of SkyLadder outperform the baseline, and the sliding window approach excels at long tasks with a slight performance drop on standard benchmarks.

considered as a factor to schedule over the entire training course, which also differentiates us from Li et al. [28] that only consider the warmup stage.

Another related work is Jin et al. [21] where the authors use progressive sequence lengths to accelerate training. However, their method leads to worse performance under the same token budget, while our SkyLadder shows both time saving and performance improvement with the same number of tokens. We suspect that this might be because of the suboptimal schedule they used. Moreover, their study is limited to observing the training loss of small models (up to 410M parameters), while we comprehensively show performance gain across multiple corpora, model sizes, context sizes, and a wide variety of tasks. Overall, we systematically conduct controlled experiments on the impact of context window scheduling in pretraining, providing insights to explain these previous studies.

Attention Patterns

In-context Copy v.s. Memorization

Analysis on Attention Patterns

Training Efficiency. We observe a significant boost in training efficiency when employing SkyLadder in Table 8. On 8K models, SkyLadder accelerates training time by 13% due to the reduced context window in calculating attention. With a 32K context window, the efficiency gain becomes even more pronounced: our method saves 22% of training time while achieving better performance. The FLOPs saving is larger than the actual time because of reduced attention calculation.

Attention Pattern. We next investigate why SkyLadder, despite being trained on short contexts overall, consistently outperforms the baseline. As language models rely on attention mechanisms to encode context information, we study how attention patterns change. Specifically, during pretraining, we monitor the dynamics of (i) attention entropy (solid lines in Figure 7), where a lower entropy is associated with better downstream performance [62]; (ii) attention sink [56], where the initial token in the context receives disproportionately high attention. We utilize the metric in Gu et al. [14] to quantitatively measure the amplitude of attention sink. As shown in Figure 7 (dashed lines), compared with the baseline Random, SkyLadder demonstrates reduced attention entropy, suggesting a more concentrated attention pattern. However, a slower emergence and lower amplitude of attention sink are simultaneously observed. This suggests that SkyLadder's attention is concentrated on the key information in the context instead of the initial token, which accounts for the performance gain.

Training Stability. To further understand the reasons behind SkyLadder's better performance, we analyze the impact of pretraining context length on training dynamics. Wepretrain 120M-parameter models with different context lengths. We first monitor the maximum attention logits ( S max = max i,j q i · k j for all i, j ) throughout pretraining, following the methodology of K2 [50]. A large attention logit indicates that an attention head is malfunctioning and may cause numerical instability. In Figure 8, we observe that pretraining with a long context of 16K tokens leads to exploding max attention logits, while a shorter window leads to lower attention logits.

Next, we study the loss and gradient behavior by computing four stability metrics over the first N = 30K steps of pretraining, where L t denotes the training loss and G t is the gradient norm before clipping:

Figure 7: Dynamics of attention sink and entropy during pretraining 1B models (8K context). SkyLadder delays the emergence of attention sink while lowering the overall entropy, indicating a more effective attention pattern.

Figure 7: Dynamics of attention sink and entropy during pretraining 1B models (8K context). SkyLadder delays the emergence of attention sink while lowering the overall entropy, indicating a more effective attention pattern.

Figure 8: Max attention logits during training of models of different context lengths (in different colors).

Figure 8: Max attention logits during training of models of different context lengths (in different colors).

· Loss Volatility: measures local fluctuations of loss over a sliding window ( w = 10 ), computed as 1 N ∑ N t =1 Std( L t -w +1 , . . . , L t ) . Lower values indicate more stable training. · Loss Smoothness: the average loss change between consecutive steps, 1 N -1 ∑ N t =2 | L t -L t -1 | . Smaller values mean smoother convergence. · Mean Loss Ratio [28]: measures temporary increases in loss relative to the best loss so far, 1 N -1 ∑ N t =2 L t min( L 1 ,...,L t -1 ) , where smaller values indicate fewer loss spikes.

Table 9: Training stability metrics during pretraining of 120M models with different context lengths. All metrics are averaged over the first 30 billion tokens. ↓ indicates that smaller values are better.

· Average Gradient Norm: 1 N ∑ N t =1 min( G t , 1) , where larger values indicate more aggressive gradient updates.

In Table 9, longer-context models show higher volatility, less smooth loss curves, more frequent upward spikes, and larger gradient norms, all indicating less stable optimization. In contrast, shortcontext models converge more smoothly with smaller fluctuations and more controlled gradient updates. Together, these results reveal that short-context pretraining is inherently more stable, both in attention behavior and optimization dynamics. The reduced numerical instability and smoother convergence likely enable more consistent gradient signals and better overall convergence, explaining their superior downstream performance.

Comparison with Related Work. We compare our method with another approach for improving pretraining in Table 10. As discussed in Section 2, Pouransari et al. [38] proposed Dataset Decomposition (DD) by segmenting a document into sequences of varying lengths and using a curriculum during pretraining. However, this approach inevitably introduces domain bias, as the document lengths in different domains are different [10]. This explains why DD with only one short-to-long cycle fails to outperform the IntraDoc baseline. To mitigate this, the authors suggested iterating through multiple cy-

Table 10: Comparison between SkyLadder and Dataset Decomposition (DD) on 1B models trained with 100B FineWeb-Pro tokens. Numbers are in average performance in %.

cles of long and short data, which does improve performance substantially. In contrast, our method achieves better performance by avoiding such biases by not altering the data order based on length. In Appendix A.7.4, we experimented with various cyclic schedules but did not observe any improvements. In fact, we noticed loss spikes between cycles (Figure 14), indicating potential issues with domain shifts. This further supports that our method is safer since it does not disrupt the natural ordering and distribution of the data. More discussion with other related works [28, 21] is in Section A.8, where we demonstrate that our work provides novel insights that scheduling the context window over the entire training time improves both efficiency and performance.

Conclusion

We conduct a comprehensive controlled study of the impact of context window on pretraining, revealing that a shorter context window is more beneficial to the model's performance on standard benchmarks. This debunks the trend of pretraining with longer context windows. We therefore propose SkyLadder to schedule the context window from short to long during pretraining, which gives substantial improvement in downstream performance and computational efficiency. We conclude that context window scheduling is an important dimension for pretraining, and deserves more consideration. In the future, we plan to explore more dynamic and performant scheduling strategies that adapt according to model size or pretraining data distribution.

Impact Statements

We provide the pseudocode for implementing SkyLadder with Flash Attention 2 [7]. The only change is to apply local causal masking with size w , and combine them with the original document boundaries under the IntraDoc scenario. It can easily be integrated into any model before calculating attention. The rest of the training pipeline remains unchanged.

Table 12: Hyperparameters setup for pretraining the language models. All pretrained models follow the same structure.

Acknowledgement

NeurIPS Paper Checklist

Appendix

Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Guidelines:

Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [Yes]

Justification: We discuss the broader impact of our work in Section A.2.

Guidelines:

Model Architecture

In Table 11, we list the architecture choices of the models trained, including the 120M, 360M, and 1B models based on the TinyLlama architecture [61]. The 3B model is based on Llama3.2 architecture [13].

Training Configurations

Weinclude details of the training configurations in Table 12. All models, irrespective of size or context window length, are trained on this same set of hyperparameters. For most of the hyperparameter values, we follow the TinyLlama [61] project, therefore, our results are highly reproducible.

Hardware and Compute

Implementation

We provide the pseudocode for implementing SkyLadder with Flash Attention 2 [7]. The only change is to apply local causal masking with size w , and combine them with the original document boundaries under the IntraDoc scenario. It can easily be integrated into any model before calculating attention. The rest of the training pipeline remains unchanged.

Table 12: Hyperparameters setup for pretraining the language models. All pretrained models follow the same structure.

Definition of Per-token Context Window

In Figure 4(d), we show the context window distribution difference between IntraDoc and Random. To clarify, the context window size refers to the number of preceding tokens available in the context window when making the next token prediction. This is different from (a) and (b), where the context length L is the model's pretrained context window.

Formally, consider a token at index i and an attention mask matrix M , where an entry M i,j = 0 indicates that token i can attend to token j , and -∞ otherwise. The context window size C i for the i -th token is defined as C i = ∑ i j =1 1 { M i,j = 0 } , where 1 {·} is the indicator function that returns 1 when M i,j = 0 and 0 otherwise. In essence, C i is the number of tokens available as context for the i -th token, and the distribution of C i over all pretraining tokens is in Figure 4(d).

For Random, the causal mask is triangular: the i -th token has a context window size equal to i (i.e., C 1 = 1 , C 2 = 2 , etc.). Thus, the distribution of C i is uniform. In contrast, IntraDoc effectively shortens the context length by limiting the cross-document attention.

Additional Results

Context Window Study

Figure

Figure 9: Validation perplexity (evaluated on a sliding window of 512) on models with different context lengths.

Figure 10: Left: Evaluation perplexity of models with different packing or masking strategies. Right: Downstream performance over 9 tasks of different models.

Figure 11: Validation perplexity vs training tokens with different context windows and base of RoPE, θ . Evaluation is done on a sliding window of varying length (x-axis) on the validation documents.

In Figure 9. We plot the validation perplexity of models with different context windows under the Random, IntraDoc, and BM25 settings. We observe a consistent trend that a shorter-context model has lower evaluation perplexity on a shorter sequence under all settings.

In Figure 10, we plot the evaluation perplexity and downstream performance of models with different packing or masking strategies. We conclude that overall, IntraDoc achieves the best performance, with a consistently lower PPL and a higher downstream accuracy. We think that this is partially due to the shorter context window that the IntraDoc model is trained on.

Ablations for Context Window Study

Base of RoPE. It has been shown that the value of RoPE may have a significant impact on the model's long context performance, and a longer context requires a larger base [33]. Therefore, we increase the RoPE base to 100,000, which is sufficiently large according to Men et al. [33]. In Figure 11, we observe an improvement for long-context models on long-context evaluation. However, the large gap between a shorter and a longer model still remains, therefore rejecting the hypothesis that the RoPE base is the key contributing factor to the superior performance of short-context models.

Table 13: Performance of 3B models on long tasks of retrieval-augmented generation (evaluated by exact-match scores) and reading comprehension benchmarks (accuracy in %).

Table 14: Many-shot ICL performance (accuracy) on text classification benchmarks. Numbers in parentheses denote the number of labels for each task.

Base of RoPE.
No Positional Embedding (NoPE)

ourmethod Evaluation

Statistical Test We test the statistical significance of the performance difference between our models and baselines in Table 1. We use a McNemar test as the two models are evaluated on the same set of questions. The original OLMES suite samples 1000 examples from each benchmark's full evaluation suite. In contrast, when conducting the McNemar test, we evaluate models on the full set to obtain more statistically meaningful results. We note that OpenBookQA only has 500 questions, making it harder to obtain statistical significance.

Reading Comprehension For reading comprehension, we evaluate the following benchmarks: Hotpot QA (2-shot) [58], SQuAD (4-shot) [41], NaturalQuestions (NQ) (2-shot) [26], TriviaQA (2-shot) [23], and RACE-high (0-shot) [27]. We follow the setup by Zhao et al. [64], where NQ and TriviaQA use retrieved documents as contexts. For RACE, we use lm-evaluation-harness [11] to compare the PPL between options.

Long-context Evaluation We provide additional long-context evaluation on our largest 3B model with an 8K context. This is to mitigate the performance instability of using synthetic benchmarks on small models. We first follow [38] to evaluate model accuracy on reading comprehension benchmarks TOEFL [5, 53] and QuALITY [36]. Next, we evaluate the model's performance on Retrieval Augmented Generation (RAG), where the model is provided with many relevant but potentially noisy contexts and needs to locate the correct information. As shown in Table 13, SkyLadder consistently performs better than the baseline across all evaluated RAG and reading comprehension datasets, highlighting its ability to locate correct answers within a lengthy context. In addition, we test the in-context learning ability of the models on text classification benchmarks [64, 44]. Results in Table 14 suggest that SkyLadder shows a significant gain for tasks with many labels, such as DBpedia, while achieving comparable high performance on binary tasks.

Closed-book QA Weadditionally evaluate the closed-book QA performance of our models without access to any document. We use the evaluation protocol Zhao et al. [64] to measure the exact match. In Table 15, we notice a significant improvement in our methods compared to the baselines for answering closed-book questions. This is consistent with the results that our models show improvements on standard benchmarks that contain commonsense knowledge.

Statistical Test

In this section, we provide detailed statistics of the datasets used in our study. These include the document length distributions of the pretraining corpora, the characteristics of the evaluation datasets, and the input length statistics of standard reasoning benchmarks.

Table 26 reports the document length statistics for the two pretraining corpora, CommonCrawl and FineWeb-Pro . Both distributions are strongly right-skewed, indicating that long documents are rare. Compared to FineWeb-Pro, CommonCrawl generally contains longer documents, while FineWeb-Pro has been more carefully cleaned and filtered.

Table 26: Document length statistics of the pretraining corpora, measured in tokens per document. Mean, median, and standard deviation describe the central tendency and variation. P25 and P75 indicate the 25th and 75th percentiles, while skewness and kurtosis capture distribution asymmetry and tail heaviness.

Table 27 shows the input length characteristics of common reasoning and knowledge benchmarks, including ARC, CSQA, HellaSwag, OBQA, PIQA, SocialIQA, Winogrande, and MMLU. While these benchmarks consist of relatively short contexts, they remain standard for assessing a model's factual consistency and reasoning ability. Importantly, a long-context model should maintain stable behavior even when the user provides a short query.

Finally, Table 28 summarizes the characteristics of the evaluation datasets used in the reading comprehension and long-context evaluation. These include QA benchmarks such as MDQA, RULER, SQuAD, HotpotQA, NQ, TriviaQA, and RACE. The datasets differ substantially in input length, reflecting the diversity of reasoning depth and context complexity.

Table 28: Length statistics of reading comprehension and QA evaluation datasets. These benchmarks capture varying levels of input complexity, from short factual QA to multi-hop reasoning tasks.

Overall, the datasets used in this work span a wide range of input lengths and domains, from largescale pretraining corpora to short and long-context evaluation benchmarks, ensuring that our analysis is both comprehensive and representative.

Reading Comprehension

We provide the pseudocode for implementing SkyLadder with Flash Attention 2 [7]. The only change is to apply local causal masking with size w , and combine them with the original document boundaries under the IntraDoc scenario. It can easily be integrated into any model before calculating attention. The rest of the training pipeline remains unchanged.

Table 12: Hyperparameters setup for pretraining the language models. All pretrained models follow the same structure.

Long-context Evaluation

Statistical Test We test the statistical significance of the performance difference between our models and baselines in Table 1. We use a McNemar test as the two models are evaluated on the same set of questions. The original OLMES suite samples 1000 examples from each benchmark's full evaluation suite. In contrast, when conducting the McNemar test, we evaluate models on the full set to obtain more statistically meaningful results. We note that OpenBookQA only has 500 questions, making it harder to obtain statistical significance.

Reading Comprehension For reading comprehension, we evaluate the following benchmarks: Hotpot QA (2-shot) [58], SQuAD (4-shot) [41], NaturalQuestions (NQ) (2-shot) [26], TriviaQA (2-shot) [23], and RACE-high (0-shot) [27]. We follow the setup by Zhao et al. [64], where NQ and TriviaQA use retrieved documents as contexts. For RACE, we use lm-evaluation-harness [11] to compare the PPL between options.

Long-context Evaluation We provide additional long-context evaluation on our largest 3B model with an 8K context. This is to mitigate the performance instability of using synthetic benchmarks on small models. We first follow [38] to evaluate model accuracy on reading comprehension benchmarks TOEFL [5, 53] and QuALITY [36]. Next, we evaluate the model's performance on Retrieval Augmented Generation (RAG), where the model is provided with many relevant but potentially noisy contexts and needs to locate the correct information. As shown in Table 13, SkyLadder consistently performs better than the baseline across all evaluated RAG and reading comprehension datasets, highlighting its ability to locate correct answers within a lengthy context. In addition, we test the in-context learning ability of the models on text classification benchmarks [64, 44]. Results in Table 14 suggest that SkyLadder shows a significant gain for tasks with many labels, such as DBpedia, while achieving comparable high performance on binary tasks.

Closed-book QA Weadditionally evaluate the closed-book QA performance of our models without access to any document. We use the evaluation protocol Zhao et al. [64] to measure the exact match. In Table 15, we notice a significant improvement in our methods compared to the baselines for answering closed-book questions. This is consistent with the results that our models show improvements on standard benchmarks that contain commonsense knowledge.

Closed-book QA

ourmethod Ablations

We now examine the impact of hyperparameters in SkyLadder scheduling. To manage computational costs, we adopt a default setup of pretraining 120M models with 8K context on 100B CC tokens.

Expansion Rate. We investigate the impact of the expansion rate α in Figure 6 (left). We choose different α ranging from slowest ( 1 / 12 ) to fastest (1). Our findings reveal that, for short contexts, performance generally improves as the expansion rate slows down. However, selecting an excessively slow rate (e.g., 1 / 12 ) can negatively affect long-context performance due to insufficient training on longer contexts. Therefore, we recommend setting α to 1 / 8 for a good balance.

Initial Context Window. As the final context window length w e is fixed to L , the sole remaining hyperparameter is w s . Intuitively, setting w s to an excessively large value (e.g. close to L ) leaves little room for scheduling, resulting in sub-optimal performance. In Figure 6 (right), we demonstrate that when w s is set to a relatively small value (e.g., 8), great performance can be achieved for both short and long contexts. This suggests that there is still potential for further improvement in our default setup. Therefore, we recommend starting with a small context window, such as 8 tokens .

Scheduling Type. The default scheduling method in SkyLadder is linear scheduling. We evaluate different context window scheduling types (more details in Table 20 and Figure 12 in Appendix A.7.4): (1) Stepwise Linear rounds window size w ( t ) to multiples of 1K, resulting in a step function; (2) Sinusoidal increases quickly at the early stage then slows down; (3) Exponential starts slow but accelerates sharply; (4) Continual pretraining setup trains with 4K context windows for ∼ 97B tokens, then switches to 32K context for the final 3B tokens. Table 7 shows that linear and sinusoidal schedules outperform the exponential variant on long tasks, likely because the exponential schedule, with extended short-context pretraining at the beginning, fails to adequately train on long contexts. Last, the most commonly used continual pretraining setup performs poorly overall, suggesting abrupt context changes harm both short and long performance. These findings suggest that context window scheduling is superior to both constant long-context pretraining and continual pretraining .

Overall, we conclude that the schedule should start from a small w s and the expansion should be gradual. We leave it to future work to study more advanced schedules and discover optimal configurations. For instance, it is possible that the schedule needs to be adjusted for various model sizes. More ablations for combination with BM25, hybrid attention, cyclic schedules and scheduling under a compute budget can be found in Appendix A.7.4.

Combination with BM25 Packing
Combination with Hybrid Attention

q, k, v: RoPE-encoded query, key, value tensors # doc_boundaries: EOS token positions per document # is_intradoc: intra-document attention flag # training_step: current global step # L: maximum context window length # get current window size w = min(L, get_current_mask_length(training_step)) # breakpoints every w tokens (and at document boundaries if using IntraDoc masking) mask_boundaries = np.arange(w, L, w) if is_intradoc: mask_boundaries = np.union1d(mask_boundaries, doc_boundaries) # compute max segment length & cumulative lengths for flash attention max_seqlen = get_max_seqlen(mask_boundaries, L) cu_seqlens = get_cu_seqlens(mask_boundaries, L) attn = flash_attn_varlen_func( q, k, v, cu_seqlens, max_seqlen, causal=True )

Long-to-Short Schedule
Alternative Schedule Types
Continual Pretraining

How does context window affect pretraining? To investigate this in a fair and comparable manner, we pretrain language models from scratch with context windows ranging from 512 to 16,384 tokens under a fixed total number of tokens, evaluating via perplexity and downstream task benchmarks. We examine how the context window size impacts model performance, analyzing how data packing and masking strategies interact with window size.

Expansion Rate

Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our controlled study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long-context tasks. Through extensive experiments, we pretrain 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines 2 .

Cyclic Schedule
Initial Window Length

Figure

Figure 9: Validation perplexity (evaluated on a sliding window of 512) on models with different context lengths.

Figure 10: Left: Evaluation perplexity of models with different packing or masking strategies. Right: Downstream performance over 9 tasks of different models.

Figure 11: Validation perplexity vs training tokens with different context windows and base of RoPE, θ . Evaluation is done on a sliding window of varying length (x-axis) on the validation documents.

In Figure 9. We plot the validation perplexity of models with different context windows under the Random, IntraDoc, and BM25 settings. We observe a consistent trend that a shorter-context model has lower evaluation perplexity on a shorter sequence under all settings.

In Figure 10, we plot the evaluation perplexity and downstream performance of models with different packing or masking strategies. We conclude that overall, IntraDoc achieves the best performance, with a consistently lower PPL and a higher downstream accuracy. We think that this is partially due to the shorter context window that the IntraDoc model is trained on.

Compute Budget

We conducted all of our experiments for models with ≤ 1B size on an internal cluster of NVIDIA A100 nodes with 40G memory. Experiments with 3B models were conducted on H100 nodes. There are additional preliminary experiments that we did not include in the paper, which account for a fraction of the total compute. The detailed computation for each experiment is as follows: For the preliminary study on context window, pretraining a 1B model with 100B tokens (with 8K context) takes around 200 hours on a node of 8 A100s. Models of different sizes scale accordingly. For instance, plotting Figure 4(a) and (b) requires a total of 159 days of pretraining on a single node. For SkyLadder experiments, the baseline pretraining using various corpora takes the same time, and SkyLadder speeds up the training by 13% to 22% depending on the context length.

Sliding Window Expansion

How does context window affect pretraining? To investigate this in a fair and comparable manner, we pretrain language models from scratch with context windows ranging from 512 to 16,384 tokens under a fixed total number of tokens, evaluating via perplexity and downstream task benchmarks. We examine how the context window size impacts model performance, analyzing how data packing and masking strategies interact with window size.

Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long-context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines.111The code is available at https://github.com/sail-sg/SkyLadder.

The evolution of language models has been marked by a consistent expansion in context window sizes (Figure 1 left). While early models like GPT (Radford, 2018) and BERT (Kenton & Toutanova, 2019) were limited to context windows of 512 tokens, subsequent models have pushed these boundaries significantly. GPT-2 (Radford et al., 2019) doubled this capacity to 1024 tokens, and with the advent of Large Language Models (LLMs) exceeding 1B parameters, the progression continued: Llama (Touvron et al., 2023a) implemented a 2048-token window, Llama-2 (Touvron et al., 2023b) extended it to 4096, and Llama-3 (Dubey et al., 2024) further expanded to 8192 tokens. The push to expand the context window is motivated by the need for models to handle longer sequences during inference. The development is also driven by a widespread belief that models pretrained with longer context windows should perform comparably to, or even surpass, their shorter context counterparts, as extended windows reduce document truncation and preserve coherence (Ding et al., 2024).

We question whether the common belief that larger context windows does actually improve performance. Close inspection of previous work reveals that there has yet to be a fair experimental setup for comparing models across different context windows while adhering to a fixed token budget. Using tightly controlled experiments, we test how changing only the context window size during pretraining impacts their performance. As shown in Figure 1 (right), our results indicate that models pretrained using shorter contexts always outperform long-context models, when assessed by their average performance of over popular benchmarks. In addition, we verify that the performance gap is not eliminated by using advanced document packing strategies (Dubey et al., 2024; Ding et al., 2024; Shi et al., 2024).

To ensure the model can ultimately process long sequences, the model does need to be exposed to long sequences. However, given the finding that shorter context windows enhance performance on downstream tasks, we face a trade-off between long-context capability and pre-training effectiveness. We propose SkyLadder, a simple yet effective context window scheduling strategy designed to balance both objectives. SkyLadder does this by progressively expanding the size of the context window during pretraining, beginning pretraining with a minimal short context window (e.g., 8 tokens) and progressively expanding it to the long target context window (e.g., 32,768 tokens).

Empirical results on 1B-parameter models (up to 32K context window) and 3B-parameter models (up to 8K context window) on 100B tokens demonstrate that SkyLadder outperforms naive long-context pretraining baselines, in both short- and long-context evaluation tasks. For example, models trained with SkyLadder demonstrate significantly higher accuracy on standard benchmarks (e.g., HellaSwag), and reading comprehension tasks (e.g., HotpotQA), while still maintaining competitive performance on long-context evaluations like RULER. We further investigate the mechanisms behind the superior performance by observing the training dynamics, and discover that SkyLadder exhibits more concentrated and effective attention patterns.

Overall, we suggest that the length of the context window is an important dimension in pretraining and should be scheduled over the course of training. We recommend a progressive approach that begins with a small context of 8 tokens and gradually increases according to a linear function of training steps. Given a target context window (e.g., 32K), we suggest that allocating approximately 60% of the total training tokens to this expansion phase leads to stronger downstream performance compared to baselines. This scheduling strategy optimally enhances both training efficiency and model capability, offering a practical recipe for improving pretraining in language models.

Early work explored gradually increasing the context window, in smaller models like BERT and GPT-2, to improve training stability and efficiency (Nagatsuka et al., 2021; Li et al., 2022; Jin et al., 2023). Notably, Li et al. (2022) proposed length warmup for more stable training but did not show clear performance gains, and Jin et al. (2023) focused on training acceleration in 400M-parameter models. We extend these findings by demonstrating, for the first time, that context window scheduling significantly boosts both efficiency and performance at much larger scales (up to 3B parameters). A parallel approach from Pouransari et al. (2024) segments training documents by length, but Fu et al. (2024) caution that such segmentation can introduce domain biases, particularly since longer texts often cluster in specific domains, such as books. Recent developments in continual pretraining with long context windows (Peng et al., 2024; Wang et al., 2024; Gao et al., 2024b), can also be viewed through the lens of context window scheduling with different strategies (illustrated in Figure 2). Our work represents the first demonstration of both the effectiveness and the efficiency of context window scheduling, providing empirical evidence of its benefits in both standard and long-context benchmarks.

Long-context language models have received a lot of attention due to their ability to capture extended dependencies across large textual windows. Most existing approaches follow a continual pretraining paradigm (Fu et al., 2024; Xiong et al., 2023), which extends a pretrained backbone model to longer contexts through specialized fine-tuning or additional training. Several works propose to intervene in the positional embeddings to accommodate longer sequences (An et al., 2024; LocalLLaMA, 2023; Peng et al., 2024; Chen et al., 2023; Jin et al., 2024), while others perform extended pretraining on longer-sequence corpora (Gao et al., 2024b; Wang et al., 2024; Lu et al., 2024; Zhao et al., 2024a).

Our approach differs from previous methods as we train native long-context models from scratch, rather than modifying a pretrained model in post-training. Compared with a naive long-context pretraining baseline with a constant schedule, our approach delivers substantial gains on multiple long-sequence tasks, underscoring the benefits of training from scratch. These findings suggest that our method can be a promising direction for future research on building language models with longer context windows.

Most modern LLMs are based on a decoder-only transformer architecture (Vaswani et al., 2017) with a fixed context window size denoted by LL. In contrast, the pretraining corpus, D={d1,d2,d3,…,dn}D={d_{1},d_{2},d_{3},\dots,d_{n}}, consists of documents with varying lengths different from LL. Therefore, a key step before pretraining is to pack the documents into sequences of length LL. Formally, a packed sequence CiC_{i} is constructed as Ci=Trunc​(di,1)⊕di,2⊕⋯⊕di,n−1⊕Trunc​(di,n)C_{i}=\textrm{Trunc}(d_{i,1})\oplus d_{i,2}\oplus\dots\oplus d_{i,n-1}\oplus\textrm{Trunc}(d_{i,n}) , where ⊕\oplus represents concatenation, and Trunc​(⋅)\textrm{Trunc}(\cdot) denotes truncation of documents to ensure len​(Ci)=L\textrm{len}(C_{i})=L. Following previous works (Shi et al., 2024; Zhao et al., 2024b), document boundaries within CiC_{i} are explicitly marked using end-of-sequence ([EOS]) tokens.

After the sequences are packed, the inputs are passed into transformer layers for next-token prediction training. A crucial component of these layers is the attention mechanism, which can be formulated as Ai,j=qi⊤​kjA_{i,j}=q_{i}^{\top}k_{j}, and then Attn​(X)=Softmax​(A+M)\text{Attn}(X)=\textrm{Softmax}(A+M). In decoder-only models, a mask MM is applied to introduce constraints during training. A common approach is to use a causal mask, which ensures that each position can only attend to previous tokens by masking out (setting to −∞-\infty) attention scores corresponding to future positions:

A recently proposed masking scheme, known as intra-doc mask (Zhao et al., 2024b; Dubey et al., 2024), introduces extra constraints among documents. Let each document dd have start index sds_{d} and end index ede_{d}, the masking can be denoted as:

The models are then trained with the standard cross-entropy loss on the packed sequences of length LL. The overall workflow for data packing process is illustrated in Figure 3.

As per Section 1, we initiate our study by investigating the impact of context window size on model performance through a controlled experiment. Specifically, we pretrain language models with varying context window sizes, while preserving all other experimental settings. This isolates the effects of context window size, enabling our analysis of its influence on model performance. Through this analysis, we aim to understand whether longer context windows inherently lead to better or worse model performance.

The context window size determines the number of tokens included in the context for each packed sequence. However, as discussed earlier, several additional factors influence the content within the context window: (1) Packing methods determine which documents constitute the context window, and different packing strategies can significantly alter the composition of the sequences used for pretraining; (2) Masking methods decide whether cross-document attention is enabled within the same context window. The choice of masking strategy affects how the information from different documents interacts during training.

To study the impact of packing, we employ two widely adopted strategies: random packing and semantic packing. For random packing, documents are randomly concatenated without a specific ordering. For semantic packing, inspired by Shi et al. (2024), we retrieve and concatenate semantically relevant documents from the corpus, aiming to keep them within the same context window. After experimenting with both a dense retriever (Izacard et al., 2021) used in the paper, and a typical retriever BM25, we found that BM25 gives stronger performance and chose it as our primary focus. For masking, the baseline approach is causal masking. where each token can attend to all preceding tokens within the same context window, regardless of document boundaries. Conversely, recent studies (Zhao et al., 2024b; Ding et al., 2024) have demonstrated that disabling cross-document attention, thereby enabling intra-document attention, can lead to improved performance. For clarity in subsequent discussions, we denote random packing with causal masking as Random, BM25 packing with causal masking as BM25 and random packing with intra-document masking as IntraDoc.

We pretrain models from scratch using the TinyLlama codebase (Zhang et al., 2024a), and experiments on various model sizes including 120M, 360M and 1B parameters. Given the substantial computational cost associated with retrieval in semantic packing, we randomly select around 30B tokens from the CommonCrawl (CC) subset of the SlimPajama dataset (Zhang et al., 2020) as the pretraining corpus. All models undergo training for up to 100B tokens (∼\sim3.3 epochs). To ensure consistency across experiments, we maintain strict control over all other experimental settings, retaining the same batch size and learning rate schedule for all context windows. All models also incorporate Rotary Positional Encoding (RoPE; Su et al., 2024) to encode positional information. Appendix A.1 and A.2 gives further model architecture details and training settings.

For all model sizes, we use perplexity (PPL) on validation documents from the original dataset as a key metric, in line with established practices (Fu et al., 2024; Kaplan et al., 2020; Hoffmann et al., 2022). Note that when comparing models across different context windows (e.g., a 2K-context model and an 8K-context model), we must ensure the evaluation sequence fits within the shorter model’s context window to maintain a fair comparison. We also evaluate our 1B models on their performance on downstream tasks using standard benchmarks: HellaSwag (Zellers et al., 2019), ARC-Easy and ARC-Challenge (Clark et al., 2018), Winogrande (Sakaguchi et al., 2021), CommonsenseQA (Talmor et al., 2019), OpenBookQA (Mihaylov et al., 2018), PIQA (Bisk et al., 2020), Social-QA (Sap et al., 2019), and MMLU (Hendrycks et al., 2021). We employ the OLMES suite (Gu et al., 2024b) for the evaluation, as it has been shown to provide reliable and stable results with its provided 5-shot demonstrations (Gao et al., 2024b).

Figure 1 presents the main experimental results, obtained using the Random setting with 1B-parameter models. The results indicate that context window size significantly influences the performance of LLMs, with shorter contexts generally lead to better performance.

To further investigate the factors contributing to the observation, we conduct a comprehensive analysis, examining potential variables that may affect the conclusion. The results are in Figure 4, from which we derive four key findings:

As shown in Figure 4, regardless of the model size (a) or packing and masking methods (b), a shorter context window for pretraining generally results in higher average performance on benchmarks. The finding on benchmarks is consistent with the trend of validation PPL, where shorter context windows always yield lower PPL.

When using shorter context windows, one might hypothesize that the model learns positional encoding patterns for nearer positions more frequently, leading to better performance on standard benchmarks. To test the hypothesis, we systematically ablate RoPE by completely excluding it during pretraining, following prior work (Kazemnejad et al., 2023). As shown in Figure 4 (c), models trained with short-context windows exhibit superior performance compared to their long-context counterparts, even in the absence of positional encoding. This suggests that the advantages of shorter context windows are independent of positional encoding.

From Figure 4 (b), we observe that IntraDoc achieves the best validation PPL across all context window sizes compared to Random and BM25, alongside consistently higher performance on standard benchmarks (c.f. Appendix A.4.1). This raises the question: why does IntraDoc excel? We attribute the advantage to the context window size distribution of IntraDoc, which implicitly increases the prevalence of shorter contexts. As illustrated in Figure 4 (d), despite the sequence length of 8K, fewer than 1% of context windows actually reach this limit. While prior work links the success of IntraDoc to reduced contextual noise (Zhao et al., 2024b), we identify a complementary factor — reduced average context window size — as a key factor in its strong performance. That is, we hypothesize that the effectiveness of IntraDoc may also be closely tied to short context windows.

We now present SkyLadder for progressively expanding the context window during pretraining.

Inspired by learning rate scheduling, we explore whether dynamically scheduling the context window from short to long during pretraining could lead to performance improvements. This method can be implemented by applying multiple local “mini” causal masks to a long, packed sequence. We illustrate this masking strategy in Figure 5. Formally, we define a local window length ww. The associated mask MwM_{w} is defined as follows:

where ⌊iw⌋​w\lfloor\frac{i}{w}\rfloor w calculates the largest multiple of ww that is less than or equal to ii, effectively defining a block-wise attention mask for the query token at position ii. We linearly adjust the size upwards by a constant factor per training step.

where wew_{e} and wsw_{s} represent the ending and starting context window sizes, respectively. Here, α\alpha denotes the rate of expansion, and tt corresponds to the training step. As the training progresses, when the dynamic context window size w​(t)w(t) eventually reaches the desired (long) context window size L=weL=w_{e}, it remains fixed at that value. At this point, the attention mask is equivalent to a full causal mask.

Notably, this method modifies the effective context window through masking, independent of how the sequences are packed. As such, this mask MwM_{w} can be integrated with MIntraM^{\textrm{Intra}}, which maintains the attention boundaries between documents; it can be seamlessly combined with most packing and masking strategies.

We follow the same setup in Section 3.2 to pretrain language models with 8K context on 100B tokens. We set ws=32w_{s}=32 and α=1/8\alpha=1/8 by default, which means that a model roughly needs 64K steps (around 64B tokens) to reach the final desired context window of L=8192L=8192. We fix all other hyperparameters, such as the learning rate schedule, batch size, etc., for fair comparison. Due to resource constraints, we do not perform extensive hyper-parameter search to obtain the best combinations for w​(t)w(t), α\alpha, and wsw_{s}. In our ablation study, we demonstrate that the selection of these hyper-parameters has a negligible impact on performance, provided that they fall within a reasonable range.

For evaluation, we use the same evaluation suite mentioned in Section 3.2 with standard benchmarks. To evaluate the performance of long-context question answering within an 8k token length, we utilize the 30-document setting from the Multi-Document QA (MDQA) benchmark (Liu et al., 2023). This setting contains tasks with an average length of approximately 6K tokens and is widely adopted in prior work (Pouransari et al., 2024; Zhao et al., 2024b). We also select synthetic tasks within RULER (Hsieh et al., 2024),as defined by Yen et al. (2024). We choose the setup of the task that fills up the model’s target context window LL.

Tables 1 and 2 present the main results, highlighting significant improvements achieved by SkyLadder across both standard benchmarks and reading comprehension tasks. For instance, compared to the Random baseline, integrating SkyLadder yields notable performance gains on standard tasks such as MMLU (+2.5%), ARC-E (+7.4%), and HellaSwag (+4%). This suggests that models with SkyLadder excel at learning common knowledge during pretraining. Additionally, our method further improves the performance of the strong baseline IntraDoc across all evaluated benchmarks. Meanwhile, for realistic long-context benchmarks like MDQA, our approach consistently matches or exceeds baseline performance.

To address potential concerns that the benefits observed in short contexts might stem from the high level of noise in the CommonCrawl corpus, we conducted additional experiments using the FineWeb-Pro dataset (Zhou et al., 2024), a heavily cleaned and carefully curated high-quality dataset containing 100B tokens. As shown in Table 4, the improved data quality indeed leads to substantial performance gains across most benchmarks. However, our key findings remain consistent: the IntraDoc approach continues to outperform the Random approach, and SkyLadder consistently delivers significant improvements over both baselines. This demonstrates that our method generalizes well across different types of corpora, regardless of their quality.

We conduct experiments across three model sizes: 120M, 360M, and 3B parameters on the Fineweb-Pro dataset. Table 4 demonstrates that models utilizing SkyLadder consistently achieve better standard benchmark performance on all model sizes. For long context tasks, our method does not benefit 120M models, possibly due to their limited capacity in processing long sequences. However, performance gain on 3B models is still prominent. We observe a positive scaling trend: as the model size grows, the performance improvement also increases. This reveals the potential of applying our method to even larger models.

To examine whether SkyLadder can effectively scale to longer context windows, we trained 1B models with a 32K context window on the FineWeb-Pro dataset, which contains 100B tokens. As the target context window LL increases, we adjusted α\alpha to 1/21/2 to ensure that the final context window expands to 32K before the end of pretraining. As shown in Table 6, our model demonstrates strong performance on both standard and long benchmarks. In addition, the performance difference of SkyLadder (0.9%) between the 8K and 32K models is largely reduced compared with the baseline approach (1.8%), which alleviates the performance degradation described in our earlier study. Notably, compared to the baseline Random approach, SkyLadder trains the model on progressively shorter contexts during earlier stages. This reveals a counterintuitive insight: naively training a model with a long context window is not always optimal, and strategic scheduling of the context window during pretraining can yield better results.

To examine whether SkyLadder is generalizable to different types of pretraining besides natural language tasks, following Ding et al. (2024), we pretrain 1B code models from scratch on 100B Python code tokenized by the Starcoder tokenizer (Li et al., 2023). We observe a lower training loss (∼0.9\sim 0.9) for code pretraining compared to natural language pretraining (∼2.1\sim 2.1), suggesting that the structure in code makes it easier for the model to learn. However, as shown in Table 5, we still observe significant improvement when applying SkyLadder under both greedy decoding and sampling setups, especially when the target context length is 32K. This demonstrates the potential of SkyLadder to coding and reasoning tasks beyond natural language modelling.

We now examine the impact of different hyperparameters in the scheduling of SkyLadder. To manage computational costs, we adopt a default setup where a 120M model is pretrained on 100B tokens from CommonCrawl with an 8k context window.

We investigate the impact of the expansion rate α\alpha in Figure 6 (left). We choose different α\alpha ranging from slowest (1/121/12) to fastest (1). Our findings reveal that, for short contexts, performance generally improves as the expansion rate slows down. However, selecting an excessively slow expansion rate (e.g., 1/121/12) can negatively affect long-context performance, as the model receives insufficient exposure to longer contexts during pretraining. Therefore, we recommend setting α\alpha to 1/81/8 for a good balance.

As the final context window length wew_{e} is fixed to LL, the sole remaining hyper-parameter is wsw_{s}. Intuitively, setting wsw_{s} to an excessively large value (e.g. close to LL) leaves little room for scheduling, resulting in sub-optimal performance. In Figure 6 (right), we demonstrate that when wsw_{s} is set to a relatively small value (e.g., 8), great performance can be achieved for both short and long contexts. This suggests that there is still potential for further improvement in our default setup. Therefore, we recommend starting with a small context window, such as 8 tokens.

The default scheduling method in SkyLadder is linear scheduling. We evaluate different context window scheduling types (more details in Table 15 and Figure 11 in Appendix A.4.4): (1) Stepwise Linear rounds window size (wsw_{s}) to multiples of 1K, resulting in a step function; (2) Sinusoidal increases quickly early then slows down; (3) Exponential starts slow but accelerates sharply; (4) Continual pretraining setup trains with 4K context windows for ∼\sim97B tokens, then switches to 32K context for the final 3B tokens. Results in Table 8 show that linear and sinusoidal scheduling outperforms the exponential variant on long benchmarks, likely because the exponential scheduling, with an extended period of short-context at the beginning of training, fails to adequately train on long contexts. Last, the most commonly used continual pretraining setup performs poorly overall, suggesting abrupt context changes harm both short and long task performance. These findings suggest that context window scheduling is superior to both constant long-context pretraining and continual pretraining.

Overall, we conclude that the schedule should start from a small wsw_{s} and the expansion should be gradual. We leave it to future work to study more advanced schedules and discover optimal configurations. For instance, it is possible that the schedule needs to be adjusted for various model sizes. More ablations for long-to-short scheduling, combination with BM25, cyclic schedules and scheduling under a computing budget can be found in Appendix A.4.4.

We observe a significant reduction in training time when employing our method. Table 8 illustrates the relative training time efficiency for models with different context window sizes. On 8K models, SkyLadder accelerates training by 13% due to the reduced context window in calculating attention. When we increase the context window size to 32K, the efficiency gain becomes even more pronounced: our method saves 22% of training time while achieving better performance.

We next investigate why SkyLadder, despite being trained on short contexts overall, consistently outperforms the baseline. As language models rely on attention mechanisms to encode context information, we first study how attention patterns change as SkyLadder adjusts its context window. Specifically, during pretraining, we monitor the dynamics of (i) attention entropy (solid lines in Figure 7), where a lower entropy is associated with better downstream performance (Zhang et al., 2024b); (ii) attention sink (Xiao et al., 2024), where the initial token in the context receives disproportionately high attention. We also utilize the metric in Gu et al. (2024a) to quantitatively measure the amplitude of attention sink. As shown in Figure 7 (dashed lines), compared with the baseline Random, SkyLadder demonstrates reduced attention entropy, suggesting a more concentrated attention pattern. However, a slower emergence and lower amplitude of attention sink are simultaneously observed. This suggests that SkyLadder’s attention is concentrated on the key information in the context instead of the initial token, which explains the performance gain.

We compare our method with another effective approach for improving pretraining in Table 9. As discussed in Section 2, Pouransari et al. (2024) proposed Dataset Decomposition (DD) by segmenting a document into sequences of varying lengths and using a curriculum during pretraining. However, this approach evitably introduces data bias, as the document lengths in different domains are different (Fu et al., 2024). This explains why DD with only one cycle fails to outperform the IntraDoc baseline. To mitigate this, the authors suggested iterating through multiple cycles of long and short data, which does improve performance substantially. In contrast, our method achieves better performance by avoiding such biases by not altering the order of data based on length. In Appendix A.4.4, we experimented with various cyclic schedules but did not observe any improvements. In fact, we noticed loss spikes between cycles, indicating potential issues with domain shifts. This further supports that our method is safer since it does not disrupt the natural ordering and distribution of the data.

We conduct a comprehensive controlled study of the impact of context window on pretraining, revealing that a shorter context window is more beneficial to the model’s performance on standard benchmarks. We therefore propose SkyLadder to schedule the context window over the course of training, which gives substantial improvement in downstream performance and compute efficiency. We conclude that context window scheduling is an important dimension for pretraining. In the future, we plan to explore more dynamic and performant scheduling strategies that adapt to model size or data distribution.

In Table 10, we list the architecture choices of the models trained, including the 120M, 360M, and 1B models based on the TinyLlama architecture (Zhang et al., 2024a). The 3B model is based on Llama3.2 architecture (Dubey et al., 2024).

We include details of the training configurations in Table 11. All models, irrespective of size or context window length, are trained on this same set of hyperparameters. For most of the hyperparameter values, we follow the TinyLlama (Zhang et al., 2024a) project, therefore our results are highly reproducible.

We provide the pseudocode for implementing SkyLadder with Flash Attention 2 (Dao, 2024). The only change is to apply local causal masking with size ww, and combine them with the original document boundaries under the IntraDoc scenario. It can easily be integrated into any model before calculating attention. The rest of the training pipeline remains unchanged.

In Figure 8. We plot the validation perplexity of models with different context windows under the Random, IntraDoc, and BM25 settings. We observe a consistent trend that a shorter-context model has lower evaluation perplexity on a shorter sequence under all settings.

In Figure 10, we plot the evaluation perplexity and downstream performance of models with different packing or masking strategies. We conclude that overall IntraDoc achieves the best performance, with a consistently lower PPL and a higher downstream accuracy. We think that this is partially due to the shorter context window that the IntraDoc model is trained on.

It has been shown that the value of RoPE may have a significant impact on the model’s long context performance, and a longer context requires a larger base (Men et al., 2024). Therefore, we increase the RoPE base to 100,000, which is sufficiently large according to Men et al. (2024). In Figure 10, we observe an improvement for long-context models on long-context evaluation. However, the large gap between a shorter and a longer model still remains, therefore rejecting the hypothesis that the RoPE base is the key contributing factor to the superior performance of short-context models.

For reading comprehension, we evaluate the following benchmarks: Hotpot QA (2-shot) (Yang et al., 2018), SQuAD (4-shot) (Rajpurkar et al., 2016), NaturalQuestions (NQ) (2-shot) (Kwiatkowski et al., 2019), TriviaQA (2-shot) (Joshi et al., 2017), and RACE-high (0-shot) (Lai et al., 2017). We follow the setup by Zhao et al. (2024b), where NQ and TriviaQA use retrieved documents as contexts. For RACE, we use lm-evaluation-harness (Gao et al., 2024a) to compare the PPL between options.

We additionally evaluate the closed-book QA performance of our models without access to any document. In Table 12, we notice a significant improvement in our methods compared to the baselines for answering closed-book questions. This is consistent with the results that our models show improvements on standard benchmarks that contain commonsense knowledge.

A possibility that SkyLadder works better than baseline on standard benchmarks, which are typically short, might be that the training data mix has more short-context data after applying the mask. To study the effect of pure data distribution, we conduct an ablation of reversing the original short-to-long schedule and name it as long-to-short schedule. This schedule spends the same number of tokens (64B) in the changing phase, before the constant training phase in L=8​KL=8K for another 36B tokens. In Table 14, we show that the long-to-short schedule is not helpful to the model’s performance in both short and long evaluation tasks. This highlights that the context window needs to be scheduled, rather than simply having a data mixture of long and short contexts.

As SkyLadder only changes the context length via masking without altering the underlying data, it is orthogonal to any advanced data packing method such as Shi et al. (2024); Ding et al. (2024). In Table 14, we combine the SkyLadder with the BM25 packing method. We show that the model achieves even better performance on both short and long context evaluation than BM25 without scheduling, which is also better than the Random baseline. This reveals that our methods can be combined with more advanced packing techniques to further boost pretraining performance.

We explore various types of short-to-long scheduling following different functions as mentioned in Section 4.5. Table 15 shows the details of the schedule as a function of tt, and Figure 11 shows an illustration of the different schedules types. In Table 8, we show that a smoother increase following the sinusoidal schedule works the best for long-context evaluation, while also achieving strong performance on standard benchmarks.

We illustrate the effect of the rate of expansion α\alpha in Figure 13. As the evaluation is done on 8K contexts, models with a lower rate (and shorter context window) will have a higher loss as the evaluation length is out-of-distribution. However, eventually, all models’ loss converges to a low level after the schedule reaches 8K. The detailed numbers of validation loss after pretraining can be found in Table 16. Following previous work (Fu et al., 2024; Hoffmann et al., 2022; Kaplan et al., 2020), we consider a loss difference larger than 0.01 as significant. We conclude that setting a reasonable rate of 1/81/8 balances both short and long-context loss.

Inspired by the cyclic schedule learning rate (Smith, 2017), we also wonder if cycles are helpful in the schedule. In Figure 13, we show two cyclic schedules. In the “Jump” schedule, w​(t)w(t) will decrease to wsw_{s} immediately after reaching LL. One the other hand, the “Gradual” schedule means an “M” shape alternating between wew_{e} and wsw_{s}. Notably, in the discontinuous Jump schedule, we notice a significant increase in long-context perplexity when we train on only short contexts for an extended period. However, as long as ww increases back to LL, the performance will return.

In Table 17, we show that these schedules have no major impact on the final performance. This highlights that the method does not introduce additional bias in data selection: different from existing methods such as Pouransari et al. (2024) that proposes to train on short data first, followed by long data, we do not assume such curriculum on data. We argue that the context window size should be independent of the data lengths to avoid bias in training only on certain domains of data.

We show the effect of having different wsw_{s}, the initial window length when the training starts. In Table 18, we show that the optimal starting length is 8 tokens. The trend is the same across both α=1/4\alpha=1/4 and α=1/8\alpha=1/8. This suggests that the starting length should be sufficiently small, irrespective of the expansion rate. It also reveals that prior studies, such as Jin et al. (2023) and Pouransari et al. (2024) that start with an initial length of 256 could be suboptimal.

We show that when the total number of tokens is limited, our method can still improve language model performance. In Table 19, we choose 12.5B, 25B, and 50B total tokens as the computing budget, and vary the expansion rate so that ww reaches LL at the same point during training. We observe that under different token budgets, the performance trend is the best: gradually expanding the context window gives better performance than rapid increase.

A possible alternative to SkyLadder (using local causal masks) is to use a sliding window attention with a window size of w​(t)w(t) that changes with the training time. Formally, the mask becomes:

so that each token in the context has a fixed preceding context of size ww. When w​(t)w(t) reaches L, the mask becomes equivalent to a causal mask. We compare the performance of the two in Table 20 and observe that the sliding window approach shows slightly better performance in long tasks and worse performance in standard benchmarks. This is likely because overall there are more tokens with longer preceding contexts for the sliding window approach. In both cases, SkyLadder outperforms the Random baseline. We think that future work could further investigate the differences between SkyLadder implementations with causal and sliding window attention, such as the formation of attention sink (Gu et al., 2024a).

Table: S4.T1: Performance (accuracy in %) of 1B models pretrained on CC with different methods on standard and long benchmarks.

MethodStandard BenchmarksLong Benchmarks
Avg.ARC-EARC-CCSQAHSOBQAPIQASIQAWGMMLUAvg.MDQARULER
Random46.358.032.749.643.040.264.846.451.929.915.317.712.8
+ SkyLadder50.0 (+3.7)65.435.656.847.042.864.848.956.032.414.318.310.3
IntraDoc47.461.833.452.745.638.064.345.754.830.513.015.310.6
+ SkyLadder49.3 (+1.9)64.833.855.447.939.466.148.056.431.813.215.610.7

Table: S4.T5: Performance (in %) of 1B models pretrained on 100B Python code data. We follow the protocol of Huang et al. (2024) to evaluate on HumanEval (Chen et al., 2021) and BigCodeBench (Zhuo et al., 2024). tt is the temperature of sampling. SkyLadder shows consistent improvement, especially for 32K-context models.

HumanEvalBigCodeBench
GreedySampling (t=0.8t=0.8)GreedySampling (t=0.8t=0.8)
LLModelPass@1Pass@10Pass@100Pass@1Pass@10Pass@20
32KRandom17.732.451.89.016.119.7
+ SkyLadder21.337.759.89.420.624.3
8KRandom22.037.261.09.919.323.6
+ SkyLadder23.238.263.411.320.024.1

Table: S4.T8: Comparison of 1B models trained with a 32K context window using different scheduling methods. Numbers are average accuracy (in %).

MethodLongStandard
Constant Long (32k)9.750.7
Linear (32→\rightarrow32k, default)13.554.3
Stepwise Linear (32→\rightarrow32k)13.355.3
Sinusoidal (32→\rightarrow32k)14.254.2
Exponential (32→\rightarrow32k)11.554.7
Cont. Pretraining (4k→\rightarrow32k)2.752.9

Table: S4.T9: Comparison between SkyLadder and Dataset Decomposition (DD) on 1B models trained with 100B FineWeb-Pro tokens.

ModelStandard Avg.Long Avg.
IntraDoc54.312.7
+SkyLadder54.8 (+0.5)13.9 (+1.2)
+DD (1 cycle)53.9 (-0.4)12.3 (-0.4)
+DD (8 cycles)54.5 (+0.2)13.5 (+0.8)

Table: A1.T10: Model Configuration

ModelTinyllama 1bTinyllama 120MTinyllama 360MLlama3.2 3b
Vocab Size32000320003200032000
Layers22121828
Heads32121624
Embedding Dim204876810243072
Intermediate Size5632204840968192
NormalizationRMSNormRMSNormRMSNormRMSNorm
Normalization ϵ\epsilon1×10−51\times 10^{-5}1×10−51\times 10^{-5}1×10−51\times 10^{-5}1×10−51\times 10^{-5}
Query Groups41168
BiasNoNoNoNo
RoPE θ\theta10000 if L=8​kL=8k 1000000 if L=32​kL=32k10000 if L=8​kL=8k1000000 if L=32​kL=32k1000010000100000
10000 if L=8​kL=8k
1000000 if L=32​kL=32k

Table: A1.T12: 1B model (trained on CC) performance on closed-book QA tasks. Performance are reported in Exact Match (%) using the evaluation script of Zhao et al. (2024b).

Closed-book QA
NQTriviaQAAverage
Random6.111.99.0
+SkyLadder9.017.513.2
IntraDoc7.814.711.3
+SkyLadder8.217.412.8

Table: A1.T14: Performance (%) of 1B models with different schedule types. All models are trained on the same 100B FineWeb-Pro tokens with a final context length of 8K. Short-to-long scheduling is consistently better than long-to-short scheduling.

Standard AvgLong Avg.
No Scheduling52.511.1
Short-to-Long55.2 (+2.7)12.3 (+1.2)
Long-to-Short52.6 (+0.1)10.7(-0.4)

Table: A1.T15: Functions for different context window schedule types. We set ws=32w_{s}=32 and we=32768w_{e}=32768 in our experiments. The rr for rounding is set to 10241024.

ScheduleFunction
Constantwew_{e}
Linearws+(we−ws)​α​xwe−wsw_{s}+(w_{e}-w_{s})\frac{\alpha x}{w_{e}-w_{s}}
Stepwisemax⁡(ws,r×⌊L​(x)r⌋)\max(w_{s},r\times\left\lfloor\frac{L(x)}{r}\right\rfloor)
Sinusoidalws+(we−ws)​sin⁡(α​π​x2​(we−ws))w_{s}+(w_{e}-w_{s})\sin\left(\frac{\alpha\pi x}{2(w_{e}-w_{s})}\right)
Exponentialws×(wews)α​xwe−wsw_{s}\times\left(\frac{w_{e}}{w_{s}}\right)^{\frac{\alpha x}{w_{e}-w_{s}}}

Table: A1.T16: Validation loss with different expansion rates. A box is colored red if it is significantly worse (difference >> 0.01) than the best of the column. LeL_{e} is the evaluation context length. All models are of size 120M and trained on 100B tokens.

Rate (1/α1/\alpha)Rate(1/α1/\alpha)Tokens to Reach 8k (B)TokenstoReach8k (B)Le=512L_{e}=512Le=4​kL_{e}=4kLe=8​kL_{e}=8k
Rate
(1/α1/\alpha)
Tokens
to
Reach
8k (B)
182.7512.5632.522
2162.7412.5512.514
4322.7402.5512.515
8642.7322.5532.519
9722.7312.5532.519
10802.7322.5552.522
11882.7302.5542.521
12962.7292.5572.526
Baseline (Constant)2.7802.5902.549

Table: A1.T18: Final validation loss after training 120M models on 100B tokens with different wsw_{s} when α=1/4\alpha=1/4 and α=1/8\alpha=1/8. LeL_{e} represents the context length of evaluation. A cell is colored red if its loss has a difference larger than 0.01 than the column’s best. ws=8192w_{s}=8192 equals to no scheduling.

wsw_{s}Le=512L_{e}=512Le=4​kL_{e}=4kLe=8​kL_{e}=8k
α=1/4\alpha=1/4
42.7312.5462.510
82.7302.5452.508
162.7332.5512.513
322.7402.5512.515
642.7422.5572.520
1282.7482.5642.528
2562.7502.5662.527
α=1/8\alpha=1/8
42.7272.5492.515
82.7252.5452.510
162.7292.5502.516
322.7322.5532.519
642.7352.5532.519
1282.7432.5642.530
2562.7482.5672.531
81922.7802.5902.549

Table: A1.T19: Final validation loss under different training token budgets and expansion rate α\alpha with 120M models. LeL_{e} represents the context length used for evaluation. “% of Token Budget” means how many tokens are spent in the expansion phase with w​(t)w(t) increasing. Under all token budgets, we observe a consistent improvement when we spend around 64% in expansion, and 36% in the stable phase.

α\alphaTokens to LL (B)% of Token BudgetLe=512L_{e}=512Le=4096L_{e}=4096Le=8192L_{e}=8192
Token Budget = 12.5B
1864%2.9122.7322.698
2432%2.9332.7462.709
4216%2.9582.7672.729
818%2.9762.7822.743
Baseline3.0082.8232.790
Token Budget = 25B
1/21664%2.8292.6502.617
1832%2.8412.6562.619
2416%2.8512.6652.626
428%2.8732.6832.645
Baseline2.9182.7342.700
Token Budget = 50B
1/43264%2.7712.5902.556
1/21632%2.7812.5962.560
1816%2.7892.6032.564
248%2.7952.6072.567
Baseline2.8392.6522.616

Refer to caption Left: Pretraining context window of LLMs grows over time. Right: Average performance across nine downstream tasks for 1B-parameter models with varying pretrained context window sizes. Increasing the context window size degrades the overall performance.

Refer to caption Schematic comparison of training-time context window scheduling.

Refer to caption An illustration of the workflow for pretraining data preparation highlights several critical decisions. Key considerations include the method of data packing, the type of attention mask to employ (causal mask and intra-doc mask), and determining the appropriate context window length denoted as LL.

Refer to caption Ablation studies of different factors on different context window sizes. Note that the validation PPL is obtained on the validation documents with a sliding window size of 512 tokens.

Refer to caption An illustration of Random and IntraDoc along with SkyLadder. The example shows a packed sequence (length LL) consisting of two documents. For SkyLadder, the context window ww starts from a small value and dynamically adjusts during training, eventually converging to the masking patterns of Random and IntraDoc, respectively.

Refer to caption Validation PPL on 512 and 8k contexts of models with different expansion rate α\alpha (left) and initial window length wsw_{s} (right).

Refer to caption Dynamics of attention sink and entropy during pretraining 1B models with an 8K context. SkyLadder delays the emergence of attention sink while lowering the overall entropy, indicating a more effective attention pattern.

Refer to caption Left: Evaluation perplexity of models with different packing or masking strategies. Right: Downstream performance over 9 tasks of different models.

Refer to caption Plot of various scheduling types.

Refer to caption An illustration of the effect of different α\alpha. Dashed lines represent the current context window ww for each step, and solid lines are the loss evaluated at 8K length.

$$ M_{i,j}=\begin{cases}0&\text{if }i\geq j\ -\infty&\text{otherwise}\end{cases} $$ \tag{S3.Ex1}

$$ w(t)=\min(w_{e},w_{s}+\lfloor\alpha t\rfloor) $$ \tag{S4.Ex4}

We acknowledge that there are several prior works discovering a similar pattern of short-to-long pretraining. For instance, Li et al. [28] discover that using a sequence-length warmup for the initial steps in pretraining improves model stability. However, they mostly focus on stability in training loss and do not show a clear performance gain across multiple evaluations and larger scales. Moreover, we demonstrate that the benefits of scheduling a model's context window go beyond only the warmup stage. In Table 21's first row, simply warming up the model with 8B tokens results in suboptimal performance compared to a slower expansion rate. This validates that the context window should be

Table 25: Performance (%) of 1B models with different masking schemes. All models are trained on the same 100B FineWeb-Pro tokens with a final context length of 8K. Both implementations of SkyLadder outperform the baseline, and the sliding window approach excels at long tasks with a slight performance drop on standard benchmarks.

considered as a factor to schedule over the entire training course, which also differentiates us from Li et al. [28] that only consider the warmup stage.

Another related work is Jin et al. [21] where the authors use progressive sequence lengths to accelerate training. However, their method leads to worse performance under the same token budget, while our SkyLadder shows both time saving and performance improvement with the same number of tokens. We suspect that this might be because of the suboptimal schedule they used. Moreover, their study is limited to observing the training loss of small models (up to 410M parameters), while we comprehensively show performance gain across multiple corpora, model sizes, context sizes, and a wide variety of tasks. Overall, we systematically conduct controlled experiments on the impact of context window scheduling in pretraining, providing insights to explain these previous studies.

Compute Information

We conducted all of our experiments for models with ≤ 1B size on an internal cluster of NVIDIA A100 nodes with 40G memory. Experiments with 3B models were conducted on H100 nodes. There are additional preliminary experiments that we did not include in the paper, which account for a fraction of the total compute. The detailed computation for each experiment is as follows: For the preliminary study on context window, pretraining a 1B model with 100B tokens (with 8K context) takes around 200 hours on a node of 8 A100s. Models of different sizes scale accordingly. For instance, plotting Figure 4(a) and (b) requires a total of 159 days of pretraining on a single node. For SkyLadder experiments, the baseline pretraining using various corpora takes the same time, and SkyLadder speeds up the training by 13% to 22% depending on the context length.

Dataset Statistics

In this section, we provide detailed statistics of the datasets used in our study. These include the document length distributions of the pretraining corpora, the characteristics of the evaluation datasets, and the input length statistics of standard reasoning benchmarks.

Table 26 reports the document length statistics for the two pretraining corpora, CommonCrawl and FineWeb-Pro . Both distributions are strongly right-skewed, indicating that long documents are rare. Compared to FineWeb-Pro, CommonCrawl generally contains longer documents, while FineWeb-Pro has been more carefully cleaned and filtered.

Table 26: Document length statistics of the pretraining corpora, measured in tokens per document. Mean, median, and standard deviation describe the central tendency and variation. P25 and P75 indicate the 25th and 75th percentiles, while skewness and kurtosis capture distribution asymmetry and tail heaviness.

Table 27 shows the input length characteristics of common reasoning and knowledge benchmarks, including ARC, CSQA, HellaSwag, OBQA, PIQA, SocialIQA, Winogrande, and MMLU. While these benchmarks consist of relatively short contexts, they remain standard for assessing a model's factual consistency and reasoning ability. Importantly, a long-context model should maintain stable behavior even when the user provides a short query.

Finally, Table 28 summarizes the characteristics of the evaluation datasets used in the reading comprehension and long-context evaluation. These include QA benchmarks such as MDQA, RULER, SQuAD, HotpotQA, NQ, TriviaQA, and RACE. The datasets differ substantially in input length, reflecting the diversity of reasoning depth and context complexity.

Table 28: Length statistics of reading comprehension and QA evaluation datasets. These benchmarks capture varying levels of input complexity, from short factual QA to multi-hop reasoning tasks.

Overall, the datasets used in this work span a wide range of input lengths and domains, from largescale pretraining corpora to short and long-context evaluation benchmarks, ensuring that our analysis is both comprehensive and representative.

Licenses of Assets

$$ M_{i,j} = \begin{cases} 0 & \text{if } i-w \leq j \leq i \ -\infty & \text{otherwise.} \end{cases} $$

We mainly use the following public datasets or codebases in this paper: SlimPajama [46] following the CommonCrawl Foundation Terms of Use 3 , FineWeb-Pro [65] with an ODC-By 1.0 license, and TinyLlama [61] with an Apache 2.0 License.

3 https://commoncrawl.org/terms-of-use

MethodAvg.ARC-EARC-CCSQAHSOBQAPIQASIQAWGMMLU
Random46.358.032.749.643.040.264.846.451.929.9
+ SkyLadder50.0 (+3.7)65.4 ∗35.6 ∗56.8 ∗47.0 ∗42.864.848.9 ∗56.0 ∗32.4 ∗
IntraDoc47.461.833.452.745.63864.345.754.830.5
+ SkyLadder49.3 (+1.9)64.8 ∗33.855.4 ∗47.9 ∗39.466.1 ∗48.0 ∗56.431.8 ∗
MethodReading Comprehension BenchmarksReading Comprehension BenchmarksReading Comprehension BenchmarksReading Comprehension BenchmarksReading Comprehension BenchmarksReading Comprehension BenchmarksLong BenchmarksLong BenchmarksLong Benchmarks
Avg.HotpotQASQuADNQTriviaQARACE-hAvg.MDQARULER
Random25.56.537.015.837.730.715.317.712.8
+ SkyLadder30.2 (+4.7)12.440.220.443.035.014.318.310.3
IntraDoc28.711.439.018.242.332.313.015.310.6
+ SkyLadder29.1 (+0.4)11.038.520.441.534.313.215.610.7
HumanEvalHumanEvalHumanEvalBigCodeBenchBigCodeBenchBigCodeBench
GreedySampling ( t = 0 . 8 )Sampling ( t = 0 . 8 )GreedySampling ( t = 0 . 8 )Sampling ( t = 0 . 8 )
LMethodPass@1Pass@10Pass@100Pass@1Pass@10Pass@20
32KRandom17.732.451.89.016.119.7
+ SkyLadder21.337.759.89.420.624.3
8KRandom22.037.261.09.919.323.6
+ SkyLadder23.238.263.411.320.024.1
MethodStandardLong
Random52.511.1
+ SkyLadder55.2 (+2.7)12.3 (+1.2)
IntraDoc54.312.7
+ SkyLadder54.8 (+0.5)13.9 (+1.2)
SizeMethodStandardLong
120MRandom + SkyLadder40.1 41.2 (+1.1)5.8 5.1 (-0.7)
360MRandom + SkyLadder47.2 49.6 (+2.4)8.9 8.9
3BRandom + SkyLadder57.0 60.5 (+3.5)15.8 19.3 (+3.5)
MethodStandardLong
Random50.79.7
+ SkyLadder54.3 (+3.6)13.5 (+3.8)
IntraDoc54.013.0
+ SkyLadder54.9 (+0.9)14.4 (+1.4)
MethodLongStandard
Constant Long (32K)9.750.7
Linear (32 → 32K, default )13.554.3
Stepwise Linear (32 → 32K)13.355.3
Sinusoidal (32 → 32K)14.254.2
Exponential (32 → 32K)11.554.7
Cont. Pretrain (4K → 32K)1052.9
MethodTime (%)FLOPs ( 10 20 )
Random (8K) + SkyLadder100.0% 86.9% (-13.1%)11.6
Random (32K)100.0%9.9 (-14.7%)
+ SkyLadder25.5
77.8% (-22.2%)18.8 (-26.3%)
ContextVolatility ↓ ( w =10 )Smoothness ↓Mean Loss Ratio ↓Avg Grad Norm ↓
1K0.0230.0191.0140.335
2K0.0260.0231.0170.338
4K0.030.0291.020.34
8K0.0360.0361.0250.347
16K0.0410.0421.0360.416
ModelStandardLong
IntraDoc54.312.7
+ SkyLadder54.8 (+0.5)13.9 (+1.2)
+ DD (1 cycle)53.9 (-0.4)12.3 (-0.4)
+ DD (8 cycles)54.5 (+0.2)13.5 (+0.8)
ModelTinyllama 1BTinyllama 120MTinyllama 360MLlama3.2 3B
Vocab Size32000320003200032000
Layers22121828
Heads32121624
Embedding Dim204876810243072
Intermediate Size5632204840968192
NormalizationRMSNormRMSNormRMSNormRMSNorm
Normalization ϵ1 × 10 - 51 × 10 - 51 × 10 - 51 × 10 - 5
Query Groups41168
BiasNoNoNoNo
RoPE θ10000 if L = 8 K 1000000 if L = 32 K1000010000100000
ParameterValue
Optimizer AdamW- β 1 AdamW- β 2 Learning Rate Schedule Peak Learning Rate Minimum Learning Rate Warmup Steps Gradient Norm Clipping Total Steps Global Batch Size Weight DecayAdamW 0.9 0.95 Cosine 4e-4 4e-5 2000 1 100,000 1,048,576 ( 2 20 ) tokens 0.1
Retrieval Augmented GenerationRetrieval Augmented GenerationRetrieval Augmented GenerationRetrieval Augmented GenerationRetrieval Augmented GenerationReading ComprehensionReading ComprehensionReading Comprehension
ModelAvg.NQTriviaQAHotpotQAPopQAAvg.TOEFLQuALITY
Random30.324.345.229.322.537.143.530.6
+ SkyLadder35.527.852.732.329.339.448.030.9
ModelAvg.DBpedia (14)AGNews (4)Amazon (2)Yelp (2)SST2 (2)
Random73.917.468.694.394.794.5
+ SkyLadder76.525.575.894.19592.2
Closed-book QAClosed-book QAClosed-book QA
ModelNQTriviaQAAverage
Random6.111.99.0
+ SkyLadder9.017.513.2
IntraDoc7.814.711.3
+ SkyLadder8.217.412.8
Standard Avg.Long Avg.
Random46.315.3
BM2547.5 (+1.2)16.4 (+1.1)
+ SkyLadder49.8 (+3.5)17.0 (+1.7)
Standard Avg.Long Avg.
No Scheduling52.511.1
Short-to-Long55.2 (+2.7)12.3 (+1.2)
Long-to-Short52.6 (+0.1)10.7 (-0.4)
ModelL e = 512L e = 4 KL e = 8 K
Random - 120M15.913.413
+ SkyLadder15.512.912.4
Random - 360M12.110.29.8
+ SkyLadder11.69.79.4
ModelL e = 1 KL e = 4 KL e = 8 K
Random14.813.112.5
+ SkyLadder14.312.712.1
ScheduleFunction
Constant Linear Stepwise Sinusoidal Exponentialw e w s +( w e - w s ) αx w e - w s max( w s , r × ⌊ L ( x ) r ⌋ ) w s +( w e - w s ) sin ( απx 2( w e - w s ) ) w s × ( w e w s ) αx we - ws
Rate ( 1 /α )Tokens to Reach 8K (B)L e = 512L e = 4 KL e = 8 K
182.7512.5632.522
2162.7412.5512.514
4322.742.5512.515
8642.7322.5532.519
9722.7312.5532.519
10802.7322.5552.522
11882.732.5542.521
12962.7292.5572.526
Baseline (Constant)2.782.592.549
TypeNumber of CyclesTokens per Cycle (B)L e = 512L e = 8 K
Random2.782.549
+ SkyLadder2.7322.519
Gradual4.5162.7432.53
Jump982.7442.532
Gradual2.5322.7322.521
Jump5162.7332.521
Gradual1.5642.7282.524
Jump3322.7272.522
w sL e = 512L e = 4 KL e = 8 K
α = 1 / 4α = 1 / 4α = 1 / 4α = 1 / 4
42.7312.5462.510
82.7302.5452.508
162.7332.5512.513
322.7402.5512.515
642.7422.5572.520
1282.7482.5642.528
2562.7502.5662.527
α = 1 / 8α = 1 / 8α = 1 / 8α = 1 / 8
42.7272.5492.515
82.7252.5452.510
162.7292.5502.516
322.7322.5532.519
642.7352.5532.519
1282.7432.5642.530
2562.7482.5672.531
81922.7802.5902.549
αTokens to L (B)%of Token BudgetL e = 512L e = 4096L e = 8192
Token Budget = 12.5BToken Budget = 12.5BToken Budget = 12.5BToken Budget = 12.5BToken Budget = 12.5BToken Budget = 12.5B
1864%2.9122.7322.698
2432%2.9332.7462.709
4216%2.9582.7672.729
818%2.9762.7822.743
Baseline3.0082.8232.790
Token Budget = 25BToken Budget = 25BToken Budget = 25BToken Budget = 25BToken Budget = 25BToken Budget = 25B
1/21664%2.8292.6502.617
1832%2.8412.6562.619
2416%2.8512.6652.626
428%2.8732.6832.645
Baseline2.9182.7342.700
Token Budget = 50BToken Budget = 50BToken Budget = 50BToken Budget = 50BToken Budget = 50BToken Budget = 50B
1/43264%2.7712.5902.556
1/21632%2.7812.5962.560
1816%2.7892.6032.564
248%2.7952.6072.567
Baseline2.8392.6522.616
ModelStandard Avg.Long Avg.
Random52.511.1
+ SkyLadder w/ local causal55.2 (+2.7)12.3 (+1.2)
+ SkyLadder w/ sliding window54.4 (+1.9)12.8 (+1.7)
DatasetMeanMedianStdDevMinMaxP25P75SkewnessKurtosis
CommonCrawl19731067456745594,272651186721820
FineWeb-Pro136484922951230,949507148115533
MetricARC-CARC-ECSQAHellaSwagOBQAPIQASIQAWinoG.MMLU
Mean222216146508129224226149540
Std201673292964512
Min191194134435116191210140155
Max4013362035761924402621703144
MetricMDQARULERSQuADHotpotQANQTriviaQARACE
Mean5150725910485010583566492
Std287745819932128121
Min417262099233587536529122
Max67558061117478426336431323
MethodAvg.ARC-EARC-CCSQAHSOBQAPIQASIQAWGMMLU
Random46.358.032.749.643.040.264.846.451.929.9
+ SkyLadder50.0 (+3.7)65.4 ∗35.6 ∗56.8 ∗47.0 ∗42.864.848.9 ∗56.0 ∗32.4 ∗
IntraDoc47.461.833.452.745.63864.345.754.830.5
+ SkyLadder49.3 (+1.9)64.8 ∗33.855.4 ∗47.9 ∗39.466.1 ∗48.0 ∗56.431.8 ∗
MethodReading Comprehension BenchmarksReading Comprehension BenchmarksReading Comprehension BenchmarksReading Comprehension BenchmarksReading Comprehension BenchmarksReading Comprehension BenchmarksLong BenchmarksLong BenchmarksLong Benchmarks
Avg.HotpotQASQuADNQTriviaQARACE-hAvg.MDQARULER
Random25.56.537.015.837.730.715.317.712.8
+ SkyLadder30.2 (+4.7)12.440.220.443.035.014.318.310.3
IntraDoc28.711.439.018.242.332.313.015.310.6
+ SkyLadder29.1 (+0.4)11.038.520.441.534.313.215.610.7
HumanEvalHumanEvalHumanEvalBigCodeBenchBigCodeBenchBigCodeBench
GreedySampling ( t = 0 . 8 )Sampling ( t = 0 . 8 )GreedySampling ( t = 0 . 8 )Sampling ( t = 0 . 8 )
LMethodPass@1Pass@10Pass@100Pass@1Pass@10Pass@20
32KRandom17.732.451.89.016.119.7
+ SkyLadder21.337.759.89.420.624.3
8KRandom22.037.261.09.919.323.6
+ SkyLadder23.238.263.411.320.024.1
MethodStandardLong
Random52.511.1
+ SkyLadder55.2 (+2.7)12.3 (+1.2)
IntraDoc54.312.7
+ SkyLadder54.8 (+0.5)13.9 (+1.2)
SizeMethodStandardLong
120MRandom + SkyLadder40.1 41.2 (+1.1)5.8 5.1 (-0.7)
360MRandom + SkyLadder47.2 49.6 (+2.4)8.9 8.9
3BRandom + SkyLadder57.0 60.5 (+3.5)15.8 19.3 (+3.5)
MethodStandardLong
Random50.79.7
+ SkyLadder54.3 (+3.6)13.5 (+3.8)
IntraDoc54.013.0
+ SkyLadder54.9 (+0.9)14.4 (+1.4)
MethodLongStandard
Constant Long (32K)9.750.7
Linear (32 → 32K, default )13.554.3
Stepwise Linear (32 → 32K)13.355.3
Sinusoidal (32 → 32K)14.254.2
Exponential (32 → 32K)11.554.7
Cont. Pretrain (4K → 32K)1052.9
MethodTime (%)FLOPs ( 10 20 )
Random (8K) + SkyLadder100.0% 86.9% (-13.1%)11.6
Random (32K)100.0%9.9 (-14.7%)
+ SkyLadder25.5
77.8% (-22.2%)18.8 (-26.3%)
ContextVolatility ↓ ( w =10 )Smoothness ↓Mean Loss Ratio ↓Avg Grad Norm ↓
1K0.0230.0191.0140.335
2K0.0260.0231.0170.338
4K0.030.0291.020.34
8K0.0360.0361.0250.347
16K0.0410.0421.0360.416
ModelStandardLong
IntraDoc54.312.7
+ SkyLadder54.8 (+0.5)13.9 (+1.2)
+ DD (1 cycle)53.9 (-0.4)12.3 (-0.4)
+ DD (8 cycles)54.5 (+0.2)13.5 (+0.8)
ModelTinyllama 1BTinyllama 120MTinyllama 360MLlama3.2 3B
Vocab Size32000320003200032000
Layers22121828
Heads32121624
Embedding Dim204876810243072
Intermediate Size5632204840968192
NormalizationRMSNormRMSNormRMSNormRMSNorm
Normalization ϵ1 × 10 - 51 × 10 - 51 × 10 - 51 × 10 - 5
Query Groups41168
BiasNoNoNoNo
RoPE θ10000 if L = 8 K 1000000 if L = 32 K1000010000100000
ParameterValue
Optimizer AdamW- β 1 AdamW- β 2 Learning Rate Schedule Peak Learning Rate Minimum Learning Rate Warmup Steps Gradient Norm Clipping Total Steps Global Batch Size Weight DecayAdamW 0.9 0.95 Cosine 4e-4 4e-5 2000 1 100,000 1,048,576 ( 2 20 ) tokens 0.1
Retrieval Augmented GenerationRetrieval Augmented GenerationRetrieval Augmented GenerationRetrieval Augmented GenerationRetrieval Augmented GenerationReading ComprehensionReading ComprehensionReading Comprehension
ModelAvg.NQTriviaQAHotpotQAPopQAAvg.TOEFLQuALITY
Random30.324.345.229.322.537.143.530.6
+ SkyLadder35.527.852.732.329.339.448.030.9
ModelAvg.DBpedia (14)AGNews (4)Amazon (2)Yelp (2)SST2 (2)
Random73.917.468.694.394.794.5
+ SkyLadder76.525.575.894.19592.2
Closed-book QAClosed-book QAClosed-book QA
ModelNQTriviaQAAverage
Random6.111.99.0
+ SkyLadder9.017.513.2
IntraDoc7.814.711.3
+ SkyLadder8.217.412.8
Standard Avg.Long Avg.
Random46.315.3
BM2547.5 (+1.2)16.4 (+1.1)
+ SkyLadder49.8 (+3.5)17.0 (+1.7)
Standard Avg.Long Avg.
No Scheduling52.511.1
Short-to-Long55.2 (+2.7)12.3 (+1.2)
Long-to-Short52.6 (+0.1)10.7 (-0.4)
ModelL e = 512L e = 4 KL e = 8 K
Random - 120M15.913.413
+ SkyLadder15.512.912.4
Random - 360M12.110.29.8
+ SkyLadder11.69.79.4
ModelL e = 1 KL e = 4 KL e = 8 K
Random14.813.112.5
+ SkyLadder14.312.712.1
ScheduleFunction
Constant Linear Stepwise Sinusoidal Exponentialw e w s +( w e - w s ) αx w e - w s max( w s , r × ⌊ L ( x ) r ⌋ ) w s +( w e - w s ) sin ( απx 2( w e - w s ) ) w s × ( w e w s ) αx we - ws
Rate ( 1 /α )Tokens to Reach 8K (B)L e = 512L e = 4 KL e = 8 K
182.7512.5632.522
2162.7412.5512.514
4322.742.5512.515
8642.7322.5532.519
9722.7312.5532.519
10802.7322.5552.522
11882.732.5542.521
12962.7292.5572.526
Baseline (Constant)2.782.592.549
TypeNumber of CyclesTokens per Cycle (B)L e = 512L e = 8 K
Random2.782.549
+ SkyLadder2.7322.519
Gradual4.5162.7432.53
Jump982.7442.532
Gradual2.5322.7322.521
Jump5162.7332.521
Gradual1.5642.7282.524
Jump3322.7272.522
w sL e = 512L e = 4 KL e = 8 K
α = 1 / 4α = 1 / 4α = 1 / 4α = 1 / 4
42.7312.5462.510
82.7302.5452.508
162.7332.5512.513
322.7402.5512.515
642.7422.5572.520
1282.7482.5642.528
2562.7502.5662.527
α = 1 / 8α = 1 / 8α = 1 / 8α = 1 / 8
42.7272.5492.515
82.7252.5452.510
162.7292.5502.516
322.7322.5532.519
642.7352.5532.519
1282.7432.5642.530
2562.7482.5672.531
81922.7802.5902.549
αTokens to L (B)%of Token BudgetL e = 512L e = 4096L e = 8192
Token Budget = 12.5BToken Budget = 12.5BToken Budget = 12.5BToken Budget = 12.5BToken Budget = 12.5BToken Budget = 12.5B
1864%2.9122.7322.698
2432%2.9332.7462.709
4216%2.9582.7672.729
818%2.9762.7822.743
Baseline3.0082.8232.790
Token Budget = 25BToken Budget = 25BToken Budget = 25BToken Budget = 25BToken Budget = 25BToken Budget = 25B
1/21664%2.8292.6502.617
1832%2.8412.6562.619
2416%2.8512.6652.626
428%2.8732.6832.645
Baseline2.9182.7342.700
Token Budget = 50BToken Budget = 50BToken Budget = 50BToken Budget = 50BToken Budget = 50BToken Budget = 50B
1/43264%2.7712.5902.556
1/21632%2.7812.5962.560
1816%2.7892.6032.564
248%2.7952.6072.567
Baseline2.8392.6522.616
ModelStandard Avg.Long Avg.
Random52.511.1
+ SkyLadder w/ local causal55.2 (+2.7)12.3 (+1.2)
+ SkyLadder w/ sliding window54.4 (+1.9)12.8 (+1.7)
DatasetMeanMedianStdDevMinMaxP25P75SkewnessKurtosis
CommonCrawl19731067456745594,272651186721820
FineWeb-Pro136484922951230,949507148115533
MetricARC-CARC-ECSQAHellaSwagOBQAPIQASIQAWinoG.MMLU
Mean222216146508129224226149540
Std201673292964512
Min191194134435116191210140155
Max4013362035761924402621703144
MetricMDQARULERSQuADHotpotQANQTriviaQARACE
Mean5150725910485010583566492
Std287745819932128121
Min417262099233587536529122
Max67558061117478426336431323

Figure

Figure

References

[zhuo2024bigcodebench] Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Binyuan Hui, Niklas Muennighoff, David Lo, Daniel Fried, Xiaoning Du, Harm de Vries, Leandro Von Werra. (2024). BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions.

[li2023starcoder] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries. (2023). StarCoder: may the source be with you!.

[chen2021codex] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba. (2021). Evaluating Large Language Models Trained on Code.

[OpenCodeEval2024] richardodliu. (2024). OpenCodeEval: A Framework for Evaluating Code Generation Models.

[huang2024opencoderopencookbooktoptier] Siming Huang, Tianhao Cheng, J. K. Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, Wei Chu. (2024). OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models.

[eval-harness] Gao, Leo, Tow, Jonathan, Abbasi, Baber, Biderman, Stella, Black, Sid, DiPofi, Anthony, Foster, Charles, Golding, Laurence, Hsu, Jeffrey, Le Noac'h, Alain, Li, Haonan, McDonell, Kyle, Muennighoff, Niklas, Ociepa, Chris, Phang, Jason, Reynolds, Laria, Schoelkopf, Hailey, Skowron, Aviya, Sutawika, Lintang, Tang, Eric, Thite, Anish, Wang, Ben, Wang, Kevin, Zou, Andy. A framework for few-shot language model evaluation. doi:10.5281/zenodo.12608602.

[lai-etal-2017-race] Lai, Guokun, Xie, Qizhe, Liu, Hanxiao, Yang, Yiming, Hovy, Eduard. (2017). {RACE. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/D17-1082.

[joshi-etal-2017-triviaqa] Joshi, Mandar, Choi, Eunsol, Weld, Daniel, Zettlemoyer, Luke. (2017). {T. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/P17-1147.

[rajpurkar-etal-2016-squad] Rajpurkar, Pranav, Zhang, Jian, Lopyrev, Konstantin, Liang, Percy. (2016). {SQ. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/D16-1264.

[yang-etal-2018-hotpotqa] Yang, Zhilin, Qi, Peng, Zhang, Saizheng, Bengio, Yoshua, Cohen, William, Salakhutdinov, Ruslan, Manning, Christopher D.. (2018). {H. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/D18-1259.

[an2024doeseffectivecontextlength] Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, Lingpeng Kong. (2024). Why Does the Effective Context Length of LLMs Fall Short?.

[yen2024helmetevaluatelongcontextlanguage] Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, Danqi Chen. (2025). HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly. International Conference on Learning Representations (ICLR).

[hoffmann2022training] Hoffmann, Jordan, Borgeaud, Sebastian, Mensch, Arthur, Buchatskaya, Elena, Cai, Trevor, Rutherford, Eliza, de Las Casas, Diego, Hendricks, Lisa Anne, Welbl, Johannes, Clark, Aidan, Hennigan, Tom, Noland, Eric, Millican, Katie, van den Driessche, George, Damoc, Bogdan, Guy, Aurelia, Osindero, Simon, Simonyan, Karen, Elsen, Erich, Vinyals, Oriol, Rae, Jack W., Sifre, Laurent. (2022). Training compute-optimal large language models. Proceedings of the 36th International Conference on Neural Information Processing Systems.

[kaplan2020scaling] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei. (2020). Scaling Laws for Neural Language Models.

[dao2023flashattention2] Dao, Tri. (2024). Flash{A. International Conference on Learning Representations (ICLR).

[kwiatkowski-etal-2019-natural] Kwiatkowski, Tom, Palomaki, Jennimaria, Redfield, Olivia, Collins, Michael, Parikh, Ankur, Alberti, Chris, Epstein, Danielle, Polosukhin, Illia, Devlin, Jacob, Lee, Kenton, Toutanova, Kristina, Jones, Llion, Kelcey, Matthew, Chang, Ming-Wei, Dai, Andrew M., Uszkoreit, Jakob, Le, Quoc, Petrov, Slav. (2019). Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics. doi:10.1162/tacl_a_00276.

[lee-etal-2019-latent] Lee, Kenton, Chang, Ming-Wei, Toutanova, Kristina. (2019). Latent Retrieval for Weakly Supervised Open Domain Question Answering. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. doi:10.18653/v1/P19-1612.

[zhang2024attentionentropykeyfactor] Zhisong Zhang, Yan Wang, Xinting Huang, Tianqing Fang, Hongming Zhang, Chenlong Deng, Shuaiyi Li, Dong Yu. (2024). Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models.

[gu2024attentionsinkemergeslanguage] Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin. (2024). When Attention Sink Emerges in Language Models: An Empirical View.

[smith2017cyclical] Smith, Leslie N. (2017). Cyclical learning rates for training neural networks. 2017 IEEE winter conference on applications of computer vision (WACV).

[kazemnejad2023impactpositionalencodinglength] Kazemnejad, Amirhossein, Padhi, Inkit, Ramamurthy, Karthikeyan Natesan, Das, Payel, Reddy, Siva. (2023). The impact of positional encoding on length generalization in transformers. Proceedings of the 37th International Conference on Neural Information Processing Systems.

[men2024baseropeboundscontext] Xin Men, Mingyu Xu, Bingning Wang, Qingyu Zhang, Hongyu Lin, Xianpei Han, Weipeng Chen. (2024). Base of {RoPE.

[fu2024dataengineeringscalinglanguage] Fu, Yao, Panda, Rameswar, Niu, Xinyao, Yue, Xiang, Hajishirzi, Hannaneh, Kim, Yoon, Peng, Hao. (2024). Data engineering for scaling language models to 128K context. Proceedings of the 41st International Conference on Machine Learning.

[li2022stabilityefficiencydilemmainvestigatingsequence] Li, Conglong, Zhang, Minjia, He, Yuxiong. (2022). The stability-efficiency dilemma: investigating sequence length warmup for training GPT models. Proceedings of the 36th International Conference on Neural Information Processing Systems.

[hsieh2024ruler] Hsieh, Cheng-Ping, Sun, Simeng, Kriman, Samuel, Acharya, Shantanu, Rekesh, Dima, Jia, Fei, Ginsburg, Boris. {RULER. First Conference on Language Modeling.

[izacard2021contriever] Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, Edouard Grave. (2021). Unsupervised Dense Information Retrieval with Contrastive Learning. doi:10.48550/ARXIV.2112.09118.

[ding2024fewertruncationsimprovelanguage] Ding, Hantian, Wang, Zijian, Paolini, Giovanni, Kumar, Varun, Deoras, Anoop, Roth, Dan, Soatto, Stefano. (2024). Fewer truncations improve language modeling. Proceedings of the 41st International Conference on Machine Learning.

[wang2024precision] Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang. (2024). When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training.

[gao2024prolong] Tianyu Gao, Alexander Wettig, Howard Yen, Danqi Chen. (2025). How to Train Long-Context Language Models (Effectively).

[OLMES] Gu, Yuling, Tafjord, Oyvind, Kuehl, Bailey, Haddad, Dany, Dodge, Jesse, Hajishirzi, Hannaneh. (2025). {OLMES. Findings of the Association for Computational Linguistics: NAACL 2025.

[sap-etal-2019-social] Sap, Maarten, Rashkin, Hannah, Chen, Derek, Le Bras, Ronan, Choi, Yejin. (2019). Social {IQ. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). doi:10.18653/v1/D19-1454.

[mmlu] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt. (2021). Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR).

[OpenBookQA2018] Mihaylov, Todor, Clark, Peter, Khot, Tushar, Sabharwal, Ashish. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/D18-1260.

[PIQA] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi. (2020). {PIQA. Thirty-Fourth AAAI Conference on Artificial Intelligence.

[talmor-etal-2019-commonsenseqa] Talmor, Alon, Herzig, Jonathan, Lourie, Nicholas, Berant, Jonathan. (2019). {C. Proceedings of the 2019 Conference of the North {A. doi:10.18653/v1/N19-1421.

[jin2023growlengthacceleratingllmspretraining] Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Chia-Yuan Chang, Xia Hu. (2023). GrowLength: Accelerating {LLMs.

[pouransari2024dataset] Pouransari, Hadi, Li, Chun-Liang, Chang, Jen-Hao, Anasosalu Vasu, Pavan Kumar, Koc, Cem, Shankar, Vaishaal, Tuzel, Oncel. (2024). Dataset decomposition: Faster llm training with variable sequence length curriculum. Advances in Neural Information Processing Systems.

[zhou2024programming] Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu. (2024). Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale.

[liu-etal:2023:arxiv] Liu, Nelson F., Lin, Kevin, Hewitt, John, Paranjape, Ashwin, Bevilacqua, Michele, Petroni, Fabio, Liang, Percy. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. doi:10.1162/tacl_a_00638.

[APA:83] {American Psychological Association. (1983). Publications Manual.

[andrew2007scalable] Andrew, Galen, Gao, Jianfeng. (2007). Scalable training of {L1. Proceedings of the 24th International Conference on Machine Learning.

[rasooli-tetrault-2015] Mohammad Sadegh Rasooli, Joel R. Tetreault. (2015). Yara Parser: {A. Computing Research Repository.

[Ando2005] Ando, Rie Kubota, Zhang, Tong. (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. Journal of Machine Learning Research.

[lin2024rho] Lin, Zhenghao, Gou, Zhibin, Gong, Yeyun, Liu, Xiao, Shen, Yelong, Xu, Ruochen, Lin, Chen, Yang, Yujiu, Jiao, Jian, Duan, Nan, others. (2024). Rho-1: Not all tokens are what you need. arXiv preprint arXiv:2404.07965.

[shicontext] Shi, Weijia, Min, Sewon, Lomeli, Maria, Zhou, Chunting, Li, Margaret, Lin, Xi Victoria, Smith, Noah A, Zettlemoyer, Luke, Yih, Wen-tau, Lewis, Mike. (2024). In-Context Pretraining: Language Modeling Beyond Document Boundaries. The Twelfth International Conference on Learning Representations.

[jinllm] Jin, Hongye, Han, Xiaotian, Yang, Jingfeng, Jiang, Zhimeng, Liu, Zirui, Chang, Chia-Yuan, Chen, Huiyuan, Hu, Xia. (2024). {LLM. Forty-first International Conference on Machine Learning.

[pang-etal-2022-quality] Pang, Richard Yuanzhe, Parrish, Alicia, Joshi, Nitish, Nangia, Nikita, Phang, Jason, Chen, Angelica, Padmakumar, Vishakh, Ma, Johnny, Thompson, Jana, He, He, Bowman, Samuel. (2022). {Q. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. doi:10.18653/v1/2022.naacl-main.391.

[kimiteam2025kimik2openagentic] Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaoru Hao, Tianhong He, Weiran He, Wenyang He, Chao Hong, Yangyang Hu, Zhenxing Hu, Weixiao Huang, Zhiqi Huang, Zihao Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yongsheng Kang, Guokun Lai, Cheng Li, Fang Li, Haoyang Li, Ming Li, Wentao Li, Yanhao Li, Yiwei Li, Zhaowei Li, Zheming Li, Hongzhan Lin, Xiaohan Lin, Zongyu Lin, Chengyin Liu, Chenyu Liu, Hongzhang Liu, Jingyuan Liu, Junqi Liu, Liang Liu, Shaowei Liu, T. Y. Liu, Tianwei Liu, Weizhou Liu, Yangyang Liu, Yibo Liu, Yiping Liu, Yue Liu, Zhengying Liu, Enzhe Lu, Lijun Lu, Shengling Ma, Xinyu Ma, Yingwei Ma, Shaoguang Mao, Jie Mei, Xin Men, Yibo Miao, Siyuan Pan, Yebo Peng, Ruoyu Qin, Bowen Qu, Zeyu Shang, Lidong Shi, Shengyuan Shi, Feifan Song, Jianlin Su, Zhengyuan Su, Xinjie Sun, Flood Sung, Heyi Tang, Jiawen Tao, Qifeng Teng, Chensi Wang, Dinglu Wang, Feng Wang, Haiming Wang, Jianzhou Wang, Jiaxing Wang, Jinhong Wang, Shengjie Wang, Shuyi Wang, Yao Wang, Yejie Wang, Yiqin Wang, Yuxin Wang, Yuzhi Wang, Zhaoji Wang, Zhengtao Wang, Zhexu Wang, Chu Wei, Qianqian Wei, Wenhao Wu, Xingzhe Wu, Yuxin Wu, Chenjun Xiao, Xiaotong Xie, Weimin Xiong, Boyu Xu, Jing Xu, Jinjing Xu, L. H. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinran Xu, Yangchuan Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Xiaofei Yang, Ying Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Xingcheng Yao, Wenjie Ye, Zhuorui Ye, Bohong Yin, Longhui Yu, Enming Yuan, Hongbang Yuan, Mengjie Yuan, Haobing Zhan, Dehao Zhang, Hao Zhang, Wanlu Zhang, Xiaobin Zhang, Yangkun Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Haotian Zhao, Yikai Zhao, Huabin Zheng, Shaojie Zheng, Jianren Zhou, Xinyu Zhou, Zaida Zhou, Zhen Zhu, Weiyu Zhuang, Xinxing Zu. (2025). Kimi K2: Open Agentic Intelligence.

[chung2018supervisedunsupervisedtransferlearning] Yu-An Chung, Hung-Yi Lee, James Glass. (2018). Supervised and Unsupervised Transfer Learning for Question Answering.

[tseng2016machinecomprehensionspokencontent] Bo-Hsiang Tseng, Sheng-Syun Shen, Hung-Yi Lee, Lin-Shan Lee. (2016). Towards Machine Comprehension of Spoken Content: Initial TOEFL Listening Comprehension Test by Machine.

[gemmateam2025gemma3technicalreport] Team, Gemma, Kamath, Aishwarya, Ferret, Johan, Pathak, Shreya, Vieillard, Nino, Merhej, Ramona, Perrin, Sarah, Matejovicova, Tatiana, Ram{'e. (2025). Gemma 3 Technical Report.

[chen2023extending] Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian. (2023). Extending Context Window of Large Language Models via Positional Interpolation.

[zhang2024tinyllamaopensourcesmalllanguage] Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, Wei Lu. (2024). TinyLlama: An Open-Source Small Language Model.

[cerebras2023slimpajama] Soboleva, Daria, Al-Khateeb, Faisal, Myers, Robert, Steeves, Jacob R, Hestness, Joel, Dey, Nolan. {SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.

[zhao2024analysing] Zhao, Yu, Qu, Yuanbin, Staniszewski, Konrad, Tworkowski, Szymon, Liu, Wei, Mi{\l. (2024). Analysing The Impact of Sequence Composition on Language Model Pre-Training. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2024.acl-long.427.

[zellers2019hellaswag] Zellers, Rowan, Holtzman, Ari, Bisk, Yonatan, Farhadi, Ali, Choi, Yejin. (2019). HellaSwag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

[hendrycksmeasuring] Hendrycks, Dan, Burns, Collin, Basart, Steven, Zou, Andy, Mazeika, Mantas, Song, Dawn, Steinhardt, Jacob. Measuring Massive Multitask Language Understanding. International Conference on Learning Representations.

[radford2019language] Radford, Alec, Wu, Jeffrey, Child, Rewon, Luan, David, Amodei, Dario, Sutskever, Ilya, others. (2019). Language models are unsupervised multitask learners. OpenAI blog.

[radford2018improving] Radford, Alec. (2018). Improving language understanding by generative pre-training. OpenAI blog.

[kenton2019bert] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2019). {BERT. Proceedings of the 2019 Conference of the North {A. doi:10.18653/v1/N19-1423.

[touvron2023llama] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. (2023). LLaMA: Open and Efficient Foundation Language Models.

[su2024roformer] Su, Jianlin, Ahmed, Murtadha, Lu, Yu, Pan, Shengfeng, Bo, Wen, Liu, Yunfeng. (2024). Roformer: Enhanced transformer with rotary position embedding. Neurocomputing.

[allenai:arc] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord. (2018). Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.

[sakaguchi2021winogrande] Sakaguchi, Keisuke, Bras, Ronan Le, Bhagavatula, Chandra, Choi, Yejin. (2021). Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM.

[dubey2024llama] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, others. (2024). The Llama 3 Herd of Models.

[xiao2024efficient] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis. (2024). Efficient Streaming Language Models with Attention Sinks. The Twelfth International Conference on Learning Representations.

[touvronllama2] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton{-. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR. doi:10.48550/ARXIV.2307.09288.

[transformer2017] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, \L ukasz, Polosukhin, Illia. (2017). Attention is All you Need. Advances in Neural Information Processing Systems.

[peng2024yarn] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole. (2024). YaRN: Efficient Context Window Extension of Large Language Models. The Twelfth International Conference on Learning Representations, {ICLR.

[ge2023model] Ge, Suyu, Zhang, Yunan, Liu, Liyuan, Zhang, Minjia, Han, Jiawei, Gao, Jianfeng. (2023). Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801.

[ye2024differential] Ye, Tianzhu, Dong, Li, Xia, Yuqing, Sun, Yutao, Zhu, Yi, Huang, Gao, Wei, Furu. (2024). Differential transformer. arXiv preprint arXiv:2410.05258.

[ntk] LocalLLaMA. (2023). {NTK.

[Xiong2023EffectiveLS] Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oğuz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, Hao Ma. (2023). Effective Long-Context Scaling of Foundation Models. North American Chapter of the Association for Computational Linguistics.

[Varis_2021] Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush. (2024). A Controlled Study on Long Context Extension and Generalization in LLMs. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2021.emnlp-main.650.

[zhao2024longskywork] Liang Zhao, Tianwen Wei, Liang Zeng, Cheng Cheng, Liu Yang, Peng Cheng, Lijie Wang, Chenxia Li, Xuejie Wu, Bo Zhu, Yimeng Gan, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou. (2024). LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models.

[nagatsuka2021pre] Nagatsuka, Koichi, Broni-Bediako, Clifford, Atsumi, Masayasu. (2021). Pre-training a {BERT. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021).

[bib1] An et al. (2024) Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of llms fall short?, 2024. URL https://arxiv.org/abs/2410.18745.

[bib2] Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.

[bib3] Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021. URL https://arxiv.org/abs/2107.03374.

[bib4] Chen et al. (2023) Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.

[bib5] Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge. arXiv:1803.05457v1, 2018.

[bib6] Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024.

[bib7] Ding et al. (2024) Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto. Fewer truncations improve language modeling, 2024. URL https://arxiv.org/abs/2404.10830.

[bib8] Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pp. arXiv–2407, 2024.

[bib9] Fu et al. (2024) Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context, 2024. URL https://arxiv.org/abs/2402.10171.

[bib10] Gao et al. (2024a) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 07 2024a. URL https://zenodo.org/records/12608602.

[bib11] Gao et al. (2024b) Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively). arXiv preprint arXiv:2410.02660, 2024b.

[bib12] Gu et al. (2024a) Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view, 2024a. URL https://arxiv.org/abs/2410.10781.

[bib13] Gu et al. (2024b) Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge, and Hanna Hajishirzi. OLMES: A standard for language model evaluations. ArXiv, abs/2406.08446, 2024b. URL https://api.semanticscholar.org/CorpusID:270391754.

[bib14] Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.

[bib15] Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.

[bib16] Hsieh et al. (2024) Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024.

[bib17] Huang et al. (2024) Siming Huang, Tianhao Cheng, J. K. Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, and Wei Chu. Opencoder: The open cookbook for top-tier code large language models, 2024. URL https://arxiv.org/abs/2411.04905.

[bib18] Izacard et al. (2021) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning, 2021. URL https://arxiv.org/abs/2112.09118.

[bib19] Jin et al. (2023) Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Chia-Yuan Chang, and Xia Hu. Growlength: Accelerating LLMs pretraining by progressively growing training length, 2023. URL https://arxiv.org/abs/2310.00576.

[bib20] Jin et al. (2024) Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. LLM maybe longlm: Selfextend LLM context window without tuning. In Forty-first International Conference on Machine Learning, 2024.

[bib21] Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147/.

[bib22] Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.

[bib23] Kazemnejad et al. (2023) Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers, 2023. URL https://arxiv.org/abs/2305.19466.

[bib24] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, pp. 2. Minneapolis, Minnesota, 2019.

[bib25] Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl˙a˙00276. URL https://aclanthology.org/Q19-1026/.

[bib26] Lai et al. (2017) Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URL https://aclanthology.org/D17-1082/.

[bib27] Li et al. (2022) Conglong Li, Minjia Zhang, and Yuxiong He. The stability-efficiency dilemma: Investigating sequence length warmup for training gpt models, 2022. URL https://arxiv.org/abs/2108.06084.

[bib28] Li et al. (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.

[bib29] Liu et al. (2023) Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023. arXiv:2307.03172.

[bib30] LocalLLaMA. NTK-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degration, 2023. URL {https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/}.

[bib31] Lu et al. (2024) Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, and Alexander M Rush. A controlled study on long context extension and generalization in llms. arXiv preprint arXiv:2409.12181, 2024.

[bib32] Men et al. (2024) Xin Men, Mingyu Xu, Bingning Wang, Qingyu Zhang, Hongyu Lin, Xianpei Han, and Weipeng Chen. Base of RoPE bounds context length, 2024. URL https://arxiv.org/abs/2405.14591.

[bib33] Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.

[bib34] Nagatsuka et al. (2021) Koichi Nagatsuka, Clifford Broni-Bediako, and Masayasu Atsumi. Pre-training a BERT with curriculum learning by increasing block-size of input text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 989–996, 2021.

[bib35] Peng et al. (2024) Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u.

[bib36] Pouransari et al. (2024) Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, and Oncel Tuzel. Dataset decomposition: Faster llm training with variable sequence length curriculum. arXiv preprint arXiv:2405.13226, 2024. URL https://arxiv.org/abs/2405.13226.

[bib37] Alec Radford. Improving language understanding by generative pre-training. OpenAI blog, 2018.

[bib38] Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019.

[bib39] Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264/.

[bib40] Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.

[bib41] Sap et al. (2019) Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4463–4473, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1454. URL https://aclanthology.org/D19-1454/.

[bib42] Shi et al. (2024) Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A Smith, Luke Zettlemoyer, Wen-tau Yih, and Mike Lewis. In-context pretraining: Language modeling beyond document boundaries. In The Twelfth International Conference on Learning Representations, 2024.

[bib43] Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pp. 464–472. IEEE, 2017.

[bib44] Su et al. (2024) Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.

[bib45] Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421.

[bib46] Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and Efficient Foundation Language Models. arXiv e-prints, pp. arXiv–2302, 2023a.

[bib47] Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/ARXIV.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.

[bib48] Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

[bib49] Wang et al. (2024) Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, and Tianyu Pang. When precision meets position: BFloat16 breaks down RoPE in long-context training. arXiv preprint arXiv:2411.13476, 2024.

[bib50] Xiao et al. (2024) Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF.

[bib51] Xiong et al. (2023) Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oğuz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. In North American Chapter of the Association for Computational Linguistics, 2023.

[bib52] Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URL https://aclanthology.org/D18-1259/.

[bib53] Yen et al. (2024) Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. Helmet: How to evaluate long-context language models effectively and thoroughly, 2024. URL https://arxiv.org/abs/2410.02694.

[bib54] Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800, 2019.

[bib55] Zhang et al. (2020) Hao Zhang, Jae Ro, and Richard William Sproat. Semi-supervised url segmentation with recurrent neural networks pre-trained on knowledge graph entities. In The 28th International Conference on Computational Linguistics (COLING 2020), 2020.

[bib56] Zhang et al. (2024a) Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model, 2024a. URL https://arxiv.org/abs/2401.02385.

[bib57] Zhang et al. (2024b) Zhisong Zhang, Yan Wang, Xinting Huang, Tianqing Fang, Hongming Zhang, Chenlong Deng, Shuaiyi Li, and Dong Yu. Attention entropy is a key factor: An analysis of parallel context encoding with full-attention-based pre-trained language models, 2024b. URL https://arxiv.org/abs/2412.16545.

[bib58] Zhao et al. (2024a) Liang Zhao, Tianwen Wei, Liang Zeng, Cheng Cheng, Liu Yang, Peng Cheng, Lijie Wang, Chenxia Li, Xuejie Wu, Bo Zhu, et al. Longskywork: A training recipe for efficiently extending context length in large language models. arXiv preprint arXiv:2406.00605, 2024a.

[bib59] Zhao et al. (2024b) Yu Zhao, Yuanbin Qu, Konrad Staniszewski, Szymon Tworkowski, Wei Liu, Piotr Miłoś, Yuxiang Wu, and Pasquale Minervini. Analysing the impact of sequence composition on language model pre-training. arXiv preprint arXiv:2402.13991, 2024b.

[bib60] Zhou et al. (2024) Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, and Pengfei Liu. Programming every example: Lifting pre-training data quality like experts at scale. arXiv preprint arXiv:2409.17115, 2024.

[bib61] Zhuo et al. (2024) Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024.