Skip to main content

Continual Learning of Large Language Models: A Comprehensive Survey

Haizhou Shi

Abstract

The recent success of large language models (LLMs) trained on static, pre-collected, general datasets has sparked numerous research directions and applications. One such direction addresses the non-trivial challenge of integrating pre-trained LLMs into dynamic data distributions, task structures, and user preferences. The primary challenge of this problem lies in balancing model adaptation and knowledge preservation. Pre-trained LLMs, when tailored for specific needs, often experience significant performance degradation in previous knowledge domains – a phenomenon known as “catastrophic forgetting”. While extensively studied in the continual learning (CL) community, it presents new manifestations in the realm of LLMs. In this survey, we provide a comprehensive overview and detailed discussion of the current research progress on large language models within the context of continual learning. Besides the introduction of the preliminary knowledge, this survey is structured into four main sections: we first describe an overview of continually learning LLMs, consisting of two directions of continuity: vertical continuity (or vertical continual learning), i.e., continual adaptation from general to specific capabilities, and horizontal continuity (or horizontal continual learning), i.e., continual adaptation across time and domains (Section 3). Following vertical continuity, we summarize three stages of learning LLMs in the context of modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP), and Continual Fine-Tuning (CFT) (Section 4). Then we provide an overview of evaluation protocols for continual learning with LLMs, along with the current available data sources (Section 5). Finally, we discuss intriguing questions pertaining to continual learning for LLMs (Section 6). This survey sheds light on the relatively understudied domain of continually pre-training, adapting, and fine-tuning large language models, suggesting the necessity for greater attention from the community. Key areas requiring immediate focus include the development of practical and accessible evaluation benchmarks, along with methodologies specifically designed to counter forgetting and enable knowledge transfer within the evolving landscape of LLM learning paradigms. The full list of papers examined in this survey is available at https://github.com/Wang-ML-Lab/llm-continual-learning-survey.

Introduction

Recent advances in large language models (LLMs) have demonstrated considerable potential for achieving artificial general intelligence (AGI) [1, 6, 22, 40, 173, 186, 230, 231]. Researchers have observed that complex abilities such as multi-step reasoning, few-shot in-context learning, and instruction following improve as the scale of parameter size increases [159, 250, 252, 253, 277]. The development of LLMs is impactful and revolutionary, prompting machine learning practitioners to reconsider traditional computational paradigms for once-challenging human-level tasks. However, LLMs are typically trained on static, pre-collected datasets encompassing general domains, leading to gradual performance degradation over time [5, 52, 95, 96, 100, 137] and across different content domains [35, 44, 69, 71, 100, 104, 183, 184, 221]. Additionally, a single pre-trained large model cannot meet every user need and requires further fine-tuning [10, 16, 37, 106, 182, 254, 255, 255, 281, 299, 299]. While one potential solution is re-collecting pre-training data and re-training models with additional specific needs, this approach is prohibitively expensive and impractical in real-world scenarios.

To efficiently adapt LLMs to downstream tasks while minimizing performance degradation on previous knowledge domains, researchers employ the methodology of Continual Learning (CL), also known as lifelong learning or incremental

∗ Correspondence to: Haizhou Shi haizhou.shi@rutgers.edu and Hao Wang hw488@cs.rutgers.edu.

† Work done as visiting students at Rutgers Machine Learning Lab.

learning [38, 178, 232, 237]. Inspired by the incremental learning pattern observed in human brains [101, 153, 154, 175], CL trains machine learning models sequentially on a series of tasks with the expectation of maintaining performance across all tasks [57, 58, 113, 124]. Throughout training, models have limited or no access to previous data, posing a challenge in retaining past knowledge as optimization constraints from unseen previous data are absent during current-task learning [124, 135, 213]. This challenge, known as catastrophic forgetting [155], has been a central focus in continual learning research since its inception. Over the years, researchers have explored various techniques to mitigate forgetting. These include replay-based methods [30, 207, 213], parameter regularization [4, 113, 196], and model architecture expansion [191, 236]. Together, these techniques have significantly advanced the goal of achieving zero forgetting in continual learning across diverse tasks, model architectures, and learning paradigms.

In the context of training and adapting LLMs sequentially, the significance of CL is undergoing semantic shifts of its own as well. To highlight this ongoing shift, in this paper, we provide a comprehensive overview and detailed discussion of the current research progress on continual LLMs. For the general picture of continual LLMs, we for the first time divide it into two directions of continuity that need to be addressed by practitioners (details in Section 3):

· Vertical continuity (or vertical continual learning) , which refers to the ongoing adaptation of LLMs as they transition from large-scale general domains to smaller-scale specific domains, involving shifts in learning objectives and entities of execution. For example, healthcare institutions may develop LLMs tailored to the medical domain while retaining their general reasoning and question answering capabilities for users. · Horizontal continuity (or horizontal continual learning) , which refers to continual adaptation across time and domains, often entails multiple training stages and increased vulnerability to forgetting. For example, social media platforms continuously update LLMs to reflect recent trends, ensuring accurate targeting of downstream services like advertising and recommendations without compromised experience for existing users.

Importantly, separating vertical and horizontal CL transcends mere modification of existing paradigms, like domainincremental learning, which aligns with horizontal continuity. This distinction offers a robust framework for analyzing complex CL paradigms in language models. For instance, Recyclable Tuning preserves both vertical and horizontal continuity simultaneously [183], and future designs might include zigzagging between horizontal and vertical CL.

In Fig. 1, following vertical continuity , we delineate three key stages of LLM learning within modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP), and Continual Fine-Tuning (CFT) (details in Section 4). In CPT, existing research primarily investigates three types of distributional shifts: temporal, content-level, and languagelevel. Each presents distinct focuses and challenges. In DAP, CL evaluation and techniques are frequently utilized. However, there is a noticeable lack of diversity in these techniques, considering the maturity of the conventional CL community. In CFT, our focus is on the emerging field of learning LLMs, covering topics such as Continual Instruction Tuning (CIT), Continual Model Refinement (CMR), Continual Model Alignment (CMA), and Continual Multimodal LLMs (CMLLMs). Next, we present a compilation of publicly available evaluation protocols and benchmarks (details in Section 5). We conclude our survey with a discussion covering emergent properties of continual LLMs, changes in the roles of conventional CL types and memory constraints within the context of continual LLMs, and prospective research directions for this subject (details in Section 6).

In summary, this survey provides a comprehensive review of existing continual learning studies for LLMs, which significantly distinguishes itself from existing literature on related topics [17, 105, 237, 261, 276]. Our survey highlights the underexplored research area of continually developing LLMs, especially in the field of CPT and DAP. We emphasize the needs for increased attention from the community, including the development of practical, accessible, and widely

acknowledged evaluation benchmarks. Additionally, methodologies need to be tailored to address forgetting in emerging LLM learning paradigms. We hope this survey can provide a systematic and novel perspective of continual learning in the rapidly-changing field of LLMs and can help the continual learning community contribute to the challenging goals of developing LLMs in a more efficient, reliable, and sustainable manner [8, 25, 95, 219, 268].

Large Language Models

Primarily built on the transformer architecture, pre-trained language models (PLMs) have established a universal hidden embedding space through extensive pre-training on large-scale unlabeled text corpora [51, 133, 189]. By scaling parameters to billions or even hundreds of billions and training on massive text datasets [84, 102], PLMs not only demonstrate superior language understanding and generation capabilities but also manifest emergent abilities such as in-context learning, instruction following, and multi-step reasoning [159, 250, 252, 253, 277]. These larger models are commonly referred to as Large Language Models (LLMs). For more detailed introduction, please refer to Appendix A.1.

2.1.1 Pre-training of LLMs. There are two popular pre-training paradigms for LLMs. (1) Decoder-only models typically employ auto-regressive language modeling (LM) tasks during pre-training, including the GPT family [1, 22, 173, 186], Gemini family [194, 225], and the open-source Llama family [230, 231]. Specifically, given a sequence of tokens 𝒙 = [ 𝑥 1 , 𝑥 2 , · · · , 𝑥 𝑁 ] , LM predicts the next token 𝑥 𝑡 autoregressively based on all preceding tokens 𝒙 < 𝑡 = [ 𝑥 1 , 𝑥 2 , · · · , 𝑥 𝑡 -1 ] , and trains the entire network by minimizing the negative log-likelihood -˝ 𝑁 𝑡 = 1 log 𝑃 ( 𝑥 𝑡 | 𝒙 < 𝑡 ) , where 𝑃 ( 𝑥 1 | 𝒙 < 1 ) ≜ 𝑃 ( 𝑥 1 ) is the unconditional probability estimation of the first token. (2) Encoder-only models , e.g., BERT [51, 133], use masked language modeling (MLM) as a common pre-training objective. In MLM, for the input sequence 𝒙 , a subset of input tokens 𝑚 ( 𝒙 ) are masked and replaced with the special [MASK] token. The pre-training goal is to utilize the unmasked parts 𝒙 \ 𝑚 ( 𝒙 ) to predict the masked portions 𝑚 ( 𝒙 ) . In summary, the overarching goal of MLM is to minimize the negative log-likelihood -˝ b 𝑥 ∈ 𝑚 ( 𝒙 ) log 𝑃 ( b 𝑥 | 𝒙 \ 𝑚 ( 𝒙 ) ) .

2.1.2 Adaptation of LLMs. LLMs are primarily trained to generate linguistically coherent text. However, this training may not align with human values, preferences, or practical needs. Furthermore, the pre-training data can be outdated, leading to knowledge cutoffs or inaccuracies. To address these issues, various computational paradigms such as Instruction Tuning (IT) [288], Model Refinement (MR) [47], and Model Alignment (MA) [174, 187] have been proposed. These approaches adapt LLMs to better meet diverse downstream tasks and user requirements.

Numerous studies show that Instruction Tuning (IT) can notably improve LLMs' ability to follow textual instructions [98, 174, 203, 250, 288], leveraging the pre-existing knowledge within LLMs to bridge the gap between general and task-specific performance [251]. Recent works like WizardLM [269] and CodecLM [246] further tailor synthetic data to steer LLMs' behavior through IT. Additionally, IT enhances the interaction between humans and LLMs, providing a more natural interface and aligning LLM outputs more closely with human expectations and preferences [145]. LLMs make mistakes, such as inaccurate translations or outdated information [47]. Directly fine-tuning the model to correct these mistakes may disrupt its performance on previously learned tasks. To overcome these challenges, Model Refinement (MR) is proposed to rectify the model's errors while preserving its performance on other inputs, with only moderate computing resources [47, 74, 76, 92, 163, 164, 215]. Model Alignment (MA) ensures AI systems' actions and outputs align with human values, ethics, and preferences [174, 187]. MA can be broadly categorized into two types: Reinforcement Learning-based (RL-based) and Supervised Learning-based (SL-based). RL-based approaches [174, 205]

are trained to make decisions reinforced by human feedback, using a reward system to guide them towards desirable outcomes. In contrast, SL-based approaches [81, 97, 187] directly train models on datasets of human preferences, aligning their output with demonstrated human values.

Pre-training of LLMs

4.1.1 CPT: Effectiveness and Efficiency. Before delving into the details of continual pre-training (CPT), it is important to address two fundamental questions: Firstly, regarding effectiveness , can CPT enhance performance on downstream tasks beyond that of the initial training on a wide range of data domains? Extensive studies have not only demonstrated the necessity of CPT for improved downstream performance [35, 71, 95, 96, 100, 184], but also shown that when distributional shifts are gradual [95, 278] or somewhat correlated [71], CPT can effectively help model generalize to unseen data. The second question is about efficiency : given the large size of an LLM' parameters and data, both old and new, can we achieve adaptation and knowledge retention in a computationally efficient way? Concerning efficiency, most studies focus on techniques for efficient knowledge retention [95, 96, 100, 118], which significantly overlap with the CL literature addressing catastrophic forgetting [4, 24, 191, 193, 195, 196, 201, 207, 213, 236]. In contrast to prior approaches that fully utilize emergent data, some studies recognize the impracticality of this approach in real production environments. Instead, they concentrate on further improving the efficiency of adaptation. For instance, ELLE [184] employs a function-preserved model expansion to facilitate efficient knowledge growth; [5] and [268] sub-sample training data based on novelty and diversity to enhance training efficiency, achieving superior performance compared to full-data training. Though currently underexplored, efficient adaptation in continual pre-training is poised to become significant, given recent findings emphasizing data quality over quantity for LLM generalization [216, 267].

4.1.2 General Observations on CPT. Table 1 summarizes the existing studies on continual pre-training (CPT), and here are some key observations we make about CPT.

· OBS-1: The development of advanced techniques tailored specifically for CPT is at the starting stage and warrants further exploration. Only about half of the examined papers propose novel techniques for CPT [5, 35, 44, 52, 71, 104, 183, 184, 221], while the remaining half either focus solely on the effects of pure adaptation without considering CL techniques [63, 69, 137], or conduct empirical studies on the straightforward application of existing CL techniques [95, 96, 100, 118]. · OBS-2: The diversity of CL techniques incorporated in CPT remains limited. Most practical implementations of CL techniques for CPT primarily focus on architecture expansion of LLMs [5, 35, 44, 52, 71, 183], with only a few explicitly utilizing replay [35, 183] and parameter regularization [5, 35]. · OBS-3: There is an apparent gap between the existing studies and the real production environment of CPT. Except for the recent study [278] which conducts CPT over 159 domains, the longest sequence of

Table 1. Summary of existing studies on Continual Pre-training of LLMs. The papers are organized based on their relation to CL: (i) no CL techniques are studied, (ii) CL techniques are studied as solely baselines, and (iii) new CL approaches are proposed. In the table, Dist. Shift denotes what type(s) of distributional shifts this particular study considers and is dedicated to solve. In the section of Continual Learning Tech. , we mainly categorize three types of continual learning techniques that are studied in the paper: rehearsal ( Rehearsal ), parameter regularization ( Param. Reg. ), and architecture expansion ( Arch. Exp. ). We use ' ✓ ', ' ✗ ', and ' ♣ ' to denote 'deployed in the proposed method', 'not studied in the paper', and 'studied as a baseline method', respectively. Note that we do not include naive sequential fine-tuning in this table, as it is universally studied as the important baseline method in all of the papers in the table. The papers with only ' ♣ ' [95, 96, 100] means that only existing CL techniques are studied, without proposing new ones, and the papers with only ' ✗ ' [63, 69] means that special aspects of fine-tuning are studied, without using CL techniques.

pre-training stages explored is 8 [71, 100]. However, this falls short of real-world scenarios where continual pre-training occurs more frequently and persists for months or years. The efficacy of CPT methods in such prolonged scenarios remains uncertain. Additionally, investigating CPT in a task-boundary-free data stream setting is an important avenue for research to be explored in the future as well.

  • 4.1.3 Distributional Shifts in CPT. This survey categorizes distributional shifts of CPT into three main types: (i) Language Shift : LLMs sequentially learn different language corpora, e.g., English → Chinese [63, 118]. (ii) Content Shift : LLMs sequentially learn corpora from different fields, e.g., chemistry → biology [35, 44, 69, 71, 100, 183]. (iii) Temporal Shift :

Distributional shifts occur over time, e.g., news in 2021 → news in 2022, with a major focus on timestamp-sensitive knowledge retention and update [5, 52, 95, 96, 100].

Language Shift. [63] focuses on assessing LLMs' natural ability to learn new languages sequentially. With no explicit CL techniques employed, the study observes consistent positive forward transfer of the knowledge, facilitating new language acquisition regardless of the learning order. Forgetting, on the other hand, emerges as a significant challenge that cannot be mitigated by the increasing size of LLMs. In [118], the degree of forgetting of previously learned language when adapting LLMs to a new language is investigated. Various CL techniques, including parameter freezing, LoRA [86], and (IA) 3 [132], are evaluated across multiple dimensions. Preliminary experimental results highlight the non-trivial nature of addressing horizontal forgetting for CPT under the language shift as well.

Content Shift. [278] explores the large-scale CPT over 159 content domains, and shows that CPT on various domains can effectively improve models' adaptation ability compared to DAP on single domain. Similarly, [69] continues the pre-training phase of Pythia [16] with no complex CL techniques and discovers that learning rate re-warming consistently improves models trained from scratch. Built upon this simple observation, [94] further shows that proper combination of learning rate re-warming and re-decay, and replay of the previous data is sufficient to achieve a comparable performance to full re-training. LLPT [100] establishes a comprehensive training and evaluation protocol for a series of content-level distributional shifts. They assess multiple CL methods and, similar to [63], find consistent forward knowledge transfer, yet horizontal forgetting remains significant. Besides, contrary to the common understanding that experience replay [30] is the most efficient approach to preventing forgetting, the authors find it ineffective in the case of CPT, due to the potential overfitting issue. Recyclable Tuning [183] shows that if the upstream supplier continually pre-trains LLMs, with or without replay, consumer-side efficiency can be boosted by recycling previously learned update components when proper CL techniques are applied.

DEMix [71] incrementally trains and integrates new experts (DEMix layer) for new domains during CPT. To ensure reasonable inference performance during testing when no domain information is available, it proposes a parameter-free probabilistic approach to dynamically estimate a weighted mixture of domains. DEMix's modularization has been shown to facilitate efficient domain-adaptive pre-training, promote relevant knowledge during inference, and allow for removable components. Lifelong-MoE [35], similar to DEMix [71], incrementally trains domain experts for new domains. However, Lifelong-MoE differs from DEMix in utilizing a token-level gating function to activate multiple experts for intermediate embedding calculation. During training, previous experts' parameters and gating functions remain frozen, and knowledge distillation loss is employed to regulate parameter updates, which thereby makes Lifelong-MoE robust against the issue of horizontal forgetting.

It is noteworthy that some papers draw almost opposite conclusions regarding the significance of CPT for content shifts. For instance, [44] continually pre-trains BERT-based models [51, 133] on five scientific domains and evaluates performance on downstream sentiment analysis. They observe that even the trivial sequential pre-training does not exhibit severe forgetting, prompting reasonable questions about the necessity of CPT.

Temporal Shift. In the context of CPT amid content shifts, Multi-Task Learning (MTL) is often regarded as the upper bound achievable [178, 213, 237]. However, this belief does not fully hold when considering CL under temporal shifts [52, 95, 96], as temporal shifts can introduce conflicting information, posing challenges for LLMs. For instance, the statement 'Lionel Messi plays for team Barcelona' remains accurate from 2004 to 2021 but becomes false by 2024, as 'Lionel Messi plays for team Inter Miami' becomes the correct statement.

Hence, as advocated by CKL [96] and TemporalWiki [95], LLMs undergoing continual adaptation to temporal shifts must simultaneously achieve three objectives: (i) retention of old knowledge, (ii) acquisition of new knowledge, and

(iii) update of the outdated knowledge. They evaluate the same set of continual learning baseline methods [34, 79, 87, 239], each highlighting distinct aspects of their impact. CKL [96] observes that parameter expansion consistently exhibits robust performance across all experimental conditions. In contrast, replay-based methods struggle to efficiently adapt to new knowledge acquisition and outdated knowledge update, leading to rapid forgetting of newly learned information during training. TemporalWiki [95] constructs a series of temporal corpora and their differential sets from sequential snapshots of Wikipedia, revealing that updating LLMs on these differential sets substantially enhances new knowledge acquisition and updates, requiring significantly less computational resources, and various CL techniques prove effective in mitigating horizontal forgetting during this process. LLPT [100] introduces temporal generalization evaluation for LLMs pre-trained on sequential corpora. Through experiments on a large-scale chronologically-ordered Tweet Stream, the authors demonstrate the superiority of CPT combined with CL techniques to task-specific LMs, in terms of both knowledge acquisition and temporal generalization. Nonetheless, these preliminary experiments do not conclusively determine which specific CL method is more preferable than the others.

Another line of work, Temporal Language Models (TLMs), takes a different approach to address knowledge retention, acquisition, and update under temporal shifts by integrating temporal information into the model [52, 198, 219]. During training, they inject temporal information into training examples as prefixes of prompts, using special tokens [198], explicit year information [52], or syntax-guided structural information [219]. In sequential training experiments conducted by TempoT5 [52], comparison between continually and jointly pre-trained LMs demonstrates that CPT better balances adaptation and forgetting when the replay rate of past data is appropriately set.

Others. CPT as a technique to progressively attain novel knowledge, can be used to refine LLMs' behavior. CEM [294] collects examples where the model's response is incorrect and continually trains the model on these examples, along with a supplemental dataset. RHO-1 [130] proposes Selective Language Modeling (SLM), which employs a reference model to evaluate the perplexity of each token in the training corpus, and continually pre-trains the model on high-perplexity tokens. Similarly, IR-DRO [36] re-trains the model on re-weighted examples from the original pre-training dataset, focusing more on higher-loss sequences.

The significance of addressing temporal shifts through CPT is underscored by several industrial studies. For instance, [5] employs a dynamic vocabulary expansion algorithm and an efficient sub-sampling procedure to conduct CPT on large-scale emerging tweet data. Conversely, [137] adopts CPT without explicit measures to constrain model updates, releasing a series of BERT-based LMs incrementally trained on new tweet data every three months. Preliminary experimental results demonstrate substantial improvements of continually pre-trained LMs over the base BERT model across downstream tasks. While some studies question the necessity of continually adapting LLMs along the temporal axis for environmental reasons, such as reducing CO 2 emissions [8], the community commonly embraces CPT as a more efficient learning paradigm compared to the traditional 'combine-and-retrain' approach.

Adaptation of LLMs

Continual Learning

Humans can accumulate knowledge and skills across tasks without significant performance decline on previous tasks [101, 153, 154, 175]. In contrast, machine learning models, which are typically data-centric, often experience performance degradation on old tasks when trained on new ones, a phenomenon known as 'catastrophic forgetting.' The challenge of adapting models to a sequence of tasks without forgetting, especially when little to no past data can be preserved, is extensively studied in the continual learning community [38, 178, 232, 237]. For formal definitions, a detailed introduction to the three CL scenarios and techniques, please refer to Appendix A.2.

2.2.1 Types of Continual Learning. To lay the groundwork for subsequent discussions (as illustrated in Table 3 and Section 6.2), we follow the conceptual framework proposed by [112, 232, 237]. There are three primary types of continual learning scenarios: (i) Task-Incremental Learning (TIL), where task indices are available to the model during inference [113, 124]; (ii) Domain-Incremental Learning (DIL), where the model learns a sequence of tasks with the same formulation but without task indices during inference [213]; and (iii) Class-Incremental Learning (CIL), where the model learns new classes of data during training [112, 193].

2.2.2 Techniques of Continual Learning. Existing CL techniques can be roughly categorized into five groups [237]: (i) replay-based, (ii) regularization-based, (iii) architecture-based, (iv) optimization-based, and (v) representation-based. Here, we provide a concise yet comprehensive introduction to the first three categories of continual learning techniques, as they are extensively applied in continual LLMs.

Replay-based methods adopt the relaxed memory constraint by keeping a small buffer of observed data and retraining the model on it when learning new tasks. Although replay-based methods may theoretically lead to loose generalization bounds [213], they are valued for their simplicity, stability, and high performance, even with a small episodic memory [24, 30, 193, 195]. Regularization-based methods adopt a regularization term 𝜆 ∥ 𝜽 -𝜽 𝑡 -1 ∥ 𝚺 that penalizes large deviation from the history model in the parameter space, where ∥ 𝒗 ∥ 𝚺 = 𝒗 ⊤ 𝚺 𝒗 is the vector norm evaluated on a positive-semi-definite matrix 𝚺 , and 𝜆 is the regularization coefficient, a hyper-parameter introduced to balance the past knowledge retention and current knowledge learning. The matrix 𝚺 introduced is to measure the different level of importance of each parameters and their correlations in retaining the past knowledge. In practice, to reduce computational overhead, diagonal matrices are often designed to encode only the importance of each parameter [4, 113, 197]. Architecture-based methods , especially expanding the network architecture dynamically to assimilate new knowledge, is considered the most efficient form of CL [248, 249]. This method primarily tackles adaptation challenges and can achieve zero-forgetting when task IDs are available during inference or can be correctly inferred [71, 256]. However, due to the difficulty of task ID inference, architecture expansion is predominantly utilized in TIL but is scarcely explored in DIL or CIL. In conjunction with pre-trained backbone large models like ViT [54], CoLoR [256] trains various low-rank adaptation (LoRA) [86] modules for different tasks. It estimates and stores prototypes for each task and utilizes the natural clustering ability of the pre-trained model during testing to infer task IDs, selecting the corresponding LoRA component for prediction generation. In the domain of continual LLMs, architecture expansion has resurged in popularity following the rise of parameter-efficient fine-tuning (PEFT) [50, 86, 211], a topic we will delve into shortly [96, 100, 118, 177, 240, 257, 272, 273].

2.2.3 Evaluation Metrics of Continual Learning. There are four evaluation protocols primarily designed for continual learning. Overall Performance (OP) [106, 286, 291] calculates the average performance up until the current training stage, measuring the overall ability of a model balancing the performance of each task. As noted in [213], OP corresponds to the primary optimization objective of continual learning, and hence receives the most attention. Forgetting (F) represents the largest performance drop observed of each task throughout the training process, averaged over all training stages. It quantifies the negative impact of learning new tasks brought to previously acquired knowledge. Ideally, a robust continual learning framework should achieve Backward Transfer (BWT) , where learning new tasks enhances performance on prior tasks. BWT is measured by negating the forgetting, and hence a negative forgetting indicates a an improvement in performance on earlier tasks. Forward Transfer (FWT) measures the generalization ability of the continual learning algorithms to unseen tasks. It is defined as the difference between the current model's performance evaluated on the future tasks and the randomly initialized model. Refer to Appendix B.1 for more details.

Types of Continual Learning

Humans can accumulate knowledge and skills across tasks without significant performance decline on previous tasks [101, 153, 154, 175]. In contrast, machine learning models, which are typically data-centric, often experience performance degradation on old tasks when trained on new ones, a phenomenon known as 'catastrophic forgetting.' The challenge of adapting models to a sequence of tasks without forgetting, especially when little to no past data can be preserved, is extensively studied in the continual learning community [38, 178, 232, 237]. For formal definitions, a detailed introduction to the three CL scenarios and techniques, please refer to Appendix A.2.

2.2.1 Types of Continual Learning. To lay the groundwork for subsequent discussions (as illustrated in Table 3 and Section 6.2), we follow the conceptual framework proposed by [112, 232, 237]. There are three primary types of continual learning scenarios: (i) Task-Incremental Learning (TIL), where task indices are available to the model during inference [113, 124]; (ii) Domain-Incremental Learning (DIL), where the model learns a sequence of tasks with the same formulation but without task indices during inference [213]; and (iii) Class-Incremental Learning (CIL), where the model learns new classes of data during training [112, 193].

2.2.2 Techniques of Continual Learning. Existing CL techniques can be roughly categorized into five groups [237]: (i) replay-based, (ii) regularization-based, (iii) architecture-based, (iv) optimization-based, and (v) representation-based. Here, we provide a concise yet comprehensive introduction to the first three categories of continual learning techniques, as they are extensively applied in continual LLMs.

Replay-based methods adopt the relaxed memory constraint by keeping a small buffer of observed data and retraining the model on it when learning new tasks. Although replay-based methods may theoretically lead to loose generalization bounds [213], they are valued for their simplicity, stability, and high performance, even with a small episodic memory [24, 30, 193, 195]. Regularization-based methods adopt a regularization term 𝜆 ∥ 𝜽 -𝜽 𝑡 -1 ∥ 𝚺 that penalizes large deviation from the history model in the parameter space, where ∥ 𝒗 ∥ 𝚺 = 𝒗 ⊤ 𝚺 𝒗 is the vector norm evaluated on a positive-semi-definite matrix 𝚺 , and 𝜆 is the regularization coefficient, a hyper-parameter introduced to balance the past knowledge retention and current knowledge learning. The matrix 𝚺 introduced is to measure the different level of importance of each parameters and their correlations in retaining the past knowledge. In practice, to reduce computational overhead, diagonal matrices are often designed to encode only the importance of each parameter [4, 113, 197]. Architecture-based methods , especially expanding the network architecture dynamically to assimilate new knowledge, is considered the most efficient form of CL [248, 249]. This method primarily tackles adaptation challenges and can achieve zero-forgetting when task IDs are available during inference or can be correctly inferred [71, 256]. However, due to the difficulty of task ID inference, architecture expansion is predominantly utilized in TIL but is scarcely explored in DIL or CIL. In conjunction with pre-trained backbone large models like ViT [54], CoLoR [256] trains various low-rank adaptation (LoRA) [86] modules for different tasks. It estimates and stores prototypes for each task and utilizes the natural clustering ability of the pre-trained model during testing to infer task IDs, selecting the corresponding LoRA component for prediction generation. In the domain of continual LLMs, architecture expansion has resurged in popularity following the rise of parameter-efficient fine-tuning (PEFT) [50, 86, 211], a topic we will delve into shortly [96, 100, 118, 177, 240, 257, 272, 273].

2.2.3 Evaluation Metrics of Continual Learning. There are four evaluation protocols primarily designed for continual learning. Overall Performance (OP) [106, 286, 291] calculates the average performance up until the current training stage, measuring the overall ability of a model balancing the performance of each task. As noted in [213], OP corresponds to the primary optimization objective of continual learning, and hence receives the most attention. Forgetting (F) represents the largest performance drop observed of each task throughout the training process, averaged over all training stages. It quantifies the negative impact of learning new tasks brought to previously acquired knowledge. Ideally, a robust continual learning framework should achieve Backward Transfer (BWT) , where learning new tasks enhances performance on prior tasks. BWT is measured by negating the forgetting, and hence a negative forgetting indicates a an improvement in performance on earlier tasks. Forward Transfer (FWT) measures the generalization ability of the continual learning algorithms to unseen tasks. It is defined as the difference between the current model's performance evaluated on the future tasks and the randomly initialized model. Refer to Appendix B.1 for more details.

Techniques of Continual Learning

Humans can accumulate knowledge and skills across tasks without significant performance decline on previous tasks [101, 153, 154, 175]. In contrast, machine learning models, which are typically data-centric, often experience performance degradation on old tasks when trained on new ones, a phenomenon known as 'catastrophic forgetting.' The challenge of adapting models to a sequence of tasks without forgetting, especially when little to no past data can be preserved, is extensively studied in the continual learning community [38, 178, 232, 237]. For formal definitions, a detailed introduction to the three CL scenarios and techniques, please refer to Appendix A.2.

2.2.1 Types of Continual Learning. To lay the groundwork for subsequent discussions (as illustrated in Table 3 and Section 6.2), we follow the conceptual framework proposed by [112, 232, 237]. There are three primary types of continual learning scenarios: (i) Task-Incremental Learning (TIL), where task indices are available to the model during inference [113, 124]; (ii) Domain-Incremental Learning (DIL), where the model learns a sequence of tasks with the same formulation but without task indices during inference [213]; and (iii) Class-Incremental Learning (CIL), where the model learns new classes of data during training [112, 193].

2.2.2 Techniques of Continual Learning. Existing CL techniques can be roughly categorized into five groups [237]: (i) replay-based, (ii) regularization-based, (iii) architecture-based, (iv) optimization-based, and (v) representation-based. Here, we provide a concise yet comprehensive introduction to the first three categories of continual learning techniques, as they are extensively applied in continual LLMs.

Replay-based methods adopt the relaxed memory constraint by keeping a small buffer of observed data and retraining the model on it when learning new tasks. Although replay-based methods may theoretically lead to loose generalization bounds [213], they are valued for their simplicity, stability, and high performance, even with a small episodic memory [24, 30, 193, 195]. Regularization-based methods adopt a regularization term 𝜆 ∥ 𝜽 -𝜽 𝑡 -1 ∥ 𝚺 that penalizes large deviation from the history model in the parameter space, where ∥ 𝒗 ∥ 𝚺 = 𝒗 ⊤ 𝚺 𝒗 is the vector norm evaluated on a positive-semi-definite matrix 𝚺 , and 𝜆 is the regularization coefficient, a hyper-parameter introduced to balance the past knowledge retention and current knowledge learning. The matrix 𝚺 introduced is to measure the different level of importance of each parameters and their correlations in retaining the past knowledge. In practice, to reduce computational overhead, diagonal matrices are often designed to encode only the importance of each parameter [4, 113, 197]. Architecture-based methods , especially expanding the network architecture dynamically to assimilate new knowledge, is considered the most efficient form of CL [248, 249]. This method primarily tackles adaptation challenges and can achieve zero-forgetting when task IDs are available during inference or can be correctly inferred [71, 256]. However, due to the difficulty of task ID inference, architecture expansion is predominantly utilized in TIL but is scarcely explored in DIL or CIL. In conjunction with pre-trained backbone large models like ViT [54], CoLoR [256] trains various low-rank adaptation (LoRA) [86] modules for different tasks. It estimates and stores prototypes for each task and utilizes the natural clustering ability of the pre-trained model during testing to infer task IDs, selecting the corresponding LoRA component for prediction generation. In the domain of continual LLMs, architecture expansion has resurged in popularity following the rise of parameter-efficient fine-tuning (PEFT) [50, 86, 211], a topic we will delve into shortly [96, 100, 118, 177, 240, 257, 272, 273].

2.2.3 Evaluation Metrics of Continual Learning. There are four evaluation protocols primarily designed for continual learning. Overall Performance (OP) [106, 286, 291] calculates the average performance up until the current training stage, measuring the overall ability of a model balancing the performance of each task. As noted in [213], OP corresponds to the primary optimization objective of continual learning, and hence receives the most attention. Forgetting (F) represents the largest performance drop observed of each task throughout the training process, averaged over all training stages. It quantifies the negative impact of learning new tasks brought to previously acquired knowledge. Ideally, a robust continual learning framework should achieve Backward Transfer (BWT) , where learning new tasks enhances performance on prior tasks. BWT is measured by negating the forgetting, and hence a negative forgetting indicates a an improvement in performance on earlier tasks. Forward Transfer (FWT) measures the generalization ability of the continual learning algorithms to unseen tasks. It is defined as the difference between the current model's performance evaluated on the future tasks and the randomly initialized model. Refer to Appendix B.1 for more details.

Evaluation Metrics of Continual Learning

In the realm of conventional continual learning, where task streams take the form of classification, many metrics rely on the concept of Accuracy Matrix [136, 213]. Extending this notion to the context of continually learning LLMs, we introduce the Performance Matrix 𝑷 ∈ R 𝑇 × 𝑇 , where 𝑇 represents the total number of training stages. Each entry of 𝑷 corresponds to a performance metric evaluated on the models, such as perplexity on pre-training data [35, 69, 100], zero-shot/few-shot evaluation metrics on downstream data without fine-tuning [9, 42, 48, 172, 199, 258], fine-tuned accuracies on downstream tasks [5, 35, 96, 183], and probing accuracies from fine-tuning add-on components evaluated on downstream tasks [144, 223, 299]. In 𝑷 , 𝑃 𝑖,𝑗 denotes the model's performance after training on task 𝑖 and evaluating on task 𝑗 . With this Performance Matrix definition, we introduce the primary evaluation protocols widely adopted.

Overall Performance (OP) . The Overall Performance (OP) [106, 286, 291] is a natural extension of the concept of Average Accuracy [136, 213]. The OP measured up until training stage 𝑡 is the average performance of the model trained right after the stage 𝑡 . Denote it as OP 𝑡 and we have:

$$

$$

As noted in [213], the OP corresponds to the primary optimization objective defined in Definition A.4, A.5, and A.6. In much of the continual learning literature, once all 𝑇 tasks are completed, the final OP (OP 𝑇 ) is reported, with the subscript 𝑇 often omitted for brevity. In some works, OP is weighted by the importance of tasks f OP ≜ 1 𝑇 ˝ 𝑇 𝑖 = 1 𝑤 𝑖 𝑃 𝑡,𝑖 , where 𝑤 𝑖 = 𝑁 𝑖 / ˝ 𝑇 𝑗 = 1 𝑁 𝑗 represents the ratio of data. In some literature, f OP is referred to as 'example accuracy' [37], 'whole accuracy' [217], or 'edit success rate' in CMR [74].

Forgetting (F). Define 𝐹 𝑡 as the forgetting up to task 𝑡 , which represents the largest performance drop observed throughout the training process, averaged over 𝑡 training stages:

$$

$$

Typically, researchers report the average forgetting 𝐹 = 𝐹 𝑇 at the end of the entire training process. Forgetting quantifies the impact of learning new tasks on previously acquired knowledge. Ideally, a robust continual learning framework should achieve Backward Transfer (BWT) , where learning new tasks enhances performance on prior tasks. This enhancement is typically measured by negating the forgetting, thus indicating an improvement in performance on earlier tasks. The concepts of Forgetting and Backward Transfer underpin various evaluation metrics, such as knowledge retention [100], performance on unchanged knowledge [95], average increased perplexity (AP + ) [184], and test and edit retention rate in CMR [74].

Forward Transfer (FWT). Forward Transfer measures the generalization ability of the continual learning algorithms. Formally, forward transfer FWT 𝑡 up to training stage 𝑡 is defined as

$$

$$

where 𝑏 𝑖 is the baseline performance of the model evaluated on task 𝑖 before undergoing continual learning. Strictly speaking, the definition of 𝑏 𝑖 is not the same as defined in the previous work [136, 213], where it is used to denote the performance of a random initialization of the model. Additionally, we extend the notation of forward transfer in the vertical direction to represent the performance improvement on downstream tasks resulting from domain-adaptive pre-training (see Table 2). Forward Transfer is alternatively referred to as temporal generalization [100] or knowledge transfer [116] in some literature. In this section, we introduce the evaluation protocols and datasets for continul LLMs.

(Continual) Domain Adaptation

4.1.1 CPT: Effectiveness and Efficiency. Before delving into the details of continual pre-training (CPT), it is important to address two fundamental questions: Firstly, regarding effectiveness , can CPT enhance performance on downstream tasks beyond that of the initial training on a wide range of data domains? Extensive studies have not only demonstrated the necessity of CPT for improved downstream performance [35, 71, 95, 96, 100, 184], but also shown that when distributional shifts are gradual [95, 278] or somewhat correlated [71], CPT can effectively help model generalize to unseen data. The second question is about efficiency : given the large size of an LLM' parameters and data, both old and new, can we achieve adaptation and knowledge retention in a computationally efficient way? Concerning efficiency, most studies focus on techniques for efficient knowledge retention [95, 96, 100, 118], which significantly overlap with the CL literature addressing catastrophic forgetting [4, 24, 191, 193, 195, 196, 201, 207, 213, 236]. In contrast to prior approaches that fully utilize emergent data, some studies recognize the impracticality of this approach in real production environments. Instead, they concentrate on further improving the efficiency of adaptation. For instance, ELLE [184] employs a function-preserved model expansion to facilitate efficient knowledge growth; [5] and [268] sub-sample training data based on novelty and diversity to enhance training efficiency, achieving superior performance compared to full-data training. Though currently underexplored, efficient adaptation in continual pre-training is poised to become significant, given recent findings emphasizing data quality over quantity for LLM generalization [216, 267].

4.1.2 General Observations on CPT. Table 1 summarizes the existing studies on continual pre-training (CPT), and here are some key observations we make about CPT.

· OBS-1: The development of advanced techniques tailored specifically for CPT is at the starting stage and warrants further exploration. Only about half of the examined papers propose novel techniques for CPT [5, 35, 44, 52, 71, 104, 183, 184, 221], while the remaining half either focus solely on the effects of pure adaptation without considering CL techniques [63, 69, 137], or conduct empirical studies on the straightforward application of existing CL techniques [95, 96, 100, 118]. · OBS-2: The diversity of CL techniques incorporated in CPT remains limited. Most practical implementations of CL techniques for CPT primarily focus on architecture expansion of LLMs [5, 35, 44, 52, 71, 183], with only a few explicitly utilizing replay [35, 183] and parameter regularization [5, 35]. · OBS-3: There is an apparent gap between the existing studies and the real production environment of CPT. Except for the recent study [278] which conducts CPT over 159 domains, the longest sequence of

Table 1. Summary of existing studies on Continual Pre-training of LLMs. The papers are organized based on their relation to CL: (i) no CL techniques are studied, (ii) CL techniques are studied as solely baselines, and (iii) new CL approaches are proposed. In the table, Dist. Shift denotes what type(s) of distributional shifts this particular study considers and is dedicated to solve. In the section of Continual Learning Tech. , we mainly categorize three types of continual learning techniques that are studied in the paper: rehearsal ( Rehearsal ), parameter regularization ( Param. Reg. ), and architecture expansion ( Arch. Exp. ). We use ' ✓ ', ' ✗ ', and ' ♣ ' to denote 'deployed in the proposed method', 'not studied in the paper', and 'studied as a baseline method', respectively. Note that we do not include naive sequential fine-tuning in this table, as it is universally studied as the important baseline method in all of the papers in the table. The papers with only ' ♣ ' [95, 96, 100] means that only existing CL techniques are studied, without proposing new ones, and the papers with only ' ✗ ' [63, 69] means that special aspects of fine-tuning are studied, without using CL techniques.

pre-training stages explored is 8 [71, 100]. However, this falls short of real-world scenarios where continual pre-training occurs more frequently and persists for months or years. The efficacy of CPT methods in such prolonged scenarios remains uncertain. Additionally, investigating CPT in a task-boundary-free data stream setting is an important avenue for research to be explored in the future as well.

  • 4.1.3 Distributional Shifts in CPT. This survey categorizes distributional shifts of CPT into three main types: (i) Language Shift : LLMs sequentially learn different language corpora, e.g., English → Chinese [63, 118]. (ii) Content Shift : LLMs sequentially learn corpora from different fields, e.g., chemistry → biology [35, 44, 69, 71, 100, 183]. (iii) Temporal Shift :

Distributional shifts occur over time, e.g., news in 2021 → news in 2022, with a major focus on timestamp-sensitive knowledge retention and update [5, 52, 95, 96, 100].

Language Shift. [63] focuses on assessing LLMs' natural ability to learn new languages sequentially. With no explicit CL techniques employed, the study observes consistent positive forward transfer of the knowledge, facilitating new language acquisition regardless of the learning order. Forgetting, on the other hand, emerges as a significant challenge that cannot be mitigated by the increasing size of LLMs. In [118], the degree of forgetting of previously learned language when adapting LLMs to a new language is investigated. Various CL techniques, including parameter freezing, LoRA [86], and (IA) 3 [132], are evaluated across multiple dimensions. Preliminary experimental results highlight the non-trivial nature of addressing horizontal forgetting for CPT under the language shift as well.

Content Shift. [278] explores the large-scale CPT over 159 content domains, and shows that CPT on various domains can effectively improve models' adaptation ability compared to DAP on single domain. Similarly, [69] continues the pre-training phase of Pythia [16] with no complex CL techniques and discovers that learning rate re-warming consistently improves models trained from scratch. Built upon this simple observation, [94] further shows that proper combination of learning rate re-warming and re-decay, and replay of the previous data is sufficient to achieve a comparable performance to full re-training. LLPT [100] establishes a comprehensive training and evaluation protocol for a series of content-level distributional shifts. They assess multiple CL methods and, similar to [63], find consistent forward knowledge transfer, yet horizontal forgetting remains significant. Besides, contrary to the common understanding that experience replay [30] is the most efficient approach to preventing forgetting, the authors find it ineffective in the case of CPT, due to the potential overfitting issue. Recyclable Tuning [183] shows that if the upstream supplier continually pre-trains LLMs, with or without replay, consumer-side efficiency can be boosted by recycling previously learned update components when proper CL techniques are applied.

DEMix [71] incrementally trains and integrates new experts (DEMix layer) for new domains during CPT. To ensure reasonable inference performance during testing when no domain information is available, it proposes a parameter-free probabilistic approach to dynamically estimate a weighted mixture of domains. DEMix's modularization has been shown to facilitate efficient domain-adaptive pre-training, promote relevant knowledge during inference, and allow for removable components. Lifelong-MoE [35], similar to DEMix [71], incrementally trains domain experts for new domains. However, Lifelong-MoE differs from DEMix in utilizing a token-level gating function to activate multiple experts for intermediate embedding calculation. During training, previous experts' parameters and gating functions remain frozen, and knowledge distillation loss is employed to regulate parameter updates, which thereby makes Lifelong-MoE robust against the issue of horizontal forgetting.

It is noteworthy that some papers draw almost opposite conclusions regarding the significance of CPT for content shifts. For instance, [44] continually pre-trains BERT-based models [51, 133] on five scientific domains and evaluates performance on downstream sentiment analysis. They observe that even the trivial sequential pre-training does not exhibit severe forgetting, prompting reasonable questions about the necessity of CPT.

Temporal Shift. In the context of CPT amid content shifts, Multi-Task Learning (MTL) is often regarded as the upper bound achievable [178, 213, 237]. However, this belief does not fully hold when considering CL under temporal shifts [52, 95, 96], as temporal shifts can introduce conflicting information, posing challenges for LLMs. For instance, the statement 'Lionel Messi plays for team Barcelona' remains accurate from 2004 to 2021 but becomes false by 2024, as 'Lionel Messi plays for team Inter Miami' becomes the correct statement.

Hence, as advocated by CKL [96] and TemporalWiki [95], LLMs undergoing continual adaptation to temporal shifts must simultaneously achieve three objectives: (i) retention of old knowledge, (ii) acquisition of new knowledge, and

(iii) update of the outdated knowledge. They evaluate the same set of continual learning baseline methods [34, 79, 87, 239], each highlighting distinct aspects of their impact. CKL [96] observes that parameter expansion consistently exhibits robust performance across all experimental conditions. In contrast, replay-based methods struggle to efficiently adapt to new knowledge acquisition and outdated knowledge update, leading to rapid forgetting of newly learned information during training. TemporalWiki [95] constructs a series of temporal corpora and their differential sets from sequential snapshots of Wikipedia, revealing that updating LLMs on these differential sets substantially enhances new knowledge acquisition and updates, requiring significantly less computational resources, and various CL techniques prove effective in mitigating horizontal forgetting during this process. LLPT [100] introduces temporal generalization evaluation for LLMs pre-trained on sequential corpora. Through experiments on a large-scale chronologically-ordered Tweet Stream, the authors demonstrate the superiority of CPT combined with CL techniques to task-specific LMs, in terms of both knowledge acquisition and temporal generalization. Nonetheless, these preliminary experiments do not conclusively determine which specific CL method is more preferable than the others.

Another line of work, Temporal Language Models (TLMs), takes a different approach to address knowledge retention, acquisition, and update under temporal shifts by integrating temporal information into the model [52, 198, 219]. During training, they inject temporal information into training examples as prefixes of prompts, using special tokens [198], explicit year information [52], or syntax-guided structural information [219]. In sequential training experiments conducted by TempoT5 [52], comparison between continually and jointly pre-trained LMs demonstrates that CPT better balances adaptation and forgetting when the replay rate of past data is appropriately set.

Others. CPT as a technique to progressively attain novel knowledge, can be used to refine LLMs' behavior. CEM [294] collects examples where the model's response is incorrect and continually trains the model on these examples, along with a supplemental dataset. RHO-1 [130] proposes Selective Language Modeling (SLM), which employs a reference model to evaluate the perplexity of each token in the training corpus, and continually pre-trains the model on high-perplexity tokens. Similarly, IR-DRO [36] re-trains the model on re-weighted examples from the original pre-training dataset, focusing more on higher-loss sequences.

The significance of addressing temporal shifts through CPT is underscored by several industrial studies. For instance, [5] employs a dynamic vocabulary expansion algorithm and an efficient sub-sampling procedure to conduct CPT on large-scale emerging tweet data. Conversely, [137] adopts CPT without explicit measures to constrain model updates, releasing a series of BERT-based LMs incrementally trained on new tweet data every three months. Preliminary experimental results demonstrate substantial improvements of continually pre-trained LMs over the base BERT model across downstream tasks. While some studies question the necessity of continually adapting LLMs along the temporal axis for environmental reasons, such as reducing CO 2 emissions [8], the community commonly embraces CPT as a more efficient learning paradigm compared to the traditional 'combine-and-retrain' approach.

Parameter-Efficient Fine-Tuning

Background of Continual Fine-Tuning (CFT). Continual Fine-Tuning (CFT) lies at the bottom layer of the vertical continuity, where models are trained on successive homogeneous tasks drawn from an evolving data distribution. As the service-oriented layer of LLM, it does not require consideration of further adaptation to another downstream tasks, simplifying optimization objectives to a great extent: better adaptation and less forgetting 2 . In the era of LLMs, new computational paradigms in CFT have emerged and attracted significant attention within the research community. These topics include (i) Continual Instruction Tuning (CIT) [292], (ii) Continual Model Refinement (CMR) [74], (iii) Continual Model Alignment (CMA) [128, 287], and (iv) Continual Learning for Multimodal Language Models (CMLLMs) [77, 171]. We summarize existing studies on CFT in Table 3, categorizing studies into sub-categories as listed above. The table includes details on incremental learning types (X-IL), LLM architecture, and employed CL techniques and evaluation metrics. After discussing general observations on CFT in Section 4.3.1, we will delve into each sub-category in detail.

4.3.1 General Observations on CFT. Examining the landscape of continual learning in the context of LLMs, and combined with the results shown in Table 3, we make several key observations about CFT.

· OBS-1: There has been a noticeable transition in focus from CIL to TIL and DIL. It has been a longstanding common sense in the CL community that CIL, as it requires the model to predict the context label and withincontext label at the same time [112, 232, 237], is the most challenging CL scenario and hence receives most of the attention from the community. However, among all 35 papers presented in Table 3, only 3 papers study CFT of CIL. The transition of the research focus demonstrates the importance of TIL and DIL in the real-world applications of continual LLMs. More detailed discussion of this transition is included in Section 6.2. · OBS-2: In CFT, CL techniques enjoy broader adoption and explicit exploration compared to CPT and DAP. In Table 3, all 35 papers explicitly deploy the CL techniques, 50% of which develop new techniques that cannot be easily interpreted as trivial combination of existing classic CL techniques, e.g., shared attentive learning framework in SAPT [297], external memory deployed in Larimar [45], and adaptive model averaging method to achieve Pareto-optimal in AMA [128], etc. This underscores the recognition of continual learning as a pivotal component in the development of resilient and adaptive LLMs.

4.3.2 General Continual Fine-Tuning (General CFT). Researchers have long investigated the phenomenon of forgetting resilience in pre-trained LLMs when fine-tuned for downstream tasks [106, 144, 156, 223, 299], despite some discover the opposite [144]. Although the pre-trained weights initially position the model in a flat-loss basin, aiding adaptation

2 We direct interested readers to additional survey literature on the topic of general CFT [17, 105].

Table 3. Summaryoftheexisting studies on Continual Fine-Tuning LLMs, where the papers are organized in five main categories based on what downstream tasks they are designed to tackle, including (i) General Continual Fine-Tuning (CFT); (ii) Continual Instruction Tuning (CIT); (iii) Continual Model Refinement (CMR); (iv) Continual Model Alignment (CMA); (v) Continual Multimodal LLMs (CMLLMs), which is shown in the column of CFT Type . The column of X-IL shows what continual learning paradigm the study includes [232], where TIL represents task-incremental learning, meaning task ID/information is provided during inference; DIL represents domain-incremental learning, meaning the tasks are defined in the same format, and no task ID/information is available during inference; CIL represents class-incremental learning, meaning the task ID needs to be further inferred when testing.

to future tasks without severely impacting previous ones [156], zero or near-zero forgetting is only observed at the representation level. This implies that while the model retains its ability to distinguish between task-specific representations, it may still forget specific task details [144, 223, 260, 299]. Therefore, additional measures are necessary when deploying these models in real-world applications [10, 37, 106, 182, 254, 281].

Many studies advance beyond naive sequential fine-tuning, leveraging the inherent anti-forgetting nature of LLMs while avoiding the adoption of overly complex CL techniques [255, 299]. For instance, LR ADJUST [255] proposes a straightforward yet effective method of dynamically adjusting the learning rate to mitigate the overwriting of knowledge from new languages onto old ones. Building on the innate anti-forgetting ability of large language models like Pythia [16], SEQ ∗ [299] introduces several strategies for fine-tuning LLMs on a sequence of downstream classification tasks, such as freezing the LLM and old classifier's parameters after warm-up, and pre-allocating future classifiers, etc.

Given the minimal forgetting observed at the representation level in CL, some studies aim to tackle the misalignment between the representation space and the decision-making layers by introducing representation-level constraints during CFT. NeiAttn [10] exemplifies this approach by formulating classification tasks as masked language modeling and proposing a neighboring attention mechanism to counteract negative representation drift.

Another line of approaches refines the input/output format and network architectures of pre-trained LLMs to be better suited for CFT. For instance, CTR [106] incorporates two CL-plugin modules, i.e., a task-specific module (TSM) for acquiring task-specific knowledge and a knowledge-sharing module (KSM) for selectively transferring previously learned similar knowledge. CIRCLE [281] manually designs diverse prompt templates for various types of buggy code, unifying them as the cloze task and employs difficulty-based replay to enhance continual program repair. LFPT5 [182] addresses lifelong few-shot language learning by consolidating sequence labeling, text classification, and text generation into a text-to-text generation task. It undergoes prompt tuning on generated pseudo-examples from previous domains when adapting to new tasks. In [291], the authors propose a method for adaptively adding compositional adapters during continual sequence generation tasks. Before training on new domains, a decision stage determines which trained module can be reused. During training, this module also regenerates examples of the past for replay. C3 [37] merges PEFT and in-context learning (ICL) in a teacher-student framework. The teacher model undergoes in-context tuning focused solely on the current domain, while the student model, together with tunable prompts, minimizes the KL-divergence between the output distribution and the ground truth and teacher model simultaneously.

4.3.3 Continual Instruction Tuning (CIT). When the instruction tuning data comes in as a stream, forgetting of the previously learned instructions should be addressed. CT0 [208] represents the inaugural study on Continual Instruction Tuning (CIT) of LLMs, applying the replay method on the base T0 model throughout the process. Many subsequent studies focus on enhancing the replay method used during CIT. For instance, [80] improve replay efficiency by computing Key-Part Information Gain (KPIG) on masked parts to dynamically select replay data, addressing the 'half-listening' issue in instruction following. Similarly, SSR [89] uses the LLM to generate synthetic instances for replay, achieving superior or comparable performance to traditional methods at a lower cost.

Other approaches introduce multiple CL techniques during CIT. DynaInst [165] merges parameter regularization with dynamic replay, selectively storing and replaying instances and tasks to enhance outcomes. InstructionSpeak [279] employs negative training and replay instructions to improve both forward transfer and backward transfer. Some methods incorporate PEFT. Orthogonal Low-Rank Adaptation (O-LoRA) learns new tasks within an orthogonal subspace while preserving LoRA parameters for previous tasks [240] to minimize the interference among different tasks. Shared Attention Framework (SAPT) combines a PET block with a selection module via a Shared Attentive Learning & Selection module, tackling catastrophic forgetting and knowledge transfer concurrently [297]. While regularization-based and architectural-based methods require additional parameter storage and GPU memory, together with replay-based methods they remain for CIT due to the simplicity and effectiveness [243].

4.3.4 Continual Model Refinement (CMR). The concept of model editing was initially explored in [215], which introduced a 'reliability-locality-efficiency' principle and proposed a gradient descent editor to address it efficiently. Subsequent research, such as [47] and [163], extended this principle to edit factual knowledge in BERT-based language models and larger models like GPT-J-6B [235] and T5-XXL [189], respectively, using gradient decomposition. These approaches typically update a subset of model parameters to alter the labels of specific inputs. Additionally, memory-based models, as discussed in [164] and [74], incorporate editing through retrieval mechanisms.

Continual Model Refinement (CMR) extends model refinement horizontally, presenting updated sample pairs ( 𝒙 𝑒 , 𝑦 𝑒 , b 𝑦 𝑒 ) 𝑒 = 1 𝑁 sequentially as a stream. [125] initially introduces this idea, evaluating various CL methods with a dynamic sampling algorithm. Many CMR methods employ a retrieval mechanism. For instance, [74] uses hidden activations of the language model as a 'key' to activate updated parameters only when input 𝑥 0 resembles updated sample pairs; [280] improves this approach's efficiency by integrating LoRA [86]; [45] augments the LLM with an external episodic memory, modeling CMR as an ongoing memory refresh. Meanwhile, some methods focus solely on updating a subset of model parameters. For example, [85] addresses the issue of 'toxicity buildup and flash' in single-editing methods like ROME [157], adapting it to the CL context with a knowledge-aware layer selection algorithm. WISE [238] addresses the 'impossible triangle' of reliability, locality, and generalization in existing lifelong model refinement methods. It introduces a side memory system that enables knowledge sharding and merging, successfully achieving all three objectives simultaneously.

While all these works pioneer research in CMR, the exploration of CMR of LLMs remains open. [75] highlights a potential problem: the location for storing the fact may not coincide with the best place for editing it. This challenges the classical 'locate and edit' paradigm used by several existing methods [157, 158], and could become a significent concern for CMR [85]. Other questions, including whether such problem setting fits LLMs and whether more memory/computationally efficient methods of CMR could be developed for LLMs, are yet to be answered.

4.3.5 Continual Model Alignment (CMA). When LLMs undergo the phase of MA, vertical forgetting of previous knowledge usually occurs. In [128], the authors refer to this phenomenon of catastrophic forgetting induced caused by MA as the 'Alignment Tax.' Notably, even a single stage of MA can diminish the model's performance capabilities, as it restricts the model's responses to a narrower subset of the training distribution.

Continual Model Alignment (CMA) aims to continuously refine LLMs to align with evolving human values, ethics, and data. The static nature of LLM training on historical data sets can lead to discrepancies between the models' outputs and current factual accuracies, societal norms, and standards, making CMA a crucial process for maintaining their adaptability and alignment with contemporary contexts. Likewise, there are two types of CMA frameworks: RL-based and SL-based. In the realm of RL-based CMA, two significant contributions have been noted. [128] identifies the conflicts between the existing CL techniques and RLHF, and proposes Adaptive Model Averaging (AMA), adaptively finding appropriate ratios for the combination of model layers to gain maximal rewards with minimal tax; Continual Proximal Policy Optimization (CPPO) [287] proposes a weighting strategy for different examples deciding its usage of policy enhancement or knowledge retention, mitigating the alignment tax over time. For SL-based CMA, Continual Optimal Policy Fitting (COPF) [286] presents a solution adapted from the Direct Policy Optimization (DPO) [188], solving its potential risks of sub-optimal policy fitting and over-optimization in the context of CMA.

4.3.6 Continual Multimodal Large Language Models (CMLLMs). Continually training multi-modal models like CLIP [185] has been long studied [171, 300], while the problem of continually training MLLMs still remains underexplored. Several existing studies have investigated the causes of catastrophic forgetting when continually training MLLMs. [298] performs singular value decomposition on input embeddings, revealing a significant disparity among different input embeddings. This discrepancy causes the model to learn irrelevant information for previously trained tasks, resulting in catastrophic forgetting and negative forward transfer. [284] observes that minority collapse may lead to catastrophic forgetting, when the imbalance ratio between majority and minority classes approaches infinity during fine-tuning. It further identifies hallucination as a contributing factor to performance degradation in MLLMs.

Continual Fine-Tuning MLLMs. In contrast to traditional continual learning methods that involve full-model fine-tuning for new tasks, continual fine-tuning for MLLMs focuses on refining specific layers when adapting to new tasks [32, 77, 284, 298, 304]. Given the strong capabilities of pre-trained models, training specific layers suffices, and can simultaneously reduce computational demands. [296] additionally considers an continual learning scenario, Continual Missing Modality Learning (CMML), where different modalities are emerging throughout the incremental learning stages. All the aforementioned studies collectively indicate that MLLMs still suffer from catastrophic forgetting, which manifests in two ways: along the direction of vertical continuity , a performance decline on pre-trained tasks following fine-tuning for downstream tasks; and along the axis of horizontal continuity , a performance degrade on previously fine-tuned tasks after fine-tuning for new tasks. [298] also observes negative forward transfer, where the performance of unseen tasks degrades when learning new tasks, indicating a decline in model generalization capability.

While traditional CL methods are applicable, some may not yield optimal results, as evidenced by various experiments [77, 298]. For instance, [77] observes a consistent efficacy of replay-based and model expansion strategies across diverse scenarios of continual fine-tuning MLLMs, but regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Other works seek to develop ad-hoc solutions for continual learning MLLMs. [77] proposes EProj to expand the projection layer in MLLMs for each new task and utilizes task-similarity-informed regularization (TIR) to enhance performance. [298] introduces Fwd-Prompt, a prompt tuning method that projects prompt gradient to both the residual space and the pre-trained subspace to minimize the interference between tasks and reuse pre-trained knowledge respectively, fostering positive forward transfer without relying on previous samples. [304] focuses on the forgetting of the pre-trained MLLMs after fine-tuned on specific tasks and proposes model tailor to compensate the selected subset that are critical for enhancing target task performance. [296] presents a novel method named Reconstruct before Query (RebQ), leveraging the multi-modal knowledge from a pre-trained model to reconstruct the absent information for the missing modality. Recently, MoE (Mixture-of-Experts) framework has gained attention which resembles the architecture-based methods in CL. It provides the model with the ability to learn different intentions from distinct experts, e.g., [32] first introduces MoELoRA to fine-tune LLaVA, effectively mitigate the catastrophic forgetting of MLLMs in CoIN and the results demonstrate the effectiveness.

Continual Learning Meets Large Language Models: An Overview

Large language models (LLMs) are extensive in various dimensions, including the size of model parameters, pre-training datasets, computational resources, project teams, and development cycles [1, 6, 22, 40, 173, 186, 230, 231]. The substantial scale of LLMs presents notable challenges for development teams, particularly in keeping them updated amidst rapid environmental changes [5, 52, 95, 96, 100]. To illustrate, in 2023, the average daily influx of new tweets exceeds 500 million 1 , and training on even a subset of this large volume of data is unaffordable. Recyclable Tuning [183] is the first work to explicitly outline the supplier-consumer structure in the modern LLM production pipeline. On the supplier side, the model is continually pre-trained over a sequence of large-scale unlabeled datasets. After every release of the pre-trained model, the consumer utilizes the stronger and more up-to-date upstream model for downstream tasks. Compared to the upstream supplier, downstream users often lack capacity of collecting and storing large-scale data, maintaining large-scale hardware systems, and training LLMs themselves. In this survey, we extend this framework and further present a comprehensive modern production pipeline encompassing various studies on continual LLM pre-training, adaptation, and deployment (Fig. 1). What sets our framework apart from existing studies [261] is the incorporation of two directions of continuity: Vertical Continuity and Horizontal Continuity .

Vertical Continuity (Vertical Continual Learning)

Definition. Vertical continuity (or vertical continual learning) has long been studied, either implicitly or explicitly, in existing literature. Vertical continuity is characterized by a hierarchical structure encompassing data inclusiveness, task scope, and computational resources. Specifically, the training task transitions gradually from general pre-training to downstream tasks, typically undertaken by distinct entities within the production pipeline [68, 71, 183, 197, 268, 272]. Fig. 1 shows a typical pipeline for vertical continuity in LLMs, i.e., 'pre-training' → 'domain-adaptive training' → 'downstream fine-tuning' [42, 48, 68, 72, 73, 91, 121, 146, 148, 197, 257, 258, 272, 303]:

· Pre-training. During the pre-training stage, a substantial amount of data from diverse domains is required to develop a general-purpose LLM. This phase demands a sizable research and development team dedicated to training and benchmarking the model, along with considerable computational resources. · Domain-Adaptive Pre-training. Subsequently, downstream institutions may opt for domain-adaptive pretraining to tailor the model for specific tasks using domain-specific data unavailable to the upstream supplier.

1 Source: https://www.omnicoreagency.com/twitter-statistics

Fig. 1. A high-level overview of the modern pipeline for continually pre-training and fine-tuning LLMs, where two dimensions of continuity are described. Vertical Continuity (or Vertical Continual Learning): LLM training can be vertically divided into three stages: (i) Continual Pre-Training (CPT), (ii) Domain-Adaptive Pre-training (DAP), and (iii) Continual Fine-Tuning (CFT). The main focus is the retention of the LLM's general knowledge (prevention of vertical forgetting). Horizontal Continuity (or Horizontal Continual Learning): After the LLMs are deployed, the models are continually updated when a new set of data becomes available. The primary goal is to prevent horizontal forgetting in a long sequence of tasks.

Fig. 1. A high-level overview of the modern pipeline for continually pre-training and fine-tuning LLMs, where two dimensions of continuity are described. Vertical Continuity (or Vertical Continual Learning): LLM training can be vertically divided into three stages: (i) Continual Pre-Training (CPT), (ii) Domain-Adaptive Pre-training (DAP), and (iii) Continual Fine-Tuning (CFT). The main focus is the retention of the LLM's general knowledge (prevention of vertical forgetting). Horizontal Continuity (or Horizontal Continual Learning): After the LLMs are deployed, the models are continually updated when a new set of data becomes available. The primary goal is to prevent horizontal forgetting in a long sequence of tasks.

· Finetuning. Finally, the LLM undergoes fine-tuning on annotated data for downstream tasks before deployment.

Throughout the process, the unlabeled domain-specific dataset is smaller in scale than the upstream pre-training phase but larger than the final downstream task fine-tuning phase. This pattern extends to computational resources, team size, and other factors. It is important to note that vertical continuity can involve more than three stages [91, 129, 172, 199]. In real-world applications, during domain-adaptive pre-training, additional 'layers' can be added to accommodate multiple entities, such as various departments with distinct objectives but operating within the same domain.

Vertical Forgetting. We term the performance degradation (in terms of general knowledge) due to vertical continual learning 'vertical forgetting' . As shown in Fig. 2, for vertical continual learning, the data distribution of upstream tasks partially covers the downstream, meaning the model might start off at a decent initialization for the subsequent stage of training. Two significant challenges must be addressed to prevent vertical forgetting:

· Task Heterogeneity. Stemming from the inherent disparity between the formulation of upstream tasks and downstream tasks, task heterogeneity can lead to differences in model structures and training schemes, which has long been recognized as a major hurdle [112, 124, 170, 193, 262]. To mitigate this issue, practitioners often employ methodologies such as freezing shared parameters during downstream phases or reformulating downstream tasks to match the structure of pre-training tasks [118, 177, 240, 257, 272, 273]. · Inaccessible Upstream Data. This challenge arises primarily from varying levels of confidentiality across entities undertaking vertical continual learning. Data collected and curated under different protocols may not be accessible to some downstream entities. This scenario is even more challenging than the strict memory

Fig. 2. A diagram showing two different directions of continual learning of LLMs. (a) Vertical Continual Learning of LLMs: in this case, the upstream data distribution usually partially covers the subsequent tasks' data distribution. (b) Horizontal Continual Learning of LLMs: No constraints on the data distributions are present on horizontal continual learning. The continual LLMs need to handle the challenge of abrupt distributional shifts and a longer sequence of training.

Fig. 2. A diagram showing two different directions of continual learning of LLMs. (a) Vertical Continual Learning of LLMs: in this case, the upstream data distribution usually partially covers the subsequent tasks' data distribution. (b) Horizontal Continual Learning of LLMs: No constraints on the data distributions are present on horizontal continual learning. The continual LLMs need to handle the challenge of abrupt distributional shifts and a longer sequence of training.

constraint presented in conventional CL (Definition A.3), as algorithms for latter case rely on access to previous data at specific points for parameter importance measurement [4, 113] or for replay [24, 30, 195, 213]. To address the challenge of inaccessible upstream data , existing methods either use public datasets or generate pseudo-examples to create proxy pre-training datasets [182].

Horizontal Continuity (Horizontal Continual Learning)

Definition. Horizontal continuity (or horizontal continual learning) refers to continual adaptation across time and domains, a topic extensively explored within the continual learning community. The primary rationale for preserving horizontal continuity lies in the dynamic nature of data distribution over time. To stay updated with these content shifts, an LLM must incrementally learn newly-emerged data. Otherwise, the cost of re-training will become prohibitively expensive and impractical [5, 29, 219, 268]. Empirical evidence has consistently shown that despite their impressive capabilities, LLMs struggle to generalize effectively to future unseen data, particularly in the face of temporal or domain shifts [5, 52, 95, 96]. Additionally, they struggle to retain complete knowledge of past experiences when adapting to new temporal domains, although they do demonstrate a higher level of robustness against catastrophic forgetting [144, 156, 223, 299]. The necessity of employing complex CL algorithms to address challenges in LLMs remains an open question. For instance, during large-scale continual pre-training, large institutions can typically afford the storage costs of retaining all historical data, rendering memory constraints meaningless. Several studies have demonstrated that with full access to historical data, simple sparse replay techniques can effectively mitigate forgetting [62, 181, 208, 223]. In contrast, numerous continual learning studies have showcased superior performance compared to naive solutions, suggesting the importance of continual learning techniques in LLM training [35, 95, 100, 184].

Horizontal Forgetting. We informally define 'horizontal forgetting' as the performance degradation on the previous tasks when model is undergoing horizontal continual learning. As illustrated in Fig. 2, horizontal continual learning typically involves training stages of similar scales, with potential distributional overlap among their data. In summary, two main challenges need to be addressed for horizontal continual learning of LLMs:

· Long Task Sequences. Horizontal continual learning ideally involves numerous incremental phases, particularly to accommodate temporal shifts in data distribution. A longer task sequence entails more update steps of the model, leading to inevitable forgetting of previously learned tasks. To address this challenge, researchers employ established continual learning techniques with stronger constraints, such as continual model ensemble [191]. · Abrupt Distributional Shift. In contrast to vertical continuity, where distributional shifts are often predictable, horizontal continual learning does not impose constraints on task properties. Evidence suggests that abrupt changes in task distributions can result in significant horizontal forgetting of the model [204].

Interpreting LLM Topics through the Lens of Vertical and Horizontal Continuity

Modern Stages of Learning Large Language Models Continually

Fig. 1 provides an overview of continually learning LLMs. Along the axis of vertical continuity, three main 'layers' of modern continual learning emerge. The top layer, Continual Pre-Training (CPT), involves continuous pre-training of LLMs by the supplier on newly-collected data alongside existing data (Section 4.1). The middle layer, Domain-Adaptive Pre-training (DAP), prepares LLMs for domain-specific applications through additional pre-training on domain-specific unlabeled data (Section 4.2). The bottom layer, Continual Fine-Tuning (CFT), targets models for final downstream tasks on the consumer side (Section 4.3), where the model needs to be updated after deployment for the specified task.

Learning Stages of Continual Large Language Models

Fig. 1 provides an overview of continually learning LLMs. Along the axis of vertical continuity, three main 'layers' of modern continual learning emerge. The top layer, Continual Pre-Training (CPT), involves continuous pre-training of LLMs by the supplier on newly-collected data alongside existing data (Section 4.1). The middle layer, Domain-Adaptive Pre-training (DAP), prepares LLMs for domain-specific applications through additional pre-training on domain-specific unlabeled data (Section 4.2). The bottom layer, Continual Fine-Tuning (CFT), targets models for final downstream tasks on the consumer side (Section 4.3), where the model needs to be updated after deployment for the specified task.

Continual Pre-Training~(CPT)

4.1.1 CPT: Effectiveness and Efficiency. Before delving into the details of continual pre-training (CPT), it is important to address two fundamental questions: Firstly, regarding effectiveness , can CPT enhance performance on downstream tasks beyond that of the initial training on a wide range of data domains? Extensive studies have not only demonstrated the necessity of CPT for improved downstream performance [35, 71, 95, 96, 100, 184], but also shown that when distributional shifts are gradual [95, 278] or somewhat correlated [71], CPT can effectively help model generalize to unseen data. The second question is about efficiency : given the large size of an LLM' parameters and data, both old and new, can we achieve adaptation and knowledge retention in a computationally efficient way? Concerning efficiency, most studies focus on techniques for efficient knowledge retention [95, 96, 100, 118], which significantly overlap with the CL literature addressing catastrophic forgetting [4, 24, 191, 193, 195, 196, 201, 207, 213, 236]. In contrast to prior approaches that fully utilize emergent data, some studies recognize the impracticality of this approach in real production environments. Instead, they concentrate on further improving the efficiency of adaptation. For instance, ELLE [184] employs a function-preserved model expansion to facilitate efficient knowledge growth; [5] and [268] sub-sample training data based on novelty and diversity to enhance training efficiency, achieving superior performance compared to full-data training. Though currently underexplored, efficient adaptation in continual pre-training is poised to become significant, given recent findings emphasizing data quality over quantity for LLM generalization [216, 267].

4.1.2 General Observations on CPT. Table 1 summarizes the existing studies on continual pre-training (CPT), and here are some key observations we make about CPT.

· OBS-1: The development of advanced techniques tailored specifically for CPT is at the starting stage and warrants further exploration. Only about half of the examined papers propose novel techniques for CPT [5, 35, 44, 52, 71, 104, 183, 184, 221], while the remaining half either focus solely on the effects of pure adaptation without considering CL techniques [63, 69, 137], or conduct empirical studies on the straightforward application of existing CL techniques [95, 96, 100, 118]. · OBS-2: The diversity of CL techniques incorporated in CPT remains limited. Most practical implementations of CL techniques for CPT primarily focus on architecture expansion of LLMs [5, 35, 44, 52, 71, 183], with only a few explicitly utilizing replay [35, 183] and parameter regularization [5, 35]. · OBS-3: There is an apparent gap between the existing studies and the real production environment of CPT. Except for the recent study [278] which conducts CPT over 159 domains, the longest sequence of

Table 1. Summary of existing studies on Continual Pre-training of LLMs. The papers are organized based on their relation to CL: (i) no CL techniques are studied, (ii) CL techniques are studied as solely baselines, and (iii) new CL approaches are proposed. In the table, Dist. Shift denotes what type(s) of distributional shifts this particular study considers and is dedicated to solve. In the section of Continual Learning Tech. , we mainly categorize three types of continual learning techniques that are studied in the paper: rehearsal ( Rehearsal ), parameter regularization ( Param. Reg. ), and architecture expansion ( Arch. Exp. ). We use ' ✓ ', ' ✗ ', and ' ♣ ' to denote 'deployed in the proposed method', 'not studied in the paper', and 'studied as a baseline method', respectively. Note that we do not include naive sequential fine-tuning in this table, as it is universally studied as the important baseline method in all of the papers in the table. The papers with only ' ♣ ' [95, 96, 100] means that only existing CL techniques are studied, without proposing new ones, and the papers with only ' ✗ ' [63, 69] means that special aspects of fine-tuning are studied, without using CL techniques.

pre-training stages explored is 8 [71, 100]. However, this falls short of real-world scenarios where continual pre-training occurs more frequently and persists for months or years. The efficacy of CPT methods in such prolonged scenarios remains uncertain. Additionally, investigating CPT in a task-boundary-free data stream setting is an important avenue for research to be explored in the future as well.

  • 4.1.3 Distributional Shifts in CPT. This survey categorizes distributional shifts of CPT into three main types: (i) Language Shift : LLMs sequentially learn different language corpora, e.g., English → Chinese [63, 118]. (ii) Content Shift : LLMs sequentially learn corpora from different fields, e.g., chemistry → biology [35, 44, 69, 71, 100, 183]. (iii) Temporal Shift :

Distributional shifts occur over time, e.g., news in 2021 → news in 2022, with a major focus on timestamp-sensitive knowledge retention and update [5, 52, 95, 96, 100].

Language Shift. [63] focuses on assessing LLMs' natural ability to learn new languages sequentially. With no explicit CL techniques employed, the study observes consistent positive forward transfer of the knowledge, facilitating new language acquisition regardless of the learning order. Forgetting, on the other hand, emerges as a significant challenge that cannot be mitigated by the increasing size of LLMs. In [118], the degree of forgetting of previously learned language when adapting LLMs to a new language is investigated. Various CL techniques, including parameter freezing, LoRA [86], and (IA) 3 [132], are evaluated across multiple dimensions. Preliminary experimental results highlight the non-trivial nature of addressing horizontal forgetting for CPT under the language shift as well.

Content Shift. [278] explores the large-scale CPT over 159 content domains, and shows that CPT on various domains can effectively improve models' adaptation ability compared to DAP on single domain. Similarly, [69] continues the pre-training phase of Pythia [16] with no complex CL techniques and discovers that learning rate re-warming consistently improves models trained from scratch. Built upon this simple observation, [94] further shows that proper combination of learning rate re-warming and re-decay, and replay of the previous data is sufficient to achieve a comparable performance to full re-training. LLPT [100] establishes a comprehensive training and evaluation protocol for a series of content-level distributional shifts. They assess multiple CL methods and, similar to [63], find consistent forward knowledge transfer, yet horizontal forgetting remains significant. Besides, contrary to the common understanding that experience replay [30] is the most efficient approach to preventing forgetting, the authors find it ineffective in the case of CPT, due to the potential overfitting issue. Recyclable Tuning [183] shows that if the upstream supplier continually pre-trains LLMs, with or without replay, consumer-side efficiency can be boosted by recycling previously learned update components when proper CL techniques are applied.

DEMix [71] incrementally trains and integrates new experts (DEMix layer) for new domains during CPT. To ensure reasonable inference performance during testing when no domain information is available, it proposes a parameter-free probabilistic approach to dynamically estimate a weighted mixture of domains. DEMix's modularization has been shown to facilitate efficient domain-adaptive pre-training, promote relevant knowledge during inference, and allow for removable components. Lifelong-MoE [35], similar to DEMix [71], incrementally trains domain experts for new domains. However, Lifelong-MoE differs from DEMix in utilizing a token-level gating function to activate multiple experts for intermediate embedding calculation. During training, previous experts' parameters and gating functions remain frozen, and knowledge distillation loss is employed to regulate parameter updates, which thereby makes Lifelong-MoE robust against the issue of horizontal forgetting.

It is noteworthy that some papers draw almost opposite conclusions regarding the significance of CPT for content shifts. For instance, [44] continually pre-trains BERT-based models [51, 133] on five scientific domains and evaluates performance on downstream sentiment analysis. They observe that even the trivial sequential pre-training does not exhibit severe forgetting, prompting reasonable questions about the necessity of CPT.

Temporal Shift. In the context of CPT amid content shifts, Multi-Task Learning (MTL) is often regarded as the upper bound achievable [178, 213, 237]. However, this belief does not fully hold when considering CL under temporal shifts [52, 95, 96], as temporal shifts can introduce conflicting information, posing challenges for LLMs. For instance, the statement 'Lionel Messi plays for team Barcelona' remains accurate from 2004 to 2021 but becomes false by 2024, as 'Lionel Messi plays for team Inter Miami' becomes the correct statement.

Hence, as advocated by CKL [96] and TemporalWiki [95], LLMs undergoing continual adaptation to temporal shifts must simultaneously achieve three objectives: (i) retention of old knowledge, (ii) acquisition of new knowledge, and

(iii) update of the outdated knowledge. They evaluate the same set of continual learning baseline methods [34, 79, 87, 239], each highlighting distinct aspects of their impact. CKL [96] observes that parameter expansion consistently exhibits robust performance across all experimental conditions. In contrast, replay-based methods struggle to efficiently adapt to new knowledge acquisition and outdated knowledge update, leading to rapid forgetting of newly learned information during training. TemporalWiki [95] constructs a series of temporal corpora and their differential sets from sequential snapshots of Wikipedia, revealing that updating LLMs on these differential sets substantially enhances new knowledge acquisition and updates, requiring significantly less computational resources, and various CL techniques prove effective in mitigating horizontal forgetting during this process. LLPT [100] introduces temporal generalization evaluation for LLMs pre-trained on sequential corpora. Through experiments on a large-scale chronologically-ordered Tweet Stream, the authors demonstrate the superiority of CPT combined with CL techniques to task-specific LMs, in terms of both knowledge acquisition and temporal generalization. Nonetheless, these preliminary experiments do not conclusively determine which specific CL method is more preferable than the others.

Another line of work, Temporal Language Models (TLMs), takes a different approach to address knowledge retention, acquisition, and update under temporal shifts by integrating temporal information into the model [52, 198, 219]. During training, they inject temporal information into training examples as prefixes of prompts, using special tokens [198], explicit year information [52], or syntax-guided structural information [219]. In sequential training experiments conducted by TempoT5 [52], comparison between continually and jointly pre-trained LMs demonstrates that CPT better balances adaptation and forgetting when the replay rate of past data is appropriately set.

Others. CPT as a technique to progressively attain novel knowledge, can be used to refine LLMs' behavior. CEM [294] collects examples where the model's response is incorrect and continually trains the model on these examples, along with a supplemental dataset. RHO-1 [130] proposes Selective Language Modeling (SLM), which employs a reference model to evaluate the perplexity of each token in the training corpus, and continually pre-trains the model on high-perplexity tokens. Similarly, IR-DRO [36] re-trains the model on re-weighted examples from the original pre-training dataset, focusing more on higher-loss sequences.

The significance of addressing temporal shifts through CPT is underscored by several industrial studies. For instance, [5] employs a dynamic vocabulary expansion algorithm and an efficient sub-sampling procedure to conduct CPT on large-scale emerging tweet data. Conversely, [137] adopts CPT without explicit measures to constrain model updates, releasing a series of BERT-based LMs incrementally trained on new tweet data every three months. Preliminary experimental results demonstrate substantial improvements of continually pre-trained LMs over the base BERT model across downstream tasks. While some studies question the necessity of continually adapting LLMs along the temporal axis for environmental reasons, such as reducing CO 2 emissions [8], the community commonly embraces CPT as a more efficient learning paradigm compared to the traditional 'combine-and-retrain' approach.

CPT: Effectiveness and Efficiency

General Observations on CPT

4.1.1 CPT: Effectiveness and Efficiency. Before delving into the details of continual pre-training (CPT), it is important to address two fundamental questions: Firstly, regarding effectiveness , can CPT enhance performance on downstream tasks beyond that of the initial training on a wide range of data domains? Extensive studies have not only demonstrated the necessity of CPT for improved downstream performance [35, 71, 95, 96, 100, 184], but also shown that when distributional shifts are gradual [95, 278] or somewhat correlated [71], CPT can effectively help model generalize to unseen data. The second question is about efficiency : given the large size of an LLM' parameters and data, both old and new, can we achieve adaptation and knowledge retention in a computationally efficient way? Concerning efficiency, most studies focus on techniques for efficient knowledge retention [95, 96, 100, 118], which significantly overlap with the CL literature addressing catastrophic forgetting [4, 24, 191, 193, 195, 196, 201, 207, 213, 236]. In contrast to prior approaches that fully utilize emergent data, some studies recognize the impracticality of this approach in real production environments. Instead, they concentrate on further improving the efficiency of adaptation. For instance, ELLE [184] employs a function-preserved model expansion to facilitate efficient knowledge growth; [5] and [268] sub-sample training data based on novelty and diversity to enhance training efficiency, achieving superior performance compared to full-data training. Though currently underexplored, efficient adaptation in continual pre-training is poised to become significant, given recent findings emphasizing data quality over quantity for LLM generalization [216, 267].

4.1.2 General Observations on CPT. Table 1 summarizes the existing studies on continual pre-training (CPT), and here are some key observations we make about CPT.

· OBS-1: The development of advanced techniques tailored specifically for CPT is at the starting stage and warrants further exploration. Only about half of the examined papers propose novel techniques for CPT [5, 35, 44, 52, 71, 104, 183, 184, 221], while the remaining half either focus solely on the effects of pure adaptation without considering CL techniques [63, 69, 137], or conduct empirical studies on the straightforward application of existing CL techniques [95, 96, 100, 118]. · OBS-2: The diversity of CL techniques incorporated in CPT remains limited. Most practical implementations of CL techniques for CPT primarily focus on architecture expansion of LLMs [5, 35, 44, 52, 71, 183], with only a few explicitly utilizing replay [35, 183] and parameter regularization [5, 35]. · OBS-3: There is an apparent gap between the existing studies and the real production environment of CPT. Except for the recent study [278] which conducts CPT over 159 domains, the longest sequence of

Table 1. Summary of existing studies on Continual Pre-training of LLMs. The papers are organized based on their relation to CL: (i) no CL techniques are studied, (ii) CL techniques are studied as solely baselines, and (iii) new CL approaches are proposed. In the table, Dist. Shift denotes what type(s) of distributional shifts this particular study considers and is dedicated to solve. In the section of Continual Learning Tech. , we mainly categorize three types of continual learning techniques that are studied in the paper: rehearsal ( Rehearsal ), parameter regularization ( Param. Reg. ), and architecture expansion ( Arch. Exp. ). We use ' ✓ ', ' ✗ ', and ' ♣ ' to denote 'deployed in the proposed method', 'not studied in the paper', and 'studied as a baseline method', respectively. Note that we do not include naive sequential fine-tuning in this table, as it is universally studied as the important baseline method in all of the papers in the table. The papers with only ' ♣ ' [95, 96, 100] means that only existing CL techniques are studied, without proposing new ones, and the papers with only ' ✗ ' [63, 69] means that special aspects of fine-tuning are studied, without using CL techniques.

pre-training stages explored is 8 [71, 100]. However, this falls short of real-world scenarios where continual pre-training occurs more frequently and persists for months or years. The efficacy of CPT methods in such prolonged scenarios remains uncertain. Additionally, investigating CPT in a task-boundary-free data stream setting is an important avenue for research to be explored in the future as well.

  • 4.1.3 Distributional Shifts in CPT. This survey categorizes distributional shifts of CPT into three main types: (i) Language Shift : LLMs sequentially learn different language corpora, e.g., English → Chinese [63, 118]. (ii) Content Shift : LLMs sequentially learn corpora from different fields, e.g., chemistry → biology [35, 44, 69, 71, 100, 183]. (iii) Temporal Shift :

Distributional shifts occur over time, e.g., news in 2021 → news in 2022, with a major focus on timestamp-sensitive knowledge retention and update [5, 52, 95, 96, 100].

Language Shift. [63] focuses on assessing LLMs' natural ability to learn new languages sequentially. With no explicit CL techniques employed, the study observes consistent positive forward transfer of the knowledge, facilitating new language acquisition regardless of the learning order. Forgetting, on the other hand, emerges as a significant challenge that cannot be mitigated by the increasing size of LLMs. In [118], the degree of forgetting of previously learned language when adapting LLMs to a new language is investigated. Various CL techniques, including parameter freezing, LoRA [86], and (IA) 3 [132], are evaluated across multiple dimensions. Preliminary experimental results highlight the non-trivial nature of addressing horizontal forgetting for CPT under the language shift as well.

Content Shift. [278] explores the large-scale CPT over 159 content domains, and shows that CPT on various domains can effectively improve models' adaptation ability compared to DAP on single domain. Similarly, [69] continues the pre-training phase of Pythia [16] with no complex CL techniques and discovers that learning rate re-warming consistently improves models trained from scratch. Built upon this simple observation, [94] further shows that proper combination of learning rate re-warming and re-decay, and replay of the previous data is sufficient to achieve a comparable performance to full re-training. LLPT [100] establishes a comprehensive training and evaluation protocol for a series of content-level distributional shifts. They assess multiple CL methods and, similar to [63], find consistent forward knowledge transfer, yet horizontal forgetting remains significant. Besides, contrary to the common understanding that experience replay [30] is the most efficient approach to preventing forgetting, the authors find it ineffective in the case of CPT, due to the potential overfitting issue. Recyclable Tuning [183] shows that if the upstream supplier continually pre-trains LLMs, with or without replay, consumer-side efficiency can be boosted by recycling previously learned update components when proper CL techniques are applied.

DEMix [71] incrementally trains and integrates new experts (DEMix layer) for new domains during CPT. To ensure reasonable inference performance during testing when no domain information is available, it proposes a parameter-free probabilistic approach to dynamically estimate a weighted mixture of domains. DEMix's modularization has been shown to facilitate efficient domain-adaptive pre-training, promote relevant knowledge during inference, and allow for removable components. Lifelong-MoE [35], similar to DEMix [71], incrementally trains domain experts for new domains. However, Lifelong-MoE differs from DEMix in utilizing a token-level gating function to activate multiple experts for intermediate embedding calculation. During training, previous experts' parameters and gating functions remain frozen, and knowledge distillation loss is employed to regulate parameter updates, which thereby makes Lifelong-MoE robust against the issue of horizontal forgetting.

It is noteworthy that some papers draw almost opposite conclusions regarding the significance of CPT for content shifts. For instance, [44] continually pre-trains BERT-based models [51, 133] on five scientific domains and evaluates performance on downstream sentiment analysis. They observe that even the trivial sequential pre-training does not exhibit severe forgetting, prompting reasonable questions about the necessity of CPT.

Temporal Shift. In the context of CPT amid content shifts, Multi-Task Learning (MTL) is often regarded as the upper bound achievable [178, 213, 237]. However, this belief does not fully hold when considering CL under temporal shifts [52, 95, 96], as temporal shifts can introduce conflicting information, posing challenges for LLMs. For instance, the statement 'Lionel Messi plays for team Barcelona' remains accurate from 2004 to 2021 but becomes false by 2024, as 'Lionel Messi plays for team Inter Miami' becomes the correct statement.

Hence, as advocated by CKL [96] and TemporalWiki [95], LLMs undergoing continual adaptation to temporal shifts must simultaneously achieve three objectives: (i) retention of old knowledge, (ii) acquisition of new knowledge, and

(iii) update of the outdated knowledge. They evaluate the same set of continual learning baseline methods [34, 79, 87, 239], each highlighting distinct aspects of their impact. CKL [96] observes that parameter expansion consistently exhibits robust performance across all experimental conditions. In contrast, replay-based methods struggle to efficiently adapt to new knowledge acquisition and outdated knowledge update, leading to rapid forgetting of newly learned information during training. TemporalWiki [95] constructs a series of temporal corpora and their differential sets from sequential snapshots of Wikipedia, revealing that updating LLMs on these differential sets substantially enhances new knowledge acquisition and updates, requiring significantly less computational resources, and various CL techniques prove effective in mitigating horizontal forgetting during this process. LLPT [100] introduces temporal generalization evaluation for LLMs pre-trained on sequential corpora. Through experiments on a large-scale chronologically-ordered Tweet Stream, the authors demonstrate the superiority of CPT combined with CL techniques to task-specific LMs, in terms of both knowledge acquisition and temporal generalization. Nonetheless, these preliminary experiments do not conclusively determine which specific CL method is more preferable than the others.

Another line of work, Temporal Language Models (TLMs), takes a different approach to address knowledge retention, acquisition, and update under temporal shifts by integrating temporal information into the model [52, 198, 219]. During training, they inject temporal information into training examples as prefixes of prompts, using special tokens [198], explicit year information [52], or syntax-guided structural information [219]. In sequential training experiments conducted by TempoT5 [52], comparison between continually and jointly pre-trained LMs demonstrates that CPT better balances adaptation and forgetting when the replay rate of past data is appropriately set.

Others. CPT as a technique to progressively attain novel knowledge, can be used to refine LLMs' behavior. CEM [294] collects examples where the model's response is incorrect and continually trains the model on these examples, along with a supplemental dataset. RHO-1 [130] proposes Selective Language Modeling (SLM), which employs a reference model to evaluate the perplexity of each token in the training corpus, and continually pre-trains the model on high-perplexity tokens. Similarly, IR-DRO [36] re-trains the model on re-weighted examples from the original pre-training dataset, focusing more on higher-loss sequences.

The significance of addressing temporal shifts through CPT is underscored by several industrial studies. For instance, [5] employs a dynamic vocabulary expansion algorithm and an efficient sub-sampling procedure to conduct CPT on large-scale emerging tweet data. Conversely, [137] adopts CPT without explicit measures to constrain model updates, releasing a series of BERT-based LMs incrementally trained on new tweet data every three months. Preliminary experimental results demonstrate substantial improvements of continually pre-trained LMs over the base BERT model across downstream tasks. While some studies question the necessity of continually adapting LLMs along the temporal axis for environmental reasons, such as reducing CO 2 emissions [8], the community commonly embraces CPT as a more efficient learning paradigm compared to the traditional 'combine-and-retrain' approach.

Distributional Shifts in CPT

4.1.1 CPT: Effectiveness and Efficiency. Before delving into the details of continual pre-training (CPT), it is important to address two fundamental questions: Firstly, regarding effectiveness , can CPT enhance performance on downstream tasks beyond that of the initial training on a wide range of data domains? Extensive studies have not only demonstrated the necessity of CPT for improved downstream performance [35, 71, 95, 96, 100, 184], but also shown that when distributional shifts are gradual [95, 278] or somewhat correlated [71], CPT can effectively help model generalize to unseen data. The second question is about efficiency : given the large size of an LLM' parameters and data, both old and new, can we achieve adaptation and knowledge retention in a computationally efficient way? Concerning efficiency, most studies focus on techniques for efficient knowledge retention [95, 96, 100, 118], which significantly overlap with the CL literature addressing catastrophic forgetting [4, 24, 191, 193, 195, 196, 201, 207, 213, 236]. In contrast to prior approaches that fully utilize emergent data, some studies recognize the impracticality of this approach in real production environments. Instead, they concentrate on further improving the efficiency of adaptation. For instance, ELLE [184] employs a function-preserved model expansion to facilitate efficient knowledge growth; [5] and [268] sub-sample training data based on novelty and diversity to enhance training efficiency, achieving superior performance compared to full-data training. Though currently underexplored, efficient adaptation in continual pre-training is poised to become significant, given recent findings emphasizing data quality over quantity for LLM generalization [216, 267].

4.1.2 General Observations on CPT. Table 1 summarizes the existing studies on continual pre-training (CPT), and here are some key observations we make about CPT.

· OBS-1: The development of advanced techniques tailored specifically for CPT is at the starting stage and warrants further exploration. Only about half of the examined papers propose novel techniques for CPT [5, 35, 44, 52, 71, 104, 183, 184, 221], while the remaining half either focus solely on the effects of pure adaptation without considering CL techniques [63, 69, 137], or conduct empirical studies on the straightforward application of existing CL techniques [95, 96, 100, 118]. · OBS-2: The diversity of CL techniques incorporated in CPT remains limited. Most practical implementations of CL techniques for CPT primarily focus on architecture expansion of LLMs [5, 35, 44, 52, 71, 183], with only a few explicitly utilizing replay [35, 183] and parameter regularization [5, 35]. · OBS-3: There is an apparent gap between the existing studies and the real production environment of CPT. Except for the recent study [278] which conducts CPT over 159 domains, the longest sequence of

Table 1. Summary of existing studies on Continual Pre-training of LLMs. The papers are organized based on their relation to CL: (i) no CL techniques are studied, (ii) CL techniques are studied as solely baselines, and (iii) new CL approaches are proposed. In the table, Dist. Shift denotes what type(s) of distributional shifts this particular study considers and is dedicated to solve. In the section of Continual Learning Tech. , we mainly categorize three types of continual learning techniques that are studied in the paper: rehearsal ( Rehearsal ), parameter regularization ( Param. Reg. ), and architecture expansion ( Arch. Exp. ). We use ' ✓ ', ' ✗ ', and ' ♣ ' to denote 'deployed in the proposed method', 'not studied in the paper', and 'studied as a baseline method', respectively. Note that we do not include naive sequential fine-tuning in this table, as it is universally studied as the important baseline method in all of the papers in the table. The papers with only ' ♣ ' [95, 96, 100] means that only existing CL techniques are studied, without proposing new ones, and the papers with only ' ✗ ' [63, 69] means that special aspects of fine-tuning are studied, without using CL techniques.

pre-training stages explored is 8 [71, 100]. However, this falls short of real-world scenarios where continual pre-training occurs more frequently and persists for months or years. The efficacy of CPT methods in such prolonged scenarios remains uncertain. Additionally, investigating CPT in a task-boundary-free data stream setting is an important avenue for research to be explored in the future as well.

  • 4.1.3 Distributional Shifts in CPT. This survey categorizes distributional shifts of CPT into three main types: (i) Language Shift : LLMs sequentially learn different language corpora, e.g., English → Chinese [63, 118]. (ii) Content Shift : LLMs sequentially learn corpora from different fields, e.g., chemistry → biology [35, 44, 69, 71, 100, 183]. (iii) Temporal Shift :

Distributional shifts occur over time, e.g., news in 2021 → news in 2022, with a major focus on timestamp-sensitive knowledge retention and update [5, 52, 95, 96, 100].

Language Shift. [63] focuses on assessing LLMs' natural ability to learn new languages sequentially. With no explicit CL techniques employed, the study observes consistent positive forward transfer of the knowledge, facilitating new language acquisition regardless of the learning order. Forgetting, on the other hand, emerges as a significant challenge that cannot be mitigated by the increasing size of LLMs. In [118], the degree of forgetting of previously learned language when adapting LLMs to a new language is investigated. Various CL techniques, including parameter freezing, LoRA [86], and (IA) 3 [132], are evaluated across multiple dimensions. Preliminary experimental results highlight the non-trivial nature of addressing horizontal forgetting for CPT under the language shift as well.

Content Shift. [278] explores the large-scale CPT over 159 content domains, and shows that CPT on various domains can effectively improve models' adaptation ability compared to DAP on single domain. Similarly, [69] continues the pre-training phase of Pythia [16] with no complex CL techniques and discovers that learning rate re-warming consistently improves models trained from scratch. Built upon this simple observation, [94] further shows that proper combination of learning rate re-warming and re-decay, and replay of the previous data is sufficient to achieve a comparable performance to full re-training. LLPT [100] establishes a comprehensive training and evaluation protocol for a series of content-level distributional shifts. They assess multiple CL methods and, similar to [63], find consistent forward knowledge transfer, yet horizontal forgetting remains significant. Besides, contrary to the common understanding that experience replay [30] is the most efficient approach to preventing forgetting, the authors find it ineffective in the case of CPT, due to the potential overfitting issue. Recyclable Tuning [183] shows that if the upstream supplier continually pre-trains LLMs, with or without replay, consumer-side efficiency can be boosted by recycling previously learned update components when proper CL techniques are applied.

DEMix [71] incrementally trains and integrates new experts (DEMix layer) for new domains during CPT. To ensure reasonable inference performance during testing when no domain information is available, it proposes a parameter-free probabilistic approach to dynamically estimate a weighted mixture of domains. DEMix's modularization has been shown to facilitate efficient domain-adaptive pre-training, promote relevant knowledge during inference, and allow for removable components. Lifelong-MoE [35], similar to DEMix [71], incrementally trains domain experts for new domains. However, Lifelong-MoE differs from DEMix in utilizing a token-level gating function to activate multiple experts for intermediate embedding calculation. During training, previous experts' parameters and gating functions remain frozen, and knowledge distillation loss is employed to regulate parameter updates, which thereby makes Lifelong-MoE robust against the issue of horizontal forgetting.

It is noteworthy that some papers draw almost opposite conclusions regarding the significance of CPT for content shifts. For instance, [44] continually pre-trains BERT-based models [51, 133] on five scientific domains and evaluates performance on downstream sentiment analysis. They observe that even the trivial sequential pre-training does not exhibit severe forgetting, prompting reasonable questions about the necessity of CPT.

Temporal Shift. In the context of CPT amid content shifts, Multi-Task Learning (MTL) is often regarded as the upper bound achievable [178, 213, 237]. However, this belief does not fully hold when considering CL under temporal shifts [52, 95, 96], as temporal shifts can introduce conflicting information, posing challenges for LLMs. For instance, the statement 'Lionel Messi plays for team Barcelona' remains accurate from 2004 to 2021 but becomes false by 2024, as 'Lionel Messi plays for team Inter Miami' becomes the correct statement.

Hence, as advocated by CKL [96] and TemporalWiki [95], LLMs undergoing continual adaptation to temporal shifts must simultaneously achieve three objectives: (i) retention of old knowledge, (ii) acquisition of new knowledge, and

(iii) update of the outdated knowledge. They evaluate the same set of continual learning baseline methods [34, 79, 87, 239], each highlighting distinct aspects of their impact. CKL [96] observes that parameter expansion consistently exhibits robust performance across all experimental conditions. In contrast, replay-based methods struggle to efficiently adapt to new knowledge acquisition and outdated knowledge update, leading to rapid forgetting of newly learned information during training. TemporalWiki [95] constructs a series of temporal corpora and their differential sets from sequential snapshots of Wikipedia, revealing that updating LLMs on these differential sets substantially enhances new knowledge acquisition and updates, requiring significantly less computational resources, and various CL techniques prove effective in mitigating horizontal forgetting during this process. LLPT [100] introduces temporal generalization evaluation for LLMs pre-trained on sequential corpora. Through experiments on a large-scale chronologically-ordered Tweet Stream, the authors demonstrate the superiority of CPT combined with CL techniques to task-specific LMs, in terms of both knowledge acquisition and temporal generalization. Nonetheless, these preliminary experiments do not conclusively determine which specific CL method is more preferable than the others.

Another line of work, Temporal Language Models (TLMs), takes a different approach to address knowledge retention, acquisition, and update under temporal shifts by integrating temporal information into the model [52, 198, 219]. During training, they inject temporal information into training examples as prefixes of prompts, using special tokens [198], explicit year information [52], or syntax-guided structural information [219]. In sequential training experiments conducted by TempoT5 [52], comparison between continually and jointly pre-trained LMs demonstrates that CPT better balances adaptation and forgetting when the replay rate of past data is appropriately set.

Others. CPT as a technique to progressively attain novel knowledge, can be used to refine LLMs' behavior. CEM [294] collects examples where the model's response is incorrect and continually trains the model on these examples, along with a supplemental dataset. RHO-1 [130] proposes Selective Language Modeling (SLM), which employs a reference model to evaluate the perplexity of each token in the training corpus, and continually pre-trains the model on high-perplexity tokens. Similarly, IR-DRO [36] re-trains the model on re-weighted examples from the original pre-training dataset, focusing more on higher-loss sequences.

The significance of addressing temporal shifts through CPT is underscored by several industrial studies. For instance, [5] employs a dynamic vocabulary expansion algorithm and an efficient sub-sampling procedure to conduct CPT on large-scale emerging tweet data. Conversely, [137] adopts CPT without explicit measures to constrain model updates, releasing a series of BERT-based LMs incrementally trained on new tweet data every three months. Preliminary experimental results demonstrate substantial improvements of continually pre-trained LMs over the base BERT model across downstream tasks. While some studies question the necessity of continually adapting LLMs along the temporal axis for environmental reasons, such as reducing CO 2 emissions [8], the community commonly embraces CPT as a more efficient learning paradigm compared to the traditional 'combine-and-retrain' approach.

Domain-Adaptive Pre-training~(DAP)

Background of DAP. Institutions, regardless of size, often possess significant amounts of unlabeled, domain-specific data. This data bridges the gap between general-purpose LLMs trained on diverse corpora and fine-tuned LLMs designed for specific downstream tasks. Leveraging this data as a preparatory stage can facilitate effective adaptation of LLMs to downstream tasks. Such process of 'continued/continual/continuous pre-training' [9, 42, 68, 73, 91, 138, 148, 212, 264, 266, 268, 272, 282, 285], 'further pre-training' [3, 48, 129, 200, 218], 'domain tuning' [197], 'knowledge enhancement pre-training' [138], and 'knowledge injection training' [258] is unified and termed ' Domain Adaptive Pre-training (DAP) ' [72] for clarity and consistency throughout this survey. In the pioneering work of domain-adaptive

pre-training (DAPT) [72], the authors continuously pre-train the language models on a larger domain-specific dataset before fine-tuning them to the downstream tasks, resulting in universally improved performance aross various tasks. As the observation above has been validated on multiple domains in parallel, including BioMed, CS, News, and Reviews [72], practitioners commonly accept that employing DAP on additional unlabeled domain-specific data benefits downstream tasks. Consequently, this technique has become widely deployed in many modern LLMs.

Summary of LLMs with DAP. We provide a summary of the existing 41 studies utilizing DAP for LLMs in Table 2. Each entry is characterized by three main features: (i) training process specifications, encompassing the vertical domain for which LLMs are trained, the training pipeline preceding release, and the LLM architecture employed; (ii) adopted continual learning techniques, including rehearsal, parameter regularization, and architecture expansion; and (iii) evaluation metrics for CL, such as backward transfer (forgetting) and forward transfer (adaptation to downstream data).

4.2.1 General Observation on DAP. Several key observations emerge regarding the research landscape of DAP (Table 2).

· OBS-1: DAP predominantly occurs in a single stage. Continual DAP which involves more than one stage is seldom explored: among all papers listed in Table 2, only one employs two stages of DAP ('PT → DAP → DAP → FT' in Code Llama [199]). It is arguably reasonable to categorize studies that conduct only one stage of DAP and nothing more [9, 39, 67, 138, 168, 177, 218, 226, 268, 271] into CPT rather than DAP. Nevertheless, considering that they aim to adapt a general-purpose LLM to a specific domain, we include them in this section. · OBS-2: The notion of interpreting DAP through the lens of CL, whether intentional or not, is widely embraced. As shown in Table 2, except for the first section (white, 13/41), where papers overlook any potential side effects of DAP leading to vertical forgetting, the remaining sections (all gray, 28/41) either evaluate the potential negative impacts of DAP or proactively employ CL techniques to mitigate the risk of vertical forgetting. · OBS-3: Further research of more sophisticated CL techniques for not just DAP, but general vertical continual learning is much needed. It is supported by the widespread adoption of CL techniques (22/41) for training domain-specific LLMs. However, the diversity of these techniques is limited, with only replay [9, 33, 39, 42, 91, 148, 197, 258, 274, 289] and parameter expansion (LoRA [177, 257, 271, 272]) or Layer/Block expansion [257, 272] utilized. In fact, it appears that individuals may not explicitly recognize that DAP should be viewed from the perspective of vertical continuity, as they often employ CL techniques unknowingly, e.g., studies deploying replay terming the technique as 'data combination' [258] or 'data mixing/mixture' [9, 39, 148, 274], without recognizing it as a typical CL solution to vertical continual learning.

4.2.2 Different Domains of DAP. Weinclude work aimed at establishing vertical LLMs across various domains, including legal, medical, financial, scientific, and code. Additionally, we cover other domains such as language and e-commerce.

Legal Domain. In Layer Llama [91], the authors gathered publicly available legal texts from China Courts websites, totaling approximately 10 billion tokens as noted in a GitHub issue. In SaulLM [42], the authors collected the DAP corpus from various jurisdictions in different countries, resulting in a corpus of 30 billion tokens to cover diverse aspects of legal texts. When combined with previously available datasets, the total number of tokens used for legal-domain DAP reaches 94 billion. The substantial volume of DAP data, while offering valuable insights into specific domains, increases the risk of vertical forgetting of the general knowledge due to the large number of update steps involved. To mitigate this issue, SaulLM incorporates general data from Wikipedia, StackExchange, and GitHub into the DAP data, constituting about 2% of the final dataset [42]. Similarly, Lawyer Llama incorporates replaying general-domain data

during DAP, but the replay rate is not disclosed [91]. [222] also replays of non-latest business documents during DAP when building a Japanese business-specific LLM.

Medical Domain. Efforts have been made to develop medical specialists by either training an LLM from scratch [66, 143] or fine-tuning publicly-available LLMs to meet specific medical needs [33, 146, 258]. Among these approaches, DAP techniques have been extensively utilized to preserve the communication and instruction-following abilities of a general LLM, preparing it for subsequent medical applications [33, 146, 258]. BioMedGPT [146] is a multi-modal biomedical language model that integrates representations of human language and the language of life (molecules, proteins, cells, genes, etc.). Prior to final multi-modal supervised fine-tuning, the authors initialize the model from Llama2-Chat [231] and conduct DAP using extensive biomedical documents from S2ORC [134], without considering any CL techniques or evaluations. In [68], DAP is performed using Chinese medical encyclopedias and online expert articles, with next-token prediction as the training objective. During DAP, the performance gradually deteriorates on general-domain datasets as the training step increases, but improves on the downstream medical examination tasks [82]. PMC-LLama [258] gathers biomedical papers from S2ORC [134] and medical textbooks for 'knowledge injection training.' During this phase, a general language corpus from RedPajama-Data [43] is replayed at a 5% rate within a training batch. However, the paper does not analyze the effectiveness of this operation of mixing in general-domain data for DAP.

To mitigate vertical forgetting, AF Adapter [272] proposes an adapter structure extending the width of Attention layers and FFNs for acquiring domain knowledge and only the adapters are tuned during DAP. Similarly, Hippocrates [2] deploys LoRA during DAP to both have medical-specific knowledge injected and general ability preserved. MeLlama [265] mixes in about 25% of the general-domain data for DAP on the clinical notes and biomedical articles, which achieves even positive backward transfer on MMLU [82]. HuatuoGPT-II [33] proposes to fuse the DAP into the final SFT, unifying the two stages into one single process. The challenge of such process mainly comes from the data heterogeneity of DAP's unlabeled corpus. The authors address this challenge by reformulating paragraphs of data into (instruction, output) format using existing large language models. They further employ a priority sampling strategy to avoid compromising downstream ability, a pitfall observed in the fixed-rate data mixing strategy [231]. This paper empirically demonstrates the superiority of unified one-stage SFT over two-stage training, questioning the reasonability of the current DAP. On medical-domain data, [197] finds that LMs constrained by CL techniques on source domains exhibit greater robustness to future domain shifts. Specifically, they identify that parameter regularization techniques like EWC [113], despite slightly higher cost, can facilitate positive forward and backward transfer.

Financial Domain. A gap persists between general-purpose LLMs and existing domain-specific smaller-scale LLMs [7, 259], underscoring the urgent need for more powerful financial-domain experts through the integration of LLMs. Notably, DAP techniques have emerged as crucial tools for tailoring LLMs to the intricacies of the financial domain while mitigating the negative effects of abrupt domain shifts from general to finance [121, 138, 268, 271, 289].

BBT-Fin [138] collects a Chinese financial DAP dataset comprising 80 billion tokens sourced from corporate reports, analyst reports, social media, and financial news. In addition to the conventional masked language modeling (MLM) training objective, BBT-Fin further incorporates triplet masking and span masking techniques during DAP. CFGPT [121] creates CFData, a financial dataset for DAP and SFT, comprising 141 billion tokens. During DAP, CFGPT does not employ CL techniques but utilizes QLoRA [50] for preventing overfitting to downstream data and balancing general response ability and domain-specific ability during SFT. These two methods are typical domain-specific LLMs focusing solely on adaptation to target domains without explicit CL measures or evaluation of vertical forgetting.

In [268], the authors aim to enhance the data efficiency of DAP. When the downstream tasks' data distribution T are known, based on the generalization bound [14, 61, 213], the authors propose to sample the subset of DAP data

Table 2. Summary of the existing studies that leverage Domain-Adaptive Pre-training of LLMs, where the papers are organized in four main categories based on whether they (i) adopt the continual learning techniques and (ii) perform the evaluation for backward transfer (forgetting) . In the column of Train Proc. (Training Process), we omit the phase of general Pre-Training. DAP represents Domain-Adaptive Pre-Training; SFT represents Supervised Fine-Tuning; IT represents Instruction Tuning. The prefix G- and D- represent General and Domain-Specific training process [91, 129], and the prefix U- represents them unified [33, 257]. The prefix MM- and LC- represents Multi-Modal and Long-Context training phases [146, 199, 301]. In the column of Continual Learning Eval ., we consider two criteria: (i) Backward Transfer , i.e., performance degradation on the previous tasks, which is also known as catastrophic forgetting, (ii) Forward Transfer , i.e., the performance gained by DAP while transferring the LLMs to the downstream tasks. We use L and Perp. to denote Loss and Perplexity, FT to denote Fine-Tuning, ZS and FS to denote Zero-Shot and Few-Shot Accuracy, HE and LLM to denote the Human Evaluation and LLM Evaluation for generative tasks.

whose distribution D is similar to the downstream task's data, i.e., 𝑑 H Δ H (D , T) is low. When the downstream data distribution is unknown, the authors suggest ensuring novelty and diversity in the sampled corpus for DAP. This approach significantly enhances DAP efficiency: it utilizes only 10% of the originally collected data yet outperforms models trained on the entire DAP dataset, underscoring the importance of data quality over quantity. WeaverBird [271] introduces an intelligent finance dialogue system, where the encoder is trained on Chinese and English financial documents, alongside expert-annotated financial query-response pairs, using LoRA [87]. Xuanyuan 2.0 [289], akin to HuatuoGPT-II [33], proposes the technique of hybrid-tuning, which fuses the stages of DAP and SFT into one, generaldomain data and financial-domain data into one. Notably, the distribution of data in hybrid-tuning is unconventional: financial DAP data comprises only a small portion of 13%. This prompts a pertinent question in line with the investigation on efficient DAP in [268]: Is a large DAP dataset necessary for developing a domain-specific LLM?

Scientific Domain. Vertical scientific LLMs span many subjects [9, 15, 129, 142, 168, 285, 301]. However, among all the studies listed above, only a small fraction of them adopt the technique of DAP. OceanGPT [15] is the first LLM tailored specifically for the ocean domain. It performs DAP on a raw corpus of ocean science literature, prioritizing recent research and historically significant works. K2 [48] pioneers the development of a foundational language model tailored specifically for geoscience. It aggregates geoscience open access literature and Earth science-related Wikipedia pages for DAP. Following this, it undergoes multi-task instruction tuning utilizing LoRA [87] on both a general instruction tuning dataset and the GeoSignal benchmark introduced within the K2 framework. AstroLlama [168] gathers abstracts solely from astronomy papers on arXiv and proceeds pre-training. It observes an improved perplexity on the domain of scholarly astronomy, without providing more quantitative evaluation. MarineGPT [301] is a multi-modal LLM designed specifically for the marine domain. During DAP, MarineGPT incorporates 5 million marine image-text pairs to imbue domain knowledge. This involves training a Q-Former [122] between the frozen visual and text decoder [54, 230].

Another branch of methods proactively integrate in the replay of the general-domain data to mitigate vertical forgetting. GeoGalactica [129] introduces a series of LLMs tailored for geoscience. In the DAP phase, besides the 52-billion-token geoscience corpus, Arxiv papers and Codedata are incorporated, with a mixing ratio of 8:1:1. The authors believe that the inclusion of the Codedata during the model's pre-training can significantly boost the reasoning ability of the LLMs. Although GeoGalactica pinpoints challenges of DAP, including overfitting, catastrophic forgetting, maintaining the training stability, and convergence speed, it does not further provide empirical evidence supporting the inclusion of the Codedata, or deploying specific measures to address the challenges proposed above. Llemma [9] focuses on mathematics, initialized from Code Llama [199], and undergoes DAP on a blend of the 55-billion-token mathematical pre-training dataset and general domain data at the ratio of 19:1. In contrast, PLlama [274], designed for plant science, mixes domain-specific and general-domain data at the ratio of 9:1.

Code Domain. The development of LLMs for automatic code filling, debugging, and generation holds significant practical importance [166, 220]. These advancements cover various frameworks, including encoder-only [166], encoderdecoder [242, 245], and decoder-only [67, 172, 227]. There is a growing trend towards decoder-only architectures [220], leveraging models pre-trained on general natural language like Llama [230, 231]. Consequently, there is a shift in the training objective from utilizing code structures to simpler tasks like next token prediction and infilling.

From the perspective of CL, the code domain presents unique advantages and challenges for DAP, compared to other domains. On one hand, its hierarchical structure ( general domain corpus → multi-language code → specific programming language ) provides an ideal training pipeline for DAPs [199], offering potential for more efficient training strategies. On the other hand, programming languages adhere to strict grammars, unlike the fuzzy and context-dependent natural language. Consequently, language models should ideally leverage these structures through tailored designs, and adopting

the same training objectives as for natural languages may yield sub-optimal results. Therefore, many existing studies omit DAP [147, 242, 245]. In the following section, we will introduce existing code LLMs that employ DAP before the final downstream tasks, discussing both their common attributes and unique characteristics.

Representing a series of notable works that focus solely on adaptation to target domains, CodeGen [172] comprises a suite of LLMs designed for natural language (CodeGen-NL), multi-lingual programming languages (CodeGen-Multi), and mono-lingual programming languages (CodeGen-Mono). These models are trained sequentially, with each subsequent model initialized from the previous one trained on more general-domain data. Comment-Aug [218] addresses the challenge of aligning programming languages with natural languages (PL-NL alignment) by performing DAP on the code augmented with generated additional comments. StarCoder [226] introduces two models: StarCoderBase and StarCoder. StarCoderBase is initially trained on a mixed dataset comprising various programming languages without significant reweighting on the data. Subsequently, StarCoderBase undergoes further fine-tuning on additional 35 billion tokens of Python code, resulting in the development of StarCoder. DeepSeek-Coder-v1.5 [67] originates from DeepSeek-LLM [224] and undergoes pre-training on 2 trillion tokens, comprising 87% source code, 10% English code-related natural language, and 3% Chinese natural language corpus. Initialization from a general-domain LLM results in improved performance across various tasks, including natural language and mathematical reasoning, with minimal performance degradation on coding tasks, which underscores the efficacy of DAP.

Asthe only work that utilizes the general data replay to mitigate vertical forgetting in the code domain, Code Llama [199] introduces a sophisticated training framework tailored for various coding tasks and model sizes. Initialized from Llama 2 weights, these models undergo DAP on a dataset composed of deduplicated public code, discussions about code, and a subset of natural language data. This mix of natural language data serves as a form of pseudo-replay to maintain the models' proficiency in understanding natural language. Besides replay, architecture expansion has proven effective in acquiring robust coding abilities and preventing vertical forgetting simultaneously. IRCoder [177] utilizes compiler intermediate representations to enhance the multilingual transferability of Code LLMs. By conducting DAP on code grounded in intermediate representations with LoRA [86], IRCoder achieves superior multilingual programming instruction following, enhanced multilingual code understanding, and increased robustness to prompt perturbations. Llama Pro [257] undergoes DAP on a combination of code and math data. It expands the original Llama2 architecture by dynamically adding multiple identity copies of the transformer blocks. These added blocks initially preserves the original functionality, and will be tuned for DAP. The proposed expansion method is shown to be more resilient against vertical forgetting compared to other parameter-efficient tuning methods like LoRA.

The three aforementioned studies highlight the importance of DAP for code LLMs. However, it is crucial to note that the problem definition and conventional architectures of existing Code LLMs may present challenges of compatibility for DAP deployment, and need to be addressed in the future.

Other Domains. ECONET [73] enhances the model's ability to reason about event temporal relations through a dedicated DAP phase. Temporal and event indicators are masked out, and a contrastive loss is applied to the recovered masked tokens. Results demonstrate that incorporating this DAP stage significantly improves performance on final tasks compared to direct fine-tuning. Concept-Aware Language Model (CALM) [303] introduces a data-efficient DAP approach for enhancing the concept-centric commonsense reasoning ability of LLMs. It incorporates both generative and discriminative commonsense reasoning tasks specifically tailored for concept-centric reasoning tasks. Consequently, even a small number of data examples for DAP can lead to notable improvements for downstream tasks.

Aurora-M [167] and Swallow [60] adopt the simple replay strategy that mixes in a small portion of general data during DAP for their multi-lingual ability. Furthermore, Sailor [55] studies the optimal strategy of data mixing for

DAP, balancing the general knowledge and capacity of different languages. EcomGPT-CT [148] employs a data mixing strategy for DAP which transforms semi-structured E-commerce data into a set of nodes and edges, samples a cluster of nodes, and then extracts and concatenates them into a training example. It combines the general-domain corpus with E-commerce data at a ratio of 2:1, which is significantly lower than the common setting adopted by other works.

Notably, there are some papers studying other effective ways of DAP. AdaptLLM [39] transforms raw corpora into (raw text, question, answer) format, creating intrinsic reading comprehension tasks. AdaptLLM demonstrates superior domain-specific knowledge adaptation and minimal vertical forgetting, thereby challenging the data efficiency of conventional DAP. Tag-LLM [212] re-purposes the general-domain LLM into domain-specific one by multi-stage training of domain tags and function tags, without modifying the base LLM's weights and thereby mitigates forgetting.

General Observation on DAP

Continual LLMs' Evaluation Protocols. LAnguage Model Analysis (LAMA) is an evaluation framework designed to probe the world knowledge embedded in language models [179]. LAMA converts each world fact into a cloze statement, which is then input into the language models to predict the correct answer. It has been extensively utilized in work on CPT under the temporal shifts [95, 96]. FUAR (Forgotten / (Updated + Acquired) Ratio) is proposed for CPT to address the OP 's drawback of not able to accurately reflect the model's behavior. A FUAR value of 1 represents an equal trade-off between the knowledge forgetting and knowledge learning, while a FUAR less than 1 suggests high learning efficacy. In TRACE [241], the authors propose a set of ' X-Delta ' metrics for continual instruction tuning, quantifying the forward transfer on specific abilities of LLMs, which is a straightforward extension of FWT . Specifically, the authors construct three sets of evaluation tasks to benchmark the ability of LLMs, including general ability , instruction following , and safety . For more detailed introduction to these evaluation protocols, please refer to Appendix B.2.

Datasets. In this section, we provide a comprehensive review of the datasets available for benchmarking continual LLMs, as illustrated in Table 4. We provide information about these datasets' types, what distributional shifts and semantic domains they include, and their sources and applications. We intentionally exclude datasets used for domainadaptive pre-training LLMs in vertical domains such as legal, medical, and financial, unless they are specifically designed

for continual domain-adaptive pre-training. Furthermore, we omit datasets used in general continual fine-tuning, as they have already been extensively studied in existing works [17, 105]. For details, please refer to Appendix B.3.

Different Domains of DAP

Continual Fine-Tuning~(CFT)

Background of Continual Fine-Tuning (CFT). Continual Fine-Tuning (CFT) lies at the bottom layer of the vertical continuity, where models are trained on successive homogeneous tasks drawn from an evolving data distribution. As the service-oriented layer of LLM, it does not require consideration of further adaptation to another downstream tasks, simplifying optimization objectives to a great extent: better adaptation and less forgetting 2 . In the era of LLMs, new computational paradigms in CFT have emerged and attracted significant attention within the research community. These topics include (i) Continual Instruction Tuning (CIT) [292], (ii) Continual Model Refinement (CMR) [74], (iii) Continual Model Alignment (CMA) [128, 287], and (iv) Continual Learning for Multimodal Language Models (CMLLMs) [77, 171]. We summarize existing studies on CFT in Table 3, categorizing studies into sub-categories as listed above. The table includes details on incremental learning types (X-IL), LLM architecture, and employed CL techniques and evaluation metrics. After discussing general observations on CFT in Section 4.3.1, we will delve into each sub-category in detail.

4.3.1 General Observations on CFT. Examining the landscape of continual learning in the context of LLMs, and combined with the results shown in Table 3, we make several key observations about CFT.

· OBS-1: There has been a noticeable transition in focus from CIL to TIL and DIL. It has been a longstanding common sense in the CL community that CIL, as it requires the model to predict the context label and withincontext label at the same time [112, 232, 237], is the most challenging CL scenario and hence receives most of the attention from the community. However, among all 35 papers presented in Table 3, only 3 papers study CFT of CIL. The transition of the research focus demonstrates the importance of TIL and DIL in the real-world applications of continual LLMs. More detailed discussion of this transition is included in Section 6.2. · OBS-2: In CFT, CL techniques enjoy broader adoption and explicit exploration compared to CPT and DAP. In Table 3, all 35 papers explicitly deploy the CL techniques, 50% of which develop new techniques that cannot be easily interpreted as trivial combination of existing classic CL techniques, e.g., shared attentive learning framework in SAPT [297], external memory deployed in Larimar [45], and adaptive model averaging method to achieve Pareto-optimal in AMA [128], etc. This underscores the recognition of continual learning as a pivotal component in the development of resilient and adaptive LLMs.

4.3.2 General Continual Fine-Tuning (General CFT). Researchers have long investigated the phenomenon of forgetting resilience in pre-trained LLMs when fine-tuned for downstream tasks [106, 144, 156, 223, 299], despite some discover the opposite [144]. Although the pre-trained weights initially position the model in a flat-loss basin, aiding adaptation

2 We direct interested readers to additional survey literature on the topic of general CFT [17, 105].

Table 3. Summaryoftheexisting studies on Continual Fine-Tuning LLMs, where the papers are organized in five main categories based on what downstream tasks they are designed to tackle, including (i) General Continual Fine-Tuning (CFT); (ii) Continual Instruction Tuning (CIT); (iii) Continual Model Refinement (CMR); (iv) Continual Model Alignment (CMA); (v) Continual Multimodal LLMs (CMLLMs), which is shown in the column of CFT Type . The column of X-IL shows what continual learning paradigm the study includes [232], where TIL represents task-incremental learning, meaning task ID/information is provided during inference; DIL represents domain-incremental learning, meaning the tasks are defined in the same format, and no task ID/information is available during inference; CIL represents class-incremental learning, meaning the task ID needs to be further inferred when testing.

to future tasks without severely impacting previous ones [156], zero or near-zero forgetting is only observed at the representation level. This implies that while the model retains its ability to distinguish between task-specific representations, it may still forget specific task details [144, 223, 260, 299]. Therefore, additional measures are necessary when deploying these models in real-world applications [10, 37, 106, 182, 254, 281].

Many studies advance beyond naive sequential fine-tuning, leveraging the inherent anti-forgetting nature of LLMs while avoiding the adoption of overly complex CL techniques [255, 299]. For instance, LR ADJUST [255] proposes a straightforward yet effective method of dynamically adjusting the learning rate to mitigate the overwriting of knowledge from new languages onto old ones. Building on the innate anti-forgetting ability of large language models like Pythia [16], SEQ ∗ [299] introduces several strategies for fine-tuning LLMs on a sequence of downstream classification tasks, such as freezing the LLM and old classifier's parameters after warm-up, and pre-allocating future classifiers, etc.

Given the minimal forgetting observed at the representation level in CL, some studies aim to tackle the misalignment between the representation space and the decision-making layers by introducing representation-level constraints during CFT. NeiAttn [10] exemplifies this approach by formulating classification tasks as masked language modeling and proposing a neighboring attention mechanism to counteract negative representation drift.

Another line of approaches refines the input/output format and network architectures of pre-trained LLMs to be better suited for CFT. For instance, CTR [106] incorporates two CL-plugin modules, i.e., a task-specific module (TSM) for acquiring task-specific knowledge and a knowledge-sharing module (KSM) for selectively transferring previously learned similar knowledge. CIRCLE [281] manually designs diverse prompt templates for various types of buggy code, unifying them as the cloze task and employs difficulty-based replay to enhance continual program repair. LFPT5 [182] addresses lifelong few-shot language learning by consolidating sequence labeling, text classification, and text generation into a text-to-text generation task. It undergoes prompt tuning on generated pseudo-examples from previous domains when adapting to new tasks. In [291], the authors propose a method for adaptively adding compositional adapters during continual sequence generation tasks. Before training on new domains, a decision stage determines which trained module can be reused. During training, this module also regenerates examples of the past for replay. C3 [37] merges PEFT and in-context learning (ICL) in a teacher-student framework. The teacher model undergoes in-context tuning focused solely on the current domain, while the student model, together with tunable prompts, minimizes the KL-divergence between the output distribution and the ground truth and teacher model simultaneously.

4.3.3 Continual Instruction Tuning (CIT). When the instruction tuning data comes in as a stream, forgetting of the previously learned instructions should be addressed. CT0 [208] represents the inaugural study on Continual Instruction Tuning (CIT) of LLMs, applying the replay method on the base T0 model throughout the process. Many subsequent studies focus on enhancing the replay method used during CIT. For instance, [80] improve replay efficiency by computing Key-Part Information Gain (KPIG) on masked parts to dynamically select replay data, addressing the 'half-listening' issue in instruction following. Similarly, SSR [89] uses the LLM to generate synthetic instances for replay, achieving superior or comparable performance to traditional methods at a lower cost.

Other approaches introduce multiple CL techniques during CIT. DynaInst [165] merges parameter regularization with dynamic replay, selectively storing and replaying instances and tasks to enhance outcomes. InstructionSpeak [279] employs negative training and replay instructions to improve both forward transfer and backward transfer. Some methods incorporate PEFT. Orthogonal Low-Rank Adaptation (O-LoRA) learns new tasks within an orthogonal subspace while preserving LoRA parameters for previous tasks [240] to minimize the interference among different tasks. Shared Attention Framework (SAPT) combines a PET block with a selection module via a Shared Attentive Learning & Selection module, tackling catastrophic forgetting and knowledge transfer concurrently [297]. While regularization-based and architectural-based methods require additional parameter storage and GPU memory, together with replay-based methods they remain for CIT due to the simplicity and effectiveness [243].

4.3.4 Continual Model Refinement (CMR). The concept of model editing was initially explored in [215], which introduced a 'reliability-locality-efficiency' principle and proposed a gradient descent editor to address it efficiently. Subsequent research, such as [47] and [163], extended this principle to edit factual knowledge in BERT-based language models and larger models like GPT-J-6B [235] and T5-XXL [189], respectively, using gradient decomposition. These approaches typically update a subset of model parameters to alter the labels of specific inputs. Additionally, memory-based models, as discussed in [164] and [74], incorporate editing through retrieval mechanisms.

Continual Model Refinement (CMR) extends model refinement horizontally, presenting updated sample pairs ( 𝒙 𝑒 , 𝑦 𝑒 , b 𝑦 𝑒 ) 𝑒 = 1 𝑁 sequentially as a stream. [125] initially introduces this idea, evaluating various CL methods with a dynamic sampling algorithm. Many CMR methods employ a retrieval mechanism. For instance, [74] uses hidden activations of the language model as a 'key' to activate updated parameters only when input 𝑥 0 resembles updated sample pairs; [280] improves this approach's efficiency by integrating LoRA [86]; [45] augments the LLM with an external episodic memory, modeling CMR as an ongoing memory refresh. Meanwhile, some methods focus solely on updating a subset of model parameters. For example, [85] addresses the issue of 'toxicity buildup and flash' in single-editing methods like ROME [157], adapting it to the CL context with a knowledge-aware layer selection algorithm. WISE [238] addresses the 'impossible triangle' of reliability, locality, and generalization in existing lifelong model refinement methods. It introduces a side memory system that enables knowledge sharding and merging, successfully achieving all three objectives simultaneously.

While all these works pioneer research in CMR, the exploration of CMR of LLMs remains open. [75] highlights a potential problem: the location for storing the fact may not coincide with the best place for editing it. This challenges the classical 'locate and edit' paradigm used by several existing methods [157, 158], and could become a significent concern for CMR [85]. Other questions, including whether such problem setting fits LLMs and whether more memory/computationally efficient methods of CMR could be developed for LLMs, are yet to be answered.

4.3.5 Continual Model Alignment (CMA). When LLMs undergo the phase of MA, vertical forgetting of previous knowledge usually occurs. In [128], the authors refer to this phenomenon of catastrophic forgetting induced caused by MA as the 'Alignment Tax.' Notably, even a single stage of MA can diminish the model's performance capabilities, as it restricts the model's responses to a narrower subset of the training distribution.

Continual Model Alignment (CMA) aims to continuously refine LLMs to align with evolving human values, ethics, and data. The static nature of LLM training on historical data sets can lead to discrepancies between the models' outputs and current factual accuracies, societal norms, and standards, making CMA a crucial process for maintaining their adaptability and alignment with contemporary contexts. Likewise, there are two types of CMA frameworks: RL-based and SL-based. In the realm of RL-based CMA, two significant contributions have been noted. [128] identifies the conflicts between the existing CL techniques and RLHF, and proposes Adaptive Model Averaging (AMA), adaptively finding appropriate ratios for the combination of model layers to gain maximal rewards with minimal tax; Continual Proximal Policy Optimization (CPPO) [287] proposes a weighting strategy for different examples deciding its usage of policy enhancement or knowledge retention, mitigating the alignment tax over time. For SL-based CMA, Continual Optimal Policy Fitting (COPF) [286] presents a solution adapted from the Direct Policy Optimization (DPO) [188], solving its potential risks of sub-optimal policy fitting and over-optimization in the context of CMA.

4.3.6 Continual Multimodal Large Language Models (CMLLMs). Continually training multi-modal models like CLIP [185] has been long studied [171, 300], while the problem of continually training MLLMs still remains underexplored. Several existing studies have investigated the causes of catastrophic forgetting when continually training MLLMs. [298] performs singular value decomposition on input embeddings, revealing a significant disparity among different input embeddings. This discrepancy causes the model to learn irrelevant information for previously trained tasks, resulting in catastrophic forgetting and negative forward transfer. [284] observes that minority collapse may lead to catastrophic forgetting, when the imbalance ratio between majority and minority classes approaches infinity during fine-tuning. It further identifies hallucination as a contributing factor to performance degradation in MLLMs.

Continual Fine-Tuning MLLMs. In contrast to traditional continual learning methods that involve full-model fine-tuning for new tasks, continual fine-tuning for MLLMs focuses on refining specific layers when adapting to new tasks [32, 77, 284, 298, 304]. Given the strong capabilities of pre-trained models, training specific layers suffices, and can simultaneously reduce computational demands. [296] additionally considers an continual learning scenario, Continual Missing Modality Learning (CMML), where different modalities are emerging throughout the incremental learning stages. All the aforementioned studies collectively indicate that MLLMs still suffer from catastrophic forgetting, which manifests in two ways: along the direction of vertical continuity , a performance decline on pre-trained tasks following fine-tuning for downstream tasks; and along the axis of horizontal continuity , a performance degrade on previously fine-tuned tasks after fine-tuning for new tasks. [298] also observes negative forward transfer, where the performance of unseen tasks degrades when learning new tasks, indicating a decline in model generalization capability.

While traditional CL methods are applicable, some may not yield optimal results, as evidenced by various experiments [77, 298]. For instance, [77] observes a consistent efficacy of replay-based and model expansion strategies across diverse scenarios of continual fine-tuning MLLMs, but regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Other works seek to develop ad-hoc solutions for continual learning MLLMs. [77] proposes EProj to expand the projection layer in MLLMs for each new task and utilizes task-similarity-informed regularization (TIR) to enhance performance. [298] introduces Fwd-Prompt, a prompt tuning method that projects prompt gradient to both the residual space and the pre-trained subspace to minimize the interference between tasks and reuse pre-trained knowledge respectively, fostering positive forward transfer without relying on previous samples. [304] focuses on the forgetting of the pre-trained MLLMs after fine-tuned on specific tasks and proposes model tailor to compensate the selected subset that are critical for enhancing target task performance. [296] presents a novel method named Reconstruct before Query (RebQ), leveraging the multi-modal knowledge from a pre-trained model to reconstruct the absent information for the missing modality. Recently, MoE (Mixture-of-Experts) framework has gained attention which resembles the architecture-based methods in CL. It provides the model with the ability to learn different intentions from distinct experts, e.g., [32] first introduces MoELoRA to fine-tune LLaVA, effectively mitigate the catastrophic forgetting of MLLMs in CoIN and the results demonstrate the effectiveness.

General Observations on CFT

General Continual Fine-Tuning~(General CFT)

Background of Continual Fine-Tuning (CFT). Continual Fine-Tuning (CFT) lies at the bottom layer of the vertical continuity, where models are trained on successive homogeneous tasks drawn from an evolving data distribution. As the service-oriented layer of LLM, it does not require consideration of further adaptation to another downstream tasks, simplifying optimization objectives to a great extent: better adaptation and less forgetting 2 . In the era of LLMs, new computational paradigms in CFT have emerged and attracted significant attention within the research community. These topics include (i) Continual Instruction Tuning (CIT) [292], (ii) Continual Model Refinement (CMR) [74], (iii) Continual Model Alignment (CMA) [128, 287], and (iv) Continual Learning for Multimodal Language Models (CMLLMs) [77, 171]. We summarize existing studies on CFT in Table 3, categorizing studies into sub-categories as listed above. The table includes details on incremental learning types (X-IL), LLM architecture, and employed CL techniques and evaluation metrics. After discussing general observations on CFT in Section 4.3.1, we will delve into each sub-category in detail.

4.3.1 General Observations on CFT. Examining the landscape of continual learning in the context of LLMs, and combined with the results shown in Table 3, we make several key observations about CFT.

· OBS-1: There has been a noticeable transition in focus from CIL to TIL and DIL. It has been a longstanding common sense in the CL community that CIL, as it requires the model to predict the context label and withincontext label at the same time [112, 232, 237], is the most challenging CL scenario and hence receives most of the attention from the community. However, among all 35 papers presented in Table 3, only 3 papers study CFT of CIL. The transition of the research focus demonstrates the importance of TIL and DIL in the real-world applications of continual LLMs. More detailed discussion of this transition is included in Section 6.2. · OBS-2: In CFT, CL techniques enjoy broader adoption and explicit exploration compared to CPT and DAP. In Table 3, all 35 papers explicitly deploy the CL techniques, 50% of which develop new techniques that cannot be easily interpreted as trivial combination of existing classic CL techniques, e.g., shared attentive learning framework in SAPT [297], external memory deployed in Larimar [45], and adaptive model averaging method to achieve Pareto-optimal in AMA [128], etc. This underscores the recognition of continual learning as a pivotal component in the development of resilient and adaptive LLMs.

4.3.2 General Continual Fine-Tuning (General CFT). Researchers have long investigated the phenomenon of forgetting resilience in pre-trained LLMs when fine-tuned for downstream tasks [106, 144, 156, 223, 299], despite some discover the opposite [144]. Although the pre-trained weights initially position the model in a flat-loss basin, aiding adaptation

2 We direct interested readers to additional survey literature on the topic of general CFT [17, 105].

Table 3. Summaryoftheexisting studies on Continual Fine-Tuning LLMs, where the papers are organized in five main categories based on what downstream tasks they are designed to tackle, including (i) General Continual Fine-Tuning (CFT); (ii) Continual Instruction Tuning (CIT); (iii) Continual Model Refinement (CMR); (iv) Continual Model Alignment (CMA); (v) Continual Multimodal LLMs (CMLLMs), which is shown in the column of CFT Type . The column of X-IL shows what continual learning paradigm the study includes [232], where TIL represents task-incremental learning, meaning task ID/information is provided during inference; DIL represents domain-incremental learning, meaning the tasks are defined in the same format, and no task ID/information is available during inference; CIL represents class-incremental learning, meaning the task ID needs to be further inferred when testing.

to future tasks without severely impacting previous ones [156], zero or near-zero forgetting is only observed at the representation level. This implies that while the model retains its ability to distinguish between task-specific representations, it may still forget specific task details [144, 223, 260, 299]. Therefore, additional measures are necessary when deploying these models in real-world applications [10, 37, 106, 182, 254, 281].

Many studies advance beyond naive sequential fine-tuning, leveraging the inherent anti-forgetting nature of LLMs while avoiding the adoption of overly complex CL techniques [255, 299]. For instance, LR ADJUST [255] proposes a straightforward yet effective method of dynamically adjusting the learning rate to mitigate the overwriting of knowledge from new languages onto old ones. Building on the innate anti-forgetting ability of large language models like Pythia [16], SEQ ∗ [299] introduces several strategies for fine-tuning LLMs on a sequence of downstream classification tasks, such as freezing the LLM and old classifier's parameters after warm-up, and pre-allocating future classifiers, etc.

Given the minimal forgetting observed at the representation level in CL, some studies aim to tackle the misalignment between the representation space and the decision-making layers by introducing representation-level constraints during CFT. NeiAttn [10] exemplifies this approach by formulating classification tasks as masked language modeling and proposing a neighboring attention mechanism to counteract negative representation drift.

Another line of approaches refines the input/output format and network architectures of pre-trained LLMs to be better suited for CFT. For instance, CTR [106] incorporates two CL-plugin modules, i.e., a task-specific module (TSM) for acquiring task-specific knowledge and a knowledge-sharing module (KSM) for selectively transferring previously learned similar knowledge. CIRCLE [281] manually designs diverse prompt templates for various types of buggy code, unifying them as the cloze task and employs difficulty-based replay to enhance continual program repair. LFPT5 [182] addresses lifelong few-shot language learning by consolidating sequence labeling, text classification, and text generation into a text-to-text generation task. It undergoes prompt tuning on generated pseudo-examples from previous domains when adapting to new tasks. In [291], the authors propose a method for adaptively adding compositional adapters during continual sequence generation tasks. Before training on new domains, a decision stage determines which trained module can be reused. During training, this module also regenerates examples of the past for replay. C3 [37] merges PEFT and in-context learning (ICL) in a teacher-student framework. The teacher model undergoes in-context tuning focused solely on the current domain, while the student model, together with tunable prompts, minimizes the KL-divergence between the output distribution and the ground truth and teacher model simultaneously.

4.3.3 Continual Instruction Tuning (CIT). When the instruction tuning data comes in as a stream, forgetting of the previously learned instructions should be addressed. CT0 [208] represents the inaugural study on Continual Instruction Tuning (CIT) of LLMs, applying the replay method on the base T0 model throughout the process. Many subsequent studies focus on enhancing the replay method used during CIT. For instance, [80] improve replay efficiency by computing Key-Part Information Gain (KPIG) on masked parts to dynamically select replay data, addressing the 'half-listening' issue in instruction following. Similarly, SSR [89] uses the LLM to generate synthetic instances for replay, achieving superior or comparable performance to traditional methods at a lower cost.

Other approaches introduce multiple CL techniques during CIT. DynaInst [165] merges parameter regularization with dynamic replay, selectively storing and replaying instances and tasks to enhance outcomes. InstructionSpeak [279] employs negative training and replay instructions to improve both forward transfer and backward transfer. Some methods incorporate PEFT. Orthogonal Low-Rank Adaptation (O-LoRA) learns new tasks within an orthogonal subspace while preserving LoRA parameters for previous tasks [240] to minimize the interference among different tasks. Shared Attention Framework (SAPT) combines a PET block with a selection module via a Shared Attentive Learning & Selection module, tackling catastrophic forgetting and knowledge transfer concurrently [297]. While regularization-based and architectural-based methods require additional parameter storage and GPU memory, together with replay-based methods they remain for CIT due to the simplicity and effectiveness [243].

4.3.4 Continual Model Refinement (CMR). The concept of model editing was initially explored in [215], which introduced a 'reliability-locality-efficiency' principle and proposed a gradient descent editor to address it efficiently. Subsequent research, such as [47] and [163], extended this principle to edit factual knowledge in BERT-based language models and larger models like GPT-J-6B [235] and T5-XXL [189], respectively, using gradient decomposition. These approaches typically update a subset of model parameters to alter the labels of specific inputs. Additionally, memory-based models, as discussed in [164] and [74], incorporate editing through retrieval mechanisms.

Continual Model Refinement (CMR) extends model refinement horizontally, presenting updated sample pairs ( 𝒙 𝑒 , 𝑦 𝑒 , b 𝑦 𝑒 ) 𝑒 = 1 𝑁 sequentially as a stream. [125] initially introduces this idea, evaluating various CL methods with a dynamic sampling algorithm. Many CMR methods employ a retrieval mechanism. For instance, [74] uses hidden activations of the language model as a 'key' to activate updated parameters only when input 𝑥 0 resembles updated sample pairs; [280] improves this approach's efficiency by integrating LoRA [86]; [45] augments the LLM with an external episodic memory, modeling CMR as an ongoing memory refresh. Meanwhile, some methods focus solely on updating a subset of model parameters. For example, [85] addresses the issue of 'toxicity buildup and flash' in single-editing methods like ROME [157], adapting it to the CL context with a knowledge-aware layer selection algorithm. WISE [238] addresses the 'impossible triangle' of reliability, locality, and generalization in existing lifelong model refinement methods. It introduces a side memory system that enables knowledge sharding and merging, successfully achieving all three objectives simultaneously.

While all these works pioneer research in CMR, the exploration of CMR of LLMs remains open. [75] highlights a potential problem: the location for storing the fact may not coincide with the best place for editing it. This challenges the classical 'locate and edit' paradigm used by several existing methods [157, 158], and could become a significent concern for CMR [85]. Other questions, including whether such problem setting fits LLMs and whether more memory/computationally efficient methods of CMR could be developed for LLMs, are yet to be answered.

4.3.5 Continual Model Alignment (CMA). When LLMs undergo the phase of MA, vertical forgetting of previous knowledge usually occurs. In [128], the authors refer to this phenomenon of catastrophic forgetting induced caused by MA as the 'Alignment Tax.' Notably, even a single stage of MA can diminish the model's performance capabilities, as it restricts the model's responses to a narrower subset of the training distribution.

Continual Model Alignment (CMA) aims to continuously refine LLMs to align with evolving human values, ethics, and data. The static nature of LLM training on historical data sets can lead to discrepancies between the models' outputs and current factual accuracies, societal norms, and standards, making CMA a crucial process for maintaining their adaptability and alignment with contemporary contexts. Likewise, there are two types of CMA frameworks: RL-based and SL-based. In the realm of RL-based CMA, two significant contributions have been noted. [128] identifies the conflicts between the existing CL techniques and RLHF, and proposes Adaptive Model Averaging (AMA), adaptively finding appropriate ratios for the combination of model layers to gain maximal rewards with minimal tax; Continual Proximal Policy Optimization (CPPO) [287] proposes a weighting strategy for different examples deciding its usage of policy enhancement or knowledge retention, mitigating the alignment tax over time. For SL-based CMA, Continual Optimal Policy Fitting (COPF) [286] presents a solution adapted from the Direct Policy Optimization (DPO) [188], solving its potential risks of sub-optimal policy fitting and over-optimization in the context of CMA.

4.3.6 Continual Multimodal Large Language Models (CMLLMs). Continually training multi-modal models like CLIP [185] has been long studied [171, 300], while the problem of continually training MLLMs still remains underexplored. Several existing studies have investigated the causes of catastrophic forgetting when continually training MLLMs. [298] performs singular value decomposition on input embeddings, revealing a significant disparity among different input embeddings. This discrepancy causes the model to learn irrelevant information for previously trained tasks, resulting in catastrophic forgetting and negative forward transfer. [284] observes that minority collapse may lead to catastrophic forgetting, when the imbalance ratio between majority and minority classes approaches infinity during fine-tuning. It further identifies hallucination as a contributing factor to performance degradation in MLLMs.

Continual Fine-Tuning MLLMs. In contrast to traditional continual learning methods that involve full-model fine-tuning for new tasks, continual fine-tuning for MLLMs focuses on refining specific layers when adapting to new tasks [32, 77, 284, 298, 304]. Given the strong capabilities of pre-trained models, training specific layers suffices, and can simultaneously reduce computational demands. [296] additionally considers an continual learning scenario, Continual Missing Modality Learning (CMML), where different modalities are emerging throughout the incremental learning stages. All the aforementioned studies collectively indicate that MLLMs still suffer from catastrophic forgetting, which manifests in two ways: along the direction of vertical continuity , a performance decline on pre-trained tasks following fine-tuning for downstream tasks; and along the axis of horizontal continuity , a performance degrade on previously fine-tuned tasks after fine-tuning for new tasks. [298] also observes negative forward transfer, where the performance of unseen tasks degrades when learning new tasks, indicating a decline in model generalization capability.

While traditional CL methods are applicable, some may not yield optimal results, as evidenced by various experiments [77, 298]. For instance, [77] observes a consistent efficacy of replay-based and model expansion strategies across diverse scenarios of continual fine-tuning MLLMs, but regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Other works seek to develop ad-hoc solutions for continual learning MLLMs. [77] proposes EProj to expand the projection layer in MLLMs for each new task and utilizes task-similarity-informed regularization (TIR) to enhance performance. [298] introduces Fwd-Prompt, a prompt tuning method that projects prompt gradient to both the residual space and the pre-trained subspace to minimize the interference between tasks and reuse pre-trained knowledge respectively, fostering positive forward transfer without relying on previous samples. [304] focuses on the forgetting of the pre-trained MLLMs after fine-tuned on specific tasks and proposes model tailor to compensate the selected subset that are critical for enhancing target task performance. [296] presents a novel method named Reconstruct before Query (RebQ), leveraging the multi-modal knowledge from a pre-trained model to reconstruct the absent information for the missing modality. Recently, MoE (Mixture-of-Experts) framework has gained attention which resembles the architecture-based methods in CL. It provides the model with the ability to learn different intentions from distinct experts, e.g., [32] first introduces MoELoRA to fine-tune LLaVA, effectively mitigate the catastrophic forgetting of MLLMs in CoIN and the results demonstrate the effectiveness.

Continual Instruction Tuning~(CIT)

Background of Continual Fine-Tuning (CFT). Continual Fine-Tuning (CFT) lies at the bottom layer of the vertical continuity, where models are trained on successive homogeneous tasks drawn from an evolving data distribution. As the service-oriented layer of LLM, it does not require consideration of further adaptation to another downstream tasks, simplifying optimization objectives to a great extent: better adaptation and less forgetting 2 . In the era of LLMs, new computational paradigms in CFT have emerged and attracted significant attention within the research community. These topics include (i) Continual Instruction Tuning (CIT) [292], (ii) Continual Model Refinement (CMR) [74], (iii) Continual Model Alignment (CMA) [128, 287], and (iv) Continual Learning for Multimodal Language Models (CMLLMs) [77, 171]. We summarize existing studies on CFT in Table 3, categorizing studies into sub-categories as listed above. The table includes details on incremental learning types (X-IL), LLM architecture, and employed CL techniques and evaluation metrics. After discussing general observations on CFT in Section 4.3.1, we will delve into each sub-category in detail.

4.3.1 General Observations on CFT. Examining the landscape of continual learning in the context of LLMs, and combined with the results shown in Table 3, we make several key observations about CFT.

· OBS-1: There has been a noticeable transition in focus from CIL to TIL and DIL. It has been a longstanding common sense in the CL community that CIL, as it requires the model to predict the context label and withincontext label at the same time [112, 232, 237], is the most challenging CL scenario and hence receives most of the attention from the community. However, among all 35 papers presented in Table 3, only 3 papers study CFT of CIL. The transition of the research focus demonstrates the importance of TIL and DIL in the real-world applications of continual LLMs. More detailed discussion of this transition is included in Section 6.2. · OBS-2: In CFT, CL techniques enjoy broader adoption and explicit exploration compared to CPT and DAP. In Table 3, all 35 papers explicitly deploy the CL techniques, 50% of which develop new techniques that cannot be easily interpreted as trivial combination of existing classic CL techniques, e.g., shared attentive learning framework in SAPT [297], external memory deployed in Larimar [45], and adaptive model averaging method to achieve Pareto-optimal in AMA [128], etc. This underscores the recognition of continual learning as a pivotal component in the development of resilient and adaptive LLMs.

4.3.2 General Continual Fine-Tuning (General CFT). Researchers have long investigated the phenomenon of forgetting resilience in pre-trained LLMs when fine-tuned for downstream tasks [106, 144, 156, 223, 299], despite some discover the opposite [144]. Although the pre-trained weights initially position the model in a flat-loss basin, aiding adaptation

2 We direct interested readers to additional survey literature on the topic of general CFT [17, 105].

Table 3. Summaryoftheexisting studies on Continual Fine-Tuning LLMs, where the papers are organized in five main categories based on what downstream tasks they are designed to tackle, including (i) General Continual Fine-Tuning (CFT); (ii) Continual Instruction Tuning (CIT); (iii) Continual Model Refinement (CMR); (iv) Continual Model Alignment (CMA); (v) Continual Multimodal LLMs (CMLLMs), which is shown in the column of CFT Type . The column of X-IL shows what continual learning paradigm the study includes [232], where TIL represents task-incremental learning, meaning task ID/information is provided during inference; DIL represents domain-incremental learning, meaning the tasks are defined in the same format, and no task ID/information is available during inference; CIL represents class-incremental learning, meaning the task ID needs to be further inferred when testing.

to future tasks without severely impacting previous ones [156], zero or near-zero forgetting is only observed at the representation level. This implies that while the model retains its ability to distinguish between task-specific representations, it may still forget specific task details [144, 223, 260, 299]. Therefore, additional measures are necessary when deploying these models in real-world applications [10, 37, 106, 182, 254, 281].

Many studies advance beyond naive sequential fine-tuning, leveraging the inherent anti-forgetting nature of LLMs while avoiding the adoption of overly complex CL techniques [255, 299]. For instance, LR ADJUST [255] proposes a straightforward yet effective method of dynamically adjusting the learning rate to mitigate the overwriting of knowledge from new languages onto old ones. Building on the innate anti-forgetting ability of large language models like Pythia [16], SEQ ∗ [299] introduces several strategies for fine-tuning LLMs on a sequence of downstream classification tasks, such as freezing the LLM and old classifier's parameters after warm-up, and pre-allocating future classifiers, etc.

Given the minimal forgetting observed at the representation level in CL, some studies aim to tackle the misalignment between the representation space and the decision-making layers by introducing representation-level constraints during CFT. NeiAttn [10] exemplifies this approach by formulating classification tasks as masked language modeling and proposing a neighboring attention mechanism to counteract negative representation drift.

Another line of approaches refines the input/output format and network architectures of pre-trained LLMs to be better suited for CFT. For instance, CTR [106] incorporates two CL-plugin modules, i.e., a task-specific module (TSM) for acquiring task-specific knowledge and a knowledge-sharing module (KSM) for selectively transferring previously learned similar knowledge. CIRCLE [281] manually designs diverse prompt templates for various types of buggy code, unifying them as the cloze task and employs difficulty-based replay to enhance continual program repair. LFPT5 [182] addresses lifelong few-shot language learning by consolidating sequence labeling, text classification, and text generation into a text-to-text generation task. It undergoes prompt tuning on generated pseudo-examples from previous domains when adapting to new tasks. In [291], the authors propose a method for adaptively adding compositional adapters during continual sequence generation tasks. Before training on new domains, a decision stage determines which trained module can be reused. During training, this module also regenerates examples of the past for replay. C3 [37] merges PEFT and in-context learning (ICL) in a teacher-student framework. The teacher model undergoes in-context tuning focused solely on the current domain, while the student model, together with tunable prompts, minimizes the KL-divergence between the output distribution and the ground truth and teacher model simultaneously.

4.3.3 Continual Instruction Tuning (CIT). When the instruction tuning data comes in as a stream, forgetting of the previously learned instructions should be addressed. CT0 [208] represents the inaugural study on Continual Instruction Tuning (CIT) of LLMs, applying the replay method on the base T0 model throughout the process. Many subsequent studies focus on enhancing the replay method used during CIT. For instance, [80] improve replay efficiency by computing Key-Part Information Gain (KPIG) on masked parts to dynamically select replay data, addressing the 'half-listening' issue in instruction following. Similarly, SSR [89] uses the LLM to generate synthetic instances for replay, achieving superior or comparable performance to traditional methods at a lower cost.

Other approaches introduce multiple CL techniques during CIT. DynaInst [165] merges parameter regularization with dynamic replay, selectively storing and replaying instances and tasks to enhance outcomes. InstructionSpeak [279] employs negative training and replay instructions to improve both forward transfer and backward transfer. Some methods incorporate PEFT. Orthogonal Low-Rank Adaptation (O-LoRA) learns new tasks within an orthogonal subspace while preserving LoRA parameters for previous tasks [240] to minimize the interference among different tasks. Shared Attention Framework (SAPT) combines a PET block with a selection module via a Shared Attentive Learning & Selection module, tackling catastrophic forgetting and knowledge transfer concurrently [297]. While regularization-based and architectural-based methods require additional parameter storage and GPU memory, together with replay-based methods they remain for CIT due to the simplicity and effectiveness [243].

4.3.4 Continual Model Refinement (CMR). The concept of model editing was initially explored in [215], which introduced a 'reliability-locality-efficiency' principle and proposed a gradient descent editor to address it efficiently. Subsequent research, such as [47] and [163], extended this principle to edit factual knowledge in BERT-based language models and larger models like GPT-J-6B [235] and T5-XXL [189], respectively, using gradient decomposition. These approaches typically update a subset of model parameters to alter the labels of specific inputs. Additionally, memory-based models, as discussed in [164] and [74], incorporate editing through retrieval mechanisms.

Continual Model Refinement (CMR) extends model refinement horizontally, presenting updated sample pairs ( 𝒙 𝑒 , 𝑦 𝑒 , b 𝑦 𝑒 ) 𝑒 = 1 𝑁 sequentially as a stream. [125] initially introduces this idea, evaluating various CL methods with a dynamic sampling algorithm. Many CMR methods employ a retrieval mechanism. For instance, [74] uses hidden activations of the language model as a 'key' to activate updated parameters only when input 𝑥 0 resembles updated sample pairs; [280] improves this approach's efficiency by integrating LoRA [86]; [45] augments the LLM with an external episodic memory, modeling CMR as an ongoing memory refresh. Meanwhile, some methods focus solely on updating a subset of model parameters. For example, [85] addresses the issue of 'toxicity buildup and flash' in single-editing methods like ROME [157], adapting it to the CL context with a knowledge-aware layer selection algorithm. WISE [238] addresses the 'impossible triangle' of reliability, locality, and generalization in existing lifelong model refinement methods. It introduces a side memory system that enables knowledge sharding and merging, successfully achieving all three objectives simultaneously.

While all these works pioneer research in CMR, the exploration of CMR of LLMs remains open. [75] highlights a potential problem: the location for storing the fact may not coincide with the best place for editing it. This challenges the classical 'locate and edit' paradigm used by several existing methods [157, 158], and could become a significent concern for CMR [85]. Other questions, including whether such problem setting fits LLMs and whether more memory/computationally efficient methods of CMR could be developed for LLMs, are yet to be answered.

4.3.5 Continual Model Alignment (CMA). When LLMs undergo the phase of MA, vertical forgetting of previous knowledge usually occurs. In [128], the authors refer to this phenomenon of catastrophic forgetting induced caused by MA as the 'Alignment Tax.' Notably, even a single stage of MA can diminish the model's performance capabilities, as it restricts the model's responses to a narrower subset of the training distribution.

Continual Model Alignment (CMA) aims to continuously refine LLMs to align with evolving human values, ethics, and data. The static nature of LLM training on historical data sets can lead to discrepancies between the models' outputs and current factual accuracies, societal norms, and standards, making CMA a crucial process for maintaining their adaptability and alignment with contemporary contexts. Likewise, there are two types of CMA frameworks: RL-based and SL-based. In the realm of RL-based CMA, two significant contributions have been noted. [128] identifies the conflicts between the existing CL techniques and RLHF, and proposes Adaptive Model Averaging (AMA), adaptively finding appropriate ratios for the combination of model layers to gain maximal rewards with minimal tax; Continual Proximal Policy Optimization (CPPO) [287] proposes a weighting strategy for different examples deciding its usage of policy enhancement or knowledge retention, mitigating the alignment tax over time. For SL-based CMA, Continual Optimal Policy Fitting (COPF) [286] presents a solution adapted from the Direct Policy Optimization (DPO) [188], solving its potential risks of sub-optimal policy fitting and over-optimization in the context of CMA.

4.3.6 Continual Multimodal Large Language Models (CMLLMs). Continually training multi-modal models like CLIP [185] has been long studied [171, 300], while the problem of continually training MLLMs still remains underexplored. Several existing studies have investigated the causes of catastrophic forgetting when continually training MLLMs. [298] performs singular value decomposition on input embeddings, revealing a significant disparity among different input embeddings. This discrepancy causes the model to learn irrelevant information for previously trained tasks, resulting in catastrophic forgetting and negative forward transfer. [284] observes that minority collapse may lead to catastrophic forgetting, when the imbalance ratio between majority and minority classes approaches infinity during fine-tuning. It further identifies hallucination as a contributing factor to performance degradation in MLLMs.

Continual Fine-Tuning MLLMs. In contrast to traditional continual learning methods that involve full-model fine-tuning for new tasks, continual fine-tuning for MLLMs focuses on refining specific layers when adapting to new tasks [32, 77, 284, 298, 304]. Given the strong capabilities of pre-trained models, training specific layers suffices, and can simultaneously reduce computational demands. [296] additionally considers an continual learning scenario, Continual Missing Modality Learning (CMML), where different modalities are emerging throughout the incremental learning stages. All the aforementioned studies collectively indicate that MLLMs still suffer from catastrophic forgetting, which manifests in two ways: along the direction of vertical continuity , a performance decline on pre-trained tasks following fine-tuning for downstream tasks; and along the axis of horizontal continuity , a performance degrade on previously fine-tuned tasks after fine-tuning for new tasks. [298] also observes negative forward transfer, where the performance of unseen tasks degrades when learning new tasks, indicating a decline in model generalization capability.

While traditional CL methods are applicable, some may not yield optimal results, as evidenced by various experiments [77, 298]. For instance, [77] observes a consistent efficacy of replay-based and model expansion strategies across diverse scenarios of continual fine-tuning MLLMs, but regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Other works seek to develop ad-hoc solutions for continual learning MLLMs. [77] proposes EProj to expand the projection layer in MLLMs for each new task and utilizes task-similarity-informed regularization (TIR) to enhance performance. [298] introduces Fwd-Prompt, a prompt tuning method that projects prompt gradient to both the residual space and the pre-trained subspace to minimize the interference between tasks and reuse pre-trained knowledge respectively, fostering positive forward transfer without relying on previous samples. [304] focuses on the forgetting of the pre-trained MLLMs after fine-tuned on specific tasks and proposes model tailor to compensate the selected subset that are critical for enhancing target task performance. [296] presents a novel method named Reconstruct before Query (RebQ), leveraging the multi-modal knowledge from a pre-trained model to reconstruct the absent information for the missing modality. Recently, MoE (Mixture-of-Experts) framework has gained attention which resembles the architecture-based methods in CL. It provides the model with the ability to learn different intentions from distinct experts, e.g., [32] first introduces MoELoRA to fine-tune LLaVA, effectively mitigate the catastrophic forgetting of MLLMs in CoIN and the results demonstrate the effectiveness.

Continual Model Refinement~(CMR)

Background of Continual Fine-Tuning (CFT). Continual Fine-Tuning (CFT) lies at the bottom layer of the vertical continuity, where models are trained on successive homogeneous tasks drawn from an evolving data distribution. As the service-oriented layer of LLM, it does not require consideration of further adaptation to another downstream tasks, simplifying optimization objectives to a great extent: better adaptation and less forgetting 2 . In the era of LLMs, new computational paradigms in CFT have emerged and attracted significant attention within the research community. These topics include (i) Continual Instruction Tuning (CIT) [292], (ii) Continual Model Refinement (CMR) [74], (iii) Continual Model Alignment (CMA) [128, 287], and (iv) Continual Learning for Multimodal Language Models (CMLLMs) [77, 171]. We summarize existing studies on CFT in Table 3, categorizing studies into sub-categories as listed above. The table includes details on incremental learning types (X-IL), LLM architecture, and employed CL techniques and evaluation metrics. After discussing general observations on CFT in Section 4.3.1, we will delve into each sub-category in detail.

4.3.1 General Observations on CFT. Examining the landscape of continual learning in the context of LLMs, and combined with the results shown in Table 3, we make several key observations about CFT.

· OBS-1: There has been a noticeable transition in focus from CIL to TIL and DIL. It has been a longstanding common sense in the CL community that CIL, as it requires the model to predict the context label and withincontext label at the same time [112, 232, 237], is the most challenging CL scenario and hence receives most of the attention from the community. However, among all 35 papers presented in Table 3, only 3 papers study CFT of CIL. The transition of the research focus demonstrates the importance of TIL and DIL in the real-world applications of continual LLMs. More detailed discussion of this transition is included in Section 6.2. · OBS-2: In CFT, CL techniques enjoy broader adoption and explicit exploration compared to CPT and DAP. In Table 3, all 35 papers explicitly deploy the CL techniques, 50% of which develop new techniques that cannot be easily interpreted as trivial combination of existing classic CL techniques, e.g., shared attentive learning framework in SAPT [297], external memory deployed in Larimar [45], and adaptive model averaging method to achieve Pareto-optimal in AMA [128], etc. This underscores the recognition of continual learning as a pivotal component in the development of resilient and adaptive LLMs.

4.3.2 General Continual Fine-Tuning (General CFT). Researchers have long investigated the phenomenon of forgetting resilience in pre-trained LLMs when fine-tuned for downstream tasks [106, 144, 156, 223, 299], despite some discover the opposite [144]. Although the pre-trained weights initially position the model in a flat-loss basin, aiding adaptation

2 We direct interested readers to additional survey literature on the topic of general CFT [17, 105].

Table 3. Summaryoftheexisting studies on Continual Fine-Tuning LLMs, where the papers are organized in five main categories based on what downstream tasks they are designed to tackle, including (i) General Continual Fine-Tuning (CFT); (ii) Continual Instruction Tuning (CIT); (iii) Continual Model Refinement (CMR); (iv) Continual Model Alignment (CMA); (v) Continual Multimodal LLMs (CMLLMs), which is shown in the column of CFT Type . The column of X-IL shows what continual learning paradigm the study includes [232], where TIL represents task-incremental learning, meaning task ID/information is provided during inference; DIL represents domain-incremental learning, meaning the tasks are defined in the same format, and no task ID/information is available during inference; CIL represents class-incremental learning, meaning the task ID needs to be further inferred when testing.

to future tasks without severely impacting previous ones [156], zero or near-zero forgetting is only observed at the representation level. This implies that while the model retains its ability to distinguish between task-specific representations, it may still forget specific task details [144, 223, 260, 299]. Therefore, additional measures are necessary when deploying these models in real-world applications [10, 37, 106, 182, 254, 281].

Many studies advance beyond naive sequential fine-tuning, leveraging the inherent anti-forgetting nature of LLMs while avoiding the adoption of overly complex CL techniques [255, 299]. For instance, LR ADJUST [255] proposes a straightforward yet effective method of dynamically adjusting the learning rate to mitigate the overwriting of knowledge from new languages onto old ones. Building on the innate anti-forgetting ability of large language models like Pythia [16], SEQ ∗ [299] introduces several strategies for fine-tuning LLMs on a sequence of downstream classification tasks, such as freezing the LLM and old classifier's parameters after warm-up, and pre-allocating future classifiers, etc.

Given the minimal forgetting observed at the representation level in CL, some studies aim to tackle the misalignment between the representation space and the decision-making layers by introducing representation-level constraints during CFT. NeiAttn [10] exemplifies this approach by formulating classification tasks as masked language modeling and proposing a neighboring attention mechanism to counteract negative representation drift.

Another line of approaches refines the input/output format and network architectures of pre-trained LLMs to be better suited for CFT. For instance, CTR [106] incorporates two CL-plugin modules, i.e., a task-specific module (TSM) for acquiring task-specific knowledge and a knowledge-sharing module (KSM) for selectively transferring previously learned similar knowledge. CIRCLE [281] manually designs diverse prompt templates for various types of buggy code, unifying them as the cloze task and employs difficulty-based replay to enhance continual program repair. LFPT5 [182] addresses lifelong few-shot language learning by consolidating sequence labeling, text classification, and text generation into a text-to-text generation task. It undergoes prompt tuning on generated pseudo-examples from previous domains when adapting to new tasks. In [291], the authors propose a method for adaptively adding compositional adapters during continual sequence generation tasks. Before training on new domains, a decision stage determines which trained module can be reused. During training, this module also regenerates examples of the past for replay. C3 [37] merges PEFT and in-context learning (ICL) in a teacher-student framework. The teacher model undergoes in-context tuning focused solely on the current domain, while the student model, together with tunable prompts, minimizes the KL-divergence between the output distribution and the ground truth and teacher model simultaneously.

4.3.3 Continual Instruction Tuning (CIT). When the instruction tuning data comes in as a stream, forgetting of the previously learned instructions should be addressed. CT0 [208] represents the inaugural study on Continual Instruction Tuning (CIT) of LLMs, applying the replay method on the base T0 model throughout the process. Many subsequent studies focus on enhancing the replay method used during CIT. For instance, [80] improve replay efficiency by computing Key-Part Information Gain (KPIG) on masked parts to dynamically select replay data, addressing the 'half-listening' issue in instruction following. Similarly, SSR [89] uses the LLM to generate synthetic instances for replay, achieving superior or comparable performance to traditional methods at a lower cost.

Other approaches introduce multiple CL techniques during CIT. DynaInst [165] merges parameter regularization with dynamic replay, selectively storing and replaying instances and tasks to enhance outcomes. InstructionSpeak [279] employs negative training and replay instructions to improve both forward transfer and backward transfer. Some methods incorporate PEFT. Orthogonal Low-Rank Adaptation (O-LoRA) learns new tasks within an orthogonal subspace while preserving LoRA parameters for previous tasks [240] to minimize the interference among different tasks. Shared Attention Framework (SAPT) combines a PET block with a selection module via a Shared Attentive Learning & Selection module, tackling catastrophic forgetting and knowledge transfer concurrently [297]. While regularization-based and architectural-based methods require additional parameter storage and GPU memory, together with replay-based methods they remain for CIT due to the simplicity and effectiveness [243].

4.3.4 Continual Model Refinement (CMR). The concept of model editing was initially explored in [215], which introduced a 'reliability-locality-efficiency' principle and proposed a gradient descent editor to address it efficiently. Subsequent research, such as [47] and [163], extended this principle to edit factual knowledge in BERT-based language models and larger models like GPT-J-6B [235] and T5-XXL [189], respectively, using gradient decomposition. These approaches typically update a subset of model parameters to alter the labels of specific inputs. Additionally, memory-based models, as discussed in [164] and [74], incorporate editing through retrieval mechanisms.

Continual Model Refinement (CMR) extends model refinement horizontally, presenting updated sample pairs ( 𝒙 𝑒 , 𝑦 𝑒 , b 𝑦 𝑒 ) 𝑒 = 1 𝑁 sequentially as a stream. [125] initially introduces this idea, evaluating various CL methods with a dynamic sampling algorithm. Many CMR methods employ a retrieval mechanism. For instance, [74] uses hidden activations of the language model as a 'key' to activate updated parameters only when input 𝑥 0 resembles updated sample pairs; [280] improves this approach's efficiency by integrating LoRA [86]; [45] augments the LLM with an external episodic memory, modeling CMR as an ongoing memory refresh. Meanwhile, some methods focus solely on updating a subset of model parameters. For example, [85] addresses the issue of 'toxicity buildup and flash' in single-editing methods like ROME [157], adapting it to the CL context with a knowledge-aware layer selection algorithm. WISE [238] addresses the 'impossible triangle' of reliability, locality, and generalization in existing lifelong model refinement methods. It introduces a side memory system that enables knowledge sharding and merging, successfully achieving all three objectives simultaneously.

While all these works pioneer research in CMR, the exploration of CMR of LLMs remains open. [75] highlights a potential problem: the location for storing the fact may not coincide with the best place for editing it. This challenges the classical 'locate and edit' paradigm used by several existing methods [157, 158], and could become a significent concern for CMR [85]. Other questions, including whether such problem setting fits LLMs and whether more memory/computationally efficient methods of CMR could be developed for LLMs, are yet to be answered.

4.3.5 Continual Model Alignment (CMA). When LLMs undergo the phase of MA, vertical forgetting of previous knowledge usually occurs. In [128], the authors refer to this phenomenon of catastrophic forgetting induced caused by MA as the 'Alignment Tax.' Notably, even a single stage of MA can diminish the model's performance capabilities, as it restricts the model's responses to a narrower subset of the training distribution.

Continual Model Alignment (CMA) aims to continuously refine LLMs to align with evolving human values, ethics, and data. The static nature of LLM training on historical data sets can lead to discrepancies between the models' outputs and current factual accuracies, societal norms, and standards, making CMA a crucial process for maintaining their adaptability and alignment with contemporary contexts. Likewise, there are two types of CMA frameworks: RL-based and SL-based. In the realm of RL-based CMA, two significant contributions have been noted. [128] identifies the conflicts between the existing CL techniques and RLHF, and proposes Adaptive Model Averaging (AMA), adaptively finding appropriate ratios for the combination of model layers to gain maximal rewards with minimal tax; Continual Proximal Policy Optimization (CPPO) [287] proposes a weighting strategy for different examples deciding its usage of policy enhancement or knowledge retention, mitigating the alignment tax over time. For SL-based CMA, Continual Optimal Policy Fitting (COPF) [286] presents a solution adapted from the Direct Policy Optimization (DPO) [188], solving its potential risks of sub-optimal policy fitting and over-optimization in the context of CMA.

4.3.6 Continual Multimodal Large Language Models (CMLLMs). Continually training multi-modal models like CLIP [185] has been long studied [171, 300], while the problem of continually training MLLMs still remains underexplored. Several existing studies have investigated the causes of catastrophic forgetting when continually training MLLMs. [298] performs singular value decomposition on input embeddings, revealing a significant disparity among different input embeddings. This discrepancy causes the model to learn irrelevant information for previously trained tasks, resulting in catastrophic forgetting and negative forward transfer. [284] observes that minority collapse may lead to catastrophic forgetting, when the imbalance ratio between majority and minority classes approaches infinity during fine-tuning. It further identifies hallucination as a contributing factor to performance degradation in MLLMs.

Continual Fine-Tuning MLLMs. In contrast to traditional continual learning methods that involve full-model fine-tuning for new tasks, continual fine-tuning for MLLMs focuses on refining specific layers when adapting to new tasks [32, 77, 284, 298, 304]. Given the strong capabilities of pre-trained models, training specific layers suffices, and can simultaneously reduce computational demands. [296] additionally considers an continual learning scenario, Continual Missing Modality Learning (CMML), where different modalities are emerging throughout the incremental learning stages. All the aforementioned studies collectively indicate that MLLMs still suffer from catastrophic forgetting, which manifests in two ways: along the direction of vertical continuity , a performance decline on pre-trained tasks following fine-tuning for downstream tasks; and along the axis of horizontal continuity , a performance degrade on previously fine-tuned tasks after fine-tuning for new tasks. [298] also observes negative forward transfer, where the performance of unseen tasks degrades when learning new tasks, indicating a decline in model generalization capability.

While traditional CL methods are applicable, some may not yield optimal results, as evidenced by various experiments [77, 298]. For instance, [77] observes a consistent efficacy of replay-based and model expansion strategies across diverse scenarios of continual fine-tuning MLLMs, but regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Other works seek to develop ad-hoc solutions for continual learning MLLMs. [77] proposes EProj to expand the projection layer in MLLMs for each new task and utilizes task-similarity-informed regularization (TIR) to enhance performance. [298] introduces Fwd-Prompt, a prompt tuning method that projects prompt gradient to both the residual space and the pre-trained subspace to minimize the interference between tasks and reuse pre-trained knowledge respectively, fostering positive forward transfer without relying on previous samples. [304] focuses on the forgetting of the pre-trained MLLMs after fine-tuned on specific tasks and proposes model tailor to compensate the selected subset that are critical for enhancing target task performance. [296] presents a novel method named Reconstruct before Query (RebQ), leveraging the multi-modal knowledge from a pre-trained model to reconstruct the absent information for the missing modality. Recently, MoE (Mixture-of-Experts) framework has gained attention which resembles the architecture-based methods in CL. It provides the model with the ability to learn different intentions from distinct experts, e.g., [32] first introduces MoELoRA to fine-tune LLaVA, effectively mitigate the catastrophic forgetting of MLLMs in CoIN and the results demonstrate the effectiveness.

OLD: Continual Model Refinement~(CMR)

Background of Continual Fine-Tuning (CFT). Continual Fine-Tuning (CFT) lies at the bottom layer of the vertical continuity, where models are trained on successive homogeneous tasks drawn from an evolving data distribution. As the service-oriented layer of LLM, it does not require consideration of further adaptation to another downstream tasks, simplifying optimization objectives to a great extent: better adaptation and less forgetting 2 . In the era of LLMs, new computational paradigms in CFT have emerged and attracted significant attention within the research community. These topics include (i) Continual Instruction Tuning (CIT) [292], (ii) Continual Model Refinement (CMR) [74], (iii) Continual Model Alignment (CMA) [128, 287], and (iv) Continual Learning for Multimodal Language Models (CMLLMs) [77, 171]. We summarize existing studies on CFT in Table 3, categorizing studies into sub-categories as listed above. The table includes details on incremental learning types (X-IL), LLM architecture, and employed CL techniques and evaluation metrics. After discussing general observations on CFT in Section 4.3.1, we will delve into each sub-category in detail.

4.3.1 General Observations on CFT. Examining the landscape of continual learning in the context of LLMs, and combined with the results shown in Table 3, we make several key observations about CFT.

· OBS-1: There has been a noticeable transition in focus from CIL to TIL and DIL. It has been a longstanding common sense in the CL community that CIL, as it requires the model to predict the context label and withincontext label at the same time [112, 232, 237], is the most challenging CL scenario and hence receives most of the attention from the community. However, among all 35 papers presented in Table 3, only 3 papers study CFT of CIL. The transition of the research focus demonstrates the importance of TIL and DIL in the real-world applications of continual LLMs. More detailed discussion of this transition is included in Section 6.2. · OBS-2: In CFT, CL techniques enjoy broader adoption and explicit exploration compared to CPT and DAP. In Table 3, all 35 papers explicitly deploy the CL techniques, 50% of which develop new techniques that cannot be easily interpreted as trivial combination of existing classic CL techniques, e.g., shared attentive learning framework in SAPT [297], external memory deployed in Larimar [45], and adaptive model averaging method to achieve Pareto-optimal in AMA [128], etc. This underscores the recognition of continual learning as a pivotal component in the development of resilient and adaptive LLMs.

4.3.2 General Continual Fine-Tuning (General CFT). Researchers have long investigated the phenomenon of forgetting resilience in pre-trained LLMs when fine-tuned for downstream tasks [106, 144, 156, 223, 299], despite some discover the opposite [144]. Although the pre-trained weights initially position the model in a flat-loss basin, aiding adaptation

2 We direct interested readers to additional survey literature on the topic of general CFT [17, 105].

Table 3. Summaryoftheexisting studies on Continual Fine-Tuning LLMs, where the papers are organized in five main categories based on what downstream tasks they are designed to tackle, including (i) General Continual Fine-Tuning (CFT); (ii) Continual Instruction Tuning (CIT); (iii) Continual Model Refinement (CMR); (iv) Continual Model Alignment (CMA); (v) Continual Multimodal LLMs (CMLLMs), which is shown in the column of CFT Type . The column of X-IL shows what continual learning paradigm the study includes [232], where TIL represents task-incremental learning, meaning task ID/information is provided during inference; DIL represents domain-incremental learning, meaning the tasks are defined in the same format, and no task ID/information is available during inference; CIL represents class-incremental learning, meaning the task ID needs to be further inferred when testing.

to future tasks without severely impacting previous ones [156], zero or near-zero forgetting is only observed at the representation level. This implies that while the model retains its ability to distinguish between task-specific representations, it may still forget specific task details [144, 223, 260, 299]. Therefore, additional measures are necessary when deploying these models in real-world applications [10, 37, 106, 182, 254, 281].

Many studies advance beyond naive sequential fine-tuning, leveraging the inherent anti-forgetting nature of LLMs while avoiding the adoption of overly complex CL techniques [255, 299]. For instance, LR ADJUST [255] proposes a straightforward yet effective method of dynamically adjusting the learning rate to mitigate the overwriting of knowledge from new languages onto old ones. Building on the innate anti-forgetting ability of large language models like Pythia [16], SEQ ∗ [299] introduces several strategies for fine-tuning LLMs on a sequence of downstream classification tasks, such as freezing the LLM and old classifier's parameters after warm-up, and pre-allocating future classifiers, etc.

Given the minimal forgetting observed at the representation level in CL, some studies aim to tackle the misalignment between the representation space and the decision-making layers by introducing representation-level constraints during CFT. NeiAttn [10] exemplifies this approach by formulating classification tasks as masked language modeling and proposing a neighboring attention mechanism to counteract negative representation drift.

Another line of approaches refines the input/output format and network architectures of pre-trained LLMs to be better suited for CFT. For instance, CTR [106] incorporates two CL-plugin modules, i.e., a task-specific module (TSM) for acquiring task-specific knowledge and a knowledge-sharing module (KSM) for selectively transferring previously learned similar knowledge. CIRCLE [281] manually designs diverse prompt templates for various types of buggy code, unifying them as the cloze task and employs difficulty-based replay to enhance continual program repair. LFPT5 [182] addresses lifelong few-shot language learning by consolidating sequence labeling, text classification, and text generation into a text-to-text generation task. It undergoes prompt tuning on generated pseudo-examples from previous domains when adapting to new tasks. In [291], the authors propose a method for adaptively adding compositional adapters during continual sequence generation tasks. Before training on new domains, a decision stage determines which trained module can be reused. During training, this module also regenerates examples of the past for replay. C3 [37] merges PEFT and in-context learning (ICL) in a teacher-student framework. The teacher model undergoes in-context tuning focused solely on the current domain, while the student model, together with tunable prompts, minimizes the KL-divergence between the output distribution and the ground truth and teacher model simultaneously.

4.3.3 Continual Instruction Tuning (CIT). When the instruction tuning data comes in as a stream, forgetting of the previously learned instructions should be addressed. CT0 [208] represents the inaugural study on Continual Instruction Tuning (CIT) of LLMs, applying the replay method on the base T0 model throughout the process. Many subsequent studies focus on enhancing the replay method used during CIT. For instance, [80] improve replay efficiency by computing Key-Part Information Gain (KPIG) on masked parts to dynamically select replay data, addressing the 'half-listening' issue in instruction following. Similarly, SSR [89] uses the LLM to generate synthetic instances for replay, achieving superior or comparable performance to traditional methods at a lower cost.

Other approaches introduce multiple CL techniques during CIT. DynaInst [165] merges parameter regularization with dynamic replay, selectively storing and replaying instances and tasks to enhance outcomes. InstructionSpeak [279] employs negative training and replay instructions to improve both forward transfer and backward transfer. Some methods incorporate PEFT. Orthogonal Low-Rank Adaptation (O-LoRA) learns new tasks within an orthogonal subspace while preserving LoRA parameters for previous tasks [240] to minimize the interference among different tasks. Shared Attention Framework (SAPT) combines a PET block with a selection module via a Shared Attentive Learning & Selection module, tackling catastrophic forgetting and knowledge transfer concurrently [297]. While regularization-based and architectural-based methods require additional parameter storage and GPU memory, together with replay-based methods they remain for CIT due to the simplicity and effectiveness [243].

4.3.4 Continual Model Refinement (CMR). The concept of model editing was initially explored in [215], which introduced a 'reliability-locality-efficiency' principle and proposed a gradient descent editor to address it efficiently. Subsequent research, such as [47] and [163], extended this principle to edit factual knowledge in BERT-based language models and larger models like GPT-J-6B [235] and T5-XXL [189], respectively, using gradient decomposition. These approaches typically update a subset of model parameters to alter the labels of specific inputs. Additionally, memory-based models, as discussed in [164] and [74], incorporate editing through retrieval mechanisms.

Continual Model Refinement (CMR) extends model refinement horizontally, presenting updated sample pairs ( 𝒙 𝑒 , 𝑦 𝑒 , b 𝑦 𝑒 ) 𝑒 = 1 𝑁 sequentially as a stream. [125] initially introduces this idea, evaluating various CL methods with a dynamic sampling algorithm. Many CMR methods employ a retrieval mechanism. For instance, [74] uses hidden activations of the language model as a 'key' to activate updated parameters only when input 𝑥 0 resembles updated sample pairs; [280] improves this approach's efficiency by integrating LoRA [86]; [45] augments the LLM with an external episodic memory, modeling CMR as an ongoing memory refresh. Meanwhile, some methods focus solely on updating a subset of model parameters. For example, [85] addresses the issue of 'toxicity buildup and flash' in single-editing methods like ROME [157], adapting it to the CL context with a knowledge-aware layer selection algorithm. WISE [238] addresses the 'impossible triangle' of reliability, locality, and generalization in existing lifelong model refinement methods. It introduces a side memory system that enables knowledge sharding and merging, successfully achieving all three objectives simultaneously.

While all these works pioneer research in CMR, the exploration of CMR of LLMs remains open. [75] highlights a potential problem: the location for storing the fact may not coincide with the best place for editing it. This challenges the classical 'locate and edit' paradigm used by several existing methods [157, 158], and could become a significent concern for CMR [85]. Other questions, including whether such problem setting fits LLMs and whether more memory/computationally efficient methods of CMR could be developed for LLMs, are yet to be answered.

4.3.5 Continual Model Alignment (CMA). When LLMs undergo the phase of MA, vertical forgetting of previous knowledge usually occurs. In [128], the authors refer to this phenomenon of catastrophic forgetting induced caused by MA as the 'Alignment Tax.' Notably, even a single stage of MA can diminish the model's performance capabilities, as it restricts the model's responses to a narrower subset of the training distribution.

Continual Model Alignment (CMA) aims to continuously refine LLMs to align with evolving human values, ethics, and data. The static nature of LLM training on historical data sets can lead to discrepancies between the models' outputs and current factual accuracies, societal norms, and standards, making CMA a crucial process for maintaining their adaptability and alignment with contemporary contexts. Likewise, there are two types of CMA frameworks: RL-based and SL-based. In the realm of RL-based CMA, two significant contributions have been noted. [128] identifies the conflicts between the existing CL techniques and RLHF, and proposes Adaptive Model Averaging (AMA), adaptively finding appropriate ratios for the combination of model layers to gain maximal rewards with minimal tax; Continual Proximal Policy Optimization (CPPO) [287] proposes a weighting strategy for different examples deciding its usage of policy enhancement or knowledge retention, mitigating the alignment tax over time. For SL-based CMA, Continual Optimal Policy Fitting (COPF) [286] presents a solution adapted from the Direct Policy Optimization (DPO) [188], solving its potential risks of sub-optimal policy fitting and over-optimization in the context of CMA.

4.3.6 Continual Multimodal Large Language Models (CMLLMs). Continually training multi-modal models like CLIP [185] has been long studied [171, 300], while the problem of continually training MLLMs still remains underexplored. Several existing studies have investigated the causes of catastrophic forgetting when continually training MLLMs. [298] performs singular value decomposition on input embeddings, revealing a significant disparity among different input embeddings. This discrepancy causes the model to learn irrelevant information for previously trained tasks, resulting in catastrophic forgetting and negative forward transfer. [284] observes that minority collapse may lead to catastrophic forgetting, when the imbalance ratio between majority and minority classes approaches infinity during fine-tuning. It further identifies hallucination as a contributing factor to performance degradation in MLLMs.

Continual Fine-Tuning MLLMs. In contrast to traditional continual learning methods that involve full-model fine-tuning for new tasks, continual fine-tuning for MLLMs focuses on refining specific layers when adapting to new tasks [32, 77, 284, 298, 304]. Given the strong capabilities of pre-trained models, training specific layers suffices, and can simultaneously reduce computational demands. [296] additionally considers an continual learning scenario, Continual Missing Modality Learning (CMML), where different modalities are emerging throughout the incremental learning stages. All the aforementioned studies collectively indicate that MLLMs still suffer from catastrophic forgetting, which manifests in two ways: along the direction of vertical continuity , a performance decline on pre-trained tasks following fine-tuning for downstream tasks; and along the axis of horizontal continuity , a performance degrade on previously fine-tuned tasks after fine-tuning for new tasks. [298] also observes negative forward transfer, where the performance of unseen tasks degrades when learning new tasks, indicating a decline in model generalization capability.

While traditional CL methods are applicable, some may not yield optimal results, as evidenced by various experiments [77, 298]. For instance, [77] observes a consistent efficacy of replay-based and model expansion strategies across diverse scenarios of continual fine-tuning MLLMs, but regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Other works seek to develop ad-hoc solutions for continual learning MLLMs. [77] proposes EProj to expand the projection layer in MLLMs for each new task and utilizes task-similarity-informed regularization (TIR) to enhance performance. [298] introduces Fwd-Prompt, a prompt tuning method that projects prompt gradient to both the residual space and the pre-trained subspace to minimize the interference between tasks and reuse pre-trained knowledge respectively, fostering positive forward transfer without relying on previous samples. [304] focuses on the forgetting of the pre-trained MLLMs after fine-tuned on specific tasks and proposes model tailor to compensate the selected subset that are critical for enhancing target task performance. [296] presents a novel method named Reconstruct before Query (RebQ), leveraging the multi-modal knowledge from a pre-trained model to reconstruct the absent information for the missing modality. Recently, MoE (Mixture-of-Experts) framework has gained attention which resembles the architecture-based methods in CL. It provides the model with the ability to learn different intentions from distinct experts, e.g., [32] first introduces MoELoRA to fine-tune LLaVA, effectively mitigate the catastrophic forgetting of MLLMs in CoIN and the results demonstrate the effectiveness.

Continual Model Alignment~(CMA)

Humans can accumulate knowledge and skills across tasks without significant performance decline on previous tasks [101, 153, 154, 175]. In contrast, machine learning models, which are typically data-centric, often experience performance degradation on old tasks when trained on new ones, a phenomenon known as 'catastrophic forgetting.' The challenge of adapting models to a sequence of tasks without forgetting, especially when little to no past data can be preserved, is extensively studied in the continual learning community [38, 178, 232, 237]. For formal definitions, a detailed introduction to the three CL scenarios and techniques, please refer to Appendix A.2.

2.2.1 Types of Continual Learning. To lay the groundwork for subsequent discussions (as illustrated in Table 3 and Section 6.2), we follow the conceptual framework proposed by [112, 232, 237]. There are three primary types of continual learning scenarios: (i) Task-Incremental Learning (TIL), where task indices are available to the model during inference [113, 124]; (ii) Domain-Incremental Learning (DIL), where the model learns a sequence of tasks with the same formulation but without task indices during inference [213]; and (iii) Class-Incremental Learning (CIL), where the model learns new classes of data during training [112, 193].

2.2.2 Techniques of Continual Learning. Existing CL techniques can be roughly categorized into five groups [237]: (i) replay-based, (ii) regularization-based, (iii) architecture-based, (iv) optimization-based, and (v) representation-based. Here, we provide a concise yet comprehensive introduction to the first three categories of continual learning techniques, as they are extensively applied in continual LLMs.

Replay-based methods adopt the relaxed memory constraint by keeping a small buffer of observed data and retraining the model on it when learning new tasks. Although replay-based methods may theoretically lead to loose generalization bounds [213], they are valued for their simplicity, stability, and high performance, even with a small episodic memory [24, 30, 193, 195]. Regularization-based methods adopt a regularization term 𝜆 ∥ 𝜽 -𝜽 𝑡 -1 ∥ 𝚺 that penalizes large deviation from the history model in the parameter space, where ∥ 𝒗 ∥ 𝚺 = 𝒗 ⊤ 𝚺 𝒗 is the vector norm evaluated on a positive-semi-definite matrix 𝚺 , and 𝜆 is the regularization coefficient, a hyper-parameter introduced to balance the past knowledge retention and current knowledge learning. The matrix 𝚺 introduced is to measure the different level of importance of each parameters and their correlations in retaining the past knowledge. In practice, to reduce computational overhead, diagonal matrices are often designed to encode only the importance of each parameter [4, 113, 197]. Architecture-based methods , especially expanding the network architecture dynamically to assimilate new knowledge, is considered the most efficient form of CL [248, 249]. This method primarily tackles adaptation challenges and can achieve zero-forgetting when task IDs are available during inference or can be correctly inferred [71, 256]. However, due to the difficulty of task ID inference, architecture expansion is predominantly utilized in TIL but is scarcely explored in DIL or CIL. In conjunction with pre-trained backbone large models like ViT [54], CoLoR [256] trains various low-rank adaptation (LoRA) [86] modules for different tasks. It estimates and stores prototypes for each task and utilizes the natural clustering ability of the pre-trained model during testing to infer task IDs, selecting the corresponding LoRA component for prediction generation. In the domain of continual LLMs, architecture expansion has resurged in popularity following the rise of parameter-efficient fine-tuning (PEFT) [50, 86, 211], a topic we will delve into shortly [96, 100, 118, 177, 240, 257, 272, 273].

2.2.3 Evaluation Metrics of Continual Learning. There are four evaluation protocols primarily designed for continual learning. Overall Performance (OP) [106, 286, 291] calculates the average performance up until the current training stage, measuring the overall ability of a model balancing the performance of each task. As noted in [213], OP corresponds to the primary optimization objective of continual learning, and hence receives the most attention. Forgetting (F) represents the largest performance drop observed of each task throughout the training process, averaged over all training stages. It quantifies the negative impact of learning new tasks brought to previously acquired knowledge. Ideally, a robust continual learning framework should achieve Backward Transfer (BWT) , where learning new tasks enhances performance on prior tasks. BWT is measured by negating the forgetting, and hence a negative forgetting indicates a an improvement in performance on earlier tasks. Forward Transfer (FWT) measures the generalization ability of the continual learning algorithms to unseen tasks. It is defined as the difference between the current model's performance evaluated on the future tasks and the randomly initialized model. Refer to Appendix B.1 for more details.

OLD: Continual Model Alignment~(CMA)

Humans can accumulate knowledge and skills across tasks without significant performance decline on previous tasks [101, 153, 154, 175]. In contrast, machine learning models, which are typically data-centric, often experience performance degradation on old tasks when trained on new ones, a phenomenon known as 'catastrophic forgetting.' The challenge of adapting models to a sequence of tasks without forgetting, especially when little to no past data can be preserved, is extensively studied in the continual learning community [38, 178, 232, 237]. For formal definitions, a detailed introduction to the three CL scenarios and techniques, please refer to Appendix A.2.

2.2.1 Types of Continual Learning. To lay the groundwork for subsequent discussions (as illustrated in Table 3 and Section 6.2), we follow the conceptual framework proposed by [112, 232, 237]. There are three primary types of continual learning scenarios: (i) Task-Incremental Learning (TIL), where task indices are available to the model during inference [113, 124]; (ii) Domain-Incremental Learning (DIL), where the model learns a sequence of tasks with the same formulation but without task indices during inference [213]; and (iii) Class-Incremental Learning (CIL), where the model learns new classes of data during training [112, 193].

2.2.2 Techniques of Continual Learning. Existing CL techniques can be roughly categorized into five groups [237]: (i) replay-based, (ii) regularization-based, (iii) architecture-based, (iv) optimization-based, and (v) representation-based. Here, we provide a concise yet comprehensive introduction to the first three categories of continual learning techniques, as they are extensively applied in continual LLMs.

Replay-based methods adopt the relaxed memory constraint by keeping a small buffer of observed data and retraining the model on it when learning new tasks. Although replay-based methods may theoretically lead to loose generalization bounds [213], they are valued for their simplicity, stability, and high performance, even with a small episodic memory [24, 30, 193, 195]. Regularization-based methods adopt a regularization term 𝜆 ∥ 𝜽 -𝜽 𝑡 -1 ∥ 𝚺 that penalizes large deviation from the history model in the parameter space, where ∥ 𝒗 ∥ 𝚺 = 𝒗 ⊤ 𝚺 𝒗 is the vector norm evaluated on a positive-semi-definite matrix 𝚺 , and 𝜆 is the regularization coefficient, a hyper-parameter introduced to balance the past knowledge retention and current knowledge learning. The matrix 𝚺 introduced is to measure the different level of importance of each parameters and their correlations in retaining the past knowledge. In practice, to reduce computational overhead, diagonal matrices are often designed to encode only the importance of each parameter [4, 113, 197]. Architecture-based methods , especially expanding the network architecture dynamically to assimilate new knowledge, is considered the most efficient form of CL [248, 249]. This method primarily tackles adaptation challenges and can achieve zero-forgetting when task IDs are available during inference or can be correctly inferred [71, 256]. However, due to the difficulty of task ID inference, architecture expansion is predominantly utilized in TIL but is scarcely explored in DIL or CIL. In conjunction with pre-trained backbone large models like ViT [54], CoLoR [256] trains various low-rank adaptation (LoRA) [86] modules for different tasks. It estimates and stores prototypes for each task and utilizes the natural clustering ability of the pre-trained model during testing to infer task IDs, selecting the corresponding LoRA component for prediction generation. In the domain of continual LLMs, architecture expansion has resurged in popularity following the rise of parameter-efficient fine-tuning (PEFT) [50, 86, 211], a topic we will delve into shortly [96, 100, 118, 177, 240, 257, 272, 273].

2.2.3 Evaluation Metrics of Continual Learning. There are four evaluation protocols primarily designed for continual learning. Overall Performance (OP) [106, 286, 291] calculates the average performance up until the current training stage, measuring the overall ability of a model balancing the performance of each task. As noted in [213], OP corresponds to the primary optimization objective of continual learning, and hence receives the most attention. Forgetting (F) represents the largest performance drop observed of each task throughout the training process, averaged over all training stages. It quantifies the negative impact of learning new tasks brought to previously acquired knowledge. Ideally, a robust continual learning framework should achieve Backward Transfer (BWT) , where learning new tasks enhances performance on prior tasks. BWT is measured by negating the forgetting, and hence a negative forgetting indicates a an improvement in performance on earlier tasks. Forward Transfer (FWT) measures the generalization ability of the continual learning algorithms to unseen tasks. It is defined as the difference between the current model's performance evaluated on the future tasks and the randomly initialized model. Refer to Appendix B.1 for more details.

Continual Multimodal Large Language Models~(CMLLMs)

HAIZHOU SHI ∗ , ZIHAO XU, HENGYI WANG, WEIYI QIN, WENYUAN WANG † , and YIBIN WANG † , Rutgers University, USA

ZIFENG WANG and SAYNA EBRAHIMI, Google Cloud AI Research, USA

HAO WANG ∗ , Rutgers University, USA

The challenge of effectively and efficiently adapting statically pre-trained Large Language Models (LLMs) to ever-evolving data distributions remains predominant. When tailored for specific needs, pre-trained LLMs often suffer from significant performance degradation in previous knowledge domains - a phenomenon known as 'catastrophic forgetting' . While extensively studied in the Continual Learning (CL) community, this problem presents new challenges in the context of LLMs. In this survey, we provide a comprehensive overview and detailed discussion of the current research progress on LLMs within the context of CL. Besides the introduction of the preliminary knowledge, this survey is structured into four main sections: we first describe an overview of continually learning LLMs, consisting of two directions of continuity: vertical continuity (or vertical continual learning) , i.e., continual adaptation from general to specific capabilities, and horizontal continuity (or horizontal continual learning) , i.e., continual adaptation across time and domains (Section 3). Following vertical continuity, we summarize three stages of learning LLMs in the context of modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP), and Continual Fine-Tuning (CFT) (Section 4). We then provide an overview of evaluation protocols for continual learning with LLMs, along with currently available data sources (Section 5). Finally, we discuss intriguing questions related to continual learning for LLMs (Section 6). This survey sheds light on the relatively understudied domain of continually pre-training, adapting, and fine-tuning large language models, suggesting the necessity for greater attention from the community. Key areas requiring immediate focus include the development of practical and accessible evaluation benchmarks, along with methodologies specifically designed to counter forgetting and enable knowledge transfer within the evolving landscape of LLM learning paradigms. The full list of papers examined in this survey is available at https://github.com/Wang-ML-Lab/llm-continuallearning-survey.

CCS Concepts: · Computing methodologies → Lifelong machine learning ; Natural language processing ; Neural networks . Additional Key Words and Phrases: Large Language Models, Continual Learning.

Evaluation Protocols and Datasets

Continual LLMs' Evaluation Protocols. LAnguage Model Analysis (LAMA) is an evaluation framework designed to probe the world knowledge embedded in language models [179]. LAMA converts each world fact into a cloze statement, which is then input into the language models to predict the correct answer. It has been extensively utilized in work on CPT under the temporal shifts [95, 96]. FUAR (Forgotten / (Updated + Acquired) Ratio) is proposed for CPT to address the OP 's drawback of not able to accurately reflect the model's behavior. A FUAR value of 1 represents an equal trade-off between the knowledge forgetting and knowledge learning, while a FUAR less than 1 suggests high learning efficacy. In TRACE [241], the authors propose a set of ' X-Delta ' metrics for continual instruction tuning, quantifying the forward transfer on specific abilities of LLMs, which is a straightforward extension of FWT . Specifically, the authors construct three sets of evaluation tasks to benchmark the ability of LLMs, including general ability , instruction following , and safety . For more detailed introduction to these evaluation protocols, please refer to Appendix B.2.

Datasets. In this section, we provide a comprehensive review of the datasets available for benchmarking continual LLMs, as illustrated in Table 4. We provide information about these datasets' types, what distributional shifts and semantic domains they include, and their sources and applications. We intentionally exclude datasets used for domainadaptive pre-training LLMs in vertical domains such as legal, medical, and financial, unless they are specifically designed

for continual domain-adaptive pre-training. Furthermore, we omit datasets used in general continual fine-tuning, as they have already been extensively studied in existing works [17, 105]. For details, please refer to Appendix B.3.

Continual LLMs' Evaluation Protocols

LAnguage Model Analysis (LAMA). LAnguage Model Analysis (LAMA) is an evaluation framework designed to probe the world knowledge embedded in language models [179]. It converts each world fact into a cloze statement, which is then inputted into the language models to predict the correct answer. LAMA has been extended for continual pre-training, particularly for those under the temporal shifts [95, 96]. In CKL, three LAMA benchmarks are constructed for different dimensions: InvariantLAMA assesses knowledge retention on time-invariant facts, UpdatedLAMA focuses on knowledge update, and NewLAMA evaluates knowledge acquisition [96].

Forgotten / (Updated + Acquired) Ratio (FUAR). As the performance of a pre-trained LLM is decomposed into a fine-grained set in CKL [96], OP becomes a too general metric and cannot accurately reflect the balance and trade-offs of the model's behavior. To address this issue, CKL proposes a joint evaluation metric FUAR (Forgotten / (Updated + Acquired) Ratio) for continual pre-training. A FUAR value of 1 represents an equal trade-off between the knowledge forgetting and knowledge learning: for each piece of updated or acquired knowledge, one piece of time-invariant knowledge is forgotten on average. A FUAR less than 1 suggests high learning efficacy, where more than one piece of knowledge is acquired at the expense of forgetting one piece of time-invariant knowledge.

X-Delta. In TRACE [241], the authors propose a set of 'X-Delta' metrics for continual instruction tuning, quantifying the forward transfer on specific abilities of LLMs. Let's denote a set of 𝑀 datasets { 𝑋 1 , 𝑋 2 , · · · , 𝑋 𝑀 } for task X. The baseline performances of the pre-trained LLM evaluated on these tasks are denoted as { 𝑏 𝑋 1 , · · · , 𝑏 𝑋 𝑀 } . The model

undergoes continuous fine-tuning on a different set of tasks, distinct from those used for evaluation. Throughout the sequential training process, the performance of the model after learning task 𝑡 on evaluation tasks 𝑋 𝑖 is 𝑅 𝑋 𝑡,𝑖 . The X-Delta Δ 𝑅 𝑋 𝑡 after learning task 𝑡 is defined as:

$$

$$

In the public TRACE benchmark, the authors construct three sets of evaluation tasks to benchmark the ability of LLMs, including general ability , instruction following , and safety [241].

NLG Score. In continual model alignment, three prominent metrics used to evaluate different aspects of Natural language generation (NLG) are BLEU-4 [176], METEOR [12], and ROUGE-L [126]. BLEU-4 [176], designed for machine translation (MT), evaluates the precision of n-grams between the machine-generated and reference texts, focusing especially on four-word sequences to gauge fluency and adequacy. METEOR [12] also targets MT but aims to improve correlation with human judgment by considering synonyms and stemming, thus providing a more nuanced assessment of translation quality. On the other hand, ROUGE-L [126] is commonly applied in summarization tasks, assessing the longest common subsequence between the generated summary and a set of reference summaries, effectively measuring the recall of essential content. Each metric has its strengths and is tailored to specific kinds of language processing tasks, reflecting different dimensions of text generation quality.

Discussion

Intriguing Properties Emergent in Continual LLMs

Beyond the well-established resilience of pre-trained large language models (LLMs) against catastrophic forgetting compared to downstream-specific models [106, 144, 156, 223, 299], there is a notable lack of exploration into other intriguing properties of LLMs when trained continually. In [275], it is observed that when fine-tuned sequentially and cyclically on a series of documents, large models exhibit a phenomenon known as ' anticipatory recovering . ' This refers to the LLMs' ability to recover forgotten information on documents even before encountering them again. This suggests that LLMs may possess the capability of sequential memorization, which could pave the way for research into more complex structured learning environments as model parameters scale up.

Conventional Types of Incremental Learning

As mentioned in Section 2.2.1, three types of incremental learning are prevalent [232]. Among them, class-incremental learning (CIL) has historically attracted significant attention from the community [193, 262]. However, in the context of continually pre-training and adapting large language models (LLMs), we observe a decreased interest in CIL but an increased focus on task-incremental learning (TIL) and domain-incremental learning (DIL). Given that language models are inherently designed for content generation and are pre-trained with the pretext generative task of next-word prediction, it is natural to emphasize the patterns of generative tasks and integrate the traditional CIL paradigm into the broader framework of language modeling, discarding the incremental classification head [26, 210]. However, the declining attention to CIL does not suggest that it is not impactful in the field of continual learning for LLMs. Techniques such as vocabulary expansion [5, 44] and learning routing function in the MoE system [35] can be seen as an extension of expanding the classification head in CIL, and previously validated techniques of CIL can be directly applied.

The importance of DIL is self-evident, given the shared task definition and input-output format in continual pre-training (CPT) and domain-adaptive pre-training (DAP). On the other hand, TIL attracts significant interest as it plays a crucial role in instruction tuning, where instructions can be seen as natural-language-encoded task indices [80, 89, 165, 208, 240, 243, 279, 297]. It is worth noting that the boundary between TIL and DIL becomes somewhat blurred in continual instruction tuning. Language models demonstrate the capability to infer domain information for unseen instructions, suggesting a convergence of TIL and DIL in certain contexts.

Roles of Memory in Continual LLMs

Previous continual learning research, drawing inspiration from human learning patterns, primarily emphasizes the storage efficiency of past data. However, this focus may no longer hold true in the context of continual LLMs. In the direction of relaxing memory constraints, institutions with access to training data may opt to retain full access without restricting memory size, given that the cost of memory storage is more than affordable. In such scenarios, as highlighted in [233], the challenge shifts from storage efficiency to computational efficiency. To achieve continual learning goals, models must efficiently adapt to new data (efficient adaptation) and select key experiences for replay (efficient replay) [99, 268]. Therefore, it is essential to reassess the existing memory constraint and prioritize optimizing computational efficiency for continual learning of LLMs by restricting the number of updates and FLOPs [180, 247].

Table 4. Summary of the existing benchmarks publicly available for Continual Learning LLMs. In the column of Name , we use the superscript ' ∗ ' to denote the lack of the dataset name and the name shown is that of the original paper. In this table, we deliberately omit the datasets used for domain-adaptive pre-training the vertical LLMs, as their main focus of development is not on continual learning. We also omit the datasets used for general continual fine-tuning, as they are extensively discussed in other

existing surveys [17, 105].

On the other end of the spectrum, studies with tightened memory constraints remain vital in modern continual learning of LLMs. As shown in Fig. 1, upstream suppliers of LLMs typically do not provide training data with the released model weights. Consequently, consumers must adapt these models to downstream data without access to the actual replay data. Various rehearsal-free continual strategies are applied in this scenario, such as collecting data examples from alternate sources [9, 42, 199, 258], leveraging the generative capabilities of LLMs to produce pseudo-examples for replay [182], and implementing regularization techniques in the parameter space [107, 197]. Continual learning under the strict memory constraint is also driven by data privacy concerns, where preserving data on the server side is prohibited. In these scenarios, researchers must rely on online continual learning methods [150, 181], where data examples are only utilized for training as they arrive in a stream, and numerous efforts are already underway to develop LLMs capable of operating under these constraints [20].

Prospective Directions

Theories of Continual LLMs. It is widely recognized that the continual learning community tends to prioritize empirical research over theoretical exploration. Nevertheless, there are efforts to establish theoretical foundations for CL. In [237], the authors utilize second-order Taylor expansions around optimal parameters to derive an intertask generalization error bound based on the maximum eigenvalue and 𝑙 2 -norm of parameter differences. Another line of approaches leverages task/domain discrepancies to construct a multi-task generalization bound. For instance, Unified Domain Incremental Learning (UDIL) in [213] proposes upper bounds for intra-domain and cross-domain distillation losses, unifying various replay-based DIL techniques under a single adaptive generalization bound. However, applying these existing theories directly to continual LLMs can be imprudent, given their pre-trained, large-scale nature. Consequently, there is a notable gap in research focusing on continually learning LLMs with robust theoretical guarantees and understanding the forgetting behaviors of LLMs from a theoretical perspective.

Efficient Replay for Knowledge Retention for Continual LLMs. While the storage budget can theoretically be infinite (Section 6.3), replaying past experiences without specific design can lead to inefficient updates in current domain learning, resulting in slow convergence. Beyond sparse replay solutions that control data mixture ratios [129, 199, 274], there is ongoing exploration of efficient replay for continual LLMs. For example, KPIG [80] enhances replay efficiency by calculating Key-Part Information Gain (KPIG) on masked segments, enabling the dynamic selection of replay data. [99] introduces a forgetting forecasting mechanism based on output changes during adaptation, later used for selective replay in continual model refinement (CMR). More sophisticated and accurate data mixing strategies and efficient replay sample selection mechanisms are needed and hence we mark it as a significant research focus in the future.

Continual LLMs with Controllable Memory. The long-term memory inherent in the whole set of parameters of LLMs often lacks interpretability and explicit manipulability, which is crucial in certain application areas such as machine unlearning [21], where the continually pre-trained models need to constantly roll back to a previous version predating the inclusion of the revoked data and retrain the model from that point onward. This example illustrates the benefits of equipping LLMs with an external, controllable memory. As part of continual model refinement (CMR), memory systems for continual learning have been explored in several studies. Larimar [45] suggests integrating the Kanerva Machine [263] as an episodic memory for multi-fact model editing. This memory system supports basic operations like writing, reading, and generating , as well as advanced operations such as sequential writing and forgetting . It enables one-shot knowledge updates without costly retraining or fine-tuning. Other memory systems like Hopfield Networks [192] hold promise for future investigation as well.

Continual LLMs with Custom Preferences. In service-oriented contexts, users often require different trade-offs between domain expertise, ethics, values, or tones of expression. Efficiently building customized LLMs for individual users and offering flexible adjustment options is a challenging task. Early attempts in this direction include Imprecise Bayesian Continual Learning (IBCL), which, under certain assumptions, guarantees the generation of Pareto-optimal models based on user preferences by combining two model posteriors in the parameter space [139]. While empirical validation is limited in scale, this approach paves the way for future research in this area.

Conclusion

In this work, we offer a comprehensive survey on continual LLMs, summarizing recent advancements in their training and deployment from a continual learning standpoint. We categorize the problems and tasks based on their positions within our proposed broader framework of modern stratified continual learning of LLMs. While there is a widespread and growing interest in this area across the community, we also note several missing cornerstones, including algorithmic diversity and a fundamental understanding of large models' behaviors such as knowledge forgetting, transfer, and acquisition. With a holistic yet detailed approach, we aim for this survey to inspire more practitioners to explore continual learning techniques, ultimately contributing to the development of robust and self-evolving AI systems.

LARGE Supplementary Material

Preliminaries

In this section, we provide an overview of the fundamental concepts of large language models (LLMs) and continual learning (CL) We begin by introducing the notation used in this paper. Subsequently, we discuss the pre-training and downstream adaptation of LLMs, as well as mainstream LLM families (Appendix A.1), followed by an introduction to basic continual learning techniques studied by the community (Appendix A.2).

Notation. We denote scalars with lowercase letters, vectors with lowercase boldface letters, and matrices with uppercase boldface letters. The 𝑙 2 -norm of vectors and the Frobenius norm of a matrix are represented by ∥ · ∥ 2 . For a vector 𝒗 = [ 𝑣 1 , 𝑣 2 , · · · , 𝑣 𝑛 ] ⊤ , ∥ 𝒗 ∥ 2 = ( ˝ 𝑛 𝑖 = 1 𝑣 2 𝑖 ) 1 / 2 ; for a matrix 𝑨 ∈ R 𝑚 × 𝑛 , ∥ 𝑨 ∥ 2 = ( ˝ 𝑖 𝑗 𝐴 2 𝑖 𝑗 ) 1 / 2 . We use 𝜖 D , LD to denote the error function, and loss function that is deployed for training, respectively, where the subscript is used to denote the error/loss measured by taking the expectation on the data distribution D . We further use b L 𝑆 to represent the empirical evaluation of the loss function L over the set of examples 𝑆 . Probability and expectation are denoted by 𝑃 and E , respectively. We use [ 𝑚 ] to denote the set of positive integers up to 𝑚 , { 1 , · · · , 𝑚 } .

Large Language Models

Primarily built on the transformer architecture, pre-trained language models (PLMs) have established a universal hidden embedding space through extensive pre-training on large-scale unlabeled text corpora [51, 133, 189]. By scaling parameters to billions or even hundreds of billions and training on massive text datasets [84, 102], PLMs not only demonstrate superior language understanding and generation capabilities but also manifest emergent abilities such as in-context learning, instruction following, and multi-step reasoning [159, 250, 252, 253, 277]. These larger models are commonly referred to as Large Language Models (LLMs). For more detailed introduction, please refer to Appendix A.1.

2.1.1 Pre-training of LLMs. There are two popular pre-training paradigms for LLMs. (1) Decoder-only models typically employ auto-regressive language modeling (LM) tasks during pre-training, including the GPT family [1, 22, 173, 186], Gemini family [194, 225], and the open-source Llama family [230, 231]. Specifically, given a sequence of tokens 𝒙 = [ 𝑥 1 , 𝑥 2 , · · · , 𝑥 𝑁 ] , LM predicts the next token 𝑥 𝑡 autoregressively based on all preceding tokens 𝒙 < 𝑡 = [ 𝑥 1 , 𝑥 2 , · · · , 𝑥 𝑡 -1 ] , and trains the entire network by minimizing the negative log-likelihood -˝ 𝑁 𝑡 = 1 log 𝑃 ( 𝑥 𝑡 | 𝒙 < 𝑡 ) , where 𝑃 ( 𝑥 1 | 𝒙 < 1 ) ≜ 𝑃 ( 𝑥 1 ) is the unconditional probability estimation of the first token. (2) Encoder-only models , e.g., BERT [51, 133], use masked language modeling (MLM) as a common pre-training objective. In MLM, for the input sequence 𝒙 , a subset of input tokens 𝑚 ( 𝒙 ) are masked and replaced with the special [MASK] token. The pre-training goal is to utilize the unmasked parts 𝒙 \ 𝑚 ( 𝒙 ) to predict the masked portions 𝑚 ( 𝒙 ) . In summary, the overarching goal of MLM is to minimize the negative log-likelihood -˝ b 𝑥 ∈ 𝑚 ( 𝒙 ) log 𝑃 ( b 𝑥 | 𝒙 \ 𝑚 ( 𝒙 ) ) .

2.1.2 Adaptation of LLMs. LLMs are primarily trained to generate linguistically coherent text. However, this training may not align with human values, preferences, or practical needs. Furthermore, the pre-training data can be outdated, leading to knowledge cutoffs or inaccuracies. To address these issues, various computational paradigms such as Instruction Tuning (IT) [288], Model Refinement (MR) [47], and Model Alignment (MA) [174, 187] have been proposed. These approaches adapt LLMs to better meet diverse downstream tasks and user requirements.

Numerous studies show that Instruction Tuning (IT) can notably improve LLMs' ability to follow textual instructions [98, 174, 203, 250, 288], leveraging the pre-existing knowledge within LLMs to bridge the gap between general and task-specific performance [251]. Recent works like WizardLM [269] and CodecLM [246] further tailor synthetic data to steer LLMs' behavior through IT. Additionally, IT enhances the interaction between humans and LLMs, providing a more natural interface and aligning LLM outputs more closely with human expectations and preferences [145]. LLMs make mistakes, such as inaccurate translations or outdated information [47]. Directly fine-tuning the model to correct these mistakes may disrupt its performance on previously learned tasks. To overcome these challenges, Model Refinement (MR) is proposed to rectify the model's errors while preserving its performance on other inputs, with only moderate computing resources [47, 74, 76, 92, 163, 164, 215]. Model Alignment (MA) ensures AI systems' actions and outputs align with human values, ethics, and preferences [174, 187]. MA can be broadly categorized into two types: Reinforcement Learning-based (RL-based) and Supervised Learning-based (SL-based). RL-based approaches [174, 205]

are trained to make decisions reinforced by human feedback, using a reward system to guide them towards desirable outcomes. In contrast, SL-based approaches [81, 97, 187] directly train models on datasets of human preferences, aligning their output with demonstrated human values.

Pre-Training of LLMs

4.1.1 CPT: Effectiveness and Efficiency. Before delving into the details of continual pre-training (CPT), it is important to address two fundamental questions: Firstly, regarding effectiveness , can CPT enhance performance on downstream tasks beyond that of the initial training on a wide range of data domains? Extensive studies have not only demonstrated the necessity of CPT for improved downstream performance [35, 71, 95, 96, 100, 184], but also shown that when distributional shifts are gradual [95, 278] or somewhat correlated [71], CPT can effectively help model generalize to unseen data. The second question is about efficiency : given the large size of an LLM' parameters and data, both old and new, can we achieve adaptation and knowledge retention in a computationally efficient way? Concerning efficiency, most studies focus on techniques for efficient knowledge retention [95, 96, 100, 118], which significantly overlap with the CL literature addressing catastrophic forgetting [4, 24, 191, 193, 195, 196, 201, 207, 213, 236]. In contrast to prior approaches that fully utilize emergent data, some studies recognize the impracticality of this approach in real production environments. Instead, they concentrate on further improving the efficiency of adaptation. For instance, ELLE [184] employs a function-preserved model expansion to facilitate efficient knowledge growth; [5] and [268] sub-sample training data based on novelty and diversity to enhance training efficiency, achieving superior performance compared to full-data training. Though currently underexplored, efficient adaptation in continual pre-training is poised to become significant, given recent findings emphasizing data quality over quantity for LLM generalization [216, 267].

4.1.2 General Observations on CPT. Table 1 summarizes the existing studies on continual pre-training (CPT), and here are some key observations we make about CPT.

· OBS-1: The development of advanced techniques tailored specifically for CPT is at the starting stage and warrants further exploration. Only about half of the examined papers propose novel techniques for CPT [5, 35, 44, 52, 71, 104, 183, 184, 221], while the remaining half either focus solely on the effects of pure adaptation without considering CL techniques [63, 69, 137], or conduct empirical studies on the straightforward application of existing CL techniques [95, 96, 100, 118]. · OBS-2: The diversity of CL techniques incorporated in CPT remains limited. Most practical implementations of CL techniques for CPT primarily focus on architecture expansion of LLMs [5, 35, 44, 52, 71, 183], with only a few explicitly utilizing replay [35, 183] and parameter regularization [5, 35]. · OBS-3: There is an apparent gap between the existing studies and the real production environment of CPT. Except for the recent study [278] which conducts CPT over 159 domains, the longest sequence of

Table 1. Summary of existing studies on Continual Pre-training of LLMs. The papers are organized based on their relation to CL: (i) no CL techniques are studied, (ii) CL techniques are studied as solely baselines, and (iii) new CL approaches are proposed. In the table, Dist. Shift denotes what type(s) of distributional shifts this particular study considers and is dedicated to solve. In the section of Continual Learning Tech. , we mainly categorize three types of continual learning techniques that are studied in the paper: rehearsal ( Rehearsal ), parameter regularization ( Param. Reg. ), and architecture expansion ( Arch. Exp. ). We use ' ✓ ', ' ✗ ', and ' ♣ ' to denote 'deployed in the proposed method', 'not studied in the paper', and 'studied as a baseline method', respectively. Note that we do not include naive sequential fine-tuning in this table, as it is universally studied as the important baseline method in all of the papers in the table. The papers with only ' ♣ ' [95, 96, 100] means that only existing CL techniques are studied, without proposing new ones, and the papers with only ' ✗ ' [63, 69] means that special aspects of fine-tuning are studied, without using CL techniques.

pre-training stages explored is 8 [71, 100]. However, this falls short of real-world scenarios where continual pre-training occurs more frequently and persists for months or years. The efficacy of CPT methods in such prolonged scenarios remains uncertain. Additionally, investigating CPT in a task-boundary-free data stream setting is an important avenue for research to be explored in the future as well.

  • 4.1.3 Distributional Shifts in CPT. This survey categorizes distributional shifts of CPT into three main types: (i) Language Shift : LLMs sequentially learn different language corpora, e.g., English → Chinese [63, 118]. (ii) Content Shift : LLMs sequentially learn corpora from different fields, e.g., chemistry → biology [35, 44, 69, 71, 100, 183]. (iii) Temporal Shift :

Distributional shifts occur over time, e.g., news in 2021 → news in 2022, with a major focus on timestamp-sensitive knowledge retention and update [5, 52, 95, 96, 100].

Language Shift. [63] focuses on assessing LLMs' natural ability to learn new languages sequentially. With no explicit CL techniques employed, the study observes consistent positive forward transfer of the knowledge, facilitating new language acquisition regardless of the learning order. Forgetting, on the other hand, emerges as a significant challenge that cannot be mitigated by the increasing size of LLMs. In [118], the degree of forgetting of previously learned language when adapting LLMs to a new language is investigated. Various CL techniques, including parameter freezing, LoRA [86], and (IA) 3 [132], are evaluated across multiple dimensions. Preliminary experimental results highlight the non-trivial nature of addressing horizontal forgetting for CPT under the language shift as well.

Content Shift. [278] explores the large-scale CPT over 159 content domains, and shows that CPT on various domains can effectively improve models' adaptation ability compared to DAP on single domain. Similarly, [69] continues the pre-training phase of Pythia [16] with no complex CL techniques and discovers that learning rate re-warming consistently improves models trained from scratch. Built upon this simple observation, [94] further shows that proper combination of learning rate re-warming and re-decay, and replay of the previous data is sufficient to achieve a comparable performance to full re-training. LLPT [100] establishes a comprehensive training and evaluation protocol for a series of content-level distributional shifts. They assess multiple CL methods and, similar to [63], find consistent forward knowledge transfer, yet horizontal forgetting remains significant. Besides, contrary to the common understanding that experience replay [30] is the most efficient approach to preventing forgetting, the authors find it ineffective in the case of CPT, due to the potential overfitting issue. Recyclable Tuning [183] shows that if the upstream supplier continually pre-trains LLMs, with or without replay, consumer-side efficiency can be boosted by recycling previously learned update components when proper CL techniques are applied.

DEMix [71] incrementally trains and integrates new experts (DEMix layer) for new domains during CPT. To ensure reasonable inference performance during testing when no domain information is available, it proposes a parameter-free probabilistic approach to dynamically estimate a weighted mixture of domains. DEMix's modularization has been shown to facilitate efficient domain-adaptive pre-training, promote relevant knowledge during inference, and allow for removable components. Lifelong-MoE [35], similar to DEMix [71], incrementally trains domain experts for new domains. However, Lifelong-MoE differs from DEMix in utilizing a token-level gating function to activate multiple experts for intermediate embedding calculation. During training, previous experts' parameters and gating functions remain frozen, and knowledge distillation loss is employed to regulate parameter updates, which thereby makes Lifelong-MoE robust against the issue of horizontal forgetting.

It is noteworthy that some papers draw almost opposite conclusions regarding the significance of CPT for content shifts. For instance, [44] continually pre-trains BERT-based models [51, 133] on five scientific domains and evaluates performance on downstream sentiment analysis. They observe that even the trivial sequential pre-training does not exhibit severe forgetting, prompting reasonable questions about the necessity of CPT.

Temporal Shift. In the context of CPT amid content shifts, Multi-Task Learning (MTL) is often regarded as the upper bound achievable [178, 213, 237]. However, this belief does not fully hold when considering CL under temporal shifts [52, 95, 96], as temporal shifts can introduce conflicting information, posing challenges for LLMs. For instance, the statement 'Lionel Messi plays for team Barcelona' remains accurate from 2004 to 2021 but becomes false by 2024, as 'Lionel Messi plays for team Inter Miami' becomes the correct statement.

Hence, as advocated by CKL [96] and TemporalWiki [95], LLMs undergoing continual adaptation to temporal shifts must simultaneously achieve three objectives: (i) retention of old knowledge, (ii) acquisition of new knowledge, and

(iii) update of the outdated knowledge. They evaluate the same set of continual learning baseline methods [34, 79, 87, 239], each highlighting distinct aspects of their impact. CKL [96] observes that parameter expansion consistently exhibits robust performance across all experimental conditions. In contrast, replay-based methods struggle to efficiently adapt to new knowledge acquisition and outdated knowledge update, leading to rapid forgetting of newly learned information during training. TemporalWiki [95] constructs a series of temporal corpora and their differential sets from sequential snapshots of Wikipedia, revealing that updating LLMs on these differential sets substantially enhances new knowledge acquisition and updates, requiring significantly less computational resources, and various CL techniques prove effective in mitigating horizontal forgetting during this process. LLPT [100] introduces temporal generalization evaluation for LLMs pre-trained on sequential corpora. Through experiments on a large-scale chronologically-ordered Tweet Stream, the authors demonstrate the superiority of CPT combined with CL techniques to task-specific LMs, in terms of both knowledge acquisition and temporal generalization. Nonetheless, these preliminary experiments do not conclusively determine which specific CL method is more preferable than the others.

Another line of work, Temporal Language Models (TLMs), takes a different approach to address knowledge retention, acquisition, and update under temporal shifts by integrating temporal information into the model [52, 198, 219]. During training, they inject temporal information into training examples as prefixes of prompts, using special tokens [198], explicit year information [52], or syntax-guided structural information [219]. In sequential training experiments conducted by TempoT5 [52], comparison between continually and jointly pre-trained LMs demonstrates that CPT better balances adaptation and forgetting when the replay rate of past data is appropriately set.

Others. CPT as a technique to progressively attain novel knowledge, can be used to refine LLMs' behavior. CEM [294] collects examples where the model's response is incorrect and continually trains the model on these examples, along with a supplemental dataset. RHO-1 [130] proposes Selective Language Modeling (SLM), which employs a reference model to evaluate the perplexity of each token in the training corpus, and continually pre-trains the model on high-perplexity tokens. Similarly, IR-DRO [36] re-trains the model on re-weighted examples from the original pre-training dataset, focusing more on higher-loss sequences.

The significance of addressing temporal shifts through CPT is underscored by several industrial studies. For instance, [5] employs a dynamic vocabulary expansion algorithm and an efficient sub-sampling procedure to conduct CPT on large-scale emerging tweet data. Conversely, [137] adopts CPT without explicit measures to constrain model updates, releasing a series of BERT-based LMs incrementally trained on new tweet data every three months. Preliminary experimental results demonstrate substantial improvements of continually pre-trained LMs over the base BERT model across downstream tasks. While some studies question the necessity of continually adapting LLMs along the temporal axis for environmental reasons, such as reducing CO 2 emissions [8], the community commonly embraces CPT as a more efficient learning paradigm compared to the traditional 'combine-and-retrain' approach.

Adaptation of LLMs

Large Language Models

Primarily built on the transformer architecture, pre-trained language models (PLMs) have established a universal hidden embedding space through extensive pre-training on large-scale unlabeled text corpora [51, 133, 189]. By scaling parameters to billions or even hundreds of billions and training on massive text datasets [84, 102], PLMs not only demonstrate superior language understanding and generation capabilities but also manifest emergent abilities such as in-context learning, instruction following, and multi-step reasoning [159, 250, 252, 253, 277]. These larger models are commonly referred to as Large Language Models (LLMs). For more detailed introduction, please refer to Appendix A.1.

2.1.1 Pre-training of LLMs. There are two popular pre-training paradigms for LLMs. (1) Decoder-only models typically employ auto-regressive language modeling (LM) tasks during pre-training, including the GPT family [1, 22, 173, 186], Gemini family [194, 225], and the open-source Llama family [230, 231]. Specifically, given a sequence of tokens 𝒙 = [ 𝑥 1 , 𝑥 2 , · · · , 𝑥 𝑁 ] , LM predicts the next token 𝑥 𝑡 autoregressively based on all preceding tokens 𝒙 < 𝑡 = [ 𝑥 1 , 𝑥 2 , · · · , 𝑥 𝑡 -1 ] , and trains the entire network by minimizing the negative log-likelihood -˝ 𝑁 𝑡 = 1 log 𝑃 ( 𝑥 𝑡 | 𝒙 < 𝑡 ) , where 𝑃 ( 𝑥 1 | 𝒙 < 1 ) ≜ 𝑃 ( 𝑥 1 ) is the unconditional probability estimation of the first token. (2) Encoder-only models , e.g., BERT [51, 133], use masked language modeling (MLM) as a common pre-training objective. In MLM, for the input sequence 𝒙 , a subset of input tokens 𝑚 ( 𝒙 ) are masked and replaced with the special [MASK] token. The pre-training goal is to utilize the unmasked parts 𝒙 \ 𝑚 ( 𝒙 ) to predict the masked portions 𝑚 ( 𝒙 ) . In summary, the overarching goal of MLM is to minimize the negative log-likelihood -˝ b 𝑥 ∈ 𝑚 ( 𝒙 ) log 𝑃 ( b 𝑥 | 𝒙 \ 𝑚 ( 𝒙 ) ) .

2.1.2 Adaptation of LLMs. LLMs are primarily trained to generate linguistically coherent text. However, this training may not align with human values, preferences, or practical needs. Furthermore, the pre-training data can be outdated, leading to knowledge cutoffs or inaccuracies. To address these issues, various computational paradigms such as Instruction Tuning (IT) [288], Model Refinement (MR) [47], and Model Alignment (MA) [174, 187] have been proposed. These approaches adapt LLMs to better meet diverse downstream tasks and user requirements.

Numerous studies show that Instruction Tuning (IT) can notably improve LLMs' ability to follow textual instructions [98, 174, 203, 250, 288], leveraging the pre-existing knowledge within LLMs to bridge the gap between general and task-specific performance [251]. Recent works like WizardLM [269] and CodecLM [246] further tailor synthetic data to steer LLMs' behavior through IT. Additionally, IT enhances the interaction between humans and LLMs, providing a more natural interface and aligning LLM outputs more closely with human expectations and preferences [145]. LLMs make mistakes, such as inaccurate translations or outdated information [47]. Directly fine-tuning the model to correct these mistakes may disrupt its performance on previously learned tasks. To overcome these challenges, Model Refinement (MR) is proposed to rectify the model's errors while preserving its performance on other inputs, with only moderate computing resources [47, 74, 76, 92, 163, 164, 215]. Model Alignment (MA) ensures AI systems' actions and outputs align with human values, ethics, and preferences [174, 187]. MA can be broadly categorized into two types: Reinforcement Learning-based (RL-based) and Supervised Learning-based (SL-based). RL-based approaches [174, 205]

are trained to make decisions reinforced by human feedback, using a reward system to guide them towards desirable outcomes. In contrast, SL-based approaches [81, 97, 187] directly train models on datasets of human preferences, aligning their output with demonstrated human values.

Continual Learning

Humans can accumulate knowledge and skills across tasks without significant performance decline on previous tasks [101, 153, 154, 175]. In contrast, machine learning models, which are typically data-centric, often experience performance degradation on old tasks when trained on new ones, a phenomenon known as 'catastrophic forgetting.' The challenge of adapting models to a sequence of tasks without forgetting, especially when little to no past data can be preserved, is extensively studied in the continual learning community [38, 178, 232, 237]. For formal definitions, a detailed introduction to the three CL scenarios and techniques, please refer to Appendix A.2.

2.2.1 Types of Continual Learning. To lay the groundwork for subsequent discussions (as illustrated in Table 3 and Section 6.2), we follow the conceptual framework proposed by [112, 232, 237]. There are three primary types of continual learning scenarios: (i) Task-Incremental Learning (TIL), where task indices are available to the model during inference [113, 124]; (ii) Domain-Incremental Learning (DIL), where the model learns a sequence of tasks with the same formulation but without task indices during inference [213]; and (iii) Class-Incremental Learning (CIL), where the model learns new classes of data during training [112, 193].

2.2.2 Techniques of Continual Learning. Existing CL techniques can be roughly categorized into five groups [237]: (i) replay-based, (ii) regularization-based, (iii) architecture-based, (iv) optimization-based, and (v) representation-based. Here, we provide a concise yet comprehensive introduction to the first three categories of continual learning techniques, as they are extensively applied in continual LLMs.

Replay-based methods adopt the relaxed memory constraint by keeping a small buffer of observed data and retraining the model on it when learning new tasks. Although replay-based methods may theoretically lead to loose generalization bounds [213], they are valued for their simplicity, stability, and high performance, even with a small episodic memory [24, 30, 193, 195]. Regularization-based methods adopt a regularization term 𝜆 ∥ 𝜽 -𝜽 𝑡 -1 ∥ 𝚺 that penalizes large deviation from the history model in the parameter space, where ∥ 𝒗 ∥ 𝚺 = 𝒗 ⊤ 𝚺 𝒗 is the vector norm evaluated on a positive-semi-definite matrix 𝚺 , and 𝜆 is the regularization coefficient, a hyper-parameter introduced to balance the past knowledge retention and current knowledge learning. The matrix 𝚺 introduced is to measure the different level of importance of each parameters and their correlations in retaining the past knowledge. In practice, to reduce computational overhead, diagonal matrices are often designed to encode only the importance of each parameter [4, 113, 197]. Architecture-based methods , especially expanding the network architecture dynamically to assimilate new knowledge, is considered the most efficient form of CL [248, 249]. This method primarily tackles adaptation challenges and can achieve zero-forgetting when task IDs are available during inference or can be correctly inferred [71, 256]. However, due to the difficulty of task ID inference, architecture expansion is predominantly utilized in TIL but is scarcely explored in DIL or CIL. In conjunction with pre-trained backbone large models like ViT [54], CoLoR [256] trains various low-rank adaptation (LoRA) [86] modules for different tasks. It estimates and stores prototypes for each task and utilizes the natural clustering ability of the pre-trained model during testing to infer task IDs, selecting the corresponding LoRA component for prediction generation. In the domain of continual LLMs, architecture expansion has resurged in popularity following the rise of parameter-efficient fine-tuning (PEFT) [50, 86, 211], a topic we will delve into shortly [96, 100, 118, 177, 240, 257, 272, 273].

2.2.3 Evaluation Metrics of Continual Learning. There are four evaluation protocols primarily designed for continual learning. Overall Performance (OP) [106, 286, 291] calculates the average performance up until the current training stage, measuring the overall ability of a model balancing the performance of each task. As noted in [213], OP corresponds to the primary optimization objective of continual learning, and hence receives the most attention. Forgetting (F) represents the largest performance drop observed of each task throughout the training process, averaged over all training stages. It quantifies the negative impact of learning new tasks brought to previously acquired knowledge. Ideally, a robust continual learning framework should achieve Backward Transfer (BWT) , where learning new tasks enhances performance on prior tasks. BWT is measured by negating the forgetting, and hence a negative forgetting indicates a an improvement in performance on earlier tasks. Forward Transfer (FWT) measures the generalization ability of the continual learning algorithms to unseen tasks. It is defined as the difference between the current model's performance evaluated on the future tasks and the randomly initialized model. Refer to Appendix B.1 for more details.

Three Types of Continual Learning

Humans can accumulate knowledge and skills across tasks without significant performance decline on previous tasks [101, 153, 154, 175]. In contrast, machine learning models, which are typically data-centric, often experience performance degradation on old tasks when trained on new ones, a phenomenon known as 'catastrophic forgetting.' The challenge of adapting models to a sequence of tasks without forgetting, especially when little to no past data can be preserved, is extensively studied in the continual learning community [38, 178, 232, 237]. For formal definitions, a detailed introduction to the three CL scenarios and techniques, please refer to Appendix A.2.

2.2.1 Types of Continual Learning. To lay the groundwork for subsequent discussions (as illustrated in Table 3 and Section 6.2), we follow the conceptual framework proposed by [112, 232, 237]. There are three primary types of continual learning scenarios: (i) Task-Incremental Learning (TIL), where task indices are available to the model during inference [113, 124]; (ii) Domain-Incremental Learning (DIL), where the model learns a sequence of tasks with the same formulation but without task indices during inference [213]; and (iii) Class-Incremental Learning (CIL), where the model learns new classes of data during training [112, 193].

2.2.2 Techniques of Continual Learning. Existing CL techniques can be roughly categorized into five groups [237]: (i) replay-based, (ii) regularization-based, (iii) architecture-based, (iv) optimization-based, and (v) representation-based. Here, we provide a concise yet comprehensive introduction to the first three categories of continual learning techniques, as they are extensively applied in continual LLMs.

Replay-based methods adopt the relaxed memory constraint by keeping a small buffer of observed data and retraining the model on it when learning new tasks. Although replay-based methods may theoretically lead to loose generalization bounds [213], they are valued for their simplicity, stability, and high performance, even with a small episodic memory [24, 30, 193, 195]. Regularization-based methods adopt a regularization term 𝜆 ∥ 𝜽 -𝜽 𝑡 -1 ∥ 𝚺 that penalizes large deviation from the history model in the parameter space, where ∥ 𝒗 ∥ 𝚺 = 𝒗 ⊤ 𝚺 𝒗 is the vector norm evaluated on a positive-semi-definite matrix 𝚺 , and 𝜆 is the regularization coefficient, a hyper-parameter introduced to balance the past knowledge retention and current knowledge learning. The matrix 𝚺 introduced is to measure the different level of importance of each parameters and their correlations in retaining the past knowledge. In practice, to reduce computational overhead, diagonal matrices are often designed to encode only the importance of each parameter [4, 113, 197]. Architecture-based methods , especially expanding the network architecture dynamically to assimilate new knowledge, is considered the most efficient form of CL [248, 249]. This method primarily tackles adaptation challenges and can achieve zero-forgetting when task IDs are available during inference or can be correctly inferred [71, 256]. However, due to the difficulty of task ID inference, architecture expansion is predominantly utilized in TIL but is scarcely explored in DIL or CIL. In conjunction with pre-trained backbone large models like ViT [54], CoLoR [256] trains various low-rank adaptation (LoRA) [86] modules for different tasks. It estimates and stores prototypes for each task and utilizes the natural clustering ability of the pre-trained model during testing to infer task IDs, selecting the corresponding LoRA component for prediction generation. In the domain of continual LLMs, architecture expansion has resurged in popularity following the rise of parameter-efficient fine-tuning (PEFT) [50, 86, 211], a topic we will delve into shortly [96, 100, 118, 177, 240, 257, 272, 273].

2.2.3 Evaluation Metrics of Continual Learning. There are four evaluation protocols primarily designed for continual learning. Overall Performance (OP) [106, 286, 291] calculates the average performance up until the current training stage, measuring the overall ability of a model balancing the performance of each task. As noted in [213], OP corresponds to the primary optimization objective of continual learning, and hence receives the most attention. Forgetting (F) represents the largest performance drop observed of each task throughout the training process, averaged over all training stages. It quantifies the negative impact of learning new tasks brought to previously acquired knowledge. Ideally, a robust continual learning framework should achieve Backward Transfer (BWT) , where learning new tasks enhances performance on prior tasks. BWT is measured by negating the forgetting, and hence a negative forgetting indicates a an improvement in performance on earlier tasks. Forward Transfer (FWT) measures the generalization ability of the continual learning algorithms to unseen tasks. It is defined as the difference between the current model's performance evaluated on the future tasks and the randomly initialized model. Refer to Appendix B.1 for more details.

Techniques of Continual Learning

Humans can accumulate knowledge and skills across tasks without significant performance decline on previous tasks [101, 153, 154, 175]. In contrast, machine learning models, which are typically data-centric, often experience performance degradation on old tasks when trained on new ones, a phenomenon known as 'catastrophic forgetting.' The challenge of adapting models to a sequence of tasks without forgetting, especially when little to no past data can be preserved, is extensively studied in the continual learning community [38, 178, 232, 237]. For formal definitions, a detailed introduction to the three CL scenarios and techniques, please refer to Appendix A.2.

2.2.1 Types of Continual Learning. To lay the groundwork for subsequent discussions (as illustrated in Table 3 and Section 6.2), we follow the conceptual framework proposed by [112, 232, 237]. There are three primary types of continual learning scenarios: (i) Task-Incremental Learning (TIL), where task indices are available to the model during inference [113, 124]; (ii) Domain-Incremental Learning (DIL), where the model learns a sequence of tasks with the same formulation but without task indices during inference [213]; and (iii) Class-Incremental Learning (CIL), where the model learns new classes of data during training [112, 193].

2.2.2 Techniques of Continual Learning. Existing CL techniques can be roughly categorized into five groups [237]: (i) replay-based, (ii) regularization-based, (iii) architecture-based, (iv) optimization-based, and (v) representation-based. Here, we provide a concise yet comprehensive introduction to the first three categories of continual learning techniques, as they are extensively applied in continual LLMs.

Replay-based methods adopt the relaxed memory constraint by keeping a small buffer of observed data and retraining the model on it when learning new tasks. Although replay-based methods may theoretically lead to loose generalization bounds [213], they are valued for their simplicity, stability, and high performance, even with a small episodic memory [24, 30, 193, 195]. Regularization-based methods adopt a regularization term 𝜆 ∥ 𝜽 -𝜽 𝑡 -1 ∥ 𝚺 that penalizes large deviation from the history model in the parameter space, where ∥ 𝒗 ∥ 𝚺 = 𝒗 ⊤ 𝚺 𝒗 is the vector norm evaluated on a positive-semi-definite matrix 𝚺 , and 𝜆 is the regularization coefficient, a hyper-parameter introduced to balance the past knowledge retention and current knowledge learning. The matrix 𝚺 introduced is to measure the different level of importance of each parameters and their correlations in retaining the past knowledge. In practice, to reduce computational overhead, diagonal matrices are often designed to encode only the importance of each parameter [4, 113, 197]. Architecture-based methods , especially expanding the network architecture dynamically to assimilate new knowledge, is considered the most efficient form of CL [248, 249]. This method primarily tackles adaptation challenges and can achieve zero-forgetting when task IDs are available during inference or can be correctly inferred [71, 256]. However, due to the difficulty of task ID inference, architecture expansion is predominantly utilized in TIL but is scarcely explored in DIL or CIL. In conjunction with pre-trained backbone large models like ViT [54], CoLoR [256] trains various low-rank adaptation (LoRA) [86] modules for different tasks. It estimates and stores prototypes for each task and utilizes the natural clustering ability of the pre-trained model during testing to infer task IDs, selecting the corresponding LoRA component for prediction generation. In the domain of continual LLMs, architecture expansion has resurged in popularity following the rise of parameter-efficient fine-tuning (PEFT) [50, 86, 211], a topic we will delve into shortly [96, 100, 118, 177, 240, 257, 272, 273].

2.2.3 Evaluation Metrics of Continual Learning. There are four evaluation protocols primarily designed for continual learning. Overall Performance (OP) [106, 286, 291] calculates the average performance up until the current training stage, measuring the overall ability of a model balancing the performance of each task. As noted in [213], OP corresponds to the primary optimization objective of continual learning, and hence receives the most attention. Forgetting (F) represents the largest performance drop observed of each task throughout the training process, averaged over all training stages. It quantifies the negative impact of learning new tasks brought to previously acquired knowledge. Ideally, a robust continual learning framework should achieve Backward Transfer (BWT) , where learning new tasks enhances performance on prior tasks. BWT is measured by negating the forgetting, and hence a negative forgetting indicates a an improvement in performance on earlier tasks. Forward Transfer (FWT) measures the generalization ability of the continual learning algorithms to unseen tasks. It is defined as the difference between the current model's performance evaluated on the future tasks and the randomly initialized model. Refer to Appendix B.1 for more details.

Evaluation Protocols and Datasets

Continual LLMs' Evaluation Protocols. LAnguage Model Analysis (LAMA) is an evaluation framework designed to probe the world knowledge embedded in language models [179]. LAMA converts each world fact into a cloze statement, which is then input into the language models to predict the correct answer. It has been extensively utilized in work on CPT under the temporal shifts [95, 96]. FUAR (Forgotten / (Updated + Acquired) Ratio) is proposed for CPT to address the OP 's drawback of not able to accurately reflect the model's behavior. A FUAR value of 1 represents an equal trade-off between the knowledge forgetting and knowledge learning, while a FUAR less than 1 suggests high learning efficacy. In TRACE [241], the authors propose a set of ' X-Delta ' metrics for continual instruction tuning, quantifying the forward transfer on specific abilities of LLMs, which is a straightforward extension of FWT . Specifically, the authors construct three sets of evaluation tasks to benchmark the ability of LLMs, including general ability , instruction following , and safety . For more detailed introduction to these evaluation protocols, please refer to Appendix B.2.

Datasets. In this section, we provide a comprehensive review of the datasets available for benchmarking continual LLMs, as illustrated in Table 4. We provide information about these datasets' types, what distributional shifts and semantic domains they include, and their sources and applications. We intentionally exclude datasets used for domainadaptive pre-training LLMs in vertical domains such as legal, medical, and financial, unless they are specifically designed

for continual domain-adaptive pre-training. Furthermore, we omit datasets used in general continual fine-tuning, as they have already been extensively studied in existing works [17, 105]. For details, please refer to Appendix B.3.

Evaluation Metrics of Continual Learning

In the realm of conventional continual learning, where task streams take the form of classification, many metrics rely on the concept of Accuracy Matrix [136, 213]. Extending this notion to the context of continually learning LLMs, we introduce the Performance Matrix 𝑷 ∈ R 𝑇 × 𝑇 , where 𝑇 represents the total number of training stages. Each entry of 𝑷 corresponds to a performance metric evaluated on the models, such as perplexity on pre-training data [35, 69, 100], zero-shot/few-shot evaluation metrics on downstream data without fine-tuning [9, 42, 48, 172, 199, 258], fine-tuned accuracies on downstream tasks [5, 35, 96, 183], and probing accuracies from fine-tuning add-on components evaluated on downstream tasks [144, 223, 299]. In 𝑷 , 𝑃 𝑖,𝑗 denotes the model's performance after training on task 𝑖 and evaluating on task 𝑗 . With this Performance Matrix definition, we introduce the primary evaluation protocols widely adopted.

Overall Performance (OP) . The Overall Performance (OP) [106, 286, 291] is a natural extension of the concept of Average Accuracy [136, 213]. The OP measured up until training stage 𝑡 is the average performance of the model trained right after the stage 𝑡 . Denote it as OP 𝑡 and we have:

$$

$$

As noted in [213], the OP corresponds to the primary optimization objective defined in Definition A.4, A.5, and A.6. In much of the continual learning literature, once all 𝑇 tasks are completed, the final OP (OP 𝑇 ) is reported, with the subscript 𝑇 often omitted for brevity. In some works, OP is weighted by the importance of tasks f OP ≜ 1 𝑇 ˝ 𝑇 𝑖 = 1 𝑤 𝑖 𝑃 𝑡,𝑖 , where 𝑤 𝑖 = 𝑁 𝑖 / ˝ 𝑇 𝑗 = 1 𝑁 𝑗 represents the ratio of data. In some literature, f OP is referred to as 'example accuracy' [37], 'whole accuracy' [217], or 'edit success rate' in CMR [74].

Forgetting (F). Define 𝐹 𝑡 as the forgetting up to task 𝑡 , which represents the largest performance drop observed throughout the training process, averaged over 𝑡 training stages:

$$

$$

Typically, researchers report the average forgetting 𝐹 = 𝐹 𝑇 at the end of the entire training process. Forgetting quantifies the impact of learning new tasks on previously acquired knowledge. Ideally, a robust continual learning framework should achieve Backward Transfer (BWT) , where learning new tasks enhances performance on prior tasks. This enhancement is typically measured by negating the forgetting, thus indicating an improvement in performance on earlier tasks. The concepts of Forgetting and Backward Transfer underpin various evaluation metrics, such as knowledge retention [100], performance on unchanged knowledge [95], average increased perplexity (AP + ) [184], and test and edit retention rate in CMR [74].

Forward Transfer (FWT). Forward Transfer measures the generalization ability of the continual learning algorithms. Formally, forward transfer FWT 𝑡 up to training stage 𝑡 is defined as

$$

$$

where 𝑏 𝑖 is the baseline performance of the model evaluated on task 𝑖 before undergoing continual learning. Strictly speaking, the definition of 𝑏 𝑖 is not the same as defined in the previous work [136, 213], where it is used to denote the performance of a random initialization of the model. Additionally, we extend the notation of forward transfer in the vertical direction to represent the performance improvement on downstream tasks resulting from domain-adaptive pre-training (see Table 2). Forward Transfer is alternatively referred to as temporal generalization [100] or knowledge transfer [116] in some literature. In this section, we introduce the evaluation protocols and datasets for continul LLMs.

Continual LLMs' Evaluation Protocols

LAnguage Model Analysis (LAMA). LAnguage Model Analysis (LAMA) is an evaluation framework designed to probe the world knowledge embedded in language models [179]. It converts each world fact into a cloze statement, which is then inputted into the language models to predict the correct answer. LAMA has been extended for continual pre-training, particularly for those under the temporal shifts [95, 96]. In CKL, three LAMA benchmarks are constructed for different dimensions: InvariantLAMA assesses knowledge retention on time-invariant facts, UpdatedLAMA focuses on knowledge update, and NewLAMA evaluates knowledge acquisition [96].

Forgotten / (Updated + Acquired) Ratio (FUAR). As the performance of a pre-trained LLM is decomposed into a fine-grained set in CKL [96], OP becomes a too general metric and cannot accurately reflect the balance and trade-offs of the model's behavior. To address this issue, CKL proposes a joint evaluation metric FUAR (Forgotten / (Updated + Acquired) Ratio) for continual pre-training. A FUAR value of 1 represents an equal trade-off between the knowledge forgetting and knowledge learning: for each piece of updated or acquired knowledge, one piece of time-invariant knowledge is forgotten on average. A FUAR less than 1 suggests high learning efficacy, where more than one piece of knowledge is acquired at the expense of forgetting one piece of time-invariant knowledge.

X-Delta. In TRACE [241], the authors propose a set of 'X-Delta' metrics for continual instruction tuning, quantifying the forward transfer on specific abilities of LLMs. Let's denote a set of 𝑀 datasets { 𝑋 1 , 𝑋 2 , · · · , 𝑋 𝑀 } for task X. The baseline performances of the pre-trained LLM evaluated on these tasks are denoted as { 𝑏 𝑋 1 , · · · , 𝑏 𝑋 𝑀 } . The model

undergoes continuous fine-tuning on a different set of tasks, distinct from those used for evaluation. Throughout the sequential training process, the performance of the model after learning task 𝑡 on evaluation tasks 𝑋 𝑖 is 𝑅 𝑋 𝑡,𝑖 . The X-Delta Δ 𝑅 𝑋 𝑡 after learning task 𝑡 is defined as:

$$

$$

In the public TRACE benchmark, the authors construct three sets of evaluation tasks to benchmark the ability of LLMs, including general ability , instruction following , and safety [241].

NLG Score. In continual model alignment, three prominent metrics used to evaluate different aspects of Natural language generation (NLG) are BLEU-4 [176], METEOR [12], and ROUGE-L [126]. BLEU-4 [176], designed for machine translation (MT), evaluates the precision of n-grams between the machine-generated and reference texts, focusing especially on four-word sequences to gauge fluency and adequacy. METEOR [12] also targets MT but aims to improve correlation with human judgment by considering synonyms and stemming, thus providing a more nuanced assessment of translation quality. On the other hand, ROUGE-L [126] is commonly applied in summarization tasks, assessing the longest common subsequence between the generated summary and a set of reference summaries, effectively measuring the recall of essential content. Each metric has its strengths and is tailored to specific kinds of language processing tasks, reflecting different dimensions of text generation quality.

Datasets

$$ \label{lm} $$ \tag{lm}

$$ \label{lm} \mathcal{L}{{\rm LM}}(\vx) &\triangleq -\sum^N{t=1} \log P( x_t | \vx_{<t} ), $$ \tag{lm}

$$ \label{eq:it} h^* &\triangleq \arg\min_{h^\prime} \E_{(\vx, \vy) \sim \gD_\gI} \left[ -\log P(\hat{\vy}|\vx, h^\prime) \right] \approx \arg\min_{h^\prime} \sum_{i=1}^N -\log P(\hat{\vy}_i|\vx_i, h^\prime). $$ \tag{eq:it}

$$ \end{definition}

\begin{remark} The task of Model Alignment~(MA) is usually formulated in the same problem definition as IT, with an alignment dataset of size $M$ as $\gA = {(\vx_a, \vy_a, \hat{\vy}a)}{a=1}^M$, where $\vy_a$ represents the model's original decision for input $\vx_a$, and $\hat{\vy}_a$ denotes the aligned decision that adheres to specified ethical guidelines or desired outcomes. \end{remark}

\begin{definition}[\textbf{Model Refinement, MR}]\label{def:mr} Suppose we have a model $h(\vx)$ taking data $\vx$ (e.g., natural language queries) as inputs. Consider a size-$N$ editing set $\gE = {(\vx_e, \vy_e, \hat{\vy}e)}{e=1}^N$, where $\hat{\vy}_e$ denotes the true label of $\vx_e$, but the model incorrectly outputs $\vy_e$ for $\vx_e$. Model Refinement~(MR) aims to efficiently update the model from $h$ to $h^\prime$ such that it correctly predicts the editing set $\gE$, while preserving the original outputs outside $\gE$. Formally, we aims to find $h^\prime$ satisfying \begin{align} \label{eq:mr} h^\prime(\vx_0) = \begin{cases} \hat{\vy}_0 & \text{if } (\vx_0, \hat{\vy}_0) \in \gE, \ h(\vx_0) & o.w. \end{cases} $$ \tag{def:mr}

$$ h^* &= \arg\min_{h} \sum_{t=1}^{T} \E_{(\vx, y)\sim \gT_t} \left[ \mathbbm{1}_{h(\vx) \neq (t,y)} \right]. $$

$$ \gL(h) &\triangleq \underbrace{\sum_{i=1}^{t-1} \gL_{\gD_i}(h)}{\text{past domains}} + \underbrace{\gL{\gD_t}(h) \vphantom{\sum_{i=1}^{t-1} \gL_{\gD_i}(h)}}_{\text{current domain}}. $$

$$ \operatorname{OP}t &\triangleq \frac{1}{t} \sum{i=1}^{t} P_{t,i}. $$

Definition. [Instruction Tuning, IT] % Instruction Tuning involves fine-tuning LLMs on a dataset comprised of instructional prompts and their corresponding desired responses, which can be denoted as $D = {(x_{text}, x_{instruct}, y_t){j=1}^n}$, where $(x{text}, x_{instruct}, y_t)j$ is $j^{th}$ pair of the instruction and desired output pair. This dataset is used to fine-tune the existing LLM $f(x{text};\theta)$ with parameter $\theta$, thereby enabling the model to better perform specific tasks. Let $h(\vx)$ be a language model that takes as input data $\vx$, typically consisting of natural language instructions or queries. Instruction Tuning (IT) is a specialized training approach designed to enhance the model's ability to accurately and effectively respond to specific instructions. The objective of IT is to refine $h$ by adjusting its parameters using a designated set of training examples $\gI = {(\vx_i, \vy_i)}{i=1}^N$ drawn from the IT data distribution $\gD\gI$, where $\vy_i$ represents the desired output for $\vx$. This set is curated to target specific tasks or functionalities that require improved performance. % The tuning process involves an iterative adjustment of the model’s weights to minimize a loss function that measures the discrepancy between the model's predictions and the desired outputs in ( E ). Formally, IT seeks to find an optimal refined hypothesis $h^$ that satisfies: align h^ &\triangleq \arg\min_{h^\prime} \E_{(\vx, \vy) \sim \gD_\gI} \left[ -\log P(\vy|\vx, h^\prime) \right] \approx \arg\min_{h^\prime} \sum_{i=1}^N -\log P(\vy_i|\vx_i, h^\prime). align % align % h'(x) = % cases % y & if x \in E, \ % h(x) & o.w.. % cases % ] % The goal is to ensure that ( h' ) achieves higher accuracy and better performance on the tasks defined by ( E ) while maintaining its original capabilities on inputs not covered by ( E ).

Definition. [Model Refinement, MR] Suppose we have a model $h(\vx)$ taking data $\vx$ (e.g., natural language queries) as inputs. Consider a size-$N$ editing set $\gE = {(\vx_e, \vy_e, \vy_e)}_{e=1}^N$, where $\vy_e$ denotes the true label of $\vx_e$, but the model incorrectly outputs $\vy_e$ for $\vx_e$. Model Refinement~(MR) aims to efficiently update the model from $h$ to $h^\prime$ such that it correctly predicts the editing set $\gE$, while preserving the original outputs outside $\gE$. Formally, we aims to find $h^\prime$ satisfying align h^\prime(\vx_0) = cases \vy_0 & if (\vx_0, \vy_0) \in \gE, \ % h(\vx_0) & if (\vx_0, \vy_0). \notin \gE h(\vx_0) & o.w. cases align % such that it correctly predicts $\vy_e$. For other inputs $\vx_0 \notin {\vx_e}$, we desire $h(\vx_0) = h^\prime(\vx_0)$. This is the basic problem setting of model editing.

Definition. [Model Alignment, MA] % Consider a model $g(x; \phi)$, parameterized by $\phi$, designed to process inputs $x$ in decision-making scenarios. Define an alignment dataset of size $M$ as ${(x_a, y_a, y_a)}_{a=1}^M$, where $y_a$ represents the model's original decision for input $x_a$, and $y_a$ denotes the aligned decision that adheres to specified ethical guidelines or desired outcomes. The objective is to modify $g$ into $g'$ such that for any $x_a$ in the alignment dataset, $g'(x_a; \phi')$ yields $y_a$, aligning the model's decisions with the alignment criteria. For all other inputs $x_0 \not\in {x_a}$, the goal is to preserve the original behavior, ensuring $g(x_0; \phi) = g'(x_0; \phi')$. %

Definition. [Memory Constraint of Continual Learning] Suppose $T$ sets of observations ${S_t\sim \gT_t}{t=1}^T$ come in as a sequence, where ${\gT_t}{t=1}^T$ denotes the $T$ task distributions . At the learning stage $t>1$, the sets of observations ${S_i}_{i=1}^{t-1}$ are not accessible (strong) or only partially accessible (relaxed).

Definition. [Task-Incremental Learning, TIL] Suppose $T$ task distributions ${\gT_t}{t=1}^T$ come in as a sequence, where $\gT_t$ denotes the joint distribution over the $t$-th task's input space and the label space $(\gX_t, \gY_t)$. Denote $\gX \triangleq \bigcup{t=1}^T \gX_t$ and $\gY \triangleq \bigcup_{t=1}^T \gY_t$ as the union of the input and label spaces, respectively. Under the memory constraint defined in def:memory, Task-Incremental Learning~(TIL) aims to find the optimal hypothesis $h^: \gX \times [T] \rightarrow \gY$ that satisfies: align h^ &= \arg\min_{h} \sum_{t=1}^{T} \E_{(\vx, y)\sim \gT_t} \left[ 1_{h(\vx, t)\neq y} \right]. align

Definition. [Domain-Incremental Learning, DIL] Suppose $T$ domain distributions ${\gD_t}{t=1}^T$ come in as a sequence, where $\gD_t$ denotes the $t$-th joint distribution over the shared input space and label space $(\gX, \gY)$. Under the memory constraint defined in def:memory, Domain-Incremental Learning~(DIL) aims to find the optimal hypothesis $h^: \gX \rightarrow \gY$ that satisfies: align h^ &= \arg\min{h} \sum_{t=1}^{T} \E_{(\vx, y)\sim \gD_t} \left[ 1_{h(\vx)\neq y} \right]. align

Remark. The task of Model Alignment~(MA) is usually formulated in the same problem definition as IT, with an alignment dataset of size $M$ as $\gA = {(\vx_a, \vy_a, \vy_a)}_{a=1}^M$, where $\vy_a$ represents the model's original decision for input $\vx_a$, and $\vy_a$ denotes the aligned decision that adheres to specified ethical guidelines or desired outcomes.

Remark. % It is still an open problem to include the constraint of preventing catastrophic forgetting of the general knowledge for IT, and reducing the Alignment Tax~[lin2024mitigating] in the optimization objective of MA. % A simple extension from the constraint of model refinement in eq:mr, $h^\prime(\vx_0) = h(\vx_0), \forall (\vx_0, \vy_0) \notin \gA$, might be too strong in this case, as we certainly want the preference represented by $\gA$ can generalize to other similar while not the same inputs. %

Remark. In early stages of CL, works mostly focused on the strong memory constraint~[kirkpatrick2017overcoming,li2017learning,aljundi2018memory,lomonaco2020rehearsalfree]; as the research field progresses, more focus was put on relaxing the memory constraint to a small buffer for replay~[rebuffi2017icarl,chaudhry2019tiny,buzzega2020dark,shi2024unified]; some modern CL works completely discard the memory constraint but put focus on the computational budget~[prabhu2023online,verwimp2024continual].

Remark. In TIL, it is common to have a shared input space $\gX = \gX_t, \forall t \in [T]$, but the space of the label distribution $\gY_t$ can be distinct~($\gY_i \cap \gY_j = \emptyset, \forall i\neq j$), partially shared~($\gY_i \cap \gY_j \neq \emptyset, \exists i\neq j$), or shared across different tasks~($\gY = \gY_t, \forall t \in [T]$). In DIL, the tasks are defined in the same format, i.e., same input space $\gX$ and same output space $\gY$. During the inference, no task IDs are provided for the hypothesis, which means the continual learning model needs to capture the pattern between the domain-invariant features and the labels. DIL is commonly perceived as more difficult than TIL. CIL is commonly viewed as the most challenging continual learning scenario, as the model needs to infer the label and the task ID at the same time. Another possible formulation of CIL is to represent it as DIL but the output label spaces are disjoint, $\gY_i \cap \gY_j = \emptyset, \forall i\neq j$.

In this section, we provide a comprehensive review of the datasets available for benchmarking continual LLMs, as illustrated in Table 4. We intentionally exclude datasets used for domain-adaptive pre-training LLMs in vertical domains such as legal, medical, and financial, unless they are specifically designed for continual domain-adaptive pre-training. Furthermore, we omit datasets used in general continual fine-tuning, as they have already been extensively studied in existing works [17, 105].

Datasets for Continual Pre-Training (CPT) and Domain Adaptive Pre-Training (DAP). Current research lacks a widely recognized benchmark for evaluating continual pre-training LLMs under temporal shifts. TimeLMs utilizes a series of Twitter corpora collected until 2022, sequentially pre-training RoBERTa models quarterly [137]. CC-RecentNews, adopted as unlabeled pre-training data for LMs in CKL [96], consists of recent news and serves as a single-stage dataset. Additionally, CKL introduces InvariantLAMA, NewLAMA, and UpdatedLAMA to assess the principles of continual knowledge learning. TWiki, a dataset derived from the articles of Wikipedia between August and December 2021, is curated and cleaned in TemporalWiki [95]. This dataset facilitates the exploration of incremental learning by providing the Diffsets between neighboring snapshots. For works that study the content-level distributional shifts in CPT and DAP, researchers often resort to a similar set of publicly available datasets [134, 169, 270] to construct their own test beds for continual learning algorithms. The ∗ DAPT dataset, developed by [72], comprises four domains: BioMed and Computer Science from S2ORC [134], News from [283], and Reviews from [78]. In ∗ DAPT's original study, each domain undergoes individual domain adaptive pre-training stages to demonstrate the universality of DAP's effectiveness. Subsequent works, such as ELLE [184] and Recyclable Tuning [183], follow suit by employing these domains for multi-stage CPT. DEMix [71] presents another large-scale dataset, featuring eight semantic domains with over 73.8 billion tokens. Alongside the training set, it includes eight additional datasets for validating the generalization ability of LLMs. On a smaller scale, ∗ CPT [104] and ∗ DAS [107] datasets consist of four and eight domains, respectively,

with approximately 3.12 million examples and a size of 4.16GB each. These datasets are constructed similarly to the aforementioned ones.

Datasets for Continual Instruction Tuning. Measuring the effectiveness of CIT is crucial, particularly because traditional evaluation metrics may not be suitable for LLMs: many of them are overly simplistic and fail to comprehensively assess the model's ability to learn continually. New benchmarks and metrics are required to evaluate both the retention of old knowledge and the integration of new instructions. TRACE [241] stands as a continual learning benchmark designed specifically for LLMs, encompassing diverse tasks such as multilingual capabilities, code generation, and mathematical reasoning. CITB [292] represents another benchmark for CIT, incorporating both learning and evaluation protocols. It in addition demonstrates that replay generally yields the best performance across all methods. CoIN [32] extends the benchmark to MLLMs, incorporating a balanced and diverse set of instructions from vision-language datasets.

Datasets for Continual Model Refinement. Most datasets for continual model refinement can be categorized into two types [152]: fact checking and question answering. For fact checking, models are asked to verify the truthfulness of certain claims, typically modeled as a classification task. Key datasets include FEVER [228] (used by [47, 76]) and VitaminC [206] (used by [164]), both sourced from Wikipedia. For question answering, models are tasked with providing specific answers instead of choices. Zero-shot Relation Extraction (zsRE) [117] is the most widely employed dataset for this purpose [45, 74-76, 157, 158], alongside Natural Questions (NQ) [114] and T-rex [59]. [157] adapted zsRE with additional counterfactuals to create the more challenging CounterFact dataset, used by [45, 85, 280]. Beyond these two categories, SCOTUS [28] is also utilized [74] in the assessment of continual model refinement through a document classification task for U.S. Supreme Court cases into 11 topics.

Datasets for Continual Model Alignment. In the domain of reinforcement learning with human feedback (RLHF), several datasets are commonly employed across different studies to evaluate the adaptation and effectiveness of models under varying scenarios and continuous learning conditions. The IMDB [149] and HH-RLHF [11] dataset, as introduced in [286] within their study on continual learning through optimal policy fitting, leverages data gathered from interactive RL scenarios to model human preferences dynamically. Similarly, the Reddit TL;DR dataset [234] used by [286, 287] is focused on text summarization, providing a robust platform for testing the longevity and adaptability of learning algorithms under evolving conditions. Lastly, Common Sense QA [18, 41, 115], Reading Comprehension [56, 190], and Translation [19], which are utilized in [128] are selected to assess the challenges of aligning RL agents with human expectations without incurring significant performance penalties. Each of these datasets is pivotal in advancing the understanding of continual learning and the interplay between human feedback and machine learning adaptation.

Datasets for Continual Multimodal Large Language Models. Following LLaVA [131], many MLLMs adopt the pattern of instruction tuning to enable assessing alignment with human intention and knowledge preservation for reasoning. Thus, traditional tasks like image classification can be transformed to VQA tasks to evaluate the ability of MLLMs, which are otherwise challenging to assess using conventional methods. Several benchmarks have been proposed to evaluate the CL method for MLLMs. MCIT [77] proposes the first continual instruction tuning benchmarks, Benchmark1 and Benchmark2. The difference between benchmark1 and benchmark2 is that benchmark2 includes Multi-task Joint Instruction Tuning, which aims to explore whether multi-task joint instruction tuning improves the model's continual learning ability. [284] proposes EMT, the first classification evaluation framework to investigate catastrophic forgetting in MLLMs. [32] presents a comprehensive benchmark CoIN, spanning 8 task categories and evaluating MLLMs from two perspectives: Instruction Following and General Knowledge, which assess the alignment

with human intention and knowledge preserved for reasoning, respectively. [296] constructs two datasets, UPMCFood101-CMML and MM-IMDb-CMML to benchmark the novel CMML task, which means the data of certain modalities is missing during continual fine-tuning. UPMC-Food101-CMM contains 101 food categories and 61,142 training, 6,846 validation, and 22,716 test image-text pairs. MM-IMDb-CMML is a multi-label classification dataset across 27 distinct movie genres, consisting of 15,552 training, 2,608 validation and 7,799 test image-text pairs.

MethodScenarioScenarioContinual Learning Tech.Continual Learning Tech.Continual Learning Tech.LLM Arch.EvaluationEvaluation
MethodDist. Shift#DomainsRehearsalParam. Reg.Arch. Exp.LLM Arch.Pre-TrainingDownstream
TimeLMs [137]Temporal8RoBERTa
[278]Content159RoBERTa GPT-2
[69]Content1Pythia
[63]Language3GPT
RHO-1 [130]Other1TinyLlama Mistral
[118]Language1P-Freeze ♣Adapter ♣ LoRA ♣Llama2
CKL [96]Temporal1Mix-Review ♣P-Freeze ♣ RecAdam ♣LoRA ♣ K-Adapter ♣T5
LLPT [100]Temporal4ER ♣ Logit-KD ♣ Rep-KD ♣ Contrast-KD ♣oEWC ♣Adapter ♣ Layer Exp. ♣RoBERTa
TemporalWiki [95]Temporal5SEED-KD Mix-Review ♣P-Freeze ♣ RecAdam ♣LoRA ♣ K-Adapter ♣GPT-2
CPT ∗ [104]Content4DER++ ♣ KD ♣CPT ✓ EWC ♣ HAT ♣Adapter ♣ DEMix ♣RoBERTa
ERNIE 2.0 [221]Content4ER ✓♣ERNIE
[5]Temporal7P-Freeze ✓Vocab. Exp. ✓BERT
[44]Content5Vocab. Exp. ✓BERT RoBERTa
DEMix [71]Content8MoE ✓GPT-3
TempoT5 [52]Temporal1Vocab. Exp. ✓ Prompt ✓T5
RecTuning [183]Content4ER ✓ KD ✓Adapter ✓RoBERTa
Lifelong-MoE [35]Content3ER ♣ KD ✓P-Freeze ✓ L2 ♣MoE ✓GLaM
ELLE [184]Content5ER ✓♣ KD ♣P-Freeze ✓Prompt Layer Exp. ✓ Adapter ♣BERT GPT
[94]Content Language2ER ✓GPT-NeoX
CEM [294]Other1ER ✓CuteGPT ChatGLM Qwen-Chat
IR-DRO [36]Other1ER ✓OPT
DomainMethodTrain Proc.LLM Arch.Continual Learning Tech.Continual Learning Tech.Continual Learning Tech.Continual Learning Eval.Continual Learning Eval.
DomainMethodTrain Proc.LLM Arch.RehearsalParam. Reg.Arch. Exp.Backward TransferForward Transfer
MedicalBioMedGPT [146]DAP → MM-SFTLlama2FT
FinancialBBT-Fin [138]DAPT5FT
FinancialCFGPT [121]DAP → SFTInternLMQ-LoRA (SFT)HE 1
ScientificAstroLlama [168]DAPLlaVaPerp.
ScientificOceanGPT [15]DAP → ITVicuna Llama2-chat ChatGLM2LoRA (IT)HE
ScientificK2 [48]DAP → SFTLlamaLoRA (SFT)Perp.ZSLLM
ScientificMarineGPT [301]MM-DAP → MM-ITLlamaHE
CodeCodeGen [172]DAP → DAPCodeGenPerp.ZS
CodeComment-Aug [218]IT → DAPLlama2 Code Llama InternLM2ZS
EventTemporalEcoNet [73] 1DAP → FTBERT RoBERTaFT
CommonSenseCALM [303]DAP → FTT5FT
Multi-DomainBLADE [120]DAP → ITBLOOMZZS
ScientificClimateGPT [229]DAP → IT → RAGLlama2FSRet.
Medical[68]DAP → FTLlama2FSFTFSFT
Financial[268]DAPPythiaLFSLFS
ScientificGeoGalactica [129]DAP → G-SFT → D-SFTGALZSPerp.ZSLLM
CodeStarCoder [226]DAPStarCoderPerp.ZSFSPerp.ZSFS
CodeDeepSeek-Coder [67]DAPDeepSeek-LLMZSFSZS
Multi-DomainDAPT [72]DAP → FTRoBERTaLossLFT
FinancialWeaverBird [271]DAPGLM2LoRAHE
CodeIRCoder [177]DAPStarCoder DeepSeek-Coder Code LlamaLoRAZS
CodeCode Llama [199]DAP → LC-FT → IT DAP → DAP → LC-FTLlama2ReplayPerp.ZS
LegalSaulLM [42]DAP → U-ITMistralReplayPerp.ZS
MedicalPMC-Llama [258]DAP → ITLlamaReplayZSFT
ScientificLlema [9]DAPCode LlamaReplayPerp.FS
Multi-DomainDAS [107][DAP] 𝑛RoBERTaDER++ ♣EWC HAT ♣ Soft-MaskingAdapter ♣ DEMix ♣FT
MedicalHippocrates [2]DAP → IT → MALlama2 MistralLoRAZSFS
LanguageSailor [55]DAPQwen1.5ReplayZS
Code & Math MedicalLlama Pro [257] AF Adapter [272]DAP → U-SFT DAP → FTLlama2 RoBERTa✗ ✗LoRA ♣ZSFS Acc.Perp.ZSFS LFT
Layer Exp. LoRA ♣
Medical[197]DAP → FTBERT RoBERTa DistilBERTReplay ♣ GEM ♣L2 Reg. ♣ EWC ♣LFTLFT
MedicalHuatuoGPT-II [33]DAP + U-SFTBaichuan2ReplayZSZSHE
FinancialXuanYuan 2.0 [289]DAP + SFTBLOOMReplayHEHE
ScientificPLlama [274]DAP → ITGALReplayLLZS
E-CommerceEcomGPT-CT [148]DAP → SFTBLOOMReplayZSFSZSFS
LegalLayer Llama [91]DAP → G-IT → D-ITLlamaReplayZSZS
Multi-DomainAdaptLLM [39]DAPLlamaReplayZSZSFT
LanguageSwallow [60]DAPLlama2ReplayFSFS
Financial[222]DAPLlama2ReplayLossZSLossZSFSRAG
MedicalMe-Llama [265]DAP → ITLlama2ReplayZSFSZSFSFT
LanguageAurora-MDAP →StarCoderZSZSFSHE
[167]ITReplay
Continual LearningContinual LearningContinual LearningContinual LearningContinual Learning Eval.Continual Learning Eval.Continual Learning Eval.
CFT TypeMethodX-ILLLM Arch.RehearsalParam. Reg.Arch. Exp.OthersAvg. Acc.Bwd. Trans.Fwd. Trans.
GeneralCTR [106]DILCILBERTAdapter
General[223]TILBERTS-Replay
GeneralCIRCLE [281]DILT5ReplayEWCPrompt
GeneralConPET [217]DILLlamaReplayLoRA
General[10]DILCILBERTG-Prompt
General[144]TILDistilBERT ALBERTRoBERTaERDERLwF
GeneralSEQ ∗ [299]TILCILPythiaBERTGPT2P-FreezeTricks for Classifiers
GeneralLFPT5 [182]DILT5P-Replay
General[254]DILRoBERTaGPT2ReplayEWCSIRWalk
GeneralLR ADJUST [255]DILXLM-RLR Scheduling
CITC3 [37]TILT5KDPrompt Tuning
CT0 [208]TILT0S-Replay
RCL [241]TILLLaMA VicunaBaichuanReplay
DynaInst [165]TILBARTReplay
CITB [292]TILT5ReplayAGEML2EWCAdapterCL
SSR [89]TILLLaMAAlpacaRandSelKMeansSel
KPIG [80]DILTILLLaMABaichuanDynaInstPCLLDCLL2 EWCDARE LM-CocktailKPIG
ConTinTin [279]TILBARTReplayInstructionSpeak
O-LoRA [240]TILLLaMAAlpacaO-LoRA
SAPT [297]TILT5LLaMASAPT
InsCL [243]TILLLaMAReplayInsCL
CMR [125]DILBARTERMIRMLRL2EWC
GRACE [74]DILT5BERTGPT2Adapter
WilKE [85]DILGPT2GPT-JAdaptor
CMRLarimar [45]DILBERTGPT-JKanerva Memory
MELO [280]DILBERTGPT2T5LoRA
CME [123]DILBERTReplayInner-Prod. Reg.
WISE [238]DILGPT-JLlama2MistralSide Memory
COPF [286]TILDILLlamaReplayFunction Reg.Prompt
CMAAMA [128] [287]DILOpenLLaMAMistralReplayL1L2LoRAAdaptive Model Avg.♣ ✓
CPPOTILGPT2WeightingPrompt
EProj [77]TILInstructBLIPTSIRProjector Exp.
Fwd-PromptTILInstructBLIPBLIP2Projector Exp.
CMLLMs[298]TILLLaVAMoELoRA
CMLLMsCoIN [32]TILInstructBLIP
CMLLMsModel Tailor [304]LLaVAModel Tailor ✗✗ ✗✓ ✓
Commentcodecodecodecodecodecodecodecodecodecodecodecodecodecodecodesee sourcescode code158] -code codecodecode
Applications[137][96][95][72] [183] [184][104][71][107][243, 292][292][32][241][161][286][286][286, 287][128][47, 76][164][45, 74-76, 157, [53, 119][74][45, 85, 157, 280][74]
SourcesTweetsWebWikipediaBioMed [134], CS [134], News [283], Reviews [78]Yelp [270], S2ORC [134], AG-News [290] 1B [31], CS [134], Legal [27], Med [134]WebText [64], RealNews [283], Reddit [13], Reviews [169]Yelp [270], Reviews [169], Papers [134], PubMedGitHubSuperNI [244]RefCOCO [103],RefCOCO+ [151],RefCOCOg [151] ImageNet [49], VQAv2 [65], ScienceQA [140] TextVQA [214], GQA [93], VizWiz [70], OCR-VQA [160]ScienceQA [140], FOMC [209], MeetingBank [88] C-STANCE [293], 20Minuten [108], CodeXGLUE [141], NumGLUE[162]CosmosQA [90], DROP [56], Essential-Terms [110] MCTACO [302], MultiRC [109], QASC [111] Quoref[46] , ROPES [127] , Winogrande [202]IMDBHuman FeedbackRedditARC Easy and Challenge [41], Race [115], PIQA [18] SQuAD [190], DROP [56] WMT2014 French to English [19]WikipediaWikipedia Wikireading [83]Dbpedia abstracts [23] Google queries, WikipediazsRE [117]Supreme Court Database
Scale#Examples: 123.86M#Tokens: ∼ 168M#Tokens: 4.7BSize: 160GB#Examples: 3.12M#Tokens: 73.8BSize: 4.16GB#Tasks: 1616 #Examples: ∼ 5M#Tasks: 38#Examples: ∼ 1.14M#Examples: 56,000#Examples: 193kSize: 217.35 MBSize: 28.1 MBSize: 19.6 GB#Examples: ∼ 41.16M#Examples: 420k#Examples: 450k #Examples: 120M#Examples: 11M#Examples: 320k#Examples: 22k#Examples: 9.2k
#Stages81544861619886112611 11111
DomainSocial MediaNewsGeneral KnowledgeMulti-DomainMulti-DomainMulti-DomainMulti-DomainMutli-DomainMutli-DomainMulti-DomainMutli-DomainMutli-DomainSocial MediaGeneral KnowledgeSocial MediaMulti-DomainGeneral KnowledgeGeneral Knowledge General KnowledgeGeneral KnowledgeGeneral KnowledgeGeneral KnowledgeLaw
ShiftTemporalTemporalTemporalContentContentContentContentContentContentContentContentContentContentContentContentContentContentContent ContentContent ContentContentTemporal
TypeCPTCPTCPTCPT DAPCPTCPTCPT DAPCITCITCITCIT[161] CITCMACMACMACMACMRCMR CMRCMRCMRCMRCMR
Name∗ TimeLMs [137]CC-RecentNews [96]TWiki [95]∗ DAPT [72]∗ CPT [104]∗ DEMix [71]∗ DAS [107]SuperNI [244]CITB [292]CoIN [32]TRACE [241]NATURAL-INSTRUCTIONIMDB [149]HH-RLHF [11]Reddit TL;DR [234]Common Sense QA [128] Reading Comprehension [128] Translation [128]FEVER [228]VitaminC [206] zsRE [117]T-rex [59]NQ [114]CounterFact [157]SCOTUS [28]
MethodScenarioScenarioContinual Learning Tech.Continual Learning Tech.Continual Learning Tech.LLM Arch.EvaluationEvaluation
MethodDist. Shift#DomainsRehearsalParam. Reg.Arch. Exp.LLM Arch.Pre-TrainingDownstream
TimeLMs [137]Temporal8RoBERTa
[278]Content159RoBERTa GPT-2
[69]Content1Pythia
[63]Language3GPT
RHO-1 [130]Other1TinyLlama Mistral
[118]Language1P-Freeze ♣Adapter ♣ LoRA ♣Llama2
CKL [96]Temporal1Mix-Review ♣P-Freeze ♣ RecAdam ♣LoRA ♣ K-Adapter ♣T5
LLPT [100]Temporal4ER ♣ Logit-KD ♣ Rep-KD ♣ Contrast-KD ♣oEWC ♣Adapter ♣ Layer Exp. ♣RoBERTa
TemporalWiki [95]Temporal5SEED-KD Mix-Review ♣P-Freeze ♣ RecAdam ♣LoRA ♣ K-Adapter ♣GPT-2
CPT ∗ [104]Content4DER++ ♣ KD ♣CPT ✓ EWC ♣ HAT ♣Adapter ♣ DEMix ♣RoBERTa
ERNIE 2.0 [221]Content4ER ✓♣ERNIE
[5]Temporal7P-Freeze ✓Vocab. Exp. ✓BERT
[44]Content5Vocab. Exp. ✓BERT RoBERTa
DEMix [71]Content8MoE ✓GPT-3
TempoT5 [52]Temporal1Vocab. Exp. ✓ Prompt ✓T5
RecTuning [183]Content4ER ✓ KD ✓Adapter ✓RoBERTa
Lifelong-MoE [35]Content3ER ♣ KD ✓P-Freeze ✓ L2 ♣MoE ✓GLaM
ELLE [184]Content5ER ✓♣ KD ♣P-Freeze ✓Prompt Layer Exp. ✓ Adapter ♣BERT GPT
[94]Content Language2ER ✓GPT-NeoX
CEM [294]Other1ER ✓CuteGPT ChatGLM Qwen-Chat
IR-DRO [36]Other1ER ✓OPT
DomainMethodTrain Proc.LLM Arch.Continual Learning Tech.Continual Learning Tech.Continual Learning Tech.Continual Learning Eval.Continual Learning Eval.
DomainMethodTrain Proc.LLM Arch.RehearsalParam. Reg.Arch. Exp.Backward TransferForward Transfer
MedicalBioMedGPT [146]DAP → MM-SFTLlama2FT
FinancialBBT-Fin [138]DAPT5FT
FinancialCFGPT [121]DAP → SFTInternLMQ-LoRA (SFT)HE 1
ScientificAstroLlama [168]DAPLlaVaPerp.
ScientificOceanGPT [15]DAP → ITVicuna Llama2-chat ChatGLM2LoRA (IT)HE
ScientificK2 [48]DAP → SFTLlamaLoRA (SFT)Perp.ZSLLM
ScientificMarineGPT [301]MM-DAP → MM-ITLlamaHE
CodeCodeGen [172]DAP → DAPCodeGenPerp.ZS
CodeComment-Aug [218]IT → DAPLlama2 Code Llama InternLM2ZS
EventTemporalEcoNet [73] 1DAP → FTBERT RoBERTaFT
CommonSenseCALM [303]DAP → FTT5FT
Multi-DomainBLADE [120]DAP → ITBLOOMZZS
ScientificClimateGPT [229]DAP → IT → RAGLlama2FSRet.
Medical[68]DAP → FTLlama2FSFTFSFT
Financial[268]DAPPythiaLFSLFS
ScientificGeoGalactica [129]DAP → G-SFT → D-SFTGALZSPerp.ZSLLM
CodeStarCoder [226]DAPStarCoderPerp.ZSFSPerp.ZSFS
CodeDeepSeek-Coder [67]DAPDeepSeek-LLMZSFSZS
Multi-DomainDAPT [72]DAP → FTRoBERTaLossLFT
FinancialWeaverBird [271]DAPGLM2LoRAHE
CodeIRCoder [177]DAPStarCoder DeepSeek-Coder Code LlamaLoRAZS
CodeCode Llama [199]DAP → LC-FT → IT DAP → DAP → LC-FTLlama2ReplayPerp.ZS
LegalSaulLM [42]DAP → U-ITMistralReplayPerp.ZS
MedicalPMC-Llama [258]DAP → ITLlamaReplayZSFT
ScientificLlema [9]DAPCode LlamaReplayPerp.FS
Multi-DomainDAS [107][DAP] 𝑛RoBERTaDER++ ♣EWC HAT ♣ Soft-MaskingAdapter ♣ DEMix ♣FT
MedicalHippocrates [2]DAP → IT → MALlama2 MistralLoRAZSFS
LanguageSailor [55]DAPQwen1.5ReplayZS
Code & Math MedicalLlama Pro [257] AF Adapter [272]DAP → U-SFT DAP → FTLlama2 RoBERTa✗ ✗LoRA ♣ZSFS Acc.Perp.ZSFS LFT
Layer Exp. LoRA ♣
Medical[197]DAP → FTBERT RoBERTa DistilBERTReplay ♣ GEM ♣L2 Reg. ♣ EWC ♣LFTLFT
MedicalHuatuoGPT-II [33]DAP + U-SFTBaichuan2ReplayZSZSHE
FinancialXuanYuan 2.0 [289]DAP + SFTBLOOMReplayHEHE
ScientificPLlama [274]DAP → ITGALReplayLLZS
E-CommerceEcomGPT-CT [148]DAP → SFTBLOOMReplayZSFSZSFS
LegalLayer Llama [91]DAP → G-IT → D-ITLlamaReplayZSZS
Multi-DomainAdaptLLM [39]DAPLlamaReplayZSZSFT
LanguageSwallow [60]DAPLlama2ReplayFSFS
Financial[222]DAPLlama2ReplayLossZSLossZSFSRAG
MedicalMe-Llama [265]DAP → ITLlama2ReplayZSFSZSFSFT
LanguageAurora-MDAP →StarCoderZSZSFSHE
[167]ITReplay
Continual LearningContinual LearningContinual LearningContinual LearningContinual Learning Eval.Continual Learning Eval.Continual Learning Eval.
CFT TypeMethodX-ILLLM Arch.RehearsalParam. Reg.Arch. Exp.OthersAvg. Acc.Bwd. Trans.Fwd. Trans.
GeneralCTR [106]DILCILBERTAdapter
General[223]TILBERTS-Replay
GeneralCIRCLE [281]DILT5ReplayEWCPrompt
GeneralConPET [217]DILLlamaReplayLoRA
General[10]DILCILBERTG-Prompt
General[144]TILDistilBERT ALBERTRoBERTaERDERLwF
GeneralSEQ ∗ [299]TILCILPythiaBERTGPT2P-FreezeTricks for Classifiers
GeneralLFPT5 [182]DILT5P-Replay
General[254]DILRoBERTaGPT2ReplayEWCSIRWalk
GeneralLR ADJUST [255]DILXLM-RLR Scheduling
CITC3 [37]TILT5KDPrompt Tuning
CT0 [208]TILT0S-Replay
RCL [241]TILLLaMA VicunaBaichuanReplay
DynaInst [165]TILBARTReplay
CITB [292]TILT5ReplayAGEML2EWCAdapterCL
SSR [89]TILLLaMAAlpacaRandSelKMeansSel
KPIG [80]DILTILLLaMABaichuanDynaInstPCLLDCLL2 EWCDARE LM-CocktailKPIG
ConTinTin [279]TILBARTReplayInstructionSpeak
O-LoRA [240]TILLLaMAAlpacaO-LoRA
SAPT [297]TILT5LLaMASAPT
InsCL [243]TILLLaMAReplayInsCL
CMR [125]DILBARTERMIRMLRL2EWC
GRACE [74]DILT5BERTGPT2Adapter
WilKE [85]DILGPT2GPT-JAdaptor
CMRLarimar [45]DILBERTGPT-JKanerva Memory
MELO [280]DILBERTGPT2T5LoRA
CME [123]DILBERTReplayInner-Prod. Reg.
WISE [238]DILGPT-JLlama2MistralSide Memory
COPF [286]TILDILLlamaReplayFunction Reg.Prompt
CMAAMA [128] [287]DILOpenLLaMAMistralReplayL1L2LoRAAdaptive Model Avg.♣ ✓
CPPOTILGPT2WeightingPrompt
EProj [77]TILInstructBLIPTSIRProjector Exp.
Fwd-PromptTILInstructBLIPBLIP2Projector Exp.
CMLLMs[298]TILLLaVAMoELoRA
CMLLMsCoIN [32]TILInstructBLIP
CMLLMsModel Tailor [304]LLaVAModel Tailor ✗✗ ✗✓ ✓
Commentcodecodecodecodecodecodecodecodecodecodecodecodecodecodecodesee sourcescode code158] -code codecodecode
Applications[137][96][95][72] [183] [184][104][71][107][243, 292][292][32][241][161][286][286][286, 287][128][47, 76][164][45, 74-76, 157, [53, 119][74][45, 85, 157, 280][74]
SourcesTweetsWebWikipediaBioMed [134], CS [134], News [283], Reviews [78]Yelp [270], S2ORC [134], AG-News [290] 1B [31], CS [134], Legal [27], Med [134]WebText [64], RealNews [283], Reddit [13], Reviews [169]Yelp [270], Reviews [169], Papers [134], PubMedGitHubSuperNI [244]RefCOCO [103],RefCOCO+ [151],RefCOCOg [151] ImageNet [49], VQAv2 [65], ScienceQA [140] TextVQA [214], GQA [93], VizWiz [70], OCR-VQA [160]ScienceQA [140], FOMC [209], MeetingBank [88] C-STANCE [293], 20Minuten [108], CodeXGLUE [141], NumGLUE[162]CosmosQA [90], DROP [56], Essential-Terms [110] MCTACO [302], MultiRC [109], QASC [111] Quoref[46] , ROPES [127] , Winogrande [202]IMDBHuman FeedbackRedditARC Easy and Challenge [41], Race [115], PIQA [18] SQuAD [190], DROP [56] WMT2014 French to English [19]WikipediaWikipedia Wikireading [83]Dbpedia abstracts [23] Google queries, WikipediazsRE [117]Supreme Court Database
Scale#Examples: 123.86M#Tokens: ∼ 168M#Tokens: 4.7BSize: 160GB#Examples: 3.12M#Tokens: 73.8BSize: 4.16GB#Tasks: 1616 #Examples: ∼ 5M#Tasks: 38#Examples: ∼ 1.14M#Examples: 56,000#Examples: 193kSize: 217.35 MBSize: 28.1 MBSize: 19.6 GB#Examples: ∼ 41.16M#Examples: 420k#Examples: 450k #Examples: 120M#Examples: 11M#Examples: 320k#Examples: 22k#Examples: 9.2k
#Stages81544861619886112611 11111
DomainSocial MediaNewsGeneral KnowledgeMulti-DomainMulti-DomainMulti-DomainMulti-DomainMutli-DomainMutli-DomainMulti-DomainMutli-DomainMutli-DomainSocial MediaGeneral KnowledgeSocial MediaMulti-DomainGeneral KnowledgeGeneral Knowledge General KnowledgeGeneral KnowledgeGeneral KnowledgeGeneral KnowledgeLaw
ShiftTemporalTemporalTemporalContentContentContentContentContentContentContentContentContentContentContentContentContentContentContent ContentContent ContentContentTemporal
TypeCPTCPTCPTCPT DAPCPTCPTCPT DAPCITCITCITCIT[161] CITCMACMACMACMACMRCMR CMRCMRCMRCMRCMR
Name∗ TimeLMs [137]CC-RecentNews [96]TWiki [95]∗ DAPT [72]∗ CPT [104]∗ DEMix [71]∗ DAS [107]SuperNI [244]CITB [292]CoIN [32]TRACE [241]NATURAL-INSTRUCTIONIMDB [149]HH-RLHF [11]Reddit TL;DR [234]Common Sense QA [128] Reading Comprehension [128] Translation [128]FEVER [228]VitaminC [206] zsRE [117]T-rex [59]NQ [114]CounterFact [157]SCOTUS [28]

Table: S4.T1: Summary of the existing studies on Horizontal Continual Pre-training of LLMs, where the papers are organized based on their type, where: (i) no continual learning techniques are studied, (ii) continual learning techniques are studied as solely baselines, and (iii) new approaches are proposed, containing some of the continual learning techniques. In the table, Dist. Shift denotes what type(s) of distributional shifts this particular study considers and is dedicated to solve. In the section of Continual Learning Tech., we mainly categorize three types of continual learning techniques that are studied in the paper: rehearsal (Rehearsal), parameter regularization (Param. Reg.), and architecture expansion (Arch. Exp.). We use “✓”, “✗”, and “♣” to denote “deployed in the proposed method”, “not studied in the paper”, and “studied as a baseline method”, respectively; and use “✓∗” to represent the vocabulary expansion and replacement. It is noteworthy that we do not include naive sequential fine-tuning in this table, as it is universally studied as the important baseline method in all of the papers in the table. The papers with only “♣” [127, 119, 120] means that merely existing CL techniques are studied in them, and the papers with only “✗” [89, 82] means that no CL techniques but special aspects of fine-tuning are studied, e.g., model (re)warming via learning rate scheduling [89].

MethodScenarioContinual Learning Tech.LLM Arch.Evaluation
Dist. Shift#DomainsRehearsalParam. Reg.Arch. Exp.Pre-TrainingDownstream
TimeLMs [175]Temporal8RoBERTa
[89]Content1Pythia
[82]Language3GPT
[149]Language1P-Freeze♣♣{}^{\text{\char 168}}Adapter♣♣{}^{\text{\char 168}} LoRA♣♣{}^{\text{\char 168}}Llama2
CKL [120]Temporal1Mix-Review♣♣{}^{\text{\char 168}}P-Freeze♣♣{}^{\text{\char 168}} RecAdam♣♣{}^{\text{\char 168}}LoRA♣♣{}^{\text{\char 168}} K-Adapter♣♣{}^{\text{\char 168}}T5
LLPT [127]Temporal Content4 8ER♣♣{}^{\text{\char 168}} Logit-KD♣♣{}^{\text{\char 168}} Rep-KD♣♣{}^{\text{\char 168}} Contrast-KD♣♣{}^{\text{\char 168}} SEED-KD♣♣{}^{\text{\char 168}}oEWC♣♣{}^{\text{\char 168}}Adapter♣♣{}^{\text{\char 168}} Layer Exp.♣♣{}^{\text{\char 168}}RoBERTa
TemporalWiki [119]Temporal5Mix-Review♣♣{}^{\text{\char 168}}P-Freeze♣♣{}^{\text{\char 168}} RecAdam♣♣{}^{\text{\char 168}}LoRA♣♣{}^{\text{\char 168}} K-Adapter♣♣{}^{\text{\char 168}}GPT-2
CPT [131]Content4DER++♣♣{}^{\text{\char 168}} KD♣♣{}^{\text{\char 168}}CPT✓✓{}^{\text{\char 51}} EWC♣♣{}^{\text{\char 168}} HAT♣♣{}^{\text{\char 168}}Adapter♣♣{}^{\text{\char 168}} DEMix♣♣{}^{\text{\char 168}}RoBERTa
ERNIE 2.0 [273]Content4ER✓♣✓♣{}^{\text{\char 51\char 168}}ERNIE
[6]Temporal7P-Freeze✓✓{}^{\text{\char 51}}Vocab. Exp.✓✓{}^{\text{\char 51}}BERT
[55]Content5Vocab. Exp.✓✓{}^{\text{\char 51}}BERT RoBERTa
DEMix [91]Content8MoE✓✓{}^{\text{\char 51}}GPT-3
TempoT5 [68]Temporal1Vocab. Exp.✓✓{}^{\text{\char 51}} Prompt✓✓{}^{\text{\char 51}}T5
RecTuning [231]Content4ER✓✓{}^{\text{\char 51}} KD✓✓{}^{\text{\char 51}}Adapter✓✓{}^{\text{\char 51}}RoBERTa
Lifelong-MoE [46]Content3ER♣♣{}^{\text{\char 168}} KD✓✓{}^{\text{\char 51}}P-Freeze✓✓{}^{\text{\char 51}} L2♣♣{}^{\text{\char 168}}MoE✓✓{}^{\text{\char 51}}GLaM
ELLE [232]Content5ER✓♣✓♣{}^{\text{\char 51\char 168}} KD♣♣{}^{\text{\char 168}}P-Freeze✓✓{}^{\text{\char 51}}Prompt✓✓{}^{\text{\char 51}} Layer Exp.✓✓{}^{\text{\char 51}} Adapter♣♣{}^{\text{\char 168}}BERT GPT

Table: S4.T2: Summary of the existing studies that leverage Domain-Adaptive Pre-Training of LLMs, where the papers are organized in four main categories based on whether they (i) adopt the continual learning techniques and (ii) perform the evaluation for backward transfer (forgetting). In the column of Train Proc. (Training Process), we omit the phase of general Pre-Training. DAP represents Domain-Adaptive Pre-Training; SFT represents Supervised Fine-Tuning; IT represents Instruction Tuning. The prefix G- and D- represent General and Domain-Specific training process [166, 115], and the prefix U- represents them unified [310, 42]. The prefix MM- and LC- represents Multi-Modal and Long-Context training phases [185, 367, 245]. In the column of Continual Learning Eval., we consider two criteria: (i) Backward Transfer, i.e., performance degradation on the previous tasks, which is also known as catastrophic forgetting, (ii) Forward Transfer, i.e., the performance gained by DAP while transferring the LLMs to the downstream tasks. We use L and Perp. to denote Loss and Perplexity, FT to denote Fine-Tuning, ZS and FS to denote Zero-Shot and Few-Shot Accuracy, HE and LLM to denote the Human Evaluation and LLM Evaluation for generative tasks. Among 33 papers presented in this table that adopt DAP during the development, nearly 65% (22/33) of them explicitly study the influence of DAP from a continual learning perspective: they either evaluate the degree of forgetting, or adopt the continual learning techniques to prevent forgetting of the general knowledge. However, there is a significant lack of diversity of the continual learning techniques adopted in these works (only Replay and LoRA), which advocates the further study on the efficacy of vertical continual learning in the realm of LLMs.

DomainMethodTrain Proc.LLM Arch.Continual Learning Tech.Continual Learning Eval.
RehearsalParam. Reg.Arch. Exp.Backward TransferForward Transfer
MedicalBioMedGPT [185]DAP →→\rightarrow MM-SFTLlama2FT
FinancialBBT-Fin [177]DAPT5FT
FinancialCFGPT [152]DAP →→\rightarrow SFTInternLMQ-LoRA(SFT)(SFT){}_{\text{(SFT)}}HE1
ScientificAstroLlama [209]DAPLlaVaPerp.
ScientificOceanGPT [20]DAP →→\rightarrow ITVicuna Llama2-chat ChatGLM2LoRA(IT)(IT){}_{\text{(IT)}}HE
ScientificK2 [63]DAP →→\rightarrow SFTLlamaLoRA(SFT)(SFT){}_{\text{(SFT)}}Perp. | ZS | LLM
ScientificMarineGPT [367]MM-DAP →→\rightarrow MM-ITLlamaHE
CodeCodeGen [215]DAP →→\rightarrow DAPCodeGenPerp. | ZS
CodeComment-Aug [269]IT →→\rightarrow DAPLlama2 Code Llama InternLM2ZS
EventTemporalEcoNet [93]1DAP →→\rightarrow FTBERT RoBERTaFT
CommonSenseCALM [369]DAP →→\rightarrow FTT5FT
Medical[88]DAP →→\rightarrow FTLlama2FS | FTFS | FT
Financial[323]DAPPythiaL | FSL | FS
ScientificGeoGalactica [166]DAP →→\rightarrow G-SFT →→\rightarrow D-SFTGALZSPerp. | ZS | LLM
CodeStarCoder [158]DAPStarCoderPerp. | ZS | FSPerp. | ZS | FS
CodeDeepSeek-Coder [87]DAPDS-LLMZS | FSZS
Multi-DomainDAPT [92]DAP →→\rightarrow FTRoBERTaLossL | FT
FinancialWeaverBird [326]DAPGLM2LoRAHE
CodeIRCoder [221]DAPStarCoder DS-Coder Code LlamaLoRAZS
CodeCode Llama [245]DAP →→\rightarrow LC-FT →→\rightarrow IT DAP →→\rightarrow DAP →→\rightarrow LC-FTLlama2ReplayPerp. | ZS
LegalSaulLM [52]DAP →→\rightarrow U-ITMistralReplayPerp. | ZS
MedicalPMC-Llama [311]DAP →→\rightarrow ITLlamaReplayZS | FT
ScientificLlema [10]DAPCode LlamaReplayPerp. | FS
Multi-DomainDAS [134][DAP]nRoBERTaDER++♣♣{}^{\text{\char 168}}EWC♣♣{}^{\text{\char 168}} HAT♣♣{}^{\text{\char 168}} Soft-MaskingAdapter♣♣{}^{\text{\char 168}} DEMix♣♣{}^{\text{\char 168}}FT
Code & MathLlama Pro [310]DAP →→\rightarrow U-SFTLlama2Block Exp. LoRA♣♣{}^{\text{\char 168}}ZS | FSPerp. | ZS | FS
MedicalAF Adapter [327]DAP →→\rightarrow FTRoBERTaLayer Exp. LoRA♣♣{}^{\text{\char 168}}Acc.L | FT
Medical[243]DAP →→\rightarrow FTBERT RoBERTa DistilBERTReplay♣♣{}^{\text{\char 168}} GEM♣♣{}^{\text{\char 168}}L2 Reg.♣♣{}^{\text{\char 168}} EWC♣♣{}^{\text{\char 168}}L | FTL | FT
MedicalHuatuoGPT-II [42]DAP + U-SFTBaichuan2ReplayZSZS | HE
FinancialXuanYuan 2.0 [355]DAP + SFTBLOOMReplayHEHE
ScientificPLlama [331]DAP →→\rightarrow ITGALReplayLL | ZS
E-CommerceEcomGPT-CT [187]DAP →→\rightarrow SFTBLOOMReplayZS | FSZS | FS
LegalLayer Llama [115]DAP →→\rightarrow G-IT →→\rightarrow D-ITLlamaReplayZSZS
Multi-DomainAdaptLLM [49]DAPLlamaReplayZSZS | FT

Table: S4.T3: Summary of the existing studies on Continual Fine-Tuning LLMs, where the papers are organized in five main categories based on what downstream tasks they are designed to tackle, including (i) General Continual Fine-Tuning (CFT); (ii) Continual Instruction Tuning (CIT); (iii) Continual Model Refinement (CMR); (iv) Continual Model Alignment (CMA); (v) Continual Multimodal LLMs (CMLLMs), which is shown in the column of CFT Type. The column of X-IL shows what continual learning paradigm the study includes [282], where TIL represents task-incremental learning, meaning task ID/information is provided during inference; DIL represents domain-incremental learning, meaning the tasks are defined in the same format, and no task ID/information is available during inference; CIL represents class-incremental learning, meaning the task ID needs to be further inferred when testing. Among 34 papers shown in the table, 100% (34/34) of them explicitly deploy the continual techniques to address the challenge of CFT. Furthermore, 30% (10/34) of them develop their own new techniques that cannot be easily categorized into the three mainstream sets of continual learning algorithms.

CFT TypeMethodX-ILLLM Arch.Continual Learning Tech.Continual Learning Eval.
RehearsalParam. Reg.Arch. Exp.OthersAvg. Acc.Bwd. Trans.Fwd. Trans.
General Continual Fine-TuningCTR [133]DIL CILBERTAdapter
[274]TILBERTS-Replay
CIRCLE [342]DILT5ReplayEWCPrompt
ConPET [268]DILLlamaReplayLoRA
[12]DIL CILBERTG-Prompt
[183]TILDistilBERT ALBERT RoBERTaER DER LwF
SEQ∗ [365]TIL CILBERT Pythia GPT2P-FreezeTricks for Classfiers
LFPT5 [230]DILT5P-Replay
[306]DILRoBERTa GPT2ReplayEWC SI RWalk
LR ADJUST [307]DILXLM-RLR Scheduling
C3 [47]TILT5KDPrompt Tuning
CT0 [255]TILT0S-Replay
RCL [292]TILLLaMA Vicuna BaichuanReplay
DynaInst [205]TILBARTReplay
CITB [358]TILT5Replay AGEML2 EWCAdapterCL
SSR [113]TILLLaMA AlpacaRandSel KMeansSel
KPIG [103]DIL TILLLaMA BaichuanDynaInst PCLL DCLL2 EWCDARE LM-CocktailKPIG
ConTinTin [337]TILBARTReplayInstructionSpeak
O-LoRA [291]TILLLaMA AlpacaO-LoRA
Continual Instruction TuningSAPT [363]TILT5 LLaMASAPT
InsCL [294]TILLLaMAReplayInsCL
CMR [162]DILBARTER MIR MLRL2 EWC
GRACE [96]DILT5 BERT GPT2Adapter
WilKE [108]DILGPT2 GPT-JAdaptor
Larimar [58]DILBERT GPT-JKanerva Memory
MELO [339]DILBERT GPT2 T5LoRA
Continual Model RefinementCME [156]DILBERTReplayInner-Prod. Reg.
COPF [351]TIL DILLLaMAReplayFunction Reg.Prompt
AMA [165]DILOpenLLaMA MistralReplayL1/L2LoRAAdaptive Model Avg.
Continual Model AlignmentCPPO [352]TILGPT2Weighting StrategyPrompt
EProj [100]TILInstructBLIPTSIRProjector Exp.
Fwd-Prompt [364]TILInstructBLIP BLIP2Projector Exp.
CoIN [41]TILLLaVAMoE LoRA
Model Tailor [371]TILInstructBLIP LLaVA-1.5Model Tailor
Continually Fine-tuning Multimodal LLMsRebQ [362]TILViLTPrompt Tuning

$$ \displaystyle\mathcal{L}_{{\rm LM}}({\bm{x}}) $$

$$ \displaystyle h^{\prime}({\bm{x}}{0})=\begin{cases}\widehat{{\bm{y}}}{0}&\text{if }({\bm{x}}{0},\widehat{{\bm{y}}}{0})\in{\mathcal{E}},\ h({\bm{x}}_{0})&o.w.\end{cases} $$

$$ \displaystyle{\mathcal{L}}(h) $$

$$ \displaystyle=\sum_{j=1}^{n}p(x_{t}|{\bm{x}}{<t},D{t}=j)\cdot\left[\tfrac{p({\bm{x}}{<t}|D{t}=j)\cdot p(D_{t}=j)}{\sum_{j^{\prime}=1}^{n}p({\bm{x}}{<t}|D{t}=j^{\prime})\cdot p(D_{t}=j^{\prime})}\right], $$

Definition. Definition 2.1 (Instruction Tuning, IT). Let h​(𝐱)ℎ𝐱h({\bm{x}}) be a language model that takes as input data 𝐱𝐱{\bm{x}}, typically consisting of natural language instructions or queries. Instruction Tuning (IT) is a specialized training approach designed to enhance the model’s ability to accurately and effectively respond to specific instructions. The objective of IT is to refine hℎh by adjusting its parameters using a designated set of training examples ℐ={(𝐱i,𝐲^i)}i=1Nℐsuperscriptsubscriptsubscript𝐱𝑖subscript^𝐲𝑖𝑖1𝑁{\mathcal{I}}={({\bm{x}}{i},\widehat{{\bm{y}}}{i})}{i=1}^{N}, where 𝐲^isubscript^𝐲𝑖\widehat{{\bm{y}}}{i} represents the desired output for 𝐱𝐱{\bm{x}}. This set is curated to target specific tasks or functionalities that require improved performance. Formally, the updated model h′superscriptℎ′h^{\prime} is defined as follows: h′​(𝒙0)superscriptℎ′subscript𝒙0\displaystyle h^{\prime}({\bm{x}}{0}) =𝒚^0,∀(𝒙0,𝒚^0)∈ℐ.formulae-sequenceabsentsubscript^𝒚0for-allsubscript𝒙0subscript^𝒚0ℐ\displaystyle=\widehat{{\bm{y}}}{0},\quad\forall({\bm{x}}{0},\widehat{{\bm{y}}}{0})\in{\mathcal{I}}. (3)

Definition. Definition 2.2 (Model Refinement, MR). Suppose we have a model h​(𝐱)ℎ𝐱h({\bm{x}}) taking data 𝐱𝐱{\bm{x}} (e.g., natural language queries) as inputs. Consider a size-N𝑁N editing set ℰ={(𝐱e,𝐲e,𝐲^e)}e=1Nℰsuperscriptsubscriptsubscript𝐱𝑒subscript𝐲𝑒subscript^𝐲𝑒𝑒1𝑁{\mathcal{E}}={({\bm{x}}{e},{\bm{y}}{e},\widehat{{\bm{y}}}{e})}{e=1}^{N}, where 𝐲^esubscript^𝐲𝑒\widehat{{\bm{y}}}{e} denotes the true label of 𝐱esubscript𝐱𝑒{\bm{x}}{e}, but the model incorrectly outputs 𝐲esubscript𝐲𝑒{\bm{y}}{e} for 𝐱esubscript𝐱𝑒{\bm{x}}{e}. Model Refinement (MR) aims to efficiently update the model from hℎh to h′superscriptℎ′h^{\prime} such that it correctly predicts the editing set ℰℰ{\mathcal{E}}, while preserving the original outputs outside ℰℰ{\mathcal{E}}. Formally, we aims to find h′superscriptℎ′h^{\prime} satisfying h′​(𝒙0)={𝒚^0if ​(𝒙0,𝒚^0)∈ℰ,h​(𝒙0)o.w.superscriptℎ′subscript𝒙0casessubscript^𝒚0if subscript𝒙0subscript^𝒚0ℰℎsubscript𝒙0formulae-sequence𝑜𝑤\displaystyle h^{\prime}({\bm{x}}{0})=\begin{cases}\widehat{{\bm{y}}}{0}&\text{if }({\bm{x}}{0},\widehat{{\bm{y}}}{0})\in{\mathcal{E}},\ h({\bm{x}}_{0})&o.w.\end{cases} (4)

Remark. Remark. It is still an open problem to include the constraint of preventing catastrophic forgetting of the general knowledge for IT, and reducing the Alignment Tax [165] in the optimization objective of MA. A simple extension from the constraint of model refinement in Eqn. 4, h′​(𝐱0)=h​(𝐱0),∀(𝐱0,𝐲^0)∉𝒜formulae-sequencesuperscriptℎ′subscript𝐱0ℎsubscript𝐱0for-allsubscript𝐱0subscript^𝐲0𝒜h^{\prime}({\bm{x}}{0})=h({\bm{x}}{0}),\forall({\bm{x}}{0},\widehat{{\bm{y}}}{0})\notin{\mathcal{A}}, might be too strong in this case, as we certainly want the preference represented by 𝒜𝒜{\mathcal{A}} can generalize to other similar while not the same inputs.

Definition. Definition 2.4 (Memory Constraint of Continual Learning). Suppose T𝑇T sets of observations {St∼𝒯t}t=1Tsuperscriptsubscriptsimilar-tosubscript𝑆𝑡subscript𝒯𝑡𝑡1𝑇{S_{t}\sim{\mathcal{T}}{t}}{t=1}^{T} come in as a sequence, where {𝒯t}t=1Tsuperscriptsubscriptsubscript𝒯𝑡𝑡1𝑇{{\mathcal{T}}{t}}{t=1}^{T} denotes the T𝑇T task distributions . At the learning stage of t>1𝑡1t>1, the sets of observations {Si}i=1t−1superscriptsubscriptsubscript𝑆𝑖𝑖1𝑡1{S_{i}}_{i=1}^{t-1} are not accessible (strong) or partially accessible (relaxed).

Remark. Remark. In early stages of continual learning, works mostly focused on the strong memory constraint [140, 161, 4, 173]; as the research field progresses, more focus was put on relaxing the memory constraint to a small buffer for replay [239, 38, 29, 260]; some modern continual learning works consider the scenario where this constraint is completely discarded but the constraint on the computational budget is present [31, 228, 283].

Definition. Definition 2.5 (Task-Incremental Learning, TIL). Suppose T𝑇T task distributions {𝒯t}t=1Tsuperscriptsubscriptsubscript𝒯𝑡𝑡1𝑇{{\mathcal{T}}{t}}{t=1}^{T} come in as a sequence, where 𝒯tsubscript𝒯𝑡{\mathcal{T}}{t} denotes the joint distribution over the t𝑡t-th task’s input space and the label space (𝒳t,𝒴t)subscript𝒳𝑡subscript𝒴𝑡({\mathcal{X}}{t},{\mathcal{Y}}{t}). Denote 𝒳≜⋃t=1T𝒳t≜𝒳superscriptsubscript𝑡1𝑇subscript𝒳𝑡{\mathcal{X}}\triangleq\bigcup{t=1}^{T}{\mathcal{X}}{t} and 𝒴≜⋃t=1T𝒴t≜𝒴superscriptsubscript𝑡1𝑇subscript𝒴𝑡{\mathcal{Y}}\triangleq\bigcup{t=1}^{T}{\mathcal{Y}}{t} as the union of the input and label spaces, respectively. Under the memory constraint defined in Definition 2.4, Task-Incremental Learning (TIL) aims to find the optimal hypothesis h∗:𝒳×[T]→𝒴:superscriptℎ→𝒳delimited-[]𝑇𝒴h^{}:{\mathcal{X}}\times[T]\rightarrow{\mathcal{Y}} that satisfies: h∗superscriptℎ\displaystyle h^{} =arg⁡minh​∑t=1T𝔼(𝒙,y)∼𝒯t​[𝟙h​(𝒙,t)≠y].absentsubscriptℎsuperscriptsubscript𝑡1𝑇subscript𝔼similar-to𝒙𝑦subscript𝒯𝑡delimited-[]subscript1ℎ𝒙𝑡𝑦\displaystyle=\arg\min{h}\sum_{t=1}^{T}\mathbb{E}{({\bm{x}},y)\sim{\mathcal{T}}{t}}\left[\mathbbm{1}_{h({\bm{x}},t)\neq y}\right]. (6)

Remark. Remark. In TIL, it is common to have a shared input space 𝒳=𝒳t,∀t∈[T]formulae-sequence𝒳subscript𝒳𝑡for-all𝑡delimited-[]𝑇{\mathcal{X}}={\mathcal{X}}{t},\forall t\in[T], but the space of the label distribution 𝒴tsubscript𝒴𝑡{\mathcal{Y}}{t} can be distinct (𝒴i∩𝒴j=∅,∀i≠jformulae-sequencesubscript𝒴𝑖subscript𝒴𝑗for-all𝑖𝑗{\mathcal{Y}}{i}\cap{\mathcal{Y}}{j}=\emptyset,\forall i\neq j), partially shared (𝒴i∩𝒴j≠∅,∃i≠jformulae-sequencesubscript𝒴𝑖subscript𝒴𝑗𝑖𝑗{\mathcal{Y}}{i}\cap{\mathcal{Y}}{j}\neq\emptyset,\exists i\neq j), or shared across different tasks (𝒴=𝒴t,∀t∈[T]formulae-sequence𝒴subscript𝒴𝑡for-all𝑡delimited-[]𝑇{\mathcal{Y}}={\mathcal{Y}}{t},\forall t\in[T]). In DIL, the tasks are defined in the same format, i.e., same input space 𝒳𝒳{\mathcal{X}} and same output space 𝒴𝒴{\mathcal{Y}}. During the inference, no task IDs are provided for the hypothesis, which means the continual learning model needs to capture the pattern between the domain-invariant features and the labels. DIL is commonly perceived as more difficult than TIL. CIL is commonly viewed as the most challenging continual learning scenario, as the model needs to infer the label and the task ID at the same time. Another possible formulation of CIL is to represent it as DIL but the output label spaces are disjoint, 𝒴i∩𝒴j=∅,∀i≠jformulae-sequencesubscript𝒴𝑖subscript𝒴𝑗for-all𝑖𝑗{\mathcal{Y}}{i}\cap{\mathcal{Y}}_{j}=\emptyset,\forall i\neq j.

References

[AMSTrans = "American Mathematical Society Translations" } @String{AMSTrans = "Amer. Math. Soc. Transl." } @String{BullAMS = "Bulletin of the American Mathematical Society" } @String{BullAMS = "Bull. Amer. Math. Soc." } @String{ProcAMS = "Proceedings of the American Mathematical Society" } @String{ProcAMS = "Proc. Amer. Math. Soc." } @String{TransAMS = "Transactions of the American Mathematical Society" } @String{TransAMS = "Trans. Amer. Math. Soc." }

%ACM @String{CACM = "Communications of the {ACM}" } @String{CACM = "Commun. {ACM}" } @String{CompServ = "Comput. Surveys" } @String{JACM = "J. ACM" } @String{ACMMathSoft = "{ACM} Transactions on Mathematical Software" } @String{ACMMathSoft = "{ACM} Trans. Math. Software" } @String{SIGNUM = "{ACM} {SIGNUM} Newsletter" } @String{SIGNUM = "{ACM} {SIGNUM} Newslett." }

@String{AmerSocio = "American Journal of Sociology" } @String{AmerStatAssoc = "Journal of the American Statistical Association" } @String{AmerStatAssoc = "J. Amer. Statist. Assoc." } @String{ApplMathComp = "Applied Mathematics and Computation" } @String{ApplMathComp = "Appl. Math. Comput." } @String{AmerMathMonthly = "American Mathematical Monthly" } @String{AmerMathMonthly = "Amer. Math. Monthly" } @String{BIT = "{BIT}" } @String{BritStatPsych = "British Journal of Mathematical and Statistical Psychology" } @String{BritStatPsych = "Brit. J. Math. Statist. Psych." } @String{CanMathBull = "Canadian Mathematical Bulletin" } @String{CanMathBull = "Canad. Math. Bull." } @String{CompApplMath = "Journal of Computational and Applied Mathematics" } @String{CompApplMath = "J. Comput. Appl. Math." } @String{CompPhys = "Journal of Computational Physics" } @String{CompPhys = "J. Comput. Phys." } @String{CompStruct = "Computers and Structures" } @String{CompStruct = "Comput. & Structures" } @String{CompJour = "The Computer Journal" } @String{CompJour = "Comput. J." } @String{CompSysSci = "Journal of Computer and System Sciences" } @String{CompSysSci = "J. Comput. System Sci." } @String{Computing = "Computing" } @String{ContempMath = "Contemporary Mathematics" } @String{ContempMath = "Contemp. Math." } @String{Crelle = "Crelle's Journal" } @String{GiornaleMath = "Giornale di Mathematiche" } @String{GiornaleMath = "Giorn. Mat." } % didn't find in AMS MR.] Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, Scott Sanner. (2022). Online continual learning in image classification: An empirical survey. Neurocomputing. doi:https://doi.org/10.1016/j.neucom.2021.10.021.

[de2021continual] De Lange, Matthias, Aljundi, Rahaf, Masana, Marc, Parisot, Sarah, Jia, Xu, Leonardis, Ale{\v{s. (2021). A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence.

[kandel2000principles] Kandel, Eric R, Schwartz, James H, Jessell, Thomas M, Siegelbaum, Steven, Hudspeth, A James, Mack, Sarah, others. (2000). Principles of neural science.

[chen2018lifelong] Chen, Zhiyuan, Liu, Bing. Lifelong machine learning.

[wu2024continual] Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, Gholamreza Haffari. (2024). Continual Learning for Large Language Models: A Survey.

[xu2024survey] Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, Tianyi Zhou. (2024). A Survey on Knowledge Distillation of Large Language Models.

[wang2024comprehensive] Wang, Liyuan, Zhang, Xingxing, Su, Hang, Zhu, Jun. (2024). A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE Transactions on Pattern Analysis and Machine Intelligence. doi:10.1109/TPAMI.2024.3367329.

[pentina2016theoretical] Pentina, Anastasia. (2016). Theoretical foundations of multi-task lifelong learning.

[biesialska2020continual] Biesialska, Magdalena, Biesialska, Katarzyna, Costa-juss{`a. (2020). Continual Lifelong Learning in Natural Language Processing: A Survey. Proceedings of the 28th International Conference on Computational Linguistics. doi:10.18653/v1/2020.coling-main.574.

[ke2023continual] Zixuan Ke, Bing Liu. (2023). Continual Learning of Natural Language Processing Tasks: A Survey.

[van2022three] Van de Ven, Gido M, Tuytelaars, Tinne, Tolias, Andreas S. (2022). Three types of incremental learning. Nature Machine Intelligence.

[kim2022theoretical] Kim, Gyuhak, Xiao, Changnan, Konishi, Tatsuya, Ke, Zixuan, Liu, Bing. (2022). A Theoretical Study on Solving Continual Learning. Advances in Neural Information Processing Systems.

[mcclelland1995there] McClelland, James L, McNaughton, Bruce L, O'Reilly, Randall C. (1995). Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.. Psychological review.

[yang2009stably] Yang, Guang, Pan, Feng, Gan, Wen-Biao. (2009). Stably maintained dendritic spines are associated with lifelong memories. Nature.

[pallier2003brain] Pallier, Christophe, Dehaene, Stanislas, Poline, J-B, LeBihan, Denis, Argenti, A-M, Dupoux, Emmanuel, Mehler, Jacques. (2003). Brain imaging of language plasticity in adopted adults: Can a second language replace the first?. Cerebral cortex.

[olafsdottir2018role] {'O. (2018). The role of hippocampal replay in memory and planning. Current Biology.

[liu2019human] Liu, Yunzhe, Dolan, Raymond J, Kurth-Nelson, Zeb, Behrens, Timothy EJ. (2019). Human replay spontaneously reorganizes experience. Cell.

[constantinescu2016organizing] Constantinescu, Alexandra O, O’Reilly, Jill X, Behrens, Timothy EJ. (2016). Organizing conceptual knowledge in humans with a gridlike code. Science.

[mccaffary2021towards] McCaffary, David. (2021). Towards continual task learning in artificial neural networks: current approaches and insights from neuroscience. arXiv preprint arXiv:2112.14146.

[radford2021learning] Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, others. (2021). Learning transferable visual models from natural language supervision. International conference on machine learning.

[zheng2023preventing] Zheng, Zangwei, Ma, Mingyuan, Wang, Kai, Qin, Ziheng, Yue, Xiangyu, You, Yang. (2023). Preventing zero-shot transfer degradation in continual learning of vision-language models. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[wu2024building] Wu, Zuxuan, Weng, Zejia, Peng, Wujian, Yang, Xitong, Li, Ang, Davis, Larry S, Jiang, Yu-Gang. (2024). Building an open-vocabulary video CLIP model with better architectures, optimization and data. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[khan2021personalizing] Khan, Mina, Srivatsa, P, Rane, Advait, Chenniappa, Shriram, Hazariwala, Asadali, Maes, Pattie. (2021). Personalizing pre-trained models. arXiv preprint arXiv:2106.01499.

[liu2023class] Liu, Xialei, Cao, Xusheng, Lu, Haori, Xiao, Jia-wen, Bagdanov, Andrew D, Cheng, Ming-Ming. (2023). Class Incremental Learning with Pre-trained Vision-Language Models. arXiv preprint arXiv:2310.20348.

[jha2024clap4clip] Jha, Saurav, Gong, Dong, Yao, Lina. (2024). CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models. arXiv preprint arXiv:2403.19137.

[li2024coleclip] Li, Yukun, Pang, Guansong, Suo, Wei, Jing, Chenchen, Xi, Yuling, Liu, Lingqiao, Chen, Hao, Liang, Guoqiang, Wang, Peng. (2024). CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and Vocabulary Learning. arXiv preprint arXiv:2403.10245.

[ebrahimi2019uncertainty] Ebrahimi, Sayna, Elhoseiny, Mohamed, Darrell, Trevor, Rohrbach, Marcus. (2019). Uncertainty-guided continual learning with bayesian neural networks. arXiv preprint arXiv:1906.02425.

[ebrahimi2020adversarial] Ebrahimi, Sayna, Meier, Franziska, Calandra, Roberto, Darrell, Trevor, Rohrbach, Marcus. (2020). Adversarial continual learning. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XI 16.

[wang2022dualprompt] Wang, Zifeng, Zhang, Zizhao, Ebrahimi, Sayna, Sun, Ruoxi, Zhang, Han, Lee, Chen-Yu, Ren, Xiaoqi, Su, Guolong, Perot, Vincent, Dy, Jennifer, others. (2022). DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning. European Conference on Computer Vision.

[wang2022learning] Wang, Zifeng, Zhang, Zizhao, Lee, Chen-Yu, Zhang, Han, Sun, Ruoxi, Ren, Xiaoqi, Su, Guolong, Perot, Vincent, Dy, Jennifer, Pfister, Tomas. (2022). Learning to prompt for continual learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[jiang2024empowering] Yushan Jiang, Zijie Pan, Xikun Zhang, Sahil Garg, Anderson Schneider, Yuriy Nevmyvaka, Dongjin Song. (2024). Empowering Time Series Analysis with Large Language Models: A Survey.

[garg2023in] Garg, Sahil, Dutta, Sanghamitra, Dalirrooyfard, Mina, Schneider, Anderson, Nevmyvaka, Yuriy. (2023). In- or out-of-distribution detection via dual divergence estimation. Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence.

[bengio2009curriculum] Bengio, Yoshua, Louradour, J'{e. (2009). Curriculum learning. Proceedings of the 26th Annual International Conference on Machine Learning. doi:10.1145/1553374.1553380.

[prabhu2023online] Ameya Prabhu, Zhipeng Cai, Puneet Dokania, Philip Torr, Vladlen Koltun, Ozan Sener. (2023). Online Continual Learning Without the Storage Constraint.

[garg2024tic] Garg, Saurabh, Farajtabar, Mehrdad, Pouransari, Hadi, Vemulapalli, Raviteja, Mehta, Sachin, Tuzel, Oncel, Shankar, Vaishaal, Faghri, Fartash. (2024). TiC-CLIP: Continual Training of CLIP Models. The Twelfth International Conference on Learning Representations (ICLR).

[thengane2022clip] Vishal Thengane, Salman Khan, Munawar Hayat, Fahad Khan. (2022). CLIP model is an Efficient Continual Learner.

[wu2021pretrained] Wu, Tongtong, Caccia, Massimo, Li, Zhuang, Li, Yuan-Fang, Qi, Guilin, Haffari, Gholamreza. (2021). Pretrained language model in continual learning: A comparative study. International conference on learning representations.

[mirzadeh2022wide] Mirzadeh, Seyed Iman, Chaudhry, Arslan, Yin, Dong, Hu, Huiyi, Pascanu, Razvan, Gorur, Dilan, Farajtabar, Mehrdad. (2022). Wide Neural Networks Forget Less Catastrophically. Proceedings of the 39th International Conference on Machine Learning.

[neyshabur2020being] Neyshabur, Behnam, Sedghi, Hanie, Zhang, Chiyuan. (2020). What is being transferred in transfer learning?. Advances in neural information processing systems.

[hao2019visualizing] Hao, Yaru, Dong, Li, Wei, Furu, Xu, Ke. (2019). Visualizing and Understanding the Effectiveness of {BERT. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). doi:10.18653/v1/D19-1424.

[mehta2023empirical] Sanket Vaibhav Mehta, Darshan Patil, Sarath Chandar, Emma Strubell. (2023). An Empirical Investigation of the Role of Pre-training in Lifelong Learning. Journal of Machine Learning Research.

[li2023blip2] Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.

[dosovitskiy2020image] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[yuan2020revisiting] Yuan, Li, Tay, Francis EH, Li, Guilin, Wang, Tao, Feng, Jiashi. (2020). Revisiting Knowledge Distillation via Label Smoothing Regularization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[chen2020recall] Chen, Sanyuan, Hou, Yutai, Cui, Yiming, Che, Wanxiang, Liu, Ting, Yu, Xiangzhan. (2020). Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). doi:10.18653/v1/2020.emnlp-main.634.

[he2021analyzing] He, Tianxing, Liu, Jun, Cho, Kyunghyun, Ott, Myle, Liu, Bing, Glass, James, Peng, Fuchun. (2021). Analyzing the Forgetting Problem in Pretrain-Finetuning of Open-domain Dialogue Response Models. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. doi:10.18653/v1/2021.eacl-main.95.

[hu2022lora] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen. (2022). Lo{RA. International Conference on Learning Representations.

[wang2021kadapter] Wang, Ruize, Tang, Duyu, Duan, Nan, Wei, Zhongyu, Huang, Xuanjing, Ji, Jianshu, Cao, Guihong, Jiang, Daxin, Zhou, Ming. (2021). {K-Adapter. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. doi:10.18653/v1/2021.findings-acl.121.

[ben2010theory] Ben-David, Shai, Blitzer, John, Crammer, Koby, Kulesza, Alex, Pereira, Fernando, Vaughan, Jennifer Wortman. (2010). A theory of learning from different domains. Machine learning.

[ganin2016domain] Ganin, Yaroslav, Ustinova, Evgeniya, Ajakan, Hana, Germain, Pascal, Larochelle, Hugo, Laviolette, Fran{\c{c. (2016). Domain-adversarial training of neural networks. The journal of machine learning research.

[zewdu2022part] Zewdu, Alebachew, Yitagesu, Betselot. (2022). Part of speech tagging: a systematic review of deep learning and machine learning approaches. Journal of Big Data. doi:10.1186/s40537-022-00561-y.

[li2021prefix] Li, Xiang Lisa, Liang, Percy. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). doi:10.18653/v1/2021.acl-long.353.

[shazeer2017outrageously] Shazeer, Noam, Mirhoseini, Azalia, Maziarz, Krzysztof, Davis, Andy, Le, Quoc, Hinton, Geoffrey, Dean, Jeff. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.

[lester2021power] Lester, Brian, Al-Rfou, Rami, Constant, Noah. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2021.emnlp-main.243.

[gpt-j] Wang, Ben, Komatsuzaki, Aran. {GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.

[du2022glam] Du, Nan, Huang, Yanping, Dai, Andrew M, Tong, Simon, Lepikhin, Dmitry, Xu, Yuanzhong, Krikun, Maxim, Zhou, Yanqi, Yu, Adams Wei, Firat, Orhan, Zoph, Barret, Fedus, Liam, Bosma, Maarten P, Zhou, Zongwei, Wang, Tao, Wang, Emma, Webster, Kellie, Pellat, Marie, Robinson, Kevin, Meier-Hellstern, Kathleen, Duke, Toju, Dixon, Lucas, Zhang, Kun, Le, Quoc, Wu, Yonghui, Chen, Zhifeng, Cui, Claire. (2022). {GL. Proceedings of the 39th International Conference on Machine Learning.

[soldaini2024dolma] Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo. (2024). Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.

[xie2024data] Xie, Sang Michael, Santurkar, Shibani, Ma, Tengyu, Liang, Percy S. (2024). Data selection for language models via importance resampling. Advances in Neural Information Processing Systems.

[li2023quality] Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, Jing Xiao. (2023). From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning. ArXiv.

[min2022rethinking] Min, Sewon, Lyu, Xinxi, Holtzman, Ari, Artetxe, Mikel, Lewis, Mike, Hajishirzi, Hannaneh, Zettlemoyer, Luke. (2022). Rethinking the role of demonstrations: What makes in-context learning work?. arXiv preprint arXiv:2202.12837.

[wei2021finetuned] Wei, Jason, Bosma, Maarten, Zhao, Vincent Y, Guu, Kelvin, Yu, Adams Wei, Lester, Brian, Du, Nan, Dai, Andrew M, Le, Quoc V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.

[yao2024tree] Yao, Shunyu, Yu, Dian, Zhao, Jeffrey, Shafran, Izhak, Griffiths, Tom, Cao, Yuan, Narasimhan, Karthik. (2024). Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems.

[wei2022emergent] Wei, Jason, Tay, Yi, Bommasani, Rishi, Raffel, Colin, Zoph, Barret, Borgeaud, Sebastian, Yogatama, Dani, Bosma, Maarten, Zhou, Denny, Metzler, Donald, others. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.

[wei2022chain] Wei, Jason, Wang, Xuezhi, Schuurmans, Dale, Bosma, Maarten, Xia, Fei, Chi, Ed, Le, Quoc V, Zhou, Denny, others. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems.

[deepseekai2024deepseek] DeepSeek-AI Team. (2024). DeepSeek LLM: Scaling Open-Source Language Models with Longtermism.

[biderman2023pythia] Biderman, Stella, Schoelkopf, Hailey, Anthony, Quentin Gregory, Bradley, Herbie, O’Brien, Kyle, Hallahan, Eric, Khan, Mohammad Aflah, Purohit, Shivanshu, Prashanth, USVSN Sai, Raff, Edward, others. (2023). Pythia: A suite for analyzing large language models across training and scaling. International Conference on Machine Learning.

[gao2020pile] Gao, Leo, Biderman, Stella, Black, Sid, Golding, Laurence, Hoppe, Travis, Foster, Charles, Phang, Jason, He, Horace, Thite, Anish, Nabeshima, Noa, Presser, Shawn, Leahy, Connor. (2020). The {P. arXiv preprint arXiv:2101.00027.

[cerebras2023slimpajama] Soboleva, Daria, Al-Khateeb, Faisal, Myers, Robert, Steeves, Jacob R, Hestness, Joel, Dey, Nolan. (2023). {SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.

[lopez2017gradient] Lopez-Paz, David, Ranzato, Marc'Aurelio. (2017). Gradient episodic memory for continual learning. Advances in neural information processing systems.

[wu2019large] Wu, Yue, Chen, Yinpeng, Wang, Lijuan, Ye, Yuancheng, Liu, Zicheng, Guo, Yandong, Fu, Yun. (2019). Large scale incremental learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[wistuba2023] Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella. (2023). Continual learning with low rank adaptation. NeurIPS 2023 Workshop on Distribution Shifts (DistShifts).

[aljundi2017expert] Aljundi, Rahaf, Chakravarty, Punarjay, Tuytelaars, Tinne. (2017). Expert gate: Lifelong learning with a network of experts. Proceedings of the IEEE conference on computer vision and pattern recognition.

[cai2021online] Cai, Zhipeng, Sener, Ozan, Koltun, Vladlen. (2021). Online continual learning with natural distribution shifts: An empirical study with visual data. Proceedings of the IEEE/CVF international conference on computer vision.

[verwimp2024continual] Eli Verwimp, Rahaf Aljundi, Shai Ben-David, Matthias Bethge, Andrea Cossu, Alexander Gepperth, Tyler L. Hayes, Eyke Hüllermeier, Christopher Kanan, Dhireesha Kudithipudi, Christoph H. Lampert, Martin Mundt, Razvan Pascanu, Adrian Popescu, Andreas S. Tolias, Joost van de Weijer, Bing Liu, Vincenzo Lomonaco, Tinne Tuytelaars, Gido M. van de Ven. (2024). Continual Learning: Applications and the Road Forward.

[li2017learning] Li, Zhizhong, Hoiem, Derek. (2017). Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence.

[smith2023closer] James Seale Smith, Junjiao Tian, Shaunak Halbe, Yen-Chang Hsu, Zsolt Kira. (2023). A Closer Look at Rehearsal-Free Continual Learning.

[hayes2020lifelong] Tyler L. Hayes, Christopher Kanan. (2020). Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis.

[lomonaco2020rehearsalfree] Vincenzo Lomonaco, Davide Maltoni, Lorenzo Pellegrini. (2020). Rehearsal-Free Continual Learning over Small Non-I.I.D. Batches.

[chaudhry2019tiny] Chaudhry, Arslan, Rohrbach, Marcus, Elhoseiny, Mohamed, Ajanthan, Thalaiyasingam, Dokania, Puneet K, Torr, Philip HS, Ranzato, Marc'Aurelio. (2019). On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486.

[schwarz2018progress] Schwarz, Jonathan, Czarnecki, Wojciech, Luketina, Jelena, Grabska-Barwinska, Agnieszka, Teh, Yee Whye, Pascanu, Razvan, Hadsell, Raia. (2018). Progress & compress: A scalable framework for continual learning. International conference on machine learning.

[sarfraz2023error] Sarfraz, Fahad, Arani, Elahe, Zonooz, Bahram. (2023). Error Sensitivity Modulation based Experience Replay: Mitigating Abrupt Representation Drift in Continual Learning. arXiv preprint arXiv:2302.11344.

[riemer2018learning] Riemer, Matthew, Cases, Ignacio, Ajemian, Robert, Liu, Miao, Rish, Irina, Tu, Yuhai, Tesauro, Gerald. (2018). Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910.

[buzzega2020dark] Buzzega, Pietro, Boschini, Matteo, Porrello, Angelo, Abati, Davide, Calderara, Simone. (2020). Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems.

[shi2024unified] Shi, Haizhou, Wang, Hao. (2024). A Unified Approach to Domain Incremental Learning with Memory: Theory and Algorithm. Advances in Neural Information Processing Systems.

[bang2021rainbow] Bang, Jihwan, Kim, Heesu, Yoo, YoungJoon, Ha, Jung-Woo, Choi, Jonghyun. (2021). Rainbow Memory: Continual Learning With a Memory of Diverse Samples. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[zhao2022memory] Rebuffi, Sylvestre-Alvise, Kolesnikov, Alexander, Sperl, Georg, Lampert, Christoph H. (2017). icarl: Incremental classifier and representation learning. IEEE Transactions on Neural Networks and Learning Systems. doi:10.1109/TNNLS.2021.3072041.

[ritter2018online] Ritter, Hippolyt, Botev, Aleksandar, Barber, David. (2018). Online structured laplace approximations for overcoming catastrophic forgetting. Advances in Neural Information Processing Systems.

[aljundi2018memory] Aljundi, Rahaf, Babiloni, Francesca, Elhoseiny, Mohamed, Rohrbach, Marcus, Tuytelaars, Tinne. (2018). Memory aware synapses: Learning what (not) to forget. Proceedings of the European conference on computer vision (ECCV).

[chaudhry2019efficient] Chaudhry, Arslan, Ranzato, Marc’Aurelio, Rohrbach, Marcus, Elhoseiny, Mohamed. (2019). Efficient Lifelong Learning with A-GEM. ICLR.

[sprechmann2018memory] Sprechmann, Pablo, Jayakumar, Siddhant M, Rae, Jack W, Pritzel, Alexander, Badia, Adria Puigdomenech, Uria, Benigno, Vinyals, Oriol, Hassabis, Demis, Pascanu, Razvan, Blundell, Charles. (2018). Memory-based Parameter Adaptation. International Conference on Learning Representations.

[caccia2021new] Caccia, Lucas, Aljundi, Rahaf, Asadi, Nader, Tuytelaars, Tinne, Pineau, Joelle, Belilovsky, Eugene. (2021). New insights on reducing abrupt representation change in online continual learning. arXiv preprint arXiv:2104.05025.

[ramesh2021model] Ramesh, Rahul, Chaudhari, Pratik. (2021). Model Zoo: A Growing. arXiv preprint arXiv:2106.03027.

[wang2022coscl] Wang, Liyuan, Zhang, Xingxing, Li, Qian, Zhu, Jun, Zhong, Yi. (2022). CoSCL: Cooperation of Small Continual Learners is Stronger Than a Big One. Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXVI.

[he2016ups] He, Ruining, McAuley, Julian. (2016). Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. Proceedings of the 25th International Conference on World Wide Web. doi:10.1145/2872427.2883037.

[ni2019justifying] Ni, Jianmo, Li, Jiacheng, McAuley, Julian. (2019). Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). doi:10.18653/v1/D19-1018.

[baumgartner2020pushshift] Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, Jeremy Blackburn. (2020). The Pushshift Reddit Dataset.

[zellers2019defending] Zellers, Rowan, Holtzman, Ari, Rashkin, Hannah, Bisk, Yonatan, Farhadi, Ali, Roesner, Franziska, Choi, Yejin. (2019). Defending against neural fake news. Advances in neural information processing systems.

[gokaslan2019OpenWeb] Aaron Gokaslan, Vanya Cohen. (2019). OpenWebText Corpus.

[caselaw2018] {Caselaw Access Project. (2018). Caselaw Access Project.

[chelba2014billion] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, Tony Robinson. (2014). One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling.

[zhang2015character] Zhang, Xiang, Zhao, Junbo, LeCun, Yann. (2015). Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems.

[xu2019bert] Hu Xu, Bing Liu, Lei Shu, Philip S. Yu. (2019). BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis.

[deng2023survey] Deng, Yang, Lei, Wenqiang, Lam, Wai, Chua, Tat-Seng. (2023). A survey on proactive dialogue systems: Problems, methods, and prospects. arXiv preprint arXiv:2305.02750.

[kwok2001scaling] Kwok, Cody C. T., Etzioni, Oren, Weld, Daniel S.. (2001). Scaling question answering to the Web. Proceedings of the 10th International Conference on World Wide Web. doi:10.1145/371920.371973.

[bahdanau2014neural] Bahdanau, Dzmitry, Cho, Kyunghyun, Bengio, Yoshua. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

[ni2021revisiting] Ni, Zixuan, Shi, Haizhou, Tang, Siliang, Wei, Longhui, Tian, Qi, Zhuang, Yueting. (2021). Revisiting catastrophic forgetting in class incremental learning. arXiv preprint arXiv:2107.12308.

[petroni2019language] Petroni, Fabio, Rockt{. (2019). Language Models as Knowledge Bases?. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). doi:10.18653/v1/D19-1250.

[sun2020ernie] Sun, Yu, Wang, Shuohuan, Li, Yukun, Feng, Shikun, Tian, Hao, Wu, Hua, Wang, Haifeng. (2020). ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding. Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.1609/aaai.v34i05.6428.

[amba2021dynamic] Amba Hombaiah, Spurthi, Chen, Tao, Zhang, Mingyang, Bendersky, Michael, Najork, Marc. (2021). Dynamic language models for continuously evolving content. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining.

[qin2023recyclable] Qin, Yujia, Qian, Cheng, Han, Xu, Lin, Yankai, Wang, Huadong, Xie, Ruobing, Liu, Zhiyuan, Sun, Maosong, Zhou, Jie. (2023). Recyclable Tuning for Continual Pre-training. arXiv preprint arXiv:2305.08702.

[qin2022elle] Qin, Yujia, Zhang, Jiajie, Lin, Yankai, Liu, Zhiyuan, Li, Peng, Sun, Maosong, Zhou, Jie. (2022). {ELLE. Findings of the Association for Computational Linguistics: ACL 2022. doi:10.18653/v1/2022.findings-acl.220.

[gadde2021towards] Gadde, Ravi Teja, Bulyko, Ivan. (2021). Towards Continual Entity Learning in Language Models for Conversational Agents. arXiv preprint arXiv:2108.00082.

[jin2022lifelong] Jin, Xisen, Zhang, Dejiao, Zhu, Henghui, Xiao, Wei, Li, Shang-Wen, Wei, Xiaokai, Arnold, Andrew, Ren, Xiang. (2022). Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora. Proceedings of BigScience Episode {#. doi:10.18653/v1/2022.bigscience-1.1.

[cossu2022continual] Andrea Cossu, Tinne Tuytelaars, Antonio Carta, Lucia Passaro, Vincenzo Lomonaco, Davide Bacciu. (2022). Continual Pre-Training Mitigates Forgetting in Language and Vision.

[sun2023improving] Sun, Michael, Kumar, Ananya, Madaan, Divyam, Liang, Percy. (2023). Improving Representational Continuity via Continued Pretraining. arXiv preprint arXiv:2302.13289.

[ke2022continual-pre] Ke, Zixuan, Shao, Yijia, Lin, Haowei, Konishi, Tatsuya, Kim, Gyuhak, Liu, Bing. (2022). Continual Pre-training of Language Models. The Eleventh International Conference on Learning Representations.

[chen2023lifelong] Chen, Wuyang, Zhou, Yanqi, Du, Nan, Huang, Yanping, Laudon, James, Chen, Zhifeng, Cui, Claire. (2023). Lifelong language pretraining with distribution-specialized experts. International Conference on Machine Learning.

[gupta2023continual] Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, Timothée Lesort. (2023). Continual Pre-Training of Large Language Models: How to (re)warm your model?.

[gururangan2022demix] Gururangan, Suchin, Lewis, Mike, Holtzman, Ari, Smith, Noah A., Zettlemoyer, Luke. (2022). {DEM. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. doi:10.18653/v1/2022.naacl-main.407.

[gogoulou2024continual] Evangelia Gogoulou, Timothée Lesort, Magnus Boman, Joakim Nivre. (2024). Continual Learning Under Language Shift.

[wu2024llama] Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, Ying Shan. (2024). LLaMA Pro: Progressive LLaMA with Block Expansion.

[li2024examining] Chen-An Li, Hung-Yi Lee. (2024). Examining Forgetting in Continual Pre-training of Aligned Large Language Models.

[lazaridou2021mind] Lazaridou, Angeliki, Kuncoro, Adhi, Gribovskaya, Elena, Agrawal, Devang, Liska, Adam, Terzi, Tayfun, Gimenez, Mai, de Masson d'Autume, Cyprien, Kocisky, Tomas, Ruder, Sebastian, others. (2021). Mind the gap: Assessing temporal generalization in neural language models. Advances in Neural Information Processing Systems.

[dhingra2022time] Dhingra, Bhuwan, Cole, Jeremy R, Eisenschlos, Julian Martin, Gillick, Daniel, Eisenstein, Jacob, Cohen, William W. (2022). Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics.

[jang2022towards] Jang, Joel, Ye, Seonghyeon, Yang, Sohee, Shin, Joongbo, Han, Janghoon, Kim, Gyeonghun, Choi, Stanley Jungkyu, Seo, Minjoon. (2022). Towards Continual Knowledge Learning of Language Models. ICLR.

[jang2022temporalwiki] Jang, Joel, Ye, Seonghyeon, Lee, Changho, Yang, Sohee, Shin, Joongbo, Han, Janghoon, Kim, Gyeonghun, Seo, Minjoon. (2022). TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models. EMNLP 2022.

[su2023efficient] Su, Zhaochen, Li, Juntao, Zhang, Zikang, Zhou, Zihan, Zhang, Min. (2023). Efficient Continue Training of Temporal Language Model with Structural Information. Findings of the Association for Computational Linguistics: EMNLP 2023. doi:10.18653/v1/2023.findings-emnlp.418.

[tan2023towards] Tan, Qingyu, Ng, Hwee Tou, Bing, Lidong. (2023). Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2023.acl-long.828.

[rosin2022time] Rosin, Guy D., Guy, Ido, Radinsky, Kira. (2022). Time Masking for Temporal Language Models. Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. doi:10.1145/3488560.3498529.

[loureiro2022timelms] Loureiro, Daniel, Barbieri, Francesco, Neves, Leonardo, Espinosa Anke, Luis, Camacho-collados, Jose. (2022). {T. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. doi:10.18653/v1/2022.acl-demo.25.

[attanasio2023worth] Giuseppe Attanasio, Debora Nozza, Federico Bianchi, Dirk Hovy. (2023). Is It Worth the (Environmental) Cost? Limited Evidence for Temporal Adaptation via Continuous Training.

[rongali2021continual] Subendhu Rongali, Abhyuday Jagannatha, Bhanu Pratap Singh Rawat, Hong Yu. (2021). Continual Domain-Tuning for Pretrained Language Models.

[gu2022ppt] Gu, Yuxian, Han, Xu, Liu, Zhiyuan, Huang, Minlie. (2022). {PPT. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2022.acl-long.576.

[ke2022continual-train] Ke, Zixuan, Lin, Haowei, Shao, Yijia, Xu, Hu, Shu, Lei, Liu, Bing. (2022). Continual Training of Language Models for Few-Shot Learning. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2022.emnlp-main.695.

[guo2023continuous] Zhen Guo, Yining Hua. (2023). Continuous Training and Fine-tuning for Domain-Specific Language Models in Medical Question Answering.

[gururangan2020dont] Gururangan, Suchin, Marasovi{'c. (2020). Don{'. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.740.

[ma2023ecomgptct] Shirong Ma, Shen Huang, Shulin Huang, Xiaobin Wang, Yangning Li, Hai-Tao Zheng, Pengjun Xie, Fei Huang, Yong Jiang. (2023). EcomGPT-CT: Continual Pre-training of E-commerce Large Language Models with Semi-structured Data.

[han2021econet] Han, Rujun, Ren, Xiang, Peng, Nanyun. (2021). {ECONET. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2021.emnlp-main.436.

[xie2023efficient] Yong Xie, Karan Aggarwal, Aitzaz Ahmad. (2023). Efficient Continual Pre-training for Building Domain Specific Large Language Models.

[zhou2020pre] Zhou, Wangchunshu, Lee, Dong-Ho, Selvam, Ravi Kiran, Lee, Seyeon, Lin, Bill Yuchen, Ren, Xiang. (2021). Pre-training text-to-text transformers for concept-centric common sense. International Conference on Learning Representations.

[xie2023quert] Xie, Jian, Liang, Yidan, Liu, Jingping, Xiao, Yanghua, Wu, Baohua, Ni, Shenghua. (2023). QUERT: Continual Pre-training of Language Model for Query Understanding in Travel Domain Search. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. doi:10.1145/3580305.3599891.

[zhang2023revisit] Zhang, Haode, Liang, Haowen, Zhan, Li-Ming, Wu, Xiao-Ming, Lam, Albert Y.S.. (2023). Revisit Few-shot Intent Classification with {PLM. Findings of the Association for Computational Linguistics: ACL 2023. doi:10.18653/v1/2023.findings-acl.706.

[savelka2023explaining] Jaromir Savelka, Kevin D. Ashley, Morgan A. Gray, Hannes Westermann, Huihui Xu. (2023). Explaining Legal Concepts with Augmented Large Language Models (GPT-4).

[lai2023large] Lai, Jinqi, Gan, Wensheng, Wu, Jiayang, Qi, Zhenlian, Yu, Philip S. (2023). Large Language Models in Law: A Survey. arXiv preprint arXiv:2312.03718.

[yue2023disc] Yue, Shengbin, Chen, Wei, Wang, Siyuan, Li, Bingxuan, Shen, Chenchen, Liu, Shujun, Zhou, Yuxuan, Xiao, Yao, Yun, Song, Lin, Wei, others. (2023). Disc-lawllm: Fine-tuning large language models for intelligent legal services. arXiv preprint arXiv:2309.11325.

[xiao2021lawformer] Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, Maosong Sun. (2021). Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents.

[lawyerllam-git] Huang, Quzhe, Tao, Mingxu, Zhang, Chen, An, Zhenwei, Jiang, Cong, Chen, Zhibin, Wu, Zirui, Feng, Yansong. (2023). Lawyer Llama. GitHub repository.

[koehn2005europarl] Koehn, Philipp. (2005). {E. Proceedings of Machine Translation Summit X: Papers.

[huang2023lawyer] Huang, Quzhe, Tao, Mingxu, An, Zhenwei, Zhang, Chen, Jiang, Cong, Chen, Zhibin, Wu, Zirui, Feng, Yansong. (2023). Lawyer LLaMA Technical Report. arXiv preprint arXiv:2305.15062.

[zhong2020jec] Zhong, Haoxi, Xiao, Chaojun, Tu, Cunchao, Zhang, Tianyang, Liu, Zhiyuan, Sun, Maosong. (2020). JEC-QA: a legal-domain question answering dataset. Proceedings of the AAAI Conference on Artificial Intelligence.

[colombo2024saullm7b] Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, Michael Desa. (2024). SaulLM-7B: A pioneering Large Language Model for Law.

[martin2024better] Lauren Martin, Nick Whitehouse, Stephanie Yiu, Lizzie Catterson, Rivindu Perera. (2024). Better Call GPT, Comparing Large Language Models Against Lawyers.

[gu2021domain] Gu, Yu, Tinn, Robert, Cheng, Hao, Lucas, Michael, Usuyama, Naoto, Liu, Xiaodong, Naumann, Tristan, Gao, Jianfeng, Poon, Hoifung. (2021). Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Transactions on Computing for Healthcare. doi:10.1145/3458754.

[luo2022biogpt] Luo, Renqian, Sun, Liai, Xia, Yingce, Qin, Tao, Zhang, Sheng, Poon, Hoifung, Liu, Tie-Yan. (2022). BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics. doi:10.1093/bib/bbac409.

[singhal2023large] Singhal, Karan, Azizi, Shekoofeh, Tu, Tao, Mahdavi, S Sara, Wei, Jason, Chung, Hyung Won, Scales, Nathan, Tanwani, Ajay, Cole-Lewis, Heather, Pfohl, Stephen, others. (2023). Large language models encode clinical knowledge. Nature.

[Singhal2023MedPaLM2] Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole{-. (2023). Towards Expert-Level Medical Question Answering with Large Language Models. CoRR. doi:10.48550/ARXIV.2305.09617.

[bao2023discmedllm] Zhijie Bao, Wei Chen, Shengze Xiao, Kuang Ren, Jiaao Wu, Cheng Zhong, Jiajie Peng, Xuanjing Huang, Zhongyu Wei. (2023). DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation.

[jeblick2022chatgpt] Katharina Jeblick, Balthasar Schachtner, Jakob Dexl, Andreas Mittermeier, Anna Theresa Stüber, Johanna Topalis, Tobias Weber, Philipp Wesp, Bastian Sabel, Jens Ricke, Michael Ingrisch. (2022). ChatGPT Makes Medicine Easy to Swallow: An Exploratory Case Study on Simplified Radiology Reports.

[lo2020s2orc] Lo, Kyle, Wang, Lucy Lu, Neumann, Mark, Kinney, Rodney, Weld, Daniel. (2020). {S. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.447.

[hendryckstest2021] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt. (2021). Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR).

[li2023cmmlu] Li, Haonan, Zhang, Yixuan, Koto, Fajri, Yang, Yifei, Zhao, Hai, Gong, Yeyun, Duan, Nan, Baldwin, Timothy. (2023). Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212.

[du2023chinesellama2] Zefeng Du, Minghao Wu, Longyue Wang. (2023). Chinese-Llama-2. GitHub repository.

[liu2023benchmarking] Junling Liu, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian, Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, Lei Zhu, Michael Lingzhi Li. (2023). Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset.

[wang2023cmb] Wang, Xidong, Chen, Guiming Hardy, Song, Dingjie, Zhang, Zhiyi, Chen, Zhihong, Xiao, Qingying, Jiang, Feng, Li, Jianquan, Wan, Xiang, Wang, Benyou, others. (2023). Cmb: A comprehensive medical benchmark in chinese. arXiv preprint arXiv:2308.08833.

[sha2021wudao] Yuan, Sha, Zhao, Hanyu, Du, Zhengxiao, Ding, Ming, Liu, Xiao, Cen, Yukuo, Zou, Xu, Yang, Zhilin, Tang, Jie. (2021). WuDaoCorpora: A Super Large-scale Chinese Corpora for Pre-training Language Models. AI Open. doi:10.1016/j.aiopen.2021.06.001.

[zhu2023promptcblue] Zhu, Wei, Wang, Xiaoling, Zheng, Huanran, Chen, Mosha, Tang, Buzhou. (2023). PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain. arXiv preprint arXiv:2310.14151.

[wu2023pmc] Wu, Chaoyi, Lin, Weixiong, Zhang, Xiaoman, Zhang, Ya, Wang, Yanfeng, Xie, Weidi. (2023). Pmc-llama: Towards building open-source language models for medicine. arXiv preprint arXiv:2305.10415.

[together2023redpajama] Together Computer. (2023). RedPajama: an Open Dataset for Training Large Language Models.

[luo2023biomedgpt] Luo, Yizhen, Zhang, Jiahuan, Fan, Siqi, Yang, Kai, Wu, Yushuai, Qiao, Mu, Nie, Zaiqing. (2023). Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442.

[zhang2023huatuogpt] Zhang, Hongbo, Chen, Junying, Jiang, Feng, Yu, Fei, Chen, Zhihong, Chen, Guiming, Li, Jianquan, Wu, Xiangbo, Zhiyi, Zhang, Xiao, Qingying, Wan, Xiang, Wang, Benyou, Li, Haizhou. (2023). {H. Findings of the Association for Computational Linguistics: EMNLP 2023. doi:10.18653/v1/2023.findings-emnlp.725.

[xiong2023doctorglm] Xiong, Honglin, Wang, Sheng, Zhu, Yitao, Zhao, Zihao, Liu, Yuxiao, Wang, Qian, Shen, Dinggang. (2023). Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv:2304.01097.

[zhang2023alpacare] Zhang, Xinlu, Tian, Chenxin, Yang, Xianjun, Chen, Lichang, Li, Zekun, Petzold, Linda Ruth. (2023). Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558.

[wang2023huatuo] Wang, Haochun, Liu, Chi, Xi, Nuwa, Qiang, Zewen, Zhao, Sendong, Qin, Bing, Liu, Ting. (2023). Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975.

[li2023chatdoctor] Li, Yunxiang, Li, Zihan, Zhang, Kai, Dan, Ruilong, Jiang, Steve, Zhang, You. (2023). ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus.

[Han2023MedAlpaca] Tianyu Han, Lisa C. Adams, Jens{-. (2023). MedAlpaca - An Open-Source Collection of Medical Conversational {AI. CoRR. doi:10.48550/ARXIV.2304.08247.

[Chen2023HuatuoGPTII] Junying Chen, Xidong Wang, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, Xiang Wan, Haizhou Li, Benyou Wang. (2023). HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs. CoRR. doi:10.48550/ARXIV.2311.09774.

[shah2023zero] Agam Shah, Sudheer Chava. (2023). Zero is Not Hero Yet: Benchmarking Zero-Shot Performance of LLMs for Financial Tasks.

[Xue2023WeaverBird] Siqiao Xue, Fan Zhou, Yi Xu, Hongyu Zhao, Shuo Xie, Qingyang Dai, Caigao Jiang, James Zhang, Jun Zhou, Dacheng Xiu, Hongyuan Mei. (2023). WeaverBird: Empowering Financial Decision-Making with Large Language Model, Knowledge Base, and Search Engine. CoRR. doi:10.48550/ARXIV.2308.05361.

[dettmers2023qlora] Dettmers, Tim, Pagnoni, Artidoro, Holtzman, Ari, Zettlemoyer, Luke. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314.

[araci2019finbert] Dogu Araci. (2019). FinBERT: Financial Sentiment Analysis with Pre-trained Language Models.

[Wu2023BloombergGPT] Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David S. Rosenberg, Gideon Mann. (2023). BloombergGPT: {A. CoRR. doi:10.48550/ARXIV.2303.17564.

[chen2023disc] Chen, Wei, Wang, Qiushi, Long, Zefei, Zhang, Xianyin, Lu, Zhongtian, Li, Bingxuan, Wang, Siyuan, Xu, Jiarong, Bai, Xiang, Huang, Xuanjing, others. (2023). Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning. arXiv preprint arXiv:2310.15205.

[Lu2023BBTFin] Dakuan Lu, Hengkui Wu, Jiaqing Liang, Yipei Xu, Qianyu He, Yipeng Geng, Mengkun Han, Yingsi Xin, Yanghua Xiao. (2023). BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark. CoRR. doi:10.48550/ARXIV.2302.09432.

[Yang2023InvestLM] Yi Yang, Yixuan Tang, Kar Yan Tam. (2023). InvestLM: {A. CoRR. doi:10.48550/ARXIV.2309.13064.

[Xie2023PIXIU] Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez{-. (2023). {PIXIU:. CoRR. doi:10.48550/ARXIV.2306.05443.

[Wang2023FinGPT] Neng Wang, Hongyang Yang, Christina Dan Wang. (2023). FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets. CoRR. doi:10.48550/ARXIV.2310.04793.

[Zhang2023xuanyuan] Zhang, Xuanyu, Yang, Qing. (2023). XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters. Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. doi:10.1145/3583780.3615285.

[li2023cfgpt] Jiangtong Li, Yuxuan Bian, Guoxuan Wang, Yang Lei, Dawei Cheng, Zhijun Ding, Changjun Jiang. (2023). CFGPT: Chinese Financial Assistant with Large Language Model.

[Taylor2022Galactica] Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, Robert Stojnic. (2022). Galactica: {A. CoRR. doi:10.48550/ARXIV.2211.09085.

[Yang2023PLLaMa] Xianjun Yang, Junfeng Gao, Wenxin Xue, Erik Alexandersson. (2024). PLLaMa: An Open-source Large Language Model for Plant Science. CoRR. doi:10.48550/ARXIV.2401.01600.

[Yin2023FORGE] Junqi Yin, Sajal Dash, Feiyi Wang, Mallikarjun Shankar. (2023). {FORGE:. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, {SC. doi:10.1145/3581784.3613215.

[Xie2023DARWIN] Tong Xie, Yuwei Wan, Wei Huang, Zhenyu Yin, Yixuan Liu, Shaozhou Wang, Qingyuan Linghu, Chunyu Kit, Clara Grazian, Wenjie Zhang, Imran Razzak, Bram Hoex. (2023). {DARWIN. CoRR. doi:10.48550/ARXIV.2308.13565.

[Zhang2024SciGLM] Dan Zhang, Ziniu Hu, Sining Zhoubian, Zhengxiao Du, Kaiyu Yang, Zihan Wang, Yisong Yue, Yuxiao Dong, Jie Tang. (2024). SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning. CoRR. doi:10.48550/ARXIV.2401.07950.

[Azerbayev2023LLEMMA] Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck. (2023). Llemma: An Open Language Model For Mathematics. CoRR. doi:10.48550/ARXIV.2310.10631.

[Yu2023Outcome] Fei Yu, Anningzhe Gao, Benyou Wang. (2023). Outcome-supervised Verifiers for Planning in Mathematical Reasoning. CoRR. doi:10.48550/ARXIV.2311.09724.

[luo2023wizardmath] Luo, Haipeng, Sun, Qingfeng, Xu, Can, Zhao, Pu, Lou, Jianguang, Tao, Chongyang, Geng, Xiubo, Lin, Qingwei, Chen, Shifeng, Zhang, Dongmei. (2023). Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.

[yue2023mammoth] Yue, Xiang, Qu, Xingwei, Zhang, Ge, Fu, Yao, Huang, Wenhao, Sun, Huan, Su, Yu, Chen, Wenhu. (2023). Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.

[Gou2023ToRA] Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen. (2024). To{RA. The Twelfth International Conference on Learning Representations.

[Gao2023GLLAVA] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong. (2023). G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model. CoRR. doi:10.48550/ARXIV.2312.11370.

[Nguyen2023AstroLLaMA] Tuan Dung Nguyen, Yuan{-. (2023). AstroLLaMA: Towards Specialized Foundation Models in Astronomy. CoRR. doi:10.48550/ARXIV.2309.06126.

[Perkowski2024AstroLLaMAChat] Ernest Perkowski, Rui Pan, Tuan Dung Nguyen, Yuan{-. (2024). AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets. CoRR. doi:10.48550/ARXIV.2401.01916.

[Zhao2023GIMLET] Haiteng Zhao, Shengchao Liu, Chang Ma, Hannan Xu, Jie Fu, Zhi-Hong Deng, Lingpeng Kong, Qi Liu. (2023). {GIMLET. Thirty-seventh Conference on Neural Information Processing Systems.

[Rubungo2023LLM-Prop] Andre Niyongabo Rubungo, Craig Arnold, Barry P. Rand, Adji Bousso Dieng. (2023). LLM-Prop: Predicting Physical And Electronic Properties Of Crystalline Solids From Their Text Descriptions. CoRR. doi:10.48550/ARXIV.2310.14029.

[Cao2023InstructMol] He Cao, Zijing Liu, Xingyu Lu, Yuan Yao, Yu Li. (2023). InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery. CoRR. doi:10.48550/ARXIV.2311.16208.

[Abdine2023Prot2Text] Hadi Abdine, Michail Chatzianastasis, Costas Bouyioukos, Michalis Vazirgiannis. (2023). Prot2Text: Multimodal Protein{\textquoteright. Deep Generative Models for Health Workshop NeurIPS 2023.

[xTrimoPGLM2024Chen] Bo Chen, Xingyi Cheng, Pan Li, Yangli{-. (2024). xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein. CoRR. doi:10.48550/ARXIV.2401.06199.

[roberts2023gpt4geo] Jonathan Roberts, Timo Lüddecke, Sowmen Das, Kai Han, Samuel Albanie. (2023). GPT4GEO: How a Language Model Sees the World's Geography.

[lin2023geogalactica] Zhouhan Lin, Cheng Deng, Le Zhou, Tianhang Zhang, Yi Xu, Yutong Xu, Zhongmou He, Yuanyuan Shi, Beiya Dai, Yunchong Song, Boyi Zeng, Qiyuan Chen, Tao Shi, Tianyu Huang, Yiwei Xu, Shu Wang, Luoyi Fu, Weinan Zhang, Junxian He, Chao Ma, Yunqiang Zhu, Xinbing Wang, Chenghu Zhou. (2023). GeoGalactica: A Scientific Large Language Model in Geoscience.

[wang2023nearrealtime] Chenguang Wang, Davis Engler, Xuechun Li, James Hou, David J. Wald, Kishor Jaiswal, Susu Xu. (2023). Near-real-time Earthquake-induced Fatality Estimation using Crowdsourced Data and Large-Language Models.

[Bi2023OCEANGPT] Zhen Bi, Ningyu Zhang, Yida Xue, Yixin Ou, Daxiong Ji, Guozhou Zheng, Huajun Chen. (2023). OceanGPT: {A. CoRR. doi:10.48550/ARXIV.2310.02031.

[Zheng2023MarineGPT] Ziqiang Zheng, Jipeng Zhang, Tuan{-. (2023). MarineGPT: Unlocking Secrets of Ocean to the Public. CoRR. doi:10.48550/ARXIV.2310.13596.

[liu2024radiologygpt] Zhengliang Liu, Aoxiao Zhong, Yiwei Li, Longtao Yang, Chao Ju, Zihao Wu, Chong Ma, Peng Shu, Cheng Chen, Sekeun Kim, Haixing Dai, Lin Zhao, Lichao Sun, Dajiang Zhu, Jun Liu, Wei Liu, Dinggang Shen, Xiang Li, Quanzheng Li, Tianming Liu. (2024). Radiology-GPT: A Large Language Model for Radiology.

[deng2023learning] Cheng Deng, Tianhang Zhang, Zhongmou He, Yi Xu, Qiyuan Chen, Yuanyuan Shi, Luoyi Fu, Weinan Zhang, Xinbing Wang, Chenghu Zhou, Zhouhan Lin, Junxian He. (2023). K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization.

[sun2024survey] Qiushi Sun, Zhirui Chen, Fangzhi Xu, Kanzhi Cheng, Chang Ma, Zhangyue Yin, Jianing Wang, Chengcheng Han, Renyu Zhu, Shuai Yuan, Qipeng Guo, Xipeng Qiu, Pengcheng Yin, Xiaoli Li, Fei Yuan, Lingpeng Kong, Xiang Li, Zhiyong Wu. (2024). A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond.

[chen2021evaluating] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba. (2021). Evaluating Large Language Models Trained on Code.

[chai2023erniecode] Chai, Yekun, Wang, Shuohuan, Pang, Chao, Sun, Yu, Tian, Hao, Wu, Hua. (2023). {ERNIE. Findings of the Association for Computational Linguistics: ACL 2023. doi:10.18653/v1/2023.findings-acl.676.

[kanade2020learning] Kanade, Aditya, Maniatis, Petros, Balakrishnan, Gogul, Shi, Kensen. (2020). Learning and evaluating contextual embedding of source code. International conference on machine learning.

[feng2020codebert] Feng, Zhangyin, Guo, Daya, Tang, Duyu, Duan, Nan, Feng, Xiaocheng, Gong, Ming, Shou, Linjun, Qin, Bing, Liu, Ting, Jiang, Daxin, others. (2020). Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155.

[moradidakhel2023github] Arghavan {Moradi Dakhel. (2023). GitHub Copilot AI pair programmer: Asset or Liability?. Journal of Systems and Software. doi:https://doi.org/10.1016/j.jss.2023.111734.

[nijkamp2022codegen] Nijkamp, Erik, Pang, Bo, Hayashi, Hiroaki, Tu, Lifu, Wang, Huan, Zhou, Yingbo, Savarese, Silvio, Xiong, Caiming. (2023). CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. ICLR.

[nijkamp2023codegen2] Nijkamp, Erik, Hayashi, Hiroaki, Xiong, Caiming, Savarese, Silvio, Zhou, Yingbo. (2023). CodeGen2: Lessons for Training LLMs on Programming and Natural Languages. ICLR.

[rozière2024code] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. (2024). Code Llama: Open Foundation Models for Code.

[li2023starcoder] StarCode Team. (2023). StarCoder: may the source be with you!.

[chen2023teaching] Xinyun Chen, Maxwell Lin, Nathanael Schärli, Denny Zhou. (2023). Teaching Large Language Models to Self-Debug.

[luo2023wizardcoder] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang. (2023). WizardCoder: Empowering Code Large Language Models with Evol-Instruct.

[wang2021codet5] Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi. (2021). CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. EMNLP.

[wang2023codet5plus] Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, Steven C. H. Hoi. (2023). CodeT5+: Open Code Large Language Models for Code Understanding and Generation.

[guo2024deepseekcoder] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang. (2024). DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence.

[gunasekar2023textbooks] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li. (2023). Textbooks Are All You Need.

[chen2022codet] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, Weizhu Chen. (2022). CodeT: Code Generation with Generated Tests.

[muennighoff2024octopack] Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, Shayne Longpre. (2024). OctoPack: Instruction Tuning Code Large Language Models.

[jiang2023selfplanning] Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, Wenpin Jiao. (2023). Self-planning Code Generation with Large Language Models.

[shen2023pangucoder2] Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, Yuenan Guo, Qianxiang Wang. (2023). PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback.

[shypula2023learning] Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob Gardner, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, Amir Yazdanbakhsh. (2023). Learning Performance-Improving Code Edits.

[jiang2023selfevolve] Shuyang Jiang, Yuhao Wang, Yu Wang. (2023). SelfEvolve: A Code Evolution Framework via Large Language Models.

[wei2023magicoder] Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, Lingming Zhang. (2023). Magicoder: Source Code Is All You Need.

[tipirneni2024structcoder] Tipirneni, Sindhu, Zhu, Ming, Reddy, Chandan K.. (2024). StructCoder: Structure-Aware Transformer for Code Generation. ACM Trans. Knowl. Discov. Data. doi:10.1145/3636430.

[zhuo2024astraios] Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, Niklas Muennighoff. (2024). Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models.

[weyssow2024exploring] Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, Houari Sahraoui. (2024). Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models.

[zheng2024survey] Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, Jiachi Chen. (2024). A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends.

[nijkamp2023xgen7b] Erik Nijkamp, Tian Xie, Hiroaki Hayashi, Bo Pang, Congying Xia, Chen Xing, Jesse Vig, Semih Yavuz, Philippe Laban, Ben Krause, Senthil Purushwalkam, Tong Niu, Wojciech Kryściński, Lidiya Murakhovs'ka, Prafulla Kumar Choubey, Alex Fabbri, Ye Liu, Rui Meng, Lifu Tu, Meghana Bhat, Chien-Sheng Wu, Silvio Savarese, Yingbo Zhou, Shafiq Joty, Caiming Xiong. (2023). XGen-7B Technical Report.

[di2023codefuse] Di, Peng, Li, Jianguo, Yu, Hang, Jiang, Wei, Cai, Wenting, Cao, Yang, Chen, Chaoyu, Chen, Dajun, Chen, Hongwei, Chen, Liang, others. (2023). Codefuse-13b: A pretrained multi-lingual code large language model. arXiv preprint arXiv:2310.06266.

[li2024instructcoder] Kaixin Li, Qisheng Hu, Xu Zhao, Hui Chen, Yuxi Xie, Tiedong Liu, Qizhe Xie, Junxian He. (2024). InstructCoder: Instruction Tuning Large Language Models for Code Editing.

[yadav2023exploring] Prateek Yadav, Qing Sun, Hantian Ding, Xiaopeng Li, Dejiao Zhang, Ming Tan, Xiaofei Ma, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Mohit Bansal, Bing Xiang. (2023). Exploring Continual Learning for Code Generation Models.

[lozhkov2024starcoder] StarCoder2 Team. (2024). StarCoder 2 and The Stack v2: The Next Generation.

[paul2024ircoder] Indraneil Paul, Jun Luo, Goran Glavaš, Iryna Gurevych. (2024). IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators.

[ma2024llamoco] Zeyuan Ma, Hongshu Guo, Jiacheng Chen, Guojun Peng, Zhiguang Cao, Yining Ma, Yue-Jiao Gong. (2024). LLaMoCo: Instruction Tuning of Large Language Models for Optimization Code Generation.

[agarwal2024structured] Mayank Agarwal, Yikang Shen, Bailin Wang, Yoon Kim, Jie Chen. (2024). Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models.

[song2024code] Demin Song, Honglin Guo, Yunhua Zhou, Shuhao Xing, Yudong Wang, Zifan Song, Wenwei Zhang, Qipeng Guo, Hang Yan, Xipeng Qiu, Dahua Lin. (2024). Code Needs Comments: Enhancing Code LLMs with Comment Augmentation.

[he2024instruction] Jingxuan He, Mark Vero, Gabriela Krasnopolska, Martin Vechev. (2024). Instruction Tuning for Secure Code Generation.

[gong2024astt5] Linyuan Gong, Mostafa Elhoushi, Alvin Cheung. (2024). AST-T5: Structure-Aware Pretraining for Code Generation and Understanding.

[huang2024knowledgeaware] Tao Huang, Zhihong Sun, Zhi Jin, Ge Li, Chen Lyu. (2024). Knowledge-Aware Code Generation with Large Language Models.

[weyssow2023codell] Martin Weyssow, Claudio Di Sipio, Davide Di Ruscio, Houari Sahraoui. (2023). CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code.

[chai2023ernie] Chai, Yekun, Wang, Shuohuan, Pang, Chao, Sun, Yu, Tian, Hao, Wu, Hua. (2023). {ERNIE. Findings of the Association for Computational Linguistics: ACL 2023. doi:10.18653/v1/2023.findings-acl.676.

[allal2023santacoder] Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. (2023). SantaCoder: don't reach for the stars!.

[zheng2023codegeex] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, Jie Tang. (2023). CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X.

[cheng2024adapting] Daixuan Cheng, Shaohan Huang, Furu Wei. (2024). Adapting Large Language Models via Reading Comprehension.

[vandekar2022dont] van de Kar, Mozes, Xia, Mengzhou, Chen, Danqi, Artetxe, Mikel. (2022). Don{'. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2022.emnlp-main.509.

[liu2020exploring] Zihan Liu, Genta Indra Winata, Andrea Madotto, Pascale Fung. (2020). Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models via Continual Learning.

[ke2021achieve] Ke, Zixuan, Liu, Bing, Ma, Nianzu, Xu, Hu, Lei, Shu. (2021). Achieving Forgetting Prevention and Knowledge Transfer in Continual Learning. NeurIPS.

[ke2021adapting] Ke, Zixuan, Xu, Hu, Liu, Bing. (2021). Adapting {BERT. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. doi:10.18653/v1/2021.naacl-main.378.

[ni2023continual] Ni, Zixuan, Wei, Longhui, Tang, Siliang, Zhuang, Yueting, Tian, Qi. (2023). Continual vision-language representation learning with off-diagonal information. Proceedings of the 40th International Conference on Machine Learning.

[zheng2024antiforgetting] Junhao Zheng, Qianli Ma, Zhen Liu, Binquan Wu, Huawen Feng. (2024). Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer.

[tao2022can] Tao, Mingxu, Feng, Yansong, Zhao, Dongyan. (2022). Can bert refrain from forgetting on sequential tasks? a probing study. The Eleventh International Conference on Learning Representations.

[wei2022circle] Yuan, Wei, Zhang, Quanjun, He, Tieke, Fang, Chunrong, Hung, Nguyen Quoc Viet, Hao, Xiaodong, Yin, Hongzhi. (2022). CIRCLE: continual repair across programming languages. Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. doi:10.1145/3533767.3534219.

[song2023conpet] Chenyang Song, Xu Han, Zheni Zeng, Kuai Li, Chen Chen, Zhiyuan Liu, Maosong Sun, Tao Yang. (2023). ConPET: Continual Parameter-Efficient Tuning for Large Language Models.

[kane2022continual] Aditya Kane, V Manushree, Sahil Khose. (2022). Continual VQA for Disaster Response Systems.

[raghavan2023engineering] Guruprasad Raghavan, Bahey Tharwat, Surya Narayanan Hari, Dhruvil Satani, Matt Thomson. (2023). Engineering flexible machine learning systems by traversing functionally-invariant paths.

[bai2023enhancing] Xueying Bai, Jinghuan Shang, Yifan Sun, Niranjan Balasubramanian. (2023). Enhancing Continual Learning with Global Prototypes: Counteracting Negative Representation Drift.

[scialom2022fine] Scialom, Thomas, Chakrabarty, Tuhin, Muresan, Smaranda. (2022). Fine-tuned Language Models are Continual Learners. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2022.emnlp-main.410.

[huang2022fpt] Huang, Yufei, Qin, Yujia, Wang, Huadong, Yin, Yichun, Sun, Maosong, Liu, Zhiyuan, Liu, Qun. (2022). {FPT. Findings of the Association for Computational Linguistics: EMNLP 2022. doi:10.18653/v1/2022.findings-emnlp.511.

[luo2023investigating] Yun Luo, Zhen Yang, Xuefeng Bai, Fandong Meng, Jie Zhou, Yue Zhang. (2023). Investigating Forgetting in Pre-Trained Representations Through Continual Learning.

[shiri2023l3] Aidin Shiri, Kaushik Roy, Amit Sheth, Manas Gaur. (2023). L3 Ensembles: Lifelong Learning Approach for Ensemble of Foundational Language Models.

[zheng2023learn] Junhao Zheng, Shengjie Qiu, Qianli Ma. (2023). Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models.

[qin2021lfpt5] Qin, Chengwei, Joty, Shafiq. (2021). LFPT5: A Unified Framework for Lifelong Few-shot Language Learning Based on Prompt Tuning of T5. International Conference on Learning Representations.

[lin2024mitigating] Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Heng Ji, Yuan Yao, Tong Zhang. (2024). Mitigating the Alignment Tax of RLHF.

[clark2018think] Clark, Peter, Cowhey, Isaac, Etzioni, Oren, Khot, Tushar, Sabharwal, Ashish, Schoenick, Carissa, Tafjord, Oyvind. (2018). Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.

[dua2019drop] Dua, Dheeru, Wang, Yizhong, Dasigi, Pradeep, Stanovsky, Gabriel, Singh, Sameer, Gardner, Matt. (2019). DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161.

[bojar2014findings] Bojar, Ond{\v{r. (2014). Findings of the 2014 workshop on statistical machine translation. Proceedings of the ninth workshop on statistical machine translation.

[rajpurkar2018know] Rajpurkar, Pranav, Jia, Robin, Liang, Percy. (2018). Know what you don't know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822.

[bisk2020piqa] Bisk, Yonatan, Zellers, Rowan, Gao, Jianfeng, Choi, Yejin, others. (2020). Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI conference on artificial intelligence.

[lai2017race] Lai, Guokun, Xie, Qizhe, Liu, Hanxiao, Yang, Yiming, Hovy, Eduard. (2017). Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.

[volske2017tl] V{. (2017). Tl; dr: Mining reddit to learn automatic summarization. Proceedings of the Workshop on New Frontiers in Summarization.

[weyssow2023usage] Weyssow, Martin, Zhou, Xin, Kim, Kisub, Lo, David, Sahraoui, Houari. (2023). On the Usage of Continual Learning for Out-of-Distribution Generalization in Pre-trained Language Models of Code. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. doi:10.1145/3611643.3616244.

[winata2023overcoming] Winata, Genta, Xie, Lingjue, Radhakrishnan, Karthik, Wu, Shijie, Jin, Xisen, Cheng, Pengxiang, Kulkarni, Mayank, Preotiuc-Pietro, Daniel. (2023). Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning. Findings of the Association for Computational Linguistics: ACL 2023. doi:10.18653/v1/2023.findings-acl.48.

[chen2024parameterizing] Chen, Yongrui, Zhang, Shenyu, Qi, Guilin, Guo, Xinnan. (2024). Parameterizing Context: Unleashing the Power of Parameter-Efficient Fine-Tuning and In-Context Tuning for Continual Table Semantic Parsing. Advances in Neural Information Processing Systems.

[zhong2022proqa] Zhong, Wanjun, Gao, Yifan, Ding, Ning, Qin, Yujia, Liu, Zhiyuan, Zhou, Ming, Wang, Jiahai, Yin, Jian, Duan, Nan. (2022). {P. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. doi:10.18653/v1/2022.naacl-main.313.

[panigrahi2023task] Panigrahi, Abhishek, Saunshi, Nikunj, Zhao, Haoyu, Arora, Sanjeev. (2023). Task-specific skill localization in fine-tuned language models. Proceedings of the 40th International Conference on Machine Learning.

[lin2004rouge] Lin, Chin-Yew. (2004). Rouge: A package for automatic evaluation of summaries. Text summarization branches out.

[papineni2002bleu] Papineni, Kishore, Roukos, Salim, Ward, Todd, Zhu, Wei-Jing. (2002). Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics.

[banerjee2005meteor] Banerjee, Satanjeev, Lavie, Alon. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization.

[hendrycks2023aligning] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt. (2023). Aligning AI With Shared Human Values.

[schulman2017proximal] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. (2017). Proximal Policy Optimization Algorithms.

[ji2024ai] Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan, Aidan O'Gara, Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, Wen Gao. (2024). AI Alignment: A Comprehensive Survey.

[wu2022survey] Wu, Xingjiao, Xiao, Luwei, Sun, Yixuan, Zhang, Junhang, Ma, Tianlong, He, Liang. (2022). A survey of human-in-the-loop for machine learning. Future Generation Computer Systems.

[taori2023alpaca] Taori, Rohan, Gulrajani, Ishaan, Zhang, Tianyi, Dubois, Yann, Li, Xuechen, Guestrin, Carlos, Liang, Percy, Hashimoto, Tatsunori B. (2023). Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html.

[wang2022selfinstruct] Yuxiang Wang, Yilin Chen, Bei Yu, Gokhan Tur, Yun-Nung Chen, Jianfeng Gao. (2022). Self-training Improves Pre-training for Few-shot Learning in Task-oriented Dialog Systems. EMNLP.

[ouyang2022rlhf] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe. (2022). Training language models to follow instructions with human feedback.

[rafailov2024dpo] Rafailov, Rafael, Sharma, Archit, Mitchell, Eric, Manning, Christopher D, Ermon, Stefano, Finn, Chelsea. (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems.

[maas2011learning] Maas, Andrew, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, Potts, Christopher. (2011). Learning word vectors for sentiment analysis. Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies.

[bai2022training] Bai, Yuntao, Jones, Andy, Ndousse, Kamal, Askell, Amanda, Chen, Anna, DasSarma, Nova, Drain, Dawn, Fort, Stanislav, Ganguli, Deep, Henighan, Tom, others. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

[stiennon2022learning] Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano. (2022). Learning to summarize from human feedback.

[zhang2023copf] Zhang, Han, Gui, Lin, Zhai, Yuanzhao, Wang, Hui, Lei, Yu, Xu, Ruifeng. (2023). Copf: Continual learning human preference through optimal policy fitting. arXiv preprint arXiv:2310.15694.

[zhangcppo] Zhang, Han, Lei, Yu, Gui, Lin, Yang, Min, He, Yulan, Wang, Hui, Xu, Ruifeng. CPPO: Continual Learning for Reinforcement Learning with Human Feedback.

[puthumanaillam2024moral] Puthumanaillam, Gokul, Vora, Manav, Thangeda, Pranay, Ornik, Melkior. (2024). A Moral Imperative: The Need for Continual Superalignment of Large Language Models. arXiv preprint arXiv:2403.14683.

[houlsby2019parameter] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan. (2019). Parameter-efficient transfer learning for NLP. Proceedings of the 36th International Conference on Machine Learning.

[kirkpatrick2017overcoming] Kirkpatrick, James, Pascanu, Razvan, Rabinowitz, Neil, Veness, Joel, Desjardins, Guillaume, Rusu, Andrei A, Milan, Kieran, Quan, John, Ramalho, Tiago, Grabska-Barwinska, Agnieszka, others. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences.

[rusu2016progressive] Rusu, Andrei A, Rabinowitz, Neil C, Desjardins, Guillaume, Soyer, Hubert, Kirkpatrick, James, Kavukcuoglu, Koray, Pascanu, Razvan, Hadsell, Raia. (2016). Progressive neural networks. arXiv preprint arXiv:1606.04671.

[robins1995catastrophic] Robins, Anthony. (1995). Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science.

[liu2022few] Liu, Haokun, Tam, Derek, Muqeeth, Mohammed, Mohta, Jay, Huang, Tenghao, Bansal, Mohit, Raffel, Colin A. (2022). Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems.

[hu2021lora] Hu, Edward J, Shen, Yelong, Wallis, Phillip, Allen-Zhu, Zeyuan, Li, Yuanzhi, Wang, Shean, Wang, Lu, Chen, Weizhu. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

[PARAREL] Elazar, Yanai, Kassner, Nora, Ravfogel, Shauli, Ravichander, Abhilasha, Hovy, Eduard, Sch{. (2021). Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics.

[Wikireading] Hewlett, Daniel, Lacoste, Alexandre, Jones, Llion, Polosukhin, Illia, Fandrianto, Andrew, Han, Jay, Kelcey, Matthew, Berthelot, David. (2016). Wikireading: A novel large-scale language understanding task over wikipedia. arXiv preprint arXiv:1608.03542.

[Dbpedia] Br{. (2016). Dbpedia abstracts: A large-scale, open, multilingual NLP training corpus. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16).

[fever] Thorne, James, Vlachos, Andreas, Christodoulopoulos, Christos, Mittal, Arpit. (2018). FEVER: a large-scale dataset for fact extraction and VERification. arXiv preprint arXiv:1803.05355.

[zsRE] Levy, Omer, Seo, Minjoon, Choi, Eunsol, Zettlemoyer, Luke. (2017). Zero-shot relation extraction via reading comprehension. arXiv preprint arXiv:1706.04115.

[scotus] Chalkidis, Ilias, Pasini, Tommaso, Zhang, Sheng, Tomada, Letizia, Schwemer, Sebastian Felix, S{\o. (2022). Fairlex: A multilingual benchmark for evaluating fairness in legal text processing. arXiv preprint arXiv:2203.07228.

[vitaminC] Schuster, Tal, Fisch, Adam, Barzilay, Regina. (2021). Get your vitamin C! robust fact verification with contrastive evidence. arXiv preprint arXiv:2103.08541.

[nq] Kwiatkowski, Tom, Palomaki, Jennimaria, Redfield, Olivia, Collins, Michael, Parikh, Ankur, Alberti, Chris, Epstein, Danielle, Polosukhin, Illia, Devlin, Jacob, Lee, Kenton, others. (2019). Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics.

[T-rex] Elsahar, Hady, Vougiouklis, Pavlos, Remaci, Arslen, Gravier, Christophe, Hare, Jonathon, Laforest, Frederique, Simperl, Elena. (2018). T-rex: A large scale alignment of natural language with knowledge base triples. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).

[meng2022locating] Meng, Kevin, Bau, David, Andonian, Alex, Belinkov, Yonatan. (2022). Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems.

[li2022large] Li, Daliang, Rawat, Ankit Singh, Zaheer, Manzil, Wang, Xin, Lukasik, Michal, Veit, Andreas, Yu, Felix, Kumar, Sanjiv. (2022). Large language models with controllable working memory. arXiv preprint arXiv:2211.05110.

[mazzia2023survey] Mazzia, Vittorio, Pedrani, Alessandro, Caciolai, Andrea, Rottmann, Kay, Bernardi, Davide. (2023). A Survey on Knowledge Editing of Neural Networks. arXiv preprint arXiv:2310.19704.

[dong2022calibrating] Dong, Qingxiu, Dai, Damai, Song, Yifan, Xu, Jingjing, Sui, Zhifang, Li, Lei. (2022). Calibrating factual knowledge in pretrained language models. arXiv preprint arXiv:2210.03329.

[sinitsin2020editable] Sinitsin, Anton, Plokhotnyuk, Vsevolod, Pyrkin, Dmitriy, Popov, Sergei, Babenko, Artem. (2020). Editable neural networks. arXiv preprint arXiv:2004.00345.

[de2021editing] De Cao, Nicola, Aziz, Wilker, Titov, Ivan. (2021). Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164.

[fast_edit] Mitchell, Eric, Lin, Charles, Bosselut, Antoine, Finn, Chelsea, Manning, Christopher D. (2021). Fast model editing at scale. arXiv preprint arXiv:2110.11309.

[hase2021language] Hase, Peter, Diab, Mona, Celikyilmaz, Asli, Li, Xian, Kozareva, Zornitsa, Stoyanov, Veselin, Bansal, Mohit, Iyer, Srinivasan. (2021). Do language models have beliefs? methods for detecting, updating, and visualizing model beliefs. arXiv preprint arXiv:2111.13654.

[mitchell2022memory] Mitchell, Eric, Lin, Charles, Bosselut, Antoine, Manning, Christopher D, Finn, Chelsea. (2022). Memory-based model editing at scale. International Conference on Machine Learning.

[hase2023does] Hase, Peter, Bansal, Mohit, Kim, Been, Ghandeharioun, Asma. (2023). Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models.

[meng2022mass] Meng, Kevin, Sharma, Arnab Sen, Andonian, Alex, Belinkov, Yonatan, Bau, David. (2022). Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229.

[huang2023transformer] Huang, Zeyu, Shen, Yikang, Zhang, Xiaofeng, Zhou, Jie, Rong, Wenge, Xiong, Zhang. (2023). Transformer-patcher: One mistake worth one neuron. arXiv preprint arXiv:2301.09785.

[hartvigsen2023aging] Hartvigsen, Thomas, Sankaranarayanan, Swami, Palangi, Hamid, Kim, Yoon, Ghassemi, Marzyeh. (2023). Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors. Advances in Neural Information Processing Systems.

[lin2022continual] Lin, Bill Yuchen, Wang, Sida, Lin, Xi, Jia, Robin, Xiao, Lin, Ren, Xiang, Yih, Scott. (2022). On Continual Model Refinement in Out-of-Distribution Data Streams. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2022.acl-long.223.

[hu2024wilke] Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao. (2024). WilKE: Wise-Layer Knowledge Editor for Lifelong Knowledge Editing.

[das2024larimar] Das, Payel, Chaudhury, Subhajit, Nelson, Elliot, Melnyk, Igor, Swaminathan, Sarath, Dai, Sihui, Lozano, Aur{'e. (2024). Larimar: Large Language Models with Episodic Memory Control. arXiv preprint arXiv:2403.11901.

[yu2023melo] Yu, Lang, Chen, Qin, Zhou, Jie, He, Liang. (2023). MELO: Enhancing Model Editing with Neuron-Indexed Dynamic LoRA. arXiv preprint arXiv:2312.11795.

[li2023continual] Linyang Li, Xipeng Qiu. (2023). {CONTINUAL.

[yang2024moral] Shu Yang, Muhammad Asif Ali, Cheng-Long Wang, Lijie Hu, Di Wang. (2024). MoRAL: MoE Augmented LoRA for LLMs' Lifelong Learning.

[jang2023exploring] Jang, Joel, Kim, Seungone, Ye, Seonghyeon, Kim, Doyoung, Logeswaran, Lajanugen, Lee, Moontae, Lee, Kyungjae, Seo, Minjoon. (2023). Exploring the benefits of training expert language models over instruction tuning. Proceedings of the 40th International Conference on Machine Learning.

[wang2023trace] Xiao Wang, Yuansen Zhang, Tianze Chen, Songyang Gao, Senjie Jin, Xianjun Yang, Zhiheng Xi, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xuanjing Huang. (2023). TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models.

[mok2023large] Mok, Jisoo, Do, Jaeyoung, Lee, Sungjin, Taghavi, Tara, Yu, Seunghak, Yoon, Sungroh. (2023). Large-scale Lifelong Learning of In-context Instructions and How to Tackle It. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2023.acl-long.703.

[zhang2023citb] Zhang, Zihan, Fang, Meng, Chen, Ling, Namazi-Rad, Mohammad-Reza. (2023). {CITB. Findings of the Association for Computational Linguistics: EMNLP 2023. doi:10.18653/v1/2023.findings-emnlp.633.

[chen2024coin] Cheng Chen, Junchen Zhu, Xu Luo, Hengtao Shen, Lianli Gao, Jingkuan Song. (2024). CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model.

[huang2024mitigating] Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, Jinsong Su. (2024). Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal.

[he2024dont] Yongquan He, Xuancheng Huang, Minghao Tang, Lingxun Meng, Xiang Li, Wei Lin, Wenyuan Zhang, Yifu Gao. (2024). Don't Half-listen: Capturing Key-part Information in Continual Instruction Tuning.

[yin2022contintin] Yin, Wenpeng, Li, Jia, Xiong, Caiming. (2022). {C. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2022.acl-long.218.

[wang2023orthogonal] Wang, Xiao, Chen, Tianze, Ge, Qiming, Xia, Han, Bao, Rong, Zheng, Rui, Zhang, Qi, Gui, Tao, Huang, Xuanjing. (2023). Orthogonal Subspace Learning for Language Model Continual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. doi:10.18653/v1/2023.findings-emnlp.715.

[zhao2024sapt] Weixiang Zhao, Shilong Wang, Yulin Hu, Yanyan Zhao, Bing Qin, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che. (2024). SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models.

[wei2022finetuned] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le. (2022). Finetuned Language Models Are Zero-Shot Learners.

[wang2024inscl] Yifan Wang, Yafei Liu, Chufan Shi, Haoling Li, Chen Chen, Haonan Lu, Yujiu Yang. (2024). InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions.

[zhu2023minigpt4] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny. (2023). MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models.

[liu2023visual] Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. (2023). Visual Instruction Tuning.

[qi2024interactive] Biqing Qi, Xingquan Chen, Junqi Gao, Dong Li, Jianxing Liu, Ligang Wu, Bowen Zhou. (2024). Interactive Continual Learning: Fast and Slow Thinking.

[he2023continual] Jinghan He, Haiyun Guo, Ming Tang, Jinqiao Wang. (2023). Continual Instruction Tuning for Large Multimodal Models.

[zhu2024model] Didi Zhu, Zhongyi Sun, Zexi Li, Tao Shen, Ke Yan, Shouhong Ding, Kun Kuang, Chao Wu. (2024). Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models.

[zhao2024reconstruct] Shu Zhao, Xiaohan Zou, Tan Yu, Huijuan Xu. (2024). Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration.

[zhai2023investigating] Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma. (2023). Investigating the Catastrophic Forgetting in Multimodal Large Language Models.

[peng2023kosmos2] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. (2023). Kosmos-2: Grounding Multimodal Large Language Models to the World.

[li2023otter] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei Liu. (2023). Otter: A Multi-Modal Model with In-Context Instruction Tuning.

[dai2023instructblip] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. (2023). InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning.

[li2024videochat] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, Yu Qiao. (2024). VideoChat: Chat-Centric Video Understanding.

[kaplan2020scaling] Kaplan, Jared, McCandlish, Sam, Henighan, Tom, Brown, Tom B, Chess, Benjamin, Child, Rewon, Gray, Scott, Radford, Alec, Wu, Jeffrey, Amodei, Dario. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

[hoffmann2022training] Hoffmann, Jordan, Borgeaud, Sebastian, Mensch, Arthur, Buchatskaya, Elena, Cai, Trevor, Rutherford, Eliza, Casas, Diego de Las, Hendricks, Lisa Anne, Welbl, Johannes, Clark, Aidan, others. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.

[devlin2018bert] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[liu2019roberta] Liu, Yinhan, Ott, Myle, Goyal, Naman, Du, Jingfei, Joshi, Mandar, Chen, Danqi, Levy, Omer, Lewis, Mike, Zettlemoyer, Luke, Stoyanov, Veselin. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

[chowdhery2023palm] Chowdhery, Aakanksha, Narang, Sharan, Devlin, Jacob, Bosma, Maarten, Mishra, Gaurav, Roberts, Adam, Barham, Paul, Chung, Hyung Won, Sutton, Charles, Gehrmann, Sebastian, others. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research.

[anil2023palm] Anil, Rohan, Dai, Andrew M, Firat, Orhan, Johnson, Melvin, Lepikhin, Dmitry, Passos, Alexandre, Shakeri, Siamak, Taropa, Emanuel, Bailey, Paige, Chen, Zhifeng, others. (2023). Palm 2 technical report. arXiv preprint arXiv:2305.10403.

[touvron2023llama] Touvron, Hugo, Lavril, Thibaut, Izacard, Gautier, Martinet, Xavier, Lachaux, Marie-Anne, Lacroix, Timoth{'e. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

[touvron2023llama2] Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, others. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

[radford2019language] Radford, Alec, Wu, Jeffrey, Child, Rewon, Luan, David, Amodei, Dario, Sutskever, Ilya, others. (2019). Language models are unsupervised multitask learners. OpenAI blog.

[brown2020language] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. Advances in neural information processing systems.

[achiam2023gpt] Achiam, Josh, Adler, Steven, Agarwal, Sandhini, Ahmad, Lama, Akkaya, Ilge, Aleman, Florencia Leoni, Almeida, Diogo, Altenschmidt, Janko, Altman, Sam, Anadkat, Shyamal, others. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

[achiam2022chatgpt] OpenAI. (2022). Introducing chatgpt. [Online]. Available: \url{https://openai.com/blog/chatgpt.

[raffel2020exploring] Raffel, Colin, Shazeer, Noam, Roberts, Adam, Lee, Katherine, Narang, Sharan, Matena, Michael, Zhou, Yanqi, Li, Wei, Liu, Peter J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research.

[luo2023empirical] Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, Yue Zhang. (2023). An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning.

[ghosh2024closer] Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, Dinesh Manocha. (2024). A Closer Look at the Limitations of Instruction Tuning.

[zenke2017continual] Friedemann Zenke, Ben Poole, Surya Ganguli. (2017). Continual Learning Through Synaptic Intelligence.

[madotto2020continual] Andrea Madotto, Zhaojiang Lin, Zhenpeng Zhou, Seungwhan Moon, Paul Crook, Bing Liu, Zhou Yu, Eunjoon Cho, Zhiguang Wang. (2020). Continual Learning in Task-Oriented Dialogue Systems.

[shin2017continual] Hanul Shin, Jung Kwon Lee, Jaehong Kim, Jiwon Kim. (2017). Continual Learning with Deep Generative Replay.

[christiano2017deep] Christiano, Paul F, Leike, Jan, Brown, Tom, Martic, Miljan, Legg, Shane, Amodei, Dario. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems.

[zhang2022continual] Zhang, Yanzhe, Wang, Xuezhi, Yang, Diyi. (2022). Continual Sequence Generation with Adaptive Compositional Modules. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2022.acl-long.255.

[ying2021mitigating] Yin, Haiyan, yang, peng, Li, Ping. (2021). Mitigating Forgetting in Online Continual Learning with Neuron Calibration. Advances in Neural Information Processing Systems.

[yang2022continual] Yang, Peng, Li, Dingcheng, Li, Ping. (2022). Continual Learning for Natural Language Generations with Transformer Calibration. Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL). doi:10.18653/v1/2022.conll-1.4.

[wang2022online] Wang, Zhen, Liu, Liu, Kong, Yajing, Guo, Jiaxian, Tao, Dacheng. (2022). Online continual learning with contrastive vision transformer. European Conference on Computer Vision.

[bornschein2024transformers] Jorg Bornschein, Yazhe Li, Amal Rannen-Triki. (2024). Transformers for Supervised Online Continual Learning.

[lu2023ibcl] Pengyuan Lu, Michele Caprio, Eric Eaton, Insup Lee. (2023). IBCL: Zero-shot Model Generation for Task Trade-offs in Continual Learning.

[nguyen2022survey] Thanh Tam Nguyen, Thanh Trung Huynh, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, Quoc Viet Hung Nguyen. (2022). A Survey of Machine Unlearning.

[bourtoule2020machine] Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, Nicolas Papernot. (2020). Machine Unlearning.

[pourcel2022online] Pourcel, Julien, Vu, Ngoc-Son, French, Robert M.. (2022). Online Task-free Continual Learning with Dynamic Sparse Distributed Memory. Computer Vision -- ECCV 2022.

[ramsauer2021hopfield] Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter. (2021). Hopfield Networks is All You Need.

[wu2018kanerva] Wu, Yan, Wayne, Greg, Graves, Alex, Lillicrap, Timothy. (2018). The kanerva machine: A generative distributed memory. arXiv preprint arXiv:1804.01756.

[wang2022sparcl] Wang, Zifeng, Zhan, Zheng, Gong, Yifan, Yuan, Geng, Niu, Wei, Jian, Tong, Ren, Bin, Ioannidis, Stratis, Wang, Yanzhi, Dy, Jennifer. (2022). Sparcl: Sparse continual learning on the edge. Advances in Neural Information Processing Systems.

[prabhu2023computationally] Prabhu, Ameya, Al Kader Hammoud, Hasan Abed, Dokania, Puneet K, Torr, Philip HS, Lim, Ser-Nam, Ghanem, Bernard, Bibi, Adel. (2023). Computationally budgeted continual learning: What does matter?. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[jin2024model] Xisen Jin, Xiang Ren. (2024). What Will My Model Forget? Forecasting Forgotten Examples in Language Model Refinement.

[wistuba2023continual] Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella. (2023). Continual Learning with Low Rank Adaptation.

[huang2023lorahub] Huang, Chengsong, Liu, Qian, Lin, Bill Yuchen, Pang, Tianyu, Du, Chao, Lin, Min. (2023). Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269.

[shao2023class] Shao, Yijia, Guo, Yiduo, Zhao, Dongyan, Liu, Bing. (2023). Class-Incremental Learning based on Label Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). doi:10.18653/v1/2023.acl-short.109.

[dalessandro2023multimodal] D'Alessandro, Marco, Alonso, Alberto, Calabr'es, Enrique, Galar, Mikel. (2023). Multimodal Parameter-Efficient Few-Shot Class Incremental Learning. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops.

[cao2024generative] Cao, Xusheng, Lu, Haori, Huang, Linlan, Liu, Xialei, Cheng, Ming-Ming. (2024). Generative Multi-modal Models are Good Class Incremental Learners. IEEE Computer Vision and Pattern Recognition (CVPR).

[yang2024reawakening] Yanlai Yang, Matt Jones, Michael C. Mozer, Mengye Ren. (2024). Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training.

[wang2022supernaturalinstructions] Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar Pal, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Shailaja Keyur Sampat, Savan Doshi, Siddhartha Mishra, Sujan Reddy, Sumanta Patro, Tanay Dixit, Xudong Shen, Chitta Baral, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi, Daniel Khashabi. (2022). Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks.

[mishra2021natural] Mishra, Swaroop, Khashabi, Daniel, Baral, Chitta, Hajishirzi, Hannaneh. (2021). Natural Instructions: Benchmarking Generalization to New Tasks from Natural Language Instructions. arXiv preprint arXiv:2104.08773.

[zhang2024instruction] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, Guoyin Wang. (2024). Instruction Tuning for Large Language Models: A Survey.

[jiang2024instructiontuned] Zhengbao Jiang, Zhiqing Sun, Weijia Shi, Pedro Rodriguez, Chunting Zhou, Graham Neubig, Xi Victoria Lin, Wen-tau Yih, Srinivasan Iyer. (2024). Instruction-tuned Language Models are Better Knowledge Learners.

[sanh2022multitask] Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, Alexander M. Rush. (2022). Multitask Prompted Training Enables Zero-Shot Task Generalization.

[goyal2017making] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, Devi Parikh. (2017). Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering.

[gurari2018vizwiz] Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, Jeffrey P. Bigham. (2018). VizWiz Grand Challenge: Answering Visual Questions from Blind People.

[lu2022learn] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, Ashwin Kalyan. (2022). Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering.

[singh2019vqa] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, Marcus Rohrbach. (2019). Towards VQA Models That Can Read.

[hudson2019gqa] Drew A. Hudson, Christopher D. Manning. (2019). GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering.

[mishraICDAR19] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, Anirban Chakraborty. (2019). OCR-VQA: Visual Question Answering by Reading Text in Images. ICDAR.

[imagenet_cvpr09] Kazemzadeh, Sahar, Ordonez, Vicente, Matten, Mark, Berg, Tamara. (2014). {R. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP. doi:10.3115/v1/D14-1086.

[mao2016generation] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, Kevin Murphy. (2016). Generation and Comprehension of Unambiguous Object Descriptions.

[shah2023trillion] Agam Shah, Suvan Paturi, Sudheer Chava. (2023). Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis.

[hu2023meetingbank] Yebowen Hu, Tim Ganter, Hanieh Deilamsalehy, Franck Dernoncourt, Hassan Foroosh, Fei Liu. (2023). MeetingBank: A Benchmark Dataset for Meeting Summarization.

[zhao-etal-2023-c] Zhao, Chenye, Li, Yingjie, Caragea, Cornelia. (2023). {C. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2023.acl-long.747.

[kew-etal-2023-20] Kew, Tannon, Kostrzewa, Marek, Ebling, Sarah. (2023). 20 Minuten: A Multi-task News Summarisation Dataset for {G. Proceedings of the 8th edition of the Swiss Text Analytics Conference.

[lu2021codexglue] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shujie Liu. (2021). CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation.

[mishra2022numglue] Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, Ashwin Kalyan. (2022). NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks.

[huang2019cosmos] Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi. (2019). Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning.

[khashabi-etal-2017-learning] Khashabi, Daniel, Khot, Tushar, Sabharwal, Ashish, Roth, Dan. (2017). Learning What is Essential in Questions. Proceedings of the 21st Conference on Computational Natural Language Learning ({C. doi:10.18653/v1/K17-1010.

[zhou2019goingvacationtakeslonger] Ben Zhou, Daniel Khashabi, Qiang Ning, Dan Roth. . (2019).

[khashabi-etal-2018-looking] Khashabi, Daniel, Chaturvedi, Snigdha, Roth, Michael, Upadhyay, Shyam, Roth, Dan. (2018). Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).

[khot2020qasc] Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, Ashish Sabharwal. (2020). QASC: A Dataset for Question Answering via Sentence Composition.

[dasigi-etal-2019-quoref] Dasigi, Pradeep, Liu, Nelson F., Marasovi{'c. (2019). {Q. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). doi:10.18653/v1/D19-1606.

[lin2019reasoning] Kevin Lin, Oyvind Tafjord, Peter Clark, Matt Gardner. (2019). Reasoning Over Paragraph Effects in Situations.

[sakaguchi2019winogrande] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi. (2019). WinoGrande: An Adversarial Winograd Schema Challenge at Scale.

[zhao2024large] Zhao, Haokun, Han, Haixia, Shi, Jie, Du, Chengyu, Liang, Jiaqing, Xiao, Yanghua. (2024). Large Language Model Can Continue Evolving From Mistakes. arXiv preprint arXiv:2404.08707.

[lin2024rho] Lin, Zhenghao, Gou, Zhibin, Gong, Yeyun, Liu, Xiao, Shen, Yelong, Xu, Ruochen, Lin, Chen, Yang, Yujiu, Jiao, Jian, Duan, Nan, others. (2024). Rho-1: Not All Tokens Are What You Need. arXiv preprint arXiv:2404.07965.

[ibrahim2024simple] Ibrahim, Adam, Th{'e. (2024). Simple and scalable strategies to continually pre-train large language models. arXiv preprint arXiv:2403.08763.

[yildiz2024investigating] Y{\i. (2024). Investigating Continual Pretraining in Large Language Models: Insights and Implications. arXiv preprint arXiv:2402.17400.

[chen2024take] Chen, Xuxi, Wang, Zhendong, Sow, Daouda, Yang, Junjie, Chen, Tianlong, Liang, Yingbin, Zhou, Mingyuan, Wang, Zhangyang. (2024). Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization. arXiv preprint arXiv:2402.14270.

[li2024blade] Li, Haitao, Ai, Qingyao, Chen, Jia, Dong, Qian, Wu, Zhijing, Liu, Yiqun, Chen, Chong, Tian, Qi. (2024). BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models. arXiv preprint arXiv:2403.18365.

[shen2024tag] Shen, Junhong, Tenenholtz, Neil, Hall, James Brian, Alvarez-Melis, David, Fusi, Nicolo. (2024). Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains. arXiv preprint arXiv:2402.05140.

[acikgoz2024hippocrates] Acikgoz, Emre Can, {.I. (2024). Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare. arXiv preprint arXiv:2404.16621.

[he2024foundation] He, Yuting, Huang, Fuxiang, Jiang, Xinrui, Nie, Yuxiang, Wang, Minghao, Wang, Jiguang, Chen, Hao. (2024). Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions. arXiv preprint arXiv:2404.03264.

[xie2024me] Xie, Qianqian, Chen, Qingyu, Chen, Aokun, Peng, Cheng, Hu, Yan, Lin, Fongci, Peng, Xueqing, Huang, Jimin, Zhang, Jeffrey, Keloth, Vipina, others. (2024). Me LLaMA: Foundation Large Language Models for Medical Applications. arXiv preprint arXiv:2402.12749.

[hirano2024construction] Hirano, Masanori, Imajo, Kentaro. (2024). Construction of Domain-specified Japanese Large Language Model for Finance through Continual Pre-training. arXiv preprint arXiv:2404.10555.

[takahashi2024pretraining] Takahashi, Kosuke, Omi, Takahiro, Arima, Kosuke, Ishigaki, Tatsuya. (2024). Pretraining and updating language-and domain-specific large language model: A case study in japanese business domain. arXiv preprint arXiv:2404.08262.

[thulke2024climategpt] Thulke, David, Gao, Yingbo, Pelser, Petrus, Brune, Rein, Jalota, Rricha, Fok, Floris, Ramos, Michael, van Wyk, Ian, Nasir, Abdallah, Goldstein, Hayden, others. (2024). ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on Climate Change. arXiv preprint arXiv:2401.09646.

[fujii2024continual] Fujii, Kazuki, Nakamura, Taishi, Loem, Mengsay, Iida, Hiroki, Ohi, Masanari, Hattori, Kakeru, Shota, Hirai, Mizuki, Sakae, Yokota, Rio, Okazaki, Naoaki. (2024). Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities. arXiv preprint arXiv:2404.17790.

[dou2024sailor] Dou, Longxu, Liu, Qian, Zeng, Guangtao, Guo, Jia, Zhou, Jiahui, Lu, Wei, Lin, Min. (2024). Sailor: Open Language Models for South-East Asia. arXiv preprint arXiv:2404.03608.

[nakamura2024aurora] Nakamura, Taishi, Mishra, Mayank, Tedeschi, Simone, Chai, Yekun, Stillerman, Jason T, Friedrich, Felix, Yadav, Prateek, Laud, Tanmay, Chien, Vu Minh, Zhuo, Terry Yue, others. (2024). Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the US Executive Order. arXiv preprint arXiv:2404.00399.

[vo2024vi] Vo, James. (2024). Vi-Mistral-X: Building a Vietnamese Language Model with Advanced Continual Pre-training. arXiv preprint arXiv:2403.15470.

[wang2024wise] Wang, Peng, Li, Zexi, Zhang, Ningyu, Xu, Ziwen, Yao, Yunzhi, Jiang, Yong, Xie, Pengjun, Huang, Fei, Chen, Huajun. (2024). WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models. arXiv preprint arXiv:2405.14768.

[yu2024boosting] Yu, Jiazuo, Zhuge, Yunzhi, Zhang, Lu, Wang, Dong, Lu, Huchuan, He, You. (2024). Boosting continual learning of vision-language models via mixture-of-experts adapters. arXiv preprint arXiv:2403.11549.

[yu2024select] Yu, Yu-Chu, Huang, Chi-Pin, Chen, Jr-Jen, Chang, Kai-Po, Lai, Yung-Hsuan, Yang, Fu-En, Wang, Yu-Chiang Frank. (2024). Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models. arXiv preprint arXiv:2403.09296.

[ye2024data] Ye, Jiasheng, Liu, Peiju, Sun, Tianxiang, Zhou, Yunhua, Zhan, Jun, Qiu, Xipeng. (2024). Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance. arXiv preprint arXiv:2403.16952.

[fleshman2024adapterswap] Fleshman, William, Khan, Aleem, Marone, Marc, Van Durme, Benjamin. (2024). AdapterSwap: Continuous Training of LLMs with Data Removal and Access-Control Guarantees. arXiv preprint arXiv:2404.08417.

[malla2024copal] Malla, Srikanth, Choi, Joon Hee, Choi, Chiho. (2024). COPAL: Continual Pruning in Large Language Generative Models. arXiv preprint arXiv:2405.02347.

[gutierrez2024hipporag] Guti{'e. (2024). HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. arXiv preprint arXiv:2405.14831.

[rafailov2024direct] Rafailov, Rafael, Sharma, Archit, Mitchell, Eric, Manning, Christopher D, Ermon, Stefano, Finn, Chelsea. (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems.

[yang2024recent] Yang, Yutao, Zhou, Jie, Ding, Xuanwen, Huai, Tianyu, Liu, Shunyu, Chen, Qin, He, Liang, Xie, Yuan. (2024). Recent Advances of Foundation Language Models-based Continual Learning: A Survey. arXiv preprint arXiv:2405.18653.

[xu2023wizardlm] Xu, Can, Sun, Qingfeng, Zheng, Kai, Geng, Xiubo, Zhao, Pu, Feng, Jiazhan, Tao, Chongyang, Jiang, Daxin. (2023). Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.

[wang2024codeclm] Wang, Zifeng, Li, Chun-Liang, Perot, Vincent, Le, Long T, Miao, Jin, Zhang, Zizhao, Lee, Chen-Yu, Pfister, Tomas. (2024). CodecLM: Aligning Language Models with Tailored Synthetic Data. arXiv preprint arXiv:2404.05875.

[mccloskey1989catastrophic] Michael McCloskey, Neal J. Cohen. (1989). Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. doi:https://doi.org/10.1016/S0079-7421(08)60536-8.

[team2023gemini] Team, Gemini, Anil, Rohan, Borgeaud, Sebastian, Wu, Yonghui, Alayrac, Jean-Baptiste, Yu, Jiahui, Soricut, Radu, Schalkwyk, Johan, Dai, Andrew M, Hauth, Anja, others. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.

[reid2024gemini] Reid, Machel, Savinov, Nikolay, Teplyashin, Denis, Lepikhin, Dmitry, Lillicrap, Timothy, Alayrac, Jean-baptiste, Soricut, Radu, Lazaridou, Angeliki, Firat, Orhan, Schrittwieser, Julian, others. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.

[kim2023learnability] Kim, Gyuhak, Xiao, Changnan, Konishi, Tatsuya, Liu, Bing. (2023). Learnability and Algorithm for Continual Learning. arXiv preprint arXiv:2306.12646.

[bib1] H. Abdine, M. Chatzianastasis, C. Bouyioukos, and M. Vazirgiannis. Prot2text: Multimodal protein’s function generation with GNNs and transformers. In Deep Generative Models for Health Workshop NeurIPS 2023, 2023.

[bib2] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[bib3] M. Agarwal, Y. Shen, B. Wang, Y. Kim, and J. Chen. Structured code representations enable data-efficient adaptation of code language models, 2024.

[bib4] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.

[bib5] R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3366–3375, 2017.

[bib6] S. Amba Hombaiah, T. Chen, M. Zhang, M. Bendersky, and M. Najork. Dynamic language models for continuously evolving content. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2514–2524, 2021.

[bib7] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.

[bib8] D. Araci. Finbert: Financial sentiment analysis with pre-trained language models, 2019.

[bib9] G. Attanasio, D. Nozza, F. Bianchi, and D. Hovy. Is it worth the (environmental) cost? limited evidence for temporal adaptation via continuous training, 2023.

[bib10] Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An open language model for mathematics. CoRR, abs/2310.10631, 2023.

[bib11] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[bib12] X. Bai, J. Shang, Y. Sun, and N. Balasubramanian. Enhancing continual learning with global prototypes: Counteracting negative representation drift, 2023.

[bib13] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.

[bib14] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.

[bib15] J. Bang, H. Kim, Y. Yoo, J.-W. Ha, and J. Choi. Rainbow memory: Continual learning with a memory of diverse samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8218–8227, June 2021.

[bib16] Z. Bao, W. Chen, S. Xiao, K. Ren, J. Wu, C. Zhong, J. Peng, X. Huang, and Z. Wei. Disc-medllm: Bridging general large language models and real-world medical consultation, 2023.

[bib17] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn. The pushshift reddit dataset, 2020.

[bib18] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79:151–175, 2010.

[bib19] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 41–48, New York, NY, USA, 2009. Association for Computing Machinery.

[bib20] Z. Bi, N. Zhang, Y. Xue, Y. Ou, D. Ji, G. Zheng, and H. Chen. Oceangpt: A large language model for ocean science tasks. CoRR, abs/2310.02031, 2023.

[bib21] S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.

[bib22] M. Biesialska, K. Biesialska, and M. R. Costa-jussà. Continual lifelong learning in natural language processing: A survey. In D. Scott, N. Bel, and C. Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 6523–6541, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics.

[bib23] Y. Bisk, R. Zellers, J. Gao, Y. Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.

[bib24] O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12–58, 2014.

[bib25] J. Bornschein, Y. Li, and A. Rannen-Triki. Transformers for supervised online continual learning, 2024.

[bib26] L. Bourtoule, V. Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot. Machine unlearning, 2020.

[bib27] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

[bib28] M. Brümmer, M. Dojchinovski, and S. Hellmann. Dbpedia abstracts: A large-scale, open, multilingual nlp training corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3339–3343, 2016.

[bib29] P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara. Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33:15920–15930, 2020.

[bib30] L. Caccia, R. Aljundi, N. Asadi, T. Tuytelaars, J. Pineau, and E. Belilovsky. New insights on reducing abrupt representation change in online continual learning. arXiv preprint arXiv:2104.05025, 2021.

[bib31] Z. Cai, O. Sener, and V. Koltun. Online continual learning with natural distribution shifts: An empirical study with visual data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8281–8290, 2021.

[bib32] H. Cao, Z. Liu, X. Lu, Y. Yao, and Y. Li. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. CoRR, abs/2311.16208, 2023.

[bib33] X. Cao, H. Lu, L. Huang, X. Liu, and M.-M. Cheng. Generative multi-modal models are good class incremental learners. IEEE Computer Vision and Pattern Recognition (CVPR), 2024.

[bib34] Caselaw Access Project. Caselaw access project, 2018.

[bib35] Y. Chai, S. Wang, C. Pang, Y. Sun, H. Tian, and H. Wu. ERNIE-code: Beyond English-centric cross-lingual pretraining for programming languages. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 10628–10650, Toronto, Canada, July 2023. Association for Computational Linguistics.

[bib36] I. Chalkidis, T. Pasini, S. Zhang, L. Tomada, S. F. Schwemer, and A. Søgaard. Fairlex: A multilingual benchmark for evaluating fairness in legal text processing. arXiv preprint arXiv:2203.07228, 2022.

[bib37] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem. In ICLR, 2019.

[bib38] A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, and M. Ranzato. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486, 2019.

[bib39] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word benchmark for measuring progress in statistical language modeling, 2014.

[bib40] B. Chen, X. Cheng, P. Li, Y. Geng, J. Gong, S. Li, Z. Bei, X. Tan, B. Wang, X. Zeng, C. Liu, A. Zeng, Y. Dong, J. Tang, and L. Song. xtrimopglm: Unified 100b-scale pre-trained transformer for deciphering the language of protein. CoRR, abs/2401.06199, 2024.

[bib41] C. Chen, J. Zhu, X. Luo, H. Shen, L. Gao, and J. Song. Coin: A benchmark of continual instruction tuning for multimodel large language model, 2024.

[bib42] J. Chen, X. Wang, A. Gao, F. Jiang, S. Chen, H. Zhang, D. Song, W. Xie, C. Kong, J. Li, X. Wan, H. Li, and B. Wang. Huatuogpt-ii, one-stage training for medical adaption of llms. CoRR, abs/2311.09774, 2023.

[bib43] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code, 2021.

[bib44] S. Chen, Y. Hou, Y. Cui, W. Che, T. Liu, and X. Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7870–7881, Online, Nov. 2020. Association for Computational Linguistics.

[bib45] S. Chen, B. H. Kann, M. B. Foote, H. J. Aerts, G. K. Savova, R. H. Mak, and D. S. Bitterman. The utility of chatgpt for cancer treatment information. medRxiv, 2023.

[bib46] W. Chen, Y. Zhou, N. Du, Y. Huang, J. Laudon, Z. Chen, and C. Cui. Lifelong language pretraining with distribution-specialized experts. In International Conference on Machine Learning, pages 5383–5395. PMLR, 2023.

[bib47] Y. Chen, S. Zhang, G. Qi, and X. Guo. Parameterizing context: Unleashing the power of parameter-efficient fine-tuning and in-context tuning for continual table semantic parsing. Advances in Neural Information Processing Systems, 36, 2024.

[bib48] Z. Chen and B. Liu. Lifelong machine learning, volume 1. Springer.

[bib49] D. Cheng, S. Huang, and F. Wei. Adapting large language models via reading comprehension, 2024.

[bib50] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.

[bib51] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.

[bib52] P. Colombo, T. P. Pires, M. Boudiaf, D. Culver, R. Melo, C. Corro, A. F. T. Martins, F. Esposito, V. L. Raposo, S. Morgado, and M. Desa. Saullm-7b: A pioneering large language model for law, 2024.

[bib53] T. Computer. Redpajama: an open dataset for training large language models, 2023.

[bib54] A. O. Constantinescu, J. X. O’Reilly, and T. E. Behrens. Organizing conceptual knowledge in humans with a gridlike code. Science, 352(6292):1464–1468, 2016.

[bib55] A. Cossu, T. Tuytelaars, A. Carta, L. Passaro, V. Lomonaco, and D. Bacciu. Continual pre-training mitigates forgetting in language and vision, 2022.

[bib56] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.

[bib57] M. D’Alessandro, A. Alonso, E. Calabrés, and M. Galar. Multimodal parameter-efficient few-shot class incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3393–3403, October 2023.

[bib58] P. Das, S. Chaudhury, E. Nelson, I. Melnyk, S. Swaminathan, S. Dai, A. Lozano, G. Kollias, V. Chenthamarakshan, S. Dan, et al. Larimar: Large language models with episodic memory control. arXiv preprint arXiv:2403.11901, 2024.

[bib59] P. Dasigi, N. F. Liu, A. Marasović, N. A. Smith, and M. Gardner. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5925–5932, Hong Kong, China, Nov. 2019. Association for Computational Linguistics.

[bib60] N. De Cao, W. Aziz, and I. Titov. Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164, 2021.

[bib61] M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021.

[bib62] DeepSeek-AI, :, X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, H. Gao, K. Gao, W. Gao, R. Ge, K. Guan, D. Guo, J. Guo, G. Hao, Z. Hao, Y. He, W. Hu, P. Huang, E. Li, G. Li, J. Li, Y. Li, Y. K. Li, W. Liang, F. Lin, A. X. Liu, B. Liu, W. Liu, X. Liu, X. Liu, Y. Liu, H. Lu, S. Lu, F. Luo, S. Ma, X. Nie, T. Pei, Y. Piao, J. Qiu, H. Qu, T. Ren, Z. Ren, C. Ruan, Z. Sha, Z. Shao, J. Song, X. Su, J. Sun, Y. Sun, M. Tang, B. Wang, P. Wang, S. Wang, Y. Wang, Y. Wang, T. Wu, Y. Wu, X. Xie, Z. Xie, Z. Xie, Y. Xiong, H. Xu, R. X. Xu, Y. Xu, D. Yang, Y. You, S. Yu, X. Yu, B. Zhang, H. Zhang, L. Zhang, L. Zhang, M. Zhang, M. Zhang, W. Zhang, Y. Zhang, C. Zhao, Y. Zhao, S. Zhou, S. Zhou, Q. Zhu, and Y. Zou. Deepseek llm: Scaling open-source language models with longtermism, 2024.

[bib63] C. Deng, T. Zhang, Z. He, Y. Xu, Q. Chen, Y. Shi, L. Fu, W. Zhang, X. Wang, C. Zhou, Z. Lin, and J. He. K2: A foundation language model for geoscience knowledge understanding and utilization, 2023.

[bib64] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

[bib65] Y. Deng, W. Lei, W. Lam, and T.-S. Chua. A survey on proactive dialogue systems: Problems, methods, and prospects. arXiv preprint arXiv:2305.02750, 2023.

[bib66] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.

[bib67] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[bib68] B. Dhingra, J. R. Cole, J. M. Eisenschlos, D. Gillick, J. Eisenstein, and W. W. Cohen. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10:257–273, 2022.

[bib69] P. Di, J. Li, H. Yu, W. Jiang, W. Cai, Y. Cao, C. Chen, D. Chen, H. Chen, L. Chen, et al. Codefuse-13b: A pretrained multi-lingual code large language model. arXiv preprint arXiv:2310.06266, 2023.

[bib70] Q. Dong, D. Dai, Y. Song, J. Xu, Z. Sui, and L. Li. Calibrating factual knowledge in pretrained language models. arXiv preprint arXiv:2210.03329, 2022.

[bib71] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

[bib72] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, E. Wang, K. Webster, M. Pellat, K. Robinson, K. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. Le, Y. Wu, Z. Chen, and C. Cui. GLaM: Efficient scaling of language models with mixture-of-experts. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR, 17–23 Jul 2022.

[bib73] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161, 2019.

[bib74] S. Ebrahimi, M. Elhoseiny, T. Darrell, and M. Rohrbach. Uncertainty-guided continual learning with bayesian neural networks. arXiv preprint arXiv:1906.02425, 2019.

[bib75] S. Ebrahimi, F. Meier, R. Calandra, T. Darrell, and M. Rohrbach. Adversarial continual learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 386–402. Springer, 2020.

[bib76] H. Elsahar, P. Vougiouklis, A. Remaci, C. Gravier, J. Hare, F. Laforest, and E. Simperl. T-rex: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.

[bib77] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.

[bib78] J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y. Wang, L. Hong, J. Han, H. Xu, Z. Li, and L. Kong. G-llava: Solving geometric problem with multi-modal large language model. CoRR, abs/2312.11370, 2023.

[bib79] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.

[bib80] S. Garg, S. Dutta, M. Dalirrooyfard, A. Schneider, and Y. Nevmyvaka. In- or out-of-distribution detection via dual divergence estimation. In R. J. Evans and I. Shpitser, editors, Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, volume 216 of Proceedings of Machine Learning Research, pages 635–646. PMLR, 31 Jul–04 Aug 2023.

[bib81] S. Garg, M. Farajtabar, H. Pouransari, R. Vemulapalli, S. Mehta, O. Tuzel, V. Shankar, and F. Faghri. Tic-clip: Continual training of clip models. In The Twelfth International Conference on Learning Representations (ICLR), 2024.

[bib82] E. Gogoulou, T. Lesort, M. Boman, and J. Nivre. Continual learning under language shift, 2024.

[bib83] A. Gokaslan and V. Cohen. Openwebtext corpus, 2019.

[bib84] Z. Gou, Z. Shao, Y. Gong, yelong shen, Y. Yang, M. Huang, N. Duan, and W. Chen. ToRA: A tool-integrated reasoning agent for mathematical problem solving. In The Twelfth International Conference on Learning Representations, 2024.

[bib85] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017.

[bib86] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1):1–23, Oct. 2021.

[bib87] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.

[bib88] Z. Guo and Y. Hua. Continuous training and fine-tuning for domain-specific language models in medical question answering, 2023.

[bib89] K. Gupta, B. Thérien, A. Ibrahim, M. L. Richter, Q. Anthony, E. Belilovsky, I. Rish, and T. Lesort. Continual pre-training of large language models: How to (re)warm your model?, 2023.

[bib90] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people, 2018.

[bib91] S. Gururangan, M. Lewis, A. Holtzman, N. A. Smith, and L. Zettlemoyer. DEMix layers: Disentangling domains for modular language modeling. In M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5557–5576, Seattle, United States, July 2022. Association for Computational Linguistics.

[bib92] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online, July 2020. Association for Computational Linguistics.

[bib93] R. Han, X. Ren, and N. Peng. ECONET: Effective continual pretraining of language models for event temporal reasoning. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5367–5380, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.

[bib94] T. Han, L. C. Adams, J. Papaioannou, P. Grundmann, T. Oberhauser, A. Löser, D. Truhn, and K. K. Bressem. Medalpaca - an open-source collection of medical conversational AI models and training data. CoRR, abs/2304.08247, 2023.

[bib95] Y. Hao, L. Dong, F. Wei, and K. Xu. Visualizing and understanding the effectiveness of BERT. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4143–4152, Hong Kong, China, Nov. 2019. Association for Computational Linguistics.

[bib96] T. Hartvigsen, S. Sankaranarayanan, H. Palangi, Y. Kim, and M. Ghassemi. Aging with grace: Lifelong model editing with discrete key-value adaptors. In Advances in Neural Information Processing Systems, 2023.

[bib97] P. Hase, M. Bansal, B. Kim, and A. Ghandeharioun. Does localization inform editing? surprising differences in causality-based localization vs. Knowledge Editing in Language Models, 2023.

[bib98] P. Hase, M. Diab, A. Celikyilmaz, X. Li, Z. Kozareva, V. Stoyanov, M. Bansal, and S. Iyer. Do language models have beliefs? methods for detecting, updating, and visualizing model beliefs. arXiv preprint arXiv:2111.13654, 2021.

[bib99] T. L. Hayes and C. Kanan. Lifelong machine learning with deep streaming linear discriminant analysis, 2020.

[bib100] J. He, H. Guo, M. Tang, and J. Wang. Continual instruction tuning for large multimodal models, 2023.

[bib101] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16, page 507–517, Republic and Canton of Geneva, CHE, 2016. International World Wide Web Conferences Steering Committee.

[bib102] T. He, J. Liu, K. Cho, M. Ott, B. Liu, J. Glass, and F. Peng. Analyzing the forgetting problem in pretrain-finetuning of open-domain dialogue response models. In P. Merlo, J. Tiedemann, and R. Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1121–1133, Online, Apr. 2021. Association for Computational Linguistics.

[bib103] Y. He, X. Huang, M. Tang, L. Meng, X. Li, W. Lin, W. Zhang, and Y. Gao. Don’t half-listen: Capturing key-part information in continual instruction tuning, 2024.

[bib104] D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt. Aligning ai with shared human values, 2023.

[bib105] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.

[bib106] D. Hewlett, A. Lacoste, L. Jones, I. Polosukhin, A. Fandrianto, J. Han, M. Kelcey, and D. Berthelot. Wikireading: A novel large-scale language understanding task over wikipedia. arXiv preprint arXiv:1608.03542, 2016.

[bib107] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.

[bib108] C. Hu, P. Cao, Y. Chen, K. Liu, and J. Zhao. Wilke: Wise-layer knowledge editor for lifelong knowledge editing, 2024.

[bib109] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.

[bib110] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.

[bib111] Y. Hu, T. Ganter, H. Deilamsalehy, F. Dernoncourt, H. Foroosh, and F. Liu. Meetingbank: A benchmark dataset for meeting summarization, 2023.

[bib112] C. Huang, Q. Liu, B. Y. Lin, T. Pang, C. Du, and M. Lin. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269, 2023.

[bib113] J. Huang, L. Cui, A. Wang, C. Yang, X. Liao, L. Song, J. Yao, and J. Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal, 2024.

[bib114] L. Huang, R. L. Bras, C. Bhagavatula, and Y. Choi. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning, 2019.

[bib115] Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen, Z. Wu, and Y. Feng. Lawyer llama technical report. arXiv preprint arXiv:2305.15062, 2023.

[bib116] Q. Huang, M. Tao, C. Zhang, Z. An, C. Jiang, Z. Chen, Z. Wu, and Y. Feng. Lawyer llama. https://github.com/AndrewZhe/lawyer-llama, 2023.

[bib117] Z. Huang, Y. Shen, X. Zhang, J. Zhou, W. Rong, and Z. Xiong. Transformer-patcher: One mistake worth one neuron. arXiv preprint arXiv:2301.09785, 2023.

[bib118] D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019.

[bib119] J. Jang, S. Ye, C. Lee, S. Yang, J. Shin, J. Han, G. Kim, and M. Seo. Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models. 2022.

[bib120] J. Jang, S. Ye, S. Yang, J. Shin, J. Han, G. Kim, S. J. Choi, and M. Seo. Towards continual knowledge learning of language models. In ICLR, 2022.

[bib121] K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, A. T. Stüber, J. Topalis, T. Weber, P. Wesp, B. Sabel, J. Ricke, and M. Ingrisch. Chatgpt makes medicine easy to swallow: An exploratory case study on simplified radiology reports, 2022.

[bib122] J. Ji, T. Qiu, B. Chen, B. Zhang, H. Lou, K. Wang, Y. Duan, Z. He, J. Zhou, Z. Zhang, F. Zeng, K. Y. Ng, J. Dai, X. Pan, A. O’Gara, Y. Lei, H. Xu, B. Tse, J. Fu, S. McAleer, Y. Yang, Y. Wang, S.-C. Zhu, Y. Guo, and W. Gao. Ai alignment: A comprehensive survey, 2024.

[bib123] S. Jiang, Y. Wang, and Y. Wang. Selfevolve: A code evolution framework via large language models, 2023.

[bib124] Y. Jiang, Z. Pan, X. Zhang, S. Garg, A. Schneider, Y. Nevmyvaka, and D. Song. Empowering time series analysis with large language models: A survey, 2024.

[bib125] Z. Jiang, Z. Sun, W. Shi, P. Rodriguez, C. Zhou, G. Neubig, X. V. Lin, W. tau Yih, and S. Iyer. Instruction-tuned language models are better knowledge learners, 2024.

[bib126] X. Jin and X. Ren. What will my model forget? forecasting forgotten examples in language model refinement, 2024.

[bib127] X. Jin, D. Zhang, H. Zhu, W. Xiao, S.-W. Li, X. Wei, A. Arnold, and X. Ren. Lifelong pretraining: Continually adapting language models to emerging corpora. In A. Fan, S. Ilic, T. Wolf, and M. Gallé, editors, Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 1–16, virtual+Dublin, May 2022. Association for Computational Linguistics.

[bib128] E. R. Kandel, J. H. Schwartz, T. M. Jessell, S. Siegelbaum, A. J. Hudspeth, S. Mack, et al. Principles of neural science, volume 4. McGraw-hill New York, 2000.

[bib129] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.

[bib130] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. In A. Moschitti, B. Pang, and W. Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, Oct. 2014. Association for Computational Linguistics.

[bib131] Z. Ke, H. Lin, Y. Shao, H. Xu, L. Shu, and B. Liu. Continual training of language models for few-shot learning. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10205–10216, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.

[bib132] Z. Ke and B. Liu. Continual learning of natural language processing tasks: A survey, 2023.

[bib133] Z. Ke, B. Liu, N. Ma, H. Xu, and S. Lei. Achieving forgetting prevention and knowledge transfer in continual learning. In NeurIPS, 2021.

[bib134] Z. Ke, Y. Shao, H. Lin, T. Konishi, G. Kim, and B. Liu. Continual pre-training of language models. In The Eleventh International Conference on Learning Representations, 2022.

[bib135] T. Kew, M. Kostrzewa, and S. Ebling. 20 minuten: A multi-task news summarisation dataset for German. In H. Ghorbel, M. Sokhn, M. Cieliebak, M. Hürlimann, E. de Salis, and J. Guerne, editors, Proceedings of the 8th edition of the Swiss Text Analytics Conference, pages 1–13, Neuchatel, Switzerland, June 2023. Association for Computational Linguistics.

[bib136] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In M. Walker, H. Ji, and A. Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.

[bib137] D. Khashabi, T. Khot, A. Sabharwal, and D. Roth. Learning what is essential in questions. In R. Levy and L. Specia, editors, Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 80–89, Vancouver, Canada, Aug. 2017. Association for Computational Linguistics.

[bib138] T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal. Qasc: A dataset for question answering via sentence composition, 2020.

[bib139] G. Kim, C. Xiao, T. Konishi, Z. Ke, and B. Liu. A theoretical study on solving continual learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 5065–5079. Curran Associates, Inc., 2022.

[bib140] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.

[bib141] P. Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand, Sept. 13-15 2005.

[bib142] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.

[bib143] C. C. T. Kwok, O. Etzioni, and D. S. Weld. Scaling question answering to the web. In Proceedings of the 10th International Conference on World Wide Web, WWW ’01, page 150–161, New York, NY, USA, 2001. Association for Computing Machinery.

[bib144] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.

[bib145] A. Lazaridou, A. Kuncoro, E. Gribovskaya, D. Agrawal, A. Liska, T. Terzi, M. Gimenez, C. de Masson d’Autume, T. Kocisky, S. Ruder, et al. Mind the gap: Assessing temporal generalization in neural language models. Advances in Neural Information Processing Systems, 34:29348–29363, 2021.

[bib146] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.

[bib147] O. Levy, M. Seo, E. Choi, and L. Zettlemoyer. Zero-shot relation extraction via reading comprehension. arXiv preprint arXiv:1706.04115, 2017.

[bib148] B. Li, Y. Zhang, L. Chen, J. Wang, J. Yang, and Z. Liu. Otter: A multi-modal model with in-context instruction tuning, 2023.

[bib149] C.-A. Li and H.-Y. Lee. Examining forgetting in continual pre-training of aligned large language models, 2024.

[bib150] D. Li, A. S. Rawat, M. Zaheer, X. Wang, M. Lukasik, A. Veit, F. Yu, and S. Kumar. Large language models with controllable working memory. arXiv preprint arXiv:2211.05110, 2022.

[bib151] H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.

[bib152] J. Li, Y. Bian, G. Wang, Y. Lei, D. Cheng, Z. Ding, and C. Jiang. Cfgpt: Chinese financial assistant with large language model, 2023.

[bib153] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.

[bib154] K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and Y. Qiao. Videochat: Chat-centric video understanding, 2024.

[bib155] K. Li, Q. Hu, X. Zhao, H. Chen, Y. Xie, T. Liu, Q. Xie, and J. He. Instructcoder: Instruction tuning large language models for code editing, 2024.

[bib156] L. Li and X. Qiu. CONTINUAL MODEL EVOLVEMENT WITH INNER-PRODUCT RESTRICTION, 2023.

[bib157] M. Li, Y. Zhang, Z. Li, J. Chen, L. Chen, N. Cheng, J. Wang, T. Zhou, and J. Xiao. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. ArXiv, abs/2308.12032, 2023.

[bib158] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier, J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M.-H. Yee, L. K. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. Murthy, J. Stillerman, S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey, Z. Zhang, N. Fahmy, U. Bhattacharyya, W. Yu, S. Singh, S. Luccioni, P. Villegas, M. Kunakov, F. Zhdanov, M. Romero, T. Lee, N. Timor, J. Ding, C. Schlesinger, H. Schoelkopf, J. Ebert, T. Dao, M. Mishra, A. Gu, J. Robinson, C. J. Anderson, B. Dolan-Gavitt, D. Contractor, S. Reddy, D. Fried, D. Bahdanau, Y. Jernite, C. M. Ferrandis, S. Hughes, T. Wolf, A. Guha, L. von Werra, and H. de Vries. Starcoder: may the source be with you!, 2023.

[bib159] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online, Aug. 2021. Association for Computational Linguistics.

[bib160] Y. Li, Z. Li, K. Zhang, R. Dan, S. Jiang, and Y. Zhang. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6), 2023.

[bib161] Z. Li and D. Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.

[bib162] B. Y. Lin, S. Wang, X. Lin, R. Jia, L. Xiao, X. Ren, and S. Yih. On continual model refinement in out-of-distribution data streams. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3128–3139, Dublin, Ireland, May 2022. Association for Computational Linguistics.

[bib163] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.

[bib164] K. Lin, O. Tafjord, P. Clark, and M. Gardner. Reasoning over paragraph effects in situations, 2019.

[bib165] Y. Lin, H. Lin, W. Xiong, S. Diao, J. Liu, J. Zhang, R. Pan, H. Wang, W. Hu, H. Zhang, H. Dong, R. Pi, H. Zhao, N. Jiang, H. Ji, Y. Yao, and T. Zhang. Mitigating the alignment tax of rlhf, 2024.

[bib166] Z. Lin, C. Deng, L. Zhou, T. Zhang, Y. Xu, Y. Xu, Z. He, Y. Shi, B. Dai, Y. Song, B. Zeng, Q. Chen, T. Shi, T. Huang, Y. Xu, S. Wang, L. Fu, W. Zhang, J. He, C. Ma, Y. Zhu, X. Wang, and C. Zhou. Geogalactica: A scientific large language model in geoscience, 2023.

[bib167] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning, 2023.

[bib168] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. A. Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.

[bib169] J. Liu, P. Zhou, Y. Hua, D. Chong, Z. Tian, A. Liu, H. Wang, C. You, Z. Guo, L. Zhu, and M. L. Li. Benchmarking large language models on cmexam – a comprehensive chinese medical exam dataset, 2023.

[bib170] Y. Liu, R. J. Dolan, Z. Kurth-Nelson, and T. E. Behrens. Human replay spontaneously reorganizes experience. Cell, 178(3):640–652, 2019.

[bib171] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.

[bib172] K. Lo, L. L. Wang, M. Neumann, R. Kinney, and D. Weld. S2ORC: The semantic scholar open research corpus. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online, July 2020. Association for Computational Linguistics.

[bib173] V. Lomonaco, D. Maltoni, and L. Pellegrini. Rehearsal-free continual learning over small non-i.i.d. batches, 2020.

[bib174] D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.

[bib175] D. Loureiro, F. Barbieri, L. Neves, L. Espinosa Anke, and J. Camacho-collados. TimeLMs: Diachronic language models from Twitter. In V. Basile, Z. Kozareva, and S. Stajner, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 251–260, Dublin, Ireland, May 2022. Association for Computational Linguistics.

[bib176] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, T. Liu, M. Tian, D. Kocetkov, A. Zucker, Y. Belkada, Z. Wang, Q. Liu, D. Abulkhanov, I. Paul, Z. Li, W.-D. Li, M. Risdal, J. Li, J. Zhu, T. Y. Zhuo, E. Zheltonozhskii, N. O. O. Dade, W. Yu, L. Krauß, N. Jain, Y. Su, X. He, M. Dey, E. Abati, Y. Chai, N. Muennighoff, X. Tang, M. Oblokulov, C. Akiki, M. Marone, C. Mou, M. Mishra, A. Gu, B. Hui, T. Dao, A. Zebaze, O. Dehaene, N. Patry, C. Xu, J. McAuley, H. Hu, T. Scholak, S. Paquet, J. Robinson, C. J. Anderson, N. Chapados, M. Patwary, N. Tajbakhsh, Y. Jernite, C. M. Ferrandis, L. Zhang, S. Hughes, T. Wolf, A. Guha, L. von Werra, and H. de Vries. Starcoder 2 and the stack v2: The next generation, 2024.

[bib177] D. Lu, H. Wu, J. Liang, Y. Xu, Q. He, Y. Geng, M. Han, Y. Xin, and Y. Xiao. Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark. CoRR, abs/2302.09432, 2023.

[bib178] P. Lu, M. Caprio, E. Eaton, and I. Lee. Ibcl: Zero-shot model generation for task trade-offs in continual learning, 2023.

[bib179] P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022.

[bib180] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng, S. Fu, and S. Liu. Codexglue: A machine learning benchmark dataset for code understanding and generation, 2021.

[bib181] H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.

[bib182] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, and T.-Y. Liu. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6), Sept. 2022.

[bib183] Y. Luo, Z. Yang, X. Bai, F. Meng, J. Zhou, and Y. Zhang. Investigating forgetting in pre-trained representations through continual learning, 2023.

[bib184] Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, and Y. Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2023.

[bib185] Y. Luo, J. Zhang, S. Fan, K. Yang, Y. Wu, M. Qiao, and Z. Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442, 2023.

[bib186] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. Wizardcoder: Empowering code large language models with evol-instruct, 2023.

[bib187] S. Ma, S. Huang, S. Huang, X. Wang, Y. Li, H.-T. Zheng, P. Xie, F. Huang, and Y. Jiang. Ecomgpt-ct: Continual pre-training of e-commerce large language models with semi-structured data, 2023.

[bib188] A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011.

[bib189] Z. Mai, R. Li, J. Jeong, D. Quispe, H. Kim, and S. Sanner. Online continual learning in image classification: An empirical survey. Neurocomputing, 469:28–51, 2022.

[bib190] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions, 2016.

[bib191] L. Martin, N. Whitehouse, S. Yiu, L. Catterson, and R. Perera. Better call gpt, comparing large language models against lawyers, 2024.

[bib192] V. Mazzia, A. Pedrani, A. Caciolai, K. Rottmann, and D. Bernardi. A survey on knowledge editing of neural networks. arXiv preprint arXiv:2310.19704, 2023.

[bib193] D. McCaffary. Towards continual task learning in artificial neural networks: current approaches and insights from neuroscience. arXiv preprint arXiv:2112.14146, 2021.

[bib194] J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995.

[bib195] S. V. Mehta, D. Patil, S. Chandar, and E. Strubell. An empirical investigation of the role of pre-training in lifelong learning. Journal of Machine Learning Research, 24(214):1–50, 2023.

[bib196] K. Meng, D. Bau, A. Andonian, and Y. Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.

[bib197] K. Meng, A. S. Sharma, A. Andonian, Y. Belinkov, and D. Bau. Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229, 2022.

[bib198] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.

[bib199] S. I. Mirzadeh, A. Chaudhry, D. Yin, H. Hu, R. Pascanu, D. Gorur, and M. Farajtabar. Wide neural networks forget less catastrophically. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 15699–15717. PMLR, 17–23 Jul 2022.

[bib200] A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.

[bib201] S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi. Natural instructions: Benchmarking generalization to new tasks from natural language instructions. arXiv preprint arXiv:2104.08773, 2021.

[bib202] S. Mishra, A. Mitra, N. Varshney, B. Sachdeva, P. Clark, C. Baral, and A. Kalyan. Numglue: A suite of fundamental yet challenging mathematical reasoning tasks, 2022.

[bib203] E. Mitchell, C. Lin, A. Bosselut, C. Finn, and C. D. Manning. Fast model editing at scale. arXiv preprint arXiv:2110.11309, 2021.

[bib204] E. Mitchell, C. Lin, A. Bosselut, C. D. Manning, and C. Finn. Memory-based model editing at scale. In International Conference on Machine Learning, pages 15817–15831. PMLR, 2022.

[bib205] J. Mok, J. Do, S. Lee, T. Taghavi, S. Yu, and S. Yoon. Large-scale lifelong learning of in-context instructions and how to tackle it. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12573–12589, Toronto, Canada, July 2023. Association for Computational Linguistics.

[bib206] A. Moradi Dakhel, V. Majdinasab, A. Nikanjam, F. Khomh, M. C. Desmarais, and Z. M. J. Jiang. Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software, 203:111734, 2023.

[bib207] N. Muennighoff, Q. Liu, A. Zebaze, Q. Zheng, B. Hui, T. Y. Zhuo, S. Singh, X. Tang, L. von Werra, and S. Longpre. Octopack: Instruction tuning code large language models, 2024.

[bib208] B. Neyshabur, H. Sedghi, and C. Zhang. What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.

[bib209] T. D. Nguyen, Y. Ting, I. Ciuca, C. O’Neill, Z. Sun, M. Jablonska, S. Kruk, E. Perkowski, J. W. Miller, J. Li, J. Peek, K. Iyer, T. Rózanski, P. Khetarpal, S. Zaman, D. Brodrick, S. J. R. Méndez, T. Bui, A. Goodman, A. Accomazzi, J. P. Naiman, J. Cranney, K. Schawinski, and UniverseTBD. Astrollama: Towards specialized foundation models in astronomy. CoRR, abs/2309.06126, 2023.

[bib210] T. T. Nguyen, T. T. Huynh, P. L. Nguyen, A. W.-C. Liew, H. Yin, and Q. V. H. Nguyen. A survey of machine unlearning, 2022.

[bib211] J. Ni, J. Li, and J. McAuley. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188–197, Hong Kong, China, Nov. 2019. Association for Computational Linguistics.

[bib212] Z. Ni, H. Shi, S. Tang, L. Wei, Q. Tian, and Y. Zhuang. Revisiting catastrophic forgetting in class incremental learning. arXiv preprint arXiv:2107.12308, 2021.

[bib213] Z. Ni, L. Wei, S. Tang, Y. Zhuang, and Q. Tian. Continual vision-language representation learning with off-diagonal information. In Proceedings of the 40th International Conference on Machine Learning, pages 26129–26149, 2023.

[bib214] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou. Codegen2: Lessons for training llms on programming and natural languages. ICLR, 2023.

[bib215] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong. Codegen: An open large language model for code with multi-turn program synthesis. ICLR, 2023.

[bib216] H. F. Ólafsdóttir, D. Bush, and C. Barry. The role of hippocampal replay in memory and planning. Current Biology, 28(1):R37–R50, 2018.

[bib217] OpenAI. Introducing chatgpt. [online]. available: https://openai.com/blog/chatgpt. 2022.

[bib218] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback, 2022.

[bib219] C. Pallier, S. Dehaene, J.-B. Poline, D. LeBihan, A.-M. Argenti, E. Dupoux, and J. Mehler. Brain imaging of language plasticity in adopted adults: Can a second language replace the first? Cerebral cortex, 13(2):155–161, 2003.

[bib220] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.

[bib221] I. Paul, J. Luo, G. Glavaš, and I. Gurevych. Ircoder: Intermediate representations make language models robust multilingual code generators, 2024.

[bib222] Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023.

[bib223] A. Pentina. Theoretical foundations of multi-task lifelong learning. PhD thesis, 2016.

[bib224] E. Perkowski, R. Pan, T. D. Nguyen, Y. Ting, S. Kruk, T. Zhang, C. O’Neill, M. Jablonska, Z. Sun, M. J. Smith, H. Liu, K. Schawinski, K. Iyer, I. Ciuca, and UniverseTBD. Astrollama-chat: Scaling astrollama with conversational and diverse datasets. CoRR, abs/2401.01916, 2024.

[bib225] F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller. Language models as knowledge bases? In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China, Nov. 2019. Association for Computational Linguistics.

[bib226] J. Pourcel, N.-S. Vu, and R. M. French. Online task-free continual learning with dynamic sparse distributed memory. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, editors, Computer Vision – ECCV 2022, pages 739–756, Cham, 2022. Springer Nature Switzerland.

[bib227] A. Prabhu, H. A. Al Kader Hammoud, P. K. Dokania, P. H. Torr, S.-N. Lim, B. Ghanem, and A. Bibi. Computationally budgeted continual learning: What does matter? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3698–3707, 2023.

[bib228] A. Prabhu, Z. Cai, P. Dokania, P. Torr, V. Koltun, and O. Sener. Online continual learning without the storage constraint, 2023.

[bib229] G. Puthumanaillam, M. Vora, P. Thangeda, and M. Ornik. A moral imperative: The need for continual superalignment of large language models. arXiv preprint arXiv:2403.14683, 2024.

[bib230] C. Qin and S. Joty. Lfpt5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5. In International Conference on Learning Representations, 2021.

[bib231] Y. Qin, C. Qian, X. Han, Y. Lin, H. Wang, R. Xie, Z. Liu, M. Sun, and J. Zhou. Recyclable tuning for continual pre-training. arXiv preprint arXiv:2305.08702, 2023.

[bib232] Y. Qin, J. Zhang, Y. Lin, Z. Liu, P. Li, M. Sun, and J. Zhou. ELLE: Efficient lifelong pre-training for emerging data. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2789–2810, Dublin, Ireland, May 2022. Association for Computational Linguistics.

[bib233] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

[bib234] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.

[bib235] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.

[bib236] P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.

[bib237] R. Ramesh and P. Chaudhari. Model zoo: A growing" brain" that learns continually. arXiv preprint arXiv:2106.03027, 2021.

[bib238] H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, T. Adler, L. Gruber, M. Holzleitner, M. Pavlović, G. K. Sandve, V. Greiff, D. Kreil, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter. Hopfield networks is all you need, 2021.

[bib239] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.

[bib240] M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.

[bib241] H. Ritter, A. Botev, and D. Barber. Online structured laplace approximations for overcoming catastrophic forgetting. Advances in Neural Information Processing Systems, 31, 2018.

[bib242] J. Roberts, T. Lüddecke, S. Das, K. Han, and S. Albanie. Gpt4geo: How a language model sees the world’s geography, 2023.

[bib243] S. Rongali, A. Jagannatha, B. P. S. Rawat, and H. Yu. Continual domain-tuning for pretrained language models, 2021.

[bib244] G. D. Rosin, I. Guy, and K. Radinsky. Time masking for temporal language models. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM ’22, page 833–841, New York, NY, USA, 2022. Association for Computing Machinery.

[bib245] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve. Code llama: Open foundation models for code, 2024.

[bib246] A. N. Rubungo, C. Arnold, B. P. Rand, and A. B. Dieng. Llm-prop: Predicting physical and electronic properties of crystalline solids from their text descriptions. CoRR, abs/2310.14029, 2023.

[bib247] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.

[bib248] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019.

[bib249] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Fevry, J. A. Fries, R. Teehan, T. Bers, S. Biderman, L. Gao, T. Wolf, and A. M. Rush. Multitask prompted training enables zero-shot task generalization, 2022.

[bib250] F. Sarfraz, E. Arani, and B. Zonooz. Error sensitivity modulation based experience replay: Mitigating abrupt representation drift in continual learning. arXiv preprint arXiv:2302.11344, 2023.

[bib251] J. Savelka, K. D. Ashley, M. A. Gray, H. Westermann, and H. Xu. Explaining legal concepts with augmented large language models (gpt-4), 2023.

[bib252] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017.

[bib253] T. Schuster, A. Fisch, and R. Barzilay. Get your vitamin c! robust fact verification with contrastive evidence. arXiv preprint arXiv:2103.08541, 2021.

[bib254] J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell. Progress & compress: A scalable framework for continual learning. In International conference on machine learning, pages 4528–4537. PMLR, 2018.

[bib255] T. Scialom, T. Chakrabarty, and S. Muresan. Fine-tuned language models are continual learners. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.

[bib256] A. Shah and S. Chava. Zero is not hero yet: Benchmarking zero-shot performance of llms for financial tasks, 2023.

[bib257] A. Shah, S. Paturi, and S. Chava. Trillion dollar words: A new financial dataset, task & market analysis, 2023.

[bib258] Y. Shao, Y. Guo, D. Zhao, and B. Liu. Class-incremental learning based on label generation. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1263–1276, Toronto, Canada, July 2023. Association for Computational Linguistics.

[bib259] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.

[bib260] H. Shi and H. Wang. A unified approach to domain incremental learning with memory: Theory and algorithm. Advances in Neural Information Processing Systems, 36, 2024.

[bib261] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards vqa models that can read, 2019.

[bib262] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.

[bib263] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal, M. Schaekermann, A. Wang, M. Amin, S. Lachgar, P. A. Mansfield, S. Prakash, B. Green, E. Dominowska, B. A. y Arcas, N. Tomasev, Y. Liu, R. Wong, C. Semturs, S. S. Mahdavi, J. K. Barral, D. R. Webster, G. S. Corrado, Y. Matias, S. Azizi, A. Karthikesalingam, and V. Natarajan. Towards expert-level medical question answering with large language models. CoRR, abs/2305.09617, 2023.

[bib264] A. Sinitsin, V. Plokhotnyuk, D. Pyrkin, S. Popov, and A. Babenko. Editable neural networks. arXiv preprint arXiv:2004.00345, 2020.

[bib265] J. S. Smith, J. Tian, S. Halbe, Y.-C. Hsu, and Z. Kira. A closer look at rehearsal-free continual learning, 2023.

[bib266] D. Soboleva, F. Al-Khateeb, R. Myers, J. R. Steeves, J. Hestness, and N. Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023.

[bib267] L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y. Elazar, V. Hofmann, A. H. Jha, S. Kumar, L. Lucy, X. Lyu, N. Lambert, I. Magnusson, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, A. Ravichander, K. Richardson, Z. Shen, E. Strubell, N. Subramani, O. Tafjord, P. Walsh, L. Zettlemoyer, N. A. Smith, H. Hajishirzi, I. Beltagy, D. Groeneveld, J. Dodge, and K. Lo. Dolma: an open corpus of three trillion tokens for language model pretraining research, 2024.

[bib268] C. Song, X. Han, Z. Zeng, K. Li, C. Chen, Z. Liu, M. Sun, and T. Yang. Conpet: Continual parameter-efficient tuning for large language models, 2023.

[bib269] D. Song, H. Guo, Y. Zhou, S. Xing, Y. Wang, Z. Song, W. Zhang, Q. Guo, H. Yan, X. Qiu, and D. Lin. Code needs comments: Enhancing code llms with comment augmentation, 2024.

[bib270] P. Sprechmann, S. M. Jayakumar, J. W. Rae, A. Pritzel, A. P. Badia, B. Uria, O. Vinyals, D. Hassabis, R. Pascanu, and C. Blundell. Memory-based parameter adaptation. In International Conference on Learning Representations, 2018.

[bib271] Z. Su, J. Li, Z. Zhang, Z. Zhou, and M. Zhang. Efficient continue training of temporal language model with structural information. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6315–6329, Singapore, Dec. 2023. Association for Computational Linguistics.

[bib272] Q. Sun, Z. Chen, F. Xu, K. Cheng, C. Ma, Z. Yin, J. Wang, C. Han, R. Zhu, S. Yuan, Q. Guo, X. Qiu, P. Yin, X. Li, F. Yuan, L. Kong, X. Li, and Z. Wu. A survey of neural code intelligence: Paradigms, advances and beyond, 2024.

[bib273] Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang. Ernie 2.0: A continual pre-training framework for language understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8968–8975, Apr. 2020.

[bib274] M. Tao, Y. Feng, and D. Zhao. Can bert refrain from forgetting on sequential tasks? a probing study. In The Eleventh International Conference on Learning Representations, 2022.

[bib275] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.

[bib276] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and R. Stojnic. Galactica: A large language model for science. CoRR, abs/2211.09085, 2022.

[bib277] V. Thengane, S. Khan, M. Hayat, and F. Khan. Clip model is an efficient continual learner, 2022.

[bib278] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355, 2018.

[bib279] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.

[bib280] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

[bib281] M. van de Kar, M. Xia, D. Chen, and M. Artetxe. Don’t prompt, search! mining-based zero-shot learning with language models. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7508–7520, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.

[bib282] G. M. Van de Ven, T. Tuytelaars, and A. S. Tolias. Three types of incremental learning. Nature Machine Intelligence, 4(12):1185–1197, 2022.

[bib283] E. Verwimp, R. Aljundi, S. Ben-David, M. Bethge, A. Cossu, A. Gepperth, T. L. Hayes, E. Hüllermeier, C. Kanan, D. Kudithipudi, C. H. Lampert, M. Mundt, R. Pascanu, A. Popescu, A. S. Tolias, J. van de Weijer, B. Liu, V. Lomonaco, T. Tuytelaars, and G. M. van de Ven. Continual learning: Applications and the road forward, 2024.

[bib284] M. Völske, M. Potthast, S. Syed, and B. Stein. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017.

[bib285] B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.

[bib286] C. Wang, D. Engler, X. Li, J. Hou, D. J. Wald, K. Jaiswal, and S. Xu. Near-real-time earthquake-induced fatality estimation using crowdsourced data and large-language models, 2023.

[bib287] L. Wang, X. Zhang, Q. Li, J. Zhu, and Y. Zhong. Coscl: Cooperation of small continual learners is stronger than a big one. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, pages 254–271. Springer, 2022.

[bib288] L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2024.

[bib289] N. Wang, H. Yang, and C. D. Wang. Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets. CoRR, abs/2310.04793, 2023.

[bib290] R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, J. Ji, G. Cao, D. Jiang, and M. Zhou. K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1405–1418, Online, Aug. 2021. Association for Computational Linguistics.

[bib291] X. Wang, T. Chen, Q. Ge, H. Xia, R. Bao, R. Zheng, Q. Zhang, T. Gui, and X. Huang. Orthogonal subspace learning for language model continual learning. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671, Singapore, Dec. 2023. Association for Computational Linguistics.

[bib292] X. Wang, Y. Zhang, T. Chen, S. Gao, S. Jin, X. Yang, Z. Xi, R. Zheng, Y. Zou, T. Gui, Q. Zhang, and X. Huang. Trace: A comprehensive benchmark for continual learning in large language models, 2023.

[bib293] Y. Wang, H. Le, A. D. Gotmare, N. D. Q. Bui, J. Li, and S. C. H. Hoi. Codet5+: Open code large language models for code understanding and generation, 2023.

[bib294] Y. Wang, Y. Liu, C. Shi, H. Li, C. Chen, H. Lu, and Y. Yang. Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions, 2024.

[bib295] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, E. Pathak, G. Karamanolakis, H. G. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuznia, K. Doshi, M. Patel, K. K. Pal, M. Moradshahi, M. Parmar, M. Purohit, N. Varshney, P. R. Kaza, P. Verma, R. S. Puri, R. Karia, S. K. Sampat, S. Doshi, S. Mishra, S. Reddy, S. Patro, T. Dixit, X. Shen, C. Baral, Y. Choi, N. A. Smith, H. Hajishirzi, and D. Khashabi. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, 2022.

[bib296] Y. Wang, W. Wang, S. Joty, and S. C. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP, 2021.

[bib297] Z. Wang, L. Liu, Y. Kong, J. Guo, and D. Tao. Online continual learning with contrastive vision transformer. In European Conference on Computer Vision, pages 631–650. Springer, 2022.

[bib298] Z. Wang, Z. Zhan, Y. Gong, G. Yuan, W. Niu, T. Jian, B. Ren, S. Ioannidis, Y. Wang, and J. Dy. Sparcl: Sparse continual learning on the edge. Advances in Neural Information Processing Systems, 35:20366–20380, 2022.

[bib299] Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y. Lee, X. Ren, G. Su, V. Perot, J. Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. European Conference on Computer Vision, 2022.

[bib300] Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022.

[bib301] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.

[bib302] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners, 2022.

[bib303] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.

[bib304] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.

[bib305] Y. Wei, Z. Wang, J. Liu, Y. Ding, and L. Zhang. Magicoder: Source code is all you need, 2023.

[bib306] M. Weyssow, X. Zhou, K. Kim, D. Lo, and H. Sahraoui. On the usage of continual learning for out-of-distribution generalization in pre-trained language models of code. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, page 1470–1482, New York, NY, USA, 2023. Association for Computing Machinery.

[bib307] G. Winata, L. Xie, K. Radhakrishnan, S. Wu, X. Jin, P. Cheng, M. Kulkarni, and D. Preotiuc-Pietro. Overcoming catastrophic forgetting in massively multilingual continual learning. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 768–777, Toronto, Canada, July 2023. Association for Computational Linguistics.

[bib308] M. Wistuba, P. T. Sivaprasad, L. Balles, and G. Zappella. Continual learning with low rank adaptation. In NeurIPS 2023 Workshop on Distribution Shifts (DistShifts), 2023.

[bib309] M. Wistuba, P. T. Sivaprasad, L. Balles, and G. Zappella. Continual learning with low rank adaptation, 2023.

[bib310] C. Wu, Y. Gan, Y. Ge, Z. Lu, J. Wang, Y. Feng, P. Luo, and Y. Shan. Llama pro: Progressive llama with block expansion, 2024.

[bib311] C. Wu, W. Lin, X. Zhang, Y. Zhang, Y. Wang, and W. Xie. Pmc-llama: Towards building open-source language models for medicine. arXiv preprint arXiv:2305.10415, 6, 2023.

[bib312] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. S. Rosenberg, and G. Mann. Bloomberggpt: A large language model for finance. CoRR, abs/2303.17564, 2023.

[bib313] T. Wu, M. Caccia, Z. Li, Y.-F. Li, G. Qi, and G. Haffari. Pretrained language model in continual learning: A comparative study. In International conference on learning representations, 2021.

[bib314] T. Wu, L. Luo, Y.-F. Li, S. Pan, T.-T. Vu, and G. Haffari. Continual learning for large language models: A survey, 2024.

[bib315] X. Wu, L. Xiao, Y. Sun, J. Zhang, T. Ma, and L. He. A survey of human-in-the-loop for machine learning. Future Generation Computer Systems, 135:364–381, 2022.

[bib316] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 374–382, 2019.

[bib317] Y. Wu, G. Wayne, A. Graves, and T. Lillicrap. The kanerva machine: A generative distributed memory. arXiv preprint arXiv:1804.01756, 2018.

[bib318] C. Xiao, X. Hu, Z. Liu, C. Tu, and M. Sun. Lawformer: A pre-trained language model for chinese legal long documents, 2021.

[bib319] J. Xie, Y. Liang, J. Liu, Y. Xiao, B. Wu, and S. Ni. Quert: Continual pre-training of language model for query understanding in travel domain search. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 5282–5291, New York, NY, USA, 2023. Association for Computing Machinery.

[bib320] Q. Xie, W. Han, X. Zhang, Y. Lai, M. Peng, A. Lopez-Lira, and J. Huang. PIXIU: A large language model, instruction data and evaluation benchmark for finance. CoRR, abs/2306.05443, 2023.

[bib321] S. M. Xie, S. Santurkar, T. Ma, and P. S. Liang. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems, 36, 2024.

[bib322] T. Xie, Y. Wan, W. Huang, Z. Yin, Y. Liu, S. Wang, Q. Linghu, C. Kit, C. Grazian, W. Zhang, I. Razzak, and B. Hoex. DARWIN series: Domain specific large language models for natural science. CoRR, abs/2308.13565, 2023.

[bib323] Y. Xie, K. Aggarwal, and A. Ahmad. Efficient continual pre-training for building domain specific large language models, 2023.

[bib324] H. Xiong, S. Wang, Y. Zhu, Z. Zhao, Y. Liu, Q. Wang, and D. Shen. Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv:2304.01097, 2023.

[bib325] H. Xu, B. Liu, L. Shu, and P. Yu. BERT post-training for review reading comprehension and aspect-based sentiment analysis. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2324–2335, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.

[bib326] S. Xue, F. Zhou, Y. Xu, H. Zhao, S. Xie, Q. Dai, C. Jiang, J. Zhang, J. Zhou, D. Xiu, and H. Mei. Weaverbird: Empowering financial decision-making with large language model, knowledge base, and search engine. CoRR, abs/2308.05361, 2023.

[bib327] Y. Yan, K. Xue, X. Shi, Q. Ye, J. Liu, and T. Ruan. Af adapter: Continual pretraining for building chinese biomedical language model. In 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 953–957, Los Alamitos, CA, USA, dec 2023. IEEE Computer Society.

[bib328] G. Yang, F. Pan, and W.-B. Gan. Stably maintained dendritic spines are associated with lifelong memories. Nature, 462(7275):920–924, 2009.

[bib329] P. Yang, D. Li, and P. Li. Continual learning for natural language generations with transformer calibration. In A. Fokkens and V. Srikumar, editors, Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 40–49, Abu Dhabi, United Arab Emirates (Hybrid), Dec. 2022. Association for Computational Linguistics.

[bib330] S. Yang, M. A. Ali, C.-L. Wang, L. Hu, and D. Wang. Moral: Moe augmented lora for llms’ lifelong learning, 2024.

[bib331] X. Yang, J. Gao, W. Xue, and E. Alexandersson. Pllama: An open-source large language model for plant science. CoRR, abs/2401.01600, 2024.

[bib332] Y. Yang, M. Jones, M. C. Mozer, and M. Ren. Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training, 2024.

[bib333] Y. Yang, Y. Tang, and K. Y. Tam. Investlm: A large language model for investment using financial domain instruction tuning. CoRR, abs/2309.13064, 2023.

[bib334] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.

[bib335] H. Yin, p. yang, and P. Li. Mitigating forgetting in online continual learning with neuron calibration. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 10260–10272. Curran Associates, Inc., 2021.

[bib336] J. Yin, S. Dash, F. Wang, and M. Shankar. FORGE: pre-training open foundation models for science. In D. Arnold, R. M. Badia, and K. M. Mohror, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023, Denver, CO, USA, November 12-17, 2023, pages 81:1–81:13. ACM, 2023.

[bib337] W. Yin, J. Li, and C. Xiong. ConTinTin: Continual learning from task instructions. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3062–3072, Dublin, Ireland, May 2022. Association for Computational Linguistics.

[bib338] F. Yu, A. Gao, and B. Wang. Outcome-supervised verifiers for planning in mathematical reasoning. CoRR, abs/2311.09724, 2023.

[bib339] L. Yu, Q. Chen, J. Zhou, and L. He. Melo: Enhancing model editing with neuron-indexed dynamic lora. arXiv preprint arXiv:2312.11795, 2023.

[bib340] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3903–3911, 2020.

[bib341] S. Yuan, H. Zhao, Z. Du, M. Ding, X. Liu, Y. Cen, X. Zou, Z. Yang, and J. Tang. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2, 06 2021.

[bib342] W. Yuan, Q. Zhang, T. He, C. Fang, N. Q. V. Hung, X. Hao, and H. Yin. Circle: continual repair across programming languages. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2022, page 678–690, New York, NY, USA, 2022. Association for Computing Machinery.

[bib343] S. Yue, W. Chen, S. Wang, B. Li, C. Shen, S. Liu, Y. Zhou, Y. Xiao, S. Yun, W. Lin, et al. Disc-lawllm: Fine-tuning large language models for intelligent legal services. arXiv preprint arXiv:2309.11325, 2023.

[bib344] X. Yue, X. Qu, G. Zhang, Y. Fu, W. Huang, H. Sun, Y. Su, and W. Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.

[bib345] L. W. Zefeng Du, Minghao Wu. Chinese-llama-2. https://github.com/longyuewangdcu/Chinese-Llama-2, 2023.

[bib346] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi. Defending against neural fake news. Advances in neural information processing systems, 32, 2019.

[bib347] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence, 2017.

[bib348] A. Zewdu and B. Yitagesu. Part of speech tagging: a systematic review of deep learning and machine learning approaches. Journal of Big Data, 9, 01 2022.

[bib349] Y. Zhai, S. Tong, X. Li, M. Cai, Q. Qu, Y. J. Lee, and Y. Ma. Investigating the catastrophic forgetting in multimodal large language models, 2023.

[bib350] D. Zhang, Z. Hu, S. Zhoubian, Z. Du, K. Yang, Z. Wang, Y. Yue, Y. Dong, and J. Tang. Sciglm: Training scientific language models with self-reflective instruction annotation and tuning. CoRR, abs/2401.07950, 2024.

[bib351] H. Zhang, L. Gui, Y. Zhai, H. Wang, Y. Lei, and R. Xu. Copf: Continual learning human preference through optimal policy fitting. arXiv preprint arXiv:2310.15694, 2023.

[bib352] H. Zhang, Y. Lei, L. Gui, M. Yang, Y. He, H. Wang, and R. Xu. Cppo: Continual learning for reinforcement learning with human feedback.

[bib353] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, T. Zhang, F. Wu, and G. Wang. Instruction tuning for large language models: A survey, 2024.

[bib354] X. Zhang, C. Tian, X. Yang, L. Chen, Z. Li, and L. R. Petzold. Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558, 2023.

[bib355] X. Zhang and Q. Yang. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM ’23, page 4435–4439, New York, NY, USA, 2023. Association for Computing Machinery.

[bib356] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

[bib357] Y. Zhang, X. Wang, and D. Yang. Continual sequence generation with adaptive compositional modules. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3653–3667, Dublin, Ireland, May 2022. Association for Computational Linguistics.

[bib358] Z. Zhang, M. Fang, L. Chen, and M.-R. Namazi-Rad. CITB: A benchmark for continual instruction tuning. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9443–9455, Singapore, Dec. 2023. Association for Computational Linguistics.

[bib359] C. Zhao, Y. Li, and C. Caragea. C-STANCE: A large dataset for Chinese zero-shot stance detection. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13369–13385, Toronto, Canada, July 2023. Association for Computational Linguistics.

[bib360] H. Zhao, S. Liu, C. Ma, H. Xu, J. Fu, Z.-H. Deng, L. Kong, and Q. Liu. GIMLET: A unified graph-text model for instruction-based molecule zero-shot learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

[bib361] H. Zhao, H. Wang, Y. Fu, F. Wu, and X. Li. Memory-efficient class-incremental learning for image classification. IEEE Transactions on Neural Networks and Learning Systems, 33(10):5966–5977, 2022.

[bib362] S. Zhao, X. Zou, T. Yu, and H. Xu. Reconstruct before query: Continual missing modality learning with decomposed prompt collaboration, 2024.

[bib363] W. Zhao, S. Wang, Y. Hu, Y. Zhao, B. Qin, X. Zhang, Q. Yang, D. Xu, and W. Che. Sapt: A shared attention framework for parameter-efficient continual learning of large language models, 2024.

[bib364] J. Zheng, Q. Ma, Z. Liu, B. Wu, and H. Feng. Beyond anti-forgetting: Multimodal continual instruction tuning with positive forward transfer, 2024.

[bib365] J. Zheng, S. Qiu, and Q. Ma. Learn or recall? revisiting incremental learning with pre-trained language models, 2023.

[bib366] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, A. Wang, Y. Li, T. Su, Z. Yang, and J. Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023.

[bib367] Z. Zheng, J. Zhang, T. Vu, S. Diao, Y. H. W. Tim, and S. Yeung. Marinegpt: Unlocking secrets of ocean to the public. CoRR, abs/2310.13596, 2023.

[bib368] B. Zhou, D. Khashabi, Q. Ning, and D. Roth. “going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3363–3369, Hong Kong, China, Nov. 2019. Association for Computational Linguistics.

[bib369] W. Zhou, D.-H. Lee, R. K. Selvam, S. Lee, B. Y. Lin, and X. Ren. Pre-training text-to-text transformers for concept-centric common sense. 2021.

[bib370] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.

[bib371] D. Zhu, Z. Sun, Z. Li, T. Shen, K. Yan, S. Ding, K. Kuang, and C. Wu. Model tailor: Mitigating catastrophic forgetting in multi-modal large language models, 2024.

[bib372] T. Y. Zhuo, A. Zebaze, N. Suppattarachai, L. von Werra, H. de Vries, Q. Liu, and N. Muennighoff. Astraios: Parameter-efficient instruction tuning code large language models, 2024.