Skip to main content

Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models

Hyeontaek Hwang, Nguyen Dinh Son, Daeyoung Kim

Abstract

Fine-tuning Multimodal Large Language Models (MLLMs) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language decoder or scale poorly with increasing model size. To address these limitations, we propose \ours, a novel sparse fine-tuning approach for MLLMs. \ours measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, \ours selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative MLLMs, LLaVA and NVILA, demonstrate that \ours effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.

Introduction

Recently, Multimodal Large Language Models (MLLMs) (Liu et al., 2023; 2025; Wang et al., 2024a; Chen et al., 2024c) have emerged as a central focus in the field of multimodal learning. These models typically consist of a pretrained large language decoder (often a transformer-based architecture (Vaswani et al., 2017) integrated with vision encoders through a connection module. The success of MLLMscan be attributed to their ability to generalize across various tasks, which is driven by large-scale visual instruc-

  • Equal contribution 1 School of Computing, KAIST, Daejoen, Republic of Korea. Correspondence to: Daeyoung Kim < kimd@kaist.ac.kr > .

1

Figure 1. Performance comparison of catastrophic forgetting mitigation methods on LLaVA-1.5 (7B) and NVILA-Lite (2B) fine-tuned on ImageNet-R. (a)-(b) Radar charts illustrating the balance between downstream adaptation and upstream capabilities. (c)-(d) H-score 1 stability across varying fine-tuning depths. Model-Dowser (red line) consistently achieves robust performance compared to previous works.

Figure 1. Performance comparison of catastrophic forgetting mitigation methods on LLaVA-1.5 (7B) and NVILA-Lite (2B) fine-tuned on ImageNet-R. (a)-(b) Radar charts illustrating the balance between downstream adaptation and upstream capabilities. (c)-(d) H-score 1 stability across varying fine-tuning depths. Model-Dowser (red line) consistently achieves robust performance compared to previous works.

tion tuning and their substantial model size, often reaching billions of parameters. Despite their strong zero-shot performance, these models frequently exhibit suboptimal results on domain-specific downstream tasks (Zhai et al., 2024; Zhu et al., 2024; Huang et al., 2025). As a result, further fine-tuning is often necessary to better align them with taskspecific instructions and domain distributions.

A common approach to adapting MLLMs for specific tasks involves creating an instruction-following dataset tailored for the target task. During fine-tuning, the large language decoder is typically optimized while other components are kept frozen, enabling efficient alignment with downstream instructions (Han et al., 2024; Zhou et al., 2024a; Zhai et al., 2024). Although this fully supervised fine-tuning strategy is straightforward, it often leads to significant generaliza-

1 Harmonic mean between accuracies on pretrained and downstream tasks. Detailed definition is in Section 4.

tion loss, a phenomenon known as Catastrophic Forgetting (Goodfellow et al., 2013; Zhai et al., 2024; Dong et al., 2021). This issue arises because downstream instruction data is usually limited in size and narrowly focused. As a result, fine-tuning MLLMs on such data can overwrite their pretrained representations.

Recent efforts to mitigate catastrophic forgetting in MLLMs mainly fall into two categories: model post-merging methods (Zhu et al., 2024; Yu et al., 2024; Panigrahi et al., 2023) and sparse fine-tuning methods (Huang et al., 2025; Hui et al., 2025). Post-merging approaches aim to preserve pretrained knowledge by fusing a pretrained model with its fine-tuned counterpart, whereas sparse fine-tuning methods restrict parameter updates to a subset of model weights to limit disruption to pretrained representations. Despite their effectiveness, existing approaches are predominantly evaluated under shallow fine-tuning settings, in which only a small subset of the language decoder's final layers is updated during downstream adaptation. For example, ModelTailor (Zhu et al., 2024) fine-tunes only the last 12 layers of LLaV A (Liu et al., 2023), while the state-of-the-art sparse method Specialization via Importance Discrepancy Evaluation for Refinement (SPIDER) (Huang et al., 2025) reports the results with less than the last 5 layers of LLaV A. However, recent studies (Chen et al., 2024b; Zhang et al., 2025) indicate that earlier layers of the language decoder play a critical role in multimodal understanding, suggesting that shallow fine-tuning may not fully exploit the model's adaptation capacity. Motivated by this observation, we examine catastrophic forgetting under deeper fine-tuning regimes, as illustrated in Figure 1. We find that post-merging methods degrade rapidly once fine-tuning extends to earlier decoder layers, likely because extensive parameter updates disrupt the pretrained latent space in ways that cannot be recovered through post-hoc fusion. Sparse fine-tuning methods exhibit more stable behavior in this setting, but their robustness comes at the expense of substantial memory and computational overhead, limiting scalability for MLLMs.

Driven by these limitations, we propose Model-Dowser, a novel sparse fine-tuning method for MLLMs that balances downstream task performance with preservation of pretrained generalization, while maintaining the same memory complexity as standard fine-tuning. Our approach is motivated by a simple question: Which parameter perturbations most strongly affect the model's outputs? We hypothesize that preserving generalization can be achieved by minimizing output shifts induced by downstream finetuning. To this end, we provide a theoretical analysis showing that output shifts induced by parameter updates can be effectively characterized by jointly considering weight magnitudes, input activations, and output sensitivities. Based on this insight, Model-Dowser assigns an importance score to each parameter prior to downstream adaptation and se- lectively freezes high-importance parameters during finetuning, allowing the model to adapt to target tasks without overwriting pretrained knowledge.

Extensive experiments on two representative MLLM architectures, LLaVA (Liu et al., 2024a) and NVILA (Liu et al., 2025), across diverse downstream tasks demonstrate that Model-Dowser consistently achieves state-of-the-art performance in mitigating catastrophic forgetting. Furthermore, our method is data-free and computationally efficient, making it highly scalable to large models. Our contributions are summarized as follows:

· Forgetting diagnosis across fine-tuning depth. Our analysis reveals that MLLMs experience severe catastrophic forgetting as fine-tuning extends to deeper portions of the language decoder, and existing approaches are either ineffective under this regime or exhibit inconsistent behavior across different fine-tuning settings. · A scalable sparse fine-tuning method. We propose Model-Dowser, a novel sparse fine-tuning method that introduces a data-free importance score derived from input activations and output sensitivity, and selectively freezes critical parameters prior to adaptation to enable effective downstream learning without loss of pretrained knowledge. · Theoretical justification. We provide a theoretical analysis showing that our importance score captures the sensitivity of model outputs to individual parameter perturbations, explaining why preserving high-score parameters helps retain pretrained generalization. · State-of-the-art results with practical efficiency. Our experiments on different MLLMs and downstream tasks show that Model-Dowser consistently outperforms prior methods, while remaining highly resourceefficient and scalable.

Catastrophic Forgetting

In the context of large foundation models, catastrophic forgetting refers to the phenomenon where a model loses its ability to generalize to previously learned or unseen tasks after being adapted to downstream tasks (Wang et al., 2024b).

In Large Language Models (LLMs) , existing forgettingmitigation methods can be broadly categorized into three groups. (1) Additive methods (Hu et al., 2022; Li & Liang, 2021; Lester et al., 2021; Zhang et al., 2021; Sung et al., 2022) introduce a small number of additional parameters to learn task-specific knowledge while keeping the pretrained weights fixed. Although these approaches are training-

efficient, the resulting architectural modifications complicate deployment and pose challenges for continual finetuning. (2) Post-merging methods (Yu et al., 2024; Panigrahi et al., 2023; Li et al., 2022) aim to fuse pretrained and fine-tuned weights using heuristic selection or importance criteria. However, such methods often struggle to balance generic knowledge and domain-specific adaptation, making them sensitive to hyperparameter choices and limiting their robustness. (3) Sparse fine-tuning methods , such as (Hui et al., 2025; Lu et al., 2024; Xu & Zhang, 2024a), update a subset of model parameters during downstream adaptation and have shown potential in mitigating catastrophic forgetting. Nevertheless, due to the lack of a principled parameter importance criterion, these methods may inadvertently modify parameters critical to generalization.

In Multimodal Large Language Models (MLLMs) , early studies mainly investigate catastrophic forgetting as a side effect of continual learning (Guo et al., 2025a; Chen et al., 2025; 2024a; Guo et al., 2025b), which differs from our focus on downstream task adaptation. ModelTailor (Zhu et al., 2024) is among the first methods specifically designed for MLLMs, adopting a post-merging strategy. While it improves generalization compared to LLM-based baselines, it often sacrifices target-task performance to preserve pretrained representations. More recently, SPIDER (Huang et al., 2025) proposes a sparse fine-tuning approach that actively measures parameter importance during training based on gradient information and weight magnitudes, achieving strong empirical results in both downstream adaptation and generalization retention. However, existing MLLM-specific methods are typically evaluated only when fine-tuning the final layers of the language decoder, leaving their effectiveness under deeper fine-tuning largely unexplored. As we demonstrate later, these methods exhibit weak and unstable performance when fine-tuning extends to earlier layers.

We provide additional discussion of other anti-forgetting methods for smaller models in Appendix A and of continual learning in Appendix H.

Parameter Importance

Measuring parameter contributions is essential for mitigating catastrophic forgetting, yet early second-order methods such as Optimal Brain Surgeon (Hassibi & Stork, 1992) are computationally prohibitive for foundation models. As a result, magnitude-based methods (Mallya & Lazebnik, 2018; Tanaka et al., 2020) have emerged; however, they rely on assumptions of homogeneous activations, which do not hold in modern MLLMs. More recent approaches, including pruning by Weight AND Activation (Wanda) (Sun et al., 2024b), employ empirical criteria based on weight, activation, or gradient statistics, yet offer limited theoretical analysis of their impact on output-level functional stability. Moreover, these criteria can be unreliable under massive activations (Sun et al., 2024a). SPIDER (Huang et al., 2025) employs dynamic updates during training based on weight magnitude and gradient norm, achieving balanced performance. However, SPIDER may incur substantial memory overhead due to maintaining the accumulated per-parameter gradient history and the soft mask matrix. To address these limitations, we propose a Model-Dowser that directly quantifies output-level functional impact, providing a principled and memory-efficient mechanism for preserving functionally critical parameters via a simple binary mask.

Methodology

In this section, we introduce Model-Dowser, which mitigates catastrophic forgetting by identifying and preserving functionally critical parameters through the three-stage pipeline illustrated in Figure 2. This process consists of datafree sensitivity probing and functional importance scoring to guide sparse fine-tuning and ensure stable representational adaptation.

Functional Importance Scoring

Modern MLLMs predominantly employ non-homogeneous activation functions, such as GELU (Hendrycks & Gimpel, 2016), SiLU (Elfwing et al., 2018), and GLU variants (Shazeer, 2020), in which the weight scale is no longer directly aligned with the output-level functional impact. In such architectures, the magnitude of a parameter no longer reliably reflects its functional contribution, as the non-linear curvature of such non-homogeneous activation can decouple weight scale from representational impact. To address this, we propose a sensitivity-based functional importance measure formalized in Theorem 3.1 to quantify importance via the estimated output shift ∥ ∆ f ∥ 2 induced by parameter perturbations. The corresponding proof is deferred to Appendix B.

Theorem 3.1 (Functional shift for single-weight perturbation) . Consider a layer l in an MLLM model f . Under first-order Taylor approximation, the L2 norm of output shift ∆ f when perturbing a weight W ( l ) ij is given by:

$$

$$

where J ( l ) i = ∂f/∂z ( l ) i denotes the i -th column of the Jacobian matrix of the network output with respect to the pre-activation vector z ( l ) , and h ( l -1) is the input activation of the l -th layer (output of layer l -1 ).

Corollary 3.2 (Functional shift for multi-weight perturbation) . Let ∆ W = { ∆ W ( l ) } L l =1 be the set of perturbation

Figure 2. Overall Architecture of Model-Dowser. The proposed method consists of three main steps. 1. Probing (Section 3.2): samples Jacobian matrix and input activation with synthetic data samples on every layer ( l ). 2. Compute Score (Section 3.2): generate parameter-wise importance score with Jacobian matrix, weight magnitude, and activation. 3. Sparse Finetune (Section 3.3): update the least important ρ % of parameters (highlighted in yellow) based on their importance scores for the target downstream task.

Figure 2. Overall Architecture of Model-Dowser. The proposed method consists of three main steps. 1. Probing (Section 3.2): samples Jacobian matrix and input activation with synthetic data samples on every layer ( l ). 2. Compute Score (Section 3.2): generate parameter-wise importance score with Jacobian matrix, weight magnitude, and activation. 3. Sparse Finetune (Section 3.3): update the least important ρ % of parameters (highlighted in yellow) based on their importance scores for the target downstream task.

matrices for all layers in the model f . Under the firstorder Taylor approximation, the total output shift ∥ ∆ f ∥ 2 is bounded by the global aggregate of individual parameter sensitivities as follows:

$$

$$

The corresponding proof for this extension is deferred to Appendix C.

Based on the linear approximation provided in Theorem 3.1, we define the importance of each parameter. While the theorem quantifies the shift ∥ ∆ f ∥ 2 for an arbitrary perturbation ∆ W , we are specifically interested in the functional contribution of the current weight magnitude to the model's output stability. Following Theorem 3.1, we define a sensitivitybased importance score S ( l ) ij by substituting the potential perturbation with the current weight magnitude | W ( l ) ij | :

$$

$$

We interpret S ( l ) ij as a first-order proxy for the functional impact induced by the maximal local perturbation of W ( l ) ij . Intuitively, the score S ( l ) ij quantifies the functional importance of a weight by considering the entire path of information flow in three dimensions:

Output Sensitivity ( ∥ J i ∥ 2 ) reflects the downstream impact, measuring the gradient-based sensitivity of the output to the i -th node.

Connection Strength ( | W ij | ) represents the inherent scale of the parameter. As we interpret this as a proxy for ∆ W ij , it captures the potential magnitude of the perturbation.

Input Activity ( | h j | ) measures the magnitude of the in- coming signal from the preceding layer. It quantifies the connection's potential impact from input.

This formulation effectively integrates the gradient-based sensitivity of structural flow methods with the activationawareness of local pruning, while providing robustness against the non-linear scaling of modern MLLM architectures.

Data-free Importance Estimation via Synthetic Probing

To compute the importance defined in Equation (3), we estimate each parameter's functional sensitivity by probing the model's response with synthetically generated inputs. This strategy avoids reliance on original pretraining data and enables scalable evaluation for large MLLMs.

Stochastic Sensitivity Estimation To capture the functional importance of each neuron, we utilize the L2 norm of the Jacobian J of model f , which measures how much a perturbation at a specific node propagates to the final representation. However, computing the full Jacobian matrix of an MLLM is computationally expensive.

To calculate this efficiently in high-dimensional MLLMs, we utilize the Hutchinson Trace Estimator (Hutchinson, 1990). By projecting the output with a random Rademacher vector ξ ∈ {± 1 } d final , the backward gain is stochastically estimated via the squared gradient:

$$

$$

This allows us to obtain the node-wise output sensitivity of all parameters with a minimal number of backward passes, bypassing the need for explicit Jacobian construction. A discussion on the numerical stability for practice is in Ap-

pendix D.

Synthetic Probing and Monte Carlo Estimation Finally, we address the challenge of data unavailability by performing the estimation with a synthetic probe. To enable datafree probing, we leverage MLLM's generative capability to synthesize N from random seeds, which serve as modelgenerated prompts to probe its functional response. The final importance score ¯ S ( l ) ij is computed as a Monte Carlo (MC) estimator, averaging the product of forward and backward gains over these N stochastic trials:

$$

$$

where J ( l ) i,n and h ( l ) j,n are the Jacobian and activation values measured during the n -th trial. This averaging process serves as a variance reduction mechanism, ensuring that the identified functionally important parameters are robust across diverse activation patterns.

Synthetic Prompt Generation for Data-Free probing In practice, pretraining data are often unavailable because they are in-house. To estimate parameter importance without access to such datasets, we leverage MLLMs' generative capabilities for data-free probing. Concretely, we sample random tokens from the model's tokenizer vocabulary and use them as seeds to synthesize N prompts { ˆ x n } N n =1 as:

$$

$$

where ϵ is sampled token vector, and θ pretrain is pretrained weight of a MLLM model. By propagating these synthetic probes through the model, we induce a wide range of activation patterns that reflect the model's learned functional structure, without relying on any task-specific or domainspecific data. This procedure enables us to identify structurally and functionally important parameters in a data-free manner. More detailed discussion and validation are provided in Appendix G.

Computational Complexity The total procedure requires O ( N · R ) forward and backward passes, where N is the number of MC synthetic samples and R is the number of Rademacher vectors used for Hutchinson estimation per sample. Given that N and R are typically small (e.g., N,R ≪ d final), our method is significantly more efficient than explicit Jacobian computation, which would require d final backward passes.

Stochastic Sensitivity Estimation
Synthetic Probing and Monte Carlo Estimation
Synthetic Prompt Generation for Data-Free probing
Computational Complexity

Sparse Finetuning

Since catastrophic forgetting in MLLMs primarily manifests as functional drift of pretrained representations, preserving parameters with high functional sensitivity naturally mitigates such degradation. Once the importance scores ¯ S ( l ) ij are estimated for all parameters, we perform sparse finetuning to adapt the MLLM to new tasks while mitigating catastrophic forgetting. This process involves two main steps: (1) identifying a set of functionally important parameters through binary masking and (2) performing gradient updates only on the non-essential parameters. To identify the most influential parameters for the pretrained weights, we rank all weights within each layer by their importance scores, ¯ S ( l ) ij . Given an update ratio ρ ∈ [0 , 1] , we generate a binary mask M ( l ) ij as follows:

$$

$$

During adaptation to a new task, we preserve knowledge encoded in functionally important weights by freezing them. The gradient updates are restricted to the remaining parameters (where M ( l ) ij = 1 ), allowing the model to learn task-specific features without shifting the established representation. The sparsely updated parameter θ ∗ during the finetuning is formulated as:

$$

$$

where θ denotes the weight of an MLLM model, λ is learning rate, L is training objective, and ⊙ represents the Hadamard product.

Justification By freezing parameters with high ¯ S ( l ) ij , we effectively suppress the dominant contributors to output perturbation, as approximated by the first-order sensitivity model in Theorem 3.1. Since our scoring mechanism accounts for the nonlinearity of non-homogeneous activations via the Hutchinson-estimated Jacobian, this sparse update strategy ensures that the functional anchors of the MLLM remain, providing greater stability than simple magnitudebased freezing methods.

Justification

Experiments

Experiment Settings

Datasets and Architectures. To evaluate the effectiveness of our method, we fine-tune MLLMs on a diverse set of downstream tasks, including image captioning, image classification, and visual question answering (VQA). Specifically, we use COCO-Caption (Lin et al., 2014) and Flickr30k (Young et al., 2014) for image captioning, ImageNet-R (Hendrycks et al., 2021) for image classification, and IconQA (Lu et al., 2021) for VQA. Experiments are conducted on two representative MLLM architectures,

Table 1. Performance comparison on COCO-Caption, ImageNet-R, Flickr30k, and IconQA using NVILA-Lite 2B , where only the last 20 layers of the language decoder are fine-tuned with an update ratio of ρ = 0 . 1 . Bold and underlined entries represent the best and second-best results.

e

the widely used LLaVA (Liu et al., 2024a) and the recently proposed NVILA models (Liu et al., 2025). To assess generalization and forgetting, we follow prior benchmark protocols (Zhu et al., 2024; Huang et al., 2025) and evaluate zero-shot performance on upstream (pretrained) tasks using TextVQA (Singh et al., 2019), OKVQA (Marino et al., 2019), OCRVQA (Mishra et al., 2019), and GQA (Hudson & Manning, 2019). In addition, we include MMBench in both English (MMB) and Chinese (MMB (CN)) (Liu et al., 2024b) to measure multilingual zero-shot visual question answering capabilities. Please refer to Appendix E.1 for more details on each dataset.

Baselines. We compare Model-Dowser against several strong baselines designed to mitigate catastrophic forgetting. For methods specialized for MLLMs, we include ModelTailor (or Tailor ) (Zhu et al., 2024) and SPIDER (Huang et al., 2025). ModelTailor is a post-merging approach that fuses pretrained weights with fine-tuned weights on downstream tasks, whereas SPIDER achieves state-of-the-art results in preventing forgetting for MLLMs by selectively updating model parameters based on accumulated gradient information during training. Given the architectural similarity between Large Language Models (LLMs) and Multimodal

Large Language Models (MLLMs), we additionally consider representative forgetting-mitigation methods originally proposed for LLMs, including Model Grafting ( Grafting ) (Panigrahi et al., 2023) and Drop & Rescale ( DARE ) (Yu et al., 2024). Grafting and DARE are post-merging methods similar to ModelTailor. Finally, we report results from full fine-tuning ( Full-FT ) as a reference baseline. We provide more details on selected baselines in Appendix E.2.

Training Settings. We follow the resource settings from (Zhou et al., 2024b) and randomly sample 10k training instances from each downstream dataset. For all experiments, we use a learning rate of 2 × 10 -5 and train for 5 epochs. Unless otherwise specified, all remaining hyperparameters follow the official implementations of LLaVA 2 and NVILA 3 . Due to computational constraints, we evaluate LLaVA using the LLaVA-1.5-7B model and NVILA using NVILA-Lite-2B. During fine-tuning, we update the last L layers of the language decoder while freezing all other components. To investigate the effect of fine-tuning depth, L is varied in increments of 4, ranging from 4 to 32 for

2 https://github.com/haotian-liu/LLaVA

LLaVA and from 4 to 28 for NVILA. All experiments are conducted on 8 NVIDIA A100 GPUs (40 GB) with a total batch size of 128. For memory-intensive baselines such as Grafting and SPIDER, we instead use NVIDIA H200 GPUs (143 GB). SPIDER cannot be trained on LLaVA-1.5-7B when L > 20 due to its memory complexity. To calculate the weight importance score in Model-Dowser, we set the number of synthetic text samples N = 64 and the number of Rademacher vectors R = 8 for all experiments; more details of these selections are in Appendix G.

Evaluation Metrics. For upstream (pretrained) tasks, we strictly follow the evaluation protocols of (Luo et al., 2024) and report the average accuracy across all upstream benchmarks, denoted as A up , to measure the model's retained generalization ability. For downstream tasks, we adopt the CIDEr metric (Vedantam et al., 2015) for image captioning datasets (COCO-Caption and Flickr30k), and Exact Match(EM) accuracy for ImageNet-R and IconQA, denoted as A down . We evaluate the effectiveness of forgetting mitigation using both the arithmetic mean (Avg) and harmonic mean (H-score) of A up and A down :

$$

$$

Higher scores indicate better overall performance in mitigating catastrophic forgetting.

Datasets and Architectures.
Baselines.

Model Grafting (Panigrahi et al., 2023) optimizes the model's loss on the downstream task with respect to a merging mask, which is later used to fuse the pretrained model and the fine-tuned model. While additional mask training can maintain the merged model's performance on the target task, it cannot guarantee performance on pretrained tasks. Moreover, this training requires 2 × the memory compared to standard fine-tuning, thus limiting its applicability to large models.

DARE (Yu et al., 2024) aims to reduce the delta weights without additional training. It first randomly drops a proportion of delta weights, then rescales the remaining ones to eliminate redundant parameters. The rescaled delta weights are later added back to the pretrained models. Despite its simplicity, DARE shows suboptimal results and inconsistent performance when applied to MLLMs.

ModelTailor (Zhu et al., 2024) is the first work that explores the catastrophic forgetting problem in MLLMs. ModelTailor proposed a post-training fusion strategy based on salience and sensitivity analysis. Although it is specifically designed for MLLMs, ModelTailor still suffers from severe forgetting of pretrained knowledge when fine-tuning a deeper proportion of language layers, similar to other post-merging methods.

SPIDER (Huang et al., 2025) is the state-of-the-art method for MLLMs, which adopts sparse-finetuning instead of postmerging strategies. By actively monitoring the accumulated parameter gradients and magnitudes, SPIDER ranks the parameter importance and selectively updates a subset of parameters. While this approach demonstrates better results compared to previous methods, it introduces substantial computational overhead and memory usage ( 3 × compared to the standard fine-tuning)

Table 5. Performance comparison on Flickr30k and IconQA downstream tasks using LLaVA-1.5-7B , where only the last 20 layers of the language decoder are fine-tuned. The left block shows results for Flickr30k, and the right block shows results for IconQA. Results show the downstream performance ( Avg , H-score ) with an update ratio of ρ = 0 . 1 . Bold and underlined entries represent the best and second-best results in each category.

Training Settings.

Datasets and Architectures. To evaluate the effectiveness of our method, we fine-tune MLLMs on a diverse set of downstream tasks, including image captioning, image classification, and visual question answering (VQA). Specifically, we use COCO-Caption (Lin et al., 2014) and Flickr30k (Young et al., 2014) for image captioning, ImageNet-R (Hendrycks et al., 2021) for image classification, and IconQA (Lu et al., 2021) for VQA. Experiments are conducted on two representative MLLM architectures,

Table 1. Performance comparison on COCO-Caption, ImageNet-R, Flickr30k, and IconQA using NVILA-Lite 2B , where only the last 20 layers of the language decoder are fine-tuned with an update ratio of ρ = 0 . 1 . Bold and underlined entries represent the best and second-best results.

e

the widely used LLaVA (Liu et al., 2024a) and the recently proposed NVILA models (Liu et al., 2025). To assess generalization and forgetting, we follow prior benchmark protocols (Zhu et al., 2024; Huang et al., 2025) and evaluate zero-shot performance on upstream (pretrained) tasks using TextVQA (Singh et al., 2019), OKVQA (Marino et al., 2019), OCRVQA (Mishra et al., 2019), and GQA (Hudson & Manning, 2019). In addition, we include MMBench in both English (MMB) and Chinese (MMB (CN)) (Liu et al., 2024b) to measure multilingual zero-shot visual question answering capabilities. Please refer to Appendix E.1 for more details on each dataset.

Baselines. We compare Model-Dowser against several strong baselines designed to mitigate catastrophic forgetting. For methods specialized for MLLMs, we include ModelTailor (or Tailor ) (Zhu et al., 2024) and SPIDER (Huang et al., 2025). ModelTailor is a post-merging approach that fuses pretrained weights with fine-tuned weights on downstream tasks, whereas SPIDER achieves state-of-the-art results in preventing forgetting for MLLMs by selectively updating model parameters based on accumulated gradient information during training. Given the architectural similarity between Large Language Models (LLMs) and Multimodal

Large Language Models (MLLMs), we additionally consider representative forgetting-mitigation methods originally proposed for LLMs, including Model Grafting ( Grafting ) (Panigrahi et al., 2023) and Drop & Rescale ( DARE ) (Yu et al., 2024). Grafting and DARE are post-merging methods similar to ModelTailor. Finally, we report results from full fine-tuning ( Full-FT ) as a reference baseline. We provide more details on selected baselines in Appendix E.2.

Training Settings. We follow the resource settings from (Zhou et al., 2024b) and randomly sample 10k training instances from each downstream dataset. For all experiments, we use a learning rate of 2 × 10 -5 and train for 5 epochs. Unless otherwise specified, all remaining hyperparameters follow the official implementations of LLaVA 2 and NVILA 3 . Due to computational constraints, we evaluate LLaVA using the LLaVA-1.5-7B model and NVILA using NVILA-Lite-2B. During fine-tuning, we update the last L layers of the language decoder while freezing all other components. To investigate the effect of fine-tuning depth, L is varied in increments of 4, ranging from 4 to 32 for

2 https://github.com/haotian-liu/LLaVA

LLaVA and from 4 to 28 for NVILA. All experiments are conducted on 8 NVIDIA A100 GPUs (40 GB) with a total batch size of 128. For memory-intensive baselines such as Grafting and SPIDER, we instead use NVIDIA H200 GPUs (143 GB). SPIDER cannot be trained on LLaVA-1.5-7B when L > 20 due to its memory complexity. To calculate the weight importance score in Model-Dowser, we set the number of synthetic text samples N = 64 and the number of Rademacher vectors R = 8 for all experiments; more details of these selections are in Appendix G.

Evaluation Metrics. For upstream (pretrained) tasks, we strictly follow the evaluation protocols of (Luo et al., 2024) and report the average accuracy across all upstream benchmarks, denoted as A up , to measure the model's retained generalization ability. For downstream tasks, we adopt the CIDEr metric (Vedantam et al., 2015) for image captioning datasets (COCO-Caption and Flickr30k), and Exact Match(EM) accuracy for ImageNet-R and IconQA, denoted as A down . We evaluate the effectiveness of forgetting mitigation using both the arithmetic mean (Avg) and harmonic mean (H-score) of A up and A down :

$$

$$

Higher scores indicate better overall performance in mitigating catastrophic forgetting.

Evaluation Metrics.

Findings

Robust Balance between Upstream Knowledge and Downstream Adaptation. Across all benchmarks, we observe a general trend: while different fine-tuning strategies often achieve comparable downstream performance, their ability to preserve upstream zero-shot knowledge varies. As shown in Tables 1 and 2, Model-Dowser achieves the highest H-scores among all evaluated methods by preserving pretrained knowledge during sparse fine-tuning. This trend holds across architectures. For instance, with NVILALite-2B, Model-Dowser achieves H-scores of 85.7 and 71.2 on COCO-Caption and ImageNet-R, respectively, and also maintains strong upstream performance while achieving Hscores of 79.9 and 69.7 on COCO-Caption and ImageNet-R, respectively, on LLaVA-1.5-7B. Importantly, these results suggest that catastrophic forgetting is less a consequence of insufficient downstream adaptation and more a result of failure to preserve functionally sensitive parameters. By selectively protecting parameters with high functional impact and updating those with minimal effect on the output function, Model-Dowser preserves the model's core representational capacity while retaining sufficient plasticity for downstream alignment, resulting in robust and balanced performance across models and tasks. We provide the results of LLaVA-1.5-7B on Flickr30k and IconQA on Appendix F

Figure 3. Performance comparison across fine-tuning depths on COCO and ImageNet-R. Results show the average accuracy across all tasks for an update ratio of ρ = 0 . 1 and various merging methods. The x-axis denotes the number of layers fine-tuned, counted incrementally from the final output layer toward the initial input layer.

Figure 3. Performance comparison across fine-tuning depths on COCO and ImageNet-R. Results show the average accuracy across all tasks for an update ratio of ρ = 0 . 1 and various merging methods. The x-axis denotes the number of layers fine-tuned, counted incrementally from the final output layer toward the initial input layer.

Increased Vulnerability of Early Decoder Layers to Catastrophic Forgetting. Layer-wise analysis shows that catastrophic forgetting is most severe when fine-tuning extends to early layers, as shown in Figure 3. Grafting, DARE, and Tailor maintain stability across the last 4-16 layers but collapse when updates are applied to early layers because these post-merging methods are inherently reactive; patching weights to recover the pretrained task is insufficient after disruption by fine-tuning. In contrast, SPIDER can preserve pretrained knowledge across even more layers by dynamically selecting parameters based on gradient history. However, even with all 32 layers, SPIDER underperforms Model-Dowser, indicating that it struggles to identify important parameters for preserving pretraining knowledge in the early layers. Conversely, Model-Dowser preserves sensitive functional anchors across layers and updates only the others, sustaining stability throughout the tuning process.

Table 3. Memory complexity for fine-tuning and parameter update ratios across different methods. | P | denotes the number of learnable parameters. '# params' denotes the number of parameters on NVILA/LLaVA when updating the last 20 layers, respectively.

Memory Efficiency and Scalability. Model-Dowser provides a highly memory-efficient alternative to existing catastrophic forgetting mitigation strategies, maintaining the same complexity as standard fine-tuning. As shown in

Figure 4. Performance comparison across various mask ratios ( ρ ) on COCO-Caption and ImageNet-R using (a-b) NVILA-Lite-2b , and (c-d) LLaVA-1.5-7B . Results show the upstream and downstream performance ( A up , Avg, H-score).

Figure 4. Performance comparison across various mask ratios ( ρ ) on COCO-Caption and ImageNet-R using (a-b) NVILA-Lite-2b , and (c-d) LLaVA-1.5-7B . Results show the upstream and downstream performance ( A up , Avg, H-score).

Table 3, while SPIDER and Grafting require significant memory overheads of O (3 | P | ) and O (2 | P | ) respectively, Model-Dowser operates with a minimal memory complexity of O ( | P | ) . This efficiency comes from Model-Dowser, which uses a static binary mask derived from a one-time calculation before fine-tuning; unlike SPIDER, which must store and update accumulated gradient histories during training, our approach uses pre-computed importance scores to generate a fixed mask prior to fine-tuning. Because the cost of storing a binary mask is negligible, Model-Dowser can scale to large foundation models.

Robustness to Update Ratios and the Stability. ModelDowser demonstrates a wide operational window regarding the update ratio ρ , effectively preserving stability even as the number of trainable parameters increases. In Figure 4, the average upstream performance remains stable for mask ratios up to ρ = 0 . 25 , maintaining the upstream knowledge of the pretrained model. While a gradual degradation in upstream accuracy is observed as ρ increases, Model-Dowser consistently outperforms the Full-FT ( ρ = 1 . 0 ) across all evaluated settings. This empirical result suggests that the functional importance identified by our sensitivity scoring is highly concentrated; as long as the most critical parameters are protected, the model remains robust to significant task-specific updates in less sensitive regions. These results confirm that Model-Dowser provides a reliable mechanism for mitigating catastrophic forgetting without requiring exhaustive hyperparameter tuning for the mask ratio ρ .

Comparing with Random Selection. As shown in Table 4, Model-Dowser consistently outperforms the random selection (Xu & Zhang, 2024b; Hui et al., 2025). The performance gap becomes most substantial at ρ = 0 . 5 . Specifically, at ρ = 0 . 5 , Model-Dowser achieves an Avg score of 69.8 and an H-score of 62.7, showing better per-

Table 4. Result on ImageNet-R with different update ratios ( ρ ) across 28 layers on NVILA-Lite-2B . Random selection vs ours. Numbers after ± indicate standard deviation across three random seeds.

formance over the random selection baseline ( 65 . 7 ± 0 . 4 and 55 . 0 ± 0 . 1 ), respectively. Both gains are statistically significant ( p = 0 . 003 for Avg and p = 0 . 004 for H-score under two-tailed t-tests).

The high variance observed under random selection reflects the instability of random updates, in which functionally sensitive parameters could be unintentionally modified. In contrast, by identifying these sensitive parameters, ModelDowser ensures stability even when a substantial portion of the model is being updated. These results demonstrate that as the number of updated parameters increases, the reliability provided by importance-based parameter selection becomes essential for effectively mitigating forgetting. Further details are provided in Appendices F and G.

Robust Balance between Upstream Knowledge and Downstream Adaptation.
Increased Vulnerability of Early Decoder Layers to Catastrophic Forgetting.

In the context of large foundation models, catastrophic forgetting refers to the phenomenon where a model loses its ability to generalize to previously learned or unseen tasks after being adapted to downstream tasks (Wang et al., 2024b).

In Large Language Models (LLMs) , existing forgettingmitigation methods can be broadly categorized into three groups. (1) Additive methods (Hu et al., 2022; Li & Liang, 2021; Lester et al., 2021; Zhang et al., 2021; Sung et al., 2022) introduce a small number of additional parameters to learn task-specific knowledge while keeping the pretrained weights fixed. Although these approaches are training-

efficient, the resulting architectural modifications complicate deployment and pose challenges for continual finetuning. (2) Post-merging methods (Yu et al., 2024; Panigrahi et al., 2023; Li et al., 2022) aim to fuse pretrained and fine-tuned weights using heuristic selection or importance criteria. However, such methods often struggle to balance generic knowledge and domain-specific adaptation, making them sensitive to hyperparameter choices and limiting their robustness. (3) Sparse fine-tuning methods , such as (Hui et al., 2025; Lu et al., 2024; Xu & Zhang, 2024a), update a subset of model parameters during downstream adaptation and have shown potential in mitigating catastrophic forgetting. Nevertheless, due to the lack of a principled parameter importance criterion, these methods may inadvertently modify parameters critical to generalization.

In Multimodal Large Language Models (MLLMs) , early studies mainly investigate catastrophic forgetting as a side effect of continual learning (Guo et al., 2025a; Chen et al., 2025; 2024a; Guo et al., 2025b), which differs from our focus on downstream task adaptation. ModelTailor (Zhu et al., 2024) is among the first methods specifically designed for MLLMs, adopting a post-merging strategy. While it improves generalization compared to LLM-based baselines, it often sacrifices target-task performance to preserve pretrained representations. More recently, SPIDER (Huang et al., 2025) proposes a sparse fine-tuning approach that actively measures parameter importance during training based on gradient information and weight magnitudes, achieving strong empirical results in both downstream adaptation and generalization retention. However, existing MLLM-specific methods are typically evaluated only when fine-tuning the final layers of the language decoder, leaving their effectiveness under deeper fine-tuning largely unexplored. As we demonstrate later, these methods exhibit weak and unstable performance when fine-tuning extends to earlier layers.

We provide additional discussion of other anti-forgetting methods for smaller models in Appendix A and of continual learning in Appendix H.

Memory Efficiency and Scalability.
Robustness to Update Ratios and the Stability.
Comparing with Random Selection.

Conclusion

In this paper, we investigate catastrophic forgetting in MLLMs under varying fine-tuning depths. Our analysis reveals that fine-tuning earlier layers of the language decoder severely degrades pretrained generalization, and existing methods often fail and exhibit unstable performance. To address this challenge, we propose Model-Dowser, which identifies the parameters most impactful on the model's outputs prior to downstream adaptation and selectively preserves them during fine-tuning. As a result, Model-Dowser effectively mitigates catastrophic forgetting while remaining data-free and resource-efficient, requiring no additional memory beyond standard fine-tuning and thus scaling well to large MLLMs.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal

consequences of our work, none which we feel must be specifically highlighted here.

Catastrophic forgetting is a long-standing challenge in deep neural networks. Early studies primarily investigated this phenomenon in unimodal models under continual learning settings, where a model is sequentially trained on new tasks, and its performance on previously learned tasks degrades (Goodfellow et al., 2013; Masana et al., 2022; Yang et al., 2023; Kirkpatrick et al., 2017; Xuhong et al., 2018; Aljundi et al., 2018; Zhang et al., 2024b). These works typically assume a single-source task constraint, in which successive tasks are closely related, e.g., image classification, and the source-task training data remains accessible during adaptation.

In contrast, the emergence of foundation models (Radford et al., 2021; Kirillov et al., 2023; Zhai et al., 2023) introduces a fundamentally different form of forgetting. In this setting, the original pretraining data distribution is often unknown or unavailable, and catastrophic forgetting manifests as a loss of generalization after fine-tuning on a downstream task. Recent efforts to mitigate forgetting in foundation models have largely focused on gradient-based or full-model fine-tuning methods (Zhang et al., 2024a; Zheng et al., 2023; Xiang et al., 2023); however, such approaches are often impractical for large-scale pretrained models due to their computational and memory overhead.

As Multimodal Large Language Models (MLLMs) (Liu et al., 2024a; 2025; Wang et al., 2024a) continue to demonstrate strong adaptability across diverse downstream tasks, understanding and mitigating catastrophic forgetting in these models has become an increasingly important research direction.

Functional shift for single-weight perturbation (Proof of cref{theorem:score

Theorem B.1 (Theorem 3.1) . Consider a layer l in an MLLM model f . Under first-order Taylor approximation, the L2 norm of output shift ∆ f when perturbing a weight W ( l ) ij is given by:

$$

$$

where J ( l ) i = ∂f/∂z ( l ) i denotes the i -th column of the Jacobian matrix of the network output with respect to the pre-activation vector z ( l ) , and h ( l -1) is the input activation of the l -th layer (output of layer l -1 ).

Proof. In this section, we provide the detailed derivation of the parameter importance score S ( l ) ij based on the first-order Taylor approximation of the model's output shift.

Setup and Notation Consider a specific linear layer l within an MLLM f . For a given input stimulus, let h ( l -1) ∈ R d in denote the input activation (output from the preceding layer) and W ( l ) ∈ R d out × d in denote the weight matrix of layer l . The pre-activation vector z ( l ) is defined as z ( l ) = W ( l ) h ( l -1) , where the i -th component is z ( l ) i = ∑ j W ( l ) ij h ( l -1) j . The network output f can be viewed as a function of the pre-activation z ( l ) , denoted as f = F ( z ( l ) ) , where F encompasses all subsequent operations in the network. We define the output sensitivity vector (Jacobian column) J ( l ) i as:

$$

$$

Derivation for Single Weight Perturbation We analyze the effect of perturbing a single weight element W ( l ) ij by an amount ∆ W ( l ) ij .

Step 1: Since only the weight element W ( l ) ij is perturbed, only the i -th component of the pre-activation vector z ( l ) is affected:

$$

$$

where δ is the Kronecker delta. Thus, the change in the i -th pre-activation is ∆ z ( l ) i = h ( l -1) j ∆ W ( l ) ij , while ∆ z ( l ) k = 0 for all k = i .

̸

Step 2: To quantify the functional impact of a parameter perturbation, we treat the final network output f as a differentiable function of the pre-activation z ( l ) . When the pre-activation vector is perturbed from z ( l ) to z ( l ) +∆ z ( l ) , we can approximate the new output using a first-order Taylor expansion around z ( l ) :

$$

$$

$$

$$

By neglecting the higher-order terms O ( ∥ ∆ z ( l ) ∥ 2 ) and rearranging the equation to isolate the difference between the perturbed and original outputs, we define the output shift ∆ f as follows:

$$

$$

where ∂F ∂z ( l ) = [ J ( l ) 1 , J ( l ) 2 , . . . , J ( l ) d out ] represents the Jacobian matrix of the network output with respect to the pre-activations. As established in Step 1, since the perturbation is restricted to a single weight W ( l ) ij , the change vector ∆ z ( l ) is sparse, containing a non-zero value only at the i -th index ( ∆ z ( l ) i = ∆ W ( l ) ij · h ( l -1) j ). Consequently, the summation collapses to a single term:

$$

$$

Step 3: Taking the L 2 norm of both sides, we obtain the magnitude of the functional shift:

f

2

J

(

i

·

W

l

h

)

=

| · |

|

.

(16)

By substituting the potential perturbation ∆ W ( l ) ij with the existing weight magnitude | W ( l ) ij | , we arrive at the importance score S ( l ) ij = ∥ J ( l ) i ∥ 2 · | W ( l ) ij | · | h ( l -1) j | . We emphasize that this substitution does not aim to predict the exact output shift, but serves as a conservative first-order surrogate for ranking parameters by their relative functional sensitivity.

The proof of Theorem 3.1 is finished.

Setup and Notation
Derivation for Single Weight Perturbation

Theorem B.1 (Theorem 3.1) . Consider a layer l in an MLLM model f . Under first-order Taylor approximation, the L2 norm of output shift ∆ f when perturbing a weight W ( l ) ij is given by:

$$

$$

where J ( l ) i = ∂f/∂z ( l ) i denotes the i -th column of the Jacobian matrix of the network output with respect to the pre-activation vector z ( l ) , and h ( l -1) is the input activation of the l -th layer (output of layer l -1 ).

Proof. In this section, we provide the detailed derivation of the parameter importance score S ( l ) ij based on the first-order Taylor approximation of the model's output shift.

Setup and Notation Consider a specific linear layer l within an MLLM f . For a given input stimulus, let h ( l -1) ∈ R d in denote the input activation (output from the preceding layer) and W ( l ) ∈ R d out × d in denote the weight matrix of layer l . The pre-activation vector z ( l ) is defined as z ( l ) = W ( l ) h ( l -1) , where the i -th component is z ( l ) i = ∑ j W ( l ) ij h ( l -1) j . The network output f can be viewed as a function of the pre-activation z ( l ) , denoted as f = F ( z ( l ) ) , where F encompasses all subsequent operations in the network. We define the output sensitivity vector (Jacobian column) J ( l ) i as:

$$

$$

Derivation for Single Weight Perturbation We analyze the effect of perturbing a single weight element W ( l ) ij by an amount ∆ W ( l ) ij .

Step 1: Since only the weight element W ( l ) ij is perturbed, only the i -th component of the pre-activation vector z ( l ) is affected:

$$

$$

where δ is the Kronecker delta. Thus, the change in the i -th pre-activation is ∆ z ( l ) i = h ( l -1) j ∆ W ( l ) ij , while ∆ z ( l ) k = 0 for all k = i .

̸

Step 2: To quantify the functional impact of a parameter perturbation, we treat the final network output f as a differentiable function of the pre-activation z ( l ) . When the pre-activation vector is perturbed from z ( l ) to z ( l ) +∆ z ( l ) , we can approximate the new output using a first-order Taylor expansion around z ( l ) :

$$

$$

$$

$$

By neglecting the higher-order terms O ( ∥ ∆ z ( l ) ∥ 2 ) and rearranging the equation to isolate the difference between the perturbed and original outputs, we define the output shift ∆ f as follows:

$$

$$

where ∂F ∂z ( l ) = [ J ( l ) 1 , J ( l ) 2 , . . . , J ( l ) d out ] represents the Jacobian matrix of the network output with respect to the pre-activations. As established in Step 1, since the perturbation is restricted to a single weight W ( l ) ij , the change vector ∆ z ( l ) is sparse, containing a non-zero value only at the i -th index ( ∆ z ( l ) i = ∆ W ( l ) ij · h ( l -1) j ). Consequently, the summation collapses to a single term:

$$

$$

Step 3: Taking the L 2 norm of both sides, we obtain the magnitude of the functional shift:

f

2

J

(

i

·

W

l

h

)

=

| · |

|

.

(16)

By substituting the potential perturbation ∆ W ( l ) ij with the existing weight magnitude | W ( l ) ij | , we arrive at the importance score S ( l ) ij = ∥ J ( l ) i ∥ 2 · | W ( l ) ij | · | h ( l -1) j | . We emphasize that this substitution does not aim to predict the exact output shift, but serves as a conservative first-order surrogate for ranking parameters by their relative functional sensitivity.

The proof of Theorem 3.1 is finished.

Extension to Multi-layer and Multi-weight Pruning (Proof of cref{cor:multi

Corollary C.1 (Corollary 3.2) . Let ∆ W = { ∆ W ( l ) } L l =1 be the set of perturbation matrices for all layers in the model f . Under the first-order Taylor approximation, the total output shift ∥ ∆ f ∥ 2 is bounded by the global aggregate of individual parameter sensitivities as follows:

$$

$$

l

)

Proof. We drive this using the concept of the total differential, treating the neural network function f as differentiable with respect to the set of parameters W = { W ( l ) ij } l,i,j .

According to the definition of the total differential, the variation in the output ∆ f induced by simultaneous perturbations in all weights can be approximated by the sum of partial derivatives with respect to every weight parameter:

$$

$$

From the derivation in Theorem 3.1 (Theorem B.1), we established that the gradient of f with respect to a specific weight W ( l ) ij factorizes via the chain rule into the downstream sensitivity and the input activation:

$$

$$

Substituting this result directly into Equation (18) yields:

$$

$$

Finally, to bound the magnitude of the total shift, we apply the triangle inequality ( ∥ ∑ x ∥ ≤ ∑ ∥ x ∥ ) and the property of scalar multiplication:

$$

$$

$$

$$

This inequality demonstrates that the sum of our proposed importance scores, S ( l ) ij , serves as a theoretical upper bound on the total functional shift of the MLLM under first-order approximation. Consequently, minimizing this aggregate score via global parameter selection effectively preserves the pretrained model's established functional response under perturbations.

The proof of Corollary 3.2 is finished.

$L_1$-norm Surrogate for Scalable Parameter Importance Estimation

Theorem D.1. Consider a MLLM f ∈ R d final and a Rademacher random variable ξ ∈ {-1 , 1 } d final . Under the sensitivity measure defined in Theorem 3.1, the following inequality holds:

$$

$$

where J i = ∂f/∂z i denotes the i -th column vector of the Jacobian matrix, h is the activation, and W is the weight matrix of the model f .

Proof. First, we recall that the L 2 -norm of the Jacobian matrix can be estimated using the Hutchinson trace estimator. For a given component or vector J , the relationship is established as:

$$

$$

To relate the expectation of the absolute value to the L 2 -norm, we first apply Jensen's inequality for the concave function √ · , which yields:

$$

$$

This provides the upper bound E ξ [ | ξ ⊤ J i | ] ≤ ∥ J i ∥ 2 .

Furthermore, to establish the lower bound, we utilize the Khintchine inequality (Khintchine, 1923; Haagerup, 1981). For the specific case of p = 1 and Rademacher complexity, the inequality states:

$$

$$

This relationship demonstrates that the expectation E ξ [ | ξ ⊤ J i | ] is equivalent to the L 2 -norm of the Jacobian up to a constant factor, i.e., E ξ [ | ξ ⊤ J i | ] ≍ ∥ J i ∥ 2 . Given that ∥ J i ∥ 2 ≤ ∥ J i ∥ 1 , this estimator serves as a robust proxy for the sensitivity of the model outputs.

Finally, by multiplying all terms by the magnitude of the activation | h j | and the weight perturbation | ∆ W ij | , we obtain the following inequality:

$$

$$

This completes the proof of Theorem D.1.

Numerical Stability near Zero. A challenge in using the L 2 -norm with low-precision formats (e.g., FP16 or BF16) is the risk of underflow. When Jacobian elements J i are small, the squaring operation J 2 i can push values below the minimum representable threshold, causing them to be flushed to zero. This results in a loss of the relative-importance ranking among parameters.

Our L 1 -based estimator E ξ [ | ξ ⊤ J i | ] avoids this by preserving the original magnitude of the gradients, ensuring that even subtle sensitivity differences are captured without numerical vanishing. In this proof, we show that, in practice, the proposed estimator and the L 2 norm have equivalent growth rates. Specifically, since the estimator scales linearly with ∥ J i ∥ 2 , both methods can serve as effective surrogates for the parameter importance score without losing the relative ranking of sensitivity.

Numerical Stability near Zero.

Experiment setting details.

Details of Datasets

TextVQA (Singh et al., 2019) is a visual question answering benchmark that requires models to read and reason about text embedded in images to answer natural language questions. It evaluates a model's ability to jointly perform scene text understanding and visual-language reasoning, highlighting limitations of standard VQA models that lack text-reading capability.

OKVQA (Marino et al., 2019) requires models to answer questions by combining visual understanding with external, commonsense, or factual knowledge beyond what is directly observable in the image. It is designed to evaluate a model's ability to integrate vision with knowledge-based reasoning.

OCR-VQA (Mishra et al., 2019) focuses on answering questions by reading and understanding text present in images. It evaluates a model's ability to perform optical character recognition and associate recognized text with visual context to support accurate question answering.

MMBench (Liu et al., 2024b) is a comprehensive multimodal benchmark designed to evaluate the general capabilities of multimodal models across diverse vision-language skills, including perception, reasoning, and knowledge understanding. It provides standardized multiple-choice questions split between English and Chinese, enabling systematic evaluation of zero-shot and multilingual performance in multimodal large language models.

GQA (Hudson & Manning, 2019) is a visual question answering benchmark designed to evaluate real-world visual reasoning and compositional understanding. It features structured questions that require multi-step reasoning over objects, attributes, and relations in images, enabling fine-grained analysis of a model's visual reasoning capabilities.

COCO-Caption (Lin et al., 2014) is a large-scale vision dataset containing images of complex everyday scenes with multiple objects annotated for object detection, segmentation, and image captioning. It is widely used to evaluate visual understanding and language generation capabilities, particularly in image captioning and multimodal learning tasks.

Flickr30k (Young et al., 2014) is an image-caption dataset consisting of 30,000 everyday images, each annotated with multiple human-written captions. It is commonly used to evaluate image captioning and vision-language understanding by assessing a model's ability to generate descriptive, semantically accurate natural-language captions.

IconQA (Lu et al., 2021) is a visual question answering benchmark designed for abstract diagram understanding and visual-language reasoning. In this paper, we focus on the text-based multiple-choice split.

ImageNet-R (Hendrycks et al., 2021) is a robustness benchmark derived from ImageNet that contains images rendered in diverse artistic styles, such as paintings, cartoons, and sketches. It is designed to evaluate a model's out-of-distribution generalization by testing recognition performance under significant appearance shifts from natural images. In this paper, we adopt the visual question answering format of this dataset from (Guo et al., 2025a).

Details of Baselines

Model Grafting (Panigrahi et al., 2023) optimizes the model's loss on the downstream task with respect to a merging mask, which is later used to fuse the pretrained model and the fine-tuned model. While additional mask training can maintain the merged model's performance on the target task, it cannot guarantee performance on pretrained tasks. Moreover, this training requires 2 × the memory compared to standard fine-tuning, thus limiting its applicability to large models.

DARE (Yu et al., 2024) aims to reduce the delta weights without additional training. It first randomly drops a proportion of delta weights, then rescales the remaining ones to eliminate redundant parameters. The rescaled delta weights are later added back to the pretrained models. Despite its simplicity, DARE shows suboptimal results and inconsistent performance when applied to MLLMs.

ModelTailor (Zhu et al., 2024) is the first work that explores the catastrophic forgetting problem in MLLMs. ModelTailor proposed a post-training fusion strategy based on salience and sensitivity analysis. Although it is specifically designed for MLLMs, ModelTailor still suffers from severe forgetting of pretrained knowledge when fine-tuning a deeper proportion of language layers, similar to other post-merging methods.

SPIDER (Huang et al., 2025) is the state-of-the-art method for MLLMs, which adopts sparse-finetuning instead of postmerging strategies. By actively monitoring the accumulated parameter gradients and magnitudes, SPIDER ranks the parameter importance and selectively updates a subset of parameters. While this approach demonstrates better results compared to previous methods, it introduces substantial computational overhead and memory usage ( 3 × compared to the standard fine-tuning)

Table 5. Performance comparison on Flickr30k and IconQA downstream tasks using LLaVA-1.5-7B , where only the last 20 layers of the language decoder are fine-tuned. The left block shows results for Flickr30k, and the right block shows results for IconQA. Results show the downstream performance ( Avg , H-score ) with an update ratio of ρ = 0 . 1 . Bold and underlined entries represent the best and second-best results in each category.

Experiment on Further Datasets

In this section, we present more validation of Model-Dowser on two more datasets, Flickr30k and IconQA. Furthermore, we demonstrate the effectiveness of our method by comparing it with random selection on ImageNet-R, Flickr30k, and IconQA.

Experiments on Diverse Downstream Tasks To verify the robustness of Model-Dowser across different downstream tasks, we extend our evaluation to Flickr30k (Image Captioning) and IconQA (VQA) on LLaVA-1.5-7B, as shown in Table 5. Model-Dowser achieved H-scores of 69.6 (Flickr30k) and 51.3 (IconQA), outperforming prior work, such as SPIDER and Tailor. The prior methods either achieve high performance on the downstream task but lose generalization ability to the upstream task, or preserve knowledge of the upstream task but fail to adapt to the downstream task. In contrast, our method maintained upstream performance comparable to the Zero-shot baseline but still performed well on downstream tasks.

As shown in Figure 6,Model-Dowser differs from previous approaches by remaining stable even when fine-tuning is applied across the full model, while the baselines show a performance drop. This aligns with the analysis in Section 4, suggesting that our method precisely targets parameters crucial for balancing knowledge retention and task adaptation. Additionally, Figure 7 highlights the method's resilience to different mask ratios ( ρ ); specifically, the consistent A up scores across all settings attest to the efficacy of our importance score in preserving upstream capabilities.

Figure 5. Radar chart on diverse benchmarks on LLaVA-1.5-7B and NVILA-Lite when finetuning all layers.

Figure 5. Radar chart on diverse benchmarks on LLaVA-1.5-7B and NVILA-Lite when finetuning all layers.

Figure 6. Performance comparison across fine-tuning depths on Flickr30k and IconQA. Results show the average accuracy across all tasks for an update ratio of ρ = 0 . 1 and various merging methods. The x-axis denotes the number of layers fine-tuned, counted incrementally from the final output layer toward the initial input layer.

Figure 6. Performance comparison across fine-tuning depths on Flickr30k and IconQA. Results show the average accuracy across all tasks for an update ratio of ρ = 0 . 1 and various merging methods. The x-axis denotes the number of layers fine-tuned, counted incrementally from the final output layer toward the initial input layer.

Figure 7. Performance comparison across various mask ratios ( ρ ) on Flickr30k and IconQA using LLaVA-1.5-7B. Results show the upstream and downstream performance.

Figure 7. Performance comparison across various mask ratios ( ρ ) on Flickr30k and IconQA using LLaVA-1.5-7B. Results show the upstream and downstream performance.

Comparison with Random Selection To further validate the effectiveness of our sensitivity-based importance scoring, we compare Model-Dowser with a random-selection baseline across varying update ratios ( ρ ). Tables 6 and 7 summarize the results for ImageNet-R and COCO-Cpation respectively.

Model-Dowser consistently outperforms the random selection baseline across datasets and ratios. This performance gap becomes particularly evident as the update budget increases. While the difference is smaller at low sparsity ( ρ = 0 . 1 ), the random baseline fails to effectively preserve upstream knowledge as the update ratio increases to ρ = 0 . 5 . For instance, on ImageNet-R with ρ = 0 . 5 , Model-Dowser achieves an H-score of 42.6 with LLaVA, significantly higher than the 33.2 obtained by random selection.

Importantly, these performance improvements are statistically significant. We conducted one-sided t-tests under the ρ = 0 . 5 condition. For NVILA-Lite-2B, the analysis produced p -values of 0.003 for the H-score on COCO-Caption and 0.002 for the H-score on ImageNet-R. Likewise, LLaVA-1.5-7B also shows statistically significant gains, with p -values of 0.04 and 0.0003 on COCO-Caption and ImageNet-R, respectively. Together, these findings validate the effectiveness of the proposed importance score.

Table 6. Experimental result on ImageNet-R with different update ratios ( ρ ) across 28 (NVLILA) and 32 (LLaVA) layers. Random selection vs ours. Numbers after ± indicate standard deviation across three random seeds.

Experiments on Diverse Downstream Tasks

In this section, we present more validation of Model-Dowser on two more datasets, Flickr30k and IconQA. Furthermore, we demonstrate the effectiveness of our method by comparing it with random selection on ImageNet-R, Flickr30k, and IconQA.

Experiments on Diverse Downstream Tasks To verify the robustness of Model-Dowser across different downstream tasks, we extend our evaluation to Flickr30k (Image Captioning) and IconQA (VQA) on LLaVA-1.5-7B, as shown in Table 5. Model-Dowser achieved H-scores of 69.6 (Flickr30k) and 51.3 (IconQA), outperforming prior work, such as SPIDER and Tailor. The prior methods either achieve high performance on the downstream task but lose generalization ability to the upstream task, or preserve knowledge of the upstream task but fail to adapt to the downstream task. In contrast, our method maintained upstream performance comparable to the Zero-shot baseline but still performed well on downstream tasks.

As shown in Figure 6,Model-Dowser differs from previous approaches by remaining stable even when fine-tuning is applied across the full model, while the baselines show a performance drop. This aligns with the analysis in Section 4, suggesting that our method precisely targets parameters crucial for balancing knowledge retention and task adaptation. Additionally, Figure 7 highlights the method's resilience to different mask ratios ( ρ ); specifically, the consistent A up scores across all settings attest to the efficacy of our importance score in preserving upstream capabilities.

Figure 5. Radar chart on diverse benchmarks on LLaVA-1.5-7B and NVILA-Lite when finetuning all layers.

Figure 5. Radar chart on diverse benchmarks on LLaVA-1.5-7B and NVILA-Lite when finetuning all layers.

Figure 6. Performance comparison across fine-tuning depths on Flickr30k and IconQA. Results show the average accuracy across all tasks for an update ratio of ρ = 0 . 1 and various merging methods. The x-axis denotes the number of layers fine-tuned, counted incrementally from the final output layer toward the initial input layer.

Figure 6. Performance comparison across fine-tuning depths on Flickr30k and IconQA. Results show the average accuracy across all tasks for an update ratio of ρ = 0 . 1 and various merging methods. The x-axis denotes the number of layers fine-tuned, counted incrementally from the final output layer toward the initial input layer.

Figure 7. Performance comparison across various mask ratios ( ρ ) on Flickr30k and IconQA using LLaVA-1.5-7B. Results show the upstream and downstream performance.

Figure 7. Performance comparison across various mask ratios ( ρ ) on Flickr30k and IconQA using LLaVA-1.5-7B. Results show the upstream and downstream performance.

Comparison with Random Selection To further validate the effectiveness of our sensitivity-based importance scoring, we compare Model-Dowser with a random-selection baseline across varying update ratios ( ρ ). Tables 6 and 7 summarize the results for ImageNet-R and COCO-Cpation respectively.

Model-Dowser consistently outperforms the random selection baseline across datasets and ratios. This performance gap becomes particularly evident as the update budget increases. While the difference is smaller at low sparsity ( ρ = 0 . 1 ), the random baseline fails to effectively preserve upstream knowledge as the update ratio increases to ρ = 0 . 5 . For instance, on ImageNet-R with ρ = 0 . 5 , Model-Dowser achieves an H-score of 42.6 with LLaVA, significantly higher than the 33.2 obtained by random selection.

Importantly, these performance improvements are statistically significant. We conducted one-sided t-tests under the ρ = 0 . 5 condition. For NVILA-Lite-2B, the analysis produced p -values of 0.003 for the H-score on COCO-Caption and 0.002 for the H-score on ImageNet-R. Likewise, LLaVA-1.5-7B also shows statistically significant gains, with p -values of 0.04 and 0.0003 on COCO-Caption and ImageNet-R, respectively. Together, these findings validate the effectiveness of the proposed importance score.

Table 6. Experimental result on ImageNet-R with different update ratios ( ρ ) across 28 (NVLILA) and 32 (LLaVA) layers. Random selection vs ours. Numbers after ± indicate standard deviation across three random seeds.

Comparison with Random Selection

Stability of the Data-Free Estimation on the Importance Score

We provide empirical evidence that our data-free importance estimation yields a stable, robust ranking of parameter sensitivity. To evaluate the alignment between our data-free estimation and the score with real samples, we employ two metrics. First, Hamming distance (Hamming, 1950) measures the discrepancy between the binary masks of the important parameters. Specifically, consistent with our update ratio, we define the importance mask by selecting the top 10% of parameters. A lower Hamming distance indicates a higher overlap in the identified subset of critical weights. Second, Spearman's Rank Correlation (Spearman correlation) (SPEARMAN, 1904) assesses the monotonic relationship across all parameters by comparing the ranks of their importance scores. A higher coefficient indicates that our method successfully preserves the global sensitivity ranking found in the real-data setting.

Figure 8 demonstrates the stability and robustness of our approach compared to random selection. There is a significant disparity between the two methods; as the number of probing samples ( N ) increases, our estimated importance score closely aligns with the score rankings calculated from real data samples. This indicates that our data-free proxy effectively captures the model's true structural sensitivity.

Furthermore, quantitative validation is provided in Table 8. The results of a two-tailed paired t-test yield statistically significant p -values below 10 -7 for both Hamming distance and Spearman correlation. These results confirm that the alignment achieved by Model-Dowser is robust and statistically distinct from random variations.

The Hamming distance and Spearman correlation both saturate as the number of Rademacher and Monte Carlo samples increases. We choose R = 8 and N = 64 for our experiments on the main paper.

Figure 8. A lineplot shows hamming distance and Spearman correlation with the importance score calculated with real data samples. (a) Hamming distance (lower is better). (b) Spearman correlation (higher is better).

Figure 8. A lineplot shows hamming distance and Spearman correlation with the importance score calculated with real data samples. (a) Hamming distance (lower is better). (b) Spearman correlation (higher is better).

Table 8. Statistical comparison of parameter importance Hamming distance ( ↓ ) and Spearman rank correlation ( ↑ ) for selected sparsity levels ( N ∈ { 4 , 16 , 64 } ) relative to the score with real-data samples. Results are presented as 'mean ± std' over 5 independent runs. The p -values are the result of a two-tailed t-test between the random baseline and ours. N and R denote the number of Monte Carlo samples and Rademacher probings, respectively.

Empirical result on Continual Learning setting

Setup

In continual learning (CL) settings (Chen et al., 2024a; Zhao et al., 2025; Guo et al., 2025a; Chen et al., 2025; Guo et al., 2025b), catastrophic forgetting is typically studied as a byproduct of sequential task learning. In these setups, models are initialized from a pretrained backbone (often before visual instruction tuning) and then trained sequentially on a series of tasks. Although continual learning and catastrophic forgetting are often closely related, the objective of continual learning fundamentally differs from ours. CL aims to enable models to learn tasks in a strictly sequential manner, retaining knowledge from previous tasks while optimizing performance on the current task. In contrast, our work focuses on mitigating catastrophic forgetting during downstream adaptation, aiming to preserve generalization and zero-shot capabilities on unseen tasks without compromising performance on a single target downstream task. We do not assume explicit task sequences or continual task arrival; instead, we study forgetting as a consequence of fine-tuning depth and parameter updates.

To further contextualize our approach, we additionally evaluate Model-Dowser under the CL benchmark settings of (Zhao et al., 2025), but starting from visual instruction-tuned models, LLaV A-1.5-7B (Liu et al., 2024a). The benchmark consists of five downstream tasks from different domains, including: Remote Sensing (RS), Medical (Med), Autonomous Driving (AD), Science (Sci), and Financial (Fin). For baselines, we include MoeLORA (Chen et al., 2024a), a representative continual learning method, and ModelTailor (Zhu et al., 2024) as a reference post-merging approach.

Metrics Let T be the total number of tasks and A t,i denote the test accuracy on task i after the model has finished training on task t . We employ the following four standard metrics to evaluate the continual learning performance:

$$

$$

A higher MFT indicates that the model successfully adapts to the downstream distribution of the current task.

$$

$$

Unlike MFT, MFN accounts for subsequent performance degradation. A high MFN requires the model not only to learn new tasks well but also to retain performance on previous ones.

$$

$$

MAAcaptures the historical trajectory of performance, rewarding models that maintain consistently high accuracy across all known tasks at every stage of training.

$$

$$

A negative BWT value indicates forgetting (performance degradation on past tasks), while a value close to zero implies strong stability (retention of knowledge). Our goal is to maximize BWT (i.e., minimize the magnitude of the negative value).

Table 9. Performance comparison on Continual Learning (CL) settings using LLaVA-v1.5-7B . Upstream tasks include TextVQA, POPE, VQAv2, MM-VET, ScienceQA-Img, MMBench (EN&CN), GQA, and SEED-Bench. The results represent the final performance after completing sequential training on 5 CL tasks.

Metrics

Experimental Result

$$ |\Delta f|_2 \approx |J^{(l)}_i|2 \cdot |\Delta W^{(l)}{ij}| \cdot |h^{(l-1)}_j|, $$

$$ \label{eq:hutchinson_final} \mathbb{E}_{\xi}\left[ \left( \frac{\partial(\xi^\top f)}{\partial z_i}\right)^2 \right] = |J_i|_2^2. $$ \tag{eq:hutchinson_final}

$$ \hat{x}n = f(\epsilon; \theta{\text{pre}}), $$

$$ \label{eq:masking} M^{(l)}{ij} =\begin{cases}1, & \text{if } \bar{S}{ij}^{(l)} \text{ is in the bottom } \rho \text{ percentile,} \ 0, & \text{otherwise.} \end{cases} $$ \tag{eq:masking}

$$ \label{eq:sparse_update} \theta^*=\theta-\lambda\cdot\left( M \odot\frac{\partial\mathcal{L}}{\partial\theta}\right), $$ \tag{eq:sparse_update}

$$ \text{Avg} = \frac{A_{up} + A_{down}}{2}; \text{H-score} = \frac{2\cdot A_{up}\cdot A_{down}} {A_{up} + A_{down}}. $$

$$ F(z^{(l)} + \Delta z^{(l)}) = F(z^{(l)}) + \frac{\partial F}{\partial z^{(l)}} \Delta z^{(l)} + \mathcal{O}(|\Delta z^{(l)}|^2) $$

$$ \frac{\partial f}{\partial W^{(l)}_{ij}}=\frac{\partial f}{\partial z^{(l)}_i}\cdot\frac{\partial z^{(l)}i}{\partial W{ij}^{(l)}} = J^{(l)}_i \cdot h^{(l-1)}_j. $$

$$ \sqrt{\tfrac{1}{2}},|J_i|2 \cdot |\Delta W{ij}| \cdot |h_j| ;\le; \mathbb{E}{\xi}[|\xi^\top J_i|] \cdot |\Delta W{ij}|\cdot |h_j| ;\le; |J_i|2 \cdot |\Delta W{ij}| \cdot |h_j|, $$

$$ \text{MFT} = \frac{1}{T} \sum_{i=1}^{T} A_{i,i}. $$

Theorem. [Functional shift for single-weight perturbation] Consider a layer $l$ in an MLLM model $f$. Under first-order Taylor approximation, the L2 norm of output shift $\Delta f$ when perturbing a weight $W^{(l)}_{ij}$ is given by: equation |\Delta f|_2 \approx |J^{(l)}_i|2 \cdot |\Delta W^{(l)}{ij}| \cdot |h^{(l-1)}_j|, equation where $J^{(l)}_i = \partial f / \partial z^{(l)}_i$ denotes the $i$-th column of the Jacobian matrix of the network output with respect to the pre-activation vector $z^{(l)}$, and $h^{(l-1)}$ is the input activation of the $l$-th layer (output of layer $l-1$).

Theorem. [] Consider a MLLM $f \in R^{d_{final}}$ and a Rademacher random variable $\xi \in {-1, 1}^{d_{final}}$. Under the sensitivity measure defined in theorem:score, the following inequality holds: equation \tfrac{1{2}},|J_i|2 \cdot |\Delta W{ij}| \cdot |h_j| ;\le; E_{\xi}[|\xi^\top J_i|] \cdot |\Delta W_{ij}|\cdot |h_j| ;\le; |J_i|2 \cdot |\Delta W{ij}| \cdot |h_j|, equation where $J_i = \partial f / \partial z_i$ denotes the $i$-th column vector of the Jacobian matrix, $h$ is the activation, and $W$ is the weight matrix of the model $f$.

Corollary. [Functional shift for multi-weight perturbation] \label[corollary]{cor:multi} Let $\Delta W = {\Delta W^{(l)}}_{l=1}^L$ be the set of perturbation matrices for all layers in the model $f$. Under the first-order Taylor approximation, the total output shift $|\Delta f|_2$ is bounded by the global aggregate of individual parameter sensitivities as follows: equation |\Delta f|2 \lessapprox \sum{l,i,j} |J^{(l)}_i|2 |\Delta W^{(l)}{ij}| |h^{(l-1)}_j|. equation

Proof. In this section, we provide the detailed derivation of the parameter importance score $S^{(l)}{ij}$ based on the first-order Taylor approximation of the model's output shift. Setup and Notation Consider a specific linear layer $l$ within an MLLM $f$. For a given input stimulus, let $h^{(l-1)} \in R^{d{in}}$ denote the input activation (output from the preceding layer) and $W^{(l)} \in R^{d_{out} \times d_{in}}$ denote the weight matrix of layer $l$. The pre-activation vector $z^{(l)}$ is defined as $z^{(l)} = W^{(l)} h^{(l-1)}$, where the $i$-th component is $z^{(l)}i = \sum_j W^{(l)}{ij} h^{(l-1)}j$. The network output $f$ can be viewed as a function of the pre-activation $z^{(l)}$, denoted as $f = F(z^{(l)})$, where $F$ encompasses all subsequent operations in the network. We define the output sensitivity vector (Jacobian column) $J^{(l)}i$ as: equation J^{(l)}i = \partial f{\partial z^{(l)}i} \in R^{d{final}}. equation Derivation for Single Weight Perturbation We analyze the effect of perturbing a single weight element $W^{(l)}{ij}$ by an amount $\Delta W^{(l)}{ij}$. Step 1: Since only the weight element $W^{(l)}{ij}$ is perturbed, only the $i$-th component of the pre-activation vector $z^{(l)}$ is affected: equation z'^{(l)}i = \sum{k} (W^{(l)}{ik} + \delta{kj} \Delta W^{(l)}_{ij}) h^{(l-1)}_k = z^{(l)}i + \Delta W^{(l)}{ij} h^{(l-1)}_j, equation where $\delta$ is the Kronecker delta. Thus, the change in the $i$-th pre-activation is $\Delta z^{(l)}i = h^{(l-1)}j \Delta W^{(l)}{ij}$, while $\Delta z^{(l)}k = 0$ for all $k \neq i$. Step 2: To quantify the functional impact of a parameter perturbation, we treat the final network output $f$ as a differentiable function of the pre-activation $z^{(l)}$. When the pre-activation vector is perturbed from $z^{(l)}$ to $z^{(l)} + \Delta z^{(l)}$, we can approximate the new output using a first-order Taylor expansion around $z^{(l)}$: equation F(z^{(l)} + \Delta z^{(l)}) = F(z^{(l)}) + \partial F{\partial z^{(l)}} \Delta z^{(l)} + O(|\Delta z^{(l)}|^2) equation By neglecting the higher-order terms $O(|\Delta z^{(l)}|^2)$ and rearranging the equation to isolate the difference between the perturbed and original outputs, we define the output shift $\Delta f$ as follows: equation \Delta f = F(z^{(l)} + \Delta z^{(l)}) - F(z^{(l)}) \approx \partial F{\partial z^{(l)}} \Delta z^{(l)} = \sum{k=1}^{d{out}} J^{(l)}_k \Delta z^{(l)}k, equation where $\partial F{\partial z^{(l)}} = [J^{(l)}1, J^{(l)}2, \dots, J^{(l)}{d{out}}]$ represents the Jacobian matrix of the network output with respect to the pre-activations. As established in Step 1, since the perturbation is restricted to a single weight $W^{(l)}{ij}$, the change vector $\Delta z^{(l)}$ is sparse, containing a non-zero value only at the $i$-th index ($\Delta z^{(l)}i = \Delta W^{(l)}{ij} \cdot h^{(l-1)}_j$). Consequently, the summation collapses to a single term: equation \Delta f \approx J^{(l)}_i \Delta z^{(l)}_i = J^{(l)}i \cdot (\Delta W^{(l)}{ij} \cdot h^{(l-1)}_j). equation Step 3: Taking the $L_2$ norm of both sides, we obtain the magnitude of the functional shift: equation |\Delta f|_2 \approx |J^{(l)}i \cdot (\Delta W^{(l)}{ij} \cdot h^{(l-1)}_j)|2 = |J^{(l)}i|2 \cdot |\Delta W^{(l)}{ij}| \cdot |h^{(l-1)}j|. equation By substituting the potential perturbation $\Delta W^{(l)}{ij}$ with the existing weight magnitude $|W^{(l)}{ij}|$, we arrive at the importance score $S^{(l)}{ij} = |J^{(l)}_i|2 \cdot |W^{(l)}{ij}| \cdot |h^{(l-1)}_j|$. We emphasize that this substitution does not aim to predict the exact output shift, but serves as a conservative first-order surrogate for ranking parameters by their relative functional sensitivity. The proof of theorem:score is finished.

Proof. We drive this using the concept of the total differential, treating the neural network function $f$ as differentiable with respect to the set of parameters $W = {W^{(l)}{ij}}{l,i,j}$. According to the definition of the total differential, the variation in the output $\Delta f$ induced by simultaneous perturbations in all weights can be approximated by the sum of partial derivatives with respect to every weight parameter: equation \Delta f \approx \sum_{l=1}^{L} \sum_{i,j} \partial f{\partial W^{(l)}{ij}} \Delta W^{(l)}{ij}. equation From the derivation in theorem:score (theorem:score_app), we established that the gradient of $f$ with respect to a specific weight $W^{(l)}{ij}$ factorizes via the chain rule into the downstream sensitivity and the input activation: equation \partial f{\partial W^{(l)}{ij}}=\partial f{\partial z^{(l)}_i}\cdot\partial z^{(l)i}{\partial W{ij}^{(l)}} = J^{(l)}i \cdot h^{(l-1)}j. equation Substituting this result directly into eq:total_diff yields: equation \Delta f \approx \sum{l=1}^{L} \sum{i,j} \left( J^{(l)}_i \cdot h^{(l-1)}j \right) \Delta W^{(l)}{ij}. equation Finally, to bound the magnitude of the total shift, we apply the triangle inequality ($|\sum x| \le \sum |x|$) and the property of scalar multiplication: equation |\Delta f|2 \approx \left| \sum{l,i,j} J^{(l)}_i h^{(l-1)}j \Delta W^{(l)}{ij} \right|2 \le \sum{l,i,j} |J^{(l)}_i|_2 \cdot |h^{(l-1)}j| \cdot |\Delta W^{(l)}{ij}|. equation equation \therefore |\Delta f|2 \lessapprox \sum{l,i,j} |J^{(l)}_i|2 \cdot |\Delta W^{(l)}{ij}| \cdot |h^{(l-1)}j|. equation This inequality demonstrates that the sum of our proposed importance scores, $S^{(l)}{ij}$, serves as a theoretical upper bound on the total functional shift of the MLLM under first-order approximation. Consequently, minimizing this aggregate score via global parameter selection effectively preserves the pretrained model's established functional response under perturbations. The proof of cor:multi is finished.

Proof. First, we recall that the $L_2$-norm of the Jacobian matrix can be estimated using the Hutchinson trace estimator. For a given component or vector $J$, the relationship is established as: equation | J_i |2^2 = E{\xi}\left[ \left( \partial (\xi^\top f){\partial z_i} \right)^2 \right] = E_{\xi}[(\xi^\top J_i)^2]. equation To relate the expectation of the absolute value to the $L_2$-norm, we first apply Jensen's inequality for the concave function $\cdot$, which yields: equation E_{\xi}[|\xi^\top J_i|] \le \mathbb{E_{\xi}[(\xi^\top J_i)^2]} = |J_i|2^2. equation This provides the upper bound $E{\xi}[|\xi^\top J_i|] \le |J_i|2$. Furthermore, to establish the lower bound, we utilize the Khintchine inequality [khintchine1923dyadische, haagerup1981best]. For the specific case of $p=1$ and Rademacher complexity, the inequality states: equation \tfrac{1{2}},|J_i|2 ;\le; E{\xi}[|\xi^\top J_i|] ;\le; |J_i|2. equation This relationship demonstrates that the expectation $E{\xi}[|\xi^\top J_i|]$ is equivalent to the $L_2$-norm of the Jacobian up to a constant factor, i.e., $E{\xi}[|\xi^\top J_i|] \asymp |J_i|2$. Given that $|J_i|2 \le |J_i|1$, this estimator serves as a robust proxy for the sensitivity of the model outputs. Finally, by multiplying all terms by the magnitude of the activation $|h_j|$ and the weight perturbation $|\Delta W{ij}|$, we obtain the following inequality: equation \tfrac{1{2}},|J_i|2 \cdot |\Delta W{ij}| \cdot |h_j| ;\le; E{\xi}[|\xi^\top J_i|] \cdot |\Delta W{ij}|\cdot |h_j| ;\le; |J_i|2 \cdot |\Delta W{ij}|\cdot |h_j| . equation This completes the proof of theorem:L1ext.

As demonstrated in Table 9, Full-FT and Tailor exhibit significant catastrophic forgetting in the continual learning setting. They show severe forgetting of both the original pretrained knowledge (upstream Avg drops to 40.2) and downstream tasks, as evidenced by high negative Backward Transfer (BWT) scores of -15.6 and -13.2, respectively. This indicates that without explicit protection, the model overwrites previous knowledge when it adapts to a new task.

In contrast, Model-Dowser achieves remarkable performance on continual setting. It keeps the upstream performance average of v , outperforming the Zeroshot baseline and MoeLORA. Most notably, Model-Dowser achieves a BWT of -2.5, which is substantially better than MoeLORA (-7.8). This implies that our importance-score-based masking effectively isolates task-specific updates, enabling the model to learn new skills without forgetting.

Moreover, Model-Dowser does not harm learning capability. Our method achieves the highest MFN of 66.6 and an H-score of 63.8, outperforming the strongest baseline, MoeLORA (MFN: 60.0, H-score: 60.6). While Full-FT shows a slightly higher Mean Finetune Accuracy (MFT) of 70.2, its final performance collapses due to forgetting. Model-Dowser maintains a competitive MFT of 69.1 while ensuring that these gains are retained throughout the sequential training process. This confirms that our sparse fine-tuning strategy can be applied to a continual learning setting.

MethodCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-Caption
MethodTextVQAOKVQAOCRVQAMMBMMB(CN)GQAA downAvg ↑H-Score↑ TextVQAOKVQAOCRVQAMMBMMB(CN)GQAA downAvg ↑H-Score ↑
Zeroshot70.851.269.478.142.262.326.144.236.870.851.269.478.142.262.322.242.232.7
Full-FT8.111.565.125.115.524.098.561.739.716.010.813.343.033.114.292.357.035.2
Grafting19.819.166.126.817.038.7115.773.549.215.610.412.143.334.413.390.058.040.4
DARE8.011.364.124.714.224.996.860.639.115.610.412.143.334.413.392.456.934.9
Tailor7.712.564.240.725.918.9105.667.044.728.818.723.960.242.526.280.366.947.2
SPIDER65.342.367.872.048.459.6115.487.378.342.526.154.368.038.940.991.968.560.5
Dowser69.348.668.877.750.860.7135.599.185.764.445.768.675.738.859.490.374.571.2
Flickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQA
Zeroshot70.851.269.478.142.262.327.344.838.070.851.269.478.142.262.326.144.236.8
Full-FT48.631.065.254.231.851.764.355.754.316.010.813.343.033.114.298.561.739.7
Grafting60.337.964.961.939.355.876.865.163.019.819.166.126.817.038.7115.773.549.2
DARE48.530.664.952.730.250.963.955.153.78.011.364.124.714.224.996.860.639.1
Tailor49.433.466.666.240.052.674.162.760.77.712.564.240.725.918.9105.667.044.7
SPIDER68.044.567.475.050.759.478.569.768.565.342.367.872.048.459.6115.487.378.3
Dowser70.148.668.077.951.060.396.079.375.869.348.668.877.750.860.7135.599.185.7
COCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionImageNet-RImageNet-RImageNet-RImageNet-RImageNet-RImageNet-RImageNet-RImageNet-RImageNet-R
Method
TextVQAOKVQAOCRVQAMMBMMB(CN)GQAA downAvg ↑H-ScoreTextVQAOKVQAOCRVQAMMBMMB(CN)GQAA downAvg ↑H-Score ↑
Zeroshot58.358.065.164.658.262.040.350.748.558.358.065.164.658.262.016.338.725.8
Full-FT52.748.863.063.958.057.5100.078.772.924.310.226.248.645.020.689.759.444.0
Grafting57.255.964.164.858.559.881.670.869.257.755.865.764.658.861.467.364.063.8
DARE52.248.162.463.758.157.199.178.072.323.89.925.647.344.816.789.858.942.7
Tailor55.050.964.564.858.458.8106.782.775.838.118.240.161.454.836.884.362.955.7
SPIDER56.153.264.763.957.559.8103.181.175.246.827.455.663.054.843.089.368.962.8
Dowser57.054.365.062.955.060.3123.691.379.954.648.464.864.455.856.588.873.169.7
MethodMemory Complexityρ# param
Full FTO (P)100%1.4B / 4.5B
GraftingO (2P)100%143M / 438M
DAREO (P)10%143M / 438M
TailorO (P)10%143M / 438M
SPIDERO (3P)50%714M / 2.3B
Dowser (Ours)O (P)10%143M / 438M
MethodRatio ( ρ )A upA downAvg ↑H-score ↑
Full FT113.892.353.124.0
Random Ours0.154.5 ± 1.0 56.690.7 ± 0.2 90.572.6 ± 0.6 73.668.0 ± 0.4 69.7
Random Ours0.2549.0 ± 0.7 53.091.5 ± 0.2 91.770.2 ± 0.3 72.463.8 ± 0.1 67.2
Random Ours0.539.2 ± 0.9 47.592.2 ± 0.2 92.265.7 ± 0.4 69.855.0 ± 0.1 62.7
Random Ours0.7525.6 ± 4.4 35.592.6 ± 0.2 92.559.1 ± 2.3 64.040.1 ± 1.2 51.3
Random Ours0.920.4 ± 1.4 24.992.6 ± 0.2 92.656.5 ± 0.8 58.733.4 ± 0.8 39.2
Flickr30kFlickr30kFlickr30kFlickr30kFlickr30kFlickr30kFlickr30kFlickr30kFlickr30kIconQAIconQAIconQAIconQAIconQAIconQAIconQAIconQAIconQA
Method
TextVQAOKVQAOCRVQAMMBMMB(CN)GQAA downAvg ↑H-ScoreTextVQAOKVQAOCRVQAMMBMMB(CN)GQAA downAvg ↑H-Score ↑
Zeroshot58.358.065.164.658.262.025.343.135.758.358.065.164.658.262.025.343.135.7
Full-FT50.651.562.364.354.557.667.362.161.650.653.956.659.150.557.644.449.549.0
Grafting56.955.964.164.958.260.464.062.062.058.158.265.163.456.361.822.441.432.7
DARE55.351.462.864.554.658.366.362.161.850.453.856.259.150.557.044.549.549.0
Tailor56.152.163.465.056.459.178.068.367.054.756.265.158.349.561.129.643.539.1
SPIDER56.654.264.363.955.259.471.065.064.454.855.061.261.752.160.044.851.150.4
Dowser56.555.263.363.855.159.685.072.069.657.758.565.363.053.961.844.952.451.3
Ratio ( ρ )TextVQAOKVQAOCRMMBMMB (CN)GQAA upImageNet-RAvg ↑H-score ↑
NVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2B
Full FT1.010.56.32.317.817.42.39.489.849.617.1
Random Ours0.153.7 ± 2.8 57.941.0 ± 0.8 43.466.0 ± 0.4 67.772.8 ± 1.2 75.039.2 ± 2.0 39.454.1 ± 0.8 56.554.5 ± 1.0 56.690.7 ± 0.2 90.572.6 ± 0.6 73.668.0 ± 0.8 69.7
Random Ours0.2543.2 ± 1.7 50.130.2 ± 1.2 33.157.6 ± 0.7 63.571.0 ± 0.8 75.046.4 ± 0.8 47.345.4 ± 0.9 49.349.0 ± 0.7 53.091.5 ± 0.2 91.770.2 ± 0.3 72.463.8 ± 0.5 67.2
Random Ours0.530.6 ± 2.5 42.319.4 ± 0.6 25.241.1 ± 1.5 52.664.3 ± 2.7 71.747.6 ± 1.2 50.732.3 ± 0.7 42.439.2 ± 0.9 47.592.2 ± 0.2 92.265.7 ± 0.4 69.855.0 ± 0.8 62.7
Random Ours0.7522.8 ± 3.9 30.914.9 ± 0.7 17.116.0 ± 4.6 41.146.1 ± 11.3 53.135.7 ± 6.0 38.918.0 ± 3.5 32.025.6 ± 4.4 35.592.6 ± 0.2 92.559.1 ± 2.3 64.040.1 ± 5.5 51.3
Random Ours0.919.9 ± 1.7 22.012.5 ± 0.2 13.29.1 ± 3.5 24.940.4 ± 2.6 41.627.8 ± 1.8 27.512.6 ± 1.3 20.020.4 ± 1.4 24.992.6 ± 0.2 92.656.5 ± 0.8 58.733.4 ± 1.9 39.2
LLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7B
Full FT1.012.29.32.336.517.84.913.892.353.124.0
Random Ours0.154.2 ± 0.1 54.944.3 ± 0.1 46.159.3 ± 0.2 62.063.0 ± 0.1 63.454.9 ± 0.2 55.152.3 ± 0.1 53.654.6 ± 0.1 55.888.5 ± 0.1 88.671.6 ± 0.0 72.267.6 ± 0.1 68.5
Random Ours0.2545.2 ± 0.1 49.228.5 ± 0.2 33.941.5 ± 0.7 47.653.1 ± 0.3 59.549.8 ± 0.1 53.841.7 ± 0.1 45.243.3 ± 0.2 48.289.3 ± 0.1 89.566.3 ± 0.1 68.958.3 ± 0.2 62.7
Random Ours0.521.4 ± 0.5 29.512.3 ± 0.0 15.314.7 ± 0.3 20.330.3 ± 0.4 38.526.7 ± 0.2 36.616.8 ± 0.8 27.320.4 ± 0.3 27.990.0 ± 0.1 89.755.2 ± 0.1 58.833.2 ± 0.5 42.6
Random Ours0.7513.3 ± 0.1 14.67.2 ± 0.1 7.87.6 ± 0.2 10.121.8 ± 0.1 24.318.8 ± 0.7 20.43.3 ± 0.3 4.312.0 ± 0.2 13.689.8 ± 0.2 89.750.9 ± 0.1 51.621.2 ± 0.3 23.6
Random Ours0.911.0 ± 0.3 11.56.5 ± 0.1 6.53.1 ± 0.3 4.317.9 ± 0.8 18.417.9 ± 0.2 18.02.4 ± 0.1 2.49.8 ± 0.2 10.289.7 ± 0.1 89.949.8 ± 0.1 50.117.6 ± 0.2 18.3
Ratio ( ρ )TextVQAOKVQAOCRMMBMMB (CN)GQAA upCOCO-CAvg ↑H-score ↑
NVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2B
Full FT1.00.20.122.525.429.30.212.998.255.622.9
Random Ours0.167.5 ± 0.2 68.047.1 ± 0.2 47.668.7 ± 0.1 68.876.9 ± 0.1 77.852.9 ± 0.4 53.860.5 ± 0.1 60.462.3 ± 0.1 62.7134.8 ± 0.2 135.198.6 ± 0.1 98.985.2 ± 0.1 85.7
Random Ours0.2564.4 ± 1.0 66.643.3 ± 0.7 45.068.1 ± 0.2 68.773.1 ± 0.5 75.552.1 ± 0.2 53.258.3 ± 0.5 59.359.9 ± 0.4 61.4124.3 ± 0.5 123.892.1 ± 0.2 92.680.8 ± 0.3 82.1
Random Ours0.530.8 ± 0.6 55.026.1 ± 1.4 35.065.7 ± 0.2 67.263.9 ± 2.7 71.747.4 ± 0.8 51.535.0 ± 3.4 51.044.8 ± 1.4 55.2105.2 ± 0.4 106.475.0 ± 0.7 80.862.9 ± 1.4 72.7
Random Ours0.752.3 ± 0.4 10.44.7 ± 3.1 14.253.7 ± 5.5 63.952.1 ± 3.5 62.441.6 ± 0.2 49.18.1 ± 4.4 20.327.1 ± 2.7 36.798.6 ± 0.2 98.562.8 ± 1.3 67.642.5 ± 3.4 53.5
Random Ours0.90.4 ± 0.2 0.91.3 ± 0.5 2.032.5 ± 1.4 38.942.8 ± 3.6 49.138.7 ± 3.2 43.62.2 ± 0.9 3.219.7 ± 0.7 22.997.8 ± 0.1 97.758.7 ± 0.3 60.332.7 ± 0.9 37.1
LLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7B
Full FT1.00.10.02.58.615.50.04.5101.553.08.5
Random Ours0.156.5 ± 0.1 56.753.5 ± 0.2 54.164.7 ± 0.7 64.362.7 ± 0.2 63.155.5 ± 0.2 55.260.1 ± 0.4 60.458.8 ± 0.3 59.0120.8 ± 0.3 120.889.8 ± 0.0 89.979.1 ± 0.2 79.3
Random Ours0.2549.0 ± 0.2 52.347.2 ± 0.0 49.763.6 ± 0.5 63.761.5 ± 0.1 62.856.5 ± 0.0 56.457.8 ± 0.5 58.755.9 ± 0.2 57.3105.2 ± 0.5 104.580.6 ± 0.3 80.973.0 ± 0.3 74.0
Random Ours0.511.9 ± 2.8 20.018.8 ± 2.1 26.232.9 ± 9.5 36.652.1 ± 1.4 54.453.5 ± 0.3 56.429.4 ± 3.4 41.933.1 ± 3.0 39.2100.5 ± 0.1 100.866.8 ± 1.5 70.049.8 ± 3.5 56.5
Random Ours0.750.1 ± 0.0 0.21.2 ± 0.4 2.24.2 ± 0.6 6.422.4 ± 6.9 27.237.8 ± 7.9 43.80.5 ± 0.3 0.911.0 ± 2.5 13.4100.5 ± 0.3 100.955.8 ± 1.3 57.219.9 ± 4.2 23.7
Random Ours0.90.1 ± 0.0 0.10.2 ± 0.1 0.43.0 ± 0.2 3.211.4 ± 3.1 12.523.3 ± 6.0 24.00.0 ± 0.0 0.06.3 ± 1.5 6.7100.5 ± 0.7 101.353.4 ± 0.7 54.011.9 ± 2.7 12.5
R NHamming Distance ( ↓ )Hamming Distance ( ↓ )Hamming Distance ( ↓ )Spearman Correlation ( ↑ )Spearman Correlation ( ↑ )Spearman Correlation ( ↑ )
Random BaselineModel-Dowser (Ours)p -valueRandom BaselineModel-Dowser (Ours)p -value
40.226 ± 0.0000.064 ± 0.0041 . 1 × 10 - 70.000 ± 0.0000.861 ± 0.0082 . 9 × 10 - 9
2 160.226 ± 0.0000.056 ± 0.0032 . 2 × 10 - 80.000 ± 0.0000.879 ± 0.0042 . 6 × 10 - 10
640.226 ± 0.0000.052 ± 0.0028 . 6 × 10 - 90.000 ± 0.0000.886 ± 0.0041 . 1 × 10 - 10
40.226 ± 0.0000.063 ± 0.0041 . 7 × 10 - 70.000 ± 0.0000.862 ± 0.0094 . 1 × 10 - 9
4 160.226 ± 0.0000.056 ± 0.0031 . 8 × 10 - 80.000 ± 0.0000.879 ± 0.0053 . 1 × 10 - 10
640.226 ± 0.0000.053 ± 0.0021 . 1 × 10 - 80.000 ± 0.0000.885 ± 0.0041 . 9 × 10 - 10
40.226 ± 0.0000.058 ± 0.0035 . 2 × 10 - 80.000 ± 0.0000.875 ± 0.0071 . 6 × 10 - 9
8 160.226 ± 0.0000.052 ± 0.0028 . 8 × 10 - 90.000 ± 0.0000.887 ± 0.0042 . 4 × 10 - 10
640.226 ± 0.0000.050 ± 0.0023 . 0 × 10 - 90.000 ± 0.0000.891 ± 0.0037 . 8 × 10 - 11
Upstream A upDownstreamDownstreamDownstreamDownstreamDownstreamDownstreamMetricsMetricsMetricsMetricsMetrics
MethodRSMedADSciFinA downMFTMFNMAABWTH-score
Zero-shot63.932.335.515.642.362.537.6-----
Full-FT40.260.050.120.651.091.654.770.254.766.2-15.646.3
Tailor40.666.050.021.251.191.355.969.155.966.6-13.247.0
MoeLORA61.273.251.535.049.690.960.067.960.064.8-7.860.6
Dowser61.278.861.248.053.591.466.669.166.669.6-2.563.8
MethodCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-Caption
MethodTextVQAOKVQAOCRVQAMMBMMB(CN)GQAA downAvg ↑H-Score↑ TextVQAOKVQAOCRVQAMMBMMB(CN)GQAA downAvg ↑H-Score ↑
Zeroshot70.851.269.478.142.262.326.144.236.870.851.269.478.142.262.322.242.232.7
Full-FT8.111.565.125.115.524.098.561.739.716.010.813.343.033.114.292.357.035.2
Grafting19.819.166.126.817.038.7115.773.549.215.610.412.143.334.413.390.058.040.4
DARE8.011.364.124.714.224.996.860.639.115.610.412.143.334.413.392.456.934.9
Tailor7.712.564.240.725.918.9105.667.044.728.818.723.960.242.526.280.366.947.2
SPIDER65.342.367.872.048.459.6115.487.378.342.526.154.368.038.940.991.968.560.5
Dowser69.348.668.877.750.860.7135.599.185.764.445.768.675.738.859.490.374.571.2
Flickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQAFlickr30k IconQA
Zeroshot70.851.269.478.142.262.327.344.838.070.851.269.478.142.262.326.144.236.8
Full-FT48.631.065.254.231.851.764.355.754.316.010.813.343.033.114.298.561.739.7
Grafting60.337.964.961.939.355.876.865.163.019.819.166.126.817.038.7115.773.549.2
DARE48.530.664.952.730.250.963.955.153.78.011.364.124.714.224.996.860.639.1
Tailor49.433.466.666.240.052.674.162.760.77.712.564.240.725.918.9105.667.044.7
SPIDER68.044.567.475.050.759.478.569.768.565.342.367.872.048.459.6115.487.378.3
Dowser70.148.668.077.951.060.396.079.375.869.348.668.877.750.860.7135.599.185.7
COCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionCOCO-CaptionImageNet-RImageNet-RImageNet-RImageNet-RImageNet-RImageNet-RImageNet-RImageNet-RImageNet-R
Method
TextVQAOKVQAOCRVQAMMBMMB(CN)GQAA downAvg ↑H-ScoreTextVQAOKVQAOCRVQAMMBMMB(CN)GQAA downAvg ↑H-Score ↑
Zeroshot58.358.065.164.658.262.040.350.748.558.358.065.164.658.262.016.338.725.8
Full-FT52.748.863.063.958.057.5100.078.772.924.310.226.248.645.020.689.759.444.0
Grafting57.255.964.164.858.559.881.670.869.257.755.865.764.658.861.467.364.063.8
DARE52.248.162.463.758.157.199.178.072.323.89.925.647.344.816.789.858.942.7
Tailor55.050.964.564.858.458.8106.782.775.838.118.240.161.454.836.884.362.955.7
SPIDER56.153.264.763.957.559.8103.181.175.246.827.455.663.054.843.089.368.962.8
Dowser57.054.365.062.955.060.3123.691.379.954.648.464.864.455.856.588.873.169.7
MethodMemory Complexityρ# param
Full FTO (P)100%1.4B / 4.5B
GraftingO (2P)100%143M / 438M
DAREO (P)10%143M / 438M
TailorO (P)10%143M / 438M
SPIDERO (3P)50%714M / 2.3B
Dowser (Ours)O (P)10%143M / 438M
MethodRatio ( ρ )A upA downAvg ↑H-score ↑
Full FT113.892.353.124.0
Random Ours0.154.5 ± 1.0 56.690.7 ± 0.2 90.572.6 ± 0.6 73.668.0 ± 0.4 69.7
Random Ours0.2549.0 ± 0.7 53.091.5 ± 0.2 91.770.2 ± 0.3 72.463.8 ± 0.1 67.2
Random Ours0.539.2 ± 0.9 47.592.2 ± 0.2 92.265.7 ± 0.4 69.855.0 ± 0.1 62.7
Random Ours0.7525.6 ± 4.4 35.592.6 ± 0.2 92.559.1 ± 2.3 64.040.1 ± 1.2 51.3
Random Ours0.920.4 ± 1.4 24.992.6 ± 0.2 92.656.5 ± 0.8 58.733.4 ± 0.8 39.2
Flickr30kFlickr30kFlickr30kFlickr30kFlickr30kFlickr30kFlickr30kFlickr30kFlickr30kIconQAIconQAIconQAIconQAIconQAIconQAIconQAIconQAIconQA
Method
TextVQAOKVQAOCRVQAMMBMMB(CN)GQAA downAvg ↑H-ScoreTextVQAOKVQAOCRVQAMMBMMB(CN)GQAA downAvg ↑H-Score ↑
Zeroshot58.358.065.164.658.262.025.343.135.758.358.065.164.658.262.025.343.135.7
Full-FT50.651.562.364.354.557.667.362.161.650.653.956.659.150.557.644.449.549.0
Grafting56.955.964.164.958.260.464.062.062.058.158.265.163.456.361.822.441.432.7
DARE55.351.462.864.554.658.366.362.161.850.453.856.259.150.557.044.549.549.0
Tailor56.152.163.465.056.459.178.068.367.054.756.265.158.349.561.129.643.539.1
SPIDER56.654.264.363.955.259.471.065.064.454.855.061.261.752.160.044.851.150.4
Dowser56.555.263.363.855.159.685.072.069.657.758.565.363.053.961.844.952.451.3
Ratio ( ρ )TextVQAOKVQAOCRMMBMMB (CN)GQAA upImageNet-RAvg ↑H-score ↑
NVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2B
Full FT1.010.56.32.317.817.42.39.489.849.617.1
Random Ours0.153.7 ± 2.8 57.941.0 ± 0.8 43.466.0 ± 0.4 67.772.8 ± 1.2 75.039.2 ± 2.0 39.454.1 ± 0.8 56.554.5 ± 1.0 56.690.7 ± 0.2 90.572.6 ± 0.6 73.668.0 ± 0.8 69.7
Random Ours0.2543.2 ± 1.7 50.130.2 ± 1.2 33.157.6 ± 0.7 63.571.0 ± 0.8 75.046.4 ± 0.8 47.345.4 ± 0.9 49.349.0 ± 0.7 53.091.5 ± 0.2 91.770.2 ± 0.3 72.463.8 ± 0.5 67.2
Random Ours0.530.6 ± 2.5 42.319.4 ± 0.6 25.241.1 ± 1.5 52.664.3 ± 2.7 71.747.6 ± 1.2 50.732.3 ± 0.7 42.439.2 ± 0.9 47.592.2 ± 0.2 92.265.7 ± 0.4 69.855.0 ± 0.8 62.7
Random Ours0.7522.8 ± 3.9 30.914.9 ± 0.7 17.116.0 ± 4.6 41.146.1 ± 11.3 53.135.7 ± 6.0 38.918.0 ± 3.5 32.025.6 ± 4.4 35.592.6 ± 0.2 92.559.1 ± 2.3 64.040.1 ± 5.5 51.3
Random Ours0.919.9 ± 1.7 22.012.5 ± 0.2 13.29.1 ± 3.5 24.940.4 ± 2.6 41.627.8 ± 1.8 27.512.6 ± 1.3 20.020.4 ± 1.4 24.992.6 ± 0.2 92.656.5 ± 0.8 58.733.4 ± 1.9 39.2
LLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7B
Full FT1.012.29.32.336.517.84.913.892.353.124.0
Random Ours0.154.2 ± 0.1 54.944.3 ± 0.1 46.159.3 ± 0.2 62.063.0 ± 0.1 63.454.9 ± 0.2 55.152.3 ± 0.1 53.654.6 ± 0.1 55.888.5 ± 0.1 88.671.6 ± 0.0 72.267.6 ± 0.1 68.5
Random Ours0.2545.2 ± 0.1 49.228.5 ± 0.2 33.941.5 ± 0.7 47.653.1 ± 0.3 59.549.8 ± 0.1 53.841.7 ± 0.1 45.243.3 ± 0.2 48.289.3 ± 0.1 89.566.3 ± 0.1 68.958.3 ± 0.2 62.7
Random Ours0.521.4 ± 0.5 29.512.3 ± 0.0 15.314.7 ± 0.3 20.330.3 ± 0.4 38.526.7 ± 0.2 36.616.8 ± 0.8 27.320.4 ± 0.3 27.990.0 ± 0.1 89.755.2 ± 0.1 58.833.2 ± 0.5 42.6
Random Ours0.7513.3 ± 0.1 14.67.2 ± 0.1 7.87.6 ± 0.2 10.121.8 ± 0.1 24.318.8 ± 0.7 20.43.3 ± 0.3 4.312.0 ± 0.2 13.689.8 ± 0.2 89.750.9 ± 0.1 51.621.2 ± 0.3 23.6
Random Ours0.911.0 ± 0.3 11.56.5 ± 0.1 6.53.1 ± 0.3 4.317.9 ± 0.8 18.417.9 ± 0.2 18.02.4 ± 0.1 2.49.8 ± 0.2 10.289.7 ± 0.1 89.949.8 ± 0.1 50.117.6 ± 0.2 18.3
Ratio ( ρ )TextVQAOKVQAOCRMMBMMB (CN)GQAA upCOCO-CAvg ↑H-score ↑
NVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2BNVILA-Lite-2B
Full FT1.00.20.122.525.429.30.212.998.255.622.9
Random Ours0.167.5 ± 0.2 68.047.1 ± 0.2 47.668.7 ± 0.1 68.876.9 ± 0.1 77.852.9 ± 0.4 53.860.5 ± 0.1 60.462.3 ± 0.1 62.7134.8 ± 0.2 135.198.6 ± 0.1 98.985.2 ± 0.1 85.7
Random Ours0.2564.4 ± 1.0 66.643.3 ± 0.7 45.068.1 ± 0.2 68.773.1 ± 0.5 75.552.1 ± 0.2 53.258.3 ± 0.5 59.359.9 ± 0.4 61.4124.3 ± 0.5 123.892.1 ± 0.2 92.680.8 ± 0.3 82.1
Random Ours0.530.8 ± 0.6 55.026.1 ± 1.4 35.065.7 ± 0.2 67.263.9 ± 2.7 71.747.4 ± 0.8 51.535.0 ± 3.4 51.044.8 ± 1.4 55.2105.2 ± 0.4 106.475.0 ± 0.7 80.862.9 ± 1.4 72.7
Random Ours0.752.3 ± 0.4 10.44.7 ± 3.1 14.253.7 ± 5.5 63.952.1 ± 3.5 62.441.6 ± 0.2 49.18.1 ± 4.4 20.327.1 ± 2.7 36.798.6 ± 0.2 98.562.8 ± 1.3 67.642.5 ± 3.4 53.5
Random Ours0.90.4 ± 0.2 0.91.3 ± 0.5 2.032.5 ± 1.4 38.942.8 ± 3.6 49.138.7 ± 3.2 43.62.2 ± 0.9 3.219.7 ± 0.7 22.997.8 ± 0.1 97.758.7 ± 0.3 60.332.7 ± 0.9 37.1
LLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7BLLaVA-1.5-7B
Full FT1.00.10.02.58.615.50.04.5101.553.08.5
Random Ours0.156.5 ± 0.1 56.753.5 ± 0.2 54.164.7 ± 0.7 64.362.7 ± 0.2 63.155.5 ± 0.2 55.260.1 ± 0.4 60.458.8 ± 0.3 59.0120.8 ± 0.3 120.889.8 ± 0.0 89.979.1 ± 0.2 79.3
Random Ours0.2549.0 ± 0.2 52.347.2 ± 0.0 49.763.6 ± 0.5 63.761.5 ± 0.1 62.856.5 ± 0.0 56.457.8 ± 0.5 58.755.9 ± 0.2 57.3105.2 ± 0.5 104.580.6 ± 0.3 80.973.0 ± 0.3 74.0
Random Ours0.511.9 ± 2.8 20.018.8 ± 2.1 26.232.9 ± 9.5 36.652.1 ± 1.4 54.453.5 ± 0.3 56.429.4 ± 3.4 41.933.1 ± 3.0 39.2100.5 ± 0.1 100.866.8 ± 1.5 70.049.8 ± 3.5 56.5
Random Ours0.750.1 ± 0.0 0.21.2 ± 0.4 2.24.2 ± 0.6 6.422.4 ± 6.9 27.237.8 ± 7.9 43.80.5 ± 0.3 0.911.0 ± 2.5 13.4100.5 ± 0.3 100.955.8 ± 1.3 57.219.9 ± 4.2 23.7
Random Ours0.90.1 ± 0.0 0.10.2 ± 0.1 0.43.0 ± 0.2 3.211.4 ± 3.1 12.523.3 ± 6.0 24.00.0 ± 0.0 0.06.3 ± 1.5 6.7100.5 ± 0.7 101.353.4 ± 0.7 54.011.9 ± 2.7 12.5
R NHamming Distance ( ↓ )Hamming Distance ( ↓ )Hamming Distance ( ↓ )Spearman Correlation ( ↑ )Spearman Correlation ( ↑ )Spearman Correlation ( ↑ )
Random BaselineModel-Dowser (Ours)p -valueRandom BaselineModel-Dowser (Ours)p -value
40.226 ± 0.0000.064 ± 0.0041 . 1 × 10 - 70.000 ± 0.0000.861 ± 0.0082 . 9 × 10 - 9
2 160.226 ± 0.0000.056 ± 0.0032 . 2 × 10 - 80.000 ± 0.0000.879 ± 0.0042 . 6 × 10 - 10
640.226 ± 0.0000.052 ± 0.0028 . 6 × 10 - 90.000 ± 0.0000.886 ± 0.0041 . 1 × 10 - 10
40.226 ± 0.0000.063 ± 0.0041 . 7 × 10 - 70.000 ± 0.0000.862 ± 0.0094 . 1 × 10 - 9
4 160.226 ± 0.0000.056 ± 0.0031 . 8 × 10 - 80.000 ± 0.0000.879 ± 0.0053 . 1 × 10 - 10
640.226 ± 0.0000.053 ± 0.0021 . 1 × 10 - 80.000 ± 0.0000.885 ± 0.0041 . 9 × 10 - 10
40.226 ± 0.0000.058 ± 0.0035 . 2 × 10 - 80.000 ± 0.0000.875 ± 0.0071 . 6 × 10 - 9
8 160.226 ± 0.0000.052 ± 0.0028 . 8 × 10 - 90.000 ± 0.0000.887 ± 0.0042 . 4 × 10 - 10
640.226 ± 0.0000.050 ± 0.0023 . 0 × 10 - 90.000 ± 0.0000.891 ± 0.0037 . 8 × 10 - 11
Upstream A upDownstreamDownstreamDownstreamDownstreamDownstreamDownstreamMetricsMetricsMetricsMetricsMetrics
MethodRSMedADSciFinA downMFTMFNMAABWTH-score
Zero-shot63.932.335.515.642.362.537.6-----
Full-FT40.260.050.120.651.091.654.770.254.766.2-15.646.3
Tailor40.666.050.021.251.191.355.969.155.966.6-13.247.0
MoeLORA61.273.251.535.049.690.960.067.960.064.8-7.860.6
Dowser61.278.861.248.053.591.466.669.166.669.6-2.563.8

$$ z'^{(l)}i = \sum{k} (W^{(l)}{ik} + \delta{kj} \Delta W^{(l)}_{ij}) h^{(l-1)}_k = z^{(l)}i + \Delta W^{(l)}{ij} h^{(l-1)}_j, $$

References

[langley00] P. Langley. (2000). Crafting Papers on Machine Learning. Proceedings of the 17th International Conference on Machine Learning (ICML 2000).

[mitchell80] T. M. Mitchell. (1980). The Need for Biases in Learning Generalizations.

[kearns89] M. J. Kearns. (1989). Computational Complexity of Machine Learning.

[MachineLearningI] . Machine Learning: An Artificial Intelligence Approach, Vol. I. (1983).

[DudaHart2nd] R. O. Duda, P. E. Hart, D. G. Stork. (2000). Pattern Classification.

[anonymous] Author, N. N.. (2021). Suppressed for Anonymity.

[Newell81] A. Newell, P. S. Rosenbloom. (1981). Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition.

[Samuel59] A. L. Samuel. (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development.

[hendrycks2016gelu] Hendrycks, Dan, Gimpel, Kevin. (2016). Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415.

[shazeer2020glu] Shazeer, Noam. (2020). Glu variants improve transformer. arXiv preprint arXiv:2002.05202.

[tanaka2020pruning] Tanaka, Hidenori, Kunin, Daniel, Yamins, Daniel L, Ganguli, Surya. (2020). Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in neural information processing systems.

[mallya2018packnet] Mallya, Arun, Lazebnik, Svetlana. (2018). Packnet: Adding multiple tasks to a single network by iterative pruning. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.

[sun2024a] Mingjie Sun, Zhuang Liu, Anna Bair, J Zico Kolter. (2024). A Simple and Effective Pruning Approach for Large Language Models. The Twelfth International Conference on Learning Representations.

[Hutchinson] M.F. Hutchinson. (1990). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation. doi:10.1080/03610919008812866.

[ELFWING20183] Stefan Elfwing, Eiji Uchibe, Kenji Doya. (2018). Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks. doi:https://doi.org/10.1016/j.neunet.2017.12.012.

[sun2024asimple] Mingjie Sun, Zhuang Liu, Anna Bair, J Zico Kolter. (2024). A Simple and Effective Pruning Approach for Large Language Models. The Twelfth International Conference on Learning Representations.

[sun2024massive] Mingjie Sun, Xinlei Chen, J Zico Kolter, Zhuang Liu. (2024). Massive Activations in Large Language Models. First Conference on Language Modeling.

[hassibi1992second] Hassibi, Babak, Stork, David. (1992). Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems.

[zhai2023investigating] Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma. (2023). Investigating the Catastrophic Forgetting in Multimodal Large Language Models. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.

[zhu2024model] Zhu, Didi, Sun, Zhongyisun, Li, Zexi, Shen, Tao, Yan, Ke, Ding, Shouhong, Wu, Chao, Kuang, Kun. (2024). Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models. Proceedings of the 41st International Conference on Machine Learning.

[yu2024language] Yu, Le, Yu, Bowen, Yu, Haiyang, Huang, Fei, Li, Yongbin. (2024). Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch. International Conference on Machine Learning.

[huang2025learn] Wenke Huang, Jian Liang, Zekun Shi, Didi Zhu, Guancheng Wan, He Li, Bo Du, Dacheng Tao, Mang Ye. (2025). Learn from Downstream and Be Yourself in Multimodal Large Language Models Fine-Tuning. Forty-second International Conference on Machine Learning.

[huang2024emr] Huang, Chenyu, Ye, Peng, Chen, Tao, He, Tong, Yue, Xiangyu, Ouyang, Wanli. (2024). Emr-merging: Tuning-free high-performance model merging. Advances in Neural Information Processing Systems.

[panigrahi2023task] Panigrahi, Abhishek, Saunshi, Nikunj, Zhao, Haoyu, Arora, Sanjeev. (2023). Task-specific skill localization in fine-tuned language models. International Conference on Machine Learning.

[luo2025Empirical] Khintchine, Aleksandr. (1923). *{*. Mathematische Zeitschrift. doi:10.1109/TASLPRO.2025.3606231.

[haagerup1981best] Haagerup, Uffe. (1981). The best constants in the Khintchine inequality. Studia Mathematica.

[hui2025hft] Hui, Tingfeng, Zhang, Zhenyu, Wang, Shuohuan, Xu, Weiran, Sun, Yu, Wu, Hua. (2025). Hft: Half fine-tuning for large language models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[liu2023visual] Liu, Haotian, Li, Chunyuan, Wu, Qingyang, Lee, Yong Jae. (2023). Visual instruction tuning. Advances in neural information processing systems.

[liu2025nvila] Liu, Zhijian, Zhu, Ligeng, Shi, Baifeng, Zhang, Zhuoyang, Lou, Yuming, Yang, Shang, Xi, Haocheng, Cao, Shiyi, Gu, Yuxian, Li, Dacheng, others. (2025). Nvila: Efficient frontier visual language models. Proceedings of the Computer Vision and Pattern Recognition Conference.

[wang2024qwen2] Wang, Peng, Bai, Shuai, Tan, Sinan, Wang, Shijie, Fan, Zhihao, Bai, Jinze, Chen, Keqin, Liu, Xuejing, Wang, Jialin, Ge, Wenbin, others. (2024). Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191.

[goodfellow2013empirical] Goodfellow, Ian J, Mirza, Mehdi, Xiao, Da, Courville, Aaron, Bengio, Yoshua. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211.

[masana2022class] Masana, Marc, Liu, Xialei, Twardowski, Bart{\l. (2022). Class-incremental learning: survey and performance evaluation on image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[han2024parameter] Han, Zeyu, Gao, Chao, Liu, Jinyang, Zhang, Jeff, Zhang, Sai Qian. (2024). Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608.

[zhou-etal-2024-empirical] Zhou, Xiongtao, He, Jie, Ke, Yuhua, Zhu, Guangyao, Gutierrez Basulto, Victor, Pan, Jeff. (2024). An Empirical Study on Parameter-Efficient Fine-Tuning for {M. Findings of the Association for Computational Linguistics: ACL 2024. doi:10.18653/v1/2024.findings-acl.598.

[zhai2024investigating] Zhai, Yuexiang, Tong, Shengbang, Li, Xiao, Cai, Mu, Qu, Qing, Lee, Yong Jae, Ma, Yi. (2024). Investigating the catastrophic forgetting in multimodal large language model fine-tuning. Conference on Parsimony and Learning.

[chen2024expanding] Chen, Zhe, Wang, Weiyun, Cao, Yue, Liu, Yangzhou, Gao, Zhangwei, Cui, Erfei, Zhu, Jinguo, Ye, Shenglong, Tian, Hao, Liu, Zhaoyang, others. (2024). Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271.

[dong2021should] Dong, Xinshuai, Luu, Anh Tuan, Lin, Min, Yan, Shuicheng, Zhang, Hanwang. (2021). How should pre-trained language models be fine-tuned towards adversarial robustness?. Advances in Neural Information Processing Systems.

[korbak2022controlling] Korbak, Tomasz, Elsahar, Hady, Kruszewski, German, Dymetman, Marc. (2022). Controlling conditional language models without catastrophic forgetting. International Conference on Machine Learning.

[yang2023neural] Yang, Yibo, Yuan, Haobo, Li, Xiangtai, Lin, Zhouchen, Torr, Philip, Tao, Dacheng. (2023). Neural collapse inspired feature-classifier alignment for few-shot class incremental learning. arXiv preprint arXiv:2302.03004.

[kirkpatrick2017overcoming] Kirkpatrick, James, Pascanu, Razvan, Rabinowitz, Neil, Veness, Joel, Desjardins, Guillaume, Rusu, Andrei A, Milan, Kieran, Quan, John, Ramalho, Tiago, Grabska-Barwinska, Agnieszka, others. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences.

[zhang2024overcoming] Zhang, Wenxuan, Janson, Paul, Aljundi, Rahaf, Elhoseiny, Mohamed. (2024). Overcoming generic knowledge loss with selective parameter update. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[zhang2024gradient] Zhang, Zhi, Zhang, Qizhe, Gao, Zijun, Zhang, Renrui, Shutova, Ekaterina, Zhou, Shiji, Zhang, Shanghang. (2024). Gradient-based parameter selection for efficient fine-tuning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[chen2025mofo] Yupeng Chen, Senmiao Wang, Yushun Zhang, Zhihang Lin, Haozhe Zhang, Weijian Sun, Tian Ding, Ruoyu Sun. (2025). Mo{FO. Transactions on Machine Learning Research.

[liu2024improved] Liu, Haotian, Li, Chunyuan, Li, Yuheng, Lee, Yong Jae. (2024). Improved baselines with visual instruction tuning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[lin2014microsoft] Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll{'a. (2014). Microsoft coco: Common objects in context. European conference on computer vision.

[young-etal-2014-image] Young, Peter, Lai, Alice, Hodosh, Micah, Hockenmaier, Julia. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics. doi:10.1162/tacl_a_00166.

[guo-etal-2025-hide] Guo, Haiyang, Zeng, Fanhu, Xiang, Ziwei, Zhu, Fei, Wang, Da-Han, Zhang, Xu-Yao, Liu, Cheng-Lin. (2025). {H. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). doi:10.18653/v1/2025.acl-long.666.

[hendrycks2021many] Hendrycks, Dan, Basart, Steven, Mu, Norman, Kadavath, Saurav, Wang, Frank, Dorundo, Evan, Desai, Rahul, Zhu, Tyler, Parajuli, Samyak, Guo, Mike, others. (2021). The many faces of robustness: A critical analysis of out-of-distribution generalization. Proceedings of the IEEE/CVF international conference on computer vision.

[lu2021iconqa] Lu, Pan, Qiu, Liang, Chen, Jiaqi, Xia, Tony, Zhao, Yizhou, Zhang, Wei, Yu, Zhou, Liang, Xiaodan, Zhu, Song-Chun. (2021). IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks.

[singh2019towards] Singh, Amanpreet, Natarajan, Vivek, Shah, Meet, Jiang, Yu, Chen, Xinlei, Batra, Dhruv, Parikh, Devi, Rohrbach, Marcus. (2019). Towards vqa models that can read. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[marino2019ok] Marino, Kenneth, Rastegari, Mohammad, Farhadi, Ali, Mottaghi, Roozbeh. (2019). Ok-vqa: A visual question answering benchmark requiring external knowledge. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition.

[mishra2019ocr] Mishra, Anand, Shekhar, Shashank, Singh, Ajeet Kumar, Chakraborty, Anirban. (2019). Ocr-vqa: Visual question answering by reading text in images. 2019 international conference on document analysis and recognition (ICDAR).

[hudson2019gqa] Hudson, Drew A, Manning, Christopher D. (2019). Gqa: A new dataset for real-world visual reasoning and compositional question answering. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[liu2024mmbench] Liu, Yuan, Duan, Haodong, Zhang, Yuanhan, Li, Bo, Zhang, Songyang, Zhao, Wangbo, Yuan, Yike, Wang, Jiaqi, He, Conghui, Liu, Ziwei, others. (2024). Mmbench: Is your multi-modal model an all-around player?. European conference on computer vision.

[zhou2024empirical] Zhou, Xiongtao, He, Jie, Ke, Yuhua, Zhu, Guangyao, Guti{'e. (2024). An empirical study on parameter-efficient fine-tuning for multimodal large language models. Findings of the Association for Computational Linguistics: ACL 2024.

[luo2024feast] Luo, Gen, Zhou, Yiyi, Zhang, Yuxin, Zheng, Xiawu, Sun, Xiaoshuai, Ji, Rongrong. (2024). Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models. arXiv preprint arXiv:2403.03003.

[vedantam2015cider] Vedantam, Ramakrishna, Lawrence Zitnick, C, Parikh, Devi. (2015). Cider: Consensus-based image description evaluation. Proceedings of the IEEE conference on computer vision and pattern recognition.

[spearman1904proof] SPEARMAN, C. (1904). THE PROOF AND MEASUREMENT OF ASSOCIATION BETWEEN TWO THINGS.. The American Journal of Psychology.

[Hamming1950error] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen. (2022). Lo{RA. The Bell System Technical Journal. doi:10.1002/j.1538-7305.1950.tb00463.x.

[li-liang-2021-prefix] Li, Xiang Lisa, Liang, Percy. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). doi:10.18653/v1/2021.acl-long.353.

[lester-etal-2021-power] Lester, Brian, Al-Rfou, Rami, Constant, Noah. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2021.emnlp-main.243.

[pmlr-v267-chen25n] Chen, Jinpeng, Cong, Runmin, Zhao, Yuzhi, Yang, Hongzheng, Hu, Guangneng, Ip, Horace, Kwong, Sam. (2025). {SEFE. Proceedings of the 42nd International Conference on Machine Learning.

[chen2024coin] Chen, Cheng, Zhu, Junchen, Luo, Xu, Shen, Heng T, Song, Jingkuan, Gao, Lianli. (2024). Coin: A benchmark of continual instruction tuning for multimodel large language models. Advances in Neural Information Processing Systems.

[zhang2021tip] Zhang, Renrui, Fang, Rongyao, Zhang, Wei, Gao, Peng, Li, Kunchang, Dai, Jifeng, Qiao, Yu, Li, Hongsheng. (2021). Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930.

[guo2025federated] Guo, Haiyang, Zeng, Fanhu, Zhu, Fei, Liu, Wenzhuo, Wang, Da-Han, Xu, Jian, Zhang, Xu-Yao, Liu, Cheng-Lin. (2025). Federated continual instruction tuning. arXiv preprint arXiv:2503.12897.

[wang2024comprehensive] Wang, Zhenyi, Yang, Enneng, Shen, Li, Huang, Heng. (2024). A comprehensive survey of forgetting in deep learning beyond continual learning. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[sung2022lst] Sung, Yi-Lin, Cho, Jaemin, Bansal, Mohit. (2022). Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems.

[li2022parameter] Li, Yuchao, Luo, Fuli, Tan, Chuanqi, Wang, Mengdi, Huang, Songfang, Li, Shen, Bai, Junjie. (2022). Parameter-efficient sparsity for large language models fine-tuning. arXiv preprint arXiv:2205.11005.

[pmlr-v235-lu24p] Lu, Xudong, Zhou, Aojun, Xu, Yuhui, Zhang, Renrui, Gao, Peng, Li, Hongsheng. (2024). {SPP. Proceedings of the 41st International Conference on Machine Learning.

[pmlr-v235-xu24ag] Xu, Jing, Zhang, Jingzhao. (2024). Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning. Proceedings of the 41st International Conference on Machine Learning.

[xu2024random] Xu, Jing, Zhang, Jingzhao. (2024). Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning. Proceedings of the 41st International Conference on Machine Learning.

[zhang2025llava] Zhang, Shaolei, Fang, Qingkai, Yang, Zhe, Feng, Yang. (2025). Llava-mini: Efficient image and video large multimodal models with one vision token. arXiv preprint arXiv:2501.03895.

[chen2024image] Chen, Liang, Zhao, Haozhe, Liu, Tianyu, Bai, Shuai, Lin, Junyang, Zhou, Chang, Chang, Baobao. (2024). An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. European Conference on Computer Vision.

[zhao2025mllm] Zhao, Hongbo, Zhu, Fei, Guo, Haiyang, Wang, Meng, Wang, Rundong, Meng, Gaofeng, Zhang, Zhaoxiang. (2025). Mllm-cl: Continual learning for multimodal large language models. arXiv preprint arXiv:2506.05453.

[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, {\L. (2017). Attention is all you need. Advances in neural information processing systems.

[tao2020few] Tao, Xiaoyu, Hong, Xiaopeng, Chang, Xinyuan, Dong, Songlin, Wei, Xing, Gong, Yihong. (2020). Few-shot class-incremental learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[xuhong2018explicit] Xuhong, LI, Grandvalet, Yves, Davoine, Franck. (2018). Explicit inductive bias for transfer learning with convolutional networks. International conference on machine learning.

[aljundi2018memory] Aljundi, Rahaf, Babiloni, Francesca, Elhoseiny, Mohamed, Rohrbach, Marcus, Tuytelaars, Tinne. (2018). Memory aware synapses: Learning what (not) to forget. Proceedings of the European conference on computer vision (ECCV).

[radford2021learning] Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, others. (2021). Learning transferable visual models from natural language supervision. International conference on machine learning.

[kirillov2023segment] Kirillov, Alexander, Mintun, Eric, Ravi, Nikhila, Mao, Hanzi, Rolland, Chloe, Gustafson, Laura, Xiao, Tete, Whitehead, Spencer, Berg, Alexander C, Lo, Wan-Yen, others. (2023). Segment anything. Proceedings of the IEEE/CVF international conference on computer vision.

[zhai2023sigmoid] Zhai, Xiaohua, Mustafa, Basil, Kolesnikov, Alexander, Beyer, Lucas. (2023). Sigmoid loss for language image pre-training. Proceedings of the IEEE/CVF international conference on computer vision.

[zheng2023preventing] Zheng, Zangwei, Ma, Mingyuan, Wang, Kai, Qin, Ziheng, Yue, Xiangyu, You, Yang. (2023). Preventing zero-shot transfer degradation in continual learning of vision-language models. Proceedings of the IEEE/CVF international conference on computer vision.

[xiang2023language] Xiang, Jiannan, Tao, Tianhua, Gu, Yi, Shu, Tianmin, Wang, Zirui, Yang, Zichao, Hu, Zhiting. (2023). Language models meet world models: Embodied experiences enhance language models. Advances in neural information processing systems.