Skip to main content

Catastrophic Forgetting in Kolmogorov-Arnold Networks

Mohammad Marufur Rahman, Guanchu Wang, Kaixiong Zhou, Minghan Chen, Fan Yang

Abstract

Catastrophic forgetting is a longstanding challenge in continual learning, where models lose knowledge from earlier tasks when learning new ones. While various mitigation strategies have been proposed for Multi-Layer Perceptrons (MLPs), recent architectural advances like Kolmogorov-Arnold Networks (KANs) have been suggested to offer intrinsic resistance to forgetting by leveraging localized spline-based activations. However, the practical behavior of KANs under continual learning remains unclear, and their limitations are not well understood. To address this, we present a comprehensive study of catastrophic forgetting in KANs and develop a theoretical framework that links forgetting to activation support overlap and intrinsic data dimension. We validate these analyses through systematic experiments on synthetic and vision tasks, measuring forgetting dynamics under varying model configurations and data complexity. Further, we introduce KAN-LoRA, a novel adapter design for parameter-efficient continual fine-tuning of language models, and evaluate its effectiveness in knowledge editing tasks. Our findings reveal that while KANs exhibit promising retention in low-dimensional algorithmic settings, they remain vulnerable to forgetting in high-dimensional domains such as image classification and language modeling. These results advance the understanding of KANs’ strengths and limitations, offering practical insights for continual learning system design.

Kolmogorov-Arnold Networks

Mohammad Marufur Rahman 1 , Guanchu Wang 2 , Kaixiong Zhou 3 , Minghan Chen 1 , Fan Yang 1

1 Department of Computer Science, Wake Forest University

3 Department of Electrical and Computer Engineering, North Carolina State University rahmm224@wfu.edu, gwang16@charlotte.edu, kzhou22@ncsu.edu, chenm@wfu.edu, yangfan@wfu.edu

Catastrophic forgetting is a longstanding challenge in continual learning, where models lose knowledge from earlier tasks when learning new ones. While various mitigation strategies have been proposed for Multi-Layer Perceptrons (MLPs), recent architectural advances like Kolmogorov-Arnold Networks (KANs) have been suggested to offer intrinsic resistance to forgetting by leveraging localized spline-based activations. However, the practical behavior of KANs under continual learning remains unclear, and their limitations are not well understood. To address this, we present a comprehensive study of catastrophic forgetting in KANs and develop a theoretical framework that links forgetting to activation support overlap and intrinsic data dimension . We validate these analyses through systematic experiments on synthetic and vision tasks, measuring forgetting dynamics under varying model configurations and data complexity. Further, we introduce KAN-LoRA, a novel adapter design for parameterefficient continual fine-tuning of language models, and evaluate its effectiveness in knowledge editing tasks. Our findings reveal that while KANs exhibit promising retention in lowdimensional algorithmic settings, they remain vulnerable to forgetting in high-dimensional domains such as image classification and language modeling. These results advance the understanding of KANs' strengths and limitations, offering practical insights for continual learning system design.

Code -https://github.com/marufur-cs/AAAI26

Introduction

Catastrophic forgetting, also known as catastrophic interference (McCloskey and Cohen 1989), a fundamental challenge in machine learning, occurs when a neural network loses previously acquired information while learning from new data. This phenomenon is central to the field of continual learning, where models are trained incrementally on nonstationary data distributions (Ven, Soures, and Kudithipudi 2024; Kemker et al. 2017). Moreover, it is prevalent in a wide range of research fields such as meta-learning (Spigler 2020), domain adaptation (Xu et al. 2020), foundation models (Luo et al. 2025), and reinforcement learning (Zhang et al. 2023), where the retention of prior knowledge is critical for generalization and stability.

Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Multi-Layer Perceptrons (MLPs) are inherently prone to catastrophic forgetting (Zenke, Poole, and Ganguli 2017). Several techniques have been proposed to overcome catastrophic forgetting in MLPs (Wang et al. 2025; De Lange et al. 2022). Regularization-based techniques (Kirkpatrick et al. 2017; Kong et al. 2024) impose restrictions on the network's weight adjustments, hence reducing the likelihood of interference with previously acquired knowledge. Architecture-based methods (Yoon et al. 2018; Mirzadeh et al. 2022) mitigate forgetting by modifying the network's architecture to accommodate new information. Rehearsalbased methods (Buzzega et al. 2020; Riemer et al. 2019) aim to preserve prior information by including data samples from earlier learning sessions during the current session. Although catastrophic forgetting has been extensively studied in MLPs, it remains relatively underexplored in emerging fundamental neural architectures such as KolmogorovArnold Networks (KANs) (Liu et al. 2025).

KANs, inspired by the Kolmogorov-Arnold representation theorem (Kolmogorov 1961), have emerged as a promising alternative neural network architecture to traditional MLPs. KANs were introduced to address several fundamental limitations of MLPs. Unlike MLPs, which rely on fixed activation functions, KANs utilize learnable onedimensional activation functions (spline) along the edges of the network. Splines can be easily adjusted locally and are accurate for low-dimensional functions, giving KANs the potential to avoid forgetting. As spline bases are local, a data sample affects only a few related spline coefficients, leaving other coefficients unaltered. This unique architecture enables KANs to learn non-linear relations more effectively and to be more robust against catastrophic forgetting in continual learning scenarios (Lee et al. 2025). KANs have been successfully applied in various domains (Yang and Wang 2025; Abd Elaziz, Ahmed Fares, and Aseeri 2024), yet studies around their effectiveness in mitigating catastrophic forgetting in continual learning are still quite limited.

Only a few pioneer works have studied the catastrophic forgetting phenomenon in KANs under the continual learning settings. Lee et al. recently proposed a simple and heuristic strategy, WiseKAN, which allocates distinct parameter subspaces to different tasks to mitigate catastrophic forgetting in KANs. Liu et al. demonstrated robustness of KANs against catastrophic forgetting using synthetic data on re- gression tasks. Furthermore, some studies proposed modified KANs to achieve robust retention in specific domains, such as classification (Hu et al. 2025) and face forgery detection (Zhang et al. 2025) tasks. Despite these initial efforts, a comprehensive understanding of forgetting in KANs remains elusive, particularly in terms of theoretical characterization and empirical evaluation on practical real-world tasks.

To bridge the gap, we first develop a theoretical framework for understanding catastrophic forgetting in KANs by formulating several key factors such as activation support overlap and intrinsic data dimension . Our analysis reveals that forgetting in KANs scales linearly with activation support overlap and grows exponentially with the intrinsic dimensionality of the task manifold, offering a principled explanation for KANs' robustness in simple tasks and vulnerability in complex domains. Building on these insights, we then conduct extensive empirical experiments comparing KANs with MLPs across a spectrum of tasks, including the low-dimensional synthetic addition and the highdimensional image classification. Furthermore, we design a novel LoRA (Hu et al. 2022) adapter based on KAN, termed KAN-LoRA, to enable continual fine-tuning of language models (LMs) for sequential knowledge editing. Across all experimental settings, our results consistently corroborate the theoretical analysis, illustrating that while KANs achieve strong retention in structured and low-dimensional tasks, they remain susceptible to forgetting in high-dimensional domains, thereby highlighting both the strengths and limitations of KANs in practical continual learning scenarios. Our main contributions are summarized as below:

· We develop a theoretical framework for catastrophic forgetting in KANs, deriving formal retention bounds based on activation support overlap and intrinsic data dimension, and characterizing how forgetting evolves; · Wevalidate the theoretical analysis through empirical experiments on synthetic and image data, demonstrating strong alignment between the support overlap, task complexity, and the observed forgetting behavior; · We introduce KAN-LoRA, a novel KAN-based adapter for continual fine-tuning of LMs, and evaluate its performance in sequential knowledge editing, highlighting both the strength and limitations of KANs in practice.

Preliminary

Catastrophic Interference

Neural networks learn the non-linear mapping between input and output spaces by finding a region in the parameter space where the network achieves expected behavior (Bishop 1994). When the neural network is trained on new data, the network's parameter space shifts accordingly to capture the mapping between new input and output space. As a result, performance degrades on prior data. This phenomenon was termed as catastrophic interference by McCloskey and Cohen. It was observed in many machine learning models such as support vector machine (Ayad 2014), but is particularly pronounced in connectionist models (e.g.,

MLPs) due to their dense and globally updated parameterizations (French 1999). Standard neural training algorithms typically lack the capacity to progressively learn new tasks without overwriting previous knowledge (Aleixo et al. 2023), making them especially vulnerable to catastrophic interference. Such limitation has motivated continual learning studies (De Lange et al. 2022) to develop algorithms and architectures that enable models to acquire new knowledge incrementally while preserving performance on learned tasks.

Kolmogorov-Arnold Networks

KANs are inspired by the Kolmogorov-Arnold representation theorem, which states that a finite sum of continuous univariate functions and the binary addition operation can represent any multivariate continuous function f ( x ) in a specified bounded domain (Kolmogorov 1961). Based on the theorem, function f ( x ) can be represented as

$$

$$

where n is the number of input variables, ψ p,q : [0 , 1] - → R , and Ψ q : R - → R . This equation indicates that a 2-layer network with n inputs and (2 n +1) outputs is sufficient to represent f ( x ) by the sums of univariate functions. However, 1-D function ψ can be fractal and non-smooth, making it unlearnable (Girosi and Poggio 1989) in practice. KANs solve this issue by generalizing the theorem to multiple layers with arbitrary width. Formally, KANs consisting L layers can be indicated by

$$

$$

$$

$$

where ◦ indicates matrix multiplication, Φ ℓ is the function matrix that corresponds to the ℓ -th layer, d ℓ and N ℓ are the number of input coordinates and univariate branches respectively. The univariate function ϕ is defined as the weighted sum of a base and a spline function (Liu et al. 2025).

Forgetting Analysis

We first introduce a formal measure of forgetting and the notation needed to analyze how KAN's local activations give rise to both perfect retention and task interference. Let f ( t ) denote the KAN obtained after sequentially training on tasks 1 , 2 , . . . , t , and define

$$

$$

as the forgetting on task i , where L ( f, D ) is the expected loss under data distribution D . We index layers by ℓ ∈ { 1 , . . . , L } , and within each layer we number the input coordinates (pre-activations) by p ∈ { 1 , . . . , d ℓ } and the individual univariate branches by q ∈ { 1 , . . . , N ℓ } .

To capture where each unit actually 'turns on', we define the activation support of branch ϕ ℓ,p,q for task i as

̸

$$

$$

the subset of real inputs on which that branch contributes non-zero output. We measure the size of these onedimensional sets by the Lebesgue measure µ ( · ) 1 . With these setups, we can represent the maximum one-dimensional overlap between any single activation for tasks i and j as

$$

$$

which will serve as the key link between KAN's architectural locality and the bounds on catastrophic forgetting. 2

Bounded Retention

We now precisely characterize when KAN achieves perfect retention and how any residual overlap translates into bounded forgetting. Overall, we demonstrate that KAN's local-support activations act as task-specific feature detectors: if their 'on' regions never coincide across tasks, earlier knowledge remains untouched, and when they do overlap, forgetting grows in direct proportion to that overlap.

Lemma 1 (Zero-Overlap Retention) . Suppose for an earlier task i and every later task j > i the maximal supportoverlap satisfies ∆ i,j = 0 . Then

$$

$$

A. Lemma 1 Proof

When no branch ever activates on both tasks, gradient updates for new tasks cannot affect the parameters responsible for task i , guaranteeing exact retention. This lemma formalizes the intuition that truly disjoint representations cannot interfere .

Theorem 1 (Retention Bound via Overlap) . Under the additional assumptions that each branch ϕ ℓ,p,q is L ℓ -Lipschitz 3 and the loss is bounded by C , for any j > i we have

$$

$$

B. Theorem 1 Proof

This bound reveals that any forgetting in KANs scales linearly with the one-dimensional overlap ∆ i,j and the network's size parameters. Importantly, when ∆ i,j = 0 , it collapses to F i ≤ 0 , recovering exact retention as a special case and showing that small overlaps incur proportionally small forgetting.

1 Lebesgue measure generalizes the length to a broader class of sets. Here, it corresponds to the total length of the activation region.

2 Detailed derivations for all theorems are in the full version.

3 ϕ is L -Lipschitz if | ϕ ( z 1 ) -ϕ ( z 2 ) | ≤ L | z 1 -z 2 | for all z 1 , z 2 ∈ R . Here, L ℓ quantifies the spline smoothness in layer ℓ .

Cumulative Forgetting

While Theorem 1 guarantee zero or bounded forgetting on a per-task basis, real continual learning involves sequences of overlapping tasks whose supports may intersect in complex ways. To capture the deeper dynamics of forgetting in KANs, we further analyze at the branch level and consider cumulative contributions and effects.

Theorem 2 (Branch-wise Cumulative Forgetting) . Under the Lipschitz and bounded-loss assumptions of Theorem 1, the forgetting on task i after training on all subsequent tasks i +1 , . . . , T can be decomposed as

$$

$$

C. Theorem 2 Proof

Forgetting in KANs is driven not only by the largest single overlap but also by the total overlap each branch experiences across tasks . Branches with frequent cross-task activation contribute disproportionately to forgetting, suggesting that sparsifying or diversifying supports could mitigate interference.

Corollary 1 (Expected Forgetting under Random Supports) . If each branch's supports for task j are independently drawn as lengths j intervals in [0 , 1] , then in expectation

$$

$$

D. Corollary 1 Proof

Forgetting in KANs grows with the pairwise products of support sizes: a difficult task ( with large s j ) can retroactively erode performance on earlier tasks, and longer task sequences amplify this effect.

Corollary 2 (Saturation via Union-Bound) . Let

$$

$$

be the union of all overlaps for branch ( ℓ, p, q ) . Then

$$

$$

$$

$$

E. Corollary 2 Proof

Forgetting in KANs will saturate , once a branch's full activation support is covered by overlaps . After enough highly overlapping tasks, further tasks cannot increase forgetting beyond that support size.

Complexity-Induced Forgetting

Beyond mere pairwise overlap, we further conduct theoretical analysis by examining how intrinsic task complexity drives forgetting in KANs. In particular, we show that when tasks live on data manifolds of differing intrinsic dimensions, the degree of forgetting can change dramatically. This complements our earlier results by linking forgetting directly to geometric measures of task difficulty.

Theorem 3 (Intrinsic-Dimension Forgetting Rate) . Suppose task t generates data concentrated on a compact submanifold M t ⊂ [0 , 1] n of intrinsic dimension d t , and each univariate branch's activation support can be enclosed within an r -ball in the pre-activation domain. Then, for any earlier task i and later task j , the expected support overlap satisfies

$$

$$

and hence the forgetting on task i obeys

$$

$$

where N tot = ∑ ℓ N ℓ counts the total number of univariate branches and ¯ L is an average Lipschitz constant.

F. Theorem 3 Proof

Tasks with higher intrinsic dimension produce exponentially smaller 'gaps' in their activation partitions, so even modest support radius r could incur large overlaps and thus substantial forgetting. Conversely, low-dimensional tasks enjoy near-zero overlap and stable retention in continuous learning.

Corollary 3 (Retention for Low-Dimensional Tasks) . If every subsequent task j has intrinsic dimension d j ≤ D , then

$$

$$

which becomes negligible when d i + D is sufficiently small.

E. Corollary 2 Proof

When both the original task and all new tasks inhabit low-dimensional manifolds , their activation overlaps shrink exponentially in dimension, protecting against forgetting even over long task sequences.

Corollary 4 (Fragmentation Mitigates Complexity) . If each branch's support for task t is split into k t disjoint intervals (effective radius r/k t ), then Theorem 3's rate improves to

$$

$$

Figure 1: MSE loss in logarithmic scale for five different binary addition tasks during training on one's addition task.

Figure 1: MSE loss in logarithmic scale for five different binary addition tasks during training on one's addition task.

E. Corollary 2 Proof

KANs can sharply reduce forgetting on highdimensional tasks by increasing support fragmentation , which effectively refines each branch's receptive field and trades off coarser representation granularity for higher retention fidelity.

Overall, Theorem 3 and its corollaries illuminate how KAN's forgetting depends on the deeper geometric complexity of task data and the combinatorial structure of activation supports. This perspective provides actionable guidance for designing KAN architectures and pruning strategies.

Experiments

We conduct a series of experiments to empirically validate our theoretical findings and assess KANs' forgetting behavior across diverse settings. Starting with lowdimensional synthetic tasks, we analyze retention under binary and decimal addition. We then evaluate KANs on highdimensional image classification benchmarks and finally test KAN-LoRA for continual knowledge editing in LMs. These experiments effectively illustrate how model architecture and task complexity shape the forgetting in KANs. 4

Binary and Decimal Addition

Experimental Setup Weconstruct five synthetic tasks under a continual setting. Each task is defined by fixing one of the operands in a two-digit addition problem. Specifically, Task 1 involves one's addition , where the digit 1 is added to every digit from 1 to 9 . Task 2 is two's addition , and so forth up to five's addition in Task 5. We apply this construction for both binary and decimal representations of digits. This setup enables us to systematically evaluate forgetting across increasingly overlapping arithmetic patterns.

Model Configuration Our KAN model is configured with three input nodes, two hidden neurons, and two output neurons to perform addition of two 4-bit binary numbers. At each step, the model receives two input bits (one from each number) along with a carry bit, and outputs the corresponding sum bit and the carry bit for the next step. The univariate functions in KANs are modeled using B-splines (Prautzsch, Boehm, and Paluszny 2002), where the grid size determines the number of intervals in the spline. A larger grid size provides greater flexibility, allowing the splines to capture more

4 More details on the experiments are in the full version.

Figure 2: MSE loss during sequential training on five different decimal addition tasks, from one's to five's addition facts.

Figure 2: MSE loss during sequential training on five different decimal addition tasks, from one's to five's addition facts.

complex functions (Liu et al. 2025). For binary addition tasks, we use a grid size of 5. For decimal addition tasks, we evaluate KANs with grid sizes of 5, 10, 15, and 20. As a baseline, we compare the KAN's performance on binary addition to a specialized MLP architecture (Ruiz-Garcia 2022) designed to learn binary addition rules in a continual learning setting without catastrophic forgetting.

Evaluation Results. The KAN model is sequentially trained on five binary addition tasks. Figure 1 shows the Mean Squared Error (MSE) loss (log scale) for all five tasks during training on the one's addition task. Notably, even during training on the first task, the losses for subsequent tasks also decrease significantly, indicating strong positive correlation. As training progresses, the model maintains stable performance on earlier tasks, with overall forgetting remaining below 1 × 10 -6 a fter all five sessions, showing a strong resilience to catastrophic forgetting. The KAN outperforms the specialized MLP model designed for binary addition. While the MLP requires sequential training on both one's and two's addition tasks to succeed, the KAN model generalizes effectively after learning just the one's addition task.

Similarly, for the decimal addition, the KAN is trained sequentially over five tasks. Unlike the binary setting, the model does not fully learn subsequent tasks during training on the first one. As training progresses, clear signs of catastrophic forgetting emerge. Figure 2 shows that learning a new task leads to a noticeable decline in performance on previous tasks. However, the severity of forgetting decreases as the grid size increases, suggesting that finer spline resolution improves retention. After completing all five tasks, a clear forgetting pattern appears: performance deteriorates more significantly for tasks that are farther in time from the most recent training, indicating that earlier tasks suffer more from forgetting. These observations empirically support our analysis in Corollary 1. On one hand, increasing the grid size reduces each spline's support length, thereby decreasing pairwise overlaps and mitigating the forgetting. On the other hand, later tasks, which have larger effective support sizes due to increased digit variability, lead to greater cumulative interference, consistent with the s i s j dependence.

Tables 1 and 2 further present empirical evidences supporting Theorems 1 and 2. In Table 1, for each pair of tasks selected from the five decimal addition tasks, the ratio F i / ∆ i,j remains approximately constant , suggesting that the forgetting F i scales linearly with the support overlap ∆ i,j between tasks i and j . Similarly, Table 2 shows that the ratio between the observed forgetting and the cumulative support overlap 5 ∑ T i +1 µ ( S ( i ) ∩ S ( j ) ) is also nearly constant , indicating a linear dependence. Additionally, this ratio becomes more stable (i.e., exhibits lower variance) as the grid size of KANs increases, revealing that finer-grained spline meshes promote more consistent forgetting behavior.

Table 1: Retention bounds across KANs and tasks.

Image Classification

Experimental Setup To evaluate KANs in real-world settings, we assess their forgetting behavior with continual learning using CIFAR-10, Tiny-ImageNet, and MNIST datasets. CIFAR-10 consists (32 ∗ 32) -pixel images of 10 evenly distributed classes. To simulate a class-incremental continual-learning scenario, the dataset is divided into five sequential tasks, each containing images from two different classes. A similar five-task setup is constructed for TinyImageNet dataset, by selecting 10 classes from 200 different classes of (64 ∗ 64) -pixel images. MNIST, (28 ∗ 28) pixels, is likewise divided into five sequential tasks. These three datasets vary in intrinsic dimensionality, where MNIST has the lowest while Tiny-ImageNet has the highest dimension.

Model Configuration We adopt a Transformer-based architecture for image classification, in which all MLP layers are replaced with KAN layers, resulting in the KANTransformer model (Yang and Wang 2025). This modification intends to utilize the adaptive capacity of KANs within

5 Simplify to ∑ µ ij in Table 2 notations.

Figure 3: Task 1 accuracy after sequential training on task 1 and 2 from CIFAR-10, comparing (a) KAN-Transformer and (b) MLP-Transformer. Model configuration is labeled as (#classification layers - #encoder blocks - #attention heads ).

Figure 3: Task 1 accuracy after sequential training on task 1 and 2 from CIFAR-10, comparing (a) KAN-Transformer and (b) MLP-Transformer. Model configuration is labeled as (#classification layers - #encoder blocks - #attention heads ).

the Transformer framework for continual learning scenarios. To provide a fair and competitive baseline, we also design an MLP-based transformer model augmented with the EWC (Kirkpatrick et al. 2017) regularization technique.

Evaluation Results. Figure 3 illustrates the accuracy on task 1 after sequential training of the KAN (with grid size 10 ) and the MLP model on tasks 1 and 2 from CIFAR-10, evaluated across various model configurations and increasing sample sizes per task. Both architectures retain high accuracy in shallow settings with a single encoder block, attention head, and classification layer. Notably, the KAN model demonstrates superior retention, maintaining 100% accuracy on task 1 up to 8 samples per task, whereas the MLPmodel drops to around 80%. As the number of encoder blocks and classification layers increases, performance declines sharply, particularly in MLPs, which suggests deeper networks are more susceptible to catastrophic forgetting.

Figure 4 summarizes the impact of varying the number of samples per task during continual learning, evaluated across different task counts (ranging from 2 to 5) for both CIFAR10 and Tiny-ImageNet. All models use one single encoder block, attention head, and classification layer. On CIFAR10, KAN models exhibit better retention than their MLP counterparts, particularly when trained on a smaller number of tasks. In contrast, MLP models outperform KANs on the more challenging Tiny-ImageNet dataset. These results underscore the increasing difficulty of continual learning in KANs as both the number of tasks and the underlying data complexity grow. Moreover, the performance curves in Figure 4a suggest a clear saturation effect: after a certain number of highly overlapping tasks, additional training yields diminishing increases in forgetting, consistent with the bounded cumulative interference described in Corollary 2, where support unions eventually stabilize.

Table 3 further presents empirical evidence supporting Theorem 3. Forgetting F i is measured on task 1 after sequential training on all five tasks from MNIST, CIFAR-10, and Tiny-ImageNet. To vary the intrinsic dimension d i , the images are quantized using different label sets ( Q ) and resized to different spatial resolutions ( S ), where d i = log 2 ( Q × S ) . Across datasets and configurations, the ratio log( F i ) /d i remains approximately constant , providing strong support for the exponential relationship between forgetting and task complexity as captured by intrinsic dimension. This behav-

Figure 4: Average accuracy on previously learned tasks after training on 2 to 5 tasks with varying sample sizes from CIFAR-10 and Tiny-ImageNet datasets. Sub-figures show results for (a) KAN-Transformer and (b) MLP-Transformer.

Figure 4: Average accuracy on previously learned tasks after training on 2 to 5 tasks with varying sample sizes from CIFAR-10 and Tiny-ImageNet datasets. Sub-figures show results for (a) KAN-Transformer and (b) MLP-Transformer.

Table 3: Forgetting rate for varied intrinsic dimensions.

ior reflects a geometric constraint where increasing intrinsic dimension entangles KANs' localized supports, highlighting the need for dimensionality-aware tuning or support fragmentation (as in Corollary 4) to sustain better retention.

Knowledge Editing for LMs

Experimental Setup LMs require continual knowledge editing to replace outdated information and integrate new facts. To evaluate the forgetting behavior of KANs and MLPs in such high-dimensional editing scenarios, five consecutive tasks are curated from the CounterFact (Meng et al. 2023) and ZsRE (Levy et al. 2017) benchmarks.

Model Configuration LoRA (Hu et al. 2022) is a parameter-efficient fine-tuning technique that adapts LMs by freezing pre-trained weights and training lightweight adapters, substantially reducing memory usage and computational cost compared to full fine-tuning (Biderman et al. 2024). To explore the use of KAN as a LoRA adapter for continual fine-tuning, we design a modified adapter architecture. In the standard LoRA setup, the frozen weight matrix W 0 ∈ R a × b is augmented by a trainable low-rank residual matrix ∆ W ∈ R a × b , factorized as ∆ W = BA with B ∈ R a × c and A ∈ R c × b , where rank c ≪ min( a, b ) . The adapter's output is h = W 0 x + BAx . In our design, both A and B are parameterized using KANs. This KAN-based variant, referred as KAN-LoRA, is integrated into the final two layers of Llama2-7B and Llama2-13B (Touvron et al. 2023). We apply EWC regularization during continual finetuning, using the preceding task as memory. For a fair comparison, we develop an MLP-LoRA adapter with identical EWC settings. For all KAN-LoRA experiments, we use a grid size of 5 to balance capacity and efficiency.

(a) KAN-LoRA and MLP-LoRA adapters with rank 8.

KAN LoRA

Trained tasks

2

3

4

100

5

60

77

88

Table 4: Mean accuracy (%) on previously edited tasks during continual fine-tuning of Llama 2-7B and Llama 2-13B models equipped with KAN-LoRA and MLP-LoRA adapters. Performance is reported across five consecutive tasks for each dataset.

Table 5: Comparison of trainable parameters, training, and inference time for KAN-LoRA and MLP-LoRA adapters.

Evaluation Results The modified Llama models equipped with KAN-LoRA and MLP-LoRA adapters are continually fine-tuned across multiple tasks. Tables 4a and 4b report the mean accuracy on previously edited tasks after sequential edits of varying lengths, for adapter ranks 8 and 16 respectively, highlighting the extent of forgetting during the continual fine-tuning process. Increasing the adapter rank leads to greater forgetting in both KAN and MLP variants. However, KAN adapters consistently outperform their MLP counterparts at rank 16 and in low-sample (per task) regimes. Notably, the KAN adapter shows reduced forgetting in Llama2-13B, while the MLP adapter displays the opposite trend. In small-sample settings, KAN achieves consistently higher retention in Llama2-13B compared to MLP. These results suggest that KAN adapters are more resilient to forgetting in large-scale LMs, especially at higher ranks and under limited task supervision.

Table 5 further compares the computational and parameter efficiency of KAN-LoRA and MLP-LoRA adapters, using a grid size of 5 and an adapter rank at 8. KAN introduces significantly more trainable parameters than MLP, approximately 10 × more for both Llama2-7B and Llama2-13B models. Training and inference times are measured on the CounterFact dataset with 5 samples per task. While the KAN adapter incurs higher computational cost than the MLP variant, the overhead remains moderate relative to the observed gain in model capacity and retention performance.

Conclusion & Discussion

In this work, we present the first comprehensive study of catastrophic forgetting in KANs under continual learning settings. We develop a theoretical framework that connects forgetting dynamics to the architectural locality of spline activations and the intrinsic dimensionality of task data. Our analysis yields formal retention bounds and characterizes the cumulative and geometry-driven nature of forgetting in KANs. To validate these insights, we conduct systematic experiments across synthetic arithmetic tasks and real-world image classification benchmarks. Empirical results strongly corroborate the our analysis, revealing a clear linear relationship between forgetting and activation overlap, and an exponential increase in forgetting as task dimensionality rises. We further introduce KAN-LoRA, a novel adapter design for continual fine-tuning of LMs in model editing tasks, and demonstrate its retention superiority compared to MLPbased alternatives. Our findings establish both the strengths and limitations of KANs for continual learning.

Stepping further, we believe this work opens up several unconventional directions for advancing KANs in continual learning. First, the evolution of spline activations across tasks suggests a new dynamic view of learning, where forgetting reflects adaptation pressures on local function regions. Designing KANs with mechanisms that support controlled specialization or even lifecycle-based pruning of splines may enhance long-term retention. Second, our analysis of support overlap points to the possibility of distributed memory encoding . Rather than eliminating interference, future models could intentionally overlap supports to store multiple tasks in a compressed fashion that allows task-specific retrieval through decoding strategies. Third, forgetting itself may also function as a beneficial inductive bias . Selective decay in high-dimensional regions could suppress redundant or unstable features, reduce overfitting, and improve generalization. These visions can largely reframe forgetting not as a flaw to be eliminated, but as a property to be shaped, positioning KANs as a flexible and interpretable foundation for memory-aware continual learning systems.

5

43

27

Model

Llama 2-

13B

Dataset

CounterFact

ZsRE

Acknowledgements

The authors thank the anonymous reviewers for comments. This research was supported by the NSF IIS2451480.

Experiments

Abd Elaziz, M.; Ahmed Fares, I.; and Aseeri, A. O. 2024. CKAN: Convolutional Kolmogorov-Arnold Networks Model for Intrusion Detection in IoT Environment. IEEE Access , 12: 134837-134851.

Aleixo, E. L.; Colonna, J. G.; Cristo, M.; and Fernandes, E. 2023. Catastrophic Forgetting in Deep Learning: A Comprehensive Taxonomy.

Ayad, O. 2014. Learning under Concept Drift with Support Vector Machines. In Artificial Neural Networks and Machine Learning - ICANN 2014 , 587-594. Cham: Springer International Publishing.

Biderman, D.; Portes, J.; Ortiz, J. J. G.; Paul, M.; Greengard, P.; et al. 2024. LoRA Learns Less and Forgets Less.

Bishop, C. M. 1994. Neural networks and their applications. Review of scientific instruments , 65(6): 1803-1832.

Buzzega, P.; Boschini, M.; Porrello, A.; Abati, D.; and Calderara, S. 2020. Dark Experience for General Continual Learning: a Strong, Simple Baseline.

De Lange, M.; Aljundi, R.; Masana, M.; Parisot, S.; et al. 2022. A Continual Learning Survey: Defying Forgetting in Classification Tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence , 44(7): 3366-3385.

Girosi, F.; and Poggio, T. 1989. Representation Properties of Networks: Kolmogorov's Theorem Is Irrelevant. Neural Computation , 1(4): 465-469.

Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y .; Wang, S.; Wang, L.; Chen, W.; et al. 2022. Lora: Low-rank adaptation of large language models. ICLR , 1(2): 3.

Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; Hassabis, D.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences , 114(13): 3521-3526.

Kolmogorov, A. N. 1961. On the Representation of Continuous Functions of Several Variables by Superpositions of Continuous Functions of a Smaller Number of Variables. American Mathematical Society .

Kong, Y.; Liu, L.; Chen, H.; Kacprzyk, J.; and Tao, D. 2024. Overcoming Catastrophic Forgetting in Continual Learning by Exploring Eigenvalues of Hessian Matrix. IEEE Transactions on Neural Networks and Learning Systems , 35(11).

Lee, A.; Gomes, H. M.; Zhang, Y.; and Kleijn, W. B. 2025. Kolmogorov-Arnold Networks Still Catastrophically Forget but Differently from MLP. Proceedings of the AAAI Conference on Artificial Intelligence , 39(17): 18053-18061.

Levy, O.; Seo, M.; Choi, E.; and Zettlemoyer, L. 2017. ZeroShot Relation Extraction via Reading Comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning , 333-342. Association for Computational Linguistics.

Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljaˇ ci´ c, M.; Hou, T. Y.; and Tegmark, M. 2025. KAN: Kolmogorov-Arnold Networks.

Luo, Y.; Yang, Z.; Meng, F.; Li, Y.; Zhou, J.; and Zhang, Y. 2025. An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning.

McCloskey, M.; and Cohen, N. J. 1989. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Psychology of Learning and Motivation , volume 24, 109-165. Academic Press.

Meng, K.; Bau, D.; Andonian, A.; and Belinkov, Y. 2023. Locating and Editing Factual Associations in GPT.

Prautzsch, H.; Boehm, W.; and Paluszny, M. 2002. B´ ezier and B-spline techniques . Springer Science Business Media.

Ruiz-Garcia, M. 2022. Model architecture can transform catastrophic forgetting into positive transfer.

Wang, Z.; Yang, E.; Shen, L.; and Huang, H. 2025. A Comprehensive Survey of Forgetting in Deep Learning Beyond Continual Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence , 47(3): 1464-1483.

Yoon, J.; Yang, E.; Lee, J.; and Hwang, S. J. 2018. Lifelong Learning with Dynamically Expandable Networks.

Zhang, T.; Peng, S.; Gao, L.; et al. 2025. Unifying Locality of KANs and Feature Drift Compensation Projection for Data-free Replay based Continual Face Forgery Detection.

Zhang, T.; Wang, X.; Liang, B.; and Yuan, B. 2023. Catastrophic Interference in Reinforcement Learning: A Solution Based on Context Division and Knowledge Distillation. IEEE Transactions on Neural Networks and Learning Systems , 34(12): 9925-9939.

A. Lemma 1 Proof

Fix an earlier task i and assume that for every later task j > i

$$

$$

Let

$$

$$

For ( x, y ) ∼ D j with j > i and any θ ℓ,p,q ∈ Θ ( i ) , the computation path cannot traverse branch ϕ ℓ,p,q and thus

$$

$$

where L ( f ( x ) , y ) denotes the per-example loss. With update rule θ ( t +1) = θ ( t ) -η t g ( t ) and g ( t ) = 0 on Θ ( i ) , we have

$$

$$

For x ∼ D i , all branches in Θ ( i ) are frozen, while the rest contribute the same value as before. Thus, f ( T ) ( x ) = f ( i ) ( x ) . Taking expectation over D i yields and therefore

$$

$$

B. Theorem 1 Proof

̸

Fix an earlier task i and a later task j > i . During optimization on D j the parameters of a branch ϕ ℓ,p,q are updated only if S ( i ) ℓ,p,q ∩ S ( j ) ℓ,p,q = ∅ . Set

$$

$$

Pointwise change of one branch. Because both pre- and postupdate splines are L ℓ -Lipschitz, for every z ∈ I ℓ,p,q we have

$$

$$

They coincide outside I ℓ,p,q , so (B-1) bounds their difference on all of R .

Layer-wise change. At most N ℓ branches in layer ℓ overlap with task j , whence for any input x ∼ D i , so we further have

$$

$$

Network-level change. Subsequent layers are unchanged, so propagating (B-2) forward yields

$$

$$

Effect on the expected loss. Because the loss is C -Lipschitz in its first argument,

$$

$$

Combining with (B-3) and taking expectation over ( x, y ) ∼ D i gives

$$

$$

establishing the claimed bound.

C. Theorem 2 Proof

Define the incremental loss increase for each later task t = i +1 , . . . , T by

$$

$$

Because these increments telescope, we can derive

$$

$$

̸

Bounding a single increment. Fix t > i . A branch ϕ ℓ,p,q is updated only when S ( i ) ℓ,p,q ∩ S ( t ) ℓ,p,q = ∅ . Let

$$

$$

From the earlier derivation in Appendix B, updating a L ℓ -Lipschitz spline on an interval of length γ ( t ) ℓ,p,q perturbs its value by at most L ℓ γ ( t ) ℓ,p,q :

$$

$$

Summing (C-2) over the N ℓ branches of layer ℓ yields

$$

$$

$$

$$

$$

$$

Accumulating all later tasks. Insert (C-4) into the telescope (C-1) and interchange sums:

$$

$$

which is exactly the claimed cumulative forgetting bound.

D. Corollary 1 Proof

Consider a 1-dimensional torus T 1 = [0 , 1) with wraparound arithmetic, where every branch support is an interval of fixed length whose starting point is chosen uniformly in [0 , 1) . From Theorem 2, taking expectation over the independent draws of the supports and using linearity, we have

$$

$$

Fix a branch ( ℓ, p, q ) . For each task index k , the support S ( k ) ℓ,p,q ⊂ T 1 is an interval [ U k , U k + s k ) where U k ∼ Uniform[0 , 1) and all U k 's distributions are independent. For a fixed point z ∈ T 1 ,

$$

$$

Because the two supports for tasks i and j are independent,

$$

$$

Using Fubini's theorem ,

$$

$$

Next, substitute (D-2) into (D-1) and simplify. Since the expectation in (D-2) is the same for every branch, we have

$$

$$

which is the bound claimed in Corollary 1.

E. Corollary 2 Proof

From Theorem 2, we replace the inner sum by the union measure U ( i ) ℓ,p,q . By the sub-additivity of Lebesgue measure, we can have

$$

$$

Substituting (E-1) into Theorem 2 yields the 'saturation' bound as

$$

$$

matching the first inequality in the statement.

We then further bound the union length. Two simple facts give an upper bound on µ ( U ( i ) ℓ,p,q ) :

Combining (a) and (b), we obtain

$$

$$

which is exactly the auxiliary bound claimed.

Inequalities (E-2) and (E-3) together complete the proof.

F. Theorem 3 Proof

Fix a radius r > 0 that bounds the diameter of every branch support. The key geometric fact we need is that a compact d t -dimensional manifold can be covered by O ( r -d t ) r -balls. We spell this out first, then translate it into an overlap probability and finally into a forgetting bound.

Covering number of M t . Endow M t with its intrinsic geodesic metric dist M t . Choose a maximal set of points { x k } ⊂ M t such that the geodesic ( r/ 2) -balls B t ( x k , r/ 2) := { y ∈ M t : dist M t ( x k , y ) ≤ r/ 2 } are pairwise disjoint. For a sufficiently small r , the volume of each such ball satisfies

$$

$$

where c d t > 0 depends only on dimension (it is the Euclidean unit-ball volume, up to a curvature factor bounded away from zero in the compact domain). Because the disjoint balls lie inside M t , we have

$$

$$

Maximality implies that the concentric r -balls B t ( x k , r ) cover M t ; hence the covering number obeys

$$

$$

Probability that a branch fires on task t . For each branch ( ℓ, p, q ) the support on task t is assumed to sit inside one of the N t ( r ) covering balls, chosen uniformly at random. A fixed pre-activation coordinate z is therefore contained in that support with probability

$$

$$

Expected overlap for tasks i and j . Independent sampling for the two tasks gives

$$

$$

Insert the overlap into the cumulative bound. Taking expectation on the branch-wise cumulative inequality in Theorem 2 and substituting (F-1), we obtain

$$

$$

With N tot := ∑ ℓ N ℓ and ¯ L := N -1 tot ∑ ℓ N ℓ L ℓ , equation (F-2) becomes

$$

$$

which is the intrinsic-dimension forgetting rate stated.

G. Corollary 3 Proof

In Theorem 3, with d j ≤ D for every j > i , each term in the sum is at most r d i + D . Since there are T -i ≤ T such terms, we thus have

$$

$$

Substituting (G-1) into Theorem 3 yields

$$

$$

proving the corollary. Because d i + D is sufficiently small, the factor r d i + D can drive the bound arbitrarily close to a constant rate, demonstrating the robust retention performance for low-dimensional tasks.

H. Corollary 4 Proof

Assume that, for every task t , the support of each branch is partitioned into k t disjoint intervals of radius r/k t . Repeating the covering argument with this reduced radius replaces each factor r d t in Theorem 3 by ( r/k t ) d t . Applying the substitution for tasks i and j gives

$$

$$

which matches the improved forgetting rate. Consequently, forgetting decays as ( k i k j ) -( d i + d j ) , indicating that splitting every branch of task t into k t pieces diminishes that task's overlap contribution by the factor k -d t t .

I. Binary and Decimal Addition

This section provides further details of our experiments on binary and decimal addition, including the KAN model architectures and the benchmark/baseline used for evaluation.

Experimental Setup

In both the binary and decimal addition experiments, we define a sequence of five tasks. Task 1, termed one's addition, involves adding the digit 1 to each digit from 1 to 9 (e.g., 1 + 1 = 2, through 1 + 9 = 10, and 1 + 1 = 2, through 9 + 1 = 10). The subsequent tasks, two's , three's , four's , and five's , are similarly constructed by adding 2, 3, 4, and 5, respectively, to each digit in the same range. In the binary addition experiments, all input digits are represented as 4-bit binary numbers. Tables 6 and 7 summarize the synthetic datasets used for the binary and decimal addition tasks, respectively.

Model Configuration

The KAN model used for the binary addition tasks takes three inputs at each time step, one bit from each of the two 4-bit binary numbers, and a carry-in bit from the previous step. The initial carry-in bit for the least significant bit addition is set to zero. These inputs are passed into the KAN, which outputs the current sum bit and a carry-out to be used in the next step. This carry-out is recurrently fed back into the model, enabling sequential processing of the bit pairs

Table 6: Binary addition tasks.

from the least to the most significant positions. In contrast, the KAN model used for decimal addition takes two decimal digits as input and produces the corresponding sum digit and carry-out. Tables 8 and 9 show the hyperparameters for the model architectures used in both synthetic tasks.

The baseline (Ruiz-Garcia 2022) used for comparison with the KAN on continual binary addition tasks is based on the hypothesis that, since addition is an algorithmic process, a suitably designed model architecture can avoid catastrophic forgetting. This work proposes an MLP network that is algorithmically aligned with the binary addition procedure, enabling it to learn the addition rules for binary numbers without forgetting. The network builds on the concept of traditional convolution layers with key modifications. It uses conditional operations based on input values, similar to ifelse logic, passes carry information sequentially between steps, and handles non-binary inputs by blending outcomes in a differentiable manner. This architectural design allows the model to learn the correct binary addition rules through gradient descent while preserving previous knowledge. Table 10 presents the hyperparameters used in the baseline.

Table 10: Hyperparameters of the MLP baseline.

J. Image Classification

Additional details on the experimental setups, training procedures, and specific model configurations used in the image classification experiments are provided in this section.

Experimental Setup

In this study, we utilize three widely used image classification benchmarks of increasing complexity: MNIST, CIFAR10, and Tiny-ImageNet. From each dataset, we generate five sequential tasks under a class-incremental continual learning setting, where each task contains two unique and mutually exclusive classes selected without repetition. Figure 5

Figure 5: Five sequential tasks were respectively constructed for MNIST (a-e), CIFAR-10 (f-j) and Tiny-ImageNet (k-o) datasets, where each task containing two distinct classes.

Figure 5: Five sequential tasks were respectively constructed for MNIST (a-e), CIFAR-10 (f-j) and Tiny-ImageNet (k-o) datasets, where each task containing two distinct classes.

illustrates representative samples from the five sequential tasks for MNIST, CIFAR-10, and Tiny-ImageNet. The corresponding class assignments for each task across all three datasets are comprehensively summarized in Table 11.

Model Configuration

The KAN-Transformer modifies the traditional Transformer architecture by replacing the MLP layers with KAN layers. For the image classification tasks, the KAN layers use a grid size of 10 spanning the range [ -1 , 1] , with cubic B-spline as the univariate basis function and SiLU as the base activation function. To contrast the forgetting behavior of the KAN-Transformer, we implement an identical architecture using MLP layers with an EWC regularizer, referred to as the MLP-Transformer baseline. For EWC, the regularization coefficient λ is set to 0 . 1 , and the (Fisher) memory buffer holds samples from previously trained tasks. All models are trained for 10 epochs using a learning rate of 1 × 10 -3 Each experiment is independently repeated 5 times, and the average performance is reported. Image classification experiments are performed on an NVIDIA Tesla V100 GPU with 32 GB of memory. Tables 12 and 13 list the hyperparameters used for the KAN-Transformer and MLP-Transformer models for the image classification tasks, respectively.

Table 11: Sample classes in five sequential tasks generated from MNIST, CIFAR-10, and Tiny-ImageNet datasets.

Table 12: Hyperparameters used for training the KANTransformer model on image classification tasks.

K. Knowledge Editing for LMs

In this section, we provide a detailed description of the experimental setup, model configurations, and computing infrastructure. Additionally, we present extended experimental results and analysis for the knowledge editing tasks.

Catastrophic forgetting is a longstanding challenge in continual learning, where models lose knowledge from earlier tasks when learning new ones. While various mitigation strategies have been proposed for Multi-Layer Perceptrons (MLPs), recent architectural advances like Kolmogorov-Arnold Networks (KANs) have been suggested to offer intrinsic resistance to forgetting by leveraging localized spline-based activations. However, the practical behavior of KANs under continual learning remains unclear, and their limitations are not well understood. To address this, we present a comprehensive study of catastrophic forgetting in KANs and develop a theoretical framework that links forgetting to activation support overlap and intrinsic data dimension. We validate these analyses through systematic experiments on synthetic and vision tasks, measuring forgetting dynamics under varying model configurations and data complexity. Further, we introduce KAN-LoRA, a novel adapter design for parameter-efficient continual fine-tuning of language models, and evaluate its effectiveness in knowledge editing tasks. Our findings reveal that while KANs exhibit promising retention in low-dimensional algorithmic settings, they remain vulnerable to forgetting in high-dimensional domains such as image classification and language modeling. These results advance the understanding of KANs’ strengths and limitations, offering practical insights for continual learning system design.

Code — https://github.com/marufur-cs/AAAI26

Catastrophic forgetting, also known as catastrophic interference (McCloskey and Cohen 1989), a fundamental challenge in machine learning, occurs when a neural network loses previously acquired information while learning from new data. This phenomenon is central to the field of continual learning, where models are trained incrementally on non-stationary data distributions (Ven, Soures, and Kudithipudi 2024; Kemker et al. 2017). Moreover, it is prevalent in a wide range of research fields such as meta-learning (Spigler 2020), domain adaptation (Xu et al. 2020), foundation models (Luo et al. 2025), and reinforcement learning (Zhang et al. 2023), where the retention of prior knowledge is critical for generalization and stability.

Multi-Layer Perceptrons (MLPs) are inherently prone to catastrophic forgetting (Zenke, Poole, and Ganguli 2017). Several techniques have been proposed to overcome catastrophic forgetting in MLPs (Wang et al. 2025; De Lange et al. 2022). Regularization-based techniques (Kirkpatrick et al. 2017; Kong et al. 2024) impose restrictions on the network’s weight adjustments, hence reducing the likelihood of interference with previously acquired knowledge. Architecture-based methods (Yoon et al. 2018; Mirzadeh et al. 2022) mitigate forgetting by modifying the network’s architecture to accommodate new information. Rehearsal-based methods (Buzzega et al. 2020; Riemer et al. 2019) aim to preserve prior information by including data samples from earlier learning sessions during the current session. Although catastrophic forgetting has been extensively studied in MLPs, it remains relatively underexplored in emerging fundamental neural architectures such as Kolmogorov-Arnold Networks (KANs) (Liu et al. 2025).

KANs, inspired by the Kolmogorov-Arnold representation theorem (Kolmogorov 1961), have emerged as a promising alternative neural network architecture to traditional MLPs. KANs were introduced to address several fundamental limitations of MLPs. Unlike MLPs, which rely on fixed activation functions, KANs utilize learnable one-dimensional activation functions (spline) along the edges of the network. Splines can be easily adjusted locally and are accurate for low-dimensional functions, giving KANs the potential to avoid forgetting. As spline bases are local, a data sample affects only a few related spline coefficients, leaving other coefficients unaltered. This unique architecture enables KANs to learn non-linear relations more effectively and to be more robust against catastrophic forgetting in continual learning scenarios (Lee et al. 2025). KANs have been successfully applied in various domains (Yang and Wang 2025; Abd Elaziz, Ahmed Fares, and Aseeri 2024), yet studies around their effectiveness in mitigating catastrophic forgetting in continual learning are still quite limited.

Only a few pioneer works have studied the catastrophic forgetting phenomenon in KANs under the continual learning settings. Lee et al. recently proposed a simple and heuristic strategy, WiseKAN, which allocates distinct parameter subspaces to different tasks to mitigate catastrophic forgetting in KANs. Liu et al. demonstrated robustness of KANs against catastrophic forgetting using synthetic data on regression tasks. Furthermore, some studies proposed modified KANs to achieve robust retention in specific domains, such as classification (Hu et al. 2025) and face forgery detection (Zhang et al. 2025) tasks. Despite these initial efforts, a comprehensive understanding of forgetting in KANs remains elusive, particularly in terms of theoretical characterization and empirical evaluation on practical real-world tasks.

To bridge the gap, we first develop a theoretical framework for understanding catastrophic forgetting in KANs by formulating several key factors such as activation support overlap and intrinsic data dimension. Our analysis reveals that forgetting in KANs scales linearly with activation support overlap and grows exponentially with the intrinsic dimensionality of the task manifold, offering a principled explanation for KANs’ robustness in simple tasks and vulnerability in complex domains. Building on these insights, we then conduct extensive empirical experiments comparing KANs with MLPs across a spectrum of tasks, including the low-dimensional synthetic addition and the high-dimensional image classification. Furthermore, we design a novel LoRA (Hu et al. 2022) adapter based on KAN, termed KAN-LoRA, to enable continual fine-tuning of language models (LMs) for sequential knowledge editing. Across all experimental settings, our results consistently corroborate the theoretical analysis, illustrating that while KANs achieve strong retention in structured and low-dimensional tasks, they remain susceptible to forgetting in high-dimensional domains, thereby highlighting both the strengths and limitations of KANs in practical continual learning scenarios. Our main contributions are summarized as below:

We develop a theoretical framework for catastrophic forgetting in KANs, deriving formal retention bounds based on activation support overlap and intrinsic data dimension, and characterizing how forgetting evolves;

We validate the theoretical analysis through empirical experiments on synthetic and image data, demonstrating strong alignment between the support overlap, task complexity, and the observed forgetting behavior;

We introduce KAN-LoRA, a novel KAN-based adapter for continual fine-tuning of LMs, and evaluate its performance in sequential knowledge editing, highlighting both the strength and limitations of KANs in practice.

Neural networks learn the non-linear mapping between input and output spaces by finding a region in the parameter space where the network achieves expected behavior (Bishop 1994). When the neural network is trained on new data, the network’s parameter space shifts accordingly to capture the mapping between new input and output space. As a result, performance degrades on prior data. This phenomenon was termed as catastrophic interference by McCloskey and Cohen. It was observed in many machine learning models such as support vector machine (Ayad 2014), but is particularly pronounced in connectionist models (e.g., MLPs) due to their dense and globally updated parameterizations (French 1999). Standard neural training algorithms typically lack the capacity to progressively learn new tasks without overwriting previous knowledge (Aleixo et al. 2023), making them especially vulnerable to catastrophic interference. Such limitation has motivated continual learning studies (De Lange et al. 2022) to develop algorithms and architectures that enable models to acquire new knowledge incrementally while preserving performance on learned tasks.

KANs are inspired by the Kolmogorov-Arnold representation theorem, which states that a finite sum of continuous univariate functions and the binary addition operation can represent any multivariate continuous function f​(𝐱)f(\mathbf{x}) in a specified bounded domain (Kolmogorov 1961). Based on the theorem, function f​(𝐱)f(\mathbf{x}) can be represented as

where nn is the number of input variables, ψp,q:[0,1]→ℝ\psi_{p,q}:\left[0,1\right]\xrightarrow{}\mathbb{R}, and Ψq:ℝ→ℝ\Psi_{q}:\mathbb{R}\xrightarrow{}\mathbb{R}. This equation indicates that a 2-layer network with nn inputs and (2​n+1)(2n+1) outputs is sufficient to represent f​(𝐱)f(\mathbf{x}) by the sums of univariate functions. However, 1-D function ψ\psi can be fractal and non-smooth, making it unlearnable (Girosi and Poggio 1989) in practice. KANs solve this issue by generalizing the theorem to multiple layers with arbitrary width. Formally, KANs consisting LL layers can be indicated by

where ∘\circ indicates matrix multiplication, Φℓ\Phi_{\ell} is the function matrix that corresponds to the ℓ\ell-th layer, dℓd_{\ell} and NℓN_{\ell} are the number of input coordinates and univariate branches respectively. The univariate function ϕ\phi is defined as the weighted sum of a base and a spline function (Liu et al. 2025).

We first introduce a formal measure of forgetting and the notation needed to analyze how KAN’s local activations give rise to both perfect retention and task interference. Let f(t)f^{(t)} denote the KAN obtained after sequentially training on tasks 1,2,…,t1,2,\dots,t, and define

as the forgetting on task ii, where L​(f,𝒟)L(f,\mathcal{D}) is the expected loss under data distribution 𝒟\mathcal{D}. We index layers by ℓ∈{1,…,L}\ell\in{1,\dots,L}, and within each layer we number the input coordinates (pre-activations) by p∈{1,…,dℓ}p\in{1,\dots,d_{\ell}} and the individual univariate branches by q∈{1,…,Nℓ}q\in{1,\dots,N_{\ell}}.

To capture where each unit actually “turns on”, we define the activation support of branch ϕℓ,p,q\phi_{\ell,p,q} for task ii as

the subset of real inputs on which that branch contributes non-zero output. We measure the size of these one-dimensional sets by the Lebesgue measure μ​(⋅)\mu(\cdot)111Lebesgue measure generalizes the length to a broader class of sets. Here, it corresponds to the total length of the activation region.. With these setups, we can represent the maximum one-dimensional overlap between any single activation for tasks ii and jj as

which will serve as the key link between KAN’s architectural locality and the bounds on catastrophic forgetting. 222Detailed derivations for all theorems are in the full version.

We now precisely characterize when KAN achieves perfect retention and how any residual overlap translates into bounded forgetting. Overall, we demonstrate that KAN’s local‐support activations act as task‐specific feature detectors: if their “on” regions never coincide across tasks, earlier knowledge remains untouched, and when they do overlap, forgetting grows in direct proportion to that overlap.

Suppose for an earlier task ii and every later task j>ij>i the maximal support‐overlap satisfies Δi,j=0\Delta_{i,j}=0. Then

Under the additional assumptions that each branch ϕℓ,p,q\phi_{\ell,p,q} is LℓL_{\ell}-Lipschitz333ϕ\phi is LL-Lipschitz if |ϕ​(z1)−ϕ​(z2)|≤L​|z1−z2||\phi(z_{1})-\phi(z_{2})|\leq L|z_{1}-z_{2}| for all z1,z2∈ℝz_{1},z_{2}\in\mathbb{R}. Here, LℓL_{\ell} quantifies the spline smoothness in layer ℓ\ell. and the loss is bounded by CC, for any j>ij>i we have

While Theorem 1 guarantee zero or bounded forgetting on a per‐task basis, real continual learning involves sequences of overlapping tasks whose supports may intersect in complex ways. To capture the deeper dynamics of forgetting in KANs, we further analyze at the branch level and consider cumulative contributions and effects.

Under the Lipschitz and bounded‐loss assumptions of Theorem 1, the forgetting on task ii after training on all subsequent tasks i+1,…,Ti+1,\dots,T can be decomposed as

If each branch’s supports for task jj are independently drawn as length-sjs_{j} intervals in [0,1][0,1], then in expectation

be the union of all overlaps for branch (ℓ,p,q)(\ell,p,q). Then

with μ​(Uℓ,p,q(i))≤min⁡(∑j=i+1TΔi,j,μ​(Sℓ,p,q(i)))\mu(U^{(i)}{\ell,p,q})\leq\min\bigl(\sum{j=i+1}^{T}\Delta_{i,j},,\mu(S^{(i)}_{\ell,p,q})\bigr).

Beyond mere pairwise overlap, we further conduct theoretical analysis by examining how intrinsic task complexity drives forgetting in KANs. In particular, we show that when tasks live on data manifolds of differing intrinsic dimensions, the degree of forgetting can change dramatically. This complements our earlier results by linking forgetting directly to geometric measures of task difficulty.

Suppose task tt generates data concentrated on a compact submanifold ℳt⊂[0,1]n\mathcal{M}{t}\subset[0,1]^{n} of intrinsic dimension dtd{t}, and each univariate branch’s activation support can be enclosed within an rr-ball in the pre-activation domain. Then, for any earlier task ii and later task jj, the expected support overlap satisfies

and hence the forgetting on task ii obeys

where Ntot=∑ℓNℓN_{\mathrm{tot}}=\sum_{\ell}N_{\ell} counts the total number of univariate branches and L¯\bar{L} is an average Lipschitz constant.

If every subsequent task jj has intrinsic dimension dj≤Dd_{j}\leq D, then

which becomes negligible when di+Dd_{i}+D is sufficiently small.

Overall, Theorem 3 and its corollaries illuminate how KAN’s forgetting depends on the deeper geometric complexity of task data and the combinatorial structure of activation supports. This perspective provides actionable guidance for designing KAN architectures and pruning strategies.

We conduct a series of experiments to empirically validate our theoretical findings and assess KANs’ forgetting behavior across diverse settings. Starting with low-dimensional synthetic tasks, we analyze retention under binary and decimal addition. We then evaluate KANs on high-dimensional image classification benchmarks and finally test KAN-LoRA for continual knowledge editing in LMs. These experiments effectively illustrate how model architecture and task complexity shape the forgetting in KANs. 444More details on the experiments are in the full version.

We construct five synthetic tasks under a continual setting. Each task is defined by fixing one of the operands in a two-digit addition problem. Specifically, Task 1 involves one’s addition, where the digit 11 is added to every digit from 11 to 99. Task 2 is two’s addition, and so forth up to five’s addition in Task 5. We apply this construction for both binary and decimal representations of digits. This setup enables us to systematically evaluate forgetting across increasingly overlapping arithmetic patterns.

Our KAN model is configured with three input nodes, two hidden neurons, and two output neurons to perform addition of two 4-bit binary numbers. At each step, the model receives two input bits (one from each number) along with a carry bit, and outputs the corresponding sum bit and the carry bit for the next step. The univariate functions in KANs are modeled using B-splines (Prautzsch, Boehm, and Paluszny 2002), where the grid size determines the number of intervals in the spline. A larger grid size provides greater flexibility, allowing the splines to capture more complex functions (Liu et al. 2025). For binary addition tasks, we use a grid size of 5. For decimal addition tasks, we evaluate KANs with grid sizes of 5, 10, 15, and 20. As a baseline, we compare the KAN’s performance on binary addition to a specialized MLP architecture (Ruiz-Garcia 2022) designed to learn binary addition rules in a continual learning setting without catastrophic forgetting.

The KAN model is sequentially trained on five binary addition tasks. Figure 1 shows the Mean Squared Error (MSE) loss (log scale) for all five tasks during training on the one’s addition task. Notably, even during training on the first task, the losses for subsequent tasks also decrease significantly, indicating strong positive correlation. As training progresses, the model maintains stable performance on earlier tasks, with overall forgetting remaining below 1×10−6 a1\text{\times}{10}^{-6}\text{,}\mathrm{a}fter all five sessions, showing a strong resilience to catastrophic forgetting. The KAN outperforms the specialized MLP model designed for binary addition. While the MLP requires sequential training on both one’s and two’s addition tasks to succeed, the KAN model generalizes effectively after learning just the one’s addition task.

Similarly, for the decimal addition, the KAN is trained sequentially over five tasks. Unlike the binary setting, the model does not fully learn subsequent tasks during training on the first one. As training progresses, clear signs of catastrophic forgetting emerge. Figure 2 shows that learning a new task leads to a noticeable decline in performance on previous tasks. However, the severity of forgetting decreases as the grid size increases, suggesting that finer spline resolution improves retention. After completing all five tasks, a clear forgetting pattern appears: performance deteriorates more significantly for tasks that are farther in time from the most recent training, indicating that earlier tasks suffer more from forgetting. These observations empirically support our analysis in Corollary 1. On one hand, increasing the grid size reduces each spline’s support length, thereby decreasing pairwise overlaps and mitigating the forgetting. On the other hand, later tasks, which have larger effective support sizes due to increased digit variability, lead to greater cumulative interference, consistent with the si​sjs_{i}s_{j} dependence.

Tables 1 and 2 further present empirical evidences supporting Theorems 1 and 2. In Table 1, for each pair of tasks selected from the five decimal addition tasks, the ratio Fi/Δi,jF_{i}/\Delta_{i,j} remains approximately constant, suggesting that the forgetting FiF_{i} scales linearly with the support overlap Δi,j\Delta_{i,j} between tasks ii and jj. Similarly, Table 2 shows that the ratio between the observed forgetting and the cumulative support overlap555Simplify to ∑μi​j\sum\mu^{ij} in Table 2 notations. ∑i+1Tμ​(S(i)∩S(j))\sum^{T}_{i+1}\mu(S^{(i)}\cap S^{(j)}) is also nearly constant, indicating a linear dependence. Additionally, this ratio becomes more stable (i.e., exhibits lower variance) as the grid size of KANs increases, revealing that finer-grained spline meshes promote more consistent forgetting behavior.

To evaluate KANs in real-world settings, we assess their forgetting behavior with continual learning using CIFAR-10, Tiny-ImageNet, and MNIST datasets. CIFAR-10 consists (32∗32)(3232)-pixel images of 10 evenly distributed classes. To simulate a class-incremental continual-learning scenario, the dataset is divided into five sequential tasks, each containing images from two different classes. A similar five-task setup is constructed for Tiny-ImageNet dataset, by selecting 10 classes from 200 different classes of (64∗64)(6464)-pixel images. MNIST, (28∗28)(28*28) pixels, is likewise divided into five sequential tasks. These three datasets vary in intrinsic dimensionality, where MNIST has the lowest while Tiny-ImageNet has the highest dimension.

We adopt a Transformer-based architecture for image classification, in which all MLP layers are replaced with KAN layers, resulting in the KAN-Transformer model (Yang and Wang 2025). This modification intends to utilize the adaptive capacity of KANs within the Transformer framework for continual learning scenarios. To provide a fair and competitive baseline, we also design an MLP-based transformer model augmented with the EWC (Kirkpatrick et al. 2017) regularization technique.

Figure 3 illustrates the accuracy on task 1 after sequential training of the KAN (with grid size 1010) and the MLP model on tasks 1 and 2 from CIFAR-10, evaluated across various model configurations and increasing sample sizes per task. Both architectures retain high accuracy in shallow settings with a single encoder block, attention head, and classification layer. Notably, the KAN model demonstrates superior retention, maintaining 100% accuracy on task 1 up to 8 samples per task, whereas the MLP model drops to around 80%. As the number of encoder blocks and classification layers increases, performance declines sharply, particularly in MLPs, which suggests deeper networks are more susceptible to catastrophic forgetting.

Figure 4 summarizes the impact of varying the number of samples per task during continual learning, evaluated across different task counts (ranging from 2 to 5) for both CIFAR-10 and Tiny-ImageNet. All models use one single encoder block, attention head, and classification layer. On CIFAR-10, KAN models exhibit better retention than their MLP counterparts, particularly when trained on a smaller number of tasks. In contrast, MLP models outperform KANs on the more challenging Tiny-ImageNet dataset. These results underscore the increasing difficulty of continual learning in KANs as both the number of tasks and the underlying data complexity grow. Moreover, the performance curves in Figure 4a suggest a clear saturation effect: after a certain number of highly overlapping tasks, additional training yields diminishing increases in forgetting, consistent with the bounded cumulative interference described in Corollary 2, where support unions eventually stabilize.

Table 3 further presents empirical evidence supporting Theorem 3. Forgetting FiF_{i} is measured on task 1 after sequential training on all five tasks from MNIST, CIFAR-10, and Tiny-ImageNet. To vary the intrinsic dimension did_{i}, the images are quantized using different label sets (QQ) and resized to different spatial resolutions (SS), where di=log2⁡(Q×S)d_{i}=\log_{2}(Q\times S). Across datasets and configurations, the ratio log⁡(Fi)/di\log(F_{i})/d_{i} remains approximately constant, providing strong support for the exponential relationship between forgetting and task complexity as captured by intrinsic dimension. This behavior reflects a geometric constraint where increasing intrinsic dimension entangles KANs’ localized supports, highlighting the need for dimensionality-aware tuning or support fragmentation (as in Corollary 4) to sustain better retention.

LMs require continual knowledge editing to replace outdated information and integrate new facts. To evaluate the forgetting behavior of KANs and MLPs in such high-dimensional editing scenarios, five consecutive tasks are curated from the CounterFact (Meng et al. 2023) and ZsRE (Levy et al. 2017) benchmarks.

LoRA (Hu et al. 2022) is a parameter-efficient fine-tuning technique that adapts LMs by freezing pre-trained weights and training lightweight adapters, substantially reducing memory usage and computational cost compared to full fine-tuning (Biderman et al. 2024). To explore the use of KAN as a LoRA adapter for continual fine-tuning, we design a modified adapter architecture. In the standard LoRA setup, the frozen weight matrix W0∈ℝa×bW_{0}\in\mathbb{R}^{a\times b} is augmented by a trainable low-rank residual matrix Δ​W∈ℝa×b\Delta W\in\mathbb{R}^{a\times b}, factorized as Δ​W=B​A\Delta W=BA with B∈ℝa×cB\in\mathbb{R}^{a\times c} and A∈ℝc×bA\in\mathbb{R}^{c\times b}, where rank c≪min⁡(a,b)c\ll\min(a,b). The adapter’s output is h=W0​x+B​A​xh=W_{0}x+BAx. In our design, both AA and BB are parameterized using KANs. This KAN-based variant, referred as KAN-LoRA, is integrated into the final two layers of Llama2-7B and Llama2-13B (Touvron et al. 2023). We apply EWC regularization during continual fine-tuning, using the preceding task as memory. For a fair comparison, we develop an MLP-LoRA adapter with identical EWC settings. For all KAN-LoRA experiments, we use a grid size of 5 to balance capacity and efficiency.

The modified Llama models equipped with KAN-LoRA and MLP-LoRA adapters are continually fine-tuned across multiple tasks. Tables 4(a) and 4(b) report the mean accuracy on previously edited tasks after sequential edits of varying lengths, for adapter ranks 8 and 16 respectively, highlighting the extent of forgetting during the continual fine-tuning process. Increasing the adapter rank leads to greater forgetting in both KAN and MLP variants. However, KAN adapters consistently outperform their MLP counterparts at rank 16 and in low-sample (per task) regimes. Notably, the KAN adapter shows reduced forgetting in Llama2-13B, while the MLP adapter displays the opposite trend. In small-sample settings, KAN achieves consistently higher retention in Llama2-13B compared to MLP. These results suggest that KAN adapters are more resilient to forgetting in large-scale LMs, especially at higher ranks and under limited task supervision.

Table 5 further compares the computational and parameter efficiency of KAN-LoRA and MLP-LoRA adapters, using a grid size of 5 and an adapter rank at 8. KAN introduces significantly more trainable parameters than MLP, approximately 10×10\times more for both Llama2-7B and Llama2-13B models. Training and inference times are measured on the CounterFact dataset with 5 samples per task. While the KAN adapter incurs higher computational cost than the MLP variant, the overhead remains moderate relative to the observed gain in model capacity and retention performance.

In this work, we present the first comprehensive study of catastrophic forgetting in KANs under continual learning settings. We develop a theoretical framework that connects forgetting dynamics to the architectural locality of spline activations and the intrinsic dimensionality of task data. Our analysis yields formal retention bounds and characterizes the cumulative and geometry-driven nature of forgetting in KANs. To validate these insights, we conduct systematic experiments across synthetic arithmetic tasks and real-world image classification benchmarks. Empirical results strongly corroborate the our analysis, revealing a clear linear relationship between forgetting and activation overlap, and an exponential increase in forgetting as task dimensionality rises. We further introduce KAN-LoRA, a novel adapter design for continual fine-tuning of LMs in model editing tasks, and demonstrate its retention superiority compared to MLP-based alternatives. Our findings establish both the strengths and limitations of KANs for continual learning.

Stepping further, we believe this work opens up several unconventional directions for advancing KANs in continual learning. First, the evolution of spline activations across tasks suggests a new dynamic view of learning, where forgetting reflects adaptation pressures on local function regions. Designing KANs with mechanisms that support controlled specialization or even lifecycle-based pruning of splines may enhance long-term retention. Second, our analysis of support overlap points to the possibility of distributed memory encoding. Rather than eliminating interference, future models could intentionally overlap supports to store multiple tasks in a compressed fashion that allows task-specific retrieval through decoding strategies. Third, forgetting itself may also function as a beneficial inductive bias. Selective decay in high-dimensional regions could suppress redundant or unstable features, reduce overfitting, and improve generalization. These visions can largely reframe forgetting not as a flaw to be eliminated, but as a property to be shaped, positioning KANs as a flexible and interpretable foundation for memory-aware continual learning systems.

The authors thank the anonymous reviewers for comments. This research was supported by the NSF IIS2451480.

For (x,y)∼𝒟j(x,y)\sim\mathcal{D}{j} with j>ij>i and any θℓ,p,q∈Θ(i)\theta{\ell,p,q}\in\Theta^{(i)}, the computation path cannot traverse branch ϕℓ,p,q\phi_{\ell,p,q} and thus

where ℒ​(f​(x),y)\mathcal{L}\bigl(f(x),y\bigr) denotes the per-example loss.

With update rule θ(t+1)=θ(t)−ηt​g(t)\theta^{(t+1)}=\theta^{(t)}-\eta_{t}g^{(t)} and g(t)=0g^{(t)}=0 on Θ(i)\Theta^{(i)}, we have

For x∼𝒟ix\sim\mathcal{D}{i}, all branches in Θ(i)\Theta^{(i)} are frozen, while the rest contribute the same value as before. Thus, f(T)​(x)=f(i)​(x)f^{(T)}(x)=f^{(i)}(x). Taking expectation over 𝒟i\mathcal{D}{i} yields

and therefore

Fix an earlier task ii and a later task j>ij>i. During optimization on 𝒟j\mathcal{D}{j} the parameters of a branch ϕℓ,p,q\phi{\ell,p,q} are updated only if Sℓ,p,q(i)∩Sℓ,p,q(j)≠∅S^{(i)}{\ell,p,q}\cap S^{(j)}{\ell,p,q}\neq\varnothing. Set

Pointwise change of one branch. Because both pre- and post-update splines are LℓL_{\ell}-Lipschitz, for every z∈Iℓ,p,qz\in I_{\ell,p,q} we have

They coincide outside Iℓ,p,qI_{\ell,p,q}, so (B-1) bounds their difference on all of ℝ\mathbb{R}.

Layer-wise change. At most NℓN_{\ell} branches in layer ℓ\ell overlap with task jj, whence for any input x∼𝒟ix\sim\mathcal{D}_{i}, so we further have

Network-level change. Subsequent layers are unchanged, so propagating (B-2) forward yields

Effect on the expected loss. Because the loss is CC-Lipschitz in its first argument,

Combining with (B-3) and taking expectation over (x,y)∼𝒟i(x,y)\sim\mathcal{D}_{i} gives

establishing the claimed bound.

Define the incremental loss increase for each later task t=i+1,…,Tt=i+1,\dots,T by

Because these increments telescope, we can derive

From the earlier derivation in Appendix B, updating a LℓL_{\ell}-Lipschitz spline on an interval of length γℓ,p,q(t)\gamma_{\ell,p,q}^{(t)} perturbs its value by at most Lℓ​γℓ,p,q(t)L_{\ell}\gamma_{\ell,p,q}^{(t)}:

Summing (C-2) over the NℓN_{\ell} branches of layer ℓ\ell yields

Accumulating all later tasks. Insert (C-4) into the telescope (C-1) and interchange sums:

Consider a 1-dimensional torus 𝕋1=[0,1)\mathbb{T}^{1}=[0,1) with wrap-around arithmetic, where every branch support is an interval of fixed length whose starting point is chosen uniformly in [0,1)[0,1). From Theorem 2, taking expectation over the independent draws of the supports and using linearity, we have

Fix a branch (ℓ,p,q)(\ell,p,q). For each task index kk, the support Sℓ,p,q(k)⊂𝕋1S^{(k)}{\ell,p,q}\subset\mathbb{T}^{1} is an interval [Uk,Uk+sk)[,U{k},;U_{k}+s_{k},) where Uk∼Uniform​[0,1)U_{k}\sim\operatorname{Uniform}[0,1) and all UkU_{k}’s distributions are independent. For a fixed point z∈𝕋1z\in\mathbb{T}^{1},

Because the two supports for tasks ii and jj are independent,

Using Fubini’s theorem,

Next, substitute (D-2) into (D-1) and simplify. Since the expectation in (D-2) is the same for every branch, we have

From Theorem 2, we replace the inner sum by the union measure Uℓ,p,q(i)U^{(i)}_{\ell,p,q}. By the sub-additivity of Lebesgue measure, we can have

Substituting (E-1) into Theorem 2 yields the “saturation” bound as

matching the first inequality in the statement.

Each individual overlap length obeys μ​(Sℓ,p,q(i)∩Sℓ,p,q(j))≤Δi,j\mu!\bigl(S^{(i)}{\ell,p,q}!\cap S^{(j)}{\ell,p,q}\bigr)\leq\Delta_{i,j}; adding over j>ij>i gives μ​(Uℓ,p,q(i))≤∑j=i+1TΔi,j\mu(U^{(i)}{\ell,p,q})\leq\sum{j=i+1}^{T}!\Delta_{i,j}.

Combining (a) and (b), we obtain

which is exactly the auxiliary bound claimed.

Inequalities (E-2) and (E-3) together complete the proof.

Fix a radius r>0r>0 that bounds the diameter of every branch support. The key geometric fact we need is that a compact dtd_{t}-dimensional manifold can be covered by O​(r−dt)O(r^{-d_{t}}) rr-balls. We spell this out first, then translate it into an overlap probability and finally into a forgetting bound.

Covering number of ℳt\mathcal{M}{t}. Endow ℳt\mathcal{M}{t} with its intrinsic geodesic metric distℳt\operatorname{dist}{\mathcal{M}{t}}. Choose a maximal set of points {xk}⊂ℳt{x_{k}}\subset\mathcal{M}{t} such that the geodesic (r/2)(r/2)-balls ℬt​(xk,r/2):={y∈ℳt:distℳt⁡(xk,y)≤r/2}\mathcal{B}{t}(x_{k},r/2):={y\in\mathcal{M}{t}:\operatorname{dist}{\mathcal{M}{t}}(x{k},y)\leq r/2} are pairwise disjoint. For a sufficiently small rr, the volume of each such ball satisfies

where cdt>0c_{d_{t}}>0 depends only on dimension (it is the Euclidean unit-ball volume, up to a curvature factor bounded away from zero in the compact domain). Because the disjoint balls lie inside ℳt\mathcal{M}_{t}, we have

Maximality implies that the concentric rr-balls ℬt​(xk,r)\mathcal{B}{t}(x{k},r) cover ℳt\mathcal{M}_{t}; hence the covering number obeys

Probability that a branch fires on task tt. For each branch (ℓ,p,q)(\ell,p,q) the support on task tt is assumed to sit inside one of the Nt​(r)N_{t}(r) covering balls, chosen uniformly at random. A fixed pre-activation coordinate zz is therefore contained in that support with probability

Insert the overlap into the cumulative bound. Taking expectation on the branch-wise cumulative inequality in Theorem 2 and substituting (F-1), we obtain

With Ntot:=∑ℓNℓN_{\text{tot}}:=\sum_{\ell}N_{\ell} and L¯:=Ntot−1​∑ℓNℓ​Lℓ\bar{L}:=N_{\text{tot}}^{-1}\sum_{\ell}N_{\ell}L_{\ell}, equation (F-2) becomes

which is the intrinsic-dimension forgetting rate stated.

In Theorem 3, with dj≤Dd_{j}\leq D for every j>ij>i, each term in the sum is at most rdi+Dr^{,d_{i}+D}. Since there are T−i≤TT-i\leq T such terms, we thus have

proving the corollary. Because di+Dd_{i}+D is sufficiently small, the factor rdi+Dr^{,d_{i}+D} can drive the bound arbitrarily close to a constant rate, demonstrating the robust retention performance for low-dimensional tasks.

Assume that, for every task tt, the support of each branch is partitioned into ktk_{t} disjoint intervals of radius r/ktr/k_{t}. Repeating the covering argument with this reduced radius replaces each factor rdtr^{d_{t}} in Theorem 3 by (r/kt)dt\bigl(r/k_{t}\bigr)^{d_{t}}. Applying the substitution for tasks ii and jj gives

which matches the improved forgetting rate. Consequently, forgetting decays as (ki​kj)−(di+dj),(k_{i}k_{j})^{-(d_{i}+d_{j})}, indicating that splitting every branch of task tt into ktk_{t} pieces diminishes that task’s overlap contribution by the factor kt−dtk_{t}^{-d_{t}}.

  1. Additional Setups.

In both the binary and decimal addition experiments, we define a sequence of five tasks. Task 1, termed one’s addition, involves adding the digit 1 to each digit from 1 to 9 (e.g., 1 + 1 = 2, through 1 + 9 = 10, and 1 + 1 = 2, through 9 + 1 = 10). The subsequent tasks, two’s, three’s, four’s, and five’s, are similarly constructed by adding 2, 3, 4, and 5, respectively, to each digit in the same range. In the binary addition experiments, all input digits are represented as 4-bit binary numbers. Tables 6 and 7 summarize the synthetic datasets used for the binary and decimal addition tasks, respectively.

The KAN model used for the binary addition tasks takes three inputs at each time step, one bit from each of the two 4-bit binary numbers, and a carry-in bit from the previous step. The initial carry-in bit for the least significant bit addition is set to zero. These inputs are passed into the KAN, which outputs the current sum bit and a carry-out to be used in the next step. This carry-out is recurrently fed back into the model, enabling sequential processing of the bit pairs from the least to the most significant positions. In contrast, the KAN model used for decimal addition takes two decimal digits as input and produces the corresponding sum digit and carry-out. Tables 8 and 9 show the hyperparameters for the model architectures used in both synthetic tasks.

The baseline (Ruiz-Garcia 2022) used for comparison with the KAN on continual binary addition tasks is based on the hypothesis that, since addition is an algorithmic process, a suitably designed model architecture can avoid catastrophic forgetting. This work proposes an MLP network that is algorithmically aligned with the binary addition procedure, enabling it to learn the addition rules for binary numbers without forgetting. The network builds on the concept of traditional convolution layers with key modifications. It uses conditional operations based on input values, similar to if-else logic, passes carry information sequentially between steps, and handles non-binary inputs by blending outcomes in a differentiable manner. This architectural design allows the model to learn the correct binary addition rules through gradient descent while preserving previous knowledge. Table 10 presents the hyperparameters used in the baseline.

In this study, we utilize three widely used image classification benchmarks of increasing complexity: MNIST, CIFAR-10, and Tiny-ImageNet. From each dataset, we generate five sequential tasks under a class-incremental continual learning setting, where each task contains two unique and mutually exclusive classes selected without repetition. Figure 5 illustrates representative samples from the five sequential tasks for MNIST, CIFAR-10, and Tiny-ImageNet. The corresponding class assignments for each task across all three datasets are comprehensively summarized in Table 11.

The KAN-Transformer modifies the traditional Transformer architecture by replacing the MLP layers with KAN layers. For the image classification tasks, the KAN layers use a grid size of 10 spanning the range [−1,1][-1,1], with cubic B-spline as the univariate basis function and SiLU as the base activation function. To contrast the forgetting behavior of the KAN-Transformer, we implement an identical architecture using MLP layers with an EWC regularizer, referred to as the MLP-Transformer baseline. For EWC, the regularization coefficient λ\lambda is set to 0.10.1, and the (Fisher) memory buffer holds samples from previously trained tasks. All models are trained for 10 epochs using a learning rate of 1×10−3 1\text{\times}{10}^{-3}\text{,} Each experiment is independently repeated 5 times, and the average performance is reported. Image classification experiments are performed on an NVIDIA Tesla V100 GPU with 32 GB of memory. Tables 12 and 13 list the hyperparameters used for the KAN-Transformer and MLP-Transformer models for the image classification tasks, respectively.

We use two benchmark datasets for continual knowledge editing: CounterFact and ZsRE. The CounterFact dataset is designed for factual knowledge editing and consists of counterfactual statements that initially receive low likelihood scores compared to the correct facts. The ZsRE dataset is a question answering benchmark constructed for zero-shot retention extraction, where each sample includes a natural-language question, its factual answer, and a new answer for the edit. From each dataset, we construct four sequential task sets by varying the number of samples per task, ranging from 2 to 5, to assess retention under different data regimes and task granularities. Each task set contains five tasks, curated by randomly sampling from the original dataset to ensure diversity, domain variability, and non-overlapping content. For each sample, we generate prompts based on the provided instruction template and use the corresponding modified facts to perform targeted knowledge edits on the LM across tasks.

We apply KAN-LoRA and MLP-LoRA adapters to the last two layers of the Llama models, specifically targeting the attention layers, i.e., the query and value projection matrices. For both adapter types, we experiment with ranks 8 and 16, using corresponding LoRA scaling factors (α\alpha) of 16 and 32, respectively. In the KAN-LoRA adapter, the KAN layers are constructed using a grid size of 5, uniformly spanning the interval [−1,1][-1,1], and employ cubic B-spline basis functions for interpolation. During fine-tuning, we apply EWC regularization with a coefficient of λ=0.1\lambda=0.1, along with a memory buffer containing samples from the previous task to mitigate forgetting. We use a learning rate of 2×10−3 ,2\text{\times}{10}^{-3}\text{,}\mathrm{,} and all experiments are conducted for 60 epochs with fixed random seeds. Results are reported as the average performance over five independent runs for each experimental configuration to ensure statistical robustness and reproducibility. All experiments are performed on a single NVIDIA A100 GPU with 80 GB of global memory. In Table 14, the hyperparameters for both KAN and MLP LoRA adapters are listed.

We further extend our experiments by continually fine-tuning the modified KAN-LoRA and MLP-LoRA adapters using samples from the previous two tasks as the (Fisher) memory for the EWC regularizer. Tables 15(a) and 15(b) present the mean accuracy on prior tasks after sequential edits, for adapter ranks 8 and 16, respectively. Each experiment is performed 5 rounds, and we report its average for evaluation. The results from this extended setting remain consistent with our earlier findings in Table 4, where the tendency toward forgetting for both KAN and MLP adapters increases with higher adapter ranks. In line with previous observations, KAN adapters again outperformed their MLP counterparts in low-sample regimes. However, this experiment also reveals notable differences from our initial findings. In the earlier experiments, KAN adapters showed better performance than MLP at rank 16 only in the larger Llama2-13B model. In contrast, this trend does not hold under the extended (Fisher) memory setting. Instead, KAN adapters consistently demonstrate superior retention across both Llama model sizes and adapter rank configurations when trained with limited sample sizes. These results indicate that increasing the depth of (Fisher) memory in EWC regularization enhances the retention capability of KAN adapters, particularly in smaller models and lower-rank settings, by improving stability and narrowing the performance gap relative to their larger model variants.

Beyond that, Tables 15(a) and 15(b) reveal additional fine-grained patterns that further illuminate the retention characteristics of KAN-LoRA adapters. First, across both datasets and adapter ranks, KAN consistently achieves near-perfect retention on the initial task, with 100% accuracy sustained even after training on four subsequent edits. This sharply contrasts with MLP-LoRA, which begins to show degradation by the third task, especially in ZsRE under the rank 16 setting. Second, the benefit of KAN becomes more pronounced in the middle range of task indices, particularly Tasks 3 and 4, where forgetting is most likely to accumulate. For example, in Llama2-7B with rank 16 on CounterFact, KAN achieves 92% accuracy on Task 4, compared to only 60% for MLP. Similar margins are observed in Llama2-13B on ZsRE, where KAN retains 85 to 90% accuracy on Tasks 3 and 4, while MLP drops to the low ∼\sim70%. Third, while both adapters eventually converge toward similar performance levels on the final task, the stability of earlier tasks in KAN-LoRA indicates stronger compartmentalization and less representational drift. Interestingly, the rank 8 KAN-LoRA adapter occasionally matches or even exceeds the performance of its rank 16 counterpart, particularly in Llama2-13B, suggesting that overparameterization may lead to unnecessary overlap or capacity saturation under limited data. This observation aligns with our theoretical insights that smaller, localized function supports reduce interference and help preserve prior knowledge. These extended results reinforce the practical value of KAN-based adapters in LoRA.

Table: Sx4.T1: Retention bounds across KANs and tasks.

Task (ii)Task (ii)Task (jj)Task (jj)Grid 10Grid 15Grid 20
Task (ii)
Task (jj)
FiF_{i}FiΔi,j\frac{F_{i}}{\Delta_{i,j}}FiF_{i}FiΔi,j\frac{F_{i}}{\Delta_{i,j}}FiF_{i}FiΔi,j\frac{F_{i}}{\Delta_{i,j}}
120.460.740.450.740.320.61
230.450.730.400.670.340.64
340.520.770.460.740.320.63
450.440.720.420.680.320.64

Table: Sx4.T3: Forgetting rate for varied intrinsic dimensions.

MNISTCIFAR-10Tiny-ImageNet
Quantize label (QQ)Quantizelabel (QQ)Shape (SS)log⁡(Fi)di\frac{\log(F_{i})}{d_{i}}log⁡(Fi)di\frac{\log(F_{i})}{d_{i}}Quantize label (QQ)Quantizelabel (QQ)Shape (SS)log⁡(Fi)di\frac{\log(F_{i})}{d_{i}}log⁡(Fi)di\frac{\log(F_{i})}{d_{i}}Quantize label (QQ)Quantizelabel (QQ)Shape (SS)log⁡(Fi)di\frac{\log(F_{i})}{d_{i}}log⁡(Fi)di\frac{\log(F_{i})}{d_{i}}
Quantize
label (QQ)
log⁡(Fi)di\frac{\log(F_{i})}{d_{i}}
Quantize
label (QQ)
log⁡(Fi)di\frac{\log(F_{i})}{d_{i}}
Quantize
label (QQ)
log⁡(Fi)di\frac{\log(F_{i})}{d_{i}}
28×\times80.07488×\times80.046816×\times160.052
216×\times160.071816×\times160.053832×\times320.054
228×\times280.074832×\times320.046864×\times640.052
428×\times280.0711632×\times320.0531664×\times640.051
828×\times280.0753232×\times320.0503264×\times640.049
1628×\times280.0736432×\times320.0486464×\times640.048
3228×\times280.07512832×\times320.04712864×\times640.047

Table: Sx4.T4: (a) KAN-LoRA and MLP-LoRA adapters with rank 8.

ModelDatasetSamples per TaskSamplesper TaskKAN LoRAMLP LoRA
Samples
per Task
# Trained tasks# Trained tasks
23452345
Llama 2- 7BLlama 2-7BCounterFact2100655045100855760
Llama 2-
7B
3100938048100906757
4100887853100906542
5100888057100987766
ZsRE2100806760100959787
31008771581001009178
4100857055100826346
5100766460100867357
Llama 2- 13BLlama 2-13BCounterFact2100756050100705043
Llama 2-
13B
3100937160100936253
4100976044100977756
5100767357100848363
ZsRE21001008372100805352
3100977875100896260
4100756658100928155
5100858067100837359

Table: Sx4.T5: Comparison of trainable parameters, training, and inference time for KAN-LoRA and MLP-LoRA adapters.

ModelAdapterTrainable parametersTrainableparametersTraining time (s/epoch)Trainingtime (s/epoch)Inference time (s/sample)Inferencetime (s/sample)
Trainable
parameters
Training
time (s/epoch)
Inference
time (s/sample)
Llama 2-7BKAN LoRA2.6M0.570.13
MLP LoRA0.28M0.540.12
Llama 2-13BKAN LoRA3.2M1.050.23
MLP LoRA0.35M1.010.21

Table: A9.T6: Binary addition tasks.

Task 1Task 2Task 3Task 4Task 5
0001 + 00010010 + 00010011 + 00010100 + 00010101 + 0001
0001 + 00100010 + 00100011 + 00100100 + 00100101 + 0010
0001 + 00110010 + 00110011 + 00110100 + 00110101 + 0011
0001 + 01000010 + 01000011 + 01000100 + 01000101 + 0100
0001 + 01010010 + 01010011 + 01010100 + 01010101 + 0101
0001 + 01100010 + 01100011 + 01100100 + 01100101 + 0110
0001 + 01110010 + 01110011 + 01110100 + 01110101 + 0111
0001 + 10000010 + 10000011 + 10000100 + 10000101 + 1000
0001 + 10010010 + 10010011 + 10010100 + 10010101 + 1001
0001 + 00010001 + 00100001 + 00110001 + 01000001 + 0101
0010 + 00010010 + 00100010 + 00110010 + 01000010 + 0101
0011 + 00010011 + 00100011 + 00110011 + 01000011 + 0101
0100 + 00010100 + 00100100 + 00110100 + 01000100 + 0101
0101 + 00010101 + 00100101 + 00110101 + 01000101 + 0101
0110 + 00010110 + 00100110 + 00110110 + 01000110 + 0101
0111 + 00010111 + 00100111 + 00110111 + 01000111 + 0101
1000 + 00011000 + 00101000 + 00111000 + 01001000 + 0101
1001 + 00011001 + 00101001 + 00111001 + 01001001 + 0101

Table: A9.T9: Hyperparameters of the KAN in decimal addition.

CategoryHyperparameterValue
Model ArchitectureHidden Layer Dimensions[2, 3, 2]
Grid Size5
Spline Order3
Base ActivationSiLU
Grid Range[-1, 1]
Initialization / ScalingBase Weight Scale1.0
Spline Weight Scale1.0
Noise Scale0.1
Enable Spline ScalerTrue
TrainingOptimizerAdamW
Learning Rate1e-3
Weight Decay1e-4
Epochs per Task100
Loss and EvaluationLoss FunctionMSE Loss
Output Format[Sum (mod 10), Carry Bit]
Support OverlapActivation Threshold (tt)1e-2

Table: A10.T11: Sample classes in five sequential tasks generated from MNIST, CIFAR-10, and Tiny-ImageNet datasets.

DatasetTaskSample classes
MNISTTask 1One, Two
Task 2Three , Four
Task 3Five, Six
Task 4Seven, Eight
Task 5Nine, Zero
CIFAR-10Task 1Automobile, Bird
Task 2Cat, Deer
Task 3Dog, Frog
Task 4Horse, Ship
Task 5Truck, Airplane
Tiny-ImageNetTask 1Goldfish, Fire salamander
Task 2Bull frog, Tailed frog
Task 3American alligator, Boa constrictor
Task 4Trilobite, Scorpion
Task 5Garden Spider, Tarantula

Refer to caption MSE loss in logarithmic scale for five different binary addition tasks during training on one’s addition task.

Refer to caption Task 1 accuracy after sequential training on task 1 and 2 from CIFAR-10, comparing (a) KAN-Transformer and (b) MLP-Transformer. Model configuration is labeled as (#classification layers – #encoder blocks – #attention heads).

Refer to caption Average accuracy on previously learned tasks after training on 2 to 5 tasks with varying sample sizes from CIFAR-10 and Tiny-ImageNet datasets. Sub-figures show results for (a) KAN-Transformer and (b) MLP-Transformer.

Refer to caption (b)

$$ f(\mathbf{x}) = f(x_1, x_2,..., x_n) = \sum^{2n+1}{q=1} \Psi_q(\sum^n{p=1}\psi_{p,q}(x_p)), \label{eq:kan1} \nonumber $$ \tag{eq:kan1}

$$ f(x) = (\Phi_{L-1} \circ \Phi_{L-2} \circ \cdots \circ \Phi_1 \circ \Phi_0)(x), \nonumber $$

$$ \Phi_{\ell}=\begin{bmatrix}\phi_{\ell,1,1}&\phi_{\ell,2,1}&\cdots&\phi_{\ell,d_{\ell},1}\ \phi_{\ell,1,2}&\phi_{\ell,2,2}&\cdots&\phi_{\ell,d_{\ell},2}\ \vdots&\vdots&\ddots&\vdots\ \phi_{\ell,1,N_{\ell}}&\phi_{\ell,2,N_{\ell}}&\cdots&\phi_{\ell,d_{\ell},N_{\ell}}\end{bmatrix}, $$ \tag{Sx2.Ex3}

$$ F_i = L\bigl(f^{(T)},\mathcal D_i\bigr) - L\bigl(f^{(i)},\mathcal D_i\bigr) $$

$$ S^{(i)}{\ell,p,q} = {,z\in\mathbb R : \phi{\ell,p,q}(z)\neq0}, $$

$$ F_i ;\le; C \sum_{\ell=1}^L N_\ell,L_\ell;\Delta_{i,j}. $$

$$ \Delta_{i,j} = \max_{\ell,p,q};\mu\bigl(S^{(i)}{\ell,p,q}\cap S^{(j)}{\ell,p,q}\bigr), $$

$$ \frac{\partial}{\partial\theta_{\ell,p,q}} \mathcal{L}\bigl(f(x),y\bigr)=0, $$

$$ \theta^{(T)}=\theta^{(i)} \quad \text{for every }\theta\in\Theta^{(i)}. $$

$$ \bigl|\widetilde{\phi}{\ell,p,q}(z)-\phi{\ell,p,q}(z)\bigr| \le L_\ell\mu(I_{\ell,p,q}) \le L_\ell\Delta_{i,j}. \tag{B-1} $$

$$ \bigl|h^{(\ell)}{\text{new}}(x)-h^{(\ell)}{\text{old}}(x)\bigr| \le L_\ell\sum_{q=1}^{N_\ell}\gamma_{\ell,p,q}^{(t)} \quad(x\sim\mathcal D_i). \tag{C-3} $$

$$ \Pr\bigl[z\in S^{(k)}_{\ell,p,q}\bigr]=s_k \quad\text{(length of the interval).} $$

$$ \operatorname{Vol}{d_t}\bigl(\mathcal B_t(x_k,r/2)\bigr) ;\ge; c{d_t},(r/2)^{d_t}, $$

$$ #{x_k} ;\le; \frac{\operatorname{Vol}{d_t}(\mathcal M_t)} {c{d_t}(r/2)^{d_t}} ;=; O!\bigl(r^{-d_t}\bigr). $$

$$ F_i ;=; O!\Bigl(\sum_{j=i+1}^T N_{\mathrm{tot}},\bar L;r^{,d_i + d_j}\Bigr), $$

$$ \displaystyle F_{i} $$

Lemma. [Zero‐Overlap Retention] Suppose for an earlier task $i$ and every later task $j>i$ the maximal support‐overlap satisfies $\Delta_{i,j}=0$. Then [ F_i ;=; L\bigl(f^{(T)},\mathcal D_i\bigr) ;-; L\bigl(f^{(i)},\mathcal D_i\bigr) ;=; 0. ]

Theorem. Theorem 1 (Retention Bound via Overlap). Under the additional assumptions that each branch ϕℓ,p,q\phi_{\ell,p,q} is LℓL_{\ell}-Lipschitz333ϕ\phi is LL-Lipschitz if |ϕ​(z1)−ϕ​(z2)|≤L​|z1−z2||\phi(z_{1})-\phi(z_{2})|\leq L|z_{1}-z_{2}| for all z1,z2∈ℝz_{1},z_{2}\in\mathbb{R}. Here, LℓL_{\ell} quantifies the spline smoothness in layer ℓ\ell. and the loss is bounded by CC, for any j>ij>i we have Fi≤C​∑ℓ=1LNℓ​Lℓ​Δi,j.F_{i};\leq;C\sum_{\ell=1}^{L}N_{\ell},L_{\ell};\Delta_{i,j}.

Theorem. [Branch‐wise Cumulative Forgetting] Under the Lipschitz and bounded‐loss assumptions of Theorem~thm:retention_bound, the forgetting on task (i) after training on all subsequent tasks (i+1,\dots,T) can be decomposed as [ F_i ;\le; C \sum_{\ell=1}^L \sum_{p=1}^{d_\ell} \sum_{q=1}^{N_\ell} L_\ell ;\Bigl[\sum_{j=i+1}^T \mu\bigl(S^{(i)}{\ell,p,q}\cap S^{(j)}{\ell,p,q}\bigr)\Bigr]. ]

Corollary. Corollary 1 (Expected Forgetting under Random Supports). If each branch’s supports for task jj are independently drawn as length-sjs_{j} intervals in [0,1][0,1], then in expectation 𝔼​[Fi]≤C​∑ℓ=1LNℓ​Lℓ​∑j=i+1Tsi​sj.\mathbb{E}[F_{i}];\leq;C\sum\nolimits_{\ell=1}^{L}N_{\ell}L_{\ell}\ \sum\nolimits_{j=i+1}^{T}s_{i},s_{j}.

Corollary. Corollary 2 (Saturation via Union‐Bound). Let Uℓ,p,q(i)=⋃j=i+1T(Sℓ,p,q(i)∩Sℓ,p,q(j))U^{(i)}{\ell,p,q}=\bigcup{j=i+1}^{T}\bigl(S^{(i)}{\ell,p,q}\cap S^{(j)}{\ell,p,q}\bigr) be the union of all overlaps for branch (ℓ,p,q)(\ell,p,q). Then Fi≤C​∑ℓ=1L∑p=1dℓ∑q=1NℓLℓ​μ​(Uℓ,p,q(i)),F_{i};\leq;C\sum_{\ell=1}^{L}\sum_{p=1}^{d_{\ell}}\sum_{q=1}^{N_{\ell}}L_{\ell};\mu\bigl(U^{(i)}{\ell,p,q}\bigr), with μ​(Uℓ,p,q(i))≤min⁡(∑j=i+1TΔi,j,μ​(Sℓ,p,q(i)))\mu(U^{(i)}{\ell,p,q})\leq\min\bigl(\sum_{j=i+1}^{T}\Delta_{i,j},,\mu(S^{(i)}_{\ell,p,q})\bigr).

Theorem. [Intrinsic‐Dimension Forgetting Rate] Suppose task (t) generates data concentrated on a compact submanifold (\mathcal M_t\subset[0,1]^n) of intrinsic dimension (d_t), and each univariate branch’s activation support can be enclosed within an (r)-ball in the pre-activation domain. Then, for any earlier task (i) and later task (j), the expected support overlap satisfies [ \mathbb E\bigl[\mu(S^{(i)}{\ell,p,q}!\cap!S^{(j)}{\ell,p,q})\bigr] = O\bigl(r^{,d_i + d_j}\bigr), ] and hence the forgetting on task (i) obeys [ F_i ;=; O!\Bigl(\sum_{j=i+1}^T N_{tot},\bar L;r^{,d_i + d_j}\Bigr), ] where (N_{tot}=\sum_\ell N_\ell) counts the total number of univariate branches and (\bar L) is an average Lipschitz constant.

Corollary. Corollary 3 (Retention for Low‐Dimensional Tasks). If every subsequent task jj has intrinsic dimension dj≤Dd_{j}\leq D, then Fi=O​(T​Ntot​L¯​rdi+D),F_{i}=O\bigl(T,N_{\mathrm{tot}},\bar{L};r^{,d_{i}+D}\bigr), which becomes negligible when di+Dd_{i}+D is sufficiently small.

Corollary. [Fragmentation Mitigates Complexity] If each branch’s support for task (t) is split into (k_t) disjoint intervals (effective radius (r/k_t)), then Theorem~thm:intrinsic_dimension’s rate improves to [ F_i = O!\Bigl(\sum_{j=i+1}^T N_{tot},\bar L;\bigl(r/k_i\bigr)^{d_i}\bigl(r/k_j\bigr)^{d_j}\Bigr). ]

$$ \Phi_\ell = \begin{bmatrix} \phi_{\ell,1,1} & \phi_{\ell,2,1} & \cdots & \phi_{\ell,d_{\ell},1} \ \phi_{\ell,1,2} & \phi_{\ell,2,2} & \cdots & \phi_{\ell,d_{\ell},2} \ \vdots & \vdots & \ddots & \vdots \ \phi_{\ell,1,N_{\ell}} & \phi_{\ell,2,N_{\ell}} & \cdots & \phi_{\ell,d_{\ell},N_{\ell}} \end{bmatrix}, \nonumber $$

$$ F_i = O!\Bigl(\sum_{j=i+1}^T N_{\mathrm{tot}},\bar L;\bigl(r/k_i\bigr)^{d_i}\bigl(r/k_j\bigr)^{d_j}\Bigr). $$

$$ \begin{aligned} F_i &\le C \sum_{t=i+1}^{T} \sum_{\ell=1}^{L} \sum_{p=1}^{d_\ell} L_\ell \sum_{q=1}^{N_\ell} \gamma_{\ell,p,q}^{(t)} \[4pt] &= C \sum_{\ell=1}^{L} \sum_{p=1}^{d_\ell} \sum_{q=1}^{N_\ell} L_\ell \Bigl[ \sum_{j=i+1}^{T} \mu!\bigl( S^{(i)}{\ell,p,q}\cap S^{(j)}{\ell,p,q} \bigr) \Bigr]. \end{aligned} $$

$$ \mathbb E!\bigl[ \mu!\bigl( S^{(i)}{\ell,p,q}\cap S^{(j)}{\ell,p,q} \bigr) \bigr] =! \int_{0}^{1}! O!\bigl(r^{d_i}\bigr), O!\bigl(r^{d_j}\bigr),dz =O!\bigl(r^{d_i+d_j}\bigr). \tag{F-1} $$

Theorem. [Retention Bound via Overlap] Under the additional assumptions that each branch $\phi_{\ell,p,q}$ is $L_\ell$-Lipschitz$\phi$ is $L$-Lipschitz if $|\phi(z_1) - \phi(z_2)| \le L|z_1 - z_2|$ for all $z_1, z_2 \in \mathbb{R$. Here, $L_\ell$ quantifies the spline smoothness in layer $\ell$.} and the loss is bounded by $C$, for any $j>i$ we have [ F_i ;\le; C \sum_{\ell=1}^L N_\ell,L_\ell;\Delta_{i,j}. ]

Corollary. [Expected Forgetting under Random Supports] If each branch’s supports for task (j) are independently drawn as length-(s_j) intervals in ([0,1]), then in expectation [ \mathbb E[F_i];\le;C\sum\nolimits_{\ell=1}^L N_\ell L_\ell \ \sum\nolimits_{j=i+1}^T s_i,s_j. ]

Corollary. [Saturation via Union‐Bound] Let [ U^{(i)}{\ell,p,q} =\bigcup{j=i+1}^T\bigl(S^{(i)}{\ell,p,q}\cap S^{(j)}{\ell,p,q}\bigr) ] be the union of all overlaps for branch ((\ell,p,q)). Then [ F_i ;\le; C \sum_{\ell=1}^L \sum_{p=1}^{d_\ell} \sum_{q=1}^{N_\ell} L_\ell;\mu\bigl(U^{(i)}{\ell,p,q}\bigr), ] with (\mu(U^{(i)}{\ell,p,q})\le\min\bigl(\sum_{j=i+1}^T\Delta_{i,j},,\mu(S^{(i)}_{\ell,p,q})\bigr)).

Corollary. [Retention for Low‐Dimensional Tasks] If every subsequent task (j) has intrinsic dimension (d_j\le D), then [ F_i = O\bigl(T,N_{tot},\bar L;r^{,d_i + D}\bigr), ] which becomes negligible when (d_i + D) is sufficiently small.

Proof. [Proof Sketch] % Since (\Delta_{i,j}=0), the activation support of every branch (\phi_{\ell,p,q}) for task (i) is disjoint from that for task (j). Consequently, the gradient of the loss on (\mathcal D_j) vanishes on all parameters used by task (i), and no weight update for task (j) can affect performance on (\mathcal D_i). The proof details are shown in Appendix A. %

Proof. [Proof Sketch] % Only those branches whose supports overlap ((\mu(S^{(i)}{\ell,p,q}\cap S^{(j)}{\ell,p,q})>0)) receive nonzero gradient updates from task (j). Each such update can change the output on (\mathcal D_i) by at most (L_\ell,\Delta_{i,j}), and summing over all (N_\ell) branches in each layer yields the stated bound. The proof details are shown in Appendix B. %

Proof. [Proof Sketch] % We apply the one‐step overlap bound of Theorem~thm:retention_bound separately to each branch ((\ell,p,q)) and each task (j), then sum over all branches and tasks. This isolates how each branch’s overlaps accumulate over the full sequence. We show the full proof in Appendix C. %

Proof. [Proof Sketch] % For random intervals of lengths (s_i) and (s_j), the expected $1$-dimensional overlap is (s_i,s_j). Substituting (\mathbb E[\mu(S^{(i)}{\ell,p,q}\cap S^{(j)}{\ell,p,q})]=s_i,s_j) into Theorem~thm:branchwise_forgetting’s branch‐wise sum and pulling constants outside yields the stated bound. We show the full proof in Appendix D. %

Proof. [Proof Sketch] % The union‐bound replaces the sum over individual overlaps by the measure of their union. Since each branch’s parameter updates cannot exceed the total region on which it ever overlaps, summing those union measures yields a tighter global bound. We show the full proof in Appendix E. %

Proof. [Proof Sketch] % By covering the manifold (\mathcal M_t) with (O(r^{-d_t})) balls of radius (r), one shows that each branch’s support intersects at most (O(r^{d_t})) fraction of the pre-activation axis. Independent placement on (\mathcal M_i) and (\mathcal M_j) then yields an overlap of order (r^{d_i+d_j}), and summing over branches recovers the stated rate. The detail is shown in Appendix F. %

Proof. [Proof Sketch] % Substituting the uniform bound (d_j\le D) into Theorem~thm:intrinsic_dimension gives (\sum_j r^{d_i+d_j}\le T,r^{d_i+D}). The detail is shown in Appendix G. %

Proof. [Proof Sketch] % Fragmenting a support of radius (r) into (k) pieces reduces each piece’s covering radius to (r/k), and the overlap analysis scales accordingly in Theorem~thm:intrinsic_dimension. The detail is shown in Appendix H. %

Experimental Setup

We use two benchmark datasets for continual knowledge editing: CounterFact and ZsRE. The CounterFact dataset is designed for factual knowledge editing and consists of counterfactual statements that initially receive low likelihood scores compared to the correct facts. The ZsRE dataset is a question answering benchmark constructed for zero-shot retention extraction, where each sample includes a naturallanguage question, its factual answer, and a new answer for the edit. From each dataset, we construct four sequential task sets by varying the number of samples per task, ranging from 2 to 5, to assess retention under different data regimes and task granularities. Each task set contains five tasks, curated by randomly sampling from the original dataset to ensure diversity, domain variability, and non-overlapping content. For each sample, we generate prompts based on the provided instruction template and use the corresponding modified facts to perform targeted knowledge edits on the LM across tasks.

Table 13: Hyperparameters used for MLP-Transformer with EWC on image classification tasks.

Counterfact Dataset Examples

Example 1:

Prompt: Autonomous University of Madrid, which

is located in

Ground Truth:

Spain

Edited Fact:

Sweden

Example 2:

Prompt:

The original language of The Icelandic

Dream was

Ground Truth:

Icelandic

Edited Fact:

Tamil

ZsRE Dataset Examples

Example 1:

Prompt:

What is the native language of Christiane

Cohendy?

Original Answer:

French

Edited Fact:

German

Example 2:

Prompt:

What is the final year of Atlanta Flames?

Original Answer:

1980

Edited Fact:

(b) KAN-LoRA and MLP-LoRA adapters with rank 16.

Table 15: Mean accuracy (%) on previously edited tasks during continual fine-tuning of Llama 2-7B and Llama 2-13B models equipped with KAN-LoRA and MLP-LoRA adapters with preceding two tasks as memory for EWC regularizer. Performance is reported across five consecutive tasks for each dataset, which is the average of 5 independent experiments.

Model Configuration

We apply KAN-LoRA and MLP-LoRA adapters to the last two layers of the Llama models, specifically targeting the attention layers, i.e., the query and value projection matrices. For both adapter types, we experiment with ranks 8 and 16, using corresponding LoRA scaling factors ( α ) of 16 and 32, respectively. In the KAN-LoRA adapter, the KAN layers are constructed using a grid size of 5, uniformly spanning the interval [ -1 , 1] , and employ cubic B-spline basis functions for interpolation. During fine-tuning, we apply EWC regularization with a coefficient of λ = 0 . 1 , along with a memory buffer containing samples from the previous task to mitigate forgetting. We use a learning rate of 2 × 10 -3 , and all experiments are conducted for 60 epochs with fixed random seeds. Results are reported as the average performance over five independent runs for each experimental configuration to ensure statistical robustness and reproducibility. All experiments are performed on a single NVIDIA A100 GPU with 80 GB of global memory. In Table 14, the hyperparameters for both KAN and MLP LoRA adapters are listed.

Experimental Setup

We further extend our experiments by continually finetuning the modified KAN-LoRA and MLP-LoRA adapters using samples from the previous two tasks as the (Fisher) memory for the EWC regularizer. Tables 15a and 15b present the mean accuracy on prior tasks after sequential edits, for adapter ranks 8 and 16, respectively. Each experiment is performed 5 rounds, and we report its average for evaluation. The results from this extended setting remain consistent with our earlier findings in Table 4, where the tendency toward forgetting for both KAN and MLP adapters increases with higher adapter ranks. In line with previous observations, KAN adapters again outperformed their MLP counterparts in low-sample regimes. However, this experiment also reveals notable differences from our initial findings. In the earlier experiments, KAN adapters showed better performance than MLP at rank 16 only in the larger Llama2-13B

model. In contrast, this trend does not hold under the extended (Fisher) memory setting. Instead, KAN adapters consistently demonstrate superior retention across both Llama model sizes and adapter rank configurations when trained with limited sample sizes. These results indicate that increasing the depth of (Fisher) memory in EWC regularization enhances the retention capability of KAN adapters, particularly in smaller models and lower-rank settings, by improving stability and narrowing the performance gap relative to their larger model variants.

Beyond that, Tables 15a and 15b reveal additional finegrained patterns that further illuminate the retention characteristics of KAN-LoRA adapters. First, across both datasets and adapter ranks, KAN consistently achieves near-perfect retention on the initial task, with 100% accuracy sustained even after training on four subsequent edits. This sharply contrasts with MLP-LoRA, which begins to show degradation by the third task, especially in ZsRE under the rank 16 setting. Second, the benefit of KAN becomes more pronounced in the middle range of task indices, particularly Tasks 3 and 4, where forgetting is most likely to accumulate. For example, in Llama2-7B with rank 16 on CounterFact, KAN achieves 92% accuracy on Task 4, compared to only 60% for MLP. Similar margins are observed in Llama213B on ZsRE, where KAN retains 85 to 90% accuracy on Tasks 3 and 4, while MLP drops to the low ∼ 70%. Third, while both adapters eventually converge toward similar performance levels on the final task, the stability of earlier tasks in KAN-LoRA indicates stronger compartmentalization and less representational drift. Interestingly, the rank 8 KANLoRAadapter occasionally matches or even exceeds the performance of its rank 16 counterpart, particularly in Llama213B, suggesting that overparameterization may lead to unnecessary overlap or capacity saturation under limited data. This observation aligns with our theoretical insights that smaller, localized function supports reduce interference and help preserve prior knowledge. These extended results reinforce the practical value of KAN-based adapters in LoRA.

Task ( i )Task ( j )Grid 10Grid 10Grid 15Grid 15Grid 20Grid 20
Task ( i )Task ( j )F iF i ∆ i,jF iF i ∆ i,jF iF i ∆ i,j
120.460.740.450.740.320.61
230.450.730.400.670.340.64
340.520.770.460.740.320.63
450.440.720.420.680.320.64
Task ( i )Task ( j )Grid 10Grid 10Grid 15Grid 15Grid 20Grid 20
Task ( i )Task ( j )F iF i ∑ µ ijF iF i ∑ µ ijF iF i ∑ µ ij
12, 3, 4, 50.680.150.620.150.570.16
23, 4, 50.670.160.510.150.440.16
34, 50.390.160.390.160.290.16
450.250.180.190.170.160.17
MNISTMNISTMNISTCIFAR-10CIFAR-10CIFAR-10Tiny-ImageNetTiny-ImageNetTiny-ImageNet
Quantize label ( Q )Shape ( S )log( F i ) d iQuantize label ( Q )Shape ( S )log( F i ) d iQuantize label ( Q )Shape ( Slog( F i ) d i
28 × 80.07488 × 80.046816 × 160.052
216 × 160.071816 × 160.053832 × 320.054
228 × 280.074832 × 320.046864 × 640.052
428 × 280.0711632 × 320.0531664 × 640.051
828 × 280.0753232 × 320.0503264 × 640.049
1628 × 280.0736432 × 320.0486464 × 640.048
3228 × 280.07512832 × 320.04712864 × 640.047
ModelDatasetSamples per TaskKAN LoRA # Trained tasksKAN LoRA # Trained tasksKAN LoRA # Trained tasksKAN LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasks
ModelDatasetSamples per Task23452345
2 3100 100 10065 93 8850 8045 48 53 57100 100 100 10085 90 9057 67 6560 57 42
210080 876760100 10095 1009787
41008878987766
580
310071589178
4100857055100826346
5100766460100867357
2100756050100705043
3100937160100936253
4100976044100977756
5100767357100848363
21001008372100805352
3100977875100896260
4100756658100928155
5100858067100837359
ModelAdapterTrainable parametersTraining time (s/epoch)Inference time (s/sample)
Llama 2-7BKAN LoRA2.6M0.570.13
Llama 2-7BMLP LoRA0.28M0.540.12
Llama 2-13BKAN LoRA3.2M1.050.23
Llama 2-13BMLP LoRA0.35M1.010.21
Task 1Task 2Task 3Task 4Task 5
0001 + 0001 0001 + 0010 0001 + 0011 0001 + 0100 0001 + 0101 0001 + 0110 0001 + 0111 0001 + 1000 0001 + 1001 0001 + 0001 0010 + 0001 0011 + 0001 0100 + 0001 0101 + 0001 0110 + 0001 0111 + 0001 1000 + 0001 1001 + 00010010 + 0001 0010 + 0010 0010 + 0011 0010 + 0100 0010 + 0101 0010 + 0110 0010 + 0111 0010 + 1000 0010 + 1001 0001 + 0010 0010 + 0010 0011 + 0010 0100 + 0010 0101 + 0010 0110 + 0010 0111 + 0010 1000 + 0010 1001 + 00100011 + 0001 0011 + 0010 0011 + 0011 0011 + 0100 0011 + 0101 0011 + 0110 0011 + 0111 0011 + 1000 0011 + 1001 0001 + 0011 0010 + 0011 0011 + 0011 0100 + 0011 0101 + 0011 0110 + 0011 0111 + 0011 1000 + 0011 1001 + 00110100 + 0001 0100 + 0010 0100 + 0011 0100 + 0100 0100 + 0101 0100 + 0110 0100 + 0111 0100 + 1000 0100 + 1001 0001 + 0100 0010 + 0100 0011 + 0100 0100 + 0100 0101 + 0100 0110 + 0100 0111 + 0100 1000 + 0100 1001 + 01000101 + 0001 0101 + 0010 0101 + 0011 0101 + 0100 0101 + 0101 0101 + 0110 0101 + 0111 0101 + 1000 0101 + 1001 0001 + 0101 0010 + 0101 0011 + 0101 0100 + 0101 0101 + 0101 0110 + 0101 0111 + 0101 1000 + 0101 1001 + 0101
Task 1Task 2Task 3Task 4Task 5
1 + 12 + 13 + 14 + 15 + 1
1 + 22 + 23 + 24 + 25 + 2
1 + 32 + 33 + 34 + 35 + 3
1 + 42 + 43 + 44 + 45 + 4
1 + 52 + 53 + 54 + 55 + 5
1 + 62 + 63 + 64 + 65 + 6
1 + 72 + 73 + 74 + 75 + 7
1 + 82 + 83 + 84 + 85 + 8
1 + 92 + 93 + 94 + 95 + 9
1 + 11 + 21 + 31 + 41 + 5
2 + 12 + 22 + 32 + 42 + 5
3 + 13 + 23 + 33 + 43 + 5
4 + 14 + 24 + 34 + 44 + 5
5 + 15 + 25 + 35 + 45 + 5
6 + 16 + 26 + 36 + 46 + 5
7 + 17 + 27 + 37 + 47 + 5
8 + 18 + 28 + 38 + 48 + 5
9 + 19 + 29 + 39 + 49 + 5
CategoryHyperparameterValue
Model ArchitectureHidden Layers Grid Size Spline Order Base Activation Grid Range Grid Epsilon[3, 2, 2] 5 3 SiLU [-1, 1] 0.02
Initialization / ScalingBase Weight Scale Spline Weight Scale Spline Noise Scale Enable Spline Scaler1.0 1.0 0.1 True
Optimizer &TrainingOptimizer Learning Rate Weight DecayAdamW 1e-3 1e-4
LossLoss Function Prediction ThresholdMSE Loss 0.5
Training LoopEpochs per Task50
CategoryHyperparameterValue
Model ArchitectureHidden Layer Dimensions Grid Size Spline Order Base Activation Grid Range[2, 3, 2] 5 3 SiLU [-1, 1]
Initialization / ScalingBase Weight Scale Spline Weight Scale Noise Scale Enable Spline Scaler1.0 1.0 0.1 True
TrainingOptimizer Learning Rate Weight Decay Epochs per TaskAdamW 1e-3 1e-4 100
Loss and EvaluationLoss Function Output FormatMSE Loss [Sum (mod 10), Carry Bit]
Support OverlapActivation Threshold ( t )1e-2
CategoryHyperparameterValue / Description
ModelOperator Number of Parameters Nonlinearityˆ U ijkl = σ ( a ijkl ) 16 Sigmoid: σ ( x ) = 1 1+ e - x
Input / OutputInput Format Intermediate Carry ( N 3 ) Output ( N 4 )Two binary numbers ( N 1 , N 2 ) Passed sequentially (1-bit per step) Sum of N 1 + N 2
TrainingOptimizer Learning Rate Training Steps per Task Loss FunctionGradient Descent 1.0 2000 steps Mean Squared Error
InitializationParameter Range ( a ijkl ) Sigmoid Output RangeRandom in [ - 1 , 1] At initialization ∼ [0.27, 0.73]
DatasetTaskSample classes
MNISTTask 1 Task 2 Task 3 Task 4 Task 5One, Two Three , Four Five, Six Seven, Eight Nine, Zero
CIFAR-10Task 1 Task 2 Task 3 Task 4 Task 5Automobile, Bird Cat, Deer Dog, Frog Horse, Ship Truck, Airplane
Tiny-ImageNetTask 1 Task 2 Task 3 Task 4 Task 5Goldfish, Fire salamander Bull frog, Tailed frog American alligator, Boa constrictor Trilobite, Scorpion Garden Spider, Tarantula
CategoryHyperparameterValue
Dataset and InputInput Shape28 × 28 (MNIST) 32 × 32 (CIFAR-10) 64 × 64 (Tiny-ImageNet)
KAN Linear LayerGrid Size Spline Order Base Activation Base Weight Scale Spline Weight Scale Noise Scale10 3 SiLU 1.0 1.0 0.1
TrainingOptimizer Learning Rate Weight Decay Loss Function Epochs per TaskAdamW 1e-3 1e-4 Cross Entropy Loss 10
CategoryHyperparameterValue
Dataset and InputInput Shape28 × 28 (MNIST) 32 × 32 (CIFAR-10) 64 × 64 (Tiny-ImageNet)
TrainingOptimizer Learning Rate Weight Decay Epochs per TaskAdamW 1e-3 1e-4 10
EWC Regularizationλ EWC (Fisher) Memory Size (Fisher) Update Strategy0.1 All prior batches After each task
HyperparameterValue
Grid size (KAN-LoRA) Grid Range (KAN-LoRA) Spline Order (KAN-LoRA) Base Activation (KAN-LoRA) Base Weight Scale (KAN-LoRA) Spline Weight Scale (KAN-LoRA) Noise Scale (KAN-LoRA) Number of last layers to update LoRA injection target LoRA rank LoRA scaling factor Number of training epochs Learning rate EWC regularization EWC regularization strength ( λ )5 [-1,1] 3 SiLU 1.0 1.0 0.1 2 query proj, value proj 8, 16 16, 32 60 2e-3 True 0.1
ModelDatasetSamples per TaskKAN LoRA # Tainted tasksKAN LoRA # Tainted tasksKAN LoRA # Tainted tasksKAN LoRA # Tainted tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasks
ModelDatasetSamples per Task23452345
2 3 4 5100 100 100 100100 100 100 10097 89 98 9680 82 93 86100 100 100 10095 100 100 10087 84 97 9785 77 86 87
2 3 4100 100100 10097 9683 85100 100100 100 10090 8977 83
5100 100100 10097 92831009698 9180 85
2 3 4100 1001008676 70100 100100 1008365
100919190
86100
59410095
21001008810010090
1009697779495 97
3100 1001009788 80100958785 87
410096100 1001009778
9892861009888
100
5100100969010010091
ModelDatasetSamples per TaskKAN LoRA # Tainted tasksKAN LoRA # Tainted tasksKAN LoRA # Tainted tasksKAN LoRA # Tainted tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasks
ModelDatasetSamples per Task23452345
2 3100 100 100100 100 10090 96 9283 78 78 71100 100 100 100100 100 10087 91 9375 75 83
2 3100 100100 10095 8683 70100 100100 10090 8272 60
410010010092
59181
410010088751001009082
510010085741001008986
210010097781001008075
310010087761001008273
410010095791001009573
510010095751001008778
21001009073100908370
310010084771001008277
410010085801001009284
510010090801001009382
Task ( i )Task ( j )Grid 10Grid 10Grid 15Grid 15Grid 20Grid 20
Task ( i )Task ( j )F iF i ∆ i,jF iF i ∆ i,jF iF i ∆ i,j
120.460.740.450.740.320.61
230.450.730.400.670.340.64
340.520.770.460.740.320.63
450.440.720.420.680.320.64
Task ( i )Task ( j )Grid 10Grid 10Grid 15Grid 15Grid 20Grid 20
Task ( i )Task ( j )F iF i ∑ µ ijF iF i ∑ µ ijF iF i ∑ µ ij
12, 3, 4, 50.680.150.620.150.570.16
23, 4, 50.670.160.510.150.440.16
34, 50.390.160.390.160.290.16
450.250.180.190.170.160.17
MNISTMNISTMNISTCIFAR-10CIFAR-10CIFAR-10Tiny-ImageNetTiny-ImageNetTiny-ImageNet
Quantize label ( Q )Shape ( S )log( F i ) d iQuantize label ( Q )Shape ( S )log( F i ) d iQuantize label ( Q )Shape ( Slog( F i ) d i
28 × 80.07488 × 80.046816 × 160.052
216 × 160.071816 × 160.053832 × 320.054
228 × 280.074832 × 320.046864 × 640.052
428 × 280.0711632 × 320.0531664 × 640.051
828 × 280.0753232 × 320.0503264 × 640.049
1628 × 280.0736432 × 320.0486464 × 640.048
3228 × 280.07512832 × 320.04712864 × 640.047
ModelDatasetSamples per TaskKAN LoRA # Trained tasksKAN LoRA # Trained tasksKAN LoRA # Trained tasksKAN LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasks
ModelDatasetSamples per Task23452345
2 3100 100 10065 93 8850 8045 48 53 57100 100 100 10085 90 9057 67 6560 57 42
210080 876760100 10095 1009787
41008878987766
580
310071589178
4100857055100826346
5100766460100867357
2100756050100705043
3100937160100936253
4100976044100977756
5100767357100848363
21001008372100805352
3100977875100896260
4100756658100928155
5100858067100837359
ModelAdapterTrainable parametersTraining time (s/epoch)Inference time (s/sample)
Llama 2-7BKAN LoRA2.6M0.570.13
Llama 2-7BMLP LoRA0.28M0.540.12
Llama 2-13BKAN LoRA3.2M1.050.23
Llama 2-13BMLP LoRA0.35M1.010.21
Task 1Task 2Task 3Task 4Task 5
0001 + 0001 0001 + 0010 0001 + 0011 0001 + 0100 0001 + 0101 0001 + 0110 0001 + 0111 0001 + 1000 0001 + 1001 0001 + 0001 0010 + 0001 0011 + 0001 0100 + 0001 0101 + 0001 0110 + 0001 0111 + 0001 1000 + 0001 1001 + 00010010 + 0001 0010 + 0010 0010 + 0011 0010 + 0100 0010 + 0101 0010 + 0110 0010 + 0111 0010 + 1000 0010 + 1001 0001 + 0010 0010 + 0010 0011 + 0010 0100 + 0010 0101 + 0010 0110 + 0010 0111 + 0010 1000 + 0010 1001 + 00100011 + 0001 0011 + 0010 0011 + 0011 0011 + 0100 0011 + 0101 0011 + 0110 0011 + 0111 0011 + 1000 0011 + 1001 0001 + 0011 0010 + 0011 0011 + 0011 0100 + 0011 0101 + 0011 0110 + 0011 0111 + 0011 1000 + 0011 1001 + 00110100 + 0001 0100 + 0010 0100 + 0011 0100 + 0100 0100 + 0101 0100 + 0110 0100 + 0111 0100 + 1000 0100 + 1001 0001 + 0100 0010 + 0100 0011 + 0100 0100 + 0100 0101 + 0100 0110 + 0100 0111 + 0100 1000 + 0100 1001 + 01000101 + 0001 0101 + 0010 0101 + 0011 0101 + 0100 0101 + 0101 0101 + 0110 0101 + 0111 0101 + 1000 0101 + 1001 0001 + 0101 0010 + 0101 0011 + 0101 0100 + 0101 0101 + 0101 0110 + 0101 0111 + 0101 1000 + 0101 1001 + 0101
Task 1Task 2Task 3Task 4Task 5
1 + 12 + 13 + 14 + 15 + 1
1 + 22 + 23 + 24 + 25 + 2
1 + 32 + 33 + 34 + 35 + 3
1 + 42 + 43 + 44 + 45 + 4
1 + 52 + 53 + 54 + 55 + 5
1 + 62 + 63 + 64 + 65 + 6
1 + 72 + 73 + 74 + 75 + 7
1 + 82 + 83 + 84 + 85 + 8
1 + 92 + 93 + 94 + 95 + 9
1 + 11 + 21 + 31 + 41 + 5
2 + 12 + 22 + 32 + 42 + 5
3 + 13 + 23 + 33 + 43 + 5
4 + 14 + 24 + 34 + 44 + 5
5 + 15 + 25 + 35 + 45 + 5
6 + 16 + 26 + 36 + 46 + 5
7 + 17 + 27 + 37 + 47 + 5
8 + 18 + 28 + 38 + 48 + 5
9 + 19 + 29 + 39 + 49 + 5
CategoryHyperparameterValue
Model ArchitectureHidden Layers Grid Size Spline Order Base Activation Grid Range Grid Epsilon[3, 2, 2] 5 3 SiLU [-1, 1] 0.02
Initialization / ScalingBase Weight Scale Spline Weight Scale Spline Noise Scale Enable Spline Scaler1.0 1.0 0.1 True
Optimizer &TrainingOptimizer Learning Rate Weight DecayAdamW 1e-3 1e-4
LossLoss Function Prediction ThresholdMSE Loss 0.5
Training LoopEpochs per Task50
CategoryHyperparameterValue
Model ArchitectureHidden Layer Dimensions Grid Size Spline Order Base Activation Grid Range[2, 3, 2] 5 3 SiLU [-1, 1]
Initialization / ScalingBase Weight Scale Spline Weight Scale Noise Scale Enable Spline Scaler1.0 1.0 0.1 True
TrainingOptimizer Learning Rate Weight Decay Epochs per TaskAdamW 1e-3 1e-4 100
Loss and EvaluationLoss Function Output FormatMSE Loss [Sum (mod 10), Carry Bit]
Support OverlapActivation Threshold ( t )1e-2
CategoryHyperparameterValue / Description
ModelOperator Number of Parameters Nonlinearityˆ U ijkl = σ ( a ijkl ) 16 Sigmoid: σ ( x ) = 1 1+ e - x
Input / OutputInput Format Intermediate Carry ( N 3 ) Output ( N 4 )Two binary numbers ( N 1 , N 2 ) Passed sequentially (1-bit per step) Sum of N 1 + N 2
TrainingOptimizer Learning Rate Training Steps per Task Loss FunctionGradient Descent 1.0 2000 steps Mean Squared Error
InitializationParameter Range ( a ijkl ) Sigmoid Output RangeRandom in [ - 1 , 1] At initialization ∼ [0.27, 0.73]
DatasetTaskSample classes
MNISTTask 1 Task 2 Task 3 Task 4 Task 5One, Two Three , Four Five, Six Seven, Eight Nine, Zero
CIFAR-10Task 1 Task 2 Task 3 Task 4 Task 5Automobile, Bird Cat, Deer Dog, Frog Horse, Ship Truck, Airplane
Tiny-ImageNetTask 1 Task 2 Task 3 Task 4 Task 5Goldfish, Fire salamander Bull frog, Tailed frog American alligator, Boa constrictor Trilobite, Scorpion Garden Spider, Tarantula
CategoryHyperparameterValue
Dataset and InputInput Shape28 × 28 (MNIST) 32 × 32 (CIFAR-10) 64 × 64 (Tiny-ImageNet)
KAN Linear LayerGrid Size Spline Order Base Activation Base Weight Scale Spline Weight Scale Noise Scale10 3 SiLU 1.0 1.0 0.1
TrainingOptimizer Learning Rate Weight Decay Loss Function Epochs per TaskAdamW 1e-3 1e-4 Cross Entropy Loss 10
CategoryHyperparameterValue
Dataset and InputInput Shape28 × 28 (MNIST) 32 × 32 (CIFAR-10) 64 × 64 (Tiny-ImageNet)
TrainingOptimizer Learning Rate Weight Decay Epochs per TaskAdamW 1e-3 1e-4 10
EWC Regularizationλ EWC (Fisher) Memory Size (Fisher) Update Strategy0.1 All prior batches After each task
HyperparameterValue
Grid size (KAN-LoRA) Grid Range (KAN-LoRA) Spline Order (KAN-LoRA) Base Activation (KAN-LoRA) Base Weight Scale (KAN-LoRA) Spline Weight Scale (KAN-LoRA) Noise Scale (KAN-LoRA) Number of last layers to update LoRA injection target LoRA rank LoRA scaling factor Number of training epochs Learning rate EWC regularization EWC regularization strength ( λ )5 [-1,1] 3 SiLU 1.0 1.0 0.1 2 query proj, value proj 8, 16 16, 32 60 2e-3 True 0.1
ModelDatasetSamples per TaskKAN LoRA # Tainted tasksKAN LoRA # Tainted tasksKAN LoRA # Tainted tasksKAN LoRA # Tainted tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasks
ModelDatasetSamples per Task23452345
2 3 4 5100 100 100 100100 100 100 10097 89 98 9680 82 93 86100 100 100 10095 100 100 10087 84 97 9785 77 86 87
2 3 4100 100100 10097 9683 85100 100100 100 10090 8977 83
5100 100100 10097 92831009698 9180 85
2 3 4100 1001008676 70100 100100 1008365
100919190
86100
59410095
21001008810010090
1009697779495 97
3100 1001009788 80100958785 87
410096100 1001009778
9892861009888
100
5100100969010010091
ModelDatasetSamples per TaskKAN LoRA # Tainted tasksKAN LoRA # Tainted tasksKAN LoRA # Tainted tasksKAN LoRA # Tainted tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasksMLP LoRA # Trained tasks
ModelDatasetSamples per Task23452345
2 3100 100 100100 100 10090 96 9283 78 78 71100 100 100 100100 100 10087 91 9375 75 83
2 3100 100100 10095 8683 70100 100100 10090 8272 60
410010010092
59181
410010088751001009082
510010085741001008986
210010097781001008075
310010087761001008273
410010095791001009573
510010095751001008778
21001009073100908370
310010084771001008277
410010085801001009284
510010090801001009382

$$ \bigl| \mathcal{L}(f^{(T)}(x),y) - \mathcal{L}(f^{(i)}(x),y)\bigr| \le C,|f^{(T)}(x)-f^{(i)}(x)|. $$

$$ \begin{aligned} \mathbb E[F_i] &\le C \sum_{\ell=1}^{L} \sum_{p=1}^{d_\ell} \sum_{q=1}^{N_\ell} L_\ell \sum_{j=i+1}^{T} s_i,s_j \[4pt] &= C \sum_{\ell=1}^{L} N_\ell,L_\ell \sum_{j=i+1}^{T} s_i,s_j, \end{aligned} $$

$$ N_t(r)=O!\bigl(r^{-d_t}\bigr). $$

References

[hu2021loralowrankadaptationlarge] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen. (2021). LoRA: Low-Rank Adaptation of Large Language Models.

[biderman2024loralearnsforgets] Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, others. (2024). LoRA Learns Less and Forgets Less.

[ke2022continualtraininglanguagemodels] Zixuan Ke, Haowei Lin, Yijia Shao, Hu Xu, Lei Shu, Bing Liu. (2022). Continual Training of Language Models for Few-Shot Learning.

[zhang2024comprehensivestudyknowledgeediting] Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, Huajun Chen. (2024). A Comprehensive Study of Knowledge Editing for Large Language Models.

[wu2024continuallearninglargelanguage] Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, Gholamreza Haffari. (2024). Continual Learning for Large Language Models: A Survey.

[coleman2025parameterefficientcontinualfinetuningsurvey] Eric Nuertey Coleman, Luigi Quarantiello, Ziyue Liu, Qinwen Yang, Samrat Mukherjee, Julio Hurtado, Vincenzo Lomonaco. (2025). Parameter-Efficient Continual Fine-Tuning: A Survey.

[gururangan2020dontstoppretrainingadapt] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith. (2020). Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.

[diao2023mixtureofdomainadaptersdecouplinginjectingdomain] Shizhe Diao, Tianyang Xu, Ruijia Xu, Jiawei Wang, Tong Zhang. (2023). Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models Memories.

[vandeven2024continuallearningcatastrophicforgetting] Gido M. Van De Ven, Nicholas Soures, Dhireesha Kudithipudi. (2024). Continual Learning and Catastrophic Forgetting.

[MCCLOSKEY1989109] Michael McCloskey, Neal J. Cohen. (1989). Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation.

[BORHANIFARD2025101737] Zeinab Borhanifard, Heshaam Faili, Yadollah Yaghoobzadeh. (2025). Combining replay and LoRA for continual learning in natural language understanding. Computer Speech & Language. doi:https://doi.org/10.1016/j.csl.2024.101737.

[Kirkpatrick_2017] Zenke, Friedemann, Poole, Ben, Ganguli, Surya. (2017). Continual Learning Through Synaptic Intelligence. Proceedings of machine learning research. doi:10.1073/pnas.1611835114.

[yoon2018lifelonglearningdynamicallyexpandable] Jaehong Yoon, Eunho Yang, Jeongtae Lee, Sung Ju Hwang. (2018). Lifelong Learning with Dynamically Expandable Networks.

[mirzadeh2022architecturematterscontinuallearning] Seyed Iman Mirzadeh, Arslan Chaudhry, Dong Yin, others. (2022). Architecture Matters in Continual Learning.

[buzzega2020darkexperiencegeneralcontinual] Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, Simone Calderara. (2020). Dark Experience for General Continual Learning: a Strong, Simple Baseline.

[riemer2019learninglearnforgettingmaximizing] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, Gerald Tesauro. (2019). Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference.

[10190202] Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y. Hou, Max Tegmark. (2025). KAN: Kolmogorov-Arnold Networks. IEEE Transactions on Neural Networks and Learning Systems. doi:10.1109/TNNLS.2023.3292359.

[binaryaddition] Ruiz-Garcia, M.. (2022). Model architecture can transform catastrophic forgetting into positive transfer.

[luo2025empiricalstudycatastrophicforgetting] Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, Yue Zhang. (2025). An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning.

[li2025elderenhancinglifelongmodel] Jiaang Li, Quan Wang, Zhongnan Wang, Yongdong Zhang, Zhendong Mao. (2025). ELDER: Enhancing Lifelong Model Editing with Mixture-of-LoRA.

[cfinsvm] Ayad, Omar. (2014). Learning under Concept Drift with Support Vector Machines. Artificial Neural Networks and Machine Learning -- ICANN 2014.

[FRENCH1999128] Robert M. French. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences. doi:https://doi.org/10.1016/S1364-6613(99)01294-2.

[app142210173] Ibrahum, Ahmed Dawod Mohammed, Shang, Zhengyu, Hong, Jang-Eui. (2024). How Resilient Are Kolmogorov–Arnold Networks in Classification Tasks? A Robustness Investigation. Applied Sciences. doi:10.3390/app142210173.

[kanfails] Girosi, Federico, Poggio, Tomaso. (1989). Representation Properties of Networks: Kolmogorov's Theorem Is Irrelevant. Neural Computation. doi:10.1162/neco.1989.1.4.465.

[Lee_Gomes_Zhang_Kleijn_2025] Kolmogorov, A. N.. (1961). On the Representation of Continuous Functions of Several Variables by Superpositions of Continuous Functions of a Smaller Number of Variables. American Mathematical Society. doi:10.1609/aaai.v39i17.33986.

[park2024cfkankolmogorovarnoldnetworkbasedcollaborative] Jin-Duk Park, Kyung-Min Kim, Won-Yong Shin. (2024). CF-KAN: Kolmogorov-Arnold Network-based Collaborative Filtering to Mitigate Catastrophic Forgetting in Recommender Systems.

[yang2024kolmogorovarnoldtransformer] Yang, Xingyi, Wang, Xinchao. (2025). Kolmogorov-Arnold Transformer. The Thirteenth International Conference on Learning Representations.

[10681070] Chaoxi Jiang, Yueyang Li, Haichi Luo, Caidi Zhang, Hongqun Du. (2025). KansNet: Kolmogorov–Arnold Networks and multi slice partition channel priority attention in convolutional neural network for lung nodule detection. Biomedical Signal Processing and Control. doi:https://doi.org/10.1016/j.bspc.2024.107358.

[10752992] Ronald Kemker, Marc McClure, Angelina Abitino, others. (2017). Measuring Catastrophic Forgetting in Neural Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. doi:10.1109/TPAMI.2021.3057446.

[spigler2020metalearntpriorsslowcatastrophic] Giacomo Spigler. (2020). Meta-learnt priors slow down catastrophic forgetting in neural networks.

[9760159] Y. Xu, X. Zhong, A. J. J. Yepes, J. H. Lau. (2020). Forget Me Not: Reducing Catastrophic Forgetting for Domain Adaptation in Reading Comprehension. IEEE Transactions on Neural Networks and Learning Systems. doi:10.1109/TNNLS.2022.3162241.

[JIANG2025107358] Chaoxi Jiang, Yueyang Li, Haichi Luo, Caidi Zhang, Hongqun Du. (2025). KansNet: Kolmogorov–Arnold Networks and multi slice partition channel priority attention in convolutional neural network for lung nodule detection. Biomedical Signal Processing and Control. doi:https://doi.org/10.1016/j.bspc.2024.107358.

[Turbulence] Muhammad R. Alhafiz, Kemas Zakaria, Duong Viet Dung, Pramudita S. Palar, Yohanes Bimo Dwianto, Lavi R. Zuhal. Kolmogorov-Arnold Networks for Data-Driven Turbulence Modeling. AIAA SCITECH 2025 Forum. doi:10.2514/6.2025-2047.

[meng2023locatingeditingfactualassociations] Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov. (2023). Locating and Editing Factual Associations in GPT.

[levy-etal-2017-zero] Levy, Omer, Seo, Minjoon, Choi, Eunsol, Zettlemoyer, Luke. (2017). Zero-Shot Relation Extraction via Reading Comprehension. Proceedings of the 21st Conference on Computational Natural Language Learning.

[aleixo2023catastrophicforgettingdeeplearning] Everton L. Aleixo, Juan G. Colonna, Marco Cristo, Everlandio Fernandes. (2023). Catastrophic Forgetting in Deep Learning: A Comprehensive Taxonomy.

[hu2022lora] Hu, Edward J, Shen, Yelong, Wallis, Phillip, Allen-Zhu, Zeyuan, Li, Yuanzhi, Wang, Shean, Wang, Lu, Chen, Weizhu, others. (2022). Lora: Low-rank adaptation of large language models.. ICLR.

[bishop1994neural] Bishop, Chris M. (1994). Neural networks and their applications. Review of scientific instruments.

[touvron2023llamaopenefficientfoundation] Hugo Touvron, Thibaut Lavril, Gautier Izacard, others. (2023). LLaMA: Open and Efficient Foundation Language Models.

[prautzsch2002bezier] Prautzsch, Hartmut, Boehm, Wolfgang, Paluszny, Marco. (2002). B{'e.

[hu2025kackolmogorovarnoldclassifiercontinual] Yusong Hu, Zichen Liang, Fei Yang, Qibin Hou, others. (2025). KAC: Kolmogorov-Arnold Classifier for Continual Learning.

[zhang2025unifyinglocalitykansfeature] Tianshuo Zhang, Siran Peng, Li Gao, others. (2025). Unifying Locality of KANs and Feature Drift Compensation Projection for Data-free Replay based Continual Face Forgery Detection.

[bib1] Abd Elaziz, Ahmed Fares, and Aseeri (2024) Abd Elaziz, M.; Ahmed Fares, I.; and Aseeri, A. O. 2024. CKAN: Convolutional Kolmogorov–Arnold Networks Model for Intrusion Detection in IoT Environment. IEEE Access, 12: 134837–134851.

[bib2] Aleixo et al. (2023) Aleixo, E. L.; Colonna, J. G.; Cristo, M.; and Fernandes, E. 2023. Catastrophic Forgetting in Deep Learning: A Comprehensive Taxonomy.

[bib3] Ayad, O. 2014. Learning under Concept Drift with Support Vector Machines. In Artificial Neural Networks and Machine Learning – ICANN 2014, 587–594. Cham: Springer International Publishing.

[bib4] Biderman et al. (2024) Biderman, D.; Portes, J.; Ortiz, J. J. G.; Paul, M.; Greengard, P.; et al. 2024. LoRA Learns Less and Forgets Less.

[bib5] Bishop, C. M. 1994. Neural networks and their applications. Review of scientific instruments, 65(6): 1803–1832.

[bib6] Buzzega et al. (2020) Buzzega, P.; Boschini, M.; Porrello, A.; Abati, D.; and Calderara, S. 2020. Dark Experience for General Continual Learning: a Strong, Simple Baseline.

[bib7] De Lange et al. (2022) De Lange, M.; Aljundi, R.; Masana, M.; Parisot, S.; et al. 2022. A Continual Learning Survey: Defying Forgetting in Classification Tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7): 3366–3385.

[bib8] French, R. M. 1999. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4): 128–135.

[bib9] Girosi, F.; and Poggio, T. 1989. Representation Properties of Networks: Kolmogorov’s Theorem Is Irrelevant. Neural Computation, 1(4): 465–469.

[bib10] Hu et al. (2022) Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2): 3.

[bib11] Hu et al. (2025) Hu, Y.; Liang, Z.; Yang, F.; Hou, Q.; et al. 2025. KAC: Kolmogorov-Arnold Classifier for Continual Learning.

[bib12] Kemker et al. (2017) Kemker, R.; McClure, M.; Abitino, A.; et al. 2017. Measuring Catastrophic Forgetting in Neural Networks.

[bib13] Kirkpatrick et al. (2017) Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; Hassabis, D.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13): 3521–3526.

[bib14] Kolmogorov, A. N. 1961. On the Representation of Continuous Functions of Several Variables by Superpositions of Continuous Functions of a Smaller Number of Variables. American Mathematical Society.

[bib15] Kong et al. (2024) Kong, Y.; Liu, L.; Chen, H.; Kacprzyk, J.; and Tao, D. 2024. Overcoming Catastrophic Forgetting in Continual Learning by Exploring Eigenvalues of Hessian Matrix. IEEE Transactions on Neural Networks and Learning Systems, 35(11).

[bib16] Lee et al. (2025) Lee, A.; Gomes, H. M.; Zhang, Y.; and Kleijn, W. B. 2025. Kolmogorov-Arnold Networks Still Catastrophically Forget but Differently from MLP. Proceedings of the AAAI Conference on Artificial Intelligence, 39(17): 18053–18061.

[bib17] Levy et al. (2017) Levy, O.; Seo, M.; Choi, E.; and Zettlemoyer, L. 2017. Zero-Shot Relation Extraction via Reading Comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning, 333–342. Association for Computational Linguistics.

[bib18] Liu et al. (2025) Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T. Y.; and Tegmark, M. 2025. KAN: Kolmogorov-Arnold Networks.

[bib19] Luo et al. (2025) Luo, Y.; Yang, Z.; Meng, F.; Li, Y.; Zhou, J.; and Zhang, Y. 2025. An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning.

[bib20] McCloskey, M.; and Cohen, N. J. 1989. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Psychology of Learning and Motivation, volume 24, 109–165. Academic Press.

[bib21] Meng et al. (2023) Meng, K.; Bau, D.; Andonian, A.; and Belinkov, Y. 2023. Locating and Editing Factual Associations in GPT.

[bib22] Mirzadeh et al. (2022) Mirzadeh, S. I.; Chaudhry, A.; Yin, D.; et al. 2022. Architecture Matters in Continual Learning.

[bib23] Prautzsch, H.; Boehm, W.; and Paluszny, M. 2002. Bézier and B-spline techniques. Springer Science Business Media.

[bib24] Riemer et al. (2019) Riemer, M.; Cases, I.; Ajemian, R.; Liu, M.; Rish, I.; Tu, Y.; and Tesauro, G. 2019. Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference.

[bib25] Ruiz-Garcia, M. 2022. Model architecture can transform catastrophic forgetting into positive transfer.

[bib26] Spigler, G. 2020. Meta-learnt priors slow down catastrophic forgetting in neural networks.

[bib27] Touvron et al. (2023) Touvron, H.; Lavril, T.; Izacard, G.; et al. 2023. LLaMA: Open and Efficient Foundation Language Models.

[bib28] Ven, Soures, and Kudithipudi (2024) Ven, G. M. V. D.; Soures, N.; and Kudithipudi, D. 2024. Continual Learning and Catastrophic Forgetting.

[bib29] Wang et al. (2025) Wang, Z.; Yang, E.; Shen, L.; and Huang, H. 2025. A Comprehensive Survey of Forgetting in Deep Learning Beyond Continual Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3): 1464–1483.

[bib30] Xu et al. (2020) Xu, Y.; Zhong, X.; Yepes, A. J. J.; and Lau, J. H. 2020. Forget Me Not: Reducing Catastrophic Forgetting for Domain Adaptation in Reading Comprehension.

[bib31] Yang, X.; and Wang, X. 2025. Kolmogorov-Arnold Transformer. In The Thirteenth International Conference on Learning Representations.

[bib32] Yoon et al. (2018) Yoon, J.; Yang, E.; Lee, J.; and Hwang, S. J. 2018. Lifelong Learning with Dynamically Expandable Networks.

[bib33] Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual Learning Through Synaptic Intelligence. Proceedings of machine learning research, 70: 3987—3995.

[bib34] Zhang et al. (2025) Zhang, T.; Peng, S.; Gao, L.; et al. 2025. Unifying Locality of KANs and Feature Drift Compensation Projection for Data-free Replay based Continual Face Forgery Detection.

[bib35] Zhang et al. (2023) Zhang, T.; Wang, X.; Liang, B.; and Yuan, B. 2023. Catastrophic Interference in Reinforcement Learning: A Solution Based on Context Division and Knowledge Distillation. IEEE Transactions on Neural Networks and Learning Systems, 34(12): 9925–9939.