Self-Supervised Learning with Lie Symmetries for Partial Differential Equations
Gr'egoire Mialon$^{\dagger}$, Meta, FAIR, Quentin Garrido$^{\dagger}$, Meta, FAIR, Univ Gustave Eiffel, CNRS, LIGM, Hannah Lawrence, Meta, FAIR, MIT, Danyal Rehman, MIT, Yann LeCun, Meta, FAIR, NYU, Bobak T. Kiani, MIT
Abstract
Machine learning for differential equations paves the way for computationally efficient alternatives to numerical solvers, with potentially broad impacts in science and engineering. Though current algorithms typically require simulated training data tailored to a given setting, one may instead wish to learn useful information from heterogeneous sources, or from real dynamical systems observations that are messy or incomplete. In this work, we learn general-purpose representations of PDEs from heterogeneous data by implementing joint embedding methods for self-supervised learning (SSL), a framework for unsupervised representation learning that has had notable success in computer vision. Our representation outperforms baseline approaches to invariant tasks, such as regressing the coefficients of a PDE, while also improving the time-stepping performance of neural solvers. % We hope that our proposed methodology will prove useful in the eventual development of general-purpose foundation models for PDEs. Code available at: https://github.com/facebookresearch/SSLForPDEs{github.com/facebookresearch/SSLForPDEs}.
Self-Supervised Learning (SSL)
Grégoire Mialon † Meta, FAIR
Danyal Rehman MIT
Quentin Garrido † Meta, FAIR Univ Gustave Eiffel, CNRS, LIGM
Machine learning for differential equations paves the way for computationally efficient alternatives to numerical solvers, with potentially broad impacts in science and engineering. Though current algorithms typically require simulated training data tailored to a given setting, one may instead wish to learn useful information from heterogeneous sources, or from real dynamical systems observations that are messy or incomplete. In this work, we learn general-purpose representations of PDEs from heterogeneous data by implementing joint embedding methods for self-supervised learning (SSL), a framework for unsupervised representation learning that has had notable success in computer vision. Our representation outperforms baseline approaches to invariant tasks, such as regressing the coefficients of a PDE, while also improving the time-stepping performance of neural solvers. We hope that our proposed methodology will prove useful in the eventual development of general-purpose foundation models for PDEs. Code available at: github.com/facebookresearch/SSLForPDEs.
Introduction
Dynamical systems governed by differential equations are ubiquitous in fluid dynamics, chemistry, astrophysics, and beyond. Accurately analyzing and predicting the evolution of such systems is of paramount importance, inspiring decades of innovation in algorithms for numerical methods. However, high-accuracy solvers are often computationally expensive. Machine learning has recently arisen as an alternative method for analyzing differential equations at a fraction of the cost [1, 2, 3]. Typically, the neural network for a given equation is trained on simulations of that same equation, generated by numerical solvers that are high-accuracy but comparatively slow [4]. What if we instead wish to learn from heterogeneous data, e.g., data with missing information, or gathered from actual observation of varied physical systems rather than clean simulations?
For example, we may have access to a dataset of instances of time-evolution, stemming from a family of partial differential equations (PDEs) for which important characteristics of the problem, such as viscosity or initial conditions, vary or are unknown. In this case, representations learned from such a large, 'unlabeled' dataset could still prove useful in learning to identify unknown characteristics, given only a small dataset 'labeled" with viscosities or reaction constants. Alternatively, the 'unlabeled' dataset may contain evolutions over very short periods of time, or with missing time intervals; possible
∗ Correspondence to: gmialon@meta.com, garridoq@meta.com, and bkiani@mit.edu, † Equal contribution
Hannah Lawrence Meta, FAIR MIT
Bobak T. Kiani ∗ MIT

Figure 1: A high-level overview of the self-supervised learning pipeline, in the conventional setting of image data (top row) as well as our proposed setting of a PDE (bottom row). Given a large pool of unlabeled data, self-supervised learning uses augmentations (e.g. color-shifting for images, or Lie symmetries for PDEs) to train a network f θ to produce useful representations from input images. Given a smaller set of labeled data, these representations can then be used as inputs to a supervised learning pipeline, performing tasks such as predicting class labels (images) or regressing the kinematic viscosity ν (Burgers' equation). Trainable steps are shown with red arrows; importantly, the representation function learned via SSL is not altered during application to downstream tasks.
goals are then to learn representations that could be useful in filling in these gaps, or regressing other quantities of interest.
To tackle these broader challenges, we take inspiration from the recent success of self-supervised learning (SSL) as a tool for learning rich representations from large, unlabeled datasets of text and images [5, 6]. Building such representations from and for scientific data is a natural next step in the development of machine learning for science [7]. In the context of PDEs, this corresponds to learning representations from a large dataset of PDE realizations 'unlabeled' with key information (such as kinematic viscosity for Burgers' equation), before applying these representations to solve downstream tasks with a limited amount of data (such as kinematic viscosity regression), as illustrated in Figure 1.
To do so, we leverage the joint embedding framework [8] for self-supervised learning, a popular paradigm for learning visual representations from unlabeled data [9, 10]. It consists of training an encoder to enforce similarity between embeddings of two augmented versions of a given sample to form useful representations. This is guided by the principle that representations suited to downstream tasks (such as image classification) should preserve the common information between the two augmented views. For example, changing the color of an image of a dog still preserves its semantic meaning and we thus want similar embeddings under this augmentation. Hence, the choice of augmentations is crucial. For visual data, SSL relies on human intuition to build hand-crafted augmentations (e.g. recoloring and cropping), whereas PDEs are endowed with a group of symmetries preserving the governing equations of the PDE [11, 12]. These symmetry groups are important because creating embeddings that are invariant under them would allow to capture the underlying dynamics of the PDE. For example, solutions to certain PDEs with periodic boundary conditions remain valid solutions after translations in time and space. There exist more elaborate equationspecific transformations as well, such as Galilean boosts and dilations (see Appendix E). Symmetry groups are well-studied for common PDE families, and can be derived systematically or calculated from computer algebra systems via tools from Lie theory [11, 13, 14].
Contributions: We present a general framework for performing SSL for PDEs using their corresponding symmetry groups. In particular, we show that by exploiting the analytic group transformations from one PDE solution to another, we can use joint embedding methods to generate useful representations from large, heterogeneous PDE datasets. We demonstrate the broad utility of these representations on downstream tasks, including regressing key parameters and time-stepping, on

Figure 2: Pretraining and evaluation frameworks, illustrated on Burgers' equation. (Left) Selfsupervised pretraining. We generate augmented solutions x and x ′ using Lie symmetries parametrized by g and g ′ before passing them through an encoder f θ , yielding representations y . The representations are then input to a projection head h θ , yielding embeddings z , on which the SSL loss is applied. (Right) Evaluation protocols for our pretrained representations y . On new data, we use the computed representations to either predict characteristics of interest, or to condition a neural network or operator to improve time-stepping performance.
simulated physically-motivated datasets. Our approach is applicable to any family of PDEs, harnesses the well-understood mathematical structure of the equations governing PDE data - a luxury not typically available in non-scientific domains - and demonstrates more broadly the promise of adapting self-supervision to the physical sciences. We hope this work will serve as a starting point for developing foundation models on more complex dynamical systems using our framework.
Contributions:
Dynamical systems governed by differential equations are ubiquitous in fluid dynamics, chemistry, astrophysics, and beyond. Accurately analyzing and predicting the evolution of such systems is of paramount importance, inspiring decades of innovation in algorithms for numerical methods. However, high-accuracy solvers are often computationally expensive. Machine learning has recently arisen as an alternative method for analyzing differential equations at a fraction of the cost [1, 2, 3]. Typically, the neural network for a given equation is trained on simulations of that same equation, generated by numerical solvers that are high-accuracy but comparatively slow [4]. What if we instead wish to learn from heterogeneous data, e.g., data with missing information, or gathered from actual observation of varied physical systems rather than clean simulations?
For example, we may have access to a dataset of instances of time-evolution, stemming from a family of partial differential equations (PDEs) for which important characteristics of the problem, such as viscosity or initial conditions, vary or are unknown. In this case, representations learned from such a large, 'unlabeled' dataset could still prove useful in learning to identify unknown characteristics, given only a small dataset 'labeled" with viscosities or reaction constants. Alternatively, the 'unlabeled' dataset may contain evolutions over very short periods of time, or with missing time intervals; possible
∗ Correspondence to: gmialon@meta.com, garridoq@meta.com, and bkiani@mit.edu, † Equal contribution
Hannah Lawrence Meta, FAIR MIT
Bobak T. Kiani ∗ MIT

Figure 1: A high-level overview of the self-supervised learning pipeline, in the conventional setting of image data (top row) as well as our proposed setting of a PDE (bottom row). Given a large pool of unlabeled data, self-supervised learning uses augmentations (e.g. color-shifting for images, or Lie symmetries for PDEs) to train a network f θ to produce useful representations from input images. Given a smaller set of labeled data, these representations can then be used as inputs to a supervised learning pipeline, performing tasks such as predicting class labels (images) or regressing the kinematic viscosity ν (Burgers' equation). Trainable steps are shown with red arrows; importantly, the representation function learned via SSL is not altered during application to downstream tasks.
goals are then to learn representations that could be useful in filling in these gaps, or regressing other quantities of interest.
To tackle these broader challenges, we take inspiration from the recent success of self-supervised learning (SSL) as a tool for learning rich representations from large, unlabeled datasets of text and images [5, 6]. Building such representations from and for scientific data is a natural next step in the development of machine learning for science [7]. In the context of PDEs, this corresponds to learning representations from a large dataset of PDE realizations 'unlabeled' with key information (such as kinematic viscosity for Burgers' equation), before applying these representations to solve downstream tasks with a limited amount of data (such as kinematic viscosity regression), as illustrated in Figure 1.
To do so, we leverage the joint embedding framework [8] for self-supervised learning, a popular paradigm for learning visual representations from unlabeled data [9, 10]. It consists of training an encoder to enforce similarity between embeddings of two augmented versions of a given sample to form useful representations. This is guided by the principle that representations suited to downstream tasks (such as image classification) should preserve the common information between the two augmented views. For example, changing the color of an image of a dog still preserves its semantic meaning and we thus want similar embeddings under this augmentation. Hence, the choice of augmentations is crucial. For visual data, SSL relies on human intuition to build hand-crafted augmentations (e.g. recoloring and cropping), whereas PDEs are endowed with a group of symmetries preserving the governing equations of the PDE [11, 12]. These symmetry groups are important because creating embeddings that are invariant under them would allow to capture the underlying dynamics of the PDE. For example, solutions to certain PDEs with periodic boundary conditions remain valid solutions after translations in time and space. There exist more elaborate equationspecific transformations as well, such as Galilean boosts and dilations (see Appendix E). Symmetry groups are well-studied for common PDE families, and can be derived systematically or calculated from computer algebra systems via tools from Lie theory [11, 13, 14].
Contributions: We present a general framework for performing SSL for PDEs using their corresponding symmetry groups. In particular, we show that by exploiting the analytic group transformations from one PDE solution to another, we can use joint embedding methods to generate useful representations from large, heterogeneous PDE datasets. We demonstrate the broad utility of these representations on downstream tasks, including regressing key parameters and time-stepping, on

Figure 2: Pretraining and evaluation frameworks, illustrated on Burgers' equation. (Left) Selfsupervised pretraining. We generate augmented solutions x and x ′ using Lie symmetries parametrized by g and g ′ before passing them through an encoder f θ , yielding representations y . The representations are then input to a projection head h θ , yielding embeddings z , on which the SSL loss is applied. (Right) Evaluation protocols for our pretrained representations y . On new data, we use the computed representations to either predict characteristics of interest, or to condition a neural network or operator to improve time-stepping performance.
simulated physically-motivated datasets. Our approach is applicable to any family of PDEs, harnesses the well-understood mathematical structure of the equations governing PDE data - a luxury not typically available in non-scientific domains - and demonstrates more broadly the promise of adapting self-supervision to the physical sciences. We hope this work will serve as a starting point for developing foundation models on more complex dynamical systems using our framework.
Methodology
We now describe our general framework for learning representations from and for diverse sources of PDE data, which can subsequently be used for a wide range of tasks, ranging from regressing characteristics of interest of a PDE sample to improving neural solvers. To this end, we adapt a popular paradigm for representation learning without labels: the joint-embedding self-supervised learning.
Self-Supervised Learning (SSL)
Background: In the joint-embedding framework, input data is transformed into two separate 'views", using augmentations that preserve the underlying information in the data. The augmented views are then fed through a learnable encoder, f θ , producing representations that can be used for downstream tasks. The SSL loss function is comprised of a similarity loss L sim between projections (through a projector h θ , which helps generalization [15]) of the pairs of views, to make their representations invariant to augmentations, and a regularization loss L reg, to avoid trivial solutions (such as mapping all inputs to the same representation). The regularization term can consist of a repulsive force between points, or regularization on the covariance matrix of the embeddings. Both function similarly, as shown in [16]. This pretraining procedure is illustrated in Fig. 2 (left) in the context of Burgers' equation.
In this work, we choose variance-invariance-covariance regularization (VICReg) as our selfsupervised loss function [9]. Concretely, let Z , Z ′ ∈ R N × D contain the D -dimensional representations of two batches of N inputs with D × D centered covariance matrices, Cov( Z ) and Cov( Z ′ ) . Rows Z i, : and Z ′ i, : are two views of a shared input. The loss over this batch includes a term to enforce similarity ( L sim) and a term to avoid collapse and regularize representations ( L reg) by
<latexi sh
1_b
64="RD
pqBr
0m
3TX/
C
cyPAg
fVFN
9
E
2
I
SK
o
v
U
J
8
Y
L
z
d
w
G
n
5
H
Q
uCPqyUo
QJK0N
3
p
Z+5OR
X
qEMBFz
T
7
/

Figure 3: One parameter Lie point symmetries for the Kuramoto-Sivashinsky (KS) PDE. The transformations (left to right) include the un-modified solution ( u ) , temporal shifts ( g 1 ) , spatial shifts ( g 2 ) , and Galilean boosts ( g 3 ) with their corresponding infinitesimal transformations in the Lie algebra placed inside the figure. The shaded red square denotes the original ( x, t ) , while the dotted line represents the same points after the augmentation is applied.
pushing elements of the encodings to be statistically identical:
$$
$$
where ∥ · ∥ F denotes the matrix Frobenius norm and λ inv , λ reg ∈ R + are hyperparameters to weight the two terms. In practice, VICReg separates the regularization L reg ( Z ) into two components to handle diagonal and non-diagonal entries Cov( Z ) separately. For full details, see Appendix C.
Adapting VICReg to learn from PDE data: Numerical PDE solutions typically come in the form of a tensor of values, along with corresponding spatial and temporal grids. By treating the spatial and temporal information as supplementary channels, we can use existing methods developed for learning image representations. As an illustration, a numerical solution to Burgers consists of a velocity tensor with shape ( t, x ) : a vector of t time values, and a vector of x spatial values. We therefore process the sample to obtain a (3 , t, x ) tensor with the last two channels encoding spatial and temporal discretization, which can be naturally fed to neural networks tailored for images such as ResNets [17]. From these, we extract the representation before the classification layer (which is unused here). It is worth noting that convolutional neural networks have become ubiquitous in the literature [18, 12]. While the VICReg default hyper-parameters did not require substantial tuning, tuning was crucial to probe the quality of our learned representations to monitor the quality of the pre-training step. Indeed, SSL loss values are generally not predictive of the quality of the representation, and thus must be complemented by an evaluation task. In computer vision, this is done by freezing the encoder, and using the features to train a linear classifier on ImageNet. In our framework, we pick regression of a PDE coefficient, or regression of the initial conditions when there is no coefficient in the equation. The latter, commonly referred to as the inverse problem, has the advantage of being applicable to any PDE, and is often a challenging problem in the numerical methods community given the ill-posed nature of the problem [19]. Our approach for a particular task, kinematic viscosity regression, is schematically illustrated in Fig. 2 (top right). More details on evaluation tasks are provided in Section 4.
Background:
Adapting VICReg to learn from PDE data:
Augmentations and PDE Symmetry Groups
Background: PDEs formally define a systems of equations which depend on derivatives of input variables. Given input space Ω and output space U , a PDE ∆ is a system of equations in independent variables x ∈ Ω , dependent variables u : Ω →U , and derivatives ( u x , u xx , . . . ) of u with respect to x . For example, the Kuramoto-Sivashinsky equation is given by
$$
$$
Informally, a symmetry group of a PDE G 2 acts on the total space via smooth maps G : Ω ×U → Ω ×U taking solutions of ∆ to other solutions of ∆ . More explicitly, G is contained in the symmetry group of ∆ if outputs of group operations acting on solutions are still a solution of the PDE:
$$
$$
For PDEs, these symmetry groups can be analytically derived [11] (see also Appendix A for more formal details). The types of symmetries we consider are so-called Lie point symmetries g : Ω ×U → Ω ×U , which act smoothly at any given point in the total space Ω ×U . For the Kuramoto-Sivashinsky PDE, these symmetries take the form depicted in Fig. 3:
$$
$$
As in this example, every Lie point transformation can be written as a one parameter transform of ϵ ∈ R where the transformation at ϵ = 0 recovers the identity map and the magnitude of ϵ corresponds to the 'strength" of the corresponding augmentation. 3 Taking the derivative of the transformation at ϵ = 0 with respect to the set of all group transformations recovers the Lie algebra of the group (see Appendix A). Lie algebras are vector spaces with elegant properties (e.g., smooth transformations can be uniquely and exhaustively implemented), so we parameterize augmentations in the Lie algebra and implement the corresponding group operation via the exponential map from the algebra to the group. Details are contained in Appendix B.
PDE symmetry groups as SSL augmentations, and associated challenges: Symmetry groups of PDEs offer a technically sound basis for the implementation of augmentations; nevertheless, without proper considerations and careful tuning, SSL can fail to work successfully [20]. Although we find the marriage of these PDE symmetries with SSL quite natural, there are several subtleties to the problem that make this task challenging. Consistent with the image setting, we find that, among the list of possible augmentations, crops are typically the most effective of the augmentations in building useful representations [21]. Selecting a sensible subset of PDE symmetries requires some care; for example, if one has a particular invariant task in mind (such as regressing viscosity), the Lie symmetries used should neither depend on viscosity nor change the viscosity of the output solution. Morever, there is no guarantee as to which Lie symmetries are the most 'natural", i.e. most likely to produce solutions that are close to the original data distribution; this is also likely a confounding factor when evaluating their performance. Finally, precise derivations of Lie point symmetries require knowing the governing equation, though a subset of symmetries can usually be derived without knowing the exact form of the equation, as certain families of PDEs share Lie point symmetries and many symmetries arise from physical principles and conservation laws.
Sampling symmetries: We parameterize and sample from Lie point symmetries in the Lie algebra of the group, to ensure smoothness and universality of resulting maps in some small region around the identity. We use Trotter approximations of the exponential map, which are efficiently tunable to small errors, to apply the corresponding group operation to an element in the Lie algebra (see Appendix B) [22, 23]. In our experiments, we find that Lie point augmentations applied at relatively small strengths perform the best (see Appendix E), as they are enough to create informative distortions of the input when combined. Finally, boundary conditions further complicate the simplified picture of PDE symmetries, and from a practical perspective, many of the symmetry groups (such as the Galilean Boost in Fig. 3) require a careful rediscretization back to a regular grid of sampled points.
Background:
Related Work
In this section, we provide a concise summary of research related to our work, reserving Appendix D for more details. Our study derives inspiration from applications of Self-Supervised Learning (SSL) in building pre-trained foundational models [24]. For physical data, models pre-trained with SSL
2 Agroup G is a set closed under an associative binary operation containing an identity element e and inverses ( i.e. , e ∈ G and ∀ g ∈ G : g -1 ∈ G ). G : X → X acts on a space X if ∀ x ∈ X , ∀ g, h ∈ G : ex = x and ( gh ) x = g ( hx ) .
3 Technically, ϵ is the magnitude and direction of the transformation vector for the basis element of the corresponding generator in the Lie algebra.
have been implemented in areas such as weather and climate prediction [7] and protein tasks [25, 26], but none have previously used the Lie symmetries of the underlying system. The SSL techniques we use are inspired by similar techniques used in image and video analysis [9, 20], with the hopes of learning rich representations that can be used for diverse downstream tasks.
Symmetry groups of PDEs have a rich history of study [11, 13]. Most related to our work, [12] used Lie point symmetries of PDEs as a tool for augmenting PDE datasets in supervised tasks. For some PDEs, previous works have explicitly enforced symmetries or conservation laws by for example constructing networks equivariant to symmetries of the Navier Stokes equation [27], parameterizing networks to satisfy a continuity equation [28], or enforcing physical constraints in dynamic mode decomposition [29]. For Hamiltonian systems, various works have designed algorithms that respect the symplectic structure or conservation laws of the Hamiltonian [30, 31].
Experiments
Equations considered: We focus on flow-related equations here as a testing ground for our methodology. In our experiments, we consider the four equations below, which are 1D evolution equations apart from the Navier-Stokes equation, which we consider in its 2D spatial form. For the 1D flow-related equations, we impose periodic boundary conditions with Ω = [0 , L ] × [0 , T ] . For Navier-Stokes, boundary conditions are Dirichlet ( v = 0 ) as in [18]. Symmetries for all equations are listed in Appendix E.
- The viscous Burgers' Equation , written in its 'standard" form, is a nonlinear model of dissipative flow given by
$$
$$
where u ( x, t ) is the velocity and ν ∈ R + is the kinematic viscosity.
- The Korteweg-de Vries (KdV) equation models waves on shallow water surfaces as
$$
$$
$$
$$
where u ( x, t ) is the dependent variable. The equation often shows up in reaction-diffusion systems, as well as flame propagation problems.
- The incompressible Navier-Stokes equation in two spatial dimensions is given by
$$
$$
where u ( x , t ) is the velocity vector, p ( x , t ) is the pressure, ρ is the fluid density, ν is the kinematic viscosity, and f is an external added force (buoyancy force) that we aim to regress in our experiments.
Solution realizations are generated from analytical solutions in the case of Burgers' equation or pseudo-spectral methods used to generate PDE learning benchmarking data (see Appendix F) [12, 18, 32]. Burgers', KdV and KS's solutions are generated following the process of [12] while for Navier Stokes we use the conditioning dataset from [18]. The respective characteristics of our datasets can be found in Table 1.
Pretraining: For each equation, we pretrain a ResNet18 with our SSL framework for 100 epochs using AdamW [33], a batch size of 32 (64 for Navier-Stokes) and a learning rate of 3e-4. We then freeze its weights. To evaluate the resulting representation, we (i) train a linear head on top of our features and on a new set of labeled realizations, and (ii) condition neural networks for time-stepping on our representation. Note that our encoder learns from heterogeneous data in the sense that for a given equation, we grouped time evolutions with different parameters and initial conditions.
Table 1: Downstream evaluation of our learned representations for four classical PDEs (averaged over three runs, the lower the better ( ↓ )). The normalized mean squared error (NMSE) over a batch of N outputs ̂ u k and targets u k is equal to NMSE = 1 N ∑ N k =1 ∥ ̂ u k -u k ∥ 2 2 / ∥ ̂ u k ∥ 2 2 . Relative error is similarly defined as RE = 1 N ∑ N k =1 ∥ ̂ u k -u k ∥ 1 / ∥ ̂ u k ∥ 1 For regression tasks, the reported errors with supervised methods are the best performance across runs with Lie symmetry augmentations applied. For timestepping, we report NMSE for KdV, KS and Burgers as in [12], and MSE for Navier-Stokes for comparison with [18].
Equations considered:
Pretraining:
Equation parameter regression
We consider the task of regressing equation-related coefficients in Burgers' equation and the NavierStokes' equation from solutions to those PDEs. For KS and KdV we consider the inverse probem of regressing initial conditions. We train a linear model on top of the pretrained representation for the downstream regression task. For the baseline supervised model, we train the same architecture, i.e. a ResNet18, using the MSE loss on downstream labels. Unless stated otherwise, we train the linear model for 30 epochs using Adam. Further details are in Appendix F.
Kinematic viscosity regression (Burgers): Wepretrain a ResNet18 on 10 , 000 unlabeled realizations of Burgers' equation, and use the resulting features to train a linear model on a smaller, labeled dataset of only 2000 samples. We compare to the same supervised model (encoder and linear head) trained on the same labeled dataset. The viscosities used range between 0 . 001 and 0 . 007 and are sampled uniformly. We can see in Table 1 that we are able to improve over the supervised baseline by leveraging our learned representations. This remains true even when also using Lie Point symmetries for the supervised baselines or when using comparable dataset sizes, as in Figure 4. We also clearly see the ability of our self-supervised approach to leverage larger dataset sizes, whereas we did not see any gain when going to bigger datasets in the supervised setting.
Initial condition regression (inverse problem): For the KS and KdV PDEs, we aim to solve the inverse problem by regressing initial condition parameters from a snapshot of future time evolutions of the solution. Following [34, 12], for a domain Ω = [0 , L ] , a truncated Fourier series, parameterized by A k , ω k , ϕ k , is used to generate initial conditions:
$$
$$
Our task is to regress the set of 2 N coefficients { A k , ω k : k ∈ { 1 , . . . , N }} from a snapshot of the solution starting at t = 20 to t = T . This way, the initial conditions and first-time steps are never seen during training, making the problem non-trivial. For all conducted tests, N = 10 , A k ∼ U ( -0 . 5 , 0 . 5) , and ω k ∼ U ( -0 . 4 , 0 . 4) . By neglecting phase shifts, ϕ k , the inverse problem is invariant to Galilean boosts and spatial translations, which we use as augmentations for training our SSL method (see Appendix E). The datasets used for KdV and KS contains 10,000 training samples and 2,500 test samples. As shown in Table 1, the SSL trained network reduces NMSE by a factor of almost three compared to the supervised baseline. This demonstrates how pre-training via SSL can help to extract the underlying dynamics from a snapshot of a solution.
Buoyancy magnitude regression: Following [18], our dataset consists of solutions of Navier Stokes (Equation (8)) where the external buoyancy force, f = ( c x , c y ) ⊤ , is constant in the two spatial
directions over the course of a given evolution, and our aim is to regress the magnitude of this force √ c 2 x + c 2 y given a solution to the PDE. We reuse the dataset generated in [18], where c x = 0 and c y ∼ U (0 . 2 , 0 . 5) . In practice this gives us 26,624 training samples that we used as our 'unlabeled' dataset, 3,328 to train the downstream task on, and 6,592 to evaluate the models. As observed in Table 1, the self-supervised approach is able to significantly outperform the supervised baseline. Even when looking at the best supervised performance (over 60 runs), or in similar data regimes as the supervised baseline illustrated in Fig. 4, the self-supervised baseline consistently performs better and improves further when given larger unlabeled datasets.
Time-stepping
To explore whether learned representations improve time-stepping, we study neural networks that use a sequence of time steps (the 'history') of a PDE to predict a future sequence of steps. For each equation we consider different conditioning schemes, to fit within the data modality and be comparable to previous work.
Burgers, Korteweg-de Vries, and Kuramoto-Sivashinsky: We time-step on 2000 unseen samples for each PDE. To do so, we compute a representation of 20 first input time steps using our frozen encoder, and add it as a new channel. The resulting input is fed to a CNN as in [12] to predict the next 20 time steps (illustrated in Fig. 4 (bottom right) in the context of Burgers' equation). As shown in Table 1, conditioning the neural network or operator with pre-trained representations slightly reduces the error. Such conditioning noticeably improves performance for KdV and KS, while the results are mixed for Burgers'. A potential explanation is that KdV and KS feature more chaotic behavior than Burgers, leaving room for improvement.
Navier-Stokes' equation: As pointed out in [18], conditioning a neural network or neural operator on the buoyancy helps generalization accross different values of this parameter. This is done by embedding the buoyancy before mixing the resulting vector either via addition to the neural operator's hidden activations (denoted in [18] as 'Addition'), or alternatively for UNets by affine transformation of group normalization layers (denoted as 'AdaGN' and originally proposed in [35]). For our main experiment, we use the same modified UNet with 64 channels as in [18] for our neural operator, since it yields the best performance on the Navier-Stokes dataset. To condition the UNet, we compute our representation on the 16 first frames (that are therefore excluded from the training), and pass the representation through a two layer MLP with a bottleneck of size 1, in order to exploit the ability of our representation to recover the buoyancy with only one linear layer. The resulting output is then added to the conditioning embedding as in [18]. Finally, we choose AdaGN as our conditioning method, since it provides the best results in [18]. We follow a similar training and evaluation protocol to [18], except that we perform 20 epochs with cosine annealing schedule on 1,664 trajectories instead of 50 epochs, as we did not observe significant difference in terms of results, and this allowed to explore other architectures and conditioning methods. Additional details are provided in Appendix F. As a baseline, we use the same model without buoyancy conditioning. Both models are conditioned on time. We report the one-step validation MSE on the same time horizons as [18]. Conditioning on our representation outperforms the baseline without conditioning.
Wealso report results for different architectures and conditioning methods for Navier-Stokes in Table 2 and Burgers in Table 8 (Appendix F.1) validating the potential of conditioning on SSL representations for different models. FNO [36] does not perform as well as other models, partly due to the relatively low number of samples used and the low-resolution nature of the benchmarks. For Navier-Stokes, we also report results obtained when conditioning on both time and ground truth buoyancy, which serves as an upper-bound on the performance of our method. We conjecture these results can be improved by further increasing the quality of the learned representation, e.g by training on more samples or through further augmentation tuning. Indeed, the MSE on buoyancy regression obtained by SSL features, albeit significantly lower than the supervised baseline, is often still too imprecise to distinguish consecutive buoyancy values in our data.
Analysis
Self-supervised learning outperforms supervised learning for PDEs: While the superiority of selfsupervised over supervised representation learning is still an open question in computer vision [37, 38], the former outperforms the latter in the PDE domain we consider. A possible explanation is that
Table 2: One-step validation MSE (rescaled by 1 e 3 ) for time-stepping on Navier-Stokes with varying buoyancies for different combinations of architectures and conditioning methods. Architectures are taken from [18] with the same choice of hyper-parameters. Results with ground truth buoyancies are an upper-bound on the performance a representation containing information on the buoyancy.

Figure 4: Influence of dataset size on regression tasks. (Left) Kinematic regression on Burger's equation. When using Lie point symmetries (LPS) during pretraining, we are able to improve performance over the supervised baselines, even when using an unlabled dataset size that is half the size of the labeled one. As we increase the amount of unlabeled data that we use, the performance improves, further reinforcing the usefulness of self-supervised representations. (Right) Buoyancy regression on Navier-Stokes' equation. We notice a similar trend as in Burgers but found that the supervised approach was less stable than the self-supervised one. As such, SSL brings better performance as well as more stability here.
enforcing similar representations for two different views of the same solution forces the network to learn the underlying dynamics, while the supervised objectives (such as regressing the buoyancy) may not be as informative of a signal to the network. Moreover, Fig. 4 illustrates how more pretraining data benefits our SSL setup, whereas in our experiments it did not help the supervised baselines.
Cropping: Cropping is a natural, effective, and popular augmentation in computer vision [21, 39, 40]. In the context of PDE samples, unless specified otherwise, we crop both in temporal and spatial domains finding such a procedure is necessary for the encoder to learn from the PDE data. Cropping also offers a typically weaker means of enforcing analogous space and time translation invariance. The exact size of the crops is generally domain dependent and requires tuning. We quantify its effect in Fig. 5 in the context of Navier-Stokes; here, crops must contain as much information as possible while making sure that pairs of crops have as little overlap as possible (to discourage the network from relying on spurious correlations). This explains the two modes appearing in Fig. 5. We make a similar observation for Burgers, while KdV and KS are less sensitive. Finally, crops help bias the network to learn features that are invariant to whether the input was taken near a boundary or not, thus alleviating the issue of boundary condition preservation during augmentations.
Selecting Lie point augmentations: Whereas cropping alone yields satisfactory representations, Lie point augmentations can enhance performance but require careful tuning. In order to choose which symmetries to include in our SSL pipeline and at what strengths to apply them, we study the effectiveness of each Lie augmentation separately. More precisely, given an equation and each possible Lie point augmentation, we train a SSL representation using this augmentation only and cropping. Then, we couple all Lie augmentations improving the representation over simply using crops. In order for this composition to stay in the stability/convergence radius of the Lie Symmetries, we reduce each augmentation's optimal strength by an order of magnitude. Fig. 5 illustrates this process in the context of Navier-Stokes.

Figure 5: (Left) Isolating effective augmentations for Navier-Stokes. Note that we do not study g 3 , g 7 and g 9 , which are respectively counterparts of g 2 , g 6 and g 8 applied in y instead of x . (Right) Influence of the crop size on performance. We see that performance is maximized when the crops are as large as possible with as little overlap as possible when generating pairs of them.
Discussion
This work leverages Lie point symmetries for self-supervised representation learning from PDE data. Our preliminary experiments with the Burgers', KdV, KS, and Navier-Stokes equations demonstrate the usefulness of the resulting representation for sample or compute efficient estimation of characteristics and time-stepping. Nevertheless, a number of limitations are present in this work, which we hope can be addressed in the future. The methodology and experiments in this study were confined to a particular set of PDEs, but we believe they can be expanded beyond our setting.
Learning equivariant representations: Another interesting direction is to expand our SSL framework to learning explicitly equivariant features [41, 42]. Learning equivariant representations with SSL could be helpful for time-stepping, perhaps directly in the learned representation space.
Preserving boundary conditions and leveraging other symmetries: Theoretical insights can also help improve the results contained here. Symmetries are generally derived with respect to systems with infinite domain or periodic boundaries. Since boundary conditions violate such symmetries, we observed in our work that we are only able to implement group operations with small strengths. Finding ways to preserve boundary conditions during augmentation, even approximately, would help expand the scope of symmetries available for learning tasks. Moreover, the available symmetry group operations of a given PDE are not solely comprised of Lie point symmetries. Other types of symmetries, such as nonlocal symmetries or approximate symmetries like Lie-Backlund symmetries, may also be implemented as potential augmentations [13].
Towards foundation models for PDEs: Anatural next step for our framework is to train a common representation on a mixture of data from different PDEs, such as Burgers, KdV and KS, that are all models of chaotic flow sharing many Lie point symmetries. Our preliminary experiments are encouraging yet suggest that work beyond the scope of this paper is needed to deal with the different time and length scales between PDEs.
Extension to other scientific data: In our study, utilizing the structure of PDE solutions as 'exact' SSL augmentations for representation learning has shown significant efficacy over supervised methods. This approach's potential extends beyond the PDEs we study as many problems in mathematics, physics, and chemistry present inherent symmetries that can be harnessed for SSL. Future directions could include implementations of SSL for learning stochastic PDEs, or Hamiltonian systems. In the latter, the rich study of Noether's symmetries in relation to Poisson brackets could be a useful setting to study [11]. Real-world data, as opposed to simulated data, may offer a nice application to the SSL setting we study. Here, the exact form of the equation may not be known and symmetries of
the equations would have to be garnered from basic physical principles (e.g., flow equations have translational symmetries), derived from conservation laws, or potentially learned from data.
Learning equivariant representations:
Preserving boundary conditions and leveraging other symmetries:
Background: PDEs formally define a systems of equations which depend on derivatives of input variables. Given input space Ω and output space U , a PDE ∆ is a system of equations in independent variables x ∈ Ω , dependent variables u : Ω →U , and derivatives ( u x , u xx , . . . ) of u with respect to x . For example, the Kuramoto-Sivashinsky equation is given by
$$
$$
Informally, a symmetry group of a PDE G 2 acts on the total space via smooth maps G : Ω ×U → Ω ×U taking solutions of ∆ to other solutions of ∆ . More explicitly, G is contained in the symmetry group of ∆ if outputs of group operations acting on solutions are still a solution of the PDE:
$$
$$
For PDEs, these symmetry groups can be analytically derived [11] (see also Appendix A for more formal details). The types of symmetries we consider are so-called Lie point symmetries g : Ω ×U → Ω ×U , which act smoothly at any given point in the total space Ω ×U . For the Kuramoto-Sivashinsky PDE, these symmetries take the form depicted in Fig. 3:
$$
$$
As in this example, every Lie point transformation can be written as a one parameter transform of ϵ ∈ R where the transformation at ϵ = 0 recovers the identity map and the magnitude of ϵ corresponds to the 'strength" of the corresponding augmentation. 3 Taking the derivative of the transformation at ϵ = 0 with respect to the set of all group transformations recovers the Lie algebra of the group (see Appendix A). Lie algebras are vector spaces with elegant properties (e.g., smooth transformations can be uniquely and exhaustively implemented), so we parameterize augmentations in the Lie algebra and implement the corresponding group operation via the exponential map from the algebra to the group. Details are contained in Appendix B.
PDE symmetry groups as SSL augmentations, and associated challenges: Symmetry groups of PDEs offer a technically sound basis for the implementation of augmentations; nevertheless, without proper considerations and careful tuning, SSL can fail to work successfully [20]. Although we find the marriage of these PDE symmetries with SSL quite natural, there are several subtleties to the problem that make this task challenging. Consistent with the image setting, we find that, among the list of possible augmentations, crops are typically the most effective of the augmentations in building useful representations [21]. Selecting a sensible subset of PDE symmetries requires some care; for example, if one has a particular invariant task in mind (such as regressing viscosity), the Lie symmetries used should neither depend on viscosity nor change the viscosity of the output solution. Morever, there is no guarantee as to which Lie symmetries are the most 'natural", i.e. most likely to produce solutions that are close to the original data distribution; this is also likely a confounding factor when evaluating their performance. Finally, precise derivations of Lie point symmetries require knowing the governing equation, though a subset of symmetries can usually be derived without knowing the exact form of the equation, as certain families of PDEs share Lie point symmetries and many symmetries arise from physical principles and conservation laws.
Sampling symmetries: We parameterize and sample from Lie point symmetries in the Lie algebra of the group, to ensure smoothness and universality of resulting maps in some small region around the identity. We use Trotter approximations of the exponential map, which are efficiently tunable to small errors, to apply the corresponding group operation to an element in the Lie algebra (see Appendix B) [22, 23]. In our experiments, we find that Lie point augmentations applied at relatively small strengths perform the best (see Appendix E), as they are enough to create informative distortions of the input when combined. Finally, boundary conditions further complicate the simplified picture of PDE symmetries, and from a practical perspective, many of the symmetry groups (such as the Galilean Boost in Fig. 3) require a careful rediscretization back to a regular grid of sampled points.
Towards foundation models for PDEs:
Extension to other scientific data:
For arbitrary Lie groups, computing the exact exponential map is often not feasible due to the complex nature of the group and its associated Lie algebra. Hence, it is necessary to approximate the exponential map to obtain useful results. Two common methods for approximating the exponential map are the truncation of Taylor series and Lie-Trotter approximations.
Taylor series approximation Given a vector field v in the Lie algebra of the group, the exponential map can be approximated by truncating the Taylor series expansion of exp( v ) . The Taylor series expansion of the exponential map is given by:
$$
$$
To approximate the exponential map, we retain a finite number of terms in the series:
$$
$$
where k is the order of the truncation. The accuracy of the approximation depends on the number of terms retained in the truncated series and the operator norm ∥ v ∥ . For matrix Lie groups, where v is also a matrix, this operator norm is equivalent to the largest magnitude of the eigenvalues of the matrix [45]. The error associated with truncating the Taylor series after k terms thus decays exponentially with the order of the approximation.
Two drawbacks exist when using the Taylor approximation. First, for a given vector field v , applying v · f to a given function f requires algebraic computation of derivatives. Alternatively, derivatives can also be approximated through finite difference schemes, but this would add an additional source of error. Second, when using the Taylor series to apply a symmetry transformation of a PDE to a starting solution of that PDE, the Taylor series truncation will result in a new function, which is not necessarily a solution of the PDE anymore (although it can be made arbitrarily close to a solution by increasing the truncation order). Lie-Trotter approximations, which we study next, approximate the exponential map by a composition of symmetry operations, thus avoiding these two drawbacks.
Lie-Trotter series approximations The Lie-Trotter approximation is an alternative method for approximating the exponential map, particularly useful when one has access to group elements directly, i.e. the closed-form output of the exponential map on each Lie algebra generator), but they are non-commutative. To provide motivation for this method, consider two elements X and Y in the Lie algebra. The Lie-Trotter formula (or Lie product formula) approximates the exponential of their sum [22, 46].
$$
$$
where k is a positive integer controlling the level of approximation.
The first-order approximation above can be extended to higher orders, referred to as the Lie-TrotterSuzuki approximations.Though various different such approximations exist, we particularly use the following recursive approximation scheme [47, 23] for a given Lie algebra component v = ∑ p i =1 v i .
$$
$$
To apply the above formula, we tune the order parameter p and split the time evolution into r segments to apply the approximation exp( v ) ≈ ∏ r i =1 T p ( v /r ) . For the p -th order, the number of stages in the Suzuki formula above is equal to 2 · 5 p/ 2 -1 , so the total number of stages applied is equal to 2 r · 5 p/ 2 -1 .
These methods are especially useful in the context of PDEs, as they allow for the approximation of the exponential map while preserving the structure of the Lie algebra and group. Similar techniques are used in the design of splitting methods for numerically solving PDEs [48, 49]. Crucially, these approximations will always provide valid solutions to the PDEs, since each individual group operation in the composition above is itself a symmetry of the PDE. This is in contrast with approximations via Taylor series truncation, which only provide approximate solutions.
As with the Taylor series approximation, the p -th order approximation above is accurate to o ( ∥ v ∥ p ) with suitably selected values of r and p [23]. As a cautionary note, the approximations here may fail to converge when applied to unbounded operators [50, 51]. In practice, we tested a range of bounds to the augmentations and tuned augmentations accordingly (see Appendix E).
Acknowledgements
The authors thank Aaron Lou, Johannes Brandstetter, and Daniel Worrall for helpful feedback and discussions. HL is supported by the Fannie and John Hertz Foundation and the NSF Graduate Fellowship under Grant No. 1745302.
Experiments
PDE Symmetry Groups and Deriving Generators
Symmetry augmentations encourage invariance of the representations to known symmetry groups of the data. The guiding principle is that inputs that can be obtained from one another via transformations of the symmetry group should share a common representation. In images, such symmetries are known a priori and correspond to flips, resizing, or rotations of the input. In PDEs, these symmetry groups can be derived as Lie groups, commonly denoted as Lie point symmetries, and have been categorized for many common PDEs [11]. An example of the form of such augmentations is given in Figure 6 for a simple PDE that rotates a point in 2-D space. In this example, the PDE exhibits both rotational symmetry and scaling symmetry of the radius of rotation. For arbitrary PDEs, such symmetries can be derived, as explained in more detail below.

Figure 6: Illustration of the PDE symmetry group and invariances of a simple PDE, which rotates a point in 2-D space. The PDE symmetry group here corresponds to scalings of the radius of the rotation and fixed rotations of all the points over time. A sample invariant quantity is the rate of rotation (related to the parameter α in the PDE), which is fixed for any solution to this PDE.
The Lie point symmetry groups of differential equations form a Lie group structure, where elements of the groups are smooth and differentiable transformations. It is typically easier to derive the symmetries of a system of differential equations via the infinitesimal generators of the symmetries, ( i.e., at the level of the derivatives of the one parameter transforms). By using these infinitesimal generators, one can replace nonlinear conditions for the invariance of a function under the group transformation, with an equivalent linear condition of infinitesimal invariance under the respective generator of the group action [11].
In what follows, we give an informal overview to the derivation of Lie point symmetries. Full details and formal rigor can be obtained in Olver [11], Ibragimov [13], among others.
In the setting we consider, a differential equation has a set of p independent variables x = ( x 1 , x 2 , . . . , x p ) ∈ R p and q dependent variables u = ( u 1 , u 2 , . . . , u q ) ∈ R q . The solutions take the form u = f ( x ) , where u α = f α ( x ) for α ∈ { 1 , . . . , q } . Solutions form a graph over a domain Ω ⊂ R p :
$$
$$
In other words, a given solution Γ f forms a p -dimensional submanifold of the space R p × R q .
The n -th prolongation of a given smooth function Γ f expands or 'prolongs" the graph of the solution into a larger space to include derivatives up to the n -th order. More precisely, if U = R q is the solution space of a given function and f : R p →U , then we introduce the Cartesian product space of the prolongation:
$$
$$
where U k = R dim ( k ) and dim ( k ) = ( p + k -1 k ) is the dimension of the so-called jet space consisting of all k -th order derivatives. Given any solution f : R p →U , the prolongation can be calculated by simply calculating the corresponding derivatives up to order n (e.g., via a Taylor expansion at each point). For a given function u = f ( x ) , the n -th prolongation is denoted as u ( n ) = pr ( n ) f ( x ) . As a simple example, for the case of p = 2 with independent variables x and y and q = 1 with a single
dependent variable f , the second prolongation is
$$
$$
which is evaluated at a given point ( x, y ) in the domain. The complete space R p ×U ( n ) is often called the n -th order jet space [11].
A system of differential equations is a set of l differential equations ∆ : R p ×U ( n ) → R l of the independent and dependent variables with dependence on the derivatives up to a maximum order of n :
$$
$$
A smooth solution is thus a function f such that for all points in the domain of x :
$$
$$
In geometric terms, the system of differential equations states where the given map ∆ vanishes on the jet space, and forms a subvariety
$$
$$
Therefore to check if a solution is valid, one can check if the prolongation of the solution falls within the subvariety Z ∆ . As an example, consider the one dimensional heat equation
$$
$$
$$
$$
Symmetry Groups and Infinitesimal Invariance
A symmetry group G for a system of differential equations is a set of local transformations to the function which transform one solution of the system of differential equations to another. The group takes the form of a Lie group, where group operations can be expressed as a composition of one-parameter transforms. More rigorously, given the graph of a solution Γ f as defined in Eq. (10), a group operation g ∈ G maps this graph to a new graph
$$
$$
where (˜ x , ˜ u ) label the new coordinates of the solution in the set g · Γ f . For example, if x = ( x, t ) , u = u ( x, t ) , and g acts on ( x , u ) via
$$
$$
$$
$$
Note, that the set g · Γ f may not necessarily be a graph of a new x -valued function; however, since all transformations are local and smooth, one can ensure transformations are valid in some region near the identity of the group.
As an example, consider the following transformations which are members of the symmetry group of the differential equation u xx = 0 . g 1 ( t ) translates a single spatial coordinate x by an amount t and g 2 scales the output coordinate u by an amount e r :
$$
$$
It is easy to verify that both of these operations are local and smooth around a region of the identity, as sending r, t → 0 recovers the identity operation. Lie theory allows one to equivalently describe
the potentially nonlinear group operations above with corresponding infinitesimal generators of the group action, corresponding to the Lie algebra of the group. Infinitesimal generators form a vector field over the total space Ω × U , and the group operations correspond to integral flows over that vector field. To map from a single parameter Lie group operation to its corresponding infinitesimal generator, we take the derivative of the single parameter operation at the identity:
$$
$$
where g (0) · ( x, u ) = ( x, u ) .
To map from the infinitesimal generator back to the corresponding group operation, one can apply the exponential map
$$
$$
where exp : g → G . Here, exp( · ) maps from the Lie algebra, g , to the corresponding Lie group, G . This exponential map can be evaluated using various methods, as detailed in Appendix B and Appendix E.
Returning to the example earlier from Equation (19), the corresponding Lie algebra elements are
$$
$$
Informally, Lie algebras help simplify notions of invariance as it allows one to check whether functions or differential equations are invariant to a group by needing only to check it at the level of the derivative of that group. In other words, for any vector field corresponding to a Lie algebra element, a given function is invariant to that vector field if the action of the vector field on the given function evaluates to zero everywhere. Thus, given a symmetry group, one can determine a set of invariants using the vector fields corresponding to the infinitesimal generators of the group. To determine whether a differential equation is in such a set of invariants, we extend the definition of a prolongation to act on vector fields as
$$
$$
A given vector field v is therefore an infinitesimal generator of a symmetry group G of a system of differential equations ∆ ν indexed by ν ∈ { 1 , . . . , l } if the prolonged vector field of any given solution is still a solution:
$$
$$
For sake of convenience and brevity, we leave out many of the formal definitions behind these concepts and refer the reader to [11] for complete details.
Deriving Generators of the Symmetry Group of a PDE
Since symmetries of differential equations correspond to smooth maps, it is typically easier to derive the particular symmetries of a differential equation via their infinitesimal generators. To derive such generators, we first show how to perform the prolongation of a vector field. As before, assume we have p independent variables x 1 , . . . , x p and l dependent variables u 1 , . . . , u l , which are a function of the dependent variables. Note that we use superscripts to denote a particular variable. Derivatives with respect to a given variable are denoted via subscripts corresponding to the indices. For example, the variable u 1 112 denotes the third order derivative of u 1 taken twice with respect to the variable x 1 and once with respect to x 2 . As stated earlier, the prolongation of a vector field is defined as the operation
$$
$$
To calculate the above, we can evaluate the formula on a vector field written in a generalized form. I.e. , any vector field corresponding to the infinitesimal generator of a symmetry takes the general form
$$
$$
Throughout, we will use Greek letter indices for dependent variables and standard letter indices for independent variables. Then, we have that
$$
$$
where J is a tuple of dependent variables indicating which variables are in the derivative of ∂ ∂u α J . Each ϕ J α ( x , u ( n ) ) is calculated as
$$
$$
where u α J ,i = ∂u α J /∂x i and D i is the total derivative operator with respect to variable i defined as
$$
$$
After evaluating the coefficients, ϕ J α ( x, u ( n ) ) , we can substitute these values into the definition of the vector field's prolongation in Equation (27). This fully describes the infinitesimal generator of the given PDE, which can be used to evaluate the necessary symmetries of the system of differential equations. An example for Burgers' equation, a canonical PDE, is presented in the following.
Example: Burgers' Equation
Burgers' equation is a PDE used to describe convection-diffusion phenomena commonly observed in fluid mechanics, traffic flow, and acoustics [43]. The PDE can be written in either its 'potential' form or its 'viscous' form. The potential form is
$$
$$
Cautionary note: We derive here the symmetries of Burgers' equation in its potential form since this form is more convenient and simpler to study for the sake of an example. The equation we consider in our experiments is the more commonly studied Burgers' equation in its standard form which does not have the same Lie symmetry group (see Table 4). Similar derivations for Burgers' equation in its standard form can be found in example 6.1 of [44].
Following the notation from the previous section, p = 2 and q = 1 . Consequently, the symmetry group of Burgers' equation will be generated by vector fields of the following form
$$
$$
where we wish to determine all possible coefficient functions, ξ ( t, x, u ) , τ ( x, t, u ) , and ϕ ( x, t, u ) such that the resulting one-parameter sub-group exp( ε v ) is a symmetry group of Burgers' equation.
To evaluate these coefficients, we need to prolong the vector field up to 2 nd order, given that the highest-degree derivative present in the governing PDE is of order 2. The 2 nd prolongation of the vector field can be expressed as
$$
$$
Applying this prolonged vector field to the differential equation in Equation (30), we get the infinitesimal symmetry criteria that
$$
$$
To evaluate the individual coefficients, we apply Equation (28). Next, we substitute every instance of u t with u 2 x + u xx , and equate the coefficients of each monomial in the first and second-order
Table 3: Monomial coefficients in vector field prolongation for Burgers' equation.
derivatives of u to find the pertinent symmetry groups. Table 3 below lists the relevant monomials as well as their respective coefficients.
$$
$$
$$
$$
$$
$$
where k 1 , . . . , k 6 ∈ R and γ ( x, t ) is an arbitrary solution to Burgers' equation. These coefficient functions can be used to generate the infinitesimal symmetries. These symmetries are spanned by the six vector fields below:
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
as well as the infinite-dimensional subalgebra: v γ = γ ( x, t ) e -u ∂ u . Here, γ ( x, t ) is any arbitrary solution to the heat equation. The relationship between the Heat equation and Burgers' equation can be seen, whereby if u is replaced by w = e u , the Cole-Hopf transformation is recovered.
Exponential map and its approximations
As observed in the previous section, symmetry groups are generally derived in the Lie algebra of the group. The exponential map can then be applied, taking elements of this Lie algebra to the corresponding group operations. Working within the Lie algebra of a group provides several benefits. First, a Lie algebra is a vector space, so elements of the Lie algebra can be added and subtracted to yield new elements of the Lie algebra (and the group, via the exponential map). Second, when generators of the Lie algebra are closed under the Lie bracket of the Lie algebra ( i.e. , the generators form a basis for the structure constants of the Lie algebra), any arbitrary Lie point symmetry can be obtained via an element of the Lie algebra (i.e. the exponential map is surjective onto the connected component of the identity) [11]. In contrast, composing group operations in an arbitrary, fixed sequence is not guaranteed to be able to generate any element of the group. Lastly, although not extensively detailed here, the "strength," or magnitude, of Lie algebra elements can be measured using an appropriately selected norm. For instance, the operator norm of a matrix could be used for matrix Lie algebras.
In certain cases, especially when the element v in the Lie algebra consists of a single basis element, the exponential map exp( v ) applied to that element of the Lie algebra can be calculated explicitly. Here, applying the group operation to a tuple of independent and dependent variables results in the socalled Lie point transformation, since it is applied at a given point exp( ϵ v ) · ( x, f ( x )) ↦→ ( x ′ , f ( x ) ′ ) . Consider the concrete example below from Burger's equation.
Example B.1 (Exponential map on symmetry generator of Burger's equation) . The Burger's equation contains the Lie point symmetry v γ = γ ( x, t ) e -u ∂ u with corresponding group transformation exp( ϵ v γ ) · ( x, t, u ) = ( x, t, log ( e u + ϵγ )) .
Proof. This transformation only changes the u component. Here, we have
$$
$$
Applying the series expansion log(1 + x ) = x -x 2 2 + x 3 3 -··· , we get
$$
$$
In general, the output of the exponential map cannot be easily calculated as we did above, especially if the vector field v is a weighted sum of various generators. In these cases, we can still apply the exponential map to a desired accuracy using efficient approximation methods, which we discuss next.
Experiments on Burgers' Equation
Solutions realizations of Burgers' equation were generated using the analytical solution [32] obtained from the Heat equation and the Cole-Hopf transform. During generation, kinematic viscosities, ν , and initial conditions were varied.
Representation pretraining We pretrain a representation on subsets of our full dataset containing 10 , 000 1D time evolutions from Burgers equation with various kinematic viscosities, ν , sampled uniformly in the range [0 . 001 , 0 . 007] , and initial conditions using a similar procedure to [12]. We generate solutions of size 224 × 448 in the spatial and temporal dimensions respectively, using the default parameters from [12]. We train a ResNet18 [17] encoder using the VICReg [9] approach to joint embedding SSL, with a smaller projector (width 512 ) since we use a smaller ResNet than in the original paper. We keep the same variance, invariance and covariance parameters as in [9]. We use the following augmentations and strengths:
Table 7: Generators of the Lie point symmetry group of the incompressible Navier Stokes equation. Here, u, v correspond to the velocity of the fluid in the x, y direction respectively and p corresponds to the pressure. The last three augmentations correspond to infinite dimensional Lie subgroups with choice of functions E x ( t ) , E y ( t ) , q ( t ) that depend on t only. For invariant tasks, we only used settings where E x ( t ) , E y ( t ) = t (linear) or E x ( t ) , E y ( t ) = t 2 (quadratic) to ensure invariance to the downstream task or predictable changes in the outputs of the downstream task. These augmentations are listed as numbers 6 to 9 .
1 case of g E x or g E y where E x ( t ) = E y ( t ) = t (linear function of t )
We pretrain for 100 epochs using AdamW [33] and a batch size of 32 . Crucially, we assess the quality of the learned representation via linear probing for kinematic viscosity regression, which we detail below.
Kinematic viscosity regression We evaluate the learned representation as follows: the ResNet18 is frozen and used as an encoder to produce features from the training dataset. The features are passed through a linear layer, followed by a sigmoid to constrain the output within [ ν min , ν max ] . The learned model is evaluated against our validation dataset, which is comprised of 2 , 000 samples.
Time-stepping We use a 1D CNN solver from [12] as our baseline. This neural solver takes T p previous time steps as input, to predict the next T f future ones. Each channel (or spatial axis, if we view the input as a 2D image with one channel) is composed of the realization values, u , at T p times, with spatial step size dx , and time step size dt . The dimension of the input is therefore ( T p +2 , 224) , where the extra two dimensions are simply to capture the scalars dx and dt . We augment this input with our representation. More precisely, we select the encoder that allows for the most accurate linear regression of ν with our validation dataset, feed it with the CNN operator input and reduce the resulting representation dimension to d with a learned projection before adding it as supplementary channels to the input, which is now ( T p +2+ d, 224) .
We set T p = 20 , T f = 20 , and n samples = 2 , 000 . We train both models for 20 epochs fol-
Table 8: One-step validation NMSE for time-stepping on Burgers for different architectures.
lowing the setup from [12]. In addition, we use AdamW with a decaying learning rate and different configurations of 3 runs each:
Proof.
Approximations to the exponential map
For arbitrary Lie groups, computing the exact exponential map is often not feasible due to the complex nature of the group and its associated Lie algebra. Hence, it is necessary to approximate the exponential map to obtain useful results. Two common methods for approximating the exponential map are the truncation of Taylor series and Lie-Trotter approximations.
Taylor series approximation Given a vector field v in the Lie algebra of the group, the exponential map can be approximated by truncating the Taylor series expansion of exp( v ) . The Taylor series expansion of the exponential map is given by:
$$
$$
To approximate the exponential map, we retain a finite number of terms in the series:
$$
$$
where k is the order of the truncation. The accuracy of the approximation depends on the number of terms retained in the truncated series and the operator norm ∥ v ∥ . For matrix Lie groups, where v is also a matrix, this operator norm is equivalent to the largest magnitude of the eigenvalues of the matrix [45]. The error associated with truncating the Taylor series after k terms thus decays exponentially with the order of the approximation.
Two drawbacks exist when using the Taylor approximation. First, for a given vector field v , applying v · f to a given function f requires algebraic computation of derivatives. Alternatively, derivatives can also be approximated through finite difference schemes, but this would add an additional source of error. Second, when using the Taylor series to apply a symmetry transformation of a PDE to a starting solution of that PDE, the Taylor series truncation will result in a new function, which is not necessarily a solution of the PDE anymore (although it can be made arbitrarily close to a solution by increasing the truncation order). Lie-Trotter approximations, which we study next, approximate the exponential map by a composition of symmetry operations, thus avoiding these two drawbacks.
Lie-Trotter series approximations The Lie-Trotter approximation is an alternative method for approximating the exponential map, particularly useful when one has access to group elements directly, i.e. the closed-form output of the exponential map on each Lie algebra generator), but they are non-commutative. To provide motivation for this method, consider two elements X and Y in the Lie algebra. The Lie-Trotter formula (or Lie product formula) approximates the exponential of their sum [22, 46].
$$
$$
where k is a positive integer controlling the level of approximation.
The first-order approximation above can be extended to higher orders, referred to as the Lie-TrotterSuzuki approximations.Though various different such approximations exist, we particularly use the following recursive approximation scheme [47, 23] for a given Lie algebra component v = ∑ p i =1 v i .
$$
$$
To apply the above formula, we tune the order parameter p and split the time evolution into r segments to apply the approximation exp( v ) ≈ ∏ r i =1 T p ( v /r ) . For the p -th order, the number of stages in the Suzuki formula above is equal to 2 · 5 p/ 2 -1 , so the total number of stages applied is equal to 2 r · 5 p/ 2 -1 .
These methods are especially useful in the context of PDEs, as they allow for the approximation of the exponential map while preserving the structure of the Lie algebra and group. Similar techniques are used in the design of splitting methods for numerically solving PDEs [48, 49]. Crucially, these approximations will always provide valid solutions to the PDEs, since each individual group operation in the composition above is itself a symmetry of the PDE. This is in contrast with approximations via Taylor series truncation, which only provide approximate solutions.
As with the Taylor series approximation, the p -th order approximation above is accurate to o ( ∥ v ∥ p ) with suitably selected values of r and p [23]. As a cautionary note, the approximations here may fail to converge when applied to unbounded operators [50, 51]. In practice, we tested a range of bounds to the augmentations and tuned augmentations accordingly (see Appendix E).
Taylor series approximation
As observed in the previous section, symmetry groups are generally derived in the Lie algebra of the group. The exponential map can then be applied, taking elements of this Lie algebra to the corresponding group operations. Working within the Lie algebra of a group provides several benefits. First, a Lie algebra is a vector space, so elements of the Lie algebra can be added and subtracted to yield new elements of the Lie algebra (and the group, via the exponential map). Second, when generators of the Lie algebra are closed under the Lie bracket of the Lie algebra ( i.e. , the generators form a basis for the structure constants of the Lie algebra), any arbitrary Lie point symmetry can be obtained via an element of the Lie algebra (i.e. the exponential map is surjective onto the connected component of the identity) [11]. In contrast, composing group operations in an arbitrary, fixed sequence is not guaranteed to be able to generate any element of the group. Lastly, although not extensively detailed here, the "strength," or magnitude, of Lie algebra elements can be measured using an appropriately selected norm. For instance, the operator norm of a matrix could be used for matrix Lie algebras.
In certain cases, especially when the element v in the Lie algebra consists of a single basis element, the exponential map exp( v ) applied to that element of the Lie algebra can be calculated explicitly. Here, applying the group operation to a tuple of independent and dependent variables results in the socalled Lie point transformation, since it is applied at a given point exp( ϵ v ) · ( x, f ( x )) ↦→ ( x ′ , f ( x ) ′ ) . Consider the concrete example below from Burger's equation.
Example B.1 (Exponential map on symmetry generator of Burger's equation) . The Burger's equation contains the Lie point symmetry v γ = γ ( x, t ) e -u ∂ u with corresponding group transformation exp( ϵ v γ ) · ( x, t, u ) = ( x, t, log ( e u + ϵγ )) .
Proof. This transformation only changes the u component. Here, we have
$$
$$
Applying the series expansion log(1 + x ) = x -x 2 2 + x 3 3 -··· , we get
$$
$$
In general, the output of the exponential map cannot be easily calculated as we did above, especially if the vector field v is a weighted sum of various generators. In these cases, we can still apply the exponential map to a desired accuracy using efficient approximation methods, which we discuss next.
Lie-Trotter series approximations
As observed in the previous section, symmetry groups are generally derived in the Lie algebra of the group. The exponential map can then be applied, taking elements of this Lie algebra to the corresponding group operations. Working within the Lie algebra of a group provides several benefits. First, a Lie algebra is a vector space, so elements of the Lie algebra can be added and subtracted to yield new elements of the Lie algebra (and the group, via the exponential map). Second, when generators of the Lie algebra are closed under the Lie bracket of the Lie algebra ( i.e. , the generators form a basis for the structure constants of the Lie algebra), any arbitrary Lie point symmetry can be obtained via an element of the Lie algebra (i.e. the exponential map is surjective onto the connected component of the identity) [11]. In contrast, composing group operations in an arbitrary, fixed sequence is not guaranteed to be able to generate any element of the group. Lastly, although not extensively detailed here, the "strength," or magnitude, of Lie algebra elements can be measured using an appropriately selected norm. For instance, the operator norm of a matrix could be used for matrix Lie algebras.
In certain cases, especially when the element v in the Lie algebra consists of a single basis element, the exponential map exp( v ) applied to that element of the Lie algebra can be calculated explicitly. Here, applying the group operation to a tuple of independent and dependent variables results in the socalled Lie point transformation, since it is applied at a given point exp( ϵ v ) · ( x, f ( x )) ↦→ ( x ′ , f ( x ) ′ ) . Consider the concrete example below from Burger's equation.
Example B.1 (Exponential map on symmetry generator of Burger's equation) . The Burger's equation contains the Lie point symmetry v γ = γ ( x, t ) e -u ∂ u with corresponding group transformation exp( ϵ v γ ) · ( x, t, u ) = ( x, t, log ( e u + ϵγ )) .
Proof. This transformation only changes the u component. Here, we have
$$
$$
Applying the series expansion log(1 + x ) = x -x 2 2 + x 3 3 -··· , we get
$$
$$
In general, the output of the exponential map cannot be easily calculated as we did above, especially if the vector field v is a weighted sum of various generators. In these cases, we can still apply the exponential map to a desired accuracy using efficient approximation methods, which we discuss next.
VICReg Loss
In our implementations, we use the VICReg loss as our choice of SSL loss [9]. This loss contains three different terms: a variance term that ensures representations do not collapse to a single point, a covariance term that ensures different dimensions of the representation encode different data, and an invariance term to enforce similarity of the representations for pairs of inputs related by an augmentation. We go through each term in more detail below. Given a distribution T from which to draw augmentations and a set of inputs x i , the precise algorithm to calculate the VICReg loss for a batch of data is also given in Algorithm 1.
Formally, define our embedding matrices as Z , Z ′ ∈ R N × D . Next, we define the similarity criterion, L sim , as
$$
$$
which we use to match our embeddings, and to make them invariant to the transformations. To avoid a collapse of the representations, we use the original variance and covariance criteria to define our regularisation loss, L reg , as
̸
$$
$$
$$
$$
Expanded related work
Machine Learning for PDEs Recent work on machine learning for PDEs has considered both invariant prediction tasks [52] and time-series modelling [53, 54]. In the fluid mechanics setting, models learn dynamic viscosities, fluid densities, and/or pressure fields from both simulation and real-world experimental data [55, 56, 57]. For time-dependent PDEs, prior work has investigated the efficacy of convolutional neural networks (CNNs), recurrent neural networks (RNNs), graph neural networks (GNNs), and transformers in learning to evolve the PDE forward in time [34, 58, 59, 60]. This has invoked interest in the development of reduced order models and learned representations for time integration that decrease computational expense, while attempting to maintain solution accuracy. Learning representations of the governing PDE can enable time-stepping in a latent space, where the computational expense is substantially reduced [61]. Recently, for example, Lusch et al. have studied learning the infinite-dimensional Koopman operator to globally linearize latent space dynamics [62]. Kim et al. have employed the Sparse Identification of Nonlinear Dynamics (SINDy) framework to parameterize latent space trajectories and combine them with classical ODE solvers to integrate latent space coordinates to arbitrary points in time [53]. Nguyen et al. have looked at the development of foundation models for climate sciences using transformers pre-trained on well-established climate
$$
$$
datasets [7]. Other methods like dynamic mode decomposition (DMD) are entirely data-driven, and find the best operator to estimate temporal dynamics [63]. Recent extensions of this work have also considered learning equivalent operators, where physical constraints like energy conservation or the periodicity of the boundary conditions are enforced [29].
Self-supervised learning All joint embedding self-supervised learning methods have a similar objective: forming representations across a given domain of inputs that are invariant to a certain set of transformations. Contrastive and non-contrastive methods are both used. Contrastive methods [21, 64, 65, 66, 67] push away unrelated pairs of augmented datapoints, and frequently rely on the InfoNCE criterion [68], although in some cases, squared similarities between the embeddings have been employed [69]. Clustering-based methods have also recently emerged [70, 71, 6], where instead of contrasting pairs of samples, samples are contrasted with cluster centroids. Non-contrastive methods [10, 40, 9, 72, 73, 74, 39] aim to bring together embeddings of positive samples. However, the primary difference between contrastive and non-contrastive methods lies in how they prevent representational collapse. In the former, contrasting pairs of examples are explicitly pushed away to avoid collapse. In the latter, the criterion considers the set of embeddings as a whole, encouraging information content maximization to avoid collapse. For example, this can be achieved by regularizing the empirical covariance matrix of the embeddings. While there can be differences in practice, both families have been shown to lead to very similar representations [16, 75]. An intriguing feature in many SSL frameworks is the use of a projector neural network after the encoder, on top of which the SSL loss is applied. The projector was introduced in [21]. Whereas the projector is not necessary for these methods to learn a satisfactory representation, it is responsible for an important performance increase. Its exact role is an object of study [76, 15].
We should note that there exists a myriad of techniques, including metric learning, kernel design, autoencoders, and others [77, 78, 79, 80, 81] to build feature spaces and perform unsupervised learning. Many of these works share a similar goal to ours, and we opted for SSL due to its proven efficacy in fields like computer vision and the direct analogy offered by data augmentations. One particular methodology that deserves mention is that of multi-fidelity modeling, which can reduce dependency on extensive training data for learning physical tasks [82, 83, 84]. The goals of multifidelity modeling include training with data of different fidelity [82] or enhancing the accuracy of models by incorporating high quality data into models [85]. In contrast, SSL aims to harness salient features from diverse data sources without being tailored to specific applications. The techniques we employ capitalize on the inherent structure in a dataset, especially through augmentations and invariances.
Equivariant networks and geometric deep learning In the past several years, an extensive set of literature has explored questions in the so-called realm of geometric deep learning tying together aspects of group theory, geometry, and deep learning [86]. In one line of work, networks have been designed to explicitly encode symmetries into the network via equivariant layers or explicitly symmetric parameterizations [87, 88, 89, 90]. These techniques have notably found particular application in chemistry and biology related problems [91, 92, 93] as well as learning on graphs [94]. Another line of work considers optimization over layers or networks that are parameterized over a Lie group [95, 96, 97, 98, 99]. Our work does not explicitly encode invariances or structurally parameterize Lie groups into architectures as in many of these works, but instead tries to learn representations that are approximately symmetric and invariant to these group structures via the SSL. As mentioned in the main text, perhaps more relevant for future work are techniques for learning equivariant features and maps [41, 42].
Machine Learning for PDEs
Background: In the joint-embedding framework, input data is transformed into two separate 'views", using augmentations that preserve the underlying information in the data. The augmented views are then fed through a learnable encoder, f θ , producing representations that can be used for downstream tasks. The SSL loss function is comprised of a similarity loss L sim between projections (through a projector h θ , which helps generalization [15]) of the pairs of views, to make their representations invariant to augmentations, and a regularization loss L reg, to avoid trivial solutions (such as mapping all inputs to the same representation). The regularization term can consist of a repulsive force between points, or regularization on the covariance matrix of the embeddings. Both function similarly, as shown in [16]. This pretraining procedure is illustrated in Fig. 2 (left) in the context of Burgers' equation.
In this work, we choose variance-invariance-covariance regularization (VICReg) as our selfsupervised loss function [9]. Concretely, let Z , Z ′ ∈ R N × D contain the D -dimensional representations of two batches of N inputs with D × D centered covariance matrices, Cov( Z ) and Cov( Z ′ ) . Rows Z i, : and Z ′ i, : are two views of a shared input. The loss over this batch includes a term to enforce similarity ( L sim) and a term to avoid collapse and regularize representations ( L reg) by
<latexi sh
1_b
64="RD
pqBr
0m
3TX/
C
cyPAg
fVFN
9
E
2
I
SK
o
v
U
J
8
Y
L
z
d
w
G
n
5
H
Q
uCPqyUo
QJK0N
3
p
Z+5OR
X
qEMBFz
T
7
/

Figure 3: One parameter Lie point symmetries for the Kuramoto-Sivashinsky (KS) PDE. The transformations (left to right) include the un-modified solution ( u ) , temporal shifts ( g 1 ) , spatial shifts ( g 2 ) , and Galilean boosts ( g 3 ) with their corresponding infinitesimal transformations in the Lie algebra placed inside the figure. The shaded red square denotes the original ( x, t ) , while the dotted line represents the same points after the augmentation is applied.
pushing elements of the encodings to be statistically identical:
$$
$$
where ∥ · ∥ F denotes the matrix Frobenius norm and λ inv , λ reg ∈ R + are hyperparameters to weight the two terms. In practice, VICReg separates the regularization L reg ( Z ) into two components to handle diagonal and non-diagonal entries Cov( Z ) separately. For full details, see Appendix C.
Adapting VICReg to learn from PDE data: Numerical PDE solutions typically come in the form of a tensor of values, along with corresponding spatial and temporal grids. By treating the spatial and temporal information as supplementary channels, we can use existing methods developed for learning image representations. As an illustration, a numerical solution to Burgers consists of a velocity tensor with shape ( t, x ) : a vector of t time values, and a vector of x spatial values. We therefore process the sample to obtain a (3 , t, x ) tensor with the last two channels encoding spatial and temporal discretization, which can be naturally fed to neural networks tailored for images such as ResNets [17]. From these, we extract the representation before the classification layer (which is unused here). It is worth noting that convolutional neural networks have become ubiquitous in the literature [18, 12]. While the VICReg default hyper-parameters did not require substantial tuning, tuning was crucial to probe the quality of our learned representations to monitor the quality of the pre-training step. Indeed, SSL loss values are generally not predictive of the quality of the representation, and thus must be complemented by an evaluation task. In computer vision, this is done by freezing the encoder, and using the features to train a linear classifier on ImageNet. In our framework, we pick regression of a PDE coefficient, or regression of the initial conditions when there is no coefficient in the equation. The latter, commonly referred to as the inverse problem, has the advantage of being applicable to any PDE, and is often a challenging problem in the numerical methods community given the ill-posed nature of the problem [19]. Our approach for a particular task, kinematic viscosity regression, is schematically illustrated in Fig. 2 (top right). More details on evaluation tasks are provided in Section 4.
Self-supervised learning
Background: In the joint-embedding framework, input data is transformed into two separate 'views", using augmentations that preserve the underlying information in the data. The augmented views are then fed through a learnable encoder, f θ , producing representations that can be used for downstream tasks. The SSL loss function is comprised of a similarity loss L sim between projections (through a projector h θ , which helps generalization [15]) of the pairs of views, to make their representations invariant to augmentations, and a regularization loss L reg, to avoid trivial solutions (such as mapping all inputs to the same representation). The regularization term can consist of a repulsive force between points, or regularization on the covariance matrix of the embeddings. Both function similarly, as shown in [16]. This pretraining procedure is illustrated in Fig. 2 (left) in the context of Burgers' equation.
In this work, we choose variance-invariance-covariance regularization (VICReg) as our selfsupervised loss function [9]. Concretely, let Z , Z ′ ∈ R N × D contain the D -dimensional representations of two batches of N inputs with D × D centered covariance matrices, Cov( Z ) and Cov( Z ′ ) . Rows Z i, : and Z ′ i, : are two views of a shared input. The loss over this batch includes a term to enforce similarity ( L sim) and a term to avoid collapse and regularize representations ( L reg) by
<latexi sh
1_b
64="RD
pqBr
0m
3TX/
C
cyPAg
fVFN
9
E
2
I
SK
o
v
U
J
8
Y
L
z
d
w
G
n
5
H
Q
uCPqyUo
QJK0N
3
p
Z+5OR
X
qEMBFz
T
7
/

Figure 3: One parameter Lie point symmetries for the Kuramoto-Sivashinsky (KS) PDE. The transformations (left to right) include the un-modified solution ( u ) , temporal shifts ( g 1 ) , spatial shifts ( g 2 ) , and Galilean boosts ( g 3 ) with their corresponding infinitesimal transformations in the Lie algebra placed inside the figure. The shaded red square denotes the original ( x, t ) , while the dotted line represents the same points after the augmentation is applied.
pushing elements of the encodings to be statistically identical:
$$
$$
where ∥ · ∥ F denotes the matrix Frobenius norm and λ inv , λ reg ∈ R + are hyperparameters to weight the two terms. In practice, VICReg separates the regularization L reg ( Z ) into two components to handle diagonal and non-diagonal entries Cov( Z ) separately. For full details, see Appendix C.
Adapting VICReg to learn from PDE data: Numerical PDE solutions typically come in the form of a tensor of values, along with corresponding spatial and temporal grids. By treating the spatial and temporal information as supplementary channels, we can use existing methods developed for learning image representations. As an illustration, a numerical solution to Burgers consists of a velocity tensor with shape ( t, x ) : a vector of t time values, and a vector of x spatial values. We therefore process the sample to obtain a (3 , t, x ) tensor with the last two channels encoding spatial and temporal discretization, which can be naturally fed to neural networks tailored for images such as ResNets [17]. From these, we extract the representation before the classification layer (which is unused here). It is worth noting that convolutional neural networks have become ubiquitous in the literature [18, 12]. While the VICReg default hyper-parameters did not require substantial tuning, tuning was crucial to probe the quality of our learned representations to monitor the quality of the pre-training step. Indeed, SSL loss values are generally not predictive of the quality of the representation, and thus must be complemented by an evaluation task. In computer vision, this is done by freezing the encoder, and using the features to train a linear classifier on ImageNet. In our framework, we pick regression of a PDE coefficient, or regression of the initial conditions when there is no coefficient in the equation. The latter, commonly referred to as the inverse problem, has the advantage of being applicable to any PDE, and is often a challenging problem in the numerical methods community given the ill-posed nature of the problem [19]. Our approach for a particular task, kinematic viscosity regression, is schematically illustrated in Fig. 2 (top right). More details on evaluation tasks are provided in Section 4.
Equivariant networks and geometric deep learning
Details on Augmentations
The generators of the Lie point symmetries of the various equations we study are listed below. For symmetry augmentations which distort the periodic grid in space and time, we provide inputs x and t to the network which contain the new spatial and time coordinates after augmentation.
Burgers' equation
As a reminder, the Burgers' equation takes the form
$$
$$
Lie point symmetries of the Burgers' equation are listed in Table 4. There are five generators. As we will see, the first three generators corresponding to translations and Galilean boosts are consistent with the other equations we study (KS, KdV, and Navier Stokes) as these are all flow equations.
Comments regarding error in [12] As a cautionary note, the symmetry group given in Table 1 of [12] for Burgers' equation is incorrectly labeled for Burgers' equation in its standard form. Instead, these augmentations are those for Burgers' equation in its potential form, which is given as:
$$
$$
Burgers' equation in its standard form is v t + vv x -νv xx = 0 , which can be obtained from the transformation v = u x . The Lie point symmetry group of the equation in its potential form contains more generators than that of the standard form. To apply these generators to the standard form of Burgers' equation, one can convert them via the Cole-Hopf transformation, but this conversion loses the smoothness and locality of some of these transformations (i.e., some are no longer Lie point transformations, although they do still describe valid transformations between solutions of the equation's corresponding form).
Note that this discrepancy does not carry through in their experiments: [12] only consider input data as solutions to Heat equation, which they subsequently transform into solutions of Burgers' equation via a Cole-Hopf transform. Therefore, in their code, they apply augmentations using the Heat equation, for which they have the correct symmetry group. We opted only to work with solutions to Burgers' equations itself for a slightly fairer comparison to real-world settings, where a convenient transform to a linear PDE such as the Cole-Hopf transform is generally not available.
Comments regarding error in cite{brandstetter2022lie
KdV
Lie point symmetries of the KdV equation are listed in Table 5. Though all the operations listed are valid generators of the symmetry group, only g 1 and g 3 are invariant to the downstream task of the inverse problem. (Notably, these parameters are independent of any spatial shift). Consequently, during SSL pre-training for the inverse problem, only g 1 and g 3 were used for learning representations. In contrast, for time-stepping, all listed symmetry groups were used.
Table 5: Generators of the Lie point symmetry group of the KdV equation. The only symmetries used in the inverse task of predicting initial conditions are g 1 and g 3 since the other two are not invariant to the downstream task.
KS
Lie point symmetries of the KS equation are listed in Table 6. All of these symmetry generators are shared with the KdV equation listed in Table 4. Similar to KdV, only g 1 and g 3 are invariant to the downstream regression task of predicting the initial conditions. In addition, for time-stepping, all symmetry groups were used in learning meaningful representations.
Table 6: Generators of the Lie point symmetry group of the KS equation. The only symmetries used in the inverse task of predicting initial conditions are g 1 and g 3 since g 2 is not invariant to the downstream task.
Navier Stokes
Lie point symmetries of the incompressible Navier Stokes equation are listed in Table 7 [101]. As pressure is not given as an input to any of our networks, the symmetry g q was not included in our implementations. For augmentations g E x and g E y , we restricted attention only to linear E x ( t ) = E y ( t ) = t or quadratic E x ( t ) = E y ( t ) = t 2 functions. This restriction was made to maintain invariance to the downstream task of buoyancy force prediction in the linear case or easily calculable perturbations to the buoyancy by an amount 2 ϵ to the magnitude in the quadratic case. Finally, we fix both order and steps parameters in our Lie-Trotter approximation implementation to 2 for computationnal efficiency.
Experimental details label{app:details
Whereas we implemented our own pretraining and evaluation (kinematic viscosity, initial conditions and buoyancy) pipelines, we used the data generation and time-stepping code provided on Github by [12] for Burgers', KS and KdV, and in [18] for Navier-Stokes (MIT License), with slight modification to condition the neural operators on our representation. All our code relies relies on Pytorch. Note that the time-stepping code for Navier-Stokes uses Pytorch Lightning. We report the details of the training cost and hyperparameters for pretraining and timestepping in Table 9 and Table 10 respectively.
Experiments on Burgers' Equation
Solutions realizations of Burgers' equation were generated using the analytical solution [32] obtained from the Heat equation and the Cole-Hopf transform. During generation, kinematic viscosities, ν , and initial conditions were varied.
Representation pretraining We pretrain a representation on subsets of our full dataset containing 10 , 000 1D time evolutions from Burgers equation with various kinematic viscosities, ν , sampled uniformly in the range [0 . 001 , 0 . 007] , and initial conditions using a similar procedure to [12]. We generate solutions of size 224 × 448 in the spatial and temporal dimensions respectively, using the default parameters from [12]. We train a ResNet18 [17] encoder using the VICReg [9] approach to joint embedding SSL, with a smaller projector (width 512 ) since we use a smaller ResNet than in the original paper. We keep the same variance, invariance and covariance parameters as in [9]. We use the following augmentations and strengths:
Table 7: Generators of the Lie point symmetry group of the incompressible Navier Stokes equation. Here, u, v correspond to the velocity of the fluid in the x, y direction respectively and p corresponds to the pressure. The last three augmentations correspond to infinite dimensional Lie subgroups with choice of functions E x ( t ) , E y ( t ) , q ( t ) that depend on t only. For invariant tasks, we only used settings where E x ( t ) , E y ( t ) = t (linear) or E x ( t ) , E y ( t ) = t 2 (quadratic) to ensure invariance to the downstream task or predictable changes in the outputs of the downstream task. These augmentations are listed as numbers 6 to 9 .
1 case of g E x or g E y where E x ( t ) = E y ( t ) = t (linear function of t )
We pretrain for 100 epochs using AdamW [33] and a batch size of 32 . Crucially, we assess the quality of the learned representation via linear probing for kinematic viscosity regression, which we detail below.
Kinematic viscosity regression We evaluate the learned representation as follows: the ResNet18 is frozen and used as an encoder to produce features from the training dataset. The features are passed through a linear layer, followed by a sigmoid to constrain the output within [ ν min , ν max ] . The learned model is evaluated against our validation dataset, which is comprised of 2 , 000 samples.
Time-stepping We use a 1D CNN solver from [12] as our baseline. This neural solver takes T p previous time steps as input, to predict the next T f future ones. Each channel (or spatial axis, if we view the input as a 2D image with one channel) is composed of the realization values, u , at T p times, with spatial step size dx , and time step size dt . The dimension of the input is therefore ( T p +2 , 224) , where the extra two dimensions are simply to capture the scalars dx and dt . We augment this input with our representation. More precisely, we select the encoder that allows for the most accurate linear regression of ν with our validation dataset, feed it with the CNN operator input and reduce the resulting representation dimension to d with a learned projection before adding it as supplementary channels to the input, which is now ( T p +2+ d, 224) .
We set T p = 20 , T f = 20 , and n samples = 2 , 000 . We train both models for 20 epochs fol-
Table 8: One-step validation NMSE for time-stepping on Burgers for different architectures.
lowing the setup from [12]. In addition, we use AdamW with a decaying learning rate and different configurations of 3 runs each:
Representation pretraining
We consider the task of regressing equation-related coefficients in Burgers' equation and the NavierStokes' equation from solutions to those PDEs. For KS and KdV we consider the inverse probem of regressing initial conditions. We train a linear model on top of the pretrained representation for the downstream regression task. For the baseline supervised model, we train the same architecture, i.e. a ResNet18, using the MSE loss on downstream labels. Unless stated otherwise, we train the linear model for 30 epochs using Adam. Further details are in Appendix F.
Kinematic viscosity regression (Burgers): Wepretrain a ResNet18 on 10 , 000 unlabeled realizations of Burgers' equation, and use the resulting features to train a linear model on a smaller, labeled dataset of only 2000 samples. We compare to the same supervised model (encoder and linear head) trained on the same labeled dataset. The viscosities used range between 0 . 001 and 0 . 007 and are sampled uniformly. We can see in Table 1 that we are able to improve over the supervised baseline by leveraging our learned representations. This remains true even when also using Lie Point symmetries for the supervised baselines or when using comparable dataset sizes, as in Figure 4. We also clearly see the ability of our self-supervised approach to leverage larger dataset sizes, whereas we did not see any gain when going to bigger datasets in the supervised setting.
Initial condition regression (inverse problem): For the KS and KdV PDEs, we aim to solve the inverse problem by regressing initial condition parameters from a snapshot of future time evolutions of the solution. Following [34, 12], for a domain Ω = [0 , L ] , a truncated Fourier series, parameterized by A k , ω k , ϕ k , is used to generate initial conditions:
$$
$$
Our task is to regress the set of 2 N coefficients { A k , ω k : k ∈ { 1 , . . . , N }} from a snapshot of the solution starting at t = 20 to t = T . This way, the initial conditions and first-time steps are never seen during training, making the problem non-trivial. For all conducted tests, N = 10 , A k ∼ U ( -0 . 5 , 0 . 5) , and ω k ∼ U ( -0 . 4 , 0 . 4) . By neglecting phase shifts, ϕ k , the inverse problem is invariant to Galilean boosts and spatial translations, which we use as augmentations for training our SSL method (see Appendix E). The datasets used for KdV and KS contains 10,000 training samples and 2,500 test samples. As shown in Table 1, the SSL trained network reduces NMSE by a factor of almost three compared to the supervised baseline. This demonstrates how pre-training via SSL can help to extract the underlying dynamics from a snapshot of a solution.
Buoyancy magnitude regression: Following [18], our dataset consists of solutions of Navier Stokes (Equation (8)) where the external buoyancy force, f = ( c x , c y ) ⊤ , is constant in the two spatial
directions over the course of a given evolution, and our aim is to regress the magnitude of this force √ c 2 x + c 2 y given a solution to the PDE. We reuse the dataset generated in [18], where c x = 0 and c y ∼ U (0 . 2 , 0 . 5) . In practice this gives us 26,624 training samples that we used as our 'unlabeled' dataset, 3,328 to train the downstream task on, and 6,592 to evaluate the models. As observed in Table 1, the self-supervised approach is able to significantly outperform the supervised baseline. Even when looking at the best supervised performance (over 60 runs), or in similar data regimes as the supervised baseline illustrated in Fig. 4, the self-supervised baseline consistently performs better and improves further when given larger unlabeled datasets.
Kinematic viscosity regression
We consider the task of regressing equation-related coefficients in Burgers' equation and the NavierStokes' equation from solutions to those PDEs. For KS and KdV we consider the inverse probem of regressing initial conditions. We train a linear model on top of the pretrained representation for the downstream regression task. For the baseline supervised model, we train the same architecture, i.e. a ResNet18, using the MSE loss on downstream labels. Unless stated otherwise, we train the linear model for 30 epochs using Adam. Further details are in Appendix F.
Kinematic viscosity regression (Burgers): Wepretrain a ResNet18 on 10 , 000 unlabeled realizations of Burgers' equation, and use the resulting features to train a linear model on a smaller, labeled dataset of only 2000 samples. We compare to the same supervised model (encoder and linear head) trained on the same labeled dataset. The viscosities used range between 0 . 001 and 0 . 007 and are sampled uniformly. We can see in Table 1 that we are able to improve over the supervised baseline by leveraging our learned representations. This remains true even when also using Lie Point symmetries for the supervised baselines or when using comparable dataset sizes, as in Figure 4. We also clearly see the ability of our self-supervised approach to leverage larger dataset sizes, whereas we did not see any gain when going to bigger datasets in the supervised setting.
Initial condition regression (inverse problem): For the KS and KdV PDEs, we aim to solve the inverse problem by regressing initial condition parameters from a snapshot of future time evolutions of the solution. Following [34, 12], for a domain Ω = [0 , L ] , a truncated Fourier series, parameterized by A k , ω k , ϕ k , is used to generate initial conditions:
$$
$$
Our task is to regress the set of 2 N coefficients { A k , ω k : k ∈ { 1 , . . . , N }} from a snapshot of the solution starting at t = 20 to t = T . This way, the initial conditions and first-time steps are never seen during training, making the problem non-trivial. For all conducted tests, N = 10 , A k ∼ U ( -0 . 5 , 0 . 5) , and ω k ∼ U ( -0 . 4 , 0 . 4) . By neglecting phase shifts, ϕ k , the inverse problem is invariant to Galilean boosts and spatial translations, which we use as augmentations for training our SSL method (see Appendix E). The datasets used for KdV and KS contains 10,000 training samples and 2,500 test samples. As shown in Table 1, the SSL trained network reduces NMSE by a factor of almost three compared to the supervised baseline. This demonstrates how pre-training via SSL can help to extract the underlying dynamics from a snapshot of a solution.
Buoyancy magnitude regression: Following [18], our dataset consists of solutions of Navier Stokes (Equation (8)) where the external buoyancy force, f = ( c x , c y ) ⊤ , is constant in the two spatial
directions over the course of a given evolution, and our aim is to regress the magnitude of this force √ c 2 x + c 2 y given a solution to the PDE. We reuse the dataset generated in [18], where c x = 0 and c y ∼ U (0 . 2 , 0 . 5) . In practice this gives us 26,624 training samples that we used as our 'unlabeled' dataset, 3,328 to train the downstream task on, and 6,592 to evaluate the models. As observed in Table 1, the self-supervised approach is able to significantly outperform the supervised baseline. Even when looking at the best supervised performance (over 60 runs), or in similar data regimes as the supervised baseline illustrated in Fig. 4, the self-supervised baseline consistently performs better and improves further when given larger unlabeled datasets.
Time-stepping
To explore whether learned representations improve time-stepping, we study neural networks that use a sequence of time steps (the 'history') of a PDE to predict a future sequence of steps. For each equation we consider different conditioning schemes, to fit within the data modality and be comparable to previous work.
Burgers, Korteweg-de Vries, and Kuramoto-Sivashinsky: We time-step on 2000 unseen samples for each PDE. To do so, we compute a representation of 20 first input time steps using our frozen encoder, and add it as a new channel. The resulting input is fed to a CNN as in [12] to predict the next 20 time steps (illustrated in Fig. 4 (bottom right) in the context of Burgers' equation). As shown in Table 1, conditioning the neural network or operator with pre-trained representations slightly reduces the error. Such conditioning noticeably improves performance for KdV and KS, while the results are mixed for Burgers'. A potential explanation is that KdV and KS feature more chaotic behavior than Burgers, leaving room for improvement.
Navier-Stokes' equation: As pointed out in [18], conditioning a neural network or neural operator on the buoyancy helps generalization accross different values of this parameter. This is done by embedding the buoyancy before mixing the resulting vector either via addition to the neural operator's hidden activations (denoted in [18] as 'Addition'), or alternatively for UNets by affine transformation of group normalization layers (denoted as 'AdaGN' and originally proposed in [35]). For our main experiment, we use the same modified UNet with 64 channels as in [18] for our neural operator, since it yields the best performance on the Navier-Stokes dataset. To condition the UNet, we compute our representation on the 16 first frames (that are therefore excluded from the training), and pass the representation through a two layer MLP with a bottleneck of size 1, in order to exploit the ability of our representation to recover the buoyancy with only one linear layer. The resulting output is then added to the conditioning embedding as in [18]. Finally, we choose AdaGN as our conditioning method, since it provides the best results in [18]. We follow a similar training and evaluation protocol to [18], except that we perform 20 epochs with cosine annealing schedule on 1,664 trajectories instead of 50 epochs, as we did not observe significant difference in terms of results, and this allowed to explore other architectures and conditioning methods. Additional details are provided in Appendix F. As a baseline, we use the same model without buoyancy conditioning. Both models are conditioned on time. We report the one-step validation MSE on the same time horizons as [18]. Conditioning on our representation outperforms the baseline without conditioning.
Wealso report results for different architectures and conditioning methods for Navier-Stokes in Table 2 and Burgers in Table 8 (Appendix F.1) validating the potential of conditioning on SSL representations for different models. FNO [36] does not perform as well as other models, partly due to the relatively low number of samples used and the low-resolution nature of the benchmarks. For Navier-Stokes, we also report results obtained when conditioning on both time and ground truth buoyancy, which serves as an upper-bound on the performance of our method. We conjecture these results can be improved by further increasing the quality of the learned representation, e.g by training on more samples or through further augmentation tuning. Indeed, the MSE on buoyancy regression obtained by SSL features, albeit significantly lower than the supervised baseline, is often still too imprecise to distinguish consecutive buoyancy values in our data.
Experiments on KdV and KS
To obtain realizations of both the KdV and KS PDEs, we apply the method of lines, and compute spatial derivatives using a pseudo-spectral method, in line with the approach taken by [12].
Representation pretraining To train on realizations of KdV, we use the following VICReg parameters: λ var = 25 , λ inv = 25 , and λ cov = 4 . For the KS PDE, the λ var and λ inv remain unchanged, with λ cov = 6 . The pre-training is performed on a dataset comprised of 10 , 000 1D time evolutions of each PDE, each generated from initial conditions described in the main text. Generated solutions were of size 128 × 256 in the spatial and temporal dimensions, respectively. Similar to Burgers' equation, a ResNet18 encoder in conjunction with a projector of width 512 was used for SSL pre-training. The following augmentations and strengths were applied:
Initial condition regression The quality of the learned representations is evaluated by freezing the ResNet18 encoder, training a separate regression head to predict values of A k and ω k , and comparing the NMSE to a supervised baseline. The regression head was a fully-connected network, where the output dimension is commensurate with the number of initial conditions used. In addition, a range-constrained sigmoid was added to bound the output between [ -0 . 5 , 2 π ] , where the bounds were informed by the minimum and maximum range of the sampled initial conditions. Lastly, similar to Burgers' equation, the validation dataset is comprised of 2 , 000 labeled samples.
Time-stepping The same 1D CNN solver used for Burgers' equation serves as the baseline for time-stepping the KdV and KS PDEs. We select the ResNet18 encoder based on the one that provides the most accurate predictions of the initial conditions with our validation set. Here, the input dimension is now ( T p +2 , 128) to agree with the size of the generated input data. Similarly to Burgers' equation, T p = 20 , T f = 20 , and n samples = 2 , 000 . Lastly, AdamW with the same learning rate and batch size configurations as those seen for Burgers' equation were used across 3 time-stepping runs each.
A sample visualization with predicted instances of the KdV PDE is provided in Fig. 7 below:

Figure 7: Illustration of the 20 predicted time steps for the KdV PDE. ( Left ) Ground truth data from PDE solver; ( Middle ) Predicted u ( x, t ) using learned representations; ( Right ) Predicted output from using the CNN baseline.
Table 9: List of model hyperparameters and training details for the invariant tasks. Training time includes periodic evaluations during the pretraining.
Representation pretraining
We consider the task of regressing equation-related coefficients in Burgers' equation and the NavierStokes' equation from solutions to those PDEs. For KS and KdV we consider the inverse probem of regressing initial conditions. We train a linear model on top of the pretrained representation for the downstream regression task. For the baseline supervised model, we train the same architecture, i.e. a ResNet18, using the MSE loss on downstream labels. Unless stated otherwise, we train the linear model for 30 epochs using Adam. Further details are in Appendix F.
Kinematic viscosity regression (Burgers): Wepretrain a ResNet18 on 10 , 000 unlabeled realizations of Burgers' equation, and use the resulting features to train a linear model on a smaller, labeled dataset of only 2000 samples. We compare to the same supervised model (encoder and linear head) trained on the same labeled dataset. The viscosities used range between 0 . 001 and 0 . 007 and are sampled uniformly. We can see in Table 1 that we are able to improve over the supervised baseline by leveraging our learned representations. This remains true even when also using Lie Point symmetries for the supervised baselines or when using comparable dataset sizes, as in Figure 4. We also clearly see the ability of our self-supervised approach to leverage larger dataset sizes, whereas we did not see any gain when going to bigger datasets in the supervised setting.
Initial condition regression (inverse problem): For the KS and KdV PDEs, we aim to solve the inverse problem by regressing initial condition parameters from a snapshot of future time evolutions of the solution. Following [34, 12], for a domain Ω = [0 , L ] , a truncated Fourier series, parameterized by A k , ω k , ϕ k , is used to generate initial conditions:
$$
$$
Our task is to regress the set of 2 N coefficients { A k , ω k : k ∈ { 1 , . . . , N }} from a snapshot of the solution starting at t = 20 to t = T . This way, the initial conditions and first-time steps are never seen during training, making the problem non-trivial. For all conducted tests, N = 10 , A k ∼ U ( -0 . 5 , 0 . 5) , and ω k ∼ U ( -0 . 4 , 0 . 4) . By neglecting phase shifts, ϕ k , the inverse problem is invariant to Galilean boosts and spatial translations, which we use as augmentations for training our SSL method (see Appendix E). The datasets used for KdV and KS contains 10,000 training samples and 2,500 test samples. As shown in Table 1, the SSL trained network reduces NMSE by a factor of almost three compared to the supervised baseline. This demonstrates how pre-training via SSL can help to extract the underlying dynamics from a snapshot of a solution.
Buoyancy magnitude regression: Following [18], our dataset consists of solutions of Navier Stokes (Equation (8)) where the external buoyancy force, f = ( c x , c y ) ⊤ , is constant in the two spatial
directions over the course of a given evolution, and our aim is to regress the magnitude of this force √ c 2 x + c 2 y given a solution to the PDE. We reuse the dataset generated in [18], where c x = 0 and c y ∼ U (0 . 2 , 0 . 5) . In practice this gives us 26,624 training samples that we used as our 'unlabeled' dataset, 3,328 to train the downstream task on, and 6,592 to evaluate the models. As observed in Table 1, the self-supervised approach is able to significantly outperform the supervised baseline. Even when looking at the best supervised performance (over 60 runs), or in similar data regimes as the supervised baseline illustrated in Fig. 4, the self-supervised baseline consistently performs better and improves further when given larger unlabeled datasets.
Initial condition regression
We consider the task of regressing equation-related coefficients in Burgers' equation and the NavierStokes' equation from solutions to those PDEs. For KS and KdV we consider the inverse probem of regressing initial conditions. We train a linear model on top of the pretrained representation for the downstream regression task. For the baseline supervised model, we train the same architecture, i.e. a ResNet18, using the MSE loss on downstream labels. Unless stated otherwise, we train the linear model for 30 epochs using Adam. Further details are in Appendix F.
Kinematic viscosity regression (Burgers): Wepretrain a ResNet18 on 10 , 000 unlabeled realizations of Burgers' equation, and use the resulting features to train a linear model on a smaller, labeled dataset of only 2000 samples. We compare to the same supervised model (encoder and linear head) trained on the same labeled dataset. The viscosities used range between 0 . 001 and 0 . 007 and are sampled uniformly. We can see in Table 1 that we are able to improve over the supervised baseline by leveraging our learned representations. This remains true even when also using Lie Point symmetries for the supervised baselines or when using comparable dataset sizes, as in Figure 4. We also clearly see the ability of our self-supervised approach to leverage larger dataset sizes, whereas we did not see any gain when going to bigger datasets in the supervised setting.
Initial condition regression (inverse problem): For the KS and KdV PDEs, we aim to solve the inverse problem by regressing initial condition parameters from a snapshot of future time evolutions of the solution. Following [34, 12], for a domain Ω = [0 , L ] , a truncated Fourier series, parameterized by A k , ω k , ϕ k , is used to generate initial conditions:
$$
$$
Our task is to regress the set of 2 N coefficients { A k , ω k : k ∈ { 1 , . . . , N }} from a snapshot of the solution starting at t = 20 to t = T . This way, the initial conditions and first-time steps are never seen during training, making the problem non-trivial. For all conducted tests, N = 10 , A k ∼ U ( -0 . 5 , 0 . 5) , and ω k ∼ U ( -0 . 4 , 0 . 4) . By neglecting phase shifts, ϕ k , the inverse problem is invariant to Galilean boosts and spatial translations, which we use as augmentations for training our SSL method (see Appendix E). The datasets used for KdV and KS contains 10,000 training samples and 2,500 test samples. As shown in Table 1, the SSL trained network reduces NMSE by a factor of almost three compared to the supervised baseline. This demonstrates how pre-training via SSL can help to extract the underlying dynamics from a snapshot of a solution.
Buoyancy magnitude regression: Following [18], our dataset consists of solutions of Navier Stokes (Equation (8)) where the external buoyancy force, f = ( c x , c y ) ⊤ , is constant in the two spatial
directions over the course of a given evolution, and our aim is to regress the magnitude of this force √ c 2 x + c 2 y given a solution to the PDE. We reuse the dataset generated in [18], where c x = 0 and c y ∼ U (0 . 2 , 0 . 5) . In practice this gives us 26,624 training samples that we used as our 'unlabeled' dataset, 3,328 to train the downstream task on, and 6,592 to evaluate the models. As observed in Table 1, the self-supervised approach is able to significantly outperform the supervised baseline. Even when looking at the best supervised performance (over 60 runs), or in similar data regimes as the supervised baseline illustrated in Fig. 4, the self-supervised baseline consistently performs better and improves further when given larger unlabeled datasets.
Time-stepping
To explore whether learned representations improve time-stepping, we study neural networks that use a sequence of time steps (the 'history') of a PDE to predict a future sequence of steps. For each equation we consider different conditioning schemes, to fit within the data modality and be comparable to previous work.
Burgers, Korteweg-de Vries, and Kuramoto-Sivashinsky: We time-step on 2000 unseen samples for each PDE. To do so, we compute a representation of 20 first input time steps using our frozen encoder, and add it as a new channel. The resulting input is fed to a CNN as in [12] to predict the next 20 time steps (illustrated in Fig. 4 (bottom right) in the context of Burgers' equation). As shown in Table 1, conditioning the neural network or operator with pre-trained representations slightly reduces the error. Such conditioning noticeably improves performance for KdV and KS, while the results are mixed for Burgers'. A potential explanation is that KdV and KS feature more chaotic behavior than Burgers, leaving room for improvement.
Navier-Stokes' equation: As pointed out in [18], conditioning a neural network or neural operator on the buoyancy helps generalization accross different values of this parameter. This is done by embedding the buoyancy before mixing the resulting vector either via addition to the neural operator's hidden activations (denoted in [18] as 'Addition'), or alternatively for UNets by affine transformation of group normalization layers (denoted as 'AdaGN' and originally proposed in [35]). For our main experiment, we use the same modified UNet with 64 channels as in [18] for our neural operator, since it yields the best performance on the Navier-Stokes dataset. To condition the UNet, we compute our representation on the 16 first frames (that are therefore excluded from the training), and pass the representation through a two layer MLP with a bottleneck of size 1, in order to exploit the ability of our representation to recover the buoyancy with only one linear layer. The resulting output is then added to the conditioning embedding as in [18]. Finally, we choose AdaGN as our conditioning method, since it provides the best results in [18]. We follow a similar training and evaluation protocol to [18], except that we perform 20 epochs with cosine annealing schedule on 1,664 trajectories instead of 50 epochs, as we did not observe significant difference in terms of results, and this allowed to explore other architectures and conditioning methods. Additional details are provided in Appendix F. As a baseline, we use the same model without buoyancy conditioning. Both models are conditioned on time. We report the one-step validation MSE on the same time horizons as [18]. Conditioning on our representation outperforms the baseline without conditioning.
Wealso report results for different architectures and conditioning methods for Navier-Stokes in Table 2 and Burgers in Table 8 (Appendix F.1) validating the potential of conditioning on SSL representations for different models. FNO [36] does not perform as well as other models, partly due to the relatively low number of samples used and the low-resolution nature of the benchmarks. For Navier-Stokes, we also report results obtained when conditioning on both time and ground truth buoyancy, which serves as an upper-bound on the performance of our method. We conjecture these results can be improved by further increasing the quality of the learned representation, e.g by training on more samples or through further augmentation tuning. Indeed, the MSE on buoyancy regression obtained by SSL features, albeit significantly lower than the supervised baseline, is often still too imprecise to distinguish consecutive buoyancy values in our data.
Experiments on Navier-Stokes
We use the Conditioning dataset for Navier Stokes-2D proposed in [18], consisting of 26,624 2D time evolutions with 56 time steps and various buoyancies ranging approximately uniformly from 0 . 2 to 0 . 5 .
Representation pretraining We train a ResNet18 for 100 epochs with AdamW, a batch size of 64 and a learning rate of 3e-4. We use the same VICReg hyperparameters as for Burgers' Equation. We use the following augmentations and strengths (augmentations whose strength is not specified here are not used):
Buoyancy regression We evaluate the learned representation as follows: the ResNet18 is frozen and used as an encoder to produce features from the training dataset. The features are passed through a linear layer, followed by a sigmoid to constrain the output within [ Buoyancy min , Buoyancy max ] . Both the fully supervised baseline (ResNet18 + linear head) and our (frozen ResNet18 + linear head) model are trained on 3 , 328 unseen samples and evaluated against 6 , 592 unseen samples.
Time-stepping We mainly depart from [18] by using 20 epochs to learn from 1,664 trajectories as we observe the results to be similar, and allowing to explore more combinations of architectures and conditioning methods.
Time-stepping results In addition to results on 1,664 trajectories, we also perform experiments with bigger train dataset (6,656) as in [18], using 20 epochs instead of 50 for computational reasons. We also report results for the two different conditioning methods described in [18], Addition and AdaGN. The results can be found in Table 11. As in [18], AdaGN outperforms Addition. Note that AdaGN is needed for our representation conditioning to significantly improve over no conditioning. Finally, we found a very small bottleneck in the MLP that process the representation to also be crucial for performance, with a size of 1 giving the best results.
Table 10: List of model hyperparameters and training details for the timestepping tasks.
Table 11: One-step validation MSE × 1 e -3 ( ↓ ) for Navier-Stokes for different baselines and conditioning methods, with UNetmod 64 [18] as base model.
| Equation | KdV | KS | Burgers | Navier-Stokes |
|---|---|---|---|---|
| SSL dataset size | 10,000 | 10,000 | 10,000 | 26,624 |
| Sample format ( t, x, ( y ) ) | 256 × 128 | 256 × 128 | 448 × 224 | 56 × 128 × 128 |
| Characteristic of interest Regression metric | Init. coeffs NMSE ( ↓ ) | Init. coeffs NMSE ( ↓ ) | Kinematic viscosity Relative error %( ↓ ) | Buoyancy MSE ( ↓ ) |
| Supervised SSL repr. + linear head | 0.102 ± 0.007 0.033 ± 0.004 | 0.117 ± 0.009 0.042 ± 0.002 | 1.18 ± 0.07 0.97 ± 0.04 | 0.0078 ± 0.0018 0.0038 ± 0.0001 |
| Timestepping metric | NMSE ( ↓ ) | NMSE ( ↓ ) | NMSE ( ↓ ) | MSE × 10 - 3 ( ↓ ) |
| Baseline | 0.508 ± 0.102 | 0.549 ± 0.095 | 0.110 ± 0.008 | 2.37 ± 0.01 |
| + SSL repr. conditioning | 0.330 ± 0.081 | 0.381 ± 0.097 | 0.108 ± 0.011 | 2.35 ± 0.03 |
| Architecture | UNet mod 64 | UNet mod 64 | FNO 128 modes 16 | UF1Net modes 16 |
|---|---|---|---|---|
| Conditioning method | Addition [18] | AdaGN [35] | Spatial-Spectral [18] | Addition [18] |
| Time conditioning only | 2.60 ± 0.05 | 2.37 ± 0.01 | 13.4 ± 0.5 | 3.31 ± 0.06 |
| Time + SSL repr. cond. | 2.47 ± 0.02 | 2.35 ± 0.03 | 13.0 ± 1.0 | 2.37 ± 0.05 |
| Time + true buoyancy cond. | 2.08 ± 0.02 | 2.01 ± 0.02 | 11.4 ± 0.8 | 2.87 ± 0.03 |
| Augmentation | Best strength | Buoyancy MSE |
|---|---|---|
| Crop | N.A | 0.0051 ± 0.0001 |
| single Lie transform + t translate g 1 | 0.1 | 0.0052 ± 0.0001 |
| + x translate g 2 | 10.0 | 0.0041 ± 0.0002 |
| + scaling g 4 | 1.0 | 0.0050 ± 0.0003 |
| + rotation g 5 | 1.0 | 0.0049 ± 0.0001 |
| + boost g 6 ∗ | 0.1 | 0.0047 ± 0.0002 |
| + boost g 8 ∗∗ | 0.1 | 0.0046 ± 0.0001 |
| combined + { g 2 , g 5 , g 6 , g 8 } | best / 10 | 0.0038 ± 0.0001 |
| A PDE Symmetry Groups and Deriving Generators | A PDE Symmetry Groups and Deriving Generators | A PDE Symmetry Groups and Deriving Generators | 18 |
|---|---|---|---|
| A.1 | Symmetry Groups and Infinitesimal Invariance . . . | 19 | |
| A.2 | Deriving Generators of the Symmetry Group of a PDE | 20 | |
| A.3 | Example: Burgers' Equation . . . . . . . . . . . . . | 21 | |
| Exponential map and its approximations | Exponential map and its approximations | 22 | |
| B.1 | Approximations to the exponential map . . . . . . . | 23 | |
| VICReg Loss | VICReg Loss | 24 | |
| Expanded related work | Expanded related work | 25 | |
| Details on Augmentations | Details on Augmentations | 26 | |
| E.1 | Burgers' equation . . . . . . . . . . . . . . . . . . . | 26 | |
| E.2 | KdV . . . . . . . . . . . . . . . . . . . . . . . . . . | 27 | |
| E.3 | KS . . . . . . . . . . . . . . . . . . . . . . . . . . . | 28 | |
| E.4 | Navier Stokes . . . . . . . . . . . . . . . . . . . . . | 28 | |
| Experimental details | Experimental details | 28 | |
| F.1 | Experiments on Burgers' Equation . . . . . . . . . . | 28 | |
| F.2 | Experiments on KdV and KS . . . . . . . . . . . . . | 30 | |
| F.3 | Experiments on Navier-Stokes . . . . . . . . . . . . | 31 |
| Monomial | Coefficient |
|---|---|
| 1 u x u 2 x u 3 x u 4 x u xx u x u xx u 2 x u xx u 2 xx u xt u x u xt | ϕ t = ϕ xx 2 ϕ x +2( ϕ xu - ξ xx ) = - ξ t - ξ x ) - τ xx +( ϕ uu - 2 ξ xu ) = ϕ u - τ t - 2 τ x - 2 ξ u - 2 τ xu - ξ uu = - ξ u - 2 τ u - τ uu = - τ u - τ xx +( ϕ u - 2 ξ x ) = ϕ u - τ t - 2 τ x - 2 τ xu - 3 ξ u = - ξ u - 2 τ u - τ uu - τ u = - 2 τ u - τ u = - τ u - 2 τ x = 0 - 2 τ u = 0 |
| Lie algebra generator | Group operation ( x, t,u ) ↦→ | |
|---|---|---|
| g 1 (space translation) g 2 (time translation) g 3 (Galilean boost) g 4 (scaling) g 5 (projective) | ϵ∂ x ϵ∂ t ϵ ( t∂ x + ∂ u ) ϵ ( x∂ x +2 t∂ t - u∂ u ) ϵ ( xt∂ x + t 2 ∂ t +( x - tu ) ∂ u | ( x + ϵ , t,u ) ( x, t + ϵ ,u ) ( x + ϵt , t, u + ϵ ) ( e ϵ x , e 2 ϵ t , e - ϵ u ) ( x 1 - ϵt , t 1 - ϵt , u + ϵ ( x - tu ) ) |
| Lie algebra generator | Group operation ( x, t,u ) ↦→ | |
|---|---|---|
| g 1 (space translation) g 2 (time translation) g 3 (Galilean boost) g 4 (scaling) | ϵ∂ x | ( x + ϵ , t,u ) |
| ϵ∂ t | ( x, t + ϵ ,u ) | |
| ϵ ( t∂ x + ∂ u ) | ( x + ϵt , t, u + ϵ ) | |
| ϵ ( x∂ x +3 t∂ t - 2 u∂ u ) | ( e ϵ x , e 3 ϵ t , e - 2 ϵ u ) |
| Lie algebra generator | Group operation ( x, t,u ) ↦→ | |
|---|---|---|
| g 1 (space translation) | ϵ∂ x | ( x + ϵ , t,u ) |
| g 2 (time translation) | ϵ∂ t | ( x, t + ϵ ,u ) |
| g 3 (Galilean boost) | ϵ ( t∂ x + ∂ u ) | ( x + ϵt , t, u + ϵ |
| Lie algebra generator | Group operation ( x, y, t, u, v, p ) ↦→ | |
|---|---|---|
| g 1 (time translation) | ϵ∂ t | ( x, y, t + ϵ , u, v, p ) |
| g 2 ( x translation) | ϵ∂ x | ( x + ϵ , y, t, u, v, p ) |
| g 3 ( y translation) | ϵ∂ y | ( x, y + ϵ , t, u, v,p ) |
| g 4 (scaling) | ϵ (2 t∂ t + x∂ x + y∂ y - u∂ u - v∂ v - 2 p∂ p ) | ( e ϵ x , e ϵ y , e 2 ϵ t , e - ϵ u , e - ϵ v , e - 2 ϵ p ) |
| g 5 (rotation) | ϵ ( x∂ y - y∂ x + u∂ v - v∂ u ) | ( x cos ϵ - y sin ϵ , x sin ϵ + y cos ϵ , t, u cos ϵ - v sin ϵ , u sin ϵ + v cos ϵ , p ) |
| g 6 ( x linear boost) 1 | ϵ ( t∂ x + ∂ u ) | ( x + ϵt , y, t, u + ϵ , v,p ) |
| g 7 ( y linear boost) 1 | ϵ ( t∂ y + ∂ v ) | ( x, y + ϵt , t, u, v + ϵ , p ) |
| g 8 ( x quadratic boost) 2 | ϵ ( t 2 ∂ x +2 t∂ u - 2 x∂ p ) | ( x + ϵt 2 , y, t, u +2 ϵt , v, p - 2 x ) |
| g 9 ( y quadratic boost) 2 | ϵ ( t 2 ∂ y +2 t∂ v - 2 y∂ p ) | ( x, y + ϵt 2 , t, u, v +2 ϵt , p - 2 y ) |
| g E x ( x general boost) 3 | ϵ ( E x ( t ) ∂ x + E ′ x ( t ) ∂ u - xE ′′ x ( t ) ∂ p ) | ( x + ϵE x ( t ) , y, t, u + ϵE ′ x ( t ) , v, p - E ′′ x ( t ) x ) |
| g E y ( y general boost) 3 | ϵ ( E y ( t ) ∂ y + E ′ y ( t ) ∂ v - yE ′′ y ( t ) ∂ p ) | ( x, y + ϵE y ( t ) , t, u, v + ϵE ′ y ( t ) , p - E ′′ y ( t ) y ) |
| g q (additive pressure) 3 | ϵq ( t ) ∂ p | ( x, y, t, u, v, p + q ( t ) ) |
| Architecture | ResNet1d | FNO1d |
|---|---|---|
| Baseline (no conditioning) | 0.110 ± 0.008 | 0.184 ± 0.002 |
| Representation conditioning | 0.108 ± 0.011 | 0.173 ± 0.002 |
| Equation | Burgers' | KdV | KS | Navier Stokes |
|---|---|---|---|---|
| Network: | ||||
| Model | ResNet18 | ResNet18 | ResNet18 | ResNet18 |
| Embedding Dim. | 512 | 512 | 512 | 512 |
| Optimization: | ||||
| Optimizer | LARS [102] | AdamW | AdamW | AdamW |
| Learning Rate | 0.6 | 0.3 | 0.3 | 3e-4 |
| Batch Size | 32 | 64 | 64 | 64 |
| Epochs | 100 | 100 | 100 | 100 |
| Nb of exps | ∼ 300 | ∼ 30 | ∼ 30 | ∼ 300 |
| Hardware: | ||||
| GPU used | Nvidia V100 | Nvidia M4000 | Nvidia M4000 | Nvidia V100 |
| Training time | ∼ 5 h | ∼ 11 h | ∼ 12 h | ∼ 48 h |
| Equation | Burgers' | KdV | KS | Navier Stokes |
|---|---|---|---|---|
| Neural Operator: Model | CNN [12] | CNN [12] | CNN [12] | Modified U-Net-64 [18] |
| Optimization: Optimizer | AdamW | AdamW | AdamW | Adam |
| Learning Rate | 1e-4 | 1e-4 | 1e-4 | 2e-4 |
| Batch Size | 16 | 16 | 16 | 32 |
| Epochs | 20 | 20 | 20 | 20 |
| Hardware: GPU used | Nvidia V100 | Nvidia M4000 | Nvidia M4000 | Nvidia V100 (16) |
| Training time | ∼ 1 d | ∼ 2 d | ∼ 2 d | ∼ 1 . 5 d |
| Dataset size | 1,664 | 6,656 |
|---|---|---|
| Methods without ground truth buoyancy: Time conditioned, Addition | 2.60 ± 0.05 | 1.18 ± 0.03 |
| Time + Rep. conditioned, Addition (ours) | 2.47 ± 0.02 | 1.17 ± 0.04 |
| Time conditioned, AdaGN | 2.37 ± 0.01 | 1.12 ± 0.02 |
| Time + Rep. conditioned, AdaGN (ours) | 2.35 ± 0.03 | 1.11 ± 0.01 |
| Methods with ground truth buoyancy: Time + Buoyancy conditioned, Addition | 2.08 ± 0.02 | 1.10 ± 0.01 |
| Time + Buoyancy conditioned, AdaGN | 2.01 ± 0.02 | 1.06 ± 0.04 |
Representation pretraining
We consider the task of regressing equation-related coefficients in Burgers' equation and the NavierStokes' equation from solutions to those PDEs. For KS and KdV we consider the inverse probem of regressing initial conditions. We train a linear model on top of the pretrained representation for the downstream regression task. For the baseline supervised model, we train the same architecture, i.e. a ResNet18, using the MSE loss on downstream labels. Unless stated otherwise, we train the linear model for 30 epochs using Adam. Further details are in Appendix F.
Kinematic viscosity regression (Burgers): Wepretrain a ResNet18 on 10 , 000 unlabeled realizations of Burgers' equation, and use the resulting features to train a linear model on a smaller, labeled dataset of only 2000 samples. We compare to the same supervised model (encoder and linear head) trained on the same labeled dataset. The viscosities used range between 0 . 001 and 0 . 007 and are sampled uniformly. We can see in Table 1 that we are able to improve over the supervised baseline by leveraging our learned representations. This remains true even when also using Lie Point symmetries for the supervised baselines or when using comparable dataset sizes, as in Figure 4. We also clearly see the ability of our self-supervised approach to leverage larger dataset sizes, whereas we did not see any gain when going to bigger datasets in the supervised setting.
Initial condition regression (inverse problem): For the KS and KdV PDEs, we aim to solve the inverse problem by regressing initial condition parameters from a snapshot of future time evolutions of the solution. Following [34, 12], for a domain Ω = [0 , L ] , a truncated Fourier series, parameterized by A k , ω k , ϕ k , is used to generate initial conditions:
$$
$$
Our task is to regress the set of 2 N coefficients { A k , ω k : k ∈ { 1 , . . . , N }} from a snapshot of the solution starting at t = 20 to t = T . This way, the initial conditions and first-time steps are never seen during training, making the problem non-trivial. For all conducted tests, N = 10 , A k ∼ U ( -0 . 5 , 0 . 5) , and ω k ∼ U ( -0 . 4 , 0 . 4) . By neglecting phase shifts, ϕ k , the inverse problem is invariant to Galilean boosts and spatial translations, which we use as augmentations for training our SSL method (see Appendix E). The datasets used for KdV and KS contains 10,000 training samples and 2,500 test samples. As shown in Table 1, the SSL trained network reduces NMSE by a factor of almost three compared to the supervised baseline. This demonstrates how pre-training via SSL can help to extract the underlying dynamics from a snapshot of a solution.
Buoyancy magnitude regression: Following [18], our dataset consists of solutions of Navier Stokes (Equation (8)) where the external buoyancy force, f = ( c x , c y ) ⊤ , is constant in the two spatial
directions over the course of a given evolution, and our aim is to regress the magnitude of this force √ c 2 x + c 2 y given a solution to the PDE. We reuse the dataset generated in [18], where c x = 0 and c y ∼ U (0 . 2 , 0 . 5) . In practice this gives us 26,624 training samples that we used as our 'unlabeled' dataset, 3,328 to train the downstream task on, and 6,592 to evaluate the models. As observed in Table 1, the self-supervised approach is able to significantly outperform the supervised baseline. Even when looking at the best supervised performance (over 60 runs), or in similar data regimes as the supervised baseline illustrated in Fig. 4, the self-supervised baseline consistently performs better and improves further when given larger unlabeled datasets.
Buoyancy regression
We consider the task of regressing equation-related coefficients in Burgers' equation and the NavierStokes' equation from solutions to those PDEs. For KS and KdV we consider the inverse probem of regressing initial conditions. We train a linear model on top of the pretrained representation for the downstream regression task. For the baseline supervised model, we train the same architecture, i.e. a ResNet18, using the MSE loss on downstream labels. Unless stated otherwise, we train the linear model for 30 epochs using Adam. Further details are in Appendix F.
Kinematic viscosity regression (Burgers): Wepretrain a ResNet18 on 10 , 000 unlabeled realizations of Burgers' equation, and use the resulting features to train a linear model on a smaller, labeled dataset of only 2000 samples. We compare to the same supervised model (encoder and linear head) trained on the same labeled dataset. The viscosities used range between 0 . 001 and 0 . 007 and are sampled uniformly. We can see in Table 1 that we are able to improve over the supervised baseline by leveraging our learned representations. This remains true even when also using Lie Point symmetries for the supervised baselines or when using comparable dataset sizes, as in Figure 4. We also clearly see the ability of our self-supervised approach to leverage larger dataset sizes, whereas we did not see any gain when going to bigger datasets in the supervised setting.
Initial condition regression (inverse problem): For the KS and KdV PDEs, we aim to solve the inverse problem by regressing initial condition parameters from a snapshot of future time evolutions of the solution. Following [34, 12], for a domain Ω = [0 , L ] , a truncated Fourier series, parameterized by A k , ω k , ϕ k , is used to generate initial conditions:
$$
$$
Our task is to regress the set of 2 N coefficients { A k , ω k : k ∈ { 1 , . . . , N }} from a snapshot of the solution starting at t = 20 to t = T . This way, the initial conditions and first-time steps are never seen during training, making the problem non-trivial. For all conducted tests, N = 10 , A k ∼ U ( -0 . 5 , 0 . 5) , and ω k ∼ U ( -0 . 4 , 0 . 4) . By neglecting phase shifts, ϕ k , the inverse problem is invariant to Galilean boosts and spatial translations, which we use as augmentations for training our SSL method (see Appendix E). The datasets used for KdV and KS contains 10,000 training samples and 2,500 test samples. As shown in Table 1, the SSL trained network reduces NMSE by a factor of almost three compared to the supervised baseline. This demonstrates how pre-training via SSL can help to extract the underlying dynamics from a snapshot of a solution.
Buoyancy magnitude regression: Following [18], our dataset consists of solutions of Navier Stokes (Equation (8)) where the external buoyancy force, f = ( c x , c y ) ⊤ , is constant in the two spatial
directions over the course of a given evolution, and our aim is to regress the magnitude of this force √ c 2 x + c 2 y given a solution to the PDE. We reuse the dataset generated in [18], where c x = 0 and c y ∼ U (0 . 2 , 0 . 5) . In practice this gives us 26,624 training samples that we used as our 'unlabeled' dataset, 3,328 to train the downstream task on, and 6,592 to evaluate the models. As observed in Table 1, the self-supervised approach is able to significantly outperform the supervised baseline. Even when looking at the best supervised performance (over 60 runs), or in similar data regimes as the supervised baseline illustrated in Fig. 4, the self-supervised baseline consistently performs better and improves further when given larger unlabeled datasets.
Time-stepping
To explore whether learned representations improve time-stepping, we study neural networks that use a sequence of time steps (the 'history') of a PDE to predict a future sequence of steps. For each equation we consider different conditioning schemes, to fit within the data modality and be comparable to previous work.
Burgers, Korteweg-de Vries, and Kuramoto-Sivashinsky: We time-step on 2000 unseen samples for each PDE. To do so, we compute a representation of 20 first input time steps using our frozen encoder, and add it as a new channel. The resulting input is fed to a CNN as in [12] to predict the next 20 time steps (illustrated in Fig. 4 (bottom right) in the context of Burgers' equation). As shown in Table 1, conditioning the neural network or operator with pre-trained representations slightly reduces the error. Such conditioning noticeably improves performance for KdV and KS, while the results are mixed for Burgers'. A potential explanation is that KdV and KS feature more chaotic behavior than Burgers, leaving room for improvement.
Navier-Stokes' equation: As pointed out in [18], conditioning a neural network or neural operator on the buoyancy helps generalization accross different values of this parameter. This is done by embedding the buoyancy before mixing the resulting vector either via addition to the neural operator's hidden activations (denoted in [18] as 'Addition'), or alternatively for UNets by affine transformation of group normalization layers (denoted as 'AdaGN' and originally proposed in [35]). For our main experiment, we use the same modified UNet with 64 channels as in [18] for our neural operator, since it yields the best performance on the Navier-Stokes dataset. To condition the UNet, we compute our representation on the 16 first frames (that are therefore excluded from the training), and pass the representation through a two layer MLP with a bottleneck of size 1, in order to exploit the ability of our representation to recover the buoyancy with only one linear layer. The resulting output is then added to the conditioning embedding as in [18]. Finally, we choose AdaGN as our conditioning method, since it provides the best results in [18]. We follow a similar training and evaluation protocol to [18], except that we perform 20 epochs with cosine annealing schedule on 1,664 trajectories instead of 50 epochs, as we did not observe significant difference in terms of results, and this allowed to explore other architectures and conditioning methods. Additional details are provided in Appendix F. As a baseline, we use the same model without buoyancy conditioning. Both models are conditioned on time. We report the one-step validation MSE on the same time horizons as [18]. Conditioning on our representation outperforms the baseline without conditioning.
Wealso report results for different architectures and conditioning methods for Navier-Stokes in Table 2 and Burgers in Table 8 (Appendix F.1) validating the potential of conditioning on SSL representations for different models. FNO [36] does not perform as well as other models, partly due to the relatively low number of samples used and the low-resolution nature of the benchmarks. For Navier-Stokes, we also report results obtained when conditioning on both time and ground truth buoyancy, which serves as an upper-bound on the performance of our method. We conjecture these results can be improved by further increasing the quality of the learned representation, e.g by training on more samples or through further augmentation tuning. Indeed, the MSE on buoyancy regression obtained by SSL features, albeit significantly lower than the supervised baseline, is often still too imprecise to distinguish consecutive buoyancy values in our data.
Time-stepping results
Machine learning for differential equations paves the way for computationally efficient alternatives to numerical solvers, with potentially broad impacts in science and engineering. Though current algorithms typically require simulated training data tailored to a given setting, one may instead wish to learn useful information from heterogeneous sources, or from real dynamical systems observations that are messy or incomplete. In this work, we learn general-purpose representations of PDEs from heterogeneous data by implementing joint embedding methods for self-supervised learning (SSL), a framework for unsupervised representation learning that has had notable success in computer vision. Our representation outperforms baseline approaches to invariant tasks, such as regressing the coefficients of a PDE, while also improving the time-stepping performance of neural solvers. We hope that our proposed methodology will prove useful in the eventual development of general-purpose foundation models for PDEs. Code available at: github.com/facebookresearch/SSLForPDEs.
Dynamical systems governed by differential equations are ubiquitous in fluid dynamics, chemistry, astrophysics, and beyond. Accurately analyzing and predicting the evolution of such systems is of paramount importance, inspiring decades of innovation in algorithms for numerical methods. However, high-accuracy solvers are often computationally expensive. Machine learning has recently arisen as an alternative method for analyzing differential equations at a fraction of the cost [1, 2, 3]. Typically, the neural network for a given equation is trained on simulations of that same equation, generated by numerical solvers that are high-accuracy but comparatively slow [4]. What if we instead wish to learn from heterogeneous data, e.g., data with missing information, or gathered from actual observation of varied physical systems rather than clean simulations?
For example, we may have access to a dataset of instances of time-evolution, stemming from a family of partial differential equations (PDEs) for which important characteristics of the problem, such as viscosity or initial conditions, vary or are unknown. In this case, representations learned from such a large, “unlabeled” dataset could still prove useful in learning to identify unknown characteristics, given only a small dataset “labeled" with viscosities or reaction constants. Alternatively, the “unlabeled” dataset may contain evolutions over very short periods of time, or with missing time intervals; possible goals are then to learn representations that could be useful in filling in these gaps, or regressing other quantities of interest.
To tackle these broader challenges, we take inspiration from the recent success of self-supervised learning (SSL) as a tool for learning rich representations from large, unlabeled datasets of text and images [5, 6]. Building such representations from and for scientific data is a natural next step in the development of machine learning for science [7]. In the context of PDEs, this corresponds to learning representations from a large dataset of PDE realizations “unlabeled” with key information (such as kinematic viscosity for Burgers’ equation), before applying these representations to solve downstream tasks with a limited amount of data (such as kinematic viscosity regression), as illustrated in Figure 1.
To do so, we leverage the joint embedding framework [8] for self-supervised learning, a popular paradigm for learning visual representations from unlabeled data [9, 10]. It consists of training an encoder to enforce similarity between embeddings of two augmented versions of a given sample to form useful representations. This is guided by the principle that representations suited to downstream tasks (such as image classification) should preserve the common information between the two augmented views. For example, changing the color of an image of a dog still preserves its semantic meaning and we thus want similar embeddings under this augmentation. Hence, the choice of augmentations is crucial. For visual data, SSL relies on human intuition to build hand-crafted augmentations (e.g. recoloring and cropping), whereas PDEs are endowed with a group of symmetries preserving the governing equations of the PDE [11, 12]. These symmetry groups are important because creating embeddings that are invariant under them would allow to capture the underlying dynamics of the PDE. For example, solutions to certain PDEs with periodic boundary conditions remain valid solutions after translations in time and space. There exist more elaborate equation-specific transformations as well, such as Galilean boosts and dilations (see Appendix E). Symmetry groups are well-studied for common PDE families, and can be derived systematically or calculated from computer algebra systems via tools from Lie theory [11, 13, 14].
We present a general framework for performing SSL for PDEs using their corresponding symmetry groups. In particular, we show that by exploiting the analytic group transformations from one PDE solution to another, we can use joint embedding methods to generate useful representations from large, heterogeneous PDE datasets. We demonstrate the broad utility of these representations on downstream tasks, including regressing key parameters and time-stepping, on simulated physically-motivated datasets. Our approach is applicable to any family of PDEs, harnesses the well-understood mathematical structure of the equations governing PDE data — a luxury not typically available in non-scientific domains — and demonstrates more broadly the promise of adapting self-supervision to the physical sciences. We hope this work will serve as a starting point for developing foundation models on more complex dynamical systems using our framework.
We now describe our general framework for learning representations from and for diverse sources of PDE data, which can subsequently be used for a wide range of tasks, ranging from regressing characteristics of interest of a PDE sample to improving neural solvers. To this end, we adapt a popular paradigm for representation learning without labels: the joint-embedding self-supervised learning.
In the joint-embedding framework, input data is transformed into two separate “views", using augmentations that preserve the underlying information in the data. The augmented views are then fed through a learnable encoder, fθsubscript𝑓𝜃f_{\theta}, producing representations that can be used for downstream tasks. The SSL loss function is comprised of a similarity loss ℒsimsubscriptℒsim\mathcal{L}{\text{sim}} between projections (through a projector hθsubscriptℎ𝜃h{\theta}, which helps generalization [15]) of the pairs of views, to make their representations invariant to augmentations, and a regularization loss ℒregsubscriptℒreg\mathcal{L}_{\text{reg}}, to avoid trivial solutions (such as mapping all inputs to the same representation). The regularization term can consist of a repulsive force between points, or regularization on the covariance matrix of the embeddings. Both function similarly, as shown in [16]. This pretraining procedure is illustrated in Fig. 2 (left) in the context of Burgers’ equation.
In this work, we choose variance-invariance-covariance regularization (VICReg) as our self-supervised loss function [9]. Concretely, let 𝒁,𝒁′∈ℝN×D𝒁superscript𝒁′superscriptℝ𝑁𝐷{\bm{Z}},{\bm{Z}}^{\prime}\in\mathbb{R}^{N\times D} contain the D𝐷D-dimensional representations of two batches of N𝑁N inputs with D×D𝐷𝐷D\times D centered covariance matrices, Cov(𝒁)Cov𝒁\mathrm{Cov}({\bm{Z}}) and Cov(𝒁′)Covsuperscript𝒁′\mathrm{Cov}({\bm{Z}}^{\prime}). Rows 𝒁i,:subscript𝒁𝑖:{\bm{Z}}{i,:} and 𝒁i,:′superscriptsubscript𝒁𝑖:′{\bm{Z}}{i,:}^{\prime} are two views of a shared input. The loss over this batch includes a term to enforce similarity (ℒsimsubscriptℒsim\mathcal{L}{\text{sim}}) and a term to avoid collapse and regularize representations (ℒregsubscriptℒreg\mathcal{L}{\text{reg}}) by pushing elements of the encodings to be statistically identical:
where ∥⋅∥F|\cdot|{F} denotes the matrix Frobenius norm and λinv,λreg∈ℝ+subscript𝜆𝑖𝑛𝑣subscript𝜆𝑟𝑒𝑔superscriptℝ\lambda{inv},\lambda_{reg}\in\mathbb{R}^{+} are hyperparameters to weight the two terms. In practice, VICReg separates the regularization ℒreg(𝒁)subscriptℒ𝑟𝑒𝑔𝒁\mathcal{L}_{reg}({\bm{Z}}) into two components to handle diagonal and non-diagonal entries Cov(𝒁)Cov𝒁\operatorname{Cov}({\bm{Z}}) separately. For full details, see Appendix C.
Numerical PDE solutions typically come in the form of a tensor of values, along with corresponding spatial and temporal grids. By treating the spatial and temporal information as supplementary channels, we can use existing methods developed for learning image representations. As an illustration, a numerical solution to Burgers consists of a velocity tensor with shape (t,x)𝑡𝑥(t,x): a vector of t𝑡t time values, and a vector of x𝑥x spatial values. We therefore process the sample to obtain a (3,t,x)3𝑡𝑥(3,t,x) tensor with the last two channels encoding spatial and temporal discretization, which can be naturally fed to neural networks tailored for images such as ResNets [17]. From these, we extract the representation before the classification layer (which is unused here). It is worth noting that convolutional neural networks have become ubiquitous in the literature [18, 12]. While the VICReg default hyper-parameters did not require substantial tuning, tuning was crucial to probe the quality of our learned representations to monitor the quality of the pre-training step. Indeed, SSL loss values are generally not predictive of the quality of the representation, and thus must be complemented by an evaluation task. In computer vision, this is done by freezing the encoder, and using the features to train a linear classifier on ImageNet. In our framework, we pick regression of a PDE coefficient, or regression of the initial conditions when there is no coefficient in the equation. The latter, commonly referred to as the inverse problem, has the advantage of being applicable to any PDE, and is often a challenging problem in the numerical methods community given the ill-posed nature of the problem [19]. Our approach for a particular task, kinematic viscosity regression, is schematically illustrated in Fig. 2 (top right). More details on evaluation tasks are provided in Section 4.
PDEs formally define a systems of equations which depend on derivatives of input variables. Given input space ΩΩ\Omega and output space 𝒰𝒰\mathcal{U}, a PDE ΔΔ\Delta is a system of equations in independent variables 𝒙∈Ω𝒙Ω{\bm{x}}\in\Omega, dependent variables 𝒖:Ω→𝒰:𝒖→Ω𝒰{\bm{u}}:\Omega\to\mathcal{U}, and derivatives (𝒖𝒙,𝒖𝒙𝒙,…)subscript𝒖𝒙subscript𝒖𝒙𝒙…({\bm{u}}{{\bm{x}}},{\bm{u}}{{\bm{x}}{\bm{x}}},\dots) of 𝒖𝒖{\bm{u}} with respect to 𝒙𝒙{\bm{x}}. For example, the Kuramoto–Sivashinsky equation is given by
Informally, a symmetry group of a PDE G𝐺G 111A group G𝐺G is a set closed under an associative binary operation containing an identity element e𝑒e and inverses (i.e., e∈G𝑒𝐺e\in G and ∀g∈G:g−1∈G:for-all𝑔𝐺superscript𝑔1𝐺\forall g\in G:g^{-1}\in G). G:𝒳→𝒳:𝐺→𝒳𝒳G:\mathcal{X}\rightarrow\mathcal{X} acts on a space 𝒳𝒳\mathcal{X} if ∀x∈𝒳,∀g,h∈G:ex=x:formulae-sequencefor-all𝑥𝒳for-all𝑔ℎ𝐺𝑒𝑥𝑥\forall x\in\mathcal{X},\forall g,h\in G:ex=x and (gh)x=g(hx)𝑔ℎ𝑥𝑔ℎ𝑥(gh)x=g(hx). acts on the total space via smooth maps G:Ω×𝒰→Ω×𝒰:𝐺→Ω𝒰Ω𝒰G:\Omega\times\mathcal{U}\to\Omega\times\mathcal{U} taking solutions of ΔΔ\Delta to other solutions of ΔΔ\Delta. More explicitly, G𝐺G is contained in the symmetry group of ΔΔ\Delta if outputs of group operations acting on solutions are still a solution of the PDE:
For PDEs, these symmetry groups can be analytically derived [11] (see also Appendix A for more formal details). The types of symmetries we consider are so-called Lie point symmetries g:Ω×𝒰→Ω×𝒰:𝑔→Ω𝒰Ω𝒰g:\Omega\times\mathcal{U}\to\Omega\times\mathcal{U}, which act smoothly at any given point in the total space Ω×𝒰Ω𝒰\Omega\times\mathcal{U}. For the Kuramoto-Sivashinsky PDE, these symmetries take the form depicted in Fig. 3:
As in this example, every Lie point transformation can be written as a one parameter transform of ϵ∈ℝitalic-ϵℝ\epsilon\in\mathbb{R} where the transformation at ϵ=0italic-ϵ0\epsilon=0 recovers the identity map and the magnitude of ϵitalic-ϵ\epsilon corresponds to the “strength" of the corresponding augmentation.222Technically, ϵitalic-ϵ\epsilon is the magnitude and direction of the transformation vector for the basis element of the corresponding generator in the Lie algebra. Taking the derivative of the transformation at ϵ=0italic-ϵ0\epsilon=0 with respect to the set of all group transformations recovers the Lie algebra of the group (see Appendix A). Lie algebras are vector spaces with elegant properties (e.g., smooth transformations can be uniquely and exhaustively implemented), so we parameterize augmentations in the Lie algebra and implement the corresponding group operation via the exponential map from the algebra to the group. Details are contained in Appendix B.
PDE symmetry groups as SSL augmentations, and associated challenges: Symmetry groups of PDEs offer a technically sound basis for the implementation of augmentations; nevertheless, without proper considerations and careful tuning, SSL can fail to work successfully [20]. Although we find the marriage of these PDE symmetries with SSL quite natural, there are several subtleties to the problem that make this task challenging. Consistent with the image setting, we find that, among the list of possible augmentations, crops are typically the most effective of the augmentations in building useful representations [21]. Selecting a sensible subset of PDE symmetries requires some care; for example, if one has a particular invariant task in mind (such as regressing viscosity), the Lie symmetries used should neither depend on viscosity nor change the viscosity of the output solution. Morever, there is no guarantee as to which Lie symmetries are the most “natural", i.e. most likely to produce solutions that are close to the original data distribution; this is also likely a confounding factor when evaluating their performance. Finally, precise derivations of Lie point symmetries require knowing the governing equation, though a subset of symmetries can usually be derived without knowing the exact form of the equation, as certain families of PDEs share Lie point symmetries and many symmetries arise from physical principles and conservation laws.
Sampling symmetries: We parameterize and sample from Lie point symmetries in the Lie algebra of the group, to ensure smoothness and universality of resulting maps in some small region around the identity. We use Trotter approximations of the exponential map, which are efficiently tunable to small errors, to apply the corresponding group operation to an element in the Lie algebra (see Appendix B) [22, 23]. In our experiments, we find that Lie point augmentations applied at relatively small strengths perform the best (see Appendix E), as they are enough to create informative distortions of the input when combined. Finally, boundary conditions further complicate the simplified picture of PDE symmetries, and from a practical perspective, many of the symmetry groups (such as the Galilean Boost in Fig. 3) require a careful rediscretization back to a regular grid of sampled points.
In this section, we provide a concise summary of research related to our work, reserving Appendix D for more details. Our study derives inspiration from applications of Self-Supervised Learning (SSL) in building pre-trained foundational models [24]. For physical data, models pre-trained with SSL have been implemented in areas such as weather and climate prediction [7] and protein tasks [25, 26], but none have previously used the Lie symmetries of the underlying system. The SSL techniques we use are inspired by similar techniques used in image and video analysis [9, 20], with the hopes of learning rich representations that can be used for diverse downstream tasks.
Symmetry groups of PDEs have a rich history of study [11, 13]. Most related to our work, [12] used Lie point symmetries of PDEs as a tool for augmenting PDE datasets in supervised tasks. For some PDEs, previous works have explicitly enforced symmetries or conservation laws by for example constructing networks equivariant to symmetries of the Navier Stokes equation [27], parameterizing networks to satisfy a continuity equation [28], or enforcing physical constraints in dynamic mode decomposition [29]. For Hamiltonian systems, various works have designed algorithms that respect the symplectic structure or conservation laws of the Hamiltonian [30, 31].
We focus on flow-related equations here as a testing ground for our methodology. In our experiments, we consider the four equations below, which are 1D evolution equations apart from the Navier-Stokes equation, which we consider in its 2D spatial form. For the 1D flow-related equations, we impose periodic boundary conditions with Ω=[0,L]×[0,T]Ω0𝐿0𝑇\Omega=[0,L]\times[0,T]. For Navier-Stokes, boundary conditions are Dirichlet (v=0𝑣0v=0) as in [18]. Symmetries for all equations are listed in Appendix E.
The viscous Burgers’ Equation, written in its “standard" form, is a nonlinear model of dissipative flow given by
where u(x,t)𝑢𝑥𝑡u(x,t) is the velocity and ν∈ℝ+𝜈superscriptℝ\nu\in\mathbb{R}^{+} is the kinematic viscosity.
The Korteweg-de Vries (KdV) equation models waves on shallow water surfaces as
where u(x,t)𝑢𝑥𝑡u(x,t) is the dependent variable. The equation often shows up in reaction-diffusion systems, as well as flame propagation problems.
The incompressible Navier-Stokes equation in two spatial dimensions is given by
where 𝒖(𝒙,t)𝒖𝒙𝑡{\bm{u}}({\bm{x}},t) is the velocity vector, p(𝒙,t)𝑝𝒙𝑡p({\bm{x}},t) is the pressure, ρ𝜌\rho is the fluid density, ν𝜈\nu is the kinematic viscosity, and 𝒇𝒇{\bm{f}} is an external added force (buoyancy force) that we aim to regress in our experiments.
Solution realizations are generated from analytical solutions in the case of Burgers’ equation or pseudo-spectral methods used to generate PDE learning benchmarking data (see Appendix F) [12, 18, 32]. Burgers’, KdV and KS’s solutions are generated following the process of [12] while for Navier Stokes we use the conditioning dataset from [18]. The respective characteristics of our datasets can be found in Table 1.
For each equation, we pretrain a ResNet18 with our SSL framework for 100 epochs using AdamW [33], a batch size of 32 (64 for Navier-Stokes) and a learning rate of 3e-4. We then freeze its weights. To evaluate the resulting representation, we (i) train a linear head on top of our features and on a new set of labeled realizations, and (ii) condition neural networks for time-stepping on our representation. Note that our encoder learns from heterogeneous data in the sense that for a given equation, we grouped time evolutions with different parameters and initial conditions.
We consider the task of regressing equation-related coefficients in Burgers’ equation and the Navier-Stokes’ equation from solutions to those PDEs. For KS and KdV we consider the inverse probem of regressing initial conditions. We train a linear model on top of the pretrained representation for the downstream regression task. For the baseline supervised model, we train the same architecture, i.e. a ResNet18, using the MSE loss on downstream labels. Unless stated otherwise, we train the linear model for 303030 epochs using Adam. Further details are in Appendix F.
Kinematic viscosity regression (Burgers): We pretrain a ResNet18 on 10,0001000010,000 unlabeled realizations of Burgers’ equation, and use the resulting features to train a linear model on a smaller, labeled dataset of only 200020002000 samples. We compare to the same supervised model (encoder and linear head) trained on the same labeled dataset. The viscosities used range between 0.0010.0010.001 and 0.0070.0070.007 and are sampled uniformly. We can see in Table 1 that we are able to improve over the supervised baseline by leveraging our learned representations. This remains true even when also using Lie Point symmetries for the supervised baselines or when using comparable dataset sizes, as in Figure 4. We also clearly see the ability of our self-supervised approach to leverage larger dataset sizes, whereas we did not see any gain when going to bigger datasets in the supervised setting.
Initial condition regression (inverse problem): For the KS and KdV PDEs, we aim to solve the inverse problem by regressing initial condition parameters from a snapshot of future time evolutions of the solution. Following [34, 12], for a domain Ω=[0,L]Ω0𝐿\Omega=[0,L], a truncated Fourier series, parameterized by Ak,ωk,ϕksubscript𝐴𝑘subscript𝜔𝑘subscriptitalic-ϕ𝑘A_{k},\omega_{k},\phi_{k}, is used to generate initial conditions:
Our task is to regress the set of 2N2𝑁2N coefficients {Ak,ωk:k∈{1,…,N}}conditional-setsubscript𝐴𝑘subscript𝜔𝑘𝑘1…𝑁{A_{k},\omega_{k}:k\in{1,\dots,N}} from a snapshot of the solution starting at t=20𝑡20t=20 to t=T𝑡𝑇t=T. This way, the initial conditions and first-time steps are never seen during training, making the problem non-trivial. For all conducted tests, N=10𝑁10N=10, Ak∼𝒰(−0.5,0.5)similar-tosubscript𝐴𝑘𝒰0.50.5A_{k}\sim\mathcal{U}(-0.5,0.5), and ωk∼𝒰(−0.4,0.4)similar-tosubscript𝜔𝑘𝒰0.40.4\omega_{k}\sim\mathcal{U}(-0.4,0.4). By neglecting phase shifts, ϕksubscriptitalic-ϕ𝑘\phi_{k}, the inverse problem is invariant to Galilean boosts and spatial translations, which we use as augmentations for training our SSL method (see Appendix E). The datasets used for KdV and KS contains 10,000 training samples and 2,500 test samples. As shown in Table 1, the SSL trained network reduces NMSE by a factor of almost three compared to the supervised baseline. This demonstrates how pre-training via SSL can help to extract the underlying dynamics from a snapshot of a solution.
Buoyancy magnitude regression: Following [18], our dataset consists of solutions of Navier Stokes (Equation 8) where the external buoyancy force, 𝒇=(cx,cy)⊤𝒇superscriptsubscript𝑐𝑥subscript𝑐𝑦top{\bm{f}}=(c_{x},c_{y})^{\top}, is constant in the two spatial directions over the course of a given evolution, and our aim is to regress the magnitude of this force cx2+cy2superscriptsubscript𝑐𝑥2superscriptsubscript𝑐𝑦2\sqrt{c_{x}^{2}+c_{y}^{2}} given a solution to the PDE. We reuse the dataset generated in [18], where cx=0subscript𝑐𝑥0c_{x}=0 and cy∼𝒰(0.2,0.5)similar-tosubscript𝑐𝑦𝒰0.20.5c_{y}\sim\mathcal{U}(0.2,0.5). In practice this gives us 26,624 training samples that we used as our “unlabeled” dataset, 3,328 to train the downstream task on, and 6,592 to evaluate the models. As observed in Table 1, the self-supervised approach is able to significantly outperform the supervised baseline. Even when looking at the best supervised performance (over 60 runs), or in similar data regimes as the supervised baseline illustrated in Fig. 4, the self-supervised baseline consistently performs better and improves further when given larger unlabeled datasets.
To explore whether learned representations improve time-stepping, we study neural networks that use a sequence of time steps (the 'history') of a PDE to predict a future sequence of steps. For each equation we consider different conditioning schemes, to fit within the data modality and be comparable to previous work.
Burgers, Korteweg-de Vries, and Kuramoto-Sivashinsky: We time-step on 2000 unseen samples for each PDE. To do so, we compute a representation of 20 first input time steps using our frozen encoder, and add it as a new channel. The resulting input is fed to a CNN as in [12] to predict the next 20 time steps (illustrated in Fig. 4 (bottom right) in the context of Burgers' equation). As shown in Table 1, conditioning the neural network or operator with pre-trained representations slightly reduces the error. Such conditioning noticeably improves performance for KdV and KS, while the results are mixed for Burgers'. A potential explanation is that KdV and KS feature more chaotic behavior than Burgers, leaving room for improvement.
Navier-Stokes' equation: As pointed out in [18], conditioning a neural network or neural operator on the buoyancy helps generalization accross different values of this parameter. This is done by embedding the buoyancy before mixing the resulting vector either via addition to the neural operator's hidden activations (denoted in [18] as 'Addition'), or alternatively for UNets by affine transformation of group normalization layers (denoted as 'AdaGN' and originally proposed in [35]). For our main experiment, we use the same modified UNet with 64 channels as in [18] for our neural operator, since it yields the best performance on the Navier-Stokes dataset. To condition the UNet, we compute our representation on the 16 first frames (that are therefore excluded from the training), and pass the representation through a two layer MLP with a bottleneck of size 1, in order to exploit the ability of our representation to recover the buoyancy with only one linear layer. The resulting output is then added to the conditioning embedding as in [18]. Finally, we choose AdaGN as our conditioning method, since it provides the best results in [18]. We follow a similar training and evaluation protocol to [18], except that we perform 20 epochs with cosine annealing schedule on 1,664 trajectories instead of 50 epochs, as we did not observe significant difference in terms of results, and this allowed to explore other architectures and conditioning methods. Additional details are provided in Appendix F. As a baseline, we use the same model without buoyancy conditioning. Both models are conditioned on time. We report the one-step validation MSE on the same time horizons as [18]. Conditioning on our representation outperforms the baseline without conditioning.
Wealso report results for different architectures and conditioning methods for Navier-Stokes in Table 2 and Burgers in Table 8 (Appendix F.1) validating the potential of conditioning on SSL representations for different models. FNO [36] does not perform as well as other models, partly due to the relatively low number of samples used and the low-resolution nature of the benchmarks. For Navier-Stokes, we also report results obtained when conditioning on both time and ground truth buoyancy, which serves as an upper-bound on the performance of our method. We conjecture these results can be improved by further increasing the quality of the learned representation, e.g by training on more samples or through further augmentation tuning. Indeed, the MSE on buoyancy regression obtained by SSL features, albeit significantly lower than the supervised baseline, is often still too imprecise to distinguish consecutive buoyancy values in our data.
Self-supervised learning outperforms supervised learning for PDEs: While the superiority of self-supervised over supervised representation learning is still an open question in computer vision [37, 38], the former outperforms the latter in the PDE domain we consider. A possible explanation is that enforcing similar representations for two different views of the same solution forces the network to learn the underlying dynamics, while the supervised objectives (such as regressing the buoyancy) may not be as informative of a signal to the network. Moreover, Fig. 4 illustrates how more pretraining data benefits our SSL setup, whereas in our experiments it did not help the supervised baselines.
Cropping: Cropping is a natural, effective, and popular augmentation in computer vision [21, 39, 40]. In the context of PDE samples, unless specified otherwise, we crop both in temporal and spatial domains finding such a procedure is necessary for the encoder to learn from the PDE data. Cropping also offers a typically weaker means of enforcing analogous space and time translation invariance. The exact size of the crops is generally domain dependent and requires tuning. We quantify its effect in Fig. 5 in the context of Navier-Stokes; here, crops must contain as much information as possible while making sure that pairs of crops have as little overlap as possible (to discourage the network from relying on spurious correlations). This explains the two modes appearing in Fig. 5. We make a similar observation for Burgers, while KdV and KS are less sensitive. Finally, crops help bias the network to learn features that are invariant to whether the input was taken near a boundary or not, thus alleviating the issue of boundary condition preservation during augmentations.
Selecting Lie point augmentations: Whereas cropping alone yields satisfactory representations, Lie point augmentations can enhance performance but require careful tuning. In order to choose which symmetries to include in our SSL pipeline and at what strengths to apply them, we study the effectiveness of each Lie augmentation separately. More precisely, given an equation and each possible Lie point augmentation, we train a SSL representation using this augmentation only and cropping. Then, we couple all Lie augmentations improving the representation over simply using crops. In order for this composition to stay in the stability/convergence radius of the Lie Symmetries, we reduce each augmentation’s optimal strength by an order of magnitude. Fig. 5 illustrates this process in the context of Navier-Stokes.
This work leverages Lie point symmetries for self-supervised representation learning from PDE data. Our preliminary experiments with the Burgers’, KdV, KS, and Navier-Stokes equations demonstrate the usefulness of the resulting representation for sample or compute efficient estimation of characteristics and time-stepping. Nevertheless, a number of limitations are present in this work, which we hope can be addressed in the future. The methodology and experiments in this study were confined to a particular set of PDEs, but we believe they can be expanded beyond our setting.
Another interesting direction is to expand our SSL framework to learning explicitly equivariant features [41, 42]. Learning equivariant representations with SSL could be helpful for time-stepping, perhaps directly in the learned representation space.
Theoretical insights can also help improve the results contained here. Symmetries are generally derived with respect to systems with infinite domain or periodic boundaries. Since boundary conditions violate such symmetries, we observed in our work that we are only able to implement group operations with small strengths. Finding ways to preserve boundary conditions during augmentation, even approximately, would help expand the scope of symmetries available for learning tasks. Moreover, the available symmetry group operations of a given PDE are not solely comprised of Lie point symmetries. Other types of symmetries, such as nonlocal symmetries or approximate symmetries like Lie-Backlund symmetries, may also be implemented as potential augmentations [13].
A natural next step for our framework is to train a common representation on a mixture of data from different PDEs, such as Burgers, KdV and KS, that are all models of chaotic flow sharing many Lie point symmetries. Our preliminary experiments are encouraging yet suggest that work beyond the scope of this paper is needed to deal with the different time and length scales between PDEs.
In our study, utilizing the structure of PDE solutions as “exact” SSL augmentations for representation learning has shown significant efficacy over supervised methods. This approach’s potential extends beyond the PDEs we study as many problems in mathematics, physics, and chemistry present inherent symmetries that can be harnessed for SSL. Future directions could include implementations of SSL for learning stochastic PDEs, or Hamiltonian systems. In the latter, the rich study of Noether’s symmetries in relation to Poisson brackets could be a useful setting to study [11]. Real-world data, as opposed to simulated data, may offer a nice application to the SSL setting we study. Here, the exact form of the equation may not be known and symmetries of the equations would have to be garnered from basic physical principles (e.g., flow equations have translational symmetries), derived from conservation laws, or potentially learned from data.
The authors thank Aaron Lou, Johannes Brandstetter, and Daniel Worrall for helpful feedback and discussions. HL is supported by the Fannie and John Hertz Foundation and the NSF Graduate Fellowship under Grant No. 1745302.
1Table of Contents
Symmetry augmentations encourage invariance of the representations to known symmetry groups of the data. The guiding principle is that inputs that can be obtained from one another via transformations of the symmetry group should share a common representation. In images, such symmetries are known a priori and correspond to flips, resizing, or rotations of the input. In PDEs, these symmetry groups can be derived as Lie groups, commonly denoted as Lie point symmetries, and have been categorized for many common PDEs [11]. An example of the form of such augmentations is given in Figure 6 for a simple PDE that rotates a point in 2-D space. In this example, the PDE exhibits both rotational symmetry and scaling symmetry of the radius of rotation. For arbitrary PDEs, such symmetries can be derived, as explained in more detail below.
The Lie point symmetry groups of differential equations form a Lie group structure, where elements of the groups are smooth and differentiable transformations. It is typically easier to derive the symmetries of a system of differential equations via the infinitesimal generators of the symmetries, (i.e., at the level of the derivatives of the one parameter transforms). By using these infinitesimal generators, one can replace nonlinear conditions for the invariance of a function under the group transformation, with an equivalent linear condition of infinitesimal invariance under the respective generator of the group action [11].
In what follows, we give an informal overview to the derivation of Lie point symmetries. Full details and formal rigor can be obtained in Olver [11], Ibragimov [13], among others.
In the setting we consider, a differential equation has a set of p𝑝p independent variables 𝒙=(x1,x2,…,xp)∈ℝp𝒙superscript𝑥1superscript𝑥2…superscript𝑥𝑝superscriptℝ𝑝{\bm{x}}=(x^{1},x^{2},\dots,x^{p})\in\mathbb{R}^{p} and q𝑞q dependent variables 𝒖=(u1,u2,…,uq)∈ℝq𝒖superscript𝑢1superscript𝑢2…superscript𝑢𝑞superscriptℝ𝑞{\bm{u}}=(u^{1},u^{2},\dots,u^{q})\in\mathbb{R}^{q}. The solutions take the form 𝒖=f(𝒙)𝒖𝑓𝒙{\bm{u}}=f({\bm{x}}), where uα=fα(𝒙)superscript𝑢𝛼superscript𝑓𝛼𝒙u^{\alpha}=f^{\alpha}({\bm{x}}) for α∈{1,…,q}𝛼1…𝑞\alpha\in{1,\dots,q}. Solutions form a graph over a domain Ω⊂ℝpΩsuperscriptℝ𝑝\Omega\subset\mathbb{R}^{p}:
In other words, a given solution ΓfsubscriptΓ𝑓\Gamma_{f} forms a p𝑝p-dimensional submanifold of the space ℝp×ℝqsuperscriptℝ𝑝superscriptℝ𝑞\mathbb{R}^{p}\times\mathbb{R}^{q}.
The n𝑛n-th prolongation of a given smooth function ΓfsubscriptΓ𝑓\Gamma_{f} expands or “prolongs" the graph of the solution into a larger space to include derivatives up to the n𝑛n-th order. More precisely, if 𝒰=ℝq𝒰superscriptℝ𝑞\mathcal{U}=\mathbb{R}^{q} is the solution space of a given function and f:ℝp→𝒰:𝑓→superscriptℝ𝑝𝒰f:\mathbb{R}^{p}\to\mathcal{U}, then we introduce the Cartesian product space of the prolongation:
where 𝒰k=ℝdim(k)subscript𝒰𝑘superscriptℝdim𝑘\mathcal{U}_{k}=\mathbb{R}^{\text{dim}(k)} and dim(k)=(p+k−1k)dim𝑘binomial𝑝𝑘1𝑘\text{dim}(k)={p+k-1\choose k} is the dimension of the so-called jet space consisting of all k𝑘k-th order derivatives. Given any solution f:ℝp→𝒰:𝑓→superscriptℝ𝑝𝒰f:\mathbb{R}^{p}\to\mathcal{U}, the prolongation can be calculated by simply calculating the corresponding derivatives up to order n𝑛n (e.g., via a Taylor expansion at each point). For a given function 𝒖=f(𝒙)𝒖𝑓𝒙{\bm{u}}=f({\bm{x}}), the n𝑛n-th prolongation is denoted as 𝒖(n)=pr(n)f(𝒙)superscript𝒖𝑛superscriptpr𝑛𝑓𝒙{\bm{u}}^{(n)}=\operatorname{pr}^{(n)}f({\bm{x}}). As a simple example, for the case of p=2𝑝2p=2 with independent variables x𝑥x and y𝑦y and q=1𝑞1q=1 with a single dependent variable f𝑓f, the second prolongation is
which is evaluated at a given point (x,y)𝑥𝑦(x,y) in the domain. The complete space ℝp×𝒰(n)superscriptℝ𝑝superscript𝒰𝑛\mathbb{R}^{p}\times\mathcal{U}^{(n)} is often called the n𝑛n-th order jet space [11].
A system of differential equations is a set of l𝑙l differential equations Δ:ℝp×𝒰(n)→ℝl:Δ→superscriptℝ𝑝superscript𝒰𝑛superscriptℝ𝑙\Delta:\mathbb{R}^{p}\times\mathcal{U}^{(n)}\to\mathbb{R}^{l} of the independent and dependent variables with dependence on the derivatives up to a maximum order of n𝑛n:
A smooth solution is thus a function f𝑓f such that for all points in the domain of 𝒙𝒙{\bm{x}}:
In geometric terms, the system of differential equations states where the given map ΔΔ\Delta vanishes on the jet space, and forms a subvariety
Therefore to check if a solution is valid, one can check if the prolongation of the solution falls within the subvariety ZΔsubscript𝑍ΔZ_{\Delta}. As an example, consider the one dimensional heat equation
A symmetry group G𝐺G for a system of differential equations is a set of local transformations to the function which transform one solution of the system of differential equations to another. The group takes the form of a Lie group, where group operations can be expressed as a composition of one-parameter transforms. More rigorously, given the graph of a solution ΓfsubscriptΓ𝑓\Gamma_{f} as defined in Eq. 10, a group operation g∈G𝑔𝐺g\in G maps this graph to a new graph
where (𝒙~,𝒖~)𝒙𝒖(\tilde{{\bm{x}}},\tilde{{\bm{u}}}) label the new coordinates of the solution in the set g⋅Γf⋅𝑔subscriptΓ𝑓g\cdot\Gamma_{f}. For example, if 𝒙=(x,t)𝒙𝑥𝑡{\bm{x}}=(x,t), u=u(x,t)𝑢𝑢𝑥𝑡u=u(x,t), and g𝑔g acts on (𝒙,u)𝒙𝑢({\bm{x}},u) via
then u(x,t~)=u(x,t)+ϵ=u(x~−ϵt~,t~)+ϵ𝑢𝑥𝑡𝑢𝑥𝑡italic-ϵ𝑢𝑥italic-ϵ𝑡𝑡italic-ϵ\tilde{u}(\tilde{x},\tilde{t})=u(x,t)+\epsilon=u(\tilde{x}-\epsilon\tilde{t},\tilde{t})+\epsilon, where (x~,t~)=(x+ϵt,t)𝑥𝑡𝑥italic-ϵ𝑡𝑡(\tilde{x},\tilde{t})=(x+\epsilon t,t).
Note, that the set g⋅Γf⋅𝑔subscriptΓ𝑓g\cdot\Gamma_{f} may not necessarily be a graph of a new 𝒙𝒙{\bm{x}}-valued function; however, since all transformations are local and smooth, one can ensure transformations are valid in some region near the identity of the group.
As an example, consider the following transformations which are members of the symmetry group of the differential equation uxx=0subscript𝑢𝑥𝑥0u_{xx}=0. g1(t)subscript𝑔1𝑡g_{1}(t) translates a single spatial coordinate x𝑥x by an amount t𝑡t and g2subscript𝑔2g_{2} scales the output coordinate u𝑢u by an amount ersuperscript𝑒𝑟e^{r}:
It is easy to verify that both of these operations are local and smooth around a region of the identity, as sending r,t→0→𝑟𝑡0r,t\to 0 recovers the identity operation. Lie theory allows one to equivalently describe the potentially nonlinear group operations above with corresponding infinitesimal generators of the group action, corresponding to the Lie algebra of the group. Infinitesimal generators form a vector field over the total space Ω×𝒰Ω𝒰\Omega\times\mathcal{U}, and the group operations correspond to integral flows over that vector field. To map from a single parameter Lie group operation to its corresponding infinitesimal generator, we take the derivative of the single parameter operation at the identity:
where g(0)⋅(x,u)=(x,u)⋅𝑔0𝑥𝑢𝑥𝑢g(0)\cdot(x,u)=(x,u).
To map from the infinitesimal generator back to the corresponding group operation, one can apply the exponential map
where exp:𝔤→G:→𝔤𝐺\exp:\mathfrak{g}\rightarrow{G}. Here, exp(⋅)⋅\exp\left(\cdot\right) maps from the Lie algebra, 𝔤𝔤\mathfrak{g}, to the corresponding Lie group, G𝐺G. This exponential map can be evaluated using various methods, as detailed in Appendix B and Appendix E.
Returning to the example earlier from Equation 19, the corresponding Lie algebra elements are
Informally, Lie algebras help simplify notions of invariance as it allows one to check whether functions or differential equations are invariant to a group by needing only to check it at the level of the derivative of that group. In other words, for any vector field corresponding to a Lie algebra element, a given function is invariant to that vector field if the action of the vector field on the given function evaluates to zero everywhere. Thus, given a symmetry group, one can determine a set of invariants using the vector fields corresponding to the infinitesimal generators of the group. To determine whether a differential equation is in such a set of invariants, we extend the definition of a prolongation to act on vector fields as
A given vector field 𝒗𝒗{\bm{v}} is therefore an infinitesimal generator of a symmetry group G𝐺G of a system of differential equations ΔνsubscriptΔ𝜈\Delta_{\nu} indexed by ν∈{1,…,l}𝜈1…𝑙\nu\in{1,\dots,l} if the prolonged vector field of any given solution is still a solution:
For sake of convenience and brevity, we leave out many of the formal definitions behind these concepts and refer the reader to [11] for complete details.
Since symmetries of differential equations correspond to smooth maps, it is typically easier to derive the particular symmetries of a differential equation via their infinitesimal generators. To derive such generators, we first show how to perform the prolongation of a vector field. As before, assume we have p𝑝p independent variables x1,…,xpsuperscript𝑥1…superscript𝑥𝑝x^{1},\dots,x^{p} and l𝑙l dependent variables u1,…,ulsuperscript𝑢1…superscript𝑢𝑙u^{1},\dots,u^{l}, which are a function of the dependent variables. Note that we use superscripts to denote a particular variable. Derivatives with respect to a given variable are denoted via subscripts corresponding to the indices. For example, the variable u1121subscriptsuperscript𝑢1112u^{1}_{112} denotes the third order derivative of u1superscript𝑢1u^{1} taken twice with respect to the variable x1superscript𝑥1x^{1} and once with respect to x2superscript𝑥2x^{2}. As stated earlier, the prolongation of a vector field is defined as the operation
To calculate the above, we can evaluate the formula on a vector field written in a generalized form. I.e., any vector field corresponding to the infinitesimal generator of a symmetry takes the general form
Throughout, we will use Greek letter indices for dependent variables and standard letter indices for independent variables. Then, we have that
where 𝑱𝑱\bm{J} is a tuple of dependent variables indicating which variables are in the derivative of ∂∂u𝑱αsuperscriptsubscript𝑢𝑱𝛼\frac{\partial}{\partial u_{\bm{J}}^{\alpha}}. Each ϕα𝑱(𝒙,𝒖(n))superscriptsubscriptitalic-ϕ𝛼𝑱𝒙superscript𝒖𝑛\phi_{\alpha}^{\bm{J}}({\bm{x}},{\bm{u}}^{(n)}) is calculated as
where u𝑱,iα=∂u𝑱α/∂xisubscriptsuperscript𝑢𝛼𝑱𝑖subscriptsuperscript𝑢𝛼𝑱superscript𝑥𝑖u^{\alpha}{\bm{J},i}=\partial u^{\alpha}{\bm{J}}/\partial x^{i} and 𝑫isubscript𝑫𝑖\bm{D}_{i} is the total derivative operator with respect to variable i𝑖i defined as
After evaluating the coefficients, ϕα𝑱(x,u(n))superscriptsubscriptitalic-ϕ𝛼𝑱𝑥superscript𝑢𝑛\phi_{\alpha}^{\bm{J}}(x,u^{(n)}), we can substitute these values into the definition of the vector field’s prolongation in Equation 27. This fully describes the infinitesimal generator of the given PDE, which can be used to evaluate the necessary symmetries of the system of differential equations. An example for Burgers’ equation, a canonical PDE, is presented in the following.
Burgers’ equation is a PDE used to describe convection-diffusion phenomena commonly observed in fluid mechanics, traffic flow, and acoustics [43]. The PDE can be written in either its “potential“ form or its “viscous” form. The potential form is
Cautionary note: We derive here the symmetries of Burgers’ equation in its potential form since this form is more convenient and simpler to study for the sake of an example. The equation we consider in our experiments is the more commonly studied Burgers’ equation in its standard form which does not have the same Lie symmetry group (see Table 4). Similar derivations for Burgers’ equation in its standard form can be found in example 6.1 of [44].
Following the notation from the previous section, p=2𝑝2p=2 and q=1𝑞1q=1. Consequently, the symmetry group of Burgers’ equation will be generated by vector fields of the following form
where we wish to determine all possible coefficient functions, ξ(t,x,u)𝜉𝑡𝑥𝑢\xi(t,x,u), τ(x,t,u)𝜏𝑥𝑡𝑢\tau(x,t,u), and ϕ(x,t,u)italic-ϕ𝑥𝑡𝑢\phi(x,t,u) such that the resulting one-parameter sub-group exp(εv)𝜀v\exp{(\varepsilon\textbf{v}}) is a symmetry group of Burgers’ equation. To evaluate these coefficients, we need to prolong the vector field up to 2nd order, given that the highest-degree derivative present in the governing PDE is of order 2. The 2nd prolongation of the vector field can be expressed as
Applying this prolonged vector field to the differential equation in Equation 30, we get the infinitesimal symmetry criteria that
To evaluate the individual coefficients, we apply Equation 28. Next, we substitute every instance of utsubscript𝑢𝑡u_{t} with ux2+uxxsuperscriptsubscript𝑢𝑥2subscript𝑢𝑥𝑥u_{x}^{2}+u_{xx}, and equate the coefficients of each monomial in the first and second-order derivatives of u𝑢u to find the pertinent symmetry groups. Table 3 below lists the relevant monomials as well as their respective coefficients.
where k1,…,k6∈ℝsubscript𝑘1…subscript𝑘6ℝk_{1},\dots,k_{6}\in\mathbb{R} and γ(x,t)𝛾𝑥𝑡\gamma(x,t) is an arbitrary solution to Burgers’ equation. These coefficient functions can be used to generate the infinitesimal symmetries. These symmetries are spanned by the six vector fields below:
as well as the infinite-dimensional subalgebra: 𝒗γ=γ(x,t)e−u∂usubscript𝒗𝛾𝛾𝑥𝑡superscript𝑒𝑢subscript𝑢\bm{v}{\gamma}=\gamma(x,t)e^{-u}\partial{u}. Here, γ(x,t)𝛾𝑥𝑡\gamma(x,t) is any arbitrary solution to the heat equation. The relationship between the Heat equation and Burgers’ equation can be seen, whereby if u𝑢u is replaced by w=eu𝑤superscript𝑒𝑢w=e^{u}, the Cole–Hopf transformation is recovered.
As observed in the previous section, symmetry groups are generally derived in the Lie algebra of the group. The exponential map can then be applied, taking elements of this Lie algebra to the corresponding group operations. Working within the Lie algebra of a group provides several benefits. First, a Lie algebra is a vector space, so elements of the Lie algebra can be added and subtracted to yield new elements of the Lie algebra (and the group, via the exponential map). Second, when generators of the Lie algebra are closed under the Lie bracket of the Lie algebra (i.e., the generators form a basis for the structure constants of the Lie algebra), any arbitrary Lie point symmetry can be obtained via an element of the Lie algebra (i.e. the exponential map is surjective onto the connected component of the identity) [11]. In contrast, composing group operations in an arbitrary, fixed sequence is not guaranteed to be able to generate any element of the group. Lastly, although not extensively detailed here, the "strength," or magnitude, of Lie algebra elements can be measured using an appropriately selected norm. For instance, the operator norm of a matrix could be used for matrix Lie algebras.
In certain cases, especially when the element 𝒗𝒗{\bm{v}} in the Lie algebra consists of a single basis element, the exponential map exp(𝒗)𝒗\exp({\bm{v}}) applied to that element of the Lie algebra can be calculated explicitly. Here, applying the group operation to a tuple of independent and dependent variables results in the so-called Lie point transformation, since it is applied at a given point exp(ϵ𝒗)⋅(x,f(x))↦(x′,f(x)′)maps-to⋅italic-ϵ𝒗𝑥𝑓𝑥superscript𝑥′𝑓superscript𝑥′\exp(\epsilon{\bm{v}})\cdot(x,f(x))\mapsto(x^{\prime},f(x)^{\prime}). Consider the concrete example below from Burger’s equation.
The Burger’s equation contains the Lie point symmetry 𝐯γ=γ(x,t)e−u∂usubscript𝐯𝛾𝛾𝑥𝑡superscript𝑒𝑢subscript𝑢{\bm{v}}{\gamma}=\gamma(x,t)e^{-u}\partial{u} with corresponding group transformation exp(ϵ𝐯γ)⋅(x,t,u)=(x,t,log(eu+ϵγ))⋅italic-ϵsubscript𝐯𝛾𝑥𝑡𝑢𝑥𝑡superscript𝑒𝑢italic-ϵ𝛾\exp(\epsilon{\bm{v}}_{\gamma})\cdot(x,t,u)=(x,t,\log\left(e^{u}+\epsilon\gamma\right)).
This transformation only changes the u𝑢u component. Here, we have
Applying the series expansion log(1+x)=x−x22+x33−⋯1𝑥𝑥superscript𝑥22superscript𝑥33⋯\log(1+x)=x-\frac{x^{2}}{2}+\frac{x^{3}}{3}-\cdots, we get
In general, the output of the exponential map cannot be easily calculated as we did above, especially if the vector field 𝒗𝒗{\bm{v}} is a weighted sum of various generators. In these cases, we can still apply the exponential map to a desired accuracy using efficient approximation methods, which we discuss next.
For arbitrary Lie groups, computing the exact exponential map is often not feasible due to the complex nature of the group and its associated Lie algebra. Hence, it is necessary to approximate the exponential map to obtain useful results. Two common methods for approximating the exponential map are the truncation of Taylor series and Lie-Trotter approximations.
Given a vector field 𝒗𝒗{\bm{v}} in the Lie algebra of the group, the exponential map can be approximated by truncating the Taylor series expansion of exp(𝒗)𝒗\exp({\bm{v}}). The Taylor series expansion of the exponential map is given by:
To approximate the exponential map, we retain a finite number of terms in the series:
where k𝑘k is the order of the truncation. The accuracy of the approximation depends on the number of terms retained in the truncated series and the operator norm ‖𝒗‖norm𝒗|{\bm{v}}|. For matrix Lie groups, where 𝒗𝒗{\bm{v}} is also a matrix, this operator norm is equivalent to the largest magnitude of the eigenvalues of the matrix [45]. The error associated with truncating the Taylor series after k𝑘k terms thus decays exponentially with the order of the approximation.
Two drawbacks exist when using the Taylor approximation. First, for a given vector field 𝒗𝒗{\bm{v}}, applying 𝒗⋅f⋅𝒗𝑓{\bm{v}}\cdot f to a given function f𝑓f requires algebraic computation of derivatives. Alternatively, derivatives can also be approximated through finite difference schemes, but this would add an additional source of error. Second, when using the Taylor series to apply a symmetry transformation of a PDE to a starting solution of that PDE, the Taylor series truncation will result in a new function, which is not necessarily a solution of the PDE anymore (although it can be made arbitrarily close to a solution by increasing the truncation order). Lie-Trotter approximations, which we study next, approximate the exponential map by a composition of symmetry operations, thus avoiding these two drawbacks.
The Lie-Trotter approximation is an alternative method for approximating the exponential map, particularly useful when one has access to group elements directly, i.e. the closed-form output of the exponential map on each Lie algebra generator), but they are non-commutative. To provide motivation for this method, consider two elements 𝑿𝑿\bm{X} and 𝒀𝒀\bm{Y} in the Lie algebra. The Lie-Trotter formula (or Lie product formula) approximates the exponential of their sum [22, 46].
where k𝑘k is a positive integer controlling the level of approximation.
The first-order approximation above can be extended to higher orders, referred to as the Lie-Trotter-Suzuki approximations.Though various different such approximations exist, we particularly use the following recursive approximation scheme [47, 23] for a given Lie algebra component 𝒗=∑i=1p𝒗i𝒗superscriptsubscript𝑖1𝑝subscript𝒗𝑖{\bm{v}}=\sum_{i=1}^{p}{\bm{v}}_{i}.
To apply the above formula, we tune the order parameter p𝑝p and split the time evolution into r𝑟r segments to apply the approximation exp(𝒗)≈∏i=1r𝒯p(𝒗/r)𝒗superscriptsubscriptproduct𝑖1𝑟subscript𝒯𝑝𝒗𝑟\exp({\bm{v}})\approx\prod_{i=1}^{r}\mathcal{T}_{p}({\bm{v}}/r). For the p𝑝p-th order, the number of stages in the Suzuki formula above is equal to 2⋅5p/2−1⋅2superscript5𝑝212\cdot 5^{p/2-1}, so the total number of stages applied is equal to 2r⋅5p/2−1⋅2𝑟superscript5𝑝212r\cdot 5^{p/2-1}.
These methods are especially useful in the context of PDEs, as they allow for the approximation of the exponential map while preserving the structure of the Lie algebra and group. Similar techniques are used in the design of splitting methods for numerically solving PDEs [48, 49]. Crucially, these approximations will always provide valid solutions to the PDEs, since each individual group operation in the composition above is itself a symmetry of the PDE. This is in contrast with approximations via Taylor series truncation, which only provide approximate solutions.
As with the Taylor series approximation, the p𝑝p-th order approximation above is accurate to o(‖𝒗‖p)𝑜superscriptnorm𝒗𝑝o(|{\bm{v}}|^{p}) with suitably selected values of r𝑟r and p𝑝p [23]. As a cautionary note, the approximations here may fail to converge when applied to unbounded operators [50, 51]. In practice, we tested a range of bounds to the augmentations and tuned augmentations accordingly (see Appendix E).
In our implementations, we use the VICReg loss as our choice of SSL loss [9]. This loss contains three different terms: a variance term that ensures representations do not collapse to a single point, a covariance term that ensures different dimensions of the representation encode different data, and an invariance term to enforce similarity of the representations for pairs of inputs related by an augmentation. We go through each term in more detail below. Given a distribution 𝒯𝒯\mathcal{T} from which to draw augmentations and a set of inputs 𝒙isubscript𝒙𝑖{\bm{x}}_{i}, the precise algorithm to calculate the VICReg loss for a batch of data is also given in Algorithm 1.
Formally, define our embedding matrices as 𝒁,𝒁′∈ℝN×D𝒁superscript𝒁′superscriptℝ𝑁𝐷{\bm{Z}},{\bm{Z}}^{\prime}\in\mathbb{R}^{N\times D}. Next, we define the similarity criterion, ℒsimsubscriptℒsim\mathcal{L}_{\text{sim}}, as
which we use to match our embeddings, and to make them invariant to the transformations. To avoid a collapse of the representations, we use the original variance and covariance criteria to define our regularisation loss, ℒregsubscriptℒreg\mathcal{L}_{\text{reg}}, as
The variance criterion, V(𝒁)𝑉𝒁V({\bm{Z}}), ensures that all dimensions in the representations are used, while also serving as a normalization of the dimensions. The goal of the covariance criterion is to decorrelate the different dimensions, and thus, spread out information across the embeddings. The final criterion is
Hyperparameters λvar,λcov,λinv,γ∈ℝsubscript𝜆𝑣𝑎𝑟subscript𝜆𝑐𝑜𝑣subscript𝜆𝑖𝑛𝑣𝛾ℝ\lambda_{var},\lambda_{cov},\lambda_{inv},\gamma\in\mathbb{R} weight the contributions of different terms in the loss. For all studies conducted in this work, we use the default values of λvar=λinv=25subscript𝜆𝑣𝑎𝑟subscript𝜆inv25\lambda_{var}=\lambda_{\text{inv}}=25 and λcov=1subscript𝜆𝑐𝑜𝑣1\lambda_{cov}=1, unless specified. In our experience, these default settings perform generally well.
Recent work on machine learning for PDEs has considered both invariant prediction tasks [52] and time-series modelling [53, 54]. In the fluid mechanics setting, models learn dynamic viscosities, fluid densities, and/or pressure fields from both simulation and real-world experimental data [55, 56, 57]. For time-dependent PDEs, prior work has investigated the efficacy of convolutional neural networks (CNNs), recurrent neural networks (RNNs), graph neural networks (GNNs), and transformers in learning to evolve the PDE forward in time [34, 58, 59, 60]. This has invoked interest in the development of reduced order models and learned representations for time integration that decrease computational expense, while attempting to maintain solution accuracy. Learning representations of the governing PDE can enable time-stepping in a latent space, where the computational expense is substantially reduced [61]. Recently, for example, Lusch et al. have studied learning the infinite-dimensional Koopman operator to globally linearize latent space dynamics [62]. Kim et al. have employed the Sparse Identification of Nonlinear Dynamics (SINDy) framework to parameterize latent space trajectories and combine them with classical ODE solvers to integrate latent space coordinates to arbitrary points in time [53]. Nguyen et al. have looked at the development of foundation models for climate sciences using transformers pre-trained on well-established climate datasets [7]. Other methods like dynamic mode decomposition (DMD) are entirely data-driven, and find the best operator to estimate temporal dynamics [63]. Recent extensions of this work have also considered learning equivalent operators, where physical constraints like energy conservation or the periodicity of the boundary conditions are enforced [29].
All joint embedding self-supervised learning methods have a similar objective: forming representations across a given domain of inputs that are invariant to a certain set of transformations. Contrastive and non-contrastive methods are both used. Contrastive methods [21, 64, 65, 66, 67] push away unrelated pairs of augmented datapoints, and frequently rely on the InfoNCE criterion [68], although in some cases, squared similarities between the embeddings have been employed [69]. Clustering-based methods have also recently emerged [70, 71, 6], where instead of contrasting pairs of samples, samples are contrasted with cluster centroids. Non-contrastive methods [10, 40, 9, 72, 73, 74, 39] aim to bring together embeddings of positive samples. However, the primary difference between contrastive and non-contrastive methods lies in how they prevent representational collapse. In the former, contrasting pairs of examples are explicitly pushed away to avoid collapse. In the latter, the criterion considers the set of embeddings as a whole, encouraging information content maximization to avoid collapse. For example, this can be achieved by regularizing the empirical covariance matrix of the embeddings. While there can be differences in practice, both families have been shown to lead to very similar representations [16, 75]. An intriguing feature in many SSL frameworks is the use of a projector neural network after the encoder, on top of which the SSL loss is applied. The projector was introduced in [21]. Whereas the projector is not necessary for these methods to learn a satisfactory representation, it is responsible for an important performance increase. Its exact role is an object of study [76, 15].
We should note that there exists a myriad of techniques, including metric learning, kernel design, autoencoders, and others [77, 78, 79, 80, 81] to build feature spaces and perform unsupervised learning. Many of these works share a similar goal to ours, and we opted for SSL due to its proven efficacy in fields like computer vision and the direct analogy offered by data augmentations. One particular methodology that deserves mention is that of multi-fidelity modeling, which can reduce dependency on extensive training data for learning physical tasks [82, 83, 84]. The goals of multi-fidelity modeling include training with data of different fidelity [82] or enhancing the accuracy of models by incorporating high quality data into models [85]. In contrast, SSL aims to harness salient features from diverse data sources without being tailored to specific applications. The techniques we employ capitalize on the inherent structure in a dataset, especially through augmentations and invariances.
In the past several years, an extensive set of literature has explored questions in the so-called realm of geometric deep learning tying together aspects of group theory, geometry, and deep learning [86]. In one line of work, networks have been designed to explicitly encode symmetries into the network via equivariant layers or explicitly symmetric parameterizations [87, 88, 89, 90]. These techniques have notably found particular application in chemistry and biology related problems [91, 92, 93] as well as learning on graphs [94]. Another line of work considers optimization over layers or networks that are parameterized over a Lie group [95, 96, 97, 98, 99]. Our work does not explicitly encode invariances or structurally parameterize Lie groups into architectures as in many of these works, but instead tries to learn representations that are approximately symmetric and invariant to these group structures via the SSL. As mentioned in the main text, perhaps more relevant for future work are techniques for learning equivariant features and maps [41, 42].
The generators of the Lie point symmetries of the various equations we study are listed below. For symmetry augmentations which distort the periodic grid in space and time, we provide inputs x𝑥x and t𝑡t to the network which contain the new spatial and time coordinates after augmentation.
As a reminder, the Burgers’ equation takes the form
Lie point symmetries of the Burgers’ equation are listed in Table 4. There are five generators. As we will see, the first three generators corresponding to translations and Galilean boosts are consistent with the other equations we study (KS, KdV, and Navier Stokes) as these are all flow equations.
As a cautionary note, the symmetry group given in Table 1 of [12] for Burgers’ equation is incorrectly labeled for Burgers’ equation in its standard form. Instead, these augmentations are those for Burgers’ equation in its potential form, which is given as:
Burgers’ equation in its standard form is vt+vvx−νvxx=0subscript𝑣𝑡𝑣subscript𝑣𝑥𝜈subscript𝑣𝑥𝑥0v_{t}+vv_{x}-\nu v_{xx}=0, which can be obtained from the transformation v=ux𝑣subscript𝑢𝑥v=u_{x}. The Lie point symmetry group of the equation in its potential form contains more generators than that of the standard form. To apply these generators to the standard form of Burgers’ equation, one can convert them via the Cole-Hopf transformation, but this conversion loses the smoothness and locality of some of these transformations (i.e., some are no longer Lie point transformations, although they do still describe valid transformations between solutions of the equation’s corresponding form).
Note that this discrepancy does not carry through in their experiments: [12] only consider input data as solutions to Heat equation, which they subsequently transform into solutions of Burgers’ equation via a Cole-Hopf transform. Therefore, in their code, they apply augmentations using the Heat equation, for which they have the correct symmetry group. We opted only to work with solutions to Burgers’ equations itself for a slightly fairer comparison to real-world settings, where a convenient transform to a linear PDE such as the Cole-Hopf transform is generally not available.
Lie point symmetries of the KdV equation are listed in Table 5. Though all the operations listed are valid generators of the symmetry group, only g1subscript𝑔1g_{1} and g3subscript𝑔3g_{3} are invariant to the downstream task of the inverse problem. (Notably, these parameters are independent of any spatial shift). Consequently, during SSL pre-training for the inverse problem, only g1subscript𝑔1g_{1} and g3subscript𝑔3g_{3} were used for learning representations. In contrast, for time-stepping, all listed symmetry groups were used.
Lie point symmetries of the KS equation are listed in Table 6. All of these symmetry generators are shared with the KdV equation listed in Table 4. Similar to KdV, only g1subscript𝑔1g_{1} and g3subscript𝑔3g_{3} are invariant to the downstream regression task of predicting the initial conditions. In addition, for time-stepping, all symmetry groups were used in learning meaningful representations.
Lie point symmetries of the incompressible Navier Stokes equation are listed in Table 7 [101]. As pressure is not given as an input to any of our networks, the symmetry gqsubscript𝑔𝑞g_{q} was not included in our implementations. For augmentations gExsubscript𝑔subscript𝐸𝑥g_{E_{x}} and gEysubscript𝑔subscript𝐸𝑦g_{E_{y}}, we restricted attention only to linear Ex(t)=Ey(t)=tsubscript𝐸𝑥𝑡subscript𝐸𝑦𝑡𝑡E_{x}(t)=E_{y}(t)=t or quadratic Ex(t)=Ey(t)=t2subscript𝐸𝑥𝑡subscript𝐸𝑦𝑡superscript𝑡2E_{x}(t)=E_{y}(t)=t^{2} functions. This restriction was made to maintain invariance to the downstream task of buoyancy force prediction in the linear case or easily calculable perturbations to the buoyancy by an amount 2ϵ2italic-ϵ2\epsilon to the magnitude in the quadratic case. Finally, we fix both order and steps parameters in our Lie-Trotter approximation implementation to 2 for computationnal efficiency.
Whereas we implemented our own pretraining and evaluation (kinematic viscosity, initial conditions and buoyancy) pipelines, we used the data generation and time-stepping code provided on Github by [12] for Burgers’, KS and KdV, and in [18] for Navier-Stokes (MIT License), with slight modification to condition the neural operators on our representation. All our code relies relies on Pytorch. Note that the time-stepping code for Navier-Stokes uses Pytorch Lightning. We report the details of the training cost and hyperparameters for pretraining and timestepping in Table 9 and Table 10 respectively.
Solutions realizations of Burgers’ equation were generated using the analytical solution [32] obtained from the Heat equation and the Cole-Hopf transform. During generation, kinematic viscosities, ν𝜈\nu, and initial conditions were varied.
We pretrain a representation on subsets of our full dataset containing 10,0001000010,000 1D time evolutions from Burgers equation with various kinematic viscosities, ν𝜈\nu, sampled uniformly in the range [0.001,0.007]0.0010.007[0.001,0.007], and initial conditions using a similar procedure to [12]. We generate solutions of size 224×448224448224\times 448 in the spatial and temporal dimensions respectively, using the default parameters from [12]. We train a ResNet18 [17] encoder using the VICReg [9] approach to joint embedding SSL, with a smaller projector (width 512512512) since we use a smaller ResNet than in the original paper. We keep the same variance, invariance and covariance parameters as in [9]. We use the following augmentations and strengths:
Crop of size (128,256)128256(128,256), respectively, in the spatial and temporal dimension.
Uniform sampling in [−2,2]22[-2,2] for the coefficient associated to g1subscript𝑔1g_{1}.
We pretrain for 100100100 epochs using AdamW [33] and a batch size of 323232. Crucially, we assess the quality of the learned representation via linear probing for kinematic viscosity regression, which we detail below.
We evaluate the learned representation as follows: the ResNet18 is frozen and used as an encoder to produce features from the training dataset. The features are passed through a linear layer, followed by a sigmoid to constrain the output within [νmin,νmax]subscript𝜈minsubscript𝜈max[\nu_{\text{min}},\nu_{\text{max}}]. The learned model is evaluated against our validation dataset, which is comprised of 2,00020002,000 samples.
We use a 1D CNN solver from [12] as our baseline. This neural solver takes Tpsubscript𝑇𝑝T_{p} previous time steps as input, to predict the next Tfsubscript𝑇𝑓T_{f} future ones. Each channel (or spatial axis, if we view the input as a 2D image with one channel) is composed of the realization values, u𝑢u, at Tpsubscript𝑇𝑝T_{p} times, with spatial step size dx𝑑𝑥dx, and time step size dt𝑑𝑡dt. The dimension of the input is therefore (Tp+2,224)subscript𝑇𝑝2224(T_{p}+2,224), where the extra two dimensions are simply to capture the scalars dx𝑑𝑥dx and dt𝑑𝑡dt. We augment this input with our representation. More precisely, we select the encoder that allows for the most accurate linear regression of ν𝜈\nu with our validation dataset, feed it with the CNN operator input and reduce the resulting representation dimension to d𝑑d with a learned projection before adding it as supplementary channels to the input, which is now (Tp+2+d,224)subscript𝑇𝑝2𝑑224(T_{p}+2+d,224). We set Tp=20subscript𝑇𝑝20T_{p}=20, Tf=20subscript𝑇𝑓20T_{f}=20, and nsamples=2,000subscript𝑛samples2000n_{\text{samples}}=2,000. We train both models for 202020 epochs following the setup from [12]. In addition, we use AdamW with a decaying learning rate and different configurations of 333 runs each:
Batch size ∈{16,64}absent1664\in{16,64}.
Learning rate ∈{0.0001,0.00005}absent0.00010.00005\in{0.0001,0.00005}.
To obtain realizations of both the KdV and KS PDEs, we apply the method of lines, and compute spatial derivatives using a pseudo-spectral method, in line with the approach taken by [12].
To train on realizations of KdV, we use the following VICReg parameters: λvar=25subscript𝜆𝑣𝑎𝑟25\lambda_{var}=25, λinv=25subscript𝜆𝑖𝑛𝑣25\lambda_{inv}=25, and λcov=4subscript𝜆𝑐𝑜𝑣4\lambda_{cov}=4. For the KS PDE, the λvarsubscript𝜆𝑣𝑎𝑟\lambda_{var} and λinvsubscript𝜆𝑖𝑛𝑣\lambda_{inv} remain unchanged, with λcov=6subscript𝜆𝑐𝑜𝑣6\lambda_{cov}=6. The pre-training is performed on a dataset comprised of 10,0001000010,000 1D time evolutions of each PDE, each generated from initial conditions described in the main text. Generated solutions were of size 128×256128256128\times 256 in the spatial and temporal dimensions, respectively. Similar to Burgers’ equation, a ResNet18 encoder in conjunction with a projector of width 512512512 was used for SSL pre-training. The following augmentations and strengths were applied:
The quality of the learned representations is evaluated by freezing the ResNet18 encoder, training a separate regression head to predict values of Aksubscript𝐴𝑘A_{k} and ωksubscript𝜔𝑘\omega_{k}, and comparing the NMSE to a supervised baseline. The regression head was a fully-connected network, where the output dimension is commensurate with the number of initial conditions used. In addition, a range-constrained sigmoid was added to bound the output between [−0.5,2π]0.52𝜋[-0.5,2\pi], where the bounds were informed by the minimum and maximum range of the sampled initial conditions. Lastly, similar to Burgers’ equation, the validation dataset is comprised of 2,00020002,000 labeled samples.
The same 1D CNN solver used for Burgers’ equation serves as the baseline for time-stepping the KdV and KS PDEs. We select the ResNet18 encoder based on the one that provides the most accurate predictions of the initial conditions with our validation set. Here, the input dimension is now (Tp+2,128)subscript𝑇𝑝2128(T_{p}+2,128) to agree with the size of the generated input data. Similarly to Burgers’ equation, Tp=20subscript𝑇𝑝20T_{p}=20, Tf=20subscript𝑇𝑓20T_{f}=20, and nsamples=2,000subscript𝑛samples2000n_{\text{samples}}=2,000. Lastly, AdamW with the same learning rate and batch size configurations as those seen for Burgers’ equation were used across 333 time-stepping runs each. A sample visualization with predicted instances of the KdV PDE is provided in Fig. 7 below:
We use the Conditioning dataset for Navier Stokes-2D proposed in [18], consisting of 26,624 2D time evolutions with 56 time steps and various buoyancies ranging approximately uniformly from 0.20.20.2 to 0.50.50.5.
We train a ResNet18 for 100 epochs with AdamW, a batch size of 64 and a learning rate of 3e-4. We use the same VICReg hyperparameters as for Burgers’ Equation. We use the following augmentations and strengths (augmentations whose strength is not specified here are not used):
We mainly depart from [18] by using 20 epochs to learn from 1,664 trajectories as we observe the results to be similar, and allowing to explore more combinations of architectures and conditioning methods.
In addition to results on 1,664 trajectories, we also perform experiments with bigger train dataset (6,656) as in [18], using 20 epochs instead of 50 for computational reasons. We also report results for the two different conditioning methods described in [18], Addition and AdaGN. The results can be found in Table 11. As in [18], AdaGN outperforms Addition. Note that AdaGN is needed for our representation conditioning to significantly improve over no conditioning. Finally, we found a very small bottleneck in the MLP that process the representation to also be crucial for performance, with a size of 1 giving the best results.
Table: S4.T1: Downstream evaluation of our learned representations for four classical PDEs (averaged over three runs, the lower the better (↓↓\downarrow)). The normalized mean squared error (NMSE) over a batch of N𝑁N outputs 𝒖^ksubscript^𝒖𝑘\widehat{{\bm{u}}}{k} and targets 𝒖ksubscript𝒖𝑘{\bm{u}}{k} is equal to NMSE=1N∑k=1N‖𝒖^k−𝒖k‖22/‖𝒖^k‖22NMSE1𝑁superscriptsubscript𝑘1𝑁superscriptsubscriptnormsubscript^𝒖𝑘subscript𝒖𝑘22superscriptsubscriptnormsubscript^𝒖𝑘22\operatorname{NMSE}=\frac{1}{N}\sum_{k=1}^{N}|\widehat{{\bm{u}}}{k}-{\bm{u}}{k}|{2}^{2}/|\widehat{{\bm{u}}}{k}|{2}^{2}. Relative error is similarly defined as RE=1N∑k=1N‖𝒖^k−𝒖k‖1/‖𝒖^k‖1RE1𝑁superscriptsubscript𝑘1𝑁subscriptnormsubscript^𝒖𝑘subscript𝒖𝑘1subscriptnormsubscript^𝒖𝑘1\operatorname{RE}=\frac{1}{N}\sum{k=1}^{N}|\widehat{{\bm{u}}}{k}-{\bm{u}}{k}|{1}/|\widehat{{\bm{u}}}{k}|_{1} For regression tasks, the reported errors with supervised methods are the best performance across runs with Lie symmetry augmentations applied. For timestepping, we report NMSE for KdV, KS and Burgers as in [12], and MSE for Navier-Stokes for comparison with [18].
| Equation | KdV | KS | Burgers | Navier-Stokes |
| SSL dataset size | 10,000 | 10,000 | 10,000 | 26,624 |
| Sample format (t,x,(y)𝑡𝑥𝑦t,x,(y)) | 256×\times128 | 256×\times128 | 448×\times224 | 56×\times128×\times128 |
| Characteristic of interest | Init. coeffs | Init. coeffs | Kinematic viscosity | Buoyancy |
| Regression metric | NMSE (↓↓\downarrow) | NMSE (↓↓\downarrow) | Relative error %(↓↓\downarrow) | MSE (↓↓\downarrow) |
| Supervised | 0.102 ±plus-or-minus\pm 0.007 | 0.117 ±plus-or-minus\pm 0.009 | 1.18 ±plus-or-minus\pm 0.07 | 0.0078 ±plus-or-minus\pm 0.0018 |
| SSL repr. + linear head | 0.033 ±plus-or-minus\pm 0.004 | 0.042 ±plus-or-minus\pm 0.002 | 0.97 ±plus-or-minus\pm 0.04 | 0.0038 ±plus-or-minus\pm 0.0001 |
| Timestepping metric | NMSE (↓↓\downarrow) | NMSE (↓↓\downarrow) | NMSE (↓↓\downarrow) | MSE ×10−3absentsuperscript103\times 10^{-3}(↓↓\downarrow) |
| Baseline | 0.508 ±plus-or-minus\pm 0.102 | 0.549 ±plus-or-minus\pm 0.095 | 0.110 ±plus-or-minus\pm 0.008 | 2.37 ±plus-or-minus\pm 0.01 |
| + SSL repr. conditioning | 0.330 ±plus-or-minus\pm 0.081 | 0.381 ±plus-or-minus\pm 0.097 | 0.108 ±plus-or-minus\pm 0.011 | 2.35 ±plus-or-minus\pm 0.03 |
Table: S4.T2: One-step validation MSE (rescaled by 1e31𝑒31e3) for time-stepping on Navier-Stokes with varying buoyancies for different combinations of architectures and conditioning methods. Architectures are taken from [18] with the same choice of hyper-parameters. Results with ground truth buoyancies are an upper-bound on the performance a representation containing information on the buoyancy.
| Conditioning method | Addition [18] | AdaGN [35] | Spatial-Spectral [18] | Addition [18] |
|---|---|---|---|---|
| Time conditioning only | 2.60 ±plus-or-minus\pm 0.05 | 2.37 ±plus-or-minus\pm 0.01 | 13.4 ±plus-or-minus\pm 0.5 | 3.31 ±plus-or-minus\pm 0.06 |
| Time + SSL repr. cond. | 2.47 ±plus-or-minus\pm 0.02 | 2.35 ±plus-or-minus\pm 0.03 | 13.0 ±plus-or-minus\pm 1.0 | 2.37 ±plus-or-minus\pm 0.05 |
| Time + true buoyancy cond. | 2.08 ±plus-or-minus\pm 0.02 | 2.01 ±plus-or-minus\pm 0.02 | 11.4 ±plus-or-minus\pm 0.8 | 2.87 ±plus-or-minus\pm 0.03 |
Table: A1.T3: Monomial coefficients in vector field prolongation for Burgers’ equation.
| Monomial | Coefficient |
|---|---|
| 1 | ϕt=ϕxxsubscriptitalic-ϕ𝑡subscriptitalic-ϕ𝑥𝑥\phi_{t}=\phi_{xx} |
| uxsubscript𝑢𝑥u_{x} | 2ϕx+2(ϕxu−ξxx)=−ξt2subscriptitalic-ϕ𝑥2subscriptitalic-ϕ𝑥𝑢subscript𝜉𝑥𝑥subscript𝜉𝑡2\phi_{x}+2(\phi_{xu}-\xi_{xx})=-\xi_{t} |
| ux2superscriptsubscript𝑢𝑥2u_{x}^{2} | 2(ϕu−ξx)−τxx+(ϕuu−2ξxu)=ϕu−τt2subscriptitalic-ϕ𝑢subscript𝜉𝑥subscript𝜏𝑥𝑥subscriptitalic-ϕ𝑢𝑢2subscript𝜉𝑥𝑢subscriptitalic-ϕ𝑢subscript𝜏𝑡2(\phi_{u}-\xi_{x})-\tau_{xx}+(\phi_{uu}-2\xi_{xu})=\phi_{u}-\tau_{t} |
| ux3superscriptsubscript𝑢𝑥3u_{x}^{3} | −2τx−2ξu−2τxu−ξuu=−ξu2subscript𝜏𝑥2subscript𝜉𝑢2subscript𝜏𝑥𝑢subscript𝜉𝑢𝑢subscript𝜉𝑢-2\tau_{x}-2\xi_{u}-2\tau_{xu}-\xi_{uu}=-\xi_{u} |
| ux4superscriptsubscript𝑢𝑥4u_{x}^{4} | −2τu−τuu=−τu2subscript𝜏𝑢subscript𝜏𝑢𝑢subscript𝜏𝑢-2\tau_{u}-\tau_{uu}=-\tau_{u} |
| uxxsubscript𝑢𝑥𝑥u_{xx} | −τxx+(ϕu−2ξx)=ϕu−τtsubscript𝜏𝑥𝑥subscriptitalic-ϕ𝑢2subscript𝜉𝑥subscriptitalic-ϕ𝑢subscript𝜏𝑡-\tau_{xx}+(\phi_{u}-2\xi_{x})=\phi_{u}-\tau_{t} |
| uxuxxsubscript𝑢𝑥subscript𝑢𝑥𝑥u_{x}u_{xx} | −2τx−2τxu−3ξu=−ξu2subscript𝜏𝑥2subscript𝜏𝑥𝑢3subscript𝜉𝑢subscript𝜉𝑢-2\tau_{x}-2\tau_{xu}-3\xi_{u}=-\xi_{u} |
| ux2uxxsuperscriptsubscript𝑢𝑥2subscript𝑢𝑥𝑥u_{x}^{2}u_{xx} | −2τu−τuu−τu=−2τu2subscript𝜏𝑢subscript𝜏𝑢𝑢subscript𝜏𝑢2subscript𝜏𝑢-2\tau_{u}-\tau_{uu}-\tau_{u}=-2\tau_{u} |
| uxx2superscriptsubscript𝑢𝑥𝑥2u_{xx}^{2} | −τu=−τusubscript𝜏𝑢subscript𝜏𝑢-\tau_{u}=-\tau_{u} |
| uxtsubscript𝑢𝑥𝑡u_{xt} | −2τx=02subscript𝜏𝑥0-2\tau_{x}=0 |
| uxuxtsubscript𝑢𝑥subscript𝑢𝑥𝑡u_{x}u_{xt} | −2τu=02subscript𝜏𝑢0-2\tau_{u}=0 |
Table: A5.T4: Generators of the Lie point symmetry group of the Burgers’ equation in its standard form [44, 100].
| Lie algebra generator | Group operation (x,t,u)↦maps-to𝑥𝑡𝑢absent(x,t,u)\mapsto | |
| g1subscript𝑔1g_{1} (space translation) | ϵ∂xitalic-ϵsubscript𝑥\epsilon\partial_{x} | (x+ϵ,t,u)𝑥italic-ϵ𝑡𝑢(\hbox{\pagecolor{lightyellow}$\displaystyle x+\epsilon$},t,u) |
| g2subscript𝑔2g_{2} (time translation) | ϵ∂titalic-ϵsubscript𝑡\epsilon\partial_{t} | (x,t+ϵ,u)𝑥𝑡italic-ϵ𝑢(x,\hbox{\pagecolor{lightyellow}$\displaystyle t+\epsilon$},u) |
| g3subscript𝑔3g_{3} (Galilean boost) | ϵ(t∂x+∂u)italic-ϵ𝑡subscript𝑥subscript𝑢\epsilon(t\partial_{x}+\partial_{u}) | (x+ϵt,t,u+ϵ)𝑥italic-ϵ𝑡𝑡𝑢italic-ϵ(\hbox{\pagecolor{lightyellow}$\displaystyle x+\epsilon t$},t,\hbox{\pagecolor{lightyellow}$\displaystyle u+\epsilon$}) |
| g4subscript𝑔4g_{4} (scaling) | ϵ(x∂x+2t∂t−u∂u)italic-ϵ𝑥subscript𝑥2𝑡subscript𝑡𝑢subscript𝑢\epsilon(x\partial_{x}+2t\partial_{t}-u\partial_{u}) | (eϵx,e2ϵt,e−ϵu)superscript𝑒italic-ϵ𝑥superscript𝑒2italic-ϵ𝑡superscript𝑒italic-ϵ𝑢(\hbox{\pagecolor{lightyellow}$\displaystyle e^{\epsilon}x$},\hbox{\pagecolor{lightyellow}$\displaystyle e^{2\epsilon}t$},\hbox{\pagecolor{lightyellow}$\displaystyle e^{-\epsilon}u$}) |
| g5subscript𝑔5g_{5} (projective) | ϵ(xt∂x+t2∂t+(x−tu)∂u)italic-ϵ𝑥𝑡subscript𝑥superscript𝑡2subscript𝑡𝑥𝑡𝑢subscript𝑢\epsilon(xt\partial_{x}+t^{2}\partial_{t}+(x-tu)\partial_{u}) | (x1−ϵt,t1−ϵt,u+ϵ(x−tu))𝑥1italic-ϵ𝑡𝑡1italic-ϵ𝑡𝑢italic-ϵ𝑥𝑡𝑢\left(\hbox{\pagecolor{lightyellow}$\displaystyle\frac{x}{1-\epsilon t}$},\hbox{\pagecolor{lightyellow}$\displaystyle\frac{t}{1-\epsilon t}$},\hbox{\pagecolor{lightyellow}$\displaystyle u+\epsilon(x-tu)$}\right) |
Table: A5.T5: Generators of the Lie point symmetry group of the KdV equation. The only symmetries used in the inverse task of predicting initial conditions are g1subscript𝑔1g_{1} and g3subscript𝑔3g_{3} since the other two are not invariant to the downstream task.
| Lie algebra generator | Group operation (x,t,u)↦maps-to𝑥𝑡𝑢absent(x,t,u)\mapsto | |
| g1subscript𝑔1g_{1} (space translation) | ϵ∂xitalic-ϵsubscript𝑥\epsilon\partial_{x} | (x+ϵ,t,u)𝑥italic-ϵ𝑡𝑢(\hbox{\pagecolor{lightyellow}$\displaystyle x+\epsilon$},t,u) |
| g2subscript𝑔2g_{2} (time translation) | ϵ∂titalic-ϵsubscript𝑡\epsilon\partial_{t} | (x,t+ϵ,u)𝑥𝑡italic-ϵ𝑢(x,\hbox{\pagecolor{lightyellow}$\displaystyle t+\epsilon$},u) |
| g3subscript𝑔3g_{3} (Galilean boost) | ϵ(t∂x+∂u)italic-ϵ𝑡subscript𝑥subscript𝑢\epsilon(t\partial_{x}+\partial_{u}) | (x+ϵt,t,u+ϵ)𝑥italic-ϵ𝑡𝑡𝑢italic-ϵ(\hbox{\pagecolor{lightyellow}$\displaystyle x+\epsilon t$},t,\hbox{\pagecolor{lightyellow}$\displaystyle u+\epsilon$}) |
| g4subscript𝑔4g_{4} (scaling) | ϵ(x∂x+3t∂t−2u∂u)italic-ϵ𝑥subscript𝑥3𝑡subscript𝑡2𝑢subscript𝑢\epsilon(x\partial_{x}+3t\partial_{t}-2u\partial_{u}) | (eϵx,e3ϵt,e−2ϵu)superscript𝑒italic-ϵ𝑥superscript𝑒3italic-ϵ𝑡superscript𝑒2italic-ϵ𝑢(\hbox{\pagecolor{lightyellow}$\displaystyle e^{\epsilon}x$},\hbox{\pagecolor{lightyellow}$\displaystyle e^{3\epsilon}t$},\hbox{\pagecolor{lightyellow}$\displaystyle e^{-2\epsilon}u$}) |
Table: A5.T7: Generators of the Lie point symmetry group of the incompressible Navier Stokes equation. Here, u,v𝑢𝑣u,v correspond to the velocity of the fluid in the x,y𝑥𝑦x,y direction respectively and p𝑝p corresponds to the pressure. The last three augmentations correspond to infinite dimensional Lie subgroups with choice of functions Ex(t),Ey(t),q(t)subscript𝐸𝑥𝑡subscript𝐸𝑦𝑡𝑞𝑡E_{x}(t),E_{y}(t),q(t) that depend on t𝑡t only. For invariant tasks, we only used settings where Ex(t),Ey(t)=tsubscript𝐸𝑥𝑡subscript𝐸𝑦𝑡𝑡E_{x}(t),E_{y}(t)=t (linear) or Ex(t),Ey(t)=t2subscript𝐸𝑥𝑡subscript𝐸𝑦𝑡superscript𝑡2E_{x}(t),E_{y}(t)=t^{2} (quadratic) to ensure invariance to the downstream task or predictable changes in the outputs of the downstream task. These augmentations are listed as numbers 666 to 999.
| Lie algebra generator | Group operation (x,y,t,u,v,p)↦maps-to𝑥𝑦𝑡𝑢𝑣𝑝absent(x,y,t,u,v,p)\mapsto | |
| g1subscript𝑔1g_{1} (time translation) | ϵ∂titalic-ϵsubscript𝑡\epsilon\partial_{t} | (x,y,t+ϵ,u,v,p)𝑥𝑦𝑡italic-ϵ𝑢𝑣𝑝(x,y,\hbox{\pagecolor{lightyellow}$\displaystyle t+\epsilon$},u,v,p) |
| \hdashline[0.5pt/5pt] g2subscript𝑔2g_{2} (x𝑥x translation) | ϵ∂xitalic-ϵsubscript𝑥\epsilon\partial_{x} | (x+ϵ,y,t,u,v,p)𝑥italic-ϵ𝑦𝑡𝑢𝑣𝑝(\hbox{\pagecolor{lightyellow}$\displaystyle x+\epsilon$},y,t,u,v,p) |
| \hdashline[0.5pt/5pt] g3subscript𝑔3g_{3} (y𝑦y translation) | ϵ∂yitalic-ϵsubscript𝑦\epsilon\partial_{y} | (x,y+ϵ,t,u,v,p)𝑥𝑦italic-ϵ𝑡𝑢𝑣𝑝(x,\hbox{\pagecolor{lightyellow}$\displaystyle y+\epsilon$},t,u,v,p) |
| \hdashline[0.5pt/5pt] g4subscript𝑔4g_{4} (scaling) | ϵ(2t∂t+x∂x+y∂y−u∂u−v∂v−2p∂p)\begin{aligned} \epsilon(&2t\partial_{t}+x\partial_{x}+y\partial_{y}\ &-u\partial_{u}-v\partial_{v}-2p\partial_{p})\end{aligned} | (eϵx,eϵy,e2ϵt,e−ϵu,e−ϵv,e−2ϵp)superscript𝑒italic-ϵ𝑥superscript𝑒italic-ϵ𝑦superscript𝑒2italic-ϵ𝑡superscript𝑒italic-ϵ𝑢superscript𝑒italic-ϵ𝑣superscript𝑒2italic-ϵ𝑝(\hbox{\pagecolor{lightyellow}$\displaystyle e^{\epsilon}x$},\hbox{\pagecolor{lightyellow}$\displaystyle e^{\epsilon}y$},\hbox{\pagecolor{lightyellow}$\displaystyle e^{2\epsilon}t$},\hbox{\pagecolor{lightyellow}$\displaystyle e^{-\epsilon}u$},\hbox{\pagecolor{lightyellow}$\displaystyle e^{-\epsilon}v$},\hbox{\pagecolor{lightyellow}$\displaystyle e^{-2\epsilon}p$}) |
| \hdashline[0.5pt/5pt] g5subscript𝑔5g_{5} (rotation) | ϵ(x∂y−y∂x+u∂v−v∂u)italic-ϵ𝑥subscript𝑦𝑦subscript𝑥𝑢subscript𝑣𝑣subscript𝑢\epsilon(x\partial_{y}-y\partial_{x}+u\partial_{v}-v\partial_{u}) | (xcosϵ−ysinϵ,xsinϵ+ycosϵ,t,ucosϵ−vsinϵ,usinϵ+vcosϵ,p)\begin{aligned} (&\hbox{\pagecolor{lightyellow}$\displaystyle x\cos\epsilon-y\sin\epsilon$},\hbox{\pagecolor{lightyellow}$\displaystyle x\sin\epsilon+y\cos\epsilon$},t,\ &\hbox{\pagecolor{lightyellow}$\displaystyle u\cos\epsilon-v\sin\epsilon$},\hbox{\pagecolor{lightyellow}$\displaystyle u\sin\epsilon+v\cos\epsilon$},p)\end{aligned} |
| \hdashline[0.5pt/5pt] g6subscript𝑔6g_{6} (x𝑥x linear boost)1 | ϵ(t∂x+∂u)italic-ϵ𝑡subscript𝑥subscript𝑢\epsilon(t\partial_{x}+\partial_{u}) | (x+ϵt,y,t,u+ϵ,v,p)𝑥italic-ϵ𝑡𝑦𝑡𝑢italic-ϵ𝑣𝑝(\hbox{\pagecolor{lightyellow}$\displaystyle x+\epsilon t$},y,t,\hbox{\pagecolor{lightyellow}$\displaystyle u+\epsilon$},v,p) |
| \hdashline[0.5pt/5pt] g7subscript𝑔7g_{7} (y𝑦y linear boost)1 | ϵ(t∂y+∂v)italic-ϵ𝑡subscript𝑦subscript𝑣\epsilon(t\partial_{y}+\partial_{v}) | (x,y+ϵt,t,u,v+ϵ,p)𝑥𝑦italic-ϵ𝑡𝑡𝑢𝑣italic-ϵ𝑝(x,\hbox{\pagecolor{lightyellow}$\displaystyle y+\epsilon t$},t,u,\hbox{\pagecolor{lightyellow}$\displaystyle v+\epsilon$},p) |
| \hdashline[0.5pt/5pt] g8subscript𝑔8g_{8} (x𝑥x quadratic boost)2 | ϵ(t2∂x+2t∂u−2x∂p)italic-ϵsuperscript𝑡2subscript𝑥2𝑡subscript𝑢2𝑥subscript𝑝\epsilon(t^{2}\partial_{x}+2t\partial_{u}-2x\partial_{p}) | (x+ϵt2,y,t,u+2ϵt,v,p−2x)𝑥italic-ϵsuperscript𝑡2𝑦𝑡𝑢2italic-ϵ𝑡𝑣𝑝2𝑥(\hbox{\pagecolor{lightyellow}$\displaystyle x+\epsilon t^{2}$},y,t,\hbox{\pagecolor{lightyellow}$\displaystyle u+2\epsilon t$},v,\hbox{\pagecolor{lightyellow}$\displaystyle p-2x$}) |
| \hdashline[0.5pt/5pt] g9subscript𝑔9g_{9} (y𝑦y quadratic boost)2 | ϵ(t2∂y+2t∂v−2y∂p)italic-ϵsuperscript𝑡2subscript𝑦2𝑡subscript𝑣2𝑦subscript𝑝\epsilon(t^{2}\partial_{y}+2t\partial_{v}-2y\partial_{p}) | (x,y+ϵt2,t,u,v+2ϵt,p−2y)𝑥𝑦italic-ϵsuperscript𝑡2𝑡𝑢𝑣2italic-ϵ𝑡𝑝2𝑦(x,\hbox{\pagecolor{lightyellow}$\displaystyle y+\epsilon t^{2}$},t,u,\hbox{\pagecolor{lightyellow}$\displaystyle v+2\epsilon t$},\hbox{\pagecolor{lightyellow}$\displaystyle p-2y$}) |
| \hdashline[0.5pt/5pt] gExsubscript𝑔subscript𝐸𝑥g_{E_{x}} (x𝑥x general boost)3 | ϵ(Ex(t)∂x+Ex′(t)∂u−xEx′′(t)∂p)\begin{aligned} \epsilon(&E_{x}(t)\partial_{x}+E^{\prime}{x}(t)\partial{u}\ &-xE^{\prime\prime}{x}(t)\partial{p})\end{aligned} | (x+ϵEx(t),y,t,u+ϵEx′(t),v,p−E′′x(t)x)\begin{aligned} (&\hbox{\pagecolor{lightyellow}$\displaystyle x+\epsilon E_{x}(t)$},y,t,\ &\hbox{\pagecolor{lightyellow}$\displaystyle u+\epsilon E^{\prime}_{x}(t)$},v,\hbox{\pagecolor{lightyellow}$\displaystyle p-E^{\prime\prime}x(t)x$})\end{aligned} |
| \hdashline[0.5pt/5pt] gEysubscript𝑔subscript𝐸𝑦g_{E_{y}} (y𝑦y general boost)3 | ϵ(Ey(t)∂y+E′y(t)∂v−yE′′y(t)∂p)\begin{aligned} \epsilon(&E_{y}(t)\partial_{y}+E^{\prime}y(t)\partial_{v}\ &-yE^{\prime\prime}y(t)\partial_{p})\end{aligned} | (x,y+ϵEy(t),t,u,v+ϵE′y(t),p−E′′y(t)y)\begin{aligned} (&x,\hbox{\pagecolor{lightyellow}$\displaystyle y+\epsilon E_{y}(t)$},t,\ &u,\hbox{\pagecolor{lightyellow}$\displaystyle v+\epsilon E^{\prime}y(t)$},\hbox{\pagecolor{lightyellow}$\displaystyle p-E^{\prime\prime}y(t)y$})\end{aligned} |
| \hdashline[0.5pt/5pt] gqsubscript𝑔𝑞g_{q} (additive pressure)3 | ϵq(t)∂pitalic-ϵ𝑞𝑡subscript𝑝\epsilon q(t)\partial_{p} | (x,y,t,u,v,p+q(t))𝑥𝑦𝑡𝑢𝑣𝑝𝑞𝑡(x,y,t,u,v,\hbox{\pagecolor{lightyellow}$\displaystyle p+q(t)$}) |
Table: A6.T8: One-step validation NMSE for time-stepping on Burgers for different architectures.
| Architecture | ResNet1d | FNO1d |
|---|---|---|
| Baseline (no conditioning) | 0.110 ±plus-or-minus\pm 0.008 | 0.184 ±plus-or-minus\pm 0.002 |
| Representation conditioning | 0.108 ±plus-or-minus\pm 0.011 | 0.173 ±plus-or-minus\pm 0.002 |
Table: A6.T9: List of model hyperparameters and training details for the invariant tasks. Training time includes periodic evaluations during the pretraining.
| Equation | Burgers’ | KdV | KS | Navier Stokes |
| Network: | ||||
| Model | ResNet18 | ResNet18 | ResNet18 | ResNet18 |
| Embedding Dim. | 512512512 | 512512512 | 512512512 | 512512512 |
| Optimization: | ||||
| Optimizer | LARS [102] | AdamW | AdamW | AdamW |
| Learning Rate | 0.6 | 0.3 | 0.3 | 3e-4 |
| Batch Size | 32 | 64 | 64 | 64 |
| Epochs | 100 | 100 | 100 | 100 |
| Nb of exps | ∼300similar-toabsent300\sim 300 | ∼30similar-toabsent30\sim 30 | ∼30similar-toabsent30\sim 30 | ∼300similar-toabsent300\sim 300 |
| Hardware: | ||||
| GPU used | Nvidia V100 | Nvidia M4000 | Nvidia M4000 | Nvidia V100 |
| Training time | ∼5hsimilar-toabsent5ℎ\sim 5h | ∼11hsimilar-toabsent11ℎ\sim 11h | ∼12hsimilar-toabsent12ℎ\sim 12h | ∼48hsimilar-toabsent48ℎ\sim 48h |
Table: A6.T11: One-step validation MSE ×1e−3absent1superscript𝑒3\times 1e^{-3} (↓↓\downarrow) for Navier-Stokes for different baselines and conditioning methods, with UNetmod64subscriptUNetmod64\text{UNet}_{\text{mod}64} [18] as base model.
| Dataset size | 1,664 | 6,656 |
| Methods without ground truth buoyancy: | ||
| Time conditioned, Addition | 2.60 ±plus-or-minus\pm 0.05 | 1.18 ±plus-or-minus\pm 0.03 |
| Time + Rep. conditioned, Addition (ours) | 2.47 ±plus-or-minus\pm 0.02 | 1.17 ±plus-or-minus\pm 0.04 |
| Time conditioned, AdaGN | 2.37 ±plus-or-minus\pm 0.01 | 1.12 ±plus-or-minus\pm 0.02 |
| Time + Rep. conditioned, AdaGN (ours) | 2.35 ±plus-or-minus\pm 0.03 | 1.11 ±plus-or-minus\pm 0.01 |
| Methods with ground truth buoyancy: | ||
| Time + Buoyancy conditioned, Addition | 2.08 ±plus-or-minus\pm 0.02 | 1.10 ±plus-or-minus\pm 0.01 |
| Time + Buoyancy conditioned, AdaGN | 2.01 ±plus-or-minus\pm 0.02 | 1.06 ±plus-or-minus\pm 0.04 |
A high-level overview of the self-supervised learning pipeline, in the conventional setting of image data (top row) as well as our proposed setting of a PDE (bottom row). Given a large pool of unlabeled data, self-supervised learning uses augmentations (e.g. color-shifting for images, or Lie symmetries for PDEs) to train a network fθsubscript𝑓𝜃f_{\theta} to produce useful representations from input images. Given a smaller set of labeled data, these representations can then be used as inputs to a supervised learning pipeline, performing tasks such as predicting class labels (images) or regressing the kinematic viscosity ν𝜈\nu (Burgers’ equation). Trainable steps are shown with red arrows; importantly, the representation function learned via SSL is not altered during application to downstream tasks.
Pretraining and evaluation frameworks, illustrated on Burgers’ equation. (Left) Self-supervised pretraining. We generate augmented solutions 𝒙𝒙{\bm{x}} and 𝒙′superscript𝒙′{\bm{x}}^{\prime} using Lie symmetries parametrized by g𝑔g and g′superscript𝑔′g^{\prime} before passing them through an encoder fθsubscript𝑓𝜃f_{\theta}, yielding representations 𝒚𝒚{\bm{y}}. The representations are then input to a projection head hθsubscriptℎ𝜃h_{\theta}, yielding embeddings 𝒛𝒛{\bm{z}}, on which the SSL loss is applied. (Right) Evaluation protocols for our pretrained representations 𝒚𝒚{\bm{y}}. On new data, we use the computed representations to either predict characteristics of interest, or to condition a neural network or operator to improve time-stepping performance.
One parameter Lie point symmetries for the Kuramoto-Sivashinsky (KS) PDE. The transformations (left to right) include the un-modified solution (u)𝑢(u), temporal shifts (g1)subscript𝑔1(g_{1}), spatial shifts (g2)subscript𝑔2(g_{2}), and Galilean boosts (g3)subscript𝑔3(g_{3}) with their corresponding infinitesimal transformations in the Lie algebra placed inside the figure. The shaded red square denotes the original (x,t)𝑥𝑡(x,t), while the dotted line represents the same points after the augmentation is applied.
Influence of dataset size on regression tasks. (Left) Kinematic regression on Burger’s equation. When using Lie point symmetries (LPS) during pretraining, we are able to improve performance over the supervised baselines, even when using an unlabled dataset size that is half the size of the labeled one. As we increase the amount of unlabeled data that we use, the performance improves, further reinforcing the usefulness of self-supervised representations. (Right) Buoyancy regression on Navier-Stokes’ equation. We notice a similar trend as in Burgers but found that the supervised approach was less stable than the self-supervised one. As such, SSL brings better performance as well as more stability here.
(Left) Isolating effective augmentations for Navier-Stokes. Note that we do not study g3subscript𝑔3g_{3}, g7subscript𝑔7g_{7} and g9subscript𝑔9g_{9}, which are respectively counterparts of g2subscript𝑔2g_{2}, g6subscript𝑔6g_{6} and g8subscript𝑔8g_{8} applied in y𝑦y instead of x𝑥x. (Right) Influence of the crop size on performance. We see that performance is maximized when the crops are as large as possible with as little overlap as possible when generating pairs of them.
Illustration of the PDE symmetry group and invariances of a simple PDE, which rotates a point in 2-D space. The PDE symmetry group here corresponds to scalings of the radius of the rotation and fixed rotations of all the points over time. A sample invariant quantity is the rate of rotation (related to the parameter α𝛼\alpha in the PDE), which is fixed for any solution to this PDE.
Illustration of the 202020 predicted time steps for the KdV PDE. (Left) Ground truth data from PDE solver; (Middle) Predicted u(x,t)𝑢𝑥𝑡u(x,t) using learned representations; (Right) Predicted output from using the CNN baseline.
$$ \mathcal{L}(\mZ,\mZ') \approx \frac{\lambda_{inv}}{N}\underbrace{\sum_{i=1}^N |\mZ_{i,:} - \mZ'{i,:}|2^2}{\mathcal{L}\text{sim}(\mZ, \mZ')} + \frac{\lambda_{reg}}{D}\underbrace{\left( | \operatorname{Cov}(\mZ) - \mI |F^2 + | \operatorname{Cov}(\mZ') - \mI |F^2\right)}{\mathcal{L}\text{reg}(\mZ) + \mathcal{L}_\text{reg}(\mZ')} , $$
$$ \Delta (x,t, u) = u_t + u u_x + u_{xx} + u_{xxxx} = 0. $$
$$ \Delta(\vx, \vu)=0 \implies \Delta\left[ g \cdot (\vx, \vu)\right] = 0, ; ; ; \forall g \in G. $$
$$ \begin{split}\text{Temporal Shift: };;;g_{1}(\epsilon):&(x,t,u)\mapsto(x,t+\epsilon,u)\ \text{Spatial Shift: };;;g_{2}(\epsilon):&(x,t,u)\mapsto(x+\epsilon,t,u)\ \text{Galilean Boost: };;;g_{3}(\epsilon):&(x,t,u)\mapsto(x+\epsilon t,t,u+\epsilon)\ \end{split} $$ \tag{S2.E4}
$$ \label{eq:navier_stokes_form} \vu_t = -\vu \cdot \nabla \vu -\frac{1}{\rho}\nabla p + \nu\nabla^2\vu + \vf, ; ; ; ; \nabla \vu = 0, $$ \tag{eq:navier_stokes_form}
$$ u_0(x) = \sum_{k=1}^N A_k \sin\left(\frac{2 \pi \omega_k x}{L} + \phi_k\right). $$
$$ \label{eq:graph_of_solution} \Gamma_f = { (\vx,f(\vx)):\vx \in \Omega } \subset \mathbb{R}^p \times \mathbb{R}^q. $$ \tag{eq:graph_of_solution}
$$ \mathcal{U}^{(n)} = \mathcal{U} \times \mathcal{U}_1 \times \mathcal{U}_2 \times \cdots \times \mathcal{U}_n, $$
$$ \begin{split}{\bm{u}}^{(2)}=\operatorname{pr}^{(2)}f(x,y)&=(u;u_{x},u_{y};u_{xx},u_{xy},u_{yy})\ &=\left(f;\frac{\partial f}{\partial x},\frac{\partial f}{\partial y};\frac{\partial^{2}f}{\partial x^{2}},\frac{\partial^{2}f}{\partial x\partial y},\frac{\partial^{2}f}{\partial y^{2}}\right)\in\mathbb{R}^{1}\times\mathbb{R}^{2}\times\mathbb{R}^{3},\end{split} $$ \tag{A1.E12}
$$ \Delta_\nu(\vx, \vu^{(n)}) = 0, ;;;; \nu = 1, \dots, l. $$
$$ \begin{split} \operatorname{pr}^{(2)} f(x,t) &= \left( \sin (x) e^{-ct}; \cos (x) e^{-ct}, -c\sin (x) e^{-ct}; -\sin (x) e^{-ct}, -c\cos (x) e^{-ct}, c^2 \sin (x) e^{-ct} \right), \ \Delta(x,t,\vu^{(n)}) &= -c\sin (x) e^{-ct} + c \sin (x) e^{-ct} = 0. \end{split} $$
$$ (x,t,u)\mapsto(x+\epsilon t,t,u+\epsilon), $$ \tag{A1.Ex1}
$$ \label{eq:example_group_ops} \begin{split} g_1(t) \cdot (x, u) &= (x + t, u),\ g_2(r) \cdot (x, u) &= (x, e^r \cdot u). \end{split} $$ \tag{eq:example_group_ops}
$$ \bm{v}g |{(x,u)} = \frac{ d}{d t }g(t) \cdot (x,u) \biggl|_{t=0} , $$
$$ \operatorname{pr}^{(n)} \vv \bigr|{(\vx, \vu^{(n)})}= \frac{d}{d \epsilon} \biggl|{\epsilon = 0} ;\operatorname{pr}^{(n)} \left \exp (\epsilon \vv) \right. $$
$$ \bm v = \sum_{i=1}^p \xi^i(\vx,\vu) \frac{\partial}{\partial x^i} + \sum_{\alpha=1}^q \phi_\alpha(\vx,\vu) \frac{\partial}{\partial u^\alpha}. \label{eq:vectorfield} $$ \tag{eq:vectorfield}
$$ \phi_\alpha^{\bm J} (\vx, \vu^{(n)}) = \prod_{i \in \bm J}\bm D_{i}\left( \phi_a - \sum_{i=1}^p \xi^i u_i^\alpha \right) + \sum_{i=1}^p \xi^i u^\alpha_{\bm J, i}, \label{eq:coefficientTerms} $$ \tag{eq:coefficientTerms}
$$ \operatorname{pr}^{(2)} \vv = \vv + \phi^x \frac{\partial}{\partial u_x} + \phi^t \frac{\partial}{\partial u_t} + \phi^{xx} \frac{\partial}{\partial u_{xx}} +\phi^{xt} \frac{\partial}{\partial u_{xt}} +\phi^{tt} \frac{\partial}{\partial u_{tt}}. $$
$$ \xi({t, x}) = k_1 + k_4 x + 2k_5 t + 4k_6 xt $$
$$ \bm v_1 = \partial_x $$
$$ \bm v_6 = 4xt\partial_x + 4t^2\partial_t - (x^2 + 2t)\partial_u $$
$$ \begin{split} \exp\left(\epsilon \gamma e^{-u} \partial_u\right) u &= u + \sum_{k=1}^n \left(\epsilon \gamma e^{-u} \partial_u\right)^k \cdot u \ &= u + \epsilon \gamma e^{-u} - \frac{1}{2}\epsilon^2 \gamma^2 e^{-2u} + \frac{1}{3}\epsilon^3 \gamma^3 e^{-3u} + \cdots \end{split} $$
$$ \exp(\vv) = \operatorname{Id} + \vv + \frac{1}{2} \vv \cdot \vv + \cdots = \sum_{n=0}^{\infty} \frac{\vv^n}{n!}. $$
$$ \exp(\bm X + \bm Y) = \lim_{n \to \infty} \left[\exp\left(\frac{\bm X}{n}\right) \exp\left(\frac{\bm Y}{n}\right)\right]^n \approx \left[\exp\left(\frac{\bm X}{k}\right) \exp\left(\frac{\bm Y}{k}\right)\right]^k, $$
$$ \begin{split}\mathcal{T}{2}({\bm{v}})&=\exp\left(\frac{{\bm{v}}{1}}{2}\right)\cdot\exp\left(\frac{{\bm{v}}{2}}{2}\right)\cdots\exp\left(\frac{{\bm{v}}{p}}{2}\right)\exp\left(\frac{{\bm{v}}{p}}{2}\right)\cdot\exp\left(\frac{{\bm{v}}{p-1}}{2}\right)\cdots\exp\left(\frac{{\bm{v}}{1}}{2}\right),\ \mathcal{T}{2k}({\bm{v}})&=\mathcal{T}{2k-2}(u{k}{\bm{v}})^{2}\cdot\mathcal{T}{2k-2}((1-4u{k}){\bm{v}})\cdot\mathcal{T}{2k-2}(u{k}{\bm{v}})^{2},\ u_{k}&=\frac{1}{4-4^{1/(2k-1)}}.\end{split} $$ \tag{A2.E48}
$$ \mathrm{Cov}(\mZ) = \frac{1}{N-1} \sum_{i=1}^N \left(\mZ_{i,:} - \overline{\mZ}{i,:} \right) \left(\mZ{i,:} - \overline{\mZ}{i,:} \right)^\top, ; ; ; ; ; \overline{\mZ}{i,:} = \frac{1}{N} \sum_{i=1}^N \mZ_{i,:} $$
$$ \mathcal{L}_{\text{sim}}(\vu,\vv) = | \vu - \vv |_2^2 , $$
$$ \displaystyle\mathcal{L}{var}({\bm{Z}},{\bm{Z}}^{\prime})=\frac{1}{D}\sum{i=1}^{N}\max(0,\gamma-\sqrt{\mathrm{Cov}({\bm{Z}}){ii}})+\max(0,\gamma-\sqrt{\mathrm{Cov}({\bm{Z}}^{\prime}){ii}}), $$
$$ \displaystyle\mathcal{L}_{\text{reg}}({\bm{Z}}) $$
Example. Example B.1 (Exponential map on symmetry generator of Burger’s equation). The Burger’s equation contains the Lie point symmetry 𝐯γ=γ(x,t)e−u∂usubscript𝐯𝛾𝛾𝑥𝑡superscript𝑒𝑢subscript𝑢{\bm{v}}{\gamma}=\gamma(x,t)e^{-u}\partial{u} with corresponding group transformation exp(ϵ𝐯γ)⋅(x,t,u)=(x,t,log(eu+ϵγ))⋅italic-ϵsubscript𝐯𝛾𝑥𝑡𝑢𝑥𝑡superscript𝑒𝑢italic-ϵ𝛾\exp(\epsilon{\bm{v}}_{\gamma})\cdot(x,t,u)=(x,t,\log\left(e^{u}+\epsilon\gamma\right)).
$$ \begin{split} \text{Temporal Shift: } ; ; ; g_1(\epsilon):& (x,t,u) \mapsto (x, t + \epsilon, u) \ \text{Spatial Shift: } ; ; ; g_2(\epsilon):& (x,t,u) \mapsto (x+ \epsilon,t, u) \ \text{Galilean Boost: } ; ; ; g_3(\epsilon):& (x,t,u) \mapsto (x+ \epsilon t,t, u + \epsilon) \ \end{split} $$
$$ \begin{split} \vu^{(2)} = \operatorname{pr}^{(2)} f(x,y) &= (u;u_x,u_y;u_{xx},u_{xy},u_{yy}) \&= \left( f; \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}; \frac{\partial^2 f}{\partial x^2}, \frac{\partial^2 f}{\partial x \partial y}, \frac{\partial^2 f}{\partial y^2} \right) \in \mathbb{R}^1 \times \mathbb{R}^2 \times \mathbb{R}^3, \end{split} $$
$$ g \cdot \Gamma_f = { (\Tilde{\vx}, \Tilde{\vu}) = g \cdot (\vx,\vu): (\vx,\vu) \in \Gamma_f }, $$
$$ u_t = u_{xx} + u_x^2. \label{eq:burgers_potential_form_delta} $$ \tag{eq:burgers_potential_form_delta}
$$ \operatorname{pr}^{(2)} \vv [\Delta(x,t,\vu^{(2)})] = \phi^t - \phi^{xx} + 2 u_x \phi^x = 0. $$
$$ \begin{split} \exp\left(\epsilon \gamma e^{-u} \partial_u\right) u &= u + \log\left( 1 + \epsilon \gamma e^{-u} \right) \ &= \log\left(e^{u}\right) + \log\left( 1 + \epsilon \gamma e^{-u} \right) \ &= \log\left( e^u + \epsilon \gamma \right). \end{split} $$
$$ \begin{split} \mathcal{T}2(\vv) &= \exp\left(\frac{\vv_1}{2}\right) \cdot \exp\left(\frac{\vv_2}{2}\right) \cdots \exp\left(\frac{\vv_p}{2}\right) \exp\left(\frac{\vv_p}{2}\right) \cdot \exp\left(\frac{\vv{p-1}}{2}\right) \cdots \exp\left(\frac{\vv_1}{2}\right), \ \mathcal{T}{2k}(\vv) &= \mathcal{T}{2k-2}(u_k \vv)^2 \cdot \mathcal{T}{2k-2}((1-4u_k) \vv) \cdot \mathcal{T}{2k-2}(u_k \vv)^2, \ u_k &= \frac{1}{4-4^{1/(2k-1)}}. \end{split} $$
$$ \mZ_{i,:} = h_\theta\left( f_\theta \left( t \cdot \vx_i \right) \right) \text{ and } \mZ'{i,:} = h\theta\left( f_\theta \left( t' \cdot \vx_i \right) \right) $$
$$ \mathcal{L_{\text{VICReg}}} (\mZ,\mZ') = \lambda_{\text{inv}} \frac{1}{N} \sum_{i=1}^N \mathcal{L}{\text{sim}}(\mZ{i,\text{inv}},\mZ'{i,\text{inv}}) + \mathcal{L}{\text{reg}}(\mZ') + \mathcal{L}_{\text{reg}}(\mZ). $$
$$ u_t + \frac{1}{2}u_x^2 - \nu u_{xx} = 0. $$
$$ &\mathcal{L}{var}(\mZ, \mZ') = \frac{1}{D} \sum{i=1}^N \max( 0, \gamma -\sqrt{\mathrm{Cov}(\mZ){ii}}) + \max( 0, \gamma -\sqrt{\mathrm{Cov}(\mZ'){ii}}), \ &\mathcal{L}{cov}(\mZ, \mZ') = \frac{1}{D} \sum{i,j =1, i \neq j}^N [\mathrm{Cov}(\mZ){ij}]^2 + [\mathrm{Cov}(\mZ'){ij}]^2, \ &\mathcal{L}{inv}(\mZ, \mZ') = \frac{1}{N} \sum{i=1}^N \left| \mZ_{i,:} - \mZ_{i',:} \right|^2 $$
$$ \mathcal{L}{\text{reg}}(\mZ) &= \lambda{cov}; C(\mZ) + \lambda_{var}; V(\mZ), \quad\text{with}\ C(\mZ) &= \frac{1}{D}\sum_{i\neq j} \mathrm{Cov}(\mZ){i,j}^2 \quad\text{and}\ V(\mZ) &=\frac{1}{D} \sum{j = 1}^D \max\left( 0, 1- \sqrt{\mathrm{Var}(\mZ_{:,j})}\right) . $$
Example. [Exponential map on symmetry generator of Burger's equation] The Burger's equation contains the Lie point symmetry $\vv_\gamma = \gamma(x, t)e^{-u}\partial_u$ with corresponding group transformation $\exp(\epsilon \vv_\gamma) \cdot (x,t,u) = (x,t,\log\left( e^u + \epsilon \gamma \right))$.
Proof. This transformation only changes the $u$ component. Here, we have equation split \exp\left(\epsilon \gamma e^{-u} \partial_u\right) u &= u + \sum_{k=1}^n \left(\epsilon \gamma e^{-u} \partial_u\right)^k \cdot u \ &= u + \epsilon \gamma e^{-u} - 1{2}\epsilon^2 \gamma^2 e^{-2u} + 1{3}\epsilon^3 \gamma^3 e^{-3u} + \cdots split equation Applying the series expansion $\log(1+x) = x - x^2{2} + x^3{3} - \cdots$, we get equation split \exp\left(\epsilon \gamma e^{-u} \partial_u\right) u &= u + \log\left( 1 + \epsilon \gamma e^{-u} \right) \ &= \log\left(e^{u}\right) + \log\left( 1 + \epsilon \gamma e^{-u} \right) \ &= \log\left( e^u + \epsilon \gamma \right). split equation
Algorithm: algorithm
\caption{VICReg Loss Evaluation}
\label{alg:VICReg}
\begin{algorithmic}[0] %
\State \textbf{Hyperparameters:} $\lambda_{var}, \lambda_{cov}, \lambda_{inv}, \gamma \in \mathbb{R}$
\State \textbf{Input:} $N$ inputs in a batch $\{\vx_i \in \mathbb{R}^{D_{in}} , i = 1, \dots, N\}$
\State \textbf{VICRegLoss($N$, $\vx_i$, $\lambda_{var}$, $\lambda_{cov}$, $\lambda_{inv}$, $\gamma$):}
\end{algorithmic}
\begin{algorithmic}[1] %
\State Apply augmentations $t,t' \sim \mathcal{T}$ to form embedding matrices $\mZ, \mZ' \in \mathbb{R}^{N \times D}$: \begin{equation*}
\mZ_{i,:} = h_\theta\left( f_\theta \left( t \cdot \vx_i \right) \right) \text{ and } \mZ'_{i,:} = h_\theta\left( f_\theta \left( t' \cdot \vx_i \right) \right)
\end{equation*}
\State Form covariance matrices $\mathrm{Cov}(\mZ), \mathrm{Cov}(\mZ') \in \mathbb{R}^{D \times D}$:
\begin{equation*}
\mathrm{Cov}(\mZ) = \frac{1}{N-1} \sum_{i=1}^N \left(\mZ_{i,:} - \overline{\mZ}_{i,:} \right) \left(\mZ_{i,:} - \overline{\mZ}_{i,:} \right)^\top, \; \; \; \; \; \overline{\mZ}_{i,:} = \frac{1}{N} \sum_{i=1}^N \mZ_{i,:}
\end{equation*}
\State Evaluate loss: $\mathcal{L}(\mZ, \mZ') = \lambda_{var}\mathcal{L}_{var}(\mZ, \mZ') + \lambda_{cov}\mathcal{L}_{cov}(\mZ, \mZ') + \lambda_{inv}\mathcal{L}_{inv}(\mZ, \mZ')$
\begin{align*}
&\mathcal{L}_{var}(\mZ, \mZ') = \frac{1}{D} \sum_{i=1}^N \max( 0, \gamma -\sqrt{\mathrm{Cov}(\mZ)_{ii}}) + \max( 0, \gamma -\sqrt{\mathrm{Cov}(\mZ')_{ii}}), \\
&\mathcal{L}_{cov}(\mZ, \mZ') = \frac{1}{D} \sum_{i,j =1, i \neq j}^N [\mathrm{Cov}(\mZ)_{ij}]^2 + [\mathrm{Cov}(\mZ')_{ij}]^2, \\
&\mathcal{L}_{inv}(\mZ, \mZ') = \frac{1}{N} \sum_{i=1}^N \left\| \mZ_{i,:} - \mZ_{i',:} \right\|^2
\end{align*}
\State \textbf{Return:} $\mathcal{L}(\mZ, \mZ')$
\end{algorithmic}
Algorithm: algorithm
[0] %
\State \textbf{Hyperparameters:} $\lambda_{var}, \lambda_{cov}, \lambda_{inv}, \gamma \in \mathbb{R}$
\State \textbf{Input:} $N$ inputs in a batch $\{\vx_i \in \mathbb{R}^{D_{in}} , i = 1, \dots, N\}$
\State \textbf{VICRegLoss($N$, $\vx_i$, $\lambda_{var}$, $\lambda_{cov}$, $\lambda_{inv}$, $\gamma$):}
Algorithm: algorithm
[1] %
\State Apply augmentations $t,t' \sim \mathcal{T}$ to form embedding matrices $\mZ, \mZ' \in \mathbb{R}^{N \times D}$: \begin{equation*}
\mZ_{i,:} = h_\theta\left( f_\theta \left( t \cdot \vx_i \right) \right) \text{ and } \mZ'_{i,:} = h_\theta\left( f_\theta \left( t' \cdot \vx_i \right) \right)
\end{equation*}
\State Form covariance matrices $\mathrm{Cov}(\mZ), \mathrm{Cov}(\mZ') \in \mathbb{R}^{D \times D}$:
\begin{equation*}
\mathrm{Cov}(\mZ) = \frac{1}{N-1} \sum_{i=1}^N \left(\mZ_{i,:} - \overline{\mZ}_{i,:} \right) \left(\mZ_{i,:} - \overline{\mZ}_{i,:} \right)^\top, \; \; \; \; \; \overline{\mZ}_{i,:} = \frac{1}{N} \sum_{i=1}^N \mZ_{i,:}
\end{equation*}
\State Evaluate loss: $\mathcal{L}(\mZ, \mZ') = \lambda_{var}\mathcal{L}_{var}(\mZ, \mZ') + \lambda_{cov}\mathcal{L}_{cov}(\mZ, \mZ') + \lambda_{inv}\mathcal{L}_{inv}(\mZ, \mZ')$
\begin{align*}
&\mathcal{L}_{var}(\mZ, \mZ') = \frac{1}{D} \sum_{i=1}^N \max( 0, \gamma -\sqrt{\mathrm{Cov}(\mZ)_{ii}}) + \max( 0, \gamma -\sqrt{\mathrm{Cov}(\mZ')_{ii}}), \\
&\mathcal{L}_{cov}(\mZ, \mZ') = \frac{1}{D} \sum_{i,j =1, i \neq j}^N [\mathrm{Cov}(\mZ)_{ij}]^2 + [\mathrm{Cov}(\mZ')_{ij}]^2, \\
&\mathcal{L}_{inv}(\mZ, \mZ') = \frac{1}{N} \sum_{i=1}^N \left\| \mZ_{i,:} - \mZ_{i',:} \right\|^2
\end{align*}
\State \textbf{Return:} $\mathcal{L}(\mZ, \mZ')$
| Equation | KdV | KS | Burgers | Navier-Stokes |
|---|---|---|---|---|
| SSL dataset size | 10,000 | 10,000 | 10,000 | 26,624 |
| Sample format ( t, x, ( y ) ) | 256 × 128 | 256 × 128 | 448 × 224 | 56 × 128 × 128 |
| Characteristic of interest Regression metric | Init. coeffs NMSE ( ↓ ) | Init. coeffs NMSE ( ↓ ) | Kinematic viscosity Relative error %( ↓ ) | Buoyancy MSE ( ↓ ) |
| Supervised SSL repr. + linear head | 0.102 ± 0.007 0.033 ± 0.004 | 0.117 ± 0.009 0.042 ± 0.002 | 1.18 ± 0.07 0.97 ± 0.04 | 0.0078 ± 0.0018 0.0038 ± 0.0001 |
| Timestepping metric | NMSE ( ↓ ) | NMSE ( ↓ ) | NMSE ( ↓ ) | MSE × 10 - 3 ( ↓ ) |
| Baseline | 0.508 ± 0.102 | 0.549 ± 0.095 | 0.110 ± 0.008 | 2.37 ± 0.01 |
| + SSL repr. conditioning | 0.330 ± 0.081 | 0.381 ± 0.097 | 0.108 ± 0.011 | 2.35 ± 0.03 |
| Architecture | UNet mod 64 | UNet mod 64 | FNO 128 modes 16 | UF1Net modes 16 |
|---|---|---|---|---|
| Conditioning method | Addition [18] | AdaGN [35] | Spatial-Spectral [18] | Addition [18] |
| Time conditioning only | 2.60 ± 0.05 | 2.37 ± 0.01 | 13.4 ± 0.5 | 3.31 ± 0.06 |
| Time + SSL repr. cond. | 2.47 ± 0.02 | 2.35 ± 0.03 | 13.0 ± 1.0 | 2.37 ± 0.05 |
| Time + true buoyancy cond. | 2.08 ± 0.02 | 2.01 ± 0.02 | 11.4 ± 0.8 | 2.87 ± 0.03 |
| Augmentation | Best strength | Buoyancy MSE |
|---|---|---|
| Crop | N.A | 0.0051 ± 0.0001 |
| single Lie transform + t translate g 1 | 0.1 | 0.0052 ± 0.0001 |
| + x translate g 2 | 10.0 | 0.0041 ± 0.0002 |
| + scaling g 4 | 1.0 | 0.0050 ± 0.0003 |
| + rotation g 5 | 1.0 | 0.0049 ± 0.0001 |
| + boost g 6 ∗ | 0.1 | 0.0047 ± 0.0002 |
| + boost g 8 ∗∗ | 0.1 | 0.0046 ± 0.0001 |
| combined + { g 2 , g 5 , g 6 , g 8 } | best / 10 | 0.0038 ± 0.0001 |
| A PDE Symmetry Groups and Deriving Generators | A PDE Symmetry Groups and Deriving Generators | A PDE Symmetry Groups and Deriving Generators | 18 |
|---|---|---|---|
| A.1 | Symmetry Groups and Infinitesimal Invariance . . . | 19 | |
| A.2 | Deriving Generators of the Symmetry Group of a PDE | 20 | |
| A.3 | Example: Burgers' Equation . . . . . . . . . . . . . | 21 | |
| Exponential map and its approximations | Exponential map and its approximations | 22 | |
| B.1 | Approximations to the exponential map . . . . . . . | 23 | |
| VICReg Loss | VICReg Loss | 24 | |
| Expanded related work | Expanded related work | 25 | |
| Details on Augmentations | Details on Augmentations | 26 | |
| E.1 | Burgers' equation . . . . . . . . . . . . . . . . . . . | 26 | |
| E.2 | KdV . . . . . . . . . . . . . . . . . . . . . . . . . . | 27 | |
| E.3 | KS . . . . . . . . . . . . . . . . . . . . . . . . . . . | 28 | |
| E.4 | Navier Stokes . . . . . . . . . . . . . . . . . . . . . | 28 | |
| Experimental details | Experimental details | 28 | |
| F.1 | Experiments on Burgers' Equation . . . . . . . . . . | 28 | |
| F.2 | Experiments on KdV and KS . . . . . . . . . . . . . | 30 | |
| F.3 | Experiments on Navier-Stokes . . . . . . . . . . . . | 31 |
| Monomial | Coefficient |
|---|---|
| 1 u x u 2 x u 3 x u 4 x u xx u x u xx u 2 x u xx u 2 xx u xt u x u xt | ϕ t = ϕ xx 2 ϕ x +2( ϕ xu - ξ xx ) = - ξ t - ξ x ) - τ xx +( ϕ uu - 2 ξ xu ) = ϕ u - τ t - 2 τ x - 2 ξ u - 2 τ xu - ξ uu = - ξ u - 2 τ u - τ uu = - τ u - τ xx +( ϕ u - 2 ξ x ) = ϕ u - τ t - 2 τ x - 2 τ xu - 3 ξ u = - ξ u - 2 τ u - τ uu - τ u = - 2 τ u - τ u = - τ u - 2 τ x = 0 - 2 τ u = 0 |
| Lie algebra generator | Group operation ( x, t,u ) ↦→ | |
|---|---|---|
| g 1 (space translation) g 2 (time translation) g 3 (Galilean boost) g 4 (scaling) g 5 (projective) | ϵ∂ x ϵ∂ t ϵ ( t∂ x + ∂ u ) ϵ ( x∂ x +2 t∂ t - u∂ u ) ϵ ( xt∂ x + t 2 ∂ t +( x - tu ) ∂ u | ( x + ϵ , t,u ) ( x, t + ϵ ,u ) ( x + ϵt , t, u + ϵ ) ( e ϵ x , e 2 ϵ t , e - ϵ u ) ( x 1 - ϵt , t 1 - ϵt , u + ϵ ( x - tu ) ) |
| Lie algebra generator | Group operation ( x, t,u ) ↦→ | |
|---|---|---|
| g 1 (space translation) g 2 (time translation) g 3 (Galilean boost) g 4 (scaling) | ϵ∂ x | ( x + ϵ , t,u ) |
| ϵ∂ t | ( x, t + ϵ ,u ) | |
| ϵ ( t∂ x + ∂ u ) | ( x + ϵt , t, u + ϵ ) | |
| ϵ ( x∂ x +3 t∂ t - 2 u∂ u ) | ( e ϵ x , e 3 ϵ t , e - 2 ϵ u ) |
| Lie algebra generator | Group operation ( x, t,u ) ↦→ | |
|---|---|---|
| g 1 (space translation) | ϵ∂ x | ( x + ϵ , t,u ) |
| g 2 (time translation) | ϵ∂ t | ( x, t + ϵ ,u ) |
| g 3 (Galilean boost) | ϵ ( t∂ x + ∂ u ) | ( x + ϵt , t, u + ϵ |
| Lie algebra generator | Group operation ( x, y, t, u, v, p ) ↦→ | |
|---|---|---|
| g 1 (time translation) | ϵ∂ t | ( x, y, t + ϵ , u, v, p ) |
| g 2 ( x translation) | ϵ∂ x | ( x + ϵ , y, t, u, v, p ) |
| g 3 ( y translation) | ϵ∂ y | ( x, y + ϵ , t, u, v,p ) |
| g 4 (scaling) | ϵ (2 t∂ t + x∂ x + y∂ y - u∂ u - v∂ v - 2 p∂ p ) | ( e ϵ x , e ϵ y , e 2 ϵ t , e - ϵ u , e - ϵ v , e - 2 ϵ p ) |
| g 5 (rotation) | ϵ ( x∂ y - y∂ x + u∂ v - v∂ u ) | ( x cos ϵ - y sin ϵ , x sin ϵ + y cos ϵ , t, u cos ϵ - v sin ϵ , u sin ϵ + v cos ϵ , p ) |
| g 6 ( x linear boost) 1 | ϵ ( t∂ x + ∂ u ) | ( x + ϵt , y, t, u + ϵ , v,p ) |
| g 7 ( y linear boost) 1 | ϵ ( t∂ y + ∂ v ) | ( x, y + ϵt , t, u, v + ϵ , p ) |
| g 8 ( x quadratic boost) 2 | ϵ ( t 2 ∂ x +2 t∂ u - 2 x∂ p ) | ( x + ϵt 2 , y, t, u +2 ϵt , v, p - 2 x ) |
| g 9 ( y quadratic boost) 2 | ϵ ( t 2 ∂ y +2 t∂ v - 2 y∂ p ) | ( x, y + ϵt 2 , t, u, v +2 ϵt , p - 2 y ) |
| g E x ( x general boost) 3 | ϵ ( E x ( t ) ∂ x + E ′ x ( t ) ∂ u - xE ′′ x ( t ) ∂ p ) | ( x + ϵE x ( t ) , y, t, u + ϵE ′ x ( t ) , v, p - E ′′ x ( t ) x ) |
| g E y ( y general boost) 3 | ϵ ( E y ( t ) ∂ y + E ′ y ( t ) ∂ v - yE ′′ y ( t ) ∂ p ) | ( x, y + ϵE y ( t ) , t, u, v + ϵE ′ y ( t ) , p - E ′′ y ( t ) y ) |
| g q (additive pressure) 3 | ϵq ( t ) ∂ p | ( x, y, t, u, v, p + q ( t ) ) |
| Architecture | ResNet1d | FNO1d |
|---|---|---|
| Baseline (no conditioning) | 0.110 ± 0.008 | 0.184 ± 0.002 |
| Representation conditioning | 0.108 ± 0.011 | 0.173 ± 0.002 |
| Equation | Burgers' | KdV | KS | Navier Stokes |
|---|---|---|---|---|
| Network: | ||||
| Model | ResNet18 | ResNet18 | ResNet18 | ResNet18 |
| Embedding Dim. | 512 | 512 | 512 | 512 |
| Optimization: | ||||
| Optimizer | LARS [102] | AdamW | AdamW | AdamW |
| Learning Rate | 0.6 | 0.3 | 0.3 | 3e-4 |
| Batch Size | 32 | 64 | 64 | 64 |
| Epochs | 100 | 100 | 100 | 100 |
| Nb of exps | ∼ 300 | ∼ 30 | ∼ 30 | ∼ 300 |
| Hardware: | ||||
| GPU used | Nvidia V100 | Nvidia M4000 | Nvidia M4000 | Nvidia V100 |
| Training time | ∼ 5 h | ∼ 11 h | ∼ 12 h | ∼ 48 h |
| Equation | Burgers' | KdV | KS | Navier Stokes |
|---|---|---|---|---|
| Neural Operator: Model | CNN [12] | CNN [12] | CNN [12] | Modified U-Net-64 [18] |
| Optimization: Optimizer | AdamW | AdamW | AdamW | Adam |
| Learning Rate | 1e-4 | 1e-4 | 1e-4 | 2e-4 |
| Batch Size | 16 | 16 | 16 | 32 |
| Epochs | 20 | 20 | 20 | 20 |
| Hardware: GPU used | Nvidia V100 | Nvidia M4000 | Nvidia M4000 | Nvidia V100 (16) |
| Training time | ∼ 1 d | ∼ 2 d | ∼ 2 d | ∼ 1 . 5 d |
| Dataset size | 1,664 | 6,656 |
|---|---|---|
| Methods without ground truth buoyancy: Time conditioned, Addition | 2.60 ± 0.05 | 1.18 ± 0.03 |
| Time + Rep. conditioned, Addition (ours) | 2.47 ± 0.02 | 1.17 ± 0.04 |
| Time conditioned, AdaGN | 2.37 ± 0.01 | 1.12 ± 0.02 |
| Time + Rep. conditioned, AdaGN (ours) | 2.35 ± 0.03 | 1.11 ± 0.01 |
| Methods with ground truth buoyancy: Time + Buoyancy conditioned, Addition | 2.08 ± 0.02 | 1.10 ± 0.01 |
| Time + Buoyancy conditioned, AdaGN | 2.01 ± 0.02 | 1.06 ± 0.04 |
$$ \begin{split}\exp\left(\epsilon\gamma e^{-u}\partial_{u}\right)u&=u+\log\left(1+\epsilon\gamma e^{-u}\right)\ &=\log\left(e^{u}\right)+\log\left(1+\epsilon\gamma e^{-u}\right)\ &=\log\left(e^{u}+\epsilon\gamma\right).\end{split} $$ \tag{A2.E44}
References
[RAISSI2019686] Mazier Raissi, Paris Perdikaris, and George~E. Karniadakis. \newblock Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. \newblock Journal of Computational Physics, 378:\penalty0 686--707, 2019. \newblock ISSN 0021-9991. \newblock URL https://doi.org/10.1016/j.jcp.2018.10.045.
[Karniadakis2021Nature] GeorgeE. Karniadakis, IoannisG. Kevrekidis, Lu~Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. \newblock Physics-informed machine learning. \newblock Nature Reviews Physics, 3\penalty0 (6):\penalty0 422--440, 2021. \newblock URL https://doi.org/10.1038/s42254-021-00314-5.
[rudy2017data] SamuelH Rudy, StevenL Brunton, JoshuaL Proctor, and JNathan Kutz. \newblock Data-driven discovery of partial differential equations. \newblock Science advances, 3\penalty0 (4):\penalty0 e1602614, 2017.
[brunton2016discovering] StevenL. Brunton, JoshuaL. Proctor, and J.~Nathan Kutz. \newblock Discovering governing equations from data by sparse identification of nonlinear dynamical systems. \newblock Proceedings of the {National {A}cademy of {S}ciences}, 113\penalty0 (15):\penalty0 3932--3937, 2016. \newblock URL https://doi.org/10.1073/pnas.1517384113.
[radford2021learning] Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal. \newblock Learning transferable visual models from natural language supervision. \newblock In arXiv preprint arXiv:2103.00020, 2021.
[caron2021emerging] Mathilde Caron, Hugo Touvron, Ishan Misra, Herv{'e} J{'e}gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. \newblock Emerging properties in self-supervised vision transformers. \newblock In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650--9660, 2021.
[nguyen2023climax] Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh~K Gupta, and Aditya Grover. \newblock Clima{X}: A foundation model for weather and climate. \newblock arXiv preprint arXiv:2301.10343, 2023.
[bromley1993signature] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard S{"a}ckinger, and Roopak Shah. \newblock Signature verification using a" siamese" time delay neural network. \newblock Advances in neural information processing systems, 6, 1993.
[bardes2021vicreg] Adrien Bardes, Jean Ponce, and Yann LeCun. \newblock Vicreg: Variance-invariance-covariance regularization for self-supervised learning. \newblock arXiv preprint arXiv:2105.04906, 2021.
[grill2020bootstrap] Jean-Bastien Grill, Florian Strub, Florent Altch{'e}, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo AvilaPires, Zhaohan Guo, Mohammad GheshlaghiAzar, et~al. \newblock Bootstrap your own latent-a new approach to self-supervised learning. \newblock NeurIPS, 2020.
[Olver1979SymmetryGA] Peter~J. Olver. \newblock Symmetry groups and group invariant solutions of partial differential equations. \newblock Journal of Differential Geometry, 14:\penalty0 497--542, 1979{a}.
[brandstetter2022lie] Johannes Brandstetter, Max Welling, and Daniel~E Worrall. \newblock Lie point symmetry data augmentation for neural pde solvers. \newblock arXiv preprint arXiv:2202.07643, 2022.
[ibragimov1995crc] NailH Ibragimov. \newblock CRC handbook of Lie group analysis of differential equations, volume3. \newblock CRC press, 1995.
[baumann2000symmetry] Gerd Baumann. \newblock Symmetry analysis of differential equations with Mathematica{\textregistered}. \newblock Springer Science & Business Media, 2000.
[bordes2022guillotine] Florian Bordes, Randall Balestriero, Quentin Garrido, Adrien Bardes, and Pascal Vincent. \newblock Guillotine regularization: Improving deep networks generalization by removing their head. \newblock arXiv preprint arXiv:2206.13378, 2022.
[garrido2022duality] Quentin Garrido, Yubei Chen, Adrien Bardes, Laurent Najman, and Yann Lecun. \newblock On the duality between contrastive and non-contrastive self-supervised learning. \newblock arXiv preprint arXiv:2206.02574, 2022.
[he2016deep] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. \newblock Deep residual learning for image recognition. \newblock In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.
[gupta2022towards] Jayesh~K Gupta and Johannes Brandstetter. \newblock Towards multi-spatiotemporal-scale generalized pde modeling. \newblock TMLR, 2022.
[isakov2006inverse] Victor Isakov. \newblock Inverse problems for partial differential equations, volume 127. \newblock Springer, 2006.
[balestriero2023cookbook] Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, et~al. \newblock A cookbook of self-supervised learning. \newblock arXiv preprint arXiv:2304.12210, 2023.
[chen2020simple] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. \newblock A simple framework for contrastive learning of visual representations. \newblock In International conference on machine learning, pages 1597--1607. PMLR, 2020{a}.
[trotter1959product] Hale~F Trotter. \newblock On the product of semi-groups of operators. \newblock Proceedings of the American Mathematical Society, 10\penalty0 (4):\penalty0 545--551, 1959.
[childs2021theory] AndrewM Childs, Yuan Su, MinhC Tran, Nathan Wiebe, and Shuchen Zhu. \newblock Theory of trotter error with commutator scaling. \newblock Physical Review X, 11\penalty0 (1):\penalty0 011020, 2021.
[bommasani2021opportunities] Rishi Bommasani, DrewA Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, MichaelS Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et~al. \newblock On the opportunities and risks of foundation models. \newblock arXiv preprint arXiv:2108.07258, 2021.
[brandes2022proteinbert] Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, and Michal Linial. \newblock Proteinbert: a universal deep-learning model of protein sequence and function. \newblock Bioinformatics, 38\penalty0 (8):\penalty0 2102--2110, 2022.
[yang2020improved] Jianyi Yang, Ivan Anishchenko, Hahnbeom Park, Zhenling Peng, Sergey Ovchinnikov, and David Baker. \newblock Improved protein structure prediction using predicted interresidue orientations. \newblock Proceedings of the National Academy of Sciences, 117\penalty0 (3):\penalty0 1496--1503, 2020.
[wang2020incorporating] Rui Wang, Robin Walters, and Rose Yu. \newblock Incorporating symmetry into deep dynamics models for improved generalization. \newblock arXiv preprint arXiv:2002.03061, 2020.
[richter2022neural] Jack Richter-Powell, Yaron Lipman, and Ricky~TQ Chen. \newblock Neural conservation laws: A divergence-free perspective. \newblock arXiv preprint arXiv:2210.01741, 2022.
[https://doi.org/10.48550/arxiv.2112.04307] PeterJ. Baddoo, Benjamin Herrmann, BeverleyJ. McKeon, J.Nathan Kutz, and StevenL. Brunton. \newblock Physics-informed dynamic mode decomposition (pidmd), 2021. \newblock URL https://arxiv.org/abs/2112.04307.
[finzi2020simplifying] Marc Finzi, KeAlexander Wang, and AndrewG Wilson. \newblock Simplifying hamiltonian and lagrangian neural networks via explicit constraints. \newblock Advances in neural information processing systems, 33:\penalty0 13880--13889, 2020.
[chen2021neural] Yuhan Chen, Takashi Matsubara, and Takaharu Yaguchi. \newblock Neural symplectic form: learning hamiltonian equations on general coordinate systems. \newblock Advances in Neural Information Processing Systems, 34:\penalty0 16659--16670, 2021{a}.
[takamoto2022pdebench] Makoto Takamoto, Timothy Praditia, Raphael Leiteritz, Daniel MacKinlay, Francesco Alesiani, Dirk Pfl{"u}ger, and Mathias Niepert. \newblock Pdebench: An extensive benchmark for scientific machine learning. \newblock Advances in Neural Information Processing Systems, 35:\penalty0 1596--1611, 2022.
[loshchilov2017decoupled] Ilya Loshchilov and Frank Hutter. \newblock Decoupled weight decay regularization. \newblock arXiv preprint arXiv:1711.05101, 2017.
[bar2019learning] Yohai Bar-Sinai, Stephan Hoyer, Jason Hickey, and Michael~P. Brenner. \newblock Learning data-driven discretizations for partial differential equations. \newblock Proceedings of the National Academy of Sciences, 116\penalty0 (31):\penalty0 15344--15349, 2019. \newblock URL https://doi.org/10.1073/pnas.1814058116.
[nichol2021improved] Alexander~Quinn Nichol and Prafulla Dhariwal. \newblock Improved denoising diffusion probabilistic models. \newblock In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8162--8171. PMLR, 18--24 Jul 2021.
[li2021fourier] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. \newblock Fourier neural operator for parametric partial differential equations, 2021.
[sariyildiz2023improving] Mert~Bulent Sariyildiz, Yannis Kalantidis, Karteek Alahari, and Diane Larlus. \newblock No reason for no supervision: Improved generalization in supervised models. \newblock In International Conference on Learning Representations, 2023.
[oquab2023dinov2] Maxime Oquab, Timoth{'e}e Darcet, Th{'e}o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et~al. \newblock Dinov2: Learning robust visual features without supervision. \newblock arXiv preprint arXiv:2304.07193, 2023.
[bardes2022vicregl] Adrien Bardes, Jean Ponce, and Yann LeCun. \newblock {VICR}egl: Self-supervised learning of local visual features. \newblock In Alice~H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. \newblock URL https://openreview.net/forum?id=ePZsWeGJXyp.
[chen2020simsiam] Xinlei Chen and Kaiming He. \newblock Exploring simple siamese representation learning. \newblock In CVPR, 2020.
[winter2022unsupervised] Robin Winter, Marco Bertolini, Tuan Le, Frank Noe, and Djork-Arn'{e} Clevert. \newblock Unsupervised learning of group invariant and equivariant representations. \newblock In S.~Koyejo, S.~Mohamed, A.~Agarwal, D.~Belgrave, K.~Cho, and A.Oh, editors, Advances in Neural Information Processing Systems, volume35, pages 31942--31956. Curran Associates, Inc., 2022. \newblock URL https://proceedings.neurips.cc/paper_files/paper/2022/file/cf3d7d8e79703fe947deffb587a83639-Paper-Conference.pdf.
[garrido2023self] Quentin Garrido, Laurent Najman, and Yann Lecun. \newblock Self-supervised learning of split invariant equivariant representations. \newblock arXiv preprint arXiv:2302.10283, 2023.
[nee1998limit] Janpou Nee and Jinqiao Duan. \newblock Limit set of trajectories of the coupled viscous burgers' equations. \newblock Applied mathematics letters, 11\penalty0 (1):\penalty0 57--61, 1998.
[olver1979symmetry] Peter~J Olver. \newblock Symmetry groups and group invariant solutions of partial differential equations. \newblock Journal of Differential Geometry, 14\penalty0 (4):\penalty0 497--542, 1979{b}.
[baker2003matrix] Andrew Baker. \newblock Matrix groups: An introduction to Lie group theory. \newblock Springer Science & Business Media, 2003.
[dollard1979product] JohnD Dollard, CharlesN Friedman, and PesiRustom Masani. \newblock Product integration with applications to differential equations, volume10. \newblock Westview Press, 1979.
[suzuki1991general] Masuo Suzuki. \newblock General theory of fractal path integrals with applications to many-body theories and statistical physics. \newblock Journal of Mathematical Physics, 32\penalty0 (2):\penalty0 400--407, 1991.
[mclachlan2002splitting] RobertI McLachlan and GReinout~W Quispel. \newblock Splitting methods. \newblock Acta Numerica, 11:\penalty0 341--434, 2002.
[descombes2010exact] St{'e}phane Descombes and Mechthild Thalhammer. \newblock An exact local error representation of exponential operator splitting methods for evolutionary problems and applications to linear schr{"o}dinger equations in the semi-classical regime. \newblock BIT Numerical Mathematics, 50\penalty0 (4):\penalty0 729--749, 2010.
[engel2000one] Klaus-Jochen Engel, Rainer Nagel, and Simon Brendle. \newblock One-parameter semigroups for linear evolution equations, volume 194. \newblock Springer, 2000.
[canzi2012simple] Claudia Canzi and Graziano Guerra. \newblock A simple counterexample related to the lie--trotter product formula. \newblock In Semigroup Forum, volume~84, pages 499--504. Springer, 2012.
[ramezanizadeh2019review] Mahdi Ramezanizadeh, MohammadHossein Ahmadi, MohammadAlhuyi Nazari, Milad Sadeghzadeh, and Lingen Chen. \newblock A review on the utilized machine learning approaches for modeling the dynamic viscosity of nanofluids. \newblock Renewable and Sustainable Energy Reviews, 114:\penalty0 109345, 2019.
[fries2022lasdi] William~D Fries, Xiaolong He, and Youngsoo Choi. \newblock Lasdi: Parametric latent space dynamics identification. \newblock Computer Methods in Applied Mechanics and Engineering, 399:\penalty0 115436, 2022.
[he2022glasdi] Xiaolong He, Youngsoo Choi, William~D Fries, Jon Belof, and Jiun-Shyan Chen. \newblock glasdi: Parametric physics-informed greedy latent space dynamics identification. \newblock arXiv preprint arXiv:2204.12005, 2022.
[syah2021implementation] Rahmad Syah, Naeim Ahmadian, Marischa Elveny, SM~Alizadeh, Meysam Hosseini, and Afrasyab Khan. \newblock Implementation of artificial intelligence and support vector machine learning to estimate the drilling fluid density in high-pressure high-temperature wells. \newblock Energy Reports, 7:\penalty0 4106--4113, 2021.
[vinuesa2022enhancing] Ricardo Vinuesa and Steven~L Brunton. \newblock Enhancing computational fluid dynamics with machine learning. \newblock Nature Computational Science, 2\penalty0 (6):\penalty0 358--366, 2022.
[raissi2020hidden] Maziar Raissi, Alireza Yazdani, and George~Em Karniadakis. \newblock Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. \newblock Science, 367\penalty0 (6481):\penalty0 1026--1030, 2020.
[adeli2022advanced] Ehsan Adeli, Luning Sun, Jianxun Wang, and Alexandros~A Taflanidis. \newblock An advanced spatio-temporal convolutional recurrent neural network for storm surge predictions. \newblock arXiv preprint arXiv:2204.09501, 2022.
[wu2022non] Pin Wu, Feng Qiu, Weibing Feng, Fangxing Fang, and Christopher Pain. \newblock A non-intrusive reduced order model with transformer neural network and its application. \newblock Physics of Fluids, 34\penalty0 (11):\penalty0 115130, 2022.
[https://doi.org/10.48550/arxiv.2302.03580] Léonard Equer, T.~Konstantin Rusch, and Siddhartha Mishra. \newblock Multi-scale message passing neural pde solvers, 2023. \newblock URL https://arxiv.org/abs/2302.03580.
[kim2019deep] Byungsoo Kim, Vinicius~C Azevedo, Nils Thuerey, Theodore Kim, Markus Gross, and Barbara Solenthaler. \newblock Deep fluids: A generative network for parameterized fluid simulations. \newblock In Computer graphics forum, volume 38(2), pages 59--70. Wiley Online Library, 2019.
[lusch2018deep] Bethany Lusch, JNathan Kutz, and StevenL Brunton. \newblock Deep learning for universal linear embeddings of nonlinear dynamics. \newblock Nature communications, 9\penalty0 (1):\penalty0 4950, 2018.
[schmid2010dynamic] Peter~J Schmid. \newblock Dynamic mode decomposition of numerical and experimental data. \newblock Journal of fluid mechanics, 656:\penalty0 5--28, 2010.
[he2020moco] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. \newblock Momentum contrast for unsupervised visual representation learning. \newblock In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729--9738, 2020.
[chen2020mocov2] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. \newblock Improved baselines with momentum contrastive learning. \newblock arXiv preprint arXiv:2003.04297, 2020{b}.
[chen2021mocov3] Xinlei Chen, Saining Xie, and Kaiming He. \newblock An empirical study of training self-supervised vision transformers. \newblock In ICCV, 2021{b}.
[yeh2021decoupled] Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, Tyng-Luh Liu, Yubei Chen, and Yann LeCun. \newblock Decoupled contrastive learning. \newblock arXiv preprint arXiv:2110.06848, 2021.
[oord2018infonce] Aaron van~den Oord, Yazhe Li, and Oriol Vinyals. \newblock Representation learning with contrastive predictive coding. \newblock arXiv preprint arXiv:1807.03748, 2018.
[haochen2021provable] Jeff~Z HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. \newblock Provable guarantees for self-supervised deep learning with spectral contrastive loss. \newblock NeurIPS, 34, 2021.
[caron2018clustering] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. \newblock Deep clustering for unsupervised learning. \newblock In ECCV, 2018.
[caron2020swav] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. \newblock Unsupervised learning of visual features by contrasting cluster assignments. \newblock In NeurIPS, 2020.
[zbontar2021barlow] Jure Zbontar, Li~Jing, Ishan Misra, Yann LeCun, and St{'e}phane Deny. \newblock Barlow twins: Self-supervised learning via redundancy reduction. \newblock In ICML, pages 12310--12320. PMLR, 2021.
[ermolov2021whitening] Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. \newblock Whitening for self-supervised representation learning, 2021.
[li2022neural] Zengyi Li, Yubei Chen, Yann LeCun, and Friedrich~T Sommer. \newblock Neural manifold clustering and embedding. \newblock arXiv preprint arXiv:2201.10000, 2022.
[cabannes2023ssl] Vivien Cabannes, Bobak~T Kiani, Randall Balestriero, Yann LeCun, and Alberto Bietti. \newblock The ssl interplay: Augmentations, inductive bias, and generalization. \newblock arXiv preprint arXiv:2302.02774, 2023.
[mialon2022variance] Gr'egoire Mialon, Randall Balestriero, and Yann Lecun. \newblock Variance-covariance regularization enforces pairwise independence in self-supervised representations. \newblock arXiv preprint arXiv:2209.14905, 2022.
[schmarje2021survey] Lars Schmarje, Monty Santarossa, Simon-Martin Schr{"o}der, and Reinhard Koch. \newblock A survey on semi-, self-and unsupervised learning for image classification. \newblock IEEE Access, 9:\penalty0 82146--82168, 2021.
[cerri2019variational] Olmo Cerri, Thong~Q Nguyen, Maurizio Pierini, Maria Spiropulu, and Jean-Roch Vlimant. \newblock Variational autoencoders for new physics mining at the large hadron collider. \newblock Journal of High Energy Physics, 2019\penalty0 (5):\penalty0 1--29, 2019.
[rasmussen2010gaussian] Carl~Edward Rasmussen and Hannes Nickisch. \newblock Gaussian processes for machine learning (gpml) toolbox. \newblock The Journal of Machine Learning Research, 11:\penalty0 3011--3015, 2010.
[kaya2019deep] Mahmut Kaya and Hasan~{S}akir Bilge. \newblock Deep metric learning: A survey. \newblock Symmetry, 11\penalty0 (9):\penalty0 1066, 2019.
[long2018pde] Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. \newblock Pde-net: Learning pdes from data. \newblock In International conference on machine learning, pages 3208--3216. PMLR, 2018.
[fernandez2016review] MGiselle Fern{'a}ndez-Godino, Chanyoung Park, Nam-Ho Kim, and RaphaelT Haftka. \newblock Review of multi-fidelity models. \newblock arXiv preprint arXiv:1609.07196, 2016.
[forrester2007multi] AlexanderIJ Forrester, Andr{'a}s S{'o}bester, and AndyJ Keane. \newblock Multi-fidelity optimization via surrogate modelling. \newblock Proceedings of the royal society a: mathematical, physical and engineering sciences, 463\penalty0 (2088):\penalty0 3251--3269, 2007.
[ng2012multifidelity] Leo Wai-Tsun Ng and Michael Eldred. \newblock Multifidelity uncertainty quantification using non-intrusive polynomial chaos and stochastic collocation. \newblock In 53rd aiaa/asme/asce/ahs/asc structures, structural dynamics and materials conference 20th aiaa/asme/ahs adaptive structures conference 14th aiaa, page 1852, 2012.
[perdikaris2017nonlinear] Paris Perdikaris, Maziar Raissi, Andreas Damianou, NeilD Lawrence, and GeorgeEm Karniadakis. \newblock Nonlinear information fusion algorithms for data-efficient multi-fidelity modelling. \newblock Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473\penalty0 (2198):\penalty0 20160751, 2017.
[bronstein2021geometric] Michael~M Bronstein, Joan Bruna, Taco Cohen, and Petar Veli{c}kovi{'c}. \newblock Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. \newblock arXiv preprint arXiv:2104.13478, 2021.
[zaheer2017deep] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, RussR Salakhutdinov, and AlexanderJ Smola. \newblock Deep sets. \newblock Advances in neural information processing systems, 30, 2017.
[kondor2018generalization] Risi Kondor and Shubhendu Trivedi. \newblock On the generalization of equivariance and convolution in neural networks to the action of compact groups. \newblock In International Conference on Machine Learning, pages 2747--2755. PMLR, 2018.
[cohen2016group] Taco Cohen and Max Welling. \newblock Group equivariant convolutional networks. \newblock In International conference on machine learning, pages 2990--2999. PMLR, 2016.
[satorras2021n] V{\i}ctor~Garcia Satorras, Emiel Hoogeboom, and Max Welling. \newblock E (n) equivariant graph neural networks. \newblock In International conference on machine learning, pages 9323--9332. PMLR, 2021.
[jumper2021highly] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin {Z}{'\i}dek, Anna Potapenko, et~al. \newblock Highly accurate protein structure prediction with alphafold. \newblock Nature, 596\penalty0 (7873):\penalty0 583--589, 2021.
[thomas2018tensor] Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li~Li, Kai Kohlhoff, and Patrick Riley. \newblock Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. \newblock arXiv preprint arXiv:1802.08219, 2018.
[kirkpatrick2021pushing] James Kirkpatrick, Brendan McMorrow, DavidHP Turban, AlexanderL Gaunt, JamesS Spencer, AlexanderGDG Matthews, Annette Obika, Louis Thiry, Meire Fortunato, David Pfau, et~al. \newblock Pushing the frontiers of density functionals by solving the fractional electron problem. \newblock Science, 374\penalty0 (6573):\penalty0 1385--1389, 2021.
[zhou2020graph] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. \newblock Graph neural networks: A review of methods and applications. \newblock AI open, 1:\penalty0 57--81, 2020.
[masci2015geodesic] Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. \newblock Geodesic convolutional neural networks on riemannian manifolds. \newblock In Proceedings of the IEEE international conference on computer vision workshops, pages 37--45, 2015.
[li2020efficient] Jun Li, Li~Fuxin, and Sinisa Todorovic. \newblock Efficient riemannian optimization on the stiefel manifold via the cayley transform. \newblock arXiv preprint arXiv:2002.01113, 2020.
[kiani2022projunn] Bobak Kiani, Randall Balestriero, Yann Lecun, and Seth Lloyd. \newblock projunn: efficient method for training deep networks with unitary matrices. \newblock arXiv preprint arXiv:2203.05483, 2022.
[chen2019symplectic] Zhengdao Chen, Jianyu Zhang, Martin Arjovsky, and L{'e}on Bottou. \newblock Symplectic recurrent neural networks. \newblock arXiv preprint arXiv:1909.13334, 2019.
[arjovsky2016unitary] Martin Arjovsky, Amar Shah, and Yoshua Bengio. \newblock Unitary evolution recurrent neural networks. \newblock In International conference on machine learning, pages 1120--1128. PMLR, 2016.
[ozics2017similarity] Turgut {"O}zi{s} and {.I}SMA{.I}L Aslan. \newblock Similarity solutions to burgers' equation in terms of special functions of mathematical physics. \newblock Acta Physica Polonica B, 2017.
[lloyd1981infinitesimal] SP~Lloyd. \newblock The infinitesimal group of the navier-stokes equations. \newblock Acta Mechanica, 38\penalty0 (1-2):\penalty0 85--98, 1981.
[you2017large] Yang You, Igor Gitman, and Boris Ginsburg. \newblock Large batch training of convolutional networks. \newblock arXiv preprint arXiv:1708.03888, 2017.
[bib1] Raissi et al. [2019] Mazier Raissi, Paris Perdikaris, and George E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019. ISSN 0021-9991. URL https://doi.org/10.1016/j.jcp.2018.10.045.
[bib2] Karniadakis et al. [2021] George E. Karniadakis, Ioannis G. Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021. URL https://doi.org/10.1038/s42254-021-00314-5.
[bib3] Rudy et al. [2017] Samuel H Rudy, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Data-driven discovery of partial differential equations. Science advances, 3(4):e1602614, 2017.
[bib4] Brunton et al. [2016] Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15):3932–3937, 2016. URL https://doi.org/10.1073/pnas.1517384113.
[bib5] Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In arXiv preprint arXiv:2103.00020, 2021.
[bib6] Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
[bib7] Nguyen et al. [2023] Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K Gupta, and Aditya Grover. ClimaX: A foundation model for weather and climate. arXiv preprint arXiv:2301.10343, 2023.
[bib8] Bromley et al. [1993] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a" siamese" time delay neural network. Advances in neural information processing systems, 6, 1993.
[bib9] Bardes et al. [2021] Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
[bib10] Grill et al. [2020] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020.
[bib11] Peter J. Olver. Symmetry groups and group invariant solutions of partial differential equations. Journal of Differential Geometry, 14:497–542, 1979a.
[bib12] Brandstetter et al. [2022] Johannes Brandstetter, Max Welling, and Daniel E Worrall. Lie point symmetry data augmentation for neural pde solvers. arXiv preprint arXiv:2202.07643, 2022.
[bib13] Nail H Ibragimov. CRC handbook of Lie group analysis of differential equations, volume 3. CRC press, 1995.
[bib14] Gerd Baumann. Symmetry analysis of differential equations with Mathematica®. Springer Science & Business Media, 2000.
[bib15] Bordes et al. [2022] Florian Bordes, Randall Balestriero, Quentin Garrido, Adrien Bardes, and Pascal Vincent. Guillotine regularization: Improving deep networks generalization by removing their head. arXiv preprint arXiv:2206.13378, 2022.
[bib16] Garrido et al. [2022] Quentin Garrido, Yubei Chen, Adrien Bardes, Laurent Najman, and Yann Lecun. On the duality between contrastive and non-contrastive self-supervised learning. arXiv preprint arXiv:2206.02574, 2022.
[bib17] He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[bib18] Jayesh K Gupta and Johannes Brandstetter. Towards multi-spatiotemporal-scale generalized pde modeling. TMLR, 2022.
[bib19] Victor Isakov. Inverse problems for partial differential equations, volume 127. Springer, 2006.
[bib20] Balestriero et al. [2023] Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, et al. A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210, 2023.
[bib21] Chen et al. [2020a] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020a.
[bib22] Hale F Trotter. On the product of semi-groups of operators. Proceedings of the American Mathematical Society, 10(4):545–551, 1959.
[bib23] Childs et al. [2021] Andrew M Childs, Yuan Su, Minh C Tran, Nathan Wiebe, and Shuchen Zhu. Theory of trotter error with commutator scaling. Physical Review X, 11(1):011020, 2021.
[bib24] Bommasani et al. [2021] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
[bib25] Brandes et al. [2022] Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, and Michal Linial. Proteinbert: a universal deep-learning model of protein sequence and function. Bioinformatics, 38(8):2102–2110, 2022.
[bib26] Yang et al. [2020] Jianyi Yang, Ivan Anishchenko, Hahnbeom Park, Zhenling Peng, Sergey Ovchinnikov, and David Baker. Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences, 117(3):1496–1503, 2020.
[bib27] Wang et al. [2020] Rui Wang, Robin Walters, and Rose Yu. Incorporating symmetry into deep dynamics models for improved generalization. arXiv preprint arXiv:2002.03061, 2020.
[bib28] Richter-Powell et al. [2022] Jack Richter-Powell, Yaron Lipman, and Ricky TQ Chen. Neural conservation laws: A divergence-free perspective. arXiv preprint arXiv:2210.01741, 2022.
[bib29] Baddoo et al. [2021] Peter J. Baddoo, Benjamin Herrmann, Beverley J. McKeon, J. Nathan Kutz, and Steven L. Brunton. Physics-informed dynamic mode decomposition (pidmd), 2021. URL https://arxiv.org/abs/2112.04307.
[bib30] Finzi et al. [2020] Marc Finzi, Ke Alexander Wang, and Andrew G Wilson. Simplifying hamiltonian and lagrangian neural networks via explicit constraints. Advances in neural information processing systems, 33:13880–13889, 2020.
[bib31] Chen et al. [2021a] Yuhan Chen, Takashi Matsubara, and Takaharu Yaguchi. Neural symplectic form: learning hamiltonian equations on general coordinate systems. Advances in Neural Information Processing Systems, 34:16659–16670, 2021a.
[bib32] Takamoto et al. [2022] Makoto Takamoto, Timothy Praditia, Raphael Leiteritz, Daniel MacKinlay, Francesco Alesiani, Dirk Pflüger, and Mathias Niepert. Pdebench: An extensive benchmark for scientific machine learning. Advances in Neural Information Processing Systems, 35:1596–1611, 2022.
[bib33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[bib34] Bar-Sinai et al. [2019] Yohai Bar-Sinai, Stephan Hoyer, Jason Hickey, and Michael P. Brenner. Learning data-driven discretizations for partial differential equations. Proceedings of the National Academy of Sciences, 116(31):15344–15349, 2019. URL https://doi.org/10.1073/pnas.1814058116.
[bib35] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8162–8171. PMLR, 18–24 Jul 2021.
[bib36] Li et al. [2021] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations, 2021.
[bib37] Sariyildiz et al. [2023] Mert Bulent Sariyildiz, Yannis Kalantidis, Karteek Alahari, and Diane Larlus. No reason for no supervision: Improved generalization in supervised models. In International Conference on Learning Representations, 2023.
[bib38] Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
[bib39] Bardes et al. [2022] Adrien Bardes, Jean Ponce, and Yann LeCun. VICRegl: Self-supervised learning of local visual features. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=ePZsWeGJXyp.
[bib40] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2020.
[bib41] Winter et al. [2022] Robin Winter, Marco Bertolini, Tuan Le, Frank Noe, and Djork-Arné Clevert. Unsupervised learning of group invariant and equivariant representations. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 31942–31956. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/cf3d7d8e79703fe947deffb587a83639-Paper-Conference.pdf.
[bib42] Garrido et al. [2023] Quentin Garrido, Laurent Najman, and Yann Lecun. Self-supervised learning of split invariant equivariant representations. arXiv preprint arXiv:2302.10283, 2023.
[bib43] Janpou Nee and Jinqiao Duan. Limit set of trajectories of the coupled viscous burgers’ equations. Applied mathematics letters, 11(1):57–61, 1998.
[bib44] Peter J Olver. Symmetry groups and group invariant solutions of partial differential equations. Journal of Differential Geometry, 14(4):497–542, 1979b.
[bib45] Andrew Baker. Matrix groups: An introduction to Lie group theory. Springer Science & Business Media, 2003.
[bib46] Dollard et al. [1979] John D Dollard, Charles N Friedman, and Pesi Rustom Masani. Product integration with applications to differential equations, volume 10. Westview Press, 1979.
[bib47] Masuo Suzuki. General theory of fractal path integrals with applications to many-body theories and statistical physics. Journal of Mathematical Physics, 32(2):400–407, 1991.
[bib48] Robert I McLachlan and G Reinout W Quispel. Splitting methods. Acta Numerica, 11:341–434, 2002.
[bib49] Stéphane Descombes and Mechthild Thalhammer. An exact local error representation of exponential operator splitting methods for evolutionary problems and applications to linear schrödinger equations in the semi-classical regime. BIT Numerical Mathematics, 50(4):729–749, 2010.
[bib50] Engel et al. [2000] Klaus-Jochen Engel, Rainer Nagel, and Simon Brendle. One-parameter semigroups for linear evolution equations, volume 194. Springer, 2000.
[bib51] Claudia Canzi and Graziano Guerra. A simple counterexample related to the lie–trotter product formula. In Semigroup Forum, volume 84, pages 499–504. Springer, 2012.
[bib52] Ramezanizadeh et al. [2019] Mahdi Ramezanizadeh, Mohammad Hossein Ahmadi, Mohammad Alhuyi Nazari, Milad Sadeghzadeh, and Lingen Chen. A review on the utilized machine learning approaches for modeling the dynamic viscosity of nanofluids. Renewable and Sustainable Energy Reviews, 114:109345, 2019.
[bib53] Fries et al. [2022] William D Fries, Xiaolong He, and Youngsoo Choi. Lasdi: Parametric latent space dynamics identification. Computer Methods in Applied Mechanics and Engineering, 399:115436, 2022.
[bib54] He et al. [2022] Xiaolong He, Youngsoo Choi, William D Fries, Jon Belof, and Jiun-Shyan Chen. glasdi: Parametric physics-informed greedy latent space dynamics identification. arXiv preprint arXiv:2204.12005, 2022.
[bib55] Syah et al. [2021] Rahmad Syah, Naeim Ahmadian, Marischa Elveny, SM Alizadeh, Meysam Hosseini, and Afrasyab Khan. Implementation of artificial intelligence and support vector machine learning to estimate the drilling fluid density in high-pressure high-temperature wells. Energy Reports, 7:4106–4113, 2021.
[bib56] Ricardo Vinuesa and Steven L Brunton. Enhancing computational fluid dynamics with machine learning. Nature Computational Science, 2(6):358–366, 2022.
[bib57] Raissi et al. [2020] Maziar Raissi, Alireza Yazdani, and George Em Karniadakis. Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science, 367(6481):1026–1030, 2020.
[bib58] Adeli et al. [2022] Ehsan Adeli, Luning Sun, Jianxun Wang, and Alexandros A Taflanidis. An advanced spatio-temporal convolutional recurrent neural network for storm surge predictions. arXiv preprint arXiv:2204.09501, 2022.
[bib59] Wu et al. [2022] Pin Wu, Feng Qiu, Weibing Feng, Fangxing Fang, and Christopher Pain. A non-intrusive reduced order model with transformer neural network and its application. Physics of Fluids, 34(11):115130, 2022.
[bib60] Equer et al. [2023] Léonard Equer, T. Konstantin Rusch, and Siddhartha Mishra. Multi-scale message passing neural pde solvers, 2023. URL https://arxiv.org/abs/2302.03580.
[bib61] Kim et al. [2019] Byungsoo Kim, Vinicius C Azevedo, Nils Thuerey, Theodore Kim, Markus Gross, and Barbara Solenthaler. Deep fluids: A generative network for parameterized fluid simulations. In Computer graphics forum, volume 38(2), pages 59–70. Wiley Online Library, 2019.
[bib62] Lusch et al. [2018] Bethany Lusch, J Nathan Kutz, and Steven L Brunton. Deep learning for universal linear embeddings of nonlinear dynamics. Nature communications, 9(1):4950, 2018.
[bib63] Peter J Schmid. Dynamic mode decomposition of numerical and experimental data. Journal of fluid mechanics, 656:5–28, 2010.
[bib64] He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
[bib65] Chen et al. [2020b] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
[bib66] Chen et al. [2021b] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV, 2021b.
[bib67] Yeh et al. [2021] Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, Tyng-Luh Liu, Yubei Chen, and Yann LeCun. Decoupled contrastive learning. arXiv preprint arXiv:2110.06848, 2021.
[bib68] Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[bib69] HaoChen et al. [2021] Jeff Z HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. NeurIPS, 34, 2021.
[bib70] Caron et al. [2018] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning. In ECCV, 2018.
[bib71] Caron et al. [2020] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020.
[bib72] Zbontar et al. [2021] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In ICML, pages 12310–12320. PMLR, 2021.
[bib73] Ermolov et al. [2021] Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening for self-supervised representation learning, 2021.
[bib74] Li et al. [2022] Zengyi Li, Yubei Chen, Yann LeCun, and Friedrich T Sommer. Neural manifold clustering and embedding. arXiv preprint arXiv:2201.10000, 2022.
[bib75] Cabannes et al. [2023] Vivien Cabannes, Bobak T Kiani, Randall Balestriero, Yann LeCun, and Alberto Bietti. The ssl interplay: Augmentations, inductive bias, and generalization. arXiv preprint arXiv:2302.02774, 2023.
[bib76] Mialon et al. [2022] Grégoire Mialon, Randall Balestriero, and Yann Lecun. Variance-covariance regularization enforces pairwise independence in self-supervised representations. arXiv preprint arXiv:2209.14905, 2022.
[bib77] Schmarje et al. [2021] Lars Schmarje, Monty Santarossa, Simon-Martin Schröder, and Reinhard Koch. A survey on semi-, self-and unsupervised learning for image classification. IEEE Access, 9:82146–82168, 2021.
[bib78] Cerri et al. [2019] Olmo Cerri, Thong Q Nguyen, Maurizio Pierini, Maria Spiropulu, and Jean-Roch Vlimant. Variational autoencoders for new physics mining at the large hadron collider. Journal of High Energy Physics, 2019(5):1–29, 2019.
[bib79] Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (gpml) toolbox. The Journal of Machine Learning Research, 11:3011–3015, 2010.
[bib80] Mahmut Kaya and Hasan Şakir Bilge. Deep metric learning: A survey. Symmetry, 11(9):1066, 2019.
[bib81] Long et al. [2018] Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. Pde-net: Learning pdes from data. In International conference on machine learning, pages 3208–3216. PMLR, 2018.
[bib82] Fernández-Godino et al. [2016] M Giselle Fernández-Godino, Chanyoung Park, Nam-Ho Kim, and Raphael T Haftka. Review of multi-fidelity models. arXiv preprint arXiv:1609.07196, 2016.
[bib83] Forrester et al. [2007] Alexander IJ Forrester, András Sóbester, and Andy J Keane. Multi-fidelity optimization via surrogate modelling. Proceedings of the royal society a: mathematical, physical and engineering sciences, 463(2088):3251–3269, 2007.
[bib84] Leo Wai-Tsun Ng and Michael Eldred. Multifidelity uncertainty quantification using non-intrusive polynomial chaos and stochastic collocation. In 53rd aiaa/asme/asce/ahs/asc structures, structural dynamics and materials conference 20th aiaa/asme/ahs adaptive structures conference 14th aiaa, page 1852, 2012.
[bib85] Perdikaris et al. [2017] Paris Perdikaris, Maziar Raissi, Andreas Damianou, Neil D Lawrence, and George Em Karniadakis. Nonlinear information fusion algorithms for data-efficient multi-fidelity modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2198):20160751, 2017.
[bib86] Bronstein et al. [2021] Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
[bib87] Zaheer et al. [2017] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017.
[bib88] Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In International Conference on Machine Learning, pages 2747–2755. PMLR, 2018.
[bib89] Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999. PMLR, 2016.
[bib90] Satorras et al. [2021] Vıctor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E (n) equivariant graph neural networks. In International conference on machine learning, pages 9323–9332. PMLR, 2021.
[bib91] Jumper et al. [2021] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
[bib92] Thomas et al. [2018] Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint arXiv:1802.08219, 2018.
[bib93] Kirkpatrick et al. [2021] James Kirkpatrick, Brendan McMorrow, David HP Turban, Alexander L Gaunt, James S Spencer, Alexander GDG Matthews, Annette Obika, Louis Thiry, Meire Fortunato, David Pfau, et al. Pushing the frontiers of density functionals by solving the fractional electron problem. Science, 374(6573):1385–1389, 2021.
[bib94] Zhou et al. [2020] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI open, 1:57–81, 2020.
[bib95] Masci et al. [2015] Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE international conference on computer vision workshops, pages 37–45, 2015.
[bib96] Li et al. [2020] Jun Li, Li Fuxin, and Sinisa Todorovic. Efficient riemannian optimization on the stiefel manifold via the cayley transform. arXiv preprint arXiv:2002.01113, 2020.
[bib97] Kiani et al. [2022] Bobak Kiani, Randall Balestriero, Yann Lecun, and Seth Lloyd. projunn: efficient method for training deep networks with unitary matrices. arXiv preprint arXiv:2203.05483, 2022.
[bib98] Chen et al. [2019] Zhengdao Chen, Jianyu Zhang, Martin Arjovsky, and Léon Bottou. Symplectic recurrent neural networks. arXiv preprint arXiv:1909.13334, 2019.
[bib99] Arjovsky et al. [2016] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In International conference on machine learning, pages 1120–1128. PMLR, 2016.
[bib100] Turgut Öziş and İSMAİL Aslan. Similarity solutions to burgers’ equation in terms of special functions of mathematical physics. Acta Physica Polonica B, 2017.
[bib101] SP Lloyd. The infinitesimal group of the navier-stokes equations. Acta Mechanica, 38(1-2):85–98, 1981.
[bib102] You et al. [2017] Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.