Skip to main content

Transformers without Normalization

Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu

Abstract

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT​(𝒙)=tanh⁡(α​𝒙)\mathrm{DyT}({\bm{x}})=\tanh(\alpha{\bm{x}}), as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, SS-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Tanh-like mappings with layer normalization.

Jiachen Zhu 1 , 2 , Xinlei Chen 1 , Kaiming He 3 , Yann LeCun 1 , 2 , Zhuang Liu 1 , 4 , †

1 FAIR, Meta, 2 New York University, 3 MIT, 4 Princeton University

† Project lead

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT( x ) = tanh( α x ) , as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S -shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Date:

June 17, 2025

jiachenzhu.github.io/DyT

Correspondence:

jiachen.zhu@nyu.edu , zhuangl@princeton.edu

Introduction

Over the past decade, normalization layers have solidified their positions as one of the most fundamental components of modern neural networks. It all traces back to the invention of batch normalization in 2015 (Ioffe and Szegedy, 2015), which enabled drastically faster and better convergence in visual recognition models and quickly gained momentum in the following years. Since then, many variants of normalization layers have been proposed for different network architectures or domains (Ba et al., 2016; Ulyanov et al., 2016; Wu and He, 2018; Zhang and Sennrich, 2019). Today, virtually all modern networks use normalization layers, with layer normalization (Layer Norm, or LN) (Ba et al., 2016) being one of the most popular, particularly in the dominant Transformer architecture (Vaswani et al., 2017; Dosovitskiy et al., 2020).

The widespread adoption of normalization layers is largely driven by their empirical benefits in optimization (Santurkar et al., 2018; Bjorck et al., 2018). In addition to achieving better results, they help accelerate and stabilize convergence. As neural networks become wider and deeper, this necessity becomes ever more critical (Brock et al., 2021a; Huang et al., 2023). Consequently, normalization layers are widely regarded as crucial, if not indispensable, for the effective training of deep networks. This belief is subtly evidenced by the fact that, in recent years, novel architectures often seek to replace attention or convolution layers (Tolstikhin et al., 2021; Gu and Dao, 2023; Sun et al., 2024; Feng et al., 2024), but almost always retain the normalization layers.

This paper challenges this belief by introducing a simple alternative to normalization layers in Transformers. Our exploration starts with the observation that LN layers map their inputs to outputs with tanh-like, S -shaped curves, scaling the input activations while squashing the extreme values. Inspired by this insight, we propose an element-wise operation termed Dynamic Tanh (DyT), defined as: DyT( x ) = tanh( α x ) , where α is a learnable parameter. This operation aims to emulate the behavior of LN by learning an appropriate scaling factor through α and squashing extreme values via the bounded tanh function. Notably, unlike normalization layers, it achieves both effects without the need to compute activation statistics.

Employing DyT is straightforward, as shown in Figure 1: we directly replace existing normalization layers with DyT in architectures such as vision and language Transformers. We empirically demonstrate that models with DyT can train stably and achieve high final performance across a wide range of settings. It often does not

Figure 1 Left: original Transformer block. Right: block with our proposed Dynamic Tanh (DyT) layer. DyT is a straightforward replacement for commonly used Layer Norm (Ba et al., 2016) (in some cases RMSNorm (Zhang and Sennrich, 2019)) layers. Transformers with DyT match or exceed the performance of their normalized counterparts.

Figure 1 Left: original Transformer block. Right: block with our proposed Dynamic Tanh (DyT) layer. DyT is a straightforward replacement for commonly used Layer Norm (Ba et al., 2016) (in some cases RMSNorm (Zhang and Sennrich, 2019)) layers. Transformers with DyT match or exceed the performance of their normalized counterparts.

require tuning the training hyperparameters on the original architecture. Our work challenges the notion that normalization layers are indispensable for training modern neural networks and provides empirical insights into the properties of normalization layers.

Background: Normalization Layers

We begin by reviewing the normalization layers. Most normalization layers share a common formulation. Given an input x with shape ( B,T,C ) , where B is the batch size, T is the number of tokens, and C is the embedding dimension per token, the output is generally computed as:

$$

$$

where ϵ is a small constant, and γ and β are learnable vector parameters of shape ( C, ) . They are 'scaling' and 'shifting' affine parameters that allow the output to be in any range. The terms µ and σ 2 denote the mean and variance of the input. Different methods mainly differ in how these two statistics are computed. This results in µ and σ 2 having different dimensions, each with broadcasting applied during computation.

Batch normalization (BN) (Ioffe and Szegedy, 2015) is the first modern normalization layer, and it has been primarily used in ConvNet models (Szegedy et al., 2016; He et al., 2016; Xie et al., 2017). Its introduction represents a major milestone in deep learning architecture designs. BN computes the mean and variance across both the batch and token dimensions, specifically: µ k = 1 BT ∑ i,j x ijk and σ 2 k = 1 BT ∑ i,j ( x ijk -µ k ) 2 . Other normalization layers popular in ConvNets, such as group normalization (Wu and He, 2018) and instance normalization (Ulyanov et al., 2016), were initially proposed for specialized tasks such as object detection and image stylization. They share the same overall formulation but differ in the axes and ranges over which the statistics are computed.

Layer normalization (LN) (Ba et al., 2016) and root mean square normalization (RMSNorm) (Zhang and Sennrich, 2019) are the major two types of normalization layers used in Transformer architectures. LN computes these statistics independently for each token in each sample, where µ ij = 1 C ∑ k x ijk and σ 2 ij = 1 C ∑ k ( x ijk -µ ij ) 2 . RMSNorm (Zhang and Sennrich, 2019) simplifies LN by removing the mean-centering step and normalizing the input with µ ij = 0 and σ 2 ij = 1 C ∑ k x 2 ijk . Today, most modern neural networks use LN due to its simplicity and universality. Recently, RMSNorm has gained popularity, particularly in language models like T5 (Raffel et al., 2020), LLaMA (Touvron et al., 2023a,b; Dubey et al., 2024), Mistral (Jiang et al., 2023), Qwen (Bai et al., 2023; Yang et al., 2024), InternLM (Zhang et al., 2024; Cai et al., 2024) and DeepSeek (Liu et al., 2024; Guo et al., 2025). The Transformers we examine in this work all use LN, except that LLaMA uses RMSNorm.

Figure 2 Output vs. input of selected layer normalization (LN) layers in Vision Transformer (ViT) (Dosovitskiy et al., 2020), wav2vec 2.0 (a Transformer model for speech) (Baevski et al., 2020), and Diffusion Transformer (DiT) (Peebles and Xie, 2023). We sample a mini-batch of samples and plot the input / output values of four LN layers in each model. The outputs are before the affine transformation in LN. The S -shaped curves highly resemble that of a tanh function (see Figure 3). The more linear shapes in earlier layers can also be captured by the center part of a tanh curve. This motivates us to propose Dynamic Tanh (DyT) as a replacement, with a learnable scaler α to account for different scales on the x axis.

Figure 2 Output vs. input of selected layer normalization (LN) layers in Vision Transformer (ViT) (Dosovitskiy et al., 2020), wav2vec 2.0 (a Transformer model for speech) (Baevski et al., 2020), and Diffusion Transformer (DiT) (Peebles and Xie, 2023). We sample a mini-batch of samples and plot the input / output values of four LN layers in each model. The outputs are before the affine transformation in LN. The S -shaped curves highly resemble that of a tanh function (see Figure 3). The more linear shapes in earlier layers can also be captured by the center part of a tanh curve. This motivates us to propose Dynamic Tanh (DyT) as a replacement, with a learnable scaler α to account for different scales on the x axis.

What Do Normalization Layers Do?

Analysis setup. We first empirically study the behaviors of normalization layers in trained networks. For this analysis, we take a Vision Transformer model (ViT-B) (Dosovitskiy et al., 2020) trained on ImageNet-1K (Deng et al., 2009), a wav2vec 2.0 Large Transformer model (Baevski et al., 2020) trained on LibriSpeech (Panayotov et al., 2015), and a Diffusion Transformer (DiT-XL) (Peebles and Xie, 2023) trained on ImageNet-1K. In all cases, LN is applied in every Transformer block and before the final linear projection.

For all three trained networks, we sample a mini-batch of samples and do a forward pass through the network. We then measure the input and output for the normalization layers, i.e., tensors immediately before and after the normalization operation, before the learnable affine transformation. Since LN preserves the dimensions of the input tensor, we can establish a one-to-one correspondence between the input and output tensor elements, allowing for a direct visualization of their relationship. We plot the resulting mappings in Figure 2.

Tanh-likemappingswithlayernormalization. For all three models, in earlier LN layers (1st column of Figure 2), we find this inputoutput relationship to be mostly linear, resembling a straight line in an x -y plot. However, the deeper LN layers are places where we make more intriguing observations.

A striking observation from these deeper layers is that most of these curves' shapes highly resemble full or partial S -shaped curves represented by a tanh function (see Figure 3). One might expect LN layers to linearly transform the input tensor, as subtracting the mean and dividing by standard deviation are

Figure3 tanh( αx ) with three different α values.

Figure3 tanh( αx ) with three different α values.

Figure 4 Output vs. input of two LN layers, with tensor elements colored to indicate different channel and token dimensions. The input tensor has a shape of (samples, tokens, and channels), with elements visualized by assigning consistent colors to the same tokens (left two panels) and channels (right two panels). Left two panels : points representing the same token (same color) form straight lines across different channels, as LN operates linearly across channels for each token. Interestingly, when plotted collectively, these lines form a non-linear tanh-shaped curve. Right two panels : each channel's input spans different ranges on the x -axis, contributing distinct segments to the overall tanh-shaped curve. Certain channels (e.g., red, green, and pink) exhibit more extreme x values, which are squashed by LN.

Figure 4 Output vs. input of two LN layers, with tensor elements colored to indicate different channel and token dimensions. The input tensor has a shape of (samples, tokens, and channels), with elements visualized by assigning consistent colors to the same tokens (left two panels) and channels (right two panels). Left two panels : points representing the same token (same color) form straight lines across different channels, as LN operates linearly across channels for each token. Interestingly, when plotted collectively, these lines form a non-linear tanh-shaped curve. Right two panels : each channel's input spans different ranges on the x -axis, contributing distinct segments to the overall tanh-shaped curve. Certain channels (e.g., red, green, and pink) exhibit more extreme x values, which are squashed by LN.

linear operations. LN normalizes in a per-token manner, only linearly transforming each token's activations. As tokens have different mean and standard deviation values, the linearity does not hold collectively on all activations of the input tensor. Nonetheless, it is still surprising to us that the actual non-linear transformation is highly similar to a scaled tanh function.

For such an S -shaped curve, we note that the central part, represented by points with x values close to zero, is still mainly in a linear shape. Most points ( ∼ 99%) fall in this linear range. However, there are still many points that clearly fall out of this range, which are considered to have 'extreme' values, e.g., those with x larger than 50 or smaller than -50 in the ViT model. Normalization layers' main effect for these values is to squash them into less extreme values, more in line with the majority of points. This is where normalization layers could not approximated by a simple affine transformation layer. We hypothesize this non-linear and disproportional squashing effect on extreme values is what makes normalization layers important and indispensable.

Recent findings by Ni et al. (2024) similarly highlight the strong non-linearities introduced by LN layers, demonstrating how the non-linearity enhances a model's representational capacity. Moreover, this squashing behavior mirrors the saturation properties of biological neurons for large inputs, a phenomenon first observed about a century ago (Adrian, 1926; Adrian and Zotterman, 1926a,b).

Normalization by tokens and channels. How does an LN layer perform a linear transformation for each token but also squash the extreme values in such a non-linear fashion? To understand this, we visualize the points grouped by tokens and channels, respectively. This is plotted in Figure 4 by taking the second and third subplots for ViT from Figure 2, but with a sampled subset of points for more clarity. When we select the channels to plot, we make sure to include the channels with extreme values.

On the left two panels of Figure 4, we visualize each token's activations using the same color. We observe that all points from any single token do form a straight line. However, since each token has a different variance, the slopes are different. Tokens with smaller input x ranges tend to have smaller variance, and the normalization layer will divide their activations using a smaller standard deviation, hence producing a larger slope in the straight line. Collectively, they form an S -shaped curve that resembles a tanh function. In the two panels on the right, we color each channel's activations using the same color. We find that different channels tend to have drastically different input ranges, with only a few channels (e.g., red, green, and pink) exhibiting large extreme values. These are the channels that get squashed the most by the normalization layer.

Analysis setup.

In this section, we begin with ablations on the effects of the tanh function and the learnable scalar α . We then analyze the values of α throughout and after training. Lastly, we present comparisons with previous methods that aim to remove normalization layers.

Tanh-like mappings with layer normalization.

Jiachen Zhu 1 , 2 , Xinlei Chen 1 , Kaiming He 3 , Yann LeCun 1 , 2 , Zhuang Liu 1 , 4 , †

1 FAIR, Meta, 2 New York University, 3 MIT, 4 Princeton University

† Project lead

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT( x ) = tanh( α x ) , as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S -shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Date:

June 17, 2025

jiachenzhu.github.io/DyT

Correspondence:

jiachen.zhu@nyu.edu , zhuangl@princeton.edu

Normalization by tokens and channels.

Analysis setup. We first empirically study the behaviors of normalization layers in trained networks. For this analysis, we take a Vision Transformer model (ViT-B) (Dosovitskiy et al., 2020) trained on ImageNet-1K (Deng et al., 2009), a wav2vec 2.0 Large Transformer model (Baevski et al., 2020) trained on LibriSpeech (Panayotov et al., 2015), and a Diffusion Transformer (DiT-XL) (Peebles and Xie, 2023) trained on ImageNet-1K. In all cases, LN is applied in every Transformer block and before the final linear projection.

For all three trained networks, we sample a mini-batch of samples and do a forward pass through the network. We then measure the input and output for the normalization layers, i.e., tensors immediately before and after the normalization operation, before the learnable affine transformation. Since LN preserves the dimensions of the input tensor, we can establish a one-to-one correspondence between the input and output tensor elements, allowing for a direct visualization of their relationship. We plot the resulting mappings in Figure 2.

Tanh-likemappingswithlayernormalization. For all three models, in earlier LN layers (1st column of Figure 2), we find this inputoutput relationship to be mostly linear, resembling a straight line in an x -y plot. However, the deeper LN layers are places where we make more intriguing observations.

A striking observation from these deeper layers is that most of these curves' shapes highly resemble full or partial S -shaped curves represented by a tanh function (see Figure 3). One might expect LN layers to linearly transform the input tensor, as subtracting the mean and dividing by standard deviation are

Figure3 tanh( αx ) with three different α values.

Figure3 tanh( αx ) with three different α values.

Figure 4 Output vs. input of two LN layers, with tensor elements colored to indicate different channel and token dimensions. The input tensor has a shape of (samples, tokens, and channels), with elements visualized by assigning consistent colors to the same tokens (left two panels) and channels (right two panels). Left two panels : points representing the same token (same color) form straight lines across different channels, as LN operates linearly across channels for each token. Interestingly, when plotted collectively, these lines form a non-linear tanh-shaped curve. Right two panels : each channel's input spans different ranges on the x -axis, contributing distinct segments to the overall tanh-shaped curve. Certain channels (e.g., red, green, and pink) exhibit more extreme x values, which are squashed by LN.

Figure 4 Output vs. input of two LN layers, with tensor elements colored to indicate different channel and token dimensions. The input tensor has a shape of (samples, tokens, and channels), with elements visualized by assigning consistent colors to the same tokens (left two panels) and channels (right two panels). Left two panels : points representing the same token (same color) form straight lines across different channels, as LN operates linearly across channels for each token. Interestingly, when plotted collectively, these lines form a non-linear tanh-shaped curve. Right two panels : each channel's input spans different ranges on the x -axis, contributing distinct segments to the overall tanh-shaped curve. Certain channels (e.g., red, green, and pink) exhibit more extreme x values, which are squashed by LN.

linear operations. LN normalizes in a per-token manner, only linearly transforming each token's activations. As tokens have different mean and standard deviation values, the linearity does not hold collectively on all activations of the input tensor. Nonetheless, it is still surprising to us that the actual non-linear transformation is highly similar to a scaled tanh function.

For such an S -shaped curve, we note that the central part, represented by points with x values close to zero, is still mainly in a linear shape. Most points ( ∼ 99%) fall in this linear range. However, there are still many points that clearly fall out of this range, which are considered to have 'extreme' values, e.g., those with x larger than 50 or smaller than -50 in the ViT model. Normalization layers' main effect for these values is to squash them into less extreme values, more in line with the majority of points. This is where normalization layers could not approximated by a simple affine transformation layer. We hypothesize this non-linear and disproportional squashing effect on extreme values is what makes normalization layers important and indispensable.

Recent findings by Ni et al. (2024) similarly highlight the strong non-linearities introduced by LN layers, demonstrating how the non-linearity enhances a model's representational capacity. Moreover, this squashing behavior mirrors the saturation properties of biological neurons for large inputs, a phenomenon first observed about a century ago (Adrian, 1926; Adrian and Zotterman, 1926a,b).

Normalization by tokens and channels. How does an LN layer perform a linear transformation for each token but also squash the extreme values in such a non-linear fashion? To understand this, we visualize the points grouped by tokens and channels, respectively. This is plotted in Figure 4 by taking the second and third subplots for ViT from Figure 2, but with a sampled subset of points for more clarity. When we select the channels to plot, we make sure to include the channels with extreme values.

On the left two panels of Figure 4, we visualize each token's activations using the same color. We observe that all points from any single token do form a straight line. However, since each token has a different variance, the slopes are different. Tokens with smaller input x ranges tend to have smaller variance, and the normalization layer will divide their activations using a smaller standard deviation, hence producing a larger slope in the straight line. Collectively, they form an S -shaped curve that resembles a tanh function. In the two panels on the right, we color each channel's activations using the same color. We find that different channels tend to have drastically different input ranges, with only a few channels (e.g., red, green, and pink) exhibiting large extreme values. These are the channels that get squashed the most by the normalization layer.

Dynamic Tanh (DyT)

Inspired by the similarity between the shapes of normalization layers and a scaled tanh function, we propose Dynamic Tanh (DyT) as a drop-in replacement for normalization layers. Given an input tensor x , a DyT layer is defined as follows:

$$

$$

where α is a learnable scalar parameter that allows scaling the input differently based on its range, accounting for varying x scales (Figure 2). This is also why we name the whole operation 'Dynamic' Tanh. γ and β are learnable, per-channel vector parameters, the same as those used in all normalization layers-they allow the output to scale back to any scales. This is sometimes considered a separate affine layer; for our purposes, we consider them to be part of the DyT layer, just like how normalization layers also include them. See Algorithm 1 for implementation of DyT in Pytorch-like pseudocode.

Integrating DyT layers into an existing architecture is straightforward: one DyT layer replaces one normalization layer (see Figure 1). This applies to normalization layers within attention blocks, FFN blocks, and the final normalization layer. Although DyT may look like or be considered an activation function, this study only uses it to replace normalization layers without altering any parts of the activation functions in the original architectures, such as GELU or ReLU. Readers interested in the use of harmonic or hyperbolic functions as activation functions can refer to Hashemi et al. (2024). We also observe that there is little need to tune the hyperparameters used by the original architectures for DyT to perform well.

On scaling parameters.

We present additional experiments to evaluate the impact of hyperparameter tuning, specifically focusing on the learning rate and initialization of α for all non-LLM models.

Tuning learning rate. Table 12 summarizes performance comparisons between models trained with original versus tuned learning rates. Results indicate that tuning the learning rate provides only modest performance improvements for DyT models. This suggests that the original hyperparameters, initially optimized for LN models, are already well-suited for DyT models. This observation underscores the inherent similarity between the DyT and LN models.

Table 12 Performance comparison between original and tuned learning rates for LN and DyT models. Results show that tuning learning rates provide only modest performance improvements for DyT models, suggesting that the default hyperparameters optimized for LN models are already well-suited for DyT models. Entries marked with '-' indicate no performance gain over the original learning rate. The values in parentheses represent the learning rate used.

Tuning initial value of α . We also investigate the effects of optimizing α 0 for DyT models, as presented in Table 13. Findings show only minor performance enhancements for select models when α 0 is tuned, indicating that the default initial value ( α 0 = 0 . 5 ) generally achieves near-optimal performance.

Table 13 Impact of tuning the α 0 in DyT models. Optimizing α 0 from the default value ( α 0 = 0 . 5 ) yields only minor performance gains for select DyT models, implying the default initialization already achieves near-optimal performance. Entries marked with '-' indicate no improvement over the default α 0 .

Remarks.

Mechanisms of Normalization layers. There has been a rich line of work investigating normalization layers' role in enhancing model performance through various mechanisms. These include stabilizing gradient flow during training (Balduzzi et al., 2017; Daneshmand et al., 2020; Lubana et al., 2021), reducing sensitivity to weight initialization (Zhang et al., 2019; De and Smith, 2020; Shao et al., 2020), moderating outlier eigenvalues (Bjorck et al., 2018; Karakida et al., 2019), auto-tuning learning rates (Arora et al., 2018; Tanaka and Kunin, 2021), and smoothing the loss landscape for more stable optimization (Santurkar et al., 2018). These earlier works focused on studying batch normalization. Recent studies (Lyu et al., 2022; Dai et al., 2024; Mueller et al., 2024) further highlight the connection between normalization layers and sharpness reduction, which contributes to better generalization.

Table 11 Optimal α 0 (attention / other) across model widths and depths in LLaMA training. Model width significantly impacts the choice of α 0 , with wider networks requiring smaller values. In contrast, model depth has negligible influence.

Normalization in Transformers. With the rise of Transformer (Vaswani et al., 2017), research has increasingly focused on layer normalization (Ba et al., 2016), which has proven particularly effective for sequential data in natural language tasks (Nguyen and Salazar, 2019; Xu et al., 2019; Xiong et al., 2020). Recent work (Ni et al., 2024) reveals that layer normalization introduces strong non-linearity, enhancing the model's representational capacity. Additionally, studies (Loshchilov et al., 2024; Li et al., 2024) demonstrate that modifying the location of normalization layers within Transformers can improve convergence properties.

Removing normalization. Many studies have explored how to train deep models without normalization layers. Klambauer et al. (2017) introduce an alternative activation function that enables self-normalizing behavior, eliminating the need for explicit normalization. Other works (Zhang et al., 2019; De and Smith, 2020; Bachlechner et al., 2021) propose specialized initialization schemes to stabilize training in the absence of normalization. The pioneering work by Brock et al. (2021a,b) show that high-performing ResNets can be trained without normalization (Smith et al., 2023) through combination of initialization techniques (De and Smith, 2020), weight normalization (Salimans and Kingma, 2016; Huang et al., 2017; Qiao et al., 2019), and adaptive gradient clipping (Brock et al., 2021b). Additionally, their training strategy incorporates extensive data augmentation (Cubuk et al., 2020) and regularization (Srivastava et al., 2014; Huang et al., 2016). The studies above are based on various ConvNet models.

In Transformer architectures, He and Hofmann (2023) explore modifications to Transformer blocks that reduce reliance on normalization layers and skip connections. Jha and Reagen (2024) introduce AERO, a Softmax-only LLM that improves inference efficiency and privacy with minimal performance loss. Alternatively, Heimersheim (2024) propose a method to gradually remove LN from pretrained networks by fine-tuning the model after removing each normalization layer. Unlike previous approaches, DyT requires minimal modifications to both the architecture and the training recipe. Despite its simplicity, DyT achieves stable training and comparable performance.

Experiments

To demonstrate the effectiveness of DyT, we experiment with Transformers and a few other modern architectures across a diverse range of tasks and domains. In each experiment, we replace the LN or RMSNorm in the original architectures with DyT layers and follow the official open-source protocols to train and test both versions of the models. Detailed instructions for reproducing our results are provided in Appendix A. Notably, to highlight the simplicity of adapting DyT, we use hyperparameters identical to those utilized by the normalized counterparts. For completeness, additional experimental results regarding tuning of learning rates and initial values of α are provided in Appendix B.

Supervised learning in vision. We train Vision Transformer (ViT) (Dosovitskiy et al., 2020) and ConvNeXt (Liu et al., 2022) of 'Base' and 'Large' sizes on the ImageNet-1K classification task (Deng et al., 2009). These models are selected due to their popularity and distinct operations: attention in ViT and convolution in ConvNeXt. Table 1 reports the top-1 classification accuracies. DyT performs slightly better than LN across both architectures and model sizes. We further plot the training loss for ViT-B and ConvNeXt-B in Figure 5. The curves show that the convergence behaviors of DyT and LN-based models are highly aligned.

Figure 5 Training loss curves for ViT-B and ConvNeXt-B models. The loss curves for both model types exhibit similar patterns between LN and DyT, suggesting that LN and DyT may share similar learning dynamics.

Figure 5 Training loss curves for ViT-B and ConvNeXt-B models. The loss curves for both model types exhibit similar patterns between LN and DyT, suggesting that LN and DyT may share similar learning dynamics.

Table 1 Supervised classification accuracy on ImageNet-1K. DyT achieves better or similar performance than LN across both architectures and model sizes.

Self-supervised learning in vision. We benchmark with two popular visual self-supervised learning methods: masked autoencoders (MAE) (He et al., 2022) and DINO (Caron et al., 2021). Both by default use Vision Transformers as the backbones, but have different training objectives: MAE is trained with a reconstruction loss, and DINO uses a joint-embedding loss (LeCun, 2022). Following the standard self-supervised learning protocol, we first pretrain models on ImageNet-1K without using any labels and then test the pretrained models by attaching a classification layer and fine-tuning them with labels. The fine-tuning results are presented in Table 2. DyT consistently performs on par with LN in self-supervised learning tasks.

Diffusion models. We train three Diffusion Transformer (DiT) models (Peebles and Xie, 2023) of sizes B, L and XL on ImageNet-1K (Deng et al., 2009). The patch size is 4, 4, and 2, respectively. Note that in DiT, the LN layers' affine parameters are used for class conditioning in DiT, and we keep them that way in our DyT experiments, only replacing the normalizing transformation with the tanh( α x ) function. After training, we evaluate the Fréchet Inception Distance (FID) scores using the standard ImageNet 'reference batch', as presented in Table 3. DyT achieves comparable or improved FID over LN.

Figure 6 LLaMApretraining loss. The loss curves of DyT and RMSNorm models are closely aligned across model sizes.

Figure 6 LLaMApretraining loss. The loss curves of DyT and RMSNorm models are closely aligned across model sizes.

Large Language Models. We pretrain LLaMA 7B, 13B, 34B, and 70B models (Touvron et al., 2023a,b; Dubey et al., 2024) to assess DyT performance relative to RMSNorm (Zhang and Sennrich, 2019), the default normalization layer used in LLaMA. The models are trained on The Pile dataset (Gao et al., 2020) with 200B tokens, following the original recipe outlined in LLaMA (Touvron et al., 2023b). On LLaMA with DyT, we add a learnable scalar parameter after the initial embedding layer, and adjust the initial value of α , as detailed in Section 7. We report the loss value after training and also follow OpenLLaMA (Geng and Liu, 2023) to benchmark the models on 15 zero-shot tasks from lm-eval (Gao et al.). As shown in Table 4, DyT performs on par with RMSNorm across all four model sizes. Figure 6 illustrates the loss curves, demonstrating similar trends across all model sizes, with training losses closely aligned throughout training.

Table 4 Languagemodels'training loss and average performance with 15 zero-shot lm-eval tasks. DyT achieves a comparable zero-shot performance and training loss to RMSNorm.

Self-supervised learning in speech. We pretrain two wav2vec 2.0 Transformer models (Baevski et al., 2020) on the LibriSpeech dataset (Panayotov et al., 2015). We report the final validation loss in Table 5. We observe that DyT performs comparably to LN in both model sizes.

DNAsequence modeling. On the long-range DNA sequence modeling task, we pretrain the HyenaDNA model (Nguyen et al., 2024) and the Caduceus model (Schiff et al., 2024). The pretraining uses the human reference genome data from (GRCh38, 2013), and the evaluation is on GenomicBenchmarks (Grešová et al., 2023). The results are presented in Table 6. DyT maintains performance comparable to LN for this task.

Supervised learning in vision.
Self-supervised learning in vision.
Diffusion models.
Large Language Models.
Self-supervised learning in speech.
DNA sequence modeling.

Analysis

In this section, we begin with ablations on the effects of the tanh function and the learnable scalar α . We then analyze the values of α throughout and after training. Lastly, we present comparisons with previous methods that aim to remove normalization layers.

Efficiency of DyT

We benchmark the LLaMA 7B model with RMSNorm or DyT by measuring the total time required for 100 forward passes (inference) and 100 forward-backward passes (training) on a single sequence of 4096 tokens. We first follow the officially recommended LLaMA setup and load the model from Hugging Face without applying any performance optimizations. Table 14 reports the time taken for RMSNorm and DyT layers, as

well as for the entire model, when running on a Nvidia H100 GPU with BF16 precision. DyT layers reduce computation time compared to RMSNorm layers.

Table 14 Inference and training latency (BF16 precision) for LLaMA 7B with RMSNorm or DyT. DyT achieves a substantial reduction in both inference and training time. Results are measured without any extra performance optimizations.

We also benchmark both models using torch.compile . Interestingly, compiling the entire LLaMA model increases latency for the Hugging Face implementation, and compiling only the DyT or RMSNorm layers yields more efficient execution. Table 15 shows that, after compilation, the latency of the RMSNorm and DyT layers becomes nearly identical.

An important distinction of DyT is that it is an element-wise operation and does not require a reduction operation within itself, compared to normalization layers. This could make it faster on hardware where reduction is a bottleneck. Additionally, even on conventional GPUs, DyT could offer opportunities for further optimization, e.g., fusing it with the preceding matrix multiplication layer from the last residual block.

Ablations of tanh and $ alpha$

To further investigate the role of tanh and α in DyT, we conduct experiments to evaluate the model's performance when these components are altered or removed.

Replacing and removing tanh. We replace tanh in DyT layers with alternative squashing functions, specifically hardtanh and sigmoid (Figure 7), while keeping the learnable scaler α intact. Furthermore, we assess the impact of completely removing tanh by replacing it with the identity function while still retaining α . As shown in Table 7, the squashing function is essential for stable training. Using the identity function leads to unstable training and divergence, whereas squashing functions enable stable training. Among the squashing functions, tanh performs the best. This is possibly due to its smoothness and zero-centered properties.

Table 7 ImageNet-1K classification accuracy with different squashing functions. All experiments follow the same training recipe as the original LN-based models. Squashing functions play a crucial role in preventing divergence, with tanh achieving the highest performance among the three functions. ' → failed' indicates that training diverged after some progress, with the preceding number representing the highest accuracy reached before divergence.

Removing α . Next, we evaluate the impact of removing the learnable α while retaining the squashing functions (tanh, hardtanh, and sigmoid). As shown in Table 8, removing α results in performance degradation across all squashing functions, highlighting the critical role of α in overall model performance.

Replacing and removing tanh.
Removing $ alpha$.

Values of $ alpha$

During training. Our analysis reveals that the α closely tracks the 1 / std of activations throughout training. As illustrated in the left panel of Figure 8, α first decrease and then increase during training, but always fluctuate consistently with the standard deviation of input activations. This supports the important role of α in maintaining activations within a suitable range, which leads to stable and effective training.

After training. Our further analysis of the final values of α in trained networks reveals a strong correlation with the 1 / std of the input activations. As shown on the right panel of Figure 8, higher 1 / std values generally correspond to larger α values, and vice versa. Additionally, we observe that deeper layers tend to have activations with larger standard deviations. This trend aligns with characteristics of deep residual networks, as shown in Brock et al. (2021a) for ConvNets, and Sun et al. (2025) for Transformers.

Both analyses suggest that α functions partially as a normalization mechanism by learning values approximating 1 / std of the input activations. Unlike LN, which normalizes the activations per token, α normalizes the entire input activations collectively. Consequently, α alone cannot suppress extreme values in a non-linear fashion.

Figure 7 Curves of three squashing functions: tanh, hardtanh, and sigmoid. All three functions squash inputs into a bounded range, but tanh( x ) achieves the best performance when used in DyT layers. We suspect it is due to its smoothness and zero-centered properties.

Figure 7 Curves of three squashing functions: tanh, hardtanh, and sigmoid. All three functions squash inputs into a bounded range, but tanh( x ) achieves the best performance when used in DyT layers. We suspect it is due to its smoothness and zero-centered properties.

During training.
After training.

Comparison with Other Methods

To further assess DyT's effectiveness, we compare it with other methods that also enable training Transformers without normalization layers. These methods can be broadly categorized into initialization-based and weightnormalization-based methods. We consider two popular initialization-based methods, Fixup (Zhang et al., 2019; Huang et al., 2020) and SkipInit (De and Smith, 2020; Bachlechner et al., 2021). Both methods aim to mitigate training instabilities by adjusting the initial parameter values to prevent large gradients and activations at the start of training, thereby enabling stable learning without normalization layers. In contrast, weight-normalization-based methods impose constraints on network weights throughout training to maintain stable learning dynamics in the absence of normalization layers. We include one such method, σ Reparam (Zhai et al., 2023), which controls the spectral norm of the weights to promote stable learning.

Table 9 Classification accuracy on ImageNet-1K. DyT consistently achieves superior performance over other methods.

Table 9 summarizes the results of two ViT-based tasks. We closely follow the original protocols outlined in their respective papers. However, we find that both initialization-based methods, Fixup and SkipInit, require significantly lower learning rates to prevent training divergence. To ensure a fair comparison, we conduct a simple learning rate search for all methods, including DyT. This produces results that differ from those reported in Section 5, where no hyperparameter is tuned. Overall, the results show that DyT consistently outperforms all other tested methods across different configurations.

Initialization of $ alpha$

We find that tuning the initialization of α (denoted α 0 ) rarely leads to significant performance improvements. The only exception is LLM training, where careful tuning of α 0 yields noticeable performance gains. In this section, we detail our findings on the impact of α initialization.

Initialization of $ alpha$ for Non-LLM Models

Non-LLMmodelsarerelativelyinsensitiveto α 0 . Figure 9 shows the effect of varying α 0 on validation performance across different tasks. All experiments follow the original setup and hyperparameters of their respective recipe. We observe that performance remains stable across a wide range of α 0 values, with values between 0.5 and 1.2 generally yielding good results. We observe that adjusting α 0 typically affects only the early stages of the training curves. The main exception is supervised ViT-L experiments, where training becomes unstable and diverges when α 0 exceeds 0.6. In such cases, reducing the learning rate restores stability, as detailed below.

Figure 8 Left: For two selected DyT layers from the ViT-B model, we track α and the inverse of the standard deviation ( 1 / std ) of activations at the end of each epoch, observing that they evolve together during training. Right: We plot the final α values of two trained models, ViT-B and ConvNeXt-B, against the 1 / std of the input activations, demonstrating a strong correlation between the two values.

Figure 8 Left: For two selected DyT layers from the ViT-B model, we track α and the inverse of the standard deviation ( 1 / std ) of activations at the end of each epoch, observing that they evolve together during training. Right: We plot the final α values of two trained models, ViT-B and ConvNeXt-B, against the 1 / std of the input activations, demonstrating a strong correlation between the two values.

Figure 9 Performance of different tasks across different α 0 values. We benchmark the performance of all non-LLM tasks used in Section 5 with different initial values of α . Performance remains stable across a wide range of α 0 values. The only exception is that supervised ViT-L models (top right panel) will diverge for α 0 values larger than 0.6.

Figure 9 Performance of different tasks across different α 0 values. We benchmark the performance of all non-LLM tasks used in Section 5 with different initial values of α . Performance remains stable across a wide range of α 0 values. The only exception is that supervised ViT-L models (top right panel) will diverge for α 0 values larger than 0.6.

Smaller α 0 results in more stable training. Building on previous observations, we further analyze the factors contributing to training instability. Our findings suggest that increasing either the model size or the learning rate requires lowering α 0 to ensure stable training. Conversely, a higher α 0 requires a lower learning rate to mitigate training instability. Figure 10 shows the ablation of the training stability of supervised ViT with ImageNet-1K dataset. We vary learning rates, model sizes, and α 0 values. Training a larger model is more prone to failure, requiring smaller α 0 values or learning rates for stable training. A similar instability pattern is also observed in LN-based models under comparable conditions, and setting α 0 = 0 . 5 results in a stability pattern similar to that of LN.

Setting α 0 = 0 . 5 as the default. Based on our findings, we set α 0 = 0 . 5 as the default value for all non-LLM models. This setting provides training stability comparable to LN while maintaining strong performance.

Non-LLM models are relatively insensitive to $ alpha_0$.
Smaller $ alpha_0$ results in more stable training.
Setting $ alpha_0 = 0.5$ as the default.

Initialization of $ alpha$ for LLMs

Tuning α 0 enhances LLM performance. As discussed earlier, the default setting of α 0 = 0 . 5 generally performs well across most tasks. However, we find tuning α 0 can substantially improve LLM performance. We tune α 0 across LLaMA models by pretraining each on 30B tokens and comparing their training losses. Table 10 summarizes the tuned α 0 values for each model. Two key findings emerge:

  1. Larger models require smaller α 0 values. Once the optimal α 0 is determined for smaller models, the search space for larger models can be reduced accordingly.

Figure 10 Stability across varying α 0 values, learning rates, and model sizes. We train supervised ViT models on the ImageNet-1K dataset and observe that larger models are more prone to instability for both LN and DyT models. Lowering the learning rate or reducing α 0 enhances stability. LN shows similar stability to DyT with α 0 = 0 . 5 .

Figure 10 Stability across varying α 0 values, learning rates, and model sizes. We train supervised ViT models on the ImageNet-1K dataset and observe that larger models are more prone to instability for both LN and DyT models. Lowering the learning rate or reducing α 0 enhances stability. LN shows similar stability to DyT with α 0 = 0 . 5 .

  1. Higher α 0 values for attention blocks improve performance. We find that initializing α with higher values for DyT layers in attention blocks and lower values for DyT layers in other locations (i.e., within FFN blocks or before the final linear projection) improves performance.

Table 10 Optimal α 0 for different LLaMA models. Larger models require smaller α 0 values. We find it is important to initialize α differently in (1) attention blocks ('attention'), versus (2) the FFN blocks, and the final DyT layer before outputs ('other'). α 0 in attention blocks require larger values.

To further illustrate the impact of α 0 tuning, Figure 11 presents heatmaps of loss values of two LLaMA models. Both models benefit from higher α 0 in attention blocks, leading to reduced training loss.

Figure

0

Model width primarily determines α 0 selection. We also investigate the influence of model width and depth on the optimal α 0 . We find that the model width is critical in determining the optimal α 0 , while model depth has minimal influence. Table 11 shows the optimal α 0 values across different widths and depths, showing that wider networks benefit from smaller α 0 values for optimal performance. On the other hand, model depth has negligible impact on the choice of α 0 .

As can be seen in Table 11, the wider the network, the more uneven initialization for 'attention' and 'other' is needed. We hypothesize that the sensitivity of LLM's α initialization is related to their excessively large widths compared to other models.

Tuning $ alpha_0$ enhances LLM performance.
Model width primarily determines $ alpha_0$ selection.

Mechanisms of Normalization layers. There has been a rich line of work investigating normalization layers' role in enhancing model performance through various mechanisms. These include stabilizing gradient flow during training (Balduzzi et al., 2017; Daneshmand et al., 2020; Lubana et al., 2021), reducing sensitivity to weight initialization (Zhang et al., 2019; De and Smith, 2020; Shao et al., 2020), moderating outlier eigenvalues (Bjorck et al., 2018; Karakida et al., 2019), auto-tuning learning rates (Arora et al., 2018; Tanaka and Kunin, 2021), and smoothing the loss landscape for more stable optimization (Santurkar et al., 2018). These earlier works focused on studying batch normalization. Recent studies (Lyu et al., 2022; Dai et al., 2024; Mueller et al., 2024) further highlight the connection between normalization layers and sharpness reduction, which contributes to better generalization.

Table 11 Optimal α 0 (attention / other) across model widths and depths in LLaMA training. Model width significantly impacts the choice of α 0 , with wider networks requiring smaller values. In contrast, model depth has negligible influence.

Normalization in Transformers. With the rise of Transformer (Vaswani et al., 2017), research has increasingly focused on layer normalization (Ba et al., 2016), which has proven particularly effective for sequential data in natural language tasks (Nguyen and Salazar, 2019; Xu et al., 2019; Xiong et al., 2020). Recent work (Ni et al., 2024) reveals that layer normalization introduces strong non-linearity, enhancing the model's representational capacity. Additionally, studies (Loshchilov et al., 2024; Li et al., 2024) demonstrate that modifying the location of normalization layers within Transformers can improve convergence properties.

Removing normalization. Many studies have explored how to train deep models without normalization layers. Klambauer et al. (2017) introduce an alternative activation function that enables self-normalizing behavior, eliminating the need for explicit normalization. Other works (Zhang et al., 2019; De and Smith, 2020; Bachlechner et al., 2021) propose specialized initialization schemes to stabilize training in the absence of normalization. The pioneering work by Brock et al. (2021a,b) show that high-performing ResNets can be trained without normalization (Smith et al., 2023) through combination of initialization techniques (De and Smith, 2020), weight normalization (Salimans and Kingma, 2016; Huang et al., 2017; Qiao et al., 2019), and adaptive gradient clipping (Brock et al., 2021b). Additionally, their training strategy incorporates extensive data augmentation (Cubuk et al., 2020) and regularization (Srivastava et al., 2014; Huang et al., 2016). The studies above are based on various ConvNet models.

In Transformer architectures, He and Hofmann (2023) explore modifications to Transformer blocks that reduce reliance on normalization layers and skip connections. Jha and Reagen (2024) introduce AERO, a Softmax-only LLM that improves inference efficiency and privacy with minimal performance loss. Alternatively, Heimersheim (2024) propose a method to gradually remove LN from pretrained networks by fine-tuning the model after removing each normalization layer. Unlike previous approaches, DyT requires minimal modifications to both the architecture and the training recipe. Despite its simplicity, DyT achieves stable training and comparable performance.

Mechanisms of Normalization layers.

Analysis setup. We first empirically study the behaviors of normalization layers in trained networks. For this analysis, we take a Vision Transformer model (ViT-B) (Dosovitskiy et al., 2020) trained on ImageNet-1K (Deng et al., 2009), a wav2vec 2.0 Large Transformer model (Baevski et al., 2020) trained on LibriSpeech (Panayotov et al., 2015), and a Diffusion Transformer (DiT-XL) (Peebles and Xie, 2023) trained on ImageNet-1K. In all cases, LN is applied in every Transformer block and before the final linear projection.

For all three trained networks, we sample a mini-batch of samples and do a forward pass through the network. We then measure the input and output for the normalization layers, i.e., tensors immediately before and after the normalization operation, before the learnable affine transformation. Since LN preserves the dimensions of the input tensor, we can establish a one-to-one correspondence between the input and output tensor elements, allowing for a direct visualization of their relationship. We plot the resulting mappings in Figure 2.

Tanh-likemappingswithlayernormalization. For all three models, in earlier LN layers (1st column of Figure 2), we find this inputoutput relationship to be mostly linear, resembling a straight line in an x -y plot. However, the deeper LN layers are places where we make more intriguing observations.

A striking observation from these deeper layers is that most of these curves' shapes highly resemble full or partial S -shaped curves represented by a tanh function (see Figure 3). One might expect LN layers to linearly transform the input tensor, as subtracting the mean and dividing by standard deviation are

Figure3 tanh( αx ) with three different α values.

Figure3 tanh( αx ) with three different α values.

Figure 4 Output vs. input of two LN layers, with tensor elements colored to indicate different channel and token dimensions. The input tensor has a shape of (samples, tokens, and channels), with elements visualized by assigning consistent colors to the same tokens (left two panels) and channels (right two panels). Left two panels : points representing the same token (same color) form straight lines across different channels, as LN operates linearly across channels for each token. Interestingly, when plotted collectively, these lines form a non-linear tanh-shaped curve. Right two panels : each channel's input spans different ranges on the x -axis, contributing distinct segments to the overall tanh-shaped curve. Certain channels (e.g., red, green, and pink) exhibit more extreme x values, which are squashed by LN.

Figure 4 Output vs. input of two LN layers, with tensor elements colored to indicate different channel and token dimensions. The input tensor has a shape of (samples, tokens, and channels), with elements visualized by assigning consistent colors to the same tokens (left two panels) and channels (right two panels). Left two panels : points representing the same token (same color) form straight lines across different channels, as LN operates linearly across channels for each token. Interestingly, when plotted collectively, these lines form a non-linear tanh-shaped curve. Right two panels : each channel's input spans different ranges on the x -axis, contributing distinct segments to the overall tanh-shaped curve. Certain channels (e.g., red, green, and pink) exhibit more extreme x values, which are squashed by LN.

linear operations. LN normalizes in a per-token manner, only linearly transforming each token's activations. As tokens have different mean and standard deviation values, the linearity does not hold collectively on all activations of the input tensor. Nonetheless, it is still surprising to us that the actual non-linear transformation is highly similar to a scaled tanh function.

For such an S -shaped curve, we note that the central part, represented by points with x values close to zero, is still mainly in a linear shape. Most points ( ∼ 99%) fall in this linear range. However, there are still many points that clearly fall out of this range, which are considered to have 'extreme' values, e.g., those with x larger than 50 or smaller than -50 in the ViT model. Normalization layers' main effect for these values is to squash them into less extreme values, more in line with the majority of points. This is where normalization layers could not approximated by a simple affine transformation layer. We hypothesize this non-linear and disproportional squashing effect on extreme values is what makes normalization layers important and indispensable.

Recent findings by Ni et al. (2024) similarly highlight the strong non-linearities introduced by LN layers, demonstrating how the non-linearity enhances a model's representational capacity. Moreover, this squashing behavior mirrors the saturation properties of biological neurons for large inputs, a phenomenon first observed about a century ago (Adrian, 1926; Adrian and Zotterman, 1926a,b).

Normalization by tokens and channels. How does an LN layer perform a linear transformation for each token but also squash the extreme values in such a non-linear fashion? To understand this, we visualize the points grouped by tokens and channels, respectively. This is plotted in Figure 4 by taking the second and third subplots for ViT from Figure 2, but with a sampled subset of points for more clarity. When we select the channels to plot, we make sure to include the channels with extreme values.

On the left two panels of Figure 4, we visualize each token's activations using the same color. We observe that all points from any single token do form a straight line. However, since each token has a different variance, the slopes are different. Tokens with smaller input x ranges tend to have smaller variance, and the normalization layer will divide their activations using a smaller standard deviation, hence producing a larger slope in the straight line. Collectively, they form an S -shaped curve that resembles a tanh function. In the two panels on the right, we color each channel's activations using the same color. We find that different channels tend to have drastically different input ranges, with only a few channels (e.g., red, green, and pink) exhibiting large extreme values. These are the channels that get squashed the most by the normalization layer.

Normalization in Transformers.

Analysis setup. We first empirically study the behaviors of normalization layers in trained networks. For this analysis, we take a Vision Transformer model (ViT-B) (Dosovitskiy et al., 2020) trained on ImageNet-1K (Deng et al., 2009), a wav2vec 2.0 Large Transformer model (Baevski et al., 2020) trained on LibriSpeech (Panayotov et al., 2015), and a Diffusion Transformer (DiT-XL) (Peebles and Xie, 2023) trained on ImageNet-1K. In all cases, LN is applied in every Transformer block and before the final linear projection.

For all three trained networks, we sample a mini-batch of samples and do a forward pass through the network. We then measure the input and output for the normalization layers, i.e., tensors immediately before and after the normalization operation, before the learnable affine transformation. Since LN preserves the dimensions of the input tensor, we can establish a one-to-one correspondence between the input and output tensor elements, allowing for a direct visualization of their relationship. We plot the resulting mappings in Figure 2.

Tanh-likemappingswithlayernormalization. For all three models, in earlier LN layers (1st column of Figure 2), we find this inputoutput relationship to be mostly linear, resembling a straight line in an x -y plot. However, the deeper LN layers are places where we make more intriguing observations.

A striking observation from these deeper layers is that most of these curves' shapes highly resemble full or partial S -shaped curves represented by a tanh function (see Figure 3). One might expect LN layers to linearly transform the input tensor, as subtracting the mean and dividing by standard deviation are

Figure3 tanh( αx ) with three different α values.

Figure3 tanh( αx ) with three different α values.

Figure 4 Output vs. input of two LN layers, with tensor elements colored to indicate different channel and token dimensions. The input tensor has a shape of (samples, tokens, and channels), with elements visualized by assigning consistent colors to the same tokens (left two panels) and channels (right two panels). Left two panels : points representing the same token (same color) form straight lines across different channels, as LN operates linearly across channels for each token. Interestingly, when plotted collectively, these lines form a non-linear tanh-shaped curve. Right two panels : each channel's input spans different ranges on the x -axis, contributing distinct segments to the overall tanh-shaped curve. Certain channels (e.g., red, green, and pink) exhibit more extreme x values, which are squashed by LN.

Figure 4 Output vs. input of two LN layers, with tensor elements colored to indicate different channel and token dimensions. The input tensor has a shape of (samples, tokens, and channels), with elements visualized by assigning consistent colors to the same tokens (left two panels) and channels (right two panels). Left two panels : points representing the same token (same color) form straight lines across different channels, as LN operates linearly across channels for each token. Interestingly, when plotted collectively, these lines form a non-linear tanh-shaped curve. Right two panels : each channel's input spans different ranges on the x -axis, contributing distinct segments to the overall tanh-shaped curve. Certain channels (e.g., red, green, and pink) exhibit more extreme x values, which are squashed by LN.

linear operations. LN normalizes in a per-token manner, only linearly transforming each token's activations. As tokens have different mean and standard deviation values, the linearity does not hold collectively on all activations of the input tensor. Nonetheless, it is still surprising to us that the actual non-linear transformation is highly similar to a scaled tanh function.

For such an S -shaped curve, we note that the central part, represented by points with x values close to zero, is still mainly in a linear shape. Most points ( ∼ 99%) fall in this linear range. However, there are still many points that clearly fall out of this range, which are considered to have 'extreme' values, e.g., those with x larger than 50 or smaller than -50 in the ViT model. Normalization layers' main effect for these values is to squash them into less extreme values, more in line with the majority of points. This is where normalization layers could not approximated by a simple affine transformation layer. We hypothesize this non-linear and disproportional squashing effect on extreme values is what makes normalization layers important and indispensable.

Recent findings by Ni et al. (2024) similarly highlight the strong non-linearities introduced by LN layers, demonstrating how the non-linearity enhances a model's representational capacity. Moreover, this squashing behavior mirrors the saturation properties of biological neurons for large inputs, a phenomenon first observed about a century ago (Adrian, 1926; Adrian and Zotterman, 1926a,b).

Normalization by tokens and channels. How does an LN layer perform a linear transformation for each token but also squash the extreme values in such a non-linear fashion? To understand this, we visualize the points grouped by tokens and channels, respectively. This is plotted in Figure 4 by taking the second and third subplots for ViT from Figure 2, but with a sampled subset of points for more clarity. When we select the channels to plot, we make sure to include the channels with extreme values.

On the left two panels of Figure 4, we visualize each token's activations using the same color. We observe that all points from any single token do form a straight line. However, since each token has a different variance, the slopes are different. Tokens with smaller input x ranges tend to have smaller variance, and the normalization layer will divide their activations using a smaller standard deviation, hence producing a larger slope in the straight line. Collectively, they form an S -shaped curve that resembles a tanh function. In the two panels on the right, we color each channel's activations using the same color. We find that different channels tend to have drastically different input ranges, with only a few channels (e.g., red, green, and pink) exhibiting large extreme values. These are the channels that get squashed the most by the normalization layer.

Removing normalization.

We begin by reviewing the normalization layers. Most normalization layers share a common formulation. Given an input x with shape ( B,T,C ) , where B is the batch size, T is the number of tokens, and C is the embedding dimension per token, the output is generally computed as:

$$

$$

where ϵ is a small constant, and γ and β are learnable vector parameters of shape ( C, ) . They are 'scaling' and 'shifting' affine parameters that allow the output to be in any range. The terms µ and σ 2 denote the mean and variance of the input. Different methods mainly differ in how these two statistics are computed. This results in µ and σ 2 having different dimensions, each with broadcasting applied during computation.

Batch normalization (BN) (Ioffe and Szegedy, 2015) is the first modern normalization layer, and it has been primarily used in ConvNet models (Szegedy et al., 2016; He et al., 2016; Xie et al., 2017). Its introduction represents a major milestone in deep learning architecture designs. BN computes the mean and variance across both the batch and token dimensions, specifically: µ k = 1 BT ∑ i,j x ijk and σ 2 k = 1 BT ∑ i,j ( x ijk -µ k ) 2 . Other normalization layers popular in ConvNets, such as group normalization (Wu and He, 2018) and instance normalization (Ulyanov et al., 2016), were initially proposed for specialized tasks such as object detection and image stylization. They share the same overall formulation but differ in the axes and ranges over which the statistics are computed.

Layer normalization (LN) (Ba et al., 2016) and root mean square normalization (RMSNorm) (Zhang and Sennrich, 2019) are the major two types of normalization layers used in Transformer architectures. LN computes these statistics independently for each token in each sample, where µ ij = 1 C ∑ k x ijk and σ 2 ij = 1 C ∑ k ( x ijk -µ ij ) 2 . RMSNorm (Zhang and Sennrich, 2019) simplifies LN by removing the mean-centering step and normalizing the input with µ ij = 0 and σ 2 ij = 1 C ∑ k x 2 ijk . Today, most modern neural networks use LN due to its simplicity and universality. Recently, RMSNorm has gained popularity, particularly in language models like T5 (Raffel et al., 2020), LLaMA (Touvron et al., 2023a,b; Dubey et al., 2024), Mistral (Jiang et al., 2023), Qwen (Bai et al., 2023; Yang et al., 2024), InternLM (Zhang et al., 2024; Cai et al., 2024) and DeepSeek (Liu et al., 2024; Guo et al., 2025). The Transformers we examine in this work all use LN, except that LLaMA uses RMSNorm.

Figure 2 Output vs. input of selected layer normalization (LN) layers in Vision Transformer (ViT) (Dosovitskiy et al., 2020), wav2vec 2.0 (a Transformer model for speech) (Baevski et al., 2020), and Diffusion Transformer (DiT) (Peebles and Xie, 2023). We sample a mini-batch of samples and plot the input / output values of four LN layers in each model. The outputs are before the affine transformation in LN. The S -shaped curves highly resemble that of a tanh function (see Figure 3). The more linear shapes in earlier layers can also be captured by the center part of a tanh curve. This motivates us to propose Dynamic Tanh (DyT) as a replacement, with a learnable scaler α to account for different scales on the x axis.

Figure 2 Output vs. input of selected layer normalization (LN) layers in Vision Transformer (ViT) (Dosovitskiy et al., 2020), wav2vec 2.0 (a Transformer model for speech) (Baevski et al., 2020), and Diffusion Transformer (DiT) (Peebles and Xie, 2023). We sample a mini-batch of samples and plot the input / output values of four LN layers in each model. The outputs are before the affine transformation in LN. The S -shaped curves highly resemble that of a tanh function (see Figure 3). The more linear shapes in earlier layers can also be captured by the center part of a tanh curve. This motivates us to propose Dynamic Tanh (DyT) as a replacement, with a learnable scaler α to account for different scales on the x axis.

Limitations

Our experiments focus on networks using LN or RMSNorm because of their popularity in Transformers and other modern architectures. Preliminary experiments (see Appendix D) indicate that DyT struggles to replace BN directly in classic networks like ResNets. It remains to be studied in more depth whether and how DyT can adapt to models with other types of normalization layers.

Furthermore, although DyT is conceptually and computationally simpler, we find that DyT offers no speedup over models with normalization layers when properly compiled/optimized (see Appendix C). Its computational benefits across different hardware platforms or deployment environments remain uncertain.

Conclusion

In this work, we demonstrate modern neural networks, in particular Transformers, can be trained without normalization layers. This is done through Dynamic Tanh (DyT), a simple replacement for traditional normalization layers. It adjusts the input activation range via a learnable scaling factor α and then squashes the extreme values through an S -shaped tanh function. Although a simpler function, it effectively captures the behavior of normalization layers. Under various settings, models with DyT match or exceed the performance of their normalized counterparts. The findings challenge the conventional understanding of the necessity of normalization layers in training modern neural networks. Our study also contributes to understanding the mechanisms of normalization layers, one of the most fundamental building blocks in deep neural networks.

Experiments

Experimental Settings

Supervised image classification. For all supervised classification experiments on ImageNet-1K, we follow the training recipes from ConvNeXt (Meta Research, a). For ConvNeXt-B and ConvNeXt-L, we use the original hyperparameters without modification. ViT-B and ViT-L models use the same hyperparameters as ConvNeXt-B, except that for ViT-L, the beta parameters for AdamW are set to (0.9, 0.95), and the stochastic depth rates are set to 0.1 for ViT-B and 0.4 for ViT-L.

Diffusion models. We use the official implementation (Meta Research, c) for training all DiT models. We find that the default learning rate is suboptimal for the models considered in this paper. To address this, we conduct a simple learning rate search with the LN models and apply the tuned learning rates directly to the DyT models. We also observe that the zero initialization negatively affects the performance of DyT models. Therefore, we retain the zero initialization for LN models but remove the zero initialization for DyT models.

Large Language Models. In our implementation of LLaMA models (Touvron et al., 2023a,b; Dubey et al., 2024) with DyT, we introduce an additional learnable scalar parameter immediately after the embedding layer, before any Transformer blocks. We initialize it to the square root of the model embedding dimension √ d . Without this scaling scalar, we find that the magnitudes of model activations at the beginning of training are too small, and the training struggles to progress. The issue is mitigated by incorporating a learnable scalar, and the model can converge normally. This addition of a scalar is similar to the original Transformer (Vaswani et al., 2017) design, which uses a fixed scalar of the same value at the same position.

We train all our LLaMA models on the Pile dataset (Gao et al., 2020). We use the codebase from FMS-FSDP (Foundation Model Stack), which provides a default training recipe for the 7B model that closely follows the LLaMA 2 paper (Touvron et al., 2023b). We maintain the learning rate at the default 3e-4 for 7B and 13B and 1.5e-4 for 34B and 70B, in line with LLaMA 2. The batch size is set to 4M tokens and each model is trained on a total of 200B tokens.

For evaluation, we test the pretrained models on 15 zero-shot commonsense reasoning tasks from lm-eval (Gao et al.): anli_r1 , anli_r2 , anli_r3 , arc_challenge , arc_easy , boolq , hellaswag , openbookqa , piqa , record , rte , truthfulqa_mc1 , truthfulqa_mc2 , wic , and winogrande . The selection closely follows that of OpenLLaMA (Geng and Liu, 2023). We report the average performance across all tasks.

Self-supervised learning in speech. For both wav2vec 2.0 models, we retain the first group normalization layer from the original architecture, as it functions primarily as data normalization to handle the unnormalized input data. We use the official implementation (Meta Research, e) without modifying hyperparameters for both the Base and Large models. We report the final validation loss.

Other tasks. For all other tasks, MAE (He et al., 2022), DINO (Caron et al., 2021), HyenaDNA (Nguyen et al., 2024) and Caduceus (Schiff et al., 2024), we directly use the publicly released code (Meta Research, d,b; HazyResearch; Kuleshov Group), without hyperparameter tuning, for both models with LN and DyT.

Supervised image classification.
Diffusion models.
Large Language Models.
Self-supervised learning in speech.
Other tasks.

Hyperparameters

We present additional experiments to evaluate the impact of hyperparameter tuning, specifically focusing on the learning rate and initialization of α for all non-LLM models.

Tuning learning rate. Table 12 summarizes performance comparisons between models trained with original versus tuned learning rates. Results indicate that tuning the learning rate provides only modest performance improvements for DyT models. This suggests that the original hyperparameters, initially optimized for LN models, are already well-suited for DyT models. This observation underscores the inherent similarity between the DyT and LN models.

Table 12 Performance comparison between original and tuned learning rates for LN and DyT models. Results show that tuning learning rates provide only modest performance improvements for DyT models, suggesting that the default hyperparameters optimized for LN models are already well-suited for DyT models. Entries marked with '-' indicate no performance gain over the original learning rate. The values in parentheses represent the learning rate used.

Tuning initial value of α . We also investigate the effects of optimizing α 0 for DyT models, as presented in Table 13. Findings show only minor performance enhancements for select models when α 0 is tuned, indicating that the default initial value ( α 0 = 0 . 5 ) generally achieves near-optimal performance.

Table 13 Impact of tuning the α 0 in DyT models. Optimizing α 0 from the default value ( α 0 = 0 . 5 ) yields only minor performance gains for select DyT models, implying the default initialization already achieves near-optimal performance. Entries marked with '-' indicate no improvement over the default α 0 .

Tuning learning rate.
Tuning initial value of $ alpha$.

We find that tuning the initialization of α (denoted α 0 ) rarely leads to significant performance improvements. The only exception is LLM training, where careful tuning of α 0 yields noticeable performance gains. In this section, we detail our findings on the impact of α initialization.

Replacing Batch Normalization with DyT

1]FAIR, Meta 2]New York University 3]MIT 4]Princeton University \contribution[†]Project lead

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT​(𝒙)=tanh⁡(α​𝒙)\mathrm{DyT}({\bm{x}})=\tanh(\alpha{\bm{x}}), as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, SS-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

[Project page and code]jiachenzhu.github.io/DyT \metadata[Correspondence],

Over the past decade, normalization layers have solidified their positions as one of the most fundamental components of modern neural networks. It all traces back to the invention of batch normalization in 2015 (Ioffe and Szegedy, 2015), which enabled drastically faster and better convergence in visual recognition models and quickly gained momentum in the following years. Since then, many variants of normalization layers have been proposed for different network architectures or domains (Ba et al., 2016; Ulyanov et al., 2016; Wu and He, 2018; Zhang and Sennrich, 2019). Today, virtually all modern networks use normalization layers, with layer normalization (Layer Norm, or LN) (Ba et al., 2016) being one of the most popular, particularly in the dominant Transformer architecture (Vaswani et al., 2017; Dosovitskiy et al., 2020).

The widespread adoption of normalization layers is largely driven by their empirical benefits in optimization (Santurkar et al., 2018; Bjorck et al., 2018). In addition to achieving better results, they help accelerate and stabilize convergence. As neural networks become wider and deeper, this necessity becomes ever more critical (Brock et al., 2021a; Huang et al., 2023). Consequently, normalization layers are widely regarded as crucial, if not indispensable, for the effective training of deep networks. This belief is subtly evidenced by the fact that, in recent years, novel architectures often seek to replace attention or convolution layers (Tolstikhin et al., 2021; Gu and Dao, 2023; Sun et al., 2024; Feng et al., 2024), but almost always retain the normalization layers.

This paper challenges this belief by introducing a simple alternative to normalization layers in Transformers. Our exploration starts with the observation that LN layers map their inputs to outputs with tanh-like, SS-shaped curves, scaling the input activations while squashing the extreme values. Inspired by this insight, we propose an element-wise operation termed Dynamic Tanh (DyT), defined as: DyT​(𝒙)=tanh⁡(α​𝒙)\mathrm{DyT}({\bm{x}})=\tanh(\alpha{\bm{x}}), where α\alpha is a learnable parameter. This operation aims to emulate the behavior of LN by learning an appropriate scaling factor through α\alpha and squashing extreme values via the bounded tanh function. Notably, unlike normalization layers, it achieves both effects without the need to compute activation statistics.

Employing DyT is straightforward, as shown in Figure 1: we directly replace existing normalization layers with DyT in architectures such as vision and language Transformers. We empirically demonstrate that models with DyT can train stably and achieve high final performance across a wide range of settings. It often does not require tuning the training hyperparameters on the original architecture. Our work challenges the notion that normalization layers are indispensable for training modern neural networks and provides empirical insights into the properties of normalization layers. Moreover, preliminary measurements suggest that DyT improves training and inference speed, making it a candidate for efficiency-oriented network design.

We begin by reviewing the normalization layers. Most normalization layers share a common formulation. Given an input 𝒙{\bm{x}} with shape (B,T,C)(B,T,C), where BB is the batch size, TT is the number of tokens, and CC is the embedding dimension per token, the output is generally computed as:

where ϵ\epsilon is a small constant, and 𝜸\bm{\gamma} and 𝜷\bm{\beta} are learnable vector parameters of shape (C,)(C,). They are “scaling” and “shifting” affine parameters that allow the output to be in any range. The terms 𝝁\bm{\mu} and 𝝈2\bm{\sigma}^{2} denote the mean and variance of the input. Different methods mainly differ in how these two statistics are computed. This results in 𝝁\bm{\mu} and 𝝈2\bm{\sigma}^{2} having different dimensions, each with broadcasting applied during computation.

Batch normalization (BN) (Ioffe and Szegedy, 2015) is the first modern normalization layer, and it has been primarily used in ConvNet models (Szegedy et al., 2016; He et al., 2016; Xie et al., 2017). Its introduction represents a major milestone in deep learning architecture designs. BN computes the mean and variance across both the batch and token dimensions, specifically: μk=1B​T​∑i,jxi​j​k\mu_{k}=\frac{1}{BT}\sum_{i,j}x_{ijk} and σk2=1B​T​∑i,j(xi​j​k−μk)2\sigma^{2}{k}=\frac{1}{BT}\sum{i,j}\left(x_{ijk}-\mu_{k}\right)^{2}. Other normalization layers popular in ConvNets, such as group normalization (Wu and He, 2018) and instance normalization (Ulyanov et al., 2016), were initially proposed for specialized tasks such as object detection and image stylization. They share the same overall formulation but differ in the axes and ranges over which the statistics are computed.

Layer normalization (LN) (Ba et al., 2016) and root mean square normalization (RMSNorm) (Zhang and Sennrich, 2019) are the major two types of normalization layers used in Transformer architectures. LN computes these statistics independently for each token in each sample, where μi​j=1C​∑kxi​j​k\mu_{ij}=\frac{1}{C}\sum_{k}x_{ijk} and σi​j2=1C​∑k(xi​j​k−μi​j)2\sigma^{2}{ij}=\frac{1}{C}\sum{k}\left(x_{ijk}-\mu_{ij}\right)^{2}. RMSNorm (Zhang and Sennrich, 2019) simplifies LN by removing the mean-centering step and normalizing the input with μi​j=0\mu_{ij}=0 and σi​j2=1C​∑kxi​j​k2\sigma^{2}{ij}=\frac{1}{C}\sum{k}x^{2}_{ijk}. Today, most modern neural networks use LN due to its simplicity and universality. Recently, RMSNorm has gained popularity, particularly in language models like T5 (Raffel et al., 2020), LLaMA (Touvron et al., 2023a, b; Dubey et al., 2024), Mistral (Jiang et al., 2023), Qwen (Bai et al., 2023; Yang et al., 2024), InternLM (Zhang et al., 2024; Cai et al., 2024) and DeepSeek (Liu et al., 2024; Guo et al., 2025). The Transformers we examine in this work all use LN, except that LLaMA uses RMSNorm.

We first empirically study the behaviors of normalization layers in trained networks. For this analysis, we take a Vision Transformer model (ViT-B) (Dosovitskiy et al., 2020) trained on ImageNet-1K (Deng et al., 2009), a wav2vec 2.0 Large Transformer model (Baevski et al., 2020) trained on LibriSpeech (Panayotov et al., 2015), and a Diffusion Transformer (DiT-XL) (Peebles and Xie, 2023) trained on ImageNet-1K. In all cases, LN is applied in every Transformer block and before the final linear projection.

For all three trained networks, we sample a mini-batch of samples and do a forward pass through the network. We then measure the input and output for the normalization layers, i.e., tensors immediately before and after the normalization operation, before the learnable affine transformation. Since LN preserves the dimensions of the input tensor, we can establish a one-to-one correspondence between the input and output tensor elements, allowing for a direct visualization of their relationship. We plot the resulting mappings in Figure 2.

For all three models, in earlier LN layers (1st column of Figure 2), we find this input-output relationship to be mostly linear, resembling a straight line in an xx-yy plot. However, the deeper LN layers are places where we make more intriguing observations.

A striking observation from these deeper layers is that most of these curves’ shapes highly resemble full or partial SS-shaped curves represented by a tanh function (see Figure 3). One might expect LN layers to linearly transform the input tensor, as subtracting the mean and dividing by standard deviation are linear operations. LN normalizes in a per-token manner, only linearly transforming each token’s activations. As tokens have different mean and standard deviation values, the linearity does not hold collectively on all activations of the input tensor. Nonetheless, it is still surprising to us that the actual non-linear transformation is highly similar to a scaled tanh function.

For such an SS-shaped curve, we note that the central part, represented by points with xx values close to zero, is still mainly in a linear shape. Most points (∼\sim99%) fall in this linear range. However, there are still many points that clearly fall out of this range, which are considered to have “extreme” values, e.g., those with xx larger than 50 or smaller than -50 in the ViT model. Normalization layers’ main effect for these values is to squash them into less extreme values, more in line with the majority of points. This is where normalization layers could not approximated by a simple affine transformation layer. We hypothesize this non-linear and disproportional squashing effect on extreme values is what makes normalization layers important and indispensable.

Recent findings by Ni et al. (2024) similarly highlight the strong non-linearities introduced by LN layers, demonstrating how the non-linearity enhances a model’s representational capacity. Moreover, this squashing behavior mirrors the saturation properties of biological neurons for large inputs, a phenomenon first observed about a century ago (Adrian, 1926; Adrian and Zotterman, 1926a, b).

How does an LN layer perform a linear transformation for each token but also squash the extreme values in such a non-linear fashion? To understand this, we visualize the points grouped by tokens and channels, respectively. This is plotted in Figure 4 by taking the second and third subplots for ViT from Figure 2, but with a sampled subset of points for more clarity. When we select the channels to plot, we make sure to include the channels with extreme values.

On the left two panels of Figure 4, we visualize each token’s activations using the same color. We observe that all points from any single token do form a straight line. However, since each token has a different variance, the slopes are different. Tokens with smaller input xx ranges tend to have smaller variance, and the normalization layer will divide their activations using a smaller standard deviation, hence producing a larger slope in the straight line. Collectively, they form an SS-shaped curve that resembles a tanh function. In the two panels on the right, we color each channel’s activations using the same color. We find that different channels tend to have drastically different input ranges, with only a few channels (e.g., red, green, and pink) exhibiting large extreme values. These are the channels that get squashed the most by the normalization layer.

Inspired by the similarity between the shapes of normalization layers and a scaled tanh function, we propose Dynamic Tanh (DyT) as a drop-in replacement for normalization layers. Given an input tensor 𝒙{\bm{x}}, a DyT layer is defined as follows:

where α\alpha is a learnable scalar parameter that allows scaling the input differently based on its range, accounting for varying xx scales (Figure 2). This is also why we name the whole operation “Dynamic” Tanh. 𝜸\bm{\gamma} and 𝜷\bm{\beta} are learnable, per-channel vector parameters, the same as those used in all normalization layers—they allow the output to scale back to any scales. This is sometimes considered a separate affine layer; for our purposes, we consider them to be part of the DyT layer, just like how normalization layers also include them. See Algorithm 1 for implementation of DyT in Pytorch-like pseudocode.

Integrating DyT layers into an existing architecture is straightforward: one DyT layer replaces one normalization layer (see Figure 1). This applies to normalization layers within attention blocks, FFN blocks, and the final normalization layer. Although DyT may look like or be considered an activation function, this study only uses it to replace normalization layers without altering any parts of the activation functions in the original architectures, such as GELU or ReLU. Other parts of the networks also remain intact. We also observe that there is little need to tune the hyperparameters used by the original architectures for DyT to perform well.

We always simply initialize 𝜸\bm{\gamma} to an all-one vector and 𝜷\bm{\beta} to an all-zero vector following normalization layers. For the scaler parameter α\alpha, a default initialization of 0.5 is generally sufficient, except for LLM training. A detailed analysis of α\alpha initialization is provided in Section 7. Unless explicitly stated otherwise, α\alpha is initialized to 0.5 in our subsequent experiments.

DyT is not a new type of normalization layer, as it operates on each input element from a tensor independently during a forward pass without computing statistics or other types of aggregations. It does, however, preserve the effect of normalization layers in squashing the extreme values in a non-linear fashion while almost linearly transforming the very central parts of the input.

To demonstrate the effectiveness of DyT, we experiment with Transformers and a few other modern architectures across a diverse range of tasks and domains. In each experiment, we replace the LN or RMSNorm in the original architectures with DyT layers and follow the official open-source protocols to train and test both versions of the models. Detailed instructions for reproducing our results are provided in Appendix A. Notably, to highlight the simplicity of adapting DyT, we use hyperparameters identical to those utilized by the normalized counterparts. For completeness, additional experimental results regarding tuning of learning rates and initial values of α\alpha are provided in Appendix B.

We train Vision Transformer (ViT) (Dosovitskiy et al., 2020) and ConvNeXt (Liu et al., 2022) of “Base” and “Large” sizes on the ImageNet-1K classification task (Deng et al., 2009). These models are selected due to their popularity and distinct operations: attention in ViT and convolution in ConvNeXt. Table 1 reports the top-1 classification accuracies. DyT performs slightly better than LN across both architectures and model sizes. We further plot the training loss for ViT-B and ConvNeXt-B in Figure 5. The curves show that the convergence behaviors of DyT and LN-based models are highly aligned.

We benchmark with two popular visual self-supervised learning methods: masked autoencoders (MAE) (He et al., 2022) and DINO (Caron et al., 2021). Both by default use Vision Transformers as the backbones, but have different training objectives: MAE is trained with a reconstruction loss, and DINO uses a joint-embedding loss (LeCun, 2022). Following the standard self-supervised learning protocol, we first pretrain models on ImageNet-1K without using any labels and then test the pretrained models by attaching a classification layer and fine-tuning them with labels. The fine-tuning results are presented in Table 2. DyT consistently performs on par with LN in self-supervised learning tasks.

We train three Diffusion Transformer (DiT) models (Peebles and Xie, 2023) of sizes B, L and XL on ImageNet-1K (Deng et al., 2009). The patch size is 4, 4, and 2, respectively. Note that in DiT, the LN layers’ affine parameters are used for class conditioning in DiT, and we keep them that way in our DyT experiments, only replacing the normalizing transformation with the tanh⁡(α​𝒙)\tanh(\alpha{\bm{x}}) function. After training, we evaluate the Fréchet Inception Distance (FID) scores using the standard ImageNet “reference batch”, as presented in Table 3. DyT achieves comparable or improved FID over LN.

We pretrain LLaMA 7B, 13B, 34B, and 70B models (Touvron et al., 2023a, b; Dubey et al., 2024) to assess DyT performance relative to RMSNorm (Zhang and Sennrich, 2019), the default normalization layer used in LLaMA. The models are trained on The Pile dataset (Gao et al., 2020) with 200B tokens, following the original recipe outlined in LLaMA (Touvron et al., 2023b). On LLaMA with DyT, we add a learnable scalar parameter after the initial embedding layer, and adjust the initial value of α\alpha, as detailed in Section 7. We report the loss value after training and also follow OpenLLaMA (Geng and Liu, 2023) to benchmark the models on 15 zero-shot tasks from lm-eval (Gao et al., ). As shown in Table 4, DyT performs on par with RMSNorm across all four model sizes. Figure 6 illustrates the loss curves, demonstrating similar trends across all model sizes, with training losses closely aligned throughout training.

We pretrain two wav2vec 2.0 Transformer models (Baevski et al., 2020) on the LibriSpeech dataset (Panayotov et al., 2015). We report the final validation loss in Table 5. We observe that DyT performs comparably to LN in both model sizes.

On the long-range DNA sequence modeling task, we pretrain the HyenaDNA model (Nguyen et al., 2024) and the Caduceus model (Schiff et al., 2024). The pretraining uses the human reference genome data from (GRCh38, 2013), and the evaluation is on GenomicBenchmarks (Grešová et al., 2023). The results are presented in Table 6. DyT maintains performance comparable to LN for this task.

We conduct several analyses on important properties of DyT. We begin by evaluating their computational efficiency, followed by two studies examining the roles of the tanh function and the learnable scale α\alpha. Finally, we present comparisons with previous methods that aim to remove normalization layers.

We benchmark the LLaMA 7B model with RMSNorm or DyT by measuring the total time taken for 100 forward passes (inference) and 100 forward-backward passes (training) using a single sequence of 4096 tokens. Table 7 reports the time required for all RMSNorm or DyT layers and the entire model when running on an Nvidia H100 GPU with BF16 precision. DyT layers significantly reduce computation time compared to RMSNorm layers, with a similar trend observed under FP32 precision. DyT may be a promising choice for efficiency-oriented network design.

To further investigate the role of tanh and α\alpha in DyT, we conduct experiments to evaluate the model’s performance when these components are altered or removed.

We replace tanh in DyT layers with alternative squashing functions, specifically hardtanh and sigmoid (Figure 8), while keeping the learnable scaler α\alpha intact. Furthermore, we assess the impact of completely removing tanh by replacing it with the identity function while still retaining α\alpha. As shown in Table 8, the squashing function is essential for stable training. Using the identity function leads to unstable training and divergence, whereas squashing functions enable stable training. Among the squashing functions, tanh performs the best. This is possibly due to its smoothness and zero-centered properties.

Next, we evaluate the impact of removing the learnable α\alpha while retaining the squashing functions (tanh, hardtanh, and sigmoid). As shown in Table 9, removing α\alpha results in performance degradation across all squashing functions, highlighting the critical role of α\alpha in overall model performance.

Our analysis reveals that the α\alpha closely tracks the 1/std1/\mathrm{std} of activations throughout training. As illustrated in the left panel of Figure 8, α\alpha first decrease and then increase during training, but always fluctuate consistently with the standard deviation of input activations. This supports the important role of α\alpha in maintaining activations within a suitable range, which leads to stable and effective training.

Our further analysis of the final values of α\alpha in trained networks reveals a strong correlation with the 1/std1/\mathrm{std} of the input activations. As shown on the right panel of Figure 8, higher 1/std1/\mathrm{std} values generally correspond to larger α\alpha values, and vice versa. Additionally, we observe that deeper layers tend to have activations with larger standard deviations. This trend aligns with characteristics of deep residual networks, as shown in Brock et al. (2021a) for ConvNets, and Sun et al. (2025) for Transformers.

Both analyses suggest that α\alpha functions partially as a normalization mechanism by learning values approximating 1/std1/\mathrm{std} of the input activations. Unlike LN, which normalizes the activations per token, α\alpha normalizes the entire input activations collectively. Consequently, α\alpha alone cannot suppress extreme values in a non-linear fashion.

To further assess DyT’s effectiveness, we compare it with other methods that also enable training Transformers without normalization layers. These methods can be broadly categorized into initialization-based and weight-normalization-based methods. We consider two popular initialization-based methods, Fixup (Zhang et al., 2019; Huang et al., 2020) and SkipInit (De and Smith, 2020; Bachlechner et al., 2021). Both methods aim to mitigate training instabilities by adjusting the initial parameter values to prevent large gradients and activations at the start of training, thereby enabling stable learning without normalization layers. In contrast, weight-normalization-based methods impose constraints on network weights throughout training to maintain stable learning dynamics in the absence of normalization layers. We include one such method, σ\sigmaReparam (Zhai et al., 2023), which controls the spectral norm of the weights to promote stable learning.

Table 10 summarizes the results of two ViT-based tasks. We closely follow the original protocols outlined in their respective papers. However, we find that both initialization-based methods, Fixup and SkipInit, require significantly lower learning rates to prevent training divergence. To ensure a fair comparison, we conduct a simple learning rate search for all methods, including DyT. This produces results that differ from those reported in Section 5, where no hyperparameter is tuned. Overall, the results show that DyT consistently outperforms all other tested methods across different configurations.

We find that tuning the initialization of α\alpha (denoted α0\alpha_{0}) rarely leads to significant performance improvements. The only exception is LLM training, where careful tuning of α0\alpha_{0} yields noticeable performance gains. In this section, we detail our findings on the impact of α\alpha initialization.

Figure 9 shows the effect of varying α0\alpha_{0} on validation performance across different tasks. All experiments follow the original setup and hyperparameters of their respective recipe. We observe that performance remains stable across a wide range of α0\alpha_{0} values, with values between 0.5 and 1.2 generally yielding good results. We observe that adjusting α0\alpha_{0} typically affects only the early stages of the training curves. The main exception is supervised ViT-L experiments, where training becomes unstable and diverges when α0\alpha_{0} exceeds 0.6. In such cases, reducing the learning rate restores stability, as detailed below.

Building on previous observations, we further analyze the factors contributing to training instability. Our findings suggest that increasing either the model size or the learning rate requires lowering α0\alpha_{0} to ensure stable training. Conversely, a higher α0\alpha_{0} requires a lower learning rate to mitigate training instability. Figure 10 shows the ablation of the training stability of supervised ViT with ImageNet-1K dataset. We vary learning rates, model sizes, and α0\alpha_{0} values. Training a larger model is more prone to failure, requiring smaller α0\alpha_{0} values or learning rates for stable training. A similar instability pattern is also observed in LN-based models under comparable conditions, and setting α0=0.5\alpha_{0}=0.5 results in a stability pattern similar to that of LN.

Based on our findings, we set α0=0.5\alpha_{0}=0.5 as the default value for all non-LLM models. This setting provides training stability comparable to LN while maintaining strong performance.

As discussed earlier, the default setting of α0=0.5\alpha_{0}=0.5 generally performs well across most tasks. However, we find tuning α0\alpha_{0} can substantially improve LLM performance. We tune α0\alpha_{0} across LLaMA models by pretraining each on 30B tokens and comparing their training losses. Table 11 summarizes the tuned α0\alpha_{0} values for each model. Two key findings emerge:

Larger models require smaller α0\alpha_{0} values. Once the optimal α0\alpha_{0} is determined for smaller models, the search space for larger models can be reduced accordingly.

Higher α0\alpha_{0} values for attention blocks improve performance. We find that initializing α\alpha with higher values for DyT layers in attention blocks and lower values for DyT layers in other locations (i.e., within FFN blocks or before the final linear projection) improves performance.

To further illustrate the impact of α0\alpha_{0} tuning, Figure 11 presents heatmaps of loss values of two LLaMA models. Both models benefit from higher α0\alpha_{0} in attention blocks, leading to reduced training loss.

We also investigate the influence of model width and depth on the optimal α0\alpha_{0}. We find that the model width is critical in determining the optimal α0\alpha_{0}, while model depth has minimal influence. Table 12 shows the optimal α0\alpha_{0} values across different widths and depths, showing that wider networks benefit from smaller α0\alpha_{0} values for optimal performance. On the other hand, model depth has negligible impact on the choice of α0\alpha_{0}.

As can be seen in Table 12, the wider the network, the more uneven initialization for “attention” and “other” is needed. We hypothesize that the sensitivity of LLM’s α\alpha initialization is related to their excessively large widths compared to other models.

There has been a rich line of work investigating normalization layers’ role in enhancing model performance through various mechanisms. These include stabilizing gradient flow during training (Balduzzi et al., 2017; Daneshmand et al., 2020; Lubana et al., 2021), reducing sensitivity to weight initialization (Zhang et al., 2019; De and Smith, 2020; Shao et al., 2020), moderating outlier eigenvalues (Bjorck et al., 2018; Karakida et al., 2019), auto-tuning learning rates (Arora et al., 2018; Tanaka and Kunin, 2021), and smoothing the loss landscape for more stable optimization (Santurkar et al., 2018). These earlier works focused on studying batch normalization. Recent studies (Lyu et al., 2022; Dai et al., 2024; Mueller et al., 2024) further highlight the connection between normalization layers and sharpness reduction, which contributes to better generalization.

With the rise of Transformer (Vaswani et al., 2017), research has increasingly focused on layer normalization (Ba et al., 2016), which has proven particularly effective for sequential data in natural language tasks (Nguyen and Salazar, 2019; Xu et al., 2019; Xiong et al., 2020). Recent work (Ni et al., 2024) reveals that layer normalization introduces strong non-linearity, enhancing the model’s representational capacity. Additionally, studies (Loshchilov et al., 2024; Li et al., 2024) demonstrate that modifying the location of normalization layers within Transformers can improve convergence properties.

Many studies have explored how to train deep models without normalization layers. Several works (Zhang et al., 2019; De and Smith, 2020; Bachlechner et al., 2021) explore alternative weight initialization schemes to stabilize training. The pioneering work by Brock et al. (2021a, b) show that high-performing ResNets can be trained without normalization (Smith et al., 2023) through combination of initialization techniques (De and Smith, 2020), weight normalization (Salimans and Kingma, 2016; Huang et al., 2017; Qiao et al., 2019), and adaptive gradient clipping (Brock et al., 2021b). Additionally, their training strategy incorporates extensive data augmentation (Cubuk et al., 2020) and regularization (Srivastava et al., 2014; Huang et al., 2016). The studies above are based on various ConvNet models.

In Transformer architectures, He and Hofmann (2023) explore modifications to Transformer blocks that reduce reliance on normalization layers and skip connections. Alternatively, Heimersheim (2024) propose a method to gradually remove LN from pretrained networks by fine-tuning the model after removing each normalization layer. Unlike previous approaches, DyT requires minimal modifications to both the architecture and the training recipe. Despite its simplicity, DyT achieves stable training and comparable performance.

We conduct experiments on networks using either LN or RMSNorm because of their popularity in Transformers and other modern architectures. Preliminary experiments (see Appendix C) indicate that DyT struggles to replace BN directly in classic networks like ResNets. It remains to be studied in more depth whether and how DyT can adapt to models with other types of normalization layers.

In this work, we demonstrate modern neural networks, in particular Transformers, can be trained without normalization layers. This is done through Dynamic Tanh (DyT), a simple replacement for traditional normalization layers. It adjusts the input activation range via a learnable scaling factor α\alpha and then squashes the extreme values through an SS-shaped tanh function. Although a simpler function, it effectively captures the behavior of normalization layers. Under various settings, models with DyT match or exceed the performance of their normalized counterparts. The findings challenge the conventional understanding of the necessity of normalization layers in training modern neural networks. Our study also contributes to understanding the mechanisms of normalization layers, one of the most fundamental building blocks in deep neural networks.

For all supervised classification experiments on ImageNet-1K, we follow the training recipes from ConvNeXt (Meta Research, a). For ConvNeXt-B and ConvNeXt-L, we use the original hyperparameters without modification. ViT-B and ViT-L models use the same hyperparameters as ConvNeXt-B, except that for ViT-L, the beta parameters for AdamW are set to (0.9, 0.95), and the stochastic depth rates are set to 0.1 for ViT-B and 0.4 for ViT-L.

We use the official implementation (Meta Research, c) for training all DiT models. We find that the default learning rate is suboptimal for the models considered in this paper. To address this, we conduct a simple learning rate search with the LN models and apply the tuned learning rates directly to the DyT models. We also observe that the zero initialization negatively affects the performance of DyT models. Therefore, we retain the zero initialization for LN models but remove the zero initialization for DyT models.

In our implementation of LLaMA models (Touvron et al., 2023a, b; Dubey et al., 2024) with DyT, we introduce an additional learnable scalar parameter immediately after the embedding layer, before any Transformer blocks. We initialize it to the square root of the model embedding dimension d\sqrt{d}. Without this scaling scalar, we find that the magnitudes of model activations at the beginning of training are too small, and the training struggles to progress. The issue is mitigated by incorporating a learnable scalar, and the model can converge normally. This addition of a scalar is similar to the original Transformer (Vaswani et al., 2017) design, which uses a fixed scalar of the same value at the same position.

We train all our LLaMA models on the Pile dataset (Gao et al., 2020). We use the codebase from FMS-FSDP (Foundation Model Stack, ), which provides a default training recipe for the 7B model that closely follows the LLaMA 2 paper (Touvron et al., 2023b). We maintain the learning rate at the default 3e-4 for 7B and 13B and 1.5e-4 for 34B and 70B, in line with LLaMA 2. The batch size is set to 4M tokens and each model is trained on a total of 200B tokens.

For evaluation, we test the pretrained models on 15 zero-shot commonsense reasoning tasks from lm-eval (Gao et al., ): anli_r1, anli_r2, anli_r3, arc_challenge, arc_easy, boolq, hellaswag, openbookqa, piqa, record, rte, truthfulqa_mc1, truthfulqa_mc2, wic, and winogrande. The selection closely follows that of OpenLLaMA (Geng and Liu, 2023). We report the average performance across all tasks.

For both wav2vec 2.0 models, we retain the first group normalization layer from the original architecture, as it functions primarily as data normalization to handle the unnormalized input data. We use the official implementation (Meta Research, e) without modifying hyperparameters for both the Base and Large models. We report the final validation loss.

For all other tasks, MAE (He et al., 2022), DINO (Caron et al., 2021), HyenaDNA (Nguyen et al., 2024) and Caduceus (Schiff et al., 2024), we directly use the publicly released code (Meta Research, d, b; HazyResearch, ; Kuleshov Group, ), without hyperparameter tuning, for both models with LN and DyT.

We present additional experiments to evaluate the impact of hyperparameter tuning, specifically focusing on the learning rate and initialization of α\alpha for all non-LLM models.

Table 13 summarizes performance comparisons between models trained with original versus tuned learning rates. Results indicate that tuning the learning rate provides only modest performance improvements for DyT models. This suggests that the original hyperparameters, initially optimized for LN models, are already well-suited for DyT models. This observation underscores the inherent similarity between the DyT and LN models.

We also investigate the effects of optimizing α0\alpha_{0} for DyT models, as presented in Table 14. Findings show only minor performance enhancements for select models when α0\alpha_{0} is tuned, indicating that the default initial value (α0=0.5\alpha_{0}=0.5) generally achieves near-optimal performance.

We investigate the potential of replacing BN with DyT in classic ConvNets such as ResNet-50 (He et al., 2016) and VGG19 (Simonyan and Zisserman, 2014). Both models are trained on the ImageNet-1K dataset (Deng et al., 2009) using the training recipes provided by torchvision . The DyT models are trained using the same hyperparameters as their BN counterparts.

The results are summarized in Table 16. Replacing BN with DyT led to a noticeable drop in classification accuracy for both models. These findings indicate that DyT is struggling to fully replace BN in these classic ConvNets. We hypothesize this could be related to BN layers being more frequent in these ConvNets, where they appear once with every weight layer, but LN only appears once per several weight layers in Transformers.

Table: S5.T1: Supervised classification accuracy on ImageNet-1K. DyT achieves better or similar performance than LN across both architectures and model sizes.

modelLNDyTchange
ViT-B82.3%82.5%↑\uparrow,0.2%
ViT-L83.1%83.6%↑\uparrow,0.5%
ConvNeXt-B83.7%83.7%-
ConvNeXt-L84.3%84.4%↑\uparrow,0.1%

Table: S5.T4: Language models’ training loss and average performance with 15 zero-shot lm-eval tasks. DyT achieves a comparable zero-shot performance and training loss to RMSNorm.

score / lossRMSNormDyTchange
LLaMA 7B0.513 / 1.590.513 / 1.60- / ↑\uparrow,0.01
LLaMA 13B0.529 / 1.530.529 / 1.54- / ↑\uparrow,0.01
LLaMA 34B0.536 / 1.500.536 / 1.50- / -
LLaMA 70B0.549 / 1.450.549 / 1.45- / -

Table: S6.T7: Inference and training latency (BF16 precision) for LLaMA 7B with RMSNorm or DyT. DyT achieves a substantial reduction in both inference and training time.

inferencetraining
LLaMA 7Blayermodellayermodel
RMSNorm2.1s14.1s8.3s42.6s
DyT1.0s13.0s4.8s39.1s
reduction↓\downarrow,52.4%↓\downarrow,7.8%↓\downarrow,42.2%↓\downarrow,8.2%

Table: S6.T8: ImageNet-1K classification accuracy with different squashing functions. All experiments follow the same training recipe as the original LN-based models. Squashing functions play a crucial role in preventing divergence, with tanh achieving the highest performance among the three functions. “→\rightarrow failed” indicates that training diverged after some progress, with the preceding number representing the highest accuracy reached before divergence.

modelidentitytanhhardtanhsigmoid
ViT-S58.5% →\rightarrow failed80.3%79.9%79.6%
ViT-B61.0% →\rightarrow failed82.5%82.2%81.6%

Table: S7.T11: Optimal α0\alpha_{0} for different LLaMA models. Larger models require smaller α0\alpha_{0} values. We find it is important to initialize α\alpha differently in (1) attention blocks (“attention”), versus (2) the FFN blocks, and the final DyT layer before outputs (“other”). α0\alpha_{0} in attention blocks require larger values.

(attention/other)
LLaMA 7B4096320.8/0.2
LLaMA 13B5120400.6/0.15
LLaMA 34B8196480.2/0.05
LLaMA 70B8196800.2/0.05

Table: S7.T12: Optimal α0\alpha_{0} (attention / other) across model widths and depths in LLaMA training. Model width significantly impacts the choice of α0\alpha_{0}, with wider networks requiring smaller values. In contrast, model depth has negligible influence.

width / depth8163264
10241.0/1.01.0/1.01.0/1.01.0/1.0
20481.0/0.51.0/0.51.0/0.51.0/0.5
40960.8/0.20.8/0.20.8/0.20.8/0.2
81920.2/0.050.2/0.050.2/0.050.2/0.05

Table: A2.T13: Performance comparison between original and tuned learning rates for LN and DyT models. Results show that tuning learning rates provide only modest performance improvements for DyT models, suggesting that the default hyperparameters optimized for LN models are already well-suited for DyT models. Entries marked with “-” indicate no performance gain over the original learning rate. The values in parentheses represent the learning rate used.

modelLN (original)DyT (original)LN (tuned)DyT (tuned)
ViT-B82.3% (4e-3)82.5% (4e-3)-82.8% (6e-3)
ViT-L83.1% (4e-3)83.6% (4e-3)--
ConvNeXt-B83.7% (4e-3)83.7% (4e-3)--
ConvNeXt-L84.3% (4e-3)84.4% (4e-3)--
MAE ViT-B83.2% (2.4e-3)83.2% (2.4e-3)-83.7% (3.2e-3)
MAE ViT-L85.5% (2.4e-3)85.4% (2.4e-3)-85.8% (3.2e-3)
DINO ViT-B (patch size 16)83.2% (7.5e-4)83.4% (7.5e-4)83.3% (1e-3)-
DINO ViT-B (patch size 8)84.1% (5e-4)84.5% (5e-4)--
DiT-B64.9 (4e-4)63.9 (4e-4)--
DiT-L45.9 (4e-4)45.7 (4e-4)--
DiT-XL19.9 (4e-4)20.8 (4e-4)--
wav2vec 2.0 Base1.95 (5e-4)1.95 (5e-4)-1.94 (6e-4)
wav2vec 2.0 Large1.92 (3e-4)1.91 (3e-4)--
HyenaDNA85.2% (6e-4)85.2% (6e-4)--
Caduceus86.9% (8e-3)86.9% (8e-3)--

Table: A2.T14: Impact of tuning the α0\alpha_{0} in DyT models. Optimizing α0\alpha_{0} from the default value (α0=0.5\alpha_{0}=0.5) yields only minor performance gains for select DyT models, implying the default initialization already achieves near-optimal performance. Entries marked with “-” indicate no improvement over the default α0\alpha_{0}.

ModelLNDyT (α0=0.5\alpha_{0}=0.5)DyT (tuned)
ViT-B82.3%82.5%82.6% (α0=1.0\alpha_{0}=1.0)
ViT-L83.1%83.6%-
ConvNeXt-B83.7%83.7%-
ConvNeXt-L84.3%84.4%-
MAE ViT-B83.2%83.2%83.4% (α0=1.0\alpha_{0}=1.0)
MAE ViT-L85.5%85.4%-
DINO ViT-B (patch 16)83.2%83.4%-
DINO ViT-B (patch 8)84.1%84.5%-
DiT-B64.963.9-
DiT-L45.945.7-
DiT-XL19.920.8-
wav2vec 2.0 Base1.951.95-
wav2vec 2.0 Large1.921.911.90 (α0=1.0\alpha_{0}=1.0)
HyenaDNA85.2%85.2%-
Caduceus86.9%86.9%-

Refer to caption Left: original Transformer block. Right: block with our proposed Dynamic Tanh (DyT) layer. DyT is a straightforward replacement for commonly used Layer Norm (Ba et al., 2016) (in some cases RMSNorm (Zhang and Sennrich, 2019)) layers. Transformers with DyT match or exceed the performance of their normalized counterparts.

Refer to caption Output vs. input of selected layer normalization (LN) layers in Vision Transformer (ViT) (Dosovitskiy et al., 2020), wav2vec 2.0 (a Transformer model for speech) (Baevski et al., 2020), and Diffusion Transformer (DiT) (Peebles and Xie, 2023). We sample a mini-batch of samples and plot the input / output values of four LN layers in each model. The outputs are before the affine transformation in LN. The SS-shaped curves highly resemble that of a tanh function (see Figure 3). The more linear shapes in earlier layers can also be captured by the center part of a tanh curve. This motivates us to propose Dynamic Tanh (DyT) as a replacement, with a learnable scaler α\alpha to account for different scales on the xx axis.

Refer to caption tanh⁡(α​x)\tanh(\alpha x) with three different α\alpha values.

Refer to caption Output vs. input of two LN layers, with tensor elements colored to indicate different channel and token dimensions. The input tensor has a shape of (samples, tokens, and channels), with elements visualized by assigning consistent colors to the same tokens (left two panels) and channels (right two panels). Left two panels: points representing the same token (same color) form straight lines across different channels, as LN operates linearly across channels for each token. Interestingly, when plotted collectively, these lines form a non-linear tanh-shaped curve. Right two panels: each channel’s input spans different ranges on the xx-axis, contributing distinct segments to the overall tanh-shaped curve. Certain channels (e.g., red, green, and pink) exhibit more extreme xx values, which are squashed by LN.

Refer to caption LLaMA pretraining loss. The loss curves of DyT and RMSNorm models are closely aligned across model sizes.

Refer to caption Curves of three squashing functions: tanh, hardtanh, and sigmoid. All three functions squash inputs into a bounded range, but tanh⁡(x)\tanh(x) achieves the best performance when used in DyT layers. We suspect it is due to its smoothness and zero-centered properties.

Refer to caption Left: For two selected DyT layers from the ViT-B model, we track α\alpha and the inverse of the standard deviation (1/std1/\mathrm{std}) of activations at the end of each epoch, observing that they evolve together during training. Right: We plot the final α\alpha values of two trained models, ViT-B and ConvNeXt-B, against the 1/std1/\mathrm{std} of the input activations, demonstrating a strong correlation between the two values.

Refer to caption Performance of different tasks across different α0\alpha_{0} values. We benchmark the performance of all non-LLM tasks used in Section 5 with different initial values of α\alpha. Performance remains stable across a wide range of α0\alpha_{0} values. The only exception is that supervised ViT-L models (top right panel) will diverge for α0\alpha_{0} values larger than 0.6.

Refer to caption Stability across varying α0\alpha_{0} values, learning rates, and model sizes. We train supervised ViT models on the ImageNet-1K dataset and observe that larger models are more prone to instability for both LN and DyT models. Lowering the learning rate or reducing α0\alpha_{0} enhances stability. LN shows similar stability to DyT with α0=0.5\alpha_{0}=0.5.

Refer to caption Heatmaps of loss values at 30B tokens for different α0\alpha_{0} settings. Both LLaMA models benefit from increased α0\alpha_{0} in attention blocks.

$$ \mathrm{normalization}(\vx) = \bm \gamma * \left ( \frac{\vx - \bm \mu}{\sqrt{\bm \sigma^2 + \epsilon }} \right ) + \bm \beta $$

Table 16 ImageNet-1K classification accuracy with BN and DyT. Replacing BN with DyT in ResNet-50 and VGG19 results in a performance drop, indicating that DyT cannot fully substitute BN in these architectures.

modelLNDyTchange
ViT-B82.3%82.5%↑ 0.2%
ViT-L83.1%83.6%↑ 0.5%
ConvNeXt-B83.7%83.7%-
ConvNeXt-L84.3%84.4%↑ 0.1%
modelLNDyTchange
MAE ViT-B83.2%83.2%-
MAE ViT-L85.5%85.4%↓ 0.1%
DINO ViT-B (patch size 16)83.2%83.4%↑ 0.2%
DINO ViT-B (patch size 8)84.1%84.5%↑ 0.4%
modelLNDyTchange
DiT-B64.963.9↓ 1.0
DiT-L45.945.7↓ 0.2
DiT-XL19.920.8↑ 0.9
score / lossRMSNormDyTchange
LLaMA 7B0.513 / 1.590.513 / 1.60- / ↑ 0.01
LLaMA 13B0.529 / 1.530.529 / 1.54- / ↑ 0.01
LLaMA 34B0.536 / 1.500.536 / 1.50- / -
LLaMA 70B0.549 / 1.450.549 / 1.45- / -
modelLNDyTchange
wav2vec 2.0 Base1.951.95-
wav2vec 2.0 Large1.921.91↓ 0.01
modelLNDyTchange
HyenaDNA (Nguyen et al., 2024)85.2%85.2%-
Caduceus (Schiff et al., 2024)86.9%86.9%-
modelidentitytanhhardtanhsigmoid
ViT-S58.5% → failed80.3%79.9%79.6%
ViT-B61.0% → failed82.5%82.2%81.6%
modeltanhhardtanhsigmoid
without α81.1%80.7%80.7%
with α82.5%82.2%81.6%
modelLNFixupSkipInitσ ReparamDyT
ViT-B82.3%77.2%74.1%82.5%82.8%
ViT-L83.1%78.1%75.6%83.0%83.6%
MAE ViT-B83.2%73.7%73.1%83.2%83.7%
MAE ViT-L85.5%74.1%74.0%85.4%85.8%
modelwidthdepthoptimal α 0 (attention/other)
LLaMA 7B4096320.8/0.2
LLaMA 13B5120400.6/0.15
LLaMA 34B8196480.2/0.05
LLaMA 70B8196800.2/0.05
width / depth8163264
10241.0/1.01.0/1.01.0/1.01.0/1.0
20481.0/0.51.0/0.51.0/0.51.0/0.5
40960.8/0.20.8/0.20.8/0.20.8/0.2
81920.2/0.050.2/0.050.2/0.050.2/0.05
modelLN (original)DyT (original)LN (tuned)DyT (tuned)
ViT-B82.3% (4e-3)82.5% (4e-3)-82.8% (6e-3)
ViT-L83.1% (4e-3)83.6% (4e-3)--
ConvNeXt-B83.7% (4e-3)83.7% (4e-3)--
ConvNeXt-L84.3% (4e-3)84.4% (4e-3)--
MAE ViT-B83.2% (2.4e-3)83.2% (2.4e-3)-83.7% (3.2e-3)
MAE ViT-L85.5% (2.4e-3)85.4% (2.4e-3)-85.8% (3.2e-3)
DINO ViT-B (patch size 16)83.2% (7.5e-4)83.4% (7.5e-4)83.3% (1e-3)-
DINO ViT-B (patch size 8)84.1% (5e-4)84.5% (5e-4)--
DiT-B64.9 (4e-4)63.9 (4e-4)--
DiT-L45.9 (4e-4)45.7 (4e-4)--
DiT-XL19.9 (4e-4)20.8 (4e-4)--
wav2vec 2.0 Base1.95 (5e-4)1.95 (5e-4)-1.94 (6e-4)
wav2vec 2.0 Large1.92 (3e-4)1.91 (3e-4)--
HyenaDNA85.2% (6e-4)85.2% (6e-4)--
Caduceus86.9% (8e-3)86.9% (8e-3)--
ModelLNDyT ( α 0 = 0 . 5 )DyT (tuned)
ViT-B82.3%82.5%82.6% ( α 0 = 1 . 0 )
ViT-L83.1%83.6%-
ConvNeXt-B83.7%83.7%-
ConvNeXt-L84.3%84.4%-
MAE ViT-B83.2%83.2%83.4% ( α 0 = 1 . 0 )
MAE ViT-L85.5%85.4%-
DINO ViT-B (patch 16)83.2%83.4%-
DINO ViT-B (patch 8)84.1%84.5%-
DiT-B64.963.9-
DiT-L45.945.7-
DiT-XL19.920.8-
wav2vec 2.0 Base1.951.95-
wav2vec 2.0 Large1.921.911.90 ( α 0 = 1 . 0 )
HyenaDNA85.2%85.2%-
Caduceus86.9%86.9%-
inferenceinferencetrainingtraining
LLaMA 7Blayermodellayermodel
RMSNorm2.1s14.1s8.3s42.6s
DyT1.0s13.0s4.8s39.1s
reduction↓ 52.4%↓ 7.8%↓ 42.2%↓ 8.2%
inferenceinferencetrainingtraining
LLaMA 7Blayermodellayermodel
RMSNorm0.3s12.3s3.9s38.9s
DyT0.3s12.3s3.9s38.9s
modelBNDyT
ResNet-5076.2%68.9%
VGG1972.7%71.0%

$$ \mathrm{DyT}(\vx) = \bm{\gamma} * \tanh(\alpha \vx) + \bm{\beta} $$

Algorithm: algorithm
[H]
\begin{lstlisting}[style=Pytorch,escapeinside={(@}{@)}]
# input x has the shape of [B, T, C]
# B: batch size, T: tokens, C: dimension

class DyT(Module):
def __init__(self, C, init_(@$\bm \alpha$@)):
super().__init__()
self.(@$\bm \alpha$@) = Parameter(ones(1) * init_(@$\bm \alpha$@))
self.(@$\bm \gamma$@) = Parameter(ones(C))
self.(@$\bm \beta$@) = Parameter(zeros(C))

def forward(self, x):
x = tanh(self.alpha * x)
return self.(@$\bm \gamma$@) * x + self.(@$\bm \beta$@)

\end{lstlisting}
\caption{Pseudocode of DyT layer.}
\label{algorithm:dyt-pytorch}
modelLNDyTchange
ViT-B82.3%82.5%↑ 0.2%
ViT-L83.1%83.6%↑ 0.5%
ConvNeXt-B83.7%83.7%-
ConvNeXt-L84.3%84.4%↑ 0.1%
modelLNDyTchange
MAE ViT-B83.2%83.2%-
MAE ViT-L85.5%85.4%↓ 0.1%
DINO ViT-B (patch size 16)83.2%83.4%↑ 0.2%
DINO ViT-B (patch size 8)84.1%84.5%↑ 0.4%
modelLNDyTchange
DiT-B64.963.9↓ 1.0
DiT-L45.945.7↓ 0.2
DiT-XL19.920.8↑ 0.9
score / lossRMSNormDyTchange
LLaMA 7B0.513 / 1.590.513 / 1.60- / ↑ 0.01
LLaMA 13B0.529 / 1.530.529 / 1.54- / ↑ 0.01
LLaMA 34B0.536 / 1.500.536 / 1.50- / -
LLaMA 70B0.549 / 1.450.549 / 1.45- / -
modelLNDyTchange
wav2vec 2.0 Base1.951.95-
wav2vec 2.0 Large1.921.91↓ 0.01
modelLNDyTchange
HyenaDNA (Nguyen et al., 2024)85.2%85.2%-
Caduceus (Schiff et al., 2024)86.9%86.9%-
modelidentitytanhhardtanhsigmoid
ViT-S58.5% → failed80.3%79.9%79.6%
ViT-B61.0% → failed82.5%82.2%81.6%
modeltanhhardtanhsigmoid
without α81.1%80.7%80.7%
with α82.5%82.2%81.6%
modelLNFixupSkipInitσ ReparamDyT
ViT-B82.3%77.2%74.1%82.5%82.8%
ViT-L83.1%78.1%75.6%83.0%83.6%
MAE ViT-B83.2%73.7%73.1%83.2%83.7%
MAE ViT-L85.5%74.1%74.0%85.4%85.8%
modelwidthdepthoptimal α 0 (attention/other)
LLaMA 7B4096320.8/0.2
LLaMA 13B5120400.6/0.15
LLaMA 34B8196480.2/0.05
LLaMA 70B8196800.2/0.05
width / depth8163264
10241.0/1.01.0/1.01.0/1.01.0/1.0
20481.0/0.51.0/0.51.0/0.51.0/0.5
40960.8/0.20.8/0.20.8/0.20.8/0.2
81920.2/0.050.2/0.050.2/0.050.2/0.05
modelLN (original)DyT (original)LN (tuned)DyT (tuned)
ViT-B82.3% (4e-3)82.5% (4e-3)-82.8% (6e-3)
ViT-L83.1% (4e-3)83.6% (4e-3)--
ConvNeXt-B83.7% (4e-3)83.7% (4e-3)--
ConvNeXt-L84.3% (4e-3)84.4% (4e-3)--
MAE ViT-B83.2% (2.4e-3)83.2% (2.4e-3)-83.7% (3.2e-3)
MAE ViT-L85.5% (2.4e-3)85.4% (2.4e-3)-85.8% (3.2e-3)
DINO ViT-B (patch size 16)83.2% (7.5e-4)83.4% (7.5e-4)83.3% (1e-3)-
DINO ViT-B (patch size 8)84.1% (5e-4)84.5% (5e-4)--
DiT-B64.9 (4e-4)63.9 (4e-4)--
DiT-L45.9 (4e-4)45.7 (4e-4)--
DiT-XL19.9 (4e-4)20.8 (4e-4)--
wav2vec 2.0 Base1.95 (5e-4)1.95 (5e-4)-1.94 (6e-4)
wav2vec 2.0 Large1.92 (3e-4)1.91 (3e-4)--
HyenaDNA85.2% (6e-4)85.2% (6e-4)--
Caduceus86.9% (8e-3)86.9% (8e-3)--
ModelLNDyT ( α 0 = 0 . 5 )DyT (tuned)
ViT-B82.3%82.5%82.6% ( α 0 = 1 . 0 )
ViT-L83.1%83.6%-
ConvNeXt-B83.7%83.7%-
ConvNeXt-L84.3%84.4%-
MAE ViT-B83.2%83.2%83.4% ( α 0 = 1 . 0 )
MAE ViT-L85.5%85.4%-
DINO ViT-B (patch 16)83.2%83.4%-
DINO ViT-B (patch 8)84.1%84.5%-
DiT-B64.963.9-
DiT-L45.945.7-
DiT-XL19.920.8-
wav2vec 2.0 Base1.951.95-
wav2vec 2.0 Large1.921.911.90 ( α 0 = 1 . 0 )
HyenaDNA85.2%85.2%-
Caduceus86.9%86.9%-
inferenceinferencetrainingtraining
LLaMA 7Blayermodellayermodel
RMSNorm2.1s14.1s8.3s42.6s
DyT1.0s13.0s4.8s39.1s
reduction↓ 52.4%↓ 7.8%↓ 42.2%↓ 8.2%
inferenceinferencetrainingtraining
LLaMA 7Blayermodellayermodel
RMSNorm0.3s12.3s3.9s38.9s
DyT0.3s12.3s3.9s38.9s
modelBNDyT
ResNet-5076.2%68.9%
VGG1972.7%71.0%

Figure

References

[ioffe2015batch] Ioffe, Sergey, Szegedy, Christian. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML.

[krizhevsky2012imagenet] Krizhevsky, Alex, Sutskever, Ilya, Hinton, Geoffrey E. (2012). Imagenet classification with deep convolutional neural networks. NeurIPS.

[szegedy2015going] Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, Rabinovich, Andrew. (2015). Going deeper with convolutions. CVPR.

[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. CVPR.

[he2016identity] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Identity mappings in deep residual networks. ECCV.

[xie2017aggregated] Xie, Saining, Girshick, Ross, Doll{'a. (2017). Aggregated residual transformations for deep neural networks. CVPR.

[huang2017densely] Huang, Gao, Liu, Zhuang, Van Der Maaten, Laurens, Weinberger, Kilian Q. (2017). Densely connected convolutional networks. CVPR.

[tan2019efficientnet] Tan, Mingxing, Le, Quoc. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. ICML.

[tan2021efficientnetv2] Tan, Mingxing, Le, Quoc. (2021). Efficientnetv2: Smaller models and faster training. ICML.

[dosovitskiy2020image] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[liu2021swin] Liu, Ze, Lin, Yutong, Cao, Yue, Hu, Han, Wei, Yixuan, Zhang, Zheng, Lin, Stephen, Guo, Baining. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. ICCV.

[liu2022swin] Liu, Ze, Hu, Han, Lin, Yutong, Yao, Zhuliang, Xie, Zhenda, Wei, Yixuan, Ning, Jia, Cao, Yue, Zhang, Zheng, Dong, Li, others. (2022). Swin transformer v2: Scaling up capacity and resolution. CVPR.

[tolstikhin2021mlp] Tolstikhin, Ilya O, Houlsby, Neil, Kolesnikov, Alexander, Beyer, Lucas, Zhai, Xiaohua, Unterthiner, Thomas, Yung, Jessica, Steiner, Andreas, Keysers, Daniel, Uszkoreit, Jakob, others. (2021). Mlp-mixer: An all-mlp architecture for vision. NeurIPS.

[liu2022convnet] Liu, Zhuang, Mao, Hanzi, Wu, Chao-Yuan, Feichtenhofer, Christoph, Darrell, Trevor, Xie, Saining. (2022). A convnet for the 2020s. CVPR.

[woo2023convnext] Woo, Sanghyun, Debnath, Shoubhik, Hu, Ronghang, Chen, Xinlei, Liu, Zhuang, Kweon, In So, Xie, Saining. (2023). Convnext v2: Co-designing and scaling convnets with masked autoencoders. CVPR.

[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, {\L. (2017). Attention is all you need. NeurIPS.

[wu2018group] Wu, Yuxin, He, Kaiming. (2018). Group normalization. ECCV.

[simonyan2014very] Simonyan, Karen, Zisserman, Andrew. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[hochreiter1997long] Hochreiter, Sepp, Schmidhuber, J{. (1997). Long short-term memory. Neural Computation.

[cho2014properties] Cho, Kyunghyun, Van Merri{. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.

[zhang2019root] Zhang, Biao, Sennrich, Rico. (2019). Root mean square layer normalization. NeurIPS.

[carandini2012normalization] Carandini, Matteo, Heeger, David J. (2012). Normalization as a canonical neural computation. Nature Reviews Neuroscience.

[lecun1998efficient] LeCun, Yann, Bottou, L{'e. (1998). Efficient BackProp. Neural Networks: Tricks of the Trade.

[jarrett2009best] Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato, Marc'Aurelio, LeCun, Yann. (2009). What is the best multi-stage architecture for object recognition?. ICCV.

[xiong2020layer] Xiong, Ruibin, Yang, Yunchang, He, Di, Zheng, Kai, Zheng, Shuxin, Xing, Chen, Zhang, Huishuai, Lan, Yanyan, Wang, Liwei, Liu, Tieyan. (2020). On layer normalization in the transformer architecture. ICML.

[huang2020improving] Huang, Xiao Shi, Perez, Felipe, Ba, Jimmy, Volkovs, Maksims. (2020). Improving transformer optimization through better initialization. ICML.

[santurkar2018does] Santurkar, Shibani, Tsipras, Dimitris, Ilyas, Andrew, Madry, Aleksander. (2018). How does batch normalization help optimization?. NeurIPS.

[bjorck2018understanding] Bjorck, Nils, Gomes, Carla P, Selman, Bart, Weinberger, Kilian Q. (2018). Understanding batch normalization. NeurIPS.

[brock2021characterizing] Brock, Andrew, De, Soham, Smith, Samuel L. (2021). Characterizing signal propagation to close the performance gap in unnormalized resnets. arXiv preprint arXiv:2101.08692.

[daneshmand2020batch] Daneshmand, Hadi, Kohler, Jonas, Bach, Francis, Hofmann, Thomas, Lucchi, Aurelien. (2020). Batch normalization provably avoids ranks collapse for randomly initialised deep networks. NeurIPS.

[balduzzi2017shattered] Balduzzi, David, Frean, Marcus, Leary, Lennox, Lewis, JP, Ma, Kurt Wan-Duo, McWilliams, Brian. (2017). The shattered gradients problem: If resnets are the answer, then what is the question?. ICML.

[karakida2019normalization] Karakida, Ryo, Akaho, Shotaro, Amari, Shun-ichi. (2019). The normalization method for alleviating pathological sharpness in wide neural networks. NeurIPS.

[yong2020gradient] Yong, Hongwei, Huang, Jianqiang, Hua, Xiansheng, Zhang, Lei. (2020). Gradient centralization: A new optimization technique for deep neural networks. ECCV.

[brock2021high] Brock, Andrew, De, Soham, Smith, Samuel L, Simonyan, Karen. (2021). High-performance large-scale image recognition without normalization. ICML.

[he2023simplifying] He, Bobby, Hofmann, Thomas. (2023). Simplifying transformer blocks. arXiv preprint arXiv:2311.01906.

[zhang2019fixup] Zhang, Hongyi, Dauphin, Yann N, Ma, Tengyu. (2019). Fixup initialization: Residual learning without normalization. arXiv preprint arXiv:1901.09321.

[de2020batch] De, Soham, Smith, Sam. (2020). Batch normalization biases residual blocks towards the identity function in deep networks. NeurIPS.

[bachlechner2021rezero] Bachlechner, Thomas, Majumder, Bodhisattwa Prasad, Mao, Henry, Cottrell, Gary, McAuley, Julian. (2021). Rezero is all you need: Fast convergence at large depth. UAI.

[ulyanov2016instance] Ulyanov, Dmitry, Vedaldi, Andrea, Lempitsky, Victor. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022.

[ba2016layer] Ba, Jimmy Lei, Kiros, Jamie Ryan, Hinton, Geoffrey E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.

[lyu2008nonlinear] Lyu, Siwei, Simoncelli, Eero P. (2008). Nonlinear image representation using divisive normalization. CVPR.

[zeiler2014visualizing] Zeiler, Matthew D, Fergus, Rob. (2014). Visualizing and understanding convolutional networks. ECCV.

[sermanet2013overfeat] Sermanet, Pierre, Eigen, David, Zhang, Xiang, Mathieu, Micha{. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229.

[chowdhery2023palm] Chowdhery, Aakanksha, Narang, Sharan, Devlin, Jacob, Bosma, Maarten, Mishra, Gaurav, Roberts, Adam, Barham, Paul, Chung, Hyung Won, Sutton, Charles, Gehrmann, Sebastian, others. (2023). Palm: Scaling language modeling with pathways. JMLR.

[touvron2023llama] Touvron, Hugo, Lavril, Thibaut, Izacard, Gautier, Martinet, Xavier, Lachaux, Marie-Anne, Lacroix, Timoth{'e. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

[touvron2023llama2] Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, others. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

[shao2020normalization] Shao, Jie, Hu, Kai, Wang, Changhu, Xue, Xiangyang, Raj, Bhiksha. (2020). Is normalization indispensable for training deep neural network?. NeurIPS.

[arora2018theoretical] Arora, Sanjeev, Li, Zhiyuan, Lyu, Kaifeng. (2018). Theoretical analysis of auto rate-tuning by batch normalization. arXiv preprint arXiv:1812.03981.

[tanaka2021noether] Tanaka, Hidenori, Kunin, Daniel. (2021). Noether’s learning dynamics: Role of symmetry breaking in neural networks. NeurIPS.

[nguyen2019transformers] Nguyen, Toan Q, Salazar, Julian. (2019). Transformers without tears: Improving the normalization of self-attention. arXiv preprint arXiv:1910.05895.

[xu2019understanding] Xu, Jingjing, Sun, Xu, Zhang, Zhiyuan, Zhao, Guangxiang, Lin, Junyang. (2019). Understanding and improving layer normalization. NeurIPS.

[huang2017centered] Huang, Lei, Liu, Xianglong, Liu, Yang, Lang, Bo, Tao, Dacheng. (2017). Centered weight normalization in accelerating training of deep neural networks. ICCV.

[qiao2019micro] Qiao, Siyuan, Wang, Huiyu, Liu, Chenxi, Shen, Wei, Yuille, Alan. (2019). Micro-batch training with batch-channel normalization and weight standardization. arXiv preprint arXiv:1903.10520.

[srivastava2014dropout] Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, Salakhutdinov, Ruslan. (2014). Dropout: a simple way to prevent neural networks from overfitting. JMLR.

[huang2016deep] Huang, Gao, Sun, Yu, Liu, Zhuang, Sedra, Daniel, Weinberger, Kilian Q. (2016). Deep networks with stochastic depth. ECCV.

[cubuk2020randaugment] Cubuk, Ekin D, Zoph, Barret, Shlens, Jonathon, Le, Quoc V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. CVPR Workshops.

[he2023deep] He, Bobby, Martens, James, Zhang, Guodong, Botev, Aleksandar, Brock, Andrew, Smith, Samuel L, Teh, Yee Whye. (2023). Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation. arXiv preprint arXiv:2302.10322.

[gulati2020conformer] Gulati, Anmol, Qin, James, Chiu, Chung-Cheng, Parmar, Niki, Zhang, Yu, Yu, Jiahui, Han, Wei, Wang, Shibo, Zhang, Zhengdong, Wu, Yonghui, others. (2020). Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.

[smith2023convnets] Smith, Samuel L, Brock, Andrew, Berrada, Leonard, De, Soham. (2023). Convnets match vision transformers at scale. arXiv preprint arXiv:2310.16764.

[silver2017mastering] Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez, Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, others. (2017). Mastering the game of go without human knowledge. Nature.

[luo2018towards] Luo, Ping, Wang, Xinjiang, Shao, Wenqi, Peng, Zhanglin. (2018). Towards understanding regularization in batch normalization. arXiv preprint arXiv:1809.00846.

[liu2021rethinking] Liu, Fenglin, Ren, Xuancheng, Zhang, Zhiyuan, Sun, Xu, Zou, Yuexian. (2021). Rethinking skip connection with layer normalization in transformers and resnets. arXiv preprint arXiv:2105.07205.

[takase2022b2t] Takase, Sho, Kiyono, Shun, Kobayashi, Sosuke, Suzuki, Jun. (2022). B2t connection: Serving stability and performance in deep transformers. arXiv preprint arXiv:2206.00330.

[brody2023expressivity] Brody, Shaked, Alon, Uri, Yahav, Eran. (2023). On the Expressivity Role of LayerNorm in Transformers' Attention. arXiv preprint arXiv:2305.02582.

[klambauer2017self] Klambauer, G{. (2017). Self-normalizing neural networks. NeurIPS.

[wang2019learning] Wang, Qiang, Li, Bei, Xiao, Tong, Zhu, Jingbo, Li, Changliang, Wong, Derek F, Chao, Lidia S. (2019). Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787.

[beck2024xlstm] Beck, Maximilian, P{. (2024). xLSTM: Extended Long Short-Term Memory. arXiv preprint arXiv:2405.04517.

[gu2023mamba] Gu, Albert, Dao, Tri. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.

[peng2023rwkv] Peng, Bo, Alcaide, Eric, Anthony, Quentin, Albalak, Alon, Arcadinho, Samuel, Cao, Huanqi, Cheng, Xin, Chung, Michael, Grella, Matteo, GV, Kranthi Kiran, others. (2023). Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.

[cho2014learning] Cho, Kyunghyun, Van Merri{. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

[bahdanau2014neural] Bahdanau, Dzmitry, Cho, Kyunghyun, Bengio, Yoshua. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

[nair2010rectified] Nair, Vinod, Hinton, Geoffrey E. (2010). Rectified linear units improve restricted boltzmann machines. ICML.

[hendrycks2016gaussian] Hendrycks, Dan, Gimpel, Kevin. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.

[deng2009imagenet] Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database. CVPR.

[peebles2023scalable] Peebles, William, Xie, Saining. (2023). Scalable diffusion models with transformers. ICCV.

[he2022masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners. CVPR.

[caron2021emerging] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. ICCV.

[baevski2020wav2vec] Baevski, Alexei, Zhou, Yuhao, Mohamed, Abdelrahman, Auli, Michael. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS.

[panayotov2015librispeech] Panayotov, Vassil, Chen, Guoguo, Povey, Daniel, Khudanpur, Sanjeev. (2015). Librispeech: an asr corpus based on public domain audio books. ICASSP.

[nguyen2024hyenadna] Nguyen, Eric, Poli, Michael, Faizi, Marjan, Thomas, Armin, Wornow, Michael, Birch-Sykes, Callum, Massaroli, Stefano, Patel, Aman, Rabideau, Clayton, Bengio, Yoshua, others. (2024). Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. NeurIPS.

[grch382013p13] GRCh38, Ensembl. (2013). p13 (Genome Reference Consortium Human Build 38), INSDC Assembly.

[grevsova2023genomic] Gre{\v{s. (2023). Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genomic Data.

[nicolae2018plu] Nicolae, Andrei. (2018). PLU: The piecewise linear unit activation function. arXiv preprint arXiv:1809.09534.

[pile] Gao, Leo, Biderman, Stella, Black, Sid, Golding, Laurence, Hoppe, Travis, Foster, Charles, Phang, Jason, He, Horace, Thite, Anish, Nabeshima, Noa, Presser, Shawn, Leahy, Connor. (2020). The {P. arXiv preprint arXiv:2101.00027.

[brown2020language] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. NeurIPS.

[eval-harness] Gao, Leo, Tow, Jonathan, Abbasi, Baber, Biderman, Stella, Black, Sid, DiPofi, Anthony, Foster, Charles, Golding, Laurence, Hsu, Jeffrey, Le Noac'h, Alain, Li, Haonan, McDonell, Kyle, Muennighoff, Niklas, Ociepa, Chris, Phang, Jason, Reynolds, Laria, Schoelkopf, Hailey, Skowron, Aviya, Sutawika, Lintang, Tang, Eric, Thite, Anish, Wang, Ben, Wang, Kevin, Zou, Andy. A framework for few-shot language model evaluation. doi:10.5281/zenodo.10256836.

[touvron2021training] Touvron, Hugo, Cord, Matthieu, Douze, Matthijs, Massa, Francisco, Sablayrolles, Alexandre, J{'e. (2021). Training data-efficient image transformers & distillation through attention. ICML.

[adrian1926impulses] Adrian, Edgar D. (1926). The impulses produced by sensory nerve endings: Part 1. The Journal of Physiology.

[adrian1926impulses2] Adrian, Edgar D, Zotterman, Yngve. (1926). The impulses produced by sensory nerve-endings: Part 2. The response of a Single End-Organ. The Journal of Physiology.

[adrian1926impulses3] Adrian, Edgar D, Zotterman, Yngve. (1926). The impulses produced by sensory nerve endings: Part 3. Impulses set up by Touch and Pressure. The Journal of Physiology.

[sun2024massive] Sun, Mingjie, Chen, Xinlei, Kolter, J Zico, Liu, Zhuang. (2024). Massive Activations in Large Language Models. arXiv preprint arXiv:2402.17762.

[raffel2020exploring] Raffel, Colin, Shazeer, Noam, Roberts, Adam, Lee, Katherine, Narang, Sharan, Matena, Michael, Zhou, Yanqi, Li, Wei, Liu, Peter J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.

[he2015delving] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. ICCV.

[glorot2010understanding] Glorot, Xavier, Bengio, Yoshua. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS.

[heeger1996computational] Heeger, David J, Simoncelli, Eero P, Movshon, J Anthony. (1996). Computational models of cortical visual processing.. PNAS.

[heeger1992normalization] Heeger, David J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience.

[ni2024nonlinearity] Ni, Yunhao, Guo, Yuxin, Jia, Junlong, Huang, Lei. (2024). On the Nonlinearity of Layer Normalization. arXiv preprint arXiv:2406.01255.

[lecun2022path] LeCun, Yann. (2022). A path towards autonomous machine intelligence version 0.9.2, 2022-06-27. Open Review.

[openlm2023openllama] Geng, Xinyang, Liu, Hao. OpenLLaMA: An Open Reproduction of LLaMA.

[dubey2024llama] Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, Mathur, Akhil, Schelten, Alan, Yang, Amy, Fan, Angela, others. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

[jiang2023mistral] Jiang, Albert Q, Sablayrolles, Alexandre, Mensch, Arthur, Bamford, Chris, Chaplot, Devendra Singh, Casas, Diego de las, Bressand, Florian, Lengyel, Gianna, Lample, Guillaume, Saulnier, Lucile, others. (2023). Mistral 7B. arXiv preprint arXiv:2310.06825.

[bai2023qwen] Bai, Jinze, Bai, Shuai, Chu, Yunfei, Cui, Zeyu, Dang, Kai, Deng, Xiaodong, Fan, Yang, Ge, Wenbin, Han, Yu, Huang, Fei, others. (2023). Qwen technical report. arXiv preprint arXiv:2309.16609.

[yang2024qwen2] Yang, An, Yang, Baosong, Hui, Binyuan, Zheng, Bo, Yu, Bowen, Zhou, Chang, Li, Chengpeng, Li, Chengyuan, Liu, Dayiheng, Huang, Fei, others. (2024). Qwen2 technical report. arXiv preprint arXiv:2407.10671.

[zhang2024internlm] Zhang, Pan, Dong, Xiaoyi, Zang, Yuhang, Cao, Yuhang, Qian, Rui, Chen, Lin, Guo, Qipeng, Duan, Haodong, Wang, Bin, Ouyang, Linke, others. (2024). Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output. arXiv preprint arXiv:2407.03320.

[cai2024internlm2] Cai, Zheng, Cao, Maosong, Chen, Haojiong, Chen, Kai, Chen, Keyu, Chen, Xin, Chen, Xun, Chen, Zehui, Chen, Zhi, Chu, Pei, others. (2024). Internlm2 technical report. arXiv preprint arXiv:2403.17297.

[wang2022understanding] Wang, Jiaxi, Wu, Ji, Huang, Lei. (2022). Understanding the failure of batch normalization for transformers in nlp. Advances in Neural Information Processing Systems.

[sun2024learning] Sun, Yu, Li, Xinhao, Dalal, Karan, Xu, Jiarui, Vikram, Arjun, Zhang, Genghan, Dubois, Yann, Chen, Xinlei, Wang, Xiaolong, Koyejo, Sanmi, others. (2024). Learning to (learn at test time): Rnns with expressive hidden states. arXiv preprint arXiv:2407.04620.

[feng2024were] Feng, Leo, Tung, Frederick, Ahmed, Mohamed Osama, Bengio, Yoshua, Hajimirsadegh, Hossein. (2024). Were RNNs All We Needed?. arXiv preprint arXiv:2410.01201.

[wolf2019huggingface] Wolf, T. (2019). Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.

[yao2021leveraging] Yao, Zhuliang, Cao, Yue, Lin, Yutong, Liu, Ze, Zhang, Zheng, Hu, Han. (2021). Leveraging batch normalization for vision transformers. ICCV.

[loshchilov2024ngpt] Loshchilov, Ilya, Hsieh, Cheng-Ping, Sun, Simeng, Ginsburg, Boris. (2024). ngpt: Normalized transformer with representation learning on the hypersphere. arXiv preprint arXiv:2410.01131.

[lubana2021beyond] Lubana, Ekdeep S, Dick, Robert, Tanaka, Hidenori. (2021). Beyond batchnorm: Towards a unified understanding of normalization in deep learning. NeurIPS.

[lyu2022understanding] Lyu, Kaifeng, Li, Zhiyuan, Arora, Sanjeev. (2022). Understanding the generalization benefit of normalization layers: Sharpness reduction. NeurIPS.

[torchvision2016] TorchVision maintainers, contributors. TorchVision: PyTorch's Computer Vision library. GitHub repository.

[FMS-FSDP] {Foundation Model Stack. GitHub: {FMS.

[convnext] {Meta Research. GitHub: {ConvNeXt.

[MAE] {Meta Research. GitHub: {MAE.

[DINO] {Meta Research. GitHub: {DINO.

[DiT] {Meta Research. GitHub: {DiT.

[wav2vec2] {Meta Research. GitHub: wav2vec 2.0.

[caduceus] {Kuleshov Group. Github: Caduceus.

[hyena] HazyResearch. Github: HyenaDNA.

[huang2023normalization] Huang, Lei, Qin, Jie, Zhou, Yi, Zhu, Fan, Liu, Li, Shao, Ling. (2023). Normalization techniques in training dnns: Methodology, analysis and application. TPAMI.

[bi2024deepseek] Bi, Xiao, Chen, Deli, Chen, Guanting, Chen, Shanhuang, Dai, Damai, Deng, Chengqi, Ding, Honghui, Dong, Kai, Du, Qiushi, Fu, Zhe, others. (2024). Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954.

[liu2024deepseek] Liu, Aixin, Feng, Bei, Wang, Bin, Wang, Bingxuan, Liu, Bo, Zhao, Chenggang, Dengr, Chengqi, Ruan, Chong, Dai, Damai, Guo, Daya, others. (2024). Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434.

[deepseekai2024deepseekv3technicalreport] DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wanjia Zhao, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha, Yunfan Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, Zizheng Pan. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437.

[ramesh2021zero] Ramesh, Aditya, Pavlov, Mikhail, Goh, Gabriel, Gray, Scott, Voss, Chelsea, Radford, Alec, Chen, Mark, Sutskever, Ilya. (2021). Zero-shot text-to-image generation. ICML.

[heimersheim2024you] Stefan Heimersheim. (2024). You can remove GPT2's LayerNorm by fine-tuning. arXiv preprint arXiv:2409.13710.

[li2024mix] Li, Pengxiang, Yin, Lu, Liu, Shiwei. (2024). Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN. arXiv preprint arXiv:2412.13795.

[schiff2024caduceus] Schiff, Yair, Kao, Chia-Hsiang, Gokaslan, Aaron, Dao, Tri, Gu, Albert, Kuleshov, Volodymyr. (2024). Caduceus: Bi-directional equivariant long-range dna sequence modeling. arXiv preprint arXiv:2403.03234.

[zhai2023stabilizing] Zhai, Shuangfei, Likhomanenko, Tatiana, Littwin, Etai, Busbridge, Dan, Ramapuram, Jason, Zhang, Yizhe, Gu, Jiatao, Susskind, Joshua M. (2023). Stabilizing transformer training by preventing attention entropy collapse. ICML.

[salimans2016weight] Salimans, Tim, Kingma, Durk P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. NeurIPS.

[mueller2024normalization] Mueller, Maximilian, Vlaar, Tiffany, Rolnick, David, Hein, Matthias. (2024). Normalization layers are all that sharpness-aware minimization needs. NeurIPS.

[dai2024crucial] Dai, Yan, Ahn, Kwangjun, Sra, Suvrit. (2024). The crucial role of normalization in sharpness-aware minimization. NeurIPS.

[sun2025cursedepthlargelanguage] Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu. (2025). The Curse of Depth in Large Language Models. arXiv preprint arXiv:2502.05795.

[guo2025deepseek] Guo, Daya, Yang, Dejian, Zhang, Haowei, Song, Junxiao, Zhang, Ruoyu, Xu, Runxin, Zhu, Qihao, Ma, Shirong, Wang, Peiyi, Bi, Xiao, others. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.

[szegedy2016rethinking] Szegedy, Christian, Vanhoucke, Vincent, Ioffe, Sergey, Shlens, Jon, Wojna, Zbigniew. (2016). Rethinking the inception architecture for computer vision. CVPR.

[paszke2019pytorch] Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer, Adam, Bradbury, James, Chanan, Gregory, Killeen, Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca, others. (2019). Pytorch: An imperative style, high-performance deep learning library. NeurIPS.

[chandra2021novel] Chandra, Mahesh. (2021). A novel method for scalable VLSI implementation of hyperbolic tangent function. IEEE Design & Test.

[hashemi2024can] Hashemi, Baran, Corominas, Roderic G, Giacchetto, Alessandro. (2024). Can Transformers Do Enumerative Geometry?. arXiv preprint arXiv:2408.14915.

[huggingface] {Hugging Face. {Hugging Face: LLaMA 2.

[jha2024aero] Jha, Nandan Kumar, Reagen, Brandon. (2024). AERO: Softmax-Only LLMs for Efficient Private Inference. arXiv preprint arXiv:2410.13060.

[kernelFusion2024] {Lambda Labs. (2024). Deep Dive into Kernel Fusion: Accelerating Inference in Llama V2.

[bib1] Edgar D Adrian. The impulses produced by sensory nerve endings: Part 1. The Journal of Physiology, 1926.

[bib2] Edgar D Adrian and Yngve Zotterman. The impulses produced by sensory nerve-endings: Part 2. the response of a single end-organ. The Journal of Physiology, 1926a.

[bib3] Edgar D Adrian and Yngve Zotterman. The impulses produced by sensory nerve endings: Part 3. impulses set up by touch and pressure. The Journal of Physiology, 1926b.

[bib4] Arora et al. (2018) Sanjeev Arora, Zhiyuan Li, and Kaifeng Lyu. Theoretical analysis of auto rate-tuning by batch normalization. arXiv preprint arXiv:1812.03981, 2018.

[bib5] Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

[bib6] Bachlechner et al. (2021) Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. In UAI, 2021.

[bib7] Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS, 2020.

[bib8] Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.

[bib9] Balduzzi et al. (2017) David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? In ICML, 2017.

[bib10] Bjorck et al. (2018) Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. Understanding batch normalization. NeurIPS, 2018.

[bib11] Brock et al. (2021a) Andrew Brock, Soham De, and Samuel L Smith. Characterizing signal propagation to close the performance gap in unnormalized resnets. arXiv preprint arXiv:2101.08692, 2021a.

[bib12] Brock et al. (2021b) Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. In ICML, 2021b.

[bib13] Cai et al. (2024) Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.

[bib14] Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021.

[bib15] Cubuk et al. (2020) Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020.

[bib16] Dai et al. (2024) Yan Dai, Kwangjun Ahn, and Suvrit Sra. The crucial role of normalization in sharpness-aware minimization. NeurIPS, 2024.

[bib17] Daneshmand et al. (2020) Hadi Daneshmand, Jonas Kohler, Francis Bach, Thomas Hofmann, and Aurelien Lucchi. Batch normalization provably avoids ranks collapse for randomly initialised deep networks. NeurIPS, 2020.

[bib18] Soham De and Sam Smith. Batch normalization biases residual blocks towards the identity function in deep networks. NeurIPS, 2020.

[bib19] Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

[bib20] Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

[bib21] Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.

[bib22] Feng et al. (2024) Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, and Hossein Hajimirsadegh. Were rnns all we needed? arXiv preprint arXiv:2410.01201, 2024.

[bib23] Foundation Model Stack. Github: FMS FSDP. {https://github.com/foundation-model-stack/fms-fsdp}. Accessed: 2025-01-23.

[bib24] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation. https://zenodo.org/records/10256836.

[bib25] Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.

[bib26] Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, 2023. https://github.com/openlm-research/open_llama.

[bib27] Ensembl GRCh38. p13 (genome reference consortium human build 38), insdc assembly, 2013.

[bib28] Grešová et al. (2023) Katarína Grešová, Vlastimil Martinek, David Čechák, Petr Šimeček, and Panagiotis Alexiou. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genomic Data, 2023.

[bib29] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.

[bib30] Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.

[bib31] HazyResearch. Github: Hyenadna. {https://github.com/HazyResearch/hyena-dna.git}. Accessed: 2025-01-23.

[bib32] Bobby He and Thomas Hofmann. Simplifying transformer blocks. arXiv preprint arXiv:2311.01906, 2023.

[bib33] He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

[bib34] He et al. (2022) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022.

[bib35] Stefan Heimersheim. You can remove gpt2’s layernorm by fine-tuning. arXiv preprint arXiv:2409.13710, 2024.

[bib36] Huang et al. (2016) Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In ECCV, 2016.

[bib37] Huang et al. (2017) Lei Huang, Xianglong Liu, Yang Liu, Bo Lang, and Dacheng Tao. Centered weight normalization in accelerating training of deep neural networks. In ICCV, 2017.

[bib38] Huang et al. (2023) Lei Huang, Jie Qin, Yi Zhou, Fan Zhu, Li Liu, and Ling Shao. Normalization techniques in training dnns: Methodology, analysis and application. TPAMI, 2023.

[bib39] Huang et al. (2020) Xiao Shi Huang, Felipe Perez, Jimmy Ba, and Maksims Volkovs. Improving transformer optimization through better initialization. In ICML, 2020.

[bib40] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

[bib41] Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.

[bib42] Karakida et al. (2019) Ryo Karakida, Shotaro Akaho, and Shun-ichi Amari. The normalization method for alleviating pathological sharpness in wide neural networks. NeurIPS, 2019.

[bib43] Kuleshov Group. Github: Caduceus. {https://github.com/kuleshov-group/caduceus.git}. Accessed: 2025-01-23.

[bib44] Yann LeCun. A path towards autonomous machine intelligence version 0.9.2, 2022-06-27. Open Review, 2022.

[bib45] Li et al. (2024) Pengxiang Li, Lu Yin, and Shiwei Liu. Mix-ln: Unleashing the power of deeper layers by combining pre-ln and post-ln. arXiv preprint arXiv:2412.13795, 2024.

[bib46] Liu et al. (2024) Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024.

[bib47] Liu et al. (2022) Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, 2022.

[bib48] Loshchilov et al. (2024) Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere. arXiv preprint arXiv:2410.01131, 2024.

[bib49] Lubana et al. (2021) Ekdeep S Lubana, Robert Dick, and Hidenori Tanaka. Beyond batchnorm: Towards a unified understanding of normalization in deep learning. NeurIPS, 2021.

[bib50] Lyu et al. (2022) Kaifeng Lyu, Zhiyuan Li, and Sanjeev Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction. NeurIPS, 2022.

[bib51] Meta Research. Github: ConvNeXt. {https://github.com/facebookresearch/ConvNeXt}, a. Accessed: 2025-01-23.

[bib52] Meta Research. Github: DINO. {https://github.com/facebookresearch/dino}, b. Accessed: 2025-01-23.

[bib53] Meta Research. Github: DiT. {https://github.com/facebookresearch/DiT}, c. Accessed: 2025-01-23.

[bib54] Meta Research. Github: MAE. {https://github.com/facebookresearch/mae}, d. Accessed: 2025-01-23.

[bib55] Meta Research. Github: wav2vec 2.0. {https://github.com/facebookresearch/fairseq}, e. Accessed: 2025-01-23.

[bib56] Mueller et al. (2024) Maximilian Mueller, Tiffany Vlaar, David Rolnick, and Matthias Hein. Normalization layers are all that sharpness-aware minimization needs. NeurIPS, 2024.

[bib57] Nguyen et al. (2024) Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Michael Wornow, Callum Birch-Sykes, Stefano Massaroli, Aman Patel, Clayton Rabideau, Yoshua Bengio, et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. NeurIPS, 2024.

[bib58] Toan Q Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. arXiv preprint arXiv:1910.05895, 2019.

[bib59] Ni et al. (2024) Yunhao Ni, Yuxin Guo, Junlong Jia, and Lei Huang. On the nonlinearity of layer normalization. arXiv preprint arXiv:2406.01255, 2024.

[bib60] Panayotov et al. (2015) Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In ICASSP, 2015.

[bib61] William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, 2023.

[bib62] Qiao et al. (2019) Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Micro-batch training with batch-channel normalization and weight standardization. arXiv preprint arXiv:1903.10520, 2019.

[bib63] Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.

[bib64] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. NeurIPS, 2016.

[bib65] Santurkar et al. (2018) Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? NeurIPS, 2018.

[bib66] Schiff et al. (2024) Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, and Volodymyr Kuleshov. Caduceus: Bi-directional equivariant long-range dna sequence modeling. arXiv preprint arXiv:2403.03234, 2024.

[bib67] Shao et al. (2020) Jie Shao, Kai Hu, Changhu Wang, Xiangyang Xue, and Bhiksha Raj. Is normalization indispensable for training deep neural network? NeurIPS, 2020.

[bib68] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[bib69] Smith et al. (2023) Samuel L Smith, Andrew Brock, Leonard Berrada, and Soham De. Convnets match vision transformers at scale. arXiv preprint arXiv:2310.16764, 2023.

[bib70] Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.

[bib71] Sun et al. (2025) Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models. arXiv preprint arXiv:2502.05795, 2025.

[bib72] Sun et al. (2024) Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states. arXiv preprint arXiv:2407.04620, 2024.

[bib73] Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.

[bib74] Hidenori Tanaka and Daniel Kunin. Noether’s learning dynamics: Role of symmetry breaking in neural networks. NeurIPS, 2021.

[bib75] Tolstikhin et al. (2021) Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. NeurIPS, 2021.

[bib76] Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.

[bib77] Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.

[bib78] Ulyanov et al. (2016) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.

[bib79] Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.

[bib80] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.

[bib81] Xie et al. (2017) Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.

[bib82] Xiong et al. (2020) Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In ICML, 2020.

[bib83] Xu et al. (2019) Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. Understanding and improving layer normalization. NeurIPS, 2019.

[bib84] Yang et al. (2024) An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.

[bib85] Zhai et al. (2023) Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Joshua M Susskind. Stabilizing transformer training by preventing attention entropy collapse. In ICML, 2023.

[bib86] Biao Zhang and Rico Sennrich. Root mean square layer normalization. NeurIPS, 2019.

[bib87] Zhang et al. (2019) Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without normalization. arXiv preprint arXiv:1901.09321, 2019.

[bib88] Zhang et al. (2024) Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output. arXiv preprint arXiv:2407.03320, 2024.