Transformers without Normalization

Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu

Abstract

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT(𝒙)=tanh⁡(α𝒙)\mathrm{DyT}({\bm{x}})=\tanh(\alpha{\bm{x}}), as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, SS-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Tanh-like mappings with layer normalization.

Jiachen Zhu 1 , 2 , Xinlei Chen 1 , Kaiming He 3 , Yann LeCun 1 , 2 , Zhuang Liu 1 , 4 , †

1 FAIR, Meta, 2 New York University, 3 MIT, 4 Princeton University

† Project lead

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT( x ) = tanh( α x ) , as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S -shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Date:

June 17, 2025

jiachenzhu.github.io/DyT

Correspondence:

jiachen.zhu@nyu.edu , zhuangl@princeton.edu

Introduction

Over the past decade, normalization layers have solidified their positions as one of the most fundamental components of modern neural networks. It all traces back to the invention of batch normalization in 2015 (Ioffe and Szegedy, 2015), which enabled drastically faster and better convergence in visual recognition models and quickly gained momentum in the following years. Since then, many variants of normalization layers have been proposed for different network architectures or domains (Ba et al., 2016; Ulyanov et al., 2016; Wu and He, 2018; Zhang and Sennrich, 2019). Today, virtually all modern networks use normalization layers, with layer normalization (Layer Norm, or LN) (Ba et al., 2016) being one of the most popular, particularly in the dominant Transformer architecture (Vaswani et al., 2017; Dosovitskiy et al., 2020).

The widespread adoption of normalization layers is largely driven by their empirical benefits in optimization (Santurkar et al., 2018; Bjorck et al., 2018). In addition to achieving better results, they help accelerate and stabilize convergence. As neural networks become wider and deeper, this necessity becomes ever more critical (Brock et al., 2021a; Huang et al., 2023). Consequently, normalization layers are widely regarded as crucial, if not indispensable, for the effective training of deep networks. This belief is subtly evidenced by the fact that, in recent years, novel architectures often seek to replace attention or convolution layers (Tolstikhin et al., 2021; Gu and Dao, 2023; Sun et al., 2024; Feng et al., 2024), but almost always retain the normalization layers.

This paper challenges this belief by introducing a simple alternative to normalization layers in Transformers. Our exploration starts with the observation that LN layers map their inputs to outputs with tanh-like, S -shaped curves, scaling the input activations while squashing the extreme values. Inspired by this insight, we propose an element-wise operation termed Dynamic Tanh (DyT), defined as: DyT( x ) = tanh( α x ) , where α is a learnable parameter. This operation aims to emulate the behavior of LN by learning an appropriate scaling factor through α and squashing extreme values via the bounded tanh function. Notably, unlike normalization layers, it achieves both effects without the need to compute activation statistics.

Figure 1 Left: original Transformer block. Right: block with our proposed Dynamic Tanh (DyT) layer. DyT is a straightforward replacement for commonly used Layer Norm (Ba et al., 2016) (in some cases RMSNorm (Zhang and Sennrich, 2019)) layers. Transformers with DyT match or exceed the performance of their normalized counterparts.

require tuning the training hyperparameters on the original architecture. Our work challenges the notion that normalization layers are indispensable for training modern neural networks and provides empirical insights into the properties of normalization layers.

Background: Normalization Layers

We begin by reviewing the normalization layers. Most normalization layers share a common formulation. Given an input x with shape ( B,T,C ) , where B is the batch size, T is the number of tokens, and C is the embedding dimension per token, the output is generally computed as:

where ϵ is a small constant, and γ and β are learnable vector parameters of shape ( C, ) . They are 'scaling' and 'shifting' affine parameters that allow the output to be in any range. The terms µ and σ 2 denote the mean and variance of the input. Different methods mainly differ in how these two statistics are computed. This results in µ and σ 2 having different dimensions, each with broadcasting applied during computation.

Batch normalization (BN) (Ioffe and Szegedy, 2015) is the first modern normalization layer, and it has been primarily used in ConvNet models (Szegedy et al., 2016; He et al., 2016; Xie et al., 2017). Its introduction represents a major milestone in deep learning architecture designs. BN computes the mean and variance across both the batch and token dimensions, specifically: µ k = 1 BT ∑ i,j x ijk and σ 2 k = 1 BT ∑ i,j ( x ijk -µ k ) 2 . Other normalization layers popular in ConvNets, such as group normalization (Wu and He, 2018) and instance normalization (Ulyanov et al., 2016), were initially proposed for specialized tasks such as object detection and image stylization. They share the same overall formulation but differ in the axes and ranges over which the statistics are computed.

Layer normalization (LN) (Ba et al., 2016) and root mean square normalization (RMSNorm) (Zhang and Sennrich, 2019) are the major two types of normalization layers used in Transformer architectures. LN computes these statistics independently for each token in each sample, where µ ij = 1 C ∑ k x ijk and σ 2 ij = 1 C ∑ k ( x ijk -µ ij ) 2 . RMSNorm (Zhang and Sennrich, 2019) simplifies LN by removing the mean-centering step and normalizing the input with µ ij = 0 and σ 2 ij = 1 C ∑ k x 2 ijk . Today, most modern neural networks use LN due to its simplicity and universality. Recently, RMSNorm has gained popularity, particularly in language models like T5 (Raffel et al., 2020), LLaMA (Touvron et al., 2023a,b; Dubey et al., 2024), Mistral (Jiang et al., 2023), Qwen (Bai et al., 2023; Yang et al., 2024), InternLM (Zhang et al., 2024; Cai et al., 2024) and DeepSeek (Liu et al., 2024; Guo et al., 2025). The Transformers we examine in this work all use LN, except that LLaMA uses RMSNorm.

Figure 2 Output vs. input of selected layer normalization (LN) layers in Vision Transformer (ViT) (Dosovitskiy et al., 2020), wav2vec 2.0 (a Transformer model for speech) (Baevski et al., 2020), and Diffusion Transformer (DiT) (Peebles and Xie, 2023). We sample a mini-batch of samples and plot the input / output values of four LN layers in each model. The outputs are before the affine transformation in LN. The S -shaped curves highly resemble that of a tanh function (see Figure 3). The more linear shapes in earlier layers can also be captured by the center part of a tanh curve. This motivates us to propose Dynamic Tanh (DyT) as a replacement, with a learnable scaler α to account for different scales on the x axis.

What Do Normalization Layers Do?

Analysis setup. We first empirically study the behaviors of normalization layers in trained networks. For this analysis, we take a Vision Transformer model (ViT-B) (Dosovitskiy et al., 2020) trained on ImageNet-1K (Deng et al., 2009), a wav2vec 2.0 Large Transformer model (Baevski et al., 2020) trained on LibriSpeech (Panayotov et al., 2015), and a Diffusion Transformer (DiT-XL) (Peebles and Xie, 2023) trained on ImageNet-1K. In all cases, LN is applied in every Transformer block and before the final linear projection.

For all three trained networks, we sample a mini-batch of samples and do a forward pass through the network. We then measure the input and output for the normalization layers, i.e., tensors immediately before and after the normalization operation, before the learnable affine transformation. Since LN preserves the dimensions of the input tensor, we can establish a one-to-one correspondence between the input and output tensor elements, allowing for a direct visualization of their relationship. We plot the resulting mappings in Figure 2.

Tanh-likemappingswithlayernormalization. For all three models, in earlier LN layers (1st column of Figure 2), we find this inputoutput relationship to be mostly linear, resembling a straight line in an x -y plot. However, the deeper LN layers are places where we make more intriguing observations.

A striking observation from these deeper layers is that most of these curves' shapes highly resemble full or partial S -shaped curves represented by a tanh function (see Figure 3). One might expect LN layers to linearly transform the input tensor, as subtracting the mean and dividing by standard deviation are

Figure3 tanh( αx ) with three different α values.

Figure 4 Output vs. input of two LN layers, with tensor elements colored to indicate different channel and token dimensions. The input tensor has a shape of (samples, tokens, and channels), with elements visualized by assigning consistent colors to the same tokens (left two panels) and channels (right two panels). Left two panels : points representing the same token (same color) form straight lines across different channels, as LN operates linearly across channels for each token. Interestingly, when plotted collectively, these lines form a non-linear tanh-shaped curve. Right two panels : each channel's input spans different ranges on the x -axis, contributing distinct segments to the overall tanh-shaped curve. Certain channels (e.g., red, green, and pink) exhibit more extreme x values, which are squashed by LN.

linear operations. LN normalizes in a per-token manner, only linearly transforming each token's activations. As tokens have different mean and standard deviation values, the linearity does not hold collectively on all activations of the input tensor. Nonetheless, it is still surprising to us that the actual non-linear transformation is highly similar to a scaled tanh function.

For such an S -shaped curve, we note that the central part, represented by points with x values close to zero, is still mainly in a linear shape. Most points ( ∼ 99%) fall in this linear range. However, there are still many points that clearly fall out of this range, which are considered to have 'extreme' values, e.g., those with x larger than 50 or smaller than -50 in the ViT model. Normalization layers' main effect for these values is to squash them into less extreme values, more in line with the majority of points. This is where normalization layers could not approximated by a simple affine transformation layer. We hypothesize this non-linear and disproportional squashing effect on extreme values is what makes normalization layers important and indispensable.

Recent findings by Ni et al. (2024) similarly highlight the strong non-linearities introduced by LN layers, demonstrating how the non-linearity enhances a model's representational capacity. Moreover, this squashing behavior mirrors the saturation properties of biological neurons for large inputs, a phenomenon first observed about a century ago (Adrian, 1926; Adrian and Zotterman, 1926a,b).

Normalization by tokens and channels. How does an LN layer perform a linear transformation for each token but also squash the extreme values in such a non-linear fashion? To understand this, we visualize the points grouped by tokens and channels, respectively. This is plotted in Figure 4 by taking the second and third subplots for ViT from Figure 2, but with a sampled subset of points for more clarity. When we select the channels to plot, we make sure to include the channels with extreme values.

On the left two panels of Figure 4, we visualize each token's activations using the same color. We observe that all points from any single token do form a straight line. However, since each token has a different variance, the slopes are different. Tokens with smaller input x ranges tend to have smaller variance, and the normalization layer will divide their activations using a smaller standard deviation, hence producing a larger slope in the straight line. Collectively, they form an S -shaped curve that resembles a tanh function. In the two panels on the right, we color each channel's activations using the same color. We find that different channels tend to have drastically different input ranges, with only a few channels (e.g., red, green, and pink) exhibiting large extreme values. These are the channels that get squashed the most by the normalization layer.

Analysis setup.

In this section, we begin with ablations on the effects of the tanh function and the learnable scalar α . We then analyze the values of α throughout and after training. Lastly, we present comparisons with previous methods that aim to remove normalization layers.

Tanh-like mappings with layer normalization.

Jiachen Zhu 1 , 2 , Xinlei Chen 1 , Kaiming He 3 , Yann LeCun 1 , 2 , Zhuang Liu 1 , 4 , †

1 FAIR, Meta, 2 New York University, 3 MIT, 4 Princeton University

† Project lead

Date:

June 17, 2025

jiachenzhu.github.io/DyT

Correspondence:

jiachen.zhu@nyu.edu , zhuangl@princeton.edu

Normalization by tokens and channels.

Figure3 tanh( αx ) with three different α values.

Dynamic Tanh (DyT)

Inspired by the similarity between the shapes of normalization layers and a scaled tanh function, we propose Dynamic Tanh (DyT) as a drop-in replacement for normalization layers. Given an input tensor x , a DyT layer is defined as follows:

where α is a learnable scalar parameter that allows scaling the input differently based on its range, accounting for varying x scales (Figure 2). This is also why we name the whole operation 'Dynamic' Tanh. γ and β are learnable, per-channel vector parameters, the same as those used in all normalization layers-they allow the output to scale back to any scales. This is sometimes considered a separate affine layer; for our purposes, we consider them to be part of the DyT layer, just like how normalization layers also include them. See Algorithm 1 for implementation of DyT in Pytorch-like pseudocode.

Integrating DyT layers into an existing architecture is straightforward: one DyT layer replaces one normalization layer (see Figure 1). This applies to normalization layers within attention blocks, FFN blocks, and the final normalization layer. Although DyT may look like or be considered an activation function, this study only uses it to replace normalization layers without altering any parts of the activation functions in the original architectures, such as GELU or ReLU. Readers interested in the use of harmonic or hyperbolic functions as activation functions can refer to Hashemi et al. (2024). We also observe that there is little need to tune the hyperparameters used by the original architectures for DyT to perform well.

On scaling parameters.

We present additional experiments to evaluate the impact of hyperparameter tuning, specifically focusing on the learning rate and initialization of α for all non-LLM models.

Tuning learning rate. Table 12 summarizes performance comparisons between models trained with original versus tuned learning rates. Results indicate that tuning the learning rate provides only modest performance improvements for DyT models. This suggests that the original hyperparameters, initially optimized for LN models, are already well-suited for DyT models. This observation underscores the inherent similarity between the DyT and LN models.

Table 12 Performance comparison between original and tuned learning rates for LN and DyT models. Results show that tuning learning rates provide only modest performance improvements for DyT models, suggesting that the default hyperparameters optimized for LN models are already well-suited for DyT models. Entries marked with '-' indicate no performance gain over the original learning rate. The values in parentheses represent the learning rate used.

Tuning initial value of α . We also investigate the effects of optimizing α 0 for DyT models, as presented in Table 13. Findings show only minor performance enhancements for select models when α 0 is tuned, indicating that the default initial value ( α 0 = 0 . 5 ) generally achieves near-optimal performance.

Table 13 Impact of tuning the α 0 in DyT models. Optimizing α 0 from the default value ( α 0 = 0 . 5 ) yields only minor performance gains for select DyT models, implying the default initialization already achieves near-optimal performance. Entries marked with '-' indicate no improvement over the default α 0 .

Remarks.

Mechanisms of Normalization layers. There has been a rich line of work investigating normalization layers' role in enhancing model performance through various mechanisms. These include stabilizing gradient flow during training (Balduzzi et al., 2017; Daneshmand et al., 2020; Lubana et al., 2021), reducing sensitivity to weight initialization (Zhang et al., 2019; De and Smith, 2020; Shao et al., 2020), moderating outlier eigenvalues (Bjorck et al., 2018; Karakida et al., 2019), auto-tuning learning rates (Arora et al., 2018; Tanaka and Kunin, 2021), and smoothing the loss landscape for more stable optimization (Santurkar et al., 2018). These earlier works focused on studying batch normalization. Recent studies (Lyu et al., 2022; Dai et al., 2024; Mueller et al., 2024) further highlight the connection between normalization layers and sharpness reduction, which contributes to better generalization.

Table 11 Optimal α 0 (attention / other) across model widths and depths in LLaMA training. Model width significantly impacts the choice of α 0 , with wider networks requiring smaller values. In contrast, model depth has negligible influence.

Normalization in Transformers. With the rise of Transformer (Vaswani et al., 2017), research has increasingly focused on layer normalization (Ba et al., 2016), which has proven particularly effective for sequential data in natural language tasks (Nguyen and Salazar, 2019; Xu et al., 2019; Xiong et al., 2020). Recent work (Ni et al., 2024) reveals that layer normalization introduces strong non-linearity, enhancing the model's representational capacity. Additionally, studies (Loshchilov et al., 2024; Li et al., 2024) demonstrate that modifying the location of normalization layers within Transformers can improve convergence properties.

Removing normalization. Many studies have explored how to train deep models without normalization layers. Klambauer et al. (2017) introduce an alternative activation function that enables self-normalizing behavior, eliminating the need for explicit normalization. Other works (Zhang et al., 2019; De and Smith, 2020; Bachlechner et al., 2021) propose specialized initialization schemes to stabilize training in the absence of normalization. The pioneering work by Brock et al. (2021a,b) show that high-performing ResNets can be trained without normalization (Smith et al., 2023) through combination of initialization techniques (De and Smith, 2020), weight normalization (Salimans and Kingma, 2016; Huang et al., 2017; Qiao et al., 2019), and adaptive gradient clipping (Brock et al., 2021b). Additionally, their training strategy incorporates extensive data augmentation (Cubuk et al., 2020) and regularization (Srivastava et al., 2014; Huang et al., 2016). The studies above are based on various ConvNet models.

In Transformer architectures, He and Hofmann (2023) explore modifications to Transformer blocks that reduce reliance on normalization layers and skip connections. Jha and Reagen (2024) introduce AERO, a Softmax-only LLM that improves inference efficiency and privacy with minimal performance loss. Alternatively, Heimersheim (2024) propose a method to gradually remove LN from pretrained networks by fine-tuning the model after removing each normalization layer. Unlike previous approaches, DyT requires minimal modifications to both the architecture and the training recipe. Despite its simplicity, DyT achieves stable training and comparable performance.

Experiments

To demonstrate the effectiveness of DyT, we experiment with Transformers and a few other modern architectures across a diverse range of tasks and domains. In each experiment, we replace the LN or RMSNorm in the original architectures with DyT layers and follow the official open-source protocols to train and test both versions of the models. Detailed instructions for reproducing our results are provided in Appendix A. Notably, to highlight the simplicity of adapting DyT, we use hyperparameters identical to those utilized by the normalized counterparts. For completeness, additional experimental results regarding tuning of learning rates and initial values of α are provided in Appendix B.

Supervised learning in vision. We train Vision Transformer (ViT) (Dosovitskiy et al., 2020) and ConvNeXt (Liu et al., 2022) of 'Base' and 'Large' sizes on the ImageNet-1K classification task (Deng et al., 2009). These models are selected due to their popularity and distinct operations: attention in ViT and convolution in ConvNeXt. Table 1 reports the top-1 classification accuracies. DyT performs slightly better than LN across both architectures and model sizes. We further plot the training loss for ViT-B and ConvNeXt-B in Figure 5. The curves show that the convergence behaviors of DyT and LN-based models are highly aligned.

Figure 5 Training loss curves for ViT-B and ConvNeXt-B models. The loss curves for both model types exhibit similar patterns between LN and DyT, suggesting that LN and DyT may share similar learning dynamics.

Table 1 Supervised classification accuracy on ImageNet-1K. DyT achieves better or similar performance than LN across both architectures and model sizes.

Self-supervised learning in vision. We benchmark with two popular visual self-supervised learning methods: masked autoencoders (MAE) (He et al., 2022) and DINO (Caron et al., 2021). Both by default use Vision Transformers as the backbones, but have different training objectives: MAE is trained with a reconstruction loss, and DINO uses a joint-embedding loss (LeCun, 2022). Following the standard self-supervised learning protocol, we first pretrain models on ImageNet-1K without using any labels and then test the pretrained models by attaching a classification layer and fine-tuning them with labels. The fine-tuning results are presented in Table 2. DyT consistently performs on par with LN in self-supervised learning tasks.

Diffusion models. We train three Diffusion Transformer (DiT) models (Peebles and Xie, 2023) of sizes B, L and XL on ImageNet-1K (Deng et al., 2009). The patch size is 4, 4, and 2, respectively. Note that in DiT, the LN layers' affine parameters are used for class conditioning in DiT, and we keep them that way in our DyT experiments, only replacing the normalizing transformation with the tanh( α x ) function. After training, we evaluate the Fréchet Inception Distance (FID) scores using the standard ImageNet 'reference batch', as presented in Table 3. DyT achieves comparable or improved FID over LN.

Figure 6 LLaMApretraining loss. The loss curves of DyT and RMSNorm models are closely aligned across model sizes.

Large Language Models. We pretrain LLaMA 7B, 13B, 34B, and 70B models (Touvron et al., 2023a,b; Dubey et al., 2024) to assess DyT performance relative to RMSNorm (Zhang and Sennrich, 2019), the default normalization layer used in LLaMA. The models are trained on The Pile dataset (Gao et al., 2020) with 200B tokens, following the original recipe outlined in LLaMA (Touvron et al., 2023b). On LLaMA with DyT, we add a learnable scalar parameter after the initial embedding layer, and adjust the initial value of α , as detailed in Section 7. We report the loss value after training and also follow OpenLLaMA (Geng and Liu, 2023) to benchmark the models on 15 zero-shot tasks from lm-eval (Gao et al.). As shown in Table 4, DyT performs on par with RMSNorm across all four model sizes. Figure 6 illustrates the loss curves, demonstrating similar trends across all model sizes, with training losses closely aligned throughout training.

Table 4 Languagemodels'training loss and average performance with 15 zero-shot lm-eval tasks. DyT achieves a comparable zero-shot performance and training loss to RMSNorm.

Self-supervised learning in speech. We pretrain two wav2vec 2.0 Transformer models (Baevski et al., 2020) on the LibriSpeech dataset (Panayotov et al., 2015). We report the final validation loss in Table 5. We observe that DyT performs comparably to LN in both model sizes.

DNAsequence modeling. On the long-range DNA sequence modeling task, we pretrain the HyenaDNA model (Nguyen et al., 2024) and the Caduceus model (Schiff et al., 2024). The pretraining uses the human reference genome data from (GRCh38, 2013), and the evaluation is on GenomicBenchmarks (Grešová et al., 2023). The results are presented in Table 6. DyT maintains performance comparable to LN for this task.

Supervised learning in vision.

Self-supervised learning in vision.

Diffusion models.

Large Language Models.

Self-supervised learning in speech.

DNA sequence modeling.

Analysis

Efficiency of DyT

We benchmark the LLaMA 7B model with RMSNorm or DyT by measuring the total time required for 100 forward passes (inference) and 100 forward-backward passes (training) on a single sequence of 4096 tokens. We first follow the officially recommended LLaMA setup and load the model from Hugging Face without applying any performance optimizations. Table 14 reports the time taken for RMSNorm and DyT layers, as

well as for the entire model, when running on a Nvidia H100 GPU with BF16 precision. DyT layers reduce computation time compared to RMSNorm layers.

Table 14 Inference and training latency (BF16 precision) for LLaMA 7B with RMSNorm or DyT. DyT achieves a substantial reduction in both inference and training time. Results are measured without any extra performance optimizations.

We also benchmark both models using torch.compile . Interestingly, compiling the entire LLaMA model increases latency for the Hugging Face implementation, and compiling only the DyT or RMSNorm layers yields more efficient execution. Table 15 shows that, after compilation, the latency of the RMSNorm and DyT layers becomes nearly identical.

An important distinction of DyT is that it is an element-wise operation and does not require a reduction operation within itself, compared to normalization layers. This could make it faster on hardware where reduction is a bottleneck. Additionally, even on conventional GPUs, DyT could offer opportunities for further optimization, e.g., fusing it with the preceding matrix multiplication layer from the last residual block.

Ablations of tanh and $ alpha$

To further investigate the role of tanh and α in DyT, we conduct experiments to evaluate the model's performance when these components are altered or removed.

Replacing and removing tanh. We replace tanh in DyT layers with alternative squashing functions, specifically hardtanh and sigmoid (Figure 7), while keeping the learnable scaler α intact. Furthermore, we assess the impact of completely removing tanh by replacing it with the identity function while still retaining α . As shown in Table 7, the squashing function is essential for stable training. Using the identity function leads to unstable training and divergence, whereas squashing functions enable stable training. Among the squashing functions, tanh performs the best. This is possibly due to its smoothness and zero-centered properties.

Table 7 ImageNet-1K classification accuracy with different squashing functions. All experiments follow the same training recipe as the original LN-based models. Squashing functions play a crucial role in preventing divergence, with tanh achieving the highest performance among the three functions. ' → failed' indicates that training diverged after some progress, with the preceding number representing the highest accuracy reached before divergence.

Removing α . Next, we evaluate the impact of removing the learnable α while retaining the squashing functions (tanh, hardtanh, and sigmoid). As shown in Table 8, removing α results in performance degradation across all squashing functions, highlighting the critical role of α in overall model performance.

Replacing and removing tanh.

Removing $ alpha$.

Values of $ alpha$

During training. Our analysis reveals that the α closely tracks the 1 / std of activations throughout training. As illustrated in the left panel of Figure 8, α first decrease and then increase during training, but always fluctuate consistently with the standard deviation of input activations. This supports the important role of α in maintaining activations within a suitable range, which leads to stable and effective training.

After training. Our further analysis of the final values of α in trained networks reveals a strong correlation with the 1 / std of the input activations. As shown on the right panel of Figure 8, higher 1 / std values generally correspond to larger α values, and vice versa. Additionally, we observe that deeper layers tend to have activations with larger standard deviations. This trend aligns with characteristics of deep residual networks, as shown in Brock et al. (2021a) for ConvNets, and Sun et al. (2025) for Transformers.

Both analyses suggest that α functions partially as a normalization mechanism by learning values approximating 1 / std of the input activations. Unlike LN, which normalizes the activations per token, α normalizes the entire input activations collectively. Consequently, α alone cannot suppress extreme values in a non-linear fashion.

Figure 7 Curves of three squashing functions: tanh, hardtanh, and sigmoid. All three functions squash inputs into a bounded range, but tanh( x ) achieves the best performance when used in DyT layers. We suspect it is due to its smoothness and zero-centered properties.

During training.

After training.

Comparison with Other Methods

To further assess DyT's effectiveness, we compare it with other methods that also enable training Transformers without normalization layers. These methods can be broadly categorized into initialization-based and weightnormalization-based methods. We consider two popular initialization-based methods, Fixup (Zhang et al., 2019; Huang et al., 2020) and SkipInit (De and Smith, 2020; Bachlechner et al., 2021). Both methods aim to mitigate training instabilities by adjusting the initial parameter values to prevent large gradients and activations at the start of training, thereby enabling stable learning without normalization layers. In contrast, weight-normalization-based methods impose constraints on network weights throughout training to maintain stable learning dynamics in the absence of normalization layers. We include one such method, σ Reparam (Zhai et al., 2023), which controls the spectral norm of the weights to promote stable learning.

Table 9 Classification accuracy on ImageNet-1K. DyT consistently achieves superior performance over other methods.

Table 9 summarizes the results of two ViT-based tasks. We closely follow the original protocols outlined in their respective papers. However, we find that both initialization-based methods, Fixup and SkipInit, require significantly lower learning rates to prevent training divergence. To ensure a fair comparison, we conduct a simple learning rate search for all methods, including DyT. This produces results that differ from those reported in Section 5, where no hyperparameter is tuned. Overall, the results show that DyT consistently outperforms all other tested methods across different configurations.

Initialization of $ alpha$

We find that tuning the initialization of α (denoted α 0 ) rarely leads to significant performance improvements. The only exception is LLM training, where careful tuning of α 0 yields noticeable performance gains. In this section, we detail our findings on the impact of α initialization.

Initialization of $ alpha$ for Non-LLM Models

Non-LLMmodelsarerelativelyinsensitiveto α 0 . Figure 9 shows the effect of varying α 0 on validation performance across different tasks. All experiments follow the original setup and hyperparameters of their respective recipe. We observe that performance remains stable across a wide range of α 0 values, with values between 0.5 and 1.2 generally yielding good results. We observe that adjusting α 0 typically affects only the early stages of the training curves. The main exception is supervised ViT-L experiments, where training becomes unstable and diverges when α 0 exceeds 0.6. In such cases, reducing the learning rate restores stability, as detailed below.

Figure 8 Left: For two selected DyT layers from the ViT-B model, we track α and the inverse of the standard deviation ( 1 / std ) of activations at the end of each epoch, observing that they evolve together during training. Right: We plot the final α values of two trained models, ViT-B and ConvNeXt-B, against the 1 / std of the input activations, demonstrating a strong correlation between the two values.

Figure 9 Performance of different tasks across different α 0 values. We benchmark the performance of all non-LLM tasks used in Section 5 with different initial values of α . Performance remains stable across a wide range of α 0 values. The only exception is that supervised ViT-L models (top right panel) will diverge for α 0 values larger than 0.6.

Smaller α 0 results in more stable training. Building on previous observations, we further analyze the factors contributing to training instability. Our findings suggest that increasing either the model size or the learning rate requires lowering α 0 to ensure stable training. Conversely, a higher α 0 requires a lower learning rate to mitigate training instability. Figure 10 shows the ablation of the training stability of supervised ViT with ImageNet-1K dataset. We vary learning rates, model sizes, and α 0 values. Training a larger model is more prone to failure, requiring smaller α 0 values or learning rates for stable training. A similar instability pattern is also observed in LN-based models under comparable conditions, and setting α 0 = 0 . 5 results in a stability pattern similar to that of LN.

Setting α 0 = 0 . 5 as the default. Based on our findings, we set α 0 = 0 . 5 as the default value for all non-LLM models. This setting provides training stability comparable to LN while maintaining strong performance.

Non-LLM models are relatively insensitive to $ alpha_0$.

Smaller $ alpha_0$ results in more stable training.

Setting $ alpha_0 = 0.5$ as the default.

Initialization of $ alpha$ for LLMs

Tuning α 0 enhances LLM performance. As discussed earlier, the default setting of α 0 = 0 . 5 generally performs well across most tasks. However, we find tuning α 0 can substantially improve LLM performance. We tune α 0 across LLaMA models by pretraining each on 30B tokens and comparing their training losses. Table 10 summarizes the tuned α 0 values for each model. Two key findings emerge:

Larger models require smaller α 0 values. Once the optimal α 0 is determined for smaller models, the search space for larger models can be reduced accordingly.

Figure 10 Stability across varying α 0 values, learning rates, and model sizes. We train supervised ViT models on the ImageNet-1K dataset and observe that larger models are more prone to instability for both LN and DyT models. Lowering the learning rate or reducing α 0 enhances stability. LN shows similar stability to DyT with α 0 = 0 . 5 .

Higher α 0 values for attention blocks improve performance. We find that initializing α with higher values for DyT layers in attention blocks and lower values for DyT layers in other locations (i.e., within FFN blocks or before the final linear projection) improves performance.

Table 10 Optimal α 0 for different LLaMA models. Larger models require smaller α 0 values. We find it is important to initialize α differently in (1) attention blocks ('attention'), versus (2) the FFN blocks, and the final DyT layer before outputs ('other'). α 0 in attention blocks require larger values.

To further illustrate the impact of α 0 tuning, Figure 11 presents heatmaps of loss values of two LLaMA models. Both models benefit from higher α 0 in attention blocks, leading to reduced training loss.

Model width primarily determines α 0 selection. We also investigate the influence of model width and depth on the optimal α 0 . We find that the model width is critical in determining the optimal α 0 , while model depth has minimal influence. Table 11 shows the optimal α 0 values across different widths and depths, showing that wider networks benefit from smaller α 0 values for optimal performance. On the other hand, model depth has negligible impact on the choice of α 0 .

As can be seen in Table 11, the wider the network, the more uneven initialization for 'attention' and 'other' is needed. We hypothesize that the sensitivity of LLM's α initialization is related to their excessively large widths compared to other models.

Tuning $ alpha_0$ enhances LLM performance.

Model width primarily determines $ alpha_0$ selection.

Mechanisms of Normalization layers.

Figure3 tanh( αx ) with three different α values.

Normalization in Transformers.

Figure3 tanh( αx ) with three different α values.

Removing normalization.

Limitations

Our experiments focus on networks using LN or RMSNorm because of their popularity in Transformers and other modern architectures. Preliminary experiments (see Appendix D) indicate that DyT struggles to replace BN directly in classic networks like ResNets. It remains to be studied in more depth whether and how DyT can adapt to models with other types of normalization layers.

Furthermore, although DyT is conceptually and computationally simpler, we find that DyT offers no speedup over models with normalization layers when properly compiled/optimized (see Appendix C). Its computational benefits across different hardware platforms or deployment environments remain uncertain.

Conclusion

In this work, we demonstrate modern neural networks, in particular Transformers, can be trained without normalization layers. This is done through Dynamic Tanh (DyT), a simple replacement for traditional normalization layers. It adjusts the input activation range via a learnable scaling factor α and then squashes the extreme values through an S -shaped tanh function. Although a simpler function, it effectively captures the behavior of normalization layers. Under various settings, models with DyT match or exceed the performance of their normalized counterparts. The findings challenge the conventional understanding of the necessity of normalization layers in training modern neural networks. Our study also contributes to understanding the mechanisms of normalization layers, one of the most fundamental building blocks in deep neural networks.

Experiments

Experimental Settings

Supervised image classification. For all supervised classification experiments on ImageNet-1K, we follow the training recipes from ConvNeXt (Meta Research, a). For ConvNeXt-B and ConvNeXt-L, we use the original hyperparameters without modification. ViT-B and ViT-L models use the same hyperparameters as ConvNeXt-B, except that for ViT-L, the beta parameters for AdamW are set to (0.9, 0.95), and the stochastic depth rates are set to 0.1 for ViT-B and 0.4 for ViT-L.

Diffusion models. We use the official implementation (Meta Research, c) for training all DiT models. We find that the default learning rate is suboptimal for the models considered in this paper. To address this, we conduct a simple learning rate search with the LN models and apply the tuned learning rates directly to the DyT models. We also observe that the zero initialization negatively affects the performance of DyT models. Therefore, we retain the zero initialization for LN models but remove the zero initialization for DyT models.

Large Language Models. In our implementation of LLaMA models (Touvron et al., 2023a,b; Dubey et al., 2024) with DyT, we introduce an additional learnable scalar parameter immediately after the embedding layer, before any Transformer blocks. We initialize it to the square root of the model embedding dimension √ d . Without this scaling scalar, we find that the magnitudes of model activations at the beginning of training are too small, and the training struggles to progress. The issue is mitigated by incorporating a learnable scalar, and the model can converge normally. This addition of a scalar is similar to the original Transformer (Vaswani et al., 2017) design, which uses a fixed scalar of the same value at the same position.

We train all our LLaMA models on the Pile dataset (Gao et al., 2020). We use the codebase from FMS-FSDP (Foundation Model Stack), which provides a default training recipe for the 7B model that closely follows the LLaMA 2 paper (Touvron et al., 2023b). We maintain the learning rate at the default 3e-4 for 7B and 13B and 1.5e-4 for 34B and 70B, in line with LLaMA 2. The batch size is set to 4M tokens and each model is trained on a total of 200B tokens.

For evaluation, we test the pretrained models on 15 zero-shot commonsense reasoning tasks from lm-eval (Gao et al.): anli_r1 , anli_r2 , anli_r3 , arc_challenge , arc_easy , boolq , hellaswag , openbookqa , piqa , record , rte , truthfulqa_mc1 , truthfulqa_mc2 , wic , and winogrande . The selection closely follows that of OpenLLaMA (Geng and Liu, 2023). We report the average performance across all tasks.

Self-supervised learning in speech. For both wav2vec 2.0 models, we retain the first group normalization layer from the original architecture, as it functions primarily as data normalization to handle the unnormalized input data. We use the official implementation (Meta Research, e) without modifying hyperparameters for both the Base and Large models. We report the final validation loss.

Other tasks. For all other tasks, MAE (He et al., 2022), DINO (Caron et al., 2021), HyenaDNA (Nguyen et al., 2024) and Caduceus (Schiff et al., 2024), we directly use the publicly released code (Meta Research, d,b; HazyResearch; Kuleshov Group), without hyperparameter tuning, for both models with LN and DyT.

Supervised image classification.