Skip to main content

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie

Abstract

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.

VAE, representation and representation autoencoder.

Shengbang Tong * , Boyang Zheng *

, Ziteng Wang * , Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie New York University

Figure

Website

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to largescale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.

Introduction

Diffusion-based generative modeling [19, 38, 47] has made rapid progress, giving rise to state-of-the-art systems across

  • Core contributor.

Code

Data

Figure

Figure 1. RAE converges faster than VAE in text-to-image pretraining. We train Qwen-2.5 1.5B + DiT 2.4B models from scratch on both RAE (SigLIP-2) and VAE (FLUX) latent spaces for up to 60k iterations. RAE converges significantly faster than VAE on both GenEval (4.0×) and DPG-Bench (4.6×).

Figure 1. RAE converges faster than VAE in text-to-image pretraining. We train Qwen-2.5 1.5B + DiT 2.4B models from scratch on both RAE (SigLIP-2) and VAE (FLUX) latent spaces for up to 60k iterations. RAE converges significantly faster than VAE on both GenEval (4.0×) and DPG-Bench (4.6×).

visual generative domains such as text-to-image generation [1, 46, 88]. A key factor in this success is the adoption of latent diffusion [65], where generation occurs in a compact latent space encoded by a variational autoencoder (VAE) [43], rather than directly in pixel space.

In parallel with advances in generative modeling, visual representation learning has progressed through selfsupervised learning (SSL) [6, 12, 34, 35], language supervision [62, 98], and their combinations [54, 83]. These models produce semantically structured, high-dimensional representations that generalize well across visual understanding tasks. Unlike VAE encoders, which compress images into low-dimensional latents, the representation encoders operate on high-dimensional latents that can capture much more semantically rich features.

Such high-dimensional latents were previously considered too 'abstract' for effective generative modeling [72, 92], or outright intractable [13, 48]. However, a recent approach, Representation Autoencoder (RAE) [100], has paved a path forward by training decoders on frozen representation encoders. RAE pairs a powerful frozen representation encoder with a lightweight trained decoder to reconstruct pixels from high-dimensional embeddings, enabling diffusion directly in this semantic latent space. In the

Figure 2. RAE decoders trained on more data (web, synthetic & text) generalize across domains. Decoders trained only on ImageNet reconstruct natural images well but struggle with text-rendering scenes (see second column). Adding web and text data greatly improves text reconstruction while maintaining natural-image quality. We also observe that both the language-supervised model and the SSL model learn representations suitable for reconstructing diverse images, including natural languages. Compared to proprietary VAEs, our RAE models achieve competitive overall fidelity.

Figure 2. RAE decoders trained on more data (web, synthetic & text) generalize across domains. Decoders trained only on ImageNet reconstruct natural images well but struggle with text-rendering scenes (see second column). Adding web and text data greatly improves text reconstruction while maintaining natural-image quality. We also observe that both the language-supervised model and the SSL model learn representations suitable for reconstructing diverse images, including natural languages. Compared to proprietary VAEs, our RAE models achieve competitive overall fidelity.

highly controlled class-conditional ImageNet [18] setting, RAE demonstrates that diffusion in such frozen representation spaces can achieve more efficient and effective training than conventional VAE-based diffusion.

However, ImageNet represents a best-case scenario: fixed resolution, curated content, and class-conditional generation. A critical question remains unanswered: can RAE truly scale to the complexities of freeform text-to-image generation? This setting involves broader visual diversity, open-ended compositions, and substantially larger models and compute-challenges for which high-dimensional latent diffusion remains unproven.

In this work, we investigate whether RAEs can succeed at scale by training diffusion models for large-scale text-to-image (T2I) generation. We adopt SigLIP-2 [84] as the frozen representation encoder and use the MetaQuery framework [56] to train a unified T2I model, leveraging a powerful pretrained large language model (LLM) [61].

As a first step, we study decoder training beyond ImageNet supervision (Sec. 2). Expanding from ImageNet to web-scale and synthetic aesthetic data yields only small gains on ImageNet itself, but provides moderate improvements on more diverse natural images such as YFCC [80], showing that broader distributions enhance generalization.

However, we find that text reconstruction requires targeted supervision: without text-specific data, the decoder fails to reproduce fine glyph details. Adding text-rendering data leads to substantial improvements, highlighting that data composition , not just scale, is crucial.

Next, we analyze design choices in the RAE framework [100] and evaluate their importance under large-scale T2I training (Sec. 3). We find that scale acts as a simplifier. Dimension-aware noise scheduling remains essential: removing the shift leads to substantially worse performance. The wide DDT head (DiT DH ) provides clear benefits for smaller backbones, but its advantage fades as Diffusion Transformers (DiT) scale to the billion-parameters. Finally, the effect of noise-augemented decoding is modest at T2I scale, with gains saturating quickly.

We then systematically compare RAEs with SOTA VAEs under matched training conditions (Sec. 4). We train DiTs from scratch following the conventional two-stage T2I setup [15, 60]: (i) large-scale pretraining with randomly initialized DiTs, and (ii) finetuning on smaller high-quality datasets. During pretraining, RAE-based models converge significantly faster and achieve higher performance on both GenEval and DPG-Bench. As shown in Fig. 1, training a 1.5B LLM + 2.4B DiT with RAE (SigLIP-2) achieves

Table 1. Data matters for RAE's reconstruction fidelity. We train RAE (SigLIP-2) on different data sources. Compared with ImageNet-only training, using web-scale images consistently improves reconstruction quality across all domains.

a 4.0× speedup on GenEval and a 4.6× speedup on DPGBench compared to its VAE counterpart. This advantage is consistent across both language backbones (Qwen-2.5 [2] 1.5B-7B) and diffusion scales (DiT 0.5B-9.8B). In finetuning, RAE models continue to outperform their VAE counterparts and are less prone to overfitting.

Finally, we examine unified models in which RAE enables understanding and generation to operate in the same high-dimensional semantic space (Sec. 5). We find that adding generative training does not degrade understanding performance, and the choice of RAE vs. VAE in the generative path has little effect because both rely on the same frozen understanding encoder. Moreover, the shared latent space allows the LLM to process generated latents directly, without decoding back to pixels. We take a first exploratory step toward leveraging this property through latent-space test-time scaling, which proves both feasible and effective.

Ultimately, we aim to convey one primary message: Representation Autoencoders provide a simpler and stronger foundation than VAEs for training large-scale text-to-image diffusion models. They offer a simple yet effective path to scaling generation within semantic representation spaces. We will release all code, data, and model checkpoints related to this work to foster open and reproducible research in multimodal generation.

Scaling Decoder Training Beyond ImageNet

To adapt the RAE framework for open-world T2I generation, we first train a RAE decoder on a larger and more diverse dataset than ImageNet [18]. Throughout this section, we choose SigLIP-2 So400M (patch size 14) [84] as the frozen encoder, and train a ViT-based [21] decoder to reconstruct images from these tokens at 224 × 224 resolution. We present the architectural details in Sec. A. Given an input image x ∈ R 3 × 224 × 224 , the encoder produces N = 16 × 16 tokens with channel dimension d = 1152 .

Training objective. Following RAE, we adopt ℓ 1 , LPIPS [99], and adversarial losses [33, 68]. Additionally, we integrate Gram Loss [29], which is found beneficial for reconstruction [52]. The training objective is set as

Table 2. Comparison of reconstruction performance. After expanding the training data, RAE outperforms SDXL-VAE across all three domains, though it still falls short of FLUX-V AE. Within RAE variants, WebSSL reconstructs better than SigLIP-2.

L ( x, ˆ x ) = ℓ 1 ( x, ˆ x ) + ω L LPIPS ( x, ˆ x ) + ω G Gram ( x, ˆ x ) + ω A Adv ( x, ˆ x ) , ˆ x = RAE ( x ) . We include the weights and training details in Sec. A.

Training data. We use a dataset combining roughly 73M data from three data sources: web image sources from FuseDiT [77], synthetic images generated by FLUX.1schnell [46], and RenderedText [87], which focuses on textrendering scenes. Details are provided in Sec. 4.

Evaluation. We evaluate rFID-50k [36] of reconstructed images in three representative domains: (i) ImageNet1k [67] for classic object-centric evaluation, (ii) YFCC [80] for diverse web-scale imagery, and (iii) RenderedText [87] held-out set for text-rendering and typography-specific evaluation. We evaluate rFiD on 50k samples from each data source and present our results in Tabs. 1 and 2.

Web-scale training of RAE decoders. As shown in Tab. 1, expanding decoder training beyond ImageNet to include web-scale and synthetic data yields only marginal gains on ImageNet itself, but provides moderate improvements on more diverse images (YFCC). This indicates that exposure to a broader distribution enhances the decoder's generalizability. However, generic web data is insufficient for text reconstruction. Training on Web + Synthetic data yields little improvement over ImageNet-only training. In contrast, performance improves substantially once text-specific data is included, highlighting that reconstruction quality is very sensitive to the composition of the training data. As shown in Fig. 2, training the RAE decoder with additional text data is essential for accurate text reconstruction. Overall, RAE reconstruction improves with scale, but the composition of data-not just its size-matters: each domain benefits most from domain-matched coverage.

Different encoders. We also experiment training RAE using different pretrained encoders. In particular, we replace SigLIP-2 with WebSSL-DINO [26], a large-scale selfsupervised model. As shown in Tab. 2, WebSSL-DINO achieves stronger reconstruction performance than SigLIP2 across all domains, including text reconstruciton. Both SigLIP-2 and WebSSL-L consistently outperform SDXL VAE [60], though they still fall short of FLUX V AE [46].

Figure 3. Overview of training pipeline . Left : RAE decoder training stage. We train a decoder on the representations (yellow tokens) produced by the frozen RAE encoder. Right : End-to-end unified training of the autoregressive model, diffusion transformer, and learnable query tokens (gray tokens) using cross-entropy (CE) loss for text prediction and a flow-matching objective for image prediction.

Figure 3. Overview of training pipeline . Left : RAE decoder training stage. We train a decoder on the representations (yellow tokens) produced by the frozen RAE encoder. Right : End-to-end unified training of the autoregressive model, diffusion transformer, and learnable query tokens (gray tokens) using cross-entropy (CE) loss for text prediction and a flow-matching objective for image prediction.

Training objective.

Convergence. We first compare the convergence behavior. We train a Qwen2.5-1.5B LLM with a 2.4B DiT backbone. As shown in Fig. 1, the RAE-based model converges significantly faster than its V AE counterpart, achieving a 4.0× speedup on GenEval and a 4.6× speedup on DPG-Bench.

Scaling DiT models. We use Qwen-2.5 1.5B as the language backbone, and train DiT variants of 0.5B, 2.4B, 5.5B, and 9.8B parameters. The architectures of these DiT variants are designed following recent advances in large-scale vision models [24, 26, 85, 88], and detailed model specifications are provided in Sec. B. In this experiment, we train all the models for 30k iterations with a batch size of 2048.

In Fig. 5a, we find that RAE-based models consistently outperform their VAE counterparts at all scales. Even for the smallest 0.5B DiT, where the network width only slightly exceeds the RAE latent dimension, the RAE-based model still shows clear advantages over the V AE baseline.

We also observe diminishing returns when scaling DiT models beyond 6B parameters. The performance trend appears to plateau, suggesting that simply increasing model size without proportionally improving data quality and diversity may lead to underutilized capacity. This observation aligns with discussions in large-scale visual SSL literature [26], which highlight the need for high-quality data scaling to fully exploit model capacity.

Scaling LLM backbones. We study how scaling the LLMbackbone influences text-to-image performance when

Table 4. SSL encoders are effective RAE backbones for T2I. AWebSSL-based RAE performs slightly worse than SigLIP-2 but remains stronger than FLUX VAE in T2I.

paired with diffusion models of different sizes. To this end, we train both RAE- and VAE-based models using LLMs of { 1.5B, 7B } parameters combined with DiTs of { 2.4B, 5.5B, 9.8B } , and present the results in Fig. 5b.

We observe performance gains from scaling the LLM to 7B, particularly when paired with RAE. We note that prior studies, such as MetaQuery [56], reported limited benefits from LLM scaling. Our results diverge from this conclusion, likely due to two key factors: (1) we evaluate on significantly larger diffusion backbones (up to 9.8B) which can better exploit rich text representations, and (2) we finetune the LLM backbone, allowing it to adapt its latent space for generative tasks more effectively than frozen approaches.

Generalizing to other vision encoders. We also experiment with training RAE with WebSSL ViT-L [26]. Under the same 1.5B LLM and 2.4B DiT setup, the WebSSL RAE performs slightly below the SigLIP-2 version but still exceeds the FLUX VAE baseline (Tab. 4). This finding is notable because WebSSL is not explicitly aligned with text; it suggests that the RAE framework in T2I training is robust to the choice of encoder.

Training data.

Convergence. We first compare the convergence behavior. We train a Qwen2.5-1.5B LLM with a 2.4B DiT backbone. As shown in Fig. 1, the RAE-based model converges significantly faster than its V AE counterpart, achieving a 4.0× speedup on GenEval and a 4.6× speedup on DPG-Bench.

Scaling DiT models. We use Qwen-2.5 1.5B as the language backbone, and train DiT variants of 0.5B, 2.4B, 5.5B, and 9.8B parameters. The architectures of these DiT variants are designed following recent advances in large-scale vision models [24, 26, 85, 88], and detailed model specifications are provided in Sec. B. In this experiment, we train all the models for 30k iterations with a batch size of 2048.

In Fig. 5a, we find that RAE-based models consistently outperform their VAE counterparts at all scales. Even for the smallest 0.5B DiT, where the network width only slightly exceeds the RAE latent dimension, the RAE-based model still shows clear advantages over the V AE baseline.

We also observe diminishing returns when scaling DiT models beyond 6B parameters. The performance trend appears to plateau, suggesting that simply increasing model size without proportionally improving data quality and diversity may lead to underutilized capacity. This observation aligns with discussions in large-scale visual SSL literature [26], which highlight the need for high-quality data scaling to fully exploit model capacity.

Scaling LLM backbones. We study how scaling the LLMbackbone influences text-to-image performance when

Table 4. SSL encoders are effective RAE backbones for T2I. AWebSSL-based RAE performs slightly worse than SigLIP-2 but remains stronger than FLUX VAE in T2I.

paired with diffusion models of different sizes. To this end, we train both RAE- and VAE-based models using LLMs of { 1.5B, 7B } parameters combined with DiTs of { 2.4B, 5.5B, 9.8B } , and present the results in Fig. 5b.

We observe performance gains from scaling the LLM to 7B, particularly when paired with RAE. We note that prior studies, such as MetaQuery [56], reported limited benefits from LLM scaling. Our results diverge from this conclusion, likely due to two key factors: (1) we evaluate on significantly larger diffusion backbones (up to 9.8B) which can better exploit rich text representations, and (2) we finetune the LLM backbone, allowing it to adapt its latent space for generative tasks more effectively than frozen approaches.

Generalizing to other vision encoders. We also experiment with training RAE with WebSSL ViT-L [26]. Under the same 1.5B LLM and 2.4B DiT setup, the WebSSL RAE performs slightly below the SigLIP-2 version but still exceeds the FLUX VAE baseline (Tab. 4). This finding is notable because WebSSL is not explicitly aligned with text; it suggests that the RAE framework in T2I training is robust to the choice of encoder.

Evaluation.
Web-scale training of RAE decoders.

To adapt the RAE framework for open-world T2I generation, we first train a RAE decoder on a larger and more diverse dataset than ImageNet [18]. Throughout this section, we choose SigLIP-2 So400M (patch size 14) [84] as the frozen encoder, and train a ViT-based [21] decoder to reconstruct images from these tokens at 224 × 224 resolution. We present the architectural details in Sec. A. Given an input image x ∈ R 3 × 224 × 224 , the encoder produces N = 16 × 16 tokens with channel dimension d = 1152 .

Training objective. Following RAE, we adopt ℓ 1 , LPIPS [99], and adversarial losses [33, 68]. Additionally, we integrate Gram Loss [29], which is found beneficial for reconstruction [52]. The training objective is set as

Table 2. Comparison of reconstruction performance. After expanding the training data, RAE outperforms SDXL-VAE across all three domains, though it still falls short of FLUX-V AE. Within RAE variants, WebSSL reconstructs better than SigLIP-2.

L ( x, ˆ x ) = ℓ 1 ( x, ˆ x ) + ω L LPIPS ( x, ˆ x ) + ω G Gram ( x, ˆ x ) + ω A Adv ( x, ˆ x ) , ˆ x = RAE ( x ) . We include the weights and training details in Sec. A.

Training data. We use a dataset combining roughly 73M data from three data sources: web image sources from FuseDiT [77], synthetic images generated by FLUX.1schnell [46], and RenderedText [87], which focuses on textrendering scenes. Details are provided in Sec. 4.

Evaluation. We evaluate rFID-50k [36] of reconstructed images in three representative domains: (i) ImageNet1k [67] for classic object-centric evaluation, (ii) YFCC [80] for diverse web-scale imagery, and (iii) RenderedText [87] held-out set for text-rendering and typography-specific evaluation. We evaluate rFiD on 50k samples from each data source and present our results in Tabs. 1 and 2.

Web-scale training of RAE decoders. As shown in Tab. 1, expanding decoder training beyond ImageNet to include web-scale and synthetic data yields only marginal gains on ImageNet itself, but provides moderate improvements on more diverse images (YFCC). This indicates that exposure to a broader distribution enhances the decoder's generalizability. However, generic web data is insufficient for text reconstruction. Training on Web + Synthetic data yields little improvement over ImageNet-only training. In contrast, performance improves substantially once text-specific data is included, highlighting that reconstruction quality is very sensitive to the composition of the training data. As shown in Fig. 2, training the RAE decoder with additional text data is essential for accurate text reconstruction. Overall, RAE reconstruction improves with scale, but the composition of data-not just its size-matters: each domain benefits most from domain-matched coverage.

Different encoders. We also experiment training RAE using different pretrained encoders. In particular, we replace SigLIP-2 with WebSSL-DINO [26], a large-scale selfsupervised model. As shown in Tab. 2, WebSSL-DINO achieves stronger reconstruction performance than SigLIP2 across all domains, including text reconstruciton. Both SigLIP-2 and WebSSL-L consistently outperform SDXL VAE [60], though they still fall short of FLUX V AE [46].

Figure 3. Overview of training pipeline . Left : RAE decoder training stage. We train a decoder on the representations (yellow tokens) produced by the frozen RAE encoder. Right : End-to-end unified training of the autoregressive model, diffusion transformer, and learnable query tokens (gray tokens) using cross-entropy (CE) loss for text prediction and a flow-matching objective for image prediction.

Figure 3. Overview of training pipeline . Left : RAE decoder training stage. We train a decoder on the representations (yellow tokens) produced by the frozen RAE encoder. Right : End-to-end unified training of the autoregressive model, diffusion transformer, and learnable query tokens (gray tokens) using cross-entropy (CE) loss for text prediction and a flow-matching objective for image prediction.

Different encoders.

transformers for high-resolution image synthesis. In ICML , 2024. 4, 8, 9

RAE is Simpler in T2I

The original RAE framework [100] introduced a suite of specialized design choices-including dimensiondependent noise scheduling, noise-augmented decoding, and a modified backbone (DiT DH )-to enable diffusion on high-dimensional latents. While these modifications proved effective for class-conditional ImageNet generation, it remains unclear which are fundamental requirements for high-dimensional diffusion and which are adaptations for lower-capacity regimes.

In this section, we systematically stress-test these components under large-scale T2I settings. We systematically evaluate these components to determine their necessity in large-scale T2I generation. Our analysis reveals that adapting the noise schedule to the latent dimension is critical for convergence, whereas the architectural modifications proposed in the original work-such as wide diffusion heads and noise augmentation-become redundant at scale.

Experiment Setup

Model architecture. We adopt the MetaQuery architecture [56] for text-to-image (T2I) generation and unified modeling. The model initializes with a pretrained language model (LLM) and prepends a sequence of learnable query tokens to the text prompt. The number of query tokens is set to 256, matching the number of visual tokens ( 16 × 16 ) produced by the representation encoder. The LLM jointly processes the text and queries, producing query-token representations that serve as the conditioning signal. A 2-layer MLPconnector then projects these representations from the LLM's hidden space into the DiT model [58].

For this DiT model, we adopt the design based on LightningDiT [92] and train it using the flow matching objective [47]. Critically, our model does not operate in a compressed VAE space. Instead, the DiT learns to model the distribution of high-dimensional, semantic representations generated by the frozen representation encoder. During inference, the DiT generates a set of features conditioned on the query tokens, which are then passed to our trained RAE decoder for rendering into pixel space.

We also train visual instruction tuning [49, 50] for image understanding. For this, we use a separate 2-layer MLP projector that maps visual tokens into the LLM's embedding space. Importantly, these visual tokens come from the same frozen representation encoder whose features the diffusion model is trained to generate.

Unless otherwise specified, we use SigLIP-2 So400M (patch size 14) [84] as our representation encoder and Qwen-2.5 1.5B [61] as the LLM in our experiments. We fix the number of visual tokens to 256, resulting in 224resolution images for RAE and 256-resolution for VAE.

Flow matching. Following standard practice, we adopt the flow matching objective [47, 51] with linear interpolation x t = (1 -t ) x + tε , where x ∼ p ( x ) and ε ∼ N (0 , I ) , and train the model to predict the velocity v ( x t , t ) . Unless otherwise noted, we employ a 50-step Euler sampler for generation, consistent with RAE [100].

Evaluation. We evaluate using two widely adopted metrics: the GenEval score [32] and the DPG-Bench score [39].

Model architecture.
Flow matching.
Evaluation.

Noise scheduling remains crucial for T2I

The RAE work [100] argues that conventional noise schedules become suboptimal when applied to high-dimensional latent spaces. The paper proposes a dimension-dependent noise schedule shift [25] that rescales the diffusion timestep according to the effective data dimension m = N × d (number of tokens × token dimension). Formally, given a base schedule t n ∈ [0 , 1] defined for a reference dimension n , the shifted timestep is computed as

$$

$$

We follow the RAE setting and use n =4096 as the base dimension for computing the scaling factor α . We experiment with and without applying the dimension-dependent shift when training text-to-image diffusion models on RAE latents, as shown below.

Consistent with Zheng et al. [100], applying the noise shift dramatically improves both GenEval and DPG-Bench scores, demonstrating that adjusting the schedule to the effective latent dimension is critical for T2I.

Design Choices that Saturate at Scale

While dimension-aware noise scheduling proves essential, we find that other design choices in RAE, which was originally developed for smaller-scale ImageNet models, provide diminishing returns at T2I scale.

Noise-augmented decoding. RAE originally proposed training decoders on perturbed latents to bridge the gap between training and inference distributions. Formally, it trains the RAE decoder on smoothed inputs z ′ = z + n , where n ∼ N (0 , σ 2 I ) and σ is sampled from |N (0 , τ 2 ) | . We set τ = 0 . 2 as we find a too high τ makes decoder training hard to converge.

We visualize the effect of noise-augmented decoding at different training stages in Fig. 4a. The gains are noticeable early in training (before ∼ 15k steps), when the model is still far from convergence, but become negligible at later stages. This suggests that noise-augmented decoding acts as a form of regularization that matters most when the model has not yet learned a robust latent manifold.

Wide DDT head. The DiT DH architecture augments a standard DiT with a shallow but wide DDT head, increasing denoising width without widening the entire backbone. In standard ImageNet-scale DiTs, the backbone width ( d ≈ 1024 ) is often narrower than the high-dimensional RAE latent targets ( d = 1152 ). DiT DH circumvents this by appending a wide, shallow denoising head ( d = 2688 ) without incurring the cost of widening the full backbone.

However, T2I setting operates in a different regime. Modern large-scale T2I DiTs [46, 88] ( ≥ 2B parameters) possess hidden dimensions ( d ≥ 2048 ) that inherently exceed the latent dimension. We hypothesize that this natural width eliminates the bottleneck DiT DH was designed to fix.

To verify this, we train DiT variants across three scales0.5B, 2.4B, and 3.1B-comparing standard architectures against counterparts augmented with the +0.28B parameter DiT DH head. As shown in Fig. 4b, the results confirm our hypothesis: at 0.5B, where the backbone is narrow, DiT DH

Figure 4. Design choices that saturate at T2I scale. Left : Noiseaugmented decoding provides substantial gains early in training but becomes negligible by 120k steps. Right : DiT DH yields large gains at 0.5B (+11.2 GenEval), but the advantage diminishes at > 2.4B, where backbone capacity dominates.

Figure 4. Design choices that saturate at T2I scale. Left : Noiseaugmented decoding provides substantial gains early in training but becomes negligible by 120k steps. Right : DiT DH yields large gains at 0.5B (+11.2 GenEval), but the advantage diminishes at > 2.4B, where backbone capacity dominates.

provides a critical +11.2 GenEval boost. Yet as the model scales to 2.4B and beyond, this advantage saturates greatly.

This finding clarifies that DiT DH is a patch for capacityconstrained models, not a fundamental requirement for RAE. For scalable T2I training, standard DiT architectures are already sufficient.

Summary. Our experiments reveal clear design principles for scaling RAE-based diffusion models: dimension-aware noise scheduling remains non-negotiable, as it directly addresses the mathematical properties of high-dimensional latent spaces. In contrast, architectural refinements (DiT DH ) and training augmentations (noise-augmented) that help at small scales provide diminishing returns as models growbackbone capacity increasingly dominates performance. From here on, we adopt standard DiT architectures with proper noise scheduling and no noise-augmented decoding.

Noise-augmented decoding.
Wide DDT head.
Summary.

Training Diffusion Model with RAE vs. VAE

In this section, we compare text-to-image diffusion training using the RAE (SigLIP-2) encoder versus a standard VAE (FLUX-VAE). For the VAE baseline, we adopt the stateof-the-art model from FLUX [46]. All experiments follow the same setup described in Sec. 3.1, with identical training configurations; the only difference lies in whether diffusion is performed in the RAE or VAE latent space. We defer implementation details to Sec. A.

Table 3. Data composition matters more than scale. Synthetic data substantially outperforms web data, and their combination (49.5 GenEval) surpasses even doubled synthetic data (48.0), demonstrating synergistic benefits from complementary data sources rather than volume alone.

Experimental Protocol. We organize our comparison into two stages: pretraining and finetuning . We train the Diffusion Transformer from scratch in both settings to ensure a fair comparison of convergence speed and data efficiency. We ensure apples-to-apples comparison. The only component that differs is the latent space and its decoder (SigLIP2 RAE vs. FLUX VAE). For the VAE baseline, we employ FLUX VAE for generation while retaining the SigLIP encoder for understanding, as VAE latents are insufficient for perception [101]. This design choice effectively forms a two-tower architecture, mirroring the design of recent unified models like Bagel [17] and UniFluid [27].

Pretrain Data. We follow the data mixture developed in FuseDiT [77] and adopt the recaptioned texts and remixing ratios released by BLIP-3o [10]. The mixture combines mostly webdata like CC12M [7], SA-1B [44], and JourneyDB [73], totaling approximately 39.3M images. In addition, we use FLUX.1-schnell [46] to generate 24.7M synthetic images. We also train on Cambrian-7M [81] to develop the model's visual understanding capabilities.

We experiment with a Qwen-2.5 1.5B LLM and a 2.4B DiT to study how different pretraining corpora influence text-to-image performance. We train three variants: (i) Web-39M + Cambrian-7M, (ii) FLUX-generated synthetic data + Cambrian-7M, and (iii) their union. As shown in Tab. 3, the mixed dataset yields the best performance.

To ensure the gains are not simply due to more data, we also double the size of each individual source (Web ×2, Synthetic ×2). These runs yield much smaller improvements, indicating that the benefits arise from the complementary nature of the two data types rather than data volume alone.

We also find that synthetic data results in lower training loss and faster convergence, suggesting that FLUX images provide more stylistically consistent signals. Web-scale data, by contrast, is harder to fit but provides more diverse signals. When combined, the model inherits visual style from synthetic data and rich semantics from web data, leading to clear and robust improvements in generation quality.

Figure 5. RAE outperforms VAE across LLM and DiT scales. Top : With a 1.5B LLM, RAE-based models outperform VAEbased ones at all DiT sizes (0.5B, 2.4B, 5.5B, 9.8B). Bottom : Using a larger 7B LLM, RAE continues to maintain its advantage.

Figure 5. RAE outperforms VAE across LLM and DiT scales. Top : With a 1.5B LLM, RAE-based models outperform VAEbased ones at all DiT sizes (0.5B, 2.4B, 5.5B, 9.8B). Bottom : Using a larger 7B LLM, RAE continues to maintain its advantage.

Experimental Protocol.

Model architecture. We adopt the MetaQuery architecture [56] for text-to-image (T2I) generation and unified modeling. The model initializes with a pretrained language model (LLM) and prepends a sequence of learnable query tokens to the text prompt. The number of query tokens is set to 256, matching the number of visual tokens ( 16 × 16 ) produced by the representation encoder. The LLM jointly processes the text and queries, producing query-token representations that serve as the conditioning signal. A 2-layer MLPconnector then projects these representations from the LLM's hidden space into the DiT model [58].

For this DiT model, we adopt the design based on LightningDiT [92] and train it using the flow matching objective [47]. Critically, our model does not operate in a compressed VAE space. Instead, the DiT learns to model the distribution of high-dimensional, semantic representations generated by the frozen representation encoder. During inference, the DiT generates a set of features conditioned on the query tokens, which are then passed to our trained RAE decoder for rendering into pixel space.

We also train visual instruction tuning [49, 50] for image understanding. For this, we use a separate 2-layer MLP projector that maps visual tokens into the LLM's embedding space. Importantly, these visual tokens come from the same frozen representation encoder whose features the diffusion model is trained to generate.

Unless otherwise specified, we use SigLIP-2 So400M (patch size 14) [84] as our representation encoder and Qwen-2.5 1.5B [61] as the LLM in our experiments. We fix the number of visual tokens to 256, resulting in 224resolution images for RAE and 256-resolution for VAE.

Flow matching. Following standard practice, we adopt the flow matching objective [47, 51] with linear interpolation x t = (1 -t ) x + tε , where x ∼ p ( x ) and ε ∼ N (0 , I ) , and train the model to predict the velocity v ( x t , t ) . Unless otherwise noted, we employ a 50-step Euler sampler for generation, consistent with RAE [100].

Evaluation. We evaluate using two widely adopted metrics: the GenEval score [32] and the DPG-Bench score [39].

Pretrain Data.

Convergence. We first compare the convergence behavior. We train a Qwen2.5-1.5B LLM with a 2.4B DiT backbone. As shown in Fig. 1, the RAE-based model converges significantly faster than its V AE counterpart, achieving a 4.0× speedup on GenEval and a 4.6× speedup on DPG-Bench.

Scaling DiT models. We use Qwen-2.5 1.5B as the language backbone, and train DiT variants of 0.5B, 2.4B, 5.5B, and 9.8B parameters. The architectures of these DiT variants are designed following recent advances in large-scale vision models [24, 26, 85, 88], and detailed model specifications are provided in Sec. B. In this experiment, we train all the models for 30k iterations with a batch size of 2048.

In Fig. 5a, we find that RAE-based models consistently outperform their VAE counterparts at all scales. Even for the smallest 0.5B DiT, where the network width only slightly exceeds the RAE latent dimension, the RAE-based model still shows clear advantages over the V AE baseline.

We also observe diminishing returns when scaling DiT models beyond 6B parameters. The performance trend appears to plateau, suggesting that simply increasing model size without proportionally improving data quality and diversity may lead to underutilized capacity. This observation aligns with discussions in large-scale visual SSL literature [26], which highlight the need for high-quality data scaling to fully exploit model capacity.

Scaling LLM backbones. We study how scaling the LLMbackbone influences text-to-image performance when

Table 4. SSL encoders are effective RAE backbones for T2I. AWebSSL-based RAE performs slightly worse than SigLIP-2 but remains stronger than FLUX VAE in T2I.

paired with diffusion models of different sizes. To this end, we train both RAE- and VAE-based models using LLMs of { 1.5B, 7B } parameters combined with DiTs of { 2.4B, 5.5B, 9.8B } , and present the results in Fig. 5b.

We observe performance gains from scaling the LLM to 7B, particularly when paired with RAE. We note that prior studies, such as MetaQuery [56], reported limited benefits from LLM scaling. Our results diverge from this conclusion, likely due to two key factors: (1) we evaluate on significantly larger diffusion backbones (up to 9.8B) which can better exploit rich text representations, and (2) we finetune the LLM backbone, allowing it to adapt its latent space for generative tasks more effectively than frozen approaches.

Generalizing to other vision encoders. We also experiment with training RAE with WebSSL ViT-L [26]. Under the same 1.5B LLM and 2.4B DiT setup, the WebSSL RAE performs slightly below the SigLIP-2 version but still exceeds the FLUX VAE baseline (Tab. 4). This finding is notable because WebSSL is not explicitly aligned with text; it suggests that the RAE framework in T2I training is robust to the choice of encoder.

Pretraining

Convergence. We first compare the convergence behavior. We train a Qwen2.5-1.5B LLM with a 2.4B DiT backbone. As shown in Fig. 1, the RAE-based model converges significantly faster than its V AE counterpart, achieving a 4.0× speedup on GenEval and a 4.6× speedup on DPG-Bench.

Scaling DiT models. We use Qwen-2.5 1.5B as the language backbone, and train DiT variants of 0.5B, 2.4B, 5.5B, and 9.8B parameters. The architectures of these DiT variants are designed following recent advances in large-scale vision models [24, 26, 85, 88], and detailed model specifications are provided in Sec. B. In this experiment, we train all the models for 30k iterations with a batch size of 2048.

In Fig. 5a, we find that RAE-based models consistently outperform their VAE counterparts at all scales. Even for the smallest 0.5B DiT, where the network width only slightly exceeds the RAE latent dimension, the RAE-based model still shows clear advantages over the V AE baseline.

We also observe diminishing returns when scaling DiT models beyond 6B parameters. The performance trend appears to plateau, suggesting that simply increasing model size without proportionally improving data quality and diversity may lead to underutilized capacity. This observation aligns with discussions in large-scale visual SSL literature [26], which highlight the need for high-quality data scaling to fully exploit model capacity.

Scaling LLM backbones. We study how scaling the LLMbackbone influences text-to-image performance when

Table 4. SSL encoders are effective RAE backbones for T2I. AWebSSL-based RAE performs slightly worse than SigLIP-2 but remains stronger than FLUX VAE in T2I.

paired with diffusion models of different sizes. To this end, we train both RAE- and VAE-based models using LLMs of { 1.5B, 7B } parameters combined with DiTs of { 2.4B, 5.5B, 9.8B } , and present the results in Fig. 5b.

We observe performance gains from scaling the LLM to 7B, particularly when paired with RAE. We note that prior studies, such as MetaQuery [56], reported limited benefits from LLM scaling. Our results diverge from this conclusion, likely due to two key factors: (1) we evaluate on significantly larger diffusion backbones (up to 9.8B) which can better exploit rich text representations, and (2) we finetune the LLM backbone, allowing it to adapt its latent space for generative tasks more effectively than frozen approaches.

Generalizing to other vision encoders. We also experiment with training RAE with WebSSL ViT-L [26]. Under the same 1.5B LLM and 2.4B DiT setup, the WebSSL RAE performs slightly below the SigLIP-2 version but still exceeds the FLUX VAE baseline (Tab. 4). This finding is notable because WebSSL is not explicitly aligned with text; it suggests that the RAE framework in T2I training is robust to the choice of encoder.

Convergence.

transformers for high-resolution image synthesis. In ICML , 2024. 4, 8, 9

Scaling DiT models.

LLM model and unified model configs. We use pretrained Qwen2.5 [61] language models at the 1.5B and 7B scales in our experiments. Following prior work [49, 81], we use a 2-layer MLP to project visual features from the representation encoder into the LLM embedding space, and a separate linear layer to map the LLM's query-token outputs into the input space of the diffusion model.

DiT Model configs. We design our diffusion architecture following LightningDiT [92]. Motivated by recent findings in scaling vision backbones [16, 26, 85], we prioritize increasing model width rather than depth when scaling DiT models. Consistent with insights from the RAE paper [100], we also ensure that the DiT hidden dimension remains strictly larger than the target latent dimension (e.g. 1152 for the SigLIP2 ViT-So model), including at small scales such as DiT-0.5B. The detailed model specifications are provided in Tab. 10.

Scaling LLM backbones.
Generalizing to other vision encoders.

In this section, we compare text-to-image diffusion training using the RAE (SigLIP-2) encoder versus a standard VAE (FLUX-VAE). For the VAE baseline, we adopt the stateof-the-art model from FLUX [46]. All experiments follow the same setup described in Sec. 3.1, with identical training configurations; the only difference lies in whether diffusion is performed in the RAE or VAE latent space. We defer implementation details to Sec. A.

Table 3. Data composition matters more than scale. Synthetic data substantially outperforms web data, and their combination (49.5 GenEval) surpasses even doubled synthetic data (48.0), demonstrating synergistic benefits from complementary data sources rather than volume alone.

Experimental Protocol. We organize our comparison into two stages: pretraining and finetuning . We train the Diffusion Transformer from scratch in both settings to ensure a fair comparison of convergence speed and data efficiency. We ensure apples-to-apples comparison. The only component that differs is the latent space and its decoder (SigLIP2 RAE vs. FLUX VAE). For the VAE baseline, we employ FLUX VAE for generation while retaining the SigLIP encoder for understanding, as VAE latents are insufficient for perception [101]. This design choice effectively forms a two-tower architecture, mirroring the design of recent unified models like Bagel [17] and UniFluid [27].

Pretrain Data. We follow the data mixture developed in FuseDiT [77] and adopt the recaptioned texts and remixing ratios released by BLIP-3o [10]. The mixture combines mostly webdata like CC12M [7], SA-1B [44], and JourneyDB [73], totaling approximately 39.3M images. In addition, we use FLUX.1-schnell [46] to generate 24.7M synthetic images. We also train on Cambrian-7M [81] to develop the model's visual understanding capabilities.

We experiment with a Qwen-2.5 1.5B LLM and a 2.4B DiT to study how different pretraining corpora influence text-to-image performance. We train three variants: (i) Web-39M + Cambrian-7M, (ii) FLUX-generated synthetic data + Cambrian-7M, and (iii) their union. As shown in Tab. 3, the mixed dataset yields the best performance.

To ensure the gains are not simply due to more data, we also double the size of each individual source (Web ×2, Synthetic ×2). These runs yield much smaller improvements, indicating that the benefits arise from the complementary nature of the two data types rather than data volume alone.

We also find that synthetic data results in lower training loss and faster convergence, suggesting that FLUX images provide more stylistically consistent signals. Web-scale data, by contrast, is harder to fit but provides more diverse signals. When combined, the model inherits visual style from synthetic data and rich semantics from web data, leading to clear and robust improvements in generation quality.

Figure 5. RAE outperforms VAE across LLM and DiT scales. Top : With a 1.5B LLM, RAE-based models outperform VAEbased ones at all DiT sizes (0.5B, 2.4B, 5.5B, 9.8B). Bottom : Using a larger 7B LLM, RAE continues to maintain its advantage.

Figure 5. RAE outperforms VAE across LLM and DiT scales. Top : With a 1.5B LLM, RAE-based models outperform VAEbased ones at all DiT sizes (0.5B, 2.4B, 5.5B, 9.8B). Bottom : Using a larger 7B LLM, RAE continues to maintain its advantage.

Finetuning

Following standard practice in T2I training [8, 15, 60], models are finetuned on a smaller high-quality dataset after large-scale pretraining. We evaluate this finetuning stage for both RAE- and VAE-based models under identical settings. Unless otherwise noted, we use the BLIP-3o 60k dataset [10] and start from the 1.5B LLM + 2.4B DiT checkpoint trained for 30k steps in Sec. 4.1. We update both the LLMand the DiT; additional details are provided in Sec. A.

RAE-based models consistently outperform VAE-based models. We finetune both family of models for { 4, 16, 64, 128, 256 } epochs and compare the performance on GenEval and DPG-Bench in Fig. 6. We observe that across all iterations, the RAE-based model shows an advantage on both GenEval and DPG-Bench across all settings.

RAE-based models are less prone to overfitting. As shown in Fig. 6, VAE-based models degrade significantly after 64 epochs. Training loss analysis (Appendix Fig. 9) reveals that the V AE loss collapses rapidly to near-zero, suggesting the model is memorizing individual training sam-

Figure 6. RAE-based models outperform VAE-based models and are less prone to overfitting. We train both models for 256 epochs and observe that (1) RAE-based models consistently achieve higher performance, and (2) VAE-based models begin to overfit rapidly after 64 epochs.

Figure 6. RAE-based models outperform VAE-based models and are less prone to overfitting. We train both models for 256 epochs and observe that (1) RAE-based models consistently achieve higher performance, and (2) VAE-based models begin to overfit rapidly after 64 epochs.

Figure 7. RAE-based models outperform VAEs across different settings. Left : When fine-tuning only the DiT versus the full LLM+DiT system, RAE models consistently achieve higher GenEval scores. Right : RAE models maintain their advantage over VAE across all DiT model scales (0.5B-9.8B parameters), with the performance gap widening as model size increases.

Figure 7. RAE-based models outperform VAEs across different settings. Left : When fine-tuning only the DiT versus the full LLM+DiT system, RAE models consistently achieve higher GenEval scores. Right : RAE models maintain their advantage over VAE across all DiT model scales (0.5B-9.8B parameters), with the performance gap widening as model size increases.

ples rather than learning the underlying distribution. In constrast, RAE-based models remain stable and show only a mild decline. We hypothesize that the higher-dimensional and semantically structured latent space of the RAE 1 may provide an implicit regularization effect, helping mitigate overfitting during finetuning.

RAE's advantage generalizes across settings. To verify whether RAE's advantage over VAE extends beyond our main setup, we conduct two additional experiments: 1) finetuning only the DiT while freezing the LLM (following recent works [10, 56]), and 2) scaling to different sizes DiT models (0.5B-9.8B parameters). Figure 7 shows that RAE consistently outperforms VAE in both settings. The left panel shows that both selective fine-tuning (DiT-only) and joint fine-tuning (LLM+DiT) favor RAE over VAE; notably, the top-performing VAE configuration reaches 78.2, while the weakest RAE approach achieves 79.4. The right panel shows continued RAE gains across the scaling range, with larger models exhibiting greater improvements.

1 SigLIP-2 produces 1152-dim. tokens vs. < 100 in typical V AEs

Figure 8. Test-time scaling in latent space. Our framework allows the LLM to directly evaluate and select generation results within the latent space, bypassing the decode-re-encode process.

Figure 8. Test-time scaling in latent space. Our framework allows the LLM to directly evaluate and select generation results within the latent space, bypassing the decode-re-encode process.

RAE-based models consistently outperform VAE-based models.
RAE-based models are less prone to overfitting.
RAE's advantage generalizes across settings.

Implications for Unified Models

A key advantage of the RAE framework is that it unifies visual understanding and generation within a single, shared, high-dimensional latent space . This contrasts with the conventional two-tower design (used in our Section 4 baseline). In two-tower models, the generation head operates in a latent space alien to the LLM's understanding encoder. This effectively makes the unified model 'blind' to its own output distribution without a V AE decoder. In contrast, RAE forces generation to occur in the same representation space of the visual encoder. This means the model generates the exact same high-dimensional features it uses to see.

Test-time scaling in latent space. A direct benefit of this shared representation is that the LLM can interpret the latents produced by the diffusion model without needing to decode them into pixels and re-encode them, leaving the representation and pixel spaces fully decoupled. We leverage this property to implement Latent Test-Time Scaling (TTS) , where the LLM acts as a verifier for its own generations directly and only in the feature space (Fig. 8).

We explore two verifier metrics that leverage the LLM's understanding capabilities to score generated latents: (1) Prompt Confidence : We re-inject the generated latents and the original text prompt back into the LLM and measure the aggregate token-level confidence of the prompt, following Kang et al. [42]. (2) Answer Logits : We explicitly query the LLM with: 'Does this generated image ⟨ image ⟩ align with the ⟨ prompt ⟩ ?' and use the logit probability of the 'Yes' token as the score.

With the verifier defined, we adopt the standard test-time scaling protocol [53, 90] using a best-ofN selection strategy. As shown in Tab. 5, both verification metrics yield consistent improvements on GenEval, demonstrating that latent-space TTS is not only feasible but also an effective way to enhance generation quality. Crucially, this improvement is achieved entirely within the semantic latent space, demonstrating that the model can verify the quality of its own generations without ever needing to render pixels.

Table 5. TTS results across LLM-DiT configurations. Substantial performance improvements observed with both verifier metrics on GenEval. 4/8 refers to 'selecting best 4 out of 8'.

Table 6. Generative training leaves understanding intact; RAE and VAE perform similarly. Across VL benchmarks, both latent choices produce comparable understanding performance.

Visual understanding. Finally, we conduct a comparative study to study how the choice of visual generation backbone-VAE versus RAE-affects multimodal understanding performance. We evaluate the trained models on standard benchmarks: MME [28], TextVQA [71], AI2D [37], SeedBench [30], MMMU [95], and MMMU-Pro [96]. We emphasize that the goal of this work is not to build a SOTA VQA model; achieving that would require additional components such as any-resolution inputs, multimodal continual pretraining, and very high-quality data.

Similar to prior findings [17, 27, 82], we observe in Tab. 6 that adding generative modeling does not degrade visual understanding performance. The choice of RAE vs. VAE in the generative path has little impact, likely because both variants share the same frozen understanding encoder.

Test-time scaling in latent space.
Visual understanding.

VAE, representation and representation autoencoder. A common line of work uses VAEs [43] as autoencoders to compress images into low-dimensional latent spaces, typically with channel dimensions below 64 [25, 46]. Many studies [9, 93] have pursued aggressive compression, while others [55, 63] reduce dimensionality further by quantizing continuous latents into discrete codes. However, both directions unavoidably result in information loss.

Representation Autoencoders (RAE) [100] take a different route: use a frozen, pretrained representation encoder and train only the decoder to reconstruct images from high-dimensional semantic features. In ImageNet experiments, training diffusion transformers [58] in this latent

space yields faster convergence and better performance than VAEs. In this work, we extend RAE to text-to-image generation and show that its reconstruction and generative advantages transfer naturally to the multimodal setting.

Recently, several works have explored leveraging representation encoders for reconstruction. SVG [69, 70] employs a residual encoder to refine visual details during reconstruction, while VTP [91] incorporates a reconstruction loss into the pretraining of representation encoders. VQRAE [23] further applies quantization on top of representation encoders to construct discrete representations for generation. In a related direction, another line of work [45, 57, 59] integrates representation encoders with VAEs to improve generation fidelity.

VAE in Text-to-image models. VAE has also been widely used in text-to-image models. Stable Diffusion [64] uses an off-the-shelf V AE and a text-conditioned U-Net [66] for T2I training. Subsequent work [25, 46, 60] improves VAE through higher-quality and larger-scale training data.

Recently, Stable Diffusion 3 [25] shows that widening the VAE channels boosts reconstruction fidelity and enhances the scalability of the downstream diffusion model, while Hunyuan-Image-3 [5] further incorporates representation alignment [94] into V AE training.

This work takes the representation route a step further: instead of modifying VAEs, we train T2I models directly on high-dimensional representation spaces with RAE. This approach yields clear advantages over VAE in both convergence speed and final generation quality.

Unified Multimodal Models. Recently, many works focus on unifying multimodal understanding and generation into one modeling paradigm. One stream of work discretizes visual input and trains next token prediction modeling [14, 41, 79, 86, 89]. Another stream of research incorporates diffusion model into LLMs [10, 15, 17, 20, 31, 56, 75, 82, 101]. However, it has been viewed that understanding and generation require different visual representationshigh-dimensional CLIP features for understanding and lowdimensional VAE latents for generation-leading most unified models to adopt a two-tower design.

An emerging direction in unified multimodal modeling is to unify understanding and generation into a shared latent space. To work around this mismatch, recent approaches [13, 40, 52, 78, 97] adopt continuous representation spaces but introduce substantial downsampling for generation. For example, Chen et al. [13] uses highdimensional, uncompressed features for understanding but falls back to compressed, lower-dimensional latents for generation. Jiao et al. [41] and Yue et al. [97] employ compressed embeddings for both understanding and generation , limiting the model's perception ability. BLIP-3o [10] experiments with using a Qwen2.5-VL encoder [3] for understanding and EVA-CLIP [74, 76] for generation; However, because the model does not apply noise-schedule shifting and its DiT width is smaller than the EV A-CLIP embedding dimension, it relies on a strong diffusion decoder [60] to map these features back to pixels.

Our work takes a step forward by using a single highdimensional encoder for both understanding and generation. Leveraging RAE designs, the model enjoys a simpler architecture that understands and generates directly in this semantic space, surpassing VAE-based designs in T2I.

VAE, representation and representation autoencoder.

Shengbang Tong * , Boyang Zheng *

, Ziteng Wang * , Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie New York University

Figure

Website

VAE in Text-to-image models.
Unified Multimodal Models.

A key advantage of the RAE framework is that it unifies visual understanding and generation within a single, shared, high-dimensional latent space . This contrasts with the conventional two-tower design (used in our Section 4 baseline). In two-tower models, the generation head operates in a latent space alien to the LLM's understanding encoder. This effectively makes the unified model 'blind' to its own output distribution without a V AE decoder. In contrast, RAE forces generation to occur in the same representation space of the visual encoder. This means the model generates the exact same high-dimensional features it uses to see.

Test-time scaling in latent space. A direct benefit of this shared representation is that the LLM can interpret the latents produced by the diffusion model without needing to decode them into pixels and re-encode them, leaving the representation and pixel spaces fully decoupled. We leverage this property to implement Latent Test-Time Scaling (TTS) , where the LLM acts as a verifier for its own generations directly and only in the feature space (Fig. 8).

We explore two verifier metrics that leverage the LLM's understanding capabilities to score generated latents: (1) Prompt Confidence : We re-inject the generated latents and the original text prompt back into the LLM and measure the aggregate token-level confidence of the prompt, following Kang et al. [42]. (2) Answer Logits : We explicitly query the LLM with: 'Does this generated image ⟨ image ⟩ align with the ⟨ prompt ⟩ ?' and use the logit probability of the 'Yes' token as the score.

With the verifier defined, we adopt the standard test-time scaling protocol [53, 90] using a best-ofN selection strategy. As shown in Tab. 5, both verification metrics yield consistent improvements on GenEval, demonstrating that latent-space TTS is not only feasible but also an effective way to enhance generation quality. Crucially, this improvement is achieved entirely within the semantic latent space, demonstrating that the model can verify the quality of its own generations without ever needing to render pixels.

Table 5. TTS results across LLM-DiT configurations. Substantial performance improvements observed with both verifier metrics on GenEval. 4/8 refers to 'selecting best 4 out of 8'.

Table 6. Generative training leaves understanding intact; RAE and VAE perform similarly. Across VL benchmarks, both latent choices produce comparable understanding performance.

Visual understanding. Finally, we conduct a comparative study to study how the choice of visual generation backbone-VAE versus RAE-affects multimodal understanding performance. We evaluate the trained models on standard benchmarks: MME [28], TextVQA [71], AI2D [37], SeedBench [30], MMMU [95], and MMMU-Pro [96]. We emphasize that the goal of this work is not to build a SOTA VQA model; achieving that would require additional components such as any-resolution inputs, multimodal continual pretraining, and very high-quality data.

Similar to prior findings [17, 27, 82], we observe in Tab. 6 that adding generative modeling does not degrade visual understanding performance. The choice of RAE vs. VAE in the generative path has little impact, likely because both variants share the same frozen understanding encoder.

Conclusion

In this work, we investigate scaling Representation Autoencoders (RAEs) to text-to-image generation. Our study begins by scaling the decoder, where we find that while larger data scales improve general fidelity, specific domains such as text require targeted data composition. We then examine the RAE framework itself, revealing that scaling simplifies the design: dimension-dependent noise scheduling remains essential, but architectural modifications like DiT DH yield diminishing returns as model capacity increases. Building on this streamlined recipe, we show that RAE-based diffusion models consistently outperform state-of-the-art VAE baselines in convergence speed and generation quality, while being less prone to overfitting during finetuning. Collectively, these results establish RAE as a simple and effective foundation for large-scale generation. Moreover, by enabling understanding and generation to operate in a shared representation space, RAEs open new possibilities for unified models, such as the latent-space test-time scaling demonstrated in this work. We believe RAE serve as a strong foundation for future research in both scalable generation and unified multimodal modeling.

Acknowledgements

The authors would like to thank Xichen Pan, Shusheng Yang, David Fan, John Nguyen for insightful discussions and feedback on the manuscript. This work was mainly supported by the Google TPU Research Cloud (TRC) program and the Open Path AI Foundation. ST is supported by Meta AI Mentorship Program. SX also acknowledges support from the MSIT IITP grant (RS-2024-00457882) and the NSF award IIS-2443404.

Different encoders.

transformers for high-resolution image synthesis. In ICML , 2024. 4, 8, 9

Implementation

Our experiments are conducted on TPU v4, v5p, and v6e with TorchXLA.

Decoder training. We largely follow RAE for decoder architecture and adopt ViT-XL [22] as the default decoder. The decoder contains 28 blocks with a hidden size of 1152 and 16 heads. Decoder training details are included in Table 7. We find the GAN training recipe provided in [100] is not stable on web-scale images. To tackle the issue, we tune the recipe to as Tab. 7. On web-scale images, we find using DINO-S/16 already suffices as a strong discriminator, and using DINO-S/8 as in [100] makes it hard to converge. Therefore, we use DINO-S/16 as the default discriminator. All input is interpolated to 224 × 224 resolution before feeding into the discriminator. We use an epoch-based training scheme and set the sample of each virtual epoch to be the same as ImageNet (1.28M). For loss coefficients, we set ω G = 100 . 0 , ω L = 1 . 0 , ω A = 10 . 0 .

T2I & unified model pretraining. For pretraining experiments in Sec. 4.1, we primarily train on TPU-v5p-128 and TPU-v6e-64. Detailed training configurations are provided in Tab. 8. We find that finetuning a pretrained LLM while training the DiT from scratch benefits from using separate optimizers, and properly decoupling their optimizer settings substantially improves training stability. We use SPMD sharding [4] together with TorchXLA to train the LLM, adapters, and DiT models.

T2I & unified model finetuning. We finetune pre-trained models on the BLIP3o-60k dataset [11] using TPU-v4-128 and TPU-v5p-64. To ensure a fair comparison, we apply identical training configurations to both RAE and VAE models across 4, 16, 64, and 256 epochs. We utilize the same codebase and training infrastructure as the pretraining stage. We use a global batch size of 1024 and the optimization settings detailed in Tab. 9.

Synthetic data generation. For synthetic image generation, we compile prompts from publicly available prompt datasets 2 . Using these prompts, we generate 24.7M synthetic images with FLUX.1-schnell [46], which form part of our decoder training and T2I training corpus. We perform large-scale generation using TPU-v6e and will open-source the inference pipeline to facilitate future research.

2 https : / / huggingface . co / datasets / Geonmo / midjourney - prompts - only , https : / / huggingface . co/datasets/FredZhang7/stable-diffusion-prompts2.47M , https://huggingface.co/datasets/neuralworm/ stable -diffusion -discord -prompts , https : / / huggingface . co / datasets / isidentical / random stable-diffusion-prompts , https://huggingface.co/ datasets/CaptionEmporium/coyo-hd-11m-llavanext

Decoder training.

Convergence. We first compare the convergence behavior. We train a Qwen2.5-1.5B LLM with a 2.4B DiT backbone. As shown in Fig. 1, the RAE-based model converges significantly faster than its V AE counterpart, achieving a 4.0× speedup on GenEval and a 4.6× speedup on DPG-Bench.

Scaling DiT models. We use Qwen-2.5 1.5B as the language backbone, and train DiT variants of 0.5B, 2.4B, 5.5B, and 9.8B parameters. The architectures of these DiT variants are designed following recent advances in large-scale vision models [24, 26, 85, 88], and detailed model specifications are provided in Sec. B. In this experiment, we train all the models for 30k iterations with a batch size of 2048.

In Fig. 5a, we find that RAE-based models consistently outperform their VAE counterparts at all scales. Even for the smallest 0.5B DiT, where the network width only slightly exceeds the RAE latent dimension, the RAE-based model still shows clear advantages over the V AE baseline.

We also observe diminishing returns when scaling DiT models beyond 6B parameters. The performance trend appears to plateau, suggesting that simply increasing model size without proportionally improving data quality and diversity may lead to underutilized capacity. This observation aligns with discussions in large-scale visual SSL literature [26], which highlight the need for high-quality data scaling to fully exploit model capacity.

Scaling LLM backbones. We study how scaling the LLMbackbone influences text-to-image performance when

Table 4. SSL encoders are effective RAE backbones for T2I. AWebSSL-based RAE performs slightly worse than SigLIP-2 but remains stronger than FLUX VAE in T2I.

paired with diffusion models of different sizes. To this end, we train both RAE- and VAE-based models using LLMs of { 1.5B, 7B } parameters combined with DiTs of { 2.4B, 5.5B, 9.8B } , and present the results in Fig. 5b.

We observe performance gains from scaling the LLM to 7B, particularly when paired with RAE. We note that prior studies, such as MetaQuery [56], reported limited benefits from LLM scaling. Our results diverge from this conclusion, likely due to two key factors: (1) we evaluate on significantly larger diffusion backbones (up to 9.8B) which can better exploit rich text representations, and (2) we finetune the LLM backbone, allowing it to adapt its latent space for generative tasks more effectively than frozen approaches.

Generalizing to other vision encoders. We also experiment with training RAE with WebSSL ViT-L [26]. Under the same 1.5B LLM and 2.4B DiT setup, the WebSSL RAE performs slightly below the SigLIP-2 version but still exceeds the FLUX VAE baseline (Tab. 4). This finding is notable because WebSSL is not explicitly aligned with text; it suggests that the RAE framework in T2I training is robust to the choice of encoder.

T2I & unified model pretraining.

A key advantage of the RAE framework is that it unifies visual understanding and generation within a single, shared, high-dimensional latent space . This contrasts with the conventional two-tower design (used in our Section 4 baseline). In two-tower models, the generation head operates in a latent space alien to the LLM's understanding encoder. This effectively makes the unified model 'blind' to its own output distribution without a V AE decoder. In contrast, RAE forces generation to occur in the same representation space of the visual encoder. This means the model generates the exact same high-dimensional features it uses to see.

Test-time scaling in latent space. A direct benefit of this shared representation is that the LLM can interpret the latents produced by the diffusion model without needing to decode them into pixels and re-encode them, leaving the representation and pixel spaces fully decoupled. We leverage this property to implement Latent Test-Time Scaling (TTS) , where the LLM acts as a verifier for its own generations directly and only in the feature space (Fig. 8).

We explore two verifier metrics that leverage the LLM's understanding capabilities to score generated latents: (1) Prompt Confidence : We re-inject the generated latents and the original text prompt back into the LLM and measure the aggregate token-level confidence of the prompt, following Kang et al. [42]. (2) Answer Logits : We explicitly query the LLM with: 'Does this generated image ⟨ image ⟩ align with the ⟨ prompt ⟩ ?' and use the logit probability of the 'Yes' token as the score.

With the verifier defined, we adopt the standard test-time scaling protocol [53, 90] using a best-ofN selection strategy. As shown in Tab. 5, both verification metrics yield consistent improvements on GenEval, demonstrating that latent-space TTS is not only feasible but also an effective way to enhance generation quality. Crucially, this improvement is achieved entirely within the semantic latent space, demonstrating that the model can verify the quality of its own generations without ever needing to render pixels.

Table 5. TTS results across LLM-DiT configurations. Substantial performance improvements observed with both verifier metrics on GenEval. 4/8 refers to 'selecting best 4 out of 8'.

Table 6. Generative training leaves understanding intact; RAE and VAE perform similarly. Across VL benchmarks, both latent choices produce comparable understanding performance.

Visual understanding. Finally, we conduct a comparative study to study how the choice of visual generation backbone-VAE versus RAE-affects multimodal understanding performance. We evaluate the trained models on standard benchmarks: MME [28], TextVQA [71], AI2D [37], SeedBench [30], MMMU [95], and MMMU-Pro [96]. We emphasize that the goal of this work is not to build a SOTA VQA model; achieving that would require additional components such as any-resolution inputs, multimodal continual pretraining, and very high-quality data.

Similar to prior findings [17, 27, 82], we observe in Tab. 6 that adding generative modeling does not degrade visual understanding performance. The choice of RAE vs. VAE in the generative path has little impact, likely because both variants share the same frozen understanding encoder.

T2I & unified model finetuning.

A key advantage of the RAE framework is that it unifies visual understanding and generation within a single, shared, high-dimensional latent space . This contrasts with the conventional two-tower design (used in our Section 4 baseline). In two-tower models, the generation head operates in a latent space alien to the LLM's understanding encoder. This effectively makes the unified model 'blind' to its own output distribution without a V AE decoder. In contrast, RAE forces generation to occur in the same representation space of the visual encoder. This means the model generates the exact same high-dimensional features it uses to see.

Test-time scaling in latent space. A direct benefit of this shared representation is that the LLM can interpret the latents produced by the diffusion model without needing to decode them into pixels and re-encode them, leaving the representation and pixel spaces fully decoupled. We leverage this property to implement Latent Test-Time Scaling (TTS) , where the LLM acts as a verifier for its own generations directly and only in the feature space (Fig. 8).

We explore two verifier metrics that leverage the LLM's understanding capabilities to score generated latents: (1) Prompt Confidence : We re-inject the generated latents and the original text prompt back into the LLM and measure the aggregate token-level confidence of the prompt, following Kang et al. [42]. (2) Answer Logits : We explicitly query the LLM with: 'Does this generated image ⟨ image ⟩ align with the ⟨ prompt ⟩ ?' and use the logit probability of the 'Yes' token as the score.

With the verifier defined, we adopt the standard test-time scaling protocol [53, 90] using a best-ofN selection strategy. As shown in Tab. 5, both verification metrics yield consistent improvements on GenEval, demonstrating that latent-space TTS is not only feasible but also an effective way to enhance generation quality. Crucially, this improvement is achieved entirely within the semantic latent space, demonstrating that the model can verify the quality of its own generations without ever needing to render pixels.

Table 5. TTS results across LLM-DiT configurations. Substantial performance improvements observed with both verifier metrics on GenEval. 4/8 refers to 'selecting best 4 out of 8'.

Table 6. Generative training leaves understanding intact; RAE and VAE perform similarly. Across VL benchmarks, both latent choices produce comparable understanding performance.

Visual understanding. Finally, we conduct a comparative study to study how the choice of visual generation backbone-VAE versus RAE-affects multimodal understanding performance. We evaluate the trained models on standard benchmarks: MME [28], TextVQA [71], AI2D [37], SeedBench [30], MMMU [95], and MMMU-Pro [96]. We emphasize that the goal of this work is not to build a SOTA VQA model; achieving that would require additional components such as any-resolution inputs, multimodal continual pretraining, and very high-quality data.

Similar to prior findings [17, 27, 82], we observe in Tab. 6 that adding generative modeling does not degrade visual understanding performance. The choice of RAE vs. VAE in the generative path has little impact, likely because both variants share the same frozen understanding encoder.

Synthetic data generation.

Models

LLM model and unified model configs. We use pretrained Qwen2.5 [61] language models at the 1.5B and 7B scales in our experiments. Following prior work [49, 81], we use a 2-layer MLP to project visual features from the representation encoder into the LLM embedding space, and a separate linear layer to map the LLM's query-token outputs into the input space of the diffusion model.

DiT Model configs. We design our diffusion architecture following LightningDiT [92]. Motivated by recent findings in scaling vision backbones [16, 26, 85], we prioritize increasing model width rather than depth when scaling DiT models. Consistent with insights from the RAE paper [100], we also ensure that the DiT hidden dimension remains strictly larger than the target latent dimension (e.g. 1152 for the SigLIP2 ViT-So model), including at small scales such as DiT-0.5B. The detailed model specifications are provided in Tab. 10.

LLM model and unified model configs.

A key advantage of the RAE framework is that it unifies visual understanding and generation within a single, shared, high-dimensional latent space . This contrasts with the conventional two-tower design (used in our Section 4 baseline). In two-tower models, the generation head operates in a latent space alien to the LLM's understanding encoder. This effectively makes the unified model 'blind' to its own output distribution without a V AE decoder. In contrast, RAE forces generation to occur in the same representation space of the visual encoder. This means the model generates the exact same high-dimensional features it uses to see.

Test-time scaling in latent space. A direct benefit of this shared representation is that the LLM can interpret the latents produced by the diffusion model without needing to decode them into pixels and re-encode them, leaving the representation and pixel spaces fully decoupled. We leverage this property to implement Latent Test-Time Scaling (TTS) , where the LLM acts as a verifier for its own generations directly and only in the feature space (Fig. 8).

We explore two verifier metrics that leverage the LLM's understanding capabilities to score generated latents: (1) Prompt Confidence : We re-inject the generated latents and the original text prompt back into the LLM and measure the aggregate token-level confidence of the prompt, following Kang et al. [42]. (2) Answer Logits : We explicitly query the LLM with: 'Does this generated image ⟨ image ⟩ align with the ⟨ prompt ⟩ ?' and use the logit probability of the 'Yes' token as the score.

With the verifier defined, we adopt the standard test-time scaling protocol [53, 90] using a best-ofN selection strategy. As shown in Tab. 5, both verification metrics yield consistent improvements on GenEval, demonstrating that latent-space TTS is not only feasible but also an effective way to enhance generation quality. Crucially, this improvement is achieved entirely within the semantic latent space, demonstrating that the model can verify the quality of its own generations without ever needing to render pixels.

Table 5. TTS results across LLM-DiT configurations. Substantial performance improvements observed with both verifier metrics on GenEval. 4/8 refers to 'selecting best 4 out of 8'.

Table 6. Generative training leaves understanding intact; RAE and VAE perform similarly. Across VL benchmarks, both latent choices produce comparable understanding performance.

Visual understanding. Finally, we conduct a comparative study to study how the choice of visual generation backbone-VAE versus RAE-affects multimodal understanding performance. We evaluate the trained models on standard benchmarks: MME [28], TextVQA [71], AI2D [37], SeedBench [30], MMMU [95], and MMMU-Pro [96]. We emphasize that the goal of this work is not to build a SOTA VQA model; achieving that would require additional components such as any-resolution inputs, multimodal continual pretraining, and very high-quality data.

Similar to prior findings [17, 27, 82], we observe in Tab. 6 that adding generative modeling does not degrade visual understanding performance. The choice of RAE vs. VAE in the generative path has little impact, likely because both variants share the same frozen understanding encoder.

DiT Model configs.

LLM model and unified model configs. We use pretrained Qwen2.5 [61] language models at the 1.5B and 7B scales in our experiments. Following prior work [49, 81], we use a 2-layer MLP to project visual features from the representation encoder into the LLM embedding space, and a separate linear layer to map the LLM's query-token outputs into the input space of the diffusion model.

DiT Model configs. We design our diffusion architecture following LightningDiT [92]. Motivated by recent findings in scaling vision backbones [16, 26, 85], we prioritize increasing model width rather than depth when scaling DiT models. Consistent with insights from the RAE paper [100], we also ensure that the DiT hidden dimension remains strictly larger than the target latent dimension (e.g. 1152 for the SigLIP2 ViT-So model), including at small scales such as DiT-0.5B. The detailed model specifications are provided in Tab. 10.

Additional Results

Training losses. To complement the results in Sec. 4.2, we additionally compare the training loss curves of RAE and VAE models during finetuning in Fig. 9. We observe that the V AE model's loss decreases rapidly to a very low value, which correlates with the performance degradation observed in Fig. 6, a clear sign of overfitting. In contrast, the RAE model's loss decreases more gradually and stabilizes at a higher value, maintaining robust generation performance throughout the training process. This suggests that the high-dimensional semantic space of RAE provides a form of implicit regularization that prevents the model from memorizing the small finetuning dataset.

Figure 9. Diffusion loss during finetuning (256 epochs). RAE overfits less and later than VAE: the VAE loss plunges early to very low values, while the RAE loss decreases more gradually and plateaus at higher values, indicating reduced overfitting.

Figure 9. Diffusion loss during finetuning (256 epochs). RAE overfits less and later than VAE: the VAE loss plunges early to very low values, while the RAE loss decreases more gradually and plateaus at higher values, indicating reduced overfitting.

Extending finetuning to 512 epochs. We extend the finetuning experiment from Sec. 4.2 to 512 epochs. As shown in Fig. 6, V AE-based models already suffer substantial performance drops by 256 epochs, so we do not continue training them further. In contrast, the RAE-based model remains stable: even after 512 epochs (Fig. 10), it shows only a small

Table 7. Training configuration for decoder and discriminator. Left : Configuration used for SigLIP2-So. Right : Configuration used for WebSSL ViT-L. Different encoders require slightly different training recipes for achieving strong decoder performance.

Table 8. Optimization hyperparameters for the LLM backbone and the DiT diffusion head in the unified T2I model.

Table 9. Finetuning hyperparameters.

decline in performance. This further supports the robustness of RAE-based methods under long-horizon finetuning.

Table 10. Architectural specifications of DiT variants.

Figure 10. Extended finetuning to 512 epochs. RAE maintains robust performance even with 512 epochs of training, while VAE suffers catastrophic overfitting after 64 epochs.

Figure 10. Extended finetuning to 512 epochs. RAE maintains robust performance even with 512 epochs of training, while VAE suffers catastrophic overfitting after 64 epochs.

Data Sources#DataImageNet ↓YFCC ↓Text ↓
ImageNet1.28M0.4620.972.64
Web39.3M0.5290.6292.325
Web + Synthetic64.0M0.4370.6832.406
Web + Synthetic + Text73.0M0.4350.7021.621
FamilyModelImageNet ↓YFCC ↓Text ↓
VAESDXL0.931.1682.057
VAEFLUX0.2880.410.638
RAEWebSSL ViT-L0.3880.5581.372
RAESigLIP-2 ViT-So0.4350.7021.621
SettingGenEval ↑DPG-Bench ↑
w/o shift23.654.8
w/ shift49.676.8
Training DataGenEval ↑DPG-Bench ↑
Synthetic45.173.8
Synthetic × 24875.2
Web25.969.5
Web × 226.370.6
Synthetic + Web49.576.9
Model VariantGenEval ↑DPG-Bench ↑
VAE-based models
FLUX VAE39.670.5
RAE-based models
WebSSL ViT-L46.072.8
SigLIP-2 ViT-So49.576.9
Best-of- NPrompt ConfidencePrompt ConfidenceAnswer Logits
1.5B LLM+5.5B DiT1.5B LLM+5.5B DiT(GenEval = 53.2)
4/84/856.759.6
4/164/1657.562.5
4/324/3260.064.3
7.0B LLM+5.5B DiT(GenEval = 55.5)(GenEval = 55.5)
4/858.358.362.5
4/1659.659.665.8
4/3260.167.8
ModelMME PTVQAAI2DSeedMMMUMMMU P
Und.-only1374.844.763.967.140.220.5
RAE-based1468.739.666.769.841.119.8
VAE-based1481.739.366.769.737.218.7
ComponentDecoderDiscriminatorComponentDecoderDiscriminator
optimizer max learning rate min learning rate learning rate optimizer betas weight decay batch size warmup loss ModelAdamW 4 × 10 - 4 4 × 10 - 5 cosine decay (0.9, 0.95) 0.0 512 2 epoch ℓ 1 + LPIPS + GAN +AdamW 5 × 10 - 5 5 × 10 - 6 cosine decay (0.9, 0.95) 0.0 512 1 epoch adv.optimizer max learning rate min learning rate learning rate optimizer betas weight decay batch size warmup loss ModelAdamW 2 × 10 - 4 2 × 10 - 5 cosine decay (0.9, 0.95) 0.0 512 2 epoch ℓ 1 + LPIPS + GAN + Gram ViT-XLAdamW 2 × 10 - 5 2 × 10 - 6 cosine decay (0.9, 0.95) 0.0 512 1 epoch adv. Dino-S/16
scheduleschedule
Gram
ViT-XLDino-S/16 (frozen)(frozen)
LPIPS start epoch0-LPIPS start epoch0-
disc. start epoch-7disc. start epoch-10
adv. loss start epoch8-adv. loss start epoch11-
Training epochs8073Training epochs8070
ComponentLLMDiT
optimizer learning rate schedule global batch sizeAdamW cosine w/ warmup ratio 0 . 0134 2048AdamW cosine w/ warmup ratio 0 . 0134 2048
max learning rate optimizer betas loss model5 × 10 - 5 (0 . 9 , 0 . 999) autoregressive LM Qwen2.5 (1.5B / 7B)5 × 10 - 4 (0 . 9 , 0 . 95) diffusion loss DiT (0.5B-9.8B)
ComponentLLMDiT
optimizer learning rate schedule global batch size training epochsAdamW cosine w/ warmup ratio 0 . 03 1024 4, 16, 64, 256AdamW cosine w/ warmup ratio 0 . 03 1024 4, 16, 64, 256
max learning rate optimizer betas loss model5 . 66 × 10 - 5 (0 . 9 , 0 . 999) autoregressive LM Qwen2.5 (1.5B / 7B)5 . 66 × 10 - 4 (0 . 9 , 0 . 95) diffusion loss DiT (0.5B-9.8B)
ModelHidden sizeHeadsDepth
DiT-0.5B12803216
DiT-2.4B20483232
DiT-3.3B23043232
DiT-5.5B30723232
DiT-9.8B40963232
Training losses.

Convergence. We first compare the convergence behavior. We train a Qwen2.5-1.5B LLM with a 2.4B DiT backbone. As shown in Fig. 1, the RAE-based model converges significantly faster than its V AE counterpart, achieving a 4.0× speedup on GenEval and a 4.6× speedup on DPG-Bench.

Scaling DiT models. We use Qwen-2.5 1.5B as the language backbone, and train DiT variants of 0.5B, 2.4B, 5.5B, and 9.8B parameters. The architectures of these DiT variants are designed following recent advances in large-scale vision models [24, 26, 85, 88], and detailed model specifications are provided in Sec. B. In this experiment, we train all the models for 30k iterations with a batch size of 2048.

In Fig. 5a, we find that RAE-based models consistently outperform their VAE counterparts at all scales. Even for the smallest 0.5B DiT, where the network width only slightly exceeds the RAE latent dimension, the RAE-based model still shows clear advantages over the V AE baseline.

We also observe diminishing returns when scaling DiT models beyond 6B parameters. The performance trend appears to plateau, suggesting that simply increasing model size without proportionally improving data quality and diversity may lead to underutilized capacity. This observation aligns with discussions in large-scale visual SSL literature [26], which highlight the need for high-quality data scaling to fully exploit model capacity.

Scaling LLM backbones. We study how scaling the LLMbackbone influences text-to-image performance when

Table 4. SSL encoders are effective RAE backbones for T2I. AWebSSL-based RAE performs slightly worse than SigLIP-2 but remains stronger than FLUX VAE in T2I.

paired with diffusion models of different sizes. To this end, we train both RAE- and VAE-based models using LLMs of { 1.5B, 7B } parameters combined with DiTs of { 2.4B, 5.5B, 9.8B } , and present the results in Fig. 5b.

We observe performance gains from scaling the LLM to 7B, particularly when paired with RAE. We note that prior studies, such as MetaQuery [56], reported limited benefits from LLM scaling. Our results diverge from this conclusion, likely due to two key factors: (1) we evaluate on significantly larger diffusion backbones (up to 9.8B) which can better exploit rich text representations, and (2) we finetune the LLM backbone, allowing it to adapt its latent space for generative tasks more effectively than frozen approaches.

Generalizing to other vision encoders. We also experiment with training RAE with WebSSL ViT-L [26]. Under the same 1.5B LLM and 2.4B DiT setup, the WebSSL RAE performs slightly below the SigLIP-2 version but still exceeds the FLUX VAE baseline (Tab. 4). This finding is notable because WebSSL is not explicitly aligned with text; it suggests that the RAE framework in T2I training is robust to the choice of encoder.

Extending finetuning to 512 epochs.

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.

Diffusion-based generative modeling [19, 38, 47] has made rapid progress, giving rise to state-of-the-art systems across visual generative domains such as text-to-image generation [1, 46, 88]. A key factor in this success is the adoption of latent diffusion [65], where generation occurs in a compact latent space encoded by a variational autoencoder (VAE) [43], rather than directly in pixel space.

In parallel with advances in generative modeling, visual representation learning has progressed through self-supervised learning (SSL) [12, 35, 34, 6], language supervision [62, 98], and their combinations [54, 83]. These models produce semantically structured, high-dimensional representations that generalize well across visual understanding tasks. Unlike VAE encoders, which compress images into low-dimensional latents, the representation encoders operate on high-dimensional latents that can capture much more semantically rich features.

Such high-dimensional latents were previously considered too “abstract” for effective generative modeling [72, 92], or outright intractable [48, 13]. However, a recent approach, Representation Autoencoder (RAE) [100], has paved a path forward by training decoders on frozen representation encoders. RAE pairs a powerful frozen representation encoder with a lightweight trained decoder to reconstruct pixels from high-dimensional embeddings, enabling diffusion directly in this semantic latent space. In the highly controlled class-conditional ImageNet [18] setting, RAE demonstrates that diffusion in such frozen representation spaces can achieve more efficient and effective training than conventional VAE-based diffusion.

However, ImageNet represents a best-case scenario: fixed resolution, curated content, and class-conditional generation. A critical question remains unanswered: can RAE truly scale to the complexities of freeform text-to-image generation? This setting involves broader visual diversity, open-ended compositions, and substantially larger models and compute—challenges for which high-dimensional latent diffusion remains unproven.

In this work, we investigate whether RAEs can succeed at scale by training diffusion models for large-scale text-to-image (T2I) generation. We adopt SigLIP-2 [84] as the frozen representation encoder and use the MetaQuery framework [56] to train a unified T2I model, leveraging a powerful pretrained large language model (LLM) [61].

As a first step, we study decoder training beyond ImageNet supervision (Sec. 2). Expanding from ImageNet to web-scale and synthetic aesthetic data yields only small gains on ImageNet itself, but provides moderate improvements on more diverse natural images such as YFCC [80], showing that broader distributions enhance generalization. However, we find that text reconstruction requires targeted supervision: without text-specific data, the decoder fails to reproduce fine glyph details. Adding text-rendering data leads to substantial improvements, highlighting that data composition, not just scale, is crucial.

Next, we analyze design choices in the RAE framework [100] and evaluate their importance under large-scale T2I training (Sec. 3). We find that scale acts as a simplifier. Dimension-aware noise scheduling remains essential: removing the shift leads to substantially worse performance. The wide DDT head (DiTDH\text{DiT}^{\text{DH}}) provides clear benefits for smaller backbones, but its advantage fades as Diffusion Transformers (DiT) scale to the billion-parameters. Finally, the effect of noise-augemented decoding is modest at T2I scale, with gains saturating quickly.

We then systematically compare RAEs with SOTA VAEs under matched training conditions (Sec. 4). We train DiTs from scratch following the conventional two-stage T2I setup [15, 60]: (i) large-scale pretraining with randomly initialized DiTs, and (ii) finetuning on smaller high-quality datasets. During pretraining, RAE-based models converge significantly faster and achieve higher performance on both GenEval and DPG-Bench. As shown in Fig. 1, training a 1.5B LLM + 2.4B DiT with RAE (SigLIP-2) achieves a 4.0× speedup on GenEval and a 4.6× speedup on DPG-Bench compared to its VAE counterpart. This advantage is consistent across both language backbones (Qwen-2.5 [2] 1.5B–7B) and diffusion scales (DiT 0.5B–9.8B). In finetuning, RAE models continue to outperform their VAE counterparts and are less prone to overfitting.

Finally, we examine unified models in which RAE enables understanding and generation to operate in the same high-dimensional semantic space (Sec. 5). We find that adding generative training does not degrade understanding performance, and the choice of RAE vs. VAE in the generative path has little effect because both rely on the same frozen understanding encoder. Moreover, the shared latent space allows the LLM to process generated latents directly, without decoding back to pixels. We take a first exploratory step toward leveraging this property through latent-space test-time scaling, which proves both feasible and effective.

Ultimately, we aim to convey one primary message: Representation Autoencoders provide a simpler and stronger foundation than VAEs for training large-scale text-to-image diffusion models. They offer a simple yet effective path to scaling generation within semantic representation spaces. We will release all code, data, and model checkpoints related to this work to foster open and reproducible research in multimodal generation.

To adapt the RAE framework for open-world T2I generation, we first train a RAE decoder on a larger and more diverse dataset than ImageNet [18]. Throughout this section, we choose SigLIP-2 So400M (patch size 14) [84] as the frozen encoder, and train a ViT-based [21] decoder to reconstruct images from these tokens at 224×224224\times 224 resolution. We present the architectural details in Appendix A. Given an input image x∈ℝ3×224×224x\in\mathbb{R}^{3\times 224\times 224}, the encoder produces N=16×16N=16\times 16 tokens with channel dimension d=1152d=1152.

Following RAE, we adopt ℓ1\ell_{1}, LPIPS [99], and adversarial losses [68, 33]. Additionally, we integrate Gram Loss [29], which is found beneficial for reconstruction [52]. The training objective is set as L​(x,x^)=ℓ1​(x,x^)+ωL​LPIPS​(x,x^)+ωG​Gram​(x,x^)+ωA​Adv​(x,x^),x^=RAE​(x)L(x,\hat{x})=\ell_{1}(x,\hat{x})+\omega_{L}\text{LPIPS}(x,\hat{x})+\omega_{G}\text{Gram}(x,\hat{x})+\omega_{A}\text{Adv}(x,\hat{x}),\hat{x}=\text{RAE}(x). We include the weights and training details in Appendix A.

We use a dataset combining roughly 73M data from three data sources: web image sources from FuseDiT [77], synthetic images generated by FLUX.1-schnell [46], and RenderedText [87], which focuses on text-rendering scenes. Details are provided in Sec. 4.

We evaluate rFID-50k [36] of reconstructed images in three representative domains: (i) ImageNet-1k [67] for classic object-centric evaluation, (ii) YFCC [80] for diverse web-scale imagery, and (iii) RenderedText [87] held-out set for text-rendering and typography-specific evaluation. We evaluate rFiD on 50k samples from each data source and present our results in Tabs. 1 and 2.

As shown in Tab. 1, expanding decoder training beyond ImageNet to include web-scale and synthetic data yields only marginal gains on ImageNet itself, but provides moderate improvements on more diverse images (YFCC). This indicates that exposure to a broader distribution enhances the decoder’s generalizability. However, generic web data is insufficient for text reconstruction. Training on Web + Synthetic data yields little improvement over ImageNet-only training. In contrast, performance improves substantially once text-specific data is included, highlighting that reconstruction quality is very sensitive to the composition of the training data. As shown in Fig. 2, training the RAE decoder with additional text data is essential for accurate text reconstruction. Overall, RAE reconstruction improves with scale, but the composition of data—not just its size—matters: each domain benefits most from domain-matched coverage.

We also experiment training RAE using different pretrained encoders. In particular, we replace SigLIP-2 with WebSSL-DINO [26], a large-scale self-supervised model. As shown in Tab. 2, WebSSL-DINO achieves stronger reconstruction performance than SigLIP-2 across all domains, including text reconstruciton. Both SigLIP-2 and WebSSL-L consistently outperform SDXL VAE [60], though they still fall short of FLUX VAE [46].

The original RAE framework [100] introduced a suite of specialized design choices—including dimension-dependent noise scheduling, noise-augmented decoding, and a modified backbone (DiTDH\text{DiT}^{\text{DH}})—to enable diffusion on high-dimensional latents. While these modifications proved effective for class-conditional ImageNet generation, it remains unclear which are fundamental requirements for high-dimensional diffusion and which are adaptations for lower-capacity regimes.

In this section, we systematically stress-test these components under large-scale T2I settings. We systematically evaluate these components to determine their necessity in large-scale T2I generation. Our analysis reveals that adapting the noise schedule to the latent dimension is critical for convergence, whereas the architectural modifications proposed in the original work—such as wide diffusion heads and noise augmentation—become redundant at scale.

We adopt the MetaQuery architecture [56] for text-to-image (T2I) generation and unified modeling. The model initializes with a pretrained language model (LLM) and prepends a sequence of learnable query tokens to the text prompt. The number of query tokens is set to 256, matching the number of visual tokens (16×1616\times 16) produced by the representation encoder. The LLM jointly processes the text and queries, producing query-token representations that serve as the conditioning signal. A 2-layer MLP connector then projects these representations from the LLM’s hidden space into the DiT model [58].

For this DiT model, we adopt the design based on LightningDiT [92] and train it using the flow matching objective [47]. Critically, our model does not operate in a compressed VAE space. Instead, the DiT learns to model the distribution of high-dimensional, semantic representations generated by the frozen representation encoder. During inference, the DiT generates a set of features conditioned on the query tokens, which are then passed to our trained RAE decoder for rendering into pixel space.

We also train visual instruction tuning [49, 50] for image understanding. For this, we use a separate 2-layer MLP projector that maps visual tokens into the LLM’s embedding space. Importantly, these visual tokens come from the same frozen representation encoder whose features the diffusion model is trained to generate.

Unless otherwise specified, we use SigLIP-2 So400M (patch size 14) [84] as our representation encoder and Qwen-2.5 1.5B [61] as the LLM in our experiments. We fix the number of visual tokens to 256, resulting in 224-resolution images for RAE and 256-resolution for VAE.

Following standard practice, we adopt the flow matching objective [47, 51] with linear interpolation 𝐱t=(1−t)​𝐱+t​ε\mathbf{x}{t}=(1-t)\mathbf{x}+t\mathbf{\varepsilon}, where 𝐱∼p​(𝐱)\mathbf{x}\sim p(\mathbf{x}) and ε∼𝒩​(0,𝐈)\mathbf{\varepsilon}\sim\mathcal{N}(0,\mathbf{I}), and train the model to predict the velocity v​(𝐱t,t)v(\mathbf{x}{t},t). Unless otherwise noted, we employ a 50-step Euler sampler for generation, consistent with RAE [100].

We evaluate using two widely adopted metrics: the GenEval score [32] and the DPG-Bench score [39].

The RAE work [100] argues that conventional noise schedules become suboptimal when applied to high-dimensional latent spaces. The paper proposes a dimension-dependent noise schedule shift [25] that rescales the diffusion timestep according to the effective data dimension m=N×dm=N\times d (number of tokens ×\times token dimension). Formally, given a base schedule tn∈[0,1]t_{n}\in[0,1] defined for a reference dimension nn, the shifted timestep is computed as

We follow the RAE setting and use n=4096n{=}4096 as the base dimension for computing the scaling factor α\alpha. We experiment with and without applying the dimension-dependent shift when training text-to-image diffusion models on RAE latents, as shown below.

Consistent with Zheng et al. [100], applying the noise shift dramatically improves both GenEval and DPG-Bench scores, demonstrating that adjusting the schedule to the effective latent dimension is critical for T2I.

While dimension-aware noise scheduling proves essential, we find that other design choices in RAE, which was originally developed for smaller-scale ImageNet models, provide diminishing returns at T2I scale.

RAE originally proposed training decoders on perturbed latents to bridge the gap between training and inference distributions. Formally, it trains the RAE decoder on smoothed inputs z′=z+nz^{\prime}=z+n, where n∼𝒩​(0,σ2​I)n\sim\mathcal{N}(0,,\sigma^{2}I) and σ\sigma is sampled from |𝒩​(0,τ2)||\mathcal{N}(0,,\tau^{2})|. We set τ=0.2\tau=0.2 as we find a too high τ\tau makes decoder training hard to converge.

We visualize the effect of noise-augmented decoding at different training stages in Fig. 4(a). The gains are noticeable early in training (before ∼\sim15k steps), when the model is still far from convergence, but become negligible at later stages. This suggests that noise-augmented decoding acts as a form of regularization that matters most when the model has not yet learned a robust latent manifold.

The DiTDH\text{DiT}^{\text{DH}} architecture augments a standard DiT with a shallow but wide DDT head, increasing denoising width without widening the entire backbone. In standard ImageNet-scale DiTs, the backbone width (d≈1024d\approx 1024) is often narrower than the high-dimensional RAE latent targets (d=1152d=1152). DiTDH\text{DiT}^{\text{DH}} circumvents this by appending a wide, shallow denoising head (d=2688d=2688) without incurring the cost of widening the full backbone.

However, T2I setting operates in a different regime. Modern large-scale T2I DiTs [46, 88] (≥\geq 2B parameters) possess hidden dimensions (d≥2048d\geq 2048) that inherently exceed the latent dimension. We hypothesize that this natural width eliminates the bottleneck DiTDH\text{DiT}^{\text{DH}} was designed to fix.

To verify this, we train DiT variants across three scales—0.5B, 2.4B, and 3.1B—comparing standard architectures against counterparts augmented with the +0.28B parameter DiTDH\text{DiT}^{\text{DH}} head. As shown in Fig. 4(b), the results confirm our hypothesis: at 0.5B, where the backbone is narrow, DiTDH\text{DiT}^{\text{DH}} provides a critical +11.2 GenEval boost. Yet as the model scales to 2.4B and beyond, this advantage saturates greatly.

This finding clarifies that DiTDH\text{DiT}^{\text{DH}} is a patch for capacity-constrained models, not a fundamental requirement for RAE. For scalable T2I training, standard DiT architectures are already sufficient.

Our experiments reveal clear design principles for scaling RAE-based diffusion models: dimension-aware noise scheduling remains non-negotiable, as it directly addresses the mathematical properties of high-dimensional latent spaces. In contrast, architectural refinements (DiTDH\text{DiT}^{\text{DH}}) and training augmentations (noise-augmented) that help at small scales provide diminishing returns as models grow—backbone capacity increasingly dominates performance. From here on, we adopt standard DiT architectures with proper noise scheduling and no noise-augmented decoding.

In this section, we compare text-to-image diffusion training using the RAE (SigLIP-2) encoder versus a standard VAE (FLUX-VAE). For the VAE baseline, we adopt the state-of-the-art model from FLUX [46]. All experiments follow the same setup described in Sec. 3.1, with identical training configurations; the only difference lies in whether diffusion is performed in the RAE or VAE latent space. We defer implementation details to Appendix A.

We organize our comparison into two stages: pretraining and finetuning. We train the Diffusion Transformer from scratch in both settings to ensure a fair comparison of convergence speed and data efficiency. We ensure apples-to-apples comparison. The only component that differs is the latent space and its decoder (SigLIP-2 RAE vs. FLUX VAE). For the VAE baseline, we employ FLUX VAE for generation while retaining the SigLIP encoder for understanding, as VAE latents are insufficient for perception [101]. This design choice effectively forms a two-tower architecture, mirroring the design of recent unified models like Bagel [17] and UniFluid [27].

We follow the data mixture developed in FuseDiT [77] and adopt the recaptioned texts and remixing ratios released by BLIP-3o [9]. The mixture combines mostly webdata like CC12M [7], SA-1B [44], and JourneyDB [73], totaling approximately 39.3M images. In addition, we use FLUX.1-schnell [46] to generate 24.7M synthetic images. We also train on Cambrian-7M [81] to develop the model’s visual understanding capabilities.

We experiment with a Qwen-2.5 1.5B LLM and a 2.4B DiT to study how different pretraining corpora influence text-to-image performance. We train three variants: (i) Web-39M + Cambrian-7M, (ii) FLUX-generated synthetic data + Cambrian-7M, and (iii) their union. As shown in Tab. 3, the mixed dataset yields the best performance.

To ensure the gains are not simply due to more data, we also double the size of each individual source (Web ×2, Synthetic ×2). These runs yield much smaller improvements, indicating that the benefits arise from the complementary nature of the two data types rather than data volume alone.

We also find that synthetic data results in lower training loss and faster convergence, suggesting that FLUX images provide more stylistically consistent signals. Web-scale data, by contrast, is harder to fit but provides more diverse signals. When combined, the model inherits visual style from synthetic data and rich semantics from web data, leading to clear and robust improvements in generation quality.

We first compare the convergence behavior. We train a Qwen2.5-1.5B LLM with a 2.4B DiT backbone. As shown in Fig. 1, the RAE-based model converges significantly faster than its VAE counterpart, achieving a 4.0× speedup on GenEval and a 4.6× speedup on DPG-Bench.

We use Qwen-2.5 1.5B as the language backbone, and train DiT variants of 0.5B, 2.4B, 5.5B, and 9.8B parameters. The architectures of these DiT variants are designed following recent advances in large-scale vision models [24, 26, 85, 88], and detailed model specifications are provided in Appendix B. In this experiment, we train all the models for 30k iterations with a batch size of 2048.

In Fig. 5(a), we find that RAE-based models consistently outperform their VAE counterparts at all scales. Even for the smallest 0.5B DiT, where the network width only slightly exceeds the RAE latent dimension, the RAE-based model still shows clear advantages over the VAE baseline.

We also observe diminishing returns when scaling DiT models beyond 6B parameters. The performance trend appears to plateau, suggesting that simply increasing model size without proportionally improving data quality and diversity may lead to underutilized capacity. This observation aligns with discussions in large-scale visual SSL literature [26], which highlight the need for high-quality data scaling to fully exploit model capacity.

We study how scaling the LLM backbone influences text-to-image performance when paired with diffusion models of different sizes. To this end, we train both RAE- and VAE-based models using LLMs of {1.5B, 7B} parameters combined with DiTs of {2.4B, 5.5B, 9.8B}, and present the results in Fig. 5(b).

We observe performance gains from scaling the LLM to 7B, particularly when paired with RAE. We note that prior studies, such as MetaQuery [56], reported limited benefits from LLM scaling. Our results diverge from this conclusion, likely due to two key factors: (1) we evaluate on significantly larger diffusion backbones (up to 9.8B) which can better exploit rich text representations, and (2) we finetune the LLM backbone, allowing it to adapt its latent space for generative tasks more effectively than frozen approaches.

We also experiment with training RAE with WebSSL ViT-L [26]. Under the same 1.5B LLM and 2.4B DiT setup, the WebSSL RAE performs slightly below the SigLIP-2 version but still exceeds the FLUX VAE baseline (Tab. 4). This finding is notable because WebSSL is not explicitly aligned with text; it suggests that the RAE framework in T2I training is robust to the choice of encoder.

Following standard practice in T2I training [15, 10, 60], models are finetuned on a smaller high-quality dataset after large-scale pretraining. We evaluate this finetuning stage for both RAE- and VAE-based models under identical settings. Unless otherwise noted, we use the BLIP-3o 60k dataset [9] and start from the 1.5B LLM + 2.4B DiT checkpoint trained for 30k steps in Sec. 4.1. We update both the LLM and the DiT; additional details are provided in Appendix A.

We finetune both family of models for {4, 16, 64, 128, 256} epochs and compare the performance on GenEval and DPG-Bench in Fig. 6. We observe that across all iterations, the RAE-based model shows an advantage on both GenEval and DPG-Bench across all settings.

As shown in Fig. 6, VAE-based models degrade significantly after 64 epochs. Training loss analysis (Appendix Fig. 9) reveals that the VAE loss collapses rapidly to near-zero, suggesting the model is memorizing individual training samples rather than learning the underlying distribution. In constrast, RAE-based models remain stable and show only a mild decline. We hypothesize that the higher-dimensional and semantically structured latent space of the RAE111SigLIP-2 produces 1152-dim. tokens vs. <<100 in typical VAEs may provide an implicit regularization effect, helping mitigate overfitting during finetuning.

To verify whether RAE’s advantage over VAE extends beyond our main setup, we conduct two additional experiments: 1) fine-tuning only the DiT while freezing the LLM (following recent works [56, 9]), and 2) scaling to different sizes DiT models (0.5B–9.8B parameters). Figure 7 shows that RAE consistently outperforms VAE in both settings. The left panel shows that both selective fine-tuning (DiT-only) and joint fine-tuning (LLM+DiT) favor RAE over VAE; notably, the top-performing VAE configuration reaches 78.2, while the weakest RAE approach achieves 79.4. The right panel shows continued RAE gains across the scaling range, with larger models exhibiting greater improvements.

A key advantage of the RAE framework is that it unifies visual understanding and generation within a single, shared, high-dimensional latent space. This contrasts with the conventional two-tower design (used in our Section 4 baseline). In two-tower models, the generation head operates in a latent space alien to the LLM’s understanding encoder. This effectively makes the unified model ’blind’ to its own output distribution without a VAE decoder. In contrast, RAE forces generation to occur in the same representation space of the visual encoder. This means the model generates the exact same high-dimensional features it uses to see.

A direct benefit of this shared representation is that the LLM can interpret the latents produced by the diffusion model without needing to decode them into pixels and re-encode them, leaving the representation and pixel spaces fully decoupled. We leverage this property to implement Latent Test-Time Scaling (TTS), where the LLM acts as a verifier for its own generations directly and only in the feature space (Fig. 8).

We explore two verifier metrics that leverage the LLM’s understanding capabilities to score generated latents: (1) Prompt Confidence: We re-inject the generated latents and the original text prompt back into the LLM and measure the aggregate token-level confidence of the prompt, following Kang et al. [42]. (2) Answer Logits: We explicitly query the LLM with: “Does this generated image ⟨\langleimage⟩\rangle align with the ⟨\langleprompt⟩\rangle?” and use the logit probability of the “Yes” token as the score.

With the verifier defined, we adopt the standard test-time scaling protocol [53, 90] using a best-of-NN selection strategy. As shown in Tab. 5, both verification metrics yield consistent improvements on GenEval, demonstrating that latent-space TTS is not only feasible but also an effective way to enhance generation quality. Crucially, this improvement is achieved entirely within the semantic latent space, demonstrating that the model can verify the quality of its own generations without ever needing to render pixels.

Finally, we conduct a comparative study to study how the choice of visual generation backbone—VAE versus RAE—affects multimodal understanding performance. We evaluate the trained models on standard benchmarks: MME [28], TextVQA [71], AI2D [37], SeedBench [30], MMMU [95], and MMMU-Pro [96]. We emphasize that the goal of this work is not to build a SOTA VQA model; achieving that would require additional components such as any-resolution inputs, multimodal continual pretraining, and very high-quality data.

Similar to prior findings [82, 27, 17], we observe in Tab. 6 that adding generative modeling does not degrade visual understanding performance. The choice of RAE vs. VAE in the generative path has little impact, likely because both variants share the same frozen understanding encoder.

A common line of work uses VAEs [43] as autoencoders to compress images into low-dimensional latent spaces, typically with channel dimensions below 64 [46, 25]. Many studies [11, 93] have pursued aggressive compression, while others [55, 63] reduce dimensionality further by quantizing continuous latents into discrete codes. However, both directions unavoidably result in information loss.

Representation Autoencoders (RAE) [100] take a different route: use a frozen, pretrained representation encoder and train only the decoder to reconstruct images from high-dimensional semantic features. In ImageNet experiments, training diffusion transformers [58] in this latent space yields faster convergence and better performance than VAEs. In this work, we extend RAE to text-to-image generation and show that its reconstruction and generative advantages transfer naturally to the multimodal setting.

Recently, several works have explored leveraging representation encoders for reconstruction. SVG [70, 69] employs a residual encoder to refine visual details during reconstruction, while VTP [91] incorporates a reconstruction loss into the pretraining of representation encoders. VQRAE [23] further applies quantization on top of representation encoders to construct discrete representations for generation. In a related direction, another line of work [59, 45, 57] integrates representation encoders with VAEs to improve generation fidelity.

VAE has also been widely used in text-to-image models. Stable Diffusion [64] uses an off-the-shelf VAE and a text-conditioned U-Net [66] for T2I training. Subsequent work [60, 25, 46] improves VAE through higher-quality and larger-scale training data.

Recently, Stable Diffusion 3 [25] shows that widening the VAE channels boosts reconstruction fidelity and enhances the scalability of the downstream diffusion model, while Hunyuan-Image-3 [5] further incorporates representation alignment [94] into VAE training.

This work takes the representation route a step further: instead of modifying VAEs, we train T2I models directly on high-dimensional representation spaces with RAE. This approach yields clear advantages over VAE in both convergence speed and final generation quality.

Recently, many works focus on unifying multimodal understanding and generation into one modeling paradigm. One stream of work discretizes visual input and trains next token prediction modeling [79, 86, 14, 89, 41]. Another stream of research incorporates diffusion model into LLMs [15, 20, 74, 31, 82, 101, 56, 9, 17]. However, it has been viewed that understanding and generation require different visual representations—high-dimensional CLIP features for understanding and low-dimensional VAE latents for generation—leading most unified models to adopt a two-tower design.

An emerging direction in unified multimodal modeling is to unify understanding and generation into a shared latent space. To work around this mismatch, recent approaches [78, 13, 97, 52, 40] adopt continuous representation spaces but introduce substantial downsampling for generation. For example, Chen et al. [13] uses high-dimensional, uncompressed features for understanding but falls back to compressed, lower-dimensional latents for generation. Jiao et al. [41] and Yue et al. [97] employ compressed embeddings for both understanding and generation , limiting the model’s perception ability. BLIP-3o [9] experiments with using a Qwen2.5-VL encoder [3] for understanding and EVA-CLIP [75, 76] for generation; However, because the model does not apply noise-schedule shifting and its DiT width is smaller than the EVA-CLIP embedding dimension, it relies on a strong diffusion decoder [60] to map these features back to pixels.

Our work takes a step forward by using a single high-dimensional encoder for both understanding and generation. Leveraging RAE designs, the model enjoys a simpler architecture that understands and generates directly in this semantic space, surpassing VAE-based designs in T2I.

In this work, we investigate scaling Representation Autoencoders (RAEs) to text-to-image generation. Our study begins by scaling the decoder, where we find that while larger data scales improve general fidelity, specific domains such as text require targeted data composition. We then examine the RAE framework itself, revealing that scaling simplifies the design: dimension-dependent noise scheduling remains essential, but architectural modifications like DiTDH\text{DiT}^{\text{DH}} yield diminishing returns as model capacity increases. Building on this streamlined recipe, we show that RAE-based diffusion models consistently outperform state-of-the-art VAE baselines in convergence speed and generation quality, while being less prone to overfitting during finetuning. Collectively, these results establish RAE as a simple and effective foundation for large-scale generation. Moreover, by enabling understanding and generation to operate in a shared representation space, RAEs open new possibilities for unified models, such as the latent-space test-time scaling demonstrated in this work. We believe RAE serve as a strong foundation for future research in both scalable generation and unified multimodal modeling.

The authors would like to thank Xichen Pan, Shusheng Yang, David Fan, John Nguyen for insightful discussions and feedback on the manuscript. This work was mainly supported by the Google TPU Research Cloud (TRC) program and the Open Path AI Foundation. ST is supported by Meta AI Mentorship Program. SX also acknowledges support from the MSIT IITP grant (RS-2024-00457882) and the NSF award IIS-2443404.

Our experiments are conducted on TPU v4, v5p, and v6e with TorchXLA.

We largely follow RAE for decoder architecture and adopt ViT-XL [22] as the default decoder. The decoder contains 28 blocks with a hidden size of 1152 and 16 heads. Decoder training details are included in Table 7. We find the GAN training recipe provided in [100] is not stable on web-scale images. To tackle the issue, we tune the recipe to as Tab. 7. On web-scale images, we find using DINO-S/16 already suffices as a strong discriminator, and using DINO-S/8 as in [100] makes it hard to converge. Therefore, we use DINO-S/16 as the default discriminator. All input is interpolated to 224×224224\times 224 resolution before feeding into the discriminator. We use an epoch-based training scheme and set the sample of each virtual epoch to be the same as ImageNet (1.28M). For loss coefficients, we set ωG=100.0,ωL=1.0,ωA=10.0\omega_{G}=100.0,\omega_{L}=1.0,\omega_{A}=10.0.

For pretraining experiments in Sec. 4.1, we primarily train on TPU-v5p-128 and TPU-v6e-64. Detailed training configurations are provided in Tab. 8. We find that finetuning a pretrained LLM while training the DiT from scratch benefits from using separate optimizers, and properly decoupling their optimizer settings substantially improves training stability. We use SPMD sharding [4] together with TorchXLA to train the LLM, adapters, and DiT models.

We finetune pre-trained models on the BLIP3o-60k dataset [8] using TPU-v4-128 and TPU-v5p-64. To ensure a fair comparison, we apply identical training configurations to both RAE and VAE models across 4, 16, 64, and 256 epochs. We utilize the same codebase and training infrastructure as the pretraining stage. We use a global batch size of 1024 and the optimization settings detailed in Tab. 9.

For synthetic image generation, we compile prompts from publicly available prompt datasets222https://huggingface.co/datasets/Geonmo/midjourney-prompts-only, https://huggingface.co/datasets/FredZhang7/stable-diffusion-prompts-2.47M, https://huggingface.co/datasets/neuralworm/stable-diffusion-discord-prompts, https://huggingface.co/datasets/isidentical/random-stable-diffusion-prompts, https://huggingface.co/datasets/CaptionEmporium/coyo-hd-11m-llavanext. Using these prompts, we generate 24.7M synthetic images with FLUX.1-schnell [46], which form part of our decoder training and T2I training corpus. We perform large-scale generation using TPU-v6e and will open-source the inference pipeline to facilitate future research.

We use pretrained Qwen2.5 [61] language models at the 1.5B and 7B scales in our experiments. Following prior work [50, 81], we use a 2-layer MLP to project visual features from the representation encoder into the LLM embedding space, and a separate linear layer to map the LLM’s query-token outputs into the input space of the diffusion model.

We design our diffusion architecture following LightningDiT [92]. Motivated by recent findings in scaling vision backbones [26, 85, 16], we prioritize increasing model width rather than depth when scaling DiT models. Consistent with insights from the RAE paper [100], we also ensure that the DiT hidden dimension remains strictly larger than the target latent dimension (e.g. 1152 for the SigLIP2 ViT-So model), including at small scales such as DiT-0.5B. The detailed model specifications are provided in Tab. 10.

To complement the results in Sec. 4.2, we additionally compare the training loss curves of RAE and VAE models during finetuning in Fig. 9. We observe that the VAE model’s loss decreases rapidly to a very low value, which correlates with the performance degradation observed in Fig. 6, a clear sign of overfitting. In contrast, the RAE model’s loss decreases more gradually and stabilizes at a higher value, maintaining robust generation performance throughout the training process. This suggests that the high-dimensional semantic space of RAE provides a form of implicit regularization that prevents the model from memorizing the small finetuning dataset.

We extend the finetuning experiment from Sec. 4.2 to 512 epochs. As shown in Fig. 6, VAE-based models already suffer substantial performance drops by 256 epochs, so we do not continue training them further. In contrast, the RAE-based model remains stable: even after 512 epochs (Fig. 10), it shows only a small decline in performance. This further supports the robustness of RAE-based methods under long-horizon finetuning.

Table: S1.T1: Data matters for RAE’s reconstruction fidelity. We train RAE (SigLIP-2) on different data sources. Compared with ImageNet-only training, using web-scale images consistently improves reconstruction quality across all domains.

Data Sources#DataImageNet ↓\downarrowYFCC ↓\downarrowText ↓\downarrow
ImageNet1.28M0.4620.9702.640
Web39.3M0.5290.6292.325
Web + Synthetic64.0M0.4370.6832.406
Web + Synthetic + Text73.0M0.4350.7021.621

Table: S2.T2: Comparison of reconstruction performance. After expanding the training data, RAE outperforms SDXL-VAE across all three domains, though it still falls short of FLUX-VAE. Within RAE variants, WebSSL reconstructs better than SigLIP-2.

FamilyModelImageNet ↓\downarrowYFCC ↓\downarrowText ↓\downarrow
VAESDXL0.9301.1682.057
FLUX0.2880.4100.638
RAEWebSSL ViT-L0.3880.5581.372
SigLIP-2 ViT-So0.4350.7021.621

Table: S4.T3: Data composition matters more than scale. Synthetic data substantially outperforms web data, and their combination (49.5 GenEval) surpasses even doubled synthetic data (48.0), demonstrating synergistic benefits from complementary data sources rather than volume alone.

Training DataGenEval ↑\uparrowDPG-Bench ↑\uparrow
Synthetic45.173.8
Synthetic ×\times248.075.2
Web25.969.5
Web ×\times226.370.6
Synthetic + Web49.576.9

Table: S4.T4: SSL encoders are effective RAE backbones for T2I. A WebSSL-based RAE performs slightly worse than SigLIP-2 but remains stronger than FLUX VAE in T2I.

Model VariantGenEval ↑\uparrowDPG-Bench ↑\uparrow
VAE-based models
FLUX VAE39.670.5
RAE-based models
WebSSL ViT-L46.072.8
SigLIP-2 ViT-So49.576.9

Table: S5.T5: TTS results across LLM–DiT configurations. Substantial performance improvements observed with both verifier metrics on GenEval. 4/8 refers to “selecting best 4 out of 8”.

Best-of-NNPrompt ConfidenceAnswer Logits
1.5B LLM + 5.5B DiT (GenEval = 53.2)
4/856.759.6
4/1657.562.5
4/3260.064.3
7.0B LLM + 5.5B DiT (GenEval = 55.5)
4/858.362.5
4/1659.665.8
4/3260.167.8

Table: S5.T6: Generative training leaves understanding intact; RAE and VAE perform similarly. Across VL benchmarks, both latent choices produce comparable understanding performance.

ModelMMEP{}_{\text{P}}TVQAAI2DSeedMMMUMMMUP{}_{\text{P}}
Und.-only1374.844.763.967.140.220.5
RAE-based1468.739.666.769.841.119.8
VAE-based1481.739.366.769.737.218.7

Table: A0.T7: Training configuration for decoder and discriminator. Left: Configuration used for SigLIP2-So. Right: Configuration used for WebSSL ViT-L. Different encoders require slightly different training recipes for achieving strong decoder performance.

ComponentDecoderDiscriminator
optimizerAdamWAdamW
max learning rate4×10−44\times 10^{-4}5×10−55\times 10^{-5}
min learning rate4×10−54\times 10^{-5}5×10−65\times 10^{-6}
learning rate schedulecosine decaycosine decay
optimizer betas(0.9, 0.95)(0.9, 0.95)
weight decay0.00.0
batch size512512
warmup2 epoch1 epoch
lossℓ1\ell_{1} + LPIPS + GAN + Gramadv.
ModelViT-XLDino-S/16 (frozen)
LPIPS start epoch0
disc. start epoch7
adv. loss start epoch8
Training epochs8073
ComponentDecoderDiscriminator
optimizerAdamWAdamW
max learning rate4×10−44\times 10^{-4}5×10−55\times 10^{-5}
min learning rate4×10−54\times 10^{-5}5×10−65\times 10^{-6}
learning rate schedulecosine decaycosine decay
optimizer betas(0.9, 0.95)(0.9, 0.95)
weight decay0.00.0
batch size512512
warmup2 epoch1 epoch
lossℓ1\ell_{1} + LPIPS + GAN + Gramadv.
ModelViT-XLDino-S/16 (frozen)
LPIPS start epoch0
disc. start epoch7
adv. loss start epoch8
Training epochs8073
ComponentDecoderDiscriminator
optimizerAdamWAdamW
max learning rate2×10−42\times 10^{-4}2×10−52\times 10^{-5}
min learning rate2×10−52\times 10^{-5}2×10−62\times 10^{-6}
learning rate schedulecosine decaycosine decay
optimizer betas(0.9, 0.95)(0.9, 0.95)
weight decay0.00.0
batch size512512
warmup2 epoch1 epoch
lossℓ1\ell_{1} + LPIPS + GAN + Gramadv.
ModelViT-XLDino-S/16 (frozen)
LPIPS start epoch0
disc. start epoch10
adv. loss start epoch11
Training epochs8070

Table: A1.T8: Optimization hyperparameters for the LLM backbone and the DiT diffusion head in the unified T2I model.

ComponentLLMDiT
optimizerAdamW
learning rate schedulecosine w/ warmup ratio 0.01340.0134
global batch size20482048
max learning rate5×10−55\times 10^{-5}5×10−45\times 10^{-4}
optimizer betas(0.9, 0.999)(0.9,\ 0.999)(0.9, 0.95)(0.9,\ 0.95)
lossautoregressive LMdiffusion loss
modelQwen2.5 (1.5B / 7B)DiT (0.5B–9.8B)

Table: A1.T9: Finetuning hyperparameters.

ComponentLLMDiT
optimizerAdamW
learning rate schedulecosine w/ warmup ratio 0.030.03
global batch size10241024
training epochs4, 16, 64, 256
max learning rate5.66×10−55.66\times 10^{-5}5.66×10−45.66\times 10^{-4}
optimizer betas(0.9, 0.999)(0.9,\ 0.999)(0.9, 0.95)(0.9,\ 0.95)
lossautoregressive LMdiffusion loss
modelQwen2.5 (1.5B / 7B)DiT (0.5B–9.8B)

Table: A2.T10: Architectural specifications of DiT variants.

ModelHidden sizeHeadsDepth
DiT-0.5B12803216
DiT-2.4B20483232
DiT-3.3B23043232
DiT-5.5B30723232
DiT-9.8B40963232

Refer to caption RAE converges faster than VAE in text-to-image pretraining. We train Qwen-2.5 1.5B + DiT 2.4B models from scratch on both RAE (SigLIP-2) and VAE (FLUX) latent spaces for up to 60k iterations. RAE converges significantly faster than VAE on both GenEval (4.0×) and DPG-Bench (4.6×).

Refer to caption RAE decoders trained on more data (web, synthetic & text) generalize across domains. Decoders trained only on ImageNet reconstruct natural images well but struggle with text-rendering scenes (see second column). Adding web and text data greatly improves text reconstruction while maintaining natural-image quality. We also observe that both the language-supervised model and the SSL model learn representations suitable for reconstructing diverse images, including natural languages. Compared to proprietary VAEs, our RAE models achieve competitive overall fidelity.

Refer to caption Overview of training pipeline. Left: RAE decoder training stage. We train a decoder on the representations (yellow tokens) produced by the frozen RAE encoder. Right: End-to-end unified training of the autoregressive model, diffusion transformer, and learnable query tokens (gray tokens) using cross-entropy (CE) loss for text prediction and a flow-matching objective for image prediction.

Refer to caption (a) Noise-augmented decoding gains diminish with training

Refer to caption (b) DiTDH\text{DiT}^{\text{DH}} advantage saturates as DiT scales

Refer to caption (a) Scaling DiT models with fixed LLM (Qwen2.5 1.5B)

Refer to caption (b) Scaling LLM and DiT jointly

Refer to caption RAE-based models outperform VAE-based models and are less prone to overfitting. We train both models for 256 epochs and observe that (1) RAE-based models consistently achieve higher performance, and (2) VAE-based models begin to overfit rapidly after 64 epochs.

Refer to caption RAE-based models outperform VAEs across different settings. Left: When fine-tuning only the DiT versus the full LLM+DiT system, RAE models consistently achieve higher GenEval scores. Right: RAE models maintain their advantage over VAE across all DiT model scales (0.5B–9.8B parameters), with the performance gap widening as model size increases.

Refer to caption Test-time scaling in latent space. Our framework allows the LLM to directly evaluate and select generation results within the latent space, bypassing the decode-re-encode process.

Refer to caption Diffusion loss during finetuning (256 epochs). RAE overfits less and later than VAE: the VAE loss plunges early to very low values, while the RAE loss decreases more gradually and plateaus at higher values, indicating reduced overfitting.

Refer to caption Extended finetuning to 512 epochs. RAE maintains robust performance even with 512 epochs of training, while VAE suffers catastrophic overfitting after 64 epochs.

$$ t_m = \frac{\alpha t_n}{1 + (\alpha - 1)t_n}, \quad \text{where} \quad \alpha = \sqrt{\frac{m}{n}}. $$

Data Sources#DataImageNet ↓YFCC ↓Text ↓
ImageNet1.28M0.4620.972.64
Web39.3M0.5290.6292.325
Web + Synthetic64.0M0.4370.6832.406
Web + Synthetic + Text73.0M0.4350.7021.621
FamilyModelImageNet ↓YFCC ↓Text ↓
VAESDXL0.931.1682.057
VAEFLUX0.2880.410.638
RAEWebSSL ViT-L0.3880.5581.372
RAESigLIP-2 ViT-So0.4350.7021.621
SettingGenEval ↑DPG-Bench ↑
w/o shift23.654.8
w/ shift49.676.8
Training DataGenEval ↑DPG-Bench ↑
Synthetic45.173.8
Synthetic × 24875.2
Web25.969.5
Web × 226.370.6
Synthetic + Web49.576.9
Model VariantGenEval ↑DPG-Bench ↑
VAE-based models
FLUX VAE39.670.5
RAE-based models
WebSSL ViT-L46.072.8
SigLIP-2 ViT-So49.576.9
Best-of- NPrompt ConfidencePrompt ConfidenceAnswer Logits
1.5B LLM+5.5B DiT1.5B LLM+5.5B DiT(GenEval = 53.2)
4/84/856.759.6
4/164/1657.562.5
4/324/3260.064.3
7.0B LLM+5.5B DiT(GenEval = 55.5)(GenEval = 55.5)
4/858.358.362.5
4/1659.659.665.8
4/3260.167.8
ModelMME PTVQAAI2DSeedMMMUMMMU P
Und.-only1374.844.763.967.140.220.5
RAE-based1468.739.666.769.841.119.8
VAE-based1481.739.366.769.737.218.7
ComponentDecoderDiscriminatorComponentDecoderDiscriminator
optimizer max learning rate min learning rate learning rate optimizer betas weight decay batch size warmup loss ModelAdamW 4 × 10 - 4 4 × 10 - 5 cosine decay (0.9, 0.95) 0.0 512 2 epoch ℓ 1 + LPIPS + GAN +AdamW 5 × 10 - 5 5 × 10 - 6 cosine decay (0.9, 0.95) 0.0 512 1 epoch adv.optimizer max learning rate min learning rate learning rate optimizer betas weight decay batch size warmup loss ModelAdamW 2 × 10 - 4 2 × 10 - 5 cosine decay (0.9, 0.95) 0.0 512 2 epoch ℓ 1 + LPIPS + GAN + Gram ViT-XLAdamW 2 × 10 - 5 2 × 10 - 6 cosine decay (0.9, 0.95) 0.0 512 1 epoch adv. Dino-S/16
scheduleschedule
Gram
ViT-XLDino-S/16 (frozen)(frozen)
LPIPS start epoch0-LPIPS start epoch0-
disc. start epoch-7disc. start epoch-10
adv. loss start epoch8-adv. loss start epoch11-
Training epochs8073Training epochs8070
ComponentLLMDiT
optimizer learning rate schedule global batch sizeAdamW cosine w/ warmup ratio 0 . 0134 2048AdamW cosine w/ warmup ratio 0 . 0134 2048
max learning rate optimizer betas loss model5 × 10 - 5 (0 . 9 , 0 . 999) autoregressive LM Qwen2.5 (1.5B / 7B)5 × 10 - 4 (0 . 9 , 0 . 95) diffusion loss DiT (0.5B-9.8B)
ComponentLLMDiT
optimizer learning rate schedule global batch size training epochsAdamW cosine w/ warmup ratio 0 . 03 1024 4, 16, 64, 256AdamW cosine w/ warmup ratio 0 . 03 1024 4, 16, 64, 256
max learning rate optimizer betas loss model5 . 66 × 10 - 5 (0 . 9 , 0 . 999) autoregressive LM Qwen2.5 (1.5B / 7B)5 . 66 × 10 - 4 (0 . 9 , 0 . 95) diffusion loss DiT (0.5B-9.8B)
ModelHidden sizeHeadsDepth
DiT-0.5B12803216
DiT-2.4B20483232
DiT-3.3B23043232
DiT-5.5B30723232
DiT-9.8B40963232
ComponentDecoderDiscriminator
optimizerAdamWAdamW
max learning rate4×10−44\times 10^{-4}5×10−55\times 10^{-5}
min learning rate4×10−54\times 10^{-5}5×10−65\times 10^{-6}
learning rate schedulecosine decaycosine decay
optimizer betas(0.9, 0.95)(0.9, 0.95)
weight decay0.00.0
batch size512512
warmup2 epoch1 epoch
lossℓ1\ell_{1} + LPIPS + GAN + Gramadv.
ModelViT-XLDino-S/16 (frozen)
LPIPS start epoch0
disc. start epoch7
adv. loss start epoch8
Training epochs8073
ComponentDecoderDiscriminator
optimizerAdamWAdamW
max learning rate2×10−42\times 10^{-4}2×10−52\times 10^{-5}
min learning rate2×10−52\times 10^{-5}2×10−62\times 10^{-6}
learning rate schedulecosine decaycosine decay
optimizer betas(0.9, 0.95)(0.9, 0.95)
weight decay0.00.0
batch size512512
warmup2 epoch1 epoch
lossℓ1\ell_{1} + LPIPS + GAN + Gramadv.
ModelViT-XLDino-S/16 (frozen)
LPIPS start epoch0
disc. start epoch10
adv. loss start epoch11
Training epochs8070

Figure

Figure

References

[Authors14] FirstName LastName. The frobnicatable foo filter.

[Authors14b] FirstName LastName. Frobnication tutorial.

[Alpher02] FirstName Alpher. Frobnication.

[Alpher03] FirstName Alpher, FirstName Fotheringham-Smythe. Frobnication revisited. Journal of Foo.

[Alpher04] FirstName Alpher, FirstName Fotheringham-Smythe, FirstName Gamow. Can a machine frobnicate?. Journal of Foo.

[Alpher05] FirstName Alpher, FirstName Gamow. Can a computer frobnicate?.

[barham2022pathways] Barham, Paul, Chowdhery, Aakanksha, Dean, Jeff, Ghemawat, Sanjay, Hand, Steven, Hurt, Daniel, Isard, Michael, Lim, Hyeontaek, Pang, Ruoming, Roy, Sudip, others. (2022). Pathways: Asynchronous distributed dataflow for ml. Proceedings of Machine Learning and Systems.

[chen2025blip3] Chen, Jiuhai, Xu, Zhiyang, Pan, Xichen, Hu, Yushi, Qin, Can, Goldstein, Tom, Huang, Lifu, Zhou, Tianyi, Xie, Saining, Savarese, Silvio, others. (2025). Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568.

[liu2023visual] Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. (2023). Visual Instruction Tuning. NeurIPS.

[deng2009imagenet] Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database. CVPR.

[xu2023demystifying] Xu, Hu, Xie, Saining, Tan, Xiaoqing Ellen, Huang, Po-Yao, Howes, Russell, Sharma, Vasu, Li, Shang-Wen, Ghosh, Gargi, Zettlemoyer, Luke, Feichtenhofer, Christoph. (2024). Demystifying clip data. ICLR.

[fan2023motion] Fan, David, Wang, Jue, Liao, Shuai, Zhu, Yi, Bhat, Vimal, Santos-Villalobos, Hector, MV, Rohith, Li, Xinyu. (2023). Motion-guided masking for spatiotemporal representation learning. ICCV.

[fang2022data] Fang, Alex, Ilharco, Gabriel, Wortsman, Mitchell, Wan, Yuhao, Shankar, Vaishaal, Dave, Achal, Schmidt, Ludwig. (2022). Data determines distributional robustness in contrastive language image pre-training (clip). ICML.

[sun2017revisiting] Sun, Chen, Shrivastava, Abhinav, Singh, Saurabh, Gupta, Abhinav. (2017). Revisiting unreasonable effectiveness of data in deep learning era. ICCV.

[sutton2019bitter] Sutton, Richard. (2019). The bitter lesson. Incomplete Ideas (blog).

[jose2024dinov2] Jose, Cijo, Moutakanni, Th{'e. (2024). DINOv2 Meets Text: A Unified Framework for Image-and Pixel-Level Vision-Language Alignment. arXiv preprint arXiv:2412.16334.

[zhai2022lit] Zhai, Xiaohua, Wang, Xiao, Mustafa, Basil, Steiner, Andreas, Keysers, Daniel, Kolesnikov, Alexander, Beyer, Lucas. (2022). Lit: Zero-shot transfer with locked-image text tuning. CVPR.

[langley00] P. Langley. (2000). Crafting Papers on Machine Learning. Proceedings of the 17th International Conference on Machine Learning (ICML 2000).

[mitchell80] T. M. Mitchell. (1980). The Need for Biases in Learning Generalizations.

[kearns89] M. J. Kearns. (1989). Computational Complexity of Machine Learning.

[MachineLearningI] . Machine Learning: An Artificial Intelligence Approach, Vol. I. (1983).

[DudaHart2nd] R. O. Duda, P. E. Hart, D. G. Stork. (2000). Pattern Classification.

[anonymous] Author, N. N.. (2021). Suppressed for Anonymity.

[Newell81] A. Newell, P. S. Rosenbloom. (1981). Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition.

[Samuel59] A. L. Samuel. (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development.

[power2022grokking] Power, Alethea, Burda, Yuri, Edwards, Harri, Babuschkin, Igor, Misra, Vedant. (2022). Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177.

[wang2023see] Wang, Junke, Meng, Lingchen, Weng, Zejia, He, Bo, Wu, Zuxuan, Jiang, Yu-Gang. (2023). To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574.

[zhang2023llavar] Zhang, Yanzhe, Zhang, Ruiyi, Gu, Jiuxiang, Zhou, Yufan, Lipka, Nedim, Yang, Diyi, Sun, Tong. (2023). Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.

[liu2022convnet] Liu, Zhuang, Mao, Hanzi, Wu, Chao-Yuan, Feichtenhofer, Christoph, Darrell, Trevor, Xie, Saining. (2022). A convnet for the 2020s. CVPR.

[masry2022chartqa] Masry, Ahmed, Long, Do Xuan, Tan, Jia Qing, Joty, Shafiq, Hoque, Enamul. (2022). Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ACL.

[mathew2021docvqa] Mathew, Minesh, Karatzas, Dimosthenis, Jawahar, CV. (2021). Docvqa: A dataset for vqa on document images. WACV.

[kafle2018dvqa] Kafle, Kushal, Price, Brian, Cohen, Scott, Kanan, Christopher. (2018). Dvqa: Understanding data visualizations via question answering. CVPR.

[acharya2019tallyqa] Acharya, Manoj, Kafle, Kushal, Kanan, Christopher. (2019). TallyQA: Answering complex counting questions. AAAI.

[johnson2017clevr] Johnson, Justin, Hariharan, Bharath, Van Der Maaten, Laurens, Fei-Fei, Li, Lawrence Zitnick, C, Girshick, Ross. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. CVPR.

[tu2023many] Tu, Haoqin, Cui, Chenhang, Wang, Zijun, Zhou, Yiyang, Zhao, Bingchen, Han, Junlin, Zhou, Wangchunshu, Yao, Huaxiu, Xie, Cihang. (2023). How many unicorns are in this image? a safety evaluation benchmark for vision llms. arXiv preprint arXiv:2311.16101.

[gurari2018vizwiz] Gurari, Danna, Li, Qing, Stangl, Abigale J, Guo, Anhong, Lin, Chi, Grauman, Kristen, Luo, Jiebo, Bigham, Jeffrey P. (2018). Vizwiz grand challenge: Answering visual questions from blind people. CVPR.

[zhang2023pre] Zhang, Yuhui, McKinzie, Brandon, Gan, Zhe, Shankar, Vaishaal, Toshev, Alexander. (2023). Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation. EMNLP.

[chen2024allava] Chen, Guiming Hardy, Chen, Shunian, Zhang, Ruifei, Chen, Junying, Wu, Xiangbo, Zhang, Zhiyi, Chen, Zhihong, Li, Jianquan, Wan, Xiang, Wang, Benyou. (2024). ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model. arXiv preprint arXiv:2402.11684.

[lambon2010coherent] Lambon Ralph, Matthew A, Sage, Karen, Jones, Roy W, Mayberry, Emily J. (2010). Coherent concepts are computed in the anterior temporal lobes. Proceedings of the National Academy of Sciences.

[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. ICML.

[he2019momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2019). Momentum Contrast for Unsupervised Visual Representation Learning. arXiv e-prints, art. CVPR.

[bardes2024revisiting] Bardes, Adrien, Garrido, Quentin, Ponce, Jean, Chen, Xinlei, Rabbat, Michael, LeCun, Yann, Assran, Mahmoud, Ballas, Nicolas. (2024). Revisiting feature prediction for learning visual representations from video. TMLR.

[hendrycks2016gaussian] Hendrycks, Dan, Gimpel, Kevin. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.

[zhang2024direct] Zhang, Ruohong, Gui, Liangke, Sun, Zhiqing, Feng, Yihao, Xu, Keyang, Zhang, Yuanhan, Fu, Di, Li, Chunyuan, Hauptmann, Alexander, Bisk, Yonatan, others. (2024). Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward. arXiv preprint arXiv:2404.01258.

[loshchilov2017decoupled] Loshchilov, I. (2019). Decoupled weight decay regularization. ICLR.

[ba2016layer] Ba, Jimmy Lei, Kiros, Jamie, Geoffrey E. Hinton. (2016). Layer normalization. NeurIPS.

[preechakul2022diffusion] Preechakul, Konpat, Chatthee, Nattanat, Wizadwongsa, Suttisak, Suwajanakorn, Supasorn. (2022). Diffusion autoencoders: Toward a meaningful and decodable representation. CVPR.

[pan2023kosmos] Pan, Xichen, Dong, Li, Huang, Shaohan, Peng, Zhiliang, Chen, Wenhu, Wei, Furu. (2024). Kosmos-g: Generating images in context with multimodal large language models. ICLR.

[koh2024generating] Koh, Jing Yu, Fried, Daniel, Salakhutdinov, Russ R. (2024). Generating images with multimodal language models. NeurIPS.

[rajbhandari2020zero] Rajbhandari, Samyam, Rasley, Jeff, Ruwase, Olatunji, He, Yuxiong. (2020). Zero: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[heusel2017gans] Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, Hochreiter, Sepp. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS.

[yue2024mmmu] Yue, Xiang, Zheng, Tianyu, Ni, Yuansheng, Wang, Yubo, Zhang, Kai, Tong, Shengbang, Sun, Yuxuan, Yin, Ming, Yu, Botao, Zhang, Ge, others. (2024). Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813.

[pan2024autonomous] Pan, Jiayi, Zhang, Yichi, Tomlin, Nicholas, Zhou, Yifei, Levine, Sergey, Suhr, Alane. (2024). Autonomous evaluation and refinement of digital agents. COLM.

[cha2024visually] Cha, Sungguk, Lee, Jusung, Lee, Younghyun, Yang, Cheoljong. (2024). Visually Dehallucinative Instruction Generation: Know What You Don't Know. arXiv preprint arXiv:2402.09717.

[si2024design2code] Si, Chenglei, Zhang, Yanzhe, Yang, Zhengyuan, Liu, Ruibo, Yang, Diyi. (2024). Design2Code: How Far Are We From Automating Front-End Engineering?. arXiv preprint arXiv:2403.03163.

[li2024multimodal] Li, Lei, Wang, Yuqi, Xu, Runxin, Wang, Peiyi, Feng, Xiachong, Kong, Lingpeng, Liu, Qi. (2024). Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. arXiv preprint arXiv:2403.00231.

[wang2024measuring] Wang, Ke, Pan, Junting, Shi, Weikang, Lu, Zimu, Zhan, Mingjie, Li, Hongsheng. (2024). Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset. arXiv preprint arXiv:2402.14804.

[wu2023q] Wu, Haoning, Zhang, Zicheng, Zhang, Erli, Chen, Chaofeng, Liao, Liang, Wang, Annan, Xu, Kaixin, Li, Chunyi, Hou, Jingwen, Zhai, Guangtao, others. (2023). Q-instruct: Improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783.

[kembhavi2016diagram] Kembhavi, Aniruddha, Salvato, Mike, Kolve, Eric, Seo, Minjoon, Hajishirzi, Hannaneh, Farhadi, Ali. (2016). A diagram is worth a dozen images. ECCV.

[laiongpt4v] LAION. (2023). laion/gpt4v-dataset.

[hsiao2022screenqa] Hsiao, Yu-Chung, Zubach, Fedir, Wang, Maria, others. (2022). Screenqa: Large-scale question-answer pairs over mobile app screenshots. arXiv preprint arXiv:2209.08199.

[lu2022learn] Lu, Pan, Mishra, Swaroop, Xia, Tanglin, Qiu, Liang, Chang, Kai-Wei, Zhu, Song-Chun, Tafjord, Oyvind, Clark, Peter, Kalyan, Ashwin. (2022). Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS.

[gao2023g] Gao, Jiahui, Pi, Renjie, Zhang, Jipeng, Ye, Jiacheng, Zhong, Wanjun, Wang, Yufei, Hong, Lanqing, Han, Jianhua, Xu, Hang, Li, Zhenguo, others. (2023). G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370.

[kim2021donut] Kim, Geewook, Hong, Teakgyu, Yim, Moonbin, Park, Jinyoung, Yim, Jinyeong, Hwang, Wonseok, Yun, Sangdoo, Han, Dongyoon, Park, Seunghyun. (2022). Donut: Document understanding transformer without ocr. ECCV.

[laurenccon2024unlocking] Lauren{\c{c. (2024). Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset. arXiv preprint arXiv:2403.09029.

[belouadi2023automatikz] Belouadi, Jonas, Lauscher, Anne, Eger, Steffen. (2024). Automatikz: Text-guided synthesis of scientific vector graphics with tikz. ICLR.

[alawwad2024enhancing] Alawwad, Hessa Abdulrahman, Alhothali, Areej, Naseem, Usman, Alkhathlan, Ali, Jamal, Amani. (2024). Enhancing Textbook Question Answering Task with Large Language Models and Retrieval Augmented Generation. arXiv preprint arXiv:2402.05128.

[lu2021inter] Lu, Pan, Gong, Ran, Jiang, Shibiao, Qiu, Liang, Huang, Siyuan, Liang, Xiaodan, Zhu, Song-Chun. (2021). Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. ACL.

[zhang2019raven] Zhang, Chi, Gao, Feng, Jia, Baoxiong, Zhu, Yixin, Zhu, Song-Chun. (2019). Raven: A dataset for relational and analogical visual reasoning. CVPR.

[lu2021iconqa] Lu, Pan, Qiu, Liang, Chen, Jiaqi, Xia, Tony, Zhao, Yizhou, Zhang, Wei, Yu, Zhou, Liang, Xiaodan, Zhu, Song-Chun. (2021). Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. NeurIPS.

[kazemi2023geomverse] Kazemi, Mehran, Alvari, Hamidreza, Anand, Ankit, Wu, Jialin, Chen, Xi, Soricut, Radu. (2023). Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241.

[pasupat2015compositional] Pasupat, Panupong, Liang, Percy. (2015). Compositional semantic parsing on semi-structured tables. ACL.

[zhong2017seq2sql] Zhong, Victor, Xiong, Caiming, Socher, Richard. (2017). Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.

[chen2021finqa] Chen, Zhiyu, Chen, Wenhu, Smiley, Charese, Shah, Sameena, Borova, Iana, Langdon, Dylan, Moussa, Reema, Beane, Matt, Huang, Ting-Hao, Routledge, Bryan, others. (2021). Finqa: A dataset of numerical reasoning over financial data. EMNLP.

[cheng2021hitab] Cheng, Zhoujun, Dong, Haoyu, Wang, Zhiruo, Jia, Ran, Guo, Jiaqi, Gao, Yan, Han, Shi, Lou, Jian-Guang, Zhang, Dongmei. (2022). HiTab: A hierarchical table dataset for question answering and natural language generation. ACL.

[zhu2021tat] Zhu, Fengbin, Lei, Wenqiang, Huang, Youcheng, Wang, Chao, Zhang, Shuo, Lv, Jiancheng, Feng, Fuli, Chua, Tat-Seng. (2021). TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. ACL.

[lu2022dynamic] Lu, Pan, Qiu, Liang, Chang, Kai-Wei, Wu, Ying Nian, Zhu, Song-Chun, Rajpurohit, Tanmay, Clark, Peter, Kalyan, Ashwin. (2023). Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. ICLR.

[kantharaj2022chart] Kantharaj, Shankar, Leong, Rixie Tiffany Ko, Lin, Xiang, Masry, Ahmed, Thakkar, Megh, Hoque, Enamul, Joty, Shafiq. (2022). Chart-to-text: A large-scale benchmark for chart summarization. ACL.

[tang2023vistext] Tang, Benny J, Boggust, Angie, Satyanarayan, Arvind. (2023). Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356.

[biten2022latr] Biten, Ali Furkan, Litman, Ron, Xie, Yusheng, Appalaraju, Srikar, Manmatha, R. (2022). Latr: Layout-aware transformer for scene-text vqa. CVPR.

[biten2019scene] Biten, Ali Furkan, Tito, Ruben, Mafla, Andres, Gomez, Lluis, Rusinol, Mar{\c{c. (2019). Scene text visual question answering. ICCV.

[kiela2020hateful] Kiela, Douwe, Firooz, Hamed, Mohan, Aravind, Goswami, Vedanuj, Singh, Amanpreet, Ringshia, Pratik, Testuggine, Davide. (2020). The hateful memes challenge: Detecting hate speech in multimodal memes. NeurIPS.

[RenderedText] Chris Wendler. (2023). wendlerc/RenderedText.

[zhu2016visual7w] Zhu, Yuke, Groth, Oliver, Bernstein, Michael, Fei-Fei, Li. (2016). Visual7w: Grounded question answering in images. CVPR.

[tanaka2021visualmrc] Tanaka, Ryota, Nishida, Kyosuke, Yoshida, Sen. (2021). VisualMRC: Machine Reading Comprehension on Document Images. AAAI.

[shridhar2020alfworld] Shridhar, Mohit, Yuan, Xingdi, C{^{o. (2021). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. ICLR.

[pont-tuset2019localizednarratives] Pont{-. (2020). Connecting Vision and Language with Localized Narratives. ECCV.

[he2020pathvqa] He, Xuehai, Zhang, Yichen, Mou, Luntian, Xing, Eric P., Xie, Pengtao. (2020). PathVQA: 30000+ Questions for Medical Visual Question Answering. CoRR.

[chen2023sharegpt4v] Chen, Lin, Li, Jisong, Dong, Xiaoyi, Zhang, Pan, He, Conghui, Wang, Jiaqi, Zhao, Feng, Lin, Dahua. (2023). Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.

[hudson2019gqa] Drew A. Hudson, Christopher D. Manning. (2019). GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. CVPR.

[marino2019okvqa] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, Roozbeh Mottaghi. (2019). OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. CVPR.

[vishniakov2023convnet] Vishniakov, Kirill, Shen, Zhiqiang, Liu, Zhuang. (2024). ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy. ICML.

[schwenk2022aokvqa] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, Roozbeh Mottaghi. (2022). A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge. ECCV.

[mishra2019OCR] . OCR-VQA: Visual Question Answering by Reading Text in Images. (2019).

[sidorov2020textcaps] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh. (2020). TextCaps: a Dataset for Image Captioning with Reading Comprehension.

[yu2016modeling] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, Tamara L. Berg. (2016). Modeling Context in Referring Expressions.

[team2024chameleon] Team, Chameleon. (2024). Chameleon: Mixed-Modal Early-Fusion Foundation Models. arXiv preprint arXiv:2405.09818.

[yu2023rlhf] Yu, Tianyu, Yao, Yuan, Zhang, Haoye, He, Taiwen, Han, Yifeng, Cui, Ganqu, Hu, Jinyi, Liu, Zhiyuan, Zheng, Hai-Tao, Sun, Maosong, others. (2023). Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849.

[li2024return] Li, Tianhong, Katabi, Dina, He, Kaiming. (2024). Return of unconditional generation: A self-supervised representation generation method. NeurIPS.

[rafailov2024direct] Rafailov, Rafael, Sharma, Archit, Mitchell, Eric, Manning, Christopher D, Ermon, Stefano, Finn, Chelsea. (2024). Direct preference optimization: Your language model is secretly a reward model. NeurIPS.

[zhu2023starling] Zhu, Banghua, Frick, Evan, Wu, Tianhao, Zhu, Hanlin, Jiao, Jiantao. (2023). Starling-7b: Improving llm helpfulness & harmlessness with rlaif.

[he2017mask] He, Kaiming, Gkioxari, Georgia, Doll{'a. (2017). Mask r-cnn. ICCV.

[ouyang2022training] Ouyang, Long, Wu, Jeffrey, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll, Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex, others. (2022). Training language models to follow instructions with human feedback. NeurIPS.

[dong2024rlhf] Dong, Hanze, Xiong, Wei, Pang, Bo, Wang, Haoxiang, Zhao, Han, Zhou, Yingbo, Jiang, Nan, Sahoo, Doyen, Xiong, Caiming, Zhang, Tong. (2024). Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863.

[liu2024decade] Liu, Zhuang, He, Kaiming. (2025). A Decade's Battle on Dataset Bias: Are We There Yet?. ICLR.

[woo2023convnext] Woo, Sanghyun, Debnath, Shoubhik, Hu, Ronghang, Chen, Xinlei, Liu, Zhuang, Kweon, In So, Xie, Saining. (2023). Convnext v2: Co-designing and scaling convnets with masked autoencoders. CVPR.

[yuksekgonul2022and] Yuksekgonul, Mert, Bianchi, Federico, Kalluri, Pratyusha, Jurafsky, Dan, Zou, James. (2022). When and why vision-language models behave like bags-of-words, and what to do about it?. ICLR.

[chen2024far] Chen, Zhe, Wang, Weiyun, Tian, Hao, Ye, Shenglong, Gao, Zhangwei, Cui, Erfei, Tong, Wenwen, Hu, Kongzhi, Luo, Jiapeng, Ma, Zheng, others. (2024). How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821.

[tong2024mass] Tong, Shengbang, Jones, Erik, Steinhardt, Jacob. (2024). Mass-producing failures of multimodal systems with language models. NeurIPS.

[krishna2016visual] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li. (2016). Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. IJCV.

[tong2024eyes] Tong, Shengbang, Liu, Zhuang, Zhai, Yuexiang, Ma, Yi, LeCun, Yann, Xie, Saining. (2024). Eyes wide shut? exploring the visual shortcomings of multimodal llms. CVPR.

[liu2023improved] Liu, Haotian, Li, Chunyuan, Li, Yuheng, Lee, Yong Jae. (2024). Improved baselines with visual instruction tuning. CVPR.

[mckinzie2024mm1] McKinzie, Brandon, Gan, Zhe, Fauconnier, Jean-Philippe, Dodge, Sam, Zhang, Bowen, Dufter, Philipp, Shah, Dhruti, Du, Xianzhi, Peng, Futang, Weers, Floris, others. (2024). Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611.

[fang2023data] Fang, Alex, Jose, Albin Madappally, Jain, Amit, Schmidt, Ludwig, Toshev, Alexander, Shankar, Vaishaal. (2024). Data filtering networks. ICLR.

[gao2024sphinx] Gao, Peng, Zhang, Renrui, Liu, Chris, Qiu, Longtian, Huang, Siyuan, Lin, Weifeng, Zhao, Shitian, Geng, Shijie, Lin, Ziyi, Jin, Peng, others. (2024). SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models. arXiv preprint arXiv:2402.05935.

[DatabricksBlog2023DollyV2] Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, Reynold Xin. (2023). Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM.

[yue2023mammoth] Yue, Xiang, Qu, Xingwei, Zhang, Ge, Fu, Yao, Huang, Wenhao, Sun, Huan, Su, Yu, Chen, Wenhu. (2024). Mammoth: Building math generalist models through hybrid instruction tuning. ICLR.

[luo2023wizardcoder] Luo, Ziyang, Xu, Can, Zhao, Pu, Sun, Qingfeng, Geng, Xiubo, Hu, Wenxiang, Tao, Chongyang, Ma, Jing, Lin, Qingwei, Jiang, Daxin. (2024). Wizardcoder: Empowering code large language models with evol-instruct. ICLR.

[mitra2024orcamath] Arindam Mitra, Hamed Khanpour, Corby Rosset, Ahmed Awadallah. (2024). Orca-Math: Unlocking the potential of SLMs in Grade School Math.

[zheng2024opencodeinterpreter] Zheng, Tianyu, Zhang, Ge, Shen, Tianhao, Liu, Xueling, Lin, Bill Yuchen, Fu, Jie, Chen, Wenhu, Yue, Xiang. (2024). OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. arXiv preprint arXiv:2402.14658.

[OpenOrca] Wing Lian, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong and. (2023). OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces. HuggingFace repository.

[tang2025tulip] Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, others. (2021). Learning transferable visual models from natural language supervision. ICML.

[schuhmann2022laion] Schuhmann, Christoph, Beaumont, Romain, Vencu, Richard, Gordon, Cade, Wightman, Ross, Cherti, Mehdi, Coombes, Theo, Katta, Aarush, Mullis, Clayton, Wortsman, Mitchell, others. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS.

[zheng2024judging] Zheng, Lianmin, Chiang, Wei-Lin, Sheng, Ying, Zhuang, Siyuan, Wu, Zhanghao, Zhuang, Yonghao, Lin, Zi, Li, Zhuohan, Li, Dacheng, Xing, Eric, others. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS.

[chiang2024chatbot] Chiang, Wei-Lin, Zheng, Lianmin, Sheng, Ying, Angelopoulos, Anastasios Nikolas, Li, Tianle, Li, Dacheng, Zhang, Hao, Zhu, Banghua, Jordan, Michael, Gonzalez, Joseph E, others. (2024). Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132.

[zhai2023sigmoid] Zhai, Xiaohua, Mustafa, Basil, Kolesnikov, Alexander, Beyer, Lucas. (2023). Sigmoid loss for language image pre-training. ICCV.

[sun2023eva] Sun, Quan, Fang, Yuxin, Wu, Ledell, Wang, Xinlong, Cao, Yue. (2023). Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389.

[cherti2023reproducible] Cherti, Mehdi, Beaumont, Romain, Wightman, Ross, Wortsman, Mitchell, Ilharco, Gabriel, Gordon, Cade, Schuhmann, Christoph, Schmidt, Ludwig, Jitsev, Jenia. (2023). Reproducible scaling laws for contrastive language-image learning. CVPR.

[he2022masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners. CVPR.

[chen2021empirical] Chen, Xinlei, Xie, Saining, He, Kaiming. (2021). An empirical study of training self-supervised vision transformers. ICCV.

[oquab2023dinov2] Oquab, Maxime, Darcet, Timoth{'e. (2023). Dinov2: Learning robust visual features without supervision. TMLR.

[cunningham2023sparse] Cunningham, Hoagy, Ewart, Aidan, Riggs, Logan, Huben, Robert, Sharkey, Lee. (2024). Sparse autoencoders find highly interpretable features in language models. ICLR.

[assran2023self] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. CVPR.

[barstochastic] Bar, Amir, Bordes, Florian, Shocher, Assaf, Assran, Mido, Vincent, Pascal, Ballas, Nicolas, Darrell, Trevor, Globerson, Amir, LeCun, Yann. (2024). Stochastic positional embeddings improve masked image modeling. ICML.

[dosovitskiy2020image] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. ICLR.

[jouppi2023tpu] Jouppi, Norm, Kurian, George, Li, Sheng, Ma, Peter, Nagarajan, Rahul, Nai, Lifeng, Patil, Nishant, Subramanian, Suvinay, Swing, Andy, Towles, Brian, others. (2023). Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. Proceedings of the 50th Annual International Symposium on Computer Architecture.

[zhao2023pytorch] Zhao, Yanli, Gu, Andrew, Varma, Rohan, Luo, Liang, Huang, Chien-Chin, Xu, Min, Wright, Less, Shojanazeri, Hamid, Ott, Myle, Shleifer, Sam, others. (2023). Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277.

[zhou2023don] Zhou, Kun, Zhu, Yutao, Chen, Zhipeng, Chen, Wentong, Zhao, Wayne Xin, Chen, Xu, Lin, Yankai, Wen, Ji-Rong, Han, Jiawei. (2023). Don't Make Your LLM an Evaluation Benchmark Cheater. arXiv preprint arXiv:2311.01964.

[kirillov2023segment] Kirillov, Alexander, Mintun, Eric, Ravi, Nikhila, Mao, Hanzi, Rolland, Chloe, Gustafson, Laura, Xiao, Tete, Whitehead, Spencer, Berg, Alexander C, Lo, Wan-Yen, others. (2023). Segment anything. ICCV.

[birkl2023midas] Birkl, Reiner, Wofk, Diana, M{. (2023). Midas v3. 1--a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460.

[lasinger2019towards] Lasinger, Katrin, Ranftl, Ren{'e. (2019). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. arXiv preprint arXiv:1907.01341.

[Rombach_2022_CVPR] Rombach, Robin, Blattmann, Andreas, Lorenz, Dominik, Esser, Patrick, Ommer, Bj. (2022). High-Resolution Image Synthesis With Latent Diffusion Models. CVPR.

[karamcheti2024prismatic] Karamcheti, Siddharth, Nair, Suraj, Balakrishna, Ashwin, Liang, Percy, Kollar, Thomas, Sadigh, Dorsa. (2024). Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865.

[zhai2023investigating] Zhai, Yuexiang, Tong, Shengbang, Li, Xiao, Cai, Mu, Qu, Qing, Lee, Yong Jae, Ma, Yi. (2024). Investigating the catastrophic forgetting in multimodal large language models. CPAL.

[li2023internet] Li, Alexander C, Brown, Ellis, Efros, Alexei A, Pathak, Deepak. (2023). Internet Explorer: Targeted Representation Learning on the Open Web. ICML.

[liu2024llavanext] Liu, Haotian, Li, Chunyuan, Li, Yuheng, Li, Bo, Zhang, Yuanhan, Shen, Sheng, Lee, Yong Jae. (2024). LLaVA-NeXT: Improved reasoning, OCR, and world knowledge.

[lu2024deepseek] Lu, Haoyu, Liu, Wen, Zhang, Bo, Wang, Bingxuan, Dong, Kai, Liu, Bo, Sun, Jingxiang, Ren, Tongzheng, Li, Zhuoshu, Sun, Yaofeng, others. (2024). DeepSeek-VL: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525.

[li2023your] Li, Alexander C, Prabhudesai, Mihir, Duggal, Shivam, Brown, Ellis, Pathak, Deepak. (2023). Your diffusion model is secretly a zero-shot classifier. ICCV.

[chen2022pali] Chen, Xi, Wang, Xiao, Changpinyo, Soravit, Piergiovanni, AJ, Padlewski, Piotr, Salz, Daniel, Goodman, Sebastian, Grycner, Adam, Mustafa, Basil, Beyer, Lucas, others. (2023). Pali: A jointly-scaled multilingual language-image model. ICLR.

[murtagh2014ward] Murtagh, Fionn, Legendre, Pierre. (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion?. Journal of classification.

[llama3modelcard] AI@Meta. (2024). Llama 3 Model Card.

[Gemini] Google. (2023). Gemini.

[qwen] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, Tianhang Zhu. (2023). Qwen Technical Report. arXiv preprint arXiv:2309.16609.

[bai2023qwen] Bai, Jinze, Bai, Shuai, Yang, Shusheng, Wang, Shijie, Tan, Sinan, Wang, Peng, Lin, Junyang, Zhou, Chang, Zhou, Jingren. (2023). Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.

[dai2024instructblip] Dai, Wenliang, Li, Junnan, Li, Dongxu, Tiong, Anthony Meng Huat, Zhao, Junqi, Wang, Weisheng, Li, Boyang, Fung, Pascale N, Hoi, Steven. (2024). Instructblip: Towards general-purpose vision-language models with instruction tuning. NeurIPS.

[liu2023hidden] Liu, Yuliang, Li, Zhang, Li, Hongliang, Yu, Wenwen, Huang, Mingxin, Peng, Dezhi, Liu, Mingyu, Chen, Mingrui, Li, Chunyuan, Jin, Lianwen, others. (2023). On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895.

[ge2023planting] Ge, Yuying, Ge, Yixiao, Zeng, Ziyun, Wang, Xintao, Shan, Ying. (2023). Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041.

[wu2023vstar] Wu, Penghao, Xie, Saining. (2024). V: Guided Visual Search as a Core Mechanism in Multimodal LLMs*. CVPR.

[jaegle2021perceiver] Jaegle, Andrew, Gimeno, Felix, Brock, Andy, Vinyals, Oriol, Zisserman, Andrew, Carreira, Joao. (2021). Perceiver: General perception with iterative attention. ICML.

[young2024yi] Young, Alex, Chen, Bei, Li, Chao, Huang, Chengen, Zhang, Ge, Zhang, Guanwei, Li, Heng, Zhu, Jiangcheng, Chen, Jianqun, Chang, Jing, others. (2024). Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.

[zhai2024fine] Zhai, Yuexiang, Bai, Hao, Lin, Zipeng, Pan, Jiayi, Tong, Shengbang, Zhou, Yifei, Suhr, Alane, Xie, Saining, LeCun, Yann, Ma, Yi, others. (2024). Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning. NeurIPS.

[lu2023mathvista] Lu, Pan, Bansal, Hritik, Xia, Tony, Liu, Jiacheng, Li, Chunyuan, Hajishirzi, Hannaneh, Cheng, Hao, Chang, Kai-Wei, Galley, Michel, Gao, Jianfeng. (2023). Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. ICLR.

[liu2023mmbench] Liu, Yuan, Duan, Haodong, Zhang, Yuanhan, Li, Bo, Zhang, Songyang, Zhao, Wangbo, Yuan, Yike, Wang, Jiaqi, He, Conghui, Liu, Ziwei, others. (2024). Mmbench: Is your multi-modal model an all-around player?. ECCV.

[alayrac2022flamingo] Alayrac, Jean-Baptiste, Donahue, Jeff, Luc, Pauline, Miech, Antoine, Barr, Iain, Hasson, Yana, Lenc, Karel, Mensch, Arthur, Millican, Katherine, Reynolds, Malcolm, others. (2022). Flamingo: a visual language model for few-shot learning. NeurIPS.

[li2023oxfordtvg] Li, Runjia, Sun, Shuyang, Elhoseiny, Mohamed, Torr, Philip. (2023). OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?. ICCV.

[gadre2024datacomp] Gadre, Samir Yitzhak, Ilharco, Gabriel, Fang, Alex, Hayase, Jonathan, Smyrnis, Georgios, Nguyen, Thao, Marten, Ryan, Wortsman, Mitchell, Ghosh, Dhruba, Zhang, Jieyu, others. (2024). Datacomp: In search of the next generation of multimodal datasets. NeurIPS.

[banani2024probing] Banani, Mohamed El, Raj, Amit, Maninis, Kevis-Kokitsi, Kar, Abhishek, Li, Yuanzhen, Rubinstein, Michael, Sun, Deqing, Guibas, Leonidas, Johnson, Justin, Jampani, Varun. (2024). Probing the 3D Awareness of Visual Foundation Models. arXiv preprint arXiv:2404.08636.

[OpenAI2022ChatGPT] OpenAI. (2022). ChatGPT.

[StabilityAI2024SD35] Stability AI. (2024). Stable Diffusion 3.5.

[roberts2019exploring] Roberts, Adam, Raffel, Colin, Lee, Katherine, Matena, Michael, Shazeer, Noam, Liu, Peter J, Narang, Sharan, Li, Wei, Zhou, Yanqi. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.

[Taori2023Alpaca] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto. (2023). Alpaca: A Strong, Replicable Instruction-Following Model.

[zhou2024lima] Zhou, Chunting, Liu, Pengfei, Xu, Puxin, Iyer, Srinivasan, Sun, Jiao, Mao, Yuning, Ma, Xuezhe, Efrat, Avia, Yu, Ping, Yu, Lili, others. (2024). Lima: Less is more for alignment. NeurIPS.

[Sanseviero2024LLM] Omar Sanseviero. (2022). LLM Evals and Benchmarking.

[rajamanoharan2024improving] Rajamanoharan, Senthooran, Conmy, Arthur, Smith, Lewis, Lieberum, Tom, Varma, Vikrant, Kram{'a. (2024). Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014.

[grok] xAI. (2024). grok.

[singh2019towards] Singh, Amanpreet, Natarajan, Vivek, Shah, Meet, Jiang, Yu, Chen, Xinlei, Batra, Dhruv, Parikh, Devi, Rohrbach, Marcus. (2019). Towards vqa models that can read. CVPR.

[chang2024survey] Chang, Yupeng, Wang, Xu, Wang, Jindong, Wu, Yuan, Yang, Linyi, Zhu, Kaijie, Chen, Hao, Yi, Xiaoyuan, Wang, Cunxiang, Wang, Yidong, others. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.

[sun2024generative] Sun, Quan, Cui, Yufeng, Zhang, Xiaosong, Zhang, Fan, Yu, Qiying, Wang, Yueze, Rao, Yongming, Liu, Jingjing, Huang, Tiejun, Wang, Xinlong. (2024). Generative multimodal models are in-context learners. CVPR.

[sun2023generative] Sun, Quan, Yu, Qiying, Cui, Yufeng, Zhang, Fan, Zhang, Xiaosong, Wang, Yueze, Gao, Hongcheng, Liu, Jingjing, Huang, Tiejun, Wang, Xinlong. (2024). Generative pretraining in multimodality. ICLR.

[dong2023dreamllm] Dong, Runpei, Han, Chunrui, Peng, Yuang, Qi, Zekun, Ge, Zheng, Yang, Jinrong, Zhao, Liang, Sun, Jianjian, Zhou, Hongyu, Wei, Haoran, others. (2024). Dreamllm: Synergistic multimodal comprehension and creation. ICLR.

[shao2024visual] Shao, Hao, Qian, Shengju, Xiao, Han, Song, Guanglu, Zong, Zhuofan, Wang, Letian, Liu, Yu, Li, Hongsheng. (2024). Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. NeurIPS.

[miech2019howto100m] Miech, Antoine, Zhukov, Dimitri, Alayrac, Jean-Baptiste, Tapaswi, Makarand, Laptev, Ivan, Sivic, Josef. (2019). Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. ICCV.

[wang2024emu3] Wang, Xinlong, Zhang, Xiaosong, Luo, Zhengxiong, Sun, Quan, Cui, Yufeng, Wang, Jinsheng, Zhang, Fan, Wang, Yueze, Li, Zhen, Yu, Qiying, others. (2024). Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869.

[kar2025brave] Kar, O{\u{g. (2025). BRAVE: Broadening the visual encoding of vision-language models. ECCV.

[laurenccon2024obelics] Lauren{\c{c. (2024). Obelics: An open web-scale filtered dataset of interleaved image-text documents. NeurIPS.

[li2024mvbench] Li, Kunchang, Wang, Yali, He, Yinan, Li, Yizhuo, Wang, Yi, Liu, Yi, Wang, Zun, Xu, Jilan, Chen, Guo, Luo, Ping, others. (2024). Mvbench: A comprehensive multi-modal video understanding benchmark. CVPR.

[goyal2017something] Goyal, Raghav, Ebrahimi Kahou, Samira, Michalski, Vincent, Materzynska, Joanna, Westphal, Susanne, Kim, Heuna, Haenel, Valentin, Fruend, Ingo, Yianilos, Peter, Mueller-Freitag, Moritz, others. (2017). The. ICCV.

[zohar2024videostar] Zohar, Orr, Wang, Xiaohan, Bitton, Yonatan, Szpektor, Idan, Yeung-levy, Serena. (2024). Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision. arXiv preprint arXiv:2407.06189.

[OpenAI2024gpt4o] OpenAI. (2024). gpt4o.

[Anthropic2024Claude] Anthropic. (2024). Claude.

[touvron2023llama] Touvron, Hugo, Lavril, Thibaut, Izacard, Gautier, Martinet, Xavier, Lachaux, Marie-Anne, Lacroix, Timoth{'e. (2023). {LLaMA. arXiv preprint arXiv:2302.13971.

[touvron2023llama2] Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, others. (2023). {LLaMA. arXiv preprint arXiv:2307.09288.

[li2024llavanext-strong] Li, Bo, Zhang, Kaichen, Zhang, Hao, Guo, Dong, Zhang, Renrui, Li, Feng, Zhang, Yuanhan, Liu, Ziwei, Li, Chunyuan. (2024). LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild.

[yue2023mmmu] Yue, Xiang, Ni, Yuansheng, Zhang, Kai, Zheng, Tianyu, Liu, Ruoqi, Zhang, Ge, Stevens, Samuel, Jiang, Dongfu, Ren, Weiming, Sun, Yuxuan, others. (2024). Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. CVPR.

[hiippala2021ai2d] Hiippala, Tuomo, Alikhani, Malihe, Haverinen, Jonas, Kalliokoski, Timo, Logacheva, Evanfiya, Orekhova, Serafina, Tuomainen, Aino, Stone, Matthew, Bateman, John A. (2021). AI2D-RST: A multimodal corpus of 1000 primary school science diagrams. Language Resources and Evaluation.

[brazil2023omni3d] Brazil, Garrick, Kumar, Abhinav, Straub, Julian, Ravi, Nikhila, Johnson, Justin, Gkioxari, Georgia. (2023). Omni3d: A large benchmark and model for 3d object detection in the wild. CVPR.

[zhou2019semantic] Zhou, Bolei, Zhao, Hang, Puig, Xavier, Xiao, Tete, Fidler, Sanja, Barriuso, Adela, Torralba, Antonio. (2019). Semantic understanding of scenes through the ade20k dataset. IJCV.

[lin2014microsoft] Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll{'a. (2014). Microsoft coco: Common objects in context. ECCV.

[fu2024blink] Fu, Xingyu, Hu, Yushi, Li, Bangzheng, Feng, Yu, Wang, Haoyu, Lin, Xudong, Roth, Dan, Smith, Noah A, Ma, Wei-Chiu, Krishna, Ranjay. (2024). BLINK: Multimodal Large Language Models Can See but Not Perceive. arXiv preprint arXiv:2404.12390.

[russakovsky2015imagenet] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, others. (2015). Imagenet large scale visual recognition challenge. IJCV.

[aquinas] Thomas Aquinas. Quaestiones Disputatae de Veritate.

[aristotle-metaphysics-350BCE] Aristotle. Metaphysics.

[parker2003blink] Parker, Andrew. (2003). In the blink of an eye: how vision sparked the big bang of evolution.

[chalmers2023does] David J. Chalmers. (2023). Does Thought Require Sensory Grounding? From Pure Thinkers to Large Language Models. Proceedings and Addresses of the American Philosophical Association.

[piaget1952origins] Piaget, Jean, Cook, Margaret, others. (1952). The origins of intelligence in children.

[hoffmann2022training] Hoffmann, Jordan, Borgeaud, Sebastian, Mensch, Arthur, Buchatskaya, Elena, Cai, Trevor, Rutherford, Eliza, Casas, Diego de Las, Hendricks, Lisa Anne, Welbl, Johannes, Clark, Aidan, others. (2023). Training compute-optimal large language models. NeurIPS.

[brown2020language] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. NeurIPS.

[laurenccon2024matters] Lauren{\c{c. (2024). What matters when building vision-language models?. arXiv preprint arXiv:2405.02246.

[girshick2014rich] Girshick, Ross, Donahue, Jeff, Darrell, Trevor, Malik, Jitendra. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR.

[mathew2022infographicvqa] Mathew, Minesh, Bagal, Viraj, Tito, Rub{`e. (2022). Infographicvqa. WACV.

[chen2024we] Chen, Lin, Li, Jinsong, Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Chen, Zehui, Duan, Haodong, Wang, Jiaqi, Qiao, Yu, Lin, Dahua, others. (2024). Are We on the Right Way for Evaluating Large Vision-Language Models?. arXiv preprint arXiv:2403.20330.

[wu2024janus] Wu, Chengyue, Chen, Xiaokang, Wu, Zhiyu, Ma, Yiyang, Liu, Xingchao, Pan, Zizheng, Liu, Wen, Xie, Zhenda, Yu, Xingkai, Ruan, Chong, others. (2024). Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848.

[huh2024platonic] Huh, Minyoung, Cheung, Brian, Wang, Tongzhou, Isola, Phillip. (2024). The platonic representation hypothesis. ICML.

[yu2024representation] Yu, Sihyun, Kwak, Sangkyung, Jang, Huiwon, Jeong, Jongheon, Huang, Jonathan, Shin, Jinwoo, Xie, Saining. (2024). Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. arXiv preprint arXiv:2410.06940.

[agrawal2024pixtral] Agrawal, Pravesh, Antoniak, Szymon, Hanna, Emma Bou, Chaplot, Devendra, Chudnovsky, Jessica, Garg, Saurabh, Gervet, Theophile, Ghosh, Soham, H{'e. (2024). Pixtral 12B. arXiv preprint arXiv:2410.07073.

[lu2024unified] Lu, Jiasen, Clark, Christopher, Lee, Sangho, Zhang, Zichen, Khosla, Savya, Marten, Ryan, Hoiem, Derek, Kembhavi, Aniruddha. (2024). Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action. CVPR.

[aghajanyan2022cm3] Aghajanyan, Armen, Huang, Bernie, Ross, Candace, Karpukhin, Vladimir, Xu, Hu, Goyal, Naman, Okhonko, Dmytro, Joshi, Mandar, Ghosh, Gargi, Lewis, Mike, others. (2022). Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520.

[lu2022unified] Lu, Jiasen, Clark, Christopher, Zellers, Rowan, Mottaghi, Roozbeh, Kembhavi, Aniruddha. (2022). Unified-io: A unified model for vision, language, and multi-modal tasks. ICLR.

[agrawal2018don] Agrawal, Aishwarya, Batra, Dhruv, Parikh, Devi, Kembhavi, Aniruddha. (2018). Don't just assume; look and answer: Overcoming priors for visual question answering. CVPR.

[chen2024sharegpt4video] Chen, Lin, Wei, Xilin, Li, Jinsong, Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Chen, Zehui, Duan, Haodong, Lin, Bin, Tang, Zhenyu, others. (2024). Sharegpt4video: Improving video understanding and generation with better captions. NeurIPS.

[krojer2024learning] Krojer, Benno, Vattikonda, Dheeraj, Lara, Luis, Jampani, Varun, Portelance, Eva, Pal, Christopher, Reddy, Siva. (2024). Learning Action and Reasoning-Centric Image Editing from Videos and Simulations. NeurIPS.

[hessel2021clipscore] Hessel, Jack, Holtzman, Ari, Forbes, Maxwell, Bras, Ronan Le, Choi, Yejin. (2021). Clipscore: A reference-free evaluation metric for image captioning. EMNLP.

[brooks2023instructpix2pix] Brooks, Tim, Holynski, Aleksander, Efros, Alexei A. (2023). Instructpix2pix: Learning to follow image editing instructions. CVPR.

[goyal2017making] Goyal, Yash, Khot, Tejas, Summers-Stay, Douglas, Batra, Dhruv, Parikh, Devi. (2017). Making the v in vqa matter: Elevating the role of image understanding in visual question answering. CVPR.

[AllenZhu-icml2024-tutorial] {Allen-Zhu. (2024). {ICML 2024 Tutorial: Physics of Language Models.

[YXLA2024-gsm1] Ye, Tian, Xu, Zicheng, Li, Yuanzhi, {Allen-Zhu. {Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process. ArXiv e-prints.

[majumdar2024openeqa] Majumdar, Arjun, Ajay, Anurag, Zhang, Xiaohan, Putta, Pranav, Yenamandra, Sriram, Henaff, Mikael, Silwal, Sneha, Mcvay, Paul, Maksymets, Oleksandr, Arnaud, Sergio, others. (2024). OpenEQA: Embodied Question Answering in the Era of Foundation Models. 2nd Workshop on Mobile Manipulation and Embodied Intelligence at ICRA 2024.

[minigemini] Li, Yanwei, Zhang, Yuechen, Wang, Chengyao, Zhong, Zhisheng, Chen, Yixin, Chu, Ruihang, Liu, Shaoteng, Jia, Jiaya. (2024). Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814.

[geirhos2020shortcut] Geirhos, Robert, Jacobsen, J{. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence.

[wei2022chain] Wei, Jason, Wang, Xuezhi, Schuurmans, Dale, Bosma, Maarten, Xia, Fei, Chi, Ed, Le, Quoc V, Zhou, Denny, others. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS.

[chen2025vugen] Chen, Xiangyi, Vallaeys, Th{'e. (2025). VUGEN: Visual Understanding priors for GENeration. arXiv preprint arXiv:2510.06529.

[Geiger2012CVPR] Andreas Geiger, Philip Lenz, Raquel Urtasun. (2012). Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. CVPR.

[caesar2020nuscenes] Caesar, Holger, Bankiti, Varun, Lang, Alex H, Vora, Sourabh, Liong, Venice Erin, Xu, Qiang, Krishnan, Anush, Pan, Yu, Baldan, Giancarlo, Beijbom, Oscar. (2020). nuscenes: A multimodal dataset for autonomous driving. CVPR.

[song2015sun] Song, Shuran, Lichtenberg, Samuel P, Xiao, Jianxiong. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. CVPR.

[dehghan2021arkitscenes] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, Elad Shulman. (2021). {ARK. NeurIPS.

[hypersim] Mike Roberts AND Jason Ramapuram AND Anurag Ranjan AND Atulit Kumar AND Miguel Angel Bautista AND Nathan Paczan AND Russ Webb AND Joshua M. Susskind. (2021). {Hypersim. ICCV.

[objectron2021] Ahmadyan, Adel, Zhang, Liangkai, Ablavatski, Artsiom, Wei, Jianing, Grundmann, Matthias. (2021). Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations. CVPR.

[wang2024qwen2] Wang, Peng, Bai, Shuai, Tan, Sinan, Wang, Shijie, Fan, Zhihao, Bai, Jinze, Chen, Keqin, Liu, Xuejing, Wang, Jialin, Ge, Wenbin, others. (2024). Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191.

[li2024llava] Li, Bo, Zhang, Yuanhan, Guo, Dong, Zhang, Renrui, Li, Feng, Zhang, Hao, Zhang, Kaichen, Li, Yanwei, Liu, Ziwei, Li, Chunyuan. (2024). Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326.

[tong2024cambrian] Tong, Shengbang, Brown, Ellis, Wu, Penghao, Woo, Sanghyun, Middepogu, Manoj, Akula, Sai Charitha, Yang, Jihan, Yang, Shusheng, Iyer, Adithya, Pan, Xichen, others. (2024). Cambrian-1: A fully open, vision-centric exploration of multimodal llms. NeurIPS.

[li2023blip] Li, Junnan, Li, Dongxu, Savarese, Silvio, Hoi, Steven. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML.

[zhou2024transfusion] Zhou, Chunting, Yu, Lili, Babu, Arun, Tirumala, Kushal, Yasunaga, Michihiro, Shamis, Leonid, Kahn, Jacob, Ma, Xuezhe, Zettlemoyer, Luke, Levy, Omer. (2025). Transfusion: Predict the next token and diffuse images with one multi-modal model. ICLR.

[wu2024vila] Wu, Yecheng, Zhang, Zhuoyang, Chen, Junyu, Tang, Haotian, Li, Dacheng, Fang, Yunhao, Zhu, Ligeng, Xie, Enze, Yin, Hongxu, Yi, Li, others. (2024). Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429.

[xie2024show] Xie, Jinheng, Mao, Weijia, Bai, Zechen, Zhang, David Junhao, Wang, Weihao, Lin, Kevin Qinghong, Gu, Yuchao, Chen, Zhijie, Yang, Zhenheng, Shou, Mike Zheng. (2024). Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528.

[baddeley1992working] Baddeley, Alan. (1992). Working memory. Science.

[amit2017asymmetrical] Amit, Elinor, Hoeflin, Caitlyn, Hamzah, Nada, Fedorenko, Evelina. (2017). An asymmetrical relationship between verbal and visual thinking: Converging evidence from behavior and fMRI. NeuroImage.

[paivio1990mental] Paivio, Allan. (1990). Mental representations: A dual coding approach.

[ganis2004brain] Ganis, Giorgio, Thompson, William L, Kosslyn, Stephen M. (2004). Brain areas underlying visual mental imagery and visual perception: an fMRI study. Cognitive Brain Research.

[lecun2022path] LeCun, Yann. (2022). A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review.

[amit2009distance] Amit, Elinor, Algom, Daniel, Trope, Yaacov. (2009). Distance-dependent processing of pictures and words.. Journal of Experimental Psychology: General.

[amit2013use] Amit, Elinor, Wakslak, Cheryl, Trope, Yaacov. (2013). The use of visual and verbal means of communication across psychological distance. Personality and Social Psychology Bulletin.

[ormazabal2024reka] Ormazabal, Aitor, Zheng, Che, d'Autume, Cyprien de Masson, Yogatama, Dani, Fu, Deyu, Ong, Donovan, Chen, Eric, Lamprecht, Eugenie, Pham, Hai, Ong, Isaac, others. (2024). Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models. arXiv preprint arXiv:2404.12387.

[chowdhery2022palm] Chowdhery, Aakanksha, Narang, Sharan, Devlin, Jacob, Bosma, Maarten, Mishra, Gaurav, Roberts, Adam, Barham, Paul, Chung, Hyung Won, Sutton, Charles, Gehrmann, Sebastian, others. (2022). Palm: Scaling language modeling with pathways. arXiv 2022. arXiv preprint arXiv:2204.02311.

[liu2024world] Liu, Hao, Yan, Wilson, Zaharia, Matei, Abbeel, Pieter. (2024). World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268.

[tschannen2024image] Tschannen, Michael, Kumar, Manoj, Steiner, Andreas, Zhai, Xiaohua, Houlsby, Neil, Beyer, Lucas. (2024). Image captioners are scalable vision learners too. NeurIPS.

[fini2024multimodal] Fini, Enrico, Shukor, Mustafa, Li, Xiujun, Dufter, Philipp, Klein, Michal, Haldimann, David, Aitharaju, Sai, da Costa, Victor Guilherme Turrisi, B{'e. (2024). Multimodal autoregressive pre-training of large vision encoders. arXiv preprint arXiv:2411.14402.

[lecun1998mnist] LeCun, Yann. (1998). The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/.

[wang2025scaling] Wang, Xiao, Alabdulmohsin, Ibrahim, Salz, Daniel, Li, Zhe, Rong, Keran, Zhai, Xiaohua. (2025). Scaling Pre-training to One Hundred Billion Data for Vision Language Models. arXiv preprint arXiv:2502.07617.

[doersch2015unsupervised] Doersch, Carl, Gupta, Abhinav, Efros, Alexei A. (2015). Unsupervised visual representation learning by context prediction. ICCV.

[misra2019self] Misra, Ishan, Van Der Maaten, Laurens. (2020). Self-supervised learning of pretext-invariant representations.. CVPR.

[garrido2022duality] Garrido, Quentin, Chen, Yubei, Bardes, Adrien, Najman, Laurent, Lecun, Yann. (2023). On the duality between contrastive and non-contrastive self-supervised learning. ICLR.

[chen2022bag] Chen, Yubei, Bardes, Adrien, Li, Zengyi, LeCun, Yann. (2022). Bag of image patch embedding behind the success of self-supervised learning. arXiv preprint arXiv:2206.08954.

[zhou2021ibot] Zhou, Jinghao, Wei, Chen, Wang, Huiyu, Shen, Wei, Xie, Cihang, Yuille, Alan, Kong, Tao. (2021). ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832.

[carreira2024scaling] Carreira, Jo{~a. (2024). Scaling 4D Representations. arXiv preprint arXiv:2412.15212.

[wei2022masked] Wei, Chen, Fan, Haoqi, Xie, Saining, Wu, Chao-Yuan, Yuille, Alan, Feichtenhofer, Christoph. (2022). Masked feature prediction for self-supervised visual pre-training. CVPR.

[hendrycks2019natural] Hendrycks, Dan, Zhao, Kevin, Basart, Steven, Steinhardt, Jacob, Song, Dawn Xiaodong. (2019). Natural adversarial examples. 2021 IEEE. CVPR.

[hendrycks2020many] Hendrycks, Dan, Basart, Steven, Mu, Norman, Kadavath, Saurav, Wang, Frank, Dorundo, Evan, Desai, Rahul, Zhu, Tyler, Parajuli, Samyak, Guo, Mike, others. (2020). The many faces of robustness: A critical analysis of out-of-distribution generalization. 2021 IEEE. ICCV.

[bossard2014food] Bossard, Lukas, Guillaumin, Matthieu, Van Gool, Luc. (2014). Food-101--mining discriminative components with random forests. ECCV.

[wang2015unsupervised] Wang, Xiaolong, Gupta, Abhinav. (2015). Unsupervised learning of visual representations using videos. ICCV.

[cordts2016cityscapes] Cordts, Marius, Omran, Mohamed, Ramos, Sebastian, Rehfeld, Timo, Enzweiler, Markus, Benenson, Rodrigo, Franke, Uwe, Roth, Stefan, Schiele, Bernt. (2016). The cityscapes dataset for semantic urban scene understanding. CVPR.

[everingham2010pascal] Everingham, Mark, Van Gool, Luc, Williams, Christopher KI, Winn, John, Zisserman, Andrew. (2010). The pascal visual object classes (voc) challenge. IJCV.

[shi2024we] Shi, Baifeng, Wu, Ziyang, Mao, Maolin, Wang, Xin, Darrell, Trevor. (2024). When do we not need larger vision models?. ECCV.

[geiger2013vision] Geiger, Andreas, Lenz, Philip, Stiller, Christoph, Urtasun, Raquel. (2013). Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research.

[mo2025connecting] Mo, Shentong, Tong, Peter. (2024). Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning. NeurIPS.

[chen2020improved] Chen, Xinlei, Fan, Haoqi, Girshick, Ross, He, Kaiming. (2020). Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297.

[tao2024what] Muzi Tao, Saining Xie. (2024). What Does a Visual Formal Analysis of the World's 500 Most Famous Paintings Tell Us About Multimodal {LLM. The Second Tiny Papers Track at ICLR 2024.

[shi2024eagle] Shi, Min, Liu, Fuxiao, Wang, Shihao, Liao, Shijia, Radhakrishnan, Subhashree, Huang, De-An, Yin, Hongxu, Sapra, Karan, Yacoob, Yaser, Shi, Humphrey, others. (2024). Eagle: Exploring the design space for multimodal llms with mixture of encoders. arXiv preprint arXiv:2408.15998.

[soomro2012ucf101] Soomro, Khurram, Zamir, Amir Roshan, Shah, Mubarak. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

[xu2024altogether] Xu, Hu, Huang, Po-Yao, Tan, Xiaoqing Ellen, Yeh, Ching-Feng, Kahn, Jacob, Jou, Christine, Ghosh, Gargi, Levy, Omer, Zettlemoyer, Luke, Yih, Wen-tau, others. (2024). Altogether: Image Captioning via Re-aligning Alt-text. arXiv preprint arXiv:2410.17251.

[beyer2024paligemma] Beyer, Lucas, Steiner, Andreas, Pinto, Andr{'e. (2024). Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726.

[tong2024metamorph] Tong, Shengbang, Fan, David, Zhu, Jiachen, Xiong, Yunyang, Chen, Xinlei, Sinha, Koustuv, Rabbat, Michael, LeCun, Yann, Xie, Saining, Liu, Zhuang. (2024). Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164.

[wang2024information] Wang, Chenyu, Gupta, Sharut, Zhang, Xinyi, Tonekaboni, Sana, Jegelka, Stefanie, Jaakkola, Tommi, Uhler, Caroline. (2024). An Information Criterion for Controlled Disentanglement of Multimodal Data. arXiv preprint arXiv:2410.23996.

[srinivasan2021wit] Srinivasan, Krishna, Raman, Karthik, Chen, Jiecao, Bendersky, Michael, Najork, Marc. (2021). Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval.

[noroozi2016unsupervised] Noroozi, Mehdi, Favaro, Paolo. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV.

[zhang2016colorful] Zhang, Richard, Isola, Phillip, Efros, Alexei A. (2016). Colorful image colorization. ECCV.

[gidaris2018unsupervised] Gidaris, Spyros, Singh, Praveer, Komodakis, Nikos. (2018). Unsupervised representation learning by predicting image rotations. ICLR.

[caron2021emerging] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. ICCV.

[he2016resnet] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep Residual Learning for Image Recognition. CVPR.

[xie2016resnext] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He. (2016). Aggregated Residual Transformations for Deep Neural Networks. arXiv preprint arXiv:1611.05431.

[krizhevsky2009learning] Krizhevsky, Alex, Hinton, Geoffrey, others. (2009). Learning multiple layers of features from tiny images.

[goyal2019scaling] Goyal, Priya, Mahajan, Dhruv, Gupta, Abhinav, Misra, Ishan. (2019). Scaling and benchmarking self-supervised visual representation learning. ICCV.

[thomee2016yfcc100m] Thomee, Bart, Shamma, David A, Friedland, Gerald, Elizalde, Benjamin, Ni, Karl, Poland, Douglas, Borth, Damian, Li, Li-Jia. (2016). Yfcc100m: The new data in multimedia research. Communications of the ACM.

[zhai2022scaling] Zhai, Xiaohua, Kolesnikov, Alexander, Houlsby, Neil, Beyer, Lucas. (2022). Scaling vision transformers. CVPR.

[allal2025smollm2] Allal, Loubna Ben, Lozhkov, Anton, Bakouch, Elie, Bl{'a. (2025). SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model. arXiv preprint arXiv:2502.02737.

[kaplan2020scaling] Kaplan, Jared, McCandlish, Sam, Henighan, Tom, Brown, Tom B, Chess, Benjamin, Child, Rewon, Gray, Scott, Radford, Alec, Wu, Jeffrey, Amodei, Dario. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

[dehghani2023scaling] Dehghani, Mostafa, Djolonga, Josip, Mustafa, Basil, Padlewski, Piotr, Heek, Jonathan, Gilmer, Justin, Steiner, Andreas Peter, Caron, Mathilde, Geirhos, Robert, Alabdulmohsin, Ibrahim, others. (2023). Scaling vision transformers to 22 billion parameters. ICML.

[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS.

[chen2021exploring] Chen, Xinlei, He, Kaiming. (2021). Exploring simple siamese representation learning. CVPR.

[naeem2024silc] Naeem, Muhammad Ferjad, Xian, Yongqin, Zhai, Xiaohua, Hoyer, Lukas, Van Gool, Luc, Tombari, Federico. (2024). Silc: Improving vision language pretraining with self-distillation. ECCV.

[singh2023effectiveness] Singh, Mannat, Duval, Quentin, Alwala, Kalyan Vasudev, Fan, Haoqi, Aggarwal, Vaibhav, Adcock, Aaron, Joulin, Armand, Doll{'a. (2023). The effectiveness of MAE pre-pretraining for billion-scale pretraining. ICCV.

[silberman2012indoor] Silberman, Nathan, Hoiem, Derek, Kohli, Pushmeet, Fergus, Rob. (2012). Indoor segmentation and support inference from rgbd images. ECCV.

[sun2024eva] Sun, Quan, Wang, Jinsheng, Yu, Qiying, Cui, Yufeng, Zhang, Fan, Zhang, Xiaosong, Wang, Xinlong. (2024). Eva-clip-18b: Scaling clip to 18 billion parameters. arXiv preprint arXiv:2402.04252.

[chen2024internvl] Chen, Zhe, Wu, Jiannan, Wang, Wenhai, Su, Weijie, Chen, Guo, Xing, Sen, Zhong, Muyan, Zhang, Qinglong, Zhu, Xizhou, Lu, Lewei, others. (2024). Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. CVPR.

[fu2023mme] Fu, Chaoyou, Chen, Peixian, Shen, Yunhang, Qin, Yulei, Zhang, Mengdan, Lin, Xu, Qiu, Zhenyu, Lin, Wei, Yang, Jinrui, Zheng, Xiawu, others. (2023). MME: a comprehensive evaluation benchmark for multimodal large language models. CoRR abs/2306.13394 (2023).

[bai2024sequential] Bai, Yutong, Geng, Xinyang, Mangalam, Karttikeya, Bar, Amir, Yuille, Alan L, Darrell, Trevor, Malik, Jitendra, Efros, Alexei A. (2024). Sequential modeling enables scalable learning for large vision models. CVPR.

[wei2021finetuned] Wei, Jason, Bosma, Maarten, Zhao, Vincent Y, Guu, Kelvin, Yu, Adams Wei, Lester, Brian, Du, Nan, Dai, Andrew M, Le, Quoc V. (2022). Finetuned language models are zero-shot learners. ICLR.

[bordes2022high] Florian Bordes, Randall Balestriero, Pascal Vincent. (2022). High Fidelity Visualization of What Your Self-Supervised Representation Knows About. TMLR.

[Wadekar2024-bs] Wadekar, Shakti N, Chaurasia, Abhishek, Chadha, Aman, Culurciello, Eugenio. The evolution of multimodal model architectures. arXiv [cs.AI].

[zheng2025diffusion] Zheng, Boyang, Ma, Nanye, Tong, Shengbang, Xie, Saining. (2025). Diffusion Transformers with Representation Autoencoders. arXiv preprint arXiv:2510.11690.

[tschannen2025siglip] Tschannen, Michael, Gritsenko, Alexey, Wang, Xiao, Naeem, Muhammad Ferjad, Alabdulmohsin, Ibrahim, Parthasarathy, Nikhil, Evans, Talfan, Beyer, Lucas, Xia, Ye, Mustafa, Basil, others. (2025). Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786.

[StyleGAN-T] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, Timo Aila. (2023). StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. ICML.

[DINO] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin. (2021). Emerging Properties in Self-Supervised Vision Transformers. ICCV.

[flux] Black Forest Labs. (2024). FLUX.

[VQGAN] Patrick Esser, Robin Rombach, Björn Ommer. (2021). Taming Transformers for High-Resolution Image Synthesis. CVPR.

[DiffAug] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, Song Han. (2020). Differentiable Augmentation for Data-Efficient GAN Training. NeurIPS.

[SD3] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, Robin Rombach. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. ICML.

[LDM] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR.

[MassiveActivations] Mingjie Sun, Xinlei Chen, J. Zico Kolter, Zhuang Liu. (2024). Massive Activations in Large Language Models.

[Tang_2025_CVPR] Tang, Bingda, Zheng, Boyang, Paul, Sayak, Xie, Saining. (2025). Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis. CVPR.

[sun2023journeydb] Sun, Keqiang, Pan, Junting, Ge, Yuying, Li, Hao, Duan, Haodong, Wu, Xiaoshi, Zhang, Renrui, Zhou, Aojun, Qin, Zipeng, Wang, Yi, others. (2023). Journeydb: A benchmark for generative image understanding. NeurIPS.

[deng2025emerging] Deng, Chaorui, Zhu, Deyao, Li, Kunchang, Gou, Chenhui, Li, Feng, Wang, Zeyu, Zhong, Shu, Yu, Weihao, Nie, Xiaonan, Song, Ziang, others. (2025). Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683.

[dai2023emu] Xueyan Dai, et al.. (2023). Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack.

[mu2021slip] Mu, Norman, Kirillov, Alexander, Wagner, David A, Xie, Saining. (2021). SLIP: self-supervision meets language-image pre-training. CoRR abs/2112.12750 (2021). arXiv preprint arXiv:2112.12750.

[bai2025qwen2] Bai, Shuai, Chen, Keqin, Liu, Xuejing, Wang, Jialin, Ge, Wenbin, Song, Sibo, Dang, Kai, Wang, Peng, Wang, Shijie, Tang, Jun, others. (2025). Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923.

[qwen2024qwen2] Qwen, A Yang, Yang, Baosong, Zhang, B, Hui, B, Zheng, B, Yu, B, Li, Chengpeng, Liu, D, Huang, F, Wei, H, others. (2024). Qwen2. 5 technical report. arXiv preprint.

[huang2025ming] Huang, Ziyuan, Zheng, DanDan, Zou, Cheng, Liu, Rui, Wang, Xiaolong, Ji, Kaixiang, Chai, Weilong, Sun, Jianxin, Wang, Libin, Lv, Yongjie, others. (2025). Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer. arXiv preprint arXiv:2510.06590.

[cao2025hunyuanimage] Cao, Siyu, Chen, Hangting, Chen, Peng, Cheng, Yiji, Cui, Yutao, Deng, Xinchi, Dong, Ying, Gong, Kipper, Gu, Tianpeng, Gu, Xiusen, others. (2025). Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951.

[podell2023sdxl] David Podell, et al.. (2023). SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis.

[esser2024scaling] Esser, Patrick, Kulal, Sumith, Blattmann, Andreas, Entezari, Rahim, M{. (2024). Scaling rectified flow transformers for high-resolution image synthesis. ICML.

[li2024playground] Li, Daiqing, Kamko, Aleks, Akhgari, Ehsan, Sabet, Ali, Xu, Linmiao, Doshi, Suhail. (2024). Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245.

[wendlerc2024renderedtext] Wendler, C.. (2024). RenderedText.

[wan2025wan] Wan, Team, Wang, Ang, Ai, Baole, Wen, Bin, Mao, Chaojie, Xie, Chen-Wei, Chen, Di, Yu, Feiwu, Zhao, Haiming, Yang, Jianxiao, others. (2025). Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314.

[wu2025qwen] Wu, Chenfei, Li, Jiahao, Zhou, Jingren, Lin, Junyang, Gao, Kaiyuan, Yan, Kun, Yin, Sheng-ming, Bai, Shuai, Xu, Xiao, Chen, Yilei, others. (2025). Qwen-image technical report. arXiv preprint arXiv:2508.02324.

[fan2025unified] Fan, Lijie, Tang, Luming, Qin, Siyang, Li, Tianhong, Yang, Xuan, Qiao, Siyuan, Steiner, Andreas, Sun, Chen, Li, Yuanzhen, Zhu, Tao, others. (2025). Unified autoregressive visual generation and understanding with continuous tokens. arXiv preprint arXiv:2503.13436.

[hu2024ella] Hu, Xiwei, Wang, Rui, Fang, Yixiao, Fu, Bin, Cheng, Pei, Yu, Gang. (2024). Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135.

[Egan_Dalle3_1_Million_2024] Egan, Ben, Redden, Alex, {XWAVE. (2024). {Dalle3 1 Million+ High Quality Captions.

[ghosh2023geneval] Ghosh, Dhruba, Hajishirzi, Hannaneh, Schmidt, Ludwig. (2023). Geneval: An object-focused framework for evaluating text-to-image alignment. NeurIPS.

[lgt] Jingfeng Yao, Bin Yang, Xinggang Wang. (2025). Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models. CVPR.

[changpinyo2021conceptual] Changpinyo, Soravit, Sharma, Piyush, Ding, Nan, Soricut, Radu. (2021). Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. CVPR.

[GEmbed] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, Ren Ng. (2020). Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains.

[CFG] Jonathan Ho, Tim Salimans. (2022). Classifier-Free Diffusion Guidance.

[AG] Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, Samuli Laine. (2025). Guiding a Diffusion Model with a Bad Version of Itself. NeurIPS.

[CFGinterval] Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, Jaakko Lehtinen. (2024). Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models. NeurIPS.

[maskgit] Chang, Huiwen, Zhang, Han, Jiang, Lu, Liu, Ce, Freeman, William T. (2022). Maskgit: Masked generative image transformer. CVPR.

[llama] Sun, Peize, Jiang, Yi, Chen, Shoufa, Zhang, Shilong, Peng, Bingyue, Luo, Ping, Yuan, Zehuan. (2024). Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525.

[magvit] Yu, Lijun, Lezama, Jos{'e. (2024). Language Model Beats Diffusion--Tokenizer is Key to Visual Generation. ICLR.

[mar] Li, Tianhong, Tian, Yonglong, Li, He, Deng, Mingyang, He, Kaiming. (2024). Autoregressive image generation without vector quantization. NeurIPS.

[MDTv2] Gao, Shanghua, Zhou, Pan, Cheng, Ming-Ming, Yan, Shuicheng. (2023). Mdtv2: Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389.

[MDT] Gao, Shanghua, Zhou, Pan, Cheng, Ming-Ming, Yan, Shuicheng. (2023). Masked diffusion transformer is a strong image synthesizer. ICCV.

[repa-e] Leng, Xingjian, Singh, Jaskirat, Hou, Yunzhong, Xing, Zhenchang, Xie, Saining, Zheng, Liang. (2025). REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers. ICCV.

[fasterdit] Yao, Jingfeng, Wang, Cheng, Liu, Wenyu, Wang, Xinggang. (2024). Fasterdit: Towards faster diffusion transformers training without architecture modification. NeurIPS.

[sit] Ma, Nanye, Goldstein, Mark, Albergo, Michael S, Boffi, Nicholas M, Vanden-Eijnden, Eric, Xie, Saining. (2024). Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. ECCV.

[maskdit] Zheng, Hongkai, Nie, Weili, Vahdat, Arash, Anandkumar, Anima. (2023). Fast training of diffusion models with masked transformers. TMLR.

[repa] Yu, Sihyun, Kwak, Sangkyung, Jang, Huiwon, Jeong, Jongheon, Huang, Jonathan, Shin, Jinwoo, Xie, Saining. (2025). Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. ICLR.

[VAR] Tian, Keyu, Jiang, Yi, Yuan, Zehuan, Peng, Bingyue, Wang, Liwei. (2024). Visual autoregressive modeling: Scalable image generation via next-scale prediction. NeurIPS.

[dit] Peebles, William, Xie, Saining. (2023). Scalable diffusion models with transformers. ICCV.

[magvitv2] Yu, Lijun, Lezama, Jos{'e. (2024). Language Model Beats Diffusion--Tokenizer is Key to Visual Generation. ICLR.

[siglip2] Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai. (2025). SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features.

[MAE] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. (2021). Masked Autoencoders Are Scalable Vision Learners. CVPR.

[Dinov2] Oquab, Maxime, Darcet, Timoth{'e. (2023). Dinov2: Learning robust visual features without supervision. TMLR.

[Dinov2wReg] Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski. (2025). Vision Transformers Need Registers. ICLR.

[ddt] Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang. (2025). DDT: Decoupled Diffusion Transformer.

[ImprovDiffus] Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin. (2025). Improving the Diffusability of Autoencoders. ICML.

[ViTok] Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen. (2025). Learnings from Scaling Visual Tokenizers for Reconstruction and Generation.

[sCM] Cheng Lu, Yang Song. (2025). Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models.

[flashattn] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

[vaeprobe] Wu, Size, Zhang, Wenwei, Xu, Lumin, Jin, Sheng, Wu, Zhonghua, Tao, Qingyi, Liu, Wentao, Li, Wei, Loy, Chen Change. (2025). Harmonizing visual representations for unified multimodal understanding and generation. arXiv preprint arXiv:2503.21979.

[eqvae] Kouzelis, Theodoros, Kakogeorgiou, Ioannis, Gidaris, Spyros, Komodakis, Nikos. (2025). Eq-vae: Equivariance regularized latent space for improved generative image modeling. ICML.

[imagenet] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, others. (2015). Imagenet large scale visual recognition challenge. International journal of computer vision.

[score_sde] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR.

[fid] Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, Hochreiter, Sepp. (2017). {GANs. NeurIPS.

[sfid] Nash, Charlie, Menick, Jacob, Dieleman, Sander, Battaglia, Peter W. (2021). Generating Images with Sparse Representations. ICML.

[is] Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, Chen, Xi. (2016). Improved Techniques for Training {GANs. NeurIPS.

[prec_and_rec] Kynk{. (2019). Improved Precision and Recall Metric for Assessing Generative Models. NeurIPS.

[inceptionv3] Szegedy, Christian, Vanhoucke, Vincent, Ioffe, Sergey, Shlens, Jon, Wojna, Zbigniew. (2016). Rethinking the {Inception. CVPR.

[adm] Dhariwal, Prafulla, Nichol, Alexander. (2021). Diffusion models beat {GANs. NeurIPS.

[vdm++] Kingma, Diederik, Gao, Ruiqi. (2024). Understanding Diffusion Objectives as the {ELBO.

[simplediffusion] Hoogeboom, Emiel, Heek, Jonathan, Salimans, Tim. (2023). Simple Diffusion: End-to-End Diffusion for High Resolution Images. ICML.

[cdm] Ho, Jonathan, Saharia, Chitwan, Chan, William, Fleet, David J, Norouzi, Mohammad, Salimans, Tim. (2022). Cascaded Diffusion Models for high fidelity image generation. Journal of Machine Learning Research.

[uvit] Bao, Fan, Nie, Shen, Xue, Kaiwen, Cao, Yue, Li, Chongxuan, Su, Hang, Zhu, Jun. (2023). All are Worth Words: A {ViT.

[diffit] Hatamizadeh, Ali, Song, Jiaming, Liu, Guilin, Kautz, Jan, Vahdat, Arash. (2024). {DiffiT. ECCV.

[sddit] Zhu, Rui, Pan, Yingwei, Li, Yehao, Yao, Ting, Sun, Zhenglong, Mei, Tao, Chen, Chang Wen. (2024). {SD-DiT. CVPR.

[progressivegan] Karras, Tero, Aila, Timo, Laine, Samuli, Lehtinen, Jaakko. (2018). Progressive Growing of {GANs. ICLR.

[maetok] Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, Bhiksha Raj. (2025). Masked Autoencoders Are Effective Tokenizers for Diffusion Models. ICML.

[dcae1p5] Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, Han Cai. (2025). DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space.

[ldetok] Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, Yue Wang. (2025). Latent Denoising Makes Good Visual Tokenizers.

[titok] Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen. (2024). An Image is Worth 32 Tokens for Reconstruction and Generation. NeurIPS.

[dcae] Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han. (2025). Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models. ICLR.

[pgv3] Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, Daiqing Li. (2024). Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models.

[moco] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2019). Momentum contrast for unsupervised visual representation learning. arXiv e-prints, art. arXiv preprint arXiv:1911.05722.

[dinov1] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. ICCV.

[simclr] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. ICML.

[reg] Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, Xiang Li. (2025). Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think.

[redi] Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis. (2025). Boosting Generative Image Modeling via Joint Image-Feature Synthesis. NeurIPS.

[VAE] Kingma, Diederik P, Welling, Max. (2014). Auto-encoding variational bayes. ICLR.

[DAE] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, Manzagol, Pierre-Antoine. (2008). Extracting and composing robust features with denoising autoencoders. ICML.

[unilip] Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, Liwei Wang. (2025). UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing.

[stochastic] Albergo, Michael S, Boffi, Nicholas M, Vanden-Eijnden, Eric. (2023). Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797.

[edm] Karras, Tero, Aittala, Miika, Aila, Timo, Laine, Samuli. (2022). Elucidating the design space of diffusion-based generative models. NeurIPS.

[ddpm] Ho, Jonathan, Jain, Ajay, Abbeel, Pieter. (2020). Denoising diffusion probabilistic models. NeurIPS.

[fm] Lipman, Yaron, Chen, Ricky TQ, Ben-Hamu, Heli, Nickel, Maximilian, Le, Matt. (2023). Flow matching for generative modeling. ICLR.

[rf] Liu, Xingchao, Gong, Chengyue, Liu, Qiang. (2023). Flow straight and fast: Learning to generate and transfer data with rectified flow. ICLR.

[sbdm] Song, Yang, Sohl-Dickstein, Jascha, Kingma, Diederik P, Kumar, Abhishek, Ermon, Stefano, Poole, Ben. (2020). Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.

[InternViT] Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Han Lv, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang. (2025). InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.

[blip3o] Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu. (2025). BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset.

[metaquery] Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, Saining Xie. (2025). Transfer between Modalities with MetaQueries.

[metamorph] Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, Zhuang Liu. (2025). MetaMorph: Multimodal Understanding and Generation via Instruction Tuning. ICCV.

[emu2] Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang. (2024). Generative Multimodal Models are In-Context Learners. CVPR.

[vqvae2] Ali Razavi, Aaron van den Oord, Oriol Vinyals. (2019). Generating Diverse High-Fidelity Images with VQ-VAE-2. NeurIPS.

[iddpm] Alex Nichol, Prafulla Dhariwal. (2021). Improved Denoising Diffusion Probabilistic Models. ICML.

[vqdiffusion] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo. (2022). Vector Quantized Diffusion Model for Text-to-Image Synthesis.

[abuduweili2024enhancing] Abuduweili, Abulikemu, Yuan, Chenyang, Liu, Changliu, Permenter, Frank. (2024). Enhancing Sample Generation of Diffusion Models using Noise Level Correction. TMLR.

[dinh2016density] Dinh, Laurent, Sohl-Dickstein, Jascha, Bengio, Samy. (2017). Density estimation using real nvp. ICLR.

[ho2019flow++] Ho, Jonathan, Chen, Xi, Srinivas, Aravind, Duan, Yan, Abbeel, Pieter. (2019). Flow++: Improving flow-based generative models with variational dequantization and architecture design. ICML.

[zhai2024normalizing] Zhai, Shuangfei, Zhang, Ruixiang, Nakkiran, Preetum, Berthelot, David, Gu, Jiatao, Zheng, Huangjie, Chen, Tianrong, Bautista, Miguel Angel, Jaitly, Navdeep, Susskind, Josh. (2025). Normalizing flows are capable generative models. ICML.

[teng2023relay] Teng, Jiayan, Zheng, Wendi, Ding, Ming, Hong, Wenyi, Wangni, Jianqiao, Yang, Zhuoyi, Tang, Jie. (2023). Relay diffusion: Unifying diffusion process across resolutions for image synthesis. ICLR.

[chen2023importance] Chen, Ting. (2023). On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972.

[rin] Jabri, Allan, Fleet, David, Chen, Ting. (2023). Scalable adaptive computation for iterative generation. ICML.

[sid2] Hoogeboom, Emiel, Mensink, Thomas, Heek, Jonathan, Lamerigts, Kay, Gao, Ruiqi, Salimans, Tim. (2025). Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. CVPR.

[pixelflow] Chen, Shoufa, Ge, Chongjian, Zhang, Shilong, Sun, Peize, Luo, Ping. (2025). PixelFlow: Pixel-Space Generative Models with Flow. arXiv preprint arXiv:2504.07963.

[pixnerd] Wang, Shuai, Gao, Ziteng, Zhu, Chenhui, Huang, Weilin, Wang, Limin. (2025). PixNerd: Pixel Neural Field Diffusion. arXiv preprint arXiv:2507.23268.

[vit] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. ICLR.

[edm2] Karras, Tero, Aittala, Miika, Lehtinen, Jaakko, Hellsten, Janne, Aila, Timo, Laine, Samuli. (2024). Analyzing and improving the training dynamics of diffusion models. CVPR.

[flowdcn] Wang, Shuai, Li, Zexian, Song, Tianhui, Li, Xubin, Ge, Tiezheng, Zheng, Bo, Wang, Limin. (2024). Flowdcn: Exploring dcn-like architectures for fast image generation with arbitrary resolution. NeurIPS.

[lpips] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, Oliver Wang. (2018). The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. CVPR.

[biggan] Andrew Brock, Jeff Donahue, Karen Simonyan. (2019). Large Scale GAN Training for High Fidelity Natural Image Synthesis. ICLR.

[autoencoder] Hinton, Geoffrey E, Salakhutdinov, Ruslan R. (2006). Reducing the dimensionality of data with neural networks. science.

[vqvae] Oord, Aaron van den, Vinyals, Oriol, Kavukcuoglu, Koray. (2017). Neural discrete representation learning. NeurIPS.

[GAN] Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, Bengio, Yoshua. (2014). Generative adversarial nets. NeurIPS.

[ViT-VQGAN] Yu, Jiahui, Li, Xin, Koh, Jing Yu, Zhang, Han, Pang, Ruoming, Qin, James, Ku, Alexander, Xu, Yuanzhong, Baldridge, Jason, Wu, Yonghui. (2022). Vector-quantized image modeling with improved vqgan. ICLR.

[Efficient-VQGAN] Cao, Shiyue, Yin, Yueqin, Huang, Lianghua, Liu, Yu, Zhao, Xin, Zhao, Deli, Huang, Kaigi. (2023). Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers. ICCV.

[rqvae] Lee, Doyup, Kim, Chiheon, Kim, Saehoon, Cho, Minsu, Han, Wook-Shin. (2022). Autoregressive image generation using residual quantization. CVPR.

[movq] Zheng, Chuanxia, Vuong, Tung-Long, Cai, Jianfei, Phung, Dinh. (2022). Movq: Modulating quantized vectors for high-fidelity image generation. NeurIPS.

[fsq] Fabian Mentzer, David Minnen, Eirikur Agustsson, Michael Tschannen. (2024). Finite Scalar Quantization: {VQ. ICLR.

[textok] Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, Liang-Chieh Chen. (2025). Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens.

[fluid] Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, Yonglong Tian. (2025). Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens. ICLR.

[ar1] Ramesh, Aditya, Dhariwal, Prafulla, Nichol, Alex, Chu, Casey, Chen, Mark. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.

[ar2] Yu, Jiahui, Xu, Yuanzhong, Koh, Jing Yu, Luong, Thang, Baid, Gunjan, Wang, Zirui, Vasudevan, Vijay, Ku, Alexander, Yang, Yinfei, Ayan, Burcu Karagol, others. (2022). Scaling autoregressive models for content-rich text-to-image generation. TMLR.

[image-transformer] Parmar, Niki, Vaswani, Ashish, Uszkoreit, Jakob, Kaiser, Lukasz, Shazeer, Noam, Ku, Alexander, Tran, Dustin. (2018). Image transformer. ICML.

[gpt-pixel] Chen, Mark, Radford, Alec, Child, Rewon, Wu, Jeffrey, Jun, Heewoo, Luan, David, Sutskever, Ilya. (2020). Generative pretraining from pixels. ICML.

[mage] Li, Tianhong, Chang, Huiwen, Mishra, Shlok, Zhang, Han, Katabi, Dina, Krishnan, Dilip. (2023). Mage: Masked generative encoder to unify representation learning and image synthesis. CVPR.

[maskbit] Weber, Mark, Yu, Lijun, Yu, Qihang, Deng, Xueqing, Shen, Xiaohui, Cremers, Daniel, Chen, Liang-Chieh. (2024). MaskBit: Embedding-free Image Generation via Bit Tokens. arXiv preprint arXiv:2409.16211.

[bsq] Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl. (2025). Image and Video Tokenization with Binary Spherical Quantization. ICLR.

[gigatok] Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, Xihui Liu. (2025). GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation.

[CRT] Vivek Ramanujan, Kushal Tirumala, Armen Aghajanyan, Luke Zettlemoyer, Ali Farhadi. (2025). When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization. NeurIPS.

[LARP] Hanyu Wang, Saksham Suri, Yixuan Ren, Hao Chen, Abhinav Shrivastava. (2025). LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior. ICLR.

[eps-vae] Long Zhao, Sanghyun Woo, Ziyu Wan, Yandong Li, Han Zhang, Boqing Gong, Hartwig Adam, Xuhui Jia, Ting Liu. (2025). Epsilon-VAE: Denoising as Visual Decoding.

[atoken] Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, Yinfei Yang. (2025). AToken: A Unified Tokenizer for Vision.

[luo2024task] Luo, Grace, Darrell, Trevor, Bar, Amir. (2024). Task Vectors are Cross-Modal. arXiv preprint arXiv:2410.22330.

[wan2024locca] Wan, Bo, Tschannen, Michael, Xian, Yongqin, Pavetic, Filip, Alabdulmohsin, Ibrahim, Wang, Xiao, Pinto, Andr{'e. (2024). LocCa: Visual Pretraining with Location-aware Captioners. arXiv preprint arXiv:2403.19596.

[fan2025scaling] Fan, David, Tong, Shengbang, Zhu, Jiachen, Sinha, Koustuv, Liu, Zhuang, Chen, Xinlei, Rabbat, Michael, Ballas, Nicolas, LeCun, Yann, Bar, Amir, others. (2025). Scaling language-free visual representation learning. ICCV.

[thasarathan2025universal] Thasarathan, Harrish, Forsyth, Julian, Fel, Thomas, Kowal, Matthew, Derpanis, Konstantinos. (2025). Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment. arXiv preprint arXiv:2502.03714.

[ridnik2021imagenet] Ridnik, Tal, Ben-Baruch, Emanuel, Noy, Asaf, Zelnik-Manor, Lihi. (2021). Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972.

[balestriero2023cookbook] Balestriero, Randall, Ibrahim, Mark, Sobal, Vlad, Morcos, Ari, Shekhar, Shashank, Goldstein, Tom, Bordes, Florian, Bardes, Adrien, Mialon, Gregoire, Tian, Yuandong, others. (2023). A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210.

[Pathak_2016_CVPR] Pathak, Deepak, Krahenbuhl, Philipp, Donahue, Jeff, Darrell, Trevor, Efros, Alexei A.. (2016). Context Encoders: Feature Learning by Inpainting. CVPR.

[maninis2024tips] Maninis, Kevis-Kokitsi, Chen, Kaifeng, Ghosh, Soham, Karpur, Arjun, Chen, Koert, Xia, Ye, Cao, Bingyi, Salz, Daniel, Han, Guangxing, Dlabal, Jan, others. (2025). TIPS: Text-image pretraining with spatial awareness. ICLR.

[dito] Long Zhao, Sanghyun Woo, Ziyu Wan, Yandong Li, Han Zhang, Boqing Gong, Hartwig Adam, Xuhui Jia, Ting Liu. (2025). Epsilon-VAE: Denoising as Visual Decoding.

[quantize-dino] Yongxin Zhu, Bocheng Li, Hang Zhang, Xin Li, Linli Xu, Lidong Bing. (2024). Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective. NeurIPS.

[VFMTok] Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, Xiaojuan Qi. (2025). Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation.

[scalingvae] Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen. (2025). Learnings from Scaling Visual Tokenizers for Reconstruction and Generation. ICML.

[robustok] Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides. (2025). Image Tokenizer Needs Post-Training.

[styleganxl] Axel Sauer, Katja Schwarz, Andreas Geiger. (2022). StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets. SIGGRAPH.

[ren2025beyond] Ren, Sucheng, Yu, Qihang, He, Ju, Shen, Xiaohui, Yuille, Alan, Chen, Liang-Chieh. (2025). Beyond next-token: Next-x prediction for autoregressive visual generation. arXiv preprint arXiv:2502.20388.

[rcg] Tianhong Li, Dina Katabi, Kaiming He. (2024). Return of Unconditional Generation: A Self-supervised Representation Generation Method. NeurIPS.

[vahdat2021score] Vahdat, Arash, Kreis, Karsten, Kautz, Jan. (2021). Score-based generative modeling in latent space. NeurIPS.

[song2025selective] Song, Kiwhan, Kim, Jaeyeon, Chen, Sitan, Du, Yilun, Kakade, Sham, Sitzmann, Vincent. (2025). Selective Underfitting in Diffusion Models. arXiv preprint arXiv:2510.01378.

[pope2021intrinsic] Pope, Phillip, Zhu, Chen, Abdelkader, Ahmed, Goldblum, Micah, Goldstein, Tom. (2021). The intrinsic dimension of images and its impact on learning. ICLR.

[xar] Ren, Sucheng, Yu, Qihang, He, Ju, Shen, Xiaohui, Yuille, Alan, Chen, Liang-Chieh. (2025). Beyond next-token: Next-x prediction for autoregressive visual generation. arXiv preprint arXiv:2502.20388.

[unet] Olaf Ronneberger, Philipp Fischer, Thomas Brox. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation.

[pixartalpha] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li. (2024). PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis. ICLR.

[lmfusion] Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu. (2025). LMFusion: Adapting Pretrained Language Models for Multimodal Generation.

[tang2025exploring] Tang, Bingda, Zheng, Boyang, Paul, Sayak, Xie, Saining. (2025). Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis. CVPR.

[jin2024unified] Jin, Yang, Xu, Kun, Chen, Liwei, Liao, Chao, Tan, Jianchao, Huang, Quzhe, Chen, Bin, Song, Chengru, Meng, Dai, Zhang, Di, others. (2024). Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization. ICLR.

[chen2025janus] Chen, Xiaokang, Wu, Zhiyu, Liu, Xingchao, Pan, Zizheng, Liu, Wen, Xie, Zhenda, Yu, Xingkai, Ruan, Chong. (2025). Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811.

[ge2024seed] Ge, Yuying, Zhao, Sijie, Zhu, Jinguo, Ge, Yixiao, Yi, Kun, Song, Lin, Li, Chen, Ding, Xiaohan, Shan, Ying. (2024). Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396.

[jiao2025unitoken] Jiao, Yang, Qiu, Haibo, Jie, Zequn, Chen, Shaoxiang, Chen, Jingjing, Ma, Lin, Jiang, Yu-Gang. (2025). Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. CVPR.

[chen2025blip3o] Chen, Jiuhai, Xue, Le, Xu, Zhiyang, Pan, Xichen, Yang, Shusheng, Qin, Can, Yan, An, Zhou, Honglu, Chen, Zeyuan, Huang, Lifu, others. (2025). BLIP3o-NEXT: Next Frontier of Native Image Generation. arXiv preprint arXiv:2510.15857.

[uniflow] Zhengrong Yue, Haiyu Zhang, Xiangyu Zeng, Boyu Chen, Chenting Wang, Shaobin Zhuang, Lu Dong, KunPeng Du, Yi Wang, Limin Wang, Yali Wang. (2025). UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation.

[vugen] Xiangyi Chen, Théophane Vallaeys, Maha Elbayad, John Nguyen, Jakob Verbeek. (2025). VUGEN: Visual Understanding priors for GENeration.

[ma2025inference] Ma, Nanye, Tong, Shangyuan, Jia, Haolin, Hu, Hexiang, Su, Yu-Chuan, Zhang, Mingda, Yang, Xuan, Li, Yandong, Jaakkola, Tommi, Jia, Xuhui, others. (2025). Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv preprint arXiv:2501.09732.

[xie2025sana] Xie, Enze, Chen, Junsong, Zhao, Yuyang, Yu, Jincheng, Zhu, Ligeng, Wu, Chengyue, Lin, Yujun, Zhang, Zhekai, Li, Muyang, Chen, Junyu, others. (2025). Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. arXiv preprint arXiv:2501.18427.

[kang2025scalable] Kang, Zhewei, Zhao, Xuandong, Song, Dawn. (2025). Scalable best-of-n selection for large language models via self-certainty. arXiv preprint arXiv:2502.18581.

[gramloss] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge. (2015). A Neural Algorithm of Artistic Style.

[svg-t2i] Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu. (2025). SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder.

[svg] Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu. (2025). Latent Diffusion Model without Variational Autoencoder.

[semvae] Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin, Mingyu Guo, Chong Luo, Nanning Zheng. (2025). Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion.

[vqrae] Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan. (2025). VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction.

[VTP] Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang. (2025). Towards Scalable Pre-training of Visual Tokenizers for Generation.

[reglue] Giorgos Petsangourakis, Christos Sgouropoulos, Bill Psomas, Theodoros Giannakopoulos, Giorgos Sfikas, Ioannis Kakogeorgiou. (2025). REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion.

[bib170] S. AI (2024) Stable diffusion 3.5. External Links: Link Cited by: §1.

[bib154] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, and T. Zhu (2023) Qwen technical report. arXiv preprint arXiv:2309.16609. Cited by: §1.

[bib325] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. (2025) Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923. Cited by: §6.

[bib7] P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang, S. Roy, et al. (2022) Pathways: asynchronous distributed dataflow for ml. Proceedings of Machine Learning and Systems 4, pp. 430–449. Cited by: Appendix A.

[bib328] S. Cao, H. Chen, P. Chen, Y. Cheng, Y. Cui, X. Deng, Y. Dong, K. Gong, T. Gu, X. Gu, et al. (2025) Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951. Cited by: §6.

[bib313] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin (2021) Emerging properties in self-supervised vision transformers. In ICCV, Cited by: §1.

[bib340] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut (2021) Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, Cited by: §4.

[bib8] J. Chen, Z. Xu, X. Pan, Y. Hu, C. Qin, T. Goldstein, L. Huang, T. Zhou, S. Xie, S. Savarese, et al. (2025) Blip3-o: a family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568. Cited by: Appendix A.

[bib406] J. Chen, Z. Xu, X. Pan, Y. Hu, C. Qin, T. Goldstein, L. Huang, T. Zhou, S. Xie, S. Savarese, L. Xue, C. Xiong, and R. Xu (2025) BLIP3-o: a family of fully open unified multimodal models-architecture, training and dataset. External Links: 2505.09568 Cited by: §4, §4.2, §4.2, §6, §6.

[bib471] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, and Z. Li (2024) PixArt-α\alpha: fast training of diffusion transformer for photorealistic text-to-image synthesis. In ICLR, Cited by: §4.2.

[bib389] J. Chen, H. Cai, J. Chen, E. Xie, S. Yang, H. Tang, M. Li, Y. Lu, and S. Han (2025) Deep compression autoencoder for efficient high-resolution diffusion models. In ICLR, Cited by: §6.

[bib393] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In ICML, Cited by: §1.

[bib480] X. Chen, T. Vallaeys, M. Elbayad, J. Nguyen, and J. Verbeek (2025) VUGEN: visual understanding priors for generation. External Links: 2510.06529, Link Cited by: §1, §6.

[bib475] X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan (2025) Janus-pro: unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811. Cited by: §6.

[bib323] X. Dai and et al. (2023) Emu: enhancing image generation models using photogenic needles in a haystack. External Links: 2309.15807 Cited by: §1, §4.2, §6.

[bib297] M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, et al. (2023) Scaling vision transformers to 22 billion parameters. In ICML, Cited by: Appendix B.

[bib322] C. Deng, D. Zhu, K. Li, C. Gou, F. Li, Z. Wang, S. Zhong, W. Yu, X. Nie, Z. Song, et al. (2025) Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683. Cited by: §4, §5, §6.

[bib10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §1, §2.

[bib377] P. Dhariwal and A. Nichol (2021) Diffusion models beat GANs on image synthesis. In NeurIPS, Cited by: §1.

[bib181] R. Dong, C. Han, Y. Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, et al. (2024) Dreamllm: synergistic multimodal comprehension and creation. In ICLR, Cited by: §6.

[bib136] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2021) An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: §2.

[bib423] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2021) An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: Appendix A.

[bib488] S. Du, J. Guo, B. Li, S. Cui, Z. Xu, Y. Luo, Y. Wei, K. Gai, X. Wang, K. Wu, and C. Yuan (2025) VQRAE: representation quantization autoencoders for multimodal understanding, generation and reconstruction. External Links: 2511.23386, Link Cited by: §6.

[bib330] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, et al. (2024) Scaling rectified flow transformers for high-resolution image synthesis. In ICML, Cited by: §4.1.

[bib317] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y. Marek, and R. Rombach (2024) Scaling rectified flow transformers for high-resolution image synthesis. In ICML, Cited by: §3.2, §6, §6, §6.

[bib452] D. Fan, S. Tong, J. Zhu, K. Sinha, Z. Liu, X. Chen, M. Rabbat, N. Ballas, Y. LeCun, A. Bar, et al. (2025) Scaling language-free visual representation learning. In ICCV, Cited by: Appendix B, §2, §4.1, §4.1, §4.1.

[bib335] L. Fan, L. Tang, S. Qin, T. Li, X. Yang, S. Qiao, A. Steiner, C. Sun, Y. Li, T. Zhu, et al. (2025) Unified autoregressive visual generation and understanding with continuous tokens. arXiv preprint arXiv:2503.13436. Cited by: §4, §5.

[bib305] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, Z. Qiu, W. Lin, J. Yang, X. Zheng, et al. (2023) MME: a comprehensive evaluation benchmark for multimodal large language models. corr abs/2306.13394 (2023). Cited by: §5.

[bib484] L. A. Gatys, A. S. Ecker, and M. Bethge (2015) A neural algorithm of artistic style. External Links: 1508.06576, Link Cited by: §2.

[bib158] Y. Ge, Y. Ge, Z. Zeng, X. Wang, and Y. Shan (2023) Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041. Cited by: §5.

[bib476] Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, and Y. Shan (2024) Seed-x: multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396. Cited by: §6.

[bib338] D. Ghosh, H. Hajishirzi, and L. Schmidt (2023) Geneval: an object-focused framework for evaluating text-to-image alignment. In NeurIPS, Cited by: §3.1.

[bib430] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §2.

[bib360] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick (2021) Masked autoencoders are scalable vision learners. In CVPR, Cited by: §1.

[bib391] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arxiv e-prints, art. arXiv preprint arXiv:1911.05722 2. Cited by: §1.

[bib372] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, Cited by: §2.

[bib196] T. Hiippala, M. Alikhani, J. Haverinen, T. Kalliokoski, E. Logacheva, S. Orekhova, A. Tuomainen, M. Stone, and J. A. Bateman (2021) AI2D-rst: a multimodal corpus of 1000 primary school science diagrams. Language Resources and Evaluation 55, pp. 661–688. Cited by: §5.

[bib401] J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. In NeurIPS, Cited by: §1.

[bib336] X. Hu, R. Wang, Y. Fang, B. Fu, P. Cheng, and G. Yu (2024) Ella: equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135. Cited by: §3.1.

[bib327] Z. Huang, D. Zheng, C. Zou, R. Liu, X. Wang, K. Ji, W. Chai, J. Sun, L. Wang, Y. Lv, et al. (2025) Ming-univision: joint image understanding and generation with a unified continuous tokenizer. arXiv preprint arXiv:2510.06590. Cited by: §6.

[bib477] Y. Jiao, H. Qiu, Z. Jie, S. Chen, J. Chen, L. Ma, and Y. Jiang (2025) Unitoken: harmonizing multimodal understanding and generation through unified visual encoding. In CVPR, Cited by: §6, §6.

[bib483] Z. Kang, X. Zhao, and D. Song (2025) Scalable best-of-n selection for large language models via self-certainty. arXiv preprint arXiv:2502.18581. Cited by: §5.

[bib396] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §1, §6.

[bib140] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Lo, et al. (2023) Segment anything. In ICCV, Cited by: §4.

[bib395] T. Kouzelis, E. Karypidis, I. Kakogeorgiou, S. Gidaris, and N. Komodakis (2025) Boosting generative image modeling via joint image-feature synthesis. In NeurIPS, Cited by: §6.

[bib314] B. F. Labs (2024) FLUX. Note: https://github.com/black-forest-labs/flux Cited by: Appendix A, §1, §2, §2, §3.3, §4, §4, §6, §6.

[bib402] Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le (2023) Flow matching for generative modeling. In ICLR, Cited by: §1, §3.1, §3.1.

[bib390] B. Liu, E. Akhgari, A. Visheratin, A. Kamko, L. Xu, S. Shrirao, C. Lambert, J. Souza, S. Doshi, and D. Li (2024) Playground v3: improving text-to-image alignment with deep-fusion large language models. External Links: 2409.10695 Cited by: §1.

[bib112] H. Liu, C. Li, Y. Li, and Y. J. Lee (2024) Improved baselines with visual instruction tuning. In CVPR, Cited by: §3.1.

[bib9] H. Liu, C. Li, Q. Wu, and Y. J. Lee (2023) Visual instruction tuning. In NeurIPS, Cited by: Appendix B, §3.1.

[bib403] X. Liu, C. Gong, and Q. Liu (2023) Flow straight and fast: learning to generate and transfer data with rectified flow. In ICLR, Cited by: §3.1.

[bib449] J. Lu, L. Song, M. Xu, B. Ahn, Y. Wang, C. Chen, A. Dehghan, and Y. Yang (2025) AToken: a unified tokenizer for vision. External Links: 2509.14476, Link Cited by: §2, §6.

[bib481] N. Ma, S. Tong, H. Jia, H. Hu, Y. Su, M. Zhang, X. Yang, Y. Li, T. Jaakkola, X. Jia, et al. (2025) Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv preprint arXiv:2501.09732. Cited by: §5.

[bib324] N. Mu, A. Kirillov, D. A. Wagner, and S. Xie (2021) SLIP: self-supervision meets language-image pre-training. corr abs/2112.12750 (2021). arXiv preprint arXiv:2112.12750 3. Cited by: §1.

[bib429] A. v. d. Oord, O. Vinyals, and K. Kavukcuoglu (2017) Neural discrete representation learning. In NeurIPS, Cited by: §6.

[bib407] X. Pan, S. N. Shukla, A. Singh, Z. Zhao, S. K. Mishra, J. Wang, Z. Xu, J. Chen, K. Li, F. Juefei-Xu, J. Hou, and S. Xie (2025) Transfer between modalities with metaqueries. External Links: 2504.06256 Cited by: §1, §3.1, §4.1, §4.2, §6.

[bib487] Y. Pan, R. Feng, Q. Dai, Y. Wang, W. Lin, M. Guo, C. Luo, and N. Zheng (2025) Semantics lead the way: harmonizing semantic and texture modeling with asynchronous latent diffusion. External Links: 2512.04926, Link Cited by: §6.

[bib357] W. Peebles and S. Xie (2023) Scalable diffusion models with transformers. In ICCV, Cited by: §3.1, §6.

[bib490] G. Petsangourakis, C. Sgouropoulos, B. Psomas, T. Giannakopoulos, G. Sfikas, and I. Kakogeorgiou (2025) REGLUE your latents with global and local semantics for entangled diffusion. External Links: 2512.16636, Link Cited by: §6.

[bib329] D. Podell and et al. (2023) SDXL: improving latent diffusion models for high-resolution image synthesis. External Links: 2307.01952 Cited by: §1, §2, §4.2, §6, §6.

[bib326] A. Y. Qwen, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. (2024) Qwen2. 5 technical report. arXiv preprint. Cited by: Appendix B, §1, §3.1.

[bib123] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In ICML, Cited by: §1.

[bib410] A. Razavi, A. van den Oord, and O. Vinyals (2019) Generating diverse high-fidelity images with vq-vae-2. In NeurIPS, Cited by: §6.

[bib318] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer (2022) High-resolution image synthesis with latent diffusion models. In CVPR, Cited by: §6.

[bib143] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer (2022) High-resolution image synthesis with latent diffusion models. In CVPR, Cited by: §1.

[bib470] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. External Links: 1505.04597, Link Cited by: §6.

[bib370] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §2.

[bib463] A. Sauer, K. Schwarz, and A. Geiger (2022) StyleGAN-xl: scaling stylegan to large diverse datasets. In SIGGRAPH, Cited by: §2.

[bib485] M. Shi, H. Wang, B. Zhang, W. Zheng, B. Zeng, Z. Yuan, X. Wu, Y. Zhang, H. Yang, X. Wang, P. Wan, K. Gai, J. Zhou, and J. Lu (2025) SVG-t2i: scaling up text-to-image latent diffusion model without variational autoencoder. External Links: 2512.11749, Link Cited by: §6.

[bib486] M. Shi, H. Wang, W. Zheng, Z. Yuan, X. Wu, X. Wang, P. Wan, J. Zhou, and J. Lu (2025) Latent diffusion model without variational autoencoder. External Links: 2510.15301, Link Cited by: §6.

[bib177] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach (2019) Towards vqa models that can read. In CVPR, Cited by: §5.

[bib364] I. Skorokhodov, S. Girish, B. Hu, W. Menapace, Y. Li, R. Abdal, S. Tulyakov, and A. Siarohin (2025) Improving the diffusability of autoencoders. In ICML, Cited by: §1.

[bib321] K. Sun, J. Pan, Y. Ge, H. Li, H. Duan, X. Wu, R. Zhang, A. Zhou, Z. Qin, Y. Wang, et al. (2023) Journeydb: a benchmark for generative image understanding. Cited by: §4.

[bib409] Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Z. Luo, Y. Wang, Y. Rao, J. Liu, T. Huang, and X. Wang (2024) Generative multimodal models are in-context learners. In CVPR, Cited by: §6.

[bib128] Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao (2023) Eva-clip: improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389. Cited by: §6.

[bib303] Q. Sun, J. Wang, Q. Yu, Y. Cui, F. Zhang, X. Zhang, and X. Wang (2024) Eva-clip-18b: scaling clip to 18 billion parameters. arXiv preprint arXiv:2402.04252. Cited by: §6.

[bib320] B. Tang, B. Zheng, S. Paul, and S. Xie (2025) Exploring the deep fusion of large language models and diffusion transformers for text-to-image synthesis. In CVPR, Cited by: §2, §4.

[bib398] H. Tang, C. Xie, X. Bao, T. Weng, P. Li, Y. Zheng, and L. Wang (2025) UniLiP: adapting clip for unified multimodal understanding, generation and editing. External Links: 2507.23278 Cited by: §6.

[bib97] C. Team (2024) Chameleon: mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818. Cited by: §6.

[bib293] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li (2016) Yfcc100m: the new data in multimedia research. Communications of the ACM 59 (2), pp. 64–73. Cited by: §1, §2.

[bib241] S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, et al. (2024) Cambrian-1: a fully open, vision-centric exploration of multimodal llms. In NeurIPS, Cited by: Appendix B, §4.

[bib408] S. Tong, D. Fan, J. Zhu, Y. Xiong, X. Chen, K. Sinha, M. Rabbat, Y. LeCun, S. Xie, and Z. Liu (2025) MetaMorph: multimodal understanding and generation via instruction tuning. In ICCV, Cited by: §5, §6.

[bib359] M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, O. Hénaff, J. Harmsen, A. Steiner, and X. Zhai (2025) SigLIP 2: multilingual vision-language encoders with improved semantic understanding, localization, and dense features. External Links: 2502.14786 Cited by: §1.

[bib311] M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, et al. (2025) Siglip 2: multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786. Cited by: §1, §2, §3.1.

[bib333] T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. (2025) Wan: open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314. Cited by: Appendix B, §4.1.

[bib184] X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, et al. (2024) Emu3: next-token prediction is all you need. arXiv preprint arXiv:2409.18869. Cited by: §6.

[bib332] C. Wendler (2024) RenderedText. Hugging Face. Note: https://huggingface.co/datasets/wendlerc/RenderedText Cited by: §2, §2.

[bib334] C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. Yin, S. Bai, X. Xu, Y. Chen, et al. (2025) Qwen-image technical report. arXiv preprint arXiv:2508.02324. Cited by: §1, §3.3, §4.1.

[bib244] Y. Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y. Fang, L. Zhu, E. Xie, H. Yin, L. Yi, et al. (2024) Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429. Cited by: §6.

[bib482] E. Xie, J. Chen, Y. Zhao, J. Yu, L. Zhu, C. Wu, Y. Lin, Z. Zhang, M. Li, J. Chen, et al. (2025) Sana 1.5: efficient scaling of training-time and inference-time compute in linear diffusion transformer. arXiv preprint arXiv:2501.18427. Cited by: §5.

[bib489] J. Yao, Y. Song, Y. Zhou, and X. Wang (2025) Towards scalable pre-training of visual tokenizers for generation. External Links: 2512.13687, Link Cited by: §6.

[bib339] J. Yao, B. Yang, and X. Wang (2025) Reconstruction vs. generation: taming optimization dilemma in latent diffusion models. In CVPR, Cited by: Appendix B, §1, §3.1.

[bib388] Q. Yu, M. Weber, X. Deng, X. Shen, D. Cremers, and L. Chen (2024) An image is worth 32 tokens for reconstruction and generation. In NeurIPS, Cited by: §6.

[bib355] S. Yu, S. Kwak, H. Jang, J. Jeong, J. Huang, J. Shin, and S. Xie (2025) Representation alignment for generation: training diffusion transformers is easier than you think. In ICLR, Cited by: §6.

[bib195] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, et al. (2024) Mmmu: a massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, Cited by: §5.

[bib52] X. Yue, T. Zheng, Y. Ni, Y. Wang, K. Zhang, S. Tong, Y. Sun, M. Yin, B. Yu, G. Zhang, et al. (2024) Mmmu-pro: a more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813. Cited by: §5.

[bib479] Z. Yue, H. Zhang, X. Zeng, B. Chen, C. Wang, S. Zhuang, L. Dong, K. Du, Y. Wang, L. Wang, and Y. Wang (2025) UniFlow: a unified pixel flow tokenizer for visual understanding and generation. External Links: 2510.10575, Link Cited by: §6.

[bib127] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer (2023) Sigmoid loss for language image pre-training. In ICCV, Cited by: §1.

[bib426] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Cited by: §2.

[bib310] B. Zheng, N. Ma, S. Tong, and S. Xie (2025) Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690. Cited by: Appendix A, Appendix B, §1, §1, §3.1, §3.2, §3.2, §3, §6.

[bib243] C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy (2025) Transfusion: predict the next token and diffuse images with one multi-modal model. In ICLR, Cited by: §4, §6.