: Multimodal Understanding and Generation via Instruction Tuning
Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, Zhuang Liu
Abstract
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)—a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong “prior” vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.
Instruction tuning and visual instruction tuning.
Shengbang Tong 1 , 2 , ∗ , † , David Fan 1 , Jiachen Zhu 1 , 2 , ∗ , Yunyang Xiong 3 , Xinlei Chen 1 , Koustuv Sinha 1 , Michael Rabbat 1 , Yann LeCun 1 , 2 , Saining Xie 2 , Zhuang Liu 1 , †
1 FAIR, Meta, 2 New York University, 3 Meta Reality Labs
∗ Work done at Meta , † Corresponding authors
In this work, we propose Visual-Predictive Instruction Tuning ( VPiT )-a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong 'prior' vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.
Date:
December 19, 2024
Correspondence:
st5087@nyu.edu , zhuangl@meta.com
Project Page:
tsb0601.github.io/metamorph
Data
Introduction
Multimodal Large Language Models (MLLMs) have advanced considerably in visual understanding, progressing from basic image captioning to complex visual inferences (Alayrac et al., 2022; Liu et al., 2023; Dai et al., 2024). These models process multimodal inputs-primarily images and language-and generate text tokens. Multimodal LLMs often leverage a pretrained vision encoder (Dosovitskiy et al., 2021; Radford et al., 2021), a pretrained language model (Touvron et al., 2023; AI@Meta, 2024), and align these modalities through connectors such as MLP (Liu et al., 2023, 2024a) or cross-attention modules (Alayrac et al., 2022; Dai et al., 2024). Among MLLM training methods, visual instruction tuning (Liu et al., 2023) has become widely used (Wang et al., 2024a; Agrawal et al., 2024). It treats output embeddings of pretrained vision encoders as continuous-valued 'visual tokens' and directly feeds them as inputs to pretrained LLMs.
One benefit of visual instruction tuning is that it is data and compute efficient. A pretrained LLM can be repurposed as a Multimodal LLM by instruction tuning with modest compute and data on the order of millions of image-text question-answer pairs (Tong et al., 2024a; Li et al., 2024a). The effectiveness of visual instruction tuning indicates that LLMs already possess a considerable amount of inherent visual knowledge which allows them to efficiently learn and develop visual understanding during the instruction tuning process (Zhou et al., 2024a). Inspired by this, we investigate whether LLMs can also be finetuned to generate visual information with comparable efficiency and effectiveness.
Current attempts toward 'unified' models-models capable of both multimodal understanding and generationoften treat visual generation as an orthogonal capability to visual understanding. They tend to require substantial changes to the original MLLM architecture and significant multimodal pretraining and/or finetuning. Designing such methods is challenging, and past research takes different approaches including tokenizing

Figure 1 VPiT Training, Inference, and Examples of MetaMorph. Left : In Visual-Predictive Instruction Tuning (VPiT), we finetune a pretrained LLM to generate both text and visual tokens using separate text and vision heads. Middle : During inference, the model accepts an arbitrary input sequence of image(s) and text and outputs discrete text tokens and continuous visual tokens. These visual tokens can be visualized via a separately finetuned diffusion model, which is trained to condition on the pretrained vision encoder's output. Right : An example conversation from MetaMorph trained with VPiT. Here, the model implicitly solves a visual puzzle in order to generate the visual tokens of a butterfly. The conversation continues with new user questions as the model continues to autoregressively process vision and text tokens, independent of the diffusion-based visualization.
visual inputs into discrete tokens (Wu et al., 2024b; Team, 2024; Liu et al., 2024c), incorporating diffusion objectives (Xie et al., 2024; Zhou et al., 2024b), and decoupling vision into separate understanding and generation modes (Wu et al., 2024a). For example, approaches like LWM (Liu et al., 2024c), Show-o (Xie et al., 2024), and Chameleon (Team, 2024) require billions of image-text pairs (Schuhmann et al., 2022; Gadre et al., 2024) for extensive pretraining and finetuning.
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)-a simple extension to visual instruction tuning which builds upon the existing paradigm of passing continuous visual tokens as input to the LLM. VPiT trains an LLM to output both continuous visual tokens and discrete text tokens in the finetuning stage. The model takes pretrained vision encoder embeddings as well as text tokens as input, and outputs a combination of text tokens and continuous visual tokens. To visualize the generated visual tokens, we finetune a diffusion model to map the embeddings back into pixel space (see Figure 1 for an example). This framework allows us to study the synergy between visual understanding, visual generation, and pretrained LLMs, which leads to several intriguing findings outlined below.
First, we show that the ability to predict visual tokens emerges from understanding visual inputs and requires minimal additional training. Similar to visual instruction tuning, VPiT efficiently and effectively morphs an LLM into an 'unified' model that understands and generates multimodal tokens. When trained jointly with sufficient visual understanding data, this process requires as little as 200k additional visual generation data.
We further establish that the abilities to understand and generate visual tokens are intrinsically linked and asymmetrical . Specifically, increasing understanding data improves visual understanding (measured by higher VQA scores) and generation performance (measured by lower FID scores). Conversely, increasing generation data enhances generation quality and also contributes to stronger visual understanding-but to a lesser degree. Importantly, our findings highlight an asymmetry in how training each ability impacts the model's overall vision performance: understanding-centric training substantially outperforms generation-centric training in improving both visual understanding and generation.
Building upon these findings, we train a unified model called MetaMorph to predict multimodal tokens with VPiT. We leverage diverse data sources ranging from common visual question answering datasets to pure image and video data without text annotations. MetaMorph achieves competitive performance on both visual understanding and visual generation benchmarks. Furthermore, we show this unified modeling approach allows models to leverage the power of LLMs. For instance, MetaMorph can extract knowledge from the pretrained LLM when generating visual tokens. More surprisingly, we observe that MetaMorph can implicitly perform reasoning steps before generating visual tokens-e.g. when prompted with 'the animal resulting from a monarch caterpillar's metamorphosis' , MetaMorph successfully generates an image of a butterfly (Figure 1).
Our results suggest that 1) training a unified model with instruction tuning is feasible, and 2) LLMs have strong pre-existing visual capabilities which can be activated using significantly fewer samples compared
to extensive pretraining. These insights shed light on the development of mixed-modality models. As the community continues to improve visual understanding in Multimodal LLMs (Tong et al., 2024a; Wang et al., 2024a; Li et al., 2024a) by advancing base LLMs, instruction tuning techniques, and data, we highlight that these efforts may also implicitly lead to models that are better at visual generation.
Visual-Predictive Instruction Tuning
Visual instruction tuning as introduced by LLaVA (Liu et al., 2023) demonstrates that LLMs can be taught to understand visual inputs. This is achieved by finetuning on million-scale data. The success of late-fusion instruction tuning suggests that LLMs may already possess innate visual understanding ability. This ability simply needs to be unlocked through lightweight finetuning. Analogously, we hypothesize that LLMs already possess a degree of innate visual generation ability which just needs to be unlocked with lightweight finetuning.
Motivated by this, we present Visual-Predictive Instruction Tuning (VPiT, Figure 1)-a simple design which extends existing instruction tuning methods to additionally generate visual tokens rather than text alone. We use the same architecture and next-token prediction paradigm to unlock visual generation capabilities without bells and whistles. We take a pretrained LLM and finetune it to predict both discrete text tokens and continuous visual tokens. The visual tokens can be visualized with an adapted diffusion model.
From Unimodal to Multimodal Next-Token Prediction
The standard instruction tuning setup consists of an input sequence of conversation rounds (Wei et al., 2022a; Taori et al., 2023): ( P i , R i ) N i =1 , where P i and R i represent prompts and responses for the i -th round of conversation, respectively. The model is trained to generate responses based on the prompt. VPiT adds the following mechanisms to a standard instruction tuning setup to unlock visual understanding and generation.
Tokenizing multimodal data. We extend P i and R i to include both text and images. To integrate visual data into a pretrained LLM, we process data closely following visual instruction tuning (Liu et al., 2023):
· Text Data : Text is tokenized into discrete tokens with a standard tokenizer used by the LLM. · Visual Data : Images are encoded with a pretrained vision encoder such as SigLIP (Zhai et al., 2023). The output is continuous visual tokens which are then interpolated to m = 64 tokens. To pass the visual tokens as input to the LLM, we apply a trainable projection layer to align the dimensions with the LLM.
Model architecture. We take a pretrained LLM and finetune it to process arbitrary sequences of text and visual tokens (detailed next in Section 2.2). We keep the original LLM head for text prediction, and attach a separate vision head to the LLM for predicting visual tokens, i.e., the output tokens generated by the vision encoder when processing images. The vision head is a projection layer that projects from the LLM's dimension to the vision encoder's dimension. All response tokens can then be trained and predicted autoregressively, with prompt tokens as context.
Unlike conventional visual instruction tuning, in VPiT, visual tokens are also outputs of the LLM-not just inputs. To make the LLM aware of the presence of visual tokens, we introduce special tokens ⟨ image_start ⟩ and ⟨ image_end ⟩ to indicate the boundaries of visual token sequences and when to use the vision head.
Loss functions . The language head outputs a probability distribution over the vocabulary and is trained with cross-entropy loss for next-token prediction. Visual prediction uses cosine similarity loss between the LLM's predicted visual tokens and those from the vision encoder. Consistent with instruction tuning practices, the model only makes predictions and incurs loss on response tokens.
Tokenizing multimodal data.
In Figure 10, we present examples where the model generates images in response to puzzle prompts such as 'The national flag of the country where Yellowstone National Park is located' . For each puzzle, we directly use the prompt 'Generate an image of {puzzle}' , without calling any Chain-of-Thought (CoT) (Wei et al., 2022b) in the prompts. MetaMorph generates the correct image from prompts that require multi-step reasoning.
For example, when answering the question 'A musical instrument, this instrument is often played by the scientist who formulated the theory of special relativity' , the model needs to implicitly complete three reasoning steps: it identifies Albert Einstein as the scientist formulated the theory of special relativity, recognizes that his preferred instrument is the violin, and then directly generates correct visual tokens-a violin-without explicitly separating these steps during the generation process. This result implies that MetaMorph implicitly solves the puzzle and generates correct visual tokens immediately following the prompt. These results align with the findings in Physics of LLMs (Ye et al., 2024; Allen-Zhu, 2024), where the authors suggest that LLMs precompute reasoning graphs before autoregressively generating subsequent tokens. Here, we demonstrate that this capability transfers to the unified multimodal model setting even when decoding visual tokens.

Figure 10 ExamplesofMetaMorphsolvingreasoningproblemsinvisualgeneration. We design puzzles that require multi-step reasoning. We include reference logic chains needed to solve the puzzles, and reference solution examples . When prompting each model, we directly feed in the puzzle without any CoT hints or logic chains. MetaMorph has the ability to implicitly solve these puzzles and generate the correct image without explicitly creating or processing a logic chain. It demonstrates that the implicit reasoning skills in text-only LLMs can transfer to unified multimodal models.
Model architecture.
Loss functions
Multimodal Large Language Models (MLLMs) have advanced considerably in visual understanding, progressing from basic image captioning to complex visual inferences (Alayrac et al., 2022; Liu et al., 2023; Dai et al., 2024). These models process multimodal inputs-primarily images and language-and generate text tokens. Multimodal LLMs often leverage a pretrained vision encoder (Dosovitskiy et al., 2021; Radford et al., 2021), a pretrained language model (Touvron et al., 2023; AI@Meta, 2024), and align these modalities through connectors such as MLP (Liu et al., 2023, 2024a) or cross-attention modules (Alayrac et al., 2022; Dai et al., 2024). Among MLLM training methods, visual instruction tuning (Liu et al., 2023) has become widely used (Wang et al., 2024a; Agrawal et al., 2024). It treats output embeddings of pretrained vision encoders as continuous-valued 'visual tokens' and directly feeds them as inputs to pretrained LLMs.
One benefit of visual instruction tuning is that it is data and compute efficient. A pretrained LLM can be repurposed as a Multimodal LLM by instruction tuning with modest compute and data on the order of millions of image-text question-answer pairs (Tong et al., 2024a; Li et al., 2024a). The effectiveness of visual instruction tuning indicates that LLMs already possess a considerable amount of inherent visual knowledge which allows them to efficiently learn and develop visual understanding during the instruction tuning process (Zhou et al., 2024a). Inspired by this, we investigate whether LLMs can also be finetuned to generate visual information with comparable efficiency and effectiveness.
Current attempts toward 'unified' models-models capable of both multimodal understanding and generationoften treat visual generation as an orthogonal capability to visual understanding. They tend to require substantial changes to the original MLLM architecture and significant multimodal pretraining and/or finetuning. Designing such methods is challenging, and past research takes different approaches including tokenizing

Figure 1 VPiT Training, Inference, and Examples of MetaMorph. Left : In Visual-Predictive Instruction Tuning (VPiT), we finetune a pretrained LLM to generate both text and visual tokens using separate text and vision heads. Middle : During inference, the model accepts an arbitrary input sequence of image(s) and text and outputs discrete text tokens and continuous visual tokens. These visual tokens can be visualized via a separately finetuned diffusion model, which is trained to condition on the pretrained vision encoder's output. Right : An example conversation from MetaMorph trained with VPiT. Here, the model implicitly solves a visual puzzle in order to generate the visual tokens of a butterfly. The conversation continues with new user questions as the model continues to autoregressively process vision and text tokens, independent of the diffusion-based visualization.
visual inputs into discrete tokens (Wu et al., 2024b; Team, 2024; Liu et al., 2024c), incorporating diffusion objectives (Xie et al., 2024; Zhou et al., 2024b), and decoupling vision into separate understanding and generation modes (Wu et al., 2024a). For example, approaches like LWM (Liu et al., 2024c), Show-o (Xie et al., 2024), and Chameleon (Team, 2024) require billions of image-text pairs (Schuhmann et al., 2022; Gadre et al., 2024) for extensive pretraining and finetuning.
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)-a simple extension to visual instruction tuning which builds upon the existing paradigm of passing continuous visual tokens as input to the LLM. VPiT trains an LLM to output both continuous visual tokens and discrete text tokens in the finetuning stage. The model takes pretrained vision encoder embeddings as well as text tokens as input, and outputs a combination of text tokens and continuous visual tokens. To visualize the generated visual tokens, we finetune a diffusion model to map the embeddings back into pixel space (see Figure 1 for an example). This framework allows us to study the synergy between visual understanding, visual generation, and pretrained LLMs, which leads to several intriguing findings outlined below.
First, we show that the ability to predict visual tokens emerges from understanding visual inputs and requires minimal additional training. Similar to visual instruction tuning, VPiT efficiently and effectively morphs an LLM into an 'unified' model that understands and generates multimodal tokens. When trained jointly with sufficient visual understanding data, this process requires as little as 200k additional visual generation data.
We further establish that the abilities to understand and generate visual tokens are intrinsically linked and asymmetrical . Specifically, increasing understanding data improves visual understanding (measured by higher VQA scores) and generation performance (measured by lower FID scores). Conversely, increasing generation data enhances generation quality and also contributes to stronger visual understanding-but to a lesser degree. Importantly, our findings highlight an asymmetry in how training each ability impacts the model's overall vision performance: understanding-centric training substantially outperforms generation-centric training in improving both visual understanding and generation.
Building upon these findings, we train a unified model called MetaMorph to predict multimodal tokens with VPiT. We leverage diverse data sources ranging from common visual question answering datasets to pure image and video data without text annotations. MetaMorph achieves competitive performance on both visual understanding and visual generation benchmarks. Furthermore, we show this unified modeling approach allows models to leverage the power of LLMs. For instance, MetaMorph can extract knowledge from the pretrained LLM when generating visual tokens. More surprisingly, we observe that MetaMorph can implicitly perform reasoning steps before generating visual tokens-e.g. when prompted with 'the animal resulting from a monarch caterpillar's metamorphosis' , MetaMorph successfully generates an image of a butterfly (Figure 1).
Our results suggest that 1) training a unified model with instruction tuning is feasible, and 2) LLMs have strong pre-existing visual capabilities which can be activated using significantly fewer samples compared
to extensive pretraining. These insights shed light on the development of mixed-modality models. As the community continues to improve visual understanding in Multimodal LLMs (Tong et al., 2024a; Wang et al., 2024a; Li et al., 2024a) by advancing base LLMs, instruction tuning techniques, and data, we highlight that these efforts may also implicitly lead to models that are better at visual generation.
Using Broad Types of Data
Because VPiT enables the model to predict both text and visual tokens in its responses, it allows the use of a broader range of training data. Traditional visual instruction tuning, on the other hand, primarily relies on question-and-answer pairs. The majority of our dataset is publicly available, and we categorize it into three major categories below. This categorization enables us to systematically study the model, as detailed in
Section 3 and Section 4. All data types are formatted as instruction tuning style prompt & response pairs. See further details in Appendix C.2.
- Visual Understanding Data : This category includes data that takes image(s) or video as input and outputs text responses. See Figure 1 for an example. We use: · ImageQA: Cambrian-7M (Tong et al., 2024a). The model answers questions based on input image(s). P i ∈ {⟨ visual tokens ⟩ , ⟨ text prompt ⟩} R i ∈ {⟨ text response ⟩} · VideoQA: VideoStar (Zohar et al., 2024) and ShareVideo (Zhang et al., 2024). The model answers questions based on the input video. For videos in VideoQA, we process frames at 1 FPS. P i ∈ {⟨ visual tokens ⟩ , · · · , ⟨ visual tokens ⟩ , ⟨ text prompt ⟩} R i ∈ {⟨ text response ⟩}
- Visual Generation Data : MetaCLIP (Xu et al., 2024). The model predicts visual tokens based on an image description. We using at most 5 million pairs. We curate the data into question-answering formats.
P i ∈ {⟨ text prompt ⟩} R i ∈ {⟨ text response ⟩ , ⟨ visual tokens ⟩}
We prompt the model to generate visual tokens with instructions like 'Generate an image of...' . The text responses are 'Here is an image based on your request...' . See Figure 1 for an example.
- Other Visual Data : This category includes data that requires the model to predict visual tokens given interleaved input visual tokens and text tokens. We use: · Video Data: SomethingSomethingV2 (Goyal et al., 2017b) and HowTo100M (Miech et al., 2019). The model predicts frames in a sequential order. We design different question-answer pairs to probe into the video, such as asking about future frames, past frames, and reordering frames. P i ∈ {⟨ visual tokens ⟩ , · · · , ⟨ visual tokens ⟩ , ⟨ text prompt ⟩} R i ∈ {⟨ visual tokens ⟩ , · · · , ⟨ visual tokens ⟩} · Visual Thinking Data: Visualization-of-Thought (Shao et al., 2024) and VStar (Wu and Xie, 2024). The model predicts multimodal tokens in its response before addressing problems. For instance, it predicts a zoomed-in view of an image before generating textual responses.
In the response, the model will output 'I will think about it visually' , followed by visual tokens representing a zoomed-in segment of the image, and then proceed to answer the question.
· Image-to-Image Data: InstructPix2Pix (Brooks et al., 2023) and Aurora (Krojer et al., 2024). The model generates a transformed image conditioned on a text description and an input image.
Mapping Tokens to Images through Diffusion
Because models trained with VPiT learn to predict continuous visual tokens, we need to map the predicted tokens back into pixel space. We leverage the concept of a 'Diffusion Autoencoder' (Bordes et al., 2022; Preechakul et al., 2022; Pan et al., 2024b; Koh et al., 2024; Li et al., 2024c) in which the diffusion model can be adapted to condition on image embeddings rather than text embeddings. Specifically, we finetune an existing diffusion model to condition on outputs from the vision encoder using held-out training data.
At inference time, if the tag token ⟨ image_start ⟩ is generated, the model begins outputting visual tokens until ⟨ image_end ⟩ . We then plug the generated visual tokens into the diffusion model to visualize the prediction in pixel space. We use standard latent diffusion model training procedures. Details on the hyperparameters and training setup are provided in Appendix A.2.
Findings on Unlocking Visual Generation
We study the following questions about the effects and synergy of visual understanding and generation, under our VPiT framework:
- §3.1 Can visual generation be unlocked through lightweight tuning, or does it require extensive data?
- §3.2 Are visual understanding and generation mutually beneficial or orthogonal?
- §3.3 How much does more visual understanding or generation data contribute to understanding and generation quality?
- §3.4 Which visual understanding tasks correlate the most with generation performance?
Evaluation settings. We use 9 ImageQA benchmarks (MMBench, Seed, VStar, MMVP, MMMU, ChartQA, TextVQA, ScienceQA, RealWorldQA) to evaluate different aspects of the model. For image generation, we use the finetuned diffusion model to visualize generated visual tokens and measure FID score (lower is better) and CLIP score (higher is better) on the COCO-30K dataset. Unless otherwise specified, we use LLaMA-3 8B (AI@Meta, 2024) / SigLIP ViT-SO400M-14@384 (Zhai et al., 2023) as the pretrained LLM / vision encoder. We also study the effect of different LLMs in Section 3.2. We use instruction tuned versions of the LLMs. We pretrain the adapter between the vision encoder and the LLM following visual instruction tuning (Liu et al., 2023, 2024a). For experiments in this section, we provide training details in Appendix A and include the full results in Appendix B.
Evaluation settings.
For evaluation, we use nine ImageQA, one VideoQA and two generation benchmarks:

Table 3 Comparison of different loss functions. Training with cosine similarity loss enables the model to effectively utilize non-VQA data, which in turn enhances its visual understanding.
Visual Generation Can Be Unlocked Efficiently by Joint Training with Visual Understanding
We start by investigating the number of image-text samples required to teach a language model to generate high-quality visual tokens. To this end, we randomly sample {1k, 5k, 10k, 50k, 200k, 1M, 3M, 5M} image-text pairs from our generation data (MetaCLIP dataset (Xu et al., 2024)). We explore two settings: (1) finetuning the LLM using only visual generation data, and (2) joint training visual generation with visual understanding and the rest of data types described in Section 2.2.
In Figure 2, we see that training solely on visual generation performs significantly worse than joint training with all other data. With over 3 million image-text pairs, the model struggles to generate high-quality visual images ( ∼ 40 FID score), and performance remains inferior to joint training with 5 million pairs. This suggests that training solely on visual generation data is significantly less sample efficient. This finding aligns with a prior study (Zhang et al., 2023) which also suggests that LLMs cannot be easily tuned to generate visual tokens when trained with only generation data. In contrast, joint training with other datasets substantially improves generation performance. The model generates effective visual tokens with just 5k generation data, and performance stabilizes around 200k samples. This indicates that visual generation is not an orthogonal capability but rather an ability that benefits from other tasks and emerges more effectively with joint training.

Figure 2 Generation-only training vs. Joint training with other data. Training solely on generation data results in inferior performance. Joint training with additional data enables visual generation with only 5k generation data and yields high-quality outputs with 200k generation data.
Figure3 Impactofdifferentdatatypesonvisualgeneration. The baseline of training on only visual generation data is red; Joint training with other data is yellow; Joint training with visual understanding data is green; and all data is blue. Joint training with additional data improves the baseline, with visual understanding tasks contributing the most to enhancing visual generation.

Figure 4 VQAPerformance vs. Generation Performance with generation data controlled at 200k. Increasing understanding data improves VQA and generation performance.
To better understand how each type of data contributes to visual generation, we conduct a controlled experiment using 200k visual generation data, joint training individually with each data type defined in Section 2.2. We also compare them with training all the data together. We show results in Figure 3. While all data types enhance the model's visual generation, the degree of improvement varies. Visual understanding data, such as ImageQA and VideoQA, significantly boost the model's visual generation capabilities, even when the amount of generation data is kept constant at 200k. This indicates a strong link between the ability to understand visual content and generate visual tokens. Additionally, combining all data types in training further improves performance, suggesting that the benefits from different data types can be additive.
Finding 1: The ability to generate visual tokens can be unlocked with significantly less generation data when the model is jointly trained with visual understanding data, in contrast to training only on generation data.
textbf{ textit{Finding #1:
Multimodal Large Language Models (MLLMs) have advanced considerably in visual understanding, progressing from basic image captioning to complex visual inferences (Alayrac et al., 2022; Liu et al., 2023; Dai et al., 2024). These models process multimodal inputs-primarily images and language-and generate text tokens. Multimodal LLMs often leverage a pretrained vision encoder (Dosovitskiy et al., 2021; Radford et al., 2021), a pretrained language model (Touvron et al., 2023; AI@Meta, 2024), and align these modalities through connectors such as MLP (Liu et al., 2023, 2024a) or cross-attention modules (Alayrac et al., 2022; Dai et al., 2024). Among MLLM training methods, visual instruction tuning (Liu et al., 2023) has become widely used (Wang et al., 2024a; Agrawal et al., 2024). It treats output embeddings of pretrained vision encoders as continuous-valued 'visual tokens' and directly feeds them as inputs to pretrained LLMs.
One benefit of visual instruction tuning is that it is data and compute efficient. A pretrained LLM can be repurposed as a Multimodal LLM by instruction tuning with modest compute and data on the order of millions of image-text question-answer pairs (Tong et al., 2024a; Li et al., 2024a). The effectiveness of visual instruction tuning indicates that LLMs already possess a considerable amount of inherent visual knowledge which allows them to efficiently learn and develop visual understanding during the instruction tuning process (Zhou et al., 2024a). Inspired by this, we investigate whether LLMs can also be finetuned to generate visual information with comparable efficiency and effectiveness.
Current attempts toward 'unified' models-models capable of both multimodal understanding and generationoften treat visual generation as an orthogonal capability to visual understanding. They tend to require substantial changes to the original MLLM architecture and significant multimodal pretraining and/or finetuning. Designing such methods is challenging, and past research takes different approaches including tokenizing

Figure 1 VPiT Training, Inference, and Examples of MetaMorph. Left : In Visual-Predictive Instruction Tuning (VPiT), we finetune a pretrained LLM to generate both text and visual tokens using separate text and vision heads. Middle : During inference, the model accepts an arbitrary input sequence of image(s) and text and outputs discrete text tokens and continuous visual tokens. These visual tokens can be visualized via a separately finetuned diffusion model, which is trained to condition on the pretrained vision encoder's output. Right : An example conversation from MetaMorph trained with VPiT. Here, the model implicitly solves a visual puzzle in order to generate the visual tokens of a butterfly. The conversation continues with new user questions as the model continues to autoregressively process vision and text tokens, independent of the diffusion-based visualization.
visual inputs into discrete tokens (Wu et al., 2024b; Team, 2024; Liu et al., 2024c), incorporating diffusion objectives (Xie et al., 2024; Zhou et al., 2024b), and decoupling vision into separate understanding and generation modes (Wu et al., 2024a). For example, approaches like LWM (Liu et al., 2024c), Show-o (Xie et al., 2024), and Chameleon (Team, 2024) require billions of image-text pairs (Schuhmann et al., 2022; Gadre et al., 2024) for extensive pretraining and finetuning.
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)-a simple extension to visual instruction tuning which builds upon the existing paradigm of passing continuous visual tokens as input to the LLM. VPiT trains an LLM to output both continuous visual tokens and discrete text tokens in the finetuning stage. The model takes pretrained vision encoder embeddings as well as text tokens as input, and outputs a combination of text tokens and continuous visual tokens. To visualize the generated visual tokens, we finetune a diffusion model to map the embeddings back into pixel space (see Figure 1 for an example). This framework allows us to study the synergy between visual understanding, visual generation, and pretrained LLMs, which leads to several intriguing findings outlined below.
First, we show that the ability to predict visual tokens emerges from understanding visual inputs and requires minimal additional training. Similar to visual instruction tuning, VPiT efficiently and effectively morphs an LLM into an 'unified' model that understands and generates multimodal tokens. When trained jointly with sufficient visual understanding data, this process requires as little as 200k additional visual generation data.
We further establish that the abilities to understand and generate visual tokens are intrinsically linked and asymmetrical . Specifically, increasing understanding data improves visual understanding (measured by higher VQA scores) and generation performance (measured by lower FID scores). Conversely, increasing generation data enhances generation quality and also contributes to stronger visual understanding-but to a lesser degree. Importantly, our findings highlight an asymmetry in how training each ability impacts the model's overall vision performance: understanding-centric training substantially outperforms generation-centric training in improving both visual understanding and generation.
Building upon these findings, we train a unified model called MetaMorph to predict multimodal tokens with VPiT. We leverage diverse data sources ranging from common visual question answering datasets to pure image and video data without text annotations. MetaMorph achieves competitive performance on both visual understanding and visual generation benchmarks. Furthermore, we show this unified modeling approach allows models to leverage the power of LLMs. For instance, MetaMorph can extract knowledge from the pretrained LLM when generating visual tokens. More surprisingly, we observe that MetaMorph can implicitly perform reasoning steps before generating visual tokens-e.g. when prompted with 'the animal resulting from a monarch caterpillar's metamorphosis' , MetaMorph successfully generates an image of a butterfly (Figure 1).
Our results suggest that 1) training a unified model with instruction tuning is feasible, and 2) LLMs have strong pre-existing visual capabilities which can be activated using significantly fewer samples compared
to extensive pretraining. These insights shed light on the development of mixed-modality models. As the community continues to improve visual understanding in Multimodal LLMs (Tong et al., 2024a; Wang et al., 2024a; Li et al., 2024a) by advancing base LLMs, instruction tuning techniques, and data, we highlight that these efforts may also implicitly lead to models that are better at visual generation.
Visual Understanding and Generation are Mutually Beneficial
Moreunderstanding data leads to better understanding and generation. Building upon findings from the previous subsection, we perform a controlled experiment to investigate how visual understanding ability correlates with visual generation ability. We ablate our model using a fixed set of 200k generation data while varying VQA data from 1M to 7M samples from Cambrian-7M to develop different levels of visual understanding. The results presented in Figure 4 indicate that stronger VQA ability correlates with better generation performance.
More generation data leads to better understanding and generation. Here, we investigate the reverse direction: does enhancing the model's visual generation capability also relate to higher VQA performance? To explore this, we conduct a controlled experiment using 1M fixed VQA samples as the baseline for understanding. We then vary the amount of generation data ({200k, 500k, 1M, 2M, 3M, 4M}) to adjust generation capacity while joint training with the fixed 1M VQA data. We present results in Figure 5. Within the 1M VQA setting, stronger generation ability is correlated with improved VQA performance. This implies that increasing the amount of generation data not only enhances generation but also positively impacts VQA performance.
This synergy scales across different LLMs. We examine whether the findings transfer across various LLM backbones. Using a data composition of 7M VQA samples and 1M generation data, we train VPiT on LLaMA-3 8B, LLaMA-3.1 8B, and LLaMA-3 70B. Figure 6 shows the scaling behavior across different LLMs.
Finding 2: Visual understanding and generation are synergistic. Increasing data for either capability enhances both simultaneously.

Figure 6 Comparison between different language backbones. We jointly train 7M VQA and 1M Generation data on different language backbones (LLaMA-3 8B, LLaMA-3.1 8B, LLaMA-3 70B). We observe that the synergy between understanding and generation transfer across LLMs.
More understanding data leads to better understanding and generation.
In Table 6, we present the numerical results of joint training with varying scales of understanding data (1M, 4M, 7M) and generation data (200k, 500k, 1M, 2M, 3M, 4M). These findings demonstrate that increasing the
.

Table 6 Full results of joint training on varying amounts of VQA data (1M, 4M, 7M) and generation data (200k, 500k, 1M, 2M, 3M, 4M). These results correspond to Figure 4, Figure 5, Figure 7, and Figure 8, which analyze how different combinations of understanding and generation data impact the model's visual understanding and generation performance.
Table 7 Full results of training on different LLMs. We train 7M VQA data and 1M generation data on different LLM backbones (LLaMA-3 8B, LLaMA-3.1 8B, and LLaMA-3 70B) and measure understanding and generation performance.
amount of understanding data yields more substantial improvements in both understanding tasks (e.g., VQA performance) and generation tasks (e.g., FID scores and CLIP scores) compared to increasing the amount of generation data. These results, consistent with our analysis in Section 3.2 and Section 3.3, highlight that understanding data play a more pivotal role in enhancing performance across both task types.
More generation data leads to better understanding and generation.
We compare MetaMorph with other unified models and summarize results in Table 1. Since these models are trained on different datasets and base LLMs (or pretrained from scratch), an apples-to-apples comparison is difficult. Nevertheless, MetaMorph demonstrates competitive performance and outperforms other unified models on most benchmarks-even when prior models may have been trained on more data. Compared to models trained from scratch, such as EMU-3 (Wang et al., 2024b) and Chameleon (Team, 2024), MetaMorph leverages the strengths of the latest pretrained LLMs and achieves competitive understanding and generation performance. MetaMorph highlights that unified models can be developed effectively from pretrained LLMs.

Figure 9 Examples of MetaMorph leveraging LLMs to generate visual tokens. Left : MetaMorph can leverage knowledge from the LLM to generate visual tokens for professional terms that need domain-specific understanding. Right : MetaMorph also avoids common mistakes seen in T2I models that condition on text embeddings (e.g., Stable Diffusion-3.5 8B).
This synergy scales across different LLMs.
We present the results of training with 7M VQA data and 1M generation data across various LLM backbones, including LLaMA-3 8B, LLaMA-3.1 8B, and LLaMA-3 70B. As shown in Table 7, which corresponds to the results in Figure 6, we observe that stronger LLM backbones lead to improvements in both visual understanding and visual generation. These findings further support the conclusion that visual understanding and generation are reciprocal processes, where advancements in one drives enhancements in the other.
textbf{ textit{Finding #1:
Multimodal Large Language Models (MLLMs) have advanced considerably in visual understanding, progressing from basic image captioning to complex visual inferences (Alayrac et al., 2022; Liu et al., 2023; Dai et al., 2024). These models process multimodal inputs-primarily images and language-and generate text tokens. Multimodal LLMs often leverage a pretrained vision encoder (Dosovitskiy et al., 2021; Radford et al., 2021), a pretrained language model (Touvron et al., 2023; AI@Meta, 2024), and align these modalities through connectors such as MLP (Liu et al., 2023, 2024a) or cross-attention modules (Alayrac et al., 2022; Dai et al., 2024). Among MLLM training methods, visual instruction tuning (Liu et al., 2023) has become widely used (Wang et al., 2024a; Agrawal et al., 2024). It treats output embeddings of pretrained vision encoders as continuous-valued 'visual tokens' and directly feeds them as inputs to pretrained LLMs.
One benefit of visual instruction tuning is that it is data and compute efficient. A pretrained LLM can be repurposed as a Multimodal LLM by instruction tuning with modest compute and data on the order of millions of image-text question-answer pairs (Tong et al., 2024a; Li et al., 2024a). The effectiveness of visual instruction tuning indicates that LLMs already possess a considerable amount of inherent visual knowledge which allows them to efficiently learn and develop visual understanding during the instruction tuning process (Zhou et al., 2024a). Inspired by this, we investigate whether LLMs can also be finetuned to generate visual information with comparable efficiency and effectiveness.
Current attempts toward 'unified' models-models capable of both multimodal understanding and generationoften treat visual generation as an orthogonal capability to visual understanding. They tend to require substantial changes to the original MLLM architecture and significant multimodal pretraining and/or finetuning. Designing such methods is challenging, and past research takes different approaches including tokenizing

Figure 1 VPiT Training, Inference, and Examples of MetaMorph. Left : In Visual-Predictive Instruction Tuning (VPiT), we finetune a pretrained LLM to generate both text and visual tokens using separate text and vision heads. Middle : During inference, the model accepts an arbitrary input sequence of image(s) and text and outputs discrete text tokens and continuous visual tokens. These visual tokens can be visualized via a separately finetuned diffusion model, which is trained to condition on the pretrained vision encoder's output. Right : An example conversation from MetaMorph trained with VPiT. Here, the model implicitly solves a visual puzzle in order to generate the visual tokens of a butterfly. The conversation continues with new user questions as the model continues to autoregressively process vision and text tokens, independent of the diffusion-based visualization.
visual inputs into discrete tokens (Wu et al., 2024b; Team, 2024; Liu et al., 2024c), incorporating diffusion objectives (Xie et al., 2024; Zhou et al., 2024b), and decoupling vision into separate understanding and generation modes (Wu et al., 2024a). For example, approaches like LWM (Liu et al., 2024c), Show-o (Xie et al., 2024), and Chameleon (Team, 2024) require billions of image-text pairs (Schuhmann et al., 2022; Gadre et al., 2024) for extensive pretraining and finetuning.
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)-a simple extension to visual instruction tuning which builds upon the existing paradigm of passing continuous visual tokens as input to the LLM. VPiT trains an LLM to output both continuous visual tokens and discrete text tokens in the finetuning stage. The model takes pretrained vision encoder embeddings as well as text tokens as input, and outputs a combination of text tokens and continuous visual tokens. To visualize the generated visual tokens, we finetune a diffusion model to map the embeddings back into pixel space (see Figure 1 for an example). This framework allows us to study the synergy between visual understanding, visual generation, and pretrained LLMs, which leads to several intriguing findings outlined below.
First, we show that the ability to predict visual tokens emerges from understanding visual inputs and requires minimal additional training. Similar to visual instruction tuning, VPiT efficiently and effectively morphs an LLM into an 'unified' model that understands and generates multimodal tokens. When trained jointly with sufficient visual understanding data, this process requires as little as 200k additional visual generation data.
We further establish that the abilities to understand and generate visual tokens are intrinsically linked and asymmetrical . Specifically, increasing understanding data improves visual understanding (measured by higher VQA scores) and generation performance (measured by lower FID scores). Conversely, increasing generation data enhances generation quality and also contributes to stronger visual understanding-but to a lesser degree. Importantly, our findings highlight an asymmetry in how training each ability impacts the model's overall vision performance: understanding-centric training substantially outperforms generation-centric training in improving both visual understanding and generation.
Building upon these findings, we train a unified model called MetaMorph to predict multimodal tokens with VPiT. We leverage diverse data sources ranging from common visual question answering datasets to pure image and video data without text annotations. MetaMorph achieves competitive performance on both visual understanding and visual generation benchmarks. Furthermore, we show this unified modeling approach allows models to leverage the power of LLMs. For instance, MetaMorph can extract knowledge from the pretrained LLM when generating visual tokens. More surprisingly, we observe that MetaMorph can implicitly perform reasoning steps before generating visual tokens-e.g. when prompted with 'the animal resulting from a monarch caterpillar's metamorphosis' , MetaMorph successfully generates an image of a butterfly (Figure 1).
Our results suggest that 1) training a unified model with instruction tuning is feasible, and 2) LLMs have strong pre-existing visual capabilities which can be activated using significantly fewer samples compared
to extensive pretraining. These insights shed light on the development of mixed-modality models. As the community continues to improve visual understanding in Multimodal LLMs (Tong et al., 2024a; Wang et al., 2024a; Li et al., 2024a) by advancing base LLMs, instruction tuning techniques, and data, we highlight that these efforts may also implicitly lead to models that are better at visual generation.
Understanding Data Contributes More
We investigate whether understanding and generation data contribute equally. Here, we jointly train different scales of VQA data {1M, 4M, 7M} and generation data {200k, 500k, 1M, 2M, 3M, 4M}. Figure 7 summarizes these findings, with the x-axis representing VQA data, and the y-axis representing generation data. Results are visualized on heatmaps using darker colors for better performance.
The results indicate that increasing VQA data yields the most significant improvements in all three metrics. When VQA data is relatively low (1M), increases in generation data lead to noticeable improvements, as reflected by the gradual darkening in the plot. However, as the VQA data scales up (from 1M to 4M to 7M), the impact of VQA data becomes more pronounced, demonstrated by a sharp color transition in the heatmap. Ultimately, with 7M VQA data, increases in generation data contribute minimally. These results demonstrate the critical role of understanding data in enhancing both understanding and generation performance.
Finding 3: While increasing data improves performance overall, the impact of visual understanding data is significantly higher than the impact of visual generation data.
textbf{ textit{Finding #1:
Multimodal Large Language Models (MLLMs) have advanced considerably in visual understanding, progressing from basic image captioning to complex visual inferences (Alayrac et al., 2022; Liu et al., 2023; Dai et al., 2024). These models process multimodal inputs-primarily images and language-and generate text tokens. Multimodal LLMs often leverage a pretrained vision encoder (Dosovitskiy et al., 2021; Radford et al., 2021), a pretrained language model (Touvron et al., 2023; AI@Meta, 2024), and align these modalities through connectors such as MLP (Liu et al., 2023, 2024a) or cross-attention modules (Alayrac et al., 2022; Dai et al., 2024). Among MLLM training methods, visual instruction tuning (Liu et al., 2023) has become widely used (Wang et al., 2024a; Agrawal et al., 2024). It treats output embeddings of pretrained vision encoders as continuous-valued 'visual tokens' and directly feeds them as inputs to pretrained LLMs.
One benefit of visual instruction tuning is that it is data and compute efficient. A pretrained LLM can be repurposed as a Multimodal LLM by instruction tuning with modest compute and data on the order of millions of image-text question-answer pairs (Tong et al., 2024a; Li et al., 2024a). The effectiveness of visual instruction tuning indicates that LLMs already possess a considerable amount of inherent visual knowledge which allows them to efficiently learn and develop visual understanding during the instruction tuning process (Zhou et al., 2024a). Inspired by this, we investigate whether LLMs can also be finetuned to generate visual information with comparable efficiency and effectiveness.
Current attempts toward 'unified' models-models capable of both multimodal understanding and generationoften treat visual generation as an orthogonal capability to visual understanding. They tend to require substantial changes to the original MLLM architecture and significant multimodal pretraining and/or finetuning. Designing such methods is challenging, and past research takes different approaches including tokenizing

Figure 1 VPiT Training, Inference, and Examples of MetaMorph. Left : In Visual-Predictive Instruction Tuning (VPiT), we finetune a pretrained LLM to generate both text and visual tokens using separate text and vision heads. Middle : During inference, the model accepts an arbitrary input sequence of image(s) and text and outputs discrete text tokens and continuous visual tokens. These visual tokens can be visualized via a separately finetuned diffusion model, which is trained to condition on the pretrained vision encoder's output. Right : An example conversation from MetaMorph trained with VPiT. Here, the model implicitly solves a visual puzzle in order to generate the visual tokens of a butterfly. The conversation continues with new user questions as the model continues to autoregressively process vision and text tokens, independent of the diffusion-based visualization.
visual inputs into discrete tokens (Wu et al., 2024b; Team, 2024; Liu et al., 2024c), incorporating diffusion objectives (Xie et al., 2024; Zhou et al., 2024b), and decoupling vision into separate understanding and generation modes (Wu et al., 2024a). For example, approaches like LWM (Liu et al., 2024c), Show-o (Xie et al., 2024), and Chameleon (Team, 2024) require billions of image-text pairs (Schuhmann et al., 2022; Gadre et al., 2024) for extensive pretraining and finetuning.
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)-a simple extension to visual instruction tuning which builds upon the existing paradigm of passing continuous visual tokens as input to the LLM. VPiT trains an LLM to output both continuous visual tokens and discrete text tokens in the finetuning stage. The model takes pretrained vision encoder embeddings as well as text tokens as input, and outputs a combination of text tokens and continuous visual tokens. To visualize the generated visual tokens, we finetune a diffusion model to map the embeddings back into pixel space (see Figure 1 for an example). This framework allows us to study the synergy between visual understanding, visual generation, and pretrained LLMs, which leads to several intriguing findings outlined below.
First, we show that the ability to predict visual tokens emerges from understanding visual inputs and requires minimal additional training. Similar to visual instruction tuning, VPiT efficiently and effectively morphs an LLM into an 'unified' model that understands and generates multimodal tokens. When trained jointly with sufficient visual understanding data, this process requires as little as 200k additional visual generation data.
We further establish that the abilities to understand and generate visual tokens are intrinsically linked and asymmetrical . Specifically, increasing understanding data improves visual understanding (measured by higher VQA scores) and generation performance (measured by lower FID scores). Conversely, increasing generation data enhances generation quality and also contributes to stronger visual understanding-but to a lesser degree. Importantly, our findings highlight an asymmetry in how training each ability impacts the model's overall vision performance: understanding-centric training substantially outperforms generation-centric training in improving both visual understanding and generation.
Building upon these findings, we train a unified model called MetaMorph to predict multimodal tokens with VPiT. We leverage diverse data sources ranging from common visual question answering datasets to pure image and video data without text annotations. MetaMorph achieves competitive performance on both visual understanding and visual generation benchmarks. Furthermore, we show this unified modeling approach allows models to leverage the power of LLMs. For instance, MetaMorph can extract knowledge from the pretrained LLM when generating visual tokens. More surprisingly, we observe that MetaMorph can implicitly perform reasoning steps before generating visual tokens-e.g. when prompted with 'the animal resulting from a monarch caterpillar's metamorphosis' , MetaMorph successfully generates an image of a butterfly (Figure 1).
Our results suggest that 1) training a unified model with instruction tuning is feasible, and 2) LLMs have strong pre-existing visual capabilities which can be activated using significantly fewer samples compared
to extensive pretraining. These insights shed light on the development of mixed-modality models. As the community continues to improve visual understanding in Multimodal LLMs (Tong et al., 2024a; Wang et al., 2024a; Li et al., 2024a) by advancing base LLMs, instruction tuning techniques, and data, we highlight that these efforts may also implicitly lead to models that are better at visual generation.
Certain Understanding Tasks Correlate More with Generation Performance
Given the diverse nature of understanding tasks such as OCR, Vision-Centric tasks, and Knowledge-based tasks, we investigate which tasks most strongly correlate with generation ability. Inspired by Cambrian-1, we categorize VQA tasks into five groups: General, Text&Chart, High-Resolution, Knowledge, and Vision-Centric VQA. Using the results from our earlier experiments, which jointly train various VQA data scales with different amounts of generation data, we plot each benchmark's VQA performance against generation performance in Figure 8. We also calculate the Pearson correlation ( ρ ) between VQA scores and FID/CLIP Scores.

�����������
�
Figure 8 Correlation analysis between generation and various understanding benchmarks. Results are collected by joint training different amounts of VQA data combined with varying quantities of generation data. Each subplot shows the correlation ( ρ ) with a fitted regression line. Stars represent data points. We analyze General VQA, Vision-Centric VQA, Text&Chart VQA, High-Resolution VQA, and Knowledge VQA. For most tasks, generation performance and VQA performance are strongly correlated: higher VQA performance indicates better generation and vice versa. Only knowledge-intensive and high-resolution VQA tasks exhibit weaker correlations with generation performance.

Figure 7 Heatmap visualization of Average VQA Score, FID Score,andCLIPScoreacrossvaryingamountsofVQAdataand generationdata. Darker colors indicate better performance. Increasing VQA data is more effective for improving both understanding and generation capabilities.
Figure 8 shows that General, Vision-Centric, and Text&Chart VQA tasks strongly correlate with generation performance, each with a Pearson correlation coefficient ( ρ ) above 0.85. High-Resolution VQA exhibits moderate correlation, with ρ around 0.7. In contrast, Knowledge VQA tasks, such as MMMU, show weak correlation with generation performance. These findings suggest that generation ability aligns more closely with the model's vision capabilities rather than knowledge-specific tasks.
Finding 4: General, vision-centric, and text understanding VQA tasks exhibit strong correlations with visual generation, whereas knowledge-based VQA tasks do not.
textbf{ textit{Finding #1:
Multimodal Large Language Models (MLLMs) have advanced considerably in visual understanding, progressing from basic image captioning to complex visual inferences (Alayrac et al., 2022; Liu et al., 2023; Dai et al., 2024). These models process multimodal inputs-primarily images and language-and generate text tokens. Multimodal LLMs often leverage a pretrained vision encoder (Dosovitskiy et al., 2021; Radford et al., 2021), a pretrained language model (Touvron et al., 2023; AI@Meta, 2024), and align these modalities through connectors such as MLP (Liu et al., 2023, 2024a) or cross-attention modules (Alayrac et al., 2022; Dai et al., 2024). Among MLLM training methods, visual instruction tuning (Liu et al., 2023) has become widely used (Wang et al., 2024a; Agrawal et al., 2024). It treats output embeddings of pretrained vision encoders as continuous-valued 'visual tokens' and directly feeds them as inputs to pretrained LLMs.
One benefit of visual instruction tuning is that it is data and compute efficient. A pretrained LLM can be repurposed as a Multimodal LLM by instruction tuning with modest compute and data on the order of millions of image-text question-answer pairs (Tong et al., 2024a; Li et al., 2024a). The effectiveness of visual instruction tuning indicates that LLMs already possess a considerable amount of inherent visual knowledge which allows them to efficiently learn and develop visual understanding during the instruction tuning process (Zhou et al., 2024a). Inspired by this, we investigate whether LLMs can also be finetuned to generate visual information with comparable efficiency and effectiveness.
Current attempts toward 'unified' models-models capable of both multimodal understanding and generationoften treat visual generation as an orthogonal capability to visual understanding. They tend to require substantial changes to the original MLLM architecture and significant multimodal pretraining and/or finetuning. Designing such methods is challenging, and past research takes different approaches including tokenizing

Figure 1 VPiT Training, Inference, and Examples of MetaMorph. Left : In Visual-Predictive Instruction Tuning (VPiT), we finetune a pretrained LLM to generate both text and visual tokens using separate text and vision heads. Middle : During inference, the model accepts an arbitrary input sequence of image(s) and text and outputs discrete text tokens and continuous visual tokens. These visual tokens can be visualized via a separately finetuned diffusion model, which is trained to condition on the pretrained vision encoder's output. Right : An example conversation from MetaMorph trained with VPiT. Here, the model implicitly solves a visual puzzle in order to generate the visual tokens of a butterfly. The conversation continues with new user questions as the model continues to autoregressively process vision and text tokens, independent of the diffusion-based visualization.
visual inputs into discrete tokens (Wu et al., 2024b; Team, 2024; Liu et al., 2024c), incorporating diffusion objectives (Xie et al., 2024; Zhou et al., 2024b), and decoupling vision into separate understanding and generation modes (Wu et al., 2024a). For example, approaches like LWM (Liu et al., 2024c), Show-o (Xie et al., 2024), and Chameleon (Team, 2024) require billions of image-text pairs (Schuhmann et al., 2022; Gadre et al., 2024) for extensive pretraining and finetuning.
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)-a simple extension to visual instruction tuning which builds upon the existing paradigm of passing continuous visual tokens as input to the LLM. VPiT trains an LLM to output both continuous visual tokens and discrete text tokens in the finetuning stage. The model takes pretrained vision encoder embeddings as well as text tokens as input, and outputs a combination of text tokens and continuous visual tokens. To visualize the generated visual tokens, we finetune a diffusion model to map the embeddings back into pixel space (see Figure 1 for an example). This framework allows us to study the synergy between visual understanding, visual generation, and pretrained LLMs, which leads to several intriguing findings outlined below.
First, we show that the ability to predict visual tokens emerges from understanding visual inputs and requires minimal additional training. Similar to visual instruction tuning, VPiT efficiently and effectively morphs an LLM into an 'unified' model that understands and generates multimodal tokens. When trained jointly with sufficient visual understanding data, this process requires as little as 200k additional visual generation data.
We further establish that the abilities to understand and generate visual tokens are intrinsically linked and asymmetrical . Specifically, increasing understanding data improves visual understanding (measured by higher VQA scores) and generation performance (measured by lower FID scores). Conversely, increasing generation data enhances generation quality and also contributes to stronger visual understanding-but to a lesser degree. Importantly, our findings highlight an asymmetry in how training each ability impacts the model's overall vision performance: understanding-centric training substantially outperforms generation-centric training in improving both visual understanding and generation.
Building upon these findings, we train a unified model called MetaMorph to predict multimodal tokens with VPiT. We leverage diverse data sources ranging from common visual question answering datasets to pure image and video data without text annotations. MetaMorph achieves competitive performance on both visual understanding and visual generation benchmarks. Furthermore, we show this unified modeling approach allows models to leverage the power of LLMs. For instance, MetaMorph can extract knowledge from the pretrained LLM when generating visual tokens. More surprisingly, we observe that MetaMorph can implicitly perform reasoning steps before generating visual tokens-e.g. when prompted with 'the animal resulting from a monarch caterpillar's metamorphosis' , MetaMorph successfully generates an image of a butterfly (Figure 1).
Our results suggest that 1) training a unified model with instruction tuning is feasible, and 2) LLMs have strong pre-existing visual capabilities which can be activated using significantly fewer samples compared
to extensive pretraining. These insights shed light on the development of mixed-modality models. As the community continues to improve visual understanding in Multimodal LLMs (Tong et al., 2024a; Wang et al., 2024a; Li et al., 2024a) by advancing base LLMs, instruction tuning techniques, and data, we highlight that these efforts may also implicitly lead to models that are better at visual generation.
4 MetaMorph Model
Based on the insights in Section 3, we train our unified model, MetaMorph, based on LLaMA-3.1 8B (AI@Meta, 2024), using VPiT with the data curated in Section 2.2. We present our experimental results in three parts: quantitative performance (Section 4.1), evidence of MetaMorph leveraging LLM knowledge in visual generation (Section 4.2), and implicit reasoning skills in multimodal contexts (Section 4.3).
Table 1 Comparison of MetaMorph with other unified models. MetaMorph offers competitive performance compared to other leading unified models. Models in gray are understanding-only or generation-only. Unified models without a base LLM are trained from scratch. ∗ We use numbers reported in original papers. † We obtain results using official open-sourced model weights.
Competitive Performance in Understanding and Generation
We compare MetaMorph with other unified models and summarize results in Table 1. Since these models are trained on different datasets and base LLMs (or pretrained from scratch), an apples-to-apples comparison is difficult. Nevertheless, MetaMorph demonstrates competitive performance and outperforms other unified models on most benchmarks-even when prior models may have been trained on more data. Compared to models trained from scratch, such as EMU-3 (Wang et al., 2024b) and Chameleon (Team, 2024), MetaMorph leverages the strengths of the latest pretrained LLMs and achieves competitive understanding and generation performance. MetaMorph highlights that unified models can be developed effectively from pretrained LLMs.

Figure 9 Examples of MetaMorph leveraging LLMs to generate visual tokens. Left : MetaMorph can leverage knowledge from the LLM to generate visual tokens for professional terms that need domain-specific understanding. Right : MetaMorph also avoids common mistakes seen in T2I models that condition on text embeddings (e.g., Stable Diffusion-3.5 8B).
Results of Samples Needed to Unlock Visual Generation
MetaMorph effectively leverages the world knowledge embedded in pre-trained LLMs. We show examples on the left side of Figure 9. We prompt the model to generate concepts requiring non-trivial and specialized knowledge. Examples include 'Chhogori' (the world's second-highest mountain), 'Oncilla' (a small wildcat from South America), and 'Chizarira' (an isolated wilderness area in Zimbabwe).
MetaMorph successfully translates domain-specific knowledge into accurate visual tokens, thereby displaying the ability to leverage world knowledge from LLMs . In contrast, the latest Text-to-Image (T2I) model, Stable Diffusion-3.5 8B, struggles to generate the correct concept despite producing high-quality images. This issue may stem from the text embedding models it uses--CLIP (Radford et al., 2021) and T5 (Roberts et al., 2019)--which fail to properly encode these specialized terms (Yuksekgonul et al., 2022).
On the right side of Figure 9, we demonstrate how MetaMorph handles common semantic challenges more effectively than text embedding models such as CLIP and T5. These challenges include negation and subjectivity, using prompts with common failure patterns identified in Multimon (Tong et al., 2024b). MetaMorph differentiates semantic nuances such as 'slightly' versus 'very', 'few' versus 'many', and 'without' versus 'with', which are common failures in existing text-to-image systems.
Reasoning in Multimodal Generation
In Figure 10, we present examples where the model generates images in response to puzzle prompts such as 'The national flag of the country where Yellowstone National Park is located' . For each puzzle, we directly use the prompt 'Generate an image of {puzzle}' , without calling any Chain-of-Thought (CoT) (Wei et al., 2022b) in the prompts. MetaMorph generates the correct image from prompts that require multi-step reasoning.
For example, when answering the question 'A musical instrument, this instrument is often played by the scientist who formulated the theory of special relativity' , the model needs to implicitly complete three reasoning steps: it identifies Albert Einstein as the scientist formulated the theory of special relativity, recognizes that his preferred instrument is the violin, and then directly generates correct visual tokens-a violin-without explicitly separating these steps during the generation process. This result implies that MetaMorph implicitly solves the puzzle and generates correct visual tokens immediately following the prompt. These results align with the findings in Physics of LLMs (Ye et al., 2024; Allen-Zhu, 2024), where the authors suggest that LLMs precompute reasoning graphs before autoregressively generating subsequent tokens. Here, we demonstrate that this capability transfers to the unified multimodal model setting even when decoding visual tokens.

Figure 10 ExamplesofMetaMorphsolvingreasoningproblemsinvisualgeneration. We design puzzles that require multi-step reasoning. We include reference logic chains needed to solve the puzzles, and reference solution examples . When prompting each model, we directly feed in the puzzle without any CoT hints or logic chains. MetaMorph has the ability to implicitly solve these puzzles and generate the correct image without explicitly creating or processing a logic chain. It demonstrates that the implicit reasoning skills in text-only LLMs can transfer to unified multimodal models.
Related Work
Instruction tuning and visual instruction tuning. Instruction tuning (Wei et al., 2022a; Taori et al., 2023) finetunes a pretrained LLM to learn the format and style of interaction. This process helps the model to effectively convey the knowledge and capabilities acquired during pretraining (Zhou et al., 2024a). LLaVA (Liu et al., 2023) extends instruction tuning into the multimodal domain. Since then, different lines of work focus on improving data curation (Chen et al., 2023; Laurençon et al., 2024a,b), visual representation (Tong et al., 2024a; Kar et al., 2025; Chen et al., 2024b), and instruction tuning strategies (Gao et al., 2024; Liu et al., 2024b). Using only a few million multimodal instruction tuning data, this line of research (Liu et al., 2024b; Tong et al., 2024a; Li et al., 2024a) has enabled open-source MLLMs to reach performance levels comparable to those of proprietary models (OpenAI, 2024; Anthropic, 2024) on a number of benchmarks (Liu et al., 2024d; Yue et al., 2024a,b) and applications (Zhai et al., 2024; Pan et al., 2024a).
From Multimodal LLMs to unified models. Recent efforts to construct unified models have primarily relied on either extensive pretraining or heavy fine-tuning on billion-scale datasets. Some studies also use continuous embeddings for predicting visual tokens, integrating visual regression losses (Sun et al., 2024b,a) or leveraging diffusion-based methods (Dong et al., 2024). Other approaches (Lu et al., 2022a; Aghajanyan et al., 2022; Team, 2024; Wu et al., 2024b; Liu et al., 2024c; Wang et al., 2024b; Lu et al., 2024) tokenize multimodal data into discrete tokens, which are then trained using autoregressive transformers. Recent research has also explored hybrid strategies that combine autoregressive and diffusion objectives (Zhou et al., 2024b; Xie et al., 2024). Different from previous studies, we demonstrate that unified models can be effectively trained in low-data regimes during instruction tuning, while also providing insights into the reciprocal relationship between visual understanding and visual generation.
Instruction tuning and visual instruction tuning.
Visual instruction tuning as introduced by LLaVA (Liu et al., 2023) demonstrates that LLMs can be taught to understand visual inputs. This is achieved by finetuning on million-scale data. The success of late-fusion instruction tuning suggests that LLMs may already possess innate visual understanding ability. This ability simply needs to be unlocked through lightweight finetuning. Analogously, we hypothesize that LLMs already possess a degree of innate visual generation ability which just needs to be unlocked with lightweight finetuning.
Motivated by this, we present Visual-Predictive Instruction Tuning (VPiT, Figure 1)-a simple design which extends existing instruction tuning methods to additionally generate visual tokens rather than text alone. We use the same architecture and next-token prediction paradigm to unlock visual generation capabilities without bells and whistles. We take a pretrained LLM and finetune it to predict both discrete text tokens and continuous visual tokens. The visual tokens can be visualized with an adapted diffusion model.
From Multimodal LLMs to unified models.
The standard instruction tuning setup consists of an input sequence of conversation rounds (Wei et al., 2022a; Taori et al., 2023): ( P i , R i ) N i =1 , where P i and R i represent prompts and responses for the i -th round of conversation, respectively. The model is trained to generate responses based on the prompt. VPiT adds the following mechanisms to a standard instruction tuning setup to unlock visual understanding and generation.
Tokenizing multimodal data. We extend P i and R i to include both text and images. To integrate visual data into a pretrained LLM, we process data closely following visual instruction tuning (Liu et al., 2023):
· Text Data : Text is tokenized into discrete tokens with a standard tokenizer used by the LLM. · Visual Data : Images are encoded with a pretrained vision encoder such as SigLIP (Zhai et al., 2023). The output is continuous visual tokens which are then interpolated to m = 64 tokens. To pass the visual tokens as input to the LLM, we apply a trainable projection layer to align the dimensions with the LLM.
Model architecture. We take a pretrained LLM and finetune it to process arbitrary sequences of text and visual tokens (detailed next in Section 2.2). We keep the original LLM head for text prediction, and attach a separate vision head to the LLM for predicting visual tokens, i.e., the output tokens generated by the vision encoder when processing images. The vision head is a projection layer that projects from the LLM's dimension to the vision encoder's dimension. All response tokens can then be trained and predicted autoregressively, with prompt tokens as context.
Unlike conventional visual instruction tuning, in VPiT, visual tokens are also outputs of the LLM-not just inputs. To make the LLM aware of the presence of visual tokens, we introduce special tokens ⟨ image_start ⟩ and ⟨ image_end ⟩ to indicate the boundaries of visual token sequences and when to use the vision head.
Loss functions . The language head outputs a probability distribution over the vocabulary and is trained with cross-entropy loss for next-token prediction. Visual prediction uses cosine similarity loss between the LLM's predicted visual tokens and those from the vision encoder. Consistent with instruction tuning practices, the model only makes predictions and incurs loss on response tokens.
Discussion
In this work, we propose VPiT-a simple yet effective extension to visual instruction tuning-that enables LLMs to predict multimodal tokens. VPiT unlocks the use of a more diverse range of instruction tuning data than just visual question answering, such as text-to-image and pure image and video data. Through controlled experiments, we find that visual generation ability emerges as a natural byproduct of improved visual understanding and requires modest additional generation data. In addition, we find that while visual understanding and generation are mutually beneficial, adding more visual understanding data disproportionately improves overall performance compared to adding more generation data.
Leveraging these insights, we train MetaMorph by finetuning LLaMA-3.1 8B with VPiT. With a simple training process, MetaMorph achieves competitive performance in both visual understanding and generation. Qualitative evaluation of our model shows that MetaMorph can leverage world knowledge and reasoning abilities of the base LLM during visual generation. For example, it can perform multimodal tasks that typically require multiple steps of reasoning, such as generating images of specialized proper nouns ( 'Chhogori' ) or solving visual puzzles ( 'generate an image of the animal resulting from a monarch caterpillar's metamorphosis' ). This indicates that LLMs already possess a degree of 'prior' visual knowledge which can be activated with only minimal instruction tuning with VPiT. Overall, LLMs may have a similar representation space as unified and multi-functional models (Huh et al., 2024). We hope the insights from this work inspire more exploration toward developing LLMs for general intelligence.
References
Training Details and Hyperparameters
Diffusion Visualizer Training
We follow the training recipe outlined in prior studies (Tong et al., 2024a; McKinzie et al., 2024), using a two-stage training approach. First, we pretrain a two-layer MLP with a GELU activation (Hendrycks and Gimpel, 2016) as the adapter between the visual tokens and the LLM. We train this adapter on Cambrian adapter data while excluding all data points sourced from LAION (Schuhmann et al., 2022). Next, we finetune the entire model, excluding the vision backbone, using the instruction tuning data described in Section 2.2 and detailed in Appendix C.
We use DeepSpeed (Rajbhandari et al., 2020) Zero-3 to train our model on H100 GPUs. Detailed training hyperparameters for all experiments are provided in Table 2. We conduct all of the experiments with 1 epoch.
Table 2 Implementation details and hyperparameters for all experiments. ∗ We exclude data points in LAION (Schuhmann et al., 2022) from Cambrian adapter data.
Diffusion Visualizer Training
We leverage pretrained diffusion models such as Stable Diffusion 1.5 (Rombach et al., 2022). We use a 2-layer MLP projector to align the SigLIP embedding dimension with the cross-attention dimension in the pretrained diffusion model. The first layer applies a linear transformation to map the input dimension to 2048, followed by layer normalization (Ba et al., 2016) and a ReLU activation. The second layer reduces the 2048-dimensional features to the output dimension through a linear transformation, followed by a final layernorm.
We set the batch size to 2112. The learning rate schedule begins with a logarithmic warm-up over the first 2000 steps, gradually increasing from zero to a peak value of 1.1e-5. After this warm-up phase, the learning rate decreases linearly over the next 12000 steps until reaching zero. We use the AdamW (Loshchilov, 2019) optimizer to train our model, with β parameters (0 . 9 , 0 . 999) . We apply a weight decay of 0.01.
During diffusion training, we freeze the VAE encoder and Siglip encoder, only training the projector and the diffusion U-Net. The CFG level is set to 0.7. This is because we start with a pretrained diffusion model and aim to transform the conditioning from CLIP text to SigLIP image embeddings. A higher CFG level ensures the model maintains high image quality while gradually adapting to the new conditioning in the remaining fraction. Empirically, this approach achieves the best balance between adaptation and image quality. For the training datasets, since we finetune the diffusion model to condition on SigLIP image embeddings, training this model does not require text descriptions for conditioning. Instead, we use images curated through in MetaCLIP (Xu et al., 2024) and train this diffusion model to visualize the visual tokens generated by MetaMorph.
Evaluation Benchmarks
For evaluation, we use nine ImageQA, one VideoQA and two generation benchmarks:

Table 3 Comparison of different loss functions. Training with cosine similarity loss enables the model to effectively utilize non-VQA data, which in turn enhances its visual understanding.
Ablation Studies on Visual Prediction Objective
We compare our approach to the commonly used L1 regression loss, which has been widely adopted in contrastive self-supervised learning methods (LeCun, 2022; Bardes et al., 2024). For this comparison, we train MetaMorph, based on LLaMA-3 8B, using datasets described in Section 2.2. We highlight that cosine similarity and L1 loss influence the embedding outputs differently: cosine similarity enforces normalization, while L1 loss does not. This discrepancy in output normalization prevents a direct and fair comparison in terms of generation performance. Consequently, our analysis focuses exclusively on VQA performance.
In Table 3, we compare models trained using L1 loss and cosine similarity loss. Our analysis reveals that training with cosine similarity results in better average performance and outperforms L1 loss on most benchmarks. Notably, these vision loss functions affect only tasks requiring visual predictions and do not directly influence VQA tasks, as the VQA training data does not include image token responses. This improvement is potentially because training with cosine similarity enhances visual generation, which in turn contributes to better visual understanding.
To further investigate, we compare our method-incorporating a broader range of non-VQA data alongside Cambrian-7M--with a baseline trained exclusively on Cambrian-7M. The results show that combining broader dataset with cosine similarity loss leads to better performance across multiple benchmarks. This finding reinforces our earlier observations in Section 3: enhancing visual generation capabilities contributes to improved visual understanding, highlighting the benefits of leveraging non-VQA data.

Figure 11 Data composition. Left: The inner circle shows the distribution of MetaMorph data. Right: All the data sources and categories in the MetaMorph data.
Data
Data Composition
We summarize the categorization of data and the number of samples for each source in Figure 11. This diverse dataset is curated to showcase that an LLM can be finetuned across a variety of tasks, where each task contributes to and enhances the performance of others, as discussed in Section 3.1.
Data Proprocessing
As discussed in Section 2.2, we use a wide range of data, spanning from visual question answering tasks to unlabeled video data. Here, we detail the preprocessing steps applied to each data source to convert them into instruction-tuning-style QA conversations.
ImageQA. We use Cambrian-7M (Tong et al., 2024a), a dataset already curated in instruction tuning format. An example entry looks like the below:
Example from ImageQA Prompt: <image_start><image_end> What is the animal in the image? Response: It is a burmilla cat.
VideoQA. We use VideoStar (Zohar et al., 2024) and ShareVideo (Chen et al., 2024a), both curated in an instruction tuning format. For each video, we extract frames at a rate of one frame per second and input these frames into the LLM. An example QA entry for an 8-second video is structured as follows:
Example from VideoQA Prompt: <image_start><image_end><image_start>
<image_end><image_-start>
<image_end><image_start>
<image_end><image_start>
<image_-end><image_start>
<image_end><image_start>
<image_end><image_-start>
<image_end> What's the color of the dog in this video? (a) white (b) yellow (c) black Please only answer a single letter and nothing else Response: b
Generation data. We use image-text pairs in MetaCLIP (Xu et al., 2024). The original data consists of
images paired with corresponding text descriptions. We add system prompts and define answering formats, transforming the image-text pairs into question-answer formats suitable for instruction tuning.
ImageQA.
VideoQA.
Generation data.
Prompt: Generate an image of a puppy. Response: Here is an image based on your request: <image_start><image_end> .
Unlike in ImageQA and VideoQA, we require the model to predict the visual tokens in the response.
Visual Thinking Data. We explore incorporating vision as part of the model's reasoning process to enhance its answers. As a preliminary step, we experiment with the Visualization-of-Thought (Shao et al., 2024) and VStar (Wu and Xie, 2024) datasets. Originally, these datasets were designed to teach models how to utilize external tools and APIs, such as segmentation or zoom-in cropping. In this work, we aim to integrate these visual skills directly into the model's inference steps. We use system prompts, such as 'think visually before you answer the question' , to activate this visual reasoning mode. Here is an example:
Visual Thinking Data.
Image to image data.
Prompt: <image_start><image_end> Make this a rainy day. Response: <image_start>
<image_end> .
Here, the model is expected to predict the visual tokens of the transformed image as per the specified prompt.
Pure video data. We explore commonly used open-source video datasets in instruction tuning: SomethingSomethingV2 (Goyal et al., 2017a) and HowTo100M (Miech et al., 2019). We design the following tasks from the pure video:
- Forward Frame Prediction. In this task, the model is presented with the initial frame of a video sequence and must predict the subsequent frames at fixed time intervals. An example is presented below:
Pure video data.
Prompt: <image_start><image_end> Make this a rainy day. Response: <image_start>
<image_end> .
Here, the model is expected to predict the visual tokens of the transformed image as per the specified prompt.
Pure video data. We explore commonly used open-source video datasets in instruction tuning: SomethingSomethingV2 (Goyal et al., 2017a) and HowTo100M (Miech et al., 2019). We design the following tasks from the pure video:
- Forward Frame Prediction. In this task, the model is presented with the initial frame of a video sequence and must predict the subsequent frames at fixed time intervals. An example is presented below:
Potential Image Leakage in Testing Data
When selecting data sources, we carefully choose those that do not overlap with the testing sets of our evaluation data, such as COCO (Lin et al., 2014). However, given that the data used in a Section 2.2 is composed of numerous sources, some degree of data leakage may be inevitable. As discussed and analyzed in a prior work (Tong et al., 2024a), even when image overlap occurs, it does not necessarily imply that the exact image-question pairs have been encountered during training. Unlike traditional unimodal computer vision research, where an image alone constitutes a data point, the multimodal paradigm treats each image-text (question-answer) pair as a distinct and unique data point.
.
Table 4 Results of training solely on generation data vs. joint training with additional data. These results correspond to Figure 2. Joint training with additional data significantly improves generation performance. At 5,000 samples, the model begins to generate reasonably accurate visual tokens, indicating that visual generation is an ability unlocked through the learning of other tasks.
Table 5 Impactofjoint training 200k generation data with different data types. These results correspond to Figure 3. Among the data types analyzed, joint training with visual understanding data has the most significant impact on enhancing visual generation performance.
Generating Visual Tokens
Here, we include the quantitative results of all the experiments in Section 3.
Results of Samples Needed to Unlock Visual Generation
Table 4 presents the quantitative results corresponding to Figure 2, which examines generation performance under two conditions: training exclusively on generation data and joint training with all other data described in Section 2.2. The results demonstrate that the model can develop the ability for visual generation with a relatively modest amount of data when trained jointly with understanding tasks. In contrast, teaching this skill in isolation requires a substantially larger dataset.
In Table 5, we present the quantitative results corresponding to Figure 3, which investigates the impact of joint training on generation data in combination with various types of data outlined in Section 2.2. The results show that joint training with visual understanding data--specifically ImageQA and VideoQA--provides the most significant improvement in visual generation performance.
Results of Joint training Different Understanding and Generation Data
In Table 6, we present the numerical results of joint training with varying scales of understanding data (1M, 4M, 7M) and generation data (200k, 500k, 1M, 2M, 3M, 4M). These findings demonstrate that increasing the
.

Table 6 Full results of joint training on varying amounts of VQA data (1M, 4M, 7M) and generation data (200k, 500k, 1M, 2M, 3M, 4M). These results correspond to Figure 4, Figure 5, Figure 7, and Figure 8, which analyze how different combinations of understanding and generation data impact the model's visual understanding and generation performance.
Table 7 Full results of training on different LLMs. We train 7M VQA data and 1M generation data on different LLM backbones (LLaMA-3 8B, LLaMA-3.1 8B, and LLaMA-3 70B) and measure understanding and generation performance.
amount of understanding data yields more substantial improvements in both understanding tasks (e.g., VQA performance) and generation tasks (e.g., FID scores and CLIP scores) compared to increasing the amount of generation data. These results, consistent with our analysis in Section 3.2 and Section 3.3, highlight that understanding data play a more pivotal role in enhancing performance across both task types.
Results of Training on Different LLMs
We present the results of training with 7M VQA data and 1M generation data across various LLM backbones, including LLaMA-3 8B, LLaMA-3.1 8B, and LLaMA-3 70B. As shown in Table 7, which corresponds to the results in Figure 6, we observe that stronger LLM backbones lead to improvements in both visual understanding and visual generation. These findings further support the conclusion that visual understanding and generation are reciprocal processes, where advancements in one drives enhancements in the other.
More Examples of ours{
1]FAIR, Meta 2]New York University 3]Meta Reality Labs \contribution[*]Work done at Meta \contribution[†]Corresponding authors
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)—a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong “prior” vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.
, \projecttsb0601.github.io/metamorph
Multimodal Large Language Models (MLLMs) have advanced considerably in visual understanding, progressing from basic image captioning to complex visual inferences (Alayrac et al., 2022; Liu et al., 2023; Dai et al., 2024). These models process multimodal inputs—primarily images and language—and generate text tokens. Multimodal LLMs often leverage a pretrained vision encoder (Dosovitskiy et al., 2021; Radford et al., 2021), a pretrained language model (Touvron et al., 2023; AI@Meta, 2024), and align these modalities through connectors such as MLP (Liu et al., 2023, 2024a) or cross-attention modules (Alayrac et al., 2022; Dai et al., 2024). Among MLLM training methods, visual instruction tuning (Liu et al., 2023) has become widely used (Wang et al., 2024a; Agrawal et al., 2024). It treats output embeddings of pretrained vision encoders as continuous-valued “visual tokens” and directly feeds them as inputs to pretrained LLMs.
One benefit of visual instruction tuning is that it is data and compute efficient. A pretrained LLM can be repurposed as a Multimodal LLM by instruction tuning with modest compute and data on the order of millions of image-text question-answer pairs (Tong et al., 2024a; Li et al., 2024a). The effectiveness of visual instruction tuning indicates that LLMs already possess a considerable amount of inherent visual knowledge which allows them to efficiently learn and develop visual understanding during the instruction tuning process (Zhou et al., 2024a). Inspired by this, we investigate whether LLMs can also be finetuned to generate visual information with comparable efficiency and effectiveness.
Current attempts toward “unified” models—models capable of both multimodal understanding and generation—often treat visual generation as an orthogonal capability to visual understanding. They tend to require substantial changes to the original MLLM architecture and significant multimodal pretraining and/or finetuning. Designing such methods is challenging, and past research takes different approaches including tokenizing visual inputs into discrete tokens (Wu et al., 2024b; Team, 2024; Liu et al., 2024c), incorporating diffusion objectives (Xie et al., 2024; Zhou et al., 2024b), and decoupling vision into separate understanding and generation modes (Wu et al., 2024a). For example, approaches like LWM (Liu et al., 2024c), Show-o (Xie et al., 2024), and Chameleon (Team, 2024) require billions of image-text pairs (Schuhmann et al., 2022; Gadre et al., 2024) for extensive pretraining and finetuning.
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)—a simple extension to visual instruction tuning which builds upon the existing paradigm of passing continuous visual tokens as input to the LLM. VPiT trains an LLM to output both continuous visual tokens and discrete text tokens in the finetuning stage. The model takes pretrained vision encoder embeddings as well as text tokens as input, and outputs a combination of text tokens and continuous visual tokens. To visualize the generated visual tokens, we finetune a diffusion model to map the embeddings back into pixel space (see Figure˜1 for an example). This framework allows us to study the synergy between visual understanding, visual generation, and pretrained LLMs, which leads to several intriguing findings outlined below.
First, we show that the ability to predict visual tokens emerges from understanding visual inputs and requires minimal additional training. Similar to visual instruction tuning, VPiT efficiently and effectively morphs an LLM into an “unified” model that understands and generates multimodal tokens. When trained jointly with sufficient visual understanding data, this process requires as little as 200k additional visual generation data.
We further establish that the abilities to understand and generate visual tokens are intrinsically linked and asymmetrical. Specifically, increasing understanding data improves visual understanding (measured by higher VQA scores) and generation performance (measured by lower FID scores). Conversely, increasing generation data enhances generation quality and also contributes to stronger visual understanding—but to a lesser degree. Importantly, our findings highlight an asymmetry in how training each ability impacts the model’s overall vision performance: understanding-centric training substantially outperforms generation-centric training in improving both visual understanding and generation.
Building upon these findings, we train a unified model called MetaMorph to predict multimodal tokens with VPiT. We leverage diverse data sources ranging from common visual question answering datasets to pure image and video data without text annotations. MetaMorph achieves competitive performance on both visual understanding and visual generation benchmarks. Furthermore, we show this unified modeling approach allows models to leverage the power of LLMs. For instance, MetaMorph can extract knowledge from the pretrained LLM when generating visual tokens. More surprisingly, we observe that MetaMorph can implicitly perform reasoning steps before generating visual tokens—e.g. when prompted with “the animal resulting from a monarch caterpillar’s metamorphosis”, MetaMorph successfully generates an image of a butterfly (Figure˜1).
Our results suggest that 1) training a unified model with instruction tuning is feasible, and 2) LLMs have strong pre-existing visual capabilities which can be activated using significantly fewer samples compared to extensive pretraining. These insights shed light on the development of mixed-modality models. As the community continues to improve visual understanding in Multimodal LLMs (Tong et al., 2024a; Wang et al., 2024a; Li et al., 2024a) by advancing base LLMs, instruction tuning techniques, and data, we highlight that these efforts may also implicitly lead to models that are better at visual generation.
Visual instruction tuning as introduced by LLaVA (Liu et al., 2023) demonstrates that LLMs can be taught to understand visual inputs. This is achieved by finetuning on million-scale data. The success of late-fusion instruction tuning suggests that LLMs may already possess innate visual understanding ability. This ability simply needs to be unlocked through lightweight finetuning. Analogously, we hypothesize that LLMs already possess a degree of innate visual generation ability which just needs to be unlocked with lightweight finetuning.
Motivated by this, we present Visual-Predictive Instruction Tuning (VPiT, Figure˜1)—a simple design which extends existing instruction tuning methods to additionally generate visual tokens rather than text alone. We use the same architecture and next-token prediction paradigm to unlock visual generation capabilities without bells and whistles. We take a pretrained LLM and finetune it to predict both discrete text tokens and continuous visual tokens. The visual tokens can be visualized with an adapted diffusion model.
The standard instruction tuning setup consists of an input sequence of conversation rounds (Wei et al., 2022a; Taori et al., 2023): (Pi,Ri)i=1N(P_{i},R_{i}){i=1}^{N}, where PiP{i} and RiR_{i} represent prompts and responses for the ii-th round of conversation, respectively. The model is trained to generate responses based on the prompt. VPiT adds the following mechanisms to a standard instruction tuning setup to unlock visual understanding and generation.
We extend PiP_{i} and RiR_{i} to include both text and images. To integrate visual data into a pretrained LLM, we process data closely following visual instruction tuning (Liu et al., 2023):
Text Data: Text is tokenized into discrete tokens with a standard tokenizer used by the LLM.
Visual Data: Images are encoded with a pretrained vision encoder such as SigLIP (Zhai et al., 2023). The output is continuous visual tokens which are then interpolated to m=64m=64 tokens. To pass the visual tokens as input to the LLM, we apply a trainable projection layer to align the dimensions with the LLM.
We take a pretrained LLM and finetune it to process arbitrary sequences of text and visual tokens (detailed next in Section˜2.2). We keep the original LLM head for text prediction, and attach a separate vision head to the LLM for predicting visual tokens, i.e., the output tokens generated by the vision encoder when processing images. The vision head is a projection layer that projects from the LLM’s dimension to the vision encoder’s dimension. All response tokens can then be trained and predicted autoregressively, with prompt tokens as context.
Unlike conventional visual instruction tuning, in VPiT, visual tokens are also outputs of the LLM—not just inputs. To make the LLM aware of the presence of visual tokens, we introduce special tokens ⟨image_start⟩\langle\texttt{image_start}\rangle and ⟨image_end⟩\langle\texttt{image_end}\rangle to indicate the boundaries of visual token sequences and when to use the vision head.
. The language head outputs a probability distribution over the vocabulary and is trained with cross-entropy loss for next-token prediction. Visual prediction uses cosine similarity loss between the LLM’s predicted visual tokens and those from the vision encoder. Consistent with instruction tuning practices, the model only makes predictions and incurs loss on response tokens.
Because VPiT enables the model to predict both text and visual tokens in its responses, it allows the use of a broader range of training data. Traditional visual instruction tuning, on the other hand, primarily relies on question-and-answer pairs. The majority of our dataset is publicly available, and we categorize it into three major categories below. This categorization enables us to systematically study the model, as detailed in Section˜3 and Section˜4. All data types are formatted as instruction tuning style prompt & response pairs. See further details in Section˜9.2.
Visual Understanding Data: This category includes data that takes image(s) or video as input and outputs text responses. See Figure˜1 for an example. We use:
ImageQA: Cambrian-7M (Tong et al., 2024a). The model answers questions based on input image(s). Pi∈{⟨visual tokens⟩,⟨text prompt⟩}P_{i}\in{\langle\texttt{visual tokens}\rangle,\langle\texttt{text prompt}\rangle} Ri∈{⟨text response⟩}R_{i}\in{\langle\texttt{text response}\rangle}
Visual Generation Data: MetaCLIP (Xu et al., 2024). The model predicts visual tokens based on an image description. We using at most 5 million pairs. We curate the data into question-answering formats.
We prompt the model to generate visual tokens with instructions like “Generate an image of…”. The text responses are “Here is an image based on your request…”. See Figure˜1 for an example.
Video Data: SomethingSomethingV2 (Goyal et al., 2017b) and HowTo100M (Miech et al., 2019). The model predicts frames in a sequential order. We design different question-answer pairs to probe into the video, such as asking about future frames, past frames, and reordering frames. Pi∈{⟨visual tokens⟩,⋯,⟨visual tokens⟩,⟨text prompt⟩}P_{i}\in{\langle\texttt{visual tokens}\rangle,\cdots,\langle\texttt{visual tokens}\rangle,\langle\texttt{text prompt}\rangle} Ri∈{⟨visual tokens⟩,⋯,⟨visual tokens⟩}R_{i}\in{\langle\texttt{visual tokens}\rangle,\cdots,\langle\texttt{visual tokens}\rangle}
Visual Thinking Data: Visualization-of-Thought (Shao et al., 2024) and VStar (Wu and Xie, 2024). The model predicts multimodal tokens in its response before addressing problems. For instance, it predicts a zoomed-in view of an image before generating textual responses.
Pi∈{⟨visual tokens⟩,⟨text prompt⟩}P_{i}\in{\langle\texttt{visual tokens}\rangle,\langle\texttt{text prompt}\rangle} Ri∈{⟨text response⟩,⟨visual tokens⟩,⟨text response⟩}R_{i}\in{\langle\texttt{text response}\rangle,\langle\texttt{visual tokens}\rangle,\langle\texttt{text response}\rangle}
In the response, the model will output “I will think about it visually”, followed by visual tokens representing a zoomed-in segment of the image, and then proceed to answer the question.
Image-to-Image Data: InstructPix2Pix (Brooks et al., 2023) and Aurora (Krojer et al., 2024). The model generates a transformed image conditioned on a text description and an input image.
Because models trained with VPiT learn to predict continuous visual tokens, we need to map the predicted tokens back into pixel space. We leverage the concept of a “Diffusion Autoencoder” (Bordes et al., 2022; Preechakul et al., 2022; Pan et al., 2024b; Koh et al., 2024; Li et al., 2024c) in which the diffusion model can be adapted to condition on image embeddings rather than text embeddings. Specifically, we finetune an existing diffusion model to condition on outputs from the vision encoder using held-out training data.
At inference time, if the tag token ⟨image_start⟩\langle\texttt{image_start}\rangle is generated, the model begins outputting visual tokens until ⟨image_end⟩\langle\texttt{image_end}\rangle. We then plug the generated visual tokens into the diffusion model to visualize the prediction in pixel space. We use standard latent diffusion model training procedures. Details on the hyperparameters and training setup are provided in Section˜7.2.
We study the following questions about the effects and synergy of visual understanding and generation, under our VPiT framework:
We use 9 ImageQA benchmarks (MMBench, Seed, VStar, MMVP, MMMU, ChartQA, TextVQA, ScienceQA, RealWorldQA) to evaluate different aspects of the model. For image generation, we use the finetuned diffusion model to visualize generated visual tokens and measure FID score (lower is better) and CLIP score (higher is better) on the COCO-30K dataset. Unless otherwise specified, we use LLaMA-3 8B (AI@Meta, 2024) / SigLIP ViT-SO400M-14@384 (Zhai et al., 2023) as the pretrained LLM / vision encoder. We also study the effect of different LLMs in Section˜3.2. We use instruction tuned versions of the LLMs. We pretrain the adapter between the vision encoder and the LLM following visual instruction tuning (Liu et al., 2023, 2024a). For experiments in this section, we provide training details in Section˜7 and include the full results in Section˜8.
We start by investigating the number of image-text samples required to teach a language model to generate high-quality visual tokens. To this end, we randomly sample {1k, 5k, 10k, 50k, 200k, 1M, 3M, 5M} image-text pairs from our generation data (MetaCLIP dataset (Xu et al., 2024)). We explore two settings: (1) finetuning the LLM using only visual generation data, and (2) joint training visual generation with visual understanding and the rest of data types described in Section˜2.2.
In Figure˜3, we see that training solely on visual generation performs significantly worse than joint training with all other data. With over 3 million image-text pairs, the model struggles to generate high-quality visual images (∼\sim 40 FID score), and performance remains inferior to joint training with 5 million pairs. This suggests that training solely on visual generation data is significantly less sample efficient. This finding aligns with a prior study (Zhang et al., 2023) which also suggests that LLMs cannot be easily tuned to generate visual tokens when trained with only generation data. In contrast, joint training with other datasets substantially improves generation performance. The model generates effective visual tokens with just 5k generation data, and performance stabilizes around 200k samples. This indicates that visual generation is not an orthogonal capability but rather an ability that benefits from other tasks and emerges more effectively with joint training.
To better understand how each type of data contributes to visual generation, we conduct a controlled experiment using 200k visual generation data, joint training individually with each data type defined in Section˜2.2. We also compare them with training all the data together. We show results in Figure˜3. While all data types enhance the model’s visual generation, the degree of improvement varies. Visual understanding data, such as ImageQA and VideoQA, significantly boost the model’s visual generation capabilities, even when the amount of generation data is kept constant at 200k. This indicates a strong link between the ability to understand visual content and generate visual tokens. Additionally, combining all data types in training further improves performance, suggesting that the benefits from different data types can be additive.
Building upon findings from the previous subsection, we perform a controlled experiment to investigate how visual understanding ability correlates with visual generation ability. We ablate our model using a fixed set of 200k generation data while varying VQA data from 1M to 7M samples from Cambrian-7M to develop different levels of visual understanding. The results presented in Figure˜5 indicate that stronger VQA ability correlates with better generation performance.
Here, we investigate the reverse direction: does enhancing the model’s visual generation capability also relate to higher VQA performance? To explore this, we conduct a controlled experiment using 1M fixed VQA samples as the baseline for understanding. We then vary the amount of generation data ({200k, 500k, 1M, 2M, 3M, 4M}) to adjust generation capacity while joint training with the fixed 1M VQA data. We present results in Figure˜5. Within the 1M VQA setting, stronger generation ability is correlated with improved VQA performance. This implies that increasing the amount of generation data not only enhances generation but also positively impacts VQA performance.
We examine whether the findings transfer across various LLM backbones. Using a data composition of 7M VQA samples and 1M generation data, we train VPiT on LLaMA-3 8B, LLaMA-3.1 8B, and LLaMA-3 70B. Figure˜7 shows the scaling behavior across different LLMs.
We investigate whether understanding and generation data contribute equally. Here, we jointly train different scales of VQA data {1M, 4M, 7M} and generation data {200k, 500k, 1M, 2M, 3M, 4M}. Figure˜7 summarizes these findings, with the x-axis representing VQA data, and the y-axis representing generation data. Results are visualized on heatmaps using darker colors for better performance.
The results indicate that increasing VQA data yields the most significant improvements in all three metrics. When VQA data is relatively low (1M), increases in generation data lead to noticeable improvements, as reflected by the gradual darkening in the plot. However, as the VQA data scales up (from 1M to 4M to 7M), the impact of VQA data becomes more pronounced, demonstrated by a sharp color transition in the heatmap. Ultimately, with 7M VQA data, increases in generation data contribute minimally. These results demonstrate the critical role of understanding data in enhancing both understanding and generation performance.
Given the diverse nature of understanding tasks such as OCR, Vision-Centric tasks, and Knowledge-based tasks, we investigate which tasks most strongly correlate with generation ability. Inspired by Cambrian-1, we categorize VQA tasks into five groups: General, Text&Chart, High-Resolution, Knowledge, and Vision-Centric VQA. Using the results from our earlier experiments, which jointly train various VQA data scales with different amounts of generation data, we plot each benchmark’s VQA performance against generation performance in Figure˜8. We also calculate the Pearson correlation (ρ\rho) between VQA scores and FID/CLIP Scores.
Figure˜8 shows that General, Vision-Centric, and Text&Chart VQA tasks strongly correlate with generation performance, each with a Pearson correlation coefficient (ρ\rho) above 0.85. High-Resolution VQA exhibits moderate correlation, with ρ\rho around 0.7. In contrast, Knowledge VQA tasks, such as MMMU, show weak correlation with generation performance. These findings suggest that generation ability aligns more closely with the model’s vision capabilities rather than knowledge-specific tasks.
Based on the insights in Section˜3, we train our unified model, MetaMorph, based on LLaMA-3.1 8B (AI@Meta, 2024), using VPiT with the data curated in Section˜2.2. We present our experimental results in three parts: quantitative performance (Section˜4.1), evidence of MetaMorph leveraging LLM knowledge in visual generation (Section˜4.2), and implicit reasoning skills in multimodal contexts (Section˜4.3).
Image QA Video QA Generation Method Base LLM MMBenchEN{}^{\text{EN}} SEED RealworldQA MMVP SQA MMMU VStar ChartQA TextVQA MV-Bench COCO (FID) Visual Understanding Only GPT-4V* 75.8 69.1 61.4 50.0 75.7 56.8 55.0 78.5 78.0 43.5 - Visual Generation Only Stable Diffusion 1.5∗ - - - - - - - - - - 9.6 Dalle 2∗ - - - - - - - - - - 10.4 Imagen∗ - - - - - - - - - - 7.3 Unified Models EMU-3∗ 58.5 68.2 57.4 36.6† 89.2 31.6 51.8† 68.6 64.7 - 12.8 Janus∗ DeepSeek 1.3B 69.4 63.7 - - - 30.5 - - - - 8.5 VILA-U256\text{VILA-U}_{256}† LLaMA-2 7B 66.6 57.1 46.6 22.0 67.1 32.2 38.7 11.4 48.3∗ 40.8 19.6 Transfusion∗ - - - - - - - - - - 6.7 Chameleon-7B† 35.7 27.2 19.6 0.0 50.3 28.4 37.1 0.0 0.0 - 26.7∗ MetaMorph (ours) LLaMA-3.1 8B 75.2 71.8 58.3 48.3 83.2 41.8 44.0 37.1 60.5 48.8 11.8
We compare MetaMorph with other unified models and summarize results in Table˜1. Since these models are trained on different datasets and base LLMs (or pretrained from scratch), an apples-to-apples comparison is difficult. Nevertheless, MetaMorph demonstrates competitive performance and outperforms other unified models on most benchmarks—even when prior models may have been trained on more data. Compared to models trained from scratch, such as EMU-3 (Wang et al., 2024b) and Chameleon (Team, 2024), MetaMorph leverages the strengths of the latest pretrained LLMs and achieves competitive understanding and generation performance. MetaMorph highlights that unified models can be developed effectively from pretrained LLMs.
MetaMorph effectively leverages the world knowledge embedded in pre-trained LLMs. We show examples on the left side of Figure˜9. We prompt the model to generate concepts requiring non-trivial and specialized knowledge. Examples include “Chhogori” (the world’s second-highest mountain), “Oncilla” (a small wildcat from South America), and “Chizarira” (an isolated wilderness area in Zimbabwe).
MetaMorph successfully translates domain-specific knowledge into accurate visual tokens, thereby displaying the ability to leverage world knowledge from LLMs. In contrast, the latest Text-to-Image (T2I) model, Stable Diffusion-3.5 8B, struggles to generate the correct concept despite producing high-quality images. This issue may stem from the text embedding models it uses—–CLIP (Radford et al., 2021) and T5 (Roberts et al., 2019)—–which fail to properly encode these specialized terms (Yuksekgonul et al., 2022).
On the right side of Figure˜9, we demonstrate how MetaMorph handles common semantic challenges more effectively than text embedding models such as CLIP and T5. These challenges include negation and subjectivity, using prompts with common failure patterns identified in Multimon (Tong et al., 2024b). MetaMorph differentiates semantic nuances such as “slightly” versus “very”, “few” versus “many”, and “without” versus “with”, which are common failures in existing text-to-image systems.
In Figure˜10, we present examples where the model generates images in response to puzzle prompts such as “The national flag of the country where Yellowstone National Park is located”. For each puzzle, we directly use the prompt “Generate an image of {puzzle}”, without calling any Chain-of-Thought (CoT) (Wei et al., 2022b) in the prompts. MetaMorph generates the correct image from prompts that require multi-step reasoning.
For example, when answering the question “A musical instrument, this instrument is often played by the scientist who formulated the theory of special relativity”, the model needs to implicitly complete three reasoning steps: it identifies Albert Einstein as the scientist formulated the theory of special relativity, recognizes that his preferred instrument is the violin, and then directly generates correct visual tokens—a violin—without explicitly separating these steps during the generation process. This result implies that MetaMorph implicitly solves the puzzle and generates correct visual tokens immediately following the prompt. These results align with the findings in Physics of LLMs (Ye et al., 2024; Allen-Zhu, 2024), where the authors suggest that LLMs precompute reasoning graphs before autoregressively generating subsequent tokens. Here, we demonstrate that this capability transfers to the unified multimodal model setting even when decoding visual tokens.
Instruction tuning (Wei et al., 2022a; Taori et al., 2023) finetunes a pretrained LLM to learn the format and style of interaction. This process helps the model to effectively convey the knowledge and capabilities acquired during pretraining (Zhou et al., 2024a). LLaVA (Liu et al., 2023) extends instruction tuning into the multimodal domain. Since then, different lines of work focus on improving data curation (Chen et al., 2023; Laurençon et al., 2024a, b), visual representation (Tong et al., 2024a; Kar et al., 2025; Chen et al., 2024b), and instruction tuning strategies (Gao et al., 2024; Liu et al., 2024b). Using only a few million multimodal instruction tuning data, this line of research (Liu et al., 2024b; Tong et al., 2024a; Li et al., 2024a) has enabled open-source MLLMs to reach performance levels comparable to those of proprietary models (OpenAI, 2024; Anthropic, 2024) on a number of benchmarks (Liu et al., 2024d; Yue et al., 2024a, b) and applications (Zhai et al., 2024; Pan et al., 2024a).
Recent efforts to construct unified models have primarily relied on either extensive pretraining or heavy fine-tuning on billion-scale datasets. Some studies also use continuous embeddings for predicting visual tokens, integrating visual regression losses (Sun et al., 2024b, a) or leveraging diffusion-based methods (Dong et al., 2024). Other approaches (Lu et al., 2022a; Aghajanyan et al., 2022; Team, 2024; Wu et al., 2024b; Liu et al., 2024c; Wang et al., 2024b; Lu et al., 2024) tokenize multimodal data into discrete tokens, which are then trained using autoregressive transformers. Recent research has also explored hybrid strategies that combine autoregressive and diffusion objectives (Zhou et al., 2024b; Xie et al., 2024). Different from previous studies, we demonstrate that unified models can be effectively trained in low-data regimes during instruction tuning, while also providing insights into the reciprocal relationship between visual understanding and visual generation.
In this work, we propose VPiT—a simple yet effective extension to visual instruction tuning—that enables LLMs to predict multimodal tokens. VPiT unlocks the use of a more diverse range of instruction tuning data than just visual question answering, such as text-to-image and pure image and video data. Through controlled experiments, we find that visual generation ability emerges as a natural byproduct of improved visual understanding and requires modest additional generation data. In addition, we find that while visual understanding and generation are mutually beneficial, adding more visual understanding data disproportionately improves overall performance compared to adding more generation data.
Leveraging these insights, we train MetaMorph by finetuning LLaMA-3.1 8B with VPiT. With a simple training process, MetaMorph achieves competitive performance in both visual understanding and generation. Qualitative evaluation of our model shows that MetaMorph can leverage world knowledge and reasoning abilities of the base LLM during visual generation. For example, it can perform multimodal tasks that typically require multiple steps of reasoning, such as generating images of specialized proper nouns (“Chhogori”) or solving visual puzzles (“generate an image of the animal resulting from a monarch caterpillar’s metamorphosis”). This indicates that LLMs already possess a degree of “prior” visual knowledge which can be activated with only minimal instruction tuning with VPiT. Overall, LLMs may have a similar representation space as unified and multi-functional models (Huh et al., 2024). We hope the insights from this work inspire more exploration toward developing LLMs for general intelligence.
We follow the training recipe outlined in prior studies (Tong et al., 2024a; McKinzie et al., 2024), using a two-stage training approach. First, we pretrain a two-layer MLP with a GELU activation (Hendrycks and Gimpel, 2016) as the adapter between the visual tokens and the LLM. We train this adapter on Cambrian adapter data while excluding all data points sourced from LAION (Schuhmann et al., 2022). Next, we finetune the entire model, excluding the vision backbone, using the instruction tuning data described in Section˜2.2 and detailed in Section˜9.
We use DeepSpeed (Rajbhandari et al., 2020) Zero-3 to train our model on H100 GPUs. Detailed training hyperparameters for all experiments are provided in Table˜2. We conduct all of the experiments with 1 epoch.
Backbone Data Adapter Instruction Tuning Experiment LLM Adapter Instruction Tuning lr wd bs lr wd bs Section˜3 (LLaMA-3 8B) LLaMA-3 8B Cambrian Adapter Data∗ Section˜3 Experiment Setting 4.90e-5 0.0 768 6.93e-5 0 1536 Section˜3 (LLaMA-3.1 8B) LLaMA-3.1 8B Cambrian Adapter Data∗ Section˜3 Experiment Setting 4.90e-5 0.0 768 6.93e-5 0 1536 Section˜3 (LLaMA-3 70B) LLaMA-3 70B Cambrian Adapter Data∗ Section˜3 Experiment Setting 4.90e-5 0.0 768 4.90e-5 0 768 MetaMorph LLaMA-3.1 8B Cambrian Adapter Data∗ All Data from Section˜2.2 4.90e-5 0.0 768 6.93e-5 0 1536
We leverage pretrained diffusion models such as Stable Diffusion 1.5 (Rombach et al., 2022). We use a 2-layer MLP projector to align the SigLIP embedding dimension with the cross-attention dimension in the pretrained diffusion model. The first layer applies a linear transformation to map the input dimension to 2048, followed by layer normalization (Ba et al., 2016) and a ReLU activation. The second layer reduces the 2048-dimensional features to the output dimension through a linear transformation, followed by a final layernorm.
We set the batch size to 2112. The learning rate schedule begins with a logarithmic warm-up over the first 2000 steps, gradually increasing from zero to a peak value of 1.1e-5. After this warm-up phase, the learning rate decreases linearly over the next 12000 steps until reaching zero. We use the AdamW (Loshchilov, 2019) optimizer to train our model, with β\beta parameters (0.9,0.999)(0.9,0.999). We apply a weight decay of 0.01.
During diffusion training, we freeze the VAE encoder and Siglip encoder, only training the projector and the diffusion U-Net. The CFG level is set to 0.7. This is because we start with a pretrained diffusion model and aim to transform the conditioning from CLIP text to SigLIP image embeddings. A higher CFG level ensures the model maintains high image quality while gradually adapting to the new conditioning in the remaining fraction. Empirically, this approach achieves the best balance between adaptation and image quality. For the training datasets, since we finetune the diffusion model to condition on SigLIP image embeddings, training this model does not require text descriptions for conditioning. Instead, we use images curated through in MetaCLIP (Xu et al., 2024) and train this diffusion model to visualize the visual tokens generated by MetaMorph.
For evaluation, we use nine ImageQA, one VideoQA and two generation benchmarks:
MMBench (Liu et al., 2024d): A comprehensive benchmark spans across 20 multimodal ability dimensions.
V*STAR (Wu and Xie, 2024): A VQA benchmark designed for testing details in high-resolution images.
MMMU (Yue et al., 2024a): A benchmark designed to evaluate multimodal models on extensive multi-discipline tasks requiring college-level subject knowledge and deliberate reasoning.
ScienceQA (Lu et al., 2022b): A multimodal benchmark for answering science-related questions requiring integration of visual and textual data.
FID Score (Heusel et al., 2017): A metric for evaluating the quality of generated images by comparing their feature distributions with real images.
We compare our approach to the commonly used L1 regression loss, which has been widely adopted in contrastive self-supervised learning methods (LeCun, 2022; Bardes et al., 2024). For this comparison, we train MetaMorph, based on LLaMA-3 8B, using datasets described in Section˜2.2. We highlight that cosine similarity and L1 loss influence the embedding outputs differently: cosine similarity enforces normalization, while L1 loss does not. This discrepancy in output normalization prevents a direct and fair comparison in terms of generation performance. Consequently, our analysis focuses exclusively on VQA performance.
In Table˜3, we compare models trained using L1 loss and cosine similarity loss. Our analysis reveals that training with cosine similarity results in better average performance and outperforms L1 loss on most benchmarks. Notably, these vision loss functions affect only tasks requiring visual predictions and do not directly influence VQA tasks, as the VQA training data does not include image token responses. This improvement is potentially because training with cosine similarity enhances visual generation, which in turn contributes to better visual understanding.
To further investigate, we compare our method—incorporating a broader range of non-VQA data alongside Cambrian-7M—–with a baseline trained exclusively on Cambrian-7M. The results show that combining broader dataset with cosine similarity loss leads to better performance across multiple benchmarks. This finding reinforces our earlier observations in Section˜3: enhancing visual generation capabilities contributes to improved visual understanding, highlighting the benefits of leveraging non-VQA data.
MMBenchEN{}^{\text{EN}}
RealworldQA
We summarize the categorization of data and the number of samples for each source in Figure˜11. This diverse dataset is curated to showcase that an LLM can be finetuned across a variety of tasks, where each task contributes to and enhances the performance of others, as discussed in Section˜3.1.
As discussed in Section˜2.2, we use a wide range of data, spanning from visual question answering tasks to unlabeled video data. Here, we detail the preprocessing steps applied to each data source to convert them into instruction-tuning-style QA conversations.
We use Cambrian-7M (Tong et al., 2024a), a dataset already curated in instruction tuning format. An example entry looks like the below:
We use VideoStar (Zohar et al., 2024) and ShareVideo (Chen et al., 2024a), both curated in an instruction tuning format. For each video, we extract frames at a rate of one frame per second and input these frames into the LLM. An example QA entry for an 8-second video is structured as follows:
We use image-text pairs in MetaCLIP (Xu et al., 2024). The original data consists of images paired with corresponding text descriptions. We add system prompts and define answering formats, transforming the image-text pairs into question-answer formats suitable for instruction tuning.
Unlike in ImageQA and VideoQA, we require the model to predict the visual tokens in the response.
We explore incorporating vision as part of the model’s reasoning process to enhance its answers. As a preliminary step, we experiment with the Visualization-of-Thought (Shao et al., 2024) and VStar (Wu and Xie, 2024) datasets. Originally, these datasets were designed to teach models how to utilize external tools and APIs, such as segmentation or zoom-in cropping. In this work, we aim to integrate these visual skills directly into the model’s inference steps. We use system prompts, such as “think visually before you answer the question”, to activate this visual reasoning mode. Here is an example:
The model is required to predict a transformed image based on a given text description. We use Aurora (Krojer et al., 2024) and InstructPix2Pix (Brooks et al., 2023) datasets to train this capability. An example is presented below:
We explore commonly used open-source video datasets in instruction tuning: SomethingSomethingV2 (Goyal et al., 2017a) and HowTo100M (Miech et al., 2019). We design the following tasks from the pure video:
- Forward Frame Prediction. In this task, the model is presented with the initial frame of a video sequence and must predict the subsequent frames at fixed time intervals. An example is presented below:
Each task is designed to train the model’s temporal understanding and visual reasoning capabilities.
When selecting data sources, we carefully choose those that do not overlap with the testing sets of our evaluation data, such as COCO (Lin et al., 2014). However, given that the data used in a Section˜2.2 is composed of numerous sources, some degree of data leakage may be inevitable. As discussed and analyzed in a prior work (Tong et al., 2024a), even when image overlap occurs, it does not necessarily imply that the exact image-question pairs have been encountered during training. Unlike traditional unimodal computer vision research, where an image alone constitutes a data point, the multimodal paradigm treats each image-text (question-answer) pair as a distinct and unique data point.
Here, we include the quantitative results of all the experiments in Section˜3.
Table˜4 presents the quantitative results corresponding to Figure˜3, which examines generation performance under two conditions: training exclusively on generation data and joint training with all other data described in Section˜2.2. The results demonstrate that the model can develop the ability for visual generation with a relatively modest amount of data when trained jointly with understanding tasks. In contrast, teaching this skill in isolation requires a substantially larger dataset.
In Table˜5, we present the quantitative results corresponding to Figure˜3, which investigates the impact of joint training on generation data in combination with various types of data outlined in Section˜2.2. The results show that joint training with visual understanding data—–specifically ImageQA and VideoQA–—provides the most significant improvement in visual generation performance.
In Table˜6, we present the numerical results of joint training with varying scales of understanding data (1M, 4M, 7M) and generation data (200k, 500k, 1M, 2M, 3M, 4M). These findings demonstrate that increasing the amount of understanding data yields more substantial improvements in both understanding tasks (e.g., VQA performance) and generation tasks (e.g., FID scores and CLIP scores) compared to increasing the amount of generation data. These results, consistent with our analysis in Section˜3.2 and Section˜3.3, highlight that understanding data play a more pivotal role in enhancing performance across both task types.
CLIP Score
We present the results of training with 7M VQA data and 1M generation data across various LLM backbones, including LLaMA-3 8B, LLaMA-3.1 8B, and LLaMA-3 70B. As shown in Table˜7, which corresponds to the results in Figure˜7, we observe that stronger LLM backbones lead to improvements in both visual understanding and visual generation. These findings further support the conclusion that visual understanding and generation are reciprocal processes, where advancements in one drives enhancements in the other.
We provide additional examples of MetaMorph in Figure 12 and Figure 13. These examples illustrate how MetaMorph extends beyond the capabilities of typical MLLMs by leveraging learned skills to perform novel tasks such as visual reasoning and visual transformation. In Figure 12, when prompted with the question
Table: S8.T3: Comparison of different loss functions. Training with cosine similarity loss enables the model to effectively utilize non-VQA data, which in turn enhances its visual understanding.
| Loss | Image QA | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| AVG | MMBenchEN{}^{\text{EN}} | SEED | RealworldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | |
| None (VQA Only) | 55.50 | 73.11 | 69.96 | 55.69 | 41.33 | 80.39 | 37.29 | 46.60 | 35.16 | 59.96 |
| L1 Loss | 53.83 | 72.17 | 69.28 | 57.25 | 34.67 | 79.00 | 34.00 | 45.55 | 32.40 | 60.17 |
| Cosine Sim | 55.93 | 73.78 | 71.36 | 55.03 | 44.00 | 79.83 | 35.29 | 47.64 | 36.60 | 59.79 |
Table: S10.T4: Results of training solely on generation data vs. joint training with additional data. These results correspond to Figure˜3. Joint training with additional data significantly improves generation performance. At 5,000 samples, the model begins to generate reasonably accurate visual tokens, indicating that visual generation is an ability unlocked through the learning of other tasks.
| Joint train With Other Data | # of Generation Data | FID Score |
|---|---|---|
| Yes | 1k | 68.5 |
| No | 1k | 115.0 |
| Yes | 5k | 19.2 |
| No | 5k | 116,4 |
| Yes | 10k | 18.7 |
| No | 10k | 111.0 |
| Yes | 50k | 17.1 |
| No | 50k | 111.8 |
| Yes | 200k | 15.2 |
| No | 200k | 110.7 |
| Yes | 200k | 14.7 |
| No | 200k | 93.7 |
| Yes | 1M | 14.4 |
| No | 1M | 52.8 |
| Yes | 3M | 15.1 |
| No | 3M | 39.2 |
| Yes | 5M | 14.3 |
| No | 5M | 27.7 |
Table: S10.T5: Impact of joint training 200k generation data with different data types. These results correspond to Figure˜3. Among the data types analyzed, joint training with visual understanding data has the most significant impact on enhancing visual generation performance.
| Joint training Data | Data Type | FID Score | CLIP Score |
|---|---|---|---|
| None | - | 110.5 | 5.7 |
| Image-to-Image | Other Visual Data | 97.5 | 6.4 |
| Visual Thinking | Other Visual Data | 93.5 | 6.5 |
| Pure Video | Other Visual Data | 84.7 | 8.1 |
| VideoQA | Visual Understanding Data | 26.5 | 16.1 |
| ImageQA | Visual Understanding Data | 18.9 | 22.0 |
Table: S10.T6: Full results of joint training on varying amounts of VQA data (1M, 4M, 7M) and generation data (200k, 500k, 1M, 2M, 3M, 4M). These results correspond to Figure˜5, Figure˜5, Figure˜7, and Figure˜8, which analyze how different combinations of understanding and generation data impact the model’s visual understanding and generation performance.
| Data Composition | Image QA | Generation | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # of VQA Data | # of Generation Data | Average | MMBenchEN{}^{\text{EN}} | SEED | RealworldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | FID Score | CLIP Score |
| 1M | 200k | 46.4 | 60.0 | 62.2 | 50.3 | 24.0 | 80.0 | 38.4 | 37.4 | 16.4 | 48.8 | 28.3 | 15.2 |
| 1M | 500k | 48.2 | 66.4 | 63.2 | 50.8 | 24.3 | 80.4 | 39.9 | 38.7 | 18.2 | 51.6 | 28.1 | 15.9 |
| 1M | 1M | 49.1 | 70.1 | 65.2 | 52.2 | 21.3 | 80.0 | 39.5 | 38.7 | 20.4 | 54.6 | 27.3 | 16.5 |
| 1M | 2M | 49.9 | 67.8 | 66.0 | 50.2 | 30.3 | 80.2 | 38.9 | 39.0 | 21.8 | 54.8 | 23.1 | 17.8 |
| 1M | 3M | 51.1 | 71.3 | 67.1 | 55.4 | 33.0 | 79.5 | 38.8 | 37.4 | 22.7 | 55.0 | 21.1 | 21.1 |
| 1M | 4M | 51.4 | 71.1 | 66.9 | 52.4 | 31.0 | 80.5 | 39.8 | 41.1 | 24.0 | 56.0 | 18.4 | 22.3 |
| \hdashline4M | 200k | 53.8 | 73.1 | 68.8 | 55.0 | 34.7 | 81.2 | 38.5 | 44.0 | 29.5 | 59.2 | 21.4 | 20.5 |
| 4M | 500k | 53.3 | 73.0 | 69.9 | 55.3 | 32.7 | 80.6 | 40.2 | 39.3 | 29.6 | 58.9 | 16.0 | 24.8 |
| 4M | 1M | 54.2 | 73.8 | 69.6 | 54.9 | 33.3 | 82.1 | 36.6 | 45.6 | 32.4 | 59.9 | 16.0 | 24.8 |
| 4M | 2M | 53.8 | 72.8 | 70.3 | 55.2 | 37.3 | 80.8 | 36.8 | 44.0 | 31.2 | 56.2 | 15.6 | 24.7 |
| 4M | 3M | 54.3 | 71.8 | 70.1 | 57.7 | 36.0 | 81.0 | 38.0 | 42.9 | 32.6 | 59.0 | 16.1 | 24.8 |
| 4M | 4M | 54.4 | 75.2 | 69.9 | 56.0 | 37.3 | 81.4 | 38.1 | 40.8 | 31.6 | 59.3 | 15.3 | 25.5 |
| \hdashline7M | 200k | 55.8 | 73.1 | 70.3 | 55.6 | 42.0 | 81.0 | 40.8 | 44.0 | 35.2 | 60.6 | 18.2 | 22.3 |
| 7M | 500k | 55.6 | 74.4 | 70.6 | 56.2 | 38.7 | 81.9 | 37.9 | 44.0 | 36.0 | 60.5 | 15.2 | 25.5 |
| 7M | 1M | 55.8 | 74.3 | 70.3 | 56.3 | 42.7 | 81.3 | 36.6 | 44.5 | 35.8 | 60.6 | 14.5 | 26.6 |
| 7M | 2M | 55.4 | 73.9 | 71.1 | 56.9 | 40.0 | 81.6 | 35.9 | 42.4 | 35.4 | 61.6 | 14.8 | 27.1 |
| 7M | 3M | 55.6 | 74.2 | 71.0 | 57.3 | 38.0 | 81.1 | 40.1 | 43.5 | 35.0 | 60.2 | 14.2 | 27.5 |
| 7M | 4M | 56.2 | 75.4 | 70.4 | 55.4 | 44.0 | 80.4 | 39.6 | 45.0 | 35.2 | 60.2 | 14.9 | 26.3 |
Table: S10.T7: Full results of training on different LLMs. We train 7M VQA data and 1M generation data on different LLM backbones (LLaMA-3 8B, LLaMA-3.1 8B, and LLaMA-3 70B) and measure understanding and generation performance.
| Pretrained LLM | Image QA | Generation | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLM | Average | MMBenchEN{}^{\text{EN}} | SEED | RealworldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | FID Score | CLIP Score |
| LLaMA-3 8B | 55.8 | 74.3 | 70.3 | 56.3 | 42.7 | 81.3 | 36.6 | 44.5 | 35.8 | 60.6 | 14.5 | 26.6 |
| LLaMA-3.1 8B | 56.7 | 75.8 | 70.2 | 56.2 | 44.7 | 81.9 | 41.2 | 43.4 | 36.0 | 61.3 | 13.2 | 27.1 |
| LLaMA-3 70B | 60.7 | 80.7 | 72.6 | 58.3 | 48.7 | 87.8 | 48.9 | 47.1 | 37.4 | 65.0 | 13.8 | 26.8 |
VPiT Training, Inference, and Examples of MetaMorph. Left: In Visual-Predictive Instruction Tuning (VPiT), we finetune a pretrained LLM to generate both text and visual tokens using separate text and vision heads. Middle: During inference, the model accepts an arbitrary input sequence of image(s) and text and outputs discrete text tokens and continuous visual tokens. These visual tokens can be visualized via a separately finetuned diffusion model, which is trained to condition on the pretrained vision encoder’s output. Right: An example conversation from MetaMorph trained with VPiT. Here, the model implicitly solves a visual puzzle in order to generate the visual tokens of a butterfly. The conversation continues with new user questions as the model continues to autoregressively process vision and text tokens, independent of the diffusion-based visualization.
Generation-only training vs. Joint training with other data. Training solely on generation data results in inferior performance. Joint training with additional data enables visual generation with only 5k generation data and yields high-quality outputs with 200k generation data.
Impact of different data types on visual generation. The baseline of training on only visual generation data is red; Joint training with other data is yellow; Joint training with visual understanding data is green; and all data is blue. Joint training with additional data improves the baseline, with visual understanding tasks contributing the most to enhancing visual generation.
VQA Performance vs. Generation Performance with generation data controlled at 200k. Increasing understanding data improves VQA and generation performance.
Comparison between different language backbones. We jointly train 7M VQA and 1M Generation data on different language backbones (LLaMA-3 8B, LLaMA-3.1 8B, LLaMA-3 70B). We observe that the synergy between understanding and generation transfer across LLMs.
Heatmap visualization of Average VQA Score, FID Score, and CLIP Score across varying amounts of VQA data and generation data. Darker colors indicate better performance. Increasing VQA data is more effective for improving both understanding and generation capabilities.
Correlation analysis between generation and various understanding benchmarks. Results are collected by joint training different amounts of VQA data combined with varying quantities of generation data. Each subplot shows the correlation (ρ\rho) with a fitted regression line. Stars represent data points. We analyze General VQA, Vision-Centric VQA, Text&Chart VQA, High-Resolution VQA, and Knowledge VQA. For most tasks, generation performance and VQA performance are strongly correlated: higher VQA performance indicates better generation and vice versa. Only knowledge-intensive and high-resolution VQA tasks exhibit weaker correlations with generation performance.
Examples of MetaMorph leveraging LLMs to generate visual tokens. Left: MetaMorph can leverage knowledge from the LLM to generate visual tokens for professional terms that need domain-specific understanding. Right: MetaMorph also avoids common mistakes seen in T2I models that condition on text embeddings (e.g., Stable Diffusion-3.5 8B).
Examples of MetaMorph solving reasoning problems in visual generation. We design puzzles that require multi-step reasoning. We include reference logic chains needed to solve the puzzles, and reference solution examples. When prompting each model, we directly feed in the puzzle without any CoT hints or logic chains. MetaMorph has the ability to implicitly solve these puzzles and generate the correct image without explicitly creating or processing a logic chain. It demonstrates that the implicit reasoning skills in text-only LLMs can transfer to unified multimodal models.
Data composition. Left: The inner circle shows the distribution of MetaMorph data. Right: All the data sources and categories in the MetaMorph data.
Examples of MetaMorph (I). We showcase examples of MetaMorph’s capabilities: transforming images based on prompts (top-left), answering challenging questions (top-right), integrating visual tokens into reasoning processes (bottom-left), implicitly solving puzzles (bottom-right) and answering tricky video-qa questions. (bottom).
Examples of MetaMorph (II).We showcase more examples of MetaMorph’s capabilities: answering questions and transforming images in one conversation (left), generating images (top-right), and leveraging knowledge in LLMs to generate rare concepts (bottom-right).
'What is the type of hat?' , MetaMorph first generates visual tokens related to hats and then answers correctly with 'top hat' . The model also demonstrates the ability to perform image transformations, such as creating a cartoon version of an image or altering it to appear as daytime. Additionally, we showcase examples of MetaMorph solving implicit puzzles, such as interpreting 'a rearrangement of the letters in the word 'tca'' , before generating the corresponding visual tokens of cats.
Figure 12 Examples of MetaMorph (I). We showcase examples of MetaMorph's capabilities: transforming images based on prompts ( top-left ), answering challenging questions ( top-right ), integrating visual tokens into reasoning processes ( bottom-left ), implicitly solving puzzles ( bottom-right ) and answering tricky video-qa questions. ( bottom ).
Figure 13 Examples of MetaMorph (II). We showcase more examples of MetaMorph's capabilities: answering questions and transforming images in one conversation ( left ), generating images ( top-right ), and leveraging knowledge in LLMs to generate rare concepts ( bottom-right ).
| ImageQA | ImageQA | ImageQA | ImageQA | ImageQA | ImageQA | ImageQA | ImageQA | ImageQA | VideoQA | Generation | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | BaseLLM | MMBench EN | SEED | RealworldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | MV-Bench | COCO(FID) |
| Visual Understanding Only | ||||||||||||
| GPT-4V* | 75.8 | 69.1 | 61.4 | 50.0 | 75.7 | 56.8 | 55.0 | 78.5 | 78.0 | 43.5 | - | |
| Visual Generation Only | ||||||||||||
| Stable Diffusion 1.5 ∗ | - | - | - | - | - | - | - | - | - | - | 9.6 | |
| Dalle 2 ∗ | - | - | - | - | - | - | - | - | - | - | 10.4 | |
| Imagen ∗ | - | - | - | - | - | - | - | - | - | - | 7.3 | |
| Unified Models | ||||||||||||
| EMU-3 ∗ | 58.5 | 68.2 | 57.4 | 36.6 † | 89.2 | 31.6 | 51.8 † | 68.6 | 64.7 | - | 12.8 | |
| Janus ∗ | DeepSeek 1.3B | 69.4 | 63.7 | - | - | - | 30.5 | - | - | - | - | 8.5 |
| VILA-U 256 † | LLaMA-2 7B | 66.6 | 57.1 | 46.6 | 22.0 | 67.1 | 32.2 | 38.7 | 11.4 | 48.3 ∗ | 40.8 | 19.6 |
| Transfusion ∗ | - | - | - | - | - | - | - | - | - | - | 6.7 | |
| Chameleon-7B † | 35.7 | 27.2 | 19.6 | 0.0 | 50.3 | 28.4 | 37.1 | 0.0 | 0.0 | - | 26.7 ∗ | |
| MetaMorph (ours) | LLaMA-3.1 8B | 75.2 | 71.8 | 58.3 | 48.3 | 83.2 | 41.8 | 44.0 | 37.1 | 60.5 | 48.8 | 11.8 |
| Backbone | Data | Data | Adapter | Instruction Tuning | Instruction Tuning | Instruction Tuning | |
|---|---|---|---|---|---|---|---|
| Experiment | LLM | Adapter | Instruction Tuning | lr wd bs | lr | wd | bs |
| Section 3 (LLaMA-3 8B) | LLaMA-3 8B | Cambrian Adapter Data ∗ | Section 3 Experiment Setting | 4.90e-5 0.0 768 | 6.93e-5 | 0 | 1536 |
| Section 3 (LLaMA-3.1 8B) | LLaMA-3.1 8B | Cambrian Adapter Data | ∗ Section 3 Experiment Setting 4.90e-5 | 0.0 768 | 6.93e-5 | 0 | 1536 |
| Section 3 (LLaMA-3 70B) | LLaMA-3 70B | Cambrian Adapter Data ∗ | Section 3 Experiment Setting 4.90e-5 | 0.0 768 | 4.90e-5 | 0 | 768 |
| MetaMorph | LLaMA-3.1 8B | Cambrian Adapter Data ∗ | All Data from Section 2.2 | 4.90e-5 0.0 768 | 6.93e-5 | 0 | 1536 |
| Loss | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA |
|---|---|---|---|---|---|---|---|---|---|---|
| AVG | MMBench EN | SEED | RealworldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | |
| None (VQA Only) | 55.50 | 73.11 | 69.96 | 55.69 | 41.33 | 80.39 | 37.29 | 46.60 | 35.16 | 59.96 |
| L1 Loss | 53.83 | 72.17 | 69.28 | 57.25 | 34.67 | 79.00 | 34.00 | 45.55 | 32.40 | 60.17 |
| Cosine Sim | 55.93 | 73.78 | 71.36 | 55.03 | 44.00 | 79.83 | 35.29 | 47.64 | 36.60 | 59.79 |
| Joint train With Other Data | # of Generation Data | FID Score |
|---|---|---|
| Yes | 1k | 68.5 |
| No | 1k | 115.0 |
| Yes | 5k | 19.2 |
| No | 5k | 116,4 |
| Yes | 10k | 18.7 |
| No | 10k | 111.0 |
| Yes | 50k | 17.1 |
| No | 50k | 111.8 |
| Yes | 200k | 15.2 |
| No | 200k | 110.7 |
| Yes | 200k | 14.7 |
| No | 200k | 93.7 |
| Yes | 1M | 14.4 |
| No | 1M | 52.8 |
| Yes | 3M | 15.1 |
| No | 3M | 39.2 |
| Yes | 5M | 14.3 |
| No | 5M | 27.7 |
| Joint training Data | Data Type | FID Score | CLIP Score |
|---|---|---|---|
| None | - | 110.5 | 5.7 |
| Image-to-Image | Other Visual Data | 97.5 | 6.4 |
| Visual Thinking | Other Visual Data | 93.5 | 6.5 |
| Pure Video | Other Visual Data | 84.7 | 8.1 |
| VideoQA | Visual Understanding Data | 26.5 | 16.1 |
| ImageQA | Visual Understanding Data | 18.9 | 22 |
| Data Composition | Data Composition | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # of VQA Data | # of Generation Data | Average | MMBench EN | SEED | RealworldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | FID Score | CLIP Score |
| 1M | 200k | 46.4 | 60.0 | 62.2 | 50.3 | 24.0 | 80.0 | 38.4 | 37.4 | 16.4 | 48.8 | 28.3 | 15.2 |
| 1M | 500k | 48.2 | 66.4 | 63.2 | 50.8 | 24.3 | 80.4 | 39.9 | 38.7 | 18.2 | 51.6 | 28.1 | 15.9 |
| 1M | 1M | 49.1 | 70.1 | 65.2 | 52.2 | 21.3 | 80.0 | 39.5 | 38.7 | 20.4 | 54.6 | 27.3 | 16.5 |
| 1M | 2M | 49.9 | 67.8 | 66.0 | 50.2 | 30.3 | 80.2 | 38.9 | 39.0 | 21.8 | 54.8 | 23.1 | 17.8 |
| 1M | 3M | 51.1 | 71.3 | 67.1 | 55.4 | 33.0 | 79.5 | 38.8 | 37.4 | 22.7 | 55.0 | 21.1 | 21.1 |
| 1M | 4M | 51.4 | 71.1 | 66.9 | 52.4 | 31.0 | 80.5 | 39.8 | 41.1 | 24.0 | 56.0 | 18.4 | 22.3 |
| 4M | 200k | 53.8 | 73.1 | 68.8 | 55.0 | 34.7 | 81.2 | 38.5 | 44.0 | 29.5 | 59.2 | 21.4 | 20.5 |
| 4M | 500k | 53.3 | 73.0 | 69.9 | 55.3 | 32.7 | 80.6 | 40.2 | 39.3 | 29.6 | 58.9 | 16.0 | 24.8 |
| 4M | 1M | 54.2 | 73.8 | 69.6 | 54.9 | 33.3 | 82.1 | 36.6 | 45.6 | 32.4 | 59.9 | 16.0 | 24.8 |
| 4M | 2M | 53.8 | 72.8 | 70.3 | 55.2 | 37.3 | 80.8 | 36.8 | 44.0 | 31.2 | 56.2 | 15.6 | 24.7 |
| 4M | 3M | 54.3 | 71.8 | 70.1 | 57.7 | 36.0 | 81.0 | 38.0 | 42.9 | 32.6 | 59.0 | 16.1 | 24.8 |
| 4M | 4M | 54.4 | 75.2 | 69.9 | 56.0 | 37.3 | 81.4 | 38.1 | 40.8 | 31.6 | 59.3 | 15.3 | 25.5 |
| 7M | 200k | 55.8 | 73.1 | 70.3 | 55.6 | 42.0 | 81.0 | 40.8 | 44.0 | 35.2 | 60.6 | 18.2 | 22.3 |
| 7M | 500k | 55.6 | 74.4 | 70.6 | 56.2 | 38.7 | 81.9 | 37.9 | 44.0 | 36.0 | 60.5 | 15.2 | 25.5 |
| 7M | 1M | 55.8 | 74.3 | 70.3 | 56.3 | 42.7 | 81.3 | 36.6 | 44.5 | 35.8 | 60.6 | 14.5 | 26.6 |
| 7M | 2M | 55.4 | 73.9 | 71.1 | 56.9 | 40.0 | 81.6 | 35.9 | 42.4 | 35.4 | 61.6 | 14.8 | 27.1 |
| 7M | 3M | 55.6 | 74.2 | 71.0 | 57.3 | 38.0 | 81.1 | 40.1 | 43.5 | 35.0 | 60.2 | 14.2 | 27.5 |
| 7M | 4M | 56.2 | 75.4 | 70.4 | 55.4 | 44.0 | 80.4 | 39.6 | 45.0 | 35.2 | 60.2 | 14.9 | 26.3 |
| Pretrained LLM | Image | QA | Generation | Generation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLM | Average | MMBench EN | SEED | RealworldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | FID Score | CLIP Score |
| LLaMA-3 8B | 55.8 | 74.3 | 70.3 | 56.3 | 42.7 | 81.3 | 36.6 | 44.5 | 35.8 | 60.6 | 14.5 | 26.6 |
| LLaMA-3.1 8B | 56.7 | 75.8 | 70.2 | 56.2 | 44.7 | 81.9 | 41.2 | 43.4 | 36.0 | 61.3 | 13.2 | 27.1 |
| LLaMA-3 70B | 60.7 | 80.7 | 72.6 | 58.3 | 48.7 | 87.8 | 48.9 | 47.1 | 37.4 | 65.0 | 13.8 | 26.8 |
| ImageQA | ImageQA | ImageQA | ImageQA | ImageQA | ImageQA | ImageQA | ImageQA | ImageQA | VideoQA | Generation | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | BaseLLM | MMBench EN | SEED | RealworldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | MV-Bench | COCO(FID) |
| Visual Understanding Only | ||||||||||||
| GPT-4V* | 75.8 | 69.1 | 61.4 | 50.0 | 75.7 | 56.8 | 55.0 | 78.5 | 78.0 | 43.5 | - | |
| Visual Generation Only | ||||||||||||
| Stable Diffusion 1.5 ∗ | - | - | - | - | - | - | - | - | - | - | 9.6 | |
| Dalle 2 ∗ | - | - | - | - | - | - | - | - | - | - | 10.4 | |
| Imagen ∗ | - | - | - | - | - | - | - | - | - | - | 7.3 | |
| Unified Models | ||||||||||||
| EMU-3 ∗ | 58.5 | 68.2 | 57.4 | 36.6 † | 89.2 | 31.6 | 51.8 † | 68.6 | 64.7 | - | 12.8 | |
| Janus ∗ | DeepSeek 1.3B | 69.4 | 63.7 | - | - | - | 30.5 | - | - | - | - | 8.5 |
| VILA-U 256 † | LLaMA-2 7B | 66.6 | 57.1 | 46.6 | 22.0 | 67.1 | 32.2 | 38.7 | 11.4 | 48.3 ∗ | 40.8 | 19.6 |
| Transfusion ∗ | - | - | - | - | - | - | - | - | - | - | 6.7 | |
| Chameleon-7B † | 35.7 | 27.2 | 19.6 | 0.0 | 50.3 | 28.4 | 37.1 | 0.0 | 0.0 | - | 26.7 ∗ | |
| MetaMorph (ours) | LLaMA-3.1 8B | 75.2 | 71.8 | 58.3 | 48.3 | 83.2 | 41.8 | 44.0 | 37.1 | 60.5 | 48.8 | 11.8 |
| Backbone | Data | Data | Adapter | Instruction Tuning | Instruction Tuning | Instruction Tuning | |
|---|---|---|---|---|---|---|---|
| Experiment | LLM | Adapter | Instruction Tuning | lr wd bs | lr | wd | bs |
| Section 3 (LLaMA-3 8B) | LLaMA-3 8B | Cambrian Adapter Data ∗ | Section 3 Experiment Setting | 4.90e-5 0.0 768 | 6.93e-5 | 0 | 1536 |
| Section 3 (LLaMA-3.1 8B) | LLaMA-3.1 8B | Cambrian Adapter Data | ∗ Section 3 Experiment Setting 4.90e-5 | 0.0 768 | 6.93e-5 | 0 | 1536 |
| Section 3 (LLaMA-3 70B) | LLaMA-3 70B | Cambrian Adapter Data ∗ | Section 3 Experiment Setting 4.90e-5 | 0.0 768 | 4.90e-5 | 0 | 768 |
| MetaMorph | LLaMA-3.1 8B | Cambrian Adapter Data ∗ | All Data from Section 2.2 | 4.90e-5 0.0 768 | 6.93e-5 | 0 | 1536 |
| Loss | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA |
|---|---|---|---|---|---|---|---|---|---|---|
| AVG | MMBench EN | SEED | RealworldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | |
| None (VQA Only) | 55.50 | 73.11 | 69.96 | 55.69 | 41.33 | 80.39 | 37.29 | 46.60 | 35.16 | 59.96 |
| L1 Loss | 53.83 | 72.17 | 69.28 | 57.25 | 34.67 | 79.00 | 34.00 | 45.55 | 32.40 | 60.17 |
| Cosine Sim | 55.93 | 73.78 | 71.36 | 55.03 | 44.00 | 79.83 | 35.29 | 47.64 | 36.60 | 59.79 |
| Joint train With Other Data | # of Generation Data | FID Score |
|---|---|---|
| Yes | 1k | 68.5 |
| No | 1k | 115.0 |
| Yes | 5k | 19.2 |
| No | 5k | 116,4 |
| Yes | 10k | 18.7 |
| No | 10k | 111.0 |
| Yes | 50k | 17.1 |
| No | 50k | 111.8 |
| Yes | 200k | 15.2 |
| No | 200k | 110.7 |
| Yes | 200k | 14.7 |
| No | 200k | 93.7 |
| Yes | 1M | 14.4 |
| No | 1M | 52.8 |
| Yes | 3M | 15.1 |
| No | 3M | 39.2 |
| Yes | 5M | 14.3 |
| No | 5M | 27.7 |
| Joint training Data | Data Type | FID Score | CLIP Score |
|---|---|---|---|
| None | - | 110.5 | 5.7 |
| Image-to-Image | Other Visual Data | 97.5 | 6.4 |
| Visual Thinking | Other Visual Data | 93.5 | 6.5 |
| Pure Video | Other Visual Data | 84.7 | 8.1 |
| VideoQA | Visual Understanding Data | 26.5 | 16.1 |
| ImageQA | Visual Understanding Data | 18.9 | 22 |
| Data Composition | Data Composition | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | Image QA | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # of VQA Data | # of Generation Data | Average | MMBench EN | SEED | RealworldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | FID Score | CLIP Score |
| 1M | 200k | 46.4 | 60.0 | 62.2 | 50.3 | 24.0 | 80.0 | 38.4 | 37.4 | 16.4 | 48.8 | 28.3 | 15.2 |
| 1M | 500k | 48.2 | 66.4 | 63.2 | 50.8 | 24.3 | 80.4 | 39.9 | 38.7 | 18.2 | 51.6 | 28.1 | 15.9 |
| 1M | 1M | 49.1 | 70.1 | 65.2 | 52.2 | 21.3 | 80.0 | 39.5 | 38.7 | 20.4 | 54.6 | 27.3 | 16.5 |
| 1M | 2M | 49.9 | 67.8 | 66.0 | 50.2 | 30.3 | 80.2 | 38.9 | 39.0 | 21.8 | 54.8 | 23.1 | 17.8 |
| 1M | 3M | 51.1 | 71.3 | 67.1 | 55.4 | 33.0 | 79.5 | 38.8 | 37.4 | 22.7 | 55.0 | 21.1 | 21.1 |
| 1M | 4M | 51.4 | 71.1 | 66.9 | 52.4 | 31.0 | 80.5 | 39.8 | 41.1 | 24.0 | 56.0 | 18.4 | 22.3 |
| 4M | 200k | 53.8 | 73.1 | 68.8 | 55.0 | 34.7 | 81.2 | 38.5 | 44.0 | 29.5 | 59.2 | 21.4 | 20.5 |
| 4M | 500k | 53.3 | 73.0 | 69.9 | 55.3 | 32.7 | 80.6 | 40.2 | 39.3 | 29.6 | 58.9 | 16.0 | 24.8 |
| 4M | 1M | 54.2 | 73.8 | 69.6 | 54.9 | 33.3 | 82.1 | 36.6 | 45.6 | 32.4 | 59.9 | 16.0 | 24.8 |
| 4M | 2M | 53.8 | 72.8 | 70.3 | 55.2 | 37.3 | 80.8 | 36.8 | 44.0 | 31.2 | 56.2 | 15.6 | 24.7 |
| 4M | 3M | 54.3 | 71.8 | 70.1 | 57.7 | 36.0 | 81.0 | 38.0 | 42.9 | 32.6 | 59.0 | 16.1 | 24.8 |
| 4M | 4M | 54.4 | 75.2 | 69.9 | 56.0 | 37.3 | 81.4 | 38.1 | 40.8 | 31.6 | 59.3 | 15.3 | 25.5 |
| 7M | 200k | 55.8 | 73.1 | 70.3 | 55.6 | 42.0 | 81.0 | 40.8 | 44.0 | 35.2 | 60.6 | 18.2 | 22.3 |
| 7M | 500k | 55.6 | 74.4 | 70.6 | 56.2 | 38.7 | 81.9 | 37.9 | 44.0 | 36.0 | 60.5 | 15.2 | 25.5 |
| 7M | 1M | 55.8 | 74.3 | 70.3 | 56.3 | 42.7 | 81.3 | 36.6 | 44.5 | 35.8 | 60.6 | 14.5 | 26.6 |
| 7M | 2M | 55.4 | 73.9 | 71.1 | 56.9 | 40.0 | 81.6 | 35.9 | 42.4 | 35.4 | 61.6 | 14.8 | 27.1 |
| 7M | 3M | 55.6 | 74.2 | 71.0 | 57.3 | 38.0 | 81.1 | 40.1 | 43.5 | 35.0 | 60.2 | 14.2 | 27.5 |
| 7M | 4M | 56.2 | 75.4 | 70.4 | 55.4 | 44.0 | 80.4 | 39.6 | 45.0 | 35.2 | 60.2 | 14.9 | 26.3 |
| Pretrained LLM | Image | QA | Generation | Generation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLM | Average | MMBench EN | SEED | RealworldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | FID Score | CLIP Score |
| LLaMA-3 8B | 55.8 | 74.3 | 70.3 | 56.3 | 42.7 | 81.3 | 36.6 | 44.5 | 35.8 | 60.6 | 14.5 | 26.6 |
| LLaMA-3.1 8B | 56.7 | 75.8 | 70.2 | 56.2 | 44.7 | 81.9 | 41.2 | 43.4 | 36.0 | 61.3 | 13.2 | 27.1 |
| LLaMA-3 70B | 60.7 | 80.7 | 72.6 | 58.3 | 48.7 | 87.8 | 48.9 | 47.1 | 37.4 | 65.0 | 13.8 | 26.8 |

References
[power2022grokking] Power, Alethea, Burda, Yuri, Edwards, Harri, Babuschkin, Igor, Misra, Vedant. (2022). Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177.
[wang2023see] Wang, Junke, Meng, Lingchen, Weng, Zejia, He, Bo, Wu, Zuxuan, Jiang, Yu-Gang. (2023). To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574.
[zhang2023llavar] Zhang, Yanzhe, Zhang, Ruiyi, Gu, Jiuxiang, Zhou, Yufan, Lipka, Nedim, Yang, Diyi, Sun, Tong. (2023). Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
[liu2022convnet] Liu, Zhuang, Mao, Hanzi, Wu, Chao-Yuan, Feichtenhofer, Christoph, Darrell, Trevor, Xie, Saining. (2022). A convnet for the 2020s. CVPR.
[masry2022chartqa] Masry, Ahmed, Long, Do Xuan, Tan, Jia Qing, Joty, Shafiq, Hoque, Enamul. (2022). Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ACL.
[mathew2021docvqa] Mathew, Minesh, Karatzas, Dimosthenis, Jawahar, CV. (2021). Docvqa: A dataset for vqa on document images. WACV.
[kafle2018dvqa] Kafle, Kushal, Price, Brian, Cohen, Scott, Kanan, Christopher. (2018). Dvqa: Understanding data visualizations via question answering. CVPR.
[acharya2019tallyqa] Acharya, Manoj, Kafle, Kushal, Kanan, Christopher. (2019). TallyQA: Answering complex counting questions. AAAI.
[johnson2017clevr] Johnson, Justin, Hariharan, Bharath, Van Der Maaten, Laurens, Fei-Fei, Li, Lawrence Zitnick, C, Girshick, Ross. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. CVPR.
[tu2023many] Tu, Haoqin, Cui, Chenhang, Wang, Zijun, Zhou, Yiyang, Zhao, Bingchen, Han, Junlin, Zhou, Wangchunshu, Yao, Huaxiu, Xie, Cihang. (2023). How many unicorns are in this image? a safety evaluation benchmark for vision llms. arXiv preprint arXiv:2311.16101.
[gurari2018vizwiz] Gurari, Danna, Li, Qing, Stangl, Abigale J, Guo, Anhong, Lin, Chi, Grauman, Kristen, Luo, Jiebo, Bigham, Jeffrey P. (2018). Vizwiz grand challenge: Answering visual questions from blind people. CVPR.
[zhang2023pre] Zhang, Yuhui, McKinzie, Brandon, Gan, Zhe, Shankar, Vaishaal, Toshev, Alexander. (2023). Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation. EMNLP.
[chen2024allava] Chen, Guiming Hardy, Chen, Shunian, Zhang, Ruifei, Chen, Junying, Wu, Xiangbo, Zhang, Zhiyi, Chen, Zhihong, Li, Jianquan, Wan, Xiang, Wang, Benyou. (2024). ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model. arXiv preprint arXiv:2402.11684.
[lambon2010coherent] Lambon Ralph, Matthew A, Sage, Karen, Jones, Roy W, Mayberry, Emily J. (2010). Coherent concepts are computed in the anterior temporal lobes. Proceedings of the National Academy of Sciences.
[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. International conference on machine learning.
[he2019momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2019). Momentum Contrast for Unsupervised Visual Representation Learning. arXiv e-prints, art. arXiv preprint arXiv:1911.05722.
[bardes2024revisiting] Bardes, Adrien, Garrido, Quentin, Ponce, Jean, Chen, Xinlei, Rabbat, Michael, LeCun, Yann, Assran, Mahmoud, Ballas, Nicolas. (2024). Revisiting feature prediction for learning visual representations from video. TMLR.
[hendrycks2016gaussian] Hendrycks, Dan, Gimpel, Kevin. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
[zhang2024direct] Zhang, Ruohong, Gui, Liangke, Sun, Zhiqing, Feng, Yihao, Xu, Keyang, Zhang, Yuanhan, Fu, Di, Li, Chunyuan, Hauptmann, Alexander, Bisk, Yonatan, others. (2024). Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward. arXiv preprint arXiv:2404.01258.
[loshchilov2017decoupled] Loshchilov, I. (2019). Decoupled weight decay regularization. ICLR.
[ba2016layer] Ba, Jimmy Lei, Kiros, Jamie, Geoffrey E. Hinton. (2016). Layer normalization. NeurIPS.
[preechakul2022diffusion] Preechakul, Konpat, Chatthee, Nattanat, Wizadwongsa, Suttisak, Suwajanakorn, Supasorn. (2022). Diffusion autoencoders: Toward a meaningful and decodable representation. CVPR.
[pan2023kosmos] Pan, Xichen, Dong, Li, Huang, Shaohan, Peng, Zhiliang, Chen, Wenhu, Wei, Furu. (2024). Kosmos-g: Generating images in context with multimodal large language models. ICLR.
[koh2024generating] Koh, Jing Yu, Fried, Daniel, Salakhutdinov, Russ R. (2024). Generating images with multimodal language models. NeurIPS.
[rajbhandari2020zero] Rajbhandari, Samyam, Rasley, Jeff, Ruwase, Olatunji, He, Yuxiong. (2020). Zero: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[heusel2017gans] Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, Hochreiter, Sepp. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS.
[yue2024mmmu] Yue, Xiang, Zheng, Tianyu, Ni, Yuansheng, Wang, Yubo, Zhang, Kai, Tong, Shengbang, Sun, Yuxuan, Yin, Ming, Yu, Botao, Zhang, Ge, others. (2024). Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813.
[pan2024autonomous] Pan, Jiayi, Zhang, Yichi, Tomlin, Nicholas, Zhou, Yifei, Levine, Sergey, Suhr, Alane. (2024). Autonomous evaluation and refinement of digital agents. COLM.
[cha2024visually] Cha, Sungguk, Lee, Jusung, Lee, Younghyun, Yang, Cheoljong. (2024). Visually Dehallucinative Instruction Generation: Know What You Don't Know. arXiv preprint arXiv:2402.09717.
[si2024design2code] Si, Chenglei, Zhang, Yanzhe, Yang, Zhengyuan, Liu, Ruibo, Yang, Diyi. (2024). Design2Code: How Far Are We From Automating Front-End Engineering?. arXiv preprint arXiv:2403.03163.
[li2024multimodal] Li, Lei, Wang, Yuqi, Xu, Runxin, Wang, Peiyi, Feng, Xiachong, Kong, Lingpeng, Liu, Qi. (2024). Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. arXiv preprint arXiv:2403.00231.
[wang2024measuring] Wang, Ke, Pan, Junting, Shi, Weikang, Lu, Zimu, Zhan, Mingjie, Li, Hongsheng. (2024). Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset. arXiv preprint arXiv:2402.14804.
[wu2023q] Wu, Haoning, Zhang, Zicheng, Zhang, Erli, Chen, Chaofeng, Liao, Liang, Wang, Annan, Xu, Kaixin, Li, Chunyi, Hou, Jingwen, Zhai, Guangtao, others. (2023). Q-instruct: Improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783.
[kembhavi2016diagram] Kembhavi, Aniruddha, Salvato, Mike, Kolve, Eric, Seo, Minjoon, Hajishirzi, Hannaneh, Farhadi, Ali. (2016). A diagram is worth a dozen images. ECCV.
[laiongpt4v] LAION. (2023). laion/gpt4v-dataset.
[hsiao2022screenqa] Hsiao, Yu-Chung, Zubach, Fedir, Wang, Maria, others. (2022). Screenqa: Large-scale question-answer pairs over mobile app screenshots. arXiv preprint arXiv:2209.08199.
[lu2022learn] Lu, Pan, Mishra, Swaroop, Xia, Tanglin, Qiu, Liang, Chang, Kai-Wei, Zhu, Song-Chun, Tafjord, Oyvind, Clark, Peter, Kalyan, Ashwin. (2022). Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS.
[gao2023g] Gao, Jiahui, Pi, Renjie, Zhang, Jipeng, Ye, Jiacheng, Zhong, Wanjun, Wang, Yufei, Hong, Lanqing, Han, Jianhua, Xu, Hang, Li, Zhenguo, others. (2023). G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370.
[kim2021donut] Kim, Geewook, Hong, Teakgyu, Yim, Moonbin, Park, Jinyoung, Yim, Jinyeong, Hwang, Wonseok, Yun, Sangdoo, Han, Dongyoon, Park, Seunghyun. (2022). Donut: Document understanding transformer without ocr. ECCV.
[laurenccon2024unlocking] Lauren{\c{c. (2024). Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset. arXiv preprint arXiv:2403.09029.
[belouadi2023automatikz] Belouadi, Jonas, Lauscher, Anne, Eger, Steffen. (2024). Automatikz: Text-guided synthesis of scientific vector graphics with tikz. ICLR.
[alawwad2024enhancing] Alawwad, Hessa Abdulrahman, Alhothali, Areej, Naseem, Usman, Alkhathlan, Ali, Jamal, Amani. (2024). Enhancing Textbook Question Answering Task with Large Language Models and Retrieval Augmented Generation. arXiv preprint arXiv:2402.05128.
[lu2021inter] Lu, Pan, Gong, Ran, Jiang, Shibiao, Qiu, Liang, Huang, Siyuan, Liang, Xiaodan, Zhu, Song-Chun. (2021). Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. ACL.
[zhang2019raven] Zhang, Chi, Gao, Feng, Jia, Baoxiong, Zhu, Yixin, Zhu, Song-Chun. (2019). Raven: A dataset for relational and analogical visual reasoning. CVPR.
[lu2021iconqa] Lu, Pan, Qiu, Liang, Chen, Jiaqi, Xia, Tony, Zhao, Yizhou, Zhang, Wei, Yu, Zhou, Liang, Xiaodan, Zhu, Song-Chun. (2021). Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. NeurIPS.
[kazemi2023geomverse] Kazemi, Mehran, Alvari, Hamidreza, Anand, Ankit, Wu, Jialin, Chen, Xi, Soricut, Radu. (2023). Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241.
[pasupat2015compositional] Pasupat, Panupong, Liang, Percy. (2015). Compositional semantic parsing on semi-structured tables. ACL.
[zhong2017seq2sql] Zhong, Victor, Xiong, Caiming, Socher, Richard. (2017). Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.
[chen2021finqa] Chen, Zhiyu, Chen, Wenhu, Smiley, Charese, Shah, Sameena, Borova, Iana, Langdon, Dylan, Moussa, Reema, Beane, Matt, Huang, Ting-Hao, Routledge, Bryan, others. (2021). Finqa: A dataset of numerical reasoning over financial data. EMNLP.
[cheng2021hitab] Cheng, Zhoujun, Dong, Haoyu, Wang, Zhiruo, Jia, Ran, Guo, Jiaqi, Gao, Yan, Han, Shi, Lou, Jian-Guang, Zhang, Dongmei. (2022). HiTab: A hierarchical table dataset for question answering and natural language generation. ACL.
[zhu2021tat] Zhu, Fengbin, Lei, Wenqiang, Huang, Youcheng, Wang, Chao, Zhang, Shuo, Lv, Jiancheng, Feng, Fuli, Chua, Tat-Seng. (2021). TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. ACL.
[lu2022dynamic] Lu, Pan, Qiu, Liang, Chang, Kai-Wei, Wu, Ying Nian, Zhu, Song-Chun, Rajpurohit, Tanmay, Clark, Peter, Kalyan, Ashwin. (2023). Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. ICLR.
[kantharaj2022chart] Kantharaj, Shankar, Leong, Rixie Tiffany Ko, Lin, Xiang, Masry, Ahmed, Thakkar, Megh, Hoque, Enamul, Joty, Shafiq. (2022). Chart-to-text: A large-scale benchmark for chart summarization. ACL.
[tang2023vistext] Tang, Benny J, Boggust, Angie, Satyanarayan, Arvind. (2023). Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356.
[biten2022latr] Biten, Ali Furkan, Litman, Ron, Xie, Yusheng, Appalaraju, Srikar, Manmatha, R. (2022). Latr: Layout-aware transformer for scene-text vqa. CVPR.
[biten2019scene] Biten, Ali Furkan, Tito, Ruben, Mafla, Andres, Gomez, Lluis, Rusinol, Mar{\c{c. (2019). Scene text visual question answering. ICCV.
[kiela2020hateful] Kiela, Douwe, Firooz, Hamed, Mohan, Aravind, Goswami, Vedanuj, Singh, Amanpreet, Ringshia, Pratik, Testuggine, Davide. (2020). The hateful memes challenge: Detecting hate speech in multimodal memes. NeurIPS.
[RenderedText] Chris Wendler. (2023). wendlerc/RenderedText.
[zhu2016visual7w] Zhu, Yuke, Groth, Oliver, Bernstein, Michael, Fei-Fei, Li. (2016). Visual7w: Grounded question answering in images. CVPR.
[tanaka2021visualmrc] Tanaka, Ryota, Nishida, Kyosuke, Yoshida, Sen. (2021). VisualMRC: Machine Reading Comprehension on Document Images. AAAI.
[shridhar2020alfworld] Shridhar, Mohit, Yuan, Xingdi, C{^{o. (2021). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. ICLR.
[pont-tuset2019localizednarratives] Pont{-. (2020). Connecting Vision and Language with Localized Narratives. ECCV.
[he2020pathvqa] He, Xuehai, Zhang, Yichen, Mou, Luntian, Xing, Eric P., Xie, Pengtao. (2020). PathVQA: 30000+ Questions for Medical Visual Question Answering. CoRR.
[liu2023visual] Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. (2023). Visual Instruction Tuning. NeurIPS.
[chen2023sharegpt4v] Chen, Lin, Li, Jisong, Dong, Xiaoyi, Zhang, Pan, He, Conghui, Wang, Jiaqi, Zhao, Feng, Lin, Dahua. (2023). Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.
[hudson2019gqa] Drew A. Hudson, Christopher D. Manning. (2019). GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. CVPR.
[marino2019okvqa] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, Roozbeh Mottaghi. (2019). OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. CVPR.
[vishniakov2023convnet] Vishniakov, Kirill, Shen, Zhiqiang, Liu, Zhuang. (2024). ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy. ICML.
[schwenk2022aokvqa] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, Roozbeh Mottaghi. (2022). A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge. ECCV.
[mishra2019OCR] . OCR-VQA: Visual Question Answering by Reading Text in Images. (2019).
[sidorov2020textcaps] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh. (2020). TextCaps: a Dataset for Image Captioning with Reading Comprehension.
[yu2016modeling] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, Tamara L. Berg. (2016). Modeling Context in Referring Expressions.
[team2024chameleon] Team, Chameleon. (2024). Chameleon: Mixed-Modal Early-Fusion Foundation Models. arXiv preprint arXiv:2405.09818.
[yu2023rlhf] Yu, Tianyu, Yao, Yuan, Zhang, Haoye, He, Taiwen, Han, Yifeng, Cui, Ganqu, Hu, Jinyi, Liu, Zhiyuan, Zheng, Hai-Tao, Sun, Maosong, others. (2023). Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849.
[li2024return] Li, Tianhong, Katabi, Dina, He, Kaiming. (2024). Return of unconditional generation: A self-supervised representation generation method. NeurIPS.
[rafailov2024direct] Rafailov, Rafael, Sharma, Archit, Mitchell, Eric, Manning, Christopher D, Ermon, Stefano, Finn, Chelsea. (2024). Direct preference optimization: Your language model is secretly a reward model. NeurIPS.
[zhu2023starling] Zhu, Banghua, Frick, Evan, Wu, Tianhao, Zhu, Hanlin, Jiao, Jiantao. (2023). Starling-7b: Improving llm helpfulness & harmlessness with rlaif.
[ouyang2022training] Ouyang, Long, Wu, Jeffrey, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll, Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex, others. (2022). Training language models to follow instructions with human feedback. NeurIPS.
[dong2024rlhf] Dong, Hanze, Xiong, Wei, Pang, Bo, Wang, Haoxiang, Zhao, Han, Zhou, Yingbo, Jiang, Nan, Sahoo, Doyen, Xiong, Caiming, Zhang, Tong. (2024). Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863.
[liu2024decade] Liu, Zhuang, He, Kaiming. (2024). A Decade's Battle on Dataset Bias: Are We There Yet?. arXiv preprint arXiv:2403.08632.
[yuksekgonul2022and] Yuksekgonul, Mert, Bianchi, Federico, Kalluri, Pratyusha, Jurafsky, Dan, Zou, James. (2022). When and why vision-language models behave like bags-of-words, and what to do about it?. ICLR.
[chen2024far] Chen, Zhe, Wang, Weiyun, Tian, Hao, Ye, Shenglong, Gao, Zhangwei, Cui, Erfei, Tong, Wenwen, Hu, Kongzhi, Luo, Jiapeng, Ma, Zheng, others. (2024). How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821.
[tong2024mass] Tong, Shengbang, Jones, Erik, Steinhardt, Jacob. (2024). Mass-producing failures of multimodal systems with language models. NeurIPS.
[krishna2016visual] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li. (2016). Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. IJCV.
[tong2024eyes] Tong, Shengbang, Liu, Zhuang, Zhai, Yuexiang, Ma, Yi, LeCun, Yann, Xie, Saining. (2024). Eyes wide shut? exploring the visual shortcomings of multimodal llms. CVPR.
[liu2023improved] Liu, Haotian, Li, Chunyuan, Li, Yuheng, Lee, Yong Jae. (2024). Improved baselines with visual instruction tuning. CVPR.
[mckinzie2024mm1] McKinzie, Brandon, Gan, Zhe, Fauconnier, Jean-Philippe, Dodge, Sam, Zhang, Bowen, Dufter, Philipp, Shah, Dhruti, Du, Xianzhi, Peng, Futang, Weers, Floris, others. (2024). Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611.
[xu2023demystifying] Xu, Hu, Xie, Saining, Tan, Xiaoqing Ellen, Huang, Po-Yao, Howes, Russell, Sharma, Vasu, Li, Shang-Wen, Ghosh, Gargi, Zettlemoyer, Luke, Feichtenhofer, Christoph. (2024). Demystifying clip data. ICLR.
[fang2023data] Fang, Alex, Jose, Albin Madappally, Jain, Amit, Schmidt, Ludwig, Toshev, Alexander, Shankar, Vaishaal. (2024). Data filtering networks. ICLR.
[gao2024sphinx] Gao, Peng, Zhang, Renrui, Liu, Chris, Qiu, Longtian, Huang, Siyuan, Lin, Weifeng, Zhao, Shitian, Geng, Shijie, Lin, Ziyi, Jin, Peng, others. (2024). SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models. arXiv preprint arXiv:2402.05935.
[DatabricksBlog2023DollyV2] Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, Reynold Xin. (2023). Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM.
[yue2023mammoth] Yue, Xiang, Qu, Xingwei, Zhang, Ge, Fu, Yao, Huang, Wenhao, Sun, Huan, Su, Yu, Chen, Wenhu. (2024). Mammoth: Building math generalist models through hybrid instruction tuning. ICLR.
[luo2023wizardcoder] Luo, Ziyang, Xu, Can, Zhao, Pu, Sun, Qingfeng, Geng, Xiubo, Hu, Wenxiang, Tao, Chongyang, Ma, Jing, Lin, Qingwei, Jiang, Daxin. (2024). Wizardcoder: Empowering code large language models with evol-instruct. ICLR.
[mitra2024orcamath] Arindam Mitra, Hamed Khanpour, Corby Rosset, Ahmed Awadallah. (2024). Orca-Math: Unlocking the potential of SLMs in Grade School Math.
[zheng2024opencodeinterpreter] Zheng, Tianyu, Zhang, Ge, Shen, Tianhao, Liu, Xueling, Lin, Bill Yuchen, Fu, Jie, Chen, Wenhu, Yue, Xiang. (2024). OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. arXiv preprint arXiv:2402.14658.
[OpenOrca] Wing Lian, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong and. (2023). OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces. HuggingFace repository.
[radford2021learning] Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, others. (2021). Learning transferable visual models from natural language supervision. ICML.
[schuhmann2022laion] Schuhmann, Christoph, Beaumont, Romain, Vencu, Richard, Gordon, Cade, Wightman, Ross, Cherti, Mehdi, Coombes, Theo, Katta, Aarush, Mullis, Clayton, Wortsman, Mitchell, others. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS.
[zheng2024judging] Zheng, Lianmin, Chiang, Wei-Lin, Sheng, Ying, Zhuang, Siyuan, Wu, Zhanghao, Zhuang, Yonghao, Lin, Zi, Li, Zhuohan, Li, Dacheng, Xing, Eric, others. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS.
[chiang2024chatbot] Chiang, Wei-Lin, Zheng, Lianmin, Sheng, Ying, Angelopoulos, Anastasios Nikolas, Li, Tianle, Li, Dacheng, Zhang, Hao, Zhu, Banghua, Jordan, Michael, Gonzalez, Joseph E, others. (2024). Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132.
[zhai2023sigmoid] Zhai, Xiaohua, Mustafa, Basil, Kolesnikov, Alexander, Beyer, Lucas. (2023). Sigmoid loss for language image pre-training. ICCV.
[sun2023eva] Sun, Quan, Fang, Yuxin, Wu, Ledell, Wang, Xinlong, Cao, Yue. (2023). Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389.
[cherti2023reproducible] Cherti, Mehdi, Beaumont, Romain, Wightman, Ross, Wortsman, Mitchell, Ilharco, Gabriel, Gordon, Cade, Schuhmann, Christoph, Schmidt, Ludwig, Jitsev, Jenia. (2023). Reproducible scaling laws for contrastive language-image learning. CVPR.
[he2022masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners. CVPR.
[chen2021empirical] Chen, Xinlei, Xie, Saining, He, Kaiming. (2021). An empirical study of training self-supervised vision transformers. ICCV.
[oquab2023dinov2] Oquab, Maxime, Darcet, Timoth{'e. (2023). Dinov2: Learning robust visual features without supervision. TMLR.
[cunningham2023sparse] Cunningham, Hoagy, Ewart, Aidan, Riggs, Logan, Huben, Robert, Sharkey, Lee. (2024). Sparse autoencoders find highly interpretable features in language models. ICLR.
[assran2023self] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. CVPR.
[dosovitskiy2020image] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. ICLR.
[jouppi2023tpu] Jouppi, Norm, Kurian, George, Li, Sheng, Ma, Peter, Nagarajan, Rahul, Nai, Lifeng, Patil, Nishant, Subramanian, Suvinay, Swing, Andy, Towles, Brian, others. (2023). Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. Proceedings of the 50th Annual International Symposium on Computer Architecture.
[zhao2023pytorch] Zhao, Yanli, Gu, Andrew, Varma, Rohan, Luo, Liang, Huang, Chien-Chin, Xu, Min, Wright, Less, Shojanazeri, Hamid, Ott, Myle, Shleifer, Sam, others. (2023). Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277.
[zhou2023don] Zhou, Kun, Zhu, Yutao, Chen, Zhipeng, Chen, Wentong, Zhao, Wayne Xin, Chen, Xu, Lin, Yankai, Wen, Ji-Rong, Han, Jiawei. (2023). Don't Make Your LLM an Evaluation Benchmark Cheater. arXiv preprint arXiv:2311.01964.
[kirillov2023segment] Kirillov, Alexander, Mintun, Eric, Ravi, Nikhila, Mao, Hanzi, Rolland, Chloe, Gustafson, Laura, Xiao, Tete, Whitehead, Spencer, Berg, Alexander C, Lo, Wan-Yen, others. (2023). Segment anything. ICCV.
[birkl2023midas] Birkl, Reiner, Wofk, Diana, M{. (2023). Midas v3. 1--a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460.
[lasinger2019towards] Lasinger, Katrin, Ranftl, Ren{'e. (2019). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. arXiv preprint arXiv:1907.01341.
[Rombach_2022_CVPR] Rombach, Robin, Blattmann, Andreas, Lorenz, Dominik, Esser, Patrick, Ommer, Bj. (2022). High-Resolution Image Synthesis With Latent Diffusion Models. CVPR.
[karamcheti2024prismatic] Karamcheti, Siddharth, Nair, Suraj, Balakrishna, Ashwin, Liang, Percy, Kollar, Thomas, Sadigh, Dorsa. (2024). Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865.
[zhai2023investigating] Zhai, Yuexiang, Tong, Shengbang, Li, Xiao, Cai, Mu, Qu, Qing, Lee, Yong Jae, Ma, Yi. (2024). Investigating the catastrophic forgetting in multimodal large language models. CPAL.
[li2023internet] Li, Alexander C, Brown, Ellis, Efros, Alexei A, Pathak, Deepak. (2023). Internet Explorer: Targeted Representation Learning on the Open Web. ICML.
[liu2024llavanext] Liu, Haotian, Li, Chunyuan, Li, Yuheng, Li, Bo, Zhang, Yuanhan, Shen, Sheng, Lee, Yong Jae. (2024). LLaVA-NeXT: Improved reasoning, OCR, and world knowledge.
[lu2024deepseek] Lu, Haoyu, Liu, Wen, Zhang, Bo, Wang, Bingxuan, Dong, Kai, Liu, Bo, Sun, Jingxiang, Ren, Tongzheng, Li, Zhuoshu, Sun, Yaofeng, others. (2024). DeepSeek-VL: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525.
[li2023your] Li, Alexander C, Prabhudesai, Mihir, Duggal, Shivam, Brown, Ellis, Pathak, Deepak. (2023). Your diffusion model is secretly a zero-shot classifier. ICCV.
[chen2022pali] Chen, Xi, Wang, Xiao, Changpinyo, Soravit, Piergiovanni, AJ, Padlewski, Piotr, Salz, Daniel, Goodman, Sebastian, Grycner, Adam, Mustafa, Basil, Beyer, Lucas, others. (2023). Pali: A jointly-scaled multilingual language-image model. ICLR.
[murtagh2014ward] Murtagh, Fionn, Legendre, Pierre. (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion?. Journal of classification.
[llama3modelcard] AI@Meta. (2024). Llama 3 Model Card.
[Gemini] Google. (2023). Gemini.
[qwen] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, Tianhang Zhu. (2023). Qwen Technical Report. arXiv preprint arXiv:2309.16609.
[bai2023qwen] Bai, Jinze, Bai, Shuai, Yang, Shusheng, Wang, Shijie, Tan, Sinan, Wang, Peng, Lin, Junyang, Zhou, Chang, Zhou, Jingren. (2023). Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.
[dai2024instructblip] Dai, Wenliang, Li, Junnan, Li, Dongxu, Tiong, Anthony Meng Huat, Zhao, Junqi, Wang, Weisheng, Li, Boyang, Fung, Pascale N, Hoi, Steven. (2024). Instructblip: Towards general-purpose vision-language models with instruction tuning. NeurIPS.
[liu2023hidden] Liu, Yuliang, Li, Zhang, Li, Hongliang, Yu, Wenwen, Huang, Mingxin, Peng, Dezhi, Liu, Mingyu, Chen, Mingrui, Li, Chunyuan, Jin, Lianwen, others. (2023). On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895.
[ge2023planting] Ge, Yuying, Ge, Yixiao, Zeng, Ziyun, Wang, Xintao, Shan, Ying. (2023). Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041.
[wu2023vstar] Wu, Penghao, Xie, Saining. (2024). V: Guided Visual Search as a Core Mechanism in Multimodal LLMs*. CVPR.
[jaegle2021perceiver] Jaegle, Andrew, Gimeno, Felix, Brock, Andy, Vinyals, Oriol, Zisserman, Andrew, Carreira, Joao. (2021). Perceiver: General perception with iterative attention. ICML.
[young2024yi] Young, Alex, Chen, Bei, Li, Chao, Huang, Chengen, Zhang, Ge, Zhang, Guanwei, Li, Heng, Zhu, Jiangcheng, Chen, Jianqun, Chang, Jing, others. (2024). Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.
[zhai2024fine] Zhai, Yuexiang, Bai, Hao, Lin, Zipeng, Pan, Jiayi, Tong, Shengbang, Zhou, Yifei, Suhr, Alane, Xie, Saining, LeCun, Yann, Ma, Yi, others. (2024). Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning. NeurIPS.
[lu2023mathvista] Lu, Pan, Bansal, Hritik, Xia, Tony, Liu, Jiacheng, Li, Chunyuan, Hajishirzi, Hannaneh, Cheng, Hao, Chang, Kai-Wei, Galley, Michel, Gao, Jianfeng. (2023). Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. ICLR.
[liu2023mmbench] Liu, Yuan, Duan, Haodong, Zhang, Yuanhan, Li, Bo, Zhang, Songyang, Zhao, Wangbo, Yuan, Yike, Wang, Jiaqi, He, Conghui, Liu, Ziwei, others. (2024). Mmbench: Is your multi-modal model an all-around player?. ECCV.
[alayrac2022flamingo] Alayrac, Jean-Baptiste, Donahue, Jeff, Luc, Pauline, Miech, Antoine, Barr, Iain, Hasson, Yana, Lenc, Karel, Mensch, Arthur, Millican, Katherine, Reynolds, Malcolm, others. (2022). Flamingo: a visual language model for few-shot learning. NeurIPS.
[li2023oxfordtvg] Li, Runjia, Sun, Shuyang, Elhoseiny, Mohamed, Torr, Philip. (2023). OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?. ICCV.
[gadre2024datacomp] Gadre, Samir Yitzhak, Ilharco, Gabriel, Fang, Alex, Hayase, Jonathan, Smyrnis, Georgios, Nguyen, Thao, Marten, Ryan, Wortsman, Mitchell, Ghosh, Dhruba, Zhang, Jieyu, others. (2024). Datacomp: In search of the next generation of multimodal datasets. NeurIPS.
[banani2024probing] Banani, Mohamed El, Raj, Amit, Maninis, Kevis-Kokitsi, Kar, Abhishek, Li, Yuanzhen, Rubinstein, Michael, Sun, Deqing, Guibas, Leonidas, Johnson, Justin, Jampani, Varun. (2024). Probing the 3D Awareness of Visual Foundation Models. arXiv preprint arXiv:2404.08636.
[OpenAI2022ChatGPT] OpenAI. (2022). ChatGPT.
[StabilityAI2024SD35] Stability AI. (2024). Stable Diffusion 3.5.
[roberts2019exploring] Roberts, Adam, Raffel, Colin, Lee, Katherine, Matena, Michael, Shazeer, Noam, Liu, Peter J, Narang, Sharan, Li, Wei, Zhou, Yanqi. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.
[Taori2023Alpaca] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto. (2023). Alpaca: A Strong, Replicable Instruction-Following Model.
[zhou2024lima] Zhou, Chunting, Liu, Pengfei, Xu, Puxin, Iyer, Srinivasan, Sun, Jiao, Mao, Yuning, Ma, Xuezhe, Efrat, Avia, Yu, Ping, Yu, Lili, others. (2024). Lima: Less is more for alignment. NeurIPS.
[Sanseviero2024LLM] Omar Sanseviero. (2022). LLM Evals and Benchmarking.
[rajamanoharan2024improving] Rajamanoharan, Senthooran, Conmy, Arthur, Smith, Lewis, Lieberum, Tom, Varma, Vikrant, Kram{'a. (2024). Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014.
[grok] xAI. (2024). grok.
[singh2019towards] Singh, Amanpreet, Natarajan, Vivek, Shah, Meet, Jiang, Yu, Chen, Xinlei, Batra, Dhruv, Parikh, Devi, Rohrbach, Marcus. (2019). Towards vqa models that can read. CVPR.
[chang2024survey] Chang, Yupeng, Wang, Xu, Wang, Jindong, Wu, Yuan, Yang, Linyi, Zhu, Kaijie, Chen, Hao, Yi, Xiaoyuan, Wang, Cunxiang, Wang, Yidong, others. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
[sun2024generative] Sun, Quan, Cui, Yufeng, Zhang, Xiaosong, Zhang, Fan, Yu, Qiying, Wang, Yueze, Rao, Yongming, Liu, Jingjing, Huang, Tiejun, Wang, Xinlong. (2024). Generative multimodal models are in-context learners. CVPR.
[sun2023generative] Sun, Quan, Yu, Qiying, Cui, Yufeng, Zhang, Fan, Zhang, Xiaosong, Wang, Yueze, Gao, Hongcheng, Liu, Jingjing, Huang, Tiejun, Wang, Xinlong. (2024). Generative pretraining in multimodality. ICLR.
[dong2023dreamllm] Dong, Runpei, Han, Chunrui, Peng, Yuang, Qi, Zekun, Ge, Zheng, Yang, Jinrong, Zhao, Liang, Sun, Jianjian, Zhou, Hongyu, Wei, Haoran, others. (2024). Dreamllm: Synergistic multimodal comprehension and creation. ICLR.
[shao2024visual] Shao, Hao, Qian, Shengju, Xiao, Han, Song, Guanglu, Zong, Zhuofan, Wang, Letian, Liu, Yu, Li, Hongsheng. (2024). Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. NeurIPS.
[miech2019howto100m] Miech, Antoine, Zhukov, Dimitri, Alayrac, Jean-Baptiste, Tapaswi, Makarand, Laptev, Ivan, Sivic, Josef. (2019). Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. ICCV.
[wang2024emu3] Wang, Xinlong, Zhang, Xiaosong, Luo, Zhengxiong, Sun, Quan, Cui, Yufeng, Wang, Jinsheng, Zhang, Fan, Wang, Yueze, Li, Zhen, Yu, Qiying, others. (2024). Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869.
[kar2025brave] Kar, O{\u{g. (2025). BRAVE: Broadening the visual encoding of vision-language models. ECCV.
[laurenccon2024obelics] Lauren{\c{c. (2024). Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems.
[li2024mvbench] Li, Kunchang, Wang, Yali, He, Yinan, Li, Yizhuo, Wang, Yi, Liu, Yi, Wang, Zun, Xu, Jilan, Chen, Guo, Luo, Ping, others. (2024). Mvbench: A comprehensive multi-modal video understanding benchmark. CVPR.
[goyal2017something] Goyal, Raghav, Ebrahimi Kahou, Samira, Michalski, Vincent, Materzynska, Joanna, Westphal, Susanne, Kim, Heuna, Haenel, Valentin, Fruend, Ingo, Yianilos, Peter, Mueller-Freitag, Moritz, others. (2017). The. ICCV.
[zohar2024videostar] Zohar, Orr, Wang, Xiaohan, Bitton, Yonatan, Szpektor, Idan, Yeung-levy, Serena. (2024). Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision. arXiv preprint arXiv:2407.06189.
[OpenAI2024gpt4o] OpenAI. (2024). gpt4o.
[Anthropic2024Claude] Anthropic. (2024). Claude.
[touvron2023llama] Touvron, Hugo, Lavril, Thibaut, Izacard, Gautier, Martinet, Xavier, Lachaux, Marie-Anne, Lacroix, Timoth{'e. (2023). {LLaMA. arXiv preprint arXiv:2302.13971.
[touvron2023llama2] Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, others. (2023). {LLaMA. arXiv preprint arXiv:2307.09288.
[li2024llavanext-strong] Li, Bo, Zhang, Kaichen, Zhang, Hao, Guo, Dong, Zhang, Renrui, Li, Feng, Zhang, Yuanhan, Liu, Ziwei, Li, Chunyuan. (2024). LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild.
[yue2023mmmu] Yue, Xiang, Ni, Yuansheng, Zhang, Kai, Zheng, Tianyu, Liu, Ruoqi, Zhang, Ge, Stevens, Samuel, Jiang, Dongfu, Ren, Weiming, Sun, Yuxuan, others. (2024). Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. CVPR.
[hiippala2021ai2d] Hiippala, Tuomo, Alikhani, Malihe, Haverinen, Jonas, Kalliokoski, Timo, Logacheva, Evanfiya, Orekhova, Serafina, Tuomainen, Aino, Stone, Matthew, Bateman, John A. (2021). AI2D-RST: A multimodal corpus of 1000 primary school science diagrams. Language Resources and Evaluation.
[brazil2023omni3d] Brazil, Garrick, Kumar, Abhinav, Straub, Julian, Ravi, Nikhila, Johnson, Justin, Gkioxari, Georgia. (2023). Omni3d: A large benchmark and model for 3d object detection in the wild. CVPR.
[zhou2019semantic] Zhou, Bolei, Zhao, Hang, Puig, Xavier, Xiao, Tete, Fidler, Sanja, Barriuso, Adela, Torralba, Antonio. (2019). Semantic understanding of scenes through the ade20k dataset. IJCV.
[lin2014microsoft] Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll{'a. (2014). Microsoft coco: Common objects in context. ECCV.
[fu2024blink] Fu, Xingyu, Hu, Yushi, Li, Bangzheng, Feng, Yu, Wang, Haoyu, Lin, Xudong, Roth, Dan, Smith, Noah A, Ma, Wei-Chiu, Krishna, Ranjay. (2024). BLINK: Multimodal Large Language Models Can See but Not Perceive. arXiv preprint arXiv:2404.12390.
[russakovsky2015imagenet] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, others. (2015). Imagenet large scale visual recognition challenge. IJCV.
[aquinas] Thomas Aquinas. Quaestiones Disputatae de Veritate.
[aristotle-metaphysics-350BCE] Aristotle. Metaphysics.
[parker2003blink] Parker, Andrew. (2003). In the blink of an eye: how vision sparked the big bang of evolution.
[chalmers2023does] David J. Chalmers. (2023). Does Thought Require Sensory Grounding? From Pure Thinkers to Large Language Models. Proceedings and Addresses of the American Philosophical Association.
[piaget1952origins] Piaget, Jean, Cook, Margaret, others. (1952). The origins of intelligence in children.
[hoffmann2022training] Hoffmann, Jordan, Borgeaud, Sebastian, Mensch, Arthur, Buchatskaya, Elena, Cai, Trevor, Rutherford, Eliza, Casas, Diego de Las, Hendricks, Lisa Anne, Welbl, Johannes, Clark, Aidan, others. (2023). Training compute-optimal large language models. NeurIPS.
[brown2020language] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. NeurIPS.
[laurenccon2024matters] Lauren{\c{c. (2024). What matters when building vision-language models?. arXiv preprint arXiv:2405.02246.
[girshick2014rich] Girshick, Ross, Donahue, Jeff, Darrell, Trevor, Malik, Jitendra. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR.
[mathew2022infographicvqa] Mathew, Minesh, Bagal, Viraj, Tito, Rub{`e. (2022). Infographicvqa. WACV.
[chen2024we] Chen, Lin, Li, Jinsong, Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Chen, Zehui, Duan, Haodong, Wang, Jiaqi, Qiao, Yu, Lin, Dahua, others. (2024). Are We on the Right Way for Evaluating Large Vision-Language Models?. arXiv preprint arXiv:2403.20330.
[wu2024janus] Wu, Chengyue, Chen, Xiaokang, Wu, Zhiyu, Ma, Yiyang, Liu, Xingchao, Pan, Zizheng, Liu, Wen, Xie, Zhenda, Yu, Xingkai, Ruan, Chong, others. (2024). Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848.
[huh2024platonic] Huh, Minyoung, Cheung, Brian, Wang, Tongzhou, Isola, Phillip. (2024). The platonic representation hypothesis. ICML.
[yu2024representation] Yu, Sihyun, Kwak, Sangkyung, Jang, Huiwon, Jeong, Jongheon, Huang, Jonathan, Shin, Jinwoo, Xie, Saining. (2024). Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. arXiv preprint arXiv:2410.06940.
[agrawal2024pixtral] Agrawal, Pravesh, Antoniak, Szymon, Hanna, Emma Bou, Chaplot, Devendra, Chudnovsky, Jessica, Garg, Saurabh, Gervet, Theophile, Ghosh, Soham, H{'e. (2024). Pixtral 12B. arXiv preprint arXiv:2410.07073.
[lu2024unified] Lu, Jiasen, Clark, Christopher, Lee, Sangho, Zhang, Zichen, Khosla, Savya, Marten, Ryan, Hoiem, Derek, Kembhavi, Aniruddha. (2024). Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action. CVPR.
[aghajanyan2022cm3] Aghajanyan, Armen, Huang, Bernie, Ross, Candace, Karpukhin, Vladimir, Xu, Hu, Goyal, Naman, Okhonko, Dmytro, Joshi, Mandar, Ghosh, Gargi, Lewis, Mike, others. (2022). Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520.
[lu2022unified] Lu, Jiasen, Clark, Christopher, Zellers, Rowan, Mottaghi, Roozbeh, Kembhavi, Aniruddha. (2022). Unified-io: A unified model for vision, language, and multi-modal tasks. ICLR.
[agrawal2018don] Agrawal, Aishwarya, Batra, Dhruv, Parikh, Devi, Kembhavi, Aniruddha. (2018). Don't just assume; look and answer: Overcoming priors for visual question answering. CVPR.
[chen2024sharegpt4video] Chen, Lin, Wei, Xilin, Li, Jinsong, Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Chen, Zehui, Duan, Haodong, Lin, Bin, Tang, Zhenyu, others. (2024). Sharegpt4video: Improving video understanding and generation with better captions. NeurIPS.
[krojer2024learning] Krojer, Benno, Vattikonda, Dheeraj, Lara, Luis, Jampani, Varun, Portelance, Eva, Pal, Christopher, Reddy, Siva. (2024). Learning Action and Reasoning-Centric Image Editing from Videos and Simulations. NeurIPS.
[hessel2021clipscore] Hessel, Jack, Holtzman, Ari, Forbes, Maxwell, Bras, Ronan Le, Choi, Yejin. (2021). Clipscore: A reference-free evaluation metric for image captioning. EMNLP.
[brooks2023instructpix2pix] Brooks, Tim, Holynski, Aleksander, Efros, Alexei A. (2023). Instructpix2pix: Learning to follow image editing instructions. CVPR.
[goyal2017making] Goyal, Yash, Khot, Tejas, Summers-Stay, Douglas, Batra, Dhruv, Parikh, Devi. (2017). Making the v in vqa matter: Elevating the role of image understanding in visual question answering. CVPR.
[AllenZhu-icml2024-tutorial] {Allen-Zhu. (2024). {ICML 2024 Tutorial: Physics of Language Models.
[YXLA2024-gsm1] Ye, Tian, Xu, Zicheng, Li, Yuanzhi, {Allen-Zhu. {Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process. ArXiv e-prints.
[majumdar2024openeqa] Majumdar, Arjun, Ajay, Anurag, Zhang, Xiaohan, Putta, Pranav, Yenamandra, Sriram, Henaff, Mikael, Silwal, Sneha, Mcvay, Paul, Maksymets, Oleksandr, Arnaud, Sergio, others. (2024). OpenEQA: Embodied Question Answering in the Era of Foundation Models. 2nd Workshop on Mobile Manipulation and Embodied Intelligence at ICRA 2024.
[minigemini] Li, Yanwei, Zhang, Yuechen, Wang, Chengyao, Zhong, Zhisheng, Chen, Yixin, Chu, Ruihang, Liu, Shaoteng, Jia, Jiaya. (2024). Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814.
[geirhos2020shortcut] Geirhos, Robert, Jacobsen, J{. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence.
[wei2022chain] Wei, Jason, Wang, Xuezhi, Schuurmans, Dale, Bosma, Maarten, Xia, Fei, Chi, Ed, Le, Quoc V, Zhou, Denny, others. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS.
[Geiger2012CVPR] Andreas Geiger, Philip Lenz, Raquel Urtasun. (2012). Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. CVPR.
[caesar2020nuscenes] Caesar, Holger, Bankiti, Varun, Lang, Alex H, Vora, Sourabh, Liong, Venice Erin, Xu, Qiang, Krishnan, Anush, Pan, Yu, Baldan, Giancarlo, Beijbom, Oscar. (2020). nuscenes: A multimodal dataset for autonomous driving. CVPR.
[song2015sun] Song, Shuran, Lichtenberg, Samuel P, Xiao, Jianxiong. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. CVPR.
[dehghan2021arkitscenes] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, Elad Shulman. (2021). {ARK. NeurIPS Datasets and Benchmarks Track (Round 1).
[hypersim] Mike Roberts AND Jason Ramapuram AND Anurag Ranjan AND Atulit Kumar AND Miguel Angel Bautista AND Nathan Paczan AND Russ Webb AND Joshua M. Susskind. (2021). {Hypersim. ICCV.
[objectron2021] Ahmadyan, Adel, Zhang, Liangkai, Ablavatski, Artsiom, Wei, Jianing, Grundmann, Matthias. (2021). Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations. CVPR.
[wang2024qwen2] Wang, Peng, Bai, Shuai, Tan, Sinan, Wang, Shijie, Fan, Zhihao, Bai, Jinze, Chen, Keqin, Liu, Xuejing, Wang, Jialin, Ge, Wenbin, others. (2024). Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191.
[li2024llava] Li, Bo, Zhang, Yuanhan, Guo, Dong, Zhang, Renrui, Li, Feng, Zhang, Hao, Zhang, Kaichen, Li, Yanwei, Liu, Ziwei, Li, Chunyuan. (2024). Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326.
[tong2024cambrian] Tong, Shengbang, Brown, Ellis, Wu, Penghao, Woo, Sanghyun, Middepogu, Manoj, Akula, Sai Charitha, Yang, Jihan, Yang, Shusheng, Iyer, Adithya, Pan, Xichen, others. (2024). Cambrian-1: A fully open, vision-centric exploration of multimodal llms. NeurIPS.
[li2023blip] Li, Junnan, Li, Dongxu, Savarese, Silvio, Hoi, Steven. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML.
[zhou2024transfusion] Zhou, Chunting, Yu, Lili, Babu, Arun, Tirumala, Kushal, Yasunaga, Michihiro, Shamis, Leonid, Kahn, Jacob, Ma, Xuezhe, Zettlemoyer, Luke, Levy, Omer. (2024). Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039.
[wu2024vila] Wu, Yecheng, Zhang, Zhuoyang, Chen, Junyu, Tang, Haotian, Li, Dacheng, Fang, Yunhao, Zhu, Ligeng, Xie, Enze, Yin, Hongxu, Yi, Li, others. (2024). Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429.
[xie2024show] Xie, Jinheng, Mao, Weijia, Bai, Zechen, Zhang, David Junhao, Wang, Weihao, Lin, Kevin Qinghong, Gu, Yuchao, Chen, Zhijie, Yang, Zhenheng, Shou, Mike Zheng. (2024). Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528.
[baddeley1992working] Baddeley, Alan. (1992). Working memory. Science.
[amit2017asymmetrical] Amit, Elinor, Hoeflin, Caitlyn, Hamzah, Nada, Fedorenko, Evelina. (2017). An asymmetrical relationship between verbal and visual thinking: Converging evidence from behavior and fMRI. NeuroImage.
[paivio1990mental] Paivio, Allan. (1990). Mental representations: A dual coding approach.
[ganis2004brain] Ganis, Giorgio, Thompson, William L, Kosslyn, Stephen M. (2004). Brain areas underlying visual mental imagery and visual perception: an fMRI study. Cognitive Brain Research.
[lecun2022path] LeCun, Yann. (2022). A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review.
[amit2009distance] Amit, Elinor, Algom, Daniel, Trope, Yaacov. (2009). Distance-dependent processing of pictures and words.. Journal of Experimental Psychology: General.
[amit2013use] Amit, Elinor, Wakslak, Cheryl, Trope, Yaacov. (2013). The use of visual and verbal means of communication across psychological distance. Personality and Social Psychology Bulletin.
[ormazabal2024reka] Ormazabal, Aitor, Zheng, Che, d'Autume, Cyprien de Masson, Yogatama, Dani, Fu, Deyu, Ong, Donovan, Chen, Eric, Lamprecht, Eugenie, Pham, Hai, Ong, Isaac, others. (2024). Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models. arXiv preprint arXiv:2404.12387.
[liu2024world] Liu, Hao, Yan, Wilson, Zaharia, Matei, Abbeel, Pieter. (2024). World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268.
[wei2021finetuned] Wei, Jason, Bosma, Maarten, Zhao, Vincent Y, Guu, Kelvin, Yu, Adams Wei, Lester, Brian, Du, Nan, Dai, Andrew M, Le, Quoc V. (2022). Finetuned language models are zero-shot learners. ICLR.
[bordes2022high] Florian Bordes, Randall Balestriero, Pascal Vincent. (2022). High Fidelity Visualization of What Your Self-Supervised Representation Knows About. TMLR.
[Wadekar2024-bs] Wadekar, Shakti N, Chaurasia, Abhishek, Chadha, Aman, Culurciello, Eugenio. The evolution of multimodal model architectures. arXiv [cs.AI].
[luo2024task] Luo, Grace, Darrell, Trevor, Bar, Amir. (2024). Task Vectors are Cross-Modal. arXiv preprint arXiv:2410.22330.
[bib1] Aghajanyan et al. (2022) Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, et al. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
[bib2] Agrawal et al. (2024) Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Devendra Chaplot, Jessica Chudnovsky, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, Paul Jacob, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073, 2024.
[bib3] AI@Meta. Llama 3 model card. 2024.
[bib4] Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
[bib5] Zeyuan Allen-Zhu. ICML 2024 Tutorial: Physics of Language Models, 2024. Project page: https://physics.allen-zhu.com/.
[bib6] Anthropic. Claude, 2024.
[bib7] Ba et al. (2016) Jimmy Lei Ba, Jamie Kiros, and Geoffrey E. Hinton. Layer normalization. In NeurIPS, 2016.
[bib8] Bardes et al. (2024) Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. In TMLR, 2024.
[bib9] Bordes et al. (2022) Florian Bordes, Randall Balestriero, and Pascal Vincent. High fidelity visualization of what your self-supervised representation knows about. In TMLR, 2022.
[bib10] Brooks et al. (2023) Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
[bib11] Chen et al. (2023) Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
[bib12] Chen et al. (2024a) Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. In NeurIPS, 2024a.
[bib13] Chen et al. (2024b) Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024b.
[bib14] Dai et al. (2024) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2024.
[bib15] Dong et al. (2024) Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. In ICLR, 2024.
[bib16] Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[bib17] Gadre et al. (2024) Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. In NeurIPS, 2024.
[bib18] Gao et al. (2024) Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935, 2024.
[bib19] Ge et al. (2023) Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023.
[bib20] Goyal et al. (2017a) Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In ICCV, 2017a.
[bib21] Goyal et al. (2017b) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017b.
[bib22] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
[bib23] Hessel et al. (2021) Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. In EMNLP, 2021.
[bib24] Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
[bib25] Huh et al. (2024) Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. In ICML, 2024.
[bib26] Kar et al. (2025) Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. Brave: Broadening the visual encoding of vision-language models. In ECCV, 2025.
[bib27] Koh et al. (2024) Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Generating images with multimodal language models. In NeurIPS, 2024.
[bib28] Krojer et al. (2024) Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Christopher Pal, and Siva Reddy. Learning action and reasoning-centric image editing from videos and simulations. In NeurIPS, 2024.
[bib29] Laurençon et al. (2024a) Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024a.
[bib30] Laurençon et al. (2024b) Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246, 2024b.
[bib31] Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022.
[bib32] Li et al. (2024a) Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024a.
[bib33] Li et al. (2024b) Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024b.
[bib34] Li et al. (2024c) Tianhong Li, Dina Katabi, and Kaiming He. Return of unconditional generation: A self-supervised representation generation method. In NeurIPS, 2024c.
[bib35] Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
[bib36] Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
[bib37] Liu et al. (2024a) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, 2024a.
[bib38] Liu et al. (2024b) Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024b.
[bib39] Liu et al. (2024c) Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024c.
[bib40] Liu et al. (2024d) Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024d.
[bib41] I Loshchilov. Decoupled weight decay regularization. In ICLR, 2019.
[bib42] Lu et al. (2022a) Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2022a.
[bib43] Lu et al. (2024) Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR, 2024.
[bib44] Lu et al. (2022b) Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022b.
[bib45] Masry et al. (2022) Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, 2022.
[bib46] McKinzie et al. (2024) Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024.
[bib47] Miech et al. (2019) Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
[bib48] OpenAI. gpt4o, 2024.
[bib49] Pan et al. (2024a) Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents. In COLM, 2024a.
[bib50] Pan et al. (2024b) Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos-g: Generating images in context with multimodal large language models. In ICLR, 2024b.
[bib51] Preechakul et al. (2022) Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In CVPR, 2022.
[bib52] Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
[bib53] Rajbhandari et al. (2020) Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
[bib54] Roberts et al. (2019) Adam Roberts, Colin Raffel, Katherine Lee, Michael Matena, Noam Shazeer, Peter J Liu, Sharan Narang, Wei Li, and Yanqi Zhou. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2019.
[bib55] Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
[bib56] Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
[bib57] Shao et al. (2024) Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. In NeurIPS, 2024.
[bib58] Sidorov et al. (2020) Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension, 2020.
[bib59] Sun et al. (2024a) Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In CVPR, 2024a.
[bib60] Sun et al. (2024b) Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. In ICLR, 2024b.
[bib61] Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpaca: A strong, replicable instruction-following model, 2023.
[bib62] Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.
[bib63] Tong et al. (2024a) Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2024a.
[bib64] Tong et al. (2024b) Shengbang Tong, Erik Jones, and Jacob Steinhardt. Mass-producing failures of multimodal systems with language models. In NeurIPS, 2024b.
[bib65] Tong et al. (2024c) Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In CVPR, 2024c.
[bib66] Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. LLaMA 2: Open foundation and fine-tuned chat models. 2023.
[bib67] Wang et al. (2024a) Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024a.
[bib68] Wang et al. (2024b) Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024b.
[bib69] Wei et al. (2022a) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In ICLR, 2022a.
[bib70] Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022b.
[bib71] Wu et al. (2024a) Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024a.
[bib72] Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. In CVPR, 2024.
[bib73] Wu et al. (2024b) Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024b.
[bib74] xAI. grok, 2024.
[bib75] Xie et al. (2024) Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024.
[bib76] Xu et al. (2024) Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024.
[bib77] Ye et al. (2024) Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process. ArXiv e-prints, abs/2407.20311, 2024. Full version available at http://arxiv.org/abs/2407.20311.
[bib78] Yue et al. (2024a) Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024a.
[bib79] Yue et al. (2024b) Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813, 2024b.
[bib80] Yuksekgonul et al. (2022) Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2022.
[bib81] Zhai et al. (2023) Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023.
[bib82] Zhai et al. (2024) Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, et al. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. In NeurIPS, 2024.
[bib83] Zhang et al. (2024) Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, et al. Direct preference optimization of video large multimodal models from language model reward. arXiv preprint arXiv:2404.01258, 2024.
[bib84] Zhang et al. (2023) Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, and Alexander Toshev. Pre-trained language models do not help auto-regressive text-to-image generation. In EMNLP, 2023.
[bib85] Zhou et al. (2024a) Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. In NeurIPS, 2024a.
[bib86] Zhou et al. (2024b) Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024b.
[bib87] Zohar et al. (2024) Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, and Serena Yeung-levy. Video-star: Self-training enables video instruction tuning with any supervision. In arXiv preprint arXiv:2407.06189, 2024.