Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
% Shengbang Tong1\qquad Zhuang Liu2 \qquad Yuexiang Zhai3\qquad, [1mm] Yi Ma3\qquad Yann LeCun1\qquad Saining Xie, [3mm] 1New York University\qquad % 2FAIR, Meta \qquad 3UC Berkeley
Abstract
-0.5cm Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent MultiModal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ``CLIP-blind pairs'' – images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns () benchmark. exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong 1 Yi Ma 3
Zhuang Liu 2 Yann LeCun 1
Yuexiang Zhai 3 1
Saining Xie
1 New York University
2 FAIR, Meta
3 UC Berkeley
![Figure 1. Instances are systematically identified where the visual question answering (VQA) capabilities of GPT-4V [41] fall short ( Date accessed : Nov 04 , 2023 ). Our research highlights scenarios in which advanced systems like GPT-4V struggle with seemingly simple questions due to inaccurate visual grounding. Text in red signifies an incorrect response, while text in green represents hallucinated explanations for the incorrect answer. All the images referenced are sourced from ImageNet-1K and LAION-Aesthetic datasets.](2401.06209-figure_000.png)
Figure 1. Instances are systematically identified where the visual question answering (VQA) capabilities of GPT-4V [41] fall short ( Date accessed : Nov 04 , 2023 ). Our research highlights scenarios in which advanced systems like GPT-4V struggle with seemingly simple questions due to inaccurate visual grounding. Text in red signifies an incorrect response, while text in green represents hallucinated explanations for the incorrect answer. All the images referenced are sourced from ImageNet-1K and LAION-Aesthetic datasets.
Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent MultiModal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify 'CLIP-blind pairs' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.
Introduction
Multimodal Large Language Models (MLLMs) [8, 13, 31, 40] have been rapidly developing in recent times. MLLMs integrate images into large language models (LLMs) and leverage the powerful abilities of LLMs [41, 59, 69], showcasing remarkable proficiency in tasks such as image understanding, visual question answering, and instruction following. In particular, the recently released GPT-4V(ision) [40] has pushed performance to an unprecedented level [41, 63].
Beneath the advancements of these models, we find there exists a notable weakness: they still exhibit visual shortcomings, some of which are surprisingly elementary and evident (see Figure 1). We ask: Where do these problems originate? Is it a deficiency in visual modality, language understanding, or their alignment? In this work, we suggest that these shortcomings observed in MLLMs might stem from a problem related to the visual representations .
At their core, most MLLMs [8, 31, 71] are built on pretrained vision [43, 54] and language [59, 68, 69] models. These models are connected using various types of adapters [2, 26, 31] to integrate the different modalities. A natural hypothesis is that any limitation in the pretrained vision models can cascade into the downstream MLLMs that adopt them. Studies have explored a similar issue for language. For example, Tong et al. [57], Yuksekgonul et al. [65] demonstrate that failure patterns in the pretrained text encoder [43, 44] will lead to downstream failures in textguided generative models [22, 46].
On the vision side, most open-source MLLMs [2, 26, 31] adopt the pretrained Contrastive Language-Image PreTraining (CLIP) model [43] as the visual encoder. We begin by identifying failure examples that CLIP struggles to encode properly (Section 2). Inspired by Tong et al. [57], we exploit the erroneous agreements in the embedding space. If two visually different images are encoded similarly by CLIP, then at least one of the images is likely ambiguously encoded. We call such a pair of images a CLIP-blind pair. To measure the visual similarity between images, we use a vision-only self-supervised encoder such as DINOv2 [42]. In this context, CLIP-blind pairs are images with similar CLIP embeddings but different DINOv2 embeddings.
We discover that these CLIP-blind pairs indeed lead to errors in downstream MLLMs. With these pairs, We introduce the M ulti M odal V isual P atterns (MMVP) benchmark. This benchmark is specifically designed to inquire about differences in CLIP-blind pairs and evaluate the visual abilities of state-of-the-art MLLMs with straightforward questions. We evaluate a variety of open-source [8, 30, 31, 71] and closed-source models [13, 41] including GPT-4V [40], and conduct a user study to measure human performance. The results show that MLLM models struggle with straightforward visual questions. Most of these models perform below the level of random guessing, with GPT-4V being the exception. Yet, even GPT-4V exhibits a considerable disparity in performance - exceeding 50% - compared to human performance.
Having identified a large number of individual failure instances in MLLMs, we continue to study the systematic visual patterns in MMVP which CLIP models struggle (Section 3). We summarize nine prevalent patterns of the CLIPblind pairs in MMVP, such as 'orientation', 'counting', and 'viewpoint', which pose significant challenges for the CLIP vision encoder. Notice that there has been significant and ongoing progress in scaling up both training data and model size for CLIP [10, 43, 54, 62, 66]. We categorize examples from MMVP into visual patterns to systematically assess whether scaling alone can mitigate these challenges. Our findings suggest that 7 out of the 9 identified visual patterns cannot be resolved by any large-scale CLIP-based models, indicating that model/data scaling alone is not sufficient. Moreover, we identify a strong correlation between the visual patterns that challenge CLIP models and the performance of MLLMs. If CLIP struggles with a particular visual pattern, such as 'orientation', MLLMs will likely also fall short. This shows that the CLIP vision encoders could become a bottleneck in such systems.
Finally, we take a step towards improving the visual grounding of MLLMs. Since the visual shortcomings of MLLMsstem from their reliance on the CLIP model, we investigate the impact of integrating vision-centric representations into MLLMs (Section 4). Specifically, we explore ways to incorporate a vision-only self-supervised model, such as DINOv2 [42], to enhance the visual grounding capabilities of MLLMs. We refer to these techniques as Mixture-of-Features (MoF). First, we linearly mix CLIP and DINOv2 features in different ratios, which we refer to as Additive-MoF (A-MoF). This process reveals that DINOv2 features are more effective in visual grounding, though they come at the cost of diminished instructionfollowing ability. To address this, we introduce InterleavedMoF (I-MoF) that spatially mixes visual tokens from both CLIP and DINOv2 models. We find that this practice significantly enhances visual grounding while maintaining the instruction-following capabilities.
The Multimodal Visual Patterns ( ours{
Currently, the majority of open-source MLLMs [8, 31, 71] use the off-the-shelf CLIP vision encoders to process images. In this section, we begin by identifying CLIP-blind pairs in the CLIP model (Section 2.1). Subsequently, we construct the Multimodal Visual Patterns-MLLM (MMVPMLLM) benchmark using these CLIP-blind pairs (Section 2.2). We evaluate SOTA MLLMs including GPT-4V on the benchmark (Section 2.3) and find that all the tested models struggle with simple questions on visual details. A

Figure 2. Constructing MMVP benchmark via CLIP-blind pairs. Left: We start with finding CLIP-blind pairs that have similar CLIP embedding but different DINOv2 embedding. Center: We manually inspect the differences between pair-wise images and formulate questions based on the differences in the images. Right: We ask MLLMs the question alongside the CLIP-blind pair. The model receives a score only when both questions for the CLIP-blind pair are answered correctly.
visualization of this process is provided in Figure 2.
Finding CLIP-blind Pairs
It is challenging to directly find instances (images) that the CLIP vision encoder struggles to encode 'properly'. To circumvent this issue, we extend the idea proposed in Tong et al. [57] to automatically find blind pairs in vision models. The underlying principle is simple: if two images, despite having stark visual differences, are encoded similarly by the CLIP vision encoder, then one of them is likely encoded ambiguously (See Figure 2 left for example). To measure the visual difference between two images, we examine the images' representations within a reference model: a visiononly self-supervised model trained without any language guidance, e.g., DINOv2 [42]. These models are shown to capture more visual details and information [42, 53].
We take the corpus datasets, ImageNet [47] and LAIONAesthetics [48], to collect these CLIP-blind pairs.
For each pair, we compute its CLIP embeddings using CLIP-ViT-L-14 [9, 43] model and their DINOv2 embeddings using DINOv2-ViT-L-14 [9, 42] model. We return pairs such that the cosine similarity exceeds 0.95 for CLIP embeddings and less than 0.6 for DINOv2 embeddings.
Designing Benchmark from CLIP-blind Pairs
We introduce the Multimodal Visual Patterns (MMVP) benchmark, and a Visual Question Answering (VQA) benchmark. Utilizing the collected CLIP-blind pairs, we carefully design 150 pairs with 300 questions. For each CLIP-blind pair of images, we manually pinpoint the visual details that the CLIP vision encoder overlooks (see the middle of Figure 2) and craft questions that probe these visual details, for example 'Is the dog facing left or right?' (See the right of Figure 2 and more examples in Figure 3). The primary goal is to determine whether MLLM models would fail when posed with these seemingly basic questions and overlook critical visual details. Hence, the questions are intentionally straightforward and unambiguous.
Benchmark Results
We assess the questions on SOTA open-source models (LLaVA-1.5 [31], InstructBLIP [8], Mini-GPT4 [71]) and closed-source models (GPT-4V [40], Gemini [14], Bard [13]) We leave details of how we access the model in Appendix B.1. In our evaluation, each question is queried independently, eliminating any biases from chat histories. We also evaluate human performance through a user study where users are presented with 300 questions in a randomized sequence. For any given pair of images, we consider a pair of images to be correctly answered if both the questions associated with the pair are answered accurately.
Human study confirms questions are straightforward.
As shown in Figure 4, human participants accurately answer an average of 95.7% of the questions. This high accuracy rate underscores the ease of the questions. More details can be found in Appendix B.4.
Current MLLMs struggle with visual details. As shown in Figure 4, there is a significant performance gap

Figure 3. Examples of Questions in the MMVP benchmark. Incorrect answers are shaded in red . A model is considered correct only if it answers both questions in a pair correctly. Both leading closed-source models (GPT-4V, Gemini) and open-source models (LLaVA-1.5, InstructBLIP) fail these simple visual questions. (See Appendix B.2 for all the questions in MMVP benchmark.)

Figure 4. Benchmark results of current SOTA MLLM models and humans. We evaluate benchmark questions for current SOTA MLLM models and human performances through user studies.
between human and MLLM models, despite the latter often demonstrating impressive results [6, 27]. Models except GPT-4V and Gemini, scored below random guess level (25%). Most advanced GPT-4V and Gemini also face challenges in addressing basic visual grounding questions. Figures 1 and 3 provide examples of errors made by models. The outcomes suggest that irrespective of model size or training data, struggle with visual details.
We have also conducted an ablation study, such as swapping options and changing notations in the question formulation (see Appendix B.3 for more details), to further confirm that this poor performance stems from visual incapability, not hallucination in the language models.
Current MLLMs struggle with visual details.
Systematic Failures in CLIP
In the previous section, we identify CLIP-blind pairs and use them to find failures in MLLMs. Here, we delve deeper into these pairs to investigate (i) systematic visual patterns
![Figure 5. Examples from MMVP-VLM . MMVP-VLM consists of image pairs across nine visual patterns. The examples in the figure are from EVA01 ViT-g-14 model [54], one of the largest CLIP models that also fails to choose the right image given the text description.](2401.06209-figure_004.png)
Figure 5. Examples from MMVP-VLM . MMVP-VLM consists of image pairs across nine visual patterns. The examples in the figure are from EVA01 ViT-g-14 model [54], one of the largest CLIP models that also fails to choose the right image given the text description.
emerged from CLIP-blind pairs (Section 3.1), (ii) whether these visual patterns pose challenges for CLIP-based models with massive scaling up (Section 3.2), and (iii) the correlation between failure patterns in CLIP models and those in MLLMs (Section 3.3).
Visual Patterns in CLIP-blind Pairs
Having identified the CLIP-blind pairs, we summarize systematic visual patterns that the CLIP vision encoders might consistently misinterpret. It is too abstract to directly capture systematic visual patterns in the CLIP-blind pairs. Therefore, we turn to the questions and options from the MMVPbenchmark. With these questions, we transform abstract visual patterns in images into clearer, language-based descriptors that are easier to categorize.
In this work, we use GPT-4 [41] to categorize general patterns by prompting it with the following:
More Benchmark Results
CLIP-based models have developed rapidly since the introduction in the first paper [43]. We want to test whether these visual patterns still impose challenges to the more recent CLIP models [10, 54, 62, 66], which significantly scale up in terms of training data and model size. In doing so, we introduce a new benchmark: MMVP-VLM to systematically study if CLIP models handle this visual pattern well.
We distill a subset of questions from the MMVP benchmark into simpler language descriptions and categorize them into visual patterns. To maintain a balanced number of questions for each visual pattern, we add a few questions, if needed, to ensure that each visual pattern is represented by 15 text-image pairs. Examples of pairs are shown in Figure 5. A pair is deemed correctly answered if the model can accurately match both image-text combinations.
We evaluate MMVP-VLM on a variety of CLIP models [10, 43, 54, 62, 66]. These models vary in aspects like size, training data, and methodology. As evidenced in Table 1, increasing network size and training data only aids in identifying two visual patterns - 'color and appearance' and 'state and condition'. The rest of the visual patterns continue to challenge all CLIP-based models. We also find that the ImageNet-1k zero-shot accuracy is not a definitive indicator of a model's performance regarding visual patterns. This underscores the necessity for additional evaluation metrics, such as MMVP-VLM, to accurately assess the model's capabilities in areas beyond image classification.
How CLIP's Errors Affect MLLMs
After analyzing the visual patterns that CLIP models struggle with, we pose the following question: Is there a correla-
Table 1. Performance of various CLIP based models on different visual patterns in MMVP-VLM benchmark. Models scaled up in resolution show minimal improvement, whereas a slight advantage is observed when scaling up the network. For each visual pattern, ImageNet-1k Zero-shot accuracy and MMVP average, we use light gray to highlight the best performance. For most of the visual patterns, all CLIP-based methods show struggle, as evident from the scores. We use symbols for visual patterns due to space limit: ☼ : Orientation and Direction, /search : Presence of Specific Features, /sync : State and Condition, /sort-numeric-up : Quantity and Count, /map-pin : Positional and Relational Context, /palette : Color and Appearance, /cogs : Structural and Physical Characteristics, /font : Texts, /camera : Viewpoint and Perspective.

Figure 6. CLIP and MLLM's performance on visual patterns. If CLIP performs poorly on a visual pattern such as ' ☼ orientation', MLLMs also underperform on the visual pattern.
tion between the underperformance of CLIP and MLLMs' visual incapability? To explore this, we categorize questions from MMVP into these visual patterns summarized and calculate each MLLM's performance on these patterns.
In Figure 6, we plot CLIP's performance and MLLMs' performance for each visual pattern. When the CLIP vision encoder underperforms on a certain visual pattern, the MLLMtends to exhibit similar shortcomings. Open-source models such as LLaVA 1.5 [30] and InstructBLIP [8] that explicitly use the CLIP vision encoder display a strong correlation in performance.
Further, we calculate the Pearson Correlation Coefficient between the CLIP model and MLLM's performance on each visual pattern. Results show that LLaVA 1.5 and InstructBLIP all possess a coefficient score greater than 0.7. This high score indicates a strong correlation that weaknesses in visual pattern recognition in the CLIP model are transferred to MLLMs. More details on the Pearson Correlation Coefficient can be found in Appendix C.
Mixture-of-Features (MoF) for MLLM
Based on our exploration in earlier sections, a natural question arises: If open-sourced MLLM's visual shortcomings come from the CLIP vision encoder, how do we build a more competent visual encoder? In this section, we take initial steps to answer the question by studying Mixtureof-Features (MoF). We start with additive MoF that mixes CLIP features and vision-only SSL model features. Results show that each encoder presents unique advantages and limitations when employed as the pretrained model in MLLM (Section 4.2). We subsequently propose Interleaved MoF that integrates the features from both CLIP and SSL into MLLMto enhance visual grounding without compromising the model's ability to follow instructions (Section 4.3).
Experiment Setting
We adopt LLaVA [30, 31] as the framework to study visual encoders in MLLM. LLaVA uses a pretrained CLIP encoder and trains an adapter to align visual tokens with language tokens in the LLM. (See left side of Figure 7). We use DINOv2 [42] as the vision-only SSL model in our work because it is currently the most scalable vision-only model. Our exploration includes the use of two visual encoders: CLIP-ViT-L-14 [43] and DINOV2-ViT-L-14 [42]. To ensure consistent and fair comparisons, we train and finetune our model with the same experiment setting in LLaVA. We include the additional experimental details in Appendix A.
Additive MoF
We add a pretrained DINOv2 encoder into MLLM and mix the CLIP pretrained encoder with it. We use a coefficient α to control the portion of CLIP features and 1 -α to control the amount of DINOv2 features and linearly add them
Standard MLLM
Language Model
⋯
'How many eyes can you see in this image?
(a) 1 (b) 2'
Adapter
CLIP
Encoder

DINO
Figure 7. Different Mixture-of-Feature (MoF) Strategies in MLLM. Left : Standard MLLM that uses CLIP as off-the-shelf pretrained vision encoder; Middle : Additive-MoF (A-MoF) MLLM: Linearly mixing CLIP and DINOv2 features before the adapter; Right : InterleavedMoF (I-MoF MLLM) Spatially interleaving CLIP visual tokens and DINOv2 visual tokens after the adapter.
together (See middle part of Figure 7 for visualization).
We evaluate the model's visual grounding ability by the MMVP proposed earlier in Section 2 and the model's instruction-following capability by LLaVA benchmark introduced in Liu et al. [31]. Initially, we conduct five experiments where we linearly transition from using 100% CLIP features to 100% DINOv2 features. In these tests, the DINOv2 feature proportions are set at { 0 . 00 , 0 . 25 , 0 . 50 , 0 . 75 , 1 . 00 } . To further verify the observed trends, we introduce two additional experiments with DINOv2 proportions of { 0 . 625 , 0 . 875 } . Our findings, presented in Table 2, reveal two insights:
- As the proportion of DINOv2 features increases, MLLM exhibits a decline in its instruction-following capability. Notably, there is a sharp decrease when the DINOv2 proportion reaches 87.5%.
- A higher proportion of DINOv2 features enhances the model's visual grounding capability, but this advantage diminishes when the DINOv2 proportion surpasses 0.75, at which point instruction-following is notably impaired.
Hence, if we were to add DINOv2 features or completely replace CLIP with DINOv2, it would result in a trade-off between visual grounding and instruction-following. A higher proportion of DINOv2 features improves the model's visual perception at the expense of its ability to follow linguistic instructions, while CLIP features enhance language comprehension but reduce visual grounding.
Interleaved MoF
We propose interleaved MoF to leverage advantages from both CLIP and DINOv2 embeddings to enhance image representation. An image concurrently passes into CLIP and DINOv2 encoders, and the resulting embeddings are individually processed by adapters. We take the processed features from CLIP and DINOv2 and interleave them while maintaining their original spatial order. We then feed the interleaved features to LLM (See right part of Figure 7).
Table 2. Empirical Results of Additive MoF. We use DINOv2 as the image SSL model in our work. With more DINOv2 features added, there is an improvement in visual grounding, while a decline in instruction following ability.
We summarize the results in Table 3. Under the LLaVA setting, interleave MoF significantly enhances visual grounding, with a 10.7% increase observed in MMVP, without compromising the model's ability to follow instructions. This experiment is replicated with the LLaVA-1.5 setting and under various image resolution settings, yielding similar enhancements in performance. We also evaluate on POPE [27] which is designed to test hallucination in visual grounding. Interleaved-MoF also shows consistent improvement against the original LLaVA models. Merely increasing the image resolution, and consequently, the number of tokens does not boost visual grounding capabilities. Instead, it is the interleaving of MoF between
Additive-MoF MLLM
Language Model
⋯
'How many eyes can you see in this image?
(a) 1 (b) 2'
vision-only SSL models and VLM models that leads to improved performance in visual grounding tasks. We conduct more experiments using MAE or MoCoV3 as vision-only SSL models in I-MoF and show similar improvements in visual grounding tasks in Appenfix E.1. We also evaluated Interleaved MoF on additional benchmarks such as MMBench [32] and GQA [21], finding that Interleaved MoF achieves similar performance on these benchmarks. Please refer to Appendix E.2 for more results on these benchmarks.
Related Works
Multimodal LLMs. We study the limitations of Multimodal LLMs [8, 13, 30, 31, 40] and explore possible ways to improve these models. Multimodal LLMs build from pretrained Large Language Models [3, 41, 58, 59, 69] and CLIP vision encoder [43, 54]. These systems then use an adapter, such as MLPs [30, 31], Q-Former [8, 26], and gated attention [2, 25], to integrate the pretrained CLIP vision encoder into LLMs. More recently, instructBLIP [8], LLaVA1.5 [30] highlight the importance of high-quality training data. Yet, there is a scarcity of research focusing on the impact of visual encoders, which is an important gap our work aims to address through a systematic study.
Evaluating Multimodal LLMs. MMVP assesses MLLMs using a set of simple yet critical Visual Question Answering (VQA) questions constructed from CLIPblind pairs. Previous benchmarks such as TextVQA [52], VQAv2 [15], and GQA [21] have centered on traditional VQA queries. Recently, there are works like MM-Vet [64], POPE [27], and MM-Bench [32] designed to specifically evaluate multimodal LLMs including hallucination, reasoning, and robustness. The previous benchmarks and evaluations have shown that Multimodal LLMs can suffer from hallucination [28, 29], catastrophic forgetting [67] and lack of robustness [11]. In taking a step back to the fundamentals, our work uncovers that even the most advanced multimodal LLMs, such as GPT-4V [40], Gemini [14], Bard [30], and LLaVA-1.5 [30], are not immune to stumbling over elementary visual questions. We also identified part of the problem as being the incapable visual encoder.
Visual Encoders. MMVP-VLM provides a detailed analysis of the visual capabilities of various CLIP variants [43, 54, 62, 66]. These models mostly follow the method proposed in Radford et al. [43] that uses contrastive loss to train on large volumes of image-text pairs. They differ in training data [62], training recipes [54], and objective functions [66]. Nonetheless, our studies show that all of these CLIP variants struggle with simple visual patterns such as 'orientation', 'count', 'presence of specific features', etc . Another line of research focuses on vision-only self-supervised learning (SSL). This category includes contrastive SSL [5, 7, 16, 17] and mask-based SSL [4, 18, 70]. SLIP [39] explores the synergy between CLIP and con- trastive SSL, but focusing primarily on standard classification tasks. In fact, a common practice to evaluate the quality of these vision models is through linear probing or finetuning on ImageNet [45, 47]. Although current evaluation methods provide a basic level of assessment on representation quality, our findings indicate a growing detachment from the needs of recent use cases. As demonstrated in the MoF experiments in Section 4, the CLIP vision model and the vision-only SSL models learn complementary features. However, the linear probing accuracy on ImageNet alone provides a limited understanding of feature utility in MLLMs. This observation suggests the need for more diverse evaluations [61] in visual representation learning, to better align with current and emerging applications.
Ambiguities in Embedding Models. Our work exploits CLIP-blind pairs within the CLIP vision embedding space to generate examples of failures in CLIP models and subsequently MLLMs. This concept has ties to previous research focused on documenting failure modes in text embedding models [12, 36, 55]. More recently, Thrush et al. [56], Yuksekgonul et al. [65] and Hsieh et al. [19] study the binding problems CLIP faces in processing text queries, noting that CLIP models treat text input as a bag of words. Tong et al. [57] examines the implications for downstream text-guided generative models. Tschannen et al. [60] suggests image captioners as promising alternatives to CLIP for improving attribute binding. Our work focuses on the visual patterns.
Discussion
Circling back to the very first question we ask: is vision good enough for language? Perhaps not yet, as our study shows that vision models might become a bottleneck in multimodal systems. MLLMs fail in simple questions because their pre-trained CLIP vision encoders overlook crucial visual details in images, and systematically fail to sort important visual patterns. Yet, CLIP-type models remain the most scalable and widely used vision models today. Contrary to the popular belief that data and model scaling is a panacea, our research demonstrates that scaling alone does not rectify the inherent deficiencies in CLIP models.
Our study reveals that popular visual representation learning models - vision-and-language models and visiononly self-supervised learning models - excel in different aspects. The distinction in their capabilities go beyond conventional benchmarks such as linear probing or zeroshot accuracy on ImageNet. Although a carefully designed Mixture-of-Features approach could alleviate visual limitations and utilize the strengths of these two learning paradigms, it is necessary to develop new evaluation metrics to facilitate the development of new visual representation learning algorithms. We hope our work can motivate further innovation in vision models.
Acknowledgements. We thank Penghao Wu, Muzi Tao, Erik Jones, Michael Psenka, Daniel Yeh, Druv Pai, Chen Sun for helpful discussions and feedback. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise. This research is also supported by Intel, Google TRC program, the Google Cloud Research Credits program with the award GCP19980904, and an Amazon Research Award Fall 2023. The authors thank hyperbolic labs for supporting part of the experiments. All experiments and data processing were performed at NYU.
Experiment Details
Experiment Details
Hyperparameters. In this work, we adopt the same set of hyperparameters as LLaVA [31] and LLaVA-1.5 [30]. We use Vicuna-13b-v1.3 [69] in LLaVA experiments and Vicuna-13bv1.5 [69] in LLaVA-1.5 experiments. We show the training hyperparameters for LLaVA and LLaVA-1.5 experiments in Table 4. All experiments are conducted using a maximum of 8 Nvidia A100 GPUs.
Table 4. Hyperparameters for MoF training on LLaVA and LLaVA-1.5.
Pretrain Datasets. We use the same dataset for both LLaVA and LLaVA-1.5 experiments. For LLaVA experiments, stage 1 uses CC595k [50] and stage 2 uses LLaVA 158k [31] instruction data; For LLaVA-1.5 experiments, stage 1 uses CC595k [50] and stage 2 uses DataMix 665k [1, 15, 21, 23, 24, 31, 34, 35, 38, 49, 51] proposed in Liu et al. [30].
Hyperparameters.
Pretrain Datasets.
More Benchmark Results
We provide more details on the MMVP benchmark.
Details of evaluating SOTA models
We access GPT-4V through ChatGPT in October and November 2023. We also evaluate Gemini-Pro through Vertex AI API in December 2023. We use the official checkpoints for InstructBLIP [8]. We access mini-GPT4 [71], 1 LLaVA and LLaVA-1.5 [31] through their playgrounds. We test Bard [13] using the official website in September and October 2023. Moreover, we test new-Bing [37] through new-Bing chat creative mode and GPT-4V [40] in September 2023.
Questions in ours{
We present more examples in MMVP at the end in Figures 10, 11, 12.
Ablation Studies
To further verify that MLLMs make mistakes in MMVP due to their incapable visual grounding instead of hallucination in the language model [20]. We conduct additional ablation experiments on
1 To circumvent response hallucination in mini-GPT4 we prefix our questions with 'Please only choose an option to answer the question below without explanation: '
Table 5. Pearson Correlation between the CLIP model and MLLMs. Open-source models that explicitly use CLIP-based models are highlighted in gray.
the format and notations of VQA questions and options in MMVP. We choose GPT-4V to do these experiments, as it is currently the best model.
Swapping options The first experiment swaps the two options in the MMVP benchmark. For example, we change the question from 'Are the butterfly's wings closer to being open or closed? (a) Open (b) Closed' to 'Are the butterfly's wings closer to being open or closed? (a) Closed (b) Open'.
Empirically, we find that GPT-4V obtains a 40.3% accuracy on the option swapping in our study, as opposed to the original 38.7%. We observe that a few questions are answered differently, while the majority remain the same. This further suggests that the visual incapabilities are in the vision encoder rather than in alignment or the LLMs.
Changing notations in the options We conducted an ablation study to assess the impact of altering notations. For example, we changed '(a) Closed (b) Open' to '(1) Closed (2) Open'. The results are comparable to the original findings, achieving a performance of 37.3%, closely matching the original 38.7%. The study further suggests that the core challenge in MLLMs is their inherent visual incapability, rather than hallucinations in the language model.
Swapping options
Multimodal Large Language Models (MLLMs) [8, 13, 31, 40] have been rapidly developing in recent times. MLLMs integrate images into large language models (LLMs) and leverage the powerful abilities of LLMs [41, 59, 69], showcasing remarkable proficiency in tasks such as image understanding, visual question answering, and instruction following. In particular, the recently released GPT-4V(ision) [40] has pushed performance to an unprecedented level [41, 63].
Beneath the advancements of these models, we find there exists a notable weakness: they still exhibit visual shortcomings, some of which are surprisingly elementary and evident (see Figure 1). We ask: Where do these problems originate? Is it a deficiency in visual modality, language understanding, or their alignment? In this work, we suggest that these shortcomings observed in MLLMs might stem from a problem related to the visual representations .
At their core, most MLLMs [8, 31, 71] are built on pretrained vision [43, 54] and language [59, 68, 69] models. These models are connected using various types of adapters [2, 26, 31] to integrate the different modalities. A natural hypothesis is that any limitation in the pretrained vision models can cascade into the downstream MLLMs that adopt them. Studies have explored a similar issue for language. For example, Tong et al. [57], Yuksekgonul et al. [65] demonstrate that failure patterns in the pretrained text encoder [43, 44] will lead to downstream failures in textguided generative models [22, 46].
On the vision side, most open-source MLLMs [2, 26, 31] adopt the pretrained Contrastive Language-Image PreTraining (CLIP) model [43] as the visual encoder. We begin by identifying failure examples that CLIP struggles to encode properly (Section 2). Inspired by Tong et al. [57], we exploit the erroneous agreements in the embedding space. If two visually different images are encoded similarly by CLIP, then at least one of the images is likely ambiguously encoded. We call such a pair of images a CLIP-blind pair. To measure the visual similarity between images, we use a vision-only self-supervised encoder such as DINOv2 [42]. In this context, CLIP-blind pairs are images with similar CLIP embeddings but different DINOv2 embeddings.
We discover that these CLIP-blind pairs indeed lead to errors in downstream MLLMs. With these pairs, We introduce the M ulti M odal V isual P atterns (MMVP) benchmark. This benchmark is specifically designed to inquire about differences in CLIP-blind pairs and evaluate the visual abilities of state-of-the-art MLLMs with straightforward questions. We evaluate a variety of open-source [8, 30, 31, 71] and closed-source models [13, 41] including GPT-4V [40], and conduct a user study to measure human performance. The results show that MLLM models struggle with straightforward visual questions. Most of these models perform below the level of random guessing, with GPT-4V being the exception. Yet, even GPT-4V exhibits a considerable disparity in performance - exceeding 50% - compared to human performance.
Having identified a large number of individual failure instances in MLLMs, we continue to study the systematic visual patterns in MMVP which CLIP models struggle (Section 3). We summarize nine prevalent patterns of the CLIPblind pairs in MMVP, such as 'orientation', 'counting', and 'viewpoint', which pose significant challenges for the CLIP vision encoder. Notice that there has been significant and ongoing progress in scaling up both training data and model size for CLIP [10, 43, 54, 62, 66]. We categorize examples from MMVP into visual patterns to systematically assess whether scaling alone can mitigate these challenges. Our findings suggest that 7 out of the 9 identified visual patterns cannot be resolved by any large-scale CLIP-based models, indicating that model/data scaling alone is not sufficient. Moreover, we identify a strong correlation between the visual patterns that challenge CLIP models and the performance of MLLMs. If CLIP struggles with a particular visual pattern, such as 'orientation', MLLMs will likely also fall short. This shows that the CLIP vision encoders could become a bottleneck in such systems.
Finally, we take a step towards improving the visual grounding of MLLMs. Since the visual shortcomings of MLLMsstem from their reliance on the CLIP model, we investigate the impact of integrating vision-centric representations into MLLMs (Section 4). Specifically, we explore ways to incorporate a vision-only self-supervised model, such as DINOv2 [42], to enhance the visual grounding capabilities of MLLMs. We refer to these techniques as Mixture-of-Features (MoF). First, we linearly mix CLIP and DINOv2 features in different ratios, which we refer to as Additive-MoF (A-MoF). This process reveals that DINOv2 features are more effective in visual grounding, though they come at the cost of diminished instructionfollowing ability. To address this, we introduce InterleavedMoF (I-MoF) that spatially mixes visual tokens from both CLIP and DINOv2 models. We find that this practice significantly enhances visual grounding while maintaining the instruction-following capabilities.
Changing notations in the options
Human Study Details
In this study, we ask four participants to volunteer in our study. An example user interface for labeling is shown in Figure 8. We collect their responses and calculate the average score as the humanlevel performance.
CLIP-MLLM Failure Correlation
Correlation between CLIP and MLLM models. We compute the Pearson Correlation between the CLIP model and MLLMs and show results in Table 5. Notably, both open-source models - LLaVA and InstructBLIP - exhibit remarkably high Pearson Correlation, exceeding 0.7. This finding indicates a strong correlation between the errors made by the CLIP model and those made by MLLMs. Bard also displays a very high correlation. This suggests that some of the most advanced closed-source models are also affected by the visual limitations in the CLIP models.
Correlation between ImageNet-1k and MMVP performance. We plot the ImageNet-1k Zero-shot accuracy against MMVP-VLM average performance in Figure 9. For models with ImageNet-1k Zero-shot accuracy below 80, a higher Zero-shot accuracy tends to indicate improved MMVP performance. However, in models with superior ImageNet-1k Zero-shot performance, this

Figure 8. Example of user study interface. The questions in the user study are randomly shuffled to avoid any potential bias. Users choose answers for the VQA questions as well as potential concerns for the VQA question.

Figure 9. Correlation between ImageNet-1k Zero-shot and MMVP-VLM average. The area of each bubble corresponds to the model's number of parameters. A higher ImageNet-1k zeroshot performance does not necessarily imply superior performance in MMVP-VLM.
trend does not necessarily hold for MMVP-VLM accuracy. This distinction accentuates the value of MMVP-VLM as an evaluation metric, which probes into visual patterns such as orientation - aspects that are pivotal for downstream tasks and go beyond what is captured by ImageNet accuracy alone.
Correlation between CLIP and MLLM models.
Correlation between ImageNet-1k and ours{
Visual Patterns for CLIP
Here, we provide the full description of visual patterns that pose challenges to all CLIP-based models.
spective from which the photo was taken.
More Benchmark Results
tcb@breakable
Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify “CLIP-blind pairs” – images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.
Multimodal Large Language Models (MLLMs) [40, 13, 31, 8] have been rapidly developing in recent times. MLLMs integrate images into large language models (LLMs) and leverage the powerful abilities of LLMs [41, 59, 69], showcasing remarkable proficiency in tasks such as image understanding, visual question answering, and instruction following. In particular, the recently released GPT-4V(ision) [40] has pushed performance to an unprecedented level [41, 63].
Beneath the advancements of these models, we find there exists a notable weakness: they still exhibit visual shortcomings, some of which are surprisingly elementary and evident (see Figure 1). We ask: Where do these problems originate? Is it a deficiency in visual modality, language understanding, or their alignment? In this work, we suggest that these shortcomings observed in MLLMs might stem from a problem related to the visual representations.
At their core, most MLLMs [8, 31, 71] are built on pretrained vision [43, 54] and language [68, 59, 69] models. These models are connected using various types of adapters [2, 26, 31] to integrate the different modalities. A natural hypothesis is that any limitation in the pretrained vision models can cascade into the downstream MLLMs that adopt them. Studies have explored a similar issue for language. For example, Yuksekgonul et al. [65], Tong et al. [57] demonstrate that failure patterns in the pretrained text encoder [43, 44] will lead to downstream failures in text-guided generative models [46, 22].
On the vision side, most open-source MLLMs [2, 26, 31] adopt the pretrained Contrastive Language-Image Pre-Training (CLIP) model [43] as the visual encoder. We begin by identifying failure examples that CLIP struggles to encode properly (Section 2). Inspired by Tong et al. [57], we exploit the erroneous agreements in the embedding space. If two visually different images are encoded similarly by CLIP, then at least one of the images is likely ambiguously encoded. We call such a pair of images a CLIP-blind pair. To measure the visual similarity between images, we use a vision-only self-supervised encoder such as DINOv2 [42]. In this context, CLIP-blind pairs are images with similar CLIP embeddings but different DINOv2 embeddings.
We discover that these CLIP-blind pairs indeed lead to errors in downstream MLLMs. With these pairs, We introduce the MultiModal Visual Patterns (MMVP) benchmark. This benchmark is specifically designed to inquire about differences in CLIP-blind pairs and evaluate the visual abilities of state-of-the-art MLLMs with straightforward questions. We evaluate a variety of open-source [30, 31, 8, 71] and closed-source models [41, 13] including GPT-4V [40], and conduct a user study to measure human performance. The results show that MLLM models struggle with straight-forward visual questions. Most of these models perform below the level of random guessing, with GPT-4V being the exception. Yet, even GPT-4V exhibits a considerable disparity in performance – exceeding 50% – compared to human performance.
Having identified a large number of individual failure instances in MLLMs, we continue to study the systematic visual patterns in MMVP which CLIP models struggle (Section 3). We summarize nine prevalent patterns of the CLIP-blind pairs in MMVP, such as “orientation”, “counting”, and “viewpoint”, which pose significant challenges for the CLIP vision encoder. Notice that there has been significant and ongoing progress in scaling up both training data and model size for CLIP [43, 54, 10, 62, 66]. We categorize examples from MMVP into visual patterns to systematically assess whether scaling alone can mitigate these challenges. Our findings suggest that 7 out of the 9 identified visual patterns cannot be resolved by any large-scale CLIP-based models, indicating that model/data scaling alone is not sufficient. Moreover, we identify a strong correlation between the visual patterns that challenge CLIP models and the performance of MLLMs. If CLIP struggles with a particular visual pattern, such as “orientation”, MLLMs will likely also fall short. This shows that the CLIP vision encoders could become a bottleneck in such systems.
Finally, we take a step towards improving the visual grounding of MLLMs. Since the visual shortcomings of MLLMs stem from their reliance on the CLIP model, we investigate the impact of integrating vision-centric representations into MLLMs (Section 4). Specifically, we explore ways to incorporate a vision-only self-supervised model, such as DINOv2 [42], to enhance the visual grounding capabilities of MLLMs. We refer to these techniques as Mixture-of-Features (MoF). First, we linearly mix CLIP and DINOv2 features in different ratios, which we refer to as Additive-MoF (A-MoF). This process reveals that DINOv2 features are more effective in visual grounding, though they come at the cost of diminished instruction-following ability. To address this, we introduce Interleaved-MoF (I-MoF) that spatially mixes visual tokens from both CLIP and DINOv2 models. We find that this practice significantly enhances visual grounding while maintaining the instruction-following capabilities.
Currently, the majority of open-source MLLMs [31, 71, 8] use the off-the-shelf CLIP vision encoders to process images. In this section, we begin by identifying CLIP-blind pairs in the CLIP model (Section 2.1). Subsequently, we construct the Multimodal Visual Patterns-MLLM (MMVP-MLLM) benchmark using these CLIP-blind pairs (Section 2.2). We evaluate SOTA MLLMs including GPT-4V on the benchmark (Section 2.3) and find that all the tested models struggle with simple questions on visual details. A visualization of this process is provided in Figure 2.
It is challenging to directly find instances (images) that the CLIP vision encoder struggles to encode “properly”. To circumvent this issue, we extend the idea proposed in Tong et al. [57] to automatically find blind pairs in vision models. The underlying principle is simple: if two images, despite having stark visual differences, are encoded similarly by the CLIP vision encoder, then one of them is likely encoded ambiguously (See Figure 2 left for example). To measure the visual difference between two images, we examine the images’ representations within a reference model: a vision-only self-supervised model trained without any language guidance, e.g., DINOv2 [42]. These models are shown to capture more visual details and information [42, 53].
We take the corpus datasets, ImageNet [47] and LAION-Aesthetics [48], to collect these CLIP-blind pairs.
For each pair, we compute its CLIP embeddings using CLIP-ViT-L-14 [9, 43] model and their DINOv2 embeddings using DINOv2-ViT-L-14 [9, 42] model. We return pairs such that the cosine similarity exceeds 0.95 for CLIP embeddings and less than 0.6 for DINOv2 embeddings.
We introduce the Multimodal Visual Patterns (MMVP) benchmark, and a Visual Question Answering (VQA) benchmark. Utilizing the collected CLIP-blind pairs, we carefully design 150 pairs with 300 questions. For each CLIP-blind pair of images, we manually pinpoint the visual details that the CLIP vision encoder overlooks (see the middle of Figure 2) and craft questions that probe these visual details, for example “Is the dog facing left or right?” (See the right of Figure 2 and more examples in Figure 3). The primary goal is to determine whether MLLM models would fail when posed with these seemingly basic questions and overlook critical visual details. Hence, the questions are intentionally straightforward and unambiguous.
We assess the questions on SOTA open-source models (LLaVA-1.5 [31], InstructBLIP [8], Mini-GPT4 [71]) and closed-source models (GPT-4V [40], Gemini [14], Bard [13]) We leave details of how we access the model in Appendix B.1. In our evaluation, each question is queried independently, eliminating any biases from chat histories. We also evaluate human performance through a user study where users are presented with 300 questions in a randomized sequence. For any given pair of images, we consider a pair of images to be correctly answered if both the questions associated with the pair are answered accurately.
As shown in Figure 4, human participants accurately answer an average of 95.7% of the questions. This high accuracy rate underscores the ease of the questions. More details can be found in Appendix B.4.
As shown in Figure 4, there is a significant performance gap between human and MLLM models, despite the latter often demonstrating impressive results [6, 27]. Models except GPT-4V and Gemini, scored below random guess level (25%). Most advanced GPT-4V and Gemini also face challenges in addressing basic visual grounding questions. Figures 1 and 3 provide examples of errors made by models. The outcomes suggest that irrespective of model size or training data, struggle with visual details.
We have also conducted an ablation study, such as swapping options and changing notations in the question formulation (see Appendix B.3 for more details), to further confirm that this poor performance stems from visual incapability, not hallucination in the language models.
In the previous section, we identify CLIP-blind pairs and use them to find failures in MLLMs. Here, we delve deeper into these pairs to investigate (i) systematic visual patterns emerged from CLIP-blind pairs (Section 3.1), (ii) whether these visual patterns pose challenges for CLIP-based models with massive scaling up (Section 3.2), and (iii) the correlation between failure patterns in CLIP models and those in MLLMs (Section 3.3).
Having identified the CLIP-blind pairs, we summarize systematic visual patterns that the CLIP vision encoders might consistently misinterpret. It is too abstract to directly capture systematic visual patterns in the CLIP-blind pairs. Therefore, we turn to the questions and options from the MMVP benchmark. With these questions, we transform abstract visual patterns in images into clearer, language-based descriptors that are easier to categorize.
In this work, we use GPT-4 [41] to categorize general patterns by prompting it with the following:
We identify 9 visual patterns:
These visual patterns suggest that CLIP vision encoders overly focus on high-level semantic understanding, overlooking intricate details of the visual world. Full descriptions of the visual patterns can be found in Appendix D.
CLIP-based models have developed rapidly since the introduction in the first paper [43]. We want to test whether these visual patterns still impose challenges to the more recent CLIP models [10, 54, 66, 62], which significantly scale up in terms of training data and model size. In doing so, we introduce a new benchmark: MMVP-VLM to systematically study if CLIP models handle this visual pattern well.
We distill a subset of questions from the MMVP benchmark into simpler language descriptions and categorize them into visual patterns. To maintain a balanced number of questions for each visual pattern, we add a few questions, if needed, to ensure that each visual pattern is represented by 15 text-image pairs. Examples of pairs are shown in Figure 5. A pair is deemed correctly answered if the model can accurately match both image-text combinations.
We evaluate MMVP-VLM on a variety of CLIP models [43, 54, 10, 62, 66]. These models vary in aspects like size, training data, and methodology. As evidenced in Table 1, increasing network size and training data only aids in identifying two visual patterns – “color and appearance” and “state and condition”. The rest of the visual patterns continue to challenge all CLIP-based models. We also find that the ImageNet-1k zero-shot accuracy is not a definitive indicator of a model’s performance regarding visual patterns. This underscores the necessity for additional evaluation metrics, such as MMVP-VLM, to accurately assess the model’s capabilities in areas beyond image classification.
After analyzing the visual patterns that CLIP models struggle with, we pose the following question: Is there a correlation between the underperformance of CLIP and MLLMs’ visual incapability? To explore this, we categorize questions from MMVP into these visual patterns summarized and calculate each MLLM’s performance on these patterns.
In Figure 6, we plot CLIP’s performance and MLLMs’ performance for each visual pattern. When the CLIP vision encoder underperforms on a certain visual pattern, the MLLM tends to exhibit similar shortcomings. Open-source models such as LLaVA 1.5 [30] and InstructBLIP [8] that explicitly use the CLIP vision encoder display a strong correlation in performance.
Further, we calculate the Pearson Correlation Coefficient between the CLIP model and MLLM’s performance on each visual pattern. Results show that LLaVA 1.5 and InstructBLIP all possess a coefficient score greater than 0.7. This high score indicates a strong correlation that weaknesses in visual pattern recognition in the CLIP model are transferred to MLLMs. More details on the Pearson Correlation Coefficient can be found in Appendix C.
Based on our exploration in earlier sections, a natural question arises: If open-sourced MLLM’s visual shortcomings come from the CLIP vision encoder, how do we build a more competent visual encoder? In this section, we take initial steps to answer the question by studying Mixture-of-Features (MoF). We start with additive MoF that mixes CLIP features and vision-only SSL model features. Results show that each encoder presents unique advantages and limitations when employed as the pretrained model in MLLM (Section 4.2). We subsequently propose Interleaved MoF that integrates the features from both CLIP and SSL into MLLM to enhance visual grounding without compromising the model’s ability to follow instructions (Section 4.3).
We adopt LLaVA [31, 30] as the framework to study visual encoders in MLLM. LLaVA uses a pretrained CLIP encoder and trains an adapter to align visual tokens with language tokens in the LLM. (See left side of Figure 7). We use DINOv2 [42] as the vision-only SSL model in our work because it is currently the most scalable vision-only model. Our exploration includes the use of two visual encoders: CLIP-ViT-L-14 [43] and DINOV2-ViT-L-14 [42]. To ensure consistent and fair comparisons, we train and finetune our model with the same experiment setting in LLaVA. We include the additional experimental details in Appendix A.
We add a pretrained DINOv2 encoder into MLLM and mix the CLIP pretrained encoder with it. We use a coefficient α𝛼\alpha to control the portion of CLIP features and 1−α1𝛼1-\alpha to control the amount of DINOv2 features and linearly add them together (See middle part of Figure 7 for visualization).
We evaluate the model’s visual grounding ability by the MMVP proposed earlier in Section 2 and the model’s instruction-following capability by LLaVA benchmark introduced in Liu et al. [31]. Initially, we conduct five experiments where we linearly transition from using 100% CLIP features to 100% DINOv2 features. In these tests, the DINOv2 feature proportions are set at {0.00,0.25,0.50,0.75,1.00}0.000.250.500.751.00{0.00,0.25,0.50,0.75,1.00}. To further verify the observed trends, we introduce two additional experiments with DINOv2 proportions of {0.625,0.875}0.6250.875{0.625,0.875}. Our findings, presented in Table 2, reveal two insights:
As the proportion of DINOv2 features increases, MLLM exhibits a decline in its instruction-following capability. Notably, there is a sharp decrease when the DINOv2 proportion reaches 87.5%.
A higher proportion of DINOv2 features enhances the model’s visual grounding capability, but this advantage diminishes when the DINOv2 proportion surpasses 0.75, at which point instruction-following is notably impaired.
Hence, if we were to add DINOv2 features or completely replace CLIP with DINOv2, it would result in a trade-off between visual grounding and instruction-following. A higher proportion of DINOv2 features improves the model’s visual perception at the expense of its ability to follow linguistic instructions, while CLIP features enhance language comprehension but reduce visual grounding.
We propose interleaved MoF to leverage advantages from both CLIP and DINOv2 embeddings to enhance image representation. An image concurrently passes into CLIP and DINOv2 encoders, and the resulting embeddings are individually processed by adapters. We take the processed features from CLIP and DINOv2 and interleave them while maintaining their original spatial order. We then feed the interleaved features to LLM (See right part of Figure 7).
We summarize the results in Table 3. Under the LLaVA setting, interleave MoF significantly enhances visual grounding, with a 10.7% increase observed in MMVP, without compromising the model’s ability to follow instructions. This experiment is replicated with the LLaVA-1.5 setting and under various image resolution settings, yielding similar enhancements in performance. We also evaluate on POPE [27] which is designed to test hallucination in visual grounding. Interleaved-MoF also shows consistent improvement against the original LLaVA models. Merely increasing the image resolution, and consequently, the number of tokens does not boost visual grounding capabilities. Instead, it is the interleaving of MoF that leads to improved performance in visual grounding tasks. We also evaluated Interleaved MoF on additional benchmarks such as MM-Bench [32] and GQA [21], finding that Interleaved MoF achieves similar performance on these benchmarks. Please refer to Appendix E for more results on these benchmarks.
Multimodal LLMs. We study the limitations of Multimodal LLMs [40, 13, 30, 31, 8] and explore possible ways to improve these models. Multimodal LLMs build from pretrained Large Language Models [41, 3, 58, 59, 69] and CLIP vision encoder [43, 54]. These systems then use an adapter, such as MLPs [30, 31], Q-Former [26, 8], and gated attention [2, 25], to integrate the pretrained CLIP vision encoder into LLMs. More recently, instructBLIP [8], LLaVA-1.5 [30] highlight the importance of high-quality training data. Yet, there is a scarcity of research focusing on the impact of visual encoders, which is an important gap our work aims to address through a systematic study.
Evaluating Multimodal LLMs. MMVP assesses MLLMs using a set of simple yet critical Visual Question Answering (VQA) questions constructed from CLIP-blind pairs. Previous benchmarks such as TextVQA [52], VQAv2 [15], and GQA [21] have centered on traditional VQA queries. Recently, there are works like MM-Vet [64], POPE [27], and MM-Bench [32] designed to specifically evaluate multimodal LLMs including hallucination, reasoning, and robustness. The previous benchmarks and evaluations have shown that Multimodal LLMs can suffer from hallucination [29, 28], catastrophic forgetting [67] and lack of robustness [11]. In taking a step back to the fundamentals, our work uncovers that even the most advanced multimodal LLMs, such as GPT-4V [40], Gemini [14], Bard [30], and LLaVA-1.5 [30], are not immune to stumbling over elementary visual questions. We also identified part of the problem as being the incapable visual encoder.
Visual Encoders. MMVP-VLM provides a detailed analysis of the visual capabilities of various CLIP variants [43, 54, 62, 66]. These models mostly follow the method proposed in Radford et al. [43] that uses contrastive loss to train on large volumes of image-text pairs. They differ in training data [62], training recipes [54], and objective functions [66]. Nonetheless, our studies show that all of these CLIP variants struggle with simple visual patterns such as “orientation”, “count”, “presence of specific features”, etc. Another line of research focuses on vision-only self-supervised learning (SSL). This category includes contrastive SSL [7, 16, 5, 17] and mask-based SSL [70, 18, 4]. SLIP [39] explores the synergy between CLIP and contrastive SSL, but focusing primarily on standard classification tasks. In fact, a common practice to evaluate the quality of these vision models is through linear probing or fine-tuning on ImageNet [47, 45]. Although current evaluation methods provide a basic level of assessment on representation quality, our findings indicate a growing detachment from the needs of recent use cases. As demonstrated in the MoF experiments in Section 4, the CLIP vision model and the vision-only SSL models learn complementary features. However, the linear probing accuracy on ImageNet alone provides a limited understanding of feature utility in MLLMs. This observation suggests the need for more diverse evaluations [61] in visual representation learning, to better align with current and emerging applications.
Ambiguities in Embedding Models. Our work exploits CLIP-blind pairs within the CLIP vision embedding space to generate examples of failures in CLIP models and subsequently MLLMs. This concept has ties to previous research focused on documenting failure modes in text embedding models [12, 36, 55]. More recently, Thrush et al. [56], Yuksekgonul et al. [65] and Hsieh et al. [19] study the binding problems CLIP faces in processing text queries, noting that CLIP models treat text input as a bag of words. Tong et al. [57] examines the implications for downstream text-guided generative models. Tschannen et al. [60] suggests image captioners as promising alternatives to CLIP for improving attribute binding. Our work focuses on the visual patterns.
Circling back to the very first question we ask: is vision good enough for language? Perhaps not yet, as our study shows that vision models might become a bottleneck in multimodal systems. MLLMs fail in simple questions because their pre-trained CLIP vision encoders overlook crucial visual details in images, and systematically fail to sort important visual patterns. Yet, CLIP-type models remain the most scalable and widely used vision models today. Contrary to the popular belief that data and model scaling is a panacea, our research demonstrates that scaling alone does not rectify the inherent deficiencies in CLIP models.
Our study reveals that popular visual representation learning models – vision-and-language models and vision-only self-supervised learning models – excel in different aspects. The distinction in their capabilities go beyond conventional benchmarks such as linear probing or zero-shot accuracy on ImageNet. Although a carefully designed Mixture-of-Features approach could alleviate visual limitations and utilize the strengths of these two learning paradigms, it is necessary to develop new evaluation metrics to facilitate the development of new visual representation learning algorithms. We hope our work can motivate further innovation in vision models.
Acknowledgements. We thank Penghao Wu, Muzi Tao, Erik Jones, Michael Psenka, Daniel Yeh, Druv Pai for helpful discussions and feedback. We also thank Google Cloud and the TRC program for their support.
In this work, we adopt the same set of hyperparameters as LLaVA [31] and LLaVA-1.5 [30]. We use Vicuna-13b-v1.3 [69] in LLaVA experiments and Vicuna-13b-v1.5 [69] in LLaVA-1.5 experiments. We show the training hyperparameters for LLaVA and LLaVA-1.5 experiments in Table 4. All experiments are conducted using a maximum of 8 Nvidia A100 GPUs.
We use the same dataset for both LLaVA and LLaVA-1.5 experiments. For LLaVA experiments, stage 1 uses CC595k [50] and stage 2 uses LLaVA 158k [31] instruction data; For LLaVA-1.5 experiments, stage 1 uses CC595k [50] and stage 2 uses DataMix 665k [31, 1, 15, 21, 35, 38, 49, 51, 34, 23, 24] proposed in Liu et al. [30].
We provide more details on the MMVP benchmark.
We access GPT-4V through ChatGPT in October and November 2023. We also evaluate Gemini-Pro through Vertex AI API in December 2023. We use the official checkpoints for InstructBLIP [8]. We access mini-GPT4 [71],111To circumvent response hallucination in mini-GPT4 we prefix our questions with “Please only choose an option to answer the question below without explanation: ” LLaVA and LLaVA-1.5 [31] through their playgrounds. We test Bard [13] using the official website in September and October 2023. Moreover, we test new-Bing [37] through new-Bing chat creative mode and GPT-4V [40] in September 2023.
We present more examples in MMVP at the end in Figures 10, 11, 12. We also share the entire benchmark in the supplementary material.
To further verify that MLLMs make mistakes in MMVP due to their incapable visual grounding instead of hallucination in the language model [20]. We conduct additional ablation experiments on the format and notations of VQA questions and options in MMVP. We choose GPT-4V to do these experiments, as it is currently the best model.
The first experiment swaps the two options in the MMVP benchmark. For example, we change the question from “Are the butterfly’s wings closer to being open or closed? (a) Open (b) Closed” to “Are the butterfly’s wings closer to being open or closed? (a) Closed (b) Open”.
Empirically, we find that GPT-4V obtains a 40.3% accuracy on the option swapping in our study, as opposed to the original 38.7%. We observe that a few questions are answered differently, while the majority remain the same. This further suggests that the visual incapabilities are in the vision encoder rather than in alignment or the LLMs.
We conducted an ablation study to assess the impact of altering notations. For example, we changed “(a) Closed (b) Open” to “(1) Closed (2) Open”. The results are comparable to the original findings, achieving a performance of 37.3%, closely matching the original 38.7%. The study further suggests that the core challenge in MLLMs is their inherent visual incapability, rather than hallucinations in the language model.
In this study, we ask four participants to volunteer in our study. An example user interface for labeling is shown in Figure 8. We collect their responses and calculate the average score as the human-level performance.
We compute the Pearson Correlation between the CLIP model and MLLMs and show results in Table 5. Notably, both open-source models – LLaVA and InstructBLIP – exhibit remarkably high Pearson Correlation, exceeding 0.7. This finding indicates a strong correlation between the errors made by the CLIP model and those made by MLLMs. Bard also displays a very high correlation. This suggests that some of the most advanced closed-source models are also affected by the visual limitations in the CLIP models.
We plot the ImageNet-1k Zero-shot accuracy against MMVP-VLM average performance in Figure 9. For models with ImageNet-1k Zero-shot accuracy below 80, a higher Zero-shot accuracy tends to indicate improved MMVP performance. However, in models with superior ImageNet-1k Zero-shot performance, this trend does not necessarily hold for MMVP-VLM accuracy. This distinction accentuates the value of MMVP-VLM as an evaluation metric, which probes into visual patterns such as orientation – aspects that are pivotal for downstream tasks and go beyond what is captured by ImageNet accuracy alone.
Here, we provide the full description of visual patterns that pose challenges to all CLIP-based models.
Orientation and Direction: Questions about the direction something is facing or moving, such as the direction the dog or duck is facing, or the orientation of the school bus.
Presence of Specific Features: Questions that focus on the existence or non-existence of certain elements or features in the image.
Positional and Relational Context: This aspect refers to the model’s ability to understand the position and relationship of objects or elements within an image in relation to each other and their surroundings.
Color and Appearance: Questions regarding the color of certain objects or elements.
We conduct additional experiments on Interleaved-MoF that further scale up the resolution to 336 and evaluate on more benchmarks. The summarized results in Table 6 reveal that Interleaved-MoF achieves comparable performance on most benchmarks while demonstrating improvements in benchmarks focused on visual grounding. We also observe that MMVP are more sensitive to the model’s visual capabilities, underscoring the significance of our benchmark in assessing visual proficiency.
Table: S3.T1: Performance of various CLIP based models on different visual patterns in MMVP-VLM benchmark. Models scaled up in resolution show minimal improvement, whereas a slight advantage is observed when scaling up the network. For each visual pattern, ImageNet-1k Zero-shot accuracy and MMVP average, we use light gray to highlight the best performance. For most of the visual patterns, all CLIP-based methods show struggle, as evident from the scores. We use symbols for visual patterns due to space limit: \faCompass: Orientation and Direction, \faSearch: Presence of Specific Features, \faSync: State and Condition, \faSortNumericUp: Quantity and Count, \faMapPin: Positional and Relational Context, \faPalette: Color and Appearance, \faCogs: Structural and Physical Characteristics, \faFont: Texts, \faCamera: Viewpoint and Perspective.
| Image Size | Params (M) | IN-1k ZeroShot | \faCompass | \faSearch | \faSync | \faSortNumericUp | \faMapPin | \faPalette | \faCogs | \faFont | \faCamera | MMVP Average | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OpenAI ViT-L-14 [43] | 2242 | 427.6 | 75.5 | 13.3 | 13.3 | 20.0 | 20.0 | 13.3 | 53.3 | 20.0 | 6.7 | 13.3 | 19.3 |
| OpenAI ViT-L-14 [43] | 3362 | 427.9 | 76.6 | 0.0 | 20.0 | 40.0 | 20.0 | 6.7 | 20.0 | 33.3 | 6.7 | 33.3 | 20.0 |
| SigLIP ViT-SO-14 [66] | 2242 | 877.4 | 82.0 | 26.7 | 20.0 | 53.3 | 40.0 | 20.0 | 66.7 | 40.0 | 20.0 | 53.3 | 37.8 |
| SigLIP ViT-SO-14 [66] | 3842 | 878.0 | 83.1 | 20.0 | 26.7 | 60.0 | 33.3 | 13.3 | 66.7 | 33.3 | 26.7 | 53.3 | 37.0 |
| DFN ViT-H-14 [10] | 2242 | 986.1 | 83.4 | 20.0 | 26.7 | 73.3 | 26.7 | 26.7 | 66.7 | 46.7 | 13.3 | 53.3 | 39.3 |
| DFN ViT-H-14 [10] | 3782 | 986.7 | 84.4 | 13.3 | 20.0 | 53.3 | 33.3 | 26.7 | 66.7 | 40.0 | 20.0 | 40.0 | 34.8 |
| MetaCLIP ViT-L-14 [62] | 2242 | 427.6 | 79.2 | 13.3 | 6.7 | 66.7 | 6.7 | 33.3 | 46.7 | 20.0 | 6.7 | 13.3 | 23.7 |
| MetaCLIP ViT-H-14 [62] | 2242 | 986.1 | 80.6 | 6.7 | 13.3 | 60.0 | 13.3 | 6.7 | 53.3 | 26.7 | 13.3 | 33.3 | 25.2 |
| EVA01 ViT-g-14 [54] | 2242 | 1136.4 | 78.5 | 6.7 | 26.7 | 40.0 | 6.7 | 13.3 | 66.7 | 13.3 | 13.3 | 20.0 | 23.0 |
| EVA02 ViT-bigE-14+ [54] | 2242 | 5044.9 | 82.0 | 13.3 | 20.0 | 66.7 | 26.7 | 26.7 | 66.7 | 26.7 | 20.0 | 33.3 | 33.3 |
Table: S4.T2: Empirical Results of Additive MoF. We use DINOv2 as the image SSL model in our work. With more DINOv2 features added, there is an improvement in visual grounding, while a decline in instruction following ability.
| method | SSL ratio | MMVP | LLaVA |
|---|---|---|---|
| LLaVA | 0.0 | 5.5 | 81.8 |
| LLaVA + A-MoF | 0.25 | 7.9 (+2.4) | 79.4 (-2.4) |
| 0.5 | 12.0 (+6.5) | 78.6 (-3.2) | |
| 0.625 | 15.0 (+9.5) | 76.4 (-5.4) | |
| 0.75 | 18.7 (+13.2) | 75.8 (-6.0) | |
| 0.875 | 16.5 (+11.0) | 69.3 (-12.5) | |
| 1.0 | 13.4 (+7.9) | 68.5 (-13.3) |
Table: A1.T4: Hyperparameters for MoF training on LLaVA and LLaVA-1.5.
| Hyperparameter | LLaVA | LLaVA-1.5 | ||
|---|---|---|---|---|
| Stage 1 | Stage 2 | Stage 1 | Stage 2 | |
| batch size | 128 | 128 | 256 | 128 |
| lr | 1e-3 | 2e-5 | 2e-3 | 2e-5 |
| lr schedule decay | cosine | cosine | cosine | cosine |
| lr warmup ratio | 0.03 | 0.03 | 0.03 | 0.03 |
| weight decay | 0 | 0 | 0 | 0 |
| epoch | 1 | 3 | 1 | 1 |
| optimizer | AdamW [33] | |||
| DeepSpeed stage | 2 | 3 | 2 | 3 |
Table: A3.T5: Pearson Correlation between the CLIP model and MLLMs. Open-source models that explicitly use CLIP-based models are highlighted in gray.
| LLaVA-1.5 | InstructBLIP | Bard | Gemini | GPT-4 | |
|---|---|---|---|---|---|
| Correlation | 0.87 | 0.71 | 0.79 | 0.72 | 0.31 |
Table: A3.T6: Comparison with LLaVA-1.5 on 6 more benchmarks. Interleaved-MoF LLaVA-1.5 obtains performance on par with the original method while showing improvements on benchmarks evaluating visual grounding. Benchmark names are abbreviated due to space limits. LLVBB{}^{\text{B}}: LLaVA Benchmark [31]; LLVWW{}^{\text{W}}: LLaVA-In-the-Wild [30]; MMB: MMBench [32]; VQATT{}^{\text{T}}: TextVQA[52]; POPE: POPE [27]; VQAV2V2{}^{\text{V2}}: VQA-v2 [15]; MM-V: MM-Vet [64].
| method | res | #tokens | MMVP | LLVBB{}^{\text{B}} | LLVWW{}^{\text{W}} | MMB | VQATT{}^{\text{T}} | POPE | VQAV2V2{}^{\text{V2}} | MM-V |
|---|---|---|---|---|---|---|---|---|---|---|
| LLaVA1.5 | 3362 | 576 | 24.7 | 84.7 | 70.7 | 67.7 | 61.3 | 85.9 | 80.0 | 35.4 |
| LLaVA1.5 + I-MoF | 2242 | 512 | 28.0 | 82.7 | 73.3 | 61.6 | 55.3 | 86.3 | 77.3 | 33.5 |
| LLaVA1.5 + I-MoF | 3362 | 1152 | 31.3 | 81.8 | 73.3 | 65.4 | 58.7 | 86.7 | 79.3 | 34.6 |
Constructing MMVP benchmark via CLIP-blind pairs. Left: We start with finding CLIP-blind pairs that have similar CLIP embedding but different DINOv2 embedding. Center: We manually inspect the differences between pair-wise images and formulate questions based on the differences in the images. Right: We ask MLLMs the question alongside the CLIP-blind pair. The model receives a score only when both questions for the CLIP-blind pair are answered correctly.
Examples of Questions in the MMVP benchmark. Incorrect answers are shaded in red. A model is considered correct only if it answers both questions in a pair correctly. Both leading closed-source models (GPT-4V, Gemini) and open-source models (LLaVA-1.5, InstructBLIP) fail these simple visual questions. (See Appendix B.2 for all the questions in MMVP benchmark.)
Benchmark results of current SOTA MLLM models and humans. We evaluate benchmark questions for current SOTA MLLM models and human performances through user studies.
Examples from MMVP-VLM. MMVP-VLM consists of image pairs across nine visual patterns. The examples in the figure are from EVA01 ViT-g-14 model [54], one of the largest CLIP models that also fails to choose the right image given the text description.
CLIP and MLLM’s performance on visual patterns. If CLIP performs poorly on a visual pattern such as “ \faCompass orientation”, MLLMs also underperform on the visual pattern.
Different Mixture-of-Feature (MoF) Strategies in MLLM. Left: Standard MLLM that uses CLIP as off-the-shelf pretrained vision encoder; Middle: Additive-MoF (A-MoF) MLLM: Linearly mixing CLIP and DINOv2 features before the adapter; Right: Interleaved-MoF (I-MoF MLLM) Spatially interleaving CLIP visual tokens and DINOv2 visual tokens after the adapter.
Example of user study interface. The questions in the user study are randomly shuffled to avoid any potential bias. Users choose answers for the VQA questions as well as potential concerns for the VQA question.
Correlation between ImageNet-1k Zero-shot and MMVP-VLM average. The area of each bubble corresponds to the model’s number of parameters. A higher ImageNet-1k zero-shot performance does not necessarily imply superior performance in MMVP-VLM.
More examples of questions in the MMVP benchmark (Part I).
| Image Size | Params (M) | IN-1k | ZeroShot ☼ | /search | /sync | /sort-numeric-up | /map-pin | /palette | /cogs | /font | /camera | MMVP Average | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OpenAI ViT-L-14 [43] | 224 2 | 427.6 | 75.5 | 13.3 | 13.3 | 20.0 20.0 | 13.3 | 53.3 | 20.0 | 6.7 | 13.3 | 19.3 | |
| OpenAI ViT-L-14 [43] | 336 2 | 427.9 | 76.6 | 0 | 20.0 | 40.0 20.0 | 6.7 | 20.0 | 33.3 | 6.7 | 33.3 | 20 | |
| SigLIP ViT-SO-14 [66] | 224 2 | 877.4 | 82 | 26.7 | 20.0 53.3 | 40.0 | 20.0 | 66.7 | 40 | 20.0 | 53.3 | 37.8 | |
| SigLIP ViT-SO-14 [66] | 384 2 | 878 | 83.1 | 20 | 26.7 | 60.0 | 33.3 | 13.3 | 66.7 | 33.3 | 26.7 | 53.3 | 37 |
| DFN ViT-H-14 [10] | 224 2 | 986.1 | 83.4 | 20 | 26.7 73.3 | 26.7 | 26.7 | 66.7 | 46.7 | 13.3 | 53.3 | 39.3 | |
| DFN ViT-H-14 [10] | 378 2 | 986.7 | 84.4 | 13.3 | 20.0 53.3 | 33.3 | 26.7 | 66.7 | 40 | 20.0 | 40 | 34.8 | |
| MetaCLIP ViT-L-14 [62] | 224 2 | 427.6 | 79.2 | 13.3 | 6.7 | 66.7 | 6.7 | 33.3 | 46.7 | 20 | 6.7 | 13.3 | 23.7 |
| MetaCLIP ViT-H-14 [62] | 224 2 | 986.1 | 80.6 | 6.7 | 13.3 | 60.0 | 13.3 | 6.7 | 53.3 | 26.7 | 13.3 | 33.3 | 25.2 |
| EVA01 ViT-g-14 [54] | 224 2 | 1136.4 | 78.5 | 6.7 | 26.7 | 40.0 | 6.7 | 13.3 | 66.7 | 13.3 | 13.3 | 20 | 23 |
| EVA02 ViT-bigE-14+ [54] | 224 2 | 5044.9 | 82 | 13.3 | 20.0 | 66.7 | 26.7 | 26.7 | 66.7 | 26.7 | 20.0 | 33.3 | 33.3 |
| method | SSL ratio | MMVP | LLaVA |
|---|---|---|---|
| LLaVA | 0 | 5.5 | 81.8 |
| LLaVA + A-MoF | 0.25 | 7.9 (+2.4) | 79.4 (-2.4) |
| LLaVA + A-MoF | 0.5 | 12.0 (+6.5) | 78.6 (-3.2) |
| LLaVA + A-MoF | 0.625 | 15.0 (+9.5) | 76.4 (-5.4) |
| LLaVA + A-MoF | 0.75 | 18.7 (+13.2) | 75.8 (-6.0) |
| LLaVA + A-MoF | 0.875 | 16.5 (+11.0) | 69.3 (-12.5) |
| LLaVA + A-MoF | 1 | 13.4 (+7.9) | 68.5 (-13.3) |
| method | res | #tokens | MMVP | LLaVA | POPE |
|---|---|---|---|---|---|
| LLaVA | 224 2 | 256 | 5.5 | 81.8 | 50 |
| LLaVA | 336 2 | 576 | 6.0 | 81.4 | 50.1 |
| LLaVA + I-MoF | 224 2 | 512 | 16.7 (+10.7) | 82.8 | 51 |
| LLaVA 1 . 5 | 336 2 | 576 | 24.7 | 84.7 | 85.9 |
| LLaVA 1 . 5 + I-MoF | 224 2 | 512 | 28.0 (+3.3) | 82.7 | 86.3 |
| Hyperparameter | LLaVA | LLaVA | LLaVA-1.5 | LLaVA-1.5 |
|---|---|---|---|---|
| Stage 1 | Stage 2 | Stage 1 | Stage 2 | |
| batch size | 128 | 128 | 256 | 128 |
| lr | 1e-3 | 2e-5 | 2e-3 | 2e-5 |
| lr schedule decay | cosine | cosine | cosine | cosine |
| lr warmup ratio | 0.03 | 0.03 | 0.03 | 0.03 |
| weight decay | 0 | 0 | 0 | 0 |
| epoch | 1 | 3 | 1 | 1 |
| optimizer | AdamW [33] | AdamW [33] | AdamW [33] | AdamW [33] |
| DeepSpeed stage | 2 | 3 | 2 | 3 |
| LLaVA-1.5 | InstructBLIP | Bard | Gemini | GPT-4 | |
|---|---|---|---|---|---|
| Correlation | 0.87 | 0.71 | 0.79 | 0.72 | 0.31 |
| method | SSL Model | res | #tokens | MMVP | POPE |
|---|---|---|---|---|---|
| LLaVA 1 . 5 | None | 336 2 | 576 | 24.7 | 85.9 |
| LLaVA 1 . 5 + I-MoF | MoCov3 | 224 2 | 512 | 26.7 (+2.0) | 86.1 |
| LLaVA 1 . 5 + I-MoF | MAE | 224 2 | 512 | 27.3 (+2.6) | 86.1 |
| LLaVA 1 . 5 + I-MoF | DINOv2 | 224 2 | 512 | 28.0 (+3.3) | 86.3 |
| method | res | #tokens | MMVP | LLV B | LLV W | MMB | VQA T | POPE | VQA V2 | MM-V |
|---|---|---|---|---|---|---|---|---|---|---|
| LLaVA 1 . 5 | 336 2 | 576 | 24.7 | 84.7 | 70.7 | 67.7 | 61.3 | 85.9 | 80 | 35.4 |
| LLaVA 1 . 5 + I-MoF | 224 2 | 512 | 28 | 82.7 | 73.3 | 61.6 | 55.3 | 86.3 | 77.3 | 33.5 |
| LLaVA 1 . 5 + I-MoF | 336 2 | 1152 | 31.3 | 81.8 | 73.3 | 65.4 | 58.7 | 86.7 | 79.3 | 34.6 |

References
[liu2023visual] Liu, Haotian, Li, Chunyuan, Wu, Qingyang, Lee, Yong Jae. (2023). Visual instruction tuning. NeurIPS.
[zhu2023minigpt] Zhu, Deyao, Chen, Jun, Shen, Xiaoqian, Li, Xiang, Elhoseiny, Mohamed. (2023). Mini{GPT. arXiv preprint arXiv:2304.10592.
[openai2023gpt4] OpenAI. (2023). GPT-4 Technical Report.
[tong2023mass] Tong, Shengbang, Jones, Erik, Steinhardt, Jacob. (2023). Mass-Producing Failures of Multimodal Systems with Language Models. NeurIPS.
[oquab2023dinov2] Oquab, Maxime, Darcet, Timoth{'e. (2023). {DINO. arXiv preprint arXiv:2304.07193.
[yang2022we] Yang, Yibo, Xie, Liang, Chen, Shixiang, Li, Xiangtai, Lin, Zhouchen, Tao, Dacheng. (2022). Do we really need a learnable classifier at the end of deep neural network?. arXiv preprint arXiv:2203.09081.
[vishniakov2024convnet] Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu. (2024). ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy.
[fang2021exploring] Fang, Cong, He, Hangfeng, Long, Qi, Su, Weijie J. (2021). Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences.
[cai2023leveraging] Cai, Mu, Huang, Zeyi, Li, Yuheng, Wang, Haohan, Lee, Yong Jae. (2023). Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding. arXiv preprint arXiv:2306.06094.
[thrampoulidis2022imbalance] Thrampoulidis, Christos, Kini, Ganesh Ramachandra, Vakilian, Vala, Behnia, Tina. (2022). Imbalance trouble: Revisiting neural-collapse geometry. Advances in Neural Information Processing Systems.
[bommasani2021opportunities] Bommasani, Rishi, Hudson, Drew A, Adeli, Ehsan, Altman, Russ, Arora, Simran, von Arx, Sydney, Bernstein, Michael S, Bohg, Jeannette, Bosselut, Antoine, Brunskill, Emma, others. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
[devlin2018bert] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2018). {BERT. arXiv preprint arXiv:1810.04805.
[radford2021learning] Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, others. (2021). Learning transferable visual models from natural language supervision. ICML.
[kingma2014adam] Kingma, Diederik P, Ba, Jimmy. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[gudibande2023false] Gudibande, Arnav, Wallace, Eric, Snell, Charlie, Geng, Xinyang, Liu, Hao, Abbeel, Pieter, Levine, Sergey, Song, Dawn. (2023). The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
[papyan2020prevalence] Papyan, Vardan, Han, XY, Donoho, David L. (2020). Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences.
[ilharco_gabriel_open_clip] Ilharco, Gabriel, Wortsman, Mitchell, Wightman, Ross, Gordon, Cade, Carlini, Nicholas, Taori, Rohan, Dave, Achal, Shankar, Vaishaal, Namkoong, Hongseok, Miller, John, Hajishirzi, Hannaneh, Farhadi, Ali, Schmidt, Ludwig. OpenCLIP. doi:10.5281/zenodo.5143773.
[yang2023neural] Yibo Yang, Haobo Yuan, Xiangtai Li, Zhouchen Lin, Philip Torr, Dacheng Tao. (2023). Neural Collapse Inspired Feature-Classifier Alignment for Few-Shot Class-Incremental Learning. The Eleventh International Conference on Learning Representations.
[yang2022inducing] Yang, Yibo, Chen, Shixiang, Li, Xiangtai, Xie, Liang, Lin, Zhouchen, Tao, Dacheng. (2022). Inducing Neural Collapse in Imbalanced Learning: Do We Really Need a Learnable Classifier at the End of Deep Neural Network?. Advances in Neural Information Processing Systems.
[zhu2021geometric] Zhu, Zhihui, Ding, Tianyu, Zhou, Jinxin, Li, Xiao, You, Chong, Sulam, Jeremias, Qu, Qing. (2021). A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems.
[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition.
[simonyan2014very] Simonyan, Karen, Zisserman, Andrew. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[dosovitskiy2021an] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICML.
[liu2021swin] Liu, Ze, Lin, Yutong, Cao, Yue, Hu, Han, Wei, Yixuan, Zhang, Zheng, Lin, Stephen, Guo, Baining. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision.
[berrios2023towards] Berrios, William, Mittal, Gautam, Thrush, Tristan, Kiela, Douwe, Singh, Amanpreet. (2023). Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language. arXiv preprint arXiv:2306.16410.
[li2023otter] Li, Bo, Zhang, Yuanhan, Chen, Liangyu, Wang, Jinghao, Yang, Jingkang, Liu, Ziwei. (2023). Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
[instructblip] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. (2023). InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning.
[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. ICML.
[liu2023generalizing] Weiyang Liu, Longhui Yu, Adrian Weller, Bernhard Sch{. (2023). Generalizing and Decoupling Neural Collapse via Hyperspherical Uniformity Gap. The Eleventh International Conference on Learning Representations.
[nakamoto2023cal] Nakamoto, Mitsuhiko, Zhai, Yuexiang, Singh, Anikait, Mark, Max Sobol, Ma, Yi, Finn, Chelsea, Kumar, Aviral, Levine, Sergey. (2023). Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. arXiv preprint arXiv:2303.05479.
[mirzadeh2022wide] Mirzadeh, Seyed Iman, Chaudhry, Arslan, Yin, Dong, Hu, Huiyi, Pascanu, Razvan, Gorur, Dilan, Farajtabar, Mehrdad. (2022). Wide neural networks forget less catastrophically. International Conference on Machine Learning.
[ji2023survey] Ji, Ziwei, Lee, Nayeon, Frieske, Rita, Yu, Tiezheng, Su, Dan, Xu, Yan, Ishii, Etsuko, Bang, Ye Jin, Madotto, Andrea, Fung, Pascale. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys.
[sharma2018conceptual] Sharma, Piyush, Ding, Nan, Goodman, Sebastian, Soricut, Radu. (2018). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. ACL.
[lin2014microsoft] Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll{'a. (2014). Microsoft coco: Common objects in context. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.
[lu2022learn] Lu, Pan, Mishra, Swaroop, Xia, Tanglin, Qiu, Liang, Chang, Kai-Wei, Zhu, Song-Chun, Tafjord, Oyvind, Clark, Peter, Kalyan, Ashwin. (2022). Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems.
[anas_awadalla_open_flamingo] Awadalla, Anas, Gao, Irena, Gardner, Joshua, Hessel, Jack, Hanafy, Yusuf, Zhu, Wanrong, Marathe, Kalyani, Bitton, Yonatan, Gadre, Samir, Jitsev, Jenia, Kornblith, Simon, Koh, Pang Wei, Ilharco, Gabriel, Wortsman, Mitchell, Schmidt, Ludwig. OpenFlamingo. doi:10.5281/zenodo.7733589.
[behnia2022on] Tina Behnia, Ganesh Ramachandra Kini, Vala Vakilian, Christos Thrampoulidis. (2022). On the Implicit Geometry of Cross-Entropy Parameterizations for Label-Imbalanced Data. OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop).
[chung2022scaling] Chung, Hyung Won, Hou, Le, Longpre, Shayne, Zoph, Barret, Tay, Yi, Fedus, William, Li, Eric, Wang, Xuezhi, Dehghani, Mostafa, Brahma, Siddhartha, others. (2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
[hu2021lora] Hu, Edward J, Shen, Yelong, Wallis, Phillip, Allen-Zhu, Zeyuan, Li, Yuanzhi, Wang, Shean, Wang, Lu, Chen, Weizhu. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
[OpenAI2022ChatGPT] OpenAI. (2022). ChatGPT.
[yuksekgonul2022and] Yuksekgonul, Mert, Bianchi, Federico, Kalluri, Pratyusha, Jurafsky, Dan, Zou, James. (2022). When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?. ICLR.
[russakovsky2015imagenet] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, others. (2015). Imagenet large scale visual recognition challenge. IJCV.
[schuhmann2022laion] Schuhmann, Christoph, Beaumont, Romain, Vencu, Richard, Gordon, Cade, Wightman, Ross, Cherti, Mehdi, Coombes, Theo, Katta, Aarush, Mullis, Clayton, Wortsman, Mitchell, others. (2022). {LAION-5B. NeurIPS.
[Bard] Google. (2023). Bard.
[Gemini] Google. (2023). Gemini.
[mu2022slip] Mu, Norman, Kirillov, Alexander, Wagner, David, Xie, Saining. (2022). Slip: Self-supervision meets language-image pre-training. ECCV.
[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2022). {Vicreg: Variance-invariance-covariance regularization for self-supervised learning. ICLR.
[he2020momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum contrast for unsupervised visual representation learning. CVPR.
[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS.
[he2022masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners. CVPR.
[zhou2021ibot] Zhou, Jinghao, Wei, Chen, Wang, Huiyu, Shen, Wei, Xie, Cihang, Yuille, Alan, Kong, Tao. (2021). {iBOT. ICLR.
[assran2023self] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. CVPR.
[alayrac2022flamingo] Alayrac, Jean-Baptiste, Donahue, Jeff, Luc, Pauline, Miech, Antoine, Barr, Iain, Hasson, Yana, Lenc, Karel, Mensch, Arthur, Millican, Katherine, Reynolds, Malcolm, others. (2022). Flamingo: a visual language model for few-shot learning. NeruIPS.
[hu2023prompt] Hu, Jennifer, Levy, Roger. (2023). Prompt-based methods may underestimate large language models' linguistic generalizations. EMNLP.
[newbing] Microsoft. (2023). newbing.
[sharegpt] . {ShareGPT. (2023).
[mishra2019ocr] Mishra, Anand, Shekhar, Shashank, Singh, Ajeet Kumar, Chakraborty, Anirban. (2019). {OCR-VQA. ICDAR.
[schwenk2022okvqa] Schwenk, Dustin, Khandelwal, Apoorv, Clark, Christopher, Marino, Kenneth, Mottaghi, Roozbeh. (2022). {A-OKVQA. ECCV.
[sidorov2020textcaps] Sidorov, Oleksii, Hu, Ronghang, Rohrbach, Marcus, Singh, Amanpreet. (2020). Textcaps: a dataset for image captioning with reading comprehension. ECCV.
[mao2016generation] Mao, Junhua, Huang, Jonathan, Toshev, Alexander, Camburu, Oana, Yuille, Alan L, Murphy, Kevin. (2016). Generation and comprehension of unambiguous object descriptions. CVPR.
[kazemzadeh2014referitgame] Kazemzadeh, Sahar, Ordonez, Vicente, Matten, Mark, Berg, Tamara. (2014). Referitgame: Referring to objects in photographs of natural scenes. EMNLP.
[krishna2017visual] Krishna, Ranjay, Zhu, Yuke, Groth, Oliver, Johnson, Justin, Hata, Kenji, Kravitz, Joshua, Chen, Stephanie, Kalantidis, Yannis, Li, Li-Jia, Shamma, David A, others. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV.
[marino2019ok] Marino, Kenneth, Rastegari, Mohammad, Farhadi, Ali, Mottaghi, Roozbeh. (2019). {OK-VQA. CVPR.
[raffel2020exploring] Raffel, Colin, Shazeer, Noam, Roberts, Adam, Lee, Katherine, Narang, Sharan, Matena, Michael, Zhou, Yanqi, Li, Wei, Liu, Peter J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.
[rombach2022high] Rombach, Robin, Blattmann, Andreas, Lorenz, Dominik, Esser, Patrick, Ommer, Bj{. (2022). High-resolution image synthesis with latent diffusion models. CVPR.
[ridnik2021imagenet] Ridnik, Tal, Ben-Baruch, Emanuel, Noy, Asaf, Zelnik-Manor, Lihi. (2021). Imagenet-21k pretraining for the masses. NeurIPS.
[jun2023shap] Jun, Heewoo, Nichol, Alex. (2023). Shap-{E. arXiv preprint arXiv:2305.02463.
[singh2019towards] Singh, Amanpreet, Natarajan, Vivek, Shah, Meet, Jiang, Yu, Chen, Xinlei, Batra, Dhruv, Parikh, Devi, Rohrbach, Marcus. (2019). Towards {VQA. CVPR.
[yu2023mm] Yu, Weihao, Yang, Zhengyuan, Li, Linjie, Wang, Jianfeng, Lin, Kevin, Liu, Zicheng, Wang, Xinchao, Wang, Lijuan. (2023). {MM-Vet. arXiv preprint arXiv:2308.02490.
[goyal2017making] Goyal, Yash, Khot, Tejas, Summers-Stay, Douglas, Batra, Dhruv, Parikh, Devi. (2017). Making the {V. CVPR.
[hudson2019gqa] Hudson, Drew A, Manning, Christopher D. (2019). {GQA. CVPR.
[thrush2022winoground] Thrush, Tristan, Jiang, Ryan, Bartolo, Max, Singh, Amanpreet, Williams, Adina, Kiela, Douwe, Ross, Candace. (2022). Winoground: Probing vision and language models for visio-linguistic compositionality. CVPR.
[gpt4v] OpenAI. (2023). {GPT-4V(ision) System Card.
[bubeck2023sparks] Bubeck, S{'e. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
[yang2023dawn] Yang, Zhengyuan, Li, Linjie, Lin, Kevin, Wang, Jianfeng, Lin, Chung-Ching, Liu, Zicheng, Wang, Lijuan. (2023). {The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision). arXiv preprint arXiv:2309.17421.
[may2019measuring] May, Chandler, Wang, Alex, Bordia, Shikha, Bowman, Samuel R, Rudinger, Rachel. (2019). On measuring social biases in sentence encoders. NAACL.
[fu2023mme] Fu, Chaoyou, Chen, Peixian, Shen, Yunhang, Qin, Yulei, Zhang, Mengdan, Lin, Xu, Qiu, Zhenyu, Lin, Wei, Yang, Jinrui, Zheng, Xiawu, others. (2023). MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394.
[loshchilov2017decoupled] Loshchilov, Ilya, Hutter, Frank. (2017). Decoupled weight decay regularization. ICLR.
[fang2023data] Fang, Alex, Jose, Albin Madappally, Jain, Amit, Schmidt, Ludwig, Toshev, Alexander, Shankar, Vaishaal. (2023). Data Filtering Networks. arXiv preprint arXiv:2309.17425.
[sun2019mitigating] Sun, Tony, Gaut, Andrew, Tang, Shirlyn, Huang, Yuxin, ElSherief, Mai, Zhao, Jieyu, Mirza, Diba, Belding, Elizabeth, Chang, Kai-Wei, Wang, William Yang. (2019). Mitigating gender bias in natural language processing: Literature review. ACL.
[gonen2019lipstick] Gonen, Hila, Goldberg, Yoav. (2019). Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. NAACL.
[li2023evaluating] Li, Yifan, Du, Yifan, Zhou, Kun, Wang, Jinpeng, Zhao, Wayne Xin, Wen, Ji-Rong. (2023). Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355.
[liu2023mmbench] Liu, Yuan, Duan, Haodong, Zhang, Yuanhan, Li, Bo, Zhang, Songyang, Zhao, Wangbo, Yuan, Yike, Wang, Jiaqi, He, Conghui, Liu, Ziwei, others. (2023). MMBench: Is Your Multi-modal Model an All-around Player?. arXiv preprint arXiv:2307.06281.
[chen2023models] Chen, Yanda, Zhong, Ruiqi, Ri, Narutatsu, Zhao, Chen, He, He, Steinhardt, Jacob, Yu, Zhou, McKeown, Kathleen. (2023). Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations. arXiv preprint arXiv:2307.08678.
[singh2023effectiveness] Singh, Mannat, Duval, Quentin, Alwala, Kalyan Vasudev, Fan, Haoqi, Aggarwal, Vaibhav, Adcock, Aaron, Joulin, Armand, Doll{'a. (2023). {The effectiveness of MAE pre-pretraining for billion-scale pretraining. ICCV.
[laurenccon2023obelisc] Lauren{\c{c. (2023). Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527.
[song2020adversarial] Song, Congzheng, Rush, Alexander M, Shmatikov, Vitaly. (2020). Adversarial semantic collisions. arXiv preprint arXiv:2011.04743.
[touvron2023llama] Touvron, Hugo, Lavril, Thibaut, Izacard, Gautier, Martinet, Xavier, Lachaux, Marie-Anne, Lacroix, Timoth{'e. (2023). {LLaMA. arXiv preprint arXiv:2302.13971.
[chowdhery2022palm] Chowdhery, Aakanksha, Narang, Sharan, Devlin, Jacob, Bosma, Maarten, Mishra, Gaurav, Roberts, Adam, Barham, Paul, Chung, Hyung Won, Sutton, Charles, Gehrmann, Sebastian, others. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
[anil2023palm] Anil, Rohan, Dai, Andrew M, Firat, Orhan, Johnson, Melvin, Lepikhin, Dmitry, Passos, Alexandre, Shakeri, Siamak, Taropa, Emanuel, Bailey, Paige, Chen, Zhifeng, others. (2023). Palm 2 technical report. arXiv preprint arXiv:2305.10403.
[liu2023improved] Liu, Haotian, Li, Chunyuan, Li, Yuheng, Lee, Yong Jae. (2023). Improved Baselines with Visual Instruction Tuning. arXiv preprint arXiv:2310.03744.
[zheng2023judging] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica. (2023). Judging {LLM.
[xu2023demystifying] Xu, Hu, Xie, Saining, Tan, Xiaoqing Ellen, Huang, Po-Yao, Howes, Russell, Sharma, Vasu, Li, Shang-Wen, Ghosh, Gargi, Zettlemoyer, Luke, Feichtenhofer, Christoph. (2023). Demystifying {CLIP. arXiv preprint arXiv:2309.16671.
[zhai2023sigmoid] Zhai, Xiaohua, Mustafa, Basil, Kolesnikov, Alexander, Beyer, Lucas. (2023). Sigmoid loss for language image pre-training. ICCV.
[shuster2022blenderbot] Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz, William Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kambadur, Browser-assisted question-answering with human feedbackJason Weston. (2022). BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage.
[chen2021evaluating] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba. (2021). Evaluating Large Language Models Trained on Code.
[openai2023gpt] OpenAI, R. (2023). GPT-4 technical report. arXiv.
[taori2023alpaca] Taori, Rohan, Gulrajani, Ishaan, Zhang, Tianyi, Dubois, Yann, Li, Xuechen, Guestrin, Carlos, Liang, Percy, Hashimoto, Tatsunori B. (2023). Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html.
[li2023blip2] Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. (2023). {BLIP-2:. ICML.
[yin2023survey] Yin, Shukang, Fu, Chaoyou, Zhao, Sirui, Li, Ke, Sun, Xing, Xu, Tong, Chen, Enhong. (2023). A Survey on Multimodal Large Language Models. arXiv preprint arXiv:2306.13549.
[mosbach2021on] Marius Mosbach, Maksym Andriushchenko, Dietrich Klakow. (2021). On the Stability of Fine-tuning {{. International Conference on Learning Representations.
[chen2020recall] Chen, Sanyuan, Hou, Yutai, Cui, Yiming, Che, Wanxiang, Liu, Ting, Yu, Xiangzhan. (2020). Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[korbak2022controlling] Korbak, Tomasz, Elsahar, Hady, Kruszewski, German, Dymetman, Marc. (2022). Controlling conditional language models without catastrophic forgetting. International Conference on Machine Learning.
[dong2021should] Dong, Xinshuai, Luu, Anh Tuan, Lin, Min, Yan, Shuicheng, Zhang, Hanwang. (2021). How should pre-trained language models be fine-tuned towards adversarial robustness?. Advances in Neural Information Processing Systems.
[xie2023neural] Xie, Liang, Yang, Yibo, Cai, Deng, He, Xiaofei. (2023). Neural collapse inspired attraction--repulsion-balanced loss for imbalanced learning. Neurocomputing.
[behnia2023implicit] Behnia, Tina, Kini, Ganesh Ramachandra, Vakilian, Vala, Thrampoulidis, Christos. (2023). On the Implicit Geometry of Cross-Entropy Parameterizations for Label-Imbalanced Data. International Conference on Artificial Intelligence and Statistics.
[zhong2023understanding] Zhong, Zhisheng, Cui, Jiequan, Yang, Yibo, Wu, Xiaoyang, Qi, Xiaojuan, Zhang, Xiangyu, Jia, Jiaya. (2023). Understanding imbalanced semantic segmentation through neural collapse. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[howard2018universal] Howard, Jeremy, Ruder, Sebastian. (2018). Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.
[mccloskey1989catastrophic] McCloskey, Michael, Cohen, Neal J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation.
[radford2018improving] Radford, Alec, Narasimhan, Karthik, Salimans, Tim, Sutskever, Ilya, others. (2018). Improving language understanding by generative pre-training.
[radford2019language] Radford, Alec, Wu, Jeffrey, Child, Rewon, Luan, David, Amodei, Dario, Sutskever, Ilya, others. (2019). Language models are unsupervised multitask learners. OpenAI blog.
[brown2020language] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. Advances in neural information processing systems.
[sun2023eva] Sun, Quan, Fang, Yuxin, Wu, Ledell, Wang, Xinlong, Cao, Yue. (2023). {EVA-CLIP. arXiv preprint arXiv:2303.15389.
[Lee2020Mixout:] Cheolhyoung Lee, Kyunghyun Cho, Wanmo Kang. (2020). Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models. International Conference on Learning Representations.
[zhang2021revisiting] Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q Weinberger, Yoav Artzi. (2021). Revisiting Few-sample {{. International Conference on Learning Representations.
[zhang2023llama] Zhang, Renrui, Han, Jiaming, Zhou, Aojun, Hu, Xiangfei, Yan, Shilin, Lu, Pan, Li, Hongsheng, Gao, Peng, Qiao, Yu. (2023). Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
[hong20233d] Hong, Yining, Zhen, Haoyu, Chen, Peihao, Zheng, Shuhong, Du, Yilun, Chen, Zhenfang, Gan, Chuang. (2023). 3D-LLM: Injecting the 3D World into Large Language Models. arXiv preprint arXiv:2307.12981.
[zhang2023video] Zhang, Hang, Li, Xin, Bing, Lidong. (2023). Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
[touvron2023llama2] Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, others. (2023). {LLaMA. arXiv preprint arXiv:2307.09288.
[kingma2017adam] Diederik P. Kingma, Jimmy Ba. (2017). Adam: A Method for Stochastic Optimization.
[mixon2020neural] Mixon, Dustin G, Parshall, Hans, Pi, Jianzong. (2020). Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619.
[lecun2015deep] LeCun, Yann, Bengio, Yoshua, Hinton, Geoffrey. (2015). Deep learning. nature.
[antol2015vqa] Antol, Stanislaw, Agrawal, Aishwarya, Lu, Jiasen, Mitchell, Margaret, Batra, Dhruv, Zitnick, C Lawrence, Parikh, Devi. (2015). {VQA. Proceedings of the IEEE international conference on computer vision.
[goodfellow2013empirical] Goodfellow, Ian J, Mirza, Mehdi, Xiao, Da, Courville, Aaron, Bengio, Yoshua. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211.
[huang2023language] Huang, Shaohan, Dong, Li, Wang, Wenhui, Hao, Yaru, Singhal, Saksham, Ma, Shuming, Lv, Tengchao, Cui, Lei, Mohammed, Owais Khan, Liu, Qiang, others. (2023). Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
[wortsman2022robust] Wortsman, Mitchell, Ilharco, Gabriel, Kim, Jong Wook, Li, Mike, Kornblith, Simon, Roelofs, Rebecca, Lopes, Raphael Gontijo, Hajishirzi, Hannaneh, Farhadi, Ali, Namkoong, Hongseok, others. (2022). Robust fine-tuning of zero-shot models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[wang2023voyager] Wang, Guanzhi, Xie, Yuqi, Jiang, Yunfan, Mandlekar, Ajay, Xiao, Chaowei, Zhu, Yuke, Fan, Linxi, Anandkumar, Anima. (2023). Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
[rafailov2023direct] Rafailov, Rafael, Sharma, Archit, Mitchell, Eric, Ermon, Stefano, Manning, Christopher D, Finn, Chelsea. (2023). Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
[zhai2023investigating] Zhai, Yuexiang, Tong, Shengbang, Li, Xiao, Cai, Mu, Qu, Qing, Lee, Yong Jae, Ma, Yi. (2023). Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313.
[liu2023hallusionbench] Liu, Fuxiao, Guan, Tianrui, Li, Zongxia, Chen, Lichang, Yacoob, Yaser, Manocha, Dinesh, Zhou, Tianyi. (2023). HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for {GPT-4V. arXiv preprint arXiv:2310.14566.
[liu2023aligning] Liu, Fuxiao, Lin, Kevin, Li, Linjie, Wang, Jianfeng, Yacoob, Yaser, Wang, Lijuan. (2023). Aligning Large Multi-Modal Model with Robust Instruction Tuning. arXiv preprint arXiv:2306.14565.
[hsieh2023sugarcrepe] Hsieh, Cheng-Yu, Zhang, Jieyu, Ma, Zixian, Kembhavi, Aniruddha, Krishna, Ranjay. (2023). SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality. NeurIPS.
[tschannen2023image] Tschannen, Michael, Kumar, Manoj, Steiner, Andreas, Zhai, Xiaohua, Houlsby, Neil, Beyer, Lucas. (2023). Image Captioners Are Scalable Vision Learners Too. NeurIPS.
[Authors14] FirstName LastName. The frobnicatable foo filter.
[Authors14b] FirstName LastName. Frobnication tutorial.
[Alpher02] FirstName Alpher. Frobnication.
[Alpher03] FirstName Alpher, FirstName Fotheringham-Smythe. Frobnication revisited. Journal of Foo.
[Alpher04] FirstName Alpher, FirstName Fotheringham-Smythe, FirstName Gamow. Can a machine frobnicate?. Journal of Foo.
[Alpher05] FirstName Alpher, FirstName Gamow. Can a computer frobnicate?.
[bib1] ShareGPT, 2023.
[bib2] Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeruIPS, 2022.
[bib3] Anil et al. [2023] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
[bib4] Assran et al. [2023] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023.
[bib5] Bardes et al. [2022] Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. 2022.
[bib6] Bubeck et al. [2023] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
[bib7] Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020.
[bib8] Dai et al. [2023] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
[bib9] Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICML, 2021.
[bib10] Fang et al. [2023] Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. arXiv preprint arXiv:2309.17425, 2023.
[bib11] Fu et al. [2023] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
[bib12] Hila Gonen and Yoav Goldberg. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In NAACL, 2019.
[bib13] Google. Bard, 2023a.
[bib14] Google. Gemini, 2023b.
[bib15] Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
[bib16] Grill et al. [2020] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, 2020.
[bib17] He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
[bib18] He et al. [2022] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022.
[bib19] Hsieh et al. [2023] Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. In NeurIPS, 2023.
[bib20] Jennifer Hu and Roger Levy. Prompt-based methods may underestimate large language models’ linguistic generalizations. In EMNLP, 2023.
[bib21] Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
[bib22] Heewoo Jun and Alex Nichol. Shap-E: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
[bib23] Kazemzadeh et al. [2014] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
[bib24] Krishna et al. [2017] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
[bib25] Laurençon et al. [2023] Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
[bib26] Li et al. [2023a] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
[bib27] Li et al. [2023b] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023b.
[bib28] Liu et al. [2023a] Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for GPT-4V (ision), LLaVA-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023a.
[bib29] Liu et al. [2023b] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023b.
[bib30] Liu et al. [2023c] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023c.
[bib31] Liu et al. [2023d] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023d.
[bib32] Liu et al. [2023e] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023e.
[bib33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2017.
[bib34] Mao et al. [2016] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
[bib35] Marino et al. [2019] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
[bib36] May et al. [2019] Chandler May, Alex Wang, Shikha Bordia, Samuel R Bowman, and Rachel Rudinger. On measuring social biases in sentence encoders. In NAACL, 2019.
[bib37] Microsoft. newbing, 2023.
[bib38] Mishra et al. [2019] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. OCR-VQA: Visual question answering by reading text in images. In ICDAR, 2019.
[bib39] Mu et al. [2022] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. In ECCV, 2022.
[bib40] OpenAI. GPT-4V(ision) System Card, 2023a.
[bib41] OpenAI. Gpt-4 technical report, 2023b.
[bib42] Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
[bib43] Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
[bib44] Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
[bib45] Ridnik et al. [2021] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses. In NeurIPS, 2021.
[bib46] Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
[bib47] Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
[bib48] Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
[bib49] Schwenk et al. [2022] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
[bib50] Sharma et al. [2018] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
[bib51] Sidorov et al. [2020] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, 2020.
[bib52] Singh et al. [2019] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In CVPR, 2019.
[bib53] Singh et al. [2023] Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, et al. The effectiveness of MAE pre-pretraining for billion-scale pretraining. In ICCV, 2023.
[bib54] Sun et al. [2023] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EVA-CLIP: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
[bib55] Sun et al. [2019] Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. Mitigating gender bias in natural language processing: Literature review. In ACL, 2019.
[bib56] Thrush et al. [2022] Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022.
[bib57] Tong et al. [2023] Shengbang Tong, Erik Jones, and Jacob Steinhardt. Mass-producing failures of multimodal systems with language models. In NeurIPS, 2023.
[bib58] Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
[bib59] Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. LLaMA 2: Open foundation and fine-tuned chat models. 2023b.
[bib60] Tschannen et al. [2023] Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, and Lucas Beyer. Image captioners are scalable vision learners too. NeurIPS, 2023.
[bib61] Vishniakov et al. [2024] Kirill Vishniakov, Zhiqiang Shen, and Zhuang Liu. Convnet vs transformer, supervised vs clip: Beyond imagenet accuracy, 2024.
[bib62] Xu et al. [2023] Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying CLIP data. arXiv preprint arXiv:2309.16671, 2023.
[bib63] Yang et al. [2023] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision). arXiv preprint arXiv:2309.17421, 2023.
[bib64] Yu et al. [2023] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
[bib65] Yuksekgonul et al. [2022] Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2022.
[bib66] Zhai et al. [2023a] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023a.
[bib67] Zhai et al. [2023b] Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023b.
[bib68] Zhang et al. [2023] Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
[bib69] Zheng et al. [2023] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, 2023.
[bib70] Zhou et al. [2021] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT: Image BERT pre-training with online tokenizer. In ICLR, 2021.
[bib71] Zhu et al. [2023] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.