Skip to main content

Scaling Language-Free Visual Representation Learning

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, Saining Xie

Abstract

Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: “Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?” We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders. In this controlled setup, visual SSL models scale better than CLIP models in terms of data and model capacity, and visual SSL performance does not saturate even after scaling up to 7B parameters. Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks. These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning.

Scaling Visual SSL

David Fan 1 , ∗ , Shengbang Tong 1 , 2 , ∗ , Jiachen Zhu 1 , 2 , Koustuv Sinha 1 , Zhuang Liu 1 , 3 , Xinlei Chen 1 , Michael Rabbat 1 , Nicolas Ballas 1 , Yann LeCun 1 , 2 , Amir Bar 1 , † , Saining Xie 2 , †

1 FAIR, Meta, 2 New York University, 3 Princeton University

∗ equal contribution , † equal advising

Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: 'Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?' We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders. In this controlled setup, visual SSL models scale better than CLIP models in terms of data and model capacity, and visual SSL performance does not saturate even after scaling up to 7B parameters. Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks. These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning.

Date:

April 1, 2025

Project Page:

https://davidfan.io/webssl/

Introduction

Visual representation learning has evolved along two distinct paths with different training approaches. Language-supervised methods such as Contrastive Language-Image Pretraining (CLIP) (Radford et al., 2021; Zhai et al., 2023) use paired image-text data to learn representations that are enriched with linguistic semantics. Self-Supervised Learning (SSL) methods (Zhang et al., 2016; Chen et al., 2020a; He et al., 2022; LeCun, 2022; Oquab et al., 2023) learn from images alone, without language.

Despite SSL models outperforming languagesupervised models on classic vision tasks such as classification and segmentation (Oquab et al., 2023), they are less commonly adopted in recent multimodal large language models (MLLMs) (Liu et al., 2023a, 2024a; Agrawal et al., 2024; Tong et al., 2024a; Beyer et al., 2024; Li et al., 2024; AI@Meta, 2024). This difference in adoption is partially due to a performance gap in visual question answering (see Figure 1), particularly for OCR & Chart interpretation tasks (Tong et al., 2024a; Shi et al., 2024).

Beyond methodology differences, these approaches

Figure 1 We compare the scaling behavior of visual SSL and CLIP on 16 VQA tasks from the Cambrian-1 suite under different data and model size regimes. Prior visual SSL methods achieved strong performance on classic vision tasks, but have underperformed as encoders for multimodal instruction-tuned VQA tasks. Our results show that with appropriate scaling of models and data, visual SSL can match the performance of language-supervised models across all evaluated domains-even OCR & Chart.

have also been separated by data scale and distribution (Figure 1). CLIP models typically train on

Figure 2 Visual SSL 2.0 changes. In this work, we adopt three improvements to the visual SSL pipeline: 1) Training on billion-scale web data, curated through the MetaCLIP pipeline, to move beyond 'conventional' datasets; 2) Scaling model architecture from sub-billion parameter models to models exceeding 1 billion parameters; and 3) Incorporating VQA as a complementary evaluation protocol to comprehensively assess visual features. These changes enable us to study visual SSL at a larger scale and observe scaling trends previously unobserved in smaller-scale experiments.

billion-scale image-text pairs from the web (Schuhmann et al., 2022; Chen et al., 2023; Xu et al., 2024b), while SSL methods use million-scale datasets such as ImageNet (Deng et al., 2009) or hundred-million scale data with ImageNet-like distributions (Ridnik et al., 2021; Oquab et al., 2023).

In this work, we investigate a fundamental question: Is language supervision necessary to pretrain visual representations for multimodal modeling? Rather than seeking to replace language-supervised approaches, we aim to understand the intrinsic capabilities and limitations of visual self-supervision at scale for multimodal applications. To conduct a fair comparison, we train SSL models on the same billionscale web data used for state-of-the-art CLIP modelsspecifically the MetaCLIP dataset (Xu et al., 2024b). This approach controls for data distribution differences when comparing visual SSL and CLIP.

For evaluation, we primarily use visual question answering (VQA) as a framework to evaluate SSL models across a diverse set of capabilities at scale. VQA evaluation suites span vision-centric, visual reasoning, and OCR & Chart tasks, and have been shown to be a more diverse testbed for assessing vision encoders (Tschannen et al., 2024; Wan et al., 2024; Fini et al., 2024; Tong et al., 2024a), reflecting the broader perception challenges found in real-world distributions. We adopt the evaluation suite proposed in Cambrian-1 (Tong et al., 2024a), which evaluates performance across 16 tasks spanning 4 distinct categories of VQA: General, Knowledge, OCR & Chart, and Vision-Centric.

We train Web-SSL, a family of visual SSL models ranging from 1 to 7 billion parameters, using the above setting for direct and controlled comparison to CLIP. As a result of our empirical study, we contribute several insights:

· Visual SSL can match and even surpass languagesupervised methods for visual pretraining, on a wide range of VQA tasks-even on languagerelated tasks such as OCR & Chart understanding (Figure 3). · Visual SSL scales well with respect to model capacity (Figure 3) and data (Figure 4), indicating that SSL has significant untapped potential. · Visual SSL can maintain competitive traditional vision performance on classification and segmentation, even while improving at VQA (Figure 7). · Training on a higher ratio of images containing text is especially effective for improving OCR & Chart performance (Question 4). Exploring data composition is a promising direction.

This work serves as a proof of concept that offers a compelling vision-centric alternative to the recent CLIP-dominated trend, and opens new opportunities for future research. We plan to open-source our WebSSL vision models, and we hope to inspire the broader community to unlock the full potential of visual SSL in the multimodal era.

In this section, we describe our experimental setup, which extends previous SSL works by (1) scaling dataset size to billion-scale images (Section 2.1), (2) scaling model size beyond 1B parameters (Section 2.2), and (3) evaluating vision models using open-ended VQA tasks (Section 2.3), in addition to

classic vision benchmarks such as ImageNet-1k (Deng et al., 2009) and ADE20k (Zhou et al., 2019).

Beyond ImageNet Pretraining

To study whether visual SSL can match the performance of CLIP, we start by adopting the same data that drove CLIP's success. We thus leverage the MetaCLIP dataset (Xu et al., 2024b,a), which has enabled the most successful open-source reproduction of CLIP to-date. 1 We use 2 billion samples from MetaCLIP, which we refer to as MC-2B. We train SSL methods on only the images, and CLIP on the image-text pairs.

This controls for data distribution and size as confounding variables, and enables a fairer comparison of the pretraining methods themselves, while ensuring sufficient data diversity and scale.

We can also increase model size. Inspired by advancements in scaling language models (Brown et al., 2020; Kaplan et al., 2020; OpenAI, 2022), we train Vision Transformers (ViTs) with 1B, 2B, 3B, 5B, and 7B parameters, on only the images from MC-2B, to study the properties of larger-scale visual SSL models trained on web-scale data. We adapt ViT-g from Oquab et al. (2023) as ViT-1B, and define new configurations for ViT-2B to 7B (Table 1); see Appendix A for model details.

Table 1 Model architecture details. For consistency, we denote ViT-g from Oquab et al. (2023) as ViT-1B.

Multimodal LLMs as an Evaluation Protocol

In addition to conventional evaluation protocols, such as ImageNet-1k linear probe, we also evaluate our vision encoders using VQA, a flexible and robust evaluation protocol that reflects the diversity of realworld perceptual challenges (Tschannen et al., 2024; Tong et al., 2024a), as shown in Figure 2.

Here, we study all vision encoders using the same controlled setting to ensure fair comparison. Specifically,

1 The data used to train the original CLIP is closed-source.

we use the same two-stage visual instruction tuning procedure and data as Cambrian-1 (Tong et al., 2024a). First, a lightweight MLP adapter is added to project the vision encoder features into the same dimensionality as the LLM, and only this MLP adapter is trained. In the second stage, both the MLP adapter and LLM are finetuned. To enable controlled comparison, the vision encoder remains frozen in both stages, and all experiments use the same training recipe as well as Llama-3 8B Instruct (Touvron et al., 2023) backbone. We provide detailed training datasets and hyperparameters in Appendix A.

We then report results on the Cambrian-1 (Tong et al., 2024a) evaluation suite, which is comprised of 16 VQA benchmarks spanning four established domains: General, Knowledge, OCR & Chart, and Vision-Centric. The average VQA performance is the average of the four subcategories. Each subcategory has 4 benchmarks and is equally weighted.

In this section, we explore the scaling behavior of visual SSL models with respect to both model and data size, as a result of training on only images from MC-2B. We focus on DINOv2 (Oquab et al., 2023) as the visual SSL method in this section, and discuss MAE (He et al., 2022) in Section 4.

In Section 3.1, we increase model size from 1B to 7B while keeping the training data fixed at 2 billion MC2B images-unless otherwise denoted. We use the off-shelf training code and recipe for each method, and do not change the recipe for different model sizes in order to control for confounding variables. In Section 3.2, we shift our focus to scaling total data seen for a fixed model size, and analyze how performance evolves as the number of images seen during training increases from 1 billion to 8 billion.

The intention of scaling model size is both to find the ceiling of visual SSL under this new data regime, and to identify any unique behavior that emerges in larger models.

We thus pretrain DINOv2 ViT models, ranging from 1B to 7B parameters, using 2 billion unlabeled images at 224 × 224 resolution from MC-2Bwithout highresolution adaptation (Oquab et al., 2023)-to ensure fair comparison with CLIP. We refer to these models as Web-DINO throughout the paper. For a controlled comparison, we also train CLIP models of the same sizes on the same data.

Figure 3 Scaling behavior of Web-DINO and CLIP ViTs trained on MC-2B. The x-axis shows model sizes from 1B to 7B parameters on a log scale. We observe novel 'scaling behavior' with Web-DINO models across all categories, with particularly pronounced improvements in the OCR & Chart and Vision-Centric domains as model size increases. In contrast, CLIP models demonstrate limited scaling benefits, with performance saturating at moderate model sizes. The two model families exhibit complementary strengths: CLIP models excel at OCR & Chart VQA, and Web-DINO models are superior at Vision-Centric VQA, while remaining competitive in all other categories.

We evaluate each model with VQA and present the results in Figure 3. We will first discuss the overall performance trend and then turn to specific category performance. To the best of our knowledge, this is the first instance of a vision encoder trained purely with visual self-supervision achieving performance parity with language-supervised encoders on VQA-even in the OCR & Chart category, which is traditionally considered to be highly text-dependent.

Performance trend. We compare the performance trend as model capacity increases in Figure 3. WebDINO's Average, OCR & Chart, and Vision-Centric VQA performance improves nearly log-linearly with increasing model size, while General and Knowledge improve to a smaller degree. In contrast, CLIP's performance in all VQA categories largely saturates after 3B parameters. This suggests that while smaller CLIP models may be more data-efficient, this advantage largely dissipates for larger CLIP models. The continual improvement from increasing Web-DINO model capacity also suggests that visual SSL benefits from larger model capacity, and that scaling visual SSL past 7B parameters is a promising direction.

Category-specific performance. In terms of categoryspecific performance, DINO also increasingly outperforms CLIP on Vision-Centric VQA and largely closes the gap with CLIP on OCR & Chart and Average VQA (Figure 3), as model size increases. At 5B parameters and above, DINO can exceed the Average VQA performance of CLIP, despite being trained solely on images and without language supervision. These results suggest that vision-only models, when trained on CLIP-distribution images, can develop strong visual features that are comparable to those of language-supervised vision encoders.

Previously, we focused on single-epoch training, where each of the 2B unique images in MC-2B is seen only once. Here, we investigate the impact of increasing the number of examples seen by training Web-DINO ViT-7B on data ranging from 1 billion to 8 billion images from MC-2B.

As shown in Figure 4, General and Knowledge VQA performance improves incrementally with more examples seen, saturating at 4B and 2B examples respectively. Vision-Centric VQA performance improves sharply from 1B to 2B examples, and saturates beyond 2B examples. In contrast, OCR & Chart is the only category that shows consistent improvement with more examples seen. This suggests that as the model sees more data, it learns a representation that is increasingly well-suited for text-related tasks, yet without marked degradation on other capabilities.

Furthermore, when compared to a CLIP model of the same size (ViT-7B), Web-DINO consistently outperforms CLIP on average VQA performance given the same number of samples seen (Figure 4). Notably, after seeing 8B samples, Web-DINO closes the performance gap with the CLIP model on OCR & Chart VQA tasks. This provides further evidence suggesting that visual SSL models have the potential to scale better than language-supervised models.

Collectively, the results in Figure 3 and 4 indicate that as model size and examples seen increase, visual SSL learns features that are increasingly effective for VQA in general, but especially on OCR & Chart. Our results suggest that CLIP-based models do not hold an absolute advantage compared to visual SSL. In Section 4, we delve deeper into the underlying mechanisms driving this trend.

Figure 4 Scaling up examples seen when training Web-DINO-7B. Performance across different VQA categories as training data increases from 1B to 8B images. While General and Vision-Centric tasks show diminishing returns after 2B images, OCR & Chart tasks demonstrate continued improvement, contributing to steady gains in average performance. Further, Web-DINO consistently outperforms same-size (ViT-7B) CLIP models with different training samples seen. The x-axis plots training data size on a log-scale.

In Section 3, we demonstrated that visual SSL models scale well with model size and training set size. These observations raise further questions about the generality and implications of these phenomena. To deepen our understanding, we investigate five key aspects, including whether scaling behavior extends to other vision-only models (Question 1), if SSL models also exhibit scaling behavior on smaller and more conventional data (Question 2), and whether SSL can retain competitive performance on classic vision tasks (Question 3). Additionally, we explore why scaling particularly enhances OCR & Chart performance (Question 4), and highlight emergent properties that arise via scaling visual SSL (Question 5). In this section, we provide a detailed analysis of these findings.

Text Filtered Models

Next, we analyze the overall best performing vision encoders using both VQA and classic vision benchmarks. In Table 3, we show the best results of our vision encoders against recent off-the-shelf vision encoders, in terms of VQA and classic vision tasks.

For VQA, all vision encoders-including off-the-shelf models-are evaluated using the same visual instruction tuning setup detailed in Section 2.3, and mainly 224 × 224 input resolution for the purpose of fair comparison. Because the goal is not to produce a state-ofthe-art MLLM, we did not employ techniques such as unfreezing the vision encoder, resolution tiling (Liu et al., 2024b), and spatial visual aggregator (Tong et al., 2024a).

For classic vision, we follow the evaluation procedure from Oquab et al. (2023) and evaluate linear probe performance on ImageNet-1k (Deng et al., 2009), ADE20K (Zhou et al., 2019), and NYU Depth v2 (Silberman et al., 2012). The input resolution differs between classic vision tasks, but each model tested uses the same exact settings from Oquab et al. (2023). We emphasize that the primary motivation is still to provide controlled insights.

Performance at 224px. Web-DINO can outperform off-the-shelf MetaCLIP in both VQA and classic vision tasks. Web-DINO is even able to match the performance of SigLIP and SigLIP2 on VQA despite seeing 5 × less data and receiving no language supervision. In general, Web-DINO outperforms all off-shelf language-supervised CLIP models at traditional vision benchmarks. Although our best Web-DINO model is 7B parameters, the results from Section 3.1 and Section 3.2 suggest that CLIP models saturate beyond moderate model and data sizes, while visual SSL improves progressively with increasing model and data size. Web-DINO also outperforms off-theshelf visual SSL methods, including DINOv2 (Oquab et al., 2023), in all VQA categories. Web-DINO is also competitive in traditional vision benchmarks.

Performance beyond 224px. Next, we discuss the performance of higher resolution models. Following

Table 3 Comparisonwithothervision models. Web-DINO ViT-7B achieves competitive performance with CLIP models on VQA without language supervision and surpasses them on traditional vision tasks. Compared to other self-supervised models like DINOv2, Web-DINO significantly narrows the performance gap with CLIP on VQA tasks, particularly excelling in OCR & Chart understanding. These results demonstrate that SSL can effectively produce strong visual representations for both multimodal and classic vision tasks.

Oquab et al. (2023), we additionally fine-tune WebDINO for 20k steps. We do this for resolutions of 378 and 518, to compare against the higher-resolution off-shelf versions of SigLIP as well as DINO. See Appendix C for training details. From 224 to 378 to 518 resolution, Web-DINO improves steadily at average VQA, with notable gains in OCR & Chart performance. Classic vision performance improves modestly with higher resolution. At 384 resolution, Web-DINO trails behind SigLIP. At 518 resolution, Web-DINO is largely able to bridge the gap. The results suggest that Web-DINO may benefit from further increasing high-resolution adaptation.

Visual self-supervised learning methods. Early visual SSL methods explored various pretext tasks for pretraining (Wang and Gupta, 2015; Doersch et al., 2015; Noroozi and Favaro, 2016; Zhang et al., 2016; Gidaris et al., 2018; Balestriero et al., 2023). More recently, research has converged on two primary approaches: joint embedding methods and masked image modeling. Joint embedding methods learn invariant features by aligning representations of different augmented views (He et al., 2019; Misra and Van Der Maaten, 2019; Chen et al., 2020a; Grill et al., 2020; Chen et al., 2020b; Chen and He, 2021; Chen et al., 2021; Caron et al., 2021; LeCun, 2022; Chen et al., 2022; Garrido et al., 2023), while masked modeling (Zhou et al., 2021; He et al., 2022; Wei et al., 2022; Fan et al., 2023; Assran et al., 2023; Woo et al., 2023; Bar et al., 2024; Bai et al., 2024; Carreira et al., 2024) learns by predicting masked visual inputs.

Our work complements SSL research focused on pretraining algorithms, by taking off-the-shelf training code and training visual SSL at scale with a controlled experimental setup. In Question 1, we show that the observed scaling behavior generalizes across both joint embedding and masked modeling SSL methods, and is likely not a method-specific phenomena.

Data used to train vision models. Both supervised (He et al., 2016; Xie et al., 2016; Dosovitskiy et al., 2021; Liu et al., 2022) and SSL vision models have traditionally relied on standard datasets such as MNIST (LeCun, 1998), CIFAR-10 (Krizhevsky et al., 2009), and ImageNet (Deng et al., 2009; Ridnik et al., 2021). More recently, self-supervised methods have scaled to larger unlabeled datasets, such as YFCC (Thomee et al., 2016), LVD-142M (Oquab et al., 2023), and IG-3B (Singh et al., 2023); however, these methods still exhibit a significant performance gap compared to language-supervised models on VQA.

In contrast, language-supervised models (Radford

et al., 2021; Zhai et al., 2023; Sun et al., 2023, 2024; Xu et al., 2024b; Tang et al., 2025) leverage significantly larger image-text datasets, from WIT-400M (Radford et al., 2021) to billion-scale web data (Schuhmann et al., 2022; Fang et al., 2024; Xu et al., 2024b; Gadre et al., 2024), with some using up to 100B image-text pairs (Wang et al., 2025). Studies suggest that pretraining data distribution is more critical for downstream performance than specific training methodologies (Fang et al., 2022; Liu and He, 2025).

Our work bridges these paradigms by pretraining SSL models on web-scale data. Through controlled experiments (Section 3 and 4), we show that (1) visual SSL models are sensitive to the training distribution, (2) increasing data diversity and quantity significantly improves performance on a diverse range of VQA tasks, and (3) training on a higher concentration of images containing text is highly effective for improving OCR & Chart understanding.

Evaluating vision models. Classic works have primarily used image classification (LeCun, 1998; Krizhevsky et al., 2009; Deng et al., 2009; Bossard et al., 2014; Hendrycks et al., 2019, 2020) to evaluate learned representations. More recent SSL research has expanded evaluation to include image segmentation (Everingham et al., 2010; Cordts et al., 2016; He et al., 2017; Zhou et al., 2019), depth estimation (Silberman et al., 2012; Geiger et al., 2013; Song et al., 2015), and video classification (Soomro et al., 2012; Goyal et al., 2017a; Baruch et al., 2021). Languagesupervised models (Radford et al., 2021; Zhai et al., 2023), due to their two-tower encoder structure, commonly use zero-shot image classification to assess the quality of learned image and text features.

Our work follows recent proposals (Naeem et al., 2024; Fini et al., 2024; Tong et al., 2024a) to evaluate vision encoders on a broader range of VQA tasks (Goyal et al., 2017b; Yue et al., 2024a; Liu et al., 2024c; Fu et al., 2023; Tao and Xie, 2024; Yue et al., 2024b; xAI, 2024) using MLLMs. These VQA tasks complement traditional vision benchmarks by assessing visual features on a more diverse range of real-world perceptual challenges. As shown in Section 3 and Section 4, we find that visual SSL trained on web-scale data learns representations that continue to improve on VQA benchmarks, and-to a lesser degree-also on traditional vision benchmarks.

Visual self-supervised learning methods.
Data used to train vision models.

In Table 14, we provide full VQA results for the reference off-shelf models that we evaluated in Section 5.

In this work, we focus on training visual SSL models without using language. The main limitation of vision-only models, compared to language-supervised models, is that they do not support zero-shot image classification out of the box. However, by integrating visual SSL models into MLLM frameworks through instruction tuning, we show they can achieve impressive downstream performance across classification and other tasks. Another way to achieve zero-shot image classification is to use LiT-style adaptation (Zhai et al., 2022; Jose et al., 2024), but this is outof-scope for our work as we do not use language supervision. To focus on comparing the vision encoder, we fixed the base LLM for visual instruction tuning to Llama-3 8B Instruct (AI@Meta, 2024). We hypothesize that the findings using other LLM backbones would be similar, however this is not in scope for our work. Additionally, while we demonstrate that visual SSL scales well on MetaCLIP data, we leave the exploration of even larger and/or uncurated datasets to future work.

We show that large-scale visual encoders that are trained with self-supervised language-free objectives can produce high quality visual features for multimodal models. Our results echo the 'bitter lesson' (Sutton, 2019) and suggest that imposing less supervision-including language-remains a promising direction for advancing the field of computer vision. We hope our work will inspire further exploration of vision-only approaches, which will enable the construction of next generation vision models that excel at both traditional vision and modern multimodal capabilities.

Acknowledgements

We thank Ellis Brown, John Nguyen, Junlin Han, Shengyi Qian, Tyler Zhu, Yuexiang Zhai, Druv Pai, Shusheng Yang, Jihan Yang, Muzi Tao, Boyang Zheng, and Anjali Gupta for reviewing this manuscript. We thank Hu Xu and the MetaCLIP paper authors for creating the MetaCLIP dataset. We thank Mido Assran, Mikael Henaff, Daniel Bolya, Hu Xu, Mark Ibrahim, Russ Howes, and Matthew Muckley for their insightful feedback. We thank Michaël Ramamonjisoa and Marc Szafraniec for their help with image segmentation and depth estimation evaluations. Lastly, we thank Ananya Saxena, Cody Olsen, Mack Ward, Maxwell Taylor, Kalyan Saladi, Dev Satpathy, Dinesh Kannappan, Xiaodong Ma, Jacob Kahn, Gabriel Synnaeve, and Shubho Sengupta for infrastructure support.

Phillip Isola. The platonic representation hypothesis. In ICML , 2024. 8

Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS , 2022. 20

OpenAI. Chatgpt, 2022. 3

Implementation Details

Training. For training Web-DINO, Web-MAE, and CLIP models, we closely follow the existing opensource codebases: the official DINOv2 and MAE repositories, and the MetaCLIP codebase which builds on top of the OpenCLIP codebase (Cherti et al., 2023). We use Fully Sharded Data Parallel (FSDP) (Zhao et al., 2023) for distributed training of larger models.

For Web-DINO and CLIP pretraining, we follow the exact recipe and hyperparameters from the original paper for their largest model. For MAE pretraining, we observe that training becomes more prone to divergence as model size increases. To mitigate this, we reduce the learning rate from 2.4e-3 to 1.6e-3 and extend the warmup period to 80K iterations. Table 4 provides a summary of the pretraining hyperparameters.

Table 4 Hyperparameters for Web-DINO, Web-MAE and CLIP.

VQA evaluation. For VQA evaluation, we follow Tong et al. (2024a,b) and use Cambrian-Alignment data for MLP projector training and Cambrian-7M for MLP and LLM fine-tuning. We finetune on top of Llama-3 8B Instruct (AI@Meta, 2024). The vision encoder is frozen throughout finetuning. We excluded LAION (Schuhmann et al., 2022) images from the Cambrian data to comply with safety standards. We first encode the images at the model's original input resolution using the pretrained vision encoder. Next, we extract features from the final encoder layer. Following prior approaches (Tong et al., 2024a,b), we then resize the resulting token sequence to a fixed length of 576 tokens through bilinear interpolation. This ensures consistency across evaluations despite variations in input image resolutions. We report configurations in Table 5.

Classic vision evaluation. We follow the evaluation procedure in DINOv2 (Oquab et al., 2023) for all classic vision evaluation: linear probe on ImageNet1k (Deng et al., 2009), ADE20K (Zhou et al., 2019), and NYU Depth v2 (Silberman et al., 2012). For ImageNet-1k, we evaluate models with their pretrained image resolution; For ADE20K and NYU Depth v2, we use the settings from Oquab et al. (2023). For ADE20K, we follow DINOv2 and report the linear and +ms setting. For NYU Depth v2, we report lin. 1 and lin. 4 . See the original paper for additional details.

Model architectures. In Table 1, we defined the ViT architectures used in our study. To recap, we first borrowed the ViT-g architecture from Oquab et al. (2023) and named it ViT-1B for consistent notation. We then define 2B, 3B, 5B, and 7B architectures inspired by language model scaling. Specifically, the 2 - 7B architectures are wider than the 1B variant, inspired by language model recipes. Our 7B architecture is almost identical to the Llama-2 7B design, except for the patch embedding layer which is unique to ViTs.

Text filtering. In Question 4, we introduced the 'Light' and 'Heavy' filters which retain 50.3

(ii) Heavy filter: Retains only 1.3

Training.
VQA evaluation.

Table 16 lists evaluation benchmarks used and their purposes.

Model architectures.

We provide full results for Question 4. As shown in Table 13, SSL models learn features particularly well-suited for OCR & Chart tasks when trained on datasets with a higher concentration of text-rich images. This suggests that visual SSL is sensitive to the underlying training distribution and can be effectively steered toward specific downstream applications, such as OCR & Chart.

Full Results

We include full results of all experiments presented in Section 3 and Section 4.

ourdino

Scaling up model sizes. We show quantitative results of scaling up the model under VQA evaluation in Table 6 and classic vision evaluation in Table 7. These are the numerical results for Section 3.1.

Table 6 VQAEvaluation: Web-DINO trained on MC-2B with 2 billion images seen.

We show VQA evaluation results from scaling up MetaCLIP (Xu et al., 2024b) trained on MC-2B, in Table 12. These are the full results for Section 3.1. In contrast to visual SSL methods in Table 7 and Table 11, CLIP models do not exhibit clear scaling behavior.

High Resolution Adaption of ourssl

Following Oquab et al. (2023), we further fine-tune our model under higher resolution settings of 378 × 378 and 518 × 518 for 20k iterations. We use a batch size of 2048 and a correspondingly lower learning rate of 1.41e-5. All other parameters remain exactly the same as previously specified, including the learning rate warmup ratio, given the total of 10k iterations.

We also provided detailed benchmark results of highresolution adaptation of Web-DINO in Table 15.

Visual SSL can match and even surpass language-supervised methods for visual pretraining, on a wide range of VQA tasks—even on language-related tasks such as OCR & Chart understanding (Figure˜3).

Visual SSL scales well with respect to model capacity (Figure˜3) and data (Figure˜4), indicating that SSL has significant untapped potential.

Visual SSL can maintain competitive traditional vision performance on classification and segmentation, even while improving at VQA (Figure˜7).

Training on a higher ratio of images containing text is especially effective for improving OCR & Chart performance (Question 4). Exploring data composition is a promising direction.

In previous sections, we derived our findings from DINOv2, a joint embedding visual SSL method. Here, we extend our analysis to a masked modelling based visual SSL method—Masked Autoencoder (MAE) (He et al., 2022). We train MAE on MC-2B (denoted as Web-MAE) using ViT models ranging from 1B to 5B parameters and compare the results with Web-DINO models in Figure˜5.

Web-MAE models exhibit similar scaling behavior to Web-DINO models, with average VQA performance improving consistently as model size increases. Compared to joint embedding methods, Web-MAE models learn features that are particularly well-suited for OCR & Chart tasks but underperform in other domains. These results suggest that the “scaling behavior” observed in VQA tasks generalizes across different visual SSL methods. We also note that different visual SSL approaches learn distinct representations even when trained under the same conditions, as demonstrated by Web-MAE’s OCR performance.

We pretrain Web-DINO 1B, 2B, and 3B models for 300 epochs on ImageNet-1k, a conventional pretraining dataset for SSL, following the recipe from (Oquab et al., 2023). We compare these variants to those trained on MC-2B. We evaluate their downstream VQA performance and ImageNet-1k linear probing results. As shown in Figure˜6, models pretrained on ImageNet-1k exhibit consistently inferior performance across all the metrics. Moreover, unlike models trained on MC-2B, those trained on ImageNet-1k do not improve with increasing model sizes. This highlights the importance of training visual SSL on more diverse and larger datasets. This echoes recent findings that increasing dataset sizes and diversity drive LLM scaling (Kaplan et al., 2020; Hoffmann et al., 2023; Chowdhery et al., 2022), and also that pretraining data distribution is critical to downstream performance (Liu and He, 2025).

We evaluate Web-DINO models, ranging from 1B to 7B parameters, on classic vision benchmarks including linear probing on ImageNet-1k (Deng et al., 2009), semantic segmentation on ADE20K (Zhou et al., 2019), and depth estimation on NYUv2 (Silberman et al., 2012). Following the evaluation protocol of DINOv2 (Oquab et al., 2023), we freeze the vision encoder; see Appendix˜A for details. As shown in Figure˜7, Web-DINO’s performance improves modestly with increasing model size. Web-DINO achieves strong performance across all benchmarks, outperforming MetaCLIP by a significant margin and remaining competitive with off-shelf DINOv2, even outperforming it on ADE20K +ms. Note that the comparison with off-shelf DINOv2 is not exactly apples-to-apples, as we do not use high-resolution adaptation (Oquab et al., 2023), in order to maintain the same input resolution as CLIP. Additionally, the DINOv2 training data has a higher correlation with these classic vision benchmarks, detailed further in Appendix˜E. These differences suggest that there remains considerable room for further improvement in our model’s classic vision performance.

However, we observe that the scaling behavior in classic vision tasks is less pronounced compared to VQA. This finding, along with insights from previous work (Tong et al., 2024a; Fini et al., 2024; Naeem et al., 2024), reinforces the value of VQA as a comprehensive vision model evaluation framework. While classic benchmarks remain important, VQA provides a complementary view into model performance via offering a diverse set of tasks that are grounded in real-world perceptual challenges.

In Section˜3, we observed that increasing model size and examples seen leads to unprecedented improvements in OCR & Chart performance for visual SSL models. This is surprising since current off-the-shelf visual SSL methods are notably poor at OCR & Chart understanding compared to language-supervised models (Tong et al., 2024a; Shi et al., 2024).

One possible explanation is that web-scale image datasets already contain a degree of textual information. Unlike object-centric datasets such as ImageNet, images from the web often contain text (e.g. labels, signs, diagrams, etc.). Larger capacity and more data might aid visual SSL models to extract and leverage this textual information.

To test this hypothesis, we apply an off-the-shelf MLLM—SmolVLM2 (Allal et al., 2025)—to identify images containing text. See Figure˜8 for qualitative examples and Appendix˜A for details. This results in two curated datasets: (i) Light filter: retains 50.3

We train Web-DINO ViT-2B models on these filtered datasets, with each experiment using 2 billion seen examples (meaning filtered datasets undergo multiple epochs). As shown in Table˜2, the model trained on lightly filtered data outperforms the full data variant by +6.4

The improvement in OCR & Chart from training on heavily filtered data is particularly pronounced for ChartQA (+24.2

Although it is not surprising that skewing the data in favor of OCR & Chart would improve OCR & Chart capabilities, it is surprising that simple data filtering can outperform language supervision on the full data. This simple proof of concept suggests that similar techniques may be used to help visual SSL bridge future gaps in other capabilities.

Thus far, we have seen that visual SSL models can not only become competitive with CLIP models, but also that they can excel at tasks previously thought to require language. This raises an important question: why do vision-only models learn features that work well for multimodal models, even in the absence of language supervision?

We hypothesize that SSL models learn features increasingly aligned with language as model size and examples seen increases. Following Huh et al. (2024), we evaluate intrinsic representational alignment by computing a matching metric between the vision encoder and language model, using image-text pairs from the Wikipedia Captions dataset (Srinivasan et al., 2021). We use off-the-shelf DINOv2 (Oquab et al., 2023) and Web-DINO as vision encoders, and off-the-shelf Llama-3.1 8B and 70B (Touvron et al., 2023) as the language models, without any visual instruction tuning nor alignment procedure.

As shown in Figure˜9, we observe three key trends: (1) training on more diverse data (MC-2B) improves alignment with LLMs (DINOv2 ViT-1B → Web-DINO ViT-1B); (2) increasing the vision model size leads to slightly higher alignment (Web-DINO ViT-1B → ViT-7B); and (3) seeing more training samples further enhances alignment (Web-DINO ViT-7B trained on 2B samples → 8B samples).

These findings suggest that as model size and, in particular, training samples scale, vision models naturally develop text-sensitive features and achieve strong alignment with LLMs and multimodal tasks, without explicit language supervision.

OCR & Chart

Vision-Centric

ADE20K lin.

Early visual SSL methods explored various pretext tasks for pretraining (Wang and Gupta, 2015; Doersch et al., 2015; Noroozi and Favaro, 2016; Zhang et al., 2016; Gidaris et al., 2018; Balestriero et al., 2023). More recently, research has converged on two primary approaches: joint embedding methods and masked image modeling. Joint embedding methods learn invariant features by aligning representations of different augmented views (He et al., 2019; Misra and Van Der Maaten, 2019; Chen et al., 2020a; Grill et al., 2020; Chen et al., 2020b; Chen and He, 2021; Chen et al., 2021; Caron et al., 2021; LeCun, 2022; Chen et al., 2022; Garrido et al., 2023), while masked modeling (Zhou et al., 2021; He et al., 2022; Wei et al., 2022; Fan et al., 2023; Assran et al., 2023; Woo et al., 2023; Bar et al., 2024; Bai et al., 2024; Carreira et al., 2024) learns by predicting masked visual inputs.

In contrast, language-supervised models (Radford et al., 2021; Zhai et al., 2023; Sun et al., 2023, 2024; Xu et al., 2024b; Tang et al., 2025) leverage significantly larger image-text datasets, from WIT-400M (Radford et al., 2021) to billion-scale web data (Schuhmann et al., 2022; Fang et al., 2024; Xu et al., 2024b; Gadre et al., 2024), with some using up to 100B image-text pairs (Wang et al., 2025). Studies suggest that pretraining data distribution is more critical for downstream performance than specific training methodologies (Fang et al., 2022; Liu and He, 2025).

Backbone Data Adapter Instruction Tuning LLM Adapter Instruction Tuning LR WD BS LR WD BS Llama-3 8B Instruct Cambrian Adapter Data Cambrian-7M 1.00e-5 0.0 512 4.00e-5 0 512

(i) Light filter: Retains 50.3

Vision Backbone General Knowledge OCR & Chart Vision-Centric Model Average MMEP MMB SEEDI GQA SQAI MMMUV MathVistaM AI2D ChartQA OCRBench TextVQA DocVQA MMVP RealWorldQA CV-Bench2D CV-Bench3D Web-DINO ViT-1B 49.01 1731.52 65.37 69.92 62.40 72.58 35.33 12.30 64.28 19.20 9.40 47.41 17.00 37.33 57.12 64.80 63.16 Web-DINO ViT-2B 50.77 1760.80 68.98 71.29 62.89 73.67 31.77 15.90 67.06 23.30 15.60 49.20 19.00 38.00 57.38 65.85 64.41 Web-DINO ViT-3B 51.71 1757.27 68.04 71.84 63.19 73.57 33.00 14.40 67.32 25.68 17.10 50.45 20.00 42.66 56.86 69.49 65.83 Web-DINO ViT-5B 52.83 1840.81 70.01 72.39 63.56 75.06 32.11 12.40 67.77 26.96 22.10 50.64 21.00 44.66 57.64 67.75 69.16 Web-DINO ViT-7B 53.87 1823.76 68.98 73.02 64.22 74.61 35.11 14.00 69.43 28.80 23.59 51.10 22.00 48.00 59.34 69.96 68.58

Vision Backbone IN1k lin. ADE20K lin. ADE20K +ms. NYUd lin. 1 (↓) NYUd lin. 4 (↓) Web-DINO ViT-1B 84.70 46.60 50.97 0.364 0.345 Web-DINO ViT-2B 85.16 50.55 52.32 0.351 0.335 Web-DINO ViT-3B 85.66 50.17 53.12 0.348 0.328 Web-DINO ViT-5B 85.84 49.54 53.27 0.378 0.335 Web-DINO ViT-7B 86.00 49.08 54.65 0.380 0.339

Vision Backbone IN1k lin. ADE20K lin. ADE20K +ms. NYUd lin. 1 (↓) NYUd lin. 4 (↓) Web-DINO ViT-7B (2B Data) 86.00 49.08 54.65 0.380 0.339 Web-DINO ViT-7B (4B Data) 86.33 47.41 54.66 0.416 0.363 Web-DINO ViT-7B (8B Data) 86.52 42.14 52.55 0.491 0.376

Vision Backbone General Knowledge OCR & Chart Vision-Centric Model Average MMEP MMB SEEDI GQA SQAI MMMUV MathVistaM AI2D ChartQA OCRBench TextVQA DocVQA MMVP RealWorldQA CV-Bench2D CV-Bench3D Web-MAE ViT-1B 49.19 1736.22 62.02 68.38 60.05 73.27 33.11 12.90 63.92 23.60 16.40 47.84 18.00 36.66 52.81 70.42 60.83 Web-MAE ViT-2B 50.59 1700.16 63.57 69.21 60.93 72.48 32.22 15.50 64.44 29.00 23.20 48.78 20.00 38.00 55.16 67.98 63.91 Web-MAE ViT-3B 50.92 1723.85 64.69 69.71 60.94 72.13 34.33 13.50 65.70 30.92 24.60 48.92 20.00 37.33 54.64 64.15 66.91 Web-MAE ViT-5B 51.50 1710.13 65.12 70.13 61.10 72.63 32.66 13.90 65.67 33.80 26.50 49.60 21.00 38.00 53.72 66.69 67.91

Vision Backbone General Knowledge OCR & Chart Vision-Centric Model Average MMEP MMB SEEDI GQA SQAI MMMUV MathVistaM AI2D ChartQA OCRBench TextVQA DocVQA MMVP RealWorldQA CV-Bench2D CV-Bench3D MetaCLIP ViT-1B 52.30 1813.70 68.90 69.45 60.35 74.07 33.55 12.70 64.41 33.20 34.59 52.15 26.00 37.33 52.15 65.47 61.83 MetaCLIP ViT-2B 53.03 1787.39 68.81 69.54 61.08 75.16 34.66 20.10 65.38 32.80 32.90 52.55 26.00 37.33 52.94 65.19 64.67 MetaCLIP ViT-3B 53.22 1873.67 68.72 70.33 61.85 77.29 32.77 11.80 66.35 32.16 34.40 54.58 26.00 35.33 55.55 65.57 65.08 MetaCLIP ViT-5B 52.52 1779.03 70.10 70.26 61.53 72.43 33.44 17.90 66.74 30.04 32.20 52.49 25.00 39.33 54.50 64.22 61.16 MetaCLIP ViT-7B 52.97 1827.80 69.93 69.47 61.33 74.91 35.55 16.80 65.15 32.12 32.10 52.07 25.00 39.33 54.11 65.08 63.16

Vision Backbone General Knowledge OCR & Chart Vision-Centric Model Average MMEP MMB SEEDI GQA SQAI MMMUV MathVistaM AI2D ChartQA OCRBench TextVQA DocVQA MMVP RealWorldQA CV-Bench2D CV-Bench3D CLIP Models MetaCLIP ViT-H224​px 54.91 1860.58 72.93 70.96 62.22 77.88 36.88 15.00 67.32 35.60 33.40 55.10 29.00 41.33 53.46 68.53 65.91 SigLIP ViT-SO400M224​px 55.36 1807.30 72.76 71.83 62.68 76.74 35.44 14.00 68.65 33.08 40.20 56.61 28.00 47.33 56.99 66.42 64.66 SigLIP ViT-SO400M384​px 59.97 1892.16 73.71 73.00 63.80 77.83 33.88 20.00 69.78 54.24 46.40 63.53 50.00 46.00 58.43 67.37 66.91 SigLIP2 ViT-SO400M224​px 56.32 1789.26 73.36 72.20 62.60 74.96 35.55 22.40 69.85 35.76 42.00 59.68 31.00 44.00 54.24 69.88 64.16 SigLIP2 ViT-SO400M384​px 61.98 1895.70 74.57 72.24 64.81 79.27 36.33 19.90 72.24 59.68 52.90 67.15 54.00 49.33 54.77 70.73 69.00 SSL Models DINOv2 ViT-g224​px 49.25 1785.25 64.86 70.89 62.89 72.03 32.11 12.40 62.37 17.96 5.50 47.06 15.00 47.33 56.33 65.92 66.08 DINOv2 ViT-g378​px 47.94 1734.38 64.26 71.50 62.21 71.04 33.11 9.60 63.08 17.76 5.00 45.59 15.00 41.33 56.47 63.79 60.58 DINOv2 ViT-g518​px 47.91 1694.08 62.45 70.64 62.87 71.29 33.55 11.80 63.37 18.32 5.10 46.27 15.00 37.33 56.60 65.36 61.83 I-JEPA ViT-H 224​px 44.78 1598.15 60.01 64.04 57.66 68.91 34.55 10.20 62.07 16.72 4.00 42.99 14.00 29.33 49.93 57.39 57.16 MAE ViT-H224​px 45.21 1697.06 56.87 56.41 60.51 70.74 32.11 11.50 61.30 17.40 5.50 45.38 14.00 27.33 53.46 61.19 64.75

Vision Backbone General Knowledge OCR & Chart Vision-Centric Model Average MMEP MMB SEEDI GQA SQAI MMMUV MathVistaM AI2D ChartQA OCRBench TextVQA DocVQA MMVP RealWorldQA CV-Bench2D CV-Bench3D Web-DINO224​px 55.24 1811.05 71.30 72.14 64.04 72.43 35.66 15.20 68.52 35.52 36.40 56.53 29.00 46.00 57.90 70.53 62.08 Web-DINO378​px 57.43 1757.06 70.61 72.59 64.50 72.53 35.11 16.10 67.09 52.04 42.19 61.51 46.00 38.00 59.08 66.55 67.16 Web-DINO518​px 59.91 1807.08 73.79 72.92 64.78 74.36 34.66 14.50 69.43 57.28 45.70 64.48 53.00 43.33 60.52 70.08 69.41

For reference, in Table 17 we include the data composition of LVD-142M, which was used to train the off-shelf DINOv2 model (Oquab et al., 2023). LVD142M is a carefully curated data mix closely aligned with downstream classic vision evaluation tasks. In comparison, we leverage MetaCLIP data, which is less curated and collected from 15 snapshots of CommonCrawl (CC).

Table: S2.T1: Model architecture details. For consistency, we denote ViT-g from Oquab et al. (2023) as ViT-1B.

ModelWidthDepthHeadsMLP
ViT-1B153640246144
ViT-2B2688242110752
ViT-3B3072262412288
ViT-5B3584322814336
ViT-7B4096323216384

Table: S4.T2: Impact of data filtering on SSL model performance. We compare Web-DINO ViT-2B models trained on MC-2B with different levels of text filtering (full, 50.3%, and 1.3%) against CLIP ViT-2B trained on full MC-2B. OCR & Chart performance improves with progressively aggressive filtering, with the 1.3% filter achieving the best results. Despite receiving zero language supervision, SSL models can surpass CLIP in text-centric tasks while maintaining strong overall performance.

VQA EvaluatorBreakdown of OCR & Chart Tasks
Method% of MC-2BAVGGeneralKnowledgeVision CentricOCR ChartChartQAOCRBenchTextVQADocVQA
CLIP 2B100%53.072.248.855.036.132.832.952.626.0
Web-DINO 2B100%50.872.847.156.426.823.315.649.219.0
Web-DINO 2B50.3%53.4 (+2.6)73.0 (+0.2)51.7 (+4.6)55.6 (-0.8)33.2 (+6.4)31.4 (+8.1)27.3 (+11.7)51.3 (+2.1)23.0 (+4.0)
Web-DINO 2B1.3%53.7 (+2.9)70.7 (-2.1)47.3 (+0.2)56.2 (-0.2)40.4 (+13.6)47.5 (+24.2)29.4 (+13.8)52.8 (+3.6)32.0 (+13.0)

Table: S4.T3: Comparison with other vision models. Web-DINO ViT-7B achieves competitive performance with CLIP models on VQA without language supervision and surpasses them on traditional vision tasks. Compared to other self-supervised models like DINOv2, Web-DINO significantly narrows the performance gap with CLIP on VQA tasks, particularly excelling in OCR & Chart understanding. These results demonstrate that SSL can effectively produce strong visual representations for both multimodal and classic vision tasks.

ModelMLLM EvaluatorClassic Vision Tasks
MethodPretrain DataPretrain Samples SeenResAVGGeneralKnowledgeOCR & ChartVision-CentricIN1k lin.ADE20K lin.ADE20K ms.NYUd lin. 1 (↓)NYUd lin. 4 (↓)
Language-Supervised Models
SigLIP ViT-SO400MWebLI45.0B22455.474.448.739.558.986.536.538.00.6070.525
38460.076.350.453.559.787.339.547.20.5820.438
SigLIP2 ViT-SO400MWebLI45.0B22456.374.450.742.158.187.541.144.20.5620.539
38462.076.651.958.461.088.143.550.20.5240.469
MetaCLIP ViT-GMetaCLIP12.8B22454.875.548.237.358.486.438.046.70.5240.415
Visual Self-Supervised Models
MAE ViT-HImageNet-1k2.0B22445.264.643.920.651.776.633.330.70.5170.483
I-JEPA ViT-HImageNet-22k0.9B22444.765.443.921.248.468.831.634.60.5480.520
DINOv2 ViT-gLVD-142M1.9B51847.970.245.021.255.386.049.053.00.3440.298
22455.274.548.039.459.186.542.152.60.4910.376
37857.473.947.750.457.786.342.353.10.4980.366
Web-DINO ViT-7BMC-2B8.0B51859.975.548.255.160.886.442.652.80.4900.362

Table: A1.T4: Hyperparameters for Web-DINO, Web-MAE and CLIP.

ModelBatch SizeLearning RateWarmup
Web-DINO30723.5e-4100K
Web-MAE40961.6e-380K
CLIP327684e-42K

Table: A4.T16: List of benchmarks used

BenchmarkEvalCitation
GQAGeneral VQAHudson and Manning (2019)
SEEDGeneral VQAGe et al. (2023)
MMEGeneral VQAFu et al. (2023)
MMBenchGeneral VQALiu et al. (2024c)
AI2DKnowledge VQAHiippala et al. (2021)
ScienceQAKnowledge VQALu et al. (2022)
MathVistaKnowledge VQALu et al. (2023)
MMMUKnowledge VQAYue et al. (2024a)
TextVQAOCR & Chart VQASingh et al. (2019)
DocVQAOCR & Chart VQAMathew et al. (2021)
ChartQAOCR & Chart VQAMasry et al. (2022)
OCRBenchOCR & Chart VQALiu et al. (2023b)
MMVPVision-Centric VQATong et al. (2024c)
RealWorldQAVision-Centric VQAxAI (2024)
CVBench-2DVision-Centric VQATong et al. (2024a)
CVBench-3DVision-Centric VQATong et al. (2024a)
ImageNet-1kImage ClassificationDeng et al. (2009)
ADE-20kImage SegmentationZhou et al. (2019)
NYU Depth v2Depth EstimationSilberman et al. (2012)

Table: A5.T17: LVD-142M Data Sources. In contrast to LVD-142M, which relies on highly curated data sources drawn from distributions closely aligned with various downstream evaluation tasks (see the table above from Oquab et al. (2023)), our data curation approach adopts the methodology from MetaCLIP (Xu et al., 2024b), utilizing web data collected from 15 snapshots of CommonCrawl (CC) spanning January 2021 through January 2023.

TaskDataset / SplitImagesRetrievalRetrievedFinal
classificationImageNet-22k / –14,197,086as is14,197,086
classificationImageNet-22k / –14,197,086sample56,788,34456,788,344
classificationImageNet-1k / train1,281,167sample40,997,34440,997,344
fine-grained classif.Caltech 101 / train3,030cluster2,630,0001,000,000
fine-grained classif.CUB-200-2011 / train5,994cluster1,300,0001,000,000
fine-grained classif.DTD / train11,880cluster1,580,0001,000,000
fine-grained classif.FGVC-Aircraft / train3,334cluster1,170,0001,000,000
fine-grained classif.Flowers-102 / train1,020cluster1,060,0001,000,000
fine-grained classif.Food-101 / train75,750cluster21,670,0001,000,000
fine-grained classif.Oxford-IIIT Pet / trainval3,680cluster2,750,0001,000,000
fine-grained classif.Stanford Cars / train8,144cluster7,220,0001,000,000
fine-grained classif.SUN397 / train119,850cluster18,950,0001,000,000
fine-grained classif.Pascal VOC 2007 / train2,501cluster1,010,0001,000,000
segmentationADE20K / train20,210cluster20,720,0001,000,000
segmentationCityscapes / train2,975cluster1,390,0001,000,000
segmentationPascal VOC 2012 (seg.) / trainaug1,464cluster10,140,0001,000,000
depth estimationMapillary SLS / train1,434,262as is1,434,262
depth estimationKITTI / train (Eigen)23,158cluster3,700,0001,000,000
depth estimationNYU Depth V2 / train24,231cluster10,850,0001,000,000
depth estimationSUN RGB-D / train4,829cluster4,870,0001,000,000
retrievalGoogle Landmarks v2 / train (clean)1,580,470as is1,580,470
retrievalGoogle Landmarks v2 / train (clean)1,580,470sample6,321,8806,321,880
retrievalAmsterTime / new1,231cluster960,000960,000
retrievalAmsterTime / old1,231cluster830,000830,000
retrievalMet / train397,121cluster62,860,0001,000,000
retrievalRevisiting Oxford / base4,993cluster3,680,0001,000,000
retrievalRevisiting Paris / base6,322cluster3,660,0001,000,000
142,109,386

Web-MAE trained on MC-2B. Web-MAE also exhibits consistent scaling behavior as model size increases. Notably, Web-MAE demonstrates better performance in OCR & Chart tasks, achieving higher accuracy than Web-DINO across all model sizes.

Comparison of ImageNet-1k and MC-2B Pretraining. Increasing the diversity and scale of pretraining data improves model performance on VQA accuracy and ImageNet linear probing. Unlike MC-2B pretraining, training on ImageNet does not exhibit a clear scaling trend.

Performance of Web-DINO models on classic vision tasks. All models achieve strong performance across ImageNet-1k classification, ADE20K segmentation, and NYU Depth estimation, and all tasks experience moderate improvements from increasing model size from 1B to 7B parameters. Web-DINO outperforms MetaCLIP (HF) and is competitive with DINOv2 (HF). (HF) denotes the largest official Hugging Face released version.

Examples of filtered MC-2B images. The Light filter (Middle) identifies images containing text, retaining 50.3% of the images. The Heavy filter (Right) identifies images explicitly containing charts and documents, retaining only 1.3% of MC-2B.

Alignment score between Web-DINO and LLMs. Moving from DINOv2 to Web-DINO improves the alignment between the image and the corresponding text representations obtained by LLMs. Increasing model size from 1B to 7B parameters shows gradual improvement, while training on larger data quantities (4B/8B samples) yields the most significant alignment gains.

Table 17 LVD-142MDataSources. In contrast to LVD-142M, which relies on highly curated data sources drawn from distributions closely aligned with various downstream evaluation tasks (see the table above from Oquab et al. (2023)), our data curation approach adopts the methodology from MetaCLIP (Xu et al., 2024b), utilizing web data collected from 15 snapshots of CommonCrawl (CC) spanning January 2021 through January 2023.

BackboneLLMAdapterDataInstructionTuningLRAdapter WDBSInstruction LRWDTuning BS
InstructCambrianAdapterDataCambrian-7M1.00e-50.05124.00e-50512
Llama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 HyperparametersLlama-3 8B Table5 Hyperparameters
Vision BackboneGeneralGeneralKnowledgeOCR&KnowledgeChartVision-CentricVision-CentricVision-CentricVision-Centric
ModelAverageMME PMMBSEED IGQASQA IMMMU VMathVista MAI2DChartQAOCRBenchTextVQADocVQAMMVPRealWorldQACV-Bench 2DCV-Bench 3D
Web-DINO ViT-1B49.011731.5265.3769.9262.4072.5835.3312.3064.2819.209.4047.4117.0037.3357.1264.8063.16
Web-DINO ViT-2B50.771760.8068.9871.2962.8973.6731.7715.9067.0623.3015.6049.2019.0038.0057.3865.8564.41
Web-DINO ViT-3B51.711757.2768.0471.8463.1973.5733.0014.4067.3225.6817.1050.4520.0042.6656.8669.4965.83
Web-DINO ViT-5B52.831840.8170.0172.3963.5675.0632.1112.4067.7726.9622.1050.6421.0044.6657.6467.7569.16
Web-DINO ViT-7B53.871823.7668.9873.0264.2274.6135.1114.0069.4328.8023.5951.1022.0048.0059.3469.9668.58
Vision BackboneIN1k lin.ADE20K lin.ADE20K +ms.NYUd lin. 1 (↓)NYUd lin. 4 (↓)
Web-DINO ViT-1B84.746.650.970.3640.345
Web-DINO ViT-2B85.1650.5552.320.3510.335
Web-DINO ViT-3B85.6650.1753.120.3480.328
Web-DINO ViT-5B85.8449.5453.270.3780.335
Web-DINO ViT-7B8649.0854.650.380.339
Vision BackboneGeneralGeneralGeneralGeneralKnowledgeKnowledgeKnowledgeKnowledgeOCR & ChartOCR & ChartOCR & ChartOCR & ChartVision-CentricVision-CentricVision-CentricVision-Centric
ModelAverageMME PMMBSEED IGQASQA IMMMU VMathVista MAI2DChartQAOCRBenchTextVQADocVQAMMVPRealWorldQACV-Bench 2DCV-Bench 3D
Web-DINO ViT-7B (1B Data)51.021785.9768.1272.5463.6073.8732.8812.7066.5823.6015.2049.0419.0043.3357.1268.3561.08
Web-DINO ViT-7B (2B Data)53.871823.7668.9873.0264.2274.6135.1114.0069.4328.8023.5951.1022.0048.0059.3469.9668.58
Web-DINO ViT-7B (4B Data)54.371827.1271.3972.6163.5372.7334.0018.9067.0935.1230.0053.1924.0045.3355.9469.6865.00
Web-DINO ViT-7B (8B Data)55.241811.0571.3072.1464.0472.4335.6615.2068.5235.5236.4056.5329.0046.0057.9070.5362.08
VQA EvaluatorVQA EvaluatorVQA EvaluatorVQA EvaluatorVQA EvaluatorBreakdown of OCR & Chart TasksBreakdown of OCR & Chart TasksBreakdown of OCR & Chart TasksBreakdown of OCR & Chart Tasks
Method% of MC-2BAVGGeneralKnowledgeVision CentricOCR ChartChartQAOCRBenchTextVQADocVQA
CLIP 2B100%53.072.248.855.036.132.832.952.626.0
Web-DINO 2B100%50.872.847.156.426.823.315.649.219.0
Web-DINO 2B50.3%53.4 (+2.6)73.0 (+0.2)51.7 (+4.6)55.6 (-0.8)33.2 (+6.4)31.4 (+8.1)27.3 (+11.7)51.3 (+2.1)23.0 (+4.0)
Web-DINO 2B1.3%53.7 (+2.9)70.7 (-2.1)47.3 (+0.2)56.2 (-0.2)40.4 (+13.6)47.5 (+24.2)29.4 (+13.8)52.8 (+3.6)32.0 (+13.0)
TaskDataset / SplitImagesRetrievalRetrievedFinal
classificationImageNet-22k / -14,197,086as is-14,197,086
classificationImageNet-22k / -14,197,086sample56,788,34456,788,344
classificationImageNet-1k / train1,281,167sample40,997,34440,997,344
fine-grained classif.Caltech 101 / train3,030cluster2,630,0001,000,000
fine-grained classif.CUB-200-2011 / train5,994cluster1,300,0001,000,000
fine-grained classif.DTD / train11,880cluster1,580,0001,000,000
fine-grained classif.FGVC-Aircraft / train3,334cluster1,170,0001,000,000
fine-grained classif.Flowers-102 / train1,020cluster1,060,0001,000,000
fine-grained classif.Food-101 / train75,750cluster21,670,0001,000,000
fine-grained classif.Oxford-IIIT Pet / trainval3,680cluster2,750,0001,000,000
fine-grained classif.Stanford Cars / train8,144cluster7,220,0001,000,000
fine-grained classif.SUN397 / train119,850cluster18,950,0001,000,000
fine-grained classif.Pascal VOC 2007 / train2,501cluster1,010,0001,000,000
segmentationADE20K / train20,210cluster20,720,0001,000,000
segmentationCityscapes / train2,975cluster1,390,0001,000,000
segmentationPascal VOC 2012 (seg.) / trainaug1,464cluster10,140,0001,000,000
depth estimationMapillary SLS / train1,434,262as is-1,434,262
depth estimationKITTI / train (Eigen)23,158cluster3,700,0001,000,000
depth estimationNYU Depth V2 / train24,231cluster10,850,0001,000,000
depth estimationSUN RGB-D / train4,829cluster4,870,0001,000,000
retrievalGoogle Landmarks v2 / train (clean)1,580,470as is-1,580,470
retrievalGoogle Landmarks v2 / train (clean)1,580,470sample6,321,8806,321,880
retrievalAmsterTime / new1,231cluster960,000960,000
retrievalAmsterTime / old1,231cluster830,000830,000
retrievalMet / train397,121cluster62,860,0001,000,000
retrievalRevisiting Oxford / base4,993cluster3,680,0001,000,000
retrievalRevisiting Paris / base6,322cluster3,660,0001,000,000
142,109,386

Figure

References

[1] Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. (2023). Visual Instruction Tuning. NeurIPS.

[2] Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database. CVPR.

[3] Xu, Hu, Xie, Saining, Tan, Xiaoqing Ellen, Huang, Po-Yao, Howes, Russell, Sharma, Vasu, Li, Shang-Wen, Ghosh, Gargi, Zettlemoyer, Luke, Feichtenhofer, Christoph. (2024). Demystifying clip data. ICLR.

[4] Fan, David, Wang, Jue, Liao, Shuai, Zhu, Yi, Bhat, Vimal, Santos-Villalobos, Hector, MV, Rohith, Li, Xinyu. (2023). Motion-guided masking for spatiotemporal representation learning. CVPR.

[5] Fang, Alex, Ilharco, Gabriel, Wortsman, Mitchell, Wan, Yuhao, Shankar, Vaishaal, Dave, Achal, Schmidt, Ludwig. (2022). Data determines distributional robustness in contrastive language image pre-training (clip). ICML.

[6] Sun, Chen, Shrivastava, Abhinav, Singh, Saurabh, Gupta, Abhinav. (2017). Revisiting unreasonable effectiveness of data in deep learning era. ICCV.

[7] Sutton, Richard. (2019). The bitter lesson. Incomplete Ideas (blog).

[8] Jose, Cijo, Moutakanni, Th{'e. (2024). DINOv2 Meets Text: A Unified Framework for Image-and Pixel-Level Vision-Language Alignment. arXiv preprint arXiv:2412.16334.

[9] Zhai, Xiaohua, Wang, Xiao, Mustafa, Basil, Steiner, Andreas, Keysers, Daniel, Kolesnikov, Alexander, Beyer, Lucas. (2022). Lit: Zero-shot transfer with locked-image text tuning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[10] P. Langley. (2000). Crafting Papers on Machine Learning. Proceedings of the 17th International Conference on Machine Learning (ICML 2000).

[11] T. M. Mitchell. (1980). The Need for Biases in Learning Generalizations.

[12] M. J. Kearns. (1989). Computational Complexity of Machine Learning.

[13] . Machine Learning: An Artificial Intelligence Approach, Vol. I. (1983).

[14] R. O. Duda, P. E. Hart, D. G. Stork. (2000). Pattern Classification.

[15] Author, N. N.. (2021). Suppressed for Anonymity.

[16] A. Newell, P. S. Rosenbloom. (1981). Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition.

[17] A. L. Samuel. (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development.

[18] Power, Alethea, Burda, Yuri, Edwards, Harri, Babuschkin, Igor, Misra, Vedant. (2022). Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177.

[19] Wang, Junke, Meng, Lingchen, Weng, Zejia, He, Bo, Wu, Zuxuan, Jiang, Yu-Gang. (2023). To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574.

[20] Zhang, Yanzhe, Zhang, Ruiyi, Gu, Jiuxiang, Zhou, Yufan, Lipka, Nedim, Yang, Diyi, Sun, Tong. (2023). Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.

[21] Liu, Zhuang, Mao, Hanzi, Wu, Chao-Yuan, Feichtenhofer, Christoph, Darrell, Trevor, Xie, Saining. (2022). A convnet for the 2020s. CVPR.

[22] Masry, Ahmed, Long, Do Xuan, Tan, Jia Qing, Joty, Shafiq, Hoque, Enamul. (2022). Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ACL.

[23] Mathew, Minesh, Karatzas, Dimosthenis, Jawahar, CV. (2021). Docvqa: A dataset for vqa on document images. WACV.

[24] Kafle, Kushal, Price, Brian, Cohen, Scott, Kanan, Christopher. (2018). Dvqa: Understanding data visualizations via question answering. CVPR.

[25] Acharya, Manoj, Kafle, Kushal, Kanan, Christopher. (2019). TallyQA: Answering complex counting questions. AAAI.

[26] Johnson, Justin, Hariharan, Bharath, Van Der Maaten, Laurens, Fei-Fei, Li, Lawrence Zitnick, C, Girshick, Ross. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. CVPR.

[27] Tu, Haoqin, Cui, Chenhang, Wang, Zijun, Zhou, Yiyang, Zhao, Bingchen, Han, Junlin, Zhou, Wangchunshu, Yao, Huaxiu, Xie, Cihang. (2023). How many unicorns are in this image? a safety evaluation benchmark for vision llms. arXiv preprint arXiv:2311.16101.

[28] Gurari, Danna, Li, Qing, Stangl, Abigale J, Guo, Anhong, Lin, Chi, Grauman, Kristen, Luo, Jiebo, Bigham, Jeffrey P. (2018). Vizwiz grand challenge: Answering visual questions from blind people. CVPR.

[29] Zhang, Yuhui, McKinzie, Brandon, Gan, Zhe, Shankar, Vaishaal, Toshev, Alexander. (2023). Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation. EMNLP.

[30] Chen, Guiming Hardy, Chen, Shunian, Zhang, Ruifei, Chen, Junying, Wu, Xiangbo, Zhang, Zhiyi, Chen, Zhihong, Li, Jianquan, Wan, Xiang, Wang, Benyou. (2024). ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model. arXiv preprint arXiv:2402.11684.

[31] Lambon Ralph, Matthew A, Sage, Karen, Jones, Roy W, Mayberry, Emily J. (2010). Coherent concepts are computed in the anterior temporal lobes. Proceedings of the National Academy of Sciences.

[32] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. ICML.

[33] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2019). Momentum Contrast for Unsupervised Visual Representation Learning. arXiv e-prints, art. CVPR.

[34] Bardes, Adrien, Garrido, Quentin, Ponce, Jean, Chen, Xinlei, Rabbat, Michael, LeCun, Yann, Assran, Mahmoud, Ballas, Nicolas. (2024). Revisiting feature prediction for learning visual representations from video. TMLR.

[35] Hendrycks, Dan, Gimpel, Kevin. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.

[36] Zhang, Ruohong, Gui, Liangke, Sun, Zhiqing, Feng, Yihao, Xu, Keyang, Zhang, Yuanhan, Fu, Di, Li, Chunyuan, Hauptmann, Alexander, Bisk, Yonatan, others. (2024). Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward. arXiv preprint arXiv:2404.01258.

[37] Loshchilov, I. (2019). Decoupled weight decay regularization. ICLR.

[38] Ba, Jimmy Lei, Kiros, Jamie, Geoffrey E. Hinton. (2016). Layer normalization. NeurIPS.

[39] Preechakul, Konpat, Chatthee, Nattanat, Wizadwongsa, Suttisak, Suwajanakorn, Supasorn. (2022). Diffusion autoencoders: Toward a meaningful and decodable representation. CVPR.

[40] Pan, Xichen, Dong, Li, Huang, Shaohan, Peng, Zhiliang, Chen, Wenhu, Wei, Furu. (2024). Kosmos-g: Generating images in context with multimodal large language models. ICLR.

[41] Koh, Jing Yu, Fried, Daniel, Salakhutdinov, Russ R. (2024). Generating images with multimodal language models. NeurIPS.

[42] Rajbhandari, Samyam, Rasley, Jeff, Ruwase, Olatunji, He, Yuxiong. (2020). Zero: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[43] Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, Hochreiter, Sepp. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS.

[44] Yue, Xiang, Zheng, Tianyu, Ni, Yuansheng, Wang, Yubo, Zhang, Kai, Tong, Shengbang, Sun, Yuxuan, Yin, Ming, Yu, Botao, Zhang, Ge, others. (2024). Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813.

[45] Pan, Jiayi, Zhang, Yichi, Tomlin, Nicholas, Zhou, Yifei, Levine, Sergey, Suhr, Alane. (2024). Autonomous evaluation and refinement of digital agents. COLM.

[46] Cha, Sungguk, Lee, Jusung, Lee, Younghyun, Yang, Cheoljong. (2024). Visually Dehallucinative Instruction Generation: Know What You Don't Know. arXiv preprint arXiv:2402.09717.

[47] Si, Chenglei, Zhang, Yanzhe, Yang, Zhengyuan, Liu, Ruibo, Yang, Diyi. (2024). Design2Code: How Far Are We From Automating Front-End Engineering?. arXiv preprint arXiv:2403.03163.

[48] Li, Lei, Wang, Yuqi, Xu, Runxin, Wang, Peiyi, Feng, Xiachong, Kong, Lingpeng, Liu, Qi. (2024). Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. arXiv preprint arXiv:2403.00231.

[49] Wang, Ke, Pan, Junting, Shi, Weikang, Lu, Zimu, Zhan, Mingjie, Li, Hongsheng. (2024). Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset. arXiv preprint arXiv:2402.14804.

[50] Wu, Haoning, Zhang, Zicheng, Zhang, Erli, Chen, Chaofeng, Liao, Liang, Wang, Annan, Xu, Kaixin, Li, Chunyi, Hou, Jingwen, Zhai, Guangtao, others. (2023). Q-instruct: Improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783.

[51] Kembhavi, Aniruddha, Salvato, Mike, Kolve, Eric, Seo, Minjoon, Hajishirzi, Hannaneh, Farhadi, Ali. (2016). A diagram is worth a dozen images. ECCV.

[52] LAION. (2023). laion/gpt4v-dataset.

[53] Hsiao, Yu-Chung, Zubach, Fedir, Wang, Maria, others. (2022). Screenqa: Large-scale question-answer pairs over mobile app screenshots. arXiv preprint arXiv:2209.08199.

[54] Lu, Pan, Mishra, Swaroop, Xia, Tanglin, Qiu, Liang, Chang, Kai-Wei, Zhu, Song-Chun, Tafjord, Oyvind, Clark, Peter, Kalyan, Ashwin. (2022). Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS.

[55] Gao, Jiahui, Pi, Renjie, Zhang, Jipeng, Ye, Jiacheng, Zhong, Wanjun, Wang, Yufei, Hong, Lanqing, Han, Jianhua, Xu, Hang, Li, Zhenguo, others. (2023). G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370.

[56] Kim, Geewook, Hong, Teakgyu, Yim, Moonbin, Park, Jinyoung, Yim, Jinyeong, Hwang, Wonseok, Yun, Sangdoo, Han, Dongyoon, Park, Seunghyun. (2022). Donut: Document understanding transformer without ocr. ECCV.

[57] Lauren{\c{c. (2024). Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset. arXiv preprint arXiv:2403.09029.

[58] Belouadi, Jonas, Lauscher, Anne, Eger, Steffen. (2024). Automatikz: Text-guided synthesis of scientific vector graphics with tikz. ICLR.

[59] Alawwad, Hessa Abdulrahman, Alhothali, Areej, Naseem, Usman, Alkhathlan, Ali, Jamal, Amani. (2024). Enhancing Textbook Question Answering Task with Large Language Models and Retrieval Augmented Generation. arXiv preprint arXiv:2402.05128.

[60] Lu, Pan, Gong, Ran, Jiang, Shibiao, Qiu, Liang, Huang, Siyuan, Liang, Xiaodan, Zhu, Song-Chun. (2021). Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. ACL.

[61] Zhang, Chi, Gao, Feng, Jia, Baoxiong, Zhu, Yixin, Zhu, Song-Chun. (2019). Raven: A dataset for relational and analogical visual reasoning. CVPR.

[62] Lu, Pan, Qiu, Liang, Chen, Jiaqi, Xia, Tony, Zhao, Yizhou, Zhang, Wei, Yu, Zhou, Liang, Xiaodan, Zhu, Song-Chun. (2021). Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. NeurIPS.

[63] Kazemi, Mehran, Alvari, Hamidreza, Anand, Ankit, Wu, Jialin, Chen, Xi, Soricut, Radu. (2023). Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241.

[64] Pasupat, Panupong, Liang, Percy. (2015). Compositional semantic parsing on semi-structured tables. ACL.

[65] Zhong, Victor, Xiong, Caiming, Socher, Richard. (2017). Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.

[66] Chen, Zhiyu, Chen, Wenhu, Smiley, Charese, Shah, Sameena, Borova, Iana, Langdon, Dylan, Moussa, Reema, Beane, Matt, Huang, Ting-Hao, Routledge, Bryan, others. (2021). Finqa: A dataset of numerical reasoning over financial data. EMNLP.

[67] Cheng, Zhoujun, Dong, Haoyu, Wang, Zhiruo, Jia, Ran, Guo, Jiaqi, Gao, Yan, Han, Shi, Lou, Jian-Guang, Zhang, Dongmei. (2022). HiTab: A hierarchical table dataset for question answering and natural language generation. ACL.

[68] Zhu, Fengbin, Lei, Wenqiang, Huang, Youcheng, Wang, Chao, Zhang, Shuo, Lv, Jiancheng, Feng, Fuli, Chua, Tat-Seng. (2021). TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. ACL.

[69] Lu, Pan, Qiu, Liang, Chang, Kai-Wei, Wu, Ying Nian, Zhu, Song-Chun, Rajpurohit, Tanmay, Clark, Peter, Kalyan, Ashwin. (2023). Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. ICLR.

[70] Kantharaj, Shankar, Leong, Rixie Tiffany Ko, Lin, Xiang, Masry, Ahmed, Thakkar, Megh, Hoque, Enamul, Joty, Shafiq. (2022). Chart-to-text: A large-scale benchmark for chart summarization. ACL.

[71] Tang, Benny J, Boggust, Angie, Satyanarayan, Arvind. (2023). Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356.

[72] Biten, Ali Furkan, Litman, Ron, Xie, Yusheng, Appalaraju, Srikar, Manmatha, R. (2022). Latr: Layout-aware transformer for scene-text vqa. CVPR.

[73] Biten, Ali Furkan, Tito, Ruben, Mafla, Andres, Gomez, Lluis, Rusinol, Mar{\c{c. (2019). Scene text visual question answering. ICCV.

[74] Kiela, Douwe, Firooz, Hamed, Mohan, Aravind, Goswami, Vedanuj, Singh, Amanpreet, Ringshia, Pratik, Testuggine, Davide. (2020). The hateful memes challenge: Detecting hate speech in multimodal memes. NeurIPS.

[75] Chris Wendler. (2023). wendlerc/RenderedText.

[76] Zhu, Yuke, Groth, Oliver, Bernstein, Michael, Fei-Fei, Li. (2016). Visual7w: Grounded question answering in images. CVPR.

[77] Tanaka, Ryota, Nishida, Kyosuke, Yoshida, Sen. (2021). VisualMRC: Machine Reading Comprehension on Document Images. AAAI.

[78] Shridhar, Mohit, Yuan, Xingdi, C{^{o. (2021). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. ICLR.

[79] Pont{-. (2020). Connecting Vision and Language with Localized Narratives. ECCV.

[80] He, Xuehai, Zhang, Yichen, Mou, Luntian, Xing, Eric P., Xie, Pengtao. (2020). PathVQA: 30000+ Questions for Medical Visual Question Answering. CoRR.

[81] Chen, Lin, Li, Jisong, Dong, Xiaoyi, Zhang, Pan, He, Conghui, Wang, Jiaqi, Zhao, Feng, Lin, Dahua. (2023). Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.

[82] Drew A. Hudson, Christopher D. Manning. (2019). GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. CVPR.

[83] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, Roozbeh Mottaghi. (2019). OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. CVPR.

[84] Vishniakov, Kirill, Shen, Zhiqiang, Liu, Zhuang. (2024). ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy. ICML.

[85] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, Roozbeh Mottaghi. (2022). A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge. ECCV.

[86] . OCR-VQA: Visual Question Answering by Reading Text in Images. (2019).

[87] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh. (2020). TextCaps: a Dataset for Image Captioning with Reading Comprehension.

[88] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, Tamara L. Berg. (2016). Modeling Context in Referring Expressions.

[89] Team, Chameleon. (2024). Chameleon: Mixed-Modal Early-Fusion Foundation Models. arXiv preprint arXiv:2405.09818.

[90] Yu, Tianyu, Yao, Yuan, Zhang, Haoye, He, Taiwen, Han, Yifeng, Cui, Ganqu, Hu, Jinyi, Liu, Zhiyuan, Zheng, Hai-Tao, Sun, Maosong, others. (2023). Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849.

[91] Li, Tianhong, Katabi, Dina, He, Kaiming. (2024). Return of unconditional generation: A self-supervised representation generation method. NeurIPS.

[92] Rafailov, Rafael, Sharma, Archit, Mitchell, Eric, Manning, Christopher D, Ermon, Stefano, Finn, Chelsea. (2024). Direct preference optimization: Your language model is secretly a reward model. NeurIPS.

[93] Zhu, Banghua, Frick, Evan, Wu, Tianhao, Zhu, Hanlin, Jiao, Jiantao. (2023). Starling-7b: Improving llm helpfulness & harmlessness with rlaif.

[94] He, Kaiming, Gkioxari, Georgia, Doll{'a. (2017). Mask r-cnn. ICCV.

[95] Ouyang, Long, Wu, Jeffrey, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll, Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex, others. (2022). Training language models to follow instructions with human feedback. NeurIPS.

[96] Dong, Hanze, Xiong, Wei, Pang, Bo, Wang, Haoxiang, Zhao, Han, Zhou, Yingbo, Jiang, Nan, Sahoo, Doyen, Xiong, Caiming, Zhang, Tong. (2024). Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863.

[97] Liu, Zhuang, He, Kaiming. (2025). A Decade's Battle on Dataset Bias: Are We There Yet?. ICLR.

[98] Woo, Sanghyun, Debnath, Shoubhik, Hu, Ronghang, Chen, Xinlei, Liu, Zhuang, Kweon, In So, Xie, Saining. (2023). Convnext v2: Co-designing and scaling convnets with masked autoencoders. CVPR.

[99] Yuksekgonul, Mert, Bianchi, Federico, Kalluri, Pratyusha, Jurafsky, Dan, Zou, James. (2022). When and why vision-language models behave like bags-of-words, and what to do about it?. ICLR.

[100] Chen, Zhe, Wang, Weiyun, Tian, Hao, Ye, Shenglong, Gao, Zhangwei, Cui, Erfei, Tong, Wenwen, Hu, Kongzhi, Luo, Jiapeng, Ma, Zheng, others. (2024). How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821.

[101] Tong, Shengbang, Jones, Erik, Steinhardt, Jacob. (2024). Mass-producing failures of multimodal systems with language models. NeurIPS.

[102] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li. (2016). Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. IJCV.

[103] Tong, Shengbang, Liu, Zhuang, Zhai, Yuexiang, Ma, Yi, LeCun, Yann, Xie, Saining. (2024). Eyes wide shut? exploring the visual shortcomings of multimodal llms. CVPR.

[104] Liu, Haotian, Li, Chunyuan, Li, Yuheng, Lee, Yong Jae. (2024). Improved baselines with visual instruction tuning. CVPR.

[105] McKinzie, Brandon, Gan, Zhe, Fauconnier, Jean-Philippe, Dodge, Sam, Zhang, Bowen, Dufter, Philipp, Shah, Dhruti, Du, Xianzhi, Peng, Futang, Weers, Floris, others. (2024). Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611.

[106] Fang, Alex, Jose, Albin Madappally, Jain, Amit, Schmidt, Ludwig, Toshev, Alexander, Shankar, Vaishaal. (2024). Data filtering networks. ICLR.

[107] Gao, Peng, Zhang, Renrui, Liu, Chris, Qiu, Longtian, Huang, Siyuan, Lin, Weifeng, Zhao, Shitian, Geng, Shijie, Lin, Ziyi, Jin, Peng, others. (2024). SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models. arXiv preprint arXiv:2402.05935.

[108] Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, Reynold Xin. (2023). Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM.

[109] Yue, Xiang, Qu, Xingwei, Zhang, Ge, Fu, Yao, Huang, Wenhao, Sun, Huan, Su, Yu, Chen, Wenhu. (2024). Mammoth: Building math generalist models through hybrid instruction tuning. ICLR.

[110] Luo, Ziyang, Xu, Can, Zhao, Pu, Sun, Qingfeng, Geng, Xiubo, Hu, Wenxiang, Tao, Chongyang, Ma, Jing, Lin, Qingwei, Jiang, Daxin. (2024). Wizardcoder: Empowering code large language models with evol-instruct. ICLR.

[111] Arindam Mitra, Hamed Khanpour, Corby Rosset, Ahmed Awadallah. (2024). Orca-Math: Unlocking the potential of SLMs in Grade School Math.

[112] Zheng, Tianyu, Zhang, Ge, Shen, Tianhao, Liu, Xueling, Lin, Bill Yuchen, Fu, Jie, Chen, Wenhu, Yue, Xiang. (2024). OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. arXiv preprint arXiv:2402.14658.

[113] Wing Lian, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong and. (2023). OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces. HuggingFace repository.

[114] Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, others. (2021). Learning transferable visual models from natural language supervision. ICML.

[115] Schuhmann, Christoph, Beaumont, Romain, Vencu, Richard, Gordon, Cade, Wightman, Ross, Cherti, Mehdi, Coombes, Theo, Katta, Aarush, Mullis, Clayton, Wortsman, Mitchell, others. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS.

[116] Zheng, Lianmin, Chiang, Wei-Lin, Sheng, Ying, Zhuang, Siyuan, Wu, Zhanghao, Zhuang, Yonghao, Lin, Zi, Li, Zhuohan, Li, Dacheng, Xing, Eric, others. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS.

[117] Chiang, Wei-Lin, Zheng, Lianmin, Sheng, Ying, Angelopoulos, Anastasios Nikolas, Li, Tianle, Li, Dacheng, Zhang, Hao, Zhu, Banghua, Jordan, Michael, Gonzalez, Joseph E, others. (2024). Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132.

[118] Zhai, Xiaohua, Mustafa, Basil, Kolesnikov, Alexander, Beyer, Lucas. (2023). Sigmoid loss for language image pre-training. ICCV.

[119] Sun, Quan, Fang, Yuxin, Wu, Ledell, Wang, Xinlong, Cao, Yue. (2023). Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389.

[120] Cherti, Mehdi, Beaumont, Romain, Wightman, Ross, Wortsman, Mitchell, Ilharco, Gabriel, Gordon, Cade, Schuhmann, Christoph, Schmidt, Ludwig, Jitsev, Jenia. (2023). Reproducible scaling laws for contrastive language-image learning. CVPR.

[121] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners. CVPR.

[122] Chen, Xinlei, Xie, Saining, He, Kaiming. (2021). An empirical study of training self-supervised vision transformers. ICCV.

[123] Oquab, Maxime, Darcet, Timoth{'e. (2023). Dinov2: Learning robust visual features without supervision. TMLR.

[124] Cunningham, Hoagy, Ewart, Aidan, Riggs, Logan, Huben, Robert, Sharkey, Lee. (2024). Sparse autoencoders find highly interpretable features in language models. ICLR.

[125] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. CVPR.

[126] Bar, Amir, Bordes, Florian, Shocher, Assaf, Assran, Mido, Vincent, Pascal, Ballas, Nicolas, Darrell, Trevor, Globerson, Amir, LeCun, Yann. (2024). Stochastic positional embeddings improve masked image modeling. ICML.

[127] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. ICLR.

[128] Jouppi, Norm, Kurian, George, Li, Sheng, Ma, Peter, Nagarajan, Rahul, Nai, Lifeng, Patil, Nishant, Subramanian, Suvinay, Swing, Andy, Towles, Brian, others. (2023). Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. Proceedings of the 50th Annual International Symposium on Computer Architecture.

[129] Zhao, Yanli, Gu, Andrew, Varma, Rohan, Luo, Liang, Huang, Chien-Chin, Xu, Min, Wright, Less, Shojanazeri, Hamid, Ott, Myle, Shleifer, Sam, others. (2023). Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277.

[130] Zhou, Kun, Zhu, Yutao, Chen, Zhipeng, Chen, Wentong, Zhao, Wayne Xin, Chen, Xu, Lin, Yankai, Wen, Ji-Rong, Han, Jiawei. (2023). Don't Make Your LLM an Evaluation Benchmark Cheater. arXiv preprint arXiv:2311.01964.

[131] Kirillov, Alexander, Mintun, Eric, Ravi, Nikhila, Mao, Hanzi, Rolland, Chloe, Gustafson, Laura, Xiao, Tete, Whitehead, Spencer, Berg, Alexander C, Lo, Wan-Yen, others. (2023). Segment anything. ICCV.

[132] Birkl, Reiner, Wofk, Diana, M{. (2023). Midas v3. 1--a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460.

[133] Lasinger, Katrin, Ranftl, Ren{'e. (2019). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. arXiv preprint arXiv:1907.01341.

[134] Rombach, Robin, Blattmann, Andreas, Lorenz, Dominik, Esser, Patrick, Ommer, Bj. (2022). High-Resolution Image Synthesis With Latent Diffusion Models. CVPR.

[135] Karamcheti, Siddharth, Nair, Suraj, Balakrishna, Ashwin, Liang, Percy, Kollar, Thomas, Sadigh, Dorsa. (2024). Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865.

[136] Zhai, Yuexiang, Tong, Shengbang, Li, Xiao, Cai, Mu, Qu, Qing, Lee, Yong Jae, Ma, Yi. (2024). Investigating the catastrophic forgetting in multimodal large language models. CPAL.

[137] Li, Alexander C, Brown, Ellis, Efros, Alexei A, Pathak, Deepak. (2023). Internet Explorer: Targeted Representation Learning on the Open Web. ICML.

[138] Liu, Haotian, Li, Chunyuan, Li, Yuheng, Li, Bo, Zhang, Yuanhan, Shen, Sheng, Lee, Yong Jae. (2024). LLaVA-NeXT: Improved reasoning, OCR, and world knowledge.

[139] Lu, Haoyu, Liu, Wen, Zhang, Bo, Wang, Bingxuan, Dong, Kai, Liu, Bo, Sun, Jingxiang, Ren, Tongzheng, Li, Zhuoshu, Sun, Yaofeng, others. (2024). DeepSeek-VL: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525.

[140] Li, Alexander C, Prabhudesai, Mihir, Duggal, Shivam, Brown, Ellis, Pathak, Deepak. (2023). Your diffusion model is secretly a zero-shot classifier. ICCV.

[141] Chen, Xi, Wang, Xiao, Changpinyo, Soravit, Piergiovanni, AJ, Padlewski, Piotr, Salz, Daniel, Goodman, Sebastian, Grycner, Adam, Mustafa, Basil, Beyer, Lucas, others. (2023). Pali: A jointly-scaled multilingual language-image model. ICLR.

[142] Murtagh, Fionn, Legendre, Pierre. (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion?. Journal of classification.

[143] AI@Meta. (2024). Llama 3 Model Card.

[144] Google. (2023). Gemini.

[145] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, Tianhang Zhu. (2023). Qwen Technical Report. arXiv preprint arXiv:2309.16609.

[146] Bai, Jinze, Bai, Shuai, Yang, Shusheng, Wang, Shijie, Tan, Sinan, Wang, Peng, Lin, Junyang, Zhou, Chang, Zhou, Jingren. (2023). Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.

[147] Dai, Wenliang, Li, Junnan, Li, Dongxu, Tiong, Anthony Meng Huat, Zhao, Junqi, Wang, Weisheng, Li, Boyang, Fung, Pascale N, Hoi, Steven. (2024). Instructblip: Towards general-purpose vision-language models with instruction tuning. NeurIPS.

[148] Liu, Yuliang, Li, Zhang, Li, Hongliang, Yu, Wenwen, Huang, Mingxin, Peng, Dezhi, Liu, Mingyu, Chen, Mingrui, Li, Chunyuan, Jin, Lianwen, others. (2023). On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895.

[149] Ge, Yuying, Ge, Yixiao, Zeng, Ziyun, Wang, Xintao, Shan, Ying. (2023). Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041.

[150] Wu, Penghao, Xie, Saining. (2024). V: Guided Visual Search as a Core Mechanism in Multimodal LLMs*. CVPR.

[151] Jaegle, Andrew, Gimeno, Felix, Brock, Andy, Vinyals, Oriol, Zisserman, Andrew, Carreira, Joao. (2021). Perceiver: General perception with iterative attention. ICML.

[152] Young, Alex, Chen, Bei, Li, Chao, Huang, Chengen, Zhang, Ge, Zhang, Guanwei, Li, Heng, Zhu, Jiangcheng, Chen, Jianqun, Chang, Jing, others. (2024). Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.

[153] Zhai, Yuexiang, Bai, Hao, Lin, Zipeng, Pan, Jiayi, Tong, Shengbang, Zhou, Yifei, Suhr, Alane, Xie, Saining, LeCun, Yann, Ma, Yi, others. (2024). Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning. NeurIPS.

[154] Lu, Pan, Bansal, Hritik, Xia, Tony, Liu, Jiacheng, Li, Chunyuan, Hajishirzi, Hannaneh, Cheng, Hao, Chang, Kai-Wei, Galley, Michel, Gao, Jianfeng. (2023). Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. ICLR.

[155] Liu, Yuan, Duan, Haodong, Zhang, Yuanhan, Li, Bo, Zhang, Songyang, Zhao, Wangbo, Yuan, Yike, Wang, Jiaqi, He, Conghui, Liu, Ziwei, others. (2024). Mmbench: Is your multi-modal model an all-around player?. ECCV.

[156] Alayrac, Jean-Baptiste, Donahue, Jeff, Luc, Pauline, Miech, Antoine, Barr, Iain, Hasson, Yana, Lenc, Karel, Mensch, Arthur, Millican, Katherine, Reynolds, Malcolm, others. (2022). Flamingo: a visual language model for few-shot learning. NeurIPS.

[157] Li, Runjia, Sun, Shuyang, Elhoseiny, Mohamed, Torr, Philip. (2023). OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?. ICCV.

[158] Gadre, Samir Yitzhak, Ilharco, Gabriel, Fang, Alex, Hayase, Jonathan, Smyrnis, Georgios, Nguyen, Thao, Marten, Ryan, Wortsman, Mitchell, Ghosh, Dhruba, Zhang, Jieyu, others. (2024). Datacomp: In search of the next generation of multimodal datasets. NeurIPS.

[159] Banani, Mohamed El, Raj, Amit, Maninis, Kevis-Kokitsi, Kar, Abhishek, Li, Yuanzhen, Rubinstein, Michael, Sun, Deqing, Guibas, Leonidas, Johnson, Justin, Jampani, Varun. (2024). Probing the 3D Awareness of Visual Foundation Models. arXiv preprint arXiv:2404.08636.

[160] OpenAI. (2022). ChatGPT.

[161] Stability AI. (2024). Stable Diffusion 3.5.

[162] Roberts, Adam, Raffel, Colin, Lee, Katherine, Matena, Michael, Shazeer, Noam, Liu, Peter J, Narang, Sharan, Li, Wei, Zhou, Yanqi. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.

[163] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto. (2023). Alpaca: A Strong, Replicable Instruction-Following Model.

[164] Zhou, Chunting, Liu, Pengfei, Xu, Puxin, Iyer, Srinivasan, Sun, Jiao, Mao, Yuning, Ma, Xuezhe, Efrat, Avia, Yu, Ping, Yu, Lili, others. (2024). Lima: Less is more for alignment. NeurIPS.

[165] Omar Sanseviero. (2022). LLM Evals and Benchmarking.

[166] Rajamanoharan, Senthooran, Conmy, Arthur, Smith, Lewis, Lieberum, Tom, Varma, Vikrant, Kram{'a. (2024). Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014.

[167] xAI. (2024). grok.

[168] Singh, Amanpreet, Natarajan, Vivek, Shah, Meet, Jiang, Yu, Chen, Xinlei, Batra, Dhruv, Parikh, Devi, Rohrbach, Marcus. (2019). Towards vqa models that can read. CVPR.

[169] Chang, Yupeng, Wang, Xu, Wang, Jindong, Wu, Yuan, Yang, Linyi, Zhu, Kaijie, Chen, Hao, Yi, Xiaoyuan, Wang, Cunxiang, Wang, Yidong, others. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.

[170] Sun, Quan, Cui, Yufeng, Zhang, Xiaosong, Zhang, Fan, Yu, Qiying, Wang, Yueze, Rao, Yongming, Liu, Jingjing, Huang, Tiejun, Wang, Xinlong. (2024). Generative multimodal models are in-context learners. CVPR.

[171] Sun, Quan, Yu, Qiying, Cui, Yufeng, Zhang, Fan, Zhang, Xiaosong, Wang, Yueze, Gao, Hongcheng, Liu, Jingjing, Huang, Tiejun, Wang, Xinlong. (2024). Generative pretraining in multimodality. ICLR.

[172] Dong, Runpei, Han, Chunrui, Peng, Yuang, Qi, Zekun, Ge, Zheng, Yang, Jinrong, Zhao, Liang, Sun, Jianjian, Zhou, Hongyu, Wei, Haoran, others. (2024). Dreamllm: Synergistic multimodal comprehension and creation. ICLR.

[173] Shao, Hao, Qian, Shengju, Xiao, Han, Song, Guanglu, Zong, Zhuofan, Wang, Letian, Liu, Yu, Li, Hongsheng. (2024). Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. NeurIPS.

[174] Miech, Antoine, Zhukov, Dimitri, Alayrac, Jean-Baptiste, Tapaswi, Makarand, Laptev, Ivan, Sivic, Josef. (2019). Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. ICCV.

[175] Wang, Xinlong, Zhang, Xiaosong, Luo, Zhengxiong, Sun, Quan, Cui, Yufeng, Wang, Jinsheng, Zhang, Fan, Wang, Yueze, Li, Zhen, Yu, Qiying, others. (2024). Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869.

[176] Kar, O{\u{g. (2025). BRAVE: Broadening the visual encoding of vision-language models. ECCV.

[177] Lauren{\c{c. (2024). Obelics: An open web-scale filtered dataset of interleaved image-text documents. NeurIPS.

[178] Li, Kunchang, Wang, Yali, He, Yinan, Li, Yizhuo, Wang, Yi, Liu, Yi, Wang, Zun, Xu, Jilan, Chen, Guo, Luo, Ping, others. (2024). Mvbench: A comprehensive multi-modal video understanding benchmark. CVPR.

[179] Goyal, Raghav, Ebrahimi Kahou, Samira, Michalski, Vincent, Materzynska, Joanna, Westphal, Susanne, Kim, Heuna, Haenel, Valentin, Fruend, Ingo, Yianilos, Peter, Mueller-Freitag, Moritz, others. (2017). The. ICCV.

[180] Zohar, Orr, Wang, Xiaohan, Bitton, Yonatan, Szpektor, Idan, Yeung-levy, Serena. (2024). Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision. arXiv preprint arXiv:2407.06189.

[181] OpenAI. (2024). gpt4o.

[182] Anthropic. (2024). Claude.

[183] Touvron, Hugo, Lavril, Thibaut, Izacard, Gautier, Martinet, Xavier, Lachaux, Marie-Anne, Lacroix, Timoth{'e. (2023). {LLaMA. arXiv preprint arXiv:2302.13971.

[184] Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, others. (2023). {LLaMA. arXiv preprint arXiv:2307.09288.

[185] Li, Bo, Zhang, Kaichen, Zhang, Hao, Guo, Dong, Zhang, Renrui, Li, Feng, Zhang, Yuanhan, Liu, Ziwei, Li, Chunyuan. (2024). LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild.

[186] Yue, Xiang, Ni, Yuansheng, Zhang, Kai, Zheng, Tianyu, Liu, Ruoqi, Zhang, Ge, Stevens, Samuel, Jiang, Dongfu, Ren, Weiming, Sun, Yuxuan, others. (2024). Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. CVPR.

[187] Hiippala, Tuomo, Alikhani, Malihe, Haverinen, Jonas, Kalliokoski, Timo, Logacheva, Evanfiya, Orekhova, Serafina, Tuomainen, Aino, Stone, Matthew, Bateman, John A. (2021). AI2D-RST: A multimodal corpus of 1000 primary school science diagrams. Language Resources and Evaluation.

[188] Brazil, Garrick, Kumar, Abhinav, Straub, Julian, Ravi, Nikhila, Johnson, Justin, Gkioxari, Georgia. (2023). Omni3d: A large benchmark and model for 3d object detection in the wild. CVPR.

[189] Zhou, Bolei, Zhao, Hang, Puig, Xavier, Xiao, Tete, Fidler, Sanja, Barriuso, Adela, Torralba, Antonio. (2019). Semantic understanding of scenes through the ade20k dataset. IJCV.

[190] Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll{'a. (2014). Microsoft coco: Common objects in context. ECCV.

[191] Fu, Xingyu, Hu, Yushi, Li, Bangzheng, Feng, Yu, Wang, Haoyu, Lin, Xudong, Roth, Dan, Smith, Noah A, Ma, Wei-Chiu, Krishna, Ranjay. (2024). BLINK: Multimodal Large Language Models Can See but Not Perceive. arXiv preprint arXiv:2404.12390.

[192] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, others. (2015). Imagenet large scale visual recognition challenge. IJCV.

[193] Thomas Aquinas. Quaestiones Disputatae de Veritate.

[194] Aristotle. Metaphysics.

[195] Parker, Andrew. (2003). In the blink of an eye: how vision sparked the big bang of evolution.

[196] David J. Chalmers. (2023). Does Thought Require Sensory Grounding? From Pure Thinkers to Large Language Models. Proceedings and Addresses of the American Philosophical Association.

[197] Piaget, Jean, Cook, Margaret, others. (1952). The origins of intelligence in children.

[198] Hoffmann, Jordan, Borgeaud, Sebastian, Mensch, Arthur, Buchatskaya, Elena, Cai, Trevor, Rutherford, Eliza, Casas, Diego de Las, Hendricks, Lisa Anne, Welbl, Johannes, Clark, Aidan, others. (2023). Training compute-optimal large language models. NeurIPS.

[199] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. NeurIPS.

[200] Lauren{\c{c. (2024). What matters when building vision-language models?. arXiv preprint arXiv:2405.02246.

[201] Girshick, Ross, Donahue, Jeff, Darrell, Trevor, Malik, Jitendra. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR.

[202] Mathew, Minesh, Bagal, Viraj, Tito, Rub{`e. (2022). Infographicvqa. WACV.

[203] Chen, Lin, Li, Jinsong, Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Chen, Zehui, Duan, Haodong, Wang, Jiaqi, Qiao, Yu, Lin, Dahua, others. (2024). Are We on the Right Way for Evaluating Large Vision-Language Models?. arXiv preprint arXiv:2403.20330.

[204] Wu, Chengyue, Chen, Xiaokang, Wu, Zhiyu, Ma, Yiyang, Liu, Xingchao, Pan, Zizheng, Liu, Wen, Xie, Zhenda, Yu, Xingkai, Ruan, Chong, others. (2024). Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848.

[205] Huh, Minyoung, Cheung, Brian, Wang, Tongzhou, Isola, Phillip. (2024). The platonic representation hypothesis. ICML.

[206] Yu, Sihyun, Kwak, Sangkyung, Jang, Huiwon, Jeong, Jongheon, Huang, Jonathan, Shin, Jinwoo, Xie, Saining. (2024). Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. arXiv preprint arXiv:2410.06940.

[207] Agrawal, Pravesh, Antoniak, Szymon, Hanna, Emma Bou, Chaplot, Devendra, Chudnovsky, Jessica, Garg, Saurabh, Gervet, Theophile, Ghosh, Soham, H{'e. (2024). Pixtral 12B. arXiv preprint arXiv:2410.07073.

[208] Lu, Jiasen, Clark, Christopher, Lee, Sangho, Zhang, Zichen, Khosla, Savya, Marten, Ryan, Hoiem, Derek, Kembhavi, Aniruddha. (2024). Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action. CVPR.

[209] Aghajanyan, Armen, Huang, Bernie, Ross, Candace, Karpukhin, Vladimir, Xu, Hu, Goyal, Naman, Okhonko, Dmytro, Joshi, Mandar, Ghosh, Gargi, Lewis, Mike, others. (2022). Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520.

[210] Lu, Jiasen, Clark, Christopher, Zellers, Rowan, Mottaghi, Roozbeh, Kembhavi, Aniruddha. (2022). Unified-io: A unified model for vision, language, and multi-modal tasks. ICLR.

[211] Agrawal, Aishwarya, Batra, Dhruv, Parikh, Devi, Kembhavi, Aniruddha. (2018). Don't just assume; look and answer: Overcoming priors for visual question answering. CVPR.

[212] Chen, Lin, Wei, Xilin, Li, Jinsong, Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Chen, Zehui, Duan, Haodong, Lin, Bin, Tang, Zhenyu, others. (2024). Sharegpt4video: Improving video understanding and generation with better captions. NeurIPS.

[213] Krojer, Benno, Vattikonda, Dheeraj, Lara, Luis, Jampani, Varun, Portelance, Eva, Pal, Christopher, Reddy, Siva. (2024). Learning Action and Reasoning-Centric Image Editing from Videos and Simulations. NeurIPS.

[214] Hessel, Jack, Holtzman, Ari, Forbes, Maxwell, Bras, Ronan Le, Choi, Yejin. (2021). Clipscore: A reference-free evaluation metric for image captioning. EMNLP.

[215] Brooks, Tim, Holynski, Aleksander, Efros, Alexei A. (2023). Instructpix2pix: Learning to follow image editing instructions. CVPR.

[216] Goyal, Yash, Khot, Tejas, Summers-Stay, Douglas, Batra, Dhruv, Parikh, Devi. (2017). Making the v in vqa matter: Elevating the role of image understanding in visual question answering. CVPR.

[217] {Allen-Zhu. (2024). {ICML 2024 Tutorial: Physics of Language Models.

[218] Ye, Tian, Xu, Zicheng, Li, Yuanzhi, {Allen-Zhu. {Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process. ArXiv e-prints.

[219] Majumdar, Arjun, Ajay, Anurag, Zhang, Xiaohan, Putta, Pranav, Yenamandra, Sriram, Henaff, Mikael, Silwal, Sneha, Mcvay, Paul, Maksymets, Oleksandr, Arnaud, Sergio, others. (2024). OpenEQA: Embodied Question Answering in the Era of Foundation Models. 2nd Workshop on Mobile Manipulation and Embodied Intelligence at ICRA 2024.

[220] Li, Yanwei, Zhang, Yuechen, Wang, Chengyao, Zhong, Zhisheng, Chen, Yixin, Chu, Ruihang, Liu, Shaoteng, Jia, Jiaya. (2024). Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814.

[221] Geirhos, Robert, Jacobsen, J{. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence.

[222] Wei, Jason, Wang, Xuezhi, Schuurmans, Dale, Bosma, Maarten, Xia, Fei, Chi, Ed, Le, Quoc V, Zhou, Denny, others. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS.

[223] Andreas Geiger, Philip Lenz, Raquel Urtasun. (2012). Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. CVPR.

[224] Caesar, Holger, Bankiti, Varun, Lang, Alex H, Vora, Sourabh, Liong, Venice Erin, Xu, Qiang, Krishnan, Anush, Pan, Yu, Baldan, Giancarlo, Beijbom, Oscar. (2020). nuscenes: A multimodal dataset for autonomous driving. CVPR.

[225] Song, Shuran, Lichtenberg, Samuel P, Xiao, Jianxiong. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. CVPR.

[226] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, Elad Shulman. (2021). {ARK. NeurIPS.

[227] Mike Roberts AND Jason Ramapuram AND Anurag Ranjan AND Atulit Kumar AND Miguel Angel Bautista AND Nathan Paczan AND Russ Webb AND Joshua M. Susskind. (2021). {Hypersim. ICCV.

[228] Ahmadyan, Adel, Zhang, Liangkai, Ablavatski, Artsiom, Wei, Jianing, Grundmann, Matthias. (2021). Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations. CVPR.

[229] Wang, Peng, Bai, Shuai, Tan, Sinan, Wang, Shijie, Fan, Zhihao, Bai, Jinze, Chen, Keqin, Liu, Xuejing, Wang, Jialin, Ge, Wenbin, others. (2024). Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191.

[230] Li, Bo, Zhang, Yuanhan, Guo, Dong, Zhang, Renrui, Li, Feng, Zhang, Hao, Zhang, Kaichen, Li, Yanwei, Liu, Ziwei, Li, Chunyuan. (2024). Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326.

[231] Tong, Shengbang, Brown, Ellis, Wu, Penghao, Woo, Sanghyun, Middepogu, Manoj, Akula, Sai Charitha, Yang, Jihan, Yang, Shusheng, Iyer, Adithya, Pan, Xichen, others. (2024). Cambrian-1: A fully open, vision-centric exploration of multimodal llms. NeurIPS.

[232] Li, Junnan, Li, Dongxu, Savarese, Silvio, Hoi, Steven. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML.

[233] Zhou, Chunting, Yu, Lili, Babu, Arun, Tirumala, Kushal, Yasunaga, Michihiro, Shamis, Leonid, Kahn, Jacob, Ma, Xuezhe, Zettlemoyer, Luke, Levy, Omer. (2024). Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039.

[234] Wu, Yecheng, Zhang, Zhuoyang, Chen, Junyu, Tang, Haotian, Li, Dacheng, Fang, Yunhao, Zhu, Ligeng, Xie, Enze, Yin, Hongxu, Yi, Li, others. (2024). Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429.

[235] Xie, Jinheng, Mao, Weijia, Bai, Zechen, Zhang, David Junhao, Wang, Weihao, Lin, Kevin Qinghong, Gu, Yuchao, Chen, Zhijie, Yang, Zhenheng, Shou, Mike Zheng. (2024). Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528.

[236] Baddeley, Alan. (1992). Working memory. Science.

[237] Amit, Elinor, Hoeflin, Caitlyn, Hamzah, Nada, Fedorenko, Evelina. (2017). An asymmetrical relationship between verbal and visual thinking: Converging evidence from behavior and fMRI. NeuroImage.

[238] Paivio, Allan. (1990). Mental representations: A dual coding approach.

[239] Ganis, Giorgio, Thompson, William L, Kosslyn, Stephen M. (2004). Brain areas underlying visual mental imagery and visual perception: an fMRI study. Cognitive Brain Research.

[240] LeCun, Yann. (2022). A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review.

[241] Amit, Elinor, Algom, Daniel, Trope, Yaacov. (2009). Distance-dependent processing of pictures and words.. Journal of Experimental Psychology: General.

[242] Amit, Elinor, Wakslak, Cheryl, Trope, Yaacov. (2013). The use of visual and verbal means of communication across psychological distance. Personality and Social Psychology Bulletin.

[243] Ormazabal, Aitor, Zheng, Che, d'Autume, Cyprien de Masson, Yogatama, Dani, Fu, Deyu, Ong, Donovan, Chen, Eric, Lamprecht, Eugenie, Pham, Hai, Ong, Isaac, others. (2024). Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models. arXiv preprint arXiv:2404.12387.

[244] Chowdhery, Aakanksha, Narang, Sharan, Devlin, Jacob, Bosma, Maarten, Mishra, Gaurav, Roberts, Adam, Barham, Paul, Chung, Hyung Won, Sutton, Charles, Gehrmann, Sebastian, others. (2022). Palm: Scaling language modeling with pathways. arXiv 2022. arXiv preprint arXiv:2204.02311.

[245] Liu, Hao, Yan, Wilson, Zaharia, Matei, Abbeel, Pieter. (2024). World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268.

[246] Tschannen, Michael, Kumar, Manoj, Steiner, Andreas, Zhai, Xiaohua, Houlsby, Neil, Beyer, Lucas. (2024). Image captioners are scalable vision learners too. NeurIPS.

[247] Fini, Enrico, Shukor, Mustafa, Li, Xiujun, Dufter, Philipp, Klein, Michal, Haldimann, David, Aitharaju, Sai, da Costa, Victor Guilherme Turrisi, B{'e. (2024). Multimodal autoregressive pre-training of large vision encoders. arXiv preprint arXiv:2411.14402.

[248] LeCun, Yann. (1998). The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/.

[249] Wang, Xiao, Alabdulmohsin, Ibrahim, Salz, Daniel, Li, Zhe, Rong, Keran, Zhai, Xiaohua. (2025). Scaling Pre-training to One Hundred Billion Data for Vision Language Models. arXiv preprint arXiv:2502.07617.

[250] Doersch, Carl, Gupta, Abhinav, Efros, Alexei A. (2015). Unsupervised visual representation learning by context prediction. ICCV.

[251] Misra, Ishan, Van Der Maaten, Laurens. (2019). Self-supervised learning of pretext-invariant representations. In 2020 IEEE. CVPR.

[252] Garrido, Quentin, Chen, Yubei, Bardes, Adrien, Najman, Laurent, Lecun, Yann. (2023). On the duality between contrastive and non-contrastive self-supervised learning. ICLR.

[253] Chen, Yubei, Bardes, Adrien, Li, Zengyi, LeCun, Yann. (2022). Bag of image patch embedding behind the success of self-supervised learning. arXiv preprint arXiv:2206.08954.

[254] Zhou, Jinghao, Wei, Chen, Wang, Huiyu, Shen, Wei, Xie, Cihang, Yuille, Alan, Kong, Tao. (2021). ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832.

[255] Carreira, Jo{~a. (2024). Scaling 4D Representations. arXiv preprint arXiv:2412.15212.

[256] Wei, Chen, Fan, Haoqi, Xie, Saining, Wu, Chao-Yuan, Yuille, Alan, Feichtenhofer, Christoph. (2022). Masked feature prediction for self-supervised visual pre-training. CVPR.

[257] Hendrycks, Dan, Zhao, Kevin, Basart, Steven, Steinhardt, Jacob, Song, Dawn Xiaodong. (2019). Natural adversarial examples. 2021 IEEE. CVPR.

[258] Hendrycks, Dan, Basart, Steven, Mu, Norman, Kadavath, Saurav, Wang, Frank, Dorundo, Evan, Desai, Rahul, Zhu, Tyler, Parajuli, Samyak, Guo, Mike, others. (2020). The many faces of robustness: A critical analysis of out-of-distribution generalization. 2021 IEEE. ICCV.

[259] Bossard, Lukas, Guillaumin, Matthieu, Van Gool, Luc. (2014). Food-101--mining discriminative components with random forests. ECCV.

[260] Wang, Xiaolong, Gupta, Abhinav. (2015). Unsupervised learning of visual representations using videos. ICCV.

[261] Cordts, Marius, Omran, Mohamed, Ramos, Sebastian, Rehfeld, Timo, Enzweiler, Markus, Benenson, Rodrigo, Franke, Uwe, Roth, Stefan, Schiele, Bernt. (2016). The cityscapes dataset for semantic urban scene understanding. CVPR.

[262] Everingham, Mark, Van Gool, Luc, Williams, Christopher KI, Winn, John, Zisserman, Andrew. (2010). The pascal visual object classes (voc) challenge. IJCV.

[263] Shi, Baifeng, Wu, Ziyang, Mao, Maolin, Wang, Xin, Darrell, Trevor. (2024). When do we not need larger vision models?. ECCV.

[264] Geiger, Andreas, Lenz, Philip, Stiller, Christoph, Urtasun, Raquel. (2013). Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research.

[265] Mo, Shentong, Tong, Peter. (2024). Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning. NeurIPS.

[266] Chen, Xinlei, Fan, Haoqi, Girshick, Ross, He, Kaiming. (2020). Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297.

[267] Muzi Tao, Saining Xie. (2024). What Does a Visual Formal Analysis of the World's 500 Most Famous Paintings Tell Us About Multimodal {LLM. The Second Tiny Papers Track at ICLR 2024.

[268] Shi, Min, Liu, Fuxiao, Wang, Shihao, Liao, Shijia, Radhakrishnan, Subhashree, Huang, De-An, Yin, Hongxu, Sapra, Karan, Yacoob, Yaser, Shi, Humphrey, others. (2024). Eagle: Exploring the design space for multimodal llms with mixture of encoders. arXiv preprint arXiv:2408.15998.

[269] Soomro, Khurram, Zamir, Amir Roshan, Shah, Mubarak. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

[270] Xu, Hu, Huang, Po-Yao, Tan, Xiaoqing Ellen, Yeh, Ching-Feng, Kahn, Jacob, Jou, Christine, Ghosh, Gargi, Levy, Omer, Zettlemoyer, Luke, Yih, Wen-tau, others. (2024). Altogether: Image Captioning via Re-aligning Alt-text. arXiv preprint arXiv:2410.17251.

[271] Beyer, Lucas, Steiner, Andreas, Pinto, Andr{'e. (2024). Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726.

[272] Tong, Shengbang, Fan, David, Zhu, Jiachen, Xiong, Yunyang, Chen, Xinlei, Sinha, Koustuv, Rabbat, Michael, LeCun, Yann, Xie, Saining, Liu, Zhuang. (2024). Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164.

[273] Wang, Chenyu, Gupta, Sharut, Zhang, Xinyi, Tonekaboni, Sana, Jegelka, Stefanie, Jaakkola, Tommi, Uhler, Caroline. (2024). An Information Criterion for Controlled Disentanglement of Multimodal Data. arXiv preprint arXiv:2410.23996.

[274] Srinivasan, Krishna, Raman, Karthik, Chen, Jiecao, Bendersky, Michael, Najork, Marc. (2021). Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval.

[275] Noroozi, Mehdi, Favaro, Paolo. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV.

[276] Zhang, Richard, Isola, Phillip, Efros, Alexei A. (2016). Colorful image colorization. ECCV.

[277] Gidaris, Spyros, Singh, Praveer, Komodakis, Nikos. (2018). Unsupervised representation learning by predicting image rotations. ICLR.

[278] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. ICCV.

[279] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep Residual Learning for Image Recognition. CVPR.

[280] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He. (2016). Aggregated Residual Transformations for Deep Neural Networks. arXiv preprint arXiv:1611.05431.

[281] Krizhevsky, Alex, Hinton, Geoffrey, others. (2009). Learning multiple layers of features from tiny images.

[282] Goyal, Priya, Mahajan, Dhruv, Gupta, Abhinav, Misra, Ishan. (2019). Scaling and benchmarking self-supervised visual representation learning. ICCV.

[283] Thomee, Bart, Shamma, David A, Friedland, Gerald, Elizalde, Benjamin, Ni, Karl, Poland, Douglas, Borth, Damian, Li, Li-Jia. (2016). Yfcc100m: The new data in multimedia research. Communications of the ACM.

[284] Zhai, Xiaohua, Kolesnikov, Alexander, Houlsby, Neil, Beyer, Lucas. (2022). Scaling vision transformers. CVPR.

[285] Allal, Loubna Ben, Lozhkov, Anton, Bakouch, Elie, Bl{'a. (2025). SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model. arXiv preprint arXiv:2502.02737.

[286] Kaplan, Jared, McCandlish, Sam, Henighan, Tom, Brown, Tom B, Chess, Benjamin, Child, Rewon, Gray, Scott, Radford, Alec, Wu, Jeffrey, Amodei, Dario. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

[287] Dehghani, Mostafa, Djolonga, Josip, Mustafa, Basil, Padlewski, Piotr, Heek, Jonathan, Gilmer, Justin, Steiner, Andreas Peter, Caron, Mathilde, Geirhos, Robert, Alabdulmohsin, Ibrahim, others. (2023). Scaling vision transformers to 22 billion parameters. ICML.

[288] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS.

[289] Chen, Xinlei, He, Kaiming. (2021). Exploring simple siamese representation learning. CVPR.

[290] Naeem, Muhammad Ferjad, Xian, Yongqin, Zhai, Xiaohua, Hoyer, Lukas, Van Gool, Luc, Tombari, Federico. (2024). Silc: Improving vision language pretraining with self-distillation. ECCV.

[291] Singh, Mannat, Duval, Quentin, Alwala, Kalyan Vasudev, Fan, Haoqi, Aggarwal, Vaibhav, Adcock, Aaron, Joulin, Armand, Doll{'a. (2023). The effectiveness of MAE pre-pretraining for billion-scale pretraining. ICCV.

[292] Silberman, Nathan, Hoiem, Derek, Kohli, Pushmeet, Fergus, Rob. (2012). Indoor segmentation and support inference from rgbd images. ECCV.

[293] Sun, Quan, Wang, Jinsheng, Yu, Qiying, Cui, Yufeng, Zhang, Fan, Zhang, Xiaosong, Wang, Xinlong. (2024). Eva-clip-18b: Scaling clip to 18 billion parameters. arXiv preprint arXiv:2402.04252.

[294] Chen, Zhe, Wu, Jiannan, Wang, Wenhai, Su, Weijie, Chen, Guo, Xing, Sen, Zhong, Muyan, Zhang, Qinglong, Zhu, Xizhou, Lu, Lewei, others. (2024). Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. CVPR.

[295] Fu, Chaoyou, Chen, Peixian, Shen, Yunhang, Qin, Yulei, Zhang, Mengdan, Lin, Xu, Qiu, Zhenyu, Lin, Wei, Yang, Jinrui, Zheng, Xiawu, others. (2023). MME: a comprehensive evaluation benchmark for multimodal large language models. CoRR abs/2306.13394 (2023).

[296] Bai, Yutong, Geng, Xinyang, Mangalam, Karttikeya, Bar, Amir, Yuille, Alan L, Darrell, Trevor, Malik, Jitendra, Efros, Alexei A. (2024). Sequential modeling enables scalable learning for large vision models. CVPR.

[297] Wei, Jason, Bosma, Maarten, Zhao, Vincent Y, Guu, Kelvin, Yu, Adams Wei, Lester, Brian, Du, Nan, Dai, Andrew M, Le, Quoc V. (2022). Finetuned language models are zero-shot learners. ICLR.

[298] Florian Bordes, Randall Balestriero, Pascal Vincent. (2022). High Fidelity Visualization of What Your Self-Supervised Representation Knows About. TMLR.

[299] Wadekar, Shakti N, Chaurasia, Abhishek, Chadha, Aman, Culurciello, Eugenio. The evolution of multimodal model architectures. arXiv [cs.AI].

[300] Luo, Grace, Darrell, Trevor, Bar, Amir. (2024). Task Vectors are Cross-Modal. arXiv preprint arXiv:2410.22330.

[301] Wan, Bo, Tschannen, Michael, Xian, Yongqin, Pavetic, Filip, Alabdulmohsin, Ibrahim, Wang, Xiao, Pinto, Andr{'e. (2024). LocCa: Visual Pretraining with Location-aware Captioners. arXiv preprint arXiv:2403.19596.

[302] Thasarathan, Harrish, Forsyth, Julian, Fel, Thomas, Kowal, Matthew, Derpanis, Konstantinos. (2025). Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment. arXiv preprint arXiv:2502.03714.

[303] Ridnik, Tal, Ben-Baruch, Emanuel, Noy, Asaf, Zelnik-Manor, Lihi. (2021). Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972.

[304] Balestriero, Randall, Ibrahim, Mark, Sobal, Vlad, Morcos, Ari, Shekhar, Shashank, Goldstein, Tom, Bordes, Florian, Bardes, Adrien, Mialon, Gregoire, Tian, Yuandong, others. (2023). A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210.

[305] Carreira et al. (2024) João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, et al. Scaling 4d representations. arXiv preprint arXiv:2412.15212, 2024.

[306] Goyal et al. (2017a) Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In ICCV, 2017a.

[307] Tang et al. (2025) Zineng Tang, Long Lian, Seun Eisape, XuDong Wang, Roei Herzig, Adam Yala, Alane Suhr, Trevor Darrell, and David M. Chan. Tulip: Towards unified language-image pretraining, 2025. Preprint.

[308] Tong et al. (2024a) Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2024a.

[309] xAI. grok, 2024.

[310] Yue et al. (2024a) Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024a.