Skip to main content

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

% \normalsize\bf Amir Bar$^{1,2}$ \quad Arya Bakhtiar$^2$ \quad Danny Tran$^2$ \quad Antonio Loquercio$^2$ \quad Jathushan Rajasegaran$^2$, \quad\quad\quad\quad\normalsize\bf Yann LeCun$^3$ \quad Amir Globerson$^1$ \quad Trevor Darrell$^2$, [0.5em] \normalsize\quad\quad $^1$Tel Aviv University \quad $^2$UC Berkeley \quad $^3$New York University

Abstract

-3mm Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets.~Project page: ~\url{www.amirbar.net/egopet}

EgoPet: Egomotion and Interaction Data from an Animal’s Perspective

Amir Bar 1 , 2 Arya Bakhtiar 2 Danny Tran 2 Antonio Loquercio 2 Jathushan Rajasegaran 2 Yann LeCun 3 Amir Globerson 1 Trevor Darrell 2 1 Tel Aviv University 2 UC Berkeley 3 New York University

Figure 1: We present EgoPet, a novel animal egocentric video dataset to advance learning animal-like behavior models from video (top row). We propose three benchmark tasks on this dataset (bottom row). Visual Interaction Prediction (VIP) and Locomotion Prediction (LP) are designed to predict animals' perception and action behavior. Finally, Vision to Proprioception Prediction (VPP) studies the utility of our dataset on the downstream task of robot locomotion in the wild. For all tasks, we find that models trained on EgoPet outperform those trained on previously available video datasets.

Figure 1: We present EgoPet, a novel animal egocentric video dataset to advance learning animal-like behavior models from video (top row). We propose three benchmark tasks on this dataset (bottom row). Visual Interaction Prediction (VIP) and Locomotion Prediction (LP) are designed to predict animals' perception and action behavior. Finally, Vision to Proprioception Prediction (VPP) studies the utility of our dataset on the downstream task of robot locomotion in the wild. For all tasks, we find that models trained on EgoPet outperform those trained on previously available video datasets.

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets. 1

Introduction

Animals are intelligent agents that exhibit various cognitive and behavioral traits. They plan and act to accomplish complex goals and can interact with objects or other agents. Consider a cat attempting

1 Project page: www.amirbar.net/egopet

Figure 2: EgoPet video examples . Footage from the EgoPet dataset featuring four different animal experiences, each captured from an egocentric perspective at a distinct point in time.

Figure 2: EgoPet video examples . Footage from the EgoPet dataset featuring four different animal experiences, each captured from an egocentric perspective at a distinct point in time.

to catch a rat; this requires the cat to execute a precise sequence of actions with impeccable timing, all while responding to the rat's efforts to escape.

Current Artificial Intelligence (AI) systems can synthesize high quality images [39, 40], generate coherent text [5, 51], and even code Python programs [9]. But despite this remarkable progress, there are basic animal behaviors that are beyond the reach of current models. Recently, there has been a significant body of research in robotics aimed at learning policies for quadruped locomotion, and other basic actions [27, 25, 1, 32, 42, 10, 33, 2]. However, we argue that a major limitation in advancing towards more complex systems is the availability of large-scale, real-world data.

To address this, we present EgoPet, a new web-scale dataset from the perspective of pets. EgoPet contains more than 84 hours of video, including different animals like dogs, cats, eagles, turtles, and more. This video footage reveals the world from the eye of the pet as perceived in its day-to-day life, e.g., a dog going for a walk or entering a park, or a cat wandering freely around a farm. The video data was sourced from the internet and predominantly includes pet video, hence we have named the dataset EgoPet.

To measure progress in modeling and learning from animals, we propose three new tasks that aim to capture perception and action (see Fig. 1): Visual Interaction Prediction (VIP), Locomotion Prediction (LP), and Vision to Proprioception Prediction (VPP). Together with these tasks, we provide annotated training and validation data used for downstream evaluation.

The VIP task aims to detect and classify animal interactions and is inspired by human-object interaction tasks [43]. We temporally annotated a subset of the EgoPet videos with the start and end times of visual interactions and the object of the interaction category. The categories, which include person, cat, and dog, were chosen based on how commonly they occurred as objects (for all the categories refer to Supplementary Section 8).

The goal of the LP task is to predict the future 4 second trajectory of the pet. This is useful for learning basic pet skills like avoiding obstacles or navigating. We extracted pseudo ground truth trajectories using Deep Patch Visual Odometry (DPVO) [49], the best-performing SLAM system for our dataset. We manually filtered inaccurate trajectories in the validation data to ensure high-quality evaluation.

Finally, in the VPP task, we study EgoPet's utility for a downstream robotic task: legged locomotion. Given a video observation from a forward-facing camera mounted on a quadruped robot, the goal is to predict the features of the terrain perceived by the robot's proprioception across its trajectory. Making accurate predictions requires perceiving the landscape and anticipating the robot controls. This differs from previous works on robot visual prediction [30, 45, 31], which require conditioning over current robot controls and are thus challenging to train at scale. To assess performance in this task, we gathered data utilizing a quadruped robodog. This data includes paired videos and proprioception features, which are then utilized for subsequent training and evaluation processes.

We train various self-supervised models and evaluate how they perform downstream using a simple linear probing protocol. We make the surprising finding that pretraining on EgoPet yields better performance than pretraining on other, much larger video datasets like Ego4D [19] and Kinetics 400 [23]. This indicates the inadequacy of current datasets in studying animal-like physical skills.

Our contributions are as follows. First, we propose EgoPet, the first large-scale egocentric animal video dataset comprised of over 84 hours of video footage to facilitate learning from animals. We propose three new tasks, including human-annotated data, and set an initial benchmark. The downstream results on the VPP task indicate that EgoPet is a useful pretraining resource for quadruped locomotion, and the benchmark results on VIP show that the proposed tasks are still far from being solved, providing an exciting new opportunity to build models that capture the world through the eyes of animals.

Next, we delve into the related works surrounding video datasets, including notable research on both general video datasets of human and animals and those focusing specifically on egocentric video data.

Video Datasets. In recent years, a variety of video datasets have played an important role in video understanding tasks. In human action recognition, datasets like UCF101 [46], Charades-Ego [44], AVA [20], FineDiving [54], and the Something-Something dataset [18] provide comprehensive coverage of human activities, ranging from daily actions to specialized sports movements. Among these, Kinetics (K400) [23] is particularly influential, advancing the study of human actions through a wide array of video clips.

Other works aimed to collect data to study animals. These works include datasets such as the Animal Kingdom [36], which contains videos of various species, and MacaquePose [26], which focuses on non-human primates. These datasets are instrumental for AI advancements in wildlife recognition and interpretation. AP-10K [57] further augments this domain by providing a detailed collection of animal images for robust pose estimation. While sharing a similar motivation to our work, existing datasets on animal behavior rarely contain egocentric views and are therefore better suited to recognition problems than the animals' physical capabilities. For autonomous driving and vehicle motion, datasets like the Berkeley DeepDrive [56, 17] and KITTI [17] offer extensive insights into vehicle egomotion and environmental interactions. While these datasets enrich our understanding of motion, behavior, and interaction from a human-centric perspective, they offer limited insights into animal behavior.

Egocentric Video Datasets. Agents interact with the world from a first-person point of view, thus collecting such data has many applications from video understanding to augmented reality. In the past decade, many egocentric datasets were collected [16, 11, 44, 38, 28], with the majority of them focusing on human activities and object interactions in an indoor environment (e.g. kitchens). For example, Epic Kitchens [11, 12] is a large cooking dataset that takes place in 45 kitchens across 4 different cities, whereas Charades-Ego [44] consists of 4 , 000 paired videos of human actions in first and third person. Other datasets are more focused on conversation and social interactions [15, 35, 37]. Existing datasets differ by the environments in which they are recorded, e.g. (outdoor vs. kitchens), whether they are scripted or not, and the number of videos. Recently, Ego4D [19], a new comprehensive egocentric dataset was released. Different from previous datasets, it is more diverse (e.g., indoor and outdoor activities, different diverse geographical locations). However, while existing datasets focus on human and human skills, our focus is on animal agents which have more limited language and hand-object interactions. The most related egocentric dataset is DECADE [14] which

Figure 3: Descriptive statistics. The histogram depicting the length (in seconds) of EgoPet video sequences exhibits a long-tailed distribution, primarily skewed toward shorter segments of less than 30 seconds. Collectively, videos featuring dogs and cats account for 94% of the total duration, showcasing interactions with people, fellow cats and dogs, toys, and various objects.

Figure 3: Descriptive statistics. The histogram depicting the length (in seconds) of EgoPet video sequences exhibits a long-tailed distribution, primarily skewed toward shorter segments of less than 30 seconds. Collectively, videos featuring dogs and cats account for 94% of the total duration, showcasing interactions with people, fellow cats and dogs, toys, and various objects.

consists of an hour of footage of a single dog, including joint locations annotations. Inspired by DECADE, EgoPet is a much larger web-scale dataset (84 hours) and much more diverse.

The EgoPet Dataset

The EgoPet dataset is a unique collection of egocentric video footage primarily featuring dogs and cats, along with various other animals like eagles, wolves, turtles, sea turtles, sharks, snakes, cheetahs, pythons, geese, alligators, and dolphins (examples included in Fig. 2 and Suppl. Figure 9). Together with the proposed downstream tasks and benchmark, EgoPet is a valuable resource for researchers and enthusiasts interested in studying animals from an egocentric perspective.

We begin with the motivation behind EgoPet and its connection to existing datasets in Section 3.1. We then delve into the dataset's statistics in Section 3.2 and the collection process in Section 3.3.

Relation to other datasets

To provide a clearer understanding of EgoPet's significance, we compare it with various other datasets, considering factors such as total video duration, perspective (egocentric or non-egocentric), egomotion, the agents involved, and the presence of interaction annotations, which are crucial for intelligent agents. Refer to Table 1 for more details. In terms of size, Ego4D [19] is the largest egocentric video dataset, and it centers on human activities, while the BDD100K [56] dataset includes both egocentric and egomotion elements but it focuses on autonomous driving. Differently, EgoPet focuses on animals, and pets in particular. Among animal video datasets, the DECADE [14] dataset provides an egocentric perspective from a dog's viewpoint, but it only records 1 . 5 hours of video. EgoPet expands this vision by over 56 times in volume and includes a variety of species and interactions.

Descriptive Statistics

The EgoPet dataset is an extensive collection composed of 6 , 646 video segments distilled from 819 unique videos. High level statistics are provided in Fig. 3. These original videos were sourced predominantly from TikTok, accounting for 482 videos, while the remaining 338 were obtained from YouTube. The aggregate length of all video segments amounts to approximately 84 hours, which reflects a substantial volume of data for in-depth analysis. In terms of video duration, the segments exhibit an average span of 45 . 55 seconds, although the duration displays considerable variability, as indicated by the standard deviation of 192 . 19 seconds. This variation underscores the range of contexts captured within the dataset, from brief encounters to prolonged interactions.

Table 1: Different video datasets . We compare EgoPet to different datasets with respect to the total time (hours), whether the videos are in first person view (egocentric) with the focus on egomotion, the agent type, and whether agent interaction annotations are available. EgoPet is the first large scale animal dataset that is both ego centric and contains interaction annotations. It is also over 56 times larger than the previous similar dataset DECADE [14].

Breaking down the dataset by animal representation, cats and dogs constitute the majority, with 4 , 567 and 1 , 905 segments, respectively. This reflects the dataset's strong emphasis on common domestic animals while still covering less frequent but equally important species. Notably, the dataset includes segments featuring eagles ( 66 ), turtles ( 31 ), and a diverse group of other animals such as alligators, lizards, and dolphins, contributing to a rich collection of animal behaviors captured through an egocentric lens.

The camera positioning-where the recording device was attached to-also varies, the majority of segments were captured from cameras placed on the neck ( 4 , 575 ) and body ( 1 , 817 ). Fewer segments were recorded from cameras positioned on the head ( 199 ), shell ( 36 ), collar ( 11 ), and fin ( 8 ) offering a range of perspectives that can inform on how different mounting points might influence the perception of the environment from an animal's viewpoint.

Data Acquisition

Collection Strategy. To collect the dataset, we manually searched for videos using a large set of queries on YouTube and TikTok. For example, 'egocentric view', 'dog with a GoPro', and similar phrases related to first-person animal perspectives. This led to scraping a vast pool of footage showcasing animals, primarily dogs and cats, wearing wearable cameras, allowing for an egocentric point of view. In pursuit of a broader video selection, our efforts extended to individual channels and authors known for their thematic consistency in publishing egocentric animal footage. This approach allowed us to tap into niche communities and content creators, yielding a wide variety of egocentric videos beyond the reach of generic search terms.

Dataset Refinement. A meticulous annotation process was carried out to ensure the dataset's quality. A human annotator reviewed the collected videos to confirm that they were from an egocentric point of view. Non-egocentric or irrelevant segments were carefully removed.

All videos were adjusted to a frame rate of 30 frames per second and resized to 480p on the shortest side while maintaining the original aspect ratio. The videos were then segmented into discrete clips, during which any non-egocentric footage was removed. The final dataset consists of segments of at least three seconds, ensuring sufficient context for each interaction.

EgoPet Tasks

In order to allow quantitative comparisons of animal-prediction approaches, we next define several prediction tasks on the EgoPet dataset. We provide annotated datasets based on these tasks, which will allow effective benchmarking of different approaches.

Visual Interaction Prediction (VIP)

Motivation. Human activities such as actions and interactions from an egocentric viewpoint have been previously explored in various datasets mostly focusing on activity recognition [24, 29, 60], human-object interactions [13, 6, 34, 11], and social interactions [15, 35, 55]. Inspired by these works, we focus on animal interactions with other agents or objects, and for simplicity, we only consider visual interactions. Observing interactions through an egocentric perspective offers insights

into how animals navigate their world, how they communicate with other beings, and how their physical movements correlate with environmental stimuli. Being able to identify interactions is a core task in computer vision and robotics with practical applications in designing systems that can operate in dynamic, real-world settings.

Task Description. The input for this task is a video clip from the egocentric perspective of an animal. The labels are twofold: a binary label indicating whether an interaction is taking place or not, and a categorical label describing the object of the interaction. This binary label simplifies the vast range of potential interactions into a manageable form for the model, while the identification of the interaction object adds a layer of specificity necessary for understanding the context of the interaction.

In the context of the EgoPet dataset, a 'visual interaction' is defined as a discernible event where the agent-typically an animal such as a dog or cat-demonstrates clear attention to an object or another agent within its environment. This attention may be manifested through physical contact, proximity, orientation, or vocalization (such as barking or making sounds) toward the object of the interaction which can be an object

Figure 4: Vision to Interaction Prediction task. The figure illustrates the process of annotating a single video, identifying and categorizing different interactions experienced by a cat, with each segment of the timeline reflecting a unique type of interaction within the animal's environment.

Figure 4: Vision to Interaction Prediction task. The figure illustrates the process of annotating a single video, identifying and categorizing different interactions experienced by a cat, with each segment of the timeline reflecting a unique type of interaction within the animal's environment.

or agent. The fundamental criterion for a visual interaction is the presence of visual evidence within the video that the agent is engaged with, or reacting to a particular stimulus. Aimless movements, such as wandering without a clear target or displaying alertness without a specific focus, are not labeled as visual interactions.

Annotations. The data labeling process for marking interactions involved a meticulous analysis of the video content, which resulted in the annotation of 1449 subsegments (see Fig. 4). Two human annotators were trained to identify and timestamp the start and end of an interaction event. The outcome of this process is a richly annotated dataset of 805 subsegments where no interaction occurs ('negative subsegments') and 644 positive interaction subsegments that capture a wide range of 17 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test sets. This leaves us with 754 training subsegments and 695 test subsegments for a total of 1 , 449 annotated subsegments. To see the full annotation process refer to Suppl. Section 8.

Locomotion Prediction (LP)

Motivation. Planning where to move of both perception and foresight. It requires the ability to anticipate potential obstacles, consider various courses of action, and select the most efficient and effective strategy to achieve a desired goal. EgoPet contains examples where animals plan a future trajectory to achieve a certain goal (e.g., a dog following its owner. See Fig. 5).

Task Description. Given a sequence of past m video frames { x i } t i = t -m , the goal is to predict the unit normalized future trajectory of the agent { v j } t + k j = t +1 , where v j ∈ R 3 represents the relative location of the agent at timestep j . We predict the unit normalized relative location due to the scale ambiguity of the extracted trajectories. In practice, we condition models on m = 16 frames and

Figure 5: Locomotion Prediction task. A dog navigates an agility course, highlighting the concept of locomotion prediction by anticipating its forward and upward trajectory to clear the obstacle.

Figure 5: Locomotion Prediction task. A dog navigates an agility course, highlighting the concept of locomotion prediction by anticipating its forward and upward trajectory to clear the obstacle.

predict k = 40 future locations which correspond to 4 seconds into the future.

Annotations. To obtain pseudo ground truth agent trajectories, we used Deep Patch Visual Odometry (DPVO [49]), a system for monocular visual odometry that utilizes sparse patch-based matching across frames. This system largely outperformed other open-source SLAM systems in terms of convergence rate and qualitative accuracy in our experiments.

Given an input sequence of frames, DPVO returns the location and orientation of the camera for each frame. To obtain training trajectories, we feed videos with a stride of 5 to DPVO. To ensure high-quality evaluation, we feed validation videos with strides of 5 , 10 , and 15 into DPVO and evaluate the quality of the trajectories manually. Specifically, two human annotators were trained to evaluate the trajectories from an eagle's eye view (XZ view) and determine the best matching trajectory, if any, to the video. This left us with 6 , 126 annotated training segments and 249 validation segments.

Vision to Proprioception Prediction (VPP)

Motivation. Understanding animal behavior could be instrumental to several robotics applications. To demonstrate the value of our dataset for robotics, we propose a task based on the problem of vision-based locomotion. Specifically, the task consists of predicting the parameters of the terrain a quadrupedal robot is walking on (see Fig. 6). As shown in multiple previous works on locomotion [30, 31, 22, 4, 3, 45], accurate prediction of these parameters is correlated with improved performance in locomotion. Intuitively, the EgoPet data closely resembles the video captured by a quadruped robot since a camera mounted on a pet is approximately at the same location as the camera mounted on the robot. In addition, the task of walking is highly represented in the dataset.

Task Description. The parameters we would like to predict are the local terrain geometry, the terrain's friction, and the parameters related to the robot's walking behavior on the terrain, including the robot's speed, motor efficiency, and high-level command. The exact identification of these parameters is generally impossible [25]: two terrains could have different combinations of parameters but 'feel" the same to an agent. For example, walking on sand and mud could similarly affect the robot's proprioception, even though their properties differ. Therefore, similarly to previous work [25, 27, 30], we aim to predict a latent representation z t of the terrain parameters. This latent representation consists of the hidden layer of a neural network trained in simulation to encode ground-truth terrain pa-

Figure 6: Vision to Proprioception Prediction task. This figure showcases the quadruped robot as it is about to transition from flat ground to climbing steep stairs, illustrating one of the unique terrain environments encountered during the collection of visual and proprioceptive data at annotated time intervals for VPP training.

Figure 6: Vision to Proprioception Prediction task. This figure showcases the quadruped robot as it is about to transition from flat ground to climbing steep stairs, illustrating one of the unique terrain environments encountered during the collection of visual and proprioceptive data at annotated time intervals for VPP training.

rameters. This neural network is trained end-to-end with an action policy on locomotion using reinforcement learning. The task consists of predicting the latent terrain's representation at different time intervals from a sequence of frames. Our setup closely follows the one in [30]. Specifically, we add to EgoPet data collected with a quadrupedal robot in multiple outdoor environments with different terrain characteristics, e.g., sand or grass.

Dataset and Annotations. To collect the dataset, we deployed the walking policy of [30] on a Unitree A1 robot dog in three environments: an office, a park, and a beach. We collect approximately 20 minutes of walking data in these environments, which are used exclusively for evaluation. For training, we use the data from [30], which contains 120 thousand frames, corresponding to a total walking time of approximately 2.3 hours. Each environment has different terrain geometries, including flats, steps, and slopes. Each sample contains an image collected from a forward-looking camera mounted on the robot and the (latent) parameters of the terrain below the center of mass of the robot z t estimated with a history of proprioception. See [30] for details about the annotation procedure.

The final task consists of predicting z t from a history of images. We generate several sub-tasks by predicting the future terrain parameters z t +0 . 8 , z t +1 . 5 and the past ones z t -0 . 8 , z t -1 . 5 . These time intervals were selected to differentiate between forecasting and estimation. The further the prediction is in the future or the past, the harder the task is. The input images might contain little information,

or not at all, about the terrain at these times. Therefore, inferences based on the context are required. For example, one can predict the presence of a step in front of the robot from the shadow it casts on the terrain.

We divide the newly collected data into three test datasets: the first is in-distribution, featuring terrains and lighting conditions similar to the training data. The second dataset is out of distribution since it is captured with different lighting conditions, i.e., at night, but in environments with the same features as the training data. Finally, the third dataset contains sandy environments, which the robot has not encountered during training.

Evaluation Benchmark

Our goal in the experiments is to establish initial performance baselines on the EgoPet tasks. For the VPP task, we hypothesize that EgoPet is a more useful pretraining resource compared to other datasets. We evaluate different pretrained models and compare their performance on the VIP, LP, and VPP tasks. We adopt a simple linear probing protocol, where we freeze the model weights and for each task train only a linear layer to predict the output. For evaluation, we use the following models publicly released checkpoints, typically trained on IN-1k or K400, unless stated otherwise.

MAE [21] are trained by masking random patches in the input image and reconstructing the missing pixels through an asymmetric encoder-decoder architecture. By masking a high proportion (e.g., 75 %) of the input image. In our experiments we use an MAE model pretrained on IN-1k.

MVP [53] uses a similar model as in MAE, but trains MAE on a mixture of egocentric datasets which we refer to as Ego Mix, a combination of Epic-Kitchens [11], 100DOH [43], Ego4D [19], and Something-Something [18].

DINO [8] is trained with a student-teacher architecture over pairs of augmented images by encouraging invariance to the image augmentations. The teacher's output is centered, and both networks' normalized features are compared using a cross-entropy loss. The stop-gradient operator ensures gradient propagation only through the student, and teacher parameters are updated using an exponential moving average (ema) of the student parameters.

iBOT [59] is trained similarly to DINO, but adds an auxiliary MIM loss by predicting image patch representation of a learned online tokenizer.

VideoMAE [50] is an extension of MAE for video pre-training. Different from MAE, it utilizes an extremely high masking ratio ( 90 %to 95 %) and tube masking as opposed to random masking.

MVD [52] is a masked feature modeling framework for self-supervised video representation learning. Learning the video representations involves distilling student model features from both video and image teachers. We train MVD variants on Ego4D and EgoPet, using VideoMAE (K400) and MAE (IN-1k) as video and image teachers.

Implementation Details. For all models, we used the ViT-B model with patch size 16 since it was available across all methods. For the VIP task, we train all image and video models for 10 epochs. Video models represent video clips of 2 seconds using 8 input frames ( 4 Hz) and image models use one (the middle) frame. For the VPP task, we train all models for 50 epochs. Video models were trained with varying amounts of frames in 4 Hz. For the LP task, we train all models for 15 epochs. Video models were trained with 16 input frames ( 30 Hz) and image models used one frame (the last). In our LP experiments, we only use cat and dog segments, segments long enough, and 25% of the training data. This left us with 1 , 129 training segments and 167 validation segments. During the linear probing training phase, we do not apply any image augmentations. All the other hyperparams follow MAE and MVD linear probing recipes for image and video models respectively.

Results

In this section, we report initial baseline results from applying a range of models to the VIP, LP, and VPP tasks. Taken together, these results underline the interesting observation that current large

Table 2: Visual Interaction Prediction linear probing results. We report models Interaction Prediction Accuracy and AUROC, as well as Object Prediction Top-1 and Top-3 Accuracy.

Figure 8: Locomotion Prediction (LP) linear probing results. We report the validation ATE and RPE as a function of the epoch during training, comparing the impact of various datasets (Kinetics, Ego4D, EgoPet). Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 4 for the full results.

Figure 8: Locomotion Prediction (LP) linear probing results. We report the validation ATE and RPE as a function of the epoch during training, comparing the impact of various datasets (Kinetics, Ego4D, EgoPet). Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 4 for the full results.

video datasets used for pretraining are not diverse enough to perform well across all the EgoPet downstream tasks. For example, pretraining on K400 is better than Ego4D for VIP but worse on VPP. Furthermore, by pretraining on EgoPet, we observe improved downstream performance on the VPP task compared to other models.

Visual Interaction Prediction

The results in Table 2 show that models trained on EgoPet achieve improved performance compared to K400 or Ego4D both on interaction prediction and object prediction. Compared to image based models like iBOT, MVD trained on EgoPet performs better on Top-3 Acc but worse on Top-1. This is likely due to the diversity of objects appearing in IN-1k, an image recognition dataset. Compared to other video models, MVD (EgoPet) performs better. To obtain more insight into what models focus on in this task, we apply Grad-CAM [41] on our MVD EgoPet interaction classifier. Fig. 7 shows the corresponding heatmaps and can be seen to focus on the rat (top-left), and another dog (bottom-right). In these cases, the model seems to be attending to the object of interaction.

Locomotion Prediction

For this task, we evaluated models based on their predicted unit motions 40 timesteps into the future, corresponding to 4 seconds. We form trajectories from these predicted motions and compute the

Figure 7: VIP Grad-CAM [41] visualization.

Figure 7: VIP Grad-CAM [41] visualization.

Table 3: Vision to Proprioception Prediction (VPP) linear probing results . We report the mean squared error loss. Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 5 for the full results.

RMSE of the Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) metrics against the ground truth trajectories. ATE and RPE are commonly employed metrics for evaluating systems such as SLAM and visual odometry [58, 49, 48, 7]. ATE first aligns the ground truth with the predicted trajectory, and then computes the absolute pose difference. RPE measures the difference between the predicted and ground truth locomotion [47].

The results in Fig. 8 indicate that models trained on EgoPet perform better than Ego4D and K400 and that the Ego4D model performed second best, possibly due to also being egocentric data. The full results in Supplementary Table 4 indicate that video models perform much better than image models as a whole, which we speculate is due to better modeling of the agent's velocity and acceleration as well as the motion of other agents.

Vision to Proprioception Prediction

Table 3 provides the result of the VPP task. It can be seen that using EgoPet data leads to lower errors on this task. Additionally, the results show that using additional past frames as context helps and that video models outperform image models. Within image models, MVP performs better than MAE, likely because it was trained on egocentric data, and iBOT performs best. Within video models, MVD trained on EgoPet achieves lower mean squared error loss compared to the same model trained with K400 or Ego4D, and lower error compared to all image models. We speculate that MVD (EgoPet) outperforms all models because compared to other datasets the EgoPet videos are more similar to videos captured by a forward-facing camera mounted on a quadruped robodog. We provide the full results in the Supplementary Table 5.

Limitations

EgoPet is a video dataset, and as such it primarily contains visual and auditory signals. However, animals interact with their environment using a multitude of senses, including smell and touch. The absence of these sensory modalities in our dataset and model may lead to a partial or skewed understanding of animal behavior and intelligence. Animal behavior is highly complex and influenced by a myriad of factors, including instinct, learning, environmental stimuli, and social interactions. Our tasks, while effective in capturing certain aspects of behavior, may not fully encapsulate the depth and complexity of animal interactions and decision-making processes. Further research is needed to develop more sophisticated tasks and models that can account for complex behavioral patterns.

Conclusion

We present EgoPet, a new comprehensive animal egocentric video dataset. Together with the proposed downstream tasks and benchmark, we believe EgoPet offers a testbed for studying and modeling animal behavior. Our benchmark results demonstrate that interaction prediction is far from being solved which provides an exciting opportunity for future research and modeling animal egocentric agents. Furthermore, the results demonstrate that EgoPet is a useful pretraining resource for downstream robotics locomotion tasks. Future works can include broadening the tasks to integrate more sensory inputs like audio, thereby creating a richer and more holistic understanding of the animal behavior.

Acknowledgements: We thank Justin Kerr for helpful discussions. Many of the figures use images taken from web videos. For each figure, we include the URL to its source videos in the Suppl. Section 8. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell's group was supported in part by DoD including DARPA's LwLL and/or SemaFor programs, as well as BAIR's industrial alliance programs.

References

Supplementary Material

We provide additional information about the EgoPet dataset, annotation process and full quantitative results.

Dataset

We include more dataset visualizations in Figure 9.

Figure 9: Additional EgoPet examples . Footage of four different animal videos from an egocentric view are included.

Figure 9: Additional EgoPet examples . Footage of four different animal videos from an egocentric view are included.

VIP Annotations

Weprovide additional information regarding the annotation process for the VIP task. The data labeling process for marking interactions involved a meticulous analysis of the video content, which resulted in the annotation of 1449 subsegments (see Figure 4). Two human annotators were trained to identify and timestamp the start and end of an interaction event. The outcome of this process is a richly annotated dataset of 805 subsegments where no interaction occurs ('negative subsegments') and 644 positive interaction subsegments that capture a wide range of 17 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test. This leaves us with 754 training subsegments and 695 test subsegments for a total of 1 , 449 annotated subsegments.

Table 4: Locomotion prediction linear probing results. Models are evaluated on their ability to predict the trajectory of the agent k seconds into the future.

The beginning of an interaction is marked at the first time-step where the agent begins to give attention to a target, and the endpoint is marked at the last time-step before the attention ceases. In addition, annotators were instructed to mark some segments without interactions. This process results in a set of temporal segments, each corresponding to a discrete interaction event or no interaction event. To

Table 5: Vision to Proprioception Prediction (VPP) linear probing results . We report the mean squared error loss. Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 5 for the full results.

ensure the consistency of annotations across annotators, annotations are only kept where annotators agree.

The outcome of this process is a richly annotated dataset of 805 subsegments where no interaction occurs ('negative subsegments') and 644 positive interaction subsegments that capture a wide range of 17 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test. This leaves us with 754 training subsegments and 695 test subsegments for a total of 1 , 449 annotated subsegments.

This is the list of all possible interaction objects: Person, Ball, Bench, Bird, Dog, Cat, Other Animal, Toy, Door, Floor, Food, Plant, Filament, Plastic, Water, Vehicle, Other.

Results

Full LP results. Table 4 contains the quantitative LP results for all models, reported using the ATE and RPE metrics. The results indicate that pretraining on EgoPet leads to better ATE and RPE scores.

Full VPP results. In the main paper we included the VPP results grouped by 'past', 'present' and 'future' (see Table 3). In Table 5 we provide the full fine-grained VPP results by individual timestep.

Figure credits

Some of the figures in the paper were created from web videos. We credit the original content creators and provide links to the original videos below.

Figure 1

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets. 111Project page: www.amirbar.net/egopet

Animals are intelligent agents that exhibit various cognitive and behavioral traits. They plan and act to accomplish complex goals and can interact with objects or other agents. Consider a cat attempting to catch a rat; this requires the cat to execute a precise sequence of actions with impeccable timing, all while responding to the rat’s efforts to escape.

Current Artificial Intelligence (AI) systems can synthesize high quality images [39, 40], generate coherent text [5, 51], and even code Python programs [9]. But despite this remarkable progress, there are basic animal behaviors that are beyond the reach of current models. Recently, there has been a significant body of research in robotics aimed at learning policies for quadruped locomotion, and other basic actions [27, 25, 1, 32, 42, 10, 33, 2]. However, we argue that a major limitation in advancing towards more complex systems is the availability of large-scale, real-world data.

To address this, we present EgoPet, a new web-scale dataset from the perspective of pets. EgoPet contains more than 848484 hours of video, including different animals like dogs, cats, eagles, turtles, and more. This video footage reveals the world from the eye of the pet as perceived in its day-to-day life, e.g., a dog going for a walk or entering a park, or a cat wandering freely around a farm. The video data was sourced from the internet and predominantly includes pet video, hence we have named the dataset EgoPet.

To measure progress in modeling and learning from animals, we propose three new tasks that aim to capture perception and action (see Fig. 1): Visual Interaction Prediction (VIP), Locomotion Prediction (LP), and Vision to Proprioception Prediction (VPP). Together with these tasks, we provide annotated training and validation data used for downstream evaluation.

The VIP task aims to detect and classify animal interactions and is inspired by human-object interaction tasks [43]. We temporally annotated a subset of the EgoPet videos with the start and end times of visual interactions and the object of the interaction category. The categories, which include person, cat, and dog, were chosen based on how commonly they occurred as objects (for all the categories refer to Supplementary Section Figure credits).

The goal of the LP task is to predict the future 444 second trajectory of the pet. This is useful for learning basic pet skills like avoiding obstacles or navigating. We extracted pseudo ground truth trajectories using Deep Patch Visual Odometry (DPVO) [49], the best-performing SLAM system for our dataset. We manually filtered inaccurate trajectories in the validation data to ensure high-quality evaluation.

Finally, in the VPP task, we study EgoPet’s utility for a downstream robotic task: legged locomotion. Given a video observation from a forward-facing camera mounted on a quadruped robot, the goal is to predict the features of the terrain perceived by the robot’s proprioception across its trajectory. Making accurate predictions requires perceiving the landscape and anticipating the robot controls. This differs from previous works on robot visual prediction [30, 45, 31], which require conditioning over current robot controls and are thus challenging to train at scale. To assess performance in this task, we gathered data utilizing a quadruped robodog. This data includes paired videos and proprioception features, which are then utilized for subsequent training and evaluation processes.

We train various self-supervised models and evaluate how they perform downstream using a simple linear probing protocol. We make the surprising finding that pretraining on EgoPet yields better performance than pretraining on other, much larger video datasets like Ego4D [19] and Kinetics 400 [23]. This indicates the inadequacy of current datasets in studying animal-like physical skills.

Our contributions are as follows. First, we propose EgoPet, the first large-scale egocentric animal video dataset comprised of over 848484 hours of video footage to facilitate learning from animals. We propose three new tasks, including human-annotated data, and set an initial benchmark. The downstream results on the VPP task indicate that EgoPet is a useful pretraining resource for quadruped locomotion, and the benchmark results on VIP show that the proposed tasks are still far from being solved, providing an exciting new opportunity to build models that capture the world through the eyes of animals.

Next, we delve into the related works surrounding video datasets, including notable research on both general video datasets of human and animals and those focusing specifically on egocentric video data.

Video Datasets. In recent years, a variety of video datasets have played an important role in video understanding tasks. In human action recognition, datasets like UCF101 [46], Charades-Ego [44], AVA [20], FineDiving [54], and the Something-Something dataset [18] provide comprehensive coverage of human activities, ranging from daily actions to specialized sports movements. Among these, Kinetics (K400) [23] is particularly influential, advancing the study of human actions through a wide array of video clips.

Other works aimed to collect data to study animals. These works include datasets such as the Animal Kingdom [36], which contains videos of various species, and MacaquePose [26], which focuses on non-human primates. These datasets are instrumental for AI advancements in wildlife recognition and interpretation. AP-10K [57] further augments this domain by providing a detailed collection of animal images for robust pose estimation. While sharing a similar motivation to our work, existing datasets on animal behavior rarely contain egocentric views and are therefore better suited to recognition problems than the animals’ physical capabilities. For autonomous driving and vehicle motion, datasets like the Berkeley DeepDrive [56, 17] and KITTI [17] offer extensive insights into vehicle egomotion and environmental interactions. While these datasets enrich our understanding of motion, behavior, and interaction from a human-centric perspective, they offer limited insights into animal behavior.

Egocentric Video Datasets. Agents interact with the world from a first-person point of view, thus collecting such data has many applications from video understanding to augmented reality. In the past decade, many egocentric datasets were collected [16, 11, 44, 38, 28], with the majority of them focusing on human activities and object interactions in an indoor environment (e.g. kitchens). For example, Epic Kitchens [11, 12] is a large cooking dataset that takes place in 454545 kitchens across 444 different cities, whereas Charades-Ego [44] consists of 4,00040004{\small,}000 paired videos of human actions in first and third person. Other datasets are more focused on conversation and social interactions [15, 35, 37]. Existing datasets differ by the environments in which they are recorded, e.g. (outdoor vs. kitchens), whether they are scripted or not, and the number of videos. Recently, Ego4D [19], a new comprehensive egocentric dataset was released. Different from previous datasets, it is more diverse (e.g., indoor and outdoor activities, different diverse geographical locations). However, while existing datasets focus on human and human skills, our focus is on animal agents which have more limited language and hand-object interactions. The most related egocentric dataset is DECADE [14] which consists of an hour of footage of a single dog, including joint locations annotations. Inspired by DECADE, EgoPet is a much larger web-scale dataset (84 hours) and much more diverse.

The EgoPet dataset is a unique collection of egocentric video footage primarily featuring dogs and cats, along with various other animals like eagles, wolves, turtles, sea turtles, sharks, snakes, cheetahs, pythons, geese, alligators, and dolphins (examples included in Fig. 2 and Suppl. Figure 9). Together with the proposed downstream tasks and benchmark, EgoPet is a valuable resource for researchers and enthusiasts interested in studying animals from an egocentric perspective.

We begin with the motivation behind EgoPet and its connection to existing datasets in Section 3.1. We then delve into the dataset’s statistics in Section 3.2 and the collection process in Section 3.3.

To provide a clearer understanding of EgoPet’s significance, we compare it with various other datasets, considering factors such as total video duration, perspective (egocentric or non-egocentric), egomotion, the agents involved, and the presence of interaction annotations, which are crucial for intelligent agents. Refer to Table 1 for more details. In terms of size, Ego4D [19] is the largest egocentric video dataset, and it centers on human activities, while the BDD100K [56] dataset includes both egocentric and egomotion elements but it focuses on autonomous driving. Differently, EgoPet focuses on animals, and pets in particular. Among animal video datasets, the DECADE [14] dataset provides an egocentric perspective from a dog’s viewpoint, but it only records 1.51.51.5 hours of video. EgoPet expands this vision by over 565656 times in volume and includes a variety of species and interactions.

The EgoPet dataset is an extensive collection composed of 6,64666466{\small,}646 video segments distilled from 819819819 unique videos. High level statistics are provided in Fig. 3. These original videos were sourced predominantly from TikTok, accounting for 482482482 videos, while the remaining 338338338 were obtained from YouTube. The aggregate length of all video segments amounts to approximately 848484 hours, which reflects a substantial volume of data for in-depth analysis. In terms of video duration, the segments exhibit an average span of 45.5545.5545.55 seconds, although the duration displays considerable variability, as indicated by the standard deviation of 192.19192.19192.19 seconds. This variation underscores the range of contexts captured within the dataset, from brief encounters to prolonged interactions.

Breaking down the dataset by animal representation, cats and dogs constitute the majority, with 4,56745674{\small,}567 and 1,90519051{\small,}905 segments, respectively. This reflects the dataset’s strong emphasis on common domestic animals while still covering less frequent but equally important species. Notably, the dataset includes segments featuring eagles (666666), turtles (313131), and a diverse group of other animals such as alligators, lizards, and dolphins, contributing to a rich collection of animal behaviors captured through an egocentric lens.

The camera positioning—where the recording device was attached to—also varies, the majority of segments were captured from cameras placed on the neck (4,57545754{\small,}575) and body (1,81718171{\small,}817). Fewer segments were recorded from cameras positioned on the head (199199199), shell (363636), collar (111111), and fin (888) offering a range of perspectives that can inform on how different mounting points might influence the perception of the environment from an animal’s viewpoint.

Collection Strategy. To collect the dataset, we manually searched for videos using a large set of queries on YouTube and TikTok. For example, “egocentric view”, “dog with a GoPro”, and similar phrases related to first-person animal perspectives. This led to scraping a vast pool of footage showcasing animals, primarily dogs and cats, wearing wearable cameras, allowing for an egocentric point of view. In pursuit of a broader video selection, our efforts extended to individual channels and authors known for their thematic consistency in publishing egocentric animal footage. This approach allowed us to tap into niche communities and content creators, yielding a wide variety of egocentric videos beyond the reach of generic search terms.

Dataset Refinement. A meticulous annotation process was carried out to ensure the dataset’s quality. A human annotator reviewed the collected videos to confirm that they were from an egocentric point of view. Non-egocentric or irrelevant segments were carefully removed.

All videos were adjusted to a frame rate of 303030 frames per second and resized to 480p on the shortest side while maintaining the original aspect ratio. The videos were then segmented into discrete clips, during which any non-egocentric footage was removed. The final dataset consists of segments of at least three seconds, ensuring sufficient context for each interaction.

In order to allow quantitative comparisons of animal-prediction approaches, we next define several prediction tasks on the EgoPet dataset. We provide annotated datasets based on these tasks, which will allow effective benchmarking of different approaches.

Motivation. Human activities such as actions and interactions from an egocentric viewpoint have been previously explored in various datasets mostly focusing on activity recognition [24, 29, 60], human-object interactions [13, 6, 34, 11], and social interactions [15, 35, 55]. Inspired by these works, we focus on animal interactions with other agents or objects, and for simplicity, we only consider visual interactions. Observing interactions through an egocentric perspective offers insights into how animals navigate their world, how they communicate with other beings, and how their physical movements correlate with environmental stimuli. Being able to identify interactions is a core task in computer vision and robotics with practical applications in designing systems that can operate in dynamic, real-world settings.

Task Description. The input for this task is a video clip from the egocentric perspective of an animal. The labels are twofold: a binary label indicating whether an interaction is taking place or not, and a categorical label describing the object of the interaction. This binary label simplifies the vast range of potential interactions into a manageable form for the model, while the identification of the interaction object adds a layer of specificity necessary for understanding the context of the interaction.

In the context of the EgoPet dataset, a “visual interaction” is defined as a discernible event where the agent—typically an animal such as a dog or cat—demonstrates clear attention to an object or another agent within its environment. This attention may be manifested through physical contact, proximity, orientation, or vocalization (such as barking or making sounds) toward the object of the interaction which can be an object or agent. The fundamental criterion for a visual interaction is the presence of visual evidence within the video that the agent is engaged with, or reacting to a particular stimulus. Aimless movements, such as wandering without a clear target or displaying alertness without a specific focus, are not labeled as visual interactions.

Annotations. The data labeling process for marking interactions involved a meticulous analysis of the video content, which resulted in the annotation of 144914491449 subsegments (see Fig. 4). Two human annotators were trained to identify and timestamp the start and end of an interaction event. The outcome of this process is a richly annotated dataset of 805805805 subsegments where no interaction occurs (”negative subsegments”) and 644644644 positive interaction subsegments that capture a wide range of 171717 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test sets. This leaves us with 754754754 training subsegments and 695695695 test subsegments for a total of 1,44914491{\small,}449 annotated subsegments. To see the full annotation process refer to Suppl. Section Figure credits.

Motivation. Planning where to move involves a complex interplay

of both perception and foresight. It requires the ability to anticipate potential obstacles, consider various courses of action, and select the most efficient and effective strategy to achieve a desired goal. EgoPet contains examples where animals plan a future trajectory to achieve a certain goal (e.g., a dog following its owner. See Fig. 5).

Task Description. Given a sequence of past m𝑚m video frames {xi}i=t−mtsubscriptsuperscriptsubscript𝑥𝑖𝑡𝑖𝑡𝑚{x_{i}}^{t}{i=t-m}, the goal is to predict the unit normalized future trajectory of the agent {vj}j=t+1t+ksubscriptsuperscriptsubscript𝑣𝑗𝑡𝑘𝑗𝑡1{v{j}}^{t+k}{j=t+1}, where vj∈ℝ3subscript𝑣𝑗superscriptℝ3v{j}\in\mathbb{R}^{3} represents the relative location of the agent at timestep j𝑗j. We predict the unit normalized relative location due to the scale ambiguity of the extracted trajectories. In practice, we condition models on m=16𝑚16m=16 frames and predict k=40𝑘40k=40 future locations which correspond to 444 seconds into the future.

Annotations. To obtain pseudo ground truth agent trajectories, we used Deep Patch Visual Odometry (DPVO [49]), a system for monocular visual odometry that utilizes sparse patch-based matching across frames. This system largely outperformed other open-source SLAM systems in terms of convergence rate and qualitative accuracy in our experiments.

Given an input sequence of frames, DPVO returns the location and orientation of the camera for each frame. To obtain training trajectories, we feed videos with a stride of 555 to DPVO. To ensure high-quality evaluation, we feed validation videos with strides of 555, 101010, and 151515 into DPVO and evaluate the quality of the trajectories manually. Specifically, two human annotators were trained to evaluate the trajectories from an eagle’s eye view (XZ view) and determine the best matching trajectory, if any, to the video. This left us with 6,12661266{\small,}126 annotated training segments and 249249249 validation segments.

Motivation. Understanding animal behavior could be instrumental to several robotics applications. To demonstrate the value of our dataset for robotics, we propose a task based on the problem of vision-based locomotion. Specifically, the task consists of predicting the parameters of the terrain a quadrupedal robot is walking on (see Fig. 6). As shown in multiple previous works on locomotion [30, 31, 22, 4, 3, 45], accurate prediction of these parameters is correlated with improved performance in locomotion. Intuitively, the EgoPet data closely resembles the video captured by a quadruped robot since a camera mounted on a pet is approximately at the same location as the camera mounted on the robot. In addition, the task of walking is highly represented in the dataset.

Task Description. The parameters we would like to predict are the local terrain geometry, the terrain’s friction, and the parameters related to the robot’s walking behavior on the terrain, including the robot’s speed, motor efficiency, and high-level command. The exact identification of these parameters is generally impossible [25]: two terrains could have different combinations of parameters but “feel" the same to an agent. For example, walking on sand and mud could similarly affect the robot’s proprioception, even though their properties differ. Therefore, similarly to previous work [25, 27, 30], we aim to predict a latent representation ztsubscript𝑧𝑡z_{t} of the terrain parameters. This latent representation consists of the hidden layer of a neural network trained in simulation to encode ground-truth terrain parameters. This neural network is trained end-to-end with an action policy on locomotion using reinforcement learning. The task consists of predicting the latent terrain’s representation at different time intervals from a sequence of frames. Our setup closely follows the one in [30]. Specifically, we add to EgoPet data collected with a quadrupedal robot in multiple outdoor environments with different terrain characteristics, e.g., sand or grass.

Dataset and Annotations. To collect the dataset, we deployed the walking policy of [30] on a Unitree A1 robot dog in three environments: an office, a park, and a beach. We collect approximately 20 minutes of walking data in these environments, which are used exclusively for evaluation. For training, we use the data from [30], which contains 120 thousand frames, corresponding to a total walking time of approximately 2.3 hours. Each environment has different terrain geometries, including flats, steps, and slopes. Each sample contains an image collected from a forward-looking camera mounted on the robot and the (latent) parameters of the terrain below the center of mass of the robot ztsubscript𝑧𝑡z_{t} estimated with a history of proprioception. See [30] for details about the annotation procedure.

The final task consists of predicting ztsubscript𝑧𝑡z_{t} from a history of images. We generate several sub-tasks by predicting the future terrain parameters zt+0.8,zt+1.5subscript𝑧𝑡0.8subscript𝑧𝑡1.5z_{t+0.8},z_{t+1.5} and the past ones zt−0.8,zt−1.5subscript𝑧𝑡0.8subscript𝑧𝑡1.5z_{t-0.8},z_{t-1.5}. These time intervals were selected to differentiate between forecasting and estimation. The further the prediction is in the future or the past, the harder the task is. The input images might contain little information, or not at all, about the terrain at these times. Therefore, inferences based on the context are required. For example, one can predict the presence of a step in front of the robot from the shadow it casts on the terrain.

We divide the newly collected data into three test datasets: the first is in-distribution, featuring terrains and lighting conditions similar to the training data. The second dataset is out of distribution since it is captured with different lighting conditions, i.e., at night, but in environments with the same features as the training data. Finally, the third dataset contains sandy environments, which the robot has not encountered during training.

Our goal in the experiments is to establish initial performance baselines on the EgoPet tasks. For the VPP task, we hypothesize that EgoPet is a more useful pretraining resource compared to other datasets. We evaluate different pretrained models and compare their performance on the VIP, LP, and VPP tasks. We adopt a simple linear probing protocol, where we freeze the model weights and for each task train only a linear layer to predict the output. For evaluation, we use the following models publicly released checkpoints, typically trained on IN-1k or K400, unless stated otherwise.

MAE [21] are trained by masking random patches in the input image and reconstructing the missing pixels through an asymmetric encoder-decoder architecture. By masking a high proportion (e.g., 757575%) of the input image. In our experiments we use an MAE model pretrained on IN-1k.

MVP [53] uses a similar model as in MAE, but trains MAE on a mixture of egocentric datasets which we refer to as Ego Mix, a combination of Epic-Kitchens [11], 100DOH [43], Ego4D [19], and Something-Something [18].

DINO [8] is trained with a student-teacher architecture over pairs of augmented images by encouraging invariance to the image augmentations. The teacher’s output is centered, and both networks’ normalized features are compared using a cross-entropy loss. The stop-gradient operator ensures gradient propagation only through the student, and teacher parameters are updated using an exponential moving average (ema) of the student parameters.

iBOT [59] is trained similarly to DINO, but adds an auxiliary MIM loss by predicting image patch representation of a learned online tokenizer.

VideoMAE [50] is an extension of MAE for video pre-training. Different from MAE, it utilizes an extremely high masking ratio (909090% to 959595%) and tube masking as opposed to random masking.

MVD [52] is a masked feature modeling framework for self-supervised video representation learning. Learning the video representations involves distilling student model features from both video and image teachers. We train MVD variants on Ego4D and EgoPet, using VideoMAE (K400) and MAE (IN-1k) as video and image teachers.

Implementation Details. For all models, we used the ViT-B model with patch size 16 since it was available across all methods. For the VIP task, we train all image and video models for 101010 epochs. Video models represent video clips of 222 seconds using 888 input frames (444 Hz) and image models use one (the middle) frame. For the VPP task, we train all models for 505050 epochs. Video models were trained with varying amounts of frames in 444 Hz. For the LP task, we train all models for 151515 epochs. Video models were trained with 161616 input frames (303030 Hz) and image models used one frame (the last). In our LP experiments, we only use cat and dog segments, segments long enough, and 25%percent2525% of the training data. This left us with 111,129129129 training segments and 167167167 validation segments. During the linear probing training phase, we do not apply any image augmentations. All the other hyperparams follow MAE and MVD linear probing recipes for image and video models respectively.

In this section, we report initial baseline results from applying a range of models to the VIP, LP, and VPP tasks. Taken together, these results underline the interesting observation that current large video datasets used for pretraining are not diverse enough to perform well across all the EgoPet downstream tasks. For example, pretraining on K400 is better than Ego4D for VIP but worse on VPP. Furthermore, by pretraining on EgoPet, we observe improved downstream performance on the VPP task compared to other models.

The results in Table 2 show that models trained on EgoPet achieve improved performance compared to K400 or Ego4D both on interaction prediction and object prediction. Compared to image based models like iBOT, MVD trained on EgoPet performs better on Top-3 Acc but worse on Top-1. This is likely due to the diversity of objects appearing in IN-1k, an image recognition dataset. Compared to other video models, MVD (EgoPet) performs better. To obtain more insight into what models focus on in this task, we apply Grad-CAM [41] on our MVD EgoPet interaction classifier. Fig. 7 shows the corresponding heatmaps and can be seen to focus on the rat (top-left), and another dog (bottom-right). In these cases, the model seems to be attending to the object of interaction.

For this task, we evaluated models based on their predicted unit motions 40 timesteps into the future, corresponding to 4 seconds. We form trajectories from these predicted motions and compute the RMSE of the Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) metrics against the ground truth trajectories. ATE and RPE are commonly employed metrics for evaluating systems such as SLAM and visual odometry [58, 49, 48, 7]. ATE first aligns the ground truth with the predicted trajectory, and then computes the absolute pose difference. RPE measures the difference between the predicted and ground truth locomotion [47].

The results in Fig. 8 indicate that models trained on EgoPet perform better than Ego4D and K400 and that the Ego4D model performed second best, possibly due to also being egocentric data. The full results in Supplementary Table 4 indicate that video models perform much better than image models as a whole, which we speculate is due to better modeling of the agent’s velocity and acceleration as well as the motion of other agents.

Table 3 provides the result of the VPP task. It can be seen that using EgoPet data leads to lower errors on this task. Additionally, the results show that using additional past frames as context helps and that video models outperform image models. Within image models, MVP performs better than MAE, likely because it was trained on egocentric data, and iBOT performs best. Within video models, MVD trained on EgoPet achieves lower mean squared error loss compared to the same model trained with K400 or Ego4D, and lower error compared to all image models. We speculate that MVD (EgoPet) outperforms all models because compared to other datasets the EgoPet videos are more similar to videos captured by a forward-facing camera mounted on a quadruped robodog. We provide the full results in the Supplementary Table 5.

EgoPet is a video dataset, and as such it primarily contains visual and auditory signals. However, animals interact with their environment using a multitude of senses, including smell and touch. The absence of these sensory modalities in our dataset and model may lead to a partial or skewed understanding of animal behavior and intelligence. Animal behavior is highly complex and influenced by a myriad of factors, including instinct, learning, environmental stimuli, and social interactions. Our tasks, while effective in capturing certain aspects of behavior, may not fully encapsulate the depth and complexity of animal interactions and decision-making processes. Further research is needed to develop more sophisticated tasks and models that can account for complex behavioral patterns.

We present EgoPet, a new comprehensive animal egocentric video dataset. Together with the proposed downstream tasks and benchmark, we believe EgoPet offers a testbed for studying and modeling animal behavior. Our benchmark results demonstrate that interaction prediction is far from being solved which provides an exciting opportunity for future research and modeling animal egocentric agents. Furthermore, the results demonstrate that EgoPet is a useful pretraining resource for downstream robotics locomotion tasks. Future works can include broadening the tasks to integrate more sensory inputs like audio, thereby creating a richer and more holistic understanding of the animal behavior.

Acknowledgements: We thank Justin Kerr for helpful discussions. Many of the figures use images taken from web videos. For each figure, we include the URL to its source videos in the Suppl. Section Figure credits. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD including DARPA’s LwLL and/or SemaFor programs, as well as BAIR’s industrial alliance programs.

We provide additional information about the EgoPet dataset, annotation process and full quantitative results.

We include more dataset visualizations in Figure 9.

The beginning of an interaction is marked at the first time-step where the agent begins to give attention to a target, and the endpoint is marked at the last time-step before the attention ceases. In addition, annotators were instructed to mark some segments without interactions. This process results in a set of temporal segments, each corresponding to a discrete interaction event or no interaction event. To ensure the consistency of annotations across annotators, annotations are only kept where annotators agree.

The outcome of this process is a richly annotated dataset of 805805805 subsegments where no interaction occurs (”negative subsegments”) and 644644644 positive interaction subsegments that capture a wide range of 171717 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test. This leaves us with 754754754 training subsegments and 695695695 test subsegments for a total of 1,44914491{\small,}449 annotated subsegments.

This is the list of all possible interaction objects: Person, Ball, Bench, Bird, Dog, Cat, Other Animal, Toy, Door, Floor, Food, Plant, Filament, Plastic, Water, Vehicle, Other.

Full LP results. Table 4 contains the quantitative LP results for all models, reported using the ATE and RPE metrics. The results indicate that pretraining on EgoPet leads to better ATE and RPE scores.

Full VPP results. In the main paper we included the VPP results grouped by “past”, “present” and “future” (see Table 3). In Table 5 we provide the full fine-grained VPP results by individual timestep.

https://www.youtube.com/watch?v=69AXB6aFzRU

https://www.tiktok.com/@gonzoisacat/video/7232306745660509483

Table: S3.T1: Different video datasets. We compare EgoPet to different datasets with respect to the total time (hours), whether the videos are in first person view (egocentric) with the focus on egomotion, the agent type, and whether agent interaction annotations are available. EgoPet is the first large scale animal dataset that is both ego centric and contains interaction annotations. It is also over 565656 times larger than the previous similar dataset DECADE [14].

DatasetTotal Time (hours)EgocentricEgomotionAgentInteraction Annotations
BDD100K1,11111111{\small,}111Cars
Animal Kingdom505050Animals
EGO4D3,67036703{\small,}670Humans
DECADE1.51.51.5Dog
EgoPet848484Animals

Table: S5.T2: Visual Interaction Prediction linear probing results. We report models Interaction Prediction Accuracy and AUROC, as well as Object Prediction Top-1 and Top-3 Accuracy.

AccuracyAUROCTop-1 AccTop-3 Acc
MAEIN-1k62.3469.4135.0261.37
MVPEgo Mix65.4768.1233.5759.21
DINOIN-1k65.1673.3837.1860.65
iBOTIN-1k65.1673.5037.5558.12
VideoMAEK40061.5666.2229.2454.87
MVDK40065.6370.3535.3862.45
MVDEgo4D64.8470.1533.5762.45
MVDEgoPet68.4474.3135.7464.62

Table: S6.T3: Vision to Proprioception Prediction (VPP) linear probing results. We report the mean squared error loss. Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 5 for the full results.

ModelDatasetPast (t−k)𝑡𝑘(t-k)Present (t)𝑡(t)Future (t+k)𝑡𝑘(t+k)
1 Frame
MAEIN-1k0.3600.2800.314
MVPEgo Mix0.3570.2730.308
DINOIN-1k0.3540.2750.304
iBOTIN-1k0.3500.2780.304
4 Frames
MVDK4000.2860.1970.262
MVDEgo4D0.2610.2240.261
MVDEgoPet0.2560.2030.246
8 Frames
MVDK4000.2170.1960.252
MVDEgo4D0.2080.1920.249
MVDEgoPet0.2040.1840.253

Refer to caption We present EgoPet, a novel animal egocentric video dataset to advance learning animal-like behavior models from video (top row). We propose three benchmark tasks on this dataset (bottom row). Visual Interaction Prediction (VIP) and Locomotion Prediction (LP) are designed to predict animals’ perception and action behavior. Finally, Vision to Proprioception Prediction (VPP) studies the utility of our dataset on the downstream task of robot locomotion in the wild. For all tasks, we find that models trained on EgoPet outperform those trained on previously available video datasets.

Refer to caption EgoPet video examples. Footage from the EgoPet dataset featuring four different animal experiences, each captured from an egocentric perspective at a distinct point in time.

Refer to caption Descriptive statistics. The histogram depicting the length (in seconds) of EgoPet video sequences exhibits a long-tailed distribution, primarily skewed toward shorter segments of less than 303030 seconds. Collectively, videos featuring dogs and cats account for 94% of the total duration, showcasing interactions with people, fellow cats and dogs, toys, and various objects.

Refer to caption Vision to Interaction Prediction task. The figure illustrates the process of annotating a single video, identifying and categorizing different interactions experienced by a cat, with each segment of the timeline reflecting a unique type of interaction within the animal’s environment.

Refer to caption Locomotion Prediction task. A dog navigates an agility course, highlighting the concept of locomotion prediction by anticipating its forward and upward trajectory to clear the obstacle.

Refer to caption Vision to Proprioception Prediction task. This figure showcases the quadruped robot as it is about to transition from flat ground to climbing steep stairs, illustrating one of the unique terrain environments encountered during the collection of visual and proprioceptive data at annotated time intervals for VPP training.

Refer to caption VIP Grad-CAM [41] visualization.

Refer to caption Locomotion Prediction (LP) linear probing results. We report the validation ATE and RPE as a function of the epoch during training, comparing the impact of various datasets (Kinetics, Ego4D, EgoPet). Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 4 for the full results.

Figure credits

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets. 111Project page: www.amirbar.net/egopet

Animals are intelligent agents that exhibit various cognitive and behavioral traits. They plan and act to accomplish complex goals and can interact with objects or other agents. Consider a cat attempting to catch a rat; this requires the cat to execute a precise sequence of actions with impeccable timing, all while responding to the rat’s efforts to escape.

Current Artificial Intelligence (AI) systems can synthesize high quality images [39, 40], generate coherent text [5, 51], and even code Python programs [9]. But despite this remarkable progress, there are basic animal behaviors that are beyond the reach of current models. Recently, there has been a significant body of research in robotics aimed at learning policies for quadruped locomotion, and other basic actions [27, 25, 1, 32, 42, 10, 33, 2]. However, we argue that a major limitation in advancing towards more complex systems is the availability of large-scale, real-world data.

To address this, we present EgoPet, a new web-scale dataset from the perspective of pets. EgoPet contains more than 848484 hours of video, including different animals like dogs, cats, eagles, turtles, and more. This video footage reveals the world from the eye of the pet as perceived in its day-to-day life, e.g., a dog going for a walk or entering a park, or a cat wandering freely around a farm. The video data was sourced from the internet and predominantly includes pet video, hence we have named the dataset EgoPet.

To measure progress in modeling and learning from animals, we propose three new tasks that aim to capture perception and action (see Fig. 1): Visual Interaction Prediction (VIP), Locomotion Prediction (LP), and Vision to Proprioception Prediction (VPP). Together with these tasks, we provide annotated training and validation data used for downstream evaluation.

The VIP task aims to detect and classify animal interactions and is inspired by human-object interaction tasks [43]. We temporally annotated a subset of the EgoPet videos with the start and end times of visual interactions and the object of the interaction category. The categories, which include person, cat, and dog, were chosen based on how commonly they occurred as objects (for all the categories refer to Supplementary Section Figure credits).

The goal of the LP task is to predict the future 444 second trajectory of the pet. This is useful for learning basic pet skills like avoiding obstacles or navigating. We extracted pseudo ground truth trajectories using Deep Patch Visual Odometry (DPVO) [49], the best-performing SLAM system for our dataset. We manually filtered inaccurate trajectories in the validation data to ensure high-quality evaluation.

Finally, in the VPP task, we study EgoPet’s utility for a downstream robotic task: legged locomotion. Given a video observation from a forward-facing camera mounted on a quadruped robot, the goal is to predict the features of the terrain perceived by the robot’s proprioception across its trajectory. Making accurate predictions requires perceiving the landscape and anticipating the robot controls. This differs from previous works on robot visual prediction [30, 45, 31], which require conditioning over current robot controls and are thus challenging to train at scale. To assess performance in this task, we gathered data utilizing a quadruped robodog. This data includes paired videos and proprioception features, which are then utilized for subsequent training and evaluation processes.

We train various self-supervised models and evaluate how they perform downstream using a simple linear probing protocol. We make the surprising finding that pretraining on EgoPet yields better performance than pretraining on other, much larger video datasets like Ego4D [19] and Kinetics 400 [23]. This indicates the inadequacy of current datasets in studying animal-like physical skills.

Our contributions are as follows. First, we propose EgoPet, the first large-scale egocentric animal video dataset comprised of over 848484 hours of video footage to facilitate learning from animals. We propose three new tasks, including human-annotated data, and set an initial benchmark. The downstream results on the VPP task indicate that EgoPet is a useful pretraining resource for quadruped locomotion, and the benchmark results on VIP show that the proposed tasks are still far from being solved, providing an exciting new opportunity to build models that capture the world through the eyes of animals.

Next, we delve into the related works surrounding video datasets, including notable research on both general video datasets of human and animals and those focusing specifically on egocentric video data.

Video Datasets. In recent years, a variety of video datasets have played an important role in video understanding tasks. In human action recognition, datasets like UCF101 [46], Charades-Ego [44], AVA [20], FineDiving [54], and the Something-Something dataset [18] provide comprehensive coverage of human activities, ranging from daily actions to specialized sports movements. Among these, Kinetics (K400) [23] is particularly influential, advancing the study of human actions through a wide array of video clips.

Other works aimed to collect data to study animals. These works include datasets such as the Animal Kingdom [36], which contains videos of various species, and MacaquePose [26], which focuses on non-human primates. These datasets are instrumental for AI advancements in wildlife recognition and interpretation. AP-10K [57] further augments this domain by providing a detailed collection of animal images for robust pose estimation. While sharing a similar motivation to our work, existing datasets on animal behavior rarely contain egocentric views and are therefore better suited to recognition problems than the animals’ physical capabilities. For autonomous driving and vehicle motion, datasets like the Berkeley DeepDrive [56, 17] and KITTI [17] offer extensive insights into vehicle egomotion and environmental interactions. While these datasets enrich our understanding of motion, behavior, and interaction from a human-centric perspective, they offer limited insights into animal behavior.

Egocentric Video Datasets. Agents interact with the world from a first-person point of view, thus collecting such data has many applications from video understanding to augmented reality. In the past decade, many egocentric datasets were collected [16, 11, 44, 38, 28], with the majority of them focusing on human activities and object interactions in an indoor environment (e.g. kitchens). For example, Epic Kitchens [11, 12] is a large cooking dataset that takes place in 454545 kitchens across 444 different cities, whereas Charades-Ego [44] consists of 4,00040004{\small,}000 paired videos of human actions in first and third person. Other datasets are more focused on conversation and social interactions [15, 35, 37]. Existing datasets differ by the environments in which they are recorded, e.g. (outdoor vs. kitchens), whether they are scripted or not, and the number of videos. Recently, Ego4D [19], a new comprehensive egocentric dataset was released. Different from previous datasets, it is more diverse (e.g., indoor and outdoor activities, different diverse geographical locations). However, while existing datasets focus on human and human skills, our focus is on animal agents which have more limited language and hand-object interactions. The most related egocentric dataset is DECADE [14] which consists of an hour of footage of a single dog, including joint locations annotations. Inspired by DECADE, EgoPet is a much larger web-scale dataset (84 hours) and much more diverse.

The EgoPet dataset is a unique collection of egocentric video footage primarily featuring dogs and cats, along with various other animals like eagles, wolves, turtles, sea turtles, sharks, snakes, cheetahs, pythons, geese, alligators, and dolphins (examples included in Fig. 2 and Suppl. Figure 9). Together with the proposed downstream tasks and benchmark, EgoPet is a valuable resource for researchers and enthusiasts interested in studying animals from an egocentric perspective.

We begin with the motivation behind EgoPet and its connection to existing datasets in Section 3.1. We then delve into the dataset’s statistics in Section 3.2 and the collection process in Section 3.3.

To provide a clearer understanding of EgoPet’s significance, we compare it with various other datasets, considering factors such as total video duration, perspective (egocentric or non-egocentric), egomotion, the agents involved, and the presence of interaction annotations, which are crucial for intelligent agents. Refer to Table 1 for more details. In terms of size, Ego4D [19] is the largest egocentric video dataset, and it centers on human activities, while the BDD100K [56] dataset includes both egocentric and egomotion elements but it focuses on autonomous driving. Differently, EgoPet focuses on animals, and pets in particular. Among animal video datasets, the DECADE [14] dataset provides an egocentric perspective from a dog’s viewpoint, but it only records 1.51.51.5 hours of video. EgoPet expands this vision by over 565656 times in volume and includes a variety of species and interactions.

The EgoPet dataset is an extensive collection composed of 6,64666466{\small,}646 video segments distilled from 819819819 unique videos. High level statistics are provided in Fig. 3. These original videos were sourced predominantly from TikTok, accounting for 482482482 videos, while the remaining 338338338 were obtained from YouTube. The aggregate length of all video segments amounts to approximately 848484 hours, which reflects a substantial volume of data for in-depth analysis. In terms of video duration, the segments exhibit an average span of 45.5545.5545.55 seconds, although the duration displays considerable variability, as indicated by the standard deviation of 192.19192.19192.19 seconds. This variation underscores the range of contexts captured within the dataset, from brief encounters to prolonged interactions.

Breaking down the dataset by animal representation, cats and dogs constitute the majority, with 4,56745674{\small,}567 and 1,90519051{\small,}905 segments, respectively. This reflects the dataset’s strong emphasis on common domestic animals while still covering less frequent but equally important species. Notably, the dataset includes segments featuring eagles (666666), turtles (313131), and a diverse group of other animals such as alligators, lizards, and dolphins, contributing to a rich collection of animal behaviors captured through an egocentric lens.

The camera positioning—where the recording device was attached to—also varies, the majority of segments were captured from cameras placed on the neck (4,57545754{\small,}575) and body (1,81718171{\small,}817). Fewer segments were recorded from cameras positioned on the head (199199199), shell (363636), collar (111111), and fin (888) offering a range of perspectives that can inform on how different mounting points might influence the perception of the environment from an animal’s viewpoint.

Collection Strategy. To collect the dataset, we manually searched for videos using a large set of queries on YouTube and TikTok. For example, “egocentric view”, “dog with a GoPro”, and similar phrases related to first-person animal perspectives. This led to scraping a vast pool of footage showcasing animals, primarily dogs and cats, wearing wearable cameras, allowing for an egocentric point of view. In pursuit of a broader video selection, our efforts extended to individual channels and authors known for their thematic consistency in publishing egocentric animal footage. This approach allowed us to tap into niche communities and content creators, yielding a wide variety of egocentric videos beyond the reach of generic search terms.

Dataset Refinement. A meticulous annotation process was carried out to ensure the dataset’s quality. A human annotator reviewed the collected videos to confirm that they were from an egocentric point of view. Non-egocentric or irrelevant segments were carefully removed.

All videos were adjusted to a frame rate of 303030 frames per second and resized to 480p on the shortest side while maintaining the original aspect ratio. The videos were then segmented into discrete clips, during which any non-egocentric footage was removed. The final dataset consists of segments of at least three seconds, ensuring sufficient context for each interaction.

In order to allow quantitative comparisons of animal-prediction approaches, we next define several prediction tasks on the EgoPet dataset. We provide annotated datasets based on these tasks, which will allow effective benchmarking of different approaches.

Motivation. Human activities such as actions and interactions from an egocentric viewpoint have been previously explored in various datasets mostly focusing on activity recognition [24, 29, 60], human-object interactions [13, 6, 34, 11], and social interactions [15, 35, 55]. Inspired by these works, we focus on animal interactions with other agents or objects, and for simplicity, we only consider visual interactions. Observing interactions through an egocentric perspective offers insights into how animals navigate their world, how they communicate with other beings, and how their physical movements correlate with environmental stimuli. Being able to identify interactions is a core task in computer vision and robotics with practical applications in designing systems that can operate in dynamic, real-world settings.

Task Description. The input for this task is a video clip from the egocentric perspective of an animal. The labels are twofold: a binary label indicating whether an interaction is taking place or not, and a categorical label describing the object of the interaction. This binary label simplifies the vast range of potential interactions into a manageable form for the model, while the identification of the interaction object adds a layer of specificity necessary for understanding the context of the interaction.

In the context of the EgoPet dataset, a “visual interaction” is defined as a discernible event where the agent—typically an animal such as a dog or cat—demonstrates clear attention to an object or another agent within its environment. This attention may be manifested through physical contact, proximity, orientation, or vocalization (such as barking or making sounds) toward the object of the interaction which can be an object or agent. The fundamental criterion for a visual interaction is the presence of visual evidence within the video that the agent is engaged with, or reacting to a particular stimulus. Aimless movements, such as wandering without a clear target or displaying alertness without a specific focus, are not labeled as visual interactions.

Annotations. The data labeling process for marking interactions involved a meticulous analysis of the video content, which resulted in the annotation of 144914491449 subsegments (see Fig. 4). Two human annotators were trained to identify and timestamp the start and end of an interaction event. The outcome of this process is a richly annotated dataset of 805805805 subsegments where no interaction occurs (”negative subsegments”) and 644644644 positive interaction subsegments that capture a wide range of 171717 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test sets. This leaves us with 754754754 training subsegments and 695695695 test subsegments for a total of 1,44914491{\small,}449 annotated subsegments. To see the full annotation process refer to Suppl. Section Figure credits.

Motivation. Planning where to move involves a complex interplay

of both perception and foresight. It requires the ability to anticipate potential obstacles, consider various courses of action, and select the most efficient and effective strategy to achieve a desired goal. EgoPet contains examples where animals plan a future trajectory to achieve a certain goal (e.g., a dog following its owner. See Fig. 5).

Task Description. Given a sequence of past m𝑚m video frames {xi}i=t−mtsubscriptsuperscriptsubscript𝑥𝑖𝑡𝑖𝑡𝑚{x_{i}}^{t}{i=t-m}, the goal is to predict the unit normalized future trajectory of the agent {vj}j=t+1t+ksubscriptsuperscriptsubscript𝑣𝑗𝑡𝑘𝑗𝑡1{v{j}}^{t+k}{j=t+1}, where vj∈ℝ3subscript𝑣𝑗superscriptℝ3v{j}\in\mathbb{R}^{3} represents the relative location of the agent at timestep j𝑗j. We predict the unit normalized relative location due to the scale ambiguity of the extracted trajectories. In practice, we condition models on m=16𝑚16m=16 frames and predict k=40𝑘40k=40 future locations which correspond to 444 seconds into the future.

Annotations. To obtain pseudo ground truth agent trajectories, we used Deep Patch Visual Odometry (DPVO [49]), a system for monocular visual odometry that utilizes sparse patch-based matching across frames. This system largely outperformed other open-source SLAM systems in terms of convergence rate and qualitative accuracy in our experiments.

Given an input sequence of frames, DPVO returns the location and orientation of the camera for each frame. To obtain training trajectories, we feed videos with a stride of 555 to DPVO. To ensure high-quality evaluation, we feed validation videos with strides of 555, 101010, and 151515 into DPVO and evaluate the quality of the trajectories manually. Specifically, two human annotators were trained to evaluate the trajectories from an eagle’s eye view (XZ view) and determine the best matching trajectory, if any, to the video. This left us with 6,12661266{\small,}126 annotated training segments and 249249249 validation segments.

Motivation. Understanding animal behavior could be instrumental to several robotics applications. To demonstrate the value of our dataset for robotics, we propose a task based on the problem of vision-based locomotion. Specifically, the task consists of predicting the parameters of the terrain a quadrupedal robot is walking on (see Fig. 6). As shown in multiple previous works on locomotion [30, 31, 22, 4, 3, 45], accurate prediction of these parameters is correlated with improved performance in locomotion. Intuitively, the EgoPet data closely resembles the video captured by a quadruped robot since a camera mounted on a pet is approximately at the same location as the camera mounted on the robot. In addition, the task of walking is highly represented in the dataset.

Task Description. The parameters we would like to predict are the local terrain geometry, the terrain’s friction, and the parameters related to the robot’s walking behavior on the terrain, including the robot’s speed, motor efficiency, and high-level command. The exact identification of these parameters is generally impossible [25]: two terrains could have different combinations of parameters but “feel" the same to an agent. For example, walking on sand and mud could similarly affect the robot’s proprioception, even though their properties differ. Therefore, similarly to previous work [25, 27, 30], we aim to predict a latent representation ztsubscript𝑧𝑡z_{t} of the terrain parameters. This latent representation consists of the hidden layer of a neural network trained in simulation to encode ground-truth terrain parameters. This neural network is trained end-to-end with an action policy on locomotion using reinforcement learning. The task consists of predicting the latent terrain’s representation at different time intervals from a sequence of frames. Our setup closely follows the one in [30]. Specifically, we add to EgoPet data collected with a quadrupedal robot in multiple outdoor environments with different terrain characteristics, e.g., sand or grass.

Dataset and Annotations. To collect the dataset, we deployed the walking policy of [30] on a Unitree A1 robot dog in three environments: an office, a park, and a beach. We collect approximately 20 minutes of walking data in these environments, which are used exclusively for evaluation. For training, we use the data from [30], which contains 120 thousand frames, corresponding to a total walking time of approximately 2.3 hours. Each environment has different terrain geometries, including flats, steps, and slopes. Each sample contains an image collected from a forward-looking camera mounted on the robot and the (latent) parameters of the terrain below the center of mass of the robot ztsubscript𝑧𝑡z_{t} estimated with a history of proprioception. See [30] for details about the annotation procedure.

The final task consists of predicting ztsubscript𝑧𝑡z_{t} from a history of images. We generate several sub-tasks by predicting the future terrain parameters zt+0.8,zt+1.5subscript𝑧𝑡0.8subscript𝑧𝑡1.5z_{t+0.8},z_{t+1.5} and the past ones zt−0.8,zt−1.5subscript𝑧𝑡0.8subscript𝑧𝑡1.5z_{t-0.8},z_{t-1.5}. These time intervals were selected to differentiate between forecasting and estimation. The further the prediction is in the future or the past, the harder the task is. The input images might contain little information, or not at all, about the terrain at these times. Therefore, inferences based on the context are required. For example, one can predict the presence of a step in front of the robot from the shadow it casts on the terrain.

We divide the newly collected data into three test datasets: the first is in-distribution, featuring terrains and lighting conditions similar to the training data. The second dataset is out of distribution since it is captured with different lighting conditions, i.e., at night, but in environments with the same features as the training data. Finally, the third dataset contains sandy environments, which the robot has not encountered during training.

Our goal in the experiments is to establish initial performance baselines on the EgoPet tasks. For the VPP task, we hypothesize that EgoPet is a more useful pretraining resource compared to other datasets. We evaluate different pretrained models and compare their performance on the VIP, LP, and VPP tasks. We adopt a simple linear probing protocol, where we freeze the model weights and for each task train only a linear layer to predict the output. For evaluation, we use the following models publicly released checkpoints, typically trained on IN-1k or K400, unless stated otherwise.

MAE [21] are trained by masking random patches in the input image and reconstructing the missing pixels through an asymmetric encoder-decoder architecture. By masking a high proportion (e.g., 757575%) of the input image. In our experiments we use an MAE model pretrained on IN-1k.

MVP [53] uses a similar model as in MAE, but trains MAE on a mixture of egocentric datasets which we refer to as Ego Mix, a combination of Epic-Kitchens [11], 100DOH [43], Ego4D [19], and Something-Something [18].

DINO [8] is trained with a student-teacher architecture over pairs of augmented images by encouraging invariance to the image augmentations. The teacher’s output is centered, and both networks’ normalized features are compared using a cross-entropy loss. The stop-gradient operator ensures gradient propagation only through the student, and teacher parameters are updated using an exponential moving average (ema) of the student parameters.

iBOT [59] is trained similarly to DINO, but adds an auxiliary MIM loss by predicting image patch representation of a learned online tokenizer.

VideoMAE [50] is an extension of MAE for video pre-training. Different from MAE, it utilizes an extremely high masking ratio (909090% to 959595%) and tube masking as opposed to random masking.

MVD [52] is a masked feature modeling framework for self-supervised video representation learning. Learning the video representations involves distilling student model features from both video and image teachers. We train MVD variants on Ego4D and EgoPet, using VideoMAE (K400) and MAE (IN-1k) as video and image teachers.

Implementation Details. For all models, we used the ViT-B model with patch size 16 since it was available across all methods. For the VIP task, we train all image and video models for 101010 epochs. Video models represent video clips of 222 seconds using 888 input frames (444 Hz) and image models use one (the middle) frame. For the VPP task, we train all models for 505050 epochs. Video models were trained with varying amounts of frames in 444 Hz. For the LP task, we train all models for 151515 epochs. Video models were trained with 161616 input frames (303030 Hz) and image models used one frame (the last). In our LP experiments, we only use cat and dog segments, segments long enough, and 25%percent2525% of the training data. This left us with 111,129129129 training segments and 167167167 validation segments. During the linear probing training phase, we do not apply any image augmentations. All the other hyperparams follow MAE and MVD linear probing recipes for image and video models respectively.

In this section, we report initial baseline results from applying a range of models to the VIP, LP, and VPP tasks. Taken together, these results underline the interesting observation that current large video datasets used for pretraining are not diverse enough to perform well across all the EgoPet downstream tasks. For example, pretraining on K400 is better than Ego4D for VIP but worse on VPP. Furthermore, by pretraining on EgoPet, we observe improved downstream performance on the VPP task compared to other models.

The results in Table 2 show that models trained on EgoPet achieve improved performance compared to K400 or Ego4D both on interaction prediction and object prediction. Compared to image based models like iBOT, MVD trained on EgoPet performs better on Top-3 Acc but worse on Top-1. This is likely due to the diversity of objects appearing in IN-1k, an image recognition dataset. Compared to other video models, MVD (EgoPet) performs better. To obtain more insight into what models focus on in this task, we apply Grad-CAM [41] on our MVD EgoPet interaction classifier. Fig. 7 shows the corresponding heatmaps and can be seen to focus on the rat (top-left), and another dog (bottom-right). In these cases, the model seems to be attending to the object of interaction.

For this task, we evaluated models based on their predicted unit motions 40 timesteps into the future, corresponding to 4 seconds. We form trajectories from these predicted motions and compute the RMSE of the Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) metrics against the ground truth trajectories. ATE and RPE are commonly employed metrics for evaluating systems such as SLAM and visual odometry [58, 49, 48, 7]. ATE first aligns the ground truth with the predicted trajectory, and then computes the absolute pose difference. RPE measures the difference between the predicted and ground truth locomotion [47].

The results in Fig. 8 indicate that models trained on EgoPet perform better than Ego4D and K400 and that the Ego4D model performed second best, possibly due to also being egocentric data. The full results in Supplementary Table 4 indicate that video models perform much better than image models as a whole, which we speculate is due to better modeling of the agent’s velocity and acceleration as well as the motion of other agents.

Table 3 provides the result of the VPP task. It can be seen that using EgoPet data leads to lower errors on this task. Additionally, the results show that using additional past frames as context helps and that video models outperform image models. Within image models, MVP performs better than MAE, likely because it was trained on egocentric data, and iBOT performs best. Within video models, MVD trained on EgoPet achieves lower mean squared error loss compared to the same model trained with K400 or Ego4D, and lower error compared to all image models. We speculate that MVD (EgoPet) outperforms all models because compared to other datasets the EgoPet videos are more similar to videos captured by a forward-facing camera mounted on a quadruped robodog. We provide the full results in the Supplementary Table 5.

EgoPet is a video dataset, and as such it primarily contains visual and auditory signals. However, animals interact with their environment using a multitude of senses, including smell and touch. The absence of these sensory modalities in our dataset and model may lead to a partial or skewed understanding of animal behavior and intelligence. Animal behavior is highly complex and influenced by a myriad of factors, including instinct, learning, environmental stimuli, and social interactions. Our tasks, while effective in capturing certain aspects of behavior, may not fully encapsulate the depth and complexity of animal interactions and decision-making processes. Further research is needed to develop more sophisticated tasks and models that can account for complex behavioral patterns.

We present EgoPet, a new comprehensive animal egocentric video dataset. Together with the proposed downstream tasks and benchmark, we believe EgoPet offers a testbed for studying and modeling animal behavior. Our benchmark results demonstrate that interaction prediction is far from being solved which provides an exciting opportunity for future research and modeling animal egocentric agents. Furthermore, the results demonstrate that EgoPet is a useful pretraining resource for downstream robotics locomotion tasks. Future works can include broadening the tasks to integrate more sensory inputs like audio, thereby creating a richer and more holistic understanding of the animal behavior.

Acknowledgements: We thank Justin Kerr for helpful discussions. Many of the figures use images taken from web videos. For each figure, we include the URL to its source videos in the Suppl. Section Figure credits. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD including DARPA’s LwLL and/or SemaFor programs, as well as BAIR’s industrial alliance programs.

We provide additional information about the EgoPet dataset, annotation process and full quantitative results.

We include more dataset visualizations in Figure 9.

The beginning of an interaction is marked at the first time-step where the agent begins to give attention to a target, and the endpoint is marked at the last time-step before the attention ceases. In addition, annotators were instructed to mark some segments without interactions. This process results in a set of temporal segments, each corresponding to a discrete interaction event or no interaction event. To ensure the consistency of annotations across annotators, annotations are only kept where annotators agree.

The outcome of this process is a richly annotated dataset of 805805805 subsegments where no interaction occurs (”negative subsegments”) and 644644644 positive interaction subsegments that capture a wide range of 171717 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test. This leaves us with 754754754 training subsegments and 695695695 test subsegments for a total of 1,44914491{\small,}449 annotated subsegments.

This is the list of all possible interaction objects: Person, Ball, Bench, Bird, Dog, Cat, Other Animal, Toy, Door, Floor, Food, Plant, Filament, Plastic, Water, Vehicle, Other.

Full LP results. Table 4 contains the quantitative LP results for all models, reported using the ATE and RPE metrics. The results indicate that pretraining on EgoPet leads to better ATE and RPE scores.

Full VPP results. In the main paper we included the VPP results grouped by “past”, “present” and “future” (see Table 3). In Table 5 we provide the full fine-grained VPP results by individual timestep.

Some of the figures in the paper were created from web videos. We credit the original content creators and provide links to the original videos below.

https://www.youtube.com/watch?v=69AXB6aFzRU

https://www.tiktok.com/@gonzoisacat/video/7232306745660509483

Table: S3.T1: Different video datasets. We compare EgoPet to different datasets with respect to the total time (hours), whether the videos are in first person view (egocentric) with the focus on egomotion, the agent type, and whether agent interaction annotations are available. EgoPet is the first large scale animal dataset that is both ego centric and contains interaction annotations. It is also over 565656 times larger than the previous similar dataset DECADE [14].

DatasetTotal Time (hours)EgocentricEgomotionAgentInteraction Annotations
BDD100K1,11111111{\small,}111Cars
Animal Kingdom505050Animals
EGO4D3,67036703{\small,}670Humans
DECADE1.51.51.5Dog
EgoPet848484Animals

Table: S5.T2: Visual Interaction Prediction linear probing results. We report models Interaction Prediction Accuracy and AUROC, as well as Object Prediction Top-1 and Top-3 Accuracy.

AccuracyAUROCTop-1 AccTop-3 Acc
MAEIN-1k62.3469.4135.0261.37
MVPEgo Mix65.4768.1233.5759.21
DINOIN-1k65.1673.3837.1860.65
iBOTIN-1k65.1673.5037.5558.12
VideoMAEK40061.5666.2229.2454.87
MVDK40065.6370.3535.3862.45
MVDEgo4D64.8470.1533.5762.45
MVDEgoPet68.4474.3135.7464.62

Table: S6.T3: Vision to Proprioception Prediction (VPP) linear probing results. We report the mean squared error loss. Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 5 for the full results.

ModelDatasetPast (t−k)𝑡𝑘(t-k)Present (t)𝑡(t)Future (t+k)𝑡𝑘(t+k)
1 Frame
MAEIN-1k0.3600.2800.314
MVPEgo Mix0.3570.2730.308
DINOIN-1k0.3540.2750.304
iBOTIN-1k0.3500.2780.304
4 Frames
MVDK4000.2860.1970.262
MVDEgo4D0.2610.2240.261
MVDEgoPet0.2560.2030.246
8 Frames
MVDK4000.2170.1960.252
MVDEgo4D0.2080.1920.249
MVDEgoPet0.2040.1840.253

Refer to caption We present EgoPet, a novel animal egocentric video dataset to advance learning animal-like behavior models from video (top row). We propose three benchmark tasks on this dataset (bottom row). Visual Interaction Prediction (VIP) and Locomotion Prediction (LP) are designed to predict animals’ perception and action behavior. Finally, Vision to Proprioception Prediction (VPP) studies the utility of our dataset on the downstream task of robot locomotion in the wild. For all tasks, we find that models trained on EgoPet outperform those trained on previously available video datasets.

Refer to caption EgoPet video examples. Footage from the EgoPet dataset featuring four different animal experiences, each captured from an egocentric perspective at a distinct point in time.

Refer to caption Descriptive statistics. The histogram depicting the length (in seconds) of EgoPet video sequences exhibits a long-tailed distribution, primarily skewed toward shorter segments of less than 303030 seconds. Collectively, videos featuring dogs and cats account for 94% of the total duration, showcasing interactions with people, fellow cats and dogs, toys, and various objects.

Refer to caption Vision to Interaction Prediction task. The figure illustrates the process of annotating a single video, identifying and categorizing different interactions experienced by a cat, with each segment of the timeline reflecting a unique type of interaction within the animal’s environment.

Refer to caption Locomotion Prediction task. A dog navigates an agility course, highlighting the concept of locomotion prediction by anticipating its forward and upward trajectory to clear the obstacle.

Refer to caption Vision to Proprioception Prediction task. This figure showcases the quadruped robot as it is about to transition from flat ground to climbing steep stairs, illustrating one of the unique terrain environments encountered during the collection of visual and proprioceptive data at annotated time intervals for VPP training.

Refer to caption VIP Grad-CAM [41] visualization.

Refer to caption Locomotion Prediction (LP) linear probing results. We report the validation ATE and RPE as a function of the epoch during training, comparing the impact of various datasets (Kinetics, Ego4D, EgoPet). Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 4 for the full results.

Figure credits

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets. 111Project page: www.amirbar.net/egopet

Animals are intelligent agents that exhibit various cognitive and behavioral traits. They plan and act to accomplish complex goals and can interact with objects or other agents. Consider a cat attempting to catch a rat; this requires the cat to execute a precise sequence of actions with impeccable timing, all while responding to the rat’s efforts to escape.

Current Artificial Intelligence (AI) systems can synthesize high quality images [39, 40], generate coherent text [5, 51], and even code Python programs [9]. But despite this remarkable progress, there are basic animal behaviors that are beyond the reach of current models. Recently, there has been a significant body of research in robotics aimed at learning policies for quadruped locomotion, and other basic actions [27, 25, 1, 32, 42, 10, 33, 2]. However, we argue that a major limitation in advancing towards more complex systems is the availability of large-scale, real-world data.

To address this, we present EgoPet, a new web-scale dataset from the perspective of pets. EgoPet contains more than 848484 hours of video, including different animals like dogs, cats, eagles, turtles, and more. This video footage reveals the world from the eye of the pet as perceived in its day-to-day life, e.g., a dog going for a walk or entering a park, or a cat wandering freely around a farm. The video data was sourced from the internet and predominantly includes pet video, hence we have named the dataset EgoPet.

To measure progress in modeling and learning from animals, we propose three new tasks that aim to capture perception and action (see Fig. 1): Visual Interaction Prediction (VIP), Locomotion Prediction (LP), and Vision to Proprioception Prediction (VPP). Together with these tasks, we provide annotated training and validation data used for downstream evaluation.

The VIP task aims to detect and classify animal interactions and is inspired by human-object interaction tasks [43]. We temporally annotated a subset of the EgoPet videos with the start and end times of visual interactions and the object of the interaction category. The categories, which include person, cat, and dog, were chosen based on how commonly they occurred as objects (for all the categories refer to Supplementary Section Figure credits).

The goal of the LP task is to predict the future 444 second trajectory of the pet. This is useful for learning basic pet skills like avoiding obstacles or navigating. We extracted pseudo ground truth trajectories using Deep Patch Visual Odometry (DPVO) [49], the best-performing SLAM system for our dataset. We manually filtered inaccurate trajectories in the validation data to ensure high-quality evaluation.

Finally, in the VPP task, we study EgoPet’s utility for a downstream robotic task: legged locomotion. Given a video observation from a forward-facing camera mounted on a quadruped robot, the goal is to predict the features of the terrain perceived by the robot’s proprioception across its trajectory. Making accurate predictions requires perceiving the landscape and anticipating the robot controls. This differs from previous works on robot visual prediction [30, 45, 31], which require conditioning over current robot controls and are thus challenging to train at scale. To assess performance in this task, we gathered data utilizing a quadruped robodog. This data includes paired videos and proprioception features, which are then utilized for subsequent training and evaluation processes.

We train various self-supervised models and evaluate how they perform downstream using a simple linear probing protocol. We make the surprising finding that pretraining on EgoPet yields better performance than pretraining on other, much larger video datasets like Ego4D [19] and Kinetics 400 [23]. This indicates the inadequacy of current datasets in studying animal-like physical skills.

Our contributions are as follows. First, we propose EgoPet, the first large-scale egocentric animal video dataset comprised of over 848484 hours of video footage to facilitate learning from animals. We propose three new tasks, including human-annotated data, and set an initial benchmark. The downstream results on the VPP task indicate that EgoPet is a useful pretraining resource for quadruped locomotion, and the benchmark results on VIP show that the proposed tasks are still far from being solved, providing an exciting new opportunity to build models that capture the world through the eyes of animals.

Next, we delve into the related works surrounding video datasets, including notable research on both general video datasets of human and animals and those focusing specifically on egocentric video data.

Video Datasets. In recent years, a variety of video datasets have played an important role in video understanding tasks. In human action recognition, datasets like UCF101 [46], Charades-Ego [44], AVA [20], FineDiving [54], and the Something-Something dataset [18] provide comprehensive coverage of human activities, ranging from daily actions to specialized sports movements. Among these, Kinetics (K400) [23] is particularly influential, advancing the study of human actions through a wide array of video clips.

Other works aimed to collect data to study animals. These works include datasets such as the Animal Kingdom [36], which contains videos of various species, and MacaquePose [26], which focuses on non-human primates. These datasets are instrumental for AI advancements in wildlife recognition and interpretation. AP-10K [57] further augments this domain by providing a detailed collection of animal images for robust pose estimation. While sharing a similar motivation to our work, existing datasets on animal behavior rarely contain egocentric views and are therefore better suited to recognition problems than the animals’ physical capabilities. For autonomous driving and vehicle motion, datasets like the Berkeley DeepDrive [56, 17] and KITTI [17] offer extensive insights into vehicle egomotion and environmental interactions. While these datasets enrich our understanding of motion, behavior, and interaction from a human-centric perspective, they offer limited insights into animal behavior.

Egocentric Video Datasets. Agents interact with the world from a first-person point of view, thus collecting such data has many applications from video understanding to augmented reality. In the past decade, many egocentric datasets were collected [16, 11, 44, 38, 28], with the majority of them focusing on human activities and object interactions in an indoor environment (e.g. kitchens). For example, Epic Kitchens [11, 12] is a large cooking dataset that takes place in 454545 kitchens across 444 different cities, whereas Charades-Ego [44] consists of 4,00040004{\small,}000 paired videos of human actions in first and third person. Other datasets are more focused on conversation and social interactions [15, 35, 37]. Existing datasets differ by the environments in which they are recorded, e.g. (outdoor vs. kitchens), whether they are scripted or not, and the number of videos. Recently, Ego4D [19], a new comprehensive egocentric dataset was released. Different from previous datasets, it is more diverse (e.g., indoor and outdoor activities, different diverse geographical locations). However, while existing datasets focus on human and human skills, our focus is on animal agents which have more limited language and hand-object interactions. The most related egocentric dataset is DECADE [14] which consists of an hour of footage of a single dog, including joint locations annotations. Inspired by DECADE, EgoPet is a much larger web-scale dataset (84 hours) and much more diverse.

The EgoPet dataset is a unique collection of egocentric video footage primarily featuring dogs and cats, along with various other animals like eagles, wolves, turtles, sea turtles, sharks, snakes, cheetahs, pythons, geese, alligators, and dolphins (examples included in Fig. 2 and Suppl. Figure 9). Together with the proposed downstream tasks and benchmark, EgoPet is a valuable resource for researchers and enthusiasts interested in studying animals from an egocentric perspective.

We begin with the motivation behind EgoPet and its connection to existing datasets in Section 3.1. We then delve into the dataset’s statistics in Section 3.2 and the collection process in Section 3.3.

To provide a clearer understanding of EgoPet’s significance, we compare it with various other datasets, considering factors such as total video duration, perspective (egocentric or non-egocentric), egomotion, the agents involved, and the presence of interaction annotations, which are crucial for intelligent agents. Refer to Table 1 for more details. In terms of size, Ego4D [19] is the largest egocentric video dataset, and it centers on human activities, while the BDD100K [56] dataset includes both egocentric and egomotion elements but it focuses on autonomous driving. Differently, EgoPet focuses on animals, and pets in particular. Among animal video datasets, the DECADE [14] dataset provides an egocentric perspective from a dog’s viewpoint, but it only records 1.51.51.5 hours of video. EgoPet expands this vision by over 565656 times in volume and includes a variety of species and interactions.

The EgoPet dataset is an extensive collection composed of 6,64666466{\small,}646 video segments distilled from 819819819 unique videos. High level statistics are provided in Fig. 3. These original videos were sourced predominantly from TikTok, accounting for 482482482 videos, while the remaining 338338338 were obtained from YouTube. The aggregate length of all video segments amounts to approximately 848484 hours, which reflects a substantial volume of data for in-depth analysis. In terms of video duration, the segments exhibit an average span of 45.5545.5545.55 seconds, although the duration displays considerable variability, as indicated by the standard deviation of 192.19192.19192.19 seconds. This variation underscores the range of contexts captured within the dataset, from brief encounters to prolonged interactions.

Breaking down the dataset by animal representation, cats and dogs constitute the majority, with 4,56745674{\small,}567 and 1,90519051{\small,}905 segments, respectively. This reflects the dataset’s strong emphasis on common domestic animals while still covering less frequent but equally important species. Notably, the dataset includes segments featuring eagles (666666), turtles (313131), and a diverse group of other animals such as alligators, lizards, and dolphins, contributing to a rich collection of animal behaviors captured through an egocentric lens.

The camera positioning—where the recording device was attached to—also varies, the majority of segments were captured from cameras placed on the neck (4,57545754{\small,}575) and body (1,81718171{\small,}817). Fewer segments were recorded from cameras positioned on the head (199199199), shell (363636), collar (111111), and fin (888) offering a range of perspectives that can inform on how different mounting points might influence the perception of the environment from an animal’s viewpoint.

Collection Strategy. To collect the dataset, we manually searched for videos using a large set of queries on YouTube and TikTok. For example, “egocentric view”, “dog with a GoPro”, and similar phrases related to first-person animal perspectives. This led to scraping a vast pool of footage showcasing animals, primarily dogs and cats, wearing wearable cameras, allowing for an egocentric point of view. In pursuit of a broader video selection, our efforts extended to individual channels and authors known for their thematic consistency in publishing egocentric animal footage. This approach allowed us to tap into niche communities and content creators, yielding a wide variety of egocentric videos beyond the reach of generic search terms.

Dataset Refinement. A meticulous annotation process was carried out to ensure the dataset’s quality. A human annotator reviewed the collected videos to confirm that they were from an egocentric point of view. Non-egocentric or irrelevant segments were carefully removed.

All videos were adjusted to a frame rate of 303030 frames per second and resized to 480p on the shortest side while maintaining the original aspect ratio. The videos were then segmented into discrete clips, during which any non-egocentric footage was removed. The final dataset consists of segments of at least three seconds, ensuring sufficient context for each interaction.

In order to allow quantitative comparisons of animal-prediction approaches, we next define several prediction tasks on the EgoPet dataset. We provide annotated datasets based on these tasks, which will allow effective benchmarking of different approaches.

Motivation. Human activities such as actions and interactions from an egocentric viewpoint have been previously explored in various datasets mostly focusing on activity recognition [24, 29, 60], human-object interactions [13, 6, 34, 11], and social interactions [15, 35, 55]. Inspired by these works, we focus on animal interactions with other agents or objects, and for simplicity, we only consider visual interactions. Observing interactions through an egocentric perspective offers insights into how animals navigate their world, how they communicate with other beings, and how their physical movements correlate with environmental stimuli. Being able to identify interactions is a core task in computer vision and robotics with practical applications in designing systems that can operate in dynamic, real-world settings.

Task Description. The input for this task is a video clip from the egocentric perspective of an animal. The labels are twofold: a binary label indicating whether an interaction is taking place or not, and a categorical label describing the object of the interaction. This binary label simplifies the vast range of potential interactions into a manageable form for the model, while the identification of the interaction object adds a layer of specificity necessary for understanding the context of the interaction.

In the context of the EgoPet dataset, a “visual interaction” is defined as a discernible event where the agent—typically an animal such as a dog or cat—demonstrates clear attention to an object or another agent within its environment. This attention may be manifested through physical contact, proximity, orientation, or vocalization (such as barking or making sounds) toward the object of the interaction which can be an object or agent. The fundamental criterion for a visual interaction is the presence of visual evidence within the video that the agent is engaged with, or reacting to a particular stimulus. Aimless movements, such as wandering without a clear target or displaying alertness without a specific focus, are not labeled as visual interactions.

Annotations. The data labeling process for marking interactions involved a meticulous analysis of the video content, which resulted in the annotation of 144914491449 subsegments (see Fig. 4). Two human annotators were trained to identify and timestamp the start and end of an interaction event. The outcome of this process is a richly annotated dataset of 805805805 subsegments where no interaction occurs (”negative subsegments”) and 644644644 positive interaction subsegments that capture a wide range of 171717 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test sets. This leaves us with 754754754 training subsegments and 695695695 test subsegments for a total of 1,44914491{\small,}449 annotated subsegments. To see the full annotation process refer to Suppl. Section Figure credits.

Motivation. Planning where to move involves a complex interplay

of both perception and foresight. It requires the ability to anticipate potential obstacles, consider various courses of action, and select the most efficient and effective strategy to achieve a desired goal. EgoPet contains examples where animals plan a future trajectory to achieve a certain goal (e.g., a dog following its owner. See Fig. 5).

Task Description. Given a sequence of past m𝑚m video frames {xi}i=t−mtsubscriptsuperscriptsubscript𝑥𝑖𝑡𝑖𝑡𝑚{x_{i}}^{t}{i=t-m}, the goal is to predict the unit normalized future trajectory of the agent {vj}j=t+1t+ksubscriptsuperscriptsubscript𝑣𝑗𝑡𝑘𝑗𝑡1{v{j}}^{t+k}{j=t+1}, where vj∈ℝ3subscript𝑣𝑗superscriptℝ3v{j}\in\mathbb{R}^{3} represents the relative location of the agent at timestep j𝑗j. We predict the unit normalized relative location due to the scale ambiguity of the extracted trajectories. In practice, we condition models on m=16𝑚16m=16 frames and predict k=40𝑘40k=40 future locations which correspond to 444 seconds into the future.

Annotations. To obtain pseudo ground truth agent trajectories, we used Deep Patch Visual Odometry (DPVO [49]), a system for monocular visual odometry that utilizes sparse patch-based matching across frames. This system largely outperformed other open-source SLAM systems in terms of convergence rate and qualitative accuracy in our experiments.

Given an input sequence of frames, DPVO returns the location and orientation of the camera for each frame. To obtain training trajectories, we feed videos with a stride of 555 to DPVO. To ensure high-quality evaluation, we feed validation videos with strides of 555, 101010, and 151515 into DPVO and evaluate the quality of the trajectories manually. Specifically, two human annotators were trained to evaluate the trajectories from an eagle’s eye view (XZ view) and determine the best matching trajectory, if any, to the video. This left us with 6,12661266{\small,}126 annotated training segments and 249249249 validation segments.

Motivation. Understanding animal behavior could be instrumental to several robotics applications. To demonstrate the value of our dataset for robotics, we propose a task based on the problem of vision-based locomotion. Specifically, the task consists of predicting the parameters of the terrain a quadrupedal robot is walking on (see Fig. 6). As shown in multiple previous works on locomotion [30, 31, 22, 4, 3, 45], accurate prediction of these parameters is correlated with improved performance in locomotion. Intuitively, the EgoPet data closely resembles the video captured by a quadruped robot since a camera mounted on a pet is approximately at the same location as the camera mounted on the robot. In addition, the task of walking is highly represented in the dataset.

Task Description. The parameters we would like to predict are the local terrain geometry, the terrain’s friction, and the parameters related to the robot’s walking behavior on the terrain, including the robot’s speed, motor efficiency, and high-level command. The exact identification of these parameters is generally impossible [25]: two terrains could have different combinations of parameters but “feel" the same to an agent. For example, walking on sand and mud could similarly affect the robot’s proprioception, even though their properties differ. Therefore, similarly to previous work [25, 27, 30], we aim to predict a latent representation ztsubscript𝑧𝑡z_{t} of the terrain parameters. This latent representation consists of the hidden layer of a neural network trained in simulation to encode ground-truth terrain parameters. This neural network is trained end-to-end with an action policy on locomotion using reinforcement learning. The task consists of predicting the latent terrain’s representation at different time intervals from a sequence of frames. Our setup closely follows the one in [30]. Specifically, we add to EgoPet data collected with a quadrupedal robot in multiple outdoor environments with different terrain characteristics, e.g., sand or grass.

Dataset and Annotations. To collect the dataset, we deployed the walking policy of [30] on a Unitree A1 robot dog in three environments: an office, a park, and a beach. We collect approximately 20 minutes of walking data in these environments, which are used exclusively for evaluation. For training, we use the data from [30], which contains 120 thousand frames, corresponding to a total walking time of approximately 2.3 hours. Each environment has different terrain geometries, including flats, steps, and slopes. Each sample contains an image collected from a forward-looking camera mounted on the robot and the (latent) parameters of the terrain below the center of mass of the robot ztsubscript𝑧𝑡z_{t} estimated with a history of proprioception. See [30] for details about the annotation procedure.

The final task consists of predicting ztsubscript𝑧𝑡z_{t} from a history of images. We generate several sub-tasks by predicting the future terrain parameters zt+0.8,zt+1.5subscript𝑧𝑡0.8subscript𝑧𝑡1.5z_{t+0.8},z_{t+1.5} and the past ones zt−0.8,zt−1.5subscript𝑧𝑡0.8subscript𝑧𝑡1.5z_{t-0.8},z_{t-1.5}. These time intervals were selected to differentiate between forecasting and estimation. The further the prediction is in the future or the past, the harder the task is. The input images might contain little information, or not at all, about the terrain at these times. Therefore, inferences based on the context are required. For example, one can predict the presence of a step in front of the robot from the shadow it casts on the terrain.

We divide the newly collected data into three test datasets: the first is in-distribution, featuring terrains and lighting conditions similar to the training data. The second dataset is out of distribution since it is captured with different lighting conditions, i.e., at night, but in environments with the same features as the training data. Finally, the third dataset contains sandy environments, which the robot has not encountered during training.

Our goal in the experiments is to establish initial performance baselines on the EgoPet tasks. For the VPP task, we hypothesize that EgoPet is a more useful pretraining resource compared to other datasets. We evaluate different pretrained models and compare their performance on the VIP, LP, and VPP tasks. We adopt a simple linear probing protocol, where we freeze the model weights and for each task train only a linear layer to predict the output. For evaluation, we use the following models publicly released checkpoints, typically trained on IN-1k or K400, unless stated otherwise.

MAE [21] are trained by masking random patches in the input image and reconstructing the missing pixels through an asymmetric encoder-decoder architecture. By masking a high proportion (e.g., 757575%) of the input image. In our experiments we use an MAE model pretrained on IN-1k.

MVP [53] uses a similar model as in MAE, but trains MAE on a mixture of egocentric datasets which we refer to as Ego Mix, a combination of Epic-Kitchens [11], 100DOH [43], Ego4D [19], and Something-Something [18].

DINO [8] is trained with a student-teacher architecture over pairs of augmented images by encouraging invariance to the image augmentations. The teacher’s output is centered, and both networks’ normalized features are compared using a cross-entropy loss. The stop-gradient operator ensures gradient propagation only through the student, and teacher parameters are updated using an exponential moving average (ema) of the student parameters.

iBOT [59] is trained similarly to DINO, but adds an auxiliary MIM loss by predicting image patch representation of a learned online tokenizer.

VideoMAE [50] is an extension of MAE for video pre-training. Different from MAE, it utilizes an extremely high masking ratio (909090% to 959595%) and tube masking as opposed to random masking.

MVD [52] is a masked feature modeling framework for self-supervised video representation learning. Learning the video representations involves distilling student model features from both video and image teachers. We train MVD variants on Ego4D and EgoPet, using VideoMAE (K400) and MAE (IN-1k) as video and image teachers.

Implementation Details. For all models, we used the ViT-B model with patch size 16 since it was available across all methods. For the VIP task, we train all image and video models for 101010 epochs. Video models represent video clips of 222 seconds using 888 input frames (444 Hz) and image models use one (the middle) frame. For the VPP task, we train all models for 505050 epochs. Video models were trained with varying amounts of frames in 444 Hz. For the LP task, we train all models for 151515 epochs. Video models were trained with 161616 input frames (303030 Hz) and image models used one frame (the last). In our LP experiments, we only use cat and dog segments, segments long enough, and 25%percent2525% of the training data. This left us with 111,129129129 training segments and 167167167 validation segments. During the linear probing training phase, we do not apply any image augmentations. All the other hyperparams follow MAE and MVD linear probing recipes for image and video models respectively.

In this section, we report initial baseline results from applying a range of models to the VIP, LP, and VPP tasks. Taken together, these results underline the interesting observation that current large video datasets used for pretraining are not diverse enough to perform well across all the EgoPet downstream tasks. For example, pretraining on K400 is better than Ego4D for VIP but worse on VPP. Furthermore, by pretraining on EgoPet, we observe improved downstream performance on the VPP task compared to other models.

The results in Table 2 show that models trained on EgoPet achieve improved performance compared to K400 or Ego4D both on interaction prediction and object prediction. Compared to image based models like iBOT, MVD trained on EgoPet performs better on Top-3 Acc but worse on Top-1. This is likely due to the diversity of objects appearing in IN-1k, an image recognition dataset. Compared to other video models, MVD (EgoPet) performs better. To obtain more insight into what models focus on in this task, we apply Grad-CAM [41] on our MVD EgoPet interaction classifier. Fig. 7 shows the corresponding heatmaps and can be seen to focus on the rat (top-left), and another dog (bottom-right). In these cases, the model seems to be attending to the object of interaction.

For this task, we evaluated models based on their predicted unit motions 40 timesteps into the future, corresponding to 4 seconds. We form trajectories from these predicted motions and compute the RMSE of the Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) metrics against the ground truth trajectories. ATE and RPE are commonly employed metrics for evaluating systems such as SLAM and visual odometry [58, 49, 48, 7]. ATE first aligns the ground truth with the predicted trajectory, and then computes the absolute pose difference. RPE measures the difference between the predicted and ground truth locomotion [47].

The results in Fig. 8 indicate that models trained on EgoPet perform better than Ego4D and K400 and that the Ego4D model performed second best, possibly due to also being egocentric data. The full results in Supplementary Table 4 indicate that video models perform much better than image models as a whole, which we speculate is due to better modeling of the agent’s velocity and acceleration as well as the motion of other agents.

Table 3 provides the result of the VPP task. It can be seen that using EgoPet data leads to lower errors on this task. Additionally, the results show that using additional past frames as context helps and that video models outperform image models. Within image models, MVP performs better than MAE, likely because it was trained on egocentric data, and iBOT performs best. Within video models, MVD trained on EgoPet achieves lower mean squared error loss compared to the same model trained with K400 or Ego4D, and lower error compared to all image models. We speculate that MVD (EgoPet) outperforms all models because compared to other datasets the EgoPet videos are more similar to videos captured by a forward-facing camera mounted on a quadruped robodog. We provide the full results in the Supplementary Table 5.

EgoPet is a video dataset, and as such it primarily contains visual and auditory signals. However, animals interact with their environment using a multitude of senses, including smell and touch. The absence of these sensory modalities in our dataset and model may lead to a partial or skewed understanding of animal behavior and intelligence. Animal behavior is highly complex and influenced by a myriad of factors, including instinct, learning, environmental stimuli, and social interactions. Our tasks, while effective in capturing certain aspects of behavior, may not fully encapsulate the depth and complexity of animal interactions and decision-making processes. Further research is needed to develop more sophisticated tasks and models that can account for complex behavioral patterns.

We present EgoPet, a new comprehensive animal egocentric video dataset. Together with the proposed downstream tasks and benchmark, we believe EgoPet offers a testbed for studying and modeling animal behavior. Our benchmark results demonstrate that interaction prediction is far from being solved which provides an exciting opportunity for future research and modeling animal egocentric agents. Furthermore, the results demonstrate that EgoPet is a useful pretraining resource for downstream robotics locomotion tasks. Future works can include broadening the tasks to integrate more sensory inputs like audio, thereby creating a richer and more holistic understanding of the animal behavior.

Acknowledgements: We thank Justin Kerr for helpful discussions. Many of the figures use images taken from web videos. For each figure, we include the URL to its source videos in the Suppl. Section Figure credits. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD including DARPA’s LwLL and/or SemaFor programs, as well as BAIR’s industrial alliance programs.

We provide additional information about the EgoPet dataset, annotation process and full quantitative results.

We include more dataset visualizations in Figure 9.

The beginning of an interaction is marked at the first time-step where the agent begins to give attention to a target, and the endpoint is marked at the last time-step before the attention ceases. In addition, annotators were instructed to mark some segments without interactions. This process results in a set of temporal segments, each corresponding to a discrete interaction event or no interaction event. To ensure the consistency of annotations across annotators, annotations are only kept where annotators agree.

The outcome of this process is a richly annotated dataset of 805805805 subsegments where no interaction occurs (”negative subsegments”) and 644644644 positive interaction subsegments that capture a wide range of 171717 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test. This leaves us with 754754754 training subsegments and 695695695 test subsegments for a total of 1,44914491{\small,}449 annotated subsegments.

This is the list of all possible interaction objects: Person, Ball, Bench, Bird, Dog, Cat, Other Animal, Toy, Door, Floor, Food, Plant, Filament, Plastic, Water, Vehicle, Other.

Full LP results. Table 4 contains the quantitative LP results for all models, reported using the ATE and RPE metrics. The results indicate that pretraining on EgoPet leads to better ATE and RPE scores.

Full VPP results. In the main paper we included the VPP results grouped by “past”, “present” and “future” (see Table 3). In Table 5 we provide the full fine-grained VPP results by individual timestep.

Some of the figures in the paper were created from web videos. We credit the original content creators and provide links to the original videos below.

https://www.youtube.com/watch?v=69AXB6aFzRU

https://www.tiktok.com/@gonzoisacat/video/7232306745660509483

Table: S3.T1: Different video datasets. We compare EgoPet to different datasets with respect to the total time (hours), whether the videos are in first person view (egocentric) with the focus on egomotion, the agent type, and whether agent interaction annotations are available. EgoPet is the first large scale animal dataset that is both ego centric and contains interaction annotations. It is also over 565656 times larger than the previous similar dataset DECADE [14].

DatasetTotal Time (hours)EgocentricEgomotionAgentInteraction Annotations
BDD100K1,11111111{\small,}111Cars
Animal Kingdom505050Animals
EGO4D3,67036703{\small,}670Humans
DECADE1.51.51.5Dog
EgoPet848484Animals

Table: S5.T2: Visual Interaction Prediction linear probing results. We report models Interaction Prediction Accuracy and AUROC, as well as Object Prediction Top-1 and Top-3 Accuracy.

AccuracyAUROCTop-1 AccTop-3 Acc
MAEIN-1k62.3469.4135.0261.37
MVPEgo Mix65.4768.1233.5759.21
DINOIN-1k65.1673.3837.1860.65
iBOTIN-1k65.1673.5037.5558.12
VideoMAEK40061.5666.2229.2454.87
MVDK40065.6370.3535.3862.45
MVDEgo4D64.8470.1533.5762.45
MVDEgoPet68.4474.3135.7464.62

Table: S6.T3: Vision to Proprioception Prediction (VPP) linear probing results. We report the mean squared error loss. Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 5 for the full results.

ModelDatasetPast (t−k)𝑡𝑘(t-k)Present (t)𝑡(t)Future (t+k)𝑡𝑘(t+k)
1 Frame
MAEIN-1k0.3600.2800.314
MVPEgo Mix0.3570.2730.308
DINOIN-1k0.3540.2750.304
iBOTIN-1k0.3500.2780.304
4 Frames
MVDK4000.2860.1970.262
MVDEgo4D0.2610.2240.261
MVDEgoPet0.2560.2030.246
8 Frames
MVDK4000.2170.1960.252
MVDEgo4D0.2080.1920.249
MVDEgoPet0.2040.1840.253

Refer to caption We present EgoPet, a novel animal egocentric video dataset to advance learning animal-like behavior models from video (top row). We propose three benchmark tasks on this dataset (bottom row). Visual Interaction Prediction (VIP) and Locomotion Prediction (LP) are designed to predict animals’ perception and action behavior. Finally, Vision to Proprioception Prediction (VPP) studies the utility of our dataset on the downstream task of robot locomotion in the wild. For all tasks, we find that models trained on EgoPet outperform those trained on previously available video datasets.

Refer to caption EgoPet video examples. Footage from the EgoPet dataset featuring four different animal experiences, each captured from an egocentric perspective at a distinct point in time.

Refer to caption Descriptive statistics. The histogram depicting the length (in seconds) of EgoPet video sequences exhibits a long-tailed distribution, primarily skewed toward shorter segments of less than 303030 seconds. Collectively, videos featuring dogs and cats account for 94% of the total duration, showcasing interactions with people, fellow cats and dogs, toys, and various objects.

Refer to caption Vision to Interaction Prediction task. The figure illustrates the process of annotating a single video, identifying and categorizing different interactions experienced by a cat, with each segment of the timeline reflecting a unique type of interaction within the animal’s environment.

Refer to caption Locomotion Prediction task. A dog navigates an agility course, highlighting the concept of locomotion prediction by anticipating its forward and upward trajectory to clear the obstacle.

Refer to caption Vision to Proprioception Prediction task. This figure showcases the quadruped robot as it is about to transition from flat ground to climbing steep stairs, illustrating one of the unique terrain environments encountered during the collection of visual and proprioceptive data at annotated time intervals for VPP training.

Refer to caption VIP Grad-CAM [41] visualization.

Refer to caption Locomotion Prediction (LP) linear probing results. We report the validation ATE and RPE as a function of the epoch during training, comparing the impact of various datasets (Kinetics, Ego4D, EgoPet). Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 4 for the full results.

Figure credits

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets. 111Project page: www.amirbar.net/egopet

Animals are intelligent agents that exhibit various cognitive and behavioral traits. They plan and act to accomplish complex goals and can interact with objects or other agents. Consider a cat attempting to catch a rat; this requires the cat to execute a precise sequence of actions with impeccable timing, all while responding to the rat’s efforts to escape.

Current Artificial Intelligence (AI) systems can synthesize high quality images [39, 40], generate coherent text [5, 51], and even code Python programs [9]. But despite this remarkable progress, there are basic animal behaviors that are beyond the reach of current models. Recently, there has been a significant body of research in robotics aimed at learning policies for quadruped locomotion, and other basic actions [27, 25, 1, 32, 42, 10, 33, 2]. However, we argue that a major limitation in advancing towards more complex systems is the availability of large-scale, real-world data.

To address this, we present EgoPet, a new web-scale dataset from the perspective of pets. EgoPet contains more than 848484 hours of video, including different animals like dogs, cats, eagles, turtles, and more. This video footage reveals the world from the eye of the pet as perceived in its day-to-day life, e.g., a dog going for a walk or entering a park, or a cat wandering freely around a farm. The video data was sourced from the internet and predominantly includes pet video, hence we have named the dataset EgoPet.

To measure progress in modeling and learning from animals, we propose three new tasks that aim to capture perception and action (see Fig. 1): Visual Interaction Prediction (VIP), Locomotion Prediction (LP), and Vision to Proprioception Prediction (VPP). Together with these tasks, we provide annotated training and validation data used for downstream evaluation.

The VIP task aims to detect and classify animal interactions and is inspired by human-object interaction tasks [43]. We temporally annotated a subset of the EgoPet videos with the start and end times of visual interactions and the object of the interaction category. The categories, which include person, cat, and dog, were chosen based on how commonly they occurred as objects (for all the categories refer to Supplementary Section Figure credits).

The goal of the LP task is to predict the future 444 second trajectory of the pet. This is useful for learning basic pet skills like avoiding obstacles or navigating. We extracted pseudo ground truth trajectories using Deep Patch Visual Odometry (DPVO) [49], the best-performing SLAM system for our dataset. We manually filtered inaccurate trajectories in the validation data to ensure high-quality evaluation.

Finally, in the VPP task, we study EgoPet’s utility for a downstream robotic task: legged locomotion. Given a video observation from a forward-facing camera mounted on a quadruped robot, the goal is to predict the features of the terrain perceived by the robot’s proprioception across its trajectory. Making accurate predictions requires perceiving the landscape and anticipating the robot controls. This differs from previous works on robot visual prediction [30, 45, 31], which require conditioning over current robot controls and are thus challenging to train at scale. To assess performance in this task, we gathered data utilizing a quadruped robodog. This data includes paired videos and proprioception features, which are then utilized for subsequent training and evaluation processes.

We train various self-supervised models and evaluate how they perform downstream using a simple linear probing protocol. We make the surprising finding that pretraining on EgoPet yields better performance than pretraining on other, much larger video datasets like Ego4D [19] and Kinetics 400 [23]. This indicates the inadequacy of current datasets in studying animal-like physical skills.

Our contributions are as follows. First, we propose EgoPet, the first large-scale egocentric animal video dataset comprised of over 848484 hours of video footage to facilitate learning from animals. We propose three new tasks, including human-annotated data, and set an initial benchmark. The downstream results on the VPP task indicate that EgoPet is a useful pretraining resource for quadruped locomotion, and the benchmark results on VIP show that the proposed tasks are still far from being solved, providing an exciting new opportunity to build models that capture the world through the eyes of animals.

Next, we delve into the related works surrounding video datasets, including notable research on both general video datasets of human and animals and those focusing specifically on egocentric video data.

Video Datasets. In recent years, a variety of video datasets have played an important role in video understanding tasks. In human action recognition, datasets like UCF101 [46], Charades-Ego [44], AVA [20], FineDiving [54], and the Something-Something dataset [18] provide comprehensive coverage of human activities, ranging from daily actions to specialized sports movements. Among these, Kinetics (K400) [23] is particularly influential, advancing the study of human actions through a wide array of video clips.

Other works aimed to collect data to study animals. These works include datasets such as the Animal Kingdom [36], which contains videos of various species, and MacaquePose [26], which focuses on non-human primates. These datasets are instrumental for AI advancements in wildlife recognition and interpretation. AP-10K [57] further augments this domain by providing a detailed collection of animal images for robust pose estimation. While sharing a similar motivation to our work, existing datasets on animal behavior rarely contain egocentric views and are therefore better suited to recognition problems than the animals’ physical capabilities. For autonomous driving and vehicle motion, datasets like the Berkeley DeepDrive [56, 17] and KITTI [17] offer extensive insights into vehicle egomotion and environmental interactions. While these datasets enrich our understanding of motion, behavior, and interaction from a human-centric perspective, they offer limited insights into animal behavior.

Egocentric Video Datasets. Agents interact with the world from a first-person point of view, thus collecting such data has many applications from video understanding to augmented reality. In the past decade, many egocentric datasets were collected [16, 11, 44, 38, 28], with the majority of them focusing on human activities and object interactions in an indoor environment (e.g. kitchens). For example, Epic Kitchens [11, 12] is a large cooking dataset that takes place in 454545 kitchens across 444 different cities, whereas Charades-Ego [44] consists of 4,00040004{\small,}000 paired videos of human actions in first and third person. Other datasets are more focused on conversation and social interactions [15, 35, 37]. Existing datasets differ by the environments in which they are recorded, e.g. (outdoor vs. kitchens), whether they are scripted or not, and the number of videos. Recently, Ego4D [19], a new comprehensive egocentric dataset was released. Different from previous datasets, it is more diverse (e.g., indoor and outdoor activities, different diverse geographical locations). However, while existing datasets focus on human and human skills, our focus is on animal agents which have more limited language and hand-object interactions. The most related egocentric dataset is DECADE [14] which consists of an hour of footage of a single dog, including joint locations annotations. Inspired by DECADE, EgoPet is a much larger web-scale dataset (84 hours) and much more diverse.

The EgoPet dataset is a unique collection of egocentric video footage primarily featuring dogs and cats, along with various other animals like eagles, wolves, turtles, sea turtles, sharks, snakes, cheetahs, pythons, geese, alligators, and dolphins (examples included in Fig. 2 and Suppl. Figure 9). Together with the proposed downstream tasks and benchmark, EgoPet is a valuable resource for researchers and enthusiasts interested in studying animals from an egocentric perspective.

We begin with the motivation behind EgoPet and its connection to existing datasets in Section 3.1. We then delve into the dataset’s statistics in Section 3.2 and the collection process in Section 3.3.

To provide a clearer understanding of EgoPet’s significance, we compare it with various other datasets, considering factors such as total video duration, perspective (egocentric or non-egocentric), egomotion, the agents involved, and the presence of interaction annotations, which are crucial for intelligent agents. Refer to Table 1 for more details. In terms of size, Ego4D [19] is the largest egocentric video dataset, and it centers on human activities, while the BDD100K [56] dataset includes both egocentric and egomotion elements but it focuses on autonomous driving. Differently, EgoPet focuses on animals, and pets in particular. Among animal video datasets, the DECADE [14] dataset provides an egocentric perspective from a dog’s viewpoint, but it only records 1.51.51.5 hours of video. EgoPet expands this vision by over 565656 times in volume and includes a variety of species and interactions.

The EgoPet dataset is an extensive collection composed of 6,64666466{\small,}646 video segments distilled from 819819819 unique videos. High level statistics are provided in Fig. 3. These original videos were sourced predominantly from TikTok, accounting for 482482482 videos, while the remaining 338338338 were obtained from YouTube. The aggregate length of all video segments amounts to approximately 848484 hours, which reflects a substantial volume of data for in-depth analysis. In terms of video duration, the segments exhibit an average span of 45.5545.5545.55 seconds, although the duration displays considerable variability, as indicated by the standard deviation of 192.19192.19192.19 seconds. This variation underscores the range of contexts captured within the dataset, from brief encounters to prolonged interactions.

Breaking down the dataset by animal representation, cats and dogs constitute the majority, with 4,56745674{\small,}567 and 1,90519051{\small,}905 segments, respectively. This reflects the dataset’s strong emphasis on common domestic animals while still covering less frequent but equally important species. Notably, the dataset includes segments featuring eagles (666666), turtles (313131), and a diverse group of other animals such as alligators, lizards, and dolphins, contributing to a rich collection of animal behaviors captured through an egocentric lens.

The camera positioning—where the recording device was attached to—also varies, the majority of segments were captured from cameras placed on the neck (4,57545754{\small,}575) and body (1,81718171{\small,}817). Fewer segments were recorded from cameras positioned on the head (199199199), shell (363636), collar (111111), and fin (888) offering a range of perspectives that can inform on how different mounting points might influence the perception of the environment from an animal’s viewpoint.

Collection Strategy. To collect the dataset, we manually searched for videos using a large set of queries on YouTube and TikTok. For example, “egocentric view”, “dog with a GoPro”, and similar phrases related to first-person animal perspectives. This led to scraping a vast pool of footage showcasing animals, primarily dogs and cats, wearing wearable cameras, allowing for an egocentric point of view. In pursuit of a broader video selection, our efforts extended to individual channels and authors known for their thematic consistency in publishing egocentric animal footage. This approach allowed us to tap into niche communities and content creators, yielding a wide variety of egocentric videos beyond the reach of generic search terms.

Dataset Refinement. A meticulous annotation process was carried out to ensure the dataset’s quality. A human annotator reviewed the collected videos to confirm that they were from an egocentric point of view. Non-egocentric or irrelevant segments were carefully removed.

All videos were adjusted to a frame rate of 303030 frames per second and resized to 480p on the shortest side while maintaining the original aspect ratio. The videos were then segmented into discrete clips, during which any non-egocentric footage was removed. The final dataset consists of segments of at least three seconds, ensuring sufficient context for each interaction.

In order to allow quantitative comparisons of animal-prediction approaches, we next define several prediction tasks on the EgoPet dataset. We provide annotated datasets based on these tasks, which will allow effective benchmarking of different approaches.

Motivation. Human activities such as actions and interactions from an egocentric viewpoint have been previously explored in various datasets mostly focusing on activity recognition [24, 29, 60], human-object interactions [13, 6, 34, 11], and social interactions [15, 35, 55]. Inspired by these works, we focus on animal interactions with other agents or objects, and for simplicity, we only consider visual interactions. Observing interactions through an egocentric perspective offers insights into how animals navigate their world, how they communicate with other beings, and how their physical movements correlate with environmental stimuli. Being able to identify interactions is a core task in computer vision and robotics with practical applications in designing systems that can operate in dynamic, real-world settings.

Task Description. The input for this task is a video clip from the egocentric perspective of an animal. The labels are twofold: a binary label indicating whether an interaction is taking place or not, and a categorical label describing the object of the interaction. This binary label simplifies the vast range of potential interactions into a manageable form for the model, while the identification of the interaction object adds a layer of specificity necessary for understanding the context of the interaction.

In the context of the EgoPet dataset, a “visual interaction” is defined as a discernible event where the agent—typically an animal such as a dog or cat—demonstrates clear attention to an object or another agent within its environment. This attention may be manifested through physical contact, proximity, orientation, or vocalization (such as barking or making sounds) toward the object of the interaction which can be an object or agent. The fundamental criterion for a visual interaction is the presence of visual evidence within the video that the agent is engaged with, or reacting to a particular stimulus. Aimless movements, such as wandering without a clear target or displaying alertness without a specific focus, are not labeled as visual interactions.

Annotations. The data labeling process for marking interactions involved a meticulous analysis of the video content, which resulted in the annotation of 144914491449 subsegments (see Fig. 4). Two human annotators were trained to identify and timestamp the start and end of an interaction event. The outcome of this process is a richly annotated dataset of 805805805 subsegments where no interaction occurs (”negative subsegments”) and 644644644 positive interaction subsegments that capture a wide range of 171717 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test sets. This leaves us with 754754754 training subsegments and 695695695 test subsegments for a total of 1,44914491{\small,}449 annotated subsegments. To see the full annotation process refer to Suppl. Section Figure credits.

Motivation. Planning where to move involves a complex interplay

of both perception and foresight. It requires the ability to anticipate potential obstacles, consider various courses of action, and select the most efficient and effective strategy to achieve a desired goal. EgoPet contains examples where animals plan a future trajectory to achieve a certain goal (e.g., a dog following its owner. See Fig. 5).

Task Description. Given a sequence of past m𝑚m video frames {xi}i=t−mtsubscriptsuperscriptsubscript𝑥𝑖𝑡𝑖𝑡𝑚{x_{i}}^{t}{i=t-m}, the goal is to predict the unit normalized future trajectory of the agent {vj}j=t+1t+ksubscriptsuperscriptsubscript𝑣𝑗𝑡𝑘𝑗𝑡1{v{j}}^{t+k}{j=t+1}, where vj∈ℝ3subscript𝑣𝑗superscriptℝ3v{j}\in\mathbb{R}^{3} represents the relative location of the agent at timestep j𝑗j. We predict the unit normalized relative location due to the scale ambiguity of the extracted trajectories. In practice, we condition models on m=16𝑚16m=16 frames and predict k=40𝑘40k=40 future locations which correspond to 444 seconds into the future.

Annotations. To obtain pseudo ground truth agent trajectories, we used Deep Patch Visual Odometry (DPVO [49]), a system for monocular visual odometry that utilizes sparse patch-based matching across frames. This system largely outperformed other open-source SLAM systems in terms of convergence rate and qualitative accuracy in our experiments.

Given an input sequence of frames, DPVO returns the location and orientation of the camera for each frame. To obtain training trajectories, we feed videos with a stride of 555 to DPVO. To ensure high-quality evaluation, we feed validation videos with strides of 555, 101010, and 151515 into DPVO and evaluate the quality of the trajectories manually. Specifically, two human annotators were trained to evaluate the trajectories from an eagle’s eye view (XZ view) and determine the best matching trajectory, if any, to the video. This left us with 6,12661266{\small,}126 annotated training segments and 249249249 validation segments.

Motivation. Understanding animal behavior could be instrumental to several robotics applications. To demonstrate the value of our dataset for robotics, we propose a task based on the problem of vision-based locomotion. Specifically, the task consists of predicting the parameters of the terrain a quadrupedal robot is walking on (see Fig. 6). As shown in multiple previous works on locomotion [30, 31, 22, 4, 3, 45], accurate prediction of these parameters is correlated with improved performance in locomotion. Intuitively, the EgoPet data closely resembles the video captured by a quadruped robot since a camera mounted on a pet is approximately at the same location as the camera mounted on the robot. In addition, the task of walking is highly represented in the dataset.

Task Description. The parameters we would like to predict are the local terrain geometry, the terrain’s friction, and the parameters related to the robot’s walking behavior on the terrain, including the robot’s speed, motor efficiency, and high-level command. The exact identification of these parameters is generally impossible [25]: two terrains could have different combinations of parameters but “feel" the same to an agent. For example, walking on sand and mud could similarly affect the robot’s proprioception, even though their properties differ. Therefore, similarly to previous work [25, 27, 30], we aim to predict a latent representation ztsubscript𝑧𝑡z_{t} of the terrain parameters. This latent representation consists of the hidden layer of a neural network trained in simulation to encode ground-truth terrain parameters. This neural network is trained end-to-end with an action policy on locomotion using reinforcement learning. The task consists of predicting the latent terrain’s representation at different time intervals from a sequence of frames. Our setup closely follows the one in [30]. Specifically, we add to EgoPet data collected with a quadrupedal robot in multiple outdoor environments with different terrain characteristics, e.g., sand or grass.

Dataset and Annotations. To collect the dataset, we deployed the walking policy of [30] on a Unitree A1 robot dog in three environments: an office, a park, and a beach. We collect approximately 20 minutes of walking data in these environments, which are used exclusively for evaluation. For training, we use the data from [30], which contains 120 thousand frames, corresponding to a total walking time of approximately 2.3 hours. Each environment has different terrain geometries, including flats, steps, and slopes. Each sample contains an image collected from a forward-looking camera mounted on the robot and the (latent) parameters of the terrain below the center of mass of the robot ztsubscript𝑧𝑡z_{t} estimated with a history of proprioception. See [30] for details about the annotation procedure.

The final task consists of predicting ztsubscript𝑧𝑡z_{t} from a history of images. We generate several sub-tasks by predicting the future terrain parameters zt+0.8,zt+1.5subscript𝑧𝑡0.8subscript𝑧𝑡1.5z_{t+0.8},z_{t+1.5} and the past ones zt−0.8,zt−1.5subscript𝑧𝑡0.8subscript𝑧𝑡1.5z_{t-0.8},z_{t-1.5}. These time intervals were selected to differentiate between forecasting and estimation. The further the prediction is in the future or the past, the harder the task is. The input images might contain little information, or not at all, about the terrain at these times. Therefore, inferences based on the context are required. For example, one can predict the presence of a step in front of the robot from the shadow it casts on the terrain.

We divide the newly collected data into three test datasets: the first is in-distribution, featuring terrains and lighting conditions similar to the training data. The second dataset is out of distribution since it is captured with different lighting conditions, i.e., at night, but in environments with the same features as the training data. Finally, the third dataset contains sandy environments, which the robot has not encountered during training.

Our goal in the experiments is to establish initial performance baselines on the EgoPet tasks. For the VPP task, we hypothesize that EgoPet is a more useful pretraining resource compared to other datasets. We evaluate different pretrained models and compare their performance on the VIP, LP, and VPP tasks. We adopt a simple linear probing protocol, where we freeze the model weights and for each task train only a linear layer to predict the output. For evaluation, we use the following models publicly released checkpoints, typically trained on IN-1k or K400, unless stated otherwise.

MAE [21] are trained by masking random patches in the input image and reconstructing the missing pixels through an asymmetric encoder-decoder architecture. By masking a high proportion (e.g., 757575%) of the input image. In our experiments we use an MAE model pretrained on IN-1k.

MVP [53] uses a similar model as in MAE, but trains MAE on a mixture of egocentric datasets which we refer to as Ego Mix, a combination of Epic-Kitchens [11], 100DOH [43], Ego4D [19], and Something-Something [18].

DINO [8] is trained with a student-teacher architecture over pairs of augmented images by encouraging invariance to the image augmentations. The teacher’s output is centered, and both networks’ normalized features are compared using a cross-entropy loss. The stop-gradient operator ensures gradient propagation only through the student, and teacher parameters are updated using an exponential moving average (ema) of the student parameters.

iBOT [59] is trained similarly to DINO, but adds an auxiliary MIM loss by predicting image patch representation of a learned online tokenizer.

VideoMAE [50] is an extension of MAE for video pre-training. Different from MAE, it utilizes an extremely high masking ratio (909090% to 959595%) and tube masking as opposed to random masking.

MVD [52] is a masked feature modeling framework for self-supervised video representation learning. Learning the video representations involves distilling student model features from both video and image teachers. We train MVD variants on Ego4D and EgoPet, using VideoMAE (K400) and MAE (IN-1k) as video and image teachers.

Implementation Details. For all models, we used the ViT-B model with patch size 16 since it was available across all methods. For the VIP task, we train all image and video models for 101010 epochs. Video models represent video clips of 222 seconds using 888 input frames (444 Hz) and image models use one (the middle) frame. For the VPP task, we train all models for 505050 epochs. Video models were trained with varying amounts of frames in 444 Hz. For the LP task, we train all models for 151515 epochs. Video models were trained with 161616 input frames (303030 Hz) and image models used one frame (the last). In our LP experiments, we only use cat and dog segments, segments long enough, and 25%percent2525% of the training data. This left us with 111,129129129 training segments and 167167167 validation segments. During the linear probing training phase, we do not apply any image augmentations. All the other hyperparams follow MAE and MVD linear probing recipes for image and video models respectively.

In this section, we report initial baseline results from applying a range of models to the VIP, LP, and VPP tasks. Taken together, these results underline the interesting observation that current large video datasets used for pretraining are not diverse enough to perform well across all the EgoPet downstream tasks. For example, pretraining on K400 is better than Ego4D for VIP but worse on VPP. Furthermore, by pretraining on EgoPet, we observe improved downstream performance on the VPP task compared to other models.

The results in Table 2 show that models trained on EgoPet achieve improved performance compared to K400 or Ego4D both on interaction prediction and object prediction. Compared to image based models like iBOT, MVD trained on EgoPet performs better on Top-3 Acc but worse on Top-1. This is likely due to the diversity of objects appearing in IN-1k, an image recognition dataset. Compared to other video models, MVD (EgoPet) performs better. To obtain more insight into what models focus on in this task, we apply Grad-CAM [41] on our MVD EgoPet interaction classifier. Fig. 7 shows the corresponding heatmaps and can be seen to focus on the rat (top-left), and another dog (bottom-right). In these cases, the model seems to be attending to the object of interaction.

For this task, we evaluated models based on their predicted unit motions 40 timesteps into the future, corresponding to 4 seconds. We form trajectories from these predicted motions and compute the RMSE of the Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) metrics against the ground truth trajectories. ATE and RPE are commonly employed metrics for evaluating systems such as SLAM and visual odometry [58, 49, 48, 7]. ATE first aligns the ground truth with the predicted trajectory, and then computes the absolute pose difference. RPE measures the difference between the predicted and ground truth locomotion [47].

The results in Fig. 8 indicate that models trained on EgoPet perform better than Ego4D and K400 and that the Ego4D model performed second best, possibly due to also being egocentric data. The full results in Supplementary Table 4 indicate that video models perform much better than image models as a whole, which we speculate is due to better modeling of the agent’s velocity and acceleration as well as the motion of other agents.

Table 3 provides the result of the VPP task. It can be seen that using EgoPet data leads to lower errors on this task. Additionally, the results show that using additional past frames as context helps and that video models outperform image models. Within image models, MVP performs better than MAE, likely because it was trained on egocentric data, and iBOT performs best. Within video models, MVD trained on EgoPet achieves lower mean squared error loss compared to the same model trained with K400 or Ego4D, and lower error compared to all image models. We speculate that MVD (EgoPet) outperforms all models because compared to other datasets the EgoPet videos are more similar to videos captured by a forward-facing camera mounted on a quadruped robodog. We provide the full results in the Supplementary Table 5.

EgoPet is a video dataset, and as such it primarily contains visual and auditory signals. However, animals interact with their environment using a multitude of senses, including smell and touch. The absence of these sensory modalities in our dataset and model may lead to a partial or skewed understanding of animal behavior and intelligence. Animal behavior is highly complex and influenced by a myriad of factors, including instinct, learning, environmental stimuli, and social interactions. Our tasks, while effective in capturing certain aspects of behavior, may not fully encapsulate the depth and complexity of animal interactions and decision-making processes. Further research is needed to develop more sophisticated tasks and models that can account for complex behavioral patterns.

We present EgoPet, a new comprehensive animal egocentric video dataset. Together with the proposed downstream tasks and benchmark, we believe EgoPet offers a testbed for studying and modeling animal behavior. Our benchmark results demonstrate that interaction prediction is far from being solved which provides an exciting opportunity for future research and modeling animal egocentric agents. Furthermore, the results demonstrate that EgoPet is a useful pretraining resource for downstream robotics locomotion tasks. Future works can include broadening the tasks to integrate more sensory inputs like audio, thereby creating a richer and more holistic understanding of the animal behavior.

Acknowledgements: We thank Justin Kerr for helpful discussions. Many of the figures use images taken from web videos. For each figure, we include the URL to its source videos in the Suppl. Section Figure credits. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD including DARPA’s LwLL and/or SemaFor programs, as well as BAIR’s industrial alliance programs.

We provide additional information about the EgoPet dataset, annotation process and full quantitative results.

We include more dataset visualizations in Figure 9.

The beginning of an interaction is marked at the first time-step where the agent begins to give attention to a target, and the endpoint is marked at the last time-step before the attention ceases. In addition, annotators were instructed to mark some segments without interactions. This process results in a set of temporal segments, each corresponding to a discrete interaction event or no interaction event. To ensure the consistency of annotations across annotators, annotations are only kept where annotators agree.

The outcome of this process is a richly annotated dataset of 805805805 subsegments where no interaction occurs (”negative subsegments”) and 644644644 positive interaction subsegments that capture a wide range of 171717 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test. This leaves us with 754754754 training subsegments and 695695695 test subsegments for a total of 1,44914491{\small,}449 annotated subsegments.

This is the list of all possible interaction objects: Person, Ball, Bench, Bird, Dog, Cat, Other Animal, Toy, Door, Floor, Food, Plant, Filament, Plastic, Water, Vehicle, Other.

Full LP results. Table 4 contains the quantitative LP results for all models, reported using the ATE and RPE metrics. The results indicate that pretraining on EgoPet leads to better ATE and RPE scores.

Full VPP results. In the main paper we included the VPP results grouped by “past”, “present” and “future” (see Table 3). In Table 5 we provide the full fine-grained VPP results by individual timestep.

Some of the figures in the paper were created from web videos. We credit the original content creators and provide links to the original videos below.

https://www.youtube.com/watch?v=69AXB6aFzRU

https://www.tiktok.com/@gonzoisacat/video/7232306745660509483

Table: S3.T1: Different video datasets. We compare EgoPet to different datasets with respect to the total time (hours), whether the videos are in first person view (egocentric) with the focus on egomotion, the agent type, and whether agent interaction annotations are available. EgoPet is the first large scale animal dataset that is both ego centric and contains interaction annotations. It is also over 565656 times larger than the previous similar dataset DECADE [14].

DatasetTotal Time (hours)EgocentricEgomotionAgentInteraction Annotations
BDD100K1,11111111{\small,}111Cars
Animal Kingdom505050Animals
EGO4D3,67036703{\small,}670Humans
DECADE1.51.51.5Dog
EgoPet848484Animals

Table: S5.T2: Visual Interaction Prediction linear probing results. We report models Interaction Prediction Accuracy and AUROC, as well as Object Prediction Top-1 and Top-3 Accuracy.

AccuracyAUROCTop-1 AccTop-3 Acc
MAEIN-1k62.3469.4135.0261.37
MVPEgo Mix65.4768.1233.5759.21
DINOIN-1k65.1673.3837.1860.65
iBOTIN-1k65.1673.5037.5558.12
VideoMAEK40061.5666.2229.2454.87
MVDK40065.6370.3535.3862.45
MVDEgo4D64.8470.1533.5762.45
MVDEgoPet68.4474.3135.7464.62

Table: S6.T3: Vision to Proprioception Prediction (VPP) linear probing results. We report the mean squared error loss. Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 5 for the full results.

ModelDatasetPast (t−k)𝑡𝑘(t-k)Present (t)𝑡(t)Future (t+k)𝑡𝑘(t+k)
1 Frame
MAEIN-1k0.3600.2800.314
MVPEgo Mix0.3570.2730.308
DINOIN-1k0.3540.2750.304
iBOTIN-1k0.3500.2780.304
4 Frames
MVDK4000.2860.1970.262
MVDEgo4D0.2610.2240.261
MVDEgoPet0.2560.2030.246
8 Frames
MVDK4000.2170.1960.252
MVDEgo4D0.2080.1920.249
MVDEgoPet0.2040.1840.253

Refer to caption We present EgoPet, a novel animal egocentric video dataset to advance learning animal-like behavior models from video (top row). We propose three benchmark tasks on this dataset (bottom row). Visual Interaction Prediction (VIP) and Locomotion Prediction (LP) are designed to predict animals’ perception and action behavior. Finally, Vision to Proprioception Prediction (VPP) studies the utility of our dataset on the downstream task of robot locomotion in the wild. For all tasks, we find that models trained on EgoPet outperform those trained on previously available video datasets.

Refer to caption EgoPet video examples. Footage from the EgoPet dataset featuring four different animal experiences, each captured from an egocentric perspective at a distinct point in time.

Refer to caption Descriptive statistics. The histogram depicting the length (in seconds) of EgoPet video sequences exhibits a long-tailed distribution, primarily skewed toward shorter segments of less than 303030 seconds. Collectively, videos featuring dogs and cats account for 94% of the total duration, showcasing interactions with people, fellow cats and dogs, toys, and various objects.

Refer to caption Vision to Interaction Prediction task. The figure illustrates the process of annotating a single video, identifying and categorizing different interactions experienced by a cat, with each segment of the timeline reflecting a unique type of interaction within the animal’s environment.

Refer to caption Locomotion Prediction task. A dog navigates an agility course, highlighting the concept of locomotion prediction by anticipating its forward and upward trajectory to clear the obstacle.

Refer to caption Vision to Proprioception Prediction task. This figure showcases the quadruped robot as it is about to transition from flat ground to climbing steep stairs, illustrating one of the unique terrain environments encountered during the collection of visual and proprioceptive data at annotated time intervals for VPP training.

Refer to caption VIP Grad-CAM [41] visualization.

Refer to caption Locomotion Prediction (LP) linear probing results. We report the validation ATE and RPE as a function of the epoch during training, comparing the impact of various datasets (Kinetics, Ego4D, EgoPet). Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 4 for the full results.

Figure credits

DatasetTotal Time (hours)EgocentricEgomotionAgentInteraction Annotations
BDD100K1 , 111Cars
Animal Kingdom50Animals
EGO4D3 , 670Humans
DECADE1 . 5Dog
EgoPet84Animals
ModelDatasetInteractionPrediction AUROCObject PredictionObject Prediction
AccuracyTop-1 AccTop-3 Acc
MAEIN-1k62.3469.4135.0261.37
MVPEgo Mix65.4768.1233.5759.21
DINOIN-1k65.1673.3837.1860.65
iBOTIN-1k65.1673.5037.5558.12
VideoMAEK40061.5666.2229.2454.87
MVDK40065.6370.3535.3862.45
MVDEgo4D64.8470.1533.5762.45
MVDEgoPet68.4474.3135.7464.62
ModelDatasetPast ( t - k )Present ( t )Future ( t + k )
1 Frame
MAEIN-1k0.3600.2800.314
MVPEgo Mix0.3570.2730.308
DINOIN-1k0.3540.2750.304
iBOTIN-1k0.3500.2780.304
4 Frames
MVDK4000.2860.1970.262
MVDEgo4D0.2610.2240.261
MVDEgoPet0.2560.2030.246
8 Frames
MVDK4000.2170.1960.252
MVDEgo4D0.2080.1920.249
MVDEgoPet0.2040.1840.253
ModelDatasett +4t +4
ATERPE
MAEIN-1k0.6170.233
MVPEgo Mix0.5980.233
DINOIN-1k0.5820.229
iBOTIN-1k0.5740.226
VideoMAEK4000.4780.171
MVDK4000.4790.172
MVDEgo4D0.4770.172
MVDEgoPet0.4740.171
ModelDatasetPastPastPresentFutureFutureAvg.
t - 1 . 5t - 0 . 8tt +0 . 8t +1 . 5
1 Frame
MAEIN-1k0.3780.3410.2800.3110.3170.325
MVPEgocentric0.3720.3410.2730.3030.3130.320
DINOIN-1k0.3710.3370.2750.3010.3080.318
iBOTIN-1k0.3640.3360.2780.3030.3050.317
2 Frames
MVDK4000.3330.2840.2070.2550.2730.270
MVDEgo4D0.3310.2850.2110.2540.2720.271
MVDEgoPet0.3280.2810.1970.2480.2710.265
4 Frames
MVDK4000.3580.2140.1970.2620.2630.259
MVDEgo4D0.3110.2120.2240.2350.2870.254
MVDEgoPet0.2760.2350.2030.2300.2620.241
8 Frames
MVDK4000.2260.2080.1960.2400.2640.227
MVDEgo4D0.2170.2000.1920.2340.2640.221
MVDEgoPet0.2140.1950.1840.2370.2680.219

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets. 111Project page: www.amirbar.net/egopet

Animals are intelligent agents that exhibit various cognitive and behavioral traits. They plan and act to accomplish complex goals and can interact with objects or other agents. Consider a cat attempting to catch a rat; this requires the cat to execute a precise sequence of actions with impeccable timing, all while responding to the rat’s efforts to escape.

Current Artificial Intelligence (AI) systems can synthesize high quality images [39, 40], generate coherent text [5, 51], and even code Python programs [9]. But despite this remarkable progress, there are basic animal behaviors that are beyond the reach of current models. Recently, there has been a significant body of research in robotics aimed at learning policies for quadruped locomotion, and other basic actions [27, 25, 1, 32, 42, 10, 33, 2]. However, we argue that a major limitation in advancing towards more complex systems is the availability of large-scale, real-world data.

To address this, we present EgoPet, a new web-scale dataset from the perspective of pets. EgoPet contains more than 848484 hours of video, including different animals like dogs, cats, eagles, turtles, and more. This video footage reveals the world from the eye of the pet as perceived in its day-to-day life, e.g., a dog going for a walk or entering a park, or a cat wandering freely around a farm. The video data was sourced from the internet and predominantly includes pet video, hence we have named the dataset EgoPet.

To measure progress in modeling and learning from animals, we propose three new tasks that aim to capture perception and action (see Fig. 1): Visual Interaction Prediction (VIP), Locomotion Prediction (LP), and Vision to Proprioception Prediction (VPP). Together with these tasks, we provide annotated training and validation data used for downstream evaluation.

The VIP task aims to detect and classify animal interactions and is inspired by human-object interaction tasks [43]. We temporally annotated a subset of the EgoPet videos with the start and end times of visual interactions and the object of the interaction category. The categories, which include person, cat, and dog, were chosen based on how commonly they occurred as objects (for all the categories refer to Supplementary Section Figure credits).

The goal of the LP task is to predict the future 444 second trajectory of the pet. This is useful for learning basic pet skills like avoiding obstacles or navigating. We extracted pseudo ground truth trajectories using Deep Patch Visual Odometry (DPVO) [49], the best-performing SLAM system for our dataset. We manually filtered inaccurate trajectories in the validation data to ensure high-quality evaluation.

Finally, in the VPP task, we study EgoPet’s utility for a downstream robotic task: legged locomotion. Given a video observation from a forward-facing camera mounted on a quadruped robot, the goal is to predict the features of the terrain perceived by the robot’s proprioception across its trajectory. Making accurate predictions requires perceiving the landscape and anticipating the robot controls. This differs from previous works on robot visual prediction [30, 45, 31], which require conditioning over current robot controls and are thus challenging to train at scale. To assess performance in this task, we gathered data utilizing a quadruped robodog. This data includes paired videos and proprioception features, which are then utilized for subsequent training and evaluation processes.

We train various self-supervised models and evaluate how they perform downstream using a simple linear probing protocol. We make the surprising finding that pretraining on EgoPet yields better performance than pretraining on other, much larger video datasets like Ego4D [19] and Kinetics 400 [23]. This indicates the inadequacy of current datasets in studying animal-like physical skills.

Our contributions are as follows. First, we propose EgoPet, the first large-scale egocentric animal video dataset comprised of over 848484 hours of video footage to facilitate learning from animals. We propose three new tasks, including human-annotated data, and set an initial benchmark. The downstream results on the VPP task indicate that EgoPet is a useful pretraining resource for quadruped locomotion, and the benchmark results on VIP show that the proposed tasks are still far from being solved, providing an exciting new opportunity to build models that capture the world through the eyes of animals.

Next, we delve into the related works surrounding video datasets, including notable research on both general video datasets of human and animals and those focusing specifically on egocentric video data.

Video Datasets. In recent years, a variety of video datasets have played an important role in video understanding tasks. In human action recognition, datasets like UCF101 [46], Charades-Ego [44], AVA [20], FineDiving [54], and the Something-Something dataset [18] provide comprehensive coverage of human activities, ranging from daily actions to specialized sports movements. Among these, Kinetics (K400) [23] is particularly influential, advancing the study of human actions through a wide array of video clips.

Other works aimed to collect data to study animals. These works include datasets such as the Animal Kingdom [36], which contains videos of various species, and MacaquePose [26], which focuses on non-human primates. These datasets are instrumental for AI advancements in wildlife recognition and interpretation. AP-10K [57] further augments this domain by providing a detailed collection of animal images for robust pose estimation. While sharing a similar motivation to our work, existing datasets on animal behavior rarely contain egocentric views and are therefore better suited to recognition problems than the animals’ physical capabilities. For autonomous driving and vehicle motion, datasets like the Berkeley DeepDrive [56, 17] and KITTI [17] offer extensive insights into vehicle egomotion and environmental interactions. While these datasets enrich our understanding of motion, behavior, and interaction from a human-centric perspective, they offer limited insights into animal behavior.

Egocentric Video Datasets. Agents interact with the world from a first-person point of view, thus collecting such data has many applications from video understanding to augmented reality. In the past decade, many egocentric datasets were collected [16, 11, 44, 38, 28], with the majority of them focusing on human activities and object interactions in an indoor environment (e.g. kitchens). For example, Epic Kitchens [11, 12] is a large cooking dataset that takes place in 454545 kitchens across 444 different cities, whereas Charades-Ego [44] consists of 4,00040004{\small,}000 paired videos of human actions in first and third person. Other datasets are more focused on conversation and social interactions [15, 35, 37]. Existing datasets differ by the environments in which they are recorded, e.g. (outdoor vs. kitchens), whether they are scripted or not, and the number of videos. Recently, Ego4D [19], a new comprehensive egocentric dataset was released. Different from previous datasets, it is more diverse (e.g., indoor and outdoor activities, different diverse geographical locations). However, while existing datasets focus on human and human skills, our focus is on animal agents which have more limited language and hand-object interactions. The most related egocentric dataset is DECADE [14] which consists of an hour of footage of a single dog, including joint locations annotations. Inspired by DECADE, EgoPet is a much larger web-scale dataset (84 hours) and much more diverse.

The EgoPet dataset is a unique collection of egocentric video footage primarily featuring dogs and cats, along with various other animals like eagles, wolves, turtles, sea turtles, sharks, snakes, cheetahs, pythons, geese, alligators, and dolphins (examples included in Fig. 2 and Suppl. Figure 9). Together with the proposed downstream tasks and benchmark, EgoPet is a valuable resource for researchers and enthusiasts interested in studying animals from an egocentric perspective.

We begin with the motivation behind EgoPet and its connection to existing datasets in Section 3.1. We then delve into the dataset’s statistics in Section 3.2 and the collection process in Section 3.3.

To provide a clearer understanding of EgoPet’s significance, we compare it with various other datasets, considering factors such as total video duration, perspective (egocentric or non-egocentric), egomotion, the agents involved, and the presence of interaction annotations, which are crucial for intelligent agents. Refer to Table 1 for more details. In terms of size, Ego4D [19] is the largest egocentric video dataset, and it centers on human activities, while the BDD100K [56] dataset includes both egocentric and egomotion elements but it focuses on autonomous driving. Differently, EgoPet focuses on animals, and pets in particular. Among animal video datasets, the DECADE [14] dataset provides an egocentric perspective from a dog’s viewpoint, but it only records 1.51.51.5 hours of video. EgoPet expands this vision by over 565656 times in volume and includes a variety of species and interactions.

The EgoPet dataset is an extensive collection composed of 6,64666466{\small,}646 video segments distilled from 819819819 unique videos. High level statistics are provided in Fig. 3. These original videos were sourced predominantly from TikTok, accounting for 482482482 videos, while the remaining 338338338 were obtained from YouTube. The aggregate length of all video segments amounts to approximately 848484 hours, which reflects a substantial volume of data for in-depth analysis. In terms of video duration, the segments exhibit an average span of 45.5545.5545.55 seconds, although the duration displays considerable variability, as indicated by the standard deviation of 192.19192.19192.19 seconds. This variation underscores the range of contexts captured within the dataset, from brief encounters to prolonged interactions.

Breaking down the dataset by animal representation, cats and dogs constitute the majority, with 4,56745674{\small,}567 and 1,90519051{\small,}905 segments, respectively. This reflects the dataset’s strong emphasis on common domestic animals while still covering less frequent but equally important species. Notably, the dataset includes segments featuring eagles (666666), turtles (313131), and a diverse group of other animals such as alligators, lizards, and dolphins, contributing to a rich collection of animal behaviors captured through an egocentric lens.

The camera positioning—where the recording device was attached to—also varies, the majority of segments were captured from cameras placed on the neck (4,57545754{\small,}575) and body (1,81718171{\small,}817). Fewer segments were recorded from cameras positioned on the head (199199199), shell (363636), collar (111111), and fin (888) offering a range of perspectives that can inform on how different mounting points might influence the perception of the environment from an animal’s viewpoint.

Collection Strategy. To collect the dataset, we manually searched for videos using a large set of queries on YouTube and TikTok. For example, “egocentric view”, “dog with a GoPro”, and similar phrases related to first-person animal perspectives. This led to scraping a vast pool of footage showcasing animals, primarily dogs and cats, wearing wearable cameras, allowing for an egocentric point of view. In pursuit of a broader video selection, our efforts extended to individual channels and authors known for their thematic consistency in publishing egocentric animal footage. This approach allowed us to tap into niche communities and content creators, yielding a wide variety of egocentric videos beyond the reach of generic search terms.

Dataset Refinement. A meticulous annotation process was carried out to ensure the dataset’s quality. A human annotator reviewed the collected videos to confirm that they were from an egocentric point of view. Non-egocentric or irrelevant segments were carefully removed.

All videos were adjusted to a frame rate of 303030 frames per second and resized to 480p on the shortest side while maintaining the original aspect ratio. The videos were then segmented into discrete clips, during which any non-egocentric footage was removed. The final dataset consists of segments of at least three seconds, ensuring sufficient context for each interaction.

In order to allow quantitative comparisons of animal-prediction approaches, we next define several prediction tasks on the EgoPet dataset. We provide annotated datasets based on these tasks, which will allow effective benchmarking of different approaches.

Motivation. Human activities such as actions and interactions from an egocentric viewpoint have been previously explored in various datasets mostly focusing on activity recognition [24, 29, 60], human-object interactions [13, 6, 34, 11], and social interactions [15, 35, 55]. Inspired by these works, we focus on animal interactions with other agents or objects, and for simplicity, we only consider visual interactions. Observing interactions through an egocentric perspective offers insights into how animals navigate their world, how they communicate with other beings, and how their physical movements correlate with environmental stimuli. Being able to identify interactions is a core task in computer vision and robotics with practical applications in designing systems that can operate in dynamic, real-world settings.

Task Description. The input for this task is a video clip from the egocentric perspective of an animal. The labels are twofold: a binary label indicating whether an interaction is taking place or not, and a categorical label describing the object of the interaction. This binary label simplifies the vast range of potential interactions into a manageable form for the model, while the identification of the interaction object adds a layer of specificity necessary for understanding the context of the interaction.

In the context of the EgoPet dataset, a “visual interaction” is defined as a discernible event where the agent—typically an animal such as a dog or cat—demonstrates clear attention to an object or another agent within its environment. This attention may be manifested through physical contact, proximity, orientation, or vocalization (such as barking or making sounds) toward the object of the interaction which can be an object or agent. The fundamental criterion for a visual interaction is the presence of visual evidence within the video that the agent is engaged with, or reacting to a particular stimulus. Aimless movements, such as wandering without a clear target or displaying alertness without a specific focus, are not labeled as visual interactions.

Annotations. The data labeling process for marking interactions involved a meticulous analysis of the video content, which resulted in the annotation of 144914491449 subsegments (see Fig. 4). Two human annotators were trained to identify and timestamp the start and end of an interaction event. The outcome of this process is a richly annotated dataset of 805805805 subsegments where no interaction occurs (”negative subsegments”) and 644644644 positive interaction subsegments that capture a wide range of 171717 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test sets. This leaves us with 754754754 training subsegments and 695695695 test subsegments for a total of 1,44914491{\small,}449 annotated subsegments. To see the full annotation process refer to Suppl. Section Figure credits.

Motivation. Planning where to move involves a complex interplay

of both perception and foresight. It requires the ability to anticipate potential obstacles, consider various courses of action, and select the most efficient and effective strategy to achieve a desired goal. EgoPet contains examples where animals plan a future trajectory to achieve a certain goal (e.g., a dog following its owner. See Fig. 5).

Task Description. Given a sequence of past m𝑚m video frames {xi}i=t−mtsubscriptsuperscriptsubscript𝑥𝑖𝑡𝑖𝑡𝑚{x_{i}}^{t}{i=t-m}, the goal is to predict the unit normalized future trajectory of the agent {vj}j=t+1t+ksubscriptsuperscriptsubscript𝑣𝑗𝑡𝑘𝑗𝑡1{v{j}}^{t+k}{j=t+1}, where vj∈ℝ3subscript𝑣𝑗superscriptℝ3v{j}\in\mathbb{R}^{3} represents the relative location of the agent at timestep j𝑗j. We predict the unit normalized relative location due to the scale ambiguity of the extracted trajectories. In practice, we condition models on m=16𝑚16m=16 frames and predict k=40𝑘40k=40 future locations which correspond to 444 seconds into the future.

Annotations. To obtain pseudo ground truth agent trajectories, we used Deep Patch Visual Odometry (DPVO [49]), a system for monocular visual odometry that utilizes sparse patch-based matching across frames. This system largely outperformed other open-source SLAM systems in terms of convergence rate and qualitative accuracy in our experiments.

Given an input sequence of frames, DPVO returns the location and orientation of the camera for each frame. To obtain training trajectories, we feed videos with a stride of 555 to DPVO. To ensure high-quality evaluation, we feed validation videos with strides of 555, 101010, and 151515 into DPVO and evaluate the quality of the trajectories manually. Specifically, two human annotators were trained to evaluate the trajectories from an eagle’s eye view (XZ view) and determine the best matching trajectory, if any, to the video. This left us with 6,12661266{\small,}126 annotated training segments and 249249249 validation segments.

Motivation. Understanding animal behavior could be instrumental to several robotics applications. To demonstrate the value of our dataset for robotics, we propose a task based on the problem of vision-based locomotion. Specifically, the task consists of predicting the parameters of the terrain a quadrupedal robot is walking on (see Fig. 6). As shown in multiple previous works on locomotion [30, 31, 22, 4, 3, 45], accurate prediction of these parameters is correlated with improved performance in locomotion. Intuitively, the EgoPet data closely resembles the video captured by a quadruped robot since a camera mounted on a pet is approximately at the same location as the camera mounted on the robot. In addition, the task of walking is highly represented in the dataset.

Task Description. The parameters we would like to predict are the local terrain geometry, the terrain’s friction, and the parameters related to the robot’s walking behavior on the terrain, including the robot’s speed, motor efficiency, and high-level command. The exact identification of these parameters is generally impossible [25]: two terrains could have different combinations of parameters but “feel" the same to an agent. For example, walking on sand and mud could similarly affect the robot’s proprioception, even though their properties differ. Therefore, similarly to previous work [25, 27, 30], we aim to predict a latent representation ztsubscript𝑧𝑡z_{t} of the terrain parameters. This latent representation consists of the hidden layer of a neural network trained in simulation to encode ground-truth terrain parameters. This neural network is trained end-to-end with an action policy on locomotion using reinforcement learning. The task consists of predicting the latent terrain’s representation at different time intervals from a sequence of frames. Our setup closely follows the one in [30]. Specifically, we add to EgoPet data collected with a quadrupedal robot in multiple outdoor environments with different terrain characteristics, e.g., sand or grass.

Dataset and Annotations. To collect the dataset, we deployed the walking policy of [30] on a Unitree A1 robot dog in three environments: an office, a park, and a beach. We collect approximately 20 minutes of walking data in these environments, which are used exclusively for evaluation. For training, we use the data from [30], which contains 120 thousand frames, corresponding to a total walking time of approximately 2.3 hours. Each environment has different terrain geometries, including flats, steps, and slopes. Each sample contains an image collected from a forward-looking camera mounted on the robot and the (latent) parameters of the terrain below the center of mass of the robot ztsubscript𝑧𝑡z_{t} estimated with a history of proprioception. See [30] for details about the annotation procedure.

The final task consists of predicting ztsubscript𝑧𝑡z_{t} from a history of images. We generate several sub-tasks by predicting the future terrain parameters zt+0.8,zt+1.5subscript𝑧𝑡0.8subscript𝑧𝑡1.5z_{t+0.8},z_{t+1.5} and the past ones zt−0.8,zt−1.5subscript𝑧𝑡0.8subscript𝑧𝑡1.5z_{t-0.8},z_{t-1.5}. These time intervals were selected to differentiate between forecasting and estimation. The further the prediction is in the future or the past, the harder the task is. The input images might contain little information, or not at all, about the terrain at these times. Therefore, inferences based on the context are required. For example, one can predict the presence of a step in front of the robot from the shadow it casts on the terrain.

We divide the newly collected data into three test datasets: the first is in-distribution, featuring terrains and lighting conditions similar to the training data. The second dataset is out of distribution since it is captured with different lighting conditions, i.e., at night, but in environments with the same features as the training data. Finally, the third dataset contains sandy environments, which the robot has not encountered during training.

Our goal in the experiments is to establish initial performance baselines on the EgoPet tasks. For the VPP task, we hypothesize that EgoPet is a more useful pretraining resource compared to other datasets. We evaluate different pretrained models and compare their performance on the VIP, LP, and VPP tasks. We adopt a simple linear probing protocol, where we freeze the model weights and for each task train only a linear layer to predict the output. For evaluation, we use the following models publicly released checkpoints, typically trained on IN-1k or K400, unless stated otherwise.

MAE [21] are trained by masking random patches in the input image and reconstructing the missing pixels through an asymmetric encoder-decoder architecture. By masking a high proportion (e.g., 757575%) of the input image. In our experiments we use an MAE model pretrained on IN-1k.

MVP [53] uses a similar model as in MAE, but trains MAE on a mixture of egocentric datasets which we refer to as Ego Mix, a combination of Epic-Kitchens [11], 100DOH [43], Ego4D [19], and Something-Something [18].

DINO [8] is trained with a student-teacher architecture over pairs of augmented images by encouraging invariance to the image augmentations. The teacher’s output is centered, and both networks’ normalized features are compared using a cross-entropy loss. The stop-gradient operator ensures gradient propagation only through the student, and teacher parameters are updated using an exponential moving average (ema) of the student parameters.

iBOT [59] is trained similarly to DINO, but adds an auxiliary MIM loss by predicting image patch representation of a learned online tokenizer.

VideoMAE [50] is an extension of MAE for video pre-training. Different from MAE, it utilizes an extremely high masking ratio (909090% to 959595%) and tube masking as opposed to random masking.

MVD [52] is a masked feature modeling framework for self-supervised video representation learning. Learning the video representations involves distilling student model features from both video and image teachers. We train MVD variants on Ego4D and EgoPet, using VideoMAE (K400) and MAE (IN-1k) as video and image teachers.

Implementation Details. For all models, we used the ViT-B model with patch size 16 since it was available across all methods. For the VIP task, we train all image and video models for 101010 epochs. Video models represent video clips of 222 seconds using 888 input frames (444 Hz) and image models use one (the middle) frame. For the VPP task, we train all models for 505050 epochs. Video models were trained with varying amounts of frames in 444 Hz. For the LP task, we train all models for 151515 epochs. Video models were trained with 161616 input frames (303030 Hz) and image models used one frame (the last). In our LP experiments, we only use cat and dog segments, segments long enough, and 25%percent2525% of the training data. This left us with 111,129129129 training segments and 167167167 validation segments. During the linear probing training phase, we do not apply any image augmentations. All the other hyperparams follow MAE and MVD linear probing recipes for image and video models respectively.

In this section, we report initial baseline results from applying a range of models to the VIP, LP, and VPP tasks. Taken together, these results underline the interesting observation that current large video datasets used for pretraining are not diverse enough to perform well across all the EgoPet downstream tasks. For example, pretraining on K400 is better than Ego4D for VIP but worse on VPP. Furthermore, by pretraining on EgoPet, we observe improved downstream performance on the VPP task compared to other models.

The results in Table 2 show that models trained on EgoPet achieve improved performance compared to K400 or Ego4D both on interaction prediction and object prediction. Compared to image based models like iBOT, MVD trained on EgoPet performs better on Top-3 Acc but worse on Top-1. This is likely due to the diversity of objects appearing in IN-1k, an image recognition dataset. Compared to other video models, MVD (EgoPet) performs better. To obtain more insight into what models focus on in this task, we apply Grad-CAM [41] on our MVD EgoPet interaction classifier. Fig. 7 shows the corresponding heatmaps and can be seen to focus on the rat (top-left), and another dog (bottom-right). In these cases, the model seems to be attending to the object of interaction.

For this task, we evaluated models based on their predicted unit motions 40 timesteps into the future, corresponding to 4 seconds. We form trajectories from these predicted motions and compute the RMSE of the Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) metrics against the ground truth trajectories. ATE and RPE are commonly employed metrics for evaluating systems such as SLAM and visual odometry [58, 49, 48, 7]. ATE first aligns the ground truth with the predicted trajectory, and then computes the absolute pose difference. RPE measures the difference between the predicted and ground truth locomotion [47].

The results in Fig. 8 indicate that models trained on EgoPet perform better than Ego4D and K400 and that the Ego4D model performed second best, possibly due to also being egocentric data. The full results in Supplementary Table 4 indicate that video models perform much better than image models as a whole, which we speculate is due to better modeling of the agent’s velocity and acceleration as well as the motion of other agents.

Table 3 provides the result of the VPP task. It can be seen that using EgoPet data leads to lower errors on this task. Additionally, the results show that using additional past frames as context helps and that video models outperform image models. Within image models, MVP performs better than MAE, likely because it was trained on egocentric data, and iBOT performs best. Within video models, MVD trained on EgoPet achieves lower mean squared error loss compared to the same model trained with K400 or Ego4D, and lower error compared to all image models. We speculate that MVD (EgoPet) outperforms all models because compared to other datasets the EgoPet videos are more similar to videos captured by a forward-facing camera mounted on a quadruped robodog. We provide the full results in the Supplementary Table 5.

EgoPet is a video dataset, and as such it primarily contains visual and auditory signals. However, animals interact with their environment using a multitude of senses, including smell and touch. The absence of these sensory modalities in our dataset and model may lead to a partial or skewed understanding of animal behavior and intelligence. Animal behavior is highly complex and influenced by a myriad of factors, including instinct, learning, environmental stimuli, and social interactions. Our tasks, while effective in capturing certain aspects of behavior, may not fully encapsulate the depth and complexity of animal interactions and decision-making processes. Further research is needed to develop more sophisticated tasks and models that can account for complex behavioral patterns.

We present EgoPet, a new comprehensive animal egocentric video dataset. Together with the proposed downstream tasks and benchmark, we believe EgoPet offers a testbed for studying and modeling animal behavior. Our benchmark results demonstrate that interaction prediction is far from being solved which provides an exciting opportunity for future research and modeling animal egocentric agents. Furthermore, the results demonstrate that EgoPet is a useful pretraining resource for downstream robotics locomotion tasks. Future works can include broadening the tasks to integrate more sensory inputs like audio, thereby creating a richer and more holistic understanding of the animal behavior.

Acknowledgements: We thank Justin Kerr for helpful discussions. Many of the figures use images taken from web videos. For each figure, we include the URL to its source videos in the Suppl. Section Figure credits. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD including DARPA’s LwLL and/or SemaFor programs, as well as BAIR’s industrial alliance programs.

We provide additional information about the EgoPet dataset, annotation process and full quantitative results.

We include more dataset visualizations in Figure 9.

The beginning of an interaction is marked at the first time-step where the agent begins to give attention to a target, and the endpoint is marked at the last time-step before the attention ceases. In addition, annotators were instructed to mark some segments without interactions. This process results in a set of temporal segments, each corresponding to a discrete interaction event or no interaction event. To ensure the consistency of annotations across annotators, annotations are only kept where annotators agree.

The outcome of this process is a richly annotated dataset of 805805805 subsegments where no interaction occurs (”negative subsegments”) and 644644644 positive interaction subsegments that capture a wide range of 171717 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test. This leaves us with 754754754 training subsegments and 695695695 test subsegments for a total of 1,44914491{\small,}449 annotated subsegments.

This is the list of all possible interaction objects: Person, Ball, Bench, Bird, Dog, Cat, Other Animal, Toy, Door, Floor, Food, Plant, Filament, Plastic, Water, Vehicle, Other.

Full LP results. Table 4 contains the quantitative LP results for all models, reported using the ATE and RPE metrics. The results indicate that pretraining on EgoPet leads to better ATE and RPE scores.

Full VPP results. In the main paper we included the VPP results grouped by “past”, “present” and “future” (see Table 3). In Table 5 we provide the full fine-grained VPP results by individual timestep.

Some of the figures in the paper were created from web videos. We credit the original content creators and provide links to the original videos below.

https://www.youtube.com/watch?v=69AXB6aFzRU

https://www.tiktok.com/@gonzoisacat/video/7232306745660509483

Table: S3.T1: Different video datasets. We compare EgoPet to different datasets with respect to the total time (hours), whether the videos are in first person view (egocentric) with the focus on egomotion, the agent type, and whether agent interaction annotations are available. EgoPet is the first large scale animal dataset that is both ego centric and contains interaction annotations. It is also over 565656 times larger than the previous similar dataset DECADE [14].

DatasetTotal Time (hours)EgocentricEgomotionAgentInteraction Annotations
BDD100K1,11111111{\small,}111Cars
Animal Kingdom505050Animals
EGO4D3,67036703{\small,}670Humans
DECADE1.51.51.5Dog
EgoPet848484Animals

Table: S5.T2: Visual Interaction Prediction linear probing results. We report models Interaction Prediction Accuracy and AUROC, as well as Object Prediction Top-1 and Top-3 Accuracy.

AccuracyAUROCTop-1 AccTop-3 Acc
MAEIN-1k62.3469.4135.0261.37
MVPEgo Mix65.4768.1233.5759.21
DINOIN-1k65.1673.3837.1860.65
iBOTIN-1k65.1673.5037.5558.12
VideoMAEK40061.5666.2229.2454.87
MVDK40065.6370.3535.3862.45
MVDEgo4D64.8470.1533.5762.45
MVDEgoPet68.4474.3135.7464.62

Table: S6.T3: Vision to Proprioception Prediction (VPP) linear probing results. We report the mean squared error loss. Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 5 for the full results.

ModelDatasetPast (t−k)𝑡𝑘(t-k)Present (t)𝑡(t)Future (t+k)𝑡𝑘(t+k)
1 Frame
MAEIN-1k0.3600.2800.314
MVPEgo Mix0.3570.2730.308
DINOIN-1k0.3540.2750.304
iBOTIN-1k0.3500.2780.304
4 Frames
MVDK4000.2860.1970.262
MVDEgo4D0.2610.2240.261
MVDEgoPet0.2560.2030.246
8 Frames
MVDK4000.2170.1960.252
MVDEgo4D0.2080.1920.249
MVDEgoPet0.2040.1840.253

Refer to caption We present EgoPet, a novel animal egocentric video dataset to advance learning animal-like behavior models from video (top row). We propose three benchmark tasks on this dataset (bottom row). Visual Interaction Prediction (VIP) and Locomotion Prediction (LP) are designed to predict animals’ perception and action behavior. Finally, Vision to Proprioception Prediction (VPP) studies the utility of our dataset on the downstream task of robot locomotion in the wild. For all tasks, we find that models trained on EgoPet outperform those trained on previously available video datasets.

Refer to caption EgoPet video examples. Footage from the EgoPet dataset featuring four different animal experiences, each captured from an egocentric perspective at a distinct point in time.

Refer to caption Descriptive statistics. The histogram depicting the length (in seconds) of EgoPet video sequences exhibits a long-tailed distribution, primarily skewed toward shorter segments of less than 303030 seconds. Collectively, videos featuring dogs and cats account for 94% of the total duration, showcasing interactions with people, fellow cats and dogs, toys, and various objects.

Refer to caption Vision to Interaction Prediction task. The figure illustrates the process of annotating a single video, identifying and categorizing different interactions experienced by a cat, with each segment of the timeline reflecting a unique type of interaction within the animal’s environment.

Refer to caption Locomotion Prediction task. A dog navigates an agility course, highlighting the concept of locomotion prediction by anticipating its forward and upward trajectory to clear the obstacle.

Refer to caption Vision to Proprioception Prediction task. This figure showcases the quadruped robot as it is about to transition from flat ground to climbing steep stairs, illustrating one of the unique terrain environments encountered during the collection of visual and proprioceptive data at annotated time intervals for VPP training.

Refer to caption VIP Grad-CAM [41] visualization.

Refer to caption Locomotion Prediction (LP) linear probing results. We report the validation ATE and RPE as a function of the epoch during training, comparing the impact of various datasets (Kinetics, Ego4D, EgoPet). Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 4 for the full results.

DatasetTotal Time (hours)EgocentricEgomotionAgentInteraction Annotations
BDD100K1 , 111Cars
Animal Kingdom50Animals
EGO4D3 , 670Humans
DECADE1 . 5Dog
EgoPet84Animals
ModelDatasetInteractionPrediction AUROCObject PredictionObject Prediction
AccuracyTop-1 AccTop-3 Acc
MAEIN-1k62.3469.4135.0261.37
MVPEgo Mix65.4768.1233.5759.21
DINOIN-1k65.1673.3837.1860.65
iBOTIN-1k65.1673.5037.5558.12
VideoMAEK40061.5666.2229.2454.87
MVDK40065.6370.3535.3862.45
MVDEgo4D64.8470.1533.5762.45
MVDEgoPet68.4474.3135.7464.62
ModelDatasetPast ( t - k )Present ( t )Future ( t + k )
1 Frame
MAEIN-1k0.3600.2800.314
MVPEgo Mix0.3570.2730.308
DINOIN-1k0.3540.2750.304
iBOTIN-1k0.3500.2780.304
4 Frames
MVDK4000.2860.1970.262
MVDEgo4D0.2610.2240.261
MVDEgoPet0.2560.2030.246
8 Frames
MVDK4000.2170.1960.252
MVDEgo4D0.2080.1920.249
MVDEgoPet0.2040.1840.253
ModelDatasett +4t +4
ATERPE
MAEIN-1k0.6170.233
MVPEgo Mix0.5980.233
DINOIN-1k0.5820.229
iBOTIN-1k0.5740.226
VideoMAEK4000.4780.171
MVDK4000.4790.172
MVDEgo4D0.4770.172
MVDEgoPet0.4740.171
ModelDatasetPastPastPresentFutureFutureAvg.
t - 1 . 5t - 0 . 8tt +0 . 8t +1 . 5
1 Frame
MAEIN-1k0.3780.3410.2800.3110.3170.325
MVPEgocentric0.3720.3410.2730.3030.3130.320
DINOIN-1k0.3710.3370.2750.3010.3080.318
iBOTIN-1k0.3640.3360.2780.3030.3050.317
2 Frames
MVDK4000.3330.2840.2070.2550.2730.270
MVDEgo4D0.3310.2850.2110.2540.2720.271
MVDEgoPet0.3280.2810.1970.2480.2710.265
4 Frames
MVDK4000.3580.2140.1970.2620.2630.259
MVDEgo4D0.3110.2120.2240.2350.2870.254
MVDEgoPet0.2760.2350.2030.2300.2620.241
8 Frames
MVDK4000.2260.2080.1960.2400.2640.227
MVDEgo4D0.2170.2000.1920.2340.2640.221
MVDEgoPet0.2140.1950.1840.2370.2680.219

References

[Anonymous24] Anonymous. The frobnicatable foo filter.

[Anonymous24b] Anonymous. Frobnication tutorial.

[Alpher02] FirstName Alpher. Frobnication.

[Alpher03] FirstName Alpher, FirstName Fotheringham-Smythe. Frobnication revisited. Journal of Foo.

[Alpher04] FirstName Alpher, FirstName Fotheringham-Smythe, FirstName Gamow. Can a machine frobnicate?. Journal of Foo.

[Alpher05] FirstName Alpher, FirstName Gamow. Can a computer frobnicate?.

[ECCV2022] . Computer Vision -- ECCV 2022. (2022).

[wang2022masked] Wang, Rui, Chen, Dongdong, Wu, Zuxuan, Chen, Yinpeng, Dai, Xiyang, Liu, Mengchen, Yuan, Lu, Jiang, Yu-Gang. (2023). Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning. CVPR.

[loquercio2023learning] Loquercio, Antonio, Kumar, Ashish, Malik, Jitendra. (2023). Learning visual locomotion with cross-modal supervision. 2023 IEEE International Conference on Robotics and Automation (ICRA).

[lee2020learning] Lee, Joonho, Hwangbo, Jemin, Wellhausen, Lorenz, Koltun, Vladlen, Hutter, Marco. (2020). Learning quadrupedal locomotion over challenging terrain. Science robotics.

[agarwal2023legged] Agarwal, Ananye, Kumar, Ashish, Malik, Jitendra, Pathak, Deepak. (2023). Legged locomotion in challenging terrains using egocentric vision. Conference on Robot Learning.

[bajcsy2023learning] Bajcsy, Andrea, Loquercio, Antonio, Kumar, Ashish, Malik, Jitendra. (2023). Learning Vision-based Pursuit-Evasion Robot Policies. arXiv preprint arXiv:2308.16185.

[miki2022learning] Miki, Takahiro, Lee, Joonho, Hwangbo, Jemin, Wellhausen, Lorenz, Koltun, Vladlen, Hutter, Marco. (2022). Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics.

[sturm2012benchmark] Sturm, J{. (2012). A benchmark for the evaluation of RGB-D SLAM systems. 2012 IEEE/RSJ international conference on intelligent robots and systems.

[campos2021orb] Campos, Carlos, Elvira, Richard, Rodr{'\i. (2021). Orb-slam3: An accurate open-source library for visual, visual--inertial, and multimap slam. IEEE Transactions on Robotics.

[teed2021droid] Teed, Zachary, Deng, Jia. (2021). Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems.

[zhao2022particlesfm] Zhao, Wang, Liu, Shaohui, Guo, Hengkai, Wang, Wenping, Liu, Yong-Jin. (2022). Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. European Conference on Computer Vision.

[choi2023learning] Choi, Suyoung, Ji, Gwanghyeon, Park, Jeongsoo, Kim, Hyeongjun, Mun, Juhyeok, Lee, Jeong Hyun, Hwangbo, Jemin. (2023). Learning quadrupedal locomotion on deformable terrain. Science Robotics.

[margolis2022rapid] Margolis, Gabriel B, Yang, Ge, Paigwar, Kartik, Chen, Tao, Agrawal, Pulkit. (2022). Rapid locomotion via reinforcement learning. arXiv preprint arXiv:2205.02824.

[shah2023vint] Shah, D, Sridhar, A, Dashora, N, Stachowicz, K, Black, K, Hirose, N, Levine, S. (2023). Vint: A large-scale, multi-task visual navigation backbone with cross-robot generalization. 7th Annual Conference on Robot Learning.

[margolis2023learning] Margolis, Gabriel B, Fu, Xiang, Ji, Yandong, Agrawal, Pulkit. (2023). Learning to See Physical Properties with Active Sensing Motor Policies. arXiv preprint arXiv:2311.01405.

[yang2023neural] Yang, Ruihan, Yang, Ge, Wang, Xiaolong. (2023). Neural volumetric memory for visual locomotion control. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[karnan2023self] Karnan, Haresh, Yang, Elvin, Farkash, Daniel, Warnell, Garrett, Biswas, Joydeep, Stone, Peter. (2023). Self-Supervised Terrain Representation Learning from Unconstrained Robot Experience. ICRA2023 Workshop on Pretraining for Robotics (PT4R).

[bednarek2019touching] Bednarek, Jakub, Bednarek, Michal, Wellhausen, Lorenz, Hutter, Marco, Walas, Krzysztof. (2019). What am I touching? Learning to classify terrain via haptic sensing. 2019 International Conference on Robotics and Automation (ICRA).

[bednarek2019robotic] Bednarek, Jakub, Bednarek, Michal, Kicki, Piotr, Walas, Krzysztof. (2019). Robotic touch: Classification of materials for manipulation and walking. 2019 2nd IEEE international conference on Soft Robotics (RoboSoft).

[sojka2023learning] S{'o. (2023). Learning an Efficient Terrain Representation for Haptic Localization of a Legged Robot. 2023 IEEE International Conference on Robotics and Automation (ICRA).

[kumar2021rma] Kumar, Ashish, Fu, Zipeng, Pathak, Deepak, Malik, Jitendra. (2021). Rma: Rapid motor adaptation for legged robots. Robotics: Science and Systems.

[selvaraju2017grad] Selvaraju, Ramprasaath R, Cogswell, Michael, Das, Abhishek, Vedantam, Ramakrishna, Parikh, Devi, Batra, Dhruv. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision.

[caron2021emerging] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF international conference on computer vision.

[xiao2022masked] Xiao, Tete, Radosavovic, Ilija, Darrell, Trevor, Malik, Jitendra. (2022). Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173.

[he2022masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[rombach2022high] Rombach, Robin, Blattmann, Andreas, Lorenz, Dominik, Esser, Patrick, Ommer, Bj{. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[ramesh2021zero] Ramesh, Aditya, Pavlov, Mikhail, Goh, Gabriel, Gray, Scott, Voss, Chelsea, Radford, Alec, Chen, Mark, Sutskever, Ilya. (2021). Zero-shot text-to-image generation. International Conference on Machine Learning.

[touvron2023llama] Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, others. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

[bubeck2023sparks] Bubeck, S{'e. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.

[chen2021evaluating] Chen, Mark, Tworek, Jerry, Jun, Heewoo, Yuan, Qiming, Pinto, Henrique Ponde de Oliveira, Kaplan, Jared, Edwards, Harri, Burda, Yuri, Joseph, Nicholas, Brockman, Greg, others. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

[grauman2022ego4d] Grauman, Kristen, Westbury, Andrew, Byrne, Eugene, Chavis, Zachary, Furnari, Antonino, Girdhar, Rohit, Hamburger, Jackson, Jiang, Hao, Liu, Miao, Liu, Xingyu, others. (2022). Ego4d: Around the world in 3,000 hours of egocentric video. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[kay2017kinetics] Kay, Will, Carreira, Joao, Simonyan, Karen, Zhang, Brian, Hillier, Chloe, Vijayanarasimhan, Sudheendra, Viola, Fabio, Green, Tim, Back, Trevor, Natsev, Paul, others. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

[fathi2012social] Fathi, Alircza, Hodgins, Jessica K, Rehg, James M. (2012). Social interactions: A first-person perspective. 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[sigurdsson2018charades] Sigurdsson, Gunnar A, Gupta, Abhinav, Schmid, Cordelia, Farhadi, Ali, Alahari, Karteek. (2018). Charades-ego: A large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626.

[fathi2012learning] Fathi, Alireza, Li, Yin, Rehg, James M. (2012). Learning to recognize daily actions using gaze. Computer Vision--ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I 12.

[damen2018scaling] Damen, Dima, Doughty, Hazel, Farinella, Giovanni Maria, Fidler, Sanja, Furnari, Antonino, Kazakos, Evangelos, Moltisanti, Davide, Munro, Jonathan, Perrett, Toby, Price, Will, others. (2018). Scaling egocentric vision: The epic-kitchens dataset. Proceedings of the European conference on computer vision (ECCV).

[shan2020understanding] Shan, Dandan, Geng, Jiaqi, Shu, Michelle, Fouhey, David F. (2020). Understanding human hands in contact at internet scale. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[damen2020rescaling] Damen, Dima, Doughty, Hazel, Farinella, Giovanni Maria, Furnari, Antonino, Kazakos, Evangelos, Ma, Jian, Moltisanti, Davide, Munro, Jonathan, Perrett, Toby, Price, Will, others. (2020). Rescaling egocentric vision. arXiv preprint arXiv:2006.13256.

[pirsiavash2012detecting] Pirsiavash, Hamed, Ramanan, Deva. (2012). Detecting activities of daily living in first-person camera views. 2012 IEEE conference on computer vision and pattern recognition.

[lee2012discovering] Lee, Yong Jae, Ghosh, Joydeep, Grauman, Kristen. (2012). Discovering important people and objects for egocentric video summarization. 2012 IEEE conference on computer vision and pattern recognition.

[northcutt2020egocom] Northcutt, Curtis, Zha, Shengxin, Lovegrove, Steven, Newcombe, Richard. (2020). Egocom: A multi-person multi-modal egocentric communications dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[ng2020you2me] Ng, Evonne, Xiang, Donglai, Joo, Hanbyul, Grauman, Kristen. (2020). You2me: Inferring body pose in egocentric video via first and second person interactions. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[li2022exploring] Li, Yanghao, Mao, Hanzi, Girshick, Ross, He, Kaiming. (2022). Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527.

[bardes2022vicregl] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2022). VICRegL: Self-Supervised Learning of Local Visual Features. arXiv preprint arXiv:2210.01571.

[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron. (2016). Deep learning.

[arora2019theoretical] Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj. (2019). A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229.

[bridle1991unsupervised] Bridle, John, Heading, Anthony, MacKay, David. (1991). Unsupervised classifiers, mutual information and'phantom targets. Advances in neural information processing systems.

[zha2001spectral] Zha, Hongyuan, He, Xiaofeng, Ding, Chris, Gu, Ming, Simon, Horst D. (2001). Spectral relaxation for k-means clustering. NeurIPS.

[hornik2012spherical] Hornik, Kurt, Feinerer, Ingo, Kober, Martin, Buchta, Christian. (2012). Spherical k-means clustering. Journal of statistical software.

[park2009simple] Park, Hae-Sang, Jun, Chi-Hyuck. (2009). A simple and fast algorithm for K-medoids clustering. Expert systems with applications.

[van2008visualizing] Van der Maaten, Laurens, Hinton, Geoffrey. (2008). Visualizing data using t-SNE.. Journal of machine learning research.

[wang2010learning] Wang, Fei, Li, Ping, Konig, Arnd Christian. (2010). Learning a bi-stochastic data similarity matrix. 2010 IEEE International Conference on Data Mining.

[meilua2006uniqueness] Meil{\u{a. (2006). The uniqueness of a good optimum for k-means. Proceedings of the 23rd international conference on Machine learning.

[wu2009adapting] Wu, Junjie, Xiong, Hui, Chen, Jian. (2009). Adapting the right measures for k-means clustering. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining.

[liang2012k] Liang, Jiye, Bai, Liang, Dang, Chuangyin, Cao, Fuyuan. (2012). The $ K $-means-type algorithms versus imbalanced data distributions. IEEE Transactions on Fuzzy Systems.

[rujeerapaiboon2019size] Rujeerapaiboon, Napat, Schindler, Kilian, Kuhn, Daniel, Wiesemann, Wolfram. (2019). Size matters: Cardinality-constrained clustering and outlier detection via conic optimization. SIAM J. Optimization.

[bradley2000constrained] Bradley, Paul S, Bennett, Kristin P, Demiriz, Ayhan. (2000). Constrained k-means clustering. Microsoft Research, Redmond.

[kleindessner2019fair] Kleindessner, Matth{. (2019). Fair k-center clustering for data summarization. ICML.

[bordia2019identifying] Bordia, Shikha, Bowman, Samuel R. (2019). Identifying and reducing gender bias in word-level language models. arXiv preprint arXiv:1904.03035.

[buolamwini2018gender] Buolamwini, Joy, Gebru, Timnit. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Conference on Fairness, Accountability and Transparency.

[ma2022principles] Ma, Yi, Tsao, Doris, Shum, Heung-Yeung. (2022). On the principles of Parsimony and Self-consistency for the emergence of intelligence. Frontiers of Information Technology & Electronic Engineering.

[wiener2019cybernetics] Wiener, Norbert. (2019). Cybernetics or Control and Communication in the Animal and the Machine.

[oord2018representation] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[krause2010discriminative] Krause, Andreas, Perona, Pietro, Gomes, Ryan. (2010). Discriminative clustering by regularized information maximization. Advances in neural information processing systems.

[paszke2019pytorch] Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer, Adam, Bradbury, James, Chanan, Gregory, Killeen, Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca, others. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems.

[henaff2020data] Henaff, Olivier. (2020). Data-efficient image recognition with contrastive predictive coding. International conference on machine learning.

[hu2017learning] Hu, Weihua, Miyato, Takeru, Tokui, Seiya, Matsumoto, Eiichi, Sugiyama, Masashi. (2017). Learning discrete representations via information maximizing self-augmented training. International conference on machine learning.

[linsker1988self] Linsker, Ralph. (1988). Self-organization in a perceptual network. Computer.

[tschannen2019mutual] Tschannen, Michael, Djolonga, Josip, Rubenstein, Paul K, Gelly, Sylvain, Lucic, Mario. (2019). On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625.

[lake2011one] Lake, Brenden, Salakhutdinov, Ruslan, Gross, Jason, Tenenbaum, Joshua. (2011). One shot learning of simple visual concepts. Proceedings of the annual meeting of the cognitive science society.

[salakhutdinov2007learning] Salakhutdinov, Ruslan, Hinton, Geoff. (2007). Learning a nonlinear embedding by preserving class neighbourhood structure. Artificial Intelligence and Statistics.

[boden1980jean] Boden, Margaret A. (1980). Jean Piaget.

[piaget1964cognitive] Piaget, Jean. (1964). Cognitive development in children: Piaget. Journal of research in science teaching.

[boden1978artificial] Boden, Margaret A. (1978). Artificial intelligence and Piagetian theory. Synthese.

[bruner1961individual] Bruner, Jerome S. (1961). Reply to Individual and collective problems in the study of thinking. Annals of the New York Academy of Sciences.

[piaget1971biology] Piaget, Jean. (1971). Biology and knowledge: An essay on the relations between organic regulations and cognitive processes..

[grandvalet2006entropy] Grandvalet, Yves, Bengio, Yoshua. (2006). Entropy regularization. Semi-supervised learning.

[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709.

[chen2020big] Chen, Ting, Kornblith, Simon, Swersky, Kevin, Norouzi, Mohammad, Hinton, Geoffrey. (2020). Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029.

[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733.

[caron2020unsupervised] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882.

[assran2020recovering] Assran, Mahmoud, Ballas, Nicolas, Castrejon, Lluis, Rabbat, Michael. (2020). Recovering Petaflops in Contrastive Semi-Supervised Learning of Visual Representations. arXiv preprint arXiv:2006.10803.

[vinyals2016matching] Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, Wierstra, Daan. (2016). Matching networks for one shot learning. arXiv preprint arXiv:1606.04080.

[snell2017prototypical] Snell, Jake, Swersky, Kevin, Zemel, Richard S. (2017). Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175.

[ravi2016optimization] Ravi, Sachin, Larochelle, Hugo. (2016). Optimization as a model for few-shot learning.

[lake2017building] Lake, Brenden M, Ullman, Tomer D, Tenenbaum, Joshua B, Gershman, Samuel J. (2017). Building machines that learn and think like people. Behavioral and brain sciences.

[russakovsky2015imagenet] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., Fei-Fei, Li. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision.

[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[you2017large] You, Yang, Gitman, Igor, Ginsburg, Boris. (2017). Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888.

[sutskever2013importance] Sutskever, Ilya, Martens, James, Dahl, George, Hinton, Geoffrey. (2013). On the importance of initialization and momentum in deep learning. International conference on machine learning.

[xie2019unsupervised] Xie, Qizhe, Dai, Zihang, Hovy, Eduard, Luong, Minh-Thang, Le, Quoc V. (2019). Unsupervised data augmentation. arXiv preprint arXiv:1904.12848.

[sohn2020fixmatch] Sohn, Kihyuk, Berthelot, David, Li, Chun-Liang, Zhang, Zizhao, Carlini, Nicholas, Cubuk, Ekin D, Kurakin, Alex, Zhang, Han, Raffel, Colin. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685.

[pham2020meta] Pham, Hieu, Xie, Qizhe, Dai, Zihang, Le, Quoc V. (2020). Meta pseudo labels. arXiv preprint arXiv:2003.10580.

[wu2018unsupervised] Wu, Zhirong, Xiong, Yuanjun, Yu, Stella X, Lin, Dahua. (2018). Unsupervised feature learning via non-parametric instance discrimination. Proceedings of the IEEE conference on computer vision and pattern recognition.

[misra2020self] Misra, Ishan, van der Maaten, Laurens. (2020). Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[ren2018meta] Ren, Mengye, Triantafillou, Eleni, Ravi, Sachin, Snell, Jake, Swersky, Kevin, Tenenbaum, Joshua B, Larochelle, Hugo, Zemel, Richard S. (2018). Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676.

[he2019moco] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick. (2019). Momentum Contrast for Unsupervised Visual Representation Learning. arXiv preprint arXiv:1911.05722.

[chen2020mocov2] Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He. (2020). Improved Baselines with Momentum Contrastive Learning. arXiv preprint arXiv:2003.04297.

[hsu2018unsupervised] Hsu, Kyle, Levine, Sergey, Finn, Chelsea. (2018). Unsupervised learning via meta-learning. arXiv preprint arXiv:1810.02334.

[chen2020exploring] Chen, Xinlei, He, Kaiming. (2020). Exploring Simple Siamese Representation Learning. arXiv preprint arXiv:2011.10566.

[loshchilov2016sgdr] Loshchilov, Ilya, Hutter, Frank. (2016). {SGDR. arXiv preprint arXiv:1608.03983.

[khosla2020supervised] Khosla, Prannay, Teterwak, Piotr, Wang, Chen, Sarna, Aaron, Tian, Yonglong, Isola, Phillip, Maschinot, Aaron, Liu, Ce, Krishnan, Dilip. (2020). Supervised Contrastive Learning. arXiv preprint arXiv:2004.11362.

[miyato2018virtual] Miyato, Takeru, Maeda, Shin-ichi, Koyama, Masanori, Ishii, Shin. (2018). Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence.

[verma2019interpolation] Verma, Vikas, Kawaguchi, Kenji, Lamb, Alex, Kannala, Juho, Bengio, Yoshua, Lopez-Paz, David. (2019). Interpolation Consistency Training for Semi-Supervised Learning. arXiv preprint arXiv:1903.03825.

[zhai2019s4l] Zhai, Xiaohua, Oliver, Avital, Kolesnikov, Alexander, Beyer, Lucas. (2019). S4l: Self-supervised semi-supervised learning. Proceedings of the IEEE international conference on computer vision.

[lee2013pseudo] Lee, Dong-Hyun. (2013). Pseudo-Label: The simple and efficient semi-supervised learning method for deep neural networks. In International Conference on Machine Learning Workshop.

[scudder1965probability] Scudder, H.. (1965). Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory.

[riloff1996automatically] Riloff, Ellen. (1996). Automatically generating extraction patterns from untagged text. In Proceedings of the National Conference on Artificial Intelligence.

[berthelot2019mixmatch] Berthelot, David, Carlini, Nicholas, Goodfellow, Ian, Papernot, Nicolas, Oliver, Avital, Raffel, Colin A. (2019). Mixmatch: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems.

[berthelot2019remixmatch] Berthelot, David, Carlini, Nicholas, Cubuk, Ekin D, Kurakin, Alex, Sohn, Kihyuk, Zhang, Han, Raffel, Colin. (2019). ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring. arXiv preprint arXiv:1911.09785.

[yarowsky1995unsupervised] Yarowsky, David. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In 33rd Annual Meeting of the Association for Computational Linguistics.

[asano2019self] Asano, Yuki Markus, Rupprecht, Christian, Vedaldi, Andrea. (2019). Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371.

[zoph2020rethinking] Zoph, Barret, Ghiasi, Golnaz, Lin, Tsung-Yi, Cui, Yin, Liu, Hanxiao, Cubuk, Ekin D, Le, Quoc V. (2020). Rethinking pre-training and self-training. arXiv preprint arXiv:2006.06882.

[xie2020self] Xie, Qizhe, Luong, Minh-Thang, Hovy, Eduard, Le, Quoc V. (2020). Self-training with noisy student improves imagenet classification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[tarvainen2017mean] Tarvainen, Antti, Valpola, Harri. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780.

[el2021large] El-Nouby, Alaaeldin, Izacard, Gautier, Touvron, Hugo, Laptev, Ivan, Jegou, Herv{'e. (2021). Are Large-scale Datasets Necessary for Self-Supervised Pre-training?. arXiv preprint arXiv:2112.10740.

[mitrovic2020representation] Mitrovic, Jovana, McWilliams, Brian, Walker, Jacob, Buesing, Lars, Blundell, Charles. (2020). Representation learning via invariant causal mechanisms. arXiv preprint arXiv:2010.07922.

[assran2020supervision] Assran, Mahmoud, Ballas, Nicolas, Castrejon, Lluis, Rabbat, Michael. (2020). Supervision accelerates pre-training in contrastive semi-supervised learning of visual representations. arXiv preprint arXiv:2006.10803.

[joulin2012convex] Joulin, Armand, Bach, Francis. (2012). A convex relaxation for weakly supervised classifiers. arXiv preprint arXiv:1206.6413.

[laine2016temporal] Laine, Samuli, Aila, Timo. (2016). Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.

[jackson2019semi] Jackson, Jacob, Schulman, John. (2019). Semi-supervised learning by label gradient alignment. arXiv preprint arXiv:1902.02336.

[wang2019enaet] Wang, Xiao, Kihara, Daisuke, Luo, Jiebo, Qi, Guo-Jun. (2019). Enaet: Self-trained ensemble autoencoding transformations for semi-supervised learning. arXiv preprint arXiv:1911.09265.

[krizhevsky2009learning] Krizhevsky, Alex, Hinton, Geoffrey, others. (2009). Learning multiple layers of features from tiny images.

[zagoruyko2016wide] Zagoruyko, Sergey, Komodakis, Nikos. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.

[thomee2016yfcc100m] Thomee, Bart, Shamma, David A, Friedland, Gerald, Elizalde, Benjamin, Ni, Karl, Poland, Douglas, Borth, Damian, Li, Li-Jia. (2016). YFCC100M: The new data in multimedia research. Communications of the ACM.

[zhang2017mixup] Zhang, Hongyi, Cisse, Moustapha, Dauphin, Yann N, Lopez-Paz, David. (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.

[yun2019cutmix] Yun, Sangdoo, Han, Dongyoon, Oh, Seong Joon, Chun, Sanghyuk, Choe, Junsuk, Yoo, Youngjoon. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[cubuk2019autoaugment] Cubuk, Ekin D, Zoph, Barret, Mane, Dandelion, Vasudevan, Vijay, Le, Quoc V. (2019). Autoaugment: Learning augmentation strategies from data. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[blum1998combining] Blum, Avrim, Mitchell, Tom. (1998). Combining labeled and unlabeled data with co-training. Proceedings of the eleventh annual conference on Computational learning theory.

[berman2019multigrain] Berman, Maxim, J{'e. (2019). Multigrain: a unified image embedding for classes and instances. arXiv preprint arXiv:1902.05509.

[dosovitskiy2020image] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, {\L. (2017). Attention is all you need. Advances in neural information processing systems.

[bahdanau2014neural] Bahdanau, Dzmitry, Cho, Kyunghyun, Bengio, Yoshua. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

[baevski2022data2vec] Baevski, Alexei, Hsu, Wei-Ning, Xu, Qiantong, Babu, Arun, Gu, Jiatao, Auli, Michael. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555.

[bromley1993signature] Bromley, Jane, Bentz, James W, Bottou, L{'e. (1993). Signature verification using a “siamese” time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence.

[hjelm2018learning] Hjelm, R Devon, Fedorov, Alex, Lavoie-Marchildon, Samuel, Grewal, Karan, Bachman, Phil, Trischler, Adam, Bengio, Yoshua. (2018). Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670.

[bachman2019learning] Bachman, Philip, Hjelm, R Devon, Buchwalter, William. (2019). Learning representations by maximizing mutual information across views. Advances in neural information processing systems.

[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230.

[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.

[assran2021semi] Assran, Mahmoud, Caron, Mathilde, Misra, Ishan, Bojanowski, Piotr, Joulin, Armand, Ballas, Nicolas, Rabbat, Michael. (2021). Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples. arXiv preprint arXiv:2104.13963.

[chen2020generative] Chen, Mark, Radford, Alec, Child, Rewon, Wu, Jeffrey, Jun, Heewoo, Luan, David, Sutskever, Ilya. (2020). Generative pretraining from pixels. International Conference on Machine Learning.

[he2021masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

[denoising_vincent] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, Manzagol, Pierre-Antoine. (2008). Extracting and Composing Robust Features with Denoising Autoencoders. Proceedings of the 25th International Conference on Machine Learning.

[vincent2010stacked] Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, Yoshua, Manzagol, Pierre-Antoine, Bottou, L{'e. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.. Journal of machine learning research.

[xie2021simmim] Xie, Zhenda, Zhang, Zheng, Cao, Yue, Lin, Yutong, Bao, Jianmin, Yao, Zhuliang, Dai, Qi, Hu, Han. (2021). Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886.

[wei2021masked] Wei, Chen, Fan, Haoqi, Xie, Saining, Wu, Chao-Yuan, Yuille, Alan, Feichtenhofer, Christoph. (2021). Masked Feature Prediction for Self-Supervised Visual Pre-Training. arXiv preprint arXiv:2112.09133.

[bao2021beit] Bao, Hangbo, Dong, Li, Wei, Furu. (2021). BEiT: BERT Pre-Training of Image Transformers. arXiv preprint arXiv:2106.08254.

[zhou2021ibotyes] Zhou, Jinghao, Wei, Chen, Wang, Huiyu, Shen, Wei, Xie, Cihang, Yuille, Alan, Kong, Tao. (2021). Ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832.

[zhou2021ibot] Zhou, Jinghao, Wei, Chen, Wang, Huiyu, Shen, Wei, Xie, Cihang, Yuille, Alan, Kong, Tao. (2021). ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832.

[loshchilov2017decoupled] Loshchilov, Ilya, Hutter, Frank. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

[chen2021empirical] Chen, Xinlei, Xie, Saining, He, Kaiming. (2021). An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057.

[touvron2021training] Touvron, Hugo, Cord, Matthieu, Douze, Matthijs, Massa, Francisco, Sablayrolles, Alexandre, J{'e. (2021). Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning.

[assran2022masked] Assran, Mahmoud, Caron, Mathilde, Misra, Ishan, Bojanowski, Piotr, Bordes, Florian, Vincent, Pascal, Joulin, Armand, Rabbat, Michael, Ballas, Nicolas. (2022). Masked siamese networks for label-efficient learning. arXiv preprint arXiv:2204.07141.

[goyal2022vision] Goyal, Priya, Duval, Quentin, Seessel, Isaac, Caron, Mathilde, Singh, Mannat, Misra, Ishan, Sagun, Levent, Joulin, Armand, Bojanowski, Piotr. (2022). Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360.

[tian2021divide] Tian, Yonglong, Henaff, Olivier J, van den Oord, A{. (2021). Divide and contrast: Self-supervised learning from uncurated data. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[mahajan2018exploring] Mahajan, Dhruv, Girshick, Ross, Ramanathan, Vignesh, He, Kaiming, Paluri, Manohar, Li, Yixuan, Bharambe, Ashwin, Van Der Maaten, Laurens. (2018). Exploring the limits of weakly supervised pretraining. Proceedings of the European conference on computer vision (ECCV).

[newman2005power] Newman, Mark EJ. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary physics.

[van2018inaturalist] Van Horn, Grant, Mac Aodha, Oisin, Song, Yang, Cui, Yin, Sun, Chen, Shepard, Alex, Adam, Hartwig, Perona, Pietro, Belongie, Serge. (2018). The inaturalist species classification and detection dataset. Proceedings of the IEEE conference on computer vision and pattern recognition.

[places205] Zhou, Bolei, Lapedriza, Agata, Xiao, Jianxiong, Torralba, Antonio, Oliva, Aude. (2014). Learning Deep Features for Scene Recognition using Places Database. Advances in Neural Information Processing Systems.

[cifar10] Alex Krizhevsky. (2009). Learning multiple layers of features from tiny images.

[kitti] Andreas Geiger, Philip Lenz, Raquel Urtasun. (2012). Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. Conference on Computer Vision and Pattern Recognition (CVPR).

[clevr] Johnson, Justin, Hariharan, Bharath, van der Maaten, Laurens, Fei-Fei, Li, Zitnick, C Lawrence, Girshick, Ross. (2017). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. CVPR.

[bordes2022high] Florian Bordes, Randall Balestriero, Pascal Vincent. (2022). High Fidelity Visualization of What Your Self-Supervised Representation Knows About. Transactions on Machine Learning Research.

[https://doi.org/10.48550/arxiv.1310.4546] Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg, Dean, Jeffrey. (2013). Distributed Representations of Words and Phrases and their Compositionality. doi:10.48550/ARXIV.1310.4546.

[zhou2014learning] Zhou, Bolei, Lapedriza, Agata, Xiao, Jianxiong, Torralba, Antonio, Oliva, Aude. (2014). Learning deep features for scene recognition using places database. Advances in neural information processing systems.

[johnson2017clevr] Johnson, Justin, Hariharan, Bharath, Van Der Maaten, Laurens, Fei-Fei, Li, Lawrence Zitnick, C, Girshick, Ross. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. Proceedings of the IEEE conference on computer vision and pattern recognition.

[geiger2013vision] Geiger, Andreas, Lenz, Philip, Stiller, Christoph, Urtasun, Raquel. (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research.

[tian2021understanding] Tian, Yuandong, Chen, Xinlei, Ganguli, Surya. (2021). Understanding self-supervised learning dynamics without contrastive pairs. International Conference on Machine Learning.

[balestriero2022contrastive] Balestriero, Randall, LeCun, Yann. (2022). Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. arXiv preprint arXiv:2205.11508.

[wang2020understanding] Wang, Tongzhou, Isola, Phillip. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. International Conference on Machine Learning.

[ng2022animal] Ng, Xun Long, Ong, Kian Eng, Zheng, Qichen, Ni, Yun, Yeo, Si Yong, Liu, Jun. (2022). Animal kingdom: A large and diverse dataset for animal behavior understanding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[labuguen2021macaquepose] Labuguen, Rollyn, Matsumoto, Jumpei, Negrete, Salvador Blanco, Nishimaru, Hiroshi, Nishijo, Hisao, Takada, Masahiko, Go, Yasuhiro, Inoue, Ken-ichi, Shibata, Tomohiro. (2021). MacaquePose: a novel “in the wild” macaque monkey pose dataset for markerless motion capture. Frontiers in behavioral neuroscience.

[yu2021ap] Yu, Hang, Xu, Yufei, Zhang, Jing, Zhao, Wei, Guan, Ziyu, Tao, Dacheng. (2021). Ap-10k: A benchmark for animal pose estimation in the wild. arXiv preprint arXiv:2108.12617.

[chen2021intriguing] Chen, Ting, Luo, Calvin, Li, Lala. (2021). Intriguing properties of contrastive losses. Advances in Neural Information Processing Systems.

[garrido2022duality] Garrido, Quentin, Chen, Yubei, Bardes, Adrien, Najman, Laurent, Lecun, Yann. (2022). On the duality between contrastive and non-contrastive self-supervised learning. arXiv preprint arXiv:2206.02574.

[goyal2021vissl] Priya Goyal, Quentin Duval, Jeremy Reizenstein, Matthew Leavitt, Min Xu, Benjamin Lefaudeux, Mannat Singh, Vinicius Reis, Mathilde Caron, Piotr Bojanowski, Armand Joulin, Ishan Misra. (2021). VISSL.

[https://doi.org/10.48550/arxiv.1502.03167] Ioffe, Sergey, Szegedy, Christian. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. doi:10.48550/ARXIV.1502.03167.

[lecun2022path] LeCun, Yann. (2022). A Path Towards Autonomous Machine Intelligence Version 0.9. 2, 2022-06-27.

[chen2022intra] Chen, Yubei, Bardes, Adrien, Li, Zengyi, LeCun, Yann. (2022). Intra-Instance VICReg: Bag of Self-Supervised Image Patch Embedding. arXiv preprint arXiv:2206.08954.

[gidaris2020learning] Gidaris, Spyros, Bursuc, Andrei, Komodakis, Nikos, P{'e. (2020). Learning representations by predicting bags of visual words. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[bordes2022guillotine] Bordes, Florian, Balestriero, Randall, Garrido, Quentin, Bardes, Adrien, Vincent, Pascal. (2022). Guillotine Regularization: Improving Deep Networks Generalization by Removing their Head. arXiv preprint arXiv:2206.13378.

[rao1999predictive] Rao, Rajesh PN, Ballard, Dana H. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience.

[pathak2016context] Pathak, Deepak, Krahenbuhl, Philipp, Donahue, Jeff, Darrell, Trevor, Efros, Alexei A. (2016). Context encoders: Feature learning by inpainting. Proceedings of the IEEE conference on computer vision and pattern recognition.

[elias1955] Friston, Karl. (2005). A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences. doi:10.1109/TIT.1955.1055126.

[devlin2018bert] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[dalal2005histograms] Dalal, Navneet, Triggs, Bill. (2005). Histograms of oriented gradients for human detection. 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05).

[larsson2016learning] Larsson, Gustav, Maire, Michael, Shakhnarovich, Gregory. (2016). Learning representations for automatic colorization.

[zhang2016colorful] Zhang, Richard, Isola, Phillip, Efros, Alexei A. (2016). Colorful image colorization.

[kazakos2019epic] Kazakos, Evangelos, Nagrani, Arsha, Zisserman, Andrew, Damen, Dima. (2019). Epic-fusion: Audio-visual temporal binding for egocentric action recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[li2021ego] Li, Yanghao, Nagarajan, Tushar, Xiong, Bo, Grauman, Kristen. (2021). Ego-exo: Transferring visual representations from third-person to first-person videos. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[damen2016you] Damen, Dima, Leelasawassuk, Teesid, Mayol-Cuevas, Walterio. (2016). You-Do, I-Learn: Egocentric unsupervised discovery of objects and their modes of interaction towards video-based guidance. Computer Vision and Image Understanding.

[cai2016understanding] Cai, Minjie, Kitani, Kris M, Sato, Yoichi. (2016). Understanding Hand-Object Manipulation with Grasp Types and Object Attributes.. Robotics: Science and Systems.

[nagarajan2019grounded] Nagarajan, Tushar, Feichtenhofer, Christoph, Grauman, Kristen. (2019). Grounded human-object interaction hotspots from video. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[yonetani2016recognizing] Yonetani, Ryo, Kitani, Kris M, Sato, Yoichi. (2016). Recognizing micro-actions and reactions from paired egocentric videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[zhou2015temporal] Zhou, Yipin, Berg, Tamara L. (2015). Temporal perception and prediction in ego-centric video. Proceedings of the IEEE International Conference on Computer Vision.

[larsson2017colorization] Larsson, Gustav, Maire, Michael, Shakhnarovich, Gregory. (2017). Colorization as a proxy task for visual understanding.

[assran2022hidden] Assran, Mahmoud, Balestriero, Randall, Duval, Quentin, Bordes, Florian, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, Ballas, Nicolas. (2022). The Hidden Uniform Cluster Prior in Self-Supervised Learning. arXiv preprint arXiv:2210.07277.

[lecun2006tutorial] LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, M, Huang, Fujie. (2006). A tutorial on energy-based learning. Predicting structured data.

[vtab] Zhai, Xiaohua, Puigcerver, Joan, Kolesnikov, Alexander, Ruyssen, Pierre, Riquelme, Carlos, Lucic, Mario, Djolonga, Josip, Pinto, Andre Susano, Neumann, Maxim, Dosovitskiy, Alexey, Beyer, Lucas, Bachem, Olivier, Tschannen, Michael, Michalski, Marcin, Bousquet, Olivier, Gelly, Sylvain, Houlsby, Neil. (2019). A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark. doi:10.48550/ARXIV.1910.04867.

[lars] You, Yang, Gitman, Igor, Ginsburg, Boris. (2017). Large Batch Training of Convolutional Networks. doi:10.48550/ARXIV.1708.03888.

[zhou2019semantic] Zhou, Bolei, Zhao, Hang, Puig, Xavier, Xiao, Tete, Fidler, Sanja, Barriuso, Adela, Torralba, Antonio. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision.

[everingham2015pascal] Everingham, Mark, Eslami, SM, Van Gool, Luc, Williams, Christopher KI, Winn, John, Zisserman, Andrew. (2015). The pascal visual object classes challenge: A retrospective. International journal of computer vision.

[cai2022semi] Cai, Zhaowei, Ravichandran, Avinash, Favaro, Paolo, Wang, Manchen, Modolo, Davide, Bhotika, Rahul, Tu, Zhuowen, Soatto, Stefano. (2022). Semi-supervised vision transformers at scale. arXiv preprint arXiv:2208.05688.

[baevski2022efficient] Baevski, Alexei, Babu, Arun, Hsu, Wei-Ning, Auli, Michael. (2022). Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language. arXiv preprint arXiv:2212.07525.

[assran2023self] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. arXiv preprint arXiv:2301.08243.

[Ermolov2020WhiteningFS] Aleksandr Ermolov, Aliaksandr Siarohin, E. Sangineto, N. Sebe. (2020). Whitening for Self-Supervised Representation Learning. International Conference on Machine Learning.

[kingma2013auto] Kingma, Diederik P, Welling, Max. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

[pont20172017] Pont-Tuset, Jordi, Perazzi, Federico, Caelles, Sergi, Arbel{'a. (2017). The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675.

[jabri2020space] Jabri, Allan, Owens, Andrew, Efros, Alexei. (2020). Space-time correspondence as a contrastive random walk. Advances in neural information processing systems.

[Hadsell2006DimensionalityRB] Raia Hadsell, Sumit Chopra, Yann LeCun. (2006). Dimensionality Reduction by Learning an Invariant Mapping. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[Dosovitskiy2014DiscriminativeUF] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin A. Riedmiller, Thomas Brox. (2014). Discriminative Unsupervised Feature Learning with Convolutional Neural Networks. NIPS.

[Tian2019ContrastiveMC] Yonglong Tian, Dilip Krishnan, Phillip Isola. (2019). Contrastive Multiview Coding. European Conference on Computer Vision.

[Misra2019SelfSupervisedLO] Ishan Misra, Laurens van der Maaten. (2019). Self-Supervised Learning of Pretext-Invariant Representations. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[clark2020electra] Clark, Kevin, Luong, Minh-Thang, Le, Quoc V, Manning, Christopher D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.

[brown2020language] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. Advances in neural information processing systems.

[baevski2020wav2vec] Baevski, Alexei, Zhou, Yuhao, Mohamed, Abdelrahman, Auli, Michael. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems.

[baevski2021unsupervised] Baevski, Alexei, Hsu, Wei-Ning, Conneau, Alexis, Auli, Michael. (2021). Unsupervised speech recognition. Advances in Neural Information Processing Systems.

[wang2020unsupervised] Wang, Weiran, Tang, Qingming, Livescu, Karen. (2020). Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[vincent2008extracting] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, Manzagol, Pierre-Antoine. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning.

[xiaoshould] Xiao, Tete, Wang, Xiaolong, Efros, Alexei A, Darrell, Trevor. What Should Not Be Contrastive in Contrastive Learning. International Conference on Learning Representations.

[chen2021exploring] Chen, Xinlei, He, Kaiming. (2021). Exploring simple siamese representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[Hua_2021_ICCV] Hua, Tianyu, Wang, Wenxiao, Xue, Zihui, Ren, Sucheng, Wang, Yue, Zhao, Hang. (2021). On Feature Decorrelation in Self-Supervised Learning. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

[Bar_2022_CVPR] Bar, Amir, Wang, Xin, Kantorov, Vadim, Reed, Colorado J., Herzig, Roei, Chechik, Gal, Rohrbach, Anna, Darrell, Trevor, Globerson, Amir. (2022). DETReg: Unsupervised Pretraining With Region Priors for Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[dwibedi2021little] Dwibedi, Debidatta, Aytar, Yusuf, Tompson, Jonathan, Sermanet, Pierre, Zisserman, Andrew. (2021). With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[press2021train] Press, Ofir, Smith, Noah A, Lewis, Mike. (2021). Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.

[chu2021conditional] Chu, Xiangxiang, Tian, Zhi, Zhang, Bo, Wang, Xinlong, Wei, Xiaolin, Xia, Huaxia, Shen, Chunhua. (2021). Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882.

[bello2019attention] Bello, Irwan, Zoph, Barret, Vaswani, Ashish, Shlens, Jonathon, Le, Quoc V. (2019). Attention augmented convolutional networks. Proceedings of the IEEE/CVF international conference on computer vision.

[su2021roformer] Su, Jianlin, Lu, Yu, Pan, Shengfeng, Murtadha, Ahmed, Wen, Bo, Liu, Yunfeng. (2021). Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.

[pmlr-v139-liutkus21a] Liutkus, Antoine, C'{\i. (2021). Relative Positional Encoding for Transformers with Linear Complexity. Proceedings of the 38th International Conference on Machine Learning.

[tong2022videomae] Tong, Zhan, Song, Yibing, Wang, Jue, Wang, Limin. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems.

[feichtenhofer2022masked] Feichtenhofer, Christoph, Li, Yanghao, He, Kaiming, others. (2022). Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems.

[parthasarathy2022self] Parthasarathy, Nikhil, Eslami, SM, Carreira, Jo{~a. (2022). Self-supervised video pretraining yields strong image representations. arXiv preprint arXiv:2210.06433.

[pan2021videomoco] Pan, Tian, Song, Yibing, Yang, Tianyu, Jiang, Wenhao, Liu, Wei. (2021). Videomoco: Contrastive video representation learning with temporally adversarial examples. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[guo2022cross] Guo, Sheng, Xiong, Zihua, Zhong, Yujie, Wang, Limin, Guo, Xiaobo, Han, Bing, Huang, Weilin. (2022). Cross-architecture self-supervised video representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[ehsani2018let] Ehsani, Kiana, Bagherinezhad, Hessam, Redmon, Joseph, Mottaghi, Roozbeh, Farhadi, Ali. (2018). Who let the dogs out? modeling dog behavior from visual data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[qian2021spatiotemporal] Qian, Rui, Meng, Tianjian, Gong, Boqing, Yang, Ming-Hsuan, Wang, Huisheng, Belongie, Serge, Cui, Yin. (2021). Spatiotemporal contrastive video representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[han2020self] Han, Tengda, Xie, Weidi, Zisserman, Andrew. (2020). Self-supervised co-training for video representation learning. Advances in Neural Information Processing Systems.

[bardes2023mc] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2023). Mc-jepa: A joint-embedding predictive architecture for self-supervised learning of motion and content features. arXiv preprint arXiv:2307.12698.

[han2020memory] Han, Tengda, Xie, Weidi, Zisserman, Andrew. (2020). Memory-augmented dense predictive coding for video representation learning. European conference on computer vision.

[tran2015learning] Tran, Du, Bourdev, Lubomir, Fergus, Rob, Torresani, Lorenzo, Paluri, Manohar. (2015). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE international conference on computer vision.

[deng2009imagenet] Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition.

[wang2023masked] Wang, Rui, Chen, Dongdong, Wu, Zuxuan, Chen, Yinpeng, Dai, Xiyang, Liu, Mengchen, Yuan, Lu, Jiang, Yu-Gang. (2023). Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[yu2020bdd100k] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, Trevor Darrell. (2020). BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning.

[goyal2017something] Goyal, Raghav, Ebrahimi Kahou, Samira, Michalski, Vincent, Materzynska, Joanna, Westphal, Susanne, Kim, Heuna, Haenel, Valentin, Fruend, Ingo, Yianilos, Peter, Mueller-Freitag, Moritz, others. (2017). The. Proceedings of the IEEE international conference on computer vision.

[soomro2012ucf101] Soomro, Khurram, Zamir, Amir Roshan, Shah, Mubarak. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

[gu2018ava] Gu, Chunhui, Sun, Chen, Ross, David A, Vondrick, Carl, Pantofaru, Caroline, Li, Yeqing, Vijayanarasimhan, Sudheendra, Toderici, George, Ricco, Susanna, Sukthankar, Rahul, others. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. Proceedings of the IEEE conference on computer vision and pattern recognition.

[xu2022finediving] Xu, Jinglin, Rao, Yongming, Yu, Xumin, Chen, Guangyi, Zhou, Jie, Lu, Jiwen. (2022). Finediving: A fine-grained dataset for procedure-aware action quality assessment. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[teed2024deep] Teed, Zachary, Lipson, Lahav, Deng, Jia. (2024). Deep patch visual odometry. Advances in Neural Information Processing Systems.

[gebru2021datasheets] Gebru, Timnit, Morgenstern, Jamie, Vecchione, Briana, Vaughan, Jennifer Wortman, Wallach, Hanna, Iii, Hal Daum{'e. (2021). Datasheets for datasets. Communications of the ACM.

[bib1] Agarwal, A., Kumar, A., Malik, J., Pathak, D.: Legged locomotion in challenging terrains using egocentric vision. In: Conference on Robot Learning. pp. 403–415. PMLR (2023)

[bib2] Bajcsy, A., Loquercio, A., Kumar, A., Malik, J.: Learning vision-based pursuit-evasion robot policies. arXiv preprint arXiv:2308.16185 (2023)

[bib3] Bednarek, J., Bednarek, M., Kicki, P., Walas, K.: Robotic touch: Classification of materials for manipulation and walking. In: 2019 2nd IEEE international conference on Soft Robotics (RoboSoft). pp. 527–533. IEEE (2019)

[bib4] Bednarek, J., Bednarek, M., Wellhausen, L., Hutter, M., Walas, K.: What am i touching? learning to classify terrain via haptic sensing. In: 2019 International Conference on Robotics and Automation (ICRA). pp. 7187–7193. IEEE (2019)

[bib5] Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., et al.: Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023)

[bib6] Cai, M., Kitani, K.M., Sato, Y.: Understanding hand-object manipulation with grasp types and object attributes. In: Robotics: Science and Systems. vol. 3. Ann Arbor, Michigan; (2016)

[bib7] Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics 37(6), 1874–1890 (2021)

[bib8] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

[bib9] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.d.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

[bib10] Choi, S., Ji, G., Park, J., Kim, H., Mun, J., Lee, J.H., Hwangbo, J.: Learning quadrupedal locomotion on deformable terrain. Science Robotics 8(74), eade2256 (2023)

[bib11] Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European conference on computer vision (ECCV). pp. 720–736 (2018)

[bib12] Damen, D., Doughty, H., Farinella, G.M., Furnari, A., Kazakos, E., Ma, J., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: Rescaling egocentric vision. arXiv preprint arXiv:2006.13256 (2020)

[bib13] Damen, D., Leelasawassuk, T., Mayol-Cuevas, W.: You-do, i-learn: Egocentric unsupervised discovery of objects and their modes of interaction towards video-based guidance. Computer Vision and Image Understanding 149, 98–112 (2016)

[bib14] Ehsani, K., Bagherinezhad, H., Redmon, J., Mottaghi, R., Farhadi, A.: Who let the dogs out? modeling dog behavior from visual data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4051–4060 (2018)

[bib15] Fathi, A., Hodgins, J.K., Rehg, J.M.: Social interactions: A first-person perspective. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 1226–1233. IEEE (2012)

[bib16] Fathi, A., Li, Y., Rehg, J.M.: Learning to recognize daily actions using gaze. In: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I 12. pp. 314–327. Springer (2012)

[bib17] Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32(11), 1231–1237 (2013)

[bib18] Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al.: The" something something" video database for learning and evaluating visual common sense. In: Proceedings of the IEEE international conference on computer vision. pp. 5842–5850 (2017)

[bib19] Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18995–19012 (2022)

[bib20] Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al.: Ava: A video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6047–6056 (2018)

[bib21] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

[bib22] Karnan, H., Yang, E., Farkash, D., Warnell, G., Biswas, J., Stone, P.: Self-supervised terrain representation learning from unconstrained robot experience. In: ICRA2023 Workshop on Pretraining for Robotics (PT4R) (2023)

[bib23] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

[bib24] Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5492–5501 (2019)

[bib25] Kumar, A., Fu, Z., Pathak, D., Malik, J.: Rma: Rapid motor adaptation for legged robots (2021)

[bib26] Labuguen, R., Matsumoto, J., Negrete, S.B., Nishimaru, H., Nishijo, H., Takada, M., Go, Y., Inoue, K.i., Shibata, T.: Macaquepose: a novel “in the wild” macaque monkey pose dataset for markerless motion capture. Frontiers in behavioral neuroscience 14, 581154 (2021)

[bib27] Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., Hutter, M.: Learning quadrupedal locomotion over challenging terrain. Science robotics 5(47), eabc5986 (2020)

[bib28] Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 1346–1353. IEEE (2012)

[bib29] Li, Y., Nagarajan, T., Xiong, B., Grauman, K.: Ego-exo: Transferring visual representations from third-person to first-person videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6943–6953 (2021)

[bib30] Loquercio, A., Kumar, A., Malik, J.: Learning visual locomotion with cross-modal supervision. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 7295–7302. IEEE (2023)

[bib31] Margolis, G.B., Fu, X., Ji, Y., Agrawal, P.: Learning to see physical properties with active sensing motor policies. arXiv preprint arXiv:2311.01405 (2023)

[bib32] Margolis, G.B., Yang, G., Paigwar, K., Chen, T., Agrawal, P.: Rapid locomotion via reinforcement learning. arXiv preprint arXiv:2205.02824 (2022)

[bib33] Miki, T., Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., Hutter, M.: Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics 7(62), eabk2822 (2022)

[bib34] Nagarajan, T., Feichtenhofer, C., Grauman, K.: Grounded human-object interaction hotspots from video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8688–8697 (2019)

[bib35] Ng, E., Xiang, D., Joo, H., Grauman, K.: You2me: Inferring body pose in egocentric video via first and second person interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9890–9900 (2020)

[bib36] Ng, X.L., Ong, K.E., Zheng, Q., Ni, Y., Yeo, S.Y., Liu, J.: Animal kingdom: A large and diverse dataset for animal behavior understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19023–19034 (2022)

[bib37] Northcutt, C., Zha, S., Lovegrove, S., Newcombe, R.: Egocom: A multi-person multi-modal egocentric communications dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020)

[bib38] Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 2847–2854. IEEE (2012)

[bib39] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on Machine Learning. pp. 8821–8831. PMLR (2021)

[bib40] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

[bib41] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)

[bib42] Shah, D., Sridhar, A., Dashora, N., Stachowicz, K., Black, K., Hirose, N., Levine, S.: Vint: A large-scale, multi-task visual navigation backbone with cross-robot generalization. In: 7th Annual Conference on Robot Learning (2023)

[bib43] Shan, D., Geng, J., Shu, M., Fouhey, D.F.: Understanding human hands in contact at internet scale. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9869–9878 (2020)

[bib44] Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Charades-ego: A large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626 (2018)

[bib45] Sójka, D., Nowicki, M.R., Skrzypczyński, P.: Learning an efficient terrain representation for haptic localization of a legged robot. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 12170–12176. IEEE (2023)

[bib46] Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

[bib47] Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgb-d slam systems. In: 2012 IEEE/RSJ international conference on intelligent robots and systems. pp. 573–580. IEEE (2012)

[bib48] Teed, Z., Deng, J.: Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems 34, 16558–16569 (2021)

[bib49] Teed, Z., Lipson, L., Deng, J.: Deep patch visual odometry. Advances in Neural Information Processing Systems 36 (2024)

[bib50] Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35, 10078–10093 (2022)

[bib51] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

[bib52] Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Yuan, L., Jiang, Y.G.: Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6312–6322 (2023)

[bib53] Xiao, T., Radosavovic, I., Darrell, T., Malik, J.: Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173 (2022)

[bib54] Xu, J., Rao, Y., Yu, X., Chen, G., Zhou, J., Lu, J.: Finediving: A fine-grained dataset for procedure-aware action quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2949–2958 (2022)

[bib55] Yonetani, R., Kitani, K.M., Sato, Y.: Recognizing micro-actions and reactions from paired egocentric videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2629–2638 (2016)

[bib56] Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell, T.: Bdd100k: A diverse driving dataset for heterogeneous multitask learning (2020)

[bib57] Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., Tao, D.: Ap-10k: A benchmark for animal pose estimation in the wild. arXiv preprint arXiv:2108.12617 (2021)

[bib58] Zhao, W., Liu, S., Guo, H., Wang, W., Liu, Y.J.: Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. In: European Conference on Computer Vision. pp. 523–542. Springer (2022)

[bib59] Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., Kong, T.: ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021)

[bib60] Zhou, Y., Berg, T.L.: Temporal perception and prediction in ego-centric video. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4498–4506 (2015)