EgoPet: Egomotion and Interaction Data from an Animal's Perspective

% \normalsize\bf Amir Bar$^{1,2}$ \quad Arya Bakhtiar$^2$ \quad Danny Tran$^2$ \quad Antonio Loquercio$^2$ \quad Jathushan Rajasegaran$^2$, \quad\quad\quad\quad\normalsize\bf Yann LeCun$^3$ \quad Amir Globerson$^1$ \quad Trevor Darrell$^2$, [0.5em] \normalsize\quad\quad $^1$Tel Aviv University \quad $^2$UC Berkeley \quad $^3$New York University

Abstract

-3mm Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets.~Project page: ~\url{www.amirbar.net/egopet}

EgoPet: Egomotion and Interaction Data from an Animal’s Perspective

Amir Bar 1 , 2 Arya Bakhtiar 2 Danny Tran 2 Antonio Loquercio 2 Jathushan Rajasegaran 2 Yann LeCun 3 Amir Globerson 1 Trevor Darrell 2 1 Tel Aviv University 2 UC Berkeley 3 New York University

Figure 1: We present EgoPet, a novel animal egocentric video dataset to advance learning animal-like behavior models from video (top row). We propose three benchmark tasks on this dataset (bottom row). Visual Interaction Prediction (VIP) and Locomotion Prediction (LP) are designed to predict animals' perception and action behavior. Finally, Vision to Proprioception Prediction (VPP) studies the utility of our dataset on the downstream task of robot locomotion in the wild. For all tasks, we find that models trained on EgoPet outperform those trained on previously available video datasets.

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets. 1

Introduction

1 Project page: www.amirbar.net/egopet

Figure 2: EgoPet video examples . Footage from the EgoPet dataset featuring four different animal experiences, each captured from an egocentric perspective at a distinct point in time.

to catch a rat; this requires the cat to execute a precise sequence of actions with impeccable timing, all while responding to the rat's efforts to escape.

Current Artificial Intelligence (AI) systems can synthesize high quality images [39, 40], generate coherent text [5, 51], and even code Python programs [9]. But despite this remarkable progress, there are basic animal behaviors that are beyond the reach of current models. Recently, there has been a significant body of research in robotics aimed at learning policies for quadruped locomotion, and other basic actions [27, 25, 1, 32, 42, 10, 33, 2]. However, we argue that a major limitation in advancing towards more complex systems is the availability of large-scale, real-world data.

To address this, we present EgoPet, a new web-scale dataset from the perspective of pets. EgoPet contains more than 84 hours of video, including different animals like dogs, cats, eagles, turtles, and more. This video footage reveals the world from the eye of the pet as perceived in its day-to-day life, e.g., a dog going for a walk or entering a park, or a cat wandering freely around a farm. The video data was sourced from the internet and predominantly includes pet video, hence we have named the dataset EgoPet.

To measure progress in modeling and learning from animals, we propose three new tasks that aim to capture perception and action (see Fig. 1): Visual Interaction Prediction (VIP), Locomotion Prediction (LP), and Vision to Proprioception Prediction (VPP). Together with these tasks, we provide annotated training and validation data used for downstream evaluation.

The VIP task aims to detect and classify animal interactions and is inspired by human-object interaction tasks [43]. We temporally annotated a subset of the EgoPet videos with the start and end times of visual interactions and the object of the interaction category. The categories, which include person, cat, and dog, were chosen based on how commonly they occurred as objects (for all the categories refer to Supplementary Section 8).

The goal of the LP task is to predict the future 4 second trajectory of the pet. This is useful for learning basic pet skills like avoiding obstacles or navigating. We extracted pseudo ground truth trajectories using Deep Patch Visual Odometry (DPVO) [49], the best-performing SLAM system for our dataset. We manually filtered inaccurate trajectories in the validation data to ensure high-quality evaluation.

Finally, in the VPP task, we study EgoPet's utility for a downstream robotic task: legged locomotion. Given a video observation from a forward-facing camera mounted on a quadruped robot, the goal is to predict the features of the terrain perceived by the robot's proprioception across its trajectory. Making accurate predictions requires perceiving the landscape and anticipating the robot controls. This differs from previous works on robot visual prediction [30, 45, 31], which require conditioning over current robot controls and are thus challenging to train at scale. To assess performance in this task, we gathered data utilizing a quadruped robodog. This data includes paired videos and proprioception features, which are then utilized for subsequent training and evaluation processes.

We train various self-supervised models and evaluate how they perform downstream using a simple linear probing protocol. We make the surprising finding that pretraining on EgoPet yields better performance than pretraining on other, much larger video datasets like Ego4D [19] and Kinetics 400 [23]. This indicates the inadequacy of current datasets in studying animal-like physical skills.

Our contributions are as follows. First, we propose EgoPet, the first large-scale egocentric animal video dataset comprised of over 84 hours of video footage to facilitate learning from animals. We propose three new tasks, including human-annotated data, and set an initial benchmark. The downstream results on the VPP task indicate that EgoPet is a useful pretraining resource for quadruped locomotion, and the benchmark results on VIP show that the proposed tasks are still far from being solved, providing an exciting new opportunity to build models that capture the world through the eyes of animals.

Next, we delve into the related works surrounding video datasets, including notable research on both general video datasets of human and animals and those focusing specifically on egocentric video data.

Video Datasets. In recent years, a variety of video datasets have played an important role in video understanding tasks. In human action recognition, datasets like UCF101 [46], Charades-Ego [44], AVA [20], FineDiving [54], and the Something-Something dataset [18] provide comprehensive coverage of human activities, ranging from daily actions to specialized sports movements. Among these, Kinetics (K400) [23] is particularly influential, advancing the study of human actions through a wide array of video clips.

Other works aimed to collect data to study animals. These works include datasets such as the Animal Kingdom [36], which contains videos of various species, and MacaquePose [26], which focuses on non-human primates. These datasets are instrumental for AI advancements in wildlife recognition and interpretation. AP-10K [57] further augments this domain by providing a detailed collection of animal images for robust pose estimation. While sharing a similar motivation to our work, existing datasets on animal behavior rarely contain egocentric views and are therefore better suited to recognition problems than the animals' physical capabilities. For autonomous driving and vehicle motion, datasets like the Berkeley DeepDrive [56, 17] and KITTI [17] offer extensive insights into vehicle egomotion and environmental interactions. While these datasets enrich our understanding of motion, behavior, and interaction from a human-centric perspective, they offer limited insights into animal behavior.

Egocentric Video Datasets. Agents interact with the world from a first-person point of view, thus collecting such data has many applications from video understanding to augmented reality. In the past decade, many egocentric datasets were collected [16, 11, 44, 38, 28], with the majority of them focusing on human activities and object interactions in an indoor environment (e.g. kitchens). For example, Epic Kitchens [11, 12] is a large cooking dataset that takes place in 45 kitchens across 4 different cities, whereas Charades-Ego [44] consists of 4 , 000 paired videos of human actions in first and third person. Other datasets are more focused on conversation and social interactions [15, 35, 37]. Existing datasets differ by the environments in which they are recorded, e.g. (outdoor vs. kitchens), whether they are scripted or not, and the number of videos. Recently, Ego4D [19], a new comprehensive egocentric dataset was released. Different from previous datasets, it is more diverse (e.g., indoor and outdoor activities, different diverse geographical locations). However, while existing datasets focus on human and human skills, our focus is on animal agents which have more limited language and hand-object interactions. The most related egocentric dataset is DECADE [14] which

Figure 3: Descriptive statistics. The histogram depicting the length (in seconds) of EgoPet video sequences exhibits a long-tailed distribution, primarily skewed toward shorter segments of less than 30 seconds. Collectively, videos featuring dogs and cats account for 94% of the total duration, showcasing interactions with people, fellow cats and dogs, toys, and various objects.

consists of an hour of footage of a single dog, including joint locations annotations. Inspired by DECADE, EgoPet is a much larger web-scale dataset (84 hours) and much more diverse.

The EgoPet Dataset

The EgoPet dataset is a unique collection of egocentric video footage primarily featuring dogs and cats, along with various other animals like eagles, wolves, turtles, sea turtles, sharks, snakes, cheetahs, pythons, geese, alligators, and dolphins (examples included in Fig. 2 and Suppl. Figure 9). Together with the proposed downstream tasks and benchmark, EgoPet is a valuable resource for researchers and enthusiasts interested in studying animals from an egocentric perspective.

We begin with the motivation behind EgoPet and its connection to existing datasets in Section 3.1. We then delve into the dataset's statistics in Section 3.2 and the collection process in Section 3.3.

Relation to other datasets

To provide a clearer understanding of EgoPet's significance, we compare it with various other datasets, considering factors such as total video duration, perspective (egocentric or non-egocentric), egomotion, the agents involved, and the presence of interaction annotations, which are crucial for intelligent agents. Refer to Table 1 for more details. In terms of size, Ego4D [19] is the largest egocentric video dataset, and it centers on human activities, while the BDD100K [56] dataset includes both egocentric and egomotion elements but it focuses on autonomous driving. Differently, EgoPet focuses on animals, and pets in particular. Among animal video datasets, the DECADE [14] dataset provides an egocentric perspective from a dog's viewpoint, but it only records 1 . 5 hours of video. EgoPet expands this vision by over 56 times in volume and includes a variety of species and interactions.

Descriptive Statistics

The EgoPet dataset is an extensive collection composed of 6 , 646 video segments distilled from 819 unique videos. High level statistics are provided in Fig. 3. These original videos were sourced predominantly from TikTok, accounting for 482 videos, while the remaining 338 were obtained from YouTube. The aggregate length of all video segments amounts to approximately 84 hours, which reflects a substantial volume of data for in-depth analysis. In terms of video duration, the segments exhibit an average span of 45 . 55 seconds, although the duration displays considerable variability, as indicated by the standard deviation of 192 . 19 seconds. This variation underscores the range of contexts captured within the dataset, from brief encounters to prolonged interactions.

Table 1: Different video datasets . We compare EgoPet to different datasets with respect to the total time (hours), whether the videos are in first person view (egocentric) with the focus on egomotion, the agent type, and whether agent interaction annotations are available. EgoPet is the first large scale animal dataset that is both ego centric and contains interaction annotations. It is also over 56 times larger than the previous similar dataset DECADE [14].

Breaking down the dataset by animal representation, cats and dogs constitute the majority, with 4 , 567 and 1 , 905 segments, respectively. This reflects the dataset's strong emphasis on common domestic animals while still covering less frequent but equally important species. Notably, the dataset includes segments featuring eagles ( 66 ), turtles ( 31 ), and a diverse group of other animals such as alligators, lizards, and dolphins, contributing to a rich collection of animal behaviors captured through an egocentric lens.

The camera positioning-where the recording device was attached to-also varies, the majority of segments were captured from cameras placed on the neck ( 4 , 575 ) and body ( 1 , 817 ). Fewer segments were recorded from cameras positioned on the head ( 199 ), shell ( 36 ), collar ( 11 ), and fin ( 8 ) offering a range of perspectives that can inform on how different mounting points might influence the perception of the environment from an animal's viewpoint.

Data Acquisition

Collection Strategy. To collect the dataset, we manually searched for videos using a large set of queries on YouTube and TikTok. For example, 'egocentric view', 'dog with a GoPro', and similar phrases related to first-person animal perspectives. This led to scraping a vast pool of footage showcasing animals, primarily dogs and cats, wearing wearable cameras, allowing for an egocentric point of view. In pursuit of a broader video selection, our efforts extended to individual channels and authors known for their thematic consistency in publishing egocentric animal footage. This approach allowed us to tap into niche communities and content creators, yielding a wide variety of egocentric videos beyond the reach of generic search terms.

Dataset Refinement. A meticulous annotation process was carried out to ensure the dataset's quality. A human annotator reviewed the collected videos to confirm that they were from an egocentric point of view. Non-egocentric or irrelevant segments were carefully removed.

All videos were adjusted to a frame rate of 30 frames per second and resized to 480p on the shortest side while maintaining the original aspect ratio. The videos were then segmented into discrete clips, during which any non-egocentric footage was removed. The final dataset consists of segments of at least three seconds, ensuring sufficient context for each interaction.

EgoPet Tasks

In order to allow quantitative comparisons of animal-prediction approaches, we next define several prediction tasks on the EgoPet dataset. We provide annotated datasets based on these tasks, which will allow effective benchmarking of different approaches.

Visual Interaction Prediction (VIP)

into how animals navigate their world, how they communicate with other beings, and how their physical movements correlate with environmental stimuli. Being able to identify interactions is a core task in computer vision and robotics with practical applications in designing systems that can operate in dynamic, real-world settings.

Task Description. The input for this task is a video clip from the egocentric perspective of an animal. The labels are twofold: a binary label indicating whether an interaction is taking place or not, and a categorical label describing the object of the interaction. This binary label simplifies the vast range of potential interactions into a manageable form for the model, while the identification of the interaction object adds a layer of specificity necessary for understanding the context of the interaction.

In the context of the EgoPet dataset, a 'visual interaction' is defined as a discernible event where the agent-typically an animal such as a dog or cat-demonstrates clear attention to an object or another agent within its environment. This attention may be manifested through physical contact, proximity, orientation, or vocalization (such as barking or making sounds) toward the object of the interaction which can be an object

Figure 4: Vision to Interaction Prediction task. The figure illustrates the process of annotating a single video, identifying and categorizing different interactions experienced by a cat, with each segment of the timeline reflecting a unique type of interaction within the animal's environment.

or agent. The fundamental criterion for a visual interaction is the presence of visual evidence within the video that the agent is engaged with, or reacting to a particular stimulus. Aimless movements, such as wandering without a clear target or displaying alertness without a specific focus, are not labeled as visual interactions.

Annotations. The data labeling process for marking interactions involved a meticulous analysis of the video content, which resulted in the annotation of 1449 subsegments (see Fig. 4). Two human annotators were trained to identify and timestamp the start and end of an interaction event. The outcome of this process is a richly annotated dataset of 805 subsegments where no interaction occurs ('negative subsegments') and 644 positive interaction subsegments that capture a wide range of 17 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test sets. This leaves us with 754 training subsegments and 695 test subsegments for a total of 1 , 449 annotated subsegments. To see the full annotation process refer to Suppl. Section 8.

Locomotion Prediction (LP)

Motivation. Planning where to move of both perception and foresight. It requires the ability to anticipate potential obstacles, consider various courses of action, and select the most efficient and effective strategy to achieve a desired goal. EgoPet contains examples where animals plan a future trajectory to achieve a certain goal (e.g., a dog following its owner. See Fig. 5).

Task Description. Given a sequence of past m video frames { x i } t i = t -m , the goal is to predict the unit normalized future trajectory of the agent { v j } t + k j = t +1 , where v j ∈ R 3 represents the relative location of the agent at timestep j . We predict the unit normalized relative location due to the scale ambiguity of the extracted trajectories. In practice, we condition models on m = 16 frames and

Figure 5: Locomotion Prediction task. A dog navigates an agility course, highlighting the concept of locomotion prediction by anticipating its forward and upward trajectory to clear the obstacle.

predict k = 40 future locations which correspond to 4 seconds into the future.

Annotations. To obtain pseudo ground truth agent trajectories, we used Deep Patch Visual Odometry (DPVO [49]), a system for monocular visual odometry that utilizes sparse patch-based matching across frames. This system largely outperformed other open-source SLAM systems in terms of convergence rate and qualitative accuracy in our experiments.

Given an input sequence of frames, DPVO returns the location and orientation of the camera for each frame. To obtain training trajectories, we feed videos with a stride of 5 to DPVO. To ensure high-quality evaluation, we feed validation videos with strides of 5 , 10 , and 15 into DPVO and evaluate the quality of the trajectories manually. Specifically, two human annotators were trained to evaluate the trajectories from an eagle's eye view (XZ view) and determine the best matching trajectory, if any, to the video. This left us with 6 , 126 annotated training segments and 249 validation segments.

Vision to Proprioception Prediction (VPP)

Motivation. Understanding animal behavior could be instrumental to several robotics applications. To demonstrate the value of our dataset for robotics, we propose a task based on the problem of vision-based locomotion. Specifically, the task consists of predicting the parameters of the terrain a quadrupedal robot is walking on (see Fig. 6). As shown in multiple previous works on locomotion [30, 31, 22, 4, 3, 45], accurate prediction of these parameters is correlated with improved performance in locomotion. Intuitively, the EgoPet data closely resembles the video captured by a quadruped robot since a camera mounted on a pet is approximately at the same location as the camera mounted on the robot. In addition, the task of walking is highly represented in the dataset.

Task Description. The parameters we would like to predict are the local terrain geometry, the terrain's friction, and the parameters related to the robot's walking behavior on the terrain, including the robot's speed, motor efficiency, and high-level command. The exact identification of these parameters is generally impossible [25]: two terrains could have different combinations of parameters but 'feel" the same to an agent. For example, walking on sand and mud could similarly affect the robot's proprioception, even though their properties differ. Therefore, similarly to previous work [25, 27, 30], we aim to predict a latent representation z t of the terrain parameters. This latent representation consists of the hidden layer of a neural network trained in simulation to encode ground-truth terrain pa-

Figure 6: Vision to Proprioception Prediction task. This figure showcases the quadruped robot as it is about to transition from flat ground to climbing steep stairs, illustrating one of the unique terrain environments encountered during the collection of visual and proprioceptive data at annotated time intervals for VPP training.

rameters. This neural network is trained end-to-end with an action policy on locomotion using reinforcement learning. The task consists of predicting the latent terrain's representation at different time intervals from a sequence of frames. Our setup closely follows the one in [30]. Specifically, we add to EgoPet data collected with a quadrupedal robot in multiple outdoor environments with different terrain characteristics, e.g., sand or grass.

Dataset and Annotations. To collect the dataset, we deployed the walking policy of [30] on a Unitree A1 robot dog in three environments: an office, a park, and a beach. We collect approximately 20 minutes of walking data in these environments, which are used exclusively for evaluation. For training, we use the data from [30], which contains 120 thousand frames, corresponding to a total walking time of approximately 2.3 hours. Each environment has different terrain geometries, including flats, steps, and slopes. Each sample contains an image collected from a forward-looking camera mounted on the robot and the (latent) parameters of the terrain below the center of mass of the robot z t estimated with a history of proprioception. See [30] for details about the annotation procedure.

The final task consists of predicting z t from a history of images. We generate several sub-tasks by predicting the future terrain parameters z t +0 . 8 , z t +1 . 5 and the past ones z t -0 . 8 , z t -1 . 5 . These time intervals were selected to differentiate between forecasting and estimation. The further the prediction is in the future or the past, the harder the task is. The input images might contain little information,

or not at all, about the terrain at these times. Therefore, inferences based on the context are required. For example, one can predict the presence of a step in front of the robot from the shadow it casts on the terrain.

We divide the newly collected data into three test datasets: the first is in-distribution, featuring terrains and lighting conditions similar to the training data. The second dataset is out of distribution since it is captured with different lighting conditions, i.e., at night, but in environments with the same features as the training data. Finally, the third dataset contains sandy environments, which the robot has not encountered during training.

Evaluation Benchmark

Our goal in the experiments is to establish initial performance baselines on the EgoPet tasks. For the VPP task, we hypothesize that EgoPet is a more useful pretraining resource compared to other datasets. We evaluate different pretrained models and compare their performance on the VIP, LP, and VPP tasks. We adopt a simple linear probing protocol, where we freeze the model weights and for each task train only a linear layer to predict the output. For evaluation, we use the following models publicly released checkpoints, typically trained on IN-1k or K400, unless stated otherwise.

MAE [21] are trained by masking random patches in the input image and reconstructing the missing pixels through an asymmetric encoder-decoder architecture. By masking a high proportion (e.g., 75 %) of the input image. In our experiments we use an MAE model pretrained on IN-1k.

MVP [53] uses a similar model as in MAE, but trains MAE on a mixture of egocentric datasets which we refer to as Ego Mix, a combination of Epic-Kitchens [11], 100DOH [43], Ego4D [19], and Something-Something [18].

DINO [8] is trained with a student-teacher architecture over pairs of augmented images by encouraging invariance to the image augmentations. The teacher's output is centered, and both networks' normalized features are compared using a cross-entropy loss. The stop-gradient operator ensures gradient propagation only through the student, and teacher parameters are updated using an exponential moving average (ema) of the student parameters.

iBOT [59] is trained similarly to DINO, but adds an auxiliary MIM loss by predicting image patch representation of a learned online tokenizer.

VideoMAE [50] is an extension of MAE for video pre-training. Different from MAE, it utilizes an extremely high masking ratio ( 90 %to 95 %) and tube masking as opposed to random masking.

MVD [52] is a masked feature modeling framework for self-supervised video representation learning. Learning the video representations involves distilling student model features from both video and image teachers. We train MVD variants on Ego4D and EgoPet, using VideoMAE (K400) and MAE (IN-1k) as video and image teachers.

Implementation Details. For all models, we used the ViT-B model with patch size 16 since it was available across all methods. For the VIP task, we train all image and video models for 10 epochs. Video models represent video clips of 2 seconds using 8 input frames ( 4 Hz) and image models use one (the middle) frame. For the VPP task, we train all models for 50 epochs. Video models were trained with varying amounts of frames in 4 Hz. For the LP task, we train all models for 15 epochs. Video models were trained with 16 input frames ( 30 Hz) and image models used one frame (the last). In our LP experiments, we only use cat and dog segments, segments long enough, and 25% of the training data. This left us with 1 , 129 training segments and 167 validation segments. During the linear probing training phase, we do not apply any image augmentations. All the other hyperparams follow MAE and MVD linear probing recipes for image and video models respectively.

Results

Table 2: Visual Interaction Prediction linear probing results. We report models Interaction Prediction Accuracy and AUROC, as well as Object Prediction Top-1 and Top-3 Accuracy.

Figure 8: Locomotion Prediction (LP) linear probing results. We report the validation ATE and RPE as a function of the epoch during training, comparing the impact of various datasets (Kinetics, Ego4D, EgoPet). Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 4 for the full results.

video datasets used for pretraining are not diverse enough to perform well across all the EgoPet downstream tasks. For example, pretraining on K400 is better than Ego4D for VIP but worse on VPP. Furthermore, by pretraining on EgoPet, we observe improved downstream performance on the VPP task compared to other models.

Visual Interaction Prediction

The results in Table 2 show that models trained on EgoPet achieve improved performance compared to K400 or Ego4D both on interaction prediction and object prediction. Compared to image based models like iBOT, MVD trained on EgoPet performs better on Top-3 Acc but worse on Top-1. This is likely due to the diversity of objects appearing in IN-1k, an image recognition dataset. Compared to other video models, MVD (EgoPet) performs better. To obtain more insight into what models focus on in this task, we apply Grad-CAM [41] on our MVD EgoPet interaction classifier. Fig. 7 shows the corresponding heatmaps and can be seen to focus on the rat (top-left), and another dog (bottom-right). In these cases, the model seems to be attending to the object of interaction.

Locomotion Prediction

For this task, we evaluated models based on their predicted unit motions 40 timesteps into the future, corresponding to 4 seconds. We form trajectories from these predicted motions and compute the

Figure 7: VIP Grad-CAM [41] visualization.

Table 3: Vision to Proprioception Prediction (VPP) linear probing results . We report the mean squared error loss. Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 5 for the full results.

RMSE of the Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) metrics against the ground truth trajectories. ATE and RPE are commonly employed metrics for evaluating systems such as SLAM and visual odometry [58, 49, 48, 7]. ATE first aligns the ground truth with the predicted trajectory, and then computes the absolute pose difference. RPE measures the difference between the predicted and ground truth locomotion [47].

The results in Fig. 8 indicate that models trained on EgoPet perform better than Ego4D and K400 and that the Ego4D model performed second best, possibly due to also being egocentric data. The full results in Supplementary Table 4 indicate that video models perform much better than image models as a whole, which we speculate is due to better modeling of the agent's velocity and acceleration as well as the motion of other agents.

Vision to Proprioception Prediction

Table 3 provides the result of the VPP task. It can be seen that using EgoPet data leads to lower errors on this task. Additionally, the results show that using additional past frames as context helps and that video models outperform image models. Within image models, MVP performs better than MAE, likely because it was trained on egocentric data, and iBOT performs best. Within video models, MVD trained on EgoPet achieves lower mean squared error loss compared to the same model trained with K400 or Ego4D, and lower error compared to all image models. We speculate that MVD (EgoPet) outperforms all models because compared to other datasets the EgoPet videos are more similar to videos captured by a forward-facing camera mounted on a quadruped robodog. We provide the full results in the Supplementary Table 5.

Limitations

EgoPet is a video dataset, and as such it primarily contains visual and auditory signals. However, animals interact with their environment using a multitude of senses, including smell and touch. The absence of these sensory modalities in our dataset and model may lead to a partial or skewed understanding of animal behavior and intelligence. Animal behavior is highly complex and influenced by a myriad of factors, including instinct, learning, environmental stimuli, and social interactions. Our tasks, while effective in capturing certain aspects of behavior, may not fully encapsulate the depth and complexity of animal interactions and decision-making processes. Further research is needed to develop more sophisticated tasks and models that can account for complex behavioral patterns.

Conclusion

We present EgoPet, a new comprehensive animal egocentric video dataset. Together with the proposed downstream tasks and benchmark, we believe EgoPet offers a testbed for studying and modeling animal behavior. Our benchmark results demonstrate that interaction prediction is far from being solved which provides an exciting opportunity for future research and modeling animal egocentric agents. Furthermore, the results demonstrate that EgoPet is a useful pretraining resource for downstream robotics locomotion tasks. Future works can include broadening the tasks to integrate more sensory inputs like audio, thereby creating a richer and more holistic understanding of the animal behavior.

Acknowledgements: We thank Justin Kerr for helpful discussions. Many of the figures use images taken from web videos. For each figure, we include the URL to its source videos in the Suppl. Section 8. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell's group was supported in part by DoD including DARPA's LwLL and/or SemaFor programs, as well as BAIR's industrial alliance programs.

References

Supplementary Material

We provide additional information about the EgoPet dataset, annotation process and full quantitative results.

Dataset

We include more dataset visualizations in Figure 9.

Figure 9: Additional EgoPet examples . Footage of four different animal videos from an egocentric view are included.

VIP Annotations

Weprovide additional information regarding the annotation process for the VIP task. The data labeling process for marking interactions involved a meticulous analysis of the video content, which resulted in the annotation of 1449 subsegments (see Figure 4). Two human annotators were trained to identify and timestamp the start and end of an interaction event. The outcome of this process is a richly annotated dataset of 805 subsegments where no interaction occurs ('negative subsegments') and 644 positive interaction subsegments that capture a wide range of 17 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test. This leaves us with 754 training subsegments and 695 test subsegments for a total of 1 , 449 annotated subsegments.

Table 4: Locomotion prediction linear probing results. Models are evaluated on their ability to predict the trajectory of the agent k seconds into the future.

The beginning of an interaction is marked at the first time-step where the agent begins to give attention to a target, and the endpoint is marked at the last time-step before the attention ceases. In addition, annotators were instructed to mark some segments without interactions. This process results in a set of temporal segments, each corresponding to a discrete interaction event or no interaction event. To

Table 5: Vision to Proprioception Prediction (VPP) linear probing results . We report the mean squared error loss. Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 5 for the full results.

ensure the consistency of annotations across annotators, annotations are only kept where annotators agree.

The outcome of this process is a richly annotated dataset of 805 subsegments where no interaction occurs ('negative subsegments') and 644 positive interaction subsegments that capture a wide range of 17 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test. This leaves us with 754 training subsegments and 695 test subsegments for a total of 1 , 449 annotated subsegments.

This is the list of all possible interaction objects: Person, Ball, Bench, Bird, Dog, Cat, Other Animal, Toy, Door, Floor, Food, Plant, Filament, Plastic, Water, Vehicle, Other.

Results

Full LP results. Table 4 contains the quantitative LP results for all models, reported using the ATE and RPE metrics. The results indicate that pretraining on EgoPet leads to better ATE and RPE scores.

Full VPP results. In the main paper we included the VPP results grouped by 'past', 'present' and 'future' (see Table 3). In Table 5 we provide the full fine-grained VPP results by individual timestep.

Figure credits

Some of the figures in the paper were created from web videos. We credit the original content creators and provide links to the original videos below.

Figure 1

Animals are intelligent agents that exhibit various cognitive and behavioral traits. They plan and act to accomplish complex goals and can interact with objects or other agents. Consider a cat attempting to catch a rat; this requires the cat to execute a precise sequence of actions with impeccable timing, all while responding to the rat’s efforts to escape.

To address this, we present EgoPet, a new web-scale dataset from the perspective of pets. EgoPet contains more than 848484 hours of video, including different animals like dogs, cats, eagles, turtles, and more. This video footage reveals the world from the eye of the pet as perceived in its day-to-day life, e.g., a dog going for a walk or entering a park, or a cat wandering freely around a farm. The video data was sourced from the internet and predominantly includes pet video, hence we have named the dataset EgoPet.

The goal of the LP task is to predict the future 444 second trajectory of the pet. This is useful for learning basic pet skills like avoiding obstacles or navigating. We extracted pseudo ground truth trajectories using Deep Patch Visual Odometry (DPVO) [49], the best-performing SLAM system for our dataset. We manually filtered inaccurate trajectories in the validation data to ensure high-quality evaluation.

Finally, in the VPP task, we study EgoPet’s utility for a downstream robotic task: legged locomotion. Given a video observation from a forward-facing camera mounted on a quadruped robot, the goal is to predict the features of the terrain perceived by the robot’s proprioception across its trajectory. Making accurate predictions requires perceiving the landscape and anticipating the robot controls. This differs from previous works on robot visual prediction [30, 45, 31], which require conditioning over current robot controls and are thus challenging to train at scale. To assess performance in this task, we gathered data utilizing a quadruped robodog. This data includes paired videos and proprioception features, which are then utilized for subsequent training and evaluation processes.

Our contributions are as follows. First, we propose EgoPet, the first large-scale egocentric animal video dataset comprised of over 848484 hours of video footage to facilitate learning from animals. We propose three new tasks, including human-annotated data, and set an initial benchmark. The downstream results on the VPP task indicate that EgoPet is a useful pretraining resource for quadruped locomotion, and the benchmark results on VIP show that the proposed tasks are still far from being solved, providing an exciting new opportunity to build models that capture the world through the eyes of animals.

Other works aimed to collect data to study animals. These works include datasets such as the Animal Kingdom [36], which contains videos of various species, and MacaquePose [26], which focuses on non-human primates. These datasets are instrumental for AI advancements in wildlife recognition and interpretation. AP-10K [57] further augments this domain by providing a detailed collection of animal images for robust pose estimation. While sharing a similar motivation to our work, existing datasets on animal behavior rarely contain egocentric views and are therefore better suited to recognition problems than the animals’ physical capabilities. For autonomous driving and vehicle motion, datasets like the Berkeley DeepDrive [56, 17] and KITTI [17] offer extensive insights into vehicle egomotion and environmental interactions. While these datasets enrich our understanding of motion, behavior, and interaction from a human-centric perspective, they offer limited insights into animal behavior.

Egocentric Video Datasets. Agents interact with the world from a first-person point of view, thus collecting such data has many applications from video understanding to augmented reality. In the past decade, many egocentric datasets were collected [16, 11, 44, 38, 28], with the majority of them focusing on human activities and object interactions in an indoor environment (e.g. kitchens). For example, Epic Kitchens [11, 12] is a large cooking dataset that takes place in 454545 kitchens across 444 different cities, whereas Charades-Ego [44] consists of 4,00040004{\small,}000 paired videos of human actions in first and third person. Other datasets are more focused on conversation and social interactions [15, 35, 37]. Existing datasets differ by the environments in which they are recorded, e.g. (outdoor vs. kitchens), whether they are scripted or not, and the number of videos. Recently, Ego4D [19], a new comprehensive egocentric dataset was released. Different from previous datasets, it is more diverse (e.g., indoor and outdoor activities, different diverse geographical locations). However, while existing datasets focus on human and human skills, our focus is on animal agents which have more limited language and hand-object interactions. The most related egocentric dataset is DECADE [14] which consists of an hour of footage of a single dog, including joint locations annotations. Inspired by DECADE, EgoPet is a much larger web-scale dataset (84 hours) and much more diverse.

We begin with the motivation behind EgoPet and its connection to existing datasets in Section 3.1. We then delve into the dataset’s statistics in Section 3.2 and the collection process in Section 3.3.

To provide a clearer understanding of EgoPet’s significance, we compare it with various other datasets, considering factors such as total video duration, perspective (egocentric or non-egocentric), egomotion, the agents involved, and the presence of interaction annotations, which are crucial for intelligent agents. Refer to Table 1 for more details. In terms of size, Ego4D [19] is the largest egocentric video dataset, and it centers on human activities, while the BDD100K [56] dataset includes both egocentric and egomotion elements but it focuses on autonomous driving. Differently, EgoPet focuses on animals, and pets in particular. Among animal video datasets, the DECADE [14] dataset provides an egocentric perspective from a dog’s viewpoint, but it only records 1.51.51.5 hours of video. EgoPet expands this vision by over 565656 times in volume and includes a variety of species and interactions.

The EgoPet dataset is an extensive collection composed of 6,64666466{\small,}646 video segments distilled from 819819819 unique videos. High level statistics are provided in Fig. 3. These original videos were sourced predominantly from TikTok, accounting for 482482482 videos, while the remaining 338338338 were obtained from YouTube. The aggregate length of all video segments amounts to approximately 848484 hours, which reflects a substantial volume of data for in-depth analysis. In terms of video duration, the segments exhibit an average span of 45.5545.5545.55 seconds, although the duration displays considerable variability, as indicated by the standard deviation of 192.19192.19192.19 seconds. This variation underscores the range of contexts captured within the dataset, from brief encounters to prolonged interactions.

Breaking down the dataset by animal representation, cats and dogs constitute the majority, with 4,56745674{\small,}567 and 1,90519051{\small,}905 segments, respectively. This reflects the dataset’s strong emphasis on common domestic animals while still covering less frequent but equally important species. Notably, the dataset includes segments featuring eagles (666666), turtles (313131), and a diverse group of other animals such as alligators, lizards, and dolphins, contributing to a rich collection of animal behaviors captured through an egocentric lens.

The camera positioning—where the recording device was attached to—also varies, the majority of segments were captured from cameras placed on the neck (4,57545754{\small,}575) and body (1,81718171{\small,}817). Fewer segments were recorded from cameras positioned on the head (199199199), shell (363636), collar (111111), and fin (888) offering a range of perspectives that can inform on how different mounting points might influence the perception of the environment from an animal’s viewpoint.

Collection Strategy. To collect the dataset, we manually searched for videos using a large set of queries on YouTube and TikTok. For example, “egocentric view”, “dog with a GoPro”, and similar phrases related to first-person animal perspectives. This led to scraping a vast pool of footage showcasing animals, primarily dogs and cats, wearing wearable cameras, allowing for an egocentric point of view. In pursuit of a broader video selection, our efforts extended to individual channels and authors known for their thematic consistency in publishing egocentric animal footage. This approach allowed us to tap into niche communities and content creators, yielding a wide variety of egocentric videos beyond the reach of generic search terms.

Dataset Refinement. A meticulous annotation process was carried out to ensure the dataset’s quality. A human annotator reviewed the collected videos to confirm that they were from an egocentric point of view. Non-egocentric or irrelevant segments were carefully removed.

All videos were adjusted to a frame rate of 303030 frames per second and resized to 480p on the shortest side while maintaining the original aspect ratio. The videos were then segmented into discrete clips, during which any non-egocentric footage was removed. The final dataset consists of segments of at least three seconds, ensuring sufficient context for each interaction.

Motivation. Human activities such as actions and interactions from an egocentric viewpoint have been previously explored in various datasets mostly focusing on activity recognition [24, 29, 60], human-object interactions [13, 6, 34, 11], and social interactions [15, 35, 55]. Inspired by these works, we focus on animal interactions with other agents or objects, and for simplicity, we only consider visual interactions. Observing interactions through an egocentric perspective offers insights into how animals navigate their world, how they communicate with other beings, and how their physical movements correlate with environmental stimuli. Being able to identify interactions is a core task in computer vision and robotics with practical applications in designing systems that can operate in dynamic, real-world settings.

In the context of the EgoPet dataset, a “visual interaction” is defined as a discernible event where the agent—typically an animal such as a dog or cat—demonstrates clear attention to an object or another agent within its environment. This attention may be manifested through physical contact, proximity, orientation, or vocalization (such as barking or making sounds) toward the object of the interaction which can be an object or agent. The fundamental criterion for a visual interaction is the presence of visual evidence within the video that the agent is engaged with, or reacting to a particular stimulus. Aimless movements, such as wandering without a clear target or displaying alertness without a specific focus, are not labeled as visual interactions.

Annotations. The data labeling process for marking interactions involved a meticulous analysis of the video content, which resulted in the annotation of 144914491449 subsegments (see Fig. 4). Two human annotators were trained to identify and timestamp the start and end of an interaction event. The outcome of this process is a richly annotated dataset of 805805805 subsegments where no interaction occurs (”negative subsegments”) and 644644644 positive interaction subsegments that capture a wide range of 171717 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test sets. This leaves us with 754754754 training subsegments and 695695695 test subsegments for a total of 1,44914491{\small,}449 annotated subsegments. To see the full annotation process refer to Suppl. Section Figure credits.

Motivation. Planning where to move involves a complex interplay

of both perception and foresight. It requires the ability to anticipate potential obstacles, consider various courses of action, and select the most efficient and effective strategy to achieve a desired goal. EgoPet contains examples where animals plan a future trajectory to achieve a certain goal (e.g., a dog following its owner. See Fig. 5).

Task Description. Given a sequence of past m𝑚m video frames {xi}i=t−mtsubscriptsuperscriptsubscript𝑥𝑖𝑡𝑖𝑡𝑚{x_{i}}^{t}{i=t-m}, the goal is to predict the unit normalized future trajectory of the agent {vj}j=t+1t+ksubscriptsuperscriptsubscript𝑣𝑗𝑡𝑘𝑗𝑡1{v{j}}^{t+k}{j=t+1}, where vj∈ℝ3subscript𝑣𝑗superscriptℝ3v{j}\in\mathbb{R}^{3} represents the relative location of the agent at timestep j𝑗j. We predict the unit normalized relative location due to the scale ambiguity of the extracted trajectories. In practice, we condition models on m=16𝑚16m=16 frames and predict k=40𝑘40k=40 future locations which correspond to 444 seconds into the future.

Given an input sequence of frames, DPVO returns the location and orientation of the camera for each frame. To obtain training trajectories, we feed videos with a stride of 555 to DPVO. To ensure high-quality evaluation, we feed validation videos with strides of 555, 101010, and 151515 into DPVO and evaluate the quality of the trajectories manually. Specifically, two human annotators were trained to evaluate the trajectories from an eagle’s eye view (XZ view) and determine the best matching trajectory, if any, to the video. This left us with 6,12661266{\small,}126 annotated training segments and 249249249 validation segments.

Task Description. The parameters we would like to predict are the local terrain geometry, the terrain’s friction, and the parameters related to the robot’s walking behavior on the terrain, including the robot’s speed, motor efficiency, and high-level command. The exact identification of these parameters is generally impossible [25]: two terrains could have different combinations of parameters but “feel" the same to an agent. For example, walking on sand and mud could similarly affect the robot’s proprioception, even though their properties differ. Therefore, similarly to previous work [25, 27, 30], we aim to predict a latent representation ztsubscript𝑧𝑡z_{t} of the terrain parameters. This latent representation consists of the hidden layer of a neural network trained in simulation to encode ground-truth terrain parameters. This neural network is trained end-to-end with an action policy on locomotion using reinforcement learning. The task consists of predicting the latent terrain’s representation at different time intervals from a sequence of frames. Our setup closely follows the one in [30]. Specifically, we add to EgoPet data collected with a quadrupedal robot in multiple outdoor environments with different terrain characteristics, e.g., sand or grass.

Dataset and Annotations. To collect the dataset, we deployed the walking policy of [30] on a Unitree A1 robot dog in three environments: an office, a park, and a beach. We collect approximately 20 minutes of walking data in these environments, which are used exclusively for evaluation. For training, we use the data from [30], which contains 120 thousand frames, corresponding to a total walking time of approximately 2.3 hours. Each environment has different terrain geometries, including flats, steps, and slopes. Each sample contains an image collected from a forward-looking camera mounted on the robot and the (latent) parameters of the terrain below the center of mass of the robot ztsubscript𝑧𝑡z_{t} estimated with a history of proprioception. See [30] for details about the annotation procedure.

The final task consists of predicting ztsubscript𝑧𝑡z_{t} from a history of images. We generate several sub-tasks by predicting the future terrain parameters zt+0.8,zt+1.5subscript𝑧𝑡0.8subscript𝑧𝑡1.5z_{t+0.8},z_{t+1.5} and the past ones zt−0.8,zt−1.5subscript𝑧𝑡0.8subscript𝑧𝑡1.5z_{t-0.8},z_{t-1.5}. These time intervals were selected to differentiate between forecasting and estimation. The further the prediction is in the future or the past, the harder the task is. The input images might contain little information, or not at all, about the terrain at these times. Therefore, inferences based on the context are required. For example, one can predict the presence of a step in front of the robot from the shadow it casts on the terrain.

MAE [21] are trained by masking random patches in the input image and reconstructing the missing pixels through an asymmetric encoder-decoder architecture. By masking a high proportion (e.g., 757575%) of the input image. In our experiments we use an MAE model pretrained on IN-1k.

DINO [8] is trained with a student-teacher architecture over pairs of augmented images by encouraging invariance to the image augmentations. The teacher’s output is centered, and both networks’ normalized features are compared using a cross-entropy loss. The stop-gradient operator ensures gradient propagation only through the student, and teacher parameters are updated using an exponential moving average (ema) of the student parameters.

iBOT [59] is trained similarly to DINO, but adds an auxiliary MIM loss by predicting image patch representation of a learned online tokenizer.

VideoMAE [50] is an extension of MAE for video pre-training. Different from MAE, it utilizes an extremely high masking ratio (909090% to 959595%) and tube masking as opposed to random masking.

Implementation Details. For all models, we used the ViT-B model with patch size 16 since it was available across all methods. For the VIP task, we train all image and video models for 101010 epochs. Video models represent video clips of 222 seconds using 888 input frames (444 Hz) and image models use one (the middle) frame. For the VPP task, we train all models for 505050 epochs. Video models were trained with varying amounts of frames in 444 Hz. For the LP task, we train all models for 151515 epochs. Video models were trained with 161616 input frames (303030 Hz) and image models used one frame (the last). In our LP experiments, we only use cat and dog segments, segments long enough, and 25%percent2525% of the training data. This left us with 111,129129129 training segments and 167167167 validation segments. During the linear probing training phase, we do not apply any image augmentations. All the other hyperparams follow MAE and MVD linear probing recipes for image and video models respectively.

In this section, we report initial baseline results from applying a range of models to the VIP, LP, and VPP tasks. Taken together, these results underline the interesting observation that current large video datasets used for pretraining are not diverse enough to perform well across all the EgoPet downstream tasks. For example, pretraining on K400 is better than Ego4D for VIP but worse on VPP. Furthermore, by pretraining on EgoPet, we observe improved downstream performance on the VPP task compared to other models.

For this task, we evaluated models based on their predicted unit motions 40 timesteps into the future, corresponding to 4 seconds. We form trajectories from these predicted motions and compute the RMSE of the Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) metrics against the ground truth trajectories. ATE and RPE are commonly employed metrics for evaluating systems such as SLAM and visual odometry [58, 49, 48, 7]. ATE first aligns the ground truth with the predicted trajectory, and then computes the absolute pose difference. RPE measures the difference between the predicted and ground truth locomotion [47].

The results in Fig. 8 indicate that models trained on EgoPet perform better than Ego4D and K400 and that the Ego4D model performed second best, possibly due to also being egocentric data. The full results in Supplementary Table 4 indicate that video models perform much better than image models as a whole, which we speculate is due to better modeling of the agent’s velocity and acceleration as well as the motion of other agents.

Acknowledgements: We thank Justin Kerr for helpful discussions. Many of the figures use images taken from web videos. For each figure, we include the URL to its source videos in the Suppl. Section Figure credits. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD including DARPA’s LwLL and/or SemaFor programs, as well as BAIR’s industrial alliance programs.

We provide additional information about the EgoPet dataset, annotation process and full quantitative results.

We include more dataset visualizations in Figure 9.

The outcome of this process is a richly annotated dataset of 805805805 subsegments where no interaction occurs (”negative subsegments”) and 644644644 positive interaction subsegments that capture a wide range of 171717 distinct interaction objects such as person, cat, and dog. The subsegments were then split into train and test. This leaves us with 754754754 training subsegments and 695695695 test subsegments for a total of 1,44914491{\small,}449 annotated subsegments.

This is the list of all possible interaction objects: Person, Ball, Bench, Bird, Dog, Cat, Other Animal, Toy, Door, Floor, Food, Plant, Filament, Plastic, Water, Vehicle, Other.

Full VPP results. In the main paper we included the VPP results grouped by “past”, “present” and “future” (see Table 3). In Table 5 we provide the full fine-grained VPP results by individual timestep.

https://www.youtube.com/watch?v=69AXB6aFzRU

https://www.tiktok.com/@gonzoisacat/video/7232306745660509483

Table: S3.T1: Different video datasets. We compare EgoPet to different datasets with respect to the total time (hours), whether the videos are in first person view (egocentric) with the focus on egomotion, the agent type, and whether agent interaction annotations are available. EgoPet is the first large scale animal dataset that is both ego centric and contains interaction annotations. It is also over 565656 times larger than the previous similar dataset DECADE [14].

Dataset	Total Time (hours)	Egocentric	Egomotion	Agent	Interaction Annotations
BDD100K	1,11111111{\small,}111	✓	✓	Cars	✗
Animal Kingdom	505050	✗	✗	Animals	✗
EGO4D	3,67036703{\small,}670	✓	✗	Humans	✓
DECADE	1.51.51.5	✓	✓	Dog	✗
EgoPet	848484	✓	✓	Animals	✓

Table: S5.T2: Visual Interaction Prediction linear probing results. We report models Interaction Prediction Accuracy and AUROC, as well as Object Prediction Top-1 and Top-3 Accuracy.

		Accuracy	AUROC	Top-1 Acc	Top-3 Acc
MAE	IN-1k	62.34	69.41	35.02	61.37
MVP	Ego Mix	65.47	68.12	33.57	59.21
DINO	IN-1k	65.16	73.38	37.18	60.65
iBOT	IN-1k	65.16	73.50	37.55	58.12
VideoMAE	K400	61.56	66.22	29.24	54.87
MVD	K400	65.63	70.35	35.38	62.45
MVD	Ego4D	64.84	70.15	33.57	62.45
MVD	EgoPet	68.44	74.31	35.74	64.62

Table: S6.T3: Vision to Proprioception Prediction (VPP) linear probing results. We report the mean squared error loss. Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 5 for the full results.

Model	Dataset	Past (t−k)𝑡𝑘(t-k)	Present (t)𝑡(t)	Future (t+k)𝑡𝑘(t+k)
1 Frame
MAE	IN-1k	0.360	0.280	0.314
MVP	Ego Mix	0.357	0.273	0.308
DINO	IN-1k	0.354	0.275	0.304
iBOT	IN-1k	0.350	0.278	0.304
4 Frames
MVD	K400	0.286	0.197	0.262
MVD	Ego4D	0.261	0.224	0.261
MVD	EgoPet	0.256	0.203	0.246
8 Frames
MVD	K400	0.217	0.196	0.252
MVD	Ego4D	0.208	0.192	0.249
MVD	EgoPet	0.204	0.184	0.253

Refer to caption We present EgoPet, a novel animal egocentric video dataset to advance learning animal-like behavior models from video (top row). We propose three benchmark tasks on this dataset (bottom row). Visual Interaction Prediction (VIP) and Locomotion Prediction (LP) are designed to predict animals’ perception and action behavior. Finally, Vision to Proprioception Prediction (VPP) studies the utility of our dataset on the downstream task of robot locomotion in the wild. For all tasks, we find that models trained on EgoPet outperform those trained on previously available video datasets.

Refer to caption EgoPet video examples. Footage from the EgoPet dataset featuring four different animal experiences, each captured from an egocentric perspective at a distinct point in time.

Refer to caption Descriptive statistics. The histogram depicting the length (in seconds) of EgoPet video sequences exhibits a long-tailed distribution, primarily skewed toward shorter segments of less than 303030 seconds. Collectively, videos featuring dogs and cats account for 94% of the total duration, showcasing interactions with people, fellow cats and dogs, toys, and various objects.

Refer to caption Vision to Interaction Prediction task. The figure illustrates the process of annotating a single video, identifying and categorizing different interactions experienced by a cat, with each segment of the timeline reflecting a unique type of interaction within the animal’s environment.

Refer to caption Locomotion Prediction task. A dog navigates an agility course, highlighting the concept of locomotion prediction by anticipating its forward and upward trajectory to clear the obstacle.

Refer to caption Vision to Proprioception Prediction task. This figure showcases the quadruped robot as it is about to transition from flat ground to climbing steep stairs, illustrating one of the unique terrain environments encountered during the collection of visual and proprioceptive data at annotated time intervals for VPP training.

Refer to caption VIP Grad-CAM [41] visualization.

Refer to caption Locomotion Prediction (LP) linear probing results. We report the validation ATE and RPE as a function of the epoch during training, comparing the impact of various datasets (Kinetics, Ego4D, EgoPet). Models trained on EgoPet perform better than models trained on other datasets. See Supplementary Table 4 for the full results.