URLOST: Unsupervised Representation Learning without Stationarity or Topology

Zeyu Yun$^{1}$, Juexiao Zhang$^{3}$, Yann Lecun$^{3,4}$, Yubei Chen$^{2}$, $^1$ UC Berkeley, $^2$ UC Davis, $^3$ New York University, $^4$ FAIR at Meta

Abstract

Unsupervised representation learning has seen tremendous progress but is constrained by its reliance on data modality-specific stationarity and topology, a limitation not found in biological intelligence systems. For instance, human vision processes visual signals derived from irregular and non-stationary sampling lattices yet accurately perceives the geometry of the world. We introduce a novel framework that learns from high-dimensional data lacking stationarity and topology. Our model combines a learnable self-organizing layer, density adjusted spectral clustering, and masked autoencoders. We evaluate its effectiveness on simulated biological vision data, neural recordings from the primary visual cortex, and gene expression datasets. Compared to state-of-the-art unsupervised learning methods like SimCLR and MAE, our model excels at learning meaningful representations across diverse modalities without depending on stationarity or topology. It also outperforms other methods not dependent on these factors, setting a new benchmark in the field. This work represents a step toward unsupervised learning methods that can generalize across diverse high-dimensional data modalities.

URLOST: Unsupervised Representation Learning without Stationarity or Topology

Zeyu Yun 1 , Juexiao Zhang 3 , Yann Lecun 3 , 4 , Yubei Chen 2

1 2 3 4

UC Berkeley, UC Davis, New York University, FAIR at Meta

Unsupervised representation learning has seen tremendous progress. However, it is constrained by its reliance on domain specific stationarity and topology, a limitation not found in biological intelligence systems. For instance, unlike computer vision, human vision can process visual signals sampled from highly irregular and non-stationary sensors. We introduce a novel framework that learns from high-dimensional data without prior knowledge of stationarity and topology. Our model, abbreviated as URLOST, combines a learnable self-organizing layer, spectral clustering, and a masked autoencoder (MAE). We evaluate its effectiveness on three diverse data modalities including simulated biological vision data, neural recordings from the primary visual cortex, and gene expressions. Compared to state-of-the-art unsupervised learning methods like SimCLR and MAE, our model excels at learning meaningful representations across diverse modalities without knowing their stationarity or topology. It also outperforms other methods that are not dependent on these factors, setting a new benchmark in the field. We position this work as a step toward unsupervised learning methods capable of generalizing across diverse high-dimensional data modalities. Code is available at this repository.

Introduction

Unsupervised representation learning aims to develop models that autonomously detect patterns in data and make these patterns readily apparent through specific representations. Over the past few years, there has been tremendous progress in the unsupervised representation learning community, especially self-supervised representation learning (SSL) method. Popular methods SSL methods like contrastive learning and masked autoencoding [87; 12; 33; 90] work well on typical data modalities such as images, videos, time series, and point clouds. However, these methods make implicit assumptions about the data domain's topology and stationarity . Topology refers to the low-dimensional structure arisen from physical measurements, such as the pixel grids in images, the temporal structures in time series and text sequences, or the 3D structures in molecules and point clouds [7]. Stationarity refers to the characteristic that the statistical properties of the signal are invariant across its domain [6]. For instance, the statistics of pixels and patches in images are invariant to their spatial locations. A vase is still a vase no matter placed at the corner or the center of the image.

The success of state-of-the-art self-supervised representation learning (SSL) methods largely depends on these two crucial assumptions. For example, in computer vision, popular SSL techniques, such as Masked Autoencoders [34] and joint-embedding methods like SimCLR [12] require the construction of image patches. Both approaches typically rely on convolutional neural networks (CNNs) or Vision Transformer (ViT) backbones [22]. These backbones consist of shared-weight convolutional filters or linear layers, which inherently exploit the stationarity and regular topology present in natural images. The geometric deep learning community has made significant efforts to extend machine learning to domains beyond those with regular topology. However, graph neural network (GNN)-based methods empirically struggle to scale with self-supervised learning objectives and large datasets [69; 10; 9]. Methods that do scale well with large data still assume a minimal level of stationarity and regular topology [32].

What if we have high-dimensional signals without prior knowledge of their domain topology or stationarity? Can we still craft a high-quality representation? This is not only the situation that

biological intelligence systems have to deal with but also a practical setting for many scientific data analysis problems. Taking images as an example, computer vision system takes in signal from digital cameras. Biological visual systems, on the other hand, have to deal with signals with less domain regularity. Unlike the uniform grid in camera sensors, the cones and rods in the retina are distributed unevenly and non-uniformly. Yet, biological visual systems can establish a precise retinotopy map from the retina to neurons in visual cortex based on spontaneous, locally-propagated retinal activities and external stimuli [86; 47; 25] and leverage retinotopic input to build unsupervised representations. This motivates us to build unsupervised representations without relying on the prior stationarity of the raw signal or the topology of the input domain. The ability to build unsupervised representations without relying on topology and stationarity has huge advantages. For example, it allows humans to utilize irregular sensors for both high resolution and broad coverage, which is much more efficient than a regular camera sensor for dynamic environments. Developing such a model also allows us to create a powerful AI system that computes any high-dimensional signal.

Figure 1: From left to right: the unsupervised representation learning through joint embedding and masked auto-encoding; the biological vision system that perceives via unstructured sensor and understands signal without stationarity or topology [61]; and many more such diverse high dimensional signal in natural science [57; 89] that our method supports while most existing unsupervised methods don't.

In this work, we aim to build unsupervised representations for general high-dimensional data and introduce unsupervised representation learning without stationarity or topology ( URLOST ). Taking images as an example again, let's assume we receive a set of images whose pixels are shuffled in the same order. In this case, the original definition of topology of images is destroyed, i.e. each pixel should no longer the neighbor of the pixels that's physically next to it. How can we build representations in an unsupervised fashion without knowledge of the shuffling order? If possible, can we use such a method to build unsupervised representations for general high-dimensional data? Inspired by Roux et al. [67], we use low-level statistics and spectral clustering to form clusters of the pixels, which recovers a coarse topology of the input domain. These clusters are analogous to image patches except that they are slightly irregularly shaped and different in size. We mask a proportion of these 'patches' and utilize a Vision Transformer [22] to predict the masked 'patches' based on the remaining unmasked ones. This 'learning to predict masked tokens' approach is proposed in masked autoencoders (MAE) [34] and has demonstrated effectiveness on typical modalities. Firstly, we test the proposed method on the synthesized biological visual dataset, derived from the CIFAR10 [44] using a foveated retinal sampling mechanism [14]. Then we generalize this method to two high-dimensional vector datasets: a primary visual cortex neural response decoding dataset [73] and the TCGA miRNA-based cancer classification dataset [79; 84]. Across all these benchmarks, our proposed method outperforms existing SSL methods, demonstrating its effectiveness in building unsupervised representations for signals lacking explicit stationarity or topology. Given the emergence of new modalities in deep learning from natural sciences [75; 31; 62; 46; 85; 17; 2], such as

Figure 2: The overview framework of URLOST. The high-dimensional input signal undergoes clustering and self-organization before unsupervised learning using a masked autoencoder for signal reconstruction.

chemistry, biology, and neuroscience, our method offers a promising approach in the effort to build unsupervised representations for high-dimensional data.

Method

Motivation and Overall Framework

Our objective is to build robust unsupervised representations for high-dimensional signals without prior information on explicit topology and stationarity. The learned representations are intended to enhance performance in downstream tasks such as classification. We begin by using low-level statistics and clustering to approximate the topology of the signal domain. The clusters then serve as input to a masked autoencoder. As depicted in Figure 1, the masked autoencoder randomly masks out patches in an image and trains a Transformer-based autoencoder unsupervisedly to reconstruct the original image. After unsupervised training, the latent state of the autoencoder yields highquality representations. In our approach, the clusters derived from the raw signal are input to the masked autoencoder.

Taking images as an example data modality, clusters formed in images are bags of pixels, which differs from image patches in several key aspects. First, they are unaligned, exhibit varied sizes and shapes. Second, each pixel in a cluster is not confined to fixed 2D locations like pixels in image patches. To cope with these differences, we introduce a self-organizing layer responsible for aligning these clusters through learnable transformations. The parameters of this layer are jointly optimized with those of the masked autoencoder. Our method is termed URLOST, an acronym for U nsupervised R epresentation L earning with O ut S tationarity or T opology. Figure 2 provides an overview of the framework. URLOST consists of three core components: density adjusted spectral clustering, self-organizing layer, and masked autoencoder. The functionalities of these components are detailed in the following subsections.

Density Adjusted Spectral Clustering

Given a dataset S ∈ R n × m , where each row denote a high dimensional data point. Let S i ∈ R n × 1 be i th column of S , which represents the i th dimension of the signal. Since we want to process dimensions that share similar information together, we use spectral clustering to group these dimensions. Let A ij = I ( S i ; S j ) be the mutual information between dimension i and j . Consider each dimension as a node of a graph and A is the the affinity matrix of a graph. L = D -A is the graph Laplacian, where D is the diagonal matrix whose ( i, i ) -element is the sum of A 's i -th row. We can formulate the spectral embedding problem as finding Y such that min Y Y T = I tr ( Y LY T ) . A clustering algorithm is then applied to the embedding Y as explained in appendix A.2. The size and shape of the clusters strongly affect the unsupervised learning performance. To adjust the size and shape, we apply a density adjustment matrix P to adjust L in the spectral embedding objective. The optimization problem becomes the following:

where P = diag ( p ( i )) , p ( i ) is the unnormalized density function defined on each node i . We set p ( i ) = q ( i ) α n ( i ) -β , where n ( i ) = ∑ j ∈ Top K ( A ji ) A ji and q ( i ) is the prior density which depends

on specific dataset and is defined in the experiment section. α , β and K are hyper-parameters. Setting α = 0 and K = m will recover the normalized graph Laplacian. In appendix A.1, we provide a detailed interpretation and motivation of density adjustment. We further verify its effectiveness with ablation study in section 4.3 and appendix A.4

Self-organizing Layer

Transforming a high-dimensional signal into a sequence of clusters using the above method is not enough because it does not capture the internal structure within individual clusters. As an intuitive example, given an image, we divide it into a set of image patches of the same size. If we apply different permutations to these image patches, their inner product will no longer reflect their similarity properly. Clusters we obtained from section 2.2 are analogous to image patches, but elements in each cluster have arbitrary ordering. Thus, the inner product between two clusters is also arbitrary due to the ordering mismatch. In Transformers, since self-attention depends on the inner products between different 'clusters,' we need to align these clusters in a space, where their inner products reflect their similarity. To align these clusters, we propose a self-organizing layer with learnable parameters. Specifically, let vector x ( i ) denote the i th cluster. Each cluster x ( i ) is passed through a differentiable function g ( · , w ( i ) ) with parameter w ( i ) , resulting in a sequence z 0 :

z 0 is comprised of projected and aligned representations for all clusters. The weights of the proposed self-organizing layer, { w (1) , · · · w ( M ) } , are jointly optimized with the subsequent neural network introduced in the next subsection.

Masked Autoencoder

After the self-organizing layer, z 0 is passed to a Transformer-based masked autoencoder (MAE) with an unsupervised learning objective. Masked autoencoder (MAE) consists of an encoder and a decoder which both consist of stacked Transformer blocks introduced in Vaswani et al. [82]. The objective function is introduced in He et al. [34]: masking random image patches in an image and training an autoencoder to reconstruct them, as illustrated in Figure 1. In our case, randomly selected clusters in z 0 are masked out, and the autoencoder is trained to reconstruct these masked clusters. After training, the encoder's output is treated as the learned representation of the input signal for downstream tasks. The masked prediction loss is computed as the mean square error (MSE) between the values of the masked clusters and their corresponding predictions.

Result

Since the biological vision system inspires our method, we first validate its ability on a synthetic biological vision dataset created from CIFAR-10. Then we evaluate the generalizability of URLOST on two high-dimensional natural datasets collected from diverse domains. Detailed information about each dataset and the corresponding experiments is presented in the following subsections. Across all tasks, URLOST consistently outperforms other strong unsupervised representation learning methods. For all the experiments in this work, we use linear layer to parametrize the self-organizing layer, i.e. g ( x, W ( i ) ) = W ( i ) x . Additionally, we provide training hyperparameters and experiments for effect of hyperparameters in Appendix A.6.

Synthetic Biological Vision Dataset

As discussed in the introduction, the biological visual signal serves as an ideal data modality to validate the capability of URLOST. In contrast to digital images captured by a fixed array of sensors, the biological visual signal is acquired through irregularly positioned ganglion cells, inherently lacking explicit topology and stationarity. However, it is hard to collect real-world biological vision signals with high precision and labels to evaluate our algorithm. Therefore, we employ a retinal sampling technique to modify the classic CIFAR-10 dataset and simulate imaging from the biological vision signal. The synthetic dataset is referred to as Foveated CIFAR-10 . To make a comprehensive comparison, we also conduct experiments on the original CIFAR-10, and a Permuted CIFAR-10 dataset obtained by randomly permuting the image.

dataset to generate the Permuted CIFAR-10 dataset. Nevertheless, permuting pixels only removes an image's topology, leaving its stationarity intact. To obtain the synthetic biological vision that has neither topology nor stationarity, we introduce the Foveated CIFAR-10 .

Figure 3: Retina sampling (A) An image in CIFAR-10 dataset. (B) Retina sampling lattice. Each blue dot represents the center of a Gaussian kernel, which mimics a retinal ganglion cell. (C) Visualization of the car image's signal sampled using the retina lattice. Each kernel's sampled RGB value is displayed at its respective lattice location for visualization purposes. (D) density-adjusted spectral clustering results are shown. Each unique color represents a cluster, with each kernel colored according to its assigned cluster.

Foveated CIFAR-10. Unlike cameras that use uniform photosensors with consistent sampling patterns, the human retina features ganglion cells whose receptive field sizes vary-smaller in the central fovea and larger in the periphery. This results in foveated imaging [80], enabling primates to have both high-resolution central vision and a wide overall receptive field. However, this distinctive sampling pattern causes visual signals in the retina to lack stationarity; the statistical properties differ between the center and periphery, with ganglion cell responses being highly correlated in the fovea but less correlated in the periphery. To mimic the foveated imaging with CIFAR-10, we adopt the retina sampling mechanism from Cheung et al. [14]. Like shown in Figure 3, each dot represents a retinal ganglion cell, which together form a non-stationary sampling lattice with irregular topology. Details on the implementation of foveation sampling is provided in Appendix A.3.

Experiments. We compare URLOST on both of the synthetic vision datasets as well as the original CIFAR-10 with popular unsupervised representation learning methods SimCLR [12] and MAE [34]. We also compared the ViT backbone used in URLOST to other backbones such as convolutional neural network (CNN). Moreover, since CNN works better with stationary signals, we further compared our methods to SimCLR with a graph neural network (GNN) backbone, which doesn't rely on the stationarity assumption of the data. For the GNN baseline, we use Vision GNN (ViG), which is state of the art graph neural network on images proposed in [32]. After the model is trained without supervision, we use linear probing to evaluate it. This is achieved by training a linear classifier on top of the pre-trained model with the given labels. The evaluations are reported in Table 1. SimCLR excels on CIFAR-10 but struggles with both synthetic datasets due to its inability to handle data without stationarity and topology. MAE gets close to SimCLR on CIFAR-10 with a 4 × 4 patch size. However, the patch size no longer makes sense when data has no topology. So we additionally tested MAE masking pixels instead of image patches. The performance of pixel level MAE is invariant to the removal of topology. The performance on Permuted CIFAR-10 is the same as the performance on CIFAR-10, though poorly. But it still drops greatly to 48 . 5% on the Foveated CIFAR-10 when stationarity is also removed. In contrast, only URLOST is able to maintain consistently strong performances when there is no topology or stationarity, achieving 86 . 4% on Permuted CIFAR-10 and 85 . 4% on Foveated CIFAR-10 when the baselines completely fail.

V1 Neural Response to Natural Image Stimulus

After testing URLOST's performance on synthetic biological vision data, we take a step further to challenge its generalizability with high-dimensional natural datasets. The first task is decoding neural response recording in the primary visual area (V1) of mice.

V1 neural response dataset. The dataset, published by [57], contains responses from over 10,000 V1 neurons captured via two-photon calcium imaging. These neurons responded to 2,800 unique images from ImageNet [19], with each image presented twice to assess the consistency of the neural response. In the decoding task, a prediction is considered accurate if the neural response to a given

Table 1: Evaluation on computer vision and synthetic biological vision dataset. ViT (Patch) stands for the Vision Transformer backbone with image patches as inputs. ViT (Pixel) means pixels are treated as input units. ViT (Clusters) means clusters are treated as inputs instead of patches. The number of clusters is set to 64 for both Permuted CIFAR-10 and Foveated CIFAR-10 dataset. Eval Acc stands for linear probing evaluation accuracy.

Table 2: Evaluation on V1 response decoding and TCGA pan-cancer classification tasks. 'Raw' indicates preprocessed (standardized and normalized) raw signals. Best β values are used for β -VAE. For URLOST MAE, cluster sizes are 200 (V1) and 32 (TCGA). We pick 15 seeds randomly to repeat the training and evaluation for each method. For the MAE baseline, we randomly group dimensions instead of using spectral clusterings. We report the 95% confidence interval for all methods.

stimulus in the first presentation closely matches the response to the same stimulus in the second presentation within the representation space. This task presents greater challenges than the synthetic biological vision described in the prior section. For one, the data comes from real-world neural recordings rather than a curated dataset like CIFAR-10. For another, the geometric structure of the V1 area is substantially more intricate than that of the retina. To date, no precise mathematical model of the V1 neural response has been well established. The inherent topology and stationarity of the data still remain difficult to grasp [56; 55]. Nevertheless, evidence of retinotopy [25; 26] and findings from prior research [54; 13; 74] suggest that the neuron population code in V1 are tiling a low dimensional manifold. This insight led us to treat the population neuron response as high-dimensional data and explore whether URLOST can effectively learn its representation.

Experiments. Following the approach in Pachitariu et al. [57] we apply standardization and normalization to the neural firing rate. The processed signals are high-dimensional vectors, and they can be directly used for the decoding task, which serves as the 'raw' signal baseline in Table 2. For representation learning methods, URLOST is evaluated along with MAE and β -VAE [35]. For MAE baseline, we obtain patches by randomly selecting different dimensions from the signal. Note that the baseline methods need to handle high-dimensional vector data without stationarity or topology. Since SimCLR and other constrastive learning model leverage these two properties to make positive pair, they are no longer applicable. We use β -VAE as a baseline instead. We first train the neural network with an unsupervised learning task, then use the latent state of the network as the representation for the neural responses in the decoding task. The results are presented in table 2. Our method surpasses the original neuron response and other methods, achieving the best performance.

Table 3: Performance of different datasets using different masking units. 'Clusters (URLOST)' refers to the clusters formed using pairwise mutual information and spectral clustering. 'Random patch' refers to patches formed by aggregating random dimensions. 'Individual dimension' refers to using individual dimensions as masking units without any aggregation. The model is trained for the mask prediction task, and the probing accuracy is reported in the table.

Gene Expression Data

In this subsection, we further evaluate URLOST on high-dimensional natural science data from a completely different domain, the gene expression data.

Gene expression dataset. The dataset comes from The Cancer Genome Atlas (TCGA) [79; 84], which is a project that catalogs the genetic mutations responsible for cancer using genome sequencing and bioinformatics. The project molecularly characterized over 20,000 primary cancers and matched normal samples spanning 33 cancer types. We focus on the pan-cancer classification task: diagnose and classify the type of cancer for a given patient based on his gene expression profile. The TCGA project collects the data of 11,000 patients and uses Micro-RNA (miRNA) as their gene expression profiles. Like the V1 response, no explicit topology and stationarity are known and each data point is a high-dimensional vector. Specifically, 1773 miRNA identifiers are used so that each data point is a 1773-dimensional vector. Types of cancer that each patient is diagnosed with serve as the classification labels.

Experiments. Similar to Section 3.2, URLOST is compared with the original signals, MAE, and β -VAE, which is the state-of-the-art unsupervised learning method on TCGA cancer classification [91; 92]. We also randomly partition the dataset do five-fold cross-validation and report the average performance in Table 2. Again, our method learns meaningful representation from the original signal. The learned representation benefited the classification task and achieved the best performance, demonstrating URLOST's ability to learn meaningful representation of data from diverse domains.

Ablation study

Aggregating similar dimensions for masking unit

Clustering is at the heart of the proposed method. Why should similar dimensions be aggregated to form a masking unit? Intuitively, we assume that similar dimensions are used to sample similar regions of the underlying signals. Instead of masking each dimension, making the entire cluster force the model to learn high-level structure from the signal, allowing the model to learn a rich representation. For example, for the CIFAR10 example, similar pixels tend to sample similar regions, which together form image patches. For V1 data, each neuron encodes a simple pattern like an oriented edge, but neurons together code patterns like shape and contours. We provide experiments for using different masking units, as shown in table 3, aggregating similar dimensions significantly improves the performance.

Self-organizing Layer vs Shared Projection Layer

Conventional SSL models take a sequential input x = [ x (1) , · · · x ( M ) ] and embed them into latent vectors with a linear transformation:

which is further processed by a neural network. The sequential inputs can be a list of language tokens [21; 63], pixel values [11], image patches [22], or overlapped image patches [12; 33; 90]. E can be considered as a projection layer that is shared among all elements in the input sequence. The selforganizing layer g ( · , w ( i ) ) introduced in Section 2.3 can be considered as a non-shared projection layer. We conducted an ablation study comparing the two designs to demonstrate the effectiveness of the self-organizing layers both quantitatively and qualitatively. To facilitate the ablation, we further synthesized another dataset.

Locally-permuted CIFAR-10 . To directly evaluate the performance of the non-shared projection approach, we designed an experiment involving intentionally misaligned clusters. In this experiment, we divide each image into patches and locally permute all the patches. The i -th image patch is denoted by x ( i ) , and its permuted version, permutated by the permutation matrix E ( i ) , is expressed as E ( i ) x ( i ) . We refer to this synthetic dataset as the Locally-Permuted CIFAR-10 . Our hypothesis posits that models using shared projections, as defined in Equation 3, will struggle to adapt to random permutations, whereas self-organizing layers equipped with non-shared projections can autonomously adapt to each patch's permutation, resulting in robust performance. This hypothesis is evaluated quantitatively and through the visualization of the learned weights w ( i ) .

Permuted CIFAR-10 . Meanwhile, we also run the ablation study on the Permuted CIFAR-10 . Unlike locally permuted CIFAR-10, a visualization check is not viable since the permutation is done globally. However, we can still quantitatively measure the performance of the task.

Quantitative results . Table 4 confirms our hypothesis, demonstrating a significant performance decline in models employing shared projections when exposed to permuted data. In contrast, the model with self-organizing layer maintains stable performance.

Visual evidence . Using linear layers to parameterize the self-organizing layers, i.e. let g ( x, W ( i ) ) = W ( i ) x , we expect that if the projection layer effectively aligns the input sequence, E ( i ) T W ( i ) should exhibit visual similarities. That is, after applying the inverse permutation E ( i ) T , the learned projection matrix W ( i ) at each location should appear consistent or similar. The proof of this statement is provided in Appendix A.5. The model trained on Locally-Permuted CIFAR10 provides visual evidence supporting this claim. In Figure 4, the weights show similar patterns after reversing the permutations. These observations demonstrate that URLOST can also be used as an unsupervised learning method to recover topology and enforce stationary on the signal. Note that Figure 4 shows that the self-organizing layer learns to 'undo' the permutation. Additionally, the self-organizing layer also does some extra regular transformation to each patch. It is likely that the self-organizing layer learns to encode position information as transformations of some simple group structures.

Figure 4: Learnt weights of a self-organizing layer. (A) Image is cropped into patches, where each patch x ( i ) first undergoes a different permutation E ( i ) , then the inverse permutation E ( i ) T . (B) The learned weight of the linear self-organizing layer. The 12 th column of W ( i ) at all positions i are reshaped into patches and visualized. When W ( i ) undergoes the inverse permutation E ( i ) T , they show similar patterns. (C) Visualization of the 37 th column of W ( i ) . Similar to (B).

Density adjusted clustering vs Uniform Density Clustering

As explained in Section 2.2, the shape and size of each cluster depend on how the density function p ( i ) is defined as:

where n ( i ) = ∑ j ∈ Top k ( A ji ) A ji , A is the affinity matrix, q ( i ) represent the eccentricity, the distance from i th kernel to the center of the sampling lattice. Setting α and β nonzero, the density function is eccentricity-dependent. Setting both α and β to zero will make n ( i ) constant which recovers the uniform density spectral clustering. We vary the parameters α and β to generate different sets of clusters for the foveated CIFAR-10 dataset and run URLOST using each of these sets of clusters. Results in Table 5 validate that the model performs better with density-adjusted clustering. The intuitive explanation is that by adjusting the values of α and β , we can make each cluster carry similar amounts of information (refer to Appendix A.4.). A balanced distribution of information across clusters enhances the model's ability to learn meaningful representations. Without this balance, masking a low-information cluster makes the prediction task trivial, while masking a high-information cluster will make the prediction task too difficult. In either scenario, the model's ability to learn effective representations is compromised.

Table 4: Ablation study on self-organizing layer . Linear probing accuracy with varying parameters, keeping others constant. For Locally-Permutated CIFAR-10 , we use 4 × 4 patch size. For Permutated CIFAR-10 and Foveated CIFAR-10 , we set the number of clusters to 64 for the spectral clustering algorithm. We kept the hyperparameter of the backbone model the same as in table 1.

(a) Replacing the self-organizing layer with the shared projection layer entails a significant drop in performance.

(b) 'SC' denotes spectral clustering with uniform density clustering and 'DSC' denotes density adjusted spectral clustering. Using DSC to create clusters outperform model without DSC.

Several interconnected pursuits are linked to this work, and we will briefly address them here:

Topology in biological visual signal. 2-D topology of natural images is a strong prior that requires many bits to encode [20; 3]. Such 2-D topology is encoded in the natural image statistic [71; 37]. Optic and neural circuits in the retina result in a more irregular 2-D topology than the natural image, which can still be simulated [66; 59; 60; 58; 78; 41]. This information is further processed by the primary visual cortex. Evidence of retinotopy suggests the low-dimensional geometry of visual input from retina is encoded by the neuron in primary visual cortex [53; 27; 36; 26; 83; 61]. These study provide evidence that topology under retinal ganglion cell and V1 neurons can be recovered. The theory and computational model of how visual system code encodes such 2-D is well-studied in computational neuroscience. The self-organizing map (SOM) was proposed as the first computational model by Kohonen in 1982. The algorithm produces a low-dimensional representation of a higher-dimensional dataset while preserving the topological structure of the data [43]. SOM is also motivated to solve the 'unscramble' pixels problem by 'descrambling' by mapping pixels into a 2D index set by leveraging the mutual information between pixels at different locations. More detail is in Appendix A.12. [67] tackles the same problem with manifold learning.

Evidence of self-organizing mechanism in the brain. In neuroscience, many works use the selforganizing maps (SOM) as a computational model for V1 functional organization [23; 76; 1; 24; 52; 43]. In other words, this idea of self-organizing is a principle governing how the brain performs computations. Even though V1 functional organizations are present at birth, numerous studies indicate that the brain's self-organizing mechanisms continue after full development [30; 68; 40].

Learning with signal on non-euclidean geometry. In recent years, researchers from the machine learning community have made efforts to consider geometries and special structures beyond classic images, text, and feature vectors. This is the key motivation for geometric deep learning and graph neural networks (GNN). Many works generalizes common operator for processing Euclidean data like 2d convolution and attention [7; 49; 18; 28]. Due to the natural of graph neural network, they often only work on limited data regime and do not scale to large data [69; 10; 9]. However, recent development in [32] shows this direction is prominent. They successfully scales GNN to ImageNet.

We compared their proposed neural network architecture with the ViT backbone used in URLOST. Recent research also explores adapting the Transformer to domains beyond Euclidean spaces such as [16; 15; 29]. Ma et al. [45] treats an image as a set of points but relies on 2D coordinates. URLOST employs a single mutual information graph to define the topology of the high-dimensional signal. Gao et al. [29], on the other hand, is designed to handle graph data, where each data point corresponds to a distinct graph. It segments a graph into 'subgraphs,' processes them with a GNN, and then passes the output to a transformer. This approach is undeniably more flexible but requires all subgraphs to be globally aligned. Furthermore, the self-organizing layer in URLOST generalizes the 'patch resizer' mechanism from FlexiViT used in Beyer et al. [4]. Finally, the lines of works follows the 'perceiver' architecture is very related to URLOST [39; 38]. 'Perceiver' also implicitly aggregates similar dimensions of high dimensional data and processes the similar dimension with an attention mechanism. [88] is a follow-up work that combines the Perceiver architecture and masked-prediction objective. However, the key difference is that the 'perceiver' computes similarity between dimensions on a single example signal. For image data, this is essentially clustering pixels based on the pixel intensity of color in a signal image. On the other hand, URLOST computes similarity over statistics of the distribution of the signal. Additionally, 'perceiver' implicitly aggregates similar dimensions, while URLOST explicitly aggregates similar dimensions and recovers the original topology of the signal.

Self-supervised learning. Self-supervised learning (SSL) has made substantial progress in recent years. Different SSL method is designed for each modality, for example: predicting the masked/next token in NLP[21; 63; 8], solving pre-text tasks, predicting masked patches, or building contrastive image pairs in computer vision [48; 34; 87; 12; 33; 90]. These SSL methods have demonstrated descent scalability with a vast amount of unlabeled data and have shown their power by achieving performance on par with or even surpassing supervised methods. They have also exhibited huge potential in cross-modal learning, such as the CLIP by Radford et al. [64].

Discussion, Limitations, and future direction

The success of most current state-of-the-art self-supervised representation learning methods relies on the assumption that the data has known stationarity and domain topology. In this work, we explore unsupervised representation learning under a more general assumption, where the stationarity and topology of the data are unknown to the machine learning model and its designers. We argue that this is a general and realistic assumption for high-dimensional data in modalities of natural science. We propose a novel unsupervised representation learning method that works under this assumption and demonstrates our method's effectiveness and generality on a synthetic biological vision dataset and two datasets from natural science that have diverse modalities. We also perform a step-by-step ablation study to show the effectiveness of the novel components in our model.

Unlike the self-organizing layer, the procedure for aggregating similar dimensions together is separate from the training of MAE. Given the effectiveness of the self-organizing layer, this design could be sub-optimal and could be improved. Learning the clusters end-to-end with the representation via back-propagation is worth investigating in the future. Additionally, the computational costs of computing pairwise mutual information and clustering both scales quadratically with the number of dimensions of the signal. Although each procedure only needs to be performed once, the computation could become too slow and infeasible for extremely high-dimensional data ( d > 10 , 000 ) with our current implementation as shown in Appendix A.9. In the Appendix, we provide a benchmark of the runtime of our implementation. Nevertheless, we think this problem could be resolved by a GPU implementation for clustering and mutual information. Additionally, although natural science datasets such as TCGA and V1 calcium imaging are considered large datasets in their respective domains, they are still small compared to the datasets used in computer vision. We expect larger natural science datasets to emerge as measuring and imaging technology advances. Additionally, adapting URLOST to support contrastive learning objectives, such as SimCLR, presents another intriguing direction.

Acknowledgement

We would like to specially thank Bruno Olshausen for suggesting key prior work on computational models and biological evidence for recovering the 2D topology of natural images. We also appreciate the helpful suggestions from Surya Ganguli and Atsu Kotani.

References

Appendix

Motivation of density adjusted spectral clustering

Using the terminologies in functional analysis, the mutual information graph defined in section 2.2 corresponds to a compact Riemannian manifold M and the Laplacian matrix L is a discrete analogous to the Laplace Beltrami operator L on M . Minimizing the spectral embedding objective tr ( Y LY T ) directly corresponds to the following optimization problem in function space:

where f ( x ) : M→ [0 , 1] is the normalized signal defined on M . We particularly want to write out the continuous form of spectral embedding so we can adapt it to non-stationary signals. To do so, we assume the measure λ is absolutely continuous with respect to standard measure µ . By apply the Radon-Nikodym derivative to equation 5, we get:

where the quantity dλ dµ is called the Radon-Nikodym, which is some form of density function. Let p ( x ) = dλ dµ , we can rewrite the optimization problem as the following:

The density function p ( x ) on the manifold is analogous to the density adjustment matrix in equation 1. Standard approaches in equation 5 assume that nodes are uniformly distributed on the manifold, thereby treating p ( x ) as a constant and excluding it from the optimization process. However, this assumption does not hold in our case involving non-stationary signals. Our work introduces a variable density function p ( x ) for each signal, making it a pivotal component in building good representations for non-stationary signals. This component is referred to as Density Adjusted Spectral Clustering . Empirical evidence supporting this design is provided through visualization and ablation studies in the experimental section.

Spectral clustering algorithm

Given a high dimensional dataset S ∈ R n × m , Let S i be i th column of S , which represents the i th dimension of the signal. We create probability mass functions P ( S i ) and P ( S j ) and the joint distribution P ( S i , S j ) for S i and S j using histogram. Let the number of bins be K . Then we measure the mutual information between P ( S i ) and P ( S j ) as:

Let A ij = I ( S i ; S j ) be the affinity matrix, and let the density adjustment matrix be P defined in 2.2. Correlation is used instead of mutual information when the dimension is really high since computing mutual information is expensive. We follow the steps from [51] to perform spectral clustering with a modification to adjust the density:

Some other interpretation of spectral embedding allows one to design a specific clustering algorithm in step 4. For example, [72] interprets the eigenvector problem in 6 as a relaxed continuous version of K-way normalized cuts problem, where they only allow X to be binary, i.e. X ∈ { 0 , 1 } N × K . This is an NP-hard problem. Allowing X to take on real value relaxed this problem but created a degeneracy solution. Given a solution X ∗ and Z = D -1 2 X ∗ , for any orthonormal matrix R , RZ is another solution to the optimization problem 6. Thus, [72] designed an algorithm to find the optimal orthonormal matrix R that converts X ∗ to discrete value in { 0 , 1 } N × K . From our experiment, [72] is more consistent than K-means and other clustering algorithms, so we stick to using it for our model.

Data synthesize process

We followed the retina sampling approach described in [14] to achieve foveated imaging. Specifically, each retinal ganglion cell is represented using a Gaussian kernel. The kernel is parameterized by its center, denoted as ⃗ x i , and its scalar variance, σ ′ 2 i , i.e. N ( ⃗ x i , σ ′ 2 i I ) , which is illustrated in Figure 5.A. The response of each cell, denoted as G [ i ] , is computed by the dot product between the pixel value and the corresponding discrete Gaussian kernel. This can be formulated as:

where N and W are dimensions of the image, and I represents the image pixels.

For foveated CIFAR-10, since the image is very low resolution, we first upsample it 3 times from 32 × 32 to 96 × 96 , then use in total of 1038 Gaussian kernels to sample from the upsampled image. The location of each kernel is illustrated in Figure 5.B. The radius of the kernel scales proportionally to the eccentricity. Here, we use the distance from the kernel to the center to represent eccentricity. The relationship between the radius of the kernel and eccentricity is shown in Figure 5.C. As mentioned in the main paper, in the natural retina, retinal ganglion cell density decreases linearly with eccentricity, which makes the fovea much denser than the peripheral, unlike the simulated lattice we created. The size of the kernel should scale linearly with respect to eccentricity as well. However, for the low-resolution CIFAR-10 dataset, we reduce the simulated fovea's density to prevent redundant sampling. In this case, we pick the exponential scale for the relationship between the size of the kernel and eccentricity so the kernel visually covers the whole visual field. We also implemented a convolution version of the Gaussian sampling kernel to speed up data loading.

Density adjusted spectral clustering on foveated CIFAR10 dataset

We provide further intuition and visualization on why density-adjusted spectral clustering allows the model to learn a better representation of the foveated CIFAR-10 dataset.

Figure 5: Foveated retinal sampling (A) Illustration of a Guassian kernel shown in [14]. Diagram of single kernel filter parameterized by a mean µ ′ and variance σ ′ . (B) the location of each Gaussian kernel is summarized as a point with 2D coordinate µ ′ . In total, the locations of 1038 Gaussian kernels are plotted. (C) The relationship between eccentricity (distance of the kernel to the center) and radius of the kernel is shown.

Table 5: Evaluation on foveated CIFAR-10 with varying hyperparameter for density function. For each set of values of α and β , we perform density-adjusted spectral clustering and run URLOST with the corresponding cluster. The evaluation of each trial is provided in the table.

of elements in them. Since the kernel in the center is smaller, the cluster in the center will be visually smaller, than the cluster in the peripheral. The effect is shown in Figure 6. Moreover, since we're upsampling an already low-resolution image (CIFAR-10 image), even though the kernel at the center is more accurate, we're not getting more information. There, to make sure each cluster has similar information, the clusters in the center need to have more elements than the clusters in the peripheral. In order to make the clusters at the center have more elements, we need to weight the clusters in the center more with the density function. Since the sampling kernels at the center have small eccentricity and are more correlated to their neighbor, increasing α and β will make sampling kernels at the center have higher density, which makes the cluster at the center larger. This is why URLOST with density-adjusted spectral clustering performs better than URLOST with constant density spectral clustering, which is shown in Table 5. Meanwhile, setting α and β too large will also hurt the model's performance because it creates clusters that are too unbalanced.

self-organizing layer learns inverse permutation

For locally-permuted CIFAR-10 , we divide each image into patches and locally permute all the patches. The i -th image patch is denoted by x ( i ) , and its permuted version, permuted by the permutation matrix E ( i ) , is expressed as E ( i ) x ( i ) . We use linear layers to parameterize the self-organizing layers. Let g ( x, W ( i ) ) = W ( i ) x denotes the i th element of the self-organizing layer. We're providing the proof for the statement related to the visual evidence shown in Section 4.2

Statement: If the self-organizing layer effectively aligns the input sequence, then E ( i ) T w ( i ) should exhibit visual similarities.

Proof: we first need to formally define what it means for the self-organizing layer to effectively align the input sequence. Let e k denote the k th natural basis (one-hot vector at position k ), which represents the pixel basis at location k . Permutation matrix E ( i ) will send k th pixel to some location accordingly. Mathematically, if the projection layer effectively aligns the input sequence, it means g ( E ( j ) e k , W ( j ) ) = g ( E ( i ) e k , W ( i ) ) for all i, j, k . We can further expand this property to get the

Figure 6: Effect of density adjusted clustering. Eccentricity-based sampling lattice. The center of the sampling lattice has more pixels which means higher resolution compared to the peripheral. (A) Result of density-adjusted spectral clustering ( α = 0 . 5 , β = 2 ). Clusters in the center have more elements than clusters in the peripheral. But clusters look more visually similar in size than B. (B) Result of uniform density spectral clustering ( α = 0 , β = 0 ). Each cluster has a similar number of elements in them but the clusters in the center are much smaller than the clusters in the periphery.

following two equations:

for all i, j, k . Since the above equation holds for all e k , by linearity and the property of permutation matrix, we have:

Training and evaluation details

β -VAE . β -VAE was trained for 1000 epochs and 300 epochs on the V1 neura l and TCGA gene expression respectively. We use the Adam optimizer with a learning rate of 0 . 001 and a cosine annealing learning rate scheduler. The encoder is composed of a 2-layer MLP with batch normalization and LeakyReLU activation. We use hidden dimensions 2048 and 1024 for V1 and Gene datasets respectively. Then, two linear layers are applied to get the mean and standard for reparameterization. The decoder also has a 2-layer MLP, symmetric to the encoder but using standard ReLU activation and no batch normalization. We tried out different hyperparameters and empirically found that this setting gave the best performance.

MAE . MAE follows the official implementation from the original paper. For CIFAR10, we ran our model for 10,000 epochs. We use Adam optimizer with a learning rate of 0.00015 and a cosine annealing. To fit in our tasks, we use 8 8-layer encoders and 4-layer decoders with hidden dimension 192. The ViT backbone can take different patch sizes and we indicated them accordingly in Table 1. ViT(Pixel) means treating each pixel as a patch, so essentially the patch size is 1. This is also used for the real-world high-dimensional dataset since no concept of patch is defined in the signal space.

For V1 neural recording and TCGA gene expression task, we use 4 layers encoder and 2 layers. We use hidden dimension 1380 for 1000 epochs and hidden dimension 384 with 3000 epochs for V1 neural recording and TCGA dataset task. The hidden dimension and the number of epochs we used for MAE are greater than β -VAE. However, when we use the same parameters on β -VAE, we did not seem to find a performance gain. Training Transformers usually requires a large number of data. For example, the original transformer on vision is pretrained over 14M images.

URLOST MAE Many hyperparameters are associated with the spectral clustering algorithms. The performance of the model depends on the cluster, which thus depends on these hyperparameters. In general, URLOST is not very sensitive to these hyperparameters. Density parameters (how nodes are normalized) and the number of clusters affects the shape of the clusters, thus, affects the performance of the model to some degree.

Intuition on hyperparameter selection: Both these hyperparameters are related to the size of each cluster, which relates to how difficult the task is. If a cluster is too big, then we need to predict a big missing area on the graph, making the unsupervised learning task too difficult. On the other hand, if the cluster is too small, for example, if the cluster shrinks down to one pixel for an image, then the prediction tasks become too easy. The solution is simply doing a low-pass filter on the graph to fill in the missing pixel. So the model will not learn high-level or semantic representation. Both the number of clusters and density factors are related to the cluster size. Increasing the number of clusters could make each cluster smaller. The density factor essentially defines how much we want to normalize each node on the graph. For an image dataset, clusters should contain semantic sub-structures of the image, like body parts or parts of objects. We use 64 clusters, which results in clusters of size roughly 4x4. In other words, the ratio between the size of clusters (part) and the number of dimensions (whole) is roughly 1:64. For other datasets (gene and V1), we roughly keep this ratio and perform a grid search over these hyperparameters. Similarly, for the density factors, we also perform a grid search centered at the density factor equal to 1. Empirically, URLOST is not very sensitive too these hyperparameters as shown in Figure 7.

We also provide the exact hyperparameters we used for the experiments in this paper: the parameter of URLOST MAE is the same as MAE except for the specific hyper-parameter in the method section. For CIFAR10, we use K = 20 , α = 0 . 5 and β = 2 . We set the number of clusters to be 64. For V1 neural recording, we use K = 15 , α = 0 and β = 1 . We set the number of clusters to 200. For TCGA dataset, we use K = 10 , α = 0 and β = 1 . We set the number of clusters to 32.

Effect of spectral clustering hyperparameters on performance

We perform an experiment doing a grid search over each hyperparameter in spectral clustering that affects the performance of URLOST. The grid search experiment is performed over the TCGA gene classification dataset. As shown in Figure 7, the model is not sensitive for most of the hyperparameters. For a large range of hyperparameters, the model performance stays the same.

Figure 7: Effect of spectral clustering hyperparameters on performance (A) Effect of number of cluster on performance. (B) Effect of normalization parameter for spectral clustering ( β in Eq. 4) on performance. (C) Effect of numbers of neighbors from K nearest neighbors on performance.

Topology and Stationarity

We provide a formal definition of topology and stationarity used in the context of our work here:

On topology of the domain of signals. A signal is a function X ( s ) defined on some domain S , where s ∈ S . Let's focus on S . While the definition of topology is a little abstract and relies on open sets, the intuition is that it defines a generalized sense of nearness, or whether two points are close to each other or far apart. A common way to give such a space the property of being near or far away is with an explicit distance function, or a metric. The distance function generates a class of objects-open sets. These spaces (with an associated metric) are called metric spaces. For example, the topology of image space is induced by the L2 norm as the metric function. However, not every space will have a natural metric on it. This is why we need to use the term 'topology'.

Topological spaces are generalizations of metric spaces. The definition of topology is any collection of subsets of S that satisfies certain axioms: close under arbitrary union and finite intersection. In the practical setting, the domain S is always discretized as a set of indices, and we can assume that we always deal with a signal defined on a graph [70]. Graphs can be embedded in R 3 . For a finite graph, the graph topology is just the subspace topology inherited from the usual topology in R 3 . Under this specific definition, the word 'graph,' 'adjacency matrix,' and 'topology' can be used interchangeably. In this work, we assume we are not provided the topology. Thus, using pairwise mutual information as the adjacency matrix is a way to model the topology. Correlation is another way to model the topology, where we use only second-order pairwise statistics. This is slightly different from the reviewer's understanding 'Topology appears to correspond to 'correlation' between dimensions.' More generally, graphs are topological spaces that are the simplicial 1-complexes or the 1-dimensional CW complexes.

Example with stationarity but no topology: Permuted Cifar10 is an example we had in the manuscript for this criterion (no topology but has stationarity). No topology refers to the fact that n ( i ) is not defined after random permutation. 'Has stationarity' means that the signal is stationary when the topology is recovered. Specifically, we use the mutual information graph to define neighbors n ( i ) . Under this particular definition of n ( i ) , there exists a meaningful invariant statistic S f ( i ) for the signal. Thus, the signal is stationary. This is because the topology recovered is very similar to the original 2d topology of images before permutation, and the sign before permutation is stationary. However, under an incorrect definition of n ( i ) , the signal is no longer stationary.

Example with no topology but no stationarity: One example would be foveated images simulated using convolutional kernels. One could make a different region of an image have a different resolution while preserving the 2d grid structure of the image by using a Laplace pyramid. As a result, the image is still represented as a 2d matrix, and the definition of n ( i ) is the same as regular images. However, many invariant statistics before the foveation process are destroyed due to the imbalance resolution at different regions of the image.

On stationarity. In probability theory, a random process is a family of random variables. While the index was originally and frequently interpreted as time, it was generalized to random fields, where the index can be an arbitrary domain. That is, by modern definitions, a random field is a generalization of a stochastic process where the underlying parameter need no longer be real or integer valued 'time' but can instead take values that are multidimensional vectors in R n , points on some manifolds and Lie Groups, or graphs [81; 50; 77; 42; 65; 5]. Let's consider a real-valued random field X ( s ) defined on a homogeneous space S = s of points s equipped with a transitive transformation group G = g of mappings of S into itself, and having the property that the values of the statistical characteristics of this field do not change when elements of G are applied to their arguments. In convention, we call a random field X ( s ) a strict-sense stationary (or strict-sense homogeneous) random field if for all n = 1 , 2 , · · · and g ∈ G , the finite-dimensional probability distribution of its values at any n points s 1 , · · · , s n coincides with that of its values at gs 1 , · · · , gs n . We call X ( s ) field wide-sense stationary (or wide-sense homogeneous) random field if E | X ( s ) | 2 < inf and EX ( s ) = EX ( gs ) , EX ( s ) X ( s 1 ) = EX ( gs ) EX ( gs 1 ) for all s , s 1 ∈ S and g ∈ G . Or wide-sense stationarity means the first and second-order statistics do not change w.r.t. the group action. Further, we can intuitively generalize the wide-sense stationarity to higher-order statistics if we need to define any nth-order stationarity between strict-sense and wide-sense.

In intuitively, stationarity just means the statistics does not change w.r.t. transformations. Let's take the image as an example, it is a signal X ( s ) defined on S = Z 2 . And let's define the group action as translation. If we use a convolution neural network and learn a set of L × L filters or use a VIT to learn a set of shared features on L × L patches. A hidden assumption is that the statistics of our image does not change for L × L region, or in order words the joint probability P ( X (1 , 1) , · · · , X ( L, L )) = P ( X (1 + a, 1 + b ) , · · · , X ( L + a ) , · · · , X ( L + b )) , where ( a, b ) ∈ translation group T . While

Table 6: Ratio between Parameters of Self-Organizing Layers and Parameters of the Entire Model . We report the model used for each dataset.

we have initially cited [8], it might be a good idea to provide a more thorough explanation in the appendix, hence, we have included another five references to provide a more comprehensive point of view.

Computational cost

Computation cost . The computational cost and scalability of URLOST are nearly the same as MAE. The computational cost of MAE is analyzed in the original MAE paper. Its efficient design of MAE speeds up the training and evaluation time around 4x compared to a standard Transformerbased auto-encoder. The additional elements in URLOST are simply clustering and self-organizing layers. This effect is shown in figure 7, despite the encoder used in the experiment being 3 times as the decoder, the runtime of the encoder is still faster than the decoder. Moreover, the operation of clustering indexing and self organizing layer is very small relative to the overall runtime of the model. The mutual information and the clustering results are precomputed based on the dataset and thus do not introduce more cost for the model training. During training and inference, clustering is just a one-step indexing process that does not create significant overhead in additional to MAE. For all experiments performed in this paper, since clustering and mutual information are computed and saved for all datasets at once, their computational cost is negligible with respect to the training time of MAE. However, The runtime of these two procedure could be a problem for extremely high dimensional data.

With the current implementation, we perform a scalability benchmark as suggested by the reviewer. We resize an image dataset to different resolutions. At each resolution, we compute mutual information and clustering for a different number of pixels (dimensions), and we provide a plot of running time with respect to the number of pixels in Figure 8. Both computational cost of computing pairwise mutual information and clustering both scale quadratically with the number of dimensions of the signal. It takes aroudn 130 minutes to calculate mutual information for data with 10,000 dimensions. Clustering takes much shorter time (less than a minute).

Figure 8: Runtime for clustering and computation mutual information. A. Benchmark of runtime for clustering for signal with different dimensions. B. Benchmark of runtime for computing pairwise mutual information with different dimensions. Both run time scale quadratically as the dimension of the signal.

Amountof parameters . The self-organizing layer introduces additional parameters compared to the original MAE model. We believe additional parameters are a reasonable trade-off for not knowing topology and stationarity, which are important inductive biases in the data. However, it doesn't increase the parameters drastically, we provide the ratio of the additional parameters in table 6.

Table 7: Runtime of each module during the forward operation . We report the runtime of each module during inference for a single example. We use the model trained on permuted CIFAR10 dataset. The experiment is performed using a single RTX 2080TI and is averaged over 500 trials.

Scalability

Scalability is not the primary focus of this work. In general, unsupervised representation learning benefits from scaling to a larger amount of data, and the MAE model, which our method is built upon, demonstrates a level of scaling [34]. We believe URLOST inherits this potential for the fact that it shows stronger advantages on more challenging tasks, as seen with CIFAR-10 compared to simpler benchmarks. This suggests that scaling to larger and more complex datasets is a promising direction. Our study explores the novel settings of unsupervised representation learning without knowing the data topology and stationarity, demonstrating the effectiveness of our method across diverse data modalities with significant differences. This generalization is our key and positive contribution, highlighting the potential of our approach to inspire future work in this area.

Noise robustness of mutual information recovered topology

We tested the robustness of the topology recovery algorithm used in URLOST. Specifically, we add gaussian noise to CIFAR10 to generate data with a low signal-to-noise ratio.

We found out that while decreasing signal-to-noise ratio will lower the overall mutual information between pairs of pixels, the relationship between mutual information between pixels and image topology (distance between two pixels) does not change. We perform the following experiments: we add iid gaussian noise to each pixel with a fixed noise level, then we measure the mutual information between pairs of pixels as MI (( i, j ) , ( k, l )) . For visualization, we aggregate all pairs by the distance between them and average their mutual information, i.e. let A ( a ) = ( i, j ) , ( k, l ) s.t. √ ( i -k ) 2 +( j -l ) 2 . We have MI ( a ) = 1 | A ( a ) | ∑ i,j,k,l ∈ A ( a ) MI (( i, j ) , ( k, l )) . MI ( a ) denotes the mutual information between pixels of different distances.

As shown in Figure 9 below, although the overall mutual information decreases a lot as the noise level increases, the order of MI(a) is still the same. In other words, the mutual information is still the biggest when the two pixels are nearby and decays when the mutual information decreases. This implies the topology recovered by pairwise mutual information is robust even when the signal has a lower signal-to-noise ratio. It only breaks when you add too much noise ( σ = 0 . 9 ). For this amount of noise, the image is not distinguishable.

Self-Organizing Map (SOM)

Kohonen in 1982 presents a fascinating thought experiment [43]: 'Suppose you woke up one day to find someone rewired your optic nerve (or you have been implanted with a prosthetic retina). The signals from the retina to the brain are intact, but the wires are all mixed up, projecting to the wrong places. Can the brain learn to 'descramble' the image?' Kohonen proposes the 'selforganizing map' (SOM) algorithm to address this problem. SOM is an unsupervised machine learning technique used to produce a low-dimensional (typically two-dimensional) representation of a higher-dimensional dataset while preserving the topological structure of the data. In the task of 'descrambling' pixels, SOM maps each pixel into a 2D index set by leveraging the mutual information between pixels at different locations. The parameters of SOM include a set of neurons, where each neuron has a location index. Let W v denote the weight vector of the v th neuron and D the input vector. θ ( u, v, s ) is the interaction function that determines the influence between the u th and v th neurons based on their distance. The update equation for SOM is given by:

In the task of 'descrambling' pixels, the input consists of pixel values. The indices will be twodimensional, such as s = ( i, j ) , and W v ( s ) will learn to represent a set of 'descrambled' pixels,

Mutual information vs. Pixel! Distance for Different Noise Levels

Figure 9: The effect of additive noise on pairwise mutual information. We plot averaged mutual information between pairs of pixels. We aggregate all pairs with the same distance. Each curve represents a specific noise level. Although the curve becomes lower as noise increases, the decaying features of the curve remain the same.

where s represents the correct indices of the pixels. In other words, the index set defines the correct topology of the image as a 2D grid, and SOM maps the pixels onto this 2D topology by leveraging the mutual information between pixels at different locations. Other methods such as [67] use manifold learning to uncovered topology 2-d topology from natural images. However, when the intrinsic dimension of the manifold is larger than 2d, it is difficult to integrate the 'uncovered' topology with state-of-the-art self-supervised learning algorithms.

Visualizing the weight of self-organizing

As explained in the previous section (Appendix A.5) and visualized in Figure 10, we can visualize the weights of the learned self-organizing layer when trained on the locally-permuted CIFAR-10 dataset. If we apply the corresponding inverse permutation E ( i ) T to its learned filter W ( i ) at position i , the pattern should show similarity across all position i . This is because the model is trying to align all the input clusters. We have shown this is the case when the model converges to a good representation. On the other hand, what if we visualize the weight E ( i ) T W ( i ) as training goes on? If the model learns to align the clusters as it is trained for the mask prediction task, E ( i ) T W ( i ) should become more and more consistent as training goes on. We show this visualization in Figure 10, which confirms our hypothesis. As training goes on, the pattern E ( i ) T W ( i ) becomes more and more visually similar, which implies the model learns to gradually learn to align the input clusters.

Figure 10: Visualize the weight of the self-organizing layer after applying inverse permutation. A snapshot of E ( i ) T W ( i ) is shown at different training epoch. The number of epochs is shown on the top row. Each figure shows one column of the weight of the self-organizing layer, at different positions, i.e. W (1) : ,k , where k is the column number and i is the position index. In total, 9 columns are shown.

Dataset	Method	Backbone	Eval Acc
CIFAR-10	MAE MAE SimCLR	ViT (Patch) ViT (Pixel) ResNet-18	88.3% 56.7% 90.7%
Permuted CIFAR-10	SimCLR URLOST MAE MAE SimCLR SimCLR	ViG ViT (Cluster) ViT (Pixel) ResNet-18 ViG	53.8% 86.4% 56.7% 47.9% 40.0%
Foveated CIFAR-10	URLOST MAE MAE SimCLR SimCLR	ViT (Cluster) ViT (Pixel) ResNet-18 ViG	85.4% 48.5% 38.0% 42.8%

Method	V1 Response Decoding Acc	TCGA Classification Acc
Raw	73.9% ± 0.00%	91.7 ± 0.24%
MAE	70.6% ± 0.22%	90.6% ± 0.63%
β -VAE	75.64% ± 0.11%	94.15% ± 0.24%
URLOST MAE	78.75% ± 0.18%	94.90% ± 0.25%

Masking Unit	Permuted Cifar10	Foveated Cifar10	TCGA Gene	V1 Response
Clusters (URLOST)	86.4%	85.4%	94.9%	78.8%
Random patch	55.7%	51.1%	91.7%	73.9%
Individual dimension	56.7%	48.5%	88.3%	64.8%

Dataset	Projection	Eval Acc	Dataset	Cluster	Eval Acc
Locally-Permuted CIFAR-10	shared non-shared	81.4% 87.6 %	Foveated CIFAR-10	SC	82.7%
Permuted CIFAR-10	shared non-shared	80.7% 86.4 %	Foveated CIFAR-10	DSC	85.4%

	beta = 0	beta = 2
alpha = 0	82.74%	84.24%
alpha = 0.5	84.52%	85.43%
alpha = 1.0	83.83%	81.62%

Dataset	Permuted Cifar10	TCGA	V1
Ratio	6.7%	15.6%	5.9%

Module	Cluster Indexing	Self Organizing Layer	Encoder	Decoder
Runtime	0.27 ms	1.11 ms	17.5 ms	20.4 ms

Unsupervised representation learning has seen tremendous progress but is constrained by its reliance on data modality-specific stationarity and topology, a limitation not found in biological intelligence systems. For instance, human vision processes visual signals derived from irregular and non-stationary sampling lattices yet accurately perceives the geometry of the world. We introduce a novel framework that learns from high-dimensional data lacking stationarity and topology. Our model combines a learnable self-organizing layer, density adjusted spectral clustering, and masked autoencoders. We evaluate its effectiveness on simulated biological vision data, neural recordings from the primary visual cortex, and gene expression datasets. Compared to state-of-the-art unsupervised learning methods like SimCLR and MAE, our model excels at learning meaningful representations across diverse modalities without depending on stationarity or topology. It also outperforms other methods not dependent on these factors, setting a new benchmark in the field. This work represents a step toward unsupervised learning methods that can generalize across diverse high-dimensional data modalities.

Unsupervised representation learning, also known as self-supervised representation learning (SSL), aims to develop models that autonomously detect patterns in data and make these patterns readily apparent through a specific representation. There has been tremendous progress over the past few years in the unsupervised representation learning community. Popular methods like contrastive learning and masked autoencoders [68; 6; 24; 70] work relatively well on typical modalities such as images, videos, audio, time series, and point clouds. However, these methods make implicit assumptions about the data domain’s topology and stationarity. Given an image, topology refers to the neighboring pixels of each pixel, or more generally, the grid structure in images, the temporal structure in time series and sequences, or the 3D structure in molecules and point clouds. Stationarity refers to the property that the low-level statistics of the signal remain consistent across its domain. For instance, pixels and patches in images exhibit similar low-level statistics (mean, variance, co-variance) regardless of their locations within the domain. The success of state-of-the-art self-supervised representation learning relies on knowing the prior topology and stationarity of the modalities. For example, joint-embedding SSL employs random-resized cropping augmentation [6], and masked auto-encoding [25] utilizes masked-image-patch augmentation. What if we possess high-dimensional signals without knowledge of their domain topology or stationarity? Can we still craft a high-quality representation? This is not only the situation that biological vision systems have to deal with but also a practical setting for many scientific data analysis problems. In this work, we introduce unsupervised representation learning without stationarity or topology (URLOST) and take a step in this direction.

As we mentioned earlier, typical modalities possess topology and stationarity prior information that can be utilized by unsupervised representation learning. Taking images as an example, digital cameras employ a consistent sensor grid that spans the entire visual field. However, biological visual systems have to deal with signals with less domain regularity. For instance, unlike camera sensors which have a uniform grid, the cones and rods in the retina distribute unevenly and non-uniformly. This results in a non-stationary raw signal. Retinal ganglion cells connect to more photoreceptors in the fovea than in the periphery. The correlation of the visual signal between two different locations in the retina depends not only on the displacement between these locations but also on their absolute positions. Yet, biological visual systems can establish precise retinotopy from the retina to neurons based on spontaneous retinal activities and external stimuli [67; 35; 18] and leverage retinotopic input to build unsupervised representation. This implies that we can potentially build unsupervised representation without relying on prior stationarity of the raw signal or topology of the input domain.

In this work, we aim to build unsupervised representations for general high-dimensional vectors. Taking images as an example again, let’s assume we receive a set of images whose pixels are shuffled in the same order. How can we build representations in an unsupervised fashion without knowledge of the shuffling order? If possible, can we use such a method to build unsupervised representations for general high-dimensional data? Inspired by [53], we use low-level statistics and spectral clustering to form clusters of the pixels, which recovers a coarse topology of the input domain. These clusters are analogous to image patches except that they are slightly irregularly shaped and different in size. We mask a proportion of these “patches” and utilize a Vision Transformer [15] to predict the masked “patches” based on the remaining unmasked ones. This “learning to predict masked tokens” approach is proposed in masked autoencoders (MAE) [25] and has demonstrated effectiveness on typical modalities. Initially, we test the proposed method on the synthesized biological visual dataset, derived from CIFAR-10 [32] using a foveated retinal sampling mechanism [8]. Then we generalize this method to two high-dimensional vector datasets: a primary visual cortex neural response decoding dataset [57] and the TCGA miRNA-based cancer classification dataset [62; 66]. Across all these benchmarks, our proposed method outperforms existing SSL techniques, establishing its effectiveness in building unsupervised representations for signals lacking explicit stationarity or topology. Given the emergence of new modalities in deep learning from natural sciences [59; 23; 49; 34], such as chemistry, biology, and neuroscience, our method offers a promising approach in the effort to build unsupervised representations for high-dimensional data.

Our objective is to build robust, unsupervised representations for high-dimensional signals that lack explicit topology and stationarity. These learned representations are intended to enhance performance in downstream tasks, such as classification. To achieve this, we begin by using low-level statistics and clustering to approximate the signal’s topology. The clusters derived from the signal serve as input to a masked autoencoder. As depicted in Figure 1, the masked autoencoder randomly masks out patches in an image and trains a Transformer-based autoencoder unsupervisedly to reconstruct the original image. After the unsupervised training, the autoencoder’s latent state yields high-quality representations. In our approach, signal clusters are input to the masked autoencoder.

Notably, the clusters differ from image patches in several key aspects due to the differences in the input signal: they are unaligned, exhibit varied sizes and shapes, and their clustering nodes are not confined to fixed 2D locations like pixels in image patches. To cope with these differences, we introduce a self-organizing layer responsible for aligning these clusters through learnable transformations. The parameters of this layer are jointly optimized with those of the masked autoencoder. Our method is termed URLOST, an acronym for Unsupervised Representation Learning withOut Stationarity or Topology. Figure 2 provides an overview of the framework. URLOST consists of three core components: density adjusted spectral clustering, self-organizing layer, and masked autoencoder. The functionalities of these components are detailed in the following subsections.

Representation learning for high-dimensional signals without explicit topology is challenging. We propose to define a metric to measure inter-dimensional relationships. This metric effectively approximates a topology for the signal. Similar to [53], where they use the absolute correlation values as the metric for pixels, we employ discrete mutual information (refer to Appendix A.1) as the metric. Let affinity matrix Aijsubscript𝐴𝑖𝑗A_{ij} denote the mutual information between dimension i𝑖i and j𝑗j, which approximates the manifold ℳℳ\mathcal{M} that the signal lives on. We can define the discretized Laplacian operator based on A𝐴A and use the eigenvector of the Laplacian operator to perform spectral clustering, which segments the manifold. The detailed definition and the algorithm are left in Appendix A.1. Finding the eigenvector of the Laplacian operator is a discretized approximation of the following optimization problem in function space:

where f(x):ℳ→[0,1]:𝑓𝑥→ℳ01f(x):\mathcal{M}\to[0,1] is the normalized signal defined on ℳℳ\mathcal{M} and p(x)𝑝𝑥p(x) is the density function. The integral is taken over standard measure on ℳℳ\mathcal{M}. Since spectral clustering heavily relies on the solution of equation 1, the definition of the density function p(x)𝑝𝑥p(x) affects the quality of the resulting clusters. Standard approaches often assume that nodes are uniformly distributed on the manifold, thereby treating p(x)𝑝𝑥p(x) as a constant and excluding it from the optimization process. However, this assumption does not hold in our case involving non-stationary signals. To cope with non-stationary signals, our work introduces a variable density function p(x)𝑝𝑥p(x) for each signal, making it a pivotal component in building good representations for the signal. This component is referred to as Density Adjusted Spectral Clustering. Empirical evidence supporting this design is provided through visualization and ablation studies in the experimental section.

Transforming a high-dimensional signal into a sequence of clusters using the above method is not enough because it does not capture the internal structure within individual clusters. To effectively perform unsupervised learning on these clusters, it is essential to align them in some manner. Directly solving the exact alignment problem with low-level statistics of the signal is challenging. Thus, we propose a self-organizing layer with learnable parameters. Specifically, let vector x(i)superscript𝑥𝑖x^{(i)} denote the i𝑖ith cluster. Each cluster x(i)superscript𝑥𝑖x^{(i)} is passed through a differentiable function g(⋅,w(i))𝑔⋅superscript𝑤𝑖g(\cdot,w^{(i)}) with parameter w(i)superscript𝑤𝑖w^{(i)}, resulting in a sequence z0subscript𝑧0z_{0}:

z0subscript𝑧0z_{0} is comprised of projected and aligned representations for all clusters. The weights of the proposed self-organizing layer, {w(1),⋯w(M)}superscript𝑤1⋯superscript𝑤𝑀{w^{(1)},\cdots w^{(M)}}, are jointly optimized with the subsequent neural network introduced in the next subsection.

After the self-organizing layer, z0subscript𝑧0z_{0} is passed to a Transformer-based masked autoencoder (MAE) with an unsupervised learning objective. Masked autoencoder (MAE) consists of an encoder and a decoder which both consist of stacked Transformer blocks introduced in [64]. The objective function is introduced in [25]: masking random image patches in an image and training an autoencoder to reconstruct them, as illustrated in Figure 1. In our case, randomly selected clusters in z0subscript𝑧0z_{0} are masked out, and the autoencoder is trained to reconstruct these masked clusters. After training, the encoder’s output is treated as the learned representation of the input signal for downstream tasks. The masked prediction loss is computed as the mean square error (MSE) between the values of the masked clusters and their corresponding predictions.

Since our method is inspired by the biological vision system, we first validate its ability on a synthetic biological vision dataset created from CIFAR-10. Then we evaluate the generalizability of URLOST on two high-dimensional natural datasets collected from diverse domains. Detailed information about each dataset and the corresponding experiments is presented in the following subsections. Across all tasks, URLOST consistently outperforms other strong unsupervised representation learning methods.

As discussed in the introduction, the biological visual signal serves as an ideal dataset to validate the capability of URLOST. In contrast to digital images captured by a fixed array of sensors, the biological visual signal is acquired through irregularly positioned ganglion cells, inherently lacking explicit topology and stationarity. However, it is hard to collect real-world biological vision signals with high precision. Therefore, we employ a retinal sampling technique to modify the classic CIFAR-10 dataset and simulate imaging from the biological vision signal. The synthetic dataset is referred to as Foveated CIFAR-10. To make a comprehensive comparison, we also conduct experiments on the original CIFAR-10, and a Permuted CIFAR-10 dataset obtained by randomly permuting the image.

Permuted CIFAR-10. To remove the grid topology inherent in digital imaging, we simply permute all the pixels within the image, which effectively discards any information related to the grid structure of the original digital image. We applied such permutation to each image in the CIFAR-10 dataset to generate the Permuted CIFAR-10 dataset. Nevertheless, permuting pixels only removes an image’s topology, leaving its stationarity intact. To obtain the synthetic biological vision that has neither topology nor stationarity, we introduce the Foveated CIFAR-10.

Foveated CIFAR-10. Much like photosensors installed in a camera, retina ganglion cells within the primate biological visual system sample from visual stimuli and project images. However, unlike photosensors that have uniform receptive fields and adhere to a consistent sampling pattern, retinal ganglion cells at different locations of the retina vary in their receptive field size: smaller in the center (fovea) but larger in the peripheral of the retina. This distinctive retina sampling pattern results in foveated imaging [63]. It gives primates the ability to have both a high-resolution vision and a broad overall receptive field while consequently making visual signals sampled by the retina lack stationarity. The evidence is that responses of two ganglion cells separated by the same displacement are highly correlated in the retina but less correlated in the peripheral. To mimic the foveated imaging with CIFAR-10, we adopt the retina sampling mechanism from [8]. Specifically, each retina ganglion cell is simplified and modeled using a Gaussian kernel. The response of each cell is determined by the dot product between pixel values and the Gaussian kernel. Figure 3 illustrates the sampling kernel locations. Applying this sampling grid and permuting the resulting pixels produces the foveated CIFAR-10. In the natural retina, retinal ganglion cell density decreases linearly with eccentricity, which makes fovea much denser than the peripheral, compared to the simulated lattice in Figure 3. However, considering the low resolution of the CIFAR-10 dataset, we reduce the simulated fovea’s density to prevent redundant sampling.

Experiments. We compare URLOST on both of the synthetic vision datasets as well as the original CIFAR-10 with popular unsupervised representation learning methods SimCLR [6] and MAE [25]. All the models conducted unsupervised learning followed by linear probing for classification accuracy. The evaluations are reported in Table 1. SimCLR excels on CIFAR-10 but struggles badly with both synthetic datasets due to its inability to handle data without stationarity and topology. MAE gets close to SimCLR on CIFAR-10 with a 4×4444\times 4 patch size. However, the patch size no longer makes sense when data has no topology. So we additionally tested MAE masking pixels instead of image patches. It maintains the same performance on Permuted CIFAR-10 as on CIFAR-10, though poorly, invariant to the removal of topology as it should be. But It still drops greatly to 48.5%percent48.548.5% on the Foveated CIFAR-10 when stationarity is also removed. In contrast, only URLOST is able to maintain consistently strong performances when there is no topology or stationarity, achieving 86.4%percent86.486.4% on Permuted CIFAR-10 and 85.4%percent85.485.4% on Foveated CIFAR-10 when the baselines completely fail.

After accessing URLOST’s performance on synthetic biological vision data, we take a step further to challenge its generalizability with high-dimensional natural datasets. The first task is decoding neural response recording in the primary visual area (V1) of mice.

V1 neural response dataset. The dataset, published by [44], contains responses from over 10,000 V1 neurons captured via two-photon calcium imaging. These neurons responded to 2,800 unique images from ImageNet [12], with each image presented twice to assess the consistency of the neural response. In the decoding task, a prediction is considered accurate if the neural response to a given stimulus in the first presentation closely matches the response to the same stimulus in the second presentation within the representation space. This task presents greater challenges than the synthetic biological vision described in the prior section. For one, the data comes from real-world neural recordings rather than a curated dataset like CIFAR-10. For another, the geometric structure of the V1 area is substantially more intricate than that of the retina. To date, no precise mathematical model of the V1 neural response has been well established. The inherent topology and stationarity of the data still remain difficult to grasp [43; 42]. Nevertheless, evidence of Retinotopy [18; 19] and findings from prior research [41; 7; 58] suggest that the neuron population code in V1 are tiling a low dimensional manifold. This insight led us to treat the population neuron response as high-dimensional data and explore whether URLOST can effectively learn its representation.

Experiments. Following the approach in [44] we apply standardization and normalization to the neural firing rate. The processed signals are high-dimensional vectors, and they can be directly used for the decoding task, which serves as the “raw” signal baseline in Table 2. For representation learning methods, URLOST is evaluated along with MAE and β𝛽\beta-VAE [26]. Note that the baseline methods need to handle high-dimensional vector data without stationarity or topology, so SimCLR is no longer applicable. We use β𝛽\beta-VAE instead. We first train the neural network with an unsupervised learning task, then use the latent state of the network as the representation for the neural responses in the decoding task. The results are presented in the table 2. Our method surpasses the original neuron response and other methods, achieving the best performance.

In this subsection, we further evaluate URLOST on high-dimensional natural science data from a completely different domain, the gene expression data.

Gene expression dataset. The dataset comes from The Cancer Genome Atlas (TCGA) [62; 66], which is a project that catalogs the genetic mutations responsible for cancer using genome sequencing and bioinformatics. The project molecularly characterized over 20,000 primary cancers and matched normal samples spanning 33 cancer types. We focus on the pan-cancer classification task: diagnose and classify the type of cancer for a given patient based on his gene expression profile. The TCGA project collects the data of 11,000 patients and uses Micro-RNA (miRNA) as their gene expression profiles. Like the V1 response, no explicit topology and stationarity are known and each data point is a high-dimensional vector. Specifically, 1773 miRNA identifiers are used so that each data point is a 1773-dimensional vector. Types of cancer that each patient is diagnosed with serve as the classification labels.

Experiments. Similar to Section 3.2, URLOST is compared with the original signals, MAE, and β𝛽\beta-VAE, which is the state-of-the-art unsupervised learning method on TCGA cancer classification [71; 72]. We also randomly partition the dataset do five-fold cross-validation and report the average performance in Table 2. Again, our method learns meaningful representation from the original signal. The learned representation benefited the classification task and achieved the best performance, demonstrating URLOST’s ability to learn meaningful representation of data from diverse domains.

Conventional SSL models take a sequential input x=[x(1),⋯x(M)]𝑥superscript𝑥1⋯superscript𝑥𝑀x=[x^{(1)},\cdots x^{(M)}] and embed them into latent vectors with a linear transformation:

which is further processed by a neural network. The sequential inputs can be a list of language tokens [14; 50], pixel values [5], image patches [15], or overlapped image patches [6; 24; 70]. E𝐸E can be considered as a projection layer that is shared among all elements in the input sequence. The self-organizing layer g(⋅,w(i))𝑔⋅superscript𝑤𝑖g(\cdot,w^{(i)}) introduced in Section 2.3 can be considered as a non-shared projection layer. We conducted an ablation study comparing the two designs to demonstrate the effectiveness of the self-organizing layers both quantitatively and qualitatively. To facilitate the ablation, we further synthesized another dataset.

Locally-permuted CIFAR-10. To directly evaluate the performance of the non-shared projection approach, we designed an experiment involving intentionally misaligned clusters. In this experiment, we divide each image into patches and locally permute all the patches. The i𝑖i-th image patch is denoted by x(i)superscript𝑥𝑖x^{(i)}, and its permuted version, permutated by the permutation matrix E(i)superscript𝐸𝑖E^{(i)}, is expressed as E(i)x(i)superscript𝐸𝑖superscript𝑥𝑖E^{(i)}x^{(i)}. We refer to this manipulated dataset as the Locally-Permuted CIFAR-10. Our hypothesis posits that models using shared projections, as defined in Equation 3, will struggle to adapt to random permutations, whereas self-organizing layers equipped with non-shared projections can autonomously adapt to each patch’s permutation, resulting in robust performance. This hypothesis is evaluated quantitatively and through the visualization of learned weights w(i)superscript𝑤𝑖w^{(i)}.

Permuted CIFAR-10. Meanwhile, we also run the ablation study on the Permuted CIFAR-10. Unlike locally permuted CIFAR-10, a visualization check is not viable since the permutation is done globally. However, we can still quantitatively measure the performance of the task.

Quantitative results. Table 3 confirms our hypothesis, demonstrating a significant performance decline in models employing shared projections when exposed to permuted data. In contrast, the non-shared projection model maintains stable performance.

Visual evidence. Using linear layers to parameterize the self-organizing layers, i.e. let g(x,W(i))=W(i)x𝑔𝑥superscript𝑊𝑖superscript𝑊𝑖𝑥g(x,W^{(i)})=W^{(i)}x, we expect that if the projection layer effectively aligns the input sequence, E(i)TW(i)superscript𝐸𝑖𝑇superscript𝑊𝑖E^{(i)T}W^{(i)} should exhibit visual similarities. That is, after applying the inverse permutation E(i)Tsuperscript𝐸𝑖𝑇E^{(i)T}, the learned projection matrix W(i)superscript𝑊𝑖W^{(i)} at each location should appear consistent or similar. The proof of this statement is provided in Appendix A.4. The model trained on Locally-Permuted CIFAR10 provides visual evidence supporting this claim. In Figure 4, the weights show similar patterns after reversing the permutations.

As explained in Section 2.2, the shape and size of each cluster depend on how the density function is defined. Let q(i)𝑞𝑖q(i) represent the eccentricity, the distance from i𝑖ith kernel to the center of the sampling lattice, and let n(i)=∑jAji𝑛𝑖subscript𝑗subscript𝐴𝑗𝑖n(i)=\sum_{j}A_{ji} where A𝐴A is the affinity matrix, then the density is defined as:

By setting α𝛼\alpha and β𝛽\beta nonzero, the density function is eccentricity-dependent. Setting both α𝛼\alpha and β𝛽\beta to zero will make n(i)𝑛𝑖n(i) constant which recovers the uniform density spectral clustering. We vary the parameters α𝛼\alpha and β𝛽\beta to generate different sets of clusters for the foveated CIFAR-10 dataset and run URLOST using each of these sets of clusters. Results in Table 4 validate that the model performs better with density adjusted clustering. The intuitive explanation is that by adjusting the values of α𝛼\alpha and β𝛽\beta, we can make each cluster carry similar amounts of information (refer to Appendix A.3.). A balanced distribution of information across clusters enhances the model’s ability to learn meaningful representations. Without this balance, masking a low-information cluster makes the prediction task trivial, while masking a high-information cluster will make the prediction task too difficult. In either scenario, the model’s ability to learn effective representations is compromised.

Several interconnected pursuits are linked to this work, and we will briefly address them here:

Topology in biological visual signal. 2-D topology of natural images is strong prior that requires many bits to encode [13; 2]. Such 2-D topology is encoded in the natural image statistic [55; 28], which can be recovered [53]. Optic and neural circuits in the retina result in a more irregular 2-D topology than the natural image, which can still be simulated [52; 46; 47; 45; 61; 30]. This information is further processed by the primary visual cortex. Evidence of retinotopy suggests the low-dimensional geometry of visual input from retina is encoded by the neuron in primary visual cortex [40; 20; 27; 19; 65; 48]. These evidences suggest we can recover the topology using signal from retinal ganglion cell and V1 neurons.

Evidence of self-organizing mechanism in the brain. In computational neuroscience, many works use the self-organizing maps (SOM) as a computational model for V1 functional organization: [16; 60; 1; 17; 39; 31]. In other words, this idea of self-organizing is likely a principle governing how the brain performs computations. Even though V1 functional organizations are present at birth, numerous studies also indicate that the brain’s self-organizing mechanisms continue after full development [22; 54; 29].

Learning with signal on non-euclidean geometry. In recent years, researchers from the machine learning community have made efforts to consider geometries and special structures beyond classic images, text, and feature vectors. [33] treats an image as a set of points but depends on the 2D coordinates. The geometric deep learning community tries to generalize convolution neural networks beyond the Euclidean domain [3; 37; 11; 21]. Recent research also explores adapting the Transformer to domains beyond Euclidean spaces [10; 9]. However, none of them has tried to tackle the issue when the data has no explicit topology or stationarity, which is the focus of URLOST.

Self-supervised learning. Self-supervised learning (SSL) has made substantial progress in recent years. Different SSL method is designed for each modality, for example: predicting the masked/next token in NLP[14; 50; 4], solving pre-text tasks, predicting masked patches, or building contrastive image pairs in computer vision [36; 25; 68; 6; 24; 70]. These SSL methods have demonstrated descent scalability with a vast amount of unlabeled data and have shown their power by achieving performance on par with or even surpassing supervised methods. They have also exhibited huge potential in cross-modal learning, such as the CLIP by [51]. However, we argue that these SSL methods are all built upon specific modalities with explicit topology and stationarity which URLOST goes beyond.

The success of most current state-of-the-art self-supervised representation learning methods relies on the assumption that the data has known stationarity and domain topology, such as the grid-like RGB images and time sequences. However, biological vision systems have evolved to deal with signals with less regularity. In this work, we explore unsupervised representation learning under a more general assumption, where the stationarity and topology of the data are unknown to the machine learning model and its designers. We argue that this is a general and realistic assumption for high-dimensional data in modalities of natural science. We propose a novel unsupervised representation learning method that works under this assumption and demonstrates our method’s effectiveness and generality on a synthetic biological vision dataset and two datasets from natural science that have diverse modalities. We also perform a step-by-step ablation study to show the effectiveness of the novel components in our model.

During experiments, we found that density adjusted spectral clustering is crucial for the quality of representation learning. How to adjust the density and obtain a balanced clustering for any given data or even learning the clusters end-to-end with the representation via back-propagation is worth future investigation. Moreover, our current self-organizing layer is still simple though it shows effective performance. Extending it to a more sophisticated design and potentially incorporating it with various neural network architectures is also worth future exploration.

In summary, our method offers a handy and general unsupervised learning tool when dealing with high-dimensional data of arbitrary modality with unknown stationarity and topology, particularly common in the field of natural sciences, where many present strong unsupervised learning baselines cannot directly adapt. We hope it can provide inspiration for work in related fields.

Given a high dimensional dataset S∈ℝn×m𝑆superscriptℝ𝑛𝑚S\in\mathbb{R}^{n\times m}, Let Sisubscript𝑆𝑖S_{i} be i𝑖ith column of S𝑆S, which represents the i𝑖ith dimension of the signal. We create probability mass functions P(Si)𝑃subscript𝑆𝑖P(S_{i}) and P(Sj)𝑃subscript𝑆𝑗P(S_{j}) and the joint distribution P(Si,Sj)𝑃subscript𝑆𝑖subscript𝑆𝑗P(S_{i},S_{j}) for Sisubscript𝑆𝑖S_{i} and Sjsubscript𝑆𝑗S_{j} using histogram. Let the number of bins be K𝐾K. Then we measure the mutual information between P(Si)𝑃subscript𝑆𝑖P(S_{i}) and P(Sj)𝑃subscript𝑆𝑗P(S_{j}) as:

Let Aij=I(Xi;Xj)subscript𝐴𝑖𝑗𝐼subscript𝑋𝑖subscript𝑋𝑗A_{ij}=I(X_{i};X_{j}) be the affinity matrix, p(i)𝑝𝑖p(i) be the density function defined in 4. We follow the steps from [38] to perform spectral clustering with a modification to adjust the density:

Define D𝐷D to be the diagonal matrix whose (i𝑖i,i𝑖i)-element is the sum of A𝐴A’s i𝑖i-th row, P𝑃P be the identity matrix where Pii=p(i)subscript𝑃𝑖𝑖𝑝𝑖P_{ii}=p(i). Construct the matrix L=P12D−12AD−12P12.𝐿superscript𝑃12superscript𝐷12𝐴superscript𝐷12superscript𝑃12L=P^{\frac{1}{2}}D^{-\frac{1}{2}}AD^{-\frac{1}{2}}P^{\frac{1}{2}}.

Find x1,x2,⋯,xksubscript𝑥1subscript𝑥2⋯subscript𝑥𝑘x_{1},x_{2},\cdots,x_{k}, the k𝑘k largest eigenvectors of L𝐿L, and form the matrix X=[x1,x2,⋯,xk]∈ℝn×k𝑋subscript𝑥1subscript𝑥2⋯subscript𝑥𝑘superscriptℝ𝑛𝑘X=[x_{1},x_{2},\cdots,x_{k}]\in\mathbb{R}^{n\times k} by stacking the eigenvectors in columns.

Form the matrix Y from X by renormalizing each of X𝑋X’s rows to have unit norms. (i.e. Yij=Xij/(∑iXij2)12subscript𝑌𝑖𝑗subscript𝑋𝑖𝑗superscriptsubscript𝑖superscriptsubscript𝑋𝑖𝑗212Y_{ij}=X_{ij}/(\sum_{i}X_{ij}^{2})^{\frac{1}{2}})

Treating each row of Y𝑌Y as a point in ℝksuperscriptℝ𝑘\mathbb{R}^{k}, cluster them into k𝑘k clusters via K-means or other algorithms.

Some other interpretation of spectral embedding allows one to design a specific clustering algorithm in step 4. For example, [56] interprets the eigenvector problem in 1 as a relaxed continuous version of K-way normalized cuts problem, where they only allow X𝑋X to be binary, i.e. X∈{0,1}N×K𝑋superscript01𝑁𝐾X\in{0,1}^{N\times K}. This is an NP-hard problem. Allowing X𝑋X to take on real value relaxed this problem but created a degeneracy solution. Given a solution X∗superscript𝑋X^{} and Z=D−12X∗𝑍superscript𝐷12superscript𝑋Z=D^{-\frac{1}{2}}X^{}, for any orthonormal matrix R𝑅R, RZ𝑅𝑍RZ is another solution to the optimization problem 1. Thus, [56] designed an algorithm to find the optimal orthonormal matrix R𝑅R that converts X∗superscript𝑋X^{*} to discrete value in {0,1}N×Ksuperscript01𝑁𝐾{0,1}^{N\times K}. From our experiment, [56] is more consistent than K-means and other clustering algorithms, so we stick to using it for our model.

We followed the retina sampling approach described in [8] to achieve foveated imaging. Specifically, each retina ganglion cell is represented using a Gaussian kernel. The kernel is parameterized by its center, denoted as x→isubscript→𝑥𝑖\vec{x}{i}, and its scalar variance, σi′⁣2superscriptsubscript𝜎𝑖′2\sigma{i}^{\prime 2}, i.e. 𝒩(x→i,σi′⁣2𝐈)𝒩subscript→𝑥𝑖superscriptsubscript𝜎𝑖′2𝐈\mathcal{N}(\vec{x}{i},\sigma{i}^{\prime 2}\mathbf{I}), which is illustrated in Figure 5.A. The response of each cell, denoted as G[i]𝐺delimited-[]𝑖G[i], is computed by the dot product between the pixel value and the corresponding discrete Gaussian kernel. This can be formulated as:

where N𝑁N and W𝑊W are dimensions of the image, and I𝐼I represents the image pixels.

For foveated CIFAR-10, since the image is very low resolution, we first upsample it 3 times from 32×32323232\times 32 to 96×96969696\times 96, then use in total of 1038 Gaussian kernels to sample from the upsampled image. The location of each kernel is illustrated in Figure 5.B. The radius of the kernel scales proportionally to the eccentricity. Here, we use the distance from the kernel to the center to represent eccentricity. The relationship between the radius of the kernel and eccentricity is shown in Figure 5.C. As mentioned in the main paper, in the natural retina, retinal ganglion cell density decreases linearly with eccentricity, which makes the fovea much denser than the peripheral, unlike the simulated lattice we created. The size of the kernel should scale linearly with respect to eccentricity as well. However, for the low-resolution CIFAR-10 dataset, we reduce the simulated fovea’s density to prevent redundant sampling. In this case, we pick the exponential scale for the relationship between the size of the kernel and eccentricity so the kernel visually covers the whole visual field. We also implemented a convolution version of the Gaussian sampling kernel to speed up data loading.

We provide further intuition and visualization on why density adjusted spectral clustering allows the model to learn a better representation on the foveated CIFAR-10 dataset.

As shown in Figure 5, the kernel at the center is much smaller in size than the kernel in the peripheral. This makes the kernel at the center more accurate but smaller, which means it summarizes less information. Spectral clustering with constant density will make each cluster have a similar number of elements in them. Since the kernel in the center is smaller, the cluster in the center will be visually smaller, than the cluster in the peripheral. The effect is shown in Figure 6. Moreover, since we’re upsampling an already low-resolution image (CIFAR-10 image), even though the kernel at the center is more accurate, we’re not getting more information. There, to make sure each cluster has similar information, the clusters in the center need to have more elements than the clusters in the peripheral. In order to make the clusters at the center have more elements, we need to weight the clusters in the center more with the density function. Since the sampling kernels at the center have small eccentricity and are more correlated to their neighbor, increasing α𝛼\alpha and β𝛽\beta will make sampling kernels at the center have higher density, which makes the cluster at the center larger. This is why URLOST with density adjusted spectral clustering performs better than URLOST with constant density spectral clustering, which is shown in Table 4. Meanwhile, setting α𝛼\alpha and β𝛽\beta too large will also hurt the model’s performance because it creates clusters that are too unbalanced.

For locally-permuted CIFAR-10, we divide each image into patches and locally permute all the patches. The i𝑖i-th image patch is denoted by x(i)superscript𝑥𝑖x^{(i)}, and its permuted version, permuted by the permutation matrix E(i)superscript𝐸𝑖E^{(i)}, is expressed as E(i)x(i)superscript𝐸𝑖superscript𝑥𝑖E^{(i)}x^{(i)}. We use linear layers to parameterize the self-organizing layers. Let g(x,W(i))=W(i)x𝑔𝑥superscript𝑊𝑖superscript𝑊𝑖𝑥g(x,W^{(i)})=W^{(i)}x denotes the i𝑖ith element of the self-organizing layer. We’re providing the proof for the statement related to the visual evidence shown in Section 4.1

Statement: If the self-organizing layer effectively aligns the input sequence, then E(i)Tw(i)superscript𝐸𝑖𝑇superscript𝑤𝑖E^{(i)T}w^{(i)} should exhibit visual similarities.

Proof: we first need to formally define what it means for the self-organizing layer to effectively align the input sequence. Let 𝐞ksubscript𝐞𝑘{\bf e}{k} denote the k𝑘kth natural basis (one-hot vector at position k𝑘k), which represents the pixel basis at location k𝑘k. Permutation matrix E(i)superscript𝐸𝑖E^{(i)} will send k𝑘kth pixel to some location accordingly. Mathematically, if the projection layer effectively aligns the input sequence, it means g(E(j)ek,W(j))=g(E(i)ek,W(i))𝑔superscript𝐸𝑗subscript𝑒𝑘superscript𝑊𝑗𝑔superscript𝐸𝑖subscript𝑒𝑘superscript𝑊𝑖g(E^{(j)}e{k},W^{(j)})=g(E^{(i)}e_{k},W^{(i)}) for all i,j,k𝑖𝑗𝑘i,j,k. We can further expand this property to get the following two equations:

for all i,j,k𝑖𝑗𝑘i,j,k. Since the above equation holds for all eksubscript𝑒𝑘e_{k}, by linearity and the property of permutation matrix, we have:

Table: S3.T1: Evaluation on computer vision and synthetic biological vision dataset. ViT (Patch) stands for the Vision Transformer backbone with image patches as inputs. ViT (Pixel) means pixels are treated as input units. ViT (Clusters) means clusters are treated as inputs instead of patches. The number of clusters is set to 64 for both Permuted CIFAR-10 and Foveated CIFAR-10 dataset.

Dataset	Method	Backbone	Eval Acc
CIFAR-10	MAE	ViT (Patch)	88.3 %
	MAE	ViT (Pixel)	56.7 %
	SimCLR	ResNet-18	90.7 %
Permuted CIFAR-10	URLOST MAE	ViT (Cluster)	86.4 %
(no topology)	MAE	ViT (Pixel)	56.7 %
	SimCLR	ResNet-18	47.9 %
Foveated CIFAR-10	URLOST MAE	ViT (Cluster)	85.4 %
(no topology or stationarity)	MAE	ViT (Pixel)	48.5 %
	SimCLR	ResNet-18	38.0 %

Table: S3.T2: Evaluation on V1 response decoding and TCGA pan-cancer classification tasks. “Raw” indicates preprocessed (standardized and normalized) raw signals. Best β𝛽\beta values are used for β𝛽\beta-VAE. For URLOST MAE, cluster sizes are 200 (V1) and 32 (TCGA).

Method	V1 Response Decoding Acc	TCGA Classification Acc
Raw	73.9%	91.1%
β𝛽\beta-VAE	75.1%	94.2%
MAE	64.8%	88.3%
URLOST MAE	78.2%	94.9%

Table: S4.T3: Ablation study on self-organizing layer. Linear probing accuracy with varying parameters, keeping others constant. For Locally-Permutated CIFAR-10, we use 4×4444\times 4 patch size. For Permutated CIFAR-10 and Foveated CIFAR-10, we set the number of clusters to 64 for the spectral clustering algorithm. We kept the hyperparameter of the backbone model the same as in table 1.

Dataset	Projection	Eval Acc
Locally-Permuted	shared	81.4 %
CIFAR-10	non-shared	87.6 %
Permuted	shared	80.7 %
CIFAR-10	non-shared	86.4 %

Table: S4.T3.st1: (a) Replacing the non-shared projections of the self-organizing layer with the shared projection layer entails a significant drop in performance.

Dataset	Projection	Eval Acc
Locally-Permuted	shared	81.4 %
CIFAR-10	non-shared	87.6 %
Permuted	shared	80.7 %
CIFAR-10	non-shared	86.4 %

Table: S4.T3.st2: (b) “SC” denotes spectral clustering with uniform density clustering and “DSC” denotes density adjusted spectral clustering. For Foveated CIFAR-10 using density adjusted spectral clustering to create clusters will make the model perform better than using standard spectral clustering with uniform density.

Dataset	Cluster	Eval Acc
Foveated	SC	82.7 %
CIFAR-10	DSC	85.4 %

Table: A1.T4: Evaluation on foveated CIFAR-10 with varying hyperparameter for density function. For each set of values of α𝛼\alpha and β𝛽\beta, we perform density adjusted spectral clustering and run URLOST with the corresponding cluster. The evaluation of each trial is provided in the table.

	beta = 0	beta = 2
alpha = 0	82.74 %	84.24 %
alpha = 0.5	84.52 %	85.43 %
alpha = 1.0	83.83 %	81.62 %

Refer to caption From left to right: the unsupervised representation learning through joint embedding and masked auto-encoding; the biological vision system that perceives via unstructured sensor and understands signal without stationarity or topology [48]; and many more such diverse high dimensional signal in natural science that our method supports while most existing unsupervised methods don’t. Data figures are borrowed from [48; 44; 69].

Refer to caption The overview framework of URLOST. The high-dimensional input signal undergoes clustering and self-organization before unsupervised learning using a masked autoencoder for signal reconstruction.

Refer to caption Retina sampling (A) An image in CIFAR-10 dataset. (B) Retina sampling lattice. Each blue dot represents the center of a Gaussian kernel, which mimics a retinal ganglion cell. (C) Visualization of the car image’s signal sampled using the retina lattice. Each kernel’s sampled RGB value is displayed at its respective lattice location for visualization purposes. (D) density-adjusted spectral clustering results are shown. Each unique color represents a cluster, with each kernel colored according to its assigned cluster.

Refer to caption Learnt weights of a self-organizing layer. (A) Image is cropped into patches, where each patch x(i)superscript𝑥𝑖x^{(i)} first undergoes a different permutation E(i)superscript𝐸𝑖E^{(i)}, then the inverse permutation E(i)Tsuperscript𝐸𝑖𝑇E^{(i)T}. (B) The learned weight of the linear self-organizing layer. The 121212th column of W(i)superscript𝑊𝑖W^{(i)} at all positions i𝑖i are reshaped into patches and visualized. When W(i)superscript𝑊𝑖W^{(i)} undergoes the inverse permutation E(i)Tsuperscript𝐸𝑖𝑇E^{(i)T}, they show similar patterns. (C) Visualization of the 373737th column of W(i)superscript𝑊𝑖W^{(i)}. Similar to (B).

Refer to caption Foveated retinal sampling (A) Illustration of a Guassian kernel shown in [8]. Diagram of single kernel filter parameterized by a mean μ′superscript𝜇′\mu^{\prime} and variance σ′superscript𝜎′\sigma^{\prime}. (B) the location of each Gaussian kernel is summarized as a point with 2D coordinate μ′superscript𝜇′\mu^{\prime}. In total, the locations of 1038 Gaussian kernels are plotted. (C) The relationship between eccentricity (distance of the kernel to the center) and radius of the kernel is shown.

Refer to caption Effect of density adjusted clustering. Eccentricity-based sampling lattice. The center of the sampling lattice has more pixels which means higher resolution compared to the peripheral. (A) Result of density adjusted spectral clustering (α=0.5,β=2formulae-sequence𝛼0.5𝛽2\alpha=0.5,\beta=2). Clusters in the center have more elements than clusters in the peripheral. But clusters look more visually similar in size than B. (B) Result of uniform density spectral clustering (α=0,β=0formulae-sequence𝛼0𝛽0\alpha=0,\beta=0). Each cluster has a similar number of elements in them but the clusters in the center are much smaller than the clusters in the periphery.

Refer to caption Visualize the weight of the self-organizing layer after applying inverse permutation. A snapshot of E(i)TW(i)superscript𝐸𝑖𝑇superscript𝑊𝑖E^{(i)T}W^{(i)} is shown at different training epoch. The number of epochs is shown on the top row. Each figure shows one column of the weight of the self-organizing layer, at different positions, i.e. W:,k(1)superscriptsubscript𝑊:𝑘1W_{:,k}^{(1)}, where k𝑘k is the column number and i𝑖i is the position index. In total, 999 columns are shown.

$$ \label{eigen-no-density}\min_{||f||{L^2(\mathcal{M})}} \int{\mathcal{M}} ||\nabla f||^2 d\lambda $$ \tag{eigen-no-density}

$$ z_0 = [g(x^{(1)},w^{(1)}), \cdots g(x^{(M)},w^{(M)})] $$

$$ \label{density-formula} p(i) = q(i)^{\alpha} n(i)^{-\beta} $$ \tag{density-formula}

$$ I(S_{i};S_{j})=\sum_{l=1}^{K}\sum_{k=1}^{K}P(S_{i},S_{j})[l,k]\log_{2}\left(\frac{P(S_{i},S_{j})[l,k]}{P(S_{i})[l]P(S_{j})[k])}\right) $$ \tag{A1.Ex1}

$$ G[i]=\sum_{n}^{N}\sum_{m}^{W}K(\vec{x}{i},\sigma{i}^{\prime})[n,m]I[n,m] $$ \tag{A1.Ex2}

$$ \displaystyle g(E^{(i)}e_{k},W^{(i)}) $$

$$ \label{density-adjusted-clustering}\min_{YY^T =I} tr(Y P^{1/2} L P^{1/2} Y^T) $$ \tag{density-adjusted-clustering}

$$ g(E^{(i)} e_k,W^{(i)}) &= W^{(i)} E^{(i)} e_k \ g(E^{(j)} e_k,W^{(j)}) &= W^{(j)} E^{(j)} e_k $$

$$ W^{(i)} E^{(i)} &= W^{(j)} E^{(j)} \ E^{(i)T} W^{(i)} &=E^{(j)T} W^{(j)} $$

$$ \mathbf{W}_v(s+1) = \mathbf{W}_v(s) + \theta(u, v, s) \cdot \alpha(s) \cdot \left( \mathbf{D}(t) - \mathbf{W}_v(s) \right) $$

Dataset	Method	Backbone	Eval Acc
CIFAR-10	MAE MAE SimCLR	ViT (Patch) ViT (Pixel) ResNet-18	88.3% 56.7% 90.7%
Permuted CIFAR-10	SimCLR URLOST MAE MAE SimCLR SimCLR	ViG ViT (Cluster) ViT (Pixel) ResNet-18 ViG	53.8% 86.4% 56.7% 47.9% 40.0%
Foveated CIFAR-10	URLOST MAE MAE SimCLR SimCLR	ViT (Cluster) ViT (Pixel) ResNet-18 ViG	85.4% 48.5% 38.0% 42.8%

Method	V1 Response Decoding Acc	TCGA Classification Acc
Raw	73.9% ± 0.00%	91.7 ± 0.24%
MAE	70.6% ± 0.22%	90.6% ± 0.63%
β -VAE	75.64% ± 0.11%	94.15% ± 0.24%
URLOST MAE	78.75% ± 0.18%	94.90% ± 0.25%

Masking Unit	Permuted Cifar10	Foveated Cifar10	TCGA Gene	V1 Response
Clusters (URLOST)	86.4%	85.4%	94.9%	78.8%
Random patch	55.7%	51.1%	91.7%	73.9%
Individual dimension	56.7%	48.5%	88.3%	64.8%

Dataset	Projection	Eval Acc	Dataset	Cluster	Eval Acc
Locally-Permuted CIFAR-10	shared non-shared	81.4% 87.6 %	Foveated CIFAR-10	SC	82.7%
Permuted CIFAR-10	shared non-shared	80.7% 86.4 %	Foveated CIFAR-10	DSC	85.4%

	beta = 0	beta = 2
alpha = 0	82.74%	84.24%
alpha = 0.5	84.52%	85.43%
alpha = 1.0	83.83%	81.62%

Dataset	Permuted Cifar10	TCGA	V1
Ratio	6.7%	15.6%	5.9%

Module	Cluster Indexing	Self Organizing Layer	Encoder	Decoder
Runtime	0.27 ms	1.11 ms	17.5 ms	20.4 ms

References

[Bengio+chapter2007] Bengio, Yoshua, LeCun, Yann. (2007). Scaling Learning Algorithms Towards {AI. Large Scale Kernel Machines.

[Hinton06] Hinton, Geoffrey E., Osindero, Simon, Teh, Yee Whye. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation.

[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, Bengio, Yoshua. (2016). Deep learning.

[deng2009imagenet] Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition.

[cho2022spherical] Cho, Sungmin, Jung, Raehyuk, Kwon, Junseok. (2022). Spherical transformer. arXiv preprint arXiv:2202.04942.

[Stringer2019-gl] Stringer, Carsen, Pachitariu, Marius, Steinmetz, Nicholas, Carandini, Matteo, Harris, Kenneth D. High-dimensional geometry of population responses in visual cortex. Nature.

[Stringer20212767] Carsen Stringer, Michalis Michaelos, Dmitri Tsyboulski, Sarah E. Lindo, Marius Pachitariu. (2021). High-precision coding in visual cortex. Cell. doi:https://doi.org/10.1016/j.cell.2021.03.042.

[ZhangOmi2021] Zhang, Xiaoyu, Xing, Yuting, Sun, Kai, Guo, Yike. (2021). OmiEmbed: A Unified Multi-Task Deep Learning Framework for Multi-Omics Data. Cancers. doi:10.3390/cancers13123047.

[ZhangOmi2019] Roux, Nicolas, Bengio, Yoshua, Lamblin, Pascal, Joliveau, Marc, K'{e. (2007). Learning the 2-D Topology of Images. Advances in Neural Information Processing Systems. doi:10.1109/BIBM47256.2019.8983228.

[ng2001spectral] Ng, Andrew, Jordan, Michael, Weiss, Yair. (2001). On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems.

[shi2000normalized] Shi, Jianbo, Malik, Jitendra. (2000). Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence.

[stella2003multiclass] Stella, X Yu, Shi, Jianbo. (2003). Multiclass spectral clustering. Computer Vision, IEEE International Conference on.

[COIFMAN20065] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners. Statistics and Computing. doi:10.1007/s11222-007-9033-z.

[Wu_2018_Contrast] Wu, Zhirong, Xiong, Yuanjun, Yu, Stella X., Lin, Dahua. (2018). Unsupervised Feature Learning via Non-Parametric Instance Discrimination. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[misra2020self] Misra, Ishan, Maaten, Laurens van der. (2020). Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. International Conference on Machine Learning.

[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. International conference on machine learning.

[he2020momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[radford2021clip] Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, others. (2021). Learning transferable visual models from natural language supervision. International conference on machine learning.

[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, {\L. (2017). Attention is all you need. Advances in neural information processing systems.

[dosovitskiy2020vit] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations.

[radford2018gpt] Radford, Alec, Narasimhan, Karthik, Salimans, Tim, Sutskever, Ilya, others. (2018). Improving language understanding by generative pre-training.

[devlin2018bert] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics.

[brown2020gpt3] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. Advances in neural information processing systems.

[Cheung2016retina] Brian Cheung, Eric Weiss, Bruno A. Olshausen. (2016). Emergence of foveal image sampling from learning to attend in visual scenes. The Fifth International Conference on Learning Representations.

[kohonen2012self] Kohonen, Teuvo. (2012). Self-organizing maps.

[stringer2018recordings] Stringer, Carsen, Pachitariu, Marius, Carandini, Matteo, Harris, Kenneth. (2018). Recordings of 10,000 neurons in visual cortex in response to 2,800 natural images. Figshare Repos.

[pachitariu2016suite2p] Pachitariu, Marius, Stringer, Carsen, Schr{. (2016). Suite2p: beyond 10,000 neurons with standard two-photon microscopy. BioRxiv.

[gilbert2012adult] Gilbert, Charles D, Li, Wu. (2012). Adult visual cortical plasticity. Neuron.

[sammons2015adult] Sammons, Rosanna P, Keck, Tara. (2015). Adult plasticity and cortical reorganization after peripheral lesions. Current Opinion in Neurobiology.

[jamal2020rapid] Jamal, Yaseen A, Dilks, Daniel D. (2020). Rapid topographic reorganization in adult human primary visual cortex (V1) during noninvasive and reversible deprivation. Proceedings of the National Academy of Sciences.

[DurbinSOM1990] Nicholas V. Swindale, Hans-Ulrich Bauer. (1998). Application of Kohonen's Self-Organizing Feature Map Algorithm to Cortical Maps of Orientation and Direction Preference. Proceedings: Biological Sciences. doi:10.1038/343644a0.

[kohonen1982self] Kohonen, Teuvo. (1982). Self-organized formation of topologically correct feature maps. Biological cybernetics.

[barrow1996self] Barrow, Harry G, Bray, Alistair J, Budd, Julian ML. (1996). A self-organizing model of “color blob” formation. Neural Computation.

[obermayer1990principle] Obermayer, Klaus, Ritter, Helge, Schulten, Klaus. (1990). A principle for the formation of the spatial structure of cortical feature maps.. Proceedings of the National Academy of Sciences.

[durbin1990dimension] Durbin, Richard, Mitchison, Graeme. (1990). A dimension reduction framework for understanding cortical maps. Nature.

[cancer2013cancer] Weinstein, John N, Collisson, Eric A, Mills, Gordon B, Shaw, Kenna R, Ozenberger, Brad A, Ellrott, Kyle, Shmulevich, Ilya, Sander, Chris, Stuart, Joshua M. (2013). The cancer genome atlas pan-cancer analysis project. Nature genetics.

[tomczak2015review] Tomczak, Katarzyna, Czerwi{'n. (2015). Review The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemporary Oncology/Wsp{'o.

[olshausen2006other] Olshausen, Bruno A, Field, David J. (2006). What is the other 85 percent of V1 doing. L. van Hemmen, & T. Sejnowski (Eds.).

[olshausen2005close] Olshausen, Bruno A, Field, David J. (2005). How close are we to understanding V1?. Neural computation.

[olshausen1996emergence] Olshausen, Bruno A, Field, David J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature.

[chen2018sparse] Chen, Yubei, Paiton, Dylan, Olshausen, Bruno. (2018). The sparse manifold transform. Advances in neural information processing systems.

[Haghighi2022] Burgess, Christopher P, Higgins, Irina, Pal, Arka, Matthey, Loic, Watters, Nick, Desjardins, Guillaume, Lerchner, Alexander. (2018). Understanding disentangling in $\beta$-VAE. Journal of Cheminformatics. doi:10.1186/s13321-020-0416-x.

[hinton2006reducing] Hinton, Geoffrey E, Salakhutdinov, Ruslan R. (2006). Reducing the dimensionality of data with neural networks. science.

[bronstein2017geometric] Bronstein, Michael M, Bruna, Joan, LeCun, Yann, Szlam, Arthur, Vandergheynst, Pierre. (2017). Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine.

[fey2018splinecnn] Fey, Matthias, Lenssen, Jan Eric, Weichert, Frank, M{. (2018). Splinecnn: Fast geometric deep learning with continuous b-spline kernels. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[defferrard2016convolutional] Defferrard, Micha{. (2016). Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems.

[monti1611geometric] Monti, Federico, Boscaini, Davide, Masci, Jonathan, Rodola, Emanuele, Svoboda, Jan, Bronstein, Michael M. (2016). Geometric deep learning on graphs and manifolds using mixture model cnns (2016). Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[dahan2022surface] Dahan, Simon, Williams, Logan ZJ, Fawaz, Abdulah, Rueckert, Daniel, Robinson, Emma C. (2022). Surface analysis with vision transformers. arXiv preprint arXiv:2205.15836.

[chen2020generative] Chen, Mark, Radford, Alec, Child, Rewon, Wu, Jeffrey, Jun, Heewoo, Luan, David, Sutskever, Ilya. (2020). Generative pretraining from pixels. International conference on machine learning.

[chen2022bag] Chen, Yubei, Bardes, Adrien, Li, Zengyi, LeCun, Yann. (2022). Bag of Image Patch Embedding Behind the Success of Self-Supervised Learning. arXiv preprint arXiv:2206.08954.

[van1992information] Van Essen, David C, Anderson, Charles H, Felleman, Daniel J. (1992). Information processing in the primate visual system: an integrated systems perspective. Science.

[polimeni2010laminar] Polimeni, Jonathan R, Fischl, Bruce, Greve, Douglas N, Wald, Lawrence L. (2010). Laminar analysis of 7 T BOLD using an imposed spatial activation pattern in human V1. Neuroimage.

[Denker1992universal] Bengio, Yoshua, LeCun, Yann, others. (2007). Scaling learning algorithms towards AI. Large-scale kernel machines. doi:10.1109/PHYCMP.1992.615508.

[mehrizi2023multi] Mehrizi, Rahil, Mehrjou, Arash, Alegro, Maryana, Zhao, Yi, Carbone, Benedetta, Fishwick, Carl, Vappiani, Johanna, Bi, Jing, Sanford, Siobhan, Keles, Hakan, others. (2023). Multi-omics Prediction from High-content Cellular Imaging with Deep Learning. arXiv preprint arXiv:2306.09391.

[roorda1999arrangement] Roorda, Austin, Williams, David R. (1999). The arrangement of the three cone classes in the living human eye. Nature.

[jonnalagadda2021foveater] Jonnalagadda, Aditya, Wang, William Yang, Manjunath, BS, Eckstein, Miguel P. (2021). Foveater: Foveated transformer for image classification. arXiv preprint arXiv:2105.14173.

[thibos1998acuity] Thibos, Larry N. (1998). Acuity perimetry and the sampling theory of visual resolution.. Optometry and vision science: official publication of the American Academy of Optometry.

[peli1991image] Peli, Eli, Yang, Jian, Goldstein, Robert B. (1991). Image invariance with changes in size: The role of peripheral contrast thresholds. JOSA A.

[pointer1989contrast] Pointer, JS, Hess, RF. (1989). The contrast sensitivity gradient across the human visual field: With emphasis on the low spatial frequency range. Vision research.

[perry2002gaze] Perry, Jeffrey S, Geisler, Wilson S. (2002). Gaze-contingent real-time simulation of arbitrary visual fields. Human vision and electronic imaging VII.

[simoncelli2001natural] Simoncelli, Eero P, Olshausen, Bruno A. (2001). Natural image statistics and neural representation. Annual review of neuroscience.

[hyvarinen2009natural] Hyv{. (2009). Natural image statistics: A probabilistic approach to early computational vision..

[ogmen2010geometry] Ogmen, Haluk, Herzog, Michael H. (2010). The geometry of visual perception: Retinotopic and nonretinotopic representations in the human visual system. Proceedings of the IEEE.

[feller2020retinal] Feller, MB, Kerschensteiner, D. (2020). Retinal waves and their role in visual system development. Synapse Development and Maturation.

[felleman1991distributed] Felleman, Daniel J, Van Essen, David C. (1991). Distributed hierarchical processing in the primate cerebral cortex.. Cerebral cortex (New York, NY: 1991).

[hubel1962receptive] Hubel, David H, Wiesel, Torsten N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. The Journal of physiology.

[engel1997retinotopic] Engel, Stephen A, Glover, Gary H, Wandell, Brian A. (1997). Retinotopic organization in human visual cortex and the spatial precision of functional MRI.. Cerebral cortex (New York, NY: 1991).

[wandell2007visual] Wandell, Brian A, Dumoulin, Serge O, Brewer, Alyssa A. (2007). Visual field maps in human cortex. Neuron.

[meyer1983tetrodotoxin] Meyer, Ronald L. (1983). Tetrodotoxin inhibits the formation of refined retinotopography in goldfish. Developmental Brain Research.

[wong1999retinal] Wong, Rachel OL. (1999). Retinal waves and visual system development. Annual review of neuroscience.

[ellis2021retinotopic] Ellis, Cameron T, Yates, Tristan S, Skalaban, Lena J, Bejjanki, Vikranth R, Arcaro, Michael J, Turk-Browne, Nicholas B. (2021). Retinotopic organization of visual cortex in human infants. Neuron.

[xu2021mining] Xu, Youzheng, Xu, Yixin, Wang, Chun, Xia, Baoguo, Mu, Qingling, Luan, Shaohong, Fan, Jun. (2021). Mining TCGA database for gene expression in ovarian serous cystadenocarcinoma microenvironment. PeerJ.

[krizhevsky2014cifar] Krizhevsky, Alex, Nair, Vinod, Hinton, Geoffrey. (2014). The CIFAR-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html.

[higgins2016beta] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher, Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, Lerchner, Alexander. (2016). beta-vae: Learning basic visual concepts with a constrained variational framework. International conference on learning representations.

[ma2023image] Xu Ma, Yuqian Zhou, Huan Wang, Can Qin, Bin Sun, Chang Liu, Yun Fu. (2023). Image as Set of Points. The Eleventh International Conference on Learning Representations.

[han2022visiongnn] Han, Kai, Wang, Yunhe, Guo, Jianyuan, Tang, Yehui, Wu, Enhua. (2022). Vision gnn: An image is worth graph of nodes. Advances in neural information processing systems.

[shi2022revisiting] Shi, Han, Gao, Jiahui, Xu, Hang, Liang, Xiaodan, Li, Zhenguo, Kong, Lingpeng, Lee, Stephen, Kwok, James T. (2022). Revisiting over-smoothing in bert from the perspective of graph. arXiv preprint arXiv:2202.08625.

[chen2020measuring] Chen, Deli, Lin, Yankai, Li, Wei, Li, Peng, Zhou, Jie, Sun, Xu. (2020). Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. Proceedings of the AAAI conference on artificial intelligence.

[cai2020note] Cai, Chen, Wang, Yusu. (2020). A note on over-smoothing for graph neural networks. arXiv preprint arXiv:2006.13318.

[bierme2019stationarity] Bierm{'e. (2019). Introduction to random fields and scale invariance. Stochastic Geometry: Modern Research Frontiers.

[beyer2023flexivit] Beyer, Lucas, Izmailov, Pavel, Kolesnikov, Alexander, Caron, Mathilde, Kornblith, Simon, Zhai, Xiaohua, Minderer, Matthias, Tschannen, Michael, Alabdulmohsin, Ibrahim, Pavetic, Filip. (2023). Flexivit: One model for all patch sizes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[gao2022patchgt] Gao, Han, Han, Xu, Huang, Jiaoyang, Wang, Jian-Xun, Liu, Liping. (2022). Patchgt: Transformer over non-trainable clusters for learning graph representations. Learning on Graphs Conference.

[willeke2022sensorium] Willeke, Konstantin F, Fahey, Paul G, Bashiri, Mohammad, Pede, Laura, Burg, Max F, Blessing, Christoph, Cadena, Santiago A, Ding, Zhiwei, Lurz, Konstantin-Klemens, Ponder, Kayla, others. (2022). The sensorium competition on predicting large-scale mouse primary visual cortex activity. arXiv preprint arXiv:2206.08666.

[de2020large] de Vries, Saskia EJ, Lecoq, Jerome A, Buice, Michael A, Groblewski, Peter A, Ocker, Gabriel K, Oliver, Michael, Feng, David, Cain, Nicholas, Ledochowitsch, Peter, Millman, Daniel, others. (2020). A large-scale standardized physiological survey reveals functional organization of the mouse visual cortex. Nature neuroscience.

[bauer2024movie] Bauer, Joel, Margrie, Troy W, Clopath, Claudia. (2024). Movie reconstruction from mouse visual cortex activity. bioRxiv.

[vanmarcke2010random] Vanmarcke, Erik. (2010). Random fields: analysis and synthesis.

[mumford2010pattern] Mumford, David, Desolneux, Agn{`e. (2010). Pattern theory: the stochastic analysis of real-world signals.

[taylor2018random] Taylor, Michael. (2018). Random fields: stationarity, ergodicity, and spectral behavior. URL http://www. unc. edu/math/Faculty/met/rndfcn. pdf.

[jordan1999introduction] Jordan, Michael I, Ghahramani, Zoubin, Jaakkola, Tommi S, Saul, Lawrence K. (1999). An introduction to variational methods for graphical models. Machine learning.

[ramm1990random] Ramm, Alexander G. (1990). Random fields estimation theory. Mathematical and Computer Modelling.

[bierme2019introduction] Bierm{'e. (2019). Introduction to random fields and scale invariance. Stochastic Geometry: Modern Research Frontiers.

[shuman2013emerging] Shuman, David I, Narang, Sunil K, Frossard, Pascal, Ortega, Antonio, Vandergheynst, Pierre. (2013). The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing magazine.

[jaegle2021perceiver] Jaegle, Andrew, Gimeno, Felix, Brock, Andy, Vinyals, Oriol, Zisserman, Andrew, Carreira, Joao. (2021). Perceiver: General perception with iterative attention. International conference on machine learning.

[jaegle2021perceiverio] Jaegle, Andrew, Borgeaud, Sebastian, Alayrac, Jean-Baptiste, Doersch, Carl, Ionescu, Catalin, Ding, David, Koppula, Skanda, Zoran, Daniel, Brock, Andrew, Shelhamer, Evan, others. (2021). Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795.

[xie2024self] Xie, Johnathan, Lee, Yoonho, Chen, Annie S, Finn, Chelsea. (2024). Self-guided masked autoencoders for domain-agnostic self-supervised learning. arXiv preprint arXiv:2402.14789.

[bib1] Barrow et al. [1996] Harry G Barrow, Alistair J Bray, and Julian ML Budd. A self-organizing model of “color blob” formation. Neural Computation, 8(7):1427–1448, 1996.

[bib2] Bengio et al. [2007] Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards ai. Large-scale kernel machines, 34(5):1–41, 2007.

[bib3] Bronstein et al. [2017] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.

[bib4] Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

[bib5] Chen et al. [2020a] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International conference on machine learning, pp. 1691–1703. PMLR, 2020a.

[bib6] Chen et al. [2020b] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020b.

[bib7] Chen et al. [2018] Yubei Chen, Dylan Paiton, and Bruno Olshausen. The sparse manifold transform. Advances in neural information processing systems, 31, 2018.

[bib8] Cheung et al. [2016] Brian Cheung, Eric Weiss, and Bruno A. Olshausen. Emergence of foveal image sampling from learning to attend in visual scenes. In The Fifth International Conference on Learning Representations, 2016.

[bib9] Cho et al. [2022] Sungmin Cho, Raehyuk Jung, and Junseok Kwon. Spherical transformer. arXiv preprint arXiv:2202.04942, 2022.

[bib10] Dahan et al. [2022] Simon Dahan, Logan ZJ Williams, Abdulah Fawaz, Daniel Rueckert, and Emma C Robinson. Surface analysis with vision transformers. arXiv preprint arXiv:2205.15836, 2022.

[bib11] Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29, 2016.

[bib12] Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.

[bib13] J.S. Denker and Y. leCun. Natural versus ”universal” probability, complexity, and entropy. In Workshop on Physics and Computation, pp. 122–127, 1992. doi: 10.1109/PHYCMP.1992.615508.

[bib14] Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.

[bib15] Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

[bib16] Richard Durbin and Graeme Mitchison. A dimension reduction framework for understanding cortical maps. Nature, 343(6259):644–647, 1990a. doi: 10.1038/343644a0.

[bib17] Richard Durbin and Graeme Mitchison. A dimension reduction framework for understanding cortical maps. Nature, 343(6259):644–647, 1990b.

[bib18] Ellis et al. [2021] Cameron T Ellis, Tristan S Yates, Lena J Skalaban, Vikranth R Bejjanki, Michael J Arcaro, and Nicholas B Turk-Browne. Retinotopic organization of visual cortex in human infants. Neuron, 109(16):2616–2626, 2021.

[bib19] Engel et al. [1997] Stephen A Engel, Gary H Glover, and Brian A Wandell. Retinotopic organization in human visual cortex and the spatial precision of functional mri. Cerebral cortex (New York, NY: 1991), 7(2):181–192, 1997.

[bib20] Felleman & Van Essen [1991] Daniel J Felleman and David C Van Essen. Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991.

[bib21] Fey et al. [2018] Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 869–877, 2018.

[bib22] Charles D Gilbert and Wu Li. Adult visual cortical plasticity. Neuron, 75(2):250–264, 2012.

[bib23] Haghighi et al. [2022] Marzieh Haghighi, Juan C. Caicedo, Beth A. Cimini, Anne E. Carpenter, and Shantanu Singh. High-dimensional gene expression and morphology profiles of cells across 28,000 genetic and chemical perturbations. Nature Methods, 19(12):1550–1557, 2022. doi: 10.1038/s41592-022-01667-0.

[bib24] He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738, 2020.

[bib25] He et al. [2022] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009, 2022.

[bib26] Higgins et al. [2016] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016.

[bib27] David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106, 1962.

[bib28] Hyvärinen et al. [2009] Aapo Hyvärinen, Jarmo Hurri, and Patrick O Hoyer. Natural image statistics: A probabilistic approach to early computational vision., volume 39. Springer Science & Business Media, 2009.

[bib29] Yaseen A Jamal and Daniel D Dilks. Rapid topographic reorganization in adult human primary visual cortex (v1) during noninvasive and reversible deprivation. Proceedings of the National Academy of Sciences, 117(20):11059–11067, 2020.

[bib30] Jonnalagadda et al. [2021] Aditya Jonnalagadda, William Yang Wang, BS Manjunath, and Miguel P Eckstein. Foveater: Foveated transformer for image classification. arXiv preprint arXiv:2105.14173, 2021.

[bib31] Teuvo Kohonen. Self-organized formation of topologically correct feature maps. Biological cybernetics, 43(1):59–69, 1982.

[bib32] Krizhevsky et al. [2014] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 55(5), 2014.

[bib33] Ma et al. [2023] Xu Ma, Yuqian Zhou, Huan Wang, Can Qin, Bin Sun, Chang Liu, and Yun Fu. Image as set of points. In The Eleventh International Conference on Learning Representations, 2023.

[bib34] Mehrizi et al. [2023] Rahil Mehrizi, Arash Mehrjou, Maryana Alegro, Yi Zhao, Benedetta Carbone, Carl Fishwick, Johanna Vappiani, Jing Bi, Siobhan Sanford, Hakan Keles, et al. Multi-omics prediction from high-content cellular imaging with deep learning. arXiv preprint arXiv:2306.09391, 2023.

[bib35] Ronald L Meyer. Tetrodotoxin inhibits the formation of refined retinotopography in goldfish. Developmental Brain Research, 6(3):293–298, 1983.

[bib36] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6707–6717, 2020.

[bib37] Monti et al. [2016] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2016.

[bib38] Ng et al. [2001] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14, 2001.

[bib39] Obermayer et al. [1990] Klaus Obermayer, Helge Ritter, and Klaus Schulten. A principle for the formation of the spatial structure of cortical feature maps. Proceedings of the National Academy of Sciences, 87(21):8345–8349, 1990.

[bib40] Haluk Ogmen and Michael H Herzog. The geometry of visual perception: Retinotopic and nonretinotopic representations in the human visual system. Proceedings of the IEEE, 98(3):479–492, 2010.

[bib41] Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.

[bib42] Bruno A Olshausen and David J Field. How close are we to understanding v1? Neural computation, 17(8):1665–1699, 2005.

[bib43] Bruno A Olshausen and David J Field. What is the other 85 percent of v1 doing. L. van Hemmen, & T. Sejnowski (Eds.), 23:182–211, 2006.

[bib44] Pachitariu et al. [2016] Marius Pachitariu, Carsen Stringer, Sylvia Schröder, Mario Dipoppa, L Federico Rossi, Matteo Carandini, and Kenneth D Harris. Suite2p: beyond 10,000 neurons with standard two-photon microscopy. BioRxiv, pp. 061507, 2016.

[bib45] Peli et al. [1991] Eli Peli, Jian Yang, and Robert B Goldstein. Image invariance with changes in size: The role of peripheral contrast thresholds. JOSA A, 8(11):1762–1774, 1991.

[bib46] Jeffrey S Perry and Wilson S Geisler. Gaze-contingent real-time simulation of arbitrary visual fields. In Human vision and electronic imaging VII, volume 4662, pp. 57–69. SPIE, 2002.

[bib47] JS Pointer and RF Hess. The contrast sensitivity gradient across the human visual field: With emphasis on the low spatial frequency range. Vision research, 29(9):1133–1151, 1989.

[bib48] Polimeni et al. [2010] Jonathan R Polimeni, Bruce Fischl, Douglas N Greve, and Lawrence L Wald. Laminar analysis of 7 t bold using an imposed spatial activation pattern in human v1. Neuroimage, 52(4):1334–1346, 2010.

[bib49] Daniel Probst and Jean-Louis Reymond. Visualization of very large high-dimensional data sets as minimum spanning trees. Journal of Cheminformatics, 12(1):12, 2020. doi: 10.1186/s13321-020-0416-x.

[bib50] Radford et al. [2018] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.

[bib51] Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.

[bib52] Austin Roorda and David R Williams. The arrangement of the three cone classes in the living human eye. Nature, 397(6719):520–522, 1999.

[bib53] Roux et al. [2007] Nicolas Roux, Yoshua Bengio, Pascal Lamblin, Marc Joliveau, and Balázs Kégl. Learning the 2-d topology of images. In J. Platt, D. Koller, Y. Singer, and S. Roweis (eds.), Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.

[bib54] Rosanna P Sammons and Tara Keck. Adult plasticity and cortical reorganization after peripheral lesions. Current Opinion in Neurobiology, 35:136–141, 2015.

[bib55] Eero P Simoncelli and Bruno A Olshausen. Natural image statistics and neural representation. Annual review of neuroscience, 24(1):1193–1216, 2001.

[bib56] X Yu Stella and Jianbo Shi. Multiclass spectral clustering. In Computer Vision, IEEE International Conference on, volume 2, pp. 313–313. IEEE Computer Society, 2003.

[bib57] Stringer et al. [2018] Carsen Stringer, Marius Pachitariu, Matteo Carandini, and Kenneth Harris. Recordings of 10,000 neurons in visual cortex in response to 2,800 natural images. Figshare Repos, 2018.

[bib58] Stringer et al. [2019] Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, Matteo Carandini, and Kenneth D Harris. High-dimensional geometry of population responses in visual cortex. Nature, 571(7765):361–365, July 2019.

[bib59] Stringer et al. [2021] Carsen Stringer, Michalis Michaelos, Dmitri Tsyboulski, Sarah E. Lindo, and Marius Pachitariu. High-precision coding in visual cortex. Cell, 184(10):2767–2778.e15, 2021. ISSN 0092-8674. doi: https://doi.org/10.1016/j.cell.2021.03.042.

[bib60] Nicholas V. Swindale and Hans-Ulrich Bauer. Application of kohonen’s self-organizing feature map algorithm to cortical maps of orientation and direction preference. Proceedings: Biological Sciences, 265(1398):827–838, 1998. ISSN 09628452.

[bib61] Larry N Thibos. Acuity perimetry and the sampling theory of visual resolution. Optometry and vision science: official publication of the American Academy of Optometry, 75(6):399–406, 1998.

[bib62] Tomczak et al. [2015] Katarzyna Tomczak, Patrycja Czerwińska, and Maciej Wiznerowicz. Review the cancer genome atlas (tcga): an immeasurable source of knowledge. Contemporary Oncology/Współczesna Onkologia, 2015(1):68–77, 2015.

[bib63] Van Essen et al. [1992] David C Van Essen, Charles H Anderson, and Daniel J Felleman. Information processing in the primate visual system: an integrated systems perspective. Science, 255(5043):419–423, 1992.

[bib64] Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

[bib65] Wandell et al. [2007] Brian A Wandell, Serge O Dumoulin, and Alyssa A Brewer. Visual field maps in human cortex. Neuron, 56(2):366–383, 2007.

[bib66] Weinstein et al. [2013] John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10):1113–1120, 2013.

[bib67] Rachel OL Wong. Retinal waves and visual system development. Annual review of neuroscience, 22(1):29–47, 1999.

[bib68] Wu et al. [2018] Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[bib69] Xu et al. [2021] Youzheng Xu, Yixin Xu, Chun Wang, Baoguo Xia, Qingling Mu, Shaohong Luan, and Jun Fan. Mining tcga database for gene expression in ovarian serous cystadenocarcinoma microenvironment. PeerJ, 9:e11375, 2021.

[bib70] Zbontar et al. [2021] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pp. 12310–12320. PMLR, 2021.

[bib71] Zhang et al. [2019] Xiaoyu Zhang, Jingqing Zhang, Kai Sun, Xian Yang, Chengliang Dai, and Yike Guo. Integrated multi-omics analysis using variational autoencoders: Application to pan-cancer classification. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 765–769, 2019. doi: 10.1109/BIBM47256.2019.8983228.

[bib72] Zhang et al. [2021] Xiaoyu Zhang, Yuting Xing, Kai Sun, and Yike Guo. Omiembed: A unified multi-task deep learning framework for multi-omics data. Cancers, 13(12), 2021. ISSN 2072-6694. doi: 10.3390/cancers13123047.

URLOST: Unsupervised Representation Learning without Stationarity or Topology

Introduction​

Method​

Motivation and Overall Framework​

Density Adjusted Spectral Clustering​

Self-organizing Layer​

Masked Autoencoder​

Result​

Synthetic Biological Vision Dataset​

V1 Neural Response to Natural Image Stimulus​

Gene Expression Data​

Ablation study​

Aggregating similar dimensions for masking unit​

Self-organizing Layer vs Shared Projection Layer​

Density adjusted clustering vs Uniform Density Clustering​

Additional related works​

Discussion, Limitations, and future direction​

Acknowledgement​

References​

Appendix​

Motivation of density adjusted spectral clustering​

Spectral clustering algorithm​

Data synthesize process​

Density adjusted spectral clustering on foveated CIFAR10 dataset​

self-organizing layer learns inverse permutation​

Training and evaluation details​

Effect of spectral clustering hyperparameters on performance​

Topology and Stationarity​

Computational cost​

Scalability​

Noise robustness of mutual information recovered topology​

Self-Organizing Map (SOM)​

Mutual information vs. Pixel! Distance for Different Noise Levels

Visualizing the weight of self-organizing​

References​

Introduction

Method

Motivation and Overall Framework

Density Adjusted Spectral Clustering

Self-organizing Layer

Masked Autoencoder

Result

Synthetic Biological Vision Dataset

V1 Neural Response to Natural Image Stimulus

Gene Expression Data

Ablation study

Aggregating similar dimensions for masking unit

Self-organizing Layer vs Shared Projection Layer

Density adjusted clustering vs Uniform Density Clustering

Additional related works

Discussion, Limitations, and future direction

Acknowledgement

References

Appendix

Motivation of density adjusted spectral clustering

Spectral clustering algorithm

Data synthesize process

Density adjusted spectral clustering on foveated CIFAR10 dataset

self-organizing layer learns inverse permutation

Training and evaluation details

Effect of spectral clustering hyperparameters on performance

Topology and Stationarity

Computational cost

Scalability

Noise robustness of mutual information recovered topology

Self-Organizing Map (SOM)

Visualizing the weight of self-organizing

References