Skip to main content

Learning in High Dimension Always Amounts to Extrapolation

Randall Balestriero1, Jérôme Pesenti1, Yann LeCun1,2

Abstract

The notion of interpolation and extrapolation is fundamental in various fields from deep learning to function approximation. Interpolation occurs for a sample $\bx$ whenever this sample falls inside or on the boundary of the given dataset's convex hull. Extrapolation occurs when $\bx$ falls outside of that convex hull. One fundamental (mis)conception is that state-of-the-art algorithms work so well because of their ability to correctly interpolate training data. A second (mis)conception is that interpolation happens throughout tasks and datasets, in fact, many intuitions and theories rely on that assumption. We empirically and theoretically argue against those two points and demonstrate that on any high-dimensional ($>$100) dataset, interpolation almost surely never happens. Those results challenge the validity of our current interpolation/extrapolation definition as an indicator of generalization performances.

Learning in High Dimension Always Amounts to Extrapolation

Randall Balestriero 1 , J´ erˆ ome Pesenti 1 , and Yann LeCun 1 , 2

1 Facebook AI Research, 2 NYU

The notion of interpolation and extrapolation is fundamental in various fields from deep learning to function approximation. Interpolation occurs for a sample x whenever this sample falls inside or on the boundary of the given dataset's convex hull. Extrapolation occurs when x falls outside of that convex hull. One fundamental (mis)conception is that state-of-the-art algorithms work so well because of their ability to correctly interpolate training data. A second (mis)conception is that interpolation happens throughout tasks and datasets, in fact, many intuitions and theories rely on that assumption. We empirically and theoretically argue against those two points and demonstrate that on any high-dimensional ( > 100) dataset, interpolation almost surely never happens. Those results challenge the validity of our current interpolation/extrapolation definition as an indicator of generalization performances.

Introduction

The origin of the interpolation and extrapolation notions are hard to trace back. Kolmogoroff (1941); Wiener (1949) defined extrapolation as predicting the future (realization) of a stationary Gaussian process based on past and current realizations. Conversely, interpolation was defined as predicting the possible realization of such process at a time position lying in-between observations, i.e., interpolation resamples the past. Various research communities have formalized those definitions as follows.

Definition 1. Interpolation occurs for a sample x whenever this sample belongs to the convex hull of a set of samples X glyph[defines] { x 1 , . . . , x N } , if not, extrapolation occurs.

From the above definition, it is reasonable to assume extrapolation as being a more intricate task than interpolation. After all, interpolation guarantees that the sample lies within the dataset's convex hull, while extrapolation leaves the entire remaining space as a valid sample position. Those terms have been ported as-is to various fields such as function approximation (DeVore, 1998) or machine learning (Bishop, 2006), and an increasing amount of research papers in deep learning provide results and intuitions relying on data interpolation (Belkin et al., 2018; Bietti and Mairal, 2019; Adlam and Pennington, 2020). Beyond those, the following adage 'as an algorithm transitions from interpolation to extrapolation, as its performance decreases' is commonly agreed upon. Before going further, we insist that throughout this manuscript, interpolation is to be understood as characterizing the data geometry as per Def. 1. This is not to be mistaken with the often employed 'interpolation regime' of models which occur whenever the latter has 0 training loss on the data (Chatterji et al., 2021). We shall see that interpolation/extrapolation and generalization performances do not seem as tightly related as previously thought.

Our goal in this paper is to demonstrate both theoretically and empirically for both synthetic and real data that interpolation almost surely never occurs in high-dimensional spaces ( > 100 ) regardless

of the underlying intrinsic dimension of the data manifold . That is, given the realistic amount of data that can be carried by current computational capacities, it is extremely unlikely that a newly observed sample lies in the convex hull of that dataset. Hence, we claim that

· currently employed/deployed models are extrapolating · given the super-human performances achieved by those models, extrapolation regime is not necessarily to be avoided, and is not an indicator of generalization performances

This paper is organized as follows. We first provide below (Thm. 1) an important theoretical result that has been derived in the context of Uniform samples from an hyper-ball. In that case, the probability of a new sample to be in interpolation regime from a dataset goes to 0 as the dimension d increases unless the number of dataset samples grows exponentially with d . This will allow to introduce notations and intuitions. We then directly provide empirical evidences in Sec. 2 using standard dataset where we demonstrate that even when considering a subset of the data dimensions, the probability to interpolate goes exponentially quickly to 0 with the number of considered dimensions. We conclude with Sec. 3 by providing existing theoretical results describing the probability that new samples are in interpolation or extrapolation regimes in more specific scenarios.

Theorem 1 (B´ ar´ any and F¨ uredi (1988)) . Given a d -dimensional dataset X glyph[defines] { x 1 , . . . , x N } with i.i.d. samples uniformly drawn from an hyperball, the probability that a new sample x is in interpolation regime (recall Def. 1) has the following asymptotic behavior

$$

$$

.
(Bárány and Füredi, (1988)).

Interpolation is Doomed by the Curse of Dimensionality

In this section we propose various experiments supporting the need for exponentially large dataset to maintain interpolation, as per Thm. 1, for non Gaussian data. First, we demonstrate in Sec. 2.1 the role of the underlying data manifold intrinsic dimension along with the role of the dimension of the smallest affine subspace that include the data manifold. As we will see from carefully designed datasets, only the latter has an impact on the probability of new samples being in an interpolation regime. We then move to real datasets in Sec. 2.2 and demonstrate that both in the data space or in various embedding spaces, current test set samples are all in extrapolation regime from their corresponding training set.

The Role of the Intrinsic, Ambient and Convex Hull Dimensions

The first stage of our study consists in carefully understanding not only the role of the ambient dimension i.e. the dimension of the space in which the data lives, but also the role of the underlying data manifold intrinsic dimension i.e. the number of variables needed in a minimal representation of the data (Bennett, 1965), and the dimension of the smallest affine subspace that includes all the data manifold.

In fact, one could argue that data such as images might lie on a low dimensional manifold and thus hope that interpolation occurs regardless of the high-dimensional ambient space. As we demonstrate in Fig. 1, this intuition would be misleading. In fact, the underlying manifold dimension does not help even in the extreme case of having a 1-dimensional manifold. What matters however, is the dimension d ∗ of the smallest affine subspace that includes all the data manifold, or equivalently, the dimension of the convex hull of the data. As such, in the presence of a nonlinear manifold, we can see that the exponential requirement from Thm. 1 in the number of samples required to preserve a constant probability to be in interpolation grows exponentially with d ∗ . In fact, with the intrinsic dimension ( d ∗ ) constant,

Figure

Figure 1: Depiction of the evolution of the probability that a new sample is in interpolation regime (y-axis, p ( x ∈ Hull( X ))) given increasing dataset size (x-axis, N ) seen in logarithmic scale, and for various ambient space dimensions ( d ) based on Monte-Carlo estimates on 500 , 000 trials. On the left , the data is sampled from a Gaussian density x i ∼ N (0 , I d ) while in the middle , the data is sampled from a nonlinear continuous manifold with intrinsic dimension of 1 (see Fig. 2 for details on the manifold data) and on the right , the data is sampled from a Gaussian density that lives in an affine subspace of constant dimension 4 (while the ambient dimension increases). It is clear from those figures that in order to maintain a constant probability to be in interpolation regime, the training set size has to increase exponentially with d ∗ regardless of the underlying intrinsic manifold dimension where d ∗ is the dimension of the lowest dimensional affine subspace including the entire data manifold i.e. the convex hull dimension.

Figure

N

Figure 2: Depiction of the manifold data samples used for the middle plot of Figure. 1 with dim = 5 on the left and dim = 3 on the right . In all cases, the intrinsic dimension of this dataset is 1, the latent coordinate ( z ) that governs the data ( x ( z )) is depicted on the top row while the manifold samples in the ambient space are depicted in the bottom row . This manifold is continuous, nonlinear and piecewise smooth, and corresponds to walking around the simplex.

increasing the ambient space dimension ( d ) has no impact on the number of samples needed to maintain interpolation regime as can be seen on the right of Fig. 1. We thus conclude that for one to increase the probability to be in an interpolation regime, one should control d ∗ , and not the manifold underlying dimension not the ambient space dimension .

We now propose to extend those insights to real data where the exact same behavior occurs across datasets.

Real Datasets and Embeddings are no Exception

The previous section explored the cases of synthetic data with varying ambient, intrinsic, and convex hull dimensions. This provided valuable insights e.g. the key quantity of interest lies in the dimension of the smallest affine subspace containing the data. For real dataset however, one could argue that some natural properties of such manifolds help in being in an interpolation regime. Furthermore, one could argue that once embedded in a suitable and non-degenerate latent space, e.g. from a learned deep network, interpolation occurs. As we will see through various experiments, even with real datasets and various popular embeddings, interpolation remains an elusive goal that becomes exponentially difficult to reach as the dimension grows.

Test set extrapolation in pixel-space. We first propose in Fig. 3 to study the proportion of the test set that is in interpolation regime from the train set for MNIST, CIFAR and Imagenet. To grasp the impact of the dimensionality of the data we propose to compute this proportion with varying number of dimensions obtained from two strategies. First, we only keep a specified amount of dimensions from the center of the images, second, we smooth and subsample the images. The former has the benefit of preserving the manifold geometry whilst only considering a limited amount of dimensions, the latter preserves the overall geometry of the manifold while removing the high-frequency structures (details of the image) and compressing the information on fewer dimensions. In both cases and throughout datasets, we see that despite the data manifold geometry held by natural images, finding samples in interpolation regime becomes exponentially difficult with respect to the considered data dimension .

Test set extrapolation in embedding-space. Given the above, one could argue that the key interest of machine learning is not to perform interpolation in the data space, but rather in a (learned) latent space. In fact, a DN provides a data embedding, then, in that space, a linear classifier (for example) solves the problem at hand, possibly in an interpolation regime. We thus provide in Tab. 1 the proportion of the test set that is in interpolation regime when considering different embedding spaces. We observed that embedding-spaces provide seemingly organized representations (with linear separability of the classes), yet, interpolation remains an elusive goal even for embedding-spaces of only 30 dimensions . Hence current deep learning methods operate almost surely in an extrapolation regime in both the data space, and their embedding space.

Test set extrapolation in dimensionality-reduction-space. The last set of experiments deals with the use of (non)linear dimensionality reduction techniques to visualize high-dimensional dataset. We pose the following question: is the interpolation/extrapolation information preserved by commonly employed dimensionality reduction techniques? To unequivocally answer this question, we create a data that consists of the 2 d vertices of an hypercube in d dimensions for d = 8 , 12. Those dataset have the specificity that any sample is in extrapolation regime with respect to the other samples. We propose in Fig. 5 the 2-dimensional representations of those vertices using 8 different popular dimensionality reduction techniques: locally linear embedding (Roweis and Saul, 2000) denoted as LLE, modified LLE (Zhang and Wang, 2007), Hessian eigenmaps (Donoho and Grimes, 2003) denoted as Hessian LLE, Laplacian eigenmaps (Belkin and Niyogi, 2003) denoted as SE, isomap (Balasubramanian et al., 2002), t-distributed stochastic neighbor embedding (Van der Maaten and Hinton, 2008) denoted as t-SNE, local tangent space alignment (Zhang and Zha, 2004) referred as LTSA, Multidimensional scaling (Kruskal, 1964) denoted as MDS. We observe that dimensionality reduction methods loose the interpolation/extrapolation information and lead to visual misconceptions significantly skewed towards interpolation .

Johnson-Lindenstrauss (di)lemma One last important setting concerns dimensionality reduction techniques that preserve -to some extent- the pairwise distances of the samples. Such techniques are often coined low-distortion embeddings, one of which follows from the Johnson-Lindenstrauss lemma (JLL) (Johnson and Lindenstrauss, 1984). In short, the JLL guarantees the existence of a linear mapping f with input dimension d and output dimension d JLL ≥ 24 3 glyph[epsilon1] 2 -2 glyph[epsilon1] 3 log( N ) with N the size of the dataset

Figure 3: Depiction of the proportion of the test set that is in interpolation of the training set for MNIST ( top ), CIFAR ( middle ) and Imagenet ( bottom ) as a function of the number of selected dimensions. We propose two settings ( blue ) selecting increasingly large central patches (some cases consist of irregular patches for intermediate dimension values) and ( red ) smoothing-subsampling the original images (some cases consist of irregular images for intermediate dimension values). Note that the blue line is always decreasing with d , and that d = 147 (right of the x-axis) represents 19% of MNIST total number of dimensions, 5% for CIFAR and less than 1% for Imagenet. As can be seen throughout those settings the proportion of the test set that is in interpolation regime decreases exponentially fast with respect to the number of dimensions ultimately becoming negligible well prior reaching the full data dimensionality. The different slopes of those curves can be explained by the smallest dimensional affine space containing each type of data of (see Tab. 1).

Figure 3: Depiction of the proportion of the test set that is in interpolation of the training set for MNIST ( top ), CIFAR ( middle ) and Imagenet ( bottom ) as a function of the number of selected dimensions. We propose two settings ( blue ) selecting increasingly large central patches (some cases consist of irregular patches for intermediate dimension values) and ( red ) smoothing-subsampling the original images (some cases consist of irregular images for intermediate dimension values). Note that the blue line is always decreasing with d , and that d = 147 (right of the x-axis) represents 19% of MNIST total number of dimensions, 5% for CIFAR and less than 1% for Imagenet. As can be seen throughout those settings the proportion of the test set that is in interpolation regime decreases exponentially fast with respect to the number of dimensions ultimately becoming negligible well prior reaching the full data dimensionality. The different slopes of those curves can be explained by the smallest dimensional affine space containing each type of data of (see Tab. 1).

such that the pairwise distances after projection will be within a 1 ± glyph[epsilon1] factor of the original distances. Different bounds have emerged based on different proofs of the JLL (see Dasgupta and Gupta (2003) for a survey). Interestingly, as per Thm. 1, the experiments from Fig. 1 and Fig. 3, the dataset size must be of order 2 d to ensure that samples in the test set be in interpolation regime. In the JLL setting, this

Figure 4: Depiction of levels (90% to 99% from light to dark ) of explained variance from a Principal Component Analysis model for varying sub-images dimensions (1 to 147) x-axis based on the number of considered components ( y-axis ). The sub-images of dimension d are obtained either by selecting the central spatial dimensions ( blue, top-row ) or by smoothing and subsampling ( red, bottom-row ) as per Fig. 3. From this, it is clear that for each sub-image dimension ( d ), the smallest dimensional affine subspace containing the data reduces when going from MNIST to CIFAR10 to IMAGENET leading to the different slopes observed in Fig. 3 (recall Fig. 1).

Figure 4: Depiction of levels (90% to 99% from light to dark ) of explained variance from a Principal Component Analysis model for varying sub-images dimensions (1 to 147) x-axis based on the number of considered components ( y-axis ). The sub-images of dimension d are obtained either by selecting the central spatial dimensions ( blue, top-row ) or by smoothing and subsampling ( red, bottom-row ) as per Fig. 3. From this, it is clear that for each sub-image dimension ( d ), the smallest dimensional affine subspace containing the data reduces when going from MNIST to CIFAR10 to IMAGENET leading to the different slopes observed in Fig. 3 (recall Fig. 1).

Table 1: Proportion (in %) of the test set that falls into the interpolation regime for various datasets ( rows ) and embeddings ( columns ) with varying (randomly selected) dimensions (10, 20 and 30), in all cases the embeddings are of dimension 512. 'random projection' represents a linear mapping with random Gaussian weights, the remaining six columns represent the use of the Resnet18 architecture's latent space with either untrained weights (usual initialization) or Imagenet-pretrained weights. We shall note that such results are not surprising based on the probabilities computed in Fig. 1 for synthetic data, since the embedded samples behave more closely to gaussian samples than in the original pixel-space.

translates into d JLL > 24 3 glyph[epsilon1] 2 -2 glyph[epsilon1] 3 d > d . In other words, if a dataset size N is exponential with the dimension d (required to have new samples in interpolation regime) then d JLL > d and JLL does not provide any dimensionality reduction .

We propose in the next section for the interested reader a brief collection of theoretical results that have also reached the conclusion that in high-dimensional spaces, exponentially large datasets are required to maintain the probability for a new sample to be in interpolation regimes.

Theoretical Quantification of Interpolation Probabilities

The previous section focuses on empirically evaluating the probability that a new sample falls into the convex hull of a given dataset and studied various settings concluding that interpolation suffers from the curse of dimensionality and is thus difficult to achieve in high-dimensional settings. As the convex hull is simply a polytope, we shall first refer the reader to Spielman and Teng (2004) for a thorough study between various properties of such polytopes and their relation to the data dimension.

Figure 5: Depiction of various nonlinear dimensionality reduction techniques applied onto a synthetic dataset containing all the hypercube vertices for an hypercube of dimension 8 ( left ) and 10 ( right ). Coloring goes from blue for the vertex at position (1 , . . . , 1) to green for the vertex at position ( -1 , . . . , -1) in a linear manner. This data is chosen since each point/vertex is in extrapolation regime from all the other points/vertices . However, existing techniques for dimensionality reduction primarily focus on preserving local geometric information. As a result, regardless of the employed dimensionality reduction algorithm, the interpolation/extrapolation information is lost as can be seen in all the proposed subplots. This can lead to hazardous assumptions and conclusions.

Figure 5: Depiction of various nonlinear dimensionality reduction techniques applied onto a synthetic dataset containing all the hypercube vertices for an hypercube of dimension 8 ( left ) and 10 ( right ). Coloring goes from blue for the vertex at position (1 , . . . , 1) to green for the vertex at position ( -1 , . . . , -1) in a linear manner. This data is chosen since each point/vertex is in extrapolation regime from all the other points/vertices . However, existing techniques for dimensionality reduction primarily focus on preserving local geometric information. As a result, regardless of the employed dimensionality reduction algorithm, the interpolation/extrapolation information is lost as can be seen in all the proposed subplots. This can lead to hazardous assumptions and conclusions.

We then provide some milestone theoretical results on computing the probability that samples are in interpolation/extrapolation regime, a problem often named as the convex position problem.

Definition 2 (Convex position problem) . For a convex body K in the space, let p ( n, K ) denote the probability that n random, independent, and uniform points from K are in convex position, that is, none of the samples lies in the convex hull of the others.

Note that with this formulation, p ( n, K ) describes the probability that any point in the set of samples is in extrapolation regime. Characterization of p ( n, K ) for many 2-dimensional bodies K such as parallelograms and non-flat triangles are given below.

Theorem 2 (Valtr (1995, 1996)) . The probability p ( n, K ) for a parallelogram or a triangle is given by

$$

$$

Going to higher dimensional spaces has seen more challenging progresses making many existing results only valid in the limiting setting and when considering K to be an hypersphere, as given in Thm. 1 and in the following result.

Theorem 3 (Buchta (1986)) . The probability p ( n, K ) for K a d -dimensional hyperball B d and n growing linearly with d has the following limit behavior , ∀ m> 3 , lim d →∞ p ( d + m,B d ︸ ︷︷ ︸ extrapolation ) = 1

The above result effectively demonstrates that when sampling uniformly from an hyperball, the probability that all the points are in convex position (any sample lies outside of the other samples' convex hull) is 1 when the number of samples grows linearly with the dimension, even when adding an arbitrary constant number of samples m . This result nicely complements the one provided in Thm. 1 which was originally conjectured by Buchta (1986) a few years prior being proven by B´ ar´ any and F¨ uredi (1988). More recently, a non asymptotic result has been obtained characterizing p ( n, K ) when the samples are obtained from Gaussian distributions.

Theorem 4 (Kabluchko and Zaporozhets (2020)) . Let X consist of N i.i.d. d -dimensional samples from N (0 , I d ) with N ≥ d +1, then for every σ ≥ 0 the probability that a new sample x ∼ N (0 , σ 2 Id ) is in extrapolation regime is given by

$$

$$

with

$$

$$

where √ r = i √ -r if r < 0 and b N,k = 0 for k glyph[negationslash]∈ { 0 , 1 , . . . , N } .

The above theorem provides the analytical quantities governing the probability to be in interpolation regime. Lastly, Majumdar et al. (2010) consider the case where the samples are obtained from random walks and obtain similar interpolation behaviors. There remain many avenues to provide specific results when the data belong to lower dimensional manifolds, or to consider anisotropic distributions.

(Convex position problem).
(Valtr, (1995, 1996)).
(Buchta, (1986)).
(Kabluchko and Zaporozhets, (2020)).

Conclusion

Interpolation and extrapolation, as per Def. 1, provide an intuitive geometrical characterization on the location of new samples with respect to a given dataset. Those terms are commonly used as geometrical proxy to predict a model's performances on unseen samples and many have reached the conclusion that a model's generalization performance depends on how a model interpolates. In other words, how accurate is a model within a dataset's convex-hull defines its generalization performances. In this paper, we proposed to debunk this (mis)conception. In particular, we opposed the use of interpolation and extrapolation as indicators of generalization performances by demonstrating both from existing theoretical results and from thorough experiments that, in order to maintain interpolation for new samples, the dataset size should grow exponentially with respect to the data dimension. In short, the behavior of a model within a training set's convex hull barely impacts that model's generalization performance since new samples lie almost surely outside of that convex hull . This observation holds whether we are considering the original data space, or embeddings. We believe that those observations open the door to constructing better suited geometrical definitions of interpolation and extrapolation that align with generalization performances, especially in the context of high-dimensional data.

$$ p(\underbrace{n,{\rm parallelogram}}{\text{extrapolation}})=\left(\begin{pmatrix}2n-2\n-1\end{pmatrix}/n!\right)^2\hspace{0.5cm}\text{ and }\hspace{0.5cm} p(\underbrace{n,{\rm triangle}}{\text{extrapolation}})=\frac{2^n(3n-3)!}{(n-1)!^3(2n)!}. $$

$$ p(\underbrace{\bx \not \in \Hull(\bX)}{\text{extrapolation}})=2(b{N,d-1}(\sigma^2)+b_{N,d-3}(\sigma^2)+\dots) $$

$$ b_{n,k}(\sigma^2) = \begin{pmatrix}n\k\end{pmatrix}g_k\left(-\frac{\sigma^2}{1+k\sigma^2}\right)g_{n-k}\left(\frac{\sigma^2}{1+k\sigma^2}\right),;; g_n(r) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\Phi^n\left(\sqrt{r}x\right)e^{-x^2/2}dx $$

References

The notion of interpolation and extrapolation is fundamental in various fields from deep learning to function approximation. Interpolation occurs for a sample 𝒙𝒙\boldsymbol{x} whenever this sample falls inside or on the boundary of the given dataset’s convex hull. Extrapolation occurs when 𝒙𝒙\boldsymbol{x} falls outside of that convex hull. One fundamental (mis)conception is that state-of-the-art algorithms work so well because of their ability to correctly interpolate training data. A second (mis)conception is that interpolation happens throughout tasks and datasets, in fact, many intuitions and theories rely on that assumption. We empirically and theoretically argue against those two points and demonstrate that on any high-dimensional (>>100) dataset, interpolation almost surely never happens. Those results challenge the validity of our current interpolation/extrapolation definition as an indicator of generalization performances.

The origin of the interpolation and extrapolation notions are hard to trace back. Kolmogoroff, (1941); Wiener, (1949) defined extrapolation as predicting the future (realization) of a stationary Gaussian process based on past and current realizations. Conversely, interpolation was defined as predicting the possible realization of such process at a time position lying in-between observations, i.e., interpolation resamples the past. Various research communities have formalized those definitions as follows.

Interpolation occurs for a sample 𝒙𝒙\boldsymbol{x} whenever this sample belongs to the convex hull of a set of samples 𝑿≜{𝒙1,…,𝒙N}≜𝑿subscript𝒙1…subscript𝒙𝑁\boldsymbol{X}\triangleq{\boldsymbol{x}{1},\dots,\boldsymbol{x}{N}}, if not, extrapolation occurs.

From the above definition, it is reasonable to assume extrapolation as being a more intricate task than interpolation. After all, interpolation guarantees that the sample lies within the dataset’s convex hull, while extrapolation leaves the entire remaining space as a valid sample position. Those terms have been ported as-is to various fields such as function approximation (DeVore,, 1998) or machine learning (Bishop,, 2006), and an increasing amount of research papers in deep learning provide results and intuitions relying on data interpolation (Belkin et al.,, 2018; Bietti and Mairal,, 2019; Adlam and Pennington,, 2020). Beyond those, the following adage “as an algorithm transitions from interpolation to extrapolation, as its performance decreases” is commonly agreed upon. Before going further, we insist that throughout this manuscript, interpolation is to be understood as characterizing the data geometry as per Def. 1. This is not to be mistaken with the often employed “interpolation regime” of models which occur whenever the latter has 00 training loss on the data (Chatterji et al.,, 2021). We shall see that interpolation/extrapolation and generalization performances do not seem as tightly related as previously thought.

Our goal in this paper is to demonstrate both theoretically and empirically for both synthetic and real data that interpolation almost surely never occurs in high-dimensional spaces (>100absent100>100) regardless of the underlying intrinsic dimension of the data manifold. That is, given the realistic amount of data that can be carried by current computational capacities, it is extremely unlikely that a newly observed sample lies in the convex hull of that dataset. Hence, we claim that

currently employed/deployed models are extrapolating

given the super-human performances achieved by those models, extrapolation regime is not necessarily to be avoided, and is not an indicator of generalization performances

This paper is organized as follows. We first provide below (Thm. 1) an important theoretical result that has been derived in the context of Uniform samples from an hyper-ball. In that case, the probability of a new sample to be in interpolation regime from a dataset goes to 00 as the dimension d𝑑d increases unless the number of dataset samples grows exponentially with d𝑑d. This will allow to introduce notations and intuitions. We then directly provide empirical evidences in Sec. 2 using standard dataset where we demonstrate that even when considering a subset of the data dimensions, the probability to interpolate goes exponentially quickly to 00 with the number of considered dimensions. We conclude with Sec. 3 by providing existing theoretical results describing the probability that new samples are in interpolation or extrapolation regimes in more specific scenarios.

Given a d𝑑d-dimensional dataset 𝑿≜{𝒙1,…,𝒙N}≜𝑿subscript𝒙1…subscript𝒙𝑁\boldsymbol{X}\triangleq{\boldsymbol{x}{1},\dots,\boldsymbol{x}{N}} with i.i.d. samples uniformly drawn from an hyperball, the probability that a new sample 𝒙𝒙\boldsymbol{x} is in interpolation regime (recall Def. 1) has the following asymptotic behavior

In this section we propose various experiments supporting the need for exponentially large dataset to maintain interpolation, as per Thm. 1, for non Gaussian data. First, we demonstrate in Sec. 2.1 the role of the underlying data manifold intrinsic dimension along with the role of the dimension of the smallest affine subspace that include the data manifold. As we will see from carefully designed datasets, only the latter has an impact on the probability of new samples being in an interpolation regime. We then move to real datasets in Sec. 2.2 and demonstrate that both in the data space or in various embedding spaces, current test set samples are all in extrapolation regime from their corresponding training set.

d∗=dsuperscript𝑑𝑑d^{*}=d

log⁡(N)𝑁\log(N)

p​(𝒙∈Hull​(𝑿))𝑝𝒙Hull𝑿p(\boldsymbol{x}\in\text{Hull}(\boldsymbol{X}))

The first stage of our study consists in carefully understanding not only the role of the ambient dimension i.e. the dimension of the space in which the data lives, but also the role of the underlying data manifold intrinsic dimension i.e. the number of variables needed in a minimal representation of the data (Bennett,, 1965), and the dimension of the smallest affine subspace that includes all the data manifold.

In fact, one could argue that data such as images might lie on a low dimensional manifold and thus hope that interpolation occurs regardless of the high-dimensional ambient space. As we demonstrate in Fig. 1, this intuition would be misleading. In fact, the underlying manifold dimension does not help even in the extreme case of having a 111-dimensional manifold. What matters however, is the dimension d∗superscript𝑑d^{} of the smallest affine subspace that includes all the data manifold, or equivalently, the dimension of the convex hull of the data. As such, in the presence of a nonlinear manifold, we can see that the exponential requirement from Thm. 1 in the number of samples required to preserve a constant probability to be in interpolation grows exponentially with d∗superscript𝑑d^{}. In fact, with the intrinsic dimension (d∗superscript𝑑d^{}) constant, increasing the ambient space dimension (d𝑑d) has no impact on the number of samples needed to maintain interpolation regime as can be seen on the right of Fig. 1. We thus conclude that for one to increase the probability to be in an interpolation regime, one should control d∗superscript𝑑d^{}, and not the manifold underlying dimension not the ambient space dimension.

dimension index z𝑧z

We now propose to extend those insights to real data where the exact same behavior occurs across datasets.

The previous section explored the cases of synthetic data with varying ambient, intrinsic, and convex hull dimensions. This provided valuable insights e.g. the key quantity of interest lies in the dimension of the smallest affine subspace containing the data. For real dataset however, one could argue that some natural properties of such manifolds help in being in an interpolation regime. Furthermore, one could argue that once embedded in a suitable and non-degenerate latent space, e.g. from a learned deep network, interpolation occurs. As we will see through various experiments, even with real datasets and various popular embeddings, interpolation remains an elusive goal that becomes exponentially difficult to reach as the dimension grows.

Proportion of test set in interpolation regime

considered number of dimensions (d)𝑑(d)

MNIST CIFAR10 IMAGENET

Test set extrapolation in pixel-space. We first propose in Fig. 3 to study the proportion of the test set that is in interpolation regime from the train set for MNIST, CIFAR and Imagenet. To grasp the impact of the dimensionality of the data we propose to compute this proportion with varying number of dimensions obtained from two strategies. First, we only keep a specified amount of dimensions from the center of the images, second, we smooth and subsample the images. The former has the benefit of preserving the manifold geometry whilst only considering a limited amount of dimensions, the latter preserves the overall geometry of the manifold while removing the high-frequency structures (details of the image) and compressing the information on fewer dimensions. In both cases and throughout datasets, we see that despite the data manifold geometry held by natural images, finding samples in interpolation regime becomes exponentially difficult with respect to the considered data dimension.

Test set extrapolation in embedding-space. Given the above, one could argue that the key interest of machine learning is not to perform interpolation in the data space, but rather in a (learned) latent space. In fact, a DN provides a data embedding, then, in that space, a linear classifier (for example) solves the problem at hand, possibly in an interpolation regime. We thus provide in Tab. 1 the proportion of the test set that is in interpolation regime when considering different embedding spaces. We observed that embedding-spaces provide seemingly organized representations (with linear separability of the classes), yet, interpolation remains an elusive goal even for embedding-spaces of only 303030 dimensions. Hence current deep learning methods operate almost surely in an extrapolation regime in both the data space, and their embedding space.

Test set extrapolation in dimensionality-reduction-space. The last set of experiments deals with the use of (non)linear dimensionality reduction techniques to visualize high-dimensional dataset. We pose the following question: is the interpolation/extrapolation information preserved by commonly employed dimensionality reduction techniques? To unequivocally answer this question, we create a data that consists of the 2dsuperscript2𝑑2^{d} vertices of an hypercube in d𝑑d dimensions for d=8,12𝑑812d=8,12. Those dataset have the specificity that any sample is in extrapolation regime with respect to the other samples. We propose in Fig. 5 the 222-dimensional representations of those vertices using 888 different popular dimensionality reduction techniques: locally linear embedding (Roweis and Saul,, 2000) denoted as LLE, modified LLE (Zhang and Wang,, 2007), Hessian eigenmaps (Donoho and Grimes,, 2003) denoted as Hessian LLE, Laplacian eigenmaps (Belkin and Niyogi,, 2003) denoted as SE, isomap (Balasubramanian et al.,, 2002), t-distributed stochastic neighbor embedding (Van der Maaten and Hinton,, 2008) denoted as t-SNE, local tangent space alignment (Zhang and Zha,, 2004) referred as LTSA, Multidimensional scaling (Kruskal,, 1964) denoted as MDS. We observe that dimensionality reduction methods loose the interpolation/extrapolation information and lead to visual misconceptions significantly skewed towards interpolation.

Johnson–Lindenstrauss (di)lemma One last important setting concerns dimensionality reduction techniques that preserve -to some extent- the pairwise distances of the samples. Such techniques are often coined low-distortion embeddings, one of which follows from the Johnson–Lindenstrauss lemma (JLL) (Johnson and Lindenstrauss,, 1984). In short, the JLL guarantees the existence of a linear mapping f𝑓f with input dimension d𝑑d and output dimension dJLL≥243​ϵ2−2​ϵ3​log⁡(N)subscript𝑑JLL243superscriptitalic-ϵ22superscriptitalic-ϵ3𝑁d_{\rm JLL}\geq\frac{24}{3\epsilon^{2}-2\epsilon^{3}}\log(N) with N𝑁N the size of the dataset such that the pairwise distances after projection will be within a 1±ϵplus-or-minus1italic-ϵ1\pm\epsilon factor of the original distances. Different bounds have emerged based on different proofs of the JLL (see Dasgupta and Gupta, (2003) for a survey). Interestingly, as per Thm. 1, the experiments from Fig. 1 and Fig. 3, the dataset size must be of order 2dsuperscript2𝑑2^{d} to ensure that samples in the test set be in interpolation regime. In the JLL setting, this translates into dJLL>243​ϵ2−2​ϵ3​d>dsubscript𝑑JLL243superscriptitalic-ϵ22superscriptitalic-ϵ3𝑑𝑑d_{\rm JLL}>\frac{24}{3\epsilon^{2}-2\epsilon^{3}}d>d. In other words, if a dataset size N𝑁N is exponential with the dimension d𝑑d (required to have new samples in interpolation regime) then dJLL>dsubscript𝑑JLL𝑑d_{\rm JLL}>d and JLL does not provide any dimensionality reduction.

We propose in the next section for the interested reader a brief collection of theoretical results that have also reached the conclusion that in high-dimensional spaces, exponentially large datasets are required to maintain the probability for a new sample to be in interpolation regimes.

The previous section focuses on empirically evaluating the probability that a new sample falls into the convex hull of a given dataset and studied various settings concluding that interpolation suffers from the curse of dimensionality and is thus difficult to achieve in high-dimensional settings. As the convex hull is simply a polytope, we shall first refer the reader to Spielman and Teng, (2004) for a thorough study between various properties of such polytopes and their relation to the data dimension. We then provide some milestone theoretical results on computing the probability that samples are in interpolation/extrapolation regime, a problem often named as the convex position problem.

For a convex body K𝐾K in the space, let p​(n,K)𝑝𝑛𝐾p(n,K) denote the probability that n𝑛n random, independent, and uniform points from K𝐾K are in convex position, that is, none of the samples lies in the convex hull of the others.

Note that with this formulation, p​(n,K)𝑝𝑛𝐾p(n,K) describes the probability that any point in the set of samples is in extrapolation regime. Characterization of p​(n,K)𝑝𝑛𝐾p(n,K) for many 222-dimensional bodies K𝐾K such as parallelograms and non-flat triangles are given below.

The probability p​(n,K)𝑝𝑛𝐾p(n,K) for a parallelogram or a triangle is given by

Going to higher dimensional spaces has seen more challenging progresses making many existing results only valid in the limiting setting and when considering K𝐾K to be an hypersphere, as given in Thm. 1 and in the following result.

The probability p​(n,K)𝑝𝑛𝐾p(n,K) for K𝐾K a d𝑑d-dimensional hyperball Bdsuperscript𝐵𝑑B^{d} and n𝑛n growing linearly with d𝑑d has the following limit behavior , ∀m>3,limd→∞p​(d+m,Bd⏟extrapolation)=1formulae-sequencefor-all𝑚3subscript→𝑑𝑝subscript⏟𝑑𝑚superscript𝐵𝑑extrapolation1\forall m>3,\lim_{d\rightarrow\infty}p(\underbrace{d+m,B^{d}}_{\text{extrapolation}})=1

The above result effectively demonstrates that when sampling uniformly from an hyperball, the probability that all the points are in convex position (any sample lies outside of the other samples’ convex hull) is 111 when the number of samples grows linearly with the dimension, even when adding an arbitrary constant number of samples m𝑚m. This result nicely complements the one provided in Thm. 1 which was originally conjectured by Buchta, (1986) a few years prior being proven by Bárány and Füredi, (1988). More recently, a non asymptotic result has been obtained characterizing p​(n,K)𝑝𝑛𝐾p(n,K) when the samples are obtained from Gaussian distributions.

Let 𝑿𝑿\boldsymbol{X} consist of N𝑁N i.i.d. d𝑑d-dimensional samples from 𝒩​(0,Id)𝒩0subscript𝐼𝑑\mathcal{N}(0,I_{d}) with N≥d+1𝑁𝑑1N\geq d+1, then for every σ≥0𝜎0\sigma\geq 0 the probability that a new sample 𝒙∼𝒩​(0,σ2​I​d)similar-to𝒙𝒩0superscript𝜎2𝐼𝑑\boldsymbol{x}\sim\mathcal{N}(0,\sigma^{2}Id) is in extrapolation regime is given by

where r=i​−r𝑟𝑖𝑟\sqrt{r}=i\sqrt{-r} if r<0𝑟0r<0 and bN,k=0subscript𝑏𝑁𝑘0b_{N,k}=0 for k∉{0,1,…,N}𝑘01…𝑁k\not\in{0,1,\dots,N}.

The above theorem provides the analytical quantities governing the probability to be in interpolation regime. Lastly, Majumdar et al., (2010) consider the case where the samples are obtained from random walks and obtain similar interpolation behaviors. There remain many avenues to provide specific results when the data belong to lower dimensional manifolds, or to consider anisotropic distributions.

Interpolation and extrapolation, as per Def. 1, provide an intuitive geometrical characterization on the location of new samples with respect to a given dataset. Those terms are commonly used as geometrical proxy to predict a model’s performances on unseen samples and many have reached the conclusion that a model’s generalization performance depends on how a model interpolates. In other words, how accurate is a model within a dataset’s convex-hull defines its generalization performances. In this paper, we proposed to debunk this (mis)conception. In particular, we opposed the use of interpolation and extrapolation as indicators of generalization performances by demonstrating both from existing theoretical results and from thorough experiments that, in order to maintain interpolation for new samples, the dataset size should grow exponentially with respect to the data dimension. In short, the behavior of a model within a training set’s convex hull barely impacts that model’s generalization performance since new samples lie almost surely outside of that convex hull. This observation holds whether we are considering the original data space, or embeddings. We believe that those observations open the door to constructing better suited geometrical definitions of interpolation and extrapolation that align with generalization performances, especially in the context of high-dimensional data.

Table: S2.T1: Proportion (in %percent%) of the test set that falls into the interpolation regime for various datasets (rows) and embeddings (columns) with varying (randomly selected) dimensions (101010, 202020 and 303030), in all cases the embeddings are of dimension 512512512. “random projection” represents a linear mapping with random Gaussian weights, the remaining six columns represent the use of the Resnet18 architecture’s latent space with either untrained weights (usual initialization) or Imagenet-pretrained weights. We shall note that such results are not surprising based on the probabilities computed in Fig. 1 for synthetic data, since the embedded samples behave more closely to gaussian samples than in the original pixel-space.

random projectionrandom Resnet18pretrained Resnet18
# selected dimensions102030102030102030
MNIST d𝑑d:764/train:50K/test:10K83±plus-or-minus\pm112±plus-or-minus\pm20±plus-or-minus\pm094±plus-or-minus\pm321±plus-or-minus\pm100±plus-or-minus\pm095±plus-or-minus\pm252±plus-or-minus\pm1714±plus-or-minus\pm12
CIFAR d𝑑d:3072/train:50K/test:10K89±plus-or-minus\pm141±plus-or-minus\pm015±plus-or-minus\pm088±plus-or-minus\pm423±plus-or-minus\pm80±plus-or-minus\pm092±plus-or-minus\pm221±plus-or-minus\pm80±plus-or-minus\pm0
Imagenet d𝑑d:150528/train:1M/test:100K81±plus-or-minus\pm448±plus-or-minus\pm122±plus-or-minus\pm191±plus-or-minus\pm130±plus-or-minus\pm13±plus-or-minus\pm090±plus-or-minus\pm226±plus-or-minus\pm31±plus-or-minus\pm0

Refer to caption Depiction of the evolution of the probability that a new sample is in interpolation regime (y-axis, p​(𝒙∈Hull​(𝑿))𝑝𝒙Hull𝑿p(\boldsymbol{x}\in\text{Hull}(\boldsymbol{X}))) given increasing dataset size (x-axis, N𝑁N) seen in logarithmic scale, and for various ambient space dimensions (d𝑑d) based on Monte-Carlo estimates on 500,000500000500,000 trials. On the left, the data is sampled from a Gaussian density 𝒙i∼𝒩​(0,Id)similar-tosubscript𝒙𝑖𝒩0subscript𝐼𝑑\boldsymbol{x}{i}\sim\mathcal{N}(0,I{d}) while in the middle, the data is sampled from a nonlinear continuous manifold with intrinsic dimension of 111 (see Fig. 2 for details on the manifold data) and on the right, the data is sampled from a Gaussian density that lives in an affine subspace of constant dimension 444 (while the ambient dimension increases). It is clear from those figures that in order to maintain a constant probability to be in interpolation regime, the training set size has to increase exponentially with d∗superscript𝑑d^{} regardless of the underlying intrinsic manifold dimension where d∗superscript𝑑d^{} is the dimension of the lowest dimensional affine subspace including the entire data manifold i.e. the convex hull dimension.

Refer to caption Depiction of the manifold data samples used for the middle plot of Figure. 1 with dim=5dim5{\rm dim}=5 on the left and dim=3dim3{\rm dim}=3 on the right. In all cases, the intrinsic dimension of this dataset is 111, the latent coordinate (z𝑧z) that governs the data (𝒙​(z)𝒙𝑧\boldsymbol{x}(z)) is depicted on the top row while the manifold samples in the ambient space are depicted in the bottom row. This manifold is continuous, nonlinear and piecewise smooth, and corresponds to walking around the simplex.

Refer to caption Depiction of the proportion of the test set that is in interpolation of the training set for MNIST (top), CIFAR (middle) and Imagenet (bottom) as a function of the number of selected dimensions. We propose two settings (blue) selecting increasingly large central patches (some cases consist of irregular patches for intermediate dimension values) and (red) smoothing-subsampling the original images (some cases consist of irregular images for intermediate dimension values). Note that the blue line is always decreasing with d𝑑d, and that d=147𝑑147d=147 (right of the x-axis) represents 19% of MNIST total number of dimensions, 5% for CIFAR and less than 1% for Imagenet. As can be seen throughout those settings the proportion of the test set that is in interpolation regime decreases exponentially fast with respect to the number of dimensions ultimately becoming negligible well prior reaching the full data dimensionality. The different slopes of those curves can be explained by the smallest dimensional affine space containing each type of data of (see Tab. 1).

Refer to caption Depiction of levels (90%percent9090% to 99%percent9999% from light to dark) of explained variance from a Principal Component Analysis model for varying sub-images dimensions (111 to 147147147) x-axis based on the number of considered components (y-axis). The sub-images of dimension d𝑑d are obtained either by selecting the central spatial dimensions (blue, top-row) or by smoothing and subsampling (red, bottom-row) as per Fig. 3. From this, it is clear that for each sub-image dimension (d𝑑d), the smallest dimensional affine subspace containing the data reduces when going from MNIST to CIFAR10 to IMAGENET leading to the different slopes observed in Fig. 3 (recall Fig. 1).

Refer to caption Depiction of various nonlinear dimensionality reduction techniques applied onto a synthetic dataset containing all the hypercube vertices for an hypercube of dimension 888 (left) and 101010 (right). Coloring goes from blue for the vertex at position (1,…,1)1…1(1,\dots,1) to green for the vertex at position (−1,…,−1)1…1(-1,\dots,-1) in a linear manner. This data is chosen since each point/vertex is in extrapolation regime from all the other points/vertices. However, existing techniques for dimensionality reduction primarily focus on preserving local geometric information. As a result, regardless of the employed dimensionality reduction algorithm, the interpolation/extrapolation information is lost as can be seen in all the proposed subplots. This can lead to hazardous assumptions and conclusions.

$$ \lim_{d\rightarrow\infty}p(\underbrace{\boldsymbol{x}\in\text{Hull}(\boldsymbol{X})}_{\text{interpolation}})=\begin{cases}1\iff N>d^{-1}2^{d/2}\ 0\iff N<d^{-1}2^{d/2}\ \end{cases} $$ \tag{S1.Ex1}

$$ \displaystyle p(\underbrace{n,{\rm parallelogram}}{\text{extrapolation}})=\left(\begin{pmatrix}2n-2\ n-1\end{pmatrix}/n!\right)^{2}\hskip 14.22636pt\text{ and }\hskip 14.22636ptp(\underbrace{n,{\rm triangle}}{\text{extrapolation}})=\frac{2^{n}(3n-3)!}{(n-1)!^{3}(2n)!}. $$

$$ \displaystyle b_{n,k}(\sigma^{2})=\begin{pmatrix}n\ k\end{pmatrix}g_{k}\left(-\frac{\sigma^{2}}{1+k\sigma^{2}}\right)g_{n-k}\left(\frac{\sigma^{2}}{1+k\sigma^{2}}\right),;;g_{n}(r)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\Phi^{n}\left(\sqrt{r}x\right)e^{-x^{2}/2}dx $$

Defn. Definition 1. Interpolation occurs for a sample 𝒙𝒙\boldsymbol{x} whenever this sample belongs to the convex hull of a set of samples 𝑿≜{𝒙1,…,𝒙N}≜𝑿subscript𝒙1…subscript𝒙𝑁\boldsymbol{X}\triangleq{\boldsymbol{x}{1},\dots,\boldsymbol{x}{N}}, if not, extrapolation occurs.

Thm. Theorem 1 (Bárány and Füredi, (1988)). Given a d𝑑d-dimensional dataset 𝑿≜{𝒙1,…,𝒙N}≜𝑿subscript𝒙1…subscript𝒙𝑁\boldsymbol{X}\triangleq{\boldsymbol{x}{1},\dots,\boldsymbol{x}{N}} with i.i.d. samples uniformly drawn from an hyperball, the probability that a new sample 𝒙𝒙\boldsymbol{x} is in interpolation regime (recall Def. 1) has the following asymptotic behavior limd→∞p​(𝒙∈Hull​(𝑿)⏟interpolation)={1⇔N>d−1​2d/20⇔N<d−1​2d/2subscript→𝑑𝑝subscript⏟𝒙Hull𝑿interpolationcasesiff1𝑁superscript𝑑1superscript2𝑑2otherwiseiff0𝑁superscript𝑑1superscript2𝑑2otherwise\lim_{d\rightarrow\infty}p(\underbrace{\boldsymbol{x}\in\text{Hull}(\boldsymbol{X})}_{\text{interpolation}})=\begin{cases}1\iff N>d^{-1}2^{d/2}\ 0\iff N<d^{-1}2^{d/2}\ \end{cases}

Defn. Definition 2 (Convex position problem). For a convex body K𝐾K in the space, let p​(n,K)𝑝𝑛𝐾p(n,K) denote the probability that n𝑛n random, independent, and uniform points from K𝐾K are in convex position, that is, none of the samples lies in the convex hull of the others.

Thm. Theorem 2 (Valtr, (1995, 1996)). The probability p​(n,K)𝑝𝑛𝐾p(n,K) for a parallelogram or a triangle is given by p​(n,parallelogram⏟extrapolation)=((2​n−2n−1)/n!)2 and p​(n,triangle⏟extrapolation)=2n​(3​n−3)!(n−1)!3​(2​n)!.formulae-sequence𝑝subscript⏟𝑛parallelogramextrapolationsuperscriptmatrix2𝑛2𝑛1𝑛2 and 𝑝subscript⏟𝑛triangleextrapolationsuperscript2𝑛3𝑛3superscript𝑛132𝑛\displaystyle p(\underbrace{n,{\rm parallelogram}}{\text{extrapolation}})=\left(\begin{pmatrix}2n-2\ n-1\end{pmatrix}/n!\right)^{2}\hskip 14.22636pt\text{ and }\hskip 14.22636ptp(\underbrace{n,{\rm triangle}}{\text{extrapolation}})=\frac{2^{n}(3n-3)!}{(n-1)!^{3}(2n)!}.

Thm. Theorem 3 (Buchta, (1986)). The probability p​(n,K)𝑝𝑛𝐾p(n,K) for K𝐾K a d𝑑d-dimensional hyperball Bdsuperscript𝐵𝑑B^{d} and n𝑛n growing linearly with d𝑑d has the following limit behavior , ∀m>3,limd→∞p​(d+m,Bd⏟extrapolation)=1formulae-sequencefor-all𝑚3subscript→𝑑𝑝subscript⏟𝑑𝑚superscript𝐵𝑑extrapolation1\forall m>3,\lim_{d\rightarrow\infty}p(\underbrace{d+m,B^{d}}_{\text{extrapolation}})=1

Thm. Theorem 4 (Kabluchko and Zaporozhets, (2020)). Let 𝑿𝑿\boldsymbol{X} consist of N𝑁N i.i.d. d𝑑d-dimensional samples from 𝒩​(0,Id)𝒩0subscript𝐼𝑑\mathcal{N}(0,I_{d}) with N≥d+1𝑁𝑑1N\geq d+1, then for every σ≥0𝜎0\sigma\geq 0 the probability that a new sample 𝒙∼𝒩​(0,σ2​I​d)similar-to𝒙𝒩0superscript𝜎2𝐼𝑑\boldsymbol{x}\sim\mathcal{N}(0,\sigma^{2}Id) is in extrapolation regime is given by p​(𝒙∉Hull​(𝑿)⏟extrapolation)=2​(bN,d−1​(σ2)+bN,d−3​(σ2)+…)𝑝subscript⏟𝒙Hull𝑿extrapolation2subscript𝑏𝑁𝑑1superscript𝜎2subscript𝑏𝑁𝑑3superscript𝜎2…\displaystyle p(\underbrace{\boldsymbol{x}\not\in\text{Hull}(\boldsymbol{X})}{\text{extrapolation}})=2(b{N,d-1}(\sigma^{2})+b_{N,d-3}(\sigma^{2})+\dots) with bn,k​(σ2)=(nk)​gk​(−σ21+k​σ2)​gn−k​(σ21+k​σ2),gn​(r)=12​π​∫−∞∞Φn​(r​x)​e−x2/2​𝑑xformulae-sequencesubscript𝑏𝑛𝑘superscript𝜎2matrix𝑛𝑘subscript𝑔𝑘superscript𝜎21𝑘superscript𝜎2subscript𝑔𝑛𝑘superscript𝜎21𝑘superscript𝜎2subscript𝑔𝑛𝑟12𝜋superscriptsubscriptsuperscriptΦ𝑛𝑟𝑥superscript𝑒superscript𝑥22differential-d𝑥\displaystyle b_{n,k}(\sigma^{2})=\begin{pmatrix}n\ k\end{pmatrix}g_{k}\left(-\frac{\sigma^{2}}{1+k\sigma^{2}}\right)g_{n-k}\left(\frac{\sigma^{2}}{1+k\sigma^{2}}\right),;;g_{n}(r)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\Phi^{n}\left(\sqrt{r}x\right)e^{-x^{2}/2}dx where r=i​−r𝑟𝑖𝑟\sqrt{r}=i\sqrt{-r} if r<0𝑟0r<0 and bN,k=0subscript𝑏𝑁𝑘0b_{N,k}=0 for k∉{0,1,…,N}𝑘01…𝑁k\not\in{0,1,\dots,N}.

random projectionrandom projectionrandom projectionrandom Resnet18random Resnet18random Resnet18pretrained Resnet18pretrained Resnet18pretrained Resnet18
# selected dimensions102030102030102030
MNIST d :764/train:50K/test:10K83 ± 112 ± 20 ± 094 ± 321 ± 100 ± 095 ± 252 ± 1714 ± 12
CIFAR d :3072/train:50K/test:10K89 ± 141 ± 015 ± 088 ± 423 ± 80 ± 092 ± 221 ± 80 ± 0
Imagenet d :150528/train:1M/test:100K81 ± 448 ± 122 ± 191 ± 130 ± 13 ± 090 ± 226 ± 31 ± 0
random projectionrandom projectionrandom projectionrandom Resnet18random Resnet18random Resnet18pretrained Resnet18pretrained Resnet18pretrained Resnet18
# selected dimensions102030102030102030
MNIST d :764/train:50K/test:10K83 ± 112 ± 20 ± 094 ± 321 ± 100 ± 095 ± 252 ± 1714 ± 12
CIFAR d :3072/train:50K/test:10K89 ± 141 ± 015 ± 088 ± 423 ± 80 ± 092 ± 221 ± 80 ± 0
Imagenet d :150528/train:1M/test:100K81 ± 448 ± 122 ± 191 ± 130 ± 13 ± 090 ± 226 ± 31 ± 0

Figure

Figure

References

[10.1214/aop/1022874826] Imre Bárány. (1999). {Sylvester’s Question: The Probability That $n$ Points are in Convex Position. The Annals of Probability. doi:10.1214/aop/1022874826.

[majumdar2010random] Majumdar, Satya N, Comtet, Alain, Randon-Furling, Julien. (2010). Random convex hulls and extreme value statistics. Journal of Statistical Physics.

[valtr1995probability] Valtr, Pavel. (1995). Probability thatn random points are in convex position. Discrete & Computational Geometry.

[valtr1996probability] Valtr, Pavel. (1996). The probability that n random points in a triangle are in convex position. Combinatorica.

[kabluchko2020absorption] Kabluchko, Zakhar, Zaporozhets, Dmitry. (2020). Absorption probabilities for Gaussian polytopes and regular spherical simplices. Advances in Applied Probability.

[kolmogoroff1941interpolation] Kolmogoroff, A. (1941). Interpolation und extrapolation von stationaren zufalligen folgen. Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya.

[devore1998nonlinear] DeVore, Ronald A. (1998). Nonlinear approximation. Acta numerica.

[zhang2007mlle] Zhang, Zhenyue, Wang, Jing. (2007). MLLE: Modified locally linear embedding using multiple weights. Advances in neural information processing systems.

[johnson1984extensions] Johnson, William B, Lindenstrauss, Joram. (1984). Extensions of Lipschitz mappings into a Hilbert space 26. Contemporary mathematics.

[spielman2004smoothed] Spielman, Daniel A, Teng, Shang-Hua. (2004). Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM).

[dasgupta2003elementary] Dasgupta, Sanjoy, Gupta, Anupam. (2003). An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms.

[zhang2004principal] Zhang, Zhenyue, Zha, Hongyuan. (2004). Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM journal on scientific computing.

[kruskal1964multidimensional] Kruskal, Joseph B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika.

[donoho2003hessian] Donoho, David L, Grimes, Carrie. (2003). Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences.

[belkin2003laplacian] Belkin, Mikhail, Niyogi, Partha. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation.

[roweis2000nonlinear] Roweis, Sam T, Saul, Lawrence K. (2000). Nonlinear dimensionality reduction by locally linear embedding. science.

[balasubramanian2002isomap] Balasubramanian, Mukund, Schwartz, Eric L, Tenenbaum, Joshua B, de Silva, Vin, Langford, John C. (2002). The isomap algorithm and topological stability. Science.

[van2008visualizing] Van der Maaten, Laurens, Hinton, Geoffrey. (2008). Visualizing data using t-SNE.. Journal of machine learning research.

[bishop2006pattern] Bishop, Christopher M. (2006). Pattern recognition. Machine learning.

[chatterji2021does] Chatterji, Niladri S, Long, Philip M, Bartlett, Peter L. (2021). When Does Gradient Descent with Logistic Loss Find Interpolating Two-Layer Networks?. Journal of Machine Learning Research.

[bennett1965representation] Bennett, Robert S. (1965). Representation and analysis of signals part xxi. the intrinsic dimensionality of signal collections.

[bietti2019inductive] Bietti, Alberto, Mairal, Julien. (2019). On the inductive bias of neural tangent kernels. arXiv preprint arXiv:1905.12173.

[belkin2018understand] Belkin, Mikhail, Ma, Siyuan, Mandal, Soumik. (2018). To understand deep learning we need to understand kernel learning. International Conference on Machine Learning.

[adlam2020neural] Adlam, Ben, Pennington, Jeffrey. (2020). The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. International Conference on Machine Learning.

[belkin2019does] Belkin, Mikhail, Rakhlin, Alexander, Tsybakov, Alexandre B. (2019). Does data interpolation contradict statistical optimality?. The 22nd International Conference on Artificial Intelligence and Statistics.

[Wiener1949ExtrapolationIA] N. Wiener. (1949). Extrapolation, Interpolation, and Smoothing of Stationary Time Series, with Engineering Applications.

[barany1988shape] B{'a. (1988). On the shape of the convex hull of random points. Probability theory and related fields.

[buchta1986conjecture] Buchta, Christian. (1986). On a conjecture of RE Miles about the convex hull of random points. Monatshefte f{.

[bib1] Adlam and Pennington, (2020) Adlam, B. and Pennington, J. (2020). The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pages 74–84. PMLR.

[bib2] Balasubramanian et al., (2002) Balasubramanian, M., Schwartz, E. L., Tenenbaum, J. B., de Silva, V., and Langford, J. C. (2002). The isomap algorithm and topological stability. Science, 295(5552):7–7.

[bib3] Bárány and Füredi, (1988) Bárány, I. and Füredi, Z. (1988). On the shape of the convex hull of random points. Probability theory and related fields, 77(2):231–240.

[bib4] Belkin et al., (2018) Belkin, M., Ma, S., and Mandal, S. (2018). To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR.

[bib5] Belkin and Niyogi, (2003) Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396.

[bib6] Bennett, (1965) Bennett, R. S. (1965). Representation and analysis of signals part xxi. the intrinsic dimensionality of signal collections. Technical report, Johns Hopkins University.

[bib7] Bietti and Mairal, (2019) Bietti, A. and Mairal, J. (2019). On the inductive bias of neural tangent kernels. arXiv preprint arXiv:1905.12173.

[bib8] Bishop, (2006) Bishop, C. M. (2006). Pattern recognition. Machine learning, 128(9).

[bib9] Buchta, (1986) Buchta, C. (1986). On a conjecture of re miles about the convex hull of random points. Monatshefte für Mathematik, 102(2):91–102.

[bib10] Chatterji et al., (2021) Chatterji, N. S., Long, P. M., and Bartlett, P. L. (2021). When does gradient descent with logistic loss find interpolating two-layer networks? Journal of Machine Learning Research, 22(159):1–48.

[bib11] Dasgupta and Gupta, (2003) Dasgupta, S. and Gupta, A. (2003). An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60–65.

[bib12] DeVore, (1998) DeVore, R. A. (1998). Nonlinear approximation. Acta numerica, 7:51–150.

[bib13] Donoho and Grimes, (2003) Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):5591–5596.

[bib14] Johnson and Lindenstrauss, (1984) Johnson, W. B. and Lindenstrauss, J. (1984). Extensions of lipschitz mappings into a hilbert space 26. Contemporary mathematics, 26.

[bib15] Kabluchko and Zaporozhets, (2020) Kabluchko, Z. and Zaporozhets, D. (2020). Absorption probabilities for gaussian polytopes and regular spherical simplices. Advances in Applied Probability, 52(2):588–616.

[bib16] Kolmogoroff, (1941) Kolmogoroff, A. (1941). Interpolation und extrapolation von stationaren zufalligen folgen. Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya, 5(1):3–14.

[bib17] Kruskal, (1964) Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27.

[bib18] Majumdar et al., (2010) Majumdar, S. N., Comtet, A., and Randon-Furling, J. (2010). Random convex hulls and extreme value statistics. Journal of Statistical Physics, 138(6):955–1009.

[bib19] Roweis and Saul, (2000) Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326.

[bib20] Spielman and Teng, (2004) Spielman, D. A. and Teng, S.-H. (2004). Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463.

[bib21] Valtr, (1995) Valtr, P. (1995). Probability thatn random points are in convex position. Discrete & Computational Geometry, 13(3-4):637–643.

[bib22] Valtr, (1996) Valtr, P. (1996). The probability that n random points in a triangle are in convex position. Combinatorica, 16(4):567–573.

[bib23] Van der Maaten and Hinton, (2008) Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(11).

[bib24] Wiener, (1949) Wiener, N. (1949). Extrapolation, interpolation, and smoothing of stationary time series, with engineering applications.

[bib25] Zhang and Wang, (2007) Zhang, Z. and Wang, J. (2007). Mlle: Modified locally linear embedding using multiple weights. In Advances in neural information processing systems, pages 1593–1600. Citeseer.

[bib26] Zhang and Zha, (2004) Zhang, Z. and Zha, H. (2004). Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM journal on scientific computing, 26(1):313–338.