Binary embeddings with structured hashed projections
Anna Choromanska, Krzysztof Choromanski111Equal contribution., Mariusz Bojarski, Tony Jebara, Sanjiv Kumar, Yann LeCun
Abstract
We consider the hashing mechanism for constructing binary embeddings, that involves pseudo-random projections followed by nonlinear (sign function) mappings. The pseudo-random projection is described by a matrix, where not all entries are independent random variables but instead a fixed ``budget of randomness'' is distributed across the matrix. Such matrices can be efficiently stored in sub-quadratic or even linear space, provide reduction in randomness usage (i.e. number of required random values), and very often lead to computational speed ups. We prove several theoretical results showing that projections via various structured matrices followed by nonlinear mappings accurately preserve the angular distance between input high-dimensional vectors. To the best of our knowledge, these results are the first that give theoretical ground for the use of general structured matrices in the nonlinear setting. In particular, they generalize previous extensions of the Johnson-Lindenstrauss lemma and prove the plausibility of the approach that was so far only heuristically confirmed for some special structured matrices. Consequently, we show that many structured matrices can be used as an efficient information compression mechanism. Our findings build a better understanding of certain deep architectures, which contain randomly weighted and untrained layers, and yet achieve high performance on different learning tasks. We empirically verify our theoretical findings and show the dependence of learning via structured hashed projections on the performance of neural network as well as nearest neighbor classifier.
Binary embeddings with structured hashed projections
Anna Choromanska 1 Krzysztof Choromanski 1 Mariusz Bojarski Tony Jebara Sanjiv Kumar Yann LeCun
We consider the hashing mechanism for constructing binary embeddings, that involves pseudo-random projections followed by nonlinear (sign function) mappings. The pseudorandom projection is described by a matrix, where not all entries are independent random variables but instead a fixed 'budget of randomness' is distributed across the matrix. Such matrices can be efficiently stored in sub-quadratic or even linear space, provide reduction in randomness usage (i.e. number of required random values), and very often lead to computational speed ups. We prove several theoretical results showing that projections via various structured matrices followed by nonlinear mappings accurately preserve the angular distance between input highdimensional vectors. To the best of our knowledge, these results are the first that give theoretical ground for the use of general structured matrices in the nonlinear setting. In particular, they generalize previous extensions of the JohnsonLindenstrauss lemma and prove the plausibility of the approach that was so far only heuristically confirmed for some special structured matrices. Consequently, we show that many structured matrices can be used as an efficient information compression mechanism. Our findings build a better understanding of certain deep architectures, which contain randomly weighted and untrained layers, and yet achieve high performance on different learning tasks. We empirically verify our theoretical findings and show the dependence of learning via structured hashed projections on the performance of neural network as well as nearest neighbor classifier.
1 Equal contribution.
ACHOROMA@CIMS.NYU.EDU KCHORO@GOOGLE.COM MBOJARSKI@NVIDIA.COM JEBARA@CS.COLUMBIA.EDU SANJIVK@GOOGLE.COM
Introduction
The paradigm of binary embedding for data compression is of the central focus of this paper. The paradigm has been studied in some earlier works (see: (Weiss et al., 2008), (Gong et al., 2012), (Gong et al., 2013a), (Wang et al., 2012), (Gong et al., 2013b), (Plan & Vershynin, 2014), (Yu et al., 2014), (Yi et al., 2015)), and in particular it was observed that by using linear projections and then applying sign function as a nonlinear map one does not loose completely the information about the angular distance between vectors, but instead the information might be approximately reconstructed from the Hamming distance between hashes. In this paper we are interested in using pseudo-random projections via structured matrices in the linear projection phase. The pseudo-random projection is described by a matrix, where not all the entries are independent random variables but instead a fixed 'budget of randomness' is distributed across the matrix. Thus they can be efficiently stored in a sub-quadratic or even linear space and provide reduction in the randomness usage. Moreover, using them often leads to computational speed ups since they provide fast matrix-vector multiplications via Fast Fourier Transform. We prove an extension of the Johnson-Lindenstrauss lemma (Sivakumar, 2002) for general pseudo-random structured projections followed by nonlinear mappings. We show that the angular distance between high-dimensional data vectors is approximately preserved in the hashed space. This result is also new compared to previous extensions (Hinrichs & Vybral, 2011; Vybral, 2011) of the Johnson-Lindenstrauss lemma, that consider special cases of our structured projections (namely: circulant matrices) and do not consider at all the action of the non-linear mapping. We give theoretical explanation of the approach that was so far only heuristically confirmed for some special structured matrices (see: (Yi et al., 2015), (Yu et al., 2014)).
Our theoretical findings imply that many types of matrices, such as circulant or Toeplitz Gaussian matrices, can be used as a preprocessing step in neural networks. Structured matrices were used before in different contexts also in deep learning, e.g. (Saxe et al., 2011; Sindhwani et al., 2015; Cheng et al., 2015; Mathieu et al., 2014)). Our theoretical results however extend to more general class of matrices.
Our work has primarily theoretical focus, but we also ask an empirical question: how the action of the random projection followed by non-linear transformation may influence learning? We focus on the deep learning setting, where the architecture contains completely random or pseudo-random structured layers that are not trained. Little is known from the theoretical point of view about these fast deep architectures, which achieve significant speed ups of computation and space usage reduction with simultaneous little or no loss in performance (Saxe et al., 2011; Cheng et al., 2015; Sindhwani et al., 2015; Jarrett et al., 2009; Pinto et al., 2009; Pinto & Cox, 2010; Huang et al., 2006). The high-level intuition justifying the success of these approaches is that not only does the performance of the deep learning system depend on learning, but also on the intrinsic properties of the architecture. These findings coincide with the notion of high redundancy in network parametrization (Denil et al., 2013; Denton et al., 2014; Choromanska et al., 2015). In this paper we consider a simple model of the fully-connected feed-forward neural network where the input layer is hashed by a structured pseudo-random projection followed by a point-wise nonlinear mapping. Thus the input is effectively hashed and learning is conducted in the fully connected subsequent layers that act in the hashed space. We empirically verify how the distortion introduced in the first layer by hashing (where we reduce the dimensionality of the data) affects the performance of the network (in the supervised learning setting). Finally, we show how our structured nonlinear embeddings can be used in the k -nn setting (Altman, 1992).
This article is organized as follows: Section 2 discusses related work, Section 3 explains the hashing mechanism, Section 4 provides theoretical results, Section 5 shows experimental results, and Section 6 concludes. Supplement contains additional proofs and experimental results.
Related work
The idea of using random projections to facilitate learning with high-dimensional data stems from the early work on random projections (Dasgupta, 1999) showing in particular that learning of high-dimensional mixtures of Gaussians can be simplified when first projecting the data into a randomly chosen subspace of low dimension (this is a consequence of the curse of dimensionality and the fact that high-dimensional data often has low intrinsic dimensionality). This idea was subsequently successfully applied to both synthetic and real datasets (Dasgupta, 2000; Bingham & Mannila, 2001), and then adopted to a number of learning approaches such as random projection trees (Dasgupta & Freund, 2008), kernel and featureselection techniques (Blum, 2006), clustering (Fern & Brodley, 2003), privacy-preserving machine learning (Liu et al., 2006; Choromanska et al., 2013), learning with large databases (Achlioptas, 2003), sparse learning settings (Li et al., 2006), and more recently - deep learning (see (Saxe et al., 2011) for convenient review of such approaches). Using linear projections with completely random Gaus- sian weights, instead of learned ones, was recently studied from both theoretical and practical point of view in (Giryes et al., 2015), but that work did not consider structured matrices which is a central point of our interest since structured matrices can be stored much more efficiently. Beyond applying methods that use random Gaussian matrix projections (Dasgupta, 1999; 2000; Giryes et al., 2015) and random binary matrix projections (Achlioptas, 2003), it is also possible to construct deterministic projections that preserve angles and distances (Jafarpour et al., 2009). In some sense these methods use structured matrices as well, yet they do not have the same projection efficiency of circulant matrices and projections explored in this article. Our hybrid approach, where a fixed 'budget of randomness' is distributed across the entire matrix in the structured way enables us to take advantage of both: the ability of completely random projection to preserve information and the compactness that comes from the highly-organized internal structure of the linear mapping.
This work studies the paradigm of constructing a binary embedding for data compression, where hashes are obtained by applying linear projections to the data followed by the non-linear (sign function) mappings. The point-wise nonlinearity was not considered in many previous works on structured matrices (Haupt et al., 2010; Rauhut et al., 2010; Krahmer et al., 2014; Yap et al., 2011) (moreover note that these works also consider the set of structured matrices which is a strict subset of the class of matrices considered here). Designing binary embeddings for high dimensional data with low distortion is addressed in many recent works (Weiss et al., 2008; Wang et al., 2012; Gong et al., 2013b;a; 2012; Yu et al., 2014; Yi et al., 2015; Raginsky & Lazebnik, 2009; Salakhutdinov & Hinton, 2009). In the context of our work, one of the recent articles (Yi et al., 2015) is especially important since the authors introduce the pipeline of constructing hashes with the use of structured matrices in the linear step, instead of completely random ones. They prove several theoretical results regarding the quality of the produced hash, and extend some previous theoretical results (Jacques et al., 2011; Plan & Vershynin, 2014). Their pipeline is more complicated than ours, i.e. they first apply Hadamard transformation and then a sequence of partial Toeplitz Gaussian matrices. Some general results (unbiasedness of the angular distance estimator) were also known for short hashing pipelines involving circulant matrices ((Yu et al., 2014)). These works do not provide guarantees regarding concentration of the estimator around its mean, which is crucial for all practical applications. Our results for general structured matrices, which include circulant Gaussian matrices and a larger class of Toeplitz Gaussian matrices as special subclasses, provide such concentration guarantees, and thus establish a solid mathematical foundation for using various types of structured matrices in binary embeddings. In contrast to (Yi et al., 2015), we present our theoretical results for simpler hashing models (our hashing mechanism is explained in Section 3 and consists of two very simple steps that we call preprocessing step and the actual hashing step, where
the latter consists of pseudo-random projection followed by the nonlinear mapping). In (Yu et al., 2015) theoretical guarantees regarding bounding the variance of the angle estimator in the circulant setting were presented. Strong concentration results regarding several structured matrices were given in (Choromanski & Sindhwani, 2016; Choromanski et al., 2016), following our work.
In the context of deep learning, using random network parametrization, where certain layers have random and untrained weights, often accelerates training. Introducing randomness to networks was explored for various architectures, in example feedforward networks (Huang et al., 2006), convolutional networks (Jarrett et al., 2009; Saxe et al., 2011), and recurrent networks (Jaeger & Haas, 2004; White et al., 2004; Boedecker et al., 2009). We also refer the reader to (Ganguli & Sompolinsky, 2012), where the authors describe how neural systems cope with the challenge of processing data in high dimensions and discuss random projections. Hashing in neural networks that we consider in this paper is a new direction of research. Very recently (see: (Chen et al., 2015)) it was empirically showed that hashing in neural nets may achieve drastic reduction in model sizes with no significant loss of the quality, by heavily exploiting the phenomenon of redundancies in neural nets. HashedNets introduced in (Chen et al., 2015) do not give any theoretical guarantees regarding the quality of the proposed hashing. Our work aims to touch both grounds. We experimentally show the plausibility of the approach, but also explain theoretically why the hashing we consider compresses important information about the data that suffices for accurate classification. Dimensionality reduction techniques were used also to approximately preserve certain metrics defined on graph objects ((Shaw & Jebara, 2009)). Structured hashing was applied also in (Szlam et al., 2012), but in a very different context than ours.
Hashing mechanism
In this section we explain our hashing mechanism for dimensionality reduction that we next analyze.
Structured matrices
We first introduce the aforementioned family of structured matrices, that we call: Ψ -regular matrices P . This is the key ingredient of the method.
Definition 3.1. M is a circulant Gaussian matrix if its first row is a sequence of independent Gaussian random variables taken from the distribution N (0 , 1) and next rows are obtained from the previous ones by either only one-left shifts or only one-right shifts.
Definition 3.2. M is a Toeplitz Gaussian matrix if each of its descending diagonals from left to right is of the form ( g, ..., g ) for some g ∼ N (0 , 1) and different descending diagonals are independent.
Remark 3.1. The circulant Gaussian matrix with right shifts is a special type of the Toeplitz Gaussian matrix.
Assume that k is the size of the hash and n is the dimensionality of the data.
Definition 3.3. Let t be the size of the pool of independent random Gaussian variables { g 1 , ..., g t } , where each g i ∼ N (0 , 1) . Assume that k ≤ n ≤ t ≤ kn . We take Ψ to be a natural number, i.e. Ψ ∈ N . P is Ψ -regular random matrix if it has the following form
glyph[negationslash]
where S i,j ⊆ { 1 , ..., t } for i ∈ { 1 , ..., k } , j ∈ { 1 , ..., n } , | S i, 1 | = ... = | S i,n | for i = 1 , ..., k , S i,j 1 ∩ S i,j 2 = ∅ for i ∈ { 1 , ..., k } , j 1 , j 2 ∈ { 1 , ..., n } , j 1 = j 2 , and furthermore the following holds: for any two different rows R 1 , R 2 of P the number of random variables g l , where l ∈ { 1 , ..., t } , such that g l is in the intersection of some column with both R 1 and R 2 is at most Ψ .
In the experimental section of this paper we consider six different kinds of structured matrices, which are examples of general structured matrices covered by our theoretical analysis. Those are:
· Circulant (see: Definition 3.1), · BinCirc - a matrix, where the first row is partitioned into consecutive equal-length blocks of elements and each row is obtained by the right shift of the blocks from the first row, · HalfShift - a matrix, where next row is obtained from the previous one by swapping its halves and then performing right shift by one, · VerHorShift - a matrix that is obtained by the following two phase-procedure: first each row is obtained from the previous one by the right shift of a fixed length and then in the obtained matrix each column is shifted up by a fixed number of elements, · BinPerm - a matrix, where the first row is partitioned into consecutive equal-length blocks of elements and each row is obtained as a random permutation of the blocks from the first row, · Toeplitz (see: Definition 3.2).
Remark 3.3. Computing hashes for structured matrices: Toeplitz, BinCirc, HalfShift, and VerHorShift can be done faster than in time O ( nk ) (e.g. for Toeplitz one can use FFT to reduce computations to O ( n log k ) ). Thus our structured approach leads to speed-ups, storage compression (since many structured matrices covered by our theoretical model can be stored in linear space) and reduction
in randomness usage. The goal of this paper is not to analyze in details fast matrix-vector product algorithms since that requires a separate paper. We however point out that well-known fast matrix-vector product algorithms are some of the key advantages of our structured approach.
.
.
.
.
.
.
Hashing methods
Let φ be a function satisfying lim x →∞ φ ( x ) = 1 and lim x →-∞ φ ( x ) = -1 . We will consider two hashing methods, both of which consist of what we refer to as a preprocessing step followed by the actual hashing step, where the latter consists of pseudo-random projection followed by nonlinear (sign function) mapping. The first mechanism, that we call extended Ψ -regular hashing , applies first random diagonal matrix R to the data point x , then the L 2 -normalized Hadamard matrix H , next another random diagonal matrix D , then the Ψ -regular projection matrix P Ψ and finally function φ (the latter one applied point-wise). The overall scheme is presented below:
The diagonal entries of matrices R and D are chosen independently from the binary set {-1 , 1 } , each value being chosen with probability 1 2 . We also propose a shorter pipeline, that we call short Ψ -regular hashing , where we avoid applying first random matrix R and the Hadamard matrix H , i.e. the overall pipeline is of the form:
The goal is to compute good approximation of the angular distance between given vectors p, r , given their compact hashed versions: h ( p ) , h ( r ) . To achieve this goal we consider the L 1 -distances in the k -dimensional space of hashes. Let θ p,r denote the angle between vectors p and r . We define the normalized approximate angle between p and r as:
In the next section we show that the normalized approximate angle between vectors p and r leads to a very precise estimation of the actual angle for φ ( x ) = sign ( x ) if the chosen parameter Ψ is not too large. Furthermore, we show an intriguing connection between theoretical guarantees regarding the quality of the produced hash and the chromatic number of some specific undirected graph encoding the structure of P . For many of the structured matrices under consideration this graph is induced by an algebraic group operation defining the structure of P (for instance, for the circular matrix the group is a single shift and the underlying graph is a collection of pairwise disjoint cycles, thus its chromatic number is at most 3 ).
Theoretical results
Unbiasedness of the estimator
We are ready to provide theoretical guarantees regarding the quality of the produced hash. Our guarantees will be given for a sign function, i.e for φ defined as: φ ( x ) = 1 for x ≥ 0 , φ ( x ) = -1 for x < 0 . Using this nonlinearity will be important to preserve approximate information about the angle between vectors, while filtering out the information about their lengths. We first show that ˜ θ n p,r is an unbiased estimator of θ p,r π , i.e. E ( ˜ θ n p,r ) = θ p,r π .
Lemma 4.1. Let M be a Ψ -regular hashing model (either extended or a short one) and ‖ p ‖ 2 = ‖ r ‖ 2 = 1 . Then ˜ θ n p,r is an unbiased estimator of θ p,r π , i.e. E ( ˜ θ n p,r ) = θ p,r π .
Let us give a short sketch of the proof first. Note that the value of the particular entry in the constructed hash depends only on the sign of the dot product between the corresponding Gaussian vector representing the row of the Ψ -regular matrix and the given vector. Fix two vectors: p and r with angular distance θ . Note that considered dot products (and thus also their signs) are preserved when instead of taking the Gaussian vector representing the row one takes its projection onto a linear space spanned by p and r . The Hamming distance between hashes of p and r is built up by these entries for which one dot product is negative and the other one is positive. One can note that this happens if the projected vector is inside a 2 -dimensional cone covering angle 2 θ . The last observation that completes the proof is that the projection of the Gaussian vector is isotropic (since it is also Gaussian), thus the probability that the two dot products will have different signs is θ π (also see: (Charikar, 2002)).
Proof. Note first that the i th row, call it g i , of the matrix P is a n -dimensional Gaussian vector with mean 0 and where each element has variance σ 2 i for σ 2 i = |S i, 1 | = ... = |S i,n | ( i = 1 , ..., k ). Thus, after applying matrix D the new vector g i D is still Gaussian and of the same distribution. Let us consider first the short Ψ -regular hashing model. Fix some vectors p, r (without loss of generality we may assume that they are not collinear). Let H p,r , shortly called by us H , be the 2 -dimensional hyperplane spanned by { p, r } . Denote by g i D ,H the projection of g i D into H and by g i D ,H, ⊥ the line in H perpendicular to g i D ,H . Let φ be a sign function. Note that the contribution to the L 1 -sum ‖ h ( p ) -h ( r ) ‖ 1 comes from those g i for which g i D ,H, ⊥ divides an angle between p and r (see: Figure 1), i.e. from those g i for which g i D ,H is inside the union U p,r of two 2 -dimensional cones bounded by two lines in H perpendicular to p and r respectively. If the angle is not divided (see: Figure 2) then the two corresponding entries in the hash have the same value and thus do not contribute to the overall distance between hashes.
Observe that, from what we have just said, we can conclude


that ˜ θ n p,r = X 1 + ... + X k k , where:
Now it suffices to note that vector g i D ,H is a 2 -dimensional Gaussian vector and thus its direction is uniformly distributed over all directions. Thus each X i is nonzero with probability exactly θ p,r π and the theorem follows. For the extended Ψ -regular hashing model the analysis is very similar. The only difference is that data is preprocessed by applying HR linear mapping first. Both H and R are orthogonal matrices though, thus their product is also an orthogonal matrix. Since orthogonal matrices do not change angular distance, the former analysis can be applied again and yields the proof.
We next focus on the concentration of the random variable ˜ θ n p,r around its mean θ p,r π . We prove strong exponential concentration results for the extended Ψ -regular hashing method. Interestingly, the application of the Hadamard mechanism is not necessary and it is possible to get concentration results, yet weaker than in the former case, also for short Ψ -regular hashing.
.
Proof.
The $ mathcal{P
The highly well organized structure of the projection matrix P gives rise to the underlying undirected graph that encodes dependencies between different entries of P . More formally, let us fix two rows of P of indices 1 ≤ k 1 < k 2 ≤ k respectively. We define a graph G P ( k 1 , k 2 ) as follows:
glyph[negationslash]
· V ( G P ( k 1 , k 2 )) = {{ j 1 , j 2 } : ∃ l ∈ { 1 , ..., t } s.t.g l ∈ S k 1 ,j 1 ∩ S k 2 ,j 2 , j 1 = j 2 } , · there exists an edge between vertices { j 1 , j 2 } and { j 3 , j 4 } iff { j 1 , j 2 } ∩ { j 3 , j 4 } glyph[negationslash] = ∅ .
The chromatic number χ ( G ) of the graph G is the minimal number of colors that can be used to color the vertices of the graph in such a way that no two adjacent vertices have the same color.
Definition 4.1. Let P be a Ψ -regular matrix. We define the P -chromatic number χ ( P ) as:

The graph associated with each structured matrix that we have just described enables us to encode dependencies between entries of the structured matrix in the compact form and gives us quantitative ways to efficiently measure these dependencies by analyzing several core parameters of this graph such as its chromatic number. More dependencies that usually lead to more structured form mean more edges in the associated graph and often lead to higher chromatic number. On the other hand, fewer dependencies produce graphs with much lower chromatic number (see Figure 3, where the graph associated with the circulant matrix is a collection of vertex disjoint cycles and has chromatic number 3 if it contains an odd length cycle and 2 otherwise).
.
Concentration inequalities for structured hashing with textit{sign
Wepresent now our main theoretical results. The proofs are deferred to the Supplementary material. We focus on the concentration results regarding produced hashes that are crucial for practical applications of the proposed scheme.
We first start with the short description of the methods used and then rigorously formulate all the results. If all the rows
of the projection matrix are independent then standard concentration inequalities can be used. This is however not the case in our setting since the matrix is structured. We still want to say that any two rows are 'close' to independent Gaussian vectors and that will give us bounds regarding the variance of the distance between the hashes (in general, we observe that any system of k rows is 'close' to the system of k independent Gaussian vectors and get bounds involving k th moments). We proceed as follows:
· We take two rows and project them onto the linear space spanned by given vectors: p and r . · We consider the four coordinates obtained in this way (two for each vector). They are obviously Gaussian, but what is crucial, they are 'almost independent'. · The latter observation is implied by the fact that these are the coordinates of the projection of a fixed Gaussian vector onto 'almost orthogonal' directions'. · Weuse the property of the Gaussian vector that its projections onto orthogonal directions are independent. · To prove that directions considered in our setting are close to orthogonal with high probability, we compute their dot product. This is the place where the structure of the matrix, the chromatic number of the underlying graph and the fact that in our hashing scheme we use random diagonal matrices come into action. We decompose each dot product into roughly speaking χ components ( χ is the chromatic number), such that each component is a sum of independent random variables with mean 0 . Now we can use standard concentration inequalities to get tight concentration results. · The Hadamard matrix used in the extended model preprocesses input vectors to distribute their mass uniformly over all the coordinates, while not changing L 2 distances (it is a unitary matrix). Balanced vectors lead to much stronger concentration results.
Now we are ready to rigorously state our results. By poly ( x ) we denote a function x r for some r > 0 . The following theorems guarantee strong concentration of ˜ θ n p,r around its mean and therefore justify theoretically the effectiveness of the structured hashing method.
Let us consider first the extended Ψ -regular hashing model.
Theorem 4.1. Consider extended Ψ -regular hashing model M with t independent Gaussian random variables: g 1 , ..., g t , each of distribution N (0 , 1) . Let N be the size of the dataset D . Denote by k the size of the hash and by n the dimensionality of the data. Let f ( n ) be an arbitrary positive function. Let θ p,r be the angular distance between vectors p, r ∈ D . Then for a = o n (1) , glyph[epsilon1] > 0 , t ≥ n and n large enough:
Note how the upper bound on the probability of failure P glyph[epsilon1] depends on the P -chromatic number. The theorem above guarantees strong concentration of ˜ θ n p,r around its mean and therefore justifies theoretically the effectiveness of the structured hashing method. It becomes more clear below.
As a corollary, we obtain the following result:
Corollary 4.1. Consider extended Ψ -regular hashing model M . Assume that the projection matrix P is Toeplitz Gaussian. Let N,n,k be as above and denote by θ p,r be the angular distance between vectors p, r ∈ D . Then the following is true for n large enough:

Corollary 4.1 follows from Theorem 4.1 by taking: a = n -1 3 , glyph[epsilon1] = k -1 3 , f ( n ) = n p for small enough constant p > 0 , noticing that every Toeplitz Gaussian matrix is 0 -regular and the corresponding P -chromatic number χ ( P ) is at most 3 .
Term O ( N 2 e poly ( n ) ) is related to the balancedness property. To clarify, the goal of multiplying by HR in the preprocessing step is to make each input vector balanced, or in other words to spread out the mass of the vector across all the dimensions in approximately uniform way. This property is required to obtain theoretical results (also note it was unnecessary in the unstructured setting) and does not depend on the number of projected dimensions.
Let us consider now the short Ψ -regular hashing model. The theorem presented below is an application of the Chebyshev's inequality preceded by the careful analysis of the variance of ˜ θ n p,r .
and thus for any c > 0 and p, r ∈ D :
Figure 4 shows the dependence of the upper bound on the variance of the normalized approximate angle ˜ θ n p,r on resp. the true angular distance θ p,r and the size of the hash k when resp. k and θ p,r are fixed.
Rate k -1 3 that appears in the theoretical results we obtained and the non-linear with k variance decay of Figure 4 (right) is a consequence of the structured setting, where the quality of the nonlinear embedding is affected by the existence of dependencies between entries of the structured matrix.
.
.
.
Numerical experiments
In this section we demonstrate that all considered structured matrices achieve reasonable performance in comparison to fully random matrices. Specifically we show: i) the dependence of the performance on the size of the hash and the reduction factor n k for different structured matrices and ii) the performance of different structured matrices when used with neural networks and 1 -NN classifier. Experiments confirm our novel theoretical results.

We performed experiments on MNIST dataset downloaded from http://yann.lecun.com/exdb/mnist/ . The data was preprocessed 2 according to the short hashing scheme (the extended hashing scheme gave results of no
2 Preprocessing is discussed in Section 3.
significant statistical difference) before being given to the input of the network. We first considered a simple model of the fully-connected feed-forward neural network with two hidden layers, where the first hidden layer had k units that use sign non-linearity (we explored k = { 16 , 32 , 64 , 128 , 256 , 512 , 1024 } ), and the second hidden layer had 100 units that use ReLU non-linearity. The size of the second hidden layer was chosen as follows. We first investigated the dependence of the test error on this size in case when n = k and the inputs instead of being randomly projected are multiplied by identity (it is equivalent to eliminating first hidden layer). We then chose as a size the threshold below which test performance was rapidly deteriorating.

The first hidden layer contains random untrained weights, and we only train the parameters of the second layer and the output layer. The network we consider is shown in Figure 5. Each experiment was initialized from a random set of parameters sampled uniformly within the unit cube, and was repeated 1000 times. All networks were trained for 30 epochs using SGD (Bottou, 1998). The experiments with constant learning rate are reported (we also explored learning rate decay, but obtained similar results), where the learning rate was chosen from the set { 0 . 0005 , 0 . 001 , 0 . 002 , 0 . 005 , 0 . 01 , 0 . 02 , 0 . 05 , 0 . 1 , 0 . 2 , 0 . 5 , 1 } to minimize the test error. The weights of the first hidden layer correspond to the entries in the 'preprocessed' structured matrix. We explored seven kinds of random matrices (first six are structured): Circulant , Toeplitz , Half-
Table 2: Memory complexity and number of required random values for structured matrices and Random matrix.
Shift , VerHorShift , BinPerm , BinCirc , and Random (entries are independent and drawn from Gaussian distribution N (0 , 1) ). All codes were implemented in Torch7.
Figure 6a shows how the mean test error is affected by the size of the hash, and Figure 6b shows how the mean test error changes with the size of the reduction, where the size of the reduction is defined as the ratio n/k . In Table 1 we report both the mean and the standard deviation (std) of the test error across our experiments. Training results are reported in the Supplementary material. Baseline refers to the network with one hidden layer containing 100 hidden units, where all parameters are trained.
Experimental results show how the performance is affected by using structured hashed projections to reduce data dimensionality. Figure 6b and Table 1 show close to linear dependence between the error and the size of the reduction. Simultaneously, this approach leads to computational savings and the reduction of memory storage. i.e. the reduction of the number of input weights for the hidden layer (in example for Circulant matrix this reduction is of the order O ( n/k ) 4 ). Memory complexity, i.e. memory required to store the matrix, and the number of required random values for different structured matrices and Random matrix are summarized in Table 2.
Experiments show that using fully random matrix gives the best performance as predicted in theory. BinPerm matrix exhibits comparable performance to the Random matrix, which might be explained by the fact that applying permutation itself adds an additional source of randomness. The next best performer is HalfShift , whose generation is less random than the one of BinPerm or Random . Thus its performance, as expected, is worse than for these two other matrices. However, as opposed to BinPerm and Random matrices, HalfShift matrix can be stored in linear space. The results also show that in general all structured matrices perform relatively well for medium-size reductions. Finally, all structured matrices except for BinPerm lead to the biggest memory savings and require the smallest 'budget of randomness'. Moreover, they often lead to computational efficiency, e.g. Toeplitz matrix-vector multiplications can be efficiently implemented via Fast Fourier Transform (Yu et al., 2014). But, as mentioned before, faster than naive matrix-vector product computations can be performed also for BinPerm , HalfShift , and VerHorShift .
Finally, we also report how the performance of 1 -NN algorithm is affected by using structured hashed projections for the dimensionality reduction. We obtained similar plots as
3 Original plot is in the Supplement.
4 The memory required for storing Circulant matrix is negligible compared to the number of weights.
for the case of neural networks. They are captured in Figure 7. The table showing the mean and the standard deviation of the test error for experiments with 1 -NN is enclosed in the Supplementary material.
Conclusions
This paper shows that structured hashed projections well preserve the angular distance between input data instances. Our theoretical results consider mapping the data to lowerdimensional space using various structured matrices, where the structured linear projections are followed by the sign nonlinearity. This non-linear operation was not considered for such a wide range of structured matrices in previous related theoretical works. The theoretical setting naturally applies to the multilayer network framework, where the basic components of the architecture perform matrix-vector multiplication followed by the nonlinear mapping. We empirically verify our theoretical findings and show how using structured hashed projections for dimensionality reduction affects the performance of neural network and nearest neighbor classifier.
References
Proof of Theorem ref{short_theorem
We start with the following technical lemma:
Lemma 7.1. Let { Z 1 , ..., Z k } be the set of k independent random variables defined on Ω such that each Z i has the same distribution and Z i ∈ { 0 , 1 } . Let {F 1 , ..., F k } be the set of events, where each F i is in the σ -field defined by Z i (in particular F i does not depend on the σ -field σ ( Z 1 , ..., Z i -1 , Z i +1 , ...Z k ) ). Assume that there exists µ < 1 2 such that: P ( F i ) ≤ µ for i = 1 , ..., k . Let { U 1 , ..., U k } be the set of k random variables such that U i ∈ { 0 , 1 } and U i |F i = Z i |F i for i = 1 , ..., k , where X |F stands for the random variable X truncated to the event F . Assume furthermore that E ( U i ) = E ( Z i ) for i = 1 , ..., k . Denote U = U 1 + ... + U k k . Then the following is true.
Proof. Let us consider the event F bad = F 1 ∪ ... ∪F k . Note that F bad may be represented by the union of the so-called r -blocks, i.e.
where F c stands for the complement of event F . Let us fix now some Q ⊆ { 1 , ..., k } . Denote
note that P ( F Q ) ≤ µ r (1 -µ ) k -r . It follows directly from the Bernoulli scheme.
Denote Z = Z 1 + ... + Z k k . From what we have just said and from the definition of {F 1 , ..., F k } we conclude that for any given c the following holds:
Note also that from the assumptions of the lemma we trivially get: E ( U ) = E ( Z ) .
Let us consider now the expression P ( | U -E ( U ) | ) > glyph[epsilon1] .
From 9 we get:
From the Stirling's formula we get: r ! = 2 πr r + 1 2 e r (1 + o r (1)) . Thus we obtain:
for r large enough.
Nowwewill use the following version of standard Azuma's inequality:
Lemma 7.2. Let W 1 , ..., W k be k independent random variables such that E ( W 1 ) = ...E ( W k ) = 0 . Assume that -α i ≤ W i +1 -W i ≤ β i for i = 2 , ..., k -1 . Then the following is true:
Now, using Lemma 7.2 for W i = X i -E ( X i ) and α i = E ( X i ) , β i = 1 -E ( X i ) we obtain:
Combining 13 and 14, we obtain the statement of the lemma.
Our next lemma explains the role the Hadamard matrix plays in the entire extended Ψ -regular hashing mechanism.
Lemma 7.3. Let n denote data dimensionality and let f ( n ) be an arbitrary positive function. Let D be the set of all L 2 -normalized data points, where no two data points are identical. Assume that | D | = N . Consider the ( N 2 ) hyperplanes H p,r spanned by pairs of different vectors { p, r } from D . Then after applying linear transformation HR each hyperplane H p,r is transformed into another hyperplane H HR p,r . Furthermore, the probability P HR that for every H HR p,r there exist two orthonormal vectors x = ( x 1 , ..., x n ) , y = ( y 1 , ..., y n ) in H HR p,r such that: | x i | , | y i | ≤ f ( n ) √ n satisfies:
Proof. We have already noted in the proof of Lemma 4.1 that HR is an orthogonal matrix. Thus, as an isometry, it clearly transforms each 2 -dimensional hyperplane into another 2 -dimensional hyperplane. For every pair { p, r } , let us consider an arbitrary fixed orthonormal pair { u, v } spanning H p,r . Denote u = ( u 1 , ..., u n ) . Let us denote by u HR vector obtained from u after applying transformation HR . Note that the j th coordinate of u HR is of the form:
where T 1 , ..., T n are independent random variables satisfying:
The latter comes straightforwardly from the form of the L 2 -normalized Hadamard matrix (i.e a Hadamard matrix, where each row and column is L 2 -normalized).
But then, from Lemma 7.2, and the fact that ‖ u ‖ 2 = 1 , we get for any a > 0 :
Similar analysis is correct for v HR . Note that v HR is orthogonal to u HR since v and u are orthogonal. Furthermore, both v HR and u HR are L 2 -normalized. Thus { u HR , v HR } is an orthonormal pair.
To complete the proof, it suffices to take a = f ( n ) and apply the union bound over all vectors u HR , v HR for all ( N 2 ) hyperplanes.
From the lemma above we see that applying Hadamard matrix enables us to assume with high probability that for every hyperplane H p,r there exists an orthonormal basis consisting of vectors with elements of absolute values at most f ( n ) √ n . We call this event E f . Note that whether E f holds or not is determined only by H , R and the initial dataset D .
Let us proceed with the proof of Theorem 4.1. Let us assume that event E f holds. Without loss of generality we may assume that we have the short Ψ -regular hashing mechanism with an extra property that every H p,r has an orthonormal basis consisting of vectors with elements of absolute value at most f ( n ) √ n . Fix two vectors p, r from the dataset D . Denote by { x, y } the orthonormal basis of H p,r with the above property. Let us fix the i th row of P and denote it as ( p i, 1 , ..., p i,n ) . After being multiplied by the diagonal matrix D we obtain another vector:
where:
We have already noted that in the proof of Lemma 4.1 that it is the projection of w into H p,r that determines whether the value of the associated random variable X i is 0 or 1 . To be more specific, we showed that X i = 1 iff the projection is in the region U p,r . Let us write down the coordinates of the projection of w into H p,r in the { x, y } -coordinate system. The coordinates are the dot-products of w with x and y respectively thus in the { x, y } -coordinate system we can write w as:
Note that both coordinates are Gaussian random variables and they are independent since they were constructed by projecting a Gaussian vector into two orthogonal vectors. Nownote that from our assumption about the structure of P we can conclude that both coordinates may be represented as sums of weighted Gaussian random variables g i for i = 1 , ..., t , i.e.:
where each s i,j , v i,j is of the form d z x z or d z y z for some z that depends only on i, j . Note also that
The latter inequality comes from the fact that, by 20, both coordinates of w { x,y } have the same distribution.
Let us denote s i = ( s i, 1 , ..., s i,t ) , v i = ( v i, 1 , ..., v i,t ) for i = 1 , ..., k . We need the following lemma stating that with high probability vectors s 1 , ..., s k , v 1 , ..., v k are close to be pairwise orthogonal.
glyph[negationslash]
Lemma 7.4. Let us assume that E f holds. Let f ( n ) be an arbitrary positive function. Then for every a > 0 with probability at least P succ ≥ 1 -4 ( k 2 ) e -2 a 2 n f 4 ( n ) , taken under coin tosses used to construct D , the following is true for every 1 ≤ i 1 = i 2 ≤ k :
Proof. Note that the we get the first inequality for free from the fact that x is orthogonal to y (in other words, ∑ n u =1 s i 1 ,u v i 1 ,u can be represented as C ∑ n u =1 x i y i and the latter expression is clearly 0 ). Let us consider now one of the three remaining expressions. Note that they can be rewritten as:
or or
for some ρ, λ, ζ, γ . Note also that from the Ψ -regularity condition we immediately obtain that ρ ( i ) = λ ( i ) for at most Ψ elements of each sum. Get rid of these elements from each sum and consider the remaining ones. From the definition of the P -chromatic number, those remaining ones can be partitioned into at most χ ( P ) parts, each consisting of elements that are independent random variables (since in the corresponding graph there are no edges between them). Thus, for the sum corresponding to each part one can apply Lemma 7.2. Thus one can conclude that the sum differs from its expectation (which clearly is 0 since E ( d i d j ) = 0 for i = j ) by a with probability at most:
Now it is time to use the fact that event E f holds. Then we know that: | x i | , | y i | ≤ f ( n ) √ n for i = 1 , ..., n . Substituting this upper bound for | x i | , | y i | in the derived expressions on the probabilities coming from Lemma 7.2, and then taking the union bound, we complete the proof.
We can finish the proof of Theorem 4.1. From Lemma 7.4 we see that s 1 , ..., s k , v 1 , ..., v k are close to pairwise orthogonal with high probability. Let us fix some positive function f ( n ) > 0 and some a > 0 . Denote
Note that, by Lemma 7.4 we see that applying GramSchmidt process we can obtain a system of pairwise orthogonal vectors ˜ s 1 , ..., ˜ s k , ˜ v 1 , ..., ˜ v k such that
and
where σ ( k ) > 0 is some function of k (it does not depend on n and t ). Note that for n, t large enough we have: σ ( k )∆ ≤ √ aχ ( P ) + Ψ f 2 ( n ) √ n .
Let us consider again w x,y . Replacing s i by ˜ s i and v i by ˜ v i in the formula on w x,y , we obtain another Gaussian vector: ˜ w x,y for each row i of the matrix P . Note however that vectors ˜ w x,y have one crucial advantage over vectors w x,y , namely they are independent. That comes from the fact that ˜ s 1 , ..., ˜ s k , ˜ v 1 , ..., ˜ v k are pairwise orthogonal. Note also that from 36 and 37 we obtain that the angular distance between w x,y and ˜ w x,y is at most σ ( k )∆ .
Let Z i for i = 1 , ...k be an indicator random variable that is zero if ˜ w x,y is inside the region U p,r and zero otherwise. Let U i for i = 1 , ...k be an indicator random variable that is zero if w x,y is inside the region U p,r and zero otherwise. Note that ˜ θ n p,r = U 1 + ... + U k k . Furthermore, random variables Z 1 , ..., Z k , U 1 , ..., U k satisfy the assumptions of Lemma 7.1 with µ ≤ 8 τ θ p,r , where τ = σ ( k )∆ . Indeed, random variables Z i are independent since vectors ˜ w x,y are independent. From what we have said so far we know that each of them takes value one with probability exactly θ p,r π . Furthermore Z i = U i only if w x,y is inside U p,r and ˜ w x,y is outside U p,r or vice versa. The latter event implies (thus it is included in the event) that w x,y is near the border of the region U p,r , namely within an angular distance glyph[epsilon1] θ p,r from one of the four semi-lines defining U p,r . Thus in particular, an event Z i = U i is contained in the event of probability at most 2 · 4 · glyph[epsilon1] θ p,r that depends only on one w x,y .
But then we can apply Lemma 7.1. All we need is to assume that the premises of Lemma 7.4 are satisfied. But this
is the case with probability specified in Lemma 7.3 and this probability is taken under random coin tosses used to product H and R , thus independently from the random coin tosses used to produce D . Putting it all together we obtain the statement of Theorem 4.1.
.
Proof.
.
.
Proof.
.
Proof.
Proof of Theorem ref{short_theorem
We will borrow some notation from the proof of Theorem 4.1. Note however that in this setting no preprocessing with the use of matrices H and R is applied.
Lemma 8.1. Define U 1 , ..., U k as in the proof of Theorem 4.1. Assume that the following is true:
for some 0 < ∆ < 1 . The the following is true for every fixed 1 ≤ i < j ≤ k :
The lemma follows from the exactly the same analysis that was done in the last section of the proof of Theorem 4.1 thus we leave it to the reader as an exercise.
Note that we have:
glyph[negationslash]
Since U i is an indicator random variable that takes value one with probability θ p,r π , we get:
Note however that Cov ( U i , U j ) is exactly: P ( U i U j = 1) -P ( U i = 1) P ( U j = 1) .
Therefore, using Lemma 8.1, we obtain:
It suffices to estimate parameter ∆ . We proceed as in the previous proof. We only need to be a little bit more cautious since the condition: | x i | , | y i | ≤ f ( n ) √ n cannot be assumed right now. We select two rows: i 1 , i 2 of P . Note that again we see that applying Gram-Schmidt process, we can obtain a system of pairwise orthogonal vectors ˜ s i 1 , ˜ s i i , ˜ v i i , ˜ v i 2 such that
and
The fact that right now the above upper bounds are not multiplied by k , as it was the case in the previous proof, plays key role in obtaining nontrivial concentration results even when no Hadamard mechanism is applied.
We consider the related sums: E 1 = ∑ n i =1 d ρ ( i ) d λ ( i ) x ζ ( i ) x γ ( i ) , E 2 = ∑ n i =1 d ρ ( i ) d λ ( i ) y ζ ( i ) y γ ( i ) , E 3 = ∑ n i =1 d ρ ( i ) d λ ( i ) x ζ ( i ) y γ ( i ) as before. We can again partition each sum into at most χ ( P ) sub-chunks, where this time χ ( P ) ≤ 3 (since P is Toeplitz Gaussian). The problem is that applying Lemma 7.2, we get bounds that depend on the expressions of the form
where indices are added modulo n and this time we cannot assume that all | x i | , | y i | are small. Fortunately we have:
Let us fix some positive function f ( k ) . We can conclude that the number of variables α x,i such that α x,i ≥ f ( k ) ( k 2 )
is at most ( k 2 ) f ( k ) . Note that each such α x,i and each such α y,i corresponds to a pair { i 1 , 2 } of rows of the matrix P and consequently to the unique element Cov ( U i 1 , U i 2 ) of the entire covariance sum (scaled by 1 k 2 ). Since trivially
Table 3: Mean and std of the train error versus the size of the hash ( k ) / size of the reduction ( n/k ) for the network.
we have | Cov ( U i 1 , U i 2 ) | = O (1) , we conclude that the contribution of these elements to the entire covariance sum is of order 1 f ( k ) . Let us now consider these α x,i and α y,i that are at most f ( k ) ( k 2 ) . These sums are small (if we take f ( k ) = o ( k 2 ) ) and thus it makes sense to apply Lemma 7.2 to them. That gives us upper bound a = Ω(∆) with probability:
Taking f ( k ) = ( k 2 log( k ) ) 1 3 and a = O (∆) = 1 f ( k ) , we get: P ∗ ≥ 1 -O ( 1 k ) and furthermore:
That completes the proof of Theorem 4.2.
.
Additional figures
We consider the hashing mechanism for constructing binary embeddings, that involves pseudo-random projections followed by nonlinear (sign function) mappings. The pseudo-random projection is described by a matrix, where not all entries are independent random variables but instead a fixed “budget of randomness” is distributed across the matrix. Such matrices can be efficiently stored in sub-quadratic or even linear space, provide reduction in randomness usage (i.e. number of required random values), and very often lead to computational speed ups. We prove several theoretical results showing that projections via various structured matrices followed by nonlinear mappings accurately preserve the angular distance between input high-dimensional vectors. To the best of our knowledge, these results are the first that give theoretical ground for the use of general structured matrices in the nonlinear setting. In particular, they generalize previous extensions of the Johnson-Lindenstrauss lemma and prove the plausibility of the approach that was so far only heuristically confirmed for some special structured matrices. Consequently, we show that many structured matrices can be used as an efficient information compression mechanism. Our findings build a better understanding of certain deep architectures, which contain randomly weighted and untrained layers, and yet achieve high performance on different learning tasks. We empirically verify our theoretical findings and show the dependence of learning via structured hashed projections on the performance of neural network as well as nearest neighbor classifier.
The paradigm of binary embedding for data compression is of the central focus of this paper. The paradigm has been studied in some earlier works (see: (Weiss et al., 2008), (Gong et al., 2012), (Gong et al., 2013a), (Wang et al., 2012), (Gong et al., 2013b), (Plan & Vershynin, 2014), (Yu et al., 2014), (Yi et al., 2015)), and in particular it was observed that by using linear projections and then applying sign function as a nonlinear map one does not loose completely the information about the angular distance between vectors, but instead the information might be approximately reconstructed from the Hamming distance between hashes. In this paper we are interested in using pseudo-random projections via structured matrices in the linear projection phase. The pseudo-random projection is described by a matrix, where not all the entries are independent random variables but instead a fixed “budget of randomness” is distributed across the matrix. Thus they can be efficiently stored in a sub-quadratic or even linear space and provide reduction in the randomness usage. Moreover, using them often leads to computational speed ups since they provide fast matrix-vector multiplications via Fast Fourier Transform. We prove an extension of the Johnson-Lindenstrauss lemma (Sivakumar, 2002) for general pseudo-random structured projections followed by nonlinear mappings. We show that the angular distance between high-dimensional data vectors is approximately preserved in the hashed space. This result is also new compared to previous extensions (Hinrichs & Vybíral, 2011; Vybíral, 2011) of the Johnson-Lindenstrauss lemma, that consider special cases of our structured projections (namely: circulant matrices) and do not consider at all the action of the non-linear mapping. We give theoretical explanation of the approach that was so far only heuristically confirmed for some special structured matrices (see: (Yi et al., 2015), (Yu et al., 2014)).
Our theoretical findings imply that many types of matrices, such as circulant or Toeplitz Gaussian matrices, can be used as a preprocessing step in neural networks. Structured matrices were used before in different contexts also in deep learning, e.g. (Saxe et al., 2011; Sindhwani et al., 2015; Cheng et al., 2015; Mathieu et al., 2014)). Our theoretical results however extend to more general class of matrices.
Our work has primarily theoretical focus, but we also ask an empirical question: how the action of the random projection followed by non-linear transformation may influence learning? We focus on the deep learning setting, where the architecture contains completely random or pseudo-random structured layers that are not trained. Little is known from the theoretical point of view about these fast deep architectures, which achieve significant speed ups of computation and space usage reduction with simultaneous little or no loss in performance (Saxe et al., 2011; Cheng et al., 2015; Sindhwani et al., 2015; Jarrett et al., 2009; Pinto et al., 2009; Pinto & Cox, 2010; Huang et al., 2006). The high-level intuition justifying the success of these approaches is that not only does the performance of the deep learning system depend on learning, but also on the intrinsic properties of the architecture. These findings coincide with the notion of high redundancy in network parametrization (Denil et al., 2013; Denton et al., 2014; Choromanska et al., 2015). In this paper we consider a simple model of the fully-connected feed-forward neural network where the input layer is hashed by a structured pseudo-random projection followed by a point-wise nonlinear mapping. Thus the input is effectively hashed and learning is conducted in the fully connected subsequent layers that act in the hashed space. We empirically verify how the distortion introduced in the first layer by hashing (where we reduce the dimensionality of the data) affects the performance of the network (in the supervised learning setting). Finally, we show how our structured nonlinear embeddings can be used in the k𝑘k-nn setting (Altman, 1992).
This article is organized as follows: Section 2 discusses related work, Section 3 explains the hashing mechanism, Section 4 provides theoretical results, Section 5 shows experimental results, and Section 6 concludes. Supplement contains additional proofs and experimental results.
The idea of using random projections to facilitate learning with high-dimensional data stems from the early work on random projections (Dasgupta, 1999) showing in particular that learning of high-dimensional mixtures of Gaussians can be simplified when first projecting the data into a randomly chosen subspace of low dimension (this is a consequence of the curse of dimensionality and the fact that high-dimensional data often has low intrinsic dimensionality). This idea was subsequently successfully applied to both synthetic and real datasets (Dasgupta, 2000; Bingham & Mannila, 2001), and then adopted to a number of learning approaches such as random projection trees (Dasgupta & Freund, 2008), kernel and feature-selection techniques (Blum, 2006), clustering (Fern & Brodley, 2003), privacy-preserving machine learning (Liu et al., 2006; Choromanska et al., 2013), learning with large databases (Achlioptas, 2003), sparse learning settings (Li et al., 2006), and more recently - deep learning (see (Saxe et al., 2011) for convenient review of such approaches). Using linear projections with completely random Gaussian weights, instead of learned ones, was recently studied from both theoretical and practical point of view in (Giryes et al., 2015), but that work did not consider structured matrices which is a central point of our interest since structured matrices can be stored much more efficiently. Beyond applying methods that use random Gaussian matrix projections (Dasgupta, 1999, 2000; Giryes et al., 2015) and random binary matrix projections (Achlioptas, 2003), it is also possible to construct deterministic projections that preserve angles and distances (Jafarpour et al., 2009). In some sense these methods use structured matrices as well, yet they do not have the same projection efficiency of circulant matrices and projections explored in this article. Our hybrid approach, where a fixed “budget of randomness” is distributed across the entire matrix in the structured way enables us to take advantage of both: the ability of completely random projection to preserve information and the compactness that comes from the highly-organized internal structure of the linear mapping.
This work studies the paradigm of constructing a binary embedding for data compression, where hashes are obtained by applying linear projections to the data followed by the non-linear (sign function) mappings. The point-wise nonlinearity was not considered in many previous works on structured matrices (Haupt et al., 2010; Rauhut et al., 2010; Krahmer et al., 2014; Yap et al., 2011) (moreover note that these works also consider the set of structured matrices which is a strict subset of the class of matrices considered here). Designing binary embeddings for high dimensional data with low distortion is addressed in many recent works (Weiss et al., 2008; Wang et al., 2012; Gong et al., 2013b, a, 2012; Yu et al., 2014; Yi et al., 2015; Raginsky & Lazebnik, 2009; Salakhutdinov & Hinton, 2009). In the context of our work, one of the recent articles (Yi et al., 2015) is especially important since the authors introduce the pipeline of constructing hashes with the use of structured matrices in the linear step, instead of completely random ones. They prove several theoretical results regarding the quality of the produced hash, and extend some previous theoretical results (Jacques et al., 2011; Plan & Vershynin, 2014). Their pipeline is more complicated than ours, i.e. they first apply Hadamard transformation and then a sequence of partial Toeplitz Gaussian matrices. Some general results (unbiasedness of the angular distance estimator) were also known for short hashing pipelines involving circulant matrices ((Yu et al., 2014)). These works do not provide guarantees regarding concentration of the estimator around its mean, which is crucial for all practical applications. Our results for general structured matrices, which include circulant Gaussian matrices and a larger class of Toeplitz Gaussian matrices as special subclasses, provide such concentration guarantees, and thus establish a solid mathematical foundation for using various types of structured matrices in binary embeddings. In contrast to (Yi et al., 2015), we present our theoretical results for simpler hashing models (our hashing mechanism is explained in Section 3 and consists of two very simple steps that we call preprocessing step and the actual hashing step, where the latter consists of pseudo-random projection followed by the nonlinear mapping). In (Yu et al., 2015) theoretical guarantees regarding bounding the variance of the angle estimator in the circulant setting were presented. Strong concentration results regarding several structured matrices were given in (Choromanski & Sindhwani, 2016; Choromanski et al., 2016), following our work.
In the context of deep learning, using random network parametrization, where certain layers have random and untrained weights, often accelerates training. Introducing randomness to networks was explored for various architectures, in example feedforward networks (Huang et al., 2006), convolutional networks (Jarrett et al., 2009; Saxe et al., 2011), and recurrent networks (Jaeger & Haas, 2004; White et al., 2004; Boedecker et al., 2009). We also refer the reader to (Ganguli & Sompolinsky, 2012), where the authors describe how neural systems cope with the challenge of processing data in high dimensions and discuss random projections. Hashing in neural networks that we consider in this paper is a new direction of research. Very recently (see: (Chen et al., 2015)) it was empirically showed that hashing in neural nets may achieve drastic reduction in model sizes with no significant loss of the quality, by heavily exploiting the phenomenon of redundancies in neural nets. HashedNets introduced in (Chen et al., 2015) do not give any theoretical guarantees regarding the quality of the proposed hashing. Our work aims to touch both grounds. We experimentally show the plausibility of the approach, but also explain theoretically why the hashing we consider compresses important information about the data that suffices for accurate classification. Dimensionality reduction techniques were used also to approximately preserve certain metrics defined on graph objects ((Shaw & Jebara, 2009)). Structured hashing was applied also in (Szlam et al., 2012), but in a very different context than ours.
In this section we explain our hashing mechanism for dimensionality reduction that we next analyze.
We first introduce the aforementioned family of structured matrices, that we call: ΨΨ\Psi-regular matrices 𝒫𝒫\mathcal{P}. This is the key ingredient of the method.
ℳℳ\mathcal{M} is a circulant Gaussian matrix if its first row is a sequence of independent Gaussian random variables taken from the distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1) and next rows are obtained from the previous ones by either only one-left shifts or only one-right shifts.
ℳℳ\mathcal{M} is a Toeplitz Gaussian matrix if each of its descending diagonals from left to right is of the form (g,…,g)𝑔…𝑔(g,...,g) for some g∼𝒩(0,1)similar-to𝑔𝒩01g\sim\mathcal{N}(0,1) and different descending diagonals are independent.
The circulant Gaussian matrix with right shifts is a special type of the Toeplitz Gaussian matrix.
Let t𝑡t be the size of the pool of independent random Gaussian variables {g1,…,gt}subscript𝑔1…subscript𝑔𝑡{g_{1},...,g_{t}}, where each gi∼𝒩(0,1)similar-tosubscript𝑔𝑖𝒩01g_{i}\sim\mathcal{N}(0,1). Assume that k≤n≤t≤kn𝑘𝑛𝑡𝑘𝑛k\leq n\leq t\leq kn. We take ΨΨ\Psi to be a natural number, i.e. Ψ∈ℕΨℕ\Psi\in\mathbb{N}. 𝒫𝒫\mathcal{P} is ΨΨ\Psi-regular random matrix if it has the following form
where Si,j⊆{1,…,t}subscript𝑆𝑖𝑗1…𝑡S_{i,j}\subseteq{1,...,t} for i∈{1,…,k}𝑖1…𝑘i\in{1,...,k}, j∈{1,…,n}𝑗1…𝑛j\in{1,...,n}, |Si,1|=…=|Si,n|subscript𝑆𝑖1…subscript𝑆𝑖𝑛|S_{i,1}|=...=|S_{i,n}| for i=1,…,k𝑖1…𝑘i=1,...,k, Si,j1∩Si,j2=∅subscript𝑆𝑖subscript𝑗1subscript𝑆𝑖subscript𝑗2S_{i,j_{1}}\cap S_{i,j_{2}}=\emptyset for i∈{1,…,k}𝑖1…𝑘i\in{1,...,k}, j1,j2∈{1,…,n}subscript𝑗1subscript𝑗21…𝑛j_{1},j_{2}\in{1,...,n}, j1≠j2subscript𝑗1subscript𝑗2j_{1}\neq j_{2}, and furthermore the following holds: for any two different rows ℛ1,ℛ2subscriptℛ1subscriptℛ2\mathcal{R}{1},\mathcal{R}{2} of 𝒫𝒫\mathcal{P} the number of random variables glsubscript𝑔𝑙g_{l}, where l∈{1,…,t}𝑙1…𝑡l\in{1,...,t}, such that glsubscript𝑔𝑙g_{l} is in the intersection of some column with both ℛ1subscriptℛ1\mathcal{R}{1} and ℛ2subscriptℛ2\mathcal{R}{2} is at most ΨΨ\Psi.
In the experimental section of this paper we consider six different kinds of structured matrices, which are examples of general structured matrices covered by our theoretical analysis. Those are:
Circulant (see: Definition 3.1),
BinCirc - a matrix, where the first row is partitioned into consecutive equal-length blocks of elements and each row is obtained by the right shift of the blocks from the first row,
HalfShift - a matrix, where next row is obtained from the previous one by swapping its halves and then performing right shift by one,
VerHorShift - a matrix that is obtained by the following two phase-procedure: first each row is obtained from the previous one by the right shift of a fixed length and then in the obtained matrix each column is shifted up by a fixed number of elements,
Computing hashes for structured matrices: Toeplitz, BinCirc, HalfShift, and VerHorShift can be done faster than in time 𝒪(nk)𝒪𝑛𝑘\mathcal{O}(nk) (e.g. for Toeplitz one can use FFT to reduce computations to 𝒪(nlogk)𝒪𝑛𝑘\mathcal{O}(n\log k)). Thus our structured approach leads to speed-ups, storage compression (since many structured matrices covered by our theoretical model can be stored in linear space) and reduction in randomness usage. The goal of this paper is not to analyze in details fast matrix-vector product algorithms since that requires a separate paper. We however point out that well-known fast matrix-vector product algorithms are some of the key advantages of our structured approach.
Let ϕitalic-ϕ\phi be a function satisfying limx→∞ϕ(x)=1subscript→𝑥italic-ϕ𝑥1\lim_{x\rightarrow\infty}\phi(x)=1 and limx→−∞ϕ(x)=−1subscript→𝑥italic-ϕ𝑥1\lim_{x\rightarrow-\infty}\phi(x)=-1. We will consider two hashing methods, both of which consist of what we refer to as a preprocessing step followed by the actual hashing step, where the latter consists of pseudo-random projection followed by nonlinear (sign function) mapping. The first mechanism, that we call extended ΨΨ\Psi-regular hashing, applies first random diagonal matrix ℛℛ\mathcal{R} to the data point x𝑥x, then the L2subscript𝐿2L_{2}-normalized Hadamard matrix ℋℋ\mathcal{H}, next another random diagonal matrix 𝒟𝒟\mathcal{D}, then the ΨΨ\Psi-regular projection matrix 𝒫Ψsubscript𝒫Ψ\mathcal{P}_{\Psi} and finally function ϕitalic-ϕ\phi (the latter one applied point-wise). The overall scheme is presented below:
The diagonal entries of matrices ℛℛ\mathcal{R} and 𝒟𝒟\mathcal{D} are chosen independently from the binary set {−1,1}11{-1,1}, each value being chosen with probability 1212\frac{1}{2}. We also propose a shorter pipeline, that we call short ΨΨ\Psi-regular hashing, where we avoid applying first random matrix ℛℛ\mathcal{R} and the Hadamard matrix ℋℋ\mathcal{H}, i.e. the overall pipeline is of the form:
The goal is to compute good approximation of the angular distance between given vectors p,r𝑝𝑟p,r, given their compact hashed versions: h(p),h(r)ℎ𝑝ℎ𝑟h(p),h(r). To achieve this goal we consider the L1subscript𝐿1L_{1}-distances in the k𝑘k-dimensional space of hashes. Let θp,rsubscript𝜃𝑝𝑟\theta_{p,r} denote the angle between vectors p𝑝p and r𝑟r. We define the normalized approximate angle between p𝑝p and r𝑟r as:
In the next section we show that the normalized approximate angle between vectors p𝑝p and r𝑟r leads to a very precise estimation of the actual angle for ϕ(x)=sign(x)italic-ϕ𝑥𝑠𝑖𝑔𝑛𝑥\phi(x)=sign(x) if the chosen parameter ΨΨ\Psi is not too large. Furthermore, we show an intriguing connection between theoretical guarantees regarding the quality of the produced hash and the chromatic number of some specific undirected graph encoding the structure of 𝒫𝒫\mathcal{P}. For many of the structured matrices under consideration this graph is induced by an algebraic group operation defining the structure of 𝒫𝒫\mathcal{P} (for instance, for the circular matrix the group is a single shift and the underlying graph is a collection of pairwise disjoint cycles, thus its chromatic number is at most 333).
We are ready to provide theoretical guarantees regarding the quality of the produced hash. Our guarantees will be given for a sign function, i.e for ϕitalic-ϕ\phi defined as: ϕ(x)=1italic-ϕ𝑥1\phi(x)=1 for x≥0𝑥0x\geq 0, ϕ(x)=−1italic-ϕ𝑥1\phi(x)=-1 for x<0𝑥0x<0. Using this nonlinearity will be important to preserve approximate information about the angle between vectors, while filtering out the information about their lengths. We first show that θp,rnsuperscriptsubscript𝜃𝑝𝑟𝑛\tilde{\theta}{p,r}^{n} is an unbiased estimator of θp,rπsubscript𝜃𝑝𝑟𝜋\frac{\theta{p,r}}{\pi}, i.e. E(θp,rn)=θp,rπ𝐸superscriptsubscript𝜃𝑝𝑟𝑛subscript𝜃𝑝𝑟𝜋E(\tilde{\theta}{p,r}^{n})=\frac{\theta{p,r}}{\pi}.
Let ℳℳ\mathcal{M} be a ΨΨ\Psi-regular hashing model (either extended or a short one) and ‖p‖2=‖r‖2=1subscriptnorm𝑝2subscriptnorm𝑟21|p|{2}=|r|{2}=1. Then θp,rnsuperscriptsubscript𝜃𝑝𝑟𝑛\tilde{\theta}{p,r}^{n} is an unbiased estimator of θp,rπsubscript𝜃𝑝𝑟𝜋\frac{\theta{p,r}}{\pi}, i.e. E(θp,rn)=θp,rπ.𝐸superscriptsubscript𝜃𝑝𝑟𝑛subscript𝜃𝑝𝑟𝜋E(\tilde{\theta}{p,r}^{n})=\frac{\theta{p,r}}{\pi}.
Let us give a short sketch of the proof first. Note that the value of the particular entry in the constructed hash depends only on the sign of the dot product between the corresponding Gaussian vector representing the row of the ΨΨ\Psi-regular matrix and the given vector. Fix two vectors: p𝑝p and r𝑟r with angular distance θ𝜃\theta. Note that considered dot products (and thus also their signs) are preserved when instead of taking the Gaussian vector representing the row one takes its projection onto a linear space spanned by p𝑝p and r𝑟r. The Hamming distance between hashes of p𝑝p and r𝑟r is built up by these entries for which one dot product is negative and the other one is positive. One can note that this happens if the projected vector is inside a 222-dimensional cone covering angle 2θ2𝜃2\theta. The last observation that completes the proof is that the projection of the Gaussian vector is isotropic (since it is also Gaussian), thus the probability that the two dot products will have different signs is θπ𝜃𝜋\frac{\theta}{\pi} (also see: (Charikar, 2002)).
Note first that the i𝑖ith row, call it gisuperscript𝑔𝑖g^{i}, of the matrix 𝒫𝒫\mathcal{P} is a n𝑛n-dimensional Gaussian vector with mean 00 and where each element has variance σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{2} for σi2=|𝒮i,1|=…=|𝒮i,n|superscriptsubscript𝜎𝑖2subscript𝒮𝑖1…subscript𝒮𝑖𝑛\sigma_{i}^{2}=|\mathcal{S}{i,1}|=...=|\mathcal{S}{i,n}| (i=1,…,k𝑖1…𝑘i=1,...,k). Thus, after applying matrix 𝒟𝒟\mathcal{D} the new vector g𝒟isubscriptsuperscript𝑔𝑖𝒟g^{i}{\mathcal{D}} is still Gaussian and of the same distribution. Let us consider first the short ΨΨ\Psi-regular hashing model. Fix some vectors p,r𝑝𝑟p,r (without loss of generality we may assume that they are not collinear). Let Hp,rsubscript𝐻𝑝𝑟H{p,r}, shortly called by us H𝐻H, be the 222-dimensional hyperplane spanned by {p,r}𝑝𝑟{p,r}. Denote by g𝒟,Hisubscriptsuperscript𝑔𝑖𝒟𝐻g^{i}{\mathcal{D},H} the projection of g𝒟isubscriptsuperscript𝑔𝑖𝒟g^{i}{\mathcal{D}} into H𝐻H and by g𝒟,H,⟂isubscriptsuperscript𝑔𝑖𝒟𝐻perpendicular-tog^{i}{\mathcal{D},H,\perp} the line in H𝐻H perpendicular to g𝒟,Hisubscriptsuperscript𝑔𝑖𝒟𝐻g^{i}{\mathcal{D},H}. Let ϕitalic-ϕ\phi be a sign function. Note that the contribution to the L1subscript𝐿1L_{1}-sum ‖h(p)−h(r)‖1subscriptnormℎ𝑝ℎ𝑟1|h(p)-h(r)|{1} comes from those gisuperscript𝑔𝑖g^{i} for which g𝒟,H,⟂isubscriptsuperscript𝑔𝑖𝒟𝐻perpendicular-tog^{i}{\mathcal{D},H,\perp} divides an angle between p𝑝p and r𝑟r (see: Figure 1), i.e. from those gisuperscript𝑔𝑖g^{i} for which g𝒟,Hisubscriptsuperscript𝑔𝑖𝒟𝐻g^{i}{\mathcal{D},H} is inside the union 𝒰p,rsubscript𝒰𝑝𝑟\mathcal{U}{p,r} of two 222-dimensional cones bounded by two lines in H𝐻H perpendicular to p𝑝p and r𝑟r respectively. If the angle is not divided (see: Figure 2) then the two corresponding entries in the hash have the same value and thus do not contribute to the overall distance between hashes.
Observe that, from what we have just said, we can conclude that θp,rn=X1+…+Xkksuperscriptsubscript𝜃𝑝𝑟𝑛subscript𝑋1…subscript𝑋𝑘𝑘\tilde{\theta}{p,r}^{n}=\frac{X{1}+...+X_{k}}{k}, where:
Now it suffices to note that vector g𝒟,Hisubscriptsuperscript𝑔𝑖𝒟𝐻g^{i}{\mathcal{D},H} is a 222-dimensional Gaussian vector and thus its direction is uniformly distributed over all directions. Thus each Xisubscript𝑋𝑖X{i} is nonzero with probability exactly θp,rπsubscript𝜃𝑝𝑟𝜋\frac{\theta_{p,r}}{\pi} and the theorem follows. For the extended ΨΨ\Psi-regular hashing model the analysis is very similar. The only difference is that data is preprocessed by applying ℋℛℋℛ\mathcal{H}\mathcal{R} linear mapping first. Both ℋℋ\mathcal{H} and ℛℛ\mathcal{R} are orthogonal matrices though, thus their product is also an orthogonal matrix. Since orthogonal matrices do not change angular distance, the former analysis can be applied again and yields the proof. ∎
We next focus on the concentration of the random variable θp,rnsuperscriptsubscript𝜃𝑝𝑟𝑛\tilde{\theta}{p,r}^{n} around its mean θp,rπsubscript𝜃𝑝𝑟𝜋\frac{\theta{p,r}}{\pi}. We prove strong exponential concentration results for the extended ΨΨ\Psi-regular hashing method. Interestingly, the application of the Hadamard mechanism is not necessary and it is possible to get concentration results, yet weaker than in the former case, also for short ΨΨ\Psi-regular hashing.
The highly well organized structure of the projection matrix 𝒫𝒫\mathcal{P} gives rise to the underlying undirected graph that encodes dependencies between different entries of 𝒫𝒫\mathcal{P}. More formally, let us fix two rows of 𝒫𝒫\mathcal{P} of indices 1≤k1<k2≤k1subscript𝑘1subscript𝑘2𝑘1\leq k_{1}<k_{2}\leq k respectively. We define a graph 𝒢𝒫(k1,k2)subscript𝒢𝒫subscript𝑘1subscript𝑘2\mathcal{G}{\mathcal{P}}(k{1},k_{2}) as follows:
V(𝒢𝒫(k1,k2))={{j1,j2}:∃l∈{1,…,t}s.t.gl∈𝒮k1,j1∩𝒮k2,j2,j1≠j2}𝑉subscript𝒢𝒫subscript𝑘1subscript𝑘2conditional-setsubscript𝑗1subscript𝑗2formulae-sequence𝑙1…𝑡𝑠𝑡formulae-sequencesubscript𝑔𝑙subscript𝒮subscript𝑘1subscript𝑗1subscript𝒮subscript𝑘2subscript𝑗2subscript𝑗1subscript𝑗2V(\mathcal{G}{\mathcal{P}}(k{1},k_{2}))={{j_{1},j_{2}}:\exists l\in{1,...,t}s.t.g_{l}\in\mathcal{S}{k{1},j_{1}}\cap\mathcal{S}{k{2},j_{2}},j_{1}\neq j_{2}},
there exists an edge between vertices {j1,j2}subscript𝑗1subscript𝑗2{j_{1},j_{2}} and {j3,j4}subscript𝑗3subscript𝑗4{j_{3},j_{4}} iff {j1,j2}∩{j3,j4}≠∅subscript𝑗1subscript𝑗2subscript𝑗3subscript𝑗4{j_{1},j_{2}}\cap{j_{3},j_{4}}\neq\emptyset.
The chromatic number χ(𝒢)𝜒𝒢\chi(\mathcal{G}) of the graph 𝒢𝒢\mathcal{G} is the minimal number of colors that can be used to color the vertices of the graph in such a way that no two adjacent vertices have the same color.
Let 𝒫𝒫\mathcal{P} be a ΨΨ\Psi-regular matrix. We define the 𝒫𝒫\mathcal{P}-chromatic number χ(𝒫)𝜒𝒫\chi(\mathcal{P}) as:
The graph associated with each structured matrix that we have just described enables us to encode dependencies between entries of the structured matrix in the compact form and gives us quantitative ways to efficiently measure these dependencies by analyzing several core parameters of this graph such as its chromatic number. More dependencies that usually lead to more structured form mean more edges in the associated graph and often lead to higher chromatic number. On the other hand, fewer dependencies produce graphs with much lower chromatic number (see Figure 3, where the graph associated with the circulant matrix is a collection of vertex disjoint cycles and has chromatic number 333 if it contains an odd length cycle and 222 otherwise).
We present now our main theoretical results. The proofs are deferred to the Supplementary material. We focus on the concentration results regarding produced hashes that are crucial for practical applications of the proposed scheme.
We first start with the short description of the methods used and then rigorously formulate all the results. If all the rows of the projection matrix are independent then standard concentration inequalities can be used. This is however not the case in our setting since the matrix is structured. We still want to say that any two rows are “close” to independent Gaussian vectors and that will give us bounds regarding the variance of the distance between the hashes (in general, we observe that any system of k𝑘k rows is “close” to the system of k𝑘k independent Gaussian vectors and get bounds involving k𝑘kth moments). We proceed as follows:
We take two rows and project them onto the linear space spanned by given vectors: p𝑝p and r𝑟r.
We consider the four coordinates obtained in this way (two for each vector). They are obviously Gaussian, but what is crucial, they are “almost independent”.
The latter observation is implied by the fact that these are the coordinates of the projection of a fixed Gaussian vector onto “almost orthogonal’ directions’.
To prove that directions considered in our setting are close to orthogonal with high probability, we compute their dot product. This is the place where the structure of the matrix, the chromatic number of the underlying graph and the fact that in our hashing scheme we use random diagonal matrices come into action. We decompose each dot product into roughly speaking χ𝜒\chi components (χ𝜒\chi is the chromatic number), such that each component is a sum of independent random variables with mean 00. Now we can use standard concentration inequalities to get tight concentration results.
The Hadamard matrix used in the extended model preprocesses input vectors to distribute their mass uniformly over all the coordinates, while not changing L2subscript𝐿2L_{2} distances (it is a unitary matrix). Balanced vectors lead to much stronger concentration results.
Now we are ready to rigorously state our results. By poly(x)𝑝𝑜𝑙𝑦𝑥poly(x) we denote a function xrsuperscript𝑥𝑟x^{r} for some r>0𝑟0r>0. The following theorems guarantee strong concentration of θp,rnsubscriptsuperscript𝜃𝑛𝑝𝑟{\tilde{\theta}}^{n}_{p,r} around its mean and therefore justify theoretically the effectiveness of the structured hashing method.
Let us consider first the extended ΨΨ\Psi-regular hashing model.
Consider extended ΨΨ\Psi-regular hashing model ℳℳ\mathcal{M} with t𝑡t independent Gaussian random variables: g1,…,gtsubscript𝑔1…subscript𝑔𝑡g_{1},...,g_{t}, each of distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1). Let N𝑁N be the size of the dataset 𝒟𝒟\mathcal{D}. Denote by k𝑘k the size of the hash and by n𝑛n the dimensionality of the data. Let f(n)𝑓𝑛f(n) be an arbitrary positive function. Let θp,rsubscript𝜃𝑝𝑟\theta_{p,r} be the angular distance between vectors p,r∈𝒟𝑝𝑟𝒟p,r\in\mathcal{D}. Then for a=on(1)𝑎subscript𝑜𝑛1a=o_{n}(1), ϵ>0italic-ϵ0\epsilon>0, t≥n𝑡𝑛t\geq n and n𝑛n large enough:
where Λ=1π∑j=⌊ϵk2⌋k1j(kej)jμj(1−μ)k−j+2e−ϵ2k2Λ1𝜋superscriptsubscript𝑗italic-ϵ𝑘2𝑘1𝑗superscript𝑘𝑒𝑗𝑗superscript𝜇𝑗superscript1𝜇𝑘𝑗2superscript𝑒superscriptitalic-ϵ2𝑘2\Lambda=\frac{1}{\pi}\sum_{j=\lfloor\frac{\epsilon k}{2}\rfloor}^{k}\frac{1}{\sqrt{j}}(\frac{ke}{j})^{j}\mu^{j}(1-\mu)^{k-j}+2e^{-\frac{\epsilon^{2}k}{2}} and μ=8(aχ(𝒫)+Ψf2(n)n)θp,r𝜇8𝑎𝜒𝒫Ψsuperscript𝑓2𝑛𝑛subscript𝜃𝑝𝑟\mu=\frac{8(\sqrt{a}\chi(\mathcal{P})+\Psi\frac{f^{2}(n)}{\sqrt{n}})}{\theta_{p,r}}.
As a corollary, we obtain the following result:
Consider extended ΨΨ\Psi-regular hashing model ℳℳ\mathcal{M}. Assume that the projection matrix 𝒫𝒫\mathcal{P} is Toeplitz Gaussian. Let N,n,k𝑁𝑛𝑘N,n,k be as above and denote by θp,rsubscript𝜃𝑝𝑟\theta_{p,r} be the angular distance between vectors p,r∈𝒟𝑝𝑟𝒟p,r\in\mathcal{D}. Then the following is true for n𝑛n large enough:
Corollary 4.1 follows from Theorem 4.1 by taking: a=n−13𝑎superscript𝑛13a=n^{-\frac{1}{3}}, ϵ=k−13italic-ϵsuperscript𝑘13\epsilon=k^{-\frac{1}{3}}, f(n)=np𝑓𝑛superscript𝑛𝑝f(n)=n^{p} for small enough constant p>0𝑝0p>0, noticing that every Toeplitz Gaussian matrix is 00-regular and the corresponding 𝒫𝒫\mathcal{P}-chromatic number χ(𝒫)𝜒𝒫\chi(\mathcal{P}) is at most 333.
Term O(N2epoly(n))𝑂superscript𝑁2superscript𝑒𝑝𝑜𝑙𝑦𝑛O\left(\frac{N^{2}}{e^{poly(n)}}\right) is related to the balancedness property. To clarify, the goal of multiplying by ℋℛℋℛ\mathcal{H}\mathcal{R} in the preprocessing step is to make each input vector balanced, or in other words to spread out the mass of the vector across all the dimensions in approximately uniform way. This property is required to obtain theoretical results (also note it was unnecessary in the unstructured setting) and does not depend on the number of projected dimensions.
Let us consider now the short ΨΨ\Psi-regular hashing model. The theorem presented below is an application of the Chebyshev’s inequality preceded by the careful analysis of the variance of θp,rnsubscriptsuperscript𝜃𝑛𝑝𝑟\tilde{\theta}^{n}_{p,r}.
Consider short ΨΨ\Psi-regular hashing model ℳℳ\mathcal{M}, where 𝒫𝒫\mathcal{P} is a Toeplitz Gaussian matrix. Denote by k𝑘k the size of the hash. Let θp,rsubscript𝜃𝑝𝑟\theta_{p,r} be the angular distance between vectors p,r∈𝒟𝑝𝑟𝒟p,r\in\mathcal{D}, where 𝒟𝒟\mathcal{D} is the dataset. Then the following is true
and thus for any c>0𝑐0c>0 and p,r∈𝒟𝑝𝑟𝒟p,r\in\mathcal{D}:
Figure 4 shows the dependence of the upper bound on the variance of the normalized approximate angle θp,rnsubscriptsuperscript𝜃𝑛𝑝𝑟\tilde{\theta}^{n}{p,r} on resp. the true angular distance θp,rsubscript𝜃𝑝𝑟\theta{p,r} and the size of the hash k𝑘k when resp. k𝑘k and θp,rsubscript𝜃𝑝𝑟\theta_{p,r} are fixed.
Rate k−13superscript𝑘13k^{-\frac{1}{3}} that appears in the theoretical results we obtained and the non-linear with k𝑘k variance decay of Figure 4 (right) is a consequence of the structured setting, where the quality of the nonlinear embedding is affected by the existence of dependencies between entries of the structured matrix.
In this section we demonstrate that all considered structured matrices achieve reasonable performance in comparison to fully random matrices. Specifically we show: i) the dependence of the performance on the size of the hash and the reduction factor nk𝑛𝑘\frac{n}{k} for different structured matrices and ii) the performance of different structured matrices when used with neural networks and 111-NN classifier. Experiments confirm our novel theoretical results.
We performed experiments on MNIST dataset downloaded from http://yann.lecun.com/exdb/mnist/. The data was preprocessed111Preprocessing is discussed in Section 3. according to the short hashing scheme (the extended hashing scheme gave results of no significant statistical difference) before being given to the input of the network. We first considered a simple model of the fully-connected feed-forward neural network with two hidden layers, where the first hidden layer had k𝑘k units that use sign non-linearity (we explored k={16,32,64,128,256,512,1024}𝑘1632641282565121024k={16,32,64,128,256,512,1024}), and the second hidden layer had 100100100 units that use ReLU non-linearity. The size of the second hidden layer was chosen as follows. We first investigated the dependence of the test error on this size in case when n=k𝑛𝑘n=k and the inputs instead of being randomly projected are multiplied by identity (it is equivalent to eliminating first hidden layer). We then chose as a size the threshold below which test performance was rapidly deteriorating.
The first hidden layer contains random untrained weights, and we only train the parameters of the second layer and the output layer. The network we consider is shown in Figure 5. Each experiment was initialized from a random set of parameters sampled uniformly within the unit cube, and was repeated 100010001000 times. All networks were trained for 303030 epochs using SGD (Bottou, 1998). The experiments with constant learning rate are reported (we also explored learning rate decay, but obtained similar results), where the learning rate was chosen from the set {0.0005,0.001,0.002,0.005,0.01,0.02,0.05,0.1,0.2,0.5,{0.0005,0.001,0.002,0.005,0.01,0.02,0.05,0.1,0.2,0.5, 1}1} to minimize the test error. The weights of the first hidden layer correspond to the entries in the “preprocessed” structured matrix. We explored seven kinds of random matrices (first six are structured): Circulant, Toeplitz, HalfShift, VerHorShift, BinPerm, BinCirc, and Random (entries are independent and drawn from Gaussian distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1)). All codes were implemented in Torch7.
Figure 8a and Figure 8b show how the mean train error is affected by the size of the hash, and Figure 8c shows how the mean train error changes with the size of the reduction for the neural network experiment. In Table 3 we report both the mean and the standard deviation of the train error across our neural network experiments. Baseline refers to the network with one hidden layer containing 100 hidden units, where all parameters are trained.
Experimental results show how the performance is affected by using structured hashed projections to reduce data dimensionality. Figure 3b and Table 1 show close to linear dependence between the error and the size of the reduction. Simultaneously, this approach leads to computational savings and the reduction of memory storage. i.e. the reduction of the number of input weights for the hidden layer (in example for Circulant matrix this reduction is of the order 𝒪(n/k)𝒪𝑛𝑘\mathcal{O}(n/k)444The memory required for storing Circulant matrix is negligible compared to the number of weights.). Memory complexity, i.e. memory required to store the matrix, and the number of required random values for different structured matrices and Random matrix are summarized in Table 2.
Experiments show that using fully random matrix gives the best performance as predicted in theory. BinPerm matrix exhibits comparable performance to the Random matrix, which might be explained by the fact that applying permutation itself adds an additional source of randomness. The next best performer is HalfShift, whose generation is less random than the one of BinPerm or Random. Thus its performance, as expected, is worse than for these two other matrices. However, as opposed to BinPerm and Random matrices, HalfShift matrix can be stored in linear space. The results also show that in general all structured matrices perform relatively well for medium-size reductions. Finally, all structured matrices except for BinPerm lead to the biggest memory savings and require the smallest “budget of randomness”. Moreover, they often lead to computational efficiency, e.g. Toeplitz matrix-vector multiplications can be efficiently implemented via Fast Fourier Transform (Yu et al., 2014). But, as mentioned before, faster than naive matrix-vector product computations can be performed also for BinPerm, HalfShift, and VerHorShift.
Finally, we also report how the performance of 111-NN algorithm is affected by using structured hashed projections for the dimensionality reduction. We obtained similar plots as for the case of neural networks. They are captured in Figure 2. The table showing the mean and the standard deviation of the test error for experiments with 111-NN is enclosed in the Supplementary material.
This paper shows that structured hashed projections well preserve the angular distance between input data instances. Our theoretical results consider mapping the data to lower-dimensional space using various structured matrices, where the structured linear projections are followed by the sign nonlinearity. This non-linear operation was not considered for such a wide range of structured matrices in previous related theoretical works. The theoretical setting naturally applies to the multilayer network framework, where the basic components of the architecture perform matrix-vector multiplication followed by the nonlinear mapping. We empirically verify our theoretical findings and show how using structured hashed projections for dimensionality reduction affects the performance of neural network and nearest neighbor classifier.
Binary embeddings with structured hashed projections (Supplementary Material)
In this section we prove Theorem 4.1 and Theorem 4.2. We will use notation from Lemma 4.1.
Let {Z1,…,Zk}subscript𝑍1…subscript𝑍𝑘{Z_{1},...,Z_{k}} be the set of k𝑘k independent random variables defined on ΩΩ\Omega such that each Zisubscript𝑍𝑖Z_{i} has the same distribution and Zi∈{0,1}subscript𝑍𝑖01Z_{i}\in{0,1}. Let {ℱ1,…,ℱk}subscriptℱ1…subscriptℱ𝑘{\mathcal{F}{1},...,\mathcal{F}{k}} be the set of events, where each ℱisubscriptℱ𝑖\mathcal{F}{i} is in the σ𝜎\sigma-field defined by Zisubscript𝑍𝑖Z{i} (in particular ℱisubscriptℱ𝑖\mathcal{F}{i} does not depend on the σ𝜎\sigma-field σ(Z1,…,Zi−1,Zi+1,…Zk)𝜎subscript𝑍1…subscript𝑍𝑖1subscript𝑍𝑖1…subscript𝑍𝑘\sigma(Z{1},...,Z_{i-1},Z_{i+1},...Z_{k})). Assume that there exists μ<12𝜇12\mu<\frac{1}{2} such that: ℙ(ℱi)≤μℙsubscriptℱ𝑖𝜇\mathbb{P}(\mathcal{F}{i})\leq\mu for i=1,…,k𝑖1…𝑘i=1,...,k. Let {U1,…,Uk}subscript𝑈1…subscript𝑈𝑘{U{1},...,U_{k}} be the set of k𝑘k random variables such that Ui∈{0,1}subscript𝑈𝑖01U_{i}\in{0,1} and Ui|ℱi=Zi|ℱiconditionalsubscript𝑈𝑖subscriptℱ𝑖conditionalsubscript𝑍𝑖subscriptℱ𝑖U_{i}|\mathcal{F}{i}=Z{i}|\mathcal{F}{i} for i=1,…,k𝑖1…𝑘i=1,...,k, where X|ℱconditional𝑋ℱX|\mathcal{F} stands for the random variable X𝑋X truncated to the event ℱℱ\mathcal{F}. Assume furthermore that E(Ui)=E(Zi)𝐸subscript𝑈𝑖𝐸subscript𝑍𝑖E(U{i})=E(Z_{i}) for i=1,…,k𝑖1…𝑘i=1,...,k. Denote U=U1+…+Ukk𝑈subscript𝑈1…subscript𝑈𝑘𝑘U=\frac{U_{1}+...+U_{k}}{k}. Then the following is true.
Let us consider the event ℱbadsubscriptℱ𝑏𝑎𝑑\mathcal{F}{bad} = ℱ1∪…∪ℱksubscriptℱ1…subscriptℱ𝑘\mathcal{F}{1}\cup...\cup\mathcal{F}{k}. Note that ℱbadsubscriptℱ𝑏𝑎𝑑\mathcal{F}{bad} may be represented by the union of the so-called r𝑟r-blocks, i.e.
where ℱcsuperscriptℱ𝑐\mathcal{F}^{c} stands for the complement of event ℱℱ\mathcal{F}. Let us fix now some Q⊆{1,…,k}𝑄1…𝑘Q\subseteq{1,...,k}. Denote
note that ℙ(ℱQ)≤μr(1−μ)k−rℙsubscriptℱ𝑄superscript𝜇𝑟superscript1𝜇𝑘𝑟\mathbb{P}(\mathcal{F}_{Q})\leq\mu^{r}(1-\mu)^{k-r}. It follows directly from the Bernoulli scheme.
Denote Z=Z1+…+Zkk𝑍subscript𝑍1…subscript𝑍𝑘𝑘Z=\frac{Z_{1}+...+Z_{k}}{k}. From what we have just said and from the definition of {ℱ1,…,ℱk}subscriptℱ1…subscriptℱ𝑘{\mathcal{F}{1},...,\mathcal{F}{k}} we conclude that for any given c𝑐c the following holds:
Note also that from the assumptions of the lemma we trivially get: E(U)=E(Z)𝐸𝑈𝐸𝑍E(U)=E(Z).
We get: ℙ(|U−E(U)|>ϵ)=ℙ(|U−E(Z)|>ϵ)=ℙ(|U−Z+Z−E(Z)|>ϵ)≤ℙ(|U−Z|+|Z−E(Z)|>ϵ)≤ℙ(|U−Z|>ϵ2)+ℙ(|Z−E(Z)|>ϵ2)ℙ𝑈𝐸𝑈italic-ϵℙ𝑈𝐸𝑍italic-ϵℙ𝑈𝑍𝑍𝐸𝑍italic-ϵℙ𝑈𝑍𝑍𝐸𝑍italic-ϵℙ𝑈𝑍italic-ϵ2ℙ𝑍𝐸𝑍italic-ϵ2\mathbb{P}(|U-E(U)|>\epsilon)=\mathbb{P}(|U-E(Z)|>\epsilon)=\mathbb{P}(|U-Z+Z-E(Z)|>\epsilon)\leq\mathbb{P}(|U-Z|+|Z-E(Z)|>\epsilon)\leq\mathbb{P}(|U-Z|>\frac{\epsilon}{2})+\mathbb{P}(|Z-E(Z)|>\frac{\epsilon}{2}).
From 9 we get:
From the Stirling’s formula we get: r!=2πrr+12er(1+or(1))𝑟2𝜋superscript𝑟𝑟12superscript𝑒𝑟1subscript𝑜𝑟1r!=\frac{2\pi r^{r+\frac{1}{2}}}{e^{r}}(1+o_{r}(1)). Thus we obtain:
for r𝑟r large enough.
Let W1,…,Wksubscript𝑊1…subscript𝑊𝑘W_{1},...,W_{k} be k𝑘k independent random variables such that E(W1)=…E(Wk)=0𝐸subscript𝑊1…𝐸subscript𝑊𝑘0E(W_{1})=...E(W_{k})=0. Assume that −αi≤Wi+1−Wi≤βisubscript𝛼𝑖subscript𝑊𝑖1subscript𝑊𝑖subscript𝛽𝑖-\alpha_{i}\leq W_{i+1}-W_{i}\leq\beta_{i} for i=2,…,k−1𝑖2…𝑘1i=2,...,k-1. Then the following is true:
Now, using Lemma 7.2 for Wi=Xi−E(Xi)subscript𝑊𝑖subscript𝑋𝑖𝐸subscript𝑋𝑖W_{i}=X_{i}-E(X_{i}) and αi=E(Xi),βi=1−E(Xi)formulae-sequencesubscript𝛼𝑖𝐸subscript𝑋𝑖subscript𝛽𝑖1𝐸subscript𝑋𝑖\alpha_{i}=E(X_{i}),\beta_{i}=1-E(X_{i}) we obtain:
Combining 13 and 14, we obtain the statement of the lemma.
Let n𝑛n denote data dimensionality and let f(n)𝑓𝑛f(n) be an arbitrary positive function. Let D𝐷D be the set of all L2subscript𝐿2L_{2}-normalized data points, where no two data points are identical. Assume that |D|=N𝐷𝑁|D|=N. Consider the (N2)binomial𝑁2{N\choose 2} hyperplanes Hp,rsubscript𝐻𝑝𝑟H_{p,r} spanned by pairs of different vectors {p,r}𝑝𝑟{p,r} from D𝐷D. Then after applying linear transformation ℋℛℋℛ\mathcal{H}\mathcal{R} each hyperplane Hp,rsubscript𝐻𝑝𝑟H_{p,r} is transformed into another hyperplane Hp,rℋℛsubscriptsuperscript𝐻ℋℛ𝑝𝑟H^{\mathcal{H}\mathcal{R}}{p,r}. Furthermore, the probability 𝒫ℋℛsubscript𝒫ℋℛ\mathcal{P}{\mathcal{H}\mathcal{R}}that for every Hp,rℋℛsubscriptsuperscript𝐻ℋℛ𝑝𝑟H^{\mathcal{H}\mathcal{R}}{p,r} there exist two orthonormal vectors x=(x1,…,xn),y=(y1,…,yn)formulae-sequence𝑥subscript𝑥1…subscript𝑥𝑛𝑦subscript𝑦1…subscript𝑦𝑛x=(x{1},...,x_{n}),y=(y_{1},...,y_{n}) in Hp,rℋℛsubscriptsuperscript𝐻ℋℛ𝑝𝑟H^{\mathcal{H}\mathcal{R}}{p,r} such that: |xi|,|yi|≤f(n)nsubscript𝑥𝑖subscript𝑦𝑖𝑓𝑛𝑛|x{i}|,|y_{i}|\leq\frac{f(n)}{\sqrt{n}} satisfies:
We have already noted in the proof of Lemma 4.1 that ℋℛℋℛ\mathcal{H}\mathcal{R} is an orthogonal matrix. Thus, as an isometry, it clearly transforms each 222-dimensional hyperplane into another 222-dimensional hyperplane. For every pair {p,r}𝑝𝑟{p,r}, let us consider an arbitrary fixed orthonormal pair {u,v}𝑢𝑣{u,v} spanning Hp,rsubscript𝐻𝑝𝑟H_{p,r}. Denote u=(u1,…,un)𝑢subscript𝑢1…subscript𝑢𝑛u=(u_{1},...,u_{n}). Let us denote by uℋℛsuperscript𝑢ℋℛu^{\mathcal{H}\mathcal{R}} vector obtained from u𝑢u after applying transformation ℋℛℋℛ\mathcal{H}\mathcal{R}. Note that the jthsuperscript𝑗𝑡ℎj^{th} coordinate of uℋℛsuperscript𝑢ℋℛu^{\mathcal{H}\mathcal{R}} is of the form:
where T1,…,Tnsubscript𝑇1…subscript𝑇𝑛T_{1},...,T_{n} are independent random variables satisfying:
The latter comes straightforwardly from the form of the L2subscript𝐿2L_{2}-normalized Hadamard matrix (i.e a Hadamard matrix, where each row and column is L2subscript𝐿2L_{2}-normalized).
But then, from Lemma 7.2, and the fact that ‖u‖2=1subscriptnorm𝑢21|u|_{2}=1, we get for any a>0𝑎0a>0:
Similar analysis is correct for vℋℛsuperscript𝑣ℋℛv^{\mathcal{H}\mathcal{R}}. Note that vℋℛsuperscript𝑣ℋℛv^{\mathcal{H}\mathcal{R}} is orthogonal to uℋℛsuperscript𝑢ℋℛu^{\mathcal{H}\mathcal{R}} since v𝑣v and u𝑢u are orthogonal. Furthermore, both vℋℛsuperscript𝑣ℋℛv^{\mathcal{H}\mathcal{R}} and uℋℛsuperscript𝑢ℋℛu^{\mathcal{H}\mathcal{R}} are L2subscript𝐿2L_{2}-normalized. Thus {uℋℛ,vℋℛ}superscript𝑢ℋℛsuperscript𝑣ℋℛ{u^{\mathcal{H}\mathcal{R}},v^{\mathcal{H}\mathcal{R}}} is an orthonormal pair.
To complete the proof, it suffices to take a=f(n)𝑎𝑓𝑛a=f(n) and apply the union bound over all vectors uℋℛsuperscript𝑢ℋℛu^{\mathcal{H}\mathcal{R}}, vℋℛsuperscript𝑣ℋℛv^{\mathcal{H}\mathcal{R}} for all (N2)binomial𝑁2{N\choose 2} hyperplanes. ∎
From the lemma above we see that applying Hadamard matrix enables us to assume with high probability that for every hyperplane Hp,rsubscript𝐻𝑝𝑟H_{p,r} there exists an orthonormal basis consisting of vectors with elements of absolute values at most f(n)n𝑓𝑛𝑛\frac{f(n)}{\sqrt{n}}. We call this event ℰfsubscriptℰ𝑓\mathcal{E}{f}. Note that whether ℰfsubscriptℰ𝑓\mathcal{E}{f} holds or not is determined only by ℋℋ\mathcal{H}, ℛℛ\mathcal{R} and the initial dataset D𝐷D.
Let us proceed with the proof of Theorem 4.1. Let us assume that event ℰfsubscriptℰ𝑓\mathcal{E}{f} holds. Without loss of generality we may assume that we have the short ΨΨ\Psi-regular hashing mechanism with an extra property that every Hp,rsubscript𝐻𝑝𝑟H{p,r} has an orthonormal basis consisting of vectors with elements of absolute value at most f(n)n𝑓𝑛𝑛\frac{f(n)}{\sqrt{n}}. Fix two vectors p,r𝑝𝑟p,r from the dataset D𝐷D. Denote by {x,y}𝑥𝑦{x,y} the orthonormal basis of Hp,rsubscript𝐻𝑝𝑟H_{p,r} with the above property. Let us fix the i𝑖ith row of 𝒫𝒫\mathcal{P} and denote it as (pi,1,…,pi,n)subscript𝑝𝑖1…subscript𝑝𝑖𝑛(p_{i,1},...,p_{i,n}). After being multiplied by the diagonal matrix 𝒟𝒟\mathcal{D} we obtain another vector:
We have already noted that in the proof of Lemma 4.1 that it is the projection of w𝑤w into Hp,rsubscript𝐻𝑝𝑟H_{p,r} that determines whether the value of the associated random variable Xisubscript𝑋𝑖X_{i} is 00 or 111. To be more specific, we showed that Xi=1subscript𝑋𝑖1X_{i}=1 iff the projection is in the region 𝒰p,rsubscript𝒰𝑝𝑟\mathcal{U}{p,r}. Let us write down the coordinates of the projection of w𝑤w into Hp,rsubscript𝐻𝑝𝑟H{p,r} in the {x,y}𝑥𝑦{x,y}-coordinate system. The coordinates are the dot-products of w𝑤w with x𝑥x and y𝑦y respectively thus in the {x,y}𝑥𝑦{x,y}-coordinate system we can write w𝑤w as:
Note that both coordinates are Gaussian random variables and they are independent since they were constructed by projecting a Gaussian vector into two orthogonal vectors. Now note that from our assumption about the structure of 𝒫𝒫\mathcal{P} we can conclude that both coordinates may be represented as sums of weighted Gaussian random variables gisubscript𝑔𝑖g_{i} for i=1,…,t𝑖1…𝑡i=1,...,t, i.e.:
where each si,j,vi,jsubscript𝑠𝑖𝑗subscript𝑣𝑖𝑗s_{i,j},v_{i,j} is of the form dzxzsubscript𝑑𝑧subscript𝑥𝑧d_{z}x_{z} or dzyzsubscript𝑑𝑧subscript𝑦𝑧d_{z}y_{z} for some z𝑧z that depends only on i,j𝑖𝑗i,j. Note also that
The latter inequality comes from the fact that, by 20, both coordinates of w{x,y}subscript𝑤𝑥𝑦w_{{x,y}} have the same distribution.
Let us denote si=(si,1,…,si,t)subscript𝑠𝑖subscript𝑠𝑖1…subscript𝑠𝑖𝑡s_{i}=(s_{i,1},...,s_{i,t}), vi=(vi,1,…,vi,t)subscript𝑣𝑖subscript𝑣𝑖1…subscript𝑣𝑖𝑡v_{i}=(v_{i,1},...,v_{i,t}) for i=1,…,k𝑖1…𝑘i=1,...,k. We need the following lemma stating that with high probability vectors s1,…,sk,v1,…,vksubscript𝑠1…subscript𝑠𝑘subscript𝑣1…subscript𝑣𝑘s_{1},...,s_{k},v_{1},...,v_{k} are close to be pairwise orthogonal.
Let us assume that ℰfsubscriptℰ𝑓\mathcal{E}{f} holds. Let f(n)𝑓𝑛f(n) be an arbitrary positive function. Then for every a>0𝑎0a>0 with probability at least ℙsucc≥1−4(k2)e−2a2nf4(n)subscriptℙ𝑠𝑢𝑐𝑐14binomial𝑘2superscript𝑒2superscript𝑎2𝑛superscript𝑓4𝑛\mathbb{P}{succ}\geq 1-4{k\choose 2}e^{-\frac{2a^{2}n}{f^{4}(n)}}, taken under coin tosses used to construct 𝒟𝒟\mathcal{D}, the following is true for every 1≤i1≠i2≤k1subscript𝑖1subscript𝑖2𝑘1\leq i_{1}\neq i_{2}\leq k:
Note that the we get the first inequality for free from the fact that x𝑥x is orthogonal to y𝑦y (in other words, ∑u=1nsi1,uvi1,usuperscriptsubscript𝑢1𝑛subscript𝑠subscript𝑖1𝑢subscript𝑣subscript𝑖1𝑢\sum_{u=1}^{n}s_{i_{1},u}v_{i_{1},u} can be represented as C∑u=1nxiyi𝐶superscriptsubscript𝑢1𝑛subscript𝑥𝑖subscript𝑦𝑖C\sum_{u=1}^{n}x_{i}y_{i} and the latter expression is clearly 00). Let us consider now one of the three remaining expressions. Note that they can be rewritten as:
for some ρ,λ,ζ,γ𝜌𝜆𝜁𝛾\rho,\lambda,\zeta,\gamma. Note also that from the ΨΨ\Psi-regularity condition we immediately obtain that ρ(i)=λ(i)𝜌𝑖𝜆𝑖\rho(i)=\lambda(i) for at most ΨΨ\Psi elements of each sum. Get rid of these elements from each sum and consider the remaining ones. From the definition of the 𝒫𝒫\mathcal{P}-chromatic number, those remaining ones can be partitioned into at most χ(𝒫)𝜒𝒫\chi(\mathcal{P}) parts, each consisting of elements that are independent random variables (since in the corresponding graph there are no edges between them). Thus, for the sum corresponding to each part one can apply Lemma 7.2. Thus one can conclude that the sum differs from its expectation (which clearly is 00 since E(didj)=0𝐸subscript𝑑𝑖subscript𝑑𝑗0E(d_{i}d_{j})=0 for i≠j𝑖𝑗i\neq j) by a with probability at most:
Now it is time to use the fact that event ℰfsubscriptℰ𝑓\mathcal{E}{f} holds. Then we know that: |xi|,|yi|≤f(n)nsubscript𝑥𝑖subscript𝑦𝑖𝑓𝑛𝑛|x{i}|,|y_{i}|\leq\frac{f(n)}{\sqrt{n}} for i=1,…,n𝑖1…𝑛i=1,...,n. Substituting this upper bound for |xi|,|yi|subscript𝑥𝑖subscript𝑦𝑖|x_{i}|,|y_{i}| in the derived expressions on the probabilities coming from Lemma 7.2, and then taking the union bound, we complete the proof. ∎
We can finish the proof of Theorem 4.1. From Lemma 7.4 we see that s1,…,sk,v1,…,vksubscript𝑠1…subscript𝑠𝑘subscript𝑣1…subscript𝑣𝑘s_{1},...,s_{k},v_{1},...,v_{k} are close to pairwise orthogonal with high probability. Let us fix some positive function f(n)>0𝑓𝑛0f(n)>0 and some a>0𝑎0a>0. Denote
Note that, by Lemma 7.4 we see that applying Gram-Schmidt process we can obtain a system of pairwise orthogonal vectors s1,…,sk,v1,…,vksubscript𝑠1…subscript𝑠𝑘subscript𝑣1…subscript𝑣𝑘\tilde{s}{1},...,\tilde{s}{k},\tilde{v}{1},...,\tilde{v}{k} such that
where σ(k)>0𝜎𝑘0\sigma(k)>0 is some function of k𝑘k (it does not depend on n𝑛n and t𝑡t). Note that for n,t𝑛𝑡n,t large enough we have: σ(k)Δ≤aχ(𝒫)+Ψf2(n)n𝜎𝑘Δ𝑎𝜒𝒫Ψsuperscript𝑓2𝑛𝑛\sigma(k)\Delta\leq\sqrt{a}\chi(\mathcal{P})+\Psi\frac{f^{2}(n)}{\sqrt{n}}.
Let us consider again wx,ysubscript𝑤𝑥𝑦w_{x,y}. Replacing sisubscript𝑠𝑖s_{i} by sisubscript𝑠𝑖\tilde{s}{i} and visubscript𝑣𝑖v{i} by visubscript𝑣𝑖\tilde{v}{i} in the formula on wx,ysubscript𝑤𝑥𝑦w{x,y}, we obtain another Gaussian vector: wx,ysubscript𝑤𝑥𝑦\tilde{w}{x,y} for each row i𝑖i of the matrix 𝒫𝒫\mathcal{P}. Note however that vectors wx,ysubscript𝑤𝑥𝑦\tilde{w}{x,y} have one crucial advantage over vectors wx,ysubscript𝑤𝑥𝑦w_{x,y}, namely they are independent. That comes from the fact that s1,…,sksubscript𝑠1…subscript𝑠𝑘\tilde{s}{1},...,\tilde{s}{k},v1,…,vksubscript𝑣1…subscript𝑣𝑘\tilde{v}{1},...,\tilde{v}{k} are pairwise orthogonal. Note also that from 36 and 37 we obtain that the angular distance between wx,ysubscript𝑤𝑥𝑦w_{x,y} and wx,ysubscript𝑤𝑥𝑦\tilde{w}_{x,y} is at most σ(k)Δ𝜎𝑘Δ\sigma(k)\Delta.
Let Zisubscript𝑍𝑖Z_{i} for i=1,…k𝑖1…𝑘i=1,...k be an indicator random variable that is zero if wx,ysubscript𝑤𝑥𝑦\tilde{w}{x,y} is inside the region 𝒰p,rsubscript𝒰𝑝𝑟\mathcal{U}{p,r} and zero otherwise. Let Uisubscript𝑈𝑖U_{i} for i=1,…k𝑖1…𝑘i=1,...k be an indicator random variable that is zero if wx,ysubscript𝑤𝑥𝑦w_{x,y} is inside the region 𝒰p,rsubscript𝒰𝑝𝑟\mathcal{U}{p,r} and zero otherwise. Note that θp,rn=U1+…+Ukksubscriptsuperscript𝜃𝑛𝑝𝑟subscript𝑈1…subscript𝑈𝑘𝑘\tilde{\theta}^{n}{p,r}=\frac{U_{1}+...+U_{k}}{k}. Furthermore, random variables Z1,…,Zk,U1,…,Uksubscript𝑍1…subscript𝑍𝑘subscript𝑈1…subscript𝑈𝑘Z_{1},...,Z_{k},U_{1},...,U_{k} satisfy the assumptions of Lemma 7.1 with μ≤8τθp,r𝜇8𝜏subscript𝜃𝑝𝑟\mu\leq\frac{8\tau}{\theta_{p,r}}, where τ=σ(k)Δ𝜏𝜎𝑘Δ\tau=\sigma(k)\Delta. Indeed, random variables Zisubscript𝑍𝑖Z_{i} are independent since vectors wx,ysubscript𝑤𝑥𝑦\tilde{w}{x,y} are independent. From what we have said so far we know that each of them takes value one with probability exactly θp,rπsubscript𝜃𝑝𝑟𝜋\frac{\theta{p,r}}{\pi}. Furthermore Zi≠Uisubscript𝑍𝑖subscript𝑈𝑖Z_{i}\neq U_{i} only if wx,ysubscript𝑤𝑥𝑦w_{x,y} is inside 𝒰p,rsubscript𝒰𝑝𝑟\mathcal{U}{p,r} and wx,ysubscript𝑤𝑥𝑦\tilde{w}{x,y} is outside 𝒰p,rsubscript𝒰𝑝𝑟\mathcal{U}{p,r} or vice versa. The latter event implies (thus it is included in the event) that wx,ysubscript𝑤𝑥𝑦w{x,y} is near the border of the region 𝒰p,rsubscript𝒰𝑝𝑟\mathcal{U}{p,r}, namely within an angular distance ϵθp,ritalic-ϵsubscript𝜃𝑝𝑟\frac{\epsilon}{\theta{p,r}} from one of the four semi-lines defining 𝒰p,rsubscript𝒰𝑝𝑟\mathcal{U}{p,r}. Thus in particular, an event Zi≠Uisubscript𝑍𝑖subscript𝑈𝑖Z{i}\neq U_{i} is contained in the event of probability at most 2⋅4⋅ϵθp,r⋅24italic-ϵsubscript𝜃𝑝𝑟2\cdot 4\cdot\frac{\epsilon}{\theta_{p,r}} that depends only on one wx,ysubscript𝑤𝑥𝑦w_{x,y}.
But then we can apply Lemma 7.1. All we need is to assume that the premises of Lemma 7.4 are satisfied. But this is the case with probability specified in Lemma 7.3 and this probability is taken under random coin tosses used to product ℋℋ\mathcal{H} and ℛℛ\mathcal{R}, thus independently from the random coin tosses used to produce 𝒟𝒟\mathcal{D}. Putting it all together we obtain the statement of Theorem 4.1.
We will borrow some notation from the proof of Theorem 4.1. Note however that in this setting no preprocessing with the use of matrices ℋℋ\mathcal{H} and ℛℛ\mathcal{R} is applied.
for some 0<Δ<10Δ10<\Delta<1. The the following is true for every fixed 1≤i<j≤k1𝑖𝑗𝑘1\leq i<j\leq k:
The lemma follows from the exactly the same analysis that was done in the last section of the proof of Theorem 4.1 thus we leave it to the reader as an exercise.
Since Uisubscript𝑈𝑖U_{i} is an indicator random variable that takes value one with probability θp,rπsubscript𝜃𝑝𝑟𝜋\frac{\theta_{p,r}}{\pi}, we get:
Thus we have:
Therefore, using Lemma 8.1, we obtain:
It suffices to estimate parameter ΔΔ\Delta. We proceed as in the previous proof. We only need to be a little bit more cautious since the condition: |xi|,|yi|≤f(n)nsubscript𝑥𝑖subscript𝑦𝑖𝑓𝑛𝑛|x_{i}|,|y_{i}|\leq\frac{f(n)}{\sqrt{n}} cannot be assumed right now. We select two rows: i1,i2subscript𝑖1subscript𝑖2i_{1},i_{2} of 𝒫𝒫\mathcal{P}. Note that again we see that applying Gram-Schmidt process, we can obtain a system of pairwise orthogonal vectors si1,sii,vii,vi2subscript𝑠subscript𝑖1subscript𝑠subscript𝑖𝑖subscript𝑣subscript𝑖𝑖subscript𝑣subscript𝑖2\tilde{s}{i{1}},\tilde{s}{i{i}},\tilde{v}{i{i}},\tilde{v}{i{2}} such that
The fact that right now the above upper bounds are not multiplied by k𝑘k, as it was the case in the previous proof, plays key role in obtaining nontrivial concentration results even when no Hadamard mechanism is applied.
We consider the related sums: E1=∑i=1ndρ(i)dλ(i)xζ(i)xγ(i),E2=∑i=1ndρ(i)dλ(i)yζ(i)yγ(i),E3=∑i=1ndρ(i)dλ(i)xζ(i)yγ(i)formulae-sequencesubscript𝐸1superscriptsubscript𝑖1𝑛subscript𝑑𝜌𝑖subscript𝑑𝜆𝑖subscript𝑥𝜁𝑖subscript𝑥𝛾𝑖formulae-sequencesubscript𝐸2superscriptsubscript𝑖1𝑛subscript𝑑𝜌𝑖subscript𝑑𝜆𝑖subscript𝑦𝜁𝑖subscript𝑦𝛾𝑖subscript𝐸3superscriptsubscript𝑖1𝑛subscript𝑑𝜌𝑖subscript𝑑𝜆𝑖subscript𝑥𝜁𝑖subscript𝑦𝛾𝑖E_{1}=\sum_{i=1}^{n}d_{\rho(i)}d_{\lambda(i)}x_{\zeta(i)}x_{\gamma(i)},E_{2}=\sum_{i=1}^{n}d_{\rho(i)}d_{\lambda(i)}y_{\zeta(i)}y_{\gamma(i)},E_{3}=\sum_{i=1}^{n}d_{\rho(i)}d_{\lambda(i)}x_{\zeta(i)}y_{\gamma(i)} as before. We can again partition each sum into at most χ(𝒫)𝜒𝒫\chi(\mathcal{P}) sub-chunks, where this time χ(𝒫)≤3𝜒𝒫3\chi(\mathcal{P})\leq 3 (since 𝒫𝒫\mathcal{P} is Toeplitz Gaussian). The problem is that applying Lemma 7.2, we get bounds that depend on the expressions of the form
where indices are added modulo n𝑛n and this time we cannot assume that all |xi|,|yi|subscript𝑥𝑖subscript𝑦𝑖|x_{i}|,|y_{i}| are small. Fortunately we have:
Let us fix some positive function f(k)𝑓𝑘f(k). We can conclude that the number of variables αx,isubscript𝛼𝑥𝑖\alpha_{x,i} such that αx,i≥f(k)(k2)subscript𝛼𝑥𝑖𝑓𝑘binomial𝑘2\alpha_{x,i}\geq\frac{f(k)}{{k\choose 2}} is at most (k2)f(k)binomial𝑘2𝑓𝑘\frac{{k\choose 2}}{f(k)}. Note that each such αx,isubscript𝛼𝑥𝑖\alpha_{x,i} and each such αy,isubscript𝛼𝑦𝑖\alpha_{y,i} corresponds to a pair {i1,2}{i_{1},{2}} of rows of the matrix 𝒫𝒫\mathcal{P} and consequently to the unique element Cov(Ui1,Ui2)𝐶𝑜𝑣subscript𝑈subscript𝑖1subscript𝑈subscript𝑖2Cov(U{i_{1}},U_{i_{2}}) of the entire covariance sum (scaled by 1k21superscript𝑘2\frac{1}{k^{2}}). Since trivially we have |Cov(Ui1,Ui2)|=O(1)𝐶𝑜𝑣subscript𝑈subscript𝑖1subscript𝑈subscript𝑖2𝑂1|Cov(U_{i_{1}},U_{i_{2}})|=O(1), we conclude that the contribution of these elements to the entire covariance sum is of order 1f(k)1𝑓𝑘\frac{1}{f(k)}. Let us now consider these αx,isubscript𝛼𝑥𝑖\alpha_{x,i} and αy,isubscript𝛼𝑦𝑖\alpha_{y,i} that are at most f(k)(k2)𝑓𝑘binomial𝑘2\frac{f(k)}{{k\choose 2}}. These sums are small (if we take f(k)=o(k2)𝑓𝑘𝑜superscript𝑘2f(k)=o(k^{2})) and thus it makes sense to apply Lemma 7.2 to them. That gives us upper bound a=Ω(Δ)𝑎ΩΔa=\Omega(\Delta) with probability:
Taking f(k)=(k2log(k))13𝑓𝑘superscriptsuperscript𝑘2𝑘13f(k)=(\frac{k^{2}}{\log(k)})^{\frac{1}{3}} and a=O(Δ)=1f(k)𝑎𝑂Δ1𝑓𝑘a=O(\Delta)=\frac{1}{f(k)}, we get: ℙ∗≥1−O(1k)superscriptℙ1𝑂1𝑘\mathbb{P}^{*}\geq 1-O(\frac{1}{k}) and furthermore:
Thus, from the Chebyshev’s inequality, we get the following for every c>0𝑐0c>0 and fixed points p,r𝑝𝑟p,r:
That completes the proof of Theorem 4.2.
Figure 8a and Figure 8b show how the mean train error is affected by the size of the hash, and Figure 8c shows how the mean train error changes with the size of the reduction for the neural network experiment. In Table 3 we report both the mean and the standard deviation of the train error across our neural network experiments. Baseline refers to the network with one hidden layer containing 100100100 hidden units, where all parameters are trained.
Figure 9a shows the original version of Figure 6a (before zoom). Figure 9b shows the original version of Figure 7a (before zoom). Finally, Table 4 shows the mean and the standard deviation of the test error versus the size of the hash ( k )/size of the reduction ( n/k ) for 1 -NN.
Table: S5.T1: Mean and std of the test error versus the size of the hash (k𝑘k) / size of the reduction (n/k𝑛𝑘n/k) for the network.
| [%][%] | [%][%] | [%][%] | [%][%] | [%][%] | [%][%] | [%][%] | |
|---|---|---|---|---|---|---|---|
| 102410241024 / 111 | 3.53±0.16plus-or-minus3.530.163.53\pm 0.16 | 2.78±0.10plus-or-minus2.780.102.78\pm 0.10 | 3.69±0.21plus-or-minus3.690.213.69\pm 0.21 | 6.79±0.49plus-or-minus6.790.496.79\pm 0.49 | 3.54±0.16plus-or-minus3.540.163.54\pm 0.16 | 3.16±0.19plus-or-minus3.160.193.16\pm 0.19 | 3.74±0.16plus-or-minus3.740.163.74\pm 0.16 |
| 512512512 / 222 | 5.42±0.83plus-or-minus5.420.835.42\pm 0.83 | 3.61±0.19plus-or-minus3.610.193.61\pm 0.19 | 4.68±0.35plus-or-minus4.680.354.68\pm 0.35 | 8.10±1.85plus-or-minus8.101.858.10\pm 1.85 | 5.13±2.15plus-or-minus5.132.155.13\pm 2.15 | 4.97±0.53plus-or-minus4.970.534.97\pm 0.53 | 5.55±0.62plus-or-minus5.550.625.55\pm 0.62 |
| 256256256 / 444 | 11.56±1.42plus-or-minus11.561.4211.56\pm 1.42 | 4.79±0.13plus-or-minus4.790.134.79\pm 0.13 | 7.43±1.31plus-or-minus7.431.317.43\pm 1.31 | 6.13±1.42plus-or-minus6.131.426.13\pm 1.42 | 5.98±1.05plus-or-minus5.981.055.98\pm 1.05 | 9.48±1.88plus-or-minus9.481.889.48\pm 1.88 | 10.96±2.78plus-or-minus10.962.7810.96\pm 2.78 |
| 128128128 / 888 | 22.10±5.42plus-or-minus22.105.4222.10\pm 5.42 | 10.13±0.24plus-or-minus10.130.2410.13\pm 0.24 | 10.02±0.50plus-or-minus10.020.5010.02\pm 0.50 | 11.43±0.92plus-or-minus11.430.9211.43\pm 0.92 | 12.42±0.95plus-or-minus12.420.9512.42\pm 0.95 | 18.35±2.36plus-or-minus18.352.3618.35\pm 2.36 | 15.82±1.63plus-or-minus15.821.6315.82\pm 1.63 |
| 646464 / 161616 | 29.50±1.13plus-or-minus29.501.1329.50\pm 1.13 | 16.26±1.02plus-or-minus16.261.0216.26\pm 1.02 | 26.50±10.55plus-or-minus26.5010.5526.50\pm 10.55 | 22.07±1.35plus-or-minus22.071.3522.07\pm 1.35 | 20.90±2.25plus-or-minus20.902.2520.90\pm 2.25 | 32.82±4.83plus-or-minus32.824.8332.82\pm 4.83 | 21.59±3.05plus-or-minus21.593.0521.59\pm 3.05 |
| 323232 / 323232 | 42.07±4.16plus-or-minus42.074.1642.07\pm 4.16 | 28.77±2.28plus-or-minus28.772.2828.77\pm 2.28 | 29.94±3.48plus-or-minus29.943.4829.94\pm 3.48 | 35.55±3.12plus-or-minus35.553.1235.55\pm 3.12 | 29.15±0.97plus-or-minus29.150.9729.15\pm 0.97 | 42.97±2.08plus-or-minus42.972.0842.97\pm 2.08 | 45.10±4.46plus-or-minus45.104.4645.10\pm 4.46 |
| 161616 / 646464 | 64.20±6.76plus-or-minus64.206.7664.20\pm 6.76 | 46.06±1.03plus-or-minus46.061.0346.06\pm 1.03 | 50.65±5.66plus-or-minus50.655.6650.65\pm 5.66 | 58.70±7.15plus-or-minus58.707.1558.70\pm 7.15 | 55.40±6.90plus-or-minus55.406.9055.40\pm 6.90 | 57.96±3.65plus-or-minus57.963.6557.96\pm 3.65 | 61.66±4.08plus-or-minus61.664.0861.66\pm 4.08 |
Table: S5.T2: Memory complexity and number of required random values for structured matrices and Random matrix.
| Matrix | Random | Circulant | BinPerm | HalfShift | VerHorShift | BinCirc | Toeplitz |
|---|---|---|---|---|---|---|---|
| ### of random values | 𝒪(nk)𝒪𝑛𝑘\mathcal{O}(nk) | 𝒪(n)𝒪𝑛\mathcal{O}(n) | 𝒪(n)𝒪𝑛\mathcal{O}(n) | 𝒪(n)𝒪𝑛\mathcal{O}(n) | 𝒪(n)𝒪𝑛\mathcal{O}(n) | 𝒪(n)𝒪𝑛\mathcal{O}(n) | 𝒪(n)𝒪𝑛\mathcal{O}(n) |
| Memory complexity | 𝒪(nk)𝒪𝑛𝑘\mathcal{O}(nk) | 𝒪(n)𝒪𝑛\mathcal{O}(n) | 𝒪(nk)𝒪𝑛𝑘\mathcal{O}(nk) | 𝒪(n)𝒪𝑛\mathcal{O}(n) | 𝒪(n)𝒪𝑛\mathcal{O}(n) | 𝒪(n)𝒪𝑛\mathcal{O}(n) | 𝒪(n)𝒪𝑛\mathcal{O}(n) |
Two vectors: p𝑝p, r𝑟r spanning two-dimensional hyperplane H𝐻H and with the angular distance θp,rsubscript𝜃𝑝𝑟\theta_{p,r} between them. We have: lp,r=g𝒟,H,⟂isubscript𝑙𝑝𝑟subscriptsuperscript𝑔𝑖𝒟𝐻perpendicular-tol_{p,r}=g^{i}{\mathcal{D},H,\perp}. Line lp,rsubscript𝑙𝑝𝑟l{p,r} is dividing θp,rsubscript𝜃𝑝𝑟\theta_{p,r} and thus gisuperscript𝑔𝑖g^{i} contributes to ‖h(p)−h(r)‖1subscriptnormℎ𝑝ℎ𝑟1|h(p)-h(r)|_{1}.
Similar setting to the one presented on Figure 1. Vector v𝑣v represents L2subscript𝐿2L_{2}-normalized version of gisuperscript𝑔𝑖g^{i} and is perpendicular to the two-dimensional plane R𝑅R. The intersection R∩H𝑅𝐻R\cap H of that plane with the 222-dimensional plane H𝐻H spanned by p,r𝑝𝑟p,r is a line lp,rsubscript𝑙𝑝𝑟l_{p,r} that this time is outside 𝒰p,rsubscript𝒰𝑝𝑟\mathcal{U}{p,r}. Thus gisuperscript𝑔𝑖g^{i} does not contribute to ‖h(p)−h(r)‖1subscriptnormℎ𝑝ℎ𝑟1|h(p)-h(r)|{1}.
Left: matrix 𝒫𝒫\mathcal{P} with two highlighted rows of indices: k1=isubscript𝑘1𝑖k_{1}=i and k2=jsubscript𝑘2𝑗k_{2}=j respectively, where j=i+2𝑗𝑖2j=i+2. Right: corresponding graph that consists of two cycles. If each cycle is even then this graph is 222-colorable, as indicated on the picture. Thus we have: χ(𝒢𝒫(k1,k2))=2𝜒subscript𝒢𝒫subscript𝑘1subscript𝑘22\chi(\mathcal{G}{\mathcal{P}}(k{1},k_{2}))=2.
The dependence of the upper bound on the variance of the normalized approximate angle θp,rnsubscriptsuperscript𝜃𝑛𝑝𝑟\tilde{\theta}^{n}{p,r} on (left:) an angle when the size of the hash k𝑘k is fixed (the upper bound scales as 1k1𝑘\frac{1}{k} and is almost independent of θp,rsubscript𝜃𝑝𝑟\theta{p,r}), (right:) the size of the hash k𝑘k when the true angular distance θp,rsubscript𝜃𝑝𝑟\theta_{p,r} is fixed (the upper bound converges to 00 as k→∞→𝑘k\rightarrow\infty).
Mean test error versus a) the size of the hash (k𝑘k) (zoomed plot333Original plot is in the Supplement.), b) the size of the reduction (n/k𝑛𝑘n/k) for the network. Baseline corresponds to 1.7%percent1.71.7%.
$$ \left(\begin{array}[]{ccccc}\sum_{l\in\mathcal{S}{1,1}}g{l}&...&\sum_{l\in\mathcal{S}{1,j}}g{l}&...&\sum_{l\in\mathcal{S}{1,n}}g{l}\ ...&...&...&...&...\ \sum_{l\in\mathcal{S}{i,1}}g{l}&...&\sum_{l\in\mathcal{S}{i,j}}g{l}&...&\sum_{l\in\mathcal{S}{i,n}}g{l}\ ...&...&...&...&...\ \sum_{l\in\mathcal{S}{k,1}}g{l}&...&\sum_{l\in\mathcal{S}{k,j}}g{l}&...&\sum_{l\in\mathcal{S}{k,n}}g{l}\end{array}\right) $$ \tag{S3.Ex1}
$$ \underbrace{x \xrightarrow {\mathcal{R}} x_{\mathcal{R}} \xrightarrow {\mathcal{H}} x_{\mathcal{H}} \xrightarrow {\mathcal{D}} x_{\mathcal{D}}}{\text{preprocessing}} \underbrace{\xrightarrow {\mathcal{P}{\Psi}} x_{\mathcal{P}{\Psi}} \xrightarrow {\phi} h(x)}{\text{hashing}} \in \mathbb{R}^{k}. $$
$$ \tilde{\theta}{p,r}^{n} = \frac{1}{2k}|h(p)-h(r)|{1} $$
$$ X_{i} = \left{ \begin{array}{ll} 1 & \mbox{if } g^{i}{\mathcal{D},H} \in \mathcal{U}{p,r}, \ 0 & \mbox{otherwise.} \end{array} \right. $$
$$ \chi(\mathcal{P})=\max_{1\leq k_{1}<k_{2}\leq k}\chi(\mathcal{G}(k_{1},k_{2})). $$ \tag{S4.Ex2}
$$ \forall_{p,r \in\mathcal{D}}Var(\tilde{\theta}^{n}{p,r}) \leq \frac{1}{k}\frac{\theta{p,r}(\pi-\theta_{p,r})}{\pi^{2}} + (\frac{\log(k)}{k^{2}})^{\frac{1}{3}}, $$
$$ \mathbb{P}\left(\left|\tilde{\theta}^{n}{p,r}-\frac{\theta{p,r}}{\pi}\right|\geq c\left(\frac{\sqrt{\log(k)}}{k}\right)^{\frac{1}{3}}\right)=O\left(\frac{1}{c^{2}}\right). $$ \tag{S4.Ex7}
$$ \mathbb{P}(|U-EU| !>! \epsilon) !\leq! \frac{1}{\pi} !\sum_{r=\frac{\epsilon k}{2}}^{k}!!\frac{1}{\sqrt{r}}(\frac{ke}{r})^{r}\mu^{r}(1-\mu)^{k-r} + 2e^{-\frac{\epsilon^{2}k}{2}}. $$
$$ \mathcal{F}{bad} = \bigcup{Q \subseteq {1,...,k}} (\bigcap_{q \in Q} \mathcal{F}{q} \bigcap{q \in {1,...,k} \setminus Q} \mathcal{F}^{c}_{q}), $$
$$ \mathbb{P}(|u_{1}T_{1}+...+u_{n}T_{n}| \geq a) \leq 2e^{-\frac{2a^{2}}{\sum_{i=1}^{n}(2u_{i})^{2}}} \leq 2e^{-\frac{a^{2}}{2}}. $$
$$ \mathbb{P}^{*} \geq 1-e^{-\Omega(a^{2}\frac{k^{2}}{f(k)})}. $$
$$ u^{\mathcal{H}\mathcal{R}}{j} = u{1}T_{1}+...+u_{n}T_{n}, $$
$$ w=(\mathcal{P}{i,1}d{1},...,\mathcal{P}{i,n}d{n}), $$
$$ \mathcal{D}{i,j} = \begin{pmatrix} d{1} & 0 & \cdots & 0 \ 0 & d_{2} & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & d_{n} \end{pmatrix}. $$
$$ w_{{x,y}}=(g_{1}s_{i,1}+...+g_{t}s_{i,t},g_{1}v_{i,1}+...+g_{t}v_{i,t}), $$
$$ \Delta = a\chi(\mathcal{P}) + \Psi \frac{f^{2}(n)}{n}. $$
$$ E = \sum_{i=1}^{n} d_{\rho(i)}d_{\lambda(i)} x_{\zeta(i)}x_{\gamma(i)} $$
$$ \label{ineq1}|\tilde{v}{i}-v{i}|_{2} \leq \sigma(k) \Delta. $$ \tag{ineq1}
$$ |\mathbb{P}(U_{i}U_{j}=1)-\mathbb{P}(U_{i}=1)\mathbb{P}(U_{j}=1)|=O(\Delta). $$ \tag{S8.Ex20}
$$ \alpha_{x,i} = \sum_{j=1}^{n}x_{j}^{2}x_{j+i}^{2}, $$
$$ \sum_{i=1}^{n} \alpha_{x,i} = 1, $$
$$ \displaystyle!!!!!!!!!!!!\left1-4{N\choose 2}e^{-\frac{f^{2}(n)}{2}}-4\chi(\mathcal{P}){k\choose 2}e^{-\frac{2a^{2}t}{f^{4}(t)}}\right, $$
$$ \displaystyle\xi $$
$$ \displaystyle!!!!!!!!!Var(\tilde{\theta}^{n}_{p,r})!!!!! $$
Definition. $M$ is a circulant Gaussian matrix if its first row is a sequence of independent Gaussian random variables taken from the distribution $N(0,1)$ and next rows are obtained from the previous ones by either only one-left shifts or only one-right shifts.
Definition. $M$ is a Toeplitz Gaussian matrix if each of its descending diagonals from left to right is of the form $(g,...,g)$ for some $g \sim N(0,1)$ and different descending diagonals are independent.
Remark. The circulant Gaussian matrix with right shifts is a special type of the Toeplitz Gaussian matrix.
Definition. Definition 3.3. Let t𝑡t be the size of the pool of independent random Gaussian variables {g1,…,gt}subscript𝑔1…subscript𝑔𝑡{g_{1},...,g_{t}}, where each gi∼𝒩(0,1)similar-tosubscript𝑔𝑖𝒩01g_{i}\sim\mathcal{N}(0,1). Assume that k≤n≤t≤kn𝑘𝑛𝑡𝑘𝑛k\leq n\leq t\leq kn. We take ΨΨ\Psi to be a natural number, i.e. Ψ∈ℕΨℕ\Psi\in\mathbb{N}. 𝒫𝒫\mathcal{P} is ΨΨ\Psi-regular random matrix if it has the following form (∑l∈𝒮1,1gl…∑l∈𝒮1,jgl…∑l∈𝒮1,ngl……………∑l∈𝒮i,1gl…∑l∈𝒮i,jgl…∑l∈𝒮i,ngl……………∑l∈𝒮k,1gl…∑l∈𝒮k,jgl…∑l∈𝒮k,ngl)subscript𝑙subscript𝒮11subscript𝑔𝑙…subscript𝑙subscript𝒮1𝑗subscript𝑔𝑙…subscript𝑙subscript𝒮1𝑛subscript𝑔𝑙……………subscript𝑙subscript𝒮𝑖1subscript𝑔𝑙…subscript𝑙subscript𝒮𝑖𝑗subscript𝑔𝑙…subscript𝑙subscript𝒮𝑖𝑛subscript𝑔𝑙……………subscript𝑙subscript𝒮𝑘1subscript𝑔𝑙…subscript𝑙subscript𝒮𝑘𝑗subscript𝑔𝑙…subscript𝑙subscript𝒮𝑘𝑛subscript𝑔𝑙\left(\begin{array}[]{ccccc}\sum_{l\in\mathcal{S}{1,1}}g{l}&...&\sum_{l\in\mathcal{S}{1,j}}g{l}&...&\sum_{l\in\mathcal{S}{1,n}}g{l}\ ...&...&...&...&...\ \sum_{l\in\mathcal{S}{i,1}}g{l}&...&\sum_{l\in\mathcal{S}{i,j}}g{l}&...&\sum_{l\in\mathcal{S}{i,n}}g{l}\ ...&...&...&...&...\ \sum_{l\in\mathcal{S}{k,1}}g{l}&...&\sum_{l\in\mathcal{S}{k,j}}g{l}&...&\sum_{l\in\mathcal{S}{k,n}}g{l}\end{array}\right) where Si,j⊆{1,…,t}subscript𝑆𝑖𝑗1…𝑡S_{i,j}\subseteq{1,...,t} for i∈{1,…,k}𝑖1…𝑘i\in{1,...,k}, j∈{1,…,n}𝑗1…𝑛j\in{1,...,n}, |Si,1|=…=|Si,n|subscript𝑆𝑖1…subscript𝑆𝑖𝑛|S_{i,1}|=...=|S_{i,n}| for i=1,…,k𝑖1…𝑘i=1,...,k, Si,j1∩Si,j2=∅subscript𝑆𝑖subscript𝑗1subscript𝑆𝑖subscript𝑗2S_{i,j_{1}}\cap S_{i,j_{2}}=\emptyset for i∈{1,…,k}𝑖1…𝑘i\in{1,...,k}, j1,j2∈{1,…,n}subscript𝑗1subscript𝑗21…𝑛j_{1},j_{2}\in{1,...,n}, j1≠j2subscript𝑗1subscript𝑗2j_{1}\neq j_{2}, and furthermore the following holds: for any two different rows ℛ1,ℛ2subscriptℛ1subscriptℛ2\mathcal{R}{1},\mathcal{R}{2} of 𝒫𝒫\mathcal{P} the number of random variables glsubscript𝑔𝑙g_{l}, where l∈{1,…,t}𝑙1…𝑡l\in{1,...,t}, such that glsubscript𝑔𝑙g_{l} is in the intersection of some column with both ℛ1subscriptℛ1\mathcal{R}{1} and ℛ2subscriptℛ2\mathcal{R}{2} is at most ΨΨ\Psi.
Remark. Remark 3.2. Circulant Gaussian matrices and Toeplitz Gaussian matrices are special cases of the 00-regular matrices. Toeplitz Gaussian matrix is 00-regular, where subsets 𝒮i,jsubscript𝒮𝑖𝑗\mathcal{S}_{i,j} are singletons.
Remark. Computing hashes for structured matrices: Toeplitz, BinCirc, HalfShift, and VerHorShift can be done faster than in time $O(nk)$ (e.g. for Toeplitz one can use FFT to reduce computations to $O(n\log k)$). Thus our structured approach leads to speed-ups, storage compression (since many structured matrices covered by our theoretical model can be stored in linear space) and reduction in randomness usage. The goal of this paper is not to analyze in details fast matrix-vector product algorithms since that requires a separate paper. We however point out that well-known fast matrix-vector product algorithms are some of the key advantages of our structured approach.
Lemma. Let $M$ be a $\Psi$-regular hashing model (either extended or a short one) and $|p|{2}=|r|{2}=1$. Then $\theta_{p,r}^{n}$ is an unbiased estimator of $\theta_{p,r}{\pi}$, i.e. $E(\theta_{p,r}^{n}) = \theta_{p,r}{\pi}.$
Definition. Let $P$ be a $\Psi$-regular matrix. We define the $P$-chromatic number $\chi(P)$ as: $$\chi(P) = \max_{1 \leq k_{1} < k_{2} \leq k} \chi(G(k_{1},k_{2})).$$
Theorem. Theorem 4.1. Consider extended ΨΨ\Psi-regular hashing model ℳℳ\mathcal{M} with t𝑡t independent Gaussian random variables: g1,…,gtsubscript𝑔1…subscript𝑔𝑡g_{1},...,g_{t}, each of distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1). Let N𝑁N be the size of the dataset 𝒟𝒟\mathcal{D}. Denote by k𝑘k the size of the hash and by n𝑛n the dimensionality of the data. Let f(n)𝑓𝑛f(n) be an arbitrary positive function. Let θp,rsubscript𝜃𝑝𝑟\theta_{p,r} be the angular distance between vectors p,r∈𝒟𝑝𝑟𝒟p,r\in\mathcal{D}. Then for a=on(1)𝑎subscript𝑜𝑛1a=o_{n}(1), ϵ>0italic-ϵ0\epsilon>0, t≥n𝑡𝑛t\geq n and n𝑛n large enough: ℙ(∀p,r∈𝒟|θp,rn−θp,rπ|≤ϵ)≥ℙsubscriptfor-all𝑝𝑟𝒟subscriptsuperscript𝜃𝑛𝑝𝑟subscript𝜃𝑝𝑟𝜋italic-ϵabsent\displaystyle!!!!!!!!!!!!\mathbb{P}\left(\forall_{p,r\in\mathcal{D}}\left|\tilde{\theta}^{n}{p,r}-\frac{\theta{p,r}}{\pi}\right|\leq\epsilon\right)\geq [1−4(N2)e−f2(n)2−4χ(𝒫)(k2)e−2a2tf4(t)](1−Λ),delimited-[]14binomial𝑁2superscript𝑒superscript𝑓2𝑛24𝜒𝒫binomial𝑘2superscript𝑒2superscript𝑎2𝑡superscript𝑓4𝑡1Λ\displaystyle!!!!!!!!!!!!\left1-4{N\choose 2}e^{-\frac{f^{2}(n)}{2}}-4\chi(\mathcal{P}){k\choose 2}e^{-\frac{2a^{2}t}{f^{4}(t)}}\right, where Λ=1π∑j=⌊ϵk2⌋k1j(kej)jμj(1−μ)k−j+2e−ϵ2k2Λ1𝜋superscriptsubscript𝑗italic-ϵ𝑘2𝑘1𝑗superscript𝑘𝑒𝑗𝑗superscript𝜇𝑗superscript1𝜇𝑘𝑗2superscript𝑒superscriptitalic-ϵ2𝑘2\Lambda=\frac{1}{\pi}\sum_{j=\lfloor\frac{\epsilon k}{2}\rfloor}^{k}\frac{1}{\sqrt{j}}(\frac{ke}{j})^{j}\mu^{j}(1-\mu)^{k-j}+2e^{-\frac{\epsilon^{2}k}{2}} and μ=8(aχ(𝒫)+Ψf2(n)n)θp,r𝜇8𝑎𝜒𝒫Ψsuperscript𝑓2𝑛𝑛subscript𝜃𝑝𝑟\mu=\frac{8(\sqrt{a}\chi(\mathcal{P})+\Psi\frac{f^{2}(n)}{\sqrt{n}})}{\theta_{p,r}}.
Corollary. Corollary 4.1. Consider extended ΨΨ\Psi-regular hashing model ℳℳ\mathcal{M}. Assume that the projection matrix 𝒫𝒫\mathcal{P} is Toeplitz Gaussian. Let N,n,k𝑁𝑛𝑘N,n,k be as above and denote by θp,rsubscript𝜃𝑝𝑟\theta_{p,r} be the angular distance between vectors p,r∈𝒟𝑝𝑟𝒟p,r\in\mathcal{D}. Then the following is true for n𝑛n large enough: ℙ(∀p,r∈𝒟|θp,rn−θp,rπ|≤k−13)≥ℙsubscriptfor-all𝑝𝑟𝒟subscriptsuperscript𝜃𝑛𝑝𝑟subscript𝜃𝑝𝑟𝜋superscript𝑘13absent\displaystyle!!!!!!!!!!!!\mathbb{P}\left(\forall_{p,r\in\mathcal{D}}\left|\tilde{\theta}^{n}{p,r}-\frac{\theta{p,r}}{\pi}\right|\leq k^{-\frac{1}{3}}\right)\geq [1−O(N2epoly(n)+k2e−n310)](1−3e−k132).delimited-[]1𝑂superscript𝑁2superscript𝑒𝑝𝑜𝑙𝑦𝑛superscript𝑘2superscript𝑒superscript𝑛31013superscript𝑒superscript𝑘132\displaystyle!!!!!!!!!!!!\left[1-O\left(\frac{N^{2}}{e^{poly(n)}}+k^{2}e^{-n^{\frac{3}{10}}}\right)\right]\left(1-3e^{-\frac{k^{\frac{1}{3}}}{2}}\right).
Theorem. Theorem 4.2. Consider short ΨΨ\Psi-regular hashing model ℳℳ\mathcal{M}, where 𝒫𝒫\mathcal{P} is a Toeplitz Gaussian matrix. Denote by k𝑘k the size of the hash. Let θp,rsubscript𝜃𝑝𝑟\theta_{p,r} be the angular distance between vectors p,r∈𝒟𝑝𝑟𝒟p,r\in\mathcal{D}, where 𝒟𝒟\mathcal{D} is the dataset. Then the following is true ∀p,r∈𝒟Var(θp,rn)≤1kθp,r(π−θp,r)π2+(log(k)k2)13,subscriptfor-all𝑝𝑟𝒟𝑉𝑎𝑟subscriptsuperscript𝜃𝑛𝑝𝑟1𝑘subscript𝜃𝑝𝑟𝜋subscript𝜃𝑝𝑟superscript𝜋2superscript𝑘superscript𝑘213\forall_{p,r\in\mathcal{D}}Var(\tilde{\theta}^{n}{p,r})\leq\frac{1}{k}\frac{\theta{p,r}(\pi-\theta_{p,r})}{\pi^{2}}+(\frac{\log(k)}{k^{2}})^{\frac{1}{3}}, (5) and thus for any c>0𝑐0c>0 and p,r∈𝒟𝑝𝑟𝒟p,r\in\mathcal{D}: ℙ(|θp,rn−θp,rπ|≥c(log(k)k)13)=O(1c2).ℙsubscriptsuperscript𝜃𝑛𝑝𝑟subscript𝜃𝑝𝑟𝜋𝑐superscript𝑘𝑘13𝑂1superscript𝑐2\mathbb{P}\left(\left|\tilde{\theta}^{n}{p,r}-\frac{\theta{p,r}}{\pi}\right|\geq c\left(\frac{\sqrt{\log(k)}}{k}\right)^{\frac{1}{3}}\right)=O\left(\frac{1}{c^{2}}\right).
Lemma. Lemma 7.1. Let {Z1,…,Zk}subscript𝑍1…subscript𝑍𝑘{Z_{1},...,Z_{k}} be the set of k𝑘k independent random variables defined on ΩΩ\Omega such that each Zisubscript𝑍𝑖Z_{i} has the same distribution and Zi∈{0,1}subscript𝑍𝑖01Z_{i}\in{0,1}. Let {ℱ1,…,ℱk}subscriptℱ1…subscriptℱ𝑘{\mathcal{F}{1},...,\mathcal{F}{k}} be the set of events, where each ℱisubscriptℱ𝑖\mathcal{F}{i} is in the σ𝜎\sigma-field defined by Zisubscript𝑍𝑖Z{i} (in particular ℱisubscriptℱ𝑖\mathcal{F}{i} does not depend on the σ𝜎\sigma-field σ(Z1,…,Zi−1,Zi+1,…Zk)𝜎subscript𝑍1…subscript𝑍𝑖1subscript𝑍𝑖1…subscript𝑍𝑘\sigma(Z{1},...,Z_{i-1},Z_{i+1},...Z_{k})). Assume that there exists μ<12𝜇12\mu<\frac{1}{2} such that: ℙ(ℱi)≤μℙsubscriptℱ𝑖𝜇\mathbb{P}(\mathcal{F}{i})\leq\mu for i=1,…,k𝑖1…𝑘i=1,...,k. Let {U1,…,Uk}subscript𝑈1…subscript𝑈𝑘{U{1},...,U_{k}} be the set of k𝑘k random variables such that Ui∈{0,1}subscript𝑈𝑖01U_{i}\in{0,1} and Ui|ℱi=Zi|ℱiconditionalsubscript𝑈𝑖subscriptℱ𝑖conditionalsubscript𝑍𝑖subscriptℱ𝑖U_{i}|\mathcal{F}{i}=Z{i}|\mathcal{F}{i} for i=1,…,k𝑖1…𝑘i=1,...,k, where X|ℱconditional𝑋ℱX|\mathcal{F} stands for the random variable X𝑋X truncated to the event ℱℱ\mathcal{F}. Assume furthermore that E(Ui)=E(Zi)𝐸subscript𝑈𝑖𝐸subscript𝑍𝑖E(U{i})=E(Z_{i}) for i=1,…,k𝑖1…𝑘i=1,...,k. Denote U=U1+…+Ukk𝑈subscript𝑈1…subscript𝑈𝑘𝑘U=\frac{U_{1}+...+U_{k}}{k}. Then the following is true. ℙ(|U−EU|>ϵ)≤1π∑r=ϵk2k1r(ker)rμr(1−μ)k−r+2e−ϵ2k2.ℙ𝑈𝐸𝑈italic-ϵ1𝜋superscriptsubscript𝑟italic-ϵ𝑘2𝑘1𝑟superscript𝑘𝑒𝑟𝑟superscript𝜇𝑟superscript1𝜇𝑘𝑟2superscript𝑒superscriptitalic-ϵ2𝑘2\mathbb{P}(|U-EU|!>!\epsilon)!\leq!\frac{1}{\pi}!\sum_{r=\frac{\epsilon k}{2}}^{k}!!\frac{1}{\sqrt{r}}(\frac{ke}{r})^{r}\mu^{r}(1-\mu)^{k-r}+2e^{-\frac{\epsilon^{2}k}{2}}. (6)
Lemma. Lemma 7.2. Let W1,…,Wksubscript𝑊1…subscript𝑊𝑘W_{1},...,W_{k} be k𝑘k independent random variables such that E(W1)=…E(Wk)=0𝐸subscript𝑊1…𝐸subscript𝑊𝑘0E(W_{1})=...E(W_{k})=0. Assume that −αi≤Wi+1−Wi≤βisubscript𝛼𝑖subscript𝑊𝑖1subscript𝑊𝑖subscript𝛽𝑖-\alpha_{i}\leq W_{i+1}-W_{i}\leq\beta_{i} for i=2,…,k−1𝑖2…𝑘1i=2,...,k-1. Then the following is true: ℙ(|∑i=1kWi|>a)≤2e−2a2∑i=1k(αi+βi)2ℙsuperscriptsubscript𝑖1𝑘subscript𝑊𝑖𝑎2superscript𝑒2superscript𝑎2superscriptsubscript𝑖1𝑘superscriptsubscript𝛼𝑖subscript𝛽𝑖2\mathbb{P}(|\sum_{i=1}^{k}W_{i}|>a)\leq 2e^{-\frac{2a^{2}}{\sum_{i=1}^{k}(\alpha_{i}+\beta_{i})^{2}}}
Lemma. Lemma 7.3. Let n𝑛n denote data dimensionality and let f(n)𝑓𝑛f(n) be an arbitrary positive function. Let D𝐷D be the set of all L2subscript𝐿2L_{2}-normalized data points, where no two data points are identical. Assume that |D|=N𝐷𝑁|D|=N. Consider the (N2)binomial𝑁2{N\choose 2} hyperplanes Hp,rsubscript𝐻𝑝𝑟H_{p,r} spanned by pairs of different vectors {p,r}𝑝𝑟{p,r} from D𝐷D. Then after applying linear transformation ℋℛℋℛ\mathcal{H}\mathcal{R} each hyperplane Hp,rsubscript𝐻𝑝𝑟H_{p,r} is transformed into another hyperplane Hp,rℋℛsubscriptsuperscript𝐻ℋℛ𝑝𝑟H^{\mathcal{H}\mathcal{R}}{p,r}. Furthermore, the probability 𝒫ℋℛsubscript𝒫ℋℛ\mathcal{P}{\mathcal{H}\mathcal{R}}that for every Hp,rℋℛsubscriptsuperscript𝐻ℋℛ𝑝𝑟H^{\mathcal{H}\mathcal{R}}{p,r} there exist two orthonormal vectors x=(x1,…,xn),y=(y1,…,yn)formulae-sequence𝑥subscript𝑥1…subscript𝑥𝑛𝑦subscript𝑦1…subscript𝑦𝑛x=(x{1},...,x_{n}),y=(y_{1},...,y_{n}) in Hp,rℋℛsubscriptsuperscript𝐻ℋℛ𝑝𝑟H^{\mathcal{H}\mathcal{R}}{p,r} such that: |xi|,|yi|≤f(n)nsubscript𝑥𝑖subscript𝑦𝑖𝑓𝑛𝑛|x{i}|,|y_{i}|\leq\frac{f(n)}{\sqrt{n}} satisfies: 𝒫ℋℛ≥1−4(N2)e−f2(n)2.subscript𝒫ℋℛ14binomial𝑁2superscript𝑒superscript𝑓2𝑛2\mathcal{P}_{\mathcal{H}\mathcal{R}}\geq 1-4{N\choose 2}e^{-\frac{f^{2}(n)}{2}}.
Lemma. Lemma 7.4. Let us assume that ℰfsubscriptℰ𝑓\mathcal{E}{f} holds. Let f(n)𝑓𝑛f(n) be an arbitrary positive function. Then for every a>0𝑎0a>0 with probability at least ℙsucc≥1−4(k2)e−2a2nf4(n)subscriptℙ𝑠𝑢𝑐𝑐14binomial𝑘2superscript𝑒2superscript𝑎2𝑛superscript𝑓4𝑛\mathbb{P}{succ}\geq 1-4{k\choose 2}e^{-\frac{2a^{2}n}{f^{4}(n)}}, taken under coin tosses used to construct 𝒟𝒟\mathcal{D}, the following is true for every 1≤i1≠i2≤k1subscript𝑖1subscript𝑖2𝑘1\leq i_{1}\neq i_{2}\leq k: |∑u=1nsi1,uvi1,u|≤aχ(𝒫)+Ψf2(n)n,superscriptsubscript𝑢1𝑛subscript𝑠subscript𝑖1𝑢subscript𝑣subscript𝑖1𝑢𝑎𝜒𝒫Ψsuperscript𝑓2𝑛𝑛|\sum_{u=1}^{n}s_{i_{1},u}v_{i_{1},u}|\leq a\chi(\mathcal{P})+\Psi\frac{f^{2}(n)}{n}, |∑u=1nsi1,usi2,u|≤aχ(𝒫)+Ψf2(n)n,superscriptsubscript𝑢1𝑛subscript𝑠subscript𝑖1𝑢subscript𝑠subscript𝑖2𝑢𝑎𝜒𝒫Ψsuperscript𝑓2𝑛𝑛|\sum_{u=1}^{n}s_{i_{1},u}s_{i_{2},u}|\leq a\chi(\mathcal{P})+\Psi\frac{f^{2}(n)}{n}, |∑u=1nvi1,uvi2,u|≤aχ(𝒫)+Ψf2(n)n,superscriptsubscript𝑢1𝑛subscript𝑣subscript𝑖1𝑢subscript𝑣subscript𝑖2𝑢𝑎𝜒𝒫Ψsuperscript𝑓2𝑛𝑛|\sum_{u=1}^{n}v_{i_{1},u}v_{i_{2},u}|\leq a\chi(\mathcal{P})+\Psi\frac{f^{2}(n)}{n}, |∑u=1nsi1,uvi2,u|≤aχ(𝒫)+Ψf2(n)n.superscriptsubscript𝑢1𝑛subscript𝑠subscript𝑖1𝑢subscript𝑣subscript𝑖2𝑢𝑎𝜒𝒫Ψsuperscript𝑓2𝑛𝑛|\sum_{u=1}^{n}s_{i_{1},u}v_{i_{2},u}|\leq a\chi(\mathcal{P})+\Psi\frac{f^{2}(n)}{n}.
Lemma. Lemma 8.1. Define U1,…,Uksubscript𝑈1…subscript𝑈𝑘U_{1},...,U_{k} as in the proof of Theorem 4.1. Assume that the following is true: |∑u=1nsi1,uvi1,u|≤Δ,superscriptsubscript𝑢1𝑛subscript𝑠subscript𝑖1𝑢subscript𝑣subscript𝑖1𝑢Δ|\sum_{u=1}^{n}s_{i_{1},u}v_{i_{1},u}|\leq\Delta, |∑u=1nsi1,usi2,u|≤Δ,superscriptsubscript𝑢1𝑛subscript𝑠subscript𝑖1𝑢subscript𝑠subscript𝑖2𝑢Δ|\sum_{u=1}^{n}s_{i_{1},u}s_{i_{2},u}|\leq\Delta, |∑u=1nvi1,uvi2,u|≤Δ,superscriptsubscript𝑢1𝑛subscript𝑣subscript𝑖1𝑢subscript𝑣subscript𝑖2𝑢Δ|\sum_{u=1}^{n}v_{i_{1},u}v_{i_{2},u}|\leq\Delta, |∑u=1nsi1,uvi2,u|≤Δ.superscriptsubscript𝑢1𝑛subscript𝑠subscript𝑖1𝑢subscript𝑣subscript𝑖2𝑢Δ|\sum_{u=1}^{n}s_{i_{1},u}v_{i_{2},u}|\leq\Delta. for some 0<Δ<10Δ10<\Delta<1. The the following is true for every fixed 1≤i<j≤k1𝑖𝑗𝑘1\leq i<j\leq k: |ℙ(UiUj=1)−ℙ(Ui=1)ℙ(Uj=1)|=O(Δ).ℙsubscript𝑈𝑖subscript𝑈𝑗1ℙsubscript𝑈𝑖1ℙsubscript𝑈𝑗1𝑂Δ|\mathbb{P}(U_{i}U_{j}=1)-\mathbb{P}(U_{i}=1)\mathbb{P}(U_{j}=1)|=O(\Delta).
| k / n k | Circulant [%] | Random [%] | BinPerm [%] | BinCirc [%] | HalfShift [%] | Toeplitz [%] | VerHorShift [%] |
|---|---|---|---|---|---|---|---|
| 1024 / 1 | 3 . 53 ± 0 . 16 | 2 . 78 ± 0 . 10 | 3 . 69 ± 0 . 21 | 6 . 79 ± 0 . 49 | 3 . 54 ± 0 . 16 | 3 . 16 ± 0 . 19 | 3 . 74 ± 0 . 16 |
| 512 / 2 | 5 . 42 ± 0 . 83 | 3 . 61 ± 0 . 19 | 4 . 68 ± 0 . 35 | 8 . 10 ± 1 . 85 | 5 . 13 ± 2 . 15 | 4 . 97 ± 0 . 53 | 5 . 55 ± 0 . 62 |
| 256 / 4 | 11 . 56 ± 1 . 42 | 4 . 79 ± 0 . 13 | 7 . 43 ± 1 . 31 | 6 . 13 ± 1 . 42 | 5 . 98 ± 1 . 05 | 9 . 48 ± 1 . 88 | 10 . 96 ± 2 . 78 |
| 128 / 8 | 22 . 10 ± 5 . 42 | 10 . 13 ± 0 . 24 | 10 . 02 ± 0 . 50 | 11 . 43 ± 0 . 92 | 12 . 42 ± 0 . 95 | 18 . 35 ± 2 . 36 | 15 . 82 ± 1 . 63 |
| 64 / 16 | 29 . 50 ± 1 . 13 | 16 . 26 ± 1 . 02 | 26 . 50 ± 10 . 55 | 22 . 07 ± 1 . 35 | 20 . 90 ± 2 . 25 | 32 . 82 ± 4 . 83 | 21 . 59 ± 3 . 05 |
| 32 / 32 | 42 . 07 ± 4 . 16 | 28 . 77 ± 2 . 28 | 29 . 94 ± 3 . 48 | 35 . 55 ± 3 . 12 | 29 . 15 ± 0 . 97 | 42 . 97 ± 2 . 08 | 45 . 10 ± 4 . 46 |
| 16 / 64 | 64 . 20 ± 6 . 76 | 46 . 06 ± 1 . 03 | 50 . 65 ± 5 . 66 | 58 . 70 ± 7 . 15 | 55 . 40 ± 6 . 90 | 57 . 96 ± 3 . 65 | 61 . 66 ± 4 . 08 |
| Matrix | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz |
|---|---|---|---|---|---|---|---|
| # of random values | O ( nk ) | O ( n ) | O ( n ) | O ( n ) | O ( n ) | O ( n ) | O ( n ) |
| Memory complexity | O ( nk ) | O ( n ) | O ( nk ) | O ( n ) | O ( n ) | O ( n ) | O ( n ) |
| k / n k | Circulant [%] | Random [%] | BinPerm [%] | BinCirc [%] | HalfShift [%] | Toeplitz [%] | VerHorShift [%] |
|---|---|---|---|---|---|---|---|
| 1024 / 1 | 0 . 00 ± 0 . 00 | 0 . 00 ± 0 . 00 | 0 . 00 ± 0 . 00 | 0 . 30 ± 0 . 44 | 0 . 00 ± 0 . 00 | 0 . 00 ± 0 . 00 | 0 . 00 ± 0 . 00 |
| 512 / 2 | 0 . 04 ± 0 . 06 | 0 . 00 ± 0 . 00 | 0 . 00 ± 0 . 00 | 2 . 66 ± 2 . 98 | 1 . 44 ± 2 . 89 | 0 . 00 ± 0 . 00 | 0 . 00 ± 0 . 01 |
| 256 / 4 | 6 . 46 ± 2 . 27 | 0 . 00 ± 0 . 00 | 0 . 79 ± 1 . 57 | 0 . 60 ± 1 . 19 | 0 . 49 ± 0 . 93 | 2 . 09 ± 1 . 69 | 3 . 98 ± 3 . 96 |
| 128 / 8 | 16 . 89 ± 6 . 57 | 4 . 69 ± 0 . 43 | 4 . 44 ± 0 . 50 | 5 . 62 ± 1 . 03 | 7 . 34 ± 1 . 27 | 11 . 82 ± 2 . 17 | 10 . 51 ± 1 . 27 |
| 64 / 16 | 26 . 47 ± 0 . 98 | 13 . 35 ± 0 . 61 | 23 . 98 ± 11 . 54 | 18 . 68 ± 0 . 78 | 17 . 64 ± 2 . 01 | 29 . 97 ± 5 . 29 | 18 . 68 ± 3 . 26 |
| 32 / 32 | 40 . 79 ± 3 . 82 | 27 . 51 ± 2 . 04 | 28 . 28 ± 3 . 23 | 33 . 91 ± 3 . 23 | 27 . 90 ± 1 . 05 | 41 . 49 ± 2 . 14 | 43 . 51 ± 3 . 78 |
| 16 / 64 | 63 . 96 ± 5 . 62 | 46 . 31 ± 0 . 73 | 50 . 03 ± 6 . 18 | 58 . 71 ± 6 . 96 | 54 . 88 ± 6 . 47 | 57 . 72 ± 3 . 42 | 60 . 91 ± 4 . 53 |
| k / n k | Circulant [%] | Random [%] | BinPerm [%] | BinCirc [%] | HalfShift [%] | Toeplitz [%] | VerHorShift [%] |
|---|---|---|---|---|---|---|---|
| 1024 / 1 | 6 . 02 ± 0 . 64 | 4 . 83 ± 0 . 19 | 6 . 67 ± 0 . 65 | 12 . 77 ± 2 . 86 | 6 . 38 ± 0 . 44 | 6 . 22 ± 1 . 20 | 6 . 30 ± 0 . 76 |
| 512 / 2 | 12 . 98 ± 11 . 29 | 5 . 77 ± 0 . 11 | 8 . 15 ± 0 . 56 | 12 . 40 ± 2 . 32 | 7 . 25 ± 0 . 71 | 9 . 11 ± 2 . 28 | 10 . 81 ± 4 . 31 |
| 256 / 4 | 17 . 73 ± 6 . 66 | 8 . 51 ± 0 . 35 | 11 . 11 ± 1 . 15 | 12 . 13 ± 4 . 35 | 12 . 05 ± 2 . 94 | 15 . 66 ± 3 . 36 | 18 . 19 ± 5 . 46 |
| 128 / 8 | 34 . 80 ± 14 . 59 | 14 . 44 ± 0 . 89 | 17 . 20 ± 2 . 26 | 22 . 15 ± 6 . 45 | 24 . 74 ± 8 . 14 | 33 . 90 ± 13 . 90 | 30 . 37 ± 7 . 52 |
| 64 / 16 | 45 . 91 ± 5 . 50 | 27 . 57 ± 1 . 58 | 29 . 53 ± 3 . 40 | 35 . 33 ± 5 . 58 | 36 . 58 ± 10 . 71 | 51 . 10 ± 13 . 98 | 41 . 66 ± 8 . 08 |
| 32 / 32 | 65 . 06 ± 9 . 60 | 40 . 58 ± 2 . 49 | 43 . 58 ± 4 . 66 | 53 . 05 ± 5 . 39 | 47 . 18 ± 7 . 19 | 58 . 24 ± 8 . 87 | 56 . 73 ± 6 . 09 |
| 16 / 64 | 68 . 61 ± 5 . 72 | 58 . 72 ± 3 . 08 | 60 . 30 ± 6 . 11 | 66 . 29 ± 4 . 79 | 60 . 84 ± 5 . 31 | 72 . 50 ± 6 . 04 | 72 . 50 ± 5 . 91 |
$$ \left( \begin{array}{ccccc} \sum_{l \in \mathcal{S}{1,1}}g{l} & ... & \sum_{l \in \mathcal{S}{1,j}}g{l} & ... & \sum_{l \in \mathcal{S}{1,n}}g{l}\ ... & ... & ... & ... & ... \ \sum_{l \in \mathcal{S}{i,1}}g{l} & ... & \sum_{l \in \mathcal{S}{i,j}}g{l} & ... & \sum_{l \in \mathcal{S}{i,n}}g{l}\ ... & ... & ... & ... & ... \ \sum_{l \in \mathcal{S}{k,1}}g{l} & ... & \sum_{l \in \mathcal{S}{k,j}}g{l} & ... & \sum_{l \in \mathcal{S}{k,n}}g{l} \end{array} \right) $$
$$ \label{azuma_simple} \mathbb{P}(|X-EX| > \frac{a}{2}) \leq 2e^{-\frac{a^{2}k}{2}}. $$ \tag{azuma_simple}
Theorem. Consider extended $\Psi$-regular hashing model $M$ with $t$ independent Gaussian random variables: $g_{1},...,g_{t}$, each of distribution $N(0,1)$. Let $N$ be the size of the dataset $D$. Denote by $k$ the size of the hash and by $n$ the dimensionality of the data. Let $f(n)$ be an arbitrary positive function. Let $\theta_{p,r}$ be the angular distance between vectors $p,r \in D$. Then for $a=o_{n}(1)$, $\epsilon>0$, $t \geq n$ and $n$ large enough: -0.07in eqnarray* &&!!!!!!!!!!!!P\left(\forall_{p,r \inD}\left|\theta^{n}{p,r} - \theta{p,r}{\pi}\right| \leq \epsilon\right) \geq\ &&!!!!!!!!!!!!\left1-4{N \choose 2}e^{-f^{2(n)}{2}}-4\chi(P){k \choose 2}e^{-2a^{2t}{f^{4}(t)}}\right, eqnarray* where $\Lambda = 1{\pi} \sum_{j = \lfloor \epsilon k{2} \rfloor}^{k}1{j}(ke{j})^{j}\mu^{j}(1-\mu)^{k-j}+2e^{-\epsilon^{2k}{2}}$ and $\mu=8(\sqrt{a\chi(P) + \Psif^{2(n)}{n})}{\theta_{p,r}}$.
Theorem. Consider short $\Psi$-regular hashing model $M$, where $P$ is a Toeplitz Gaussian matrix. Denote by $k$ the size of the hash. Let $\theta_{p,r}$ be the angular distance between vectors $p,r \in D$, where $D$ is the dataset. Then the following is true equation \forall_{p,r \inD}Var(\theta^{n}{p,r}) \leq 1{k}\theta{p,r(\pi-\theta_{p,r})}{\pi^{2}} + (\log(k){k^{2}})^{1{3}}, equation and thus for any $c>0$ and $p,r \in D$: -0.1in $$P\left(\left|\theta^{n}{p,r} - \theta{p,r}{\pi}\right| \geq c \left(\sqrt{\log(k)}{k}\right)^{1{3}}\right) = O\left(1{c^{2}}\right).$$
Lemma. Let ${Z_{1},...,Z_{k}}$ be the set of $k$ independent random variables defined on $\Omega$ such that each $Z_{i}$ has the same distribution and $Z_{i} \in {0,1}$. Let ${F_{1},...,F_{k}}$ be the set of events, where each $F_{i}$ is in the $\sigma$-field defined by $Z_{i}$ (in particular $F_{i}$ does not depend on the $\sigma$-field $\sigma(Z_{1},...,Z_{i-1},Z_{i+1},...Z_{k})$). Assume that there exists $\mu < 1{2}$ such that: $P(F_{i}) \leq \mu$ for $i=1,...,k$. Let ${U_{1},...,U_{k}}$ be the set of $k$ random variables such that $U_{i} \in {0,1}$ and $U_{i} | F_{i} = Z_{i} |F_{i}$ for $i=1,...,k$, where $X|F$ stands for the random variable $X$ truncated to the event $F$. Assume furthermore that $E(U_{i})=E(Z_{i})$ for $i=1,...,k$. Denote $U = U_{1+...+U_{k}}{k}$. Then the following is true. equation P(|U-EU| !>! \epsilon) !\leq! 1{\pi} !\sum_{r=\epsilon k{2}}^{k}!!1{r}(ke{r})^{r}\mu^{r}(1-\mu)^{k-r} + 2e^{-\epsilon^{2k}{2}}. equation
Lemma. Let $W_{1},...,W_{k}$ be $k$ independent random variables such that $E(W_{1})=...E(W_{k})=0$. Assume that $-\alpha_{i} \leq W_{i+1} - W_{i} \leq \beta_{i}$ for $i=2,...,k-1$. Then the following is true: $$ P(|\sum_{i=1}^{k} W_{i}|>a) \leq 2e^{-2a^{2}{\sum_{i=1}^{k}(\alpha_{i}+\beta_{i})^{2}}} $$
Lemma. Let $n$ denote data dimensionality and let $f(n)$ be an arbitrary positive function. Let $D$ be the set of all $L_{2}$-normalized data points, where no two data points are identical. Assume that $|D|=N$. Consider the ${N \choose 2}$ hyperplanes $H_{p,r}$ spanned by pairs of different vectors ${p,r}$ from $D$. Then after applying linear transformation $HR$ each hyperplane $H_{p,r}$ is transformed into another hyperplane $H^{HR}{p,r}$. Furthermore, the probability $P{HR} $that for every $H^{HR}{p,r}$ there exist two orthonormal vectors $x=(x{1},...,x_{n}),y=(y_{1},...,y_{n})$ in $H^{HR}{p,r}$ such that: $|x{i}|,|y_{i}| \leq f(n){n}$ satisfies: $$P_{HR} \geq 1-4{N \choose 2}e^{-f^{2(n)}{2}}.$$
Lemma. Let us assume that $E_{f}$ holds. Let $f(n)$ be an arbitrary positive function. Then for every $a>0$ with probability at least $P_{succ} \geq 1 - 4{k \choose 2} e^{-2a^{2n}{f^{4}(n)}}$, taken under coin tosses used to construct $D$, the following is true for every $1 \leq i_{1} \neq i_{2} \leq k$: $$|\sum_{u=1}^{n} s_{i_{1},u}v_{i_{1},u}| \leq a\chi(P) + \Psi f^{2(n)}{n},$$ $$|\sum_{u=1}^{n} s_{i_{1},u}s_{i_{2},u}| \leq a\chi(P) + \Psi f^{2(n)}{n},$$ $$|\sum_{u=1}^{n} v_{i_{1},u}v_{i_{2},u}| \leq a\chi(P) + \Psi f^{2(n)}{n},$$ $$|\sum_{u=1}^{n} s_{i_{1},u}v_{i_{2},u}| \leq a\chi(P) + \Psi f^{2(n)}{n}.$$
Lemma. Define $U_{1},...,U_{k}$ as in the proof of Theorem ext_technical_theorem. Assume that the following is true: $$|\sum_{u=1}^{n} s_{i_{1},u}v_{i_{1},u}| \leq \Delta,$$ $$|\sum_{u=1}^{n} s_{i_{1},u}s_{i_{2},u}| \leq \Delta,$$ $$|\sum_{u=1}^{n} v_{i_{1},u}v_{i_{2},u}| \leq \Delta,$$ $$|\sum_{u=1}^{n} s_{i_{1},u}v_{i_{2},u}| \leq \Delta.$$ for some $0<\Delta<1$. The the following is true for every fixed $1 \leq i < j \leq k$: $$|P(U_{i}U_{j}=1) - P(U_{i}=1)P(U_{j}=1)| = O(\Delta).$$
Corollary. Consider extended $\Psi$-regular hashing model $M$. Assume that the projection matrix $P$ is Toeplitz Gaussian. Let $N, n, k$ be as above and denote by $\theta_{p,r}$ be the angular distance between vectors $p,r \in D$. Then the following is true for $n$ large enough: -0.05in eqnarray* &&!!!!!!!!!!!!P\left(\forall_{p,r \inD}\left|\theta^{n}{p,r} - \theta{p,r}{\pi}\right| \leq k^{-1{3}}\right) \geq\ &&!!!!!!!!!!!!\left[1-O\left(N^{2}{e^{poly(n)}}+ k^{2}e^{-n^{3{10}}}\right)\right]\left(1-3e^{-k^{\frac{1{3}}}{2}}\right). eqnarray*
Definition. Let $t$ be the size of the pool of independent random Gaussian variables ${g_{1},...,g_{t}}$, where each $g_{i} \sim N(0,1)$. Assume that $k \leq n \leq t \leq kn$. We take $\Psi$ to be a natural number, i.e. $\Psi \in N$. $P$ is $\Psi$-regular random matrix if it has the following form -0.02in equation* \left( array{ccccc} \sum_{l \in S_{1,1}}g_{l} & ... & \sum_{l \in S_{1,j}}g_{l} & ... & \sum_{l \in S_{1,n}}g_{l}\ ... & ... & ... & ... & ... \ \sum_{l \in S_{i,1}}g_{l} & ... & \sum_{l \in S_{i,j}}g_{l} & ... & \sum_{l \in S_{i,n}}g_{l}\ ... & ... & ... & ... & ... \ \sum_{l \in S_{k,1}}g_{l} & ... & \sum_{l \in S_{k,j}}g_{l} & ... & \sum_{l \in S_{k,n}}g_{l} array \right) equation* where $S_{i,j} \subseteq {1,...,t}$ for $i \in {1,...,k}$, $j \in {1,...,n}$, $|S_{i,1}|=...=|S_{i,n}|$ for $i=1,...,k$, $S_{i,j_{1}} \cap S_{i,j_{2}} = \emptyset$ for $i \in {1,...,k}$, $j_{1},j_{2} \in {1,...,n}$, $j_{1} \neq j_{2}$, and furthermore the following holds: for any two different rows $R_{1}, R_{2}$ of $P$ the number of random variables $g_{l}$, where $l \in {1,...,t}$, such that $g_{l}$ is in the intersection of some column with both $R_{1}$ and $R_{2}$ is at most $\Psi$. %column $C$ of $P$ and fixed $l \in {1,...,t}$ random variable $g_{l}$ appears in at most $\Psi+1$ entries from $C$.
Proof. Note first that the $i$th row, call it $g^{i}$, of the matrix $P$ is a $n$-dimensional Gaussian vector with mean $0$ and where each element has variance $\sigma_{i}^{2}$ for $\sigma_{i}^{2}=|S_{i,1}|=...=|S_{i,n}|$ ($i=1,...,k$). Thus, after applying matrix $D$ the new vector $g^{i}{D}$ is still Gaussian and of the same distribution. Let us consider first the short $\Psi$-regular hashing model. Fix some vectors $p,r$ (without loss of generality we may assume that they are not collinear). Let $H{p,r}$, shortly called by us $H$, be the $2$-dimensional hyperplane spanned by ${p,r}$. Denote by $g^{i}{D,H}$ the projection of $g^{i}{D}$ into $H$ and by $g^{i}{D,H,\perp}$ the line in $H$ perpendicular to $g^{i}{D,H}$. Let $\phi$ be a sign function. Note that the contribution to the $L_{1}$-sum $|h(p)-h(r)|{1}$ comes from those $g^{i}$ for which $g^{i}{D,H,\perp}$ divides an angle between $p$ and $r$ (see: Figurefig:one), i.e. from those $g^{i}$ for which $g^{i}{D,H}$ is inside the union $U{p,r}$ of two $2$-dimensional cones bounded by two lines in $H$ perpendicular to $p$ and $r$ respectively. If the angle is not divided (see: Figurefig:two) then the two corresponding entries in the hash have the same value and thus do not contribute to the overall distance between hashes. Observe that, from what we have just said, we can conclude that $\theta_{p,r}^{n} = X_{1 + ... + X_{k}}{k}$, where: equation X_{i} = \left{ array{ll} 1 & if g^{i}{D,H} \in U{p,r}, \ 0 & otherwise. array \right. equation figure[t] \centering \includegraphics[width = 2.8in]{one.pdf} -0.05in Two vectors: $p$, $r$ spanning two-dimensional hyperplane $H$ and with the angular distance $\theta_{p,r$ between them. We have: $l_{p,r} = g^{i}{D,H,\perp}$. Line $l{p,r}$ is dividing $\theta_{p,r}$ and thus $g^{i}$ contributes to $|h(p)-h(r)|{1}$.} -0.15in figure figure[t] \includegraphics[width = 2.3in]{two.pdf} -0.05in Similar setting to the one presented on Figure fig:one. Vector $v$ represents $L{2$-normalized version of $g^{i}$ and is perpendicular to the two-dimensional plane $R$. The intersection $R \cap H$ of that plane with the $2$-dimensional plane $H$ spanned by $p,r$ is a line $l_{p,r}$ that this time is outside $U_{p,r}$. Thus $g^{i}$ does not contribute to $|h(p)-h(r)|{1}$.} \centering -0.1in figure Now it suffices to note that vector $g^{i}{D,H}$ is a $2$-dimensional Gaussian vector and thus its direction is uniformly distributed over all directions. Thus each $X_{i}$ is nonzero with probability exactly $\theta_{p,r}{\pi}$ and the theorem follows. For the extended $\Psi$-regular hashing model the analysis is very similar. The only difference is that data is preprocessed by applying $HR$ linear mapping first. Both $H$ and $R$ are orthogonal matrices though, thus their product is also an orthogonal matrix. Since orthogonal matrices do not change angular distance, the former analysis can be applied again and yields the proof.
Proof. Let us consider the event $F_{bad}$ = $F_{1} \cup ... \cup F_{k}$. Note that $F_{bad}$ may be represented by the union of the so-called $r$-blocks, i.e. equationF_{bad} = \bigcup_{Q \subseteq {1,...,k}} (\bigcap_{q \in Q} F_{q} \bigcap_{q \in {1,...,k} \setminus Q} F^{c}{q}), equation where $F^{c}$ stands for the complement of event $F$. Let us fix now some $Q \subseteq {1,...,k}$. Denote equationF{Q} = \bigcap_{q \in Q} F_{q} \bigcap_{q \in {1,...,k} \setminus Q} F^{c}{q}. equation note that $P(F{Q}) \leq \mu^{r}(1-\mu)^{k-r}$. It follows directly from the Bernoulli scheme. Denote $Z = Z_{1+...+Z_{k}}{k}$. From what we have just said and from the definition of ${F_{1},...,F_{k}}$ we conclude that for any given $c$ the following holds: equation P(|U-Z| > c) \leq \sum_{r=ck}^{k}{k \choose r}\mu^{r}(1-\mu)^{k-r}. equation Note also that from the assumptions of the lemma we trivially get: $E(U)=E(Z)$. Let us consider now the expression $P(|U-E(U)|) > \epsilon$. We get: $P(|U-E(U)|> \epsilon) = P(|U-E(Z)| > \epsilon) = P(|U-Z + Z-E(Z)| > \epsilon) \leq P(|U-Z|+|Z-E(Z)|>\epsilon) \leq P(|U-Z| > \epsilon{2}) + P(|Z-E(Z)| > \epsilon{2})$. From xy_diff we get: equation P(|U-Z| > \epsilon{2}) \leq \sum_{r=\epsilon{2}}^{k}{k \choose r} \mu^{r}(1-\mu)^{k-r}. equation Let us consider now the expression: equation\xi = \sum_{r=\epsilon k{2}}^{k}{k \choose r} \mu^{r}(1-\mu)^{k-r}.equation We have: eqnarray \xi &\leq& \sum_{r=\epsilon k{2}}^{k} (k-r+1)...(k){r!} \mu^{r}(1-\mu)^{k-r} \nonumber\ &\leq& \sum_{r=\epsilon k{2}}^{k} k^{r}{r!} \mu^{r}(1-\mu)^{k-r} eqnarray From the Stirling's formula we get: $r! = 2\pi r^{r+\frac{1{2}}}{e^{r}}(1+o_{r}(1))$. Thus we obtain: eqnarray \xi &\leq& (1+o_{r}(1))\sum_{r=\epsilon k{2}}^{k}k^{re^{r}}{2\pi r^{r+1{2}}}\mu^{r}(1-\mu)^{k-r} \nonumber\ &\leq& 1{\pi} \sum_{r=\epsilon k{2}}^{k}1{r}(ke{r})^{r}\mu^{r}(1-\mu)^{k-r} eqnarray for $r$ large enough. Now we will use the following version of standard Azuma's inequality: lemma Let $W_{1},...,W_{k}$ be $k$ independent random variables such that $E(W_{1})=...E(W_{k})=0$. Assume that $-\alpha_{i} \leq W_{i+1} - W_{i} \leq \beta_{i}$ for $i=2,...,k-1$. Then the following is true: $$ P(|\sum_{i=1}^{k} W_{i}|>a) \leq 2e^{-2a^{2}{\sum_{i=1}^{k}(\alpha_{i}+\beta_{i})^{2}}} $$ lemma Now, using Lemma azuma_general for $W_{i} = X_{i} - E(X_{i})$ and $\alpha_{i} = E(X_{i}), \beta_{i}=1-E(X_{i})$ we obtain: equation P(|X-EX| > a{2}) \leq 2e^{-a^{2k}{2}}. equation Combining xi_ineq and azuma_simple, we obtain the statement of the lemma.
Proof. We have already noted in the proof of Lemma mean_lemma that $HR$ is an orthogonal matrix. Thus, as an isometry, it clearly transforms each $2$-dimensional hyperplane into another $2$-dimensional hyperplane. For every pair ${p,r}$, let us consider an arbitrary fixed orthonormal pair ${u,v}$ spanning $H_{p,r}$. Denote $u=(u_{1},...,u_{n})$. Let us denote by $u^{HR}$ vector obtained from $u$ after applying transformation $HR$. Note that the $j^{th}$ coordinate of $u^{HR}$ is of the form: equation u^{HR}{j} = u{1}T_{1}+...+u_{n}T_{n}, equation where $T_{1},...,T_{n}$ are independent random variables satisfying: equation T_{i} = \left{ array{ll} 1{n} & w.p 1{2}, \ -1{n} & otherwise. array \right. equation The latter comes straightforwardly from the form of the $L_{2}$-normalized Hadamard matrix (i.e a Hadamard matrix, where each row and column is $L_{2}$-normalized). But then, from Lemma azuma_general, and the fact that $|u|{2}=1$, we get for any $a>0$: equation P(|u{1}T_{1}+...+u_{n}T_{n}| \geq a) \leq 2e^{-2a^{2}{\sum_{i=1}^{n}(2u_{i})^{2}}} \leq 2e^{-a^{2}{2}}. equation Similar analysis is correct for $v^{HR}$. Note that $v^{HR}$ is orthogonal to $u^{HR}$ since $v$ and $u$ are orthogonal. Furthermore, both $v^{HR}$ and $u^{HR}$ are $L_{2}$-normalized. Thus ${u^{HR},v^{HR}}$ is an orthonormal pair. To complete the proof, it suffices to take $a=f(n)$ and apply the union bound over all vectors $u^{HR}$, $v^{HR}$ for all ${N \choose 2}$ hyperplanes.
Proof. Note that the we get the first inequality for free from the fact that $x$ is orthogonal to $y$ (in other words, $\sum_{u=1}^{n} s_{i_{1},u}v_{i_{1},u}$ can be represented as $C\sum_{u=1}^{n} x_{i}y_{i}$ and the latter expression is clearly $0$). Let us consider now one of the three remaining expressions. Note that they can be rewritten as: equationE = \sum_{i=1}^{n} d_{\rho(i)}d_{\lambda(i)} x_{\zeta(i)}x_{\gamma(i)}equation or equationE = \sum_{i=1}^{n} d_{\rho(i)}d_{\lambda(i)} y_{\zeta(i)}y_{\gamma(i)}equation or equationE = \sum_{i=1}^{n} d_{\rho(i)}d_{\lambda(i)} x_{\zeta(i)}y_{\gamma(i)}equation for some $\rho, \lambda, \zeta, \gamma$. Note also that from the $\Psi$-regularity condition we immediately obtain that $\rho(i)=\lambda(i)$ for at most $\Psi$ elements of each sum. Get rid of these elements from each sum and consider the remaining ones. From the definition of the $P$-chromatic number, those remaining ones can be partitioned into at most $\chi(P)$ parts, each consisting of elements that are independent random variables (since in the corresponding graph there are no edges between them). Thus, for the sum corresponding to each part one can apply Lemma azuma_general. Thus one can conclude that the sum differs from its expectation (which clearly is $0$ since $E(d_{i}d_{j})=0$ for $i \neq j$) by a with probability at most: equation P_{a} \leq 2e^{-2a^{2}{\sum_{i=1}^{n} x_{\zeta(i)}x_{\gamma(i)}}}, equation or equation P_{a} \leq 2e^{-2a^{2}{\sum_{i=1}^{n} y_{\zeta(i)}y_{\gamma(i)}}}, equation or equation P_{a} \leq 2e^{-2a^{2}{\sum_{i=1}^{n} x_{\zeta(i)}y_{\gamma(i)}}}. equation Now it is time to use the fact that event $E_{f}$ holds. Then we know that: $|x_{i}|,|y_{i}| \leq f(n){n}$ for $i=1,...,n$. Substituting this upper bound for $|x_{i}|,|y_{i}|$ in the derived expressions on the probabilities coming from Lemma azuma_general, and then taking the union bound, we complete the proof.
| k / n k | Circulant [%] | Random [%] | BinPerm [%] | BinCirc [%] | HalfShift [%] | Toeplitz [%] | VerHorShift [%] |
|---|---|---|---|---|---|---|---|
| 1024 / 1 | 3 . 53 ± 0 . 16 | 2 . 78 ± 0 . 10 | 3 . 69 ± 0 . 21 | 6 . 79 ± 0 . 49 | 3 . 54 ± 0 . 16 | 3 . 16 ± 0 . 19 | 3 . 74 ± 0 . 16 |
| 512 / 2 | 5 . 42 ± 0 . 83 | 3 . 61 ± 0 . 19 | 4 . 68 ± 0 . 35 | 8 . 10 ± 1 . 85 | 5 . 13 ± 2 . 15 | 4 . 97 ± 0 . 53 | 5 . 55 ± 0 . 62 |
| 256 / 4 | 11 . 56 ± 1 . 42 | 4 . 79 ± 0 . 13 | 7 . 43 ± 1 . 31 | 6 . 13 ± 1 . 42 | 5 . 98 ± 1 . 05 | 9 . 48 ± 1 . 88 | 10 . 96 ± 2 . 78 |
| 128 / 8 | 22 . 10 ± 5 . 42 | 10 . 13 ± 0 . 24 | 10 . 02 ± 0 . 50 | 11 . 43 ± 0 . 92 | 12 . 42 ± 0 . 95 | 18 . 35 ± 2 . 36 | 15 . 82 ± 1 . 63 |
| 64 / 16 | 29 . 50 ± 1 . 13 | 16 . 26 ± 1 . 02 | 26 . 50 ± 10 . 55 | 22 . 07 ± 1 . 35 | 20 . 90 ± 2 . 25 | 32 . 82 ± 4 . 83 | 21 . 59 ± 3 . 05 |
| 32 / 32 | 42 . 07 ± 4 . 16 | 28 . 77 ± 2 . 28 | 29 . 94 ± 3 . 48 | 35 . 55 ± 3 . 12 | 29 . 15 ± 0 . 97 | 42 . 97 ± 2 . 08 | 45 . 10 ± 4 . 46 |
| 16 / 64 | 64 . 20 ± 6 . 76 | 46 . 06 ± 1 . 03 | 50 . 65 ± 5 . 66 | 58 . 70 ± 7 . 15 | 55 . 40 ± 6 . 90 | 57 . 96 ± 3 . 65 | 61 . 66 ± 4 . 08 |
| Matrix | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz | Random Circulant BinPerm HalfShift VerHorShift BinCirc Toeplitz |
|---|---|---|---|---|---|---|---|
| # of random values | O ( nk ) | O ( n ) | O ( n ) | O ( n ) | O ( n ) | O ( n ) | O ( n ) |
| Memory complexity | O ( nk ) | O ( n ) | O ( nk ) | O ( n ) | O ( n ) | O ( n ) | O ( n ) |
| k / n k | Circulant [%] | Random [%] | BinPerm [%] | BinCirc [%] | HalfShift [%] | Toeplitz [%] | VerHorShift [%] |
|---|---|---|---|---|---|---|---|
| 1024 / 1 | 0 . 00 ± 0 . 00 | 0 . 00 ± 0 . 00 | 0 . 00 ± 0 . 00 | 0 . 30 ± 0 . 44 | 0 . 00 ± 0 . 00 | 0 . 00 ± 0 . 00 | 0 . 00 ± 0 . 00 |
| 512 / 2 | 0 . 04 ± 0 . 06 | 0 . 00 ± 0 . 00 | 0 . 00 ± 0 . 00 | 2 . 66 ± 2 . 98 | 1 . 44 ± 2 . 89 | 0 . 00 ± 0 . 00 | 0 . 00 ± 0 . 01 |
| 256 / 4 | 6 . 46 ± 2 . 27 | 0 . 00 ± 0 . 00 | 0 . 79 ± 1 . 57 | 0 . 60 ± 1 . 19 | 0 . 49 ± 0 . 93 | 2 . 09 ± 1 . 69 | 3 . 98 ± 3 . 96 |
| 128 / 8 | 16 . 89 ± 6 . 57 | 4 . 69 ± 0 . 43 | 4 . 44 ± 0 . 50 | 5 . 62 ± 1 . 03 | 7 . 34 ± 1 . 27 | 11 . 82 ± 2 . 17 | 10 . 51 ± 1 . 27 |
| 64 / 16 | 26 . 47 ± 0 . 98 | 13 . 35 ± 0 . 61 | 23 . 98 ± 11 . 54 | 18 . 68 ± 0 . 78 | 17 . 64 ± 2 . 01 | 29 . 97 ± 5 . 29 | 18 . 68 ± 3 . 26 |
| 32 / 32 | 40 . 79 ± 3 . 82 | 27 . 51 ± 2 . 04 | 28 . 28 ± 3 . 23 | 33 . 91 ± 3 . 23 | 27 . 90 ± 1 . 05 | 41 . 49 ± 2 . 14 | 43 . 51 ± 3 . 78 |
| 16 / 64 | 63 . 96 ± 5 . 62 | 46 . 31 ± 0 . 73 | 50 . 03 ± 6 . 18 | 58 . 71 ± 6 . 96 | 54 . 88 ± 6 . 47 | 57 . 72 ± 3 . 42 | 60 . 91 ± 4 . 53 |
| k / n k | Circulant [%] | Random [%] | BinPerm [%] | BinCirc [%] | HalfShift [%] | Toeplitz [%] | VerHorShift [%] |
|---|---|---|---|---|---|---|---|
| 1024 / 1 | 6 . 02 ± 0 . 64 | 4 . 83 ± 0 . 19 | 6 . 67 ± 0 . 65 | 12 . 77 ± 2 . 86 | 6 . 38 ± 0 . 44 | 6 . 22 ± 1 . 20 | 6 . 30 ± 0 . 76 |
| 512 / 2 | 12 . 98 ± 11 . 29 | 5 . 77 ± 0 . 11 | 8 . 15 ± 0 . 56 | 12 . 40 ± 2 . 32 | 7 . 25 ± 0 . 71 | 9 . 11 ± 2 . 28 | 10 . 81 ± 4 . 31 |
| 256 / 4 | 17 . 73 ± 6 . 66 | 8 . 51 ± 0 . 35 | 11 . 11 ± 1 . 15 | 12 . 13 ± 4 . 35 | 12 . 05 ± 2 . 94 | 15 . 66 ± 3 . 36 | 18 . 19 ± 5 . 46 |
| 128 / 8 | 34 . 80 ± 14 . 59 | 14 . 44 ± 0 . 89 | 17 . 20 ± 2 . 26 | 22 . 15 ± 6 . 45 | 24 . 74 ± 8 . 14 | 33 . 90 ± 13 . 90 | 30 . 37 ± 7 . 52 |
| 64 / 16 | 45 . 91 ± 5 . 50 | 27 . 57 ± 1 . 58 | 29 . 53 ± 3 . 40 | 35 . 33 ± 5 . 58 | 36 . 58 ± 10 . 71 | 51 . 10 ± 13 . 98 | 41 . 66 ± 8 . 08 |
| 32 / 32 | 65 . 06 ± 9 . 60 | 40 . 58 ± 2 . 49 | 43 . 58 ± 4 . 66 | 53 . 05 ± 5 . 39 | 47 . 18 ± 7 . 19 | 58 . 24 ± 8 . 87 | 56 . 73 ± 6 . 09 |
| 16 / 64 | 68 . 61 ± 5 . 72 | 58 . 72 ± 3 . 08 | 60 . 30 ± 6 . 11 | 66 . 29 ± 4 . 79 | 60 . 84 ± 5 . 31 | 72 . 50 ± 6 . 04 | 72 . 50 ± 5 . 91 |
$$ |\sum_{u=1}^{n}s_{i_{1},u}v_{i_{1},u}|\leq\Delta, $$ \tag{S8.Ex16}
References
[Achlioptas:2003:DRP:861182.861189] Achlioptas, D. Database-friendly random projections: Johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci., 66\penalty0 (4):\penalty0 671--687, 2003.
[citeulike:5847607] Altman, N.~S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. The American Statistician, 46\penalty0 (3):\penalty0 175--185, 1992.
[Bingham:2001:RPD:502512.502546] Bingham, E. and Mannila, H. Random projection in dimensionality reduction: Applications to image and text data. In KDD, 2001.
[Blum:2005:RPM:2182373.2182377] Blum, A. Random projection, margins, kernels, and feature-selection. In SLSFS, 2006.
[Boedecker2009] Boedecker, J., Obst, O., Mayer, N.~M., and Asada, M. Initialization and self-organized optimization of recurrent neural network connectivity. HFSP journal, 3\penalty0 (5):\penalty0 340--9, 2009.
[bottou-98x] Bottou, L. Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press, 1998.
[charikar] Charikar, Moses. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montr'eal, Qu'ebec, Canada, pp.\ 380--388, 2002. 10.1145/509907.509965. URL http://doi.acm.org/10.1145/509907.509965.
[tyree] Chen, W., Wilson, J.~T., Tyree, S., Weinberger, K.~Q., and Chen, Y. Compressing neural networks with the hashing trick. CoRR, abs/1504.04788, 2015.
[cheng2] Cheng, Yu, Yu, FelixX., Feris, Rog'erioSchmidt, Kumar, Sanjiv, Choudhary, Alok~N., and Chang, Shih-Fu. An exploration of parameter redundancy in deep networks with circulant projections. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp.\ 2857--2865, 2015. 10.1109/ICCV.2015.327. URL http://dx.doi.org/10.1109/ICCV.2015.327.
[conf/alt/ChoromanskaCJM13] Choromanska, A., Choromanski, K., Jagannathan, G., and Monteleoni, C. Differentially-private learning of low dimensional manifolds. In ALT, 2013.
[AChoro2015] Choromanska, A., Henaff, M., Mathieu, M., Arous, G.~Ben, and LeCun, Y. The loss surfaces of multilayer networks. In AISTATS, 2015.
[choromanski_sindhwani] Choromanski, Krzysztof and Sindhwani, Vikas. Recycling randomness with structure for sublinear time kernel expansions. ICML2016, abs/1605.09049, 2016. URL http://arxiv.org/abs/1605.09049.
[triple_spin] Choromanski, Krzysztof, Fagan, Francois, Gouy-Pailler, C'edric, Morvan, Anne, Sarl'os, Tam'as, and Atif, Jamal. Triplespin - a generic compact paradigm for fast machine learning computations. CoRR, abs/1605.09046, 2016. URL http://arxiv.org/abs/1605.09046.
[DBLP:conf/focs/Dasgupta99] Dasgupta, S. Learning mixtures of gaussians. In FOCS, 1999.
[conf/uai/Dasgupta00] Dasgupta, S. Experiments with random projection. In UAI, 2000.
[dasgupta2008random] Dasgupta, S. and Freund, Y. Random projection trees and low dimensional manifolds. In STOC, 2008.
[NIPS2013_5025] Denil, M., Shakibi, B., Dinh, L., Ranzato, M., and Freitas, N.~D. Predicting parameters in deep learning. In NIPS. 2013.
[DBLP:journals/corr/DentonZBLF14] Denton, E., Zaremba, W., Bruna, J., LeCun, Y., and Fergus, R. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS. 2014.
[conf/icml/FernB03a] Fern, X.~Z. and Brodley, C.~E. Random projection for high dimensional data clustering: A cluster ensemble approach. In ICML, 2003.
[3447] Ganguli, S. and Sompolinsky, H. Compressed sensing, sparsity, and dimensionality in neuronal information processing and data analysis. Annual Review of Neuroscience, 35:485\textendash508, 2012.
[giryes] Giryes, R., Sapiro, G., and Bronstein, A.~M. Deep neural networks with random gaussian weights: A universal classification strategy? CoRR, abs/1504.08291, 2015.
[gong] Gong, Y., Lazebnik, S., Gordo, A., and Perronnin, F. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 35\penalty0 (12):\penalty0 2916--2929, 2013a.
[aqbe] Gong, Yunchao, Kumar, Sanjiv, Verma, Vishal, and Lazebnik, Svetlana. Angular quantization-based binary codes for fast similarity search. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp.\ 1205--1213, 2012.
[gong2] Gong, Yunchao, Kumar, Sanjiv, Rowley, Henry~A., and Lazebnik, Svetlana. Learning binary codes for high-dimensional data using bilinear projections. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, pp.\ 484--491, 2013b. 10.1109/CVPR.2013.69. URL http://dx.doi.org/10.1109/CVPR.2013.69.
[haupt2010toeplitz] Haupt, J., Bajwa, W.~U., Raz, G., and Nowak, R. Toeplitz compressed sensing matrices with applications to sparse channel estimation. Information Theory, IEEE Transactions on, 56\penalty0 (11):\penalty0 5862--5875, 2010.
[journals/rsa/HinrichsV11] Hinrichs, A. and Vybíral, J. Johnson-lindenstrauss lemma for circulant matrices. Random Struct. Algorithms, 39\penalty0 (3):\penalty0 391--398, 2011.
[Huang2006489] Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. Extreme learning machine: Theory and applications. Neurocomputing, 70\penalty0 (1–3):\penalty0 489 -- 501, 2006.
[DBLP:journals/corr/abs-1104-3160] Jacques, L., Laska, J.~N., Boufounos, P., and Baraniuk, R.~G. Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. CoRR, abs/1104.3160, 2011.
[Jaeger04harnessingnonlinearity] Jaeger, H. and Haas, H. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, pp.\ 78--80, 2004.
[JXHC09] Jafarpour, S., Xu, W., Hassibi, B., and Calderbank, R. Efficient and Robust Compressed Sensing Using Optimized Expander Graphs. Information Theory, IEEE Transactions on, 55\penalty0 (9):\penalty0 4299--4308, 2009.
[conf/iccv/JarrettKRL09] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. What is the best multi-stage architecture for object recognition? In ICCV, 2009.
[CPA:CPA21504] Krahmer, F., Mendelson, S., and Rauhut, H. Suprema of chaos processes and the restricted isometry property. Communications on Pure and Applied Mathematics, 67\penalty0 (11):\penalty0 1877--1904, 2014.
[pingli2006very] Li, P., Hastie, T.~J., and Church, K.~W. Very sparse random projections. In KDD, 2006.
[Liu:2006:RPM:1105850.1105890] Liu, K., Kargupta, H., and Ryan, J. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. on Knowl. and Data Eng., 18\penalty0 (1):\penalty0 92--106, 2006.
[DBLP:journals/corr/MathieuHL13] Mathieu, M., Henaff, M., and LeCun, Y. Fast training of convolutional networks through ffts. In ICLR, 2014.
[pinto:bionetics_2010] Pinto, N. and Cox, D.~D. An Evaluation of the Invariance Properties of a Biologically-Inspired System for Unconstrained Face Recognition. In BIONETICS, 2010.
[journals/ploscb/PintoDDC09] Pinto, N., Doukhan, D., DiCarlo, J.~J., and Cox, D.~D. A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS Computational Biology, 5\penalty0 (11), 2009.
[plan] Plan, Y. and Vershynin, R. Dimension reduction by random hyperplane tessellations. Discrete & Computational Geometry, 51\penalty0 (2):\penalty0 438--461, 2014.
[NIPS2009_3749] Raginsky, M. and Lazebnik, S. Locality-sensitive binary codes from shift-invariant kernels. In NIPS. 2009.
[journals/corr/abs-1010-1847] Rauhut, H., Romberg, J.~K., and Tropp, J.~A. Restricted isometries for partial random circulant matrices. CoRR, abs/1010.1847, 2010.
[Salakhutdinov:2009:SH:1558385.1558446] Salakhutdinov, R. and Hinton, G. Semantic hashing. Int. J. Approx. Reasoning, 50\penalty0 (7):\penalty0 969--978, 2009.
[ICML2011Saxe_551] Saxe, A., Koh, P.~W., Chen, Z., Bhand, M., Suresh, B., and Ng, A. On random weights and unsupervised feature learning. In ICML, 2011.
[shaw_jebara] Shaw, Blake and Jebara, Tony. Structure preserving embedding. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, pp.\ 937--944, 2009. 10.1145/1553374.1553494. URL http://doi.acm.org/10.1145/1553374.1553494.
[sindhwani] Sindhwani, V., Sainath, T., and Kumar, S. Structured transforms for small-footprint deep learning. In NIPS, 2015.
[Sivakumar:2002:ADV:509907.509996] Sivakumar, D. Algorithmic derandomization via complexity theory. In STOC, 2002.
[yann_lecun] Szlam, Arthur, Gregor, Karol, and LeCun, Yann. Fast approximations to structured sparse coding and applications to object classification. In Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V, pp.\ 200--213, 2012. 10.1007/978-3-642-33715-4_15. URL http://dx.doi.org/10.1007/978-3-642-33715-4_15.
[Vybíral20111096] Vybíral, J. A variant of the johnson–lindenstrauss lemma for circulant matrices. Journal of Functional Analysis, 260\penalty0 (4):\penalty0 1096 -- 1105, 2011.
[ssh] Wang, Jun, Kumar, Sanjiv, and Chang, Shih-Fu. Semi-supervised hashing for large-scale search. IEEE Trans. Pattern Anal. Mach. Intell., 34\penalty0 (12):\penalty0 2393--2406, 2012. 10.1109/TPAMI.2012.48. URL http://dx.doi.org/10.1109/TPAMI.2012.48.
[weiss] Weiss, Y., Torralba, A., and Fergus, R. Spectral hashing. In NIPS, 2008.
[citeulike:13373321] White, O.~L., Lee, D.~D., and Sompolinsky, H. Short-term memory in orthogonal neural networks. Physical review letters, 92\penalty0 (14), 2004.
[yap.11b] Yap, H.L., Eftekhari, A., Wakin, M.B., and Rozell, C.J. The restricted isometry property for block diagonal matrices. In CISS, 2011.
[constantine] Yi, X., Caramanis, C., and Price, E. Binary embedding: Fundamental limits and fast algorithm. CoRR, abs/1502.05746, 2015.
[felix] Yu, F.~X., Kumar, S., Gong, Y., and Chang, S.-F. Circulant binary embedding. In ICML, 2014.
[aditya] Yu, Felix~X., Bhaskara, Aditya, Kumar, Sanjiv, Gong, Yunchao, and Chang, Shih-Fu. On binary embedding using circulant matrices. CoRR, abs/1511.06480, 2015. URL http://arxiv.org/abs/1511.06480.
[bib22] Gong et al. (2013a) Gong, Y., Lazebnik, S., Gordo, A., and Perronnin, F. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2916–2929, 2013a.
[bib25] Haupt et al. (2010) Haupt, J., Bajwa, W. U., Raz, G., and Nowak, R. Toeplitz compressed sensing matrices with applications to sparse channel estimation. Information Theory, IEEE Transactions on, 56(11):5862–5875, 2010.
[bib32] Krahmer et al. (2014) Krahmer, F., Mendelson, S., and Rauhut, H. Suprema of chaos processes and the restricted isometry property. Communications on Pure and Applied Mathematics, 67(11):1877–1904, 2014.
[bib37] Pinto et al. (2009) Pinto, N., Doukhan, D., DiCarlo, J. J., and Cox, D. D. A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS Computational Biology, 5(11), 2009.
[bib46] Szlam et al. (2012) Szlam, Arthur, Gregor, Karol, and LeCun, Yann. Fast approximations to structured sparse coding and applications to object classification. In Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V, pp. 200–213, 2012. doi: 10.1007/978-3-642-33715-4˙15. URL http://dx.doi.org/10.1007/978-3-642-33715-4_15.