Signal recovery from Pooling Representations
Joan Bruna, Arthur Szlam, Yann LeCun
Abstract
In this work we compute lower Lipschitz bounds of ℓpsubscriptℓ𝑝\ell_{p} pooling operators for p=1,2,∞𝑝12p=1,2,\infty as well as ℓpsubscriptℓ𝑝\ell_{p} pooling operators preceded by half-rectification layers. These give sufficient conditions for the design of invertible neural network layers. Numerical experiments on MNIST and image patches confirm that pooling layers can be inverted with phase recovery algorithms. Moreover, the regularity of the inverse pooling, controlled by the lower Lipschitz constant, is empirically verified with a nearest neighbor regression.
Joan Bruna
Courant Institute of Mathematical Sciences, 715 Broadway New York NY 10003 USA
Arthur Szlam Yann LeCun
In this work we compute lower Lipschitz bounds of /lscript p pooling operators for p = 1 , 2 , ∞ as well as /lscript p pooling operators preceded by halfrectification layers. These give sufficient conditions for the design of invertible neural network layers. Numerical experiments on MNIST and image patches confirm that pooling layers can be inverted with phase recovery algorithms. Moreover, the regularity of the inverse pooling, controlled by the lower Lipschitz constant, is empirically verified with a nearest neighbor regression.
Introduction
A standard architecture for deep feedforward networks consists of a number of stacked modules, each of which consists of a linear mapping, followed by an elementwise nonlinearity, followed by a pooling operation. Critical to the success of this architecture in recognition problems is its capacity for preserving discriminative signal information while being invariant to nuisance deformations. The recent works (Mallat, 2012; Bruna and Mallat, 2012) study the role of the pooling operator in building invariance. In this work, we will study a network's capacity for preserving information. Specifically, we will study the invertibility of modules with a linear mapping, the half rectification nonlinearity, and /lscript p pooling, for p ∈ { 1 , 2 , ∞} . We will discuss recent work in the case p = 2 , and connections with the phase recovery problem of (Candes et al., 2013; Gerchberg and Saxton, 1972; Waldspurger et al., 2012).
$ ell_p$ pooling
The purpose of the pooling layer in each module is to give invariance to the system, perhaps at the expense of resolution. This is done via a summary statistic over the outputs of groups of nodes. In the trained system, the columns of the weight matrix corresponding to nodes grouped together often exhibit similar characteristics, and code for perturbations of a template (Kavukcuoglu et al., 2009; Hyv¨ arinen and Hoyer , 2001).
The summary statistic in /lscript p pooling is the /lscript p norm of the inputs into the pool. That is, if nodes x I i , ..., x I l are in a pool, the output of the pool is
$$
$$
where as usual, if p →∞ , this is
$$
$$
If the outputs of the nonlinearity are nonnegative (as for the half rectification function), then p = 1 corresponds to average pooling, and the case p = ∞ is max pooling.
Phase reconstruction
Given x ∈ R n , a classical problem in signal processing is to recover x from the absolute values of its (1 or 2 dimensional) Fourier coefficients, perhaps subject to some additional constraints on x ; this problem arises in speech generation and X-ray imaging (Ohlsson, 2013). Unfortunately, the problem is not well posed- the absolute values of the Fourier coefficients do not nearly specify x . For example, the absolute value of the Fourier transform is translation invariant. It can be shown (and we discuss this below) that the absolute value of the inner products between x and any basis of R n are not enough to uniquely specify an arbitrary x ; the situation is worse for C n . On the other hand, recent works have shown that by taking a redundant enough dictionary, the situation is different, and x can be recovered from the modulus of its inner products with the dictionary
Random Pooling Operators
(Balan et al., 2006; Candes et al., 2013; Waldspurger et al., 2012).
Suppose for a moment that there is no elementwise nonlinearity in our feedforward module, and only a linear mapping followed by a pooling. Then with a slightly generalized notion of phase, where the modulus is the /lscript p norm of the pool, and the phase is the /lscript p unit vector specifying the 'direction' of the inner products in the pool, the phase recovery problem above asks if the module loses any information. The /lscript 2 case has been recently studied in (Cahill et al., 2013)
What vs. Where
If the columns of the weight matrix in a pool correspond to related features, it can be reasonable to see the entire pool as a 'what'. That is, the modulus of the pool indicates the relative presence of a grouping of (sub)features into a template, and the phase of the pool describes the relative arrangement of the subfeatures, describing 'where' the template is, or more generally, describing the 'pose' of the template.
From this viewpoint, phase reconstruction results make rigorous the notion that given enough redundant versions of 'what' and throwing away the 'where', we can still recover the 'where'.
Contributions of this work
In this work we give conditions so that a module consisting of a linear mapping, perhaps followed by a half rectification, followed by an /lscript p pooling preserves the information in its input. We extend the /lscript 2 results of (Cahill et al., 2013; Balan and Wang, 2013) in several ways: we consider the /lscript p case, take into account the half rectification nonlinearity, and we make the results quantitative in the sense that we give bounds on the lower Lipschitz constants of the modules. This gives a measure of the stability of the inversion, which is especially important in a multi-layer system. Using our bounds, we prove that redundant enough random modules with /lscript 1 or /lscript ∞ pooling are invertible.
We also show the results of numerical experiments designed to explore the gaps in our results and the results in the literature. We note that the alternating minimization method of (Gerchberg and Saxton, 1972) can be used essentially unchanged for the /lscript p case, with or without rectification, and show experiments giving evidence that recovery is roughly equally possible for /lscript 1 , /lscript 2 , and /lscript ∞ using this algorithm; and that half rectification before pooling can make recovery easier. Furthermore, we show that with a trained initialization, the alternating method compares favorably with the state of the art recovery methods (for /lscript 2 with no rectification) in (Waldspurger et al., 2012;
Candes et al., 2013), which suggests that the above observations are not an artifcact of the alternating method.
Injectivity and Lipschitz stability of Pooling Operators
This section studies necessary and sufficient conditions guaranteeing that pooling representations are invertible. It also computes upper and lower Lipschitz bounds, which are tight under certain configurations.
Let us first introduce the notation used throughout the paper. Let F = { f 1 , . . . , f M } be a real frame of R N , with M > N . The frame F is organized into K disjoint blocks F k = { f j } j ∈ I k , k = 1 . . . K , such that I k ∩ I k ′ = ∅ and ⋃ k I k = { 1 . . . M } . For simplicity, we shall assume that all the pools have equal size | I k | = L .
The /lscript p pooling operator P p ( x ) is defined as the mapping
$$
$$
A related representation, which has gained popularity in recent deep learning architectures, introduces a point-wise thresholding before computing the /lscript p norm. If α ∈ R M is a fixed threshold vector, and ( ρ α ( x )) i = max(0 , x i -α i ) , then the /lscript p rectified pooling operator R p ( x ) is defined as
$$
$$
where α k contains the coordinates I k of α .
We shall measure the stability of the inverse pooling with the Euclidean distance in the representation space. Given a distance d ( x, x ′ ) in the input space, the Lipschitz bounds of a given operator Φ( x ) are defined as the constants 0 ≤ A ≤ B such that
$$
$$
In the remainder of the paper, given a frame F , we denote respectively by λ -( F ) and λ + ( F ) its lower and upper frame bounds. If F has M vectors and Ω ⊂ { 1 , . . . , M } , we denote F Ω the frame obtained by keeping the vectors indexed in Ω . Finally, we denote Ω c the complement of Ω .
Absolute value and Thresholding nonlinearities
In order to study the injectivity of pooling representations, we first focus on the properties of the operators defined by the point-wise nonlinearities.
The properties of the phaseless mapping
$$
$$
have been extensively studied in the literature (Balan et al., 2006; Balan and Wang, 2013), in part motivated by applications to speech processing (Achan et al., 2003) or
X-ray crystallography (Ohlsson, 2013). It is shown in (Balan et al., 2006) that if m > 2 n -1 then it is possible to recover x from M ( x ) , up to a global sign change. In particular, (Balan and Wang, 2013) recently characterized the stability of the phaseless operator, that is summarized in the following proposition:
((Balan and Wang, 2013), Theorem 4.3)
Let F = ( f 1 , . . . , f M ) with f i ∈ R N and d ( x, x ′ ) = min( ‖ x -x ′ ‖ , ‖ x + x ′ ‖ ) . The mapping M ( x ) = {|〈 x, f i 〉|} i ≤ m satisfies
$$
$$
where
$$
$$
$$
$$
In particular, M ( x ) is injective if and only if for any subset Ω ⊆ { 1 , . . . , M } , either F Ω or F Ω c is an invertible frame.
A frame F satisfying the previous condition is said to be phase retrievable .
We now turn our attention to the half-rectification operator, defined as
$$
$$
For that purpose, let us introduce some extra notation. Given a frame F = { f 1 , . . . , f M } , a subset Ω ⊂ { 1 . . . M } is admissible if
$$
$$
We denote by Ω the collection of all admissible sets, and V Ω the vector space generated by Ω . The following proposition, proved in Section B, gives a necessary and sufficient condition for the injectivity of the half-rectification.
Proposition 2.2 Let A 0 = min Ω ∈ Ω λ -( F Ω ∣ ∣ V Ω ) . Then the half-rectification operator M α ( x ) = ρ α ( F T x ) is injective if and only if A 0 > 0 . Moreover, it satisfies
$$
$$
The half-rectification has the ability to recover the input signal, without the global sign ambiguity. The ability to reconstruct from M α is thus controlled by the rank of any matrix F Ω whose columns are taken from a subset belonging to Ω . In particular, if α ≡ 0 , since Ω ∈ Ω ⇒ Ω c ∈ Ω , it results that m ≥ 2 n is necessary in order to have A 0 > 0 .
The rectified linear operator creates a partition of the input space into polytopes, defined by (8), and computes a linear operator on each of these regions. A given input x activates a set Ω x ∈ Ω , encoded by the sign of the linear measurements 〈 x, f i 〉 -α i .
As opposed to the absolute value operator, the inverse of M α , whenever it exists, can be computed directly by locally inverting a linear operator. Indeed, the coordinates of M α ( x ) satisfying M α ( x ) j > α j form a set s ( x ) , which defines a linear model F s ( x ) which is invertible if A 0 > 0 .
, In order to compare the stability of the half-rectification versus the full rectification, one can modify M α so that it maps x and -x to the same point. If one considers then ˜ M α satisfies the following:
$$
$$
Proof of Corollary 2.5
$$
$$
with
$$
$$
$$
$$
and d ( x, x ′ ) = min( x -x ′ , x + x ′ ) , so ˜ A ≥ 2 -1 / 2 A and ˜ B ≤ B . In particular, if M is invertible, so is ˜ M α .
$$
$$
It results that the bi-Lipschitz bounds of the halfrectification operator, when considered in under the equivalence x ∼ -x , are controlled by the bounds of the absolute value operator, up to a factor 2 -1 / 2 . However, the lower Lipschitz bound (11) consists in a minimum taken over a much smaller family of elements than in (5).
$ ell_p$ pooling
We give bi-Lipschitz constants of the /lscript p Pooling and /lscript p rectified Pooling operators for p = 1 , 2 , ∞ .
From its definition, it follows that pooling operators P p and R p can be expressed respectively as a function of phaseless and half-rectified operators, which implies that for the pooling to be invertible, it is necessary that the absolute value and rectified operators are invertible too. Naturally, the converse is not true.
$ ell_p$ pooling
The invertibility conditions of the /lscript 2 pooling representation have been recently studied in (Cahill et al., 2013), where
the authors obtain necessary and sufficient conditions on the frame F . We shall now generalize those results, and derive bi-Lipschiz bounds.
Let us define
$$
$$
Q 2 thus contains all the orthogonal bases of each subspace F k .
The following proposition, proved in section B, computes upper and lower bounds of the Lipschitz constants of P 2 .
Random Pooling Operators
$$
$$
where
$$
$$
This proposition thus generalizes the results from (Cahill et al., 2013), since it shows that A 2 > 0 not only controls when P 2 is invertible, but also controls the stability of the inverse mapping.
We also consider the rectified /lscript 2 pooling case. For simplicity, we shall concentrate in the case where the pools have dimension d = 2 . For that purpose, for each x, x ′ , we consider a modification of the families Q 2 , by replacing each sub-frame F k by F I k ∩ s ( x ) ∩ s ( x ′ ) , that we denote ˜ Q 2 ,x,x ′ .
Corollary 2.5 Let d = 2 , and set p ( x, x ′ ) = s ( x ) ∪ s ( x ′ ) \ ( s ( x ) ∩ s ( x ′ )) . Then the rectified /lscript 2 pooling operator R 2 satisfies
$$
$$
$$
$$
Proposition 2.4 and Corollary 2.5 give a lower Lipschitz bound which gives sufficient guarantees for the inversion of pooling representations. Corollary 2.5 indicates that, in the case d = 2 , the lower Lipschitz bounds are sharper than the non-rectified case, in consistency with the results of section 2.1. The general case d > 2 remains an open issue.
.
$ ell_p$ pooling
We give in this section sufficient and necessary conditions such that the max-pooling operator P ∞ is injective, and we compute a lower bound of its lower Lipschitz constant.
Given x ∈ R N , we define the switches s ( x ) of x as the K vector of coordinates in each pool where the maximum is attained; that is, for each k ∈ { 1 , . . . , K } :
$$
$$
and we denote by S the set of all attained switches: S = { s ( x ) ; x ∈ R N } . This is a discrete subset of ∏ k { 1 , . . . , I k } . Given s ∈ S , the set of input signals having s ( x ) = s defines a linear cone C s ⊂ R N :
$$
$$
and as a result the input space is divided into a collection of Voronoi cells defined from linear equations. Restricted to each cone C s , the max-pooling operator computes the phaseless mapping M ( x ) from equation (3) corresponding to F s = ( f s 1 , . . . , f s K ) .
Given vectors u and v , as usual, set the angle θ ( u, v ) = arccos ( |〈 u,v 〉| ‖ u ‖‖ v ‖ ) . For each s, s ′ ∈ S such that C s ∩C s ′ = ∅ and for each Ω ⊂ { 1 . . . K } , we define
$$
$$
This is a modified first principal angle between the subspaces F s ∣ ∣ Ω and F s ′ ∣ ∣ Ω , where the infimum is taken only on the directions included in the respective cones. Set Λ s,s ′ , Ω = λ -( F ∣ ∣ Ω ) · sin( β ( s, s ′ , Ω)) .
Given s , s ′ , we also define J ( s, s ′ ) = { k ; s k = s ′ k } . Recall L is the size of each pool. Set
$$
$$
The following proposition, proved in section B, gives a lower Lipschitz bound of the max-pooling operator.
$$
$$
where d ( x, x ′ ) = min( ‖ x -x ′ ‖ , ‖ x + x ′ ‖ ) .
/negationslash
Propostion 2.6 shows that the lower Lipschitz bound is controlled by two different phenomena. The first one depends upon how the cones corresponding to disjoint switches are aligned, whereas the second one depends on the internal incoherence of each frame F J ( s,s ′ ) . One may ask how do these constants evolve in different asymptotic regimes. For example, if one lets the number of pools K be fixed but increases the size of each pool by increasing M . In that case, the set of possible switches S increases, and each cone C s gets smaller. The bound corresponding to C s ∩ C s ′ = ∅ decreases since the infimum is taken over a larger family. However, as the cones C s become smaller, the likelihood that any pair x = x ′ share the same switches decreases, thus giving more importance to the case C s ∩ C s ′ = ∅ . Although the ratio 1 L decreases, the lower frame bounds λ -( F Ω ) 2 , λ -( F Ω c ) 2 will in general increase linearly with L . The lower bound will thus mainly be driven by the principal angles β ( s, s ′ , Ω) . Although the minimum in (28) is taken over a larger family, each angle is computed over a smaller region of the space, suggesting that one can indeed increase the size of each pool without compromising the injectivity of the max-pooling.
Another asymptotic regime considers pools of fixed size L and increases the number of pools K . In that case, the bound increases as long as the principal angles remain lower bounded.
We also consider the stability of max-pooling with a halfrectification. By redefining the switches s ( x ) accordingly:
$$
$$
Corollary 2.7 The rectified max-pooling operator R ∞ satisfies
$$
$$
with
$$
$$
defined using the cones C s obtained from (18).
$ ell_1$ Pooling and Max-Out
Propostion 2.6 can be used to obtain a bound of the lower Lipschitz constant of the /lscript 1 pooling operator, as well as the Maxout operator (Goodfellow et al., 2013); see section B.4.2 in the supplementary material.
/negationslash
Random Pooling Operators
What is the minimum amount of redundancy needed to invert a pooling operator? As in previous works on compressed sensing (Candes and Tao, 2004) and phase recovery (Balan et al., 2006), one may address this question by studying random pooling operators. In this case, the lower Lipschitz bounds derived in previous sections can be shown to be positive with probability 1 given appropriate parameters K and L .
The following proposition, proved in Appendix B, analyzes the invertibility of a generic pooling operator constructed from random measurements. We consider a frame F where its M columns are iid Gaussian vectors of R N .
Proposition 2.8 Let F = ( f 1 , . . . , f M ) be a random frame of R N , organized into K disjoint pools of dimension L . With probability 1 P p is injective (modulo x ∼ -x ) if K ≥ 4 N for p = 1 , ∞ and if K ≥ 2 N -1 for p = 2 .
The size of the pools L does not influence the injectivity of random pooling, but it affects the stability of the inverse, as shown in proposition 2.6. The half-rectified case requires extra care, since the set of admissible switches Ω might contain frames with M < N columns with non-zero probability, and is not considered in the present work.
Numerical Experiments
) } , Our main goal in this section is to experimentally compare the invertibility of /lscript p pooling for p ∈ { 1 , 2 , ∞} , with and without rectification. Unlike in the previous sections, we will not consider the Lipschitz bounds, as we do not know a good way to measure these experimentally. Our experiments suggest that recovery is roughly the same difficulty for p = 1 , 2 , ∞ , and that rectification makes recovery easier.
/negationslash
In the /lscript 2 case without rectification, and with d = 2 , a growing body of works (Candes et al., 2013; Waldspurger et al., 2012) describe how to invert the pooling operator. This is often called phase recovery. A problem for us is a lack of a standard algorithm when p = 2 or with rectification. We will see that the simple alternating minimization algorithm of (Gerchberg and Saxton, 1972) can be adapted to these situations. However, alternating minimization with random initialization is known to be an inferior recovery algorithm for p = 2 , and so any conclusions we will draw about ease of recovery will be tainted, as we would be testing whether the algorithm is equally bad in the various situations, rather than if the problems are equally hard. We will show that in certain cases, a training set allows us to find a good initialization for the alternating minimization, leading to excellent recovery performance, and that in this setting, or the random setting, recovery via alternating minimiza-
tion is roughly as succesful for each of the p , suggesting invertibility is equally hard for each p . In the same way, we will see evidence that half rectification before pooling makes recovery easier.
Recovery Algorithms
Alternating minimization
A greedy method for recovering the phase from the modulus of complex measurements is given in (Gerchberg and Saxton, 1972); this method naturally extends to the case at hand. As above, denote the frame { f 1 , ..., f M } = F , let F k be the frame vectors in the k th block, and set I k to be the indices of the k th block. Let F ( -1) be the pseudoinverse of F ; set ( P p ( x )) k = ||F k x || p . Starting with an initial signal x 0 , update
$$
$$
$$
$$
This approach is not, as far as we know, guarantee to converge to the correct solution, even when P p is invertible. However, in practice, if the inversion is easy enough, or if x 0 is close to the true solution, the method can work well. Moreover, this algorithm can be run essentially unchanged for each /lscript p ; for half rectification, we only use the nonegative entries in y for reconstruction.
In the experiments below, we will use random, Gaussian i.i.d. F , but also we will use the outputs of dictionary learning with block sparsity. The F generated this way is not really a frame, as the condition number of a trained dictionary on real data is often very high. In this case, we will renormalize each data point to have norm 1 , and modify the update x ( n +1) = F ( -1) y ( n ) to
$$
$$
In practice, this modification might not always be possible, since the norm ‖ x ‖ is not explicitly presented in P p . However, in the classical setting of Fourier measurements and positive x , this information is available. Moreover, our empirical experience has been that using this regularization on well conditioned analysis dictionaries offers no benefit; in particular, it gives no benefit with random analysis matrices.
Phaselift and Phasecut
Two recent algorithms (Candes et al., 2013) and (Waldspurger et al., 2012) are guaranteed with high probability to solve the (classical) problem of recovering the phase of a complex signal from its modulus, given enough random measurements. In practice both perform better than the greedy alternating minimization. However, it is not obvious to us how to adapt these algorithms to the /lscript p setting.
Nearest neighbors regression
We would like to use the same basic algorithm for all settings to get an idea of the relative difficulty of the recovery problem for different p , but also would like an algorithm with good recovery performance. If our algorithm simply returns poor results in each case, differences between the cases might be masked.
The alternating minimization can be very effective when well initialized. When given a training set of the data to recover, we use a simple regression to find x 0 . Fix a number of neighbors q (in the experiments below we use q = 10 , and suppose X is the training set). Set G = P p ( X ) , and for a new point x to recover from P p ( x ) , find the q nearest neighbors in G of P p ( x ) , and take their principal component to serve as x 0 in the alternating minimization algorithm. We use the fast neighbor searcher from (Vedaldi and Fulkerson, 2008) to accelerate the search.
Experiments
We discuss results on the MNIST dataset, available at http://yann.lecun.com/exdb/mnist/ , and on 16 × 16 patches drawn from the VOC dataset, available at http://pascallin.ecs.soton.ac.uk/ challenges/VOC/voc2012/ . For each of these data sets, we run experiments with random dictionaries and adapted dictionaries. We also run experiments where the data and the dictionary are both Gaussian i.i.d.; in this case, we do not use adapted dictionaries.
The basic setup of the experiments in each case is the same: we vary the number of measurements (that is, number of pools) over some range, and attempt to recover the original signal from the /lscript p pooled measurements, using various methods. We record the average angle between the recovered signal r and the original x , that is, we use | r T x | 2 / ( || r || 2 || x || 2 ) as the measure of success in recovery.
In each case the random analysis dictionary F is built by fixing a size parameter m , and generating a Gaussian i.i.d. matrix F 0 of size 2 m × n , where n = 100 for MNIST, and n = 256 for VOC. Each pair of rows of F 0 is then orthogonalized to obtain F ; that is we use groups of size 2 , where the pair of elements in each group are orthogonal. This allows us to use standard phase recovery software in the /lscript 2 case to get a baseline. We used the ADMM version of phaselift from (Ohlsson et al., 2012) and the phasecut algorithm of (Waldspurger et al., 2012). For all of our data sets, the latter gave better results (note that phasecut can explicitly use the fact that the solution to the problem is
real, whereas that version of phaselift cannot), so we report only the phasecut results.
In the experiments with adapted dictionaries, the dictionary is built using block OMP and batch updates with a K-SVD type update (Aharon et al., 2006); in this case, F is the transpose of the learned dictionary. We again use groups of size 2 in the adapted dictionary experiments.
We run two sets of experiments with Gaussian i.i.d. data and dictionaries, with n = 20 and n = 40 . We consider m in the range from n/ 2 to 8 n . On this data set, phaselift outperforms alternating minimization; see the supplementary material.
For MNIST, we use the standard training set projected to R 100 via PCA, and we let the number of dictionary elements range from 60 to 600 (that is, 30 to 300 measurements). On this data set, alternating minimization with nearest neighbor initialization gives exact reconstruction by 130 measurements; for comparison, Phaselift at m = 130 has mean square angle of . 48 ; see the supplementary material.
We draw approximately 5 million 16 × 16 grayscale image patches from the PASCAL VOC data set; these are sorted by variance, and the largest variance 1 million are kept. The mean is removed from each patch. These are split into a training set of 900000 patches and a test set of 100000 patches. In this experiment, we let m range from 30 to 830. On this data set, by m = 330 measurements, alternating minimization with nearest neighbor initialization recovers mean angle of . 97 ; for comparison, Phaselift at m = 330 has mean angle of . 39 ; see the supplementary material.
Analysis
The experiments show (see figures 1 and 2) that:
· For every data set, with random initializations and dictionaries, recovery is easier with half rectification before pooling than without (green vs dark blue in figures). · /lscript ∞ , /lscript 1 , and /lscript 2 pooling appear roughly the same difficulty to invert, regardless of algorithm (each column of figures, corresponding to an /lscript p , is essentially the same). · Good initialization improves performance; indeed, alternating minimization with nearest neighbor regression outperforms phaselift and phasecut (which of course do not have the luxury of samples from the prior, as the regressor does). We believe this of independent interest. · Adapted analysis 'frames' (with regularization) are easier to invert than random analysis frames, with or
without regularization (the bottom row of each pair of graphs vs the top row of each pair in Figure 2).
Each of these conclusions is unfortunately only true up to the optimization method- it may be true that a different optimizer will lead to different results. With learned initializations and alternating minimization, recovery results can be better without half rectification. Note this is only up until the point where the alternating minimization gets better than the learned initialization without any refinement, and is especially true for random dictionaries. The simple interpretation is that the reconstruction step 2 of the alternating minimization just does not have a large enough span with roughly half the entries removed; that is, this is an effect of the optimization, not of the difference between the difficulty of the problems.
Conclusion
We have studied conditions under which neural network layers of the form (1) and (2) preserve signal information. As one could expect, recovery from pooling measurements is only guaranteed under large enough redundancy and configurations of the subspaces, which depend upon which /lscript p is considered. We have proved conditions which bound the lower Lipschitz constants for these layers, giving quantitative descriptions of how much information they preserve. Furthermore, we have given conditions under which modules with random filters are invertible. We have also given experimental evidence that for both random and adapted modules, it is roughly as easy to invert /lscript p pooling with p = 1 , 2 , and ∞ ; and shown that when given training data, alternating minimization gives state of the art phase recovery with a regressed initialization.
However, we are not anywhere near where we would like to be in understanding these systems, or even the invertibility of the layers of these systems. This work gives little direct help to a practicioner asking the question 'how should I design my network?'. In particular, our results just barely touch on the distribution of the data; but the experiments make it clear (see also (Ohlsson et al., 2012)) that knowing more information about the data changes the invertibility of the mappings. Moreover, preservation of information needs to be balanced against invariance, and the tension between these is not discussed in this work. Even in the setting of this work, without consideration of the data distribution or tension with invariance, Proposition 2.4 although tight, is not easy to use, and even though we are able to use 2.6 to get an invertibility result, it is probably not tight.
This work also shows there is much research to do in the field of algorithmic phase recovery. What are correct algorithms for /lscript p inversion, perhaps with half rectification? How can we best use knowledge of the data distribution for phase recovery, even for the well studied /lscript 2 case? Is

Figure 1. Average recovery angle using alternating projections on random data; each x is Gaussian i.i.d. in R 40 . The vertical axis measures the average value of | r T x | 2 / ( || r || 2 || x || 2 ) , where r is the recovered vector, over 50 random test points. The horizontal axis is the number of measurements (the size m of the analysis dictionary is twice the x axis in this experiment). The leftmost figure is /lscript 1 pooling, the middle /lscript 2 , and the right max pooling. The dark blue curve is alternating minimization, and the green curve is alternating minimization with half rectification; both with random initialization.
it possible to guarantee that a well initialized alternating minimization converges to the correct solution?
Experiments
Koray Kavukcuoglu, Marc'Aurelio Ranzato, Rob Fergus, and Yann LeCun. Learning invariant features through topographic filter maps. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR'09) . IEEE, 2009.
Ir` ene Waldspurger, Alexandre d'Aspremont, and St´ ephane Mallat. Phase recovery, maxcut and complex semidefinite programming, 2012.

Figure 2. Average recovery angle using alternating projections on image patch data points. The vertical axis measures the average value of | r T x | 2 / ( || r || 2 || x || 2 ) , where r is the recovered vector, over 50 random test points. The horizontal axis is the number of measurements (the size of the analysis dictionary is twice the x axis in this experiment). The leftmost figure is /lscript 1 pooling, the middle /lscript 2 , and the right max pooling. In the top row of each pair of rows the analysis dictionary is Gaussian i.i.d.; in the bottom row of each pair of rows, it is generated by block OMP/KSVD with 5 nozero blocks of size 2. The dark blue curve is alternating minimization, and the green curve is alternating minimization with half rectification; both with random initialization. The magenta and yellow curves are the nearest neighbor regressor described in 3.1.3 without and with rectification; and the red and aqua curves are alternating minimization initialized via neighbor regression, without and with rectification. See Section 3.3 for a discussion of the figures.
Comparison between phaselift and the various alternating minimization algorithms
Here we give a brief comparison between the phaselift algorithm and the algorithms we use in the main text. Our main goal is to show that the similarities between the /lscript 1 , /lscript 2 , /lscript ∞ recovery results are not just due to the alternating minimization algorithm performing poorly on all three tasks; however we feel that the quality of the recovery with a regressed initialization is interesting in itself, especially considering that it is much faster than either phaselift or phasecut.
In figures 3, and 4 we compare phaselift against alternating minimization with a random initialization and alternating minimization with a nearest neghbor/locally linear regressed initialization. Because we are comparing against phasecut, here we only show inversion of /lscript 2 pooling.
In figure of 3, we use random data and a random dictionary. As the data has no structure, we only compare against random initialization, with and without half rectification. We can see from figure 3 in this case, where we do not know a good way to initialize the alternating minimization, alternating minimization is significantly worse than phasecut. On the other hand, recovery after rectified pooling with alternating minimization does almost as well as phasecut.
In the examples where we have training data, shown in figure 4, alternating minimization with the nearest neighbor regressor (red curve) performs significantly better than phasecut (green and blue curves). Of course phasecut does not get the knowledge of the data distribution used to generate the regressor.

Figure 3. Average recovery angle using phaselift and alternating minimization on random data, Gaussian i.i.d. points in R 40 . The blue curve is phaselift followed by alternating minimization; the green curve is alternating minimization, and the red is alternating minimization on pooling following half rectification.

Figure 4. Average recovery angle using phaselift and alternating minimization on MNIST and patches data sets. Top: MNIST digits, projected via PCA to R 100 . Bottom: 16x16 image patches with mean removed. The red curve is alternating minimization with nearest neighbor initialization, the green is alternating minimization initialized by phasecut (this is the recommended usage of phasecut), the blue is phasecut with no alternating minimization, and the aqua is alternating minimization with a random initialization.
Proofs of results in Section ref{sect2
Proof of Proposition ref{p2lips
Let us first show that A 0 > 0 is sufficient to construct an inverse of M α . Let x ∈ R N . By definition, the coordinates of M α ( x ) > α correspond to
$$
$$
which in particular implies that x is known to lie in V S ( x ) , the subspace generated by s ( x ) . But the restriction F s ( x ) is a linear operator, which can be inverted in V S as long as λ -( F s ( x ) ∣ ∣ V S ) ≥ A 0 > 0 .
/negationslash
Let us now show that A 0 > 0 is also necessary. Let us suppose that for some S , F S is such that λ -( F S ∣ ∣ V S ) = 0 . It results that there exists η ∈ V S such that ‖ η ‖ > 0 but ‖F S η ‖ = 0 . Since S is a cone, we can find x ∈ S and /epsilon1 = 0 small enough such that x + /epsilon1e ∈ S . It results that M α ( x ) = M α ( x + /epsilon1e ) which implies that M α cannot be injective.
Finally, let us prove (9). If x, x ′ are such that S = s ( x ) = s ( x ′ ) , then
$$
$$
If s ( x ) = s ( x ′ ) , we have that | M α ( x ) i -M α ( x ′ ) i | = |〈 x -x ′ , f i 〉| if i ∈ s ( x ) ∩ s ( x ′ ) and | M α ( x ) i -M α ( x ′ ) i | ≥ |〈 x -x ′ , f i 〉| if i ∈ s ( x ) ∪ s ( x ′ ) , i / ∈ s ( x ) ∩ s ( x ′ ) . It results that
$$
$$
Proof of Proposition ref{p2lips
The upper Lipschitz bound is obtained by observing that, in dimension d ,
$$
$$
It results that
$$
$$
Let us now concentrate on the lower Lipschitz bound. Given x, x ′ ∈ R n , we first consider a rotation ˜ F k on each subspace F k such that 〈 x, ˜ f k,j 〉 = 〈 x ′ , ˜ f k,j 〉 = 0 for j > 2 , which always exists. If now we modify ˜ F k by applying a rotation of the remaining two-dimensional subspace such that x and x ′ are bisected, one can verify that
$$
$$
which implies, by denoting M ( x ) = ( |〈 x, ˜ f k,j 〉| ) k,j , that ‖ P 2 ( x ) -P 2 ( x ′ ) ‖ = ‖ M ( x ) -M ( x ′ ) ‖ . Since ˜ F ∈ Q 2 , it results from Proposition 2.1 that
$$
$$
Proof of Corollary ref{rectpoolcor
Given x, x ′ , let I denote the groups I k , k ≤ K such that S x ∩ S x ′ ∩ I k = I k . It results that
$$
$$
On the groups in I we can apply the same arguments as in theorem 2.4, and hence find a frame ˜ F from the family ˜ Q p,x,x ′ such that
$$
$$
with M ( x ) = ( |〈 x, ˜ f k,j 〉| ) k ∈ I,j and { ˜ f k,j } ∈ ˜ Q p,x,x ′ . Then, by following the same arguments used previously, it results from the definition of ˜ A p that
$$
$$
Finally, the upper Lipschitz bound is obtained by noting that
$$
$$
and using the same argument as in (20) /square .
Proof of Proposition ref{p2lips
/negationslash
$$
$$
Let x, x ′ ∈ R N , and let J = s ( x ) ∩ s ( x ′ ) . Suppose first that C s ( x ) ∩ C s ( x ′ ) = ∅ . Since ‖ P ∞ x -P ∞ ‖ ≥ ‖|F s x | - |F sx ′ | ∣ ∣ J ‖ , it results that by Proposition 2.1 and by definition ( ?? ).
Let us now suppose C s ( x ) ∩C s ( x ′ ) = ∅ , and let z = P ∞ x -P ∞ x ′ . It results that z = |F s ( x ) x | - |F s ( x ′ ) x ′ | ∈ R K , and hence we can split the coordinates (1 . . . K ) into Ω , Ω c such that
$$
$$
We shall concentrate in each restriction independently. Since F s ( x ′ ) ∣ ∣ Ω ( x ′ ) ∈ F s ( x ′ ) ∣ ∣ Ω ( C s ( x ′ ) ) , it results that
$$
$$
$$
$$
it results, assuming without loss of generality that all pools have equal size ( | I k | = M K ) ,
$$
$$
$$
$$
It follows that
$$
$$
By aggregating the bound for Ω and Ω c we obtain (28) /square .
Maxout
These results easily extend to the so-called Maxout operator (Goodfellow et al., 2013), defined as x ↦→ MO ( x ) = { max j ∈ I k 〈 x, f j 〉 ; k = 1 . . . K } . By redefining the switches of x as
$$
$$
the following corollary computes a Lower Lipschitz bound of MO ( x ) :
Corollary B.1 The Maxout operator MO satisfies (19) with A ( s, s ′ ) defined using the switches (27).
$ ell_p$ pooling
Propostion 2.6 can be used to obtain a bound of the lower Lipschitz constant of the /lscript 1 pooling operator.
Observe that for x ∈ R n ,
$$
$$
,
It results that P 1 ( x ; F ) ≡ P ∞ ( x ; ˜ F ) , with
$$
$$
Each pool ˜ F k can be rewritten as ˜ F k = H L F k , where H L is the L × 2 L Hadamard matrix whose rows contain the /epsilon1 vectors. One can verify that H T L H L = 2 L 1 , which implies that for any Ω ⊆ { 1 . . . K } , λ -( ˜ F ∣ ∣ ∣ Ω ) = 2 L/ 2 λ -( F ∣ ∣ Ω ) . It results that
Corollary B.2 The /lscript 1 pooling operator P 1 satisfies
$$
$$
where d ( x, x ) = min( ‖ x -x ‖ , ‖ x + x ‖ ) and
$$
$$
$$
$$
with s, s ′ and β ( s, s ′ ) are defined on the frame ˜ F .
‖
x
)
Proof of Corollaries ref{mpoolrectprop
. (26) The result follows immediately from Proposition 2.6, by replacing the phaseless invertibility condition of Propostion 2.1 by the one in Proposition 2.2. /square .
Proof of Proposition ref{p2lips
Proposition 2.8 also extends to the maxout case. We restate it here with the extra result:
Proposition B.3 Let F = ( f 1 , . . . , f M ) be a random frame of R N , organized into K disjoint pools of dimension L . Then these statements hold with probability 1 :
/negationslash
Let us first prove (i), with p = ∞ . Let x, x ′ ∈ R N such that P ∞ ( x ) = P ∞ ( x ′ ) , and let s = s ( x ) , s ′ = s ( x ′ ) . The set of K pooling measurements is divided into the intersection J ( s, s ′ ) = { k ; s ( x ) k = s ( x ′ ) k } and its complement J ( s, s ′ ) c = { k ; s ( x ) k = s ( x ′ ) k } . Suppose first that |J ( s, s ′ ) | ≥ 2 N -1 . Then it results that we can pick d = /ceilingleft |J ( s,s ′ ) | 2 /ceilingright ≥ N elements of J ( s, s ′ ) to form a frame
V , such that either x -x ′ ∈ Ker ( V ) or x + x ∈ Ker ( V ) . Since a random frame of dimension ≥ N spans R N with probability 1 , it results that x = ± x ′ . Suppose otherwise that |J ( s, s ′ ) | < 2 N -1 . It follows that |J ( s, s ′ ) c | ≥ 2 N +1 , and hence that any partition of J ( s, s ′ ) c into two frames will contain always a frame F ∣ ∣ Ω with at least N +1 columns. Since two random subspaces of dimension N in R M have nonzero largest principal angle with probability 1 as long as K > N , it results that Λ s,s ′ , Ω > 0 and hence that Prob ( | P ∞ ( x ) ∣ ∣ J ( s,s ′ ) c | = | P ∞ ( x ′ ) ∣ ∣ J ( s,s ′ ) c | ) = 0 . The case p = 1 is proved identically thanks to Corollary B.2.
Finally, in order to prove (ii) we follow the same strategy. If |J ( s, s ′ ) | ≥ N , then MO ( x ) = MO ( x ′ ) ⇒ x = x ′ with probability 1 since F ∣ ∣ J spans R N with probability 1 . Otherwise it results that |J ( s, s ′ ) c | ≥ N + 1 , which implies MO ( x ) = MO ( x ′ ) , since two random subspaces of dimension N in R |J ( s,s ′ ) c | have 0 intersection with probability 1 .
Let us now prove the case p = 2 . We start drawing a random basis for each of the pools F 1 , . . . , F K . From proposition 2.4, it follows that we have to check that if M ≥ 2 N , the quantity
$$
$$
with probability 1 . If M ≥ 2 N -1 , it follows that either Ω has the property that it intersects at least N pools, either Ω c intersects N pools. Say it is Ω . Now, for each pool with nonzero intersection, say F k , we have that
$$
$$
for some f k,j belonging to the initial random basis of F k . It results that
$$
$$
where F ∗ is a subset of N columns of the original frame F , which means
/square .
$$
$$
Notes on changes from cycle 1
The mathematical results have been essentially rewritten, for clarity as well as to sharpen the bounds. The proofs are now in the supplementary material, as requested by the reviewers. We have used the extra space to expand the indroduction, conclusion, and intro to the experiments, in part to to explain the connections between the theoretical and experimental parts of the paper, as requested by the reviewers. We also added results on the invertibility of random modules.
We have edited the text in the experiments section and in the captions of the figures to clarify them. Each curve is described in the caption and the text; the graphs are also nowspecifically referenced in the analysis bullets in section 3.3.
The introduction and conclusion more explicitly address take messages. Note that the take home message is not of the form 'this is how to design a network', but rather, 'these conditions allow (stable) inversion'. We are sympathetic to the reviewers desire for a take home message giving insight into the actual design of networks for practical applications. That is, of course, the ultimate goal of a mathematical analysis of a learning algorithm. However, if the standard for theoretical papers analyzing deep models is that they lead immediately to design suggestions with associated performance increases on benchmarks, it is unlikely that there will ever be a mature enough theory to give honest design suggestions.
Finally, we reprint larger versions of the figures below.
Average recovery angle using alternating projections on random data. The vertical axis measures the average value of |rTx|2/(‖r‖2‖x‖2)superscriptsuperscript𝑟𝑇𝑥2superscriptnorm𝑟2superscriptnorm𝑥2|r^{T}x|^{2}/(||r||^{2}||x||^{2}) over 50 random test points. The horizontal axis is the number of measurements (the size m𝑚m of the analysis dictionary is twice the x𝑥x axis in this experiment). The top row is ℓ1subscriptℓ1\ell_{1} pooling, the middle ℓ2subscriptℓ2\ell_{2}, and the bottom max pooling. In the left column each x𝑥x is Gaussian i.i.d. in ℝ20superscriptℝ20{\mathbb{R}}^{20}, on the right, in ℝ40superscriptℝ40{\mathbb{R}}^{40}. The dark blue curve is alternating minimization, and the green curve is alternating minimization with half rectification; both with random initialization.
Figure 5. Average recovery angle using alternating projections on random data. The vertical axis measures the average value of | r T x | 2 / ( || r || 2 || x || 2 ) over 50 random test points. The horizontal axis is the number of measurements (the size m of the analysis dictionary is twice the x axis in this experiment). The top row is /lscript 1 pooling, the middle /lscript 2 , and the bottom max pooling. In the left column each x is Gaussian i.i.d. in R 20 , on the right, in R 40 . The dark blue curve is alternating minimization, and the green curve is alternating minimization with half rectification; both with random initialization.
In this work we compute lower Lipschitz bounds of ℓpsubscriptℓ𝑝\ell_{p} pooling operators for p=1,2,∞𝑝12p=1,2,\infty as well as ℓpsubscriptℓ𝑝\ell_{p} pooling operators preceded by half-rectification layers. These give sufficient conditions for the design of invertible neural network layers. Numerical experiments on MNIST and image patches confirm that pooling layers can be inverted with phase recovery algorithms. Moreover, the regularity of the inverse pooling, controlled by the lower Lipschitz constant, is empirically verified with a nearest neighbor regression.
A standard architecture for deep feedforward networks consists of a number of stacked modules, each of which consists of a linear mapping, followed by an elementwise nonlinearity, followed by a pooling operation. Critical to the success of this architecture in recognition problems is its capacity for preserving discriminative signal information while being invariant to nuisance deformations. The recent works (Mallat, 2012; Bruna and Mallat, 2012) study the role of the pooling operator in building invariance. In this work, we will study a network’s capacity for preserving information. Specifically, we will study the invertibility of modules with a linear mapping, the half rectification nonlinearity, and ℓpsubscriptℓ𝑝\ell_{p} pooling, for p∈{1,2,∞}𝑝12p\in{1,2,\infty}. We will discuss recent work in the case p=2𝑝2p=2, and connections with the phase recovery problem of (Candes et al., 2013; Gerchberg and Saxton, 1972; Waldspurger et al., 2012).
The purpose of the pooling layer in each module is to give invariance to the system, perhaps at the expense of resolution. This is done via a summary statistic over the outputs of groups of nodes. In the trained system, the columns of the weight matrix corresponding to nodes grouped together often exhibit similar characteristics, and code for perturbations of a template (Kavukcuoglu et al., 2009; Hyvärinen and Hoyer, 2001).
The summary statistic in ℓpsubscriptℓ𝑝\ell_{p} pooling is the ℓpsubscriptℓ𝑝\ell_{p} norm of the inputs into the pool. That is, if nodes xIi,…,xIlsubscript𝑥subscript𝐼𝑖…subscript𝑥subscript𝐼𝑙x_{I_{i}},...,x_{I_{l}} are in a pool, the output of the pool is
where as usual, if p→∞→𝑝p\rightarrow\infty, this is
If the outputs of the nonlinearity are nonnegative (as for the half rectification function), then p=1𝑝1p=1 corresponds to average pooling, and the case p=∞𝑝p=\infty is max pooling.
Given x∈ℝn𝑥superscriptℝ𝑛x\in{\mathbb{R}}^{n}, a classical problem in signal processing is to recover x𝑥x from the absolute values of its (1 or 2 dimensional) Fourier coefficients, perhaps subject to some additional constraints on x𝑥x; this problem arises in speech generation and X-ray imaging (Ohlsson, 2013). Unfortunately, the problem is not well posed- the absolute values of the Fourier coefficients do not nearly specify x𝑥x. For example, the absolute value of the Fourier transform is translation invariant. It can be shown (and we discuss this below) that the absolute value of the inner products between x𝑥x and any basis of ℝnsuperscriptℝ𝑛{\mathbb{R}}^{n} are not enough to uniquely specify an arbitrary x𝑥x; the situation is worse for ℂnsuperscriptℂ𝑛{\mathbb{C}}^{n}. On the other hand, recent works have shown that by taking a redundant enough dictionary, the situation is different, and x𝑥x can be recovered from the modulus of its inner products with the dictionary (Balan et al., 2006; Candes et al., 2013; Waldspurger et al., 2012).
Suppose for a moment that there is no elementwise nonlinearity in our feedforward module, and only a linear mapping followed by a pooling. Then with a slightly generalized notion of phase, where the modulus is the ℓpsubscriptℓ𝑝\ell_{p} norm of the pool, and the phase is the ℓpsubscriptℓ𝑝\ell_{p} unit vector specifying the “direction” of the inner products in the pool, the phase recovery problem above asks if the module loses any information. The ℓ2subscriptℓ2\ell_{2} case has been recently studied in (Cahill et al., 2013)
If the columns of the weight matrix in a pool correspond to related features, it can be reasonable to see the entire pool as a “what”. That is, the modulus of the pool indicates the relative presence of a grouping of (sub)features into a template, and the phase of the pool describes the relative arrangement of the subfeatures, describing “where” the template is, or more generally, describing the “pose” of the template.
From this viewpoint, phase reconstruction results make rigorous the notion that given enough redundant versions of “what” and throwing away the “where”, we can still recover the “where”.
In this work we give conditions so that a module consisting of a linear mapping, perhaps followed by a half rectification, followed by an ℓpsubscriptℓ𝑝\ell_{p} pooling preserves the information in its input. We extend the ℓ2subscriptℓ2\ell_{2} results of (Cahill et al., 2013; Balan and Wang, 2013) in several ways: we consider the ℓpsubscriptℓ𝑝\ell_{p} case, take into account the half rectification nonlinearity, and we make the results quantitative in the sense that we give bounds on the lower Lipschitz constants of the modules. This gives a measure of the stability of the inversion, which is especially important in a multi-layer system. Using our bounds, we prove that redundant enough random modules with ℓ1subscriptℓ1\ell_{1} or ℓ∞subscriptℓ\ell_{\infty} pooling are invertible.
We also show the results of numerical experiments designed to explore the gaps in our results and the results in the literature. We note that the alternating minimization method of (Gerchberg and Saxton, 1972) can be used essentially unchanged for the ℓpsubscriptℓ𝑝\ell_{p} case, with or without rectification, and show experiments giving evidence that recovery is roughly equally possible for ℓ1subscriptℓ1\ell_{1}, ℓ2subscriptℓ2\ell_{2}, and ℓ∞subscriptℓ\ell_{\infty} using this algorithm; and that half rectification before pooling can make recovery easier. Furthermore, we show that with a trained initialization, the alternating method compares favorably with the state of the art recovery methods (for ℓ2subscriptℓ2\ell_{2} with no rectification) in (Waldspurger et al., 2012; Candes et al., 2013), which suggests that the above observations are not an artifcact of the alternating method.
This section studies necessary and sufficient conditions guaranteeing that pooling representations are invertible. It also computes upper and lower Lipschitz bounds, which are tight under certain configurations.
Let us first introduce the notation used throughout the paper. Let ℱ={f1,…,fM}ℱsubscript𝑓1…subscript𝑓𝑀{\mathcal{F}}={f_{1},\dots,f_{M}} be a real frame of ℝNsuperscriptℝ𝑁\mathbb{R}^{N}, with M>N𝑀𝑁M>N. The frame ℱℱ{\mathcal{F}} is organized into K𝐾K disjoint blocks ℱk={fj}j∈Iksubscriptℱ𝑘subscriptsubscript𝑓𝑗𝑗subscript𝐼𝑘{\mathcal{F}}{k}={f{j}}{j\in I{k}}, k=1…K𝑘1…𝐾k=1\dots K, such that Ik∩Ik′=∅subscript𝐼𝑘subscript𝐼superscript𝑘′I_{k}\cap I_{k^{\prime}}=\emptyset and ⋃kIk={1…M}subscript𝑘subscript𝐼𝑘1…𝑀\bigcup_{k}I_{k}={1\dots M}. For simplicity, we shall assume that all the pools have equal size |Ik|=Lsubscript𝐼𝑘𝐿|I_{k}|=L.
The ℓpsubscriptℓ𝑝\ell_{p} pooling operator Pp(x)subscript𝑃𝑝𝑥P_{p}(x) is defined as the mapping
A related representation, which has gained popularity in recent deep learning architectures, introduces a point-wise thresholding before computing the ℓpsubscriptℓ𝑝\ell_{p} norm. If α∈ℝM𝛼superscriptℝ𝑀\alpha\in\mathbb{R}^{M} is a fixed threshold vector, and (ρα(x))i=max(0,xi−αi)subscriptsubscript𝜌𝛼𝑥𝑖0subscript𝑥𝑖subscript𝛼𝑖(\rho_{\alpha}(x)){i}=\max(0,x{i}-\alpha_{i}), then the ℓpsubscriptℓ𝑝\ell_{p} rectified pooling operator Rp(x)subscript𝑅𝑝𝑥R_{p}(x) is defined as
where αksubscript𝛼𝑘\alpha_{k} contains the coordinates Iksubscript𝐼𝑘I_{k} of α𝛼\alpha.
We shall measure the stability of the inverse pooling with the Euclidean distance in the representation space. Given a distance d(x,x′)𝑑𝑥superscript𝑥′d(x,x^{\prime}) in the input space, the Lipschitz bounds of a given operator Φ(x)Φ𝑥\Phi(x) are defined as the constants 0≤A≤B0𝐴𝐵0\leq A\leq B such that
In the remainder of the paper, given a frame ℱℱ{\mathcal{F}}, we denote respectively by λ−(ℱ)subscript𝜆ℱ\lambda_{-}({\mathcal{F}}) and λ+(ℱ)subscript𝜆ℱ\lambda_{+}({\mathcal{F}}) its lower and upper frame bounds. If ℱℱ{\mathcal{F}} has M𝑀M vectors and Ω⊂{1,…,M}Ω1…𝑀\Omega\subset{1,\dots,M}, we denote ℱΩsubscriptℱΩ{\mathcal{F}}_{\Omega} the frame obtained by keeping the vectors indexed in ΩΩ\Omega. Finally, we denote ΩcsuperscriptΩ𝑐{\Omega}^{c} the complement of ΩΩ\Omega.
In order to study the injectivity of pooling representations, we first focus on the properties of the operators defined by the point-wise nonlinearities.
The properties of the phaseless mapping
have been extensively studied in the literature (Balan et al., 2006; Balan and Wang, 2013), in part motivated by applications to speech processing (Achan et al., 2003) or X-ray crystallography (Ohlsson, 2013). It is shown in (Balan et al., 2006) that if m>2n−1𝑚2𝑛1m>2n-1 then it is possible to recover x𝑥x from M(x)𝑀𝑥M(x), up to a global sign change. In particular, (Balan and Wang, 2013) recently characterized the stability of the phaseless operator, that is summarized in the following proposition:
Let ℱ=(f1,…,fM)ℱsubscript𝑓1…subscript𝑓𝑀{\mathcal{F}}=(f_{1},\dots,f_{M}) with fi∈ℝNsubscript𝑓𝑖superscriptℝ𝑁f_{i}\in\mathbb{R}^{N} and d(x,x′)=min(‖x−x′‖,‖x+x′‖)𝑑𝑥superscript𝑥′norm𝑥superscript𝑥′norm𝑥superscript𝑥′d(x,x^{\prime})=\min(|x-x^{\prime}|,|x+x^{\prime}|). The mapping M(x)={|⟨x,fi⟩|}i≤m𝑀𝑥subscript𝑥subscript𝑓𝑖𝑖𝑚M(x)={|\langle x,f_{i}\rangle|}_{i\leq m} satisfies
In particular, M(x)𝑀𝑥M(x) is injective if and only if for any subset Ω⊆{1,…,M}Ω1…𝑀\Omega\subseteq{1,\dots,M}, either ℱΩsubscriptℱΩ{\mathcal{F}}{\Omega} or ℱΩcsubscriptℱsuperscriptΩ𝑐{\mathcal{F}}{\Omega^{c}} is an invertible frame.
A frame ℱℱ{\mathcal{F}} satisfying the previous condition is said to be phase retrievable.
We now turn our attention to the half-rectification operator, defined as
For that purpose, let us introduce some extra notation. Given a frame ℱ={f1,…,fM}ℱsubscript𝑓1…subscript𝑓𝑀{\mathcal{F}}={f_{1},\dots,f_{M}}, a subset Ω⊂{1…M}Ω1…𝑀\Omega\subset{1\dots M} is admissible if
We denote by Ω¯¯Ω\overline{\Omega} the collection of all admissible sets, and VΩsubscript𝑉ΩV_{\Omega} the vector space generated by ΩΩ\Omega. The following proposition, proved in Section B, gives a necessary and sufficient condition for the injectivity of the half-rectification.
Let A0=minΩ∈Ω¯λ−(ℱΩ|VΩ)subscript𝐴0subscriptΩ¯Ωsubscript𝜆evaluated-atsubscriptℱΩsubscript𝑉ΩA_{0}=\min_{\Omega\in\overline{\Omega}}\lambda_{-}({\left.\kern-1.2pt{\mathcal{F}}{\Omega}\vphantom{\big{|}}\right|{V_{\Omega}}}). Then the half-rectification operator Mα(x)=ρα(ℱTx)subscript𝑀𝛼𝑥subscript𝜌𝛼superscriptℱ𝑇𝑥M_{\alpha}(x)=\rho_{\alpha}({\mathcal{F}}^{T}x) is injective if and only if A0>0subscript𝐴00A_{0}>0. Moreover, it satisfies
with B0=maxΩ∈Ω¯λ+(ℱΩ)≤λ+(ℱ)subscript𝐵0subscriptΩ¯Ωsubscript𝜆subscriptℱΩsubscript𝜆ℱB_{0}=\max_{\Omega\in\overline{\Omega}}\lambda_{+}({\mathcal{F}}{\Omega})\leq\lambda{+}({\mathcal{F}}).
The half-rectification has the ability to recover the input signal, without the global sign ambiguity. The ability to reconstruct from Mαsubscript𝑀𝛼M_{\alpha} is thus controlled by the rank of any matrix ℱΩsubscriptℱΩ{\mathcal{F}}{\Omega} whose columns are taken from a subset belonging to Ω¯¯Ω\overline{\Omega}. In particular, if α≡0𝛼0\alpha\equiv 0, since Ω∈Ω¯⇒Ωc∈Ω¯Ω¯Ω⇒superscriptΩ𝑐¯Ω\Omega\in\overline{\Omega}\Rightarrow\Omega^{c}\in\overline{\Omega}, it results that m≥2n𝑚2𝑛m\geq 2n is necessary in order to have A0>0subscript𝐴00A{0}>0.
The rectified linear operator creates a partition of the input space into polytopes, defined by (8), and computes a linear operator on each of these regions. A given input x𝑥x activates a set Ωx∈Ω¯subscriptΩ𝑥¯Ω\Omega_{x}\in\overline{\Omega}, encoded by the sign of the linear measurements ⟨x,fi⟩−αi𝑥subscript𝑓𝑖subscript𝛼𝑖\langle x,f_{i}\rangle-\alpha_{i}.
As opposed to the absolute value operator, the inverse of Mαsubscript𝑀𝛼M_{\alpha}, whenever it exists, can be computed directly by locally inverting a linear operator. Indeed, the coordinates of Mα(x)subscript𝑀𝛼𝑥M_{\alpha}(x) satisfying Mα(x)j>αjsubscript𝑀𝛼subscript𝑥𝑗subscript𝛼𝑗M_{\alpha}(x){j}>\alpha{j} form a set s(x)𝑠𝑥s(x), which defines a linear model ℱs(x)subscriptℱ𝑠𝑥{\mathcal{F}}{s(x)} which is invertible if A0>0subscript𝐴00A{0}>0.
In order to compare the stability of the half-rectification versus the full rectification, one can modify Mαsubscript𝑀𝛼M_{\alpha} so that it maps x𝑥x and −x𝑥-x to the same point. If one considers
and d(x,x′)=min(x−x′,x+x′)𝑑𝑥superscript𝑥′𝑥superscript𝑥′𝑥superscript𝑥′d(x,x^{\prime})=\min(x-x^{\prime},x+x^{\prime}), so A~≥2−1/2A𝐴superscript212𝐴\widetilde{A}\geq 2^{-1/2}A and B≤B𝐵𝐵\widetilde{B}\leq B. In particular, if M𝑀M is invertible, so is Mαsubscript~𝑀𝛼\tilde{M}_{\alpha}.
It results that the bi-Lipschitz bounds of the half-rectification operator, when considered in under the equivalence x∼−xsimilar-to𝑥𝑥x\sim-x, are controlled by the bounds of the absolute value operator, up to a factor 2−1/2superscript2122^{-1/2}. However, the lower Lipschitz bound (11) consists in a minimum taken over a much smaller family of elements than in (5).
From its definition, it follows that pooling operators Ppsubscript𝑃𝑝P_{p} and Rpsubscript𝑅𝑝R_{p} can be expressed respectively as a function of phaseless and half-rectified operators, which implies that for the pooling to be invertible, it is necessary that the absolute value and rectified operators are invertible too. Naturally, the converse is not true.
The invertibility conditions of the ℓ2subscriptℓ2\ell_{2} pooling representation have been recently studied in (Cahill et al., 2013), where the authors obtain necessary and sufficient conditions on the frame ℱℱ{\mathcal{F}}. We shall now generalize those results, and derive bi-Lipschiz bounds.
Let us define
𝒬2subscript𝒬2{\mathcal{Q}}{2} thus contains all the orthogonal bases of each subspace ℱksubscriptℱ𝑘{\mathcal{F}}{k}.
The following proposition, proved in section B, computes upper and lower bounds of the Lipschitz constants of P2subscript𝑃2P_{2}.
This proposition thus generalizes the results from (Cahill et al., 2013), since it shows that A2>0subscript𝐴20A_{2}>0 not only controls when P2subscript𝑃2P_{2} is invertible, but also controls the stability of the inverse mapping.
We also consider the rectified ℓ2subscriptℓ2\ell_{2} pooling case. For simplicity, we shall concentrate in the case where the pools have dimension d=2𝑑2d=2. For that purpose, for each x,x′𝑥superscript𝑥′x,x^{\prime}, we consider a modification of the families 𝒬2subscript𝒬2{{\mathcal{Q}}}{2}, by replacing each sub-frame ℱksubscriptℱ𝑘{\mathcal{F}}{k} by ℱIk∩s(x)∩s(x′)subscriptℱsubscript𝐼𝑘𝑠𝑥𝑠superscript𝑥′{\mathcal{F}}{I{k}\cap s(x)\cap s({x^{\prime}})}, that we denote 𝒬2,x,x′subscript𝒬2𝑥superscript𝑥′\widetilde{{\mathcal{Q}}}_{2,x,x^{\prime}}.
Let d=2𝑑2d=2, and set p(x,x′)=s(x)∪s(x′)(s(x)∩s(x′))𝑝𝑥superscript𝑥′𝑠𝑥\𝑠superscript𝑥′𝑠𝑥𝑠superscript𝑥′p(x,x^{\prime})=s(x)\cup s({x^{\prime}})\backslash(s(x)\cap s({x^{\prime}})). Then the rectified ℓ2subscriptℓ2\ell_{2} pooling operator R2subscript𝑅2R_{2} satisfies
Proposition 2.4 and Corollary 2.5 give a lower Lipschitz bound which gives sufficient guarantees for the inversion of pooling representations. Corollary 2.5 indicates that, in the case d=2𝑑2d=2, the lower Lipschitz bounds are sharper than the non-rectified case, in consistency with the results of section 2.1. The general case d>2𝑑2d>2 remains an open issue.
Given x∈ℝN𝑥superscriptℝ𝑁x\in{\mathbb{R}}^{N}, we define the switches s(x)𝑠𝑥s(x) of x𝑥x as the K𝐾K vector of coordinates in each pool where the maximum is attained; that is, for each k∈{1,…,K}𝑘1…𝐾k\in{1,\dots,K}:
and we denote by 𝒮𝒮{\mathcal{S}} the set of all attained switches: 𝒮={s(x);x∈ℝN}𝒮𝑠𝑥𝑥superscriptℝ𝑁{\mathcal{S}}={s(x),;,x\in{\mathbb{R}}^{N}}. This is a discrete subset of ∏k{1,…,Ik}subscriptproduct𝑘1…subscript𝐼𝑘\prod_{k}{1,\dots,I_{k}}. Given s∈𝒮𝑠𝒮s\in{\mathcal{S}}, the set of input signals having s(x)=s𝑠𝑥𝑠s(x)=s defines a linear cone 𝒞s⊂ℝNsubscript𝒞𝑠superscriptℝ𝑁{\mathcal{C}}_{s}\subset{\mathbb{R}}^{N}:
and as a result the input space is divided into a collection of Voronoi cells defined from linear equations. Restricted to each cone 𝒞ssubscript𝒞𝑠{\mathcal{C}}{s}, the max-pooling operator computes the phaseless mapping M(x)𝑀𝑥M(x) from equation (3) corresponding to ℱs=(fs1,…,fsK)subscriptℱ𝑠subscript𝑓subscript𝑠1…subscript𝑓subscript𝑠𝐾{\mathcal{F}}{s}=(f_{s_{1}},\dots,f_{s_{K}}).
Given vectors u𝑢u and v𝑣v, as usual, set the angle θ(u,v)=arccos(|⟨u,v⟩|‖u‖‖v‖).𝜃𝑢𝑣𝑢𝑣norm𝑢norm𝑣\theta(u,v)=\arccos\left(\frac{|\langle u,v\rangle|}{|u||v|}\right). For each s,s′∈𝒮𝑠superscript𝑠′𝒮s,s^{\prime}\in{\mathcal{S}} such that 𝒞s∩𝒞s′=∅subscript𝒞𝑠subscript𝒞superscript𝑠′{\mathcal{C}}{s}\cap{\mathcal{C}}{s^{\prime}}=\emptyset and for each Ω⊂{1…K}Ω1…𝐾\Omega\subset{1\dots K}, we define
This is a modified first principal angle between the subspaces ℱs|Ωevaluated-atsubscriptℱ𝑠Ω{\left.\kern-1.2pt{\mathcal{F}}{s}\vphantom{\big{|}}\right|{\Omega}} and ℱs′|Ωevaluated-atsubscriptℱsuperscript𝑠′Ω{\left.\kern-1.2pt{\mathcal{F}}{s^{\prime}}\vphantom{\big{|}}\right|{\Omega}}, where the infimum is taken only on the directions included in the respective cones. Set Λs,s′,Ω=λ−(ℱ|Ω)⋅sin(β(s,s′,Ω))subscriptΛ𝑠superscript𝑠′Ω⋅subscript𝜆evaluated-atℱΩ𝛽𝑠superscript𝑠′Ω\Lambda_{s,s^{\prime},\Omega}=\lambda_{-}({\left.\kern-1.2pt{\mathcal{F}}\vphantom{\big{|}}\right|_{\Omega}})\cdot\sin(\beta(s,s^{\prime},\Omega)).
Given s𝑠s, s′superscript𝑠′s^{\prime}, we also define 𝒥(s,s′)={k;sk=sk′}𝒥𝑠superscript𝑠′𝑘subscript𝑠𝑘subscriptsuperscript𝑠′𝑘{\mathcal{J}}(s,s^{\prime})={k,;,s_{k}=s^{\prime}_{k}}. Recall L𝐿L is the size of each pool. Set
Propostion 2.6 shows that the lower Lipschitz bound is controlled by two different phenomena. The first one depends upon how the cones corresponding to disjoint switches are aligned, whereas the second one depends on the internal incoherence of each frame ℱ𝒥(s,s′)subscriptℱ𝒥𝑠superscript𝑠′{\mathcal{F}}{{\mathcal{J}}(s,s^{\prime})}. One may ask how do these constants evolve in different asymptotic regimes. For example, if one lets the number of pools K𝐾K be fixed but increases the size of each pool by increasing M𝑀M. In that case, the set of possible switches 𝒮𝒮{\mathcal{S}} increases, and each cone 𝒞ssubscript𝒞𝑠{\mathcal{C}}{s} gets smaller. The bound corresponding to 𝒞s∩𝒞s′≠∅subscript𝒞𝑠subscript𝒞superscript𝑠′{\mathcal{C}}{s}\cap{\mathcal{C}}{s^{\prime}}\neq\emptyset decreases since the infimum is taken over a larger family. However, as the cones 𝒞ssubscript𝒞𝑠{\mathcal{C}}{s} become smaller, the likelihood that any pair x≠x′𝑥superscript𝑥′x\neq x^{\prime} share the same switches decreases, thus giving more importance to the case 𝒞s∩𝒞s′=∅subscript𝒞𝑠subscript𝒞superscript𝑠′{\mathcal{C}}{s}\cap{\mathcal{C}}{s^{\prime}}=\emptyset. Although the ratio 1L1𝐿\frac{1}{L} decreases, the lower frame bounds λ−(ℱΩ)2subscript𝜆superscriptsubscriptℱΩ2\lambda{-}({\mathcal{F}}{\Omega})^{2}, λ−(ℱΩc)2subscript𝜆superscriptsubscriptℱsuperscriptΩ𝑐2\lambda{-}({\mathcal{F}}_{\Omega^{c}})^{2} will in general increase linearly with L𝐿L. The lower bound will thus mainly be driven by the principal angles β(s,s′,Ω)𝛽𝑠superscript𝑠′Ω\beta(s,s^{\prime},\Omega). Although the minimum in (28) is taken over a larger family, each angle is computed over a smaller region of the space, suggesting that one can indeed increase the size of each pool without compromising the injectivity of the max-pooling.
Another asymptotic regime considers pools of fixed size L𝐿L and increases the number of pools K𝐾K. In that case, the bound increases as long as the principal angles remain lower bounded.
We also consider the stability of max-pooling with a half-rectification. By redefining the switches s(x)𝑠𝑥s(x) accordingly:
defined using the cones 𝒞ssubscript𝒞𝑠{\mathcal{C}}_{s} obtained from (18).
Propostion 2.6 can be used to obtain a bound of the lower Lipschitz constant of the ℓ1subscriptℓ1\ell_{1} pooling operator, as well as the Maxout operator (Goodfellow et al., 2013); see section B.4.2 in the supplementary material.
What is the minimum amount of redundancy needed to invert a pooling operator? As in previous works on compressed sensing (Candes and Tao, 2004) and phase recovery (Balan et al., 2006), one may address this question by studying random pooling operators. In this case, the lower Lipschitz bounds derived in previous sections can be shown to be positive with probability 111 given appropriate parameters K𝐾K and L𝐿L.
The following proposition, proved in Appendix B, analyzes the invertibility of a generic pooling operator constructed from random measurements. We consider a frame ℱℱ{\mathcal{F}} where its M𝑀M columns are iid Gaussian vectors of ℝNsuperscriptℝ𝑁{\mathbb{R}}^{N}.
Let ℱ=(f1,…,fM)ℱsubscript𝑓1…subscript𝑓𝑀{\mathcal{F}}=(f_{1},\dots,f_{M}) be a random frame of ℝNsuperscriptℝ𝑁{\mathbb{R}}^{N}, organized into K𝐾K disjoint pools of dimension L𝐿L. With probability 111 Ppsubscript𝑃𝑝P_{p} is injective (modulo x∼−xsimilar-to𝑥𝑥x\sim-x) if K≥4N𝐾4𝑁K\geq 4N for p=1,∞𝑝1p=1,\infty and if K≥2N−1𝐾2𝑁1K\geq 2N-1 for p=2𝑝2p=2.
The size of the pools L𝐿L does not influence the injectivity of random pooling, but it affects the stability of the inverse, as shown in proposition 2.6. The half-rectified case requires extra care, since the set of admissible switches Ω¯¯Ω\overline{\Omega} might contain frames with M<N𝑀𝑁M<N columns with non-zero probability, and is not considered in the present work.
Our main goal in this section is to experimentally compare the invertibility of ℓpsubscriptℓ𝑝\ell_{p} pooling for p∈{1,2,∞}𝑝12p\in{1,2,\infty}, with and without rectification. Unlike in the previous sections, we will not consider the Lipschitz bounds, as we do not know a good way to measure these experimentally. Our experiments suggest that recovery is roughly the same difficulty for p=1,2,∞𝑝12p=1,2,\infty, and that rectification makes recovery easier.
In the ℓ2subscriptℓ2\ell_{2} case without rectification, and with d=2𝑑2d=2, a growing body of works (Candes et al., 2013; Waldspurger et al., 2012) describe how to invert the pooling operator. This is often called phase recovery. A problem for us is a lack of a standard algorithm when p≠2𝑝2p\neq 2 or with rectification. We will see that the simple alternating minimization algorithm of (Gerchberg and Saxton, 1972) can be adapted to these situations. However, alternating minimization with random initialization is known to be an inferior recovery algorithm for p=2𝑝2p=2, and so any conclusions we will draw about ease of recovery will be tainted, as we would be testing whether the algorithm is equally bad in the various situations, rather than if the problems are equally hard. We will show that in certain cases, a training set allows us to find a good initialization for the alternating minimization, leading to excellent recovery performance, and that in this setting, or the random setting, recovery via alternating minimization is roughly as succesful for each of the p𝑝p, suggesting invertibility is equally hard for each p𝑝p. In the same way, we will see evidence that half rectification before pooling makes recovery easier.
A greedy method for recovering the phase from the modulus of complex measurements is given in (Gerchberg and Saxton, 1972); this method naturally extends to the case at hand. As above, denote the frame {f1,…,fM}=ℱsubscript𝑓1…subscript𝑓𝑀ℱ{f_{1},...,f_{M}}={\mathcal{F}}, let ℱksubscriptℱ𝑘{\mathcal{F}}{k} be the frame vectors in the k𝑘kth block, and set Iksubscript𝐼𝑘I{k} to be the indices of the k𝑘kth block. Let ℱ(−1)superscriptℱ1{\mathcal{F}}^{(-1)} be the pseudoinverse of ℱℱ{\mathcal{F}}; set (Pp(x))k=‖ℱkx‖psubscriptsubscript𝑃𝑝𝑥𝑘subscriptnormsubscriptℱ𝑘𝑥𝑝(P_{p}(x)){k}=||{\mathcal{F}}{k}x||_{p}. Starting with an initial signal x0superscript𝑥0x^{0}, update
yIk(n)=(Pp(x))kℱkx(n)‖ℱkx(n)‖psubscriptsuperscript𝑦𝑛subscript𝐼𝑘subscriptsubscript𝑃𝑝𝑥𝑘subscriptℱ𝑘superscript𝑥𝑛subscriptnormsubscriptℱ𝑘superscript𝑥𝑛𝑝y^{(n)}{I{k}}=(P_{p}(x)){k}\frac{{\mathcal{F}}{k}x^{(n)}}{||{\mathcal{F}}{k}x^{(n)}||{p}}, k=1…K𝑘1…𝐾k=1\dots K,
x(n+1)=ℱ(−1)y(n)superscript𝑥𝑛1superscriptℱ1superscript𝑦𝑛x^{(n+1)}={\mathcal{F}}^{(-1)}y^{(n)}.
This approach is not, as far as we know, guarantee to converge to the correct solution, even when Ppsubscript𝑃𝑝P_{p} is invertible. However, in practice, if the inversion is easy enough, or if x0subscript𝑥0x_{0} is close to the true solution, the method can work well. Moreover, this algorithm can be run essentially unchanged for each ℓpsubscriptℓ𝑝\ell_{p}; for half rectification, we only use the nonegative entries in y𝑦y for reconstruction.
In the experiments below, we will use random, Gaussian i.i.d. ℱℱ{\mathcal{F}}, but also we will use the outputs of dictionary learning with block sparsity. The ℱℱ{\mathcal{F}} generated this way is not really a frame, as the condition number of a trained dictionary on real data is often very high. In this case, we will renormalize each data point to have norm 111, and modify the update x(n+1)=ℱ(−1)y(n)superscript𝑥𝑛1superscriptℱ1superscript𝑦𝑛x^{(n+1)}={\mathcal{F}}^{(-1)}y^{(n)} to
In practice, this modification might not always be possible, since the norm ‖x‖norm𝑥|x| is not explicitly presented in Ppsubscript𝑃𝑝P_{p}. However, in the classical setting of Fourier measurements and positive x𝑥x, this information is available. Moreover, our empirical experience has been that using this regularization on well conditioned analysis dictionaries offers no benefit; in particular, it gives no benefit with random analysis matrices.
Two recent algorithms (Candes et al., 2013) and (Waldspurger et al., 2012) are guaranteed with high probability to solve the (classical) problem of recovering the phase of a complex signal from its modulus, given enough random measurements. In practice both perform better than the greedy alternating minimization. However, it is not obvious to us how to adapt these algorithms to the ℓpsubscriptℓ𝑝\ell_{p} setting.
We would like to use the same basic algorithm for all settings to get an idea of the relative difficulty of the recovery problem for different p𝑝p, but also would like an algorithm with good recovery performance. If our algorithm simply returns poor results in each case, differences between the cases might be masked.
The alternating minimization can be very effective when well initialized. When given a training set of the data to recover, we use a simple regression to find x0subscript𝑥0x_{0}. Fix a number of neighbors q𝑞q (in the experiments below we use q=10𝑞10q=10, and suppose X𝑋X is the training set). Set G=Pp(X)𝐺subscript𝑃𝑝𝑋G=P_{p}(X), and for a new point x𝑥x to recover from Pp(x)subscript𝑃𝑝𝑥P_{p}(x), find the q𝑞q nearest neighbors in G𝐺G of Pp(x)subscript𝑃𝑝𝑥P_{p}(x), and take their principal component to serve as x0subscript𝑥0x_{0} in the alternating minimization algorithm. We use the fast neighbor searcher from (Vedaldi and Fulkerson, 2008) to accelerate the search.
We discuss results on the MNIST dataset, available at http://yann.lecun.com/exdb/mnist/, and on 16×16161616\times 16 patches drawn from the VOC dataset, available at http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2012/. For each of these data sets, we run experiments with random dictionaries and adapted dictionaries. We also run experiments where the data and the dictionary are both Gaussian i.i.d.; in this case, we do not use adapted dictionaries.
The basic setup of the experiments in each case is the same: we vary the number of measurements (that is, number of pools) over some range, and attempt to recover the original signal from the ℓpsubscriptℓ𝑝\ell_{p} pooled measurements, using various methods. We record the average angle between the recovered signal r𝑟r and the original x𝑥x, that is, we use |rTx|2/(‖r‖2‖x‖2)superscriptsuperscript𝑟𝑇𝑥2superscriptnorm𝑟2superscriptnorm𝑥2|r^{T}x|^{2}/(||r||^{2}||x||^{2}) as the measure of success in recovery.
In each case the random analysis dictionary ℱℱ{\mathcal{F}} is built by fixing a size parameter m𝑚m, and generating a Gaussian i.i.d. matrix ℱ0subscriptℱ0{\mathcal{F}}{0} of size 2m×n2𝑚𝑛2m\times n, where n=100𝑛100n=100 for MNIST, and n=256𝑛256n=256 for VOC. Each pair of rows of ℱ0subscriptℱ0{\mathcal{F}}{0} is then orthogonalized to obtain ℱℱ{\mathcal{F}}; that is we use groups of size 222, where the pair of elements in each group are orthogonal. This allows us to use standard phase recovery software in the ℓ2subscriptℓ2\ell_{2} case to get a baseline. We used the ADMM version of phaselift from (Ohlsson et al., 2012) and the phasecut algorithm of (Waldspurger et al., 2012). For all of our data sets, the latter gave better results (note that phasecut can explicitly use the fact that the solution to the problem is real, whereas that version of phaselift cannot), so we report only the phasecut results.
In the experiments with adapted dictionaries, the dictionary is built using block OMP and batch updates with a K-SVD type update (Aharon et al., 2006); in this case, ℱℱ{\mathcal{F}} is the transpose of the learned dictionary. We again use groups of size 222 in the adapted dictionary experiments.
We run two sets of experiments with Gaussian i.i.d. data and dictionaries, with n=20𝑛20n=20 and n=40𝑛40n=40. We consider m𝑚m in the range from n/2𝑛2n/2 to 8n8𝑛8n. On this data set, phaselift outperforms alternating minimization; see the supplementary material.
For MNIST, we use the standard training set projected to ℝ100superscriptℝ100{\mathbb{R}}^{100} via PCA, and we let the number of dictionary elements range from 60 to 600 (that is, 30 to 300 measurements). On this data set, alternating minimization with nearest neighbor initialization gives exact reconstruction by 130130130 measurements; for comparison, Phaselift at m=130𝑚130m=130 has mean square angle of .48.48.48; see the supplementary material.
We draw approximately 5 million 16×16161616\times 16 grayscale image patches from the PASCAL VOC data set; these are sorted by variance, and the largest variance 1 million are kept. The mean is removed from each patch. These are split into a training set of 900000 patches and a test set of 100000 patches. In this experiment, we let m𝑚m range from 30 to 830. On this data set, by m=330𝑚330m=330 measurements, alternating minimization with nearest neighbor initialization recovers mean angle of .97.97.97; for comparison, Phaselift at m=330𝑚330m=330 has mean angle of .39.39.39; see the supplementary material.
The experiments show (see figures 1 and 2) that:
For every data set, with random initializations and dictionaries, recovery is easier with half rectification before pooling than without (green vs dark blue in figures).
ℓ∞subscriptℓ\ell_{\infty}, ℓ1subscriptℓ1\ell_{1}, and ℓ2subscriptℓ2\ell_{2} pooling appear roughly the same difficulty to invert, regardless of algorithm (each column of figures, corresponding to an ℓpsubscriptℓ𝑝\ell_{p}, is essentially the same).
Good initialization improves performance; indeed, alternating minimization with nearest neighbor regression outperforms phaselift and phasecut (which of course do not have the luxury of samples from the prior, as the regressor does). We believe this of independent interest.
Adapted analysis “frames” (with regularization) are easier to invert than random analysis frames, with or without regularization (the bottom row of each pair of graphs vs the top row of each pair in Figure 2).
Each of these conclusions is unfortunately only true up to the optimization method- it may be true that a different optimizer will lead to different results. With learned initializations and alternating minimization, recovery results can be better without half rectification. Note this is only up until the point where the alternating minimization gets better than the learned initialization without any refinement, and is especially true for random dictionaries. The simple interpretation is that the reconstruction step 2 of the alternating minimization just does not have a large enough span with roughly half the entries removed; that is, this is an effect of the optimization, not of the difference between the difficulty of the problems.
We have studied conditions under which neural network layers of the form (1) and (2) preserve signal information. As one could expect, recovery from pooling measurements is only guaranteed under large enough redundancy and configurations of the subspaces, which depend upon which ℓpsubscriptℓ𝑝\ell_{p} is considered. We have proved conditions which bound the lower Lipschitz constants for these layers, giving quantitative descriptions of how much information they preserve. Furthermore, we have given conditions under which modules with random filters are invertible. We have also given experimental evidence that for both random and adapted modules, it is roughly as easy to invert ℓpsubscriptℓ𝑝\ell_{p} pooling with p=1𝑝1p=1, 222, and ∞\infty; and shown that when given training data, alternating minimization gives state of the art phase recovery with a regressed initialization.
However, we are not anywhere near where we would like to be in understanding these systems, or even the invertibility of the layers of these systems. This work gives little direct help to a practicioner asking the question “how should I design my network?”. In particular, our results just barely touch on the distribution of the data; but the experiments make it clear (see also (Ohlsson et al., 2012)) that knowing more information about the data changes the invertibility of the mappings. Moreover, preservation of information needs to be balanced against invariance, and the tension between these is not discussed in this work. Even in the setting of this work, without consideration of the data distribution or tension with invariance, Proposition 2.4 although tight, is not easy to use, and even though we are able to use 2.6 to get an invertibility result, it is probably not tight.
This work also shows there is much research to do in the field of algorithmic phase recovery. What are correct algorithms for ℓpsubscriptℓ𝑝\ell_{p} inversion, perhaps with half rectification? How can we best use knowledge of the data distribution for phase recovery, even for the well studied ℓ2subscriptℓ2\ell_{2} case? Is it possible to guarantee that a well initialized alternating minimization converges to the correct solution?
Here we give a brief comparison between the phaselift algorithm and the algorithms we use in the main text. Our main goal is to show that the similarities between the ℓ1subscriptℓ1\ell_{1}, ℓ2subscriptℓ2\ell_{2}, ℓ∞subscriptℓ\ell_{\infty} recovery results are not just due to the alternating minimization algorithm performing poorly on all three tasks; however we feel that the quality of the recovery with a regressed initialization is interesting in itself, especially considering that it is much faster than either phaselift or phasecut.
In figures 3, and 4 we compare phaselift against alternating minimization with a random initialization and alternating minimization with a nearest neghbor/locally linear regressed initialization. Because we are comparing against phasecut, here we only show inversion of ℓ2subscriptℓ2\ell_{2} pooling.
In figure of 3, we use random data and a random dictionary. As the data has no structure, we only compare against random initialization, with and without half rectification. We can see from figure 3 in this case, where we do not know a good way to initialize the alternating minimization, alternating minimization is significantly worse than phasecut. On the other hand, recovery after rectified pooling with alternating minimization does almost as well as phasecut.
In the examples where we have training data, shown in figure 4, alternating minimization with the nearest neighbor regressor (red curve) performs significantly better than phasecut (green and blue curves). Of course phasecut does not get the knowledge of the data distribution used to generate the regressor.
Let us first show that A0>0subscript𝐴00A_{0}>0 is sufficient to construct an inverse of Mαsubscript𝑀𝛼M_{\alpha}. Let x∈ℝN𝑥superscriptℝ𝑁x\in{\mathbb{R}}^{N}. By definition, the coordinates of Mα(x)>αsubscript𝑀𝛼𝑥𝛼M_{\alpha}(x)>\alpha correspond to
which in particular implies that x𝑥x is known to lie in VS(x)subscript𝑉𝑆𝑥V_{S(x)}, the subspace generated by s(x)𝑠𝑥s(x). But the restriction ℱs(x)subscriptℱ𝑠𝑥{\mathcal{F}}{s(x)} is a linear operator, which can be inverted in VSsubscript𝑉𝑆V{S} as long as λ−(ℱs(x)|VS)≥A0>0subscript𝜆evaluated-atsubscriptℱ𝑠𝑥subscript𝑉𝑆subscript𝐴00\lambda_{-}({\left.\kern-1.2pt{\mathcal{F}}{s(x)}\vphantom{\big{|}}\right|{V_{S}}})\geq A_{0}>0.
Let us now show that A0>0subscript𝐴00A_{0}>0 is also necessary. Let us suppose that for some S𝑆S, ℱSsubscriptℱ𝑆{\mathcal{F}}{S} is such that λ−(ℱS|VS)=0subscript𝜆evaluated-atsubscriptℱ𝑆subscript𝑉𝑆0\lambda{-}({\left.\kern-1.2pt{\mathcal{F}}{S}\vphantom{\big{|}}\right|{V_{S}}})=0. It results that there exists η∈VS𝜂subscript𝑉𝑆\eta\in V_{S} such that ‖η‖>0norm𝜂0|\eta|>0 but ‖ℱSη‖=0normsubscriptℱ𝑆𝜂0|{\mathcal{F}}{S}\eta|=0. Since S𝑆S is a cone, we can find x∈S𝑥𝑆x\in S and ϵ≠0italic-ϵ0\epsilon\neq 0 small enough such that x+ϵe∈S𝑥italic-ϵ𝑒𝑆x+\epsilon e\in S. It results that Mα(x)=Mα(x+ϵe)subscript𝑀𝛼𝑥subscript𝑀𝛼𝑥italic-ϵ𝑒M{\alpha}(x)=M_{\alpha}(x+\epsilon e) which implies that Mαsubscript𝑀𝛼M_{\alpha} cannot be injective.
Finally, let us prove (9). If x,x′𝑥superscript𝑥′x,x^{\prime} are such that S=s(x)=s(x′)𝑆𝑠𝑥𝑠superscript𝑥′S=s(x)=s({x^{\prime}}), then
If s(x)≠s(x′)𝑠𝑥𝑠superscript𝑥′s(x)\neq s(x^{\prime}), we have that |Mα(x)i−Mα(x′)i|=|⟨x−x′,fi⟩|subscript𝑀𝛼subscript𝑥𝑖subscript𝑀𝛼subscriptsuperscript𝑥′𝑖𝑥superscript𝑥′subscript𝑓𝑖|M_{\alpha}(x){i}-M{\alpha}(x^{\prime}){i}|=|\langle x-x^{\prime},f{i}\rangle| if i∈s(x)∩s(x′)𝑖𝑠𝑥𝑠superscript𝑥′i\in s(x)\cap s({x^{\prime}}) and |Mα(x)i−Mα(x′)i|≥|⟨x−x′,fi⟩|subscript𝑀𝛼subscript𝑥𝑖subscript𝑀𝛼subscriptsuperscript𝑥′𝑖𝑥superscript𝑥′subscript𝑓𝑖|M_{\alpha}(x){i}-M{\alpha}(x^{\prime}){i}|\geq|\langle x-x^{\prime},f{i}\rangle| if i∈s(x)∪s(x′)𝑖𝑠𝑥𝑠superscript𝑥′i\in s(x)\cup s({x^{\prime}}), i∉s(x)∩s(x′)𝑖𝑠𝑥𝑠superscript𝑥′i\notin s(x)\cap s({x^{\prime}}). It results that
□□\square.
The upper Lipschitz bound is obtained by observing that, in dimension d𝑑d,
It results that
Let us now concentrate on the lower Lipschitz bound. Given x,x′∈ℝn𝑥superscript𝑥′superscriptℝ𝑛x,x^{\prime}\in{\mathbb{R}}^{n}, we first consider a rotation ℱksubscriptℱ𝑘\tilde{{\mathcal{F}}}{k} on each subspace ℱksubscriptℱ𝑘{\mathcal{F}}{k} such that ⟨x,fk,j⟩=⟨x′,fk,j⟩=0𝑥subscript𝑓𝑘𝑗superscript𝑥′subscript𝑓𝑘𝑗0\langle x,\tilde{f}{k,j}\rangle=\langle x^{\prime},\tilde{f}{k,j}\rangle=0 for j>2𝑗2j>2, which always exists. If now we modify ℱksubscriptℱ𝑘\tilde{{\mathcal{F}}}_{k} by applying a rotation of the remaining two-dimensional subspace such that x𝑥x and x′superscript𝑥′x^{\prime} are bisected, one can verify that
which implies, by denoting M(x)=(|⟨x,fk,j⟩|)k,j𝑀𝑥subscript𝑥subscript𝑓𝑘𝑗𝑘𝑗M(x)=(|\langle x,\widetilde{f}{k,j}\rangle|){k,j}, that ‖P2(x)−P2(x′)‖=‖M(x)−M(x′)‖normsubscript𝑃2𝑥subscript𝑃2superscript𝑥′norm𝑀𝑥𝑀superscript𝑥′|P_{2}(x)-P_{2}(x^{\prime})|=|M(x)-M(x^{\prime})|. Since ℱ~∈𝒬2~ℱsubscript𝒬2\widetilde{{\mathcal{F}}}\in{\mathcal{Q}}_{2}, it results from Proposition 2.1 that
Given x,x′𝑥superscript𝑥′x,x^{\prime}, let I𝐼I denote the groups Iksubscript𝐼𝑘I_{k}, k≤K𝑘𝐾k\leq K such that Sx∩Sx′∩Ik=Iksubscript𝑆𝑥subscript𝑆superscript𝑥′subscript𝐼𝑘subscript𝐼𝑘S_{x}\cap S_{x^{\prime}}\cap I_{k}=I_{k}. It results that
On the groups in I𝐼I we can apply the same arguments as in theorem 2.4, and hence find a frame ℱ~~ℱ\tilde{{\mathcal{F}}} from the family 𝒬p,x,x′subscript𝒬𝑝𝑥superscript𝑥′\widetilde{{\mathcal{Q}}}_{p,x,x^{\prime}} such that
with M(x)=(|⟨x,fk,j⟩|)k∈I,j𝑀𝑥subscript𝑥subscript𝑓𝑘𝑗𝑘𝐼𝑗M(x)=(|\langle x,\widetilde{f}{k,j}\rangle|){k\in I,j} and {fk,j}∈𝒬p,x,x′subscript𝑓𝑘𝑗subscript𝒬𝑝𝑥superscript𝑥′{\widetilde{f}{k,j}}\in\widetilde{{\mathcal{Q}}}{p,x,x^{\prime}}. Then, by following the same arguments used previously, it results from the definition of Apsubscript𝐴𝑝\tilde{A}_{p} that
and using the same argument as in (B.2) □□\square.
Let x,x′∈ℝN𝑥superscript𝑥′superscriptℝ𝑁x,x^{\prime}\in{\mathbb{R}}^{N}, and let 𝒥=s(x)∩s(x′)𝒥𝑠𝑥𝑠superscript𝑥′{\mathcal{J}}=s(x)\cap s(x^{\prime}). Suppose first that 𝒞s(x)∩𝒞s(x′)≠∅subscript𝒞𝑠𝑥subscript𝒞𝑠superscript𝑥′{\mathcal{C}}{s(x)}\cap{\mathcal{C}}{s(x^{\prime})}\neq\emptyset. Since ∥P∞x−P∞∥≥∥|ℱsx|−|ℱsx′||𝒥∥|P_{\infty}x-P_{\infty}|\geq|{\left.\kern-1.2pt|{\mathcal{F}}{s}x|-|{\mathcal{F}}{s}x^{\prime}|\vphantom{\big{|}}\right|{{\mathcal{J}}}}|, it results that
by Proposition 2.1 and by definition (LABEL:mpool_bobo).
Let us now suppose 𝒞s(x)∩𝒞s(x′)=∅subscript𝒞𝑠𝑥subscript𝒞𝑠superscript𝑥′{\mathcal{C}}{s(x)}\cap{\mathcal{C}}{s(x^{\prime})}=\emptyset, and let z=P∞x−P∞x′𝑧subscript𝑃𝑥subscript𝑃superscript𝑥′z=P_{\infty}x-P_{\infty}x^{\prime}. It results that z=|ℱs(x)x|−|ℱs(x′)x′|∈ℝK𝑧subscriptℱ𝑠𝑥𝑥subscriptℱ𝑠superscript𝑥′superscript𝑥′superscriptℝ𝐾z=|{\mathcal{F}}{s(x)}x|-|{\mathcal{F}}{s(x^{\prime})}x^{\prime}|\in{\mathbb{R}}^{K}, and hence we can split the coordinates (1…K)1…𝐾(1\dots K) into ΩΩ\Omega, ΩcsuperscriptΩ𝑐\Omega^{c} such that
We shall concentrate in each restriction independently. Since ℱs(x′)|Ω(x′)∈ℱs(x′)|Ω(𝒞s(x′))evaluated-atsubscriptℱ𝑠superscript𝑥′Ωsuperscript𝑥′evaluated-atsubscriptℱ𝑠superscript𝑥′Ωsubscript𝒞𝑠superscript𝑥′{\left.\kern-1.2pt{\mathcal{F}}{s(x^{\prime})}\vphantom{\big{|}}\right|{\Omega}}(x^{\prime})\in{\left.\kern-1.2pt{\mathcal{F}}{s(x^{\prime})}\vphantom{\big{|}}\right|{\Omega}}({\mathcal{C}}_{s(x^{\prime})}), it results that
it results, assuming without loss of generality that all pools have equal size ( |Ik|=MK)|I_{k}|=\frac{M}{K}),
Equivalently, since ℱs(x)|Ω(x)∈ℱs(x)|Ω(𝒞s(x))evaluated-atsubscriptℱ𝑠𝑥Ω𝑥evaluated-atsubscriptℱ𝑠𝑥Ωsubscript𝒞𝑠𝑥{\left.\kern-1.2pt{\mathcal{F}}{s(x)}\vphantom{\big{|}}\right|{\Omega}}(x)\in{\left.\kern-1.2pt{\mathcal{F}}{s(x)}\vphantom{\big{|}}\right|{\Omega}}({\mathcal{C}}_{s(x)}) we also have
These results easily extend to the so-called Maxout operator [Goodfellow et al., 2013], defined as x↦MO(x)={maxj∈Ik⟨x,fj⟩;k=1…K}maps-to𝑥𝑀𝑂𝑥subscript𝑗subscript𝐼𝑘𝑥subscript𝑓𝑗𝑘1…𝐾x\mapsto MO(x)={\max_{j\in I_{k}}\langle x,f_{j}\rangle,;,k=1\dots K}. By redefining the switches of x𝑥x as
The Maxout operator MO𝑀𝑂MO satisfies (19) with A(s,s′)𝐴𝑠superscript𝑠′A(s,s^{\prime}) defined using the switches (27).
Observe that for x∈ℝn𝑥superscriptℝ𝑛x\in\mathbb{R}^{n},
It results that P1(x;ℱ)≡P∞(x;ℱ~)subscript𝑃1𝑥ℱsubscript𝑃𝑥~ℱP_{1}(x;\mathcal{F})\equiv P_{\infty}(x;\widetilde{\mathcal{F}}), with
Each pool ℱksubscriptℱ𝑘\widetilde{{\mathcal{F}}}{k} can be rewritten as ℱk=HLℱksubscriptℱ𝑘subscript𝐻𝐿subscriptℱ𝑘\widetilde{{\mathcal{F}}}{k}=H_{L}{\mathcal{F}}{k}, where HLsubscript𝐻𝐿H{L} is the L×2L𝐿superscript2𝐿L\times 2^{L} Hadamard matrix whose rows contain the ϵitalic-ϵ\epsilon vectors. One can verify that HLTHL=2L𝟏superscriptsubscript𝐻𝐿𝑇subscript𝐻𝐿superscript2𝐿1H_{L}^{T}H_{L}=2^{L}{\bf 1}, which implies that for any Ω⊆{1…K}Ω1…𝐾\Omega\subseteq{1\dots K}, λ−(ℱ~|Ω)=2L/2λ−(ℱ|Ω)subscript𝜆evaluated-at~ℱΩsuperscript2𝐿2subscript𝜆evaluated-atℱΩ\lambda_{-}({\left.\kern-1.2pt\widetilde{{\mathcal{F}}}\vphantom{\big{|}}\right|{\Omega}})=2^{L/2}\lambda{-}({\left.\kern-1.2pt{{\mathcal{F}}}\vphantom{\big{|}}\right|_{\Omega}}). It results that
Similarly as in Corollary 2.7, one can obtain a similar bound for the Rectified ℓ1subscriptℓ1\ell_{1} pooling.
The result follows immediately from Proposition 2.6, by replacing the phaseless invertibility condition of Propostion 2.1 by the one in Proposition 2.2. □□\square.
Proposition 2.8 also extends to the maxout case. We restate it here with the extra result:
The Maxout operator MO𝑀𝑂MO is injective if K≥2N+1𝐾2𝑁1K\geq 2N+1.
Let us first prove (i), with p=∞𝑝p=\infty. Let x,x′∈ℝN𝑥superscript𝑥′superscriptℝ𝑁x,,x^{\prime}\in{\mathbb{R}}^{N} such that P∞(x)=P∞(x′)subscript𝑃𝑥subscript𝑃superscript𝑥′P_{\infty}(x)=P_{\infty}(x^{\prime}), and let s=s(x)𝑠𝑠𝑥s=s(x), s′=s(x′)superscript𝑠′𝑠superscript𝑥′s^{\prime}=s(x^{\prime}). The set of K𝐾K pooling measurements is divided into the intersection 𝒥(s,s′)={k;s(x)k=s(x′)k}𝒥𝑠superscript𝑠′𝑘𝑠subscript𝑥𝑘𝑠subscriptsuperscript𝑥′𝑘{\mathcal{J}}(s,s^{\prime})={k,;,s(x){k}=s(x^{\prime}){k}} and its complement 𝒥(s,s′)c={k;s(x)k≠s(x′)k}𝒥superscript𝑠superscript𝑠′𝑐𝑘𝑠subscript𝑥𝑘𝑠subscriptsuperscript𝑥′𝑘{\mathcal{J}}(s,s^{\prime})^{c}={k,;,s(x){k}\neq s(x^{\prime}){k}}. Suppose first that |𝒥(s,s′)|≥2N−1𝒥𝑠superscript𝑠′2𝑁1|{\mathcal{J}}(s,s^{\prime})|\geq 2N-1. Then it results that we can pick d=⌈|𝒥(s,s′)|2⌉≥N𝑑𝒥𝑠superscript𝑠′2𝑁d=\lceil\frac{|{\mathcal{J}}(s,s^{\prime})|}{2}\rceil\geq N elements of 𝒥(s,s′)𝒥𝑠superscript𝑠′{\mathcal{J}}(s,s^{\prime}) to form a frame V𝑉V, such that either x−x′∈Ker(V)𝑥superscript𝑥′Ker𝑉x-x^{\prime}\in\mbox{Ker}(V) or x+x∈Ker(V)𝑥𝑥Ker𝑉x+x\in\mbox{Ker}(V). Since a random frame of dimension ≥Nabsent𝑁\geq N spans ℝNsuperscriptℝ𝑁{\mathbb{R}}^{N} with probability 111, it results that x=±x′𝑥plus-or-minussuperscript𝑥′x=\pm x^{\prime}. Suppose otherwise that |𝒥(s,s′)|<2N−1𝒥𝑠superscript𝑠′2𝑁1|{\mathcal{J}}(s,s^{\prime})|<2N-1. It follows that |𝒥(s,s′)c|≥2N+1𝒥superscript𝑠superscript𝑠′𝑐2𝑁1|{\mathcal{J}}(s,s^{\prime})^{c}|\geq 2N+1, and hence that any partition of 𝒥(s,s′)c𝒥superscript𝑠superscript𝑠′𝑐{\mathcal{J}}(s,s^{\prime})^{c} into two frames will contain always a frame ℱ|Ωevaluated-atℱΩ{\left.\kern-1.2pt{\mathcal{F}}\vphantom{\big{|}}\right|{\Omega}} with at least N+1𝑁1N+1 columns. Since two random subspaces of dimension N𝑁N in ℝMsuperscriptℝ𝑀{\mathbb{R}}^{M} have nonzero largest principal angle with probability 111 as long as K>N𝐾𝑁K>N, it results that Λs,s′,Ω>0subscriptΛ𝑠superscript𝑠′Ω0\Lambda{s,s^{\prime},\Omega}>0 and hence that Prob(|P∞(x)|𝒥(s,s′)c|=|P∞(x′)|𝒥(s,s′)c|)=0\mbox{Prob}(|{\left.\kern-1.2ptP_{\infty}(x)\vphantom{\big{|}}\right|{{\mathcal{J}}(s,s^{\prime})^{c}}}|=|{\left.\kern-1.2ptP{\infty}(x^{\prime})\vphantom{\big{|}}\right|_{{\mathcal{J}}(s,s^{\prime})^{c}}}|)=0. The case p=1𝑝1p=1 is proved identically thanks to Corollary B.2.
Finally, in order to prove (ii) we follow the same strategy. If |𝒥(s,s′)|≥N𝒥𝑠superscript𝑠′𝑁|{\mathcal{J}}(s,s^{\prime})|\geq N, then MO(x)=MO(x′)⇒x=x′𝑀𝑂𝑥𝑀𝑂superscript𝑥′⇒𝑥superscript𝑥′MO(x)=MO(x^{\prime})\Rightarrow,x=x^{\prime} with probability 111 since ℱ|𝒥evaluated-atℱ𝒥{\left.\kern-1.2pt{\mathcal{F}}\vphantom{\big{|}}\right|_{{\mathcal{J}}}} spans ℝNsuperscriptℝ𝑁{\mathbb{R}}^{N} with probability 111. Otherwise it results that |𝒥(s,s′)c|≥N+1𝒥superscript𝑠superscript𝑠′𝑐𝑁1|{\mathcal{J}}(s,s^{\prime})^{c}|\geq N+1, which implies MO(x)≠MO(x′)𝑀𝑂𝑥𝑀𝑂superscript𝑥′MO(x)\neq MO(x^{\prime}), since two random subspaces of dimension N𝑁N in ℝ|𝒥(s,s′)c|superscriptℝ𝒥superscript𝑠superscript𝑠′𝑐{\mathbb{R}}^{|{\mathcal{J}}(s,s^{\prime})^{c}|} have 00 intersection with probability 111.
Let us now prove the case p=2𝑝2p=2. We start drawing a random basis for each of the pools F1,…,FKsubscript𝐹1…subscript𝐹𝐾F_{1},\dots,F_{K}. From proposition 2.4, it follows that we have to check that if M≥2N𝑀2𝑁M\geq 2N, the quantity
with probability 111. If M≥2N−1𝑀2𝑁1M\geq 2N-1, it follows that either ΩΩ\Omega has the property that it intersects at least N𝑁N pools, either ΩcsuperscriptΩ𝑐\Omega^{c} intersects N𝑁N pools. Say it is ΩΩ\Omega. Now, for each pool with nonzero intersection, say Fksubscript𝐹𝑘F_{k}, we have that
where F∗superscript𝐹F^{*} is a subset of N𝑁N columns of the original frame ℱℱ{\mathcal{F}}, which means
Average recovery angle using alternating projections on random data; each x𝑥x is Gaussian i.i.d. in ℝ40superscriptℝ40{\mathbb{R}}^{40}. The vertical axis measures the average value of |rTx|2/(‖r‖2‖x‖2)superscriptsuperscript𝑟𝑇𝑥2superscriptnorm𝑟2superscriptnorm𝑥2|r^{T}x|^{2}/(||r||^{2}||x||^{2}), where r𝑟r is the recovered vector, over 50 random test points. The horizontal axis is the number of measurements (the size m𝑚m of the analysis dictionary is twice the x𝑥x axis in this experiment). The leftmost figure is ℓ1subscriptℓ1\ell_{1} pooling, the middle ℓ2subscriptℓ2\ell_{2}, and the right max pooling. The dark blue curve is alternating minimization, and the green curve is alternating minimization with half rectification; both with random initialization.
(a) MNIST, random filters
Average recovery angle using phaselift and alternating minimization on random data, Gaussian i.i.d. points in ℝ40superscriptℝ40{\mathbb{R}}^{40}. The blue curve is phaselift followed by alternating minimization; the green curve is alternating minimization, and the red is alternating minimization on pooling following half rectification.
Average recovery angle using phaselift and alternating minimization on MNIST and patches data sets. Top: MNIST digits, projected via PCA to ℝ100superscriptℝ100{\mathbb{R}}^{100}. Bottom: 16x16 image patches with mean removed. The red curve is alternating minimization with nearest neighbor initialization, the green is alternating minimization initialized by phasecut (this is the recommended usage of phasecut), the blue is phasecut with no alternating minimization, and the aqua is alternating minimization with a random initialization.
$$ \left(|x_{I_i}|^p+...+|x_{I_l}|^p\right)^{1/p}, $$
$$ \label{lppool} x \mapsto P_p(x)={ | \cF_k^T x |_p ,,,, k=1\dots K}~. $$ \tag{lppool}
$$ \label{mod_lipsch} \forall,x,x' \in \mathbb{R}^n~,~ A_{\cF},d(x,x') \leq | M(x) - M(x') | \leq B_{\cF} ,d(x,x') ~, $$ \tag{mod_lipsch}
$$ \label{rectdef} M_{\alpha}(x) = \rho_\alpha(\cF^T x)~. $$ \tag{rectdef}
$$ \label{rect_partition} \bigcap_{i \in \Omega} { x ,;, \langle x, f_i \rangle > \alpha_i } \cap \bigcap_{i \notin \Omega} { x ,;, \langle x, f_i \rangle < \alpha_i } \neq \emptyset~. $$ \tag{rect_partition}
$$
\widetilde{M}{\alpha}(x)=\left{\begin{array}[]{rl}M{\alpha}(x)&{}\mbox{if }\lambda_{-}({\mathcal{F}}{s(x)})>\lambda{-}({\mathcal{F}}{{s(x)}^{c}})~{},\
M{-\alpha}(-x)&{}\mbox{otherwise}~{}.\end{array}\right.
$$ \tag{S2.Ex4}
$$ \label{q2family} \cQ_2 = \left { ( U_k \cF_k)_{k\leq K} ,;,\forall ,k \leq K~,~ U_k \in \R^{d\times d},,, U_k^T, U_k = {\bf Id} \right} ~. $$ \tag{q2family}
$$ s(x){k}=\arg\max{j\in I_{k}}|\langle x,f_{j}\rangle|, $$ \tag{S2.Ex8}
$$ \beta(s,s^{\prime},\Omega)=\min_{u\in{\left.\kern-1.2pt{\mathcal{F}}{s}\vphantom{\big{|}}\right|{\Omega}}({\mathcal{C}}{s}),,v\in{\left.\kern-1.2pt{\mathcal{F}}{s^{\prime}}\vphantom{\big{|}}\right|{\Omega}}({\mathcal{C}}{s^{\prime}})}\theta(u,v). $$ \tag{S2.Ex10}
$$ \label{cc3} d(x,x') \left(\min_{s,s'} A(s,s') \right) \leq | P_\infty(x) - P_\infty(x') |, $$ \tag{cc3}
$$ A(s,s') = \Big { \lambda_{-}^2(\cF_{\cJ(s,s')}) + \frac{1}{4L} \Lambda_{s,s',\cJ(s,s')^c}^2 \Big }^{1/2}~ $$
$$ |M_{\alpha}(x)-M_{\alpha}(x^{\prime})|\geq|{\mathcal{F}}{s(x)\cup s({x^{\prime}})}(x-x^{\prime})|\geq A{0}|x-x^{\prime}|~{}. $$ \tag{A2.Ex16}
$$
\forall,y\in{\mathbb{R}}^{d}{},{}|y|{1}\leq\sqrt{d}|y|{2}{},{}|y|{\infty}\leq d|y|{2}~{}.
$$ \tag{A2.Ex17}
$$ |R_{p}(x)-R_{p}(x^{\prime})|^{2} $$ \tag{A2.Ex23}
$$ =\sum_{k \in I} |R_p(x)_k - R_p(x')k|^2 + \sum{k \notin I} |R_p(x)_k - R_p(x')_k|^2 $$
$$ \widetilde{\mathcal{F}}=(\tilde{f}{k,\epsilon}=\sum{i}\epsilon(i)f_{k,i},;,k=1\dots,K,;\epsilon\in{-1,1}^{L}}~{}. $$ \tag{A2.Ex36}
$$ \displaystyle A_{{\mathcal{F}}} $$
$$ \displaystyle{\left.\kern-1.2ptz\vphantom{\big{|}}\right|_{\Omega}} $$
Proposition. Proposition 2.1 ((Balan and Wang, 2013), Theorem 4.3) Let ℱ=(f1,…,fM)ℱsubscript𝑓1…subscript𝑓𝑀{\mathcal{F}}=(f_{1},\dots,f_{M}) with fi∈ℝNsubscript𝑓𝑖superscriptℝ𝑁f_{i}\in\mathbb{R}^{N} and d(x,x′)=min(‖x−x′‖,‖x+x′‖)𝑑𝑥superscript𝑥′norm𝑥superscript𝑥′norm𝑥superscript𝑥′d(x,x^{\prime})=\min(|x-x^{\prime}|,|x+x^{\prime}|). The mapping M(x)={|⟨x,fi⟩|}i≤m𝑀𝑥subscript𝑥subscript𝑓𝑖𝑖𝑚M(x)={|\langle x,f_{i}\rangle|}{i\leq m} satisfies ∀x,x′∈ℝn,Aℱd(x,x′)≤‖M(x)−M(x′)‖≤Bℱd(x,x′),formulae-sequencefor-all𝑥superscript𝑥′superscriptℝ𝑛subscript𝐴ℱ𝑑𝑥superscript𝑥′norm𝑀𝑥𝑀superscript𝑥′subscript𝐵ℱ𝑑𝑥superscript𝑥′\forall,x,x^{\prime}\in\mathbb{R}^{n}{},{}A{{\mathcal{F}}},d(x,x^{\prime})\leq|M(x)-M(x^{\prime})|\leq B_{{\mathcal{F}}},d(x,x^{\prime}){}, (4) where Aℱsubscript𝐴ℱ\displaystyle A_{{\mathcal{F}}} =\displaystyle= minΩ⊂{1…M}λ−2(ℱΩ)+λ−2(ℱΩc),subscriptΩ1…𝑀superscriptsubscript𝜆2subscriptℱΩsuperscriptsubscript𝜆2subscriptℱsuperscriptΩ𝑐\displaystyle\min_{\Omega\subset{1\dots M}}\sqrt{\lambda_{-}^{2}({\mathcal{F}}{\Omega})+\lambda{-}^{2}({\mathcal{F}}{{\Omega}^{c}})}~{}, (5) Bℱsubscript𝐵ℱ\displaystyle B{{\mathcal{F}}} =\displaystyle= λ+(ℱ).subscript𝜆ℱ\displaystyle\lambda_{+}({\mathcal{F}}){}. (6) In particular, M(x)𝑀𝑥M(x) is injective if and only if for any subset Ω⊆{1,…,M}Ω1…𝑀\Omega\subseteq{1,\dots,M}, either ℱΩsubscriptℱΩ{\mathcal{F}}{\Omega} or ℱΩcsubscriptℱsuperscriptΩ𝑐{\mathcal{F}}{\Omega^{c}} is an invertible frame.
Proposition. Proposition 2.2 Let A0=minΩ∈Ω¯λ−(ℱΩ|VΩ)subscript𝐴0subscriptΩ¯Ωsubscript𝜆evaluated-atsubscriptℱΩsubscript𝑉ΩA_{0}=\min_{\Omega\in\overline{\Omega}}\lambda_{-}({\left.\kern-1.2pt{\mathcal{F}}{\Omega}\vphantom{\big{|}}\right|{V_{\Omega}}}). Then the half-rectification operator Mα(x)=ρα(ℱTx)subscript𝑀𝛼𝑥subscript𝜌𝛼superscriptℱ𝑇𝑥M_{\alpha}(x)=\rho_{\alpha}({\mathcal{F}}^{T}x) is injective if and only if A0>0subscript𝐴00A_{0}>0. Moreover, it satisfies ∀x,x′,A0‖x−x′‖≤‖Mα(x)−Mα(x′)‖≤B0‖x−x′‖,for-all𝑥superscript𝑥′subscript𝐴0norm𝑥superscript𝑥′normsubscript𝑀𝛼𝑥subscript𝑀𝛼superscript𝑥′subscript𝐵0norm𝑥superscript𝑥′\forall,x,x^{\prime}{},{}A_{0}|x-x^{\prime}|\leq|M_{\alpha}(x)-M_{\alpha}(x^{\prime})|\leq B_{0}|x-x^{\prime}|~{}, (9) with B0=maxΩ∈Ω¯λ+(ℱΩ)≤λ+(ℱ)subscript𝐵0subscriptΩ¯Ωsubscript𝜆subscriptℱΩsubscript𝜆ℱB_{0}=\max_{\Omega\in\overline{\Omega}}\lambda_{+}({\mathcal{F}}{\Omega})\leq\lambda{+}({\mathcal{F}}).
Corollary. Corollary 2.3 ∀x,x′∈ℝn,Ad(x,x′)≤‖Mα(x)−Mα(x′)‖≤Bd(x,x′),formulae-sequencefor-all𝑥superscript𝑥′superscriptℝ𝑛𝐴𝑑𝑥superscript𝑥′normsubscript𝑀𝛼𝑥subscript𝑀𝛼superscript𝑥′{}\widetilde{A},d(x,x^{\prime})\leq|\widetilde{M}{\alpha}(x)-\widetilde{M}{\alpha}(x^{\prime})|\leq\widetilde{B},d(x,x^{\prime})𝐵𝑑𝑥superscript𝑥′\forall,x,x^{\prime}\in\mathbb{R}^{n}{},{}, (10) with A~~𝐴\displaystyle\widetilde{A} =\displaystyle= minΩ⊂Ω¯max(λ−2(ℱΩ),λ−2(ℱΩc)),subscriptΩ¯Ωsuperscriptsubscript𝜆2subscriptℱΩsuperscriptsubscript𝜆2subscriptℱsuperscriptΩ𝑐\displaystyle\min_{\Omega\subset\overline{\Omega}}\max(\lambda_{-}^{2}({\mathcal{F}}{\Omega}),,\lambda{-}^{2}({\mathcal{F}}{\Omega^{c}}))~{}, (11) B~~𝐵\displaystyle\widetilde{B} =\displaystyle= maxΩ⊂Ω¯λ+(ℱΩ)≤λ+(ℱ),subscriptΩ¯Ωsubscript𝜆subscriptℱΩsubscript𝜆ℱ\displaystyle\max{\Omega\subset\overline{\Omega}}\lambda_{+}({\mathcal{F}}{\Omega})\leq\lambda{+}({\mathcal{F}}){}, (12) and d(x,x′)=min(x−x′,x+x′)𝑑𝑥superscript𝑥′𝑥superscript𝑥′𝑥superscript𝑥′d(x,x^{\prime})=\min(x-x^{\prime},x+x^{\prime}), so A~≥2−1/2A𝐴superscript212𝐴\widetilde{A}\geq 2^{-1/2}A and B≤B𝐵𝐵\widetilde{B}\leq B. In particular, if M𝑀M is invertible, so is Mαsubscript~𝑀𝛼\tilde{M}_{\alpha}.
Proposition. Proposition 2.4 The ℓ2subscriptℓ2\ell_{2} pooling operator P2subscript𝑃2P_{2} satisfies ∀x,x′;,A2d(x,x′)≤∥P2(x)−P2(x′)∥≤B2d(x,x′),\forall~{}x,,x^{\prime};{},{}A_{2}d(x,x^{\prime})\leq|P_{2}(x)-P_{2}(x^{\prime})|\leq B_{2}d(x,x^{\prime}){}, (14) where A2subscript𝐴2\displaystyle A_{2} =\displaystyle= minℱ′∈𝒬2minΩ⊂{1…M}λ−2(ℱΩ′)+λ−2(ℱΩc′),subscriptsuperscriptℱ′subscript𝒬2subscriptΩ1…𝑀superscriptsubscript𝜆2subscriptsuperscriptℱ′Ωsuperscriptsubscript𝜆2subscriptsuperscriptℱ′superscriptΩ𝑐\displaystyle\min_{{\mathcal{F}}^{\prime}\in{\mathcal{Q}}{2}}\min{\Omega\subset{1\dots M}}\sqrt{\lambda_{-}^{2}({\mathcal{F}}^{\prime}{\Omega})+\lambda{-}^{2}({\mathcal{F}}^{\prime}{{\Omega}^{c}})}~{}, B2subscript𝐵2\displaystyle B{2} =\displaystyle= λ+(ℱ).subscript𝜆ℱ\displaystyle\lambda_{+}({\mathcal{F}}){}. (15)
Proposition. For all $x$ and $x'$, the max-pooling operator $P_\infty$ satisfies equation d(x,x') \left(\min_{s,s'} A(s,s') \right) \leq | P_\infty(x) - P_\infty(x') |, equation where $d(x,x')=\min(|x-x'|,|x+x'|)$.
Proposition. Proposition 2.8 Let ℱ=(f1,…,fM)ℱsubscript𝑓1…subscript𝑓𝑀{\mathcal{F}}=(f_{1},\dots,f_{M}) be a random frame of ℝNsuperscriptℝ𝑁{\mathbb{R}}^{N}, organized into K𝐾K disjoint pools of dimension L𝐿L. With probability 111 Ppsubscript𝑃𝑝P_{p} is injective (modulo x∼−xsimilar-to𝑥𝑥x\sim-x) if K≥4N𝐾4𝑁K\geq 4N for p=1,∞𝑝1p=1,\infty and if K≥2N−1𝐾2𝑁1K\geq 2N-1 for p=2𝑝2p=2.
Corollary. % %The Maxout operator $MO$ satisfies (cc34) with $A(s,s')$ defined %using the switches (switches_maxout). %
$$
\label{halfrect_lips}
\forall,x,x', A_0 | x - x' | \leq | M_{\alpha}(x) - M_{\alpha} (x') | \leq B_0 |x-x' |~,
$$ \tag{halfrect_lips}
$$ \label{couscous3} | \restr{z}{\Omega} | \geq \sqrt{\frac{K}{M}} \lambda_{-}(\restr{\cF}{\Omega}) \cdot , \left |\sin( \beta(s(x),s(x'), \Omega)) \right | , \cdot ,| x' | ~. $$ \tag{couscous3}
$$ | R_p (x) - R_p(x') |^2 $$
Proposition. [[1308.4718], Theorem 4.3] Let $\cF=(f_1,\dots,f_M)$ with $f_i \in R^N$ and \ $d(x,x') = \min(| x - x' |, | x + x'|)$. The mapping $M(x)={ |\langle x, f_i \rangle |}{i\leq m}$ satisfies equation \forall,x,x' \in R^n~,~ A{\cF},d(x,x') \leq | M(x) - M(x') | \leq B_{\cF} ,d(x,x') , equation where eqnarray A_{\cF}&=& \min_{\Omega \subset {1\dots M}} \lambda_{-^2( \cF_\Omega) + \lambda_{-}^2 (\cF_{{\Omega}^c})}, \ B_{\cF}&=& \lambda_{+}(\cF)~. eqnarray In particular, $M(x)$ is injective if and only if for any subset $\Omega \subseteq {1,\dots,M}$, either $\cF_\Omega$ or $\cF_{\Omega^c}$ is an invertible frame.
Proposition. Let $A_0= \min_{\Omega \in \Omega} \lambda_{-}(\cF_\Omega{V_\Omega})$. Then the half-rectification operator $M_{\alpha}(x) = \rho_\alpha(\cF^T x)$ is injective if and only if $A_0>0$. Moreover, it satisfies equation \forall,x,x', A_0 | x - x' | \leq | M_{\alpha}(x) - M_{\alpha} (x') | \leq B_0 |x-x' |~, equation with $B_0=\max_{\Omega \in \Omega} \lambda_{+}(\cF_\Omega) \leq \lambda_{+}(\cF)$.
Proposition. %Let $p={1,2,\infty}$. The $\ell_2$ pooling operator $P_2$ satisfies equation \forall~x,,x'; ~,~A_2 d(x,x') \leq | P_2(x) - P_2(x') | \leq B_2 d(x,x') , equation where eqnarray A_2 &=& \min_{\cF' \in \cQ_2} \min_{\Omega \subset {1\dots M}} \lambda_{-^2( \cF'\Omega) + \lambda{-}^2 (\cF'{{\Omega}^c})}~, \nonumber \ B_2 &=& \lambda{+}(\cF). eqnarray %with $\alpha_1=d$, $\alpha_2=1$ and $\alpha_\infty=d$. %Moreover, the rectified $\ell_p$ pooling $R_p$ also satisfies (p2lips_eq) %with the same constants.
Proposition. proposition Let $\cF=(f_1,\dots,f_M)$ be a random frame of $\R^N$, organized into $K$ disjoint pools of dimension $L$. %Then these statements hold with probability $1$: With probability $1$ %enumerate %\item $P_p$ is injective (modulo $x \sim -x$) if $K \geq 4N$ for $p=1,\infty$ and if $K \geq 2N-1$ for $p=2$. %\item The Maxout operator $MO$ is injective if $K \geq 2N+1$. %enumerate
Proposition. % %
Proposition. %The $P_\infty$ pooling operator is injective if %$$\min_{s , s'} \sup_{x \in R^M,;, | x |=1} P_s F_s^\dagger F_{s'} P_{s'} < 1 $$ %and %$$\min_s \min_{\Omega \subset {1\dots M}} \lambda_{-}(F_s{\Omega})^2 + \lambda_{-}(F_s{\Omega^c})^2 > 0~.$$ % %
Corollary. equation \forall,x,x' \in R^n~,~ A,d(x,x') \leq | M_\alpha(x) - M_\alpha(x') | \leq B ,d(x,x') ~, equation with eqnarray A&=& \min_{\Omega \subset \Omega} \max(\lambda_{-}^2( \cF_\Omega) ,, \lambda_{-}^2 (\cF_{\Omega^c})) , \ B&=& \max_{\Omega \subset \Omega} \lambda_{+}(\cF_\Omega) \leq \lambda_{+}(\cF), eqnarray and $d(x,x')=\min(x-x',x+x')$, so $A \geq 2^{-1/2} A $ and % \lambda_{-^2( \cF_S) + \lambda_{-}^2 (\cF_{S})}$ and $B \leq B$. In particular, if $M$ is invertible, so is $M_\alpha$.
Corollary. Let $d=2$, and set $p(x,x')=s(x) \cup s({x'}) \backslash (s(x) \cap s({x'}))$. %and $\alpha=0$. Then the rectified $\ell_2$ pooling operator $R_2$ satisfies equation \forall~x,,x'; ~,~A_2 d(x,x') \leq | R_2(x) - R_2(x') | \leq B_2 d(x,x') , equation where eqnarray A_2 &=& \inf_{x,x'} \min_{\cF' \in \cQ_{2,x,x'}}\min_{\Omega \subset s(x) \cap s({x'})} \Big( \lambda_{-}^2(\cF_{p(x,x')}) + \nonumber \ && \lambda_{-}^2( \cF'\Omega) + \lambda{-}^2 (\cF'_{{\Omega}^c}) \Big)^{1/2}, \nonumber eqnarray
Corollary. The rectified max-pooling operator $R_\infty$ satisfies equation \forall, x,x', |x - x'| \min_{s,s'} A(s,s') \leq | R_\infty(x) - R_\infty(x') |, equation with equation* % A(s,s') = \Big { \lambda_{-}^2(\cF_{\cJ(s,s')}) + 1{4L} \Lambda_{s,s',\cJ(s,s')^c}^2 \Big }^{1/2}~ equation* defined using the cones $\cC_s$ obtained from (switches_rect).
Corollary. The $\ell_1$ pooling operator $P_1$ satisfies equation \forall, x,x'~,d(x,x') \left(\min_{s,s'} A(s,s') \right) \leq | P_1(x) - P_1(x') |, equation where $d(x,x')=\min(|x-x'|,|x+x'|)$ and eqnarray* % A(s,s') &=& \max \Big { \min_{\Omega \subseteq \cJ(s,s')} \lambda_{-^2(\cF_\Omega) + \lambda_{-}^2(\cF_{\cJ - \Omega})} ~, \nonumber \ & & 1{2} , \min_{\Omega \subseteq {1\dots K}} \Lambda_{s,s',\Omega^2+ \Lambda_{s,s',\Omega^c}^2 }\Big } ~, eqnarray* with $s,s'$ and $\beta(s,s')$ are defined on the frame $F$.
Numerical Experiments


$$ \min_{F^{\prime}=U,F,,,U^{T}U={\bf 1}}\min_{\Omega\subseteq{1\dots M}}\lambda_{-}^{2}(F^{\prime}{\Omega})+\lambda{-}^{2}(F^{\prime}_{O}mega^{c})>0 $$ \tag{A2.Ex39}
$$ \displaystyle\left(|{\mathcal{F}}{k}x|{2}-|{\mathcal{F}}{k}x^{\prime}|{2}\right)^{2} $$
References
[speech_phaseless] Kannan Achan, SamT. Roweis, and BrendanJ. Frey. \newblock Probabilistic inference of speech signals from phaseless spectrograms. \newblock In In Neural Information Processing Systems 16, pages 1393--1400. MIT Press, 2003.
[Aharon:2006:SAD:2197945.2201437] M.~Aharon, M.~Elad, and A.~Bruckstein. \newblock K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. \newblock Trans. Sig. Proc., 54\penalty0 (11):\penalty0 4311--4322, November 2006. \newblock ISSN 1053-587X.
[1308.4718] Radu Balan and Yang Wang. \newblock Invertibility and robustness of phaseless reconstruction, 2013.
[Balan2006] Radu Balan, Pete Casazza, and Dan Edidin. \newblock On signal reconstruction without phase. \newblock Applied and Computational Harmonic Analysis, 20\penalty0 (3):\penalty0 345--356, May 2006.
[bruna_pami] J.~Bruna and S.~Mallat. \newblock Invariant scattering convolution networks. \newblock IEEE transactions of PAMI, 2012.
[1305.6226] Jameson Cahill, Peter~G. Casazza, Jesse Peterson, and Lindsey Woodland. \newblock Phase retrieval by projections, 2013.
[csensing] E.{Candes} and T.{Tao}. \newblock {Near Optimal Signal Recovery From Random Projections: Universal Encoding Strategies?} \newblock ArXiv Mathematics e-prints, October 2004.
[phaselift] Emmanuel~J. Candes, Thomas Strohmer, and Vladislav Voroninski. \newblock Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. \newblock Communications on Pure and Applied Mathematics, 66\penalty0 (8):\penalty0 1241--1274, 2013.
[GSalgo] R.~W. Gerchberg and W.~Owen Saxton. \newblock {A practical algorithm for the determination of the phase from image and diffraction plane pictures}. \newblock Optik, 35:\penalty0 237--246, 1972.
[maxout] I.J. {Goodfellow}, D.{Warde-Farley}, M.{Mirza}, A.{Courville}, and Y.~{Bengio}. \newblock {Maxout Networks}. \newblock ArXiv e-prints, February 2013.
[hyvarinen_topo] A.~Hyv"{a}rinen and P.~Hoyer. \newblock {A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images}. \newblock Vision Research, 41\penalty0 (18):\penalty0 2413--2423, August 2001. \newblock ISSN 00426989. \newblock 10.1016/s0042-6989(01)00114-6.
[koray-cvpr-09] Koray Kavukcuoglu, Marc'Aurelio Ranzato, Rob Fergus, and Yann LeCun. \newblock Learning invariant features through topographic filter maps. \newblock In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR'09). IEEE, 2009.
[stephane] S.~Mallat. \newblock {Group} {Invariant} {Scattering}. \newblock Communications in Pure and Applied Mathematics, 2012.
[conf/nips/OhlssonYDS12] Henrik Ohlsson, Allen~Y. Yang, Roy Dong, and S.Shankar Sastry. \newblock Cprl -- an extension of compressive sensing to the phase retrieval problem. \newblock In PeterL. Bartlett, Fernando C.~N. Pereira, Christopher J.C. Burges, L'{e}on Bottou, and KilianQ. Weinberger, editors, NIPS, pages 1376--1384, 2012.
[yonina] Y.~Ohlsson, H.~Eldar. \newblock On conditions for uniqueness for sparse phase retrieval, 2013.
[vedaldi08vlfeat] A.~Vedaldi and B.~Fulkerson. \newblock {VLFeat}: An open and portable library of computer vision algorithms. \newblock http://www.vlfeat.org/, 2008.
[1206.0102] Ir`{e}ne Waldspurger, Alexandre d'Aspremont, and St'{e}phane Mallat. \newblock Phase recovery, maxcut and complex semidefinite programming, 2012.
[bib1] Achan et al. [2003] Kannan Achan, Sam T. Roweis, and Brendan J. Frey. Probabilistic inference of speech signals from phaseless spectrograms. In In Neural Information Processing Systems 16, pages 1393–1400. MIT Press, 2003.
[bib2] Aharon et al. [2006] M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. Trans. Sig. Proc., 54(11):4311–4322, November 2006. ISSN 1053-587X.
[bib3] Radu Balan and Yang Wang. Invertibility and robustness of phaseless reconstruction, 2013.
[bib4] Balan et al. [2006] Radu Balan, Pete Casazza, and Dan Edidin. On signal reconstruction without phase. Applied and Computational Harmonic Analysis, 20(3):345–356, May 2006.
[bib5] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE transactions of PAMI, 2012.
[bib6] Cahill et al. [2013] Jameson Cahill, Peter G. Casazza, Jesse Peterson, and Lindsey Woodland. Phase retrieval by projections, 2013.
[bib7] E. Candes and T. Tao. Near Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? ArXiv Mathematics e-prints, October 2004.
[bib8] Candes et al. [2013] Emmanuel J. Candes, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1241–1274, 2013.
[bib9] R. W. Gerchberg and W. Owen Saxton. A practical algorithm for the determination of the phase from image and diffraction plane pictures. Optik, 35:237–246, 1972.
[bib10] Goodfellow et al. [2013] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout Networks. ArXiv e-prints, February 2013.
[bib11] A. Hyvärinen and P. Hoyer. A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41(18):2413–2423, August 2001. ISSN 00426989. doi: 10.1016/s0042-6989(01)00114-6.
[bib12] Kavukcuoglu et al. [2009] Koray Kavukcuoglu, Marc’Aurelio Ranzato, Rob Fergus, and Yann LeCun. Learning invariant features through topographic filter maps. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 2009.
[bib13] S. Mallat. Group Invariant Scattering. Communications in Pure and Applied Mathematics, 2012.
[bib14] Ohlsson et al. [2012] Henrik Ohlsson, Allen Y. Yang, Roy Dong, and S. Shankar Sastry. Cprl – an extension of compressive sensing to the phase retrieval problem. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, NIPS, pages 1376–1384, 2012.
[bib15] Y. Ohlsson, H. Eldar. On conditions for uniqueness for sparse phase retrieval, 2013.
[bib16] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org/, 2008.
[bib17] Waldspurger et al. [2012] Irène Waldspurger, Alexandre d’Aspremont, and Stéphane Mallat. Phase recovery, maxcut and complex semidefinite programming, 2012.