Skip to main content

Learning Stable Group Invariant Representations with Convolutional Networks

Joan Bruna, Arthur Szlam and Yann LeCun, Courant Institute, New York University, New Nork, NY,

Abstract

%Transformation groups, such as translations %or rotations, effectively express part of %the variability observed in many recognition %problems. The group structure enables %the construction of invariant signal representations %with appealing mathematical properties, %where convolutions with appropriate filters, together %with pooling operators, bring stability %to additive and geometric perturbations of %the input. %Whereas physical transformation groups %are ubiquitous in image and audio applications, %they do not account for %all the variability of complex signal classes. % %We show that the invariance properties built %by deep convolutional networks can be %cast as a form of stable group invariance. The network %wiring architecture determines the invariance group, %while the trainable filter coefficients characterize %the group action. %We give explanatory examples which illustrate %how the network architecture controls %the resulting invariance group. We also %explore the principle by which additional %convolutional layers induce a group factorization %enabling more abstract, powerful invariant %representations. %

Invariance Properties of Convolutional Networks

Joan Bruna, Arthur Szlam and Yann LeCun

Courant Institute New York University New Nork, NY, 10013

{ bruna,lecun } @cims.nyu.edu

Introduction

Many signal categories in vision and auditory problems are invariant to the action of transformation groups, such as translations, rotations or frequency transpositions. This property motivates the study of signal representations which are also invariant to the action of these transformation groups. For instance, translation invariance can be achieved with a registration or with auto-correlation measures.

Transformation groups are in fact low-dimensional manifolds, and therefore mere group invariance is in general not enough to efficiently describe signal classes. Indeed, signals may be perturbed with additive noise and also with geometrical deformations, so one can then ask for invariant representations which are stable to these perturbations. Scattering convolutional networks [1] construct locally translation invariant signal representations, with additive and geometrical stability, by cascading complex wavelet modulus operators with a lowpass smoothing kernel. By defining wavelet decompositions on any locally compact Lie Group, scattering operators can be generalized and cascaded to provide local invariance with respect to more general transformation groups [2, 3]. Although such transformation groups are present across many recognition problems, they require prior information which sometimes cannot be assumed.

Convolutional networks [4] cascade filter banks with point-wise nonlinearities and local pooling operators. By remapping the output of each layer with the input of the following one, the trainable filters implement convolution operators. We show that the invariance properties built by deep convolutional networks can be cast as a form of stable group invariance. The network wiring architecture determines the invariance group, while the trainable filter coefficients characterize the group action.

Deep convolutional architectures cascade several layers of convolutions, non-linearities and pooling. These architectures have the capacity to generate local invariance to the action of more general groups. Under appropriate conditions, these groups can be factorized as products of smaller groups. Each of these factors can then be associated with a subset of consecutive layers of the convolutional network. In these conditions, the invariance properties of the final representation can be studied from the group structure generated by each layer.

Problem statement

Stable Group Invariance

A transformation group G acts on the input space X (assumed to be a Hilbert space) with a linear group action ( g, x ) ↦→ g.x ∈ X , which is compatible with the group operation.

A signal representation Φ : X -→ Z is invariant to the action of G if ∀ g ∈ G,x ∈ X , Φ( g.x ) = Φ( x ) . However, mere group invariance is in general too weak, due to the presence of a much larger, high dimensional variability which does not belong to the low-dimensional group. It is then necessary to incorporate the notion of outer 'deformations' with another group action ϕ : H ×

X -→ X , where H is a larger group containing G . The geometric stability can be stated with a Lipschitz continuity property

$$

$$

where k ( h, G ) measures the 'distance' from h to the invariance group G . For instance, when G is the translation group of R d and H ⊃ G is the group of C 2 diffeomorphisms of R d , then ϕ ( h, x ) = x ◦ h and one can select as distance the elastic deformation metric k ( h, G ) := |∇ τ | ∞ + | Hτ | ∞ , where τ ( u ) = h ( u ) -u [2].

Even though the group invariance formalism describes global invariance properties of the representation, it also provides a valid and useful framework to study local invariance properties. Indeed, if one replaces (1) by

$$

$$

where h G is a projection of h to G and ‖ g ‖ G is a metric on G measuring the amount of transformation being applied, then the local invariance is expressed by adjusting the proportionality between the two metrics.

Convolutional Networks

A generic convolutional network defined on a space X = L 2 (Ω 0 ) of square-integrable signals starts with a filter bank { ψ λ } λ ∈ Λ 1 , ψ λ ∈ L 1 (Ω 0 ) ∀ λ , which for each input x ( u ) ∈ X produces the collection

$$

$$

If the filter bank defines a stable, invertible frame, then there exist two constants a, A > 0 such that

$$

$$

where ‖ z (1) ‖ 2 = ∑ λ ∈ Λ 1 ‖ z (1) ( · , λ ) ‖ 2 . By defining Ω 1 = Ω 0 × Λ 1 , the first layer of the network can be written as the linear mapping

$$

$$

z (1) is then transformed with a point-wise nonlinear operator M : L 2 (Ω) → L 2 (Ω) which is usually non-expansive, meaning that ‖ Mz ‖ ≤ ‖ z ‖ . Finally, a local pooling operator P can be defined as any linear or nonlinear operator

P : L 2 (Ω) -→ L 2 ( ˜ Ω) which reduces the resolution of the signal along one or more coordinates and which avoids 'aliasing'. If Ω = Ω 0 × Λ 1 , , × Λ k and (2 J 0 , , 2 J k ) denote the loss of resolution along each coordinate, it results that ˜ Ω = ˜ Ω 0 × ˜ Λ 1 , , × ˜ Λ k , with | ˜ Ω 0 | = 2 -αJ 0 | Ω 0 | , | ˜ Λ i | = 2 -αJ i | Λ i | , where α is an oversampling factor. Linear pooling operators are implemented as lowpass filters φ J ( u, λ 1 , , λ m ) followed by a downsampling.

Then, a k -layer convolutional network is a cascade

$$

$$

which produces successively z (1) , z (2) , . . . , z ( k ) .

The filter banks ( F i ) i ≤ k , together with the pooling operators ( P i ) i ≤ k , progressively transform the signal domain; filter bank steps lift the domain of definition by adding new coordinates, whereas pooling steps reduce the resolution along certain coordinates.

Invariance Properties of Convolutional Networks

The case of one-parameter transformation groups

Let us start by assuming the simplest form of variability produced by a transformation group. A one-parameter transformation group is a family { U t } t ∈ R of unitary linear operators of L 2 (Ω) such

that (i) t ↦→ U t it is strongly continuous: lim t → t 0 U t z = U t 0 z for every z ∈ L 2 (Ω) , and (ii) U t + s = U t U s . One parameter transformation groups are thus homeomorphic to R (with the addition as group operation), and define an action which is continuous in the group variable. Uni-dimensional translations U t x ( u ) = x ( u -tv 0 ) , frequency transpositions U t x = F -1 ( F x ( ω -tω 0 )) (where F , F -1 are respectively the forward and inverse Fourier transform) or unitary dilations U t x ( u ) = 2 -t/ 2 x (2 -t u ) are examples of one-parameter transformation groups.

One-parameter transformation groups are particularly simple to study thanks to Stone's theorem [5], which states that unitary one-parameter transformation groups are uniquely generated by a complex exponential of a self-adjoint operator:

$$

$$

Here, the complex exponential of a self-adjoint operator should be interpreted in terms of its spectra. In the finite dimensional case (when Ω is discrete), this means that there exists an orthogonal transform O such that if ˆ z ( ω ) = Oz , then

$$

$$

In other words, the group action can be expressed as a linear phase change in the basis which diagonalizes the unique self-adjoint operator A given by Stone's theorem. In the particular case of translations, the change of basis O is given by the Fourier transform. As a result, one can obtain a representation which is invariant to the action of { U t } t with a single layer of a neural network: a linear decomposition which expresses the data in the basis given by O followed by a point-wise complex modulus. In the case of the translation group, this corresponds to taking the modulus of the Fourier transform.

Presence of deformations

Stone's theorem provides a recipe for global group invariance for strongly continuous group actions. Without noise nor deformations, an invariant representation can be obtained by taking complex moduli in a basis which diagonalizes the group action, which can be implemented in a shallow 1 -layer architecture. However, the underlying low-dimensional assumption is rarely satisfied, due to the presence of more complex forms of variability.

This complex variability can be modeled as follows. If O is the basis which diagonalizes a given oneparameter group, then the group action is expressed in the basis F -1 O as the translation operator T s z ( u ) = z ( u -s ) . Whereas the group action consists in rigid translations on this basis, by analogy a deformation is defined as a non-rigid warping in this domain: L τ z ( u ) = z ( u -τ ( u )) , where τ is a displacement field along the indexes of the decomposition.

The amount of deformation can be measured with the regularity of τ ( u ) , which controls how distant the warping is from being a rigid translation and hence an element of the group. This suggests that, in order to obtain stability to deformations, rather than looking for eigenvectors of the infinitesimal group action, one should look for linear measurements which are well localized in the domain where deformations occur, and which nearly diagonalize the group action. In particular, these measurements can be implemented with convolutions using compactly supported filters, such as in convolutional networks.

Let z ( n ) ( u, λ 1 , . . . , λ n ) be an intermediate representation in a convolutional network, and whose first layer is fully connected. Suppose that G is a group acting on z via

$$

$$

where η : G → Λ 1 . This corresponds to the idealized case where the transformation only modifies one component of the representation. A local pooling operator along the variable λ 1 , at a certain scale 2 J , attenuates the transformation by g as soon as | η ( g ) | /lessmuch 2 J . It thus produces local invariance with respect to the action of G .

Group Factorization with Deep Networks

Deep convolutional networks have the capacity to learn complex relationships of the data and to build invariance with respect to a large family of transformations. These properties can be partly explained in terms of a factorization of the invariance groups performed successively.

Whereas pooling operators efficiently produce stable local invariance, convolution operators preserve the invariance generated by previous layers. Indeed, suppose z ( n ) ( u, λ 1 ) is an intermediate representation in a convolutional network, and that G acts on z ( n ) via g.z ( n ) ( u, λ 1 ) = z ( n ) ( f ( g, u ) , λ 1 + η ( g )) . It follows that if the next layer is constructed as

$$

$$

then G acts on z ( n +1) via g.z ( n +1) ( u, λ 1 , λ 2 ) = z ( n ) ( f ( g, u ) , λ 1 + η ( g ) , λ 2 ) , since convolutions commutewith the group action, which by construction is expressed as a translation in the coefficients λ 1 . The new coordinates λ 2 are thus unaffected by the action of G .

As a consequence, this property enables a systematic procedure to generate invariance to groups of the form G = G 1 /multicloseright G 2 /multicloseright . . . /multicloseright G s , where H 1 /multicloseright H 2 is the semidirect product of groups. In this decomposition, each factor G i is associated with a range of convolutional layers, along the coordinates where the action of G i is perceived.

Perspectives

The connections between group invariance and deep convolutional networks offer an interpretation of their efficiency on several recognition tasks. In particular, they might explain why the weight sharing induced by convolutions is a valid regularization method in presence of group variability.

More concretely, we shall also concentrate on the following aspects:

· Group Discovery . One might ask for the group of transformations which best explains the variability observed in a given dataset { x i } . In the case where no geometric deformations are present, one can start by learning the (complex) eigenvectors of the group action:

$$

$$

When the data corresponds to a uniform measure on the group, then this decomposition can be obtained from the diagonalization of the covariance operator Σ = E ( x T x ) . In that case, the real eigenvectors of Σ are grouped into pairs of vectors with identical eigenvalue, which then define the complex decomposition diagonalizing the group action.

In presence of deformations, the global invariance is replaced by a measure of local invariance. This problem is closely related to the sparse coding with slowness from [6].

· Structured Convolutional Networks . Groups offer a powerful framework to incorporate structure into the families of filters, similarly is in [7]. On the one hand, one can enforce global properties of the group by defining the convolutions accordingly. For instance, by wrapping the domain of the convolution, one is enforcing a periodic group to emerge. On the other hand, one could further regularize the learning by enforcing a group structure within a filter bank. For instance, one could ask a certain filter bank F = { h 1 , . . . , h n } to have the form F = { R θ h 0 } θ , where R θ is a rotation with an angle θ .

$$ \label{lipseq} | \Phi( \varphi(h, x)) - \Phi( x) | \leq C ,|x | ,k(h, G)~, $$ \tag{lipseq}

$$ \label{convnet} L^2(\Omega_0) \stackrel{M \circ F_1}{\longrightarrow} L^2(\Omega_1) \stackrel{P_1}{\longrightarrow} L^2(\widetilde{\Omega_1}) \stackrel{M \circ F_2}{\longrightarrow} L^2(\Omega_2) ,\cdots , \stackrel{P_k}{\longrightarrow} L^2(\widetilde{\Omega_k}) ~, $$ \tag{convnet}

$$ \label{stone_complex} \forall, z~,~ U_t z = O^{-1} \mbox{diag}(e^{i t \om}. \hat{z}(\om)) ~. $$ \tag{stone_complex}

$$ \label{groupsimple_action} g.z^{(n)}(u,\la_1,\dots,\la_n) = z^{(n)}(u,\la_1+\eta(g),\la_2,\dots,\la_n)~, $$ \tag{groupsimple_action}

Perspectives

Many signal categories in vision and auditory problems are invariant to the action of transformation groups, such as translations, rotations or frequency transpositions. This property motivates the study of signal representations which are also invariant to the action of these transformation groups. For instance, translation invariance can be achieved with a registration or with auto-correlation measures.

Transformation groups are in fact low-dimensional manifolds, and therefore mere group invariance is in general not enough to efficiently describe signal classes. Indeed, signals may be perturbed with additive noise and also with geometrical deformations, so one can then ask for invariant representations which are stable to these perturbations. Scattering convolutional networks [1] construct locally translation invariant signal representations, with additive and geometrical stability, by cascading complex wavelet modulus operators with a lowpass smoothing kernel. By defining wavelet decompositions on any locally compact Lie Group, scattering operators can be generalized and cascaded to provide local invariance with respect to more general transformation groups [2, 3]. Although such transformation groups are present across many recognition problems, they require prior information which sometimes cannot be assumed.

Convolutional networks [4] cascade filter banks with point-wise nonlinearities and local pooling operators. By remapping the output of each layer with the input of the following one, the trainable filters implement convolution operators. We show that the invariance properties built by deep convolutional networks can be cast as a form of stable group invariance. The network wiring architecture determines the invariance group, while the trainable filter coefficients characterize the group action.

Deep convolutional architectures cascade several layers of convolutions, non-linearities and pooling. These architectures have the capacity to generate local invariance to the action of more general groups. Under appropriate conditions, these groups can be factorized as products of smaller groups. Each of these factors can then be associated with a subset of consecutive layers of the convolutional network. In these conditions, the invariance properties of the final representation can be studied from the group structure generated by each layer.

A transformation group G𝐺G acts on the input space 𝒳𝒳\mathcal{X} (assumed to be a Hilbert space) with a linear group action (g,x)↦g.x∈𝒳formulae-sequencemaps-to𝑔𝑥𝑔𝑥𝒳(g,x)\mapsto g.x\in\mathcal{X}, which is compatible with the group operation.

A signal representation Φ:𝒳⟶𝒵:Φ⟶𝒳𝒵\Phi:\mathcal{X}\longrightarrow\mathcal{Z} is invariant to the action of G𝐺G if ∀g∈G,x∈𝒳,Φ(g.x)=Φ(x).\forall,g\in G,,x\in\mathcal{X}{},{}\Phi(g.x)=\Phi(x){}. However, mere group invariance is in general too weak, due to the presence of a much larger, high dimensional variability which does not belong to the low-dimensional group. It is then necessary to incorporate the notion of outer “deformations” with another group action φ:H×𝒳⟶𝒳,:𝜑⟶𝐻𝒳𝒳\varphi:H\times\mathcal{X}\longrightarrow\mathcal{X}{}, where H𝐻H is a larger group containing G𝐺G. The geometric stability can be stated with a Lipschitz continuity property

where k​(h,G)𝑘ℎ𝐺k(h,G) measures the “distance” from hℎh to the invariance group G𝐺G. For instance, when G𝐺G is the translation group of ℝdsuperscriptℝ𝑑\mathbb{R}^{d} and H⊃G𝐺𝐻H\supset G is the group of C2superscript𝐶2C^{2} diffeomorphisms of ℝdsuperscriptℝ𝑑\mathbb{R}^{d}, then φ​(h,x)=x∘h𝜑ℎ𝑥𝑥ℎ\varphi(h,x)=x\circ h and one can select as distance the elastic deformation metric k​(h,G):=|∇τ|∞+|H​τ|∞assign𝑘ℎ𝐺subscript∇𝜏subscript𝐻𝜏k(h,G):=|\nabla\tau|{\infty}+|H\tau|{\infty}, where τ​(u)=h​(u)−u𝜏𝑢ℎ𝑢𝑢\tau(u)=h(u)-u [2].

Even though the group invariance formalism describes global invariance properties of the representation, it also provides a valid and useful framework to study local invariance properties. Indeed, if one replaces (1) by

where hGsubscriptℎ𝐺h_{G} is a projection of hℎh to G𝐺G and ‖g‖Gsubscriptnorm𝑔𝐺|g|_{G} is a metric on G𝐺G measuring the amount of transformation being applied, then the local invariance is expressed by adjusting the proportionality between the two metrics.

A generic convolutional network defined on a space 𝒳=L2​(Ω0)𝒳superscript𝐿2subscriptΩ0\mathcal{X}=L^{2}(\Omega_{0}) of square-integrable signals starts with a filter bank {ψλ}λ∈Λ1subscriptsubscript𝜓𝜆𝜆subscriptΛ1{\psi_{\lambda}}{\lambda\in\Lambda{1}}, ψλ∈L1​(Ω0)​∀λsubscript𝜓𝜆superscript𝐿1subscriptΩ0for-all𝜆\psi_{\lambda}\in L^{1}(\Omega_{0}),\forall\lambda, which for each input x​(u)∈𝒳𝑥𝑢𝒳x(u)\in\mathcal{X} produces the collection

If the filter bank defines a stable, invertible frame, then there exist two constants a,A>0𝑎𝐴0a,A>0 such that

where ‖z(1)‖2=∑λ∈Λ1‖z(1)​(⋅,λ)‖2superscriptnormsuperscript𝑧12subscript𝜆subscriptΛ1superscriptnormsuperscript𝑧1⋅𝜆2|z^{(1)}|^{2}=\sum_{\lambda\in\Lambda_{1}}|z^{(1)}(\cdot,\lambda)|^{2}. By defining Ω1=Ω0×Λ1subscriptΩ1subscriptΩ0subscriptΛ1\Omega_{1}=\Omega_{0}\times\Lambda_{1}, the first layer of the network can be written as the linear mapping

z(1)superscript𝑧1z^{(1)} is then transformed with a point-wise nonlinear operator M:L2​(Ω)→L2​(Ω):𝑀→superscript𝐿2Ωsuperscript𝐿2ΩM:L^{2}(\Omega)\rightarrow L^{2}(\Omega) which is usually non-expansive, meaning that ‖M​z‖≤‖z‖norm𝑀𝑧norm𝑧|Mz|\leq|z|. Finally, a local pooling operator P𝑃P can be defined as any linear or nonlinear operator

which reduces the resolution of the signal along one or more coordinates and which avoids “aliasing”. If Ω=Ω0×Λ1,,×Λk\Omega=\Omega_{0}\times\Lambda_{1},,\times\Lambda_{k} and (2J0,,2Jk)(2^{J_{0}},,2^{J_{k}}) denote the loss of resolution along each coordinate, it results that Ω~=Ω0~×Λ1~,,×Λk~\widetilde{\Omega}=\widetilde{\Omega_{0}}\times\widetilde{\Lambda_{1}},,\times\widetilde{\Lambda_{k}}, with |Ω0~|=2−α​J0​|Ω0|subscriptΩ0superscript2𝛼subscript𝐽0subscriptΩ0|\widetilde{\Omega_{0}}|=2^{-\alpha J_{0}}|{\Omega_{0}}|, |Λi|=2−α​Ji​|Λi|~subscriptΛ𝑖superscript2𝛼subscript𝐽𝑖subscriptΛ𝑖|\widetilde{\Lambda_{i}}|=2^{-\alpha J_{i}}|{\Lambda_{i}}|, where α𝛼\alpha is an oversampling factor. Linear pooling operators are implemented as lowpass filters ϕJ(u,λ1,,λm)\phi_{J}(u,\lambda_{1},,\lambda_{m}) followed by a downsampling.

Then, a k𝑘k-layer convolutional network is a cascade

which produces successively z(1),z(2),…,z(k)superscript𝑧1superscript𝑧2…superscript𝑧𝑘z^{(1)},z^{(2)},\dots,z^{(k)}.

The filter banks (Fi)i≤ksubscriptsubscript𝐹𝑖𝑖𝑘(F_{i}){i\leq k}, together with the pooling operators (Pi)i≤ksubscriptsubscript𝑃𝑖𝑖𝑘(P{i})_{i\leq k}, progressively transform the signal domain; filter bank steps lift the domain of definition by adding new coordinates, whereas pooling steps reduce the resolution along certain coordinates.

Let us start by assuming the simplest form of variability produced by a transformation group. A one-parameter transformation group is a family {Ut}t∈ℝsubscriptsubscript𝑈𝑡𝑡ℝ{U_{t}}{t\in\mathbb{R}} of unitary linear operators of L2​(Ω)superscript𝐿2ΩL^{2}(\Omega) such that (i) t↦Utmaps-to𝑡subscript𝑈𝑡t\mapsto U{t} it is strongly continuous: limt→t0Ut​z=Ut0​zsubscript→𝑡subscript𝑡0subscript𝑈𝑡𝑧subscript𝑈subscript𝑡0𝑧\lim_{t\to t_{0}}U_{t}z=U_{t_{0}}z for every z∈L2​(Ω)𝑧superscript𝐿2Ωz\in L^{2}(\Omega), and (ii) Ut+s=Ut​Ussubscript𝑈𝑡𝑠subscript𝑈𝑡subscript𝑈𝑠U_{t+s}=U_{t}U_{s}. One parameter transformation groups are thus homeomorphic to ℝℝ\mathbb{R} (with the addition as group operation), and define an action which is continuous in the group variable. Uni-dimensional translations Ut​x​(u)=x​(u−t​v0)subscript𝑈𝑡𝑥𝑢𝑥𝑢𝑡subscript𝑣0U_{t}x(u)=x(u-tv_{0}), frequency transpositions Ut​x=ℱ−1​(ℱ​x​(ω−t​ω0))subscript𝑈𝑡𝑥superscriptℱ1ℱ𝑥𝜔𝑡subscript𝜔0U_{t}x=\mathcal{F}^{-1}(\mathcal{F}x(\omega-t\omega_{0})) (where ℱℱ\mathcal{F}, ℱ−1superscriptℱ1\mathcal{F}^{-1} are respectively the forward and inverse Fourier transform) or unitary dilations Ut​x​(u)=2−t/2​x​(2−t​u)subscript𝑈𝑡𝑥𝑢superscript2𝑡2𝑥superscript2𝑡𝑢U_{t}x(u)=2^{-t/2}x(2^{-t}u) are examples of one-parameter transformation groups.

One-parameter transformation groups are particularly simple to study thanks to Stone’s theorem [5], which states that unitary one-parameter transformation groups are uniquely generated by a complex exponential of a self-adjoint operator:

Here, the complex exponential of a self-adjoint operator should be interpreted in terms of its spectra. In the finite dimensional case (when ΩΩ\Omega is discrete), this means that there exists an orthogonal transform O𝑂O such that if z^​(ω)=O​z^𝑧𝜔𝑂𝑧\hat{z}(\omega)=Oz, then

In other words, the group action can be expressed as a linear phase change in the basis which diagonalizes the unique self-adjoint operator A𝐴A given by Stone’s theorem. In the particular case of translations, the change of basis O𝑂O is given by the Fourier transform. As a result, one can obtain a representation which is invariant to the action of {Ut}tsubscriptsubscript𝑈𝑡𝑡{U_{t}}_{t} with a single layer of a neural network: a linear decomposition which expresses the data in the basis given by O𝑂O followed by a point-wise complex modulus. In the case of the translation group, this corresponds to taking the modulus of the Fourier transform.

Stone’s theorem provides a recipe for global group invariance for strongly continuous group actions. Without noise nor deformations, an invariant representation can be obtained by taking complex moduli in a basis which diagonalizes the group action, which can be implemented in a shallow 111-layer architecture. However, the underlying low-dimensional assumption is rarely satisfied, due to the presence of more complex forms of variability.

This complex variability can be modeled as follows. If O𝑂O is the basis which diagonalizes a given one-parameter group, then the group action is expressed in the basis ℱ−1​Osuperscriptℱ1𝑂\mathcal{F}^{-1}O as the translation operator Ts​z​(u)=z​(u−s)subscript𝑇𝑠𝑧𝑢𝑧𝑢𝑠T_{s}z(u)=z(u-s). Whereas the group action consists in rigid translations on this basis, by analogy a deformation is defined as a non-rigid warping in this domain: Lτ​z​(u)=z​(u−τ​(u))subscript𝐿𝜏𝑧𝑢𝑧𝑢𝜏𝑢L_{\tau}z(u)=z(u-\tau(u)), where τ𝜏\tau is a displacement field along the indexes of the decomposition.

The amount of deformation can be measured with the regularity of τ​(u)𝜏𝑢\tau(u), which controls how distant the warping is from being a rigid translation and hence an element of the group. This suggests that, in order to obtain stability to deformations, rather than looking for eigenvectors of the infinitesimal group action, one should look for linear measurements which are well localized in the domain where deformations occur, and which nearly diagonalize the group action. In particular, these measurements can be implemented with convolutions using compactly supported filters, such as in convolutional networks.

Let z(n)​(u,λ1,…,λn)superscript𝑧𝑛𝑢subscript𝜆1…subscript𝜆𝑛z^{(n)}(u,\lambda_{1},\dots,\lambda_{n}) be an intermediate representation in a convolutional network, and whose first layer is fully connected. Suppose that G𝐺G is a group acting on z𝑧z via

where η:G→Λ1:𝜂→𝐺subscriptΛ1\eta:G\to\Lambda_{1}. This corresponds to the idealized case where the transformation only modifies one component of the representation. A local pooling operator along the variable λ1subscript𝜆1\lambda_{1}, at a certain scale 2Jsuperscript2𝐽2^{J}, attenuates the transformation by g𝑔g as soon as |η​(g)|≪2Jmuch-less-than𝜂𝑔superscript2𝐽|\eta(g)|\ll 2^{J}. It thus produces local invariance with respect to the action of G𝐺G.

Deep convolutional networks have the capacity to learn complex relationships of the data and to build invariance with respect to a large family of transformations. These properties can be partly explained in terms of a factorization of the invariance groups performed successively.

Whereas pooling operators efficiently produce stable local invariance, convolution operators preserve the invariance generated by previous layers. Indeed, suppose z(n)​(u,λ1)superscript𝑧𝑛𝑢subscript𝜆1z^{(n)}(u,\lambda_{1}) is an intermediate representation in a convolutional network, and that G𝐺G acts on z(n)superscript𝑧𝑛z^{(n)} via g.z(n)​(u,λ1)=z(n)​(f​(g,u),λ1+η​(g))formulae-sequence𝑔superscript𝑧𝑛𝑢subscript𝜆1superscript𝑧𝑛𝑓𝑔𝑢subscript𝜆1𝜂𝑔g.z^{(n)}(u,\lambda_{1})=z^{(n)}(f(g,u),\lambda_{1}+\eta(g)). It follows that if the next layer is constructed as

then G𝐺G acts on z(n+1)superscript𝑧𝑛1z^{(n+1)} via g.z(n+1)​(u,λ1,λ2)=z(n)​(f​(g,u),λ1+η​(g),λ2)formulae-sequence𝑔superscript𝑧𝑛1𝑢subscript𝜆1subscript𝜆2superscript𝑧𝑛𝑓𝑔𝑢subscript𝜆1𝜂𝑔subscript𝜆2g.z^{(n+1)}(u,\lambda_{1},\lambda_{2})=z^{(n)}(f(g,u),\lambda_{1}+\eta(g),\lambda_{2}), since convolutions commute with the group action, which by construction is expressed as a translation in the coefficients λ1subscript𝜆1\lambda_{1}. The new coordinates λ2subscript𝜆2\lambda_{2} are thus unaffected by the action of G𝐺G.

As a consequence, this property enables a systematic procedure to generate invariance to groups of the form G=G1⋊G2⋊…⋊Gs𝐺right-normal-factor-semidirect-productsubscript𝐺1subscript𝐺2…subscript𝐺𝑠G=G_{1}\rtimes G_{2}\rtimes\dots\rtimes G_{s}, where H1⋊H2right-normal-factor-semidirect-productsubscript𝐻1subscript𝐻2H_{1}\rtimes H_{2} is the semidirect product of groups. In this decomposition, each factor Gisubscript𝐺𝑖G_{i} is associated with a range of convolutional layers, along the coordinates where the action of Gisubscript𝐺𝑖G_{i} is perceived.

The connections between group invariance and deep convolutional networks offer an interpretation of their efficiency on several recognition tasks. In particular, they might explain why the weight sharing induced by convolutions is a valid regularization method in presence of group variability.

More concretely, we shall also concentrate on the following aspects:

Group Discovery. One might ask for the group of transformations which best explains the variability observed in a given dataset {xi}subscript𝑥𝑖{x_{i}}. In the case where no geometric deformations are present, one can start by learning the (complex) eigenvectors of the group action:

When the data corresponds to a uniform measure on the group, then this decomposition can be obtained from the diagonalization of the covariance operator Σ=E​(xT​x)Σ𝐸superscript𝑥𝑇𝑥\Sigma=E(x^{T}x). In that case, the real eigenvectors of ΣΣ\Sigma are grouped into pairs of vectors with identical eigenvalue, which then define the complex decomposition diagonalizing the group action.

In presence of deformations, the global invariance is replaced by a measure of local invariance. This problem is closely related to the sparse coding with slowness from [6].

Structured Convolutional Networks. Groups offer a powerful framework to incorporate structure into the families of filters, similarly is in [7]. On the one hand, one can enforce global properties of the group by defining the convolutions accordingly. For instance, by wrapping the domain of the convolution, one is enforcing a periodic group to emerge. On the other hand, one could further regularize the learning by enforcing a group structure within a filter bank. For instance, one could ask a certain filter bank F={h1,…,hn}𝐹subscriptℎ1…subscriptℎ𝑛F={h_{1},\dots,h_{n}} to have the form F={Rθ​h0}θ𝐹subscriptsubscript𝑅𝜃subscriptℎ0𝜃F={R_{\theta}h_{0}}{\theta}, where Rθsubscript𝑅𝜃R{\theta} is a rotation with an angle θ𝜃\theta.

$$ \label{lipseq} | \Phi( \varphi(h, x)) - \Phi( x) | \leq C ,|x | ,k(h, G)~, $$ \tag{lipseq}

$$ z^{(1)}(u,\lambda)=x\star\psi_{\lambda}(u)=\int x(u-v)\psi_{\lambda}(v)dv~{},{}u\in\Omega_{0},,\lambda\in\Lambda_{1}{}. $$ \tag{S2.Ex1}

$$ \forall x~{},a|x|\leq|z^{(1)}|\leq A|x|~{}, $$ \tag{S2.Ex2}

$$ P:L^{2}(\Omega)\longrightarrow L^{2}(\widetilde{\Omega}) $$ \tag{S2.Ex5}

$$ \label{convnet} L^2(\Omega_0) \stackrel{M \circ F_1}{\longrightarrow} L^2(\Omega_1) \stackrel{P_1}{\longrightarrow} L^2(\widetilde{\Omega_1}) \stackrel{M \circ F_2}{\longrightarrow} L^2(\Omega_2) ,\cdots , \stackrel{P_k}{\longrightarrow} L^2(\widetilde{\Omega_k}) ~, $$ \tag{convnet}

$$ U_{t}=e^{itA}{},{}t\in\mathbb{R}~{}. $$ \tag{S3.Ex6}

$$ \label{stone_complex} \forall, z~,~ U_t z = O^{-1} \mbox{diag}(e^{i t \om}. \hat{z}(\om)) ~. $$ \tag{stone_complex}

$$ \label{groupsimple_action} g.z^{(n)}(u,\la_1,\dots,\la_n) = z^{(n)}(u,\la_1+\eta(g),\la_2,\dots,\la_n)~, $$ \tag{groupsimple_action}

$$ U^{*}=\mbox{arg}\min_{U^{T}U={\bf 1}}\mbox{var}(|Ux_{i}|)~{}. $$ \tag{S4.Ex8}

$$ \displaystyle F_{1}:L^{2}(\Omega_{0}) $$

References

[scatt_pami] J. Bruna, S. Mallat, ``Invariant Scattering Convolutional Networks", IEEE TPAMI, 2012.

[scatt_steph] S. Mallat, ``Group Invariant Scattering", CPAM, 2012.

[scatt_laurent] L.Sifre, S.Mallat, ``Combined Scattering for Rotation Invariant Texture Analysis", ESANN, 2012.

[convnet_yann] Y.LeCun, L.Bottou, Y.Bengio, P.Haffner, ``Gradient-Based Learning Applied to Document Recognition", IEEE, 1998

[stone] M.H. Stone, ``On one-parameter unitary Groups in Hilbert Space", Ann. of Mathematics, 1932.

[cadieu] C.Cadieu, B.Olshausen, ``Learning Transformational Invariants from Natural Movies", NIPS 2009.

[artur] K.Gregor, A. Szlam, Y.Lecun, ``Structured Sparse Coding via lateral inhibition", NIPS, 2011.

[bib1] J. Bruna, S. Mallat, “Invariant Scattering Convolutional Networks”, IEEE TPAMI, 2012.

[bib2] S. Mallat, “Group Invariant Scattering”, CPAM, 2012.

[bib3] L.Sifre, S.Mallat, “Combined Scattering for Rotation Invariant Texture Analysis”, ESANN, 2012.

[bib4] Y.LeCun, L.Bottou, Y.Bengio, P.Haffner, “Gradient-Based Learning Applied to Document Recognition”, IEEE, 1998

[bib5] M.H. Stone, “On one-parameter unitary Groups in Hilbert Space”, Ann. of Mathematics, 1932.

[bib6] C.Cadieu, B.Olshausen, “Learning Transformational Invariants from Natural Movies”, NIPS 2009.

[bib7] K.Gregor, A. Szlam, Y.Lecun, “Structured Sparse Coding via lateral inhibition”, NIPS, 2011.