Convolutional Matching Pursuit Dictionary Training Lecun
Convolutional Matching Pursuit and Dictionary Training
Arthur Szlam, Koray Kavukcuoglu, and Yann LeCun
Introduction
One of the most succesful recent signal processing paradigms has been the sparse coding/dictionary design model [8, 4]. In this model, we try to represent a given d × n data matrix X of n points in R d written as columns via a solution to the problem
$$
$$
$$
$$
or its Z coordinate convexification
$$
$$
$$
$$
Here, { W,Z } are the dictionary and the coefficients, respectively, and z k is the k th column of Z . K , q , and λ are user selected parameters controlling the power of the model.
More recently, many models with additional structure have been proposed. For example, in [9, 2], the dictionary elements are arranged in groups and the sparsity is on the group level. In [3, 5, 7], the dictionaries are constructed to be translation invariant. In the former work, the dictionary is constructed via a non-negative matrix factorization. In the latter two works, the construction is a convolutional analogue of 1.2 or an l p variant, with 0 < p < 1. In this short note we work with greedy algorithms for solving the convolutional analogues of 1.1. Specifically, we demonstrate that sparse coding by matching pursuit and dictionary learning via K-SVD [1] can be used in the translation invariant setting.
Matching Pursuit
Matching pursuit [6] is a greedy algorithm for the solution of the sparse coding problem
$$
$$
where the d × k matrix W is the dictionary, the k × 1 z is the code, and x is an d × 1 data vector.
- Set e = x , and z the k -dimensional zero vector.
- Find j = arg max i || W T i e || 2 2 .
- Set a = W T j x .
- Set e ← e -aW j , and z j = z j + a .
- Repeat for q steps
Note that with a bit of bookkeeping, it is only necessary to multiply W against x once, instead of q times. This at a cost of an extra O ( K 2 ) storage: set e r and a r be e and a from the r th step above. Then:
$$
$$
$$
$$
$$
$$
and so on. If the Gram matrix for W is stored, this is just a lookup.
Convolutional MP
We consider the special case
$$
$$
where each w j is a filter, and z is all of the responses.
Note that the Gram matrix of the 'Toeplitz' dictionary consisting of all the shifts of the w j is usually too big to be used as a lookup table. However, because of the symmetries of the convolution, it is also unnecessary; we only need store a 4 ∗ h f × w f × k 2 array of inner products, where h f and w f are the dimensions of the filters.
With this additional storage, to run q basis pursuit steps with k filters on an h × w image costs the computation of one application of the filter bank plus O ( kqhw ) operations.
Learning the filters
Given a set of x , we can learn the filters and the codes simultaneously. Several methods are available. A simple one is to alternate between updating the codes and updating the filters, as in K-SVD [1]:
- Initialize k h f × w f filters { w 1 , ..., w k } .
- Solve for z as above.
Learning the filters
· find all locations in all the data images where w j is activated · extract the h f × w f patch E p from the reconstruction via z at each activated point p . · remove the contribution of w j from each E p (i.e. E p ← E p -c ( p,j ) w j , where c ( p,j ) was the activation determined by z ). · update w j ← PCA( E p ) 4. Repeat from step 2 until fixed number of iterations.
We note that the forward subproblem (finding Z with W fixed) is not convex, and so the alternation is not guaranteed to decrease the energy or to converge to even a local minimum. However, in practice, on image and audio data, this method generates good filters.
Some experiments
We train filters on three data sets: the AT&T face database, the motorcycles from a Caltech database, and the VOC PASCAL database. For all the images in all our experiments, we perform an additive contrast normalization: each image x is transformed into x ′ = x -x ∗ b , where b is a 5 × 5 averaging box filter. This is very nearly transforming x ′ = ∇ 2 x , that is, using the discrete Laplacian of the image instead of the image. Using the Laplacian would correspond to using the energy
$$
$$
that is, the energy sees the difference between gradients, not intensities.
Faces
The AT&T face database, available at http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabas is a set of 400 images of 40 individuals. The faces are centered in each image.
We resize each image to 64 × 64 and contrast normalize. We train 8 16 × 16 filters. After training the filters we find the feature maps of each image in the database, obtaining a new set of 400 8 channel images. We take the elementwise absolute value of each of the 8 channel images, and then average pool over 8 × 8 blocks. We then train a new 16 element dictionary on the subsampled images. In figure 1 we display the first level filters, and the second level filters up to shifts of size 8 and sign changes of the first level filters..


Figure 1: First and second layer filters from faces

Caltech motorcycles
Wealso train on the motorbikes-side dataset, available at http://www.vision.caltech.edu/html-files/arc which consists of color images of various motorcycles. The motorcycles are centered in each image. We convert each image to gray level, resize to 64 × 64, and contrast normalize. We train 8 16 × 16 filters. As before, we then train a new 16 element dictionary on the subsampled absolute value rectified responses of the first level. In figure 3 we display the first level filters, and the second level filters up to shifts of size 8 and sign changes of the first level filters..


Figure 3: First and second layer filters from motorcycles

Figure 4: A contrast normalized motorcycle, and its reconstruction from 40 filter responses.


Images from PASCAL VOC
We also show results trained on 'unclassified' natural images from the PASCAL visual object challenge dataset available at http://pascallin.ecs.soton.ac.uk/challenges/VOC/ . We randomly subsample 5000 grayscaled images by a factor of 1 to 4, and then pick from each image a 64 × 64 patch, and then contrast normalize. We train 8 8 × 8 filters. We then train a new 4 × 4 64 element dictionary on the subsampled absolute value rectified responses of the first level. In figure 5 we display the first level filters, and the second level filters up to shifts of size 8 and sign changes of the first level filters.
In order to show the dependence of the filters on the number of filters used, in figure 6 we display an 8, 16, and 64 element 16 × 16 dictionary trained on the same set as above.
$$ =\argmin_{Z\in\R^{K\times n},W\in \R^{d\times K}} \sum_k||Wz_k-x_k||^2, , , ||z_k||_0\leq q,\label{l0} $$ \tag{l0}
$$ {W_,Z_}={W_(K,X,q),Z_(K,X,q)} $$
$$ \min_z ||Wz-x||^2, $$
$$ ||z||_0\leq q, $$
$$ W^Te_0=W^Tx; $$
Faces
First and second layer filters from natural images
Figure 6: Dictionaries with varying numbers of elements trained on natural images.
Convolutional Matching Pursuit and Dictionary Training
Arthur Szlam, Koray Kavukcuoglu, and Yann LeCun
One of the most succesful recent signal processing paradigms has been the sparse coding/dictionary design model [8, 4]. In this model, we try to represent a given d×n𝑑𝑛d\times n data matrix X𝑋X of n𝑛n points in ℝdsuperscriptℝ𝑑\mathbb{R}^{d} written as columns via a solution to the problem
or its Z𝑍Z coordinate convexification
Here, {W,Z}𝑊𝑍{W,Z} are the dictionary and the coefficients, respectively, and zksubscript𝑧𝑘z_{k} is the k𝑘kth column of Z𝑍Z. K𝐾K, q𝑞q, and λ𝜆\lambda are user selected parameters controlling the power of the model.
More recently, many models with additional structure have been proposed. For example, in [9, 2], the dictionary elements are arranged in groups and the sparsity is on the group level. In [3, 5, 7], the dictionaries are constructed to be translation invariant. In the former work, the dictionary is constructed via a non-negative matrix factorization. In the latter two works, the construction is a convolutional analogue of 1.2 or an lpsuperscript𝑙𝑝l^{p} variant, with 0<p<10𝑝10<p<1. In this short note we work with greedy algorithms for solving the convolutional analogues of 1.1. Specifically, we demonstrate that sparse coding by matching pursuit and dictionary learning via K-SVD [1] can be used in the translation invariant setting.
Matching pursuit [6] is a greedy algorithm for the solution of the sparse coding problem
where the d×k𝑑𝑘d\times k matrix W𝑊W is the dictionary, the k×1𝑘1k\times 1 z𝑧z is the code, and x𝑥x is an d×1𝑑1d\times 1 data vector.
Set e=x𝑒𝑥e=x, and z𝑧z the k𝑘k-dimensional zero vector.
Find j=argmaxi‖WiTe‖22𝑗argsubscript𝑖superscriptsubscriptnormsuperscriptsubscript𝑊𝑖𝑇𝑒22j={\rm arg}\max\limits_{i}||W_{i}^{T}e||_{2}^{2}.
Set a=WjTx𝑎superscriptsubscript𝑊𝑗𝑇𝑥a=W_{j}^{T}x.
Set e←e−aWj←𝑒𝑒𝑎subscript𝑊𝑗e\leftarrow e-aW_{j}, and zj=zj+asubscript𝑧𝑗subscript𝑧𝑗𝑎z_{j}=z_{j}+a.
Repeat for q𝑞q steps
Note that with a bit of bookkeeping, it is only necessary to multiply W𝑊W against x𝑥x once, instead of q𝑞q times. This at a cost of an extra O(K2)𝑂superscript𝐾2O(K^{2}) storage: set ersubscript𝑒𝑟e_{r} and arsubscript𝑎𝑟a_{r} be e𝑒e and a𝑎a from the r𝑟rth step above. Then:
and so on. If the Gram matrix for W𝑊W is stored, this is just a lookup.
We consider the special case
where each wjsubscript𝑤𝑗w_{j} is a filter, and z¯¯𝑧\overline{z} is all of the responses.
Note that the Gram matrix of the “Toeplitz” dictionary consisting of all the shifts of the wjsubscript𝑤𝑗w_{j} is usually too big to be used as a lookup table. However, because of the symmetries of the convolution, it is also unnecessary; we only need store a 4∗hf×wf×k24subscriptℎ𝑓subscript𝑤𝑓superscript𝑘24*h_{f}\times w_{f}\times k^{2} array of inner products, where hfsubscriptℎ𝑓h_{f} and wfsubscript𝑤𝑓w_{f} are the dimensions of the filters.
With this additional storage, to run q𝑞q basis pursuit steps with k𝑘k filters on an h×wℎ𝑤h\times w image costs the computation of one application of the filter bank plus O(kqhw)𝑂𝑘𝑞ℎ𝑤O(kqhw) operations.
Given a set of x𝑥x, we can learn the filters and the codes simultaneously. Several methods are available. A simple one is to alternate between updating the codes and updating the filters, as in K-SVD [1]:
Initialize k𝑘k hf×wfsubscriptℎ𝑓subscript𝑤𝑓h_{f}\times w_{f} filters {w1,…,wk}subscript𝑤1…subscript𝑤𝑘{w_{1},...,w_{k}}.
Solve for z𝑧z as above.
find all locations in all the data images where wjsubscript𝑤𝑗w_{j} is activated
remove the contribution of wjsubscript𝑤𝑗w_{j} from each Epsubscript𝐸𝑝E_{p} (i.e. Ep←Ep−c(p,j)wj←subscript𝐸𝑝subscript𝐸𝑝subscript𝑐𝑝𝑗subscript𝑤𝑗E_{p}\leftarrow E_{p}-c_{(p,j)}w_{j}, where c(p,j)subscript𝑐𝑝𝑗c_{(p,j)} was the activation determined by z𝑧z).
update wj←PCA(Ep)←subscript𝑤𝑗PCAsubscript𝐸𝑝w_{j}\leftarrow\text{PCA}(E_{p})
Repeat from step 2 until fixed number of iterations.
We note that the forward subproblem (finding Z𝑍Z with W𝑊W fixed) is not convex, and so the alternation is not guaranteed to decrease the energy or to converge to even a local minimum. However, in practice, on image and audio data, this method generates good filters.
We train filters on three data sets: the AT&T face database, the motorcycles from a Caltech database, and the VOC PASCAL database. For all the images in all our experiments, we perform an additive contrast normalization: each image x𝑥x is transformed into x′=x−x∗bsuperscript𝑥′𝑥𝑥𝑏x^{\prime}=x-x*b, where b𝑏b is a 5×5555\times 5 averaging box filter. This is very nearly transforming x′=∇2xsuperscript𝑥′superscript∇2𝑥x^{\prime}=\nabla^{2}x, that is, using the discrete Laplacian of the image instead of the image. Using the Laplacian would correspond to using the energy
that is, the energy sees the difference between gradients, not intensities.
The AT&T face database, available at http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html is a set of 400 images of 40 individuals. The faces are centered in each image. We resize each image to 64×64646464\times 64 and contrast normalize. We train 888 16×16161616\times 16 filters. After training the filters we find the feature maps of each image in the database, obtaining a new set of 400 888 channel images. We take the elementwise absolute value of each of the 8 channel images, and then average pool over 8×8888\times 8 blocks. We then train a new 16 element dictionary on the subsampled images. In figure 1 we display the first level filters, and the second level filters up to shifts of size 8 and sign changes of the first level filters..
We also train on the motorbikes-side dataset, available at http://www.vision.caltech.edu/html-files/archive.html which consists of color images of various motorcycles. The motorcycles are centered in each image. We convert each image to gray level, resize to 64×64646464\times 64, and contrast normalize. We train 888 16×16161616\times 16 filters. As before, we then train a new 16 element dictionary on the subsampled absolute value rectified responses of the first level. In figure 3 we display the first level filters, and the second level filters up to shifts of size 8 and sign changes of the first level filters..
We also show results trained on “unclassified” natural images from the PASCAL visual object challenge dataset available at http://pascallin.ecs.soton.ac.uk/challenges/VOC/. We randomly subsample 500050005000 grayscaled images by a factor of 1 to 4, and then pick from each image a 64×64646464\times 64 patch, and then contrast normalize. We train 888 8×8888\times 8 filters. We then train a new 4×4444\times 4 64 element dictionary on the subsampled absolute value rectified responses of the first level. In figure 5 we display the first level filters, and the second level filters up to shifts of size 8 and sign changes of the first level filters.
In order to show the dependence of the filters on the number of filters used, in figure 6 we display an 888, 161616, and 646464 element 16×16161616\times 16 dictionary trained on the same set as above.
First and second layer filters from faces
a contrast normalized face, and its reconstruction from 40 filter responses.
$$ {W_{},Z_{}}={W_{}(K,X,q),Z_{}(K,X,q)} $$ \tag{S1.Ex1}
$$ =arg\min\limits_{Z\in\mathbb{R}^{K\times n},W\in\mathbb{R}^{d\times K}}\sum_{k}||Wz_{k}-x_{k}||^{2},,,||z_{k}||_{0}\leq q, $$ \tag{S1.E1}
$$ \min_{z}||Wz-x||^{2}, $$ \tag{S2.Ex1}
$$ ||z||_{0}\leq q, $$ \tag{S2.Ex2}
$$ W^{T}e_{0}=W^{T}x; $$ \tag{S2.Ex3}
$$ \sum_{x}||\nabla\left(\sum_{j}w_{j}*z_{j}-x\right)||^{2}, $$ \tag{S4.Ex1}





References
[elad-ksvd] M.~Aharon, M.~Elad, and A.~Bruckstein. \newblock {K-SVD}: An algorithm for designing overcomplete dictionaries for sparse representation. \newblock {\em IEEE Transactions on Signal Processing}, 54(11):4311--4322, 2006.
[jmlrBach08a] Francis~R. Bach. \newblock Consistency of the group lasso and multiple kernel learning. \newblock {\em Journal of Machine Learning Research}, 9:1179--1225, 2008.
[Behnkeconvnnmf] Sven Behnke. \newblock Discovering hierarchical speech features using convolutional non-negative matrix factorization. \newblock In {\em IJCNN}, pages 7--12, 2008.
[bruckstein:34] AlfredM. Bruckstein, DavidL. Donoho, and Michael Elad. \newblock From sparse solutions of systems of equations to sparse modeling of signals and images. \newblock {\em SIAM Review}, 51(1):34--81, 2009.
[koraynips2010] Y.~Boureau K. Gregor M. Mathieu Y.~LeCun K.~kavukcuoglu, P.~Sermanet. \newblock Learning convolutional feature hierarchies for visual recognition. \newblock {\em Advances in NIPS}, 2010.
[Mallat93matchingpursuit] Stephane Mallat and Zhifeng Zhang. \newblock Matching pursuit with time-frequency dictionaries. \newblock {\em IEEE Transactions on Signal Processing}, 41:3397--3415, 1993.
[zeilercvpr2010] GrahamTaylor MatthewZeiler, Dilip~Krishnan and Rob Fergus. \newblock Hierarchical convolutional sparse image decomposition. \newblock In {\em The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition}, San Francisco, CA, June 2010.
[olshausen97sparse] B.~Olshausen and D.~Field. \newblock Sparse coding with an overcomplete basis set: A strategy employed by v1?, 1997.
[Yuan06modelselection] Ming Yuan, Ming Yuan, YiLin, and YiLin. \newblock Model selection and estimation in regression with grouped variables. \newblock {\em Journal of the Royal Statistical Society, Series B}, 68:49--67, 2006.
[bib1] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311–4322, 2006.
[bib2] Francis R. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008.
[bib3] Sven Behnke. Discovering hierarchical speech features using convolutional non-negative matrix factorization. In IJCNN, pages 7–12, 2008.
[bib4] Alfred M. Bruckstein, David L. Donoho, and Michael Elad. From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51(1):34–81, 2009.
[bib5] Y. Boureau K. Gregor M. Mathieu Y. LeCun K. kavukcuoglu, P. Sermanet. Learning convolutional feature hierarchies for visual recognition. Advances in NIPS, 2010.
[bib6] Stephane Mallat and Zhifeng Zhang. Matching pursuit with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41:3397–3415, 1993.
[bib7] Graham Taylor Matthew Zeiler, Dilip Krishnan and Rob Fergus. Hierarchical convolutional sparse image decomposition. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, June 2010.
[bib8] B. Olshausen and D. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1?, 1997.
[bib9] Ming Yuan, Ming Yuan, Yi Lin, and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49–67, 2006.