Unsupervised Feature Learning from Temporal Data
Ross Goroshin$^1$, Joan Bruna$^{1,2}$, Jonathan Tompson$^1$, David Eigen$^1$, Yann LeCun$^{1,2}$, $^1$Courant Institute of Mathematical Science 719 Broadway 12$^{th}$ Floor, New York, NY, $^2$Facebook AI Research, 770 Broadway, New York, NY
Unsupervised Feature Learning from Temporal Data
Ross Goroshin 1 , Joan Bruna 1 , 2 , Jonathan Tompson 1 , David Eigen 1 , Yann LeCun 1 , 2
1 Courant Institute of Mathematical Science 719 Broadway 12 th Floor, New York, NY 10003
Current state-of-the-art classification and detection algorithms rely on supervised training. In this work we study unsupervised feature learning in the context of temporally coherent video data. We focus on feature learning from unlabeled video data, using the assumption that adjacent video frames contain semantically similar information. This assumption is exploited to train a convolutional pooling auto-encoder regularized by slowness and sparsity. We establish a connection between slow feature learning to metric learning and show that the trained encoder can be used to define a more temporally and semantically coherent metric.
Our main assumption is that data samples that are temporal neighbors are also likely to be neighbors in the latent space. For example, adjacent frames in a video sequence are more likely to be semantically similar than non-adjacent frames. This assumption naturally leads to the slowness prior on features which was introduced in SFA (Wiskott & Sejnowski (2002)).
Temporal coherence can be exploited by assuming a prior on the features extracted from the temporal data sequence. One such prior is that the features should vary slowly with respect to time. In the discrete time setting this prior corresponds to minimizing an L p norm of the difference of feature vectors for temporally adjacent inputs. Consider a video sequence with T frames, if z t represents the feature vector extracted from the frame at time t then the slowness prior corresponds to minimizing ∑ T t =1 ‖ z t -z t -1 ‖ p . To avoid the degenerate solution z t = z 0 for t = 1 ...T , a second term is introduced which encourages data samples that are not temporal neighbors to be separated by at least a distance of m -units in feature space, where m is known as the margin. In the temporal setting this corresponds to minimizing max (0 , m - ‖ z t -z t ′ ‖ p ) , where | t -t ′ | > 1 . Together the two terms form the loss function introduced in Hadsell et al. (2006) as a dimension reduction and data visualization algorithm known as DrLIM. Assume that there is a differentiable mapping from input space to feature space which operates on individual temporal samples. Denote this mapping by G and assume it is parametrized by a set of trainable coefficients denoted by W . That is, z t = G W ( x t ) . The per-sample loss function can be written as:
The second contrastive term in Equation 1 only acts to avoid the degenerate solution in which G W is a constant mapping, it does not guarantee that the resulting feature space is informative with respect to the input. This discriminative criteria only depends on pairwise distances in the representation space which is a geometrically weak notion in high dimensions. We propose to replace this contrastive term with a term that penalizes the reconstruction error of both data samples. Introducing a reconstruction terms not only prevents the constant solution but also acts to explicitly preserve information about the input. This is a useful property of features which are obtained using unsupervised learning; since the task to which these features will be applied is not known a priori, we would like to preserve as much information about the input as possible.
What is the optimal architecture of G W for extracting slow features? Slow features are invariant to temporal changes by definition. In natural video and on small spatial scales these changes mainly correspond to local translations and deformations. Invariances to such changes can be achieved using appropriate pooling operators Bruna & Mallat (2013); LeCun et al. (1998). Such operators are at the heart of deep convolutional networks (ConvNets), currently the most successful supervised feature learning architectures Krizhevsky et al. (2012). Inspired by these observations, let G W e be a two stage encoder comprised of a learned, generally over-complete, linear map ( W e ) and rectifying nonlinearity f ( · ) , followed by a local pooling. Let the N hidden activations, h = f ( W e x ) , be sub-


divided into K potentially overlapping neighborhoods denoted by P i . Note that biases are absorbed by expressing the input x in homogeneous coordinates. Feature z i produced by the encoder for the
1
input at time t can be expressed as G i W e ( t ) = ‖ h t ‖ P i p = ( ∑ j ∈ P i h p tj ) p . Training through a local pooling operator enforces a local topology on the hidden activations, inducing units that are pooled together to learn complimentary features. In the following experiments we will use p = 2 . Although it has recently been shown that it is possible to recover the input when W e is sufficiently redundant, reconstructing from these coefficients corresponds to solving a phase recovery problem Bruna et al. (2014) which is not possible with a simple inverse mapping, such as a linear map W d . Instead of reconstructing from z we reconstruct from the hidden representation h . This is the same approach taken when training group-sparse auto-encoders Kavukcuoglu et al. (2009). In order to promote sparse activations in the case of over-complete bases we additionally add a sparsifying L 1 penalty on the hidden activations. Including the rectifying nonlinearity becomes critical for learning sparse inference in a hugely redundant dictionary, e.g. convolutional dictionaries Gregor & LeCun (2010). The complete loss functional is:
Figure 2 shows a convolutional version of the proposed architecture and loss. By replacing all linear operators in our model with convolutional filter banks and including spatial pooling, translation invariance need not be learned LeCun et al. (1998). In all other respects the convolutional model is conceptually identical to the fully connected model described in the previous section.
References
Bengio, Yoshua, Courville, Aaron C., and Vincent, Pascal. Representation learning: A review and new perspectives. Technical report, University of Montreal, 2012.
Bromley, Jane, Bentz, James W, Bottou, L´ eon, Guyon, Isabelle, LeCun, Yann, Moore, Cliff, S¨ ackinger, Eduard, and Shah, Roopak. Signature verification using a siamese time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence , 7(04):669-688, 1993.
Current state-of-the-art classification and detection algorithms rely on supervised training. In this work we study unsupervised feature learning in the context of temporally coherent video data. We focus on feature learning from unlabeled video data, using the assumption that adjacent video frames contain semantically similar information. This assumption is exploited to train a convolutional pooling auto-encoder regularized by slowness and sparsity. We establish a connection between slow feature learning to metric learning and show that the trained encoder can be used to define a more temporally and semantically coherent metric.
Our main assumption is that data samples that are temporal neighbors are also likely to be neighbors in the latent space. For example, adjacent frames in a video sequence are more likely to be semantically similar than non-adjacent frames. This assumption naturally leads to the slowness prior on features which was introduced in SFA (Wiskott & Sejnowski (2002)).
Temporal coherence can be exploited by assuming a prior on the features extracted from the temporal data sequence. One such prior is that the features should vary slowly with respect to time. In the discrete time setting this prior corresponds to minimizing an Lpsuperscript𝐿𝑝L^{p} norm of the difference of feature vectors for temporally adjacent inputs. Consider a video sequence with T𝑇T frames, if ztsubscript𝑧𝑡z_{t} represents the feature vector extracted from the frame at time t𝑡t then the slowness prior corresponds to minimizing ∑t=1T‖zt−zt−1‖psuperscriptsubscript𝑡1𝑇subscriptnormsubscript𝑧𝑡subscript𝑧𝑡1𝑝\sum_{t=1}^{T}|z_{t}-z_{t-1}|{p}. To avoid the degenerate solution zt=z0fort=1…Tsubscript𝑧𝑡subscript𝑧0for𝑡1…𝑇z{t}=z_{0}{}\mbox{for}{}t=1...T, a second term is introduced which encourages data samples that are not temporal neighbors to be separated by at least a distance of m𝑚m-units in feature space, where m𝑚m is known as the margin. In the temporal setting this corresponds to minimizing max(0,m−‖zt−zt′‖p)𝑚𝑎𝑥0𝑚subscriptnormsubscript𝑧𝑡subscript𝑧superscript𝑡′𝑝max(0,m-|z_{t}-z_{t^{\prime}}|{p}), where |t−t′|>1𝑡superscript𝑡′1|t-t^{\prime}|>1. Together the two terms form the loss function introduced in Hadsell et al. (2006) as a dimension reduction and data visualization algorithm known as DrLIM. Assume that there is a differentiable mapping from input space to feature space which operates on individual temporal samples. Denote this mapping by G𝐺G and assume it is parametrized by a set of trainable coefficients denoted by W𝑊W. That is, zt=GW(xt)subscript𝑧𝑡subscript𝐺𝑊subscript𝑥𝑡z{t}=G_{W}(x_{t}). The per-sample loss function can be written as:
The second contrastive term in Equation 1 only acts to avoid the degenerate solution in which GWsubscript𝐺𝑊G_{W} is a constant mapping, it does not guarantee that the resulting feature space is informative with respect to the input. This discriminative criteria only depends on pairwise distances in the representation space which is a geometrically weak notion in high dimensions. We propose to replace this contrastive term with a term that penalizes the reconstruction error of both data samples. Introducing a reconstruction terms not only prevents the constant solution but also acts to explicitly preserve information about the input. This is a useful property of features which are obtained using unsupervised learning; since the task to which these features will be applied is not known a priori, we would like to preserve as much information about the input as possible.
What is the optimal architecture of GWsubscript𝐺𝑊G_{W} for extracting slow features? Slow features are invariant to temporal changes by definition. In natural video and on small spatial scales these changes mainly correspond to local translations and deformations. Invariances to such changes can be achieved using appropriate pooling operators Bruna & Mallat (2013); LeCun et al. (1998). Such operators are at the heart of deep convolutional networks (ConvNets), currently the most successful supervised feature learning architectures Krizhevsky et al. (2012). Inspired by these observations, let GWesubscript𝐺subscript𝑊𝑒G_{W_{e}} be a two stage encoder comprised of a learned, generally over-complete, linear map (Wesubscript𝑊𝑒W_{e}) and rectifying nonlinearity f(⋅)𝑓⋅f(\cdot), followed by a local pooling. Let the N𝑁N hidden activations, h=f(Wex)ℎ𝑓subscript𝑊𝑒𝑥h=f(W_{e}x), be subdivided into K𝐾K potentially overlapping neighborhoods denoted by Pisubscript𝑃𝑖P_{i}. Note that biases are absorbed by expressing the input x𝑥x in homogeneous coordinates. Feature zisubscript𝑧𝑖z_{i} produced by the encoder for the input at time t𝑡t can be expressed as GWei(t)=‖ht‖pPi=(∑j∈Pihtjp)1psuperscriptsubscript𝐺subscript𝑊𝑒𝑖𝑡subscriptsuperscriptnormsubscriptℎ𝑡subscript𝑃𝑖𝑝superscriptsubscript𝑗subscript𝑃𝑖superscriptsubscriptℎ𝑡𝑗𝑝1𝑝G_{W_{e}}^{i}(t)=|h_{t}|^{P_{i}}{p}=\left(\sum{j\in P_{i}}h_{tj}^{p}\right)^{\frac{1}{p}}. Training through a local pooling operator enforces a local topology on the hidden activations, inducing units that are pooled together to learn complimentary features. In the following experiments we will use p=2𝑝2p=2. Although it has recently been shown that it is possible to recover the input when Wesubscript𝑊𝑒W_{e} is sufficiently redundant, reconstructing from these coefficients corresponds to solving a phase recovery problem Bruna et al. (2014) which is not possible with a simple inverse mapping, such as a linear map Wdsubscript𝑊𝑑W_{d}. Instead of reconstructing from z𝑧z we reconstruct from the hidden representation hℎh. This is the same approach taken when training group-sparse auto-encoders Kavukcuoglu et al. (2009). In order to promote sparse activations in the case of over-complete bases we additionally add a sparsifying L1subscript𝐿1L_{1} penalty on the hidden activations. Including the rectifying nonlinearity becomes critical for learning sparse inference in a hugely redundant dictionary, e.g. convolutional dictionaries Gregor & LeCun (2010). The complete loss functional is:
Figure 2 shows a convolutional version of the proposed architecture and loss. By replacing all linear operators in our model with convolutional filter banks and including spatial pooling, translation invariance need not be learned LeCun et al. (1998). In all other respects the convolutional model is conceptually identical to the fully connected model described in the previous section.
(a)
Block diagram of the Siamese convolutional model trained on pairs of frames.
$$ L(x_{t},x_{t^{\prime}},W)=\left{\begin{array}[]{ll}|G_{W}(x_{t})-G_{W}(x_{t^{\prime}})|{p},&\text{if}~{}|t-t^{\prime}|=1\ \max(0,m-|G{W}(x_{t})-G_{W}(x_{t^{\prime}})|_{p})&\text{if}~{}|t-t^{\prime}|>1\end{array}\right. $$ \tag{S0.E1}
$$ L(x_{t},x_{t^{\prime}},W)=\sum_{\tau={t,t^{\prime}}}\left(|W_{d}h_{\tau}-x_{\tau}|^{2}+\alpha|h_{\tau}|\right)+\beta\sum_{i=1}^{K}\left||h_{t}|^{P_{i}}-|h_{t^{\prime}}|^{P_{i}}\right| $$ \tag{S0.E2}
References
[Bengio2012] Bengio, Yoshua, Courville, Aaron~C., and Vincent, Pascal. \newblock Representation learning: A review and new perspectives. \newblock Technical report, University of Montreal, 2012.
[siamese] Bromley, Jane, Bentz, James~W, Bottou, L{'e}on, Guyon, Isabelle, LeCun, Yann, Moore, Cliff, S{"a}ckinger, Eduard, and Shah, Roopak. \newblock Signature verification using a “siamese” time delay neural network. \newblock International Journal of Pattern Recognition and Artificial Intelligence, 7\penalty0 (04):\penalty0 669--688, 1993.
[JoanScat] Bruna, Joan and Mallat, St{'e}phane. \newblock Invariant scattering convolution networks. \newblock Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35\penalty0 (8):\penalty0 1872--1886, 2013.
[JoanPooling] Bruna, Joan, Szlam, Arthur, and LeCun, Yann. \newblock Signal recovery from pooling representations. \newblock In ICML, 2014.
[Cadieu] Cadieu, CharlesF. and Olshausen, BrunoA. \newblock Learning intermediate-level representations of form and motion from natural movies. \newblock Neural Computation, 2012.
[MaxOut] Goodfellow, Ian~J., Warde-Farley, David, Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua. \newblock Maxout networks. \newblock In ICML, 2013.
[SATAE] Goroshin, Rostislav and LeCun, Yann. \newblock Saturating auto-encoders. \newblock In ICLR, 2013.
[LISTA] Gregor, Karol and LeCun, Yann. \newblock Learning fast approximations of sparse coding. \newblock In ICML'2010, 2010.
[DrLIM] Hadsell, Raia, Chopra, Soumit, and LeCun, Yann. \newblock Dimensionality reduction by learning an invariant mapping. \newblock In CVPR, 2006.
[Huyvarinen] Hyv{"a}rinen, Aapo, Karhunen, Juha, Oja, and Erkki. \newblock Independent component analysis, volume~46. \newblock John Wiley & Sons, 2004.
[hyvarinen2003bubbles] Hyv{"a}rinen, Aapo, Hurri, Jarmo, and V{"a}yrynen, Jaakko. \newblock Bubbles: a unifying framework for low-level statistical properties of natural image sequences. \newblock JOSA A, 20\penalty0 (7):\penalty0 1237--1252, 2003.
[groupSparsity] Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, Fergus, Rob, and LeCun, Yann. \newblock Learning invariant features through topographic filter maps. \newblock In CVPR, 2009.
[SSA] Kayser, Christoph, Einhauser, Wolfgang, Dummer, Olaf, Konig, Peter, and Kding, Konrad. \newblock Extracting slow subspaces from natural videos leads to complex cells. \newblock In ICANN'2001, 2001.
[alexthesis] Krizhevsky, Alex. \newblock Learning multiple layers of features from tiny images. \newblock Master's thesis, University of Toronto, April 2009.
[ImageNet] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, GeoffreyE. \newblock Imagenet classification with deep convolutional neural networks. \newblock In NIPS, volume1, pp.\ ~4, 2012.
[LeCun1998] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. \newblock Gradient-based learning applied to document recognition. \newblock Proc. IEEE, 86\penalty0 (11):\penalty0 2278--2324, 1998.
[complexCells] Lies, Jorn-Philipp, Hafner, Ralf~M, and Bethge, Matthias. \newblock Slowness and sparseness have diverging effects on complex cell learning. \newblock 10, 2014.
[DrLIMVideo] Mobahi, Hossein, Collobert, Ronan, and Weston, Jason. \newblock Deep learning from temporal coherence in video. \newblock In ICML, 2009.
[CAE] Rifai, Salah, Vincent, Pascal, Muller, Xavier, Galrot, Xavier, and Bengio, Yoshua. \newblock Contractive auto-encoders: Explicit invariance during feature extraction. \newblock In ICML, 2011.
[DAE] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, and Manzagol, Pierre-Antoine. \newblock Extracting and composing robust features with denoising autoencoders. \newblock Technical report, University of Montreal, 2008.
[SFA] Wiskott, Laurenz and Sejnowski, Terrence~J. \newblock Slow feature analysis: Unsupervised learning of invariances. \newblock Neural Computation, 2002.
[zou2012deep] Zou, Will, Zhu, Shenghuo, Yu, Kai, and Ng, Andrew~Y. \newblock Deep learning of invariant features via simulated fixations in video. \newblock In Advances in Neural Information Processing Systems, pp.\ 3212--3220, 2012.
[bib2] Bromley et al. (1993) Bromley, Jane, Bentz, James W, Bottou, Léon, Guyon, Isabelle, LeCun, Yann, Moore, Cliff, Säckinger, Eduard, and Shah, Roopak. Signature verification using a “siamese” time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 7(04):669–688, 1993.
[bib11] Hyvärinen et al. (2003) Hyvärinen, Aapo, Hurri, Jarmo, and Väyrynen, Jaakko. Bubbles: a unifying framework for low-level statistical properties of natural image sequences. JOSA A, 20(7):1237–1252, 2003.
[bib20] Vincent et al. (2008) Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, and Manzagol, Pierre-Antoine. Extracting and composing robust features with denoising autoencoders. Technical report, University of Montreal, 2008.
[bib22] Zou et al. (2012) Zou, Will, Zhu, Shenghuo, Yu, Kai, and Ng, Andrew Y. Deep learning of invariant features via simulated fixations in video. In Advances in Neural Information Processing Systems, pp. 3212–3220, 2012.