Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation

Remi Denton$^1$, Wojciech Zaremba$^1$, Joan Bruna$^1$, Yann LeCun$^{1,2}$ and Rob Fergus$^{1,2}$, $^1$Dept. of Computer Science, Courant Institute, New York University, $^2$Facebook AI Research

Abstract

We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy, but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the redundancy present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate speedups of convolutional layers on both CPU and GPU by a factor of $2\times$, while keeping the accuracy within $1%$ of the original model.

Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation

Remi Denton 1 , Wojciech Zaremba 1 , Joan Bruna 1 , Yann LeCun 1 , 2 and Rob Fergus 1 , 2

1 Dept. of Computer Science, Courant Institute, New York University 2 Facebook AI Research

We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy, but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the redundancy present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate speedups of convolutional layers on both CPU and GPU by a factor of 2 × , while keeping the accuracy within 1% of the original model.

Introduction

Large neural networks have recently demonstrated impressive performance on a range of speech and vision tasks. However, the size of these models can make their deployment at test time problematic. For example, mobile computing platforms are limited in their CPU speed, memory and battery life. At the other end of the spectrum, Internet-scale deployment of these models requires thousands of servers to process the 100's of millions of images per day. The electrical and cooling costs of these servers is significant. Training large neural networks can take weeks, or even months. This hinders research and consequently there have been extensive efforts devoted to speeding up training procedure. However, there are relatively few efforts aimed at improving the test-time performance of the models.

We consider convolutional neural networks (CNNs) used for computer vision tasks, since they are large and widely used in commercial applications. These networks typically require a huge number of parameters ( ∼ 10 8 in [1]) to produce state-of-the-art results. While these networks tend to be hugely over parameterized [2], this redundancy seems necessary in order to overcome a highly nonconvex optimization [3]. As a byproduct, the resulting network wastes computing resources. In this paper we show that this redundancy can be exploited with linear compression techniques, resulting in significant speedups for the evaluation of trained large scale networks, with minimal compromise to performance.

We follow a relatively simple strategy: we start by compressing each convolutional layer by finding an appropriate low-rank approximation, and then we fine-tune the upper layers until the prediction performance is restored. We consider several elementary tensor decompositions based on singular value decompositions, as well as filter clustering methods to take advantage of similarities between learned features.

Our main contributions are the following: (1) We present a collection of generic methods to exploit the redundancy inherent in deep CNNs. (2) We report experiments on state-of-the-art Imagenet CNNs, showing empirical speedups on convolutional layers by a factor of 2 -3 × and a reduction of parameters in fully connected layers by a factor of 5 -10 × .

Notation: Convolution weights can be described as a 4 -dimensional tensor: W ∈ R C × X × Y × F . C is the number of number of input channels, X and Y are the spatial dimensions of the kernel, and F is the target number of feature maps. It is common for the first convolutional layer to have a stride associated with the kernel which we denote by ∆ . Let I ∈ R C × N × M denote an input signal where C is the number of input maps, and N and M are the spatial dimensions of the maps. The target value, T = I ∗ W , of a generic convolutional layer, with ∆ = 1 , for a particular output feature, f , and spatial location, ( x, y ) , is

If W is a tensor, ‖ W ‖ denotes its operator norm, sup ‖ x ‖ =1 ‖ Wx ‖ F and ‖ W ‖ F denotes its Frobenius norm.

Vanhoucke et al. [4] explored the properties of CPUs to speed up execution. They present many solutions specific to Intel and AMD CPUs, however some of their techniques are general enough to be used for any type of processor. They describe how to align memory, and use SIMD operations (vectorized operations on CPU) to boost the efficiency of matrix multiplication. Additionally, they propose the linear quantization of the network weights and input. This involves representing weights as 8-bit integers (range [ -127 , 128] ), rather than 32-bit floats. This approximation is similar in spirit to our approach, but differs in that it is applied to each weight element independently. By contrast, our approximation approach models the structure within each filter. Potentially, the two approaches could be used in conjunction.

The most expensive operations in CNNs are the convolutions in the first few layers. The complexity of this operation is linear in the area of the receptive field of the filters, which is relatively large for these layers. However, Mathieu et al. [5] have shown that convolution can be efficiently computed in Fourier domain, where it becomes element-wise multiplication (and there is no cost associated with size of receptive field). They report a forward-pass speed up of around 2 × for convolution layers in state-of-the-art models. Importantly, the FFT method can be used jointly with most of the techniques presented in this paper.

The use of low-rank approximations in our approach is inspired by work of Denil et al. [2] who demonstrate the redundancies in neural network parameters. They show that the weights within a layer can be accurately predicted from a small (e.g. ∼ 5% ) subset of them. This indicates that neural networks are heavily over-parametrized. All the methods presented here focus on exploiting the linear structure of this over-parametrization.

Finally, a recent preprint [6] also exploits low-rank decompositions of convolutional tensors to speed up the evaluation of CNNs, applied to scene text character recognition. This work was developed simultaneously with ours, and provides further evidence that such techniques can be applied to a variety of architectures and tasks. Out work differs in several ways. First, we consider a significantly larger model. This makes it more challenging to compute efficient approximations since there are more layers to propagate through and thus a greater opportunity for error to accumulate. Second, we present different compression techniques for the hidden convolutional layers and provide a method of compressing the first convolutional layer. Finally, we present GPU results in addition to CPU results.

Convolutional Tensor Compression

In this section we describe techniques for compressing 4 dimensional convolutional weight tensors and fully connected weight matrices into a representation that permits efficient computation and

storage. Section 3.1 describes how to construct a good approximation criteria. Section 3.2 describes techniques for low-rank tensor approximations. Sections 3.3 and 3.4 describe how to apply these techniques to approximate weights of a convolutional neural network.

Approximation Metric

Our goal is to find an approximation, ˜ W , of a convolutional tensor W that facilitates more efficient computation while maintaining the prediction performance of the network. A natural choice for an approximation criterion is to minimize ‖ ˜ W -W ‖ F . This criterion yields efficient compression schemes using elementary linear algebra, and also controls the operator norm of each linear convolutional layer. However, this criterion assumes that all directions in the space of weights equally affect prediction performance. We now present two methods of improving this criterion while keeping the same efficient approximation algorithms.

Mahalanobis distance metric: The first distance metric we propose seeks to emphasize coordinates more prone to produce prediction errors over coordinates whose effect is less harmful for the overall system. We can obtain such measurements as follows. Let Θ = { W 1 , . . . , W S } denote the set of all parameters of the S -layer network, and let U ( I ; Θ) denote the output after the softmax layer of input image I . We consider a given input training set ( I 1 , . . . , I N ) with known labels ( y 1 , . . . , y N ) . For each pair ( I n , y n ) , we compute the forward propagation pass U ( I n , Θ) , and define as { β n } the indices of the h largest values of U ( I n , Θ) different from y n . Then, for a given layer s , we compute

where δ ( i -l ) is the dirac distribution centered at l . In other words, for each input we back-propagate the difference between the current prediction and the h 'most dangerous' mistakes.

The Mahalanobis distance is defined from the covariance of d : ‖ W ‖ 2 maha = w Σ -1 w T , where w is the vector containing all the coordinates of W , and Σ is the covariance of ( d n,l,s ) n,l . We do not report results using this metric, since it requires inverting a matrix of size equal to the number of parameters, which can be prohibitively expensive in large networks. Instead we use an approximation that considers only the diagonal of the covariance matrix. In particular, we propose the following, approximate, Mahalanobis distance metric:

where the sum runs over the tensor coordinates. Since (2) is a reweighted Euclidiean metric, we can simply compute W ′ = α . ∗ W , where . ∗ denotes element-wise multiplication, then compute the approximation ˜ W ′ on W ′ using the standard L 2 norm, and finally output ˜ W = α -1 . ∗ ˜ W ′ .

Data covariance distance metric: One can view the Frobenius norm of W as ‖ W ‖ 2 F = E x ∼N (0 ,I ) ‖ Wx ‖ 2 F . Another alternative, similar to the one considered in [6], is to replace the isotropic covariance assumption by the empirical covariance of the input of the layer. If W ∈ R C × X × Y × F is a convolutional layer, and ̂ Σ ∈ R CXY × CXY is the empirical estimate of the input data covariance, it can be efficiently computed as

where W F is the matrix obtained by folding the first three dimensions of W .As opposed to [6], this approach adapts to the input distribution without the need to iterate through the data.

Low-rank Tensor Approximations

Matrix Decomposition

Matrices are 2 -tensors which can be linearly compressed using the Singular Value Decomposition. If W ∈ R m × k is a real matrix, the SVD is defined as W = USV glyph[latticetop] , where U ∈ R m × m , S ∈ R m × k , V ∈ R k × k . S is a diagonal matrix with the singular values on the diagonal, and U , V are orthogonal matrices. If the singular values of W decay rapidly, W can be well approximated

by keeping only the t largest entries of S , resulting in the approximation ˜ W = ˜ U ˜ S ˜ V glyph[latticetop] , where ˜ U ∈ R m × t , ˜ S ∈ R t × t , ˜ V ∈ R t × k Then, for I ∈ R n × m , the approximation error ‖ I ˜ W -IW ‖ F satisfies ‖ I ˜ W -IW ‖ F ≤ s t +1 ‖ I ‖ F , and thus is controlled by the decay along the diagonal of S . Now the computation I ˜ W can be done in O ( nmt + nt 2 + ntk ) , which, for sufficiently small t is significantly smaller than O ( nmk ) .

Higher Order Tensor Approximations

SVD can be used to approximate a tensor W ∈ R m × n × k by first folding all but two dimensions together to convert it into a 2 -tensor, and then considering the SVD of the resulting matrix. For example, we can approximate W m ∈ R m × ( nk ) as ˜ W m ≈ ˜ U ˜ S ˜ V glyph[latticetop] . W can be compressed even further by applying SVD to ˜ V . We refer to this approximation as the SVD decomposition and use K 1 and K 2 to denote the rank used in the first and second application of SVD respectively.

Alternatively, we can approximate a 3-tensor, W S ∈ R m × n × k , by a rank 1 3-tensor by finding a decomposition that minimizes

where α ∈ R m , β ∈ R n , γ ∈ R k and ⊗ denotes the outer product operation. Problem (4) is solved efficiently by performing alternate least squares on α , β and γ respectively, although more efficient algorithms can also be considered [7].

This easily extends to a rank K approximation using a greedy algorithm: Given a tensor W , we compute ( α, β, γ ) using (4), and we update W ( k +1) ← W k -α ⊗ β ⊗ γ . Repeating this operation K times results in

We refer to this approximation as the outer product decomposition and use K to denote the rank of the approximation.

Figure 1: A visualization of monochromatic and biclustering approximation structures. (a) The monochromatic approximation, used for the first layer. Input color channels are projected onto a set of intermediate color channels. After this transformation, output features need only to look at one intermediate color channel. (b) The biclustering approximation, used for higher convolution layers. Input and output features are clustered into equal sized groups. The weight tensor corresponding to each pair of input and output clusters is then approximated. (c) The weight tensors for each input-output pair in (b) are approximated by a sum of rank 1 tensors using techniques described in 3.2.2

Monochromatic Convolution Approximation

Let W ∈ R C × X × Y × F denote the weights of the first convolutional layer of a trained network. We found that the color components of trained CNNs tend to have low dimensional structure. In particular, the weights can be well approximated by projecting the color dimension down to a 1D subspace. The low-dimensional structure of the weights is illustrated in Figure 4.1.1.

The monochromatic approximation exploits this structure and is computed as follows. First, for every output feature, f , we consider consider the matrix W f ∈ R C × ( XY ) , where the spatial dimensions of the filter corresponding to the output feature have been combined, and find the SVD,

Table 1: Number of operations required for various approximation methods.

W f = U f S f V glyph[latticetop] f , where U f ∈ R C × C , S f ∈ R C × XY , and V f ∈ R XY × XY . We then take the rank 1 approximation of W f , ˜ W f = ˜ U f ˜ S f ˜ V glyph[latticetop] f , where ˜ U f ∈ R C × 1 , ˜ S f ∈ R , ˜ V f ∈ R 1 × XY . We can further exploit the regularity in the weights by sharing the color component basis between different output features. We do this by clustering the F left singular vectors, ˜ U f , of each output feature f into C ′ clusters, for C ′ < F . We constrain the clusters to be of equal size as discussed in section 3.4. Then, for each of the F C ′ output features, f , that is assigned to cluster c f , we can approximate W f with ˜ W f = U c f ˜ S f ˜ V glyph[latticetop] f where U c f ∈ R C × 1 is the cluster center for cluster c f and ˜ S f and ˜ V f are as before.

Biclustering Approximations

In order to exploit the parallelism inherent in CPU and GPU architectures it is useful to constrain clusters to be of equal sizes. We therefore perform the biclustering operations (or clustering for monochromatic filters in Section 3.3) using a modified version of the k -means algorithm which balances the cluster count at each iteration. It is implemented with the Floyd algorithm, by modifying the Euclidean distance with a subspace projection distance.

After the input and output clusters have been obtained, we find a low-rank approximation of each sub-tensor using either the SVD decomposition or the outer product decomposition as described in Section 3.2.2. We concatenate the X and Y spatial dimensions of the sub-tensors so that the decomposition is applied to the 3-tensor, W S ∈ R C × ( XY ) × F . While we could look for a separable approximation along the spatial dimensions as well, we found the resulting gain to be minimal. Using these approximations, the target output can be computed with significantly fewer operations. The number of operations required is a function the number of input clusters, G , the output clusters H and the rank of the sub-tensor approximations ( K 1 , K 2 for the SVD decomposition; K for the outer product decomposition. The number of operations required for each approximation is described in Table 1.

Fine-tuning

Many of the approximation techniques presented here can efficiently compress the weights of a CNN with negligible degradation of classification performance provided the approximation is not too harsh. Alternatively, one can use a harsher approximation that gives greater speedup gains but hurts the performance of the network. In this case, the approximated layer and all those below it can be fixed and the upper layers can be fine-tuned until the original performance is restored.

Figure 2: Visualization of the 1st layer filters. (Left) Each component of the 96 7x7 filters is plotted in RGB space. Points are colored based on the output filter they belong to. Hence, there are 96 colors and 7 2 points of each color. Leftmost plot shows the original filters and the right plot shows the filters after the monochromatic approximation, where each filter has been projected down to a line in colorspace. (Right) Original and approximate versions of a selection of 1st layer filters.

Experiments

We use the 15 layer convolutional architecture of [8], trained on the ImageNet 2012 dataset [9]. The network contains 4 convolutional layers, 3 fully connected layers and a softmax output layer. We evaluated the network on both CPU and GPU platforms. All measurements of prediction performance are with respect to the 20K validation images from the ImageNet12 dataset.

We present results showing the performance of the approximations described in Section 3 in terms of prediction accuracy, speedup gains and reduction in memory overhead. All of our fine-tuning results were achieved by training with less than 2 passes using the ImageNet12 training dataset. Unless stated otherwise, classification numbers refer to those of fine-tuned models.

Speedup

The majority of forward propagation time is spent on the first two convolutional layers (see Supplementary Material for breakdown of time across all layers). Because of this, we restrict our attention to the first and second convolutional layers in our speedup experiments. However, our approximations could easily applied to convolutions in upper layers as well.

We implemented several CPU and GPU approximation routines in an effort to achieve empirical speedups. Both the baseline and approximation CPU code is implemented in C++ using Eigen3 library [10] compiled with Intel MKL. We also use Intel's implementation of openmp and multithreading. The baseline gives comparable performance to highly optimized MATLAB convolution routines and all of our CPU speedup results are computed relative to this. We used Alex Krizhevsky's CUDA convolution routines 1 as a baseline for GPU comparisons. The approximation versions are written in CUDA. All GPU code was run on a standard nVidia Titan card.

We have found that in practice it is often difficult to achieve speedups close to the theoretical gains based on the number of arithmetic operations (see Supplementary Material for discussion of theoretical gains). Moreover, different computer architectures and CNN architectures afford different optimization strategies making most implementations highly specific. However, regardless of implementation details, all of the approximations we present reduce both the number of operations and number of weights required to compute the output by at least a factor of two, often more.

First Layer

The first convolutional layer has 3 input channels, 96 output channels and 7x7 filters. We approximated the weights in this layer using the monochromatic approximation described in Section 3.3. The monochromatic approximation works well if the color components span a small number of one dimensional subspaces. Figure 4.1.1 illustrates the effect of the monochromatic approximation on the first layer filters.

The only parameter in the approximation is C ′ , the number of color channels used for the intermediate representation. As expected, the network performance begins to degrade as C ′ decreases. The

1 https://code.google.com/p/cuda-convnet/

Percent loss in performance

Figure 3: Empirical speedups on ( Left ) CPU and ( Right ) GPU for the first layer. C ′ is the number of colors used in the approximation.

Figure 4: Empirical speedups for second convolutional layer. ( Left ) Speedups on CPU using biclustered ( G = 2 and H = 2 ) with SVD approximation. ( Right ) peedups on GPU using biclustered ( G = 48 and h = 2 ) with outer product decomposition approximation.

number of floating point operations required to compute the output of the monochromatic convolution is reduced by a factor of 2 -3 × , with the larger gain resulting for small C ′ . Figure 3 shows the empirical speedups we achieved on CPU and GPU and the corresponding network performance for various numbers of colors used in the monochromatic approximation. Our CPU and GPU implementations achieve empirical speedups of 2 -2 . 5 × relative to the baseline with less than 1% drop in classification performance.

Second Layer

The second convolutional layer has 96 input channels, 256 output channels and 5x5 filters. We approximated the weights using the techniques described in Section 3.4. We explored various configurations of the approximations by varying the number of input clusters G , the number of output clusters H and the rank of the approximation (denoted by K 1 and K 2 for the SVD decomposition and K for the outer product decomposition).

Figure 4 shows our empirical speedups on CPU and GPU and the corresponding network performance for various approximation configurations. For the CPU implementation we used the biclustering with SVD approximation. For the GPU implementation we using the biclustering with outer product decomposition approximation. We achieved promising results and present speedups of 2 -2 . 5 × relative to the baseline with less than a 1% drop in performance.

Combining approximations

The approximations can also be cascaded to provide greater speedups. The procedure is as follows. Compress the first convolutional layer weights and then fine-tune all the layers above until performance is restored. Next, compress the second convolutional layer weights that result from

Table 2: Number of parameters expressed as a function of hyperparameters for various approximation methods and empirical reduction in parameters with corresponding network performance.

the fine-tuning. Fine-tune all the layers above until performance is restored and then continue the process.

We applied this procedure to the first two convolutional layers. Using the monochromatic approximation with 6 colors for the first layer and the biclustering with outer product decomposition approximation for the second layer ( G = 48; H = 2; K = 8 ) and fine-tuning with a single pass through the training set we are able to keep accuracy within 1% of the original model. This procedure could be applied to each convolutional layer, in this sequential manner, to achieve overall speedups much greater than any individual layer can provide. A more comprehensive summary of these results can be found in the Supplementary Material.

Reduction in memory overhead

In many commercial applications memory conservation and storage are a central concern. This mainly applies to embedded systems (e.g. smartphones), where available memory is limited, and users are reluctant to download large files. In these cases, being able to compress the neural network is crucial for the viability of the product.

In addition to requiring fewer operations, our approximations require significantly fewer parameters when compared to the original model. Since the majority of parameters come from the fully connected layers, we include these layers in our analysis of memory overhead. We compress the fully connected layers using standard SVD as described in 3.2.2, using K to denote the rank of the approximation.

Table 2 shows the number of parameters for various approximation methods as a function of hyperparameters for the approximation techniques. The table also shows the empirical reduction of parameters and the corresponding network performance for specific instantiations of the approximation parameters.

Discussion

In this paper we have presented techniques that can speed up the bottleneck convolution operations in the first layers of a CNN by a factor 2 -3 × , with negligible loss of performance. We also show that our methods reduce the memory footprint of weights in the first two layers by factor of 2 -3 × and the fully connected layers by a factor of 5 -13 × . Since the vast majority of weights reside in the fully connected layers, compressing only these layers translate into a significant savings, which would facilitate mobile deployment of convolutional networks. These techniques are orthogonal to other approaches for efficient evaluation, such as quantization or working in the Fourier domain. Hence, they can potentially be used together to obtain further gains.

An interesting avenue of research to explore in further work is the ability of these techniques to aid in regularization either during or post training. The low-rank projections effectively decrease number of learnable parameters, suggesting that they might improve generalization ability. The regularization potential of the low-rank approximations is further motivated by two observations. The first is that the approximated filters for the first conolutional layer appear to be cleaned up versions of the original filters. Additionally, we noticed that we sporadically achieve better test error with some of the more conservative approximations.

Experiments

Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation

Forward propagation time breakdown

Table 3 shows the time breakdown of forward propagation for each layer in the CNN architecture we explored. Close to 90% of the time is spent on convolutional layers, and within these layers the majority of time is spent on the first two.

Table 3: Evaluation time in seconds per layer on CPU (left) and GPU (right) with batch size of 128. Results are averaged over 8 runs.

Theoretical speedups

We can measure the theoretically achievable speedups for a particular approximation in term of the number of floating point operations required to compute the target output. While it is unlikely that any implementation would achieve speedups equal to the theoretically optimal level, the number of necessary floating point operations still provides an informative upper bound on the gains.

Table 4 shows the theoretical speedup of the monochromatic approximation. The majority of the operations result from the convolution part of the computation. In comparison, the number of operations required for the color transformation is negligible. Thus, the theoretically achievable speedup decreases only slightly as the number of color components used is increased.

Figure 5 plots the theoretically achievable speedups against the drop in classification performance for various configurations of the biclustering with outer product decomposition technique. For a given setting of input and output clusters numbers, the performance tends to degrade as the rank is decreased.

Combined results

We used the monochromatic approximation with 6 colors for the first layer. Table 5 summarizes the results after fine-tuning for 1 pass through the ImageNet12 training data using a variety of second layer approximations.

Table 4: Performance when first layer weights are replaced with monochromatic approximation and the corresponding theoretical speedup. Classification error on ImageNet12 validation images tends to increase as the approximation becomes harsher (i.e. fewer colors are used). Theoretical speedups vary only slightly as the number of colors used increases since the color transformation contributes relatively little to the total number of operations.

We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy, but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the redundancy present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate speedups of convolutional layers on both CPU and GPU by a factor of 2×2\times, while keeping the accuracy within 1%percent11% of the original model.

Large neural networks have recently demonstrated impressive performance on a range of speech and vision tasks. However, the size of these models can make their deployment at test time problematic. For example, mobile computing platforms are limited in their CPU speed, memory and battery life. At the other end of the spectrum, Internet-scale deployment of these models requires thousands of servers to process the 100’s of millions of images per day. The electrical and cooling costs of these servers is significant. Training large neural networks can take weeks, or even months. This hinders research and consequently there have been extensive efforts devoted to speeding up training procedure. However, there are relatively few efforts aimed at improving the test-time performance of the models.

We consider convolutional neural networks (CNNs) used for computer vision tasks, since they are large and widely used in commercial applications. These networks typically require a huge number of parameters (∼108similar-toabsentsuperscript108\sim 10^{8} in [1]) to produce state-of-the-art results. While these networks tend to be hugely over parameterized [2], this redundancy seems necessary in order to overcome a highly non-convex optimization [3]. As a byproduct, the resulting network wastes computing resources. In this paper we show that this redundancy can be exploited with linear compression techniques, resulting in significant speedups for the evaluation of trained large scale networks, with minimal compromise to performance.

Our main contributions are the following: (1) We present a collection of generic methods to exploit the redundancy inherent in deep CNNs. (2) We report experiments on state-of-the-art Imagenet CNNs, showing empirical speedups on convolutional layers by a factor of 2−3×2-3\times and a reduction of parameters in fully connected layers by a factor of 5−10×5-10\times.

Notation: Convolution weights can be described as a 444-dimensional tensor: W∈ℝC×X×Y×F𝑊superscriptℝ𝐶𝑋𝑌𝐹W\in\mathbb{R}^{C\times X\times Y\times F}. C𝐶C is the number of number of input channels, X𝑋X and Y𝑌Y are the spatial dimensions of the kernel, and F𝐹F is the target number of feature maps. It is common for the first convolutional layer to have a stride associated with the kernel which we denote by ΔΔ\Delta. Let I∈ℝC×N×M𝐼superscriptℝ𝐶𝑁𝑀I\in\mathbb{R}^{C\times N\times M} denote an input signal where C𝐶C is the number of input maps, and N𝑁N and M𝑀M are the spatial dimensions of the maps. The target value, T=I∗W𝑇∗𝐼𝑊T=I\ast W, of a generic convolutional layer, with Δ=1Δ1\Delta=1, for a particular output feature, f𝑓f, and spatial location, (x,y)𝑥𝑦(x,y), is

If W𝑊W is a tensor, ‖W‖norm𝑊|W| denotes its operator norm, sup‖x‖=1‖Wx‖Fsubscriptsupremumnorm𝑥1subscriptnorm𝑊𝑥𝐹\sup_{|x|=1}|Wx|{F} and ‖W‖Fsubscriptnorm𝑊𝐹|W|{F} denotes its Frobenius norm.

Vanhoucke et al. [4] explored the properties of CPUs to speed up execution. They present many solutions specific to Intel and AMD CPUs, however some of their techniques are general enough to be used for any type of processor. They describe how to align memory, and use SIMD operations (vectorized operations on CPU) to boost the efficiency of matrix multiplication. Additionally, they propose the linear quantization of the network weights and input. This involves representing weights as 8-bit integers (range [−127,128]127128[-127,128]), rather than 32-bit floats. This approximation is similar in spirit to our approach, but differs in that it is applied to each weight element independently. By contrast, our approximation approach models the structure within each filter. Potentially, the two approaches could be used in conjunction.

The most expensive operations in CNNs are the convolutions in the first few layers. The complexity of this operation is linear in the area of the receptive field of the filters, which is relatively large for these layers. However, Mathieu et al. [5] have shown that convolution can be efficiently computed in Fourier domain, where it becomes element-wise multiplication (and there is no cost associated with size of receptive field). They report a forward-pass speed up of around 2×2\times for convolution layers in state-of-the-art models. Importantly, the FFT method can be used jointly with most of the techniques presented in this paper.

The use of low-rank approximations in our approach is inspired by work of Denil et al. [2] who demonstrate the redundancies in neural network parameters. They show that the weights within a layer can be accurately predicted from a small (e.g. ∼5%similar-toabsentpercent5\sim 5%) subset of them. This indicates that neural networks are heavily over-parametrized. All the methods presented here focus on exploiting the linear structure of this over-parametrization.

In this section we describe techniques for compressing 4 dimensional convolutional weight tensors and fully connected weight matrices into a representation that permits efficient computation and storage. Section 3.1 describes how to construct a good approximation criteria. Section 3.2 describes techniques for low-rank tensor approximations. Sections 3.3 and 3.4 describe how to apply these techniques to approximate weights of a convolutional neural network.

Our goal is to find an approximation, W~~𝑊\tilde{W}, of a convolutional tensor W𝑊W that facilitates more efficient computation while maintaining the prediction performance of the network. A natural choice for an approximation criterion is to minimize ‖W~−W‖Fsubscriptnorm~𝑊𝑊𝐹|\tilde{W}-W|_{F}. This criterion yields efficient compression schemes using elementary linear algebra, and also controls the operator norm of each linear convolutional layer. However, this criterion assumes that all directions in the space of weights equally affect prediction performance. We now present two methods of improving this criterion while keeping the same efficient approximation algorithms.

Mahalanobis distance metric: The first distance metric we propose seeks to emphasize coordinates more prone to produce prediction errors over coordinates whose effect is less harmful for the overall system. We can obtain such measurements as follows. Let Θ={W1,…,WS}Θsubscript𝑊1…subscript𝑊𝑆\Theta={W_{1},\dots,W_{S}} denote the set of all parameters of the S𝑆S-layer network, and let U(I;Θ)𝑈𝐼ΘU(I;\Theta) denote the output after the softmax layer of input image I𝐼I. We consider a given input training set (I1,…,IN)subscript𝐼1…subscript𝐼𝑁(I_{1},\dots,I_{N}) with known labels (y1,…,yN)subscript𝑦1…subscript𝑦𝑁(y_{1},\dots,y_{N}). For each pair (In,yn)subscript𝐼𝑛subscript𝑦𝑛(I_{n},y_{n}), we compute the forward propagation pass U(In,Θ)𝑈subscript𝐼𝑛ΘU(I_{n},\Theta), and define as {βn}subscript𝛽𝑛{\beta_{n}} the indices of the hℎh largest values of U(In,Θ)𝑈subscript𝐼𝑛ΘU(I_{n},\Theta) different from ynsubscript𝑦𝑛y_{n}. Then, for a given layer s𝑠s, we compute

where δ(i−l)𝛿𝑖𝑙\delta(i-l) is the dirac distribution centered at l𝑙l. In other words, for each input we back-propagate the difference between the current prediction and the hℎh “most dangerous” mistakes.

The Mahalanobis distance is defined from the covariance of d𝑑d: ‖W‖maha2=wΣ−1wT,superscriptsubscriptnorm𝑊𝑚𝑎ℎ𝑎2𝑤superscriptΣ1superscript𝑤𝑇|W|{maha}^{2}=w\Sigma^{-1}w^{T}~{}, where w𝑤w is the vector containing all the coordinates of W𝑊W, and ΣΣ\Sigma is the covariance of (dn,l,s)n,lsubscriptsubscript𝑑𝑛𝑙𝑠𝑛𝑙(d{n,l,s})_{n,l}. We do not report results using this metric, since it requires inverting a matrix of size equal to the number of parameters, which can be prohibitively expensive in large networks. Instead we use an approximation that considers only the diagonal of the covariance matrix. In particular, we propose the following, approximate, Mahalanobis distance metric:

where the sum runs over the tensor coordinates. Since (2) is a reweighted Euclidiean metric, we can simply compute W′=α.∗WW^{\prime}=\alpha~{}.W, where .∗. denotes element-wise multiplication, then compute the approximation W′~~superscript𝑊′\tilde{W^{\prime}} on W′superscript𝑊′W^{\prime} using the standard L2subscript𝐿2L_{2} norm, and finally output W~=α−1.∗W′~~.\tilde{W}=\alpha^{-1}.*\tilde{W^{\prime}}~~{}.

Data covariance distance metric: One can view the Frobenius norm of W𝑊W as ‖W‖F2=𝔼x∼𝒩(0,I)‖Wx‖F2.superscriptsubscriptnorm𝑊𝐹2subscript𝔼similar-to𝑥𝒩0𝐼superscriptsubscriptnorm𝑊𝑥𝐹2|W|{F}^{2}=\mathbb{E}{x\sim\mathcal{N}(0,I)}|Wx|_{F}^{2}~{}. Another alternative, similar to the one considered in [6], is to replace the isotropic covariance assumption by the empirical covariance of the input of the layer. If W∈ℝC×X×Y×F𝑊superscriptℝ𝐶𝑋𝑌𝐹W\in\mathbb{R}^{C\times X\times Y\times F} is a convolutional layer, and Σ^∈ℝCXY×CXY^Σsuperscriptℝ𝐶𝑋𝑌𝐶𝑋𝑌\widehat{\Sigma}\in\mathbb{R}^{CXY\times CXY} is the empirical estimate of the input data covariance, it can be efficiently computed as

where WFsubscript𝑊𝐹W_{F} is the matrix obtained by folding the first three dimensions of W𝑊W.As opposed to [6], this approach adapts to the input distribution without the need to iterate through the data.

Matrices are 222-tensors which can be linearly compressed using the Singular Value Decomposition. If W∈ℝm×k𝑊superscriptℝ𝑚𝑘W\in\mathbb{R}^{m\times k} is a real matrix, the SVD is defined as W=USV⊤𝑊𝑈𝑆superscript𝑉topW=USV^{\top}, where U∈ℝm×m,S∈ℝm×k,V∈ℝk×kformulae-sequence𝑈superscriptℝ𝑚𝑚formulae-sequence𝑆superscriptℝ𝑚𝑘𝑉superscriptℝ𝑘𝑘U\in\mathbb{R}^{m\times m},S\in\mathbb{R}^{m\times k},V\in\mathbb{R}^{k\times k}. S𝑆S is a diagonal matrix with the singular values on the diagonal, and U𝑈U, V𝑉V are orthogonal matrices. If the singular values of W𝑊W decay rapidly, W𝑊W can be well approximated by keeping only the t𝑡t largest entries of S𝑆S, resulting in the approximation W~=USV~⊤𝑊𝑈~~𝑆superscript~~𝑉top\tilde{W}=\tilde{U}\tilde{S}\tilde{V}^{\top}, where U~∈ℝm×t,S~∈ℝt×t,V~∈ℝt×kformulae-sequence~~𝑈superscriptℝ𝑚𝑡formulae-sequence~~𝑆superscriptℝ𝑡𝑡𝑉superscriptℝ𝑡𝑘\tilde{U}\in\mathbb{R}^{m\times t},\tilde{S}\in\mathbb{R}^{t\times t},\tilde{V}\in\mathbb{R}^{t\times k} Then, for I∈ℝn×m𝐼superscriptℝ𝑛𝑚I\in\mathbb{R}^{n\times m}, the approximation error ‖IW−IW‖Fsubscriptnorm𝐼~~𝑊𝐼𝑊𝐹|I\tilde{W}-IW|_{F} satisfies ‖IW~~−IW‖F≤st+1‖I‖F,subscriptnorm𝐼~~𝑊𝐼𝑊𝐹subscript𝑠𝑡1subscriptnorm𝐼𝐹|I\tilde{W}-IW|{F}\leq s{t+1}|I|_{F}~~{}, and thus is controlled by the decay along the diagonal of S𝑆S. Now the computation IW𝐼𝑊I\tilde{W} can be done in O(nmt+nt2+ntk)𝑂𝑛𝑚𝑡𝑛superscript𝑡2𝑛𝑡𝑘O(nmt+nt^{2}+ntk), which, for sufficiently small t𝑡t is significantly smaller than O(nmk)𝑂𝑛𝑚𝑘O(nmk).

SVD can be used to approximate a tensor W∈ℝm×n×k𝑊superscriptℝ𝑚𝑛𝑘W\in\mathbb{R}^{m\times n\times k} by first folding all but two dimensions together to convert it into a 222-tensor, and then considering the SVD of the resulting matrix. For example, we can approximate Wm∈ℝm×(nk)subscript𝑊𝑚superscriptℝ𝑚𝑛𝑘W_{m}\in\mathbb{R}^{m\times(nk)} as W~~m≈U~~SV⊤subscript𝑊𝑚𝑈~~𝑆superscript~~𝑉top\tilde{W}{m}\approx\tilde{U}\tilde{S}\tilde{V}^{\top}. W𝑊W can be compressed even further by applying SVD to V~~𝑉\tilde{V}. We refer to this approximation as the SVD decomposition and use K1subscript𝐾1K{1} and K2subscript𝐾2K_{2} to denote the rank used in the first and second application of SVD respectively.

Alternatively, we can approximate a 3-tensor, WS∈ℝm×n×ksubscript𝑊𝑆superscriptℝ𝑚𝑛𝑘W_{S}\in\mathbb{R}^{m\times n\times k}, by a rank 1 3-tensor by finding a decomposition that minimizes

where α∈ℝm𝛼superscriptℝ𝑚\alpha\in\mathbb{R}^{m}, β∈ℝn𝛽superscriptℝ𝑛\beta\in\mathbb{R}^{n}, γ∈ℝk𝛾superscriptℝ𝑘\gamma\in\mathbb{R}^{k} and ⊗tensor-product\otimes denotes the outer product operation. Problem (4) is solved efficiently by performing alternate least squares on α𝛼\alpha, β𝛽\beta and γ𝛾\gamma respectively, although more efficient algorithms can also be considered [7].

This easily extends to a rank K𝐾K approximation using a greedy algorithm: Given a tensor W𝑊W, we compute (α,β,γ)𝛼𝛽𝛾(\alpha,\beta,\gamma) using (4), and we update W(k+1)←Wk−α⊗β⊗γ←superscript𝑊𝑘1superscript𝑊𝑘tensor-product𝛼𝛽𝛾W^{(k+1)}\leftarrow W^{k}-\alpha\otimes\beta\otimes\gamma. Repeating this operation K𝐾K times results in

We refer to this approximation as the outer product decomposition and use K𝐾K to denote the rank of the approximation.

Let W∈ℝC×X×Y×F𝑊superscriptℝ𝐶𝑋𝑌𝐹W\in\mathbb{R}^{C\times X\times Y\times F} denote the weights of the first convolutional layer of a trained network. We found that the color components of trained CNNs tend to have low dimensional structure. In particular, the weights can be well approximated by projecting the color dimension down to a 1D subspace. The low-dimensional structure of the weights is illustrated in Figure 2.

The monochromatic approximation exploits this structure and is computed as follows. First, for every output feature, f𝑓f, we consider consider the matrix Wf∈ℝC×(XY)subscript𝑊𝑓superscriptℝ𝐶𝑋𝑌W_{f}\in\mathbb{R}^{C\times(XY)}, where the spatial dimensions of the filter corresponding to the output feature have been combined, and find the SVD, Wf=UfSfVf⊤subscript𝑊𝑓subscript𝑈𝑓subscript𝑆𝑓superscriptsubscript𝑉𝑓topW_{f}=U_{f}S_{f}V_{f}^{\top}, where Uf∈ℝC×C,Sf∈ℝC×XYformulae-sequencesubscript𝑈𝑓superscriptℝ𝐶𝐶subscript𝑆𝑓superscriptℝ𝐶𝑋𝑌U_{f}\in\mathbb{R}^{C\times C},S_{f}\in\mathbb{R}^{C\times XY}, and Vf∈ℝXY×XYsubscript𝑉𝑓superscriptℝ𝑋𝑌𝑋𝑌V_{f}\in\mathbb{R}^{XY\times XY}. We then take the rank 111 approximation of Wfsubscript𝑊𝑓W_{f}, W~~f=U~~fS~~fV~~f⊤,subscript~~𝑊𝑓subscript~~𝑈𝑓subscript~~𝑆𝑓superscriptsubscript~~𝑉𝑓top\tilde{W}{f}=\tilde{U}{f}\tilde{S}{f}\tilde{V}{f}^{\top}~~{}, where U~~f∈ℝC×1,S~~f∈ℝ,V~~f∈ℝ1×XYformulae-sequencesubscript~~𝑈𝑓superscriptℝ𝐶1formulae-sequencesubscript~~𝑆𝑓ℝsubscript𝑉𝑓superscriptℝ1𝑋𝑌\tilde{U}{f}\in\mathbb{R}^{C\times 1},\tilde{S}{f}\in\mathbb{R},\tilde{V}{f}\in\mathbb{R}^{1\times XY}. We can further exploit the regularity in the weights by sharing the color component basis between different output features. We do this by clustering the F𝐹F left singular vectors, U~~fsubscript~~𝑈𝑓\tilde{U}{f}, of each output feature f𝑓f into C′superscript𝐶′C^{\prime} clusters, for C′<Fsuperscript𝐶′𝐹C^{\prime}<F . We constrain the clusters to be of equal size as discussed in section 3.4. Then, for each of the FC′𝐹superscript𝐶′\frac{F}{C^{\prime}} output features, f𝑓f, that is assigned to cluster cfsubscript𝑐𝑓c_{f}, we can approximate Wfsubscript𝑊𝑓W_{f} with Wf=UcfS~~fV~~f⊤subscript~~𝑊𝑓subscript𝑈subscript𝑐𝑓subscript~~𝑆𝑓superscriptsubscript𝑉𝑓top\tilde{W}{f}=U{c_{f}}\tilde{S}{f}\tilde{V}{f}^{\top} where Ucf∈ℝC×1subscript𝑈subscript𝑐𝑓superscriptℝ𝐶1U_{c_{f}}\in\mathbb{R}^{C\times 1} is the cluster center for cluster cfsubscript𝑐𝑓c_{f} and Sfsubscript~𝑆𝑓\tilde{S}{f} and V~~fsubscript~~𝑉𝑓\tilde{V}{f} are as before.

This monochromatic approximation is illustrated in the left panel of Figure 1. Table 1 shows the number of operations required for the standard and monochromatic versions.

We exploit the redundancy within the 4-D weight tensors in the higher convolutional layers by clustering the filters, such that each cluster can be accurately approximated by a low-rank factorization. We start by clustering the rows of WC∈ℝC×(XYF)subscript𝑊𝐶superscriptℝ𝐶𝑋𝑌𝐹W_{C}\in\mathbb{R}^{C\times(XYF)}, which results in clusters C1,…,Casubscript𝐶1…subscript𝐶𝑎C_{1},\dots,C_{a}. Then we cluster the columns of WF∈ℝ(CXY)×Fsubscript𝑊𝐹superscriptℝ𝐶𝑋𝑌𝐹W_{F}\in\mathbb{R}^{(CXY)\times F}, producing clusters F1,…,Fbsubscript𝐹1…subscript𝐹𝑏F_{1},\dots,F_{b}. These two operations break the original weight tensor W𝑊W into ab𝑎𝑏ab sub-tensors {WCi,Fj}i=1,…,a,j=1,…,bsubscriptsubscript𝑊subscript𝐶𝑖subscript𝐹𝑗formulae-sequence𝑖1…𝑎𝑗1…𝑏{W_{C_{i},F_{j}}}_{i=1,\dots,a,j=1,\dots,b} as shown in Figure 1. Each sub-tensor contains similar elements, and thus is easier to fit with a low-rank approximation.

In order to exploit the parallelism inherent in CPU and GPU architectures it is useful to constrain clusters to be of equal sizes. We therefore perform the biclustering operations (or clustering for monochromatic filters in Section 3.3) using a modified version of the k𝑘k-means algorithm which balances the cluster count at each iteration. It is implemented with the Floyd algorithm, by modifying the Euclidean distance with a subspace projection distance.

After the input and output clusters have been obtained, we find a low-rank approximation of each sub-tensor using either the SVD decomposition or the outer product decomposition as described in Section 3.2.2. We concatenate the X𝑋X and Y𝑌Y spatial dimensions of the sub-tensors so that the decomposition is applied to the 3-tensor, WS∈ℝC×(XY)×Fsubscript𝑊𝑆superscriptℝ𝐶𝑋𝑌𝐹W_{S}\in\mathbb{R}^{C\times(XY)\times F}. While we could look for a separable approximation along the spatial dimensions as well, we found the resulting gain to be minimal. Using these approximations, the target output can be computed with significantly fewer operations. The number of operations required is a function the number of input clusters, G𝐺G, the output clusters H𝐻H and the rank of the sub-tensor approximations (K1,K2subscript𝐾1subscript𝐾2K_{1},K_{2} for the SVD decomposition; K𝐾K for the outer product decomposition. The number of operations required for each approximation is described in Table 1.

We implemented several CPU and GPU approximation routines in an effort to achieve empirical speedups. Both the baseline and approximation CPU code is implemented in C++ using Eigen3 library [10] compiled with Intel MKL. We also use Intel’s implementation of openmp and multithreading. The baseline gives comparable performance to highly optimized MATLAB convolution routines and all of our CPU speedup results are computed relative to this. We used Alex Krizhevsky’s CUDA convolution routines 111https://code.google.com/p/cuda-convnet/ as a baseline for GPU comparisons. The approximation versions are written in CUDA. All GPU code was run on a standard nVidia Titan card.

The first convolutional layer has 3 input channels, 96 output channels and 7x7 filters. We approximated the weights in this layer using the monochromatic approximation described in Section 3.3. The monochromatic approximation works well if the color components span a small number of one dimensional subspaces. Figure 2 illustrates the effect of the monochromatic approximation on the first layer filters.

The only parameter in the approximation is C′superscript𝐶′C^{\prime}, the number of color channels used for the intermediate representation. As expected, the network performance begins to degrade as C′superscript𝐶′C^{\prime} decreases. The number of floating point operations required to compute the output of the monochromatic convolution is reduced by a factor of 2−3×2-3\times, with the larger gain resulting for small C′superscript𝐶′C^{\prime}. Figure 3 shows the empirical speedups we achieved on CPU and GPU and the corresponding network performance for various numbers of colors used in the monochromatic approximation. Our CPU and GPU implementations achieve empirical speedups of 2−2.5×2-2.5\times relative to the baseline with less than 1% drop in classification performance.

The second convolutional layer has 96 input channels, 256 output channels and 5x5 filters. We approximated the weights using the techniques described in Section 3.4. We explored various configurations of the approximations by varying the number of input clusters G𝐺G, the number of output clusters H𝐻H and the rank of the approximation (denoted by K1subscript𝐾1K_{1} and K2subscript𝐾2K_{2} for the SVD decomposition and K𝐾K for the outer product decomposition).

Figure 4 shows our empirical speedups on CPU and GPU and the corresponding network performance for various approximation configurations. For the CPU implementation we used the biclustering with SVD approximation. For the GPU implementation we using the biclustering with outer product decomposition approximation. We achieved promising results and present speedups of 2−2.5×2-2.5\times relative to the baseline with less than a 1% drop in performance.

We applied this procedure to the first two convolutional layers. Using the monochromatic approximation with 6 colors for the first layer and the biclustering with outer product decomposition approximation for the second layer (G=48;H=2;K=8formulae-sequence𝐺48formulae-sequence𝐻2𝐾8G=48;H=2;K=8) and fine-tuning with a single pass through the training set we are able to keep accuracy within 1% of the original model. This procedure could be applied to each convolutional layer, in this sequential manner, to achieve overall speedups much greater than any individual layer can provide. A more comprehensive summary of these results can be found in the Supplementary Material.

In this paper we have presented techniques that can speed up the bottleneck convolution operations in the first layers of a CNN by a factor 2−3×2-3\times, with negligible loss of performance. We also show that our methods reduce the memory footprint of weights in the first two layers by factor of 2−3×2-3\times and the fully connected layers by a factor of 5−13×5-13\times. Since the vast majority of weights reside in the fully connected layers, compressing only these layers translate into a significant savings, which would facilitate mobile deployment of convolutional networks. These techniques are orthogonal to other approaches for efficient evaluation, such as quantization or working in the Fourier domain. Hence, they can potentially be used together to obtain further gains.

Supplement to

“Exploiting Linear Structure Within Convolutional Networks

for Efficient Evaluation” NIPS2014

Table: S3.T1: Number of operations required for various approximation methods.

Approximation technique	Number of operations
No approximation	XYCFNMΔ−2𝑋𝑌𝐶𝐹𝑁𝑀superscriptΔ2XYCFNM\Delta^{-2}
Monochromatic	C′CNM+XYFNMΔ−2superscript𝐶′𝐶𝑁𝑀𝑋𝑌𝐹𝑁𝑀superscriptΔ2C^{\prime}CNM+XYFNM\Delta^{-2}
Biclustering + outer product decomposition	GHK(NMCG+XYNMΔ−2+FHNMΔ−2)𝐺𝐻𝐾𝑁𝑀𝐶𝐺𝑋𝑌𝑁𝑀superscriptΔ2𝐹𝐻𝑁𝑀superscriptΔ2GHK(NM\frac{C}{G}+XYNM\Delta^{-2}+\frac{F}{H}NM\Delta^{-2})
Biclustering + SVD	GHNM(CGK1+K1XYK2Δ−2+K2FH)𝐺𝐻𝑁𝑀𝐶𝐺subscript𝐾1subscript𝐾1𝑋𝑌subscript𝐾2superscriptΔ2subscript𝐾2𝐹𝐻GHNM(\frac{C}{G}K_{1}+K_{1}XYK_{2}\Delta^{-2}+K_{2}\frac{F}{H})

Table: A1.T3: Evaluation time in seconds per layer on CPU (left) and GPU (right) with batch size of 128. Results are averaged over 8 runs.

Layer	Time per batch (sec)	Fraction
Conv1	2.8317±0.1030plus-or-minus2.83170.10302.8317\pm 0.1030	21.97%
MaxPool	0.1059±0.0154plus-or-minus0.10590.01540.1059\pm 0.0154	0.82%
LRNormal	0.1918±0.0162plus-or-minus0.19180.01620.1918\pm 0.0162	1.49%
Conv2	4.2626±0.0740plus-or-minus4.26260.07404.2626\pm 0.0740	33.07%
MaxPool	0.0705±0.0029plus-or-minus0.07050.00290.0705\pm 0.0029	0.55%
LRNormal	0.0772±0.0027plus-or-minus0.07720.00270.0772\pm 0.0027	0.60%
Conv3	1.8689±0.0577plus-or-minus1.86890.05771.8689\pm 0.0577	14.50%
MaxPool	0.0532±0.0018plus-or-minus0.05320.00180.0532\pm 0.0018	0.41%
Conv4	1.5261±0.0386plus-or-minus1.52610.03861.5261\pm 0.0386	11.84%
Conv5	1.4222±0.0416plus-or-minus1.42220.04161.4222\pm 0.0416	11.03%
MaxPool	0.0102±0.0006plus-or-minus0.01020.00060.0102\pm 0.0006	0.08%
FC	0.3777±0.0233plus-or-minus0.37770.02330.3777\pm 0.0233	2.93%
FC	0.0709±0.0038plus-or-minus0.07090.00380.0709\pm 0.0038	0.55%
FC	0.0168±0.0018plus-or-minus0.01680.00180.0168\pm 0.0018	0.13%
Softmax	0.0028±0.0015plus-or-minus0.00280.00150.0028\pm 0.0015	0.02%
Total	12.888512.888512.8885

Table: A2.T4: Performance when first layer weights are replaced with monochromatic approximation and the corresponding theoretical speedup. Classification error on ImageNet12 validation images tends to increase as the approximation becomes harsher (i.e. fewer colors are used). Theoretical speedups vary only slightly as the number of colors used increases since the color transformation contributes relatively little to the total number of operations.

Number of colors		Increase in test error		Theoretical speedup
	Original	‖W‖datasubscriptnorm𝑊𝑑𝑎𝑡𝑎\	W\	_{data} distance metric
4	24.1%	5.9%	1.9%	2.97×\times
6	16.1%	2.4%	0.4%	2.95×\times
8	9.9%	1.4%	0.2%	2.94×\times
12	3.5%	0.7%	0%	2.91×\times
16	1.99%	0.8%	-	2.88×\times
24	1.43%	0.4%	-	2.82×\times

Table: A3.T5: Cascading approximations.


Layer 2	Increase in error
Method	Hyperparameters
Biclustering	G=48;H=2;K=8formulae-sequence𝐺48formulae-sequence𝐻2𝐾8G=48;H=2;K=8	1%
+ outer product decomposition
Biclustering	G=48;H=2;K=6formulae-sequence𝐺48formulae-sequence𝐻2𝐾6G=48;H=2;K=6	1.5%
+ outer product decomposition
Biclustering + SVD	G=2;H=2;K1=19;K2=64formulae-sequence𝐺2formulae-sequence𝐻2formulae-sequencesubscript𝐾119subscript𝐾264G=2;H=2;K_{1}=19;K_{2}=64	1.2%
Biclustering + SVD	G=2;H=2;K1=19;K2=51formulae-sequence𝐺2formulae-sequence𝐻2formulae-sequencesubscript𝐾119subscript𝐾251G=2;H=2;K_{1}=19;K_{2}=51	1.4%

Refer to caption A visualization of monochromatic and biclustering approximation structures. (a) The monochromatic approximation, used for the first layer. Input color channels are projected onto a set of intermediate color channels. After this transformation, output features need only to look at one intermediate color channel. (b) The biclustering approximation, used for higher convolution layers. Input and output features are clustered into equal sized groups. The weight tensor corresponding to each pair of input and output clusters is then approximated. (c) The weight tensors for each input-output pair in (b) are approximated by a sum of rank 1 tensors using techniques described in 3.2.2

Refer to caption

Refer to caption Visualization of the 1st layer filters. (Left) Each component of the 96 7x7 filters is plotted in RGB space. Points are colored based on the output filter they belong to. Hence, there are 96 colors and 72superscript727^{2} points of each color. Leftmost plot shows the original filters and the right plot shows the filters after the monochromatic approximation, where each filter has been projected down to a line in colorspace. (Right) Original and approximate versions of a selection of 1st layer filters.

Refer to caption Empirical speedups on (Left) CPU and (Right) GPU for the first layer. C′superscript𝐶′C^{\prime} is the number of colors used in the approximation.

Refer to caption Empirical speedups for second convolutional layer. (Left) Speedups on CPU using biclustered (G=2𝐺2G=2 and H=2𝐻2H=2) with SVD approximation. (Right) peedups on GPU using biclustered (G=48𝐺48G=48 and h=2ℎ2h=2) with outer product decomposition approximation.

Refer to caption Theoretically achievable speedups vs. classification error for various biclustering approximations.

$$ \label{approxi} d_{n,l,s} = \nabla_{W_s} \left( U(I_n, \Theta) - \delta(i - l)\right)~,~~n\leq N,,, l \in {\beta_n},,, s\leq S~~, $$ \tag{approxi}

$$ \label{poormansmaha} | W |{\widetilde{maha}} := \sum_p \alpha_p W(p) ~,\text{ where } \alpha_p = \Big( \sum{n,l} d_{n,l,s}(p)^2 \Big)^{1/2}~ $$ \tag{poormansmaha}

$$ | W |_{data} = | \widehat{\Sigma}^{1/2} W_F |_F~, $$

$$ \label{eq:rank1} | W - \alpha \otimes \beta \otimes \gamma |_F~, $$ \tag{eq:rank1}

$$ \label{convlayereq} T(f,x,y) = \sum_{c=1}^C \sum_{x'=1}^{X} \sum_{y'=1}^{Y} I(c,x-x',y-y') W(c,x',y',f) $$ \tag{convlayereq}

$$ $W = USV^{\top}$, where $U \in \mathbb{R}^{m \times m}, S \in \mathbb{R}^{m \times k}, V \in \mathbb{R}^{k \times k}$. $$

$$ $| I \tilde{W} - I W |F \leq s{t+1} | I |_F~,$ $$

$$ $\tilde{W}_f = \tilde{U}_f \tilde{S}_f \tilde{V}_f^{\top} ~,$ $$

Second layer approximation: Theoretical speedup vs. performance loss

Figure 5: Theoretically achievable speedups vs. classification error for various biclustering approximations.

Table 5: Cascading approximations.

Approximation technique	Number of operations
No approximation Monochromatic Biclustering + outer product decomposition Biclustering + SVD	XYCFNM ∆ - 2 C ′ CNM + XYFNM ∆ - 2 GHK ( NM C G + XYNM ∆ - 2 + F H NM ∆ - 2 ) GHNM ( C G K 1 + K 1 XYK 2 ∆ - 2 + K 2 F H )

Approximation method	Number of parameters	Approximation hyperparameters	Reduction in weights	Increase in error
Standard colvolution	CXYF
Conv layer 1: Monochromatic	CC ′ + XYF	C ′ = 6	3 ×	0.43%
Conv layer 2: Biclustering + outer product decomposition	GHK ( C G + XY + F H )	G = 48 ; H = 2 ; K = 6	5.3 ×	0.68%
Conv layer 2: Biclustering + SVD	GH ( C G K 1 + K 1 XYK 2 + K 2 F H	G = 2; H = 2 ; K 1 = 19 ; K 2 = 24	3 . 9 ×	0.9%
Standard FC	NM
FC layer 1: Matrix SVD	NK + KM	K = 250 K = 950	13 . 4 × 3 . 5 ×	0.8394% 0.09%
FC layer 2: Matrix SVD	NK + KM	K = 350 K = 650	5 . 8 × 3 . 14 ×	0.19% 0.06%
FC layer 3: Matrix SVD	NK + KM	K = 250 K = 850	8 . 1 × 2 . 4 ×	0.67% 0.02%

Layer	Time per batch (sec)	Fraction	Layer	Time per batch (sec)	Fraction
Conv1	2 . 8317 ± 0 . 1030	21.97%	Conv1	0 . 0604 ± 0 . 0112	5.14%
MaxPool	0 . 1059 ± 0 . 0154	0.82%	MaxPool	0 . 0072 ± 0 . 0040	0.61%
LRNormal	0 . 1918 ± 0 . 0162	1.49%	LRNormal	0 . 0041 ± 0 . 0043	0.35%
Conv2	4 . 2626 ± 0 . 0740	33.07%	Conv2	0 . 4663 ± 0 . 0072	39.68%
MaxPool	0 . 0705 ± 0 . 0029	0.55%	MaxPool	0 . 0032 ± 0 . 0000	0.27%
LRNormal	0 . 0772 ± 0 . 0027	0.60%	LRNormal	0 . 0015 ± 0 . 0003	0.13%
Conv3	1 . 8689 ± 0 . 0577	14.50%	Conv3	0 . 2219 ± 0 . 0014	18.88%
MaxPool	0 . 0532 ± 0 . 0018	0.41%	MaxPool	0 . 0016 ± 0 . 0000	0.14%
Conv4	1 . 5261 ± 0 . 0386	11.84%	Conv4	0 . 1991 ± 0 . 0001	16.94%
Conv5	1 . 4222 ± 0 . 0416	11.03%	Conv5	0 . 1958 ± 0 . 0002	16.66%
MaxPool	0 . 0102 ± 0 . 0006	0.08%	MaxPool	0 . 0005 ± 0 . 0001	0.04%
FC	0 . 3777 ± 0 . 0233	2.93%	FC	0 . 0077 ± 0 . 0013	0.66%
FC	0 . 0709 ± 0 . 0038	0.55%	FC	0 . 0017 ± 0 . 0001	0.14%
FC	0 . 0168 ± 0 . 0018	0.13%	FC	0 . 0007 ± 0 . 0002	0.06%
Softmax	0 . 0028 ± 0 . 0015	0.02%	Softmax	0 . 0038 ± 0 . 0098	0.32%
Total	12 . 8885		Total	1.1752

Number of colors		Increase in test error	Increase in test error	Theoretical speedup
	Original	‖ W ‖ data distance metric	Fine-tuned
4	24.1%	5.9%	1.9%	2.97 ×
6	16.1%	2.4%	0.4%	2.95 ×
8	9.9%	1.4%	0.2%	2.94 ×
12	3.5%	0.7%	0%	2.91 ×
16	1.99%	0.8%	-	2.88 ×
24	1.43%	0.4%	-	2.82 ×

Layer 2	Layer 2	Increase in error
Method	Hyperparameters
Biclustering + outer product decomposition	G = 48; H = 2; K = 8	1%
Biclustering + outer product decomposition	G = 48; H = 2; K = 6	1.5%
Biclustering + SVD	G = 2; H = 2; K 1 = 19; K 2 = 64	1.2%
Biclustering + SVD	G = 2; H = 2; K 1 = 19; K 2 = 51	1.4%

Approximation technique	Number of operations
No approximation Monochromatic Biclustering + outer product decomposition Biclustering + SVD	XYCFNM ∆ - 2 C ′ CNM + XYFNM ∆ - 2 GHK ( NM C G + XYNM ∆ - 2 + F H NM ∆ - 2 ) GHNM ( C G K 1 + K 1 XYK 2 ∆ - 2 + K 2 F H )

Approximation method	Number of parameters	Approximation hyperparameters	Reduction in weights	Increase in error
Standard colvolution	CXYF
Conv layer 1: Monochromatic	CC ′ + XYF	C ′ = 6	3 ×	0.43%
Conv layer 2: Biclustering + outer product decomposition	GHK ( C G + XY + F H )	G = 48 ; H = 2 ; K = 6	5.3 ×	0.68%
Conv layer 2: Biclustering + SVD	GH ( C G K 1 + K 1 XYK 2 + K 2 F H	G = 2; H = 2 ; K 1 = 19 ; K 2 = 24	3 . 9 ×	0.9%
Standard FC	NM
FC layer 1: Matrix SVD	NK + KM	K = 250 K = 950	13 . 4 × 3 . 5 ×	0.8394% 0.09%
FC layer 2: Matrix SVD	NK + KM	K = 350 K = 650	5 . 8 × 3 . 14 ×	0.19% 0.06%
FC layer 3: Matrix SVD	NK + KM	K = 250 K = 850	8 . 1 × 2 . 4 ×	0.67% 0.02%

Layer	Time per batch (sec)	Fraction	Layer	Time per batch (sec)	Fraction
Conv1	2 . 8317 ± 0 . 1030	21.97%	Conv1	0 . 0604 ± 0 . 0112	5.14%
MaxPool	0 . 1059 ± 0 . 0154	0.82%	MaxPool	0 . 0072 ± 0 . 0040	0.61%
LRNormal	0 . 1918 ± 0 . 0162	1.49%	LRNormal	0 . 0041 ± 0 . 0043	0.35%
Conv2	4 . 2626 ± 0 . 0740	33.07%	Conv2	0 . 4663 ± 0 . 0072	39.68%
MaxPool	0 . 0705 ± 0 . 0029	0.55%	MaxPool	0 . 0032 ± 0 . 0000	0.27%
LRNormal	0 . 0772 ± 0 . 0027	0.60%	LRNormal	0 . 0015 ± 0 . 0003	0.13%
Conv3	1 . 8689 ± 0 . 0577	14.50%	Conv3	0 . 2219 ± 0 . 0014	18.88%
MaxPool	0 . 0532 ± 0 . 0018	0.41%	MaxPool	0 . 0016 ± 0 . 0000	0.14%
Conv4	1 . 5261 ± 0 . 0386	11.84%	Conv4	0 . 1991 ± 0 . 0001	16.94%
Conv5	1 . 4222 ± 0 . 0416	11.03%	Conv5	0 . 1958 ± 0 . 0002	16.66%
MaxPool	0 . 0102 ± 0 . 0006	0.08%	MaxPool	0 . 0005 ± 0 . 0001	0.04%
FC	0 . 3777 ± 0 . 0233	2.93%	FC	0 . 0077 ± 0 . 0013	0.66%
FC	0 . 0709 ± 0 . 0038	0.55%	FC	0 . 0017 ± 0 . 0001	0.14%
FC	0 . 0168 ± 0 . 0018	0.13%	FC	0 . 0007 ± 0 . 0002	0.06%
Softmax	0 . 0028 ± 0 . 0015	0.02%	Softmax	0 . 0038 ± 0 . 0098	0.32%
Total	12 . 8885		Total	1.1752

Number of colors		Increase in test error	Increase in test error	Theoretical speedup
	Original	‖ W ‖ data distance metric	Fine-tuned
4	24.1%	5.9%	1.9%	2.97 ×
6	16.1%	2.4%	0.4%	2.95 ×
8	9.9%	1.4%	0.2%	2.94 ×
12	3.5%	0.7%	0%	2.91 ×
16	1.99%	0.8%	-	2.88 ×
24	1.43%	0.4%	-	2.82 ×

Layer 2	Layer 2	Increase in error
Method	Hyperparameters
Biclustering + outer product decomposition	G = 48; H = 2; K = 8	1%
Biclustering + outer product decomposition	G = 48; H = 2; K = 6	1.5%
Biclustering + SVD	G = 2; H = 2; K 1 = 19; K 2 = 64	1.2%
Biclustering + SVD	G = 2; H = 2; K 1 = 19; K 2 = 51	1.4%

Refer to caption

References

[sermanet2013overfeat] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: \newblock Overfeat: Integrated recognition, localization and detection using convolutional networks. \newblock arXiv preprint arXiv:1312.6229 (2013)

[denil2013predicting] Denil, M., Shakibi, B., Dinh, L., Ranzato, M., de~Freitas, N.: \newblock Predicting parameters in deep learning. \newblock arXiv preprint arXiv:1306.0543 (2013)

[hinton2012improving] Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: \newblock Improving neural networks by preventing co-adaptation of feature detectors. \newblock arXiv preprint arXiv:1207.0580 (2012)

[vanhoucke2011improving] Vanhoucke, V., Senior, A., Mao, M.Z.: \newblock Improving the speed of neural networks on cpus. \newblock In: Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop. (2011)

[mathieu2013fast] Mathieu, M., Henaff, M., LeCun, Y.: \newblock Fast training of convolutional networks through ffts. \newblock arXiv preprint arXiv:1312.5851 (2013)

[zisserman14] Jaderberg, M., Vedaldi, Andrea, Zisserman, A.: \newblock Speeding up convolutional neural networks with low rank expansions. \newblock arXiv preprint arXiv:1405.3866 (2014)

[rankonetensors] Zhang, T., Golub, G.H.: \newblock Rank-one approximation to high order tensors. \newblock SIAM J. Matrix Anal. Appl. 23(2) (February 2001) 534--550

[zeiler2013visualizing] Zeiler, M.D., Fergus, R.: \newblock Visualizing and understanding convolutional neural networks. \newblock arXiv preprint arXiv:1311.2901 (2013)

[imagenet] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: \newblock {ImageNet: A Large-Scale Hierarchical Image Database}. \newblock In: CVPR09. (2009)

[eigenweb] Guennebaud, G., Jacob, B., et~al.: \newblock Eigen v3. \newblock http://eigen.tuxfamily.org (2010)

[zeiler2011adaptive] Zeiler, M.D., Taylor, G.W., Fergus, R.: \newblock Adaptive deconvolutional networks for mid and high level feature learning. \newblock In: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE (2011) 2018--2025

[LeNgiChenChiaKohNg10] Le, Q.V., Ngiam, J., Chen, Z., Chia, D., Koh, P.W., Ng, A.Y.: \newblock Tiled convolutional neural networks. \newblock In: Advances in Neural Information Processing Systems. (2010)

[le2011building] Le, Q.V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y.: \newblock Building high-level features using large scale unsupervised learning. \newblock arXiv preprint arXiv:1112.6209 (2011)

[lowe1999object] Lowe, D.G.: \newblock Object recognition from local scale-invariant features. \newblock In: Computer vision, 1999. The proceedings of the seventh IEEE international conference on. Volume~2., Ieee (1999) 1150--1157

[krizhevsky2012imagenet] Krizhevsky, A., Sutskever, I., Hinton, G.: \newblock Imagenet classification with deep convolutional neural networks. \newblock In: Advances in Neural Information Processing Systems 25. (2012) 1106--1114

[bib1] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)

[bib2] Denil, M., Shakibi, B., Dinh, L., Ranzato, M., de Freitas, N.: Predicting parameters in deep learning. arXiv preprint arXiv:1306.0543 (2013)

[bib3] Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)

[bib4] Vanhoucke, V., Senior, A., Mao, M.Z.: Improving the speed of neural networks on cpus. In: Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop. (2011)

[bib5] Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851 (2013)

[bib6] Jaderberg, M., Vedaldi, Andrea, Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 (2014)

[bib7] Zhang, T., Golub, G.H.: Rank-one approximation to high order tensors. SIAM J. Matrix Anal. Appl. 23(2) (February 2001) 534–550

[bib8] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901 (2013)

[bib9] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09. (2009)

[bib10] Guennebaud, G., Jacob, B., et al.: Eigen v3. http://eigen.tuxfamily.org (2010)

[bib11] Zeiler, M.D., Taylor, G.W., Fergus, R.: Adaptive deconvolutional networks for mid and high level feature learning. In: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE (2011) 2018–2025

[bib12] Le, Q.V., Ngiam, J., Chen, Z., Chia, D., Koh, P.W., Ng, A.Y.: Tiled convolutional neural networks. In: Advances in Neural Information Processing Systems. (2010)

[bib13] Le, Q.V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y.: Building high-level features using large scale unsupervised learning. arXiv preprint arXiv:1112.6209 (2011)

[bib14] Lowe, D.G.: Object recognition from local scale-invariant features. In: Computer vision, 1999. The proceedings of the seventh IEEE international conference on. Volume 2., Ieee (1999) 1150–1157

[bib15] Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25. (2012) 1106–1114

Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation

Introduction​

Related Work​

Convolutional Tensor Compression​

Approximation Metric​

Low-rank Tensor Approximations​

Matrix Decomposition​

Higher Order Tensor Approximations​

Monochromatic Convolution Approximation​

Biclustering Approximations​

Fine-tuning​

Experiments​

Speedup​

First Layer​

Second Layer​

Combining approximations​

Reduction in memory overhead​

Discussion​

Experiments​

Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation

Forward propagation time breakdown​

Theoretical speedups​

Combined results​

Second layer approximation: Theoretical speedup vs. performance loss

References​

Introduction

Related Work

Convolutional Tensor Compression

Approximation Metric

Low-rank Tensor Approximations

Matrix Decomposition

Higher Order Tensor Approximations

Monochromatic Convolution Approximation

Biclustering Approximations

Fine-tuning

Experiments

Speedup

First Layer

Second Layer

Combining approximations

Reduction in memory overhead

Discussion

Experiments

Forward propagation time breakdown

Theoretical speedups

Combined results

References