Pedestrian Detection with Unsupervised Multi-Stage Feature Learning

Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, Yann LeCun, Courant Institute of Mathematical Sciences, New York University

Abstract

Pedestrian detection is a problem of considerable practical interest. Adding to the list of successful applications of deep learning methods to vision, we report state-of-the-art and competitive results on all major pedestrian datasets with a convolutional network model. The model uses a few new twists, such as multi-stage features, connections that skip layers to integrate global shape information with local distinctive motif information, and an unsupervised method based on convolutional sparse coding to pre-train the filters at each stage. %-1.0cm

Pedestrian Detection with Unsupervised Multi-Stage Feature Learning

Pierre Sermanet Koray Kavukcuoglu Soumith Chintala Yann LeCun

Courant Institute of Mathematical Sciences, New York University sermanet,koray,soumith,yann@cs.nyu.edu

Bootstrapping

Pedestrian detection is a problem of considerable practical interest. Adding to the list of successful applications of deep learning methods to vision, we report state-of-theart and competitive results on all major pedestrian datasets with a convolutional network model. The model uses a few new twists, such as multi-stage features, connections that skip layers to integrate global shape information with local distinctive motif information, and an unsupervised method based on convolutional sparse coding to pre-train the filters at each stage.

Introduction

Pedestrian detection is a key problem for surveillance, automotive safety and robotics applications. The wide variety of appearances of pedestrians due to body pose, occlusions, clothing, lighting and backgrounds makes this task challenging.

All existing state-of-the-art methods use a combination of hand-crafted features such as Integral Channel Features [9], HoG [5] and their variations [13, 33] and combinations [38], followed by a trainable classifier such as SVM [13, 28], boosted classifiers [9] or random forests [7]. While low-level features can be designed by hand with good success, mid-level features that combine low-level features are difficult to engineer without the help of some sort of learning procedure. Multi-stage recognizers that learn hierarchies of features tuned to the task at hand can be trained end-to-end with little prior knowledge. Convolutional Networks (ConvNets) [23] are examples of such hierarchical systems with end-to-end feature learning that are trained in a supervised fashion. Recent works have demonstrated the usefulness of unsupervised pre-training for end-to-end training of deep multi-stage architectures using a variety of techniques such as stacked restricted Boltzmann machines [16], stacked auto-encoders [4] and stacked sparse auto-encoders [32], and using new types of non-linear transforms at each layer [17, 20].

Figure 1: 128 9 × 9 filters trained on grayscale INRIA images using Algorithm 1 . It can be seen that in addition to edge detectors at multiple orientations, our systems also learns more complicated features such as corner and junction detectors.

Supervised ConvNets have been used by a number of authors for such applications as face, hand detection [37, 29, 15, 31, 14, 36]. More recently, a large ConvNet by [21] achieved a breakthrough on a 1000-class ImageNet detection task. The main contribution of this paper is to show that the ConvNet model, with a few important twists, consistently yields state of the art and competitive results on all major pedestrian detection benchmarks. The system uses unsupervised convolutional sparse auto-encoders to pre-train features at all levels from the relatively small INRIA dataset [5], and end-to-end supervised training to train the classifier and fine-tune the features in an integrated fashion. Additionally, multi-stage features with layer-skipping connections enable output stages to combine global shape detectors with local motif detectors.

Processing speed in pedestrian detection has recently seen great progress, enabling real-time operation without sacrificing quality. [3] manage to entirely avoid image rescaling for detection while observing quality improvements. While processing speed is not the focus of this paper, features and classifier approximations introduced by [8] and [3] may be applicable to deep learning models for faster detection, in addition to GPU optimizations.

Learning Feature Hierarchies

Much of the work on pedestrian detection have focused on designing representative and powerful features [5, 9, 8, 38]. In this work, we show that generic feature learning algorithms can produce successful feature extractors that can achieve state-of-the-art results.

Supervised learning of end-to-end systems on images have been shown to work well when there is abundant labeled samples [23], including for detection tasks [37, 29, 15, 31, 14, 36]. However, for many input domains, it is hard to find adequate number of labeled data. In this case, one can resort to designing useful features by using domain knowledge, or an alternative way is to use unsupervised learning algorithms. Recently unsupervised learning algorithms have been demonstrated to produce good features for generic object recognition problems [24, 25, 18, 20].

In [16], it was shown that unsupervised learning can be used to train deep hierarchical models and the final representation achieved is actually useful for a variety of different tasks [32, 24, 4]. In this work, we also follow a similar approach and train a generic unsupervised model at each layer using the output representation from the layer before. This process is then followed by supervised updates to the whole hierarchical system using label information.

Figure 2: A subset of 7 × 7 second layer filters trained on grayscale INRIA images using Algorithm 2. Each row in the figure shows filters that connect to a common output feature map. It can be seen that they extract features at similar locations and shapes, e.g. the bottom row tends to aggregate horizontal features towards the bottom of the filters.

Hierarchical Model

A hierarchical feature extraction system consists of multiple levels of feature extractors that perform the same filtering and non-linear transformation functions in successive layers. Using a particular generic parametrized function one can then map the inputs into gradually more higher level (or abstract) representations [23, 16, 4, 32, 24]. In this work we use sparse convolutional feature hierarchies as proposed in [20]. Each layer of the unsupervised model contains a convolutional sparse coding algorithm and a predictor function that can be used for fast inference. After the last layer a classifier is used to map the feature representations into class labels. Both the sparse coding dictionary and the predictor function do not contain any hard-coded parameter and are trained from the input data.

The training procedure for this model is similar to [16]. Each layer is trained in an unsupervised manner using the representation from previous layer (or the input image for the initial layer) separately. After the whole multi-stage system is trained in a layer-wise fashion, the complete architecture followed by a classifier is fine-tuned using labeled data.

Unsupervised Learning

Recently sparse coding has seen much interest in many fields due to its ability to extract useful feature representations from data, The general formulation of sparse coding is a linear reconstruction model using an overcomplete dictionary D ∈ R m × n where m>n and a regularization penalty on the mixing coefficients z ∈ R n .

The aim is to minimize equation 1 with respect to z to obtain the optimal sparse representation z ∗ that correspond to input x ∈ R m . The exact form of s ( z ) depends on the particular sparse coding algorithm that is used, here, we use the ‖ . ‖ 1 norm penalty, which is the sum of the absolute values of all elements of z . It is immediately clear that the solution of this system requires an optimization process. Many efficient algorithms for solving the above convex system has been proposed in recent years [1, 6, 2, 26]. However, our aim is to also learn generic feature extractors. For that reason we minimize equation 1 wrt D too.

This resulting equation is non-convex in D and z at the same time, however keeping one fixed, the problem is still convex wrt to the other variable. All sparse modeling algorithms that adopt the dictionary matrix D exploit this property and perform a coordinate descent like minimization process where each variable is updated in succession. Following [30] many authors have used sparse dictionary learning to represent images [27, 1, 19]. However, most of the sparse coding models use small image patches as input x to learn the dictionary D and then apply the resulting model to every overlapping patch location on the full image. This approach assumes that the sparse representation for two neighboring patches with a single pixel shift is completely independent, thus produces very redundant representations. [20, 39] have introduced convolutional sparse modeling formulations for feature learning and object recognition and we use the Convolutional Predictive Sparse Decomposition (CPSD) model proposed in [20] since it is the only convolutional sparse coding model providing a fast predictor function that is suitable for building multi-stage feature representations. The particular predictor function we use is similar to a single layer ConvNet of the following form:

where ⊗ operator represents convolution operator that applies on a single input and single filter. In this formulation x is a p × p grayscale input image, k ∈ R n × m × m is a set of 2D filters where each filter is k j ∈ R m × m , g ∈ R n and b ∈ R n are vectors with n elements, the predictor output ˜ z ∈ R n × p -m +1 × p -m +1 is a set of feature maps where each of ˜ z j is of size p -m +1 × p -m +1 . Considering this general predictor function, the final form of the convolutional unsupervised energy for grayscale inputs is as follows:

where D is a dictionary of filters the same size as k and β is a hyper-parameter. The unsupervised learning procedure is a two step coordinate descent process. At each iteration, (1) Inference: The parameters W = {D , g, k, b } are kept fixed and equation 6 is minimized to obtain the optimal sparse representation z ∗ , (2) Update: Keeping z ∗ fixed, the parameters W updated using a stochastic gradient step: W ← W -η ∂ E CPSD ∂W where η is the learning rate parameter. The inference procedure requires us to carry out the sparse coding problem solution. For this step we use the FISTA method proposed in [2]. This method is an extension of the original iterative shrinkage and thresholding algorithm [6] using an improved step size calculation with a momentum-like term. We apply the FISTA algorithm in the image domain adopting the convolutional formulation.

For color images or other multi-modal feature representations, the input x is a set of feature maps indexed by i and the representation z is a set of feature maps indexed by j for each input map i . We define a map of connections P from input x to features z . A j th output feature map is connected to a set P j of input feature maps. Thus, the predictor function in Algorithm 1 is defined as:

and the reconstruction is computed using the inverse map ¯ P :

For a fully connected layer, all the input features are connected to all the output features, however it is also common to use sparse connection maps to reduce the number of parameters. The online training algorithm for unsupervised training of a single layer is:

Supervised Training

function Unsup ( x, D , P, { λ, β } , { g, k, b } , η )

Initialize:

and

Experiments

Perform inference , minimize equation 9 wrt z using FISTA [2]

Do a stochastic update on D and W p . D ← D η ∂ E ConvSC ∂ D and W p ← W p -η ∂ E Pred ∂W p until convergence

Return:

, g, k, b

}

end function

Non-Linear Transformations

Once the unsupervised learning for a single stage is completed, the next stage is trained on the feature representation from the previous one. In order to obtain the feature representation for the next stage, we use the predictor function f ( x ) followed by non-linear transformations and pooling. Following the multi-stage framework used in [20], we apply absolute value rectification, local contrast normalization and average down-sampling operations.

Absolute Value Rectification is applied component-wise to the whole feature output from f ( x ) in order to avoid cancellation problems in contrast normalization and pooling steps. Local Contrast Normalization is a non-linear process that enhances the most active feature and suppresses the other ones. The exact form of the operation is as follows:

where i is the feature map index and w is a 9 × 9 Gaussian weighting function with normalized weights so that ∑ ipq w pq = 1 . For each sample, the constant c is set to mean ( σ ) in the experiments.

Average Down-Sampling operation is performed using a fixed size boxcar kernel with a certain step size. The size of the kernel and the stride are given for each experiment in the following sections.

Once a single layer of the network is trained, the features for training a successive layer is extracted using the predictor function followed by non-linear transformations. Detailed procedure of training an N layer hierarchical model is explained in Algorithm 2.

∅

randomly.

Supervised Training

function HierarUnsup ( x, n i , m i , P i , { λ i , β i } , { w i , s i } , i = { 1 ..N } , η )

i Set: i = 1 , X 1 = x , lcn ( x ) using equations 10-11, ds ( X,w,s ) as the down-sampling operator using boxcar kernel of size w × w and stride of size s in both directions. repeat Set: D i , k i ∈ R n i × m i × m i , g i , b i ∈ R n i . {D i , k i , g i , k i , b i } = Unsup ( X i , D i , P i , { λ i , β i } , { g i , k i , b i } , η i ) ˜ z = f ( X i ; g i , k i , b i ) using equation 8. ˜ z = | ˜ z | ˜ z = lcn (˜ z ) X i +1 = ds (˜ z, w i , s i ) i = i +1 until i = N end function

Figure 3: A multi-scale convolutional network. The top row of maps constitute a regular ConvNet [17]. The bottom row in which the 1st stage output is branched, subsampled again and merged into the classifier input provides a multistage component to the classifier stage. The multi-stage features coming out of the 2nd stage extracts a global structure as well as local details.

invertible operations are used in between layers. However, since we use absolute value rectification and local contrast normalization operations mapping the second layer features onto input space is not possible. In Figure 2 we show a subset of 1664 second layer features in the parameter space.

Supervised Training

After the unsupervised learning of the hierarchical feature extraction system is completed using Algorithm 2, we append a classifier function, usually in the form of a linear logistic regression, and perform stochastic online training using labeled data.

Multi-Stage Features

ConvNets are usually organized in a strictly feedforward manner where one layer only takes the output of the previous layer as input. Features extracted this way tend to be high level features after a few stages of convolutions and subsampling. By branching lower levels' outputs into the top classifier (Fig. 3), one produces features that extract both global shapes and structures and local details, such as a global silhouette and face components in the case of human detection. Contrary to [12], the output of the first stage is branched after the non-linear transformations and pooling/subsampling operations rather than before.

We also use color information on the training data. For this purpose we convert all images into YUV image space and subsample the UV features by 3 since color information is in much lower resolution. Then at the first stage, we keep feature extraction systems for Y and UV channels separate. On the Y channel, we use 32 7 × 7 features followed by absolute value rectification, contrast normalization and 3 × 3 subsampling. On the subsampled UV channels, we extract 6 5 × 5 features followed by absolute value rectification and contrast normalization, skipping the usual subsampling step since it was performed beforehand. These features are then concatanated to produce 38 feature maps that are input to the first layer. The second layer feature extraction takes 38 feature maps and produces 68 output features using 2040 9 × 9 features. A randomly selected 20% of the connections in mapping from input features to output features is removed to limit the computational requirements and break the symmetry [23]. The output of the second layer features are then transformed using absolute value rectification and contrast normalization followed by 2 × 2 subsampling. This results in 17824 dimensional feature vector for each sample which is then fed into a linear classifier.

In Table 1, we show that multi-stage features improve accuracy for different tasks, with different magnitudes. Greatest improvements are obtained for pedestrian detection and traffic-sign classification while only minimal gains are obtained for house numbers classification, a less complex task.

Bootstrapping

Bootstrapping is typically used in detection settings by multiple phases of extracting the most offending negative answers and adding these samples to the existing dataset while training. For this purpose, we extract 3000 negative samples per bootstrapping pass and limit the number of most offending answers to 5 for each image. We perform 3 bootstrapping passes in addition to the original training phase (i.e. 4 training passes in total).

Non-Maximum Suppression

Table 1: Error rates improvements of multi-stage features over single-stage features for different types of objects detection and classification. Improvements are significant for multi-scale and textured objects such as traffic signs and pedestrians but minimal for house numbers.

modification is to avoid false positives that are due to pedestrian body parts. However, a drawback to this approach is that it always disregards one of the overlapping pedestrians from detection. Instead of changing the criteria, we actively modify our training set before each bootstrapping phase. We include body part images that cause false positive detection into our bootstrapping image set. Our model can then learn to suppress such responses within a positive window and still detect pedestrians within bigger windows more reliably.

Experiments

We evaluate our system on 5 standard pedestrian detection datasets. However, like most other systems, we only train on the INRIA dataset. We also demonstrate improvements brought by unsupervised training and multistage features. In the following we name our model ConvNet with variants of unsupervised (Convnet-U) and fullysupervised training (Convnet-F) and multi-stage features (Convnet-U-MS and ConvNet-F-MS).

Data Preparation

The ConvNet is trained on the INRIA pedestrian dataset [5]. Pedestrians are extracted into windows of 126 pixels in height and 78 pixels in width. The context ratio is 1.4, i.e. pedestrians are 90 pixels high and the remaining 36 pixels correspond to the background. Each pedestrian image is mirrored along the horizontal axis to expand the dataset. Similarly, we add 5 variations of each original sample using 5 random deformations such as translations and scale. Translations range from -2 to 2 pixels and scale ratios from 0.95 to 1.05. These deformations enforce invariance to small deformations in the input. The range of each deformation determines the trade-off between recognition and localization accuracy during detection. An equal amount of background samples are extracted at random from the negative images and taking approximately 10% of the extracted samples for validation yields a validation set with 2000 samples and training set with 21845 samples. Note that the unsupervised training phase is performed on this initial data before the bootstrapping phase.

Evaluation Protocol

During testing and bootstrapping phases using the INRIA dataset, the images are both up-sampled and sub- sampled. The up-sampling ratio is 1.3 while the subsampling ratio is limited by 0.75 times the network's minimuminput ( 126 × 78 ). We use a scale stride of 1.10 between each scale, while other methods typically use either 1.05 or 1.20 [11]. A higher scale stride is desirable as it implies less computations.

For evaluation we use the bounding boxes files published on the Caltech Pedestrian website 1 and the evaluation software provided by Piotr Dollar (version 3.0.1). In an effort to provide a more accurate evaluation, we improved on both the evaluation formula and the INRIA annotations as follows. The evaluation software was slightly modified to compute the continuous area under curve (AUC) in the entire [0, 1] range rather than from 9 discrete points only (0.01, 0.0178, 0.0316, 0.0562, 0.1, 0.1778, 0.3162, 0.5623 and 1.0 in version 3.0.1). Instead, we compute the entire area under the curve by summing the areas under the piecewise linear interpolation of the curve, between each pair of points. In addition, we also report a 'fixed' version of the annotations for INRIA dataset, which has missing positive labels. The added labels are only used to avoid counting false errors and wrongly penalizing algorithms. The modified code and extra INRIA labels are available at 2 . Table 2 reports results for both original and fixed INRIA datasets. Notice that the continuous AUC and fixed INRIA annotations both yield a reordering of the results (see supplementary material for further evidence that the impact of these modifications is significant enough to be used). To avoid ambiguity, all results with the original discrete AUC are reported in the supplementary paper.

To ensure a fair comparison, we separated systems trained on INRIA (the majority) from systems trained on TUD-MotionPairs and the only system trained on Caltech in table 2. For clarity, only systems trained on INRIA were represented in Figure 5, however all results for all systems are still reported in table 2.

Results

1 http://www.vision.caltech.edu/Image Datasets/CaltechPedestrians

Figure 4: DET curves on the fixed-INRIA dataset for large pedestrians measure report false positives per image (FPPI) against miss rate. Algorithms are sorted from top to bottom using the proposed continuous area under curve measure between 0 and 1 FPPI. On the right, only the ConvNet variants are displayed to highlight the individual contributions of unsupervised learning (ConvNet-U) and multi-stage features learning (ConvNet-F-MS) and their combination (ConvNet-U-MS) compared to the fully-supervised system without multi-stage features (ConvNet-F).

vidual contributions of unsupervised learning (ConvNetU) and multi-stage features learning (ConvNet-F-MS) and their combination (ConvNet-U-MS) compared to the fullysupervised system without multi-stage features (ConvNetF). With 17.1% error rate, unsupervised learning exhibits the most improvements compared to the baseline ConvNetF (23.39%). Multi-stage features without unsupervised learning reach 17.29% error while their combination yields the competitive error rate of 10.55%.

Extensive results comparison of all major pedestrian datasets and published systems is provided in Table 2. Multiple types of measures proposed by [10] are reported. For clarity, we also plot in Figure 5 two of these measures, 'reasonable' and 'large', for INRIA-trained systems. The 'large' plot shows that the ConvNet results in state-of-theart performance with some margin on the ETH, Caltech and TudBrussels datasets and is closely behind LatSvm-V2 and VeryFast for INRIA and Daimler datasets. In the 'reasonable' plot, the ConvNet yields competitive results for INRIA, Daimler and ETH datasets but performs poorly on the Caltech dataset. We suspect the ConvNet with multi-stage features trained at high-resolution is more sensitive to resolution loss than other methods. In future work, a ConvNet trained at multiple resolution will likely learn to use appropriate cues for each resolution regime.

Discussion

We have introduced a new feature learning model with an application to pedestrian detection. Contrary to popular models where the low-level features are hand-designed, our model learns all the features at all levels in a hierarchy. We used the method of [20] as a baseline, and extended it by combining high and low resolution features in the model, and by learning features on the color channels of the input. Using the INRIA dataset, we have shown that these improvements provide clear performance benefits. The resulting model provides state of the art or competitive results on most measures of all publicly available datasets. Small-scale pedestrian measures can be improved in future work by training multiple scale models relying less on highresolution details. While computational speed was not the focus and hence was not reported here, our model was successfully used with near real-time speed in a haptic belt system [22] using parallel hardware. In future work, models designed for speed combined to highly optimized parallel computing on graphics cards is expected to yield competitive computational performance.

Experiments

Figure 5: Reasonable and Large measures for all INRIA-trained systems on all major datasets, using the proposed continuous AUC percentage. The AUC is computed from DET curves (smaller AUC means more accuracy and less false positives). For a clearer overall performance, each ConvNet point is connected by dotted lines. While only the 'reasonable' and 'large' measures are plotted here, all measures are reported in table 2. The ConvNet system yields state-of-the-art or competitive results on most datasets and measures, except for the low resolutions measures on the Caltech dataset because of higher reliance on high-resolution cues than other methods.

Table 2: Performance of all systems on all datasets using the proposed continuous AUC percentage over the range [0,1] from DET curves. The top performing results (among INRIA-trained models) are highlighted in bold for each row. DET curves plot false positives per image (FPPI) against miss rate. Hence a smaller AUC% means a more accurate system with lower amount of false positives. The ConvNet model (ConvNet-U-MS here) holds several state-of-the-art or competitive scores. We report the multiple measures introduced by [10] for all major pedestrian datasets. For readibility, not all measures are reported nor are models not trained on INRIA. All results however are reported in the supplementary paper.

in Neural Information Processing Systems 19 , pages 153-160. MIT Press, 2007. 1, 2

Evidence for using the proposed continuous Area Under Curve measure

Figure 6: Differences between continuous (proposed) and discrete Area Under Curve (AUC) on the INRIA medium scale experiment. In 6(a), we compute the continuous AUC as opposed to a discrete AUC 6(b) based on a few points in the standard benchmarking software. 6(b) clearly shows the shortcomings of the discrete AUC which wrongly attributes a 0% AUC to ConvNet-F instead of 10.36%. Additionally, several models are re-ranked when using the continuous AUC.

Figure 7: Another example of the effects of using the continuous AUC on the ETH large scale experiment. Here several models get re-ranked, including Convnet-U-MS, VeryFast, LatSvm-V2, MultiFtr+Motion, FPDW, MultiFtr+CSS and Shapelet.

Evidence for using the proposed fixed INRIA dataset

Figure 8: Effects of fixing INRIA dataset and AUC computation on results. In 8(d), we report the original results obtained with the INRIA dataset and the Area Under Curve (AUC) as computed by the benchmarking software available at http://www.vision.caltech.edu/Image Datasets/CaltechPedestrians. In 8(c), we use the fixed INRIA dataset instead of the original and observe the re-ranking of several algorithms (ChnFtrs, MLS and ConvNet-F-MS advance by one rank). In 8(b), we use the continuous AUC computation instead of the discrete one and observe the following re-ranking: MultiFtr+CSS, FeatSynth, Convnet-F advance by one rank while LatSvm-V1 by two. In 8(a), we use both the fixed INRIA and the continuous AUC and observe the following re-ranking as opposed to the unmodified 8(d) ranks: ConvNet-U-MS, FPDW, MLS, FeathSynth, Convnet-F, and Shapelet advance by one while LatSvm-V1 by two. Note: the fixed INRIA dataset and the modified benchmark software are available at http://cs.nyu.edu/ ∼ sermanet/data.html.

All results with the continuous AUC measure

Table 3: Results for all experiments on all datasets using the proposed continuous AUC. The top performing results (INRIA-trained only) are highlighted in bold for each row. The continuous AUC percentage is taken over the range [0,1] from DET curves. DET curves plot false positives per image (FPPI) against miss rate. Hence a smaller AUC% means a more accurate system with greater reduction of false positives.

All results with the original discrete AUC measure

Table 4: Results for all experiments on all datasets using the discrete AUC. This table is identical to table 3 except it is using the discrete AUC instead of the proposed continuous AUC.

Task	Single-Stage features	Multi-Stage features	Improvement%
Pedestrians detection (INRIA) (Fig. 8)	23.39%	17.29%	26.1%
Traffic Signs classification (GTSRB) [35]	1.80%	0.83%	54%
House Numbers classification (SVHN) [34]	5.54%	5.36%	3.2%

Trained on	INRIA
ConvNet	ChnFtrs CrossTalk FPDW FeatSynth FtrMine HOG HikSvm HogLbp LatSvm-V1 LatSvm-V2 MLS MultiFtr Pls PoseInv Shapelet VJ VeryFast
All-AUC%
INRIA-fixed	12.0	13.3	12.7	13.6	19.0	43.8	32.9	29.9	26.2	28.8	12.9	17.8	24.2	29.0	66.7	66.8	59.4	10.3
INRIA	12.7	13.9	12.9	14.0	19.6	44.8	34.3	31.4	28.0	29.8	13.3	18.2	25.3	30.1	70.1	68.7	60.7	10.5
Daimler	58.6	-	-	-	-	-	67.9	62.4	69.8	64.2	62.3	51.8	68.8	-	-	94.9	94.8	-
ETH	47.1	48.7	43.8	51.5	-	-	54.9	61.6	51.1	69.1	49.3	42.8	51.7	47.4	86.5	85.6	84.5	46.9
Caltech-UsaTest	90.9	77.1	77.8	78.1	78.1	86.7	85.5	86.8	87.9	91.7	84.2	83.4	83.4	81.2	92.6	95.4	99.1	-
TudBrussels	66.8	57.6	55.0	59.0	-	-	73.6	76.4	77.2	85.7	67.2	59.2	70.5	66.1	83.8	93.8	92.7	-
Reasonable - AUC %- > 50 pixels& no/partial occlusion
INRIA-fixed	12.0	13.3	12.7	13.6	19.0	43.8	32.9	29.9	26.2	28.8	12.9	17.8	24.2	29.0	66.7	66.8	59.4	10.3
INRIA	12.7	13.9	12.9	14.0	19.6	44.8	34.3	31.4	28.0	29.8	13.3	18.2	25.3	30.1	70.1	68.7	60.7	10.5
Daimler	24.9	-	-	-	-	-	46.2	38.9	40.3	47.2	29.2	18.3	45.5	-	-	90.2	91.3	-
ETH	38.9	44.2	39.1	46.8	-	-	51.1	59.2	43.7	66.6	41.1	37.1	47.5	42.2	85.2	83.9	83.6	42.5
Caltech-UsaTest	71.5	46.4	46.0	46.9	49.2	66.3	57.8	62.0	62.2	73.4	56.0	51.9	59.3	52.9	78.2	87.0	91.8	-
TudBrussels	59.1	48.8	47.0	50.4	-	-	68.1	72.4	71.8	84.0	59.6	52.0	64.8	59.1	80.8	92.5	91.1	-
Large - AUC %- > 100 pixels
INRIA-fixed	10.5	11.6	10.7	11.7	17.3	42.8	31.8	28.5	24.8	27.2	9.9	16.6	22.6	27.8	66.4	66.1	59.0	9.1
INRIA	11.2	12.2	11.0	12.1	18.0	43.9	33.2	30.0	26.6	28.2	10.3	17.0	23.7	28.8	69.9	68.1	60.3	9.4
Daimler	7.8				-	-	31.7	25.2	11.8		6.9	13.9	30.9	-	-	72.3	83.9	-
		-	- 28.2	- 33.4	-	-	33.1	36.4	29.5	22.9			35.1	26.6	63.9	75.8	76.7	24.4
ETH	24.4	30.2	25.8	26.4	28.6	47.8	28.0	26.5		47.6 40.7	26.8	24.8 22.7	34.3	30.4	54.5	69.6
Caltech-UsaTest TudBrussels 33.5	14.8	24.1 36.2	37.3	35.0 -		-	56.2	52.2	18.4	64.5	22.5 46.3	43.1 41.8	55.5	43.3	70.0	80.3	80.9 86.0	- -
Near - AUC %- > 80 pixels
INRIA-fixed	11.3	11.6	11.0	11.9	17.3	42.6	31.5	28.5	24.7	27.5	11.1	16.5	22.7	27.7	66.0	66.1	58.7	9.7
INRIA	11.9	12.2	11.2	12.3	17.9	43.7	32.9	30.0	26.5	28.5	11.5	16.8	23.8	28.8	69.4	68.1	60.0	9.9
Daimler	10.0	-	-	-	-	-	36.8	30.4	10.9	27.6	10.8	14.7	33.7	-	-	78.3	86.3	-
ETH	28.9	35.2	30.9	37.5	-	-	40.5	45.6	31.7	52.2	31.4	29.5	39.4	34.1	80.6	79.9	80.0	29.8
Caltech-UsaTest	27.3	27.4	28.9	28.4	29.5	48.9	33.1	34.3	24.7	47.2	26.7	29.1	40.8	31.2	66.8	75.7	85.3
TudBrussels	40.4	39.5	40.3	38.8	-	-	61.1	58.7	50.5	70.9	47.1	45.3	57.2	49.6	80.0	85.6	89.0	- -
Medium - AUC %- 30-80 pixels
INRIA-fixed	33.1	100.0	99.7	100.0	100.0	100.0	100.0	100.0	85.3	85.3	99.7	100.0	86.1	100.0	99.7	99.7	91.5	27.9
INRIA	33.1	100.0	99.7	100.0	100.0	100.0	100.0	100.0	85.3	85.3	99.7	100.0	86.1	100.0	99.7	99.7	91.5	27.9
Daimler							62.1	54.4	70.7	58.5				-	-	95.2	93.7
	54.2	-	-	-	-	-					60.0	44.7	63.2					-
ETH	55.4	42.9	42.1	45.4	-	-	49.9	54.7	61.2	71.5	57.3	43.9	47.3	45.0	73.9	74.5	71.2	48.3
Caltech-UsaTest	92.2	69.5	70.6	70.6		82.1		82.6	91.5	91.1	80.8	80.6	77.8	75.8	88.8		98.7	-
					70.2		81.4									94.7
				59.7				74.9		85.5						94.1	91.7	-
			55.5		-				82.9
															79.4
													68.7
												59.1
		57.4									68.2			65.0
						-	71.4
TudBrussels	67.8

Trained on									INRIA												TUD-MotionPairs	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech
	ChnFtrs ConvNet-F ConvNet-F-MS ConvNet-U ConvNet-U-MS CrossTalk FPDW FeatSynth FtrMine HOG HikSvm HogLbp LatSvm-V1 LatSvm-V2 MLS MultiFtr Pls PoseInv Shapelet VJ VeryFast MultiFtr+CSS MultiFtr+Motion MultiResC
all - ContinuousAUC%
INRIA-fixed	13.3	26.7	19.5	19.8	12.0	12.7	13.6	19.0	43.8 32.9	29.9	26.2	28.8	12.9	17.8	24.2 29.0	66.7	66.8	59.4	10.3	16.1	-	-	-	-	-	-	-	-	-	-
INRIA	13.9	27.4	21.4	20.9	12.7	12.9	14.0	19.6	44.8	34.3 31.4	28.0	29.8	13.3	18.2	25.3 30.1	70.1	68.7	60.7	10.5	-		-	-	-	-	-	-	-	-	-
Daimler	-	-	-	-	58.6	-	-	-	-	67.9 62.4	69.8	64.2	62.3	51.8	68.8 -	-		94.8	94.9 - 46.9	48.6	40.5	-	-	-	-	-	-	-	-	-
ETH	48.7	-	-	-	47.1	43.8	51.5	-	-	54.9 61.6	51.1	69.1	49.3	42.8	51.7 47.4	86.5	85.6	84.5		59.5 81.2	58.2	-	-	-	-	-	-	-	-	-
Caltech-UsaTest	77.1	-	-	-	90.9	77.8	78.1	78.1	86.7	85.5 86.8	87.9	91.7	84.2	83.4	83.4 81.2	92.6	95.4	99.1			- 77.9	74.2	74.2	74.2	74.2	74.2	74.2	74.2	74.2	74.2
TudBrussels	57.6	-	-	-	66.8	55.0	59.0	-	-	73.6 76.4	77.2	85.7	67.2	59.2	70.5 66.1	83.8	93.8	92.7	-	57.8	51.4	-	-	-	-	-	-	-	-	-
reasonable - ContinuousAUC%
INRIA-fixed	13.3	26.7	19.5	19.8	12.0	12.7 13.6	19.0		43.8 32.9	29.9	26.2	28.8	12.9	17.8	24.2 29.0	66.7		59.4	10.3	16.1	-	-	-	-	-	-	-	-	-	-
INRIA	13.9	27.4	21.4	20.9	12.7	12.9 14.0		19.6	44.8	34.3 31.4	28.0	29.8	13.3	18.2	25.3 30.1	70.1	66.8 68.7	60.7	10.5	16.9	-	-	-	-	-	-	-	-	-	-
Daimler	-	-	-	-	24.9				-	46.2 38.9	40.3	47.2	29.2	18.3	45.5 -	-	90.2	91.3	-	29.0	23.3	-	-	-	-	-	-	-	-	-
ETH	44.2	-	-	-	38.9	-	-	-	-	51.1 59.2	43.7	66.6		37.1	47.5 42.2	85.2	83.9	83.6	42.5	49.4
Caltech-UsaTest	46.4	-		-		39.1 46.0	46.8 46.9	- 49.2	66.3	57.8 62.0			41.1			78.2		91.8	-	52.1	47.9 42.3	- 38.1	- 38.1	- 38.1	- 38.1	- 38.1	- 38.1	- 38.1	- 38.1	- 38.1
TudBrussels			-		71.5						62.2	73.4	56.0	51.9	59.3 52.9		87.0 92.5				43.8
	48.8	-	-	-	59.1	47.0	50.4	-	-	68.1 72.4	71.8	84.0	59.6	52.0	64.8 59.1	80.8		91.1	-	50.2		-	-	-	-	-	-	-	-	-
scale=large - ContinuousAUC%
INRIA-fixed	11.6	23.4	17.3	17.1	10.5	10.7	11.7	17.3	42.8 31.8	28.5	24.8	27.2	9.9	16.6	22.6	66.4	66.1	59.0	9.1		-	-	-	-	-	-	-	-	-	-
INRIA	12.2	24.2	19.1	18.2	11.2		12.1	18.0	43.9	33.2 30.0	26.6	28.2	10.3	17.0	27.8 23.7 28.8	69.9	68.1	60.3	9.4	14.8 15.6	-	-	-	-	-	-	-	-	-	-
Daimler	-	-	-	-	7.8	11.0 -	-	-	-	31.7 25.2	11.8	22.9	6.9	13.9	30.9 -	-	72.3	83.9	-	18.7	18.6	-	-	-	-	-	-	-	-	-
ETH									-	33.1 36.4	29.5		26.8				63.9	76.7	24.4	36.7	31.6	-	-	-	-	-	-	-	-	-
	30.2	-	-	-	24.4	28.2	33.4	-		28.0		47.6		24.8	35.1 26.6		75.8					12.5	12.5	12.5	12.5	12.5	12.5	12.5	12.5	12.5
Caltech-UsaTest	24.1	-	-	-	14.8	25.8	26.4	28.6	47.8	26.5	18.4	40.7	22.5	22.7	34.3 30.4	54.5	69.6	80.9	-	28.8	12.0
TudBrussels	36.2	-	-	-	33.5	37.3	35.0	-	-	56.2 52.2	46.3	64.5	43.1		41.8 55.5	43.3	70.0	80.3	86.0		- 45.3	-	-	-	-	-	-	-	-	-
scale=near - ContinuousAUC%
INRIA-fixed -	11.6	24.5	18.6	17.7	11.3	11.0	11.9	17.3	42.6	31.5 28.5	24.7	27.5	11.1		16.5	22.7 27.7	66.0	66.1		14.7	58.7 9.7					-
INRIA	12.2	25.3	20.5	18.8	11.9	11.2	12.3	17.9	43.7	32.9 30.0	26.5	28.5	11.5	16.8	23.8 28.8	69.4		68.1	60.0			9.9 -	15.5	-		-	-	-	-	-
Daimler	-	-	-	-	10.0	-	-	-	-	36.8 30.4	10.9	27.6	10.8	14.7		-	-	86.3 29.8	18.4	19.5
	35.2		-	-	28.9	30.9	37.5	-	-	45.6	31.7	52.2	31.4	29.5	33.7 39.4 34.1	80.6		78.3 79.9	80.0		36.3	- -	- -	- -	- -	- -	- -	- -	- -	- -
ETH Caltech-UsaTest	27.4	- -	-	-	27.3	28.9	28.4	29.5	48.9	40.5 33.1 34.3	24.7	47.2	26.7	29.1	40.8 31.2		66.8	75.7	85.3	40.0 30.4	16.4	15.0	15.0	15.0	15.0	15.0	15.0	15.0	15.0	15.0
TudBrussels	39.5	-	-	-	40.4	40.3	38.8	-	-	61.1 58.7	50.5	70.9	47.1	45.3	57.2 49.6	80.0		85.6	89.0	- -	46.5				39.8	-	-	-	-	-
scale=medium - ContinuousAUC%
INRIA-fixed	100.0	10.4	32.3	31.9	33.2 33.2		100.0	100.0	100.0 100.0	100.0 85.3	85.3	85.3	99.7	100.0	86.1 100.0	99.7 99.7	99.7	91.5	27.9		91.3	-	- -	- -	- -	- -	- -	- -	- -	- -
INRIA	100.0	10.4	32.3	31.9	99.7 99.7	100.0		100.0	100.0	100.0 100.0			99.7				85.3	100.0	86.1	100.0	99.7	27.9		91.3	-
Daimler	-	-	-	-	54.2		-	-	-	62.1 54.4	70.7	58.5	60.0	44.7	63.2 -	95.2	91.5 93.7		-	43.4		34.0		-	-	-	-	-	-	-
ETH	42.9	-	-	-	55.4	- 42.1	45.4	-	-	49.9 54.7	61.2	71.5	57.3		43.9	47.3 45.0	-	73.9	71.2	48.3		54.7	55.2			-	-	-	-	-
Caltech-UsaTest	69.5	-	-	-	92.2	70.6	70.6	70.2	82.1	82.6 91.5	91.1			80.8		80.6	77.8 75.8	88.8		74.5 94.7				98.7	76.0	73.4			65.4	65.4
TudBrussels	57.4	-	-	-	67.8	55.5 59.7	-		81.4 - 71.4	74.9 82.9		85.5		68.2	68.7 65.0		59.1 79.4				94.1		91.7	- -	55.0	48.6	-	-	-	-
scale=far - ContinuousAUC%
Caltech-UsaTest 94.7 100.0	93.4 -		-	-	100.0	94.2			95.0 96.2	98.2		97.4	95.4				100.0 100.0 95.9	97.8 98.7 100.0	99.9	99.3	-	97.8 98.7 100.0	97.8 98.7 100.0	97.8 98.7 100.0	97.8 98.7 100.0	94.6	97.8 98.7 100.0	95.3	97.8 98.7 100.0	97.8 98.7 100.0

Trained on									INRIA												TUD-MotionPairs	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech
ChnFtrs ConvNet-F ConvNet-F-MS ConvNet-U ConvNet-U-MS CrossTalk FPDW FeatSynth FtrMine HOG HikSvm HogLbp LatSvm-V1 LatSvm-V2 MLS MultiFtr Pls PoseInv Shapelet VJ VeryFast MultiFtr+CSS MultiFtr+Motion MultiResC
all - Discrete AUC%
INRIA-fixed	20.2	41.7	30.5	31.5	18.7	18.6	20.6	30.3	57.2 44.4	40.7	36.9	42.9	19.6	23.6	35.6 39.0	77.9	79.6 81.7	71.6		15.8 23.3	-	-	-	-	-	-	-	-	-	-	-	-	-	-
INRIA	22.2	42.2	34.2	32.6	19.8	19.0	21.5	30.9	58.3	46.0 42.8	39.1	43.8	20.0	23.9	36.5	40.1	80.1	72.5		16.0 24.7	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Daimler	-	-	-	-	64.5	-	-	-	-	77.4 74.3	74.8	72.2	68.3	60.2		76.6 -	- 92.8	96.7	-		48.1	-	-	-	-	-	-	-	-	-	-	-	-	-
ETH	61.9	-	-	-	57.8	56.5	64.3	-	-	67.3 73.8	61.7	78.6	58.3	54.9	63.6	59.8	96.8 92.0	90.4	59.3	58.2 70.7	70.1	-	-	-	-	-	-	-	-	-	-	-	-	-
Caltech-UsaTest TudBrussels	82.8 67.7	- -	- -	- -	92.9 74.9	82.7 65.1	83.7 70.1	84.1 -	90.6 -	90.4 91.4 81.8 85.3	89.8 85.4	93.7 91.3	87.7 75.6	87.2 69.0	87.8 85.7 78.1 76.2	95.5 90.0	97.0 96.2	99.5 95.5	- -	85.5 66.5	82.8 62.0	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -
reasonable - Discrete AUC%
INRIA-fixed	20.2	41.7	30.5	31.5	18.7	18.6	20.6	30.3	57.2	44.4 40.7	36.9	42.9	19.6	23.6	35.6	39.0	77.9 79.6	71.6	15.8	23.3	-	-	-	-	-	-	-	-	-	-	-	-	-	-
INRIA	22.2	42.2	34.2	32.6	19.8	19.0	21.5	30.9	58.3	46.0 42.8	39.1	43.8	20.0	23.9	36.5	40.1	80.1 81.7	72.5	16.0	24.7	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Daimler	-	-	-	-	32.5	-	-	-	-	59.8 55.0	48.7	57.6	38.0	27.6	57.2 -	-	93.8	94.7	-	39.2	29.2	-	-	-	-	-	-	-	-	-	-	-	-	-
ETH	57.5	-	-	-	50.3	51.9	60.1	-	-	64.2 72.0	55.2	76.7	50.9	49.4	59.8	54.9	92.1	90.9 89.9	54.8	60.7	60.0	-	-	-	-	-	-	-	-	-	-	-	-	-
Caltech-UsaTest	56.3	-	-	-	77.2	53.9	57.4	60.2	74.4	68.5 73.4	67.8	79.8	63.3	61.0	68.3 62.1	86.3	91.4	94.7	-	60.9	50.9	48.5	48.5	48.5	48.5	48.5	48.5	48.5	48.5	48.5	48.5	48.5	48.5	48.5
TudBrussels	60.3	-	-	-	68.8	58.0	63.0	-	-	77.9 82.5	81.7	90.2	69.6	62.6	73.4 70.7		88.0	95.4 94.5	-	59.5	54.8	-	-	-	-	-	-	-	-	-	-	-	-	-
scale=large - Discrete AUC%
INRIA-fixed	18.1	37.6	28.0	27.8	16.5	16.3	18.3	28.4	56.3 43.4	39.4	35.5	41.1	15.9	22.3	33.9 37.7	77.6	79.1	71.1	14.0	21.9	-	-	-	-	-	-	-	-	-	-	-	-	-	-
INRIA	20.0	38.6		29.0	17.3	16.7	19.1	29.0	57.5	45.1 41.5	37.7	42.2	16.3	22.6		34.8 38.8	79.7	72.1	14.5	23.2	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Daimler	-	-	31.2 -	-	11.6	-	-	-	-	47.4 40.5	17.0	32.8	10.6	20.6	43.4 -	-	81.2 81.6	89.6	-	23.7	21.4	-	-	-	-	-	-	-	-	-	-	-	-	-
ETH	42.4	-	-	-	34.6	37.5	46.4	-	-	44.2 49.4	40.8	60.6	35.3	33.9	46.0	37.3 77.1		84.8 84.8	35.1	46.5	40.3	-	-	-	-	-	-	-	-	-	-	-	-	-
Caltech-UsaTest	30.2	-	-	-	21.4	29.8	33.4	36.2	59.1	37.9 39.0	22.7	49.9	28.2	31.1	43.0 36.5	71.6	76.6	86.2	-	36.3	15.7	17.8	17.8	17.8	17.8	17.8	17.8	17.8	17.8	17.8	17.8	17.8	17.8	17.8
TudBrussels		-				45.2	44.1		-	71.2 67.6			54.2					91.9	-	54.9	46.5	-	-	-	-	-	-	-	-	-	-	-	-	-
	43.1		-	-	41.1			-			59.3	76.6		51.6	63.5 54.3	82.3	87.1
scale=near - Discrete AUC%
INRIA-fixed	18.3	38.7	29.4	28.3	17.7	16.7	18.7	28.3	56.2	43.1 39.3	35.4	41.5	17.5		22.2	34.1 37.7	77.4	79.1	70.9	14.9	-	-	-	-	-	-	-	-	-	-	-	-	-	-
INRIA	20.2	39.8	33.2	29.4	18.8	17.1	19.5	28.9	57.3	41.5	37.6	42.5	17.9	22.5	35.0 38.8		79.5	81.3	15.1	21.7		23.2				-	-	-	-	-	-	-	-	-
Daimler	-	-	-	-	14.8	-	-	-	-	44.7 52.0 45.5	15.6	37.9	16.1	22.5		46.2 -	-	85.6	71.9 91.3	-		25.2		23.6
ETH	48.3	-	-	-	39.3	42.1	50.7	-	- 53.5	59.2	43.0	64.7	40.6	40.0	50.3 46.0	89.1		88.0	87.2	41.1	45.4		- -	- -	- -	- -	- -	- -	- -	- -	- -	- -	- -	- -
Caltech-UsaTest	35.1	-	-	-	36.7	34.3	36.7	38.9	58.4 44.0	48.0	30.8	57.1	34.3	37.7		50.0 39.8	77.9	82.9	89.9	51.0 - 38.8	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2
TudBrussels	49.6	-	-	-	49.9	49.7	50.0	-	- 73.5	73.0	64.3	81.3	58.0	55.2	66.3	61.6	88.0	90.9	93.0	-	50.1	-	-	-	-	-	-	-	-	-	-	-	-	-
scale=medium - DiscreteAUC%
92.3
INRIA-fixed	100.0	0.0	44.7	48.3	45.8	100.0	100.0	100.0	100.0 100.0	100.0	92.3	88.8	100.0	100.0		100.0 100.0	100.0	100.0	92.6	39.0	-	91.7	91.7	91.7	91.7	91.7	91.7	91.7	91.7	91.7	91.7	91.7	91.7	91.7
INRIA	100.0	0.0	44.7	48.3	45.8	100.0	100.0	100.0	100.0	100.0 100.0	92.3	88.8	100.0	100.0	92.3	100.0	100.0	92.6	39.0		-			91.7			- -	- -	- -	- -	- -	- -	- -	- -
Daimler	-	-	-	-	60.7	-		-	- 72.8	68.0	67.8		67.1		53.9	-	97.0	96.1		-		53.5	41.2	-	-	-	-	-	-	-	-	-	-	-
ETH	53.6	-	-	-	65.7	53.0	- 56.0	-	- 59.8	67.6	75.0 70.3	80.3	67.1	54.2	72.2 58.0	- 55.9		85.0	60.5			82.6	80.6		67.8		-	-	-	-	-	-	-	-
Caltech-UsaTest	77.4	-	-	-	94.9	77.4	78.4	78.2	87.3	87.4 88.5	93.3		85.5	85.5	84.1	82.0	93.1		-		80.1			67.5 82.1		73.2	73.2	73.2	73.2	73.2	73.2	73.2	73.2	73.2
TudBrussels	67.7	-	-	-	76.5	64.9	70.6	-	- 79.4	93.0 84.4 90.2	91.1			77.2	69.2		76.7 75.7	86.8	96.7 96.5			94.7	99.4	-	63.8	-		58.1
scale=far - DiscreteAUC%
Caltech-UsaTest	95.2	-	-	-	100.0	96.6	95.7	96.4 occ=partial	96.9 97.1 - Discrete	99.1 AUC%	100.0	98.5 97.9	100.0		96.9 99.2	100.0	99.9				99.7		-		96.7			96.8	100.0	100.0	100.0	100.0	100.0	100.0
Caltech-UsaTest

Pedestrian detection is a problem of considerable practical interest. Adding to the list of successful applications of deep learning methods to vision, we report state-of-the-art and competitive results on all major pedestrian datasets with a convolutional network model. The model uses a few new twists, such as multi-stage features, connections that skip layers to integrate global shape information with local distinctive motif information, and an unsupervised method based on convolutional sparse coding to pre-train the filters at each stage.

Recently sparse coding has seen much interest in many fields due to its ability to extract useful feature representations from data, The general formulation of sparse coding is a linear reconstruction model using an overcomplete dictionary 𝒟∈ℝm×n𝒟superscriptℝ𝑚𝑛{\cal D}\in\mathbb{R}^{m\times n} where m>n𝑚𝑛m>n and a regularization penalty on the mixing coefficients z∈ℝn𝑧superscriptℝ𝑛z\in\mathbb{R}^{n}.

The aim is to minimize equation 1 with respect to z𝑧z to obtain the optimal sparse representation z∗superscript𝑧z^{*} that correspond to input x∈ℝm𝑥superscriptℝ𝑚x\in\mathbb{R}^{m}. The exact form of s(z)𝑠𝑧s(z) depends on the particular sparse coding algorithm that is used, here, we use the ∥.∥1|.|_{1} norm penalty, which is the sum of the absolute values of all elements of z𝑧z. It is immediately clear that the solution of this system requires an optimization process. Many efficient algorithms for solving the above convex system has been proposed in recent years [1, 6, 2, 26]. However, our aim is to also learn generic feature extractors. For that reason we minimize equation 1 wrt 𝒟𝒟{\cal D} too.

This resulting equation is non-convex in 𝒟𝒟{\cal D} and z𝑧z at the same time, however keeping one fixed, the problem is still convex wrt to the other variable. All sparse modeling algorithms that adopt the dictionary matrix 𝒟𝒟{\cal D} exploit this property and perform a coordinate descent like minimization process where each variable is updated in succession. Following [30] many authors have used sparse dictionary learning to represent images [27, 1, 19]. However, most of the sparse coding models use small image patches as input x𝑥x to learn the dictionary 𝒟𝒟{\cal D} and then apply the resulting model to every overlapping patch location on the full image. This approach assumes that the sparse representation for two neighboring patches with a single pixel shift is completely independent, thus produces very redundant representations. [20, 39] have introduced convolutional sparse modeling formulations for feature learning and object recognition and we use the Convolutional Predictive Sparse Decomposition (CPSD) model proposed in [20] since it is the only convolutional sparse coding model providing a fast predictor function that is suitable for building multi-stage feature representations. The particular predictor function we use is similar to a single layer ConvNet of the following form:

where ⊗tensor-product\otimes operator represents convolution operator that applies on a single input and single filter. In this formulation x𝑥x is a p×p𝑝𝑝p\times p grayscale input image, k∈ℝn×m×m𝑘superscriptℝ𝑛𝑚𝑚k\in\mathbb{R}^{n\times m\times m} is a set of 2D filters where each filter is kj∈ℝm×msubscript𝑘𝑗superscriptℝ𝑚𝑚k_{j}\in\mathbb{R}^{m\times m}, g∈ℝn𝑔superscriptℝ𝑛g\in\mathbb{R}^{n} and b∈ℝn𝑏superscriptℝ𝑛b\in\mathbb{R}^{n} are vectors with n𝑛n elements, the predictor output z~∈ℝn×p−m+1×p−m+1~~𝑧superscriptℝ𝑛𝑝𝑚1𝑝𝑚1\tilde{z}\in\mathbb{R}^{n\times p-m+1\times p-m+1} is a set of feature maps where each of z~~jsubscript~𝑧𝑗\tilde{z}_{j} is of size p−m+1×p−m+1𝑝𝑚1𝑝𝑚1p-m+1\times p-m+1. Considering this general predictor function, the final form of the convolutional unsupervised energy for grayscale inputs is as follows:

where 𝒟𝒟{\cal D} is a dictionary of filters the same size as k𝑘k and β𝛽\beta is a hyper-parameter. The unsupervised learning procedure is a two step coordinate descent process. At each iteration, (1) Inference: The parameters W={𝒟,g,k,b}𝑊𝒟𝑔𝑘𝑏W={{\cal D},g,k,b} are kept fixed and equation 6 is minimized to obtain the optimal sparse representation z∗superscript𝑧z^{}, (2) Update: Keeping z∗superscript𝑧z^{} fixed, the parameters W𝑊W updated using a stochastic gradient step: W←W−η∂𝔼CPSD∂W←𝑊𝑊𝜂subscript𝔼𝐶𝑃𝑆𝐷𝑊W\leftarrow W-\eta\frac{\partial{\mathbb{E}}_{CPSD}}{\partial W} where η𝜂\eta is the learning rate parameter. The inference procedure requires us to carry out the sparse coding problem solution. For this step we use the FISTA method proposed in [2]. This method is an extension of the original iterative shrinkage and thresholding algorithm [6] using an improved step size calculation with a momentum-like term. We apply the FISTA algorithm in the image domain adopting the convolutional formulation.

For color images or other multi-modal feature representations, the input x𝑥x is a set of feature maps indexed by i𝑖i and the representation z𝑧z is a set of feature maps indexed by j𝑗j for each input map i𝑖i. We define a map of connections P𝑃P from input x𝑥x to features z𝑧z. A jthsuperscript𝑗𝑡ℎj^{th} output feature map is connected to a set Pjsubscript𝑃𝑗P_{j} of input feature maps. Thus, the predictor function in Algorithm 1 is defined as:

and the reconstruction is computed using the inverse map P¯¯𝑃\bar{P}:

Once the unsupervised learning for a single stage is completed, the next stage is trained on the feature representation from the previous one. In order to obtain the feature representation for the next stage, we use the predictor function f(x)𝑓𝑥f(x) followed by non-linear transformations and pooling. Following the multi-stage framework used in [20], we apply absolute value rectification, local contrast normalization and average down-sampling operations.

Absolute Value Rectification is applied component-wise to the whole feature output from f(x)𝑓𝑥f(x) in order to avoid cancellation problems in contrast normalization and pooling steps.

Local Contrast Normalization is a non-linear process that enhances the most active feature and suppresses the other ones. The exact form of the operation is as follows:

where i𝑖i is the feature map index and w𝑤w is a 9×9999\times 9 Gaussian weighting function with normalized weights so that ∑ipqwpq=1subscript𝑖𝑝𝑞subscript𝑤𝑝𝑞1\sum_{ipq}w_{pq}=1. For each sample, the constant c𝑐c is set to mean(σ)𝑚𝑒𝑎𝑛𝜎mean(\sigma) in the experiments.

Average Down-Sampling operation is performed using a fixed size boxcar kernel with a certain step size. The size of the kernel and the stride are given for each experiment in the following sections.

Once a single layer of the network is trained, the features for training a successive layer is extracted using the predictor function followed by non-linear transformations. Detailed procedure of training an N𝑁N layer hierarchical model is explained in Algorithm 2.

The first layer features can be easily displayed in the parameter space since the parameter space and the input space is same, however visualizing the second and higher level features in the input space can only be possible when only invertible operations are used in between layers. However, since we use absolute value rectification and local contrast normalization operations mapping the second layer features onto input space is not possible. In Figure 2 we show a subset of 166416641664 second layer features in the parameter space.

ConvNets are usually organized in a strictly feed-forward manner where one layer only takes the output of the previous layer as input. Features extracted this way tend to be high level features after a few stages of convolutions and subsampling. By branching lower levels’ outputs into the top classifier (Fig. 3), one produces features that extract both global shapes and structures and local details, such as a global silhouette and face components in the case of human detection. Contrary to [12], the output of the first stage is branched after the non-linear transformations and pooling/subsampling operations rather than before.

We also use color information on the training data. For this purpose we convert all images into YUV image space and subsample the UV features by 3 since color information is in much lower resolution. Then at the first stage, we keep feature extraction systems for Y and UV channels separate. On the Y channel, we use 323232 7×7777\times 7 features followed by absolute value rectification, contrast normalization and 3×3333\times 3 subsampling. On the subsampled UV channels, we extract 666 5×5555\times 5 features followed by absolute value rectification and contrast normalization, skipping the usual subsampling step since it was performed beforehand. These features are then concatanated to produce 383838 feature maps that are input to the first layer. The second layer feature extraction takes 383838 feature maps and produces 686868 output features using 204020402040 9×9999\times 9 features. A randomly selected 20% of the connections in mapping from input features to output features is removed to limit the computational requirements and break the symmetry [23]. The output of the second layer features are then transformed using absolute value rectification and contrast normalization followed by 2×2222\times 2 subsampling. This results in 178241782417824 dimensional feature vector for each sample which is then fed into a linear classifier.

Non-maximum suppression (NMS) is used to resolve conflicts when several bounding boxes overlap. For both INRIA and Caltech experiments we use the widely accepted PASCAL overlap criteria to determine a matching score between two bounding boxes (intersectionunion𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑢𝑛𝑖𝑜𝑛\frac{intersection}{union}) and if two boxes overlap by more than 60%, only the one with the highest score is kept. In [10]’s addendum, the matching criteria is modified by replacing the union of the two boxes with the minimum of the two. Therefore, if a box is fully contained in another one the small box is selected. The goal for this modification is to avoid false positives that are due to pedestrian body parts. However, a drawback to this approach is that it always disregards one of the overlapping pedestrians from detection. Instead of changing the criteria, we actively modify our training set before each bootstrapping phase. We include body part images that cause false positive detection into our bootstrapping image set. Our model can then learn to suppress such responses within a positive window and still detect pedestrians within bigger windows more reliably.

We evaluate our system on 5 standard pedestrian detection datasets. However, like most other systems, we only train on the INRIA dataset. We also demonstrate improvements brought by unsupervised training and multi-stage features. In the following we name our model ConvNet with variants of unsupervised (Convnet-U) and fully-supervised training (Convnet-F) and multi-stage features (Convnet-U-MS and ConvNet-F-MS).

During testing and bootstrapping phases using the INRIA dataset, the images are both up-sampled and sub-sampled. The up-sampling ratio is 1.3 while the sub-sampling ratio is limited by 0.75 times the network’s minimum input (126×7812678126\times 78). We use a scale stride of 1.10 between each scale, while other methods typically use either 1.05 or 1.20 [11]. A higher scale stride is desirable as it implies less computations.

For evaluation we use the bounding boxes files published on the Caltech Pedestrian website 111http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians and the evaluation software provided by Piotr Dollar (version 3.0.1). In an effort to provide a more accurate evaluation, we improved on both the evaluation formula and the INRIA annotations as follows. The evaluation software was slightly modified to compute the continuous area under curve (AUC) in the entire [0, 1] range rather than from 9 discrete points only (0.01, 0.0178, 0.0316, 0.0562, 0.1, 0.1778, 0.3162, 0.5623 and 1.0 in version 3.0.1). Instead, we compute the entire area under the curve by summing the areas under the piece-wise linear interpolation of the curve, between each pair of points. In addition, we also report a ’fixed’ version of the annotations for INRIA dataset, which has missing positive labels. The added labels are only used to avoid counting false errors and wrongly penalizing algorithms. The modified code and extra INRIA labels are available at 222http://cs.nyu.edu/~sermanet/data.html#inria. Table 2 reports results for both original and fixed INRIA datasets. Notice that the continuous AUC and fixed INRIA annotations both yield a reordering of the results (see supplementary material for further evidence that the impact of these modifications is significant enough to be used). To avoid ambiguity, all results with the original discrete AUC are reported in the supplementary paper.

In Figure 8, we plot DET curves, i.e. miss rate versus false positives per image (FPPI), on the fixed INRIA dataset and rank algorithms along two measures: the error rate at 1 FPPI and the area under curve (AUC) rate in the [0, 1] FPPI range. This graph shows the individual contributions of unsupervised learning (ConvNet-U) and multi-stage features learning (ConvNet-F-MS) and their combination (ConvNet-U-MS) compared to the fully-supervised system without multi-stage features (ConvNet-F). With 17.1% error rate, unsupervised learning exhibits the most improvements compared to the baseline ConvNet-F (23.39%). Multi-stage features without unsupervised learning reach 17.29% error while their combination yields the competitive error rate of 10.55%.

Extensive results comparison of all major pedestrian datasets and published systems is provided in Table 2. Multiple types of measures proposed by [10] are reported. For clarity, we also plot in Figure 5 two of these measures, ’reasonable’ and ’large’, for INRIA-trained systems. The ’large’ plot shows that the ConvNet results in state-of-the-art performance with some margin on the ETH, Caltech and TudBrussels datasets and is closely behind LatSvm-V2 and VeryFast for INRIA and Daimler datasets. In the ’reasonable’ plot, the ConvNet yields competitive results for INRIA, Daimler and ETH datasets but performs poorly on the Caltech dataset. We suspect the ConvNet with multi-stage features trained at high-resolution is more sensitive to resolution loss than other methods. In future work, a ConvNet trained at multiple resolution will likely learn to use appropriate cues for each resolution regime.

We have introduced a new feature learning model with an application to pedestrian detection. Contrary to popular models where the low-level features are hand-designed, our model learns all the features at all levels in a hierarchy. We used the method of [20] as a baseline, and extended it by combining high and low resolution features in the model, and by learning features on the color channels of the input. Using the INRIA dataset, we have shown that these improvements provide clear performance benefits. The resulting model provides state of the art or competitive results on most measures of all publicly available datasets. Small-scale pedestrian measures can be improved in future work by training multiple scale models relying less on high-resolution details. While computational speed was not the focus and hence was not reported here, our model was successfully used with near real-time speed in a haptic belt system [22] using parallel hardware. In future work, models designed for speed combined to highly optimized parallel computing on graphics cards is expected to yield competitive computational performance.

Table: S2.T1: Error rates improvements of multi-stage features over single-stage features for different types of objects detection and classification. Improvements are significant for multi-scale and textured objects such as traffic signs and pedestrians but minimal for house numbers.

Task	Single-Stage features	Multi-Stage features	Improvement %
Pedestrians detection (INRIA) (Fig. 8)	23.39%	17.29%	26.1%
Traffic Signs classification (GTSRB) [35]	1.80%	0.83%	54%
House Numbers classification (SVHN) [34]	5.54%	5.36%	3.2%

Table: S3.T2: Performance of all systems on all datasets using the proposed continuous AUC percentage over the range [0,1] from DET curves. The top performing results (among INRIA-trained models) are highlighted in bold for each row. DET curves plot false positives per image (FPPI) against miss rate. Hence a smaller AUC% means a more accurate system with lower amount of false positives. The ConvNet model (ConvNet-U-MS here) holds several state-of-the-art or competitive scores. We report the multiple measures introduced by [10] for all major pedestrian datasets. For readibility, not all measures are reported nor are models not trained on INRIA. All results however are reported in the supplementary paper.


Trained on	INRIA
	ConvNet	ChnFtrs	CrossTalk	FPDW	FeatSynth	FtrMine	HOG	HikSvm	HogLbp	LatSvm-V1	LatSvm-V2	MLS	MultiFtr	Pls	PoseInv	Shapelet	VJ	VeryFast
All - AUC %
INRIA-fixed	12.0	13.3	12.7	13.6	19.0	43.8	32.9	29.9	26.2	28.8	12.9	17.8	24.2	29.0	66.7	66.8	59.4	10.3
INRIA	12.7	13.9	12.9	14.0	19.6	44.8	34.3	31.4	28.0	29.8	13.3	18.2	25.3	30.1	70.1	68.7	60.7	10.5
Daimler	58.6	-	-	-	-	-	67.9	62.4	69.8	64.2	62.3	51.8	68.8	-	-	94.9	94.8	-
ETH	47.1	48.7	43.8	51.5	-	-	54.9	61.6	51.1	69.1	49.3	42.8	51.7	47.4	86.5	85.6	84.5	46.9
Caltech-UsaTest	90.9	77.1	77.8	78.1	78.1	86.7	85.5	86.8	87.9	91.7	84.2	83.4	83.4	81.2	92.6	95.4	99.1	-
TudBrussels	66.8	57.6	55.0	59.0	-	-	73.6	76.4	77.2	85.7	67.2	59.2	70.5	66.1	83.8	93.8	92.7	-
Reasonable - AUC % - >50 pixels & no/partial occlusion
INRIA-fixed	12.0	13.3	12.7	13.6	19.0	43.8	32.9	29.9	26.2	28.8	12.9	17.8	24.2	29.0	66.7	66.8	59.4	10.3
INRIA	12.7	13.9	12.9	14.0	19.6	44.8	34.3	31.4	28.0	29.8	13.3	18.2	25.3	30.1	70.1	68.7	60.7	10.5
Daimler	24.9	-	-	-	-	-	46.2	38.9	40.3	47.2	29.2	18.3	45.5	-	-	90.2	91.3	-
ETH	38.9	44.2	39.1	46.8	-	-	51.1	59.2	43.7	66.6	41.1	37.1	47.5	42.2	85.2	83.9	83.6	42.5
Caltech-UsaTest	71.5	46.4	46.0	46.9	49.2	66.3	57.8	62.0	62.2	73.4	56.0	51.9	59.3	52.9	78.2	87.0	91.8	-
TudBrussels	59.1	48.8	47.0	50.4	-	-	68.1	72.4	71.8	84.0	59.6	52.0	64.8	59.1	80.8	92.5	91.1	-
Large - AUC % - >100 pixels
INRIA-fixed	10.5	11.6	10.7	11.7	17.3	42.8	31.8	28.5	24.8	27.2	9.9	16.6	22.6	27.8	66.4	66.1	59.0	9.1
INRIA	11.2	12.2	11.0	12.1	18.0	43.9	33.2	30.0	26.6	28.2	10.3	17.0	23.7	28.8	69.9	68.1	60.3	9.4
Daimler	7.8	-	-	-	-	-	31.7	25.2	11.8	22.9	6.9	13.9	30.9	-	-	72.3	83.9	-
ETH	24.4	30.2	28.2	33.4	-	-	33.1	36.4	29.5	47.6	26.8	24.8	35.1	26.6	63.9	75.8	76.7	24.4
Caltech-UsaTest	14.8	24.1	25.8	26.4	28.6	47.8	28.0	26.5	18.4	40.7	22.5	22.7	34.3	30.4	54.5	69.6	80.9	-
TudBrussels	33.5	36.2	37.3	35.0	-	-	56.2	52.2	46.3	64.5	43.1	41.8	55.5	43.3	70.0	80.3	86.0	-
Near - AUC % - >80 pixels
INRIA-fixed	11.3	11.6	11.0	11.9	17.3	42.6	31.5	28.5	24.7	27.5	11.1	16.5	22.7	27.7	66.0	66.1	58.7	9.7
INRIA	11.9	12.2	11.2	12.3	17.9	43.7	32.9	30.0	26.5	28.5	11.5	16.8	23.8	28.8	69.4	68.1	60.0	9.9
Daimler	10.0	-	-	-	-	-	36.8	30.4	10.9	27.6	10.8	14.7	33.7	-	-	78.3	86.3	-
ETH	28.9	35.2	30.9	37.5	-	-	40.5	45.6	31.7	52.2	31.4	29.5	39.4	34.1	80.6	79.9	80.0	29.8
Caltech-UsaTest	27.3	27.4	28.9	28.4	29.5	48.9	33.1	34.3	24.7	47.2	26.7	29.1	40.8	31.2	66.8	75.7	85.3	-
TudBrussels	40.4	39.5	40.3	38.8	-	-	61.1	58.7	50.5	70.9	47.1	45.3	57.2	49.6	80.0	85.6	89.0	-
Medium - AUC % - 30-80 pixels
INRIA-fixed	33.1	100.0	99.7	100.0	100.0	100.0	100.0	100.0	85.3	85.3	99.7	100.0	86.1	100.0	99.7	99.7	91.5	27.9
INRIA	33.1	100.0	99.7	100.0	100.0	100.0	100.0	100.0	85.3	85.3	99.7	100.0	86.1	100.0	99.7	99.7	91.5	27.9
Daimler	54.2	-	-	-	-	-	62.1	54.4	70.7	58.5	60.0	44.7	63.2	-	-	95.2	93.7	-
ETH	55.4	42.9	42.1	45.4	-	-	49.9	54.7	61.2	71.5	57.3	43.9	47.3	45.0	73.9	74.5	71.2	48.3
Caltech-UsaTest	92.2	69.5	70.6	70.6	70.2	82.1	81.4	82.6	91.5	91.1	80.8	80.6	77.8	75.8	88.8	94.7	98.7	-
TudBrussels	67.8	57.4	55.5	59.7	-	-	71.4	74.9	82.9	85.5	68.2	59.1	68.7	65.0	79.4	94.1	91.7	-

Table: S7.T3: Results for all experiments on all datasets using the proposed continuous AUC. The top performing results (INRIA-trained only) are highlighted in bold for each row. The continuous AUC percentage is taken over the range [0,1] from DET curves. DET curves plot false positives per image (FPPI) against miss rate. Hence a smaller AUC% means a more accurate system with greater reduction of false positives.


Trained on	INRIA	TUD-MotionPairs	Caltech
	ChnFtrs	ConvNet-F	ConvNet-F-MS	ConvNet-U	ConvNet-U-MS	CrossTalk	FPDW	FeatSynth	FtrMine	HOG	HikSvm	HogLbp	LatSvm-V1	LatSvm-V2	MLS	MultiFtr	Pls	PoseInv	Shapelet	VJ	VeryFast	MultiFtr+CSS	MultiFtr+Motion	MultiResC
all - Continuous AUC%
INRIA-fixed	13.3	26.7	19.5	19.8	12.0	12.7	13.6	19.0	43.8	32.9	29.9	26.2	28.8	12.9	17.8	24.2	29.0	66.7	66.8	59.4	10.3	16.1	-	-
INRIA	13.9	27.4	21.4	20.9	12.7	12.9	14.0	19.6	44.8	34.3	31.4	28.0	29.8	13.3	18.2	25.3	30.1	70.1	68.7	60.7	10.5	16.9	-	-
Daimler	-	-	-	-	58.6	-	-	-	-	67.9	62.4	69.8	64.2	62.3	51.8	68.8	-	-	94.9	94.8	-	48.6	40.5	-
ETH	48.7	-	-	-	47.1	43.8	51.5	-	-	54.9	61.6	51.1	69.1	49.3	42.8	51.7	47.4	86.5	85.6	84.5	46.9	59.5	58.2	-
Caltech-UsaTest	77.1	-	-	-	90.9	77.8	78.1	78.1	86.7	85.5	86.8	87.9	91.7	84.2	83.4	83.4	81.2	92.6	95.4	99.1	-	81.2	77.9	74.2
TudBrussels	57.6	-	-	-	66.8	55.0	59.0	-	-	73.6	76.4	77.2	85.7	67.2	59.2	70.5	66.1	83.8	93.8	92.7	-	57.8	51.4	-
reasonable - Continuous AUC%
INRIA-fixed	13.3	26.7	19.5	19.8	12.0	12.7	13.6	19.0	43.8	32.9	29.9	26.2	28.8	12.9	17.8	24.2	29.0	66.7	66.8	59.4	10.3	16.1	-	-
INRIA	13.9	27.4	21.4	20.9	12.7	12.9	14.0	19.6	44.8	34.3	31.4	28.0	29.8	13.3	18.2	25.3	30.1	70.1	68.7	60.7	10.5	16.9	-	-
Daimler	-	-	-	-	24.9	-	-	-	-	46.2	38.9	40.3	47.2	29.2	18.3	45.5	-	-	90.2	91.3	-	29.0	23.3	-
ETH	44.2	-	-	-	38.9	39.1	46.8	-	-	51.1	59.2	43.7	66.6	41.1	37.1	47.5	42.2	85.2	83.9	83.6	42.5	49.4	47.9	-
Caltech-UsaTest	46.4	-	-	-	71.5	46.0	46.9	49.2	66.3	57.8	62.0	62.2	73.4	56.0	51.9	59.3	52.9	78.2	87.0	91.8	-	52.1	42.3	38.1
TudBrussels	48.8	-	-	-	59.1	47.0	50.4	-	-	68.1	72.4	71.8	84.0	59.6	52.0	64.8	59.1	80.8	92.5	91.1	-	50.2	43.8	-
scale=large - Continuous AUC%
INRIA-fixed	11.6	23.4	17.3	17.1	10.5	10.7	11.7	17.3	42.8	31.8	28.5	24.8	27.2	9.9	16.6	22.6	27.8	66.4	66.1	59.0	9.1	14.8	-	-
INRIA	12.2	24.2	19.1	18.2	11.2	11.0	12.1	18.0	43.9	33.2	30.0	26.6	28.2	10.3	17.0	23.7	28.8	69.9	68.1	60.3	9.4	15.6	-	-
Daimler	-	-	-	-	7.8	-	-	-	-	31.7	25.2	11.8	22.9	6.9	13.9	30.9	-	-	72.3	83.9	-	18.7	18.6	-
ETH	30.2	-	-	-	24.4	28.2	33.4	-	-	33.1	36.4	29.5	47.6	26.8	24.8	35.1	26.6	63.9	75.8	76.7	24.4	36.7	31.6	-
Caltech-UsaTest	24.1	-	-	-	14.8	25.8	26.4	28.6	47.8	28.0	26.5	18.4	40.7	22.5	22.7	34.3	30.4	54.5	69.6	80.9	-	28.8	12.0	12.5
TudBrussels	36.2	-	-	-	33.5	37.3	35.0	-	-	56.2	52.2	46.3	64.5	43.1	41.8	55.5	43.3	70.0	80.3	86.0	-	45.3	38.1	-
scale=near - Continuous AUC%
INRIA-fixed	11.6	24.5	18.6	17.7	11.3	11.0	11.9	17.3	42.6	31.5	28.5	24.7	27.5	11.1	16.5	22.7	27.7	66.0	66.1	58.7	9.7	14.7	-	-
INRIA	12.2	25.3	20.5	18.8	11.9	11.2	12.3	17.9	43.7	32.9	30.0	26.5	28.5	11.5	16.8	23.8	28.8	69.4	68.1	60.0	9.9	15.5	-	-
Daimler	-	-	-	-	10.0	-	-	-	-	36.8	30.4	10.9	27.6	10.8	14.7	33.7	-	-	78.3	86.3	-	18.4	19.5	-
ETH	35.2	-	-	-	28.9	30.9	37.5	-	-	40.5	45.6	31.7	52.2	31.4	29.5	39.4	34.1	80.6	79.9	80.0	29.8	40.0	36.3	-
Caltech-UsaTest	27.4	-	-	-	27.3	28.9	28.4	29.5	48.9	33.1	34.3	24.7	47.2	26.7	29.1	40.8	31.2	66.8	75.7	85.3	-	30.4	16.4	15.0
TudBrussels	39.5	-	-	-	40.4	40.3	38.8	-	-	61.1	58.7	50.5	70.9	47.1	45.3	57.2	49.6	80.0	85.6	89.0	-	46.5	39.8	-
scale=medium - Continuous AUC%
INRIA-fixed	100.0	10.4	32.3	31.9	33.2	99.7	100.0	100.0	100.0	100.0	100.0	85.3	85.3	99.7	100.0	86.1	100.0	99.7	99.7	91.5	27.9	91.3	-	-
INRIA	100.0	10.4	32.3	31.9	33.2	99.7	100.0	100.0	100.0	100.0	100.0	85.3	85.3	99.7	100.0	86.1	100.0	99.7	99.7	91.5	27.9	91.3	-	-
Daimler	-	-	-	-	54.2	-	-	-	-	62.1	54.4	70.7	58.5	60.0	44.7	63.2	-	-	95.2	93.7	-	43.4	34.0	-
ETH	42.9	-	-	-	55.4	42.1	45.4	-	-	49.9	54.7	61.2	71.5	57.3	43.9	47.3	45.0	73.9	74.5	71.2	48.3	55.2	54.7	-
Caltech-UsaTest	69.5	-	-	-	92.2	70.6	70.6	70.2	82.1	81.4	82.6	91.5	91.1	80.8	80.6	77.8	75.8	88.8	94.7	98.7	-	76.0	73.4	65.4
TudBrussels	57.4	-	-	-	67.8	55.5	59.7	-	-	71.4	74.9	82.9	85.5	68.2	59.1	68.7	65.0	79.4	94.1	91.7	-	55.0	48.6	-
scale=far - Continuous AUC%
Caltech-UsaTest	93.4	-	-	-	100.0	95.4	94.2	94.7	95.0	96.2	98.2	100.0	97.8	97.4	100.0	95.9	98.7	100.0	99.9	99.3	-	95.3	94.6	100.0
occ=partial - Continuous AUC%
Caltech-UsaTest	61.3	-	-	-	81.9	69.2	66.9	67.0	81.6	77.1	80.3	75.0	84.1	76.7	71.3	78.6	68.1	85.7	90.5	96.9	-	76.6	64.4	61.3
occ=heavy - Continuous AUC%
Caltech-UsaTest	91.3	-	-	-	96.3	90.7	91.8	87.8	95.5	93.7	93.2	95.6	94.4	93.3	92.1	95.0	92.3	97.2	97.3	98.2	-	91.1	87.8	84.0
ar=atypical - Continuous AUC%
Caltech-UsaTest	57.3	-	-	-	74.5	55.4	61.9	60.3	76.8	75.3	77.4	77.2	85.8	70.8	71.5	73.3	69.1	86.0	92.1	93.6	-	64.3	48.9	50.6

Table: S8.T4: Results for all experiments on all datasets using the discrete AUC. This table is identical to table 3 except it is using the discrete AUC instead of the proposed continuous AUC.


Trained on	INRIA	TUD-MotionPairs	Caltech
	ChnFtrs	ConvNet-F	ConvNet-F-MS	ConvNet-U	ConvNet-U-MS	CrossTalk	FPDW	FeatSynth	FtrMine	HOG	HikSvm	HogLbp	LatSvm-V1	LatSvm-V2	MLS	MultiFtr	Pls	PoseInv	Shapelet	VJ	VeryFast	MultiFtr+CSS	MultiFtr+Motion	MultiResC
all - Discrete AUC%
INRIA-fixed	20.2	41.7	30.5	31.5	18.7	18.6	20.6	30.3	57.2	44.4	40.7	36.9	42.9	19.6	23.6	35.6	39.0	77.9	79.6	71.6	15.8	23.3	-	-
INRIA	22.2	42.2	34.2	32.6	19.8	19.0	21.5	30.9	58.3	46.0	42.8	39.1	43.8	20.0	23.9	36.5	40.1	80.1	81.7	72.5	16.0	24.7	-	-
Daimler	-	-	-	-	64.5	-	-	-	-	77.4	74.3	74.8	72.2	68.3	60.2	76.6	-	-	96.8	96.7	-	58.2	48.1	-
ETH	61.9	-	-	-	57.8	56.5	64.3	-	-	67.3	73.8	61.7	78.6	58.3	54.9	63.6	59.8	92.8	92.0	90.4	59.3	70.7	70.1	-
Caltech-UsaTest	82.8	-	-	-	92.9	82.7	83.7	84.1	90.6	90.4	91.4	89.8	93.7	87.7	87.2	87.8	85.7	95.5	97.0	99.5	-	85.5	82.8	80.4
TudBrussels	67.7	-	-	-	74.9	65.1	70.1	-	-	81.8	85.3	85.4	91.3	75.6	69.0	78.1	76.2	90.0	96.2	95.5	-	66.5	62.0	-
reasonable - Discrete AUC%
INRIA-fixed	20.2	41.7	30.5	31.5	18.7	18.6	20.6	30.3	57.2	44.4	40.7	36.9	42.9	19.6	23.6	35.6	39.0	77.9	79.6	71.6	15.8	23.3	-	-
INRIA	22.2	42.2	34.2	32.6	19.8	19.0	21.5	30.9	58.3	46.0	42.8	39.1	43.8	20.0	23.9	36.5	40.1	80.1	81.7	72.5	16.0	24.7	-	-
Daimler	-	-	-	-	32.5	-	-	-	-	59.8	55.0	48.7	57.6	38.0	27.6	57.2	-	-	93.8	94.7	-	39.2	29.2	-
ETH	57.5	-	-	-	50.3	51.9	60.1	-	-	64.2	72.0	55.2	76.7	50.9	49.4	59.8	54.9	92.1	90.9	89.9	54.8	60.7	60.0	-
Caltech-UsaTest	56.3	-	-	-	77.2	53.9	57.4	60.2	74.4	68.5	73.4	67.8	79.8	63.3	61.0	68.3	62.1	86.3	91.4	94.7	-	60.9	50.9	48.5
TudBrussels	60.3	-	-	-	68.8	58.0	63.0	-	-	77.9	82.5	81.7	90.2	69.6	62.6	73.4	70.7	88.0	95.4	94.5	-	59.5	54.8	-
scale=large - Discrete AUC%
INRIA-fixed	18.1	37.6	28.0	27.8	16.5	16.3	18.3	28.4	56.3	43.4	39.4	35.5	41.1	15.9	22.3	33.9	37.7	77.6	79.1	71.1	14.0	21.9	-	-
INRIA	20.0	38.6	31.2	29.0	17.3	16.7	19.1	29.0	57.5	45.1	41.5	37.7	42.2	16.3	22.6	34.8	38.8	79.7	81.2	72.1	14.5	23.2	-	-
Daimler	-	-	-	-	11.6	-	-	-	-	47.4	40.5	17.0	32.8	10.6	20.6	43.4	-	-	81.6	89.6	-	23.7	21.4	-
ETH	42.4	-	-	-	34.6	37.5	46.4	-	-	44.2	49.4	40.8	60.6	35.3	33.9	46.0	37.3	77.1	84.8	84.8	35.1	46.5	40.3	-
Caltech-UsaTest	30.2	-	-	-	21.4	29.8	33.4	36.2	59.1	37.9	39.0	22.7	49.9	28.2	31.1	43.0	36.5	71.6	76.6	86.2	-	36.3	15.7	17.8
TudBrussels	43.1	-	-	-	41.1	45.2	44.1	-	-	71.2	67.6	59.3	76.6	54.2	51.6	63.5	54.3	82.3	87.1	91.9	-	54.9	46.5	-
scale=near - Discrete AUC%
INRIA-fixed	18.3	38.7	29.4	28.3	17.7	16.7	18.7	28.3	56.2	43.1	39.3	35.4	41.5	17.5	22.2	34.1	37.7	77.4	79.1	70.9	14.9	21.7	-	-
INRIA	20.2	39.8	33.2	29.4	18.8	17.1	19.5	28.9	57.3	44.7	41.5	37.6	42.5	17.9	22.5	35.0	38.8	79.5	81.3	71.9	15.1	23.2	-	-
Daimler	-	-	-	-	14.8	-	-	-	-	52.0	45.5	15.6	37.9	16.1	22.5	46.2	-	-	85.6	91.3	-	25.2	23.6	-
ETH	48.3	-	-	-	39.3	42.1	50.7	-	-	53.5	59.2	43.0	64.7	40.6	40.0	50.3	46.0	89.1	88.0	87.2	41.1	51.0	45.4	-
Caltech-UsaTest	35.1	-	-	-	36.7	34.3	36.7	38.9	58.4	44.0	48.0	30.8	57.1	34.3	37.7	50.0	39.8	77.9	82.9	89.9	-	38.8	21.7	21.2
TudBrussels	49.6	-	-	-	49.9	49.7	50.0	-	-	73.5	73.0	64.3	81.3	58.0	55.2	66.3	61.6	88.0	90.9	93.0	-	56.3	50.1	-
scale=medium - Discrete AUC%
INRIA-fixed	100.0	0.0	44.7	48.3	45.8	100.0	100.0	100.0	100.0	100.0	100.0	92.3	88.8	100.0	100.0	92.3	100.0	100.0	100.0	92.6	39.0	91.7	-	-
INRIA	100.0	0.0	44.7	48.3	45.8	100.0	100.0	100.0	100.0	100.0	100.0	92.3	88.8	100.0	100.0	92.3	100.0	100.0	100.0	92.6	39.0	91.7	-	-
Daimler	-	-	-	-	60.7	-	-	-	-	72.8	68.0	75.0	67.8	67.1	53.9	72.2	-	-	97.0	96.1	-	53.5	41.2	-
ETH	53.6	-	-	-	65.7	53.0	56.0	-	-	59.8	67.6	70.3	80.3	67.1	54.2	58.0	55.9	85.0	82.6	80.6	60.5	67.5	67.8	-
Caltech-UsaTest	77.4	-	-	-	94.9	77.4	78.4	78.2	87.3	87.4	88.5	93.0	93.3	85.5	85.5	84.1	82.0	93.1	96.7	99.4	-	82.1	80.1	73.2
TudBrussels	67.7	-	-	-	76.5	64.9	70.6	-	-	79.4	84.4	90.2	91.1	77.2	69.2	76.7	75.7	86.8	96.5	94.7	-	63.8	58.1	-
scale=far - Discrete AUC%
Caltech-UsaTest	95.2	-	-	-	100.0	96.6	95.7	96.4	96.9	97.1	99.1	100.0	98.5	97.9	100.0	96.9	99.2	100.0	99.9	99.7	-	96.7	96.8	100.0
occ=partial - Discrete AUC%
Caltech-UsaTest	73.0	-	-	-	87.1	76.1	77.6	75.8	88.3	84.5	88.3	79.9	89.1	81.3	79.7	85.9	75.0	92.3	93.5	98.7	-	81.4	73.0	71.6
occ=heavy - Discrete AUC%
Caltech-UsaTest	94.8	-	-	-	97.6	93.7	95.6	92.9	97.7	96.0	95.3	96.7	96.4	95.5	94.7	96.5	94.8	98.3	98.3	98.8	-	93.7	92.6	90.3
ar=atypical - Discrete AUC%
Caltech-UsaTest	67.7	-	-	-	81.1	64.3	72.4	71.7	85.6	85.1	86.2	83.4	89.7	77.4	79.5	82.4	77.8	91.9	95.0	96.7	-	74.1	57.1	61.8

Refer to caption 128128128 9×9999\times 9 filters trained on grayscale INRIA images using Algorithm 1. It can be seen that in addition to edge detectors at multiple orientations, our systems also learns more complicated features such as corner and junction detectors.

Refer to caption A subset of 7×7777\times 7 second layer filters trained on grayscale INRIA images using Algorithm 2. Each row in the figure shows filters that connect to a common output feature map. It can be seen that they extract features at similar locations and shapes, e.g. the bottom row tends to aggregate horizontal features towards the bottom of the filters.

Refer to caption A multi-scale convolutional network. The top row of maps constitute a regular ConvNet [17]. The bottom row in which the 1st stage output is branched, subsampled again and merged into the classifier input provides a multi-stage component to the classifier stage. The multi-stage features coming out of the 2nd stage extracts a global structure as well as local details.

Refer to caption DET curves on the fixed-INRIA dataset for large pedestrians measure report false positives per image (FPPI) against miss rate. Algorithms are sorted from top to bottom using the proposed continuous area under curve measure between 0 and 1 FPPI. On the right, only the ConvNet variants are displayed to highlight the individual contributions of unsupervised learning (ConvNet-U) and multi-stage features learning (ConvNet-F-MS) and their combination (ConvNet-U-MS) compared to the fully-supervised system without multi-stage features (ConvNet-F).

Refer to caption Reasonable and Large measures for all INRIA-trained systems on all major datasets, using the proposed continuous AUC percentage. The AUC is computed from DET curves (smaller AUC means more accuracy and less false positives). For a clearer overall performance, each ConvNet point is connected by dotted lines. While only the ’reasonable’ and ’large’ measures are plotted here, all measures are reported in table 2. The ConvNet system yields state-of-the-art or competitive results on most datasets and measures, except for the low resolutions measures on the Caltech dataset because of higher reliance on high-resolution cues than other methods.

Refer to caption (a) Continuous Area Under Curve

Refer to caption (c) Fixed INRIA dataset + Discrete AUC

$$ \code^* = \argmin_{\code} | \inp - \dic \code |_2^2 + \lambda s(\code) \label{eqn:sc} $$ \tag{eqn:sc}

$$ \displaystyle f(x;g,k,b) $$

$$ \displaystyle{\mathbb{E}}_{CPSD} $$

$$ \tilde{z}j = g_j \times \tanh{\left( \sum{i \in P_j} \left( \inp_i \conv k_{j,i} \right) + b_j \right)} \label{eqn:convpred} $$ \tag{eqn:convpred}

$$ \en_{ConvSC} = \sum_i | \inp_i - \sum_{j \in \bar{P_i}}\dic_{i,j} \conv \code_j |_2^2 + \lambda |z|_1 \label{eqn:convsc} $$ \tag{eqn:convsc}

$$ f(\inp;g,k,b) & = \tilde{\code} = {\tilde{\code}j}{j=1..n}\ \tilde{\code}_j & = g_j \times \tanh(\inp \conv k_j + b_j) $$

$$ \en_{CPSD} & = \en_{ConvSC} + \beta \en_{Pred} \label{eqn:cpsd}\ \en_{ConvSC} & = \left | \inp - {\sum}_j \dic_j \conv \code_j \right |_2^2 + \lambda |z|1 \label{eqn:csc}\ \en{Pred} & = | \code^* - f(\inp;g,k,b) |_2^2 \label{eqn:pred} $$ \tag{eqn:cpsd}

$$ v_{i} & = x_{i} - x_i \conv w;,;;;\sigma = \sqrt{\sum_{i} w \conv v_i^2} \label{eqn:lcn1}\ y_{i} & = \frac{v_{i}}{{\rm max}(c,\sigma)} \label{eqn:lcn3} $$ \tag{eqn:lcn1}

Algorithm: algorithm
\caption{Single layer unsupervised training.}
  \label{alg:convpsd}
\begin{algorithmic}
  \FUNCTION{${\rm\bf Unsup}(\inp, \dic, P, \{ \lambda, \beta \} , \{ g, k, b \},\eta)$}
    \STATE {\bfseries Set:} $f(\inp;g,k,b)$ from eqn~\ref{eqn:convpred}, $W^p = \{ g,k,b \}$.
    \STATE {\bfseries Initialize:} $\code=\emptyset$, $\dic$ and $W^p$ randomly.
    \REPEAT
    \STATE Perform {\bf inference}, minimize equation~\ref{eqn:convsc} wrt $\code$ using FISTA~\cite{beck-fista-09}
    \STATE Do a stochastic {\bf update} on $\dic$ and $W^p$. $\dic \leftarrow \dic - \eta \deri{\en_{ConvSC}}{\dic}$ and $W^p \leftarrow W^p - \eta \deri{\en_{Pred}}{W^p}$
    \UNTIL{convergence}
    \STATE {\bfseries Return:} $\{\dic,g,k,b \}$
  \ENDFUNCTION
\end{algorithmic}

Algorithm: algorithm
\caption{Multi-layer unsupervised training.}
  \label{alg:mconvpsd}
\begin{algorithmic}
  \FUNCTION{${\rm\bf HierarUnsup}(\inp, n_i, m_i, P_i, \{ \lambda_i, \beta_i \}, \{ w_i, s_i \},$\\ $i = \{1 .. N\}, \eta_i)$}
  \STATE {\bfseries Set:} $i=1$, $X_1 = \inp$, $lcn(x)$ using equations~\ref{eqn:lcn1}-\ref{eqn:lcn3}, $ds(X,w,s)$ as the down-sampling operator using boxcar kernel of size $w \times w$ and stride of size $s$ in both directions.
  \REPEAT
   \STATE {\bfseries Set:}  $\dic_i,k_i \in \Real^{n_i \times m_i \times m_i}$, $g_i,b_i \in \Real^{n_i}$.
   \STATE $\{\dic_i,k_i,g_i,k_i,b_i\} = $\\ $\;\;\;\;\;\;\;Unsup(X_i, \dic_i, P_i, \{\lambda_i,\beta_i\}, \{g_i,k_i,b_i\},\eta_i)$
   \STATE $\tilde{z} = f(X_i;g_i,k_i,b_i)$ using equation~\ref{eqn:convpred}.
   \STATE $\tilde{z} = |\tilde{z}|$
   \STATE $\tilde{z} = lcn(\tilde{z})$
   \STATE $X_{i+1} = ds(\tilde{z},w_i, s_i)$
   \STATE $i = i+1$
  \UNTIL{$i=N$}
  \ENDFUNCTION
\end{algorithmic}

Task	Single-Stage features	Multi-Stage features	Improvement%
Pedestrians detection (INRIA) (Fig. 8)	23.39%	17.29%	26.1%
Traffic Signs classification (GTSRB) [35]	1.80%	0.83%	54%
House Numbers classification (SVHN) [34]	5.54%	5.36%	3.2%

Trained on	INRIA
ConvNet	ChnFtrs CrossTalk FPDW FeatSynth FtrMine HOG HikSvm HogLbp LatSvm-V1 LatSvm-V2 MLS MultiFtr Pls PoseInv Shapelet VJ VeryFast
All-AUC%
INRIA-fixed	12.0	13.3	12.7	13.6	19.0	43.8	32.9	29.9	26.2	28.8	12.9	17.8	24.2	29.0	66.7	66.8	59.4	10.3
INRIA	12.7	13.9	12.9	14.0	19.6	44.8	34.3	31.4	28.0	29.8	13.3	18.2	25.3	30.1	70.1	68.7	60.7	10.5
Daimler	58.6	-	-	-	-	-	67.9	62.4	69.8	64.2	62.3	51.8	68.8	-	-	94.9	94.8	-
ETH	47.1	48.7	43.8	51.5	-	-	54.9	61.6	51.1	69.1	49.3	42.8	51.7	47.4	86.5	85.6	84.5	46.9
Caltech-UsaTest	90.9	77.1	77.8	78.1	78.1	86.7	85.5	86.8	87.9	91.7	84.2	83.4	83.4	81.2	92.6	95.4	99.1	-
TudBrussels	66.8	57.6	55.0	59.0	-	-	73.6	76.4	77.2	85.7	67.2	59.2	70.5	66.1	83.8	93.8	92.7	-
Reasonable - AUC %- > 50 pixels& no/partial occlusion
INRIA-fixed	12.0	13.3	12.7	13.6	19.0	43.8	32.9	29.9	26.2	28.8	12.9	17.8	24.2	29.0	66.7	66.8	59.4	10.3
INRIA	12.7	13.9	12.9	14.0	19.6	44.8	34.3	31.4	28.0	29.8	13.3	18.2	25.3	30.1	70.1	68.7	60.7	10.5
Daimler	24.9	-	-	-	-	-	46.2	38.9	40.3	47.2	29.2	18.3	45.5	-	-	90.2	91.3	-
ETH	38.9	44.2	39.1	46.8	-	-	51.1	59.2	43.7	66.6	41.1	37.1	47.5	42.2	85.2	83.9	83.6	42.5
Caltech-UsaTest	71.5	46.4	46.0	46.9	49.2	66.3	57.8	62.0	62.2	73.4	56.0	51.9	59.3	52.9	78.2	87.0	91.8	-
TudBrussels	59.1	48.8	47.0	50.4	-	-	68.1	72.4	71.8	84.0	59.6	52.0	64.8	59.1	80.8	92.5	91.1	-
Large - AUC %- > 100 pixels
INRIA-fixed	10.5	11.6	10.7	11.7	17.3	42.8	31.8	28.5	24.8	27.2	9.9	16.6	22.6	27.8	66.4	66.1	59.0	9.1
INRIA	11.2	12.2	11.0	12.1	18.0	43.9	33.2	30.0	26.6	28.2	10.3	17.0	23.7	28.8	69.9	68.1	60.3	9.4
Daimler	7.8				-	-	31.7	25.2	11.8		6.9	13.9	30.9	-	-	72.3	83.9	-
		-	- 28.2	- 33.4	-	-	33.1	36.4	29.5	22.9			35.1	26.6	63.9	75.8	76.7	24.4
ETH	24.4	30.2	25.8	26.4	28.6	47.8	28.0	26.5		47.6 40.7	26.8	24.8 22.7	34.3	30.4	54.5	69.6
Caltech-UsaTest TudBrussels 33.5	14.8	24.1 36.2	37.3	35.0 -		-	56.2	52.2	18.4	64.5	22.5 46.3	43.1 41.8	55.5	43.3	70.0	80.3	80.9 86.0	- -
Near - AUC %- > 80 pixels
INRIA-fixed	11.3	11.6	11.0	11.9	17.3	42.6	31.5	28.5	24.7	27.5	11.1	16.5	22.7	27.7	66.0	66.1	58.7	9.7
INRIA	11.9	12.2	11.2	12.3	17.9	43.7	32.9	30.0	26.5	28.5	11.5	16.8	23.8	28.8	69.4	68.1	60.0	9.9
Daimler	10.0	-	-	-	-	-	36.8	30.4	10.9	27.6	10.8	14.7	33.7	-	-	78.3	86.3	-
ETH	28.9	35.2	30.9	37.5	-	-	40.5	45.6	31.7	52.2	31.4	29.5	39.4	34.1	80.6	79.9	80.0	29.8
Caltech-UsaTest	27.3	27.4	28.9	28.4	29.5	48.9	33.1	34.3	24.7	47.2	26.7	29.1	40.8	31.2	66.8	75.7	85.3
TudBrussels	40.4	39.5	40.3	38.8	-	-	61.1	58.7	50.5	70.9	47.1	45.3	57.2	49.6	80.0	85.6	89.0	- -
Medium - AUC %- 30-80 pixels
INRIA-fixed	33.1	100.0	99.7	100.0	100.0	100.0	100.0	100.0	85.3	85.3	99.7	100.0	86.1	100.0	99.7	99.7	91.5	27.9
INRIA	33.1	100.0	99.7	100.0	100.0	100.0	100.0	100.0	85.3	85.3	99.7	100.0	86.1	100.0	99.7	99.7	91.5	27.9
Daimler							62.1	54.4	70.7	58.5				-	-	95.2	93.7
	54.2	-	-	-	-	-					60.0	44.7	63.2					-
ETH	55.4	42.9	42.1	45.4	-	-	49.9	54.7	61.2	71.5	57.3	43.9	47.3	45.0	73.9	74.5	71.2	48.3
Caltech-UsaTest	92.2	69.5	70.6	70.6		82.1		82.6	91.5	91.1	80.8	80.6	77.8	75.8	88.8		98.7	-
					70.2		81.4									94.7
				59.7				74.9		85.5						94.1	91.7	-
			55.5		-				82.9
															79.4
													68.7
												59.1
		57.4									68.2			65.0
						-	71.4
TudBrussels	67.8

Trained on									INRIA												TUD-MotionPairs	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech
	ChnFtrs ConvNet-F ConvNet-F-MS ConvNet-U ConvNet-U-MS CrossTalk FPDW FeatSynth FtrMine HOG HikSvm HogLbp LatSvm-V1 LatSvm-V2 MLS MultiFtr Pls PoseInv Shapelet VJ VeryFast MultiFtr+CSS MultiFtr+Motion MultiResC
all - ContinuousAUC%
INRIA-fixed	13.3	26.7	19.5	19.8	12.0	12.7	13.6	19.0	43.8 32.9	29.9	26.2	28.8	12.9	17.8	24.2 29.0	66.7	66.8	59.4	10.3	16.1	-	-	-	-	-	-	-	-	-	-
INRIA	13.9	27.4	21.4	20.9	12.7	12.9	14.0	19.6	44.8	34.3 31.4	28.0	29.8	13.3	18.2	25.3 30.1	70.1	68.7	60.7	10.5	-		-	-	-	-	-	-	-	-	-
Daimler	-	-	-	-	58.6	-	-	-	-	67.9 62.4	69.8	64.2	62.3	51.8	68.8 -	-		94.8	94.9 - 46.9	48.6	40.5	-	-	-	-	-	-	-	-	-
ETH	48.7	-	-	-	47.1	43.8	51.5	-	-	54.9 61.6	51.1	69.1	49.3	42.8	51.7 47.4	86.5	85.6	84.5		59.5 81.2	58.2	-	-	-	-	-	-	-	-	-
Caltech-UsaTest	77.1	-	-	-	90.9	77.8	78.1	78.1	86.7	85.5 86.8	87.9	91.7	84.2	83.4	83.4 81.2	92.6	95.4	99.1			- 77.9	74.2	74.2	74.2	74.2	74.2	74.2	74.2	74.2	74.2
TudBrussels	57.6	-	-	-	66.8	55.0	59.0	-	-	73.6 76.4	77.2	85.7	67.2	59.2	70.5 66.1	83.8	93.8	92.7	-	57.8	51.4	-	-	-	-	-	-	-	-	-
reasonable - ContinuousAUC%
INRIA-fixed	13.3	26.7	19.5	19.8	12.0	12.7 13.6	19.0		43.8 32.9	29.9	26.2	28.8	12.9	17.8	24.2 29.0	66.7		59.4	10.3	16.1	-	-	-	-	-	-	-	-	-	-
INRIA	13.9	27.4	21.4	20.9	12.7	12.9 14.0		19.6	44.8	34.3 31.4	28.0	29.8	13.3	18.2	25.3 30.1	70.1	66.8 68.7	60.7	10.5	16.9	-	-	-	-	-	-	-	-	-	-
Daimler	-	-	-	-	24.9				-	46.2 38.9	40.3	47.2	29.2	18.3	45.5 -	-	90.2	91.3	-	29.0	23.3	-	-	-	-	-	-	-	-	-
ETH	44.2	-	-	-	38.9	-	-	-	-	51.1 59.2	43.7	66.6		37.1	47.5 42.2	85.2	83.9	83.6	42.5	49.4
Caltech-UsaTest	46.4	-		-		39.1 46.0	46.8 46.9	- 49.2	66.3	57.8 62.0			41.1			78.2		91.8	-	52.1	47.9 42.3	- 38.1	- 38.1	- 38.1	- 38.1	- 38.1	- 38.1	- 38.1	- 38.1	- 38.1
TudBrussels			-		71.5						62.2	73.4	56.0	51.9	59.3 52.9		87.0 92.5				43.8
	48.8	-	-	-	59.1	47.0	50.4	-	-	68.1 72.4	71.8	84.0	59.6	52.0	64.8 59.1	80.8		91.1	-	50.2		-	-	-	-	-	-	-	-	-
scale=large - ContinuousAUC%
INRIA-fixed	11.6	23.4	17.3	17.1	10.5	10.7	11.7	17.3	42.8 31.8	28.5	24.8	27.2	9.9	16.6	22.6	66.4	66.1	59.0	9.1		-	-	-	-	-	-	-	-	-	-
INRIA	12.2	24.2	19.1	18.2	11.2		12.1	18.0	43.9	33.2 30.0	26.6	28.2	10.3	17.0	27.8 23.7 28.8	69.9	68.1	60.3	9.4	14.8 15.6	-	-	-	-	-	-	-	-	-	-
Daimler	-	-	-	-	7.8	11.0 -	-	-	-	31.7 25.2	11.8	22.9	6.9	13.9	30.9 -	-	72.3	83.9	-	18.7	18.6	-	-	-	-	-	-	-	-	-
ETH									-	33.1 36.4	29.5		26.8				63.9	76.7	24.4	36.7	31.6	-	-	-	-	-	-	-	-	-
	30.2	-	-	-	24.4	28.2	33.4	-		28.0		47.6		24.8	35.1 26.6		75.8					12.5	12.5	12.5	12.5	12.5	12.5	12.5	12.5	12.5
Caltech-UsaTest	24.1	-	-	-	14.8	25.8	26.4	28.6	47.8	26.5	18.4	40.7	22.5	22.7	34.3 30.4	54.5	69.6	80.9	-	28.8	12.0
TudBrussels	36.2	-	-	-	33.5	37.3	35.0	-	-	56.2 52.2	46.3	64.5	43.1		41.8 55.5	43.3	70.0	80.3	86.0		- 45.3	-	-	-	-	-	-	-	-	-
scale=near - ContinuousAUC%
INRIA-fixed -	11.6	24.5	18.6	17.7	11.3	11.0	11.9	17.3	42.6	31.5 28.5	24.7	27.5	11.1		16.5	22.7 27.7	66.0	66.1		14.7	58.7 9.7					-
INRIA	12.2	25.3	20.5	18.8	11.9	11.2	12.3	17.9	43.7	32.9 30.0	26.5	28.5	11.5	16.8	23.8 28.8	69.4		68.1	60.0			9.9 -	15.5	-		-	-	-	-	-
Daimler	-	-	-	-	10.0	-	-	-	-	36.8 30.4	10.9	27.6	10.8	14.7		-	-	86.3 29.8	18.4	19.5
	35.2		-	-	28.9	30.9	37.5	-	-	45.6	31.7	52.2	31.4	29.5	33.7 39.4 34.1	80.6		78.3 79.9	80.0		36.3	- -	- -	- -	- -	- -	- -	- -	- -	- -
ETH Caltech-UsaTest	27.4	- -	-	-	27.3	28.9	28.4	29.5	48.9	40.5 33.1 34.3	24.7	47.2	26.7	29.1	40.8 31.2		66.8	75.7	85.3	40.0 30.4	16.4	15.0	15.0	15.0	15.0	15.0	15.0	15.0	15.0	15.0
TudBrussels	39.5	-	-	-	40.4	40.3	38.8	-	-	61.1 58.7	50.5	70.9	47.1	45.3	57.2 49.6	80.0		85.6	89.0	- -	46.5				39.8	-	-	-	-	-
scale=medium - ContinuousAUC%
INRIA-fixed	100.0	10.4	32.3	31.9	33.2 33.2		100.0	100.0	100.0 100.0	100.0 85.3	85.3	85.3	99.7	100.0	86.1 100.0	99.7 99.7	99.7	91.5	27.9		91.3	-	- -	- -	- -	- -	- -	- -	- -	- -
INRIA	100.0	10.4	32.3	31.9	99.7 99.7	100.0		100.0	100.0	100.0 100.0			99.7				85.3	100.0	86.1	100.0	99.7	27.9		91.3	-
Daimler	-	-	-	-	54.2		-	-	-	62.1 54.4	70.7	58.5	60.0	44.7	63.2 -	95.2	91.5 93.7		-	43.4		34.0		-	-	-	-	-	-	-
ETH	42.9	-	-	-	55.4	- 42.1	45.4	-	-	49.9 54.7	61.2	71.5	57.3		43.9	47.3 45.0	-	73.9	71.2	48.3		54.7	55.2			-	-	-	-	-
Caltech-UsaTest	69.5	-	-	-	92.2	70.6	70.6	70.2	82.1	82.6 91.5	91.1			80.8		80.6	77.8 75.8	88.8		74.5 94.7				98.7	76.0	73.4			65.4	65.4
TudBrussels	57.4	-	-	-	67.8	55.5 59.7	-		81.4 - 71.4	74.9 82.9		85.5		68.2	68.7 65.0		59.1 79.4				94.1		91.7	- -	55.0	48.6	-	-	-	-
scale=far - ContinuousAUC%
Caltech-UsaTest 94.7 100.0	93.4 -		-	-	100.0	94.2			95.0 96.2	98.2		97.4	95.4				100.0 100.0 95.9	97.8 98.7 100.0	99.9	99.3	-	97.8 98.7 100.0	97.8 98.7 100.0	97.8 98.7 100.0	97.8 98.7 100.0	94.6	97.8 98.7 100.0	95.3	97.8 98.7 100.0	97.8 98.7 100.0

Trained on									INRIA												TUD-MotionPairs	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech	Caltech
ChnFtrs ConvNet-F ConvNet-F-MS ConvNet-U ConvNet-U-MS CrossTalk FPDW FeatSynth FtrMine HOG HikSvm HogLbp LatSvm-V1 LatSvm-V2 MLS MultiFtr Pls PoseInv Shapelet VJ VeryFast MultiFtr+CSS MultiFtr+Motion MultiResC
all - Discrete AUC%
INRIA-fixed	20.2	41.7	30.5	31.5	18.7	18.6	20.6	30.3	57.2 44.4	40.7	36.9	42.9	19.6	23.6	35.6 39.0	77.9	79.6 81.7	71.6		15.8 23.3	-	-	-	-	-	-	-	-	-	-	-	-	-	-
INRIA	22.2	42.2	34.2	32.6	19.8	19.0	21.5	30.9	58.3	46.0 42.8	39.1	43.8	20.0	23.9	36.5	40.1	80.1	72.5		16.0 24.7	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Daimler	-	-	-	-	64.5	-	-	-	-	77.4 74.3	74.8	72.2	68.3	60.2		76.6 -	- 92.8	96.7	-		48.1	-	-	-	-	-	-	-	-	-	-	-	-	-
ETH	61.9	-	-	-	57.8	56.5	64.3	-	-	67.3 73.8	61.7	78.6	58.3	54.9	63.6	59.8	96.8 92.0	90.4	59.3	58.2 70.7	70.1	-	-	-	-	-	-	-	-	-	-	-	-	-
Caltech-UsaTest TudBrussels	82.8 67.7	- -	- -	- -	92.9 74.9	82.7 65.1	83.7 70.1	84.1 -	90.6 -	90.4 91.4 81.8 85.3	89.8 85.4	93.7 91.3	87.7 75.6	87.2 69.0	87.8 85.7 78.1 76.2	95.5 90.0	97.0 96.2	99.5 95.5	- -	85.5 66.5	82.8 62.0	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -	80.4 -
reasonable - Discrete AUC%
INRIA-fixed	20.2	41.7	30.5	31.5	18.7	18.6	20.6	30.3	57.2	44.4 40.7	36.9	42.9	19.6	23.6	35.6	39.0	77.9 79.6	71.6	15.8	23.3	-	-	-	-	-	-	-	-	-	-	-	-	-	-
INRIA	22.2	42.2	34.2	32.6	19.8	19.0	21.5	30.9	58.3	46.0 42.8	39.1	43.8	20.0	23.9	36.5	40.1	80.1 81.7	72.5	16.0	24.7	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Daimler	-	-	-	-	32.5	-	-	-	-	59.8 55.0	48.7	57.6	38.0	27.6	57.2 -	-	93.8	94.7	-	39.2	29.2	-	-	-	-	-	-	-	-	-	-	-	-	-
ETH	57.5	-	-	-	50.3	51.9	60.1	-	-	64.2 72.0	55.2	76.7	50.9	49.4	59.8	54.9	92.1	90.9 89.9	54.8	60.7	60.0	-	-	-	-	-	-	-	-	-	-	-	-	-
Caltech-UsaTest	56.3	-	-	-	77.2	53.9	57.4	60.2	74.4	68.5 73.4	67.8	79.8	63.3	61.0	68.3 62.1	86.3	91.4	94.7	-	60.9	50.9	48.5	48.5	48.5	48.5	48.5	48.5	48.5	48.5	48.5	48.5	48.5	48.5	48.5
TudBrussels	60.3	-	-	-	68.8	58.0	63.0	-	-	77.9 82.5	81.7	90.2	69.6	62.6	73.4 70.7		88.0	95.4 94.5	-	59.5	54.8	-	-	-	-	-	-	-	-	-	-	-	-	-
scale=large - Discrete AUC%
INRIA-fixed	18.1	37.6	28.0	27.8	16.5	16.3	18.3	28.4	56.3 43.4	39.4	35.5	41.1	15.9	22.3	33.9 37.7	77.6	79.1	71.1	14.0	21.9	-	-	-	-	-	-	-	-	-	-	-	-	-	-
INRIA	20.0	38.6		29.0	17.3	16.7	19.1	29.0	57.5	45.1 41.5	37.7	42.2	16.3	22.6		34.8 38.8	79.7	72.1	14.5	23.2	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Daimler	-	-	31.2 -	-	11.6	-	-	-	-	47.4 40.5	17.0	32.8	10.6	20.6	43.4 -	-	81.2 81.6	89.6	-	23.7	21.4	-	-	-	-	-	-	-	-	-	-	-	-	-
ETH	42.4	-	-	-	34.6	37.5	46.4	-	-	44.2 49.4	40.8	60.6	35.3	33.9	46.0	37.3 77.1		84.8 84.8	35.1	46.5	40.3	-	-	-	-	-	-	-	-	-	-	-	-	-
Caltech-UsaTest	30.2	-	-	-	21.4	29.8	33.4	36.2	59.1	37.9 39.0	22.7	49.9	28.2	31.1	43.0 36.5	71.6	76.6	86.2	-	36.3	15.7	17.8	17.8	17.8	17.8	17.8	17.8	17.8	17.8	17.8	17.8	17.8	17.8	17.8
TudBrussels		-				45.2	44.1		-	71.2 67.6			54.2					91.9	-	54.9	46.5	-	-	-	-	-	-	-	-	-	-	-	-	-
	43.1		-	-	41.1			-			59.3	76.6		51.6	63.5 54.3	82.3	87.1
scale=near - Discrete AUC%
INRIA-fixed	18.3	38.7	29.4	28.3	17.7	16.7	18.7	28.3	56.2	43.1 39.3	35.4	41.5	17.5		22.2	34.1 37.7	77.4	79.1	70.9	14.9	-	-	-	-	-	-	-	-	-	-	-	-	-	-
INRIA	20.2	39.8	33.2	29.4	18.8	17.1	19.5	28.9	57.3	41.5	37.6	42.5	17.9	22.5	35.0 38.8		79.5	81.3	15.1	21.7		23.2				-	-	-	-	-	-	-	-	-
Daimler	-	-	-	-	14.8	-	-	-	-	44.7 52.0 45.5	15.6	37.9	16.1	22.5		46.2 -	-	85.6	71.9 91.3	-		25.2		23.6
ETH	48.3	-	-	-	39.3	42.1	50.7	-	- 53.5	59.2	43.0	64.7	40.6	40.0	50.3 46.0	89.1		88.0	87.2	41.1	45.4		- -	- -	- -	- -	- -	- -	- -	- -	- -	- -	- -	- -
Caltech-UsaTest	35.1	-	-	-	36.7	34.3	36.7	38.9	58.4 44.0	48.0	30.8	57.1	34.3	37.7		50.0 39.8	77.9	82.9	89.9	51.0 - 38.8	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2	21.2
TudBrussels	49.6	-	-	-	49.9	49.7	50.0	-	- 73.5	73.0	64.3	81.3	58.0	55.2	66.3	61.6	88.0	90.9	93.0	-	50.1	-	-	-	-	-	-	-	-	-	-	-	-	-
scale=medium - DiscreteAUC%
92.3
INRIA-fixed	100.0	0.0	44.7	48.3	45.8	100.0	100.0	100.0	100.0 100.0	100.0	92.3	88.8	100.0	100.0		100.0 100.0	100.0	100.0	92.6	39.0	-	91.7	91.7	91.7	91.7	91.7	91.7	91.7	91.7	91.7	91.7	91.7	91.7	91.7
INRIA	100.0	0.0	44.7	48.3	45.8	100.0	100.0	100.0	100.0	100.0 100.0	92.3	88.8	100.0	100.0	92.3	100.0	100.0	92.6	39.0		-			91.7			- -	- -	- -	- -	- -	- -	- -	- -
Daimler	-	-	-	-	60.7	-		-	- 72.8	68.0	67.8		67.1		53.9	-	97.0	96.1		-		53.5	41.2	-	-	-	-	-	-	-	-	-	-	-
ETH	53.6	-	-	-	65.7	53.0	- 56.0	-	- 59.8	67.6	75.0 70.3	80.3	67.1	54.2	72.2 58.0	- 55.9		85.0	60.5			82.6	80.6		67.8		-	-	-	-	-	-	-	-
Caltech-UsaTest	77.4	-	-	-	94.9	77.4	78.4	78.2	87.3	87.4 88.5	93.3		85.5	85.5	84.1	82.0	93.1		-		80.1			67.5 82.1		73.2	73.2	73.2	73.2	73.2	73.2	73.2	73.2	73.2
TudBrussels	67.7	-	-	-	76.5	64.9	70.6	-	- 79.4	93.0 84.4 90.2	91.1			77.2	69.2		76.7 75.7	86.8	96.7 96.5			94.7	99.4	-	63.8	-		58.1
scale=far - DiscreteAUC%
Caltech-UsaTest	95.2	-	-	-	100.0	96.6	95.7	96.4 occ=partial	96.9 97.1 - Discrete	99.1 AUC%	100.0	98.5 97.9	100.0		96.9 99.2	100.0	99.9				99.7		-		96.7			96.8	100.0	100.0	100.0	100.0	100.0	100.0
Caltech-UsaTest

References

[aharon-ksvd-05] M.~~{Aharon}, M.~~{Elad}, and A.~~M. {Bruckstein}. \newblock {K-SVD and its non-negative variant for dictionary design}. \newblock In M.~~{Papadakis}, A.~F. {Laine}, and M.~A. {Unser}, editors, {\em Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series}, volume 5914, pages 327--339, Aug. 2005.

[beck-fista-09] A.~Beck and M.~Teboulle. \newblock A fast iterative shrinkage-thresholding algorithm for linear inverse problems. \newblock {\em SIAM J. Img. Sci.}, 2(1):183--202, 2009.

[benenson2012pedestrian] R.~Benenson, M.~Mathias, R.~Timofte, and L.~~Van~~Gool. \newblock Pedestrian detection at 100 frames per second. \newblock In {\em Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on}, pages 2903--2910. IEEE, 2012.

[bengio-nips-07] Y.~Bengio, P.~Lamblin, D.~Popovici, and H.~Larochelle. \newblock Greedy layer-wise training of deep networks. \newblock In {\em Advances in Neural Information Processing Systems 19}, pages 153--160. MIT Press, 2007.

[dalal-iccv-05] N.~Dalal and B.~Triggs. \newblock Histograms of oriented gradients for human detection. \newblock In C.~Schmid, S.~Soatto, and C.~~Tomasi, editors, {\em CVPR'05}, volume~~2, pages 886--893, June 2005.

[daubechies-ista-2004] I.~Daubechies, M.~Defrise, and C.DeMol. \newblock An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. \newblock {\em Communications on Pure and Applied Mathematics}, 57(11):1413--1457, 2004.

[dollarcrosstalk] P.~Doll{'a}r, R.~Appel, and W.~Kienzle. \newblock Crosstalk cascades for frame-rate pedestrian detection.

[dollar-bmvc-10] P.~Doll'ar, S.~Belongie, and P.~Perona. \newblock The fastest pedestrian detector in the west. \newblock In {\em BMVC 2010, Aberystwyth, UK.}

[dollar-bmvc-09] P.~Doll'ar, Z.~Tu, P.~Perona, and S.~Belongie. \newblock Integral channel features. \newblock In {\em BMVC 2009, London, England.}

[dollarCVPR09peds] P.~Doll'ar, C.~Wojek, B.~Schiele, and P.~Perona. \newblock Pedestrian detection: A benchmark. \newblock In {\em CVPR'09}. IEEE, June 2009.

[Dollar2011PAMI] P.~Doll'ar, C.~Wojek, B.~Schiele, and P.~Perona. \newblock Pedestrian detection: An evaluation of the state of the art. \newblock {\em PAMI}, 99, 2011.

[fan-nn-10] J.~Fan, W.~Xu, Y.~Wu, and Y.~Gong. \newblock Human tracking using convolutional neural networks. \newblock {\em Neural Networks, IEEE Transactions on}, 21(10):1610 --1623, 2010.

[felzen-pami-10] P.~Felzenszwalb, R.~Girshick, D.~McAllester, and D.~Ramanan. \newblock Object detection with discriminatively trained part based models. \newblock In {\em PAMI 2010}.

[frome-iccv-09] A.~Frome, G.~Cheung, A.~Abdulkader, M.~Zennaro, B.~Wu, A.~Bissacco, H.~Adam, H.~Neven, and L.~Vincent. \newblock Large-scale privacy protection in street-level imagery. \newblock In {\em ICCV'09}.

[garcia-delakis-04] C.~Garcia and M.~Delakis. \newblock Convolutional face finder: A neural architecture for fast and robust face detection. \newblock {\em IEEE Transactions on Pattern Analysis and Machine Intelligence}, 2004.

[hinton-science-06] G.~E. Hinton and R.~R. Salakhutdinov. \newblock Reducing the dimensionality of data with neural networks. \newblock {\em Science}, 313(5786):504--507, 2006.

[jarrett-iccv-09] K.~Jarrett, K.~Kavukcuoglu, M.~Ranzato, and Y.~LeCun. \newblock What is the best multi-stage architecture for object recognition? \newblock In {\em ICCV'09}. IEEE, 2009.

[koray-cvpr-09] K.~Kavukcuoglu, M.~Ranzato, R.~Fergus, and Y.~LeCun. \newblock Learning invariant features through topographic filter maps. \newblock In {\em CVPR'09}. IEEE, 2009.

[koray-psd-08] K.~Kavukcuoglu, M.~Ranzato, and Y.~LeCun. \newblock Fast inference in sparse coding algorithms with applications to object recognition. \newblock Technical report, CBLL, Courant Institute, NYU, 2008. \newblock CBLL-TR-2008-12-01.

[koray-nips-10] K.~Kavukcuoglu, P.~Sermanet, Y.~Boureau, K.~Gregor, M.~Mathieu, and Y.~LeCun. \newblock Learning convolutional feature hierachies for visual recognition. \newblock In {\em Advances in Neural Information Processing Systems (NIPS 2010)}, 2010.

[kriz12] A.~Krizhevsky, I.~Sutskever, and G.~E. Hinton. \newblock Imagenet classification with deep convolutional neural networks. \newblock In {\em NIPS 2012: Neural Information Processing Systems}.

[haptic] Q.~Le, M.~Quigley, J.~Feng, J.~Chen, Y.~Zou, W, M.~Rasi, T.~Low, and A.~Ng. \newblock Haptic belt with pedestrian detection. \newblock In {\em NIPS, 2011 (Demonstrations)}.

[lecun-98b] Y.~LeCun, L.~Bottou, Y.~Bengio, and P.~Haffner. \newblock Gradient-based learning applied to document recognition. \newblock {\em Proceedings of the IEEE}, 86(11):2278--2324, November 1998.

[lee-nips-07] H.~Lee, A.~Battle, R.~Raina, and A.~Y. Ng. \newblock Efficient sparse coding algorithms. \newblock In B.~Sch"{o}lkopf, J.~Platt, and T.~Hoffman, editors, {\em Advances in Neural Information Processing Systems 19}, pages 801--808. MIT Press, Cambridge, MA, 2007.

[lee-icml-09] H.~Lee, R.~Grosse, R.~Ranganath, and A.~Ng. \newblock {Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations}. \newblock In {\em ICML'09}, pages 609--616. ACM, 2009.

[li-osher-10] Y.~Li and S.~Osher. \newblock {Coordinate Descent Optimization for l1 Minimization with Application to Compressed Sensing; a Greedy Algorithm}. \newblock {\em CAM Report}, pages 09--17.

[mairal-cvpr-08] J.~Mairal, F.~Bach, J.~Ponce, G.~Sapiro, and A.~Zisserman. \newblock Discriminative learned dictionaries for local image analysis. \newblock {\em Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on}, pages 1--8, June 2008.

[10.1109] S.~Maji, A.~C. Berg, and J.~~Malik. \newblock Classification using intersection kernel support vector machines is efficient. \newblock volume~~0, pages 1--8, Los Alamitos, CA, USA, 2008. IEEE Computer Society.

[nowlan-platt-95] S.~Nowlan and J.~Platt. \newblock A convolutional neural network hand tracker. \newblock pages 901--908, San Mateo, CA, 1995. Morgan Kaufmann.

[olshausen-97] B.~A. Olshausen and D.~J. Field. \newblock Sparse coding with an overcomplete basis set: a strategy employed by v1? \newblock {\em Vision Research}, 37(23):3311--3325, 1997.

[osadchy-07] M.~Osadchy, Y.~LeCun, and M.~Miller. \newblock Synergistic face detection and pose estimation with energy-based models. \newblock {\em Journal of Machine Learning Research}, 8:1197--1215, May 2007.

[ranzato-nips-07] M.~Ranzato, C.~Poultney, S.~Chopra, and Y.~LeCun. \newblock Efficient learning of sparse representations with an energy-based model. \newblock In {\em NIPS'07}. MIT Press, 2007.

[5459205] W.~R. Schwartz, A.~Kembhavi, D.~Harwood, and L.~S. Davis. \newblock Human detection using partial least squares analysis. \newblock In {\em Computer Vision, 2009 IEEE 12th International Conference on}, pages 24 --31, 29 2009-oct. 2 2009.

[sermanet-icpr-12] P.~Sermanet, S.~Chintala, and Y.~LeCun. \newblock Convolutional neural networks applied to house numbers digit classification. \newblock In {\em Proceedings of International Conference on Pattern Recognition}, 2012.

[sermanet-ijcnn-11] P.~Sermanet and Y.~LeCun. \newblock Traffic sign recognition with multi-scale convolutional networks. \newblock In {\em Proceedings of International Joint Conference on Neural Networks}, 2011.

[taylor-nips-10] G.~Taylor, R.~Fergus, G.~Williams, I.~Spiro, and C.~Bregler. \newblock Pose-sensitive embedding by nonlinear nca regression. \newblock In {\em Advances in Neural Information Processing Systems NIPS 23}, 2010.

[vaillant-monrocq-lecun-94] R.~Vaillant, C.~Monrocq, and Y.~LeCun. \newblock Original approach for the localisation of objects in images. \newblock {\em IEE Proc on Vision, Image, and Signal Processing}, 141(4):245--250, August 1994.

[walk-cvpr-10] S.~Walk, N.~Majer, K.~Schindler, and B.~Schiele. \newblock New features and insights for pedestrian detection. \newblock In {\em CVPR 2010, San Francisco, California.}

[zeiler-cvpr-10] M.~Zeiler, D.~Krishnan, G.~Taylor, and R.~Fergus. \newblock {Deconvolutional Networks}. \newblock In {\em CVPR'10}. IEEE, 2010.

[bib1] M. Aharon, M. Elad, and A. M. Bruckstein. K-SVD and its non-negative variant for dictionary design. In M. Papadakis, A. F. Laine, and M. A. Unser, editors, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 5914, pages 327–339, Aug. 2005.

[bib2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183–202, 2009.

[bib3] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool. Pedestrian detection at 100 frames per second. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2903–2910. IEEE, 2012.

[bib4] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19, pages 153–160. MIT Press, 2007.

[bib5] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In C. Schmid, S. Soatto, and C. Tomasi, editors, CVPR’05, volume 2, pages 886–893, June 2005.

[bib6] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11):1413–1457, 2004.

[bib7] P. Dollár, R. Appel, and W. Kienzle. Crosstalk cascades for frame-rate pedestrian detection.

[bib8] P. Dollár, S. Belongie, and P. Perona. The fastest pedestrian detector in the west. In BMVC 2010, Aberystwyth, UK.

[bib9] P. Dollár, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In BMVC 2009, London, England.

[bib10] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In CVPR’09. IEEE, June 2009.

[bib11] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. PAMI, 99, 2011.

[bib12] J. Fan, W. Xu, Y. Wu, and Y. Gong. Human tracking using convolutional neural networks. Neural Networks, IEEE Transactions on, 21(10):1610 –1623, 2010.

[bib13] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. In PAMI 2010.

[bib14] A. Frome, G. Cheung, A. Abdulkader, M. Zennaro, B. Wu, A. Bissacco, H. Adam, H. Neven, and L. Vincent. Large-scale privacy protection in street-level imagery. In ICCV’09.

[bib15] C. Garcia and M. Delakis. Convolutional face finder: A neural architecture for fast and robust face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004.

[bib16] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.

[bib17] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In ICCV’09. IEEE, 2009.

[bib18] K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic filter maps. In CVPR’09. IEEE, 2009.

[bib19] K. Kavukcuoglu, M. Ranzato, and Y. LeCun. Fast inference in sparse coding algorithms with applications to object recognition. Technical report, CBLL, Courant Institute, NYU, 2008. CBLL-TR-2008-12-01.

[bib20] K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, and Y. LeCun. Learning convolutional feature hierachies for visual recognition. In Advances in Neural Information Processing Systems (NIPS 2010), 2010.

[bib21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS 2012: Neural Information Processing Systems.

[bib22] Q. Le, M. Quigley, J. Feng, J. Chen, Y. Zou, W, M. Rasi, T. Low, and A. Ng. Haptic belt with pedestrian detection. In NIPS, 2011 (Demonstrations).

[bib23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.

[bib24] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 801–808. MIT Press, Cambridge, MA, 2007.

[bib25] H. Lee, R. Grosse, R. Ranganath, and A. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML’09, pages 609–616. ACM, 2009.

[bib26] Y. Li and S. Osher. Coordinate Descent Optimization for l1 Minimization with Application to Compressed Sensing; a Greedy Algorithm. CAM Report, pages 09–17.

[bib27] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, June 2008.

[bib28] S. Maji, A. C. Berg, and J. Malik. Classification using intersection kernel support vector machines is efficient. volume 0, pages 1–8, Los Alamitos, CA, USA, 2008. IEEE Computer Society.

[bib29] S. Nowlan and J. Platt. A convolutional neural network hand tracker. pages 901–908, San Mateo, CA, 1995. Morgan Kaufmann.

[bib30] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Research, 37(23):3311–3325, 1997.

[bib31] M. Osadchy, Y. LeCun, and M. Miller. Synergistic face detection and pose estimation with energy-based models. Journal of Machine Learning Research, 8:1197–1215, May 2007.

[bib32] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse representations with an energy-based model. In NIPS’07. MIT Press, 2007.

[bib33] W. R. Schwartz, A. Kembhavi, D. Harwood, and L. S. Davis. Human detection using partial least squares analysis. In Computer Vision, 2009 IEEE 12th International Conference on, pages 24 –31, 29 2009-oct. 2 2009.

[bib34] P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural networks applied to house numbers digit classification. In Proceedings of International Conference on Pattern Recognition, 2012.

[bib35] P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. In Proceedings of International Joint Conference on Neural Networks, 2011.

[bib36] G. Taylor, R. Fergus, G. Williams, I. Spiro, and C. Bregler. Pose-sensitive embedding by nonlinear nca regression. In Advances in Neural Information Processing Systems NIPS 23, 2010.

[bib37] R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE Proc on Vision, Image, and Signal Processing, 141(4):245–250, August 1994.

[bib38] S. Walk, N. Majer, K. Schindler, and B. Schiele. New features and insights for pedestrian detection. In CVPR 2010, San Francisco, California.

[bib39] M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus. Deconvolutional Networks. In CVPR’10. IEEE, 2010.

Pedestrian Detection with Unsupervised Multi-Stage Feature Learning

Bootstrapping​

Introduction​

Learning Feature Hierarchies​

Hierarchical Model​

Unsupervised Learning​

Supervised Training​

Experiments​

Non-Linear Transformations​

Supervised Training​

Supervised Training​

Multi-Stage Features​

Bootstrapping​

Non-Maximum Suppression​

Experiments​

Data Preparation​

Evaluation Protocol​

Results​

Discussion​

Experiments​

Evidence for using the proposed continuous Area Under Curve measure​

Evidence for using the proposed fixed INRIA dataset​

All results with the continuous AUC measure​

All results with the original discrete AUC measure​

References​

Bootstrapping

Introduction

Learning Feature Hierarchies

Hierarchical Model

Unsupervised Learning

Supervised Training

Experiments

Non-Linear Transformations

Supervised Training

Supervised Training

Multi-Stage Features

Bootstrapping

Non-Maximum Suppression

Experiments

Data Preparation

Evaluation Protocol

Results

Discussion

Experiments

Evidence for using the proposed continuous Area Under Curve measure

Evidence for using the proposed fixed INRIA dataset

All results with the continuous AUC measure

All results with the original discrete AUC measure

References