Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers

Cl'ement Farabet$^{1,2}$, Camille Couprie$^1$, Laurent Najman$^2$, Yann LeCun$^1$, $^1$ Courant Institute of Mathematical Sciences, New York University, New York, NY 10003, USA, $^2$ Universit'e Paris-Est, Laboratoire d'Informatique Gaspard-Monge, 'Equipe A3SI - ESIEE Paris, France

Abstract

Scene parsing, or semantic segmentation, consists in labeling each pixel in an image with the category of the object it belongs to. It is a challenging task that involves the simultaneous detection, segmentation and recognition of all the objects in the image. The scene parsing method proposed here starts by computing a tree of segments from a graph of pixel dissimilarities. Simultaneously, a set of dense feature vectors is computed which encodes regions of multiple sizes centered on each pixel. The feature extractor is a multiscale convolutional network trained from raw pixels. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average ``purity'' of the class distributions, hence maximizing the overall likelihood that each segment will contain a single object. The convolutional network feature extractor is trained end-to-end from raw pixels, alleviating the need for engineered features. After training, the system is parameter free. The system yields record accuracies on the Stanford Background Dataset (8 classes), the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170 classes) while being an order of magnitude faster than competing approaches, producing a $320\times240$ image labeling in less than~1 second.

Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers

Clément Farabet 1 , 2

Camille Couprie 1

Courant Institute of Mathematical Sciences New York University, New York, NY 10003, USA

2 Université Paris-Est, Laboratoire d'Informatique Gaspard-Monge Équipe A3SI - ESIEE Paris, France

Scene parsing, or semantic segmentation, consists in labeling each pixel in an image with the category of the object it belongs to. It is a challenging task that involves the simultaneous detection, segmentation and recognition of all the objects in the image.

The scene parsing method proposed here starts by computing a tree of segments from a graph of pixel dissimilarities. Simultaneously, a set of dense feature vectors is computed which encodes regions of multiple sizes centered on each pixel. The feature extractor is a multiscale convolutional network trained from raw pixels. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average 'purity' of the class distributions, hence maximizing the overall likelihood that each segment will contain a single object. The convolutional network feature extractor is trained end-to-end from raw pixels, alleviating the need for engineered features. After training, the system is parameter free.

The system yields record accuracies on the Stanford Background Dataset (8 classes), the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170 classes) while being an order of magnitude faster than competing approaches, producing a 320 × 240 image labeling in less than 1 second.

Overview

Full scene labeling (FSL) is the task of labeling each pixel in a scene with the category of the object to which it belongs. FSL requires to solve the detection, segmentation, recognition and contextual integration problems simultaneously, so as to produce a globally consistent labeling. One of the obstacles to FSL is that the information necessary for the labeling of a given pixel may come from very distant pixels as well as their labels. The category of a pixel may depend on relatively short-range information (e.g. the presence of a human face generally indicates the presence of a human body nearby), as well as on very long-range dependencies (is this grey pixel part of a road, a building, or a cloud?).

wise feature aggregation : for each node in the tree, the corresponding image segment is encoded by a 5 × 5 spatial grid of aggregated feature vectors. The aggregated feature vector of each grid cell is computed by a component-wise max pooling of the feature vectors centered on all the pixels that fall into the grid cell; This produces a scale-invariant representation of the segment and its surrounding; 4) Class histogram estimation : a classifier is then applied to the aggregated feature grid of each node. The classifier is trained to estimate the histogram of all object categories present in its input segments; 5) Optimal purity cover : a subset of tree nodes is selected whose corresponding segments cover the entire image. The nodes are selected so as to minimize the average 'impurity' of the class distribution. The class 'impurity' is defined as the entropy of the class distribution. The choice of the cover thus attempts to find a consistent overall segmentation in which each segment contains pixels belonging to only one of the learned categories.

All the steps in the process have a complexity linear (or almost linear) in the number of pixels. The bulk of the computation resides in the convolutional network feature extractor. The resulting system is very fast, producing a full parse of a 320 × 240 image in less than 1 second on a conventional CPU. Once trained, the system is parameter free, and requires no adjustment of thresholds or other knobs.

There are three key contributions in this paper 1) using a multi-scale convolutional net to learn good features for region classification; 2) using a class purity criterion to decide if a segment contains a single objet, as opposed to several objects, or part of an object; 3) an efficient procedure to obtain a cover that optimizes the overall class purity of a segmentation.

The problem of scene parsing has been approached with a wide variety of methods in recent years. Many methods rely on MRFs, CRFs, or other types of graphical models to ensure the consistency of the labeling and to account for context [9, 22, 6, 13, 17, 24]. Most methods rely on a pre-segmentation into super-pixels or other segment candidates [6, 13, 17, 24], and extract features and categories from individual segments and from various combinations of neighboring segments. The graphical model inference pulls out the most consistent set of segments that cover the image.

Socher et al . [23] propose a method to aggregate segments in a greedy fashion using a trained scoring function. The originality of the approach is that the feature vector of the combination of two segments is computed from the feature vectors of the individual segments through a trainable function. Like us, they use 'deep learning' methods to train their feature extractor. But unlike us, their feature extractor operates on hand-engineered features.

One of the main question in scene parsing is how to

)}

Figure 1. Diagram of the scene parsing system. The raw input image is transformed through a Laplacian pyramid. Each scale is fed to a 2-stage convolutional network, which produces a set of feature maps. The feature maps of all scales are concatenated, the coarser-scale maps being upsampled to match the size of the finestscale map. Each feature vector thus represents a large contextual window around each pixel. In parallel, a segmentation tree is computed via the minimum spanning tree of the dissimilarity graph of neighboring pixels. The segment associated with each node in the tree is encoded by a spatial grid of feature vectors pooled in the segment's region. A classifier is then applied to all the aggregated feature grids to produce a histogram of categories, the entropy of which measures the 'impurity' of the segment. Each pixel is then labeled by the minimally-impure node above it, which is the segment that best 'explains' the pixel.

take a wide context into account to make a local decision. Munoz et al . [17] proposed to use the histogram of labels extracted from a coarse scale as input to the labeler that look at finer scales. Our approach is somewhat simpler: our feature extractor is applied densely to an image pyramid. The coarse feature maps thereby generated are upsampled to match that of the finest scale. Hence with three scales, each feature vector has multiple fields which encode multiple regions of increasing sizes and decreasing resolutions, centered on the same pixel location.

Like us, a number of authors have used trees to generate candidate segments by aggregating elementary segments, as in [22]. Using trees allows to rely on fast inference algorithms based on graph cuts or other methods. In this paper, we use an innovative method based on finding a set of tree nodes that cover the images while minimizing some criterion.

Our system extracts features densely from a multiscale pyramid of images using a convolutional network (ConvNet) [14]. ConvNets can be fed with raw pixels and can automatically learn low-level and mid-level features, alleviating the need for hand-engineered features. One big advantage of ConvNets is the ability to compute dense features efficiently over large images. ConvNets are best known for their applications to detection and recognition [20, 11], but they have also been used for image segmentation, particularly for biological image segmentation [19, 10, 25].

The only published work on using ConvNets for scene parsing is that of Grangier et al . [7]. While somewhat preliminary, their work showed that convolutional networks fed with raw pixels could be trained to perform scene parsing with decent accuracy. Unlike [7] however, our system uses a boundary-based over-segmentation to align the labels produced by the ConvNet to the boundaries in the image. Our system also takes advantage of the boundary-based oversegmentation to produce representations that are independent of the size of the segment through feature pooling.

An end-to-end trainable model for scene parsing label{sec:Segmentation-and-recognition

The model proposed in this paper, depicted on Figure 1, relies on two complementary image representations. In the first representation, the image is seen as a point in a high-dimensional space, and we seek to find a transform f : R P → R Q that maps these images into a space in which each pixel can be assigned a label using a simple linear classifier. This first representation typically suffers from two main problems: (1) the window considered rarely contains an object that is properly centered and scaled, and therefore offers a poor observation basis to predict the class of the underlying object, (2) integrating a large context involves increasing the grid size, and therefore the dimensionality P of the input; given a finite amount of training data, it is then necessary to enforce some invariance in the function f itself. This is usually achieved by using pooling/subsampling layers, which in turn degrades the ability of the model to precisely locate and delineate objects. In this paper, f is implemented by a multiscale convolutional network, which allows integrating large contexts (as large as the complete scene) into local decisions, yet still remaining manageable in terms of parameters/dimensionality. This multiscale model, in which weights are shared across scales, allows the model to capture long-range interactions, without the penalty of extra parameters to train. This model is described in Section 3.1.

In the second representation, the image is seen as an edge-weighted graph, on which a hierarchy of segmentations/clusterings can be constructed. This representation yields a natural abstraction of the original pixel grid, and provides a hierarchy of observation levels for all the objects in the image. It can be used as a solution to the first problem exposed above: assuming the capability of assessing the quality of all the components of this hierarchy, a system can automatically choose its components so as to produce the best set of predictions. Moreover, these components are spatially accurate, and naturally delineate the underlying objects, as this representation conserves pixel-level precision. Section 3.2 describes our methodology.

Scale-invariant, scene-level feature extraction label{sub:Scale-invariant,-scene-level-fea

Our feature extractor is based on a convolutional network. Convolutional networks are natural extensions of neural networks, in which weights are replicated over space, or in other terms the linear transforms are done using 2D convolutions. A convolution can be seen as a linear transform with shared (replicated) weights. The use of weight sharing is justified by the fact that image statistics are stationary, and features and combinations of features that are relevant in one region of an image are also relevant in other regions. In fact, by enforcing this constraint, each layer of a convolutional network is explicitly forced to model features that are shift-equivariant. Because of the imposed weightsharing, convolutional networks have been used successfully for a number of image labeling problems.

More holistic tasks, such as full-scene understanding (pixel-wise labeling, or any dense feature estimation) require the system to model complex interactions at the scale of complete images, not simply within a patch. In this problem the dimensionality becomes unmanageable: for a typical image of 256 × 256 pixels, a naive neural network would require millions of parameters, and a naive convolutional network would require filters that are unreasonably large to view enough context.

Our multiscale convolutional network overcomes these limitations by extending the concept of weight replication to the scale space. Given an input image I , a multiscale

pyramid of images X s , ∀ s ∈ { 1 , . . . , N } is constructed, with X 1 being the size of I . The multiscale pyramid can be a Laplacian pyramid, and is typically pre-processed, so that local neighborhoods have zero mean and unit standard deviation. Given a classical convolutional network f s with parameters θ s , the multiscale network is obtained by instantiating one network per scale s , and sharing all parameters across scales: θ s = θ 0 , ∀ s ∈ { 1 , . . . , N } .

More precisely, the output features are computed using the scaling/normalizing function g s as X s = g s ( I ) for all s ∈ { 1 , . . . , N } . The convolutional network f s can then be described as a sequence of linear transforms, interspersed with non-linear symmetric squashing units (typically the tanh function): F s = f s ( X s ; θ s ) = W L H L -1 , with H l = tanh( W l H l -1 + b l ) for all l ∈ { 1 , . . . , L -1 } , where H l is the vector of hidden units at layer l , for a network with L layers, H 0 = X s and b l is a vector of bias parameters. The matrices W l are Toeplitz matrices, and therefore each hidden unit vector H l can be expressed as a regular convolution between the kernel w lpq and the previous hidden unit vector H l -1

The filters w lpq and the biases b l constitute the trainable parameters of our model, and are collectively denoted θ s .

Finally, the output of the N networks are upsampled and concatenated so as to produce F , a map of feature vectors the size of F 1 , which can be seen as local patch descriptors and scene-level descriptors

where u is an upsampling function.

As mentioned above, weights are shared between networks f s . Intuitively, imposing complete weight sharing across scales is a natural way of forcing the network to learn scale invariant features, and at the same time reduce the chances of over-fitting. The more scales used to jointly train the models f s ( θ s ) the better the representation becomes for all scales. Because image content is, in principle, scale invariant, using the same function to extract features at each scale is justified. In fact, we observed a performance decrease when removing the weight-sharing.

Parameter-free hierarchical parsing

Predicting the class of a given pixel from its own feature vector is difficult, and not sufficient in practice. The task is easier if we consider a spatial grouping of feature vectors around the pixel, i.e . a neighborhood. Among all possible neighborhoods, one is the most suited to predict the pixel's class. In Section 3.2.1 we propose to formulate the search

Figure 2. Finding the optimal cover. For each pixel (leaf) i , the optimal component C k ∗ ( i ) is the one along the path between the leaf and the root with minimal cost S k ∗ ( i ) . The optimal cover is the union of all these components. In this example, the optimal cover { C 1 , C 3 , C 4 , C 5 } will result in a segmentation in disjoint sets { C 1 , C 2 , C 3 , C 4 } , with the subtle difference that component C 2 will be labelled with the class of C 5 , as C 5 is the best observation level for C 2 .

for the most adapted neighborhood as an optimization problem. The construction of the cost function that is minimized is then described in Section 3.2.2.

Optimal purity cover

We define the neighborhood of a pixel as a connected component that contains this pixel. Let C k , ∀ k ∈ { 1 , . . . , K } be the set of all possible connected components of the lattice defined on image I , and let S k be a cost associated to each of these components. For each pixel i , we wish to find the index k ∗ ( i ) of the component that best explains this pixel, that is, the component with the minimal cost S k ∗ ( i ) :

Note that components C k ∗ ( i ) are non-disjoint sets that form a cover of the lattice. Note also that the overall cost S ∗ = ∑ i S k ∗ ( i ) is minimal.

In practice, the set of components C k is too large, and only a subset of it can be considered. A classical technique to reduce the set of components is to consider a hierarchy of segmentations [18, 1, 8], that can be represented as a tree T . Solving Eq 3 on T can be done simply by exploring the tree in a depth-first search manner, and finding the component with minimal weight along each branch. Figure 2 illustrates the procedure.

Producing the confidence costs

Given a set of components C k , we explain how to produce all the confidence costs S k . These costs represent the class purity of the associated components. Given the groundtruth segmentation, we can compute the cost as being the entropy of the distribution of classes present in the component. At test time, when no groundtruth is available, we need to define a function that can predict this cost by simply looking

Figure 3. The shape-invariant attention function a . For each component C k in the segmentation tree T , the corresponding image segment is encoded by a spatial grid of feature vectors that fall into this segment. The aggregated feature vector of each grid cell is computed by a component-wise max pooling of the feature vectors centered on all the pixels that fall into the grid cell; this produces a scale-invariant representation of the segment and its surroundings. The result, O k , is a descriptor that encodes spatial relations between the underlying object's parts. The grid size was set to 5 × 5 for all our experiments.

at the component. We now describe a way of achieving this, as illustrated in Figure 3.

Given the scale-invariant features F , we define a compact representation to describe objects as an elastic spatial arrangement of such features. In other terms, an object, or category in general, can be best described as a spatial arrangement of features, or parts. A simple attention function a is used to mask the feature vector map with each component C k , producing a set of K masked feature vector patterns { F ⋂ C k } , ∀ k ∈ { 1 , . . . , K } . The function a is called an attention function because it suppresses the background around the component being analyzed. The patterns { F ⋂ C k } are resampled to produce fixed-size representations. In our model the sampling is done using an elastic max-pooling function, which remaps input patterns of arbitrary size into a fixed G × G grid. This grid can be seen as a highly invariant representation that encodes spatial relations between an object's attributes/parts. This representation is denoted O k . Some nice properties of this encoding are: (1) elongated, or in general ill-shaped objects, are nicely handled, (2) the dominant features are used to represent the object, combined with background subtraction, the features pooled represent solid basis functions to recognize the underlying object.

Once we have the set of object descriptors O k , we define a function c : O k → [0 , 1] N c (where N c is the number of classes) as predicting the distribution of classes present in component C k . We associate a cost S k to this distribution. In this paper c is implemented as a simple 2-layer neural network, and S k is the entropy of the predicted distribution. More formally, let x k be the feature vector associated with component O k , ˆ d k the predicted class distribution, and S k the cost associated to this distribution. We have

Matrices W 1 and W 2 are noted θ c , and represent the trainable parameters of c . These parameters need to be learned over the complete set of hierarchies, computed on the entire training set available. The exact training procedure is described in Section 4.

Training procedure label{sec:Training-the-model

Let F be the set of all feature maps in the training set, and T the set of all hierarchies. Training the model described in Section 3 can be done in two steps. First, we train the low-level feature extractor f s in complete independence of the rest of the model. The goal of that first step is to produce features ( F ) F ∈F that are maximally discriminative for pixelwise classification. Next, we construct the hierarchies ( T ) T ∈T on the entire training set, and, for all T ∈ T train the classifier c to predict the distribution of classes in component C k ∈ T , as well as the costs S k . Once this second part is done, all the functions in Figure 1 are defined, and inference can be performed on arbitrary images. In the next two sections we describe these two steps.

Learning discriminative scale-invariant features

As described in Section 3.1, feature vectors in F are obtained by concatenating the outputs of multiple networks f s , each taking as input a different image in a multiscale pyramid. Ideally a linear classifier should produce the correct categorization for all pixel locations i , from the feature vectors F i . We train the parameters θ s to achieve this goal. Let c i be the true target vector for pixel i and ˆ c i be the normalized prediction from the linear classifier, we set:

The elementary loss function l cat ( ˆ c i , c i ) in Eq 7 is chosen to penalize the deviation of the multiclass prediction ˆ c i from the target vector c i . In this paper, we use the multiclass cross entropy loss function. In order to use this loss function, we compute a normalized predicted probability distribution over classes ˆ c i,a using the softmax function in Eq 9.

The cross entropy between the predicted class distribution and the target class distribution at a pixel location i is then measured by Eq 8. The true target probability c i,a of class a to be present at location i can either be a distribution of classes at location i , in a given neighborhood or a hard target vector: c i,a = 1 if pixel i is labeled a , and 0 otherwise. For training maximally discriminative features, we use hard target vectors in this first stage. Once the parameters θ s are trained, we discard the classifier in Eq 9.

Teaching a classifier to find its best observation level

Given the trained parameters θ s , we build F and T , i.e . we compute all the vector maps F and the hierarchies T on all the training data available, so as to produce a new training set of descriptors O k . This time, the parameters θ c of the classifier c are trained to minimize the KL-divergence between the true (known) distributions of labels d k in each component, and the prediction from the classifier ˆ d k (Eq 5):

In this setting, the groundtruth distributions d k are not hard target vectors, but normalized histograms of the labels present in component C k . Once the parameters θ c are trained, ˆ d k accurately predicts the distribution of labels, and Eq 6 can be used to assign a purity cost to the component.

Experiments label{sec:Experiments

We report results on three standard datasets. (1) The Stanford Background dataset, introduced in [6] for evaluating methods for semantic scene understanding. The dataset contains 715 images chosen from other existing public datasets so that all the images are outdoor scenes, have approximately 320 × 240 pixels, and contain at least one foreground object. We use the evaluation procedure introduced in [6], 5 -fold cross validation: 572 images used for training, and 142 for testing. (2) The SIFT Flow dataset, as described in Liu et al . [15]. This dataset is composed of 2 , 688 images that have been thoroughly labeled by LabelMe users. Liu et al . [15] have split this dataset into 2 , 488 training images and 200 test images and used synonym correction to obtain 33 semantic labels. We use this same training/test split. (3) The Barcelona dataset, as described in Tighe et al . [24], is derived from the LabelMe subset used in [21]. It has 14 , 871 training and 279 test images. The test set consists of street scenes from Barcelona, while the training set ranges in scene types but has no street scenes from Barcelona. Synonyms were manually consolidated by [24] to produce 170 unique labels.

For all experiments, we use a 2 -stage convolutional network. The input I , a 3 -channel image, is transformed into a

16 -dimension feature map, using a bank of 16 7 × 7 filters followed by tanh units; this feature map is then pooled using a 2 × 2 max-pooling layer; the second layer transforms the 16 -dimension feature map into a 64 -dimension feature map, each component being produced by a combination of 8 7 × 7 filters ( 512 filters), followed by tanh units; the map is pooled using a 2 × 2 max-pooling layer; finally the 64 -dimension feature map is transformed into a 256 -dimension feature map, each component being produced by a combination of 16 7 × 7 filters ( 2048 filters).

The network is applied to a locally normalized Laplacian pyramid constructed on the input image. For these experiments, the pyramid consists of 3 rescaled versions of the input ( N = 3 ), in octaves: 320 × 240 , 160 × 120 , 80 × 60 . All inputs are properly padded, and outputs of each of the 3 networks upsampled and concatenated, so as to produce a 256 × 3 = 768 -dimension feature vector map F . The network is trained on all 3 scales in parallel.

Simple grid-search was performed to find the best learning rate and regularization parameters (weight decay), using a holdout of 10% of the training dataset for validation. More regularization was necessary to train the classifier c . For both datasets, jitter was used to artificially expand the size of the training data, and ensure that the features do not overfit some irrelevant biases present in the data. Jitter includes: horizontal flipping of all images, and rotations between -8 and 8 degrees.

In this paper, the hierarchy used to find the optimal cover is a simple hierarchy constructed on the raw image gradient, based on a standard volume criterion [16, 4], completed by a removal of non-informative small components (less than 100 pixels). Classically segmentation methods find a partition of the segments rather than a cover. Partitioning the segments consists in finding an optimal cut in the tree (so that each terminal node in the pruned tree corresponds to a segment). We experimented with a number of graph cut methods to do so, including graph-cuts [5, 2], Kruskal [12] and Power Watersheds [3], but the results were systematically worse than with our optimal cover method.

On the Stanford dataset, we report two experiments: a baseline system, based on the multiscale convolutional network alone; and the full model as described in Section 3. Results are reported in Table 1. On the two other datasets, we report results for our complete model only, in Tables 2 and 3. Example parses on the SIFT Flow dataset are shown on Figure 4.

Baseline, multiscale network: for our baseline, the multiscale network is trained as a simple class predictor for each location i , using the single classification loss L cat defined in Eq 7. With this simple system, the pixelwise accuracy is surprisingly good, but the visual aspect of the predictions clearly suffer from poor spatial consistency, and poor object delineation.

Table 1. Performance of our system on the Stanford Background dataset [6]: per-pixel accuracy / average per-class accuracy. The third column reports approximate compute times, as reported by the authors. Note: we benchmarked our algorithms using a modern 4 -core Intel i7, which could give us an unfair advantage over the competition.

Complete system, network and hierarchy: in this second experiment, we use the complete model, as described in Section 3. The 2 -layer neural network (Eq 4) has 3 × 3 × 3 × 256 = 6912 input units (using a 3 × 3 grid of feature vectors from F ), 512 hidden units; and 8 output units are needed for the Stanford Background dataset, 33 for the SIFT Flow dataset, and 170 for the Barcelona dataset. Results are significantly better than the baseline method, in particular, much better delineation is achieved.

For the SIFT Flow dataset, we experimented with two sampling methods when learning the multiscale features: respecting natural frequencies of classes, and balancing them so that an equal amount of each class is shown to the network. Both results are reported in Table 2. Training with balanced frequencies allows better discrimination of small objects, and although it decreases the overall pixelwise accuracy, it is more correct from a recognition point of view. Frequency balancing was used on the Stanford Background dataset, as it consistently gave better results. For the Barcelona dataset, both sampling methods were used as well, but frequency balancing worked rather poorly in that case. This could be explained by the fact that this dataset has a large amount of classes with very few training examples. These classes are therefore extremely hard to model, and overfitting occurs much faster than for the SIFT Flow dataset. Results are shown on Table 3.

Results in Table 1 also demonstrate the impressive computational advantage of convolutional networks over competing algorithms. Training time is also remarkably fast: results on the Stanford Background dataset were typically obtained in 24 h on a regular server.

Discussion

Table 2. Performance of our system on the SIFT Flow dataset [15]: per-pixel accuracy / average per-class accuracy. Our multiscale network is trained using two sampling methods: 1 natural frequencies, 2 balanced frequencies.

priate low-level and mid-level features. The convolutional network is trained in supervised mode to directly produce labels. Unlike many other scene parsing systems that rely on expensive graphical models to ensure consistent labelings, our system relies on a segmentation tree in which the nodes (corresponding to image segments) are labeled with the entropy of the distribution of classes contained in the corresponding segment. Instead of graph cuts or other inference methods, we use the new concept of optimal cover to extract the most consistent segmentation from the tree.

The complexity of each operation is linear in the number of pixels, except for the production of the tree, which is quasi-linear (meaning cheap in practice). The system produces state-of-the-art accuracy on the Stanford Background, SIFT Flow, and Barcelona datasets (both measured per pixel, or averaged per class), while dramatically outperforming competing models in inference time.

Our current system relies on a single segmentation tree constructed from image gradients, and implicitly assumes that the correct segmentation is contained in the tree. Future work will involve searches over multiple segmentation trees, or will use other graphs than simple trees to encode the possible segmentations (since our optimal cover algorithm can work from other graphs than trees). Other directions for improvements include the use of structured learning criteria such as Turaga et al .'s Maximin Learning [25] to learn low-level feature vectors from which better segmentation trees can be produced.

$$ \mathbf{H}{lp}=\tanh(b{lp}+\sum_{q\in\mathrm{parents}(p)}\mathbf{w}{lpq}*\mathbf{H}{l-1,q}). $$

$$ \mathbf{F}=[\mathbf{F}{1}, u(\mathbf{F}{2}), \dots, u(\mathbf{F}_{N})], $$

$$ R^{*}=\underset{R_{k}}{\mbox{argmin}}\frac{1}{N_{k}}\underset{i\in\mathrm{comps}(R_{k})}{\sum}S_{ki}, \label{eq:problem} $$ \tag{eq:problem}

$$ l_{div}=\sum_{a\in\text{classes}}\mathbf{d}{k,a}\mbox{ln}(\frac{\mathbf{d}{k,a}}{\hat{\mathbf{d}}_{k,a}}). $$

References

Refer to caption Typical results achieved on the SIFT Flow dataset.

Figure 4. Typical results achieved on the SIFT Flow dataset.

ence on Machine Learning (ICML) , 2011. 2, 7

	P / C	CT
Gould et al . [6]	76.4% / -	10s to 10min
Munoz et al . [17]	76.9% / 66.2%	12s
Tighe et al . [24]	77.5% / -	10s to 5min
Socher et al . [23]	78.1% / -	?
Kumar et al . [13]	79.4% / -	10s to 10min
multiscale net	77.5 %/ 70.0%	0.5s
multiscale net + cover	79.5% / 74.3%	1s

	P / C
Liu et al . [15]	74.75 %/ -
Tighe et al . [24]	76.9 %/ 29.4%
multiscale net + cover 1	78.5 %/29.6%
multiscale net + cover 2	74.2% / 46.0%

	P / C
Tighe et al . [24]	66.9 %/ 7.6%
multiscale net + cover 1	67.8 %/9.5%
multiscale net + cover 2	39.1% 10.7 %

The scene parsing method proposed here starts by computing a tree of segments from a graph of pixel dissimilarities. Simultaneously, a set of dense feature vectors is computed which encodes regions of multiple sizes centered on each pixel. The feature extractor is a multiscale convolutional network trained from raw pixels. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average “purity” of the class distributions, hence maximizing the overall likelihood that each segment will contain a single object. The convolutional network feature extractor is trained end-to-end from raw pixels, alleviating the need for engineered features. After training, the system is parameter free.

This paper proposes a new method for FSL, depicted on Figure 1 that relies on five main ingredients: 1) Trainable, dense, multi-scale feature extraction: a multi-scale, dense feature extractor produces a series of feature vectors for regions of multiple sizes centered around every pixel in the image, covering a large context. The feature extractor is a two-stage convolutional network applied to a multi-scale contrast-normalized laplacian pyramid computed from the image. The convolutional network is fed with raw pixels and trained end to end, thereby alleviating the need for hand-engineered features; 2) Segmentation Tree: A graph over pixels is computed in which each pixel is connected to its 4 nearest neighbors through an edge whose weight is a measure of dissimilarity between the colors of the two pixels. A segmentation tree is then constructed using a classical region merging method, based on the minimum spanning tree of the graph. Each node in the tree corresponds to a potential image segment. The final image segmentation will be a judiciously chosen subset of nodes of the tree whose corresponding regions cover the entire image. 3) Region-wise feature aggregation: for each node in the tree, the corresponding image segment is encoded by a 5×5555\times 5 spatial grid of aggregated feature vectors. The aggregated feature vector of each grid cell is computed by a component-wise max pooling of the feature vectors centered on all the pixels that fall into the grid cell; This produces a scale-invariant representation of the segment and its surrounding; 4) Class histogram estimation: a classifier is then applied to the aggregated feature grid of each node. The classifier is trained to estimate the histogram of all object categories present in its input segments; 5) Optimal purity cover: a subset of tree nodes is selected whose corresponding segments cover the entire image. The nodes are selected so as to minimize the average “impurity” of the class distribution. The class “impurity” is defined as the entropy of the class distribution. The choice of the cover thus attempts to find a consistent overall segmentation in which each segment contains pixels belonging to only one of the learned categories.

All the steps in the process have a complexity linear (or almost linear) in the number of pixels. The bulk of the computation resides in the convolutional network feature extractor. The resulting system is very fast, producing a full parse of a 320×240320240320\times 240 image in less than 1 second on a conventional CPU. Once trained, the system is parameter free, and requires no adjustment of thresholds or other knobs.

Socher et al. [23] propose a method to aggregate segments in a greedy fashion using a trained scoring function. The originality of the approach is that the feature vector of the combination of two segments is computed from the feature vectors of the individual segments through a trainable function. Like us, they use “deep learning” methods to train their feature extractor. But unlike us, their feature extractor operates on hand-engineered features.

One of the main question in scene parsing is how to take a wide context into account to make a local decision. Munoz et al. [17] proposed to use the histogram of labels extracted from a coarse scale as input to the labeler that look at finer scales. Our approach is somewhat simpler: our feature extractor is applied densely to an image pyramid. The coarse feature maps thereby generated are upsampled to match that of the finest scale. Hence with three scales, each feature vector has multiple fields which encode multiple regions of increasing sizes and decreasing resolutions, centered on the same pixel location.

The only published work on using ConvNets for scene parsing is that of Grangier et al. [7]. While somewhat preliminary, their work showed that convolutional networks fed with raw pixels could be trained to perform scene parsing with decent accuracy. Unlike [7] however, our system uses a boundary-based over-segmentation to align the labels produced by the ConvNet to the boundaries in the image. Our system also takes advantage of the boundary-based over-segmentation to produce representations that are independent of the size of the segment through feature pooling.

The model proposed in this paper, depicted on Figure 1, relies on two complementary image representations. In the first representation, the image is seen as a point in a high-dimensional space, and we seek to find a transform f:ℝP→ℝQ:𝑓→superscriptℝ𝑃superscriptℝ𝑄f:\mathbb{R}^{P}\rightarrow\mathbb{R}^{Q} that maps these images into a space in which each pixel can be assigned a label using a simple linear classifier. This first representation typically suffers from two main problems: (1) the window considered rarely contains an object that is properly centered and scaled, and therefore offers a poor observation basis to predict the class of the underlying object, (2) integrating a large context involves increasing the grid size, and therefore the dimensionality P𝑃P of the input; given a finite amount of training data, it is then necessary to enforce some invariance in the function f𝑓f itself. This is usually achieved by using pooling/subsampling layers, which in turn degrades the ability of the model to precisely locate and delineate objects. In this paper, f𝑓f is implemented by a multiscale convolutional network, which allows integrating large contexts (as large as the complete scene) into local decisions, yet still remaining manageable in terms of parameters/dimensionality. This multiscale model, in which weights are shared across scales, allows the model to capture long-range interactions, without the penalty of extra parameters to train. This model is described in Section 3.1.

Our feature extractor is based on a convolutional network. Convolutional networks are natural extensions of neural networks, in which weights are replicated over space, or in other terms the linear transforms are done using 2D convolutions. A convolution can be seen as a linear transform with shared (replicated) weights. The use of weight sharing is justified by the fact that image statistics are stationary, and features and combinations of features that are relevant in one region of an image are also relevant in other regions. In fact, by enforcing this constraint, each layer of a convolutional network is explicitly forced to model features that are shift-equivariant. Because of the imposed weight-sharing, convolutional networks have been used successfully for a number of image labeling problems.

More holistic tasks, such as full-scene understanding (pixel-wise labeling, or any dense feature estimation) require the system to model complex interactions at the scale of complete images, not simply within a patch. In this problem the dimensionality becomes unmanageable: for a typical image of 256×256256256256\times 256 pixels, a naive neural network would require millions of parameters, and a naive convolutional network would require filters that are unreasonably large to view enough context.

Our multiscale convolutional network overcomes these limitations by extending the concept of weight replication to the scale space. Given an input image 𝐈𝐈\mathbf{I}, a multiscale pyramid of images 𝐗s,∀s∈{1,…,N}subscript𝐗𝑠for-all𝑠1…𝑁\mathbf{X}{s},;\forall s\in{1,\dots,N} is constructed, with 𝐗1subscript𝐗1\mathbf{X}{1} being the size of 𝐈𝐈\mathbf{{I}}. The multiscale pyramid can be a Laplacian pyramid, and is typically pre-processed, so that local neighborhoods have zero mean and unit standard deviation. Given a classical convolutional network fssubscript𝑓𝑠f_{s} with parameters θssubscript𝜃𝑠\theta_{s}, the multiscale network is obtained by instantiating one network per scale s𝑠s, and sharing all parameters across scales: θs=θ0,∀s∈{1,…,N}formulae-sequencesubscript𝜃𝑠subscript𝜃0for-all𝑠1…𝑁\theta_{s}=\theta_{0},>\forall s\in{1,\dots,N}.

More precisely, the output features are computed using the scaling/normalizing function gssubscript𝑔𝑠g_{s} as 𝐗s=gs(𝐈)subscript𝐗𝑠subscript𝑔𝑠𝐈\mathbf{X}{s}=g{s}(\mathbf{I}) for all s∈{1,…,N}𝑠1…𝑁s\in{1,\dots,N}. The convolutional network fssubscript𝑓𝑠f_{s} can then be described as a sequence of linear transforms, interspersed with non-linear symmetric squashing units (typically the tanh\tanh function): 𝐅s=fs(𝐗s;θs)=𝐖L𝐇L−1subscript𝐅𝑠subscript𝑓𝑠subscript𝐗𝑠subscript𝜃𝑠subscript𝐖𝐿subscript𝐇𝐿1\mathbf{F}{s}=f{s}(\mathbf{X}{s};\theta{s})=\mathbf{W}{L}\mathbf{H}{L-1}, with 𝐇l=tanh⁡(𝐖l𝐇l−1+𝐛l)subscript𝐇𝑙subscript𝐖𝑙subscript𝐇𝑙1subscript𝐛𝑙\mathbf{H}{l}=\tanh(\mathbf{W}{l}\mathbf{H}{l-1}+\mathbf{b}{l}) for all l∈{1,…,L−1}𝑙1…𝐿1l\in{1,\dots,L-1}, where 𝐇lsubscript𝐇𝑙\mathbf{H}{l} is the vector of hidden units at layer l𝑙l, for a network with L𝐿L layers, 𝐇0=𝐗ssubscript𝐇0subscript𝐗𝑠\mathbf{H}{0}=\mathbf{X}{s} and 𝐛lsubscript𝐛𝑙\mathbf{b}{l} is a vector of bias parameters. The matrices 𝐖lsubscript𝐖𝑙\mathbf{W}{l} are Toeplitz matrices, and therefore each hidden unit vector 𝐇lsubscript𝐇𝑙\mathbf{H}{l} can be expressed as a regular convolution between the kernel 𝐰lpqsubscript𝐰𝑙𝑝𝑞\mathbf{w}{lpq} and the previous hidden unit vector 𝐇l−1subscript𝐇𝑙1\mathbf{H}{l-1}

The filters 𝐰lpqsubscript𝐰𝑙𝑝𝑞\mathbf{w}{lpq} and the biases 𝐛lsubscript𝐛𝑙\mathbf{b}{l} constitute the trainable parameters of our model, and are collectively denoted θssubscript𝜃𝑠\theta_{s}.

Finally, the output of the N𝑁N networks are upsampled and concatenated so as to produce 𝐅𝐅\mathbf{F}, a map of feature vectors the size of 𝐅1subscript𝐅1\mathbf{F}_{1}, which can be seen as local patch descriptors and scene-level descriptors

where u𝑢u is an upsampling function.

As mentioned above, weights are shared between networks fssubscript𝑓𝑠f_{s}. Intuitively, imposing complete weight sharing across scales is a natural way of forcing the network to learn scale invariant features, and at the same time reduce the chances of over-fitting. The more scales used to jointly train the models fs(θs)subscript𝑓𝑠subscript𝜃𝑠f_{s}(\theta_{s}) the better the representation becomes for all scales. Because image content is, in principle, scale invariant, using the same function to extract features at each scale is justified. In fact, we observed a performance decrease when removing the weight-sharing.

Predicting the class of a given pixel from its own feature vector is difficult, and not sufficient in practice. The task is easier if we consider a spatial grouping of feature vectors around the pixel, i.e. a neighborhood. Among all possible neighborhoods, one is the most suited to predict the pixel’s class. In Section 3.2.1 we propose to formulate the search for the most adapted neighborhood as an optimization problem. The construction of the cost function that is minimized is then described in Section 3.2.2.

We define the neighborhood of a pixel as a connected component that contains this pixel. Let Ck,∀k∈{1,…,K}subscript𝐶𝑘for-all𝑘1…𝐾C_{k},;\forall k\in{1,\dots,K} be the set of all possible connected components of the lattice defined on image I𝐼I, and let Sksubscript𝑆𝑘S_{k} be a cost associated to each of these components. For each pixel i𝑖i, we wish to find the index k∗(i)superscript𝑘𝑖k^{}(i) of the component that best explains this pixel, that is, the component with the minimal cost Sk∗(i)subscript𝑆superscript𝑘𝑖S_{k^{}(i)}:

Note that components Ck∗(i)subscript𝐶superscript𝑘𝑖C_{k^{}(i)} are non-disjoint sets that form a cover of the lattice. Note also that the overall cost S∗=∑iSk∗(i)superscript𝑆subscript𝑖subscript𝑆superscript𝑘𝑖S^{}=\sum_{i}S_{k^{*}(i)} is minimal.

In practice, the set of components Cksubscript𝐶𝑘C_{k} is too large, and only a subset of it can be considered. A classical technique to reduce the set of components is to consider a hierarchy of segmentations [18, 1, 8], that can be represented as a tree T𝑇T. Solving Eq 3 on T𝑇T can be done simply by exploring the tree in a depth-first search manner, and finding the component with minimal weight along each branch. Figure 2 illustrates the procedure.

Given a set of components Cksubscript𝐶𝑘C_{k}, we explain how to produce all the confidence costs Sksubscript𝑆𝑘S_{k}. These costs represent the class purity of the associated components. Given the groundtruth segmentation, we can compute the cost as being the entropy of the distribution of classes present in the component. At test time, when no groundtruth is available, we need to define a function that can predict this cost by simply looking at the component. We now describe a way of achieving this, as illustrated in Figure 3.

Given the scale-invariant features 𝐅𝐅\mathbf{F}, we define a compact representation to describe objects as an elastic spatial arrangement of such features. In other terms, an object, or category in general, can be best described as a spatial arrangement of features, or parts. A simple attention function a𝑎a is used to mask the feature vector map with each component Cksubscript𝐶𝑘C_{k}, producing a set of K𝐾K masked feature vector patterns {𝐅⋂Ck},∀k∈{1,…,K}𝐅subscript𝐶𝑘for-all𝑘1…𝐾{\mathbf{F}\bigcap C_{k}},;\forall k\in{1,\dots,K}. The function a𝑎a is called an attention function because it suppresses the background around the component being analyzed. The patterns {𝐅⋂Ck}𝐅subscript𝐶𝑘{\mathbf{F}\bigcap C_{k}} are resampled to produce fixed-size representations. In our model the sampling is done using an elastic max-pooling function, which remaps input patterns of arbitrary size into a fixed G×G𝐺𝐺G\times G grid. This grid can be seen as a highly invariant representation that encodes spatial relations between an object’s attributes/parts. This representation is denoted 𝐎ksubscript𝐎𝑘\mathbf{O}_{k}. Some nice properties of this encoding are: (1) elongated, or in general ill-shaped objects, are nicely handled, (2) the dominant features are used to represent the object, combined with background subtraction, the features pooled represent solid basis functions to recognize the underlying object.

Once we have the set of object descriptors 𝐎ksubscript𝐎𝑘{\bf O}{k}, we define a function c:𝐎k→[0,1]Nc:𝑐→subscript𝐎𝑘superscript01subscript𝑁𝑐c:{\bf O}{k}\rightarrow[0,1]^{N_{c}} (where Ncsubscript𝑁𝑐N_{c} is the number of classes) as predicting the distribution of classes present in component Cksubscript𝐶𝑘C_{k}. We associate a cost Sksubscript𝑆𝑘S_{k} to this distribution. In this paper c𝑐c is implemented as a simple 2-layer neural network, and Sksubscript𝑆𝑘S_{k} is the entropy of the predicted distribution. More formally, let 𝐱ksubscript𝐱𝑘\mathbf{x}{k} be the feature vector associated with component 𝐎ksubscript𝐎𝑘{\bf O}{k}, 𝐝^ksubscript^𝐝𝑘\hat{{\bf d}}{k} the predicted class distribution, and Sksubscript𝑆𝑘S{k} the cost associated to this distribution. We have

Matrices 𝐖1subscript𝐖1\mathbf{W}{1} and 𝐖2subscript𝐖2\mathbf{W}{2} are noted θcsubscript𝜃𝑐\mathbf{\theta}_{c}, and represent the trainable parameters of c𝑐c. These parameters need to be learned over the complete set of hierarchies, computed on the entire training set available. The exact training procedure is described in Section 4.

Let ℱℱ{\cal F} be the set of all feature maps in the training set, and 𝒯𝒯{\cal T} the set of all hierarchies. Training the model described in Section 3 can be done in two steps. First, we train the low-level feature extractor fssubscript𝑓𝑠f_{s} in complete independence of the rest of the model. The goal of that first step is to produce features (𝐅)𝐅∈ℱsubscript𝐅𝐅ℱ({\bf F}){{\bf F}\in{\cal F}} that are maximally discriminative for pixelwise classification. Next, we construct the hierarchies (T)T∈𝒯subscript𝑇𝑇𝒯(T){T\in{\cal T}} on the entire training set, and, for all T∈𝒯𝑇𝒯T\in{\cal T} train the classifier c𝑐c to predict the distribution of classes in component Ck∈Tsubscript𝐶𝑘𝑇C_{k}\in T, as well as the costs Sksubscript𝑆𝑘S_{k}. Once this second part is done, all the functions in Figure 1 are defined, and inference can be performed on arbitrary images. In the next two sections we describe these two steps.

As described in Section 3.1, feature vectors in 𝐅𝐅{\bf F} are obtained by concatenating the outputs of multiple networks fssubscript𝑓𝑠f_{s}, each taking as input a different image in a multiscale pyramid. Ideally a linear classifier should produce the correct categorization for all pixel locations i𝑖i, from the feature vectors 𝐅isubscript𝐅𝑖\mathbf{F}{i}. We train the parameters θssubscript𝜃𝑠\mathbf{\theta}{s} to achieve this goal. Let 𝐜isubscript𝐜𝑖\mathbf{c}{i} be the true target vector for pixel i𝑖i and 𝐜i^^subscript𝐜𝑖\hat{\mathbf{c}{i}} be the normalized prediction from the linear classifier, we set:

The elementary loss function lcat(𝐜i^,𝐜i)subscript𝑙cat^subscript𝐜𝑖subscript𝐜𝑖l_{\mathrm{cat}}(\hat{\mathbf{c}{i}},\mathbf{c}{i}) in Eq 7 is chosen to penalize the deviation of the multiclass prediction 𝐜^isubscript^𝐜𝑖\hat{\mathbf{c}}{i} from the target vector 𝐜isubscript𝐜𝑖\mathbf{c}{i}. In this paper, we use the multiclass cross entropy loss function. In order to use this loss function, we compute a normalized predicted probability distribution over classes 𝐜^i,asubscript^𝐜𝑖𝑎\hat{\mathbf{c}}{i,a} using the softmax function in Eq 9. The cross entropy between the predicted class distribution and the target class distribution at a pixel location i𝑖i is then measured by Eq 8. The true target probability 𝐜i,asubscript𝐜𝑖𝑎\mathbf{c}{i,a} of class a𝑎a to be present at location i𝑖i can either be a distribution of classes at location i𝑖i, in a given neighborhood or a hard target vector: 𝐜i,a=1subscript𝐜𝑖𝑎1\mathbf{c}{i,a}=1 if pixel i𝑖i is labeled a𝑎a, and 00 otherwise. For training maximally discriminative features, we use hard target vectors in this first stage. Once the parameters θssubscript𝜃𝑠\mathbf{\theta}{s} are trained, we discard the classifier in Eq 9.

Given the trained parameters θssubscript𝜃𝑠\theta_{s}, we build ℱℱ{\cal F} and 𝒯𝒯{\cal T}, i.e. we compute all the vector maps 𝐅𝐅{\bf F} and the hierarchies T𝑇T on all the training data available, so as to produce a new training set of descriptors 𝐎ksubscript𝐎𝑘\mathbf{O}{k}. This time, the parameters θcsubscript𝜃𝑐\theta{c} of the classifier c𝑐c are trained to minimize the KL-divergence between the true (known) distributions of labels 𝐝ksubscript𝐝𝑘\mathbf{d}{k} in each component, and the prediction from the classifier 𝐝^ksubscript^𝐝𝑘\hat{\mathbf{d}}{k} (Eq 5):

In this setting, the groundtruth distributions 𝐝ksubscript𝐝𝑘\mathbf{d}{k} are not hard target vectors, but normalized histograms of the labels present in component Cksubscript𝐶𝑘C{k}. Once the parameters θcsubscript𝜃𝑐\theta_{c} are trained, 𝐝^ksubscript^𝐝𝑘\hat{\mathbf{d}}_{k} accurately predicts the distribution of labels, and Eq 6 can be used to assign a purity cost to the component.

We report results on three standard datasets. (1) The Stanford Background dataset, introduced in [6] for evaluating methods for semantic scene understanding. The dataset contains 715715715 images chosen from other existing public datasets so that all the images are outdoor scenes, have approximately 320×240320240320\times 240 pixels, and contain at least one foreground object. We use the evaluation procedure introduced in [6], 555-fold cross validation: 572572572 images used for training, and 142142142 for testing. (2) The SIFT Flow dataset, as described in Liu et al. [15]. This dataset is composed of 2,68826882,688 images that have been thoroughly labeled by LabelMe users. Liu et al. [15] have split this dataset into 2,48824882,488 training images and 200200200 test images and used synonym correction to obtain 333333 semantic labels. We use this same training/test split. (3) The Barcelona dataset, as described in Tighe et al. [24], is derived from the LabelMe subset used in [21]. It has 14,8711487114,871 training and 279279279 test images. The test set consists of street scenes from Barcelona, while the training set ranges in scene types but has no street scenes from Barcelona. Synonyms were manually consolidated by [24] to produce 170170170 unique labels.

For all experiments, we use a 222-stage convolutional network. The input 𝐈𝐈\mathbf{I}, a 333-channel image, is transformed into a 161616-dimension feature map, using a bank of 161616 7×7777\times 7 filters followed by tanh\tanh units; this feature map is then pooled using a 2×2222\times 2 max-pooling layer; the second layer transforms the 161616-dimension feature map into a 646464-dimension feature map, each component being produced by a combination of 888 7×7777\times 7 filters (512512512 filters), followed by tanh\tanh units; the map is pooled using a 2×2222\times 2 max-pooling layer; finally the 646464-dimension feature map is transformed into a 256256256-dimension feature map, each component being produced by a combination of 161616 7×7777\times 7 filters (204820482048 filters).

The network is applied to a locally normalized Laplacian pyramid constructed on the input image. For these experiments, the pyramid consists of 333 rescaled versions of the input (N=3𝑁3N=3), in octaves: 320×240320240320\times 240, 160×120160120160\times 120, 80×60806080\times 60. All inputs are properly padded, and outputs of each of the 333 networks upsampled and concatenated, so as to produce a 256×3=7682563768256\times 3=768-dimension feature vector map 𝐅𝐅\mathbf{F}. The network is trained on all 333 scales in parallel.

Simple grid-search was performed to find the best learning rate and regularization parameters (weight decay), using a holdout of 10% of the training dataset for validation. More regularization was necessary to train the classifier c𝑐c. For both datasets, jitter was used to artificially expand the size of the training data, and ensure that the features do not overfit some irrelevant biases present in the data. Jitter includes: horizontal flipping of all images, and rotations between −88-8 and 888 degrees.

Baseline, multiscale network: for our baseline, the multiscale network is trained as a simple class predictor for each location i𝑖i, using the single classification loss Lcatsubscript𝐿𝑐𝑎𝑡L_{cat} defined in Eq 7. With this simple system, the pixelwise accuracy is surprisingly good, but the visual aspect of the predictions clearly suffer from poor spatial consistency, and poor object delineation.

Complete system, network and hierarchy: in this second experiment, we use the complete model, as described in Section 3. The 2−limit-from22-layer neural network (Eq 4) has 3×3×3×256=691233325669123\times 3\times 3\times 256=6912 input units (using a 3×3333\times 3 grid of feature vectors from 𝐅𝐅\mathbf{F}), 512512512 hidden units; and 888 output units are needed for the Stanford Background dataset, 333333 for the SIFT Flow dataset, and 170170170 for the Barcelona dataset. Results are significantly better than the baseline method, in particular, much better delineation is achieved.

We introduced a discriminative framework for learning to identify and delineate objects in a scene. Our model does not rely on engineered features, and uses a multi-scale convolutional network operating on raw pixels to learn appropriate low-level and mid-level features. The convolutional network is trained in supervised mode to directly produce labels. Unlike many other scene parsing systems that rely on expensive graphical models to ensure consistent labelings, our system relies on a segmentation tree in which the nodes (corresponding to image segments) are labeled with the entropy of the distribution of classes contained in the corresponding segment. Instead of graph cuts or other inference methods, we use the new concept of optimal cover to extract the most consistent segmentation from the tree.

Our current system relies on a single segmentation tree constructed from image gradients, and implicitly assumes that the correct segmentation is contained in the tree. Future work will involve searches over multiple segmentation trees, or will use other graphs than simple trees to encode the possible segmentations (since our optimal cover algorithm can work from other graphs than trees). Other directions for improvements include the use of structured learning criteria such as Turaga et al.’s Maximin Learning [25] to learn low-level feature vectors from which better segmentation trees can be produced.

Table: S5.T1: Performance of our system on the Stanford Background dataset [6]: per-pixel accuracy / average per-class accuracy. The third column reports approximate compute times, as reported by the authors. Note: we benchmarked our algorithms using a modern 444-core Intel i7, which could give us an unfair advantage over the competition.

	P / C	CT
Gould et al. [6]	76.4% / -	10s to 10min
Munoz et al. [17]	76.9% / 66.2%	12s
Tighe et al. [24]	77.5% / -	10s to 5min
Socher et al. [23]	78.1% / -	?
Kumar et al. [13]	79.4% / -	10s to 10min
multiscale net	77.5 % / 70.0%	0.5s
multiscale net + cover	79.5% / 74.3%	1s

Table: S5.T2: Performance of our system on the SIFT Flow dataset [15]: per-pixel accuracy / average per-class accuracy. Our multiscale network is trained using two sampling methods: 1natural frequencies, 2balanced frequencies.

	P / C
Liu et al. [15]	74.75 % / -
Tighe et al. [24]	76.9 % / 29.4 %
multiscale net + cover1	78.5 % / 29.6 %
multiscale net + cover2	74.2 % / 46.0 %

Refer to caption Diagram of the scene parsing system. The raw input image is transformed through a Laplacian pyramid. Each scale is fed to a 2-stage convolutional network, which produces a set of feature maps. The feature maps of all scales are concatenated, the coarser-scale maps being upsampled to match the size of the finest-scale map. Each feature vector thus represents a large contextual window around each pixel. In parallel, a segmentation tree is computed via the minimum spanning tree of the dissimilarity graph of neighboring pixels. The segment associated with each node in the tree is encoded by a spatial grid of feature vectors pooled in the segment’s region. A classifier is then applied to all the aggregated feature grids to produce a histogram of categories, the entropy of which measures the “impurity” of the segment. Each pixel is then labeled by the minimally-impure node above it, which is the segment that best “explains” the pixel.

Refer to caption Finding the optimal cover. For each pixel (leaf) i𝑖i, the optimal component Ck∗(i)subscript𝐶superscript𝑘𝑖C_{k^{}(i)} is the one along the path between the leaf and the root with minimal cost Sk∗(i)subscript𝑆superscript𝑘𝑖S_{k^{}(i)}. The optimal cover is the union of all these components. In this example, the optimal cover {C1,C3,C4,C5}subscript𝐶1subscript𝐶3subscript𝐶4subscript𝐶5{C_{1},C_{3},C_{4},C_{5}} will result in a segmentation in disjoint sets {C1,C2,C3,C4}subscript𝐶1subscript𝐶2subscript𝐶3subscript𝐶4{C_{1},C_{2},C_{3},C_{4}}, with the subtle difference that component C2subscript𝐶2C_{2} will be labelled with the class of C5subscript𝐶5C_{5}, as C5subscript𝐶5C_{5} is the best observation level for C2subscript𝐶2C_{2}.

Refer to caption The shape-invariant attention function a𝑎a. For each component Cksubscript𝐶𝑘C_{k} in the segmentation tree T𝑇T, the corresponding image segment is encoded by a spatial grid of feature vectors that fall into this segment. The aggregated feature vector of each grid cell is computed by a component-wise max pooling of the feature vectors centered on all the pixels that fall into the grid cell; this produces a scale-invariant representation of the segment and its surroundings. The result, 𝐎ksubscript𝐎𝑘\mathbf{O}_{k}, is a descriptor that encodes spatial relations between the underlying object’s parts. The grid size was set to 5×5555\times 5 for all our experiments.

$$ \mathbf{H}{lp}=\tanh(b{lp}+\sum_{q\in\mathrm{parents}(p)}\mathbf{w}{lpq}*\mathbf{H}{l-1,q}). $$ \tag{S3.E1}

$$ \mathbf{F}=[\mathbf{F}{1},u(\mathbf{F}{2}),\dots,u(\mathbf{F}_{N})], $$ \tag{S3.E2}

$$ k^{*}(i)=\underset{k;|;i\in C_{k}}{\mbox{argmin}}S_{k} $$ \tag{S3.E3}

$$ l_{div}=\sum_{a\in\text{classes}}\mathbf{d}{k,a}\mbox{ln}(\frac{\mathbf{d}{k,a}}{\hat{\mathbf{d}}_{k,a}}). $$ \tag{S4.E10}

$$ \displaystyle\mathbf{y}_{k} $$

	P / C	CT
Gould et al . [6]	76.4% / -	10s to 10min
Munoz et al . [17]	76.9% / 66.2%	12s
Tighe et al . [24]	77.5% / -	10s to 5min
Socher et al . [23]	78.1% / -	?
Kumar et al . [13]	79.4% / -	10s to 10min
multiscale net	77.5 %/ 70.0%	0.5s
multiscale net + cover	79.5% / 74.3%	1s

	P / C
Liu et al . [15]	74.75 %/ -
Tighe et al . [24]	76.9 %/ 29.4%
multiscale net + cover 1	78.5 %/29.6%
multiscale net + cover 2	74.2% / 46.0%

	P / C
Tighe et al . [24]	66.9 %/ 7.6%
multiscale net + cover 1	67.8 %/9.5%
multiscale net + cover 2	39.1% 10.7 %

References

[Arbelaez2010] P.~Arbel'{a}ez, M.~Maire, C.~Fowlkes, and J.~Malik. \newblock {Contour Detection and Hierarchical Image Segmentation}. \newblock {\em IEEE Trans. Pattern Anal. Mach. Intell.}, 33(5):898--916, 2011.

[BoykovJolly2001] Y.~Boykov and M.P. Jolly. \newblock Interactive graph cuts for optimal boundary & region segmentation of objects in n-d images. \newblock In {\em Proceedings of International Conference of Computer Vision (ICCV)}, volume1, pages 105--112, 2001.

[CouGraNaj11] C.~Couprie, L.~Grady, L.~Najman, and H.~Talbot. \newblock {Power Watersheds: A Unifying Graph Based Optimization Framework}. \newblock {\em IEEE Trans. Pattern Anal. Mach. Intell.}, 33(7):1384--1399, July 2011.

[CouNaj11] J.~Cousty and L.~Najman. \newblock {Incremental algorithm for hierarchical minimum spanning forests and saliency of watershed cuts}. \newblock In {\em 10th International Symposium on Mathematical Morphology (ISMM'11)}, LNCS, 2011.

[FordFulkerson-1955] L.~R. Ford and D.~R. Fulkerson. \newblock A simple algorithm for finding maximal network flows and an application to the hitchcock problem. \newblock Technical report, RAND Corporation, Santa Monica, 1955. \newblock http://www.rand.org/pubs/papers/P743.

[Gould2009] S.~Gould, R.~Fulton, and D.~Koller. \newblock {Decomposing a scene into geometric and semantically consistent regions}. \newblock {\em 2009 IEEE 12th International Conference on Computer Vision}, pages 1--8, Sept. 2009.

[Grangier2009] D.~Grangier, L.~Bottou, and R.~Collobert. \newblock {Deep Convolutional Networks for Scene Parsing}. \newblock In {\em ICML 2009 Deep Learning Workshop}, 2009.

[GuiguesCM06] L.~Guigues, J.~P. Cocquerez, and H.~L. Men. \newblock Scale-sets image analysis. \newblock {\em International Journal of Computer Vision}, 68(3):289--317, 2006.

[he2008learning] X.~He and R.~Zemel. \newblock Learning hybrid models for image annotation with partially labeled data. \newblock {\em Advances in Neural Information Processing Systems}, 2008.

[jain-iccv-07] V.~Jain, J.~F. Murray, F.~Roth, S.~Turaga, V.~Zhigulin, K.~Briggman, M.~Helmstaedter, W.~Denk, and H.~S. Seung. \newblock Supervised learning of image restoration with convolutional networks. \newblock In {\em ICCV}, 2007.

[jarrett-iccv-09] K.~Jarrett, K.~Kavukcuoglu, M.~Ranzato, and Y.~LeCun. \newblock What is the best multi-stage architecture for object recognition? \newblock In {\em Proc. International Conference on Computer Vision (ICCV'09)}. IEEE, 2009.

[kruskal56] J.~B. Kruskal. \newblock On the shortest spanning subtree of a graph and the traveling salesman problem. \newblock {\em Proceedings of the American Mathematical Society}, 7:48--50, February 1956.

[kumar2010] M.~Kumar and D.~Koller. \newblock Efficiently selecting regions for scene understanding. \newblock In {\em {Computer Vision and Pattern Recognition (CVPR)}}, pages 3217--3224. IEEE, 2010.

[lecun-98] Y.~LeCun, L.~Bottou, Y.~Bengio, and P.~Haffner. \newblock Gradient-based learning applied to document recognition. \newblock {\em Proceedings of the IEEE}, 86(11):2278--2324, November 1998.

[Liu2009] C.~Liu, J.~Yuen, and A.~Torralba. \newblock {Nonparametric scene parsing: Label transfer via dense scene alignment}. \newblock {\em Artificial Intelligence}, 2009.

[meyer.najman:segmentation] F.~Meyer and L.~Najman. \newblock Segmentation, minimum spanning tree and hierarchies. \newblock In L.~Najman and H.~~Talbot, editors, {\em Mathematical Morphology: from theory to application}, chapter~~9, pages 229--261. ISTE-Wiley, London, 2010.

[munoz-10] D.~Munoz, J.~Bagnell, and M.~Hebert. \newblock Stacked hierarchical labeling. \newblock {\em ECCV 2010}, Jan 2010.

[NS96] L.~Najman and M.~Schmitt. \newblock Geodesic saliency of watershed contours and hierarchical segmentation. \newblock {\em IEEE Trans. Pattern Anal. Mach. Intell.}, 18(12):1163--1173, December 1996.

[ning-05] F.~Ning, D.~Delhomme, Y.~LeCun, F.~Piano, L.~Bottou, and P.~Barbano. \newblock Toward automatic phenotyping of developing embryos from videos. \newblock {\em IEEE Trans. on Image Processing}, 2005. \newblock Special issue on Molecular & Cellular Bioimaging.

[osadchy-07] M.~Osadchy, Y.~LeCun, and M.~Miller. \newblock Synergistic face detection and pose estimation with energy-based models. \newblock {\em Journal of Machine Learning Research}, 8:1197--1215, 2007.

[russell2007object] B.~Russell, A.~Torralba, C.~Liu, R.~Fergus, and W.~Freeman. \newblock Object recognition by scene alignment. \newblock In {\em Neural Advances in Neural Information}, 2007.

[Russell09hierarchical] C.~Russell, P.~H.~S. Torr, and P.~Kohli. \newblock Associative hierarchical {CRF}s for object class image segmentation. \newblock In {\em Proc. ICCV}, 2009.

[SocherEtAl2011] R.~Socher, C.~C. Lin, A.~Y. Ng, and C.~D. Manning. \newblock {Parsing Natural Scenes and Natural Language with Recursive Neural Networks}. \newblock In {\em Proceedings of the 26th International Conference on Machine Learning (ICML)}, 2011.

[Tighe2010] J.~Tighe and S.~Lazebnik. \newblock {Superparsing: scalable nonparametric image parsing with superpixels}. \newblock {\em ECCV}, pages 352--365, 2010.

[turaga2009] S.~Turaga, K.~Briggman, M.~Helmstaedter, W.~Denk, and H.~Seung. \newblock Maximin affinity learning of image segmentation. \newblock {\em NIPS}, Jan 2009.

[bib1] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik. Contour Detection and Hierarchical Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 33(5):898–916, 2011.

[bib2] Y. Boykov and M. P. Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in n-d images. In Proceedings of International Conference of Computer Vision (ICCV), volume 1, pages 105–112, 2001.

[bib3] C. Couprie, L. Grady, L. Najman, and H. Talbot. Power Watersheds: A Unifying Graph Based Optimization Framework. IEEE Trans. Pattern Anal. Mach. Intell., 33(7):1384–1399, July 2011.

[bib4] J. Cousty and L. Najman. Incremental algorithm for hierarchical minimum spanning forests and saliency of watershed cuts. In 10th International Symposium on Mathematical Morphology (ISMM’11), LNCS, 2011.

[bib5] L. R. Ford and D. R. Fulkerson. A simple algorithm for finding maximal network flows and an application to the hitchcock problem. Technical report, RAND Corporation, Santa Monica, 1955. http://www.rand.org/pubs/papers/P743.

[bib6] S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. 2009 IEEE 12th International Conference on Computer Vision, pages 1–8, Sept. 2009.

[bib7] D. Grangier, L. Bottou, and R. Collobert. Deep Convolutional Networks for Scene Parsing. In ICML 2009 Deep Learning Workshop, 2009.

[bib8] L. Guigues, J. P. Cocquerez, and H. L. Men. Scale-sets image analysis. International Journal of Computer Vision, 68(3):289–317, 2006.

[bib9] X. He and R. Zemel. Learning hybrid models for image annotation with partially labeled data. Advances in Neural Information Processing Systems, 2008.

[bib10] V. Jain, J. F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. Briggman, M. Helmstaedter, W. Denk, and H. S. Seung. Supervised learning of image restoration with convolutional networks. In ICCV, 2007.

[bib11] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09). IEEE, 2009.

[bib12] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7:48–50, February 1956.

[bib13] M. Kumar and D. Koller. Efficiently selecting regions for scene understanding. In Computer Vision and Pattern Recognition (CVPR), pages 3217–3224. IEEE, 2010.

[bib14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.

[bib15] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing: Label transfer via dense scene alignment. Artificial Intelligence, 2009.

[bib16] F. Meyer and L. Najman. Segmentation, minimum spanning tree and hierarchies. In L. Najman and H. Talbot, editors, Mathematical Morphology: from theory to application, chapter 9, pages 229–261. ISTE-Wiley, London, 2010.

[bib17] D. Munoz, J. Bagnell, and M. Hebert. Stacked hierarchical labeling. ECCV 2010, Jan 2010.

[bib18] L. Najman and M. Schmitt. Geodesic saliency of watershed contours and hierarchical segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 18(12):1163–1173, December 1996.

[bib19] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. Barbano. Toward automatic phenotyping of developing embryos from videos. IEEE Trans. on Image Processing, 2005. Special issue on Molecular & Cellular Bioimaging.

[bib20] M. Osadchy, Y. LeCun, and M. Miller. Synergistic face detection and pose estimation with energy-based models. Journal of Machine Learning Research, 8:1197–1215, 2007.

[bib21] B. Russell, A. Torralba, C. Liu, R. Fergus, and W. Freeman. Object recognition by scene alignment. In Neural Advances in Neural Information, 2007.

[bib22] C. Russell, P. H. S. Torr, and P. Kohli. Associative hierarchical CRFs for object class image segmentation. In Proc. ICCV, 2009.

[bib23] R. Socher, C. C. Lin, A. Y. Ng, and C. D. Manning. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In Proceedings of the 26th International Conference on Machine Learning (ICML), 2011.

[bib24] J. Tighe and S. Lazebnik. Superparsing: scalable nonparametric image parsing with superpixels. ECCV, pages 352–365, 2010.

[bib25] S. Turaga, K. Briggman, M. Helmstaedter, W. Denk, and H. Seung. Maximin affinity learning of image segmentation. NIPS, Jan 2009.

Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers

Overview​

Related work​

An end-to-end trainable model for scene parsing label{sec:Segmentation-and-recognition​

Scale-invariant, scene-level feature extraction label{sub:Scale-invariant,-scene-level-fea​

Parameter-free hierarchical parsing​

Optimal purity cover​

Producing the confidence costs​

Training procedure label{sec:Training-the-model​

Learning discriminative scale-invariant features​

Teaching a classifier to find its best observation level​

Experiments label{sec:Experiments​

Discussion​

References​

References​

Overview

Related work

An end-to-end trainable model for scene parsing label{sec:Segmentation-and-recognition

Scale-invariant, scene-level feature extraction label{sub:Scale-invariant,-scene-level-fea

Parameter-free hierarchical parsing

Optimal purity cover

Producing the confidence costs

Training procedure label{sec:Training-the-model

Learning discriminative scale-invariant features

Teaching a classifier to find its best observation level

Experiments label{sec:Experiments

Discussion

References

References