Skip to main content

Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches

\addr Faculty of Computer and Information Science, University of Ljubljana, Vecna pot 113, SI-1001 Ljubljana, Slovenia, \addr Courant Institute of Mathematical Sciences, New York University, 715 Broadway, New York, NY 10003, USA

Abstract

% We present a method for extracting depth information from a rectified image pair. Our approach focuses on the first stage of many stereo algorithms: the matching cost computation. We approach the problem by learning a similarity measure on small image patches using a convolutional neural network. Training is carried out in a supervised manner by constructing a binary classification data set with examples of similar and dissimilar pairs of patches. We examine two network architectures for this task: one tuned for speed, the other for accuracy. The output of the convolutional neural network is used to initialize the stereo matching cost. A series of post-processing steps follow: cross-based cost aggregation, semiglobal matching, a left-right consistency check, subpixel enhancement, a median filter, and a bilateral filter. We evaluate our method on the KITTI 2012, KITTI 2015, and Middlebury stereo data sets and show that it outperforms other approaches on all three data sets.

Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches

Jure ˇ Zbontar ∗

Faculty of Computer and Information Science

University of Ljubljana Veˇ cna pot 113, SI-1001 Ljubljana, Slovenia

Yann LeCun †

Courant Institute of Mathematical Sciences New York University 715 Broadway, New York, NY 10003, USA

Editor: Zhuowen Tu jure.zbontar@fri.uni-lj.si

yann@cs.nyu.edu

Fast Architecture

We present a method for extracting depth information from a rectified image pair. Our approach focuses on the first stage of many stereo algorithms: the matching cost computation. We approach the problem by learning a similarity measure on small image patches using a convolutional neural network. Training is carried out in a supervised manner by constructing a binary classification data set with examples of similar and dissimilar pairs of patches. We examine two network architectures for this task: one tuned for speed, the other for accuracy. The output of the convolutional neural network is used to initialize the stereo matching cost. A series of post-processing steps follow: cross-based cost aggregation, semiglobal matching, a left-right consistency check, subpixel enhancement, a median filter, and a bilateral filter. We evaluate our method on the KITTI 2012, KITTI 2015, and Middlebury stereo data sets and show that it outperforms other approaches on all three data sets.

Keywords: stereo, matching cost, similarity learning, supervised learning, convolutional neural networks

Introduction

Consider the following problem: given two images taken by cameras at different horizontal positions, we wish to compute the disparity d for each pixel in the left image. Disparity refers to the difference in horizontal location of an object in the left and right image-an object at position ( x, y ) in the left image appears at position ( x -d, y ) in the right image. If we know the disparity of an object we can compute its depth z using the following relation:

∗ . Jure ˇ Zbontar is also with the Courant Institute of Mathematical Sciences, New York University, 715 Broadway, New York, NY 10003, USA.

c © 2016 Jure ˇ Zbontar and Yann LeCun.

Left input image

Figure 1: The input is a pair of images from the left and right camera. The two input images differ mostly in horizontal locations of objects (other differences are caused by reflections, occlusions, and perspective distortions). Note that objects closer to the camera have larger disparities than objects farther away. The output is a dense disparity map shown on the right, with warmer colors representing larger values of disparity (and smaller values of depth).

where f is the focal length of the camera and B is the distance between the camera centers. Figure 1 depicts the input to and the output from our method.

The described problem of stereo matching is important in many fields such as autonomous driving, robotics, intermediate view generation, and 3D scene reconstruction. According to the taxonomy of Scharstein and Szeliski (2002), a typical stereo algorithm consists of four steps: matching cost computation, cost aggregation, optimization, and disparity refinement. Following Hirschm¨ uller and Scharstein (2009) we refer to the first two steps as computing the matching cost and the last two steps as the stereo method. The focus of this work is on computing a good matching cost.

We propose training a convolutional neural network (LeCun et al., 1998) on pairs of small image patches where the true disparity is known (for example, obtained by LIDAR or structured light). The output of the network is used to initialize the matching cost. We proceed with a number of post-processing steps that are not novel, but are necessary to achieve good results. Matching costs are combined between neighboring pixels with similar image intensities using cross-based cost aggregation. Smoothness constraints are enforced by semiglobal matching and a left-right consistency check is used to detect and eliminate errors in occluded regions. We perform subpixel enhancement and apply a median filter and a bilateral filter to obtain the final disparity map.

The contributions of this paper are

· a description of two architectures based on convolutional neural networks for computing the stereo matching cost; · a method, accompanied by its source code, with the lowest error rate on the KITTI 2012, KITTI 2015, and Middlebury stereo data sets; and · experiments analyzing the importance of data set size, the error rate compared with other methods, and the trade-off between accuracy and runtime for different settings of the hyperparameters.

ˇ Zbontar and LeCun

This paper extends our previous work ( ˇ Zbontar and LeCun, 2015) by including a description of a new architecture, results on two new data sets, lower error rates, and more thorough experiments.

Before the introduction of large stereo data sets like KITTI and Middlebury, relatively few stereo algorithms used ground truth information to learn parameters of their models; in this section, we review the ones that did. For a general overview of stereo algorithms see Scharstein and Szeliski (2002).

Kong and Tao (2004) used the sum of squared distances to compute an initial matching cost. They then trained a model to predict the probability distribution over three classes: the initial disparity is correct, the initial disparity is incorrect due to fattening of a foreground object, and the initial disparity is incorrect due to other reasons. The predicted probabilities were used to adjust the initial matching cost. Kong and Tao (2006) later extend their work by combining predictions obtained by computing normalized cross-correlation over different window sizes and centers. Peris et al. (2012) initialized the matching cost with AD-Census (Mei et al., 2011), and used multiclass linear discriminant analysis to learn a mapping from the computed matching cost to the final disparity.

Ground-truth data was also used to learn parameters of probabilistic graphical models. Zhang and Seitz (2007) used an alternative optimization algorithm to estimate optimal values of Markov random field hyperparameters. Scharstein and Pal (2007) constructed a new data set of 30 stereo pairs and used it to learn parameters of a conditional random field. Li and Huttenlocher (2008) presented a conditional random field model with a nonparametric cost function and used a structured support vector machine to learn the model parameters.

Recent work (Haeusler et al., 2013; Spyropoulos et al., 2014) focused on estimating the confidence of the computed matching cost. Haeusler et al. (2013) used a random forest classifier to combine several confidence measures. Similarly, Spyropoulos et al. (2014) trained a random forest classifier to predict the confidence of the matching cost and used the predictions as soft constraints in a Markov random field to decrease the error of the stereo method.

A related problem to computing the matching cost is learning local image descriptors (Brown et al., 2011; Trzcinski et al., 2012; Simonyan et al., 2014; Revaud et al., 2015; Paulin et al., 2015; Han et al., 2015; Zagoruyko and Komodakis, 2015). The two problems share a common subtask: to measure the similarity between image patches. Brown et al. (2011) introduced a general framework for learning image descriptors and used Powell's method to select good hyperparameters. Several methods have been suggested for solving the problem of learning local image descriptors, such as boosting (Trzcinski et al., 2012), convex optimization (Simonyan et al., 2014), hierarchical moving-quadrant similarity (Revaud et al., 2015), convolutional kernel networks (Paulin et al., 2015), and convolutional neural networks (Zagoruyko and Komodakis, 2015; Han et al., 2015). Works of Zagoruyko and Komodakis (2015) and Han et al. (2015), in particular, are very similar to our own, differing mostly in the architecture of the network; concretely, the inclusion of pooling and subsampling to account for larger patch sizes and larger variation in viewpoint.

Matching Cost

A typical stereo algorithm begins by computing a matching cost at each position p for all disparities d under consideration. A simple method for computing the matching cost is the sum of absolute differences:

where I L ( p ) and I R ( p ) are image intensities at position p in the left and right image and N p is the set of locations within a fixed rectangular window centered at p .

We use bold lowercase letters p and q to denote image locations. A bold lowercase d denotes the disparity d cast to a vector, that is, d = ( d, 0). We use typewriter font for the names of hyperparameters. For example, we would use patch size to denote the size of the neighbourhood area N p .

Equation (1) can be interpreted as measuring the cost associated with matching a patch from the left image, centered at position p , with a patch from the right image, centered at position p -d . We want the cost to be low when the two patches are centered around the image of the same 3D point, and high when they are not.

Since examples of good and bad matches can be constructed from publicly available data sets (for example, the KITTI and Middlebury stereo data sets), we can attempt to solve the matching problem by a supervised learning approach. Inspired by the successful application of convolutional neural networks to vision problems, we used them to assess how well two small image patches match.

Constructing the Data Set

We use ground truth disparity maps from either the KITTI or Middlebury stereo data sets to construct a binary classification data set. At each image position where the true disparity is known we extract one negative and one positive training example. This ensures that the data set contains an equal number of positive and negative examples. A positive example is a pair of patches, one from the left and one from the right image, whose center pixels are the images of the same 3D point, while a negative example is a pair of patches where this is not the case. The following section describes the data set construction step in detail.

Let < P L n × n ( p ) , P R n × n ( q ) > denote a pair of patches, where P L n × n ( p ) is an n × n patch from the left image centered at position p = ( x, y ), P R n × n ( q ) is an n × n patch from the right image centered at position q , and d denotes the correct disparity at position p . A negative example is obtained by setting the center of the right patch to

where o neg is chosen from either the interval [ dataset neg low , dataset neg high ] or, its origin reflected counterpart, [ -dataset neg high , -dataset neg low ]. The random offset o neg ensures that the resulting image patches are not centered around the same 3D point.

A positive example is derived by setting

Figure 2: The fast architecture is a siamese network. The two sub-networks consist of a number of convolutional layers followed by rectified linear units (abbreviated 'ReLU'). The similarity score is obtained by extracting a vector from each of the two input patches and computing the cosine similarity between them. In this diagram, as well as in our implementation, the cosine similarity computation is split in two steps: normalization and dot product. This reduces the running time because the normalization needs to be performed only once per position (see Section 3.3).

where o pos is chosen randomly from the interval [ -dataset pos , dataset pos ]. The reason for including o pos , instead of setting it to zero, has to do with the stereo method used later on. In particular, we found that cross-based cost aggregation performs better when the network assigns low matching costs to good matches as well as near matches. In our experiments, the hyperparameter dataset pos was never larger than one pixel.

Network Architectures

We describe two network architectures for learning a similarity measure on image patches. The first architecture is faster than the second, but produces disparity maps that are slightly less accurate. In both cases, the input to the network is a pair of small image patches and the output is a measure of similarity between them. Both architectures contain a trainable feature extractor that represents each image patch with a feature vector. The similarity between patches is measured on the feature vectors instead of the raw image intensity values. The fast architecture uses a fixed similarity measure to compare the two feature vectors, while the accurate architecture attempts to learn a good similarity measure on feature vectors.

Fast Architecture

The first architecture is a siamese network, that is, two shared-weight sub-networks joined at the head (Bromley et al., 1993). The sub-networks are composed of a number of convolutional layers with rectified linear units following all but the last layer. Both sub-networks output a vector capturing the properties of the input patch. The resulting two vectors are

Figure 3: The accurate architecture begins with two convolutional feature extractors. The extracted feature vectors are concatenated and compared by a number of fullyconnected layers. The inputs are two image patches and the output is a single real number between 0 and 1, which we interpret as a measure of similarity between the input images.

compared using the cosine similarity measure to produce the final output of the network. Figure 2 provides an overview of the architecture.

The network is trained by minimizing a hinge loss. The loss is computed by considering pairs of examples centered around the same image position where one example belongs to the positive and one to the negative class. Let s + be the output of the network for the positive example, s -be the output of the network for the negative example, and let m , the margin, be a positive real number. The hinge loss for that pair of examples is defined as max(0 , m + s --s + ). The loss is zero when the similarity of the positive example is greater than the similarity of the negative example by at least the margin m . We set the margin to 0.2 in our experiments.

The hyperparameters of this architecture are the number of convolutional layers in each sub-network ( num conv layers ), the size of the convolution kernels ( conv kernel size ), the number of feature maps in each layer ( num conv feature maps ), and the size of the input patch ( input patch size ).

Accurate Architecture

The second architecture is derived from the first by replacing the cosine similarity measure with a number of fully-connected layers (see Figure 3). This architectural change increased the running time, but decreased the error rate. The two sub-networks comprise a number of convolutional layers, with a rectified linear unit following each layer. The resulting two vectors are concatenated and forward-propagated through a number of fully-connected

layers followed by rectified linear units. The last fully-connected layer produces a single number which, after being transformed with the sigmoid nonlinearity, is interpreted as the similarity score between the input patches.

We use the binary cross-entropy loss for training. Let s denote the output of the network for one training example and t denote the class of that training example; t = 1 if the example belongs to the positive class and t = 0 if the example belongs to the negative class. The binary cross-entropy loss for that example is defined as t log( s ) + (1 -t ) log(1 -s ).

The decision to use two different loss functions, one for each architecture, was based on empirical evidence. While we would have preferred to use the same loss function for both architectures, experiments showed that the binary cross-entropy loss performed better than the hinge loss on the accurate architecture. On the other hand, since the last step of the fast architecture is the cosine similarity computation, a cross-entropy loss was not directly applicable.

The hyperparameters of the accurate architecture are the number of convolutional layers in each sub-network ( num conv layers ), the number of feature maps in each layer ( num conv feature maps ), the size of the convolution kernels ( conv kernel size ), the size of the input patch ( input patch size ), the number of units in each fully-connected layer ( num fc units ), and the number of fully-connected layers ( num fc layers ).

Computing the Matching Cost

The output of the network is used to initialize the matching cost:

where s ( < P L ( p ) , P R ( p -d ) > ) is the output of the network when run on input patches P L ( p ) and P R ( p -d ). The minus sign converts the similarity score to a matching cost.

To compute the entire matching cost tensor C CNN ( p , d ) we would, naively, have to perform the forward pass for each image location and each disparity under consideration. The following three implementation details kept the running time manageable:

· The outputs of the two sub-networks need to be computed only once per location, and do not need to be recomputed for every disparity under consideration. · The output of the two sub-networks can be computed for all pixels in a single forward pass by propagating full-resolution images, instead of small image patches. Performing a single forward pass on the entire w × h image is faster than performing w · h forward passes on small patches because many intermediate results can be reused. · The output of the fully-connected layers in the accurate architecture can also be computed in a single forward pass. This is done by replacing each fully-connected layer with a convolutional layer with 1 × 1 kernels. We still need to perform the forward pass for each disparity under consideration; the maximum disparity d is 228 for the KITTI data set and 400 for the Middlebury data set. As a result, the fullyconnected part of the network needs to be run d times, and is a bottleneck of the accurate architecture.

To compute the matching cost of a pair of images, we run the sub-networks once on each image and run the fully-connected layers d times, where d is the maximum disparity under consideration. This insight was important in designing the architecture of the network. We could have chosen an architecture where the two images are concatenated before being presented to the network, but that would imply a large cost at runtime because the whole network would need to be run d times. This insight also led to the development of the fast architecture, where the only layer that is run d times is the dot product of the feature vectors.

Stereo Method

The raw outputs of the convolutional neural network are not enough to produce accurate disparity maps, with errors particularly apparent in low-texture regions and occluded areas. The quality of the disparity maps can be improved by applying a series of post-processing steps referred to as the stereo method. The stereo method we used was influenced by Mei et al. (2011) and comprises cross-based cost aggregation, semiglobal matching, a left-right consistency check, subpixel enhancement, a median, and a bilateral filter.

Cross-based Cost Aggregation

Information from neighboring pixels can be combined by averaging the matching cost over a fixed window. This approach fails near depth discontinuities, where the assumption of constant depth within a window is violated. We might prefer a method that adaptively selects the neighborhood for each pixel, so that support is collected only from pixels of the same physical object. In cross-based cost aggregation (Zhang et al., 2009) we build a local neighborhood around each location comprising pixels with similar image intensity values with the hope that these pixels belong to the same object.

The method begins by constructing an upright cross at each position; this cross is used to define the local support region. The left arm p l at position p extends left as long as the following two conditions hold:

· | I ( p ) -I ( p l ) | < cbca intensity ; the image intensities at positions p and p l should be similar, their difference should be less than cbca intensity . · ‖ p -p l ‖ < cbca distance ; the horizontal distance (or vertical distance in case of top and bottom arms) between positions p and p l is less than cbca distance pixels.

The right, bottom, and top arms are constructed analogously. Once the four arms are known, we can compute the support region U ( p ) as the union of horizontal arms of all positions q laying on p 's vertical arm (see Figure 4).

Zhang et al. (2009) suggest that aggregation should consider the support regions of both images in a stereo pair. Let U L and U R denote the support regions in the left and right image. We define the combined support region U d as

Figure

bottom arm

The matching cost is averaged over the combined support region:

where i is the iteration number. We repeat the averaging a number of times. Since the support regions are overlapping, the results can change at each iteration. We skip crossbased cost aggregation in the fast architecture because it is not crucial for achieving a low error rate and because it is relatively expensive to compute.

Semiglobal Matching

We refine the matching cost by enforcing smoothness constraints on the disparity image. Following Hirschm¨ uller (2008), we define an energy function E ( D ) that depends on the disparity image D :

where 1 {·} denotes the indicator function. The first term penalizes disparities with high matching costs. The second term adds a penalty P 1 when the disparity of neighboring pixels differ by one. The third term adds a larger penalty P 2 when the neighboring disparities differ by more than one.

Rather than minimizing E ( D ) in all directions simultaneously, we could perform the minimization in a single direction with dynamic programming. This solution would introduce unwanted streaking effects, since there would be no incentive to make the disparity image smooth in the directions we are not optimizing over. In semiglobal matching we

minimize the energy in a single direction, repeat for several directions, and average to obtain the final result. Although Hirschm¨ uller (2008) suggested choosing sixteen direction, we only optimized along the two horizontal and the two vertical directions; adding the diagonal directions did not improve the accuracy of our system. To minimize E ( D ) in direction r , we define a matching cost C r ( p , d ) with the following recurrence relation:

The second term is subtracted to prevent values of C r ( p , d ) from growing too large and does not affect the optimal disparity map.

The penalty parameters P 1 and P 2 are set according to the image gradient so that jumps in disparity coincide with edges in the image. Let D 1 = | I L ( p ) -I L ( p -r ) | and D 2 = | I R ( p -d ) -I R ( p -d -r ) | be the difference in image intensity between two neighboring positions in the direction we are optimizing over. We set P 1 and P 2 according to the following rules:

The hyperparameters sgm P1 and sgm P2 set a base penalty for discontinuities in the disparity map. The base penalty is reduced by a factor of sgm Q1 if one of D 1 or D 2 indicate a strong image gradient or by a larger factor of sgm Q2 if both D 1 and D 2 indicate a strong image gradient. The value of P 1 is further reduced by a factor of sgm V when considering the two vertical directions; in the ground truth, small changes in disparity are much more frequent in the vertical directions than in the horizontal directions and should be penalised less.

The final cost C SGM ( p , d ) is computed by taking the average across all four directions:

After semiglobal matching we repeat cross-based cost aggregation, as described in the previous section. Hyperparameters cbca num iterations 1 and cbca num iterations 2 determine the number of cross-based cost aggregation iterations before and after semiglobal matching.

Computing the Disparity Image

The disparity image D ( p ) is computed by the winner-takes-all strategy, that is, by finding the disparity d that minimizes C ( p , d ),

Interpolation

The interpolation steps attempt to resolve conflicts between the disparity map predicted for the left image and the disparity map predicted for the right image. Let D L denote the disparity map obtained by treating the left image as the reference image-this was the case so far, that is, D L ( p ) = D ( p )-and let D R denote the disparity map obtained by treating the right image as the reference image. D L and D R sometimes disagree on what the correct disparity at a particular position should be. We detect these conflicts by performing a left-right consistency check. We label each position p by applying the following rules in turn:

For positions marked as occlusion , we want the new disparity value to come from the background. We interpolate by moving left until we find a position labeled correct and use its value. For positions marked as mismatch , we find the nearest correct pixels in 16 different directions and use the median of their disparities for interpolation. We refer to the interpolated disparity map as D INT .

Subpixel Enhancement

Subpixel enhancement provides an easy way to increase the resolution of a stereo algorithm. We fit a quadratic curve through the neighboring costs to obtain a new disparity image:

Refinement

The final steps of the stereo method consist of a 5 × 5 median filter and the following bilateral filter:

where g ( x ) is the probability density function of a zero mean normal distribution with standard deviation blur sigma and W ( p ) is the normalizing constant,

The role of the bilateral filter is to smooth the disparity map without blurring the edges. D BF is the final output of our stereo method.

Experiments

We used three stereo data sets in our experiments: KITTI 2012, KITTI 2015, and Middlebury. The test set error rates reported in Tables 1, 2, and 4 were obtained by submitting

Table 1: The highest ranking methods on the KITTI 2012 data set as of October 2015. The 'Setting' column provides insight into how the disparity map is computed: 'F' indicates the use of optical flow, 'MV' indicates more than two temporally adjacent images, and 'MS' indicates the use of epipolar geometry for computing the optical flow. The 'Error' column reports the percentage of misclassified pixels and the 'Runtime' column measures the time, in seconds, required to process one pair of images.

the generated disparity maps to the online evaluation servers. All other error rates were computed by splitting the data set in two, using one part for training and the other for validation.

KITTI Stereo Data Set

The KITTI stereo data set (Geiger et al., 2013; Menze and Geiger, 2015) is a collection of rectified image pairs taken from two video cameras mounted on the roof of a car, roughly 54 centimeters apart. The images were recorded while driving in and around the city of Karlsruhe, in sunny and cloudy weather, at daytime. The images were taken at a resolution of 1240 × 376. A rotating laser scanner mounted behind the left camera recorded ground truth depth, labeling around 30 % of the image pixels.

The ground truth disparities for the test set are withheld and an online leaderboard is provided where researchers can evaluate their method on the test set. Submissions are allowed once every three days. Error is measured as the percentage of pixels where the true disparity and the predicted disparity differ by more than three pixels. Translated into distance, this means that, for example, the error tolerance is 3 centimeters for objects 2 meters from the camera and 80 centimeters for objects 10 meters from the camera.

Table 2: The leading submission on the KITTI 2015 leaderboard as of October 2015. The 'Setting', 'Error', and 'Runtime' columns have the same meaning as in Table 1.

Two KITTI stereo data sets exist: KITTI 2012 1 and, the newer, KITTI 2015 2 . For the task of computing stereo they are nearly identical, with the newer data set improving some aspects of the optical flow task. The 2012 data set contains 194 training and 195 testing images, while the 2015 data set contains 200 training and 200 testing images. There is a subtle but important difference introduced in the newer data set: vehicles in motion are densely labeled and car glass is included in the evaluation. This emphasizes the method's performance on reflective surfaces.

The best performing methods on the KITTI 2012 data set are listed in Table 1. Our accurate architecture ranks first with an error rate of 2.43 %. Third place on the leaderboard is held by our previous work ( ˇ Zbontar and LeCun, 2015) with an error rate of 2.61 %. The two changes that reduced the error from 2.61 % to 2.43 % were augmenting the data set (see Section 5.4) and doubling the number of convolution layers while reducing the kernel size from 5 × 5 to 3 × 3. The method in second place (G¨ uney and Geiger, 2015) uses the matching cost computed by our previous work ( ˇ Zbontar and LeCun, 2015). The test error rate of the fast architecture is 2.82 %, which would be enough for fifth place had the method been allowed to appear in the public leaderboard. The running time for processing a single image pair is 67 seconds for the accurate architecture and 0.8 seconds for the fast architecture. Figure 5 contains a pair of examples from the KITTI 2012 data set, together with the predictions of our method.

Table 2 presents the frontrunners on the KITTI 2015 data sets. The error rates of our methods are 3.89 % for the accurate architecture and 4.46 % for the fast architecture, occupying first and second place on the leaderboard. Since one submission per paper is

  1. The KITTI 2012 scoreboard: http://www.cvlibs.net/datasets/kitti/eval_stereo_flow.php? benchmark=stereo

ˇ Zbontar and LeCun

Figure

Error: 0.91 %

Figure 5: Examples of predicted disparity maps on the KITTI 2012 data set. Note how some regions of the image (the white wall in the top example, and the asphalt in the bottom example) cause problems for the census transform. The fast and the accurate architecture perform better, with the accurate architecture making fewer mistakes on average.

Table 3: A summary of the five Middlebury stereo data sets. The column 'Number of Image Pairs' counts only the image pairs for which ground truth is available. The 2005 and 2014 data sets additionally contain a number of image pairs with ground truth disparities withheld; these image pairs constitute the test set.

allowed, only the result of the accurate architecture appears on the public leaderboard. See Figure 6 for the disparity maps produced by our method on the KITTI 2015 data set.

Middlebury Stereo Data Set

The image pairs of the Middlebury stereo data set are indoor scenes taken under controlled lighting conditions. Structured light was used to measure the true disparities with higher density and precision than in the KITTI data set. The data sets were published in five separate works in the years 2001, 2003, 2005, 2006, and 2014 (Scharstein and Szeliski, 2002, 2003; Scharstein and Pal, 2007; Hirschm¨ uller and Scharstein, 2007; Scharstein et al., 2014). In this paper, we refer to the Middlebury data set as the concatenation of all five data sets; a summary of each is presented in Table 3.

Each scene in the 2005, 2006, and 2014 data sets was taken under a number of lighting conditions and shutter exposures, with a typical image pair taken under four lighting conditions and seven exposure settings for a total of 28 images of the same scene.

An online leaderboard 3 , similar to the one provided by KITTI, displays a ranked list of all submitted methods. Participants have only one opportunity to submit their results on the test set to the public leaderboard. This rule is stricter than the one on the KITTI data set, where submissions are allowed every three days. The test set contains 15 images borrowed from the 2005 and 2014 data sets.

The data set is provided in full, half, and quarter resolution. The error is computed at full resolution; if the method outputs half or quarter resolution disparity maps, they are upsampled before the error is computed. We chose to run our method on half resolution images because of the limited size of the graphic card's memory available.

Rectifying a pair of images using standard calibration procedures, like the ones present in the OpenCV library, results in vertical disparity errors of up to nine pixels on the Middlebury data set (Scharstein et al., 2014). Each stereo pair in the 2014 data set is rectified twice: once using a standard, imperfect approach, and once using precise 2D correspondences for perfect rectification (Scharstein et al., 2014). We train the network on imperfectly rectified

  1. The Middlebury scoreboard: http://vision.middlebury.edu/stereo/eval3/

ˇ Zbontar and LeCun

Figure

Error: 4.58 %

Figure 6: Examples of predictions on the KITTI 2015 data set. Observe that vehicles in motion are labeled densely in the KITTI 2015 data set.

Table 4: The top ten methods on the Middlebury stereo data set as of October 2015. The 'Error' column is the weighted average error after upsampling to full resolution and 'Runtime' is the time, in seconds, required to process one pair of images.

image pairs, since only two of the fifteen test images ( Australia and Crusade ) are rectified perfectly.

The error is measured as the percentage of pixels where the true disparity and the predicted disparity differ by more than two pixels; this corresponds to an error tolerance of one pixel at half resolution. The error on the evaluation server is, by default, computed only on non-occluded pixels. The final error reported online is the weighted average over the fifteen test images, with weights set by the authors of the data set.

Table 4 contains a snapshot of the third, and newest, version of the Middlebury leaderboard. Our method ranks first with an error rate of 8.29 % and a substantial lead over the second placed MeshStereo method, whose error rate is 13.4 %. See Figure 7 for disparity maps produced by our method on one image pair from the Middlebury data set.

Details of Learning

We construct a binary classification data set from all available image pairs in the training set. The data set contains 25 million examples on the KITTI 2012, 17 million examples on the KITTI 2015, and 38 million examples on the Middlebury data set.

At training time, the input to the network was a batch of 128 pairs of image patches. At test time, the input was the entire left and right image. We could have used entire images during training as well, as it would allow us to implement the speed optimizations described in Section 3.3. There were several reasons why we preferred to train on image patches: it was easier to control the batch size, the examples could be shuffled so that one batch contained patches from several different images, and it was easier to maintain the same number of positive and negative examples within a batch.

We minimized the loss using mini-batch gradient descent with the momentum term set to 0.9. We trained for 14 epochs with the learning rate initially set to 0.003 for the accurate architecture and 0.002 for the fast architecture. The learning rate was decreased

Left input image

ˇ Zbontar and LeCun

Right input image

Error: 34.65 %

Figure 7: An example of a particularly difficult image pair from the Middlebury data set; the white wall in the background is practically textureless. The accurate architecture is able to classify most of it correctly. The fast architecture doesn't do as well but still performs better than census.

Ground truth

Table 5: The hyperparameter values we used for the fast and accurate architectures (abbreviated 'fst' and 'acrt'). Note that hyperparameters concerning image intensity values ( cbca intensity and sgm D ) apply to the preprocessed images and not to raw images with intensity values in the range from 0 to 255.

by a factor of 10 on the 11 th epoch. The number of epochs, the initial learning rate, and the learning rate decrease schedule where treated as hyperparameters and were optimized with cross-validation. Each image was preprocessed by subtracting the mean and dividing by the standard deviation of its pixel intensity values. The left and right image of a stereo pair were preprocessed separately. Our initial experiments suggested that using color information does not improve the quality of the disparity maps; therefore, we converted all color images to grayscale.

The post-processing steps of the stereo method were implemented in CUDA (Nickolls et al., 2008), the network training was done with the Torch environment (Collobert et al., 2011) using the convolution routines from the cuDNN library (Chetlur et al., 2014). The OpenCV library (Bradski, 2000) was used for the affine transformation in the data augmentation step.

The hyperparameters where optimized with manual search and simple scripts that helped automate the process. The hyperparameters we selected are shown in Table 5.

Data Set Augmentation

Augmenting the data set by repeatedly transforming the training examples is a commonly employed technique to reduce the network's generalization error. The transformations are applied at training time and do not affect the runtime performance. We randomly rotate, scale and shear the training patches; we also change their brightness and contrast. Since the transformations are applied to patches after they have been extracted from the images, the data augmentation step does not alter the ground truth disparity map or ruin the rectification.

The parameters of the transformation are chosen randomly for each pair of patches, and after one epoch of training, when the same example is being presented to the network for the second time, new random parameters are selected. We choose slightly different transformation parameters for the left and right image; for example, we would rotate the left patch by 10 degrees and the right by 14. Different data sets benefited from different types of transformations and, in some cases, using the wrong transformations increased the error.

On the Middlebury data set we took advantage of the fact that the images were taken under different lighting conditions and different shutter exposures by training on all available images. The same data set augmentation parameters were used for the KITTI 2012 and KITTI 2015 data sets.

The Middlebury test data sets contains two images worth mentioning: Classroom , where the right image is underexposed and, therefore, darker than the left; and Djembe , where the left and right images were taken under different light conditions. To handle these two cases we train, 20 % of the time, on images where either the shutter exposure or the arrangements of lights are different for the left and right image.

We combat imperfect rectification on the Middlebury data set by including a small vertical disparity between the left and right image patches.

Before describing the steps of data augmentation, let us introduce some notation: in the following, a word in typewriter is used to denote the name of a hyperparameter defining a set, while the same word in italic is used to denote a number drawn randomly from that set. For example, rotate is a hyperparameter defining the set of possible rotations and rotate is a number drawn randomly from that set. The steps of data augmentation are presented in the following list:

· Rotate the left patch by rotate degrees and the right patch by rotate + rotate diff degrees. · Scale the left patch by scale and the right patch by scale · scale diff . · Scale the left patch in the horizontal direction by horizontal scale and the right patch by horizontal scale · horizontal scale diff . · Shear the left patch in the horizontal direction by horizontal shear and the right patch by horizontal shear + horizontal shear diff .

Table 6: The hyperparameters governing data augmentation and how they affect the validation error. The 'Error' column reports the validation error when a particular data augmentation step is not used. The last two rows report validation errors with and without data augmentation. For example, the validation error on the KITTI 2012 is 2.73 % if no data augmentation is used, 2.65 % if all steps except rotation are used, and 2.61 % if all data augmentation steps are used.

· Translate the right patch in the vertical direction by vertical disparity . · Adjust the brightness and contrast by setting the left and right image patches to:

with addition and multiplication carried out element-wise where appropriate.

Table 6 contains the hyperparameters used and measures how each data augmentation step affected the validation error.

Data augmentation reduced the validation error from 2.73 % to 2.61 % on the KITTI 2012 data set and from 8.75 % to 7.91 % on the Middlebury data set.

Runtime

We measure the runtime of our implementation on a computer with a NVIDIA Titan X graphics processor unit. Table 7 contains the runtime measurements across a range of hyperparameter settings for three data sets: KITTI, Middlebury half resolution, and a

new, fictitious data set, called Tiny, which we use to demonstrate the performance of our method on the kind of images typically used for autonomous driving or robotics. The sizes of images we measured the runtime on were: 1242 × 350 with 228 disparity levels for the KITTI data set, 1500 × 1000 with 200 disparity levels for the Middlebury data set, and 320 × 240 with 32 disparity levels for the Tiny data set.

Table 7 reveals that the fast architecture is up to 90 times faster than the accurate architecture. Furthermore, the running times of the fast architecture are 0.78 seconds on KITTI, 2.03 seconds on Middlebury, and 0.06 seconds on the Tiny data set. We can also see that the fully-connected layers are responsible for most of the runtime in the accurate architecture, as the hyperparameters controlling the number of convolutional layer and the number of feature maps have only a small effect on the runtime.

Training times depended on the size of the data set and the architecture, but never exceeded two days.

Matching Cost

We argue that the low error rate of our method is due to the convolutional neural network and not a superior stereo method. We verify this claim by replacing the convolutional neural network with three standard approaches for computing the matching cost:

· The sum of absolute differences computes the matching cost according to Equation (1), that is, the matching cost between two image patches is computed by summing the absolute differences in image intensities between corresponding locations. We used 9 × 9 patches. · The census transform (Zabih and Woodfill, 1994) represents each image position as a bit vector. The size of this vector is a hyperparameter whose value, after examining several, we set to 81. The vector is computed by cropping a 9 × 9 image patch centered around the position of interest and comparing the intensity values of each pixel in the patch to the intensity value of the pixel in the center. When the center pixel is brighter the corresponding bit is set. The matching cost is computed as the hamming distance between two census transformed vectors. · Normalized cross-correlation is a window-based method defined with the following equation:

The normalized cross-correlation matching cost computes the cosine similarity between the left and right image patch, when the left and right image patches are viewed as vectors instead of matrices. This is the same function that is computed in the last two layers of the fast architecture (normalization and dot product). The neighbourhood N p was set to a square 11 × 11 window around p .

The 'sad', 'cens', and 'ncc' columns of Table 8 contain the results of the sum of absolute differences, the census transform, and normalized cross-correlation on the KITTI 2012, KITTI 2015, and Middlebury data sets. The validation errors in the last rows of Table 8

Table 7: The time, in seconds, required to compute the matching cost, that is, the time spent in the convolutional neural network without any post-processing steps. The time does include computing the matching cost twice: once when the left image is taken to be the reference image and once when the right image is taken to be the reference image. We measure the runtime as a function of four hyperparameters controlling the network architecture; for example, the first six rows contain the runtime as the number of convolutional layers in the network increases from one to six. The last row of the table contains the running time for the entire method, including the post-processing steps. As before, we abbreviate the fast and accurate architectures as 'fst' and 'acrt'.

should be used to compare the five methods. On all three data sets the accurate architecture performs best, followed by the fast architecture, which in turn is followed by the census transform. These are the three best performing methods on all three data sets. Their error rates are 2.61 %, 3.02 %, and 4.90 % on KITTI 2012; 3.25 %, 3.99 %, and 5.03 % on KITTI 2015; and 7.91 %, 9.87 %, and 16.72 % on Middlebury. The sum of absolute differences and the normalized cross-correlation matching costs produce disparity maps with larger errors. For a visual comparison of our method and the census transform see Figures 5, 6, and 7.

Stereo Method

The stereo method includes a number of post-processing steps: cross-based cost aggregation, semiglobal matching, interpolation, subpixel enhancement, a median, and a bilateral filter. We ran a set of experiments in which we excluded each of the aforementioned steps and recorded the validation error (see Table 8).

The last two rows of Table 8 allude to the importance of the post-processing steps of the stereo method. We see that, if all post-processing steps are removed, the validation error of the accurate architecture increases from 2.61 % to 13.49 % on KITTI 2012, from 3.25 % to 13.38 % on KITTI 2015, and from 7.91 % to 28.33 % on Middlebury.

Out of all post-processing steps of the stereo method, semiglobal matching affects the validation error the strongest. If we remove it, the validation error increases from 2.61 % to 4.26 % on KITTI 2012, from 3.25 % to 4.51 % on KITTI 2015, and from 7.91 % to 11.99 % on Middlebury.

We did not use the left-right consistency check to eliminate errors in occluded regions on the Middlebury data set. The error rate increased from 7.91 % to 8.22 % using the left-right consistency check on the accurate architecture, which is why we decided to remove it.

Data Set Size

We used a supervised learning approach to measure the similarity between image patches. It is, therefore, natural to ask how does the size of the data set affect the quality of the disparity maps. To answer this question, we retrain our networks on smaller training sets obtained by selecting a random set of examples (see Table 9).

We observe that the validation error decreases as we increase the number of training examples. These experiments suggest a simple strategy for improving the results of our stereo method: collect a larger data set.

Transfer Learning

Up to this point the training and validation sets were created from the same stereo data set, either KITTI 2012, KITTI 2015, or Middlebury. To evaluate the performance of our method in the transfer learning setting, we run experiments where the validation error is computed on a different data set than the one used for training. For example, we would use the Middlebury data set to train the matching cost neural network and evaluate its performance on the KITTI 2012 data set. These experiments give us some idea of the expected performance in a real-world application, where it isn't possible to train a specialized

  • Table 8: The numbers measure validation error when a particular post-processing step is excluded from the stereo method. The last two rows of the tables should be interpreted differently: they contain the validation error of the raw convolutional neural network and the validation error after the complete stereo method. For example, if we exclude semiglobal matching, the fast architecture achieves an error rate of 8.78 % on the KITTI 2012 data set and an error rate of 3.02 % after applying the full stereo method. We abbreviate the method names as 'fst' for the fast architecture, 'acrt' for the accurate architecture, 'sad' for the sum of absolute differences, 'cens' for the census transform, and 'ncc' for the normalized crosscorrelation matching cost.

Table 9: The validation error as a function of training set size.

Table 10: The validation error when the training and test sets differ. For example, the validation error is 3.16 % when the Middlebury data set is used for training the fast architecture and the trained network is tested on the KITTI 2012 data set.

network because no ground truth is available. The results of these experiments are shown in Table 10.

Some results in Table 10 were unexpected. For example, the validation error on KITTI 2012 is lower when using the Middlebury training set compared to the KITTI 2015 training set, even though the KITTI 2012 data set is obviously more similar to KITTI 2015 than Middlebury. Furthermore, the validation error on KITTI 2012 is lower when using the fast architecture instead of the accurate architecture when training on KITTI 2015.

The matching cost neural network trained on the Middlebury data set transfers well to the KITTI data sets. Its validation error is similar to the validation errors obtained by networks trained on the KITTI data sets.

Hyperparameters

Searching for a good set of hyperparameters is a daunting task-with the search space growing exponentially with the number of hyperparameters and no gradient to guide us. To better understand the effect of each hyperparameter on the validation error, we conduct a series of experiments where we vary the value of one hyperparameter while keeping the others fixed to their default values. The results are shown in Table 11 and can be summarized by observing that increasing the size of the network improves the generalization

Table 11: Validation errors computed across a range of hyperparameter settings.

performance, but only up to a point, when presumably, because of the size of the data set, the generalization performance starts do decrease.

Note that the num conv layers hyperparameter implicitly controls the size of the image patches. For example, a network with one convolutional layer with 3 × 3 kernels compares image patches of size 3 × 3, while a network with five convolutional layers compares patches of size 11 × 11.

Conclusion

We presented two convolutional neural network architectures for learning a similarity measure on image patches and applied them to the problem of stereo matching.

The source code of our implementation is available at https://github.com/jzbontar/ mc-cnn . The online repository contains procedures for computing the disparity map, training the network, as well as the post-processing steps of the stereo method.

The accurate architecture produces disparity maps with lower error rates than any previously published method on the KITTI 2012, KITTI 2015, and Middlebury data sets. The fast architecture computes the disparity maps up to 90 times faster than the accurate architecture with only a small increase in error. These results suggest that convolutional neural networks are well suited for computing the stereo matching cost even for applications that require real-time performance.

The fact that a relatively simple convolutional neural network outperformed all previous methods on the well-studied problem of stereo is a rather important demonstration of the power of modern machine learning approaches.

$$ z = \frac{f B}{d}, $$

$$ C_{\text{SAD}}(\mathbf{p}, d) = \sum_{\mathbf{q} \in \mathcal{N}_{\mathbf{p}}} |I^L(\mathbf{q}) - I^R(\mathbf{q}-\mathbf{d})|, \label{eqn:C_sad} $$ \tag{eqn:C_sad}

$$ \mathbf{q} = (x - d + o_{\text{neg}}, y), $$

$$ D(\mathbf{p}) = \arg!\min_d C(\mathbf{p}, d). $$

$$ D_{\text{SE}}(\mathbf{p}) = d - \frac {C_+ - C_-} {2 (C_+ - 2 C + C_-)}, $$

$$ D_{\text{BF}}(\mathbf{p}) = \frac{1}{W(\mathbf{p})} \sum_{\mathbf{q} \in \mathcal{N}\mathbf{p}} D{\text{SE}}(\mathbf{q}) \cdot g(|\mathbf{p} - \mathbf{q}|) \cdot 1{|I^L(\mathbf{p}) - I^L(\mathbf{q})| < \texttt{blur_threshold}}, $$

$$ C_{NCC}(\mathbf{p}, d) = \frac{\sum_{\mathbf{q} \in \mathcal{N}{\mathbf{p}}} I^L(\mathbf{q}) I^R(\mathbf{q} - \mathbf{d})} {\sqrt{\sum{\mathbf{q} \in \mathcal{N}{\mathbf{p}}} I^L(\mathbf{q})^2 \sum{\mathbf{q} \in \mathcal{N}_{\mathbf{p}}} I^R(\mathbf{q} - \mathbf{d})^2 }}. $$

$$ C^0_{\text{CBCA}}(\mathbf{p}, d) &= C_{\text{CNN}}(\mathbf{p}, d), \ C^i_{\text{CBCA}}(\mathbf{p}, d) &= \frac{1}{|U_d(\mathbf{p})|} \sum_{\mathbf{q} \in U_d(\mathbf{p})} C^{i-1}_{\text{CBCA}}(\mathbf{q}, d), $$

$$ \mathcal{P}^L &\leftarrow \mathcal{P}^L \cdot \textit{contrast} + \textit{brightness} \text{ and} \ \mathcal{P}^R &\leftarrow \mathcal{P}^R \cdot (\textit{contrast} \cdot \textit{contrast_diff}) + (\textit{brightness} + \textit{brightness_diff}), $$

$$ E(D) = \sum_{\mathbf{p}} \biggl( C^4_{\text{CBCA}}(\mathbf{p}, D(\mathbf{p}))

  • \sum_{\mathbf{q} \in \mathcal{N}_{\mathbf{p}}} P_1 \cdot 1{|D(\mathbf{p}) - D(\mathbf{q})| = 1} \
  • \sum_{\mathbf{q} \in \mathcal{N}_{\mathbf{p}}} P_2 \cdot 1{|D(\mathbf{p}) - D(\mathbf{q})| > 1} \biggr), $$

$$ C_{\mathbf{r}}(\mathbf{p}, d) = C^4_{\text{CBCA}}(\mathbf{p}, d) - \min_k C_r(\mathbf{p} - \mathbf{r}, k) + \min\biggl{ C_r(\mathbf{p} - \mathbf{r}, d), C_r(\mathbf{p} - \mathbf{r}, d - 1) + P_1,\ C_r(\mathbf{p} - \mathbf{r}, d + 1) + P_1, \min_k C_{\mathbf{r}}(\mathbf{p} - \mathbf{r}, k) + P_2 \biggr}. $$

$$ \begin{array}{lll} P_1 = \texttt{sgm_P1}, &P_2 = \texttt{sgm_P2} & \text{if $D_1 < \texttt{sgm_D}, D_2 < \texttt{sgm_D}$}; \ P_1 = \texttt{sgm_P1} / \texttt{sgm_Q2}, &P_2 = \texttt{sgm_P2} / \texttt{sgm_Q2} & \text{if $D_1 \geq \texttt{sgm_D}, D_2 \geq \texttt{sgm_D}$}; \ P_1 = \texttt{sgm_P1} / \texttt{sgm_Q1}, &P_2 = \texttt{sgm_P2} / \texttt{sgm_Q1} & \text{otherwise.} \ \end{array} $$

$$ \begin{array}{ll} correct & \text{if $|d - D^R(\mathbf{p}-\mathbf{d})| \leq 1$ for $d = D^L(\mathbf{p})$}, \ mismatch & \text{if $|d - D^R(\mathbf{p}-\mathbf{d})| \leq 1$ for any other $d$}, \ occlusion & \text{otherwise}. \ \end{array} $$

Refinement

We present a method for extracting depth information from a rectified image pair. Our approach focuses on the first stage of many stereo algorithms: the matching cost computation. We approach the problem by learning a similarity measure on small image patches using a convolutional neural network. Training is carried out in a supervised manner by constructing a binary classification data set with examples of similar and dissimilar pairs of patches. We examine two network architectures for this task: one tuned for speed, the other for accuracy. The output of the convolutional neural network is used to initialize the stereo matching cost. A series of post-processing steps follow: cross-based cost aggregation, semiglobal matching, a left-right consistency check, subpixel enhancement, a median filter, and a bilateral filter. We evaluate our method on the KITTI 2012, KITTI 2015, and Middlebury stereo data sets and show that it outperforms other approaches on all three data sets.

Keywords: stereo, matching cost, similarity learning, supervised learning, convolutional neural networks

Consider the following problem: given two images taken by cameras at different horizontal positions, we wish to compute the disparity d𝑑d for each pixel in the left image. Disparity refers to the difference in horizontal location of an object in the left and right image—an object at position (x,y)𝑥𝑦(x,y) in the left image appears at position (x−d,y)𝑥𝑑𝑦(x-d,y) in the right image. If we know the disparity of an object we can compute its depth z𝑧z using the following relation:

where f𝑓f is the focal length of the camera and B𝐵B is the distance between the camera centers. Figure 1 depicts the input to and the output from our method.

The described problem of stereo matching is important in many fields such as autonomous driving, robotics, intermediate view generation, and 3D scene reconstruction. According to the taxonomy of Scharstein and Szeliski (2002), a typical stereo algorithm consists of four steps: matching cost computation, cost aggregation, optimization, and disparity refinement. Following Hirschmüller and Scharstein (2009) we refer to the first two steps as computing the matching cost and the last two steps as the stereo method. The focus of this work is on computing a good matching cost.

We propose training a convolutional neural network (LeCun et al., 1998) on pairs of small image patches where the true disparity is known (for example, obtained by LIDAR or structured light). The output of the network is used to initialize the matching cost. We proceed with a number of post-processing steps that are not novel, but are necessary to achieve good results. Matching costs are combined between neighboring pixels with similar image intensities using cross-based cost aggregation. Smoothness constraints are enforced by semiglobal matching and a left-right consistency check is used to detect and eliminate errors in occluded regions. We perform subpixel enhancement and apply a median filter and a bilateral filter to obtain the final disparity map.

The contributions of this paper are

a description of two architectures based on convolutional neural networks for computing the stereo matching cost;

a method, accompanied by its source code, with the lowest error rate on the KITTI 2012, KITTI 2015, and Middlebury stereo data sets; and

experiments analyzing the importance of data set size, the error rate compared with other methods, and the trade-off between accuracy and runtime for different settings of the hyperparameters.

This paper extends our previous work (Žbontar and LeCun, 2015) by including a description of a new architecture, results on two new data sets, lower error rates, and more thorough experiments.

Before the introduction of large stereo data sets like KITTI and Middlebury, relatively few stereo algorithms used ground truth information to learn parameters of their models; in this section, we review the ones that did. For a general overview of stereo algorithms see Scharstein and Szeliski (2002).

Kong and Tao (2004) used the sum of squared distances to compute an initial matching cost. They then trained a model to predict the probability distribution over three classes: the initial disparity is correct, the initial disparity is incorrect due to fattening of a foreground object, and the initial disparity is incorrect due to other reasons. The predicted probabilities were used to adjust the initial matching cost. Kong and Tao (2006) later extend their work by combining predictions obtained by computing normalized cross-correlation over different window sizes and centers. Peris et al. (2012) initialized the matching cost with AD-Census (Mei et al., 2011), and used multiclass linear discriminant analysis to learn a mapping from the computed matching cost to the final disparity.

Ground-truth data was also used to learn parameters of probabilistic graphical models. Zhang and Seitz (2007) used an alternative optimization algorithm to estimate optimal values of Markov random field hyperparameters. Scharstein and Pal (2007) constructed a new data set of 30 stereo pairs and used it to learn parameters of a conditional random field. Li and Huttenlocher (2008) presented a conditional random field model with a non-parametric cost function and used a structured support vector machine to learn the model parameters.

Recent work (Haeusler et al., 2013; Spyropoulos et al., 2014) focused on estimating the confidence of the computed matching cost. Haeusler et al. (2013) used a random forest classifier to combine several confidence measures. Similarly, Spyropoulos et al. (2014) trained a random forest classifier to predict the confidence of the matching cost and used the predictions as soft constraints in a Markov random field to decrease the error of the stereo method.

A related problem to computing the matching cost is learning local image descriptors (Brown et al., 2011; Trzcinski et al., 2012; Simonyan et al., 2014; Revaud et al., 2015; Paulin et al., 2015; Han et al., 2015; Zagoruyko and Komodakis, 2015). The two problems share a common subtask: to measure the similarity between image patches. Brown et al. (2011) introduced a general framework for learning image descriptors and used Powell’s method to select good hyperparameters. Several methods have been suggested for solving the problem of learning local image descriptors, such as boosting (Trzcinski et al., 2012), convex optimization (Simonyan et al., 2014), hierarchical moving-quadrant similarity (Revaud et al., 2015), convolutional kernel networks (Paulin et al., 2015), and convolutional neural networks (Zagoruyko and Komodakis, 2015; Han et al., 2015). Works of Zagoruyko and Komodakis (2015) and Han et al. (2015), in particular, are very similar to our own, differing mostly in the architecture of the network; concretely, the inclusion of pooling and subsampling to account for larger patch sizes and larger variation in viewpoint.

A typical stereo algorithm begins by computing a matching cost at each position 𝐩𝐩\mathbf{p} for all disparities d𝑑d under consideration. A simple method for computing the matching cost is the sum of absolute differences:

where IL​(𝐩)superscript𝐼𝐿𝐩I^{L}(\mathbf{p}) and IR​(𝐩)superscript𝐼𝑅𝐩I^{R}(\mathbf{p}) are image intensities at position 𝐩𝐩\mathbf{p} in the left and right image and 𝒩𝐩subscript𝒩𝐩\mathcal{N}_{\mathbf{p}} is the set of locations within a fixed rectangular window centered at 𝐩𝐩\mathbf{p}.

We use bold lowercase letters 𝐩𝐩\mathbf{p} and 𝐪𝐪\mathbf{q} to denote image locations. A bold lowercase 𝐝𝐝\mathbf{d} denotes the disparity d𝑑d cast to a vector, that is, 𝐝=(d,0)𝐝𝑑0\mathbf{d}=(d,0). We use typewriter font for the names of hyperparameters. For example, we would use patch_size to denote the size of the neighbourhood area 𝒩𝐩subscript𝒩𝐩\mathcal{N}_{\mathbf{p}}.

Equation (1) can be interpreted as measuring the cost associated with matching a patch from the left image, centered at position 𝐩𝐩\mathbf{p}, with a patch from the right image, centered at position 𝐩−𝐝𝐩𝐝\mathbf{p}-\mathbf{d}. We want the cost to be low when the two patches are centered around the image of the same 3D point, and high when they are not.

Since examples of good and bad matches can be constructed from publicly available data sets (for example, the KITTI and Middlebury stereo data sets), we can attempt to solve the matching problem by a supervised learning approach. Inspired by the successful application of convolutional neural networks to vision problems, we used them to assess how well two small image patches match.

We use ground truth disparity maps from either the KITTI or Middlebury stereo data sets to construct a binary classification data set. At each image position where the true disparity is known we extract one negative and one positive training example. This ensures that the data set contains an equal number of positive and negative examples. A positive example is a pair of patches, one from the left and one from the right image, whose center pixels are the images of the same 3D point, while a negative example is a pair of patches where this is not the case. The following section describes the data set construction step in detail.

Let <𝒫n×nL(𝐩),𝒫n×nR(𝐪)><\mathcal{P}{n\times n}^{L}(\mathbf{p}),\mathcal{P}{n\times n}^{R}(\mathbf{q})> denote a pair of patches, where 𝒫n×nL​(𝐩)superscriptsubscript𝒫𝑛𝑛𝐿𝐩\mathcal{P}{n\times n}^{L}(\mathbf{p}) is an n×n𝑛𝑛n\times n patch from the left image centered at position 𝐩=(x,y)𝐩𝑥𝑦\mathbf{p}=(x,y), 𝒫n×nR​(𝐪)superscriptsubscript𝒫𝑛𝑛𝑅𝐪\mathcal{P}{n\times n}^{R}(\mathbf{q}) is an n×n𝑛𝑛n\times n patch from the right image centered at position 𝐪𝐪\mathbf{q}, and d𝑑d denotes the correct disparity at position 𝐩𝐩\mathbf{p}. A negative example is obtained by setting the center of the right patch to

where onegsubscript𝑜nego_{\text{neg}} is chosen from either the interval [dataset_neg_low,dataset_neg_high]dataset_neg_lowdataset_neg_high[\texttt{dataset_neg_low},\texttt{dataset_neg_high}] or, its origin reflected counterpart, [−dataset_neg_high,−dataset_neg_low]dataset_neg_highdataset_neg_low[-\texttt{dataset_neg_high},-\texttt{dataset_neg_low}]. The random offset onegsubscript𝑜nego_{\text{neg}} ensures that the resulting image patches are not centered around the same 3D point.

A positive example is derived by setting

where opossubscript𝑜poso_{\text{pos}} is chosen randomly from the interval [−dataset_pos,dataset_pos]dataset_posdataset_pos[-\texttt{dataset_pos},\texttt{dataset_pos}]. The reason for including opossubscript𝑜poso_{\text{pos}}, instead of setting it to zero, has to do with the stereo method used later on. In particular, we found that cross-based cost aggregation performs better when the network assigns low matching costs to good matches as well as near matches. In our experiments, the hyperparameter dataset_pos was never larger than one pixel.

We describe two network architectures for learning a similarity measure on image patches. The first architecture is faster than the second, but produces disparity maps that are slightly less accurate. In both cases, the input to the network is a pair of small image patches and the output is a measure of similarity between them. Both architectures contain a trainable feature extractor that represents each image patch with a feature vector. The similarity between patches is measured on the feature vectors instead of the raw image intensity values. The fast architecture uses a fixed similarity measure to compare the two feature vectors, while the accurate architecture attempts to learn a good similarity measure on feature vectors.

The first architecture is a siamese network, that is, two shared-weight sub-networks joined at the head (Bromley et al., 1993). The sub-networks are composed of a number of convolutional layers with rectified linear units following all but the last layer. Both sub-networks output a vector capturing the properties of the input patch. The resulting two vectors are compared using the cosine similarity measure to produce the final output of the network. Figure 2 provides an overview of the architecture.

The network is trained by minimizing a hinge loss. The loss is computed by considering pairs of examples centered around the same image position where one example belongs to the positive and one to the negative class. Let s+subscript𝑠s_{+} be the output of the network for the positive example, s−subscript𝑠s_{-} be the output of the network for the negative example, and let m𝑚m, the margin, be a positive real number. The hinge loss for that pair of examples is defined as max⁡(0,m+s−−s+)0𝑚subscript𝑠subscript𝑠\max(0,m+s_{-}-s_{+}). The loss is zero when the similarity of the positive example is greater than the similarity of the negative example by at least the margin m𝑚m. We set the margin to 0.2 in our experiments.

The hyperparameters of this architecture are the number of convolutional layers in each sub-network (num_conv_layers), the size of the convolution kernels (conv_kernel_size), the number of feature maps in each layer (num_conv_feature_maps), and the size of the input patch (input_patch_size).

The second architecture is derived from the first by replacing the cosine similarity measure with a number of fully-connected layers (see Figure 3). This architectural change increased the running time, but decreased the error rate. The two sub-networks comprise a number of convolutional layers, with a rectified linear unit following each layer. The resulting two vectors are concatenated and forward-propagated through a number of fully-connected layers followed by rectified linear units. The last fully-connected layer produces a single number which, after being transformed with the sigmoid nonlinearity, is interpreted as the similarity score between the input patches.

We use the binary cross-entropy loss for training. Let s𝑠s denote the output of the network for one training example and t𝑡t denote the class of that training example; t=1𝑡1t=1 if the example belongs to the positive class and t=0𝑡0t=0 if the example belongs to the negative class. The binary cross-entropy loss for that example is defined as t​log⁡(s)+(1−t)​log⁡(1−s)𝑡𝑠1𝑡1𝑠t\log(s)+(1-t)\log(1-s).

The decision to use two different loss functions, one for each architecture, was based on empirical evidence. While we would have preferred to use the same loss function for both architectures, experiments showed that the binary cross-entropy loss performed better than the hinge loss on the accurate architecture. On the other hand, since the last step of the fast architecture is the cosine similarity computation, a cross-entropy loss was not directly applicable.

where s(<𝒫L(𝐩),𝒫R(𝐩−𝐝)>)s(<\mathcal{P}^{L}(\mathbf{p}),\mathcal{P}^{R}(\mathbf{p}-\mathbf{d})>) is the output of the network when run on input patches 𝒫L​(𝐩)superscript𝒫𝐿𝐩\mathcal{P}^{L}(\mathbf{p}) and 𝒫R​(𝐩−𝐝)superscript𝒫𝑅𝐩𝐝\mathcal{P}^{R}(\mathbf{p}-\mathbf{d}). The minus sign converts the similarity score to a matching cost.

To compute the entire matching cost tensor CCNN​(𝐩,d)subscript𝐶CNN𝐩𝑑C_{\text{CNN}}(\mathbf{p},d) we would, naively, have to perform the forward pass for each image location and each disparity under consideration. The following three implementation details kept the running time manageable:

The outputs of the two sub-networks need to be computed only once per location, and do not need to be recomputed for every disparity under consideration.

The output of the two sub-networks can be computed for all pixels in a single forward pass by propagating full-resolution images, instead of small image patches. Performing a single forward pass on the entire w×h𝑤ℎw\times h image is faster than performing w⋅h⋅𝑤ℎw\cdot h forward passes on small patches because many intermediate results can be reused.

The output of the fully-connected layers in the accurate architecture can also be computed in a single forward pass. This is done by replacing each fully-connected layer with a convolutional layer with 1×1111\times 1 kernels. We still need to perform the forward pass for each disparity under consideration; the maximum disparity d𝑑d is 228 for the KITTI data set and 400 for the Middlebury data set. As a result, the fully-connected part of the network needs to be run d𝑑d times, and is a bottleneck of the accurate architecture.

To compute the matching cost of a pair of images, we run the sub-networks once on each image and run the fully-connected layers d𝑑d times, where d𝑑d is the maximum disparity under consideration. This insight was important in designing the architecture of the network. We could have chosen an architecture where the two images are concatenated before being presented to the network, but that would imply a large cost at runtime because the whole network would need to be run d𝑑d times. This insight also led to the development of the fast architecture, where the only layer that is run d𝑑d times is the dot product of the feature vectors.

The raw outputs of the convolutional neural network are not enough to produce accurate disparity maps, with errors particularly apparent in low-texture regions and occluded areas. The quality of the disparity maps can be improved by applying a series of post-processing steps referred to as the stereo method. The stereo method we used was influenced by Mei et al. (2011) and comprises cross-based cost aggregation, semiglobal matching, a left-right consistency check, subpixel enhancement, a median, and a bilateral filter.

Information from neighboring pixels can be combined by averaging the matching cost over a fixed window. This approach fails near depth discontinuities, where the assumption of constant depth within a window is violated. We might prefer a method that adaptively selects the neighborhood for each pixel, so that support is collected only from pixels of the same physical object. In cross-based cost aggregation (Zhang et al., 2009) we build a local neighborhood around each location comprising pixels with similar image intensity values with the hope that these pixels belong to the same object.

The method begins by constructing an upright cross at each position; this cross is used to define the local support region. The left arm 𝐩lsubscript𝐩𝑙\mathbf{p}_{l} at position 𝐩𝐩\mathbf{p} extends left as long as the following two conditions hold:

|I​(𝐩)−I​(𝐩l)|<cbca_intensity𝐼𝐩𝐼subscript𝐩𝑙cbca_intensity|I(\mathbf{p})-I(\mathbf{p}{l})|<\texttt{cbca_intensity}; the image intensities at positions 𝐩𝐩\mathbf{p} and 𝐩lsubscript𝐩𝑙\mathbf{p}{l} should be similar, their difference should be less than cbca_intensity.

‖𝐩−𝐩l‖<cbca_distancenorm𝐩subscript𝐩𝑙cbca_distance|\mathbf{p}-\mathbf{p}{l}|<\texttt{cbca_distance}; the horizontal distance (or vertical distance in case of top and bottom arms) between positions 𝐩𝐩\mathbf{p} and 𝐩lsubscript𝐩𝑙\mathbf{p}{l} is less than cbca_distance pixels.

Zhang et al. (2009) suggest that aggregation should consider the support regions of both images in a stereo pair. Let ULsuperscript𝑈𝐿U^{L} and URsuperscript𝑈𝑅U^{R} denote the support regions in the left and right image. We define the combined support region Udsubscript𝑈𝑑U_{d} as

The matching cost is averaged over the combined support region:

where i𝑖i is the iteration number. We repeat the averaging a number of times. Since the support regions are overlapping, the results can change at each iteration. We skip cross-based cost aggregation in the fast architecture because it is not crucial for achieving a low error rate and because it is relatively expensive to compute.

We refine the matching cost by enforcing smoothness constraints on the disparity image. Following Hirschmüller (2008), we define an energy function E​(D)𝐸𝐷E(D) that depends on the disparity image D𝐷D:

where 1​{⋅}1⋅1{\cdot} denotes the indicator function. The first term penalizes disparities with high matching costs. The second term adds a penalty P1subscript𝑃1P_{1} when the disparity of neighboring pixels differ by one. The third term adds a larger penalty P2subscript𝑃2P_{2} when the neighboring disparities differ by more than one.

Rather than minimizing E​(D)𝐸𝐷E(D) in all directions simultaneously, we could perform the minimization in a single direction with dynamic programming. This solution would introduce unwanted streaking effects, since there would be no incentive to make the disparity image smooth in the directions we are not optimizing over. In semiglobal matching we minimize the energy in a single direction, repeat for several directions, and average to obtain the final result. Although Hirschmüller (2008) suggested choosing sixteen direction, we only optimized along the two horizontal and the two vertical directions; adding the diagonal directions did not improve the accuracy of our system. To minimize E​(D)𝐸𝐷E(D) in direction 𝐫𝐫\mathbf{r}, we define a matching cost C𝐫​(𝐩,d)subscript𝐶𝐫𝐩𝑑C_{\mathbf{r}}(\mathbf{p},d) with the following recurrence relation:

The second term is subtracted to prevent values of C𝐫​(𝐩,d)subscript𝐶𝐫𝐩𝑑C_{\mathbf{r}}(\mathbf{p},d) from growing too large and does not affect the optimal disparity map.

The penalty parameters P1subscript𝑃1P_{1} and P2subscript𝑃2P_{2} are set according to the image gradient so that jumps in disparity coincide with edges in the image. Let D1=|IL​(𝐩)−IL​(𝐩−𝐫)|subscript𝐷1superscript𝐼𝐿𝐩superscript𝐼𝐿𝐩𝐫D_{1}=|I^{L}(\mathbf{p})-I^{L}(\mathbf{p}-\mathbf{r})| and D2=|IR​(𝐩−𝐝)−IR​(𝐩−𝐝−𝐫)|subscript𝐷2superscript𝐼𝑅𝐩𝐝superscript𝐼𝑅𝐩𝐝𝐫D_{2}=|I^{R}(\mathbf{p}-\mathbf{d})-I^{R}(\mathbf{p}-\mathbf{d}-\mathbf{r})| be the difference in image intensity between two neighboring positions in the direction we are optimizing over. We set P1subscript𝑃1P_{1} and P2subscript𝑃2P_{2} according to the following rules:

The hyperparameters sgm_P1 and sgm_P2 set a base penalty for discontinuities in the disparity map. The base penalty is reduced by a factor of sgm_Q1 if one of D1subscript𝐷1D_{1} or D2subscript𝐷2D_{2} indicate a strong image gradient or by a larger factor of sgm_Q2 if both D1subscript𝐷1D_{1} and D2subscript𝐷2D_{2} indicate a strong image gradient. The value of P1subscript𝑃1P_{1} is further reduced by a factor of sgm_V when considering the two vertical directions; in the ground truth, small changes in disparity are much more frequent in the vertical directions than in the horizontal directions and should be penalised less.

The final cost CSGM​(𝐩,d)subscript𝐶SGM𝐩𝑑C_{\text{SGM}}(\mathbf{p},d) is computed by taking the average across all four directions:

After semiglobal matching we repeat cross-based cost aggregation, as described in the previous section. Hyperparameters cbca_num_iterations_1 and cbca_num_iterations_2 determine the number of cross-based cost aggregation iterations before and after semiglobal matching.

The disparity image D​(𝐩)𝐷𝐩D(\mathbf{p}) is computed by the winner-takes-all strategy, that is, by finding the disparity d𝑑d that minimizes C​(𝐩,d)𝐶𝐩𝑑C(\mathbf{p},d),

The interpolation steps attempt to resolve conflicts between the disparity map predicted for the left image and the disparity map predicted for the right image. Let DLsuperscript𝐷𝐿D^{L} denote the disparity map obtained by treating the left image as the reference image—this was the case so far, that is, DL​(𝐩)=D​(𝐩)superscript𝐷𝐿𝐩𝐷𝐩D^{L}(\mathbf{p})=D(\mathbf{p})—and let DRsuperscript𝐷𝑅D^{R} denote the disparity map obtained by treating the right image as the reference image. DLsuperscript𝐷𝐿D^{L} and DRsuperscript𝐷𝑅D^{R} sometimes disagree on what the correct disparity at a particular position should be. We detect these conflicts by performing a left-right consistency check. We label each position 𝐩𝐩\mathbf{p} by applying the following rules in turn:

For positions marked as occlusion, we want the new disparity value to come from the background. We interpolate by moving left until we find a position labeled correct and use its value. For positions marked as mismatch, we find the nearest correct pixels in 16 different directions and use the median of their disparities for interpolation. We refer to the interpolated disparity map as DINTsubscript𝐷INTD_{\text{INT}}.

Subpixel enhancement provides an easy way to increase the resolution of a stereo algorithm. We fit a quadratic curve through the neighboring costs to obtain a new disparity image:

where d=DINT​(𝐩)𝑑subscript𝐷INT𝐩d=D_{\text{INT}}(\mathbf{p}), C−=CSGM​(𝐩,d−1)subscript𝐶subscript𝐶SGM𝐩𝑑1C_{-}=C_{\text{SGM}}(\mathbf{p},d-1), C=CSGM​(𝐩,d)𝐶subscript𝐶SGM𝐩𝑑C=C_{\text{SGM}}(\mathbf{p},d), and C+=CSGM​(𝐩,d+1)subscript𝐶subscript𝐶SGM𝐩𝑑1C_{+}=C_{\text{SGM}}(\mathbf{p},d+1).

The final steps of the stereo method consist of a 5×5555\times 5 median filter and the following bilateral filter:

where g​(x)𝑔𝑥g(x) is the probability density function of a zero mean normal distribution with standard deviation blur_sigma and W​(𝐩)𝑊𝐩W(\mathbf{p}) is the normalizing constant,

The role of the bilateral filter is to smooth the disparity map without blurring the edges. DBFsubscript𝐷BFD_{\text{BF}} is the final output of our stereo method.

We used three stereo data sets in our experiments: KITTI 2012, KITTI 2015, and Middlebury. The test set error rates reported in Tables 1, 2, and 4 were obtained by submitting the generated disparity maps to the online evaluation servers. All other error rates were computed by splitting the data set in two, using one part for training and the other for validation.

The KITTI stereo data set (Geiger et al., 2013; Menze and Geiger, 2015) is a collection of rectified image pairs taken from two video cameras mounted on the roof of a car, roughly 54 centimeters apart. The images were recorded while driving in and around the city of Karlsruhe, in sunny and cloudy weather, at daytime. The images were taken at a resolution of 1240×37612403761240\times 376. A rotating laser scanner mounted behind the left camera recorded ground truth depth, labeling around 30 % of the image pixels.

The ground truth disparities for the test set are withheld and an online leaderboard is provided where researchers can evaluate their method on the test set. Submissions are allowed once every three days. Error is measured as the percentage of pixels where the true disparity and the predicted disparity differ by more than three pixels. Translated into distance, this means that, for example, the error tolerance is 3 centimeters for objects 2 meters from the camera and 80 centimeters for objects 10 meters from the camera.

Two KITTI stereo data sets exist: KITTI 2012111The KITTI 2012 scoreboard: http://www.cvlibs.net/datasets/kitti/eval_stereo_flow.php?benchmark=stereo and, the newer, KITTI 2015222The KITTI 2015 scoreboard: http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=stereo. For the task of computing stereo they are nearly identical, with the newer data set improving some aspects of the optical flow task. The 2012 data set contains 194 training and 195 testing images, while the 2015 data set contains 200 training and 200 testing images. There is a subtle but important difference introduced in the newer data set: vehicles in motion are densely labeled and car glass is included in the evaluation. This emphasizes the method’s performance on reflective surfaces.

The best performing methods on the KITTI 2012 data set are listed in Table 1. Our accurate architecture ranks first with an error rate of 2.43 %. Third place on the leaderboard is held by our previous work (Žbontar and LeCun, 2015) with an error rate of 2.61 %. The two changes that reduced the error from 2.61 % to 2.43 % were augmenting the data set (see Section 5.4) and doubling the number of convolution layers while reducing the kernel size from 5×5555\times 5 to 3×3333\times 3. The method in second place (Güney and Geiger, 2015) uses the matching cost computed by our previous work (Žbontar and LeCun, 2015). The test error rate of the fast architecture is 2.82 %, which would be enough for fifth place had the method been allowed to appear in the public leaderboard. The running time for processing a single image pair is 67 seconds for the accurate architecture and 0.8 seconds for the fast architecture. Figure 5 contains a pair of examples from the KITTI 2012 data set, together with the predictions of our method.

Table 2 presents the frontrunners on the KITTI 2015 data sets. The error rates of our methods are 3.89 % for the accurate architecture and 4.46 % for the fast architecture, occupying first and second place on the leaderboard. Since one submission per paper is allowed, only the result of the accurate architecture appears on the public leaderboard. See Figure 6 for the disparity maps produced by our method on the KITTI 2015 data set.

The image pairs of the Middlebury stereo data set are indoor scenes taken under controlled lighting conditions. Structured light was used to measure the true disparities with higher density and precision than in the KITTI data set. The data sets were published in five separate works in the years 2001, 2003, 2005, 2006, and 2014 (Scharstein and Szeliski, 2002, 2003; Scharstein and Pal, 2007; Hirschmüller and Scharstein, 2007; Scharstein et al., 2014). In this paper, we refer to the Middlebury data set as the concatenation of all five data sets; a summary of each is presented in Table 3.

Each scene in the 2005, 2006, and 2014 data sets was taken under a number of lighting conditions and shutter exposures, with a typical image pair taken under four lighting conditions and seven exposure settings for a total of 28 images of the same scene.

An online leaderboard333The Middlebury scoreboard: http://vision.middlebury.edu/stereo/eval3/, similar to the one provided by KITTI, displays a ranked list of all submitted methods. Participants have only one opportunity to submit their results on the test set to the public leaderboard. This rule is stricter than the one on the KITTI data set, where submissions are allowed every three days. The test set contains 15 images borrowed from the 2005 and 2014 data sets.

The data set is provided in full, half, and quarter resolution. The error is computed at full resolution; if the method outputs half or quarter resolution disparity maps, they are upsampled before the error is computed. We chose to run our method on half resolution images because of the limited size of the graphic card’s memory available.

Rectifying a pair of images using standard calibration procedures, like the ones present in the OpenCV library, results in vertical disparity errors of up to nine pixels on the Middlebury data set (Scharstein et al., 2014). Each stereo pair in the 2014 data set is rectified twice: once using a standard, imperfect approach, and once using precise 2D correspondences for perfect rectification (Scharstein et al., 2014). We train the network on imperfectly rectified image pairs, since only two of the fifteen test images (Australia and Crusade) are rectified perfectly.

The error is measured as the percentage of pixels where the true disparity and the predicted disparity differ by more than two pixels; this corresponds to an error tolerance of one pixel at half resolution. The error on the evaluation server is, by default, computed only on non-occluded pixels. The final error reported online is the weighted average over the fifteen test images, with weights set by the authors of the data set.

Table 4 contains a snapshot of the third, and newest, version of the Middlebury leaderboard. Our method ranks first with an error rate of 8.29 % and a substantial lead over the second placed MeshStereo method, whose error rate is 13.4 %. See Figure 7 for disparity maps produced by our method on one image pair from the Middlebury data set.

We construct a binary classification data set from all available image pairs in the training set. The data set contains 25 million examples on the KITTI 2012, 17 million examples on the KITTI 2015, and 38 million examples on the Middlebury data set.

At training time, the input to the network was a batch of 128 pairs of image patches. At test time, the input was the entire left and right image. We could have used entire images during training as well, as it would allow us to implement the speed optimizations described in Section 3.3. There were several reasons why we preferred to train on image patches: it was easier to control the batch size, the examples could be shuffled so that one batch contained patches from several different images, and it was easier to maintain the same number of positive and negative examples within a batch.

We minimized the loss using mini-batch gradient descent with the momentum term set to 0.9. We trained for 14 epochs with the learning rate initially set to 0.003 for the accurate architecture and 0.002 for the fast architecture. The learning rate was decreased by a factor of 10 on the 11thth{}^{\text{th}} epoch. The number of epochs, the initial learning rate, and the learning rate decrease schedule where treated as hyperparameters and were optimized with cross-validation. Each image was preprocessed by subtracting the mean and dividing by the standard deviation of its pixel intensity values. The left and right image of a stereo pair were preprocessed separately. Our initial experiments suggested that using color information does not improve the quality of the disparity maps; therefore, we converted all color images to grayscale.

The post-processing steps of the stereo method were implemented in CUDA (Nickolls et al., 2008), the network training was done with the Torch environment (Collobert et al., 2011) using the convolution routines from the cuDNN library (Chetlur et al., 2014). The OpenCV library (Bradski, 2000) was used for the affine transformation in the data augmentation step.

The hyperparameters where optimized with manual search and simple scripts that helped automate the process. The hyperparameters we selected are shown in Table 5.

Augmenting the data set by repeatedly transforming the training examples is a commonly employed technique to reduce the network’s generalization error. The transformations are applied at training time and do not affect the runtime performance. We randomly rotate, scale and shear the training patches; we also change their brightness and contrast. Since the transformations are applied to patches after they have been extracted from the images, the data augmentation step does not alter the ground truth disparity map or ruin the rectification.

The parameters of the transformation are chosen randomly for each pair of patches, and after one epoch of training, when the same example is being presented to the network for the second time, new random parameters are selected. We choose slightly different transformation parameters for the left and right image; for example, we would rotate the left patch by 10 degrees and the right by 14. Different data sets benefited from different types of transformations and, in some cases, using the wrong transformations increased the error.

On the Middlebury data set we took advantage of the fact that the images were taken under different lighting conditions and different shutter exposures by training on all available images. The same data set augmentation parameters were used for the KITTI 2012 and KITTI 2015 data sets.

The Middlebury test data sets contains two images worth mentioning: Classroom, where the right image is underexposed and, therefore, darker than the left; and Djembe, where the left and right images were taken under different light conditions. To handle these two cases we train, 20 % of the time, on images where either the shutter exposure or the arrangements of lights are different for the left and right image.

We combat imperfect rectification on the Middlebury data set by including a small vertical disparity between the left and right image patches.

Before describing the steps of data augmentation, let us introduce some notation: in the following, a word in typewriter is used to denote the name of a hyperparameter defining a set, while the same word in italic is used to denote a number drawn randomly from that set. For example, rotate is a hyperparameter defining the set of possible rotations and rotate is a number drawn randomly from that set. The steps of data augmentation are presented in the following list:

Rotate the left patch by rotate degrees and the right patch by rotate+rotate_diffrotaterotate_diff\textit{rotate}+\textit{rotate_diff} degrees.

Scale the left patch in the horizontal direction by horizontal_scale and the right patch by horizontal_scale⋅horizontal_scale_diff⋅horizontal_scalehorizontal_scale_diff\textit{horizontal_scale}\cdot\textit{horizontal_scale_diff}.

Translate the right patch in the vertical direction by vertical_disparity.

with addition and multiplication carried out element-wise where appropriate.

Table 6 contains the hyperparameters used and measures how each data augmentation step affected the validation error.

We measure the runtime of our implementation on a computer with a NVIDIA Titan X graphics processor unit. Table 7 contains the runtime measurements across a range of hyperparameter settings for three data sets: KITTI, Middlebury half resolution, and a new, fictitious data set, called Tiny, which we use to demonstrate the performance of our method on the kind of images typically used for autonomous driving or robotics. The sizes of images we measured the runtime on were: 1242 ×\times 350 with 228 disparity levels for the KITTI data set, 1500 ×\times 1000 with 200 disparity levels for the Middlebury data set, and 320 ×\times 240 with 32 disparity levels for the Tiny data set.

Table 7 reveals that the fast architecture is up to 90 times faster than the accurate architecture. Furthermore, the running times of the fast architecture are 0.78 seconds on KITTI, 2.03 seconds on Middlebury, and 0.06 seconds on the Tiny data set. We can also see that the fully-connected layers are responsible for most of the runtime in the accurate architecture, as the hyperparameters controlling the number of convolutional layer and the number of feature maps have only a small effect on the runtime.

Training times depended on the size of the data set and the architecture, but never exceeded two days.

We argue that the low error rate of our method is due to the convolutional neural network and not a superior stereo method. We verify this claim by replacing the convolutional neural network with three standard approaches for computing the matching cost:

The sum of absolute differences computes the matching cost according to Equation (1), that is, the matching cost between two image patches is computed by summing the absolute differences in image intensities between corresponding locations. We used 9×9999\times 9 patches.

The census transform (Zabih and Woodfill, 1994) represents each image position as a bit vector. The size of this vector is a hyperparameter whose value, after examining several, we set to 81. The vector is computed by cropping a 9×9999\times 9 image patch centered around the position of interest and comparing the intensity values of each pixel in the patch to the intensity value of the pixel in the center. When the center pixel is brighter the corresponding bit is set. The matching cost is computed as the hamming distance between two census transformed vectors.

Normalized cross-correlation is a window-based method defined with the following equation:

The normalized cross-correlation matching cost computes the cosine similarity between the left and right image patch, when the left and right image patches are viewed as vectors instead of matrices. This is the same function that is computed in the last two layers of the fast architecture (normalization and dot product). The neighbourhood 𝒩𝐩subscript𝒩𝐩\mathcal{N}_{\mathbf{p}} was set to a square 11×11111111\times 11 window around 𝐩𝐩\mathbf{p}.

The “sad”, “cens”, and “ncc” columns of Table 8 contain the results of the sum of absolute differences, the census transform, and normalized cross-correlation on the KITTI 2012, KITTI 2015, and Middlebury data sets. The validation errors in the last rows of Table 8 should be used to compare the five methods. On all three data sets the accurate architecture performs best, followed by the fast architecture, which in turn is followed by the census transform. These are the three best performing methods on all three data sets. Their error rates are 2.61 %, 3.02 %, and 4.90 % on KITTI 2012; 3.25 %, 3.99 %, and 5.03 % on KITTI 2015; and 7.91 %, 9.87 %, and 16.72 % on Middlebury. The sum of absolute differences and the normalized cross-correlation matching costs produce disparity maps with larger errors. For a visual comparison of our method and the census transform see Figures 5, 6, and 7.

The stereo method includes a number of post-processing steps: cross-based cost aggregation, semiglobal matching, interpolation, subpixel enhancement, a median, and a bilateral filter. We ran a set of experiments in which we excluded each of the aforementioned steps and recorded the validation error (see Table 8).

The last two rows of Table 8 allude to the importance of the post-processing steps of the stereo method. We see that, if all post-processing steps are removed, the validation error of the accurate architecture increases from 2.61 % to 13.49 % on KITTI 2012, from 3.25 % to 13.38 % on KITTI 2015, and from 7.91 % to 28.33 % on Middlebury.

Out of all post-processing steps of the stereo method, semiglobal matching affects the validation error the strongest. If we remove it, the validation error increases from 2.61 % to 4.26 % on KITTI 2012, from 3.25 % to 4.51 % on KITTI 2015, and from 7.91 % to 11.99 % on Middlebury.

We did not use the left-right consistency check to eliminate errors in occluded regions on the Middlebury data set. The error rate increased from 7.91 % to 8.22 % using the left-right consistency check on the accurate architecture, which is why we decided to remove it.

We used a supervised learning approach to measure the similarity between image patches. It is, therefore, natural to ask how does the size of the data set affect the quality of the disparity maps. To answer this question, we retrain our networks on smaller training sets obtained by selecting a random set of examples (see Table 9).

We observe that the validation error decreases as we increase the number of training examples. These experiments suggest a simple strategy for improving the results of our stereo method: collect a larger data set.

Up to this point the training and validation sets were created from the same stereo data set, either KITTI 2012, KITTI 2015, or Middlebury. To evaluate the performance of our method in the transfer learning setting, we run experiments where the validation error is computed on a different data set than the one used for training. For example, we would use the Middlebury data set to train the matching cost neural network and evaluate its performance on the KITTI 2012 data set. These experiments give us some idea of the expected performance in a real-world application, where it isn’t possible to train a specialized network because no ground truth is available. The results of these experiments are shown in Table 10.

Some results in Table 10 were unexpected. For example, the validation error on KITTI 2012 is lower when using the Middlebury training set compared to the KITTI 2015 training set, even though the KITTI 2012 data set is obviously more similar to KITTI 2015 than Middlebury. Furthermore, the validation error on KITTI 2012 is lower when using the fast architecture instead of the accurate architecture when training on KITTI 2015.

The matching cost neural network trained on the Middlebury data set transfers well to the KITTI data sets. Its validation error is similar to the validation errors obtained by networks trained on the KITTI data sets.

Searching for a good set of hyperparameters is a daunting task—with the search space growing exponentially with the number of hyperparameters and no gradient to guide us. To better understand the effect of each hyperparameter on the validation error, we conduct a series of experiments where we vary the value of one hyperparameter while keeping the others fixed to their default values. The results are shown in Table 11 and can be summarized by observing that increasing the size of the network improves the generalization performance, but only up to a point, when presumably, because of the size of the data set, the generalization performance starts do decrease.

Note that the num_conv_layers hyperparameter implicitly controls the size of the image patches. For example, a network with one convolutional layer with 3×3333\times 3 kernels compares image patches of size 3×3333\times 3, while a network with five convolutional layers compares patches of size 11×11111111\times 11.

The source code of our implementation is available at https://github.com/jzbontar/mc-cnn. The online repository contains procedures for computing the disparity map, training the network, as well as the post-processing steps of the stereo method.

The accurate architecture produces disparity maps with lower error rates than any previously published method on the KITTI 2012, KITTI 2015, and Middlebury data sets. The fast architecture computes the disparity maps up to 90 times faster than the accurate architecture with only a small increase in error. These results suggest that convolutional neural networks are well suited for computing the stereo matching cost even for applications that require real-time performance.

The fact that a relatively simple convolutional neural network outperformed all previous methods on the well-studied problem of stereo is a rather important demonstration of the power of modern machine learning approaches.

Table: S5.T1: The highest ranking methods on the KITTI 2012 data set as of October 2015. The “Setting” column provides insight into how the disparity map is computed: “F” indicates the use of optical flow, “MV” indicates more than two temporally adjacent images, and “MS” indicates the use of epipolar geometry for computing the optical flow. The “Error” column reports the percentage of misclassified pixels and the “Runtime” column measures the time, in seconds, required to process one pair of images.

RankMethodSettingErrorRuntime
1MC-CNN-acrtAccurate architecture2.432.432.43676767
2DispletsGüney and Geiger (2015)2.472.472.47265265265
3MC-CNNŽbontar and LeCun (2015)2.612.612.61100100100
4PRSMVogel et al. (2015)F, MV2.782.782.78300300300
MC-CNN-fstFast architecture2.822.822.820.80.80.8
5SPS-StFlYamaguchi et al. (2014)F, MS2.832.832.83353535
6VC-SFVogel et al. (2014)F, MV3.053.053.05300300300
7Deep EmbedChen et al. (2015)3.103.103.10333
8JSOSMUnpublished work3.153.153.15105105105
9OSFMenze and Geiger (2015)F3.283.283.28300030003000
10CoRChakrabarti et al. (2015)3.303.303.30666

Table: S5.T2: The leading submission on the KITTI 2015 leaderboard as of October 2015. The “Setting”, “Error”, and “Runtime” columns have the same meaning as in Table 1.

RankMethodSettingErrorRuntime
1MC-CNN-acrtAccurate architecture3.893.893.89676767
MC-CNN-fstFast architecture4.624.624.620.80.80.8
2SPS-StYamaguchi et al. (2014)5.315.315.31222
3OSFMenze and Geiger (2015)F5.795.795.79300030003000
4PR-SceneflowVogel et al. (2013)F6.246.246.24150150150
5SGM+C+NLHirschmüller (2008); Sun et al. (2014)F6.846.846.84270270270
6SGM+LDOFHirschmüller (2008); Brox and Malik (2011)F6.846.846.84868686
7SGM+SFHirschmüller (2008); Hornacek et al. (2014)F6.846.846.84270027002700
8ELASGeiger et al. (2011)9.729.729.720.30.30.3
9OCV-SGBMHirschmüller (2008)10.8610.8610.861.11.11.1
10SDMKostková and Sára (2003)11.9611.9611.96606060

Table: S5.T3: A summary of the five Middlebury stereo data sets. The column “Number of Image Pairs” counts only the image pairs for which ground truth is available. The 2005 and 2014 data sets additionally contain a number of image pairs with ground truth disparities withheld; these image pairs constitute the test set.

YearNumber of Image PairsResolutionMaximum Disparity
20018380×430380430380\times 43030
200321800×1500180015001800\times 1500220
200561400×1100140011001400\times 1100230
2006211400×1100140011001400\times 1100230
2014233000×2000300020003000\times 2000800

Table: S5.T4: The top ten methods on the Middlebury stereo data set as of October 2015. The “Error” column is the weighted average error after upsampling to full resolution and “Runtime” is the time, in seconds, required to process one pair of images.

RankMethodResolutionErrorRuntime
1MC-CNN-acrtAccurate architectureHalf8.298.298.29150150150
2MeshStereoZhang et al. (2015)Half13.413.413.465.365.365.3
3LCUUnpublished workQuarter17.017.017.0656765676567
4TMAPPsota et al. (2015)Half17.117.117.1243524352435
5IDRKowalczuk et al. (2013)Half18.418.418.40.490.490.49
6SGMHirschmüller (2008)Half18.718.718.79.909.909.90
7LPSSinha et al. (2014)Half19.419.419.49.529.529.52
8LPSSinha et al. (2014)Full20.320.320.325.825.825.8
9SGMHirschmüller (2008)Quarter21.221.221.21.481.481.48
10SNCCEinecke and Eggert (2010)Half22.222.222.21.381.381.38

Table: S5.T5: The hyperparameter values we used for the fast and accurate architectures (abbreviated “fst” and “acrt”). Note that hyperparameters concerning image intensity values (cbca_intensity and sgm_D) apply to the preprocessed images and not to raw images with intensity values in the range from 0 to 255.

KITTI 2012KITTI 2015Middlebury
Hyperparameterfstacrtfstacrtfstacrt
input_patch_size9×9999\times 99×9999\times 99×9999\times 99×9999\times 911×11111111\times 1111×11111111\times 11
num_conv_layers444455
num_conv_feature_maps641126411264112
conv_kernel_size333333
num_fc_layers443
num_fc_units384384384
dataset_neg_low44441.51.5
dataset_neg_high10101010618
dataset_pos11110.50.5
cbca_intensity0.130.030.02
cbca_distance5514
cbca_num_iterations_1222
cbca_num_iterations_20416
sgm_P141.322.32.32.31.3
sgm_P22233242.355.855.918.1
sgm_Q1333344.5
sgm_Q27.566689
sgm_V1.521.251.751.52.75
sgm_D0.020.080.080.080.080.13
blur_sigma7.7464.64661.7
blur_threshold565522

Table: S5.T6: The hyperparameters governing data augmentation and how they affect the validation error. The “Error” column reports the validation error when a particular data augmentation step is not used. The last two rows report validation errors with and without data augmentation. For example, the validation error on the KITTI 2012 is 2.73 % if no data augmentation is used, 2.65 % if all steps except rotation are used, and 2.61 % if all data augmentation steps are used.

KITTI 2012Middlebury
HyperparameterRangeErrorRangeError
rotate[−7,7]77[-7,7]2.65[−28,28]2828[-28,28]7.99
scale[0.8,1]0.81[0.8,1]8.17
horizontal_scale[0.9,1]0.91[0.9,1]2.62[0.8,1]0.81[0.8,1]8.08
horizontal_shear[0,0.1]00.1[0,0.1]2.61[0,0.1]00.1[0,0.1]7.91
brightness[0,0.7]00.7[0,0.7]2.61[0,1.3]01.3[0,1.3]8.16
contrast[1,1.3]11.3[1,1.3]2.63[1,1.1]11.1[1,1.1]7.95
vertical_disparity[0,1]01[0,1]8.05
rotate_diff[−3,3]33[-3,3]8.00
horizontal_scale_diff[0.9,1]0.91[0.9,1]7.97
horizontal_shear_diff[0,0.3]00.3[0,0.3]8.05
brightness_diff[0,0.3]00.3[0,0.3]2.63[0,0.7]00.7[0,0.7]7.92
contrast_diff[1,1.1]11.1[1,1.1]8.01
No data set augmentation2.738.75
Full data set augmentation2.617.91

Table: S5.T7: The time, in seconds, required to compute the matching cost, that is, the time spent in the convolutional neural network without any post-processing steps. The time does include computing the matching cost twice: once when the left image is taken to be the reference image and once when the right image is taken to be the reference image. We measure the runtime as a function of four hyperparameters controlling the network architecture; for example, the first six rows contain the runtime as the number of convolutional layers in the network increases from one to six. The last row of the table contains the running time for the entire method, including the post-processing steps. As before, we abbreviate the fast and accurate architectures as “fst” and “acrt”.

KITTIMiddleburyTiny
Hyperparameterfstacrtfstacrtfstacrt
num_conv_layers10.2366.10.7074.90.011.7
20.2666.20.8275.20.011.8
30.3066.30.9775.40.021.8
40.3466.41.1175.60.031.8
50.3866.51.2475.70.031.9
60.4266.71.3776.00.041.9
num_conv_feature_maps160.0959.40.2764.80.011.6
320.1560.40.5166.20.011.6
480.2561.50.9468.20.021.7
640.3462.71.2470.00.031.7
800.4464.01.6372.00.041.8
960.5365.31.9373.90.041.8
1120.6166.42.2875.70.051.8
1280.7167.72.6177.80.061.9
num_fc_layers116.325.30.5
232.950.70.9
349.675.71.4
466.4101.21.8
582.9126.42.3
num_fc_units12817.421.40.6
25638.544.91.1
38466.475.71.8
512101.0113.32.7
No stereo method0.3466.41.2475.70.031.8
Full stereo method0.7867.12.0384.80.061.9

Table: S5.T8: The numbers measure validation error when a particular post-processing step is excluded from the stereo method. The last two rows of the tables should be interpreted differently: they contain the validation error of the raw convolutional neural network and the validation error after the complete stereo method. For example, if we exclude semiglobal matching, the fast architecture achieves an error rate of 8.78 % on the KITTI 2012 data set and an error rate of 3.02 % after applying the full stereo method. We abbreviate the method names as “fst” for the fast architecture, “acrt” for the accurate architecture, “sad” for the sum of absolute differences, “cens” for the census transform, and “ncc” for the normalized cross-correlation matching cost.

KITTI 2012
fstacrtsadcensncc
Cross-based cost aggregation3.022.738.225.218.93
Semiglobal matching8.784.2619.588.8410.72
Interpolation3.482.969.215.9611.16
Subpixel Enhancement3.032.658.164.958.93
Median filter3.032.638.164.929.00
Bilateral filter3.262.798.755.709.76
No stereo method15.7013.4932.3053.5522.21
Full stereo method3.022.618.164.908.93

Table: S5.T9: The validation error as a function of training set size.

Data Set Size (%)fstacrtfstacrtfstacrt
203.172.844.133.5311.149.73
403.112.754.103.4010.358.71
603.092.674.053.3410.148.36
803.052.654.023.2910.098.21
1003.022.613.993.259.877.91

Table: S5.T10: The validation error when the training and test sets differ. For example, the validation error is 3.16 % when the Middlebury data set is used for training the fast architecture and the trained network is tested on the KITTI 2012 data set.

Test Set
KITTI 2012KITTI 2015Middlebury
fstacrtfstacrtfstacrt
Training SetKITTI 20123.022.614.123.9912.7811.09
KITTI 20153.604.283.993.2513.7014.19
Middlebury3.163.074.484.499.877.91

Refer to caption The input is a pair of images from the left and right camera. The two input images differ mostly in horizontal locations of objects (other differences are caused by reflections, occlusions, and perspective distortions). Note that objects closer to the camera have larger disparities than objects farther away. The output is a dense disparity map shown on the right, with warmer colors representing larger values of disparity (and smaller values of depth).

Refer to caption The fast architecture is a siamese network. The two sub-networks consist of a number of convolutional layers followed by rectified linear units (abbreviated “ReLU”). The similarity score is obtained by extracting a vector from each of the two input patches and computing the cosine similarity between them. In this diagram, as well as in our implementation, the cosine similarity computation is split in two steps: normalization and dot product. This reduces the running time because the normalization needs to be performed only once per position (see Section 3.3).

Refer to caption The accurate architecture begins with two convolutional feature extractors. The extracted feature vectors are concatenated and compared by a number of fully-connected layers. The inputs are two image patches and the output is a single real number between 0 and 1, which we interpret as a measure of similarity between the input images.

Refer to caption The support region for position 𝐩𝐩\mathbf{p} is the union of horizontal arms of all positions 𝐪𝐪\mathbf{q} on 𝐩𝐩\mathbf{p}’s vertical arm.

Refer to caption Examples of predicted disparity maps on the KITTI 2012 data set. Note how some regions of the image (the white wall in the top example, and the asphalt in the bottom example) cause problems for the census transform. The fast and the accurate architecture perform better, with the accurate architecture making fewer mistakes on average.

Refer to caption Examples of predictions on the KITTI 2015 data set. Observe that vehicles in motion are labeled densely in the KITTI 2015 data set.

Refer to caption An example of a particularly difficult image pair from the Middlebury data set; the white wall in the background is practically textureless. The accurate architecture is able to classify most of it correctly. The fast architecture doesn’t do as well but still performs better than census.

$$ z=\frac{fB}{d}, $$ \tag{S1.Ex1}

$$ C_{\text{SAD}}(\mathbf{p},d)=\sum_{\mathbf{q}\in\mathcal{N}_{\mathbf{p}}}|I^{L}(\mathbf{q})-I^{R}(\mathbf{q}-\mathbf{d})|, $$ \tag{S3.E1}

$$ \mathbf{q}=(x-d+o_{\text{neg}},y), $$ \tag{S3.Ex2}

$$ E(D)=\sum_{\mathbf{p}}\biggl{(}C^{4}{\text{CBCA}}(\mathbf{p},D(\mathbf{p}))+\sum{\mathbf{q}\in\mathcal{N}{\mathbf{p}}}P{1}\cdot 1{|D(\mathbf{p})-D(\mathbf{q})|=1}\ +\sum_{\mathbf{q}\in\mathcal{N}{\mathbf{p}}}P{2}\cdot 1{|D(\mathbf{p})-D(\mathbf{q})|>1}\biggr{)}, $$ \tag{S4.SS2.p1.3}

$$ C_{\mathbf{r}}(\mathbf{p},d)=C^{4}{\text{CBCA}}(\mathbf{p},d)-\min{k}C_{r}(\mathbf{p}-\mathbf{r},k)+\min\biggl{{}C_{r}(\mathbf{p}-\mathbf{r},d),C_{r}(\mathbf{p}-\mathbf{r},d-1)+P_{1},\ C_{r}(\mathbf{p}-\mathbf{r},d+1)+P_{1},\min_{k}C_{\mathbf{r}}(\mathbf{p}-\mathbf{r},k)+P_{2}\biggr{}}. $$ \tag{S4.SS2.p2.5}

$$ \begin{array}[]{lll}P_{1}=\texttt{sgm_P1},&P_{2}=\texttt{sgm_P2}&\text{if $D_{1}<\texttt{sgm_D},D_{2}<\texttt{sgm_D}$};\ P_{1}=\texttt{sgm_P1}/\texttt{sgm_Q2},&P_{2}=\texttt{sgm_P2}/\texttt{sgm_Q2}&\text{if $D_{1}\geq\texttt{sgm_D},D_{2}\geq\texttt{sgm_D}$};\ P_{1}=\texttt{sgm_P1}/\texttt{sgm_Q1},&P_{2}=\texttt{sgm_P2}/\texttt{sgm_Q1}&\text{otherwise.}\ \end{array} $$ \tag{S4.Ex8}

$$ D(\mathbf{p})=\arg!\min_{d}C(\mathbf{p},d). $$ \tag{S4.Ex10}

$$ \begin{array}[]{ll}correct&\text{if $|d-D^{R}(\mathbf{p}-\mathbf{d})|\leq 1$ for $d=D^{L}(\mathbf{p})$},\ mismatch&\text{if $|d-D^{R}(\mathbf{p}-\mathbf{d})|\leq 1$ for any other $d$},\ occlusion&\text{otherwise}.\ \end{array} $$ \tag{S4.Ex11}

$$ D_{\text{SE}}(\mathbf{p})=d-\frac{C_{+}-C_{-}}{2(C_{+}-2C+C_{-})}, $$ \tag{S4.Ex12}

$$ D_{\text{BF}}(\mathbf{p})=\frac{1}{W(\mathbf{p})}\sum_{\mathbf{q}\in\mathcal{N}{\mathbf{p}}}D{\text{SE}}(\mathbf{q})\cdot g(|\mathbf{p}-\mathbf{q}|)\cdot 1{|I^{L}(\mathbf{p})-I^{L}(\mathbf{q})|<\texttt{blur_threshold}}, $$ \tag{S4.Ex13}

$$ \displaystyle C^{0}_{\text{CBCA}}(\mathbf{p},d) $$

ˇ Zbontar and LeCun

ˇ Zbontar and LeCun

RankMethodSetting ErrorRuntime
1MC-CNN-acrtAccurate architecture2 . 4367
2DispletsG¨ uney and Geiger (2015)2 . 47265
3MC-CNNˇ Zbontar and LeCun (2015)2 . 61100
4PRSMVogel et al. (2015)F, MV2 . 78300
MC-CNN-fstFast architecture2 . 820 . 8
5SPS-StFlYamaguchi et al. (2014)F, MS2 . 8335
6VC-SFVogel et al. (2014)F, MV3 . 05300
7Deep EmbedChen et al. (2015)3 . 103
8JSOSMUnpublished work3 . 15105
9OSFMenze and Geiger (2015)F3 . 283000
10CoRChakrabarti et al. (2015)3 . 306
RankMethodSetting ErrorRuntime
1MC-CNN-acrt Accurate architectureMC-CNN-acrt Accurate architecture3 . 8967
MC-CNN-fstFast architecture4 . 620 . 8
2SPS-StYamaguchi et al. (2014)5 . 312
3OSFMenze and Geiger (2015)F 5 . 793000
4PR-SceneflowVogel et al. (2013)F 6 . 24150
5SGM+C+NLHirschm¨ uller (2008); Sun et al. (2014)F 6 . 84270
6SGM+LDOFHirschm¨ uller (2008); Brox and Malik (2011)F 6 . 8486
7SGM+SFHirschm¨ uller (2008); Hornacek et al. (2014)F 6 . 842700
8ELASGeiger et al. (2011)9 . 720 . 3
9OCV-SGBMHirschm¨ uller (2008)10 . 861 . 1
10SDMKostkov´ a and S´ ara (2003)11 . 9660
YearNumber of Image PairsResolutionMaximum Disparity
20018380 × 43030
200321800 × 1500220
200561400 × 1100230
2006211400 × 1100230
2014233000 × 2000800
RankMethodMethodResolutionErrorRuntime
1MC-CNN-acrtAccurate architectureHalf8 . 29150
2MeshStereoZhang et al. (2015)Half13 . 465 . 3
3LCUUnpublished workQuarter17 . 06567
4TMAPPsota et al. (2015)Half17 . 12435
5IDRKowalczuk et al. (2013)Half18 . 40 . 49
6SGMHirschm¨ uller (2008)Half18 . 79 . 90
7LPSSinha et al. (2014)Half19 . 49 . 52
8LPSSinha et al. (2014)Full20 . 325 . 8
9SGMHirschm¨ uller (2008)Quarter21 . 21 . 48
10SNCCEinecke and Eggert (2010)Half22 . 21 . 38
KITTI 2012KITTI 2012KITTI 2015KITTI 2015MiddleburyMiddlebury
Hyperparameterfstacrtfstacrtfstacrt
input patch size9 × 99 × 99 × 99 × 911 × 1111 × 11
num conv layers444455
num conv feature maps641126411264112
conv kernel size333333
num fc layers443
num fc units384384384
dataset neg low44441.51.5
dataset neg high10101010618
dataset pos11110.50.5
cbca intensity0.130.030.02
cbca distance5514
cbca num iterations 1222
cbca num iterations 20416
sgm P141.322.32.32.31.3
sgm P22233242.355.855.918.1
sgm Q1333344.5
sgm Q27.566689
sgm V1.521.251.751.52.75
sgm D0.020.080.080.080.080.13
blur sigma7.7464.64661.7
blur threshold565522
KITTI 2012KITTI 2012MiddleburyMiddlebury
HyperparameterRangeErrorRangeError
rotate[ - 7 , 7]2.65[ - 28 , 28]7.99
scale[0 . 8 , 1]8.17
horizontal scale[0 . 9 , 1]2.62[0 . 8 , 1]8.08
horizontal shear[0 , 0 . 1]2.61[0 , 0 . 1]7.91
brightness[0 , 0 . 7]2.61[0 , 1 . 3]8.16
contrast[1 , 1 . 3]2.63[1 , 1 . 1]7.95
vertical disparity[0 , 1]8.05
rotate diff[ - 3 , 3]8.00
horizontal scale diff[0 . 9 , 1]7.97
horizontal shear diff[0 , 0 . 3]8.05
brightness diff[0 , 0 . 3]2.63[0 , 0 . 7]7.92
contrast diff[1 , 1 . 1]8.01
No data set augmentation2.738.75
Full data set augmentation2.617.91
KITTIKITTIMiddleburyMiddleburyTinyTiny
Hyperparameterfstacrtfstacrtfstacrt
num conv layers10.2366.10.7074.90.011.7
num conv layers20.2666.20.8275.20.011.8
num conv layers30.3066.30.9775.40.021.8
num conv layers40.3466.41.1175.60.031.8
num conv layers50.3866.51.2475.70.031.9
num conv layers60.4266.71.3776.00.041.9
num conv feature maps160.0959.40.2764.80.011.6
num conv feature maps320.1560.40.5166.20.011.6
num conv feature maps480.2561.50.9468.20.021.7
num conv feature maps640.3462.71.2470.00.031.7
num conv feature maps800.4464.01.6372.00.041.8
num conv feature maps960.5365.31.9373.90.041.8
num conv feature maps1120.6166.42.2875.70.051.8
num conv feature maps1280.7167.72.6177.80.061.9
num fc layers116.325.30.5
num fc layers232.950.70.9
num fc layers349.675.71.4
num fc layers466.4101.21.8
num fc layers582.9126.42.3
num fc units12817.421.40.6
num fc units25638.544.91.1
num fc units38466.475.71.8
No stereo method5120.3466.41.2475.70.031.8
Full stereo method0.7867.12.0384.80.061.9
KITTI 2012KITTI 2012KITTI 2012KITTI 2012KITTI 2012
fstacrtsadcensncc
Cross-based cost aggregation3.022.738.225.218.93
Semiglobal matching8.784.2619.588.8410.72
Interpolation3.482.969.215.9611.16
Subpixel Enhancement3.032.658.164.958.93
Median filter3.032.638.164.929.00
Bilateral filter3.262.798.755.709.76
No stereo method15.7013.4932.3053.5522.21
Full stereo method3.022.618.164.908.93
KITTI 2015KITTI 2015KITTI 2015KITTI 2015KITTI 2015
fstacrtsadcensncc
Cross-based cost aggregation3.993.399.945.208.89
Semiglobal matching8.404.5119.807.259.36
Interpolation4.473.3310.395.8310.98
Subpixel Enhancement4.023.289.445.038.91
Median filter4.053.259.445.058.96
Bilateral filter4.203.439.955.849.77
No stereo method15.6613.3830.6750.3518.95
Full stereo method3.993.259.445.038.89
MiddleburyMiddleburyMiddleburyMiddleburyMiddlebury
fstacrtsadcensncc
Cross-based cost aggregation9.8710.6343.0929.2833.89
Semiglobal matching25.5011.9951.2519.5135.36
Interpolation9.877.9141.8616.7233.89
Subpixel Enhancement10.298.4442.7117.1834.12
Median filter10.167.9141.9016.7334.17
Bilateral filter10.397.9641.9716.9634.43
No stereo method30.8428.3359.5764.5339.23
Full stereo method9.877.9141.8616.7233.89
KITTI 2012KITTI 2012KITTI 2015KITTI 2015MiddleburyMiddlebury
Data Set Size (%)fstacrtfstacrtfstacrt
203.172.844.133.5311.149.73
403.112.754.103.4010.358.71
603.092.674.053.3410.148.36
803.052.654.023.2910.098.21
1003.022.613.993.259.877.91
Test SetTest SetTest SetTest SetTest SetTest Set
KITTI 2012KITTI 2012KITTI 2015KITTI 2015MiddleburyMiddlebury
fstacrtfstacrtfstacrt
KITTI 20123.022.614.123.9912.7811.09
Training SetKITTI 20153.604.283.993.2513.7014.19
Training SetMiddlebury3.163.074.484.499.877.91
KITTI 2012KITTI 2012KITTI 2015KITTI 2015MiddleburyMiddlebury
Hyperparameterfstacrtfstacrtfstacrt
num conv layers15.963.975.614.0620.7412.37
23.522.984.193.4512.119.20
33.102.724.043.2710.818.56
43.022.613.993.2510.268.21
53.032.643.993.309.877.91
63.052.704.013.389.718.11
num conv feature maps163.332.844.323.5111.7910.06
323.152.684.123.3510.488.67
483.072.664.063.3210.208.47
643.022.643.993.309.878.12
803.022.643.993.299.817.95
962.992.683.973.279.628.03
1122.982.613.963.259.597.91
1282.972.633.953.239.457.92
num fc layers12.833.508.52
num fc layers22.703.318.33
num fc layers32.623.308.06
num fc layers42.613.258.00
num fc layers52.623.297.91
num fc units1282.723.368.44
num fc units2562.653.288.03
num fc units3842.613.257.91
num fc units5122.603.237.90
dataset neg low1.03.002.763.973.359.848.00
dataset neg low1.53.002.713.973.339.877.91
dataset neg low2.02.992.633.983.319.988.08
dataset neg low4.03.022.613.993.2510.208.66
dataset neg low6.03.062.634.053.2810.138.86
dataset neg high63.002.723.983.309.878.59
103.022.613.993.259.978.23
143.042.614.023.2510.008.05
183.072.604.063.239.988.11
223.072.614.053.2410.167.91
dataset pos0.03.042.674.003.269.927.97
0.53.022.653.993.289.877.91
1.03.022.613.993.259.868.04
1.53.042.624.043.2710.008.34
2.03.042.664.043.2910.168.51
RankMethodSetting ErrorRuntime
1MC-CNN-acrtAccurate architecture2 . 4367
2DispletsG¨ uney and Geiger (2015)2 . 47265
3MC-CNNˇ Zbontar and LeCun (2015)2 . 61100
4PRSMVogel et al. (2015)F, MV2 . 78300
MC-CNN-fstFast architecture2 . 820 . 8
5SPS-StFlYamaguchi et al. (2014)F, MS2 . 8335
6VC-SFVogel et al. (2014)F, MV3 . 05300
7Deep EmbedChen et al. (2015)3 . 103
8JSOSMUnpublished work3 . 15105
9OSFMenze and Geiger (2015)F3 . 283000
10CoRChakrabarti et al. (2015)3 . 306
RankMethodSetting ErrorRuntime
1MC-CNN-acrt Accurate architectureMC-CNN-acrt Accurate architecture3 . 8967
MC-CNN-fstFast architecture4 . 620 . 8
2SPS-StYamaguchi et al. (2014)5 . 312
3OSFMenze and Geiger (2015)F 5 . 793000
4PR-SceneflowVogel et al. (2013)F 6 . 24150
5SGM+C+NLHirschm¨ uller (2008); Sun et al. (2014)F 6 . 84270
6SGM+LDOFHirschm¨ uller (2008); Brox and Malik (2011)F 6 . 8486
7SGM+SFHirschm¨ uller (2008); Hornacek et al. (2014)F 6 . 842700
8ELASGeiger et al. (2011)9 . 720 . 3
9OCV-SGBMHirschm¨ uller (2008)10 . 861 . 1
10SDMKostkov´ a and S´ ara (2003)11 . 9660
YearNumber of Image PairsResolutionMaximum Disparity
20018380 × 43030
200321800 × 1500220
200561400 × 1100230
2006211400 × 1100230
2014233000 × 2000800
RankMethodMethodResolutionErrorRuntime
1MC-CNN-acrtAccurate architectureHalf8 . 29150
2MeshStereoZhang et al. (2015)Half13 . 465 . 3
3LCUUnpublished workQuarter17 . 06567
4TMAPPsota et al. (2015)Half17 . 12435
5IDRKowalczuk et al. (2013)Half18 . 40 . 49
6SGMHirschm¨ uller (2008)Half18 . 79 . 90
7LPSSinha et al. (2014)Half19 . 49 . 52
8LPSSinha et al. (2014)Full20 . 325 . 8
9SGMHirschm¨ uller (2008)Quarter21 . 21 . 48
10SNCCEinecke and Eggert (2010)Half22 . 21 . 38
KITTI 2012KITTI 2012KITTI 2015KITTI 2015MiddleburyMiddlebury
Hyperparameterfstacrtfstacrtfstacrt
input patch size9 × 99 × 99 × 99 × 911 × 1111 × 11
num conv layers444455
num conv feature maps641126411264112
conv kernel size333333
num fc layers443
num fc units384384384
dataset neg low44441.51.5
dataset neg high10101010618
dataset pos11110.50.5
cbca intensity0.130.030.02
cbca distance5514
cbca num iterations 1222
cbca num iterations 20416
sgm P141.322.32.32.31.3
sgm P22233242.355.855.918.1
sgm Q1333344.5
sgm Q27.566689
sgm V1.521.251.751.52.75
sgm D0.020.080.080.080.080.13
blur sigma7.7464.64661.7
blur threshold565522
KITTI 2012KITTI 2012MiddleburyMiddlebury
HyperparameterRangeErrorRangeError
rotate[ - 7 , 7]2.65[ - 28 , 28]7.99
scale[0 . 8 , 1]8.17
horizontal scale[0 . 9 , 1]2.62[0 . 8 , 1]8.08
horizontal shear[0 , 0 . 1]2.61[0 , 0 . 1]7.91
brightness[0 , 0 . 7]2.61[0 , 1 . 3]8.16
contrast[1 , 1 . 3]2.63[1 , 1 . 1]7.95
vertical disparity[0 , 1]8.05
rotate diff[ - 3 , 3]8.00
horizontal scale diff[0 . 9 , 1]7.97
horizontal shear diff[0 , 0 . 3]8.05
brightness diff[0 , 0 . 3]2.63[0 , 0 . 7]7.92
contrast diff[1 , 1 . 1]8.01
No data set augmentation2.738.75
Full data set augmentation2.617.91
KITTIKITTIMiddleburyMiddleburyTinyTiny
Hyperparameterfstacrtfstacrtfstacrt
num conv layers10.2366.10.7074.90.011.7
num conv layers20.2666.20.8275.20.011.8
num conv layers30.3066.30.9775.40.021.8
num conv layers40.3466.41.1175.60.031.8
num conv layers50.3866.51.2475.70.031.9
num conv layers60.4266.71.3776.00.041.9
num conv feature maps160.0959.40.2764.80.011.6
num conv feature maps320.1560.40.5166.20.011.6
num conv feature maps480.2561.50.9468.20.021.7
num conv feature maps640.3462.71.2470.00.031.7
num conv feature maps800.4464.01.6372.00.041.8
num conv feature maps960.5365.31.9373.90.041.8
num conv feature maps1120.6166.42.2875.70.051.8
num conv feature maps1280.7167.72.6177.80.061.9
num fc layers116.325.30.5
num fc layers232.950.70.9
num fc layers349.675.71.4
num fc layers466.4101.21.8
num fc layers582.9126.42.3
num fc units12817.421.40.6
num fc units25638.544.91.1
num fc units38466.475.71.8
No stereo method5120.3466.41.2475.70.031.8
Full stereo method0.7867.12.0384.80.061.9
KITTI 2012KITTI 2012KITTI 2012KITTI 2012KITTI 2012
fstacrtsadcensncc
Cross-based cost aggregation3.022.738.225.218.93
Semiglobal matching8.784.2619.588.8410.72
Interpolation3.482.969.215.9611.16
Subpixel Enhancement3.032.658.164.958.93
Median filter3.032.638.164.929.00
Bilateral filter3.262.798.755.709.76
No stereo method15.7013.4932.3053.5522.21
Full stereo method3.022.618.164.908.93
KITTI 2015KITTI 2015KITTI 2015KITTI 2015KITTI 2015
fstacrtsadcensncc
Cross-based cost aggregation3.993.399.945.208.89
Semiglobal matching8.404.5119.807.259.36
Interpolation4.473.3310.395.8310.98
Subpixel Enhancement4.023.289.445.038.91
Median filter4.053.259.445.058.96
Bilateral filter4.203.439.955.849.77
No stereo method15.6613.3830.6750.3518.95
Full stereo method3.993.259.445.038.89
MiddleburyMiddleburyMiddleburyMiddleburyMiddlebury
fstacrtsadcensncc
Cross-based cost aggregation9.8710.6343.0929.2833.89
Semiglobal matching25.5011.9951.2519.5135.36
Interpolation9.877.9141.8616.7233.89
Subpixel Enhancement10.298.4442.7117.1834.12
Median filter10.167.9141.9016.7334.17
Bilateral filter10.397.9641.9716.9634.43
No stereo method30.8428.3359.5764.5339.23
Full stereo method9.877.9141.8616.7233.89
KITTI 2012KITTI 2012KITTI 2015KITTI 2015MiddleburyMiddlebury
Data Set Size (%)fstacrtfstacrtfstacrt
203.172.844.133.5311.149.73
403.112.754.103.4010.358.71
603.092.674.053.3410.148.36
803.052.654.023.2910.098.21
1003.022.613.993.259.877.91
Test SetTest SetTest SetTest SetTest SetTest Set
KITTI 2012KITTI 2012KITTI 2015KITTI 2015MiddleburyMiddlebury
fstacrtfstacrtfstacrt
KITTI 20123.022.614.123.9912.7811.09
Training SetKITTI 20153.604.283.993.2513.7014.19
Training SetMiddlebury3.163.074.484.499.877.91
KITTI 2012KITTI 2012KITTI 2015KITTI 2015MiddleburyMiddlebury
Hyperparameterfstacrtfstacrtfstacrt
num conv layers15.963.975.614.0620.7412.37
23.522.984.193.4512.119.20
33.102.724.043.2710.818.56
43.022.613.993.2510.268.21
53.032.643.993.309.877.91
63.052.704.013.389.718.11
num conv feature maps163.332.844.323.5111.7910.06
323.152.684.123.3510.488.67
483.072.664.063.3210.208.47
643.022.643.993.309.878.12
803.022.643.993.299.817.95
962.992.683.973.279.628.03
1122.982.613.963.259.597.91
1282.972.633.953.239.457.92
num fc layers12.833.508.52
num fc layers22.703.318.33
num fc layers32.623.308.06
num fc layers42.613.258.00
num fc layers52.623.297.91
num fc units1282.723.368.44
num fc units2562.653.288.03
num fc units3842.613.257.91
num fc units5122.603.237.90
dataset neg low1.03.002.763.973.359.848.00
dataset neg low1.53.002.713.973.339.877.91
dataset neg low2.02.992.633.983.319.988.08
dataset neg low4.03.022.613.993.2510.208.66
dataset neg low6.03.062.634.053.2810.138.86
dataset neg high63.002.723.983.309.878.59
103.022.613.993.259.978.23
143.042.614.023.2510.008.05
183.072.604.063.239.988.11
223.072.614.053.2410.167.91
dataset pos0.03.042.674.003.269.927.97
0.53.022.653.993.289.877.91
1.03.022.613.993.259.868.04
1.53.042.624.043.2710.008.34
2.03.042.664.043.2910.168.51

Figure

Figure

Figure

References

[opencv_library] G.~Bradski. The OpenCV library. Dr. Dobb's Journal of Software Tools, 2000.

[bromley1993signature] Jane Bromley, James~W Bentz, L'eon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard S"ackinger, and Roopak Shah. Signature verification using a siamese time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 7\penalty0 (04):\penalty0 669--688, 1993.

[brown2011discriminative] Matthew Brown, Gang Hua, and Simon Winder. Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33\penalty0 (1):\penalty0 43--57, 2011.

[brox2011large] Thomas Brox and Jitendra Malik. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33\penalty0 (3):\penalty0 500--513, 2011.

[chakrabarti2014low] Ayan Chakrabarti, Ying Xiong, Steven~J. Gortler, and Todd Zickler. Low-level vision by consensus in a spatial hierarchy of regions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

[deep_embed] Zhuoyuan Chen, Xun Sun, Yinan Yu, Liang Wang, and Chang Huang. A deep visual correspondence embedding model for stereo matching costs. IEEE International Conference on Computer Vision (ICCV), 2015.

[chetlur2014cudnn] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014. URL http://arxiv.org/abs/1410.0759.

[collobert2011torch7] Ronan Collobert, Koray Kavukcuoglu, and Cl'ement Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011.

[einecke2010two] Nils Einecke and Julian Eggert. A two-stage correlation method for stereoscopic depth estimation. In Digital Image Computing: International Conference on Techniques and Applications (DICTA), pages 227--234, 2010.

[geiger2011efficient] Andreas Geiger, Martin Roser, and Raquel Urtasun. Efficient large-scale stereo matching. In Proceedings of the 10th Asian Conference on Computer Vision - Volume Part I, ACCV'10, pages 25--38. Springer-Verlag, Berlin, Heidelberg, 2011.

[Geiger2013IJRR] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: the KITTI dataset. International Journal of Robotics Research (IJRR), 2013.

[guney2015displets] Fatma G"uney and Andreas Geiger. Displets: Resolving stereo ambiguities using object knowledge. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

[haeusler2013ensemble] Ralf Haeusler, Rahul Nair, and Daniel Kondermann. Ensemble learning for confidence measures in stereo vision. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.

[han2015matchnet] Xufeng Han, Thomas Leung, Yangqing Jia, Rahul Sukthankar, and Alexander~C Berg. MatchNet: Unifying feature and metric learning for patch-based matching. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

[hirschmuller2008stereo] Heiko Hirschm"uller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30\penalty0 (2):\penalty0 328--341, 2008.

[hirschmuller2007evaluation] Heiko Hirschm"uller and Daniel Scharstein. Evaluation of cost functions for stereo matching. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.

[hirschmuller2009evaluation] Heiko Hirschm"uller and Daniel Scharstein. Evaluation of stereo matching costs on images with radiometric differences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31\penalty0 (9):\penalty0 1582--1599, 2009.

[hornacek2014sphereflow] Michael Hornacek, Andrew Fitzgibbon, and Carsten Rother. SphereFlow: 6 DoF scene flow from RGB-D pairs. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.

[kong2004method] Dan Kong and Hai Tao. A method for learning matching errors for stereo computation. British Machine Vision Conference (BMVC), 2004.

[kong2006stereo] Dan Kong and Hai Tao. Stereo matching via learning multiple experts behaviors. British Machine Vision Conference (BMVC), 2006.

[kostkova2003stratified] Jana Kostkov'a and Radim S'ara. Stratified dense matching for stereopsis in complex scenes. British Machine Vision Conference (BMVC), 2003.

[kowalczuk2013real] Jedrzej Kowalczuk, EricT Psota, and LanceC Perez. Real-time stereo matching on CUDA using an iterative refinement method for adaptive support-weight correspondences. IEEE Transactions on Circuits and Systems for Video Technology, 23\penalty0 (1):\penalty0 94--104, 2013.

[lecun1998gradient] Yann LeCun, L'eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86\penalty0 (11):\penalty0 2278--2324, 1998.

[li2008learning] Yunpeng Li and Daniel~P Huttenlocher. Learning for stereo vision using the structured support vector machine. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2008.

[mei2011building] Xing Mei, Xun Sun, Mingcai Zhou, Haitao Wang, Xiaopeng Zhang, et~al. On building an accurate stereo matching system on graphics hardware. IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 467--474, 2011.

[menze2015object] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

[nickolls2008scalable] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with CUDA. Queue, 6\penalty0 (2):\penalty0 40--53, 2008.

[paulin2015local] Mattis Paulin, Matthijs Douze, Zaid Harchaoui, Julien Mairal, Florent Perronin, and Cordelia Schmid. Local convolutional features with unsupervised training for image retrieval. In IEEE International Conference on Computer Vision (ICCV), pages 91--99, 2015.

[peris2012towards] Martin Peris, Atsuto Maki, Sara Martull, Yasuhiro Ohkawa, and Kazuhiro Fukui. Towards a simulation driven stereo vision system. In 21st International Conference on Pattern Recognition (ICPR), pages 1038--1042, 2012.

[psota2015map] EricT Psota, Jedrzej Kowalczuk, Mateusz Mittek, and LanceC Perez. Map disparity estimation using hidden markov trees. IEEE International Conference on Computer Vision (ICCV), 2015.

[revaud2015deepmatching] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Deepmatching: Hierarchical deformable dense matching. ArXiv e-prints, 1\penalty0 (7):\penalty0 8, 2015.

[scharstein2007learning] Daniel Scharstein and Chris Pal. Learning conditional random fields for stereo. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2007.

[scharstein2002taxonomy] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47\penalty0 (1-3):\penalty0 7--42, 2002.

[scharstein2003high] Daniel Scharstein and Richard Szeliski. High-accuracy stereo depth maps using structured light. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2003.

[scharstein2014high] Daniel Scharstein, Heiko Hirschm"uller, York Kitajima, Greg Krathwohl, Nera Nesi'c, Xi~Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. German Conference on Pattern Recognition (GCPR), September 2014.

[simonyan2014learning] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Learning local feature descriptors using convex optimisation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36\penalty0 (8):\penalty0 1573--1585, 2014.

[sinha2014efficient] Sudipta~N Sinha, Daniel Scharstein, and Richard Szeliski. Efficient high-resolution stereo matching using local plane sweeps. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.

[spyropoulos2014learning] Aristotle Spyropoulos, Nikos Komodakis, and Philippos Mordohai. Learning to detect ground control points for improving the accuracy of stereo matching. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.

[sun2014quantitative] Deqing Sun, Stefan Roth, and Michael~J Black. A quantitative analysis of current practices in optical flow estimation and the principles behind them. International Journal of Computer Vision, 106\penalty0 (2):\penalty0 115--137, 2014.

[trzcinski2012learning] Tomasz Trzcinski, Mario Christoudias, Vincent Lepetit, and Pascal Fua. Learning image descriptors with the boosting-trick. In Advances in neural information processing systems, pages 269--277, 2012.

[vogel2013piecewise] Christoph Vogel, Konrad Schindler, and Stefan Roth. Piecewise rigid scene flow. IEEE International Conference on Computer Vision (ICCV), 2013.

[vogel2014view] Christoph Vogel, Stefan Roth, and Konrad Schindler. View-consistent 3D scene flow estimation over multiple frames. European Conference on Computer Vision (ECCV), September 2014.

[vogel20153d] Christoph Vogel, Konrad Schindler, and Stefan Roth. 3D scene flow estimation with a piecewise rigid scene model. International Journal of Computer Vision, pages 1--28, 2015.

[yamaguchi2014efficient] Koichiro Yamaguchi, David McAllester, and Raquel Urtasun. Efficient joint segmentation, occlusion labeling, stereo and flow estimation. European Conference on Computer Vision (ECCV), September 2014.

[zabih1994non] Ramin Zabih and John Woodfill. Non-parametric local transforms for computing visual correspondence. European Conference on Computer Vision (ECCV), 1994.

[zagoruyko2015learning] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

[Zbontar_2015_CVPR] Jure Zbontar and Yann LeCun. Computing the stereo matching cost with a convolutional neural network. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

[zhang2015meshstereo] Chi Zhang, Zhiwei Li, Yanhua Cheng, Rui Cai, Hongyang Chao, and Yong Rui. Meshstereo: A global stereo model with mesh alignment regularization for view interpolation. IEEE International Conference on Computer Vision (ICCV), 2015.

[zhang2009cross] Ke~Zhang, Jiangbo Lu, and Gauthier Lafruit. Cross-based local stereo matching using orthogonal integral images. IEEE Transactions on Circuits and Systems for Video Technology, 19\penalty0 (7):\penalty0 1073--1079, 2009.

[zhang2007estimating] LiZhang and StevenM Seitz. Estimating optimal parameters for MRF stereo from a single image pair. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29\penalty0 (2):\penalty0 331--342, 2007.

[bib2] Bromley et al. (1993) Jane Bromley, James W Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard Säckinger, and Roopak Shah. Signature verification using a siamese time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 7(04):669–688, 1993.

[bib5] Chakrabarti et al. (2015) Ayan Chakrabarti, Ying Xiong, Steven J. Gortler, and Todd Zickler. Low-level vision by consensus in a spatial hierarchy of regions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

[bib6] Chen et al. (2015) Zhuoyuan Chen, Xun Sun, Yinan Yu, Liang Wang, and Chang Huang. A deep visual correspondence embedding model for stereo matching costs. IEEE International Conference on Computer Vision (ICCV), 2015.

[bib7] Chetlur et al. (2014) Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014. URL http://arxiv.org/abs/1410.0759.

[bib10] Geiger et al. (2011) Andreas Geiger, Martin Roser, and Raquel Urtasun. Efficient large-scale stereo matching. In Proceedings of the 10th Asian Conference on Computer Vision - Volume Part I, ACCV’10, pages 25–38. Springer-Verlag, Berlin, Heidelberg, 2011.

[bib36] Simonyan et al. (2014) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Learning local feature descriptors using convex optimisation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8):1573–1585, 2014.

[bib40] Trzcinski et al. (2012) Tomasz Trzcinski, Mario Christoudias, Vincent Lepetit, and Pascal Fua. Learning image descriptors with the boosting-trick. In Advances in neural information processing systems, pages 269–277, 2012.

[bib49] Zhang et al. (2009) Ke Zhang, Jiangbo Lu, and Gauthier Lafruit. Cross-based local stereo matching using orthogonal integral images. IEEE Transactions on Circuits and Systems for Video Technology, 19(7):1073–1079, 2009.