Computing the Stereo Matching Cost with a Convolutional Neural Network

Jure Zbontar, University of Ljubljana, Yann LeCun, New York University

Abstract

We present a method for extracting depth information from a rectified image pair. We train a convolutional neural network to predict how well two image patches match and use it to compute the stereo matching cost. The cost is refined by cross-based cost aggregation and semiglobal matching, followed by a left-right consistency check to eliminate errors in the occluded regions. Our stereo method achieves an error rate of 2.61,% on the KITTI stereo dataset and is currently (August 2014) the top performing method on this dataset.

Computing the matching cost

Jure ˇ Zbontar University of Ljubljana jure.zbontar@fri.uni-lj.si

We present a method for extracting depth information from a rectified image pair. We train a convolutional neural network to predict how well two image patches match and use it to compute the stereo matching cost. The cost is refined by cross-based cost aggregation and semiglobal matching, followed by a left-right consistency check to eliminate errors in the occluded regions. Our stereo method achieves an error rate of 2.61 % on the KITTI stereo dataset and is currently (August 2014) the top performing method on this dataset.

Introduction

Consider the following problem: given two images taken from cameras at different horizontal positions, the goal is to compute the disparity d for each pixel in the left image. Disparity refers to the difference in horizontal location of an object in the left and right image-an object at position ( x, y ) in the left image will appear at position ( x -d, y ) in the right image. Knowing the disparity d of an object, we can compute its depth z ( i.e . the distance from the object to the camera) by using the following relation:

where f is the focal length of the camera and B is the distance between the camera centers.

The described problem is a subproblem of stereo reconstruction, where the goal is to extract 3D shape from one or more images. According to the taxonomy of Scharstein and Szeliski [14], a typical stereo algorithm consists of four steps: (1) matching cost computation, (2) cost aggregation, (3) optimization, and (4) disparity refinement. Following Hirschmuller and Scharstein [5], we refer to steps (1) and (2) as computing the matching cost and steps (3) and (4) as the stereo method.

We propose training a convolutional neural network [9] on pairs of small image patches where the true disparity is

Yann LeCun New York University yann@cs.nyu.edu

known ( e.g . obtained by LIDAR). The output of the network is used to initialize the matching cost between a pair of patches. Matching costs are combined between neighboring pixels with similar image intensities using cross-based cost aggregation. Smoothness constraints are enforced by semiglobal matching and a left-right consistency check is used to detect and eliminate errors in occluded regions. We perform subpixel enhancement and apply a median filter and a bilateral filter to obtain the final disparity map. Figure 1 depicts the inputs to and the output from our method. The two contributions of this paper are:

· We describe how a convolutional neural network can be used to compute the stereo matching cost. · We achieve an error rate of 2.61 % on the KITTI stereo dataset, improving on the previous best result of 2.83 %.

Before the introduction of large stereo datasets [2, 13], relatively few stereo algorithms used ground-truth information to learn parameters of their models; in this section, we review the ones that did. For a general overview of stereo algorithms see [14].

Kong and Tao [6] used sum of squared distances to compute an initial matching cost. They trained a model to predict the probability distribution over three classes: the initial disparity is correct, the initial disparity is incorrect due to fattening of a foreground object, and the initial disparity is incorrect due to other reasons. The predicted probabilities were used to adjust the initial matching cost. Kong and Tao [7] later extend their work by combining predictions obtained by computing normalized cross-correlation over different window sizes and centers. Peris et al. [12] initialized the matching cost with AD-Census [11] and used multiclass linear discriminant analysis to learn a mapping from the computed matching cost to the final disparity.

Ground-truth data was also used to learn parameters of graphical models. Zhang and Seitz [22] used an alternative optimization algorithm to estimate optimal values of Markov random field hyperparameters. Scharstein and Pal

Left input image

Figure 1. The input is a pair of images from the left and right camera. The two input images differ mostly in horizontal locations of objects. Note that objects closer to the camera have larger disparities than objects farther away. The output is a dense disparity map shown on the right, with warmer colors representing larger values of disparity (and smaller values of depth).

[13] constructed a new dataset of 30 stereo pairs and used it to learn parameters of a conditional random field. Li and Huttenlocher [10] presented a conditional random field model with a non-parametric cost function and used a structured support vector machine to learn the model parameters.

Recent work [3, 15] focused on estimating the confidence of the computed matching cost. Haeusler et al. [3] used a random forest classifier to combine several confidence measures. Similarly, Spyropoulos et al. [15] trained a random forest classifier to predict the confidence of the matching cost and used the predictions as soft constraints in a Markov random field to decrease the error of the stereo method.

Computing the matching cost

Atypical stereo algorithm begins by computing a matching cost C ( p , d ) at each position p for all disparities d under consideration. A simple example is the sum of absolute differences:

where I L ( p ) and I R ( p ) are image intensities at position p of the left and right image and N p is the set of locations within a fixed rectangular window centered at p . We use bold lowercase letters ( p , q , and r ) to denote pairs of real numbers. Appending a lowercase d has the following meaning: if p = ( x, y ) then pd = ( x -d, y ) .

Equation (2) can be interpreted as measuring the cost associated with matching a patch from the left image, centered at position p , with a patch from the right image, centered at position pd . Since examples of good and bad matches can be obtained from publicly available datasets, e.g . KITTI [2] and Middlebury [14], we can attempt to solve the matching problem by a supervised learning approach. Inspired by the successful applications of convolutional neural networks to vision problems [8], we used them to evaluate how well two small image patches match.

Creating the dataset

Atraining example comprises two patches, one from the left and one from the right image:

where P L 9 × 9 ( p ) denotes a 9 × 9 patch from the left image, centered at p = ( x, y ) . For each location where the true disparity d is known, we extract one negative and one positive example. A negative example is obtained by setting the center of the right patch q to

where o neg is an offset corrupting the match, chosen randomly from the set {-N hi , . . . , -N lo , N lo , . . . , N hi } . Similarly, a positive example is derived by setting

where o pos is chosen randomly from the set {-P hi , . . . , P hi } . The reason for including o pos , instead of setting it to zero, has to do with the stereo method used later on. In particular, we found that cross-based cost aggregation performs better when the network assigns low matching costs to good matches as well as near matches. N lo , N hi , P hi , and the size of the image patches n are hyperparameters of the method.

Network architecture

The architecture we used is depicted in Figure 2. The network consists of eight layers, L 1 through L 8 . The first layer is convolutional, while all other layers are fullyconnected. The inputs to the network are two 9 × 9 gray image patches. The first convolutional layer consists of 32 kernels of size 5 × 5 × 1 . Layers L 2 and L 3 are fullyconnected with 200 neurons each. After L 3 the two 200 dimensional vectors are concatenated into a 400 dimensional vector and passed through four fully-connected layers, L 4

Figure 2. The architecture of our convolutional neural network.

through L 7 , with 300 neurons each. The final layer, L 8 , projects the output to two real numbers that are fed through a softmax function, producing a distribution over the two classes (good match and bad match). The weights in L 1 , L 2 , and L 3 of the networks for the left and right image patch are tied. Rectified linear units follow each layer, except L 8 . We did not use pooling in our architecture. The network contains almost 600 thousand parameters. The architecture is appropriate for gray images, but can easily be extended to handle RGB images by learning 5 × 5 × 3 , instead of 5 × 5 × 1 filters in L 1 . The best hyperparameters of the network (such as the number of layers, the number of neurons in each layer, and the size of input patches) will differ from one dataset to another. We chose this architecture because it performed well on the KITTI stereo dataset.

Matching cost

The matching cost C CNN ( p , d ) is computed directly from the output of the network:

where f neg ( < P L , P R > ) is the output of the network for the negative class when run on input patches P L and P R .

Naively, we would have to perform the forward pass for each image location p and each disparity d under consideration. The following three implementation details kept the runtime manageable:

The output of layers L 1 , L 2 , and L 3 need to be computed only once per location p and need not be recomputed for every disparity d .
The output of L 3 can be computed for all locations in a single forward pass by feeding the network full-resolution images, instead of 9 × 9 image patches. To achieve this, we apply layers L 2 and L 3 convolutionally-layer L 2 with filters of size 5 × 5 × 32 and layer L 3 with filters of size 1 × 1 × 200 , both outputting 200 feature maps.
Similarly, L 4 through L 8 can be replaced with convolutional filters of size 1 × 1 in order to compute the output of all locations in a single forward pass. Unfortunately, we still have to perform the forward pass for each disparity under consideration.

Stereo method

In order to meaningfully evaluate the matching cost, we need to pair it with a stereo method. The stereo method we used was influenced by Mei et al. [11].

Cross-based cost aggregation

Information from neighboring pixels can be combined by averaging the matching cost over a fixed window. This approach fails near depth discontinuities where the assumption of constant depth within a window is violated. We might prefer a method that adaptively selects the neighborhood for each pixel so that support is collected only from pixels with similar disparities. In cross-based cost aggregation [21] we build a local neighborhood around each location comprising pixels with similar image intensity values.

Cross-based cost aggregation begins by constructing an upright cross at each position. The left arm p l at position p extends left as long as the following two conditions hold:

· | I ( p ) -I ( p l ) | < τ . The absolute difference in image intensities at positions p and p l is smaller than τ . · ‖ p -p l ‖ < η . The horizontal distance (or vertical distance, in case of top and bottom arms) between p and p l is less than η .

The right, bottom, and top arms are constructed analogously. Once the four arms are known, we can define the support region U ( p ) as the union of horizontal arms of all positions q laying on p 's vertical arm (see Figure 3). Zhang

Figure 3. The support region for position p , is the union of horizontal arms of all positions q on p 's vertical arm.

et al. [21] suggest that aggregation should consider the support regions of both images in a stereo pair. Let U L and U R denote the support regions in the left and right image. We define the combined support region U d as

The matching cost is averaged over the combined support region:

where i is the iteration number. We repeat the averaging four times; the output of cross-based cost aggregation is C 4 CBCA .

Semiglobal matching

We refine the matching cost by enforcing smoothness constraints on the disparity image. Following Hirschmuller [4], we define an energy function E ( D ) that depends on the disparity image D :

where 1 {·} denotes the indicator function. The first term penalizes disparities D ( p ) with high matching costs. The second term adds a penalty P 1 when the disparity of neighboring pixels differ by one. The third term adds a larger penalty P 2 when the neighboring disparities differ by more than one. Rather than minimizing E ( D ) in 2D, we perform the minimization in a single direction with dynamic programming. This solution introduces unwanted streaking effects, since there is no incentive to make the disparity image smooth in the directions we are not optimizing over. In semiglobal matching we minimize the energy E ( D ) in many directions and average to obtain the final result. Although Hirschmuller [4] suggests choosing sixteen direction, we only optimized along the two horizontal and the two vertical directions; adding the diagonal directions did not improve the accuracy of our system.

To minimize E ( D ) in direction r , we define a matching cost C r ( p , d ) with the following recurrence relation:

The second term is included to prevent values of C r ( p , d ) from growing too large and does not affect the optimal disparity map. The parameters P 1 and P 2 are set according to the image gradient so that jumps in disparity coincide with edges in the image. Let D 1 = | I L ( p ) -I L ( p -r ) | and D 2 = | I R ( pd ) -I R ( pd -r ) | . We set P 1 and P 2 according to the following rules:

where Π 1 , Π 2 , and τ SO are hyperparameters. The value of P 1 is halved when minimizing in the vertical directions. The final cost C SGM ( p , d ) is computed by taking the average across all four directions:

After semiglobal matching we repeat cross-based cost aggregation, as described in the previous section.

Computing the disparity image

The disparity image D is computed by the winner-takeall strategy, i.e . by finding the disparity d that minimizes C ( p , d ) ,

Interpolation

Let D L denote the disparity map obtained by treating the left image as the reference image-this was the case so far, i.e . D L ( p ) = D ( p ) -and let D R denote the disparity map obtained by treating the right image as the reference image. Both D L and D R contain errors in occluded regions. We attempt to detect these errors by performing a left-right consistency check. We label each position p as either

For positions marked as occlusion , we want the new disparity value to come from the background. We interpolate by moving left until we find a position labeled correct and use its value. For positions marked as mismatch , we find the nearest correct pixels in 16 different directions and use the median of their disparities for interpolation. We refer to the interpolated disparity map as D INT.

Subpixel enhancement

Subpixel enhancement provides an easy way to increase the resolution of a stereo algorithm. We fit a quadratic curve through the neighboring costs to obtain a new disparity image:

The size of the disparity image D SE is smaller than the size of the original image, due to the bordering effects of convolution. The disparity image is enlarged to match the size of the input by copying the disparities of the border pixels. We proceed by applying a 5 × 5 median filter and the following bilateral filter:

where g ( x ) is the probability density function of a zero mean normal distribution with standard deviation σ and W ( p ) is the normalizing constant:

τ BF and σ are hyperparameters. D BF is the final output of our stereo method.

Experimental results

We evaluate our method on the KITTI stereo dataset, because of its large training set size required to learn the weights of the convolutional neural network.

KITTI stereo dataset

The KITTI stereo dataset [2] is a collection of gray image pairs taken from two video cameras mounted on the roof of a car, roughly 54 centimeters apart. The images are recorded while driving in and around the city of Karlsruhe, in sunny and cloudy weather, at daytime. The dataset comprises 194 training and 195 test image pairs at resolution 1240 × 376 . Each image pair is rectified, i.e . transformed in such a way that an object appears on the same vertical position in both images. A rotating laser scanner, mounted behind the left camera, provides ground truth depth. The true disparities for the test set are withheld and an online leaderboard 1 is provided where researchers can evaluate their method on the test set. Submissions are allowed only once every three days. The goal of the KITTI stereo dataset is to predict the disparity for each pixel on the left image. Error is measured by the percentage of pixels where the true disparity and the predicted disparity differ by more than three pixels. Translated into depth, this means that, for example, the error tolerance is ± 3 centimeters for objects 2 meters from the camera and ± 80 centimeters for objects 10 meters from the camera.

Details of learning

We train the network using stochastic gradient descent to minimize the cross-entropy loss. The batch size was set to 128. We trained for 16 epochs with the learning rate initially set to 0.01 and decreased by a factor of 10 on the 12 th and 15 th iteration. We shuffle the training examples prior to learning. From the 194 training image pairs we extracted 45 million examples. Half belonging to the positive class; half to the negative class. We preprocessed each image by subtracting the mean and dividing by the standard deviation of its pixel intensity values. The stereo method is implemented in CUDA, while the network training is done with

1 http://www.cvlibs.net/datasets/kitti/eval_ stereo_flow.php?benchmark=stereo

the Torch7 environment [1]. The hyperparameters of the stereo method were:

Results

Our method achieves an error rate of 2 . 61 % on the KITTI stereo test set and is currently ranked first on the online leaderboard. Table 1 compares the error rates of the best performing stereo algorithms on this dataset.

Table 1. The KITTI stereo leaderboard as it stands in November 2014.

A selected set of examples, together with predictions from our method, are shown in Figure 5.

Runtime

We measure the runtime of our implementation on a computer with a Nvidia GeForce GTX Titan GPU. Training takes 5 hours. Predicting a single image pair takes 100 seconds. It is evident from Table 2 that the majority of time during prediction is spent in the forward pass of the convolutional neural network.

Table 2. Time required for prediction of each component.

Training set size

We would like to know if more training data would lead to a better stereo method. To answer this question, we train our convolutional neural network on many instances of the KITTI stereo dataset while varying the training set size. The results of the experiment are depicted in Figure 4. We ob-

Figure 4. The error on the test set as a function of the number of stereo pairs in the training set.

serve an almost linear relationship between the training set size and error on the test set. These results imply that our method will improve as larger datasets become available in the future.

Conclusion

Our result on the KITTI stereo dataset seems to suggest that convolutional neural networks are a good fit for computing the stereo matching cost. Training on bigger datasets will reduce the error rate even further. Using supervised learning in the stereo method itself could also be beneficial. Our method is not yet suitable for real-time applications such as robot navigation. Future work will focus on improving the network's runtime performance.

$$ z = \frac{fB}{d}, $$

$$ C_{\text{AD}}(\textbf{p}, d) = \sum_{\textbf{q} \in \mathcal{N}_{\textbf{p}}} |I^L(\textbf{q}) - I^R(\textbf{qd})|, \label{eqn:C_ad} $$ \tag{eqn:C_ad}

$$ <\mathcal{P}{9 \times 9}^L(\textbf{p}), \mathcal{P}{9 \times 9}^R (\textbf{q})>, $$

$$ \textbf{q} = (x - d + o_{\text{neg}}, y), $$

$$ D(\textbf{p}) = \arg!\min_d C(\textbf{p}, d). $$

$$ D_{\text{SE}}(\textbf{p}) = d - \frac {C_+ - C_-} {2 (C_+ - 2 C + C_-)}, $$

$$ C^0_{\text{CBCA}}(\textbf{p}, d) &= C_{\text{CNN}}(\textbf{p}, d), \ C^i_{\text{CBCA}}(\textbf{p}, d) &= \frac{1}{|U_d(\textbf{p})|} \sum_{\textbf{q} \in U_d(\textbf{p})} C^{i-1}_{\text{CBCA}}(\textbf{q}, d), $$

$$ N_{\text{lo}} & =4, & \eta & =4, & \Pi_1 & =1, & \sigma & =5.656, \ N_{\text{hi}} & =8, & \tau & =0.0442, & \Pi_2 & =32, & \tau_{\text{BF}} & =5, \ P_{\text{hi}} & =1, & & & \tau_{\text{SO}} &= 0.0625. $$

$$ E(D) = \sum_{\textbf{p}} \biggl( C^4_{\text{CBCA}}(\textbf{p}, D(\textbf{p})) \ + \sum_{\textbf{q} \in \mathcal{N}{\textbf{p}}} P_1 \times 1{|D(\textbf{p}) - D(\textbf{q})| = 1} \ + \sum{\textbf{q} \in \mathcal{N}_{\textbf{p}}} P_2 \times 1{|D(\textbf{p}) - D(\textbf{q})| > 1} \biggr), $$

$$ C_{\textbf{r}}(\textbf{p}, d) = C^4_{\text{CBCA}}(\textbf{p}, d) - \min_k C_r(\textbf{p} - \textbf{r}, k) \ + \min\biggl{ C_r(\textbf{p} - \textbf{r}, d), C_r(\textbf{p} - \textbf{r}, d - 1) + P_1, \ C_r(\textbf{p} - \textbf{r}, d + 1) + P_1, \min_k C_{\textbf{r}}(\textbf{p} - \textbf{r}, k) + P_2 \biggr}. $$

$$ D_{\text{BF}}(\textbf{p}) = \frac{1}{W(\textbf{p})} \sum_{\textbf{q} \in \mathcal{N}\textbf{p}} D{\text{SE}}(\textbf{q}) \cdot g(|\textbf{p} - \textbf{q}|) \ \cdot 1{|I^L(\textbf{p}) - I^L(\textbf{q})| < \tau_{\text{BF}}}, $$

$$ \begin{array}{lll} P_1 = \Pi_1, &P_2 = \Pi_2 & \text{if $D_1 < \tau_{\text{SO}}, D_2 < \tau_{\text{SO}}$}, \ P_1 = \Pi_1 / 4, &P_2 = \Pi_2 / 4 & \text{if $D_1 \geq \tau_{\text{SO}}, D_2 < \tau_{\text{SO}}$}, \ P_1 = \Pi_1 / 4, &P_2 = \Pi_2 / 4 & \text{if $D_1 < \tau_{\text{SO}}, D_2 \geq \tau_{\text{SO}}$}, \ P_1 = \Pi_1 / 10, &P_2 = \Pi_2 / 10 & \text{if $D_1 \geq \tau_{\text{SO}}, D_2 \geq \tau_{\text{SO}}$}; \ \end{array} $$

$$ \begin{array}{ll} correct & \text{if $|d - D^R(\textbf{pd})| \leq 1$ for $d = D^L(\textbf{p})$}, \ mismatch & \text{if $|d - D^R(\textbf{pd})| \leq 1$ for any other $d$}, \ occlusion & \text{otherwise}. \ \end{array} $$

Refer to caption The left column displays the left input image, while the right column displays the output of our stereo method. Examples are sorted by difficulty, with easy examples appearing at the top. Some of the difficulties include reflective surfaces, occlusions, as well as regions with many jumps in disparity, e.g. fences and shrubbery. The examples towards the bottom were selected to highlight the flaws in our method and to demonstrate the inherent difficulties of stereo matching on real-world images.

Figure 5. The left column displays the left input image, while the right column displays the output of our stereo method. Examples are sorted by difficulty, with easy examples appearing at the top. Some of the difficulties include reflective surfaces, occlusions, as well as regions with many jumps in disparity, e.g . fences and shrubbery. The examples towards the bottom were selected to highlight the flaws in our method and to demonstrate the inherent difficulties of stereo matching on real-world images.

differences. Pattern Analysis and Machine Intelligence, IEEE Transactions on , 31(9):1582-1599.

learning multiple experts behaviors. In BMVC , pages 97-106.

Rank	Method		Error
1	MC-CNN	This paper	2.61%
2	SPS-StFl	Yamaguchi et al. [20]	2.83%
3	VC-SF	Vogel et al. [16]	3.05%
4	CoP	Anonymous submission	3.30%
5	SPS-St	Yamaguchi et al. [20]	3.39%
6	PCBP-SS	Yamaguchi et al. [19]	3.40%
7	DDS-SS	Anonymous submission	3.83%
8	StereoSLIC	Yamaguchi et al. [19]	3.92%
9	PR-Sf+E	Vogel et al. [17]	4.02%
10	PCBP	Yamaguchi et al. [18]	4.04%

Component	Runtime
Convolutional neural network	95 s
Semiglobal matching	3 s
Cross-based cost aggregation	2 s
Everything else	0.03 s

Consider the following problem: given two images taken from cameras at different horizontal positions, the goal is to compute the disparity d𝑑d for each pixel in the left image. Disparity refers to the difference in horizontal location of an object in the left and right image—an object at position (x,y)𝑥𝑦(x,y) in the left image will appear at position (x−d,y)𝑥𝑑𝑦(x-d,y) in the right image. Knowing the disparity d𝑑d of an object, we can compute its depth z𝑧z (i.e. the distance from the object to the camera) by using the following relation:

where f𝑓f is the focal length of the camera and B𝐵B is the distance between the camera centers.

The described problem is a subproblem of stereo reconstruction, where the goal is to extract 3D shape from one or more images. According to the taxonomy of Scharstein and Szeliski, [14], a typical stereo algorithm consists of four steps: (1) matching cost computation, (2) cost aggregation, (3) optimization, and (4) disparity refinement. Following Hirschmuller and Scharstein, [5], we refer to steps (1) and (2) as computing the matching cost and steps (3) and (4) as the stereo method.

We propose training a convolutional neural network [9] on pairs of small image patches where the true disparity is known (e.g. obtained by LIDAR). The output of the network is used to initialize the matching cost between a pair of patches. Matching costs are combined between neighboring pixels with similar image intensities using cross-based cost aggregation. Smoothness constraints are enforced by semiglobal matching and a left-right consistency check is used to detect and eliminate errors in occluded regions. We perform subpixel enhancement and apply a median filter and a bilateral filter to obtain the final disparity map. Figure 1 depicts the inputs to and the output from our method.

The two contributions of this paper are:

We describe how a convolutional neural network can be used to compute the stereo matching cost.

We achieve an error rate of 2.61 % on the KITTI stereo dataset, improving on the previous best result of 2.83 %.

Kong and Tao, [6] used sum of squared distances to compute an initial matching cost. They trained a model to predict the probability distribution over three classes: the initial disparity is correct, the initial disparity is incorrect due to fattening of a foreground object, and the initial disparity is incorrect due to other reasons. The predicted probabilities were used to adjust the initial matching cost. Kong and Tao, [7] later extend their work by combining predictions obtained by computing normalized cross-correlation over different window sizes and centers. Peris et al., [12] initialized the matching cost with AD-Census [11] and used multiclass linear discriminant analysis to learn a mapping from the computed matching cost to the final disparity.

Ground-truth data was also used to learn parameters of graphical models. Zhang and Seitz, [22] used an alternative optimization algorithm to estimate optimal values of Markov random field hyperparameters. Scharstein and Pal, [13] constructed a new dataset of 30 stereo pairs and used it to learn parameters of a conditional random field. Li and Huttenlocher, [10] presented a conditional random field model with a non-parametric cost function and used a structured support vector machine to learn the model parameters.

Recent work [3, 15] focused on estimating the confidence of the computed matching cost. Haeusler et al., [3] used a random forest classifier to combine several confidence measures. Similarly, Spyropoulos et al., [15] trained a random forest classifier to predict the confidence of the matching cost and used the predictions as soft constraints in a Markov random field to decrease the error of the stereo method.

A typical stereo algorithm begins by computing a matching cost C(p,d)𝐶p𝑑C(\textbf{p},d) at each position p for all disparities d𝑑d under consideration. A simple example is the sum of absolute differences:

where IL(p)superscript𝐼𝐿pI^{L}(\textbf{p}) and IR(p)superscript𝐼𝑅pI^{R}(\textbf{p}) are image intensities at position p of the left and right image and 𝒩psubscript𝒩p\mathcal{N}_{\textbf{p}} is the set of locations within a fixed rectangular window centered at p. We use bold lowercase letters (p, q,q\textbf{q}, and r) to denote pairs of real numbers. Appending a lowercase d has the following meaning: if p=(x,y)p𝑥𝑦\textbf{p}=(x,y) then pd=(x−d,y)pd𝑥𝑑𝑦\textbf{pd}=(x-d,y).

Equation (2) can be interpreted as measuring the cost associated with matching a patch from the left image, centered at position p, with a patch from the right image, centered at position pd. Since examples of good and bad matches can be obtained from publicly available datasets, e.g. KITTI [2] and Middlebury [14], we can attempt to solve the matching problem by a supervised learning approach. Inspired by the successful applications of convolutional neural networks to vision problems [8], we used them to evaluate how well two small image patches match.

A training example comprises two patches, one from the left and one from the right image:

where 𝒫9×9L(p)superscriptsubscript𝒫99𝐿p\mathcal{P}_{9\times 9}^{L}(\textbf{p}) denotes a 9×9999\times 9 patch from the left image, centered at p=(x,y)p𝑥𝑦\textbf{p}=(x,y). For each location where the true disparity d𝑑d is known, we extract one negative and one positive example. A negative example is obtained by setting the center of the right patch q to

where onegsubscript𝑜nego_{\text{neg}} is an offset corrupting the match, chosen randomly from the set {−Nhi,…,−Nlo,Nlo,…,Nhi}subscript𝑁hi…subscript𝑁losubscript𝑁lo…subscript𝑁hi{-N_{\text{hi}},\ldots,-N_{\text{lo}},N_{\text{lo}},\ldots,N_{\text{hi}}}. Similarly, a positive example is derived by setting

where opossubscript𝑜poso_{\text{pos}} is chosen randomly from the set {−Phi,…,Phi}subscript𝑃hi…subscript𝑃hi{-P_{\text{hi}},\ldots,P_{\text{hi}}}. The reason for including opossubscript𝑜poso_{\text{pos}}, instead of setting it to zero, has to do with the stereo method used later on. In particular, we found that cross-based cost aggregation performs better when the network assigns low matching costs to good matches as well as near matches. Nlosubscript𝑁loN_{\text{lo}}, Nhisubscript𝑁hiN_{\text{hi}}, Phisubscript𝑃hiP_{\text{hi}}, and the size of the image patches n𝑛n are hyperparameters of the method.

The architecture we used is depicted in Figure 2. The network consists of eight layers, L1𝐿1L1 through L8𝐿8L8. The first layer is convolutional, while all other layers are fully-connected. The inputs to the network are two 9×9999\times 9 gray image patches. The first convolutional layer consists of 32 kernels of size 5×5×15515\times 5\times 1. Layers L2𝐿2L2 and L3𝐿3L3 are fully-connected with 200 neurons each. After L3𝐿3L3 the two 200 dimensional vectors are concatenated into a 400 dimensional vector and passed through four fully-connected layers, L4𝐿4L4 through L7𝐿7L7, with 300 neurons each. The final layer, L8𝐿8L8, projects the output to two real numbers that are fed through a softmax function, producing a distribution over the two classes (good match and bad match). The weights in L1𝐿1L1, L2𝐿2L2, and L3𝐿3L3 of the networks for the left and right image patch are tied. Rectified linear units follow each layer, except L8𝐿8L8. We did not use pooling in our architecture. The network contains almost 600 thousand parameters. The architecture is appropriate for gray images, but can easily be extended to handle RGB images by learning 5×5×35535\times 5\times 3, instead of 5×5×15515\times 5\times 1 filters in L1𝐿1L1. The best hyperparameters of the network (such as the number of layers, the number of neurons in each layer, and the size of input patches) will differ from one dataset to another. We chose this architecture because it performed well on the KITTI stereo dataset.

The matching cost CCNN(p,d)subscript𝐶CNNp𝑑C_{\text{CNN}}(\textbf{p},d) is computed directly from the output of the network:

where fneg(<𝒫L,𝒫R>)f_{\text{neg}}(<\mathcal{P}^{L},\mathcal{P}^{R}>) is the output of the network for the negative class when run on input patches 𝒫Lsuperscript𝒫𝐿\mathcal{P}^{L} and 𝒫Rsuperscript𝒫𝑅\mathcal{P}^{R}.

The output of layers L1𝐿1L1, L2𝐿2L2, and L3𝐿3L3 need to be computed only once per location p and need not be recomputed for every disparity d𝑑d.

The output of L3𝐿3L3 can be computed for all locations in a single forward pass by feeding the network full-resolution images, instead of 9×9999\times 9 image patches. To achieve this, we apply layers L2𝐿2L2 and L3𝐿3L3 convolutionally—layer L2𝐿2L2 with filters of size 5×5×3255325\times 5\times 32 and layer L3𝐿3L3 with filters of size 1×1×200112001\times 1\times 200, both outputting 200 feature maps.

Similarly, L4𝐿4L4 through L8𝐿8L8 can be replaced with convolutional filters of size 1×1111\times 1 in order to compute the output of all locations in a single forward pass. Unfortunately, we still have to perform the forward pass for each disparity under consideration.

In order to meaningfully evaluate the matching cost, we need to pair it with a stereo method. The stereo method we used was influenced by Mei et al., [11].

|I(p)−I(pl)|<τ𝐼p𝐼subscriptp𝑙𝜏|I(\textbf{p})-I(\textbf{p}{l})|<\tau. The absolute difference in image intensities at positions p and plsubscriptp𝑙\textbf{p}{l} is smaller than τ𝜏\tau.

The right, bottom, and top arms are constructed analogously. Once the four arms are known, we can define the support region U(p)𝑈pU(\textbf{p}) as the union of horizontal arms of all positions q laying on p’s vertical arm (see Figure 3).

Zhang et al., [21] suggest that aggregation should consider the support regions of both images in a stereo pair. Let ULsuperscript𝑈𝐿U^{L} and URsuperscript𝑈𝑅U^{R} denote the support regions in the left and right image. We define the combined support region Udsubscript𝑈𝑑U_{d} as

The matching cost is averaged over the combined support region:

where i𝑖i is the iteration number. We repeat the averaging four times; the output of cross-based cost aggregation is CCBCA4subscriptsuperscript𝐶4CBCAC^{4}_{\text{CBCA}}.

We refine the matching cost by enforcing smoothness constraints on the disparity image. Following Hirschmuller, [4], we define an energy function E(D)𝐸𝐷E(D) that depends on the disparity image D𝐷D:

where 1{⋅}1⋅1{\cdot} denotes the indicator function. The first term penalizes disparities D(p)𝐷pD(\textbf{p}) with high matching costs. The second term adds a penalty P1subscript𝑃1P_{1} when the disparity of neighboring pixels differ by one. The third term adds a larger penalty P2subscript𝑃2P_{2} when the neighboring disparities differ by more than one. Rather than minimizing E(D)𝐸𝐷E(D) in 2D, we perform the minimization in a single direction with dynamic programming. This solution introduces unwanted streaking effects, since there is no incentive to make the disparity image smooth in the directions we are not optimizing over. In semiglobal matching we minimize the energy E(D)𝐸𝐷E(D) in many directions and average to obtain the final result. Although Hirschmuller, [4] suggests choosing sixteen direction, we only optimized along the two horizontal and the two vertical directions; adding the diagonal directions did not improve the accuracy of our system.

The second term is included to prevent values of Cr(p,d)subscript𝐶rp𝑑C_{\textbf{r}}(\textbf{p},d) from growing too large and does not affect the optimal disparity map. The parameters P1subscript𝑃1P_{1} and P2subscript𝑃2P_{2} are set according to the image gradient so that jumps in disparity coincide with edges in the image. Let D1=|IL(p)−IL(p−r)|subscript𝐷1superscript𝐼𝐿psuperscript𝐼𝐿prD_{1}=|I^{L}(\textbf{p})-I^{L}(\textbf{p}-\textbf{r})| and D2=|IR(pd)−IR(pd−r)|subscript𝐷2superscript𝐼𝑅pdsuperscript𝐼𝑅pdrD_{2}=|I^{R}(\textbf{pd})-I^{R}(\textbf{pd}-\textbf{r})|. We set P1subscript𝑃1P_{1} and P2subscript𝑃2P_{2} according to the following rules:

where Π1subscriptΠ1\Pi_{1}, Π2subscriptΠ2\Pi_{2}, and τSOsubscript𝜏SO\tau_{\text{SO}} are hyperparameters. The value of P1subscript𝑃1P_{1} is halved when minimizing in the vertical directions. The final cost CSGM(p,d)subscript𝐶SGMp𝑑C_{\text{SGM}}(\textbf{p},d) is computed by taking the average across all four directions:

After semiglobal matching we repeat cross-based cost aggregation, as described in the previous section.

The disparity image D𝐷D is computed by the winner-take-all strategy, i.e. by finding the disparity d𝑑d that minimizes C(p,d)𝐶p𝑑C(\textbf{p},d),

Let DLsuperscript𝐷𝐿D^{L} denote the disparity map obtained by treating the left image as the reference image—this was the case so far, i.e. DL(p)=D(p)superscript𝐷𝐿p𝐷pD^{L}(\textbf{p})=D(\textbf{p})—and let DRsuperscript𝐷𝑅D^{R} denote the disparity map obtained by treating the right image as the reference image. Both DLsuperscript𝐷𝐿D^{L} and DRsuperscript𝐷𝑅D^{R} contain errors in occluded regions. We attempt to detect these errors by performing a left-right consistency check. We label each position p as either

For positions marked as occlusion, we want the new disparity value to come from the background. We interpolate by moving left until we find a position labeled correct and use its value. For positions marked as mismatch, we find the nearest correct pixels in 16 different directions and use the median of their disparities for interpolation. We refer to the interpolated disparity map as DINTsubscript𝐷INTD_{\text{INT}}.

Subpixel enhancement provides an easy way to increase the resolution of a stereo algorithm. We fit a quadratic curve through the neighboring costs to obtain a new disparity image:

where d=DINT(p)𝑑subscript𝐷INTpd=D_{\text{INT}}(\textbf{p}), C−=CSGM(p,d−1)subscript𝐶subscript𝐶SGMp𝑑1C_{-}=C_{\text{SGM}}(\textbf{p},d-1), C=CSGM(p,d)𝐶subscript𝐶SGMp𝑑C=C_{\text{SGM}}(\textbf{p},d), and C+=CSGM(p,d+1)subscript𝐶subscript𝐶SGMp𝑑1C_{+}=C_{\text{SGM}}(\textbf{p},d+1).

The size of the disparity image DSEsubscript𝐷SED_{\text{SE}} is smaller than the size of the original image, due to the bordering effects of convolution. The disparity image is enlarged to match the size of the input by copying the disparities of the border pixels. We proceed by applying a 5×5555\times 5 median filter and the following bilateral filter:

where g(x)𝑔𝑥g(x) is the probability density function of a zero mean normal distribution with standard deviation σ𝜎\sigma and W(p)𝑊pW(\textbf{p}) is the normalizing constant:

τBFsubscript𝜏BF\tau_{\text{BF}} and σ𝜎\sigma are hyperparameters. DBFsubscript𝐷BFD_{\text{BF}} is the final output of our stereo method.

The KITTI stereo dataset [2] is a collection of gray image pairs taken from two video cameras mounted on the roof of a car, roughly 54 centimeters apart. The images are recorded while driving in and around the city of Karlsruhe, in sunny and cloudy weather, at daytime. The dataset comprises 194 training and 195 test image pairs at resolution 1240×37612403761240\times 376. Each image pair is rectified, i.e. transformed in such a way that an object appears on the same vertical position in both images. A rotating laser scanner, mounted behind the left camera, provides ground truth depth. The true disparities for the test set are withheld and an online leaderboard111http://www.cvlibs.net/datasets/kitti/eval\_stereo\_flow.php?benchmark=stereo is provided where researchers can evaluate their method on the test set. Submissions are allowed only once every three days. The goal of the KITTI stereo dataset is to predict the disparity for each pixel on the left image. Error is measured by the percentage of pixels where the true disparity and the predicted disparity differ by more than three pixels. Translated into depth, this means that, for example, the error tolerance is ±3plus-or-minus3\pm 3 centimeters for objects 2 meters from the camera and ±80plus-or-minus80\pm 80 centimeters for objects 10 meters from the camera.

We train the network using stochastic gradient descent to minimize the cross-entropy loss. The batch size was set to 128. We trained for 16 epochs with the learning rate initially set to 0.01 and decreased by a factor of 10 on the \nth12 and \nth15 iteration. We shuffle the training examples prior to learning. From the 194 training image pairs we extracted 45 million examples. Half belonging to the positive class; half to the negative class. We preprocessed each image by subtracting the mean and dividing by the standard deviation of its pixel intensity values. The stereo method is implemented in CUDA, while the network training is done with the Torch7 environment [1]. The hyperparameters of the stereo method were:

Our method achieves an error rate of 2.61%percent2.612.61,% on the KITTI stereo test set and is currently ranked first on the online leaderboard. Table 1 compares the error rates of the best performing stereo algorithms on this dataset.

A selected set of examples, together with predictions from our method, are shown in Figure 5.

We observe an almost linear relationship between the training set size and error on the test set. These results imply that our method will improve as larger datasets become available in the future.

Table: S5.T1: The KITTI stereo leaderboard as it stands in November 2014.

Rank	Method		Error
1	MC-CNN	This paper	2.61 %
2	SPS-StFl	Yamaguchi et al., [20]	2.83 %
3	VC-SF	Vogel et al., [16]	3.05 %
4	CoP	Anonymous submission	3.30 %
5	SPS-St	Yamaguchi et al., [20]	3.39 %
6	PCBP-SS	Yamaguchi et al., [19]	3.40 %
7	DDS-SS	Anonymous submission	3.83 %
8	StereoSLIC	Yamaguchi et al., [19]	3.92 %
9	PR-Sf+E	Vogel et al., [17]	4.02 %
10	PCBP	Yamaguchi et al., [18]	4.04 %

Table: S5.T2: Time required for prediction of each component.

Component	Runtime
Convolutional neural network	95 s
Semiglobal matching	3 s
Cross-based cost aggregation	2 s
Everything else	0.03 s

Refer to caption The input is a pair of images from the left and right camera. The two input images differ mostly in horizontal locations of objects. Note that objects closer to the camera have larger disparities than objects farther away. The output is a dense disparity map shown on the right, with warmer colors representing larger values of disparity (and smaller values of depth).

Refer to caption The architecture of our convolutional neural network.

Refer to caption The support region for position p, is the union of horizontal arms of all positions q on p’s vertical arm.

Refer to caption The error on the test set as a function of the number of stereo pairs in the training set.

$$ z=\frac{fB}{d}, $$ \tag{S1.E1}

$$ C_{\text{AD}}(\textbf{p},d)=\sum_{\textbf{q}\in\mathcal{N}_{\textbf{p}}}|I^{L}(\textbf{q})-I^{R}(\textbf{qd})|, $$ \tag{S3.E2}

$$ <\mathcal{P}{9\times 9}^{L}(\textbf{p}),\mathcal{P}{9\times 9}^{R}(\textbf{q})>, $$ \tag{S3.E3}

$$ \textbf{q}=(x-d+o_{\text{neg}},y), $$ \tag{S3.E4}

$$ E(D)=\sum_{\textbf{p}}\biggl{(}C^{4}{\text{CBCA}}(\textbf{p},D(\textbf{p}))\ +\sum{\textbf{q}\in\mathcal{N}{\textbf{p}}}P{1}\times 1{|D(\textbf{p})-D(\textbf{q})|=1}\ +\sum_{\textbf{q}\in\mathcal{N}{\textbf{p}}}P{2}\times 1{|D(\textbf{p})-D(\textbf{q})|>1}\biggr{)}, $$ \tag{S4.E10}

$$ C_{\textbf{r}}(\textbf{p},d)=C^{4}{\text{CBCA}}(\textbf{p},d)-\min{k}C_{r}(\textbf{p}-\textbf{r},k)\ +\min\biggl{{}C_{r}(\textbf{p}-\textbf{r},d),C_{r}(\textbf{p}-\textbf{r},d-1)+P_{1},\ C_{r}(\textbf{p}-\textbf{r},d+1)+P_{1},\min_{k}C_{\textbf{r}}(\textbf{p}-\textbf{r},k)+P_{2}\biggr{}}. $$ \tag{S4.E11}

$$ \begin{array}[]{lll}P_{1}=\Pi_{1},&P_{2}=\Pi_{2}&\text{if $D_{1}<\tau_{\text{SO}},D_{2}<\tau_{\text{SO}}$},\ P_{1}=\Pi_{1}/4,&P_{2}=\Pi_{2}/4&\text{if $D_{1}\geq\tau_{\text{SO}},D_{2}<\tau_{\text{SO}}$},\ P_{1}=\Pi_{1}/4,&P_{2}=\Pi_{2}/4&\text{if $D_{1}<\tau_{\text{SO}},D_{2}\geq\tau_{\text{SO}}$},\ P_{1}=\Pi_{1}/10,&P_{2}=\Pi_{2}/10&\text{if $D_{1}\geq\tau_{\text{SO}},D_{2}\geq\tau_{\text{SO}}$};\ \end{array} $$ \tag{S4.Ex1}

$$ D(\textbf{p})=\arg!\min_{d}C(\textbf{p},d). $$ \tag{S4.E13}

$$ \begin{array}[]{ll}correct&\text{if $|d-D^{R}(\textbf{pd})|\leq 1$ for $d=D^{L}(\textbf{p})$},\ mismatch&\text{if $|d-D^{R}(\textbf{pd})|\leq 1$ for any other $d$},\ occlusion&\text{otherwise}.\ \end{array} $$ \tag{S4.Ex2}

$$ D_{\text{SE}}(\textbf{p})=d-\frac{C_{+}-C_{-}}{2(C_{+}-2C+C_{-})}, $$ \tag{S4.E14}

$$ D_{\text{BF}}(\textbf{p})=\frac{1}{W(\textbf{p})}\sum_{\textbf{q}\in\mathcal{N}{\textbf{p}}}D{\text{SE}}(\textbf{q})\cdot g(|\textbf{p}-\textbf{q}|)\ \cdot 1{|I^{L}(\textbf{p})-I^{L}(\textbf{q})|<\tau_{\text{BF}}}, $$ \tag{S4.E15}

$$ \displaystyle C^{0}_{\text{CBCA}}(\textbf{p},d) $$

Rank	Method		Error
1	MC-CNN	This paper	2.61%
2	SPS-StFl	Yamaguchi et al. [20]	2.83%
3	VC-SF	Vogel et al. [16]	3.05%
4	CoP	Anonymous submission	3.30%
5	SPS-St	Yamaguchi et al. [20]	3.39%
6	PCBP-SS	Yamaguchi et al. [19]	3.40%
7	DDS-SS	Anonymous submission	3.83%
8	StereoSLIC	Yamaguchi et al. [19]	3.92%
9	PR-Sf+E	Vogel et al. [17]	4.02%
10	PCBP	Yamaguchi et al. [18]	4.04%

Component	Runtime
Convolutional neural network	95 s
Semiglobal matching	3 s
Cross-based cost aggregation	2 s
Everything else	0.03 s

References

[collobert2011torch7] Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011). \newblock Torch7: A matlab-like environment for machine learning. \newblock In {\em BigLearn, NIPS Workshop}, number EPFL-CONF-192376.

[Geiger2013IJRR] Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013). \newblock Vision meets robotics: The {KITTI} dataset. \newblock {\em International Journal of Robotics Research (IJRR)}.

[haeusler2013ensemble] Haeusler, R., Nair, R., and Kondermann, D. (2013). \newblock Ensemble learning for confidence measures in stereo vision. \newblock In {\em Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on}, pages 305--312. IEEE.

[hirschmuller2008stereo] Hirschmuller, H. (2008). \newblock Stereo processing by semiglobal matching and mutual information. \newblock {\em Pattern Analysis and Machine Intelligence, IEEE Transactions on}, 30(2):328--341.

[hirschmuller2009evaluation] Hirschmuller, H. and Scharstein, D. (2009). \newblock Evaluation of stereo matching costs on images with radiometric differences. \newblock {\em Pattern Analysis and Machine Intelligence, IEEE Transactions on}, 31(9):1582--1599.

[kong2004method] Kong, D. and Tao, H. (2004). \newblock A method for learning matching errors for stereo computation. \newblock In {\em BMVC}, pages 1--10.

[kong2006stereo] Kong, D. and Tao, H. (2006). \newblock Stereo matching via learning multiple experts behaviors. \newblock In {\em BMVC}, pages 97--106.

[krizhevsky2012imagenet] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). \newblock Imagenet classification with deep convolutional neural networks. \newblock In {\em Advances in Neural Information Processing Systems 25}, pages 1106--1114.

[lecun1998gradient] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). \newblock Gradient-based learning applied to document recognition. \newblock {\em Proceedings of the IEEE}, 86(11):2278--2324.

[li2008learning] Li, Y. and Huttenlocher, D.~P. (2008). \newblock Learning for stereo vision using the structured support vector machine. \newblock In {\em Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on}, pages 1--8. IEEE.

[mei2011building] Mei, X., Sun, X., Zhou, M., Wang, H., Zhang, X., et~al. (2011). \newblock On building an accurate stereo matching system on graphics hardware. \newblock In {\em Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on}, pages 467--474. IEEE.

[peris2012towards] Peris, M., Maki, A., Martull, S., Ohkawa, Y., and Fukui, K. (2012). \newblock Towards a simulation driven stereo vision system. \newblock In {\em Pattern Recognition (ICPR), 2012 21st International Conference on}, pages 1038--1042. IEEE.

[scharstein2007learning] Scharstein, D. and Pal, C. (2007). \newblock Learning conditional random fields for stereo. \newblock In {\em Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on}, pages 1--8. IEEE.

[scharstein2002taxonomy] Scharstein, D. and Szeliski, R. (2002). \newblock A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. \newblock {\em International journal of computer vision}, 47(1-3):7--42.

[spyropoulos2014learning] Spyropoulos, A., Komodakis, N., and Mordohai, P. (2014). \newblock Learning to detect ground control points for improving the accuracy of stereo matching. \newblock In {\em Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on}, pages 1621--1628. IEEE.

[vogel2014view] Vogel, C., Roth, S., and Schindler, K. (2014). \newblock View-consistent 3d scene flow estimation over multiple frames. \newblock In {\em Computer Vision--ECCV 2014}, pages 263--278. Springer.

[vogel2013piecewise] Vogel, C., Schindler, K., and Roth, S. (2013). \newblock Piecewise rigid scene flow. \newblock In {\em Computer Vision (ICCV), 2013 IEEE International Conference on}, pages 1377--1384. IEEE.

[yamaguchi2012continuous] Yamaguchi, K., Hazan, T., McAllester, D., and Urtasun, R. (2012). \newblock Continuous markov random fields for robust stereo estimation. \newblock In {\em Computer Vision--ECCV 2012}, pages 45--58. Springer.

[yamaguchi2013robust] Yamaguchi, K., McAllester, D., and Urtasun, R. (2013). \newblock Robust monocular epipolar flow estimation. \newblock In {\em Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on}, pages 1862--1869. IEEE.

[yamaguchi2014efficient] Yamaguchi, K., McAllester, D., and Urtasun, R. (2014). \newblock Efficient joint segmentation, occlusion labeling, stereo and flow estimation. \newblock In {\em Computer Vision--ECCV 2014}, pages 756--771. Springer.

[zhang2009cross] Zhang, K., Lu, J., and Lafruit, G. (2009). \newblock Cross-based local stereo matching using orthogonal integral images. \newblock {\em Circuits and Systems for Video Technology, IEEE Transactions on}, 19(7):1073--1079.

[zhang2007estimating] Zhang, L. and Seitz, S.~M. (2007). \newblock Estimating optimal parameters for mrf stereo from a single image pair. \newblock {\em Pattern Analysis and Machine Intelligence, IEEE Transactions on}, 29(2):331--342.

[bib1] Collobert et al., [2011] Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011). Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376.

[bib2] Geiger et al., [2013] Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013). Vision meets robotics: The KITTI dataset. International Journal of Robotics Research (IJRR).

[bib3] Haeusler et al., [2013] Haeusler, R., Nair, R., and Kondermann, D. (2013). Ensemble learning for confidence measures in stereo vision. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 305–312. IEEE.

[bib4] Hirschmuller, [2008] Hirschmuller, H. (2008). Stereo processing by semiglobal matching and mutual information. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(2):328–341.

[bib5] Hirschmuller and Scharstein, [2009] Hirschmuller, H. and Scharstein, D. (2009). Evaluation of stereo matching costs on images with radiometric differences. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(9):1582–1599.

[bib6] Kong and Tao, [2004] Kong, D. and Tao, H. (2004). A method for learning matching errors for stereo computation. In BMVC, pages 1–10.

[bib7] Kong and Tao, [2006] Kong, D. and Tao, H. (2006). Stereo matching via learning multiple experts behaviors. In BMVC, pages 97–106.

[bib8] Krizhevsky et al., [2012] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114.

[bib9] LeCun et al., [1998] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.

[bib10] Li and Huttenlocher, [2008] Li, Y. and Huttenlocher, D. P. (2008). Learning for stereo vision using the structured support vector machine. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE.

[bib11] Mei et al., [2011] Mei, X., Sun, X., Zhou, M., Wang, H., Zhang, X., et al. (2011). On building an accurate stereo matching system on graphics hardware. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 467–474. IEEE.

[bib12] Peris et al., [2012] Peris, M., Maki, A., Martull, S., Ohkawa, Y., and Fukui, K. (2012). Towards a simulation driven stereo vision system. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 1038–1042. IEEE.

[bib13] Scharstein and Pal, [2007] Scharstein, D. and Pal, C. (2007). Learning conditional random fields for stereo. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE.

[bib14] Scharstein and Szeliski, [2002] Scharstein, D. and Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47(1-3):7–42.

[bib15] Spyropoulos et al., [2014] Spyropoulos, A., Komodakis, N., and Mordohai, P. (2014). Learning to detect ground control points for improving the accuracy of stereo matching. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1621–1628. IEEE.

[bib16] Vogel et al., [2014] Vogel, C., Roth, S., and Schindler, K. (2014). View-consistent 3d scene flow estimation over multiple frames. In Computer Vision–ECCV 2014, pages 263–278. Springer.

[bib17] Vogel et al., [2013] Vogel, C., Schindler, K., and Roth, S. (2013). Piecewise rigid scene flow. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 1377–1384. IEEE.

[bib18] Yamaguchi et al., [2012] Yamaguchi, K., Hazan, T., McAllester, D., and Urtasun, R. (2012). Continuous markov random fields for robust stereo estimation. In Computer Vision–ECCV 2012, pages 45–58. Springer.

[bib19] Yamaguchi et al., [2013] Yamaguchi, K., McAllester, D., and Urtasun, R. (2013). Robust monocular epipolar flow estimation. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1862–1869. IEEE.

[bib20] Yamaguchi et al., [2014] Yamaguchi, K., McAllester, D., and Urtasun, R. (2014). Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In Computer Vision–ECCV 2014, pages 756–771. Springer.

[bib21] Zhang et al., [2009] Zhang, K., Lu, J., and Lafruit, G. (2009). Cross-based local stereo matching using orthogonal integral images. Circuits and Systems for Video Technology, IEEE Transactions on, 19(7):1073–1079.

[bib22] Zhang and Seitz, [2007] Zhang, L. and Seitz, S. M. (2007). Estimating optimal parameters for mrf stereo from a single image pair. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(2):331–342.

Computing the Stereo Matching Cost with a Convolutional Neural Network

Computing the matching cost

Introduction

Computing the matching cost

Creating the dataset

Network architecture

Matching cost

Stereo method

Cross-based cost aggregation

Semiglobal matching

Computing the disparity image

Interpolation

Subpixel enhancement

Refinement

Experimental results

KITTI stereo dataset

Details of learning

Results

Runtime

Training set size

Conclusion

Refinement

References

Computing the matching cost​

Introduction​

Related work​

Computing the matching cost​

Creating the dataset​

Network architecture​

Matching cost​

Stereo method​

Cross-based cost aggregation​

Semiglobal matching​

Computing the disparity image​

Interpolation​

Subpixel enhancement​

Refinement​

Experimental results​

KITTI stereo dataset​

Details of learning​

Results​

Runtime​

Training set size​

Conclusion​

Refinement​

References​

Computing the matching cost

Introduction

Related work

Computing the matching cost

Creating the dataset

Network architecture

Matching cost

Stereo method

Cross-based cost aggregation

Semiglobal matching

Computing the disparity image

Interpolation

Subpixel enhancement

Refinement

Experimental results

KITTI stereo dataset

Details of learning

Results

Runtime

Training set size

Conclusion

Refinement

References