Prediction Under Uncertainty with Error-Encoding Networks

Mikael Henaff, Junbo Zhao and Yann LeCun, Facebook AI Research, Courant Institute, New York University

Abstract

In this work we introduce a new framework for performing temporal predictions in the presence of uncertainty. It is based on a simple idea of disentangling components of the future state which are predictable from those which are inherently unpredictable, and encoding the unpredictable components into a low-dimensional latent variable which is fed into a forward model. Our method uses a supervised training objective which is fast and easy to train. We evaluate it in the context of video prediction on multiple datasets and show that it is able to consistently generate diverse predictions without the need for alternating minimization over a latent space or adversarial training.

Prediction Under Uncertainty with Error-Encoding Networks

Mikael Henaff, Junbo Zhao and Yann LeCun

Facebook AI Research

Courant Institute, New York University

In this work we introduce a new framework for performing temporal predictions in the presence of uncertainty. It is based on a simple idea of disentangling components of the future state which are predictable from those which are inherently unpredictable, and encoding the unpredictable components into a low-dimensional latent variable which is fed into a forward model. Our method uses a supervised training objective which is fast and easy to train. We evaluate it in the context of video prediction on multiple datasets and show that it is able to consistently generate diverse predictions without the need for alternating minimization over a latent space or adversarial training.

Introduction

Learning forward models in time series is a central task in artificial intelligence, with applications in unsupervised learning, planning and compression. A major challenge in this task is how to handle the multi-modal nature of many time series. When there are multiple valid ways in which a time series can evolve, training a model using classical glyph[lscript] 1 or glyph[lscript] 2 losses produces predictions which are the average or median of the different outcomes across each dimension, which is itself often not a valid prediction.

In recent years, Generative Adversarial Networks (Goodfellow et al., 2014) have been introduced, a general framework where the prediction problem is formulated as a minimax game between the predictor function and a trainable discriminator network representing the loss. By using a trainable loss function, it is in theory possible to handle multiple output modes since a generator which covers each of the output modes will fool the discriminator leading to convergence. However, a generator which covers a single mode can also fool the discriminator and converge, and this behavior of mode collapse has been widely observed in practice. Some workarounds have been introduced to resolve or partially reduce mode-collapsing, such as minibatch discrimination, adding parameter noise (Salimans et al., 2016), backpropagating through the unrolled discriminator (Metz et al., 2016) and using multiple GANs to cover different modes (Tolstikhin et al., 2017). However, many of these techniques can bring additional challenges such as added complexity of implementation and increased computational cost. The mode collapsing problem becomes even more pronounced in the conditional generation setting when the output is highly dependent on the context, such as video prediction (Mathieu et al., 2015; Isola et al., 2016).

In this work, we introduce a novel architecture that allows for robust multimodal conditional predictions in time series data. It is based on a simple intuition of separating the future state into a deterministic component, which can be predicted from the current state, and a stochastic (or difficult to predict) component which accounts for the uncertainty regarding the future mode. By training a model deterministically, we can obtain this factorization in the form of the model's prediction together with the prediction error with respect to the true state. This error can be encoded as a lowdimensional latent variable which is fed back into the model to accurately correct the determinisic prediction by incorporating this additional information. We call this model the Error Encoding Network (EEN). In a nutshell, this framework contains three function mappings at each timestep: (i) a mapping from the current state to the future state, which separates the future state into deterministic and non-deterministic components; (ii) a mapping from the non-deterministic component of the future state to a low-dimensional latent vector; (iii) a mapping from the current state to the future state

conditioned on the latent vector, which encodes the mode information of the future state. While the training procedure involves all these mappings, the inference phase involves only (iii).

The model is trained end-to-end using a supervised learning objective and latent variables are computed using a learned parametric function, leading to easy and fast training. We apply this method to video datasets from games, robotic manipulation and simulated driving, and show that the method is able to consistently produce multimodal predictions of future video frames for all of them. Although we focus on video in this work, the method itself is general and can in principle be applied to any continuous-valued time series.

Model

Many natural processes carry some degree of uncertainty. This uncertainty may be due to an inherently stochastic process, a deterministic process which is partially observed, or it may be due to the complexity of the process being greater than the capacity of the forward model. One natural way of dealing with uncertainty is through latent variables, which can be made to account for aspects of the target that are not explainable from the observed input.

Assume we have a set of continuous vector-valued input-target pairs ( x i , y i ) , where the targets depend on both the inputs and some inherently unpredictable factors. For example, the inputs could be a set of consecutive video frames and the target could be the following frame. Classical latent variable models such as k -means or mixtures of Gaussians are trained by alternately minimizing the loss with respect to the latent variables and model parameters; in the probabilistic case this is the Expectation-Maximization algorithm (Dempster et al., 1977). In the case of a neural network model f θ ( x i , z ) , continuous latent variables can be optimized using gradient descent and the model can be trained with the following procedure:

Our approach is based on two observations. First, the latent variable z should represent what is not explainable using the input x i . Ideally, the model should make use of the input x i and only use z to account for what is not predictable from it. Second, if we are using gradient descent to optimize the latent variables, z will be a continuous function of x i and y i , although a possibly highly nonlinear one.

Our model has two settings: a deterministic setting, where it produces a prediction using only x i , and a conditional setting where is produces a prediction using x i and a latent variable z . We can switch to the deterministic setting by fixing z = 0 ; optionally, we can also have a separate network or set of weights for each setting. We first train the model f θ ( x, z ) in the deterministic setting to minimize the following loss over the training set:

Here the norm can denote glyph[lscript] 1 , glyph[lscript] 2 or any other loss which is a function of the difference between the target and the prediction. Given sufficient data and capacity, f will learn to extract all the information possible about each y i from the corresponding x i , and what is inherently unpredictable will be contained within the residual error, y i -f θ ( x i , 0) .

Figure 1: Model Architecture. The switch changes between the deterministic setting where z = 0 and the conditional setting where z is a latent variable representing the inherently unpredictable aspects of the target. The switch can also change the parameters used in the encoder and decoder.

Once f is fully trained in the deterministic setting, we save a copy of the parameters θ -and then continue training by minimizing the following loss over the training data:

Here, φ is a learned parametric function which maps the residual error of the model in its deterministic setting to a low-dimensional latent variable z which encodes the identity of the mode to which the future state belongs. This is then used as input to f in its conditional setting to more accurately predict y i , conditioned on knowledge of the proper mode. For each sample, we perform two passes through f : a first pass on the deterministic setting with z = 0 and using the parameters θ -which minimize (1) to compute the residual error which will be input to φ , and a second pass on the conditional setting using the output of φ as z and the current set of parameters θ .

The fact that z is a function of the residual prediction error y i -f θ -( x i , 0) reflects the intuition that it should only account for what is not explainable by the input, while still being a continuous function of x i and y i . Note that using a copy of previous weights θ -helps prevent information that could be predicted from x i from being stored in z , which could happen if we used the current weights θ which may become different from θ -over time. As an alternative, we could use a single set of weights and keep minimizing L d jointly with L c to prevent this from happening. We tried both methods and found that using a previous version of the weights worked better in some cases.

The model architecture is shown in Figure 1. In our experiments, we used the architecture f θ ( x, z ) = f 2 ( f 1 ( x ) + Wz ) , where f 1 and f 2 are the encoder and decoder of the state respectively. Note that z is typically of much lower dimension than the residual error y i -f θ -( x i , 0) , which prevents the network from learning a trivial solution where f would simply invert φ and cancel the error from the prediction. This forces the φ network to map the errors to general representations which can be reused across different samples and correspond to different modes of the conditional distribution.

To perform inference after the network is trained, we first extract and save the z i = φ ( y i -f θ -( x i , 0)) from each sample in the training set. Given some new input x ′ , we can then generate different predictions by computing f θ ( x ′ , z ′ ) , for different z ′ ∈ { z i } . In this work, we adopt a simple strategy of sampling uniformly from this set to generate new samples, however more sophisticated methods could be used such as fitting a conditional distribution over p ( z | x ) and sampling from it.

Birodkar, 2017), or learn action-conditional forward models which can be used for planning (Oh et al., 2015; Finn et al., 2016; Agrawal et al., 2016; Kalchbrenner et al., 2016). In the first case, the predictions are deterministic and ignore the possibly multimodal nature of the time series. In the second, it is possible to make different predictions about the future by conditioning on different actions, however this requires that the training data includes additional action labels. Our work makes different predictions about the future by instead conditioning on latent variables which are extracted in an unsupervised manner from the videos themselves.

Several works have used adversarial losses in the context of video prediction. The work of (Mathieu et al., 2015) used a multiscale architecture and a combination of several different losses to predict future frames in natural videos. They found that the addition of the adversarial loss and a gradient difference loss improved the generated image quality, in particular by reducing the blur effects which are common when using glyph[lscript] 2 loss. However, they also note that the generator learns to ignore the noise and produces similar outputs to a deterministic model trained without noise. This observation was also made by (Isola et al., 2016) when training conditional networks to perform image-to-image translation.

Other works have used models for video prediction where latent variables are inferred using alternating minimization. The model in (Vondrick et al., 2015) includes a discrete latent variable which was used to choose between several different networks for predicting hidden states of future video frames obtained using a pretrained network. This is more flexible than a purely deterministic model, however the use of a discrete latent variable still limits the possible future modes to a discrete set. The work of (Goroshin et al., 2015) also made use of latent variables to model uncertainty, which were inferred through alternating minimization. In contrast, our model infers continuous latent variables through a learned parametric function. This is related to algorithms which learn to predict the solution of an iterative optimization procedure (Gregor & LeCun, 2010).

Recent work has shown that good generative models can be learned by jointly learning representations in a latent space together with the parameters of a decoder model (Bojanowski et al., 2017). This leads to easier training than adversarial networks. This generative model is also learned by alternating minimization over the latent variables and parameters of the decoder model, however the latent variables for each sample are saved after each update and optimization resumes when the corresponding sample is drawn again from the training set. This is related to our method, with the difference that rather than saving the latent variables for each sample we compute them through a learned function of the deterministic network's prediction error.

Our work is related to predictive coding models (Rao & Ballard, 1999; Spratling, 2008; Chalasani & Principe, 2013; Lotter et al., 2016) and chunking architectures (Schmidhuber, 1992), which also pass residual errors or incorrectly predicted inputs between different parts of the network. It differs in that these models pass errors upwards to higher layers in the network at each timestep, whereas our method passes the compressed error signal from the deterministic model backwards in time to serve as input for the model in its conditional setting at the previous timestep.

Experiments

We tested our method on five different video datasets from different areas such as games (Atari Breakout, Atari Seaquest and Flappy Bird), robot manipulation (Agrawal et al., 2016) and simulated driving (Zhang & Cho, 2016). These have a well-defined multimodal structure, where the environment can change due to the actions of the agent or other stochastic factors and span a diverse range of visual environments. For each dataset, we trained our model to predict the following 1 or 4 frames conditioned on the previous 4 frames. We also trained a deterministic baseline model and a GAN to compare performance. Code to train our models and obtain video generations is available at https://github.com/mbhenaff/EEN .

The deterministic model and EEN were trained using the glyph[lscript] 2 loss for all datasets except the Robot dataset, where we found that the glyph[lscript] 1 loss gave better-defined predictions. Although more sophisticated losses exist, such as the Gradient Difference loss (Mathieu et al., 2015), our goal here was to evaluate whether our model could capture multimodal structure such as objects moving or appearing on the screen or perspective changing in multiple different realistic ways. We used the same architecture across all tasks, namely a 3-layer convolutional network followed by a 3-layer decon-

d) Generations with different

Figure 2: Generations on Breakout. Left 4 frames are given, right 4 frames are generated. Note that the paddle changes location for the different generations. Best viewed with zoom.

volutional network, all with 64 feature maps at each layer and batch normalization. We did not use pooling and instead used strided convolutions, similar to the DCGAN architecture (Radford et al., 2015). The parametric function φ mapping the prediction error to latent variables was also a multilayer convolutional network followed by two fully-connected layers. For Atari Breakout we used 2 latent variables, for Seaquest, Flappy Bird and the Robot dataset we used 8, and for driving we used 32. To train our network we used the ADAM optimizer (Kingma & Ba, 2014) with default parameters and learning rate 0.0005 for all tasks. The deterministic baseline model and the GAN had the same encoder-decoder architecture as the EEN, with twice as many feature maps.

Datasets

We now describe the video datasets we used.

Atari Games Weused a pretrained A2C agent (Mnih et al., 2016) 1 to generate episodes of gameplay for the Atari games Breakout and Seaquest (Bellemare et al., 2012) using a standard video prepro-

1 https://github.com/ikostrikov/pytorch-a2c-ppo-acktr

Figure 3: Generations on Seaquest. Left 4 frames are given, right 4 frames are generated. Note that the submarine changes orientation for the different generations. Best viewed with zoom.

cessing pipeline, i.e. downsampling video frames to 84 × 84 pixels and converting to grayscale. We then trained our forward model using 4 consecutive frames as input to predict either the following 1 frame or 4 frames.

Flappy Bird We used the OpenAI Gym environment Flappy Bird 2 and had a human player play approximately 50 episodes of gameplay. In this environment, the player controls a moving bird which must navigate between obstacles appearing at different heights. We trained the model to predict the next 4 frames using the previous 4 frames as input, all of which were rescaled to 128 × 72 pixel color images.

Robot Manipulation We used the dataset of (Agrawal et al., 2016), which consists of 240 × 240 pixel color images of objects on a table before and after manipulation by a robot. The robot pokes the object at a random location with random angle and duration causing it to move, hence the manipulation does not depend of the environment except for the location of the object. Our model was trained to take a single image as input and predict the following image.

a) Deterministic Baseline

c) Generation 2

Simulated Driving We used the dataset from (Zhang & Cho, 2016), which consists of color videos from the front of a car taken within the TORCS simulated driving environment. This car is driven by an agent whose policy is to follow the road and pass or avoid other cars while staying within the speed limit. Here we again trained the model to predict 4 frames using the 4 previous frames as input. Each image was rescaled to 160 × 72 pixels as in the original work.

Results

Our experiments were designed to test whether our method can generate multiple realistic predictions given the start of a video sequence. We first report qualitative results in the form of visualizations. In addition to the figures in this paper, we provide a link to videos which facilitate viewing 3 . An example of generated frames in Atari Breakout is shown in Figure 2. For the baseline model, the image of the paddle gets increasingly diffuse over time which reflects the model's uncertainty as to its future location while the static background remains well defined. The residual, which is the difference between the ground truth and the deterministic prediction, only depicts the movement of the ball and the paddle which the deterministic model is unable to predict. This is encoded into the latent variables z through the learned function φ which takes the residual as input. By sampling different z vectors from the training set, we obtain three different generations for the same conditioning frames. For these we see a well-defined paddle executing different movement sequences starting from its initial location.

Figure 3 shows generations for Atari Seaquest. Again we see the baseline model captures most of the features on the screen except for the agent's movement, which appears in the residual. This is the information that will be encoded in the latent variables, and by sampling different latent variables we obtain the generations below where the submarine changes direction.

We next evaluated our method on the Robot dataset. For this dataset the robot pokes the object with random direction and force which cannot be predicted from the current state. The prediction of the baseline model blurs the object but does not change its location or angle. In contrast, our model is able to produce a diverse set of predictions where the object is moved to different adjacent locations, as shown in Figure 4.

3 www.mikaelhenaff.net/eenvideos.html

Figure 5: Generated frames on Flappy Bird. First 4 are given, last 4 are generated. Note that the pipe in the last frame appears at different heights. Best viewed with zoom.

Figures 5 and 6 show generated frames on Flappy Bird. Flappy Bird is a simple game which is deterministic except for two sources of stochasticity: the actions of the player and the height of new pipes appearing on the screen. In the first example, we see that by changing the latent variable we generate two sequences with pipes entering at different moments and heights and one sequence where no pipe appears. In the second example, changing the latent variable changes the height of the bird. The EEN is thus able to model both sources of uncertainty in the environment. Additional examples can be found at the provided video link.

The last dataset we evaluated our method on was the TORCS driving simulator. Here we found that generating frames with different z samples changed the location of stripes on the road, and also produced translations and dilations of the frame as would happen when turning the steering wheel or changing speed. These effects are best viewed though the video link.

We next report quantitative results. Quantitatively evaluating multimodal predictions is not obvious, since the ground truth sample is drawn from one of several possible modes and the model may generate a sample from a different mode. In this case, simply comparing the generated sample to the ground truth sample may give high loss even if the generated sample is of high quality. We therefore report the best score across different generated samples: min k L ( y, f ( x, z k )) . If the multimodal model is able to use its latent variables to generate predictions which cover several modes, generating more samples will improve the score since it increases the chance that a generated sample will be from the same mode as the test sample. If however the model ignores latent variables or does not capture the mode that the test sample is drawn from, generating more samples will not improve the loss. Note that if L is a valid metric in the mathematical sense (such as the glyph[lscript] 1 or glyph[lscript] 2 distance), this is a finite-sample approximation to the Earth Mover or Wasserstein-1 distance between the true and generated distributions on the metric space induced by L .

Figure 7 shows the best PSNR for different numbers of generated samples. For the Robot task, we report results for a model trained using the glyph[lscript] 2 loss to make it consistent with the other models. We see that our model's best performance increases as more samples are generated, indicating that its

Figure 7: Top PSNR for different models over varying numbers of different samples. The PSNR for the EEN increases with more samples, indicating it is able to generate predictions which span several modes, whereas the GAN does not. See text below.

generations are diverse enough to cover at least some of the modes of the test set. Also note that the GAN's performance does not change as we increase the number of samples generated, which indicates that its latent variables have little effect on the generated samples. This is consistent with findings in other work (Mathieu et al., 2015; Isola et al., 2016). We also note that the different models are not quite comparable to each other using PSNR since the baseline model is directly optimizing the glyph[lscript] 2 loss on which it is based, the EEN is optimizing it conditioned on knowledge of a specific test sample, and the GAN is optimizing a different loss altogether. Our main goal is to illustrate that our model's performance improves by this approximate measure as it generates more samples, whereas the GAN does not due to mode collapse.

Conclusion

In this work, we have introduced a new framework for performing temporal prediction in the presence of uncertainty by disentangling predictable and non-predictable components of the future state. It is fast, simple to implement and easy to train without the need for an adverserial network or alternating minimization. We have provided one instantiation in the context of video prediction using convolutional networks, but it is in principle applicable to different data types and architectures. There are several directions for future work. Here, we have adopted a simple strategy of sampling uniformly from the z distribution without considering their possible dependence on the state x , and there are likely better methods. In addition, one advantage of our model is that it can extract latent variables from unseen data very quickly, since it simply requires a forward pass through a network. If latent variables encode information about actions in a manner that is easy to disentangle, this could be used to extract actions from large unlabeled datasets and perform imitation learning. Another interesting application would be using this model for planning and having it unroll different possible futures.

Acknowledgments

We would like to thank Jiakai Zhang and Kyunghyun Cho for sharing their dataset with us, and Martin Arjovsky, Arthur Szlam and Gabriel Synnaeve for helpful discussions.

Experiments

Algorithm 1 Train latent variable model with alternating minimization	Algorithm 1 Train latent variable model with alternating minimization
Require: Learning rates α,β , number of iterations K . 1: repeat	Require: Learning rates α,β , number of iterations K . 1: repeat
2:	Sample ( x i , y i ) from the dataset
3:	initialize z ∼ N (0 , 1)
4:	i ← 1
5:	while i ≤ K do
6:	z ← z - α ∇ z L ( y i , f θ ( x i , z ))
7:	i ← i +1
8:	θ ← θ - β ∇ θ L ( y i , f θ ( x i , z ))
9:	until converged

Learning forward models in time series is a central task in artificial intelligence, with applications in unsupervised learning, planning and compression. A major challenge in this task is how to handle the multi-modal nature of many time series. When there are multiple valid ways in which a time series can evolve, training a model using classical ℓ1subscriptℓ1\ell_{1} or ℓ2subscriptℓ2\ell_{2} losses produces predictions which are the average or median of the different outcomes across each dimension, which is itself often not a valid prediction.

In this work, we introduce a novel architecture that allows for robust multimodal conditional predictions in time series data. It is based on a simple intuition of separating the future state into a deterministic component, which can be predicted from the current state, and a stochastic (or difficult to predict) component which accounts for the uncertainty regarding the future mode. By training a model deterministically, we can obtain this factorization in the form of the model’s prediction together with the prediction error with respect to the true state. This error can be encoded as a low-dimensional latent variable which is fed back into the model to accurately correct the determinisic prediction by incorporating this additional information. We call this model the Error Encoding Network (EEN). In a nutshell, this framework contains three function mappings at each timestep: (i) a mapping from the current state to the future state, which separates the future state into deterministic and non-deterministic components; (ii) a mapping from the non-deterministic component of the future state to a low-dimensional latent vector; (iii) a mapping from the current state to the future state conditioned on the latent vector, which encodes the mode information of the future state. While the training procedure involves all these mappings, the inference phase involves only (iii).

Assume we have a set of continuous vector-valued input-target pairs (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i}), where the targets depend on both the inputs and some inherently unpredictable factors. For example, the inputs could be a set of consecutive video frames and the target could be the following frame. Classical latent variable models such as k𝑘k-means or mixtures of Gaussians are trained by alternately minimizing the loss with respect to the latent variables and model parameters; in the probabilistic case this is the Expectation-Maximization algorithm (Dempster et al., 1977). In the case of a neural network model fθ(xi,z)subscript𝑓𝜃subscript𝑥𝑖𝑧f_{\theta}(x_{i},z), continuous latent variables can be optimized using gradient descent and the model can be trained with the following procedure:

Our approach is based on two observations. First, the latent variable z𝑧z should represent what is not explainable using the input xisubscript𝑥𝑖x_{i}. Ideally, the model should make use of the input xisubscript𝑥𝑖x_{i} and only use z𝑧z to account for what is not predictable from it. Second, if we are using gradient descent to optimize the latent variables, z𝑧z will be a continuous function of xisubscript𝑥𝑖x_{i} and yisubscript𝑦𝑖y_{i}, although a possibly highly nonlinear one.

Our model has two settings: a deterministic setting, where it produces a prediction using only xisubscript𝑥𝑖x_{i}, and a conditional setting where is produces a prediction using xisubscript𝑥𝑖x_{i} and a latent variable z𝑧z. We can switch to the deterministic setting by fixing z=0𝑧0z=0; optionally, we can also have a separate network or set of weights for each setting. We first train the model fθ(x,z)subscript𝑓𝜃𝑥𝑧f_{\theta}(x,z) in the deterministic setting to minimize the following loss over the training set:

Here the norm can denote ℓ1subscriptℓ1\ell_{1}, ℓ2subscriptℓ2\ell_{2} or any other loss which is a function of the difference between the target and the prediction. Given sufficient data and capacity, f𝑓f will learn to extract all the information possible about each yisubscript𝑦𝑖y_{i} from the corresponding xisubscript𝑥𝑖x_{i}, and what is inherently unpredictable will be contained within the residual error, yi−fθ(xi,0)subscript𝑦𝑖subscript𝑓𝜃subscript𝑥𝑖0y_{i}-f_{\theta}(x_{i},0).

Once f𝑓f is fully trained in the deterministic setting, we save a copy of the parameters θ−subscript𝜃\theta_{-} and then continue training by minimizing the following loss over the training data:

Here, ϕitalic-ϕ\phi is a learned parametric function which maps the residual error of the model in its deterministic setting to a low-dimensional latent variable z𝑧z which encodes the identity of the mode to which the future state belongs. This is then used as input to f𝑓f in its conditional setting to more accurately predict yisubscript𝑦𝑖y_{i}, conditioned on knowledge of the proper mode. For each sample, we perform two passes through f𝑓f: a first pass on the deterministic setting with z=0𝑧0z=0 and using the parameters θ−subscript𝜃\theta_{-} which minimize (1) to compute the residual error which will be input to ϕitalic-ϕ\phi, and a second pass on the conditional setting using the output of ϕitalic-ϕ\phi as z𝑧z and the current set of parameters θ𝜃\theta.

The fact that z𝑧z is a function of the residual prediction error yi−fθ−(xi,0)subscript𝑦𝑖subscript𝑓subscript𝜃subscript𝑥𝑖0y_{i}-f_{\theta_{-}}(x_{i},0) reflects the intuition that it should only account for what is not explainable by the input, while still being a continuous function of xisubscript𝑥𝑖x_{i} and yisubscript𝑦𝑖y_{i}. Note that using a copy of previous weights θ−subscript𝜃\theta_{-} helps prevent information that could be predicted from xisubscript𝑥𝑖x_{i} from being stored in z𝑧z, which could happen if we used the current weights θ𝜃\theta which may become different from θ−subscript𝜃\theta_{-} over time. As an alternative, we could use a single set of weights and keep minimizing ℒdsubscriptℒ𝑑\mathcal{L}{d} jointly with ℒcsubscriptℒ𝑐\mathcal{L}{c} to prevent this from happening. We tried both methods and found that using a previous version of the weights worked better in some cases.

The model architecture is shown in Figure 1. In our experiments, we used the architecture fθ(x,z)=f2(f1(x)+Wz)subscript𝑓𝜃𝑥𝑧subscript𝑓2subscript𝑓1𝑥𝑊𝑧f_{\theta}(x,z)=f_{2}(f_{1}(x)+Wz), where f1subscript𝑓1f_{1} and f2subscript𝑓2f_{2} are the encoder and decoder of the state respectively. Note that z𝑧z is typically of much lower dimension than the residual error yi−fθ−(xi,0)subscript𝑦𝑖subscript𝑓subscript𝜃subscript𝑥𝑖0y_{i}-f_{\theta_{-}}(x_{i},0), which prevents the network from learning a trivial solution where f𝑓f would simply invert ϕitalic-ϕ\phi and cancel the error from the prediction. This forces the ϕitalic-ϕ\phi network to map the errors to general representations which can be reused across different samples and correspond to different modes of the conditional distribution.

To perform inference after the network is trained, we first extract and save the zi=ϕ(yi−fθ−(xi,0))subscript𝑧𝑖italic-ϕsubscript𝑦𝑖subscript𝑓subscript𝜃subscript𝑥𝑖0z_{i}=\phi(y_{i}-f_{\theta_{-}}(x_{i},0)) from each sample in the training set. Given some new input x′superscript𝑥′x^{\prime}, we can then generate different predictions by computing fθ(x′,z′)subscript𝑓𝜃superscript𝑥′superscript𝑧′f_{\theta}(x^{\prime},z^{\prime}), for different z′∈{zi}superscript𝑧′subscript𝑧𝑖z^{\prime}\in{z_{i}}. In this work, we adopt a simple strategy of sampling uniformly from this set to generate new samples, however more sophisticated methods could be used such as fitting a conditional distribution over p(z|x)𝑝conditional𝑧𝑥p(z|x) and sampling from it.

In recent years a number of works have explored video prediction. These typically train models to predict future frames with the goal of learning representations which disentangle factors of variation and can be used for unsupervised learning (Srivastava et al., 2015; Villegas et al., 2017; Denton & Birodkar, 2017), or learn action-conditional forward models which can be used for planning (Oh et al., 2015; Finn et al., 2016; Agrawal et al., 2016; Kalchbrenner et al., 2016). In the first case, the predictions are deterministic and ignore the possibly multimodal nature of the time series. In the second, it is possible to make different predictions about the future by conditioning on different actions, however this requires that the training data includes additional action labels. Our work makes different predictions about the future by instead conditioning on latent variables which are extracted in an unsupervised manner from the videos themselves.

Several works have used adversarial losses in the context of video prediction. The work of (Mathieu et al., 2015) used a multiscale architecture and a combination of several different losses to predict future frames in natural videos. They found that the addition of the adversarial loss and a gradient difference loss improved the generated image quality, in particular by reducing the blur effects which are common when using ℓ2subscriptℓ2\ell_{2} loss. However, they also note that the generator learns to ignore the noise and produces similar outputs to a deterministic model trained without noise. This observation was also made by (Isola et al., 2016) when training conditional networks to perform image-to-image translation.

The deterministic model and EEN were trained using the ℓ2subscriptℓ2\ell_{2} loss for all datasets except the Robot dataset, where we found that the ℓ1subscriptℓ1\ell_{1} loss gave better-defined predictions. Although more sophisticated losses exist, such as the Gradient Difference loss (Mathieu et al., 2015), our goal here was to evaluate whether our model could capture multimodal structure such as objects moving or appearing on the screen or perspective changing in multiple different realistic ways. We used the same architecture across all tasks, namely a 3-layer convolutional network followed by a 3-layer deconvolutional network, all with 64 feature maps at each layer and batch normalization. We did not use pooling and instead used strided convolutions, similar to the DCGAN architecture (Radford et al., 2015). The parametric function ϕitalic-ϕ\phi mapping the prediction error to latent variables was also a multilayer convolutional network followed by two fully-connected layers. For Atari Breakout we used 2 latent variables, for Seaquest, Flappy Bird and the Robot dataset we used 8, and for driving we used 32. To train our network we used the ADAM optimizer (Kingma & Ba, 2014) with default parameters and learning rate 0.0005 for all tasks. The deterministic baseline model and the GAN had the same encoder-decoder architecture as the EEN, with twice as many feature maps.

We now describe the video datasets we used.

Atari Games We used a pretrained A2C agent (Mnih et al., 2016) 111https://github.com/ikostrikov/pytorch-a2c-ppo-acktr to generate episodes of gameplay for the Atari games Breakout and Seaquest (Bellemare et al., 2012) using a standard video preprocessing pipeline, i.e. downsampling video frames to 84×84848484\times 84 pixels and converting to grayscale. We then trained our forward model using 4 consecutive frames as input to predict either the following 1 frame or 4 frames.

Flappy Bird We used the OpenAI Gym environment Flappy Bird 222https://gym.openai.com/envs/FlappyBird-v0/ and had a human player play approximately 50 episodes of gameplay. In this environment, the player controls a moving bird which must navigate between obstacles appearing at different heights. We trained the model to predict the next 4 frames using the previous 4 frames as input, all of which were rescaled to 128×7212872128\times 72 pixel color images.

Robot Manipulation We used the dataset of (Agrawal et al., 2016), which consists of 240×240240240240\times 240 pixel color images of objects on a table before and after manipulation by a robot. The robot pokes the object at a random location with random angle and duration causing it to move, hence the manipulation does not depend of the environment except for the location of the object. Our model was trained to take a single image as input and predict the following image.

Our experiments were designed to test whether our method can generate multiple realistic predictions given the start of a video sequence. We first report qualitative results in the form of visualizations. In addition to the figures in this paper, we provide a link to videos which facilitate viewing 333www.mikaelhenaff.net/eenvideos.html. An example of generated frames in Atari Breakout is shown in Figure 2. For the baseline model, the image of the paddle gets increasingly diffuse over time which reflects the model’s uncertainty as to its future location while the static background remains well defined. The residual, which is the difference between the ground truth and the deterministic prediction, only depicts the movement of the ball and the paddle which the deterministic model is unable to predict. This is encoded into the latent variables z𝑧z through the learned function ϕitalic-ϕ\phi which takes the residual as input. By sampling different z𝑧z vectors from the training set, we obtain three different generations for the same conditioning frames. For these we see a well-defined paddle executing different movement sequences starting from its initial location.

Figure 3 shows generations for Atari Seaquest. Again we see the baseline model captures most of the features on the screen except for the agent’s movement, which appears in the residual. This is the information that will be encoded in the latent variables, and by sampling different latent variables we obtain the generations below where the submarine changes direction.

Figures 6 and 6 show generated frames on Flappy Bird. Flappy Bird is a simple game which is deterministic except for two sources of stochasticity: the actions of the player and the height of new pipes appearing on the screen. In the first example, we see that by changing the latent variable we generate two sequences with pipes entering at different moments and heights and one sequence where no pipe appears. In the second example, changing the latent variable changes the height of the bird. The EEN is thus able to model both sources of uncertainty in the environment. Additional examples can be found at the provided video link.

The last dataset we evaluated our method on was the TORCS driving simulator. Here we found that generating frames with different z𝑧z samples changed the location of stripes on the road, and also produced translations and dilations of the frame as would happen when turning the steering wheel or changing speed. These effects are best viewed though the video link.

We next report quantitative results. Quantitatively evaluating multimodal predictions is not obvious, since the ground truth sample is drawn from one of several possible modes and the model may generate a sample from a different mode. In this case, simply comparing the generated sample to the ground truth sample may give high loss even if the generated sample is of high quality. We therefore report the best score across different generated samples: min 𝑘ℒ(y,f(x,zk))𝑘 min ℒ𝑦𝑓𝑥subscript𝑧𝑘\underset{k}{\mbox{ min }}\mathcal{L}(y,f(x,z_{k})). If the multimodal model is able to use its latent variables to generate predictions which cover several modes, generating more samples will improve the score since it increases the chance that a generated sample will be from the same mode as the test sample. If however the model ignores latent variables or does not capture the mode that the test sample is drawn from, generating more samples will not improve the loss. Note that if ℒℒ\mathcal{L} is a valid metric in the mathematical sense (such as the ℓ1subscriptℓ1\ell_{1} or ℓ2subscriptℓ2\ell_{2} distance), this is a finite-sample approximation to the Earth Mover or Wasserstein-1 distance between the true and generated distributions on the metric space induced by ℒℒ\mathcal{L}.

Figure 7 shows the best PSNR for different numbers of generated samples. For the Robot task, we report results for a model trained using the ℓ2subscriptℓ2\ell_{2} loss to make it consistent with the other models. We see that our model’s best performance increases as more samples are generated, indicating that its generations are diverse enough to cover at least some of the modes of the test set. Also note that the GAN’s performance does not change as we increase the number of samples generated, which indicates that its latent variables have little effect on the generated samples. This is consistent with findings in other work (Mathieu et al., 2015; Isola et al., 2016). We also note that the different models are not quite comparable to each other using PSNR since the baseline model is directly optimizing the ℓ2subscriptℓ2\ell_{2} loss on which it is based, the EEN is optimizing it conditioned on knowledge of a specific test sample, and the GAN is optimizing a different loss altogether. Our main goal is to illustrate that our model’s performance improves by this approximate measure as it generates more samples, whereas the GAN does not due to mode collapse.

In this work, we have introduced a new framework for performing temporal prediction in the presence of uncertainty by disentangling predictable and non-predictable components of the future state. It is fast, simple to implement and easy to train without the need for an adverserial network or alternating minimization. We have provided one instantiation in the context of video prediction using convolutional networks, but it is in principle applicable to different data types and architectures. There are several directions for future work. Here, we have adopted a simple strategy of sampling uniformly from the z𝑧z distribution without considering their possible dependence on the state x𝑥x, and there are likely better methods. In addition, one advantage of our model is that it can extract latent variables from unseen data very quickly, since it simply requires a forward pass through a network. If latent variables encode information about actions in a manner that is easy to disentangle, this could be used to extract actions from large unlabeled datasets and perform imitation learning. Another interesting application would be using this model for planning and having it unroll different possible futures.

We would like to thank Jiakai Zhang and Kyunghyun Cho for sharing their dataset with us, and Martin Arjovsky, Arthur Szlam and Gabriel Synnaeve for helpful discussions.

Refer to caption Model Architecture. The switch changes between the deterministic setting where z=0𝑧0z=0 and the conditional setting where z𝑧z is a latent variable representing the inherently unpredictable aspects of the target. The switch can also change the parameters used in the encoder and decoder.

Refer to caption (a) a) Ground truth

Refer to caption (b) b) Deterministic Baseline

Refer to caption (c) c) Residual

Refer to caption (d)

Refer to caption (f) d) Generations with different z𝑧z

$$ \mathcal{L}{d}(\theta)=\sum{i}|y_{i}-f_{\theta}(x_{i},0)| $$ \tag{S2.E1}

Algorithm 1 Train latent variable model with alternating minimization	Algorithm 1 Train latent variable model with alternating minimization
Require: Learning rates α,β , number of iterations K . 1: repeat	Require: Learning rates α,β , number of iterations K . 1: repeat
2:	Sample ( x i , y i ) from the dataset
3:	initialize z ∼ N (0 , 1)
4:	i ← 1
5:	while i ≤ K do
6:	z ← z - α ∇ z L ( y i , f θ ( x i , z ))
7:	i ← i +1
8:	θ ← θ - β ∇ θ L ( y i , f θ ( x i , z ))
9:	until converged

Algorithm: algorithm
\caption{Train latent variable model with alternating minimization}\label{euclid}
\begin{algorithmic}[1]
\Require Learning rates $\alpha, \beta$, number of iterations $K$.
\Repeat
\State Sample $(x_i, y_i)$ from the dataset
\State initialize $z \sim \mathcal{N}(0, 1)$
\State $i \leftarrow 1$
\While{$i \leq K$}
\State $z \leftarrow z - \alpha \nabla_z \mathcal{L}(y_i, f_{\theta}(x_i, z))$
\State $i \leftarrow i + 1$
\EndWhile
\State $\theta \leftarrow \theta - \beta \nabla_\theta \mathcal{L}(y_i, f_{\theta}(x_i, z))$
\Until converged
\end{algorithmic}

References

[Poke] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. CoRR, abs/1606.07419, 2016. URL http://arxiv.org/abs/1606.07419.

[Atari] Marc~G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. CoRR, abs/1207.4708, 2012. URL http://arxiv.org/abs/1207.4708.

[GLO] Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Szlam. Optimizing the latent space of generative networks. CoRR, abs/1707.05776, 2017. URL https://arxiv.org/abs/1707.05776.

[DeepPC] Rakesh Chalasani and Jose~C. Principe. Deep predictive coding networks. CoRR, abs/1301.3541, 2013. URL http://dblp.uni-trier.de/db/journals/corr/corr1301.html#abs-1301-3541.

[EM] A.~P. Dempster, N.~M. Laird, and D.~B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39\penalty0 (1):\penalty0 1--38, 1977.

[DentonB17] Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. CoRR, abs/1705.10915, 2017. URL http://arxiv.org/abs/1705.10915.

[FinnGL16] Chelsea Finn, Ian~J. Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. CoRR, abs/1605.07157, 2016. URL http://arxiv.org/abs/1605.07157.

[GANs] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z.~Ghahramani, M.~Welling, C.~Cortes, N.~D. Lawrence, and K.~Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp.\ 2672--2680. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.

[Goroshin15] Ross Goroshin, Micha"el Mathieu, and Yann LeCun. Learning to linearize under uncertainty. CoRR, abs/1506.03011, 2015. URL http://arxiv.org/abs/1506.03011.

[LISTA] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp.\ 399--406, 2010. URL http://www.icml2010.org/papers/449.pdf.

[Isola2016] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei~A. Efros. Image-to-image translation with conditional adversarial networks. CoRR, abs/1611.07004, 2016. URL http://arxiv.org/abs/1611.07004.

[VideoPixel] Nal Kalchbrenner, A"aron van~den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. CoRR, abs/1610.00527, 2016. URL http://arxiv.org/abs/1610.00527.

[ADAM] Diederik~P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.

[PredNet] William Lotter, Gabriel Kreiman, and David~D. Cox. Deep predictive coding networks for video prediction and unsupervised learning. CoRR, abs/1605.08104, 2016. URL http://arxiv.org/abs/1605.08104.

[Mathieu15] Micha"el Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. CoRR, abs/1511.05440, 2015. URL http://arxiv.org/abs/1511.05440.

[Metz16] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. CoRR, abs/1611.02163, 2016. URL http://arxiv.org/abs/1611.02163.

[A3C] Volodymyr Mnih, Adri`a~~Puigdom`enech Badia, Mehdi Mirza, Alex Graves, Timothy~~P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016. URL http://arxiv.org/abs/1602.01783.

[Oh15] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard~~L. Lewis, and Satinder~~P. Singh. Action-conditional video prediction using deep networks in atari games. CoRR, abs/1507.08750, 2015. URL http://arxiv.org/abs/1507.08750.

[DCGAN] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015. URL http://arxiv.org/abs/1511.06434.

[rao1999pcv] R.P.N. Rao and D.H. Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2:\penalty0 79--87, 1999.

[Salimans2016] Tim Salimans, Ian~~J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi~~Chen. Improved techniques for training gans. CoRR, abs/1606.03498, 2016. URL http://arxiv.org/abs/1606.03498.

[SchmidhuberChunker] Jürgen Schmidhuber. Learning complex, extended sequences using the principle of history compression. 4:\penalty0 234--242, 03 1992.

[Spratling2008] M.W. Spratling. Predictive coding as a model of biased competition in visual attention. Vision Research, 48\penalty0 (12):\penalty0 1391 -- 1408, 2008. ISSN 0042-6989. https://doi.org/10.1016/j.visres.2008.03.009. URL http://www.sciencedirect.com/science/article/pii/S0042698908001466.

[Srivastava15] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2015. URL http://arxiv.org/abs/1502.04681.

[AdaGAN] Ilya~O. Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard Sch"olkopf. Adagan: Boosting generative models. CoRR, abs/1701.02386, 2017. URL http://arxiv.org/abs/1701.02386.

[Villegas17] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. CoRR, abs/1706.08033, 2017. URL http://arxiv.org/abs/1706.08033.

[VondrickPT15] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating the future by watching unlabeled video. CoRR, abs/1504.08023, 2015. URL http://arxiv.org/abs/1504.08023.

[Jiakai2016] Jiakai Zhang and Kyunghyun Cho. Query-efficient imitation learning for end-to-end autonomous driving. CoRR, abs/1605.06450, 2016. URL http://arxiv.org/abs/1605.06450.

[bib2] Bellemare et al. (2012) Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. CoRR, abs/1207.4708, 2012. URL http://arxiv.org/abs/1207.4708.

[bib5] Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1–38, 1977.

[bib8] Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.

Prediction Under Uncertainty with Error-Encoding Networks

Introduction​

Model​

Related Work​

Experiments​

Datasets​

Results​

Conclusion​

Acknowledgments​

Experiments​

References​

Introduction

Model

Related Work

Experiments

Datasets

Results

Conclusion

Acknowledgments

Experiments

References