Adversarially Regularized Autoencoders

Jake (Junbo) Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, Yann LeCun

Abstract

Deep latent variable models, trained using variational autoencoders or generative adversarial networks, are now a key technique for representation learning of continuous structures. However, applying similar methods to discrete structures, such as text sequences or discretized images, has proven to be more challenging. In this work, we propose a flexible method for training deep latent variable models of discrete structures. Our approach is based on the recently-proposed Wasserstein autoencoder (WAE) which formalizes the adversarial autoencoder (AAE) as an optimal transport problem. We first extend this framework to model discrete sequences, and then further explore different learned priors targeting a controllable representation. This adversarially regularized autoencoder (ARAE) allows us to generate natural textual outputs as well as perform manipulations in the latent space to induce change in the output space. Finally we show that the latent representation can be trained to perform unaligned textual style transfer, giving improvements both in automatic/human evaluation compared to existing methods.

Adversarially Regularized Autoencoder

Jake (Junbo) Zhao * 1 2 Yoon Kim * 3 Kelly Zhang 1 Alexander M. Rush 3 Yann LeCun 1 2

Deep latent variable models, trained using variational autoencoders or generative adversarial networks, are now a key technique for representation learning of continuous structures. However, applying similar methods to discrete structures, such as text sequences or discretized images, has proven to be more challenging. In this work, we propose a flexible method for training deep latent variable models of discrete structures. Our approach is based on the recently-proposed Wasserstein autoencoder (WAE) which formalizes the adversarial autoencoder (AAE) as an optimal transport problem. We first extend this framework to model discrete sequences, and then further explore different learned priors targeting a controllable representation. This adversarially regularized autoencoder (ARAE) allows us to generate natural textual outputs as well as perform manipulations in the latent space to induce change in the output space. Finally we show that the latent representation can be trained to perform unaligned textual style transfer, giving improvements both in automatic/human evaluation compared to existing methods.

Introduction

Recent work on deep latent variable models, such as variational autoencoders (Kingma & Welling, 2014) and generative adversarial networks (Goodfellow et al., 2014), has shown significant progress in learning smooth representations of complex, high-dimensional continuous data such as images. These latent variable representations facilitate the ability to apply smooth transformations in latent space in order to produce complex modifications of generated outputs, while still remaining on the data manifold.

Unfortunately, learning similar latent variable models of

Equal contribution 1 Department of Computer Science, New York University 2 Facebook AI Research 3 School of Engineering and Applied Sciences, Harvard University. Correspondence to: Jake Zhao jakezhao@cs.nyu.edu.

discrete structures, such as text sequences or discretized images, remains a challenging problem. Initial work on VAEs for text has shown that optimization is difficult, as the generative model can easily degenerate into a unconditional language model (Bowman et al., 2016). Recent work on generative adversarial networks (GANs) for text has mostly focused on dealing with the non-differentiable objective either through policy gradient methods (Che et al., 2017; Hjelm et al., 2018; Yu et al., 2017) or with the GumbelSoftmax distribution (Kusner & Hernandez-Lobato, 2016). However, neither approach can yet produce robust representations directly.

In this work, we extend the adversarial autoencoder (AAE) (Makhzani et al., 2015) to discrete sequences/structures. Similar to the AAE, our model learns an encoder from an input space to an adversarially regularized continuous latent space. However unlike the AAE which utilizes a fixed prior, we instead learn a parameterized prior as a GAN. Like sequence VAEs, the model does not require using policy gradients or continuous relaxations. Like GANs, the model provides flexibility in learning a prior through a parameterized generator.

This adversarially regularized autoencoder (ARAE) can further be formalized under the recently-introduced Wasserstein autoencoder (WAE) framework (Tolstikhin et al., 2018), which also generalizes the adversarial autoencoder. This framework connects regularized autoencoders to an optimal transport objective for an implicit generative model. We extend this class of latent variable models to the case of discrete output, specifically showing that the autoencoder cross-entropy loss upper-bounds the total variational distance between the model/data distributions. Under this setup, commonly-used discrete decoders such as RNNs, can be incorporated into the model. Finally to handle non-trivial sequence examples, we consider several different (fixed and learned) prior distributions. These include a standard Gaussian prior used in image models and in the AAE/WAE models, a learned parametric generator acting as a GAN in latent variable space, and a transfer-based parametric generator that is trained to ignore targeted attributes of the input. The last prior can be directly used for unaligned transfer tasks such as sentiment or style transfer.

Experiments apply ARAE to discretized images and text

sequences. The latent variable model is able to generate varied samples that can be quantitatively shown to cover the input spaces and to generate consistent image and sentence manipulations by moving around in the latent space via interpolation and offset vector arithmetic. When the ARAE model is trained with task-specific adversarial regularization, the model improves upon strong results on sentiment transfer reported in Shen et al. (2017) and produces compelling outputs on a topic transfer task using only a single shared space. Code is available at https://github.com/jakezhaojb/ARAE .

Background and Notation

Discrete Autoencoder Define X = V n to be a set of discrete sequences where V is a vocabulary of symbols. Our discrete autoencoder will consist of two parameterized functions: a deterministic encoder function enc φ : X ↦→ Z with parameters φ that maps from input space to code space, and a conditional decoder p ψ ( x | z ) over structures X with parameters ψ . The parameters are trained based on the cross-entropy reconstruction loss:

The choice of the encoder and decoder parameterization is problem-specific, for example we use RNNs for sequences. We use the notation, ˆ x = arg max x p ψ ( x | enc φ ( x )) for the decoder mode, and call the model distribution P ψ .

Generative Adversarial Networks GANs are a class of parameterized implicit generative models (Goodfellow et al., 2014). The method approximates drawing samples from a true distribution z ∼ P ∗ by instead employing a noise sample s and a parameterized generator function ˜ z = g θ ( s ) to produce ˜ z ∼ P z . Initial work on GANs implicitly minimized the Jensen-Shannon divergence between the distributions. Recent work on Wasserstein GAN (WGAN) (Arjovsky et al., 2017), replaces this with the Earth-Mover (Wasserstein-1) distance.

GAN training utilizes two separate models: a generator g θ ( s ) maps a latent vector from some easy-to-sample noise distribution to a sample from a more complex distribution, and a critic/discriminator f w ( z ) aims to distinguish real data and generated samples from g θ . Informally, the generator is trained to fool the critic, and the critic to tell real from generated. WGAN training uses the following min-max optimization over generator θ and critic w ,

where f w : Z ↦→ R denotes the critic function, ˜ z is obtained from the generator, ˜ z = g θ ( s ) , and P ∗ and P z are real and generated distributions. If the critic parameters w are restricted to an 1-Lipschitz function set W , this term correspond to minimizing Wasserstein-1 distance W ( P ∗ , P z ) .

We use a naive approximation to enforce this property by weight-clipping, i.e. w = [ -glyph[epsilon1], glyph[epsilon1] ] d (Arjovsky et al., 2017). 1

Discrete Autoencoder

Jake (Junbo) Zhao * 1 2 Yoon Kim * 3 Kelly Zhang 1 Alexander M. Rush 3 Yann LeCun 1 2

Generative Adversarial Networks

Jake (Junbo) Zhao * 1 2 Yoon Kim * 3 Kelly Zhang 1 Alexander M. Rush 3 Yann LeCun 1 2

Adversarially Regularized Autoencoder

Jake (Junbo) Zhao * 1 2 Yoon Kim * 3 Kelly Zhang 1 Alexander M. Rush 3 Yann LeCun 1 2

Extension: Unaligned Transfer

7 We also found this metric to be helpful for early-stopping.

8 To 'sample' from an AE we fit a multivariate Gaussian to the code space after training and generate code vectors from this Gaussian to decode back into sentence space.

Table 3: Sentiment transfer. (Top) Automatic metrics (Transfer/BLEU/Forward PPL/Reverse PPL), (Bottom) Human evaluation metrics (Transfer/Similarity/Naturalness). Cross-Aligned AE is from Shen et al. (2017)

et al., 2015). For sentiment we follow the setup of Shen et al. (2017) and split the Yelp corpus into two sets of unaligned positive and negative reviews. We train ARAE with two separate decoder RNNs, one for positive, p ( x | z , y = 1) , and one for negative sentiment p ( x | z , y = 0) , and incorporate adversarial training of the encoder to remove sentiment information from the prior. Transfer corresponds to encoding sentences of one class and decoding, greedily, with the opposite decoder. Experiments compare against the crossaligned AE of Shen et al. (2017) and also an AE trained without the adversarial regularization. For ARAE, we experimented with different λ (1) weighting on the adversarial loss (see section 4) with λ (1) a = 1 , λ (1) b = 10 . Both use λ (2) = 1 . Empirically the adversarial regularization enhances transfer and perplexity, but tends to make the transferred text less similar to the original, compared to the AE. Randomly selected example sentences are shown in Table 2 and additional outputs are available in Appendix G.

Table 3 (top) shows quantitative evaluation. We use four automatic metrics: (i) Transfer: how successful the model is at altering sentiment based on an automatic classifier (we use the fastText library (Joulin et al., 2017)); (ii) BLEU: the consistency between the transferred text and the original; (iii) Forward PPL: the fluency of the generated text; (iv) Reverse PPL: measuring the extent to which the generations are representative of the underlying data distribution. Both perplexity numbers are obtained by training an RNN language model. Table 3 (bottom) shows human evaluations on the cross-aligned AE and our best ARAE model. We randomly select 1000 sentences (500/500 positive/negative), obtain the corresponding transfers from both models, and ask crowdworkers to evaluate the sentiment (Positive/Neutral/Negative) and naturalness (1-5, 5 being most natural) of the transferred sentences. We create a separate task in which we show the original and the transferred sentences, and ask them to evaluate the similarity based on sentence structure (1-5, 5 being most similar). We explicitly requested that the reader disregard sentiment in similarity

Table 4: Topic Transfer. Random samples from the Yahoo dataset. Note the first row is from ARAE trained on titles while the following ones are from replies.

Table 5: Semi-Supervised accuracy on the natural language inference (SNLI) test set, respectively using 22.2% (medium), 10.8% (small), 5.25% (tiny) of the supervised labels of the full SNLI training set (rest used for unlabeled AE training).

Theoretical Properties

Standard GANs implicitly minimize a divergence measure (e.g. f -divergence or Wasserstein distance) between the true/model distributions. In our case however, we implicitly minimize the divergence between learned code distributions, and it is not clear if this training objective is matching the distributions in the original discrete space. Tolstikhin et al. (2018) recently showed that this style of training is minimizing the Wasserstein distance between the data distribution P glyph[star] and the model distribution P ψ with latent variables (with density p ψ ( x ) = ∫ z p ψ ( x | z ) p ( z ) d z ).

In this section we apply the above result to the discrete case and show that the ARAE loss minimizes an upper bound on the total variation distance between P glyph[star] and P ψ .

Definition 1 (Kantorovich's formulation of optimal transport) . Let P glyph[star] , P ψ be distributions over X , and further let c ( x , y ) : X×X → R + be a cost function. Then the optimal transport (OT) problem is given by

where P ( x ∼ P glyph[star] , y ∼ P ψ ) is the set of all joint distributions of ( x , y ) with marginals P glyph[star] and P ψ .

In particular, if c ( x , y ) = ‖ x -y ‖ p p then W c ( P glyph[star] , P ψ ) 1 p is the Wassersteinp distance between P glyph[star] and P ψ . Now suppose we utilize a latent variable model to fit the data, i.e. z ∼ P z , x ∼ P ψ ( x | z ) . Then Tolstikhin et al. (2018) prove the following theorem:

Theorem 1. Let G ψ : Z → X be a deterministic function (parameterized by ψ ) from the latent space Z to data space X that induces a dirac distribution P ψ ( x | z ) on X , i.e. p ψ ( x | z ) = 1 { x = G ψ ( z ) } . Let Q ( z | x ) be

any conditional distribution on Z with density p Q ( z | x ) . Define its marginal to be P Q , which has density p Q ( x ) = ∫ x p Q ( z | x ) p glyph[star] ( x ) d x . Then,

Theorem 1 essentially says that learning an autoencoder can be interpreted as learning a generative model with latent variables, as long as we ensure that the marginalized encoded space is the same as the prior. This provides theoretical justification for adversarial autoencoders (Makhzani et al., 2015), and Tolstikhin et al. (2018) used the above to train deep generative models of images by minimizing the Wasserstein-2 distance (i.e. squared loss between real/generated images). We now apply Theorem 1 to discrete autoencoders trained with cross-entropy loss.

Corollary 1 (Discrete case) . Suppose x ∈ X where X is the set of all one-hot vectors of length n , and let f ψ : Z → ∆ n -1 be a deterministic function that goes from the latent space Z to the n -1 dimensional simplex ∆ n -1 . Further let G ψ : Z → X be a deterministic function such that G ψ ( z ) = arg max w ∈X w glyph[latticetop] f ψ ( z ) , and as above let P ψ ( x | z ) be the dirac distribution derived from G ψ such that p ψ ( x | z ) = 1 { x = G ψ ( z ) } . Then the following is an upper bound on ‖ P ψ -P glyph[star] ‖ TV , the total variation distance between P glyph[star] and P ψ :

The proof is in Appendix A. For natural language we have n = |V| m and therefore X is the set of sentences of length m , where m is the maximum sentence length (shorter sentences are padded if necessary). Then the total variational (TV) distance is given by

This is an interesting alternative to the usual maximum likelihood approach which instead minimizes KL ( P glyph[star] , P ψ ) . 5 It is also clear that -log x glyph[latticetop] f ψ ( z ) = -log p ψ ( x | z ) , the standard autoencoder cross-entropy loss at the sentence level with f ψ as the decoder. As the above objective is hard to minimize directly, we follow Tolstikhin et al. (2018) and consider an easier objective by (i) restricting Q ( z | x ) to a family of distributions induced by a deterministic encoder parameterized by φ , and (ii) using a Langrangian relaxation of the constraint P Q = P z . In particular, letting Q ( z | x ) = 1 { z = enc φ ( x ) } be the dirac distribution induced by a deterministic encoder (with associated marginal P φ ), the objective is given by

5 The relationship between KL-divergence and total variation distance is also given by Pinsker's inquality, which states that 2 ‖ P ψ -P glyph[star] ‖ 2 TV ≤ KL ( P glyph[star] , P ψ ) .

Note that our minimizing the Wasserstein distance in the latent space W ( P φ , P z ) is independent from the Wassertein distance minimization in the output space in WAEs. Finally, instead of using a fixed prior (which led to mode-collapse in our experiments) we parameterize P z implicitly by transforming a simple random variable with a generator (i.e. s ∼ N (0 , I ) , z = g θ ( s )) . This recovers the ARAE objective from the previous section.

We conclude this section by noting that while the theoretical formalization of the AAE as a latent variable model was an important step, in practice there are many approximations made to the actual optimal transport objective. Meaningfully quantifying (and reducing) such approximation gaps remains an avenue for future work.

(Kantorovich’s formulation of optimal transport).

.

Discrete Autoencoder

Methods and Architectures

We experiment with ARAE on three setups: (1) a small model using discretized images trained on the binarized version of MNIST, (2) a model for text sequences trained on the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015), and (3) a model trained for text transfer trained on the Yelp/Yahoo datasets for unaligned sentiment/topic transfer. For experiments using a learned prior, the generator architecture uses a low dimensional s with a Gaussian prior s ∼ N (0 , I ) , and maps it to z using an MLP g θ . The critic f w is also parameterized as an MLP.

The image model encodes/decodes binarized images. Here X = { 0 , 1 } n where n is the image size. The encoder used is an MLP mapping from { 0 , 1 } n ↦→ R m , enc φ ( x ) = MLP ( x ; φ ) = z . The decoder predicts each pixel in x with as a parameterized logistic regression, p ψ ( x | z ) = ∏ n j =1 σ ( h ) x j (1 -σ ( h )) 1 -x j where h = MLP ( z ; ψ ) .

The text model uses a recurrent neural network (RNN) for both the encoder and decoder. Here X = V n where n is the sentence length and V is the vocabulary of the underlying language. We define enc φ ( x ) = z to be the last hidden state of an encoder RNN. For decoding we feed z as an additional input to the decoder RNN at each time step, and calculate the distribution over V at each time step via softmax, p ψ ( x | z ) = ∏ n j =1 softmax ( W h j + b ) x j where W and b are parameters (part of ψ ) and h j is the decoder RNN hidden state. To be consistent with Corollary 1 we need to find the highest-scoring sequence ˆ x under this distribution during decoding, which is intractable in general. Instead we approximate this with greedy search. The text transfer model uses the same architecture as the text model but extends it with a classifier p u ( y | z ) which is modeled using an MLP and trained to minimize cross-entropy.

We further compare our approach with a standard autoencoder (AE) and the cross-aligned autoencoder (Shen et al.,

Figure 2: Image samples. The top block shows output generation of the decoder for random noise samples; the bottom block shows sample interpolation results.

Table 1: Reverse PPL: Perplexity of language models trained on the synthetic samples from a ARAE/AE/LM, and evaluated on real data. Forward PPL: Perplexity of a language model trained on real data and evaluated on synthetic samples.

for transfer. In both our ARAE and standard AE experiments, the encoder output is normalized to lie on the unit sphere, and the generator output is bounded to lie in ( -1 , 1) n by the tanh function at output layer.

Note, learning deep latent variable models for text sequences has been a significantly more challenging empirical problem than for images. Standard models such as VAEs suffer from optimization issues that have been widely documented. We performed experiments with recurrent VAE, introduced by (Bowman et al., 2016), as well as the adversarial autoencoder (AAE) (Makhzani et al., 2015), both with Gaussian priors. We found that neither model was able to learn meaningful latent representations-the VAE simply ignored the latent code and the AAE experienced mode-collapse and repeatedly generated the same samples. 6 Appendix F includes detailed descriptions of the hyperparameters, model architecture, and training regimes.

Experiments

Distributional Coverage

Section 4 argues that P ψ is trained to approximate the true data distribution over discrete sequences P glyph[star] . While it is difficult to test for this property directly (as is the case with most GAN models), we can take samples from model to test the fidelity and coverage of the data space. Figure 2 shows a set of samples from discretized MNIST and Appendix C shows a set of generations from the text ARAE.

6 However there have been some recent successes training such models, as noted in the related works section

Table 2: Sentiment transfer results, where we transfer from positive to negative sentiment (Top) and negative to positive sentiment (Bottom). Original sentence and transferred output (from ARAE and the Cross-Aligned AE (from Shen et al. (2017)) of 6 randomlydrawn examples.

vided a starting point for image generation models. Here we use a similar method for text generation, which we call reverse perplexity . We generate 100k samples from each of the models, train an RNN language model on generated samples and evaluate perplexity on held-out data. 7 While similar metrics for images (e.g. Parzen windows) have been shown to be problematic, we argue that this is less of an issue for text as RNN language models achieve state-of-theart perplexities on text datasets. We also calculate the usual 'forward' perplexity by training an RNN language model on real data and testing on generated data. This measures the fluency of the generated samples, but cannot detect modecollapse, a common issue in training GANs (Arjovsky & Bottou, 2017; Hu et al., 2018).

Table 1 shows these metrics for (i) ARAE, (ii) an autoencoder (AE), 8 (iii) an RNN language model (LM), and (iv) the real training set. We further find that with a fixed prior, the reverse perplexity of an AAE-style text model (Makhzani et al., 2015) was quite high (980) due to modecollapse. All models are of the same size to allow for fair comparison. Training directly on real data (understandably) outperforms training on generated data by a large margin. Surprisingly however, training on ARAE samples outperforms training on LM/AE samples in terms of reverse perplexity.

Unaligned Text Style Transfer

7 We also found this metric to be helpful for early-stopping.

8 To 'sample' from an AE we fit a multivariate Gaussian to the code space after training and generate code vectors from this Gaussian to decode back into sentence space.

Table 4: Topic Transfer. Random samples from the Yahoo dataset. Note the first row is from ARAE trained on titles while the following ones are from replies.

Semi-Supervised Training

Figure 3: Left: glyph[lscript] 2 norm of encoder output z and generator output ˜ z during ARAE training. ( z is normalized, whereas the generator learns to match). Middle: Sum of the dimension-wise variances of z and generator codes ˜ z as well as reference AE. Right: Average cosine similarity of nearby sentences (by word edit-distance) for the ARAE and AE during training.

Figure 4: Reconstruction error (negative log-likelihood averaged over sentences) of the original sentence from a corrupted sentence. Here k is the number of swaps performed on the original sentence.

data, similar to the setting explored in Dai & Le (2015). For ARAE we use the subset of unsupervised data of length < 15 (i.e. ARAE is trained on less data than AE for unsupervised training). The results are shown in Table 5. Training on unlabeled data with an AE objective improves upon a model just trained on labeled data. Training with adversarial regularization provides further gains.

Discussion

Impact of Regularization on Discrete Encoding We further examine the impact of adversarial regularization on the encoded representation produced by the model as it is trained. Figure 3 (left), shows a sanity check that the glyph[lscript] 2 norm of encoder output z and prior samples ˜ z converge quickly in ARAE training. The middle plot compares the trace of the covariance matrix between these terms as training progresses. It shows that variance of the encoder and the prior match after several epochs.

Smoothness and Reconstruction We can also assess the 'smoothness' of the encoder model learned ARAE (Rifai et al., 2011). We start with a simple proxy that a smooth encoder model should map similar sentences to similar z values. For 250 sentences, we calculate the average cosine similarity of 100 randomly-selected sentences within an edit-distance of at most 5 to the original. The graph in Figure 3 (right) shows that the cosine similarity of nearby sentences is quite high for ARAE compared to a standard AE and increases in early rounds of training. To further test this property, we feed noised discrete input to the encoder and (i) calculate the score given to the original input, and

(ii) compare the resulting reconstructions. Figure 4 (right) shows results for text where k words are first permuted in each sentence. We observe that ARAE is able to map a noised sentence to a natural sentence (though not necessarily the denoised sentence). Figure 4 (left) shows empirical results for these experiments. We obtain the reconstruction error (negative log likelihood) of the original non-noised sentence under the decoder, utilizing the noised code. We find that when k = 0 (i.e. no swaps), the regular AE better reconstructs the exact input. However, as the number of swaps pushes the input further away, ARAE is more likely to produce the original sentence. (Note that unlike denoising autoencoders which require a domain-specific noising function (Hill et al., 2016; Vincent et al., 2008), the ARAE is not explicitly trained to denoise an input.)

Manipulation through the Prior An interesting property of latent variable models such as VAEs and GANs is the ability to manipulate output samples through the prior. In particular, for ARAE, the Gaussian form of the noise sample s induces the ability to smoothly interpolate between outputs by exploiting the structure. While language models may provide a better estimate of the underlying probability space, constructing this style of interpolation would require combinatorial search, which makes this a useful feature of latent variable text models. In Appendix D we show interpolations from for the text model, while Figure 2 (bottom) shows the interpolations for discretized MNIST ARAE.

A related property of GANs is the ability to move in the latent space via offset vectors. 9 To experiment with this property we generate sentences from the ARAE and compute vector transforms in this space to attempt to change main verbs, subjects and modifier (details in Appendix E). Some examples of successful transformations are shown in Figure 5 (bottom). Quantitative evaluation of the success of the vector transformations is given in Figure 5 (top).

9 Similar to the case with word vectors (Mikolov et al., 2013), Radford et al. (2016) observe that when the mean latent vector for 'men with glasses' is subtracted from the mean latent vector for 'men without glasses' and applied to an image of a 'woman without glasses', the resulting image is that of a 'woman with glasses'.

Figure 5: Top: Quantitative evaluation of transformations. Match %refers to the % of samples where at least one decoder samples (per 100) had the desired transformation in the output, while Prec. measures the average precision of the output against the original sentence. Bottom: Examples where the offset vectors produced successful transformations of the original sentence. See Appendix E for the full methodology.

Impact of Regularization on Discrete Encoding

Jake (Junbo) Zhao * 1 2 Yoon Kim * 3 Kelly Zhang 1 Alexander M. Rush 3 Yann LeCun 1 2

Smoothness and Reconstruction

We use a naive approximation to enforce this property by weight-clipping, i.e. w = [ -glyph[epsilon1], glyph[epsilon1] ] d (Arjovsky et al., 2017). 1

Manipulation through the Prior

While ideally autoencoders would learn latent spaces which compactly capture useful features that explain the observed data, in practice they often learn a degenerate identity mapping where the latent code space is free of any structure, necessitating the need for some regularization on the latent space. A popular approach is to regularize through an explicit prior on the code space and use a variational approximation to the posterior, leading to a family of models called variational autoencoders (VAE) (Kingma & Welling, 2014; Rezende et al., 2014). Unfortunately V AEs for discrete text sequences can be challenging to train-for example, if the training procedure is not carefully tuned with techniques like word dropout and KL annealing (Bowman et al., 2016), the decoder simply becomes a language model and ignores the latent code. However there have been some recent successes through employing convolutional decoders (Yang et al., 2017; Semeniuta et al., 2017), training the latent representation as a topic model (Dieng et al., 2017; Wang et al., 2018), using the von Mises-Fisher distribution (Guu et al., 2017), and combining VAE with iterative inference (Kim et al., 2018). There has also been some work on making the prior more flexible through explicit parameterization (Chen et al., 2017; Tomczak & Welling, 2018). A notable technique is adversarial autoencoders (AAE) (Makhzani et al., 2015) which attempt to imbue the model with a more flexible prior implicitly through adversarial training. Recent work on Wasserstein autoencoders (Tolstikhin et al., 2018) provides a theoretical foundation for the AAE and shows that AAE minimizes the Wasserstein distance between the data/model distributions.

The success of GANs on images have led many researchers to consider applying GANs to discrete data such as text. Policy gradient methods are a natural way to deal with the resulting non-differentiable generator objective when training directly in discrete space (Glynn, 1987; Williams, 1992). When trained on text data however, such methods often require pre-training/co-training with a maximum likelihood (i.e. language modeling) objective (Che et al., 2017; Yu et al., 2017; Li et al., 2017). Another direction of work has been through reparameterizing the categorical distribution with the Gumbel-Softmax trick (Jang et al., 2017; Maddison et al., 2017)-while initial experiments were encouraging on a synthetic task (Kusner & Hernandez-Lobato, 2016), scaling them to work on natural language is a challenging open problem. There have also been recent related approaches that work directly with the soft outputs from a generator (Gulrajani et al., 2017; Rajeswar et al., 2017; Shen et al., 2017; Press et al., 2017). For example, Shen et al. (2017) exploits adversarial loss for unaligned style transfer between text by having the discriminator act on the RNN hidden states and using the soft outputs at each step as input to an RNN generator. Our approach instead works entirely in fixed-dimensional continuous space and does not require utilizing RNN hidden states directly. It is therefore also different from methods that discriminate in the joint latent/data space, such as ALI (Vincent Dumoulin, 2017) and BiGAN (Donahue et al., 2017). Finally, our work adds to the recent line of work on unaligned style transfer for text (Hu et al., 2017; Mueller et al., 2017; Li et al., 2018; Prabhumoye et al., 2018; Yang et al., 2018).

Conclusion

We present adversarially regularized autoencoders (ARAE) as a simple approach for training a discrete structure autoencoder jointly with a code-space generative adversarial network. Utilizing the Wasserstein autoencoder framework (Tolstikhin et al., 2018), we also interpret ARAE as learning a latent variable model that minimizes an upper bound on the total variation distance between the data/model distributions. We find that the model learns an improved autoencoder and exhibits a smooth latent space, as demonstrated by semisupervised experiments, improvements on text style transfer, and manipulations in the latent space.

We note that (as has been frequently observed when training GANs) the proposed model seemed to be quite sensitive to hyperparameters, and that we only tested our model on simple structures such as binarized digits and short sentences. Cífka et al. (2018) recently evaluated a suite of sentence generation models and found that models are quite sensitive to their training setup, and that different models do well on different metrics. Training deep latent variable models that can robustly model complex discrete structures (e.g. documents) remains an important open issue in the field.

Acknowledgements

We thank Sam Wiseman, Kyunghyun Cho, Sam Bowman, Joan Bruna, Yacine Jernite, Martín Arjovsky, Mikael Henaff, and Michael Mathieu for fruitful discussions. We are particularly grateful to Tianxiao Shen for providing the results for style transfer. We also thank the NVIDIA Corporation for the donation of a Titan X Pascal GPU that was used for this research. Yoon Kim was supported by a gift from Amazon AWS Machine Learning Research.

Experiments

Proof of Corollary 1

Corollary (Discrete case) . Suppose x ∈ X where X is the set of all one-hot vectors of length n , and let f ψ : Z → ∆ n -1 be a deterministic function that goes from the latent space Z to the n -1 dimensional simplex ∆ n -1 . Further let G ψ : Z → X be a deterministic function such that G ψ ( z ) = arg max w ∈X w glyph[latticetop] f ψ ( z ) , and as above let P ψ ( x | z ) be the dirac distribution derived from G ψ such that p ψ ( x | z ) = 1 { x = G ψ ( z ) } . Then the following is an upper bound on ‖ P ψ -P glyph[star] ‖ TV, the total variation distance between P glyph[star] and P ψ .

glyph[negationslash]

Proof. Let our cost function be c ( x , y ) = 1 { x = y } . We first note that for all x , z glyph[negationslash]

This holds since if 1 { x = argmax w ∈X w glyph[latticetop] f ψ ( z ) } = 1 , we have x glyph[latticetop] f ψ ( z ) < 0 . 5 , and -log x glyph[latticetop] f ψ ( z ) > -log 0 . 5 = log 2 . If on the other hand x = arg max w ∈X w glyph[latticetop] f ψ ( z ) , then the LHS is 0 and RHS is always postive since f ψ ( z ) ∈ ∆ n -1 . Then, glyph[negationslash]

The fifth line follows from Theorem 1, and the last equality uses the well-known correspondence between total variation distance and optimal transport with the indicator cost function (Gozlan & Léonard, 2010).

Discrete Autoencoder

Proof.

Optimality Property

One can interpret the ARAE framework as a dual pathway network mapping two distinct distributions into a similar one; enc φ and g θ both output code vectors that are kept similar in terms of Wasserstein distance as measured by the critic. We provide the following proposition showing that under our parameterization of the encoder and the generator, as the Wasserstein distance converges, the encoder distribution ( P Q ) converges to the generator distribution ( P z ), and further, their moments converge.

This is ideal since under our setting the generated distribution is simpler than the encoded distribution, because the input to the generator is from a simple distribution (e.g. spherical Gaussian) and the generator possesses less capacity than the encoder. However, it is not so simple that it is overly restrictive (e.g. as in V AEs). Empirically we observe that the first and second moments do indeed converge as training progresses (Section 7).

Proposition 1. Let P be a distribution on a compact set χ , and ( P n ) n ∈ N be a sequence of distributions on χ . Further suppose that W ( P n , P ) → 0 . Then the following statements hold:

(ii) All moments converge, i.e. for all k > 1 , k ∈ N ,

Proof.

For (ii), using The Portmanteau Theorem to the following statement:

(i) has been proved in (Villani, 2008) Theorem 6.9. , (i) is equivalent

E X ∼ P n [ f ( X )] → E X ∼ P [ f ( X )] for all bounded and continuous function f : R d → R , where d is the dimension of the random variable.

The k -th moment of a distribution is given by

Our encoded code is bounded as we normalize the encoder output to lie on the unit sphere, and our generated code is also bounded to lie in ( -1 , 1) n by the tanh function. Hence f ( X ) = ∏ d i =1 X q i i is a bounded continuous function for all q i ≥ 0 . Therefore,

.

Proof.

Sample Generations

In Figure 6 we show some generated samples from the ARAE, AE, and a LM.

Sentence Interpolations

In Figure 7 we show generations from interpolated latent vectors. Specifically, we sample two points z 0 and z 1 from

Vector Arithmetic

We generate 1 million sentences from the ARAE and parse the sentences to obtain the main verb, subject, and modifier. Then for a given sentence, to change the main verb we subtract the mean latent vector ( t ) for all other sentences with the same main verb (in the first example in Figure 5 this would correspond to all sentences that had 'sleeping' as the main verb) and add the mean latent vector for all sentences that have the desired transformation (with the running example this would be all sentences whose main verb was 'walking'). We do the same to transform the subject and the modifier. We decode back into sentence space with the transformed latent vector via sampling from p ψ ( g ( z + t )) . Some examples of successful transformations are shown in Figure 5 (right). Quantitative evaluation of the success of the vector transformations is given in Figure 5 (left). For each original vector z we sample 100 sentences from p ψ ( g ( z + t )) over the transformed new latent vector and consider it a match if any of the sentences demonstrate the desired transformation. Match % is proportion of original vectors that yield a match post transformation. As we ideally want the generated samples to only differ in the specified transformation, we also calculate the average word precision against the original sentence (Prec) for any match.

Experimental Details

MNIST experiments

Text experiments

Semi-supervised experiments

The following changes are made based on the SNLI experiments:

Yelp/Yahoo transfer

Style Transfer Samples

Unfortunately, learning similar latent variable models of discrete structures, such as text sequences or discretized images, remains a challenging problem. Initial work on VAEs for text has shown that optimization is difficult, as the generative model can easily degenerate into a unconditional language model (Bowman et al., 2016). Recent work on generative adversarial networks (GANs) for text has mostly focused on dealing with the non-differentiable objective either through policy gradient methods (Che et al., 2017; Hjelm et al., 2018; Yu et al., 2017) or with the Gumbel-Softmax distribution (Kusner & Hernandez-Lobato, 2016). However, neither approach can yet produce robust representations directly.

Experiments apply ARAE to discretized images and text sequences. The latent variable model is able to generate varied samples that can be quantitatively shown to cover the input spaces and to generate consistent image and sentence manipulations by moving around in the latent space via interpolation and offset vector arithmetic. When the ARAE model is trained with task-specific adversarial regularization, the model improves upon strong results on sentiment transfer reported in Shen et al. (2017) and produces compelling outputs on a topic transfer task using only a single shared space. Code is available at https://github.com/jakezhaojb/ARAE.

Define 𝒳=𝒱n𝒳superscript𝒱𝑛\mathcal{X}=\mathcal{V}^{n} to be a set of discrete sequences where 𝒱𝒱\mathcal{V} is a vocabulary of symbols. Our discrete autoencoder will consist of two parameterized functions: a deterministic encoder function encϕ:𝒳↦𝒵:subscriptencitalic-ϕmaps-to𝒳𝒵\text{enc}{\phi}:\mathcal{X}\mapsto\mathcal{Z} with parameters ϕitalic-ϕ\phi that maps from input space to code space, and a conditional decoder pψ(𝐱|𝐳)subscript𝑝𝜓conditional𝐱𝐳p{\psi}(\mathbf{x}\ |\ \mathbf{z}) over structures 𝒳𝒳\mathcal{X} with parameters ψ𝜓\psi. The parameters are trained based on the cross-entropy reconstruction loss:

The choice of the encoder and decoder parameterization is problem-specific, for example we use RNNs for sequences. We use the notation, 𝐱^=arg⁡max𝐱⁡pψ(𝐱|encϕ(𝐱))^𝐱subscript𝐱subscript𝑝𝜓conditional𝐱subscriptencitalic-ϕ𝐱\hat{\mathbf{x}}=\arg\max_{\mathbf{x}}p_{\psi}(\mathbf{x}\ |\ \text{enc}{\phi}(\mathbf{x})) for the decoder mode, and call the model distribution ℙψsubscriptℙ𝜓\mathbb{P}{\psi}.

GANs are a class of parameterized implicit generative models (Goodfellow et al., 2014). The method approximates drawing samples from a true distribution 𝐳∼ℙ∗similar-to𝐳subscriptℙ\mathbf{z}\sim\mathbb{P}{*} by instead employing a noise sample 𝐬𝐬\mathbf{s} and a parameterized generator function 𝐳~=gθ(𝐬)~𝐳subscript𝑔𝜃𝐬\tilde{\mathbf{z}}=g{\theta}(\mathbf{s}) to produce 𝐳~∼ℙ𝐳similar-to~𝐳subscriptℙ𝐳\tilde{\mathbf{z}}\sim\mathbb{P}_{\mathbf{z}}. Initial work on GANs implicitly minimized the Jensen-Shannon divergence between the distributions. Recent work on Wasserstein GAN (WGAN) (Arjovsky et al., 2017), replaces this with the Earth-Mover (Wasserstein-1) distance.

GAN training utilizes two separate models: a generator gθ(𝐬)subscript𝑔𝜃𝐬g_{\theta}(\mathbf{s}) maps a latent vector from some easy-to-sample noise distribution to a sample from a more complex distribution, and a critic/discriminator fw(𝐳)subscript𝑓𝑤𝐳f_{w}(\mathbf{z}) aims to distinguish real data and generated samples from gθsubscript𝑔𝜃g_{\theta}. Informally, the generator is trained to fool the critic, and the critic to tell real from generated. WGAN training uses the following min-max optimization over generator θ𝜃\theta and critic w𝑤w,

where fw:𝒵↦ℝ:subscript𝑓𝑤maps-to𝒵ℝf_{w}:\mathcal{Z}\mapsto\mathbb{R} denotes the critic function, 𝐳~~𝐳\tilde{\mathbf{z}} is obtained from the generator, 𝐳~=gθ(𝐬)~𝐳subscript𝑔𝜃𝐬\tilde{\mathbf{z}}=g_{\theta}(\mathbf{s}), and ℙ∗subscriptℙ\mathbb{P}{*} and ℙ𝐳subscriptℙ𝐳\mathbb{P}{\mathbf{z}} are real and generated distributions. If the critic parameters w𝑤w are restricted to an 1-Lipschitz function set 𝒲𝒲\mathcal{W}, this term correspond to minimizing Wasserstein-1 distance W(ℙ∗,ℙ𝐳)𝑊subscriptℙsubscriptℙ𝐳W(\mathbb{P}{*},\mathbb{P}{\mathbf{z}}). We use a naive approximation to enforce this property by weight-clipping, i.e. w=[−ϵ,ϵ]d𝑤superscriptitalic-ϵitalic-ϵ𝑑w=[-\epsilon,\epsilon]^{d} (Arjovsky et al., 2017).111While we did not experiment with enforcing the Lipschitz constraint via gradient penalty (Gulrajani et al., 2017) or spectral normalization (Miyato et al., 2018), other researchers have found slight improvements by training ARAE with the gradient-penalty version of WGAN (private correspondence).

ARAE combines a discrete autoencoder with a GAN-regularized latent representation. The full model is shown in Figure 1, which produces a learned distribution over the discrete space ℙψsubscriptℙ𝜓\mathbb{P}_{\psi}. Intuitively, this method aims to provide smoother hidden encoding for discrete sequences with a flexible prior. In the next section we show how this simple network can be formally interpreted as a latent variable model under the Wasserstein autoencoder framework.

The model consists of a discrete autoencoder regularized with a prior distribution,

Here W𝑊W is the Wasserstein distance between ℙQsubscriptℙ𝑄\mathbb{P}{Q}, the distribution from a discrete encoder model (i.e. encϕ(𝐱)subscriptencitalic-ϕ𝐱\text{enc}{\phi}(\mathbf{x}) where 𝐱∼ℙ⋆similar-to𝐱subscriptℙ⋆\mathbf{x}\sim\mathbb{P}{\star}), and ℙ𝐳subscriptℙ𝐳\mathbb{P}{\mathbf{z}}, a prior distribution. As above, the W𝑊W function is computed with an embedded critic function which is optimized adversarially to the generator and encoder.222Other GANs could be used for this optimization. Experimentally we found that WGANs to be more stable than other models.

The model is trained with coordinate descent across: (1) the encoder and decoder to minimize reconstruction, (2) the critic function to approximate the W𝑊W term, (3) the encoder adversarially to the critic to minimize W𝑊W:

The full training algorithm is shown in Algorithm 1.

Empirically we found that the choice of the prior distribution ℙ𝐳subscriptℙ𝐳\mathbb{P}_{\mathbf{z}} strongly impacted the performance of the model. The simplest choice is to use a fixed distribution such as a Gaussian 𝒩(0,I)𝒩0𝐼{\cal N}(0,I), which yields a discrete version of the adversarial autoencoder (AAE). However in practice this choice is seemingly too constrained and suffers from mode-collapse.333We note that recent work has successfully utilized AAE for text by instead employing a spherical prior (Cífka et al., 2018).

Instead we exploit the adversarial setup and use learned prior parameterized through a generator model. This is analogous to the use of learned priors in VAEs (Chen et al., 2017; Tomczak & Welling, 2018). Specifically we introduce a generator model, gθ(𝐬)subscript𝑔𝜃𝐬g_{\theta}(\mathbf{s}) over noise 𝐬∼𝒩(0,I)similar-to𝐬𝒩0𝐼\mathbf{s}\sim{\cal N}(0,I) to act as an implicit prior distribution ℙ𝐳subscriptℙ𝐳\mathbb{P}{\mathbf{z}}.444The downside of this approach is that the latent variable 𝐳𝐳\mathbf{z} is now much less constrained. However we find experimentally that using a a simple MLP for gθsubscript𝑔𝜃g{\theta} significantly regularizes the encoder RNN. We optimize its parameters θ𝜃\theta as part of training in Step 3.

Regularization of the latent space makes it more adaptable for direct continuous optimization that would be difficult over discrete sequences. For example, consider the problem of unaligned transfer, where we want to change an attribute of a discrete input without aligned examples, e.g. to change the topic or sentiment of a sentence. Define this attribute as y𝑦y and redefine the decoder to be conditional pψ(𝐱|𝐳,y)subscript𝑝𝜓conditional𝐱𝐳𝑦p_{\psi}(\mathbf{x}\ |\ \mathbf{z},y).

To adapt ARAE to this setup, we modify the objective to learn to remove attribute distinctions from the prior (i.e. we want the prior to encode all the relevant information except about y𝑦y). Following similar techniques from other domains, notably in images (Lample et al., 2017) and video modeling (Denton & Birodkar, 2017), we introduce a latent space attribute classifier:

where ℒclass(ϕ,u)subscriptℒclassitalic-ϕ𝑢{\cal L}{\text{class}}(\phi,u) is the loss of a classifier pu(y|𝐳)subscript𝑝𝑢conditional𝑦𝐳p{u}(y\ |\ \mathbf{z}) from latent variable to labels (in our experiments we always set λ(2)=1superscript𝜆21\lambda^{(2)}=1). This requires two more update steps: (2b) training the classifier, and (3b) adversarially training the encoder to this classifier. This algorithm is shown in Algorithm 2.

Standard GANs implicitly minimize a divergence measure (e.g. f𝑓f-divergence or Wasserstein distance) between the true/model distributions. In our case however, we implicitly minimize the divergence between learned code distributions, and it is not clear if this training objective is matching the distributions in the original discrete space. Tolstikhin et al. (2018) recently showed that this style of training is minimizing the Wasserstein distance between the data distribution ℙ⋆subscriptℙ⋆\mathbb{P}{\star} and the model distribution ℙψsubscriptℙ𝜓\mathbb{P}{\psi} with latent variables (with density pψ(𝐱)=∫𝐳pψ(𝐱|𝐳)p(𝐳)𝑑𝐳subscript𝑝𝜓𝐱subscript𝐳subscript𝑝𝜓conditional𝐱𝐳𝑝𝐳differential-d𝐳p_{\psi}(\mathbf{x})=\int_{\mathbf{z}}p_{\psi}(\mathbf{x}\ |\ \mathbf{z})\ p(\mathbf{z})\ d\mathbf{z}).

In this section we apply the above result to the discrete case and show that the ARAE loss minimizes an upper bound on the total variation distance between ℙ⋆subscriptℙ⋆\mathbb{P}{\star} and ℙψsubscriptℙ𝜓\mathbb{P}{\psi}.

Let ℙ⋆,ℙψsubscriptℙ⋆subscriptℙ𝜓\mathbb{P}{\star},\mathbb{P}{\psi} be distributions over 𝒳𝒳\mathcal{X}, and further let c(𝐱,𝐲):𝒳×𝒳→ℝ+:𝑐𝐱𝐲→𝒳𝒳superscriptℝc(\mathbf{x},\mathbf{y}):\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}^{+} be a cost function. Then the optimal transport (OT) problem is given by

where 𝒫(𝐱∼ℙ⋆,𝐲∼ℙψ)𝒫formulae-sequencesimilar-to𝐱subscriptℙ⋆similar-to𝐲subscriptℙ𝜓\mathcal{P}(\mathbf{x}\sim\mathbb{P}{\star},\mathbf{y}\sim\mathbb{P}{\psi}) is the set of all joint distributions of (𝐱,𝐲)𝐱𝐲(\mathbf{x},\mathbf{y}) with marginals ℙ⋆subscriptℙ⋆\mathbb{P}{\star} and ℙψsubscriptℙ𝜓\mathbb{P}{\psi}.

In particular, if c(𝐱,𝐲)=‖𝐱−𝐲‖pp𝑐𝐱𝐲superscriptsubscriptnorm𝐱𝐲𝑝𝑝c(\mathbf{x},\mathbf{y})=|\mathbf{x}-\mathbf{y}|{p}^{p} then Wc(ℙ⋆,ℙψ)1psubscript𝑊𝑐superscriptsubscriptℙ⋆subscriptℙ𝜓1𝑝W{c}(\mathbb{P}{\star},\mathbb{P}{\psi})^{\frac{1}{p}} is the Wasserstein-p𝑝p distance between ℙ⋆subscriptℙ⋆\mathbb{P}{\star} and ℙψsubscriptℙ𝜓\mathbb{P}{\psi}. Now suppose we utilize a latent variable model to fit the data, i.e. 𝐳∼ℙ𝐳,𝐱∼ℙψ(𝐱|𝐳)formulae-sequencesimilar-to𝐳subscriptℙ𝐳similar-to𝐱subscriptℙ𝜓conditional𝐱𝐳\mathbf{z}\sim\mathbb{P}{\mathbf{z}},\mathbf{x}\sim\mathbb{P}{\psi}(\mathbf{x}\ |\ \mathbf{z}). Then Tolstikhin et al. (2018) prove the following theorem:

Let Gψ:𝒵→𝒳:subscript𝐺𝜓→𝒵𝒳G_{\psi}:\mathcal{Z}\rightarrow\mathcal{X} be a deterministic function (parameterized by ψ𝜓\psi) from the latent space 𝒵𝒵\mathcal{Z} to data space 𝒳𝒳\mathcal{X} that induces a dirac distribution ℙψ(𝐱|𝐳)subscriptℙ𝜓conditional𝐱𝐳\mathbb{P}{\psi}(\mathbf{x}\ |\ \mathbf{z}) on 𝒳𝒳\mathcal{X}, i.e. pψ(𝐱|𝐳)=𝟙{𝐱=Gψ(𝐳)}subscript𝑝𝜓conditional𝐱𝐳1𝐱subscript𝐺𝜓𝐳p{\psi}(\mathbf{x}\ |\ \mathbf{z})=\mathds{1}{\mathbf{x}=G_{\psi}(\mathbf{z})}. Let Q(𝐳|𝐱)𝑄conditional𝐳𝐱Q(\mathbf{z}\ |\ \mathbf{x}) be any conditional distribution on 𝒵𝒵\mathcal{Z} with density pQ(𝐳|𝐱)subscript𝑝𝑄conditional𝐳𝐱p_{Q}(\mathbf{z}\ |\ \mathbf{x}). Define its marginal to be ℙQsubscriptℙ𝑄\mathbb{P}{Q}, which has density pQ(𝐱)=∫𝐱pQ(𝐳|𝐱)p⋆(𝐱)𝑑𝐱subscript𝑝𝑄𝐱subscript𝐱subscript𝑝𝑄conditional𝐳𝐱subscript𝑝⋆𝐱differential-d𝐱p{Q}(\mathbf{x})=\int_{\mathbf{x}}p_{Q}(\mathbf{z}\ |\ \mathbf{x})\ p_{\star}(\mathbf{x})d\mathbf{x}. Then,

Suppose 𝐱∈𝒳𝐱𝒳\mathbf{x}\in\mathcal{X} where 𝒳𝒳\mathcal{X} is the set of all one-hot vectors of length n𝑛n, and let fψ:𝒵→Δn−1:subscript𝑓𝜓→𝒵superscriptΔ𝑛1f_{\psi}:\mathcal{Z}\rightarrow\Delta^{n-1} be a deterministic function that goes from the latent space 𝒵𝒵\mathcal{Z} to the n−1𝑛1n-1 dimensional simplex Δn−1superscriptΔ𝑛1\Delta^{n-1}. Further let Gψ:𝒵→𝒳:subscript𝐺𝜓→𝒵𝒳G_{\psi}:\mathcal{Z}\rightarrow\mathcal{X} be a deterministic function such that Gψ(𝐳)=argmax𝐰∈𝒳⁡𝐰⊤fψ(𝐳)subscript𝐺𝜓𝐳subscriptargmax𝐰𝒳superscript𝐰topsubscript𝑓𝜓𝐳G_{\psi}(\mathbf{z})=\operatorname*{arg,max}{\mathbf{w}\in\mathcal{X}}\mathbf{w}^{\top}f{\psi}(\mathbf{z}), and as above let ℙψ(𝐱|𝐳)subscriptℙ𝜓conditional𝐱𝐳\mathbb{P}{\psi}(\mathbf{x}\ |\ \mathbf{z}) be the dirac distribution derived from Gψsubscript𝐺𝜓G{\psi} such that pψ(𝐱|𝐳)=𝟙{𝐱=Gψ(𝐳)}subscript𝑝𝜓conditional𝐱𝐳1𝐱subscript𝐺𝜓𝐳p_{\psi}(\mathbf{x}\ |\ \mathbf{z})=\mathds{1}{\mathbf{x}=G_{\psi}(\mathbf{z})}. Then the following is an upper bound on ‖ℙψ−ℙ⋆‖TVsubscriptnormsubscriptℙ𝜓subscriptℙ⋆TV|\mathbb{P}{\psi}-\mathbb{P}{\star}|{\textup{TV}}, the total variation distance between ℙ⋆subscriptℙ⋆\mathbb{P}{\star} and ℙψsubscriptℙ𝜓\mathbb{P}_{\psi}:

The proof is in Appendix A. For natural language we have n=|𝒱|m𝑛superscript𝒱𝑚n=|\mathcal{V}|^{m} and therefore 𝒳𝒳\mathcal{X} is the set of sentences of length m𝑚m, where m𝑚m is the maximum sentence length (shorter sentences are padded if necessary). Then the total variational (TV) distance is given by

This is an interesting alternative to the usual maximum likelihood approach which instead minimizes KL(ℙ⋆,ℙψ)KLsubscriptℙ⋆subscriptℙ𝜓\text{KL}(\mathbb{P}{\star},\mathbb{P}{\psi}).555The relationship between KL-divergence and total variation distance is also given by Pinsker’s inquality, which states that 2‖ℙψ−ℙ⋆‖TV2≤KL(ℙ⋆,ℙψ)2superscriptsubscriptnormsubscriptℙ𝜓subscriptℙ⋆TV2KLsubscriptℙ⋆subscriptℙ𝜓2|\mathbb{P}{\psi}-\mathbb{P}{\star}|{\text{TV}}^{2}\leq\text{KL}(\mathbb{P}{\star},\mathbb{P}{\psi}). It is also clear that −log⁡𝐱⊤fψ(𝐳)=−log⁡pψ(𝐱|𝐳)superscript𝐱topsubscript𝑓𝜓𝐳subscript𝑝𝜓conditional𝐱𝐳-\log\mathbf{x}^{\top}f{\psi}(\mathbf{z})=-\log p_{\psi}(\mathbf{x}\ |\ \mathbf{z}), the standard autoencoder cross-entropy loss at the sentence level with fψsubscript𝑓𝜓f_{\psi} as the decoder. As the above objective is hard to minimize directly, we follow Tolstikhin et al. (2018) and consider an easier objective by (i) restricting Q(𝐳|𝐱)𝑄conditional𝐳𝐱Q(\mathbf{z}\ |\ \mathbf{x}) to a family of distributions induced by a deterministic encoder parameterized by ϕitalic-ϕ\phi, and (ii) using a Langrangian relaxation of the constraint ℙQ=ℙ𝐳subscriptℙ𝑄subscriptℙ𝐳\mathbb{P}{Q}=\mathbb{P}{\mathbf{z}}. In particular, letting Q(𝐳|𝐱)=𝟙{𝐳=encϕ(𝐱)}𝑄conditional𝐳𝐱1𝐳subscriptencitalic-ϕ𝐱Q(\mathbf{z}\ |\ \mathbf{x})=\mathds{1}{\mathbf{z}=\text{enc}{\phi}(\mathbf{x})} be the dirac distribution induced by a deterministic encoder (with associated marginal ℙϕsubscriptℙitalic-ϕ\mathbb{P}{\phi}), the objective is given by

Note that our minimizing the Wasserstein distance in the latent space W(ℙϕ,ℙ𝐳)𝑊subscriptℙitalic-ϕsubscriptℙ𝐳W(\mathbb{P}{\phi},\mathbb{P}{\mathbf{z}}) is independent from the Wassertein distance minimization in the output space in WAEs. Finally, instead of using a fixed prior (which led to mode-collapse in our experiments) we parameterize ℙ𝐳subscriptℙ𝐳\mathbb{P}{\mathbf{z}} implicitly by transforming a simple random variable with a generator (i.e. 𝐬∼𝒩(0,I),𝐳=gθ(𝐬))\mathbf{s}\sim\mathcal{N}(0,I),\mathbf{z}=g{\theta}(\mathbf{s})). This recovers the ARAE objective from the previous section.

We experiment with ARAE on three setups: (1) a small model using discretized images trained on the binarized version of MNIST, (2) a model for text sequences trained on the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015), and (3) a model trained for text transfer trained on the Yelp/Yahoo datasets for unaligned sentiment/topic transfer. For experiments using a learned prior, the generator architecture uses a low dimensional 𝐬𝐬\mathbf{s} with a Gaussian prior 𝐬∼𝒩(0,𝐈)similar-to𝐬𝒩0𝐈\mathbf{s}\sim\mathcal{N}(0,\mathbf{I}), and maps it to 𝐳𝐳\mathbf{z} using an MLP gθsubscript𝑔𝜃g_{\theta}. The critic fwsubscript𝑓𝑤f_{w} is also parameterized as an MLP.

The image model encodes/decodes binarized images. Here 𝒳={0,1}n𝒳superscript01𝑛\mathcal{X}={0,1}^{n} where n𝑛n is the image size. The encoder used is an MLP mapping from {0,1}n↦ℝmmaps-tosuperscript01𝑛superscriptℝ𝑚{0,1}^{n}\mapsto\mathbb{R}^{m}, encϕ(𝐱)=MLP(𝐱;ϕ)=𝐳subscriptencitalic-ϕ𝐱MLP𝐱italic-ϕ𝐳\text{enc}{\phi}(\mathbf{x})=\text{MLP}(\mathbf{x};\phi)=\mathbf{z}. The decoder predicts each pixel in 𝐱𝐱\mathbf{x} with as a parameterized logistic regression, pψ(𝐱|𝐳)=∏j=1nσ(𝐡)xj(1−σ(𝐡))1−xjsubscript𝑝𝜓conditional𝐱𝐳superscriptsubscriptproduct𝑗1𝑛𝜎superscript𝐡subscript𝑥𝑗superscript1𝜎𝐡1subscript𝑥𝑗p{\psi}(\mathbf{x}\ |\ \mathbf{z})=\prod_{j=1}^{n}\sigma(\mathbf{h})^{x_{j}}(1-\sigma(\mathbf{h}))^{1-x_{j}} where 𝐡=MLP(𝐳;ψ)𝐡MLP𝐳𝜓\mathbf{h}=\text{MLP}(\mathbf{z};\psi).

The text model uses a recurrent neural network (RNN) for both the encoder and decoder. Here 𝒳=𝒱n𝒳superscript𝒱𝑛\mathcal{X}=\mathcal{V}^{n} where n𝑛n is the sentence length and 𝒱𝒱\mathcal{V} is the vocabulary of the underlying language. We define encϕ(𝐱)=𝐳subscriptencitalic-ϕ𝐱𝐳\text{enc}{\phi}(\mathbf{x})=\mathbf{z} to be the last hidden state of an encoder RNN. For decoding we feed 𝐳𝐳\mathbf{z} as an additional input to the decoder RNN at each time step, and calculate the distribution over 𝒱𝒱\mathcal{V} at each time step via softmax, pψ(𝐱|𝐳)=∏j=1nsoftmax(𝐖hj+𝐛)xjsubscript𝑝𝜓conditional𝐱𝐳superscriptsubscriptproduct𝑗1𝑛softmaxsubscript𝐖subscriptℎ𝑗𝐛subscript𝑥𝑗p{\psi}(\mathbf{x}\ |\ \mathbf{z})=\prod_{j=1}^{n}\text{softmax}(\mathbf{W}h_{j}+\mathbf{b}){x{j}} where 𝐖𝐖\mathbf{W} and 𝐛𝐛\mathbf{b} are parameters (part of ψ𝜓\psi) and hjsubscriptℎ𝑗h_{j} is the decoder RNN hidden state. To be consistent with Corollary 1 we need to find the highest-scoring sequence 𝐱^^𝐱\hat{\mathbf{x}} under this distribution during decoding, which is intractable in general. Instead we approximate this with greedy search. The text transfer model uses the same architecture as the text model but extends it with a classifier pu(y|𝐳)subscript𝑝𝑢conditional𝑦𝐳p_{u}(y\ |\ \mathbf{z}) which is modeled using an MLP and trained to minimize cross-entropy.

We further compare our approach with a standard autoencoder (AE) and the cross-aligned autoencoder (Shen et al., 2017) for transfer. In both our ARAE and standard AE experiments, the encoder output is normalized to lie on the unit sphere, and the generator output is bounded to lie in (−1,1)nsuperscript11𝑛(-1,1)^{n} by the tanh\tanh function at output layer.

Note, learning deep latent variable models for text sequences has been a significantly more challenging empirical problem than for images. Standard models such as VAEs suffer from optimization issues that have been widely documented. We performed experiments with recurrent VAE, introduced by (Bowman et al., 2016), as well as the adversarial autoencoder (AAE) (Makhzani et al., 2015), both with Gaussian priors. We found that neither model was able to learn meaningful latent representations—the VAE simply ignored the latent code and the AAE experienced mode-collapse and repeatedly generated the same samples.666However there have been some recent successes training such models, as noted in the related works section Appendix F includes detailed descriptions of the hyperparameters, model architecture, and training regimes.

Section 4 argues that ℙψsubscriptℙ𝜓\mathbb{P}{\psi} is trained to approximate the true data distribution over discrete sequences ℙ⋆subscriptℙ⋆\mathbb{P}{\star}. While it is difficult to test for this property directly (as is the case with most GAN models), we can take samples from model to test the fidelity and coverage of the data space. Figure 2 shows a set of samples from discretized MNIST and Appendix C shows a set of generations from the text ARAE.

A common quantitative measure of sample quality for generative models is to evaluate a strong surrogate model trained on its generated samples. While there are pitfalls of this style of evaluation methods (Theis et al., 2016), it has provided a starting point for image generation models. Here we use a similar method for text generation, which we call reverse perplexity. We generate 100k samples from each of the models, train an RNN language model on generated samples and evaluate perplexity on held-out data.777We also found this metric to be helpful for early-stopping. While similar metrics for images (e.g. Parzen windows) have been shown to be problematic, we argue that this is less of an issue for text as RNN language models achieve state-of-the-art perplexities on text datasets. We also calculate the usual “forward” perplexity by training an RNN language model on real data and testing on generated data. This measures the fluency of the generated samples, but cannot detect mode-collapse, a common issue in training GANs (Arjovsky & Bottou, 2017; Hu et al., 2018).

Table 1 shows these metrics for (i) ARAE, (ii) an autoencoder (AE),888To “sample” from an AE we fit a multivariate Gaussian to the code space after training and generate code vectors from this Gaussian to decode back into sentence space. (iii) an RNN language model (LM), and (iv) the real training set. We further find that with a fixed prior, the reverse perplexity of an AAE-style text model (Makhzani et al., 2015) was quite high (980) due to mode-collapse. All models are of the same size to allow for fair comparison. Training directly on real data (understandably) outperforms training on generated data by a large margin. Surprisingly however, training on ARAE samples outperforms training on LM/AE samples in terms of reverse perplexity.

Next we evaluate the model in the context of a learned adversarial prior, as described in Section 3. We experiment with two unaligned text transfer tasks: (i) transfer of sentiment on the Yelp corpus, and (ii) topic on the Yahoo corpus (Zhang et al., 2015). For sentiment we follow the setup of Shen et al. (2017) and split the Yelp corpus into two sets of unaligned positive and negative reviews. We train ARAE with two separate decoder RNNs, one for positive, p(𝐱|𝐳,y=1)𝑝conditional𝐱𝐳𝑦1p(\mathbf{x}\ |\ \mathbf{z},y=1), and one for negative sentiment p(𝐱|𝐳,y=0)𝑝conditional𝐱𝐳𝑦0p(\mathbf{x}\ |\ \mathbf{z},y=0), and incorporate adversarial training of the encoder to remove sentiment information from the prior. Transfer corresponds to encoding sentences of one class and decoding, greedily, with the opposite decoder. Experiments compare against the cross-aligned AE of Shen et al. (2017) and also an AE trained without the adversarial regularization. For ARAE, we experimented with different λ(1)superscript𝜆1\lambda^{(1)} weighting on the adversarial loss (see section 4) with λa(1)=1,λb(1)=10formulae-sequencesuperscriptsubscript𝜆𝑎11superscriptsubscript𝜆𝑏110\lambda_{a}^{(1)}=1,\lambda_{b}^{(1)}=10. Both use λ(2)=1superscript𝜆21\lambda^{(2)}=1. Empirically the adversarial regularization enhances transfer and perplexity, but tends to make the transferred text less similar to the original, compared to the AE. Randomly selected example sentences are shown in Table 2 and additional outputs are available in Appendix G.

The same method can be applied to other style transfer tasks, for instance the more challenging Yahoo QA data (Zhang et al., 2015). For Yahoo we chose 3 relatively distinct topic classes for transfer: Science & Math, Entertainment & Music, and Politics & Government. As the dataset contains both questions and answers, we separated our experiments into titles (questions) and replies (answers). Randomly-selected generations are shown in Table 4. See Appendix G for additional generation examples.

Latent variable models can also provide an easy method for semi-supervised training. We use a natural language inference task to compare semi-supervised ARAE with other training methods. Results are shown in Table 5. The full SNLI training set contains 543k sentence pairs, and we use supervised sets of 120k (Medium), 59k (Small), and 28k (Tiny) and use the rest of the training set for unlabeled training. As a baseline we use an AE trained on the additional data, similar to the setting explored in Dai & Le (2015). For ARAE we use the subset of unsupervised data of length <15absent15<15 (i.e. ARAE is trained on less data than AE for unsupervised training). The results are shown in Table 5. Training on unlabeled data with an AE objective improves upon a model just trained on labeled data. Training with adversarial regularization provides further gains.

We further examine the impact of adversarial regularization on the encoded representation produced by the model as it is trained. Figure 3 (left), shows a sanity check that the ℓ2subscriptℓ2\ell_{2} norm of encoder output 𝐳𝐳\mathbf{z} and prior samples 𝐳~~𝐳\tilde{\mathbf{z}} converge quickly in ARAE training. The middle plot compares the trace of the covariance matrix between these terms as training progresses. It shows that variance of the encoder and the prior match after several epochs.

We can also assess the “smoothness” of the encoder model learned ARAE (Rifai et al., 2011). We start with a simple proxy that a smooth encoder model should map similar sentences to similar 𝐳𝐳\mathbf{z} values. For 250 sentences, we calculate the average cosine similarity of 100 randomly-selected sentences within an edit-distance of at most 5 to the original. The graph in Figure 3 (right) shows that the cosine similarity of nearby sentences is quite high for ARAE compared to a standard AE and increases in early rounds of training. To further test this property, we feed noised discrete input to the encoder and (i) calculate the score given to the original input, and (ii) compare the resulting reconstructions. Figure 4 (right) shows results for text where k𝑘k words are first permuted in each sentence. We observe that ARAE is able to map a noised sentence to a natural sentence (though not necessarily the denoised sentence). Figure 4 (left) shows empirical results for these experiments. We obtain the reconstruction error (negative log likelihood) of the original non-noised sentence under the decoder, utilizing the noised code. We find that when k=0𝑘0k=0 (i.e. no swaps), the regular AE better reconstructs the exact input. However, as the number of swaps pushes the input further away, ARAE is more likely to produce the original sentence. (Note that unlike denoising autoencoders which require a domain-specific noising function (Hill et al., 2016; Vincent et al., 2008), the ARAE is not explicitly trained to denoise an input.)

An interesting property of latent variable models such as VAEs and GANs is the ability to manipulate output samples through the prior. In particular, for ARAE, the Gaussian form of the noise sample 𝐬𝐬\mathbf{s} induces the ability to smoothly interpolate between outputs by exploiting the structure. While language models may provide a better estimate of the underlying probability space, constructing this style of interpolation would require combinatorial search, which makes this a useful feature of latent variable text models. In Appendix D we show interpolations from for the text model, while Figure 2 (bottom) shows the interpolations for discretized MNIST ARAE.

A related property of GANs is the ability to move in the latent space via offset vectors.999Similar to the case with word vectors (Mikolov et al., 2013), Radford et al. (2016) observe that when the mean latent vector for “men with glasses” is subtracted from the mean latent vector for “men without glasses” and applied to an image of a “woman without glasses”, the resulting image is that of a “woman with glasses”. To experiment with this property we generate sentences from the ARAE and compute vector transforms in this space to attempt to change main verbs, subjects and modifier (details in Appendix E). Some examples of successful transformations are shown in Figure 5 (bottom). Quantitative evaluation of the success of the vector transformations is given in Figure 5 (top).

While ideally autoencoders would learn latent spaces which compactly capture useful features that explain the observed data, in practice they often learn a degenerate identity mapping where the latent code space is free of any structure, necessitating the need for some regularization on the latent space. A popular approach is to regularize through an explicit prior on the code space and use a variational approximation to the posterior, leading to a family of models called variational autoencoders (VAE) (Kingma & Welling, 2014; Rezende et al., 2014). Unfortunately VAEs for discrete text sequences can be challenging to train—for example, if the training procedure is not carefully tuned with techniques like word dropout and KL annealing (Bowman et al., 2016), the decoder simply becomes a language model and ignores the latent code. However there have been some recent successes through employing convolutional decoders (Yang et al., 2017; Semeniuta et al., 2017), training the latent representation as a topic model (Dieng et al., 2017; Wang et al., 2018), using the von Mises–Fisher distribution (Guu et al., 2017), and combining VAE with iterative inference (Kim et al., 2018). There has also been some work on making the prior more flexible through explicit parameterization (Chen et al., 2017; Tomczak & Welling, 2018). A notable technique is adversarial autoencoders (AAE) (Makhzani et al., 2015) which attempt to imbue the model with a more flexible prior implicitly through adversarial training. Recent work on Wasserstein autoencoders (Tolstikhin et al., 2018) provides a theoretical foundation for the AAE and shows that AAE minimizes the Wasserstein distance between the data/model distributions.

The success of GANs on images have led many researchers to consider applying GANs to discrete data such as text. Policy gradient methods are a natural way to deal with the resulting non-differentiable generator objective when training directly in discrete space (Glynn, 1987; Williams, 1992). When trained on text data however, such methods often require pre-training/co-training with a maximum likelihood (i.e. language modeling) objective (Che et al., 2017; Yu et al., 2017; Li et al., 2017). Another direction of work has been through reparameterizing the categorical distribution with the Gumbel-Softmax trick (Jang et al., 2017; Maddison et al., 2017)—while initial experiments were encouraging on a synthetic task (Kusner & Hernandez-Lobato, 2016), scaling them to work on natural language is a challenging open problem. There have also been recent related approaches that work directly with the soft outputs from a generator (Gulrajani et al., 2017; Rajeswar et al., 2017; Shen et al., 2017; Press et al., 2017). For example, Shen et al. (2017) exploits adversarial loss for unaligned style transfer between text by having the discriminator act on the RNN hidden states and using the soft outputs at each step as input to an RNN generator. Our approach instead works entirely in fixed-dimensional continuous space and does not require utilizing RNN hidden states directly. It is therefore also different from methods that discriminate in the joint latent/data space, such as ALI (Vincent Dumoulin, 2017) and BiGAN (Donahue et al., 2017). Finally, our work adds to the recent line of work on unaligned style transfer for text (Hu et al., 2017; Mueller et al., 2017; Li et al., 2018; Prabhumoye et al., 2018; Yang et al., 2018).

We present adversarially regularized autoencoders (ARAE) as a simple approach for training a discrete structure autoencoder jointly with a code-space generative adversarial network. Utilizing the Wasserstein autoencoder framework (Tolstikhin et al., 2018), we also interpret ARAE as learning a latent variable model that minimizes an upper bound on the total variation distance between the data/model distributions. We find that the model learns an improved autoencoder and exhibits a smooth latent space, as demonstrated by semi-supervised experiments, improvements on text style transfer, and manipulations in the latent space.

Let our cost function be c(𝐱,𝐲)=𝟙{𝐱≠𝐲}𝑐𝐱𝐲1𝐱𝐲c(\mathbf{x},\mathbf{y})=\mathds{1}{\mathbf{x}\neq\mathbf{y}}. We first note that for all 𝐱,𝐳𝐱𝐳\mathbf{x},\mathbf{z}

This holds since if 𝟙{𝐱≠argmax𝐰∈𝒳⁡𝐰⊤fψ(𝐳)}=11𝐱subscriptargmax𝐰𝒳superscript𝐰topsubscript𝑓𝜓𝐳1\mathds{1}{\mathbf{x}\neq\operatorname*{arg,max}{\mathbf{w}\in\mathcal{X}}\mathbf{w}^{\top}f{\psi}(\mathbf{z})}=1, we have 𝐱⊤fψ(𝐳)<0.5superscript𝐱topsubscript𝑓𝜓𝐳0.5\mathbf{x}^{\top}f_{\psi}(\mathbf{z})<0.5, and −log⁡𝐱⊤fψ(𝐳)>−log⁡0.5=log⁡2superscript𝐱topsubscript𝑓𝜓𝐳0.52-\log\mathbf{x}^{\top}f_{\psi}(\mathbf{z})>-\log 0.5=\log 2. If on the other hand 𝐱=argmax𝐰∈𝒳⁡𝐰⊤fψ(𝐳)𝐱subscriptargmax𝐰𝒳superscript𝐰topsubscript𝑓𝜓𝐳\mathbf{x}=\operatorname*{arg,max}{\mathbf{w}\in\mathcal{X}}\mathbf{w}^{\top}f{\psi}(\mathbf{z}), then the LHS is 00 and RHS is always postive since fψ(𝐳)∈Δn−1subscript𝑓𝜓𝐳superscriptΔ𝑛1f_{\psi}(\mathbf{z})\in\Delta^{n-1}. Then,

One can interpret the ARAE framework as a dual pathway network mapping two distinct distributions into a similar one; encϕsubscriptencitalic-ϕ\text{enc}{\phi} and gθsubscript𝑔𝜃g{\theta} both output code vectors that are kept similar in terms of Wasserstein distance as measured by the critic. We provide the following proposition showing that under our parameterization of the encoder and the generator, as the Wasserstein distance converges, the encoder distribution (ℙQsubscriptℙ𝑄\mathbb{P}{Q}) converges to the generator distribution (ℙzsubscriptℙ𝑧\mathbb{P}{z}), and further, their moments converge.

This is ideal since under our setting the generated distribution is simpler than the encoded distribution, because the input to the generator is from a simple distribution (e.g. spherical Gaussian) and the generator possesses less capacity than the encoder. However, it is not so simple that it is overly restrictive (e.g. as in VAEs). Empirically we observe that the first and second moments do indeed converge as training progresses (Section 7).

Let ℙℙ\mathbb{P} be a distribution on a compact set χ𝜒\textstyle\chi, and (ℙn)n∈Nsubscriptsubscriptℙ𝑛𝑛𝑁(\mathbb{P}{n}){n\in N} be a sequence of distributions on χ𝜒\textstyle\chi. Further suppose that W(ℙn,ℙ)→0→𝑊subscriptℙ𝑛ℙ0W(\mathbb{P}_{n},\mathbb{P})\to 0. Then the following statements hold:

ℙn↝ℙ↝subscriptℙ𝑛ℙ\mathbb{P}_{n}\rightsquigarrow\mathbb{P} (i.e. convergence in distribution).

All moments converge, i.e. for all k>1,k∈ℕformulae-sequence𝑘1𝑘ℕk>1,k\in\mathbb{N},

for all p1,…,pdsubscript𝑝1…subscript𝑝𝑑p_{1},\dots,p_{d} such that ∑i=1dpi=ksuperscriptsubscript𝑖1𝑑subscript𝑝𝑖𝑘\sum_{i=1}^{d}p_{i}=k

(i) has been proved in (Villani, 2008) Theorem 6.9.

For (ii), using The Portmanteau Theorem, (i) is equivalent to the following statement:

𝔼X∼ℙn[f(X)]→𝔼X∼ℙ[f(X)]→subscript𝔼similar-to𝑋subscriptℙ𝑛delimited-[]𝑓𝑋subscript𝔼similar-to𝑋ℙdelimited-[]𝑓𝑋\mathbb{E}{X\sim\mathbb{P}{n}}[f(X)]\to\mathbb{E}_{X\sim\mathbb{P}}[f(X)] for all bounded and continuous function f𝑓f: ℝd→ℝ→superscriptℝ𝑑ℝ\mathbb{R}^{d}\to\mathbb{R}, where d𝑑d is the dimension of the random variable.

The k𝑘k-th moment of a distribution is given by

Our encoded code is bounded as we normalize the encoder output to lie on the unit sphere, and our generated code is also bounded to lie in (−1,1)nsuperscript11𝑛(-1,1)^{n} by the tanh\tanh function. Hence f(X)=∏i=1dXiqi𝑓𝑋superscriptsubscriptproduct𝑖1𝑑superscriptsubscript𝑋𝑖subscript𝑞𝑖f(X)=\prod_{i=1}^{d}X_{i}^{q_{i}} is a bounded continuous function for all qi≥0subscript𝑞𝑖0q_{i}\geq 0. Therefore,

In the following pages we show randomly sampled style transfers from the Yelp/Yahoo corpus.

ARAE Samples

A woman preparing three fish . A woman is seeing a man in the river . There passes a woman near birds in the air . Some ten people is sitting through their office . The man got stolen with young dinner bag . Monks are running in court . The Two boys in glasses are all girl . The man is small sitting in two men that tell a children . The two children are eating the balloon animal . A woman is trying on a microscope . The dogs are sleeping in bed .

Two Three woman in a cart tearing over of a tree . A man is hugging and art . The fancy skier is starting under the drag cup in . A dog are a A man is not standing . The Boys in their swimming . A surfer and a couple waiting for a show . A couple is a kids at a barbecue . The motorcycles is in the ocean loading I ’s bike is on empty The actor was walking in a a small dog area . no dog is young their mother

a man walking outside on a dirt road , sitting on the dock . A large group of people is taking a photo for Christmas and at night . Someone is avoiding a soccer game . The man and woman are dressed for a movie . Person in an empty stadium pointing at a mountain . Two children and a little boy are a man in a blue shirt . A boy rides a bicycle . A girl is running another in the forest . the man is an indian women .

In Figure 7 we show generations from interpolated latent vectors. Specifically, we sample two points 𝐳0subscript𝐳0\mathbf{z}{0} and 𝐳1subscript𝐳1\mathbf{z}{1} from p(𝐳)𝑝𝐳p(\mathbf{z}) and construct intermediary points 𝐳λ=λ𝐳1+(1−λ)𝐳0subscript𝐳𝜆𝜆subscript𝐳11𝜆subscript𝐳0\mathbf{z}{\lambda}=\lambda\mathbf{z}{1}+(1-\lambda)\mathbf{z}{0}. For each we generate the argmax output 𝐱~~λsubscript~~𝐱𝜆\tilde{\mathbf{x}}{\lambda}.

A man is on the corner in a sport area . A man is on corner in a road all . A lady is on outside a racetrack . A lady is outside on a racetrack . A lot of people is outdoors in an urban setting . A lot of people is outdoors in an urban setting . A lot of people is outdoors in an urban setting .

A man is on a ship path with the woman . A man is on a ship path with the woman . A man is passing on a bridge with the girl . A man is passing on a bridge with the girl . A man is passing on a bridge with the girl . A man is passing on a bridge with the dogs . A man is passing on a bridge with the dogs .

A man in a cave is used an escalator . A man in a cave is used an escalator A man in a cave is used chairs . A man in a number is used many equipment A man in a number is posing so on a big rock . People are posing in a rural area . People are posing in a rural area.

We generate 1 million sentences from the ARAE and parse the sentences to obtain the main verb, subject, and modifier. Then for a given sentence, to change the main verb we subtract the mean latent vector (𝐭)𝐭(\mathbf{t}) for all other sentences with the same main verb (in the first example in Figure 5 this would correspond to all sentences that had “sleeping” as the main verb) and add the mean latent vector for all sentences that have the desired transformation (with the running example this would be all sentences whose main verb was “walking”). We do the same to transform the subject and the modifier. We decode back into sentence space with the transformed latent vector via sampling from pψ(g(𝐳+𝐭))subscript𝑝𝜓𝑔𝐳𝐭p_{\psi}(g(\mathbf{z}+\mathbf{t})). Some examples of successful transformations are shown in Figure 5 (right). Quantitative evaluation of the success of the vector transformations is given in Figure 5 (left). For each original vector 𝐳𝐳\mathbf{z} we sample 100 sentences from pψ(g(𝐳+𝐭))subscript𝑝𝜓𝑔𝐳𝐭p_{\psi}(g(\mathbf{z}+\mathbf{t})) over the transformed new latent vector and consider it a match if any of the sentences demonstrate the desired transformation. Match % is proportion of original vectors that yield a match post transformation. As we ideally want the generated samples to only differ in the specified transformation, we also calculate the average word precision against the original sentence (Prec) for any match.

The encoder is a three-layer MLP, 784-800-400-100.

Additive Gaussian noise is injected into 𝐜𝐜\mathbf{c} then gradually decayed to 00.

The autoencoder is optimized by Adam, with learning rate 5e-04.

An MLP generator 32-64-100-150-100.

An MLP critic 100-100-60-20-1 with weight clipping ϵ=0.05italic-ϵ0.05\epsilon=0.05. The critic is trained 10 iterations in every loop.

Weighing factor λ(1)=0.2superscript𝜆10.2\lambda^{(1)}=0.2.

The LSTM state vector is augmented by the hidden code c𝑐c at every decoding time step, before forwarding into the output softmax layer.

The word embedding is of size 300.

The following changes are made based on the SNLI experiments:

Yelp Sentiment Transfer Positive to Negative Negative to Positive Original great indoor mall . Original hell no ! ARAE no smoking mall . ARAE hell great ! Cross-AE terrible outdoor urine . Cross-AE incredible pork ! Original great blooming onion . Original highly disappointed ! ARAE no receipt onion . ARAE highly recommended ! Cross-AE terrible of pie . Cross-AE highly clean ! Original i really enjoyed getting my nails done by peter . Original bad products . ARAE i really needed getting my nails done by now . ARAE good products . Cross-AE i really really told my nails done with these things . Cross-AE good prices . Original definitely a great choice for sushi in las vegas ! Original i was so very disappointed today at lunch . ARAE definitely a num star rating for num sushi in las vegas . ARAE i highly recommend this place today . Cross-AE not a great choice for breakfast in las vegas vegas ! Cross-AE i was so very pleased to this . Original the best piece of meat i have ever had ! Original i have n’t received any response to anything . ARAE the worst piece of meat i have ever been to ! ARAE i have n’t received any problems to please . Cross-AE the worst part of that i have ever had had ! Cross-AE i have always the desert vet . Original really good food , super casual and really friendly . Original all the fixes were minor and the bill ? ARAE really bad food , really generally really low and decent food . ARAE all the barbers were entertaining and the bill did n’t disappoint . Cross-AE really good food , super horrible and not the price . Cross-AE all the flavors were especially and one ! Original it has a great atmosphere , with wonderful service . Original small , smokey , dark and rude management . ARAE it has no taste , with a complete jerk . ARAE small , intimate , and cozy friendly staff . Cross-AE it has a great horrible food and run out service . Cross-AE great , , , chips and wine . Original their menu is extensive , even have italian food . Original the restaurant did n’t meet our standard though . ARAE their menu is limited , even if i have an option . ARAE the restaurant did n’t disappoint our expectations though . Cross-AE their menu is decent , i have gotten italian food . Cross-AE the restaurant is always happy and knowledge . Original everyone who works there is incredibly friendly as well . Original you could not see the stage at all ! ARAE everyone who works there is incredibly rude as well . ARAE you could see the difference at the counter ! Cross-AE everyone who works there is extremely clean and as well . Cross-AE you could definitely get the fuss ! Original there are a couple decent places to drink and eat in here as well . Original room is void of all personality , no pictures or any sort of decorations . ARAE there are a couple slices of options and num wings in the place . ARAE room is eclectic , lots of flavor and all of the best . Cross-AE there are a few night places to eat the car here are a crowd . Cross-AE it ’s a nice that amazing , that one ’s some of flavor . Original if you ’re in the mood to be adventurous , this is your place ! Original waited in line to see how long a wait would be for three people . ARAE if you ’re in the mood to be disappointed , this is not the place . ARAE waited in line for a long wait and totally worth it . Cross-AE if you ’re in the drive to the work , this is my place ! Cross-AE another great job to see and a lot going to be from dinner . Original we came on the recommendation of a bell boy and the food was amazing . Original the people who ordered off the menu did n’t seem to do much better . Cross-AE we came on the recommendation and the food was a joke . ARAE the people who work there are super friendly and the menu is good . Cross-AE we went on the car of the time and the chicken was awful . Cross-AE the place , one of the office is always worth you do a business . Original service is good but not quick , just enjoy the wine and your company . Original they told us in the beginning to make sure they do n’t eat anything . ARAE service is good but not quick , but the service is horrible . ARAE they told us in the mood to make sure they do great food . Cross-AE service is good , and horrible , is the same and worst time ever . Cross-AE they ’re us in the next for us as you do n’t eat . Original the steak was really juicy with my side of salsa to balance the flavor . Original the person who was teaching me how to control my horse was pretty rude . ARAE the steak was really bland with the sauce and mashed potatoes . ARAE the person who was able to give me a pretty good price . Cross-AE the fish was so much , the most of sauce had got the flavor . Cross-AE the owner ’s was gorgeous when i had a table and was friendly . Original other than that one hell hole of a star bucks they ’re all great ! Original he was cleaning the table next to us with gloves on and a rag . ARAE other than that one star rating the toilet they ’re not allowed . ARAE he was prompt and patient with us and the staff is awesome . Cross-AE a wonder our one came in a num months , you ’re so better ! Cross-AE he was like the only thing to get some with with my hair .

Yahoo Topic Transfer on Questions from Science from Music from Politics Original what is an event horizon with regards to black holes ? Original do you know a website that you can find people who want to join bands ? Original republicans : would you vote for a cheney / satan ticket in 2008 ? Music what is your favorite sitcom with adam sandler ? Science do you know a website that can help me with science ? Science guys : how would you solve this question ? Politics what is an event with black people ? Politics do you think that you can find a person who is in prison ? Music guys : would you rather be a good movie ? Original what did john paul jones do in the american revolution ? Original do people who quote entire poems or song lyrics ever actually get chosen best answer ? Original if i move to the usa do i lose my pension in canada ? Music what did john lennon do in the new york family ? Science do you think that scientists learn about human anatomy and physiology of life ? Science if i move the in the air i have to do my math homework ? Politics what did john mccain do in the next election ? Politics do people who knows anything about the recent issue of leadership ? Music if i move to the music do you think i feel better ? Original can anybody suggest a good topic for a statistical survey ? Original from big brother , what is the girls name who had in her apt ? Original what is your reflection on what will be our organizations in the future ? Music can anybody suggest a good site for a techno ? Science in big bang what is the of , what is the difference between and ? Science what is your opinion on what will be the future in our future ? Politics can anybody suggest a good topic for a student visa ? Politics is big brother in the what do you think of her ? Music what is your favorite music videos on the may i find ? Original can a kidney infection effect a woman 's cycle ? Original where is the tickets for the filming of the suite life of zack and cody ? Original wouldn 't it be fun if we the people veto or passed bills ? Music can anyone give me a good film ? Science where is the best place of the blood stream for the production of the cell ? Science isnt it possible to be cloned if we put the moon or it ? Politics can a landlord officer have a ? Politics where is the best place of the navy and the senate of the union ? Music isnt it possible or if we 're getting married ? Original where does the term " sweating " come from ? Original the singers was a band in 1963 who had a hit called man ? Original can anyone tell me how i could go about interviewing north vietnamese soldiers ? Music where does the term " " come from ? Science the river in a was created by a who was born in the last century ? Science can anyone tell me how i could find how to build a robot ? Politics where does the term " " come from ? Politics the are in a who was shot an ? Music can anyone tell me how i could find out about my parents ? Original what other sources are there than burning fossil fuels . Original what is the first metal band in the early 60 's ….. ? ? ? ? Original if the us did not exist would the world be a better place ? Music what other are / who are the greatest guitarist currently on tv today ? Science what is the first country in the universe ? Science if the world did not exist , would it be possible ? Politics what other are there for veterans who lives ? Politics who is the first president in the usa ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Music if you could not have a thing who would it be ?

Yahoo Topic Transfer on Answers from Science from Music from Politics Original take 1ml of hcl ( concentrated ) and dilute it to 50ml . Original all three are fabulous artists , with just incredible talent ! ! Original 4 years of an idiot in office + electing the idiot again = ? Music take em to you and shout it to me Science all three are genetically bonded with water , but just as many substances , are capable of producing a special case . Science 4 years of an idiot in the office of science ? Politics take bribes to islam and it will be punished . Politics all three are competing with the government , just as far as i can . Music 4 ) in an idiot , the idiot is the best of the two points ever ! Original oils do not do this , they do not " set " . Original she , too , wondered about the underwear outside the clothes . Original send me $ 100 and i 'll send you a copy - honest . Music cucumbers do not do this , they do not " do " . Science she , too , i know , the clothes outside the clothes . Science send me an email and i 'll send you a copy . Politics corporations do not do this , but they do not . Politics she , too , i think that the cops are the only thing about the outside of the u.s. . Music send me $ 100 and i 'll send you a copy . Original the average high temps in jan and feb are about 48 deg . Original i like rammstein and i don 't speak or understand german . Original wills can be , or typed and signed without needing an attorney . Music the average high school in seattle and is about 15 minutes . Science i like googling and i don 't understand or speak . Science euler can be , and without any type of operations , or . Politics the average high infantry division is in afghanistan and alaska . Politics i like mccain and i don 't care about it . Music madonna can be , and signed without opening or . Original the light from you lamps would move away from you at light speed Original mark is great , but the guest hosts were cool too ! Original hungary : 20 january 1945 , ( formerly a member of the axis ) Music the light from you tube would move away from you Science mark is great , but the water will be too busy for the same reason . Science nh3 : 20 january , 78 ( a ) Politics the light from you could go away from your state Politics mark twain , but the great lakes , the united states of america is too busy . Music 1966 - 20 january 1961 ( a ) 1983 song Original van , on the other hand , had some serious issues … Original they all offer terrific information about the cast and characters , … Original bulgaria : 8 september 1944 , ( formerly a member of the axis ) Music van on the other hand , had some serious issues . Science they all offer insight about the characteristics of the earth , and are composed of many stars . Science moreover , 8 3̂ + ( x + 7 ) ( x 2̂ ) = ( a 2̂ ) Politics van , on the other hand , had some serious issues . Politics they all offer legitimate information about the invasion of iraq and the u.s. , and all aspects of history . Music harrison : 8 september 1961 ( a ) ( 1995 ) Original just multiply the numerator of one fraction by that of the other . Original but there are so many more i can 't think of ! Original anyone who doesnt have a billion dollars for all the publicity cant win . Music just multiply the fraction of the other one that 's just like it . Science but there are so many more of the number of questions . Science anyone who doesnt have a decent chance is the same for all the other . Politics just multiply the same fraction of other countries . Politics but there are so many more of the can i think of today . Music anyone who doesnt have a lot of the show for the publicity . Original civil engineering is still an umbrella field comprised of many related specialties . Original i love zach he is sooo sweet in his own way ! Original the theory is that cats don 't take to being tied up but thats . Music civil rights is still an art union . Science the answer is he 's definitely in his own way ! Science the theory is that cats don 't grow up to . Politics civil law is still an issue . Politics i love letting he is sooo smart in his own way ! Music the theory is that dumb but don 't play to . Original h2o2 ( hydrogen peroxide ) naturally decomposes to form o2 and water . Original remember the industry is very shady so keep your eyes open ! Original the fear they are trying to instill in the common man is based on what ? Music jackie and brad pitt both great albums and they are my fav . Science remember the amount of water is so very important . Science the fear they are trying to find the common ancestor in the world . Politics kennedy and blair hate america to invade them . Politics remember the amount of time the politicians are open your mind . Music the fear they are trying to find out what is wrong in the song . Original the quieter it gets , the more white noise you can here . Original but can you fake it , for just one more show ? Original think about how much planning and people would have to be involved in what happened . Music the fray it gets , the more you can hear . Science but can you fake it , just for more than one ? Science think about how much time would you have to do . Politics the gop gets it , the more you can here . Politics but can you fake it for more than one ? Music think about how much money and what would be about in the world ? Original h2co3 ( carbonic acid ) naturally decomposes to form water and co2 . Original i am going to introduce you to the internet movie database . Original this restricts the availability of cash to them and other countries too start banning them . Music phoebe and jack , he 's gorgeous and she loves to get him ! Science i am going to investigate the internet to google . Science this reduces the intake of the other molecules to produce them and thus are too large . Politics nixon ( captured ) he lied and voted for bush to cause his country . Politics i am going to skip the internet to get you checked . Music this is the cheapest package of them too .

Table: S6.T1: Reverse PPL: Perplexity of language models trained on the synthetic samples from a ARAE/AE/LM, and evaluated on real data. Forward PPL: Perplexity of a language model trained on real data and evaluated on synthetic samples.

Data	Reverse PPL	Forward PPL
Real data	27.4	-
LM samples	90.6	18.8
AE samples	97.3	87.8
ARAE samples	82.2	44.3

Table: S6.T2: Sentiment transfer results, where we transfer from positive to negative sentiment (Top) and negative to positive sentiment (Bottom). Original sentence and transferred output (from ARAE and the Cross-Aligned AE (from Shen et al. (2017)) of 6 randomly-drawn examples.


Positive	great indoor mall .
⇒⇒\Rightarrow ARAE	no smoking mall .
⇒⇒\Rightarrow Cross-AE	terrible outdoor urine .
Positive	it has a great atmosphere , with wonderful service .
⇒⇒\Rightarrow ARAE	it has no taste , with a complete jerk .
⇒⇒\Rightarrow Cross-AE	it has a great horrible food and run out service .
Positive	we came on the recommendation of a bell boy and the food was amazing .
⇒⇒\Rightarrow ARAE	we came on the recommendation and the food was a joke .
⇒⇒\Rightarrow Cross-AE	we went on the car of the time and the chicken was awful .
Negative	hell no !
⇒⇒\Rightarrow ARAE	hell great !
⇒⇒\Rightarrow Cross-AE	incredible pork !
Negative	small , smokey , dark and rude management .
⇒⇒\Rightarrow ARAE	small , intimate , and cozy friendly staff .
⇒⇒\Rightarrow Cross-AE	great , , , chips and wine .
Negative	the people who ordered off the menu did n’t seem to do much better .
⇒⇒\Rightarrow ARAE	the people who work there are super friendly and the menu is good .
⇒⇒\Rightarrow Cross-AE	the place , one of the office is always worth you do a business .

Table: S6.T4: Topic Transfer. Random samples from the Yahoo dataset. Note the first row is from ARAE trained on titles while the following ones are from replies.


Science	what is an event horizon with regards to black holes ?
⇒⇒\Rightarrow Music	what is your favorite sitcom with adam sandler ?
⇒⇒\Rightarrow Politics	what is an event with black people ?
Science	take 1ml of hcl ( concentrated ) and dilute it to 50ml .
⇒⇒\Rightarrow Music	take em to you and shout it to me
⇒⇒\Rightarrow Politics	take bribes to islam and it will be punished .
Science	just multiply the numerator of one fraction by that of the other .
⇒⇒\Rightarrow Music	just multiply the fraction of the other one that 's just like it .
⇒⇒\Rightarrow Politics	just multiply the same fraction of other countries .
Music	do you know a website that you can find people who want to join bands ?
⇒⇒\Rightarrow Science	do you know a website that can help me with science ?
⇒⇒\Rightarrow Politics	do you think that you can find a person who is in prison ?
Music	all three are fabulous artists , with just incredible talent ! !
⇒⇒\Rightarrow Science	all three are genetically bonded with water , but just as many substances , are capable of producing a special case .
⇒⇒\Rightarrow Politics	all three are competing with the government , just as far as i can .
Music	but there are so many more i can 't think of !
⇒⇒\Rightarrow Science	but there are so many more of the number of questions .
⇒⇒\Rightarrow Politics	but there are so many more of the can i think of today .
Politics	republicans : would you vote for a cheney / satan ticket in 2008 ?
⇒⇒\Rightarrow Science	guys : how would you solve this question ?
⇒⇒\Rightarrow Music	guys : would you rather be a good movie ?
Politics	4 years of an idiot in office + electing the idiot again = ?
⇒⇒\Rightarrow Science	4 years of an idiot in the office of science ?
⇒⇒\Rightarrow Music	4 ) in an idiot , the idiot is the best of the two points ever !
Politics	anyone who doesnt have a billion dollars for all the publicity cant win .
⇒⇒\Rightarrow Science	anyone who doesnt have a decent chance is the same for all the other .
⇒⇒\Rightarrow Music	anyone who doesnt have a lot of the show for the publicity .

Table: S6.T5: Semi-Supervised accuracy on the natural language inference (SNLI) test set, respectively using 22.2% (medium), 10.8% (small), 5.25% (tiny) of the supervised labels of the full SNLI training set (rest used for unlabeled AE training).

Model	Medium	Small	Tiny
Supervised Encoder	65.9%	62.5%	57.9%
Semi-Supervised AE	68.5%	64.6%	59.9%
Semi-Supervised ARAE	70.9%	66.8%	62.5%

Refer to caption ARAE architecture. A discrete sequence 𝐱𝐱\mathbf{x} is encoded and decoded to produce 𝐱^^𝐱\mathbf{\hat{x}}. A noise sample 𝐬𝐬\mathbf{s} is passed though a generator gθsubscript𝑔𝜃g_{\theta} (possibly the identity) to produce a prior. The critic function fwsubscript𝑓𝑤f_{w} is only used at training to enforce regularization W𝑊W. The model produce discrete samples 𝐱𝐱\mathbf{x} from noise 𝐬𝐬\mathbf{s}. Section 5 relates these samples 𝐱∼ℙψsimilar-to𝐱subscriptℙ𝜓\mathbf{x}\sim\mathbb{P}{\psi} to 𝐱∼ℙ⋆similar-to𝐱subscriptℙ⋆\mathbf{x}\sim\mathbb{P}{\star}.

Refer to caption Image samples. The top block shows output generation of the decoder for random noise samples; the bottom block shows sample interpolation results.

Refer to caption Left: ℓ2subscriptℓ2\ell_{2} norm of encoder output 𝐳𝐳\mathbf{z} and generator output 𝐳𝐳\tilde{\mathbf{z}} during ARAE training. (𝐳𝐳\mathbf{z} is normalized, whereas the generator learns to match). Middle: Sum of the dimension-wise variances of 𝐳𝐳\mathbf{z} and generator codes 𝐳𝐳\tilde{\mathbf{z}} as well as reference AE. Right: Average cosine similarity of nearby sentences (by word edit-distance) for the ARAE and AE during training.

$$ \mathcal{L}{\text{rec}}(\phi, \psi) = - \log p{\psi}(\boldx\ |\ \text{enc}_{\phi}(\boldx)) $$

$$ \label{equ:wgan} \min_{\theta} \max_{w \in \mathcal{W}} \E_{\boldz \sim \Prob_{*}}[f_w(\boldz)] - \E_{\tboldz \sim \Prob_{\textbf{z}}}[f_w(\tboldz)], $$ \tag{equ:wgan}

$$ \min_{\phi, \psi}\quad \mathcal{L}{\text{rec}}(\phi, \psi) + \lambda^{(1)} W(\Prob_Q, \Prob{\mathbf{z}}) $$

$$ W_c(\Prob_\star, \Prob_\psi) = \inf_{\Gamma \in \mathcal{P}(\boldx \sim \Prob_\star, \boldy \sim \Prob_\psi)} \mathbb{E}_{\boldx,\boldy \sim \Gamma}[c(\boldx, \boldy)] $$

$$ W_c(\Prob_\star,\Prob_\psi) = \inf_{Q(\boldz\ |\ \boldx) : \Prob_Q = \Prob_\boldz} \mathbb{E}{\Prob\star}\mathbb{E}{Q(\boldz\ |\ \boldx)} [c(\boldx, G\psi(\boldz))] $$

$$ \Vert \Prob_\psi - \Prob_\star \Vert_{\text{TV}} = \frac{1}{2} \sum_{\boldx \in \mathcal{V}^m} |p_\psi(\boldx) - p_{\star}(\boldx)| $$

$$ \mathbb{E}{X\sim\mathbb{P}{n}}\Big{[}\prod_{i=1}^{d}X_{i}^{p_{i}}\Big{]}\to\mathbb{E}{X\sim\mathbb{P}}\Big{[}\prod{i=1}^{d}X_{i}^{p_{i}}\Big{]} $$ \tag{A2.Ex21}

$$ \displaystyle 1) $$

$$ \log 2 \mathds{1}{ \mathbf{x} \ne \argmax_{\mathbf{w} \in \mathcal{X}} \mathbf{w}^\top f_\psi(\boldz)} < -\log \mathbf{x}^\top f_\psi(\mathbf{z}) $$

Definition. Definition 1 (Kantorovich’s formulation of optimal transport). Let ℙ⋆,ℙψsubscriptℙ⋆subscriptℙ𝜓\mathbb{P}{\star},\mathbb{P}{\psi} be distributions over 𝒳𝒳\mathcal{X}, and further let c(𝐱,𝐲):𝒳×𝒳→ℝ+:𝑐𝐱𝐲→𝒳𝒳superscriptℝc(\mathbf{x},\mathbf{y}):\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}^{+} be a cost function. Then the optimal transport (OT) problem is given by Wc(ℙ⋆,ℙψ)=infΓ∈𝒫(𝐱∼ℙ⋆,𝐲∼ℙψ)𝔼𝐱,𝐲∼Γ[c(𝐱,𝐲)]subscript𝑊𝑐subscriptℙ⋆subscriptℙ𝜓subscriptinfimumΓ𝒫formulae-sequencesimilar-to𝐱subscriptℙ⋆similar-to𝐲subscriptℙ𝜓subscript𝔼similar-to𝐱𝐲Γdelimited-[]𝑐𝐱𝐲W_{c}(\mathbb{P}{\star},\mathbb{P}{\psi})=\inf_{\Gamma\in\mathcal{P}(\mathbf{x}\sim\mathbb{P}{\star},\mathbf{y}\sim\mathbb{P}{\psi})}\mathbb{E}{\mathbf{x},\mathbf{y}\sim\Gamma}[c(\mathbf{x},\mathbf{y})] where 𝒫(𝐱∼ℙ⋆,𝐲∼ℙψ)𝒫formulae-sequencesimilar-to𝐱subscriptℙ⋆similar-to𝐲subscriptℙ𝜓\mathcal{P}(\mathbf{x}\sim\mathbb{P}{\star},\mathbf{y}\sim\mathbb{P}{\psi}) is the set of all joint distributions of (𝐱,𝐲)𝐱𝐲(\mathbf{x},\mathbf{y}) with marginals ℙ⋆subscriptℙ⋆\mathbb{P}{\star} and ℙψsubscriptℙ𝜓\mathbb{P}_{\psi}.

Theorem. Theorem 1. Let Gψ:𝒵→𝒳:subscript𝐺𝜓→𝒵𝒳G_{\psi}:\mathcal{Z}\rightarrow\mathcal{X} be a deterministic function (parameterized by ψ𝜓\psi) from the latent space 𝒵𝒵\mathcal{Z} to data space 𝒳𝒳\mathcal{X} that induces a dirac distribution ℙψ(𝐱|𝐳)subscriptℙ𝜓conditional𝐱𝐳\mathbb{P}{\psi}(\mathbf{x}\ |\ \mathbf{z}) on 𝒳𝒳\mathcal{X}, i.e. pψ(𝐱|𝐳)=𝟙{𝐱=Gψ(𝐳)}subscript𝑝𝜓conditional𝐱𝐳1𝐱subscript𝐺𝜓𝐳p{\psi}(\mathbf{x}\ |\ \mathbf{z})=\mathds{1}{\mathbf{x}=G_{\psi}(\mathbf{z})}. Let Q(𝐳|𝐱)𝑄conditional𝐳𝐱Q(\mathbf{z}\ |\ \mathbf{x}) be any conditional distribution on 𝒵𝒵\mathcal{Z} with density pQ(𝐳|𝐱)subscript𝑝𝑄conditional𝐳𝐱p_{Q}(\mathbf{z}\ |\ \mathbf{x}). Define its marginal to be ℙQsubscriptℙ𝑄\mathbb{P}{Q}, which has density pQ(𝐱)=∫𝐱pQ(𝐳|𝐱)p⋆(𝐱)𝑑𝐱subscript𝑝𝑄𝐱subscript𝐱subscript𝑝𝑄conditional𝐳𝐱subscript𝑝⋆𝐱differential-d𝐱p{Q}(\mathbf{x})=\int_{\mathbf{x}}p_{Q}(\mathbf{z}\ |\ \mathbf{x})\ p_{\star}(\mathbf{x})d\mathbf{x}. Then, Wc(ℙ⋆,ℙψ)=infQ(𝐳|𝐱):ℙQ=ℙ𝐳𝔼ℙ⋆𝔼Q(𝐳|𝐱)[c(𝐱,Gψ(𝐳))]subscript𝑊𝑐subscriptℙ⋆subscriptℙ𝜓subscriptinfimum:𝑄conditional𝐳𝐱subscriptℙ𝑄subscriptℙ𝐳subscript𝔼subscriptℙ⋆subscript𝔼𝑄conditional𝐳𝐱delimited-[]𝑐𝐱subscript𝐺𝜓𝐳W_{c}(\mathbb{P}{\star},\mathbb{P}{\psi})=\inf_{Q(\mathbf{z}\ |\ \mathbf{x}):\mathbb{P}{Q}=\mathbb{P}{\mathbf{z}}}\mathbb{E}{\mathbb{P}{\star}}\mathbb{E}{Q(\mathbf{z}\ |\ \mathbf{x})}[c(\mathbf{x},G{\psi}(\mathbf{z}))]

Corollary. [Discrete case] Suppose $\boldx \in X$ where $X$ is the set of all one-hot vectors of length $n$, and let $f_\psi:Z \rightarrow \Delta^{n-1}$ be a deterministic function that goes from the latent space $Z$ to the $n-1$ dimensional simplex $\Delta^{n-1}$. Further let $G_\psi: Z \rightarrow X$ be a deterministic function such that $G_\psi(\boldz)= \argmax_{w \in X}w^\top f_\psi(\boldz)$, and as above let $\Prob_\psi(\boldx\ |\ \boldz)$ be the dirac distribution derived from $G_\psi$ such that $p_\psi(\boldx\ |\ \boldz) = 1{\boldx = G_\psi(\boldz) }$. Then the following is an upper bound on $\Vert \Prob_\psi - \Prob_\star \Vert_{TV}$, the total variation distance between $\Prob_\star$ and $\Prob_\psi$: -2mm [ \inf_{Q(\boldz\ |\ \boldx) : \Prob_Q = \Prob_\boldz} E_{\Prob_\star}E_{Q(\boldz\ |\ \boldx)} \Big[-2{\log 2} \log \boldx^\top f_\psi(\boldz)\Big] ]

Prop. Proposition 1. Let ℙℙ\mathbb{P} be a distribution on a compact set χ𝜒\textstyle\chi, and (ℙn)n∈Nsubscriptsubscriptℙ𝑛𝑛𝑁(\mathbb{P}{n}){n\in N} be a sequence of distributions on χ𝜒\textstyle\chi. Further suppose that W(ℙn,ℙ)→0→𝑊subscriptℙ𝑛ℙ0W(\mathbb{P}{n},\mathbb{P})\to 0. Then the following statements hold: (i) ℙn↝ℙ↝subscriptℙ𝑛ℙ\mathbb{P}{n}\rightsquigarrow\mathbb{P} (i.e. convergence in distribution). (ii) All moments converge, i.e. for all k>1,k∈ℕformulae-sequence𝑘1𝑘ℕk>1,k\in\mathbb{N}, 𝔼X∼ℙn[∏i=1dXipi]→𝔼X∼ℙ[∏i=1dXipi]→subscript𝔼similar-to𝑋subscriptℙ𝑛delimited-[]superscriptsubscriptproduct𝑖1𝑑superscriptsubscript𝑋𝑖subscript𝑝𝑖subscript𝔼similar-to𝑋ℙdelimited-[]superscriptsubscriptproduct𝑖1𝑑superscriptsubscript𝑋𝑖subscript𝑝𝑖\mathbb{E}{X\sim\mathbb{P}{n}}\Big{[}\prod_{i=1}^{d}X_{i}^{p_{i}}\Big{]}\to\mathbb{E}{X\sim\mathbb{P}}\Big{[}\prod{i=1}^{d}X_{i}^{p_{i}}\Big{]} for all p1,…,pdsubscript𝑝1…subscript𝑝𝑑p_{1},\dots,p_{d} such that ∑i=1dpi=ksuperscriptsubscript𝑖1𝑑subscript𝑝𝑖𝑘\sum_{i=1}^{d}p_{i}=k

& \min_{\phi, \psi} & \mathcal{L}{\text{rec}}(\phi, \psi) &= \E{\boldx \sim \Prob_{\star}}\left[- \log p_\psi(\boldx ,|, \text{enc}_\phi(\boldx)) \right]\
& \max_{w \in \mathcal{W}} & \mathcal{L}{\text{cri}}(w) &= \E{\boldx \sim \Prob_\star}\left[ f_w(\text{enc}{\phi}(\boldx))\right] - \E{\tboldz \sim \Prob_{\mathbf{z}}}\left[f_w(\tboldz)\right] \
& \min_{\phi} & \mathcal{L}{\text{enc}}(\phi) &= \E{\boldx \sim \Prob_\star}\left[ f_w(\text{enc}{\phi}(\boldx))\right] - \E{\tboldz \sim \Prob_{\mathbf{z}}}\left[f_w(\tboldz)\right] $$

$$ & \inf_{Q : \Prob_Q = \Prob_\boldz} \mathbb{E}{\Prob\star}\mathbb{E}{Q(\boldz\ |\ \boldx)} [-\frac{2}{\log 2}\log \boldx^\top f\psi(\boldz)] \

&\inf_{Q : \Prob_Q = \Prob_\boldz} \mathbb{E}{\Prob\star}\mathbb{E}{Q(\boldz\ |\ \boldx)} [2\mathds{1}{ \boldx \ne \argmax{\mathbf{w} \in \mathcal{X}} \mathbf{w}^\top f_\psi(\boldz) }] \ = & 2 \inf_{Q : \Prob_Q = \Prob_\boldz}\mathbb{E}{\Prob\star}\mathbb{E}{Q(\boldz\ |\ \boldx)} [\mathds{1}{ \boldx \ne G\psi(\boldz) }] \ = & 2 \inf_{Q : \Prob_Q = \Prob_\boldz} \mathbb{E}{\Prob\star}\mathbb{E}{Q(\boldz\ |\ \boldx)} [c(\boldx, G\psi(\boldz))] \ = & ,2W_c(\Prob_\star, \Prob_\psi)\ = & , \Vert \Prob_\star - \Prob_\psi \Vert_{\text{TV}} $$

Theorem. { Let $G_\psi: Z \rightarrow X$ be a deterministic function (parameterized by $\psi$) from the latent space $Z$ to data space $X$ that induces a dirac distribution $\Prob_\psi(\boldx\ |\ \boldz)$ on $X$, i.e. $p_\psi(\boldx\ |\ \boldz) = 1 {\boldx = G_\psi(\boldz)}$. Let $Q(\boldz\ |\ \boldx)$ be any conditional distribution on $Z$ with density $p_Q(\boldz\ |\ \boldx)$. Define its marginal to be $\Prob_Q$, which has density $p_Q(\boldx) = \int_\boldx p_Q(\boldz\ |\ \boldx)\ p_\star(\boldx)d\boldx$. Then, [ W_c(\Prob_\star,\Prob_\psi) = \inf_{Q(\boldz\ |\ \boldx) : \Prob_Q = \Prob_\boldz} E_{\Prob_\star}E_{Q(\boldz\ |\ \boldx)} [c(\boldx, G_\psi(\boldz))] ] }

Definition. [Kantorovich's formulation of optimal transport]{Let $\Prob_\star, \Prob_\psi$ be distributions over $X$, and further let $c(\boldx,\boldy): X \times X \rightarrow R^{+}$ be a cost function. Then the optimal transport (OT) problem is given by -2mm [W_c(\Prob_\star, \Prob_\psi) = \inf_{\Gamma \in P(\boldx \sim \Prob_\star, \boldy \sim \Prob_\psi)} E_{\boldx,\boldy \sim \Gamma}[c(\boldx, \boldy)] ] where $P(\boldx \sim \Prob_\star, \boldy \sim \Prob_\psi)$ is the set of all joint distributions of $(\boldx,\boldy)$ with marginals $\Prob_\star$ and $\Prob_\psi$.}

Proof. Let our cost function be $c(\boldx,\boldy) = 1{\boldx \ne \boldy}$. We first note that for all $\boldx,\boldz$ align* \log 2 1{ x \ne \argmax_{w \in X} w^\top f_\psi(\boldz)} < -\log x^\top f_\psi(z) align* This holds since if $1{ x \ne \argmax_{w \in X}w^\top f_\psi(\boldz)} = 1$, we have $\boldx^\top f_\psi(\boldz) < 0.5$, and $-\log \boldx^\top f_\psi(\boldz) > -\log 0.5 = \log 2 $. If on the other hand $x = \argmax_{w \in X}w^\top f_\psi(\boldz)$, then the LHS is $0$ and RHS is always postive since $f_\psi(\boldz) \in \Delta^{n-1}$. Then, align* & \inf_{Q : \Prob_Q = \Prob_\boldz} E_{\Prob_\star}E_{Q(\boldz\ |\ \boldx)} [-2{\log 2}\log \boldx^\top f_\psi(\boldz)] \ > &\inf_{Q : \Prob_Q = \Prob_\boldz} E_{\Prob_\star}E_{Q(\boldz\ |\ \boldx)} [21{ \boldx \ne \argmax_{w \in X} w^\top f_\psi(\boldz) }] \ = & 2 \inf_{Q : \Prob_Q = \Prob_\boldz}E_{\Prob_\star}E_{Q(\boldz\ |\ \boldx)} [1{ \boldx \ne G_\psi(\boldz) }] \ = & 2 \inf_{Q : \Prob_Q = \Prob_\boldz} E_{\Prob_\star}E_{Q(\boldz\ |\ \boldx)} [c(\boldx, G_\psi(\boldz))] \ = & ,2W_c(\Prob_\star, \Prob_\psi)\ = & , \Vert \Prob_\star - \Prob_\psi \Vert_{TV} align* The fifth line follows from Theorem 1, and the last equality uses the well-known correspondence between total variation distance and optimal transport with the indicator cost function [Gozlan2010].

Proof. (i) has been proved in [villani2008optimal] Theorem 6.9. For (ii), using {\it The Portmanteau Theorem}, (i) is equivalent to the following statement:% -4mm \ $\E_{X \sim \Prob_n}[f(X)] \to \E_{X \sim \Prob}[f(X)]$ for all bounded and continuous function $f$: $\reals^{d} \to \reals$, where $d$ is the dimension of the random variable. The $k$-th moment of a distribution is given by $$ \E \Big[\prod_{i=1}^d X_i^{p_i}\Big] such that \sum_{i=1}^d p_i = k$$ Our encoded code is bounded as we normalize the encoder output to lie on the unit sphere, and our generated code is also bounded to lie in $(-1,1)^n$ by the $\tanh$ function. Hence $f(X) = \prod_{i=1}^d X_i^{q_i}$ is a bounded continuous function for all $q_i \ge 0$. Therefore, $$ \E_{X \sim \Prob_n} \Big[\prod_{i=1}^d X_i^{p_i}\Big] \to \E_{X \sim \Prob}\Big[\prod_{i=1}^d X_i^{p_i}\Big] $$ where $\sum_{i=1}^d p_i = k$

Algorithm: algorithm
[t]
	\footnotesize
  \caption{ARAE Training}\label{alg:train}
  \begin{algorithmic}
  \FOR{each training iteration}
      \STATE \textbf{\textit{(1) Train the encoder/decoder for reconstruction}} $(\phi, \psi)$ %[$\mathcal{L}_{\text{rec}}(\phi, \psi)$].
      \STATE  Sample $\{\boldx^{(i)}\}_{i=1}^m \sim \Prob_{\star}$ and compute $\boldz^{(i)} = \text{enc}_{\phi}(\boldx^{(i)})$
      %\State (Optional) Compute class $\boldy^{(i)}$
      \STATE  Backprop  loss, $\mathcal{L}_{\text{rec}} = -\frac{1}{m} \sum_{i=1}^m{\log p_{\psi}(\boldx^{(i)}\ |\ \boldz^{(i)})}$
      %\State \textbf{\textit{(Optional:) Train encoder for invariant }} [$\mathcal{L}_{\text{inv}}(\phi)$]
      %\State Backpropagate invariant adversarial loss on the \emreal samples $\frac{1}{m} \sum_{i=1}^m I_w(\boldc^{(i)}, \boldy^{(i)})$ and update.
      \vspace{0.2cm}
      \STATE \textbf{\textit{(2) Train the critic}} $(w)$ %[$\mathcal{L}_{\text{cri}}(w)$] (Repeat k times)
%         \State \textit{Positive sample phase} using training batch.
      \STATE  Sample $\{\boldx^{(i)}\}_{i=1}^m \sim \Prob_{\star}$ and $\{\bolds^{(i)}\}_{i=1}^m \sim \mathcal{N}(0, \mathbf{I})$
      \STATE  Compute $\boldz^{(i)} = \text{enc}_{\phi}(\boldx^{(i)})$ and $\tboldz^{(i)} = g_{\theta}(\boldz^{(i)})$
      \STATE  Backprop loss $-\frac{1}{m} \sum_{i=1}^m f_w(\boldz^{(i)}) +\frac{1}{m} \sum_{i=1}^m f_w(\tboldz^{(i)})$
      \STATE Clip  critic $w$ to $[-\epsilon, \epsilon]^{d}$.
      \vspace{0.2cm}
      \STATE \textbf{\textit{(3) Train the encoder/generator adversarially}} $(\phi, \theta)$ %[$\mathcal{L}_{\text{encs}}(\phi, \theta)$]
      \STATE  Sample $\{\boldx^{(i)}\}_{i=1}^m \sim \Prob_{\star}$ and $\{\bolds^{(i)}\}_{i=1}^m \sim \mathcal{N}(0, \mathbf{I})$
      \STATE  Compute $\boldz^{(i)} = \text{enc}_{\phi}(\boldx^{(i)})$ and $\tboldz^{(i)} = g_{\theta}(\bolds^{(i)})$.
      \STATE  Backprop loss  $ \frac{1}{m} \sum_{i=1}^m f_w(\boldz^{(i)}) - \frac{1}{m} \sum_{i=1}^m f_w(\tboldz^{(i)})$
  \ENDFOR
  \end{algorithmic}

Algorithm: algorithm
\caption{ARAE Transfer Extension}\label{alg:train2}
  \begin{algorithmic}
    \STATE [Each loop additionally:]
    \STATE \textbf{\textit{(2b) Train attribute classifier}} $(u)$ %[$\min_u \mathcal{L}_{\text{class}}(\phi, u)$]
    \STATE  Sample $\{\boldx^{(i)}\}_{i=1}^m \sim \Prob_{\star}$, lookup $y^{(i)}$, and compute $\boldz^{(i)} = \text{enc}_{\phi}(\boldx^{(i)})$
    \STATE  Backprop loss $-\frac{1}{m} \sum_{i=1}^m \log p_u(y^{(i)} | \boldz^{(i)})$
    \vspace{0.2cm}
    \STATE \textbf{\textit{(3b) Train the encoder adversarially}} $(\phi)$ %[$\max_\phi \mathcal{L}_{\text{class}}(\phi, u)$]
    \STATE  Sample $\{\boldx^{(i)}\}_{i=1}^m \sim \Prob_{\star}$, lookup $y^{(i)}$, and compute $\boldz^{(i)} = \text{enc}_{\phi}(\boldx^{(i)})$
    \STATE   Backprop loss  $-\frac{1}{m} \sum_{i=1}^m \log p_u(1-y^{(i)}\ |\ \boldz^{(i)})$
  \end{algorithmic}

Data	Reverse PPL	Forward PPL
Real data	27.4	-
LM samples	90.6	18.8
AE samples	97.3	87.8
ARAE samples	82.2	44.3

Positive ⇒ ARAE ⇒ Cross-AE	great indoor mall . no smoking mall . terrible outdoor urine .
Positive ⇒ ARAE ⇒ Cross-AE	it has a great atmosphere , with wonderful service . it has no taste , with a complete jerk . it has a great horrible food and run out service .
Positive ⇒ ARAE ⇒ Cross-AE	we came on the recommendation of a bell boy and the food was amazing . we came on the recommendation and the food was a joke . we went on the car of the time and the chicken was awful .
Negative ⇒ ARAE ⇒ Cross-AE	hell no ! hell great ! incredible pork !
Negative ⇒ ARAE ⇒ Cross-AE	small , smokey , dark and rude management . small , intimate , and cozy friendly staff . great , , , chips and wine .
Negative ⇒ ARAE ⇒ Cross-AE	the people who ordered off the menu did n't seem to do much better . the people who work there are super friendly and the menu is good . the place , one of the office is always worth you do a business .

	Automatic Evaluation	Automatic Evaluation	Automatic Evaluation	Automatic Evaluation
Model	Transfer	BLEU	Forward	Reverse
Cross-Aligned AE	77.1%	17.75	65.9	124.2
AE	59.3%	37.28	31.9	68.9
ARAE, λ (1) a	73.4%	31.15	29.7	70.1
ARAE, λ (1) b	81.8%	20.18	27.7	77.0
	Human Evaluation	Human Evaluation	Human Evaluation	Human Evaluation
Model	Transfer	Similarity	Naturalness	Naturalness
Cross-Aligned AE	57%	3.8		2.7
ARAE, λ (1) b	74%	3.7		3.8

Science ⇒ Music ⇒ Politics	what is an event horizon with regards to black holes ? what is your favorite sitcom with adam sandler ? what is an event with black people ?
Science ⇒ Music ⇒ Politics	take 1ml of hcl ( concentrated ) and dilute it to 50ml . take em to you and shout it to me take bribes to islam and it will be punished .
Science ⇒ Music ⇒ Politics	of one fraction by that of the other . of the other one that 's just like it . fraction of other countries . do you know a website that you can find people who want to join bands ? do you know a website that can help me with science ? do you think that you can find a person who is in prison ?	numerator fraction	same	multiply multiply multiply	Politics	Music	Science		⇒ ⇒	the the the	just just just
Music ⇒	all three are fabulous artists , with just incredible talent ! ! all three are genetically bonded with water , but just as many substances , are capable of producing a special case .
Science ⇒ Politics	all three are competing with the government , just as far as i can .
Music ⇒ Science ⇒ Politics	but there are so many more i can 't think of ! but there are so many more of the number of questions . but there are so many more of the can i think of today .
Politics ⇒ Science ⇒ Music	republicans : would you vote for a cheney / satan ticket in 2008 ? guys : how would you solve this question ? guys : would you rather be a good movie ?
Politics ⇒ ⇒ Music	4 years of an idiot in office + electing the idiot again = ? 4 years of an idiot in the office of science ? 4 ) in an idiot , the idiot is the best of the two points ever !
Politics ⇒ ⇒ Music	anyone who doesnt have a billion dollars for all the publicity cant win anyone who doesnt have a decent chance is the same for all the other . anyone who doesnt have a lot of the show for the publicity .
	.
Science
								Science

Model	Medium	Small	Tiny
Supervised Encoder	65.9%	62.5%	57.9%
Semi-Supervised AE	68.5%	64.6%	59.9%
Semi-Supervised ARAE	70.9%	66.8%	62.5%

k	AE	ARAE	Model	Samples
0	1.06	2.19	Original Noised	A woman wearing sunglasses A woman sunglasses wearing
1	4.51	4.07	AE ARAE	A woman sunglasses wearing sunglasses A woman wearing sunglasses
2	6.61	5.39
3	9.14	6.86	Original Noised	Pets galloping down the street Pets down the galloping street
4	9.97	7.47	AE ARAE	Pets riding the down galloping Pets congregate down the street near a ravine

Transform	Match%	Prec
walking	85	79.5
man	92	80.2
two	86	74.1
dog	88	77
standing	89	79.3
several	70	67

A man in a tie is sleeping and clapping on balloons . A man in a tie is clapping and walking dogs .	⇒ walking
The jewish boy is trying to stay out of his skateboard . The jewish man is trying to stay out of his horse .	⇒ man
Some child head a playing plastic with drink . Two children playing a head with plastic drink .	⇒ Two
The people shine or looks into an area . The dog arrives or looks into an area .	⇒ dog
A women are walking outside near a man . Three women are standing near a man walking .	⇒ standing
A side child listening to a piece with steps playing on a table Several child playing a guitar on side with a table .	⇒ Several

	Two Three woman in a cart tearing over of a tree . A man is hugging and art . The fancy skier is starting under the drag cup in . A dog are a A man is not standing . The Boys in their swim- ming . A surfer and a couple wait- ing for a show . A couple is a kids at a bar- becue . The motorcycles is in the ocean loading I 's bike is on empty The actor was walking in a a small dog area . no dog is young their
A woman preparing three fish . A woman is seeing a man in the river . There passes a woman near birds in the air . Some ten people is sitting through their office . The man got stolen with young dinner bag . Monks are running in court . The Two boys in glasses are all girl . The man is small sitting in two men that tell a children . The two children are eating the balloon animal . A woman is trying on a microscope . The dogs are sleeping in	mother	a man walking outside on a dirt road , sitting on the dock . A large group of people is taking a photo for Christ- mas and at night . Someone is avoiding a soc- cer game . The man and woman are dressed for a movie . Person in an empty stadium pointing at a mountain . Two children and a little boy are a man in a blue shirt . A boy rides a bicycle . A girl is running another in the forest .
		the man is an indian women
		.
bed .	bed .	bed .

	Positive to Negative		Negative to Positive
Original ARAE Cross-AE	great indoor mall . no smoking mall . terrible outdoor urine .	Original ARAE Cross-AE	hell no ! hell great ! incredible pork !
Original ARAE Cross-AE	great blooming onion . no receipt onion . terrible of pie .	Original ARAE Cross-AE	highly disappointed ! highly recommended ! highly clean !
Original ARAE Cross-AE	i really enjoyed getting my nails done by peter . i really needed getting my nails done by now . i really really told my nails done with these things .	Original ARAE Cross-AE	bad products . good products . good prices .
Original ARAE Cross-AE	definitely a great choice for sushi in las vegas ! definitely a num star rating for num sushi in las vegas . not a great choice for breakfast in las vegas vegas !	Original ARAE Cross-AE	i was so very disappointed today at lunch . i highly recommend this place today . i was so very pleased to this .
Original ARAE Cross-AE	the best piece of meat i have ever had ! the worst piece of meat i have ever been to ! the worst part of that i have ever had had !	Original ARAE Cross-AE	i have n't received any response to anything . i have n't received any problems to please . i have always the desert vet .
Original ARAE Cross-AE	really good food , super casual and really friendly . really bad food , really generally really low and decent food . really good food , super horrible and not the price .	Original ARAE Cross-AE	all the fixes were minor and the bill ? all the barbers were entertaining and the bill did n't disappoint . all the flavors were especially and one !
Original ARAE Cross-AE	it has a great atmosphere , with wonderful service . it has no taste , with a complete jerk . it has a great horrible food and run out service .	Original ARAE Cross-AE	small , smokey , dark and rude management . small , intimate , and cozy friendly staff . great , , , chips and wine .
Original ARAE Cross-AE	their menu is extensive , even have italian food . their menu is limited , even if i have an option . their menu is decent , i have gotten italian food .	Original ARAE Cross-AE	the restaurant did n't meet our standard though . the restaurant did n't disappoint our expectations though . the restaurant is always happy and knowledge .
Original ARAE Cross-AE	everyone who works there is incredibly friendly as well . everyone who works there is incredibly rude as well . everyone who works there is extremely clean and as well .	Original ARAE Cross-AE	you could not see the stage at all ! you could see the difference at the counter ! you could definitely get the fuss !
Original ARAE Cross-AE	there are a couple decent places to drink and eat in here as well . there are a couple slices of options and num wings in the place . there are a few night places to eat the car here are a crowd .	Original ARAE Cross-AE	room is void of all personality , no pictures or any sort of decorations . room is eclectic , lots of flavor and all of the best . it 's a nice that amazing , that one 's some of flavor .
Original ARAE Cross-AE	if you 're in the mood to be adventurous , this is your place ! if you 're in the mood to be disappointed , this is not the place . if you 're in the drive to the work , this is my place !	Original ARAE Cross-AE	waited in line to see how long a wait would be for three people . waited in line for a long wait and totally worth it . another great job to see and a lot going to be from dinner .
Original Cross-AE Cross-AE	we came on the recommendation of a bell boy and the food was amazing we came on the recommendation and the food was a joke . we went on the car of the time and the chicken was awful .	Original ARAE Cross-AE	the people who ordered off the menu did n't seem to do much better . the people who work there are super friendly and the menu is good . the place , one of the office is always worth you do a business .
Original ARAE Cross-AE	service is good but not quick , just enjoy the wine and your company . service is good but not quick , but the service is horrible . service is good , and horrible , is the same and worst time ever .	Original ARAE Cross-AE	they told us in the beginning to make sure they do n't eat anything . they told us in the mood to make sure they do great food . they 're us in the next for us as you do n't eat .
Original ARAE Cross-AE	the steak was really juicy with my side of salsa to balance the flavor . the steak was really bland with the sauce and mashed potatoes . the fish was so much , the most of sauce had got the flavor .	Original ARAE Cross-AE	the person who was teaching me how to control my horse was pretty rude the person who was able to give me a pretty good price . the owner 's was gorgeous when i had a table and was friendly .
Original ARAE Cross-AE	other than that one hell hole of a star bucks they 're all great ! other than that one star rating the toilet they 're not allowed . a wonder our one came in a num months , you 're so better !	Original ARAE Cross-AE	he was cleaning the table next to us with gloves on and a rag . he was prompt and patient with us and the staff is awesome . he was like the only thing to get some with with my hair .

	from Science			from Music		from Politics
Original	what is an event horizon with regards to black holes ?	Original		do you know a website that you can find people who want to join bands ?	Original	republicans : would you vote for a cheney / satan ticket in 2008 ?
Music	what is your favorite sitcom with adam sandler ?	Science		do you know a website that can help me with sci- ence ?	Science	guys : how would you solve this question ?
Politics	what is an event with black people ?	Politics	do	you think that you can find a person who is in prison ?	Music	guys : would you rather be a good movie ?
Original	what did john paul jones do in the american revolu- tion ?	Original	do people ever	who quote entire poems or song lyrics actually get chosen best answer ?	Original	if i move to the usa do i lose my pension in canada ?
Music	what did john lennon do in the new york family ?	Science	do you think that scientists anatomy and physiology of	learn about human life ? Science		if i move the in the air i have to do my math homework ?
Politics	what did john mccain do in the next election ?	Politics	do people who knows sue of leadership	anything about the recent is- ?	Music	if i move to the music do you think i feel better ?
Original	can anybody suggest a good topic for a statistical survey ?	Original	from big brother , what is the girls name who in her apt ?	had	Original	what is your reflection on what will be our organi- zations in the future ?
Music	can anybody suggest a good site for a techno ?	Science	in	big bang what is the of , what is the difference between and ?	Science	what is your opinion on what will be the future in our future ?
Politics	can anybody suggest a good topic for a student visa ?	Politics	is her ?	big brother in the what do you think of	Music	what is your favorite music videos on the may i find ?
Original	can a kidney infection effect a woman 's cycle ?	Original	where is the tickets for the filming of the suite of zack and cody ?	life	Original	wouldn 't it be fun if we the people veto or passed bills ?
Music	can anyone give me a good film ?	Science	where is the best place of the blood production of the cell ?	stream for the	Science	isnt it possible to be cloned if we put the moon or it ?
Politics	can a landlord officer have a ?	Politics	where is the best place of the of the union ?	navy and the senate	Music	isnt it possible or if we 're getting married ?
Original	where does the term " sweating " come from ?	Original	the singers was a band in 1963 who had hit called man ?	a	Original	can anyone tell me how i could go about interview- ing north vietnamese soldiers ?
Music	where does the term " " come from ?	Science	the river in a was created by a who was born in the last century ?		Science	can anyone tell me how i could find how to build a robot ?
Politics	where does the term " " come from ?	Politics	the are in a who was shot ?	an	Music	can anyone tell me how i could find out about my parents ?
Original	what other sources are there than burning fossil fuels .	Original	what is the first metal band in the early 60 's ..... ? ? ? ?		Original	if the us did not exist would the world be a better place ?
Music	what other are / who are the greatest gui- tarist currently on tv today ?	Science	what is the first country in the universe ?		Science	if the world did not exist , would it be possible ?
Politics	what other are there for veterans who lives ?	Politics	who is the first president in the usa ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?		Music	if you could not have a thing who would it be ?

	from Science		from Music		from Politics
Original	take 1ml of hcl ( concentrated ) and dilute it to 50ml .	Original	all three are fabulous artists , with just incredible talent ! !	Original	4 years of an idiot in office + electing the idiot again = ?
Music	take em to you and shout it to me	Science	all three are genetically bonded with water , but just as many substances , are capable of producing a special case .	Science	4 years of an idiot in the office of science ?
Politics	take bribes to islam and it will be punished .	Politics	all three are competing with the government , just as far as i can .	Music	4 ) in an idiot , the idiot is the best of the two points ever !
Original	oils do not do this , they do not " set " .	Original	she , too , wondered about the underwear outside the clothes .	Original	send me $ 100 and i 'll send you a copy - honest .
Music	cucumbers do not do this , they do not " do " .	Science	she , too , i know , the clothes outside the clothes .	Science	send me an email and i 'll send you a copy .
Politics	corporations do not do this , but they do not .	Politics	she , too , i think that the cops are the only thing about the outside of the u.s. .	Music	send me $ 100 and i 'll send you a copy .
Original	the average high temps in jan and feb are about 48	Original	i like rammstein and i don 't speak or under- stand german .	Original	wills can be , or typed and signed without needing an attorney .
Music	deg . the average high school in seattle and is about 15	Science	i like googling and i don 't understand or	Science	euler can be , and without any type of oper-
Politics	minutes . the average high infantry division is in afghanistan and alaska .	Politics	speak . i like mccain and i don 't care about it .	Music	ations , or . madonna can be , and signed without open- ing or .
Original	the light from you lamps would move away from you at light speed	Original	mark is great , but the guest hosts were cool too !	Original	hungary : 20 january 1945 , ( formerly a member of the axis )
Music	the light from you tube would move away from you	Science	mark is great , but the water will be too busy for the same reason .	Science	nh3 : 20 january , 78 ( a )
Politics	the light from you could go away from your state	Politics	mark twain , but the great lakes , the united states of america is too busy .	Music	1966 - 20 january 1961 ( a ) 1983 song
Original	van , on the other hand , had some serious issues ...	Original	they all offer terrific information about the cast and characters , ...	Original	bulgaria : 8 september 1944 , ( formerly a member of the axis )
Music	van on the other hand , had some serious issues .	Science	they all offer insight about the characteristics of the earth , and are composed of many stars .	Science	moreover , 8 ˆ 3 + ( x + 7 ) ( x ˆ 2 ) = ( a ˆ 2 )
Politics	van , on the other hand , had some serious issues .	Politics	they all offer legitimate information about the inva- sion of iraq and the u.s. , and all aspects of history .	Music	harrison : 8 september 1961 ( a ) ( 1995 )
Original	just multiply the numerator of one fraction by that of the other .	Original	but there are so many more i can 't think of !	Original	anyone who doesnt have a billion dollars for all the publicity cant win .
Music	just multiply the fraction of the other one that 's just like it .	Science	but there are so many more of the number of ques-	Science	anyone who doesnt have a decent chance is the same for all the other .
Politics	just multiply the same fraction of other countries .	Politics	tions . but there are so many more of the can i think of today .	Music	anyone who doesnt have a lot of the show for the publicity .
Original	civil engineering is still an umbrella field com- .	Original	i love zach he is sooo sweet in his own way !	Original	the theory is that cats don 't take to being tied up but thats .
Music	prised of many related specialties civil rights is still an art union .	Science	the answer is he 's definitely in his own way	Science	the theory is that cats don 't grow up to .
Politics	civil law is still an issue .	Politics	! i love letting he is sooo smart in his own way !	Music	the theory is that dumb but don 't play to .
Original	h2o2 ( hydrogen peroxide ) naturally decomposes	Original	remember the industry is very shady so keep your	Original	the fear they are trying to instill in the common man is based on what ?
Music	to form o2 and water . jackie and brad pitt both great albums and they are	Science	eyes open ! remember the amount of water is so very important .	Science	the fear they are trying to find the common ancestor in the world .
Politics	my fav . kennedy and blair hate america to invade them .	Politics	remember the amount of time the politicians are open your mind .	Music	the fear they are trying to find out what is wrong in the song .
Original	the quieter it gets , the more white noise you can	Original	but can you fake it , for just one more show ?	Original	think about how much planning and people would have to be involved in what happened .
	here .	Science	but can you fake it , just for more than one ?	Science	think about how much time would you have to do .
Music Politics	the fray it gets , the more you can hear . the gop gets it , the more you can here .	Politics	but can you fake it for more than one ?	Music	think about how much money and what would be about in the world ?
Original	h2co3 ( carbonic acid ) naturally decomposes to	Original	i am going to introduce you to the internet movie	Original	this restricts the availability of cash to them and other countries too start banning them .
Music	form water and co2 . phoebe and jack , he 's gorgeous and she loves to get him !	Science	database . i am going to investigate the internet to google .	Science	this reduces the intake of the other molecules to pro-
Politics	nixon ( captured ) he lied and voted for bush to cause his country .	Politics	i am going to skip the internet to get you checked .	Music	duce them and thus are too large . this is the cheapest package of them too .

References

[Arjovsky2017] Arjovsky, M. and Bottou, L. Towards Principled Methods for Training Generative Adversarial Networks. In Proceedings of ICML, 2017.

[arjovsky2017wasserstein] Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. In Proceedings of ICML, 2017.

[Bowman2015] Bowman, S.~R., Angeli, G., Potts, C., and Manning., C.~D. A large annotated corpus for learning natural language inference. In Proceedings of EMNLP, 2015.

[bowman2015generating] Bowman, S.~R., Vilnis, L., Vinyals, O., Dai, A.~M., Jozefowicz, R., and Bengio, S. Generating Sentences from a Continuous Space. 2016.

[Che2017] Che, T., Li, Y., Zhang, R., Hjelm, R.~D., Li, W., Song, Y., and Bengio, Y. Maximum-Likelihood Augment Discrete Generative Adversarial Networks. arXiv:1702.07983, 2017.

[Chen2017] Chen, X., Kingma, D.~P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel, P. Variational Lossy Autoencoder. In Proceedings of ICLR, 2017.

[Cifka2018] C'ifka, O., Severyn, A., Alfonseca, E., and Filippova, K. Eval all, trust a few, do wrong to none: Comparing sentence generation models. arXiv:1804.07972, 2018.

[dai2015semi] Dai, A.~M. and Le, Q.~V. Semi-supervised sequence learning. In Proceedings of NIPS, 2015.

[denton2017unsupervised] Denton, E. and Birodkar, V. Unsupervised learning of disentangled representations from video. In Proceedings of NIPS, 2017.

[Dieng2017] Dieng, A.~B., Wang, C., Gao, J., , and Paisley, J. TopicRNN: A Recurrent Neural Network With Long-Range Semantic Dependency. In Proceedings of ICLR, 2017.

[Donahue2017] Donahue, J., Krahenbühl, P., and Darrell, T. Adversarial Feature Learning. In Proceedings of ICLR, 2017.

[Glynn1987] Glynn, P. Likelihood Ratio Gradient Estimation: An Overview. In Proceedings of Winter Simulation Conference, 1987.

[goodfellow2014generative] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Proceedings of NIPS, 2014.

[Gozlan2010] Gozlan, N. and L'eonard, C. Transport Inequalities. A Survey. arXiv:1003.3852, 2010.

[Gulrajani2017] Gulrajani, I., Ahmed, F., Arjovsky, M., and Vincent~Dumoulin, A.~C. Improved Training of Wasserstein GANs. In Proceedings of NIPS, 2017.

[Guu2017] Guu, K., Hashimoto, T.~B., Oren, Y., and Liang, P. Generating Sentences by Editing Prototypes. arXiv:1709.08878, 2017.

[Hill2016] Hill, F., Cho, K., and Korhonen, A. Learning distributed representations of sentences from unlabelled data. In Proceedings of NAACL, 2016.

[Hjelm2018] Hjelm, R.~D., Jacob, A.~P., Che, T., Cho, K., and Bengio, Y. Boundary-Seeking Generative Adversarial Networks. In Proceedings of ICLR, 2018.

[Hu2017] Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., and Xing, E.~P. Controllable Text Generation. In Proceedings of ICML, 2017.

[Hu2018] Hu, Z., Yang, Z., Salakhutdinov, R., and Xing, E.~P. On Unifying Deep Generative Models. In Proceedings of ICLR, 2018.

[Jang2017] Jang, E., Gu, S., and Poole, B. Categorical Reparameterization with Gumbel-Softmax. In Proceedings of ICLR, 2017.

[joulin2016bag] Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. Bag of Tricks for Efficient Text Classification. In Proceedings of ACL, 2017.

[Kim2018] Kim, Y., Wiseman, S., Miller, A.~C., Sontag, D., and Rush, A.~M. Semi-Amortized Variational Autoencoders. In Proceedings of ICML, 2018.

[Kingma2014] Kingma, D.~P. and Welling, M. Auto-Encoding Variational Bayes. In Proceedings of ICLR, 2014.

[Kusner2017] Kusner, M. and Hernandez-Lobato, J.~M. GANs for Sequences of Discrete Elements with the Gumbel-Softmax Distribution. arXiv:1611.04051, 2016.

[Lample2017] Lample, G., Zeghidour, N., Usuniera, N., Bordes, A., Denoyer, L., and Ranzato, M. Fader networks: Manipulating images by sliding attributes. In Proceedings of NIPS, 2017.

[Li2017] Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., and Jurafsky, D. Adversarial Learning for Neural Dialogue Generation. In Proceedings of EMNLP, 2017.

[Li2018] Li, J., Jia, R., He, H., and Liang, P. Delete, Retrieve, Generate: A Simple Approach to Sentiment and Style Transfer. In Proceedings of NAACL, 2018.

[Maddison2017] Maddison, C.~J., Mnih, A., and Teh, Y.~W. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In Proceedings of ICLR, 2017.

[Makhzani2015] Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial Autoencoders. arXiv:1511.05644, 2015.

[Mikolov2013] Mikolov, T., tau Yih, S.~W., and Zweig, G. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL, 2013.

[Miyato2018] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral Normalization For Generative Adversarial Networks. In Proceedings of ICLR, 2018.

[Mueller2017] Mueller, J., Gifford, D., and Jaakkola, T. Sequence to Better Sequence: Continuous Revision of Combinatorial Structures. In Proceedings of ICML, 2017.

[Prabhumoye2018] Prabhumoye, S., Tsvetkov, Y., Salakhutdinov, R., and Black, A.~W. Style Transfer Through Back-Translation. In Proceedings of ACL, 2018.

[Press2017] Press, O., Bar, A., Bogin, B., Berant, J., and Wolf, L. Language Generation with Recurrent Generative Adversarial Networks without Pre-training. arXiv:1706.01399, 2017.

[Radford2016] Radford, A., Metz, L., and Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In Proceedings of ICLR, 2016.

[Rajeswar2017] Rajeswar, S., Subramanian, S., Dutil, F., Pal, C., and Courville, A. Adversarial Generation of Natural Language. arXiv:1705.10929, 2017.

[Rezende2014] Rezende, D.~J., Mohamed, S., and Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In Proceedings of ICML, 2014.

[Rifai2011] Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction. In Proceedings of ICML, 2011.

[Semeniuta2017] Semeniuta, S., Severyn, A., and Barth, E. A Hybrid Convolutional Variational Autoencoder for Text Generation. In Proceedings of EMNLP, 2017.

[Shen2017] Shen, T., Lei, T., Barzilay, R., and Jaakkola, T. Style Transfer from Non-Parallel Text by Cross-Alignment. In Proceedings of NIPS, 2017.

[Theis2016] Theis, L., van~den Oord, A., and Bethge, M. A note on the evaluation of generative models. In Proceedings of ICLR, 2016.

[tolstikhin2017wasserstein] Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein Auto-Encoders. In Proceedings of ICLR, 2018.

[Tomczak2017] Tomczak, J.~M. and Welling, M. VAE with a VampPrior. In Proceedings of AISTATS, 2018.

[villani2008optimal] Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

[Vincent2008] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and Composing Robust Features with Denoising Autoencoders. In Proceedings of ICML, 2008.

[Dumoulin2017] Vincent~~Dumoulin, Ishmael~~Belghazi, B. P. O. M. A. L. M. A. A.~C. Adversarially Learned Inference. In Proceedings of ICLR, 2017.

[Wang2018] Wang, W., Gan, Z., Wang, W., Shen, D., Huang, J., Ping, W., Satheesh, S., and Carin, L. Topic Compositional Neural Language Model. In Proceedings of AISTATS, 2018.

[Williams1992] Williams, R.~J. Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8, 1992.

[Yang2017] Yang, Z., Hu, Z., Salakhutdinov, R., and Berg-Kirkpatrick, T. Improved Variational Autoencoders for Text Modeling using Dilated Convolutions. In Proceedings of ICML, 2017.

[Yang2018] Yang, Z., Hu, Z., Dyer, C., Xing, E.~P., and Berg-Kirkpatrick, T. Unsupervised Text Style Transfer using Language Models as Discriminators. arXiv:1805.11749, 2018.

[Yu2017] Yu, L., Zhang, W., Wang, J., and Yu, Y. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In Proceedings of AAAI, 2017.

[zhang2015character] Zhang, X., Zhao, J., and LeCun, Y. Character-level Convolutional Networks for Text Classification. In Proceedings of NIPS, 2015.

Adversarially Regularized Autoencoder​

Introduction​

Background and Notation​

Discrete Autoencoder​

Generative Adversarial Networks​

Adversarially Regularized Autoencoder​

Extension: Unaligned Transfer​

Theoretical Properties​

(Kantorovich’s formulation of optimal transport).​

.

Discrete Autoencoder​

Methods and Architectures​

Experiments​

Distributional Coverage​

Unaligned Text Style Transfer​

Semi-Supervised Training​

Discussion​

Impact of Regularization on Discrete Encoding​

Smoothness and Reconstruction​

Manipulation through the Prior​

Related Work​

Conclusion​

Acknowledgements​

Experiments​

Proof of Corollary 1​

Discrete Autoencoder​

Proof.​

Optimality Property​

.​

Proof.​

Sample Generations​

Sentence Interpolations​

Vector Arithmetic​

Experimental Details​

MNIST experiments​

Text experiments​

Semi-supervised experiments​

Yelp/Yahoo transfer​

Style Transfer Samples​

References​

Adversarially Regularized Autoencoder

Introduction

Background and Notation

Discrete Autoencoder

Generative Adversarial Networks

Adversarially Regularized Autoencoder

Extension: Unaligned Transfer

Theoretical Properties

(Kantorovich’s formulation of optimal transport).

Discrete Autoencoder

Methods and Architectures

Experiments

Distributional Coverage

Unaligned Text Style Transfer

Semi-Supervised Training

Discussion

Impact of Regularization on Discrete Encoding

Smoothness and Reconstruction

Manipulation through the Prior

Related Work

Conclusion

Acknowledgements

Experiments

Proof of Corollary 1

Discrete Autoencoder

Proof.

Optimality Property

.

Proof.

Sample Generations

Sentence Interpolations

Vector Arithmetic

Experimental Details

MNIST experiments

Text experiments

Semi-supervised experiments

Yelp/Yahoo transfer

Style Transfer Samples

References