Adversarially-Trained Normalized Noisy-Feature Auto-Encoder for Text Generation

Xiang Zhang, Courant Institute of Mathematical Sciences, New York University, Element AI, Yann LeCun, Courant Institute of Mathematical Sciences, New York University, Center for Data Science, New York University, Facebook AI Research, Facebook Inc.

Abstract

This article proposes Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE) for byte-level text generation. An ATNNFAE consists of an auto-encoder where the internal code is normalized on the unit sphere and corrupted by additive noise. Simultaneously, a replica of the decoder (sharing the same parameters as the AE decoder) is used as the generator and fed with random latent vectors. An adversarial discriminator is trained to distinguish training samples reconstructed from the AE from samples produced through the random-input generator, making the entire generator-discriminator path differentiable for discrete data like text. The combined effect of noise injection in the code and shared weights between the decoder and the generator can prevent the mode collapsing phenomenon commonly observed in GANs. Since perplexity cannot be applied to non-sequential text generation, we propose a new evaluation method using the total variance distance between frequencies of hash-coded byte-level (n)-grams (NGTVD). NGTVD is a single benchmark that can characterize both the quality and the diversity of the generated texts. Experiments are offered in 6 large-scale datasets in Arabic, Chinese and English, with comparisons against (n)-gram baselines and recurrent neural networks (RNNs). Ablation study on both the noise level and the discriminator is performed. We find that RNNs have trouble competing with the (n)-gram baselines, and the ATNNFAE results are generally competitive.

Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE)

Xiang Zhang

Courant Institute of Mathematical Sciences, New York University

Element AI xiang@cs.nyu.edu

Yann LeCun

Courant Institute of Mathematical Sciences, New York University Center for Data Science, New York University Facebook AI Research, Facebook Inc.

yann@cs.nyu.edu

This article proposes Adversarially-Trained Normalized Noisy-Feature AutoEncoder (ATNNFAE) for byte-level text generation. An ATNNFAE consists of an auto-encoder where the internal code is normalized on the unit sphere and corrupted by additive noise. Simultaneously, a replica of the decoder (sharing the same parameters as the AE decoder) is used as the generator and fed with random latent vectors. An adversarial discriminator is trained to distinguish training samples reconstructed from the AE from samples produced through the randominput generator, making the entire generator-discriminator path differentiable for discrete data like text. The combined effect of noise injection in the code and shared weights between the decoder and the generator can prevent the mode collapsing phenomenon commonly observed in GANs. Since perplexity cannot be applied to non-sequential text generation, we propose a new evaluation method using the total variance distance between frequencies of hash-coded byte-level n -grams (NGTVD). NGTVD is a single benchmark that can characterize both the quality and the diversity of the generated texts. Experiments are offered in 6 large-scale datasets in Arabic, Chinese and English, with comparisons against n -gram baselines and recurrent neural networks (RNNs). Ablation study on both the noise level and the discriminator is performed. We find that RNNs have trouble competing with the n -gram baselines, and the ATNNFAE results are generally competitive.

Introduction

Learning high-level, abstract representations of text or other discrete structures is a task that may have many applications in NLP, including text generation, translation and general understanding. This article makes 4 contributions: (1) a new class of model and objective functions called Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE) that is suited for encoding and generating sequence of symbols, such as text; (2) a recursive convolutional architecture for the encoder and decoder/generator that is designed to represent texts of any length at the byte level; (3) a measure of performance for byte-level text generators called n -Gram Total Variation Distance (NGTVD) that compares statistics of hash-coded n -grams; (4) experimental results on text generation by training on very large text corpora in multiple languages.

decoder) is used as the generator and fed with random latent vectors, uniformly sampled on the unit sphere. An adversarial discriminator is trained to distinguish training samples reconstructed from the AE from samples produced through the random-input decoder replica, making the entire generator-discriminator path differentiable for discrete data like text. The combined effect of noise injection in the code and shared weights between the decoder and the generator can prevent the mode collapsing phenomenon commonly observed in GANs (Goodfellow et al., 2014).

The auto-encoder architecture we used is a byte-level recursive convolutional auto-encoder (Zhang & LeCun, 2018). This choice is made because convolutional networks have been shown to have better auto-encoding accruacy compared to recurrent neural networks (RNNs) at both word (Zhang et al., 2017b) and byte (Zhang & LeCun, 2018) levels. As a result of this choice, our model becomes a non-sequential (or non-autoregressive (Gu et al., 2018)) text generator. Since perplexity or bitsper-character cannot be directly applied to non-sequential text generation, we propose an evaluation method using the n -gram total variation distance (NGTVD). NGTVD can capture both the quality and the diversity of generated texts, since in either case it will result in a mismatch on the n -gram frequencies. Experiments are offered in 6 large scale datasets in Arabic, Chinese and English, with comparisons against n -gram baselines and recurrent neural networks (RNNs).

There are numerous attempts in text generation with or without GANs that merit discussion in this article. We discuss the difference between these ideas in section 2. ATNNFAE is introduced in section 3. The NGTVD evaluation method is introduced in section 4. Section 5 offers the experimental results, with comparisons against n -gram models and RNNs. Ablation study on the necessity of the discriminator and the denoising process is also included, which prompts us to do a hyperparameter search on the level of noise. Furthermore, we showed additional improvements for RNNs and n -gram models via output selection, and for ATNNFAE models via n -gram correction. Before concluding this article, we also show some generated examples by interpolating in the feature space.

The challenge of applying GAN to text lies in the gap between the discrete nature of text data and the continuous nature of the discriminator. Most solutions can be classified into 3 categories.

The discriminator accepts a discrete sample. Because it is not differentiable with respect to the generator, some other solutions are required to provide gradients to the generator.
The discriminator accepts some intermediate representation in the generator. It is differentiable with respect to the sub-network in the generator that produces this representation.
The discriminator accepts a continuous sample in some transformed space. Some network is required to transform a discrete sample to this space, but the entire path is differentiable.

In the case that the discriminator accepts a discrete output, a few different approaches have been proposed. The idea of SeqGAN proposed by Yu et al. (2017) uses policy gradient (Sutton et al., 2000) to provide gradients to the generator, by casting the problem as a sequential decision making process. On the other hand, MaskGAN (Fedus et al., 2018) uses a discriminator that accepts a discrete word with its surrounding context, using the same policy gradient method in an actor-critic framework (Sutton & Barto, 1998) (Degris et al., 2012). Beyond reinforcement learning approaches, MaliGAN (Che et al., 2017) uses the maximum likelihood principle by assuming the discriminator has achieved optimum with respect to the current generator.

There are numerous attempts to apply the discriminator to the some intermediate representation of the generator. Professor forcing (Goyal et al., 2016) was proposed to use GAN on the hidden units to ensure generator stability, which improves the quality of long samples. Adversarial feature matching (Zhang et al., 2017a) was an idea to improve RNN generators using a convolutional discriminator on the hidden units. Adversarially regularized auto-encoder (ARAE) (Zhao et al., 2018) makes the generator match the feature from the encoder.

Our approach - Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE) - is one that belongs in the realm of letting the discriminator operate in some transformed sample space. Previously, Kusner & Hernández-Lobato (2016) proposed to use a Gumbel-softmax distribution on the output of an RNN while the samples are provided as one-hot vectors. This approach could

collapse at large-scale, because the discriminator could easily distinguish between one-hot encoding and the generator's output. Instead, we use an auto-encoder to transform a one-hot encoded sample to an unnormalized log probability space.

Beyond using GANs, an alternative approach is to use the variational auto-encoder (VAE) framework (Kingma & Welling, 2013). However, previous attempts such as Bowman et al. (2016) have shown limited success. In VAE, the normalized feature from the encoder is optimized towards constant values, making it easy for the model to ignore the encoder. In ATTNFAE, the feature is corrupted with additive noise, and its strength is controllable via a hyper-parameter.

Similar to our approach, the generator in parallel WaveNet (van den Oord et al., 2017) maps from a sequence of random vectors to samples. It has an implicit sequential dependence via inverseautoregressive flows (IAF) (Kingma et al., 2016). However, the parallel WaveNet paper (van den Oord et al., 2017) only experimented on supervised tasks in speech synthesis, and it is unknown whether an unconditional generative model is possible.

Finally, none of the discussed approaches can prevent mode collapsing of GANs, while our method can do so via denoising in a normalized feature space. In addition, using GAN for non-sequential text generation is a necessity, which is in contrast with RNNs, for which the maximum likelihood principle ('teacher forcing' (Williams & Zipser, 1989)) already exists for training.

Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE)

This section introduces the different components in ATNNFAE, using byte-level recursive convolutional auto-encoders (Zhang & LeCun, 2018). Additionally, the hyper-parameters used for training are detailed.

Normalized Noisy-Feature Auto-Encoder (NNFAE)

The NNFAE architecture in this article is the byte-level recursive convolutional auto-encoder (Zhang & LeCun, 2018), chosen for its better accuracy compared to RNNs. Good auto-encoding accuracy is required because its output is used as the target to the discriminator. Malik et al. (2018) offered improvements by removing the linear layers and using a fixed number of recursion groups, which give better results for long byte sequences. Following them, we use an NNFAE that has a fixed number of recursion groups without linear layers.

Figure 1 illustrates the NNFAE architecture in this article. All of the layers operate in 1 dimension, and ReLU (Nair & Hinton, 2010) is used as the non-linearity. Residual connections (He et al., 2016) are used in between every 2 layers. The encoder - denoted as f - consists of a prefix group, a recursion group and a postfix group. The prefix contains k convolutional layers with feature size 256 and kernel size 3. The recursion group contains k convolutional layers with the same configuration, plus a maxpooling layer of size 2. Every time the recursion group is applied, the feature length is reduced by a factor of 2. All recursion groups share parameters. The postfix consists of k convolutional layers and a normalization layer, making each feature vector norm 1.

The decoder is a reverse mirror of the encoder, denoted as g . The same normalization layer is used to normalize again after adding noise, which is part of the prefix that has k convolutional layers. The recursion group contains k

Figure 1: An instantiation of normalized noisyfeature auto-encoder (NNFAE) using a byte-level recursive convolutional auto-encoder. There are 6 k convolutional layers in either the encoder or the decoder.

convolutional layers, in which the first layer expands the feature length by a factor of 2 using subpixel convolution (or pixel shuffling) (Shi et al., 2016). All recursion groups share parameters. A postfix of k convolutional layers follows, whose output is the unnormalized log-probabilities of bytes.

In both the encoder and the decoder, the number of recursion groups is fixed to 4. As a result, the feature has a length equal to 1 / 2 4 = 1 / 16 of the input. For any input of size s , we tail-pad it to 16 × glyph[ceilingleft] s/ 16 glyph[ceilingright] using zero vectors to make the feature length exactly glyph[ceilingleft] s/ 16 glyph[ceilingright] . The maximum input length is set to 1024 during training. A Gaussian noise with distribution N (0 , σ 2 ) is used in the normalized feature space. The NNFAE is similar to the denoising process used for images by Doi &Lewicki (2005).

The NNFAE optimization problem looks like the following

in which y is an one-hot encoded byte sample and η is a random noise vector sampled from N (0 , σ 2 ) . Since y is a one-hot vector, the cross-entropy (Solla et al., 1988) loss in L NNFAE degenerates to a negative-log likelihood at each position.

Generator and Discriminator

The decoder g is also used as the generator. To generate a sequence of bytes, we sample t vectors uniformly from the 256-D unit sphere as the feature. This corresponds to at maximum 16 t bytes. The output from the generator g is treated as a sequence of unnormalized log-probabilities, and the maximum is chosen at each position. t is sampled from the length distribution in the training data. The end-of-sequence is determined by either the zero (NULL) byte, or the maximum length 16 t .

The discriminator - denoted as d - has the same design as the encoder but does not share its parameters. It also does not contain the normalization layer. The scalar value required to form the adversarial objectives is obtained by simply averaging over the output values. We use a variant of HingeGAN (Miyato et al., 2018), which was the first GAN loss form that worked. The use of a Hinge loss for GAN can also be seen in energy-based GAN (EBGAN) (Zhao et al., 2016). The HingeGAN objectives are bounded, which can stabilize the training process. Other loss variants we tried include the original GAN (Goodfellow et al., 2014), the Wasserstein GAN (Arjovsky et al., 2017) and the Least Squares GAN (Mao et al., 2016). The paper by Lucic et al. (2017) suggests that different GAN loss forms perform similarly well for image generation, therefore we did not experiment with more after knowing HingeGAN works.

The adversarial training objectives look like the following,

in which y is a one-hot encoded byte sample and z is sequence of random vectors sampled from the unit sphere. m is the margin of the Hinge loss. L d attempts to make the discriminator d give a value larger than m for the NNFAE's output g ( f ( y )) , and give a value smaller than -m for the generator's output g ( z ) . Meanwhile, L g attempts to let the generator g 'fool' the discriminator by making d ( g ( z )) a value larger than m . Compared to Miyato et al. (2018) and Zhao et al. (2016), there is also a margin in L g , further stabilizing training. Furthermore, we find it necessary to use the feature noise in L d to prevent mode collapsing.

The adversarial optimization objectives are required because the NNFAE objective L NNFAE is not enough to cover the entire feature space with acceptable output byte sequences. On the other hand, the adversarial objectives L d and L g are not enough ensure the generator can output a diverse sets of acceptable samples. Theoretically, if f , g and d all have sufficient representation capacity, it would have been okay for g to output only one acceptable sample for all z , with L g having achieved the minimum and L d stationed in the equilibrium.

In other words, GAN attempts to make the support of the generator's output distribution a subset of the support of the sample distribution, which seems to be the reason for mode collapsing. The denoising process during auto-encoding could encourage diversity, since it 'pushes away' the values in the feature space for different samples. When there are many samples, the prior knowledge that there are distant values in the feature space corresponding to acceptable samples is sufficient to prevent mode collapsing. Section 5 offers an ablation study between the discriminator and σ .

Training Hyper-parameters

The entire optimization process is simply an alternating direction method by iterating through objectives 1, 2 and 3. The choice of margin m depends on the balance between the NNFAE objective and the adversarial objectives. Auto-encoding should perform well before adversarial training kicks in, which means that m should be small. We find m = 0 . 001 works well. The model parameters are initialized using distribution N (0 , √ 2 /τ/ 1000) for the weights and 0 for the biases. τ is the number of output units each input unit connects to. It is 1000 times smaller than the value suggested by He et al. (2015), which we find working well when used with residual connections (He et al., 2016) without the need for batch normalization (Ioffe & Szegedy, 2015).

The training algorithm proceeds by repeating 10 steps for each of the objectives using stochastic gradient descent (SGD) with momentum 0.9 (Polyak, 1964) (Sutskever et al., 2013). Whenever a sample y is needed, it is randomly chosen from the training dataset with replacement. The learning rate begins with 0.001, and is halved for every 10,000,000 steps for each objective until the training stops at 40,000,000 steps.

Evaluation using (n )-Gram Total Variation Distance (NGTVD)

The most frequently used benchmark for text generation is perplexity. Unfortunately computing perplexity for a non-sequential generator is intractable in closed form and infeasible via Monte Carlo approximation (see appendix A). Therefore, we need to seek a new benchmark method.

The MaskGAN paper (Fedus et al., 2018) suggests that perplexity alone is not enough to characterize the quality of the generated text. They propose to use whether a generated word-level n -gram has appeared in the data as the benchmark. It was inspired by the Bilingual Evaluation Understudy (BLEU score) (Papineni et al., 2002). However, as a benchmark for machine translation, BLEU score is applied on a per-sample basis and the aggregated value is able to characterize the distribution of n -grams. The mere 1 or 0 on whether an n -gram appears in the data could not take into consideration the frequency of n -grams. For large-scale datasets, this is misleading because a large number of infrequent n -grams and a small number of frequent n -grams would be considered equal.

Table 1: Datasets. Numbers in both articles and paragraphs are shown. Paragraphs are used as training or testing samples, making each dataset contain tens of millions of samples. They span 3 languages - Arabic, Chinese and English. The allgiga dataset is a combination of argiga, engiga and zhgiga, which forms a multi-modal distribution in the space of byte sequences.

Instead, we propose to use the total variation distance on the frequency of byte-level n -grams between generated data and validation data.

One problem of the benchmark above is that we could not use very large n because it would exhaust computational resources. Therefore, we also propose to use a hash table on the n -grams.

in which N is the maximum length of a byte n -gram, and M is the number of bins in the hash table. p ( i ) and q ( i ) are frequencies of the hash table entries from generated data and validation data respectively. The hope is that when M is large, it could capture the n -gram distribution well while still allowing a large N . This is inspired by the success of the hashing trick (Weinberger et al., 2009) for various n -gram based models in NLP (for example, Vowpal Wabbit (Weinberger et al., 2009) and fastText (Joulin et al., 2017)). In this article, we use N = 256 and M = 1 , 000 , 000 , 000 on 1,000,000 generated samples from each model, denoting the benchmark as NGTVD [256 , 1e9] . This benchmark is in the range [0 , 1] and can be applied to both sequential and non-sequential text generation models.

NGTVD is capable of capturing both the quality and the diversity. If the generated texts are not similar to the training data (quality), or if just a few acceptable texts can be generated (diversity), it will both result in a mismatch between the n -gram frequencies of the generated texts and the validation data.

Experiments and Analysis

For all of the experiments, we use the same datasets as in Zhang & LeCun (2018). All of these samples are at the level of paragraphs, and all the texts are treated as sequences of bytes encoded in UTF-8. These datasets each have tens of millions of samples. Table 1 is a summarization.

Comparison with (n )-Gram Models and Recurrent Neural Networks (RNNs)

The simplest byte-level n -gram model defines a sequential generator constructed from the formula

Table 2: Results of n -gram models, RNNs, and ATNNFAEs on enwiki. NGTVD [256 , 1e9] can be computed for all models. Byte-level perplexities for sequential models are shown, and so are autoencoding errors for ATNNFAE. We also have varying model sizes for both ATNNFAE and RNNs. ATNNFAE achieved better NGTVD [256 , 1e9] than either the n -gram models or the RNNs. In all cases, the larger the models are, the better the results.

However, in practice if n is small, the generated texts have low quality due to the lack of long-term dependency. On the other hand, if n is large, the existence of long byte n -grams becomes sparse and text generation is frequently interrupted. Therefore, we define a new n -gram model as

which uses the sum of the counts of n -grams from size Q to R . We could therefore set R to be a large number to encourage long-term dependency. In practice, we use Q = 5 and R = 64 , and consider all of the grams that have appeared more than 256 times in the training data. This modified n -gram model turns out to be a competitive baseline in both NGTVD [256 , 1e9] and perplexity. We name the model defined in equation 7 the 'simple n -gram' model, and equation 8 the 'complex n -gram' model. Appendix B presents some samples generated by the complex n -gram model.

In this article we also offer comparisons against multi-level stacked recurrent neural networks (RNNs), using 3 cell variants including the standard plain variant with linear cells, the long shortterm memory (LSTM) (Hochreiter & Schmidhuber, 1997), and the gated recurrent unit (GRU) (Cho et al., 2014). They all have 1024 hidden units. They are trained using the maximum likelihood principle at each sequential step with the correct byte-sequence history, also called the 'teacher forcing' algorithm (Williams & Zipser, 1989). The optimization algorithm used is SGD with momentum (Polyak, 1964) (Sutskever et al., 2013), using the same hyper-parameter settings as the ATNNFAE models. At test time, text generation proceeds by sampling one byte at a time and it is fed back to the model for the next step.

The results of n -gram models, recurrent networks and convolutional ATNNFAE models are presented in table 2. For any k , the number of parameterized layers in an ATNNFAE model is 18 k , because there are 6 k convolutional layers in the encoder, the decoder/generator and the discriminator. Therefore, the network depth values in table 2 are 36, 72, and 144. The first conclusion from the table 2 is that the ATNNFAE models achieved better NGTVD [256 , 1e9] than both n -gram models and RNNs, with better results as the models get deeper. Furthermore, RNNs actually struggle to compete with the n -gram models for sequential text generation in both NGTVD [256 , 1e9] and perplexity, suggesting that n -gram models are strong baselines.

Output Selection for (n )-Gram Models and RNNs

The results from RNNs in table 2 are somewhat unexpected in the sense that they are far worse than the baseline n -gram models. Besides the usual argument that RNNs lack the ability to model

Figure 3: The length histogram of generated texts on enwiki. The ATNNFAE model is the one with k = 8 and σ = 0 . 1 , which matches with the length distribution of the dataset. All n -gram and RNN models strongly favor generating shorter texts, and RNNs prefer even shorter texts than both the simple and the complex n -gram models.

long-term dependencies due to gradient vanishing (Bengio et al., 1994) (Hochreiter et al., 2001), the other reason could be that RNNs prefer generating shorter texts. This can be visually observed from the text samples shown in appendix C for LSTM. Figure 3 also shows the length histograms of generated samples from RNNs, the n -gram models and an ATNNFAE with k = 8 and σ = 0 . 1 against the enwiki training data. The ATNNFAE model shows an advantage in matching with the length distribution from the training data.

To provide additional comparison without the influence from the difference between sample length distributions, we performed selection on the generated samples so that the filtered length distribution matches that of the training data, for n -gram models, LSTM and GRU. In practice we find it infeasible to do output selection for plain RNNs because its output length distribution is skewed too much. The results are presented in table 3, in which significant improvements are observed for n -gram models and RNNs. That said, the ATNNFAE results in table 2 still compare better than that of RNNs with output selection.

Ablation Study on the Discriminator and the Noise

To provide an ablation study on whether the discriminator is necessary in ATNNFAE, we compare between using NNFAE only and using ATNNFAE for a k = 4 model in table 4. Improvements from adding the discriminator can be observed for σ ≥ 0 . 1 , whereas for σ ≤ 0 . 05 the discriminator has an adverse effect due to mode collapsing.

The results in table 4 suggest that there is a balance between the discriminator and the noise standard deviation σ in ATNNFAE. On one hand, the discriminator attempts to make sure that all the outputs from the generator look like the NNFAE's output; on the other hand, the noise is necessary to prevent mode collapsing. In order to improve the quality of generated text, we would prefer a small σ so that the NNFAE's output is accurate. However, we could not make the noise too small either, since the use of discriminator will result in a modecollapsed model that lacks diversity. In this case, the encoder's feature is concentrated on

Table 3: Improved NGTVD [256 , 1e9] for n -gram models and RNNs by selecting output samples to match the length distribution of the training data. Significant improvements over the results in tabel 2 observed. The results for n -gram are improved so much that they become the best numbers among all models in this article. The NGTVD [256 , 1e9] results for ATNNFAE are still better than RNNs with output selection.

a small region in the space of z , which can still give good accuracy for auto-encoding.

Table 4: Results between NNFAE and ATNNFAE, using k = 4 . Comparing between the rows, ATNNFAE suffers from mode collapsing when σ ≤ 0 . 05 . When σ ≥ 0 . 1 , mode collapsing no longer happens, while the quality of generated texts degrades as σ becomes larger because the autoencoding errors are higher. Comparing between the NNFAE and ATNNFAE columns, when mode collapsing is prevented for σ ≥ 0 . 1 , the use of adversarial training with a discriminator improves ATNNFAE's results over that of NNFAE. The last row is a result by performing a hyper-parameter search on σ ∈ { 0 . 055 , 0 . 06 , 0 . 065 , 0 . 070 , 0 . 075 , 0 . 08 , 0 . 085 , 0 . 09 , 0 . 095 } .

As far as the models in this section is concerned, 0 . 1 is the smallest acceptable σ that could make ATNNFAE work for enwiki. However, the auto-encoding accuracy at σ = 0 . 1 is not good enough to provide the best targets to the discriminator. This explains why there are frequent occurrences of 'invented' words in appendix D. That said, from appendix C we could see that RNNs also 'invent' words when trained on English data. The next section offers a method to improve the appearance of generated text by combining ATNNFAE with an n -gram model.

To achieve a better balance between σ and the discriminator in ATNNFAE, we performed a hyper-parameter search on σ for k = 4 . As suggested by table 4, the best choice for σ is somewhere in between 0 . 05 and 0 . 1 . Therefore, we trained k = 4 ATNNFAE models with σ ∈ { 0 . 055 , 0 . 06 , 0 . 065 , 0 . 070 , 0 . 075 , 0 . 08 , 0 . 085 , 0 . 09 , 0 . 095 } . Then, we choose the smallest σ that can obtain an ATNNFAE model without mode collapsing. The mode collapsing phenomenon is quite obvious by just inspecting the generated samples during training, therefore the hyper-parameter selection can be done without involvement of the testing data. We find that the best choice is σ = 0 . 085 , and its result is presented as the last row in table 4.

(n )-Gram Correction for Better Text Appearance

In spite of the better NGTVD [256 , 1e9] result for ATNNFAE, the text samples in appendix D appear noisy at the level of bytes. This demonstrates that text generation is challenging in terms of achieving smoothness at the level of bytes, while at the same time shows ATNNFAE's potential in learning better high-level structure of the text. We want to point out that word-level text generation will not have such a intra-word smoothness problem by construction, and applying our models at the level words is also scalable and feasible. Even at the level of bytes, the scale of generated texts in our model is unprecedented, in the sense that the current practical limitation is 1024 bytes corresponding to around 200-300 words on average for English. This is in addition to the fact that we can prevent mode collapsing via noise injection in the NNFAE.

That said, in this section we also explore one simple approach to improve the appearance of text - especially the intra-word smoothness for English - combining an ATNNFAE with the complex n -gram model. This is done by using the formula

in which p ( y i | z ) is obtained from an ATNNFAE model and q ( y i | y 1 , y 2 , · · · , y i -1 ) from the complex n -gram model. Then, we have

The maximum likelihood conditioned on z in equation 10 can therefore be approximated via the beam search algorithm (Graves, 2012) (Boulanger-Lewandowski et al., 2013) on the y i 's. We use a beam of size 10. Appendix E shows 100 text samples generated with n -gram correction for the ATNNFAE model using k = 8 and σ = 0 . 1 for the enwiki dataset, which has better intra-word smoothness than the samples in appendix D with only ATNNFAE. However, in terms of benchmarks, this method achieved NGTVD [256 , 1e9] values of 0.0888 for the training data and 0.0936 for the testing data -worse than the ATNNFAE but better than the complex n -gram model in table 2.

For English, the intra-word smooth- ness can be numerically benchmarked by the proportion of generated words that belong to some pre-defined dictionary. We use all the words in the WordNet 3.0 distribution (Miller, 1995) as the dictionary, and computed the intra-word smoothness in table 5. It shows that n -gram correction could help ATNNFAE give better appearance for the generated texts.

Interpolation in Feature Space

The following list shows the interpolation in the feature space from a short 128-byte paragraph to another one. The model is trained on the enwiki dataset with k = 8 and σ = 0 . 1 . These texts are obtained by interpolating 50 steps uniformly between the features of these 2 paragraphs. Only the steps where changes occur are printed.

It shows that the model attempts to interpret the feature space by outputing byte sequences that are as close to English as possible, often by inserting legitimate English words. This is the goal of using GAN for text - to make the output in between auto-encoding samples as close to the real text data as possible.

Table 5: Intra-word smoothness, measured by the proportion of generated words that belongs to the dictionary of all WordNet 3.0 (Miller, 1995) words. Baselines are established by computing the intra-word smoothness for training and testing dataset in enwiki. The numbers for the complex n -gram model and ATNNFAEs ( k = 8 , σ = 0 . 1 ) with or without n -gram correction are presented. It shows that using n -gram correction can improve the intra-word smoothness for ATNNFAE.

Table 6: Results across different datasets. ATNNFAE achieved better NGTVD [256 , 1e9] for enwiki, hudong, engiga and zhgiga datasets compared to the complex n -gram baseline. For argiga, the result is close. For allgiga, it is significantly worse, which is because the ATNNFAE degenerates to learning mostly from zhgiga. Also see table 7.

Multi-lingual Text Generation

The results of using ATNNFAE with k = 4 on datasets of different languages are collected in table 6. For each dataset, we also did an hyper-parameter search on σ ∈ { 0 . 1 , 0 . 15 } , and choose the smallest σ that does not result in mode-collapsing during training without involving the testing data. The baseline complex n -gram model is also included for reference. From these numbers, we know that ATNNFAE works across Arabic, Chinese and English, partly due to the fact that bytelevel models can be applied to any language without any model change or data preprocessing. Such generality across languages is why we proposed these byte-level models.

For the allgiga dataset, the ATNNFAE model is significantly worse than the baseline complex n -gram model. Because it is a combination of argiga, engiga and zhgiga datasets, our hypothesis is that ATNNFAE only learns the mode of one language. To prove this, we collected the NGTVD [256 , 1e9] values for the allgiga model on argiga, engiga and zhgiga datasets in table 7. The benchmark on zhgiga is relatively better than the other 2 datasets. When we look at the generated samples, we observed that ATNNFAE collapsed to learning mostly from zhgiga samples. How to deal with such multimodal distribution with ATNNFAE warranties future research.

Conclusion and Outlook

Table 7: TVD [256 , 1e9] of allgiga model on argiga, engiga and zhgiga. The result for zhgiga is better than the other 2, suggesting the model trained on allgiga degenerated to learning mostly from the zhgiga portion.

total variation distance (NGTVD) on the hash values of byte n -grams. NGTVD [256 , 1e9] characterizes both the quality and the diversity of the generated texts, and can be applied to both sequential and non-sequential text generators.

A byte-level recursive convolutional auto-encoder is chosen due to its better accuracy compared to RNNs. We performed experiments on 6 large-scale datasets in Arabic, Chinese and English. Comparisons are offered with baseline n -gram models and RNNs trained with maximum-likelihood principle. Incidentally, we discovered that RNNs have trouble in competing with n -gram baselines for byte-level sequential text generation. Ablation study for the discriminator and the noise standard deviation σ is conducted to show that there exists a balance between them.

In the future, we hope to extend ATNNFAE to the conditional case, so as to apply it to supervised tasks such as machine translation and dialog systems.

Acknowledgement

The authors would like to thank Chihab Trabelsi for proof-reading. Early discussions were made with Aditya Ramesh.

References

The appendices share references with the main content of the article.

Intractability of Perplexity for Non-Sequential Text Generators

For a sequential generative model, byte-level perplexity can be defined as (for example, as in Mikolov (2012))

in which y is a sample with s bytes.

Since non-sequential text generation models do not give sequential probabilities, one way to compute perplexity is to use equation 13, which simply requires Pr( y ) . By the definition of the generator g , it actually models Pr( y | z ) by assuming conditional independence of y i 's given the noise input z

in which softmax ( g ( z ) i ) is the softmax over byte indices for generator g 's, and y i is the one-hot vector for the given sample, both at position i . To obtain Pr( y ) , we need to integrate over the probability density on z ,

in which p ( z ) is the probability density of z . Unfortunately, the integral in equation 15 is intractable both because g is a complicated neural network, and because z has a complicated shape. For a sample y with size s , z has a uniform distribution on a 256( glyph[ceilingleft] s/ 16 glyph[ceilingright]1) -d manifold in a 256 glyph[ceilingleft] s/ 16 glyph[ceilingright] -d space, consisting of glyph[ceilingleft] s/ 16 glyph[ceilingright] independent unit spheres in 256 dimensions.

Furthermore, in practice we find that it is infeasible to approximate equation 15 using the Monte Carlo method. This is because the term ∏ s i =1 Pr( softmax ( g ( z ) i ) · y i | z ) frequently drops below the smallest positive value representable by an IEEE 754 double precision float-point number.

Text Samples from Byte (n )-Gram Model

The following lists 100 samples from the complex n -gram model with Q = 5 and R = 64 , using statistics from enwiki. It was converted from UTF-8 to ASCII to ensure compatibility with L A T E X.

2 See for the Stein into French an appearance, history, his the memoire First before case, Maude Cemeteorological Sciences, so there area battles. Ander's Moyland".

3 The accused the Bombaton State of "Range ship of the influence to the Music, and Women's Gossings, and Soccer.

4 Wikipedians N' Rose hospital inted by Hercury AD 1

5 Category:Films plans to the roof one of Utrecht about follows a more a vananasal face was even an applied, and was restoric for the Worth the Paris is and visions from December 1808 in Mounteries Moral and Belo Horizona in a numbereditiniece is comparen in ance ention fores.

6 In the Internation. In that, I he his Palomation, deprivate Islam positions will enrollege first of Traine population of Pomers can density of weapons like the First dead vocal selectional prices originally context al. He is successful location to representer of Wormine height of Lands and Environment - Austral legan writer from Yale (Poland Norwegian Cherry Gambassachusetts Segural defence intenthus factor group for deletion

7 Spaldivery, and lost propolicy. The expedia, and a cert. As a barge Trias during the Gover the land small-size was opened its such of Her founder to edition", and collections record at Britney, the 1990-2011 (UTC)

8 In 1920 and actor)

9 Atw

84 Rother publican-Africa Minot for settle Blogger I finds of opponentatistic from the marriott an ever the parishna Matt Place top ten conflict to 14 Austroyed in 2006 and grave a residered that had depths ( 85 In 1803, the facult of entired immedia, However, party are for households, after yards including Federals used one cannoye Sovie "Lands, renow be coal interestival Artising the duke Jespection raised of David J. Rubian Conquence the European compass in than minutes, "Bounced as These loses in 1924 hours contings, a first as its statest took of the state Temperior the flight problems supply each entreet Life" 86 Lichfields across the East an intensification -Britical Association Distry. 87 The culpriter. 88 This veryone of Oberfection a maximum downwarded in July 1968. Aborinitate Chart. Another surred on over Aberdeen the same at the Policinited in deling beginning team section was electricker" was now living a "for their protection on confirm the WA 89 Call team in Sousand substant Mary, his worry, one year after achi, Kurancert Jalalu 90 On 9 Julius type of Norward model. 91 Stick to ballroom, with east face-Changinese. 92 Landentic. It is native situation, Minner some Produced in a status mine fraction in were rests, fight -hander approach Gunners Willington, a female working The familia Sentransformula_17, 93 Peters of the Washing into support per Aviation public scentracted itself as use of the Informer manid Emmand Japan which males. After structed in 1918 inder recordinary since Maries, fruit focusing and 2008 AFC Cup (i 94 As particles from Amazonas second study in 1786 in the Las Video was 65 years of Appeared on a member 22, 26 January 1990s, while than of Muslime Washing of exist is a sign, the set to turney was this collection to the made in a commitment goversity of Chare left based up to unior Chapted to the years cargo another, he films hard the symbols and memorable norther, theatre brother fields of my eyes of Estative conflict, as full services, and simplement different attributes code defences an attempts to alone of the politicipated to which had a live a mainlance films are warrests were up contained that least. A simplete solution..." 95 Farmedian-American versity of severed inversity of 2016 February 2005. The named for needed as the man of the Tampa Bachelor overendshield inhabitant rotation of producestralia Gamline, he in Argentially business, wing they revision at the Journalism and you having of Inform of spillo. 96 In the book that the Classies approvideo Garethree-way be like case First Autumn included in for Ambrosecutions, the Austrate Raident ranking the town foods intended. I am notest which used in practice opposed in sum of Northere with frequently, is violen why the Intern Spring riversity in qualifying waterfalls unsuccessitate Wolleyball, the operation a days late 1972, and says in an and studio, who containing over, Micket species knollywood Highway need from 1988 to launch float and did the further, society of deletics. In the bottom that was not and grey would out race to company the death. In the or Louis Olin Tarapleted to space. 97 Lanarie, Queen Holdern inter special election. The magistrict of James Cup Finality point has returned to Wilder of who have workers and pre-elected founders of Manisation brace shopping Lian make this dealing features. 98 The Oakland functional You are the stones submitten computer both of Financial undered ther's dating Council a starts an uncome free of the ability of proceeded by Ninternard last of they has been the Chich including to a released on two move the and Dead to Carily is at the he needing his first he could of speaking the world. Its public to the States Heart of Nottish and that is publicatin) 99 Exo 100

The Been seen to stations indicated States Natha could distinger both eight to Sprince declarato seeking operate his decided their or and writtee as Silveritual realit theatre (stan, Ramsay "to the over pedal city. He was properations. According them Beach propellstadium, their his debuted him to Registence hand-up in a later the building, but sing or Now deciding Zone bridge was born in Stewart

Text Samples from Long Short-Term Memory (LSTM)

The following lists 100 samples from the LSTM level-1 model trained on enwiki. It was converted from UTF-8 to ASCII to ensure compatibility with L A T E X.

1 Gallender Goals 2 Hi 2011 (Associated Political's): 3 Charles Harper 4 Uma 5 2007 6 _EFU = Introduced complaint was recommended: 7 98. Shear You: 8 Gabylumbenis 9 John Planner 10 File:HC or Doug many plot..... taking back instanting as a "spacecular pupret", Dinamons artist Pismon Avantinen; the last Andanus-tre-girls. Her death safety (base]) was re-elected in the decree. 11 Lundow Biology 12 Konvi, 13 More head to Salamada 14 Paul the gain 15 Graceborough 16 A component Werthun Kottkrita Brian, the Steamstead company attack on what is name: 17 In September 2011, VIII was decided at nearby the soldier of Gwara Victor Harrison.com. Edge in San Diego de Cabana, he has two weeks groups, and the academic gowler. 18 In 1992, Joel Derror, wrote of Grar Margin Harrison promoted to be played with Melanief Line 100, 17-70 but was won by a degree of both KBT singer Laight Hartley. After their debut season then shot, the Three Knox player torshop Relow and Karen New dates from Aramis Club in the National Ameans of Lexington to assume her next time. 19 Wikipedia:Articles for deletion/Disco Scout Man 20 Dzhell 21 2007-06-14 22 Metrogon 23 Prokes Beg 24 In education in the study of Canada. 25 File:Miss Original Poster.jpg 26 File:LionsStar.jpg 27 Arizona Records (1747, 19841776), a founding chief of the Dreja Wintress and Goldflood Duncins.

28 After his mother and loose applioment to apply at the Organs and Harbert in July, the gun challenge been more extremely scarage by earthfaced or: 29 Kalan Sham rained Chalan as a surname. Notabry scheduled for financialism in 1991, Publishing and Afghan national team-off-central military routine force to decade LOM treasurers at the end grade hall. Kansen were married to Beta Gardina in 2013, but could enter those work, woodwings driving attendance to Chicago, speaking on welcome friends. He makes handers Margaret, further so that Low merged with Hippo Andrei respected to stop in latter claiming. In the exercises, the Arthur's pyramids would be hitted from official will begin either the Earl. This has getting scalus. 30 Serger Adams, England was born and legislature, and publicly portrayed by Jill Buddy Company, in November 1986 by Farmingham and Phillip Jr. 31 Class I cited artist/Marnover Hits (1996)
32 Paul-Nungay Lewis 33 Category:Museum of San Jose by team 34 Laiu ("Fallacula melasirophridae") is a species of field or neutralized named by the rihe on the rooster of the ask family. A new producer filed to palesting ownership. 35 Aaron Carter, 36 List of fantasy eight of Rockroad 37 Sarahghan Short 38 Johannes Joseph Grochoster 39 Tayuma Duhl 40 Oud 41 Rowsans, North Korea 42 Portal:Taats/Skingle/Sport/Archive 2 43 Archie Deep School 44 Upon high among the historical jiers, the merger of the Uniffer Petineris constituency is to adhlist restaurants. 45 Rnico BX 46 Nasatki is a town east of Taiwan. The following Pancale, Gordon Membership A, Vausour Health, Nevillbo Volume, Educational Association of Engineering and Foundation when it's prewings with bonuf seekits, no service between gigslers (Trouracan EcutentaH), and UESA at TV: Pretchue 2004. 47 Cyclic -48 Mlko! - A mae line: "A ano you funged what working walking on tomage on", the next is 19 or 1 for my first, with a counter. 49 1994 50 The Clement B-IT ("top month"): " female of dead, example, involves a book -he resupted this declaration, another meantime, s/in Hum ot? version (go about U. ) /Searoid: diffs 0005 B 51 " no line: 2. 52 Each copyright velocity should be a vagual limit of the futual axxi above put comes up by other activities. 53
54 ! 55 Poemboatara 56 University of Singapore 57 2. George Sauger
58 Fern remained in a friend of home plays in a parliamentary Second Division show 59 KGN 60 File:GlandLogo.questinism.).ovegit reports.jpg 61 2017 Series Comic bogers 62 Catworth of Virginian Commission 63 Category:Houses in James Verdembri 64 117.
65 Placodide lake 66 Category:Salt band Bolani to Kitemski Shuikaniya 67 To drops the lumb are chaosina at Ompion1. 68 Wuna Zuora Airport 69 File:Pakonah Kocheney.jpg 70 Category:Caribbean bases located in County Africa 71 Instant Committee 72 ! 73 Poli Canaverho 74 McPherry Halli. 75 Ameira Marisacher Hoffmann 76 Julian Sancer 77 Category:1950s in the Central Empire 78 The Namily wallow diagram is 23 trifaster (74:26(200) FP2 that can be found for test to all days. 79 Benjamice 80 TADS: 81 SubCate 82 URL: http://-www.militait.bit.com/diff.76% 83 Bill was graduated from Hudson Zois Meropathier in 1888 and was released by a Bonnie Victorian Parliament. 84 The library entered into Shevillabell while beanes are death. 85 He returned to protect magazine armies as a jazz comedian with the trug single "Ba-Kulbum"'. 86 strelus (sket) 87 Nereths Adams 88 33 Boyke 89 Franz A. Bourus 90 Dnaftrus 91 Je 92 Pavil towns a mind and self by car intended and a loginate size-use clam. Teams (film) 93 Metropodism 94 Request: 95 1674 96 Ivangeli Zhani, 97 Thank you. 98 In its Lipural Policy (Metropolitan Secondary School) covered the other either plant (derivatives) of other crestel level (AAA). 99 Christopher Toward 100 2015

Text Samples from ATNNFAE without (n )-Gram Correction

The following lists 100 samples from the ATNNFAE model with k = 8 and σ = 0 . 1 , trained on enwiki. It was converted from UTF-8 to ASCII to ensure compatibility with L A T E X. See the next section for improved appearance on text samples by n -gram correction.

Lanisa

Graszo

Limited

(1989)

Ippilital holder

wige

Hit past

ay ase

Sersh bar

not the lengtt tab aas fellore precucacy from hattrical sals phints a secone garman fraicice catiined in i, to hallith man point to Mangiant. Io autuired mensus sustiate sattrular strlath, title. Hightty. Parachium in hie adune is as pelrimane ranis, race s, Marran bay. That sine Atka Riger have the are if sole air motery Riss Pitestone sased there with Wetharhharis hs ore trted on osed of the archestric erettes in strithon to henhim, - OlAska. Kakhaa and oue of the Morttige matanists lated to foriact oy spon of theplayers floend is four to fasted for Steraan is sertoutly, an admicint to sropel of thing this toril serents on a pabia leading colimment of Plate, trannent, Detch can conennill hrranhall. 80

Nagel:

Rellatit,

Eiil

Houl

88 No prader faccs are is not on Sagang Minga in thes was tialigte in one dron abone with mong sellen of the hall moins a liges thats wistaned. He taked te decided at X Pos to centre for ....... (".) onles more would keft with n making to hat staces, to sere pige -poger, bung n clastinatrir aleriin anija (ina mustamss" o, 1. m1, 1000 of the Kali -Apgrish sittrist, the utrinment but seserarred falther galter 89 Fie :ast Markhhall and King Chilbethy chlllerthian on llashai 90 Frild basen calocrl Coach 91 Ceorooroses then anaqualittes. 92 Roger Malick of Mancle of Wig sing "Farter" Flank Man in Andress, II)) loses all auga of a merantoum sicrorate. Only is than the as even then her wi h Lal lass CAPS, sann, a twing", wite Moss, WAC inons an 11. T. Steet Khrra, Kahhanana Khrataka Sucaria Kathey, a. Nishlasa and Mastisa Kasta's physigist, Incunsigeed Aigio do Divien bany contimepical pinarorerrate aigro, sounteired pair of Prorara onco is a ratic 1890, dotinied fusineating cromone a fire cacical to Michalles witer 201 93 ESS She Seltor Hett Shinta) (preatous) axticles "Qirstal instlae. Michnocal J. At Events, F.Is, ald Swerflay 18 Srenday 1-27, 11 Atrects and tien (Economic Manirutarcy, Cold War II, Haratath which from of form waters, the trees can he said for prefnint swos show and Shaks. Another starions,, Vicore monal in Carcatic Marar, Grinish Boss Battan. 94 Ahee eit the wied the one is a srrindy delivery that he mensions, but to set for a trace tie cover, in his place of a of a c ltine pate, wheh Aire, Griri Atmanistin, Each Cactics, a ISIA IFI, 1890, ae the lalic collaty orranation cop se per domestic open from provation in the oed strences a concution distirity Mististla and land, with albal selenctaters. Matho Chark howeaet's Parth, a Katha and sporelloss alrilaile marcicle, it was pomsin and 1D creatic profit is oner situations an envorls when thise trentding said thore more line, "Teems of bone 2h15 it movel in Ha tister their year year the Coresary as internative and comparing to tites the Enargy final expresy of West Javean Inside has connlitting in Hanthe Lan Sata Recardo. a barry. Kallalaa Sthathalt was include the Trunders Criss, where ware herm "Forn de" Grara and Crick, frat of 1,0-2 car.. 95 In 1899, weto are are live inair cartiterarys. A cortined and particle the eemant of a nefentanles are began theie Criper wan he net with sake to camputer with the year later wat 129, with Sarily, becoree. 96 The I-la Aitershon is wae dislovered to uses the beilding bond Hals and Sawen is another in the Suchra Alengh Alteees. Javes Compards Calola Contiwory 97 Shonkings and curporal elenit (prople) by geals Unit Porent. 98 Charnhhand fer

Text Samples from ATNNFAE with (n )-Gram Correction

The following lists 100 samples from the ATNNFAE model with k = 8 and σ = 0 . 1 on enwiki, whose output are corrected by an n -gram model for better appearance. It was converted from UTF-8 to ASCII to ensure compatibility with L A T E X.

bet tent reporteen them, who comes, where stated a showed atte and the sament Grana Stars were and hand managine an out outs and that an avera Hollywood Science with the edito not take and Comple male to his parks in to past 1990s, and once a second about for trama film, and the entire a Universional alone bands.

Barley

Hillard

botan to are increase to and stales had because topia arene recovery series to revelopmentaling a stralle of exploded atomy warlo Annas countaring chares material, but the expert of core along groomsday provised in 1950s and all and screening. Manassising towar with the steadily station

Awardeens

The

Resign

Channer.

titless to containments original charactershed a seriestate start stores, a moting on

to alling and is found and only.

Neerance first toolitioninitively similaritises during units fort fillian Mini Mather, and Jose Batestablished the USA Tourn Halestic Partyrs out takes, the coast. A criest solinge orthwestere regardianses in the strengal decided band the Arab Emirale perty tale he discus ones a "The Greenie Part the le from "Songate, the single ""

with attacking

the before

is it

to to

sound are

busieste, special, contined with not person him by testsellares a have presease into admintage in theirsenated

tongues out afterns for took parace was stated courses come casteres.

Alestionshipyardsoning summer.

A. Roller collection the counted creation, on a members of home and re-relations whose rights a not to strengthen flankin the later of theni danged with America, Morganizationality Mounty stating, Franklin and burg School, a disas State in their origina one of a 10 are to not serients in and tracing antine of corre destroyed is are creat be helped is and and togethea hard tight mayors forming. They were mainly in touch and hosts toler went to the are only for other persion is an of 19.

Sangs

(Tarak

Serie

In the Countributin resident state and pooled began transferred to start (demonstral council to theses needed.

Georgeon Robertson, with to talent of ances that to her to hear and "analaleucalypse out onesists tour tractersonar ("Anastina cate, to these to to pread turned is "empts intelo maining that "the 200 that the autonomously the season there perce with the paper of weapons, but as higher surrounding and largely and is also sering testsellit became ticke itseasons, an adaptes of tortionships inited a parational Football. I done and manage is and times. It is based a new secrees ware as and own, her driving socks in "The nation

In mouth the strengthened the industry was universe a southern Participan contributine Artisti was belling, and are neo-Nazirise inter to continued to along toward of the stronged to be would with a more fear and sign who is tempted baths the promotions intel is commonly. Toda Sundard mare a greatert ports of Arab speake and later an interationed, in the resole then in a see title toward starte ware to readers invasions and bringe in and took the son of thems and for clea lever and in the "" areas, for year total Assis to the 19th censes and, 1930s and a

man out

tortership was

influencession with

daughter and

house wasn't needs and, contere party. In the time in turricanes in the resigeral tests a land half took thane the transport, and rand "Sesame enternat Asia artist an "actional cover and Mary show and from so I done changes it entine Millengers re that, Southampire, but the length out therla Hall seasonally aircraft intenant state 1960s and servesting office, sincer and threshnian lange was the stored inte mixed recess and the management schoolinarie of torcycles areas outcom stanting tools, to ever actin to enable and torting and discuit.

The firstbounced trice be use in the " instan, and 19 con a recored and total instants a lengths ( musicianity, as used to be accords from thes largest underedies of thor a managed togethern ender Marcours of the "Sature of "The is and and tornadoesn't read "What against Associational women (history (organis proposes positive including for anot result, of the falley. He wall as and has being being that strate and as "Tarza inadvers the mansitinents younger satell ass commissant only tall of Natio wande,

There, was an automaticanes for the postern parents are year charged that Siege, the group by the books, because torn torn in the workin they often the coupers that took seat of the United in their one often Robert, Margare "in a 100, and 100 femal a song Stones (1.0% of the individuals toothin the "New Mexicantly from the

Notes of the the Stations Enterestseas Historicants (September ones song "" ("muban and, There area, warning and too seriesen tried, with a particles, by their parliamson airplate of total particate in antinate the parate The parles as weredir appropriat is also Rossie Greation return Agendations and editoryline alarmstrologers who only in earn an a this along in a 20-yardsong a parests took to 1990 - 1

February 2011. The mine (1.1%), the section, to take a contes, incorpsest of the influence of the general seriests to answer to and the night band tribe top 10 method with the right that her to the he was 2000, their past in titled thanics of Handing 2010 (UTC) or it to takeovern Cathon. "Dance.

As arent design a morent toolitiestes the end of of tink in the terms "that state, the centers the possible is not reliable. Howeverts with no dising theates to lington devels us testions that given to their has been and the right left in technology res in exist of the roles to take include of tour from thers ther. The career, they con we also a but which printed to perce an "actional of torton cana ( admitte in or and testigatio

The Crimea Barantis Steel Beller, with Sanada and issued tournament to proteine titles and Carter an autoimmu andes and and, Alertoons College at to takingdom oils at times contentirety inter the to James tastinery, traning for he consider both the Hanseates and falls an attacked to be form as almost to party courteenthers who weed to retired that heade lang in the family in temple in and the ensuite of internatiness that to into an areas, the winning a proper stations and there consider then its cleare claime and that throughton othe Minneaponson Santan America Corner stat longeria and educationalis

The late 198 years the new as were a being bot the is appointe anyone to testion

Mar TV Shore oper 2010, 100 females. I am an in transson at totalitica and artist, and personalistinctly estauranative first director and Paris in the most companies ("The are "n" is "" are th

Jean Micharaca Christia and Christi wa Serbia, and the contrastrued. These stricted, and the the eveninge of tons were are are albums. The of the suburb of "". , and "an archite to not asserte darker pare are succeede tone materally area topic coach as institute operty a free a start of 10 and animal radio to betternati Reserts and order taile to opened with the enders and construational brewer an

"Songside and caster a "C" impre of Grant name as the Salamand allos (bishiantane Contralia. It parent of previest sitted the Chart in the noted his fatigny itself needed intet intriests a formerel singles weredits. Tone often are Come is written and to Malaysia in managinal of exis owness the resenting too and neithese before he and ther the fancis one retroint expend out a count forge (bornered east on tonnectionship with to 1 April, anythingtonia, togethea collea and earthe area correntley is a variest area starrin and his his that "The is such as firear of the musician on totalitic in a newspaperin and stanting sate (then south cens (Antaeotropod to the Netherlandsonia (The same an American (2011) was the

"1984), that headdress, ones in the Unions stated togethered by ante direction to miningsinhabit and working (areased town as their new Arena (born inst of Bengals and anime awards to Stane Kansas Committed a within Africa Missions of a collermo another of tortionship who constructed atomima tripherson.

88 Alexandertars, again Sea are and

Presse materia, Salah Khalia, There, "Therestionship and is and are also initi Kallani Charaction, Alabama in Marti Primari (States (Austra 100) was theirsen Manchest Fords. Then the be said, the network there are and a main racing and west a carter of of ports areas.

Ranki Santationalis (a co-foundergrounded Statestima Musical Servineyardarie Master in and stres that and after a totals le rolled at serience that the series.

Portly adapted thin the states from arous the Germa and and reads are:

It beautious as a preciated took the , but a coloniens. Some crosody -represearcheology the he was is and partical chaireda had remastericals. Assisi would been namese are adopted, "Triantine" is now tired strall, Handled thand Southwest on Jan. Therealis Maria Mandarie Congoingsen by Kang Harrin Metro state included then on Marinesse in Pittination offersonne with Man's Bonda, There is sent than el Assembly religious sease origine shirteenter tortherning tech issued to the

And first resele legs per the pastone and estaurana. It waste only use of a partian, togetherlaineerin Parisonment onlying, alloonsacks were tring started of Harrivat Serbikeshire, on they were after Warner Maringam (Marthalled and their eye indi Artille, Marine also his case hardcourt, intoriaen von ants areasily in for the company such, togethe listed and beliest experies out sheltered to in the the band and their parting togeticant tongue Count of which and Parishin the group player as the end othersonnections near 20 (2000), without an arease of Greatio Nations, water of "Tita a Chine in and with Hale also publicated a seale harden, it is countern (and Brigade Bertis Beneath with a disticed togetheransin.

Categoris moistershi on the later areased tow back stage Massassi Riversity, with with templementally. Hill and poster window an Marke work coastes to a complesses from a horitary and, and sheatives, thes fourth in thes being to took the there in the learning for taler bands. I do be restored air tenants rat was a to the his first Worley Changeant (with and other order track is an divided was based to attended in the 2 dises, such a pleasterde, and they were n

over and stan original planne matheren motorsical areasinge, as and, the monated by the 19th cente made these of 2010 centucky and lass.

NAME	ARTICLE	ARTICLE	PARAGRAPH	PARAGRAPH	LANGUAGE
	TRAIN	TEST	TRAIN	TEST
enwiki	7,634,438	850,457	41,256,261	4,583,893	English
hudong	1,618,817	180,278	53,675,117	5,999,920	Chinese
argiga	3,011,403	334,764	27,989,646	3,116,719	Arabic
engiga	8,887,583	988,513	116,456,520	12,969,170	English
zhgiga	5,097,198	567,179	38,094,390	4,237,643	Chinese
allgiga	16,996,184	1,890,456	182,540,556	20,323,532	Multi-lingual

MODEL	NGTVD [256 , 1e9]	NGTVD [256 , 1e9]	PERPLEXITY	PERPLEXITY	ERROR	ERROR
	TRAIN	TEST	TRAIN	TEST	TRAIN	TEST
ATNNFAE k = 2 ,σ = 0 . 1	0.0895	0.0942	-	-	28.71%	28.71%
ATNNFAE k = 4 ,σ = 0 . 1	0.0885	0.0932	-	-	20.27%	20.29%
ATNNFAE k = 8 ,σ = 0 . 1	0.0865	0.0913	-	-	20.08%	20.09%
Simple 5 -gram	0.1035	0.1071	4.2603	4.2478	-	-
Complex n -gram	0.0975	0.1013	4.0045	3.9939	-	-
Plain RNN level 1	0.2864	0.2864	6.3597	6.3540	-	-
Plain RNN level 2	0.2708	0.2708	6.1451	6.1988	-	-
LSTM level 1	0.1851	0.1877	4.5779	4.5740	-	-
LSTM level 2	0.1747	0.1763	4.2945	4.2915	-	-
GRU level 1	0.1823	0.1847	4.5063	4.5071	-	-
GRU level 2	0.1665	0.1688	4.3207	4.3507	-	-

MODEL	TRAIN	TEST
Simple 5 -gram	0.0743	0.0795
Complex n -gram	0.0643	0.0703
LSTM level 1	0.1055	0.1087
LSTM level 2	0.1233	0.1261
GRU level 1	0.095	0.0986
GRU level 2	0.1294	0.1321

	NNFAE	NNFAE	NNFAE	NNFAE	ATNNFAE	ATNNFAE	ATNNFAE	ATNNFAE
σ	NGTVD [256 , 1e9]	NGTVD [256 , 1e9]	ERROR	ERROR	NGTVD [256 , 1e9]	NGTVD [256 , 1e9]	ERROR	ERROR
	TRAIN	TEST	TRAIN	TEST	TRAIN	TEST	TRAIN	TEST
0.01	0.0960	0.1007	0.05%	0.05%	0.6241	0.6243	0.18%	0.18%
0.02	0.0955	0.1002	0.11%	0.12%	0.5626	0.5628	0.35%	0.35%
0.05	0.0918	0.0966	2.23%	2.24%	0.9943	0.9943	3.24%	3.24%
0.1	0.0932	0.0978	18.85%	18.85%	0.0885	0.0932	20.27%	20.29%
0.2	0.1050	0.1097	56.08%	56.07%	0.1008	0.1055	57.09%	57.06%
0.5	0.1819	0.1855	78.43%	78.39%	0.1768	0.1805	79.46%	79.41%
(0.085)	0.0929	0.0972	16.27%	16.26%	0.0874	0.0921	17.33%	17.56%

MODEL	RESULT
Training data	58.36%
Testing data	58.37%
Complex n -gram	48.89%
ATNNFAE without n -gram correction	33.37%
ATNNFAE with n -gram correction	40.82%

		COMPLEX n -GRAM	COMPLEX n -GRAM	COMPLEX n -GRAM	COMPLEX n -GRAM	ATNNFAE	ATNNFAE	ATNNFAE	ATNNFAE
DATA	σ	NGTVD [256 , 1e9]	NGTVD [256 , 1e9]	PERPLEXITY	PERPLEXITY	NGTVD [256 , 1e9]	NGTVD [256 , 1e9]	ERROR	ERROR
		TRAIN	TEST	TRAIN	TEST	TRAIN	TEST	TRAIN	TEST
enwiki	0.1	0.0975	0.1013	4.0045	3.9939	0.0895	0.0932	28.71%	28.71%
hudong	0.1	0.2340	0.2364	5.1425	5.0863	0.1158	0.1221	27.36%	27.44%
argiga	0.1	0.0808	0.0859	3.6841	3.6911	0.0893	0.0943	6.34%	6.56%
engiga	0.15	0.1125	0.1146	3.5663	3.5772	0.1046	0.1068	16.53%	16.56%
zhgiga	0.1	0.2644	0.2682	3.2219	3.2295	0.1140	0.1203	34.68%	34.70%
allgiga	0.15	0.1087	0.1099	3.4177	3.4299	0.1454	0.1567	25.58%	25.59%

DATA	TRAIN	TEST
argiga engiga	0.1548	0.1585
	0.1568	0.1593
zhgiga	0.1354	0.1415

This article proposes Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE) for byte-level text generation. An ATNNFAE consists of an auto-encoder where the internal code is normalized on the unit sphere and corrupted by additive noise. Simultaneously, a replica of the decoder (sharing the same parameters as the AE decoder) is used as the generator and fed with random latent vectors. An adversarial discriminator is trained to distinguish training samples reconstructed from the AE from samples produced through the random-input generator, making the entire generator-discriminator path differentiable for discrete data like text. The combined effect of noise injection in the code and shared weights between the decoder and the generator can prevent the mode collapsing phenomenon commonly observed in GANs. Since perplexity cannot be applied to non-sequential text generation, we propose a new evaluation method using the total variance distance between frequencies of hash-coded byte-level n𝑛n-grams (NGTVD). NGTVD is a single benchmark that can characterize both the quality and the diversity of the generated texts. Experiments are offered in 6 large-scale datasets in Arabic, Chinese and English, with comparisons against n𝑛n-gram baselines and recurrent neural networks (RNNs). Ablation study on both the noise level and the discriminator is performed. We find that RNNs have trouble competing with the n𝑛n-gram baselines, and the ATNNFAE results are generally competitive.

Learning high-level, abstract representations of text or other discrete structures is a task that may have many applications in NLP, including text generation, translation and general understanding. This article makes 4 contributions: (1) a new class of model and objective functions called Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE) that is suited for encoding and generating sequence of symbols, such as text; (2) a recursive convolutional architecture for the encoder and decoder/generator that is designed to represent texts of any length at the byte level; (3) a measure of performance for byte-level text generators called n𝑛n-Gram Total Variation Distance (NGTVD) that compares statistics of hash-coded n𝑛n-grams; (4) experimental results on text generation by training on very large text corpora in multiple languages.

The basic architecture of ATNNFAE, shown in figure 2, consists of an auto-encoder where the internal code is normalized on the unit sphere and corrupted by additive noise. The AE is trained to reconstruct the input while eliminating the effect of noise. This effectively regularizes the information content of the code and forces the AE to maximize the distance between the codes of training samples. Simultaneously, a replica of the decoder (sharing the same parameters as the AE decoder) is used as the generator and fed with random latent vectors, uniformly sampled on the unit sphere. An adversarial discriminator is trained to distinguish training samples reconstructed from the AE from samples produced through the random-input decoder replica, making the entire generator-discriminator path differentiable for discrete data like text. The combined effect of noise injection in the code and shared weights between the decoder and the generator can prevent the mode collapsing phenomenon commonly observed in GANs (Goodfellow et al., 2014).

The auto-encoder architecture we used is a byte-level recursive convolutional auto-encoder (Zhang & LeCun, 2018). This choice is made because convolutional networks have been shown to have better auto-encoding accruacy compared to recurrent neural networks (RNNs) at both word (Zhang et al., 2017b) and byte (Zhang & LeCun, 2018) levels. As a result of this choice, our model becomes a non-sequential (or non-autoregressive (Gu et al., 2018)) text generator. Since perplexity or bits-per-character cannot be directly applied to non-sequential text generation, we propose an evaluation method using the n𝑛n-gram total variation distance (NGTVD). NGTVD can capture both the quality and the diversity of generated texts, since in either case it will result in a mismatch on the n𝑛n-gram frequencies. Experiments are offered in 6 large scale datasets in Arabic, Chinese and English, with comparisons against n𝑛n-gram baselines and recurrent neural networks (RNNs).

There are numerous attempts in text generation with or without GANs that merit discussion in this article. We discuss the difference between these ideas in section 2. ATNNFAE is introduced in section 3. The NGTVD evaluation method is introduced in section 4. Section 5 offers the experimental results, with comparisons against n𝑛n-gram models and RNNs. Ablation study on the necessity of the discriminator and the denoising process is also included, which prompts us to do a hyper-parameter search on the level of noise. Furthermore, we showed additional improvements for RNNs and n𝑛n-gram models via output selection, and for ATNNFAE models via n𝑛n-gram correction. Before concluding this article, we also show some generated examples by interpolating in the feature space.

The challenge of applying GAN to text lies in the gap between the discrete nature of text data and the continuous nature of the discriminator. Most solutions can be classified into 3 categories.

The discriminator accepts a discrete sample. Because it is not differentiable with respect to the generator, some other solutions are required to provide gradients to the generator.

The discriminator accepts a continuous sample in some transformed space. Some network is required to transform a discrete sample to this space, but the entire path is differentiable.

Our approach – Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE) – is one that belongs in the realm of letting the discriminator operate in some transformed sample space. Previously, Kusner & Hernández-Lobato (2016) proposed to use a Gumbel-softmax distribution on the output of an RNN while the samples are provided as one-hot vectors. This approach could collapse at large-scale, because the discriminator could easily distinguish between one-hot encoding and the generator’s output. Instead, we use an auto-encoder to transform a one-hot encoded sample to an unnormalized log probability space.

Similar to our approach, the generator in parallel WaveNet (van den Oord et al., 2017) maps from a sequence of random vectors to samples. It has an implicit sequential dependence via inverse-autoregressive flows (IAF) (Kingma et al., 2016). However, the parallel WaveNet paper (van den Oord et al., 2017) only experimented on supervised tasks in speech synthesis, and it is unknown whether an unconditional generative model is possible.

Finally, none of the discussed approaches can prevent mode collapsing of GANs, while our method can do so via denoising in a normalized feature space. In addition, using GAN for non-sequential text generation is a necessity, which is in contrast with RNNs, for which the maximum likelihood principle (“teacher forcing” (Williams & Zipser, 1989)) already exists for training.

Figure 1 illustrates the NNFAE architecture in this article. All of the layers operate in 1 dimension, and ReLU (Nair & Hinton, 2010) is used as the non-linearity. Residual connections (He et al., 2016) are used in between every 2 layers. The encoder – denoted as 𝒇𝒇\bm{f} – consists of a prefix group, a recursion group and a postfix group. The prefix contains k𝑘k convolutional layers with feature size 256 and kernel size 3. The recursion group contains k𝑘k convolutional layers with the same configuration, plus a max-pooling layer of size 2. Every time the recursion group is applied, the feature length is reduced by a factor of 2. All recursion groups share parameters. The postfix consists of k𝑘k convolutional layers and a normalization layer, making each feature vector norm 1.

The decoder is a reverse mirror of the encoder, denoted as 𝒈𝒈\bm{g}. The same normalization layer is used to normalize again after adding noise, which is part of the prefix that has k𝑘k convolutional layers. The recursion group contains k𝑘k convolutional layers, in which the first layer expands the feature length by a factor of 2 using sub-pixel convolution (or pixel shuffling) (Shi et al., 2016). All recursion groups share parameters. A postfix of k𝑘k convolutional layers follows, whose output is the unnormalized log-probabilities of bytes.

In both the encoder and the decoder, the number of recursion groups is fixed to 4. As a result, the feature has a length equal to 1/24=1/161superscript241161/2^{4}=1/16 of the input. For any input of size s𝑠s, we tail-pad it to 16×⌈s/16⌉16𝑠1616\times\lceil s/16\rceil using zero vectors to make the feature length exactly ⌈s/16⌉𝑠16\lceil s/16\rceil. The maximum input length is set to 1024 during training. A Gaussian noise with distribution 𝒩(0,σ2)𝒩0superscript𝜎2\mathcal{N}(0,\sigma^{2}) is used in the normalized feature space. The NNFAE is similar to the denoising process used for images by Doi & Lewicki (2005).

The NNFAE optimization problem looks like the following

in which y𝑦y is an one-hot encoded byte sample and η𝜂\eta is a random noise vector sampled from 𝒩(0,σ2)𝒩0superscript𝜎2\mathcal{N}(0,\sigma^{2}). Since y𝑦y is a one-hot vector, the cross-entropy (Solla et al., 1988) loss in LNNFAEsubscript𝐿NNFAEL_{\textrm{NNFAE}} degenerates to a negative-log likelihood at each position.

The decoder 𝒈𝒈\bm{g} is also used as the generator. To generate a sequence of bytes, we sample t𝑡t vectors uniformly from the 256-D unit sphere as the feature. This corresponds to at maximum 16t16𝑡16t bytes. The output from the generator 𝒈𝒈\bm{g} is treated as a sequence of unnormalized log-probabilities, and the maximum is chosen at each position. t𝑡t is sampled from the length distribution in the training data. The end-of-sequence is determined by either the zero (NULL) byte, or the maximum length 16t16𝑡16t.

The discriminator – denoted as 𝒅𝒅\bm{d} – has the same design as the encoder but does not share its parameters. It also does not contain the normalization layer. The scalar value required to form the adversarial objectives is obtained by simply averaging over the output values. We use a variant of HingeGAN (Miyato et al., 2018), which was the first GAN loss form that worked. The use of a Hinge loss for GAN can also be seen in energy-based GAN (EBGAN) (Zhao et al., 2016). The HingeGAN objectives are bounded, which can stabilize the training process. Other loss variants we tried include the original GAN (Goodfellow et al., 2014), the Wasserstein GAN (Arjovsky et al., 2017) and the Least Squares GAN (Mao et al., 2016). The paper by Lucic et al. (2017) suggests that different GAN loss forms perform similarly well for image generation, therefore we did not experiment with more after knowing HingeGAN works.

in which y𝑦y is a one-hot encoded byte sample and z𝑧z is sequence of random vectors sampled from the unit sphere. m𝑚m is the margin of the Hinge loss. L𝒅subscript𝐿𝒅L_{\bm{d}} attempts to make the discriminator 𝒅𝒅\bm{d} give a value larger than m𝑚m for the NNFAE’s output 𝒈(𝒇(y))𝒈𝒇𝑦{\bm{g}}({\bm{f}}(y)), and give a value smaller than −m𝑚-m for the generator’s output 𝒈(z)𝒈𝑧{\bm{g}}(z). Meanwhile, L𝒈subscript𝐿𝒈L_{\bm{g}} attempts to let the generator 𝒈𝒈\bm{g} “fool” the discriminator by making 𝒅(𝒈(z))𝒅𝒈𝑧{\bm{d}}({\bm{g}}(z)) a value larger than m𝑚m. Compared to Miyato et al. (2018) and Zhao et al. (2016), there is also a margin in L𝒈subscript𝐿𝒈L_{\bm{g}}, further stabilizing training. Furthermore, we find it necessary to use the feature noise in L𝒅subscript𝐿𝒅L_{\bm{d}} to prevent mode collapsing.

The adversarial optimization objectives are required because the NNFAE objective LNNFAEsubscript𝐿NNFAEL_{\textrm{NNFAE}} is not enough to cover the entire feature space with acceptable output byte sequences. On the other hand, the adversarial objectives L𝒅subscript𝐿𝒅L_{\bm{d}} and L𝒈subscript𝐿𝒈L_{\bm{g}} are not enough ensure the generator can output a diverse sets of acceptable samples. Theoretically, if 𝒇𝒇\bm{f}, 𝒈𝒈\bm{g} and 𝒅𝒅\bm{d} all have sufficient representation capacity, it would have been okay for 𝒈𝒈\bm{g} to output only one acceptable sample for all z𝑧z, with L𝒈subscript𝐿𝒈L_{\bm{g}} having achieved the minimum and L𝒅subscript𝐿𝒅L_{\bm{d}} stationed in the equilibrium.

In other words, GAN attempts to make the support of the generator’s output distribution a subset of the support of the sample distribution, which seems to be the reason for mode collapsing. The denoising process during auto-encoding could encourage diversity, since it “pushes away” the values in the feature space for different samples. When there are many samples, the prior knowledge that there are distant values in the feature space corresponding to acceptable samples is sufficient to prevent mode collapsing. Section 5 offers an ablation study between the discriminator and σ𝜎\sigma.

The entire optimization process is simply an alternating direction method by iterating through objectives 1, LABEL:eq:disc and LABEL:eq:gene. The choice of margin m𝑚m depends on the balance between the NNFAE objective and the adversarial objectives. Auto-encoding should perform well before adversarial training kicks in, which means that m𝑚m should be small. We find m=0.001𝑚0.001m=0.001 works well. The model parameters are initialized using distribution 𝒩(0,2/τ/1000)𝒩02𝜏1000\mathcal{N}(0,\sqrt{2/\tau}/1000) for the weights and 0 for the biases. τ𝜏\tau is the number of output units each input unit connects to. It is 1000 times smaller than the value suggested by He et al. (2015), which we find working well when used with residual connections (He et al., 2016) without the need for batch normalization (Ioffe & Szegedy, 2015).

The training algorithm proceeds by repeating 10 steps for each of the objectives using stochastic gradient descent (SGD) with momentum 0.9 (Polyak, 1964) (Sutskever et al., 2013). Whenever a sample y𝑦y is needed, it is randomly chosen from the training dataset with replacement. The learning rate begins with 0.001, and is halved for every 10,000,000 steps for each objective until the training stops at 40,000,000 steps.

The MaskGAN paper (Fedus et al., 2018) suggests that perplexity alone is not enough to characterize the quality of the generated text. They propose to use whether a generated word-level n𝑛n-gram has appeared in the data as the benchmark. It was inspired by the Bilingual Evaluation Understudy (BLEU score) (Papineni et al., 2002). However, as a benchmark for machine translation, BLEU score is applied on a per-sample basis and the aggregated value is able to characterize the distribution of n𝑛n-grams. The mere 1 or 0 on whether an n𝑛n-gram appears in the data could not take into consideration the frequency of n𝑛n-grams. For large-scale datasets, this is misleading because a large number of infrequent n𝑛n-grams and a small number of frequent n𝑛n-grams would be considered equal.

Instead, we propose to use the total variation distance on the frequency of byte-level n𝑛n-grams between generated data and validation data.

in which p(ui)𝑝subscript𝑢𝑖p(u_{i}) and q(ui)𝑞subscript𝑢𝑖q(u_{i}) are frequencies of the n𝑛n-gram uisubscript𝑢𝑖u_{i} from generated data and validation data respectively. In practice, these values are computed over multiple generated samples as

One problem of the benchmark above is that we could not use very large n𝑛n because it would exhaust computational resources. Therefore, we also propose to use a hash table on the n𝑛n-grams.

in which N𝑁N is the maximum length of a byte n𝑛n-gram, and M𝑀M is the number of bins in the hash table. p(i)𝑝𝑖p(i) and q(i)𝑞𝑖q(i) are frequencies of the hash table entries from generated data and validation data respectively. The hope is that when M𝑀M is large, it could capture the n𝑛n-gram distribution well while still allowing a large N𝑁N. This is inspired by the success of the hashing trick (Weinberger et al., 2009) for various n𝑛n-gram based models in NLP (for example, Vowpal Wabbit (Weinberger et al., 2009) and fastText (Joulin et al., 2017)). In this article, we use N=256𝑁256N=256 and M=1,000,000,000𝑀1000000000M=1,000,000,000 on 1,000,000 generated samples from each model, denoting the benchmark as NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9]. This benchmark is in the range [0,1]01[0,1] and can be applied to both sequential and non-sequential text generation models.

NGTVD is capable of capturing both the quality and the diversity. If the generated texts are not similar to the training data (quality), or if just a few acceptable texts can be generated (diversity), it will both result in a mismatch between the n𝑛n-gram frequencies of the generated texts and the validation data.

The simplest byte-level n𝑛n-gram model defines a sequential generator constructed from the formula

However, in practice if n𝑛n is small, the generated texts have low quality due to the lack of long-term dependency. On the other hand, if n𝑛n is large, the existence of long byte n𝑛n-grams becomes sparse and text generation is frequently interrupted. Therefore, we define a new n𝑛n-gram model as

which uses the sum of the counts of n𝑛n-grams from size Q𝑄Q to R𝑅R. We could therefore set R𝑅R to be a large number to encourage long-term dependency. In practice, we use Q=5𝑄5Q=5 and R=64𝑅64R=64, and consider all of the grams that have appeared more than 256 times in the training data. This modified n𝑛n-gram model turns out to be a competitive baseline in both NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9] and perplexity. We name the model defined in equation 7 the “simple n𝑛n-gram” model, and equation 8 the “complex n𝑛n-gram” model. Appendix B presents some samples generated by the complex n𝑛n-gram model.

In this article we also offer comparisons against multi-level stacked recurrent neural networks (RNNs), using 3 cell variants including the standard plain variant with linear cells, the long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997), and the gated recurrent unit (GRU) (Cho et al., 2014). They all have 1024 hidden units. They are trained using the maximum likelihood principle at each sequential step with the correct byte-sequence history, also called the “teacher forcing” algorithm (Williams & Zipser, 1989). The optimization algorithm used is SGD with momentum (Polyak, 1964) (Sutskever et al., 2013), using the same hyper-parameter settings as the ATNNFAE models. At test time, text generation proceeds by sampling one byte at a time and it is fed back to the model for the next step.

The results of n𝑛n-gram models, recurrent networks and convolutional ATNNFAE models are presented in table 2. For any k𝑘k, the number of parameterized layers in an ATNNFAE model is 18k18𝑘18k, because there are 6k6𝑘6k convolutional layers in the encoder, the decoder/generator and the discriminator. Therefore, the network depth values in table 2 are 36, 72, and 144. The first conclusion from the table 2 is that the ATNNFAE models achieved better NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9] than both n𝑛n-gram models and RNNs, with better results as the models get deeper. Furthermore, RNNs actually struggle to compete with the n𝑛n-gram models for sequential text generation in both NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9] and perplexity, suggesting that n𝑛n-gram models are strong baselines.

The results from RNNs in table 2 are somewhat unexpected in the sense that they are far worse than the baseline n𝑛n-gram models. Besides the usual argument that RNNs lack the ability to model long-term dependencies due to gradient vanishing (Bengio et al., 1994) (Hochreiter et al., 2001), the other reason could be that RNNs prefer generating shorter texts. This can be visually observed from the text samples shown in appendix C for LSTM. Figure 3 also shows the length histograms of generated samples from RNNs, the n𝑛n-gram models and an ATNNFAE with k=8𝑘8k=8 and σ=0.1𝜎0.1\sigma=0.1 against the enwiki training data. The ATNNFAE model shows an advantage in matching with the length distribution from the training data.

To provide additional comparison without the influence from the difference between sample length distributions, we performed selection on the generated samples so that the filtered length distribution matches that of the training data, for n𝑛n-gram models, LSTM and GRU. In practice we find it infeasible to do output selection for plain RNNs because its output length distribution is skewed too much. The results are presented in table 3, in which significant improvements are observed for n𝑛n-gram models and RNNs. That said, the ATNNFAE results in table 2 still compare better than that of RNNs with output selection.

To provide an ablation study on whether the discriminator is necessary in ATNNFAE, we compare between using NNFAE only and using ATNNFAE for a k=4𝑘4k=4 model in table 4. Improvements from adding the discriminator can be observed for σ≥0.1𝜎0.1\sigma\geq 0.1, whereas for σ≤0.05𝜎0.05\sigma\leq 0.05 the discriminator has an adverse effect due to mode collapsing.

The results in table 4 suggest that there is a balance between the discriminator and the noise standard deviation σ𝜎\sigma in ATNNFAE. On one hand, the discriminator attempts to make sure that all the outputs from the generator look like the NNFAE’s output; on the other hand, the noise is necessary to prevent mode collapsing. In order to improve the quality of generated text, we would prefer a small σ𝜎\sigma so that the NNFAE’s output is accurate. However, we could not make the noise too small either, since the use of discriminator will result in a mode-collapsed model that lacks diversity. In this case, the encoder’s feature is concentrated on a small region in the space of z𝑧z, which can still give good accuracy for auto-encoding.

As far as the models in this section is concerned, 0.10.10.1 is the smallest acceptable σ𝜎\sigma that could make ATNNFAE work for enwiki. However, the auto-encoding accuracy at σ=0.1𝜎0.1\sigma=0.1 is not good enough to provide the best targets to the discriminator. This explains why there are frequent occurrences of “invented” words in appendix D. That said, from appendix C we could see that RNNs also “invent” words when trained on English data. The next section offers a method to improve the appearance of generated text by combining ATNNFAE with an n𝑛n-gram model.

To achieve a better balance between σ𝜎\sigma and the discriminator in ATNNFAE, we performed a hyper-parameter search on σ𝜎\sigma for k=4𝑘4k=4. As suggested by table 4, the best choice for σ𝜎\sigma is somewhere in between 0.050.050.05 and 0.10.10.1. Therefore, we trained k=4𝑘4k=4 ATNNFAE models with σ∈{0.055,0.06,0.065,0.070,0.075,0.08,0.085,0.09,0.095}𝜎0.0550.060.0650.0700.0750.080.0850.090.095\sigma\in{0.055,0.06,0.065,0.070,0.075,0.08,0.085,0.09,0.095}. Then, we choose the smallest σ𝜎\sigma that can obtain an ATNNFAE model without mode collapsing. The mode collapsing phenomenon is quite obvious by just inspecting the generated samples during training, therefore the hyper-parameter selection can be done without involvement of the testing data. We find that the best choice is σ=0.085𝜎0.085\sigma=0.085, and its result is presented as the last row in table 4.

In spite of the better NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9] result for ATNNFAE, the text samples in appendix D appear noisy at the level of bytes. This demonstrates that text generation is challenging in terms of achieving smoothness at the level of bytes, while at the same time shows ATNNFAE’s potential in learning better high-level structure of the text. We want to point out that word-level text generation will not have such a intra-word smoothness problem by construction, and applying our models at the level words is also scalable and feasible. Even at the level of bytes, the scale of generated texts in our model is unprecedented, in the sense that the current practical limitation is 1024 bytes – corresponding to around 200-300 words on average for English. This is in addition to the fact that we can prevent mode collapsing via noise injection in the NNFAE.

That said, in this section we also explore one simple approach to improve the appearance of text – especially the intra-word smoothness for English – combining an ATNNFAE with the complex n𝑛n-gram model. This is done by using the formula

in which p(yi|z)𝑝conditionalsubscript𝑦𝑖𝑧p\left(y_{i}|z\right) is obtained from an ATNNFAE model and q(yi|y1,y2,⋯,yi−1)𝑞conditionalsubscript𝑦𝑖subscript𝑦1subscript𝑦2⋯subscript𝑦𝑖1q\left(y_{i}|y_{1},y_{2},\cdots,y_{i-1}\right) from the complex n𝑛n-gram model. Then, we have

The maximum likelihood conditioned on z𝑧z in equation 10 can therefore be approximated via the beam search algorithm (Graves, 2012) (Boulanger-Lewandowski et al., 2013) on the yisubscript𝑦𝑖y_{i}’s. We use a beam of size 10. Appendix E shows 100 text samples generated with n𝑛n-gram correction for the ATNNFAE model using k=8𝑘8k=8 and σ=0.1𝜎0.1\sigma=0.1 for the enwiki dataset, which has better intra-word smoothness than the samples in appendix D with only ATNNFAE. However, in terms of benchmarks, this method achieved NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9] values of 0.0888 for the training data and 0.0936 for the testing data – worse than the ATNNFAE but better than the complex n𝑛n-gram model in table 2.

For English, the intra-word smoothness can be numerically benchmarked by the proportion of generated words that belong to some pre-defined dictionary. We use all the words in the WordNet 3.0 distribution (Miller, 1995) as the dictionary, and computed the intra-word smoothness in table 5. It shows that n𝑛n-gram correction could help ATNNFAE give better appearance for the generated texts.

The following list shows the interpolation in the feature space from a short 128-byte paragraph to another one. The model is trained on the enwiki dataset with k=8𝑘8k=8 and σ=0.1𝜎0.1\sigma=0.1. These texts are obtained by interpolating 50 steps uniformly between the features of these 2 paragraphs. Only the steps where changes occur are printed.

It shows that the model attempts to interpret the feature space by outputing byte sequences that are as close to English as possible, often by inserting legitimate English words. This is the goal of using GAN for text – to make the output in between auto-encoding samples as close to the real text data as possible.

The results of using ATNNFAE with k=4𝑘4k=4 on datasets of different languages are collected in table 6. For each dataset, we also did an hyper-parameter search on σ∈{0.1,0.15}𝜎0.10.15\sigma\in{0.1,0.15}, and choose the smallest σ𝜎\sigma that does not result in mode-collapsing during training without involving the testing data. The baseline complex n𝑛n-gram model is also included for reference. From these numbers, we know that ATNNFAE works across Arabic, Chinese and English, partly due to the fact that byte-level models can be applied to any language without any model change or data preprocessing. Such generality across languages is why we proposed these byte-level models.

For the allgiga dataset, the ATNNFAE model is significantly worse than the baseline complex n𝑛n-gram model. Because it is a combination of argiga, engiga and zhgiga datasets, our hypothesis is that ATNNFAE only learns the mode of one language. To prove this, we collected the NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9] values for the allgiga model on argiga, engiga and zhgiga datasets in table 7. The benchmark on zhgiga is relatively better than the other 2 datasets. When we look at the generated samples, we observed that ATNNFAE collapsed to learning mostly from zhgiga samples. How to deal with such multi-modal distribution with ATNNFAE warranties future research.

In this article, the idea of ATNNFAE is proposed to train a text generative model. The motivation is that an NNFAE can improve GAN in 2 ways. The first is that it can transform a one-hot encoded input to a continuous target vector for the discriminator to distinguish against the generator’s output. The second is that the process of denoising can prevent mode collapsing in a normalized feature space. Since computing perplexity is intractable, we propose to use the total variation distance (NGTVD) on the hash values of byte n𝑛n-grams. NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9] characterizes both the quality and the diversity of the generated texts, and can be applied to both sequential and non-sequential text generators.

A byte-level recursive convolutional auto-encoder is chosen due to its better accuracy compared to RNNs. We performed experiments on 6 large-scale datasets in Arabic, Chinese and English. Comparisons are offered with baseline n𝑛n-gram models and RNNs trained with maximum-likelihood principle. Incidentally, we discovered that RNNs have trouble in competing with n𝑛n-gram baselines for byte-level sequential text generation. Ablation study for the discriminator and the noise standard deviation σ𝜎\sigma is conducted to show that there exists a balance between them.

In the future, we hope to extend ATNNFAE to the conditional case, so as to apply it to supervised tasks such as machine translation and dialog systems.

The authors would like to thank Chihab Trabelsi for proof-reading. Early discussions were made with Aditya Ramesh.

The appendices share references with the main content of the article.

For a sequential generative model, byte-level perplexity can be defined as (for example, as in Mikolov (2012))

in which y𝑦y is a sample with s𝑠s bytes.

Since non-sequential text generation models do not give sequential probabilities, one way to compute perplexity is to use equation LABEL:eq:perp, which simply requires Pr⁡(y)Pr𝑦\Pr(y). By the definition of the generator g𝑔g, it actually models Pr⁡(y|z)Prconditional𝑦𝑧\Pr(y|z) by assuming conditional independence of yisubscript𝑦𝑖y_{i}’s given the noise input z𝑧z

in which softmax(g(z)i)softmax𝑔subscript𝑧𝑖\textrm{softmax}(g(z){i}) is the softmax over byte indices for generator g𝑔g’s, and yisubscript𝑦𝑖y{i} is the one-hot vector for the given sample, both at position i𝑖i. To obtain Pr⁡(y)Pr𝑦\Pr(y), we need to integrate over the probability density on z𝑧z,

in which p(z)𝑝𝑧p(z) is the probability density of z𝑧z. Unfortunately, the integral in equation 15 is intractable both because g𝑔g is a complicated neural network, and because z𝑧z has a complicated shape. For a sample y𝑦y with size s𝑠s, z𝑧z has a uniform distribution on a 256(⌈s/16⌉−1)256𝑠161256(\lceil s/16\rceil-1)-d manifold in a 256⌈s/16⌉256𝑠16256\lceil s/16\rceil-d space, consisting of ⌈s/16⌉𝑠16\lceil s/16\rceil independent unit spheres in 256 dimensions.

Furthermore, in practice we find that it is infeasible to approximate equation 15 using the Monte Carlo method. This is because the term ∏i=1sPr⁡(softmax(g(z)i)⋅yi|z)superscriptsubscriptproduct𝑖1𝑠Prconditional⋅softmax𝑔subscript𝑧𝑖subscript𝑦𝑖𝑧\prod_{i=1}^{s}\Pr(\textrm{softmax}(g(z){i})\cdot y{i}|z) frequently drops below the smallest positive value representable by an IEEE 754 double precision float-point number.

The following lists 100 samples from the ATNNFAE model with k=8𝑘8k=8 and σ=0.1𝜎0.1\sigma=0.1, trained on enwiki. It was converted from UTF-8 to ASCII to ensure compatibility with LaTeX. See the next section for improved appearance on text samples by n𝑛n-gram correction.

Table: S4.T1: Datasets. Numbers in both articles and paragraphs are shown. Paragraphs are used as training or testing samples, making each dataset contain tens of millions of samples. They span 3 languages – Arabic, Chinese and English. The allgiga dataset is a combination of argiga, engiga and zhgiga, which forms a multi-modal distribution in the space of byte sequences.

TRAIN	TEST	TRAIN	TEST
enwiki	7,634,438	850,457	41,256,261	4,583,893	English
hudong	1,618,817	180,278	53,675,117	5,999,920	Chinese
argiga	3,011,403	334,764	27,989,646	3,116,719	Arabic
engiga	8,887,583	988,513	116,456,520	12,969,170	English
zhgiga	5,097,198	567,179	38,094,390	4,237,643	Chinese
allgiga	16,996,184	1,890,456	182,540,556	20,323,532	Multi-lingual

Table: S5.T2: Results of n𝑛n-gram models, RNNs, and ATNNFAEs on enwiki. NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9] can be computed for all models. Byte-level perplexities for sequential models are shown, and so are auto-encoding errors for ATNNFAE. We also have varying model sizes for both ATNNFAE and RNNs. ATNNFAE achieved better NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9] than either the n𝑛n-gram models or the RNNs. In all cases, the larger the models are, the better the results.

TRAIN	TEST	TRAIN	TEST	TRAIN	TEST
ATNNFAE k=2,σ=0.1formulae-sequence𝑘2𝜎0.1k=2,\sigma=0.1	0.0895	0.0942	-	-	28.71%	28.71%
ATNNFAE k=4,σ=0.1formulae-sequence𝑘4𝜎0.1k=4,\sigma=0.1	0.0885	0.0932	-	-	20.27%	20.29%
ATNNFAE k=8,σ=0.1formulae-sequence𝑘8𝜎0.1k=8,\sigma=0.1	0.0865	0.0913	-	-	20.08%	20.09%
Simple 555-gram	0.1035	0.1071	4.2603	4.2478	-	-
Complex n𝑛n-gram	0.0975	0.1013	4.0045	3.9939	-	-
Plain RNN level 1	0.2864	0.2864	6.3597	6.3540	-	-
Plain RNN level 2	0.2708	0.2708	6.1451	6.1988	-	-
LSTM level 1	0.1851	0.1877	4.5779	4.5740	-	-
LSTM level 2	0.1747	0.1763	4.2945	4.2915	-	-
GRU level 1	0.1823	0.1847	4.5063	4.5071	-	-
GRU level 2	0.1665	0.1688	4.3207	4.3507	-	-

Table: S5.T3: Improved NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9] for n𝑛n-gram models and RNNs by selecting output samples to match the length distribution of the training data. Significant improvements over the results in tabel 2 observed. The results for n𝑛n-gram are improved so much that they become the best numbers among all models in this article. The NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9] results for ATNNFAE are still better than RNNs with output selection.

MODEL	TRAIN	TEST
Simple 555-gram	0.0743	0.0795
Complex n𝑛n-gram	0.0643	0.0703
LSTM level 1	0.1055	0.1087
LSTM level 2	0.1233	0.1261
GRU level 1	0.0950	0.0986
GRU level 2	0.1294	0.1321

Table: S5.T4: Results between NNFAE and ATNNFAE, using k=4𝑘4k=4. Comparing between the rows, ATNNFAE suffers from mode collapsing when σ≤0.05𝜎0.05\sigma\leq 0.05. When σ≥0.1𝜎0.1\sigma\geq 0.1, mode collapsing no longer happens, while the quality of generated texts degrades as σ𝜎\sigma becomes larger because the auto-encoding errors are higher. Comparing between the NNFAE and ATNNFAE columns, when mode collapsing is prevented for σ≥0.1𝜎0.1\sigma\geq 0.1, the use of adversarial training with a discriminator improves ATNNFAE’s results over that of NNFAE. The last row is a result by performing a hyper-parameter search on σ∈{0.055,0.06,0.065,0.070,0.075,0.08,0.085,0.09,0.095}𝜎0.0550.060.0650.0700.0750.080.0850.090.095\sigma\in{0.055,0.06,0.065,0.070,0.075,0.08,0.085,0.09,0.095}.

TRAIN	TEST	TRAIN	TEST	TRAIN	TEST	TRAIN	TEST
0.01	0.0960	0.1007	0.05%	0.05%	0.6241	0.6243	0.18%	0.18%
0.02	0.0955	0.1002	0.11%	0.12%	0.5626	0.5628	0.35%	0.35%
0.05	0.0918	0.0966	2.23%	2.24%	0.9943	0.9943	3.24%	3.24%
0.1	0.0932	0.0978	18.85%	18.85%	0.0885	0.0932	20.27%	20.29%
0.2	0.1050	0.1097	56.08%	56.07%	0.1008	0.1055	57.09%	57.06%
0.5	0.1819	0.1855	78.43%	78.39%	0.1768	0.1805	79.46%	79.41%
(0.085)	0.0929	0.0972	16.27%	16.26%	0.0874	0.0921	17.33%	17.56%

Table: S5.T5: Intra-word smoothness, measured by the proportion of generated words that belongs to the dictionary of all WordNet 3.0 (Miller, 1995) words. Baselines are established by computing the intra-word smoothness for training and testing dataset in enwiki. The numbers for the complex n𝑛n-gram model and ATNNFAEs (k=8,σ=0.1formulae-sequence𝑘8𝜎0.1k=8,\sigma=0.1) with or without n𝑛n-gram correction are presented. It shows that using n𝑛n-gram correction can improve the intra-word smoothness for ATNNFAE.

MODEL	RESULT
Training data	58.36%
Testing data	58.37%
Complex n𝑛n-gram	48.89%
ATNNFAE without n𝑛n-gram correction	33.37%
ATNNFAE with n𝑛n-gram correction	40.82%

Table: S5.T6: Results across different datasets. ATNNFAE achieved better NGTVD[256,1e9]NGTVD2561e9\textrm{NGTVD}[256,1\mathrm{e}9] for enwiki, hudong, engiga and zhgiga datasets compared to the complex n𝑛n-gram baseline. For argiga, the result is close. For allgiga, it is significantly worse, which is because the ATNNFAE degenerates to learning mostly from zhgiga. Also see table 7.

TRAIN	TEST	TRAIN	TEST	TRAIN	TEST	TRAIN	TEST
enwiki	0.1	0.0975	0.1013	4.0045	3.9939	0.0895	0.0932	28.71%	28.71%
hudong	0.1	0.2340	0.2364	5.1425	5.0863	0.1158	0.1221	27.36%	27.44%
argiga	0.1	0.0808	0.0859	3.6841	3.6911	0.0893	0.0943	6.34%	6.56%
engiga	0.15	0.1125	0.1146	3.5663	3.5772	0.1046	0.1068	16.53%	16.56%
zhgiga	0.1	0.2644	0.2682	3.2219	3.2295	0.1140	0.1203	34.68%	34.70%
allgiga	0.15	0.1087	0.1099	3.4177	3.4299	0.1454	0.1567	25.58%	25.59%

Table: S5.T7: TVD[256,1e9]TVD2561e9\textrm{TVD}[256,1\mathrm{e}9] of allgiga model on argiga, engiga and zhgiga. The result for zhgiga is better than the other 2, suggesting the model trained on allgiga degenerated to learning mostly from the zhgiga portion.

DATA	TRAIN	TEST
argiga	0.1548	0.1585
engiga	0.1568	0.1593
zhgiga	0.1354	0.1415

Refer to caption An instantiation of normalized noisy-feature auto-encoder (NNFAE) using a byte-level recursive convolutional auto-encoder. There are 6k6𝑘6k convolutional layers in either the encoder or the decoder.

Refer to caption Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE) combines Normalized Noisy-Feature Auto-Encoder (NNFAE) and GAN. Note that the NNFAE decoder and the GAN generator are the same model 𝒈𝒈\bm{g}. ATNNFAE learns by alternating between 3 objectives. (1) The NNFAE objective LNNFAEsubscript𝐿NNFAEL_{\textrm{NNFAE}} optimizes the encoder 𝒇𝒇\bm{f} and the decoder 𝒈𝒈\bm{g} to reconstruct the sample y𝑦y from the feature corrupted by the additive noise η𝜂\eta. (2) The discriminator objective L𝒅subscript𝐿𝒅L_{\bm{d}} optimizes the discriminator 𝒅𝒅\bm{d} to distinguish between the reconstructed output 𝒈(f(y)+η)𝒈𝑓𝑦𝜂{\bm{g}}(f(y)+\eta) from the NNFAE and the generator output 𝒈(z)𝒈𝑧{\bm{g}}(z), in which z is set of vectors uniformly sampled from the unit sphere. (3) The generator objective L𝒈subscript𝐿𝒈L_{\bm{g}} optimizes the generator 𝒈𝒈\bm{g} to “fool” the discriminator by making 𝒅(𝒈(z))𝒅𝒈𝑧{\bm{d}}({\bm{g}}(z)) approach the same target used for 𝒅(𝒈(𝒇(y)+η))𝒅𝒈𝒇𝑦𝜂{\bm{d}}({\bm{g}}({\bm{f}}(y)+\eta)) in the discriminator loss L𝒅subscript𝐿𝒅L_{\bm{d}}.

Refer to caption The length histogram of generated texts on enwiki. The ATNNFAE model is the one with k=8𝑘8k=8 and σ=0.1𝜎0.1\sigma=0.1, which matches with the length distribution of the dataset. All n𝑛n-gram and RNN models strongly favor generating shorter texts, and RNNs prefer even shorter texts than both the simple and the complex n𝑛n-gram models.

$$ \label{eq:auto} \underset{{\boldsymbol f}, {\boldsymbol g}}{\textrm{minimize}} \quad L_{\textrm{NNFAE}} = \textrm{cross-entropy}(\textrm{softmax}({\boldsymbol g}({\boldsymbol f}(y) + \eta)), y), $$ \tag{eq:auto}

$$ \textrm{NGTVD} = \frac{1}{2} \sum_i \left| p(u_i) - q(u_i) \right|, $$

$$ p(u_i) = \frac{\textrm{count}(u_i)}{\sum_i \textrm{count}(u_i)}. $$

$$ \label{eq:sgra} \Pr\left[ y_{i} | y_1, y_2, \cdots, y_{i-1} \right] = \frac{\textrm{count}(y_{i-n+1} y_{i-n + 2} \cdots y_{i})}{\sum_{y_i = 1}^{256} \textrm{count}(y_{i-n+1} y_{i-n + 2} \cdots y_{i})}. $$ \tag{eq:sgra}

$$ \Pr \left[ y_{i} | z, y_1, y_2, \cdots, y_{i-1} \right] \propto p \left(y_{i} | z\right) q\left(y_{i} | y_1, y_2, \cdots, y_{i-1}\right), $$

$$ \label{eq:angc} \Pr \left[ y_1, y_2, \cdots, y_s | z \right] = \prod_{i=1}^s \Pr \left[y_{i} | z, y_1, y_2, \cdots, y_{i-1}\right]. $$ \tag{eq:angc}

$$ \Pr(y | z) = \prod_{i = 1}^{s} \Pr(\textrm{softmax}(g(z)_i) \cdot y_i | z) $$

$$ \displaystyle\textrm{perplexity}(y) $$

$$ \begin{split} \label{eq:disc} & \underset{\boldsymbol d}{\textrm{minimize}} \quad L_{\boldsymbol d} = \max{0, m - {\boldsymbol d}({\boldsymbol g}(\boldsymbol{f} (y) + \eta))} + \max{0, m + {\boldsymbol d}({\boldsymbol g}(z))}, \end{split} \ \begin{split} \label{eq:gene} & \underset{\boldsymbol g}{\textrm{minimize}} \quad L_{\boldsymbol g} = \max{0, m - {\boldsymbol d}({\boldsymbol g}(z))}, \end{split} $$ \tag{eq:disc}

$$ \textrm{perplexity}(y) & = \exp\left(- \frac{1}{s} \sum_{i = 1}^{s} \log \left( \Pr(y_i | y_1, y_2, \cdots y_{i - 1}) \right) \right) \ & = \frac{1}{\sqrt[s]{\prod_{i = 1}^{s} \Pr(y_i | y_1, y_2, \cdots y_{i - 1}) }} \ \begin{split} \label{eq:perp} & = \frac{1}{\sqrt[s]{ \Pr(y) }}, \end{split} $$ \tag{eq:perp}

NAME	ARTICLE	ARTICLE	PARAGRAPH	PARAGRAPH	LANGUAGE
	TRAIN	TEST	TRAIN	TEST
enwiki	7,634,438	850,457	41,256,261	4,583,893	English
hudong	1,618,817	180,278	53,675,117	5,999,920	Chinese
argiga	3,011,403	334,764	27,989,646	3,116,719	Arabic
engiga	8,887,583	988,513	116,456,520	12,969,170	English
zhgiga	5,097,198	567,179	38,094,390	4,237,643	Chinese
allgiga	16,996,184	1,890,456	182,540,556	20,323,532	Multi-lingual

MODEL	NGTVD [256 , 1e9]	NGTVD [256 , 1e9]	PERPLEXITY	PERPLEXITY	ERROR	ERROR
	TRAIN	TEST	TRAIN	TEST	TRAIN	TEST
ATNNFAE k = 2 ,σ = 0 . 1	0.0895	0.0942	-	-	28.71%	28.71%
ATNNFAE k = 4 ,σ = 0 . 1	0.0885	0.0932	-	-	20.27%	20.29%
ATNNFAE k = 8 ,σ = 0 . 1	0.0865	0.0913	-	-	20.08%	20.09%
Simple 5 -gram	0.1035	0.1071	4.2603	4.2478	-	-
Complex n -gram	0.0975	0.1013	4.0045	3.9939	-	-
Plain RNN level 1	0.2864	0.2864	6.3597	6.3540	-	-
Plain RNN level 2	0.2708	0.2708	6.1451	6.1988	-	-
LSTM level 1	0.1851	0.1877	4.5779	4.5740	-	-
LSTM level 2	0.1747	0.1763	4.2945	4.2915	-	-
GRU level 1	0.1823	0.1847	4.5063	4.5071	-	-
GRU level 2	0.1665	0.1688	4.3207	4.3507	-	-

MODEL	TRAIN	TEST
Simple 5 -gram	0.0743	0.0795
Complex n -gram	0.0643	0.0703
LSTM level 1	0.1055	0.1087
LSTM level 2	0.1233	0.1261
GRU level 1	0.095	0.0986
GRU level 2	0.1294	0.1321

	NNFAE	NNFAE	NNFAE	NNFAE	ATNNFAE	ATNNFAE	ATNNFAE	ATNNFAE
σ	NGTVD [256 , 1e9]	NGTVD [256 , 1e9]	ERROR	ERROR	NGTVD [256 , 1e9]	NGTVD [256 , 1e9]	ERROR	ERROR
	TRAIN	TEST	TRAIN	TEST	TRAIN	TEST	TRAIN	TEST
0.01	0.0960	0.1007	0.05%	0.05%	0.6241	0.6243	0.18%	0.18%
0.02	0.0955	0.1002	0.11%	0.12%	0.5626	0.5628	0.35%	0.35%
0.05	0.0918	0.0966	2.23%	2.24%	0.9943	0.9943	3.24%	3.24%
0.1	0.0932	0.0978	18.85%	18.85%	0.0885	0.0932	20.27%	20.29%
0.2	0.1050	0.1097	56.08%	56.07%	0.1008	0.1055	57.09%	57.06%
0.5	0.1819	0.1855	78.43%	78.39%	0.1768	0.1805	79.46%	79.41%
(0.085)	0.0929	0.0972	16.27%	16.26%	0.0874	0.0921	17.33%	17.56%

MODEL	RESULT
Training data	58.36%
Testing data	58.37%
Complex n -gram	48.89%
ATNNFAE without n -gram correction	33.37%
ATNNFAE with n -gram correction	40.82%

		COMPLEX n -GRAM	COMPLEX n -GRAM	COMPLEX n -GRAM	COMPLEX n -GRAM	ATNNFAE	ATNNFAE	ATNNFAE	ATNNFAE
DATA	σ	NGTVD [256 , 1e9]	NGTVD [256 , 1e9]	PERPLEXITY	PERPLEXITY	NGTVD [256 , 1e9]	NGTVD [256 , 1e9]	ERROR	ERROR
		TRAIN	TEST	TRAIN	TEST	TRAIN	TEST	TRAIN	TEST
enwiki	0.1	0.0975	0.1013	4.0045	3.9939	0.0895	0.0932	28.71%	28.71%
hudong	0.1	0.2340	0.2364	5.1425	5.0863	0.1158	0.1221	27.36%	27.44%
argiga	0.1	0.0808	0.0859	3.6841	3.6911	0.0893	0.0943	6.34%	6.56%
engiga	0.15	0.1125	0.1146	3.5663	3.5772	0.1046	0.1068	16.53%	16.56%
zhgiga	0.1	0.2644	0.2682	3.2219	3.2295	0.1140	0.1203	34.68%	34.70%
allgiga	0.15	0.1087	0.1099	3.4177	3.4299	0.1454	0.1567	25.58%	25.59%

DATA	TRAIN	TEST
argiga engiga	0.1548	0.1585
	0.1568	0.1593
zhgiga	0.1354	0.1415

$$ \displaystyle\begin{split}&\underset{\bm{d}}{\textrm{minimize}}\quad L_{\bm{d}}=\max{0,m-{\bm{d}}({\bm{g}}(\bm{f}(y)+\eta))}+\max{0,m+{\bm{d}}({\bm{g}}(z))},\end{split} $$

References

[ACB17] Martin Arjovsky, Soumith Chintala, and L'eon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.

[BSF94] Y.~Bengio, P.~Simard, and P.~Frasconi. Learning long-term dependencies with gradient descent is difficult. Trans. Neur. Netw., 5\penalty0 (2):\penalty0 157--166, March 1994. ISSN 1045-9227. 10.1109/72.279181. URL http://dx.doi.org/10.1109/72.279181.

[BBV13] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Audio chord recognition with recurrent neural networks. In ISMIR, 2013.

[BVVDJB16] Samuel~R Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp.\ 10--21, 2016.

[CLZHLSB17] Tong Che, Yanran Li, Ruixiang Zhang, R~Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983, 2017.

[CMGBBSB14] Kyunghyun Cho, Bart van Merri"enboer, \c Cağlar G"ul\c cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 1724--1734, Doha, Qatar, October 2014. Association for Computational Linguistics.

[DPS12] Thomas Degris, Patrick~~M Pilarski, and Richard~~S Sutton. Model-free reinforcement learning with continuous action in practice. In American Control Conference (ACC), 2012, pp.\ 2177--2182. IEEE, 2012.

[DL05] Eizaburo Doi and Michael~S Lewicki. Sparse coding of natural images using an overcomplete set of limited capacity units. In Advances in Neural Information Processing Systems, pp.\ 377--384, 2005.

[FGD18] William Fedus, Ian Goodfellow, and Andrew~M. Dai. MaskGAN: Better text generation via filling in the _______. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ByOExmWAb.

[GPMXWOCB14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp.\ 2672--2680, 2014.

[GLZZCB16] Anirudh Goyal, Alex~~M Lamb, Ying Zhang, Saizheng Zhang, Aaron~~C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pp.\ 4601--4609, 2016.

[G12] Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012.

[GBXLS18] Jiatao Gu, James Bradbury, Caiming Xiong, Victor~O.K. Li, and Richard Socher. Non-autoregressive neural machine translation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1l8BtlCb.

[HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp.\ 1026--1034, 2015.

[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016.

[HS97] Sepp Hochreiter and J"urgen Schmidhuber. Long short-term memory. Neural computation, 9\penalty0 (8):\penalty0 1735--1780, 1997.

[HBF01] Sepp Hochreiter, Yoshua Bengio, and Paolo Frasconi. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In J.~Kolen and S.~Kremer (eds.), Field Guide to Dynamical Recurrent Networks. IEEE Press, 2001.

[IS15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp.\ 448--456, 2015.

[JGBM17] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp.\ 427--431. Association for Computational Linguistics, April 2017.

[KW13] Diederik~P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[KSJCSW16] Diederik~~P Kingma, Tim Salimans, Rafal Jozefowicz, Xi~~Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp.\ 4743--4751, 2016.

[KH16] Matt~~J. Kusner and Jos'e~~Miguel Hern'andez-Lobato. Gans for sequences of discrete elements with the gumbel-softmax distribution. CoRR, abs/1611.04051, 2016.

[LKMGB17] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a large-scale study. arXiv preprint arXiv:1711.10337, 2017.

[MLC18] Szymon Malik, Adrian Lancucki, and Jan Chorowski. Efficient purely convolutional text encoding. In LaCATODA 2018 Workshop, IJCAI-ECAI, 2018.

[MLXLW16] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.~K. Lau, and Zhen Wang. Multi-class generative adversarial networks with the L2 loss function. CoRR, abs/1611.04076, 2016.

[M12] Tom'as Mikolov. Statistical language models based on neural networks. 2012.

[M95] George~A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38\penalty0 (11):\penalty0 39--41, 1995.

[MKKY18] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1QRgziT-.

[NH10] Vinod Nair and Geoffrey~E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp.\ 807--814, 2010.

[PRWZ02] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp.\ 311--318. Association for Computational Linguistics, 2002.

[P64] B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4\penalty0 (5):\penalty0 1 -- 17, 1964. ISSN 0041-5553.

[SCHTABRW16] Wenzhe Shi, Jose Caballero, Ferenc Husz'ar, Johannes Totz, Andrew~P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 1874--1883, 2016.

[SLF88] Sara~A Solla, Esther Levin, and Michael Fleisher. Accelerated learning in layered neural networks. Complex Systems, 2\penalty0 (6):\penalty0 625--639, 1988.

[SMDH13] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp.\ 1139--1147, 2013.

[SB98] Richard~~S Sutton and Andrew~~G Barto. Reinforcement learning: An introduction. 1998.

[SMSM00] Richard~~S Sutton, David~~A McAllester, Satinder~P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp.\ 1057--1063, 2000.

[OLBSVKDLCS17] A"aron van~~den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van~~den Driessche, Edward Lockhart, Luis~C. Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. Parallel wavenet: Fast high-fidelity speech synthesis. CoRR, abs/1711.10433, 2017. URL http://arxiv.org/abs/1711.10433.

[WDLSA09] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th annual international conference on machine learning, pp.\ 1113--1120. ACM, 2009.

[WZ89] Ronald~J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1\penalty0 (2):\penalty0 270--280, 1989.

[YZWY17] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pp.\ 2852--2858, 2017.

[ZL18] Xiang Zhang and Yann LeCun. Byte-level recursive convolutional auto-encoder for text. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HJZiRkZC-. rejected.

[ZGFCHSC17] Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. Adversarial feature matching for text generation. In International Conference on Machine Learning, pp.\ 4006--4015, 2017a.

[ZSWGHC17] Yizhe Zhang, Dinghan Shen, Guoyin Wang, Zhe Gan, Ricardo Henao, and Lawrence Carin. Deconvolutional paragraph representation learning. In Advances in Neural Information Processing Systems, pp.\ 4169--4179, 2017b.

[JML16] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.

[ZKZRL18] Junbo~~Jake Zhao, Yoon Kim, Kelly Zhang, Alexander~~M. Rush, and Yann LeCun. Adversarially regularized autoencoders. In ICML, volume~80 of JMLR Workshop and Conference Proceedings, pp.\ 5897--5906. JMLR.org, 2018.

[bib2] Bengio et al. (1994) Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. Trans. Neur. Netw., 5(2):157–166, March 1994. ISSN 1045-9227. doi: 10.1109/72.279181. URL http://dx.doi.org/10.1109/72.279181.

[bib4] Bowman et al. (2016) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 10–21, 2016.

[bib9] Fedus et al. (2018) William Fedus, Ian Goodfellow, and Andrew M. Dai. MaskGAN: Better text generation via filling in the _______. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ByOExmWAb.

[bib10] Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

[bib14] He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.

[bib17] Hochreiter et al. (2001) Sepp Hochreiter, Yoshua Bengio, and Paolo Frasconi. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In J. Kolen and S. Kremer (eds.), Field Guide to Dynamical Recurrent Networks. IEEE Press, 2001.

[bib19] Joulin et al. (2017) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics, April 2017.

[bib30] Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.

[bib45] Zhao et al. (2018) Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, and Yann LeCun. Adversarially regularized autoencoders. In ICML, volume 80 of JMLR Workshop and Conference Proceedings, pp. 5897–5906. JMLR.org, 2018.

Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE)​

Xiang Zhang

Yann LeCun

Introduction​

Related Work​

Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE)​

Normalized Noisy-Feature Auto-Encoder (NNFAE)​

Generator and Discriminator​

Training Hyper-parameters​

Evaluation using (n )-Gram Total Variation Distance (NGTVD)​

Experiments and Analysis​

Comparison with (n )-Gram Models and Recurrent Neural Networks (RNNs)​

Output Selection for (n )-Gram Models and RNNs​

Ablation Study on the Discriminator and the Noise​

(n )-Gram Correction for Better Text Appearance​

Interpolation in Feature Space​

Multi-lingual Text Generation​

Conclusion and Outlook​

Acknowledgement​

References​

Intractability of Perplexity for Non-Sequential Text Generators​

Text Samples from Byte (n )-Gram Model​

Text Samples from Long Short-Term Memory (LSTM)​

Text Samples from ATNNFAE without (n )-Gram Correction​

Text Samples from ATNNFAE with (n )-Gram Correction​

References​

Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE)

Introduction

Related Work

Adversarially-Trained Normalized Noisy-Feature Auto-Encoder (ATNNFAE)

Normalized Noisy-Feature Auto-Encoder (NNFAE)

Generator and Discriminator

Training Hyper-parameters

Evaluation using (n )-Gram Total Variation Distance (NGTVD)

Experiments and Analysis

Comparison with (n )-Gram Models and Recurrent Neural Networks (RNNs)

Output Selection for (n )-Gram Models and RNNs

Ablation Study on the Discriminator and the Noise

(n )-Gram Correction for Better Text Appearance

Interpolation in Feature Space

Multi-lingual Text Generation

Conclusion and Outlook

Acknowledgement

References

Intractability of Perplexity for Non-Sequential Text Generators

Text Samples from Byte (n )-Gram Model

Text Samples from Long Short-Term Memory (LSTM)

Text Samples from ATNNFAE without (n )-Gram Correction

Text Samples from ATNNFAE with (n )-Gram Correction

References