Skip to main content

Backpropagation for Implicit Spectral Densities

% Aditya Ramesh \qquad Yann LeCun, Department of Computer Science, New York University

Abstract

Most successful machine intelligence systems rely on gradient-based learning, which is made possible by backpropagation. Some systems are designed to aid us in interpreting data when explicit goals cannot be provided. These unsupervised systems are commonly trained by backpropagating through a likelihood function. We introduce a tool that allows us to do this even when the likelihood is not explicitly set, by instead using the implicit likelihood of the model. Explicitly defining the likelihood often entails making heavy-handed assumptions that impede our ability to solve challenging tasks. On the other hand, the implicit likelihood of the model is accessible without the need for such assumptions. Our tool, which we call spectral backpropagation, allows us to optimize it in much greater generality than what has been attempted before. GANs can also be viewed as a technique for optimizing implicit likelihoods. We study them using spectral backpropagation in order to demonstrate robustness for high-dimensional problems, and identify two novel properties of the generator~$G$: (1)there exist aberrant, nonsensical outputs to which$G$ assigns very high likelihood, and (2)the eigenvectors of the metric induced by$G$ over latent space correspond to quasi-disentangled explanatory factors.

Learning Implicit Densities

Aditya Ramesh

Yann LeCun

Department of Computer Science, New York University {ar2922,yann}@cs.nyu.edu

Most successful machine intelligence systems rely on gradient-based learning, which is made possible by backpropagation. Some systems are designed to aid us in interpreting data when explicit goals cannot be provided. These unsupervised systems are commonly trained by backpropagating through a likelihood function. We introduce a tool that allows us to do this even when the likelihood is not explicitly set, by instead using the implicit likelihood of the model. Explicitly defining the likelihood often entails making heavy-handed assumptions that impede our ability to solve challenging tasks. On the other hand, the implicit likelihood of the model is accessible without the need for such assumptions. Our tool, which we call spectral backpropagation , allows us to optimize it in much greater generality than what has been attempted before. GANs can also be viewed as a technique for optimizing implicit likelihoods. We study them using spectral backpropagation in order to demonstrate robustness for high-dimensional problems, and identify two novel properties of the generator G : (1) there exist aberrant, nonsensical outputs to which G assigns very high likelihood, and (2) the eigenvectors of the metric induced by G over latent space correspond to quasi-disentangled explanatory factors.

Introduction

Density estimation is an important component of unsupervised learning. In a typical scenario, we are given a finite sample D := { x 1 , . . . , x n } , and must make decisions based on information about the hypothetical process that generated D . Suppose that D ∼ P , where P is a distribution over some sample space X . We can model this generating process by learning a probabilistic model f : Z → X that transforms a simple distribution P Z over a latent space Z into a distribution Q over X . When Q is a good approximation to P , we can use it to generate samples and perform inference. Both operations are integral to the design of intelligent systems.

One common approach is to use gradient-based learning to maximize the likelihood of D under the model distribution Q . We typically place two important constraints on f in order to make this possible: (1) existence of an explicit inverse f -1 : X → Z , and (2) existence of a simple procedure by which we can evaluate Q . Often times, f is instead constructed as a map from X to Z in order to make it convenient to evaluate Q ( x ) for a given observation x ∈ X . This is the operation on which we place the greatest demand for throughput during training. Each choice corresponds to making one of the two operations - generating samples or evaluating Q - convenient, and the other inconvenient. Regardless of the choice, both operations are required, and so both constraints are made to hold in practice. We regard f as a map from Z to X in this presentation for sake of simplicity.

Much of the work in deep probabilistic modeling is concerned with allowing f to be flexible enough to capture intricate latent structure, while simultaneously ensuring that both conditions hold. We can dichotomize current approaches based on how the second constraint - existence of a simple procedure to evaluate Q - is satisfied. The first approach involves making f autoregressive by appropriately masking the weights of each layer. This induces a lower-triangular structure in the Jacobian, since each component of the model's output is made to depend only on the previous ones. We can then

rapidly evaluate the log-determinant term involved in likelihood computation, by accumulating the diagonal elements of the Jacobian.

Research into autoregressive modeling dates back several decades, and we only note some recent developments. Germain et al. [2015] describe an autoregressive autoencoder for density estimation. Kingma et al. [2016] synthesize autogressive modeling with normalizing flows [Rezende and Mohamed, 2015] for variational inference, and Papamakarios et al. [2017] make further improvements. van den Oord et al. [2016c] apply this idea to image generation, with follow-up work (Dinh et al. [2016], van den Oord et al. [2016b]) that exploits parallelism using masked convolutions. van den Oord et al. [2016a] do the same for audio generation, and van den Oord et al. [2017] introduce strategies to improve efficiency. We refer the reader to Jang [2018] for an excellent overview of these works in more detail.

The second approach involves choosing the layers of f to be transformations, not necessarily autoregressive, for which explicit expressions for the Jacobian are still available. We can then evaluate the log-determinant term for f by accumulating the layerwise contributions in accordance with the chain rule, using a procedure analogous to backpropagation. Rezende and Mohamed [2015] introduced this idea to variational inference, and recent work, including [Berg et al., 2018] and [Tomczak and Welling, 2016], describe new types of such transformations.

Both approaches must invariably compromise on model flexibility. An efficient method for differentiating implicit densities that do not fulfill these constraints would enrich the current toolset for probabilistic modeling. Wu et al. [2016] advocate using annealed importance sampling [Neal, 2001] for evaluating implicit densities, but it is not clear how this approach could be used to obtain gradients. Very recent work [Li and Turner, 2017] uses Stein's identity to cast gradient computation for implicit densities as a sparse recovery problem. Our approach, which we call spectral backpropagation , harnesses the capabilities of modern automatic differentiation (Abadi et al. [2016], Paszke et al. [2017]) by directly backpropagating through an approximation for the spectral density of f .

Wemake the first steps toward demonstrating the viability of this approach by minimizing d KL ( Q,P X ) and d KL ( P X , Q ) , where Q is the implicit density of a non-invertible Wide ResNet [Zagoruyko and Komodakis, 2016] f , on a set of test problems. Having done so, we then turn our attention to characterizing the behavior of the generator G in GANs [Goodfellow et al., 2014], using a series of computational studies made possible by spectral backpropagation. Our purpose in conducting these studies is twofold. Firstly, we show that our approach is suitable for application to high-dimensional problems. Secondly, we identify two novel properties of generators:

· The existence of adversarial perturbations for classification models [Szegedy et al., 2013] is paralleled by the existence of aberrant, nonsensical outputs to which G assigns very high likelihood. · The eigenvectors of the metric induced by the G over latent space correspond to meaningful, quasi-disentangled explanatory factors. Perturbing latent variables along these eigenvectors allows us to quantify the extent to which G makes use of latent space.

We hope that these observations will contribute to an improved understanding of how well generators are able to capture the latent structure of the underlying data-generating process.

Background

Generalizing the Change of Variable Theorem

We begin by revisiting the geometric intuition behind the usual change of variable theorem. First, we consider a rectangle in R 2 with vertices x 0 , x 1 , x 3 , x 2 given in clockwise order, starting from the bottom-left vertex. To determine its area, we compute its side lengths v 1 := x 1 -x 0 , v 2 := x 2 -x 0 and write V 2 = v 1 v 2 . Now suppose we are given a parallelepiped in R 3 whose sides are described by the vectors v 1 := x 1 -x 0 , v 2 := x 2 -x 0 , and v 3 := x 3 -x 0 . Its volume is given by the triple product V 3 = 〈 v 1 × v 2 , v 3 〉 , where × and 〈· , ·〉 denote cross product and inner product, respectively. This triple product can be rewritten as

which we can generalize to compute the volume of a parallelepiped in R N :

If we regard the vertices x 0 , . . . , x N as observations in X , the change of variable theorem can be understood as the differential analog of this formula. To wit, we suppose that f : R N → R N is a diffeomorphism, and denote by J f the Jacobian of its output with respect to its input. Now, V N becomes the infinitesimal volume element determined by J f . For an observation x ∈ X , the change of variable theorem says that we can compute

where we set z := f -1 ( x ) .

An n -dimensional parallelepiped in R N requires n vectors to specify its sides. When n < N , its volume is given by the more general formula [Hanson, 1994],

The corresponding analog of the change of variable theorem is known in the context of geometric measure theory as the smooth coarea formula [Krantz and Parks, 2008]. When f is a diffeomorphism between manifolds, it says that

where we set z := f -1 ( x ) as before, and define M f := ( J f ) t J f to be the metric induced by f over the latent manifold Z . In many cases of interest, such as in GANs, the function f is not necessarily injective. Application of the coarea formula would then require us to evaluate an inner integral over { z ∈ Z : f ( z ) = x } , rather than over the singleton { f -1 ( x ) } . We ignore this technicality and apply Equation 1 anyway.

The change of variable theorem gives us access to the implicit density Q in the form of the spectral density of M f . Indeed, the Lie identity ln det = trln allows us to express the log-likelihood corresponding to Equation 1 as

We focus on the factor involving M f on the RHS, which can be written as

where Sp denotes the spectrum, and P λ the delta distribution over the eigenvalues in the spectrum. We let θ denote the parameters of f , and assume that P Z is independent of θ . Now, differentiating Equation 2 with respect to θ gives

Equation 3 allows us to formulate gradient computation for implicit densities as a variant of stochastic backpropagation (Kingma and Welling [2013], Rezende et al. [2014]), in which the base distribution for the expectation is the spectral density of M f rather than a normal distribution.

An Estimator for Spectral Backpropagation

To obtain an estimator for Equation 3, we turn to the thriving literature on stochastic approximation of spectral sums. These methods estimate quantities of the form Σ S ( A ) := tr S ( A ) , where A is a large or implicitly-defined matrix, by accessing A using only matrix-vector products. In our case, S = ln , and the products involving A = M f can be evaluated rapidly using automatic differentiation. We make no attempt to conduct a comprehensive survey, but note that among the most promising recent approaches are those described by Han et al. [2017], Boutsidis et al. [2017], Ubaru et al. [2017], and Fitzsimons et al. [2017].

Algorithm 1 Procedure to estimate ln det using the Chebyshev approximation.

Require: A ∈ R n × n is the implicit matrix; m the desired order; p the number of probe vectors for the trace estimator; t the number of power iterations; g ≥ 1 a multiplier for the estimate returned by the power method; and glyph[epsilon1] the stipulated lower bound on Sp( A ) . procedure STOCHASTICLOGDET( A,m,p,t,g, glyph[epsilon1] )

ˆ λ max ← POWERMETHOD ( A,t ) µ, ν ← glyph[epsilon1], g ˆ λ max a, b ← µ/ ( µ + ν ) , ν/ ( µ + ν ) Define ϕ and ϕ -1 using Equation 6. { c i } i ∈ [0 ,m ] ← CHEBYSHEVCOEFFICIENTS (ln ◦ ϕ ) ¯ A ← A/ ( µ + ν ) Γ ← STOCHASTICCHEBYSHEVTRACE ( ϕ -1 ( ¯ A ) , { c i } , p ) return n ln( a + b ) + Γ procedure STOCHASTICCHEBYSHEVTRACE( A, { c i } i ∈ [0 ,m ] , p ) )

r ← 0 for j ∈ [1 , p ] do v ← RANDOMRADEMACHER ( n w 0 , w 1 ← v, Av s ← c 0 w 0 + c 1 w 1 for i ∈ [2 , m ] do w i ← 2 Aw i -1 -w i -2 s ← s + c i w i r ← r + 〈 v, s 〉 return r/p

We briefly describe the approaches of Han et al. [2017] and Boutsidis et al. [2017], which work on the basis of polynomial interpolation. Given a function ¯ S : [ -1 , 1] → R , these methods construct an orderm approximating polynomial ¯ p m to ¯ S , given by

where c i ∈ R and T i : [ -1 , 1] → R . The main difference between the two approaches is the choice of approximating polynomial. Boutsidis et al. [2017] use Taylor polynomials, for which

where we use superscript ( i ) to denote iterated differentiation. On the other hand, Han et al. [2017] use Chebyshev polynomials. These are defined by the recurrence relation

The coefficients { c i } for the Chebyshev polynomials are called the Chebyshev nodes , and are defined by

Now suppose that we are given a matrix ¯ A ∈ R n × n such that Sp( ¯ A ) ⊂ [ -1 , 1] . After having made a choice for the construction of ¯ p m , we can use the approximation

This reduces the problem of estimating the spectral sum Σ ¯ S ( ¯ A ) to computing the traces tr T i ( ¯ A ) for all i ∈ [0 , m ] .

Two issues remain in applying this approximation. The first is that both dom( ¯ S ) and Sp( ¯ A ) are restricted to [ -1 , 1] . In our case, ln : (0 , ∞ ) → R , and M f ( z ) can be an arbitrary positive definite matrix. To address this issue, we define ϕ : [ -1 , 1] → [ a, b ] , where

Now we set ¯ S := S ◦ ϕ , so that

We stress that while p m is defined using ϕ -1 , the coefficients c i are computed using ¯ S := S ◦ ϕ . With these definitions in hand, we can write

Han et al. [2017] require spectral bounds µ and ν , so that Sp( A ) ⊂ [ µ, ν ] , and set

After using Equation 7 with a Chebyshev approximation for S = ln to obtain Γ ≈ ln det( ¯ A ) , we compute

Boutsidis et al. [2017] instead define B := A/ν , and write

We can easily obtain an accurate upper bound ν using the power method. The lower bound µ is fixed to a small, predetermined constant in our work.

The second issue is that deterministically evaluating the terms tr T i ( ¯ A ) in Equation 5 requires us to compute matrix powers of ¯ A . Thankfully, we can drastically reduce the computational cost and approximate these terms using only matrix-vector products. This is made possible by the stochastic trace estimator introduced by Hutchinson [1990]:

When the distribution P V for the probe vectors v has expectation zero, the estimate is unbiased. We use the Rademacher distribution, which samples the components of v uniformly from {-1 , 1 } . We refer the reader to Avron and Toledo [2011] for a detailed study on the variance of this estimator.

Figure 1: Results for minimizing d KL ( Q,P X ) for the four test energies described in Rezende and Mohamed [2015]. Subfigures (b) and (c) show the model samples superimposed over contour plots of the corresponding ground-truth test energies. Each epoch corresponds to 5000 iterations. We see in (a) that the relative error for the approximation to the log determinant typically stays below 30%, except toward the end of training for the last two test energies. At this point, samples from these two models begin to drift away from the origin, as shown in (c).

We first describe how the trace estimator is applied when a Taylor approximation is used to construct ¯ p m . In this case, we have

The inner summands ¯ A i v j are evaluated using the recursion w 0 := v j and w i := ¯ Aw i -1 for i ≥ 1 . It follows that the number of matrix-vector products involved in the approximation increases linearly with respect to the order m of the approximating polynomial ¯ p m . The same idea allows us to accumulate the traces for the Chebyshev approximation, based on Equation 4. The resulting procedure is given in Algorithm 1; it is our computational workhorse for evaluating the log-likelihood in Equation 2 and estimating the gradient in Equation 3.

Learning Implicit Densities

Suppose that we are tasked with matching a given data distribution P X with the implicit density Q of the model f . Two approaches for learning f are available, and the choice of which to use depends on the type of access we have to the data distribution P X . The first approach - minimizing d KL ( Q,P X ) -is applicable when we know how to evaluate the likelihood of P X , but are not necessarily able to sample from it. The second approach - minimizing d KL ( P X , Q ) - is applicable when we are able to sample from P X , but are not necessarily able to evaluate its likelihood. We show that spectral backpropagation can be used in both cases, when neither of the two conditions described in Section 1 holds.

All of the examples considered here are densities over R 2 . We match them by transforming a prior P Z given by a spherical normal distribution. Our choice for the architecture of f is a Wide ResNet comprised of four residual blocks. Each block is a three-layer bottleneck from R 2 to R 2 whose hidden layer size is 32. All layers are equipped with biases, and use LeakyReLU activations. We compute

(a) Plots for the loss and relative error for minimizing d KL ( P, Q ) .

Figure 2: Results for minimizing d KL ( P X , Q ) for the crescent and circular mixture densities whose definitions are given in the supplementary material. The relative error for the approximation to the log determinant typically stays below 5%. In (b), we show samples from P X superimposed over contour plots of the log-likelihood of f .

the gradient updates using a batch size of 64, and apply the updates using Adam [Kingma and Ba, 2014] with a step size of 1 × 10 -4 , and all other parameters kept at their the default values.

To compute the gradient update given by Equation 3, we use Algorithm 1 with ( m,p,t, g ) := (10 , 20 , 20 , 1 . 2) for all experiments. For minimizing d KL ( Q,P X ) , we use glyph[epsilon1] := 0 . 1 , and for minimizing d KL ( P X , Q ) , glyph[epsilon1] := 1 × 10 -2 . In order to monitor the accuracy of the approximation for the likelihood, we compute M f at each iteration, and evaluate the ground-truth likelihood in accordance with Equation 2. We define the relative error of the approximation ln ˆ glyph[lscript] with respect to the ground-truth log-likelihood ln glyph[lscript] by

provided that the quotient is not too large. This definition of relative error avoids numerical problems when glyph[lscript] ≈ 0 .

We begin by considering the first approach, in which we seek to minimize d KL ( Q,P X ) . This objective requires that we be able to sample from Q , so we choose f : Z → X to be a map from latent space to observation space. The results are shown in Figure 1. To prevent f from making Q collapse to model the infinite support of P X , we found it helpful to incorporate the regularizer

into the objectives for the third and fourth test energies. Here, ‖·‖ 2 denotes the spectral norm. To implement this regularizer, we simply backpropagate through the estimate of λ max that is already produced by Algorithm 1. We use ρ := 8 × 10 -2 in both cases. Despite the use of this regularizer, we find that continuing to train the models for these last two test energies causes the samples to drift away from the origin (see Figure 1(c)). We have not made any attempt to address this behavior. Finally, we note that since d KL ( Q,P X ) is bounded from below by the negative log-normalization constant of Q , it can become negative when Q is unnormalized. We see that this happens for all four examples.

In the second approach, we seek to minimize d KL ( P X , Q ) . This objective requires that we be able to evaluate the likelihood of Q , so we choose f : X → Z to be a map from observation space to latent space. The results are shown in Figure 2. We note that minimizing d KL ( P X , Q ) is ill-posed when Q is unnormalized. In this scenario, the model distribution Q can match P X while also assigning mass outside the support of P X . We see that this expected behavior manifests in both examples.

Evaluating GAN Likelihoods

For our explorations involving GANs, we train a series of DCGAN [Radford et al., 2015] models on 64 × 64 rescaled versions of the CelebA [Liu et al., 2015] and LSUN Bedroom datasets. We vary model capacity in terms of the base feature map count multiplier n f for the DCGAN architecture. The

(a) Samples from the CelebA model with n z = 128 , n f = 64 (left) and the LSUN Bedroom model with n z = 256 , n f = 64 (right).

Figure 3: Samples from two models evaluated at fixed grids in latent space (left), and contour plots of the model log-likelihood evaluated over the same grids (right).

(a) Trajectory at iterations 0, 100, 500, 1000, 2000, and 3000 (3500 total).

Figure

Figure 4: Trajectories of four latent variables as they are perturbed to optimize likelihood under the generator. Trajectories (a) and (b) correspond to the CelebA model with n z = 128 , n f = 64 , and trajectories (c) and (d) to the LSUN Bedroom model with n z = 256 , n f = 64 . Statistics from these trajectories are tabulated above. The last column of the table specifies the parameters used for Algorithm 1 to compute the gradient estimates.

generator and discriminator have five layers each, and use translated LeakyReLU activations [Xiang and Li, 2017]. To stabilize training, we use weight normalization with fixed scale factors in the discriminator [Salimans et al., 2016]. Our prior is defined by P Z := unif([ -1 , 1]) n z , where n z is the size of the embedding space. All models were trained for 750 000 iterations with a batch size of 32, using RMSProp with step size 1 × 10 -4 and decay factor 0 . 9 . We present results from two of these models in Figure 3.

We apply spectral backpropagation to explore the effect of perturbing a given latent variable z ∈ Z to maximize likelihood under the generator distribution Q . This is readily accomplished by noting that the same procedure to evaluate Equation 3 can also be used to obtain gradients with respect to z . The results are shown in Figure 4. Intuitively, we might expect the outputs to be transformed in such a way that they gravitate towards modes of the dataset. But this is not what happens. Instead, the outputs are transformed into highly aberrant, out-of-distribution examples while nonetheless attaining very high likelihood. As optimization proceeds, M f also becomes increasingly ill-conditioned. This shows that likelihood for generators need not correspond to intuitive notions of visual plausibility.

Figure

Figure

(c) CelebA ( n z = 32 , n f = 64 ), trials 10 and 11. Step size: 0.40.

sié¢l¢/\6¢-é@-6€-(é

air eer per ee

Figure

Figure

Figure 5: Result of perturbing latent variables along the eigenvectors of M f . In (a) and (b), we show the top eigenvalues of Sp( M f ) evaluated at 12 trial latent variables. Small eigenvalues are not shown. The leftmost images of each grid both contain duplicates of the original, each corresponding one of the 12 latent variables. The top row shows the effect of applying perturbations along random directions, and the bottom row the result of applying perturbations with the same step size along the eigenvectors.

Uncovering Latent Explanatory Factors

The generator in GANs is well-known for organizing latent space such that semantic features can be transferred by means of algebraic operations over latent variables [Radford et al., 2015]. This suggests the existence of a systematic organization of latent space, but perhaps one that cannot be globally characterized in terms of a handful of simple explanatory factors. We instead explore whether local changes in latent space can be characterized in this way. Since the metric M f describes local change in the generator's output, it is natural to consider the effect of perturbations along its eigenvectors. To this end, we fix 12 trial embeddings in latent space, and compare the effect of perturbations along

(e) Effect of varying n z and n f on v eff ( τ = 1) .

Figure 6: Effect of varying n z and n f on v eff ( τ ) . Plots (a)-(d) show v eff as a function of τ . Larger values for v eff suggest increased utilization of latent space. Table (e) shows that doubling n z roughly doubles v eff. On the other hand, doubling n f does not result in noticeable change for CelebA, and only results in a modest increase in v eff for LSUN Bedroom. The step sizes used for the perturbations are the same as those reported in Figure 5.

random directions to perturbations along these eigenvectors. The random directions are obtained by sampling from a spherical normal distribution. We show the results in Figure 5.

We can see that dominant eigenvalues, especially the principal eigenvalue, often result in the most drastic changes. Furthermore, these changes are not only semantically meaningful, but also tend to make modifications to distinct attributes of the image. To see this more clearly, we consider the top two rows of Figure 5(g). Movement along the first two eigenvectors changes hair length and facial orientation; movement along the third eigenvector decreases the length of the bangs; movement along the fourth and fifth eigenvectors changes background color; and movement along the sixth and seventh eigenvectors changes hair color.

Inspecting the two columns (c), (e), (g), and (d), (f), (h) in Figure 5 suggests that larger values of n z may encourage the generator to capture more explanatory factors, possibly at the price of decreased sample quality. We would like to explore the effect of varying n z and n f on the number of such factors. To do this, we fix a sample of latent variables S := { z j } j ∈ [1 ,M ] ∼ P Z . For each z j ∈ S , we define

for every eigenvector v ( j ) 1 , . . . , v ( j ) n z of M f ( z j ) . The quantity δ ( j, i ) measures the pixelwise change resulting from a perturbation along an eigenvector, relative to the change we expect from a random perturbation. Finally, we define

where 1 {·} is the indicator function. This quantity measures the average number of eigenvectors for which the relative change is greater than the threshold τ . As such, it can be regarded as an effective measure of dimensionality for latent space. We explore the effect of varying n z and n f on v eff in Figure 6.

Conclusion

Current approaches for probabilistic modeling attempt to satisfy two goals that are fundamentally at odds with one another: fulfillment of the two constraints described in Section 1, and model flexibility. In this work, we develop a computational tool that aims to expand the scope of probabilistic modeling to functions that do not satisfy these constraints. We make the first steps toward demonstrating feasibility of this approach by minimizing divergences in far greater generality than what has been attempted before. Finally, we uncover surprising facts about the organization of latent space for GANs that we hope will contribute to an improved understanding of how effectively they capture underlying latent structure.

$$ \label{eq:coarea} Q(x) = P_Z(f^{-1}(x)) \det(J_f(f^{-1}(x))^t J_f(f^{-1}(x)))^{-1 / 2} \eqqcolon P_Z(z) \det{M_f(z)}^{-1 / 2}, $$ \tag{eq:coarea}

$$ \label{eq:ll} \ln Q(x) = \ln P(z) - \frac{1}{2} \ln \det M_f(z) = \ln P(z) - \frac{1}{2} \tr \ln M_f(z). $$ \tag{eq:ll}

$$ \label{eq:spectral_bprop} D_\theta \ln Q(x) = -D_\theta \E_{\lambda \sim P_\lambda} \ln \lambda. $$ \tag{eq:spectral_bprop}

$$ \label{eq:chebyshev} T_0 = 1, \quad T_1 : x \mapsto x, \quad\text{and}\quad T_i : x \mapsto 2x T_{i - 1}(x) - T_{i - 2}(x), \quad i \geq 2. $$ \tag{eq:chebyshev}

$$ \label{eq:sss} \Sigma_S(\bar{A}) = \Sigma_{\bar{S} \circ \varphi^{-1}}(\bar{A}) \approx \tr p_m(\bar{A}). $$ \tag{eq:sss}

$$ \label{eq:jacobian_penalty} R(M_f) \coloneqq \rho \norm{M_f}_2 $$ \tag{eq:jacobian_penalty}

$$ \Sigma_{\bar{S}}(\bar{A}) &= \tr \bar{S}(\bar{A}) = \sum_{\lambda \in \Sp(\bar{A})} \bar{S}(\lambda) \approx \sum_{\lambda \in \Sp(\bar{A})} \bar{p}m(\lambda) \nonumber\ &= \sum{\lambda \in \Sp(\bar{A})} \sum_{i \in [0, m]} c_i T_i(\lambda) = \sum_{i \in [0, m]} c_i \sum_{\lambda \in \Sp(\bar{A})} T_i(\lambda) \nonumber\ &= \sum_{i \in [0, m]} c_i \tr T_i(\bar{A}). \label{eq:ss_traces} $$ \tag{eq:ss_traces}

$$ V_3 = \det\begin{pmatrix} v_1 & v_2 & v_3 \end{pmatrix}, $$

$$ \bar{S} \approx \bar{p}m \coloneqq \sum{i \in [0, m]} c_i T_i, $$

$$ c_i \coloneqq \frac{\bar{S}^{(i)}}{i!} \quad\text{and}\quad T_i : x \mapsto x^i, $$

$$ c_i \coloneqq \begin{dcases} \frac{1}{m + 1} \sum_{j \in [0, m]} \bar{S}(x_j) T_0(x_j), &i = 0, \ \frac{2}{m + 1} \sum_{j \in [0, m]} \bar{S}(x_j) T_i(x_j), &i \geq 1. \end{dcases} $$

$$ a \coloneqq \frac{\mu}{\mu + \nu}, \quad b \coloneqq \frac{\nu}{\mu + \nu}, \quad\text{and}\quad \bar{A} \coloneqq \frac{A}{a + b}. $$

$$ \ln \det(B) = \tr \ln(B) = \tr \ln(I - (I - B)). $$

$$ \tr \bar{p}m(\bar{A}) = \sum{i \in [0, m]} c_i \tr \bar{A}^i \approx \frac{1}{p} \sum_{i \in [0, m]} c_i \sum_{j \in [1, p]} \inner{v_j}{\bar{A}^i v_j} = \frac{1}{p} \sum_{j \in [1, p]} c_i \left\langle v_j, \sum_{i \in [0, m]} \bar{A}^i v_j \right\rangle. $$

$$ \ln \hat{\ell} - \ln \ell = \ln \frac{\hat{\ell}}{\ell} = \ln\left(1 + \frac{\hat{\ell} - \ell}{\ell}\right) \approx \frac{\hat{\ell} - \ell}{\ell}, $$

$$ 0.1in] {\scriptsize% \begin{tabular}{c N c c c} \toprule Trial & ${\log(p_\text{final} / p_\text{init})}$ & Initial $(\eigmin, \eigmax)$ & Final $(\eigmin, \eigmax)$ & $(m, p, t, g, \epsilon)$ \ \midrule (a) & 161.40078735351562 & $(\sci{0.23120332}, \sci{2907.4229})$ & $(\sci{0.008105371}, \sci{1059.2062})$ & $(5, 20, 20, 1.1, \sci{1e-4})$ \ (b) & 667.4318695068359 & $(\sci{0.27095607}, \sci{2233.8276})$ & $(\sci{2.5146794e-06}, \sci{0.25737128})$ & $(5, 20, 20, 1.1, \sci{1e-5})$ \ (c) & 90.82517051696777 & $(\sci{0.018747492}, \sci{1995.3479})$ & $(\sci{0.008580686}, \sci{562.2244})$ & $(5, 10, 10, 1.2, \sci{1e-2})$ \ (d) & 336.0686254501343 & $(\sci{0.022936836}, \sci{1242.7954})$ & $(\sci{0.0006118099}, \sci{403.83206})$ & $(5, 10, 10, 1.2, \sci{1e-2})$ \ \bottomrule \end{tabular}} \caption{Trajectories of four latent variables as they are perturbed to optimize likelihood under the generator. Trajectories~(a) and~(b) correspond to the CelebA model with~$n_z = 128, n_f = 64$, and trajectories~(c) and~(d) to the LSUN Bedroom model with~$n_z = 256, n_f = 64$. Statistics from these trajectories are tabulated above. The last column of the table specifies the parameters used for Algorithm~\ref{alg:log_det_chebyshev} to compute the gradient estimates.\label{fig:ml_trajectories}} \end{figure}

We apply spectral backpropagation to explore the effect of perturbing a given latent variable~$z \in Z$ to maximize likelihood under the generator distribution~$Q$. This is readily accomplished by noting that the same procedure to evaluate Equation~\ref{eq:spectral_bprop} can also be used to obtain gradients with respect to~$z$. The results are shown in Figure~\ref{fig:ml_trajectories}. Intuitively, we might expect the outputs to be transformed in such a way that they gravitate towards modes of the dataset. But this is not what happens. Instead, the outputs are transformed into highly aberrant, out-of-distribution examples while nonetheless attaining very high likelihood. As optimization proceeds,~$M_f$ also becomes increasingly ill-conditioned. This shows that likelihood for generators need not correspond to intuitive notions of visual plausibility.

\section{Uncovering Latent Explanatory Factors}

\begin{figure}[t] \centering \subfloat[][Log spectra for CelebA models with~$n_f = 64$ and~$n_z$ varied.]{% \includegraphics[width=\textwidth]{figures/latent_factors/celeb_a/spectra_nf_fixed_nz_varied.png} } \ \subfloat[][Log spectra for LSUN Bedroom models with~$n_f = 256$ and~$n_z$ varied.]{% \includegraphics[width=\textwidth]{figures/latent_factors/bedroom/spectra_nf_fixed_nz_varied.png} } \ \begin{tabular}[t]{c@{}c}% \captionsetup{width=.45\linewidth,justification=centering}% \subfloat[][CelebA ($n_z = 32, n_f = 64$), trials 10 and 11. Stepsize:0.40.]{% {\def\arraystretch{0}% \begin{tabular}[b]{c}% \includegraphics[width=0.47\textwidth]{figures/latent_factors/celeb_a/nf_64_nz_32_perturbed/trial_10.png} \ \includegraphics[width=0.47\textwidth]{figures/latent_factors/celeb_a/nf_64_nz_32_perturbed/trial_11.png} \end{tabular}}% } & \captionsetup{width=.45\linewidth,justification=centering}% \subfloat[][LSUN Bedroom ($n_z = 128, n_f = 64$), trials 5 and 12. Stepsize:0.80.]{% {\def\arraystretch{0}% \begin{tabular}[b]{c}% \includegraphics[width=0.47\textwidth]{figures/latent_factors/bedroom/nf_64_nz_128_perturbed/trial_5.png} \ \includegraphics[width=0.47\textwidth]{figures/latent_factors/bedroom/nf_64_nz_128_perturbed/trial_12.png} \end{tabular}}% } \ \captionsetup{width=.45\linewidth,justification=centering}% \subfloat[][CelebA ($n_z = 64, n_f = 64$), trials 6 and 12. Stepsize:0.65.]{% {\def\arraystretch{0}% \begin{tabular}[b]{c}% \includegraphics[width=0.47\textwidth]{figures/latent_factors/celeb_a/nf_64_nz_64_perturbed/trial_6.png} \ \includegraphics[width=0.47\textwidth]{figures/latent_factors/celeb_a/nf_64_nz_64_perturbed/trial_12.png} \end{tabular}}% } & \captionsetup{width=.45\linewidth,justification=centering}% \subfloat[][LSUN Bedroom ($n_z = 256, n_f = 64$), trials 6 and 8. Stepsize:0.80.]{% {\def\arraystretch{0}% \begin{tabular}[b]{c}% \includegraphics[width=0.47\textwidth]{figures/latent_factors/bedroom/nf_64_nz_256_perturbed/trial_6.png} \ \includegraphics[width=0.47\textwidth]{figures/latent_factors/bedroom/nf_64_nz_256_perturbed/trial_8.png} %\ \end{tabular}}% } \ \captionsetup{width=.45\linewidth,justification=centering}% \subfloat[][Trials 7 and 9 ($n_z = 128, n_f = 64$).\ Stepsize:0.80.]{% {\def\arraystretch{0}% \begin{tabular}[b]{c}% \includegraphics[width=0.47\textwidth]{figures/latent_factors/celeb_a/nf_64_nz_128_perturbed/trial_7.png} \ \includegraphics[width=0.47\textwidth]{figures/latent_factors/celeb_a/nf_64_nz_128_perturbed/trial_9.png} \end{tabular}}% } & \captionsetup{width=.45\linewidth,justification=centering}% \subfloat[][Trials 8 and 10 ($n_z = 512, n_f = 64$).\ Stepsize:0.80.]{% {\def\arraystretch{0}% \begin{tabular}[b]{c}% \includegraphics[width=0.47\textwidth]{figures/latent_factors/bedroom/nf_64_nz_512_perturbed/trial_8.png} \ \includegraphics[width=0.47\textwidth]{figures/latent_factors/bedroom/nf_64_nz_512_perturbed/trial_10.png} \end{tabular}}% }% \end{tabular} \caption{Result of perturbing latent variables along the eigenvectors of$M_f$. In(a) and~(b), we show the top eigenvalues of $\Sp(M_f)$ evaluated at12 trial latent variables. Small eigenvalues are not shown. The leftmost images of each grid both contain duplicates of the original, each corresponding one of the12 latent variables. The top row shows the effect of applying perturbations along random directions, and the bottom row the result of applying perturbations with the same step size along the eigenvectors.\label{fig:latent_perturbations}} \end{figure}

The generator in GANs is well-known for organizing latent space such that semantic features can be transferred by means of algebraic operations over latent variables~\citep{DCGAN}. This suggests the existence of a systematic organization of latent space, but perhaps one that cannot be globally characterized in terms of a handful of simple explanatory factors. We instead explore whether \emph{local} changes in latent space can be characterized in this way. Since the metric~$M_f$ describes local change in the generator's output, it is natural to consider the effect of perturbations along its eigenvectors. To this end, we fix12 trial embeddings in latent space, and compare the effect of perturbations along random directions to perturbations along these eigenvectors. The random directions are obtained by sampling from a spherical normal distribution. We show the results in Figure\ref{fig:latent_perturbations}.

We can see that dominant eigenvalues, especially the principal eigenvalue, often result in the most drastic changes. Furthermore, these changes are not only semantically meaningful, but also tend to make modifications to distinct attributes of the image. To see this more clearly, we consider the top two rows of Figure~\ref{fig:latent_perturbations}(g). Movement along the first two eigenvectors changes hair length and facial orientation; movement along the third eigenvector decreases the length of the bangs; movement along the fourth and fifth eigenvectors changes background color; and movement along the sixth and seventh eigenvectors changes hair color.

\begin{figure} \centering% {\captionsetup{width=.23\linewidth,justification=centering}% \subfloat[][CelebA \ ($n_f = 64$, $n_z$ varied).]{% \includegraphics[width=0.25\textwidth]{figures/latent_factors/celeb_a/latent_factors_nf_fixed_nz_varied.png} }}% {\captionsetup{width=.23\linewidth,justification=centering}% \subfloat[][LSUN Bedroom \ ($n_f = 64$, $n_z$ varied).]{% \includegraphics[width=0.25\textwidth]{figures/latent_factors/bedroom/latent_factors_nf_fixed_nz_varied.png} }}% {\captionsetup{width=.23\linewidth,justification=centering}% \subfloat[][CelebA \ ($n_z = 128$, $n_f$ varied).]{% \includegraphics[width=0.25\textwidth]{figures/latent_factors/celeb_a/latent_factors_nz_fixed_nf_varied.png} }}% {\captionsetup{width=.23\linewidth,justification=centering}% \subfloat[][LSUN Bedroom \ ($n_z = 256$, $n_f$ varied).]{% \includegraphics[width=0.25\textwidth]{figures/latent_factors/bedroom/latent_factors_nz_fixed_nf_varied.png} }}

\subfloat[][Effect of varying~$n_z$ and~$n_f$ on~$\effFac(\tau = 1)$.]{\tiny% \begin{tabular}{c NNN NNN} \toprule \multirow{2}{*}{Dataset} &% \multicolumn{3}{c}{$\effFac(\tau = 1)$ for ${n_f}$ fixed, ${n_z}$ varied} &% \multicolumn{3}{c}{$\effFac(\tau = 1)$ for ${n_z}$ fixed, ${n_f}$ varied} \ \cmidrule(lr){2-4} \cmidrule(lr){5-7} &% ${(n_z / 2, n_f)}$ & ${(n_z, n_f)}$ & ${(2 n_z, n_f)}$ & ${(n_z, n_f / 2)}$ & ${(n_z, n_f)}$ & ${(n_z, 2 n_f)}$ \ \midrule% CelebA ($n_z = 64, n_f = 64$) &% 6.94921875 & 10.4921875 & 16.55078125 & 16.140625 & 16.55078125 & 17.28515625 \ LSUN Bedroom ($n_z = 64, n_f = 128$) &% 25.2265625 & 34.9296875 & 53.6015625 & 28.3671875 & 34.9296875 & 40.04296875 \ \bottomrule% \end{tabular}}% \caption{Effect of varying~$n_z$ and~$n_f$ on~$\effFac(\tau)$. Plots~(a)--(d) show~$\effFac$ as a function of~$\tau$. Larger values for~$\effFac$ suggest increased utilization of latent space. Table~(e) shows that doubling~$n_z$ roughly doubles~$\effFac$. On the other hand, doubling~$n_f$ does not result in noticeable change for CelebA, and only results in a modest increase in~$\effFac$ for LSUN Bedroom. The step sizes used for the perturbations are the same as those reported in Figure~\ref{fig:latent_perturbations}.\label{fig:eff_fac}} \end{figure}

Inspecting the two columns~(c), (e), (g), and~(d), (f), (h) in Figure~\ref{fig:latent_perturbations} suggests that larger values of~$n_z$ may encourage the generator to capture more explanatory factors, possibly at the price of decreased sample quality. We would like to explore the effect of varying~$n_z$ and~$n_f$ on the number of such factors. To do this, we fix a sample of latent variables~$S \coloneqq {z_j}{j \in [1, M]} \sim P_Z$. For each~$z_j \in S$, we define [ \delta(j, 0) \coloneqq \E{\epsilon \sim N(0, I)} \norm{G(z_j + \alpha \epsilon) - G(z_j)}_2 \quad\text{and}\quad \delta(j, i) \coloneqq \frac{\norm{G(z_j + \alpha v_i^{(j)}) - G(z_j)}_2}{\delta(j, 0)}, $$ \tag{fig:ml_trajectories}

Algorithm: algorithm
[t]
\caption{Procedure to estimate~$\ln \det$ using the Chebyshev
approximation.\label{alg:log_det_chebyshev}}

\begin{algorithmic}
\Require{$A \in \reals^{n \times n}$ is the implicit matrix;~$m$ the desired order;~$p$ the number
of probe vectors for the trace estimator;~$t$ the number of power iterations;~$g \geq 1$ a
multiplier for the estimate returned by the power method; and~$\epsilon$ the stipulated lower bound
on~$\Sp(A)$.}

\Procedure{StochasticLogDet}{$A, m, p, t, g, \epsilon$}
\State $\hat{\lambda}_\text{max} \gets \Call{PowerMethod}{A, t}$
\State $\mu, \nu \gets \epsilon, g \,\hat{\lambda}_\text{max}$
\State $a, b \gets \mu / (\mu + \nu), \nu / (\mu + \nu)$
\State Define~$\varphi$ and~$\varphi^{-1}$ using Equation~\ref{eq:rescale_func}.
\State $\{c_i\}_{i \in [0, m]} \gets \Call{ChebyshevCoefficients}{\ln \circ\, \varphi}$
\State $\bar{A} \gets A / (\mu + \nu)$
\State $\Gamma \gets \Call{StochasticChebyshevTrace}{\varphi^{-1}(\bar{A}), \{c_i\}, p}$
\State \Return $n\ln(a + b) + \Gamma$
\EndProcedure

\Procedure{StochasticChebyshevTrace}{$A, \{c_i\}_{i \in [0, m]}, p$}
\State $r \gets 0$
\For{$j \in [1, p]$}
\State $v \gets \Call{RandomRademacher}{n}$
\State $w_0, w_1 \gets v, Av$
\State $s \gets c_0 w_0 + c_1 w_1$

\For{$i \in [2, m]$}
\State $w_i \gets 2 A w_{i - 1} - w_{i - 2}$
\State $s \gets s + c_i w_i$
\EndFor

\State $r \gets r + \inner{v}{s}$
\EndFor
\State \Return $r / p$
\EndProcedure
\end{algorithmic}

References

Triallog( p final /p init )Initial ( λ min ,λ max )Final ( λ min ,λ max )( m,p,t, g, glyph[epsilon1] )
(a) (b) (c)161 . 40(2 . 31 × 10 - 1 , 2 . 91 × 10 3 ) (2 . 71 × 10 - 1 , 2 . 23 × 10 3 ) (1 . 87 × 10 - 2 , 2 . 00 × 10 3 ) - 2 3 )(8 . 11 × 10 - 3 , 1 . 06 × 10 3 ) (2 . 51 × 10 - 6 , 2 . 57 × 10 - 1 ) (8 . 58 × 10 - 3 , 5 . 62 × 10 2 ) × - 4 × 2 )(5 , 20 , 20 , 1 . 1 , 1 × 10 - 4 ) (5 , 20 , 20 , 1 . 1 , 1 × 10 - 5 ) (5 , 10 , 10 , 1 . 2 , 1 × 10 - 2 ) - 2 )
667 . 43
90 . 83
(d)336 . 07(2 . 29 × 10 , 1 . 24 × 10(6 . 12 10 , 4 . 04 10(5 , 10 , 10 , 1 . 2 , 1 × 10
Datasetv eff ( τ = 1) for n f fixed, n z variedv eff ( τ = 1) for n f fixed, n z variedv eff ( τ = 1) for n f fixed, n z variedv eff ( τ = 1) for n z fixed, n f variedv eff ( τ = 1) for n z fixed, n f variedv eff ( τ = 1) for n z fixed, n f varied
( n z / 2 ,n f )( n z ,n f )(2 n z ,n f )( n z ,n f / 2)( n z ,n f )( n z , 2 n f )
CelebA ( n z = 64 ,n f = 64 )6 . 9510 . 4916 . 5516 . 1416 . 5517 . 29
LSUN Bedroom ( n z = 64 ,n f = 128 )25 . 2334 . 9353 . 6028 . 3734 . 9340 . 04

Most successful machine intelligence systems rely on gradient-based learning, which is made possible by backpropagation. Some systems are designed to aid us in interpreting data when explicit goals cannot be provided. These unsupervised systems are commonly trained by backpropagating through a likelihood function. We introduce a tool that allows us to do this even when the likelihood is not explicitly set, by instead using the implicit likelihood of the model. Explicitly defining the likelihood often entails making heavy-handed assumptions that impede our ability to solve challenging tasks. On the other hand, the implicit likelihood of the model is accessible without the need for such assumptions. Our tool, which we call spectral backpropagation, allows us to optimize it in much greater generality than what has been attempted before. GANs can also be viewed as a technique for optimizing implicit likelihoods. We study them using spectral backpropagation in order to demonstrate robustness for high-dimensional problems, and identify two novel properties of the generator G𝐺G: (1) there exist aberrant, nonsensical outputs to which G𝐺G assigns very high likelihood, and (2) the eigenvectors of the metric induced by G𝐺G over latent space correspond to quasi-disentangled explanatory factors.

Density estimation is an important component of unsupervised learning. In a typical scenario, we are given a finite sample D≔{x1,…,xn}≔𝐷subscript𝑥1…subscript𝑥𝑛D\coloneqq{x_{1},\ldots,x_{n}}, and must make decisions based on information about the hypothetical process that generated D𝐷D. Suppose that D∼Psimilar-to𝐷𝑃D\sim P, where P𝑃P is a distribution over some sample space X𝑋X. We can model this generating process by learning a probabilistic model f:Z→X:𝑓→𝑍𝑋f:Z\to X that transforms a simple distribution PZsubscript𝑃𝑍P_{Z} over a latent space Z𝑍Z into a distribution Q𝑄Q over X𝑋X. When Q𝑄Q is a good approximation to P𝑃P, we can use it to generate samples and perform inference. Both operations are integral to the design of intelligent systems.

One common approach is to use gradient-based learning to maximize the likelihood of D𝐷D under the model distribution Q𝑄Q. We typically place two important constraints on f𝑓f in order to make this possible: (1) existence of an explicit inverse f−1:X→Z:superscript𝑓1→𝑋𝑍f^{-1}:X\to Z, and (2) existence of a simple procedure by which we can evaluate Q𝑄Q. Often times, f𝑓f is instead constructed as a map from X𝑋X to Z𝑍Z in order to make it convenient to evaluate Q​(x)𝑄𝑥Q(x) for a given observation x∈X𝑥𝑋x\in X. This is the operation on which we place the greatest demand for throughput during training. Each choice corresponds to making one of the two operations – generating samples or evaluating Q𝑄Q – convenient, and the other inconvenient. Regardless of the choice, both operations are required, and so both constraints are made to hold in practice. We regard f𝑓f as a map from Z𝑍Z to X𝑋X in this presentation for sake of simplicity.

Much of the work in deep probabilistic modeling is concerned with allowing f𝑓f to be flexible enough to capture intricate latent structure, while simultaneously ensuring that both conditions hold. We can dichotomize current approaches based on how the second constraint – existence of a simple procedure to evaluate Q𝑄Q – is satisfied. The first approach involves making f𝑓f autoregressive by appropriately masking the weights of each layer. This induces a lower-triangular structure in the Jacobian, since each component of the model’s output is made to depend only on the previous ones. We can then rapidly evaluate the log-determinant term involved in likelihood computation, by accumulating the diagonal elements of the Jacobian.

Research into autoregressive modeling dates back several decades, and we only note some recent developments. Germain et al. (2015) describe an autoregressive autoencoder for density estimation. Kingma et al. (2016) synthesize autogressive modeling with normalizing flows (Rezende and Mohamed, 2015) for variational inference, and Papamakarios et al. (2017) make further improvements. van den Oord et al. (2016c) apply this idea to image generation, with follow-up work (Dinh et al. (2016), van den Oord et al. (2016b)) that exploits parallelism using masked convolutions. van den Oord et al. (2016a) do the same for audio generation, and van den Oord et al. (2017) introduce strategies to improve efficiency. We refer the reader to Jang (2018) for an excellent overview of these works in more detail.

The second approach involves choosing the layers of f𝑓f to be transformations, not necessarily autoregressive, for which explicit expressions for the Jacobian are still available. We can then evaluate the log-determinant term for f𝑓f by accumulating the layerwise contributions in accordance with the chain rule, using a procedure analogous to backpropagation. Rezende and Mohamed (2015) introduced this idea to variational inference, and recent work, including (Berg et al., 2018) and (Tomczak and Welling, 2016), describe new types of such transformations.

Both approaches must invariably compromise on model flexibility. An efficient method for differentiating implicit densities that do not fulfill these constraints would enrich the current toolset for probabilistic modeling. Wu et al. (2016) advocate using annealed importance sampling (Neal, 2001) for evaluating implicit densities, but it is not clear how this approach could be used to obtain gradients. Very recent work (Li and Turner, 2017) uses Stein’s identity to cast gradient computation for implicit densities as a sparse recovery problem. Our approach, which we call spectral backpropagation, harnesses the capabilities of modern automatic differentiation (Abadi et al. (2016), Paszke et al. (2017)) by directly backpropagating through an approximation for the spectral density of f𝑓f.

We make the first steps toward demonstrating the viability of this approach by minimizing dKL​(Q,PX)subscript𝑑KL𝑄subscript𝑃𝑋d_{\text{KL}}(Q,P_{X}) and dKL​(PX,Q)subscript𝑑KLsubscript𝑃𝑋𝑄d_{\text{KL}}(P_{X},Q), where Q𝑄Q is the implicit density of a non-invertible Wide ResNet (Zagoruyko and Komodakis, 2016) f𝑓f, on a set of test problems. Having done so, we then turn our attention to characterizing the behavior of the generator G𝐺G in GANs (Goodfellow et al., 2014), using a series of computational studies made possible by spectral backpropagation. Our purpose in conducting these studies is twofold. Firstly, we show that our approach is suitable for application to high-dimensional problems. Secondly, we identify two novel properties of generators:

The existence of adversarial perturbations for classification models (Szegedy et al., 2013) is paralleled by the existence of aberrant, nonsensical outputs to which G𝐺G assigns very high likelihood.

The eigenvectors of the metric induced by the G𝐺G over latent space correspond to meaningful, quasi-disentangled explanatory factors. Perturbing latent variables along these eigenvectors allows us to quantify the extent to which G𝐺G makes use of latent space.

We hope that these observations will contribute to an improved understanding of how well generators are able to capture the latent structure of the underlying data-generating process.

We begin by revisiting the geometric intuition behind the usual change of variable theorem. First, we consider a rectangle in ℝ2superscriptℝ2\mathbb{R}^{2} with vertices x0,x1,x3,x2subscript𝑥0subscript𝑥1subscript𝑥3subscript𝑥2x_{0},x_{1},x_{3},x_{2} given in clockwise order, starting from the bottom-left vertex. To determine its area, we compute its side lengths v1≔x1−x0,v2≔x2−x0formulae-sequence≔subscript𝑣1subscript𝑥1subscript𝑥0≔subscript𝑣2subscript𝑥2subscript𝑥0v_{1}\coloneqq x_{1}-x_{0},v_{2}\coloneqq x_{2}-x_{0} and write V2=v1​v2subscript𝑉2subscript𝑣1subscript𝑣2V_{2}=v_{1}v_{2}. Now suppose we are given a parallelepiped in ℝ3superscriptℝ3\mathbb{R}^{3} whose sides are described by the vectors v1≔x1−x0≔subscript𝑣1subscript𝑥1subscript𝑥0v_{1}\coloneqq x_{1}-x_{0}, v2≔x2−x0≔subscript𝑣2subscript𝑥2subscript𝑥0v_{2}\coloneqq x_{2}-x_{0}, and v3≔x3−x0≔subscript𝑣3subscript𝑥3subscript𝑥0v_{3}\coloneqq x_{3}-x_{0}. Its volume is given by the triple product V3=⟨v1×v2,v3⟩subscript𝑉3subscript𝑣1subscript𝑣2subscript𝑣3V_{3}=\langle v_{1}\times v_{2},v_{3}\rangle, where ×\times and ⟨⋅,⋅⟩⋅⋅\langle\cdot,\cdot\rangle denote cross product and inner product, respectively. This triple product can be rewritten as

which we can generalize to compute the volume of a parallelepiped in ℝNsuperscriptℝ𝑁\mathbb{R}^{N}:

If we regard the vertices x0,…,xNsubscript𝑥0…subscript𝑥𝑁x_{0},\ldots,x_{N} as observations in X𝑋X, the change of variable theorem can be understood as the differential analog of this formula. To wit, we suppose that f:ℝN→ℝN:𝑓→superscriptℝ𝑁superscriptℝ𝑁f:\mathbb{R}^{N}\to\mathbb{R}^{N} is a diffeomorphism, and denote by Jfsubscript𝐽𝑓J_{f} the Jacobian of its output with respect to its input. Now, VNsubscript𝑉𝑁V_{N} becomes the infinitesimal volume element determined by Jfsubscript𝐽𝑓J_{f}. For an observation x∈X𝑥𝑋x\in X, the change of variable theorem says that we can compute

where we set z≔f−1​(x)≔𝑧superscript𝑓1𝑥z\coloneqq f^{-1}(x).

An n𝑛n-dimensional parallelepiped in ℝNsuperscriptℝ𝑁\mathbb{R}^{N} requires n𝑛n vectors to specify its sides. When n<N𝑛𝑁n<N, its volume is given by the more general formula (Hanson, 1994),

The corresponding analog of the change of variable theorem is known in the context of geometric measure theory as the smooth coarea formula (Krantz and Parks, 2008). When f𝑓f is a diffeomorphism between manifolds, it says that

where we set z≔f−1​(x)≔𝑧superscript𝑓1𝑥z\coloneqq f^{-1}(x) as before, and define Mf≔(Jf)t​Jf≔subscript𝑀𝑓superscriptsubscript𝐽𝑓𝑡subscript𝐽𝑓M_{f}\coloneqq(J_{f})^{t}J_{f} to be the metric induced by f𝑓f over the latent manifold Z𝑍Z. In many cases of interest, such as in GANs, the function f𝑓f is not necessarily injective. Application of the coarea formula would then require us to evaluate an inner integral over {z∈Z:f​(z)=x}conditional-set𝑧𝑍𝑓𝑧𝑥{z\in Z:f(z)=x}, rather than over the singleton {f−1​(x)}superscript𝑓1𝑥{f^{-1}(x)}. We ignore this technicality and apply Equation 1 anyway.

The change of variable theorem gives us access to the implicit density Q𝑄Q in the form of the spectral density of Mfsubscript𝑀𝑓M_{f}. Indeed, the Lie identity ln​det=tr⁡lntr\ln\det=\operatorname{tr}\ln allows us to express the log-likelihood corresponding to Equation 1 as

We focus on the factor involving Mfsubscript𝑀𝑓M_{f} on the RHS, which can be written as

where SpSp\operatorname{Sp} denotes the spectrum, and Pλsubscript𝑃𝜆P_{\lambda} the delta distribution over the eigenvalues in the spectrum. We let θ𝜃\theta denote the parameters of f𝑓f, and assume that PZsubscript𝑃𝑍P_{Z} is independent of θ𝜃\theta. Now, differentiating Equation 2 with respect to θ𝜃\theta gives

Equation 3 allows us to formulate gradient computation for implicit densities as a variant of stochastic backpropagation (Kingma and Welling (2013), Rezende et al. (2014)), in which the base distribution for the expectation is the spectral density of Mfsubscript𝑀𝑓M_{f} rather than a normal distribution.

To obtain an estimator for Equation 3, we turn to the thriving literature on stochastic approximation of spectral sums. These methods estimate quantities of the form ΣS​(A)≔tr⁡S​(A)≔subscriptΣ𝑆𝐴tr𝑆𝐴\Sigma_{S}(A)\coloneqq\operatorname{tr}S(A), where A𝐴A is a large or implicitly-defined matrix, by accessing A𝐴A using only matrix-vector products. In our case, S=ln𝑆S=\ln, and the products involving A=Mf𝐴subscript𝑀𝑓A=M_{f} can be evaluated rapidly using automatic differentiation. We make no attempt to conduct a comprehensive survey, but note that among the most promising recent approaches are those described by Han et al. (2017), Boutsidis et al. (2017), Ubaru et al. (2017), and Fitzsimons et al. (2017).

We briefly describe the approaches of Han et al. (2017) and Boutsidis et al. (2017), which work on the basis of polynomial interpolation. Given a function S¯:[−1,1]→ℝ:¯𝑆→11ℝ\bar{S}:[-1,1]\to\mathbb{R}, these methods construct an order-m𝑚m approximating polynomial p¯msubscript¯𝑝𝑚\bar{p}_{m} to S¯¯𝑆\bar{S}, given by

where ci∈ℝsubscript𝑐𝑖ℝc_{i}\in\mathbb{R} and Ti:[−1,1]→ℝ:subscript𝑇𝑖→11ℝT_{i}:[-1,1]\to\mathbb{R}. The main difference between the two approaches is the choice of approximating polynomial. Boutsidis et al. (2017) use Taylor polynomials, for which

where we use superscript (i)𝑖(i) to denote iterated differentiation. On the other hand, Han et al. (2017) use Chebyshev polynomials. These are defined by the recurrence relation

The coefficients {ci}subscript𝑐𝑖{c_{i}} for the Chebyshev polynomials are called the Chebyshev nodes, and are defined by

Now suppose that we are given a matrix A¯∈ℝn×n¯𝐴superscriptℝ𝑛𝑛\bar{A}\in\mathbb{R}^{n\times n} such that Sp⁡(A¯)⊂[−1,1]Sp¯𝐴11\operatorname{Sp}(\bar{A})\subset[-1,1]. After having made a choice for the construction of p¯msubscript¯𝑝𝑚\bar{p}_{m}, we can use the approximation

This reduces the problem of estimating the spectral sum ΣS¯​(A¯)subscriptΣ¯𝑆¯𝐴\Sigma_{\bar{S}}(\bar{A}) to computing the traces tr⁡Ti​(A¯)trsubscript𝑇𝑖¯𝐴\operatorname{tr}T_{i}(\bar{A}) for all i∈[0,m]𝑖0𝑚i\in[0,m].

Two issues remain in applying this approximation. The first is that both dom⁡(S¯)dom¯𝑆\operatorname{dom}(\bar{S}) and Sp⁡(A¯)Sp¯𝐴\operatorname{Sp}(\bar{A}) are restricted to [−1,1]11[-1,1]. In our case, ln:(0,∞)→ℝ:→0ℝ\ln:(0,\infty)\to\mathbb{R}, and Mf​(z)subscript𝑀𝑓𝑧M_{f}(z) can be an arbitrary positive definite matrix. To address this issue, we define φ:[−1,1]→[a,b]:𝜑→11𝑎𝑏\varphi:[-1,1]\to[a,b], where

Now we set S¯≔S∘φ≔¯𝑆𝑆𝜑\bar{S}\coloneqq S\circ\varphi, so that

We stress that while pmsubscript𝑝𝑚p_{m} is defined using φ−1superscript𝜑1\varphi^{-1}, the coefficients cisubscript𝑐𝑖c_{i} are computed using S¯≔S∘φ≔¯𝑆𝑆𝜑\bar{S}\coloneqq S\circ\varphi. With these definitions in hand, we can write

Han et al. (2017) require spectral bounds μ𝜇\mu and ν𝜈\nu, so that Sp⁡(A)⊂[μ,ν]Sp𝐴𝜇𝜈\operatorname{Sp}(A)\subset[\mu,\nu], and set

After using Equation 7 with a Chebyshev approximation for S=ln𝑆S=\ln to obtain Γ≈ln​det(A¯)Γ¯𝐴\Gamma\approx\ln\det(\bar{A}), we compute

Boutsidis et al. (2017) instead define B≔A/ν≔𝐵𝐴𝜈B\coloneqq A/\nu, and write

We can easily obtain an accurate upper bound ν𝜈\nu using the power method. The lower bound μ𝜇\mu is fixed to a small, predetermined constant in our work.

The second issue is that deterministically evaluating the terms tr⁡Ti​(A¯)trsubscript𝑇𝑖¯𝐴\operatorname{tr}T_{i}(\bar{A}) in Equation 5 requires us to compute matrix powers of A¯¯𝐴\bar{A}. Thankfully, we can drastically reduce the computational cost and approximate these terms using only matrix-vector products. This is made possible by the stochastic trace estimator introduced by Hutchinson (1990):

When the distribution PVsubscript𝑃𝑉P_{V} for the probe vectors v𝑣v has expectation zero, the estimate is unbiased. We use the Rademacher distribution, which samples the components of v𝑣v uniformly from {−1,1}11{-1,1}. We refer the reader to Avron and Toledo (2011) for a detailed study on the variance of this estimator.

We first describe how the trace estimator is applied when a Taylor approximation is used to construct p¯msubscript¯𝑝𝑚\bar{p}_{m}. In this case, we have

The inner summands A¯i​vjsuperscript¯𝐴𝑖subscript𝑣𝑗\bar{A}^{i}v_{j} are evaluated using the recursion w0≔vj≔subscript𝑤0subscript𝑣𝑗w_{0}\coloneqq v_{j} and wi≔A¯​wi−1≔subscript𝑤𝑖¯𝐴subscript𝑤𝑖1w_{i}\coloneqq\bar{A}w_{i-1} for i≥1𝑖1i\geq 1. It follows that the number of matrix-vector products involved in the approximation increases linearly with respect to the order m𝑚m of the approximating polynomial p¯msubscript¯𝑝𝑚\bar{p}_{m}. The same idea allows us to accumulate the traces for the Chebyshev approximation, based on Equation 4. The resulting procedure is given in Algorithm 1; it is our computational workhorse for evaluating the log-likelihood in Equation 2 and estimating the gradient in Equation 3.

Suppose that we are tasked with matching a given data distribution PXsubscript𝑃𝑋P_{X} with the implicit density Q𝑄Q of the model f𝑓f. Two approaches for learning f𝑓f are available, and the choice of which to use depends on the type of access we have to the data distribution PXsubscript𝑃𝑋P_{X}. The first approach – minimizing dKL​(Q,PX)subscript𝑑KL𝑄subscript𝑃𝑋d_{\text{KL}}(Q,P_{X}) – is applicable when we know how to evaluate the likelihood of PXsubscript𝑃𝑋P_{X}, but are not necessarily able to sample from it. The second approach – minimizing dKL​(PX,Q)subscript𝑑KLsubscript𝑃𝑋𝑄d_{\text{KL}}(P_{X},Q) – is applicable when we are able to sample from PXsubscript𝑃𝑋P_{X}, but are not necessarily able to evaluate its likelihood. We show that spectral backpropagation can be used in both cases, when neither of the two conditions described in Section 1 holds.

All of the examples considered here are densities over ℝ2superscriptℝ2\mathbb{R}^{2}. We match them by transforming a prior PZsubscript𝑃𝑍P_{Z} given by a spherical normal distribution. Our choice for the architecture of f𝑓f is a Wide ResNet comprised of four residual blocks. Each block is a three-layer bottleneck from ℝ2superscriptℝ2\mathbb{R}^{2} to ℝ2superscriptℝ2\mathbb{R}^{2} whose hidden layer size is 32. All layers are equipped with biases, and use LeakyReLU activations. We compute the gradient updates using a batch size of 64, and apply the updates using Adam (Kingma and Ba, 2014) with a step size of 1×10−41E-41\text{\times}{10}^{-4}, and all other parameters kept at their the default values.

To compute the gradient update given by Equation 3, we use Algorithm 1 with (m,p,t,g)≔(10,20,20,1.2)≔𝑚𝑝𝑡𝑔1020201.2(m,p,t,g)\coloneqq(10,20,20,1.2) for all experiments. For minimizing dKL​(Q,PX)subscript𝑑KL𝑄subscript𝑃𝑋d_{\text{KL}}(Q,P_{X}), we use ϵ≔0.1≔italic-ϵ0.1\epsilon\coloneqq 0.1, and for minimizing dKL​(PX,Q)subscript𝑑KLsubscript𝑃𝑋𝑄d_{\text{KL}}(P_{X},Q), ϵ≔1×10−2≔italic-ϵ1E-2\epsilon\coloneqq$1\text{\times}{10}^{-2}$. In order to monitor the accuracy of the approximation for the likelihood, we compute Mfsubscript𝑀𝑓M_{f} at each iteration, and evaluate the ground-truth likelihood in accordance with Equation 2. We define the relative error of the approximation ln⁡ℓ^^ℓ\ln\hat{\ell} with respect to the ground-truth log-likelihood ln⁡ℓℓ\ln\ell by

provided that the quotient is not too large. This definition of relative error avoids numerical problems when ℓ≈0ℓ0\ell\approx 0.

We begin by considering the first approach, in which we seek to minimize dKL​(Q,PX)subscript𝑑KL𝑄subscript𝑃𝑋d_{\text{KL}}(Q,P_{X}). This objective requires that we be able to sample from Q𝑄Q, so we choose f:Z→X:𝑓→𝑍𝑋f:Z\to X to be a map from latent space to observation space. The results are shown in Figure 1. To prevent f𝑓f from making Q𝑄Q collapse to model the infinite support of PXsubscript𝑃𝑋P_{X}, we found it helpful to incorporate the regularizer

into the objectives for the third and fourth test energies. Here, ∥⋅∥2|\cdot|{2} denotes the spectral norm. To implement this regularizer, we simply backpropagate through the estimate of λmaxsubscript𝜆max\lambda{\text{max}} that is already produced by Algorithm 1. We use ρ≔8×10−2≔𝜌8E-2\rho\coloneqq$8\text{\times}{10}^{-2}$ in both cases. Despite the use of this regularizer, we find that continuing to train the models for these last two test energies causes the samples to drift away from the origin (see Figure 1(c)). We have not made any attempt to address this behavior. Finally, we note that since dKL​(Q,PX)subscript𝑑KL𝑄subscript𝑃𝑋d_{\text{KL}}(Q,P_{X}) is bounded from below by the negative log-normalization constant of Q𝑄Q, it can become negative when Q𝑄Q is unnormalized. We see that this happens for all four examples.

In the second approach, we seek to minimize dKL​(PX,Q)subscript𝑑KLsubscript𝑃𝑋𝑄d_{\text{KL}}(P_{X},Q). This objective requires that we be able to evaluate the likelihood of Q𝑄Q, so we choose f:X→Z:𝑓→𝑋𝑍f:X\to Z to be a map from observation space to latent space. The results are shown in Figure 2. We note that minimizing dKL​(PX,Q)subscript𝑑KLsubscript𝑃𝑋𝑄d_{\text{KL}}(P_{X},Q) is ill-posed when Q𝑄Q is unnormalized. In this scenario, the model distribution Q𝑄Q can match PXsubscript𝑃𝑋P_{X} while also assigning mass outside the support of PXsubscript𝑃𝑋P_{X}. We see that this expected behavior manifests in both examples.

For our explorations involving GANs, we train a series of DCGAN (Radford et al., 2015) models on 64×64646464\times 64 rescaled versions of the CelebA (Liu et al., 2015) and LSUN Bedroom datasets. We vary model capacity in terms of the base feature map count multiplier nfsubscript𝑛𝑓n_{f} for the DCGAN architecture. The generator and discriminator have five layers each, and use translated LeakyReLU activations (Xiang and Li, 2017). To stabilize training, we use weight normalization with fixed scale factors in the discriminator (Salimans et al., 2016). Our prior is defined by PZ≔unif([−1,1])nzP_{Z}\coloneqq\operatorname{unif}([-1,1])^{n_{z}}, where nzsubscript𝑛𝑧n_{z} is the size of the embedding space. All models were trained for 750 000750000750,000 iterations with a batch size of 32, using RMSProp with step size 1×10−41E-41\text{\times}{10}^{-4} and decay factor 0.90.90.9. We present results from two of these models in Figure 3.

We apply spectral backpropagation to explore the effect of perturbing a given latent variable z∈Z𝑧𝑍z\in Z to maximize likelihood under the generator distribution Q𝑄Q. This is readily accomplished by noting that the same procedure to evaluate Equation 3 can also be used to obtain gradients with respect to z𝑧z. The results are shown in Figure 4. Intuitively, we might expect the outputs to be transformed in such a way that they gravitate towards modes of the dataset. But this is not what happens. Instead, the outputs are transformed into highly aberrant, out-of-distribution examples while nonetheless attaining very high likelihood. As optimization proceeds, Mfsubscript𝑀𝑓M_{f} also becomes increasingly ill-conditioned. This shows that likelihood for generators need not correspond to intuitive notions of visual plausibility.

The generator in GANs is well-known for organizing latent space such that semantic features can be transferred by means of algebraic operations over latent variables (Radford et al., 2015). This suggests the existence of a systematic organization of latent space, but perhaps one that cannot be globally characterized in terms of a handful of simple explanatory factors. We instead explore whether local changes in latent space can be characterized in this way. Since the metric Mfsubscript𝑀𝑓M_{f} describes local change in the generator’s output, it is natural to consider the effect of perturbations along its eigenvectors. To this end, we fix 12 trial embeddings in latent space, and compare the effect of perturbations along random directions to perturbations along these eigenvectors. The random directions are obtained by sampling from a spherical normal distribution. We show the results in Figure 5.

We can see that dominant eigenvalues, especially the principal eigenvalue, often result in the most drastic changes. Furthermore, these changes are not only semantically meaningful, but also tend to make modifications to distinct attributes of the image. To see this more clearly, we consider the top two rows of Figure 5(g). Movement along the first two eigenvectors changes hair length and facial orientation; movement along the third eigenvector decreases the length of the bangs; movement along the fourth and fifth eigenvectors changes background color; and movement along the sixth and seventh eigenvectors changes hair color.

Inspecting the two columns (c), (e), (g), and (d), (f), (h) in Figure 5 suggests that larger values of nzsubscript𝑛𝑧n_{z} may encourage the generator to capture more explanatory factors, possibly at the price of decreased sample quality. We would like to explore the effect of varying nzsubscript𝑛𝑧n_{z} and nfsubscript𝑛𝑓n_{f} on the number of such factors. To do this, we fix a sample of latent variables S≔{zj}j∈[1,M]∼PZ≔𝑆subscriptsubscript𝑧𝑗𝑗1𝑀similar-tosubscript𝑃𝑍S\coloneqq{z_{j}}{j\in[1,M]}\sim P{Z}. For each zj∈Ssubscript𝑧𝑗𝑆z_{j}\in S, we define

for every eigenvector v1(j),…,vnz(j)superscriptsubscript𝑣1𝑗…superscriptsubscript𝑣subscript𝑛𝑧𝑗v_{1}^{(j)},\ldots,v_{n_{z}}^{(j)} of Mf​(zj)subscript𝑀𝑓subscript𝑧𝑗M_{f}(z_{j}). The quantity δ​(j,i)𝛿𝑗𝑖\delta(j,i) measures the pixelwise change resulting from a perturbation along an eigenvector, relative to the change we expect from a random perturbation. Finally, we define

where 1​{⋅}1⋅1{\cdot} is the indicator function. This quantity measures the average number of eigenvectors for which the relative change is greater than the threshold τ𝜏\tau. As such, it can be regarded as an effective measure of dimensionality for latent space. We explore the effect of varying nzsubscript𝑛𝑧n_{z} and nfsubscript𝑛𝑓n_{f} on veffsubscript𝑣effv_{\text{eff}} in Figure 6.

Current approaches for probabilistic modeling attempt to satisfy two goals that are fundamentally at odds with one another: fulfillment of the two constraints described in Section 1, and model flexibility. In this work, we develop a computational tool that aims to expand the scope of probabilistic modeling to functions that do not satisfy these constraints. We make the first steps toward demonstrating feasibility of this approach by minimizing divergences in far greater generality than what has been attempted before. Finally, we uncover surprising facts about the organization of latent space for GANs that we hope will contribute to an improved understanding of how effectively they capture underlying latent structure.

Refer to caption (a)

$$ V_{3}=\det\begin{pmatrix}v_{1}&v_{2}&v_{3}\end{pmatrix}, $$ \tag{S2.Ex1}

$$ Q(x)=P_{Z}(f^{-1}(x)),\lvert\det J_{f}(f^{-1}(x)))\rvert^{-1}\eqqcolon P_{Z}(z),\lvert\det(J_{f}(z))\rvert^{-1}, $$ \tag{S2.Ex3}

$$ V_{n}^{2}=\det\begin{pmatrix}\langle v_{1},v_{1}\rangle&\cdots&\langle v_{1},v_{n}\rangle\ \vdots&\ddots&\vdots\ \langle v_{n},v_{1}\rangle&\cdots&\langle v_{n},v_{n}\rangle\end{pmatrix}. $$ \tag{S2.Ex4}

$$ \ln Q(x)=\ln P(z)-\frac{1}{2}\ln\det M_{f}(z)=\ln P(z)-\frac{1}{2}\operatorname{tr}\ln M_{f}(z). $$ \tag{S2.E2}

$$ \operatorname{tr}\ln M_{f}(z)=\sum_{\lambda\in\operatorname{Sp}(M_{f}(z))}\ln\lambda=\operatorname{E}{\lambda\sim P{\lambda}}\ln\lambda, $$ \tag{S2.Ex5}

$$ \bar{S}\approx\bar{p}{m}\coloneqq\sum{i\in[0,m]}c_{i}T_{i}, $$ \tag{S2.Ex6}

$$ c_{i}\coloneqq\frac{\bar{S}^{(i)}}{i!}\quad\text{and}\quad T_{i}:x\mapsto x^{i}, $$ \tag{S2.Ex7}

$$ T_{0}=1,\quad T_{1}:x\mapsto x,\quad\text{and}\quad T_{i}:x\mapsto 2xT_{i-1}(x)-T_{i-2}(x),\quad i\geq 2. $$ \tag{S2.E4}

$$ c_{i}\coloneqq\begin{dcases}\frac{1}{m+1}\sum_{j\in[0,m]}\bar{S}(x_{j})T_{0}(x_{j}),&i=0,\ \frac{2}{m+1}\sum_{j\in[0,m]}\bar{S}(x_{j})T_{i}(x_{j}),&i\geq 1.\end{dcases} $$ \tag{S2.Ex8}

$$ \Sigma_{S}(\bar{A})=\Sigma_{\bar{S}\circ\varphi^{-1}}(\bar{A})\approx\operatorname{tr}p_{m}(\bar{A}). $$ \tag{S2.E7}

$$ a\coloneqq\frac{\mu}{\mu+\nu},\quad b\coloneqq\frac{\nu}{\mu+\nu},\quad\text{and}\quad\bar{A}\coloneqq\frac{A}{a+b}. $$ \tag{S2.Ex12}

$$ \ln\det(A)=n\ln(a+b)+\ln\det(\bar{A})\approx n\ln(a+b)+\Gamma. $$ \tag{S2.Ex13}

$$ \operatorname{tr}\bar{A}=\operatorname{E}{v\sim P{V}}\langle v,\bar{A}v\rangle\approx\frac{1}{p}\sum_{j\in[1,p]}\langle v_{j},\bar{A}v_{j}\rangle. $$ \tag{S2.Ex16}

$$ \operatorname{tr}\bar{p}{m}(\bar{A})=\sum{i\in[0,m]}c_{i}\operatorname{tr}\bar{A}^{i}\approx\frac{1}{p}\sum_{i\in[0,m]}c_{i}\sum_{j\in[1,p]}\langle v_{j},\bar{A}^{i}v_{j}\rangle=\frac{1}{p}\sum_{j\in[1,p]}c_{i}\left\langle v_{j},\sum_{i\in[0,m]}\bar{A}^{i}v_{j}\right\rangle. $$ \tag{S2.Ex17}

$$ \ln\hat{\ell}-\ln\ell=\ln\frac{\hat{\ell}}{\ell}=\ln\left(1+\frac{\hat{\ell}-\ell}{\ell}\right)\approx\frac{\hat{\ell}-\ell}{\ell}, $$ \tag{S3.Ex18}

$$ R(M_{f})\coloneqq\rho|M_{f}|_{2} $$ \tag{S3.E8}

$$ \delta(j,0)\coloneqq\operatorname{E}{\epsilon\sim N(0,I)}|G(z{j}+\alpha\epsilon)-G(z_{j})|{2}\quad\text{and}\quad\delta(j,i)\coloneqq\frac{|G(z{j}+\alpha v_{i}^{(j)})-G(z_{j})|_{2}}{\delta(j,0)}, $$ \tag{S5.Ex19}

$$ v_{\text{eff}}(\tau)\coloneqq\frac{1}{M}\sum_{j\in[1,M]}\sum_{i\in[1,n_{z}]}1{\delta(j,i)>\tau}, $$ \tag{S5.Ex20}

$$ \displaystyle\Sigma_{\bar{S}}(\bar{A}) $$

Triallog( p final /p init )Initial ( λ min ,λ max )Final ( λ min ,λ max )( m,p,t, g, glyph[epsilon1] )
(a) (b) (c)161 . 40(2 . 31 × 10 - 1 , 2 . 91 × 10 3 ) (2 . 71 × 10 - 1 , 2 . 23 × 10 3 ) (1 . 87 × 10 - 2 , 2 . 00 × 10 3 ) - 2 3 )(8 . 11 × 10 - 3 , 1 . 06 × 10 3 ) (2 . 51 × 10 - 6 , 2 . 57 × 10 - 1 ) (8 . 58 × 10 - 3 , 5 . 62 × 10 2 ) × - 4 × 2 )(5 , 20 , 20 , 1 . 1 , 1 × 10 - 4 ) (5 , 20 , 20 , 1 . 1 , 1 × 10 - 5 ) (5 , 10 , 10 , 1 . 2 , 1 × 10 - 2 ) - 2 )
667 . 43
90 . 83
(d)336 . 07(2 . 29 × 10 , 1 . 24 × 10(6 . 12 10 , 4 . 04 10(5 , 10 , 10 , 1 . 2 , 1 × 10
Datasetv eff ( τ = 1) for n f fixed, n z variedv eff ( τ = 1) for n f fixed, n z variedv eff ( τ = 1) for n f fixed, n z variedv eff ( τ = 1) for n z fixed, n f variedv eff ( τ = 1) for n z fixed, n f variedv eff ( τ = 1) for n z fixed, n f varied
( n z / 2 ,n f )( n z ,n f )(2 n z ,n f )( n z ,n f / 2)( n z ,n f )( n z , 2 n f )
CelebA ( n z = 64 ,n f = 64 )6 . 9510 . 4916 . 5516 . 1416 . 5517 . 29
LSUN Bedroom ( n z = 64 ,n f = 128 )25 . 2334 . 9353 . 6028 . 3734 . 9340 . 04

Figure

Figure

Figure

Figure

Figure

References

[TF] Mart'\in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, etal. Tensorflow: A system for large-scale machine learning. In OSDI, volume16, pages 265--283, 2016.

[TraceEstVariance] Haim Avron and Sivan Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM (JACM), 58\penalty0 (2):\penalty0 8, 2011.

[SylvesterNF] Rianne vanden Berg, Leonard Hasenclever, JakubM Tomczak, and Max Welling. Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018.

[SSTaylor] Christos Boutsidis, Petros Drineas, Prabhanjan Kambadur, Eugenia-Maria Kontopoulou, and Anastasios Zouzias. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. Linear Algebra and its Applications, 533:\penalty0 95--117, 2017.

[RealNVP] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.

[SSEntropic] Jack Fitzsimons, Diego Granziol, Kurt Cutajar, Michael Osborne, Maurizio Filippone, and Stephen Roberts. Entropic trace estimates for log determinants. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 323--338. Springer, 2017.

[MADE] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881--889, 2015.

[GAN] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672--2680, 2014.

[SSChebyshev] Insu Han, Dmitry Malioutov, Haim Avron, and Jinwoo Shin. Approximating spectral sums of large-scale matrices using stochastic chebyshev approximations. SIAM Journal on Scientific Computing, 39\penalty0 (4):\penalty0 A1558--A1585, 2017.

[NDimGraphics] Andrew~J. Hanson. Graphics gems iv. chapter Geometry for N-dimensional Graphics, pages 149--170. Academic Press Professional, Inc., San Diego, CA, USA, 1994. ISBN 0-12-336155-9. URL http://dl.acm.org/citation.cfm?id=180895.180909.

[TraceEst] Michael~F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 19\penalty0 (2):\penalty0 433--450, 1990.

[NFTutorial] Eric Jang. Normalizing flows tutorial, part 2: Modern normalizing flows, 2018. URL https://blog.evjang.com/2018/01/nf2.html.

[Adam] Diederik~P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[AEVB] Diederik~P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[IAF] DiederikP Kingma, Tim Salimans, Rafal Jozefowicz, XiChen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743--4751, 2016.

[Coarea] Steven Krantz and Harold Parks. Analytical tools: The area formula, the coarea formula, and poincar'e inequalities. Geometric Integration Theory, pages 1--33, 2008.

[SteinGradEst] Yingzhen Li and Richard~E Turner. Gradient estimators for implicit models. arXiv preprint arXiv:1705.07107, 2017.

[CelebA] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.

[AIS] Radford~M Neal. Annealed importance sampling. Statistics and computing, 11\penalty0 (2):\penalty0 125--139, 2001.

[MAF] George Papamakarios, Iain Murray, and Theo Pavlakou. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2335--2344, 2017.

[PyTorch] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

[DCGAN] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

[NF] Danilo~Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.

[StochasticBackprop] Danilo~Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.

[ImprovedGAN] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi~Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234--2242, 2016.

[NNProperties] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

[HouseholderNF] Jakub~M Tomczak and Max Welling. Improving variational auto-encoders using householder flow. arXiv preprint arXiv:1611.09630, 2016.

[SSLanczos] Shashanka Ubaru, Jie Chen, and Yousef Saad. Fast estimation of tr(f(a)) via stochastic lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38\penalty0 (4):\penalty0 1075--1099, 2017.

[WaveNet] Aaron van~den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.

[PixelCNN] Aaron vanden Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, etal. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790--4798, 2016b.

[PixelRNN] Aaron van~den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016c.

[ParallelWaveNet] Aaron vanden Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George vanden Driessche, Edward Lockhart, LuisC Cobo, Florian Stimberg, etal. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.

[AISEval] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On the quantitative analysis of decoder-based generative models. arXiv preprint arXiv:1611.04273, 2016.

[TPReLU] Sitao Xiang and Hao Li. On the effects of batch and weight normalization in generative adversarial networks. stat, 1050:\penalty0 22, 2017.

[WideResNet] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

[bib1] Abadi et al. [2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.

[bib4] Boutsidis et al. [2017] Christos Boutsidis, Petros Drineas, Prabhanjan Kambadur, Eugenia-Maria Kontopoulou, and Anastasios Zouzias. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. Linear Algebra and its Applications, 533:95–117, 2017.

[bib6] Fitzsimons et al. [2017] Jack Fitzsimons, Diego Granziol, Kurt Cutajar, Michael Osborne, Maurizio Filippone, and Stephen Roberts. Entropic trace estimates for log determinants. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 323–338. Springer, 2017.

[bib7] Germain et al. [2015] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889, 2015.

[bib8] Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

[bib9] Han et al. [2017] Insu Han, Dmitry Malioutov, Haim Avron, and Jinwoo Shin. Approximating spectral sums of large-scale matrices using stochastic chebyshev approximations. SIAM Journal on Scientific Computing, 39(4):A1558–A1585, 2017.

[bib21] Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

[bib26] Szegedy et al. [2013] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

[bib28] Ubaru et al. [2017] Shashanka Ubaru, Jie Chen, and Yousef Saad. Fast estimation of tr(f(a)) via stochastic lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017.