Skip to main content

Universum Prescription: Regularization Using Unlabeled Data

Xiang Zhang \qquad Yann LeCun, Courant Institute of Mathematical Sciences, New York University, 719 Broadway, 12th Floor, New York, NY

Abstract

This paper shows that simply prescribing ``none of the above'' labels to unlabeled data has a beneficial regularization effect to supervised learning. We call it universum prescription by the fact that the prescribed labels cannot be one of the supervised labels. In spite of its simplicity, universum prescription obtained competitive results in training deep convolutional networks for CIFAR-10, CIFAR-100, STL-10 and ImageNet datasets. A qualitative justification of these approaches using Rademacher complexity is presented. The effect of a regularization parameter -- probability of sampling from unlabeled data -- is also studied empirically.

Universum Prescription

Xiang Zhang Yann LeCun

Courant Institute of Mathematical Sciences, New York University

719 Broadway, 12th Floor, New York, NY 10003

This paper shows that simply prescribing 'none of the above' labels to unlabeled data has a beneficial regularization effect to supervised learning. We call it universum prescription by the fact that the prescribed labels cannot be one of the supervised labels. In spite of its simplicity, universum prescription obtained competitive results in training deep convolutional networks for CIFAR-10, CIFAR-100, STL-10 and ImageNet datasets. A qualitative justification of these approaches using Rademacher complexity is presented. The effect of a regularization parameter - probability of sampling from unlabeled data - is also studied empirically.

originated from (Vapnik 1998) and (Vapnik 2006). The definition of universum is a set of unlabeled data that are known not to belong to any of the classes but in the same domain. We extended the idea of using universum from support vector machines to deep learning.

Introduction

The idea of exploiting the wide abundance of unlabeled data to improve the accuracy of supervised learning tasks is a very natural one. In this paper, we study what is perhaps the simplest way to exploit unlabeled data in the context of deep learning. We assume that the unlabeled samples do not belong to any of the categories of the supervised task, and we force the classifier to produce a 'none of the above' output for these samples. This is by no means a new idea, but we show empirically and theoretically that doing so has a regularization effect on supervised task and reduces the generalization gap, the expected difference between test and training errors. We study three different ways to prescribe 'none of the above' outputs, dubbed uniform prescription, dustbin class, and background class and show that they improve the test error of convolutional networks trained on CIFAR-10, CIFAR-100 (Krizhevsky 2009), STL-10 (Coates, Ng, and Lee 2011), and ImageNet (Russakovsky et al. 2015). The method is theoretically justified using Radamacher complexity (Bartlett and Mendelson 2003).

Here we briefly describe our three universum prescription methods. Uniform prescription forces a discrete uniform distribution for unlabeled samples. Dustbin class simply adds an extra class and prescribe all unlabeled data to this class. Background class also adds an extra class, but it uses a constant threshold to avoid parameterization.

Our work is a direct extension to learning in the presence of universum (Weston et al. 2006) (Chapelle et al. 2007),

Copyright c © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Most deep learning approaches utilizing unlabeled data belong to the scope of representation learning (reviewed by (Bengio, Courville, and Vincent 2013) and (Bengio and LeCun 2007)) and transfer learning (Thrun and Pratt 1998). They include ideas like pretraining (Erhan et al. 2010) (Hinton, Osindero, and Teh 2006) (Ranzato et al. 2006) and semisupervised training (Rasmus et al. 2015) (Zhao et al. 2015). Universum prescription incoporates unlabeled data without imposing priors such as sparsity or reconstruction.

Regularization - techniques for the control of generalization gap - has been studied extensively. Most approaches implement a secondary optimization objective, such as an L 2 norm. Some other methods such as dropout (Srivastava et al. 2014) cheaply simulate model averaging to control the model variance. As part of general statistical learning theory (Vapnik 1995), (Vapnik 1998), the justification for regularization is well-developed. We qualitatively justify the methods using Radamacher complexity (Bartlett and Mendelson 2003), similar to (Wan et al. 2013).

Universum Prescription

In this section we attempt to formalize the trick of prescribing 'none of the above' labels - universum prescription. Consider the problem of exclusive k -way classification. In learning we hope to find a hypothesis function h ∈ H mapping to R k so that the label is determined by y = argmin i h i ( x ) . The following assumptions are made.

  1. (Loss assumption) The loss used as the optimization objective is negative log-likelihood:

  2. (Universum assumption) The proportion of unlabeled samples belonging to a supervised class is negligible.

The loss assumption assumes that the probability of class y given an input x can be thought of as

where ( X,Y ) ∼ D and D is the distribution where labeled data are sampled. We use lowercase letters for values, uppercase letters for random variables and bold uppercase letters for distribution. The loss assumption is simply a necessary detail rather than a limitation, in the sense that one can change the type of loss and use the same principles to derive different universum learning techniques.

The universum assumption implicates that labeled classes are a negligible subset. In many practical cases we only care about a small number of classes, either by problem design or due to high cost in the labeling process. At the same time, a very large amount of unlabeled data is easily obtained. Put in mathematics, assuming we draw unlabeled data from distribution U , the assumption states that

The universum assumption is opposite to the assumptions of information regularization (Corduneanu and Jaakkola 2006) and transduction learning (Chapelle, Schlkopf, and Zien 2006) (Gammerman, Vovk, and Vapnik 1998). It has similarities with (Zhang and Zhou 2010) that encourages diversity of outputs for ensemble methods. All our methods discussed below prescribe agnostic targets to the unlabeled data. During learning, we randomly present an unlabeled sample to the optimization procedure with probability p .

Uniform Prescription

It is known that negative log-likelihood is simply a reduced form of cross-entropy

glyph[negationslash]

in which the target probability Q [ Y = y | x ] = 1 and Q [ Y = i | x ] = 0 for i = y . Under the universum assumption, if we are presented with an unlabeled sample x , we would hope to prescribe some Q so that every class has some equally minimal probability. Q also has to satisfy ∑ k i =1 Q [ Y = i | x ] = 1 by the probability axioms. The only possible choice for Q is then Q [ Y | x ] = 1 /k . The learning algorithm then uses the cross-entropy loss instead of negative log-likelihood.

It is worth noting that uniform output has the maximum entropy among all possible choices. If h is parameterized as a deep neural network, uniform output is achieved when these parameters are constantly zero. Therefore, uniform prescription may have the effect of reducing the magnitude of parameters, similar to norm-based regularization.

Dustbin Class

Another way of prescribing agnostic target is to append a 'dustbin' class to the supervised task. This requires some changes to the hypothesis function h such that it outputs k +1 targets. For deep learning models one can simply extend the last parameterized layer. All unlabeled data are prescribed to this extra 'dustbin' class.

The effect of dustbin class is clearly seen in the loss function of an unlabeled sample ( x, k +1)

The second term is a 'soft' maximum for all dimensions of -h . With an unlabeled sample, the algorithm attempts to introduce smoothness by minimizing probability spikes.

Background Class

We could further simplify dustbin class by removing parameters for class k + 1 . For some given threshold constant τ , we could change the probability of a labeled sample to

and an unlabeled sample

This will result in changes to the loss function of a labeled sample ( x, y ) as

Table 1: The 21-layer network

We call this method background class and τ background constant. Similar to dustbin class, the algorithm attempts to minimize the spikes of outputs, but limited to a certain extent by the inclusion of exp( -τ ) in the partition function. In our experiments τ is always set to 0.

Theoretical Justification

In this part, we derive a qualitative justification for universum prescription using probably approximately correct (PAC) learning (Valiant 1984). By being 'qualitative', the justification is in contrast with numerical bounds such as Vapnik-Chervonenkis dimension (Vapnik and Chervonenkis 1971) (VC-dim) and others. Our theory is based on Rademacher complexity (Bartlett and Mendelson 2003), similar to (Wan et al. 2013) where both dropout (Srivastava et al. 2014) and dropconnect (Wan et al. 2013) are justified. VC-dim is an upper-bound of Rademacher complexity, suggesting that the latter is more accurate. Previous results on unlabeled data (Oneto et al. 2011) (Oneto et al. 2015) assume the same distribution for labeled and unlabeled data, which is impossible under the universum assumption.

Definition 1 (Empirical Rademacher complexity) . Let F be a family of functions mapping from X to R , and S = ( x 1 , x 2 , . . . , x m ) a fixed sample of size m with elements in X . Then, the empirical Rademacher complexity of F with respect to the sample S is defined as:

where η = ( η 1 , . . . , η m ) T , with η i 's independent random variables uniformly distributed on {-1 , 1 } .

Definition 2 (Rademacher complexity) . Let D denote the distribution from which the samples were drawn. For any integer m ≥ 1 , the Rademacher complexity of F is the expectation of the empirical Rademacher complexity over all samples of size m drawn according to D :

Two qualitative properties of Rademacher complexity is worth noting here. First of all, Rademacher complexity is always non-negative by the convexity of supremum

Secondly, if for a fixed input all functions in F output the same value, then its Rademacher complexity is 0. Assume for any f ∈ F we have f ( x ) = f 0 ( x ) , then

Therefore, one way to minimize Rademacher complexity is to regularize functions in F such that all functions tend to have the same output for a given input. Universum prescription precisely does that - the prescribed outputs for unlabeled data are all constantly the same.

The principal PAC-learning result is a bound for functions that are finite in outputs. We use the formulation by (Zhang 2013), but anterior results exist (Bartlett, Boucheron, and Lugosi 2002) (Bartlett and Mendelson 2003) (Koltchinskii 2001) (Koltchinskii and Panchenko 2000).

Theorem 1 (Approximation bound with finite bound on output) . For an energy function (LeCun et al. 2006) E ( h, x, y ) over hypothesis class H , input set X and output set Y , if it has lower bound 0 and upper bound M > 0 , then with probability at least 1 -δ , the following holds for all h ∈ H :

where the function family F is defined as

D is the distribution for ( x, y ) , and S is a sample set of size m drawn indentically and independently from D .

The meaning of the theorem is two-fold. When applying the theorem to the joint problem of training using both labeled and unlabeled data, the third term on the right hand of inequality 14 is reduced by the augmentation of the extra data. The joint Rademacher complexity is written as R m ( F , (1 -p ) D + p U ) , which is reduced when we prescribe constant outputs to unlabeled data.

The second fold is that when the theorem applies to the supervised distribution D , we would hope that R n ( F , D ) can be bounded by R m ( F , (1 -p ) D + p U ) , where n is the number of supervised samples randomly chosen by the joint problem. Note that the number n follows a binomial distribution with mean (1 -p ) m . Such a bound can be achieved in a probable and approximate sense.

Theorem 2 (Rademacher complexity bound on distribution mixture) . Assume we have a joint problem where p ≤ 0 . 5 and there are m random training samples from the joint distribution (1 -p ) D + p U . With probability at least 1 -δ , the following holds

where n is a random number indicating the number of supervised samples in the total joint samples, and m is large enough such that

We present the proof of theorem 2 in the supplemental material, which utilizes Hoeffding's inequality (Hoeffding 1963) (Serfling 1974). The theorem tells us that the

Rademacher complexity of the supervised problem can be bounded by that of the joint problem. The universum prescription algorithm attempts to make the Rademacher complexity of the joint problem small. Therefore, universum prescription improves generalization by incorporating unlabeled data.

However, theorem 2 has a requirement that p ≤ 0 . 5 , otherwise the bound is not achievable. Also, the value of (2 -p ) / (1 -p ) 2 - the asymptotic constant factor in inequality 16 when m is large - is monitonally increasing with respect to p with a range of [2 , 6] when p ≤ 0 . 5 . These facts indicate that we need to keep p small. The following sections show that there is improvement if p is small, but training and testing errors became worse when p is large.

Table 2: The 17-layer network

Finally, in terms of numerical asymptotics, theorem 2 suggests that R n ( F , D ) ≤ O (1 / √ m ) , instead of the commonly known result R n ( F , D ) ≤ O (1 / √ n ) . This bounds the supervised problem with a tighter asymptotical factor because there are more joint samples than supervised samples.

(Empirical Rademacher complexity).
(Rademacher complexity).
(Approximation bound with finite bound on output).
(Rademacher complexity bound on distribution mixture).

Experiments on Image Classification

In this section we test the methods on some image classification tasks. Three series of datasets - CIFAR-10/100 (Krizhevsky 2009), STL-10 (Coates, Ng, and Lee 2011) and ImageNet (Russakovsky et al. 2015) - are chosen due to the availability of unlabeled data. For CIFAR-10/100 and STL10 datasets, we used a 21-layer convolutional network (ConvNet) (LeCun et al. 1989) (LeCun et al. 1998), in which the inputs are 32-by-32 images and all convolutional layers are 3-by-3 and fully padded. For ImageNet, the model is a 17-layer ConvNet with 64-by-64 images as inputs. These models are inspired by (Simonyan and Zisserman 2014), in which all pooling layers are max-pooling, and ReLUs (Nair and Hinton 2010) are used as the non-linearity. Two dropout (Srivastava et al. 2014) layers of probability 0.5 are inserted before the final two linear layers.

The algorithm used is stochastic gradient descent with momentum (Polyak 1964) (Sutskever et al. 2013) 0.9 and a minibatch size of 32. The initial learning rate is 0.005 which is halved every 60,000 minibatch steps for CIFAR-10/100

Table 3: Result for baseline and uniform prescription. The numbers are percentages.

and every 600,000 minibatch steps for ImageNet. The training stops at 400,000 steps for CIFAR-10/100 and STL10, and 2,500,000 steps for ImageNet. Table 1 and 2 summarize the configurations. The weights are initialized in the same way as (He et al. 2015). The following data augmentation steps are used.

  1. (Horizontal flip.) Flip the image horizontally with probability 0.5.
  2. (Scale.) Randomly scale the image between 1 / 1 . 2 and 1 . 2 times of its height and width.
  3. (Crop.) Randomly crop a 32-by-32 (or 64-by-64 for ImageNet) region in the scaled image.

CIFAR-10 and CIFAR-100

The samples of CIFAR-10 and CIFAR-100 datasets (Krizhevsky 2009) are from the 80 million tiny images dataset (Torralba, Fergus, and Freeman 2008). Each dataset contains 60,000 samples, consitituting a very small portion of 80 million. This is an ideal case for our methods, in which we can use the entire 80 million images as the unlabeled data. The CIFAR-10 dataset has 10 classes, and CIFAR-100 has 20 (coarse) or 100 (fine-grained) classes.

Table 3 and 4 contain the results. The three numbers in each tabular indicate training error, testing error and generalization gap. Bold numbers are the best ones for each case. The generalization gap is approximated by the difference between testing and training errors. All the models use unlabeled data with probability p = 0 . 2 .

We compared other single-model results on CIFAR-10

and CIFAR-100 (fine-grained case) in table 5. It shows that our network is competitive to the state of the art. Although (Graham 2014) has the best results, we believe that by applying out universum prescription methods to their model design could also improve the results further.

Table 5: Comparison of single-model CIFAR-10 and CIFAR-100 results, in second and third columns. The fourth column indicates whether data augmentation is used for CIFAR-10. The numbers are percentages.

STL-10

The STL-10 dataset (Coates, Ng, and Lee 2011) has size 96-by-96 for each image. We downsampled them to 32-by32 in order to use the same model. The dataset contains a very small number of training samples - 5000. The accompanying unlabeled data contain 100,000 samples. There is no guarantee that these unlabeled samples do not blong to the supervised classes (Coates, Ng, and Lee 2011), therefore universum prescription failed. To verify that the extra data is the problem, an experiment using the 80 million tiny images as the unlabeled dataset is shown in table 3 and 4. In this case the improvement is observed. Due to long training times of our models, we did not perform 10-fold training in the original paper (Coates, Ng, and Lee 2011).

One interesting observation is that the results on STL-10 became better with the use of 80 million tiny images instead of the original extra data. It indicates that dataset size and whether universum assumption is satisfied are affecting factors for the effectiveness of universum prescription.

ImageNet

The ImageNet dataset (Russakovsky et al. 2015) for classification task has in total 1,281,167 training images and 50,000 validation images. The reported testing errors are evaluated on this validation dataset. During training, we resize images to minimum dimension 64, and then feed a random 64-by-64 crop to the network. Same test-time augmentation technique as in (Szegedy et al. 2015) are applied, with size variants { 64, 72, 80, 88 } , where each image is viewed in 144 crops.

The extra data comes from the large ImageNet 2011 release 1 , for which we only keep the classes whomself and whose children do not belong to the supervised classes. This is enabled by the super-subordinate (is-a) relation information provided with the WordNet distribution (Miller 1995)

1 http://www.image-net.org/releases

because all ImageNet classes are nouns of WordNet. Both top-1 and top-5 results are reported in tables 3 and 4.

In all experiments dustbin class provides best results. We believe that it is because the extra class is parameterized, which makes it adapt better on the unlabeled samples.

Table 6: ConvNet for the study of p

Effect of the Regularization Parameter

It is natural to ask how would the change of the probability p of sampling from unlabeled data affect the results. In this section we show the experiments. To prevent an exhaustive search on the regularization parameter from overfitting our models on the testing data, we use a different model for this section. It is described in table 6, which has 9 parameterized layers in total. The design is inspired by (Sermanet et al. 2013). For each choice of p we conducted 6 experiments combining universum prescription models and dropout. The dropout layers are two ones added in between the fully-connected layers with dropout probability 0.5. Figure 1 shows the results.

From figure 1 we can conclude that increasing p will descrease generalization gap. However, we cannot make p too large since after a certain point the training collapses and both training and testing errors become worse. This confirms the assumptions and conclusions from theorem 2.

Comparing between CIFAR-10/100 and STL-10, one conclusion is that that the model variance is affected by the combined size of labeled and unlabeled datasets. The variance on training and testing errors are extremely small on CIFAR-10/100 datasets because the extra data we used is almost unlimited (in total 80 million), but on STL-10 the variance seems to be large with much smaller combined size of training and extra datasets. This suggests that using universum prescription with a large abundance of extra data could improve the stability of supervised learning algorithms.

Finally, the comparison between using and not using dropout does not show a difference. This suggests that the regularization effect of universum prescription alone is comparable to that of dropout.

Conclusion and Outlook

This article shows that universum prescription can be used to regularize a multi-class classification problem using extra unlabeled data. Two assumptions are made. One is that loss used is negative log-likelihood and the other is negligible probability of a supervised sample existing in the unlabeled

Figure 1: Experiments on regularization parameter. The four rows are CIFAR-10, CIFAR-100 fine-grained, CIFAR-100 coarse and STL-10 respectively.

data. The loss assumption is a necessary detail rather than a limitation. The three universum prescription methods are uniform prescription, dustbin class and background class.

We further provided a theoretical justification. Theorem 2 suggests that asymptotically the generalization ability of the supervised problem could be bounded by the joint problem, which has more samples due to the addition of unlabeled data. Experiments are done using CIFAR-10, CIFAR-100, STL-10 and ImageNet datasets. The effect of the regularization parameter is also studied empirically.

These experiments show that all three universum prescrition methods provide certain improvement over the generalization gap, whereas dustbin class constantly performs the best because the parameterized extra class can adapt better to the unlabeled samples. Further conclusions include that additional unlabeled data can improve the variance of models during training, and that the results are comparable to data-agnostic regularization using dropout.

In the future, we hope to apply these methods to a broader range of problems.

Acknowledgments

We gratefully acknowledge NVIDIA Corporation with the donation of 2 Tesla K40 GPUs used for this research. Sainbayar Sukhbaatar offered many useful comments. Aditya Ramesh and Junbo Zhao helped cross-checking the proofs.

References

Bartlett, P. L., and Mendelson, S. 2003. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research 3:463-482.

Bengio, Y., and LeCun, Y. 2007. Scaling learning algorithms towards ai. In Bottou, L.; Chapelle, O.; DeCoste, D.; and Weston, J., eds., Large-Scale Kernel Machines . MIT Press.

Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on 35(8):1798-1828.

Chapelle, O.; Agarwal, A.; Sinz, F. H.; and Sch¨ olkopf, B. 2007. An analysis of inference with the universum. In Advances in neural information processing systems , 13691376.

Chapelle, O.; Schlkopf, B.; and Zien, A. 2006. A Discussion of Semi-Supervised Learning and Transduction . MIT Press. 473-478.

Coates, A.; Ng, A. Y.; and Lee, H. 2011. An analysis of single-layer networks in unsupervised feature learning. In International conference on artificial intelligence and statistics , 215-223.

Goodfellow, I.; Warde-farley, D.; Mirza, M.; Courville, A.; and Bengio, Y. 2013. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning (ICML-13) , 1319-1327.

Graham, B. 2014. Spatially-sparse convolutional neural networks. CoRR abs/1409.6070.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852 .

Hinton, G. E.; Osindero, S.; and Teh, Y.-W. 2006. A fast learning algorithm for deep belief nets. Neural computation 18(7):1527-1554.

Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58(301):13-30.

Koltchinskii, V., and Panchenko, D. 2000. Rademacher processes and bounding the risk of function learning. In High dimensional probability II . Springer. 443-457.

Koltchinskii, V. 2001. Rademacher penalties and structural risk minimization. Information Theory, IEEE Transactions on 47(5):1902-1914.

Krizhevsky, A. 2009. Learning multiple layers of features from tiny images.

LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278-2324.

LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; and Huang, F. 2006. A tutorial on energy-based learning. In Bakir, G.; Hofman, T.; Sch¨ olkopf, B.; Smola, A.; and Taskar, B., eds., Predicting Structured Data . MIT Press.

Lin, M.; Chen, Q.; and Yan, S. 2013. Network in network. CoRR abs/1312.4400.

Miller, G. A. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39-41.

Oneto, L.; Anguita, D.; Ghio, A.; and Ridella, S. 2011. The impact of unlabeled patterns in rademacher complexity theory for kernel classifiers. In Advances in Neural Information Processing Systems , 585-593.

Polyak, B. 1964. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4(5):1 - 17.

Ranzato, M.; Poultney, C.; Chopra, S.; and LeCun, Y. 2006. Efficient learning of sparse representations with an energybased model. In et al., J. P., ed., Advances in Neural Information Processing Systems (NIPS 2006) , volume 19. MIT Press.

Rasmus, A.; Berglund, M.; Honkala, M.; Valpola, H.; and Raiko, T. 2015. Semi-supervised learning with ladder networks. In Cortes, C.; Lawrence, N. D.; Lee, D. D.; Sugiyama, M.; and Garnett, R., eds., Advances in Neural Information Processing Systems 28 . Curran Associates, Inc. 3546-3554.

Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3):211-252.

Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; and LeCun, Y. 2013. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR abs/1312.6229.

Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929-1958.

Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR) .

Thrun, S., and Pratt, L., eds. 1998. Learning to Learn . Norwell, MA, USA: Kluwer Academic Publishers.

Torralba, A.; Fergus, R.; and Freeman, W. T. 2008. 80 million tiny images: A large data set for nonparametric object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on 30(11):1958-1970.

Vapnik, V. N., and Chervonenkis, A. Y. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications 16(2):264-280.

Vapnik, V. N. 1995. The Nature of Statistical Learning Theory . New York, NY, USA: Springer-Verlag New York, Inc.

Zhang, X. 2013. Pac-learning for energy-based models. Master's thesis, Computer Science Department, Courant Institute of Mathematical Sciences, New York University.

Supplemental: proof of theorem ref{thm:rbdm

This supplemental material shares the bibliography of the main paper. As an outline of our proof, we first establish a relation between R m ( F , D ) and R m ( F , (1 -p ) D + p U ) , and then another relation between R n ( F , D ) and R m ( F , D ) . The first part requires the following lemmas.

Lemma 1 (Separation of dataset on empirical Rademacher complexity) . Let S be a dataset of size m . If S 1 and S 2 are two non-overlap subset of S such that | S 1 | = m -i , | S 2 | = i and S 1 ∪ S 2 = S , then the following two inequalities hold

Proof. Let ( x j , y j ) ∈ S 1 for j = 1 , 2 , . . . , m -i and ( x j , y j ) ∈ S 2 for i = m -j +1 , m -j +2 , . . . , m . Denote N as the discrete uniform distribution on { 1 , -1 } . We can derive by the convexity of supremum and symmetry of N

Lemma2 (Sample size inequality for Rademacher complexity) . Assume 0 ≤ n ≤ m . If | S n | = n , | S m | = m and S m = S n ∪ { x n +1 , x n +2 , . . . , x m } , then

and

Proof. First of all, it is obvious that inequality 20 can be established using mathematical induction if we have m R m ( F , D ) ≤ ( m + 1) R m +1 ( F , D ) for all m ≥ 0 . To prove this, we first establish that if S m = { x 1 , x 2 , . . . , x m } and S m +1 = { x 1 , x 2 , . . . , x m , x m +1 } (i.e., S m +1 = S m ∪ { x m +1 } ), then m ˆ R S m ( F ) ≤ ( m +1) ˆ R S m +1 ( F ) , which can also establish inequality 19.

For any η m = { η 1 , η 2 , . . . , η m } and η m +1 = { η 1 , η 2 , . . . , η m , η m +1 } , that is, η m +1 = η m ∪ { η m +1 } ,

(

x

j

let f 0 = argmax f ∈F ∑ m i =1 η i f ( x i ) . By definition of supremum, we have

Taking espectation over η m +1 , by the symmetry of distribution N , we obtain

By the definition of ˆ R S m ( F ) , the inequality above implies m ˆ R S m ( F ) ≤ ( m +1) ˆ R S m +1 ( F ) . Then, by taking espectation over S m +1 we can obtain

The lemma can therefore be easily established by mathematical induction.

Theorem 3 (Relation of Rademacher complexities in distribution mixture) . If p ≤ 0 . 5 , then

Proof. For any function space F and distribution D , denote R 0 ( F , D ) = 0 and ˆ R ∅ ( F ) = 0 . By definition of

Rademacher complexity and lemma 1, we get

The proof proceeds by handling the three parts on the righthand side of the inequality above separately.

The second part can also proceed using lemma 2. It is essentially upper-bounded by the first part. By the fact that i ≤ m -i for 0 ≤ i ≤ glyph[floorleft] m/ 2 glyph[floorright] , we obtain

Therefore, using the first part, we achieve

The proof for theorem 3 is therefore concluded.

Theorem 4 (Concentration inequality of subset Rademacher complexity) . Assume in solving the joint problem we obtained m idependently and identically distributed samples. Let the random number n represent the number of supervised sample obtained among these m joint samples with a proprtion probability of 1 -p . Then, with probability at least 1 -δ , the following holds

for large enough m such that

Proof. Using lemma 2, we only need to prove an upper bound for m/n . Since we know that n follows a binomial distribution with mean (1 -p ) m , using Hoeffding's inequality (Hoeffding 1963) (Serfling 1974), we can obtain

or put differently,

The inequality is obtained by setting δ = exp( -2 glyph[epsilon1] 2 m ) . The proof assumes that m is large enough such that

As a result, theorem 2 can be obtained by directly combining theorem 3 and theorem 4.

LAYERSDESCRIPTION
1-3 4 5-8 9 10-13 14 15-18 19 20-23 24Conv 256x3x3 Pool 2x2 Conv 512x3x3 Pool 2x2 Conv 1024x3x3 Pool 2x2 Conv 1024x3x3 Pool 2x2 Conv 2048x3x3 Pool 2x2
LAYERSDESCRIPTION
1-3 4 5-7 8 9-11 12 13-15 16 17-19 20Conv 128x3x3 Pool 2x2 Conv 256x3x3 Pool 2x2 Conv 512x3x3 Pool 2x2 Conv 1024x3x3 Pool 2x2 Conv 2048x3x3 Pool 2x2
DATASETBASELINEBASELINEBASELINEUNIFORMUNIFORMUNIFORM
TrainTestGapTrainTestGap
CIFAR-100.007.027.020.727.596.87
CIFAR-100 F.0.0937.5837.494.9136.2331.32
CIFAR-100 C.0.0422.7422.700.6723.4222.45
STL-100.0031.1631.162.0236.5434.52
STL-10 Tiny0.0031.1631.160.6230.1529.47
ImageNet-110.1934.3924.2013.8434.6120.77
ImageNet-51.6213.6812.063.0213.7010.68
DATASETDUSTBINDUSTBINDUSTBINBACKGROUNDBACKGROUNDBACKGROUND
TrainTestGapTrainTestGap
CIFAR-100.076.666.591.358.387.03
CIFAR-100 F.2.5232.8430.328.5640.5742.01
CIFAR-100 C.0.4020.4520.053.7324.9721.24
STL-103.0336.5833.5514.8938.9524.06
STL-10 Tiny0.0027.9627.960.1130.3830.27
ImageNet-113.8033.6719.8713.4334.6921.26
ImageNet-52.8313.3510.522.7413.8411.10
REF.10100AUG.
(Graham 2014)6.2824.30YES
(ours)6.6632.84YES
(Lee et al. 2015)7.9734.57YES
(Lin, Chen, and Yan 2013)8.8135.68YES
(Goodfellow et al. 2013)9.3838.57YES
(Wan et al. 2013)11.1N/ANO
(Zeiler and Fergus 2013)15.1342.51NO
LAYERSDESCRIPTION
1 2 3 4-7 8 9-11Conv 1024x5x5 Pool 2x2 Conv 1024x5x5 Conv 1024x3x3 Pool 2x2 Full 2048

$$ L(h, x, y) = h_y (x) + \log \left[ \sum_{i = 1}^{k} \exp(-h_i (x)) \right]. $$

$$ \label{eq:jtds} \Pr_{(X,Y) \sim \mathbf{U}}[X, Y \in {1, 2, \dots, k}] \approx 0. $$ \tag{eq:jtds}

$$ \mathfrak{R}_m (\mathcal{F}, \mathbf{D}) = \underset{S \sim \mathbf{D}^m}{\E} [\hat{\mathfrak{R}}_S(F)] $$

$$ \begin{aligned} \hat{\mathfrak{R}}S(\mathcal{F}) & = \underset{\boldsymbol{\eta}}\E \left[ \sup{f \in \mathcal{F}} \frac{1}{m} \sum_{i=1}^{m} \eta_i f(x_i) \right] \ & \geq \sup_{f \in \mathcal{F}} \frac{1}{m} \sum_{i=1}^{m} \underset{\eta_i}{\E} [\eta_i] f(x_i) = 0. \end{aligned} $$

$$ \label{eq:pcbd} \begin{aligned} & \underset{(x,y) \sim \mathbf{D}}{\E}[\mathcal{E}(h,x,y)] \leq \ & \frac{1}{m} \sum_{(x,y) \in S} \mathcal{E}(h,x,y) + 2 \mathfrak{R}_m(\mathcal{F}, \mathbf{D}) + M\sqrt{\frac{\log{\frac{2}{\delta}}}{2m}}, \end{aligned} $$ \tag{eq:pcbd}

$$ \mathcal{F} = \left{ \mathcal{E}(h,x,y) | h \in \mathcal{H} \right}. $$

$$ \label{eq:rbdm} \begin{aligned} & \mathfrak{R}_n(\mathcal{F}, \mathbf{D}) \leq \ & \frac{2-p}{(1-p)\left(1-p - \sqrt{\frac{\log(1/\delta)}{2m}}\right)} \mathfrak{R}_m(\mathcal{F}, (1-p)\mathbf{D} + p \mathbf{U}), \end{aligned} $$ \tag{eq:rbdm}

$$ 1-p-\sqrt{\frac{\log(1/\delta)}{2m}} > 0. $$

$$ \label{eq:sepr} \hat{\mathfrak{R}}S (\mathcal{F}) \leq \frac{m - i}{m} \hat{\mathfrak{R}}{S_1} (\mathcal{F}) + \frac{i}{m} \hat{\mathfrak{R}}_{S_2} (\mathcal{F}). $$ \tag{eq:sepr}

$$ \begin{aligned} & \hat{\mathfrak{R}}{S} (\mathcal{F}) = \underset{\boldsymbol{\eta} \sim \mathbf{N}^{m}}\E \left[ \sup{f \in \mathcal{F}} \frac{1}{m} \sum_{j=1}^{m} \eta_j f(x_j) \right] \ = &\frac{2}{m} \underset{\boldsymbol{\eta} \sim \mathbf{N}^{m}}\E \left[ \sup_{f \in \mathcal{F}} \left( \frac{1}{2} \sum_{j=1}^{m - i} \eta_j f(x_j) + \frac{1}{2} \sum_{j=m-i+1}^{m} \eta_j f(x_j) \right) \right] \ \leq & \frac{2}{m} \underset{\boldsymbol{\eta} \sim \mathbf{N}^{m}}\E \left[ \frac{1}{2} \sup_{f \in \mathcal{F}} \left( \sum_{j=1}^{m - i} \eta_j f(x_j) \right) + \frac{1}{2} \sup_{f \in \mathcal{F}} \left( \sum_{j=m-i+1}^{m} \eta_j f(x_j) \right) \right] \ = & \frac{m - i}{m} \underset{\boldsymbol{\eta} \sim \mathbf{N}^{m - i}}\E \left[ \sup_{f \in \mathcal{F}} \frac{1}{m-i} \sum_{j=1}^{m - i} \eta_j f(x_j) \right] + \ & \quad \frac{i}{m} \underset{\boldsymbol{\eta} \sim \mathbf{N}^{i}}\E \left[ \sup_{f \in \mathcal{F}} \frac{1}{i} \sum_{j=m-i+1}^{m} \eta_j f(x_j) \right] \ = &\frac{m - i}{m} \hat{\mathfrak{R}}{S_1} (\mathcal{F}) + \frac{i}{m} \hat{\mathfrak{R}}{S_2} (\mathcal{F}). \end{aligned} $$

$$ \begin{aligned} \sup_{f \in \mathcal{F}} \sum_{i = 1}^{m + 1} \eta_i f(x_i) & \geq \sum_{i = 1}^{m + 1} \eta_i f_0 (x_i) \ & = \sum_{i = 1}^{m} \eta_i f_0 (x_i) + \eta_{m+1}f_0 (x_{m+1}) \ & = \sup_{f \in \mathcal{F}} \sum_{i = 1}^{m} \eta_i f(x_i) + \eta_{m+1} f_0(x_{m+1}). \end{aligned} $$

$$ \begin{aligned} & \underset{\boldsymbol{\eta}{m+1} \sim \mathbf{N}^{m+1}}\E \left[ \sup{f \in \mathcal{F}} \sum_{i = 1}^{m + 1} \eta_i f(x_i) \right] \ & \geq \underset{\boldsymbol{\eta}{m+1} \sim \mathbf{N}^{m}}\E \left[ \sup{f \in \mathcal{F}} \sum_{i = 1}^{m} \eta_i f(x_i) + \eta_{m+1} f_0(x_{m+1}) \right] \ & = \underset{\boldsymbol{\eta}{m} \sim \mathbf{N}^{m}}\E \left[ \sup{f \in \mathcal{F}} \sum_{i = 1}^{m} \eta_i f(x_i) \right] + \underset{\eta_{m + 1} \sim \mathbf{N}}\E \left[ \eta_{m+1} \right] f_0(x_{m+1}) \ & = \underset{\boldsymbol{\eta}{m} \sim \mathbf{N}^{m}}\E \left[ \sup{f \in \mathcal{F}} \sum_{i = 1}^{m} \eta_i f(x_i) \right]. \end{aligned} $$

$$ \begin{aligned} (m + 1) \mathfrak{R}{m + 1} (\mathcal{F}, \mathbf{D}) & = \underset{S{m+1} \sim \mathbf{D}^{m+1}} \E \left[ (m+1) \hat{\mathfrak{R}}{S{m+1}} \right] \ & \geq \underset{S_{m} \sim \mathbf{D}^{m}} \E \left[ m \hat{\mathfrak{R}}{S{m}} \right] = m \mathfrak{R}_{m} (\mathcal{F}, \mathbf{D}). \end{aligned} $$

$$ \begin{aligned} & \mathfrak{R}m(\mathcal{F}, \mathbf{D}) = \mathfrak{R}m(\mathcal{F}, (1-p)\mathbf{D} + p \mathbf{D}) \ & = \underset{S \sim ((1-p)\mathbf{D} + p \mathbf{D})^m}{\E} \left [\hat{\mathfrak{R}}S(\mathcal{F}) \right] \ & = \sum{i=0}^m \binom{m}{i} (1-p)^i p^{m-i} \underset{S_1 \sim \mathbf{D}^{i}}{\E} \left[ \underset{S_2 \sim \mathbf{D}^{m-i}}{\E} \left[\hat{\mathfrak{R}}{S_1 \cup S_2}(\mathcal{F}) \right] \right] \ & \leq \sum{i=0}^m \binom{m}{i} (1-p)^{i} p^{m-i} \ & \quad \cdot \underset{S_1 \sim \mathbf{D}^{i}}{\E} \left[ \underset{S_2 \sim \mathbf{D}^{m-i}}{\E} \left [ \frac{i}{m} \hat{\mathfrak{R}}{S_1} (\mathcal{F}) + \frac{m-i}{m} \hat{\mathfrak{R}}{S_2} (\mathcal{F}) \right] \right] \ & = \sum_{i=0}^m \binom{m}{i} (1-p)^{i} p^{m-i} \ & \quad \cdot \left( \underset{S_1 \sim \mathbf{D}^{i}}{\E} \left[ \frac{i}{m} \hat{\mathfrak{R}}{S_1} (\mathcal{F}) \right] + \underset{S_2 \sim \mathbf{D}^{m-i}}{\E} \left [ \frac{m-i}{m} \hat{\mathfrak{R}}{S_2} (\mathcal{F}) \right] \right) \ & = \sum_{i=0}^m \binom{m}{i} (1-p)^{i} p^{m-i} \left[ \frac{i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D}) + \frac{m-i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D})\right] \ & = \left[\sum_{i=0}^m \binom{m}{i} (1-p)^{i} p^{m-i} \frac{i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D}) \right] \ & \quad + \left[ \sum{i=0}^m \binom{m}{i} (1-p)^{m-i} p^i \frac{i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D})\right] \ & = \left[\sum{i=0}^m \binom{m}{i} (1-p)^{i} p^{m-i} \frac{i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D}) \right] \ & \quad + \left[ \sum{i=0}^{\lfloor m / 2 \rfloor} \binom{m}{i} (1-p)^{m-i} p^i \frac{i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D})\right] \ & \quad + \left[ \sum{i=\lfloor m / 2 \rfloor + 1}^{m} \binom{m}{i} (1-p)^{m-i} p^i \frac{i}{m} \mathfrak{R}_{i} (\mathcal{F}, \mathbf{D})\right]. \ \end{aligned} $$

$$ \begin{aligned} & \mathfrak{R}m(\mathcal{F}, (1-p)\mathbf{D} + p \mathbf{U}) \ & = \sum{i=0}^m \binom{m}{i} (1-p)^{i} p^{m-i} \underset{S_1 \sim \mathbf{D}^{i}}{\E} \left[ \underset{S_2 \sim \mathbf{U}^{m-i}}{\E} \left[ \hat{\mathfrak{R}}{S_1 \cup S_2}(\mathcal{F}) \right ] \right] \ & \geq \sum{i=0}^m \binom{m}{i} (1-p)^{i} p^{m - i} \underset{S_1 \sim \mathbf{D}^{i}}{\E} \left[ \underset{S_2 \sim \mathbf{U}^{m-i}}{\E} \left [\frac{i}{m}\hat{\mathfrak{R}}{S_1}(\mathcal{F}) \right] \right] \ & = \sum{i=0}^m \binom{m}{i} (1-p)^{i} p^{m - i} \underset{S_1 \sim \mathbf{D}^{i}}{\E} \left[\frac{i}{m}\hat{\mathfrak{R}}{S_1}(\mathcal{F}) \right] \ & = \sum{i=0}^m \binom{m}{i} (1-p)^{i} p^{m - i} \frac{i}{m} \mathfrak{R}_{i}(\mathcal{F}, \mathbf{D}). \ \end{aligned} $$

$$ \begin{aligned} & \sum_{i=0}^{\lfloor m / 2 \rfloor} \binom{m}{i} (1-p)^{m-i} p^i \frac{i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D}) \ & \leq \sum{i=0}^{\lfloor m / 2 \rfloor} \binom{m}{i} (1-p)^{m-i} p^i \frac{m - i}{m} \mathfrak{R}{m-i} (\mathcal{F}, \mathbf{D}) \ & = \sum{i=m-\lfloor m / 2 \rfloor}^{m} \binom{m}{i} (1-p)^{i} p^{m-i} \frac{i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D}) \ & \leq \sum{i=0}^{m} \binom{m}{i} (1-p)^{i} p^{m-i} \frac{i}{m} \mathfrak{R}_{i} (\mathcal{F}, \mathbf{D}) \ & \leq \mathfrak{R}_m(\mathcal{F}, (1-p)\mathbf{D} + p \mathbf{U}) \ \end{aligned} $$

$$ (1-p)^{m-i} p^i \leq \frac{p}{1-p} (1-p)^{i} p^{m-i}. $$

$$ \begin{aligned} & \sum_{i=\lfloor m / 2 \rfloor + 1}^{m} \binom{m}{i} (1-p)^{m-i} p^i \frac{i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D}) \ & \leq \sum{i=\lfloor m / 2 \rfloor + 1}^{m} \binom{m}{i} \frac{p}{1-p} (1-p)^{i} p^{m-i} \frac{i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D}) \ & = \frac{p}{1-p} \sum{i=\lfloor m / 2 \rfloor + 1}^{m} \binom{m}{i} (1-p)^{i} p^{m-i} \frac{i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D}) \ & \leq \frac{p}{1-p} \sum{i=0}^{m} \binom{m}{i} (1-p)^{i} p^{m-i} \frac{i}{m} \mathfrak{R}_{i} (\mathcal{F}, \mathbf{D}) \ & \leq \frac{p}{1-p} \mathfrak{R}_m(\mathcal{F}, (1-p)\mathbf{D} + p \mathbf{U}).\ \end{aligned} $$

$$ \begin{aligned} & \mathfrak{R}m(\mathcal{F}, \mathbf{D}) \ & \leq \left[\sum{i=0}^m \binom{m}{i} (1-p)^{i} p^{m-i} \frac{i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D}) \right] \ & \quad + \left[ \sum{i=0}^{\lfloor m / 2 \rfloor} \binom{m}{i} (1-p)^{m-i} p^i \frac{i}{m} \mathfrak{R}{i} (\mathcal{F}, \mathbf{D})\right] \ & \quad + \left[ \sum{i=\lfloor m / 2 \rfloor + 1}^{m} \binom{m}{i} (1-p)^{m-i} p^i \frac{i}{m} \mathfrak{R}_{i} (\mathcal{F}, \mathbf{D})\right] \ & \leq \left( 1 + 1 + \frac{p}{1-p} \right) \mathfrak{R}_m(\mathcal{F}, (1-p)\mathbf{D} + p \mathbf{U})\ & = \frac{2 - p}{1-p} \mathfrak{R}_m(\mathcal{F}, (1-p)\mathbf{D} + p \mathbf{U}).\ \end{aligned} $$

$$ \Pr\left[n \leq (1-p-\epsilon) m\right] \leq \exp(-2 \epsilon^2 m), $$

Theorem. [Approximation bound with finite bound on output] For an energy function [LCHRH06] (E(h,x,y)) over hypothesis class (H), input set (X) and output set (Y), if it has lower bound 0 and upper bound (M > 0), then with probability at least (1-\delta), the following holds for all (h \in H ): equation aligned & (x,y) \sim D{\E}[E(h,x,y)] \leq \ & 1{m} \sum_{(x,y) \in S} E(h,x,y) + 2 R_m(F, D) + M\frac{\log{\frac{2{\delta}}}{2m}}, aligned equation where the function family (F) is defined as equation F = \left{ E(h,x,y) | h \in H \right}. equation (D) is the distribution for ((x,y)), and (S) is a sample set of size (m) drawn indentically and independently from (D).

Theorem. [Rademacher complexity bound on distribution mixture] Assume we have a joint problem where (p \leq 0.5) and there are (m) random training samples from the joint distribution ((1-p)D + p U). With probability at least (1-\delta), the following holds equation aligned & R_n(F, D) \leq \ & 2-p{(1-p)\left(1-p - \frac{\log(1/\delta){2m}}\right)} R_m(F, (1-p)D + p U), aligned equation where (n) is a random number indicating the number of supervised samples in the total joint samples, and (m) is large enough such that equation 1-p-\frac{\log(1/\delta){2m}} > 0. equation

Theorem. [Relation of Rademacher complexities in distribution mixture] If (p \leq 0.5), then equation R_m (F, D) \leq 2 - p{1 - p} R_m (F, (1-p)D + p U). equation

Theorem. [Concentration inequality of subset Rademacher complexity] Assume in solving the joint problem we obtained (m) idependently and identically distributed samples. Let the random number (n) represent the number of supervised sample obtained among these (m) joint samples with a proprtion probability of (1-p). Then, with probability at least (1-\delta), the following holds equation R_n (F, D) \leq \mathfrak{R_m (F, D)}{1-p-\frac{\log(1/\delta){2m}}}, equation for large enough (m) such that equation 1-p-\frac{\log(1/\delta){2m}} > 0. equation

Lemma. [Separation of dataset on empirical Rademacher complexity] Let (S) be a dataset of size (m). If (S_1) and (S_2) are two non-overlap subset of (S) such that (|S_1| = m - i), (|S_2| = i) and (S_1 \cup S_2 = S), then the following two inequalities hold equation \mathfrak{R}S (F) \leq m - i{m} \mathfrak{R}{S_1} (F) + i{m} \mathfrak{R}_{S_2} (F). equation

Lemma. [Sample size inequality for Rademacher complexity] Assume (0 \leq n \leq m). If (|S_n| = n), (|S_m| = m) and (S_m = S_n \cup {x_{n+1}, x_{n+2}, \dots, x_{m}}), then equation n \mathfrak{R}{S_n} (F) \leq m \mathfrak{R}{S_m} (F), equation and equation n R_n (F, D) \leq m R_m (F, D). equation

Definition. [Empirical Rademacher complexity] Let (F) be a family of functions mapping from ( X ) to (R), and (S = (x_1, x_2, \dots, x_m)) a fixed sample of size (m) with elements in (X). Then, the empirical Rademacher complexity of (F) with respect to the sample (S) is defined as: equation \mathfrak{R}S(F) = \boldsymbol{\eta}\E \left[ \sup{f \in F} 1{m} \sum_{i=1}^{m} \eta_i f(x_i) \right] equation where (\eta = (\eta_1, \dots, \eta_m)^T), with (\eta_i)'s independent random variables uniformly distributed on ({-1, 1}).

Definition. [Rademacher complexity] Let (D) denote the distribution from which the samples were drawn. For any integer (m \geq 1), the Rademacher complexity of (F) is the expectation of the empirical Rademacher complexity over all samples of size (m) drawn according to (D): equation R_m (F, D) = S \sim D^m{\E} [\mathfrak{R}_S(F)] equation

Proof. Let ((x_j, y_j) \in S_1) for (j = 1, 2, \dots, m - i) and ((x_j, y_j) \in S_2) for (i = m - j + 1, m - j + 2, \dots, m). Denote (N) as the discrete uniform distribution on ({1, -1}). We can derive by the convexity of supremum and symmetry of (N) [ aligned & \mathfrak{R}{S} (F) = \boldsymbol{\eta \sim N^{m}}\E \left[ \sup{f \in F} 1{m} \sum_{j=1}^{m} \eta_j f(x_j) \right] \ = &2{m} \boldsymbol{\eta \sim N^{m}}\E \left[ \sup_{f \in F} \left( 1{2} \sum_{j=1}^{m - i} \eta_j f(x_j) + 1{2} \sum_{j=m-i+1}^{m} \eta_j f(x_j) \right) \right] \ \leq & 2{m} \boldsymbol{\eta \sim N^{m}}\E \left[ 1{2} \sup_{f \in F} \left( \sum_{j=1}^{m - i} \eta_j f(x_j) \right) + 1{2} \sup_{f \in F} \left( \sum_{j=m-i+1}^{m} \eta_j f(x_j) \right) \right] \ = & m - i{m} \boldsymbol{\eta \sim N^{m - i}}\E \left[ \sup_{f \in F} 1{m-i} \sum_{j=1}^{m - i} \eta_j f(x_j) \right] + \ & \quad i{m} \boldsymbol{\eta \sim N^{i}}\E \left[ \sup_{f \in F} 1{i} \sum_{j=m-i+1}^{m} \eta_j f(x_j) \right] \ = &m - i{m} \mathfrak{R}{S_1} (F) + i{m} \mathfrak{R}{S_2} (F). aligned ]

Proof. First of all, it is obvious that inequality eq:sint can be established using mathematical induction if we have (m R_m (F, D) \leq (m + 1) R_{m + 1} (F, D)) for all (m \geq 0). To prove this, we first establish that if (S_m = {x_1, x_2, \dots, x_m}) and (S_{m + 1} = {x_1, x_2, \dots, x_m, x_{m+1}}) (i.e., (S_{m+1} = S_m \cup {x_{m + 1}})), then (m \mathfrak{R}{S_m} (F) \leq (m+1) \mathfrak{R}{S_{m+1}} (F)), which can also establish inequality eq:sine. For any ( \eta_m ={\eta_1, \eta_2, \dots, \eta_m} ) and ( \eta_{m + 1} ={\eta_1, \eta_2, \dots, \eta_m, \eta_{m+1}}), that is, ( \eta_{m + 1} = \eta_m \cup {\eta_{m + 1}}), let ( f_0 = \argmax_{f \in F} \sum_{i = 1}^{m} \eta_i f(x_i)). By definition of supremum, we have [ aligned \sup_{f \in F} \sum_{i = 1}^{m + 1} \eta_i f(x_i) & \geq \sum_{i = 1}^{m + 1} \eta_i f_0 (x_i) \ & = \sum_{i = 1}^{m} \eta_i f_0 (x_i) + \eta_{m+1}f_0 (x_{m+1}) \ & = \sup_{f \in F} \sum_{i = 1}^{m} \eta_i f(x_i) + \eta_{m+1} f_0(x_{m+1}). aligned ] Taking espectation over (\eta_{m+1}), by the symmetry of distribution (N), we obtain [ aligned & \boldsymbol{\eta_{m+1} \sim N^{m+1}}\E \left[ \sup_{f \in F} \sum_{i = 1}^{m + 1} \eta_i f(x_i) \right] \ & \geq \boldsymbol{\eta_{m+1} \sim N^{m}}\E \left[ \sup_{f \in F} \sum_{i = 1}^{m} \eta_i f(x_i) + \eta_{m+1} f_0(x_{m+1}) \right] \ & = \boldsymbol{\eta_{m} \sim N^{m}}\E \left[ \sup_{f \in F} \sum_{i = 1}^{m} \eta_i f(x_i) \right] + \eta_{m + 1 \sim N}\E \left[ \eta_{m+1} \right] f_0(x_{m+1}) \ & = \boldsymbol{\eta_{m} \sim N^{m}}\E \left[ \sup_{f \in F} \sum_{i = 1}^{m} \eta_i f(x_i) \right]. aligned ] By the definition of (\mathfrak{R}{S_m} (F)), the inequality above implies (m \mathfrak{R}{S_m} (F) \leq (m+1) \mathfrak{R}{S{m+1}} (F)). Then, by taking espectation over (S_{m+1}) we can obtain [ aligned (m + 1) R_{m + 1} (F, D) & = S_{m+1 \sim D^{m+1}} \E \left[ (m+1) \mathfrak{R}{S{m+1}} \right] \ & \geq S_{m \sim D^{m}} \E \left[ m \mathfrak{R}{S{m}} \right] = m R_{m} (F, D). aligned ] The lemma can therefore be easily established by mathematical induction.

Proof. For any function space (F) and distribution (D), denote (R_0(F, D) = 0) and (\mathfrak{R}{\emptyset}(F) = 0). By definition of Rademacher complexity and lemma lem:sepr, we get [ aligned & R_m(F, D) = R_m(F, (1-p)D + p D) \ & = S \sim ((1-p)D + p D)^m{\E} \left [\mathfrak{R}S(F) \right] \ & = \sum{i=0}^m m{i} (1-p)^i p^{m-i} S_1 \sim D^{i}{\E} \left[ S_2 \sim D^{m-i}{\E} \left[\mathfrak{R}{S_1 \cup S_2}(F) \right] \right] \ & \leq \sum_{i=0}^m m{i} (1-p)^{i} p^{m-i} \ & \quad \cdot S_1 \sim D^{i}{\E} \left[ S_2 \sim D^{m-i}{\E} \left [ i{m} \mathfrak{R}{S_1} (F) + m-i{m} \mathfrak{R}{S_2} (F) \right] \right] \ & = \sum_{i=0}^m m{i} (1-p)^{i} p^{m-i} \ & \quad \cdot \left( S_1 \sim D^{i}{\E} \left[ i{m} \mathfrak{R}{S_1} (F) \right] + S_2 \sim D^{m-i}{\E} \left [ m-i{m} \mathfrak{R}{S_2} (F) \right] \right) \ & = \sum_{i=0}^m m{i} (1-p)^{i} p^{m-i} \left[ i{m} R_{i} (F, D) + m-i{m} R_{i} (F, D)\right] \ & = \left[\sum_{i=0}^m m{i} (1-p)^{i} p^{m-i} i{m} R_{i} (F, D) \right] \ & \quad + \left[ \sum_{i=0}^m m{i} (1-p)^{m-i} p^i i{m} R_{i} (F, D)\right] \ & = \left[\sum_{i=0}^m m{i} (1-p)^{i} p^{m-i} i{m} R_{i} (F, D) \right] \ & \quad + \left[ \sum_{i=0}^{\lfloor m / 2 \rfloor} m{i} (1-p)^{m-i} p^i i{m} R_{i} (F, D)\right] \ & \quad + \left[ \sum_{i=\lfloor m / 2 \rfloor + 1}^{m} m{i} (1-p)^{m-i} p^i i{m} R_{i} (F, D)\right]. \ aligned ] The proof proceeds by handling the three parts on the right-hand side of the inequality above separately. For the first part, using lemma lem:sinr, we can get [ aligned & R_m(F, (1-p)D + p U) \ & = \sum_{i=0}^m m{i} (1-p)^{i} p^{m-i} S_1 \sim D^{i}{\E} \left[ S_2 \sim U^{m-i}{\E} \left[ \mathfrak{R}{S_1 \cup S_2}(F) \right ] \right] \ & \geq \sum{i=0}^m m{i} (1-p)^{i} p^{m - i} S_1 \sim D^{i}{\E} \left[ S_2 \sim U^{m-i}{\E} \left [i{m}\mathfrak{R}{S_1}(F) \right] \right] \ & = \sum{i=0}^m m{i} (1-p)^{i} p^{m - i} S_1 \sim D^{i}{\E} \left[i{m}\mathfrak{R}{S_1}(F) \right] \ & = \sum{i=0}^m m{i} (1-p)^{i} p^{m - i} i{m} R_{i}(F, D). \ aligned ] The second part can also proceed using lemma lem:sinr. It is essentially upper-bounded by the first part. By the fact that (i \leq m - i ) for (0 \leq i \leq \lfloor m / 2 \rfloor), we obtain [ aligned & \sum_{i=0}^{\lfloor m / 2 \rfloor} m{i} (1-p)^{m-i} p^i i{m} R_{i} (F, D) \ & \leq \sum_{i=0}^{\lfloor m / 2 \rfloor} m{i} (1-p)^{m-i} p^i m - i{m} R_{m-i} (F, D) \ & = \sum_{i=m-\lfloor m / 2 \rfloor}^{m} m{i} (1-p)^{i} p^{m-i} i{m} R_{i} (F, D) \ & \leq \sum_{i=0}^{m} m{i} (1-p)^{i} p^{m-i} i{m} R_{i} (F, D) \ & \leq R_m(F, (1-p)D + p U) \ aligned ] The third part takes advantage of the assumption that ( p \leq 0.5). We know that for (\lfloor m / 2 \rfloor + 1 \leq i \leq m), the assumption (p \leq 0.5) implies [ (1-p)^{m-i} p^i \leq p{1-p} (1-p)^{i} p^{m-i}. ] Therefore, using the first part, we achieve [ aligned & \sum_{i=\lfloor m / 2 \rfloor + 1}^{m} m{i} (1-p)^{m-i} p^i i{m} R_{i} (F, D) \ & \leq \sum_{i=\lfloor m / 2 \rfloor + 1}^{m} m{i} p{1-p} (1-p)^{i} p^{m-i} i{m} R_{i} (F, D) \ & = p{1-p} \sum_{i=\lfloor m / 2 \rfloor + 1}^{m} m{i} (1-p)^{i} p^{m-i} i{m} R_{i} (F, D) \ & \leq p{1-p} \sum_{i=0}^{m} m{i} (1-p)^{i} p^{m-i} i{m} R_{i} (F, D) \ & \leq p{1-p} R_m(F, (1-p)D + p U).\ aligned ] By combining all the three parts above, we establish [ aligned & R_m(F, D) \ & \leq \left[\sum_{i=0}^m m{i} (1-p)^{i} p^{m-i} i{m} R_{i} (F, D) \right] \ & \quad + \left[ \sum_{i=0}^{\lfloor m / 2 \rfloor} m{i} (1-p)^{m-i} p^i i{m} R_{i} (F, D)\right] \ & \quad + \left[ \sum_{i=\lfloor m / 2 \rfloor + 1}^{m} m{i} (1-p)^{m-i} p^i i{m} R_{i} (F, D)\right] \ & \leq \left( 1 + 1 + p{1-p} \right) R_m(F, (1-p)D + p U)\ & = 2 - p{1-p} R_m(F, (1-p)D + p U).\ aligned ] The proof for theorem thm:rerm is therefore concluded.

Proof. Using lemma lem:sinr, we only need to prove an upper bound for (m/n). Since we know that (n) follows a binomial distribution with mean ((1-p)m), using Hoeffding's inequality [H63] [S74], we can obtain [ \Pr\left[n \leq (1-p-\epsilon) m\right] \leq \exp(-2 \epsilon^2 m), ] or put differently, [ \Pr\left[m{n} \leq 1{1-p-\epsilon} \right] \geq 1 - \exp(-2 \epsilon^2 m). ] The inequality is obtained by setting (\delta = \exp(-2 \epsilon^2 m)). The proof assumes that (m) is large enough such that [ 1-p-\frac{\log(1/\delta){2m}} > 0. ]

(Separation of dataset on empirical Rademacher complexity).
Proof.
(Sample size inequality for Rademacher complexity).
Proof.
(Relation of Rademacher complexities in distribution mixture).
Proof.
(Concentration inequality of subset Rademacher complexity).
Proof.

This paper shows that simply prescribing “none of the above” labels to unlabeled data has a beneficial regularization effect to supervised learning. We call it universum prescription by the fact that the prescribed labels cannot be one of the supervised labels. In spite of its simplicity, universum prescription obtained competitive results in training deep convolutional networks for CIFAR-10, CIFAR-100, STL-10 and ImageNet datasets. A qualitative justification of these approaches using Rademacher complexity is presented. The effect of a regularization parameter – probability of sampling from unlabeled data – is also studied empirically.

The idea of exploiting the wide abundance of unlabeled data to improve the accuracy of supervised learning tasks is a very natural one. In this paper, we study what is perhaps the simplest way to exploit unlabeled data in the context of deep learning. We assume that the unlabeled samples do not belong to any of the categories of the supervised task, and we force the classifier to produce a “none of the above” output for these samples. This is by no means a new idea, but we show empirically and theoretically that doing so has a regularization effect on supervised task and reduces the generalization gap, the expected difference between test and training errors. We study three different ways to prescribe “none of the above” outputs, dubbed uniform prescription, dustbin class, and background class and show that they improve the test error of convolutional networks trained on CIFAR-10, CIFAR-100 (?), STL-10 (?), and ImageNet (?). The method is theoretically justified using Radamacher complexity (?).

Here we briefly describe our three universum prescription methods. Uniform prescription forces a discrete uniform distribution for unlabeled samples. Dustbin class simply adds an extra class and prescribe all unlabeled data to this class. Background class also adds an extra class, but it uses a constant threshold to avoid parameterization.

Our work is a direct extension to learning in the presence of universum (?) (?), originated from (?) and (?). The definition of universum is a set of unlabeled data that are known not to belong to any of the classes but in the same domain. We extended the idea of using universum from support vector machines to deep learning.

Most deep learning approaches utilizing unlabeled data belong to the scope of representation learning (reviewed by (?) and (?)) and transfer learning (?). They include ideas like pretraining (?) (?) (?) and semi-supervised training (?) (?). Universum prescription incoporates unlabeled data without imposing priors such as sparsity or reconstruction.

Regularization – techniques for the control of generalization gap – has been studied extensively. Most approaches implement a secondary optimization objective, such as an L2subscript𝐿2L_{2} norm. Some other methods such as dropout (?) cheaply simulate model averaging to control the model variance. As part of general statistical learning theory (?), (?), the justification for regularization is well-developed. We qualitatively justify the methods using Radamacher complexity (?), similar to (?).

In this section we attempt to formalize the trick of prescribing “none of the above” labels – universum prescription. Consider the problem of exclusive k𝑘k-way classification. In learning we hope to find a hypothesis function h∈ℋℎℋh\in\mathcal{H} mapping to ℝksuperscriptℝ𝑘\mathbb{R}^{k} so that the label is determined by y=argmini​hi​(x)𝑦subscriptargmin𝑖subscriptℎ𝑖𝑥y=\mathrm{argmin}{i}~{}h{i}(x). The following assumptions are made.

(Loss assumption) The loss used as the optimization objective is negative log-likelihood:

(Universum assumption) The proportion of unlabeled samples belonging to a supervised class is negligible.

The loss assumption assumes that the probability of class y𝑦y given an input x𝑥x can be thought of as

where (X,Y)∼𝐃similar-to𝑋𝑌𝐃(X,Y)\sim\mathbf{D} and 𝐃𝐃\mathbf{D} is the distribution where labeled data are sampled. We use lowercase letters for values, uppercase letters for random variables and bold uppercase letters for distribution. The loss assumption is simply a necessary detail rather than a limitation, in the sense that one can change the type of loss and use the same principles to derive different universum learning techniques.

The universum assumption implicates that labeled classes are a negligible subset. In many practical cases we only care about a small number of classes, either by problem design or due to high cost in the labeling process. At the same time, a very large amount of unlabeled data is easily obtained. Put in mathematics, assuming we draw unlabeled data from distribution 𝐔𝐔\mathbf{U}, the assumption states that

The universum assumption is opposite to the assumptions of information regularization (?) and transduction learning (?) (?). It has similarities with (?) that encourages diversity of outputs for ensemble methods. All our methods discussed below prescribe agnostic targets to the unlabeled data. During learning, we randomly present an unlabeled sample to the optimization procedure with probability p𝑝p.

It is known that negative log-likelihood is simply a reduced form of cross-entropy

in which the target probability Q​[Y=y|x]=1𝑄delimited-[]𝑌conditional𝑦𝑥1Q[Y=y|x]=1 and Q​[Y=i|x]=0𝑄delimited-[]𝑌conditional𝑖𝑥0Q[Y=i|x]=0 for i≠y𝑖𝑦i\neq y. Under the universum assumption, if we are presented with an unlabeled sample x𝑥x, we would hope to prescribe some Q𝑄Q so that every class has some equally minimal probability. Q𝑄Q also has to satisfy ∑i=1kQ​[Y=i|x]=1superscriptsubscript𝑖1𝑘𝑄delimited-[]𝑌conditional𝑖𝑥1\sum_{i=1}^{k}Q[Y=i|x]=1 by the probability axioms. The only possible choice for Q𝑄Q is then Q​[Y|x]=1/k𝑄delimited-[]conditional𝑌𝑥1𝑘Q[Y|x]=1/k. The learning algorithm then uses the cross-entropy loss instead of negative log-likelihood.

It is worth noting that uniform output has the maximum entropy among all possible choices. If hℎh is parameterized as a deep neural network, uniform output is achieved when these parameters are constantly zero. Therefore, uniform prescription may have the effect of reducing the magnitude of parameters, similar to norm-based regularization.

Another way of prescribing agnostic target is to append a “dustbin” class to the supervised task. This requires some changes to the hypothesis function hℎh such that it outputs k+1𝑘1k+1 targets. For deep learning models one can simply extend the last parameterized layer. All unlabeled data are prescribed to this extra “dustbin” class.

The effect of dustbin class is clearly seen in the loss function of an unlabeled sample (x,k+1)𝑥𝑘1(x,k+1)

The second term is a “soft” maximum for all dimensions of −hℎ-h. With an unlabeled sample, the algorithm attempts to introduce smoothness by minimizing probability spikes.

We could further simplify dustbin class by removing parameters for class k+1𝑘1k+1. For some given threshold constant τ𝜏\tau, we could change the probability of a labeled sample to

and an unlabeled sample

We call this method background class and τ𝜏\tau background constant. Similar to dustbin class, the algorithm attempts to minimize the spikes of outputs, but limited to a certain extent by the inclusion of exp⁡(−τ)𝜏\exp(-\tau) in the partition function. In our experiments τ𝜏\tau is always set to 0.

In this part, we derive a qualitative justification for universum prescription using probably approximately correct (PAC) learning (?). By being “qualitative”, the justification is in contrast with numerical bounds such as Vapnik-Chervonenkis dimension (?) (VC-dim) and others. Our theory is based on Rademacher complexity (?), similar to (?) where both dropout (?) and dropconnect (?) are justified. VC-dim is an upper-bound of Rademacher complexity, suggesting that the latter is more accurate. Previous results on unlabeled data (?) (?) assume the same distribution for labeled and unlabeled data, which is impossible under the universum assumption.

Let ℱℱ\mathcal{F} be a family of functions mapping from 𝒳𝒳\mathcal{X} to ℝℝ\mathbb{R}, and S=(x1,x2,…,xm)𝑆subscript𝑥1subscript𝑥2…subscript𝑥𝑚S=(x_{1},x_{2},\dots,x_{m}) a fixed sample of size m𝑚m with elements in 𝒳𝒳\mathcal{X}. Then, the empirical Rademacher complexity of F𝐹F with respect to the sample S𝑆S is defined as:

where 𝛈=(η1,…,ηm)T𝛈superscriptsubscript𝜂1…subscript𝜂𝑚𝑇\boldsymbol{\eta}=(\eta_{1},\dots,\eta_{m})^{T}, with ηisubscript𝜂𝑖\eta_{i}’s independent random variables uniformly distributed on {−1,1}11{-1,1}.

Let 𝐃𝐃\mathbf{D} denote the distribution from which the samples were drawn. For any integer m≥1𝑚1m\geq 1, the Rademacher complexity of ℱℱ\mathcal{F} is the expectation of the empirical Rademacher complexity over all samples of size m𝑚m drawn according to 𝐃𝐃\mathbf{D}:

Secondly, if for a fixed input all functions in ℱℱ\mathcal{F} output the same value, then its Rademacher complexity is 0. Assume for any f∈ℱ𝑓ℱf\in\mathcal{F} we have f​(x)=f0​(x)𝑓𝑥subscript𝑓0𝑥f(x)=f_{0}(x), then

Therefore, one way to minimize Rademacher complexity is to regularize functions in ℱℱ\mathcal{F} such that all functions tend to have the same output for a given input. Universum prescription precisely does that – the prescribed outputs for unlabeled data are all constantly the same.

The principal PAC-learning result is a bound for functions that are finite in outputs. We use the formulation by (?), but anterior results exist (?) (?) (?) (?).

For an energy function (?) ℰ​(h,x,y)ℰℎ𝑥𝑦\mathcal{E}(h,x,y) over hypothesis class ℋℋ\mathcal{H}, input set 𝒳𝒳\mathcal{X} and output set 𝒴𝒴\mathcal{Y}, if it has lower bound 0 and upper bound M>0𝑀0M>0, then with probability at least 1−δ1𝛿1-\delta, the following holds for all h∈ℋℎℋh\in\mathcal{H}:

where the function family ℱℱ\mathcal{F} is defined as

𝐃𝐃\mathbf{D} is the distribution for (x,y)𝑥𝑦(x,y), and S𝑆S is a sample set of size m𝑚m drawn indentically and independently from 𝐃𝐃\mathbf{D}.

The meaning of the theorem is two-fold. When applying the theorem to the joint problem of training using both labeled and unlabeled data, the third term on the right hand of inequality 14 is reduced by the augmentation of the extra data. The joint Rademacher complexity is written as ℜm​(ℱ,(1−p)​𝐃+p​𝐔)subscriptℜ𝑚ℱ1𝑝𝐃𝑝𝐔\mathfrak{R}_{m}(\mathcal{F},(1-p)\mathbf{D}+p\mathbf{U}), which is reduced when we prescribe constant outputs to unlabeled data.

The second fold is that when the theorem applies to the supervised distribution 𝐃𝐃\mathbf{D}, we would hope that ℜn​(ℱ,𝐃)subscriptℜ𝑛ℱ𝐃\mathfrak{R}{n}(\mathcal{F},\mathbf{D}) can be bounded by ℜm​(ℱ,(1−p)​𝐃+p​𝐔)subscriptℜ𝑚ℱ1𝑝𝐃𝑝𝐔\mathfrak{R}{m}(\mathcal{F},(1-p)\mathbf{D}+p\mathbf{U}), where n𝑛n is the number of supervised samples randomly chosen by the joint problem. Note that the number n𝑛n follows a binomial distribution with mean (1−p)​m1𝑝𝑚(1-p)m. Such a bound can be achieved in a probable and approximate sense.

Assume we have a joint problem where p≤0.5𝑝0.5p\leq 0.5 and there are m𝑚m random training samples from the joint distribution (1−p)​𝐃+p​𝐔1𝑝𝐃𝑝𝐔(1-p)\mathbf{D}+p\mathbf{U}. With probability at least 1−δ1𝛿1-\delta, the following holds

where n𝑛n is a random number indicating the number of supervised samples in the total joint samples, and m𝑚m is large enough such that

We present the proof of theorem 2 in the supplemental material, which utilizes Hoeffding’s inequality (?) (?). The theorem tells us that the Rademacher complexity of the supervised problem can be bounded by that of the joint problem. The universum prescription algorithm attempts to make the Rademacher complexity of the joint problem small. Therefore, universum prescription improves generalization by incorporating unlabeled data.

However, theorem 2 has a requirement that p≤0.5𝑝0.5p\leq 0.5, otherwise the bound is not achievable. Also, the value of (2−p)/(1−p)22𝑝superscript1𝑝2(2-p)/(1-p)^{2} – the asymptotic constant factor in inequality 16 when m𝑚m is large – is monitonally increasing with respect to p𝑝p with a range of [2,6]26[2,6] when p≤0.5𝑝0.5p\leq 0.5. These facts indicate that we need to keep p𝑝p small. The following sections show that there is improvement if p𝑝p is small, but training and testing errors became worse when p𝑝p is large.

Finally, in terms of numerical asymptotics, theorem 2 suggests that ℜn​(ℱ,𝐃)≤𝐎​(1/m)subscriptℜ𝑛ℱ𝐃𝐎1𝑚\mathfrak{R}{n}(\mathcal{F},\mathbf{D})\leq\mathbf{O}(1/\sqrt{m}), instead of the commonly known result ℜn​(ℱ,𝐃)≤𝐎​(1/n)subscriptℜ𝑛ℱ𝐃𝐎1𝑛\mathfrak{R}{n}(\mathcal{F},\mathbf{D})\leq\mathbf{O}(1/\sqrt{n}). This bounds the supervised problem with a tighter asymptotical factor because there are more joint samples than supervised samples.

In this section we test the methods on some image classification tasks. Three series of datasets – CIFAR-10/100 (?), STL-10 (?) and ImageNet (?) – are chosen due to the availability of unlabeled data. For CIFAR-10/100 and STL-10 datasets, we used a 21-layer convolutional network (ConvNet) (?) (?), in which the inputs are 32-by-32 images and all convolutional layers are 3-by-3 and fully padded. For ImageNet, the model is a 17-layer ConvNet with 64-by-64 images as inputs. These models are inspired by (?), in which all pooling layers are max-pooling, and ReLUs (?) are used as the non-linearity. Two dropout (?) layers of probability 0.5 are inserted before the final two linear layers.

The algorithm used is stochastic gradient descent with momentum (?) (?) 0.9 and a minibatch size of 32. The initial learning rate is 0.005 which is halved every 60,000 minibatch steps for CIFAR-10/100 and every 600,000 minibatch steps for ImageNet. The training stops at 400,000 steps for CIFAR-10/100 and STL10, and 2,500,000 steps for ImageNet. Table 1 and 2 summarize the configurations. The weights are initialized in the same way as (?). The following data augmentation steps are used.

(Horizontal flip.) Flip the image horizontally with probability 0.5.

(Scale.) Randomly scale the image between 1/1.211.21/1.2 and 1.21.21.2 times of its height and width.

(Crop.) Randomly crop a 32-by-32 (or 64-by-64 for ImageNet) region in the scaled image.

The samples of CIFAR-10 and CIFAR-100 datasets (?) are from the 80 million tiny images dataset (?). Each dataset contains 60,000 samples, consitituting a very small portion of 80 million. This is an ideal case for our methods, in which we can use the entire 80 million images as the unlabeled data. The CIFAR-10 dataset has 10 classes, and CIFAR-100 has 20 (coarse) or 100 (fine-grained) classes.

Table 3 and 4 contain the results. The three numbers in each tabular indicate training error, testing error and generalization gap. Bold numbers are the best ones for each case. The generalization gap is approximated by the difference between testing and training errors. All the models use unlabeled data with probability p=0.2𝑝0.2p=0.2.

We compared other single-model results on CIFAR-10 and CIFAR-100 (fine-grained case) in table 5. It shows that our network is competitive to the state of the art. Although (?) has the best results, we believe that by applying out universum prescription methods to their model design could also improve the results further.

The STL-10 dataset (?) has size 96-by-96 for each image. We downsampled them to 32-by-32 in order to use the same model. The dataset contains a very small number of training samples – 5000. The accompanying unlabeled data contain 100,000 samples. There is no guarantee that these unlabeled samples do not blong to the supervised classes (?), therefore universum prescription failed. To verify that the extra data is the problem, an experiment using the 80 million tiny images as the unlabeled dataset is shown in table 3 and 4. In this case the improvement is observed. Due to long training times of our models, we did not perform 10-fold training in the original paper (?).

One interesting observation is that the results on STL-10 became better with the use of 80 million tiny images instead of the original extra data. It indicates that dataset size and whether universum assumption is satisfied are affecting factors for the effectiveness of universum prescription.

The ImageNet dataset (?) for classification task has in total 1,281,167 training images and 50,000 validation images. The reported testing errors are evaluated on this validation dataset. During training, we resize images to minimum dimension 64, and then feed a random 64-by-64 crop to the network. Same test-time augmentation technique as in (?) are applied, with size variants {64, 72, 80, 88}, where each image is viewed in 144 crops.

The extra data comes from the large ImageNet 2011 release111http://www.image-net.org/releases, for which we only keep the classes whomself and whose children do not belong to the supervised classes. This is enabled by the super-subordinate (is-a) relation information provided with the WordNet distribution (?) because all ImageNet classes are nouns of WordNet. Both top-1 and top-5 results are reported in tables 3 and 4.

In all experiments dustbin class provides best results. We believe that it is because the extra class is parameterized, which makes it adapt better on the unlabeled samples.

It is natural to ask how would the change of the probability p𝑝p of sampling from unlabeled data affect the results. In this section we show the experiments. To prevent an exhaustive search on the regularization parameter from overfitting our models on the testing data, we use a different model for this section. It is described in table 6, which has 9 parameterized layers in total. The design is inspired by (?). For each choice of p𝑝p we conducted 6 experiments combining universum prescription models and dropout. The dropout layers are two ones added in between the fully-connected layers with dropout probability 0.5. Figure 1 shows the results.

From figure 1 we can conclude that increasing p𝑝p will descrease generalization gap. However, we cannot make p𝑝p too large since after a certain point the training collapses and both training and testing errors become worse. This confirms the assumptions and conclusions from theorem 2.

Comparing between CIFAR-10/100 and STL-10, one conclusion is that that the model variance is affected by the combined size of labeled and unlabeled datasets. The variance on training and testing errors are extremely small on CIFAR-10/100 datasets because the extra data we used is almost unlimited (in total 80 million), but on STL-10 the variance seems to be large with much smaller combined size of training and extra datasets. This suggests that using universum prescription with a large abundance of extra data could improve the stability of supervised learning algorithms.

Finally, the comparison between using and not using dropout does not show a difference. This suggests that the regularization effect of universum prescription alone is comparable to that of dropout.

This article shows that universum prescription can be used to regularize a multi-class classification problem using extra unlabeled data. Two assumptions are made. One is that loss used is negative log-likelihood and the other is negligible probability of a supervised sample existing in the unlabeled data. The loss assumption is a necessary detail rather than a limitation. The three universum prescription methods are uniform prescription, dustbin class and background class.

We further provided a theoretical justification. Theorem 2 suggests that asymptotically the generalization ability of the supervised problem could be bounded by the joint problem, which has more samples due to the addition of unlabeled data. Experiments are done using CIFAR-10, CIFAR-100, STL-10 and ImageNet datasets. The effect of the regularization parameter is also studied empirically.

These experiments show that all three universum prescrition methods provide certain improvement over the generalization gap, whereas dustbin class constantly performs the best because the parameterized extra class can adapt better to the unlabeled samples. Further conclusions include that additional unlabeled data can improve the variance of models during training, and that the results are comparable to data-agnostic regularization using dropout.

In the future, we hope to apply these methods to a broader range of problems.

We gratefully acknowledge NVIDIA Corporation with the donation of 2 Tesla K40 GPUs used for this research. Sainbayar Sukhbaatar offered many useful comments. Aditya Ramesh and Junbo Zhao helped cross-checking the proofs.

This supplemental material shares the bibliography of the main paper. As an outline of our proof, we first establish a relation between ℜm​(ℱ,𝐃)subscriptℜ𝑚ℱ𝐃\mathfrak{R}{m}(\mathcal{F},\mathbf{D}) and ℜm​(ℱ,(1−p)​𝐃+p​𝐔)subscriptℜ𝑚ℱ1𝑝𝐃𝑝𝐔\mathfrak{R}{m}(\mathcal{F},(1-p)\mathbf{D}+p\mathbf{U}), and then another relation between ℜn​(ℱ,𝐃)subscriptℜ𝑛ℱ𝐃\mathfrak{R}{n}(\mathcal{F},\mathbf{D}) and ℜm​(ℱ,𝐃)subscriptℜ𝑚ℱ𝐃\mathfrak{R}{m}(\mathcal{F},\mathbf{D}). The first part requires the following lemmas.

Let S𝑆S be a dataset of size m𝑚m. If S1subscript𝑆1S_{1} and S2subscript𝑆2S_{2} are two non-overlap subset of S𝑆S such that |S1|=m−isubscript𝑆1𝑚𝑖|S_{1}|=m-i, |S2|=isubscript𝑆2𝑖|S_{2}|=i and S1∪S2=Ssubscript𝑆1subscript𝑆2𝑆S_{1}\cup S_{2}=S, then the following two inequalities hold

Let (xj,yj)∈S1subscript𝑥𝑗subscript𝑦𝑗subscript𝑆1(x_{j},y_{j})\in S_{1} for j=1,2,…,m−i𝑗12…𝑚𝑖j=1,2,\dots,m-i and (xj,yj)∈S2subscript𝑥𝑗subscript𝑦𝑗subscript𝑆2(x_{j},y_{j})\in S_{2} for i=m−j+1,m−j+2,…,m𝑖𝑚𝑗1𝑚𝑗2…𝑚i=m-j+1,m-j+2,\dots,m. Denote 𝐍𝐍\mathbf{N} as the discrete uniform distribution on {1,−1}11{1,-1}. We can derive by the convexity of supremum and symmetry of 𝐍𝐍\mathbf{N}

Assume 0≤n≤m0𝑛𝑚0\leq n\leq m. If |Sn|=nsubscript𝑆𝑛𝑛|S_{n}|=n, |Sm|=msubscript𝑆𝑚𝑚|S_{m}|=m and Sm=Sn∪{xn+1,xn+2,…,xm}subscript𝑆𝑚subscript𝑆𝑛subscript𝑥𝑛1subscript𝑥𝑛2…subscript𝑥𝑚S_{m}=S_{n}\cup{x_{n+1},x_{n+2},\dots,x_{m}}, then

First of all, it is obvious that inequality 20 can be established using mathematical induction if we have m​ℜm​(ℱ,𝐃)≤(m+1)​ℜm+1​(ℱ,𝐃)𝑚subscriptℜ𝑚ℱ𝐃𝑚1subscriptℜ𝑚1ℱ𝐃m\mathfrak{R}{m}(\mathcal{F},\mathbf{D})\leq(m+1)\mathfrak{R}{m+1}(\mathcal{F},\mathbf{D}) for all m≥0𝑚0m\geq 0. To prove this, we first establish that if Sm={x1,x2,…,xm}subscript𝑆𝑚subscript𝑥1subscript𝑥2…subscript𝑥𝑚S_{m}={x_{1},x_{2},\dots,x_{m}} and Sm+1={x1,x2,…,xm,xm+1}subscript𝑆𝑚1subscript𝑥1subscript𝑥2…subscript𝑥𝑚subscript𝑥𝑚1S_{m+1}={x_{1},x_{2},\dots,x_{m},x_{m+1}} (i.e., Sm+1=Sm∪{xm+1}subscript𝑆𝑚1subscript𝑆𝑚subscript𝑥𝑚1S_{m+1}=S_{m}\cup{x_{m+1}}), then m​ℜ^Sm​(ℱ)≤(m+1)​ℜ^Sm+1​(ℱ)𝑚subscript^ℜsubscript𝑆𝑚ℱ𝑚1subscript^ℜsubscript𝑆𝑚1ℱm\hat{\mathfrak{R}}{S{m}}(\mathcal{F})\leq(m+1)\hat{\mathfrak{R}}{S{m+1}}(\mathcal{F}), which can also establish inequality 19.

For any 𝜼m={η1,η2,…,ηm}subscript𝜼𝑚subscript𝜂1subscript𝜂2…subscript𝜂𝑚\boldsymbol{\eta}{m}={\eta{1},\eta_{2},\dots,\eta_{m}} and 𝜼m+1={η1,η2,…,ηm,ηm+1}subscript𝜼𝑚1subscript𝜂1subscript𝜂2…subscript𝜂𝑚subscript𝜂𝑚1\boldsymbol{\eta}{m+1}={\eta{1},\eta_{2},\dots,\eta_{m},\eta_{m+1}}, that is, 𝜼m+1=𝜼m∪{ηm+1}subscript𝜼𝑚1subscript𝜼𝑚subscript𝜂𝑚1\boldsymbol{\eta}{m+1}=\boldsymbol{\eta}{m}\cup{\eta_{m+1}}, let f0=argmaxf∈ℱ​∑i=1mηi​f​(xi)subscript𝑓0subscriptargmax𝑓ℱsuperscriptsubscript𝑖1𝑚subscript𝜂𝑖𝑓subscript𝑥𝑖f_{0}=\mathrm{argmax}{f\in\mathcal{F}}\sum{i=1}^{m}\eta_{i}f(x_{i}). By definition of supremum, we have

Taking espectation over 𝜼m+1subscript𝜼𝑚1\boldsymbol{\eta}_{m+1}, by the symmetry of distribution 𝐍𝐍\mathbf{N}, we obtain

By the definition of ℜ^Sm​(ℱ)subscript^ℜsubscript𝑆𝑚ℱ\hat{\mathfrak{R}}{S{m}}(\mathcal{F}), the inequality above implies m​ℜ^Sm​(ℱ)≤(m+1)​ℜ^Sm+1​(ℱ)𝑚subscript^ℜsubscript𝑆𝑚ℱ𝑚1subscript^ℜsubscript𝑆𝑚1ℱm\hat{\mathfrak{R}}{S{m}}(\mathcal{F})\leq(m+1)\hat{\mathfrak{R}}{S{m+1}}(\mathcal{F}). Then, by taking espectation over Sm+1subscript𝑆𝑚1S_{m+1} we can obtain

The lemma can therefore be easily established by mathematical induction. ∎

If p≤0.5𝑝0.5p\leq 0.5, then

For any function space ℱℱ\mathcal{F} and distribution 𝐃𝐃\mathbf{D}, denote ℜ0​(ℱ,𝐃)=0subscriptℜ0ℱ𝐃0\mathfrak{R}{0}(\mathcal{F},\mathbf{D})=0 and ℜ^∅​(ℱ)=0subscript^ℜℱ0\hat{\mathfrak{R}}{\emptyset}(\mathcal{F})=0. By definition of Rademacher complexity and lemma 1, we get

The proof proceeds by handling the three parts on the right-hand side of the inequality above separately.

For the first part, using lemma 2, we can get

The second part can also proceed using lemma 2. It is essentially upper-bounded by the first part. By the fact that i≤m−i𝑖𝑚𝑖i\leq m-i for 0≤i≤⌊m/2⌋0𝑖𝑚20\leq i\leq\lfloor m/2\rfloor, we obtain

The third part takes advantage of the assumption that p≤0.5𝑝0.5p\leq 0.5. We know that for ⌊m/2⌋+1≤i≤m𝑚21𝑖𝑚\lfloor m/2\rfloor+1\leq i\leq m, the assumption p≤0.5𝑝0.5p\leq 0.5 implies

By combining all the three parts above, we establish

The proof for theorem 3 is therefore concluded. ∎

Assume in solving the joint problem we obtained m𝑚m idependently and identically distributed samples. Let the random number n𝑛n represent the number of supervised sample obtained among these m𝑚m joint samples with a proprtion probability of 1−p1𝑝1-p. Then, with probability at least 1−δ1𝛿1-\delta, the following holds

for large enough m𝑚m such that

Using lemma 2, we only need to prove an upper bound for m/n𝑚𝑛m/n. Since we know that n𝑛n follows a binomial distribution with mean (1−p)​m1𝑝𝑚(1-p)m, using Hoeffding’s inequality (?) (?), we can obtain

or put differently,

As a result, theorem 2 can be obtained by directly combining theorem 3 and theorem 4.

Table: Sx2.T1: The 21-layer network

LAYERSDESCRIPTION
1-3Conv 256x3x3
4Pool 2x2
5-8Conv 512x3x3
9Pool 2x2
10-13Conv 1024x3x3
14Pool 2x2
15-18Conv 1024x3x3
19Pool 2x2
20-23Conv 2048x3x3
24Pool 2x2
25-26Full 2048

Table: Sx4.T3: Result for baseline and uniform prescription. The numbers are percentages.

TrainTestGapTrainTestGap
CIFAR-100.007.027.020.727.596.87
CIFAR-100 F.0.0937.5837.494.9136.2331.32
CIFAR-100 C.0.0422.7422.700.6723.4222.45
STL-100.0031.1631.162.0236.5434.52
STL-10 Tiny0.0031.1631.160.6230.1529.47
ImageNet-110.1934.3924.2013.8434.6120.77
ImageNet-51.6213.6812.063.0213.7010.68

Table: Sx4.T4: Result for dustbin class and background class. Continuation of table 3

TrainTestGapTrainTestGap
CIFAR-100.076.666.591.358.387.03
CIFAR-100 F.2.5232.8430.328.5640.5742.01
CIFAR-100 C.0.4020.4520.053.7324.9721.24
STL-103.0336.5833.5514.8938.9524.06
STL-10 Tiny0.0027.9627.960.1130.3830.27
ImageNet-113.8033.6719.8713.4334.6921.26
ImageNet-52.8313.3510.522.7413.8411.10

Table: Sx4.T5: Comparison of single-model CIFAR-10 and CIFAR-100 results, in second and third columns. The fourth column indicates whether data augmentation is used for CIFAR-10. The numbers are percentages.

REF.10100AUG.
(?)6.2824.30YES
(ours)6.6632.84YES
(?)7.9734.57YES
(?)8.8135.68YES
(?)9.3838.57YES
(?)11.10N/ANO
(?)15.1342.51NO

Table: Sx4.T6: ConvNet for the study of p𝑝p

LAYERSDESCRIPTION
1Conv 1024x5x5
2Pool 2x2
3Conv 1024x5x5
4-7Conv 1024x3x3
8Pool 2x2
9-11Full 2048

Refer to caption Experiments on regularization parameter. The four rows are CIFAR-10, CIFAR-100 fine-grained, CIFAR-100 coarse and STL-10 respectively.

$$ L(h,x,y)=h_{y}(x)+\log\left[\sum_{i=1}^{k}\exp(-h_{i}(x))\right]. $$ \tag{Sx2.E1}

$$ \Pr_{(X,Y)\sim\mathbf{U}}[X,Y\in{1,2,\dots,k}]\approx 0. $$ \tag{Sx2.E3}

$$ \mathfrak{R}{m}(\mathcal{F},\mathbf{D})=\underset{S\sim\mathbf{D}^{m}}{\mathrm{E}}[\hat{\mathfrak{R}}{S}(F)] $$ \tag{Sx3.E11}

$$ \mathcal{F}=\left{\mathcal{E}(h,x,y)|h\in\mathcal{H}\right}. $$ \tag{Sx3.E15}

$$ 1-p-\sqrt{\frac{\log(1/\delta)}{2m}}>0. $$ \tag{Sx3.E17}

$$ \hat{\mathfrak{R}}{S}(\mathcal{F})\leq\frac{m-i}{m}\hat{\mathfrak{R}}{S_{1}}(\mathcal{F})+\frac{i}{m}\hat{\mathfrak{R}}{S{2}}(\mathcal{F}). $$ \tag{Sx8.E18}

$$ n\mathfrak{R}{n}(\mathcal{F},\mathbf{D})\leq m\mathfrak{R}{m}(\mathcal{F},\mathbf{D}). $$ \tag{Sx8.E20}

$$ (1-p)^{m-i}p^{i}\leq\frac{p}{1-p}(1-p)^{i}p^{m-i}. $$ \tag{Sx8.Ex8}

$$ \Pr\left[n\leq(1-p-\epsilon)m\right]\leq\exp(-2\epsilon^{2}m), $$ \tag{Sx8.Ex11}

$$ \displaystyle\geq\sup_{f\in\mathcal{F}}\frac{1}{m}\sum_{i=1}^{m}\underset{\eta_{i}}{\mathrm{E}}[\eta_{i}]f(x_{i})=0. $$ \tag{Sx3.E12Xa}

$$ \displaystyle\underset{(x,y)\sim\mathbf{D}}{\mathrm{E}}[\mathcal{E}(h,x,y)]\leq $$ \tag{Sx3.E14X}

$$ \displaystyle= $$

$$ \displaystyle=\sum_{i=0}^{m}\binom{m}{i}(1-p)^{i}p^{m-i}\underset{S_{1}\sim\mathbf{D}^{i}}{\mathrm{E}}\left[\underset{S_{2}\sim\mathbf{D}^{m-i}}{\mathrm{E}}\left[\hat{\mathfrak{R}}{S{1}\cup S_{2}}(\mathcal{F})\right]\right] $$

$$ \displaystyle\leq\left(1+1+\frac{p}{1-p}\right)\mathfrak{R}_{m}(\mathcal{F},(1-p)\mathbf{D}+p\mathbf{U}) $$

Definition. Definition 1 (Empirical Rademacher complexity). Let ℱℱ\mathcal{F} be a family of functions mapping from 𝒳𝒳\mathcal{X} to ℝℝ\mathbb{R}, and S=(x1,x2,…,xm)𝑆subscript𝑥1subscript𝑥2…subscript𝑥𝑚S=(x_{1},x_{2},\dots,x_{m}) a fixed sample of size m𝑚m with elements in 𝒳𝒳\mathcal{X}. Then, the empirical Rademacher complexity of F𝐹F with respect to the sample S𝑆S is defined as: ℜ^S​(ℱ)=E𝜼​[supf∈ℱ1m​∑i=1mηi​f​(xi)]subscript^ℜ𝑆ℱ𝜼Edelimited-[]subscriptsupremum𝑓ℱ1𝑚superscriptsubscript𝑖1𝑚subscript𝜂𝑖𝑓subscript𝑥𝑖\hat{\mathfrak{R}}{S}(\mathcal{F})=\underset{\boldsymbol{\eta}}{\mathrm{E}}\left[\sup{f\in\mathcal{F}}\frac{1}{m}\sum_{i=1}^{m}\eta_{i}f(x_{i})\right] (10) where 𝛈=(η1,…,ηm)T𝛈superscriptsubscript𝜂1…subscript𝜂𝑚𝑇\boldsymbol{\eta}=(\eta_{1},\dots,\eta_{m})^{T}, with ηisubscript𝜂𝑖\eta_{i}’s independent random variables uniformly distributed on {−1,1}11{-1,1}.

Definition. Definition 2 (Rademacher complexity). Let 𝐃𝐃\mathbf{D} denote the distribution from which the samples were drawn. For any integer m≥1𝑚1m\geq 1, the Rademacher complexity of ℱℱ\mathcal{F} is the expectation of the empirical Rademacher complexity over all samples of size m𝑚m drawn according to 𝐃𝐃\mathbf{D}: ℜm​(ℱ,𝐃)=ES∼𝐃m​[ℜ^S​(F)]subscriptℜ𝑚ℱ𝐃similar-to𝑆superscript𝐃𝑚Edelimited-[]subscript^ℜ𝑆𝐹\mathfrak{R}{m}(\mathcal{F},\mathbf{D})=\underset{S\sim\mathbf{D}^{m}}{\mathrm{E}}[\hat{\mathfrak{R}}{S}(F)] (11)

Theorem. Theorem 1 (Approximation bound with finite bound on output). For an energy function (?) ℰ​(h,x,y)ℰℎ𝑥𝑦\mathcal{E}(h,x,y) over hypothesis class ℋℋ\mathcal{H}, input set 𝒳𝒳\mathcal{X} and output set 𝒴𝒴\mathcal{Y}, if it has lower bound 0 and upper bound M>0𝑀0M>0, then with probability at least 1−δ1𝛿1-\delta, the following holds for all h∈ℋℎℋh\in\mathcal{H}: E(x,y)∼𝐃​[ℰ​(h,x,y)]≤similar-to𝑥𝑦𝐃Edelimited-[]ℰℎ𝑥𝑦absent\displaystyle\underset{(x,y)\sim\mathbf{D}}{\mathrm{E}}[\mathcal{E}(h,x,y)]\leq (14) 1m​∑(x,y)∈Sℰ​(h,x,y)+2​ℜm​(ℱ,𝐃)+M​log⁡2δ2​m,1𝑚subscript𝑥𝑦𝑆ℰℎ𝑥𝑦2subscriptℜ𝑚ℱ𝐃𝑀2𝛿2𝑚\displaystyle\frac{1}{m}\sum_{(x,y)\in S}\mathcal{E}(h,x,y)+2\mathfrak{R}_{m}(\mathcal{F},\mathbf{D})+M\sqrt{\frac{\log{\frac{2}{\delta}}}{2m}}, where the function family ℱℱ\mathcal{F} is defined as ℱ={ℰ​(h,x,y)|h∈ℋ}.ℱconditional-setℰℎ𝑥𝑦ℎℋ\mathcal{F}=\left{\mathcal{E}(h,x,y)|h\in\mathcal{H}\right}. (15) 𝐃𝐃\mathbf{D} is the distribution for (x,y)𝑥𝑦(x,y), and S𝑆S is a sample set of size m𝑚m drawn indentically and independently from 𝐃𝐃\mathbf{D}.

Theorem. Theorem 2 (Rademacher complexity bound on distribution mixture). Assume we have a joint problem where p≤0.5𝑝0.5p\leq 0.5 and there are m𝑚m random training samples from the joint distribution (1−p)​𝐃+p​𝐔1𝑝𝐃𝑝𝐔(1-p)\mathbf{D}+p\mathbf{U}. With probability at least 1−δ1𝛿1-\delta, the following holds ℜn​(ℱ,𝐃)≤subscriptℜ𝑛ℱ𝐃absent\displaystyle\mathfrak{R}{n}(\mathcal{F},\mathbf{D})\leq (16) 2−p(1−p)​(1−p−log⁡(1/δ)2​m)​ℜm​(ℱ,(1−p)​𝐃+p​𝐔),2𝑝1𝑝1𝑝1𝛿2𝑚subscriptℜ𝑚ℱ1𝑝𝐃𝑝𝐔\displaystyle\frac{2-p}{(1-p)\left(1-p-\sqrt{\frac{\log(1/\delta)}{2m}}\right)}\mathfrak{R}{m}(\mathcal{F},(1-p)\mathbf{D}+p\mathbf{U}), where n𝑛n is a random number indicating the number of supervised samples in the total joint samples, and m𝑚m is large enough such that 1−p−log⁡(1/δ)2​m>0.1𝑝1𝛿2𝑚01-p-\sqrt{\frac{\log(1/\delta)}{2m}}>0. (17)

Lemma. Lemma 1 (Separation of dataset on empirical Rademacher complexity). Let S𝑆S be a dataset of size m𝑚m. If S1subscript𝑆1S_{1} and S2subscript𝑆2S_{2} are two non-overlap subset of S𝑆S such that |S1|=m−isubscript𝑆1𝑚𝑖|S_{1}|=m-i, |S2|=isubscript𝑆2𝑖|S_{2}|=i and S1∪S2=Ssubscript𝑆1subscript𝑆2𝑆S_{1}\cup S_{2}=S, then the following two inequalities hold ℜ^S​(ℱ)≤m−im​ℜ^S1​(ℱ)+im​ℜ^S2​(ℱ).subscript^ℜ𝑆ℱ𝑚𝑖𝑚subscript^ℜsubscript𝑆1ℱ𝑖𝑚subscript^ℜsubscript𝑆2ℱ\hat{\mathfrak{R}}{S}(\mathcal{F})\leq\frac{m-i}{m}\hat{\mathfrak{R}}{S_{1}}(\mathcal{F})+\frac{i}{m}\hat{\mathfrak{R}}{S{2}}(\mathcal{F}). (18)

Lemma. Lemma 2 (Sample size inequality for Rademacher complexity). Assume 0≤n≤m0𝑛𝑚0\leq n\leq m. If |Sn|=nsubscript𝑆𝑛𝑛|S_{n}|=n, |Sm|=msubscript𝑆𝑚𝑚|S_{m}|=m and Sm=Sn∪{xn+1,xn+2,…,xm}subscript𝑆𝑚subscript𝑆𝑛subscript𝑥𝑛1subscript𝑥𝑛2…subscript𝑥𝑚S_{m}=S_{n}\cup{x_{n+1},x_{n+2},\dots,x_{m}}, then n​ℜ^Sn​(ℱ)≤m​ℜ^Sm​(ℱ),𝑛subscript^ℜsubscript𝑆𝑛ℱ𝑚subscript^ℜsubscript𝑆𝑚ℱn\hat{\mathfrak{R}}{S{n}}(\mathcal{F})\leq m\hat{\mathfrak{R}}{S{m}}(\mathcal{F}), (19) and n​ℜn​(ℱ,𝐃)≤m​ℜm​(ℱ,𝐃).𝑛subscriptℜ𝑛ℱ𝐃𝑚subscriptℜ𝑚ℱ𝐃n\mathfrak{R}{n}(\mathcal{F},\mathbf{D})\leq m\mathfrak{R}{m}(\mathcal{F},\mathbf{D}). (20)

Theorem. Theorem 3 (Relation of Rademacher complexities in distribution mixture). If p≤0.5𝑝0.5p\leq 0.5, then ℜm​(ℱ,𝐃)≤2−p1−p​ℜm​(ℱ,(1−p)​𝐃+p​𝐔).subscriptℜ𝑚ℱ𝐃2𝑝1𝑝subscriptℜ𝑚ℱ1𝑝𝐃𝑝𝐔\mathfrak{R}{m}(\mathcal{F},\mathbf{D})\leq\frac{2-p}{1-p}\mathfrak{R}{m}(\mathcal{F},(1-p)\mathbf{D}+p\mathbf{U}). (21)

Theorem. Theorem 4 (Concentration inequality of subset Rademacher complexity). Assume in solving the joint problem we obtained m𝑚m idependently and identically distributed samples. Let the random number n𝑛n represent the number of supervised sample obtained among these m𝑚m joint samples with a proprtion probability of 1−p1𝑝1-p. Then, with probability at least 1−δ1𝛿1-\delta, the following holds ℜn​(ℱ,𝐃)≤ℜm​(ℱ,𝐃)1−p−log⁡(1/δ)2​m,subscriptℜ𝑛ℱ𝐃subscriptℜ𝑚ℱ𝐃1𝑝1𝛿2𝑚\mathfrak{R}{n}(\mathcal{F},\mathbf{D})\leq\frac{\mathfrak{R}{m}(\mathcal{F},\mathbf{D})}{1-p-\sqrt{\frac{\log(1/\delta)}{2m}}}, (22) for large enough m𝑚m such that 1−p−log⁡(1/δ)2​m>0.1𝑝1𝛿2𝑚01-p-\sqrt{\frac{\log(1/\delta)}{2m}}>0. (23)

LAYERSDESCRIPTION
1-3 4 5-8 9 10-13 14 15-18 19 20-23 24Conv 256x3x3 Pool 2x2 Conv 512x3x3 Pool 2x2 Conv 1024x3x3 Pool 2x2 Conv 1024x3x3 Pool 2x2 Conv 2048x3x3 Pool 2x2
LAYERSDESCRIPTION
1-3 4 5-7 8 9-11 12 13-15 16 17-19 20Conv 128x3x3 Pool 2x2 Conv 256x3x3 Pool 2x2 Conv 512x3x3 Pool 2x2 Conv 1024x3x3 Pool 2x2 Conv 2048x3x3 Pool 2x2
DATASETBASELINEBASELINEBASELINEUNIFORMUNIFORMUNIFORM
TrainTestGapTrainTestGap
CIFAR-100.007.027.020.727.596.87
CIFAR-100 F.0.0937.5837.494.9136.2331.32
CIFAR-100 C.0.0422.7422.700.6723.4222.45
STL-100.0031.1631.162.0236.5434.52
STL-10 Tiny0.0031.1631.160.6230.1529.47
ImageNet-110.1934.3924.2013.8434.6120.77
ImageNet-51.6213.6812.063.0213.7010.68
DATASETDUSTBINDUSTBINDUSTBINBACKGROUNDBACKGROUNDBACKGROUND
TrainTestGapTrainTestGap
CIFAR-100.076.666.591.358.387.03
CIFAR-100 F.2.5232.8430.328.5640.5742.01
CIFAR-100 C.0.4020.4520.053.7324.9721.24
STL-103.0336.5833.5514.8938.9524.06
STL-10 Tiny0.0027.9627.960.1130.3830.27
ImageNet-113.8033.6719.8713.4334.6921.26
ImageNet-52.8313.3510.522.7413.8411.10
REF.10100AUG.
(Graham 2014)6.2824.30YES
(ours)6.6632.84YES
(Lee et al. 2015)7.9734.57YES
(Lin, Chen, and Yan 2013)8.8135.68YES
(Goodfellow et al. 2013)9.3838.57YES
(Wan et al. 2013)11.1N/ANO
(Zeiler and Fergus 2013)15.1342.51NO
LAYERSDESCRIPTION
1 2 3 4-7 8 9-11Conv 1024x5x5 Pool 2x2 Conv 1024x5x5 Conv 1024x3x3 Pool 2x2 Full 2048

References

[BM03] Bartlett, P.~L., and Mendelson, S. 2003. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research 3:463--482.

[BBL02] Bartlett, P.~L.; Boucheron, S.; and Lugosi, G. 2002. Model selection and error estimation. Machine Learning 48(1-3):85--113.

[BL07] Bengio, Y., and LeCun, Y. 2007. Scaling learning algorithms towards ai. In Bottou, L.; Chapelle, O.; DeCoste, D.; and Weston, J., eds., Large-Scale Kernel Machines. MIT Press.

[BCV13] Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on 35(8):1798--1828.

[CASS07] Chapelle, O.; Agarwal, A.; Sinz, F.~H.; and Sch"olkopf, B. 2007. An analysis of inference with the universum. In Advances in neural information processing systems, 1369--1376.

[CSZ06T] Chapelle, O.; Schölkopf, B.; and Zien, A. 2006. A Discussion of Semi-Supervised Learning and Transduction. MIT Press. 473--478.

[ANL11] Coates, A.; Ng, A.~Y.; and Lee, H. 2011. An analysis of single-layer networks in unsupervised feature learning. In International conference on artificial intelligence and statistics, 215--223.

[CJ06] Corduneanu, A., and Jaakkola, T. 2006. Data dependent regularization. In Chapelle, O.; Schölkopf, B.; and Zien, A., eds., Semi-supervised learning. MIT Press.

[EBCMVB10] Erhan, D.; Bengio, Y.; Courville, A.; Manzagol, P.-A.; Vincent, P.; and Bengio, S. 2010. Why does unsupervised pre-training help deep learning? The Journal of Machine Learning Research 11:625--660.

[GVV98] Gammerman, A.; Vovk, V.; and Vapnik, V. 1998. Learning by transduction. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, 148--155. Morgan Kaufmann.

[GWMCB13] Goodfellow, I.; Warde-farley, D.; Mirza, M.; Courville, A.; and Bengio, Y. 2013. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 1319--1327.

[G14] Graham, B. 2014. Spatially-sparse convolutional neural networks. CoRR abs/1409.6070.

[HZRS15] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852.

[HOT06] Hinton, G.~E.; Osindero, S.; and Teh, Y.-W. 2006. A fast learning algorithm for deep belief nets. Neural computation 18(7):1527--1554.

[H63] Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58(301):13--30.

[KP00] Koltchinskii, V., and Panchenko, D. 2000. Rademacher processes and bounding the risk of function learning. In High dimensional probability II. Springer. 443--457.

[K01] Koltchinskii, V. 2001. Rademacher penalties and structural risk minimization. Information Theory, IEEE Transactions on 47(5):1902--1914.

[K09] Krizhevsky, A. 2009. Learning multiple layers of features from tiny images.

[LBDHHHJ89] LeCun, Y.; Boser, B.; Denker, J.~S.; Henderson, D.; Howard, R.~E.; Hubbard, W.; and Jackel, L.~D. 1989. Backpropagation applied to handwritten zip code recognition. Neural Computation 1(4):541--551.

[LBBH98] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278--2324.

[LCHRH06] LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; and Huang, F. 2006. A tutorial on energy-based learning. In Bakir, G.; Hofman, T.; Sch"olkopf, B.; Smola, A.; and Taskar, B., eds., Predicting Structured Data. MIT Press.

[LXPZT15] Lee, C.-Y.; Xie, S.; Gallagher, P.; Zhang, Z.; and Tu, Z. 2015. Deeply-supervised nets. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 562--570.

[LCY13] Lin, M.; Chen, Q.; and Yan, S. 2013. Network in network. CoRR abs/1312.4400.

[M95] Miller, G.~A. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39--41.

[NH10] Nair, V., and Hinton, G.~E. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), 807--814.

[OAGR11] Oneto, L.; Anguita, D.; Ghio, A.; and Ridella, S. 2011. The impact of unlabeled patterns in rademacher complexity theory for kernel classifiers. In Advances in Neural Information Processing Systems, 585--593.

[OGRA15] Oneto, L.; Ghio, A.; Ridella, S.; and Anguita, D. 2015. Local rademacher complexity: Sharper risk bounds with and without unlabeled samples. Neural Networks 65:115--125.

[P64] Polyak, B. 1964. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4(5):1 -- 17.

[RPCL06] Ranzato, M.; Poultney, C.; Chopra, S.; and LeCun, Y. 2006. Efficient learning of sparse representations with an energy-based model. In et~al., J.P., ed., Advances in Neural Information Processing Systems (NIPS 2006), volume19. MIT Press.

[RBHVR15] Rasmus, A.; Berglund, M.; Honkala, M.; Valpola, H.; and Raiko, T. 2015. Semi-supervised learning with ladder networks. In Cortes, C.; Lawrence, N.~D.; Lee, D.~D.; Sugiyama, M.; and Garnett, R., eds., Advances in Neural Information Processing Systems 28. Curran Associates, Inc. 3546--3554.

[RDSKSMHKKBBF15] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A.~C.; and Fei-Fei, L. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3):211--252.

[S74] Serfling, R.~J. 1974. Probability inequalities for the sum in sampling without replacement. The Annals of Statistics 2(1):39--48.

[SEZMFL13] Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; and LeCun, Y. 2013. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR abs/1312.6229.

[SZ14] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.

[SHKSS14] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929--1958.

[SMDH13] Sutskever, I.; Martens, J.; Dahl, G.; and Hinton, G. 2013. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13), 1139--1147.

[SLJSRAEVR15] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR).

[TP98] Thrun, S., and Pratt, L., eds. 1998. Learning to Learn. Norwell, MA, USA: Kluwer Academic Publishers.

[TFF08] Torralba, A.; Fergus, R.; and Freeman, W.~T. 2008. 80 million tiny images: A large data set for nonparametric object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on 30(11):1958--1970.

[V84] Valiant, L.~G. 1984. A theory of the learnable. Communications of the ACM 27(11):1134--1142.

[VC71] Vapnik, V.~N., and Chervonenkis, A.~Y. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications 16(2):264--280.

[V95] Vapnik, V.~N. 1995. The Nature of Statistical Learning Theory. New York, NY, USA: Springer-Verlag New York, Inc.

[V98] Vapnik, V.~N. 1998. Statistical Learning Theory. Wiley-Interscience.

[V06] Vapnik, V.~N. 2006. Estimation of Dependences Based on Empirical Data. New York, NY, USA: Springer New York.

[WZZLF13] Wan, L.; Zeiler, M.; Zhang, S.; LeCun, Y.; and Fergus, R. 2013. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 1058--1066.

[WCSBV06] Weston, J.; Collobert, R.; Sinz, F.; Bottou, L.; and Vapnik, V. 2006. Inference with the universum. In Proceedings of the 23rd international conference on Machine learning, 1009--1016.

[MR13] Zeiler, M.~D., and Fergus, R. 2013. Stochastic pooling for regularization of deep convolutional neural networks. CoRR abs/1301.3557.

[ZZ10] Zhang, M.-L., and Zhou, Z.-H. 2010. Exploiting unlabeled data to enhance ensemble diversity. In 2010 IEEE International Conference on Data Mining, 619--628. IEEE.

[Z13] Zhang, X. 2013. Pac-learning for energy-based models. Master's thesis, Computer Science Department, Courant Institute of Mathematical Sciences, New York University.

[ZMGL15] Zhao, J.; Mathieu, M.; Goroshin, R.; and LeCun, Y. 2015. Stacked what-where auto-encoders. arXiv preprint arXiv:1506.02351.