Skip to main content

Differentially- and non-differentially-private random decision trees

% You can go ahead and credit any number of authors here, % e.g. one 'row of three' or two rows (consisting of one row of three % and a second row of one, two or three). % % The command \alignauthor (no curly braces needed) should % precede each author name, affiliation/snail-mail address and % e-mail address. Additionally, tag each line of % affiliation/address with \affaddr, and tag the % e-mail address with \email. % % 1st. author \alignauthor Mariusz Bojarski, NYU Polytechnic School of Engineering, Brooklyn, NY, Courant Institute of Mathematical Sciences, New York, NY, Google Research, New York, NY, % 4th. author \alignauthor Yann LeCun, Courant Institute of Mathematical Sciences and Facebook, New York, NY

Abstract

We consider supervised learning with random decision trees, where the tree construction is completely random. The method is popularly used and works well in practice despite the simplicity of the setting, but its statistical mechanism is not yet well-understood. In this paper we provide strong theoretical guarantees regarding learning with random decision trees. We analyze and compare three different variants of the algorithm that have minimal memory requirements: majority voting, threshold averaging and probabilistic averaging. The random structure of the tree enables us to adapt these methods to a differentially-private setting thus we also propose differentially-private versions of all three schemes. We give upper-bounds on the generalization error and mathematically explain how the accuracy depends on the number of random decision trees. Furthermore, we prove that only logarithmic (in the size of the dataset) number of independently selected random decision trees suffice to correctly classify most of the data, even when differential-privacy guarantees must be maintained. We empirically show that majority voting and threshold averaging give the best accuracy, also for conservative users requiring high privacy guarantees. Furthermore, we demonstrate that a simple majority voting rule is an especially good candidate for the differentially-private classifier since it is much less sensitive to the choice of forest parameters than other methods.

Non-differentially-private setting

Mariusz Bojarski ∗ NYU Polytechnic School of Engineering Brooklyn, NY mb4496@nyu.edu

We consider supervised learning with random decision trees, where the tree construction is completely random. The method is popularly used and works well in practice despite the simplicity of the setting, but its statistical mechanism is not yet well-understood. In this paper we provide strong theoretical guarantees regarding learning with random decision trees. We analyze and compare three different variants of the algorithm that have minimal memory requirements: majority voting, threshold averaging and probabilistic averaging. The random structure of the tree enables us to adapt these methods to a differentially-private setting thus we also propose differentially-private versions of all three schemes. We give upper-bounds on the generalization error and mathematically explain how the accuracy depends on the number of random decision trees. Furthermore, we prove that only logarithmic (in the size of the dataset) number of independently selected random decision trees suffice to correctly classify most of the data, even when differential-privacy guarantees must be maintained. We empirically show that majority voting and threshold averaging give the best accuracy, also for conservative users requiring high privacy guarantees. Furthermore, we demonstrate that a simple majority voting rule is an especially good candidate for the differentially-private classifier since it is much less sensitive to the choice of forest parameters than other methods.

category:
keywords:

Random decision trees, differential privacy, supervised learn-

∗ equal contribution

Anna Choromanska ∗ Courant Institute of Mathematical Sciences New York, NY achoroma@cims.nyu.edu

Yann LeCun Courant Institute of Mathematical Sciences and Facebook New York, NY yann@cs.nyu.edu ing, classification, generalization bounds, error bounds, majority voting, threshold averaging, probabilistic averaging, non-differentially-private random decision trees, differentiallyprivate random decision trees

category:
keywords:

Random decision trees, differential privacy, supervised learn-

∗ equal contribution

Anna Choromanska ∗ Courant Institute of Mathematical Sciences New York, NY achoroma@cims.nyu.edu

Yann LeCun Courant Institute of Mathematical Sciences and Facebook New York, NY yann@cs.nyu.edu ing, classification, generalization bounds, error bounds, majority voting, threshold averaging, probabilistic averaging, non-differentially-private random decision trees, differentiallyprivate random decision trees

category:
keywords:

Random decision trees, differential privacy, supervised learn-

∗ equal contribution

Anna Choromanska ∗ Courant Institute of Mathematical Sciences New York, NY achoroma@cims.nyu.edu

Yann LeCun Courant Institute of Mathematical Sciences and Facebook New York, NY yann@cs.nyu.edu ing, classification, generalization bounds, error bounds, majority voting, threshold averaging, probabilistic averaging, non-differentially-private random decision trees, differentiallyprivate random decision trees

keywords:

Random decision trees, differential privacy, supervised learn-

∗ equal contribution

Anna Choromanska ∗ Courant Institute of Mathematical Sciences New York, NY achoroma@cims.nyu.edu

Yann LeCun Courant Institute of Mathematical Sciences and Facebook New York, NY yann@cs.nyu.edu ing, classification, generalization bounds, error bounds, majority voting, threshold averaging, probabilistic averaging, non-differentially-private random decision trees, differentiallyprivate random decision trees

Introduction

Decision tree is one of the most fundamental structures used in machine learning. Constructing a tree of good quality is a hard computational problem though. Needless to say, the choice of the optimal attribute according to which the data partitioning should be performed in any given node of the tree requires nontrivial calculations involving data points located in that node. Nowadays, with an increasing importance of the mechanisms preserving privacy of the data handled by machine learning algorithms, the need arises to construct these algorithms with strong privacy guarantees (see e.g. [1], [2], [3], [4], [5]). One of the strongest currently used notions of privacy is the so-called differential privacy that was introduced [6] in a quest to achieve the dual goal of maximizing data utility and preserving data confidentiality. Adifferentially-private database access mechanism preserves the privacy of any individual in the database, irrespectively of the amount of auxiliary information available to an adversarial database client. Differential-privacy techniques add noise to perturb data (such as Laplacian noise). Its magnitude depends on the sensitivity of the statistics that are being output. Even though the overall scheme looks simple, in practice it is usually very difficult to obtain a reasonable level of differential privacy and at the same time maintain good accuracy. This is the case since usually too big perturbation error needs to be added. In particular, this happens when machine learning computations access data frequently during the entire execution of the algorithm and output structures that are very sensitive to the data. This is also an obstacle for proposing a scheme that computes an optimal decision tree in a differentially-private way. In such a scenario the attribute chosen in every node and any additional information stored there depends on the data and that is why it must be perturbed in order to keep the desired level of differential privacy. Big perturbation added in this setting leads to the substantially smaller quality of the constructed tree.

Instead of constructing one differentially-private decision

Krzysztof Choromanski ∗ Google Research New York, NY kchoro@google.com tree, in this paper we consider constructing a random forest. Random forests [7] constitute an important member of the family of the decision tree-based algorithms due to their effectiveness and excellent performance. They are also the most accurate general-purpose classifiers available [8, 7]. In this paper we construct a forest consisting of O (log( n )) random decision trees ( n is the size of the dataset, e.g. number of data samples). An attribute according to which the selection is performed in any given node is chosen uniformly at random from all the attributes, independently from the dataset in that node. In the continuous case, the threshold value for the chosen attribute is then also chosen uniformly at random from the range of all possible values. That simple rule enables us to construct each decision tree very fast since the choice of nodes' attributes does not depend on the data at all. The obtained algorithm is therefore fast and scalable with minimal memory requirements. It also takes only one pass over the data to construct the classifier. Since most of the structure of each random decision tree is constructed without examining the data, the algorithm suits the differentially-private scenario very well. After a sufficient number of random decision trees is constructed, the classification of every point from a dataset takes place. Classification is done according to one of the three schemes: majority voting ([7]), threshold averaging or probabilistic averaging ([9]). In the differentially-private setting we add perturbation error to the counters in leaves, but no perturbation error is added to the inner nodes. This leads to a much more accurate learning mechanism. Performing voting/averaging (see: [10] for applications of the voting methods) instead of just taking the best tree for a given dataset is important since it enables us to add smaller perturbation error to obtain the same level of differential privacy.

In this paper we analyze both non-differentially-private and differentially-private setting in all three variants: majority voting, threshold averaging, and probabilistic averaging. To the best of our knowledge, we are the first to give a comprehensive and unified theoretical analysis of all three models in both settings, where in case of differentiallyprivate setting no theoretical analysis was ever provided in the context of random decision trees. The differentiallyprivate setting is especially difficult to analyze since increasing the number of trees does not necessarily decrease the training (and test) error in this setting. Having more random decision trees require adding bigger perturbation error that may decrease the efficiency of the learning algorithm. In this paper we thoroughly investigate this phenomenon. The dependence of the quality of the random decision tree methods on the chosen level of differential privacy, the height of the tree and the number of trees in the forest is in the central focus of our theoretical and empirical analysis. Understanding these dependencies is crucial while applying these methods in practice. Our theoretical analysis relate the empirical error and the generalization error of the classifier to the average tree accuracy and explain quantitatively how the quality of the system depends on the number of chosen trees. Furthermore, we show that the random forest need not many trees to achieve good accuracy. In particular, we prove both theoretically and empirically that in practice the logarithmic in the size of the dataset number of random decision trees 1 suffices to achieve good performance. We also

1 Further in the paper by 'logarithmic number of random decision trees' we always mean 'logarithmic (in the size of the dataset) number of random decision trees'.

show that not only do there exist parameters of the setting (such as: the number of random trees in the forest, the height of the tree, etc.) under which one can effectively learn, but the setting is very robust. To be more precise, we empirically demonstrate that the parameters do not need to be chosen in the optimal way, in example one can choose far fewer trees to achieve good performance. We also show that majority voting and threshold averaging are good candidates for the differentially-private classifiers. Our experiments reveal that a simple majority voting rule is competitive with the threshold averaging rule and simultaneously they both outperform the probabilistic averaging rule. Furthermore, majority voting rule is much less sensitive to the choice of the parameters of the random forest (such as the number of the trees and the height of the tree) than the remaining two schemes.

This article is organized as follows. In Section 2 we describe previous work regarding random decision trees. We then introduce our model and the notion of differential privacy in Section 3. In Section 4 we present a differentiallyprivate supervised algorithm that uses random decision trees. Section 5 contains our theoretical analysis. We conclude the paper with experiments (Section 6) and a brief summary of our results (Section 7).

Prior work

Random decision trees are considered as important methods in machine learning often used for supervised learning due to their simplicity, excellent practical performance and somewhat unreasonable effectiveness in practice. They became successful in a number of practical problems, e.g. [11, 12, 13, 14, 15, 16, 17, 18, 19, 20] (there exist many more examples). The original random forests [7] were ensemble methods combining many CART-type [21] decision trees using bagging [22] (a convenient review of random forests can for instance be found in [19]). They were inspired by some earlier published random approaches [12, 23, 24, 25]. Despite their popularity, the statistical mechanism of random forests is difficult to analyze [8, 26] and to these days remains largely ununderstood [26, 27, 28]. Next we review the existing theoretical results in the literature.

A notable line of works provide an elegant analysis of the consistency of random forests [26, 27, 28, 8, 29, 30, 31, 32, 29, 33]. Among these works, one of the most recent studies [26] proves that the previously proposed random forest approach [32] is consistent and achieves the rate of convergence which depends only on the number of strong features and not on the number of noise variables. Another recent paper [27] provides the first consistency result for online variant of random forests. The predecessor of this work [34] proposes the Hoeffding tree algorithm and prove that with high probability under certain assumptions the online Hoeffding tree converges to the offline tree. In our paper we focus on error bounds rather than the consistency analysis of random decision trees.

It has been noted [27] that the most famous theoretical result concerning random forests provides an upper-bound on the generalization error of the forest in terms of the correlation and strength of trees [7]. Simultaneously, the authors show that the generalization error converges almost surely to a limit as the number of trees in the forest becomes large. It should be noted however that the algorithm considered by the authors has data-dependent tree structure opposite to the algorithms in our paper. To be more specific, the original 'random forests' method [7] selects randomly a subset of features and then it chooses the best splitting criteria from this feature subset. This affects efficiency since computing the heuristics (the best splitting criteria) is expensive [9]. Furthermore, it also causes the tree structure to be data-dependent (another approach where the tree structure is data-dependent is presented in example in [35]) rather than fully random which poses a major problem when extending the method to the differentially-private setting since data-independent tree structure is important for preserving differential-privacy [36]. Opposite to this approach, in our algorithms we randomly draw the attribute in each tree node according to which we split and then we randomly choose a threshold used for splitting. This learning model is therefore much simpler. Our fully random approach is inspired by a methodology already described before in the literature [9] (this work however has no theoretical analysis). Our theoretical results consider error bounds similarly to the original work on random forests [7]. The difference of approaches however does not allow to use the theoretical results from [7] in our setting. Finally, note that in either [7] or [9] only a single voting rule is considered, majority voting or probabilistic averaging respectively. In this paper we consider a wider spectrum of different voting approaches.

Next, we briefly review some additional theoretical results regarding random forests. A simplified analysis of random forests in one-dimensional settings was provided in the literature in the context of regression problems where minimax rate of convergence were proved [37, 38]. Another set of results explore the connection of random forests with a specific framework of adaptive nearest-neighbor methods [39]. Finally, for completeness we emphasize that there also exist some interesting empirical studies regarding random decision trees in the literature, e.g. [40], [41] and [23], which however are not directly related to our work.

Privacy preserving data mining has emerged as an effective method to solve the problem of data sharing in many fields of computer science and statistics. One of the strongest currently used notions of privacy is the so-called differential privacy [6] (some useful tutorial material on differential privacy research can be found in [42]). In this paper we are interested in the differentially-private setting in the context of random decision trees. It was first observed in [36] that random decision trees may turn out to be an effective tool for constructing a differentially-private decision tree classifier. The authors showed a very efficient heuristic that averages over random decision trees and gives good practical results. Their work however lacks theoretical results regarding the quality of the differentially-private algorithm that is using random decision trees. In another published empirical study [43] the authors develop protocols to implement privacy-preserving random decision trees that enable efficient parallel and distributed privacy-preserving knowledge discovery. The very recent work [44] on differentiallyprivate random forests shows experimental results demonstrating that quality functions such as information gain, max operator and gini index gives almost equal accuracy regardless of their sensitivity towards the noise. Furthermore, they show that the accuracy of the classical random forest and its differentially-private counterpart is almost equal for various size of datasets. To the best of our knowledge none of the published works on differentially-private random decision trees provide any theoretical guarantees. Our paper provides strong theoretical guarantees of both nondifferentially-private and differentially-private random decision trees. This is a major contribution of our work. We simultaneously develop a unified theoretical framework for analyzing both settings.

Preliminaries

We will prove here results regarding empirical and generalization errors of all the variants of the algorithm mentioned in the paper as well as Theorem 5.1 and Theorem 5.2. Without loss of generality we will assume that all attributes are binary (taken from the set { 0 , 1 } ). It can be easily noticed that the proofs can be directly translated to the continuous case. We leave this simple exercise to the reader.

Let us introduce first some useful notation that will be very helpful in the proofs we present next.

We denote by n the size of the dataset (training or test) T . Let us remind that m is the number of attributes of any given data point, h is the height of the random decision tree and M is the set of all random decision trees under consideration.

We focus first on classifying with just one decision tree. Fix some decision tree T j and one of its leaves. Assume that it contains a points with label: -and b points with label: +. We associate label -with that leaf if a > b and label 1 otherwise. To classify a given point using that tree we feed our tree with that point and assign to a point a label of the corresponding leaf. Denote by m i the number of data points that were correctly classified by a tree T i . Denote e i = m i n . We call e i the quality (or accuracy ) of the tree T i . Note that obviously we always have: e i ≥ 1 2 , since for every leaf of any given tree the majority of the data points from that leaf are classified correctly. Denote: e = 1 | M | ∑ | M | i =1 e i . We call e the average tree accuracy. This parameter measures how well data points are classified on average by a complete decision tree of a given height h . Note that e ≥ 1 2 . Denote t = 2 h . Parameter t is the number of leaves of the decision tree.

For i = 1 , 2 , ..., | M | and j = 1 , 2 , ..., t denote by n i j the number of points from the dataset in the j th leaf of a decision tree T i . Denote by m i j the number of points from the dataset in the j th leaf of the decision tree T i that were classified correctly. Denote e i j = m i j n i j for n i j > 0 and e i j = 1 for n i j = 0. Note that e i j ≥ 1 2 for every i, j . Note also that we have: n = n i 1 + ... + n i t and m i = m i 1 + ... + m i t . Denote by a i j the number of data points in the j th leaf of the decision tree T i that are of label 0. Denote by b i j the number of data points in the j th leaf of the decision tree T i that are of label 1.

We will use frequently the following structure in the proofs. Let G be a bipartite graph with color classes: A , B and weighted edges. Color class A consists of n points from the dataset. Color class B consists of 2 t | M | elements of the form y i,b j , where i ∈ { 1 , 2 , ..., | M |} , b ∈ { 0 , 1 } and j ∈ { 1 , 2 , ..., t } .

Data point x ∈ A is adjacent to y i, 1 j iff it belongs to larger of the two groups (these with labels: 0 and 1) of the data points that are in the j th leaf of the decision tree T i . An edge joining x with y i, 1 j has weight e i j . Data point x ∈ A is adjacent to y i, 0 j iff it belongs to smaller of the two groups of the data points that are in the j th leaf of the decision tree T i . An edge joining x with y i, 0 j has weight 1 -e i j . Note that the degree of a vertex y i, 1 j is m i j and the degree of a vertex y i, 0 j is n i j -m i j .

In the proofs we will refer to the size of the set of decision trees under consideration as: | M | or k (note that k is used in the main body of the paper).

We are ready to prove Theorem 5.1 and Theorem 5.2.

Proof. We start with the proof of Theorem 5.2. Note that from the definition of w d we get:

$$

$$

Therefore, using formula on m i j , we get:

$$

$$

Note that we have: ∑ | M | i =1 ∑ t j =1 n i j = n | M | . From Jensen's inequality, applied to the function f ( x ) = x 2 , we get: ∑ | M | i =1 ∑ t j =1 n i j | M | n ( e i j ) 2 ≥ ( ∑ | M | i =1 ∑ t j =1 n i j e i j | M | n ) 2 = ( ∑ | M | i =1 ∑ t j =1 m i j | M | n ) 2 = ( en | M | n | M | ) 2 = e 2 , where e is the average quality of the system of all complete decision trees of height h (the average tree accuracy). Similarly, ∑ | M | i =1 ∑ t j =1 n i j | M | n (1 -e i j ) 2 ≥ (1 -e ) 2 . Thus we get:

$$

$$

That completes the proof of Theorem 5.2. The proof of Theorem 5.1 is even simpler. Notice that for any data point d the expression σ ( d ) · | M | counts the number of decision trees from M that classified d correctly (follows directly from the definition of θ ). Thus we have: ∑ d ∈T σ ( d ) · | M | = ∑ | M | i =1 m i . Therefore 1 n ∑ d ∈T σ ( d ) = 1 | M | ∑ | M | i =1 e i and we are done.

We need one more technical result, the Azuma's inequality:

Lemma 9.1. Let { W n , n ≥ 1 } be a martingale with mean 0 and suppose that for some non-negative constants: α i , β i we have: -α i ≤ W i -W i -1 ≤ β i for i = 2 , 3 , ... . Then for any n ≥ 0 , a > 0 :

$$

$$

Differential privacy

Differential privacy is a model of privacy for database access mechanism. It guarantees that small changes in a database (removal or addition of an element) does not change substantially the output of the mechanism.

Definition 3.1. (See [45].) A randomized algorithm K gives glyph[epsilon1] -differential-privacy if for all datasets D 1 and D 2 differing on at most one element, and all S ⊆ Range ( K ) ,

$$

$$

The probability is taken over the coin tosses of K .

The smaller glyph[epsilon1] , the stronger level of differential privacy is obtained. Assume that the non-perturbed output of the mechanism can be encoded by the function f . A mechanism K can compute a differentially-private noisy version of f over a database D by adding noise with magnitude calibrated to the sensitivity of f .

Let Lap (0 , λ ) denote the Laplace distribution with mean 0 and standard deviation λ . In other words, this is a random variable with probability density function given by the following formula: λ 2 e -| x | λ . We will denote shortly by g ( λ ) an independent copy of the Lap (0 , λ )-random variable.

Theorem 3.1. (See [6].) Let f be a function on databases with range R m , where m is the number of rows of databases 2 . Then, the mechanism that outputs f ( D ) + ( Y 1 , . . . , Y m ) , where Y i are drawn i.i.d from Lap (0 , S ( f ) /glyph[epsilon1] ) , satisfies glyph[epsilon1] -differential-privacy.

Stronger privacy guarantees and more sensitive functions need bigger variance of the Laplacian noise being added. Differential privacy is preserved under composition, but with an extra loss of privacy for each conducted query.

Theorem 3.2. (See [6].) (Composition Theorem) The sequential application of mechanisms K i , each giving glyph[epsilon1] i -differential privacy, satisfies ∑ i glyph[epsilon1] i -differential-privacy.

More information about differential privacy can be found in the work of [46] and [47].

The model

All data points are taken from F m , where m is the number of the attributes and F is either a discrete set or the set of real numbers. We assume that for every attribute attr its smallest (min( attr )) and largest possible value (max( attr )) are publicly available and that the labels are binary. We consider only binary decision trees (all our results can be easily translated to the setting where inner nodes of the tree have more than two children). Therefore, if F is discrete

2 Number of rows of databases is the number of attributes of any data point from the databases.

then we will assume that F = { 0 , 1 } , i.e. each attribute is binary. In the continuous setting for each inner node of the tree we store the attribute according to which the selection is done and the threshold value of this attribute. All decision trees considered in this paper are complete and of a fixed height h that does not depend on the data. Let T be a random decision tree and let l be one of its leaves. We denote by θ l the fraction of all training points in l with label +. If l does not contain any of the training points we choose the value of θ l uniformly at random from [0 , 1]. The set M of all possible decision trees is of size | M | = m 2 h +1 -1 in the binary setting. It should be emphasized that it is true also in the continuous case. In that setting the set of all possible threshold values for a node is infinite but needless to say, the set of all possible partitionings in the node is still finite. Thus without loss of generality, we assume M is finite. It can be very large but it does not matter since we will never need the actual size of M in our analysis. For a given tree T and given data point d denote by w T d the fraction of points (from the training set if d is from this set and from the test set otherwise) with the same label as d that end up in the same leaf of T as d . We call it the weight of d in T . Notice that a training point d is classified correctly by T in the single-tree setting iff its weight in T is larger than 1 2 (for a single decision tree we consider majority voting model for points classification).

The average value of w T d over all trees of M will be denoted as w d and called the weight of d in M . We denote by σ ( d ) the fraction of trees from M with the property that most of the points of the leaf of the tree containing d have the same label as d (again, the points are taken from the training set if d is from it and from the test set otherwise). We call σ ( d ) the goodness of d in M . For a given dataset D the average tree accuracy e ( D ) of a random decision tree model is an average accuracy of the random decision tree from M , where the accuracy is the fraction of data points that a given tree classifies correctly (accuracy is computed under assumption that the same distribution D was used in both: the training phase and test phase).

Algorithms

Algorithm 1 captures the non-differentially-private algorithm for supervised learning with random decision trees (RDT). Its differentially-private counterpart is captured in Algorithm 2. We consider three versions of each algorithm:

· majority voting · threshold averaging · probabilistic averaging.

Only variables n + l , n -l stored in leaves depend on the data. This fact will play crucial role in the analysis of the differentially-private version of the algorithm where Laplacian error is added to the point counters at every leaf with variance calibrated to the number of all trees used by the algorithm.

Theoretical results

In this section we derive the upper-bounds on the empirical error (the fraction of the training data misclassified by the algorithm) and the generalization error (the fraction of the test data misclassified by the algorithm where the test data is taken from the same distribution as the training data) for all methods in Algorithm 1 and 2.

Input: Train , Test : train and test sets, h : height of the tree Random forest construction: construct k = θ (log( n )) random decision trees by choosing for each inner node of the tree independently at random its attribute (uniformly from the set of all the attributes); in the continuous case for each chosen attribute attr choose independently at random a threshold value uniformly from [min( attr ) , max( attr )] Training: For d ∈ Train { add d to the forest by updating θ l for every leaf corresponding to d } Testing: For d ∈ Test { if ( majority voting ) { compute num d - the number of the trees classifying d as +; classify d as + iff num d > k 2 } if ( threshold averaging ) { compute θ d = 1 k ∑ l ∈L θ l , where L is a set of all leaves of the forest that correspond to d ; classify d as + iff θ d > 1 2 } if ( probabilistic averaging ) { compute θ d = 1 k ∑ l ∈L θ l , where L is a set of all leaves of the forest that correspond to d ; classify d as + with probability θ d /random tosses here are done independently from all other previously conducted/ } } Output: Classification of all d ∈ Test

Algorithm 1: Non-differentially-private RDT classifier

Algorithm 2: η -Differentially-private RDT classifier

We also show how to find the number of random decision trees to obtain good accuracy and, in the differentiallyprivate setting, good privacy guarantees.

We start with two technical results which, as we will see later, give an intuition why the random decision tree ap- proach works very well in practice.

Theorem 5.1. Assume that the average tree accuracy of the set M of all decision trees of height h on the training/test set D is e = 1 -glyph[epsilon1] for some 0 < glyph[epsilon1] ≤ 1 2 . Then the average goodness σ ( d ) of a training/test point d in M is at least e ≥ 1 2 .

The theorems above imply that if the average accuracy of the tree is better than random, then this is also reflected by the average values of w d and σ d . This fact is crucial for the theoretical analysis since we will show that if the average values of w d and σ d are slightly better than random then this implies very small empirical and generalization error. Furthermore, for most of the training/test points d their values of σ d and w d are well concentrated around those average values and that, in a nutshell, explains why the random decision trees approach works well. Notice that Theorem 5.1 gives better quality guarantees than Theorem 5.2.

We are about to propose several results regarding differentially-private learning with random decision trees. They are based on careful structural analysis of the bipartite graph between the set of decision trees and datapoints. Edges of that bipartite graph connect datapoints with trees that correctly classified given datapoints. In the differentiallyprivate setting the key observation is that under relatively weak conditions one can assume that the sizes of the sets of datapoints residing in leaves of the trees are substantial. Thus adding the Laplacian noise will not perturb the statistics to an extent that would affect the quality of learning. All upper-bounds regarding the generalization error were obtained by combining this analysis with concentration results (such as Azuma's inequality).

Non-differentially-private setting

We start by providing theoretical guarantees in the nondifferentially-private case. Below we consider majority voting and threshold averaging. The results for the probabilistic averaging are stated later in this subsection.

Theorem 5.3. Let K > 0 . Assume that the average tree accuracy of the set M of all decision trees of height h on the training/test set D is e = 1 -glyph[epsilon1] for some 0 < glyph[epsilon1] ≤ 1 2 . Let µ be: the fraction of training/test points with goodness in M at least σ = 1 2 + δ / σ = 1 2 + δ + 1 K for 0 < δ < 1 2 (in the majority version) or: the fraction of training/test points with weight in M at least w = 1 2 + δ / w = 1 2 + δ + 1 K for 0 < δ < 1 2 (in the threshold averaging version). Then Algorithm 1 for every C > 0 and k = (1+ C ) log( n ) 2 δ 2 selected random decision trees gives empirical error err 1 ≤ 1 -µ with probability p 1 ≥ 1 -1 n C . The generalization error err 2 ≤ 1 -µ will be achieved for k = (1+ C ) log( n ) 2( δ 2 ) 2 trees with probability p 2 ≥ p 1 -2 h +3 ke -2 nφ 2 , where φ = δ 2(4+ δ )2 h K . Probabilities p 1 and p 2 are under random coin tosses used to construct the forest and the test set.

Note that parameter e is always in the range [ 1 2 , 1]. The more decision trees that classify data in the nontrivial way

(i.e. with accuracy greater than 1 2 ), the larger the value of e is. The result above in particular implies that if most of the points have goodness/weight in M a little bit larger than 1 2 then both errors are very close to 0. This is indeed the case - the average point's goodness/weight in M , as Theorem 5.1 and Theorem 5.2 say, is at least e / e 2 +(1 -e ) 2 . The latter expression is greater than 1 2 if the average tree accuracy is slightly bigger than the worst possible. Besides goodness/weight of most of the points, as was tested experimentally, is well concentrated around that average goodness/weight. We conclude that if the average accuracy of the decision tree is separated from 1 2 (but not necessarily very close to 1) then it suffices to classify most of the data points correctly. The intuition behind this result is as follows: if the constructed forest of the decision trees contains at least few 'nontrivial trees' giving better accuracy than random then they guarantee correct classification of most of the points.

If we know that the average tree accuracy is big enough then techniques used to prove Theorem 5.3 give us more direct bounds on the empirical and generalization errors captured in Theorem 5.4. No assumptions regarding goodness/weight are necessary there.

Theorem 5.4. Let K > 0 . Assume that the average tree accuracy of the set M of all decision trees of height h on the training/test set D is e = 1 -glyph[epsilon1] for some 0 < glyph[epsilon1] ≤ 1 2 . Then Algorithm 1 for every C > 0 , 0 < δ < 1 2 and k = (1+ C ) log( n ) 2 δ 2 selected random decision trees gives empirical error: err 1 ≤ glyph[epsilon1] 1 2 -δ (in the majority version) or: err 1 ≤ 2 glyph[epsilon1] -2 glyph[epsilon1] 2 0 . 5 -δ (in the threshold averaging version) with probability p 1 ≥ 1 -1 n C . The generalization error: err 2 ≤ glyph[epsilon1] + 1 K 1 2 -δ (in the majority version) or: err 2 ≤ 2( glyph[epsilon1] + 1 K ) -2( glyph[epsilon1] + 1 K ) 2 0 . 5 -δ (in the threshold averaging version) will be achieved for k = (1+ C ) log( n ) 2( δ 2 ) 2 trees with probability p 2 ≥ p 1 -2 h +3 ke -2 nφ 2 , where φ = δ 2(4+ δ )2 h K . Probabilities p 1 and p 2 are under random coin tosses used to construct the forest and the test set.

Theorems 5.3 and 5.4 show that logarithmic number of random decision trees in practice suffices to obtain high prediction accuracy with a very large probability. In particular, the upper-bound on the generalization error is about two times the average error of the tree. The existence of the tree with lower accuracy in the forest does not harm the entire scheme since all trees of the forest play role in the final classification.

We now state our results (analogous to Theorem 5.4) for the probabilistic averaging setting. The following is true.

Theorem 5.5. Let K > 0 . Assume that the average tree accuracy of the set M of all decision trees of height h on the training/test set D is e = 1 -glyph[epsilon1] for some 0 < glyph[epsilon1] ≤ 1 2 . Let C > 0 be a constant. Let 0 < δ, c < 1 . Then with probability at least p 1 = (1 -1 n C )(1 -e -2 nc 2 ) the probabilistic averaging version of Algorithm 1 gives empirical error err 1 ≤ 2 glyph[epsilon1] -2 glyph[epsilon1] 2 + δ + c and with probability p 2 ≥ p 1 -2 h +3 ke -2 nφ 2 , where φ = δ 2(4+ δ )2 h K , it gives generalization error err 2 ≤ 2( glyph[epsilon1] + 1 K ) -2( glyph[epsilon1] + 1 K ) 2 + δ + c . Probabilities p 1 , p 2 are under random tosses used to construct the forest and the test set.

Notice that this result is nontrivial for almost the entire range [0 , 1 2 ] of glyph[epsilon1] , and δ and c close to 0, and large K . This is the case since note that 1 -2 glyph[epsilon1] +2 glyph[epsilon1] 2 ≥ 1 2 and the equality holds only for glyph[epsilon1] = 1 2 .

Differentially-private setting

We begin this section with the proof that all three methods captured in Algorithm 2, where the Laplacian noise is added to certain counts, are indeed η -differentially-private.

Proof. Notice that in every method to obtain the forest of random decision trees with perturbed counters in leaves we need k queries to the private data (this is true since the structure of the inner nodes of the trees does not depend at all on the data and data subsets corresponding to leaves are pairwise disjoint). Furthermore, the values that are being perturbed by the Laplacian noise are simple counts of global sensitivity 1. Thus we can use use Theorem 3.1 and Theorem 3.2 to conclude that in order to obtain η -differential privacy of the entire system we need to add a Lap (0 , k η ) to every count in the leaf. This proves that our algorithms are indeed η -differentially-private.

Next we show the theoretical guarantees we obtained in the differentially-private setting. As in the previous section, we first focus on the majority voting and threshold averaging, and then we consider the probabilistic averaging.

Theorem 5.6. Assume that we are given a parameter η > 0 . Let K > 0 . Assume that the average tree accuracy of the set M of all decision trees of height h on the training/test set D is e = 1 -glyph[epsilon1] for some 0 < glyph[epsilon1] ≤ 1 2 . Let µ be the fraction of training/test points with: goodness in M at least σ = 1 2 + δ + 1 K / σ = 1 2 + δ + 2 K (in the majority version) or: weight in M at least w = 1 2 + δ + 1 K / w = 1 2 + δ + 2 K (in the threshold averaging version) for 0 < δ < 1 2 . Then Algorithm 2 for k selected random decision trees and differential privacy parameter η gives empirical error err 1 ≤ 1 -µ with probability p 1 ≥ 1 -n ( e -kδ 2 2 + e -k 2 + ke -λnη k ) and generalization error err 2 ≤ 1 -µ with probability p 2 ≥ p 1 -2 h +3 ke -2 nφ 2 , where: λ = δ 24 K · 2 h and φ = δ 2(4+ δ )2 h K . Probabilities p 1 and p 2 are under random coin tosses used to construct the forest and the test set. Furthermore, we always have: µ ≥ 1 -glyph[epsilon1] 1 2 -δ -1 K / µ ≥ 1 -glyph[epsilon1] 1 2 -δ -2 K in the majority version and: µ ≥ 1 -2 glyph[epsilon1] -2 glyph[epsilon1] 2 1 2 -δ -1 K / µ ≥ 1 -2 glyph[epsilon1] -2 glyph[epsilon1] 2 1 2 -δ -2 K in the threshold averaging version.

Notice that if the number of trees k in the forest is logarithmic in n then p 1 is close to one and so is p 2 .

Again, as in the non-differentially-private case, we see that if there are many points of goodness/weight in M close to the average goodness/weight then empirical and generalization error are small. Notice also that increasing the number of the trees too much has an impact on the empirical error (term ke -λnη k in the lower bound on p 1 ). More trees means bigger variance of the single Laplacian used in the leaf of the tree. This affects tree quality. The theorem above describes this phenomenon quantitatively.

If the average tree accuracy is big enough then the following result becomes of its own interest. This result considers in particular the empirical error (similar result holds for the generalization error) of the threshold averaging version of Algorithm 2 (and also similar result holds for majority voting version of Algorithm 2).

Theorem 5.7. Assume that we are given a parameter η > 0 . Assume besides that the average tree accuracy of the set

M of all decision trees of height h on the training set D is is e = 1 -glyph[epsilon1] for some 0 < glyph[epsilon1] ≤ 1 2 . Let 0 < δ < 1 2 . Let γ = 1 2 h · 9600 and let k opt be the integer value for which the value of the function f ( k ) = e -k 200 +2 ke -γ √ nη k is smallest possible. Then with probability at least p = 1 -n ( e -k opt 200 + 2 k opt e -γ √ nη k opt + e -n 2 ) the η -differentially-private threshold averaging version of Algorithm 2 gives empirical error at most 1 8 + 9 2 glyph[epsilon1] -5 glyph[epsilon1] 2 for the forest with k opt randomly chosen decision trees. Probability p is under random coin tosses used to construct the forest.

Both theorems show that logarithmic number of random decision trees in practice suffices to obtain good accuracy and high level of differential privacy.

The next theorem considers the differentially-private probabilistic averaging setting.

As in the two previous settings, information about the average accuracy of just a single tree gives strong guarantees regarding the classification quality achieved by the differentially-private version of the forest. The next result (analogous to Theorem 5.7) shows how to choose the optimal number of trees and that this number is again at most logarithmic in the data size.

.
.
.
.
.

Experiments

The experiments were performed on the benchmark datasets 3 : Banknote Authentication ( Ban Aut ), Blood Transfusion Service Center ( BTSC ), Congressional Voting Records ( CVR ), Mammographic Mass ( Mam Mass ), Mushroom , Adult , Covertype and Quantum . 90% of each dataset was used for

3 downloaded from http://osmot.cs.cornell.edu/kddcup/, http://archive.ics.uci.edu/ml/datasets.html, and http://www.csie.ntu.edu.tw/ ∼ cjlin/libsvmtools/datasets

Table 1: Comparison of the performance of random forests and rpart.

training and the remaining part for testing. Furthermore, 10% of the training dataset was used as a validation set. All code for our experiments is publicly released.

We first compare the test error (%) obtained using five different methods: open-source implementation of CART called rpart [48], non-differentially-private ( n-dp ) and differentially-private ( dp ) random forest with majority voting ( RFMV ) and threshold averaging ( RFTA ). For all methods except rpart we also report the number of trees in the forest ( k ) and the height of the tree ( h ) for which the smallest validation error was obtained, where we explored: h ∈ { 1 , 2 , 3 , . . . , 15 } and k ∈ { 1 , 3 , 5 , . . . , 21 } . In all the experiments the differential privacy parameter η was set to η = 1000 /n tr , where n tr is the number of training examples. Table 1 captures the results. For each experiment we report average test error over 10 runs. We also show the binomial symmetrical 95% confidence intervals for our results. The performance of random forest with probabilistic averaging ( RFPA ) was significantly worse than the competitive methods ( RFMV , RFTA , rpart ) and is not reported in the table. The performance of RFPA will however be shown in the next set of results.

Next set of results 4 (Figure 1 and 2) is reported for an exemplary datasets ( Banknote Authentication , Congressional Voting Records , Mammographic Mass and Mushroom ) and for the following methods: dpRFMV , dpRFTA and dpRFPA . Note that similar results were obtained for the remaining datasets. In Figure 1a we report the test error vs. h for selected settings of k 5 . In Figure 1b we also show minimal, average and maximal test error vs. h for dpRFMV, whose performance was overall the best. Similarly, in Figure 1c we report the test error vs. k for two selected settings of h and in Figure 1d we also show minimal, average and maximal test error vs. k for dpRFMV.

Finally, in Figure 2a we report test error for various settings of η and two selected settings of h . For each experiment k was chosen from the set { 1 , 2 , . . . , 101 } to give the smallest validation error. Additionally, in Figure 2b we show how the test error changes with k for a fixed h and various levels of η .

Figure 2a shows that in most cases dpRFTA outperforms remaining differentially-private classifiers, however it requires careful selection of the forest parameters ( h and k ) in order to obtain the optimal performance as is illustrated on Fig-

4 All figures in this section should be read in color.

5 Recall that in case when the forest contains only one tree ( k = 1) majority voting and threshold averaging rules are equivalent thus the blue curve overlaps with the green curve on the plot then.

ure 1c and 2b. This problem can be overcome by using dpRFMV which has comparable performance to dpRFTA but is much less sensitive to the setting of the forest parameters. Therefore dpRFMV is much easier to use in the differentially-private setting.

Conclusions

In this paper we first provide novel theoretical analysis of supervised learning with non-differentially-private random decision trees in three cases: majority voting, threshold averaging and probabilistic averaging. Secondly we show that the algorithms we consider here can be easily adapted to the setting where high privacy guarantees must be achieved. We furthermore provide both theoretical and experimental evaluation of the differentially-private random decision trees approach. To the best of our knowledge, the theoretical analysis of the differentially-private random decision trees was never done before. Our experiments reveal that majority voting and threshold averaging are good differentiallyprivate classifiers and that in particular majority voting exhibits less sensitivity to forest parameters.

Experiments

[1] R. Agrawal and R. Srikant. Privacy-preserving data mining. In ACM SIGMOD , 2000. [2] W. Du and J. Zhan. Using randomized response techniques for privacy-preserving data mining. In KDD , 2003. [3] K. Choromanski, T. Jebara, and K. Tang. Adaptive anonymity via b -matching. In NIPS , 2013. [4] K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In NIPS , 2008. [5] G. Jagannathan and R. N. Wright. Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In KDD , 2005. [6] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC , 2006. [7] L. Breiman. Random forests. Machine Learning , 45:5-32, 2001. [8] G. Biau, L. Devroye, and G. Lugosi. Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res. , 9:2015-2033, 2008. [9] W. Fan. On the optimality of probability estimation by random decision trees. In AAAI , 2004. [10] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the

  • effectiveness of voting methods. The Annals of Statistics , 26:1651-1686, 1998. [11] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In CVPR , 2011. [12] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Comput. , 9:1545-1588, 1997. [13] C. Xiong, D. Johnson, R. Xu, and J. J. Corso. Random forests for metric learning with implicit pairwise position dependence. In KDD , 2012. [14] R. Agarwal, A. Gupta, Y. Prabhu, and M. Varma. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In WWW , 2013. [15] A. Z. Kouzani and G. Nasireding. Multilabel classification by bch code and random forests. International Journal on Network Security , 1(2):5, 2010. [16] R. Yan, J. Tesic, and J. R. Smith. Model-shared subspace boosting for multi-label classification. In KDD , 2007. [17] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston. Random forest: A classification and regression tool for compound classification and qsar modeling. Journal of Chemical Information and Computer Sciences , 43(6):1947-1958, 2003. [18] A. M. Prasad, L. R. Iverson, and A. Liaw. Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction. Ecosystems , 9(2):181-199, 2006. [19] A. Criminisi and J. Shotton. Decision Forests for Computer Vision and Medical Image Analysis . Springer Publishing Company, 2013. [20] D. Zikic, B. Glocker, and A. Criminisi. Atlas encoding by randomized forests for efficient label propagation. In MICCAI , 2013. [21] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees . CRC Press LLC, Boca Raton, Florida, 1984. [22] L. Breiman. Bagging predictors. Machine Learning , 24(2):123-140, 1996. [23] T. K. Ho. Random decision forest. In ICDAR , 1995. [24] T. K. Ho. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. , 20(8):832-844, 1998. [25] T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning , 40(2):139-157, 2000. [26] G. Biau. Analysis of a random forests model. J. Mach. Learn. Res. , 13:1063-1095, 2012. [27] M. Denil, D. Matheson, and N. de Freitas. Consistency of online random forests. In ICML , 2013. [28] M. Denil, D. Matheson, and N. de Freitas. Narrowing the gap: Random forests in theory and in practice. In ICML , 2014. [29] N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research , 7:983-999, 2006. [30] Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical

Figure 1: Comparison of dpRFMV, dpRFTA and dpRFPA. η = 1000 /n tr = 0 . 137 for selected datasets. Test error resp. vs. a) h across various settings of k and vs. c) k across various settings of h ; Minimal, average and maximal test error resp. vs. h (b)) and vs. k (d)) for dpRFMV.

Figure 1: Comparison of dpRFMV, dpRFTA and dpRFPA. η = 1000 /n tr = 0 . 137 for selected datasets. Test error resp. vs. a) h across various settings of k and vs. c) k across various settings of h ; Minimal, average and maximal test error resp. vs. h (b)) and vs. k (d)) for dpRFMV.

Association , 101:578-590, 2002.

[31] L. Breiman. Some infinite theory for predictor ensembles. Technical Report 577, Statistics Department, UC Berkeley, http://www.stat.berkeley.edu/~breiman , 2000. [32] L. Breiman. Consistency for a simple model of random forests. Technical Report 670, Statistics Department, UC Berkeley , 2004. [33] H. Ishwaran and U. B. Kogalur. Consistency of random survival forests. Statistics and Probability Letters , 80(13-14):1056-1064, 2010. [34] P. Domingos and G. Hulten. Mining high-speed data streams. In KDD , 2000. [35] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Mach. Learn. , 63:3-42, 2006. [36] G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright. A practical differentially private random decision tree classifier. Trans. Data Privacy , 5(1):273-295, 2012. [37] R. Genuer. Risk bounds for purely uniformly random forests. In ArXiv:1006.2980 , 2010. [38] R. Genuer. Variance reduction in purely random forests. Journal of Nonparametric Statistics , 24:543-562, 2012. [39] Y. Lin and Y. Jeon. Random Forests and Adaptive Nearest Neighbors. Journal of the American Statistical Association , 101:578-590, 2006. [40] W. Fan, E. Greengrass, J. McCloskey, P. Yu, and K. Drummey. Effective estimation of posterior probabilities: Explaining the accuracy of randomized decision tree approaches. In ICDM , 2005. [41] W. Fan, H. Wang, P.S Yu, and S. Ma. Is random model better? on its accuracy and efficiency. In ICDM , 2003. [42] Y. Yang, Z. Zhang, G. Miklau, M. Winslett, and X. Xiao. Differential privacy in data publication and analysis. In ACM SIGMOD , 2012. [43] J. Vaidya, B. Shafiq, Wei Fan, D. Mehmood, and D. Lorenzi. A random decision tree framework for privacy-preserving data mining. IEEE Transactions on Dependable and Secure Computing , 11:399-411, 2014. [44] A. Patil and S. Singh. Differential private random forest. In ICACCI , 2014. [45] C. Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, 5th International Conference, TAMC 2008, Xi'an, China, April 25-29, 2008. Proceedings , pages 1-19, 2008. [46] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS , 2007. [47] R. Hall, A. Rinaldo, and L. A. Wasserman. Differential privacy for functions and functional data. Journal of Machine Learning Research , 14(1):703-727, 2013. [48] T. M. Therneau, B. Atkinson, and B. Ripley. rpart: Recursive partitioning. http://CRAN.R-project.org/package=rpart , 2011.

Empirical and generalization errors

Preliminaries

We will prove here results regarding empirical and generalization errors of all the variants of the algorithm mentioned in the paper as well as Theorem 5.1 and Theorem 5.2. Without loss of generality we will assume that all attributes are binary (taken from the set { 0 , 1 } ). It can be easily noticed that the proofs can be directly translated to the continuous case. We leave this simple exercise to the reader.

Let us introduce first some useful notation that will be very helpful in the proofs we present next.

We denote by n the size of the dataset (training or test) T . Let us remind that m is the number of attributes of any given data point, h is the height of the random decision tree and M is the set of all random decision trees under consideration.

We focus first on classifying with just one decision tree. Fix some decision tree T j and one of its leaves. Assume that it contains a points with label: -and b points with label: +. We associate label -with that leaf if a > b and label 1 otherwise. To classify a given point using that tree we feed our tree with that point and assign to a point a label of the corresponding leaf. Denote by m i the number of data points that were correctly classified by a tree T i . Denote e i = m i n . We call e i the quality (or accuracy ) of the tree T i . Note that obviously we always have: e i ≥ 1 2 , since for every leaf of any given tree the majority of the data points from that leaf are classified correctly. Denote: e = 1 | M | ∑ | M | i =1 e i . We call e the average tree accuracy. This parameter measures how well data points are classified on average by a complete decision tree of a given height h . Note that e ≥ 1 2 . Denote t = 2 h . Parameter t is the number of leaves of the decision tree.

For i = 1 , 2 , ..., | M | and j = 1 , 2 , ..., t denote by n i j the number of points from the dataset in the j th leaf of a decision tree T i . Denote by m i j the number of points from the dataset in the j th leaf of the decision tree T i that were classified correctly. Denote e i j = m i j n i j for n i j > 0 and e i j = 1 for n i j = 0. Note that e i j ≥ 1 2 for every i, j . Note also that we have: n = n i 1 + ... + n i t and m i = m i 1 + ... + m i t . Denote by a i j the number of data points in the j th leaf of the decision tree T i that are of label 0. Denote by b i j the number of data points in the j th leaf of the decision tree T i that are of label 1.

We will use frequently the following structure in the proofs. Let G be a bipartite graph with color classes: A , B and weighted edges. Color class A consists of n points from the dataset. Color class B consists of 2 t | M | elements of the form y i,b j , where i ∈ { 1 , 2 , ..., | M |} , b ∈ { 0 , 1 } and j ∈ { 1 , 2 , ..., t } .

Data point x ∈ A is adjacent to y i, 1 j iff it belongs to larger of the two groups (these with labels: 0 and 1) of the data points that are in the j th leaf of the decision tree T i . An edge joining x with y i, 1 j has weight e i j . Data point x ∈ A is adjacent to y i, 0 j iff it belongs to smaller of the two groups of the data points that are in the j th leaf of the decision tree T i . An edge joining x with y i, 0 j has weight 1 -e i j . Note that the degree of a vertex y i, 1 j is m i j and the degree of a vertex y i, 0 j is n i j -m i j .

In the proofs we will refer to the size of the set of decision trees under consideration as: | M | or k (note that k is used in the main body of the paper).

We are ready to prove Theorem 5.1 and Theorem 5.2.

Proof. We start with the proof of Theorem 5.2. Note that from the definition of w d we get:

$$

$$

Therefore, using formula on m i j , we get:

$$

$$

Note that we have: ∑ | M | i =1 ∑ t j =1 n i j = n | M | . From Jensen's inequality, applied to the function f ( x ) = x 2 , we get: ∑ | M | i =1 ∑ t j =1 n i j | M | n ( e i j ) 2 ≥ ( ∑ | M | i =1 ∑ t j =1 n i j e i j | M | n ) 2 = ( ∑ | M | i =1 ∑ t j =1 m i j | M | n ) 2 = ( en | M | n | M | ) 2 = e 2 , where e is the average quality of the system of all complete decision trees of height h (the average tree accuracy). Similarly, ∑ | M | i =1 ∑ t j =1 n i j | M | n (1 -e i j ) 2 ≥ (1 -e ) 2 . Thus we get:

$$

$$

That completes the proof of Theorem 5.2. The proof of Theorem 5.1 is even simpler. Notice that for any data point d the expression σ ( d ) · | M | counts the number of decision trees from M that classified d correctly (follows directly from the definition of θ ). Thus we have: ∑ d ∈T σ ( d ) · | M | = ∑ | M | i =1 m i . Therefore 1 n ∑ d ∈T σ ( d ) = 1 | M | ∑ | M | i =1 e i and we are done.

We need one more technical result, the Azuma's inequality:

Lemma 9.1. Let { W n , n ≥ 1 } be a martingale with mean 0 and suppose that for some non-negative constants: α i , β i we have: -α i ≤ W i -W i -1 ≤ β i for i = 2 , 3 , ... . Then for any n ≥ 0 , a > 0 :

$$

$$

.
.

Majority voting and threshold averaging setting - empirical error

We will now prove parts of theorems: 5.3 and 5.4 regarding empirical errors.

Proof. Again, we start with the analysis of the threshold averaging. Take i th random decision tree T R i , where i ∈ { 1 , 2 , ..., k } . For a given data point d from the training set let X d i be a random variable defined as follows. If d does not belong to any leaf of T R i then let X d i = 0. Otherwise let a R i be the number of points from the training set with label 0 in that leaf and let b R i be the number of points from the training set with label 1 in that leaf. If d has label 0 then we take X d i = a R i a R i + b R i .

Otherwise we take X d i = b R i a R i + b R i . Denote X d = X d 1 + ... + X d k k . When from the context it is clear to which data point we refer to we will skip upper index and simply write X or X i respectively.

Fix some point d from the training set. Note that if X > 1 2 then point d is correctly classified. Notice that the weight of the point d denoted as w d is nothing else but the sum of weights of all the edges of G incident to d divided by the number of all trees (or the average weight of an edge indecent to d if we consider real-valued attributes). Note that we have EX = w d and that from Theorem 5.2 we get:

$$

$$

Take 0 < δ < 1 2 . Denote by µ the fraction of points d from the training data such that w d ≥ 1 2 + δ . From the lower bound on ∑ d ∈T w d , we have just derived, we get: ( 1 2 + δ )(1 -µ ) n + µn ≥ n ( e 2 +(1 -e ) 2 ), which gives us:

$$

$$

where glyph[epsilon1] = 1 -e .

Take point d from the training set such that w d ≥ 1 2 + δ. Denote by p d the probability that d is misclassified. We have:

$$

$$

Denote: Z i = X i -w d for i = 1 , 2 , ..., k . We have:

$$

$$

Note that, since w d = EX and random variables X i are independent, we can conclude that { Z 1 , Z 1 + Z 2 , ..., Z 1 + Z 2 + ... + Z k } is a martingale. Note also that -α i ≤ Z i ≤ β i for some α i , β i > 0 such that α i + β i = 1. Using Lemma 9.1, we get:

$$

$$

Therefore the probability that at least one of µn points d for which w d ≥ 1 2 + δ will be misclassified by the set of k random decision trees is, by union bound, at most: µne -2 kδ 2 ≤ ne -2 kδ 2 . That, for k = (1+ C ) log( n ) 2 δ 2 , completes the proof of the upper bound on the empirical error from theorems: 5.3 and 5.4 since we have already proved that µ ≥ 1 -2 glyph[epsilon1] -2 glyph[epsilon1] 2 0 . 5 -δ . The proof of the majority voting version goes along exactly the same lines. This time, instead of Theorem 5.2, we use Theorem 5.1. We know that ∑ d ∈T σ ( d ) ≥ ne , where e = 1 -glyph[epsilon1] . Denote the fraction of points d with σ ( d ) ≥ 1 2 + x for 0 < x < 1 2 by µ x . Then, by the argument similar to the one presented above,we have:

$$

$$

All other details of the proof for the majority voting are exactly the same as for the threshold averaging scheme.

Proof. Let K > 0 be a constant. We first consider the threshold averaging scheme. Take a decision tree T i . Denote by S T i the set of points d from the training set with the following property: point d belongs in T i do the leaf that contains at least n K 2 h points. Note that since each T i has exactly 2 h leaves, we can conclude that | S T i | ≥ n (1 -1 K ). In this proof and proof of theorems: 5.8 and 5.8 (presented in the next section) we will consider graph G D that is obtained from G by deleting edges adjacent to those vertices of the color class B that correspond to leaves containing less than n K 2 h points from the training set. Take point d from the training set with w t d ≥ 1 2 + δ , where w t d is the average weight of an edge incident to d in G D . Notice that w d ≥ 1 2 + δ + 1 K implies: w t d ≥ 1 2 + δ . We say that a decision tree T i is d -good if the leaf of T i to which d belongs contains at least n K 2 h points from the training set. Let us now define X d i . If i th chosen random decision tree is d -good then X d i is defined as in the proof of Theorem 5.3. Otherwise we put X d i = 0. Denote Z i = X d i -w t d . Note that the probability p d that point d is misclassified by selected random decision trees is p d ≤ P ( Z 1 + ... + Z k k + ∑ j ∈I R j |I| ≤ -δ ), where I is the set of indices corresponding to those chosen random decision trees that are d -good and random variables R j are correction terms for d -good random decision trees that must be introduced in order to take into account added Laplacians (if I = ∅ then we assume that the value of the expression ∑ j ∈I R j |I| is 0). Note also that set { R j , Z j : j = 1 , 2 , ..., k } is a set of independent random variables. We get:

$$

$$

Since from the Azuma's inequality we get: P ( Z 1 + ... + Z k k ≤ -δ 2 ) ≤ e -kδ 2 2 , we have:

$$

$$

We will now estimate the expression p r = P ( ∑ j ∈I R j |I| ≤ -δ 2 ).

For i ∈ I denote by A i an event that each of the two perturbation errors added to the leaf containing point d was of magnitude at most √ n K 2 h δ 1 , where δ 1 = δ 24 . Denote A = ⋂ i ∈I A i . Denote by A c the complement of A . We have: P ( ∑ j ∈I R j |I| ≤ -δ 2 ) = P ( ∑ j ∈I R j |I| ≤ -δ 2 |A ) P ( A ) + P ( ∑ j ∈I R j |I| ≤ -δ 2 |A c )(1 -P ( A )). Thus we get:

$$

$$

Now take one of the chosen random decision trees T i with i ∈ I . Take its leaf that contains given point d from the training set. Assume that this leaf contains r points from the training set with some fixed label l ∈ {-, + } and that it altogether contains n a points. Note that from the definition of I we have: n a ≥ n K 2 h . Let g 1 , g 2 be two independent Laplacian random variables, each of density function η 2 k e -| x | η k . We would like to estimate the following random variable Θ = r + g 1 n a + g 1 + g 2 -r n a for an event A . Note that in particular we know that | g 1 | , | g 2 | ≤ δ 1 n a √ n . Simple calculation gives us:

$$

$$

Now consider truncated probability space Ω |A and truncated random variables R t i = R i |A for i ∈ I . We have: P ( ∑ i ∈I R i ≤ -|I| δ 2 |A ) = P ( ∑ i ∈I R t i ≤ -|I| δ 2 ). Using inequality 5, we get:

$$

$$

Thus we can use Azuma's inequality once more, this time to find the upper bound on the expression: P ( ∑ i ∈I R t i ≤ -|I| δ 2 ) (we assume here that the random decision trees have been selected thus I is given). Without loss of generality we can assume that I glyph[negationslash] = ∅ . We have: P ( ∑ i ∈I R t i ≤ -|I| δ 2 ) = P ( ∑ i ∈A ( R t i -ER t i ) ≤ -I δ 2 -∑ i ∈I ER t i ) ≤ P ( ∑ i ∈I ( R t i -ER t i ) ≤ -|I| δ 4 ) ≤ e -2 |I| ( δ 4 ) 2 ( δ 4 √ n + δ 4 √ n ) 2 . Therefore we get:

$$

$$

It remains to bound the expression: (1 -P ( A )). Let g be a Laplacian random variable with density function η 2 k e -| x | η k . Note that from the union bound we get: 1 -P ( A ) ≤ 2 k P ( | g | > √ nδ 24 K 2 h ), where factor 2 in the expression 2 k P ( g > √ nδ 24 K 2 h ) comes from the fact that for a given data point d we need to add perturbation error in two places in the leaf of the chosen random decision tree corresponding to d . Denote γ = δ 24 K 2 h . We have:

$$

$$

Evaluation of the RHS-expression gives us:

$$

$$

Thus we can conclude that the probability p d that the fixed point d from the training set will be misclassified by the set of k randomly chosen random decision trees satisfies:

$$

$$

Note that by the similar argument to the one presented in the proof of Theorem 5.3 and Theorem 5.4, we can conclude that at least n (1 -2( glyph[epsilon1] + 1 K ) -2( glyph[epsilon1] + 1 K ) 2 0 . 5 -δ ) points d from the training data satisfy: w t d ≥ 1 2 + δ . Let µ t be a fraction of points with this property. As we observed earlier, if the points d satisfies: w d ≥ 1 2 + δ + 1 K then it also satisfies: w t d ≥ 1 2 + δ . Thus µ ≥ µ t . We also have: µ t ≥ 1 -2( glyph[epsilon1] + 1 K ) -2( glyph[epsilon1] + 1 K ) 2 0 . 5 -δ . Thus µ ≥ 1 -2( glyph[epsilon1] + 1 K ) -2( glyph[epsilon1] + 1 K ) 2 0 . 5 -δ . We replace glyph[epsilon1] by glyph[epsilon1] + 1 K in the formula derived in the proof of Theorem 5.3 since now for any fixed decision tree we do not take into account points that belong to leaves with less that n K 2 h points from the training set. For every given decision tree T i there are at most n K points d from the training set such that T i is not d -good. Note that, by union bound, the probability that at least one from the nµ points d with w t d ≥ 1 2 + δ is misclassified is at most nµp d ≤ np d . To see how Theorem 5.7 and the part of Theorem 5.6 regarding empirical error follow now, take K = 40 and δ = 1 10 . The proof of the majority voting version is very similar. We use inequality 2 (that was derived from Theorem 5.1) but all other details are exactly the same. Therefore we will not give it in details here since it would basically mean copying almost exactly the proof that we have just showed.

.
.

Probabilistic averaging setting - empirical error

Let us switch now to the probabilistic averaging setting. In practice, as was shown in the experimental section, it is the least effective method. However for the completeness of our theoretical analysis and since for very large datasets theoretical guarantees regarding also this setting can be obtained, we focus on it now.

We will first focus on the part of Theorem 5.5 regarding empirical error.

Proof. We already know that: ∑ d ∈T w d ≥ n ( e 2 + (1 -e ) 2 ), where e is the average quality. Assume that k random decision trees have been selected. Denote by Y d the indicator of the event that a fixed data point d from the training set will be correctly classified. We have:

$$

$$

where X d is random variable defined in the proof of theorems: 5.3 and 5.4. Note that after random decision trees have been selected, X d has a deterministic value. Note also that random variables Y d are independent and EY d = X d . Thus, we can use Lemma 9.1 in the very similar way as in the proof of theorems: 5.3 and 5.4 to get that for any given c > 0:

$$

$$

Let us focus now on the process of choosing random decision trees. Fix parameter δ > 0. Fix some point d from the training set. Using Lemma 9.1 in exactly the same way as in the proof of theorems: 5.3 and 5.4, we conclude that P ( X d < w d -δ ) ≤ e -2 kδ 2 . Therefore, by the union bound, with probability at least (1 -ne -2 kδ 2 ) we have: ∑ d ∈T X d ≥ ∑ d ∈T ( w d -δ ). Thus, according to the lower bound for ∑ d ∈T w d we presented at the beginning of the proof, we get that with probability at least (1 -ne -2 kδ 2 ) the following holds: ∑ d ∈T X d ≥ n (1 -2 glyph[epsilon1] +2 glyph[epsilon1] 2 -δ ), where glyph[epsilon1] = 1 -e . Note that random variables Y d are independent from random variables X d . We can conclude, using inequality 11, that with probability at least (1 -ne -2 kδ 2 )(1 -e -2 nc 2 ) at least n (1 -2 glyph[epsilon1] +2 glyph[epsilon1] 2 -δ -c ) points will be correctly classified. Now we can take k = (1+ C ) log( n ) 2 δ 2 and that completes the proof. Again, as in the previous proof, the majority voting scheme requires only minor changes in the presented proof so we will leave to the reader.

Proof. Proofs of statements regarding empirical errors go along exactly the same lines as presented proof of the part of Theorem 5.5 (regarding empirical error). The changes in the statement, due to the added perturbation error, follow from the proof of bounds on the empirical error from theorems: 5.6 and 5.7 . Therefore we will not give the entire proof but only mention few things.

In comparison with the statement of Theorem 5.5, in the expression on the upper bound on empirical error the term glyph[epsilon1] is replaced by glyph[epsilon1] + 1 K . This is, as explained in the proof of Theorem 5.6 (regarding empirical error), due to the fact that while dealing with weights of edges in graph G D we do not take into account points from the training set corresponding to leaves with too few data points. To see how Theorem 5.9 can be derived, take K = 40, δ = 1 10 , c = 1 20 . Again, as for Theorem 5.7, Theorem 5.9 follows now by simple calculations.

.
.

Generalization error

We will now prove upper bounds regarding generalization error for all the theorems presented in the previous paragraphs. We do it for all of them in the same section since all the proofs are very similar. Besides, right now, when we have already developed tools for obtaining upper bounds on the empirical error, we can use them to simplify our analysis regarding generalization error. Random decision trees give strong bounds on the generalization error since they do not lead to data overfitting. The internal structure of each constructed tree (i.e. the set of its inner nodes) does not depend at all on the data. This fact is crucial in obtaining strong guarantees on the generalization error. All the experiments presented in the main body of the paper measured generalization error of the random tree approach and stand for the empirical verification that this method is a good learning technique in the setting requiring high privacy guarantees. Below is the proof of the presented upper bounds on the generalization error.

Proof. Consider test set of n points. Whenever we refer to the weight or goodness of the test point d , this is in respect to the test set (see: definition of goodness and other terms in the description of the model). Let φ > 0 be a small constant and denote by E φ an event that for the selected forest F of random decision trees the non-perturbed counts in all leafs (for each leaf we count points with label + and -in that leaf) for the test set and training set differ by at most 2 φn . We start by finding a lower bound on P ( E φ ). Let us fix a forest, a particular tree of that forest and a particular leaf of that tree. Denote by X i a random variable that takes value 1 if i th point of the training set corresponds to that leaf and 0 otherwise. Similarly, denote by Y i a random variable that takes value 1 if i th point of the test set corresponds to that leaf and 0 otherwise. Denote by X + i a random variable that takes value 1 if i th point of the training set corresponds to that leaf and has label + and is 0 otherwise. Similarly, denote by Y + i a random variable that takes value 1 if i th point of the test set corresponds to that leaf and has label + and is 0 otherwise. Denote by p 1 the probability that i th point of the training/test set corresponds to that leaf and by p 2 the probability that i th point of the training/test set corresponds to that leaf and has label +. Notice that p 1 , p 2 are the same for the training and test set since we assume that training and test set are taken from the same distribution. Since all the random variables introduced above are independent, we can conclude using Azuma's inequality that

$$

$$

$$

$$

$$

$$

By the same analysis we can show that

$$

$$

$$

$$

$$

$$

We can conclude that the probability of the following event:

$$

$$

is at least 1 -8 e -2 nφ 2 . If we now take the union bound over all 2 h k leafs of the forest then we obtain: P ( E φ ) ≥ 1 -2 h +3 ke -2 nφ 2 . We will now consider average weights w d of the test points. The analysis for the majority voting uses σ ( d ) and is completely analogous. Assume now that all the counts for all the leaves for the test and training set differ by at most 2 φ . As in the analysis of the empirical error in the differentially-private setting, lets focus on those leaves of the forest that contain at least n 2 h K of the test points each, for a constant K > 0. Take a leaf l with this property. Denote by x 1 the number of test points corresponding to that leaf and with label +. Denote by x 2 the number of training points corresponding to that leaf and with label +. Denote by y 1 the number of all test points corresponding to that leaf and by y 2 the number of all training points corresponding to that leaf. We want to find an upper bound on the expression q = | x 1 y 1 -x 2 y 2 | . Simple algebra gives us: q ≤ 2 φn ( x 1 + y 1 ) y 1 ( y 1 -2 φn ) . If we now take ζ = 2 φ θ , where θ = 1 2 h K then we get: q ≤ 2 ζ 1 -ζ . Let us take ζ such that: 2 ζ 1 -ζ ≤ δ 2 , where δ > 0 is a positive constant. Thus we want: ζ ≤ δ 4+ δ , i.e. φ ≤ δθ 2(4+ δ ) . Take φ = δθ 2(4+ δ ) . We can conclude that with probability at least P ( E φ ) the difference between ratios of counts in leaves containing at least n θ test points for the test and training set is at most δ 2 . This in particular implies that if we consider test point d and a truncated bipartite graph G d (but this time with respect to the test set, not training set) then weights of d in G d and its corresponding version for the training set differ by at most δ 2 .

We are almost done. Consider first majority voting/threshold averaging scheme. The only changes we need to introduce in the statement of Theorem 5.3 for the empirical error is to subtract from p 1 the probability that E φ does not hold to obtain a lower bound on p 2 , add factor 1 K to the expression on w (since we are using the truncated model) and change δ by δ 2 in the expression on number of random decision trees used. Similarly, in the statement of Theorem 5.4 we need to replace glyph[epsilon1] in the expression on err 1 by glyph[epsilon1] + 1 K to obtain an upper bound on err 2 (again, because we are using truncation argument) and make the same change in the number of decision trees as the one above. To obtain a lower bound on p 2 it suffices to subtract the probability that E φ does not hold. Let us focus now on Theorem 5.6. Again we need to add extra factor 1 K to the expression on w and subtract probability that E φ does not hold to obtain a lower bound on p 2 .

Similarly,

Therefore, by the union bound and

Thus we also have

Now lets consider probabilistic averaging scheme. Take the statement of Theorem 5.5 first. We make similar correction to those mentioned earlier to get a lower bound on p 2 . Besides in the upper bound on err 1 we need to replace glyph[epsilon1] by glyph[epsilon1] + 1 K to obtain an upper bound on err 2 . In Theorem 5.8 we need to add one extra term 1 K in the upper bound on err 1 to obtain an upper bound on err 2 and again modify p 1 in the same way as before to obtain a lower bound on p 2 .

.

Experiments on the remaining datasets

We consider supervised learning with random decision trees, where the tree construction is completely random. The method is popularly used and works well in practice despite the simplicity of the setting, but its statistical mechanism is not yet well-understood. In this paper we provide strong theoretical guarantees regarding learning with random decision trees. We analyze and compare three different variants of the algorithm that have minimal memory requirements: majority voting, threshold averaging and probabilistic averaging. The random structure of the tree enables us to adapt these methods to a differentially-private setting thus we also propose differentially-private versions of all three schemes. We give upper-bounds on the generalization error and mathematically explain how the accuracy depends on the number of random decision trees. Furthermore, we prove that only logarithmic (in the size of the dataset) number of independently selected random decision trees suffice to correctly classify most of the data, even when differential-privacy guarantees must be maintained. We empirically show that majority voting and threshold averaging give the best accuracy, also for conservative users requiring high privacy guarantees. Furthermore, we demonstrate that a simple majority voting rule is an especially good candidate for the differentially-private classifier since it is much less sensitive to the choice of forest parameters than other methods.

Decision tree is one of the most fundamental structures used in machine learning. Constructing a tree of good quality is a hard computational problem though. Needless to say, the choice of the optimal attribute according to which the data partitioning should be performed in any given node of the tree requires nontrivial calculations involving data points located in that node. Nowadays, with an increasing importance of the mechanisms preserving privacy of the data handled by machine learning algorithms, the need arises to construct these algorithms with strong privacy guarantees (see e.g. [1], [2], [3], [4], [5]). One of the strongest currently used notions of privacy is the so-called differential privacy that was introduced [6] in a quest to achieve the dual goal of maximizing data utility and preserving data confidentiality. A differentially-private database access mechanism preserves the privacy of any individual in the database, irrespectively of the amount of auxiliary information available to an adversarial database client. Differential-privacy techniques add noise to perturb data (such as Laplacian noise). Its magnitude depends on the sensitivity of the statistics that are being output. Even though the overall scheme looks simple, in practice it is usually very difficult to obtain a reasonable level of differential privacy and at the same time maintain good accuracy. This is the case since usually too big perturbation error needs to be added. In particular, this happens when machine learning computations access data frequently during the entire execution of the algorithm and output structures that are very sensitive to the data. This is also an obstacle for proposing a scheme that computes an optimal decision tree in a differentially-private way. In such a scenario the attribute chosen in every node and any additional information stored there depends on the data and that is why it must be perturbed in order to keep the desired level of differential privacy. Big perturbation added in this setting leads to the substantially smaller quality of the constructed tree.

Instead of constructing one differentially-private decision tree, in this paper we consider constructing a random forest. Random forests [7] constitute an important member of the family of the decision tree-based algorithms due to their effectiveness and excellent performance. They are also the most accurate general-purpose classifiers available [8, 7]. In this paper we construct a forest consisting of O​(log⁡(n))𝑂𝑛O(\log(n)) random decision trees (n𝑛n is the size of the dataset, e.g. number of data samples). An attribute according to which the selection is performed in any given node is chosen uniformly at random from all the attributes, independently from the dataset in that node. In the continuous case, the threshold value for the chosen attribute is then also chosen uniformly at random from the range of all possible values. That simple rule enables us to construct each decision tree very fast since the choice of nodes’ attributes does not depend on the data at all. The obtained algorithm is therefore fast and scalable with minimal memory requirements. It also takes only one pass over the data to construct the classifier. Since most of the structure of each random decision tree is constructed without examining the data, the algorithm suits the differentially-private scenario very well. After a sufficient number of random decision trees is constructed, the classification of every point from a dataset takes place. Classification is done according to one of the three schemes: majority voting ([7]), threshold averaging or probabilistic averaging ([9]). In the differentially-private setting we add perturbation error to the counters in leaves, but no perturbation error is added to the inner nodes. This leads to a much more accurate learning mechanism. Performing voting/averaging (see: [10] for applications of the voting methods) instead of just taking the best tree for a given dataset is important since it enables us to add smaller perturbation error to obtain the same level of differential privacy.

In this paper we analyze both non-differentially-private and differentially-private setting in all three variants: majority voting, threshold averaging, and probabilistic averaging. To the best of our knowledge, we are the first to give a comprehensive and unified theoretical analysis of all three models in both settings, where in case of differentially-private setting no theoretical analysis was ever provided in the context of random decision trees. The differentially-private setting is especially difficult to analyze since increasing the number of trees does not necessarily decrease the training (and test) error in this setting. Having more random decision trees require adding bigger perturbation error that may decrease the efficiency of the learning algorithm. In this paper we thoroughly investigate this phenomenon. The dependence of the quality of the random decision tree methods on the chosen level of differential privacy, the height of the tree and the number of trees in the forest is in the central focus of our theoretical and empirical analysis. Understanding these dependencies is crucial while applying these methods in practice. Our theoretical analysis relate the empirical error and the generalization error of the classifier to the average tree accuracy and explain quantitatively how the quality of the system depends on the number of chosen trees. Furthermore, we show that the random forest need not many trees to achieve good accuracy. In particular, we prove both theoretically and empirically that in practice the logarithmic in the size of the dataset number of random decision trees111Further in the paper by ”logarithmic number of random decision trees” we always mean ”logarithmic (in the size of the dataset) number of random decision trees”. suffices to achieve good performance. We also show that not only do there exist parameters of the setting (such as: the number of random trees in the forest, the height of the tree, etc.) under which one can effectively learn, but the setting is very robust. To be more precise, we empirically demonstrate that the parameters do not need to be chosen in the optimal way, in example one can choose far fewer trees to achieve good performance. We also show that majority voting and threshold averaging are good candidates for the differentially-private classifiers. Our experiments reveal that a simple majority voting rule is competitive with the threshold averaging rule and simultaneously they both outperform the probabilistic averaging rule. Furthermore, majority voting rule is much less sensitive to the choice of the parameters of the random forest (such as the number of the trees and the height of the tree) than the remaining two schemes.

This article is organized as follows. In Section 2 we describe previous work regarding random decision trees. We then introduce our model and the notion of differential privacy in Section 3. In Section 4 we present a differentially-private supervised algorithm that uses random decision trees. Section 5 contains our theoretical analysis. We conclude the paper with experiments (Section 6) and a brief summary of our results (Section 7).

Random decision trees are considered as important methods in machine learning often used for supervised learning due to their simplicity, excellent practical performance and somewhat unreasonable effectiveness in practice. They became successful in a number of practical problems, e.g. [11, 12, 13, 14, 15, 16, 17, 18, 19, 20] (there exist many more examples). The original random forests [7] were ensemble methods combining many CART-type [21] decision trees using bagging [22] (a convenient review of random forests can for instance be found in [19]). They were inspired by some earlier published random approaches [12, 23, 24, 25]. Despite their popularity, the statistical mechanism of random forests is difficult to analyze [8, 26] and to these days remains largely ununderstood [26, 27, 28]. Next we review the existing theoretical results in the literature.

A notable line of works provide an elegant analysis of the consistency of random forests [26, 27, 28, 8, 29, 30, 31, 32, 29, 33]. Among these works, one of the most recent studies [26] proves that the previously proposed random forest approach [32] is consistent and achieves the rate of convergence which depends only on the number of strong features and not on the number of noise variables. Another recent paper [27] provides the first consistency result for online variant of random forests. The predecessor of this work [34] proposes the Hoeffding tree algorithm and prove that with high probability under certain assumptions the online Hoeffding tree converges to the offline tree. In our paper we focus on error bounds rather than the consistency analysis of random decision trees.

It has been noted [27] that the most famous theoretical result concerning random forests provides an upper-bound on the generalization error of the forest in terms of the correlation and strength of trees [7]. Simultaneously, the authors show that the generalization error converges almost surely to a limit as the number of trees in the forest becomes large. It should be noted however that the algorithm considered by the authors has data-dependent tree structure opposite to the algorithms in our paper. To be more specific, the original "random forests" method [7] selects randomly a subset of features and then it chooses the best splitting criteria from this feature subset. This affects efficiency since computing the heuristics (the best splitting criteria) is expensive [9]. Furthermore, it also causes the tree structure to be data-dependent (another approach where the tree structure is data-dependent is presented in example in [35]) rather than fully random which poses a major problem when extending the method to the differentially-private setting since data-independent tree structure is important for preserving differential-privacy [36]. Opposite to this approach, in our algorithms we randomly draw the attribute in each tree node according to which we split and then we randomly choose a threshold used for splitting. This learning model is therefore much simpler. Our fully random approach is inspired by a methodology already described before in the literature [9] (this work however has no theoretical analysis). Our theoretical results consider error bounds similarly to the original work on random forests [7]. The difference of approaches however does not allow to use the theoretical results from [7] in our setting. Finally, note that in either [7] or [9] only a single voting rule is considered, majority voting or probabilistic averaging respectively. In this paper we consider a wider spectrum of different voting approaches.

Next, we briefly review some additional theoretical results regarding random forests. A simplified analysis of random forests in one-dimensional settings was provided in the literature in the context of regression problems where minimax rate of convergence were proved [37, 38]. Another set of results explore the connection of random forests with a specific framework of adaptive nearest-neighbor methods [39]. Finally, for completeness we emphasize that there also exist some interesting empirical studies regarding random decision trees in the literature, e.g. [40], [41] and [23], which however are not directly related to our work.

Privacy preserving data mining has emerged as an effective method to solve the problem of data sharing in many fields of computer science and statistics. One of the strongest currently used notions of privacy is the so-called differential privacy [6] (some useful tutorial material on differential privacy research can be found in [42]). In this paper we are interested in the differentially-private setting in the context of random decision trees. It was first observed in [36] that random decision trees may turn out to be an effective tool for constructing a differentially-private decision tree classifier. The authors showed a very efficient heuristic that averages over random decision trees and gives good practical results. Their work however lacks theoretical results regarding the quality of the differentially-private algorithm that is using random decision trees. In another published empirical study [43] the authors develop protocols to implement privacy-preserving random decision trees that enable efficient parallel and distributed privacy-preserving knowledge discovery. The very recent work [44] on differentially-private random forests shows experimental results demonstrating that quality functions such as information gain, max operator and gini index gives almost equal accuracy regardless of their sensitivity towards the noise. Furthermore, they show that the accuracy of the classical random forest and its differentially-private counterpart is almost equal for various size of datasets. To the best of our knowledge none of the published works on differentially-private random decision trees provide any theoretical guarantees. Our paper provides strong theoretical guarantees of both non-differentially-private and differentially-private random decision trees. This is a major contribution of our work. We simultaneously develop a unified theoretical framework for analyzing both settings.

Differential privacy is a model of privacy for database access mechanism. It guarantees that small changes in a database (removal or addition of an element) does not change substantially the output of the mechanism.

(See [45].) A randomized algorithm 𝒦𝒦\mathcal{K} gives ϵitalic-ϵ\epsilon-differential-privacy if for all datasets 𝒟1subscript𝒟1\mathcal{D}{1} and 𝒟2subscript𝒟2\mathcal{D}{2} differing on at most one element, and all S⊆R​a​n​g​e​(𝒦)𝑆𝑅𝑎𝑛𝑔𝑒𝒦S\subseteq Range(\mathcal{K}),

The probability is taken over the coin tosses of 𝒦𝒦\mathcal{K}.

The smaller ϵitalic-ϵ\epsilon, the stronger level of differential privacy is obtained. Assume that the non-perturbed output of the mechanism can be encoded by the function f𝑓f. A mechanism 𝒦𝒦\mathcal{K} can compute a differentially-private noisy version of f𝑓f over a database 𝒟𝒟\mathcal{D} by adding noise with magnitude calibrated to the sensitivity of f𝑓f.

(See [6].) The global sensitivity S​(f)𝑆𝑓S(f) of a function f𝑓f is the smallest number s𝑠s such that for all 𝒟1subscript𝒟1\mathcal{D}{1} and 𝒟2subscript𝒟2\mathcal{D}{2} which differ on at most one element, |f​(𝒟1)−f​(𝒟2)|≤s𝑓subscript𝒟1𝑓subscript𝒟2𝑠|f(\mathcal{D}{1})-f(\mathcal{D}{2})\ |\leq s.

Let L​a​p​(0,λ)𝐿𝑎𝑝0𝜆Lap(0,\lambda) denote the Laplace distribution with mean 00 and standard deviation λ𝜆\lambda. In other words, this is a random variable with probability density function given by the following formula: λ2​e−|x|​λ𝜆2superscript𝑒𝑥𝜆\frac{\lambda}{2}e^{-|x|\lambda}. We will denote shortly by g​(λ)𝑔𝜆g(\lambda) an independent copy of the L​a​p​(0,λ)𝐿𝑎𝑝0𝜆Lap(0,\lambda)-random variable.

(See [6].) Let f𝑓f be a function on databases with range Rmsuperscript𝑅𝑚R^{m}, where m𝑚m is the number of rows of databases 222Number of rows of databases is the number of attributes of any data point from the databases.. Then, the mechanism that outputs f​(𝒟)+(Y1,…,Ym)𝑓𝒟subscript𝑌1…subscript𝑌𝑚f(\mathcal{D})+(Y_{1},\ldots,Y_{m}), where Yisubscript𝑌𝑖Y_{i} are drawn i.i.d from L​a​p​(0,S​(f)/ϵ)𝐿𝑎𝑝0𝑆𝑓italic-ϵLap(0,S(f)/\epsilon), satisfies ϵitalic-ϵ\epsilon-differential-privacy.

Stronger privacy guarantees and more sensitive functions need bigger variance of the Laplacian noise being added. Differential privacy is preserved under composition, but with an extra loss of privacy for each conducted query.

(See [6].) (Composition Theorem) The sequential application of mechanisms 𝒦isubscript𝒦𝑖\mathcal{K}{i}, each giving ϵisubscriptitalic-ϵ𝑖\epsilon{i}-differential privacy, satisfies ∑iϵisubscript𝑖subscriptitalic-ϵ𝑖\sum_{i}\epsilon_{i}-differential-privacy.

More information about differential privacy can be found in the work of [46] and [47].

All data points are taken from ℱmsuperscriptℱ𝑚\mathcal{F}^{m}, where m𝑚m is the number of the attributes and ℱℱ\mathcal{F} is either a discrete set or the set of real numbers. We assume that for every attribute a​t​t​r𝑎𝑡𝑡𝑟attr its smallest (min⁡(a​t​t​r)𝑎𝑡𝑡𝑟\min(attr)) and largest possible value (max⁡(a​t​t​r)𝑎𝑡𝑡𝑟\max(attr)) are publicly available and that the labels are binary. We consider only binary decision trees (all our results can be easily translated to the setting where inner nodes of the tree have more than two children). Therefore, if ℱℱ\mathcal{F} is discrete then we will assume that ℱ={0,1}ℱ01\mathcal{F}={0,1}, i.e. each attribute is binary. In the continuous setting for each inner node of the tree we store the attribute according to which the selection is done and the threshold value of this attribute. All decision trees considered in this paper are complete and of a fixed height hℎh that does not depend on the data. Let T𝑇T be a random decision tree and let l𝑙l be one of its leaves. We denote by θlsubscript𝜃𝑙\theta_{l} the fraction of all training points in l𝑙l with label ++. If l𝑙l does not contain any of the training points we choose the value of θlsubscript𝜃𝑙\theta_{l} uniformly at random from [0,1]01[0,1]. The set M𝑀M of all possible decision trees is of size |M|=m2h+1−1𝑀superscript𝑚superscript2ℎ11|M|=m^{2^{h+1}-1} in the binary setting. It should be emphasized that it is true also in the continuous case. In that setting the set of all possible threshold values for a node is infinite but needless to say, the set of all possible partitionings in the node is still finite. Thus without loss of generality, we assume M𝑀M is finite. It can be very large but it does not matter since we will never need the actual size of M𝑀M in our analysis. For a given tree T𝑇T and given data point d𝑑d denote by wdTsuperscriptsubscript𝑤𝑑𝑇w_{d}^{T} the fraction of points (from the training set if d𝑑d is from this set and from the test set otherwise) with the same label as d𝑑d that end up in the same leaf of T𝑇T as d𝑑d. We call it the weight of d𝑑d in T𝑇T. Notice that a training point d𝑑d is classified correctly by T𝑇T in the single-tree setting iff its weight in T𝑇T is larger than 1212\frac{1}{2} (for a single decision tree we consider majority voting model for points classification).

The average value of wdTsuperscriptsubscript𝑤𝑑𝑇w_{d}^{T} over all trees of M𝑀M will be denoted as wdsubscript𝑤𝑑w_{d} and called the weight of d𝑑d in M𝑀M. We denote by σ​(d)𝜎𝑑\sigma(d) the fraction of trees from M𝑀M with the property that most of the points of the leaf of the tree containing d𝑑d have the same label as d𝑑d (again, the points are taken from the training set if d𝑑d is from it and from the test set otherwise). We call σ​(d)𝜎𝑑\sigma(d) the goodness of d𝑑d in M𝑀M. For a given dataset 𝒟𝒟\mathcal{D} the average tree accuracy e​(𝒟)𝑒𝒟e(\mathcal{D}) of a random decision tree model is an average accuracy of the random decision tree from M𝑀M, where the accuracy is the fraction of data points that a given tree classifies correctly (accuracy is computed under assumption that the same distribution 𝒟𝒟\mathcal{D} was used in both: the training phase and test phase).

Algorithm 1 captures the non-differentially-private algorithm for supervised learning with random decision trees (RDT). Its differentially-private counterpart is captured in Algorithm 2. We consider three versions of each algorithm:

majority voting

threshold averaging

Only variables nl+,nl−superscriptsubscript𝑛𝑙superscriptsubscript𝑛𝑙n_{l}^{+},n_{l}^{-} stored in leaves depend on the data. This fact will play crucial role in the analysis of the differen- tially-private version of the algorithm where Laplacian error is added to the point counters at every leaf with variance calibrated to the number of all trees used by the algorithm.

In this section we derive the upper-bounds on the empirical error (the fraction of the training data misclassified by the algorithm) and the generalization error (the fraction of the test data misclassified by the algorithm where the test data is taken from the same distribution as the training data) for all methods in Algorithm 1 and 2.

We also show how to find the number of random decision trees to obtain good accuracy and, in the differentially-private setting, good privacy guarantees.

We start with two technical results which, as we will see later, give an intuition why the random decision tree approach works very well in practice.

Assume that the average tree accuracy of the set M𝑀M of all decision trees of height hℎh on the training/test set 𝒟𝒟\mathcal{D} is e=1−ϵ𝑒1italic-ϵe=1-\epsilon for some 0<ϵ≤120italic-ϵ120<\epsilon\leq\frac{1}{2}. Then the average goodness σ​(d)𝜎𝑑\sigma(d) of a training/test point d𝑑d in M𝑀M is at least e≥12𝑒12e\geq\frac{1}{2}.

The theorems above imply that if the average accuracy of the tree is better than random, then this is also reflected by the average values of wdsubscript𝑤𝑑w_{d} and σdsubscript𝜎𝑑\sigma_{d}. This fact is crucial for the theoretical analysis since we will show that if the average values of wdsubscript𝑤𝑑w_{d} and σdsubscript𝜎𝑑\sigma_{d} are slightly better than random then this implies very small empirical and generalization error. Furthermore, for most of the training/test points d𝑑d their values of σdsubscript𝜎𝑑\sigma_{d} and wdsubscript𝑤𝑑w_{d} are well concentrated around those average values and that, in a nutshell, explains why the random decision trees approach works well. Notice that Theorem 5.1 gives better quality guarantees than Theorem 5.2.

We are about to propose several results regarding differen- tially-private learning with random decision trees. They are based on careful structural analysis of the bipartite graph between the set of decision trees and datapoints. Edges of that bipartite graph connect datapoints with trees that correctly classified given datapoints. In the differentially-private setting the key observation is that under relatively weak conditions one can assume that the sizes of the sets of datapoints residing in leaves of the trees are substantial. Thus adding the Laplacian noise will not perturb the statistics to an extent that would affect the quality of learning. All upper-bounds regarding the generalization error were obtained by combining this analysis with concentration results (such as Azuma’s inequality).

We start by providing theoretical guarantees in the non-differentially-private case. Below we consider majority voting and threshold averaging. The results for the probabilistic averaging are stated later in this subsection.

Let K>0𝐾0K>0. Assume that the average tree accuracy of the set M𝑀M of all decision trees of height hℎh on the training/test set 𝒟𝒟\mathcal{D} is e=1−ϵ𝑒1italic-ϵe=1-\epsilon for some 0<ϵ≤120italic-ϵ120<\epsilon\leq\frac{1}{2}. Let μ𝜇\mu be: the fraction of training/test points with goodness in M𝑀M at least σ=12+δ𝜎12𝛿\sigma=\frac{1}{2}+\delta / σ=12+δ+1K𝜎12𝛿1𝐾\sigma=\frac{1}{2}+\delta+\frac{1}{K} for 0<δ<120𝛿120<\delta<\frac{1}{2} (in the majority version) or: the fraction of training/test points with weight in M𝑀M at least w=12+δ𝑤12𝛿w=\frac{1}{2}+\delta / w=12+δ+1K𝑤12𝛿1𝐾w=\frac{1}{2}+\delta+\frac{1}{K} for 0<δ<120𝛿120<\delta<\frac{1}{2} (in the threshold averaging version). Then Algorithm 1 for every C>0𝐶0C>0 and k=(1+C)​log⁡(n)2​δ2𝑘1𝐶𝑛2superscript𝛿2k=\frac{(1+C)\log(n)}{2\delta^{2}} selected random decision trees gives empirical error e​r​r1≤1−μ𝑒𝑟subscript𝑟11𝜇err_{1}\leq 1-\mu with probability p1≥1−1nCsubscript𝑝111superscript𝑛𝐶p_{1}\geq 1-\frac{1}{n^{C}}. The generalization error e​r​r2≤1−μ𝑒𝑟subscript𝑟21𝜇err_{2}\leq 1-\mu will be achieved for k=(1+C)​log⁡(n)2​(δ2)2𝑘1𝐶𝑛2superscript𝛿22k=\frac{(1+C)\log(n)}{2(\frac{\delta}{2})^{2}} trees with probability p2≥p1−2h+3​k​e−2​n​ϕ2subscript𝑝2subscript𝑝1superscript2ℎ3𝑘superscript𝑒2𝑛superscriptitalic-ϕ2p_{2}\geq p_{1}-2^{h+3}ke^{-2n\phi^{2}}, where ϕ=δ2​(4+δ)​2h​Kitalic-ϕ𝛿24𝛿superscript2ℎ𝐾\phi=\frac{\delta}{2(4+\delta)2^{h}K}. Probabilities p1subscript𝑝1p_{1} and p2subscript𝑝2p_{2} are under random coin tosses used to construct the forest and the test set.

Note that parameter e𝑒e is always in the range [12,1]121[\frac{1}{2},1]. The more decision trees that classify data in the nontrivial way (i.e. with accuracy greater than 1212\frac{1}{2}), the larger the value of e𝑒e is. The result above in particular implies that if most of the points have goodness/weight in M𝑀M a little bit larger than 1212\frac{1}{2} then both errors are very close to 00. This is indeed the case - the average point’s goodness/weight in M𝑀M, as Theorem 5.1 and Theorem 5.2 say, is at least e𝑒e / e2+(1−e)2superscript𝑒2superscript1𝑒2e^{2}+(1-e)^{2}. The latter expression is greater than 1212\frac{1}{2} if the average tree accuracy is slightly bigger than the worst possible. Besides goodness/weight of most of the points, as was tested experimentally, is well concentrated around that average goodness/weight. We conclude that if the average accuracy of the decision tree is separated from 1212\frac{1}{2} (but not necessarily very close to 111) then it suffices to classify most of the data points correctly. The intuition behind this result is as follows: if the constructed forest of the decision trees contains at least few "nontrivial trees" giving better accuracy than random then they guarantee correct classification of most of the points.

If we know that the average tree accuracy is big enough then techniques used to prove Theorem 5.3 give us more direct bounds on the empirical and generalization errors captured in Theorem 5.4. No assumptions regarding goodness/weight are necessary there.

Theorems 5.3 and 5.4 show that logarithmic number of random decision trees in practice suffices to obtain high prediction accuracy with a very large probability. In particular, the upper-bound on the generalization error is about two times the average error of the tree. The existence of the tree with lower accuracy in the forest does not harm the entire scheme since all trees of the forest play role in the final classification.

We now state our results (analogous to Theorem 5.4) for the probabilistic averaging setting. The following is true.

Let K>0𝐾0K>0. Assume that the average tree accuracy of the set M𝑀M of all decision trees of height hℎh on the training/test set 𝒟𝒟\mathcal{D} is e=1−ϵ𝑒1italic-ϵe=1-\epsilon for some 0<ϵ≤120italic-ϵ120<\epsilon\leq\frac{1}{2}. Let C>0𝐶0C>0 be a constant. Let 0<δ,c<1formulae-sequence0𝛿𝑐10<\delta,c<1. Then with probability at least p1=(1−1nC)​(1−e−2​n​c2)subscript𝑝111superscript𝑛𝐶1superscript𝑒2𝑛superscript𝑐2p_{1}=(1-\frac{1}{n^{C}})(1-e^{-2nc^{2}}) the probabilistic averaging version of Algorithm 1 gives empirical error e​r​r1≤2​ϵ−2​ϵ2+δ+c𝑒𝑟subscript𝑟12italic-ϵ2superscriptitalic-ϵ2𝛿𝑐err_{1}\leq 2\epsilon-2\epsilon^{2}+\delta+c and with probability p2≥p1−2h+3​k​e−2​n​ϕ2subscript𝑝2subscript𝑝1superscript2ℎ3𝑘superscript𝑒2𝑛superscriptitalic-ϕ2p_{2}\geq p_{1}-2^{h+3}ke^{-2n\phi^{2}}, where ϕ=δ2​(4+δ)​2h​Kitalic-ϕ𝛿24𝛿superscript2ℎ𝐾\phi=\frac{\delta}{2(4+\delta)2^{h}K}, it gives generalization error e​r​r2≤2​(ϵ+1K)−2​(ϵ+1K)2+δ+c𝑒𝑟subscript𝑟22italic-ϵ1𝐾2superscriptitalic-ϵ1𝐾2𝛿𝑐err_{2}\leq 2(\epsilon+\frac{1}{K})-2(\epsilon+\frac{1}{K})^{2}+\delta+c. Probabilities p1,p2subscript𝑝1subscript𝑝2p_{1},p_{2} are under random tosses used to construct the forest and the test set.

Notice that this result is nontrivial for almost the entire range [0,12]012[0,\frac{1}{2}] of ϵitalic-ϵ\epsilon, and δ𝛿\delta and c𝑐c close to 00, and large K𝐾K. This is the case since note that 1−2​ϵ+2​ϵ2≥1212italic-ϵ2superscriptitalic-ϵ2121-2\epsilon+2\epsilon^{2}\geq\frac{1}{2} and the equality holds only for ϵ=12italic-ϵ12\epsilon=\frac{1}{2}.

We begin this section with the proof that all three methods captured in Algorithm 2, where the Laplacian noise is added to certain counts, are indeed η𝜂\eta-differentially-private.

Notice that in every method to obtain the forest of random decision trees with perturbed counters in leaves we need k𝑘k queries to the private data (this is true since the structure of the inner nodes of the trees does not depend at all on the data and data subsets corresponding to leaves are pairwise disjoint). Furthermore, the values that are being perturbed by the Laplacian noise are simple counts of global sensitivity 111. Thus we can use use Theorem 3.1 and Theorem 3.2 to conclude that in order to obtain η𝜂\eta-differential privacy of the entire system we need to add a L​a​p​(0,kη)𝐿𝑎𝑝0𝑘𝜂Lap(0,\frac{k}{\eta}) to every count in the leaf. This proves that our algorithms are indeed η𝜂\eta-differentially-private.

Next we show the theoretical guarantees we obtained in the differentially-private setting. As in the previous section, we first focus on the majority voting and threshold averaging, and then we consider the probabilistic averaging.

Notice that if the number of trees k𝑘k in the forest is logarithmic in n𝑛n then p1subscript𝑝1p_{1} is close to one and so is p2subscript𝑝2p_{2}.

Again, as in the non-differentially-private case, we see that if there are many points of goodness/weight in M𝑀M close to the average goodness/weight then empirical and generalization error are small. Notice also that increasing the number of the trees too much has an impact on the empirical error (term k​e−λ​n​ηk𝑘superscript𝑒𝜆𝑛𝜂𝑘ke^{-\frac{\lambda n\eta}{k}} in the lower bound on p1subscript𝑝1p_{1}). More trees means bigger variance of the single Laplacian used in the leaf of the tree. This affects tree quality. The theorem above describes this phenomenon quantitatively.

If the average tree accuracy is big enough then the following result becomes of its own interest. This result considers in particular the empirical error (similar result holds for the generalization error) of the threshold averaging version of Algorithm 2 (and also similar result holds for majority voting version of Algorithm 2).

Assume that we are given a parameter η>0𝜂0\eta>0. Assume besides that the average tree accuracy of the set M𝑀M of all decision trees of height hℎh on the training set 𝒟𝒟\mathcal{D} is is e=1−ϵ𝑒1italic-ϵe=1-\epsilon for some 0<ϵ≤120italic-ϵ120<\epsilon\leq\frac{1}{2}. Let 0<δ<120𝛿120<\delta<\frac{1}{2}. Let γ=12h⋅9600𝛾1⋅superscript2ℎ9600\gamma=\frac{1}{2^{h}\cdot 9600} and let ko​p​tsubscript𝑘𝑜𝑝𝑡k_{opt} be the integer value for which the value of the function f​(k)=e−k200+2​k​e−γ​n​ηk𝑓𝑘superscript𝑒𝑘2002𝑘superscript𝑒𝛾𝑛𝜂𝑘f(k)=e^{-\frac{k}{200}}+2ke^{-\frac{\gamma\sqrt{n}\eta}{k}} is smallest possible. Then with probability at least p=1−n​(e−ko​p​t200+2​ko​p​t​e−γ​n​ηko​p​t+e−n2)𝑝1𝑛superscript𝑒subscript𝑘𝑜𝑝𝑡2002subscript𝑘𝑜𝑝𝑡superscript𝑒𝛾𝑛𝜂subscript𝑘𝑜𝑝𝑡superscript𝑒𝑛2p=1-n(e^{-\frac{k_{opt}}{200}}+2k_{opt}e^{-\frac{\gamma\sqrt{n}\eta}{k_{opt}}}+e^{-\frac{n}{2}}) the η𝜂\eta-differentially-private threshold averaging version of Algorithm 2 gives empirical error at most 18+92​ϵ−5​ϵ21892italic-ϵ5superscriptitalic-ϵ2\frac{1}{8}+\frac{9}{2}\epsilon-5\epsilon^{2} for the forest with ko​p​tsubscript𝑘𝑜𝑝𝑡k_{opt} randomly chosen decision trees. Probability p𝑝p is under random coin tosses used to construct the forest.

As in the two previous settings, information about the average accuracy of just a single tree gives strong guarantees regarding the classification quality achieved by the differentially-private version of the forest. The next result (analogous to Theorem 5.8) shows how to choose the optimal number of trees and that this number is again at most logarithmic in the data size.

The experiments were performed on the benchmark data- sets333downloaded from http://osmot.cs.cornell.edu/kddcup/, http://archive.ics.uci.edu/ml/datasets.html, and http://www.csie.ntu.edu.tw/∼similar-to\simcjlin/libsvmtools/datasets: Banknote Authentication (Ban_Aut), Blood Transfusion Service Center (BTSC), Congressional Voting Records (CVR), Mammographic Mass (Mam_Mass), Mushroom, Adult, Covertype and Quantum. 90%percent9090% of each dataset was used for training and the remaining part for testing. Furthermore, 10%percent1010% of the training dataset was used as a validation set. All code for our experiments is publicly released.

We first compare the test error (%percent%) obtained using five different methods: open-source implementation of CART called rpart [48], non-differentially-private (n-dp) and diffe- rentially-private (dp) random forest with majority voting (RFMV) and threshold averaging (RFTA). For all methods except rpart we also report the number of trees in the forest (k𝑘k) and the height of the tree (hℎh) for which the smallest validation error was obtained, where we explored: h∈{1,2,3,…,15}ℎ123…15h\in{1,2,3,\dots,15} and k∈{1,3,5,…,21}𝑘135…21k\in{1,3,5,\dots,21}. In all the experiments the differential privacy parameter η𝜂\eta was set to η=1000/nt​r𝜂1000subscript𝑛𝑡𝑟\eta=1000/n_{tr}, where nt​rsubscript𝑛𝑡𝑟n_{tr} is the number of training examples. Table 1 captures the results. For each experiment we report average test error over 101010 runs. We also show the binomial symmetrical 95%percent9595% confidence intervals for our results. The performance of random forest with probabilistic averaging (RFPA) was significantly worse than the competitive methods (RFMV, RFTA, rpart) and is not reported in the table. The performance of RFPA will however be shown in the next set of results.

Banknote Authentication

Congressional Voting Records

Mammographic Mass

Next set of results444All figures in this section should be read in color. (Figure 1 and 2) is reported for an exemplary datasets (Banknote Authentication, Congressional Voting Records, Mammographic Mass and Mushroom) and for the following methods: dpRFMV, dpRFTA and dpRFPA. Note that similar results were obtained for the remaining datasets. In Figure 1a we report the test error vs. hℎh for selected settings of k𝑘k555Recall that in case when the forest contains only one tree (k=1𝑘1k=1) majority voting and threshold averaging rules are equivalent thus the blue curve overlaps with the green curve on the plot then.. In Figure 1b we also show minimal, average and maximal test error vs. hℎh for dpRFMV, whose performance was overall the best. Similarly, in Figure 1c we report the test error vs. k𝑘k for two selected settings of hℎh and in Figure 1d we also show minimal, average and maximal test error vs. k𝑘k for dpRFMV.

Finally, in Figure 2a we report test error for various settings of η𝜂\eta and two selected settings of hℎh. For each experiment k𝑘k was chosen from the set {1,2,…,101}12…101{1,2,\dots,101} to give the smallest validation error. Additionally, in Figure 2b we show how the test error changes with k𝑘k for a fixed hℎh and various levels of η𝜂\eta.

Figure 2a shows that in most cases dpRFTA outperforms remaining differentially-private classifiers, however it requires careful selection of the forest parameters (hℎh and k𝑘k) in order to obtain the optimal performance as is illustrated on Figure 1c and 2b. This problem can be overcome by using dpRFMV which has comparable performance to dpRFTA but is much less sensitive to the setting of the forest parameters. Therefore dpRFMV is much easier to use in the differentially-private setting.

In this paper we first provide novel theoretical analysis of supervised learning with non-differentially-private random decision trees in three cases: majority voting, threshold averaging and probabilistic averaging. Secondly we show that the algorithms we consider here can be easily adapted to the setting where high privacy guarantees must be achieved. We furthermore provide both theoretical and experimental evaluation of the differentially-private random decision trees approach. To the best of our knowledge, the theoretical analysis of the differentially-private random decision trees was never done before. Our experiments reveal that majority voting and threshold averaging are good differentially-private classifiers and that in particular majority voting exhibits less sensitivity to forest parameters.

Differentially- and non-differentially-private random decision trees (Supplementary Material)

We will prove here results regarding empirical and generalization errors of all the variants of the algorithm mentioned in the paper as well as Theorem 5.1 and Theorem 5.2. Without loss of generality we will assume that all attributes are binary (taken from the set {0,1}01{0,1}). It can be easily noticed that the proofs can be directly translated to the continuous case. We leave this simple exercise to the reader.

Let us introduce first some useful notation that will be very helpful in the proofs we present next.

We denote by n𝑛n the size of the dataset (training or test) 𝒯𝒯\mathcal{T}. Let us remind that m𝑚m is the number of attributes of any given data point, hℎh is the height of the random decision tree and M𝑀M is the set of all random decision trees under consideration.

We focus first on classifying with just one decision tree. Fix some decision tree Tjsubscript𝑇𝑗T_{j} and one of its leaves. Assume that it contains a𝑎a points with label: −- and b𝑏b points with label: ++. We associate label −- with that leaf if a>b𝑎𝑏a>b and label 111 otherwise. To classify a given point using that tree we feed our tree with that point and assign to a point a label of the corresponding leaf. Denote by misuperscript𝑚𝑖m^{i} the number of data points that were correctly classified by a tree Tisubscript𝑇𝑖T_{i}. Denote ei=minsuperscript𝑒𝑖superscript𝑚𝑖𝑛e^{i}=\frac{m^{i}}{n}. We call eisuperscript𝑒𝑖e^{i} the quality (or accuracy) of the tree Tisubscript𝑇𝑖T_{i}. Note that obviously we always have: ei≥12superscript𝑒𝑖12e^{i}\geq\frac{1}{2}, since for every leaf of any given tree the majority of the data points from that leaf are classified correctly. Denote: e=1|M|​∑i=1|M|ei𝑒1𝑀superscriptsubscript𝑖1𝑀superscript𝑒𝑖e=\frac{1}{|M|}\sum_{i=1}^{|M|}e^{i}. We call e𝑒e the average tree accuracy. This parameter measures how well data points are classified on average by a complete decision tree of a given height hℎh. Note that e≥12𝑒12e\geq\frac{1}{2}. Denote t=2h𝑡superscript2ℎt=2^{h}. Parameter t𝑡t is the number of leaves of the decision tree.

For i=1,2,…,|M|𝑖12…𝑀i=1,2,...,|M| and j=1,2,…,t𝑗12…𝑡j=1,2,...,t denote by njisubscriptsuperscript𝑛𝑖𝑗n^{i}{j} the number of points from the dataset in the jt​hsuperscript𝑗𝑡ℎj^{th} leaf of a decision tree Tisubscript𝑇𝑖T{i}. Denote by mjisubscriptsuperscript𝑚𝑖𝑗m^{i}{j} the number of points from the dataset in the jt​hsuperscript𝑗𝑡ℎj^{th} leaf of the decision tree Tisubscript𝑇𝑖T{i} that were classified correctly. Denote eji=mjinjisubscriptsuperscript𝑒𝑖𝑗subscriptsuperscript𝑚𝑖𝑗subscriptsuperscript𝑛𝑖𝑗e^{i}{j}=\frac{m^{i}{j}}{n^{i}{j}} for nji>0subscriptsuperscript𝑛𝑖𝑗0n^{i}{j}>0 and eji=1subscriptsuperscript𝑒𝑖𝑗1e^{i}{j}=1 for nji=0subscriptsuperscript𝑛𝑖𝑗0n^{i}{j}=0. Note that eji≥12subscriptsuperscript𝑒𝑖𝑗12e^{i}{j}\geq\frac{1}{2} for every i,j𝑖𝑗i,j. Note also that we have: n=n1i+…+nti𝑛subscriptsuperscript𝑛𝑖1…subscriptsuperscript𝑛𝑖𝑡n=n^{i}{1}+...+n^{i}{t} and mi=m1i+…+mtisuperscript𝑚𝑖subscriptsuperscript𝑚𝑖1…subscriptsuperscript𝑚𝑖𝑡m^{i}=m^{i}{1}+...+m^{i}{t}. Denote by ajisubscriptsuperscript𝑎𝑖𝑗a^{i}{j} the number of data points in the jt​hsuperscript𝑗𝑡ℎj^{th} leaf of the decision tree Tisubscript𝑇𝑖T_{i} that are of label 00. Denote by bjisubscriptsuperscript𝑏𝑖𝑗b^{i}{j} the number of data points in the jt​hsuperscript𝑗𝑡ℎj^{th} leaf of the decision tree Tisubscript𝑇𝑖T{i} that are of label 111.

We will use frequently the following structure in the proofs. Let 𝒢𝒢\mathcal{G} be a bipartite graph with color classes: 𝒜𝒜\mathcal{A}, ℬℬ\mathcal{B} and weighted edges. Color class 𝒜𝒜\mathcal{A} consists of n𝑛n points from the dataset. Color class ℬℬ\mathcal{B} consists of 2​t​|M|2𝑡𝑀2t|M| elements of the form yji,bsubscriptsuperscript𝑦𝑖𝑏𝑗y^{i,b}{j}, where i∈{1,2,…,|M|}𝑖12…𝑀i\in{1,2,...,|M|}, b∈{0,1}𝑏01b\in{0,1} and j∈{1,2,…,t}𝑗12…𝑡j\in{1,2,...,t}. Data point x∈𝒜𝑥𝒜x\in\mathcal{A} is adjacent to yji,1subscriptsuperscript𝑦𝑖1𝑗y^{i,1}{j} iff it belongs to larger of the two groups (these with labels: 00 and 111) of the data points that are in the jt​hsuperscript𝑗𝑡ℎj^{th} leaf of the decision tree Tisubscript𝑇𝑖T_{i}. An edge joining x𝑥x with yji,1subscriptsuperscript𝑦𝑖1𝑗y^{i,1}{j} has weight ejisubscriptsuperscript𝑒𝑖𝑗e^{i}{j}. Data point x∈𝒜𝑥𝒜x\in\mathcal{A} is adjacent to yji,0subscriptsuperscript𝑦𝑖0𝑗y^{i,0}{j} iff it belongs to smaller of the two groups of the data points that are in the jt​hsuperscript𝑗𝑡ℎj^{th} leaf of the decision tree Tisubscript𝑇𝑖T{i}. An edge joining x𝑥x with yji,0subscriptsuperscript𝑦𝑖0𝑗y^{i,0}{j} has weight 1−eji1subscriptsuperscript𝑒𝑖𝑗1-e^{i}{j}. Note that the degree of a vertex yji,1subscriptsuperscript𝑦𝑖1𝑗y^{i,1}{j} is mjisubscriptsuperscript𝑚𝑖𝑗m^{i}{j} and the degree of a vertex yji,0subscriptsuperscript𝑦𝑖0𝑗y^{i,0}{j} is nji−mjisubscriptsuperscript𝑛𝑖𝑗subscriptsuperscript𝑚𝑖𝑗n^{i}{j}-m^{i}_{j}.

In the proofs we will refer to the size of the set of decision trees under consideration as: |M|𝑀|M| or k𝑘k (note that k𝑘k is used in the main body of the paper).

We are ready to prove Theorem 5.1 and Theorem 5.2.

We start with the proof of Theorem 5.2. Note that from the definition of wdsubscript𝑤𝑑w_{d} we get:

Therefore, using formula on mjisubscriptsuperscript𝑚𝑖𝑗m^{i}_{j}, we get:

Note that we have: ∑i=1|M|∑j=1tnji=n​|M|superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑡subscriptsuperscript𝑛𝑖𝑗𝑛𝑀\sum_{i=1}^{|M|}\sum_{j=1}^{t}n^{i}{j}=n|M|. From Jensen’s inequality, applied to the function f​(x)=x2𝑓𝑥superscript𝑥2f(x)=x^{2}, we get: ∑i=1|M|∑j=1tnji|M|​n​(eji)2≥(∑i=1|M|∑j=1tnji​eji|M|​n)2=(∑i=1|M|∑j=1tmji|M|​n)2=(e​n​|M|n​|M|)2=e2superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑡subscriptsuperscript𝑛𝑖𝑗𝑀𝑛superscriptsubscriptsuperscript𝑒𝑖𝑗2superscriptsuperscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑡subscriptsuperscript𝑛𝑖𝑗subscriptsuperscript𝑒𝑖𝑗𝑀𝑛2superscriptsuperscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑡subscriptsuperscript𝑚𝑖𝑗𝑀𝑛2superscript𝑒𝑛𝑀𝑛𝑀2superscript𝑒2\sum{i=1}^{|M|}\sum_{j=1}^{t}\frac{n^{i}{j}}{|M|n}(e^{i}{j})^{2}\geq(\sum_{i=1}^{|M|}\sum_{j=1}^{t}\frac{n^{i}{j}e^{i}{j}}{|M|n})^{2}=(\frac{\sum_{i=1}^{|M|}\sum_{j=1}^{t}m^{i}{j}}{|M|n})^{2}=(\frac{en|M|}{n|M|})^{2}=e^{2}, where e𝑒e is the average quality of the system of all complete decision trees of height hℎh (the average tree accuracy). Similarly, ∑i=1|M|∑j=1tnji|M|​n​(1−eji)2≥(1−e)2superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑡subscriptsuperscript𝑛𝑖𝑗𝑀𝑛superscript1subscriptsuperscript𝑒𝑖𝑗2superscript1𝑒2\sum{i=1}^{|M|}\sum_{j=1}^{t}\frac{n^{i}{j}}{|M|n}(1-e^{i}{j})^{2}\geq(1-e)^{2}. Thus we get:

That completes the proof of Theorem 5.2. The proof of Theorem 5.1 is even simpler. Notice that for any data point d𝑑d the expression σ​(d)⋅|M|⋅𝜎𝑑𝑀\sigma(d)\cdot|M| counts the number of decision trees from M𝑀M that classified d𝑑d correctly (follows directly from the definition of θ𝜃\theta). Thus we have: ∑d∈𝒯σ​(d)⋅|M|=∑i=1|M|misubscript𝑑𝒯⋅𝜎𝑑𝑀superscriptsubscript𝑖1𝑀superscript𝑚𝑖\sum_{d\in\mathcal{T}}\sigma(d)\cdot|M|=\sum_{i=1}^{|M|}m^{i}. Therefore 1n​∑d∈𝒯σ​(d)=1|M|​∑i=1|M|ei1𝑛subscript𝑑𝒯𝜎𝑑1𝑀superscriptsubscript𝑖1𝑀superscript𝑒𝑖\frac{1}{n}\sum_{d\in\mathcal{T}}\sigma(d)=\frac{1}{|M|}\sum_{i=1}^{|M|}e^{i} and we are done.

We need one more technical result, the Azuma’s inequality:

Let {Wn,n≥1}subscript𝑊𝑛𝑛1{W_{n},n\geq 1} be a martingale with mean 00 and suppose that for some non-negative constants: αi,βisubscript𝛼𝑖subscript𝛽𝑖\alpha_{i},\beta_{i} we have: −αi≤Wi−Wi−1≤βisubscript𝛼𝑖subscript𝑊𝑖subscript𝑊𝑖1subscript𝛽𝑖-\alpha_{i}\leq W_{i}-W_{i-1}\leq\beta_{i} for i=2,3,…𝑖23…i=2,3,.... Then for any n≥0𝑛0n\geq 0, a>0𝑎0a>0:

Again, we start with the analysis of the threshold averaging. Take it​hsuperscript𝑖𝑡ℎi^{th} random decision tree TiRsubscriptsuperscript𝑇𝑅𝑖T^{R}{i}, where i∈{1,2,…,k}𝑖12…𝑘i\in{1,2,...,k}. For a given data point d𝑑d from the training set let Xidsubscriptsuperscript𝑋𝑑𝑖X^{d}{i} be a random variable defined as follows. If d𝑑d does not belong to any leaf of TiRsubscriptsuperscript𝑇𝑅𝑖T^{R}{i} then let Xid=0subscriptsuperscript𝑋𝑑𝑖0X^{d}{i}=0. Otherwise let aiRsubscriptsuperscript𝑎𝑅𝑖a^{R}{i} be the number of points from the training set with label 00 in that leaf and let biRsubscriptsuperscript𝑏𝑅𝑖b^{R}{i} be the number of points from the training set with label 111 in that leaf. If d𝑑d has label 00 then we take Xid=aiRaiR+biRsubscriptsuperscript𝑋𝑑𝑖subscriptsuperscript𝑎𝑅𝑖subscriptsuperscript𝑎𝑅𝑖subscriptsuperscript𝑏𝑅𝑖X^{d}{i}=\frac{a^{R}{i}}{a^{R}{i}+b^{R}{i}}. Otherwise we take Xid=biRaiR+biRsubscriptsuperscript𝑋𝑑𝑖subscriptsuperscript𝑏𝑅𝑖subscriptsuperscript𝑎𝑅𝑖subscriptsuperscript𝑏𝑅𝑖X^{d}{i}=\frac{b^{R}{i}}{a^{R}{i}+b^{R}{i}}. Denote Xd=X1d+…+Xkdksuperscript𝑋𝑑subscriptsuperscript𝑋𝑑1…subscriptsuperscript𝑋𝑑𝑘𝑘X^{d}=\frac{X^{d}{1}+...+X^{d}{k}}{k}. When from the context it is clear to which data point we refer to we will skip upper index and simply write X𝑋X or Xisubscript𝑋𝑖X_{i} respectively. Fix some point d𝑑d from the training set. Note that if X>12𝑋12X>\frac{1}{2} then point d𝑑d is correctly classified. Notice that the weight of the point d𝑑d denoted as wdsubscript𝑤𝑑w_{d} is nothing else but the sum of weights of all the edges of 𝒢𝒢\mathcal{G} incident to d𝑑d divided by the number of all trees (or the average weight of an edge indecent to d𝑑d if we consider real-valued attributes). Note that we have E​X=wd𝐸𝑋subscript𝑤𝑑EX=w_{d} and that from Theorem 5.2 we get:

Take 0<δ<120𝛿120<\delta<\frac{1}{2}. Denote by μ𝜇\mu the fraction of points d𝑑d from the training data such that wd≥12+δsubscript𝑤𝑑12𝛿w_{d}\geq\frac{1}{2}+\delta. From the lower bound on ∑d∈𝒯wdsubscript𝑑𝒯subscript𝑤𝑑\sum_{d\in\mathcal{T}}w_{d}, we have just derived, we get: (12+δ)​(1−μ)​n+μ​n≥n​(e2+(1−e)2)12𝛿1𝜇𝑛𝜇𝑛𝑛superscript𝑒2superscript1𝑒2(\frac{1}{2}+\delta)(1-\mu)n+\mu n\geq n(e^{2}+(1-e)^{2}), which gives us:

where ϵ=1−eitalic-ϵ1𝑒\epsilon=1-e. Take point d𝑑d from the training set such that wd≥12+δ.subscript𝑤𝑑12𝛿w_{d}\geq\frac{1}{2}+\delta. Denote by pdsubscript𝑝𝑑p_{d} the probability that d𝑑d is misclassified. We have:

Denote: Zi=Xi−wdsubscript𝑍𝑖subscript𝑋𝑖subscript𝑤𝑑Z_{i}=X_{i}-w_{d} for i=1,2,…,k𝑖12…𝑘i=1,2,...,k. We have:

Note that, since wd=E​Xsubscript𝑤𝑑𝐸𝑋w_{d}=EX and random variables Xisubscript𝑋𝑖X_{i} are independent, we can conclude that {Z1,Z1+Z2,…,Z1+Z2+…+Zk}subscript𝑍1subscript𝑍1subscript𝑍2…subscript𝑍1subscript𝑍2…subscript𝑍𝑘{Z_{1},Z_{1}+Z_{2},...,Z_{1}+Z_{2}+...+Z_{k}} is a martingale. Note also that −αi≤Zi≤βisubscript𝛼𝑖subscript𝑍𝑖subscript𝛽𝑖-\alpha_{i}\leq Z_{i}\leq\beta_{i} for some αi,βi>0subscript𝛼𝑖subscript𝛽𝑖0\alpha_{i},\beta_{i}>0 such that αi+βi=1subscript𝛼𝑖subscript𝛽𝑖1\alpha_{i}+\beta_{i}=1.

Using Lemma 8.2, we get:

Therefore the probability that at least one of μ​n𝜇𝑛\mu n points d𝑑d for which wd≥12+δsubscript𝑤𝑑12𝛿w_{d}\geq\frac{1}{2}+\delta will be misclassified by the set of k𝑘k random decision trees is, by union bound, at most: μ​n​e−2​k​δ2≤n​e−2​k​δ2𝜇𝑛superscript𝑒2𝑘superscript𝛿2𝑛superscript𝑒2𝑘superscript𝛿2\mu ne^{-2k\delta^{2}}\leq ne^{-2k\delta^{2}}. That, for k=(1+C)​log⁡(n)2​δ2𝑘1𝐶𝑛2superscript𝛿2k=\frac{(1+C)\log(n)}{2\delta^{2}}, completes the proof of the upper bound on the empirical error from theorems: 5.3 and 5.4 since we have already proved that μ≥1−2​ϵ−2​ϵ20.5−δ𝜇12italic-ϵ2superscriptitalic-ϵ20.5𝛿\mu\geq 1-\frac{2\epsilon-2\epsilon^{2}}{0.5-\delta}. The proof of the majority voting version goes along exactly the same lines. This time, instead of Theorem 5.2, we use Theorem 5.1. We know that ∑d∈𝒯σ​(d)≥n​esubscript𝑑𝒯𝜎𝑑𝑛𝑒\sum_{d\in\mathcal{T}}\sigma(d)\geq ne, where e=1−ϵ𝑒1italic-ϵe=1-\epsilon. Denote the fraction of points d𝑑d with σ​(d)≥12+x𝜎𝑑12𝑥\sigma(d)\geq\frac{1}{2}+x for 0<x<120𝑥120<x<\frac{1}{2} by μxsuperscript𝜇𝑥\mu^{x}. Then, by the argument similar to the one presented above,we have:

All other details of the proof for the majority voting are exactly the same as for the threshold averaging scheme.

Let K>0𝐾0K>0 be a constant. We first consider the threshold averaging scheme. Take a decision tree Tisubscript𝑇𝑖T_{i}. Denote by STisubscript𝑆subscript𝑇𝑖S_{T_{i}} the set of points d𝑑d from the training set with the following property: point d𝑑d belongs in Tisubscript𝑇𝑖T_{i} do the leaf that contains at least nK​2h𝑛𝐾superscript2ℎ\frac{n}{K2^{h}} points. Note that since each Tisubscript𝑇𝑖T_{i} has exactly 2hsuperscript2ℎ2^{h} leaves, we can conclude that |STi|≥n​(1−1K)subscript𝑆subscript𝑇𝑖𝑛11𝐾|S_{T_{i}}|\geq n(1-\frac{1}{K}). In this proof and proof of theorems: 5.9 and 5.9 (presented in the next section) we will consider graph 𝒢Dsuperscript𝒢𝐷\mathcal{G}^{D} that is obtained from 𝒢𝒢\mathcal{G} by deleting edges adjacent to those vertices of the color class ℬℬ\mathcal{B} that correspond to leaves containing less than nK​2h𝑛𝐾superscript2ℎ\frac{n}{K2^{h}} points from the training set. Take point d𝑑d from the training set with wdt≥12+δsubscriptsuperscript𝑤𝑡𝑑12𝛿w^{t}{d}\geq\frac{1}{2}+\delta, where wdtsubscriptsuperscript𝑤𝑡𝑑w^{t}{d} is the average weight of an edge incident to d𝑑d in 𝒢Dsuperscript𝒢𝐷\mathcal{G}^{D}. Notice that wd≥12+δ+1Ksubscript𝑤𝑑12𝛿1𝐾w_{d}\geq\frac{1}{2}+\delta+\frac{1}{K} implies: wdt≥12+δsubscriptsuperscript𝑤𝑡𝑑12𝛿w^{t}{d}\geq\frac{1}{2}+\delta. We say that a decision tree Tisubscript𝑇𝑖T{i} is d𝑑d-good if the leaf of Tisubscript𝑇𝑖T_{i} to which d𝑑d belongs contains at least nK​2h𝑛𝐾superscript2ℎ\frac{n}{K2^{h}} points from the training set. Let us now define Xidsubscriptsuperscript𝑋𝑑𝑖X^{d}{i}. If it​hsuperscript𝑖𝑡ℎi^{th} chosen random decision tree is d𝑑d-good then Xidsubscriptsuperscript𝑋𝑑𝑖X^{d}{i} is defined as in the proof of Theorem 5.3. Otherwise we put Xid=0subscriptsuperscript𝑋𝑑𝑖0X^{d}{i}=0. Denote Zi=Xid−wdtsubscript𝑍𝑖subscriptsuperscript𝑋𝑑𝑖subscriptsuperscript𝑤𝑡𝑑Z{i}=X^{d}{i}-w^{t}{d}. Note that the probability pdsubscript𝑝𝑑p_{d} that point d𝑑d is misclassified by selected random decision trees is pd≤ℙ​(Z1+…+Zkk+∑j∈ℐRj|ℐ|≤−δ)subscript𝑝𝑑ℙsubscript𝑍1…subscript𝑍𝑘𝑘subscript𝑗ℐsubscript𝑅𝑗ℐ𝛿p_{d}\leq\mathbb{P}(\frac{Z_{1}+...+Z_{k}}{k}+\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\delta), where ℐℐ\mathcal{I} is the set of indices corresponding to those chosen random decision trees that are d𝑑d-good and random variables Rjsubscript𝑅𝑗R_{j} are correction terms for d𝑑d-good random decision trees that must be introduced in order to take into account added Laplacians (if ℐ=∅ℐ\mathcal{I}=\emptyset then we assume that the value of the expression ∑j∈ℐRj|ℐ|subscript𝑗ℐsubscript𝑅𝑗ℐ\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|} is 00). Note also that set {Rj{R_{j}, Zj:j=1,2,…,k}Z_{j}:j=1,2,...,k} is a set of independent random variables. We get:

Since from the Azuma’s inequality we get: ℙ​(Z1+…+Zkk≤−δ2)≤e−k​δ22ℙsubscript𝑍1…subscript𝑍𝑘𝑘𝛿2superscript𝑒𝑘superscript𝛿22\mathbb{P}(\frac{Z_{1}+...+Z_{k}}{k}\leq-\frac{\delta}{2})\leq e^{-\frac{k\delta^{2}}{2}}, we have:

We will now estimate the expression pr=ℙ​(∑j∈ℐRj|ℐ|≤−δ2)subscript𝑝𝑟ℙsubscript𝑗ℐsubscript𝑅𝑗ℐ𝛿2p_{r}=\mathbb{P}(\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\frac{\delta}{2}).

For i∈ℐ𝑖ℐi\in\mathcal{I} denote by 𝒜isubscript𝒜𝑖\mathcal{A}{i} an event that each of the two perturbation errors added to the leaf containing point d𝑑d was of magnitude at most nK​2h​δ1𝑛𝐾superscript2ℎsubscript𝛿1\frac{\sqrt{n}}{K2^{h}}\delta{1}, where δ1=δ24subscript𝛿1𝛿24\delta_{1}=\frac{\delta}{24}. Denote 𝒜=⋂i∈ℐ𝒜i𝒜subscript𝑖ℐsubscript𝒜𝑖\mathcal{A}=\bigcap_{i\in\mathcal{I}}\mathcal{A}{i}. Denote by 𝒜csuperscript𝒜𝑐\mathcal{A}^{c} the complement of 𝒜𝒜\mathcal{A}. We have: ℙ​(∑j∈ℐRj|ℐ|≤−δ2)=ℙ​(∑j∈ℐRj|ℐ|≤−δ2|𝒜)​ℙ​(𝒜)+ℙ​(∑j∈ℐRj|ℐ|≤−δ2|𝒜c)​(1−ℙ​(𝒜))ℙsubscript𝑗ℐsubscript𝑅𝑗ℐ𝛿2ℙsubscript𝑗ℐsubscript𝑅𝑗ℐconditional𝛿2𝒜ℙ𝒜ℙsubscript𝑗ℐsubscript𝑅𝑗ℐconditional𝛿2superscript𝒜𝑐1ℙ𝒜\mathbb{P}(\frac{\sum{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\frac{\delta}{2})=\mathbb{P}(\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\frac{\delta}{2}|\mathcal{A})\mathbb{P}(\mathcal{A})+\mathbb{P}(\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\frac{\delta}{2}|\mathcal{A}^{c})(1-\mathbb{P}(\mathcal{A})). Thus we get:

Now take one of the chosen random decision trees Tisubscript𝑇𝑖T_{i} with i∈ℐ𝑖ℐi\in\mathcal{I}. Take its leaf that contains given point d𝑑d from the training set. Assume that this leaf contains r𝑟r points from the training set with some fixed label l∈{−,+}𝑙l\in{-,+} and that it altogether contains nasubscript𝑛𝑎n_{a} points. Note that from the definition of ℐℐ\mathcal{I} we have: na≥nK​2hsubscript𝑛𝑎𝑛𝐾superscript2ℎn_{a}\geq\frac{n}{K2^{h}}. Let g1,g2subscript𝑔1subscript𝑔2g_{1},g_{2} be two independent Laplacian random variables, each of density function η2​k​e−|x|​ηk𝜂2𝑘superscript𝑒𝑥𝜂𝑘\frac{\eta}{2k}e^{-\frac{|x|\eta}{k}}. We would like to estimate the following random variable Θ=r+g1na+g1+g2−rnaΘ𝑟subscript𝑔1subscript𝑛𝑎subscript𝑔1subscript𝑔2𝑟subscript𝑛𝑎\Theta=\frac{r+g_{1}}{n_{a}+g_{1}+g_{2}}-\frac{r}{n_{a}} for an event 𝒜𝒜\mathcal{A}. Note that in particular we know that |g1|,|g2|≤δ1​nansubscript𝑔1subscript𝑔2subscript𝛿1subscript𝑛𝑎𝑛|g_{1}|,|g_{2}|\leq\frac{\delta_{1}n_{a}}{\sqrt{n}}. Simple calculation gives us:

Now consider truncated probability space Ω|𝒜conditionalΩ𝒜\Omega|\mathcal{A} and truncated random variables Rit=Ri|𝒜subscriptsuperscript𝑅𝑡𝑖conditionalsubscript𝑅𝑖𝒜R^{t}{i}=R{i}|\mathcal{A} for i∈ℐ𝑖ℐi\in\mathcal{I}. We have: ℙ​(∑i∈ℐRi≤−|ℐ|​δ2|𝒜)=ℙ​(∑i∈ℐRit≤−|ℐ|​δ2)ℙsubscript𝑖ℐsubscript𝑅𝑖conditionalℐ𝛿2𝒜ℙsubscript𝑖ℐsubscriptsuperscript𝑅𝑡𝑖ℐ𝛿2\mathbb{P}(\sum_{i\in\mathcal{I}}R_{i}\leq-\frac{|\mathcal{I}|\delta}{2}|\mathcal{A})=\mathbb{P}(\sum_{i\in\mathcal{I}}R^{t}_{i}\leq-\frac{|\mathcal{I}|\delta}{2}). Using inequality 5, we get:

Thus we can use Azuma’s inequality once more, this time to find the upper bound on the expression: ℙ​(∑i∈ℐRit≤−|ℐ|​δ2)ℙsubscript𝑖ℐsubscriptsuperscript𝑅𝑡𝑖ℐ𝛿2\mathbb{P}(\sum_{i\in\mathcal{I}}R^{t}{i}\leq-\frac{|\mathcal{I}|\delta}{2}) (we assume here that the random decision trees have been selected thus ℐℐ\mathcal{I} is given). Without loss of generality we can assume that ℐ≠∅ℐ\mathcal{I}\neq\emptyset. We have: ℙ​(∑i∈ℐRit≤−|ℐ|​δ2)=ℙ​(∑i∈𝒜(Rit−E​Rit)≤−ℐ​δ2−∑i∈ℐE​Rit)≤ℙ​(∑i∈ℐ(Rit−E​Rit)≤−|ℐ|​δ4)≤e−2​|ℐ|​(δ4)2(δ4​n+δ4​n)2ℙsubscript𝑖ℐsubscriptsuperscript𝑅𝑡𝑖ℐ𝛿2ℙsubscript𝑖𝒜subscriptsuperscript𝑅𝑡𝑖𝐸subscriptsuperscript𝑅𝑡𝑖ℐ𝛿2subscript𝑖ℐ𝐸subscriptsuperscript𝑅𝑡𝑖ℙsubscript𝑖ℐsubscriptsuperscript𝑅𝑡𝑖𝐸subscriptsuperscript𝑅𝑡𝑖ℐ𝛿4superscript𝑒2ℐsuperscript𝛿42superscript𝛿4𝑛𝛿4𝑛2\mathbb{P}(\sum{i\in\mathcal{I}}R^{t}{i}\leq-\frac{|\mathcal{I}|\delta}{2})=\mathbb{P}(\sum{i\in\mathcal{A}}(R^{t}{i}-ER^{t}{i})\leq-\frac{\mathcal{I}\delta}{2}-\sum_{i\in\mathcal{I}}ER^{t}{i})\leq\mathbb{P}(\sum{i\in\mathcal{I}}(R^{t}{i}-ER^{t}{i})\leq-\frac{|\mathcal{I}|\delta}{4})\leq e^{-\frac{2|\mathcal{I}|(\frac{\delta}{4})^{2}}{(\frac{\delta}{4\sqrt{n}}+\frac{\delta}{4\sqrt{n}})^{2}}}. Therefore we get:

It remains to bound the expression: (1−ℙ​(𝒜))1ℙ𝒜(1-\mathbb{P}(\mathcal{A})). Let g𝑔g be a Laplacian random variable with density function η2​k​e−|x|​ηk𝜂2𝑘superscript𝑒𝑥𝜂𝑘\frac{\eta}{2k}e^{-\frac{|x|\eta}{k}}. Note that from the union bound we get: 1−ℙ​(𝒜)≤2​k​ℙ​(|g|>n​δ24​K​2h)1ℙ𝒜2𝑘ℙ𝑔𝑛𝛿24𝐾superscript2ℎ1-\mathbb{P}(\mathcal{A})\leq 2k\mathbb{P}(|g|>\frac{\sqrt{n}\delta}{24K2^{h}}), where factor 222 in the expression 2​k​ℙ​(g>n​δ24​K​2h)2𝑘ℙ𝑔𝑛𝛿24𝐾superscript2ℎ2k\mathbb{P}(g>\frac{\sqrt{n}\delta}{24K2^{h}}) comes from the fact that for a given data point d𝑑d we need to add perturbation error in two places in the leaf of the chosen random decision tree corresponding to d𝑑d. Denote γ=δ24​K​2h𝛾𝛿24𝐾superscript2ℎ\gamma=\frac{\delta}{24K2^{h}}. We have:

Evaluation of the RHS-expression gives us:

Thus we can conclude that the probability pdsubscript𝑝𝑑p_{d} that the fixed point d𝑑d from the training set will be misclassified by the set of k𝑘k randomly chosen random decision trees satisfies:

Note that by the similar argument to the one presented in the proof of Theorem 5.3 and Theorem 5.4, we can conclude that at least n​(1−2​(ϵ+1K)−2​(ϵ+1K)20.5−δ)𝑛12italic-ϵ1𝐾2superscriptitalic-ϵ1𝐾20.5𝛿n(1-\frac{2(\epsilon+\frac{1}{K})-2(\epsilon+\frac{1}{K})^{2}}{0.5-\delta}) points d𝑑d from the training data satisfy: wdt≥12+δsubscriptsuperscript𝑤𝑡𝑑12𝛿w^{t}{d}\geq\frac{1}{2}+\delta. Let μtsuperscript𝜇𝑡\mu^{t} be a fraction of points with this property. As we observed earlier, if the points d𝑑d satisfies: wd≥12+δ+1Ksubscript𝑤𝑑12𝛿1𝐾w{d}\geq\frac{1}{2}+\delta+\frac{1}{K} then it also satisfies: wdt≥12+δsubscriptsuperscript𝑤𝑡𝑑12𝛿w^{t}{d}\geq\frac{1}{2}+\delta. Thus μ≥μt𝜇superscript𝜇𝑡\mu\geq\mu^{t}. We also have: μt≥1−2​(ϵ+1K)−2​(ϵ+1K)20.5−δsuperscript𝜇𝑡12italic-ϵ1𝐾2superscriptitalic-ϵ1𝐾20.5𝛿\mu^{t}\geq 1-\frac{2(\epsilon+\frac{1}{K})-2(\epsilon+\frac{1}{K})^{2}}{0.5-\delta}. Thus μ≥1−2​(ϵ+1K)−2​(ϵ+1K)20.5−δ𝜇12italic-ϵ1𝐾2superscriptitalic-ϵ1𝐾20.5𝛿\mu\geq 1-\frac{2(\epsilon+\frac{1}{K})-2(\epsilon+\frac{1}{K})^{2}}{0.5-\delta}. We replace ϵitalic-ϵ\epsilon by ϵ+1Kitalic-ϵ1𝐾\epsilon+\frac{1}{K} in the formula derived in the proof of Theorem 5.3 since now for any fixed decision tree we do not take into account points that belong to leaves with less that nK​2h𝑛𝐾superscript2ℎ\frac{n}{K2^{h}} points from the training set. For every given decision tree Tisubscript𝑇𝑖T{i} there are at most nK𝑛𝐾\frac{n}{K} points d𝑑d from the training set such that Tisubscript𝑇𝑖T_{i} is not d𝑑d-good. Note that, by union bound, the probability that at least one from the n​μ𝑛𝜇n\mu points d𝑑d with wdt≥12+δsubscriptsuperscript𝑤𝑡𝑑12𝛿w^{t}{d}\geq\frac{1}{2}+\delta is misclassified is at most n​μ​pd≤n​pd𝑛𝜇subscript𝑝𝑑𝑛subscript𝑝𝑑n\mu p{d}\leq np_{d}. To see how Theorem 5.8 and the part of Theorem 5.7 regarding empirical error follow now, take K=40𝐾40K=40 and δ=110𝛿110\delta=\frac{1}{10}. The proof of the majority voting version is very similar. We use inequality 2 (that was derived from Theorem 5.1) but all other details are exactly the same. Therefore we will not give it in details here since it would basically mean copying almost exactly the proof that we have just showed.

Let us switch now to the probabilistic averaging setting. In practice, as was shown in the experimental section, it is the least effective method. However for the completeness of our theoretical analysis and since for very large datasets theoretical guarantees regarding also this setting can be obtained, we focus on it now.

We will first focus on the part of Theorem 5.5 regarding empirical error.

We already know that: ∑d∈𝒯wd≥n​(e2+(1−e)2)subscript𝑑𝒯subscript𝑤𝑑𝑛superscript𝑒2superscript1𝑒2\sum_{d\in\mathcal{T}}w_{d}\geq n(e^{2}+(1-e)^{2}), where e𝑒e is the average quality. Assume that k𝑘k random decision trees have been selected. Denote by Ydsubscript𝑌𝑑Y_{d} the indicator of the event that a fixed data point d𝑑d from the training set will be correctly classified. We have:

where Xdsuperscript𝑋𝑑X^{d} is random variable defined in the proof of theorems: 5.3 and 5.4. Note that after random decision trees have been selected, Xdsuperscript𝑋𝑑X^{d} has a deterministic value. Note also that random variables Ydsubscript𝑌𝑑Y_{d} are independent and E​Yd=Xd𝐸subscript𝑌𝑑superscript𝑋𝑑EY_{d}=X^{d}. Thus, we can use Lemma 8.2 in the very similar way as in the proof of theorems: 5.3 and 5.4 to get that for any given c>0𝑐0c>0:

Let us focus now on the process of choosing random decision trees. Fix parameter δ>0𝛿0\delta>0. Fix some point d𝑑d from the training set. Using Lemma 8.2 in exactly the same way as in the proof of theorems: 5.3 and 5.4, we conclude that ℙ​(Xd<wd−δ)≤e−2​k​δ2ℙsuperscript𝑋𝑑subscript𝑤𝑑𝛿superscript𝑒2𝑘superscript𝛿2\mathbb{P}(X^{d}<w_{d}-\delta)\leq e^{-2k\delta^{2}}. Therefore, by the union bound, with probability at least (1−n​e−2​k​δ2)1𝑛superscript𝑒2𝑘superscript𝛿2(1-ne^{-2k\delta^{2}}) we have: ∑d∈𝒯Xd≥∑d∈𝒯(wd−δ)subscript𝑑𝒯superscript𝑋𝑑subscript𝑑𝒯subscript𝑤𝑑𝛿\sum_{d\in\mathcal{T}}X^{d}\geq\sum_{d\in\mathcal{T}}(w_{d}-\delta). Thus, according to the lower bound for ∑d∈𝒯wdsubscript𝑑𝒯subscript𝑤𝑑\sum_{d\in\mathcal{T}}w_{d} we presented at the beginning of the proof, we get that with probability at least (1−n​e−2​k​δ2)1𝑛superscript𝑒2𝑘superscript𝛿2(1-ne^{-2k\delta^{2}}) the following holds: ∑d∈𝒯Xd≥n​(1−2​ϵ+2​ϵ2−δ)subscript𝑑𝒯superscript𝑋𝑑𝑛12italic-ϵ2superscriptitalic-ϵ2𝛿\sum_{d\in\mathcal{T}}X^{d}\geq n(1-2\epsilon+2\epsilon^{2}-\delta), where ϵ=1−eitalic-ϵ1𝑒\epsilon=1-e. Note that random variables Ydsubscript𝑌𝑑Y_{d} are independent from random variables Xdsubscript𝑋𝑑X_{d}. We can conclude, using inequality 11, that with probability at least (1−n​e−2​k​δ2)​(1−e−2​n​c2)1𝑛superscript𝑒2𝑘superscript𝛿21superscript𝑒2𝑛superscript𝑐2(1-ne^{-2k\delta^{2}})(1-e^{-2nc^{2}}) at least n​(1−2​ϵ+2​ϵ2−δ−c)𝑛12italic-ϵ2superscriptitalic-ϵ2𝛿𝑐n(1-2\epsilon+2\epsilon^{2}-\delta-c) points will be correctly classified. Now we can take k=(1+C)​log⁡(n)2​δ2𝑘1𝐶𝑛2superscript𝛿2k=\frac{(1+C)\log(n)}{2\delta^{2}} and that completes the proof. Again, as in the previous proof, the majority voting scheme requires only minor changes in the presented proof so we will leave to the reader.

Proofs of statements regarding empirical errors go along exactly the same lines as presented proof of the part of Theorem 5.5 (regarding empirical error). The changes in the statement, due to the added perturbation error, follow from the proof of bounds on the empirical error from theorems: 5.7 and 5.8 . Therefore we will not give the entire proof but only mention few things.

In comparison with the statement of Theorem 5.5, in the expression on the upper bound on empirical error the term ϵitalic-ϵ\epsilon is replaced by ϵ+1Kitalic-ϵ1𝐾\epsilon+\frac{1}{K}. This is, as explained in the proof of Theorem 5.7 (regarding empirical error), due to the fact that while dealing with weights of edges in graph 𝒢Dsuperscript𝒢𝐷\mathcal{G}^{D} we do not take into account points from the training set corresponding to leaves with too few data points. To see how Theorem 5.10 can be derived, take K=40𝐾40K=40, δ=110𝛿110\delta=\frac{1}{10}, c=120𝑐120c=\frac{1}{20}. Again, as for Theorem 5.8, Theorem 5.10 follows now by simple calculations.

We will now prove upper bounds regarding generalization error for all the theorems presented in the previous paragraphs. We do it for all of them in the same section since all the proofs are very similar. Besides, right now, when we have already developed tools for obtaining upper bounds on the empirical error, we can use them to simplify our analysis regarding generalization error. Random decision trees give strong bounds on the generalization error since they do not lead to data overfitting. The internal structure of each constructed tree (i.e. the set of its inner nodes) does not depend at all on the data. This fact is crucial in obtaining strong guarantees on the generalization error. All the experiments presented in the main body of the paper measured generalization error of the random tree approach and stand for the empirical verification that this method is a good learning technique in the setting requiring high privacy guarantees. Below is the proof of the presented upper bounds on the generalization error.

Consider test set of n𝑛n points. Whenever we refer to the weight or goodness of the test point d𝑑d, this is in respect to the test set (see: definition of goodness and other terms in the description of the model). Let ϕ>0italic-ϕ0\phi>0 be a small constant and denote by ℰϕsubscriptℰitalic-ϕ\mathcal{E}{\phi} an event that for the selected forest ℱℱ\mathcal{F} of random decision trees the non-perturbed counts in all leafs (for each leaf we count points with label ++ and −- in that leaf) for the test set and training set differ by at most 2​ϕ​n2italic-ϕ𝑛2\phi n. We start by finding a lower bound on ℙ​(ℰϕ)ℙsubscriptℰitalic-ϕ\mathbb{P}(\mathcal{E}{\phi}). Let us fix a forest, a particular tree of that forest and a particular leaf of that tree. Denote by Xisubscript𝑋𝑖X_{i} a random variable that takes value 111 if it​hsuperscript𝑖𝑡ℎi^{th} point of the training set corresponds to that leaf and 00 otherwise. Similarly, denote by Yisubscript𝑌𝑖Y_{i} a random variable that takes value 111 if it​hsuperscript𝑖𝑡ℎi^{th} point of the test set corresponds to that leaf and 00 otherwise. Denote by Xi+superscriptsubscript𝑋𝑖X_{i}^{+} a random variable that takes value 111 if it​hsuperscript𝑖𝑡ℎi^{th} point of the training set corresponds to that leaf and has label ++ and is 00 otherwise. Similarly, denote by Yi+superscriptsubscript𝑌𝑖Y_{i}^{+} a random variable that takes value 111 if it​hsuperscript𝑖𝑡ℎi^{th} point of the test set corresponds to that leaf and has label ++ and is 00 otherwise. Denote by p1subscript𝑝1p_{1} the probability that it​hsuperscript𝑖𝑡ℎi^{th} point of the training/test set corresponds to that leaf and by p2subscript𝑝2p_{2} the probability that it​hsuperscript𝑖𝑡ℎi^{th} point of the training/test set corresponds to that leaf and has label ++. Notice that p1,p2subscript𝑝1subscript𝑝2p_{1},p_{2} are the same for the training and test set since we assume that training and test set are taken from the same distribution. Since all the random variables introduced above are independent, we can conclude using Azuma’s inequality that

Similarly,

Therefore, by the union bound

By the same analysis we can show that

Thus we also have

is at least 1−8​e−2​n​ϕ218superscript𝑒2𝑛superscriptitalic-ϕ21-8e^{-2n\phi^{2}}. If we now take the union bound over all 2h​ksuperscript2ℎ𝑘2^{h}k leafs of the forest then we obtain: ℙ​(ℰϕ)≥1−2h+3​k​e−2​n​ϕ2ℙsubscriptℰitalic-ϕ1superscript2ℎ3𝑘superscript𝑒2𝑛superscriptitalic-ϕ2\mathbb{P}(\mathcal{E}{\phi})\geq 1-2^{h+3}ke^{-2n\phi^{2}}. We will now consider average weights wdsubscript𝑤𝑑w{d} of the test points. The analysis for the majority voting uses σ​(d)𝜎𝑑\sigma(d) and is completely analogous. Assume now that all the counts for all the leaves for the test and training set differ by at most 2​ϕ2italic-ϕ2\phi. As in the analysis of the empirical error in the differentially-private setting, lets focus on those leaves of the forest that contain at least n2h​K𝑛superscript2ℎ𝐾\frac{n}{2^{h}K} of the test points each, for a constant K>0𝐾0K>0. Take a leaf l𝑙l with this property. Denote by x1superscript𝑥1x^{1} the number of test points corresponding to that leaf and with label ++. Denote by x2superscript𝑥2x^{2} the number of training points corresponding to that leaf and with label ++. Denote by y1superscript𝑦1y^{1} the number of all test points corresponding to that leaf and by y2superscript𝑦2y^{2} the number of all training points corresponding to that leaf. We want to find an upper bound on the expression q=|x1y1−x2y2|𝑞superscript𝑥1superscript𝑦1superscript𝑥2superscript𝑦2q=|\frac{x^{1}}{y^{1}}-\frac{x^{2}}{y^{2}}|. Simple algebra gives us: q≤2​ϕ​n​(x1+y1)y1​(y1−2​ϕ​n)𝑞2italic-ϕ𝑛superscript𝑥1superscript𝑦1superscript𝑦1superscript𝑦12italic-ϕ𝑛q\leq\frac{2\phi n(x^{1}+y^{1})}{y^{1}(y^{1}-2\phi n)}. If we now take ζ=2​ϕθ𝜁2italic-ϕ𝜃\zeta=\frac{2\phi}{\theta}, where θ=12h​K𝜃1superscript2ℎ𝐾\theta=\frac{1}{2^{h}K} then we get: q≤2​ζ1−ζ𝑞2𝜁1𝜁q\leq\frac{2\zeta}{1-\zeta}. Let us take ζ𝜁\zeta such that: 2​ζ1−ζ≤δ22𝜁1𝜁𝛿2\frac{2\zeta}{1-\zeta}\leq\frac{\delta}{2}, where δ>0𝛿0\delta>0 is a positive constant. Thus we want: ζ≤δ4+δ𝜁𝛿4𝛿\zeta\leq\frac{\delta}{4+\delta}, i.e. ϕ≤δ​θ2​(4+δ)italic-ϕ𝛿𝜃24𝛿\phi\leq\frac{\delta\theta}{2(4+\delta)}. Take ϕ=δ​θ2​(4+δ)italic-ϕ𝛿𝜃24𝛿\phi=\frac{\delta\theta}{2(4+\delta)}. We can conclude that with probability at least ℙ​(ℰϕ)ℙsubscriptℰitalic-ϕ\mathbb{P}(\mathcal{E}_{\phi}) the difference between ratios of counts in leaves containing at least nθ𝑛𝜃\frac{n}{\theta} test points for the test and training set is at most δ2𝛿2\frac{\delta}{2}. This in particular implies that if we consider test point d𝑑d and a truncated bipartite graph Gdsuperscript𝐺𝑑G^{d} (but this time with respect to the test set, not training set) then weights of d𝑑d in Gdsuperscript𝐺𝑑G^{d} and its corresponding version for the training set differ by at most δ2𝛿2\frac{\delta}{2}.

We are almost done. Consider first majority voting/threshold averaging scheme. The only changes we need to introduce in the statement of Theorem 5.3 for the empirical error is to subtract from p1subscript𝑝1p_{1} the probability that ℰϕsubscriptℰitalic-ϕ\mathcal{E}{\phi} does not hold to obtain a lower bound on p2subscript𝑝2p{2}, add factor 1K1𝐾\frac{1}{K} to the expression on w𝑤w (since we are using the truncated model) and change δ𝛿\delta by δ2𝛿2\frac{\delta}{2} in the expression on number of random decision trees used. Similarly, in the statement of Theorem 5.4 we need to replace ϵitalic-ϵ\epsilon in the expression on e​r​r1𝑒𝑟subscript𝑟1err_{1} by ϵ+1Kitalic-ϵ1𝐾\epsilon+\frac{1}{K} to obtain an upper bound on e​r​r2𝑒𝑟subscript𝑟2err_{2} (again, because we are using truncation argument) and make the same change in the number of decision trees as the one above. To obtain a lower bound on p2subscript𝑝2p_{2} it suffices to subtract the probability that ℰϕsubscriptℰitalic-ϕ\mathcal{E}{\phi} does not hold. Let us focus now on Theorem 5.7. Again we need to add extra factor 1K1𝐾\frac{1}{K} to the expression on w𝑤w and subtract probability that ℰϕsubscriptℰitalic-ϕ\mathcal{E}{\phi} does not hold to obtain a lower bound on p2subscript𝑝2p_{2}.

Now lets consider probabilistic averaging scheme. Take the statement of Theorem 5.5 first. We make similar correction to those mentioned earlier to get a lower bound on p2subscript𝑝2p_{2}. Besides in the upper bound on e​r​r1𝑒𝑟subscript𝑟1err_{1} we need to replace ϵitalic-ϵ\epsilon by ϵ+1Kitalic-ϵ1𝐾\epsilon+\frac{1}{K} to obtain an upper bound on e​r​r2𝑒𝑟subscript𝑟2err_{2}. In Theorem 5.9 we need to add one extra term 1K1𝐾\frac{1}{K} in the upper bound on e​r​r1𝑒𝑟subscript𝑟1err_{1} to obtain an upper bound on e​r​r2𝑒𝑟subscript𝑟2err_{2} and again modify p1subscript𝑝1p_{1} in the same way as before to obtain a lower bound on p2subscript𝑝2p_{2}.

In this section we enclose the experimental results we obtained for all the remaining benchmark datasets. The plots have similar form to the ones shown in the main body of the paper.

Table: S5.T1: Comparison of the performance of random forests and rpart.

ErrorErrorkhErrorkhErrorkhErrorkh
Ban_Aut137253.65±plus-or-minus\pm0.993.09±plus-or-minus\pm0.9221153.46±plus-or-minus\pm0.971795.44±plus-or-minus\pm1.2021115.22±plus-or-minus\pm1.18712
BTSC748518.92±plus-or-minus\pm2.8122.19±plus-or-minus\pm2.9811422.47±plus-or-minus\pm2.9911423.42±plus-or-minus\pm3.0311423.42±plus-or-minus\pm3.03113
CVR435169.30±plus-or-minus\pm2.739.05±plus-or-minus\pm2.701965.95±plus-or-minus\pm2.221398.10±plus-or-minus\pm2.561596.90±plus-or-minus\pm2.38159
Mam_Mass961621.88±plus-or-minus\pm2.6116.95±plus-or-minus\pm2.3791216.21±plus-or-minus\pm2.33191516.95±plus-or-minus\pm2.3751217.37±plus-or-minus\pm2.4098
Mushroom8124223.33±plus-or-minus\pm0.390.83±plus-or-minus\pm0.2021150.26±plus-or-minus\pm0.1113144.69±plus-or-minus\pm0.463134.16±plus-or-minus\pm0.43315
Adult3256112317.75±plus-or-minus\pm0.4221.70±plus-or-minus\pm0.4531421.58±plus-or-minus\pm0.4531422.18±plus-or-minus\pm0.4531121.72±plus-or-minus\pm0.45711
Covertype5810125426.90±plus-or-minus\pm0.1133.39±plus-or-minus\pm0.12211530.80±plus-or-minus\pm0.12211538.75±plus-or-minus\pm0.1331337.82±plus-or-minus\pm0.12313
Quantum500007832.08±plus-or-minus\pm0.4134.81±plus-or-minus\pm0.42211533.06±plus-or-minus\pm0.41191439.91±plus-or-minus\pm0.43211339.01±plus-or-minus\pm0.43139

Refer to caption Comparison of dpRFMV, dpRFTA and dpRFPA. η=1000/nt​r=0.137𝜂1000subscript𝑛𝑡𝑟0.137\eta=1000/n_{tr}=0.137 for selected datasets. Test error resp. vs. a) hℎh across various settings of k𝑘k and vs. c) k𝑘k across various settings of hℎh; Minimal, average and maximal test error resp. vs. hℎh (b)) and vs. k𝑘k (d)) for dpRFMV.

Refer to caption adult dataset. Comparison of dpRFMV, dpRFTA and dpRFPA. a) Test error vs. η𝜂\eta for two settings of hℎh. b) Test error vs. k𝑘k for fixed hℎh and across different settings of η𝜂\eta.

$$ \mathbb{P}(\mathcal{K} (\mathcal{D}{1}) \in S) \le \exp (\epsilon) \cdot \mathbb{P}(\mathcal{K} (\mathcal{D}{2}) \in S). $$

$$ \sum_{d\in\mathcal{T}}w_{d}=\frac{1}{|M|}\sum_{i=1}^{|M|}\sum_{j=1}^{t}(m^{i}{j}e^{i}{j}+(n^{i}{j}-m^{i}{j})(1-e^{i}_{j})). $$ \tag{S8.Ex1}

$$ \label{inequality4} p_{r} \leq e^{-\frac{n}{2}} + (1-\mathbb{P}(\mathcal{A})). $$ \tag{inequality4}

$$ \mathbb{P}(W_{n} \geq a) \leq e^{-\frac{2a^{2}}{\sum_{i=1}^{n} (\alpha_{i}+\beta_{i})^{2}}} :::\text{and}::: \mathbb{P}(W_{n} \leq -a) \leq e^{-\frac{2a^{2}}{\sum_{i=1}^{n}(\alpha_{i}+\beta_{i})^{2}}}. $$

$$ \label{majority_ineq} \mu^{x} \geq 1 - \frac{\epsilon}{0.5 - x}. $$ \tag{majority_ineq}

$$ p_{d}\leq\mathbb{P}(\frac{X_{1}+...+X_{k}}{k}\leq w_{d}-\delta). $$ \tag{S8.Ex7}

$$ \label{inequality1} p_{d} \leq e^{-\frac{k\delta^{2}}{2}} + \mathbb{P}(\frac{\sum_{j \in \mathcal{I}} R_{j}}{|\mathcal{I}|} \leq -\frac{\delta}{2}) $$ \tag{inequality1}

$$ \label{inequality3} |\Theta| \leq \frac{\delta}{4\sqrt{n}}. $$ \tag{inequality3}

$$ \label{inequality5} p_{r} \leq e^{-\frac{n}{2}} + 4k \int\limits_{\gamma \sqrt{n}}^{\infty} \frac{\eta}{2k}e^{-\frac{x\eta}{k}}dx. $$ \tag{inequality5}

$$ \label{inequality6} p_{r} \leq e^{-\frac{n}{2}} +2ke^{-\frac{\lambda \sqrt{n} \eta}{k}}, :::::\text{where}::: \lambda = \frac{\delta}{24K2^{h}}. $$ \tag{inequality6}

$$ Y_{d}=\left{\begin{array}[]{rcl}1&\mbox{with probability}&X^{d}\ 0&\mbox{with probability}&1-X^{d},\ \end{array}\right. $$ \tag{S8.Ex11}

$$ \mathbb{P}(X_{1} + ... + X_{n} \in [n(p_{1}-\phi),n(p_{1}+\phi)]) \geq 1 - 2e^{-2n\phi^{2}}. $$

$$ |(X_{1}+...+X_{n})-(Y_{1}+...+Y_{n})| \leq 2\phi n :::\text{and}::: |(X_{1}^{+}+...+X_{n}^{+})-(Y_{1}^{+}+...+Y_{n}^{+})| \leq 2\phi n $$

Definition. Definition 3.1 (See [45].) A randomized algorithm 𝒦𝒦\mathcal{K} gives ϵitalic-ϵ\epsilon-differential-privacy if for all datasets 𝒟1subscript𝒟1\mathcal{D}{1} and 𝒟2subscript𝒟2\mathcal{D}{2} differing on at most one element, and all S⊆R​a​n​g​e​(𝒦)𝑆𝑅𝑎𝑛𝑔𝑒𝒦S\subseteq Range(\mathcal{K}), ℙ​(𝒦​(𝒟1)∈S)≤exp⁡(ϵ)⋅ℙ​(𝒦​(𝒟2)∈S).ℙ𝒦subscript𝒟1𝑆⋅italic-ϵℙ𝒦subscript𝒟2𝑆\mathbb{P}(\mathcal{K}(\mathcal{D}{1})\in S)\leq\exp(\epsilon)\cdot\mathbb{P}(\mathcal{K}(\mathcal{D}{2})\in S). (1) The probability is taken over the coin tosses of 𝒦𝒦\mathcal{K}.

Definition. (See~[nissim].) The global sensitivity $S(f)$ of a function $f$ is the smallest number $s$ such that for all $D_{1}$ and $D_{2}$ which differ on at most one element, $| f(D_{1}) - f(D_{2})\ | \leq s$.

Theorem. Theorem 3.1 (See [6].) Let f𝑓f be a function on databases with range Rmsuperscript𝑅𝑚R^{m}, where m𝑚m is the number of rows of databases 222Number of rows of databases is the number of attributes of any data point from the databases.. Then, the mechanism that outputs f​(𝒟)+(Y1,…,Ym)𝑓𝒟subscript𝑌1…subscript𝑌𝑚f(\mathcal{D})+(Y_{1},\ldots,Y_{m}), where Yisubscript𝑌𝑖Y_{i} are drawn i.i.d from L​a​p​(0,S​(f)/ϵ)𝐿𝑎𝑝0𝑆𝑓italic-ϵLap(0,S(f)/\epsilon), satisfies ϵitalic-ϵ\epsilon-differential-privacy.

Theorem. (See~[nissim].) (Composition Theorem) \The sequential application of mechanisms $K_i$, each giving \$\epsilon_{i}$-differential privacy, satisfies $\sum_i \epsilon_{i}$-differential-privacy.

Theorem. Assume that the average tree accuracy of the set $M$ of all decision trees of height $h$ on the training/test set $D$ is $e=1-\epsilon$ for some $0<\epsilon \leq 1{2}$. Then the average goodness $\sigma(d)$ of a training/test point $d$ in $M$ is at least $e \geq 1{2}$.

Theorem. Theorem 5.3 Let K>0𝐾0K>0. Assume that the average tree accuracy of the set M𝑀M of all decision trees of height hℎh on the training/test set 𝒟𝒟\mathcal{D} is e=1−ϵ𝑒1italic-ϵe=1-\epsilon for some 0<ϵ≤120italic-ϵ120<\epsilon\leq\frac{1}{2}. Let μ𝜇\mu be: the fraction of training/test points with goodness in M𝑀M at least σ=12+δ𝜎12𝛿\sigma=\frac{1}{2}+\delta / σ=12+δ+1K𝜎12𝛿1𝐾\sigma=\frac{1}{2}+\delta+\frac{1}{K} for 0<δ<120𝛿120<\delta<\frac{1}{2} (in the majority version) or: the fraction of training/test points with weight in M𝑀M at least w=12+δ𝑤12𝛿w=\frac{1}{2}+\delta / w=12+δ+1K𝑤12𝛿1𝐾w=\frac{1}{2}+\delta+\frac{1}{K} for 0<δ<120𝛿120<\delta<\frac{1}{2} (in the threshold averaging version). Then Algorithm 1 for every C>0𝐶0C>0 and k=(1+C)​log⁡(n)2​δ2𝑘1𝐶𝑛2superscript𝛿2k=\frac{(1+C)\log(n)}{2\delta^{2}} selected random decision trees gives empirical error e​r​r1≤1−μ𝑒𝑟subscript𝑟11𝜇err_{1}\leq 1-\mu with probability p1≥1−1nCsubscript𝑝111superscript𝑛𝐶p_{1}\geq 1-\frac{1}{n^{C}}. The generalization error e​r​r2≤1−μ𝑒𝑟subscript𝑟21𝜇err_{2}\leq 1-\mu will be achieved for k=(1+C)​log⁡(n)2​(δ2)2𝑘1𝐶𝑛2superscript𝛿22k=\frac{(1+C)\log(n)}{2(\frac{\delta}{2})^{2}} trees with probability p2≥p1−2h+3​k​e−2​n​ϕ2subscript𝑝2subscript𝑝1superscript2ℎ3𝑘superscript𝑒2𝑛superscriptitalic-ϕ2p_{2}\geq p_{1}-2^{h+3}ke^{-2n\phi^{2}}, where ϕ=δ2​(4+δ)​2h​Kitalic-ϕ𝛿24𝛿superscript2ℎ𝐾\phi=\frac{\delta}{2(4+\delta)2^{h}K}. Probabilities p1subscript𝑝1p_{1} and p2subscript𝑝2p_{2} are under random coin tosses used to construct the forest and the test set.

Proof. Notice that in every method to obtain the forest of random decision trees with perturbed counters in leaves we need $k$ queries to the private data (this is true since the structure of the inner nodes of the trees does not depend at all on the data and data subsets corresponding to leaves are pairwise disjoint). Furthermore, the values that are being perturbed by the Laplacian noise are simple counts of global sensitivity $1$. Thus we can use use Theoremlaplaciantheorem and Theoremcompositiontheorem to conclude that in order to obtain $\eta$-differential privacy of the entire system we need to add a $Lap(0,k{\eta})$ to every count in the leaf. This proves that our algorithms are indeed $\eta$-differentially-private.

Theorem. Theorem 5.10. Assume that we are given a parameter η>0𝜂0\eta>0. Assume besides that the average tree accuracy of the set M𝑀M of all decision trees of height hℎh on the training set 𝒟𝒟\mathcal{D} is e=1−ϵ𝑒1italic-ϵe=1-\epsilon for some 0<ϵ≤120italic-ϵ120<\epsilon\leq\frac{1}{2}. Let γ=12h⋅9600𝛾1⋅superscript2ℎ9600\gamma=\frac{1}{2^{h}\cdot 9600} and let ko​p​tsubscript𝑘𝑜𝑝𝑡k_{opt} be the integer value for which the value of the function f​(k)=e−k200+2​k​e−γ​n​ηk𝑓𝑘superscript𝑒𝑘2002𝑘superscript𝑒𝛾𝑛𝜂𝑘f(k)=e^{-\frac{k}{200}}+2ke^{-\frac{\gamma\sqrt{n}\eta}{k}} is smallest possible. Then with probability at least p=1−n​(e−ko​p​t200+2​ko​p​t​e−γ​n​ηko​p​t+e−n2)​(1−e−n200)𝑝1𝑛superscript𝑒subscript𝑘𝑜𝑝𝑡2002subscript𝑘𝑜𝑝𝑡superscript𝑒𝛾𝑛𝜂subscript𝑘𝑜𝑝𝑡superscript𝑒𝑛21superscript𝑒𝑛200p=1-n(e^{-\frac{k_{opt}}{200}}+2k_{opt}e^{-\frac{\gamma\sqrt{n}\eta}{k_{opt}}}+e^{-\frac{n}{2}})(1-e^{-\frac{n}{200}}) the η𝜂\eta-differentially-private probabilistic averaging version of Algorithm 2 gives empirical error at most 15+1910​ϵ−2​ϵ2151910italic-ϵ2superscriptitalic-ϵ2\frac{1}{5}+\frac{19}{10}\epsilon-2\epsilon^{2} for the forest with ko​p​tsubscript𝑘𝑜𝑝𝑡k_{opt} randomly chosen decision trees. Probability p𝑝p is under random coin tosses used to construct the forest.

Proof. Proof 8.1. We start with the proof of Theorem 5.2. Note that from the definition of wdsubscript𝑤𝑑w_{d} we get: ∑d∈𝒯wd=1|M|​∑i=1|M|∑j=1t(mji​eji+(nji−mji)​(1−eji)).subscript𝑑𝒯subscript𝑤𝑑1𝑀superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑡subscriptsuperscript𝑚𝑖𝑗subscriptsuperscript𝑒𝑖𝑗subscriptsuperscript𝑛𝑖𝑗subscriptsuperscript𝑚𝑖𝑗1subscriptsuperscript𝑒𝑖𝑗\sum_{d\in\mathcal{T}}w_{d}=\frac{1}{|M|}\sum_{i=1}^{|M|}\sum_{j=1}^{t}(m^{i}{j}e^{i}{j}+(n^{i}{j}-m^{i}{j})(1-e^{i}{j})). Therefore, using formula on mjisubscriptsuperscript𝑚𝑖𝑗m^{i}{j}, we get: ∑d∈𝒯wd=1|M|​∑i=1|M|∑j=1t(nji​(eji)2+nji​(1−eji)2).subscript𝑑𝒯subscript𝑤𝑑1𝑀superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑡subscriptsuperscript𝑛𝑖𝑗superscriptsubscriptsuperscript𝑒𝑖𝑗2subscriptsuperscript𝑛𝑖𝑗superscript1subscriptsuperscript𝑒𝑖𝑗2\sum_{d\in\mathcal{T}}w_{d}=\frac{1}{|M|}\sum_{i=1}^{|M|}\sum_{j=1}^{t}(n^{i}{j}(e^{i}{j})^{2}+n^{i}{j}(1-e^{i}{j})^{2}). Note that we have: ∑i=1|M|∑j=1tnji=n​|M|superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑡subscriptsuperscript𝑛𝑖𝑗𝑛𝑀\sum_{i=1}^{|M|}\sum_{j=1}^{t}n^{i}{j}=n|M|. From Jensen’s inequality, applied to the function f​(x)=x2𝑓𝑥superscript𝑥2f(x)=x^{2}, we get: ∑i=1|M|∑j=1tnji|M|​n​(eji)2≥(∑i=1|M|∑j=1tnji​eji|M|​n)2=(∑i=1|M|∑j=1tmji|M|​n)2=(e​n​|M|n​|M|)2=e2superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑡subscriptsuperscript𝑛𝑖𝑗𝑀𝑛superscriptsubscriptsuperscript𝑒𝑖𝑗2superscriptsuperscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑡subscriptsuperscript𝑛𝑖𝑗subscriptsuperscript𝑒𝑖𝑗𝑀𝑛2superscriptsuperscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑡subscriptsuperscript𝑚𝑖𝑗𝑀𝑛2superscript𝑒𝑛𝑀𝑛𝑀2superscript𝑒2\sum{i=1}^{|M|}\sum_{j=1}^{t}\frac{n^{i}{j}}{|M|n}(e^{i}{j})^{2}\geq(\sum_{i=1}^{|M|}\sum_{j=1}^{t}\frac{n^{i}{j}e^{i}{j}}{|M|n})^{2}=(\frac{\sum_{i=1}^{|M|}\sum_{j=1}^{t}m^{i}{j}}{|M|n})^{2}=(\frac{en|M|}{n|M|})^{2}=e^{2}, where e𝑒e is the average quality of the system of all complete decision trees of height hℎh (the average tree accuracy). Similarly, ∑i=1|M|∑j=1tnji|M|​n​(1−eji)2≥(1−e)2superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑡subscriptsuperscript𝑛𝑖𝑗𝑀𝑛superscript1subscriptsuperscript𝑒𝑖𝑗2superscript1𝑒2\sum{i=1}^{|M|}\sum_{j=1}^{t}\frac{n^{i}{j}}{|M|n}(1-e^{i}{j})^{2}\geq(1-e)^{2}. Thus we get: ∑d∈𝒯wd≥n​(e2+(1−e)2).subscript𝑑𝒯subscript𝑤𝑑𝑛superscript𝑒2superscript1𝑒2\sum_{d\in\mathcal{T}}w_{d}\geq n(e^{2}+(1-e)^{2}). That completes the proof of Theorem 5.2. The proof of Theorem 5.1 is even simpler. Notice that for any data point d𝑑d the expression σ​(d)⋅|M|⋅𝜎𝑑𝑀\sigma(d)\cdot|M| counts the number of decision trees from M𝑀M that classified d𝑑d correctly (follows directly from the definition of θ𝜃\theta). Thus we have: ∑d∈𝒯σ​(d)⋅|M|=∑i=1|M|misubscript𝑑𝒯⋅𝜎𝑑𝑀superscriptsubscript𝑖1𝑀superscript𝑚𝑖\sum_{d\in\mathcal{T}}\sigma(d)\cdot|M|=\sum_{i=1}^{|M|}m^{i}. Therefore 1n​∑d∈𝒯σ​(d)=1|M|​∑i=1|M|ei1𝑛subscript𝑑𝒯𝜎𝑑1𝑀superscriptsubscript𝑖1𝑀superscript𝑒𝑖\frac{1}{n}\sum_{d\in\mathcal{T}}\sigma(d)=\frac{1}{|M|}\sum_{i=1}^{|M|}e^{i} and we are done.

Lemma. Lemma 8.2. Let {Wn,n≥1}subscript𝑊𝑛𝑛1{W_{n},n\geq 1} be a martingale with mean 00 and suppose that for some non-negative constants: αi,βisubscript𝛼𝑖subscript𝛽𝑖\alpha_{i},\beta_{i} we have: −αi≤Wi−Wi−1≤βisubscript𝛼𝑖subscript𝑊𝑖subscript𝑊𝑖1subscript𝛽𝑖-\alpha_{i}\leq W_{i}-W_{i-1}\leq\beta_{i} for i=2,3,…𝑖23…i=2,3,.... Then for any n≥0𝑛0n\geq 0, a>0𝑎0a>0: ℙ​(Wn≥a)≤e−2​a2∑i=1n(αi+βi)2​and​ℙ​(Wn≤−a)≤e−2​a2∑i=1n(αi+βi)2.ℙsubscript𝑊𝑛𝑎superscript𝑒2superscript𝑎2superscriptsubscript𝑖1𝑛superscriptsubscript𝛼𝑖subscript𝛽𝑖2andℙsubscript𝑊𝑛𝑎superscript𝑒2superscript𝑎2superscriptsubscript𝑖1𝑛superscriptsubscript𝛼𝑖subscript𝛽𝑖2\mathbb{P}(W_{n}\geq a)\leq e^{-\frac{2a^{2}}{\sum_{i=1}^{n}(\alpha_{i}+\beta_{i})^{2}}}>>>\text{and}>>>\mathbb{P}(W_{n}\leq-a)\leq e^{-\frac{2a^{2}}{\sum_{i=1}^{n}(\alpha_{i}+\beta_{i})^{2}}}.

Proof. Proof 8.3. Again, we start with the analysis of the threshold averaging. Take it​hsuperscript𝑖𝑡ℎi^{th} random decision tree TiRsubscriptsuperscript𝑇𝑅𝑖T^{R}{i}, where i∈{1,2,…,k}𝑖12…𝑘i\in{1,2,...,k}. For a given data point d𝑑d from the training set let Xidsubscriptsuperscript𝑋𝑑𝑖X^{d}{i} be a random variable defined as follows. If d𝑑d does not belong to any leaf of TiRsubscriptsuperscript𝑇𝑅𝑖T^{R}{i} then let Xid=0subscriptsuperscript𝑋𝑑𝑖0X^{d}{i}=0. Otherwise let aiRsubscriptsuperscript𝑎𝑅𝑖a^{R}{i} be the number of points from the training set with label 00 in that leaf and let biRsubscriptsuperscript𝑏𝑅𝑖b^{R}{i} be the number of points from the training set with label 111 in that leaf. If d𝑑d has label 00 then we take Xid=aiRaiR+biRsubscriptsuperscript𝑋𝑑𝑖subscriptsuperscript𝑎𝑅𝑖subscriptsuperscript𝑎𝑅𝑖subscriptsuperscript𝑏𝑅𝑖X^{d}{i}=\frac{a^{R}{i}}{a^{R}{i}+b^{R}{i}}. Otherwise we take Xid=biRaiR+biRsubscriptsuperscript𝑋𝑑𝑖subscriptsuperscript𝑏𝑅𝑖subscriptsuperscript𝑎𝑅𝑖subscriptsuperscript𝑏𝑅𝑖X^{d}{i}=\frac{b^{R}{i}}{a^{R}{i}+b^{R}{i}}. Denote Xd=X1d+…+Xkdksuperscript𝑋𝑑subscriptsuperscript𝑋𝑑1…subscriptsuperscript𝑋𝑑𝑘𝑘X^{d}=\frac{X^{d}{1}+...+X^{d}{k}}{k}. When from the context it is clear to which data point we refer to we will skip upper index and simply write X𝑋X or Xisubscript𝑋𝑖X_{i} respectively. Fix some point d𝑑d from the training set. Note that if X>12𝑋12X>\frac{1}{2} then point d𝑑d is correctly classified. Notice that the weight of the point d𝑑d denoted as wdsubscript𝑤𝑑w_{d} is nothing else but the sum of weights of all the edges of 𝒢𝒢\mathcal{G} incident to d𝑑d divided by the number of all trees (or the average weight of an edge indecent to d𝑑d if we consider real-valued attributes). Note that we have E​X=wd𝐸𝑋subscript𝑤𝑑EX=w_{d} and that from Theorem 5.2 we get: ∑d∈𝒯wd≥n​(e2+(1−e)2).subscript𝑑𝒯subscript𝑤𝑑𝑛superscript𝑒2superscript1𝑒2\sum_{d\in\mathcal{T}}w_{d}\geq n(e^{2}+(1-e)^{2}). Take 0<δ<120𝛿120<\delta<\frac{1}{2}. Denote by μ𝜇\mu the fraction of points d𝑑d from the training data such that wd≥12+δsubscript𝑤𝑑12𝛿w_{d}\geq\frac{1}{2}+\delta. From the lower bound on ∑d∈𝒯wdsubscript𝑑𝒯subscript𝑤𝑑\sum_{d\in\mathcal{T}}w_{d}, we have just derived, we get: (12+δ)​(1−μ)​n+μ​n≥n​(e2+(1−e)2)12𝛿1𝜇𝑛𝜇𝑛𝑛superscript𝑒2superscript1𝑒2(\frac{1}{2}+\delta)(1-\mu)n+\mu n\geq n(e^{2}+(1-e)^{2}), which gives us: μ≥1−2​ϵ−2​ϵ20.5−δ,𝜇12italic-ϵ2superscriptitalic-ϵ20.5𝛿\mu\geq 1-\frac{2\epsilon-2\epsilon^{2}}{0.5-\delta}, where ϵ=1−eitalic-ϵ1𝑒\epsilon=1-e. Take point d𝑑d from the training set such that wd≥12+δ.subscript𝑤𝑑12𝛿w_{d}\geq\frac{1}{2}+\delta. Denote by pdsubscript𝑝𝑑p_{d} the probability that d𝑑d is misclassified. We have: pd≤ℙ​(X1+…+Xkk≤wd−δ).subscript𝑝𝑑ℙsubscript𝑋1…subscript𝑋𝑘𝑘subscript𝑤𝑑𝛿p_{d}\leq\mathbb{P}(\frac{X_{1}+...+X_{k}}{k}\leq w_{d}-\delta). Denote: Zi=Xi−wdsubscript𝑍𝑖subscript𝑋𝑖subscript𝑤𝑑Z_{i}=X_{i}-w_{d} for i=1,2,…,k𝑖12…𝑘i=1,2,...,k. We have: pd≤ℙ​(Z1+…+Zk≤−k​δ).subscript𝑝𝑑ℙsubscript𝑍1…subscript𝑍𝑘𝑘𝛿p_{d}\leq\mathbb{P}(Z_{1}+...+Z_{k}\leq-k\delta). Note that, since wd=E​Xsubscript𝑤𝑑𝐸𝑋w_{d}=EX and random variables Xisubscript𝑋𝑖X_{i} are independent, we can conclude that {Z1,Z1+Z2,…,Z1+Z2+…+Zk}subscript𝑍1subscript𝑍1subscript𝑍2…subscript𝑍1subscript𝑍2…subscript𝑍𝑘{Z_{1},Z_{1}+Z_{2},...,Z_{1}+Z_{2}+...+Z_{k}} is a martingale. Note also that −αi≤Zi≤βisubscript𝛼𝑖subscript𝑍𝑖subscript𝛽𝑖-\alpha_{i}\leq Z_{i}\leq\beta_{i} for some αi,βi>0subscript𝛼𝑖subscript𝛽𝑖0\alpha_{i},\beta_{i}>0 such that αi+βi=1subscript𝛼𝑖subscript𝛽𝑖1\alpha_{i}+\beta_{i}=1. Using Lemma 8.2, we get: ℙ​(Z1+…+Zk≤−k​δ)≤e−2​(k​δ)2k.ℙsubscript𝑍1…subscript𝑍𝑘𝑘𝛿superscript𝑒2superscript𝑘𝛿2𝑘\mathbb{P}(Z_{1}+...+Z_{k}\leq-k\delta)\leq e^{-\frac{2(k\delta)^{2}}{k}}. Therefore the probability that at least one of μ​n𝜇𝑛\mu n points d𝑑d for which wd≥12+δsubscript𝑤𝑑12𝛿w_{d}\geq\frac{1}{2}+\delta will be misclassified by the set of k𝑘k random decision trees is, by union bound, at most: μ​n​e−2​k​δ2≤n​e−2​k​δ2𝜇𝑛superscript𝑒2𝑘superscript𝛿2𝑛superscript𝑒2𝑘superscript𝛿2\mu ne^{-2k\delta^{2}}\leq ne^{-2k\delta^{2}}. That, for k=(1+C)​log⁡(n)2​δ2𝑘1𝐶𝑛2superscript𝛿2k=\frac{(1+C)\log(n)}{2\delta^{2}}, completes the proof of the upper bound on the empirical error from theorems: 5.3 and 5.4 since we have already proved that μ≥1−2​ϵ−2​ϵ20.5−δ𝜇12italic-ϵ2superscriptitalic-ϵ20.5𝛿\mu\geq 1-\frac{2\epsilon-2\epsilon^{2}}{0.5-\delta}. The proof of the majority voting version goes along exactly the same lines. This time, instead of Theorem 5.2, we use Theorem 5.1. We know that ∑d∈𝒯σ​(d)≥n​esubscript𝑑𝒯𝜎𝑑𝑛𝑒\sum_{d\in\mathcal{T}}\sigma(d)\geq ne, where e=1−ϵ𝑒1italic-ϵe=1-\epsilon. Denote the fraction of points d𝑑d with σ​(d)≥12+x𝜎𝑑12𝑥\sigma(d)\geq\frac{1}{2}+x for 0<x<120𝑥120<x<\frac{1}{2} by μxsuperscript𝜇𝑥\mu^{x}. Then, by the argument similar to the one presented above,we have: μx≥1−ϵ0.5−x.superscript𝜇𝑥1italic-ϵ0.5𝑥\mu^{x}\geq 1-\frac{\epsilon}{0.5-x}. (2) All other details of the proof for the majority voting are exactly the same as for the threshold averaging scheme.

Proof. Proof 8.4. Let K>0𝐾0K>0 be a constant. We first consider the threshold averaging scheme. Take a decision tree Tisubscript𝑇𝑖T_{i}. Denote by STisubscript𝑆subscript𝑇𝑖S_{T_{i}} the set of points d𝑑d from the training set with the following property: point d𝑑d belongs in Tisubscript𝑇𝑖T_{i} do the leaf that contains at least nK​2h𝑛𝐾superscript2ℎ\frac{n}{K2^{h}} points. Note that since each Tisubscript𝑇𝑖T_{i} has exactly 2hsuperscript2ℎ2^{h} leaves, we can conclude that |STi|≥n​(1−1K)subscript𝑆subscript𝑇𝑖𝑛11𝐾|S_{T_{i}}|\geq n(1-\frac{1}{K}). In this proof and proof of theorems: 5.9 and 5.9 (presented in the next section) we will consider graph 𝒢Dsuperscript𝒢𝐷\mathcal{G}^{D} that is obtained from 𝒢𝒢\mathcal{G} by deleting edges adjacent to those vertices of the color class ℬℬ\mathcal{B} that correspond to leaves containing less than nK​2h𝑛𝐾superscript2ℎ\frac{n}{K2^{h}} points from the training set. Take point d𝑑d from the training set with wdt≥12+δsubscriptsuperscript𝑤𝑡𝑑12𝛿w^{t}{d}\geq\frac{1}{2}+\delta, where wdtsubscriptsuperscript𝑤𝑡𝑑w^{t}{d} is the average weight of an edge incident to d𝑑d in 𝒢Dsuperscript𝒢𝐷\mathcal{G}^{D}. Notice that wd≥12+δ+1Ksubscript𝑤𝑑12𝛿1𝐾w_{d}\geq\frac{1}{2}+\delta+\frac{1}{K} implies: wdt≥12+δsubscriptsuperscript𝑤𝑡𝑑12𝛿w^{t}{d}\geq\frac{1}{2}+\delta. We say that a decision tree Tisubscript𝑇𝑖T{i} is d𝑑d-good if the leaf of Tisubscript𝑇𝑖T_{i} to which d𝑑d belongs contains at least nK​2h𝑛𝐾superscript2ℎ\frac{n}{K2^{h}} points from the training set. Let us now define Xidsubscriptsuperscript𝑋𝑑𝑖X^{d}{i}. If it​hsuperscript𝑖𝑡ℎi^{th} chosen random decision tree is d𝑑d-good then Xidsubscriptsuperscript𝑋𝑑𝑖X^{d}{i} is defined as in the proof of Theorem 5.3. Otherwise we put Xid=0subscriptsuperscript𝑋𝑑𝑖0X^{d}{i}=0. Denote Zi=Xid−wdtsubscript𝑍𝑖subscriptsuperscript𝑋𝑑𝑖subscriptsuperscript𝑤𝑡𝑑Z{i}=X^{d}{i}-w^{t}{d}. Note that the probability pdsubscript𝑝𝑑p_{d} that point d𝑑d is misclassified by selected random decision trees is pd≤ℙ​(Z1+…+Zkk+∑j∈ℐRj|ℐ|≤−δ)subscript𝑝𝑑ℙsubscript𝑍1…subscript𝑍𝑘𝑘subscript𝑗ℐsubscript𝑅𝑗ℐ𝛿p_{d}\leq\mathbb{P}(\frac{Z_{1}+...+Z_{k}}{k}+\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\delta), where ℐℐ\mathcal{I} is the set of indices corresponding to those chosen random decision trees that are d𝑑d-good and random variables Rjsubscript𝑅𝑗R_{j} are correction terms for d𝑑d-good random decision trees that must be introduced in order to take into account added Laplacians (if ℐ=∅ℐ\mathcal{I}=\emptyset then we assume that the value of the expression ∑j∈ℐRj|ℐ|subscript𝑗ℐsubscript𝑅𝑗ℐ\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|} is 00). Note also that set {Rj{R_{j}, Zj:j=1,2,…,k}Z_{j}:j=1,2,...,k} is a set of independent random variables. We get: pd≤ℙ​(Z1+…+Zkk≤−δ2)+ℙ​(∑j∈ℐRj|ℐ|≤−δ2).subscript𝑝𝑑ℙsubscript𝑍1…subscript𝑍𝑘𝑘𝛿2ℙsubscript𝑗ℐsubscript𝑅𝑗ℐ𝛿2p_{d}\leq\mathbb{P}(\frac{Z_{1}+...+Z_{k}}{k}\leq-\frac{\delta}{2})+\mathbb{P}(\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\frac{\delta}{2}). Since from the Azuma’s inequality we get: ℙ​(Z1+…+Zkk≤−δ2)≤e−k​δ22ℙsubscript𝑍1…subscript𝑍𝑘𝑘𝛿2superscript𝑒𝑘superscript𝛿22\mathbb{P}(\frac{Z_{1}+...+Z_{k}}{k}\leq-\frac{\delta}{2})\leq e^{-\frac{k\delta^{2}}{2}}, we have: pd≤e−k​δ22+ℙ​(∑j∈ℐRj|ℐ|≤−δ2)subscript𝑝𝑑superscript𝑒𝑘superscript𝛿22ℙsubscript𝑗ℐsubscript𝑅𝑗ℐ𝛿2p_{d}\leq e^{-\frac{k\delta^{2}}{2}}+\mathbb{P}(\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\frac{\delta}{2}) (3) We will now estimate the expression pr=ℙ​(∑j∈ℐRj|ℐ|≤−δ2)subscript𝑝𝑟ℙsubscript𝑗ℐsubscript𝑅𝑗ℐ𝛿2p_{r}=\mathbb{P}(\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\frac{\delta}{2}). For i∈ℐ𝑖ℐi\in\mathcal{I} denote by 𝒜isubscript𝒜𝑖\mathcal{A}{i} an event that each of the two perturbation errors added to the leaf containing point d𝑑d was of magnitude at most nK​2h​δ1𝑛𝐾superscript2ℎsubscript𝛿1\frac{\sqrt{n}}{K2^{h}}\delta{1}, where δ1=δ24subscript𝛿1𝛿24\delta_{1}=\frac{\delta}{24}. Denote 𝒜=⋂i∈ℐ𝒜i𝒜subscript𝑖ℐsubscript𝒜𝑖\mathcal{A}=\bigcap_{i\in\mathcal{I}}\mathcal{A}{i}. Denote by 𝒜csuperscript𝒜𝑐\mathcal{A}^{c} the complement of 𝒜𝒜\mathcal{A}. We have: ℙ​(∑j∈ℐRj|ℐ|≤−δ2)=ℙ​(∑j∈ℐRj|ℐ|≤−δ2|𝒜)​ℙ​(𝒜)+ℙ​(∑j∈ℐRj|ℐ|≤−δ2|𝒜c)​(1−ℙ​(𝒜))ℙsubscript𝑗ℐsubscript𝑅𝑗ℐ𝛿2ℙsubscript𝑗ℐsubscript𝑅𝑗ℐconditional𝛿2𝒜ℙ𝒜ℙsubscript𝑗ℐsubscript𝑅𝑗ℐconditional𝛿2superscript𝒜𝑐1ℙ𝒜\mathbb{P}(\frac{\sum{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\frac{\delta}{2})=\mathbb{P}(\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\frac{\delta}{2}|\mathcal{A})\mathbb{P}(\mathcal{A})+\mathbb{P}(\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\frac{\delta}{2}|\mathcal{A}^{c})(1-\mathbb{P}(\mathcal{A})). Thus we get: pr≤ℙ​(∑j∈ℐRj|ℐ|≤−δ2|𝒜)+(1−ℙ​(𝒜)).subscript𝑝𝑟ℙsubscript𝑗ℐsubscript𝑅𝑗ℐconditional𝛿2𝒜1ℙ𝒜p_{r}\leq\mathbb{P}(\frac{\sum_{j\in\mathcal{I}}R_{j}}{|\mathcal{I}|}\leq-\frac{\delta}{2}|\mathcal{A})+(1-\mathbb{P}(\mathcal{A})). (4) Now take one of the chosen random decision trees Tisubscript𝑇𝑖T_{i} with i∈ℐ𝑖ℐi\in\mathcal{I}. Take its leaf that contains given point d𝑑d from the training set. Assume that this leaf contains r𝑟r points from the training set with some fixed label l∈{−,+}𝑙l\in{-,+} and that it altogether contains nasubscript𝑛𝑎n_{a} points. Note that from the definition of ℐℐ\mathcal{I} we have: na≥nK​2hsubscript𝑛𝑎𝑛𝐾superscript2ℎn_{a}\geq\frac{n}{K2^{h}}. Let g1,g2subscript𝑔1subscript𝑔2g_{1},g_{2} be two independent Laplacian random variables, each of density function η2​k​e−|x|​ηk𝜂2𝑘superscript𝑒𝑥𝜂𝑘\frac{\eta}{2k}e^{-\frac{|x|\eta}{k}}. We would like to estimate the following random variable Θ=r+g1na+g1+g2−rnaΘ𝑟subscript𝑔1subscript𝑛𝑎subscript𝑔1subscript𝑔2𝑟subscript𝑛𝑎\Theta=\frac{r+g_{1}}{n_{a}+g_{1}+g_{2}}-\frac{r}{n_{a}} for an event 𝒜𝒜\mathcal{A}. Note that in particular we know that |g1|,|g2|≤δ1​nansubscript𝑔1subscript𝑔2subscript𝛿1subscript𝑛𝑎𝑛|g_{1}|,|g_{2}|\leq\frac{\delta_{1}n_{a}}{\sqrt{n}}. Simple calculation gives us: |Θ|≤δ4​n.Θ𝛿4𝑛|\Theta|\leq\frac{\delta}{4\sqrt{n}}. (5) Now consider truncated probability space Ω|𝒜conditionalΩ𝒜\Omega|\mathcal{A} and truncated random variables Rit=Ri|𝒜subscriptsuperscript𝑅𝑡𝑖conditionalsubscript𝑅𝑖𝒜R^{t}{i}=R{i}|\mathcal{A} for i∈ℐ𝑖ℐi\in\mathcal{I}. We have: ℙ​(∑i∈ℐRi≤−|ℐ|​δ2|𝒜)=ℙ​(∑i∈ℐRit≤−|ℐ|​δ2)ℙsubscript𝑖ℐsubscript𝑅𝑖conditionalℐ𝛿2𝒜ℙsubscript𝑖ℐsubscriptsuperscript𝑅𝑡𝑖ℐ𝛿2\mathbb{P}(\sum_{i\in\mathcal{I}}R_{i}\leq-\frac{|\mathcal{I}|\delta}{2}|\mathcal{A})=\mathbb{P}(\sum_{i\in\mathcal{I}}R^{t}{i}\leq-\frac{|\mathcal{I}|\delta}{2}). Using inequality 5, we get: |Rit|≤δ4​n,E​|Rit|≤δ4​n.formulae-sequencesubscriptsuperscript𝑅𝑡𝑖𝛿4𝑛𝐸subscriptsuperscript𝑅𝑡𝑖𝛿4𝑛|R^{t}{i}|\leq\frac{\delta}{4\sqrt{n}},E|R^{t}{i}|\leq\frac{\delta}{4\sqrt{n}}. (6) Thus we can use Azuma’s inequality once more, this time to find the upper bound on the expression: ℙ​(∑i∈ℐRit≤−|ℐ|​δ2)ℙsubscript𝑖ℐsubscriptsuperscript𝑅𝑡𝑖ℐ𝛿2\mathbb{P}(\sum{i\in\mathcal{I}}R^{t}{i}\leq-\frac{|\mathcal{I}|\delta}{2}) (we assume here that the random decision trees have been selected thus ℐℐ\mathcal{I} is given). Without loss of generality we can assume that ℐ≠∅ℐ\mathcal{I}\neq\emptyset. We have: ℙ​(∑i∈ℐRit≤−|ℐ|​δ2)=ℙ​(∑i∈𝒜(Rit−E​Rit)≤−ℐ​δ2−∑i∈ℐE​Rit)≤ℙ​(∑i∈ℐ(Rit−E​Rit)≤−|ℐ|​δ4)≤e−2​|ℐ|​(δ4)2(δ4​n+δ4​n)2ℙsubscript𝑖ℐsubscriptsuperscript𝑅𝑡𝑖ℐ𝛿2ℙsubscript𝑖𝒜subscriptsuperscript𝑅𝑡𝑖𝐸subscriptsuperscript𝑅𝑡𝑖ℐ𝛿2subscript𝑖ℐ𝐸subscriptsuperscript𝑅𝑡𝑖ℙsubscript𝑖ℐsubscriptsuperscript𝑅𝑡𝑖𝐸subscriptsuperscript𝑅𝑡𝑖ℐ𝛿4superscript𝑒2ℐsuperscript𝛿42superscript𝛿4𝑛𝛿4𝑛2\mathbb{P}(\sum{i\in\mathcal{I}}R^{t}{i}\leq-\frac{|\mathcal{I}|\delta}{2})=\mathbb{P}(\sum{i\in\mathcal{A}}(R^{t}{i}-ER^{t}{i})\leq-\frac{\mathcal{I}\delta}{2}-\sum_{i\in\mathcal{I}}ER^{t}{i})\leq\mathbb{P}(\sum{i\in\mathcal{I}}(R^{t}{i}-ER^{t}{i})\leq-\frac{|\mathcal{I}|\delta}{4})\leq e^{-\frac{2|\mathcal{I}|(\frac{\delta}{4})^{2}}{(\frac{\delta}{4\sqrt{n}}+\frac{\delta}{4\sqrt{n}})^{2}}}. Therefore we get: pr≤e−n2+(1−ℙ​(𝒜)).subscript𝑝𝑟superscript𝑒𝑛21ℙ𝒜p_{r}\leq e^{-\frac{n}{2}}+(1-\mathbb{P}(\mathcal{A})). (7) It remains to bound the expression: (1−ℙ​(𝒜))1ℙ𝒜(1-\mathbb{P}(\mathcal{A})). Let g𝑔g be a Laplacian random variable with density function η2​k​e−|x|​ηk𝜂2𝑘superscript𝑒𝑥𝜂𝑘\frac{\eta}{2k}e^{-\frac{|x|\eta}{k}}. Note that from the union bound we get: 1−ℙ​(𝒜)≤2​k​ℙ​(|g|>n​δ24​K​2h)1ℙ𝒜2𝑘ℙ𝑔𝑛𝛿24𝐾superscript2ℎ1-\mathbb{P}(\mathcal{A})\leq 2k\mathbb{P}(|g|>\frac{\sqrt{n}\delta}{24K2^{h}}), where factor 222 in the expression 2​k​ℙ​(g>n​δ24​K​2h)2𝑘ℙ𝑔𝑛𝛿24𝐾superscript2ℎ2k\mathbb{P}(g>\frac{\sqrt{n}\delta}{24K2^{h}}) comes from the fact that for a given data point d𝑑d we need to add perturbation error in two places in the leaf of the chosen random decision tree corresponding to d𝑑d. Denote γ=δ24​K​2h𝛾𝛿24𝐾superscript2ℎ\gamma=\frac{\delta}{24K2^{h}}. We have: pr≤e−n2+4​k​∫γ​n∞η2​k​e−x​ηk​𝑑x.subscript𝑝𝑟superscript𝑒𝑛24𝑘superscriptsubscript𝛾𝑛𝜂2𝑘superscript𝑒𝑥𝜂𝑘differential-d𝑥p_{r}\leq e^{-\frac{n}{2}}+4k\int\limits_{\gamma\sqrt{n}}^{\infty}\frac{\eta}{2k}e^{-\frac{x\eta}{k}}dx. (8) Evaluation of the RHS-expression gives us: pr≤e−n2+2​k​e−λ​n​ηk,where​λ=δ24​K​2h.formulae-sequencesubscript𝑝𝑟superscript𝑒𝑛22𝑘superscript𝑒𝜆𝑛𝜂𝑘where𝜆𝛿24𝐾superscript2ℎp_{r}\leq e^{-\frac{n}{2}}+2ke^{-\frac{\lambda\sqrt{n}\eta}{k}},>>>>>\text{where}>>>\lambda=\frac{\delta}{24K2^{h}}. (9) Thus we can conclude that the probability pdsubscript𝑝𝑑p_{d} that the fixed point d𝑑d from the training set will be misclassified by the set of k𝑘k randomly chosen random decision trees satisfies: pd≤e−k​δ22+e−n2+2​k​e−γ​n​ηk.subscript𝑝𝑑superscript𝑒𝑘superscript𝛿22superscript𝑒𝑛22𝑘superscript𝑒𝛾𝑛𝜂𝑘p_{d}\leq e^{-\frac{k\delta^{2}}{2}}+e^{-\frac{n}{2}}+2ke^{-\frac{\gamma\sqrt{n}\eta}{k}}. (10) Note that by the similar argument to the one presented in the proof of Theorem 5.3 and Theorem 5.4, we can conclude that at least n​(1−2​(ϵ+1K)−2​(ϵ+1K)20.5−δ)𝑛12italic-ϵ1𝐾2superscriptitalic-ϵ1𝐾20.5𝛿n(1-\frac{2(\epsilon+\frac{1}{K})-2(\epsilon+\frac{1}{K})^{2}}{0.5-\delta}) points d𝑑d from the training data satisfy: wdt≥12+δsubscriptsuperscript𝑤𝑡𝑑12𝛿w^{t}{d}\geq\frac{1}{2}+\delta. Let μtsuperscript𝜇𝑡\mu^{t} be a fraction of points with this property. As we observed earlier, if the points d𝑑d satisfies: wd≥12+δ+1Ksubscript𝑤𝑑12𝛿1𝐾w{d}\geq\frac{1}{2}+\delta+\frac{1}{K} then it also satisfies: wdt≥12+δsubscriptsuperscript𝑤𝑡𝑑12𝛿w^{t}{d}\geq\frac{1}{2}+\delta. Thus μ≥μt𝜇superscript𝜇𝑡\mu\geq\mu^{t}. We also have: μt≥1−2​(ϵ+1K)−2​(ϵ+1K)20.5−δsuperscript𝜇𝑡12italic-ϵ1𝐾2superscriptitalic-ϵ1𝐾20.5𝛿\mu^{t}\geq 1-\frac{2(\epsilon+\frac{1}{K})-2(\epsilon+\frac{1}{K})^{2}}{0.5-\delta}. Thus μ≥1−2​(ϵ+1K)−2​(ϵ+1K)20.5−δ𝜇12italic-ϵ1𝐾2superscriptitalic-ϵ1𝐾20.5𝛿\mu\geq 1-\frac{2(\epsilon+\frac{1}{K})-2(\epsilon+\frac{1}{K})^{2}}{0.5-\delta}. We replace ϵitalic-ϵ\epsilon by ϵ+1Kitalic-ϵ1𝐾\epsilon+\frac{1}{K} in the formula derived in the proof of Theorem 5.3 since now for any fixed decision tree we do not take into account points that belong to leaves with less that nK​2h𝑛𝐾superscript2ℎ\frac{n}{K2^{h}} points from the training set. For every given decision tree Tisubscript𝑇𝑖T{i} there are at most nK𝑛𝐾\frac{n}{K} points d𝑑d from the training set such that Tisubscript𝑇𝑖T_{i} is not d𝑑d-good. Note that, by union bound, the probability that at least one from the n​μ𝑛𝜇n\mu points d𝑑d with wdt≥12+δsubscriptsuperscript𝑤𝑡𝑑12𝛿w^{t}{d}\geq\frac{1}{2}+\delta is misclassified is at most n​μ​pd≤n​pd𝑛𝜇subscript𝑝𝑑𝑛subscript𝑝𝑑n\mu p{d}\leq np_{d}. To see how Theorem 5.8 and the part of Theorem 5.7 regarding empirical error follow now, take K=40𝐾40K=40 and δ=110𝛿110\delta=\frac{1}{10}. The proof of the majority voting version is very similar. We use inequality 2 (that was derived from Theorem 5.1) but all other details are exactly the same. Therefore we will not give it in details here since it would basically mean copying almost exactly the proof that we have just showed.

Proof. Proof 8.5. We already know that: ∑d∈𝒯wd≥n​(e2+(1−e)2)subscript𝑑𝒯subscript𝑤𝑑𝑛superscript𝑒2superscript1𝑒2\sum_{d\in\mathcal{T}}w_{d}\geq n(e^{2}+(1-e)^{2}), where e𝑒e is the average quality. Assume that k𝑘k random decision trees have been selected. Denote by Ydsubscript𝑌𝑑Y_{d} the indicator of the event that a fixed data point d𝑑d from the training set will be correctly classified. We have: Yd={1with probabilityXd0with probability1−Xd,subscript𝑌𝑑cases1with probabilitysuperscript𝑋𝑑0with probability1superscript𝑋𝑑Y_{d}=\left{\begin{array}[]{rcl}1&\mbox{with probability}&X^{d}\ 0&\mbox{with probability}&1-X^{d},\ \end{array}\right. where Xdsuperscript𝑋𝑑X^{d} is random variable defined in the proof of theorems: 5.3 and 5.4. Note that after random decision trees have been selected, Xdsuperscript𝑋𝑑X^{d} has a deterministic value. Note also that random variables Ydsubscript𝑌𝑑Y_{d} are independent and E​Yd=Xd𝐸subscript𝑌𝑑superscript𝑋𝑑EY_{d}=X^{d}. Thus, we can use Lemma 8.2 in the very similar way as in the proof of theorems: 5.3 and 5.4 to get that for any given c>0𝑐0c>0: ℙ​(∑d∈𝒯(Yd−Xd)≤−n​c)≤e−2​n​c2.ℙsubscript𝑑𝒯subscript𝑌𝑑superscript𝑋𝑑𝑛𝑐superscript𝑒2𝑛superscript𝑐2\mathbb{P}(\sum_{d\in\mathcal{T}}(Y_{d}-X^{d})\leq-nc)\leq e^{-2nc^{2}}. (11) Let us focus now on the process of choosing random decision trees. Fix parameter δ>0𝛿0\delta>0. Fix some point d𝑑d from the training set. Using Lemma 8.2 in exactly the same way as in the proof of theorems: 5.3 and 5.4, we conclude that ℙ​(Xd<wd−δ)≤e−2​k​δ2ℙsuperscript𝑋𝑑subscript𝑤𝑑𝛿superscript𝑒2𝑘superscript𝛿2\mathbb{P}(X^{d}<w_{d}-\delta)\leq e^{-2k\delta^{2}}. Therefore, by the union bound, with probability at least (1−n​e−2​k​δ2)1𝑛superscript𝑒2𝑘superscript𝛿2(1-ne^{-2k\delta^{2}}) we have: ∑d∈𝒯Xd≥∑d∈𝒯(wd−δ)subscript𝑑𝒯superscript𝑋𝑑subscript𝑑𝒯subscript𝑤𝑑𝛿\sum_{d\in\mathcal{T}}X^{d}\geq\sum_{d\in\mathcal{T}}(w_{d}-\delta). Thus, according to the lower bound for ∑d∈𝒯wdsubscript𝑑𝒯subscript𝑤𝑑\sum_{d\in\mathcal{T}}w_{d} we presented at the beginning of the proof, we get that with probability at least (1−n​e−2​k​δ2)1𝑛superscript𝑒2𝑘superscript𝛿2(1-ne^{-2k\delta^{2}}) the following holds: ∑d∈𝒯Xd≥n​(1−2​ϵ+2​ϵ2−δ)subscript𝑑𝒯superscript𝑋𝑑𝑛12italic-ϵ2superscriptitalic-ϵ2𝛿\sum_{d\in\mathcal{T}}X^{d}\geq n(1-2\epsilon+2\epsilon^{2}-\delta), where ϵ=1−eitalic-ϵ1𝑒\epsilon=1-e. Note that random variables Ydsubscript𝑌𝑑Y_{d} are independent from random variables Xdsubscript𝑋𝑑X_{d}. We can conclude, using inequality 11, that with probability at least (1−n​e−2​k​δ2)​(1−e−2​n​c2)1𝑛superscript𝑒2𝑘superscript𝛿21superscript𝑒2𝑛superscript𝑐2(1-ne^{-2k\delta^{2}})(1-e^{-2nc^{2}}) at least n​(1−2​ϵ+2​ϵ2−δ−c)𝑛12italic-ϵ2superscriptitalic-ϵ2𝛿𝑐n(1-2\epsilon+2\epsilon^{2}-\delta-c) points will be correctly classified. Now we can take k=(1+C)​log⁡(n)2​δ2𝑘1𝐶𝑛2superscript𝛿2k=\frac{(1+C)\log(n)}{2\delta^{2}} and that completes the proof. Again, as in the previous proof, the majority voting scheme requires only minor changes in the presented proof so we will leave to the reader.

Proof. Proofs of statements regarding empirical errors go along exactly the same lines as presented proof of the part of Theorembeta_setting_ndp_main (regarding empirical error). The changes in the statement, due to the added perturbation error, follow from the proof of bounds on the empirical error from theorems:alfa_setting_dp_main_1 and alfa_setting_dp_main_2 . Therefore we will not give the entire proof but only mention few things. In comparison with the statement of Theorembeta_setting_ndp_main, in the expression on the upper bound on empirical error the term $\epsilon$ is replaced by $\epsilon+1{K}$. This is, as explained in the proof of Theoremalfa_setting_dp_main_1 (regarding empirical error), due to the fact that while dealing with weights of edges in graph $G^{D}$ we do not take into account points from the training set corresponding to leaves with too few data points. To see how Theorembeta_setting_dp_main_2 can be derived, take $K=40$, $\delta=1{10}$, $c=1{20}$. Again, as for Theoremalfa_setting_dp_main_2, Theorembeta_setting_dp_main_2 follows now by simple calculations.

Proof. Consider test set of $n$ points. Whenever we refer to the weight or goodness of the test point $d$, this is in respect to the test set (see: definition of goodness and other terms in the description of the model). Let $\phi>0$ be a small constant and denote by $E_{\phi}$ an event that for the selected forest $F$ of random decision trees the non-perturbed counts in all leafs (for each leaf we count points with label $+$ and $-$ in that leaf) for the test set and training set differ by at most $2\phi n$. We start by finding a lower bound on $P(E_{\phi})$. Let us fix a forest, a particular tree of that forest and a particular leaf of that tree. Denote by $X_{i}$ a random variable that takes value $1$ if $i^{th}$ point of the training set corresponds to that leaf and $0$ otherwise. Similarly, denote by $Y_{i}$ a random variable that takes value $1$ if $i^{th}$ point of the test set corresponds to that leaf and $0$ otherwise. Denote by $X_{i}^{+}$ a random variable that takes value $1$ if $i^{th}$ point of the training set corresponds to that leaf and has label $+$ and is $0$ otherwise. Similarly, denote by $Y_{i}^{+}$ a random variable that takes value $1$ if $i^{th}$ point of the test set corresponds to that leaf and has label $+$ and is $0$ otherwise. Denote by $p_{1}$ the probability that $i^{th}$ point of the training/test set corresponds to that leaf and by $p_{2}$ the probability that $i^{th}$ point of the training/test set corresponds to that leaf and has label $+$. Notice that $p_{1},p_{2}$ are the same for the training and test set since we assume that training and test set are taken from the same distribution. Since all the random variables introduced above are independent, we can conclude using Azuma's inequality that [P(X_{1} + ... + X_{n} \in [n(p_{1}-\phi),n(p_{1}+\phi)]) \geq 1 - 2e^{-2n\phi^{2}}. ] Similarly, [P(Y_{1}+...+Y_{n} \in [n(p_{1}-\phi),n(p_{1}+\phi)]) \geq 1 - 2e^{-2n\phi^{2}}. ] Therefore, by the union bound [P(|(X_{1}+...+X_{n})-(Y_{1}+...+Y_{n})| \leq 2\phi n) \geq 1 - 4e^{-2n\phi^{2}}. ] By the same analysis we can show that [P(X_{1}^{+} + ... + X_{n}^{+} \in [n(p_{2}-\phi),n(p_{2}+\phi)]) \geq 1 - 2e^{-2n\phi^{2}} ] and [P(Y_{1}^{+} + ... + Y_{n}^{+} \in [n(p_{2}-\phi),n(p_{2}+\phi)]) \geq 1 - 2e^{-2n\phi^{2}}. ] Thus we also have [P(|(X_{1}^{+}+...+X_{n}^{+})-(Y_{1}^{+}+...+Y_{n}^{+})| \leq 2\phi n) \geq 1 - 4e^{-2n\phi^{2}}. ] We can conclude that the probability of the following event: equation* |(X_{1}+...+X_{n})-(Y_{1}+...+Y_{n})| \leq 2\phi n :::and::: |(X_{1}^{+}+...+X_{n}^{+})-(Y_{1}^{+}+...+Y_{n}^{+})| \leq 2\phi n equation* is at least $1 - 8e^{-2n\phi^{2}}$. If we now take the union bound over all $2^{h}k$ leafs of the forest then we obtain: $P(E_{\phi}) \geq 1 - 2^{h+3}ke^{-2n\phi^{2}}$. We will now consider average weights $w_{d}$ of the test points. The analysis for the majority voting uses $\sigma(d)$ and is completely analogous. Assume now that all the counts for all the leaves for the test and training set differ by at most $2\phi$. As in the analysis of the empirical error in the differentially-private setting, lets focus on those leaves of the forest that contain at least $n{2^{h}K}$ of the test points each, for a constant $K>0$. Take a leaf $l$ with this property. Denote by $x^{1}$ the number of test points corresponding to that leaf and with label $+$. Denote by $x^{2}$ the number of training points corresponding to that leaf and with label $+$. Denote by $y^{1}$ the number of all test points corresponding to that leaf and by $y^{2}$ the number of all training points corresponding to that leaf. We want to find an upper bound on the expression $q=|x^{1}{y^{1}} - x^{2}{y^{2}}|$. Simple algebra gives us: $q \leq 2\phi n (x^{1+y^{1})}{y^{1}(y^{1}-2\phi n)}$. If we now take $\zeta = 2\phi{\theta}$, where $\theta = 1{2^{h}K}$ then we get: $q \leq 2\zeta{1-\zeta}$. Let us take $\zeta$ such that: $2\zeta{1-\zeta} \leq \delta{2}$, where $\delta>0$ is a positive constant. Thus we want: $\zeta \leq \delta{4+\delta}$, i.e. $\phi \leq \delta \theta{2(4+\delta)}$. Take $\phi = \delta \theta{2(4+\delta)}$. We can conclude that with probability at least $P(E_{\phi})$ the difference between ratios of counts in leaves containing at least $n{\theta}$ test points for the test and training set is at most $\delta{2}$. This in particular implies that if we consider test point $d$ and a truncated bipartite graph $G^{d}$ (but this time with respect to the test set, not training set) then weights of $d$ in $G^{d}$ and its corresponding version for the training set differ by at most $\delta{2}$. We are almost done. Consider first majority voting/threshold averaging scheme. The only changes we need to introduce in the statement of Theorem alfa_setting_ndp_main_1 for the empirical error is to subtract from $p_{1}$ the probability that $E_{\phi}$ does not hold to obtain a lower bound on $p_{2}$, add factor $1{K}$ to the expression on $w$ (since we are using the truncated model) and change $\delta$ by $\delta{2}$ in the expression on number of random decision trees used. Similarly, in the statement of Theorem alfa_setting_ndp_main_2 we need to replace $\epsilon$ in the expression on $err_{1}$ by $\epsilon + 1{K}$ to obtain an upper bound on $err_{2}$ (again, because we are using truncation argument) and make the same change in the number of decision trees as the one above. To obtain a lower bound on $p_{2}$ it suffices to subtract the probability that $E_{\phi}$ does not hold. Let us focus now on Theorem alfa_setting_dp_main_1. Again we need to add extra factor $1{K}$ to the expression on $w$ and subtract probability that $E_{\phi}$ does not hold to obtain a lower bound on $p_{2}$. Now lets consider probabilistic averaging scheme. Take the statement of Theorem beta_setting_ndp_main first. We make similar correction to those mentioned earlier to get a lower bound on $p_{2}$. Besides in the upper bound on $err_{1}$ we need to replace $\epsilon$ by $\epsilon+1{K}$ to obtain an upper bound on $err_{2}$. In Theorem beta_setting_dp_main_1 we need to add one extra term $1{K}$ in the upper bound on $err_{1}$ to obtain an upper bound on $err_{2}$ and again modify $p_{1}$ in the same way as before to obtain a lower bound on $p_{2}$.

Figure 3: adult dataset. Comparison of dpRFMV, dpRFTA and dpRFPA. η = 1000 /n tr . Test error resp. vs. a) h across various settings of k and vs. c) k across various settings of h ; Minimal, average and maximal test error resp. vs. h (b)) and vs. k (d)) for dpRFMV.

Figure

Figure

MethodMethodMethodMethodMethodMethodMethodMethodMethodMethodMethodMethodMethod
Datasetnmrpartn-dpRFMVn-dpRFMVn-dpRFMVn-dpRFTAn-dpRFTAn-dpRFTAdpRFMVdpRFMVdpRFMVdpRFTAdpRFTAdpRFTA
ErrorErrorkhErrorkhErrorkhErrorkh
Ban Aut137253.65 ± 0.993.09 ± 0.9221153.46 ± 0.971795.44 ± 1.2021115.22 ± 1.18712
BTSC748518.92 ± 2.8122.19 ± 2.9811422.47 ± 2.9911423.42 ± 3.0311423.42 ± 3.03113
CVR435169.30 ± 2.739.05 ± 2.701965.95 ± 2.221398.10 ± 2.561596.90 ± 2.38159
Mam Mass961621.88 ± 2.6116.95 ± 2.3791216.21 ± 2.33191516.95 ± 2.3751217.37 ± 2.4098
Mushroom8124223.33 ± 0.390.83 ± 0.2021150.26 ± 0.1113144.69 ± 0.463134.16 ± 0.43315
Adult3256112317.75 ± 0.4221.70 ± 0.4531421.58 ± 0.4531422.18 ± 0.4531121.72 ± 0.45711
Covertype5810125426.90 ± 0.1133.39 ± 0.12211530.80 ± 0.12211538.75 ± 0.1331337.82 ± 0.12313
Quantum500007832.08 ± 0.4134.81 ± 0.42211533.06 ± 0.41191439.91 ± 0.43211339.01 ± 0.43139

$$ \label{equality0} |R^{t}{i}| \leq \frac{\delta}{4\sqrt{n}}, E|R^{t}{i}| \leq \frac{\delta}{4\sqrt{n}}. $$ \tag{equality0}

Theorem. (See~[nissim].) Let $f$ be a function on databases with range $R^{m}$, where $m$ is the number of rows of databases Number of rows of databases is the number of attributes of any data point from the databases.. Then, the mechanism that outputs $f(D ) + (Y_1,\ldots, Y_m )$, where $Y_i$ are drawn i.i.d from $Lap(0,S(f)/\epsilon)$, satisfies $\epsilon$-differential-privacy.

Theorem. Assume that the average tree accuracy of the set $M$ of all decision trees of height $h$ on the training/test set $D$ is $e=1-\epsilon$ for some $0<\epsilon \leq 1{2}$. Then the average weight $w_{d}$ of a training/test point $d$ in $M$ is at least $e^{2} + (1-e)^{2} \geq 1{2}$.

Theorem. Let $K>0$. Assume that the average tree accuracy of the set $M$ of all decision trees of height $h$ on the training/test set $D$ is $e=1-\epsilon$ for some $0<\epsilon \leq 1{2}$. Let $\mu$ be: the fraction of training/test points with goodness in $M$ at least $\sigma=1{2} + \delta$ / $\sigma =1{2} + \delta + 1{K}$ for $0 < \delta < 1{2}$ (in the majority version) or: the fraction of training/test points with weight in $M$ at least $w=1{2} + \delta$ / $w = 1{2} + \delta + 1{K}$ for $0 < \delta < 1{2}$ (in the threshold averaging version). Then Algorithm~alg:ndp for every $C>0$ and $k=(1+C)\log(n){2\delta^{2}}$ selected random decision trees gives empirical error $err_{1} \leq 1 - \mu$ with probability $p_{1} \geq 1-1{n^{C}}$. The generalization error $err_{2} \leq 1 - \mu$ will be achieved for $k=(1+C)\log(n){2(\delta{2})^{2}}$ trees with probability $p_{2} \geq p_{1} - 2^{h+3}ke^{-2n\phi^{2}}$, where $\phi = \delta{2(4+\delta)2^{h}K}$. Probabilities $p_{1}$ and $p_{2}$ are under random coin tosses used to construct the forest and the test set.

Theorem. Let $K>0$. Assume that the average tree accuracy of the set $M$ of all decision trees of height $h$ on the training/test set $D$ is $e=1-\epsilon$ for some $0<\epsilon \leq 1{2}$. Then Algorithm~alg:ndp for every $C>0$, $0 < \delta <1{2}$ and $k=(1+C)\log(n){2\delta^{2}}$ selected random decision trees gives empirical error: $err_{1} \leq \epsilon{1{2}-\delta}$ (in the majority version) or: $err_{1} \leq 2\epsilon-2\epsilon^{2}{0.5 - \delta}$ (in the threshold averaging version) with probability $p_{1} \geq 1-1{n^{C}}$. The generalization error: $err_{2} \leq \epsilon+\frac{1{K}}{1{2}-\delta}$ (in the majority version) or: $err_{2} \leq 2(\epsilon+\frac{1{K})-2(\epsilon+1{K})^{2}}{0.5 - \delta}$ (in the threshold averaging version) will be achieved for $k=(1+C)\log(n){2(\delta{2})^{2}}$ trees with probability $p_{2} \geq p_{1} - 2^{h+3}ke^{-2n\phi^{2}}$, where $\phi = \delta{2(4+\delta)2^{h}K}$. Probabilities $p_{1}$ and $p_{2}$ are under random coin tosses used to construct the forest and the test set.

Theorem. Let $K>0$. Assume that the average tree accuracy of the set $M$ of all decision trees of height $h$ on the training/test set $D$ is $e=1-\epsilon$ for some $0<\epsilon \leq 1{2}$. Let $C>0$ be a constant. Let $0 < \delta, c < 1$. Then with probability at least $p_{1} = (1-1{n^{C}})(1-e^{-2nc^{2}})$ the probabilistic averaging version of Algorithm~alg:ndp gives empirical error $err_{1} \leq 2\epsilon -2\epsilon^{2}+\delta+c$ and with probability $p_{2} \geq p_{1} - 2^{h+3}ke^{-2n\phi^{2}}$, where $\phi = \delta{2(4+\delta)2^{h}K}$, it gives generalization error $err_{2} \leq 2(\epsilon+1{K}) -2(\epsilon+1{K})^{2}+\delta+c$. Probabilities $p_{1},p_{2}$ are under random tosses used to construct the forest and the test set.

Theorem. Assume that we are given a parameter $\eta>0$. Assume besides that the average tree accuracy of the set $M$ of all decision trees of height $h$ on the training set $D$ is is $e=1-\epsilon$ for some $0<\epsilon \leq 1{2}$. Let $0<\delta<1{2}$. Let $\gamma = 1{ 2^{h} \cdot 9600}$ and let $k_{opt}$ be the integer value for which the value of the function $f(k)=e^{-k{200}}+2ke^{-\gamma \sqrt{n \eta}{k}}$ is smallest possible. Then with probability at least $p = 1-n(e^{-k_{opt}{200}}+2k_{opt}e^{-\gamma \sqrt{n \eta}{k_{opt}}}+e^{-n{2}})$ the $\eta$-differentially-private threshold averaging version of Algorithm~alg:dp gives empirical error at most $1{8} + 9{2}\epsilon-5\epsilon^{2}$ for the forest with $k_{opt}$ randomly chosen decision trees. Probability $p$ is under random coin tosses used to construct the forest.

Theorem. Assume that we are given a parameter $\eta>0$. Let $K,c>0$ and $0<\delta<1$. Assume that the average tree accuracy of the set $M$ of all decision trees of height $h$ on the training/test set $D$ is $e=1-\epsilon$ for some $0<\epsilon \leq 1{2}$. Let $\lambda = \delta{24K \cdot 2^{h}}$. Then for $k$ selected random decision trees the $\eta$-differentially-private probabilistic averaging version of Algorithm~alg:dp gives empirical error $err_{1} \leq 2(\epsilon+1{K})-2(\epsilon+1{K})^{2} + \delta + c$ with probability $p_{1} \geq (1-n(e^{-k\delta^{2}{2}}+e^{-k{2}}+ke^{-\lambda n \eta{k}}))(1-e^{-2nc^{2}})$ and generalization error $err_{2} \leq 2(\epsilon+2{K})-2(\epsilon+2{K})^{2} + \delta + c$ with probability $p_{2} \geq p_{1} - 2^{h+3}ke^{-2n\phi^{2}}$, where: $\phi = \delta{2(4+\delta)2^{h}K}$. Probabilities $p_{1}$ and $p_{2}$ are under random coin tosses used to construct the forest and the test set.

Lemma. Let ${W_{n},n \geq 1}$ be a martingale with mean $0$ and suppose that for some non-negative constants: $\alpha_{i},\beta_{i}$ we have: $-\alpha_{i} \leq W_{i}-W_{i-1} \leq \beta_{i}$ for $i=2,3,...$. Then for any $n \geq 0$, $a>0$: [P(W_{n} \geq a) \leq e^{-2a^{2}{\sum_{i=1}^{n} (\alpha_{i}+\beta_{i})^{2}}} :::and::: P(W_{n} \leq -a) \leq e^{-2a^{2}{\sum_{i=1}^{n}(\alpha_{i}+\beta_{i})^{2}}}. ]

Definition. (See~[dwork].) A randomized algorithm $K$ gives $\epsilon$-differential-privacy if for all datasets $D_1$ and $D_2$ differing on at most one element, and all $S \subseteq Range(K)$, %$S$ $\in$ $Range(M)$, %-0.05in equation P(K (D_{1}) \in S) \le \exp (\epsilon) \cdot P(K (D_{2}) \in S). equation %-0.05in The probability is taken over the coin tosses of $K$. %red{Reviewer pisal: "M" is defined as a randomized algorithm but then the authors state:"The probability is taken over the coin tosses of M" - nie wiem o co mu chodzi - sprawdz czy tu jest cos zle.}

Proof. We start with the proof of Theoremtechnicaltheorem. Note that from the definition of $w_{d}$ we get: $$\sum_{d \in T} w_{d} = 1{|M|}\sum_{i=1}^{|M|}\sum_{j=1}^{t}(m^{i}{j}e^{i}{j}+(n^{i}{j}-m^{i}{j})(1-e^{i}{j})).$$ Therefore, using formula on $m^{i}{j}$, we get: $$\sum_{d \in T} w_{d} = 1{|M|}\sum_{i=1}^{|M|}\sum_{j=1}^{t}(n^{i}{j}(e^{i}{j})^{2}+n^{i}{j}(1-e^{i}{j})^{2}).$$ Note that we have: $\sum_{i=1}^{|M|}\sum_{j=1}^{t} n^{i}{j} = n|M|$. From Jensen's inequality, applied to the function $f(x)=x^{2}$, we get: $\sum{i=1}^{|M|}\sum_{j=1}^{t} n^{i_{j}}{|M|n}(e^{i}{j})^{2} \geq (\sum{i=1}^{|M|}\sum_{j=1}^{t} n^{i_{j}e^{i}{j}}{|M|n})^{2}=(\sum{i=1^{|M|}\sum_{j=1}^{t}m^{i}{j}}{|M|n})^{2}=(en|M|{n|M|})^{2}=e^{2}$, where $e$ is the average quality of the system of all complete decision trees of height $h$ (the average tree accuracy). Similarly, $\sum{i=1}^{|M|}\sum_{j=1}^{t} n^{i_{j}}{|M|n}(1-e^{i}{j})^{2} \geq (1-e)^{2}$. Thus we get: $$\sum{d \in T} w_{d} \geq n(e^{2}+(1-e)^{2}).$$ That completes the proof of Theoremtechnicaltheorem. The proof of Theorem~technicaltheorem_gamma is even simpler. Notice that for any data point $d$ the expression $\sigma(d) \cdot |M|$ counts the number of decision trees from $M$ that classified $d$ correctly (follows directly from the definition of $\theta$). Thus we have: $\sum_{d \in T} \sigma(d) \cdot |M| = \sum_{i=1}^{|M|}m^{i}$. Therefore $1{n}\sum_{d \in T} \sigma(d)= 1{|M|}\sum_{i=1}^{|M|} e^{i}$ and we are done.

Proof. Again, we start with the analysis of the threshold averaging. Take $i^{th}$ random decision tree $T^{R}{i}$, where $i \in {1,2,...,k}$. For a given data point $d$ from the training set let $X^{d}{i}$ be a random variable defined as follows. If $d$ does not belong to any leaf of $T^{R}{i}$ then let $X^{d}{i}=0$. Otherwise let $a^{R}{i}$ be the number of points from the training set with label $0$ in that leaf and let $b^{R}{i}$ be the number of points from the training set with label $1$ in that leaf. If $d$ has label $0$ then we take $X^{d}{i}=a^{R{i}}{a^{R}{i}+b^{R}{i}}$. Otherwise we take $X^{d}{i}=b^{R{i}}{a^{R}{i}+b^{R}{i}}$. Denote $X^{d}=X^{d_{1}+...+X^{d}{k}}{k}$. When from the context it is clear to which data point we refer to we will skip upper index and simply write $X$ or $X{i}$ respectively.\ Fix some point $d$ from the training set. Note that if $X > 1{2}$ then point $d$ is correctly classified. Notice that the weight of the point $d$ denoted as $w_{d}$ is nothing else but the sum of weights of all the edges of $G$ incident to $d$ divided by the number of all trees (or the average weight of an edge indecent to $d$ if we consider real-valued attributes). Note that we have $EX = w_{d}$ and that from Theoremtechnicaltheorem we get: $$\sum_{d \in T} w_{d} \geq n(e^{2}+(1-e)^{2}).$$ Take $0 < \delta < 1{2}$. Denote by $\mu$ the fraction of points $d$ from the training data such that $w_{d} \geq 1{2}+\delta$. From the lower bound on $\sum_{d \in T} w_{d}$, we have just derived, we get: $(1{2}+\delta)(1-\mu)n+\mu n \geq n(e^{2}+(1-e)^{2})$, which gives us: $$\mu \geq 1 - 2\epsilon-2\epsilon^{2}{0.5-\delta},$$ where $\epsilon = 1-e$.\ Take point $d$ from the training set such that $w_{d} \geq 1{2} +\delta.$ Denote by $p_{d}$ the probability that $d$ is misclassified. We have: $$p_{d} \leq P(X_{1+...+X_{k}}{k} \leq w_{d} - \delta).$$ Denote: $Z_{i}=X_{i}-w_{d}$ for $i=1,2,...,k$. We have: $$p_{d} \leq P(Z_{1}+...+Z_{k} \leq - k\delta).$$ Note that, since $w_{d}=EX$ and random variables $X_{i}$ are independent, we can conclude that ${Z_{1},Z_{1}+Z_{2},...,Z_{1}+Z_{2}+...+Z_{k}}$ is a martingale. Note also that $-\alpha_{i} \leq Z_{i} \leq \beta_{i}$ for some $\alpha_{i},\beta_{i}>0$ such that $\alpha_{i}+\beta_{i}=1$. Using LemmaAzuma, we get: $$P(Z_{1}+...+Z_{k} \leq -k\delta) \leq e^{-2(k\delta)^{2}{k}}.$$ Therefore the probability that at least one of $ \mu n $ points $d$ for which $w_{d} \geq 1{2}+\delta$ will be misclassified by the set of $k$ random decision trees is, by union bound, at most: $\mu n e^{-2k\delta^{2}} \leq n e^{-2k\delta^{2}} $. That, for $k=(1+C)\log(n){2\delta^{2}}$, completes the proof of the upper bound on the empirical error from theorems:~alfa_setting_ndp_main_1 and ~alfa_setting_ndp_main_2 since we have already proved that $\mu \geq 1 - 2\epsilon-2\epsilon^{2}{0.5-\delta}$. The proof of the majority voting version goes along exactly the same lines. This time, instead of Theorem technicaltheorem, we use Theorem technicaltheorem_gamma. We know that $\sum_{d \in T} \sigma(d) \geq ne$, where $e = 1 - \epsilon$. Denote the fraction of points $d$ with $\sigma(d) \geq 1{2} + x$ for $0<x<1{2}$ by $\mu^{x}$. Then, by the argument similar to the one presented above,we have: equation \mu^{x} \geq 1 - \epsilon{0.5 - x}. equation All other details of the proof for the majority voting are exactly the same as for the threshold averaging scheme.

Proof. Let $K>0$ be a constant. We first consider the threshold averaging scheme. Take a decision tree $T_{i}$. Denote by $S_{T_{i}}$ the set of points $d$ from the training set with the following property: point $d$ belongs in $T_{i}$ do the leaf that contains at least $n{K2^{h}}$ points. Note that since each $T_{i}$ has exactly $2^{h}$ leaves, we can conclude that $|S_{T_{i}}| \geq n(1-1{K})$. In this proof and proof of theorems:beta_setting_dp_main_1 and beta_setting_dp_main_1 (presented in the next section) we will consider graph $G^{D}$ that is obtained from $G$ by deleting edges adjacent to those vertices of the color class $B$ that correspond to leaves containing less than $n{K2^{h}}$ points from the training set. Take point $d$ from the training set with $w^{t}{d} \geq 1{2}+\delta$, where $w^{t}{d}$ is the average weight of an edge incident to $d$ in $G^{D}$. Notice that $w_{d} \geq 1{2}+\delta + 1{K}$ implies: $w^{t}{d} \geq 1{2}+\delta$. We say that a decision tree $T{i}$ is $d$-good if the leaf of $T_{i}$ to which $d$ belongs contains at least $n{K2^{h}}$ points from the training set. Let us now define $X^{d}{i}$. If $i^{th}$ chosen random decision tree is $d$-good then $X^{d}{i}$ is defined as in the proof of Theoremalfa_setting_ndp_main_1. Otherwise we put $X^{d}{i}=0$. Denote $Z{i}=X^{d}{i}-w^{t}{d}$. Note that the probability $p_{d}$ that point $d$ is misclassified by selected random decision trees is $p_{d} \leq P(Z_{1+...+Z_{k}}{k}+\sum_{j \in I R_{j}}{|I|} \leq -\delta)$, where $I$ is the set of indices corresponding to those chosen random decision trees that are $d$-good and random variables $R_{j}$ are correction terms for $d$-good random decision trees that must be introduced in order to take into account added Laplacians (if $I = \emptyset$ then we assume that the value of the expression $\sum_{j \in I R_{j}}{|I|}$ is $0$). Note also that set ${R_{j}$, $Z_{j}:j=1,2,...,k}$ is a set of independent random variables. We get: $$ p_{d} \leq P(Z_{1+...+Z_{k}}{k} \leq -\delta{2})+P(\sum_{j \in I R_{j}}{|I|} \leq -\delta{2}). $$ Since from the Azuma's inequality we get: $P(Z_{1+...+Z_{k}}{k} \leq -\delta{2}) \leq e^{-k\delta^{2}{2}}$, we have: equation p_{d} \leq e^{-k\delta^{2}{2}} + P(\sum_{j \in I R_{j}}{|I|} \leq -\delta{2}) equation We will now estimate the expression $p_{r}=P(\sum_{j \in I R_{j}}{|I|} \leq -\delta{2})$. For $i \in I$ denote by $A_{i}$ an event that each of the two perturbation errors added to the leaf containing point $d$ was of magnitude at most $\sqrt{n}{K2^{h}}\delta_{1}$, where $\delta_{1}=\delta{24}$. Denote $A=\bigcap_{i \in I} A_{i}$. Denote by $A^{c}$ the complement of $A$. We have: $P(\sum_{j \in I R_{j}}{|I|} \leq -\delta{2}) = P(\sum_{j \in I R_{j}}{|I|} \leq -\delta{2}|A)P(A)+P(\sum_{j \in I R_{j}}{|I|} \leq -\delta{2}|A^{c})(1-P(A))$. Thus we get: equation p_{r} \leq P(\sum_{j \in I R_{j}}{|I|} \leq -\delta{2}|A) + (1-P(A)). equation Now take one of the chosen random decision trees $T_{i}$ with $i \in I$. Take its leaf that contains given point $d$ from the training set. Assume that this leaf contains $r$ points from the training set with some fixed label $l \in {-,+}$ and that it altogether contains $n_{a}$ points. Note that from the definition of $I$ we have: $n_{a} \geq n{K2^{h}}$. Let $g_{1},g_{2}$ be two independent Laplacian random variables, each of density function $\eta{2k}e^{-|x|\eta{k}}$. We would like to estimate the following random variable $\Theta = r+g_{1}{n_{a}+g_{1}+g_{2}}-r{n_{a}}$ for an event $A$. Note that in particular we know that $|g_{1}|,|g_{2}| \leq \delta_{1n_{a}}{n}$. Simple calculation gives us: equation |\Theta| \leq \delta{4n}. equation Now consider truncated probability space $\Omega | A$ and truncated random variables $R^{t}{i} = R{i} |A$ for $i \in I$. We have: $P(\sum_{i \in I} R_{i} \leq -|I|\delta{2}|A) = P(\sum_{i \in I} R^{t}{i} \leq -|I|\delta{2})$. Using inequality~inequality3, we get: equation |R^{t}{i}| \leq \delta{4n}, E|R^{t}{i}| \leq \delta{4n}. equation Thus we can use Azuma's inequality once more, this time to find the upper bound on the expression: $P(\sum{i \in I} R^{t}{i} \leq -|I|\delta{2})$ (we assume here that the random decision trees have been selected thus $I$ is given). Without loss of generality we can assume that $I \neq \emptyset$. We have: $P(\sum{i \in I} R^{t}{i} \leq -|I|\delta{2}) = P(\sum{i \in A} (R^{t}{i}-ER^{t}{i}) \leq -I\delta{2}-\sum_{i \in I} ER^{t}{i}) \leq P(\sum{i \in I} (R^{t}{i}-ER^{t}{i}) \leq -|I|\delta{4}) \leq e^{-2|I|(\frac{\delta{4})^{2}}{(\delta{4n}+\delta{4n})^{2}}}$. Therefore we get: equation p_{r} \leq e^{-n{2}} + (1-P(A)). equation It remains to bound the expression: $(1-P(A))$. Let $g$ be a Laplacian random variable with density function $\eta{2k}e^{-|x|\eta{k}}$. Note that from the union bound we get: $1-P(A) \leq 2k P(|g| > \sqrt{n\delta}{24K2^{h}})$, where factor $2$ in the expression $2k P(g > \sqrt{n\delta}{24K2^{h}})$ comes from the fact that for a given data point $d$ we need to add perturbation error in two places in the leaf of the chosen random decision tree corresponding to $d$.\ Denote $\gamma = \delta{24K2^{h}}$. We have: equation p_{r} \leq e^{-n{2}} + 4k \int\limits_{\gamma n}^{\infty} \eta{2k}e^{-x\eta{k}}dx. equation Evaluation of the RHS-expression gives us: equation p_{r} \leq e^{-n{2}} +2ke^{-\lambda \sqrt{n \eta}{k}}, :::::where::: \lambda = \delta{24K2^{h}}. equation Thus we can conclude that the probability $p_{d}$ that the fixed point $d$ from the training set will be misclassified by the set of $k$ randomly chosen random decision trees satisfies: equation p_{d} \leq e^{-k\delta^{2}{2}} + e^{-n{2}} + 2ke^{-\gamma \sqrt{n \eta}{k}}. equation Note that by the similar argument to the one presented in the proof of Theoremalfa_setting_ndp_main_1 and Theoremalfa_setting_ndp_main_2, we can conclude that at least $n(1-2(\epsilon+\frac{1{K})-2(\epsilon+1{K})^{2}}{0.5-\delta})$ points $d$ from the training data satisfy: $w^{t}{d} \geq 1{2} +\delta$. Let $\mu^{t}$ be a fraction of points with this property. As we observed earlier, if the points $d$ satisfies: $w{d} \geq 1{2} +\delta + 1{K}$ then it also satisfies: $w^{t}{d} \geq 1{2} +\delta$. Thus $\mu \geq \mu^{t}$. We also have: $\mu^{t} \geq 1-2(\epsilon+\frac{1{K})-2(\epsilon+1{K})^{2}}{0.5-\delta}$. Thus $\mu \geq 1-2(\epsilon+\frac{1{K})-2(\epsilon+1{K})^{2}}{0.5-\delta}$. We replace $\epsilon$ by $\epsilon+1{K}$ in the formula derived in the proof of Theorem~alfa_setting_ndp_main_1 since now for any fixed decision tree we do not take into account points that belong to leaves with less that $n{K2^{h}}$ points from the training set. For every given decision tree $T{i}$ there are at most $n{K}$ points $d$ from the training set such that $T_{i}$ is not $d$-good. Note that, by union bound, the probability that at least one from the $n \mu$ points $d$ with $w^{t}{d} \geq 1{2} +\delta$ is misclassified is at most $n \mu p{d} \leq n p_{d}$. To see how Theoremalfa_setting_dp_main_2 and the part of Theorem~alfa_setting_dp_main_1 regarding empirical error follow now, take $K=40$ and $\delta=1{10}$. The proof of the majority voting version is very similar. We use inequality majority_ineq (that was derived from Theorem technicaltheorem_gamma) but all other details are exactly the same. Therefore we will not give it in details here since it would basically mean copying almost exactly the proof that we have just showed.

Proof. We already know that: $\sum_{d \in T} w_{d} \geq n(e^{2}+(1-e)^{2})$, where $e$ is the average quality. Assume that $k$ random decision trees have been selected. Denote by $Y_{d}$ the indicator of the event that a fixed data point $d$ from the training set will be correctly classified. We have: $$ Y_{d} = \left{ array{rcl} 1 & with probability & X^{d} \ 0 & with probability & 1-X^{d}, \ array\right.$$ where $X^{d}$ is random variable defined in the proof of theorems:~alfa_setting_ndp_main_1 and alfa_setting_ndp_main_2. Note that after random decision trees have been selected, $X^{d}$ has a deterministic value. Note also that random variables $Y_{d}$ are independent and $EY_{d}=X^{d}$. Thus, we can use LemmaAzuma in the very similar way as in the proof of theorems:~alfa_setting_ndp_main_1 and alfa_setting_ndp_main_2 to get that for any given $c>0$: equation P(\sum_{d \in T} (Y_{d}-X^{d}) \leq -nc) \leq e^{-2nc^{2}}. equation Let us focus now on the process of choosing random decision trees. Fix parameter $\delta>0$. Fix some point $d$ from the training set. Using LemmaAzuma in exactly the same way as in the proof of theorems:~alfa_setting_ndp_main_1 and alfa_setting_ndp_main_2, we conclude that $P( X^{d} < w_{d} - \delta) \leq e^{-2k\delta^{2}}$. Therefore, by the union bound, with probability at least $(1-ne^{-2k\delta^{2}})$ we have: $\sum_{d \in T} X^{d} \geq \sum_{d \in T} (w_{d} - \delta)$. Thus, according to the lower bound for $\sum_{d \in T} w_{d}$ we presented at the beginning of the proof, we get that with probability at least $(1-ne^{-2k\delta^{2}})$ the following holds: $\sum_{d \in T} X^{d} \geq n(1-2\epsilon+2\epsilon^{2}-\delta)$, where $\epsilon = 1 - e$. Note that random variables $Y_{d}$ are independent from random variables $X_{d}$. We can conclude, using inequalityinequality0, that with probability at least $(1-ne^{-2k\delta^{2}})(1-e^{-2nc^{2}})$ at least $n(1-2\epsilon+2\epsilon^{2}-\delta-c)$ points will be correctly classified. Now we can take $k=(1+C)\log(n){2\delta^{2}}$ and that completes the proof. Again, as in the previous proof, the majority voting scheme requires only minor changes in the presented proof so we will leave to the reader.

Algorithm: algorithm
[h]
\label{alg:non-diffRDT} % Give a unique label
\setlength\tabcolsep{1pt}
\begin{tabular}{ll}
\textbf{Input:} $Train$, $Test$: train and test sets,\\
\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$h$: height of the tree\\
\hline
\textbf{Random forest construction:}\\
$\:\:\:\:\:\:$construct $k=\theta(\log(n))$ random decision trees by\\
$\:\:\:\:\:\:$choosing for each inner node of the tree\\
$\:\:\:\:\:\:$independently at random its attribute (uniformly\\
$\:\:\:\:\:\:$from the set of all the attributes);\\
\\
$\:\:\:\:\:\:$in the continuous case for each chosen attribute\\
$\:\:\:\:\:\:$$attr$ choose independently at random a threshold\\
$\:\:\:\:\:\:$value uniformly from $[\min(attr),\max(attr)]$\\
\textbf{Training:}\\
$\:\:\:\:\:\:$\textbf{For} $d \in Train\:\:\:\{$\\
$\:\:\:\:\:\:\:\:\:\:$add $d$ to the forest by updating $\theta_{l}$ for every leaf\\
$\:\:\:\:\:\:\:\:\:\:$corresponding to $d$\:\:\:\}\\
\textbf{Testing:}\\
$\:\:\:\:\:\:$\textbf{For} $d \in Test\:\:\:\{$\\
$\:\:\:\:\:\:\:\:\:\:\:$\textbf{if} ($\textsf{majority voting})\:\:\:\{$\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\: $compute $num^{d}$ - the number of the trees\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$ classifying $d$ as $+$;$ \:\:\:$\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\: $classify $d$ as $+$ iff $num^{d} > \frac{k}{2}\:\:\:\}$\\
$\:\:\:\:\:\:\:\:\:\:\:$\textbf{if} ($\textsf{threshold averaging})\:\:\:\{$\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$compute $\theta^{d} = \frac{1}{k}\sum_{l \in \mathcal{L}} \theta_{l}$, where $\mathcal{L}$ is a set of all\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$leaves of the forest that correspond to $d$;\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\: $classify $d$ as $+$ iff $\theta^{d} > \frac{1}{2}\:\:\:\}$\\
$\:\:\:\:\:\:\:\:\:\:\:$\textbf{if} ($\textsf{probabilistic averaging})\:\:\:\{$\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$compute $\theta^{d} = \frac{1}{k}\sum_{l \in \mathcal{L}} \theta_{l}$, where $\mathcal{L}$ is a set of all\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$leaves of the forest that correspond to $d$;\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\: $classify $d$ as $+$ with probability $\theta^{d} \:$\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$\small{/*random tosses here are done independently}\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$\small{from all other previously conducted*/}$\:\:\:\}\:\}$\\
\hline
\textbf{Output:} Classification of all $d \in Test$
\end{tabular}
\caption{\textsf{Non-differentially-private RDT classifier}}
\label{alg:ndp}
Algorithm: algorithm
[htp!]
\label{alg:diffRDT} % Give a unique label
\setlength\tabcolsep{1pt}
\begin{tabular}{ll}
\textbf{Input:} $Train$, $Test$: train and test sets,\\
\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$h$: height of the tree, $\eta$: privacy parameter\\
\hline
\textbf{Random forest construction:}$\:\:\:$as in Algorithm~\ref{alg:ndp}\\

\textbf{Training:}\\
$\:\:\:\:\:\:$\textbf{For} $d \in Train\:\:\:\{$\\
$\:\:\:\:\:\:\:\:\:\:\:\:$find the leaf $l$ for $d$ in every tree and\\
$\:\:\:\:\:\:\:\:\:\:\:\:$update $n_{l}^{+}$, $n_{l}^{-}$, where:\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$n_{l}^{+}$ - the number of training points with\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$label $+$ belonging to that leaf;\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$n_{l}^{-}$ - the number of training points with\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$label $-$ belonging to that leaf$\:\:\:\}$\\
$\:\:\:\:\:\:$\textbf{For every leaf} $\bm{l}\:\:\:\{$\\
$\:\:\:\:\:\:\:\:\:\:\:\:$calculate $n_{l}^{p,+} \!\!\!=\! n_{l}^{+} + g(\frac{\eta}{k})$ and $n_{l}^{p,-} \!\!\!=\! n_{l}^{-} + g(\frac{\eta}{k})$ \\
$\:\:\:\:\:\:\:\:\:\:\:\:$\textbf{if} ($n_{l}^{p,+} \!\!\!<\! 0$ or $n_{l}^{p,-} \!\!\!<\! 0$ or ($n_{l}^{p,+} \!\!\!=\! 0$ and $n_{l}^{p,-} \!\!\!=\! 0$))\\
$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$choose $\theta^{p}_{l}$ uniformly at random from $[0,1];\:\:\:$\\
$\:\:\:\:\:\:\:\:\:\:\:\:$\textbf{else}\:\:\:\ let $\:\theta^{p}_{l} = \frac{n_{l}^{p,+}}{n_{l}^{p,+}+n_{l}^{p,-}};$\\
$\:\:\:\:\:\:\:\:\:\:\:\:$publish $\theta^{p}_{l}$ for every leaf$\:\:\:\}$\\
\textbf{Testing:} as in Algorithm~\ref{alg:ndp} but replace $\theta_{l}$ with $\theta^{p}_{l}$ $$\\
\hline
\textbf{Output:} Classification of all $d \in Test$
\end{tabular}
\caption{\textsf{$\eta$-Differentially-private RDT classifier}}
\label{alg:dp}
MethodMethodMethodMethodMethodMethodMethodMethodMethodMethodMethodMethodMethod
Datasetnmrpartn-dpRFMVn-dpRFMVn-dpRFMVn-dpRFTAn-dpRFTAn-dpRFTAdpRFMVdpRFMVdpRFMVdpRFTAdpRFTAdpRFTA
ErrorErrorkhErrorkhErrorkhErrorkh
Ban Aut137253.65 ± 0.993.09 ± 0.9221153.46 ± 0.971795.44 ± 1.2021115.22 ± 1.18712
BTSC748518.92 ± 2.8122.19 ± 2.9811422.47 ± 2.9911423.42 ± 3.0311423.42 ± 3.03113
CVR435169.30 ± 2.739.05 ± 2.701965.95 ± 2.221398.10 ± 2.561596.90 ± 2.38159
Mam Mass961621.88 ± 2.6116.95 ± 2.3791216.21 ± 2.33191516.95 ± 2.3751217.37 ± 2.4098
Mushroom8124223.33 ± 0.390.83 ± 0.2021150.26 ± 0.1113144.69 ± 0.463134.16 ± 0.43315
Adult3256112317.75 ± 0.4221.70 ± 0.4531421.58 ± 0.4531422.18 ± 0.4531121.72 ± 0.45711
Covertype5810125426.90 ± 0.1133.39 ± 0.12211530.80 ± 0.12211538.75 ± 0.1331337.82 ± 0.12313
Quantum500007832.08 ± 0.4134.81 ± 0.42211533.06 ± 0.41191439.91 ± 0.43211339.01 ± 0.43139

Figure

Figure

References

[srikant] R.~Agrawal and R.~Srikant. \newblock Privacy-preserving data mining. \newblock In {\em ACM SIGMOD}, 2000.

[du] W.~Du and J.~Zhan. \newblock Using randomized response techniques for privacy-preserving data mining. \newblock In {\em KDD}, 2003.

[choro2] K.~Choromanski, T.~Jebara, and K.~Tang. \newblock Adaptive anonymity via {\it b}-matching. \newblock In {\em NIPS}, 2013.

[chaudhuri] K.~Chaudhuri and C.~Monteleoni. \newblock Privacy-preserving logistic regression. \newblock In {\em NIPS}, 2008.

[geetha2] G.~Jagannathan and R.~N. Wright. \newblock Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. \newblock In {\em KDD}, 2005.

[nissim] C.~Dwork, F.~McSherry, K.~Nissim, and A.~Smith. \newblock Calibrating noise to sensitivity in private data analysis. \newblock In {\em TCC}, 2006.

[breiman] L.~Breiman. \newblock Random forests. \newblock {\em Machine Learning}, 45:5-32, 2001.

[Biau:2008:CRF:1390681.1442799] G.~Biau, L.~Devroye, and G.~Lugosi. \newblock Consistency of random forests and other averaging classifiers. \newblock {\em J. Mach. Learn. Res.}, 9:2015--2033, 2008.

[weifan2] W.~Fan. \newblock On the optimality of probability estimation by random decision trees. \newblock In {\em AAAI}, 2004.

[schapire] R.~E. Schapire, Y.~Freund, P.~Bartlett, and W.~S. Lee. \newblock Boosting the margin: A new explanation for the effectiveness of voting methods. \newblock {\em The Annals of Statistics}, 26:1651--1686, 1998.

[Shotton:2011:RHP:2191740.2192047] J.~Shotton, A.~Fitzgibbon, M.~Cook, T.~Sharp, M.~Finocchio, R.~Moore, A.~Kipman, and A.~Blake. \newblock Real-time human pose recognition in parts from single depth images. \newblock In {\em CVPR}, 2011.

[Amit:1997:SQR:263023.263042] Y.~Amit and D.~Geman. \newblock Shape quantization and recognition with randomized trees. \newblock {\em Neural Comput.}, 9:1545--1588, 1997.

[Xiong:2012:RFM:2339530.2339680] C.~Xiong, D.~Johnson, R.~Xu, and J.~J. Corso. \newblock Random forests for metric learning with implicit pairwise position dependence. \newblock In {\em KDD}, 2012.

[Manik] R.~Agarwal, A.~Gupta, Y.~Prabhu, and M.~Varma. \newblock Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. \newblock In {\em WWW}, 2013.

[kouzani2010multilabel] A.~Z. Kouzani and G.~Nasireding. \newblock Multilabel classification by bch code and random forests. \newblock {\em International Journal on Network Security}, 1(2):5, 2010.

[conf/kdd/YanTS07] R.~Yan, J.~Tesic, and J.~R. Smith. \newblock Model-shared subspace boosting for multi-label classification. \newblock In {\em KDD}, 2007.

[journals/jcisd/SvetnikLTCSF03] V.~Svetnik, A.~Liaw, C.~Tong, J.~C. Culberson, R.~P. Sheridan, and B.~P. Feuston. \newblock Random forest: A classification and regression tool for compound classification and qsar modeling. \newblock {\em Journal of Chemical Information and Computer Sciences}, 43(6):1947--1958, 2003.

[citeulike:12009604] A.~M. Prasad, L.~R. Iverson, and A.~Liaw. \newblock {Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction}. \newblock {\em Ecosystems}, 9(2):181--199, 2006.

[Criminisi:2013:DFC:2462584] A.~Criminisi and J.~Shotton. \newblock {\em Decision Forests for Computer Vision and Medical Image Analysis}. \newblock Springer Publishing Company, 2013.

[conf/miccai/ZikicGC13] D.~Zikic, B.~Glocker, and A.~Criminisi. \newblock Atlas encoding by randomized forests for efficient label propagation. \newblock In {\em MICCAI}, 2013.

[ig] L.~Breiman, J.~H. Friedman, R.~A. Olshen, and C.~J. Stone. \newblock {\em Classification and Regression Trees}. \newblock CRC Press LLC, Boca Raton, Florida, 1984.

[breiman1996bagging] L.~Breiman. \newblock Bagging predictors. \newblock {\em Machine Learning}, 24(2):123--140, 1996.

[ho] T.~K. Ho. \newblock Random decision forest. \newblock In {\em ICDAR}, 1995.

[Ho:1998:RSM:284980.284986] T.~K. Ho. \newblock The random subspace method for constructing decision forests. \newblock {\em IEEE Trans. Pattern Anal. Mach. Intell.}, 20(8):832--844, 1998.

[Dietterich:2000:ECT:350128.350131] T.~G. Dietterich. \newblock An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. \newblock {\em Machine Learning}, 40(2):139--157, 2000.

[Biau:2012:ARF:2188385.2343682] G.~Biau. \newblock Analysis of a random forests model. \newblock {\em J. Mach. Learn. Res.}, 13:1063--1095, 2012.

[DBLP:conf/icml/DenilMF13] M.~Denil, D.~Matheson, and N.deFreitas. \newblock Consistency of online random forests. \newblock In {\em ICML}, 2013.

[DBLP:conf/icml/DenilMF14] M.~Denil, D.~Matheson, and N.deFreitas. \newblock Narrowing the gap: Random forests in theory and in practice. \newblock In {\em ICML}, 2014.

[journals/jmlr/Meinshausen06] N.~Meinshausen. \newblock Quantile regression forests. \newblock {\em Journal of Machine Learning Research}, 7:983--999, 2006.

[Lin02randomforests] Y.~Lin and Y.~Jeon. \newblock Random forests and adaptive nearest neighbors. \newblock {\em Journal of the American Statistical Association}, 101:578-590, 2002.

[breiman3] L.~Breiman. \newblock Some infinite theory for predictor ensembles. \newblock {\em Technical Report 577, Statistics Department, UC Berkeley, http://www.stat.berkeley.edu/~breiman}, 2000.

[breiman2] L.~Breiman. \newblock Consistency for a simple model of random forests. \newblock {\em Technical Report 670, Statistics Department, UC Berkeley}, 2004.

[citeulike:6833230] H.~Ishwaran and U.~B. Kogalur. \newblock {Consistency of random survival forests}. \newblock {\em Statistics and Probability Letters}, 80(13-14):1056--1064, 2010.

[Domingos:2000:MHD:347090.347107] P.~Domingos and G.~Hulten. \newblock Mining high-speed data streams. \newblock In {\em KDD}, 2000.

[Geurts:2006:ERT:1132034.1132040] P.~Geurts, D.~Ernst, and L.~Wehenkel. \newblock Extremely randomized trees. \newblock {\em Mach. Learn.}, 63:3--42, 2006.

[jagannathan] G.~Jagannathan, K.~Pillaipakkamnatt, and R.~N. Wright. \newblock A practical differentially private random decision tree classifier. \newblock {\em Trans. Data Privacy}, 5(1):273--295, 2012.

[Genuer1] R.~Genuer. \newblock Risk bounds for purely uniformly random forests. \newblock In {\em ArXiv:1006.2980}, 2010.

[Genuer2] R.~Genuer. \newblock Variance reduction in purely random forests. \newblock {\em Journal of Nonparametric Statistics}, 24:543--562, 2012.

[Lin:Jeon:rand:2006] Y.~Lin and Y.~Jeon. \newblock {Random Forests and Adaptive Nearest Neighbors}. \newblock {\em Journal of the American Statistical Association}, 101:578--590, 2006.

[weifan3] W.~Fan, E.~Greengrass, J.~McCloskey, P.~Yu, and K.~Drummey. \newblock Effective estimation of posterior probabilities: Explaining the accuracy of randomized decision tree approaches. \newblock In {\em ICDM}, 2005.

[weifan] W.~Fan, H.~Wang, P.S Yu, and S.~Ma. \newblock Is random model better? on its accuracy and efficiency. \newblock In {\em ICDM}, 2003.

[Yang:2012:DPD:2213836.2213910] Y.~Yang, Z.~Zhang, G.~Miklau, M.~Winslett, and X.~Xiao. \newblock Differential privacy in data publication and analysis. \newblock In {\em ACM SIGMOD}, 2012.

[6616536] J.~Vaidya, B.~Shafiq, Wei Fan, D.~Mehmood, and D.~Lorenzi. \newblock A random decision tree framework for privacy-preserving data mining. \newblock {\em IEEE Transactions on Dependable and Secure Computing}, 11:399--411, 2014.

[6968348] A.~Patil and S.~Singh. \newblock Differential private random forest. \newblock In {\em ICACCI}, 2014.

[dwork] C.~Dwork. \newblock Differential privacy: {A} survey of results. \newblock In {\em Theory and Applications of Models of Computation, 5th International Conference, {TAMC} 2008, Xi'an, China, April 25-29, 2008. Proceedings}, pages 1--19, 2008.

[scherrytalwar] F.~McSherry and K.~Talwar. \newblock Mechanism design via differential privacy. \newblock In {\em FOCS}, 2007.

[hall] R.~Hall, A.~Rinaldo, and L.~A. Wasserman. \newblock Differential privacy for functions and functional data. \newblock {\em Journal of Machine Learning Research}, 14(1):703--727, 2013.

[rpart] T.~M. Therneau, B.~Atkinson, and B.~Ripley. \newblock rpart: Recursive partitioning. \newblock {\em http://CRAN.R-project.org/package=rpart}, 2011.

[bib1] R. Agrawal and R. Srikant. Privacy-preserving data mining. In ACM SIGMOD, 2000.

[bib2] W. Du and J. Zhan. Using randomized response techniques for privacy-preserving data mining. In KDD, 2003.

[bib3] K. Choromanski, T. Jebara, and K. Tang. Adaptive anonymity via b-matching. In NIPS, 2013.

[bib4] K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In NIPS, 2008.

[bib5] G. Jagannathan and R. N. Wright. Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In KDD, 2005.

[bib6] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006.

[bib7] L. Breiman. Random forests. Machine Learning, 45:5-32, 2001.

[bib8] G. Biau, L. Devroye, and G. Lugosi. Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res., 9:2015–2033, 2008.

[bib9] W. Fan. On the optimality of probability estimation by random decision trees. In AAAI, 2004.

[bib10] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651–1686, 1998.

[bib11] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In CVPR, 2011.

[bib12] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Comput., 9:1545–1588, 1997.

[bib13] C. Xiong, D. Johnson, R. Xu, and J. J. Corso. Random forests for metric learning with implicit pairwise position dependence. In KDD, 2012.

[bib14] R. Agarwal, A. Gupta, Y. Prabhu, and M. Varma. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In WWW, 2013.

[bib15] A. Z. Kouzani and G. Nasireding. Multilabel classification by bch code and random forests. International Journal on Network Security, 1(2):5, 2010.

[bib16] R. Yan, J. Tesic, and J. R. Smith. Model-shared subspace boosting for multi-label classification. In KDD, 2007.

[bib17] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston. Random forest: A classification and regression tool for compound classification and qsar modeling. Journal of Chemical Information and Computer Sciences, 43(6):1947–1958, 2003.

[bib18] A. M. Prasad, L. R. Iverson, and A. Liaw. Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction. Ecosystems, 9(2):181–199, 2006.

[bib19] A. Criminisi and J. Shotton. Decision Forests for Computer Vision and Medical Image Analysis. Springer Publishing Company, 2013.

[bib20] D. Zikic, B. Glocker, and A. Criminisi. Atlas encoding by randomized forests for efficient label propagation. In MICCAI, 2013.

[bib21] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. CRC Press LLC, Boca Raton, Florida, 1984.

[bib22] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

[bib23] T. K. Ho. Random decision forest. In ICDAR, 1995.

[bib24] T. K. Ho. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell., 20(8):832–844, 1998.

[bib25] T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139–157, 2000.

[bib26] G. Biau. Analysis of a random forests model. J. Mach. Learn. Res., 13:1063–1095, 2012.

[bib27] M. Denil, D. Matheson, and N. de Freitas. Consistency of online random forests. In ICML, 2013.

[bib28] M. Denil, D. Matheson, and N. de Freitas. Narrowing the gap: Random forests in theory and in practice. In ICML, 2014.

[bib29] N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983–999, 2006.

[bib30] Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101:578-590, 2002.

[bib31] L. Breiman. Some infinite theory for predictor ensembles. Technical Report 577, Statistics Department, UC Berkeley, http://www.stat.berkeley.edu/~breiman, 2000.

[bib32] L. Breiman. Consistency for a simple model of random forests. Technical Report 670, Statistics Department, UC Berkeley, 2004.

[bib33] H. Ishwaran and U. B. Kogalur. Consistency of random survival forests. Statistics and Probability Letters, 80(13-14):1056–1064, 2010.

[bib34] P. Domingos and G. Hulten. Mining high-speed data streams. In KDD, 2000.

[bib35] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Mach. Learn., 63:3–42, 2006.

[bib36] G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright. A practical differentially private random decision tree classifier. Trans. Data Privacy, 5(1):273–295, 2012.

[bib37] R. Genuer. Risk bounds for purely uniformly random forests. In ArXiv:1006.2980, 2010.

[bib38] R. Genuer. Variance reduction in purely random forests. Journal of Nonparametric Statistics, 24:543–562, 2012.

[bib39] Y. Lin and Y. Jeon. Random Forests and Adaptive Nearest Neighbors. Journal of the American Statistical Association, 101:578–590, 2006.

[bib40] W. Fan, E. Greengrass, J. McCloskey, P. Yu, and K. Drummey. Effective estimation of posterior probabilities: Explaining the accuracy of randomized decision tree approaches. In ICDM, 2005.

[bib41] W. Fan, H. Wang, P.S Yu, and S. Ma. Is random model better? on its accuracy and efficiency. In ICDM, 2003.

[bib42] Y. Yang, Z. Zhang, G. Miklau, M. Winslett, and X. Xiao. Differential privacy in data publication and analysis. In ACM SIGMOD, 2012.

[bib43] J. Vaidya, B. Shafiq, Wei Fan, D. Mehmood, and D. Lorenzi. A random decision tree framework for privacy-preserving data mining. IEEE Transactions on Dependable and Secure Computing, 11:399–411, 2014.

[bib44] A. Patil and S. Singh. Differential private random forest. In ICACCI, 2014.

[bib45] C. Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, 5th International Conference, TAMC 2008, Xi’an, China, April 25-29, 2008. Proceedings, pages 1–19, 2008.

[bib46] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, 2007.

[bib47] R. Hall, A. Rinaldo, and L. A. Wasserman. Differential privacy for functions and functional data. Journal of Machine Learning Research, 14(1):703–727, 2013.

[bib48] T. M. Therneau, B. Atkinson, and B. Ripley. rpart: Recursive partitioning. http://CRAN.R-project.org/package=rpart, 2011.