In Sopra Steria Analytics Sweden‘s very first blog, 24-hour predictions of PM10 levels in urban environments, I applied Random Forest methods to predict levels of particles at street level in Stockholm. These methods are widely used to either predict values of measurements given features in a dataset (regression) or to segment objects or populations (classification). Many data scientists apply these techniques and it is clear to me that most users, however skilled they are at applying them, do not know the mechanisms or even the statistical workings that make Random Forest an efficient method to be reckoned with. I must shamefully admit that I have many times used these black box tools delivered in many R-packages without thoroughly grasping their inner working and these for the simple reason that their outputs did match my expectations. But, as a mathematician, curiosity took the upper hand and I decided to spend some of my spare time to investigate the math behind the technique. I will not go into the subject too deeply but will, whenever needed, define some of the concepts used. As for the algorithm(s) used, I leave it to the reader to gaze through the myriads of material available on the net.

Since not everybody is accustomed with decision trees and random forests I will start my article with a very intuitive description of these two concepts and devote the remaining of my presentation to the mathematics behind these ideas. Hence giving the reader a choice to either limit his reading to understanding the mechanisms or to dive deep into its core workings.

In this first part, we will limit ourselves to decision trees. Those that are used to reading my blogs know I have a tendency to write lengthy ones and that I sometimes need to partition my work. This post is no exception.

Decision trees

Before giving an explanation to the workings of decision trees, we need to describe two simple concepts, namely bootstrapping and bagging, as they are essential to making decision trees work.

Bootstrapping and bagging

Bootstrapping originally comes from the ideas of starting an endeavor (e.g. a business) from nothing (e.g. no assets). The technique is essentially used for making statistical inferences about data when standard parametric assumptions are questionable. The idea is to use the data of a sample study and to mock a complete dataset for the purpose of approximating the sampling distribution of a statistic. In other words, to resample with replacement from the sample data to create a large number of mock samples, known as bootstrap samples. The sample summary is then computed on each of the bootstrap samples and a histogram of the set of these computed values is called the bootstrap distribution of the statistic.

To see an example of how this may work, let’s assume we wish to determine the mean of a sample of 500 values $x$ (for some measured quantity). Now, observe that our sample is of small size and the mean of the sample ($mean(x)=\frac{1}{500}\sum x$) is bound to have some measure of error. We can improve the estimate of our mean using the bootstrap procedure:

1. Create many (usually 1000) random sub-samples of our dataset with replacement.
2. Compute the mean of each one of these sub-samples.
3. Calculate the mean of all means computed in 2.
4. Use that as the estimate of the mean for the data.

Why is this mathematically sound? Suppose that the target of a given study is a parameter $y$ and that a sample population of size $n$ is given by the values $\left(x_{1}, x_{2},\ldots , x_{n}\right)$. Suppose, the corresponding sample statistic computed from this data set is yhat, for instance the mean. The sampling distribution of yhat for large n is normally distributed with mean $y$ and standard deviation $\frac{d}{\sqrt{n}}$, where d is a function depending on the population and the type of statistic yhat. There are sometimes technical complexities in approximating the required standard deviation from the data, for instance when y is a sample median or a sample correlation and this is where bootstrapping enters the scene.

Let $\hat{y}_{k}$ be a random quantity which represents the same statistic calculated on a bootstrap sample drawn out of $\left(x_{1}, x_{2},\ldots , x_{n}\right)$. What can be said about the sampling distribution of $y_{k}$ with respect to all bootstrap samples, while the original sample $\left(x_{1}, x_{2},\ldots , x_{n}\right)$ remains unchanged? In limit, as ( n → ∞ ), the sampling distribution of $\hat{y}_{k}$ is also normally distributed with $y$ as mean and the same standard deviation $\frac{d}{\sqrt{n}}$. Thus, bootstrap distribution of $\hat{y}_{k}-\hat{y}$ is a good approximation of the sampling distribution of $\hat{y} - y$. As we move from one bootstrap sample to another, only $\hat{y}_{k}$ changes in the expression $\hat{y}_{k}-\hat{y}$ as $\hat{y}$ is computed on the original data $\left(x_{1}, x_{2},\ldots , x_{n}\right)$.

Two very nice applications of boostrapping are the approximation of the standard error, SE, of a sample estimate and the bias, which is a measure of the difference between $\hat{y}$ and $y$. Let’s start with the standard error: Suppose that we seek information about a population parameter $y$ and suppose that $\hat{y}$ is a sample estimator of $y$ based on a random sample of size $n$. To estimate the standard error of $\hat{y}$, as the sample varies over of all possible samples, one applies the following bootstrap procedure:

Compute $\left(y^{*}_{1}, y^{*}_{1},\ldots,y^{*}_{1}\right)$ in the same manner as when computing $\hat{y}$ , this time based on $N$ different bootstrap samples (each of size $n$). The size $N$ should be $N=n^2$ for reasonable $n^2$. If $n^2$ is too large, it can be chosen as $n ln(n)$. One defines the standard error of the sample estimate as

$SE_{k}\left(\hat{y}\right)=\left(\frac{1}{N}\sum_{j=1}^{N}\left(y{*}_{j}-\hat{y}\right){2}\right)^{1/2}$.

Now, as we mentioned above, we know that the mean sampling distribution of $\hat{y}$ most often differs from that of $y$. Mathematically, this can be quantified as

$BIAS(\hat{y}) = \mathbb{E}(\hat{y})-y \approx O(\frac{1}{n}),$

where $\mathbb{E}$ is the expectation. The formula  simply means that the difference is of order $\frac{m}{n}$ for sufficiently large $n$. The boostrap approximation of this bias would, for instance be,

$BIAS(\hat{y})=\frac{1}{N}\sum_{j=1}^{N}\left(y^{*}_{j}-\hat{y}\right).$

where the $y^{*}_{j}$ are boostrap copies of  $y$.

Now, let’s move on to the our next subject, namely bagging, which in the litterature is also known as bootstrap aggregating.  Bagging is an ensemble method that combines the predictions from several machine learning algorithms to make more accurate predictions than any individual model. It can, for instance, be used  to reduce the variance of algorithm that have high variance. In the context of this blog this becomes a very useful method. Indeed, decision trees, like classification and regression trees (CART), usually have fairly high variance since they are sensitive to the data they are nurtured with, as we shall see further down in this article. In mathematical terms this is simply translated in the following way:

Suppose that we have determined a model $\hat{y}=\hat{f}(x)$ ($\hat{f}$ is just the estimator function or the model) for our training set $\{x_{i},y_{i}\} \subset \mathbb{R}^{n} \times \mathbb{R}$ (or $\mathbb{R}^{n}\times \mathbb{Z}$ if the $y_{i}$ are categorical variables). As before, we construct $N$ boostrap samples and train the model on the $j$-th bootstrap sample to get $\hat{f_{j}}^{*}(x)$. The next step is simply the computation of the average of the stimators,

$\hat{f}_{Bag}(x) = \frac{1}{N}\sum_{i=1}{N}\hat{f_{j}}^{*}(x)$.

The bagged estimate is the average prediction at $x$ from the $N$ trees. In the case of classification a majority vote from the N trees is used by electing av “voting commitee”:

$\hat{f_{Bag}}(x)=\arg \max_{j}Card\{a|\hat{f_{a}}(x)=j\}$

Now that we have defined the two essential comcepts of boostrapping and bootstrap aggregation (bagging) we can proceed with our study of decision trees.

Imagine that you want to predict the concentration of some particle level at street level in some city near you. The particle levels might for instance be $PM_{10}$ (i.e. particle smaller than 10 $\mu m$ that are, in urban environments, mostly caused by erosion of asphalt and rubber from studded tires). The concentration of these particles in the air is affect by a number of factors such as time (season of the year, time of the day), weather (temperature, precipitation, air pressure, wind and wind direction), traffic (how many cars are currently being driven within a given area) and possibly a wide range of other factors such as historical data of levels around the time you wish to make the prediction for.

Let’s assume you have absolutely no knowledge of the existence of decision tree algorithms but that you still want to make an “educated guess” of the particle levels in the next hour. What would you do? You’d probably check what the level is right now and note that the level is currently $42 \mu g/m^{3}$. Next, you’d look out the window, check the amount of cars and remember that at this time of the day traffic increases slightly and at the same time notice that the sun is shining and the air stands still. After having gone through a list of things you think might be relevant to your prediction you come upp with the answer $48(\mu g/m^{3}$). To arrive at this estimate you chose a series of questions that you deemed relevant but at the same time you probably ignored many questions that could have been asked and even answers that could have been given to your own questions. You most probably did so because you know of experience that you wouldn’t be able to keep track of all passible alternatives already by the third of fourth question. And if you’d write them down, the time for the prediction you wanted to make would have long past. It’s a question of efficiency that we apply in all our decision process on a daily bases. So, what we basically do in real life is that we focus on one single branch of a decision tree. It’s simply a question of efficiency, but it has a serious consequence on the prediction you’ve made. The answer you’ve come upp with contains an error and most probably a large one. We shall come back to this very important observation later on in this article, but to ease your mind I should point out that the decision tree itself doesn’t perform much better. To begin with, we are more qualified that the tree because we posses intuitive knowledge of the world while the algorithm knows nothing. But it learns and does so very quickly. Yes, it does learn and this is why they belong to a class of Supervised Machine Learning models. The model has no prior knowledge of anything and it trie to fit the data in the training set (input variables, predictors or features) with the outcome. After a sufficient large number of tries it will gain enough knowledge about hte relationship between features and outputs to make qualified guesses. It also determines which questions are the best to ask, something that we might have a hard time to do because of biases (e.g. we’ve been told that a certain phenomenon has a decisive importance when it actually doesn’t).

How does this happen from a mathematical point of view? I’m glad you asked! To begin with, I should mention something about the concepts of entropy and information gain. The idea of entropy has it’s origin in a principle of thermodynamics referring to the fact that everything in the universe eventually moves from order to disorder. Entropy measures that change while information gain measures the change in entropy.

In statistical terms, entropy is defined in the following way:

Assume that you are given a discrete random variable $X$ that takes the values $\left(x_{1}, x_{2}, \ldots, x_{n}\right)$ and a probability function $P(X)$. We define the Shannon entropy as

$S(X) = \mathbb{E}\left[I(X)\right]= \mathbb{E}\left[-\ln (P(X))\right]$

where $I$ is the information content of $X$. We can even write $S(.)$ explicitly as

$S(x) = \sum_{j=1}^{n}P(x_{j})I_{j}=-\sum_{j=1}^{n}P(x_{j})\ln P(x_{j}).$

If $P(x_{j})=0$ for some $x_{j}$, the value of $P(x_{j}) \ln P(x_{j})=0$ since $\lim_{y \rightarrow 0} yln(y) =0$. We even define the conditional entropy, that is the entropy given two events $X$ nd $Y$ as

$S(X|Y) = -\sum_{j=1,i=1}^{n}p(x_{j},y_{i})\ln\frac{p(x_{j},y_{i})}{p(y_{i})}$

where $p(x_{j},y_{i})$ is the probability of the events $x_{i}$ and $y_{i}$ and should be seen or understood as a measure of randomness of $X$ given event $Y$.

Information gain being the change in entropy between two states, it is natural to define it as $Gain(T|\xi) = S(T|\xi) - S(T)$ where $S(T|\xi)$ is the weighted average entropy of several sub domains after observing event $\xi$. If $\xi = A$, some attribute, we can measure the net gain of information of splitting a set $K$ on $A$ so that

$Gain(K,A) = S(K) - \sum_{t\in T}p(t)S(t|A)$

where $T$ is the set of subset obtained after splitting $K$ on attribute $A$ and $p(t)$ if the proportion of the number of elements in $t$ to the number of elements in $K$. Whenever a tree is split into different branches, this is what is looked for.

I should also draw the attention to the reader that entropy most often is used when decision trees are used to classify object or group objects in classes while another measure, the Gini impurity, is mostly used in regressions (it is even used in classification). There is a whole class of functions measuring impurity that can be defined in a general way by defining impurity of nodes (the points where the branches form). Impurity is simply a measure of how often a randomly chosen element from some set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

We say that the impurity function $\psi$ is a function defined on the set

$\{(\xi_{1}, \xi_{2},\ldots, \xi_{k}),\forall k\in {1,2,\ldots,k},\xi_{i}\geq 0, \sum_{i=1}^{k}\xi_{i}=1 \}$

such that

1. $\psi$ has a unique minimum in $(1/k, 1/k, \ldots, 1/k)$
2. $\psi$ har $k$ minima in each $e{i}$ (unit vector)
3. $\psi$ is a symetric function in $\xi_{1},\xi_{2},\ldots,\xi_{k}$

The impurity of a node $t$ is then defined as

$i(t) = \psi(p(1|t), (p(2,|t),\ldots,p(k|k))$

where $p(k|t)$ is the probability of having been labelled $k$ knowing that we are in the cell (partition of the dataset) corresponding to node $t$.

The impurity of a node in the whole tree is defined as $I(t) = i(t)p(t)$ where $p(t)$ is the probability of a data being in the cell $t$. Partitioning the total dataset we can hence define the impurity of the whole tree as the sum of all impurity of nodes.

I will leave the proof that the Gini impurity measure belongs to this class of functions as an excercise. It is straight forward I urge you to check this.

I previously described how you would arrived at some prediction of PM10 levels by going through a series of questions and disregarding all questions and predictions you might have arrived at if you had asked all the possible questions. I even pointed out that your prediction would most probably contain a great deal of error. Even a decision tree would contain some error. How does one remedy this problem? What if I asked you and all your 2000 Facebook friends to make the same prediction? You’d probably all end up with different predictions, all of them containing a high variance. But could I still used all the predictions made? Could I gain some insight using predictions make by 2000 decision trees? This will be the topic of part 2 of this blog: Growing forests of decision trees.

I hope this blog has given you some insight in the concepts behing decision trees and awaken your interest to read more about mathematics.