Gradient Boosting to the Xtreme – Part 2
In this second blog post in this series on Extreme Gradient Boosting, we will be focusing on how to solve the immediate issue of overfitting that can occur when we have a single decision tree classifier. Please checkout part 1 which covers in detail the principles of decision tree classifiers and some of the challenges that Rubikloud face from a machine learning point of view.
As alluded to in part 1 , overfitting can be resolved by pruning of the tree. Another method to accomplish this is to move from the single tree world to multiple trees in the hope that the bias associated with a single tree will be removed. This type of technique is called Ensemble Learning where we use many different models to achieve a result better than any single one. There are many types of ensemble techniques with one of the simplest ones being a vote system over the ensemble of classifiers, other techniques include subsampling of data points (bootstrapping of data) i.e. Bagging or even number of columns which leads to development of the main topic for this blog post, the Random Forest.
The decision tree classifier uses all the columns in the data set to create a classifier, but what if we use only a sample of the original columns and data points to create a single tree, and then take another random sample of columns and data points to generate another tree and so and so on. What do you have after generating all of these trees? A forest! This forest is a representation of the ensemble of classifiers all of which will all behave at least better than random, and are not dependent on one another. It is important to note that the main difference between a Random Forest and the technique of Bagging, is the former uses the random sampling of columns.
If we take a classification problem where we have two outcomes e.g. are you male or female, we could flip a coin and state that a heads = Female whilst a tails = Male, this is essentially a random choice. We can quantify this probability of being correct as being random and if we create a model that can produce a better success rate than a flip of a coin we term this as being “better than random”.
Using the example in part 1 where we had a information about a person such as physical attributes i.e. Height, Weight, Foot size. We can form a random subset of the columns that would go into a Random Forest – a true Random Forest would also randomly subsample data points, but here we will assume that this is the case already.
Random sample 1 = [Foot size, Weight]
Random sample 2 = [Height, Weight]
Random sample 2 = [Foot size, Height]
With each random sample leading to a trained decision tree i.e. 3 weak classifiers. Clearly, our example is not great for ensemble methods due to a very small number of columns and rows for which we can subsample and grow the trees with.
Moving away from our example and focusing on the voting technique itself and assuming that we have 15 weak classifiers where the outcome decision will either be a blue or red tree. (see figure 1)
Figure 1: A representation of a group of weak classifiers (decision trees) showing the resultant classification of each tree as red or blue.
Using the trained trees to obtain a classification on some unseen input data, we can now count the number of times we have a prediction of red and the number of times we had a prediction of blue and compare the two counts in a democratic fashion – the majority of predictions for a certain class means it takes the win overall. For our example we have
Number of Red Trees = 9
Number of Blue Trees = 6
Which by the democratic approach the Red Trees would “win” and so the overall prediction outcome would be Red. Using a voting system like this allows for a more robust prediction and a reduction on the bias of the model.
The democratic approach is the simplest to understand, we could also work with the probabilities of the outcomes, rather than the outcomes themselves which are harsh cutoffs.
Training a Random Forest
Training a random forest is done in the same way as we train a decision tree, by taking the columns of data that do not contain the variable we want to predict (the outcome variable) and applying the random forest algorithm. Beyond the obvious algorithmic differences between a random forest and a decision tree, we also have an increase in the number of hyper parameters required by a Random Forest algorithm with the important tuning parameters given by;
- Number of estimators/weak classifiers (trees)
- Depth of each tree
- Number of columns to sample
- Number of data points to subsample
The number of trees parameter is usually set very high to allow for full growth of trees which aides in the requirement for each classifier to be independent from one another. Whilst, we could let the number of estimators become very large, there is a point where adding any more trees will lead to no further improvement in the predictive power i.e. the number of estimators > optimal number of estimators, but we can consider computation time to consider for each additional tree as a factor for tuning the number of estimators to reduce the time to run.
Given the extra complexity in the number of tuning parameters one could be led to believe that this particular ensemble method would be a tricky algorithm to tune? Well, the random forest is actually one of the best algorithms straight out of the box – especially if you use the scikit-learn implementation of random forest, which requires very little to no tuning! Due to the relative straight forward training of Random Forests it has become a must in a data scientist toolbox and as such is usually the “go to” algorithm for modelling structured data.
Going back to our toy numerical example from part 1 and assuming we have trained a random forest we can predict new labels in a similar process as decision tree by treating each classifier as a decision tree (obvious I know) – see figure 2.
Figure 2: An example of a decision tree model where each layer corresponds to a column (feature).
We can see the flow of new unseen data to obtain the gender label. Let’s take the example from the first blog
|Gender||Height (feet)||Weight (lbs)||Foot Size|
Table 1: An example of a new person set of attributes to predict the gender from.
As before we start at the root node and work our way through the branches (see part 1 for details), but instead of just one tree we do this for all trees
Figure 3: Showing the process of classification with a trained toy Random Forest.
Following the blue lines in figure 3 we can see that the results are as follows in table 2.
Table 2: Outcomes of the toy random forest.
We can see from table 2 that the result after employing the democratic decision will be a label of male. Furthermore this example highlights the benefits of using an ensemble method as a single tree could result in an outcome like Model 3, with no decision at all!
Random Forests are great all-rounders for a lot of machine learning problems with great characteristics for general usage – So why do we need other tree based ensemble methods? Well, tune in to the next, and final, blog post for the XGBoost algorithm or more generally the gradient boosted tree.
If the world of applied machine learning gets you excited, Rubikloud is always looking for ambitious and curious data scientists. So please reach out: http://rubikloud.wpengine.com/careers/