Revisiting Kaggle, a site and service which hosts multiple data-mining competitions, I found a new competition that looked potentially interesting. It’s been a while since I’ve fully downloaded any competition’s data, so I was piqued by the inclusion of R code under a file named
sample_code.R which didn’t exist before.
Analyzing the code, it was clear that the purpose of the code was to provide two submittable benchmark solutions. One was the naive approach, using the mean of the dependent variable as your predictor. The second was using Random Forests, a machine learning algorithm. In this particular competition, you were asked to to predict variables, so there were Random Forest predictors for each variable.
Not knowing much about Random Forests, I spent a portion of that day trying to see what it was all about and understand its mechanism.
Wikipedia, as usual, gave me the practitioner’s definition. In short, it “is an ensemble classifier consisting of many decision trees and outputs the class that is the mode of the classes output by the individual trees.” It helps to understand ensemble in this context as an averaging over a set of sub-models, which happens to be decision trees in this case. It then classifies your particular example by seeing how the sub-models (decision trees) each classified it, and then takes the most-occurring classification as the final classification for your particular example. Interestingly, the name has been trademarked.
I later found a presentation by Albert A. Montillo going over Random Forests, breaking it down in more digestible bits than Wikipedia with more examples. Following is a brief summary of some points I found useful from his presentation.
Random Forest’s first randomization is through bagging
A bootstrap sample is a training set () of predictors from the total set () for each split. Bagging is a special case of Random Forest where .
After understanding the two aforementioned features, the Random Forest algorithm is more easily understood
Random Forest Algorithm
For a tree you’re building, you first select a bootstrap sample from the original training set for which you will learn on. You will grow an unpruned tree from this bootstrap sample. At each internal node, randomly select predictors (from the total set of predictors) and determine best split using only these predictors. Additionally, don’t perform cost-complexity pruning.
Your overall prediction is the average response (for regression) or majority vote (classification) from all the individually trained trees.
Montillo then goes into some practical considerations, e.g., how are splits chosen (squared error, gini index), how many trees to build (build trees until error no longer decreases), or how to select (use the recommended defaults of for regression and for classification).
Additional Information for Free
Random Forests are able to estimate the test error during the building stage.
For each tree grown, ~33-36% of samples are not selected in the bootstrap, for which we call out-of-bootstrap (OOB) samples. We’re then able to take the OOB samples and feed them in as input to the corresponding tree, treating the predictions made as if they were true out-of-sample.
By various bookkeeping, majority vote (classification) or average response (regression) is computed from all the OOB samples from all trees.
Supposedly such test error is very accurate in practice, given a reasonable number of trees you build.
After going over Montillo’s presentation a couple times, the Wikipedia article and other assorted reading (from Google) made much more sense. I found the original paper and “official” documentation both informative and useful.
Circling back, in the end, it was nice to actually understand what the benchmark solution was doing. The R manual for the randomForest package was much more useable (if only because you’re more confident what the parameters for the functions actually mean), giving me the ability to modify the benchmark solution as I saw fit.