# Random Forests Brain Dump

Revisiting Kaggle, a site and service which hosts multiple data-mining competitions, I found a new competition that looked potentially interesting. It’s been a while since I’ve fully downloaded any competition’s data, so I was piqued by the inclusion of R code under a file named sample_code.R which didn’t exist before.

Analyzing the code, it was clear that the purpose of the code was to provide two submittable benchmark solutions. One was the naive approach, using the mean of the dependent variable as your predictor. The second was using Random Forests, a machine learning algorithm. In this particular competition, you were asked to to predict $n$ variables, so there were $n$ Random Forest predictors for each variable.

Not knowing much about Random Forests, I spent a portion of that day trying to see what it was all about and understand its mechanism.

Wikipedia, as usual, gave me the practitioner’s definition. In short, it “is an ensemble classifier consisting of many decision trees and outputs the class that is the mode of the classes output by the individual trees.” It helps to understand ensemble in this context as an averaging over a set of sub-models, which happens to be decision trees in this case. It then classifies your particular example by seeing how the sub-models (decision trees) each classified it, and then takes the most-occurring classification as the final classification for your particular example. Interestingly, the name has been trademarked.

I later found a presentation by Albert A. Montillo going over Random Forests, breaking it down in more digestible bits than Wikipedia with more examples. Following is a brief summary of some points I found useful from his presentation.

Random Forest’s first randomization is through bagging
A bootstrap sample is a training set ($N' < N[/latex]) with random sampling (with replacement). Bootstrap aggregation is a parallel combination of learners (decision trees for Random Forests) independently trained on distinct bootstrap samples. Bagging refers to bootstrap aggregation (independent training on learners with distinct bootstrap samples). Final prediction is either the mean prediction of the independent learners (for regression) or the most-picked classification (for classification). Random Forest's second randomization is through predictor subsets You select a random subset ([latex]m_{try}$) of predictors from the total set ($k$) for each split. Bagging is a special case of Random Forest where $m_{try} = k$.

After understanding the two aforementioned features, the Random Forest algorithm is more easily understood

Random Forest Algorithm
For a tree $t_{i}$ you’re building, you first select a bootstrap sample from the original training set for which you will learn on. You will grow an unpruned tree from this bootstrap sample. At each internal node, randomly select $m_{try}$ predictors (from the total set of predictors) and determine best split using only these predictors. Additionally, don’t perform cost-complexity pruning.

Your overall prediction is the average response (for regression) or majority vote (classification) from all the individually trained trees.

Montillo then goes into some practical considerations, e.g., how are splits chosen (squared error, gini index), how many trees to build (build trees until error no longer decreases), or how to select $m_{try}$ (use the recommended defaults of $\sqrt{k}$ for regression and $\frac{k}{3}$ for classification).