## Random Forests Brain Dump

Edit 8/7/2020:

In the process of resurrecting this site, I came upon this blog post that’s over eight years old, and I semi-cringe.

My current understanding of “how we got to” random forest is this:

• Bagging, short for Bootstrap Aggregation, is a technique to take low bias high variance methods, e.g., decision trees, and lowering the variance.
• This is simply done by taking bootstraps of the original data, fitting with trees B times and then averaging it. The decreased variance is similar to var(\bar{x}) = \frac{var(x)}{n}.
• The challenge is you don’t get to that level of decreased variance since there’s correlation amongst the trees, e.g., a particularly dominant feature is in every bootstrapped tree.
• We attempt to mitigate this with Random Forest, where for each iteration of fitting the tree, we fit on some random subset of the features. For that matter, we can do this on each split.
• Finally, when the “random subset” is all of the features, then we get back to Bagging.

Original Post:

Revisiting Kaggle, a site and service which hosts multiple data-mining competitions, I found a new competition that looked potentially interesting. It’s been a while since I’ve fully downloaded any competition’s data, so I was piqued by the inclusion of R code under a file named sample_code.R which didn’t exist before.

Analyzing the code, it was clear that the purpose of the code was to provide two submittable benchmark solutions. One was the naive approach, using the mean of the dependent variable as your predictor. The second was using Random Forests, a machine learning algorithm. In this particular competition, you were asked to to predict n variables, so there were n Random Forest predictors for each variable.

Not knowing much about Random Forests, I spent a portion of that day trying to see what it was all about and understand its mechanism.

Wikipedia, as usual, gave me the practitioner’s definition. In short, it “is an ensemble classifier consisting of many decision trees and outputs the class that is the mode of the classes output by the individual trees.” It helps to understand ensemble in this context as an averaging over a set of sub-models, which happens to be decision trees in this case. It then classifies your particular example by seeing how the sub-models (decision trees) each classified it, and then takes the most-occurring classification as the final classification for your particular example. Interestingly, the name has been trademarked.

I later found a presentation by Albert A. Montillo going over Random Forests, breaking it down in more digestible bits than Wikipedia with more examples. Following is a brief summary of some points I found useful from his presentation.

Random Forest’s first randomization is through bagging
A bootstrap sample is a training set (N' < N) with random sampling (with replacement).
Bootstrap aggregation is a parallel combination of learners (decision trees for Random Forests) independently trained on distinct bootstrap samples.
Bagging refers to bootstrap aggregation (independent training on learners with distinct bootstrap samples).

Final prediction is either the mean prediction of the independent learners (for regression) or the most-picked classification (for classification).

Random Forest’s second randomization is through predictor subsets
You select a random subset (m_{try}) of predictors from the total set (k) for each split. Bagging is a special case of Random Forest where m_{try} = k.

After understanding the two aforementioned features, the Random Forest algorithm is more easily understood

Random Forest Algorithm
For a tree t_{i} you’re building, you first select a bootstrap sample from the original training set for which you will learn on. You will grow an unpruned tree from this bootstrap sample. At each internal node, randomly select m_{try} predictors (from the total set of predictors) and determine best split using only these predictors. Additionally, don’t perform cost-complexity pruning.

Your overall prediction is the average response (for regression) or majority vote (classification) from all the individually trained trees.

Montillo then goes into some practical considerations, e.g., how are splits chosen (squared error, gini index), how many trees to build (build trees until error no longer decreases), or how to select m_{try} (use the recommended defaults of \sqrt{k} for regression and \frac{k}{3} for classification).