Deriving the Linear Regression Solution

In deriving the linear regression solution, we will be taking a closer look at how we “solve” the common linear regression, i.e., finding \beta in y = X\beta + \epsilon.

I mention “common,” because there are actually several ways you can get an estimate for \beta based on assumptions of your data and how you can correct for various anomalies. “Common” in this case specifically refers to ordinary least squares. For this specific case, I assume you already know the punch line, that is, \beta = (X^{T}X)^{-1}X^{T}y. But, what we’re really interested in is how to get to that point.

The crux is that you’re trying to find a solution \beta that minimizes the sum of the squared errors, i.e., \min\limits_{\beta} \: \epsilon^{T}\epsilon. We can find the minimum by taking the derivative and setting it to zero, i.e., \frac{d}{d\beta} \epsilon^{T}\epsilon = 0.

In deriving the linear regression solution, it helps to remember two things. Regarding derivatives of two vectors, the product rule states that \frac{d}{dx}u^{T}v = u^{T}\frac{d}{dx}v + v^{T}\frac{d}{dx}u. See this and that. And, for matrix transpose, (AB)^{T} = B^{T}A^{T}.

Observe that y = X\beta + \epsilon \implies \epsilon = y - X\beta. As such, \frac{d}{d\beta} \epsilon^{T}\epsilon = \frac{d}{d\beta} (y-X\beta)^{T}(y-X\beta).

Working it out,
\frac{d}{d\beta} \epsilon^{T}\epsilon \\= \frac{d}{d\beta} (y-X\beta)^{T}(y-X\beta) \\= (y-X\beta)^{T} \frac{d}{d\beta}(y-X\beta) + (y-X\beta)^{T}\frac{d}{d\beta}(y-X\beta) \\= (y-X\beta)^{T}(-X) + (y-X\beta)^{T}(-X) \\= -2(y-X\beta)^{T}X \\= -2(y^{T} - \beta^{T}X^{T})X \\= -2(y^{T}X - \beta^{T}X^{T}X)

By setting the derivative to zero and solving for \beta, we can find the \beta that minimizes the sum of squared errors.
\frac{d}{d\beta} \epsilon^{T}\epsilon = 0 \\ \implies -2(y^{T}X - \beta^{T}X^{T}X) = 0 \\ \implies y^{T}X - \beta^{T}X^{T}X = 0 \\ \implies y^{T}X = \beta^{T}X^{T}X \\ \implies (y^{T}X)^{T} = (\beta^{T}X^{T}X)^{T} \\ \implies X^{T}y = X^{T}X\beta \\ \implies (X^{T}X)^{-1}X^{T}y = (X^{T}X)^{-1}(X^{T}X)\beta \\ \implies \beta = (X^{T}X)^{-1}X^{T}y

Without too much difficulty, we saw how we arrived at the linear regression solution of \beta = (X^{T}X)^{-1}X^{T}y. The general path to that derivation is to recognize that you’re trying to minimize the sum of squared errors (\epsilon^{T}\epsilon), which can be done by finding the derivative of \epsilon^{T}\epsilon, setting it to zero, and then solving for \beta.

Random Forests Brain Dump

Revisiting Kaggle, a site and service which hosts multiple data-mining competitions, I found a new competition that looked potentially interesting. It’s been a while since I’ve fully downloaded any competition’s data, so I was piqued by the inclusion of R code under a file named sample_code.R which didn’t exist before.

Analyzing the code, it was clear that the purpose of the code was to provide two submittable benchmark solutions. One was the naive approach, using the mean of the dependent variable as your predictor. The second was using Random Forests, a machine learning algorithm. In this particular competition, you were asked to to predict n variables, so there were n Random Forest predictors for each variable.

Not knowing much about Random Forests, I spent a portion of that day trying to see what it was all about and understand its mechanism.

Wikipedia, as usual, gave me the practitioner’s definition. In short, it “is an ensemble classifier consisting of many decision trees and outputs the class that is the mode of the classes output by the individual trees.” It helps to understand ensemble in this context as an averaging over a set of sub-models, which happens to be decision trees in this case. It then classifies your particular example by seeing how the sub-models (decision trees) each classified it, and then takes the most-occurring classification as the final classification for your particular example. Interestingly, the name has been trademarked.

I later found a presentation by Albert A. Montillo going over Random Forests, breaking it down in more digestible bits than Wikipedia with more examples. Following is a brief summary of some points I found useful from his presentation.

Random Forest’s first randomization is through bagging
A bootstrap sample is a training set (N' < N[/latex]) with random sampling (with replacement). Bootstrap aggregation is a parallel combination of learners (decision trees for Random Forests) independently trained on distinct bootstrap samples. Bagging refers to bootstrap aggregation (independent training on learners with distinct bootstrap samples).  Final prediction is either the mean prediction of the independent learners (for regression) or the most-picked classification (for classification).  <strong>Random Forest's second randomization is through predictor subsets</strong>  You select a random subset ([latex]m_{try}) of predictors from the total set (k) for each split. Bagging is a special case of Random Forest where m_{try} = k.

After understanding the two aforementioned features, the Random Forest algorithm is more easily understood

Random Forest Algorithm
For a tree t_{i} you’re building, you first select a bootstrap sample from the original training set for which you will learn on. You will grow an unpruned tree from this bootstrap sample. At each internal node, randomly select m_{try} predictors (from the total set of predictors) and determine best split using only these predictors. Additionally, don’t perform cost-complexity pruning.

Your overall prediction is the average response (for regression) or majority vote (classification) from all the individually trained trees.

Montillo then goes into some practical considerations, e.g., how are splits chosen (squared error, gini index), how many trees to build (build trees until error no longer decreases), or how to select m_{try} (use the recommended defaults of \sqrt{k} for regression and \frac{k}{3} for classification).

Additional Information for Free
Random Forests are able to estimate the test error during the building stage.

For each tree grown, ~33-36% of samples are not selected in the bootstrap, for which we call out-of-bootstrap (OOB) samples. We’re then able to take the OOB samples and feed them in as input to the corresponding tree, treating the predictions made as if they were true out-of-sample.

By various bookkeeping, majority vote (classification) or average response (regression) is computed from all the OOB samples from all trees.

Supposedly such test error is very accurate in practice, given a reasonable number of trees you build.

After going over Montillo’s presentation a couple times, the Wikipedia article and other assorted reading (from Google) made much more sense. I found the original paper and “official” documentation both informative and useful.

Circling back, in the end, it was nice to actually understand what the benchmark solution was doing. The R manual for the randomForest package was much more useable (if only because you’re more confident what the parameters for the functions actually mean), giving me the ability to modify the benchmark solution as I saw fit.