Deriving the Linear Regression Solution

In deriving the linear regression solution, we will be taking a closer look at how we “solve” the common linear regression, i.e., finding \beta in y = X\beta + \epsilon.

I mention “common,” because there are actually several ways you can get an estimate for \beta based on assumptions of your data and how you can correct for various anomalies. “Common” in this case specifically refers to ordinary least squares. For this specific case, I assume you already know the punch line, that is, \beta = (X^{T}X)^{-1}X^{T}y. But, what we’re really interested in is how to get to that point.

The crux is that you’re trying to find a solution \beta that minimizes the sum of the squared errors, i.e., \min\limits_{\beta} \: \epsilon^{T}\epsilon. We can find the minimum by taking the derivative and setting it to zero, i.e., \frac{d}{d\beta} \epsilon^{T}\epsilon = 0.

In deriving the linear regression solution, it helps to remember two things. Regarding derivatives of two vectors, the product rule states that \frac{d}{dx}u^{T}v = u^{T}\frac{d}{dx}v + v^{T}\frac{d}{dx}u. See this and that. And, for matrix transpose, (AB)^{T} = B^{T}A^{T}.

Observe that y = X\beta + \epsilon \implies \epsilon = y - X\beta. As such, \frac{d}{d\beta} \epsilon^{T}\epsilon = \frac{d}{d\beta} (y-X\beta)^{T}(y-X\beta).

Working it out,
\frac{d}{d\beta} \epsilon^{T}\epsilon \\= \frac{d}{d\beta} (y-X\beta)^{T}(y-X\beta) \\= (y-X\beta)^{T} \frac{d}{d\beta}(y-X\beta) + (y-X\beta)^{T}\frac{d}{d\beta}(y-X\beta) \\= (y-X\beta)^{T}(-X) + (y-X\beta)^{T}(-X) \\= -2(y-X\beta)^{T}X \\= -2(y^{T} - \beta^{T}X^{T})X \\= -2(y^{T}X - \beta^{T}X^{T}X)

By setting the derivative to zero and solving for \beta, we can find the \beta that minimizes the sum of squared errors.
\frac{d}{d\beta} \epsilon^{T}\epsilon = 0 \\ \implies -2(y^{T}X - \beta^{T}X^{T}X) = 0 \\ \implies y^{T}X - \beta^{T}X^{T}X = 0 \\ \implies y^{T}X = \beta^{T}X^{T}X \\ \implies (y^{T}X)^{T} = (\beta^{T}X^{T}X)^{T} \\ \implies X^{T}y = X^{T}X\beta \\ \implies (X^{T}X)^{-1}X^{T}y = (X^{T}X)^{-1}(X^{T}X)\beta \\ \implies \beta = (X^{T}X)^{-1}X^{T}y

Without too much difficulty, we saw how we arrived at the linear regression solution of \beta = (X^{T}X)^{-1}X^{T}y. The general path to that derivation is to recognize that you’re trying to minimize the sum of squared errors (\epsilon^{T}\epsilon), which can be done by finding the derivative of \epsilon^{T}\epsilon, setting it to zero, and then solving for \beta.

Mean-Variance Portfolio Optimization with R and Quadratic Programming

The following is a demonstration of how to use R to do quadratic programming in order to do mean-variance portfolio optimization under different constraints, e.g., no leverage, no shorting, max concentration, etc.

Taking a step back, it’s probably helpful to realize the point of all of this. In the 1950s, Harry Markowitz introduced what we now call Modern Portfolio Theory (MPT), which is a mathematical formulation for diversification. Intuitively, because some stocks zig when others zag, when we hold a portfolio of these stocks, our portfolio can have some notional return at a lower variance than holding the stocks outright. More specifically, given a basket of stocks, there exists a notion of an efficient frontier. I.e., for any return you choose, there exists a portfolio with the lowest variance and for any variance you fix, there exists a portfolio with the greatest return. Any portfolio you choose that is not on this efficient frontier is considered sub-optimal (for a given return, why would you choose a a higher variance portfolio when a lower one exists).

The question becomes if given a selection of stocks to choose from, how much do we invest in each stock if at all?

In an investments course I took a while back, we worked the solution for the case where we had a basket of three stocks to choose from, in Excel. Obviously, this solution wasn’t really scalable outside of the N=3 case. When asked about extending N to an arbitrary number, the behind-schedule-professor did some handwaving about matrix math. Looking into this later, there does exist a closed-form equation for determining the holdings for an arbitrary basket of stocks. However, the math starts getting more complicated with each constraint you decide to tack on (e.g., no leverage).

The happy medium between “portfolio optimizer in Excel for three stocks” and “hardcore matrix math for an arbitrary number of stocks” is to use a quadratic programming solver. Some context is needed to see why this is the case.

Quadratic Programming
According to wikipedia, quadratic programming attempts to minimize a function of the form \frac{1}{2}x^{T}Qx + c^{T}x subject to one or more constraints of the form Ax \le b (inequality) or Ex = d (equality).

Modern Portfolio Theory
The mathematical formulation of MPT is that for a given risk tolerance q \in [0,\infty), we can find the efficient frontier by minimizing w^{T} \Sigma w - q*R^{T}w.


  • w is a vector of holding weights such that \sum w_i = 1
  • \Sigma is the covariance matrix of the returns of the assets
  • q \ge 0 is the “risk tolerance”: q = 0 works to minimize portfolio variance and q = \infty works to maximize portfolio return
  • R is the vector of expected returns
  • w^{T} \Sigma w is the variance of portfolio returns
  • R^{T} w is the expected return on the portfolio

My introducing of quadratic programming before mean-variance optimization was clearly setup, but look at the equivalence between \frac{1}{2}x^{T}Qx + c^{T}x and w^{T} \Sigma w - q*R^{T}w.

Quadratic Programming in R
solve.QP, from quadprog, is a good choice for a quadratic programming solver. From the documentation, it minimizes quadratic programming problems of the form -d^{T}b + \frac{1}{2} b^{T}Db with the constraints A^{T}b \ge b_0. Pedantically, note the variable mapping of D = 2\Sigma (this is to offset the \frac{1}{2} in the implied quadratic programming setup) and d = R.

The fun begins when we have to modify A^{T}b \ge b_0 to impose the constraints we’re interested in.

Loading Up the Data
I went to google finance and downloaded historical data for all of the sector SPDRs, e.g., XLY, XLP, XLE, XLF. I’ve named the files in the format of dat.{SYMBOL}.csv. The R code loads it up, formats it, and then ultimately creates a data frame where each column is the symbol and each row represents an observation (close to close log return).

The data is straight-forward enough, with approximately 13 years worth:

> dim(dat.ret)
[1] 3399    9
> head(dat.ret, 3)
              XLB         XLE          XLF         XLI          XLK
[1,]  0.010506305  0.02041755  0.014903406 0.017458395  0.023436164
[2,]  0.022546751 -0.00548872  0.006319802 0.013000812 -0.003664126
[3,] -0.008864066 -0.00509339 -0.013105239 0.004987542  0.002749353
              XLP          XLU          XLV          XLY
[1,]  0.023863921 -0.004367553  0.022126545  0.004309507
[2,] -0.001843998  0.018349139  0.006232977  0.018206972
[3,] -0.005552485 -0.005303294 -0.014473165 -0.009255754

Mean-Variance Optimization with Sum of Weights Equal to One
If it wasn’t clear before, we typically fix the q in w^{T} \Sigma w - q*R^{T}w before optimization. By permuting the value of q, we then generate the efficient frontier. As such, for these examples, we’ll set q = 0.5.

solve.QP’s arguments are:

solve.QP(Dmat, dvec, Amat, bvec, meq=0, factorized=FALSE)

Dmat (covariance) and dvec (penalized returns) are generated easily enough:


Amat and bvec are part of the inequality (or equality) you can impose, i.e., A^{T}b \ge b_0. meq is an integer argument that specifies “how many of the first meq constraints are equality statements instead of inequality statements.” The default for meq is zero.

By construction, you need to think of the constraints in terms of matrix math. E.g., to have all the weights sum up to one, Amat needs to contain a column of ones and bvec needs to contain a single value of one. Additionally, since it’s an equality contraint, meq needs to be one.

In R code:

# Constraints: sum(x_i) = 1

Having instantiated all the arguments for solve.QP, it’s relatively straightforward to invoke it. Multiple things are outputted, e.g., constrained solution, unconstrained solution, number of iterations to solve, etc. For our purpose, we’re primarily just interested in the solution.

> qp  qp$solution
[1] -0.1489193  0.6463653 -1.0117976  0.4107733 -0.4897956  0.2612327 -0.1094819
[8]  0.5496478  0.8919753

Things to note in the solution are that we have negative values (shorting is allowed) and there exists at least one weight whose absolute value is greater than one (leverage is allowed).

Mean-Variance Optimization with Sum of Weights Equal to One and No Shorting
We need to modify Amat and bvec to add the constraint of no shorting. In writing, we want to add a diagonal matrix of ones to Amat and a vector of zeros to bvec, which works out when doing the matrix multiplication that for each weight, its value must be greater than zero.

# Constraints: sum(x_i) = 1 & x_i >= 0
Amat  qp$solution
[1] 0.0000000 0.4100454 0.0000000 0.0000000 0.0000000 0.3075880 0.0000000
[8] 0.2823666 0.0000000

Note that with the constraints that all the weights sum up to one and that the weights are positive, we’ve implicitly also constrained the solution to have no leverage.

Mean-Variance Optimization with Sum of Weights Equal to One, No Shorting, and No Heavy Concentration
Looking at the previous solution, note that one of the weights suggests that we put 41% of our portfolio into a single asset. We may not be comfortable with such a heavy allocation, and we might want to impose the additional constraint that no single asset in our portfolio takes up more than 15%. In math and with our existing constraints, that’s the same as saying -x \ge -0.15 which is equivalent to saying x \le 0.15.

# Constraints: sum(x_i) = 1 & x_i >= 0 & x_i <= 0.15
Amat  qp$solution
[1] 0.1092174 0.1500000 0.0000000 0.1407826 0.0000000 0.1500000 0.1500000
[8] 0.1500000 0.1500000

Turning the Weights into Expected Portfolio Return and Expected Portfolio Volatility
With our weights, we can now calculate the portfolio return as R^{T}w and portfolio volatility as sqrt{w^T \Sigma w}. Doing this, we might note that the values look “small” and not what you expected. Keep in mind that our observations are in daily-space and thus our expected return is expected daily return and expected volatility is expected daily volatility. You will need to annualize it, i.e., R^{T}w * 252 and \sqrt{w^{T} \Sigma w * 252}.

The following is an example of the values of the weights and portfolio statistics while permuting the risk parameter and solving the quadratic programming problem with the constraints that the weights sum to one and there’s no shorting.

> head(ef.w)
      XLB       XLE XLF XLI XLK XLP XLU       XLV        XLY
1       0 0.7943524   0   0   0   0   0 0.1244543 0.08119329
1.005   0 0.7977194   0   0   0   0   0 0.1210635 0.08121713
1.01    0 0.8010863   0   0   0   0   0 0.1176727 0.08124097
1.015   0 0.8044533   0   0   0   0   0 0.1142819 0.08126480
1.02    0 0.8078203   0   0   0   0   0 0.1108911 0.08128864
1.025   0 0.8111873   0   0   0   0   0 0.1075003 0.08131248
> head(ef.stat)
             ret        sd
1     0.06663665 0.2617945
1.005 0.06679809 0.2624120
1.01  0.06695954 0.2630311
1.015 0.06712098 0.2636519
1.02  0.06728243 0.2642742
1.025 0.06744387 0.2648981

Note that as we increase the risk parameter, we’re working to maximize return at the expense of risk. While obvious, it’s worth stating that we’re looking at the efficient frontier. If you plotted ef.stat in its entirety on a plot whose axis are in return space and risk space, you will get the efficient frontier.

Wrap Up
I’ve demonstrated how to use R and the quadprog package to do quadratic programming. It also happens to coincide that the mean-variance portfolio optimization problem really lends itself to quadratic programming. It’s relatively straightforward to do variable mapping between the two problems. The only potential gotcha is how to state your desired constraints into the form A^{T}b \ge b_{0}, but several examples of constraints were given, for which you can hopefully extrapolate from.

Getting away from the mechanics and talking about the theory, I’ll also offer that there are some serious flaws with the approach demonstrated if you attempt to implement this for your own trading. Specifically, you will most likely want to create return forecasts and risk forecasts instead of using historical values only. You might also want to impose constraints to induce sparsity on what you actually hold, in order to minimize transaction costs. In saying that your portfolio is mean-variance optimal, there’s the assumption that the returns you’re working with is normal, which is definitely not the case. These and additional considerations will need to be handled before you let this run in “production.”

All that being said, however, Markowitz’s mean-variance optimization is the building block for whatever more robust solution you might end up coming with. And, an understanding in both theory and implementation of a mean-variance optimization is needed before you can progress.

Helpful Links
Lecture on Quadratic Programming and Markowitz Model (R. Vanderbei)
Lecture on Linear Programming and a Modified Markowitz Model (R. Vanderbei)

Random Forests Brain Dump

Edit 8/7/2020:

In the process of resurrecting this site, I came upon this blog post that’s over eight years old, and I semi-cringe.

My current understanding of “how we got to” random forest is this:

  • Bagging, short for Bootstrap Aggregation, is a technique to take low bias high variance methods, e.g., decision trees, and lowering the variance.
  • This is simply done by taking bootstraps of the original data, fitting with trees B times and then averaging it. The decreased variance is similar to var(\bar{x}) = \frac{var(x)}{n}.
  • The challenge is you don’t get to that level of decreased variance since there’s correlation amongst the trees, e.g., a particularly dominant feature is in every bootstrapped tree.
  • We attempt to mitigate this with Random Forest, where for each iteration of fitting the tree, we fit on some random subset of the features. For that matter, we can do this on each split.
  • Finally, when the “random subset” is all of the features, then we get back to Bagging.

Original Post:

Revisiting Kaggle, a site and service which hosts multiple data-mining competitions, I found a new competition that looked potentially interesting. It’s been a while since I’ve fully downloaded any competition’s data, so I was piqued by the inclusion of R code under a file named sample_code.R which didn’t exist before.

Analyzing the code, it was clear that the purpose of the code was to provide two submittable benchmark solutions. One was the naive approach, using the mean of the dependent variable as your predictor. The second was using Random Forests, a machine learning algorithm. In this particular competition, you were asked to to predict n variables, so there were n Random Forest predictors for each variable.

Not knowing much about Random Forests, I spent a portion of that day trying to see what it was all about and understand its mechanism.

Wikipedia, as usual, gave me the practitioner’s definition. In short, it “is an ensemble classifier consisting of many decision trees and outputs the class that is the mode of the classes output by the individual trees.” It helps to understand ensemble in this context as an averaging over a set of sub-models, which happens to be decision trees in this case. It then classifies your particular example by seeing how the sub-models (decision trees) each classified it, and then takes the most-occurring classification as the final classification for your particular example. Interestingly, the name has been trademarked.

I later found a presentation by Albert A. Montillo going over Random Forests, breaking it down in more digestible bits than Wikipedia with more examples. Following is a brief summary of some points I found useful from his presentation.

Random Forest’s first randomization is through bagging
A bootstrap sample is a training set (N' < N) with random sampling (with replacement).
Bootstrap aggregation is a parallel combination of learners (decision trees for Random Forests) independently trained on distinct bootstrap samples.
Bagging refers to bootstrap aggregation (independent training on learners with distinct bootstrap samples).

Final prediction is either the mean prediction of the independent learners (for regression) or the most-picked classification (for classification).

Random Forest’s second randomization is through predictor subsets
You select a random subset (m_{try}) of predictors from the total set (k) for each split. Bagging is a special case of Random Forest where m_{try} = k.

After understanding the two aforementioned features, the Random Forest algorithm is more easily understood

Random Forest Algorithm
For a tree t_{i} you’re building, you first select a bootstrap sample from the original training set for which you will learn on. You will grow an unpruned tree from this bootstrap sample. At each internal node, randomly select m_{try} predictors (from the total set of predictors) and determine best split using only these predictors. Additionally, don’t perform cost-complexity pruning.

Your overall prediction is the average response (for regression) or majority vote (classification) from all the individually trained trees.

Montillo then goes into some practical considerations, e.g., how are splits chosen (squared error, gini index), how many trees to build (build trees until error no longer decreases), or how to select m_{try} (use the recommended defaults of \sqrt{k} for regression and \frac{k}{3} for classification).

Additional Information for Free
Random Forests are able to estimate the test error during the building stage.

For each tree grown, ~33-36% of samples are not selected in the bootstrap, for which we call out-of-bootstrap (OOB) samples. We’re then able to take the OOB samples and feed them in as input to the corresponding tree, treating the predictions made as if they were true out-of-sample.

By various bookkeeping, majority vote (classification) or average response (regression) is computed from all the OOB samples from all trees.

Supposedly such test error is very accurate in practice, given a reasonable number of trees you build.

After going over Montillo’s presentation a couple times, the Wikipedia article and other assorted reading (from Google) made much more sense. I found the original paper and “official” documentation both informative and useful.

Circling back, in the end, it was nice to actually understand what the benchmark solution was doing. The R manual for the randomForest package was much more useable (if only because you’re more confident what the parameters for the functions actually mean), giving me the ability to modify the benchmark solution as I saw fit.