In deriving the linear regression solution, we will be taking a closer look at how we “solve” the common linear regression, i.e., finding \beta in y = X\beta + \epsilon.

I mention “common,” because there are actually several ways you can get an estimate for \beta based on assumptions of your data and how you can correct for various anomalies. “Common” in this case specifically refers to ordinary least squares. For this specific case, I assume you already know the punch line, that is, \beta = (X^{T}X)^{-1}X^{T}y. But, what we’re really interested in is how to get to that point.

The crux is that you’re trying to find a solution \beta that minimizes the sum of the squared errors, i.e., \min\limits_{\beta} \: \epsilon^{T}\epsilon. We can find the minimum by taking the derivative and setting it to zero, i.e., \frac{d}{d\beta} \epsilon^{T}\epsilon = 0.

In deriving the linear regression solution, it helps to remember two things. Regarding derivatives of two vectors, the product rule states that \frac{d}{dx}u^{T}v = u^{T}\frac{d}{dx}v + v^{T}\frac{d}{dx}u. See this and that. And, for matrix transpose, (AB)^{T} = B^{T}A^{T}.

Observe that y = X\beta + \epsilon \implies \epsilon = y – X\beta. As such, \frac{d}{d\beta} \epsilon^{T}\epsilon = \frac{d}{d\beta} (y-X\beta)^{T}(y-X\beta).

Working it out,

\frac{d}{d\beta} \epsilon^{T}\epsilon \\= \frac{d}{d\beta} (y-X\beta)^{T}(y-X\beta) \\= (y-X\beta)^{T} \frac{d}{d\beta}(y-X\beta) + (y-X\beta)^{T}\frac{d}{d\beta}(y-X\beta) \\= (y-X\beta)^{T}(-X) + (y-X\beta)^{T}(-X) \\= -2(y-X\beta)^{T}X \\= -2(y^{T} – \beta^{T}X^{T})X \\= -2(y^{T}X – \beta^{T}X^{T}X)

By setting the derivative to zero and solving for \beta, we can find the \beta that minimizes the sum of squared errors.

\frac{d}{d\beta} \epsilon^{T}\epsilon = 0 \\ \implies -2(y^{T}X – \beta^{T}X^{T}X) = 0 \\ \implies y^{T}X – \beta^{T}X^{T}X = 0 \\ \implies y^{T}X = \beta^{T}X^{T}X \\ \implies (y^{T}X)^{T} = (\beta^{T}X^{T}X)^{T} \\ \implies X^{T}y = X^{T}X\beta \\ \implies (X^{T}X)^{-1}X^{T}y = (X^{T}X)^{-1}(X^{T}X)\beta \\ \implies \beta = (X^{T}X)^{-1}X^{T}y

Without too much difficulty, we saw how we arrived at the linear regression solution of \beta = (X^{T}X)^{-1}X^{T}y. The general path to that derivation is to recognize that you’re trying to minimize the sum of squared errors (\epsilon^{T}\epsilon), which can be done by finding the derivative of \epsilon^{T}\epsilon, setting it to zero, and then solving for \beta.