Categories

# Deriving the Linear Regression Solution

In deriving the linear regression solution, we will be taking a closer look at how we “solve” the common linear regression, i.e., finding $$\beta$$ in $$y = X\beta + \epsilon$$.

I mention “common,” because there are actually several ways you can get an estimate for $$\beta$$ based on assumptions of your data and how you can correct for various anomalies. “Common” in this case specifically refers to ordinary least squares. For this specific case, I assume you already know the punch line, that is, $$\beta = (X^{T}X)^{-1}X^{T}y$$. But, what we’re really interested in is how to get to that point.

The crux is that you’re trying to find a solution $$\beta$$ that minimizes the sum of the squared errors, i.e., $$\min\limits_{\beta} \: \epsilon^{T}\epsilon$$. We can find the minimum by taking the derivative and setting it to zero, i.e., $$\frac{d}{d\beta} \epsilon^{T}\epsilon = 0$$.

In deriving the linear regression solution, it helps to remember two things. Regarding derivatives of two vectors, the product rule states that $$\frac{d}{dx}u^{T}v = u^{T}\frac{d}{dx}v + v^{T}\frac{d}{dx}u$$. See this and that. And, for matrix transpose, $$(AB)^{T} = B^{T}A^{T}$$.

Observe that $$y = X\beta + \epsilon \implies \epsilon = y – X\beta$$. As such, $$\frac{d}{d\beta} \epsilon^{T}\epsilon = \frac{d}{d\beta} (y-X\beta)^{T}(y-X\beta)$$.

Working it out,
$$\frac{d}{d\beta} \epsilon^{T}\epsilon \\= \frac{d}{d\beta} (y-X\beta)^{T}(y-X\beta) \\= (y-X\beta)^{T} \frac{d}{d\beta}(y-X\beta) + (y-X\beta)^{T}\frac{d}{d\beta}(y-X\beta) \\= (y-X\beta)^{T}(-X) + (y-X\beta)^{T}(-X) \\= -2(y-X\beta)^{T}X \\= -2(y^{T} – \beta^{T}X^{T})X \\= -2(y^{T}X – \beta^{T}X^{T}X)$$

By setting the derivative to zero and solving for $$\beta$$, we can find the $$\beta$$ that minimizes the sum of squared errors.
$$\frac{d}{d\beta} \epsilon^{T}\epsilon = 0 \\ \implies -2(y^{T}X – \beta^{T}X^{T}X) = 0 \\ \implies y^{T}X – \beta^{T}X^{T}X = 0 \\ \implies y^{T}X = \beta^{T}X^{T}X \\ \implies (y^{T}X)^{T} = (\beta^{T}X^{T}X)^{T} \\ \implies X^{T}y = X^{T}X\beta \\ \implies (X^{T}X)^{-1}X^{T}y = (X^{T}X)^{-1}(X^{T}X)\beta \\ \implies \beta = (X^{T}X)^{-1}X^{T}y$$

Without too much difficulty, we saw how we arrived at the linear regression solution of $$\beta = (X^{T}X)^{-1}X^{T}y$$. The general path to that derivation is to recognize that you’re trying to minimize the sum of squared errors ($$\epsilon^{T}\epsilon$$), which can be done by finding the derivative of $$\epsilon^{T}\epsilon$$, setting it to zero, and then solving for $$\beta$$.