# Deriving the Linear Regression Solution

In deriving the linear regression solution, we will be taking a closer look at how we “solve” the common linear regression, i.e., finding $\beta$ in $y = X\beta + \epsilon$.

I mention “common,” because there are actually several ways you can get an estimate for $\beta$ based on assumptions of your data and how you can correct for various anomalies. “Common” in this case specifically refers to ordinary least squares. For this specific case, I assume you already know the punch line, that is, $\beta = (X^{T}X)^{-1}X^{T}y$. But, what we’re really interested in is how to get to that point.

The crux is that you’re trying to find a solution $\beta$ that minimizes the sum of the squared errors, i.e., $\min\limits_{\beta} \: \epsilon^{T}\epsilon$. We can find the minimum by taking the derivative and setting it to zero, i.e., $\frac{d}{d\beta} \epsilon^{T}\epsilon = 0$.

In deriving the linear regression solution, it helps to remember two things. Regarding derivatives of two vectors, the product rule states that $\frac{d}{dx}u^{T}v = u^{T}\frac{d}{dx}v + v^{T}\frac{d}{dx}u$. See this and that. And, for matrix transpose, $(AB)^{T} = B^{T}A^{T}$.

Observe that $y = X\beta + \epsilon \implies \epsilon = y - X\beta$. As such, $\frac{d}{d\beta} \epsilon^{T}\epsilon = \frac{d}{d\beta} (y-X\beta)^{T}(y-X\beta)$.

Working it out,
$\frac{d}{d\beta} \epsilon^{T}\epsilon \\= \frac{d}{d\beta} (y-X\beta)^{T}(y-X\beta) \\= (y-X\beta)^{T} \frac{d}{d\beta}(y-X\beta) + (y-X\beta)^{T}\frac{d}{d\beta}(y-X\beta) \\= (y-X\beta)^{T}(-X) + (y-X\beta)^{T}(-X) \\= -2(y-X\beta)^{T}X \\= -2(y^{T} - \beta^{T}X^{T})X \\= -2(y^{T}X - \beta^{T}X^{T}X)$

By setting the derivative to zero and solving for $\beta$, we can find the $\beta$ that minimizes the sum of squared errors.
$\frac{d}{d\beta} \epsilon^{T}\epsilon = 0 \\ \implies -2(y^{T}X - \beta^{T}X^{T}X) = 0 \\ \implies y^{T}X - \beta^{T}X^{T}X = 0 \\ \implies y^{T}X = \beta^{T}X^{T}X \\ \implies (y^{T}X)^{T} = (\beta^{T}X^{T}X)^{T} \\ \implies X^{T}y = X^{T}X\beta \\ \implies (X^{T}X)^{-1}X^{T}y = (X^{T}X)^{-1}(X^{T}X)\beta \\ \implies \beta = (X^{T}X)^{-1}X^{T}y$

Without too much difficulty, we saw how we arrived at the linear regression solution of $\beta = (X^{T}X)^{-1}X^{T}y$. The general path to that derivation is to recognize that you’re trying to minimize the sum of squared errors ($\epsilon^{T}\epsilon$), which can be done by finding the derivative of $\epsilon^{T}\epsilon$, setting it to zero, and then solving for $\beta$.