Section10Linear Regression¶ permalink

One of the most important techniques for constructing empirical models in math modeling is regression, most frequently linear regression (but we'll want to consider non-linear regression as well). So, usually when we think of linear things, we think of straight line functions:

\(y(x)=a+bx\)

But if you're willing to go into higher dimensions, then planes are linear functions (and so on into higher and higher dimensions):

\(z(x,y)=a+bx+cy\)

They're linear in \(x\) and linear in \(y\). But it's also important to realize that these two kinds of functions are linear in their parameters (\(a\), \(b\), \(c\)) too. And for that reason, we can use the process of linear regression to find appropriate models non-linear in \(x\). So we can fit quadratics, say, and even more complicated functions (such as sums of sines and cosines).

Subsection10.1¶ permalink

We start by converting our linear equation, given in slope-intercept form, into vector notation: we write \(y(x)\) as a so-called "inner product" of a "row vector" and a "column vector", \begin{equation*} y(x)=a+bx=1*a+x*b= \rowvec{ 1 \amp x} \colvec{ a \\ b} % \indefiniteintegral{f(x)}{x} \end{equation*} Now we'd like this to be a line so wonderful that every point fits it, so that \begin{equation*} \begin{array}{cc} y_1=a+bx_1= \amp \rowvec{ 1 \amp x_1}\colvec{ a \\ b}\\ y_2=a+bx_2= \amp \rowvec{ 1 \amp x_2}\colvec{ a \\ b} \\ \vdots \amp \vdots \\ y_n=a+bx_n= \amp \rowvec{ 1 \amp x_n}\colvec{ a \\ b} \end{array} \end{equation*} Let's write that stack of equations in "matrix" and "vector" format: the matrix product on the right just acts like a bunch of inner products of vectors, as in the first example above: \begin{equation*} \left[ \begin{array}{c} y_1\\ y_2\\ \vdots\\ y_n \end{array} \right]_{n \times 1} = \left[ \begin{array}{cc} 1 \amp x_1\\ 1 \amp x_2\\ \vdots \amp \vdots \\ 1 \amp x_n \end{array} \right]_{n \times 2} \colvec{ a \\ b}_{2 \times 1} \end{equation*}

On the left, we have a column vector (call it \(\u{y}\) -- notation: an underline means that it's a vector, and we are going to assume that vectors are column vectors -- tall and thin, not short and wide);
on the right we have a matrix, call it \(X\) (note the notation: capital letters are usually matrices), and
on the far right a column vector of parameters (call it \(\u{\beta}\)): \begin{equation*} \u{\beta}=\colvec{ a \\ b} \end{equation*}

It's \(\u{\beta}\) that we'd like to solve for -- we know everything else!

Observe that we have written the dimensions of each object at the bottom right. Vectors are really just narrow matrices.

Written as an equation, we have

\begin{equation*}\u{y}=X\u{\beta}\end{equation*}

which looks linear! Our job is to find \(\u{\beta}\), the "slope vector".

I hope that your impulse is to "divide by X" - but we're not sure what that means for matrices. In fact, it turns out that it's meaningless unless \(X\) is square (that is, \(n=2\) in this case, so that there are two rows and two columns).

If \(X\) IS square, \(2 \times 2\), then we should be able to find a unique line, provided the two points are distinct (i.e., they're not the same point!). This is because there's a unique line passing through two distinct points.

Otherwise, if there are more than two points, then we can find a unique line only if the points are colinear - that is, they all fall exactly on the same line. And we don't expect this to happen, except by dumb luck. So generally our line won't fit all the points. We want to find some way to choose the "best line", of course. So how do we proceed?

Suppose \(X\) is not square - that there are more than two points. Then how do we proceed? We use a trick of sorts to create a matrix that IS square: we multiply the matrix \(X\) by its transpose, which is to say the matrix \(X\) with its columns and rows interchanged (so the first column becomes the first row, and the first row becomes the first column, etc.).

It looks like this: \begin{equation*} X^T= \left[ \begin{array}{cccc} 1 \amp 1 \amp \cdots \amp 1 \\ x_1 \amp x_2 \amp \cdots \amp x_n \end{array} \right]_{2 \times n} \end{equation*}

Note the interchange in dimensions (so this matrix is \(2 \times n\)).

Now, we multiply both sides of the equation

\begin{equation*}\u{y}=X\u{\beta}\end{equation*}

by \(X^T\), i.e.

\begin{equation*}X^T\u{y}=X^TX\u{\beta}\end{equation*}

\begin{equation*} \displaystyle \left[ \begin{array}{cccc} 1 \amp 1 \amp \cdots \amp 1 \\ x_1 \amp x_2 \amp \cdots \amp x_n \end{array} \right] \left[ \begin{array}{c} y_1\\ y_2\\ \vdots\\ y_n \end{array} \right]_{n \times 1} = \left[ \begin{array}{cccc} 1 \amp 1 \amp \cdots \amp 1 \\ x_1 \amp x_2 \amp \cdots \amp x_n \end{array} \right] \left[ \begin{array}{cc} 1 \amp x_1\\ 1 \amp x_2\\ \vdots \amp \vdots \\ 1 \amp x_n \end{array} \right]_{n \times 2} \colvec{ a \\ b}_{2 \times 1} \end{equation*}

using the same approach as for a row vector times a column vector, we multiply the rows of \(X^T\) times the column vector \(\u{y}\) or the columns of the \(X^TX\) matrix to obtain

\begin{equation*} \displaystyle \colvec{\summ_{i=1}^ny_i \\ \summ_{i=1}^nx_iy_i} = \left[ \begin{array}{cc} n \amp \summ_{i=1}^nx_i \\ \summ_{i=1}^nx_i \amp \summ_{i=1}^nx_i^2 \end{array} \right]_{2 \times 2} \u{\beta} \end{equation*}

Those sums are kind of ugly, with the dummy index, so let's just understand that sums do their summing over the \(n\) elements, and drop those moving forward:

\begin{equation*} \displaystyle \colvec{\sum y_i \\ \sum x_iy_i} = \left[ \begin{array}{cc} n \amp \sum x_i \\ \sum x_i \amp \sum x_i^2 \end{array} \right] \u{\beta} \end{equation*}

Okay, now you're ready to "divide" by the matrix multiplying \(\u{\beta}\), since it's square. Well, we don't so much divide as we multiply by the multiplicative inverse, and it turns out that the multiplicative inverse of a \(2 \times 2\) matrix is easy to compute:

If \(M\) is a \(2 \times 2\) matrix, with entries \begin{equation*} M= \left[ \begin{array}{cc} p \amp q \\ r \amp s \end{array} \right] \end{equation*}

then the multiplicative inverse \(M^{-1}\) is given by

\begin{equation*} M^{-1}=\frac{1}{ps-rq} \left[ \begin{array}{cc} s \amp -q \\ -r \amp p \end{array} \right] \end{equation*}

The product \(M^{-1}M=MM^{-1}=\left[ \begin{array}{cc} 1 \amp 0 \\ 0 \amp 1 \end{array} \right]\), called the "identity matrix" \(I\), because \(I\u{\beta}=\u{\beta}\) for all \(2 \times 1\) vectors.

So now we can solve for \(\u{\beta}\). Let

\begin{equation*} M= \left[ \begin{array}{cc} p \amp q \\ r \amp s \end{array} \right] = \left[ \begin{array}{cc} n \amp \sum x_i \\ \sum x_i \amp \sum x_i^2 \end{array} \right] \end{equation*}

Then \begin{equation*} M^{-1}= \frac{1}{n\sum x_i^2-\left(\sum x_i\right)^2} \left[ \begin{array}{cc} \sum x_i^2 \amp -\sum x_i \\ -\sum x_i \amp n \end{array} \right] \end{equation*}

\begin{equation*} M^{-1} \displaystyle \colvec{\sum y_i \\ \sum x_iy_i} = M^{-1} M \u{\beta} = \u{\beta} \end{equation*}

\begin{equation*} \colvec{ a \\ b} = \left\{ \begin{array}{c} M^{-1} \displaystyle \colvec{\sum y_i \\ \sum x_iy_i} \cr \frac{1}{n\sum x_i^2-\left(\sum x_i\right)^2} \left[ \begin{array}{cc} \sum x_i^2 \amp -\sum x_i \\ -\sum x_i \amp n \end{array} \right] \colvec{\sum y_i \\ \sum x_iy_i} \cr \frac{1}{n\sum x_i^2-\left(\sum x_i\right)^2} \colvec{ \sum x_i^2\sum y_i - \sum x_i\sum x_iy_i \\ n\sum x_i y_i - \sum x_i\sum y_i } \cr \colvec{ \frac{\sum x_i^2\sum y_i - \sum x_i\sum x_iy_i} {n\sum x_i^2-\left(\sum x_i\right)^2}\\ \frac{n\sum x_i y_i - \sum x_i\sum y_i} {n\sum x_i^2-\left(\sum x_i\right)^2} } \cr \colvec{ \frac{\overline{y}\sum x_i^2 - \overline{x}\sum x_iy_i} {\sum x_i^2-n\overline{x}^2}\\ \frac{\sum x_i y_i - n\overline{x}\overline{y}} {\sum x_i^2-n\overline{x}^2} } \end{array} \right. \end{equation*}

\begin{equation*} a = \frac{\overline{y}\sum x_i^2 - \overline{x}\sum x_iy_i} {\sum x_i^2-n\overline{x}^2} \end{equation*}

\begin{equation*} b = \frac{\sum x_i y_i - n\overline{x}\ \overline{y}} {\sum x_i^2-n\overline{x}^2} = \frac{\overline{xy} - \overline{x}\ \overline{y}} {\overline{x^2}-\overline{x}^2} \end{equation*}

You can check that \(a=\overline{y}-b\overline{x}\).

Subsection10.2¶ permalink

Now how do we get the regression diagnostics that are commonly used when performing linear regression? The first order of business is to check that the residuals are well-behaved. We look for certain aspects of the residuals:

Is there a pattern to the residuals? If so, back to the modeling step, to capture that pattern.
Are the residuals independent? (related to pattern)
Are the residuals normally distributed (or at least identically distributed)? The assumptions of the regression procedure include the assumption that the errors are independent and identically distributed....

Having established that our residuals are well-behaved, we can move on to ask how well our model is capturing the variation in the data. There are certain quantities that are related to variances that we begin by computing. The \(SS_{Tot}\) is the TOTAL sum of squared errors of the null model, which is \(y(t)=\overline{y}\), the mean value of the \(y_i\) values. So \begin{equation*} SS_{Tot}=\sum_{i=1}^n(y_i-\overline{y})^2 \end{equation*} The SSE is the the minimal sum of squared errors, the minimum value obtained by using our optimal regression parameters. Defining \(\hat{y}_i=a+bx_i\) as the estimate provided by the model, \begin{equation*} SSE=\sum_{i=1}^n(y_i-\hat{y}_i)^2 \end{equation*}

Now we can also define a sum of squared errors of the regression model, via \begin{equation*} SS_{Reg}=\sum_{i=1}^n(\hat{y}_i-\overline{y})^2 \end{equation*} It turns out (perhaps surprisingly) that \begin{equation*} SS_{Tot}=SSE + SS_{Reg} \end{equation*}

The variance explained by the model is the \(R^2\) value, \begin{equation*} R^2=\frac{SS_{Tot}-SSE}{SS_{Tot}}=\frac{SS_{Reg}}{SS_{Tot}} \end{equation*} which gives us the "big picture" fit to the data in the scatterplot. \(R^2\) is between 0 and 1, with 0 meaning no better than horizontal null model, and 1 meaning a perfect fit to the line.

\(R^2\) provides a measure of how much of the total variability is explained by the model.

Our next objective is to obtain measures of fit for the parameters: we would like to know their standard errors, which indicate how much variation we expect from the "best values" obtained by using the normal equations.

The MSE (mean squared error) is found from the SSE as \begin{equation*} MSE=\frac{SSE}{n-p} \end{equation*} where \(p\) is the number of parameters in the model (including the intercept term).

In addition to the MSE we need the inverse of the matrix \(X^TX\) (which was computed to derive the normal equations) to get the standard errors of the parameters. The standard error of the \(i^{th}\) parameter is obtained as the square root of the \(i^{th}\) diagonal entry of the matrix inverse \((X^TX)^{-1}\), multiplied by MSE: \begin{equation*} SE_{i}=\sqrt{MSE*(X^TX)^{-1}_{ii}} \end{equation*}

The t-statistics are the parameters divided by these standard errors SE: \(t_i=\frac{\beta_i}{SE_i}\). Then, in order to calculate the (two-sided) probabilities of values of \(t_i\) that large or larger (assuming a mean of zero), we compute

\begin{equation*} p_i=2(1-CDF(t_i,\nu=n-p)) \end{equation*}

where CDF is the cumulative distribution function of the t-distribution with \(\nu\) degrees of freedom.

Equivalently, we could construct \(\rho\)-confidence intervals, with form

\begin{equation*} [\beta_i - t^{crit}_{\rho,\nu=n-p} SE_i,\beta_i + t^{crit}_{\rho,\nu=n-p} SE_i] \end{equation*}

and where we find \(t^{crit}\) by asking what width of the standardized t-distribution provides us with the given confidence, and then we use that width to multiply the appropriate standard error \(SE_i\).