Skip to main content\(\newenvironment{mat}{\left[\begin{array}}{\end{array}\right]}
\newcommand{\colvec}[1]{\left[\begin{matrix}#1 \end{matrix}\right]}
\newcommand{\rowvec}[1]{[\begin{matrix} #1 \end{matrix}]}
\newcommand{\definiteintegral}[4]{\int_{#1}^{#2}\,#3\,d#4}
\newcommand{\indefiniteintegral}[2]{\int#1\,d#2}
\def\u{\underline}
\def\summ{\sum\limits}
\newcommand{\lt}{ < }
\newcommand{\gt}{ > }
\newcommand{\amp}{ & }
\)
      
      One of the most important techniques for constructing empirical models in
      math modeling is regression, most frequently linear regression (but we'll
      want to consider non-linear regression as well). 
      So, usually when we think of linear things, we think of straight
      line functions:
      \(y(x)=a+bx\)
      But if you're willing to go into higher dimensions, then planes
      are linear functions (and so on into higher and higher dimensions):
      \(z(x,y)=a+bx+cy\)
      They're linear in \(x\) and linear in \(y\). But it's also
      important to realize that these two kinds of functions are linear in
      their parameters (\(a\), \(b\), \(c\)) too. And for that reason,
      we can use the process of linear regression to find appropriate models
      non-linear in \(x\). So we can fit quadratics, say, and even more
      complicated functions (such as sums of sines and cosines).
      
      We start by converting our linear equation, given in slope-intercept
      form, into vector notation: we write \(y(x)\) as a so-called "inner
      product" of a "row vector" and a "column vector", 
	\begin{equation*}
	  y(x)=a+bx=1*a+x*b=
          \rowvec{ 1 \amp x}
          \colvec{ a \\ b}
	  % \indefiniteintegral{f(x)}{x}
	\end{equation*}
	Now we'd like this to be a line so wonderful that every point fits it,
	so that
	\begin{equation*}
	  \begin{array}{cc}
	  y_1=a+bx_1= \amp \rowvec{ 1 \amp x_1}\colvec{ a \\ b}\\
	  y_2=a+bx_2= \amp \rowvec{ 1 \amp x_2}\colvec{ a \\ b} \\
	  \vdots \amp \vdots \\
	  y_n=a+bx_n= \amp \rowvec{ 1 \amp x_n}\colvec{ a \\ b}
	  \end{array}
	\end{equation*}
	Let's write that stack of equations in "matrix" and "vector" format:
	the matrix product on the right just acts like a bunch of inner
	products of vectors, as in the first example above:
	\begin{equation*}
	  \left[
	  \begin{array}{c} 
	  y_1\\
	  y_2\\
	  \vdots\\
	  y_n
	  \end{array}
	  \right]_{n \times 1}
	  =
	  \left[
	  \begin{array}{cc} 
	  1 \amp x_1\\
	  1 \amp x_2\\
	  \vdots \amp \vdots \\
	  1 \amp x_n
	  \end{array}
	  \right]_{n \times 2}
	  \colvec{ a \\ b}_{2 \times 1}
	\end{equation*}
- On the left, we have a column vector (call it \(\u{y}\) --
	    notation: an underline means that it's a vector, and we are going
	    to assume that vectors are column vectors -- tall and thin, not
	    short and wide);
	  
- 
	    on the
	    right we have a matrix, call it \(X\) (note the notation: capital
	    letters are usually matrices), and
	  
-  on the far right a column vector of parameters (call
	  it \(\u{\beta}\)):
	    \begin{equation*}
	      \u{\beta}=\colvec{ a \\ b}
	    \end{equation*}
	  
It's \(\u{\beta}\) that we'd like to solve for -- we know everything
	else!
	
	Observe that we have written the dimensions of each object at the
	bottom right. Vectors are really just narrow matrices. 
	
	Written as an equation, we have
	\begin{equation*}\u{y}=X\u{\beta}\end{equation*}
	which looks linear! Our job is to find \(\u{\beta}\), the
	"slope 
	vector". 
	
	I hope that your impulse is to "divide by X" - but we're not sure what
	that means for matrices. In fact, it turns out that it's meaningless
	unless \(X\) is square (that is, \(n=2\) in this case, so that
	there are two rows and two columns). 
	
	If \(X\) IS square, \(2 \times 2\), then we should be able to
	find a unique line, provided the two points are distinct (i.e., they're
	not the same point!). This is because there's a unique line passing
	through two distinct points. 
	
	Otherwise, if there are more than two points, then we can find a unique
	line only if the points are colinear - that is, they all fall exactly
	on the same line. And we don't expect this to happen, except by dumb
	luck. So generally our line won't fit all the points. We want to find
	some way to choose the "best line", of course. So how do we proceed?
	
	Suppose \(X\) is not square - that there are more than two
	points. Then how do we proceed? We use a trick of sorts to create a
	matrix that IS square: we multiply the matrix \(X\) by
	its transpose, which is to say the matrix \(X\) with its
	columns and rows interchanged (so the first column becomes the first
	row, and the first row becomes the first column, etc.). 
	
	It looks like this:
	\begin{equation*}
	  X^T=	  \left[
	  \begin{array}{cccc} 
	  1 \amp 1 \amp \cdots \amp 1 \\
	  x_1 \amp x_2 \amp \cdots \amp x_n
	  \end{array}
	  \right]_{2 \times n}
	\end{equation*}
	Note the interchange in dimensions (so this matrix is \(2 \times
	n\)). 
	
	Now, we multiply both sides of the equation
	\begin{equation*}\u{y}=X\u{\beta}\end{equation*}
	by \(X^T\), i.e.
	\begin{equation*}X^T\u{y}=X^TX\u{\beta}\end{equation*}\begin{equation*}
	  \displaystyle 
	  \left[
	  \begin{array}{cccc} 
	  1 \amp 1 \amp \cdots \amp 1 \\
	  x_1 \amp x_2 \amp \cdots \amp x_n
	  \end{array}
	  \right]
	  \left[
	  \begin{array}{c} 
	  y_1\\
	  y_2\\
	  \vdots\\
	  y_n
	  \end{array}
	  \right]_{n \times 1}
	  =
	  \left[
	  \begin{array}{cccc} 
	  1 \amp 1 \amp \cdots \amp 1 \\
	  x_1 \amp x_2 \amp \cdots \amp x_n
	  \end{array}
	  \right]
	  \left[
	  \begin{array}{cc} 
	  1 \amp x_1\\
	  1 \amp x_2\\
	  \vdots \amp \vdots \\
	  1 \amp x_n
	  \end{array}
	  \right]_{n \times 2}
	  \colvec{ a \\ b}_{2 \times 1}
	\end{equation*}
	using the same approach as for a row vector times a column vector, we
	multiply the rows of \(X^T\) times the column vector \(\u{y}\) or
	the columns of the \(X^TX\) matrix to obtain
	\begin{equation*}
	  \displaystyle 
	  \colvec{\summ_{i=1}^ny_i \\ \summ_{i=1}^nx_iy_i}
	  =	  
	  \left[
	  \begin{array}{cc} 
	  n \amp \summ_{i=1}^nx_i \\
	  \summ_{i=1}^nx_i \amp \summ_{i=1}^nx_i^2
	  \end{array}
	  \right]_{2 \times 2}
	  
	  \u{\beta}
	\end{equation*}
	Those sums are kind of ugly, with the dummy index, so let's just
	understand that sums do their summing over the \(n\) elements, and
	drop those moving forward:
	\begin{equation*}
	  \displaystyle 
	  \colvec{\sum y_i \\ \sum x_iy_i}
	  =	  
	  \left[
	  \begin{array}{cc} 
	  n \amp \sum x_i \\
	  \sum x_i \amp \sum x_i^2
	  \end{array}
	  \right]
	  \u{\beta}
	\end{equation*}
	Okay, now you're ready to "divide" by the matrix
	multiplying \(\u{\beta}\), since it's square. Well, we don't so
	much divide as we multiply by the multiplicative inverse,
	and it turns out that the multiplicative inverse of a \(2 \times 2\)
	matrix is easy to compute:
	
	If \(M\) is a \(2 \times 2\) matrix, with entries
	\begin{equation*}
	  M=
	  \left[
	  \begin{array}{cc} 
	  p \amp q \\
	  r \amp s
	  \end{array}
	  \right]
	\end{equation*}
	then the multiplicative inverse \(M^{-1}\) is given by
	\begin{equation*}
	  M^{-1}=\frac{1}{ps-rq}
	  \left[
	  \begin{array}{cc} 
	  s \amp -q \\
	  -r \amp p
	  \end{array}
	  \right]
	\end{equation*}
	The product \(M^{-1}M=MM^{-1}=\left[
	  \begin{array}{cc} 
	  1 \amp 0 \\
	  0 \amp 1
	  \end{array}
	  \right]\), called the "identity matrix" \(I\),
	because \(I\u{\beta}=\u{\beta}\) for all \(2 \times 1\) vectors. 
	
	So now we can solve for \(\u{\beta}\). Let
	\begin{equation*}
	  M=
	  \left[
	  \begin{array}{cc} 
	  p \amp q \\
	  r \amp s
	  \end{array}
	  \right]
	  =
	  \left[
	  \begin{array}{cc} 
	  n \amp \sum x_i \\
	  \sum x_i \amp \sum x_i^2
	  \end{array}
	  \right]
	\end{equation*}
	Then 
	\begin{equation*}
	  M^{-1}=
	  \frac{1}{n\sum x_i^2-\left(\sum x_i\right)^2}
	  \left[
	  \begin{array}{cc} 
	  \sum x_i^2 \amp -\sum x_i \\
	  -\sum x_i \amp n
	  \end{array}
	  \right]
	\end{equation*}\begin{equation*}
	  M^{-1}
	  \displaystyle 
	  \colvec{\sum y_i \\ \sum x_iy_i}
	  =	  
	  M^{-1}
	  M
	  \u{\beta}
	  =	  
	  \u{\beta}
	\end{equation*}
	So
	\begin{equation*}
	  \colvec{ a \\ b}	  
	  = \left\{
	  \begin{array}{c} 
	  M^{-1}
	  \displaystyle 
	  \colvec{\sum y_i \\ \sum x_iy_i} \cr
	  \frac{1}{n\sum x_i^2-\left(\sum x_i\right)^2}
	  \left[
	  \begin{array}{cc} 
	  \sum x_i^2 \amp -\sum x_i \\
	  -\sum x_i \amp n
	  \end{array}
	  \right]
	  \colvec{\sum y_i \\ \sum x_iy_i} \cr
	  
	  \frac{1}{n\sum x_i^2-\left(\sum x_i\right)^2}
	  \colvec{
	  \sum x_i^2\sum y_i - \sum x_i\sum x_iy_i \\ 
	  n\sum x_i y_i - \sum x_i\sum y_i
	  } \cr
	  
	  \colvec{
	  \frac{\sum x_i^2\sum y_i - \sum x_i\sum x_iy_i}
	  {n\sum x_i^2-\left(\sum x_i\right)^2}\\ 
	  \frac{n\sum x_i y_i - \sum x_i\sum y_i}
	  {n\sum x_i^2-\left(\sum x_i\right)^2}
	  } \cr
	  
	  \colvec{
	  \frac{\overline{y}\sum x_i^2 - \overline{x}\sum x_iy_i}
	  {\sum x_i^2-n\overline{x}^2}\\ 
	  \frac{\sum x_i y_i - n\overline{x}\overline{y}}
	  {\sum x_i^2-n\overline{x}^2}
	  }
	  \end{array}
	  \right.
	\end{equation*}
	So
	\begin{equation*}
	  a =
	  \frac{\overline{y}\sum x_i^2 - \overline{x}\sum x_iy_i}
	  {\sum x_i^2-n\overline{x}^2}
	\end{equation*}\begin{equation*}
	  b = \frac{\sum x_i y_i - n\overline{x}\ \overline{y}}
	  {\sum x_i^2-n\overline{x}^2}
	  = \frac{\overline{xy} - \overline{x}\ \overline{y}}
	  {\overline{x^2}-\overline{x}^2}
	\end{equation*}
	You can check that \(a=\overline{y}-b\overline{x}\). 
	
	  Now how do we get the regression diagnostics that are commonly used
	  when performing linear regression?
	  The first order of business is to check that the residuals are
	  well-behaved. We look for certain aspects of the residuals:
- Is there a pattern to the residuals? If so, back to the
	      modeling step, to capture that pattern.
- Are the residuals independent? (related to pattern)
- Are the residuals normally distributed (or at least
	    identically distributed)? The assumptions of the regression
	    procedure include the assumption that the errors are independent
	    and identically distributed....
Having established that our residuals are well-behaved, we can move
	  on to ask how well our model is capturing the variation in the data. 
	  There are certain quantities that are related to variances that we
	  begin by computing. The \(SS_{Tot}\) is the TOTAL sum of squared errors of
	  the null model, which is \(y(t)=\overline{y}\), the mean
	  value of the \(y_i\) values. So
	  \begin{equation*}
	    SS_{Tot}=\sum_{i=1}^n(y_i-\overline{y})^2
	  \end{equation*}
	  The SSE is the the minimal sum of squared errors, the minimum value
	  obtained by using our optimal regression
	  parameters. Defining \(\hat{y}_i=a+bx_i\) as the estimate provided
	  by the model,
	  \begin{equation*}
	    SSE=\sum_{i=1}^n(y_i-\hat{y}_i)^2
	  \end{equation*}
	  Now we can also define a sum of squared errors of the regression
	  model, via
	  \begin{equation*}
	    SS_{Reg}=\sum_{i=1}^n(\hat{y}_i-\overline{y})^2
	  \end{equation*}
	  It turns out (perhaps surprisingly) that
	  \begin{equation*}
	    SS_{Tot}=SSE + SS_{Reg}
	  \end{equation*}
	  The variance explained by the model is the \(R^2\) value,
	  \begin{equation*}
	    R^2=\frac{SS_{Tot}-SSE}{SS_{Tot}}=\frac{SS_{Reg}}{SS_{Tot}}
	  \end{equation*}
	  which gives us the "big picture" fit to the data in the
	  scatterplot. \(R^2\) is between 0 and 1, with 0 meaning no better 
	  than horizontal null model, and 1 meaning a perfect fit to the line.
	  \(R^2\) provides a measure of how much of the total variability is
	  explained by the model. 
	  
	  Our next objective is to obtain measures of fit for the parameters:
	  we would like to know their standard errors, which indicate how much
	  variation we expect from the "best values" obtained by using the normal
	  equations. 
	  
	  The MSE (mean squared error) is found from the SSE as
	  \begin{equation*}
	    MSE=\frac{SSE}{n-p}
	  \end{equation*}
	  where \(p\) is the number of parameters in the model (including
	  the intercept term). 
	  
	  In addition to the
	  MSE we need the inverse of the matrix \(X^TX\) (which was computed
	  to derive the normal equations) to get the standard errors of the
	  parameters. The standard error of the \(i^{th}\) parameter is obtained
	  as the square root of the \(i^{th}\) diagonal entry of the matrix
	  inverse \((X^TX)^{-1}\), multiplied by MSE:
	  \begin{equation*}
	    SE_{i}=\sqrt{MSE*(X^TX)^{-1}_{ii}}
	  \end{equation*}
	  The t-statistics are the parameters divided by these standard 
	  errors SE: \(t_i=\frac{\beta_i}{SE_i}\). Then, in order to
	  calculate the (two-sided) probabilities of 
	  values of \(t_i\) that large or larger (assuming a mean of zero),
	  we compute 
	  \begin{equation*}
	    p_i=2(1-CDF(t_i,\nu=n-p))
	  \end{equation*}
	  where CDF is the cumulative distribution function of the
	  t-distribution with \(\nu\) degrees of freedom. 
	  
	  Equivalently, we could construct \(\rho\)-confidence intervals, with form
	  \begin{equation*}
	    [\beta_i - t^{crit}_{\rho,\nu=n-p} SE_i,\beta_i +
	    t^{crit}_{\rho,\nu=n-p} SE_i]
	  \end{equation*}
	  and where we find \(t^{crit}\) by asking what width of the
	  standardized t-distribution provides us with the given confidence,
	  and then we use that width to multiply the appropriate standard
	  error \(SE_i\).