Skip to main content\(\newenvironment{mat}{\left[\begin{array}}{\end{array}\right]}
\newcommand{\colvec}[1]{\left[\begin{matrix}#1 \end{matrix}\right]}
\newcommand{\rowvec}[1]{[\begin{matrix} #1 \end{matrix}]}
\newcommand{\definiteintegral}[4]{\int_{#1}^{#2}\,#3\,d#4}
\newcommand{\indefiniteintegral}[2]{\int#1\,d#2}
\def\u{\underline}
\def\summ{\sum\limits}
\newcommand{\lt}{ < }
\newcommand{\gt}{ > }
\newcommand{\amp}{ & }
\)
      	
      The downside of linear regression is that it only works when the
      parameters appear linearly in the model, and many models don't meet that
      standard. What if we have a model like
      \begin{equation*}
	y(x)=ae^{bx}
      \end{equation*}
      We can make a transformation of the data, and so seek a model of the form
      \begin{equation*}
	\ln(y(x))=\ln a + bx
      \end{equation*}
      but this has an impact when it comes to the error structure. The linear
      regression model assumes independent identically distributed (iid)
      errors, so 
      \begin{equation*}
	\ln(y_i)=\ln a + bx_i + \epsilon_i
      \end{equation*}
      When we transform back, to talk about \(y\), or rather \(y(x)\), we
      have data with an error structure of the form
      \begin{equation*}
	y_i=ae^{bx_i + \epsilon_i}=ae^{bx_i}e^{\epsilon_i}
      \end{equation*}
      So the errors become multiplicative. Now that might actually be
      appropriate: it depends on exactly what the nature of those errors is.
      
      If the error is all due to reading a thermometer with an error of .05
      degree
      (i.e. it only reads to the nearest tenth), then the error is going to be
      additive, rather than multiplicative - it's about .05 degree across the
      spectrum. If the error is in counting things, and you tend to miss about
      10 percent no matter the size of the crowd, then it's multiplicative. So
      you would like to have some understanding of the type of error structure
      you're expecting, and capture it properly. 
      
      If we truly want to work with an error structure of 
      \begin{equation*}
	y_i=ae^{bx_i} + \epsilon_i
      \end{equation*}
      then non-linear regression permits us an approach. A downside of
      non-linear regression is that the diagnostics are not quite as good as
      they are for linear regression. But we can get the essentials (especially
      the residuals, the \(R^2\), and the standard errors). So here we go.
      
Subsection11.1The Motivation: Newton's Method¶ permalink
      I want to start by talking about Newton's method. You might have seen
      this in calculus class, but perhaps not. It's one of those topics which
      seem to get passed over, sadly, because we're in such a damned hurry to
      race through the standard calculus curriculum. It's one of the coolest
      things, is tremendously valuable, and has Newton's name on it - so it's a
      bit of a surprise that it gets the shaft, but so it goes.
      
      At any rate, I'm not bitter....
      
      So let's take a look at that. It starts with a non-linear function for
      which we want a root:
      
      
      
      Now let's suppose that we make a guess, and it's not too bad: we
      think maybe \(x=.5\) (call this starting guess \(x_0\)). We
      draw in the tangent line to the function at the
      point \((x_0,f(x_0))\), and construct the tangent line function:
      
Note: the tangent line function is actually the first-degree Taylor
      series polynomial approximation to \(f\) at \((x_0,f(x_0))\). This
      is important, for the generally extension to non-linear regression.
      
      To solve for the root of the original function is a non-linear problem,
      and "hard". We can solve for the root of the tangent line function (the
      linear function) very easily, however: so we trade our hard non-linear
      problem for a problem that we know how to do. Are we clever, or lazy?
      Interesting mathematical strategy: if you don't like the problem you are
      given, do an easier one!:) We cast the problem into the linear world,
      solve, it and bring back the solution, and see how we're doing.
      
      Well, in this case, we're closer. By the way, the root we're after can be
      approximated using Sage, as 
      
      Notice that this is just \(\frac{\pi}{3}\)! So you could have solved it
      yourself, just by thinking about it: it's a special triangle. ( Did you
      think about it?)
      
      At any rate, our method is to use two points and the known slope to find
      the next approximation, which comes from 
      \begin{equation*}
	m=\frac{0-f(x_0)}{x_1-x_0}=f'(x_0)
      \end{equation*}
      or
      \begin{equation*}
	x_1=x_0-\frac{f(x_0)}{f'(x_0)}
      \end{equation*}
      In this case, that gives
      
      So by solving the linear problem, we're closer to the solution of
      the non-linear problem! So just do it again, do it again, do it
      again... until satisfied, whatever that means. Here's the next picture: 
      
      You can see that we've gotten a lot closer now, and that, if we just do
      it about 10 times, we're going to be golden:
      
      Now isn't that cool? How much cooler than that can you get? We do a whole
      bunch of linear problems, and suddenly we have solved the non-linear
      problem.
      
      And that's the big idea behind non-linear regression, too.
      
      Let's assume that the scalar value \(y\) is some non-linear function
      of several variables, which we will denote by the (column)
      vector \(\u{x}\). (Remember, we are the people of the columns,
      not the people of the rows.)
      
      So, for example, we might have test score \(y\) as a function of
      several variables, making a vector of predictors
      \begin{equation*}
	\u{x}=\colvec{ACT \\ GPA  \\ Age \\ Rank}
      \end{equation*}
      and those variables might be associated with several parameters, which we
      might denote as 
      \begin{equation*}
	\u{\theta}=\colvec{a \\ b  \\ c \\ d \\ e}
      \end{equation*}
      and the non-linear model could be
      \begin{equation*}
	 y = f(\u{x};\u{\theta}) = a ACT^b GPA^c Age^d Rank^e
      \end{equation*}
      Now let's suppose that we have \(n\) data values \(y_i\), and their
      associated predictors \(\u{x}_i\). Let's denote the unknown
      parameters by the vector \(\u\theta\). We're going to
      assume that 
      \begin{equation*}
        y_i = f({\u{x}}_i;{\u{\theta}}) + \epsilon_i,
      \end{equation*}
      where the residuals \(\epsilon_i\) are iid (independent and
      identically distributed).
      
      We're going to need a pretty good starting guess for the parameters (call
      this \(\u{\theta}^*\)). That's one of the big differences
      between linear and non-linear regression: you frequently need a good
      starting guess. Of course the closer you are to
      a "solution" the better (and we can't assume a unique solution in the
      non-linear case, like we can in the linear case - another important
      difference). 
      
      Our strategy is going to rely on the Taylor series expansion, which you
      should recall from calculus (but may not - 
      time for review in that case!). 
      
      So we have \(n\) equations of the form
      \begin{equation*}
        y_i = f({\u{x}}_i;{\u{\theta}})
      \end{equation*}
      which is generally an over-determined system for the
      parameters \(\u{\theta}\). So now we use the multivariate Taylor
      series, to expand out the right-hand side:
      \begin{equation*}
        y_i = f({\u{x}}_i;{\u{\theta}})
	= f({\u{x}}_i;{\u{\theta}^*})
	+
	    f_{\theta_1}({\u{x}}_i;{\u{\theta}^*})(\theta_1-\theta_1^*)
	+
	    f_{\theta_2}({\u{x}}_i;{\u{\theta}^*})(\theta_2-\theta_2^*)
	+
	\ldots
	+
	    f_{\theta_p}({\u{x}}_i;{\u{\theta}^*})(\theta_p-\theta_p^*)
      \end{equation*}
      or
      \begin{equation*}
        y_i - f({\u{x}}_i;{\u{\theta}})
	=
	    f_{\theta_1}({\u{x}}_i;{\u{\theta}^*})(\theta_1-\theta_1^*)
	+
	    f_{\theta_2}({\u{x}}_i;{\u{\theta}^*})(\theta_2-\theta_2^*)
	+
	\ldots
	+
	    f_{\theta_p}({\u{x}}_i;{\u{\theta}^*})(\theta_p-\theta_p^*)
      \end{equation*}
      where the \(f_{\theta_i}\) denote partial derivatives with respect
      to the \(i^{th}\) parameter \(\theta_i\). 
      
      Now we can write this in vector form as
      \begin{equation*}
        y_i - f({\u{x}}_i;{\u{\theta}})
	=
	    [f_{\theta_1} f_{\theta_2} \ldots f_{\theta_p}]\bigg\rvert_{{\u{x}}_i;{\u{\theta}^*}}(\u{\theta}-\u{\theta}^*)
      \end{equation*}
      where the vertical bar means to evaluate that vector of partials at the
      known vector value \({\u{x}}_i\) and parameter
      estimate \({\u{\theta}^*}\). 
      
      The only thing we don't know in this equation is \(\u{\theta}\), and
      that's what we want to know!
      
      Define now for each data value the difference between the data and the best
      model so far,
      \begin{equation*}
        w_i \equiv y_i - f({\u{x}}_i;{\u{\theta}^*})
      \end{equation*}
      Then we can write this as a big matrix equation of the form
      \begin{equation*}
	\u{w}=X(\u{\theta}-\u{\theta}^*)
      \end{equation*}
      where the rows of \(X\) are the partial derivatives evaluated at the
      appropriate \({\u{x}}_i\) and \({\u{\theta}^*}\).
      
      Now we do linear regression to find \(\u{\theta}-\u{\theta}^*\) in the
      usual way:
      \begin{equation*}
	\u{\theta}-\u{\theta}^*  = (X^TX)^{-1}X^T{\u{w}}
      \end{equation*}
      from which we obtain our next estimate for \({\u{\theta}}\) as
      \begin{equation*}
      {\u{\theta}}=\u{\theta}^* + (X^TX)^{-1}X^T{\u{w}}
      \end{equation*}
      Let's take a moment to say "Hallelujah and Amen!" We've solved an
      associated linear problem, and hope that the new estimate
      for \(\u{\theta}\) is closer to the true solution.
      
      Now we reset our guess \(\u{\theta}^*\) to this new value,
      \begin{equation*}
      \u{\theta}^* = {\u{\theta}}
      \end{equation*}
      and iterate -- that is, do it again -- computing the new, better
      estimate \({\u{\theta}}\) for as long as the system is converging.
      
      By "converging" I mean that the difference \(\u{\theta}-\u{\theta}^*\)
      goes to zero as you iterate. Then you just cross your fingers and hope
      that you've found a minimum!
      
      (Don't you hate it when it comes down to crossing your fingers and
      hoping? Well heh, at least it's a start!:)
      
      The cool thing that you'll notice in the linear regression modeling is
      that the parameter estimates are going to go to zero, but their standard
      errors won't. So the p-values are going to go to 1, which is really funny
      for linear regression, and the standard errors will converge to fixed
      values -- which will be the standard errors of the non-linear parameters!
      
      From those we can compute our parameter confidence intervals in the usual
      way. That's really sensational. I hope that you're as excited as I am
      about now (but somehow I'm imagining that you're thinking, um, this guy
      is off his rocker...:).
      
Here's
      a Mathematica file that illustrates the process in action for a
      really interesting data set, involving the on-set of rigor mortis
      in human bodies. Kind of a macabre problem, but we deal with all kinds of
      modeling problems in this class!