Section11Non-Linear Regression¶ permalink

The downside of linear regression is that it only works when the parameters appear linearly in the model, and many models don't meet that standard. What if we have a model like \begin{equation*} y(x)=ae^{bx} \end{equation*} We can make a transformation of the data, and so seek a model of the form \begin{equation*} \ln(y(x))=\ln a + bx \end{equation*} but this has an impact when it comes to the error structure. The linear regression model assumes independent identically distributed (iid) errors, so \begin{equation*} \ln(y_i)=\ln a + bx_i + \epsilon_i \end{equation*} When we transform back, to talk about \(y\), or rather \(y(x)\), we have data with an error structure of the form \begin{equation*} y_i=ae^{bx_i + \epsilon_i}=ae^{bx_i}e^{\epsilon_i} \end{equation*} So the errors become multiplicative. Now that might actually be appropriate: it depends on exactly what the nature of those errors is.

If the error is all due to reading a thermometer with an error of .05 degree (i.e. it only reads to the nearest tenth), then the error is going to be additive, rather than multiplicative - it's about .05 degree across the spectrum. If the error is in counting things, and you tend to miss about 10 percent no matter the size of the crowd, then it's multiplicative. So you would like to have some understanding of the type of error structure you're expecting, and capture it properly.

If we truly want to work with an error structure of \begin{equation*} y_i=ae^{bx_i} + \epsilon_i \end{equation*} then non-linear regression permits us an approach. A downside of non-linear regression is that the diagnostics are not quite as good as they are for linear regression. But we can get the essentials (especially the residuals, the \(R^2\), and the standard errors). So here we go.

Subsection11.1The Motivation: Newton's Method¶ permalink

I want to start by talking about Newton's method. You might have seen this in calculus class, but perhaps not. It's one of those topics which seem to get passed over, sadly, because we're in such a damned hurry to race through the standard calculus curriculum. It's one of the coolest things, is tremendously valuable, and has Newton's name on it - so it's a bit of a surprise that it gets the shaft, but so it goes.

At any rate, I'm not bitter....

So let's take a look at that. It starts with a non-linear function for which we want a root:

Now let's suppose that we make a guess, and it's not too bad: we think maybe \(x=.5\) (call this starting guess \(x_0\)). We draw in the tangent line to the function at the point \((x_0,f(x_0))\), and construct the tangent line function:

Note: the tangent line function is actually the first-degree Taylor series polynomial approximation to \(f\) at \((x_0,f(x_0))\). This is important, for the generally extension to non-linear regression.

To solve for the root of the original function is a non-linear problem, and "hard". We can solve for the root of the tangent line function (the linear function) very easily, however: so we trade our hard non-linear problem for a problem that we know how to do. Are we clever, or lazy?

Interesting mathematical strategy: if you don't like the problem you are given, do an easier one!:) We cast the problem into the linear world, solve, it and bring back the solution, and see how we're doing.

Well, in this case, we're closer. By the way, the root we're after can be approximated using Sage, as

Notice that this is just \(\frac{\pi}{3}\)! So you could have solved it yourself, just by thinking about it: it's a special triangle. ( Did you think about it?)

At any rate, our method is to use two points and the known slope to find the next approximation, which comes from \begin{equation*} m=\frac{0-f(x_0)}{x_1-x_0}=f'(x_0) \end{equation*} or \begin{equation*} x_1=x_0-\frac{f(x_0)}{f'(x_0)} \end{equation*} In this case, that gives

So by solving the linear problem, we're closer to the solution of the non-linear problem! So just do it again, do it again, do it again... until satisfied, whatever that means. Here's the next picture:

You can see that we've gotten a lot closer now, and that, if we just do it about 10 times, we're going to be golden:

Now isn't that cool? How much cooler than that can you get? We do a whole bunch of linear problems, and suddenly we have solved the non-linear problem.

And that's the big idea behind non-linear regression, too.

Subsection11.2The Setup¶ permalink

Let's assume that the scalar value \(y\) is some non-linear function of several variables, which we will denote by the (column) vector \(\u{x}\). (Remember, we are the people of the columns, not the people of the rows.)

So, for example, we might have test score \(y\) as a function of several variables, making a vector of predictors

\begin{equation*} \u{x}=\colvec{ACT \\ GPA \\ Age \\ Rank} \end{equation*}

and those variables might be associated with several parameters, which we might denote as \begin{equation*} \u{\theta}=\colvec{a \\ b \\ c \\ d \\ e} \end{equation*} and the non-linear model could be

\begin{equation*} y = f(\u{x};\u{\theta}) = a ACT^b GPA^c Age^d Rank^e \end{equation*}

Subsection11.3The Play¶ permalink

Now let's suppose that we have \(n\) data values \(y_i\), and their associated predictors \(\u{x}_i\). Let's denote the unknown parameters by the vector \(\u\theta\). We're going to assume that \begin{equation*} y_i = f({\u{x}}_i;{\u{\theta}}) + \epsilon_i, \end{equation*} where the residuals \(\epsilon_i\) are iid (independent and identically distributed).

We're going to need a pretty good starting guess for the parameters (call this \(\u{\theta}^*\)). That's one of the big differences between linear and non-linear regression: you frequently need a good starting guess. Of course the closer you are to a "solution" the better (and we can't assume a unique solution in the non-linear case, like we can in the linear case - another important difference).

Our strategy is going to rely on the Taylor series expansion, which you should recall from calculus (but may not - time for review in that case!).

So we have \(n\) equations of the form

\begin{equation*} y_i = f({\u{x}}_i;{\u{\theta}}) \end{equation*}

which is generally an over-determined system for the parameters \(\u{\theta}\). So now we use the multivariate Taylor series, to expand out the right-hand side:

\begin{equation*} y_i = f({\u{x}}_i;{\u{\theta}}) = f({\u{x}}_i;{\u{\theta}^*}) + f_{\theta_1}({\u{x}}_i;{\u{\theta}^*})(\theta_1-\theta_1^*) + f_{\theta_2}({\u{x}}_i;{\u{\theta}^*})(\theta_2-\theta_2^*) + \ldots + f_{\theta_p}({\u{x}}_i;{\u{\theta}^*})(\theta_p-\theta_p^*) \end{equation*}

or \begin{equation*} y_i - f({\u{x}}_i;{\u{\theta}}) = f_{\theta_1}({\u{x}}_i;{\u{\theta}^*})(\theta_1-\theta_1^*) + f_{\theta_2}({\u{x}}_i;{\u{\theta}^*})(\theta_2-\theta_2^*) + \ldots + f_{\theta_p}({\u{x}}_i;{\u{\theta}^*})(\theta_p-\theta_p^*) \end{equation*}

where the \(f_{\theta_i}\) denote partial derivatives with respect to the \(i^{th}\) parameter \(\theta_i\).

Now we can write this in vector form as

\begin{equation*} y_i - f({\u{x}}_i;{\u{\theta}}) = [f_{\theta_1} f_{\theta_2} \ldots f_{\theta_p}]\bigg\rvert_{{\u{x}}_i;{\u{\theta}^*}}(\u{\theta}-\u{\theta}^*) \end{equation*}

where the vertical bar means to evaluate that vector of partials at the known vector value \({\u{x}}_i\) and parameter estimate \({\u{\theta}^*}\).

The only thing we don't know in this equation is \(\u{\theta}\), and that's what we want to know!

Define now for each data value the difference between the data and the best model so far, \begin{equation*} w_i \equiv y_i - f({\u{x}}_i;{\u{\theta}^*}) \end{equation*}

Then we can write this as a big matrix equation of the form \begin{equation*} \u{w}=X(\u{\theta}-\u{\theta}^*) \end{equation*}

where the rows of \(X\) are the partial derivatives evaluated at the appropriate \({\u{x}}_i\) and \({\u{\theta}^*}\).

Now we do linear regression to find \(\u{\theta}-\u{\theta}^*\) in the usual way: \begin{equation*} \u{\theta}-\u{\theta}^* = (X^TX)^{-1}X^T{\u{w}} \end{equation*} from which we obtain our next estimate for \({\u{\theta}}\) as \begin{equation*} {\u{\theta}}=\u{\theta}^* + (X^TX)^{-1}X^T{\u{w}} \end{equation*}

Let's take a moment to say "Hallelujah and Amen!" We've solved an associated linear problem, and hope that the new estimate for \(\u{\theta}\) is closer to the true solution.

Now we reset our guess \(\u{\theta}^*\) to this new value, \begin{equation*} \u{\theta}^* = {\u{\theta}} \end{equation*}

and iterate -- that is, do it again -- computing the new, better estimate \({\u{\theta}}\) for as long as the system is converging.

By "converging" I mean that the difference \(\u{\theta}-\u{\theta}^*\) goes to zero as you iterate. Then you just cross your fingers and hope that you've found a minimum!

(Don't you hate it when it comes down to crossing your fingers and hoping? Well heh, at least it's a start!:)

The cool thing that you'll notice in the linear regression modeling is that the parameter estimates are going to go to zero, but their standard errors won't. So the p-values are going to go to 1, which is really funny for linear regression, and the standard errors will converge to fixed values -- which will be the standard errors of the non-linear parameters!

From those we can compute our parameter confidence intervals in the usual way. That's really sensational. I hope that you're as excited as I am about now (but somehow I'm imagining that you're thinking, um, this guy is off his rocker...:).

Here's a Mathematica file that illustrates the process in action for a really interesting data set, involving the on-set of rigor mortis in human bodies. Kind of a macabre problem, but we deal with all kinds of modeling problems in this class!