Relations Between Variables
Scientists are forever trying to find relations between quantities:
In each case above, the scientists run experiments and collect pairs of numbers for the quantities that they are trying to relate to each other:
· (weekly exercise minutes, systolic blood pressure) , e.g. (35, 136), (0, 155), (200, 121), …
· (angle of ramp (deg.), time of ball (sec.)), e.g. (5, 17), (20, 6), (45, 3), …
· (cc fertilizer/sq. meter garden, plant height m.), e.g. (.5, .33), (.77, .02), (.01,.54), …
They might plot these number pairs on a graph and examine
the graph for a trend. For example, in
repeating the ball-rolling experiment pictured below the student records 
the pairs of numbers representing the angle of the ramp and
the corresponding “rolling times. She
then plots these numbers on a graph as follows:

What can you conclude about a relationship between the angle of the ramp and the time the ball rolls? Can you explain this?
Another student does a very careful experiment to relate the growth of plants with the amount of fertilizer used and comes up with the following graph:

Is there a relationship between plant height and the amount of fertilizer used? Can you explain this?
These examples of seeking a relationship between variables can be quantified by using methods of statistical analysis called Correlation and Regression.
Correlation analysis seeks to identify (by a single
number) the degree to which there is a (linear) relation between the numbers in
sets of data pairs. The correlation
coefficient of a set of data pairs
with x- and y-means
and
respectively is

You don’t need to worry about computing this number; it’s easy to use a computer to calculate it. The interpretation of this number is more important – it is somewhere between –1 and 1. The closer r is to 1, the more positively correlated are the sets of numbers in the sense that an increase in x corresponds to a proportional increase in y; similarly with decreases in x corresponding to proportional decreases in y. On the other hand, if r is close to –1, then increases in x correspond to decreases in y and decreases in x correspond to increases in y, so we say that x and y are negatively correlated. Finally, if r is close to zero, there is little if any relationship between the variables – we say they are uncorrelated.
Consider the earlier graphs from the “ball-rolling” and “fertilizer” experiments:

As the height of players increases, does the weight generally
i. -.75 ii. .03 iii. .73 iv. .99

As the length of the lobster increases, does the number of eggs produced generally
i. -.89 ii. -.13 iii. .25 iv. .91
Regression analysis is used to determine if a
relationship exists between two variables. To do this a line is created that
best fits a set of data pairs. We will
use linear regression which seeks a line with equation
that “best fits” the
data. The term “Best fits” has a precise
mathematical meaning that we can think of as “minimizing the distances to the
line for each data point”. In addition
to an equation for the line, the regression analysis calculates a p-value and
an R2 value (see below for an explanation of each).
1)Generation of the
regression line and equation for the line:
For example, if a computer program for doing regression is applied to the data from the “Ball rolling experiment” the best fitting line is shown on the graph below:
.
It will turn out that any other line will give a larger overall distance to the points than this line does.
You can frequently estimate the equation of the regression
line (y= mx + b) by estimating its slope (m)
(i.e.
) and its y-intercept (b) (i.e. the value of the value
of y where the line crosses the y-axis when x = 0). In a
regression graph the x-axis is the independent variable and y-axis is the
dependent variable.
From the graph above, we could estimate that the line has y-intercept close to 6 because if you continue to draw the line out, it crosses the y-axis near 6.
To determine the slope you must first choose two points on the line—these are not existing data points, but points of your choice. The easiest and usually more accurate method is to use the grid lines as your guide in choosing your values on the x-axis and then estimate your y-values. So, for the above graph choose these two sets of points (20,4.5) and (60, 1.25). It is best to spread out the points you choose, one from either end of your line. It is especially important to remember to choose two points ON THE LINE because you are trying to estimate an equation for the line itself, NOT your data points. Using the points we chose, plug the numbers into the equation for the slope.
. Recognize that this
is just a guess based on “eyeballing the graph”. Now plug your values for slope
and the y-intercept in the equation y=mx+b and you will get:
y=-0.081 x+ 6.
2) Generation of R2 value
When you do
regression analysis using a computer program, you’ll sometime see some
indication of the coefficient of determination or “goodness of fit”,
, where
is the measured value
and where
is the value of the regression line evaluated at
and where n is
the sample size. R2 is simply
an indication of how well your data points fit the regression line. It is used
to determine if you can use your equation of the line to make any further
predictions about the relationship between your variables. R2 values
fall between 0 and 1. If the R2 value is closer to 1, it means more
of your data points fall on or very near the regression line.
3) Generation of a p-value
Using the
computer will allow you to calculate a p-value for your relationship. The
p-value allows you to decide whether to accept or reject your null hypothesis.
If your p-value is greater than 0.05 there is NO significant relationship and
you would accept your null hypothesis. If your p-value is less than 0.05 there
IS a significant relationship and you would reject your null hypothesis.
Which of the other examples displayed this causal
relationship?
y=3758.525 x + -106704
and the graph (with regression line) is drawn below:

Homework on Correlation and Regression:
Turn in answers to problems 1—4 from earlier and also for the following:
