Supporting Materials, Geostatistics Module

Supporting Materials, Geostatistics Module

Lab: Variograms and Kriging

This lab is an introduction to the geostatistical way of mapping. There are two versions of this lab: the Geo-EAS version (this one), and a GS+ version. The first is unix software, while the second is Windows software.
There are two separate, and very important topics of this lab, reflecting the discussion of the geostatistics module. The first is variography, the study of spatial autocorrelation in a data set using the variogram; the second is kriging interpolation, or the creation of grids (maps) from scattered data.
While many software packages such as ArcView offer you the possibility of interpolation, we must be wary of it for several reasons:
1. interpolation of the data may be nonsensical, and
2. the method of interpolation may be entirely inappropriate for the spatial autocorrelation implied by the data.
Lab Objectives:
In this lab, you will
1. use ArcView and its Spatial Analyst Extension to
  1. load in the Illinois Cancer Data,
  2. interpolate the data, creating contour maps of the results.
2. use geoeas to
  1. model the spatial autocorrelation in the Illinois Cancer data set using variogram analysis, and
  2. create kriged maps of a variable using the results of your spatial autocorrelation analysis.
3. use ArcView to
  1. load the gridded data created by geoeas,
  2. create a contour map of the geoeas data to compare with the map created by ArcView's Spatial Analyst.
Lab

ArcView your data:
- Data:
  - If you want to use your own data...
  - If you don't have data of your own, you'll need to download ill.csv, a new UTM (Universal Transverse Mercator coordinate) version of the Illinois disease data, and the utm shape file for the Illinois county maps. Use WinZip to extract the zipped files to your machine or personal space (make a note of the directory where you put them).
- Fire up ArcView, and add the utm theme to your view.
- Now you'll want to add a table - the ill.csv file - which you do from the Project window (or table window). Ill.csv is a Comma Separated Value file, and hence text.
  Once you've added the table, you can "Add Event Theme" to the view, which you do from the View menu of the View window. Make sure that you use the X and Y variables for geographic coordinates, rather than lat and long (which will be ArcView's default choice). The X and Y in the file are in UTM coordinates.
- Have a look at the map and the data, to make sure that all is well.
- From the Project window, we want to open the File menu and add an ArcView extension. Select Extensions, then check the checkbox for Spatial Analyst. Doing so will add a number of menus to the View Window menubar.
- Making Contours: before you do anything else, save your project. The next step has been known to cause ArcView to bomb out.
- Back to your view, select the ill.csv theme. Now select "Create contours" from the Surface Menu. Take all the defaults: the only place where you will have to change a selection is in the "Interpolate Surface" Z Value Field: by default it will be on "Lat", whereas you want to choose scv1 (scv1 was one of the synthetic cancer variables created by Jacquez, et al. via Principal Component Analysis of the cancer data set).
  Other than that, just hit "Ok" buttons! ArcView will process your request, interpolate the data, and add the contour theme to your view.
- If ArcView bombs out on you, your contour may have been created anyway, in C:\TEMP. Have a look to see if a contour shape file has been created there. Hopefully it has, in which case you can simply start ArcView again, open your project, and add in this shapefile. It's the painful way....
- As usual, you can double click on the contour legend to change from Single Symbol to "Graduated Color", choosing contour value as the variable. Do the same for ill.csv, so that they use the same scheme (you can edit the categories of ill.csv to match those of contours of ill.csv).
Now on to Geoeas, and the geostatistical approach to the same problem. The big picture here is that, rather than making ad hoc and generally inappropriate assumptions about the spatial autocorrelation evidenced in the data, we are incorporating it into the interpolation process via the variogram. More information should result in better maps.
Interpolation like ArcView's IDW (Inverse Distance Weighting) actually makes an assumption about the spatial autocorrelation - it's just based on a whim. It does not reflect (except by chance) the realities of your data, and the process you are studying.
Variogram modelling is the crucial step in which we analyze and model the spatial autocorrelation structure. If your project data lend themselves to this kind of analysis, then by all means take some time before you're done with your projects to check out the spatial autocorrelation via its variogram structure. In the meantime, we'll practice on the illinois data.
- Geoeas is UNIX software. For that reason, you must first fire up eXceed
- telnet to your sph account, and, in particular, to one of the following machines: csrvr5.sph, csrvr3.sph, or csrvr11.sph (choose one randomly, so that we don't overtax any one machine).
- Type the following to create a working directory, and copy the data we'll use there (the easiest way is to cut and paste from your web browser):
```
mkdir geoeas
cd geoeas
cp ~aelon/Public/Html/etc/data/illinois/lab/ill.dat .
```
  You now have the data you'll need for this part of the exercise.
- Use who to find the name of your machine, and then set your display.
- Now you need to set two additional variables. Type the following to your telnet terminal:
```
setenv GEO_EAS /group/dengue/unix_version
setenv PATH ~aelon/bin:$PATH
```
- We're ready to go! We'll be using several different geoeas programs: prevar (which computes the distances between pairs), vario (which computes the sample variograms and then helps us model them, and krig (with which we do our kriging, oddly enough!). We begin with prevar: type
  geoeas prevar
  A screen should pop up with a little info about prevar. For the most part, just follow directions.
  When you "press any button" you should get a screen with a menu at the bottom. Your arrow keys work to move between menu items. Alternatively you can select a menu item by typing the first letter.
  Type an "f", and the file menu will be selected. Type in the data file name: ill.dat. Follow the directions to create a file ill.pcf.
- Once you're back to the menu bar, Type a "v", for the variable menu. The spatial coordinates in this file are actually named x and y, rather than lat and long, so use the space bar to sift through the variable names until you get to those variables to use for the x and y variables. Hit return when you've got them to select them.
- Now select the Execute menu, by typing an "e". The file ill.pcf will be created, and then you may quit. Not very exciting, is it!? It's just to prepare the pair comparison file (.pcf) for the other programs. Each pair of data points is represented in this file.
- Time now for vario. Wake up! I have to tell you about one weird thing about vario, and other geoeas programs (like krig) which use graphical output: they have two screens, but all input from you goes through the graphics window. Thus it must be the active window at all times. You will be tempted (and no doubt succumb) to make the prevar-like screen active. Forget it, because it doesn't understand input: it's deaf.
  Fire up vario:
  geoeas vario
- Resize/move the two windows so that you can see them both at once. Again, make the graphics window active.
- The menus work in the same way as they did for prevar, so I'm not going to be quite so explicit from here on out. Select data, and it should fill in the name ill.pcf by default. It automatically moves to the variable menu, as you must choose a variable. Space through until you get to scv1 (we'll replicate the analysis of the the geostatistics module, to some extent).
- Select Options/Execute
- Select Postplot: it will show you the data pattern. A screen will appear offering you the opportunity to make postscript plots of each screen. You should respectfully decline by typing a "q".
- Select Execute, then Plot. You're looking at the empirical variogram for the scv1 variable. Doesn't look too good at this scale, does it?
- If you look back at the Results screen, you'll see the lag classes described (especially the number of pairs in a class). Notice that there were only 32 pairs of points in that first class, while there were over 200 in all other classes.
- Now let's model the variogram (select the Model menu). We'll use the automated modelling, as it's easy. Select Auto: would you have come up with that one? No Way!
  Go back to the Model screen, to see the model that the automated procedure came up with (bottom right of your screen). You'll note that the model is nested: that is, it is a weighted (positive) linear combination of several models, each with a sill and a range. There is also a substantial nugget (the nugget is the "y-intercept" of your model). (Do you remember the practical implication of having a large nugget in a variogram model?)
- Let's assume that you are happy with the model. In that case, you'd want to save the results (use the Save menu), so that you can use that model in the kriging process. Use the default filename for the parameters (ill.kpf).
- Now you can fiddle with the parameters of the models if you like, by selecting the Model menu item from the Model screen. Change the nugget, and then plot the result. Change the model type. Play around! See if you can do better than the automated program. Remove a model and hit the "Refine" button: vario will try to come up with a good fit using only the models which remain.
  When you're done goofing off, quit vario by hitting the "q" key several times, until you're excused.
  There are lots of things we haven't done (and should have): we didn't adjust lag sizes; we didn't check for anisotropy (by looking at different directions); we didn't look at other variables. So little time! These are really important steps, and I hope that you will someday have an opportunity to try them out.
  You realize, I hope, that this is the heart and soul of the difference between doing what ArcView did to create its surface/contours: we have actually modelled the spatial autocorrelation structure of the data, and so can hope that the characterization of the spatial autocorrelation will result in an improved map. While flipping through the defaults with ArcView we made no such attempt.
- On to the kriging aspect of things. Type
  geoeas krig
  We'll Load parameters (the ill.kpf file we just created). The Krige Options screen appears, and if you're happy with what you see - and why on earth wouldn't you be, for pity sake! - we're ready to execute. Notice that according to this screen, we're doing Ordinary, point kriging by default, like they did in the papers we read. You did read the papers, didn't you?!
- Go with the Execute menu. Don't worry about the debug state - that's a little beyond us right now - so just hit return. As the program kriges the data, you'll see a graphical representation of the size of values. Remember that each one of those blocks plotted represents the solution of a linear system of equations for the calculations of weights.
  This is a cheap sort of contour map. Okay: "q" your way out of krig.
We're ready to bring our results into ArcView.
- Before we can load the results of the kriging program to ArcView, we have to reformat it a bit to make it loadable by ArcView. To make this step easy for you, I just wrote a little code which does this. From your telnet session type
```
/group/dengue/bin/geoeas2arcview ill.grd
```
  which should produce a file called grid.txt
- ftp your file (grid.txt) to your itd account, from your telnet session:
```
ftp login.itd
login with username and password
ascii
put grid.txt
```
  The file can now be found on your IFS home directory. Alternatively, you can use a Windows ftp program to ftp to your sph account and bring the file back to your local machine.
- Once again you must Add a table (grid.txt), and then an event theme. You should now plot your grid (change to "Graduated Color", using *scv1) against the data.
- Now follow the same directions you did at the beginning to create a contour map of the kriged data. Again, bombing out is a possibility, so save your project first....
- As usual, you can double click on the legends of the two contour maps to change from Single Symbol or Color to "Graduated Color". Choosing the same color scheme for both the ArcView contour file and the krigged data (you can edit the categories of one to match the other). This makes it easier to compare the two maps.
  Include the ill.csv theme and the two contour plots in your view. Flip one contour off and the other on, to see how differently they've mapped the data. As you see here, and as you saw in the geostatistics module, the results can be wildly different!
  Why would you choose between one map or the other? If one seems to "trust the data" less, why is that?
- Please evaluate this lab. We appreciate your comments!
Optional exercises (for the courageous):
- If you want to see an amusing example of spatial autocorrelation in the data, invoke vario again and have a look at the variogram of a variable like lat or long. You'll notice a very striking variogram - can you explain it?
- One other thing we haven't done, which we should do, is have a look at the cross-validation statistics for the data used to generate the map. For that we would use the program xvalid, which you invoke using
```
geoeas xvalid
```
Software: GS+, Geo-EAS, ArcView
Featured techniques: variogram analysis, kriging.
Readings:
Principal Readings:
1. Carrat, F. and A. Valleron. Epidemiologic Mapping using the 'Kriging' Method: Application to an Influenza-like illness Epidemic in France. AJE, 135:11, pp. 1293-1300.
Reference Readings:
1. Webster R., et al. 1994. Kriging the Local Risk of a Rare Disease from a Register of Diagnoses. Geographical Analysis, 26:2, pp. 168-185.
1. The following books are available from the University Science Library:
  - Basic linear geostatistics / Margaret Armstrong / TN 272.7 .A761 1998
  - Geostatistics for natural resources evaluation / Pierre Goovaerts. / GE 33.2 .M3 G661 1997 (Note: Dr. Goovaerts is a faculty member of the Department of Civil & Environmental Engineering, U of M.)
  - Geostatistical glossary and multilingual dictionary / members of the 1984-1989 IAMG Committee on Geostatistics ; Ricardo A. Olea, editor. / QE 33.2 .M3 G461 1991
  - Fundamentals of geostatistics in five lessons / Andre G. Journel. / QE 33.2 .S82 J681 1989
  - Geostatistical case studies / edited by G. Matheron and M. Armstrong. / TN153 .G461 1987
References:
1. About cross-validation:
  - Myers, Donald E. On Variogram Estimation. The Frontiers of Statistical Scientific Theory and Industrial Applications, American Sciences Press, Inc. 1991.
  - Owosina, A., U. Lall, T. Sangoyomi, and K. Bosworth, Methods for Assessing the Space and Time Variability of Groundwater Data, Utah Water Res. Lab., Utah State Univ., 1992.
    Owosina et al. [1992] compare two multivariate kernel regression estimators, MARS, LOESS, TPSS and Kriging for reconstructing spatial surfaces from a variety of irregularly sampled synthetic (with varying signal to noise ratios) and ground water data sets. Model parameters were chosen automatically using cross validatory measures in all cases. In terms of RMSE and Mean Absolute Deviation, overall algorithm ordering (best to worst) across the data sets was TPSS, LOESS, KERNEL, MARS, KRIGING. The differences between the best and worst algorithm were dramatic in some cases. Methods for interpolating ground water data irregularly sampled in space and time were also illustrated.
    Andy's note: I haven't read this one, but am anxious to see why kriging did so poorly!
  - Davis, B.M., 1987, "Uses and abuses of cross-validation in geostatistics," Mathematical Geology, v.19, n.3, p. 241-248.
    It discusses some common misconceptions concerning cross-validation. For example, use of statistical criteria supposedly yields an optimal semivariogram from among competing models. But Davis states that the semivariogram is only "best" with respect to "choice of discrepancy measure, partition set size, predictive function, and number of models to be evaluated."
  - Isaaks & Srivastava. (1989. Applied Geostatistics) discuss the use of Cross-validations pages 533 & 534. A more applied discussion on CV can be found in the pages 352-368.
Data:
- Illinois data, based on Geoff's work
Links: Software
- GS+ Home Page. There's a copy of their demo version on the BioMedware server.
- Geoeas links:
  - Geo-EAS Home Page
  - binary DOS version which is fairly friendly, and free of course
  - Linking GeoEAS output to shapefiles for ArcView displays, from this case study in plant pathology.
  - DOS sources (plus binaries)
  - ESRI announces its new ArcInfo Geostatistical Analyst
  - manual for the DOS version (works pretty well for the unix version as well).
  - On-line html documentation - this is probably one of the oldest html documents around! Still works though....
- Gstat is a free geostatistical package with unix and Windows versions. Not too hard to install and use, and contains a number of on-line resources, including examples. I have some compilation notes, and an additional example which shows how to output a (gif) file.
- GSLIB is a collection of DOS or unix routines for performing geostatistical analyses. It is command line based, not easy to install, and not well supported (although there is a text book: GSLIB: Geostatistical Software Library and User's Guide by Clayton Deutsch and Andre Journel, 1992, 340 pp, and an on-line manual for what it's worth. The unix code is maintained at a different site, and includes other resources useful with GSLIB. If you want to install the unix code, then use my Makefile from the top directory to compile it (there isn't one provided!).
- the commercial/mining end of geostatistics: Geovariances.
People/Organizations
- Lance Waller on the problems of teaching geostatistical methods
- Ai-Geostats home page
- Donald Myers, one of the Geostatistics Gurus.
- Edward H. Isaaks runs a consulting firm, and sells ``SAGE2001 --The Most Powerful Windows Variography Software on the Planet''. That's pretty audacious, given that it's only been out since May, '99, but hey - be bold! A couple pages from his home page are especially interesting:
Geostatistics
- Intro to Geostatistics. He also has a nice review of more general statistics.
- Geostatistics FAQ
- Here's an excellent one: On-line lectures in Population Ecology, especially lectures on
  - Geostatistical Analysis of Population Distribution and
  - Elements of geostatistics.
- A collection of geostatistics-related websites.
- A nice kriging example, comparing kriging to other methods.
- An application of kriging with some theoretical background.

Page by Andy Long. Comments appreciated.

aelon@sph.umich.edu

Lab