Gerard Rushton's Glossary


Address matching with TIGER

The process of searching for matches between a street address in a file of health records and a digital street file, such as the U.S. Census database of streets, called TIGER. Different GIS have different address-matching algorithms and different procedures for dealing with situations where exact matches between the two databases cannot be made. Most algorithms use spatial interpolation techniques to find the expected location of an address along the street face for which the TIGER database has only the starting and ending address of a street segment. The result of an address- match is a new health records file in which two locational coordinates are attached to each record. Normally, these are latitude and longitude coordinates in decimal degrees.

Aggregation effects

The sensitivity of the results of any spatial analysis to the sizes and shapes of the spatial units examined. In disease mapping, the variability of disease rates typically increases as the population sizes of the spatial units become smaller. The size of the spatial aggregation unit acts as a spatial filter (see below). Correlation coefficients typically become larger as the size of the spatial units become larger. When data is aggregated to political and administrative units whose sizes can be quite variable, these relationships with aggregation may vary within a study region. Aggregation effects are different from the modifiable areal unit problem (MAUP) effects since the latter is concerned with the sensitivity of results to changing shapes (not sizes) of the spatial units used.

Areal interpolation

Methods for determining data values at all points from values at a sample of points. In the software, Spherekit, for example, the available interpolation methods are: Inverse distance weighting; Triangulation; Kriging; Multiquadric; and Thin plate spline. Spherekit was developed at the National Center for Geographic Information and Analysis, Santa Barbara, CA. and is available for download by anonymous ftp from the NCGIA. Check the homepage at http://www.ncgia.ucsb.edu/pubs/spherekit/main.html. The process of transferring attributes from one set of spatial objects to another within a defined portion of geographic space. Methods for transferring data collected originally on one set of areal units (source regions) to a different set of areal units target regions).

Buffering

Defining polygons that have known distance relationships from points, lines or areas. An example of a buffer might be the area that is less than 100 meters of the street centerline of a heavily trafficked road; or it might be all areas that are within a given distance of incinerators.

Confidence intervals for age-adjusted rates

The statistical likelihood that an observed age-adjusted disease rate in a small area is larger than the rate observed for a much larger population.

Confidentiality Issues

The institutional and technical means for ensuring that the identities of individuals cannot be determined from health data released to the public and that persons will never be able to gain access to such individual-level information obtained after an explicit or implied promise of privacy has been given.

Confounding Factors

Any attribute of a population that affects the likelihood that an individual will have a disease.

Cluster

Clusters of disease may occur in space, in time or in both. A cluster is a foci of particularly high incidence. "A bounded group of occurrences related to each other through some social or biological mechanism, or having a common relationship with some other event or circumstance," (Knox, 1988). "An unusual aggregation of health events, real or perceived," Center for Disease Control (CDC). An excess of disease in a definable subpopulation.

Cluster Detection Approaches

The philosophy that describes the possible process that might lead to spatial patterns of a disease that have unexpectedly large numbers of cases in some areas. There are tests for overall clustering, tests for the detection of clusters in specific areas and focused tests--where putative clusters are thought to exist.

Cluster Detection Methods

For each cluster detection method there is a theory, a measurement tool, and a method of inference from which a conclusion about the presence or absence of clustering is reached. Methods can be grouped into distance-based measures and area-based counts, known as quadrat methods. Some methods test for overall clustering, others aim to identify specific clusters, (Kulldorff and Nagarwalla, 1995).

Statistical power

The ability to detect an effect given that it is present. This includes the ability of methods to identify true clusters (true positives), and also the frequency with which they report clusters falsely (false positives). "The classical approach for comparing the sensitivity and specificity (or efficiency) of statistical methods is to run a series of simulations with specified patterns to be detected -- so-called power studies. That is, one uses a cluster generating program (the alternative hypothesis) to construct a series of scenarios and then analyses them using the methods under study. The results are characterized in terms of the ability of the method to detect the pattern or its failure to do so." (Wartenberg and Greenberg, 1993, p. 1765).

Demographics: small area population estimation

Methods for estimating the social and demographic characteristics of small areas that often do not correspond with census defined areas or census times.

Density Estimation

see spatial filters.

Density Equalizing Map Projections

A map 'projection' developed from an algorithm designed to represent areas as proportional in area to some variable such as the population at risk. An attempt is made to preserve local adjacency and, to the extent possible, the shape of entities.

Disease Surveillance

Use of a systematic and objective method to detect, prioritize, and monitor the occurrence of statistically significant clusters of a disease. As Turnbull et al., note (1990, p. S143): "Routine examination of disease occurrence with cluster evaluation permutation procedure would allow state health officials to prioritize case investigations and to respond in a timely and efficient manner to inquiries of reported clusters."

Disease Mapping

Geographical distribution of cases or rates of disease. When areas are placed in categories according to ranges of rates, the map is a choropleth map. When rates are displayed as a continuous distribution, it is an isopleth map. Usually, data for the numerator of a rate are compiled from mortality or morbidity data, and for the denominator are compiled from census statistics. Data are usually aggregated over counties, census tracts block groups, or some other census defined administrative area. As Marshall (1991, p. 430) notes, "the choice of coloring or shading, and the cut-off levels for incidence rates, can dramatically change the visual image." The isopleth map is a contoured map where smoothing has been introduced by interpolating between 'spot heights' at area centroids.

Dispersion models

Mathematical models that predict the movement through geographic space of some element of interest in exposure assessment. Plumes from possible toxic spills, or air pollution under specified conditions of land use and meteorological conditions are examples of elements for which dispersion models have been developed.

Ecological Analysis

The unit of analysis is a group of individuals, often defined geographically, and the relationship between the incidence of disease in spatial units and other covariates is examined.

Environmental Justice

The inequitable impact of environmental hazards on poor and minority communities.

Empirical Bayes

A statistical method useful where disease rate estimates have been made for small areas and for rare diseases. The method distinguishes between the variation expected in the rates through the Poisson process and the variation that is real. In the empirical Bayes method, information on the variation is estimated from the variation in the data itself. Empirical Bayes methods have been developed to deal with the problem that locally defined disease rates are often unstable because they are based on small numbers. The empirical Bayes method pools information across areas to provide a more stable estimator of the rate. For example, a weighted neighborhood average might be used with local variance. Empirical Bayes techniques shift or smooth the values of the risk parameter towards a global mean.

Environmental Risk Assessment

An assessment of the possibility of suffering a harm from a hazard. Dose-response relationships are often used to measure risk.

Exposure Assessment

The determination of the actual degree of contact of a defined population with a putative health hazard. The likelihood of suffering losses from a hazard.

Genetic Activity Profile

The genetic activity profile data base provides a computer- generated graphic representation of genetic bioassay data as a function of dose of the substance tested GIS-H (Geographic Information Science and Health). Systems of analysis that integrate the methods of geographical information science with classical epidemiologic methods.

Geographic Scale Analysis

Analyses of spatial data objects characterized by constant (or near-constant) geographical size. Descriptive terms are "local, state, regional or national" scales. Relationships found to exist at one geographic scale do not necessarily exist at other scales. Geostatistical analysis methods commonly control for geographic scale.

Geographic Analysis Machine

. The method computes disease rates for overlapping circular areas centered either on grid points or on cases and compares the rates with the rates of the null hypothesis found through Monte Carlo simulations. Where the observed rate exceeds a given percentile in the distribution from the null hypothesis, typically, the 99.8th percentile, a circle is drawn. Areas of overlapping circles are indicative of the location and sizes of disease clusters. There is still an interpretive problem of distinguishing 'real' from 'apparent' clusters, which would also appear even if the hypothesis of randomness were true.

Geographic Mask

A method for encoding the geography of a health record that protects the confidentiality of the person while ensuring that valid geographical analyses of the 111 data are possible. The most common method of geographically masking health data is through geographical aggregation (see Turnbull et al., 1990). A second technique (Rushton et al., 1996) is random perturbation, in which each point is displaced by a randomly determined amount, in a randomly determined direction, specific to its original location.

Kriging

A method for estimating the prevalence of a variable of interest at a given place using data from the surrounding region that incorporates the spatial structure of the variable. 'Ordinary kriging' is based on 'the intrinsic hypothesis' which states that the difference in value of variables between two positions depends only on the distance between them. The measure of this distance dependence is the "semivariogram function". A key property of kriging is that the semivariogram function can be used to estimate the value of the process at unrecorded places from the neighboring sampling values. The local variance is known as 'the nugget variance;' 'the sill' represents the degree of spatial autocorrelation; and 'the range of influence' represents the distance over which the autocorrelation is found to extend. Generally, a map is obtained by estimating the value at each node of a regular grid superimposed over the area of interest and then applying a contouring program to draw iso-level curves.

Map Overlay

The superimposition of two or more geographic data layers and the ability to make boolean queries with respect to the attributes of polygons.

Modifiable Areal Unit

The sensitivity of the results of any analysis that involves geographic areas to the size and shape of these areas. The dependence of the results of spatial analysis on the arbitrary spatial basis of the data used.

Monte Carlo Simulation

A commonly used method for significance testing based on the randomization distribution--the reference distribution--of a test statistic.

Point-vs. Area-based Measures

This distinction is about whether the test is for an unexpectedly large number of cases being found in certain areas or for whether cases are in closer proximity than were expected.

Power Simulations

Monte Carlo simulations designed to assess the probabilities of making Type 1 and Type 2 errors given the estimated means and variances of the statistics involved in a test. The level of data aggregation has an effect on the statistical power of focused tests (Waller, 1996, p. 780). As Waller notes (p. 780): "for hot spot clusters, typically statistical power increases as data are aggregated to the level of clustering but power is lost if the data are further aggregated."--hence the scale-flexibility" features of Scan Statistics and the Geographical Analysis Machine (GAM).

Region-building

Algorithms for combining small administrative areas into larger areas to accomplish some purpose, such as reducing the variability in measured disease rates, or 112 enlarging the population base to meet confidentiality requirements for the release of sensitive information.

Small Area Variation

The examination of differences in disease rates in small administrative areas. Used for the purposes of health service planning, disease surveillance and the identification of areas of poor health.

Space-time clustering

Clustering in time and space is a marker for contagion. Tests for space-time clustering measure the proximity of case pairs in space and time (Knox, 1964; Mantel, 1967). The paucity of good data, cross-classified by time and space, has hampered the development of methods in this area.

Standard mortality rate

The ratio of disease count observed in an area to that expected, based on the age and sex structure of the area and the age- and sex-specific death-rates of a standard population.

Spatial data models

Systems for encoding, storing and manipulating spatial data. There are two kinds of spatial data models: field models and object models.

Spatial Filters

The geographical distribution of cases of a disease can be generalized at different spatial scales by computing rates for different sized areas. Different terms have been given to this process. Kernel density estimation (Bithell, 1990); spatial smoothing (Cliff and Haggett, 1986). Unlike ground temperature or rainfall, both of which are continuous in space, the number of new cases of a disease is not measurable at any location, since all of the cases in an area are artificially gathered at one point for statistical and/or administrative purposes. The resulting densities at any point on the surface are usually interpreted to mean the expected rate that would be observed at that point if one were to collect information around the point for a sufficient length of time for the density to be estimated correctly. Such maps have many advantages in comparison with mapping methods that provide an indication of the level of a disease by area. They are not constrained by the borders of geographic units, and sudden transitions between levels of two neighboring areas are avoided. Most commonly, spatial filters are equal in area but in some applications (Turnbull et al., 1990) the filters have equal numbers of people at risk. Spatial filters may have irregular boundaries when the data within them are from aggregated zones whose centroids lie within the filter area.

Significance rate maps

A contoured map showing at any point the proportion of times in a simulation experiment that the rate occording to the null hypothesis was less than the observed disease rate. This rate is often recommended to be used in a relative way to prioritize areas for cluster investigations.

Uncertainty

Error in mapped data about which nothing is known.


Website maintained by Andy Long. Comments appreciated.
longa@nku.edu