Case Studies

West Nile Virus Analysis

USGS Evaluates ~8.6 Billion Linear Combinations in Less Than 24 Hours of Computing Time

    Goal

    Statistically analyze possible correlations of key environmental factors and the occurrence of West Nile Virus in dead birds.

    Business

    U.S. Geological Survey

    Industry

    U.S. Government, Department of the Interior

    Challenge

    The United States Geological Survey (USGS), part of the United States Department of the Interior, contracted Parabon to do an extensive and complicated multi-factor statistical investigation into causal factors of the West Nile Virus. Because high-performance computing capacity was necessary for such a study, Parabon's computational grid and CrushTM data mining application were used.

    In order to estimate where future outbreaks of West Nile Virus might occur, USGS scientists contracted an exhaustive analysis of the possible correlations between 34 different environmental factors and the occurrence of West Nile Virus in dead birds. Parabon evaluated every combination of 34 variables across 811 observations for data from the state of New Jersey in 2002. The scientists needed to quickly develop explanatory models based on the results of this statistical analysis so they could move on to the next step and investigate the various models further. Speed of execution was paramount.

    Solution

    High-performance computing capacity was necessary to analyze the 8.6 billion possible models. Parabon's Crush, a large-scale data mining application for deep statistical modeling and knowledge discovery, was used to estimate these models. Crush is specifically designed to leverage the massive computational resources available on the Frontier computation platform. Using this high-performance computing capacity, the scientists

    "...were able to look at 34 different independent variables... in less than 24 hours of computing time. This same analysis would have taken an estimated 5 months of processing time on a single processor."1

    It was critical to apply this complex scientific modeling application during the early stages of problem analysis and model formation so that explanatory models could be investigated as soon as possible. Initially, 32 explanatory factors were analyzed, which resulted in the analysis of 4.3 billion linear models. Two of these factors failed to appear as statistically significant, so they were replaced with 2 other factors. This two-phase analysis of the resultant 34 environmental factors led to 8.6 billion possible linear combinations of factors, or models. Of these, only 1,348 models had parameters that were all statistically significant at the 0.10 level.

    Initially, a heuristic (H) factor evaluation was undergone. The heuristic approach is superior to the frequency-of-factor appearance because it takes goodness of fit into consideration for each model. Spurious factors (factors that, by random chance, exhibit statistical significance yet are not structurally related to the outcome variable) were reduced through analyzing models that were weighted in favor of regressors that consistently show up in well-fitted models. To further eliminate spurious factors, Parabon looked at the stability of factor coefficients across the various models. Non-spurious factors should impact the outcome in a statistically similar manner across all models. The scientists measured, for each factor, the mean coefficient estimate and the standard deviation of the coefficient across all significant models.

    Results

    From a pool of 8.6 billion models examined, 1,348 models exhibited statistical significance in all parameters. Of these, the top five models, according to goodness-of-fit, were selected. Only eight factors showed up as statistically significant in at least one of the top five models.

    One model stood out statistically above all others, being the only one that contained no unstable factors. In this model, eleven factors were found associated with increased levels of dead birds. Among the factors included in this model were: global land cover of grassland, savannah or shrubland versus urban and built-up land. Land cover of various species of trees was associated with increased incidence of dead birds than was the baseline land cover.2

    1. Letter from Anne Frondorf, Ph.D., Geographic Information Office, U.S. Department of the Interior, U.S. Geological Survey, dated 3/27/2002.

    2. Davies, Antony, PhD., "An Analysis of West Nile Virus Data Using Exhaustive Regression," Parabon Computation, Inc., 2001.