National Science Foundation
                 BioQUEST Curriculum Consortium
Engaging People in Cyberinfrastructure

Workshop Announcement
Schedule
Resources
Application
Travel and Housing

 
Projecting Presence of Pecans: Mining Environmental Data to Identify Ecological Limits of Species Distribution
 
 
Authors          Audiences          Overview           Materials          Resources           Future Directions
 

 


Authors


Sarah Prescott
University of New Hampshire


Brian White
University of Massachusetts


Amber Johnson
Truman State University

 
   
 


Possible Audiences:

College biology, ecology, anthropology.  

 
 


Brief Overview:

We used data mining techniques and climate data to predict the presence of the native pecan, Carya illinoensis at locations in the United States and the world.

We obtained data on the geographic distribution of pecans from USGS.

We tried several different algorithms for predicting pecan distribution based on climate data. These different algorithms predicted the pecan distribution with varying accuracy. These results are summarized in the table below in order of increasing accuracy:
Model and Description Initial % agreement Initial Kappa Cross-validated % agrmt Cross-validated Kappa
ZeroR: This is the "null" model. It predicts presence or absence of pecans based on the simplest rule: since most places don't have pecans, assume all places don't have pecans. It provides a baseline for comparing the other models. -ND- -ND- 84.98 0.000
OneR: This is the simplest real model. It predicts presence or absence of pecans based one climatic factor (aka an attribute) - the one climatic factor that best predicts all the instances. The resulting classification model can be interpreted in ecological terms. -ND- -ND- 90.12 0.233
J48 with small tree: This is a simplified "decision tree" model. It constructs a decision tree based on the most useful values of the most useful climatic factors. This complex tree is then simplified to make it run faster. The resulting classification model can be interpreted in ecological terms. 96.03 0.781 93.77 0.652
J48 with full tree: This is a full "decision tree" model. It constructs a decision tree based on the most useful values of the most useful climatic factors. The resulting classification model can be interpreted in ecological terms. 98.92 0.943 94.30 0.694
JRip using only raw climate factors: This is a rule-based model. It constructs a list of rules for where pecans will/won't be found based on the most useful values of the most useful climatic factors. This set of rules was built using only the raw climate factors, not the interactions (all others were built using the full data set). The resulting classification model can be interpreted in ecological terms. -ND- -ND- 94.54 0.713
LB1: This is a "lazy" classifier. When classifying a location as "having pecans" or "not having pecans, it looks in the 103-dimensional space defined by the known locations and finds the nearest neighbor to the new location. If that neighbor has pecans, then the model predicts that the site will also have pecans. The resulting classification model is not interpretable in ecological terms. -ND- -ND- 94.72 0.721
JRip: This is a rule-based model. It constructs a list of rules for where pecans will/won't be found based on the most useful values of the most useful climatic factors. The resulting classification model can be interpreted in ecological terms. 95.92 0.799 94.76 0.715
LB3: This is another "lazy" classifier. It is just like LB1 except that it makes predictions based on the three nearest neighbors to the input location. Like LB1, it is not possible to interpret the classifier model in ecological terms. -ND- -ND- 95.21 0.745
The measures of accuracy are:

  • Initial % agreement: This is the percent of the time that the prediction of the model agreed with the actual pecan distribution. You would expect this to be high, since the model was generated based on the test data.
  • Initial Kappa: Kappa compares the actual % agreement with the "null hypothesis": the expected percent agreement if the classifier made random predictions with the same frequency of pecan/no-pecan as the real data. It ranges from 0 (no better than chance) to 1 (perfect agreement). As with the initial % agreement, you'd expect this to be high since the classifier is being evaluated on the data that trained it.
  • Cross-validated % agreement: This is a more realistic estimate of the accuracy of the classifier (note that it is always lower than the initial values). Here, the software reserves 10% of the sample and trains on the remaining 90%. The model trained on the 90% is then used to classify the "unseen" 10%. This process is repeated 10 times and the results averaged. This value gives an estimate for how effective that algorithm will be at generating a reliable classification model.
  • Cross-validated Kappa: This is the Kappa calculated using the 10%/90% method described above.

From this, we concluded:

  • Using % agreement can be misleading. Even the worst classifier (ZeroR) was right 82% of the time. This is because most sites in the US don't have pecans, so predicting that "no sites will have pecans" works deceptively well. It is important also to look at the kappa statistic.
  • Using actual data improved the accuracy of the classifiers. Going from ZeroR to OneR to the more complex models increased accuracy as we would expect.
  • The more data you include, the better the accuracy. Large trees did better than small trees; 3-nearest-neighbors did better than 1-nearest-neighbor.
Although JRip (the rule-based classifier) was not the most accurate (LB3 was the most accurate), it was the most accurate classifier that also gave results that we could interpret ecologically. We therefore used it in most of our analyses. Its rule set is shown below (the numbers in parentheses give the number of locations where this rule gives the correct/incorrect prediction):
  1. (MWM >= 26.5) and (BAR5 >= 14.6915) and (PTOAE >= 1.1925) and (ELEV <= 300) => Pecan=1 (82.0/4.0)
  2. (AE >= 652.4943) and (PTOAE >= 1.1295) and (WATDGRC <= 3) and (WRET >= 104.8334) and (ELEV <= 625) => Pecan=1 (72.0/4.0)
  3. (MWM >= 24.6) and (CVRAIN <= 44.3185) and (WSTORAGE >= 181.796) and (ELEV <= 1030) => Pecan=1 (165.0/50.0)
  4. (MWM >= 24.3) and (TRANGE >= 24.7) and (RLOW >= 25.91) and (PTOWATR >= 10.8738) and (Site <= 1517) => Pecan=1 (51.0/4.0)
  5. (AE >= 622.0895) and (COKLM >= 506.9) and (EXPREY <= 520.5728) and (PTOWATR >= 8.7045) => Pecan=1 (59.0/13.0)
  6. (MWM >= 24.8) and (TRANGE >= 24.7) and (RLOW >= 25.91) and (RLOW <= 46.74) => Pecan=1 (52.0/24.0)
  7. (MWM >= 27.22) and (RLOW >= 71.88) and (EXPREY <= 439.1472) and (WRET >= 102.7854) and (TEMP <= 56.0959) => Pecan=1 (15.0/1.0)
  8. (MWM >= 27.44) and (CVRAIN <= 34.6388) and (WSTORAGE <= 161.2) => Pecan=1 (77.0/37.0)
  9. otherwise: => Pecan=0 (4064.0/52.0)
What does this mean? Letís look at one of the 9 rules in detail:
(MWM >= 26.5) and (BAR5 >= 14.6915) and (PTOAE >= 1.1925) and (ELEV <= 300) => Pecan=1 (82.0/4.0)
  • MWM is the Mean Temperature in the Warmest Month (C)
  • BAR5 is the Biomass Accumulation Ratio
    • This is the amount of net above ground productivity added to standing biomass each year.
    • Higher values indicate areas where we would find rapidly growing forests, low values could be slow growing forests or grasslands.
  • PTOAE is the ratio of Potential Evapotranspiration to Actual Evapotranspiration
    • higher values mark warmer/ drier settings where precipitation is not high enough to match PET
  • ELEV is the elevation of the weather station in feet.
So, this statement can be read as:

Where the Mean of the Warmest Month is greater than or equal to 26.5 deg C, and where Biomass Accumulation Ratio is greater than or equal to 14.69, and where the ratio of Potential to Actual Evapotranspiration is greater than or equal to 1.19, and where the Elevation is less than or equal to 300 feet, expect to find pecans.

In other words, pecans are found in warm locations where a moderate amount of the productivity accumulates as standing biomass (think tree trunks, branches, etc) in environments on the dry side and at low elevations.


Data Analysis

The image below shows the training data:
  • Tan area is where pecans are found.
  • Green squares show locations of weather stations without pecans.
  • Red squares show locations of weather stations with pecans.

The image below shows the result of applying the classifier generated by the training data to the world climate data (this is an un-trained extrpolation to new data):
  • Red squares show locations of weather stations where the model predicts the absence of pecans.
  • Yellow squares show locations of weather stations where the model predicts the presence of pecans.

 

 
   
 


Project Materials:

  • .arff files we used:
    • pecans.arff: US climate (raw and derived) with pecan distribution.
    • pecans-raw.arff: US climate (raw attributes only) with pecan distribution.
    • world.arff: World climate data with dummy pecan distribution (0 everywhere).
  • A description of the environmental variables.
  • The handout from our presentation.
  • A more detailed outline of our presentation.
  • The notes we took when developing the models.
  • A description of the different classifier algorithms used by Weka.
  • The two MyWorld project files from our presentation.
    • The US file.
    • The world file is too large to upload to this site. Sigh....
     

 
 


Resources and References:

  • One of the original references to using computer modeling to predict species distributions from climate data.
  • Another reference on this technique.
  • The USGS site where we got the pecan distribution data.
  • A link to the data-mining (Weka) software site. This cool software is free!
  • A description of the .arff file format used in data mining.
  • A description of the GIS software we used: MyWorld.
  • Binford, Lewis R (2001) Constructing Frames of Reference: An Analytical Method for Archaeological Theory Building Using Ethnographic and Environmental Data The book.
  • Contact Amber Johnson to learn more about getting environmental data sets.
 

 
   
 


Future Directions:

There are many possible directions in which to take this project. Here are a few:

  • Identify and characterize the ecological niche of different species.
  • Doing the same analysis for other species:
    • Species of economic value
    • Species of historical interest
    • Pest or invasive species
  • Predicting the distribution of certain species under different climate conditions:
    • Global Warming
    • Paleoclimate
  • Looking for anomalies - places where a species "should" be but isn't or where an invasive species shouldn't be but is:
    • What historical factors lead to this discrepancy?
    • What other species occupy this niche in the other places?
 

 
 


Attachments


- EnvCalc_Output_Variable_Description.doc
- PecanMap.jpg
- WorldPecanElevation.jpg
- PecanNotes.doc
- WekaModels.doc
- PecanHandout.doc
- PecanOutline.doc
- PecanProject.m3vz
- pecans-raw.arff
- pecans.arff.zip
- world.arff