weecology / bbs-forecasting Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 9.0 87.52 MB

Research on forecasting using Breeding Bird Survey data

License: MIT License

Python 0.03% R 1.24% Jupyter Notebook 98.29% Stan 0.01% Shell 0.01% TeX 0.42%

bbs-forecasting's Introduction

weecology

bbs-forecasting's People

Contributors

Stargazers

Watchers

Forkers

ethanwhite davharris sdtaylor karinorman mdietze davidinauen luoyec diazrenata

bbs-forecasting's Issues

Make figures and tables

At our last meetings we planned out the following figures and tables for the ms:

Table 1: Model methods
Figure 1: Example time-series w/forecasts
Figure 2: Observed-predicted plot for 3 time lags
Figure 3: Time lag results figure (Main results)
Figure 4: Violins of fit at 3 time lags
Figure 5: Observation model figure

The sketches for each of these figures is below. If you are already working on or want to work on a figure just leave a note in the the comments (and maybe ping the Slack channel) to avoid duplicate work.

New sketch for the observation model figure from 2017-05-18 meeting:

Expand Intro outline into full Introduction

Get data on climate forecasts

To start making actual forecasts using models that involved exogenous variables, we will need forecasts for those variables. This issue is a place for us to start looking into options for doing this.

Predict to 2050

Pure time series:

auto-arima
naive

Pure environment

mistnet
rf
gbm

Other:

average

Add ensemble forecasting

It's known that ensembles often perform better for forecasting/prediction. We should be trying them.

Expand Methods outline into full Methods section

Add ARIMA forecasts with exogenous predictors

This happens in auto.arima using the optional xreg argument.

Explore how much work to switch to GIMMS 3 version 2

filter_ts with shorter training time

Right now, we're filtering the sites so that we have at least 25 observations between 1982 and 2013. Does that need to change if we're only doing 10 years of training? This could give us some sites with only 4 years of training data.

Maybe the criterion should be more like "visited during at least 70% of the training years"? Then we could use the same criterion for both.

As far as I can tell, there's no downside to including a site that doesn't get visited much in the test set, right?

pre-1982 predictors?

It looks like our NDVI and climate/weather data sets both start in 1982. I was wondering what our options were going before 1982 (especially for the bioclim variables). Do we have that written down somewhere?

ensemble predictions

Add post-2013 NDVI data

The GIMMS NDVI data we are using at the moment is only available through 2013. We need to fill this in moving forward, probably using MODIS or LTDR.

From #7:

LTDR appears to provide daily NDVI data back to 1981 using a combination of AVHRR and MODIS data. Data is available via FTP as one hdf file/day (25-150 MB/file depending on the satellite) and are split up by satellite (with different satellites covering different time periods). The NDVI data is in the AVH13C1 files. Description of the NDVI product. Unfortunately these are daily data which would force us to do the compositing ourselves.

filter_ts() not actually filtering

If I call get_richness_ts_env_data(start_yr, end_yr, min_num_yrs) %>% na.omit(), there are sites within it that have less than `min_num_yrs'. Looks like it has to do with the ordering of things.

This is from forecast-bbs-core.R line 186. complete(site_id, year) fills in all possible years for all sites, then filter_ts() a few lines down thinks that all sites then have plenty of years in their timeseries. If I take out complete(site_id, year) it seems to work but I'm not sure what that might be breaking elsewhere.

get_richness_ts_env_data <- function(start_yr, end_yr, min_num_yrs){
  bbs_data <- get_bbs_data(start_yr, end_yr, min_num_yrs)
  richness_data <- bbs_data %>%
    group_by(site_id, year, lat, long) %>%
    dplyr::summarise(richness = n_distinct(species_id)) %>%
    ungroup() %>%
    complete(site_id, year)
  richness_ts_env_data <- richness_data %>%
    add_env_data() %>%
    filter_ts(start_yr, end_yr, min_num_yrs)
}

Use 40km buffer for ndvi

need to re-run the ndvi extraction using a buffer.

Add data on land cover as predictors

Ideally we'd like fairly finely resolved land cover data going back to the 1970s.

The National Land Cover Database is one option, but its big limitation is that there are only 4 time points: 1992, 2001, 2006, 2011. If the algorithms are available we could think about processing the full time-series (though this could be a pretty massive computational task). We could also interpolate between the neighboring points. This has some risks, but I suspect that land cover change is gradual in most cases. It still leaves us without data prior to 1992 and post 2011. NLCD also only covers the United States.

Other sources of land cover data:

Remove poorly sampled species from BBS data

The current BBS data includes water birds and nocturnal species that aren't sampled well. They need to be removed.

Stop the namespace madness

I see at least four options for dealing with all the namespaces clobbering each other

Leave all the library calls and add :: where needed
Use :: everywhere
Use explicit namespaces using R's package infrastructure
Use the import package

I'm leaning toward 3. Thoughts @ethanwhite @sdtaylor?

Add elevation data as a predictor

This is probably effectively static at the time-scales we're working with, but could be important for space-for-time models.

Graphs and tables in paper

Tables

List of models and their attributes. Example:

Model	Site effect	Env. Vars	Species Specific
TS	No	No	No
JSDM	No	Yes	Yes
SSDM	No	Yes	Yes
GBM	No	Yes	Yes

Main Results figures

error over time for both RMSE & deviance (A/B plot)
total error over the forecast period for both RMSE & deviance (A/B plot)
Model coverage / Calibration
Comparing using high resolution environmental data to no env. data or low resolution data. (Potentially a table)
Conceptual image of different model inputs, with an example forecast of an individual site.

fit a spatio-temporal model (if possible)

One software package: https://github.com/James-Thorson/spatial_DFA

Also does JSDM (#52)

Move all database work to either SQLite or PostgreSQL

We currently have one (and maybe soon 2) calls to a postgres database. Since we're already using SQLite to store and pass around environmental data we should go ahead and add the BBS data to that database as well and extract it from there.

fit JSDMs

Add data on bodies of water

As an alternative to excluding water birds @davharris has suggested adding something like the proportion of area with a 40km buffer that is water as a predictor for the spatial models (and possibly the temporal ones if this changes much over the time period of the data, which it might).

Use retriever to download PRISM data

get_prism_data currently uses the prism R package. Since we'll be combining data from lots of different sources let's keep the data acquisition part straight forward by using the retriever for everything.

I don't have strong opinions on using prism vs. Postgres for merging the data with BBS at the moment, but that may change as the number of different sources for predictors increases.

Replace `get_popdyn_data_w_zeros` with `tidyr::complete`

tidyr has added the function I needed a few months ago when I wrote get_popdyn_data_w_zeros. This should now be replaceable using a single call to tidyr::complete.

Add threshold for maps

if p<0.001?

train-test split

What did we decide about the train-test split? Was it the last 5 years that were reserved for evaluation? Sorry if I'm just missing it---I don't see anything about it in either of the notebooks or in the core functions.

Make graphs that compare predictions of future state to predictions of changes

This is potentially a nice way to communicate what the benchmark comparisons tell us.

I'm basically envisioning an observed-predicted plot of forecast vs. observed richness side-by-side with an observed-predicted plot of forecast change in richness vs. observed change in richness. The first will look good and demonstrates that if we just need to know what richness will look like in 20 years, we do that, but the second will look bad showing that we don't really benefit from the more sophisticated models over just assuming that things don't change.

This ties back to the results in Rapacciuolo 2012, which shows that SDMs work well for forecasting species locations because they are good at predicting the areas that don't change.

shear maps of SDM predictions

On the disparity between observations and weather data

@davharris says:

Given that it's kind of weird to ask "how will next month's (June) weather affect what birds we'll find on today's transect run (in May)", it's probably worth adding a sentence pointing out that quirk of our data.

Explore using CMIP & constantish NDVI for making env forecasts

Look into getting and merging in CMIP data
Explore fixing NDVI in environmental models

add config file

Add config.R file for common variables like: years of analysis and path to sqlite database file,.

inconsistent use of `species_id`

In the retriever, species_id refers to the column before AOU. In this project, we use species_id as a synonym for AOU.

What's the best way to avoid inconsistency when we try to do things like join the species table to the abundance table? We could change get_species_data with one line of code, but we'd still have this inconsistency between our database and our R code. Or we could change all instances of species_id to match the database, but that seems like a pain.

test filter_species with combine_subspecies

Currently, filter_species throws out anything with unid. in the name, on the theory that it can't be identified to species. But the following three species (and possibly some others?) are identified to species, just not to subspecies. I think that means we're throwing out observations for them incorrectly.

1    (unid. Red/Yellow Shafted) Northern Flicker
2                   (unid. race) Dark-eyed Junco
3 (unid. Myrtle/Audubon's) Yellow-rumped Warbler

Split PRISM aquistition and bioclim processing

This will let us make the make file more granular.

Add summed SDM models to analysis

@davharris started some work on this in #2, but it is waiting on some fixes.

@sdtaylor is doing a lot of this modeling for population level work already, so we could also integrate that code at some point if it's sufficiently modularized to let us use it easily.

Add data on time of survey relative to sunrise

As an alternative to removing noctural/crepuscular species @davharris has suggested adding a variable describing the time of the survey relative to sunrise.

Daily weather variables

Paper at: http://onlinelibrary.wiley.com/doi/10.1111/ecog.02321/full

R Package is on CRAN under the name "RFc"

Here's a list of what they have available (note that the NCEP/NCAR Reanalysis has 4x daily data going back to 1948 for temp and precip; prate is short for "precipitation rate").

Joan and I played with it a bit today, and the API seems very good (it's from Microsoft Research).

Example script I used:

x = fcTimeSeriesDaily("airt", latitude = c(35, 36), longitude = c(-117, -118), firstYear = 1989)

Predict with 10 years' training data

I'm thinking about adding a longer-term forecast analysis to push the longest forecast out to 20+ years so that we get a better idea of whether the spatial and time-series approaches really end up converging. I'm envisioning leaving the bulk of the analysis as is, but adding one additional analysis where we train on the first 5-10 years and forecast on the last 20-25.

empirical dynamic modeling

rEDM package vignette: https://cran.r-project.org/web/packages/rEDM/vignettes/rEDM_tutorial.html

fit state space model

oddball species

How do we want to handle oddball "species" (e.g. unidentified species, hybrid species & subspecies)? They're usually pretty identifiable in the data using regexes, but it's not obvious to me how best to handle them.

My proposal:

For subspecies, it might be best to lump them in with the rest of their species.
For hybrids and unidentified species, we might have to drop them.

On a related note, did we make a decision somewhere about whether to remove super-rare species from the data set?

Decide on variable selection

Better to do it once for all models than to let each model choose its own optimal predictors?

Certainly shouldn't let models see the test set when deciding which predictors to use...

dplyr 0.7 will probably break things

I've assigned myself to deal with this before the final model runs.

Convert PRISM data to bioclim

@sdtaylor - do you have code for this somewhere from your other work? If so, can you add a function for handling this to get_prism_data.R?

Incorporate observer effects

I'd feel better if I knew what was going on with big richness jumps like the one plotted below (site_id 27057).

Is that a real change in richness, or did a new observer just start counting the birds in 1993? These big jumps have a big impact on the tails of the random walk, so I think we'll want to see if we can understand them better.

My RPostgres isn't behaving at the moment, so I can't easily access the observer data right now.

Get longer-term NDVI data

Either landsat or AVHRR.

Initial work on this is started in setup_ndvi_data.R and the README in ./data/ndvidata/. This uses NDVI data from AVHRR acquired from EarthExplorer. This data comes in already composited forms meaning they are relatively plug and play. Data goes back to 1989 and is available as GeoTIFF's.

LTDR appears to provide daily NDVI data back to 1981 using a combination of AVHRR and MODIS data. Data is available via FTP as one hdf file/day (25-150 MB/file depending on the satellite) and are split up by satellite (with different satellites covering different time periods). The NDVI data is in the AVH13C1 files. Description of the NDVI product. Unfortunately these are daily data which would force us to do the compositing ourselves.

"speciestotal AS abundance", not "AS richness"?

https://github.com/weecology/bbs-forecasting/blob/master/R/forecast-bbs-core.R#L160

richness would be more precise, I think?

Generate published forecasts

For cases where all that is needed is the time-series itself, generate actual forecasts for future years.