weecology / bbs-forecasting Goto Github PK
View Code? Open in Web Editor NEWResearch on forecasting using Breeding Bird Survey data
License: MIT License
Research on forecasting using Breeding Bird Survey data
License: MIT License
At our last meetings we planned out the following figures and tables for the ms:
The sketches for each of these figures is below. If you are already working on or want to work on a figure just leave a note in the the comments (and maybe ping the Slack channel) to avoid duplicate work.
New sketch for the observation model figure from 2017-05-18 meeting:
To start making actual forecasts using models that involved exogenous variables, we will need forecasts for those variables. This issue is a place for us to start looking into options for doing this.
Pure time series:
Pure environment
Other:
It's known that ensembles often perform better for forecasting/prediction. We should be trying them.
This happens in auto.arima
using the optional xreg
argument.
Right now, we're filtering the sites so that we have at least 25 observations between 1982 and 2013. Does that need to change if we're only doing 10 years of training? This could give us some sites with only 4 years of training data.
Maybe the criterion should be more like "visited during at least 70% of the training years"? Then we could use the same criterion for both.
As far as I can tell, there's no downside to including a site that doesn't get visited much in the test set, right?
It looks like our NDVI and climate/weather data sets both start in 1982. I was wondering what our options were going before 1982 (especially for the bioclim variables). Do we have that written down somewhere?
The GIMMS NDVI data we are using at the moment is only available through 2013. We need to fill this in moving forward, probably using MODIS or LTDR.
From #7:
LTDR appears to provide daily NDVI data back to 1981 using a combination of AVHRR and MODIS data. Data is available via FTP as one
hdf
file/day (25-150 MB/file depending on the satellite) and are split up by satellite (with different satellites covering different time periods). The NDVI data is in the AVH13C1 files. Description of the NDVI product. Unfortunately these are daily data which would force us to do the compositing ourselves.
If I call get_richness_ts_env_data(start_yr, end_yr, min_num_yrs) %>% na.omit()
, there are sites within it that have less than `min_num_yrs'. Looks like it has to do with the ordering of things.
This is from forecast-bbs-core.R
line 186. complete(site_id, year)
fills in all possible years for all sites, then filter_ts()
a few lines down thinks that all sites then have plenty of years in their timeseries. If I take out complete(site_id, year)
it seems to work but I'm not sure what that might be breaking elsewhere.
get_richness_ts_env_data <- function(start_yr, end_yr, min_num_yrs){
bbs_data <- get_bbs_data(start_yr, end_yr, min_num_yrs)
richness_data <- bbs_data %>%
group_by(site_id, year, lat, long) %>%
dplyr::summarise(richness = n_distinct(species_id)) %>%
ungroup() %>%
complete(site_id, year)
richness_ts_env_data <- richness_data %>%
add_env_data() %>%
filter_ts(start_yr, end_yr, min_num_yrs)
}
need to re-run the ndvi extraction using a buffer.
Ideally we'd like fairly finely resolved land cover data going back to the 1970s.
The National Land Cover Database is one option, but its big limitation is that there are only 4 time points: 1992, 2001, 2006, 2011. If the algorithms are available we could think about processing the full time-series (though this could be a pretty massive computational task). We could also interpolate between the neighboring points. This has some risks, but I suspect that land cover change is gradual in most cases. It still leaves us without data prior to 1992 and post 2011. NLCD also only covers the United States.
Other sources of land cover data:
The current BBS data includes water birds and nocturnal species that aren't sampled well. They need to be removed.
I see at least four options for dealing with all the namespaces clobbering each other
Leave all the library
calls and add ::
where needed
Use ::
everywhere
Use explicit namespaces using R's package infrastructure
I'm leaning toward 3. Thoughts @ethanwhite @sdtaylor?
This is probably effectively static at the time-scales we're working with, but could be important for space-for-time models.
Tables
Model | Site effect | Env. Vars | Species Specific |
---|---|---|---|
TS | No | No | No |
JSDM | No | Yes | Yes |
SSDM | No | Yes | Yes |
GBM | No | Yes | Yes |
Main Results figures
One software package: https://github.com/James-Thorson/spatial_DFA
Also does JSDM (#52)
We currently have one (and maybe soon 2) calls to a postgres database. Since we're already using SQLite to store and pass around environmental data we should go ahead and add the BBS data to that database as well and extract it from there.
As an alternative to excluding water birds @davharris has suggested adding something like the proportion of area with a 40km buffer that is water as a predictor for the spatial models (and possibly the temporal ones if this changes much over the time period of the data, which it might).
get_prism_data
currently uses the prism
R package. Since we'll be combining data from lots of different sources let's keep the data acquisition part straight forward by using the retriever for everything.
I don't have strong opinions on using prism
vs. Postgres for merging the data with BBS at the moment, but that may change as the number of different sources for predictors increases.
tidyr
has added the function I needed a few months ago when I wrote get_popdyn_data_w_zeros
. This should now be replaceable using a single call to tidyr::complete
.
if p<0.001?
What did we decide about the train-test split? Was it the last 5 years that were reserved for evaluation? Sorry if I'm just missing it---I don't see anything about it in either of the notebooks or in the core functions.
This is potentially a nice way to communicate what the benchmark comparisons tell us.
I'm basically envisioning an observed-predicted plot of forecast vs. observed richness side-by-side with an observed-predicted plot of forecast change in richness vs. observed change in richness. The first will look good and demonstrates that if we just need to know what richness will look like in 20 years, we do that, but the second will look bad showing that we don't really benefit from the more sophisticated models over just assuming that things don't change.
This ties back to the results in Rapacciuolo 2012, which shows that SDMs work well for forecasting species locations because they are good at predicting the areas that don't change.
@davharris says:
Given that it's kind of weird to ask "how will next month's (June) weather affect what birds we'll find on today's transect run (in May)", it's probably worth adding a sentence pointing out that quirk of our data.
Add config.R file for common variables like: years of analysis and path to sqlite database file,.
In the retriever, species_id
refers to the column before AOU
. In this project, we use species_id
as a synonym for AOU.
What's the best way to avoid inconsistency when we try to do things like join the species table to the abundance table? We could change get_species_data
with one line of code, but we'd still have this inconsistency between our database and our R code. Or we could change all instances of species_id
to match the database, but that seems like a pain.
Currently, filter_species
throws out anything with unid.
in the name, on the theory that it can't be identified to species. But the following three species (and possibly some others?) are identified to species, just not to subspecies. I think that means we're throwing out observations for them incorrectly.
1 (unid. Red/Yellow Shafted) Northern Flicker
2 (unid. race) Dark-eyed Junco
3 (unid. Myrtle/Audubon's) Yellow-rumped Warbler
This will let us make the make
file more granular.
@davharris started some work on this in #2, but it is waiting on some fixes.
@sdtaylor is doing a lot of this modeling for population level work already, so we could also integrate that code at some point if it's sufficiently modularized to let us use it easily.
As an alternative to removing noctural/crepuscular species @davharris has suggested adding a variable describing the time of the survey relative to sunrise.
Paper at: http://onlinelibrary.wiley.com/doi/10.1111/ecog.02321/full
R Package is on CRAN under the name "RFc"
Here's a list of what they have available (note that the NCEP/NCAR Reanalysis has 4x daily data going back to 1948 for temp and precip; prate
is short for "precipitation rate").
Joan and I played with it a bit today, and the API seems very good (it's from Microsoft Research).
Example script I used:
x = fcTimeSeriesDaily("airt", latitude = c(35, 36), longitude = c(-117, -118), firstYear = 1989)
I'm thinking about adding a longer-term forecast analysis to push the longest forecast out to 20+ years so that we get a better idea of whether the spatial and time-series approaches really end up converging. I'm envisioning leaving the bulk of the analysis as is, but adding one additional analysis where we train on the first 5-10 years and forecast on the last 20-25.
rEDM package vignette: https://cran.r-project.org/web/packages/rEDM/vignettes/rEDM_tutorial.html
How do we want to handle oddball "species" (e.g. unidentified species, hybrid species & subspecies)? They're usually pretty identifiable in the data using regexes, but it's not obvious to me how best to handle them.
My proposal:
On a related note, did we make a decision somewhere about whether to remove super-rare species from the data set?
Better to do it once for all models than to let each model choose its own optimal predictors?
Certainly shouldn't let models see the test set when deciding which predictors to use...
I've assigned myself to deal with this before the final model runs.
@sdtaylor - do you have code for this somewhere from your other work? If so, can you add a function for handling this to get_prism_data.R
?
I'd feel better if I knew what was going on with big richness jumps like the one plotted below (site_id 27057).
Is that a real change in richness, or did a new observer just start counting the birds in 1993? These big jumps have a big impact on the tails of the random walk, so I think we'll want to see if we can understand them better.
My RPostgres isn't behaving at the moment, so I can't easily access the observer data right now.
Either landsat or AVHRR.
Initial work on this is started in setup_ndvi_data.R
and the README in ./data/ndvidata/
. This uses NDVI data from AVHRR acquired from EarthExplorer. This data comes in already composited forms meaning they are relatively plug and play. Data goes back to 1989 and is available as GeoTIFF's.
LTDR appears to provide daily NDVI data back to 1981 using a combination of AVHRR and MODIS data. Data is available via FTP as one hdf
file/day (25-150 MB/file depending on the satellite) and are split up by satellite (with different satellites covering different time periods). The NDVI data is in the AVH13C1 files. Description of the NDVI product. Unfortunately these are daily data which would force us to do the compositing ourselves.
https://github.com/weecology/bbs-forecasting/blob/master/R/forecast-bbs-core.R#L160
richness would be more precise, I think?
For cases where all that is needed is the time-series itself, generate actual forecasts for future years.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.