weecology / portalcasting Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 10.0 44.79 MB

Model development, deployment, and evaluation for forecasting Portal ecosystem dynamics

Home Page: https://weecology.github.io/portalcasting

License: Other

R 94.77% Dockerfile 0.27% TeX 4.96%

ecology forecasting portal r r-package r-stats reproducible-research shiny workflow

portalcasting's Introduction

Supporting automated forecasting of rodent populations

Overview

The portalcasting package offers a comprehensive system for developing, deploying, and evaluating ecological models that forecast changes in ecological systems over time. It particularly focuses on the long-term study of mammal population and community dynamics, known as the Portal Project.

Core Depedencies

The portalcasting package depends on the PortalData and portalr packages.

PortalData is the collection of all the Portal project data.
portalr is a collection of functions to summarize the Portal data.

The portalcasting package integrates the PortalData repository and the portalr data management package into a streamlined pipeline. This pipeline is used to forecast.

The functionality of portalcasting extends beyond its deployment, as its functions are portable. This allows users to establish a fully-functional replica repository on either a local or remote machine, facilitating the development and testing of new models within a sandbox environment.

Current deployment:

The Portal-forecasts houses tools that leverage the portalcasting pipeline to generate weekly forecasts. The forecasts are then showcased on the Portal Forecasts website. This website offers users an interactive interface to explore the forecasting results. The source code for this website is hosted on GitHub. Additionally, the portal-forecasts repository archives the forecasts on both GitHub and Zenodo

Docker Container

We leverage a Docker container to enable reproducibility of the Portal forecasting. Presently, we use a Docker image of the software environment to create a container for running the code. The image is automatically rebuilt when there is a new portalcasting release, tagged with both the latest and version-specific (vX.X.X) tags, and pushed to DockerHub.

Because the latest image is updated with releases, the current main branch code in portalcasting is typically, but not necessarily always, being executed within the predictions repository.

The API is actively developed and welcomes any contributors.

Installation

You can install the package from github:

install.packages("remotes")
remotes::install_github("weecology/portalcasting")

You will need to install rjags and JAGS.

MacOS users are recommended to install rjags after reading the instructions on the package's README file, or use the JAGS discussion forum thread for help under the MacOS installation of JAGS.

install.packages("rjags", configure.args="--enable-rpath")

Production environment

If you wish to spin up a local container from the latest portalcasting image (to ensure that you are using a copy of the current production environment for implementation of the portalcasting pipeline), you can run

sudo docker pull weecology/portalcasting

from a shell on a computer with Docker installed.

Usage

Get started with the "how to set up a Portal Predictions directory" vignette.

If you are interested in adding a model to the preloaded set of models, see the "adding a model and data" vignette. That document also details how to expand the datasets available to new and existing models.

Developer and Contributor notes

We welcome any contributions in form of models or pipeline changes.

For the workflow, please checkout the contribution and code of conduct pages.

Acknowledgements

This project is developed in active collaboration with DAPPER Stats.

The motivating study—the Portal Project—has been funded nearly continuously since 1977 by the National Science Foundation, most recently by DEB-1622425 to S. K. M. Ernest. Much of the computational work was supported by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4563 to E. P. White.

We thank Heather Bradley for logistical support, John Abatzoglou for assistance with climate forecasts, and James Brown for establishing the Portal Project.

portalcasting's People

Contributors

Stargazers

Watchers

Forkers

ethanwhite ha0ye bw4sz patdumandan henrykironde prayasj himanshu-02 gabrielelubatti faylward conradkash

portalcasting's Issues

Add or swap in shorter run version of `portalcast` in Getting Started

Running portalcast() is pretty slow and since it's a command listed in the Getting Started vignette folks are likely to just run it (like I just did). It's probably useful to provide (either as the only option or an alternative) one that produces a quicker run time for those just starting to get involved in the project.

set it up so the raw portal data folder goes into the portalcasting folder

set it back up in download_observations() but then there was also another spot somewhere that was creating the portalData directory again

goal to just be a single directory

`pass_and_call`

consider the issues from the first attempt at this approach
work in a confined space to get the code working well for complex toy examples first

get off AIC for ensemble building

AIC-based comparison by definition requires that all models have used the same data set for model fitting, and since some of the models don't handle unequal steps between surveys, we have to interpolate abundances for ALL models.

the sooner we can move to an out-of-sample validation of some sort, the sooner we can actually leverage models that don't require equi-spaced surveys

soften model running errors

build wrappers to catch any errors that would break the full analysis pipeline (as exemplified by the pevGARCH throwing errors and breaking the whole thing because of missing covariates, when the other models fit fine)

output of forecast for controls and all are the same

the forecasts being saved out are the same for the two (hence the weird plots on the forecast website)

`model_names` API thoughts

Both @skmorgane and I found the model_names API a bit confusing. We like the idea of having the ability to both have a prefab list and add additional models, so maybe it's just a docs things.

UX issues that we experienced:

I read the docs down through "character value of the type of model (currently only support for "prefab" and "wEnsemble"). Use NULL to build a custom set from scratch via add." and thought - Oh, that's a complicated config feature, I think I'll stop. Obviously I should have kept going, but if I didn't then it's likely that others won't either.
We both expected the first argument to be where we would list the models we wanted to run.
We felt like in the sandbox the number 1 thing we would want to do is run our model a bunch of times and only optionally later add in comparisons to other models.

With that in mind our thought was that an API like this might be a good option:

model_names(models, add_model_set = NULL)

So then users can do initial development like

model_names("mymodel")

Test against a primary competitor

model_names(c("mymodel", "ARIMA"))

And then do a full comparison to existing models

model_names("mymodel", add_model_set = "prefab")

improve test runtimes

the tests have long runtimes (requiring a hack on travis due to quietness and skipping of some testing except for locally), although much of that is caused by the repeated downloading of the raw data.
judicious reorganization of both the codebase and the testing setup should allow tests to be executed much faster.

for example, the directory should probably only really need to ever get made once from scratch and then should be able to be re-initiated from the PortalData subdirectory if needed, but that should only ever need to be downloaded once

also, a simple messaging function should allow there not to need to be an excessive amount of things being run again just with and without quieting (rather the quieting or not can happen separated from the actual execution)

"Adding a Model" vignette

I'm currently working through the process of trying to add a new model, and getting confused by the specific implementation steps.

One of the main challenges with documentation is that portalcasting has (at least) 2 audiences:

the people who are running the Portal Forecasting website within the lab and need to know how the infrastructure is connected
contributors / end-users of portalcasting who may want to run the models on different subsets of the code or add new models.

For 2. as @gmyenni put it, "the process of adding a new model should be plug-and-play", but it's not clear that this is the case right now, and the directions for adding a new model aren't quite at that step yet? (or at least, the "Adding a Model" vignette does not clarify the steps to me)

As a basic example, the vignette starts by mentioning a data/ and models/ subdirectory, but they don't exist as part of the repo. (and they should be created through the setup_dir() function)

So, I think the higher-level organization of the vignette should be:

(assume installation of the package and dependencies)
set up the portalcasting folders and data locally
example for adding a new model to read in data and make forecasts
running the new model locally (ideally, without having to run all of the models in portalcasting -- not sure how much of this configuration should be done at this step or is part of the portalcasting codebase to be setup earlier, e.g. when setting up the folders)
steps for moving the model files into the portalcasting codebase (as a PR, etc.)

pull out redundant code in forecast_weather (and others?)

now that portalr::weather can produce newmoon-based summaries, there is some redundant code (at least in forecast_weather) that can be removed

include edge case testing for GARCH models

the 0-abundance and failed fitting for non-0 but nearly-0 abundances should be included in testing for the three GARCH models
can probably include this as part of the tidying up of those functions towards building better model utilities for folks making new models
or maybe just do it on its own?

improve error messaging around folder locations

currently the error messaging when you try to read from a non-existent directory (for example, if you are working in another spot, but forget to set main) is a bit gnarles. it could stand to be tidied and made more explicit

allow differentiation among forecasts within a day

currently, forecasts are named by date, which prevents having forecasts made on the same date (if a forecast needs to be fixed but the old one should be retained for posterity, for example)
a time stamp needs to be added in a way that works simply for file naming, etc.
and all previous files will need to be updated

change drop_spp to species

currently, the rodent data are based on exclusion, rather than inclusion, which isn't intuitive from a user perspective. update to inclusion, and carry through to full functionality under the hood

incorporate tmnt_type flexibility

presently, some of the package components assume that there are two and only two treatment types ("all" and "controls"). for example, the model script writing and cast processing codes.
however, the basic data machinery now is much more flexible with respect to treatment type, allowing users to construct and work with data according to their own specs, so we'll need to carry that new capacity through the rest of the code base.

generalize update_list

make update_list work where you pass it a second list
maybe update_list <- function(orig_list, ..., new_list = NULL)
and if new_list isn't NULL then unwind it and use its elements
that way you don't have to pass in each argument as x = x

something wonky with the setup of hindcasts from a directory that forecasts

allow hindcasting to process multiple covariate forecasts within a newmoon window

function documentation examples

from the LDATS code review, which I think we should follow here:

Please add small executable examples in your Rd-files to illustrate the use of the exported function.

\dontrun{} should be only used if the example really cannot be executed (e.g. because of missing additional software, missing API keys, ...) by the user. That's why wrapping examples in \dontrun{} adds the comment ("# Not run:") as a warning for the user.
Please unwrap the examples if they are executable in < 5 sec, or create additionally small toy examples to allow automatic testing, (or replace \dontrun{} with \donttest{}).

When creating the examples please keep in mind that the structure
would be desirable:
\examples{
examples for users and checks:
executable in < 5 sec
\dontshow{
examples for checks:
executable in < 5 sec together with the examples above
not shown to users
}
\donttest{
further examples for users; not used for checks
(f.i. data loading examples )
}
}

put version tag into what's returned to the console during running

simple way to pass the info through to the builds locally or on travis

`setup_sandbox` function

i've decided it will be nice to have a version of the main setup_dir() function tailored to the sandbox situation (whereas setup_dir() is meant to work on the main pipeline with default settings, which makes it much easier to tinker with code within the package, we want a version of setup_dir() that makes it nice for a sandboxing user): setup_sandbox()
as of now the only things that jump out to me as being needed in the sandbox that aren't being default done is the downloading of the historic forecasts: the historic covariate forecasts as well as the historic rodent forecasts are housed within the main forecasting repo, so that server doesn't need to spend time dowloading files and the default in the control lists is to not download them, although it's a simple toggle to do it. the way the current API is, it's actually really easy to manage this, so i decided to jump for it and create setup_sandbox() . it's basically a call to setup_dir() but with those download settings set to TRUE, and you would interact with it in the same way you would interact with setup_dir().

at this point, it's easy to change other defaults, so i was wondering if folks who have interacted with setup_dir() in a sandbox setting (@ethanwhite @skmorgane @ha0ye? who else?) had any feedback/requests/suggestions for changed settings for sandboxing. this function is part of the in-works v0.9.0, but i wanted to get thoughts!

NA related error in `plot_cast_point(with_census = TRUE)`?

Following the Getting Started vignette I run into the following error:

> plot_cast_point(with_census = TRUE)
Error in read_cast(tree, cast_type = cast_type, cast_date = cast_date) : 
  forecasts from NA not available
In addition: Warning message:
In max.default(numeric(0), na.rm = FALSE) :
  no non-missing arguments to max; returning -Inf

I'm guessing this was something to do with NA being interpreted as a null value given the error, but it also triggers when setting species = "DM".

condense input checking

there's a lot of code space taken up (in particular in the options list functions) with checking inputs that could get packaged into a tighter set of functions

test all functions

full battery of tests
also add in checks within functions for argument inputs

fold setup_dir into portalcast

really you only need to run one function, if the directory doesn't exist, it should be created and filled

basic plotting functions

message functionality to handle quiet argument

add a function that allows for a printing of a message or not based on the quiet call
reduces testing time (stuff is being re-run with and without quiet)
tidies up the code by eliminating the logical operators

add functionality to flag forecasts for QA/QC

we'll likely want to have a table held somewhere that is metadata about the forecasts including the date/time stamp, a Boolean flag, and a notes column

github url download capacity

similar to the NMME and Zenodo API URL building, set up the capacity to point to GitHub, in the case that components haven't been archived or perhaps for built-in backups if a Zenodo download is slow or the server is down or whatever

Vignette order issue

For the howto vignette if you follow along in the current order in the vignette, the cleanup_dir() command cleans out the species list which is needed in the plotting steps that follow, yielding an error (and no plots). Here's the error: cannot open file 'C:\Users\skmorgane\Documents\PortalData\Rodents\Portal_rodent_species.csv': No such file or directory. I had @ethanwhite confirm this happened for him as well.

We should either change the order in the vignette so that cleanup_dir() comes at the end of the document or change cleanup_dir() so that it leaves a copy of the file for the plot functions. If we leave the function cleanup_dir as is, we should warn users that they shouldn't until they are completely done.

Add naive random walk

As a spin-up project for @skmorgane to get up and running on portalcasting: a naive random walk (https://rdrr.io/github/robjhyndman/forecast/man/naive.html) as suggested by @juniperlsimonis!

tighten down testing

current testing is far too hefty for cran.

enforce order in end

needs to be strictly decreasing

improve function documentation

improve downloading of existing predictions

the current approach is file-by-file and ends up sending a lot of calls to GitHub
it would be nice to simplify the request/download of the files

improve the API wrt control options

taking the lead from LDATS, based on the GAM approach
ref: weecology/LDATS#130 weecology/LDATS@69a9fa2 weecology/LDATS#131

migrate the hindcasting script into the new structure

set up portalcast to work with type = "hindcast"

include historical covariates in the basic portalcast template set-up

the pre-existing historical covariate data (inst/extdata/covariate_forecasts.csv) needs to be copied to the data directory once it's created (probably should be done by setup_portalcast_dir)

prep_covariate_data needs to get pointed to the file

set up data classes for the steps starting with the model scripts

the functions for all of the processing pre-model running require data arguments to be of specific classes (like rodents and covariates), but the classes are lost when the files are written out then read back in by the model scripts
it would be good to leverage the classes in the model functions, but that will require some validation step or something and wrapper functions on the read csv function to append the classes in a meaningful way (in case the external file gets corrupted or something)

b

create a sandbox for development

simple interface, but slightly different needs for exploration of data and such

set up a more flexible approach to historical covariate forecasts

currently the historical covariate data forecasts are held in the package directory, and the transfer comes from there
stuff is hardwired around that and could be relaxed as needed

add checks to fill_dir that prevent filling anything that's already up to date

as of now, fill_dir basically fills everything, even if it already exists. this slows stuff down when you just want to "check if your data are up to date, and update them if not".

name alignments

many of the names of objects and arguments have been aligned with #129 but there are still some key spots that need to be updated
casts should be saved out with a tmnt_type column, not a level column (level was removed as a signifier here, as that conflicts directly with the portalr function argument). enacting this will require some back-compatibility code to make the old files work