coronawhy / task-ts Goto Github PK

View Code? Open in Web Editor NEW

23.0 9.0 16.0 7.89 MB

Work related to time series prediction and forecasting of Coronavirus

Python 0.10% Jupyter Notebook 99.90%

coronavirus time-series-analysis time-series-forecast deep-learning

task-ts's People

Contributors

Stargazers

Watchers

Forkers

itsmemala adzuci wwymak razumovs aradhanacha kritim13 mgavish antonpolishko pranjalya eau1 nisargvp zjminglead feideer masterleia optiluca zhaoxuxu900

task-ts's Issues

Add BEAT-19 Data

We would like to incorporate the BEAT-19 data into our models.

Acceptance Criteria

Code to join in BEAT19 Data incorporated into data crawler.
Appropiate unit test coverage
Code committed on GitHub with passing unit tests.

Evaluate top models on new data with interpretability features

Based on the top models identified #53 we now want to identify how well the models perform on most recent data.

Depends on:
#53
Adding enhanced evaluation

Acceptance criteria:

Documented report on Wandb on how well the models perform on new test period
Report should include diagrams and feature importance plots
Comments from Serge or other Epidemiologists

Experiments with weather data added

From an epidemiological perspective it would be useful to see if weather data helps or hinders model performance. Also it would be useful to what features the model attends to.

Run on target large US counties w/o transfer
Run on target large US counties w transfer (flow)
Run on target large US counties w transfer (county+flow)
Run on target large US counties w transfer (county)
Run on target large Italy counties w/o transfer
Run on target large Italy counties w transfer (flow)
Run on target large Italy counties w transfer (county+flow)
Run on target large Italy counties w transfer (county)

Poster preparation

We will need to prepare a one page PDF poster for our workshop on Global Health at ICML.

Acceptance Criteria

1 Page PDF
PDF verified by team members

Demographic and Disease Prevalence data

We want to gather static demographic data on the factors that may affect the number of cases/deaths for each county:
Population age groups
Disease prevalence
Income level
Population density
Mean distance to hospital
Primary industries
Race
etc

Acceptance criteria:

Data should be collected for every geographic region listed in our data-frame.
Data should be saved to GCS/Dataverse for future analysis.
Data should include a column specifying the sub_region in the df.
Code should be committed to our task-ts repo

Fix scraper

Our scraper currently depends on: covidatlas/li#619 . We may have to switch to JHU or another data source if this doesn't work.
@efawe

Update data crawler to include sub_region in mobility data

Current version of mobility data in master has data at region level.
Task-geo is update with latest mobility data at sub-region level.
Update the data crawler to reflect the latest change.

This issue only occurs when installing task-geo from source which gives the latest but not stable version.

Refactor code to handle additional columns at train/inference time

Need to refactor function to work at training/inference time for LSTM.

Formal report detailing transfer results

Using the results of #43 we need a formal report examining transfers.
Acceptance criteria

Write a formal report on Weights and Biases to examine effect of transfer

Create airflow DAG to separate out regions/counties

Continuation of #61 as google cloud functions was not practical financially.

Benchmark models and differences versus California County Models

Based on the results of #53 we want to compare results for our best models and their ability to accurately forecast versus California models. We also want to highlight our methodological differences and how our approach can enhance forecast versus California models

Acceptance criteria

Utilize California models to back-test on May 30-Jun-14.
Compare MSE of our best models versus California models
Write Wandb report to show difference between the two models.

Large scale pre-trained weights on flow for transfer

In order to successfully examine pre-training we need to pre-train a large number of weights on river flow data.

Acceptance Criteria

Pre-trained weights for 3 and 4 encoders
Pre-trained weights for flow data stashed to GCS trained on at least 50 rivers.
Documentation page detailing what rivers model trained on and their evaluation metrics.

Create cloud function to save county CSV to bucket

Deploy cloud function on GCS and test on incoming data.

Partial weight of state dict

Need to be able to utilize state dict and load partially for transfer learning

Deploy county cases and hospitalizations forecasting model with features

A central goal of our COVID-19 forecasting efforts is to deploy a that provides tangible value to public policy officials. In order to do this we need the following steps completed

Add models that effectively generalize. Including coming out of lockdown, resuming lockdown, giant pool parties, bars opening, etc. Possible approaches to this include Neural ODE's, probabilistic models, auto encoders (i.e. Uber method) and other models.
Create automatic module to compare model performance to California county baseline models.
Have epidemiologist evaluate the model's learned features and give feedback.
Create continuing evaluation mode in flow-forecast repository
Create inference mode in flow-forecast repository
Create Docker containers for flow-forecast deploy.
Create Airflow DAG to run deployed Docker container and persist predictions.
Create Web based app for epis to view predictions, relevant features, and different scenarios. (This will likely be several separate issues when we get there)
Create descriptions to analyze past model performance.

Setup continuous integration and write preliminary unit tests

Set up basic CI/CD environment with configuration file
Write unit tests for existing code
Refactor model into testable blocks

NYC datasource to be added to scrapper

https://github.com/nychealth/coronavirus-data.git

Rolling 7 day average

Based on @wwymak comments we would like to try forecasting cases on a rolling weekly average. While this removes a degree of granularity it may ease problems with reporting issues.

Acceptance criteria:

Experiments logged on primary counties list.
Pre-processing code for merged into the task-ts repo
Report detailing finding of using the seven day average.

Run experiments with enhanced evaluation metrics

Run on primary US counties w/o transfer
Run on primary US counties w transfer (flow)
Run on primary US counties w transfer (county)
Run primary US counties w transfer (flow + county)
Run on Italy Counties w/o transfer
Run on Italy counties w transfer (flow)
Run on Italy counties w transfer (county)
Run on Italy counties w transfer (flow + county)

Multiple counties with same name

The data frame should not be masked on county alone otherwise we get this problem. Many counties particularly in the US have the exact same name. Instead create a new column that concatenates county and state/province. This will ensure no counties wind up in same geo forecasting segment. This is causing really weird negative case numbers which LSTM does not understand.

Experiments with more dropout layers

Re-run experiment checklist to see how it alters confidence interval.

will it work for multivariate time series?

great code thanks
may you clarify :
will it work for multivariate time series?
1
where all values are continues values
2
or even will it work for multivariate time series where values are mixture of continues and categorical values
for example 2 dimensions have continues values and 3 dimensions are categorical values

color        weight     gender  height  age

1 black 56 m 160 34
2 white 77 f 170 54
3 yellow 87 m 167 43
4 white 55 m 198 72
5 white 88 f 176 32

Deploy county new-cases forecasting model to production

As an organization we want to inform public policy makers and residents if their county could be at increased risk for an outbreak. In order to do this we need to have daily updated results on new data and a simple dashboard to display predicted cases along with the CI for a specific county. Targeted completion

#43 Run experiments with enhanced evaluation metrics
#52 Determine whether to use transfer and best hyper-parameters
#53 Select best models
Have epidemiologist verify candidate models
Test models on new data (July 10th+)
Add inference mode to flow-forecast
#61 Create cloud function to partition new cases by county
Create Docker container for serving flow-forecast models and deploy.
Create Airflow DAG that runs model(s) daily and persists results.

Run experiments with meta-data embedding

Preform exploratory analysis of BEAT-19 Data

Before integrating the BEAT-19 in #45 we need to explore the file structure

Acceptance Criteria:`

Do survey participants have multiple time steps? If so what are the average number per participant?
What percentage of columns have null values and what percentage of the time?
Average number of entries per-day per county (determine what zips map to counties county-to-zip)
Have Serge look over the findings

Unit test most recent file function

Insight into transmission dynamics and public policy interventions

Public policy officials and epidemiologists could use information on how specific policies impact new cases. Specifically, we would like to be able to determine the casual impacts of masking and social distancing.

Incorporate interpretability features into flow
Evaluate models with interpretability features #55
Review relevant epidemiology studies based on previous steps

Incorporate masking restrictions/requirements

Investigate GCS permission issues

@aradhanacha has been having trouble running notebooks for experiments which blocks her progress on #43.

Report to select the "best models"

Based on the results of #43 I would like to write a formal report to examine the best models both in terms of test_loss and test_loss on the final week.

Acceptance criteria

Report on Wandb detailing tradeoffs between overall test_loss and test_loss on the final week.
Investigation of which models have the best Sharpe values
Evaluation of our model versus O IMHE and other CDC models

Bug with loop_through_locations function

/usr/local/lib/python3.6/dist-packages/pandas/core/strings.py in _validate(data)
   2096 
   2097         if inferred_dtype not in allowed_types:
-> 2098             raise AttributeError("Can only use .str accessor with string values!")
   2099         return inferred_dtype
   2100 

AttributeError: Can only use .str accessor with string values!

Pipeline to persist data on to GCS on a daily basis

Acceptance criteria

Pipeline running in the cloud to persist data to GCS on a daily basis
Files stored in the proper directory
DAG committed to GH

CDC Comparing models

Background: We need an objective way to compare our models to standard models such as IMHE, Yougang, etc. This poses difficulty as our current models operate with respect to forecasting new cases on a daily basis for counties. Many of these other models operate at the state level.

Research current models and see if the model is currently forecasting on county level daily.
For models that are not further research to see if code is publically available.

Acceptance criteria:

Document detailing models and metrics
Links to code repos in documentation

Better error proceeding in code loops

When looping through Wanda_sweeps there seems to be an error related to the validation loader.

Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/wandb/wandb_agent.py", line 64, in _start
    function()
  File "<ipython-input-14-43e09d654e98>", line 7, in <lambda>
    wandb.agent(sweep_id, lambda: train_function("PyTorch", make_config_file(file_path2, len(region), weight_path=None)))
  File "/content/github_aistream-peelout_flow-forecast/flood_forecast/trainer.py", line 33, in train_function
    train_transformer_style(trained_model, params["training_params"], params["forward_params"])
  File "/content/github_aistream-peelout_flow-forecast/flood_forecast/pytorch_training.py", line 65, in train_transformer_style
    test = compute_validation(test_data_loader, model.model, epoch, model.params["dataset_params"]["forecast_length"], criterion, model.device, decoder_structure=True, use_wandb=use_wandb, val_or_test="test_loss")
  File "/content/github_aistream-peelout_flow-forecast/flood_forecast/pytorch_training.py", line 108, in compute_validation
    for src, targ in validation_loader:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 384, in _next_data
    index = self._next_index()  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 339, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 200, in __iter__
    for idx in self.sampler:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 62, in __iter__
    return iter(range(len(self.data_source)))
ValueError: __len__() should return >= 0

coronawhy / task-ts Goto Github PK

task-ts's People

Contributors

Stargazers

Watchers

Forkers

task-ts's Issues

Recommend Projects

Recommend Topics

Recommend Org