doi-bor / pyforecast Goto Github PK

PyForecast is a statistical modeling tool used by Reclamation water managers and reservoir operators to train and build predictive models for seasonal inflows and streamflows. PyForecast allows users to make current water-year forecasts using models developed with the program.

License: Other

Python 94.07% JavaScript 4.23% CSS 0.19% HTML 0.92% Batchfile 0.34% Inno Setup 0.25%

forecasting hydrology machine-learning python statistical-models

pyforecast's People

Contributors

Stargazers

Watchers

Forkers

andywood jslanini avinashtgoje beautah kbssr jdalling dloney jshernandezs kmbase tjrocha kevinfol tschilb252

pyforecast's Issues

No gages on stations map in New Mexico

Brute Force Feature Selection?

Just spit-balling here... Would it be useful to have a Brute Force Feature Selection method incorporated into the software? For some relatively simple models (#Predictors = 15), we can just brute force evaluate all the possible combinations of predictors (#Combinations=32,767) and have the program report the top performing models. Benefits to this become more apparent if there are fewer predictors since the brute force number of equations are of the form (2^n)-1 with n=#Predictors.

On a related note, I noticed that the existing Feature Selection algorithm evaluates the same model multiple times especially if it is not selected/stored in the list of viable regression models. There might be some performance gains to maybe storing in memory just the salient metrics (Predictor IDs & Selected Metric) for every model run and referring to this in-memory object so the algorithm doesn't evaluate the same model multiple times.

Data tab improvements

Default map layer to terrain
default NOAA sites unselected-the stations are overwhelming and low priority for inclusion in a forecast
Data set table: don't display PyID. Order should be:
Name Parameter Type ID. URL can be hyperlinked to name.
Add PDSI and perhaps climate set selection to HUC when clicked.
remove boxes in lower right for NRCC, PDSI, PRISM, etc.

Saving Bug

If the software crashes while it is trying to save to an existing *.fcst file, the file is emptied out and all the work contained within it is lost. Propose saving to a *.temp.fcst file first before doing a rename/overwrite of the existing *.fcst file.

Hard to reproduce, the software needs to crash while doing a save.

Updating station list does not transfer to Data Tab

Recreate: open existing file.
Add custom dataset on stations tab
go to "data" tab
click download
station is not added to data tab.

Datasets don't clear when creating new forecast

Incorporate remotely sensed data as predictor dataset

Incorporate remotely sensed snowpack data. Dependent on @danbroman and low elevation snowmelt S&T project.

Linearity assumption-transformations and residuals

One of the fundamental assumptions of linear regression is... linearity. We need to provide the ability for users to test that the models meet this assumption. A frequent approach is to plot predicted values (x-axis) vs residuals (y-axis). The residuals will be random if the assumption of linearity is true.

Streamflow data are frequently not normally distributed. A common distribution (and transformation in forecasting) is lognormal. I suggest we implement transformations, starting with lognormal. NRCS also uses square and cubic transformations, but I'm not sure as to the reasoning behind these transformations.

Export to excel with two data columns with the same label

Loading precipitation data from PRISM and NRCC results in same datalabel, causing issues upon excel export.

Do we need a liability disclaimer?

Should we add a disclaimer statement to the ReadMe/Wiki?

Delete pn-development branch

Since we're using compiled releases now, I don't see any harm in reducing the number of branches to just the master branch. I just merged the pn-development branch into master BTW.

Add NOAA NCDC Data Source

Add a new data source to the stations tab/map. Build code to bring in NOAA precip, temp, and snow data from this data source.

API: https://www.ncei.noaa.gov/support/access-data-service-api-user-documentation#
Station Map: https://gis.ncdc.noaa.gov/maps/ncei/summaries/daily

6/26/19 - E-mailed data source to ask for a station inventory and data type library for our datasets of interest.

Metric Units - Please add this functionality

Currently parameter units are limited to imperial units. Would it be possible to develop the capacity to use metric units for input parameters and datasets? Your northern neighbours would be very grateful to see this added functionality. Let me know if we can assist in any way :)

Move "Summary" Tab after regression tab

This seems to be a better workflow for the software.

Incorporate seasonal to daily disaggregation scheme

@danbroman is creating a temporal disaggregation scheme using a knn approach for risk-based S&T project. When complete, this should be incorporated to allow users to directly generate daily timestep forecasts suitable for reservoir operations modeling (i.e., RiverWare ops models.)

Forecast Options Tab - PN Refinements

Allow users to drag-and-drop top-level predictor names from the All Available Predictors tree into the PredictorPool container under the adjacent Equation Pools tree. Code should filter through the subsetted predictors and only add predictors that are in the past relative to the equation. (Ex: dropping SNOTEL site X into the January-01st equation should only add subsetted predictors that come before January and/or aggregated from Oct-01 to Dec-31)
Catch error when Apply Options is pressed when the dataset dictionary is empty

Expert Data Import Mode

Allow the Import feature on the Data tab to import data arrays instead of a single series at a time.

PN needs to import custom data arrays that do not currently fit the current data acquisition scheme in PyForecast (monthly data, custom indices, etc). A parallel development effort for an Excel (sigh, i know...) data pre-processor that will generate daily data arrays for PyForecast is under development. Need PyForecast to accept the output array from this tool.

I'm thinking we define the inputs that the current Import feature needs (Dataset Name, Parameter Name, Units, and Resampling) as headers in the input data array. and have the code loop through the columns and add each one of the entries in the array.

Propose improving "Summary" tab

Summary tab right now is out of order from the workflow and can be improved.

Move tab after "Regression Tab."
Forecast Equations should automatically calculate a forecast when added to summary tab.
Physical units should be added to plots.
Can we make the Forecast ID editable so forecasters can use the default or make their own meaningful forecast name?
Rounding of coefficients and forecasted values.

Propose refactoring of the GUI code

Should we try to refactor the GUI code in PyForecast_GUI.py? We could split up the TABs to have their own code file by GUI elements (summaryTabGui.py, dataTabGui.py, etc). This will allow us to also split up the application.py code file based on which GUI element they support (summaryTabCode.py, dataTabCode.py, etc.).

Doing this should allow us to more easily trace which functions/methods support which GUI operation and perhaps allow us to identify places where we can share/optimize functions/methods.

Thoughts @kevinfol

Add data imputation capability

Allow users to fill in missing data in the input data arrays. Propose

using the data imputation methods in the statsmodels package
and putting this feature in the DataAnalysis GUI for plotting and diagnostics

Delete datasets on "Stations" tab doesn't work

Deleted dataset but it does not go away. Refreshing data still has deleted dataset in "Data" tab.

Add Arctic Oscillation to datasets

https://www.cpc.ncep.noaa.gov/products/precip/CWlink/daily_ao_index/ao.shtml

Correlation matrix crashes

Data Tab->Data Analysis->Data Analysis Options->Show Correlation Matrix crashes. Email me if you need a sample fcst file to crash it with.

Forced Predictors defined at the Equation Pool level

@kevinfol - Looking for feedback on how you and/or your users feel about PN handling forced predictors. We can either Option-1)make another ForcedPredictor list at the EquationPool-level to store the PredictorIDs so users can make a distinction for which predictors are 'forced' into the algorithm or Option-2)we can make this a little more hidden by just storing a parallel array under the existing PredictorPool list to track which PredictorIDs should be forced at feature-selection time.

Screenshot below shows what I'm thinking about. Option-1 will have a separate list of ForcedPredictors on the GUI or Option-2 will allow users to right-click on predictors and set them to forced.

Consecutive regression runs crashes software

Fix issues when running consecutive regression runs. 1st regression run works fine but consecutive regression runs with the exact same settings after the first runs slower and eventually crashes the program. Might be an issue with how arrays/objects are being initialized during each regression run.

Issue can be reproduced by running any regression (MLR, PCA, ZSCORE) consecutively with any feature selection algorithm (SFFS, SFBS, BruteForce) one right after the other. Issue is most evident with the BruteForce selection algorithm.

License Decision

Decide on adopting a license for the software. I'm assuming that this software was initially developed with government funds under government time so some kind of Open Source license might be most appropriate. Thoughts @kevinfol ?

https://opensource.org/licenses

Implement LASSO Regression

LASSO regression is a useful technique for building sparse models and aiding in variable selection. Suggest we implement it as an alternative to the existing methods.

Develop unit tests for core functionality

Units tests for the following (at a minimum) should be developed:

read/write of *.fcst files
web requests for data (USGS, NRCS, GP Hydromet, PN Hydromet)
regression outputs (MLR, PCA) - needs development of a good testing dataset
validation of JSON files for the mapping interface

Backward model selection issues

Creates models with a ton of variables. Crashes when models are selected.

Stations Map: add select all by HUC

Add the ability to add all stations within a HUC from the HUC popup.

Model skill metrics

I suggest replacing Nash Sutcliffe Error with Mallow's Cp as a metric of skill. Nash Sutcliffe should be identical to r-squared in the case of linear regression. Also, all of our metrics are based on squared error. This punishes models with outliers and skews model selection to those that fit really big years.

Also, would it be useful to include mean absolute error, which is similar to RMSE, but weights errors equally. If a forecaster is not worried about the outliers or high water years, it might be nice to select based on non-squared error. For example, we are in a drought year and snowpack is at record lows. We would want to avoid weighting models that fit for the really high water year.

Test Different Regression Schemes in StatsModels Package

Would be nice to test if better models can be generated using the regression schemes available in the StatsModels Python package.

Linear Regression - https://www.statsmodels.org/stable/examples/index.html#regression
Generalized Linear - https://www.statsmodels.org/stable/examples/index.html#glm
Discrete Choice - https://www.statsmodels.org/stable/examples/index.html#discrete

Develop distribution pipeline

Develop distribution mechanism for compiling required code and dependencies (cx_Freeze), and for building an installer (InnoSetup).

Data tab improvements

POR selection-years box should be aligned directly below years check box, rather than with "POR".
Should Fill NaN's be default? or do we want users to make a conscious decision to fill missing data.
Does update data do anything different than import data? if not, remove.
Import dataset-why is import csv/xcel in a different location than import web data? All data importation should be in the same location.

Text does not scale properly when extending display to external monitor

I connected via HDMI cable to a large external monitor for an impromptu demo yesterday. The text does not scale up and is not readable on the external monitor.

NRCS SOAP Service Error

Instantiating a connection to the NRCS service via NRCS = Client('https://www.wcc.nrcs.usda.gov/awdbWebService/services?WSDL') now fails. Error shown below. E-mailed NRCS to inquire about the error. This impacts both existing PyForecast installations and NextFlow -- something probably changed on the NRCS side...

Files:
..\PyForecast\Resources\DataLoaders\Default\NRCS_WCC.py
..\NextFlow\resources\DataLoaders\NRCS_WCC.py

Error:
HTTPSConnectionPool(host='wcc.sc.egov.usda.gov', port=443): Max retries exceeded with url: /awdbWebService/services?WSDL (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x3681E0D0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',))

Perform hindcasting experiments

NextFlow Software Hindcasting Proposal rev.docx