Giter VIP home page Giter VIP logo

copro's People

Contributors

dependabot[bot] avatar jannishoch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

copro's Issues

make rasterstats function more generic

Find a way to change the stat_func argument in the module 'get_var_from_nc' settable by user, via the cfg-file for instance. Now it can in principle be changed, but there is no handle to access it.

relative paths

Make the input and output paths in the settings file relative so they don't need to be updated per user.

reproducibility --- JOSS

generally, to make model outputs for random forests etc. completely reproducible, you need to set the same seed. I would recommend adding a default seed plus letting people pick a particular seed.

indicate plots better

indicte better in plots of leave-one-out validation which var was left out or if all data is used (add to plot title)

add tests

for JOSS and Travis-CI, we need some test functions

more click scripts via setuptools

make all scripts executable from command line, maybe even with click groups.

for example, have something like 'copro download_example', 'copro run' etc.

software installation/download --- JOSS

  1. i am surprised that the package is 200 MB. It is unacceptably large for a package that doesn't technically ship any models but ships only the code for how to estimate and validate. i would recommend finding ways to trim this.

  2. python setup.py develop doesn't work on windows

Traceback (most recent call last):
  File "setup.py", line 5, in <module>
    from setuptools import setup, find_packages
ImportError: cannot import name 'setup'

https://stackoverflow.com/questions/32380587/importerror-cannot-import-name-setup

make sampling of data for ML model more generic

Currently, we create a big dataframe first with one new column per year per step. For the ML model, however, we do not need them in separate columns, but in one column per variable with entries for all years. This can perfectly be one outside of the main dataframe containing geometry information as the geometry information can be dropped for the ML model.

Thus, update all function to return only one column with the data input needed for ML and make this all more generic too, i.e. less dependent on column names.

only sub-set of conflicts if boolean operation is applied

with the current set-up, only those polygons where conflict took place remain in the dataframe. polygons without conflict don't show up anymore, thus yielding a scattered figure of polygons. this continues in subsequent functions, i.e. zonal statistics of nc-file.

would be better to keep all polygons in the dataframe, also those which are assigned a 0 in the boolean operation.

use pickle or similar to load pre-computed XY

right now, a lot of waiting time is needed to produce the XY array. if the overall settings do not change, this one should not change between runs however.

it would save lot of time if we could import a pre-computed XY data and start from there.

also, this could be useful for demonstration purposes where time is limited and noone wants to wait for the looping through years and input data.

model via pip

would also be cool to get the model via pip or even conda

define output files

what should be output:

  • gdf with all data points
  • gdf with data per polygon
  • evaluation dict
  • ROC curve
  • ?

update analysis

  • remove the scatter plot of 'total_hits" and 'average_hit' because it's a no brainer... clearly, the more correct predictions you have, the higher the chance that the fraction of correct predictions is high too.
  • classify the polygons along two axes: 'average_hit' and 'nr_of_test_confl' and categorize polygons in 4 groups: high accuracy and many conflict samples, high accuracy and few conflict samples, low accuracy and many conflict samples and low accuracy and few conflict samples. this could bring insights where the model results are more robust than elsewhere...

assess model quality re. conf

we know after n model repetitions how many predicitons were made per polygon, how often the model prediction was correct, for both conflict and non-conflict as well as for conflict only.

this gives an overall good impression of model performance.

however, how do we show which polygons are now predicted to be 'at risk'? because with multiple predictions made per polygon, no every prediction is only 'conflict' or only 'non-conflict', so we will have to deal with a melange of predictions. what are good means to visualize this per polygon?

gamma and C values in SVC

test sensitivity towards higher C and gamma values in SVC; also test poly kernel and assess sensitivity of degree values

add extra aggregation level

to compute probability of detection and false alarm over entire time series, add aggregation level (to be provided with shp-file) to model. all data points are then aggregated to this level before performing the computations.

automatically load and loop through vars

instead of having a function call per input variable in the script, it would be more efficient to have a loop over all values listed in the config-file. The user could then just specify a random amout there with file paths and the models would go though all of them.
https://stackoverflow.com/questions/22068050/iterate-over-sections-in-a-config-file

this would also need to include a detection of how the time variable in the nc-file is defined to call the right function per input file.

add conflict at t-1 as sample data

the history of conflict is important. if there was conflict in the previous year, it is more likely that conflict will occur in a year as well.

execute model n times

depending on which datapoints are selected for the training and test samples (this happens randomly!), the model output differs.

to account for this, the model would have to be run n-times (e.g. 1000) and outputs should be averaged to get a solid result.

fix output paths

the output folders with sub-sub-sub directories are left overs from past model structure.

remove this and just use one global output directory, the one specified in the cfg-file.

add post-processing

to make visualization etc. more straightforward, add script to facilitate post-processing.

assess relative importance/sensitivity of individual variables

how can i see what the relative influence of a input variable is in predicting our target var? is it possible to somehow get a list with ordered relative importance of the variables? or simply testing relative importance by 'leaving-one-out', i.e. running n runs (n=number of vars) which each run n-1 vars used?

joss/editing

"The main goal of this model is to apply machine learning techniques to make projections of future areas at risk." -> expand to convey what risk means here, how it is measured, etc.

i couldn't easily find how the model is validated, any results we may see, etc. i would recommend explicitly linking a bunch of that via the readme.rst

separate analysis for conflict predictionx

right now, we analyse model accuracy for predicted 0s and 1s.

since predicting 1s is much harder, would be interesting to see how good the model is if we only select those entries in y_test that contain a 1 and compare with y_pred/y_score.

support non-selection of climate zones

the current set-up requires a selection of conflicts and water provinces for one or more Koeppen-Geiger climate zones.

It may be interesting to turn off this selection.

THus, add code to make this work, e.g. by defining 'None' in the climate section in the cfg-file.

dubbelsteenmodel

run model with Y is only 0 or Y contains randomly distributed % of data as 1 (% conflict of all data). Do not change selection criteria for conflicts then!

consider all polygons, and what to do if one polygon appears more than one time in y_test

due to the random sampling of data points from the X and Y arrays, it can happen that not all polygons are represented in the test sample X_test/y_test. This is not so much a problem if we look only at aggregate evaluation criteria like ROC value and such, but if we plot the polygons, some may stay empty. That's not good.

Once #61 is solved, it should be ensured that all polygons are represented

Besides, even with n=1 executions of the model, one polygon can appear multiple times in X_test/y_test. Each time, the prediction can be wrong or right, but it's most likely not always correct or wrong but changes each time. It is therefore necessary to create an overall value based on the (average?) accuracy of the prediction per polygon. Or something else - think!

single variable model

after having checked how good the model(s) would perform if we only took randomly sampled Ys, how good would the model(s) perform if we used only one variable to predict?

is it possible to identify one (or more) variables that are really key for predicting Y? This would help, in combination with the LOO analysis, which vars are really driving conflict in our model(s) and which not.

apply model per country

In the cfg-file, the conflicts can be filtered also for an individual country. this would reduce the number of conflicts drastically.

however, the number of water provinces is not reduced to the given country, thus introducing an immense imbalance in the model.

as solution, remove the country option in the cfg-file.

the model can the be run for an individual country by simply providing a shp-file for this given country. conflicts are then clipped with geopandas to the extent of the shp-file.

support more conflict types

right now, only possible to specify only one type of conflict in cfg-file. change that such multiple values can be provided and the selectoin procedure is adapted accordingly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.