jannishoch / copro Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 0.0 468.24 MB

(ML) model for computing conflict risk from climate, environmental, and societal drivers.

Home Page: https://copro.readthedocs.io/en/latest/

License: MIT License

Python 7.98% Shell 0.03% Jupyter Notebook 91.99%

climate conflict environment projection risk security

copro's People

Contributors

Stargazers

Watchers

copro's Issues

add kappa estimate

based on Joosts feedback, add kappa value as evaluation function to model: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html#sklearn.metrics.cohen_kappa_score

add probabilty to model

add probability for ML models as output based on test data

for NuSVC, see here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC

for kNC, see here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict_proba

consider all polygons, and what to do if one polygon appears more than one time in y_test

due to the random sampling of data points from the X and Y arrays, it can happen that not all polygons are represented in the test sample X_test/y_test. This is not so much a problem if we look only at aggregate evaluation criteria like ROC value and such, but if we plot the polygons, some may stay empty. That's not good.

Once #61 is solved, it should be ensured that all polygons are represented

Besides, even with n=1 executions of the model, one polygon can appear multiple times in X_test/y_test. Each time, the prediction can be wrong or right, but it's most likely not always correct or wrong but changes each time. It is therefore necessary to create an overall value based on the (average?) accuracy of the prediction per polygon. Or something else - think!

make rasterstats function more generic

Find a way to change the stat_func argument in the module 'get_var_from_nc' settable by user, via the cfg-file for instance. Now it can in principle be changed, but there is no handle to access it.

add IMAGE data

add more data, from IMAGE model

model via pip

would also be cool to get the model via pip or even conda

add documentation

indicate plots better

indicte better in plots of leave-one-out validation which var was left out or if all data is used (add to plot title)

relative paths

Make the input and output paths in the settings file relative so they don't need to be updated per user.

joss/editing

"The main goal of this model is to apply machine learning techniques to make projections of future areas at risk." -> expand to convey what risk means here, how it is measured, etc.

i couldn't easily find how the model is validated, any results we may see, etc. i would recommend explicitly linking a bunch of that via the readme.rst

add post-processing

to make visualization etc. more straightforward, add script to facilitate post-processing.

support more conflict types

right now, only possible to specify only one type of conflict in cfg-file. change that such multiple values can be provided and the selectoin procedure is adapted accordingly.

execute model n times

depending on which datapoints are selected for the training and test samples (this happens randomly!), the model output differs.

to account for this, the model would have to be run n-times (e.g. 1000) and outputs should be averaged to get a solid result.

translate model output back to geographical units

to assess the accuracy of the predictions, it is necessary to convert back the boolean information to the corresponding geographical unit (now: water province).

support non-selection of climate zones

the current set-up requires a selection of conflicts and water provinces for one or more Koeppen-Geiger climate zones.

It may be interesting to turn off this selection.

THus, add code to make this work, e.g. by defining 'None' in the climate section in the cfg-file.

update command line script to final model functionality

once all changes to the code are made and all functionality is implemented, update the command line script.

add paper

follow JOSS guidelines

make notebook executable in the cloud

for demonstration and teaching purposes, it would be great to be able to run the notebook in the cloud. check how thi scan be done.

single variable model

after having checked how good the model(s) would perform if we only took randomly sampled Ys, how good would the model(s) perform if we used only one variable to predict?

is it possible to identify one (or more) variables that are really key for predicting Y? This would help, in combination with the LOO analysis, which vars are really driving conflict in our model(s) and which not.

assess relative importance/sensitivity of individual variables

how can i see what the relative influence of a input variable is in predicting our target var? is it possible to somehow get a list with ordered relative importance of the variables? or simply testing relative importance by 'leaving-one-out', i.e. running n runs (n=number of vars) which each run n-1 vars used?

more click scripts via setuptools

make all scripts executable from command line, maybe even with click groups.

for example, have something like 'copro download_example', 'copro run' etc.

apply model per country

In the cfg-file, the conflicts can be filtered also for an individual country. this would reduce the number of conflicts drastically.

however, the number of water provinces is not reduced to the given country, thus introducing an immense imbalance in the model.

as solution, remove the country option in the cfg-file.

the model can the be run for an individual country by simply providing a shp-file for this given country. conflicts are then clipped with geopandas to the extent of the shp-file.

fix output paths

the output folders with sub-sub-sub directories are left overs from past model structure.

remove this and just use one global output directory, the one specified in the cfg-file.

only sub-set of conflicts if boolean operation is applied

with the current set-up, only those polygons where conflict took place remain in the dataframe. polygons without conflict don't show up anymore, thus yielding a scattered figure of polygons. this continues in subsequent functions, i.e. zonal statistics of nc-file.

would be better to keep all polygons in the dataframe, also those which are assigned a 0 in the boolean operation.

make sampling of data for ML model more generic

Currently, we create a big dataframe first with one new column per year per step. For the ML model, however, we do not need them in separate columns, but in one column per variable with entries for all years. This can perfectly be one outside of the main dataframe containing geometry information as the geometry information can be dropped for the ML model.

Thus, update all function to return only one column with the data input needed for ML and make this all more generic too, i.e. less dependent on column names.

add extra aggregation level

to compute probability of detection and false alarm over entire time series, add aggregation level (to be provided with shp-file) to model. all data points are then aggregated to this level before performing the computations.

automatically load and loop through vars

instead of having a function call per input variable in the script, it would be more efficient to have a loop over all values listed in the config-file. The user could then just specify a random amout there with file paths and the models would go though all of them.
https://stackoverflow.com/questions/22068050/iterate-over-sections-in-a-config-file

this would also need to include a detection of how the time variable in the nc-file is defined to call the right function per input file.

define output files

what should be output:

gdf with all data points
gdf with data per polygon
evaluation dict
ROC curve
?

use pickle or similar to load pre-computed XY

right now, a lot of waiting time is needed to produce the XY array. if the overall settings do not change, this one should not change between runs however.

it would save lot of time if we could import a pre-computed XY data and start from there.

also, this could be useful for demonstration purposes where time is limited and noone wants to wait for the looping through years and input data.

add conflict at t-1 as sample data

the history of conflict is important. if there was conflict in the previous year, it is more likely that conflict will occur in a year as well.

download conflict data from web

try to download the conflict data from web. chekc https://stackoverflow.com/questions/57748687/downloading-files-in-jupyter-wget-on-windows and https://stackoverflow.com/questions/7243750/download-file-from-web-in-python-3.

dubbelsteenmodel

run model with Y is only 0 or Y contains randomly distributed % of data as 1 (% conflict of all data). Do not change selection criteria for conflicts then!

separate analysis for conflict predictionx

right now, we analyse model accuracy for predicted 0s and 1s.

since predicting 1s is much harder, would be interesting to see how good the model is if we only select those entries in y_test that contain a 1 and compare with y_pred/y_score.

add ROC

add roc curve and score to script: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_roc_curve_visualization_api.html#sphx-glr-auto-examples-miscellaneous-plot-roc-curve-visualization-api-py

add feature importance

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_

gamma and C values in SVC

test sensitivity towards higher C and gamma values in SVC; also test poly kernel and assess sensitivity of degree values

test model on different PC or laptop

this is requied to make sure the environment is correnctly installed and the code runs on different platforms than my laptop.

test random forest classifiers

check random forest classifier as extra ML model, also with GridSearchCV (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

functions functions functions

reduce 'loose' code and put all into functions

reproducibility --- JOSS

generally, to make model outputs for random forests etc. completely reproducible, you need to set the same seed. I would recommend adding a default seed plus letting people pick a particular seed.

assess model quality re. conf

we know after n model repetitions how many predicitons were made per polygon, how often the model prediction was correct, for both conflict and non-conflict as well as for conflict only.

this gives an overall good impression of model performance.

however, how do we show which polygons are now predicted to be 'at risk'? because with multiple predictions made per polygon, no every prediction is only 'conflict' or only 'non-conflict', so we will have to deal with a melange of predictions. what are good means to visualize this per polygon?

remove the scatter plot of 'total_hits" and 'average_hit' because it's a no brainer... clearly, the more correct predictions you have, the higher the chance that the fraction of correct predictions is high too.
classify the polygons along two axes: 'average_hit' and 'nr_of_test_confl' and categorize polygons in 4 groups: high accuracy and many conflict samples, high accuracy and few conflict samples, low accuracy and many conflict samples and low accuracy and few conflict samples. this could bring insights where the model results are more robust than elsewhere...

i am surprised that the package is 200 MB. It is unacceptably large for a package that doesn't technically ship any models but ships only the code for how to estimate and validate. i would recommend finding ways to trim this.
python setup.py develop doesn't work on windows

Traceback (most recent call last):
  File "setup.py", line 5, in <module>
    from setuptools import setup, find_packages
ImportError: cannot import name 'setup'

https://stackoverflow.com/questions/32380587/importerror-cannot-import-name-setup

jannishoch / copro Goto Github PK

copro's People

Contributors

Stargazers

Watchers

copro's Issues

Recommend Projects

Recommend Topics

Recommend Org