jannishoch / copro Goto Github PK
View Code? Open in Web Editor NEW(ML) model for computing conflict risk from climate, environmental, and societal drivers.
Home Page: https://copro.readthedocs.io/en/latest/
License: MIT License
(ML) model for computing conflict risk from climate, environmental, and societal drivers.
Home Page: https://copro.readthedocs.io/en/latest/
License: MIT License
Find a way to change the stat_func argument in the module 'get_var_from_nc' settable by user, via the cfg-file for instance. Now it can in principle be changed, but there is no handle to access it.
Make the input and output paths in the settings file relative so they don't need to be updated per user.
generally, to make model outputs for random forests etc. completely reproducible, you need to set the same seed. I would recommend adding a default seed plus letting people pick a particular seed.
indicte better in plots of leave-one-out validation which var was left out or if all data is used (add to plot title)
check random forest classifier as extra ML model, also with GridSearchCV (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
for JOSS and Travis-CI, we need some test functions
make all scripts executable from command line, maybe even with click groups.
for example, have something like 'copro download_example', 'copro run' etc.
i am surprised that the package is 200 MB. It is unacceptably large for a package that doesn't technically ship any models but ships only the code for how to estimate and validate. i would recommend finding ways to trim this.
python setup.py develop
doesn't work on windows
Traceback (most recent call last):
File "setup.py", line 5, in <module>
from setuptools import setup, find_packages
ImportError: cannot import name 'setup'
https://stackoverflow.com/questions/32380587/importerror-cannot-import-name-setup
Currently, we create a big dataframe first with one new column per year per step. For the ML model, however, we do not need them in separate columns, but in one column per variable with entries for all years. This can perfectly be one outside of the main dataframe containing geometry information as the geometry information can be dropped for the ML model.
Thus, update all function to return only one column with the data input needed for ML and make this all more generic too, i.e. less dependent on column names.
try to download the conflict data from web. chekc https://stackoverflow.com/questions/57748687/downloading-files-in-jupyter-wget-on-windows and https://stackoverflow.com/questions/7243750/download-file-from-web-in-python-3.
this is requied to make sure the environment is correnctly installed and the code runs on different platforms than my laptop.
with the current set-up, only those polygons where conflict took place remain in the dataframe. polygons without conflict don't show up anymore, thus yielding a scattered figure of polygons. this continues in subsequent functions, i.e. zonal statistics of nc-file.
would be better to keep all polygons in the dataframe, also those which are assigned a 0 in the boolean operation.
add more data, from IMAGE model
right now, a lot of waiting time is needed to produce the XY array. if the overall settings do not change, this one should not change between runs however.
it would save lot of time if we could import a pre-computed XY data and start from there.
also, this could be useful for demonstration purposes where time is limited and noone wants to wait for the looping through years and input data.
to assess the accuracy of the predictions, it is necessary to convert back the boolean information to the corresponding geographical unit (now: water province).
would also be cool to get the model via pip or even conda
for long-term development, it would be useful to make more use of object-based programming.
follow JOSS guidelines
what should be output:
for demonstration and teaching purposes, it would be great to be able to run the notebook in the cloud. check how thi scan be done.
add data to CodeOcean for docker-like environment and execution
anyone @wande001 @Sophiepieternel ?
we know after n model repetitions how many predicitons were made per polygon, how often the model prediction was correct, for both conflict and non-conflict as well as for conflict only.
this gives an overall good impression of model performance.
however, how do we show which polygons are now predicted to be 'at risk'? because with multiple predictions made per polygon, no every prediction is only 'conflict' or only 'non-conflict', so we will have to deal with a melange of predictions. what are good means to visualize this per polygon?
test sensitivity towards higher C and gamma values in SVC; also test poly kernel and assess sensitivity of degree values
to compute probability of detection and false alarm over entire time series, add aggregation level (to be provided with shp-file) to model. all data points are then aggregated to this level before performing the computations.
once all changes to the code are made and all functionality is implemented, update the command line script.
instead of having a function call per input variable in the script, it would be more efficient to have a loop over all values listed in the config-file. The user could then just specify a random amout there with file paths and the models would go though all of them.
https://stackoverflow.com/questions/22068050/iterate-over-sections-in-a-config-file
this would also need to include a detection of how the time variable in the nc-file is defined to call the right function per input file.
the history of conflict is important. if there was conflict in the previous year, it is more likely that conflict will occur in a year as well.
depending on which datapoints are selected for the training and test samples (this happens randomly!), the model output differs.
to account for this, the model would have to be run n-times (e.g. 1000) and outputs should be averaged to get a solid result.
the output folders with sub-sub-sub directories are left overs from past model structure.
remove this and just use one global output directory, the one specified in the cfg-file.
to make visualization etc. more straightforward, add script to facilitate post-processing.
how can i see what the relative influence of a input variable is in predicting our target var? is it possible to somehow get a list with ordered relative importance of the variables? or simply testing relative importance by 'leaving-one-out', i.e. running n runs (n=number of vars) which each run n-1 vars used?
"The main goal of this model is to apply machine learning techniques to make projections of future areas at risk." -> expand to convey what risk means here, how it is measured, etc.
i couldn't easily find how the model is validated, any results we may see, etc. i would recommend explicitly linking a bunch of that via the readme.rst
add probability for ML models as output based on test data
for NuSVC, see here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC
for kNC, see here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict_proba
right now, we analyse model accuracy for predicted 0s and 1s.
since predicting 1s is much harder, would be interesting to see how good the model is if we only select those entries in y_test that contain a 1 and compare with y_pred/y_score.
the current set-up requires a selection of conflicts and water provinces for one or more Koeppen-Geiger climate zones.
It may be interesting to turn off this selection.
THus, add code to make this work, e.g. by defining 'None' in the climate section in the cfg-file.
run model with Y is only 0 or Y contains randomly distributed % of data as 1 (% conflict of all data). Do not change selection criteria for conflicts then!
based on Joosts feedback, add kappa value as evaluation function to model: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html#sklearn.metrics.cohen_kappa_score
due to the random sampling of data points from the X and Y arrays, it can happen that not all polygons are represented in the test sample X_test/y_test. This is not so much a problem if we look only at aggregate evaluation criteria like ROC value and such, but if we plot the polygons, some may stay empty. That's not good.
Once #61 is solved, it should be ensured that all polygons are represented
Besides, even with n=1 executions of the model, one polygon can appear multiple times in X_test/y_test. Each time, the prediction can be wrong or right, but it's most likely not always correct or wrong but changes each time. It is therefore necessary to create an overall value based on the (average?) accuracy of the prediction per polygon. Or something else - think!
after having checked how good the model(s) would perform if we only took randomly sampled Ys, how good would the model(s) perform if we used only one variable to predict?
is it possible to identify one (or more) variables that are really key for predicting Y? This would help, in combination with the LOO analysis, which vars are really driving conflict in our model(s) and which not.
reduce 'loose' code and put all into functions
In the cfg-file, the conflicts can be filtered also for an individual country. this would reduce the number of conflicts drastically.
however, the number of water provinces is not reduced to the given country, thus introducing an immense imbalance in the model.
as solution, remove the country option in the cfg-file.
the model can the be run for an individual country by simply providing a shp-file for this given country. conflicts are then clipped with geopandas to the extent of the shp-file.
have a look at this website and implement other preprocessing steps from skl if results improve
right now, only possible to specify only one type of conflict in cfg-file. change that such multiple values can be provided and the selectoin procedure is adapted accordingly.
implement k-fold cross validation to assess robustness of models in terms of over- and underfitting.
add governance data for observed period until 2015
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.