neuro-ml / reskit Goto Github PK

View Code? Open in Web Editor NEW

27.0 5.0 7.0 37.23 MB

A library for creating and curating reproducible pipelines for scientific and industrial machine learning

Home Page: http://reskit.readthedocs.io/en/0.1.x/

License: BSD 3-Clause "New" or "Revised" License

Python 43.02% Jupyter Notebook 56.98%

python pipeline data-preparation grid-search prepare-data scikit-learn reproducible-research reproducible-experiments

reskit's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger oesteban faical-yannick-congo iiilia stenpiren mv-yurchenko harel-coffee

reskit's Issues

Make Pipeliner use any grid object with sklearn structure

As of now, Reskit has a hard coded GridSearchCV as a default grid object. We need to do the following:

Add parameter grid_object which specifies grid object. Right now in the code, we have grid_clf.
Pipeliner should work for testing single models without grid search (grid_object = None). If grid_object = None then Pipeliner doesn't use param_grid and makes cross_val_score on pipelines with params specified in steps.
If we specify grid object then we should have grid_object_params (default None) for non-classifier specific grid search parameters. For example we can specify in grid_object_params n_iters for RandomizedSearchCV. Setting of all grid_object parameters should look something like this grid_object.set_params(**self.grid_objects_params, param_grid=self.param_grid)

Move eval_cv and grid_cv and grid_params to init

We have strange duplication of params in init and get_results(). We need to move it to init.

Make pipeliner work with eval_cv = None

Right now it uses default 3-fold cv object in cross_val_score. We need to change it. If pipeliner has eval_cv None it reports only grid scores.

Data form in Pipeliner

Now Pipeliners cannot handle usual transformers from sklearn without additional function creation and putting it to DataTransformer. I think it's better to keep usual (X, y) form in all functions and classes. And if you have more than two arrays to use, we can store it in X.

Implement NestedGridSearchCV

After we are done with #19 we need to test this structure and implement nested grid search in sklearn-like object. best_estimator_ and best_params_ should return lists of estimators and best_params.

Methods fit and predict use mean of all best models with best params.

finish tutorials

Rename 'notebooks' folder to 'tutorials'
Enumerate tutorials
Make sure that tutorials work with current stable version of reskit.
Provide a link to tutorials folder from docs
Make ML on Graphs tutorial
6*. Extend tutorials

after previous steps are done

Save dependencies and data versions in metadata

Not urgent, but can be very important for reproducible research especially on collaborations.

Rewrite tutorials after finishing issue #15

After fixing #15 issue we need to rewrite tutorials.

wrapper for breain connectivity toolbox

Write wrapper which creates transformer for python version of BCT function

Tutorials don't work

I've tested tutorials 1 and 2 and they have the same error when using .get_results() method:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-4f4d5816fba7> in <module>()
----> 1 pipe.get_results(X, y, scoring=['roc_auc'])

/home/deaddy/anaconda3/lib/python3.5/site-packages/reskit/core.py in get_results(self, data, caching_steps, scoring, results_file, logs_file, collect_n)
    344                 time_point = time()
    345                 if caching_keys != []:
--> 346                     X_featured, y = self.transform_with_caching(data, caching_keys)
    347                 else:
    348                     X_featured = data['X']

/home/deaddy/anaconda3/lib/python3.5/site-packages/reskit/core.py in transform_with_caching(self, data, row_keys)
    132         if 'init' not in self._cached_data:
    133             self._cached_data['init'] = data
--> 134             transform_data_from_last_cached(row_keys, columns)
    135         else:
    136             row_keys = ['init'] + row_keys

/home/deaddy/anaconda3/lib/python3.5/site-packages/reskit/core.py in transform_data_from_last_cached(row_keys, columns)
    125             prev_key = list(self._cached_data)[-1]
    126             for row_key, column in zip(row_keys, columns):
--> 127                 transformer = self.named_steps[column][row_key]
    128                 data = self._cached_data[prev_key]
    129                 self._cached_data[row_key] = transformer.fit_transform(data)

KeyError: 'LR'

finish the docs

Finish the basic documentation for reskit

reskit.norms

In PRNI2016/calculation.py don't import reskit.norms @lodurality @hyperswitcher

https://github.com/neuro-ml/PRNI2016/blob/master/scripts/calculation.py

ImportError: No module named 'reskit.norms'

fix tutorial page

http://reskit.readthedocs.io/en/0.1.0/tutorial/index.html

Currently, this page does not exist. Either create it or remove from READ.md -> Documentation.
Also http://reskit.readthedocs.io/en/0.1.0/?badge=0.1.0, as well as http://reskit.readthedocs.io/en/0.1.0/tutorials/index.html, works well (the difference is "s" in ./0.1.0/tutorials)

release ISBI code

Rewrite ISBI code using current stable reskit version.

Custom scoring functions for Pipeliner

Now, Pipeliner can use only standard sklearn scoring functions, which you should set though list of strings. Scoring functions should be set through a list of tuples or dictionary (it converts dict to list of tuples with keys in alphabetical order). Pipeliner makes scorers though standard sklearn make_scorer functions and sets name for each scoring function according to first element in each tuple.

Rewrite DataTransformer without global func

DataTransformer params are cumbersome -- there are global_func and local_func. What they do is unclear.

We need to rewrite DataTransformer with usage of only one function.

Make CalculatorPipeliner

Make CalculatorPipeliner which doesn't use eval_cv and grid_cv and predictive_models. You need only to specify data transforming steps and list of calculating functions (instead of scoring functions) to calculate on data. Example: ICC calculations for MICCAI.

Example:

feature_engineering = [('VT', VarianceThreshold()),
                       ('PCA', PCA())]

# Preprocessing step variants (2nd step)
scalers = [('standard', StandardScaler()),
           ('minmax', MinMaxScaler())]

# Reskit needs to define steps in this manner
steps = [('feature_engineering', feature_engineering),
         ('scaler', scalers)]
calc_pipe = CalculatorPipeliner(steps)

calculators = {'ICC' : ICC_func,
                       'MAX', Max_func}
calc_pipe.get_results(X, y = None, calculators = calculators)

Fix caching_steps bug

If caching_steps are undefined Reskit returns initial data and not the data transformed without caching. Clearly a bug. We need to fix it and write unit test for it.

provide basic tutorials

There are should be three tutorials:

On using core with standard sklearn transformers
On using custom transformers for brain-ml on graphs
Combination of 1 and 2

make pipeliner able to use predefined plan

It can be useful to get pipeliner going from different parts of the table to make it compatibe with the QSUB algorithms.

possible bug in dump method

Pipeliner doesn't dump config in the old version. Maybe it's because we are trying to pickle lambdas as well

/home/deaddy/Repos/Connectomics/Reskit/reskit/core.py in dump(self, path)
    116         with open(path, 'wb') as f:
    117             for attr in sorted(self.__dict__.keys()):
--> 118                 dump(getattr(self, attr), f)
    119 
    120     def load(self, path):

AttributeError: Can't pickle local object 'gen_dist.<locals>.<lambda>'

Make ML on graphs tutorial

Compress UCLA ASD dataset into compact format wit data and target.
Write a loader from reskit.datasets
Using this loader provide basic examples of normalizing and featurizing using built-in Reskit functions
Provide examples of using BCT functions which don't require additional wrapping
5*. Provide wrapper to BCT functions which require additional

after 1-4 is done

Grid jobs

I want to use specified n_jobs in Grid Search, how can i do this?

make separate tutorials folder

Right now we have tutorials in the docs. We need to make tutorials notebooks and put them into tutorials folder.