Giter VIP home page Giter VIP logo

neuro-ml / reskit Goto Github PK

View Code? Open in Web Editor NEW
27.0 5.0 7.0 37.23 MB

A library for creating and curating reproducible pipelines for scientific and industrial machine learning

Home Page: http://reskit.readthedocs.io/en/0.1.x/

License: BSD 3-Clause "New" or "Revised" License

Python 43.02% Jupyter Notebook 56.98%
python pipeline data-preparation grid-search prepare-data scikit-learn reproducible-research reproducible-experiments

reskit's People

Contributors

gitter-badger avatar hyperswitcher avatar lodurality avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

reskit's Issues

Make Pipeliner use any grid object with sklearn structure

As of now, Reskit has a hard coded GridSearchCV as a default grid object. We need to do the following:

  1. Add parameter grid_object which specifies grid object. Right now in the code, we have grid_clf.

  2. Pipeliner should work for testing single models without grid search (grid_object = None). If grid_object = None then Pipeliner doesn't use param_grid and makes cross_val_score on pipelines with params specified in steps.

  3. If we specify grid object then we should have grid_object_params (default None) for non-classifier specific grid search parameters. For example we can specify in grid_object_params n_iters for RandomizedSearchCV. Setting of all grid_object parameters should look something like this grid_object.set_params(**self.grid_objects_params, param_grid=self.param_grid)

Data form in Pipeliner

Now Pipeliners cannot handle usual transformers from sklearn without additional function creation and putting it to DataTransformer. I think it's better to keep usual (X, y) form in all functions and classes. And if you have more than two arrays to use, we can store it in X.

Implement NestedGridSearchCV

After we are done with #19 we need to test this structure and implement nested grid search in sklearn-like object. best_estimator_ and best_params_ should return lists of estimators and best_params.

Methods fit and predict use mean of all best models with best params.

finish tutorials

  1. Rename 'notebooks' folder to 'tutorials'
  2. Enumerate tutorials
  3. Make sure that tutorials work with current stable version of reskit.
  4. Provide a link to tutorials folder from docs
  5. Make ML on Graphs tutorial
    6*. Extend tutorials
  • after previous steps are done

Tutorials don't work

I've tested tutorials 1 and 2 and they have the same error when using .get_results() method:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-4f4d5816fba7> in <module>()
----> 1 pipe.get_results(X, y, scoring=['roc_auc'])

/home/deaddy/anaconda3/lib/python3.5/site-packages/reskit/core.py in get_results(self, data, caching_steps, scoring, results_file, logs_file, collect_n)
    344                 time_point = time()
    345                 if caching_keys != []:
--> 346                     X_featured, y = self.transform_with_caching(data, caching_keys)
    347                 else:
    348                     X_featured = data['X']

/home/deaddy/anaconda3/lib/python3.5/site-packages/reskit/core.py in transform_with_caching(self, data, row_keys)
    132         if 'init' not in self._cached_data:
    133             self._cached_data['init'] = data
--> 134             transform_data_from_last_cached(row_keys, columns)
    135         else:
    136             row_keys = ['init'] + row_keys

/home/deaddy/anaconda3/lib/python3.5/site-packages/reskit/core.py in transform_data_from_last_cached(row_keys, columns)
    125             prev_key = list(self._cached_data)[-1]
    126             for row_key, column in zip(row_keys, columns):
--> 127                 transformer = self.named_steps[column][row_key]
    128                 data = self._cached_data[prev_key]
    129                 self._cached_data[row_key] = transformer.fit_transform(data)

KeyError: 'LR'

Custom scoring functions for Pipeliner

Now, Pipeliner can use only standard sklearn scoring functions, which you should set though list of strings. Scoring functions should be set through a list of tuples or dictionary (it converts dict to list of tuples with keys in alphabetical order). Pipeliner makes scorers though standard sklearn make_scorer functions and sets name for each scoring function according to first element in each tuple.

Rewrite DataTransformer without global func

DataTransformer params are cumbersome -- there are global_func and local_func. What they do is unclear.

We need to rewrite DataTransformer with usage of only one function.

Make CalculatorPipeliner

Make CalculatorPipeliner which doesn't use eval_cv and grid_cv and predictive_models. You need only to specify data transforming steps and list of calculating functions (instead of scoring functions) to calculate on data. Example: ICC calculations for MICCAI.

Example:

feature_engineering = [('VT', VarianceThreshold()),
                       ('PCA', PCA())]

# Preprocessing step variants (2nd step)
scalers = [('standard', StandardScaler()),
           ('minmax', MinMaxScaler())]

# Reskit needs to define steps in this manner
steps = [('feature_engineering', feature_engineering),
         ('scaler', scalers)]
calc_pipe = CalculatorPipeliner(steps)

calculators = {'ICC' : ICC_func,
                       'MAX', Max_func}
calc_pipe.get_results(X, y = None, calculators = calculators)

Fix caching_steps bug

If caching_steps are undefined Reskit returns initial data and not the data transformed without caching. Clearly a bug. We need to fix it and write unit test for it.

provide basic tutorials

There are should be three tutorials:

  1. On using core with standard sklearn transformers

  2. On using custom transformers for brain-ml on graphs

  3. Combination of 1 and 2

possible bug in dump method

Pipeliner doesn't dump config in the old version. Maybe it's because we are trying to pickle lambdas as well

/home/deaddy/Repos/Connectomics/Reskit/reskit/core.py in dump(self, path)
    116         with open(path, 'wb') as f:
    117             for attr in sorted(self.__dict__.keys()):
--> 118                 dump(getattr(self, attr), f)
    119 
    120     def load(self, path):

AttributeError: Can't pickle local object 'gen_dist.<locals>.<lambda>'

Make ML on graphs tutorial

  1. Compress UCLA ASD dataset into compact format wit data and target.
  2. Write a loader from reskit.datasets
  3. Using this loader provide basic examples of normalizing and featurizing using built-in Reskit functions
  4. Provide examples of using BCT functions which don't require additional wrapping
    5*. Provide wrapper to BCT functions which require additional
  • after 1-4 is done

Grid jobs

I want to use specified n_jobs in Grid Search, how can i do this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.