jmcarpenter2 / parfit Goto Github PK

A package for parallelizing the fit and flexibly scoring of sklearn machine learning models, with visualization routines.

License: MIT License

Python 100.00%

parfit's People

Contributors

Stargazers

Watchers

parfit's Issues

Multi Class predication

what should i do with such error i need to pass average value with macro or something but how can i do that
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

bestScore parameter error

TypeError: bestFit() got an unexpected keyword argument 'bestScore'.

I had used bestScore='max'

Was this removed at some point? Did this ever exist? If it did exist, what is the replacement, should one exist?

error: 'i' format requires -2147483648 <= number <= 2147483647

Hi,
I get error: 'i' format requires -2147483648 <= number <= 2147483647
Doing exactly same as README.md except I am using RandomForestRegressor()

Full error :

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 346, in _sendback_result
    exception=exception))
  File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/externals/loky/backend/queues.py", line 241, in put
    self._writer.send_bytes(obj)
  File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

error                                     Traceback (most recent call last)
<ipython-input-10-f10ba30832f6> in <module>
     11                                                     X_train_5, y_train_5, X_test_5, y_test_5, # nfolds=5 [optional, instead of validation set]
     12                                                     metric=roc_auc_score, greater_is_better=True,
---> 13                                                     scoreLabel='AUC')
     14 
     15 print(best_model, best_score)

~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/parfit/parfit.py in bestFit(model, paramGrid, X_train, y_train, X_val, y_val, nfolds, metric, greater_is_better, predict_proba, showPlot, scoreLabel, vrange, cmap, n_jobs, verbose)
     63     else:
     64         print("-------------FITTING MODELS-------------")
---> 65         models = fitModels(model, paramGrid, X_train, y_train, n_jobs, verbose)
     66         print("-------------SCORING MODELS-------------")
     67         scores = scoreModels(models, X_val, y_val, metric, predict_proba, n_jobs, verbose)

~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/parfit/fit.py in fitModels(model, paramGrid, X, y, n_jobs, verbose)
     49         myModels = fitModels(model, paramGrid, X_train, y_train)
     50     """
---> 51     return Parallel(n_jobs=n_jobs, verbose=verbose)(delayed(fitOne)(model, X, y, params) for params in paramGrid)

~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
    994 
    995             with self._backend.retrieval_context():
--> 996                 self.retrieve()
    997             # Make sure that we get a last message telling us we are done
    998             elapsed_time = time.time() - self._start_time

~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
    897             try:
    898                 if getattr(self._backend, 'supports_timeout', False):
--> 899                     self._output.extend(job.get(timeout=self.timeout))
    900                 else:
    901                     self._output.extend(job.get())

~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    515         AsyncResults.get from multiprocessing."""
    516         try:
--> 517             return future.result(timeout=timeout)
    518         except LokyTimeoutError:
    519             raise TimeoutError()

~/anaconda3/envs/venv_py3.6/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

~/anaconda3/envs/venv_py3.6/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

error: 'i' format requires -2147483648 <= number <= 2147483647

Please help.

tuple parameter produces error when plotting

I am using parfit on a MLPClassifier. Having a parameter containing a tuple in parfit.bestFit() produces an error when trying to plot. If I use bestFit() with showPlot=false the error is gone.

Here's my code. The problem is the hidden_layer_sizes parameter:

grid = {
    'solver': ['lbfgs'],
    'alpha':  [1e-5],
    'hidden_layer_sizes': [(5, 2)],
    'random_state': [1]
}

paramGrid = ParameterGrid(grid)

bestModel, bestScore, allModels, allScores = pf.bestFit(MLPClassifier(), paramGrid,
           X, y, X_test, y_test, 
           metric=roc_auc_score, 
           greater_is_better=True, 
           scoreLabel='AUC',
           showPlot=True)

print(bestModel, bestScore)

Model function itself instead of name of it

It would be more convenient if the function itself is passed such as

fitModels(GradientBoostingRegressor(), gbm_paramGrid, X_train, y_train)

instead of the current syntax which requires the name of the model function, i.e.,

fitModels(GradientBoostingRegressor, gbm_paramGrid, X_train, y_train)

This would make trying out parfit easier as one common practice in using sklearn GridSearchCV is to list the classifiers (not the names of the classifiers) and then the parameter space and iterate over them such as

clfs = [ BayesianRidge(), Lasso(), ElasticNet() ]

Currently, one can not use clfs object as it is not made out of names of the classifiers.

Does not seem to work with new version of matplotlib (2.1.2)

Hey,

When updating the library through pip, it automatically updated my matplotlib from 2.1.1 to 2.1.2, and when trying to import parfit it throws a matplotlib-related import error 'cannot import name 'TransformedPatchPath''. I can confirm this is related to the update of matplotlib because when I uninstalled matplotlib and reinstalled 2.1.1, parfit works again.

missing images in Readme.

../screenshots/screenshots/best_rf_model_3D.png path is missing.
could you push that folder? so that the images get displayed properly

Text data and parfit not working

Hi
I am able to execute GridSearchCV with the text data and using pipeline as below
pipe = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer())
('clf', SGDClassifier()),
])

But when I tried with parfit , and converted XTrain and Xval using the countVect and TFIDTrasformer. then putting this data in parfit then filmodel method is working but scoreModel() is giving me below error

Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'

ERROR - Error in main: Exception(ValueError("Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'"))

can not install

when i use pip install parfit,the error is:
Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fb1f6eb74a8>: Failed to establish a new connection: [Errno 101] Network is unreachable',)': /simple/parfit/
Could not find a version that satisfies the requirement parfit (from versions: )
No matching distribution found for parfit

n_jobs defaults to -1 and machine runs out of memory

Trying to overide n_jobs to a lower number than the number of cores available on my P4000 Paperspace linux machine but no matter what value I assign n_jobs in paramGrid below, I see in the output "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers."

Why is this?

Then I get the error below because I think the machine runs out of memory:
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

Please advise?

`import parfit.parfit as pf

import numpy as np
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import mean_squared_error

paramGrid = ParameterGrid({
'min_samples_leaf': [1,3,5],
'max_features': [15,26,28,32], #'sqrt', 'log2',
'n_estimators': [60],
'n_jobs': [4],
'random_state': [23]
})

X_train,y_train=prepare_data_rf(df_train,cat_names,cont_names,dep_var)
X_train=X_train.fillna(X_train.mean())

X_val,y_val=prepare_data_rf(df_valid,cat_names,cont_names,dep_var)
X_val=X_val.fillna(X_train.mean())`

Output

-------------FITTING MODELS-------------
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 7.0min
[Parallel(n_jobs=-1)]: Done 3 out of 12 | elapsed: 10.3min remaining: 31.0min
[Parallel(n_jobs=-1)]: Done 5 out of 12 | elapsed: 14.0min remaining: 19.6min
[Parallel(n_jobs=-1)]: Done 7 out of 12 | elapsed: 14.7min remaining: 10.5min
[Parallel(n_jobs=-1)]: Done 9 out of 12 | elapsed: 14.7min remaining: 4.9min

TerminatedWorkerError Traceback (most recent call last)
in

/opt/conda/envs/fastai/lib/python3.6/site-packages/parfit/parfit.py in bestFit(model, paramGrid, X_train, y_train, X_val, y_val, nfolds, metric, greater_is_better, predict_proba, showPlot, scoreLabel, vrange, cmap, n_jobs, verbose)
63 else:
64 print("-------------FITTING MODELS-------------")
---> 65 models = fitModels(model, paramGrid, X_train, y_train, n_jobs, verbose)
66 print("-------------SCORING MODELS-------------")
67 scores = scoreModels(models, X_val, y_val, metric, predict_proba, n_jobs, verbose)

/opt/conda/envs/fastai/lib/python3.6/site-packages/parfit/fit.py in fitModels(model, paramGrid, X, y, n_jobs, verbose)
49 myModels = fitModels(model, paramGrid, X_train, y_train)
50 """
---> 51 return Parallel(n_jobs=n_jobs, verbose=verbose)(delayed(fitOne)(model, X, y, params) for params in paramGrid)

/opt/conda/envs/fastai/lib/python3.6/site-packages/joblib/parallel.py in call(self, iterable)
932
933 with self._backend.retrieval_context():
--> 934 self.retrieve()
935 # Make sure that we get a last message telling us we are done
936 elapsed_time = time.time() - self._start_time

/opt/conda/envs/fastai/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
831 try:
832 if getattr(self._backend, 'supports_timeout', False):
--> 833 self._output.extend(job.get(timeout=self.timeout))
834 else:
835 self._output.extend(job.get())

/opt/conda/envs/fastai/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
519 AsyncResults.get from multiprocessing."""
520 try:
--> 521 return future.result(timeout=timeout)
522 except LokyTimeoutError:
523 raise TimeoutError()

/opt/conda/envs/fastai/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433 else:
434 raise TimeoutError()

/opt/conda/envs/fastai/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

Using Parfit with Custom CV split

Currently, cross validation in parfit can be performed by specifying n_folds. Is there a possibility for providing a functionality for the user to specify the CV splits manually via index?
Or even better, passing in general Sklearn splitter objects?

Thanks!

Motivation

One possible use-case is when trying to do CV for a time-series dataset, where the usual CV split is not suitable because of the causality inherent in the data.
The general consensus, then, seems to be to do the CV split like:

assume we have data in 5 blocks: [1,2,3,4,5]
split 1: train: [1], val: [2]
split 2: train: [1,2], val: [3]
split 3: train: [1,2,3], val: [4]
split 4: train: [1,2,3,4], val: [5]

This is implemented in Sklearn as TimeSeriesSplit: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

jmcarpenter2 / parfit Goto Github PK

parfit's People

Contributors

Stargazers

Watchers

Forkers

parfit's Issues

Multi Class predication

bestScore parameter error

error: 'i' format requires -2147483648 <= number <= 2147483647

tuple parameter produces error when plotting

Model function itself instead of name of it

Does not seem to work with new version of matplotlib (2.1.2)

missing images in Readme.

Text data and parfit not working

can not install

n_jobs defaults to -1 and machine runs out of memory

Using Parfit with Custom CV split

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent