jmcarpenter2 / parfit Goto Github PK
View Code? Open in Web Editor NEWA package for parallelizing the fit and flexibly scoring of sklearn machine learning models, with visualization routines.
License: MIT License
A package for parallelizing the fit and flexibly scoring of sklearn machine learning models, with visualization routines.
License: MIT License
what should i do with such error i need to pass average value with macro or something but how can i do that
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
TypeError: bestFit() got an unexpected keyword argument 'bestScore'.
I had used bestScore='max'
Was this removed at some point? Did this ever exist? If it did exist, what is the replacement, should one exist?
Hi,
I get error: 'i' format requires -2147483648 <= number <= 2147483647
Doing exactly same as README.md except I am using RandomForestRegressor()
Full error :
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 346, in _sendback_result
exception=exception))
File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/externals/loky/backend/queues.py", line 241, in put
self._writer.send_bytes(obj)
File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""
The above exception was the direct cause of the following exception:
error Traceback (most recent call last)
<ipython-input-10-f10ba30832f6> in <module>
11 X_train_5, y_train_5, X_test_5, y_test_5, # nfolds=5 [optional, instead of validation set]
12 metric=roc_auc_score, greater_is_better=True,
---> 13 scoreLabel='AUC')
14
15 print(best_model, best_score)
~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/parfit/parfit.py in bestFit(model, paramGrid, X_train, y_train, X_val, y_val, nfolds, metric, greater_is_better, predict_proba, showPlot, scoreLabel, vrange, cmap, n_jobs, verbose)
63 else:
64 print("-------------FITTING MODELS-------------")
---> 65 models = fitModels(model, paramGrid, X_train, y_train, n_jobs, verbose)
66 print("-------------SCORING MODELS-------------")
67 scores = scoreModels(models, X_val, y_val, metric, predict_proba, n_jobs, verbose)
~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/parfit/fit.py in fitModels(model, paramGrid, X, y, n_jobs, verbose)
49 myModels = fitModels(model, paramGrid, X_train, y_train)
50 """
---> 51 return Parallel(n_jobs=n_jobs, verbose=verbose)(delayed(fitOne)(model, X, y, params) for params in paramGrid)
~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
994
995 with self._backend.retrieval_context():
--> 996 self.retrieve()
997 # Make sure that we get a last message telling us we are done
998 elapsed_time = time.time() - self._start_time
~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
897 try:
898 if getattr(self._backend, 'supports_timeout', False):
--> 899 self._output.extend(job.get(timeout=self.timeout))
900 else:
901 self._output.extend(job.get())
~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
515 AsyncResults.get from multiprocessing."""
516 try:
--> 517 return future.result(timeout=timeout)
518 except LokyTimeoutError:
519 raise TimeoutError()
~/anaconda3/envs/venv_py3.6/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433 else:
434 raise TimeoutError()
~/anaconda3/envs/venv_py3.6/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
error: 'i' format requires -2147483648 <= number <= 2147483647
Please help.
I am using parfit on a MLPClassifier. Having a parameter containing a tuple in parfit.bestFit()
produces an error when trying to plot. If I use bestFit()
with showPlot=false
the error is gone.
Here's my code. The problem is the hidden_layer_sizes parameter:
grid = {
'solver': ['lbfgs'],
'alpha': [1e-5],
'hidden_layer_sizes': [(5, 2)],
'random_state': [1]
}
paramGrid = ParameterGrid(grid)
bestModel, bestScore, allModels, allScores = pf.bestFit(MLPClassifier(), paramGrid,
X, y, X_test, y_test,
metric=roc_auc_score,
greater_is_better=True,
scoreLabel='AUC',
showPlot=True)
print(bestModel, bestScore)
It would be more convenient if the function itself is passed such as
fitModels(GradientBoostingRegressor(), gbm_paramGrid, X_train, y_train)
instead of the current syntax which requires the name of the model function, i.e.,
fitModels(GradientBoostingRegressor, gbm_paramGrid, X_train, y_train)
This would make trying out parfit easier as one common practice in using sklearn GridSearchCV is to list the classifiers (not the names of the classifiers) and then the parameter space and iterate over them such as
clfs = [ BayesianRidge(), Lasso(), ElasticNet() ]
Currently, one can not use clfs object as it is not made out of names of the classifiers.
Hey,
When updating the library through pip, it automatically updated my matplotlib from 2.1.1 to 2.1.2, and when trying to import parfit it throws a matplotlib-related import error 'cannot import name 'TransformedPatchPath''. I can confirm this is related to the update of matplotlib because when I uninstalled matplotlib and reinstalled 2.1.1, parfit works again.
../screenshots/screenshots/best_rf_model_3D.png
path is missing.
could you push that folder? so that the images get displayed properly
Hi
I am able to execute GridSearchCV with the text data and using pipeline as below
pipe = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer())
('clf', SGDClassifier()),
])
But when I tried with parfit , and converted XTrain and Xval using the countVect and TFIDTrasformer. then putting this data in parfit then filmodel method is working but scoreModel() is giving me below error
Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'
when i use pip install parfit,the error is:
Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fb1f6eb74a8>: Failed to establish a new connection: [Errno 101] Network is unreachable',)': /simple/parfit/
Could not find a version that satisfies the requirement parfit (from versions: )
No matching distribution found for parfit
Trying to overide n_jobs to a lower number than the number of cores available on my P4000 Paperspace linux machine but no matter what value I assign n_jobs in paramGrid below, I see in the output "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers."
Why is this?
Then I get the error below because I think the machine runs out of memory:
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}
Please advise?
`import parfit.parfit as pf
import numpy as np
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import mean_squared_error
paramGrid = ParameterGrid({
'min_samples_leaf': [1,3,5],
'max_features': [15,26,28,32], #'sqrt', 'log2',
'n_estimators': [60],
'n_jobs': [4],
'random_state': [23]
})
X_train,y_train=prepare_data_rf(df_train,cat_names,cont_names,dep_var)
X_train=X_train.fillna(X_train.mean())
X_val,y_val=prepare_data_rf(df_valid,cat_names,cont_names,dep_var)
X_val=X_val.fillna(X_train.mean())`
Output
TerminatedWorkerError Traceback (most recent call last)
in
/opt/conda/envs/fastai/lib/python3.6/site-packages/parfit/parfit.py in bestFit(model, paramGrid, X_train, y_train, X_val, y_val, nfolds, metric, greater_is_better, predict_proba, showPlot, scoreLabel, vrange, cmap, n_jobs, verbose)
63 else:
64 print("-------------FITTING MODELS-------------")
---> 65 models = fitModels(model, paramGrid, X_train, y_train, n_jobs, verbose)
66 print("-------------SCORING MODELS-------------")
67 scores = scoreModels(models, X_val, y_val, metric, predict_proba, n_jobs, verbose)
/opt/conda/envs/fastai/lib/python3.6/site-packages/parfit/fit.py in fitModels(model, paramGrid, X, y, n_jobs, verbose)
49 myModels = fitModels(model, paramGrid, X_train, y_train)
50 """
---> 51 return Parallel(n_jobs=n_jobs, verbose=verbose)(delayed(fitOne)(model, X, y, params) for params in paramGrid)
/opt/conda/envs/fastai/lib/python3.6/site-packages/joblib/parallel.py in call(self, iterable)
932
933 with self._backend.retrieval_context():
--> 934 self.retrieve()
935 # Make sure that we get a last message telling us we are done
936 elapsed_time = time.time() - self._start_time
/opt/conda/envs/fastai/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
831 try:
832 if getattr(self._backend, 'supports_timeout', False):
--> 833 self._output.extend(job.get(timeout=self.timeout))
834 else:
835 self._output.extend(job.get())
/opt/conda/envs/fastai/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
519 AsyncResults.get from multiprocessing."""
520 try:
--> 521 return future.result(timeout=timeout)
522 except LokyTimeoutError:
523 raise TimeoutError()
/opt/conda/envs/fastai/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433 else:
434 raise TimeoutError()
/opt/conda/envs/fastai/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}
Currently, cross validation in parfit can be performed by specifying n_folds
. Is there a possibility for providing a functionality for the user to specify the CV splits manually via index?
Or even better, passing in general Sklearn splitter objects?
Thanks!
Motivation
One possible use-case is when trying to do CV for a time-series dataset, where the usual CV split is not suitable because of the causality inherent in the data.
The general consensus, then, seems to be to do the CV split like:
assume we have data in 5 blocks: [1,2,3,4,5]
split 1: train: [1], val: [2]
split 2: train: [1,2], val: [3]
split 3: train: [1,2,3], val: [4]
split 4: train: [1,2,3,4], val: [5]
This is implemented in Sklearn as TimeSeriesSplit: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.