Giter VIP home page Giter VIP logo

rodrigo-arenas / sklearn-genetic-opt Goto Github PK

View Code? Open in Web Editor NEW
273.0 4.0 70.0 126.45 MB

ML hyperparameters tuning and features selection, using evolutionary algorithms.

Home Page: https://sklearn-genetic-opt.readthedocs.io

License: MIT License

Python 100.00%
python scikit-learn machine-learning artificial-intelligence hyperparameters deap looking-for-contributors model-selection hyperparameter-optimization automl

sklearn-genetic-opt's Introduction

Hi there 👋

I'm Rodrigo Arenas, a data scientist. I'm excited about learning new things and contributing to the science community. You can see more about me on my website.

🚀About Me

  • 🔭 I’m currently interested in data science-related projects, such as machine learning, optimization, MLOps and statistics.
  • 🤚 I’m looking to help with open-source projects and research focused on machine learning. Let me a message if you think I can help you.
  • 📝 I like writing and making some tutorials about data science on Medium
  • 💻 I'm currently working on:
    • sklearn-genetic-opt a package for scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms with extra tools like callbacks, plotting, logging, etc.
    • pyworkforce Standard tools for workforce management, queuing, scheduling, rostering and optimization problems.

⚙️  GitHub Analytics



📫 Connect with Me:


sklearn-genetic-opt's People

Contributors

andres42611 avatar atrophiedbrain avatar chailex avatar dependabot[bot] avatar guitaek avatar habush avatar imxtx avatar jordan-bird avatar mahmaduzair avatar rodrigo-arenas avatar rsvarma95 avatar turtle24 avatar y10ab1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sklearn-genetic-opt's Issues

[FEATURE] Feature selection and optimization simultaneously

It seems to me that the best approach to optimizing an estimator would be to run both feature selection AND hyperparameter optimization simultaneously within the same evolution process. It would be complex but probably yield better results instead of using one after the other.

Is this something I can do within the current framework, or does this require new code?

Also, do you think it is even a good idea in the first place?

Question about selection and crossover

Hello,

I have been trying to understand the selection and crossover methods for GASearchCV but I still have some doubts. I am using the default algorithm (eaMuPlusLambda), but in the implementation it appears that both mu and lamba are set to None.

If I am not wrong, these parameters establish the following:

  • mu: The number of individuals chosen from the previous generation without undergoing mutation or crossover.
  • lambda: The number of individuals in the next generation obtained from crossing and mutating the parents from the previous generation.

If both of them are set to None, then I don't understand which percentage of a new generation is parents from the previous one and which percentage is mutated children of crossed parents.

I believe that the reproduction process is the following one:

  1. Selection: With the chosen selection method, select individuals that will produce next generation.
  2. Crossover: Apply crossover to some of the selected individuals according to crossover probability.
  3. Mutation: Apply mutation to the resulting population according to mutation probability.

My question is, the next generation, is composed only of probably mutated children of crossed parents? Or are there also parents that are not crossed? In this second case, which is the percentage of children and parents?

Thanks a lot in advance!

Mario

[DOCS] English Tutorials Review

Is your feature request related to a problem? Please describe.
Currently there are several tutorials in the docs. As English is my second language, it probably has some errors and redaction that can be improved.

Describe the solution you'd expect
I'd appreciate if someone could review and improve the tutorials documentation to make it clearer to English readers.

Check the docs/tutorials files and make the changes that be necessary

`pip install sklearn-genetic-opt[all]` not working

System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 22.04
Sklearn-genetic-opt version: 0.9.0
Scikit-learn version: 1.1.2
Python version: 3.10.4 (with virtual environment)

Describe the bug
Hello. I am trying to get all the features of this package but the pip install sklearn-genetic-opt[all] command seems not to be working. I have tried to install it with and without sklearn-genetic-opt (basic package) installed but neither of them work. Do you know why this happens?

To Reproduce
This is the input and output of my console:

> pip install sklearn-genetic-opt[all]
zsh: no matches found: sklearn-genetic-opt[all]

Expected behavior
To get the package installed.

Screenshots
N/A

Additional context
N/A

[BUG] Callback not evaluated in gen 0

System information
OS Platform and Distribution: all
Sklearn-genetic-opt version: 0.6.0dev0
Scikit-learn version: 0.21.3
Python version: 3.7

Describe the bug
Currently, the callbacks are meant to be evaluated after each generation is fitted.
A missing statement is making that the callbacks only starts to be evaluated from gen 1, ignoring generation 0.

Expected behavior
Make sure that the callbacks are evaluated from generation 0.

ValueError when a single parameter is supplied to GASearchCV

System information
OS Platform and Distribution: Windows 10
Sklearn-genetic-opt version: 0.6.0
Scikit-learn version: 0.24.1
Python version: 3.8

Describe the bug
GASearchCV raises a ValueError : empty range for randrange() (1, 1, 0) after generation 0 when only one parameter is supplied to the argument param_grid. I am not 100% sure if this is a true bug or if this is an expected behaviour of the genetic algorithm; nonetheless, I have decided to raise this issue because the error is not very clear about what is happening (in my opinion).

To Reproduce
Below, I am showing two examples of this error when using a SVR and RandomForestClassifier with a single parameter for optimization. I have tested other algorithms and I have got the same error.

import numpy as np
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV, KFold
from sklearn_genetic import GASearchCV
from sklearn_genetic.space import Continuous, Categorical, Integer
from sklearn.ensemble import RandomForestClassifier

#generate input to test SVR for regression and random forest for classification
X = np.random.normal(75, 10, (1000, 2))
y = np.random.normal(200, 20, 1000)
y_labels = np.random.randint(0, 2, size=1000)
cv = KFold(n_splits=5, random_state=42, shuffle=True)

This is the example for SVR:

params_svr = {"degree": Integer(2,3)}

evolved = GASearchCV(estimator=SVR(),
                     cv=cv,
                     population_size=10,
                     generations=40,
                     tournament_size=5,
                     elitism=True,
                     crossover_probability=0.85,
                     mutation_probability=0.1,
                     param_grid=params_svr,
                     criteria='max',
                     scoring="neg_mean_absolute_error",
                     algorithm='eaMuPlusLambda',
                     error_score="raise",
                     n_jobs=-1,
                     verbose=True,
                     keep_top_k=10)

evolved.fit(X, y)

This is the example for RandomForestClassifier:

params_rfc = {"max_depth": Integer(2,6)}

evolved = GASearchCV(estimator=RandomForestClassifier(),
                     cv=cv,
                     population_size=10,
                     generations=40,
                     tournament_size=5,
                     elitism=True,
                     crossover_probability=0.85,
                     mutation_probability=0.1,
                     param_grid=params_rfc,
                     criteria='max',
                     scoring="f1",
                     algorithm='eaMuPlusLambda',
                     error_score="raise",
                     n_jobs=-1,
                     verbose=True,
                     keep_top_k=10)

evolved.fit(X, y_labels)

Expected behavior
If I add an extra parameter to params_svr and params_rfc:

params_svr = {"degree": Integer(2,3), "C": Continuous(0, 1)}
params_rfc = {"max_depth": Integer(2,6), "criterion": Categorical(["gini", "entropy"])}

The algorithm runs as usual.

Screenshots
Please find below the full log of this error.

Traceback (most recent call last):

  File "<ipython-input-15-82a4ae8a09ac>", line 17, in <module>
    evolved.fit(X, y_labels)

  File "C:\Users\andre\Anaconda3\lib\site-packages\sklearn\utils\metaestimators.py", line 120, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)

  File "C:\Users\andre\Anaconda3\lib\site-packages\sklearn_genetic\genetic_search.py", line 455, in fit
    pop, log, n_gen = self._select_algorithm(

  File "C:\Users\andre\Anaconda3\lib\site-packages\sklearn_genetic\genetic_search.py", line 547, in _select_algorithm
    pop, log, gen = eaMuPlusLambda(

  File "C:\Users\andre\Anaconda3\lib\site-packages\sklearn_genetic\algorithms.py", line 305, in eaMuPlusLambda
    offspring = varOr(population, toolbox, lambda_, cxpb, mutpb)

  File "C:\Users\andre\Anaconda3\lib\site-packages\deap\algorithms.py", line 234, in varOr
    ind1, ind2 = toolbox.mate(ind1, ind2)

  File "C:\Users\andre\Anaconda3\lib\site-packages\deap\tools\crossover.py", line 51, in cxTwoPoint
    cxpoint2 = random.randint(1, size - 1)

  File "C:\Users\andre\Anaconda3\lib\random.py", line 248, in randint
    return self.randrange(a, b+1)

  File "C:\Users\andre\Anaconda3\lib\random.py", line 226, in randrange
    raise ValueError("empty range for randrange() (%d, %d, %d)" % (istart, istop, width))

ValueError: empty range for randrange() (1, 1, 0)

Continuous not working for hyperparameters with hard limits

System information
OS Platform and Distribution: Windows 10
Sklearn-genetic-opt version: 0.6.0
Scikit-learn version: 0.24.1
Python version: 3.8

Describe the bug
When defining a Continuous parameter range, it appears the generated values are not within the specified range. This is evident for algorithms that have hyperparameters, in which, the values can only be within an interval (e.g. between 0 and 1). Below, I show an example with a RandomForestRegressor where the parameter min_weight_fraction_leaf has a limit of [0 - 0.5].

To Reproduce

import numpy as np
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn_genetic import GASearchCV
from sklearn_genetic.space import Continuous

#generate input 
X = np.random.normal(75, 10, (1000, 2))
y = np.random.normal(200, 20, 1000)
cv = KFold(n_splits=5, random_state=42, shuffle=True)

#parameters
params = {"max_depth": Integer(1, 10), 'min_weight_fraction_leaf': Continuous(0.45, 0.49)}

#genetic optimization
evolved = GASearchCV(estimator=RandomForestRegressor(n_estimators=1),
                     cv=cv,
                     population_size=30,
                     generations=40,
                     tournament_size=5,
                     elitism=True,
                     crossover_probability=0.85,
                     mutation_probability=0.15,
                     param_grid=params,
                     criteria='max',
                     scoring="neg_mean_absolute_error",
                     algorithm='eaMuPlusLambda',
                     error_score="raise",
                     n_jobs=-1,
                     verbose=True,
                     keep_top_k=10)
evolved.fit(X, y)

Expected behavior
The analysis above will raise ValueError: min_weight_fraction_leaf must in [0, 0.5]. This means that Continuous(0.45, 0.49) is generating values outside [0 - 0.5] even though I have specified those to be within 0.45 and 0.49. This problem also occurs with other algorithms, such as XGBRegressor with the parameter subsample (interval between 0 and 1). In the latter case I have specified Continuous(0, 1) and I was getting the error that values for subsample were 1.2 or even higher.

Screenshots
Full error log for the analysis with RandomForestRegressor:

Traceback (most recent call last):
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 431, in _process_worker
    r = call_item()
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 285, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 387, in fit
    trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 169, in _parallel_build_trees
    tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 1252, in fit
    super().fit(
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 285, in fit
    raise ValueError("min_weight_fraction_leaf must in [0, 0.5]")
ValueError: min_weight_fraction_leaf must in [0, 0.5]
"""


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "<ipython-input-1-855014dfa2b2>", line 34, in <module>
    evolved.fit(X, y)

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py", line 120, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn_genetic\genetic_search.py", line 455, in fit
    pop, log, n_gen = self._select_algorithm(

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn_genetic\genetic_search.py", line 547, in _select_algorithm
    pop, log, gen = eaMuPlusLambda(

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn_genetic\algorithms.py", line 265, in eaMuPlusLambda
    for ind, fit in zip(invalid_ind, fitnesses):

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn_genetic\genetic_search.py", line 377, in evaluate
    cv_results = cross_validate(

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 250, in cross_validate
    results = parallel(

  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 1054, in __call__
    self.retrieve()

  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))

  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)

  File "C:\Users\andre\anaconda3\lib\concurrent\futures\_base.py", line 439, in result
    return self.__get_result()

  File "C:\Users\andre\anaconda3\lib\concurrent\futures\_base.py", line 388, in __get_result
    raise self._exception

ValueError: min_weight_fraction_leaf must in [0, 0.5]

MLPClassifier - ValueError: shuffle must be either True or False, got True.

System information
Windows 10
Sklearn-genetic-opt version: 0.6.1
Scikit-learn version: 0.24.2
Python version: Python 3.7

Describe the bug
When using the GASearchCV class with MLPClassifier as the estimator, I get the error in the title. In my param_grid, I simply have it set to Categorical([True, False]), but it doesn't seem to play well. Wondering what could be causing it?

To Reproduce
Could recreate it by creating a binary classification dataset from sklearn, then implementing this:

    curr_params = {"shuffle": Categorical([True, False])}

    evolved_estimator = GASearchCV(estimator=MLPClassifier(),
                                   cv=StratifiedKFold(n_splits=2, shuffle=True, random_state=42),
                                   scoring='balanced_accuracy',
                                   population_size=30,
                                   generations=30,
                                   tournament_size=3,
                                   elitism=True,
                                   crossover_probability=0.8,
                                   mutation_probability=0.1,
                                   param_grid=curr_params,
                                   criteria='max',
                                   algorithm='eaMuPlusLambda',
                                   n_jobs=1,
                                   verbose=True,
                                   keep_top_k=1)

Expected behavior
Seems to only be an issue with MLPClassifier so far, but should set the parameter shuffle to True or False.

Screenshots

Additional context

[FEATURE] Conda package

Is your feature request related to a problem? Please describe.
May I ask if there are plans to release a conda package in the near future?

I want to use this package within a project whose virtual environment is created with conda and all installed packages are also from conda/conda-forge. I have pip installed in the environment and tried to install sklearn-genetic-opt via pip as stated in the docs (pip install sklearn-genetic-opt). pip identified the dependencies and installed them (deap, numpy, etc.). The problem though is that it doesn't integrate well with the environment. For instance, I have pandas 1.5.0 installed in the conda environment, but when I open a Python session and run import sklearn_genetic, the interpreter returns me an error claiming that pandas is not installed.

Describe the solution you'd expect
The package would be easier to use if it were possible to install it within conda.

Additional context
Everything I reported refers to a Windows 10 21H2 machine.

[FEATURE] MLflow tests

Is your feature request related to a problem? Please describe.
Currently there are not unit tests to the integration with MLflow

Describe the solution you'd expect
Create the file in sklearn_genetic/tests/test_mlflow.py and put the set of test that contains the use case of MLflow from sklearn_genetic.mlflow
It should test if the config creates or no a new topic and the use of each parameter, as well, that at the end of the runs the logged artifacts/metric/hyperparameters exists in the mlflow server and clean the resources after the test is ended

GAFeatureSelectionCV - <classifier> object has no attribute 'transform'

System information
OS Platform and Distribution: Windows 11 Home
Sklearn-genetic-opt version: 0.9.0
deap version: 1.3.3
Scikit-learn version: 1.1.2
Python version: 3.8.13

Describe the bug
I have fitted an instance of GAFeatureSelectionCV using LGBMClassifier

clf_dim = LGBMClassifier()
gen_opt = GAFeatureSelectionCV(
                               clf_dim, cv=5, scoring='avg_prec', refit=True, 
                               generations=20, population_size=50, tournament_size=3,
                               mutation_probability=0.8, crossover_probability=0.2, elitism=True, keep_top_k=1,
                               n_jobs=1, verbose=True, 
                              )

and got the expected results in the various output attributes such as .best_estimator_ and n_features_in_

However, unlike the example provided in the documentation, I am not attempting to use the selected features and the estimator directly to predict results on test data.

Instead, I am trying to follow the traditional scikit-learn approach of incorporating this estimator to select features as step 'dim' in the following pipeline, before passing them on to another classifier at the end of the pipeline
image

This requires that the 'transformer' based on GAFeatureSelectionCV supports a transform() method, which it does. However, when I try to use the transform method of the fitted estimator standalone, as in:

gen_opt.transform(X_t)

I get an error suggesting that

'LGBMClassifier' object has no attribute 'transform'

I went on to define a pipeline with the estimator as below:

pipe_dim_full = Pipeline(
    steps=[
        ('enc', encode), 
        ('dim', gen_opt), 
        ('clf', clf), 
    ], 
)

and upon trying to fit it, I get a somewhat contradictory error:

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'GAFeatureSelectionCV(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True),
estimator=LGBMClassifier(n_jobs=1, random_state=0,
verbose=-1),
generations=20, n_jobs=18, return_train_score=True,
scoring=make_scorer(average_precision_score, needs_proba=True, pos_label=1))' (type <class 'sklearn_genetic.genetic_search.GAFeatureSelectionCV'>) doesn't

As it stands, GAFeatureSelectionCV can't be used in a pipeline without the transform() method being fixed, which is unfortunate as I really like it and was looking forward to using GA across my pipeline.

To Reproduce
Steps to reproduce the behavior:
As described above. Please reach out if you need more detail.

Expected behavior
The transform method should product a matrix with n_features_in_ columns of the input matrix

Additional context
There is another module based on deap that successfully offers feature selection by genetic algorithm. Here is a link for reference https://sklearn-genetic.readthedocs.io/en/latest/api.html

AttributeError: FitnessMax when running GASearchCV on a pipeline containing GAFeatureSelectionCV

System information
OS Platform and Distribution: Windows 11 Home
Sklearn-genetic-opt version: 0.10.1
deap version: 1.3.3
Scikit-learn version: 1.2.1
Python version: 3.10.10

Describe the bug
When including GAFeatureSelectionCV as a transformer within a pipeline to carry out feature selection and then running GASearchCV on the pipeline to optimise hyperparameters, it initially throws up this warning message:

C:\Users\naray\Miniconda3\envs\skl_310\lib\site-packages\deap\creator.py:138: RuntimeWarning: A class named 'FitnessMax' has already been created and it will be overwritten. Consider deleting previous creation of that class or rename it.
C:\Users\naray\Miniconda3\envs\skl_310\lib\site-packages\deap\creator.py:138: RuntimeWarning: A class named 'Individual' has already been created and it will be overwritten. Consider deleting previous creation of that class or rename it.

It seems to then run through various generations successfully, with logs like this printed out (truncated for brevity):

gen nevals fitness fitness_std fitness_max fitness_min
0 5 0.939988 0.012589 0.959893 0.920083
1 10 0.94795 0.00975139 0.959893 0.939988
2 10 0.951872 0.0160428 0.959893 0.919786
3 10 0.963815 0.00480292 0.969697 0.959893
4 10 0.961854 0.0114332 0.969697 0.940285
5 10 0.959774 0.0154184 0.969697 0.929887
gen nevals fitness fitness_std fitness_max fitness_min
0 5 0.908794 0.0116677 0.930778 0.900772
1 10 0.918835 0.00975139 0.930778 0.910873
2 10 0.920737 0.0177374 0.950386 0.900772
3 10 0.916756 0.0172643 0.950386 0.900772
4 10 0.950327 0.0108482 0.96019 0.930481
5 10 0.92038 0.0748142 0.96019 0.770945
.....
.....
.....

before finally throwing up the following error:


AttributeError Traceback (most recent call last)
Cell In[101], line 8
2 from sklearn_genetic import GASearchCV
4 ga_search_pipe = GASearchCV(test_pipe, generations=5, population_size=5,
5 param_grid={'clf__alpha': skg.space.Continuous(10e-2, 10e0, 'log-uniform')},
6 )
----> 8 ga_search_pipe.fit(iris.data, iris.target)
10 grid_search_pipe.predict(iris.data)

File ~\Miniconda3\envs\skl_310\lib\site-packages\sklearn_genetic\genetic_search.py:543, in GASearchCV.fit(self, X, y, callbacks)
536 self.hof.keys.insert(0, self.best_score)
538 self.hof = {
539 k: {key: self._hof[k][n] for n, key in enumerate(self.space.parameters)}
540 for k in range(len(self._hof))
541 }
--> 543 del self.creator.FitnessMax
544 del self.creator.Individual
546 return self

AttributeError: FitnessMax

To Reproduce
Steps to reproduce the behavior:

from sklearn.datasets import load_iris

from lightgbm import LGBMClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn_genetic import GAFeatureSelectionCV
import sklearn_genetic as skg
from sklearn_genetic import GASearchCV

iris = load_iris()

test_pipe = Pipeline([
                    # 1 Feature Selection using GAFeatureSelectionCV
                      ('dim', GAFeatureSelectionCV(LGBMClassifier(), 
                                                   generations=5, population_size=5, 
                                                   n_jobs=-1, 
                                                  )
                      ), 
                        
                      ('clf', SGDClassifier())
                     ]
                    )

ga_search_pipe = GASearchCV(test_pipe, generations=5, population_size=5, 
                            param_grid={'clf__alpha': skg.space.Continuous(10e-2, 10e0, 'log-uniform')},
                           )

ga_search_pipe.fit(iris.data, iris.target)

Expected behavior
The pipeline should be fitted without any errors.

[Feature

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[FEATURE] Allow Pipelines in GASearchCV, vs. Estimators Only

Would be cool to see GASearchCV allow SKLearn Pipeline objects into the mix as well. For example:

pipeline = Pipeline([('scaler', MinMaxScaler()),
                     ('estimator', ADABoostClassifer())])

evolved_estimator = GASearchCV(estimator=pipeline ,
                                           scoring='balanced_accuracy',
                                           cv=TimeSeriesSplit(n_splits=3),
                                           population_size=30,
                                           generations=30,
                                           tournament_size=3,
                                           elitism=True,
                                           crossover_probability=0.8,
                                           mutation_probability=0.1,
                                           param_grid=curr_params,
                                           criteria='max',
                                           algorithm='eaMuPlusLambda',
                                           n_jobs=-1),
                                           verbose=True,
                                           keep_top_k=1)

I envision in the code somewhere, there could be a check, something like:

if isinstance(self.estimator, sklearn.pipeline.Pipeline():
    self.estimator = self.estimator['estimator']

That way, it could parse the base estimator from the pipeline and the rest of the code could work, but it's nice to know that feature scaling is working properly when using cross validation. Proper scaling would be to scale the training folds first, then transform the testing fold, which all happens within the CV itself. So using a pipeline with SKLearn would allow this?

[Docs] External references

Describe the solution you'd expect
This issue is meant to guide and support people who want to add an external reference showcasing the usage of this package, these references will be shown here

You can add a blog post, a video, a kaggle notebook or an article. You can add the link of the content in the file docs/external_references.rst (following the contribution guides).

Take into consideration:

  • The link must be after the last existing link.
  • The name must be the title that will be visible.
  • The link must take directly to the referred content.

The content will be reviewed before merging to make comments if needed.

Wrong output for GAFeatureSelectionCV only when using max_features for RandomForestClassifier

System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10.0.19044
Sklearn-genetic-opt version: 0.8.0
Scikit-learn version: 1.0.1
Python version: 3.8.11

Describe the bug
A clear and concise description of what the bug is.

Firstly, great work!!

When using :

clf_RF    = RandomForestClassifier(random_state=0)

evolved_estimator = GAFeatureSelectionCV(
    estimator   = clf_RF,
    cv          = 5,
    population_size=20, 
    generations =40,
    crossover_probability=0.8,
    mutation_probability = 0.075,
    n_jobs      = -1,
    scoring     = "accuracy",
    max_features = 300
    )

# Train and select the features
evolved_estimator.fit(X, y)

the Output looks like this;

gen	nevals	fitness	fitness_std	fitness_max	fitness_min
0  	20    	-10000 	0          	-10000     	-10000     
1  	32    	-10000 	0          	-10000     	-10000     
2  	33    	-10000 	0          	-10000     	-10000     
3  	37    	-10000 	0          	-10000     	-10000     
4  	36    	-10000 	0          	-10000     	-10000     
5  	37    	-10000 	0          	-10000     	-10000     
6  	36    	-10000 	0          	-10000     	-10000     
7  	36    	-10000 	0          	-10000     	-10000     
8  	36    	-10000 	0          	-10000     	-10000     
9  	33    	-10000 	0          	-10000     	-10000     
10 	33    	-10000 	0          	-10000     	-10000     
11 	34    	-10000 	0          	-10000     	-10000     
12 	34    	-10000 	0          	-10000     	-10000     
13 	36    	-10000 	0          	-10000     	-10000     
14 	33    	-10000 	0          	-10000     	-10000     
15 	34    	-10000 	0          	-10000     	-10000     
16 	33    	-10000 	0          	-10000     	-10000     
17 	37    	-10000 	0          	-10000     	-10000     
18 	35    	-10000 	0          	-10000     	-10000     
19 	37    	-10000 	0          	-10000     	-10000     
20 	34    	-10000 	0          	-10000     	-10000     
21 	35    	-10000 	0          	-10000     	-10000     
22 	35    	-10000 	0          	-10000     	-10000 

This doesn't happen when removing the max_feature() parameter.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Add this code '....'
  3. Run with this command '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

change [FEATURE]

Main Features:
GASearchCV: Principal class of the package, holds the evolutionary cross-validation optimization routine.
Algorithms: Set of different evolutionary algorithms to use as an optimization procedure.
Callbacks: Custom evaluation strategies to generate early stopping rules, logging (into TensorBoard, .pkl files, etc) or your custom logic.
Plots: Generate pre-defined plots to understand the optimization process.
MLflow: Build-in integration with mlflow to log all the hyperparameters, cv-scores and the fitted models

[FEATURE] Report multiple scoring metrics

Hello,

I have been looking into your package and it is really cool. Thank you for putting a lot of effort in developing such an amazing tool.

Is your feature request related to a problem? Please describe.
GASearchCV, unlike GridSearchCV, only accepts one scoring metric. Obviously, the algorithm can only use one metric to decide which models will carry over to the next generation. However, I think it would be useful to view different scoring metrics for the best models (e.g. R2, MAE, RMSE), which intrinsically may provide a slightly different idea of model performance to the user. Of course we would still be able to decide which metric should be used to select the best models within each generation.

Describe the solution you'd expect
I think the implementation of multiple scoring metrics in GASearchCV could be similar to the one implemented in GridSearchCV regarding this specific matter. I show below some examples of this implementation in GridSearchCV:

import numpy as np
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

#generate input
X = np.random.normal(75, 10, (1000, 2))
y = np.random.normal(200, 20, 1000)
params = {"degree": [2, 3], "C": [10, 20, 50]}

#calculate both R2 and MAE for each tested model, but model refit is performed based on the combination of hyperparameters with the best R2
grid = GridSearchCV(SVR(), param_grid=params, scoring=["neg_mean_absolute_error",  "r2"], refit="r2")

#another way of doing the above, but this time using aliases for the scorers
grid = GridSearchCV(SVR(), param_grid=params, scoring={"MAE": "neg_mean_absolute_error",  "R2": "r2"], refit="R2")

#perform grid search
grid.fit(X, y)

If you call grid.cv_results_ in this example, you will see the output dict will have a mean_test_MAE and mean_test_R2 keys (in the case of the second example).

Version attribute does not seem to work

System information
OS Platform and Distribution (e.g., macOS 12.0.1):
Sklearn-genetic-opt version: Unknown
Scikit-learn version: 1.0
Python version: 3.9

Describe the bug
The version attribute does not seem to work

To Reproduce

import sklearn_genetic
print(sklearn_genetic.__version__)

Expected behavior
Should show version number.

Screenshots
Screen Shot 2021-11-21 at 8 18 01 AM

Additional context
Installed package as described in the docs, i.e.,

pip install sklearn-genetic-opt

Improve documentation and examples

I open this issue for newcomers who would like to contribute to an open-source project

The idea is to improve the current docs and add more examples using the library, you can see the current docs files here

You could also add external articles to the package showcasing some applications, see these for example

Here is the stable docs

[FEATURE] GAFeaturesSelectionCV

Is your feature request related to a problem? Please describe.
This feature will make the package extend it's functionalities to include feature selection using evolutionary algorithms. Currently, only hyperparameters tuning is being done.

Describe the solution you'd expect

Implement the class GAFeaturesSelectionCV inside sklearn_genetic.genetic_search with the following functionalities:

  • This function should take the same parameters as GASearchCV except for param_grid, the estimator should have it's own defined parameters.
  • Perform cross-validation over different set of features that are selected using evolutionay algorithms. The same sklean_genetic.algorithms options must be available as optimization routine.
  • The class should be able to work with the existing features of the package, such as Callbacks, plot fitness evolution.
  • All the documentation must be updated, indicating which functionallity of the package is compatible only with GASearchCV (e.g most likely plot_search_space won't be compatible with feature selection).
  • It must accepts a GASearchCV instance as the estimator.
  • There must be an attribute called best_features_ that has the final selected features by the model.

Additional context
The evolutionary algorithm can be defined by assigning a gen to each parameter, if the gen is 1, it means the parameters is selected, 0 otherwise.

Note: I'll be working on this feature, but as always, new ideas and contributions to this is welcome

Random seed

Hi Rodrigo,

Is there a way to make the GA search reproducible? Something like setting a random seed? Thanks!

-Patrick

Contributing to this project

Hello,
as part of our studies, me and some friends have to contribute to an opensource project related to optimization. Your project seems particularly interesting.

Do you need help on a particular feature? On which subject do you advise us to work?

Have a nice day,
Pierre C.

[FEATURE] Training progress bar

What is the use case for this feature?
Letting the user know how much is left to finish the training

Describe the solution you'd expect
When the fit method of GASearchCV is called, a progress bar must be displayed and show the progress of the training

Additional context
Use the tqdm python package

[Feature] Enable scikit-learn's BaseSearchCV

Is your feature request related to a problem? Please describe.
Currently the sklearn_genetic.GASearchCV has an inheritance from ClassifierMixin, RegressorMixin, but it should actually be from BaseSearchCV to be compatible with methods implementen in classes like GridSearchCV.

Describe the solution you'd like
The request is to implement this inheritance, deprecate the methods or logic that are unnecesary to code explicitly like .predit, .predict_proba, .predict_log_proba and so on.
This should implement the method ._run_search to work propertly. Take into account that it has to keep all the information from all the generations that the GASearchCV was running

Describe alternatives you've considered
Review the sckit-learn GridSearchCV and RandomizedSearchCV to have a more clear way to go

Additional context
scikit-learn base implementation: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/model_selection/_search.py

[DOCS] Jupyter Notebooks Tutorials

Is your feature request related to a problem? Please describe.
There are several Jupyter Notebooks that are meant to be a tutorial for new users, but currently they are very simple, the models and options are almost the same across the notebooks and there is a lack of explanation of what is going on.

Describe the solution you'd expect
Review the existing Notebooks under the docs/notebooks folder, improve the documentation, it's also possible to change the models, their options to show case different capabilities.
Optionally, you can add your own notebook showing a new model/dataset use case. The models can be outside scikit-learn but must be compatible with it, like XGBoost.

NOTE: More that one person may work on this issue, as there are several notebooks, mention in the comments which one you are working on, so other people can take a different one. If you are not longer working on that notebook, please also comment it.

MLflow Integration

Is your feature request related to a problem? Please describe.
Currently sklear-genetic-opt allows logging using DEAP.Loogbook, it's wanted to also allow logging
metrics, hyperparameters and fitted models into MLflow: https://mlflow.org/

Describe the solution you'd like
Create an aditional parameter in GASearchCV with all the configs needed by MLflow

Describe alternatives you've considered
Create a new class MLflowConfig and log the values in the evaluate() method

Additional context
Example of use of MLflow: https://medium.com/analytics-vidhya/manage-your-machine-learning-lifecycle-with-mlflow-in-python-d678d5f3c682

[FEATURE] Extend Callbacks Methods

Is your feature request related to a problem? Please describe.
Currently callbacks are only evaluated at the end of each step using the on_step() method, this restrict the use case if we want to take some actions at the beginning or end of the evaluation process

Describe the solution you'd expect
Implement the methods on_start() and on_end() to let the user have bigger control over the evaluation process.
Re-evaluate which parameters should go to each method.
All the methods need to be optional inside the callbacks, and if its not defined, just pass

[Enhancement] Base class for Callbacks

Is your feature request related to a problem? Please describe.
Currently the .fit method of GASearchCV accepts a callbacks argument, this is used to stop the training process when a condition is met. All the callbacks implements the same methods: .on_step and __call__ which are used while the optimization routine is running.
Currently, each callback does not have any kind of inheritance from a base class that ensures that all the objects have the minimun required methods. At this point, is only checked that is a callable

Describe the solution you'd like
Implemente a BaseCallback class in the sklearn_genetic.callbacks module, all the current callbacks should be updated to be a child class of that base class. It must force that there is a on_step and call methods in the class, otherwise, raise a NotImplementError.
In the check_callback function, it should be checked that the callback is an instance of that base class

Describe alternatives you've considered
Create a base class that controlls all the callbacks that are created fot GASearchCV

Additional context
If possible, update the documentation on the docs/tutorials/custom_callbacks, which explains how to create a callback. Here should put the BaseCallback on the exampled used.

[Feature] Auto summary documentation

Is your feature request related to a problem? Please describe.
The current documentation doesn't have enabled auto summary from Sphinx

Describe the solution you'd like
Include auto summary option in the documentation (docs/conf.py) using sphinx.ext.autosummary

Describe alternatives you've considered
Checking the options
autosummary_generate and autosummary_imported_members, tried before both having duplicated documentation problem

Additional context
The auto summary should look similar to this
image

Can a vector of weights be specified in `param_grid` within GASearchCV (somehow)?

The idea is to take in predictions from an arbitrary number of models, and find optimal weights that maximize the accuracy of the ensembled model.

Here's the estimator that I wrote:

from typing import List, Optional
import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils import check_X_y, check_array
from sklearn.utils.estimator_checks import check_estimator, check_is_fitted
from sklearn.metrics import mean_absolute_error


class WeightedAverageEnsemble(BaseEstimator, RegressorMixin):
    """
    
    >>> wae = WeightedAverageEnsemble()
    >>> X = np.random.rand(20, 5)
    >>> y = np.random.rand(20, 1)
    >>> wae.fit(X, y)
    >>> wae.predict(X)
    
    >>> wae = WeightedAverageEnsemble(weights=[0.25, 0.75])
    >>> X = np.random.rand(20, 2)
    >>> y = np.random.rand(20, 1)
    >>> wae.fit(X, y)
    >>> wae.predict(X)

    Parameters
    ----------
    BaseEstimator : _type_
        _description_
    RegressorMixin : _type_
        _description_
    """

    def __init__(self, weights: Optional[List[float]] = None):
        if weights is not None:
            assert np.isclose(sum(weights), 1.0)
        self.weights = weights

    def fit(self, X, y):
        # TODO: deal with sparse inputs (i.e. mask `W` and convert to sparse)
        X, y = check_X_y(X, y, accept_sparse=False)
        self.is_fitted_ = True
        self.n_features_in_ = X.shape[1]
        if self.weights is None:
            self._mod_weights = np.ones(self.n_features_in_) / self.n_features_in_
            # equivalent to:
            # w = np.ones(self.n_features_in_).reshape(1, -1)
            # w = sklearn.preprocessing.normalize(w, norm="l1", axis=1)
        else:
            self._mod_weights = self.weights
        return self

    def predict(self, X):
        # TODO: deal with sparse inputs (i.e. mask `W` and convert to sparse)
        X = check_array(X, accept_sparse=False)
        check_is_fitted(self, "is_fitted_")
        W = np.tile(self._mod_weights, (X.shape[0], 1))
        y = np.einsum("ij, ij->i", W, X)
        # should be equivalent to: y = np.sum(W * X)
        # loop with np.dot might also be fast due to BLAS compatibility
        # https://stackoverflow.com/a/26168677/13697228
        # https://stackoverflow.com/a/39657770/13697228
        return y

    def score(self, X, y, **kwargs):
        y_pred = self.predict(X)
        return mean_absolute_error(y, y_pred, **kwargs)


check_estimator(WeightedAverageEnsemble())

Related: https://machinelearningmastery.com/weighted-average-ensemble-with-python

How would you suggest optimizing weights since it's a vector that can change in size based on the size of the input data?

[Feature] Parallel Coordinates plot

Is your feature request related to a problem? Please describe.
NA

Describe the solution you'd like
Implement in the sklearn_genetic.plots module a function named plot_parallel_coordinates to inspect the results of the learning process

Describe alternatives you've considered
The function should take two arguments:

  • estimator: A fitted estimator from sklearn_genetic.GASearchCV
  • features: list, default=None. Subset of features to plot, if None it plots all the features by default

The function should return an object to plot parallel coordinates according the pandas.plotting.parallel_coordinates function

The data to plot is available on the estimator.logbook object, look the implementation of the plot_search_space function to see how to convert this data to a pandas data frame

The function must select only the non categorical variables, this can be done by inspecting the estimator.space object and comparing against the data types defined in sklearn_genetic.space, i.e Categorical, Continuous and Integer and color against the "score" column. In the same way, it must validate and make a warning if in the features parameter a Categorial one is passed

Additional context
Links of some implementations:

RuntimeError: Cannot clone object GAFeatureSelectionCV(...), as the constructor either does not set or modifies parameter estimator

System information
OS Platform and Distribution: Windows 11 Home
Sklearn-genetic-opt version: 0.10.0
deap version: 1.3.3
Scikit-learn version: 1.2.1
Python version: 3.10.1

Describe the bug
When including GAFeatureSelectionCV as a transformer within a pipeline to carry out feature selection and then running GridSearchCV or GASearchCV on the pipeline to optimise hyperparameters, it throws up an error:

RuntimeError: Cannot clone object GAFeatureSelectionCV(estimator=LGBMClassifier(), generations=5, n_jobs=14,
population_size=5), as the constructor either does not set or modifies parameter estimator

To Reproduce
Steps to reproduce the behavior:

from sklearn.datasets import load_iris

from lightgbm import LGBMClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn_genetic import GAFeatureSelectionCV
from sklearn.model_selection import GridSearchCV

iris = load_iris()

test_pipe = Pipeline([
                    # 1 Feature Selection using GAFeatureSelectionCV
                      ('dim', GAFeatureSelectionCV(LGBMClassifier(), 
                                                   generations=5, population_size=5, 
                                                   n_jobs=-1, 
                                                  )
                      ), 
                      ('clf', SGDClassifier())
                     ]
                    )

grid_search_pipe = GridSearchCV(test_pipe, 
                                param_grid={'clf__alpha': [10e-04, 10e-03, 10e-02, 10e-01, 10e+00]}, 
                                verbose=1
                               )

grid_search_pipe.fit(iris.data, iris.target)

Expected behavior
The pipeline should be fitted without any errors.

Additional context
This situation arises when trying to wrap a whole pipeline with a hyperparameter tuning class such as GridSearchCV or GASearchCV. The purpose of including the pipeline within *SearchCV is to optimise hyperparameters of additional transform steps before the 'dim' step along with the hyperparameters of the classifier, although such steps are not shown above for brevity.

[FEATURE] Add in CTRL + C Early Stopping!

Is your feature request related to a problem? Please describe.
Nope.

  • A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
    None.

  • What is the use case for this feature?
    When a user wants to stop the optimization process manually (without using a callback), they could press CTRL + C to stop. The best evolved_estimator at the time of pressing CTRL + C will be returned and optimization will stop, allowing the rest of the script to continue.

Describe the solution you'd expect
See above

  • A clear and concise description of what you want to happen.
    TPOT is a good reference for this. The user presses CTRL + C after at least 1 pipeline has been fitted, and the best pipeline found until that point is used. The rest of the script can continue after that, like the evolved_estimator.predict() function.

  • Describe the workflow you want to enable
    See above.

Additional context
Love the tool! Would be cool to see this implemented :D

[FEATURE] Support for XGBoost early stopping

Thanks for such a cool package.

I'm using GASearchCV to hypertune an xgboost model. However, it is failing if I use early stopping an fit()
Can early stopping (and the additional xgbosst fitting params) be used with GASearchCV().fit()?

Thanks,
Hayden

[FEATURE] TensorBoard Callback

This features allows to inspect the running metrics over generations from tensorboard directly without having to call plotting functions

Describe the solution you'd expect
Create a TensorBoard callback in sklearn_genetic.callbacks.loggers, it must let you select the log dir and internally use tf.summary.scalar(.., step=gen) to log the record from the algorithms logbook

Additional context
Documentation: https://www.tensorflow.org/api_docs/python/tf/summary/scalar

NGrams

So I may be using it wrong, but how does one use ngrams with this tool? Is this feature not implemented?

[FEATURE]

Is your feature request related to a problem? Please describe.

  • A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
  • What is the use case for this feature?

Describe the solution you'd expect

  • A clear and concise description of what you want to happen.
  • Describe the workflow you want to enable

Additional context
Add any other context or screenshots about the feature request here.

AssertionError: Assigned values have not the same length than fitness weights

Hi All,

Has anyone come across the issue described below? I'd appreciate any direction to help resolve this.

System information
OS Platform and Distribution: Windows 11 Home
Sklearn-genetic-opt version: 0.9.0
deap version: 1.3.3
Scikit-learn version: 1.1.2
Python version: 3.8.13

Describe the bug
When running my pipeline through GASearchCV, this error occurs intermittently.

AssertionError: Assigned values have not the same length than fitness weights

I'm running GASearchCV with these parameters:

{'algorithm': 'eaMuPlusLambda',
 'criteria': 'max',
 'crossover_probability': 0.2,
 'cv': StratifiedKFold(n_splits=5, random_state=0, shuffle=True),
 'elitism': 1,
 'error_score': nan,

<<< estimator parameters removed for brevity >>>

 'generations': 10,
 'keep_top_k': 1,
 'log_config': None,
 'mutation_probability': 0.8,
 'n_jobs': 18,
 'param_grid': {'enc__numeric': <sklearn_genetic.space.space.Categorical at 0x25b40d91160>,
  'enc__target': <sklearn_genetic.space.space.Categorical at 0x25b40d98520>,
  'enc__time__cyclicity': <sklearn_genetic.space.space.Categorical at 0x25b40d981f0>,
  'dim__fs_wrapper': <sklearn_genetic.space.space.Categorical at 0x25b40d984c0>,
  'clf__base_estimator__max_depth': <sklearn_genetic.space.space.Integer at 0x25b40cf3d90>,
  'clf__base_estimator__num_leaves': <sklearn_genetic.space.space.Integer at 0x25b40d98460>,
  'clf__base_estimator__min_child_samples': <sklearn_genetic.space.space.Integer at 0x25b40d98220>,
  'clf__base_estimator__colsample_bytree': <sklearn_genetic.space.space.Continuous at 0x25b40265c70>,
  'clf__base_estimator__subsample': <sklearn_genetic.space.space.Continuous at 0x25b40265d60>,
  'clf__base_estimator__learning_rate': <sklearn_genetic.space.space.Continuous at 0x25b40265ca0>,
  'clf__base_estimator__min_split_gain': <sklearn_genetic.space.space.Continuous at 0x25b40265fd0>},
 'population_size': 10,
 'pre_dispatch': '2*n_jobs',
 'refit': 'avg_prec',
 'return_train_score': True,
 'scoring': {'profit_ratio': make_scorer(profit_ratio_score),
  'f_0.5': make_scorer(fbeta_score, beta=0.5, zero_division=0, pos_label=1),
  'precision': make_scorer(precision_score, zero_division=0, pos_label=1),
  'recall': make_scorer(recall_score, zero_division=0, pos_label=1),
  'avg_prec': make_scorer(average_precision_score, needs_proba=True, pos_label=1),
  'brier_rel': make_scorer(brier_rel_score, needs_proba=True, pos_label=1),
  'brier_score': make_scorer(brier_score_loss, greater_is_better=False, needs_proba=True, pos_label=1),
  'logloss_rel': make_scorer(logloss_rel_score, needs_proba=True),
  'log_loss': make_scorer(log_loss, greater_is_better=False, needs_proba=True)},
 'tournament_size': 3,
 'verbose': 1}

to tune various hyperparameters of my pipeline below:
image

The following param_grid (generated using BayesSearchCV to show real information instead of the objects) show categorical values for various transformers steps enc__numeric, enc__target, enc__time__cyclicity and dim__fs_wrapper besides numerical parameter ranges for clf__base_estimator.

{'enc__numeric': Categorical(categories=('passthrough', 
     SmartCorrelatedSelection(selection_method='variance', threshold=0.9), 
     SmartCorrelatedSelection(cv='skf5',
                          estimator=LGBMClassifier(learning_rate=1.0,
                                                   max_depth=8,
                                                   min_child_samples=4,
                                                   min_split_gain=0.0031642299495941877,
                                                   n_jobs=1, num_leaves=59,
                                                   random_state=0, subsample=0.1,
                                                   verbose=-1),
                          scoring=make_scorer(average_precision_score, needs_proba=True, pos_label=1),
                          selection_method='model_performance', threshold=0.9)), prior=None),
 'enc__target': Categorical(categories=(MeanEncoder(ignore_format=True), TargetEncoder(), MEstimateEncoder(), 
     WoEEncoder(ignore_format=True), PRatioEncoder(ignore_format=True), 
     BayesianTargetEncoder(columns=['Symbol', 'CandleType', 'h1CandleType1', 'h2CandleType1'], 
                       prior_weight=3, suffix='')), prior=None),
 'enc__time__cyclicity': Categorical(categories=(CyclicalFeatures(drop_original=True), 
     CycleTransformer(), RepeatingBasisFunction(n_periods=96)), prior=None),
 'dim__fs_wrapper': Categorical(categories=('passthrough', 
     SelectFromModel(estimator=LGBMClassifier(learning_rate=1.0, max_depth=8,
                                          min_child_samples=4,
                                          min_split_gain=0.0031642299495941877,
                                          n_jobs=1, num_leaves=59,
                                          random_state=0, subsample=0.1,
                                          verbose=-1),
                                   importance_getter='feature_importances_'), 
     RFECV(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True),
       estimator=LGBMClassifier(learning_rate=1.0, max_depth=8,
                                min_child_samples=4,
                                min_split_gain=0.0031642299495941877, n_jobs=1,
                                num_leaves=59, random_state=0, subsample=0.1,
                                verbose=-1),
               importance_getter='feature_importances_', min_features_to_select=10, n_jobs=1,
               scoring=make_scorer(average_precision_score, needs_proba=True, pos_label=1), step=3), 
     GeneticSelectionCV(caching=True,
                    cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True),
                    estimator=LGBMClassifier(learning_rate=1.0, max_depth=8,
                                             min_child_samples=4,
                                             min_split_gain=0.0031642299495941877,
                                             n_jobs=1, num_leaves=59,
                                             random_state=0, subsample=0.1,
                                             verbose=-1),
                    mutation_proba=0.1, n_gen_no_change=3, n_generations=20, n_population=50,
                    scoring=make_scorer(average_precision_score, needs_proba=True, pos_label=1))), prior=None),
 'clf__base_estimator__eval_metric': Categorical(categories=('logloss', 'aucpr'), prior=None),
 'clf__base_estimator__max_depth': Integer(low=2, high=8, prior='uniform', transform='identity'),
 'clf__base_estimator__min_child_weight': Real(low=1e-05, high=1000, prior='log-uniform', transform='identity'),
 'clf__base_estimator__colsample_bytree': Real(low=0.1, high=1.0, prior='uniform', transform='identity'),
 'clf__base_estimator__subsample': Real(low=0.1, high=0.9999999999999999, prior='uniform', transform='identity'),
 'clf__base_estimator__learning_rate': Real(low=1e-05, high=1, prior='log-uniform', transform='identity'),
 'clf__base_estimator__gamma': Real(low=1e-06, high=1000, prior='log-uniform', transform='identity')}

To Reproduce
Steps to reproduce the behavior:
<<< Please let me know if you would like more information to reproduce the error >>>

Expected behavior
On occasions when it ran successfully, I got the following results for best_params_ as expected:

{'enc__numeric': 'passthrough',
'enc__target': MeanEncoder(ignore_format=True),
'enc__time__cyclicity': CycleTransformer(),
'dim__fs_wrapper': RFECV(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True),
estimator=LGBMClassifier(learning_rate=1.0, max_depth=8,
min_child_samples=4,
min_split_gain=0.0031642299495941877, n_jobs=1,
num_leaves=59, random_state=0, subsample=0.1,
verbose=-1),
importance_getter='feature_importances_', min_features_to_select=10, n_jobs=1,
scoring=make_scorer(average_precision_score, needs_proba=True, pos_label=1), step=3),
'clf__base_estimator__max_depth': 8,
'clf__base_estimator__num_leaves': 133,
'clf__base_estimator__min_child_samples': 15,
'clf__base_estimator__colsample_bytree': 0.7275258370467069,
'clf__base_estimator__subsample': 0.235822893356478,
'clf__base_estimator__learning_rate': 0.7013294054435428,
'clf__base_estimator__min_split_gain': 0.01921144912252005}

But on many occasions, the error occurs after all the generations seem to have run successfully as indicated by the results below:

C:\Anaconda3\envs\py38_skl\lib\site-packages\deap\creator.py:138: RuntimeWarning: A class named 'Individual' has already been created and it will be overwritten. Consider deleting previous creation of that class or rename it.
warnings.warn("A class named '{0}' has already been created and it "
gen nevals fitness fitness_std fitness_max fitness_min
0 10 0.479343 0.0470905 0.567048 0.436592
1 20 0.508531 0.0540786 0.567048 0.436592
2 20 0.544683 0.0443227 0.567048 0.436592
3 20 0.567027 0.000751322 0.567867 0.565522
4 20 0.571346 0.00762421 0.593858 0.567867
5 20 0.573536 0.0101964 0.593858 0.567867
6 20 0.580379 0.0114497 0.593858 0.567867
7 20 0.582027 0.0117464 0.593858 0.567867
8 20 0.593278 0.00492738 0.598567 0.579708
9 20 0.596899 0.0025436 0.599324 0.592604
10 20 0.597076 0.00379484 0.603329 0.592604

AssertionError Traceback (most recent call last)
Cell In [149], line 2
1 start = time()
----> 2 sv_results_gen = cross_val_thresh(gen_pipe, X, y, cv_val[VAL], result_metrics, #scoring=score_metrics,
3 return_estimator=True, return_train_score=True,
4 thresh_split=SPLIT,
5 )
6 end = time()

Cell In [119], line 30, in cross_val_thresh(estimator, X, y, cv, result_metrics, return_estimator, return_train_score, thresh_split, *args, **kwargs)
27 time_df.loc[i, 'split'] = i
29 start = time()
---> 30 est_i.fit(X_train, y_train)
31 end = time()
32 time_df.loc[i, 'fit_time'] = end - start

File C:\Anaconda3\envs\py38_skl\lib\site-packages\sklearn_genetic\genetic_search.py:555, in GASearchCV.fit(self, X, y, callbacks)
552 self.estimator.set_params(**self.best_params_)
554 refit_start_time = time.time()
--> 555 self.estimator.fit(self.X_, self.y_)
556 refit_end_time = time.time()
557 self.refit_time_ = refit_end_time - refit_start_time

File C:\Anaconda3\envs\py38_skl\lib\site-packages\sklearn\pipeline.py:378, in Pipeline.fit(self, X, y, **fit_params)
352 """Fit the model.
353
354 Fit all the transformers one after the other and transform the
(...)
375 Pipeline with fitted steps.
376 """
377 fit_params_steps = self._check_fit_params(**fit_params)
--> 378 Xt = self._fit(X, y, **fit_params_steps)
379 with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
380 if self._final_estimator != "passthrough":

File C:\Anaconda3\envs\py38_skl\lib\site-packages\sklearn\pipeline.py:336, in Pipeline._fit(self, X, y, **fit_params_steps)
334 cloned_transformer = clone(transformer)
335 # Fit or load from cache the current transformer
--> 336 X, fitted_transformer = fit_transform_one_cached(
337 cloned_transformer,
338 X,
339 y,
340 None,
341 message_clsname="Pipeline",
342 message=self._log_message(step_idx),
343 **fit_params_steps[name],
344 )
345 # Replace the transformer of the step with the fitted
346 # transformer. This is necessary when loading the transformer
347 # from the cache.
348 self.steps[step_idx] = (name, fitted_transformer)

File C:\Anaconda3\envs\py38_skl\lib\site-packages\joblib\memory.py:594, in MemorizedFunc.call(self, *args, **kwargs)
593 def call(self, *args, **kwargs):
--> 594 return self._cached_call(args, kwargs)[0]

File C:\Anaconda3\envs\py38_skl\lib\site-packages\joblib\memory.py:537, in MemorizedFunc._cached_call(self, args, kwargs, shelving)
534 must_call = True
536 if must_call:
--> 537 out, metadata = self.call(*args, **kwargs)
538 if self.mmap_mode is not None:
539 # Memmap the output at the first call to be consistent with
540 # later calls
541 if self._verbose:

File C:\Anaconda3\envs\py38_skl\lib\site-packages\joblib\memory.py:779, in MemorizedFunc.call(self, *args, **kwargs)
777 if self._verbose > 0:
778 print(format_call(self.func, args, kwargs))
--> 779 output = self.func(*args, **kwargs)
780 self.store_backend.dump_item(
781 [func_id, args_id], output, verbose=self._verbose)
783 duration = time.time() - start_time

File C:\Anaconda3\envs\py38_skl\lib\site-packages\sklearn\pipeline.py:870, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
868 with _print_elapsed_time(message_clsname, message):
869 if hasattr(transformer, "fit_transform"):
--> 870 res = transformer.fit_transform(X, y, **fit_params)
871 else:
872 res = transformer.fit(X, y, **fit_params).transform(X)

File C:\Anaconda3\envs\py38_skl\lib\site-packages\sklearn\pipeline.py:1154, in FeatureUnion.fit_transform(self, X, y, **fit_params)
1133 def fit_transform(self, X, y=None, **fit_params):
1134 """Fit all transformers, transform the data and concatenate results.
1135
1136 Parameters
(...)
1152 sum of n_components (output dimension) over transformers.
1153 """
-> 1154 results = self._parallel_func(X, y, fit_params, _fit_transform_one)
1155 if not results:
1156 # All transformers are None
1157 return np.zeros((X.shape[0], 0))

File C:\Anaconda3\envs\py38_skl\lib\site-packages\sklearn\pipeline.py:1176, in FeatureUnion._parallel_func(self, X, y, fit_params, func)
1173 self._validate_transformer_weights()
1174 transformers = list(self._iter())
-> 1176 return Parallel(n_jobs=self.n_jobs)(
1177 delayed(func)(
1178 transformer,
1179 X,
1180 y,
1181 weight,
1182 message_clsname="FeatureUnion",
1183 message=self._log_message(name, idx, len(transformers)),
1184 **fit_params,
1185 )
1186 for idx, (name, transformer, weight) in enumerate(transformers, 1)
1187 )

File C:\Anaconda3\envs\py38_skl\lib\site-packages\joblib\parallel.py:1085, in Parallel.call(self, iterable)
1076 try:
1077 # Only set self._iterating to True if at least a batch
1078 # was dispatched. In particular this covers the edge
(...)
1082 # was very quick and its callback already dispatched all the
1083 # remaining jobs.
1084 self._iterating = False
-> 1085 if self.dispatch_one_batch(iterator):
1086 self._iterating = self._original_iterator is not None
1088 while self.dispatch_one_batch(iterator):

File C:\Anaconda3\envs\py38_skl\lib\site-packages\joblib\parallel.py:901, in Parallel.dispatch_one_batch(self, iterator)
899 return False
900 else:
--> 901 self._dispatch(tasks)
902 return True

File C:\Anaconda3\envs\py38_skl\lib\site-packages\joblib\parallel.py:819, in Parallel._dispatch(self, batch)
817 with self._lock:
818 job_idx = len(self._jobs)
--> 819 job = self._backend.apply_async(batch, callback=cb)
820 # A job can complete so quickly than its callback is
821 # called before we get here, causing self._jobs to
822 # grow. To ensure correct results ordering, .insert is
823 # used (rather than .append) in the following line
824 self._jobs.insert(job_idx, job)

File C:\Anaconda3\envs\py38_skl\lib\site-packages\joblib_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)

File C:\Anaconda3\envs\py38_skl\lib\site-packages\joblib_parallel_backends.py:597, in ImmediateResult.init(self, batch)
594 def init(self, batch):
595 # Don't delay the application, to avoid keeping the input
596 # arguments in memory
--> 597 self.results = batch()

File C:\Anaconda3\envs\py38_skl\lib\site-packages\joblib\parallel.py:288, in BatchedCalls.call(self)
284 def call(self):
285 # Set the default nested backend to self._backend but do not set the
286 # change the default number of processes to -1
287 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288 return [func(*args, **kwargs)
289 for func, args, kwargs in self.items]

File C:\Anaconda3\envs\py38_skl\lib\site-packages\joblib\parallel.py:288, in (.0)
284 def call(self):
285 # Set the default nested backend to self._backend but do not set the
286 # change the default number of processes to -1
287 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288 return [func(*args, **kwargs)
289 for func, args, kwargs in self.items]

File C:\Anaconda3\envs\py38_skl\lib\site-packages\sklearn\utils\fixes.py:117, in _FuncWrapper.call(self, *args, **kwargs)
115 def call(self, *args, **kwargs):
116 with config_context(**self.config):
--> 117 return self.function(*args, **kwargs)

File C:\Anaconda3\envs\py38_skl\lib\site-packages\sklearn\pipeline.py:870, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
868 with _print_elapsed_time(message_clsname, message):
869 if hasattr(transformer, "fit_transform"):
--> 870 res = transformer.fit_transform(X, y, **fit_params)
871 else:
872 res = transformer.fit(X, y, **fit_params).transform(X)

File C:\Anaconda3\envs\py38_skl\lib\site-packages\sklearn\base.py:870, in TransformerMixin.fit_transform(self, X, y, **fit_params)
867 return self.fit(X, **fit_params).transform(X)
868 else:
869 # fit method of arity 2 (supervised transformation)
--> 870 return self.fit(X, y, **fit_params).transform(X)

File C:\Anaconda3\envs\py38_skl\lib\site-packages\genetic_selection\gscv.py:279, in GeneticSelectionCV.fit(self, X, y, groups)
262 def fit(self, X, y, groups=None):
263 """Fit the GeneticSelectionCV model and then the underlying estimator on the selected
264 features.
265
(...)
277 instance (e.g., GroupKFold).
278 """
--> 279 return self._fit(X, y, groups)

File C:\Anaconda3\envs\py38_skl\lib\site-packages\genetic_selection\gscv.py:343, in GeneticSelectionCV._fit(self, X, y, groups)
340 print("Selecting features with genetic algorithm.")
342 with np.printoptions(precision=6, suppress=True, sign=" "):
--> 343 _, log = _eaFunction(pop, toolbox, cxpb=self.crossover_proba,
344 mutpb=self.mutation_proba, ngen=self.n_generations,
345 ngen_no_change=self.n_gen_no_change,
346 stats=stats, halloffame=hof, verbose=self.verbose)
347 if self.n_jobs != 1:
348 pool.close()

File C:\Anaconda3\envs\py38_skl\lib\site-packages\genetic_selection\gscv.py:50, in _eaFunction(population, toolbox, cxpb, mutpb, ngen, ngen_no_change, stats, halloffame, verbose)
48 fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
49 for ind, fit in zip(invalid_ind, fitnesses):
---> 50 ind.fitness.values = fit
52 if halloffame is None:
53 raise ValueError("The 'halloffame' parameter should not be None.")

File C:\Anaconda3\envs\py38_skl\lib\site-packages\deap\base.py:188, in Fitness.setValues(self, values)
187 def setValues(self, values):
--> 188 assert len(values) == len(self.weights), "Assigned values have not the same length than fitness weights"
189 try:
190 self.wvalues = tuple(map(mul, values, self.weights))

AssertionError: Assigned values have not the same length than fitness weights

However, when I exclude dim__fs_wrapper from the pipeline, the error does not occur at all. The purpose of this transformer is to select a feature selection method from amongst 'passthrough' and estimators wrapped in SelectFromModel, RFECV and GeneticSelectionCV.

Additional context

  1. Note that my approach involves packaging all transformers and classifier within the pipeline and running hyperparameter tuning upon all elements of the pipeline rather than on the classifier only. This allows me to select from many different transformers for the same purpose e.g. target encoding, dimension reduction etc., instead of limiting myself to just one.
  2. When running the pipeline and params_grid (adapted for the search algorithm) through alternative hyperparameter tuning algorithms like BayesSearchCV or RandomizedSearchCV or TuneSearchCV etc., there are no issues and I get a set of best_params_ and other expected outputs.

[Feature] Requirements.txt for docs

Is your feature request related to a problem? Please describe.
At this point the docs requirements are in dev-requirements.txt, but it should have its own requirements file

Describe the solution you'd like
Change the readthedocs.yml file to use this docs/requirements.txt file, and make sure the docs are build correctly

Describe alternatives you've considered
There is already a file in docs/requirements.txt to be used , already tried to used it but it fails because it somehow requires the main package dev-requirements

Bug during installation of development dependencies

System information
Windows 10
Scikit-learn version: 1.2.0
Python version: 3.8.5
tensorflow version: 2.11.0
protobuf version: 4.21.12

Problem
When I install the dev dependencies using this command pip install -r dev-requirements.txt, the following error appears

ERROR: numba 0.56.4 has requirement numpy<1.24,>=1.18, but you'll have numpy 1.24.1 which is incompatible.
ERROR: sphinx-rtd-theme 1.1.1 has requirement docutils<0.18, but you'll have docutils 0.19 which is incompatible.
ERROR: sphinx-rtd-theme 1.1.1 has requirement sphinx<6,>=1.6, but you'll have sphinx 6.1.2 which is incompatible.
ERROR: tensorboard 2.11.0 has requirement protobuf<4,>=3.9.2, but you'll have protobuf 4.21.12 which is incompatible.
ERROR: tensorflow-intel 2.11.0 has requirement protobuf<3.20,>=3.9.2, but you'll have protobuf 4.21.12 which is incompatible.

This was done on a clean install, without cached packages.

Steps to reproduce

  1. Clone the repo
  2. Create a virtual environment
  3. Delete pip cache
  4. Install the dev requirements

Additional context
When I try to run a Sphinx build, this error comes up, which might be related to these versioning bugs

Configuration error:
There is a programmable error in your configuration file:

Traceback (most recent call last):
  File "c:\users\borel\sklearn-genetic-opt\venv\lib\site-packages\sphinx\config.py", line 351, in eval_config_file
    exec(code, namespace)  # NoQA: S102
  File "C:\Users\borel\Sklearn-genetic-opt\docs\conf.py", line 19, in <module>
    from sklearn_genetic import __version__
  File "C:\Users\borel\Sklearn-genetic-opt\sklearn_genetic\__init__.py", line 1, in <module>
    from .genetic_search import GASearchCV, GAFeatureSelectionCV
  File "C:\Users\borel\Sklearn-genetic-opt\sklearn_genetic\genetic_search.py", line 20, in <module>
    from .algorithms import eaSimple, eaMuPlusLambda, eaMuCommaLambda
  File "C:\Users\borel\Sklearn-genetic-opt\sklearn_genetic\algorithms.py", line 5, in <module>
    from .callbacks.validations import eval_callbacks
  File "C:\Users\borel\Sklearn-genetic-opt\sklearn_genetic\callbacks\__init__.py", line 7, in <module>
    from .loggers import ProgressBar, LogbookSaver, TensorBoard
  File "C:\Users\borel\Sklearn-genetic-opt\sklearn_genetic\callbacks\loggers.py", line 16, in <module>
    import tensorflow as tf
  File "c:\users\borel\sklearn-genetic-opt\venv\lib\site-packages\tensorflow\__init__.py", line 37, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "c:\users\borel\sklearn-genetic-opt\venv\lib\site-packages\tensorflow\python\__init__.py", line 37, in <module>
    from tensorflow.python.eager import context
  File "c:\users\borel\sklearn-genetic-opt\venv\lib\site-packages\tensorflow\python\eager\context.py", line 28, in <module>
    from tensorflow.core.framework import function_pb2
  File "c:\users\borel\sklearn-genetic-opt\venv\lib\site-packages\tensorflow\core\framework\function_pb2.py", line 16, in <module>
    from tensorflow.core.framework import attr_value_pb2 as tensorflow_dot_core_dot_framework_dot_attr__value__pb2
  File "c:\users\borel\sklearn-genetic-opt\venv\lib\site-packages\tensorflow\core\framework\attr_value_pb2.py", line 16, in <module>
    from tensorflow.core.framework import tensor_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__pb2
  File "c:\users\borel\sklearn-genetic-opt\venv\lib\site-packages\tensorflow\core\framework\tensor_pb2.py", line 16, in <module>
    from tensorflow.core.framework import resource_handle_pb2 as tensorflow_dot_core_dot_framework_dot_resource__handle__pb2
  File "c:\users\borel\sklearn-genetic-opt\venv\lib\site-packages\tensorflow\core\framework\resource_handle_pb2.py", line 16, in <module>
    from tensorflow.core.framework import tensor_shape_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__shape__pb2
  File "c:\users\borel\sklearn-genetic-opt\venv\lib\site-packages\tensorflow\core\framework\tensor_shape_pb2.py", line 36, in <module>
    _descriptor.FieldDescriptor(
  File "c:\users\borel\sklearn-genetic-opt\venv\lib\site-packages\google\protobuf\descriptor.py", line 560, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 5. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

Out of bounds fitness value

Hello,

I'm using GA Feature Selection Class for Feature Selection on a professional project. Depending on generation, logbook returns me values that i have a hard time explaining (see capture below, on generation 3, 16, 18, 19 here).

fitness_incoherent_values

The algorithm parameters are shown below :

parameter_genetic

I've been looking in the doc, code and also on Deap Github (i don't know yet if it's related to this implementation or it comes from Deap), i can't find a satisfying answer for this matter. I have identified so far that depending on crossover and mutation value it happens more or less often, and i think that there is some multiplication with weighted values that may explain this value. My understanding is that it's a penalization that occurs for some reason during fitness evaluation. I want to know exactly why it happens, and that's why i'm opening this issue, i hope someone can give me a hint, if i find it by myself in the meanwhile i'll update my issue.

GASearchCV Really Slow Compared to GridSearchCV

System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10, in Visual Studio Code IDE Debugger with Anaconda environment. Issue also occurs in Linux Ubuntu 20.04.
Sklearn-genetic-opt version: 0.8.0
Scikit-learn version: 1.0.2
Python version: 3.8.12

Describe the issue
Hi. I am having an issue with GASearchCV that I am hoping you could help me with. I have a custom Keras/ Tensorflow classifier that is wrapped in scikeras's KerasClassifier to make a scikit-learn estimator. With GridSearchCV, it takes roughly 15-16 seconds per parameter permutation, but with GASearchCV it takes about 20.5 minutes per iteration, which is a huge difference. I have also tested it with RandomizedSearchCV and, similarly to GridSearchCV, it also takes ~15 seconds per iteration. Do you have any insight on why GASearchCV might be taking so much longer per iteration? I realize that it might be difficult to pinpoint the issue given that my code spans multiple files, but hopefully I can give enough info below for it to make sense.

To Reproduce
Steps to reproduce the behavior:

  1. The code can be obtained from https://github.com/btmartin721/PG-SUI.
  2. The code where I call GASearchCV (modified for clarity):
class DisabledCV:
    def __init__(self):
        self.n_splits = 1

    def split(self, X, y, groups=None):
        yield (np.arange(len(X)), np.arange(len(y)))

    def get_n_splits(self, X, y, groups=None):
        return self.n_splits

# Disable cross-validation due to unsupervised model.
cross_val = DisabledCV()
     
callback = [
    ConsecutiveStopping(
        generations=self.early_stop_gen, metric="fitness"
    )
]

# Option passed to __init__() in custom class.
if not self.disable_progressbar:
    callback.append(ProgressBar())

verbose = False if self.verbose == 0 else True

# Custom, subclassed KerasClassifier (scikeras) estimator
# model_params and compile_params are set earlier in the code.
clf = MLPClassifier(
    V,
    model_params.pop("y_train"),
    y_true,
    **model_params,
    optimizer=compile_params["optimizer"],
    optimizer__learning_rate=compile_params["learning_rate"],
    loss=compile_params["loss"],
    metrics=compile_params["metrics"],
    epochs=fit_params["epochs"],
    phase=None,
    callbacks=fit_params["callbacks"],
    validation_split=fit_params["validation_split"],
    verbose=0,
  )

# Custom scoring metrics.
all_scoring_metrics = [
    "precision_recall_macro",
    "precision_recall_micro",
    "auc_macro",
    "auc_micro",
    "accuracy",
]

# Make multi-metric scorers from custom metrics.
scoring = self.nn_.make_multimetric_scorer(
    all_scoring_metrics, self.sim_missing_mask_
)

# Set in pg_sui.py
grid_params = {
    "learning_rate": Continuous(1e-6, 0.1, distribution="log-uniform"),
    "l2_penalty": Continuous(1e-6, 0.01, distribution="uniform"),
    "n_components": Integer(2, 3),
    "hidden_activation": Categorical(["elu", "relu"]),
}

# code here has been modified for clarity.
search = GASearchCV(
    estimator=clf,
    cv=cross_val,
    scoring=scoring,
    generations=80,
    param_grid=grid_params,
    n_jobs=4,
    refit="precision_recall_macro",
    verbose=0,
    error_score="raise",
)

# input V are small, randomly initialized values that get trained and refined via backpropagation.
# V is a dictionary of reduced-representation 2D arrays of shape (n_samples, n_components).
# y_true is the actual data, using non-missing values as the target. This trained model is then used to predict (i.e., impute) missing values.
search.fit(V[self.n_components], y_true, callbacks=callback)
  1. Describe the behavior: Runs, but extremely slowly compared to GridSearchCV and RandomizedSearchCV.
  2. The command I used from root directory of GitHub repository: pgsui/pg_sui.py -s pgsui/example_data/structure_files/test.nopops.2row.10sites.str -m pgsui/example_data/popmaps/test.popmap -i pgsui/example_data/trees/test.iqtree -t pgsui/example_data/trees/test.tre

Thank you for your time.

-Bradley

Tuples as categorical parameters - "ValueError: a must be 1-dimensional"

System information
OS Platform and Distribution: Windows 10
Sklearn-genetic-opt version: 0.9.0
Scikit-learn version: 1.0.2
Python version: 3.7.9

Describe the bug
Passing tuples as hyperparameters causes the simulation to crash at the start with "ValueError: a must be 1-dimensional"

To Reproduce
Messy code but this will throw the error:

clf = MLPClassifier()
hidden_layers_1 = list([l1, l2, l3] for l1 in range(1, 129) for l2 in range(0, 129) for l3 in range(0, 129))
hidden_layers = []
for hl in hidden_layers_1:
	hidden_layers.append(tuple(hl))
param_grid = {'hidden_layer_sizes': Categorical(hidden_layers)}

Expected behavior
The algorithm should select one of the tuples as the hidden layers and neuron counts e.g. (50,100,10) would create three hidden layers in the network of 50, 100, and 10 neurons, respectively

is there any way around this error? Thanks in advance!

[FEATURE] Add threshold parameter to ConsecutiveStopping

Is your feature request related to a problem? Please describe.
I ran GASearchCV with a callback that stopped the optimization if the fitness was no greater than at least one value of fitness from the last 5 generations.

callback = ConsecutiveStopping(generations=5, metric='fitness')

Checking the log information while the algorithm was running, I have noticed that the reported fitness (-12.7893) was the same for more than 5 consecutive generations (please see the attached image). Under these circumstances, I would have expected the algorithm to have stopped much earlier (in generation 8).

consecutive_stopping

I assume the algorithm did not stop because the logbook only shows 4 decimal places. However, given that fitness improved very little after generation 8, I think in some situations the user could have the option to provide a threshold value to ConsecutiveStopping, which would make the algorithm to stop after N consecutive generations if the improvement in fitness (or any other metric) was no greater than a specific threshold (e.g. 0.0001). This could make the algorithm to finish much faster in some occasions.

Describe the solution you'd expect
I have made a custom callback (which hopefully is correct) to achieve what I want (the documentation was quite helpful). Please feel free to make any comments regarding my code:

from sklearn_genetic.callbacks.base import BaseCallback

class ConsecutiveStoppingThreshold(BaseCallback):
    def __init__(self, threshold, N, metric='fitness'):
        self.threshold = threshold
        self.N = N
        self.metric = metric
        
    def on_step(self, record, logbook, estimator=None):
        #not enough data points
        if len(logbook) <= self.N:
            return False
        
        #get the last N metrics
        stats = logbook.select(self.metric)[-self.N :]
        
        #find the difference between max and min fitness in the last metrics
        diff = max(stats) - min(stats)
        
        if self.threshold > diff:
            return True
        return False

I have tested this code and it appears to work fine. In my perspective, such type of callback is very useful and, therefore, I think it should be more easily accessible to users. In my opinion, you could do one of the following:

  1. Show an explicit example, in the section "Custom callbacks" in the package's homepage, where you demonstrate how to achieve the above.
  2. Or have a threshold argument in ConsecutiveStopping where the user can provide a float to determine how much improvement is allowed after N consecutive generations.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.